Risk Analysis A Quantitative Guide

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 729

DownloadRisk-Analysis-A-Quantitative-Guide
Open PDF In BrowserView PDF
Risk Analysis
A quantitative guide

David Vose

third edition

John Wiley & Sons, Ltd

Copyright O 2008

David Vose

Published by

John Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester,
West Sussex, PO19 8SQ, England
Telephone +44 (0)1243 779777

Email (for orders and customer service enquiries): cs-books@wiley.co.uk
Visit our Home Page on www.wileyeurope.com or www.wiley.com
All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright,
Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham
Court Road, London WIT 4LP, UK, without the permission in writing of the Publisher. Requests to the Publisher should be
addressed to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19
8SQ, England, or emailed to permreq@wiley.co.uk, or faxed to +44 (0)1243 770620.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product
names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The
Publisher is not associated with any product or vendor mentioned in this book.
This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold
on the understanding that the Publisher is not engaged in rendering professional services. If professional advice or other expert
assistance is required, the services of a competent professional should be sought.
Other Wiley Editorial Ofices
John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA
Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA
Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany
John Wiley & Sons Australia Ltd, 42 McDougall Street, Milton, Queensland 4064, Australia
John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809
John Wiley & Sons Canada Ltd, 6045 Freemont Blvd, Mississauga, Ontario, Canada, L5R 453
Wiley also publishes its books in a variety of electronic formats. Some content that appears
in print may not be available in electronic books.

Library of Congress Cataloging-in-Publication Data
Vose, David.
Risk analysis : a quantitative guide / David Vose. - 3rd ed.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-470-51284-5 (cloth : alk. paper)
1. Monte Carlo method. 2. Risk assessment-Mathematical models. I.
Title.
QA298.V67 2008
658.4'0352 - dc22

British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library

ISBN: 978-0-470-51284-5 (H/B)
Typeset in 10/12pt Times by Laserwords Private Limited, Chennai, India
Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire

Contents
Preface
Part 1 Introduction
1

Why
1.1
1.2
1.3
1.4
1.5
1.6

do a risk analysis?
Moving on from "What If" Scenarios
The Risk Analysis Process
Risk Management Options
Evaluating Risk Management Options
Inefficiencies in Transferring Risks to Others
Risk Registers

2

Planning a risk analysis
2.1
Questions and Motives
Determine the Assumptions that are Acceptable or Required
2.2
2.3 Time and Timing
You'll Need a Good Risk Analyst or Team
2.4

3

The quality of a risk analysis
The Reasons Why a Risk Analysis can be Terrible
3.1
Communicating the Quality of Data Used in a Risk Analysis
3.2
3.3 Level of Criticality
3.4 The Biggest Uncertainty in a Risk Analysis
3.5 Iterate

4

Choice of model structure
Software Tools and the Models they Build
4.1
4.2
Calculation Methods
4.3
Uncertainty and Variability
How Monte Carlo Simulation Works
4.4
4.5
Simulation Modelling

5

Understanding and using the results of a risk analysis
5.1 Writing a Risk Analysis Report
5.2 Explaining a Model's Assumptions

viii

Contents

5.3
5.4

Graphical Presentation of a Model's Results
Statistical Methods of Analysing Results

Part 2 Introduction
6

Probability mathematics and simulation
6.1 Probability Distribution Equations
6.2
The Definition of "Probability"
6.3
Probability Rules
6.4
Statistical Measures

7

Building and running a model
7.1
Model Design and Scope
Building Models that are Easy to Check and Modify
7.2
7.3 Building Models that are Efficient
7.4 Most Common Modelling Errors

8

Some basic random processes
8.1 Introduction
8.2 The Binomial Process
8.3 The Poisson Process
8.4 The Hypergeometric Process
8.5 Central Limit Theorem
8.6 Renewal Processes
8.7 Mixture Distributions
8.8 Martingales
8.9 Miscellaneous Examples

9

Data
9.1
9.2
9.3
9.4
9.5
9.6

10

Fitting distributions to data
10.1 Analysing the Properties of the Observed Data
10.2 Fitting a Non-Parametric Distribution to the Observed Data
10.3 Fitting a First-Order Parametric Distribution to Observed Data
10.4 Fitting a Second-Order Parametric Distribution to Observed Data

11

Sums of random variables
11.1 The Basic Problem
11.2 Aggregate Distributions

and statistics
Classical Statistics
Bayesian Inference
The Bootstrap
Maximum Entropy Principle
Which Technique Should You Use?
Adding uncertainty in Simple Linear Least-Squares Regression Analysis

contents

12

Forecasting with uncertainty
12.1 The properties of a Time Sefies Forecast
12.2 Common Financial Time Sefies Models
12.3 ~utoregressiveModels
12.4 Markov Chain Models
12.5 Birth and Death Models
1 ~i~~ Series projection of Events Occumng Randomly in Time
12.7 Time Series Models with Leading 1ndifators
12 8 Comparing Forecasting Fits for Different Models
12.9 ~ong-TermForecasting

13

Modelling correlation and dependencies
13.1 Introduction
13.2 Rank Order Conelation
13.3 Copulas
13.4 The Envelope Method
13.5 Multiple Correlation Using a Look-UP Table

14

Eliciting from expert opinion
14.1 Introduction
14.2 Sources of Enor in Subjective Estimation
14.3 Modelling Techniques
14.4 Calibrating Subject Matter Experts
14.5 Conducting a Brainstorming Session
14.6 Conducting the Interview

15

16

i

I

17

IX

321
322

Testing and modelling causal
15.1 Compylobaeter Example
15.2 Types of Model to Analy se Data
15.3 From Risk Factors to Causes
15.4 Evaluating Evidence
15.5 The Limits of Causal Arguments
15.6 An Example of a Qualitative Causal Analysis
15.7 is Causal Analysis Essential?
optimisation in risk analysis
16.1 Introduction
16.2 timi mi sat ion Methods
16.3 Risk Analysis Modelling and Optimisation
16.4 Working Example: Optimal Allocation of Mineral Pots
Checking and validating a model
17.1 Spreadsheet Model Errors
17.2 Checking Model Behaviour
17.3 Comparing predictions Against Reality

436
439
444

451
45 1
456
460

x

Contents

18

Discounted cashflow modelling
18.1 Useful Time Series Models of Sales and Market Size
18.2 Summing Random Variables
18.3 Summing Variable Margins on Variable Revenues
18.4 Financial Measures in Risk Analysis

19

Project risk analysis
19.1 Cost Risk Analysis
19.2 Schedule Risk Analysis
19.3 Portfolios of risks
19.4 Cascading Risks

20

Insurance and finance risk analysis modelling
20.1 Operational Risk Modelling
20.2 Credit Risk
20.3 Credit Ratings and Markov Chain Models
20.4 Other Areas of Financial Risk
20.5 Measures of Risk
20.6 Term Life Insurance
20.7 Accident Insurance
20.8 Modelling a Correlated Insurance Portfolio
20.9 Modelling Extremes
20.10 Premium Calculations

21

Microbial food safety risk assessment
21.1 Growth and Attenuation Models
21.2 Dose-Response Models
21.3 Is Monte Carlo Simulation the Right Approach?
21.4 Some Model Simplifications

22

Animal import risk assessment
22.1 Testing for an Infected Animal
22.2 Estimating True Prevalence in a Population
22.3 Importing Problems
22.4 Confidence of Detecting an Infected Group
22.5 Miscellaneous Animal Health and Food Safety Problems

I

Guide for lecturers

I1

About ModelRisk

111 A compendium of distributions
111.1 Discrete and Continuous Distributions
111.2 Bounded and Unbounded Distributions
111.3 Parametric and Non-Parametric Distributions
111.4 Univariate and Multivariate Distributions

Contents

111.5 Lists of Applications and the Most Useful Distributions
111.6 How to Read Probability Distribution Equations
111.7 The Distributions
111.8 Introduction to Creating Your Own Distributions
111.9 Approximation of One Distribution with Another
111.10 Recursive Formulae for Discrete Distributions
111.11 A Visual Observation On The Behaviour Of Distributions

IV

Further reading

V

Vose Consulting

References
Index

xi

Preface
I'll try to keep it short.
This third edition is an almost complete rewrite. I have thrown out anything from the second edition
that was really of pure academic interest - but that wasn't very much, and I had a lot of new topics I
wanted to include, so this edition is quite a bit bigger. I apologise if you had to pay postage.
There are two main reasons why there is so much material to add since 2000. The first is that our
consultancy firm has grown considerably, and, with the extra staff and talent, we have had the privilege
of working on more ambitious and varied projects. We have particularly expanded in the insurance
and finance markets, so you will see that a lot of techniques from those areas, which have far wider
applications, appear throughout this edition. We have had contracts where we were given carte blanche
to think up new ideas, and that really got the creative juices flowing. I have also been involved in writing
and editing various risk analysis guidelines that made me think more about the disconnect between what
risk analysts produce and what risk managers need. This edition is split into two parts in an attempt to
help remedy that problem.
The second reason is that we have built a really great software team, and the freedom to design our
own tools has been a double espresso for our collective imagination. We now build a lot of bespoke risk
analysis applications for clients and have our own commercial software products. It has been enormous
fun starting off with a typical risk-based problem, researching techniques that would solve that problem
if only they were easy to use and then working out how to make that happen. ModelRisk is the result,
and we have a few others in the pipeline.

Some thank yous . . .
I have imposed a lot on Veerle and our children to get this book done. V has spent plenty of evenings
without me while I typed away in my office, but I think she suffered much more living with a guy who
was perpetually distracted by what he was going to write next. Sophie and SCbastien have also missed
out. Papa always seemed to be working instead of playing with them. Worse, perhaps, it didn't stop
raining all summer in Belgium, and they had to forego a holiday in the sun so I could finish writing.
I'll make it up to all three of you, I promise.
I have the luxury of having some really smart and motivated people working with me. I have leaned
rather heavily on the partners and staff in our consultancy firm while I focused on this book, particularly
on Huybert Groenendaal who has largely run the company in my "absence". He also wrote Appendix 5.
Timour Koupeev heads our programming team and has been infinitely patient in converting my neverending ideas for our ModelRisk software into reality. He also wrote Appendix 2. Murat Tomaev, our
head programmer, has made it all work together. Getting new modules for me to look at always feels
a little like Christmas.

xiv

R~skAnalysis

My secretary, Jane Pooley, retired from the company this year. She was the first person with enough
faith to risk working for me, and I couldn't have wished for a better start.
Wouter Smet and Michael van Hauwermeiren in our Belgian office have been a great support, going
through the manuscript and models for this book. Michael wrote the enormous Appendix 3, which could
be a book in its own right, and Wouter offered many suggestions for improving the English, which is
embarrassing considering it's his third language.
Francisco Zagmutt wrote Chapter 16 while under pressure to finish his thesis for his second doctorate
and being a full-time, jumping-on-airplanes, deadline-chasing senior consultant in our US office.
When Wiley sent me copies of the first edition, the first thing I did was go over to my parents' house
and give them a copy. I did the same with the second edition, and the Japanese version too. They are
all proudly displayed in the sitting room. I will be doing the same with this book. There's little that can
beat knowing my parents are proud of me, as I am of them. Mum still plays tennis, rides and competes
in target shooting. Dad is still a great golfer and neither ever seems to stop working on their house,
unless they're off to a party. They are a constant reminder to make the most of life.
Paul Curtis copy-edited the manuscript with great diligence and diplomacy. I'd love to know how he
spotted inconsistencies and repetitions in parts of the text that were a hundred or more pages apart. Any
remaining errors are all my fault.
Finally, have you ever watched those TV programmes where some guy with a long beard is teaching
you how to paint in thirty minutes? I did once. He didn't have a landscape in front of him, so he just
started painting what he felt like: a lake, then some hills, the sky, trees. He built up his painting, and
after about 20 minutes I thought - yes, that's finished. Then he added reflections, some snow, a bush
or two in the foreground. Each time I thought - yes, now it's finished. That's the problem with writing
a book (or software) - there's always something more to add or change or rewrite. So I have rather
exceeded my deadline, and certainly the page estimate, and my thanks go to my editor at Wiley, Emma
Cooper, for her gentle pushing, encouragement and flexibility.

Part I
ction
The first part of this book is focused on helping those who have to make decisions in the face of risk.
The second part of the book focuses on modelling techniques and has all the mathematics. The purpose
of Part 1 is to help a manager understand what a risk analysis is and how it can help in decision-making.
I offer some thoughts on how to build a risk analysis team, how to evaluate the quality of the analysis
and how to ask the right questions so you get the most useful answers.
This section should also be of use to analysts because they need to understand the managers' viewpoint
and work towards the same goal.

Chapter I

Why do a risk analysis?
In business and government one faces having to make decisions all the time where the outcome is
uncertain. Understanding the uncertainty can help us make a much better decision. Imagine that you
are a national healthcare provider considering which of two vaccines to purchase. The two vaccines
have the same reported level of efficacy (67 %), but further study reveals that there is a difference in
confidence attached to these two performance measures: one is twice as uncertain as the other (see
Figure 1.1).
All else being equal, the healthcare provider would purchase the vaccine with the smallest uncertainty
about its performance (vaccine A). Replace vaccine with investment and efficacy with profit and we
have a problem in business, for which the answer is the same - pick the investment with the smallest
uncertainty, all else being equal (investment A). The principal problem is determining that uncertainty,
which is the central focus of this book.
We can think of two forms of uncertainty that we have to deal with in risk analysis. The first is a
general sense that the quantity we are trying to estimate has some uncertainty attached to it. This is
usually described by a distribution like the ones in Figure 1.1. Then we have risk events, which are
random events that may or may not occur and for which there is some impact of interest to us. We can
distinguish between two types of event:
A risk is a random event that may possibly occur and, if it did occur, would have a negative
impact on the goals of the organisation. Thus, a risk is composed of three elements: the scenario;
its probability of occurrence; and the size of its impact if it did occur (either a fixed value or a
distribution).
An opportunity is also a random event that may possibly occur but, if it did occur, would have a
positive impact on the goals of the organisation. Thus, an opportunity is composed of the same three
elements as a risk.
A risk and an opportunity can be considered the opposite sides of the same coin. It is usually easiest
to consider a potential event to be a risk if it would have a negative impact and its probability is less
than 50%, and, if the risk has a probability in excess of 50 %, to include it in a base plan and then
consider the opportunity of it not occurring.

I. I Moving on from "What If" Scenarios
Single-point or deterministic modelling involves using a single "best-guess" estimate of each variable
within a model to determine the model's outcome(s). Sensitivities are then performed on the model to
determine how much that outcome might in reality vary from the model outcome. This is achieved by
selecting various combinations for each input variable. These various combinations of possible values

4

Risk Analysis

Figure 1.1 Efficacy comparison for two vaccines: the vertical axis represents how confident we are about
the true level of efficacy. I've omitted the scale to avoid some confusion at this stage (see Section 111.1.2).

around the "best guess" are commonly known as "what if" scenarios. The model is often also "stressed
by putting in values that represent worst-case scenarios.
Consider a simple problem that is just the sum of five cost items. We can use the three points,
minimum, best guess and maximum, as values to use in a "what if" analysis. Since there are five cost
items and three values per item, there are 35 = 243 possible "what if7' combinations we could produce.
Clearly, this is too large a set of scenarios to have any practical use. This process suffers from two
other important drawbacks: only three values are being used for each variable, where they could, in
fact, take any number of values; and no recognition is being given to the fact that the best-guess value
is much more likely to occur than the minimum and maximum values. We can stress the model by
adding up the minimum costs to find the best-case scenario, and add up the maximum costs to get the
worst-case scenario, but in doing so the range is usually unrealistically large and offers no real insight.
The exception is when the worst-case scenario is still acceptable.
Quantitative risk analysis (QRA) using Monte Carlo simulation (the dominant modelling technique in
this book) is similar to "what if" scenarios in that it generates a number of possible scenarios. However,
it goes one step further by effectively accounting for every possible value that each variable could
take and weighting each possible scenario by the probability of its occurrence. QRA achieves this by
modelling each variable within a model by a probability distribution. The structure of a QRA model
is usually (there are some important exceptions) very similar to a deterministic model, with all the
multiplications, additions, etc., that link the variables together, except that each variable is represented
by a probability distribution function instead of a single value. The objective of a QRA is to calculate
the combined impact of the uncertainty1 in the model's parameters in order to determine an uncertainty
distribution of the possible model outcomes.

' I discuss the exact meaning of "uncertainty", randomness, etc., in Chapter 4.

Chapter I Why do a nsk analysis?

5

1.2 The Risk Analysis Process
Figure 1.2 shows a typical flow of activities in a risk analysis, leading from problem formulation to
decision. This section and those that follow provide more detail on each activity.

1.2. I Identifying the risks
Risk identification is the first step in a complete risk analysis, given that the objectives of the decisionmaker have been well defined. There are a number of techniques used to help formalise the identification
of risks. This part of a formal risk analysis will often prove to be the most informative and constructive
element of the whole process, improving company culture by encouraging greater team effort and
reducing blame, and should be executed with care. The organisations participating in a formal risk
analysis should take pains to create an open and blameless environment in which expressions of concern
and doubt can be openly given.
Risk Management approach
Decision-maker

Analyst
Discussion between
analyst and decision-maker

Identify risks, drivers and
risk management options

Define quantitative
questions to help select
between options

Review available data
and possible analysis

t

t
I

Assign
probability
distribut~ons

I
I

Design model

- - - -

1
4Tlme Series 1
Opinion

Run simulation

*

Review results
Finish reporting

Normal
Possible

p
p

Figure 1.2 The risk analysis process.

Incorrect

6

Risk Analysis

Prompt lists

Prompt lists provide a set of categories of risk that are pertinent to the type of project under consideration
or the type of risk being considered by an organisation. The lists are used to help people think about
and identify risks. Sometimes different types of list are used together to improve further the chance of
identifying all of the important risks that may occur. For example, in analysing the risks to some project,
one prompt list might look at various aspects of the project (e.g. legal, commercial, technical, etc.) or
types of task involved in the project (design, construction, testing). A project plan and a work breakdown
structure, with all of the major tasks defined, are natural prompt lists. In analysing the reliability of some
manufacturing plant, a list of different types of failure (mechanical, electrical, electronic, human, etc.)
or a list of the machines or processes involved could be used. One could also cross-check with a plan of
the site or a flow diagram of the manufacturing process. Check lists can be used at the same time: these
are a series of questions one asks as a result of experience of previous problems or opportune events.
A prompt list will never be exhaustive but acts as a focus of attention in the identification of risks.
Whether a risk falls into one category or another is not important, only that the risk is identified. The
following list provides an example of a fairly general project prompt list. There will often be a number
of subsections for each category:
administration;
project acceptance;
commercial;
communication;
environmental;
financial;
knowledge and information;
legal;
management;
partner;
political;
quality;
resources;
strategic;
subcontractor;
technical.
The identified risks can then be stored in a risk register described in Section 1.6.

1.2.2 Modelling the risk problem and making appropriate decisions
This book is concerned with the modelling of identified risks and how to make decisions from those
models. In this book I try not to offer too many modelling rules. Instead, I have focused on techniques
that I hope readers will be able to put together as necessary to produce a good model of their problem.
However, there are a few basic principles that are worth adhering to. Morgan and Henrion (1990) offer
the following excellent "ten commandments" in relation to quantitative risk and policy analysis:

Chapter I Why do a risk analysis?

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

7

Do your homework with literature, experts and users.
Let the problem drive the analysis.
Make the analysis as simple as possible, but no simpler.
Identify all significant assumptions.
Be explicit about decision criteria and policy strategies.
Be explicit about uncertainties.
Perform systematic sensitivity and uncertainty analysis.
Iteratively refine the problem statement and the analysis.
Document clearly and completely.
Expose to peer review.

The responses to correctly identified and evaluated risks are many, but generally fall into the following
categories:
Increase (the project plan may be overly cautious).
Do nothing (because it would cost too much or there is nothing that can be done).
Collect more data (to better understand the risk).
Add a contingency (extra amount to budget, deadline, etc., to allow for possibility of risk).
Reduce (e.g. build in redundancy, take a less risky approach).
Share (e.g. with partner, contractor, providing they can reasonably handle the impact).
Transfer (e.g. insure, back-to-back contract).
Eliminate (e.g. do it another way).
Cancel project.
This list can be helpful in thinking of possible responses to identified risks. It should be borne in mind
that these risk responses might in turn cany secondary risks. Fall-backplans should be developed to deal
with risks that are identified and not eliminated. If done well in advance, they can help the organisation
react efficiently, calmly and in unison in a situation where blame and havoc might normally reign.

1.3 Risk Management Options
The purpose of risk analysis is to help managers better understand the risks (and opportunities) they
face and to evaluate the options available for their control. In general, risk management options can be
divided into several groups.
Acceptance (Do nothing)

Nothing is done to control the risk or one's exposure to that risk. Appropriate for risks where the cost
of control is out of proportion with the risk. It is usually appropriate for low-probability, low-impact
risks and opportunities, of which one normally has a vast list, but you may be missing some high-value
risk mitigation or avoidance options, especially where they control several risks at once. If the chosen
response is acceptance, some considerable thought should be given to risk contingency planning.

8

Risk Analys~s

Increase

You may find that you are already spending considerable resources to manage a risk that is excessive
compared with the level of protection that it affords you. In such cases, it is logical to reduce the level
of protection and allocate the resources to manage other risks, thereby achieving a superior overall risk
efficiency. Examples are:
remove a costly safety regulation for nuclear power plants that affects a risk that would otherwise
still be miniscule;
cease the requirement to test all slaughtered cows for BSE and use saved money for hospital upgrades.
It may be logical but nonetheless politically unacceptable. There are not too many politicians or CEOs
who want to explain to the public that they've just authorised less caution in handling a risk.
Get more information

A risk analysis can describe the level of uncertainty there is about the decision problem (here we use
uncertainty as distinct from inherent randomness). Uncertainty can often be reduced by acquiring more
information (whereas randomness cannot). Thus, a decision-maker can determine that there is too much
uncertainty to make a robust decision and request that more information be collected. Using a risk
analysis model, the risk analyst can advise the least-cost method of collecting extra data that would be
needed to achieve the required level of precision. Value-of-information arguments (see Section 5.4.5)
can be used to assess how much, if any, extra information should be collected.
Avoidance (Elimination)

This involves changing a method of operation, a project plan, an investment strategy, etc., so that the
identified risk is no longer relevant. Avoidance is usually employed for high-probability, high-impact
type risks. Examples are:
use a tried and tested technology instead of the new one that was originally envisaged;
change the country location of a factory to avoid political instability;
scrap the project altogether.
Note that there may be a very real chance of introducing new (and perhaps much more important)
risks by changing your plans.
Reduction (Mitigation)

Reduction involves a range of techniques, which may be used together, to reduce the probability of the
risk, its impact or both. Examples are:
build in redundancy (standby equipment, back-up computer at different location);
perform more quality tests or inspections;
provide better training to personnel;
spread risk over several areas (portfolio effect).
Reduction strategies are used for any level of risk where the remaining risk is not of very high severity
(very high probability and impact) and where the benefits (amount by which risk is reduced) outweigh
the reduction costs.

Chapter I W h y do a r~skanalys~s? 9

Cont~ngencyplann~ng

These are plans devised to optimise the response to risks should they occur. They can be used in
conjunction with acceptance and reduction strategies. A contingency plan should identify individuals
who take responsibility for monitoring the occurrence of the risk, andlor identified risk drivers for
changes in the risk's probability or possible impact. The plan should identify what to do, who should
do it and in which order, the window of opportunity, etc. Examples are:
have a trained firefighting team on site;
have a preprepared press release;
have a visible phone list (or email distribution list) of whom to contact if the risk occurs;
reduce police and emergency service leave during a strike;
fit lifeboats on ships.

Management's response to an identified risk is to add some reserve (buffer) to cover the risk should it
occur. Appropriate for small to medium impact risks. Examples are:
allocate extra funds to a project;
allocate extra time to complete a project;
have cash reserves;
have extra stock in shops for a holiday weekend;
stockpile medical and food supplies.

'4

Insurance

Essentially, this is a risk reduction strategy, but it is so common that it is worth mentioning separately.
If an insurance company has done its numbers correctly, in a competitive market you will pay a little
above the expected cost of the risk (i.e. probability * expected impact should the risk occur). In general,
we therefore insure for risks that have an impact outside our comfort zone (i.e. where we value the risk
higher than its expected value). Alternatively, you may feel that your exposure is higher than the average
policy purchaser, in which case insurance may be under your expected cost and therefore extremely
attractive.
Risk transfer

This involves manipulating the problem so that the risk is transferred from one party to another. A
common method of transferring risk is through contracts, where some form of penalty is included into a
contractor's performance. The idea is appealing and used often but can be very inefficient. Examples are:
penalty clause for running over agreed schedule;
performance guarantee of product;
lease a maintained building from the builder instead of purchasing;
purchase an advertising campaign from some media body or advertising agency with payment
contingent on some agreed measure of success.

I0

Risk Analysis

You can also consider transferring risks to you, where there is some advantage to relieving another
party of a risk. For example, if you can guarantee a second party against some small risk resultant from
an activity you wish to take that provides you with much greater benefit than the other party's risk, the
second party may remove its objection to your proposed activity.

1.4 Evaluating Risk Management Options
The manager evaluating the possible options for dealing with a defined risk issue needs to consider
many things:
Is the risk assessment of sufficient quality to be relied upon?
How sensitive is the ranking of each option to model uncertainties?
What are the benefits relative to the costs associated with each risk management option?
Are there any secondary risks associated with a chosen risk management option?
How practical will it be to execute the risk management option?
Is the risk assessment of sufficient quality to be relied upon? (See Chapter 3.)
How sensitive is the ranking of each option to model uncertainties?
On this last point, we almost always would like to have better data, or greater certainty about the
form of the problem: we would like the distribution of what will happen in the future to be as narrow
as possible. However, a decision-maker cannot wait indefinitely for better data and, from a decisionanalytic point of view, may quickly reach the point where the best option has been determined and no

Figure 1.3

Different possible outputs compared with a threshold T

Chapter I W h y do a risk analys~s? I I

further data (or perhaps only a very dramatic change in knowledge of the problem) will make another
option preferable. This concept is known as decision sensitivity. For example, in Figure 1.3 the decisionmaker considers any output below a threshold T (shown with a dashed line) to be perfectly acceptable
(perhaps this is a regulatory threshold or a budget). The decision-maker would consider option A to
be completely unacceptable and option C to be perfectly fine, and would only need more information
about option B to be sure whether it was acceptable or not, in spite of all three having considerable
uncertainty.

1.5 Inefficiencies in Transferring Risks to Others
A common method of managing risks is to force or persuade another party to accept the risk on your
behalf. For example, an oil company could require that a subcontractor welding a pipeline accept the
costs to the oil company resulting from any delays they incur or any poor workmanship. The welding
company will, in all likelihood, be far smaller than the oil company, so possible penalty payments
would be catastrophic. The weldng company will therefore value the risk as very high and will require
a premium greatly in excess of the expected value of the risk. On the other hand, the oil company may
be able to absorb the risk impact relatively easily, so would not value the risk as highly. The difference
in the utility of these two companies is shown in Figures 1.4 to 1.7, which demonstrate that the oil
company will pay an excessive amount to eliminate the risk.
A far more realistic approach to sharing risks is through a partnership arrangement. A list of risks
that may impact on various parties involved in the project is drawn up, and for each risk one then asks:
How big is the risk?
What are the risk drivers?
Who is in control of the risk drivers? Who has the experience to control them?
Who could absorb the risk impacts?
How can we work together to manage the risks?
Utility gain

I

:

'

Utility loss

Figure 1.4 The contractor's utility function is highly concave over the money gainlloss range in question.
That means, for example, that the contractor would value a loss of 100 units of money (e.g. $100 000) as a
vastly larger loss in absolute utility terms than a gain of $100 000 might be.

I2

1

Risk Analysis

Utility gain

Utility loss

Figure 1.5 Over that same money gain/loss range, the oil company has an almost exactly linear utility
function. The contractor, required to take on a risk with an expected value of -$60000, would value this
as - X utiles. To compensate, the contractor would have to charge an additional amount well in excess of
$100 000. The oil company, on the other hand, would value -$60 000 in rough balance with +$60 000, so
will be paying considerably in excess of its valuation of the risk to transfer it to the contractor.

Utility gain

Utility loss

Figure 1.6 Imagine the risk has a 10 % probability of occurring, and its impact would be -$300 000, to give
an expected value of -$30 000. If $300 000 is the total capital value of the contractor, it won't much matter
to the contractor whether the risk impact is $300 000 or $3 000 000 -they still go bust. This is shown by the
shortened utility curve and the horizontal dashed line for the contractor.

What arrangement would efficiently allocate the risk impacts and rewards for good risk management?
Can we insure, etc., to share risks with outsiders?
The more one can allocate ownership of risks, and opportunities, to those who control them the
better - up to the point where the owner could not reasonably bear the risk impact where others can.
Answering the questions above will help you construct a contractual arrangement that is risk efficient,
workable and tolerable to all parties.

Chapter I W h y do a risk analysis?

13

Utility gain

Utility loss

Figure 1.7 In this situation, the contractor now values any risk with an impact that exceeds its capital value
at a level that is less than the oil company (shown as "Discrepancy"). It may mean that the contractor can
offer a more competitive bid than another, larger contractor who would feel the full risk impact, but the oil
company will not have covered the risk it had hoped to transfer, and so again will be paying more than it
should to offload the risk. Of course, one way to avoid this problem is to require evidence from the contractor
that they have the necessary insurance or capital base to cover the risk they are being asked to absorb.

1.6 Risk Registers
A risk register is a document or database that lists each risk pertaining to a project or organisation,
along with a variety of information that is useful for the management of those risks. The risks listed in
a risk register will have come from some collective exercise to identify risks. The following items are
essential in any risk register entry:
date the register was last modified;
name of risk;
description of what the risk is;
description of why it would occur;
description of factors that would increase or decrease its probability of occurrence or size of impact
(risk drivers);
semi-quantitative estimates of its probability and potential impact;
P- I scores;
name of owner of the risk (the person who will assume responsibility for monitoring the risk and
effecting any risk reduction strategies that have been agreed);
details of risk reduction strategies that it is agreed will be taken (i.e. strategy that will reduce the
impact on the project should the risk event occur andlor the probability of its occurrence);
reduced impact andlor probability of the risk, given the above agreed risk reduction strategies have
been taken;

14

a
a

a
a

Risk Analysis

ranking of risk by scores of the reduced P - I ;
cross-referencing the risk event to identification numbers of tasks in a project plan or areas of
operation or regulation where the risk may impact;
description of secondary risks that may arise as a result of adopting the risk reduction strategies;
action window - the period during which risk reduction strategies must be put in place.
The following items may also be useful to include:

a
a

a
a
a
a

description of other optional risk reduction strategies;
ranking of risks by the possible effectiveness of further risk mitigation [effectiveness = (total
decrease in risk)/(cost of risk mitigation action)];
fall-back plan in the event the risk event still occurs;
name of the person who first identified the risk;
date the risk was first identified;
date the risk was removed from the list of active risks (if appropriate).

A risk register should include a description of the scale used in the semi-quantitative analysis, as
explained in the section on P-I scores. A risk register should also have a summary that lists the top
risks (ten is a fairly usual number but will vary according to the project or overview level). The "top"
risks are those that have the highest combination of probability and impact (i.e. severity), after the reducing effects of any agreed risk reduction strategies have been included. Risk registers lend themselves
perfectly to being stored in a networked database. In this way, risks from each project or regulatory
body's concerns, for example, can be added to a common database. Then, a project manager can access
that database to look at all risks to his or her project. The finance director, lawyer, etc., can look at all
the risks from any project being managed by their departments and the chief executive can look at the
major risks to the organisation as a whole. What is more, head office has an easy means for assessing
the threat posed by a risk that may impact on several projects or areas at the same time. "Dashboard
software can bring the outputs of a risk register into appropriate focus for the decision-makers.

1.6.1 P-l tables
The risk identification stage attempts to identify all risks threatening the achievement of the project's or
organisation's goals. It is clearly important, however, that attention is focused on those risks that pose
the greatest threat.
Defining qualitative risk descriptions

A qualitative assessment of the probability P of a risk event (a possible event that would produce a
negative impact on the project or organisation) and the impact(s) it would produce, I , can be made by
assigning descriptions to the magnitudes of these probabilities and impacts. The assessor is asked to
describe the probability and impact of each risk, selecting from a predetermined set of phrases such as:
nil, very low, low, medium, high and very high. A range of values is assigned to each phrase in order
to maintain consistency between the estimates of each risk. An example of the value range that might
be given to each phrase in a risk register for a particular project is shown in Table 1.1.
Note that in Table 1.1 the value ranges are not evenly spaced. Ideally there is a multiple difference
between each range (in this case roughly 3). If the same multiple is applied for probability and impact

i
I
a

Chapter I Why do a r ~ s analys~s?
k
I5

Table 1.1 An example of the value ranges that could be associated with qualitative descriptions of the
probabilities and impacts of a risk on a project.
Category

Probability (%)

Very high
High
Medium
Low
Very low

10-50
5-10
2-5
1-2
<1

Delay (days)

Cost ($k)

Quality
-

>I00
30-100
10-30
2-10
<2

> I 000
30-100
100-300
20-1 00
t20

--

Failure to meet acceptance criteria
Failure to meet > 1 important specification
Failure to meet an important specification
Failure to meet > 1 minor specification
Failure to meet a minor specification

Table 1.2 An example of the descriptions that could be associated with impacts
of a risk on a corporation.
Category

Description

Catastrophic
Major
Moderate
Minor
Insignificant

Jeopardises the existence of the company
No longer possible to achieve business objectives
Reduced ability to achieve business objectives
Some business disruptions but little effect on business objectives
No impact on business strategy objectives

scales, we can more easily determine severity scores as described below. The value ranges can be
selected to match the size of the project. Alternatively, they can be matched to the effect the risks would
have on the organisation as a whole. The drawback in making the definition of each phrase specific to
a project is that it becomes very difficult to perform a combined analysis of the risks from all projects
in which the organisation is involved. From a corporate perspective one can describe how a risk affects
the health of a company, as shown in Table 1.2.
Visualising a portfolio of risks

A P-I table offers a quick way to visualise the relative importance of all identified risks that pertain
to a project (or organisation). Table 1.3 illustrates an example. All risks are plotted on the one table,
allowing easy identification of the most threatening risks as well as providing a general picture of the
overall riskiness of the project. Risk numbers 13, 2, 12 and 15 are the most threatening in this example.
The impact of a project risk that is most commonly considered is a delay in the scheduled completion
of the project. However, an analysis may also consider the increased cost of the project resulting from
Table 1.3 Example of a P - 1 table for schedule
delav.

16

Risk Analysis

Table 1.4 P-1 table for a s~ecificrisk.

I

Probabilitv

I

each risk. It might further consider other, less numerically definable impacts on the project, for example:
the quality of the final product; the goodwill that could be lost; sociological impacts; political damage;
or strategic importance of the project to the organisation. A P-I table can be constructed for each type
of impact, enabling the decision-maker to gain a more rounded understanding of a project's riskiness.
P - I tables can be constructed for the various types of impact of each single risk. Table 1.4 illustrates
an example where the impacts of schedule delay, T, cost, $, and product quality, Q , are shown for a
specific risk. The probability of each impact may not be the same. In this example, the probability of the
risk event occurring is high, and hence the probability of schedule delay and cost impacts are high, but
it is considered that, even if this risk event does occur, the probability of a quality impact is still low. In
other words, there is a fairly small probability of a quality impact even when the risk event does occur.
Ranking risks

P-I scores can be used to rank the identified risks. A scaling factor, or weighting, is assigned to each
phrase used to describe each type of impact. Table 1.5 provides an example of the type of scaling factors
that could be associated with each phraselimpact type combination.
In this type of scoring system, the higher the score, the greater is the risk. A base measure of risk
is probability *impact. The categorising system in Table 1.1 is on a log scale, so, to make Table 1.5
consistent, we can define the severity of a risk with a single type of impact as

which leaves the severity on a log scale too. If a risk has k possible types of impact (quality, delay,
cost, reputation, environmental, etc.), perhaps with different probabilities for each impact type, we can
Table 1.5 An example of
the scores that could be
associated with descriptive
risk categories to produce
a severity score.
Category

Score

Very high
High
Medium
Low
Very low

5
4
3
2
1

Chapter I W h y do a risk analysis?

17

High severity
Medium severity
Low severity

still combine them into one score as follows:

The severity scores are then used to determine the most important risks, enabling the management to
focus resources on reducing or eliminating risks from the project in a rational and efficient manner.
A drawback to this approach of ranking risks is that the process is quite dependent on the granularity
of the scaling factors that are assigned to each phrase describing the risk impacts. If we have better
information on probability or impact than the scoring system would allow, we can assign a more accurate
(non-integer) score.
In the scoring regime of Table 1.5, for example, a high severe risk could be defined as having a
score higher than 7, and a low risk as having a score lower than 5. Given the crude scaling used, risks
with a severity of 7 may require further investigation to determine whether they should be categorised
as high severity. Table 1.6 shows how this segregates the risks shown in a P-I table into the three
regions.
P-I scores for a project provide a consistent measure of risk that can be used to define metrics and
perform trend analyses. For example, the distribution of severity scores for a project gives an indication
of the overall "amount" of risk exposure. More complex metrics can be derived using severity scores,
allowing risk exposure to be normalised and compared with a baseline status. These permit trends in risk
exposure to be identified and monitored, giving valuable information to those responsible for controlling
the project.
Efficient risk management with severity scores

Efficient risk management seeks to achieve the maximum reduction in risk for a given amount of
investment (of people, time, money, restriction of liberty, etc.). Thus, we need to evaluate in some
sense the ratio (reduction in risk)/(investment to achieve reduction). If you use the log scale for severity
described here, this would equate to calculating

18

Risk Analysis

The risk management options that provide the greatest efficiency should logically be preferred, all
else being equal.
Inherent risks are the risk estimates before accounting for any mitigation efforts. They can be plotted
against a guiding risk response framework where the P - I table is split, covered by overlapping areas
of avoid, control, transfer and accept, as shown in Figure 1.8:
"Avoid" applies where an organisation would be accepting a high-probability, high-impact risk
without any compensating benefits.
"Control" applies usually to high-probability, low-impact risks, normally associated with repetitive
actions, and therefore usually managed through better internal processes.
"Transfer" applies to low-probability, high-impact risks usually managed through insurance or other
means of transferring the risk to parties better capable of absorbing the impact.
"Accept" applies to the remaining low-probability, low-impact risks for which it may not be effective
to focus on too much.
Figure 1.9 plots residual risks after any implemented risk mitigation strategies and tracks the progress
in managing the residual risks compared with the previous year using arrows. Grey letters represent the
status of the risk last year if it is different. A dashed arrow pointing out of the graph means that the risk
5

4

-

HIGH-LEVEL RISKS

3-

b
2

2
2

-

1-

0

1

0

1

2

3
PROBABILITY

Figure 1.8 P-1 graph for inherent risks.

4

5

Chapter I Why do a nsk analys~s? 19

Figure 1.9 P-1 graph for residual risks.

has been avoided. An enhancement to the residual risk graph that you might like to add is to plot each
risk as a circle whose radius reflects how comfortable you are in dealing with the residual risk - for
example, perhaps you have handled the occurrence of similar risks before and minimised their impact
through good management, or perhaps they got out of hand. A small circle represents risks that one is
comfortable managing, and a large circle represents the opposite, so the less manageable risks stand out
in the plot.

Chapter 2

Planning a risk analysis
In order to plan a risk analysis properly, you'll need to answer a few questions:
What do you want to know and why?
What assumptions are acceptable?
What is the timing?
Who is going to do the risk analysis?
1'11 go through each of these in turn.

2.1 Questions and Motives
The purpose of a risk analysis is to provide information to help make better decisions in an uncertain
world. A decision-maker has to work with the risk analyst precisely to define the questions that need
answering. You should consider a number of things:

1. Rank the questions that need answering from "critical" down to "interesting". Often a single model
cannot answer all questions, or has to be built in a complicated way to answer several questions,
so a common recognition of the extra effort needed to answer each question going down the list
helps determine a cut-off point.
2. Discuss with the risk analyst the form of the answer. For example, if you want to know how much
extra revenue might be made by buying rather than leasing a vessel, you'll need to specify a
currency, whether this should be as a percentage or in actual currency and whether you want just
the mean (which can make the modelling a lot easier) or a graph of the distribution. Explain what
statistics you need and to what accuracy (e.g. asking for the 95th percentile to the nearest $1000),
as this will help the risk analyst save time or figure out that an unusual approach might be needed
to get the required accuracy.
3. Explain what arguments will be based on these outputs. I am of the view that this is a key breakdown
area because a decision-maker might ask for specific outputs and then put them together into an
argument that is probabilistically incorrect. Much embarrassment and frustration all round. It is
better to explain the arguments (e.g. comparing with the distribution of another potential project's
extra revenue) that would be put forward and find out if the risk analyst agrees that this is technically
correct before you get started.
4. Explain whether the risk analysis has to sit within a framework. This could be a formal framework,
like a regulatory requirement or a company policy, or it could be informal, like building up a
portfolio of risk analyses that can be compared on the same footing (for example, we are helping a

22

5.

6.

7.

8.

9.

R~skAnalysis

large chemical manufacturer to build up a combined toxicological, environmental, etc., risk analysis
database for their treasure chest of compounds). It will help the risk analyst ensure the maximum
level of compatibility - e.g. that the same base assumptions are used between risk analyses.
Explain the target audience. We write reports on all our risk analyses, of course, but sometimes
there can be several versions: the executive summary; the main report; and the technical report
with all the formulae and guides for testing. Often, others will want to run the model and change
parameters, so we make a model version that minirnises the ability to mess up the mathematics,
and write the code to allow the most flexibility. These days we usually put a VBA user interface on
the front to make life easier and perhaps add a reporting facility to compare results. We might add
a help file too. Clients will also sometimes ask us to prepare a Powerpoint presentation. Knowing
the knowledge level and focus of each target audience, and knowing what types of reporting will
be needed at the offset, saves a lot of time.
Discuss any possible hostile reactions. The results of a risk analysis will not always be popular,
and when people dislike the answers they start attacking the model (or, if you're unlucky, the
modeller). Assumptions are the primary Achilles' heel, as we can argue forever about whether
assumptions are right. I talk about getting buy-in for assumptions in Section 5.2. Statistical analysis
of data is also rather draining - it usually involves a couple of very technical people with opposing
arguments about the appropriateness of a statistical procedure that nobody else understands. The
decision to include and exclude certain datasets can also create a lot of tension. The arguments can
be minimised, or at least convincingly dismissed, if people likely to be hostile are brought into the
analysis process early, or an external expert is asked to give an independent review.
Figure out a timeline. Decision-makers have something of a habit of setting unrealistic deadlines.
When these deadlines pass, nothing very dramatic usually happens, as the deadlines are some
artificial internal confection. Our consultants deal with deadlines all the time, of course, but we
openly discuss whether a deadline is really that important because, if we have to meet a tight
deadline (and that happens), the quality of the risk analysis may be lower than would have been
achievable with more time. The decision-maker has to be honest about time limits and decide
whether it is worth postponing things for a bit.
Figure out the priority level. The risk analyst might have other work to juggle too. The project
might be of high importance and justify pulling off other resources to help with the analysis or
instructing others in the organisation to set aside time to provide good quality input.
Decide on how regularly the decision-maker and risk analyst will meet. Things change and the risk
analysis may have to be modified, so find that out sooner rather than later.

2.2 Determine the Assumptions that are Acceptable or Required
If a risk analysis is to sit within a certain framework, discussed above, it may well have to comply
with a set of common assumptions to allow meaningful comparisons between the results of different
analyses. Sometimes it is better not to revise some assumptions for a new analysis because it makes it
impossible to compare. You can often see a similar problem with historic data, e.g. calculating crime or
unemployment statistics. It seems that the basis for these statistics keeps changing, making it impossible
to know whether the problem is getting better or worse.
In a corporate environment there will be certain base assumptions used for things like interest and
exchange rates, production capacity and energy price. The same assumptions should be used in all

Chapter 2 Planning a risk analysis

23

models. In a risk analysis world these should be probabilistic forecasts, but they are nonetheless often
fixed-point values. Oil companies, for example, have the challenging job of figuring out what the oil
price might be in the future. They can get it very wrong so often take a low price for planning purposes,
e.g. $16 a barrel, which in 2007 might seem rather unlikely for the future. The risk analyst working
hard on getting everything else really precise could find such an assumption irritating, but it allows
consistency between analyses where oil price forecast uncertainty could be so large as to mask the
differences between investment opportunities.
Some assumptions we make are conservative, meaning that, if, for example, we need a certain percentile of the output to be above X before we accept the risk as acceptable, then a conservative
assumption will bias the output to lower values. Thus, if the output still gives numbers that say the risk
is acceptable, we know we are on pretty safe ground. Conservative assumptions are most useful as a
sensitivity tool to demonstrate that one has not taken an unacceptable risk, but they are to be avoided
whenever possible because they run counter to the principle of risk analysis which is to give an unbiased
report of uncertainty.

2.3 Time and Timing
We get a lot of requests to help "risk" a model. The potential client has spent a few months working
on a problem, building up a cashflow model, etc., and the decision-makers decide the week before the
board meeting that they really should have a risk analysis done.
If done properly, risk analysis is an integral part of the planning of a project, not an add-on at the
end. One of the prime reasons for doing risk analyses is to identify risks and risk management strategies
so the decision-makers can decide how the risks can be managed, which could well involve a revision
of the project plan. That can save a lot of time and money on a project. If risk analysis is added on at
the end, you lose all that potential benefit.
The data collection efforts required to produce a fixed-value model of a project are little different
from the efforts required for a risk analysis, so adding a risk analysis on at the end is inefficient and
delays a project, as the risk analyst has to go back over previous work.
We advocate that a risk analyst write the report as the model develops. This helps keep a track of
what one is doing and makes it easier to meet the report submission deadline at the end. I also like to
write down my thinking because it helps me spot any mistakes early.
Finally, try to allow the risk analyst enough time to check the model for errors and get it reviewed.
Chapter 16 offers some advice on model validation.

2.4 You'll Need a Good Risk Analyst or Team
If the risk analysis is a one-off and the outcome is important to you, I recommend you hire in a
consultant risk analyst. Well I would say that, of course, but it does make a lot of sense. Consultants
are expensive on a daily basis, but, certainly at Vose Consulting, we are far faster (my guess is over
10 times faster than a novice) - we know what we're doing and we know how to communicate and
organise effectively. Please don't get a bright person within your organisation, install some risk analysis
software on their computer and tell them to get on with the job. It will end in tears.
The publishers of risk analysis software (Crystal Ball, @RISK, Analytica, Risk+, PERTmaster, etc.)
have made risk analysis modelling very easy to implement from a software viewpoint. The courses they

24

Risk Analysis

teach show you how to drive the software and reinforce the notion that risk analysis modelling is pretty
easy (Vose Consulting courses generally assume you have already attended a software familiarisation
course). In a lot of cases, risk analysis is in fact pretty easy, as long as you avoid some common
basic errors discussed in Section 7.4. However, it can also become quite tricky too, for sometimes
subtle reasons, and you should have someone who understands risk analysis well enough to be able to
recognise and handle the trickier models. Knowing how to use Excel won't make you an accountant
(but it's a good first step), and learning how to use risk analysis software won't make you a risk analyst
(but it's also a good first step).
There are still very few tertiary courses in risk analysis, and these courses tend to be highly focused in
particular areas (financial modelling, environmental risk assessment, etc.). I don't know of any tertiary
courses that aim to produce professional risk analysts who can work across many disciplines. There
are very few people who could say they are qualified to be a risk analyst. This makes it pretty tough
to know where to search and to be sure you have found someone who will have the knowledge to
analyse your risks properly. It seems that industry-specific risk analysts also have little awareness of the
narrowness of their knowledge: a little while ago we advertised for two highly qualified actuarial and
financial risk analysts with several years experience and received a large number of applications from
people who were risk analysts in toxicology, microbial, environmental and project areas with almost no
overlap in required skill sets.

2.4.1 Qualities of a risk analyst
I often get asked by companies and government agencies what sort of person they should look for to
fill a position as a risk analyst. In my view, candidates should have the following characteristics:
Creative thinkers. Risk analysis is about problem-solving. This is at the top of my list and is the
rarest quality.
Conjident. We often have to come up with original solutions. I've seen too many pieces of work
that have followed some previously published method because it is "safer". We also have to present
to senior decision-makers and maybe defend our work in front of hostile stakeholders or a court.
Modest. Too many risk analyses fail to meet their requirements because of a risk analyst who thought
shehe could do it without help or consultation.
Thick-skinned. Risk analysts bring together a lot of disparate information and ideas, sometimes
conflicting, sometimes controversial, and we produce outputs that are not always what people want
to see, so we have to be prepared for a fair amount of enthusiastic criticism.
Communicators. We have to listen to a lot of people and present ideas that are new and sometimes
difficult to understand.
Pragmatic. Our models could always be better with more time, data and resources, but decisionmakers have deadlines.
Able to conceptualise. There are a lot of tools at our disposal that are developed in various fields of
risk, so the risk analyst needs to read widely and be able to extrapolate an idea from one application
to another.
Curious. Risk analysts need to keep learning.
Good at mathematics. Take a look at Part 2 of this book to get a feel for the level. It will depend
on the area: project risk requires more intuition and perseverance but less mathematics, insurance

Chapter 2 Planning a risk analysis

25

and finance require intuition and high mathematical skills, food safety requires medium levels of
everything.
A feel for numbers. It is one thing to be good at mathematics, but we also have to have an idea
of where the numbers should lie because it (a) helps us check the work and (b) allows us to know
where we can take shortcuts.
Finishers. Some people are great at coming up with ideas, but lose interest when it comes to
implementing them. Risk analysts have to get the job done.
a Cynical. We have to maintain a healthy cynicism about published work and about how good our
subject matter experts are.
Pedantic. When developing probability models, one needs to be very precise about exactly what
each variable represents.
Careful. It is easy to make mistakes.
Social. We have to work in teams.
a Neutral. Our job is to produce an objective risk analysis. A project manager is not usually ideal
to perform the project risk analysis because it may reflect on hisher ability to manage and plan. A
scientist is not ideal if shehe has a pet theory that could slant the approach taken.
It's a demanding list and indicates, I think, that risk analysis should be performed by people of high
skill levels who are fairly senior and in a respected position within a company or agency. It is also
rather unlikely that you will find all these qualities in the one person: the best risk analysis units with
which we work are composed of a number of individuals with complementary skills and strengths.

2.4.2 Suitable education
I interviewed a statistics student a couple of months back. This person was just finishing a PhD and had
top grades throughout from a very reputable school. I asked a pretty simple question about estimating a
prevalence and got a vague answer about how this person would perform the appropriate test and report
the confidence interval, but the student couldn't tell me what that test might be (this is a really basic
Statistics 101-type question). I offered some numbers and asked what the bounds might roughly be, but
the interviewee had absolutely no idea. With each question it became very clear that this person had
been taught a lot of theory but had no feel for how to use it, and no sense of numbers. We didn't hire.
I interviewed another person who had written a very sophisticated traffic model using discrete event
simulation (which we use a fair bit) that was helping decide how to manage boat traffic. The model
predicted that putting in traffic lights on the narrow part of some waterway would produce a horrendous
number of crashes at the traffic light queues, easily outweighing the crashes avoided by letting vessels
pass each other in the narrow part of the waterway. Conclusion: no traffic lights. That seemed strange
to me and, after some thought, the interviewee explained it was probably because the model used a
probability of crashing that was inversely proportional to the distance between the vessels, and vessels
in a queue are very close, so the model generated lots of crashes. But they are also barely moving, I
pointed out, so the probability of a collision will be lower at a given distance for vessels at the lights
than for vessels passing each other at speed, and any contact between waiting vessels would have a
negligible effect. The modeller responded that the probability could be changed. We didn't hire that
person either because the modeller had never stepped back and asked "does this make sense?'.
I interviewed a student who was just finishing a Masters degree and was writing up a thesis on
applying probability models from physics to financial markets. This person explained that studying had

26

Risk Analysis

become rather dull because it was always about learning what others had done, but the thesis was a
different story because there was a chance to think for oneself and come up with something new. The
student was very enthusiastic, had great mathematics and could really explain to me what the thesis was
about. We hired and I have no regrets.
A prospective hire for a risk analysis position will need some sort of quantitative background. I
think the best candidates tend to have a background that combines attempting to model the real world
with using the results to make decisions. In these areas, approximations and the tools of approximation
are embraced as necessary and useful, and there is a clear purpose to modelling that goes beyond the
academic exercise of producing the model itself. Applied physics, engineering, applied statistics and
operations research are all very suitable. Applied physics is the most appealing of all of them (I may be
biased, I studied physics as an undergraduate) because in physics we hypothesise how the world might
work, describe the theory with mathematics, make predictions and figure out an experiment that will
challenge the theory, perform the experiment, collect and analyse data and conclude whether our theory
was supported. Learning this basic thinking is extraordinarily valuable: risk analysis follows much of
the same process, uses many of the same modelling and statistical techniques, makes approximations
and should critically review scientific data when relevant. Most published papers describe studies that
were designed to show supportive evidence for someone's theory.
Pure mathematics and classical statistics are not that great: pure mathematics is too abstract; we find
that pure statistics teaching is very constrained, and encourages formulaic thinking and reaching for a
computer rather than a pen and paper. The schools also don't seem to emphasise communication skills
very much. It's a shame because the statistician has so much of the basic knowledge requirements.
Bayesian statistics is somewhat better - it does not have such a problem with subjective estimates, its
techniques are more conducive to risk analysis and it's a newer field, so the teaching is somewhat less
staid. Don't be swayed by a six-sigma black belt qualification - the ideas behind Six Sigma certainly
have merit, but the technical knowledge gained to get a black belt is quite basic and the production-line
teaching seems to be at the expense of in-depth understanding and creativity. The main things you will
need to look out for are a track record of independent thinking, strong communication skills and some
reasonable grasp of probability modelling. The more advanced techniques can be learned from courses
and books.

2.4.3 Our team
I thought it might be helpful to give you a brief description of how we organise our teams. If your
organisation is large enough to need 10 or more people in a risk analysis team, you might get some
ideas from how we operate.
Vose Consulting has quite a mixture of people, roughly split into three groups, and we seem to have
hired organically to match people's skills and characters to the roles of these groups. I love to learn,
teach, develop new talent and dream up new ideas, so my team is made up of conceptual thinkers with
great mathematics, computing and researching skills. They are young and very intelligent, but are too
young for us to put them into the most stressful jobs, so part of my role is to give them challenging
work and the confidence to meet consulting deadlines by solving their problems with them. My office
is the nursery for Huybert's team to which they can migrate once they have more experience. Huybert
is an ironman triathlon competitor with boundless energy. His consulting group fly around everywhere
solving problems, writing reports and meeting deadlines. They are real finishers and my team provide
as much technical support as they need (though they are no slouches, we have four quantitative PhDs
and nobody with less than a Masters degree in that team). Timour is a very methodical, deep thinker.

Chapter 2 Plann~nga r~skanalysis

27

Unlike me, he tends not to say anything unless he has something to say. His programming group writes
our commercial software like ModelRisk, requiring a long-term development view, but he has a couple
of people who write bespoke software for our clients meeting strict deadlines too.
When we get a consulting enquiry, the partners will discuss whether we have the time and knowledge
to do the job, who it would involve and who would lead it. Then the prospective lead is invited to
talk with us and the client about the project and then takes over. The lead consultant has to agree to
do the project, hisfher name and contact details are put on the MOU and helshe remains in charge and
responsible to the client throughout the project. A partner will monitor progress, or a partner could be
the lead consultant. The lead consultant can ask anyone within the company for advice, for manpower
assistance, to review models and reports, to write bespoke software for the client, to be available for a
call with the client, etc. I like this approach because it means we spread around the satisfaction of a job
well done, it encourages responsibility and creativity, it emphasises a flat company structure and we all
get to know what others in the company can do, and because the poor performance in a project would
be the company's failure, not one individual's.
I read Ricardo Semler's book Maverick a few months ago and loved it for showing me that much of
what we practise in our small company can work in a company as large as Semco. Semco also works
in groups that mix around depending on the project and has a flat hierarchy. We give our staff a lot of
responsibility, so we also assume that they are responsible: we give them considerable freedom over
their working hours and practices, we expect them to keep expenses at a sensible level, but don't set
daily rates, etc. Staff choose their own computers, can buy a printer, etc., without having to get approval.
The only thing we have no flexibility on is honesty.

Chapter 3

The quality of a risk analysis
We've seen a fair number of quantitative risk analyses that are terrible. They might also have been very
expensive, taken a long time to complete and used up valuable human resources. In fact, I'll stick my
neck out and say the more complex and expensive a quantitative risk analysis is, the more likely it is
to be terrible. Worst of all, the people making decisions on the results of these analyses have little if
any idea of how bad they are. These are rather attention-grabbing sentences, but this chapter is small
and I would really like you not to skip over it: it could save you a lot of heartache.
In our company we do a lot of reviews of models for decision-makers. We'd love to be able to
say "it's great, trust the results" a lot more often than we do, and I want to spend this short chapter
explaining what, in our experience, goes wrong and what you can do about it. First of all, to give some
motivation for this chapter, I want to show you some of the results of a survey we ran a couple of years
ago in a well-developed science-based area of risk analysis (Figure 3.1). The question appears in the
title of each pane. Which results do you find most worrying?

3.1 The Reasons Why a Risk Analysis can be Terrible

t

1

From Figure 3.1 I think you'll see that there really needs to be more communication between decisionmakers and their risk analysts and a greater attempt to work as a team. I see the risk analyst as an
important avenue of communication between those "on the ground" who understand the problem at
hand and hold the data and those who make decisions. The risk analyst needs to understand the context
of the decision question and have the flexibility to be able to find the method of analysis that gives
the most useful information. I've heard too many risk analysts complain that they get told to produce
a quantitative model by the boss, but have to make the numbers up because the data aren't there. Now
doesn't that seem silly? I'm sure the decision-maker would be none too happy to know the numbers
are all made up, but the risk analyst is often not given access to the decision-makers to let them know.
On the other hand, in some business and regulatory environments they are trying to follow a rule that
says a quantitative risk analysis needs to be completed - the box needs ticking.
Regulations and guidelines can be a real impediment to creative thinking. I've been in plenty of committees gathered to write risk analysis guidelines, and I've done my best to reverse the tendency to be
formulaic. My argument is that in 19 years we have never done the same risk analysis twice: every one
has its individual peculiarities. Yet the tendency seems to be the reverse: I trained over a hundred consultants in one of the big four management consultancy firms in business risk modelling techniques, and they
decided that, to ensure that they could maintain consistency, they would keep it simple and essentially fill
in a template of three-point estimates with some correlation. I can see their point - if every risk analyst
developed a fancy and highly individual model it would be impossible to ensure any quality standard.
The problem is, of course, that the standard they will maintain is very low. Risk analysis should not be a
packaged commodity but a voyage of reasoned thinking leading to the best possible decision at the time.

30

Risk Analysis

What factors leopardise the value of an assessment?
45%
40%
35%
30%
E Usually

2596

QJ 50:50

20%

Seldom

15%

Never
10%
5%
0%
lnsufflcienthuman
resourcesto complete
the assessment

lnsufficlenttlme to
complete the
assessment

lnsufflclentdata to
support the rlsk
assessment

lnsufflclent In-house
expertise in the area

lnsufficlentgeneral
sclentlfic knowledge of
the area

Figure 3.1 Some results of a survey of 39 professional risk analysts working in a scientific field where risk
analysis is well developed and applied very frequently.

Chapter 3 The quality of a risk analysis

3I

I think it is usually pretty easy to see early on in the risk analysis process that a quantitative risk
analysis will be of little value. There are several key areas where it can fall down:
It can't answer all the key questions.
2. There are going to be a lot of assumptions.
3. There is going to be one or more show-stopping assumption.
4. There aren't enough good data or experts.
1.

We can get around 1 sometimes by doing different risk analyses for different questions, but that can
be problematic when each risk analysis has a different set of fundamental assumptions - how do we
compare their results?
For 2 we need to have some way of expressing whether a lot of little assumptions compound to make
a very vulnerable analysis: if you have 20 assumptions (and 20 is quite a small number), all pretty good
ones - e.g. we think there's a 90 % chance they are correct, but the analysis is only useful if all the
assumptions are correct, then we only have a 0.9~' = 12 % chance that the assumption set is correct.
Of course, if this were the real problem we wouldn't bother writing models. In reality, in the business
world particularly, we deal with assumptions that are good enough because the answers we get are
close enough. In some more scientific areas, like human health, we have to deal with assumptions such
as: compound X is present; compound X is toxic; people are exposed to compound X; the exposure is
sufficient to cause harm; and treatment is ineffective. The sequence then produces the theoretical human
harm we might want to protect against, but if any one of those assumptions is wrong there is no human
health threat to worry about.
If 3 occurs we have a pretty good indication that we don't know enough to produce a decent risk analysis model, but maybe we can produce two or three crude models under different possible assumptions
and see whether we come to the same conclusion anyway.
Area 4 is the least predictable because the risk analyst doing a preliminary scoping can be reassured
that the relevant data are available, but then finds out they are not available either because the data turn
out to be clearly wrong (we see this a lot), the data aren't what was thought, there is a delay past the
deadline in the data becoming available or the data are dirty and need so much rework that it becomes
impractical to analyse them within the decision timeframe.
There is a lot of emphasis placed on transparency in a risk analysis, which usually manifests itself
in a large report describing the model, all the data and sources, the assumptions, etc., and then finishes
with some of the graphical and numerical outputs described in Chapter 5. I've seen reports of 100 or
200 pages that seem far from transparent to me - who really has the time or inclination to read such a
document? The executive summary tends to focus on the decision question and numerical results, and
places little emphasis on the robustness of the study.

3.2 Communicating the Quality of Data Used in a Risk Analysis
Elsewhere in this book you will find lots of techniques for describing the numerical accuracy that a
model can provide given the data that are available. These analyses are at the heart of a quantitative
risk analysis and give us distributions, percentiles, sensitivity plots, etc.
In this section I want to discuss how we can communicate any impact on the robustness of a model
owing to the assumptions behind using data or settling on a model scope and structure. Elsewhere in this
book I encourage the risk analyst to write down each assumption that is made in developing equations

32

Risk Analysis

Table 3.1 Pedigree matrix for parameter strength (adapted from Boone et a/., 2007).

Exact measure of the
d

e
s
~
~
representative)

Good fit or measure

Large sample, direct
measurements,
~
~
~
lrecent
~ data,
g
controlledexperiments

Best available practice in wellestablished discipline
(accredited met,,od for
sampling / diagnostic test)

Compared with independent
measurements of the same
variable over long domain,
rigorous correction of errors

Small sample, direct
measurements, less recent Reliable and common method.
data, uncontrolled experiments, Best practice in immature
discipline
low non-response rate

Well correlated but not
the same thing

Several expert estimates in
general agreement

Weak correlation (very
large geographical
differences)

One expert opinion,
rule-of-thumb estimate

Not clearly connected

Crude speculation

Acceptable method but limited
consensus on reliability

with
unknown reliability

No discernible rigour

1'

Weak very indirect

No validation

and performing statistical analyses. We get participants to do the same in the training courses we teach
as they solve simple class exercises, and there is a general surprise at how many assumptions are implicit
in even the simplest type of equation. It becomes rather onerous to write all these assumptions down,
but it is even more difficult to convert the conceptual assumptions underpinning our probability models
into something that a reader rather less familiar with probability modelling might understand.
The NUSAP (Numeral Unit Spread Assessment Pedigree) method (Funtowicz and Ravetz, 1990) is
a notational system that communicates the level of uncertainty for data in scientific analysis used for
policy making. The idea is to use a number of experts in the field to score independently the data under
different categories. The system is well established as being useful in toxicological risk assessment. I
will describe here a generalisation of the idea. It's key attractions are that it is easy to implement and
can be sumrnarised into consistent pictorial representations. In Table 3.1 I have used the categorisation
descriptions of data from van der Sluijs, Risbey and Ravetz (2005), which are: proxy - reflecting
how close data being used are to ideal; empirical - reflecting the quantity and quality of the data;
method - reflecting where the method used to collect the data lies between careful and well established
and haphazard; and validation - reflecting whether the acquired data have been matched to real-world
experience (e.g. does an effect observed in a laboratory actually occur in the wider world).
Each dataset is scored in turn by each expert. The average of all scores is calculated and then divided
by the maximum attainable score of 4. For example:
Expert A
Expert B
Expert C

Proxy
3
3
2

Empirical
2
2
1

Method
4
4
3

Validation
3
3
4

gives an average score of 2.833. Dividing by the maximum score of 4 gives 0.708. An additional level
of sophistication is to allow the experts to weight their level of expertise for the particular variable in

Chapter 3 The quality of a risk analysis

Exposure

Release

Treatment effectiveness

Toxicity

..

1
0.9 -

U)

0.6 0.5 0.4 0.3 0.2

-

0.1 07
0

Figure 3.2

4

' . ..

0.7 -

!f

A

:+

0.8 A

.+ .
..
+

: +
'

+

4

10

+

6

+ 'j

5

+

A a

* .

- +

+

7

: +

.+.

A

+

4

+:

15

33

++

20
25
30
Parameter identification #

+
35

40

45

Plot of average scores for datasets in a toxicological risk assessment.

question (e.g. 0.3 for low, 0.6 for medium and 1.0 for high, as well as allowing experts to select not
to make any comment when it is outside their competence), in which case one calculates a weighted
average score. One can then plot these scores together and segregate them by different parts of the
analysis if desired, which gives an overview of the robustness of data used in the analysis (Figure 3.2).
Scores can be generally categorised as follows:
t0.2
0.2-0.4
>0.4-0.8
>0.8

weak
moderate
high
excellent

So, for example, Figure 3.2 shows that the toxicity part of the analysis appears to be the weakest, with
several datasets in the weak category.
We can summarise the scores for each dataset using a kite diagram to give a visual "traffic light",
green indicating that the parameter support is excellent, red indicating that it is weak and one or two
levels of orange representing gradations between these extremes. Figure 3.3 gives an example: one
works from the centre-point, marking on the axes the weighted fraction of all the experts considering
the parameter support to be "excellent", then adds the weighted fraction considering the support to be
"high", etc. These points are then joined to make the different colour zones - from green in the centre
for "excellent", through yellow and orange, to red in the last category: a kite will be green if all experts
agree the parameter support is excellent and red for weak. Plotting these kite diagrams together can
give a strong visual representation: a sea of green should give great confidence, a sea of red says the
risk analysis is extremely weak. In practice, we'll end up with a big mix of colours, but over time one
can get a sense of what colour mix is typical, when an analysis is comparatively weak or strong and
when it can be relied upon for your field.
The only real impediment to using the system above is that you need to develop a database software
tool. Some organisations have developed their own in-house products that are effective but somewhat
limited in their ability for reviewing, sorting and tracking. Our software developers have it on their "to
do" list to make a tool that can be used across an organisation, where one can track the current status

34

Risk Analysis

Proxy

Method

Figure 3.3 A kite diagram summarising the level of data support the experts believe that a model parameter
will have: red (dark) in the outer band = weak; green (light) in the inner band = excellent.

of a risk analysis, drill down to see the reasons for the vulnerability of a parameter, etc., so you might
like to visit www.vosesoftware.com and see if we've got anywhere yet.

3.3 Level of Criticality
The categorisation system of Section 3.2 helps determine whether a parameter is well supported, but
it can still misrepresent the robustness of the risk analysis. For example, we might have done a food
safety microbial risk analysis involving 10 parameters - nine enjoy high or excellent support, and one
is suffering weak support. If that weakly supported parameter is defining the dose-response relationship
(the probability a random individual will experience an adverse health effect given the number of
pathogenic organisms ingested), then the whole risk analysis is jeopardised because the dose-response
is the link between all the exposure pathways and the amount of pathogen involved (often a big model)
and the size of human health impact that results. It is therefore rather useful to separate the kite
diagrams and other analyses into different categories for the level of dependence the analysis has on
each parameter: critical, important or small, for example.
A more sophisticated version for separating the level of dependence is statistically to analyse the degree
of effect each parameter has on the numerical result; for example, one might look at the difference in
the mean of the model output when the parameter distribution is replaced by its 95th and 5th percentile.
Taking that range and multiplying by (1 - the support score), giving 0 for excellent and 1 for terrible,
gives one a sense of the level of vulnerability of the output numbers. However, this method suffers
other problems. Imagine that we are performing a risk analysis on an emerging bacterium for which
we have absolutely no dose-response data, so we use a dataset for a surrogate bacterium that we
think will have a very similar effect (e.g. because it produces a similar toxin). We might have large
amounts of excellent data for the surrogate bacterium and may therefore have little uncertainty about the
dose-response model, so using the 5th and 95th percentiles of the uncertainty about that dose-response
model will result in a small change in the output and multiplying that by ( I - the support score) will
under-represent the real uncertainty. A second problem is that we often estimate two or more model
parameters from the same dataset; for example, a dose-response model often has two or three parameters

Chapter 3 The quality of a risk analysis

35

that are fitted to data. Each parameter might be quite uncertain, but the dose-response curve can be
nonetheless quite stable, so this numerical analysis needs to look at the combined effect of the uncertain
parameters as a single entity, which requires a fair bit of number juggling.

3.4 The Biggest Uncertainty in a Risk Analysis
The techniques discussed above have focused on the vulnerability of the results of a risk analysis to
the parameters of a model. When we are asked to review or audit a risk analysis, the client is often
surprised that our first step is not to look at the model mathematics and supporting statistical analyses,
but to consider what the decision questions are, whether there were a number of assumptions, whether
it would be possible to do the analysis a different (usually simpler, but sometimes more complex and
precise) way and whether this other way would give the same answers, and to see if there are any
means for comparing predictions against reality. What we are trying to do is see whether the structure
and scope of the analysis are correct. The biggest uncertainty in a risk analysis is whether we started
off analysing the right thing and in the right way.
Finding the answer is very often not amenable to any numerical technique because we will not have
any alternative to compare against. If we do, it might nonetheless take a great deal of effort to put
together an alternative risk analysis model, and a model audit is usually too late in the process to be
able to start again. A much better idea, in my view, is to get a sense at the beginning of a risk analysis
of how confident we should be that the analysis will be scoped sufficiently broadly, or how confident
we are that the world is adequately represented by our model. Needless to say, we can also start rather
confident that our approach will be quite adequate and then, once having delved into the details of
the problem, find out we were quite mistaken, so it is important to keep revisiting our view of the
appropriateness of the model.
We encourage clients, particularly in the scientific areas of risk in which we work, to instigate a solid
brainstorming session of experts and decision-makers whenever it has been decided that a risk analysis
is to be undertaken, or maybe is just under consideration. The focus is to discuss the form and scope
of the potential risk analysis. The experts first of all need to think about the decision questions, discuss
with decision-makers any possible alternatives or supplements to those questions and then consider how
they can be answered and what the outputs should look like (e.g. only the mean is required, or some
high percentile). Each approach will have a set of assumptions that need to be thought through carefully:
What would the effect be if the assumptions are wrong? If we use a conservative assumption and estimate a risk that is too high, are we back to where we started? We need to think about data requirements
too: Is the quality likely to be good and are the data easily attainable? We also need to think about
software. I was once asked to review a 2 year, $3million model written entirely in interacting C++
modules - nobody else had been able to figure it out (I couldn't either).
When the brainstorming is over, I recommend that you pass around a questionnaire to each expert
and ask those attending independently to answer a questionnaire something like this:

We discussed three risk analysis approaches (description A, description B, description
C). Please indicate your level of confidence (0 = none, 1 = slight, 2 =good, 3 =
excellent, -1 = no opinion) to the following:
1. What is your confidence that method A, B or C will be sufficiently flexible and
comprehensive to answer any foreseeable questions from the management about
this risk?

36

Risk Analysis

2. What is your confidence that method A, B or C is based on assumptions that are
correct?
3. What is your confidence for method A, B or C that the necessary data will be
available within the required timeframe and budget?
4. What is your confidence that the method A, B or C analysis will be completed in
time?
5 . What is your confidence that there will be strong support for method A, B or C
among reviewing peers?
6. What is your confidence that there will be strong support for method A, B or C
among stakeholders?

Asking each brainstorming participant independently will help you attain a balanced view, particularly
if the chairperson of that meeting has enforced the discipline of requiring participants not to express their
view on the above questions during the meeting (it won't be completely possible, but you are trying to
make sure that nobody will be influenced into giving a desired answer). Asking people independently
rather than trying to achieve consensus during the meeting will also help remove the overconfidence
that often appears when people make a group decision.

3.5 Iterate
Things change. The political landscape in which the decision is to be made can become more hostile
or accepting to some assumptions, data can prove better or worse than we initially thought, new data
turn up, new questions suddenly become important, the timeframe or budget can change, a risk analysis
consultant sees an early model and shows you a simpler way, etc.
So it makes sense to go back from time to time over the types of assumption analysis I discussed
in Sections 3.2 and 3.3 and to remain open to talung a different approach, even to making as dramatic
a change as going from a quantitative to a qualitative risk analysis. That means you (analysts and
decision-makers alike) should also be a little guarded in making premature promises so you have some
space to adapt. In our consultancy contracts, for example, a client will usually commission us to do a
quantitative risk analysis and tell us about the data they have. We'll probably have had a little look at
the data too. We prefer to structure our proposal into stages. In the first stage we go over the decision
problem, review any constraints (time, money, political, etc.), take a first decent look at the available
data and figure out possible ways of getting to the answer. Then we produce a report describing how
we want to tackle the problem and why. At that stage the client can stop the work, continue with us,
do it themselves or maybe hire someone else if they wish. It may take a little longer (usually a day
or two), but everyone's expectations are kept realistic, we aren't cornered into doing a risk analysis
that we know is inappropriate and clients don't waste their time or money. As consultants, we are in
the somewhat privileged position of turning down work that we know would be terrible. A risk analyst
employed by a company or government department may not have that luxury. If you, the reader, are
a risk analyst in the awkward position of being made to produce terrible risk analyses, perhaps you
should show your boss this chapter, or maybe check to see if we have any vacancies.

Chapter 4

Choice of model structure
There is a tendency to settle on the form that a risk analysis model will take too early on in the risk
analysis process. In part that will be because of a limited knowledge of the available options, but also
because people tend not to take a step back and ask themselves what the purpose of the analysis is, and
also how it might evolve over time. In this chapter I give a short guide to various types of model used
in risk analysis.

4.1 Software Tools and the Models they Build
4. I. I Spreadsheets
Spreadsheets, and by that I mean Excel these days, are the most natural and the first choice for most
people because it is perceived that relatively little additional knowledge is required to produce a risk
analysis model. Products like @RISK, CrystaLBall, ModelRisk and many other contenders for their
shared crown have made adding uncertainty into a spreadsheet as simple as cliclung a few buttons. You
can run a simulation and look at the distribution results in a few seconds and a few more button clicks.
Monte Carlo simulation software tools for Excel have focused very much on the graphical interfaces to
make risk analysis modelling easy: combine that with the ability to track formulae across spreadsheets,
imbed graphs and format sheets in many ways, and with VBA and data importing capabilities, and we
can see why Excel is so popular. I have even seen a whole trading floor run on Excel using VBA, and
not a single recognisable spreadsheet appeared on any dealer's screen.
But Excel has its limitations. ModelRisk overcomes many of them for high-level financial and insurance modelling, and I have used its features in this book a fair bit to help explain some modelling
concepts. However, there are many types of problem for which Excel is not suitable. Project cost and
schedule risk analysis can be done in spreadsheets at a crude level, which I cover in Chapter 19, and a
crude level is often enough for large-scale risk analysis, as we are rarely interested in the minutia that
can be built into a project planning model (like you might make with Primevera or Microsoft Project).
However, a risk register is better constructed in an electronic database with various levels of access.
The problem with building a project plan in a spreadsheet is that expanding the model into greater detail
becomes mechanically very awkward, while it is a simple matter in project planning software.
In other areas, risk analysis models with spreadsheets have a number of limitations:
1. They scale very badly, meaning that spreadsheets can become really huge when one has a lot of
data, or when one is performing repetitive calculations that could be succinctly written in another
language (e.g. a looping formula), although one can get round this to some degree with Visual
Basic. Our company reviews many risk models built in spreadsheets, and they can be vast, often
unnecessarily so because there are shortcuts to achieving the same result if one knows a bit of

38

2.

3.

4.

5.

Risk Analysis

probability mathematics. The next version of Excel will handle even bigger sheets, so I predict this
problem will only get worse.
They are limited to the two dimensions of a grid, three at a push if one uses sheets as a third
dimension; if you have a multidimensional problem you should really think hard before deciding on
a spreadsheet. There are a lot of other modelling environments one could use: C++ is highly flexible,
but opaque to anyone who is not a C++ programmer. Matlab and, to a lesser extent, Mathematica
and Maple are highly sophisticated mathematical modelling software with very powerful built-in
modelling capabilities that will handle many dimensions and can also perform simulations.
They are really slow. Running a simulation in Excel will take hundreds or more times longer than
specialised tools. That's a problem if you have a huge model, or if you need to achieve a high level
of precision (i.e. require many iterations).
Simulation models built in spreadsheets calculate in one direction, meaning that, if one acquires
new data that can be matched to a forecast in the model, the data cannot be integrated into the
model to update the estimates of parameters on which the model was based and therefore produce
a more accurate forecast. The simulation software WinBUGS can do this, and I give a number of
examples through this book.
Spreadsheets cannot easily handle modelling dynamic systems. There are a number of flexible
and user-friendly tools like Simul8 which give very good approximations to continuously varying
stochastic systems with many interacting components. I give an example later in this chapter.
Attempting to achieve the same in Excel is not worth the pain.

There are other types of model that one can build, and software that will let you do so easily, which
I describe below.

4.1.2 Influence diagrams
Influence diagrams are quite popular - they essentially replicate the mathematics you can build in
a spreadsheet, but the modelling environment is quite different (Figure 4.1 is a simple example).
~ n a l y t i c ais~ the most popular influence diagram tool. Variables (called nodes) are represented as
graphical objects (circles, squares, etc.) and are connected together with arrows (called arcs) which
show the direction of interaction between these variables. The visual result is a network that shows the
Project base cost

Total project cost

I

I

A

Additional
costs

,

<

Inflation

A
f

L

Figure 4.1

\

f

Risk of political
change

Risk of bad
weather
/

Risk of strike

Example of a simple influence diagram.

1

L

,

Chaater 4 Choice of model structure

39

viewer which variables affect which, but you can imagine that such a diagram quickly becomes overly
complex, so one builds submodels. Click on a model object and it opens another view to show a lower
level of interaction. Personally, I don't like them much because the mathematics and data behind the
model are hard to get to, but others love them. They are certainly very visual.

4.1.3 Event trees
Event trees offer a way to describe a sequence of probabilistic events, together with their probabilities
and impacts. They are perhaps the most useful of all the methods for depicting a probabilistic sequence,
because they are very intuitive, the mathematics to combine the probabilities is simple and the diagram
helps ensure the necessary discipline. Event trees are built out of nodes (boxes) and arcs (arrows)
(Figure 4.2).
The tree starts from the left with a node (in the diagram below, "Select animal" to denote the random
selection of an animal from some population), and arrows to the right indicate possible outcomes
(here, whether the animal is infected with some particular disease agent, or not) and their probabilities
(p, which would be the prevalence of infected animals in the population, and (1 - p) respectively).
Branching out from these boxes are arrows to the next probability event (the testing of an animal for
the disease), and attached to these arrows are the conditional probabilities of the next level of event
occurring. The conditional nature of the probabilities in an event tree is extremely important to underline.
In this example:
Se = P(test positive for disegse given the animal is infected)

-

Sp = P(test negative for disease given the animal is not infected)

Thus, following the rules of conditional probability algebra, we can say, for example:
P(anima1 is infected and tests positive) = p*Se

P(anima1 is infected and tests negative) = p*(l - Se)
P(anima1 tests positive) = p*Se

Probabilityof the
sequence

=lse/

/

-P

y

Not infected

Probabilityof this step

Figure 4.2

Tests +ve

Tests -ve

Select animal

4

+ (1 - p)*(l - SP)

Example of a simple event tree.

Probabilities conditional on
previous step

se
:

I

~ ( -1Se)

40

Risk Analysis

High revenue
Investment 1
Low revenue
I

Investment decision

I

High revenue
Investment 2
Low revenue
1

No investment

Figure 4.3 Example of a simple decision tree. The decision options are to make either of two investments
or do nothing with associated revenues as a result. More involved decision trees would include two or more
sequential decisions depending on how well the investment went.

Event trees are very useful for building up your probability thinking, although they will get quite
complex rather quickly. We use them a great deal to help understand and communicate a problem.

4.1.4 Decision trees
Decision trees are like event trees but add possible decision options (Figure 4.3). They have a role in
risk analysis, and in fields like petroleum exploration they are very popular. They sketch the possible
decisions that one might make and the outcomes that might result. Decision tree software (which can
also produce event trees) can calculate the best option to take under the assumption of some user-defined
utility function. Again, personally I am not a big fan of decision trees in actual model writing. I find that
it is difficult for decision-makers to be comfortable with defining a utility curve, so I don't have much
use for the analytical component of decision tree software, but they are helpful for communicating the
logic of a problem.

4.1.5 Fault trees
Fault trees start from the reverse approach to an event tree. An event tree looks forward from a starting
point and considers the possible future outcomes. A fault tree starts with the outcome and looks at
the ways it could have arisen. A fault tree is therefore constructed from the right with the outcome,
moves to the left with the possible immediate events that could have made that outcome arise, continues
backwards with the possible events that could have made the first set of events arise, etc.
Fault trees are very useful for focusing attention on what might go wrong and why. They have been
used in reliability engineering for a long time, but also have applications in areas like terrorism. For
example, one might start with the risk of deliberate contamination of a city's drinking water supply
and then consider routes that the terrorist could use (pipeline, treatment plant, reservoir, etc.) and the
probabilities of being able to do that given the security in place.

4.1.6 Discrete event simulation
Discrete event simulation (DES) differs from Monte Carlo simulation mainly in that it models the
evolution of a (usually stochastic) system over time. It does this by allowing the user to define equations

Chapter 4 Choice of model structure

41

for each element in the model for how it changes, moves and interacts with other elements. Then it
steps the system through small time increments and keeps track of where all elements are at any time
(e.g. parts in a manufacturing system, passengers in an airport or ships in a harbour). More sophisticated
tools can increase the clock steps when nothing is happening, then decrease again to get a more accurate
approximation to the continuous behaviour it is modelling.
We have used DES for a variety of clients, one of which was a shipping firm that regularly received
LNG-ships at its site on a narrow shared waterway. The client wanted to investigate the impact of
constructing an alternative berthing system designed to reduce the impact of their activities on other
shipping movements, and the model evaluated the benefits of such a system. Within the DES model,
movements of the client's and any other relevant shipping traffic were simulated, taking into account
restrictions of movements by certain rules and regulations and evaluating the costs of delays. The standalone model, as well as documentation and training, was provided to the client and helped them to
persuade the other shipping operators and the Federal Energy Regulatory Commission (FERC) of the
effectiveness of their plan.
Figure 4.4 shows a screen shot of the model (it looks better in colour). Going from left to right, we
can see that currently there is one ship in the upper harbour, four in the inner harbour, none at the city
front and one in the outer harbour. In the client's berth, two ships are unloading with 1330 and 2430
units of materials still on board. In the upper right-hand comer the number of ships entering the shared
waterway is visible, including the number of ships that are currently in a queue (three and two ships of
a particular type). Finally, the lower right-hand corner shows the current waterway conditions, which
dictate some of the rules such as "only ships of a certain draft can enter or exit the waterway given a
particular current, tide, wind speed and visibility".
DES allows us to model extremely complicated systems in a simple way by defining how the elements
interact and then letting the model simulate what might happen. It is used a great deal to model, for
example, manufacturing processes, the spr6ad of epidemics, all sorts of complex queuing systems, traffic

River Closed
to non-LNG

i:

Figure 4.4

Example of a DES model.

42

Risk Analysis

flows and crowd behaviour to design emergency exits. The beauty of a visual interface is that anyone
who knows the system can check whether it behaves as expected, which makes it a great communication
and validation tool.

4.2 Calculation Methods
Given a certain probability model that we wish to evaluate, there are several methods that we could use
to produce the required answer, which I describe below.

4.2.1 Calculating moments
This method uses some probability laws that are discussed later in this book. In particular it uses the
following rules:

I

I

1. The mean of the sum of two distributions is equal to the sum of their means, i.e. (a
and (a - b) = Z - b.

+ b) = Z + b

2. The mean of the product of two distributions is equal to the product of their means, i.e. (a . b ) =

a . b.

3. The variance of the sum of two independent distributions is equal to the sum of their variances, i.e.
V ( a b) = V ( a ) V ( b ) and V ( a - b ) = V ( a ) V ( b ) .
4. V ( n a ) = n 2 v ( a ) ,E
i = nZ, where n is some constant.

+

+

+

The moments calculation method replaces each uncertain variable with its mean and variance and
then uses the above rules to estimate the mean and variance of the model's outcome.
So, for example, three variables a , b and c have the following means and variances:

a
b
c
If the problem is to calculate 2a

Mean = 70
Mean = 16
Mean = 12

Variance = 14
Variance = 2
Variance = 4

+ b - c, the result can be estimated as follows:
Mean = (2 *70) + 16 - 12 = 144
Variance = (22* 14) + 2 + 4 = 62

These two values are then used to construct a normal distribution of the outcome:
Result = Normal(144,

a)

where &? is the standard deviation of the distribution which is the square root of the variance.
This method is useful in certain situations, like the summation of a large number of potential risks
and in the determination of aggregate distributions (Section 11.2). It does have some fairly severe
limitations - it cannot easily cope with divisions, exponents, power functions, branching, etc. In short,

Chapter 4 Choice of model structure

43

this technique becomes very difficult to execute for all but the most simple models that also reasonably
obey its set of assumptions.

4.2.2 Exact algebraic solutions
Each probability distribution has associated with it a probability distribution function that mathematically
describes its shape. Algebraic methods have been developed for determining the probability distribution
functions of some combinations of variables, so for simple models one may be able to find an equation
directly that describes the output distribution. For example, it is quite simple to calculate the probability
distribution function of the sum of two independent distributions (the following maths might not make
sense until you've read Chapter 6).
Let X be the first distribution with density f ( x ) and cumulative distribution function Fx(x), and let
Y be the second distribution with density g(x). Then the cumulative distribution function of the sum of
X and Y, Fx+y, is given by

The sum of two independent distributions is sometimes known as the convolution of the distributions.
By differentiating this equation, we obtain the density function of X Y:

+

So, for example, we can determine the distribution of the sum of two independent Uniform(0, 1)
distributions. The probability distribution functions f (x) and g(x) are both 1 for 0 5 x 5 1, and zero
otherwise. From Equation (4.2) we get

For 0 5 a 5 1, this yields

which gives fx+ (a) = a.
For 1 ( a ( 2, this yields

which is a Triangle(0, 1, 2) distribution.
Thus, if our risk analysis model was just the sum of several simple distributions, we could use these
equations repeatedly to determine the exact output distribution. There are a number of advantages to

44

Risk Analysis

this approach, for example: the answer is exact; one can immediately see the effect of changing a
parameter value; and one can use differential calculus to explore the sensitivity of the output to the
model parameters.
A variation of the same approach is to recognise the relationship between certain distributions. For
example:

There are plenty of such relationships, and many are described in Appendix 111, but nonetheless the
distributions used in a risk analysis model don't usually allow such simple manipulation and the exact
algebraic technique becomes hugely complex and often intractable very quickly, so it cannot usually be
considered as a practical solution.

4.2.3 Numerical approximations
Some fast Fourier transform and recursive techniques have been developed for directly, and very accurately, determining the aggregate distribution of a random number of independent random variables. A lot
of focus has been paid to this particular problem because it is central to the actuarial need to determine
the aggregate claim payout an insurance company will face. However, the same generic problem occurs
in banking and other areas. I describe these techniques in Section 11.2.2. There are other numerical
techniques that can solve certain types of problem, particularly via numerical integration. ModelRisk, for
example, provides the function VoseIntegrate which will perform a very accurate numerical integration.
Consider a function that relates the probability of illness, Pill(D), to the number of virus particles
ingested, D , as follows:

1

If we believed that the number of virus particles followed a Lognorma1(100,10) distribution, we could
calculate the probability of illness as follows:

Chapter 4 Choice of model structure

45

where the VoseIntegrate function interprets "#" to be the variable to integrate over and the integration
is done between 1 and 1000. The answer is 2.10217E-05 - a value that we could only determine with
accuracy using Monte Carlo simulation by running a large number of iterations.

4.2.4 Monte carlo simulation
This technique involves the random sampling of each probability distribution within the model to produce
hundreds or even thousands of scenarios (also called iterations or trials). Each probability distribution is
sampled in a manner that reproduces the distribution's shape. The distribution of the values calculated
for the model outcome therefore reflects the probability of the values that could occur. Monte Carlo
simulation offers many advantages over the other techniques presented above:
The distributions of the model's variables do not have to be approximated in any way.
Correlation and other interdependencies can be modelled.
The level of mathematics required to perform a Monte Carlo simulation is quite basic.
The computer does all of the work required in determining the outcome distribution.
Software is commercially available to automate the tasks involved in the simulation.
Complex mathematics can be included (e.g. power functions, logs, IF statements, etc.) with no extra
difficulty.
Monte Carlo simulation is widely recognised as a valid technique, so its results are more likely to
be accepted.
The behaviour of the model can be investiga2/ed with great ease.
Changes to the model can be made very quickly and the results compared with previous models.
Monte Carlo simulation is often criticised as being an approximate technique. However, in theory at
least, any required level of precision can be achieved by simply increasing the number of iterations in a
simulation. The limitations are in the number of random numbers that can be produced from a random
number generating algorithm and, more commonly, the time a computer needs to generate the iterations.
For a great many problems, these limitations are irrelevant or can be avoided by structuring the model
into sections.
The value of Monte Carlo simulation can be demonstrated by considering the cost model problem of
Figure 4.5. Triangular distributions represent uncertainty variables in the model. There are many other,
very intuitive, distributions in common use (Figure 4.6 gives some examples) that require little or no
probability knowledge to understand. The cumulative distribution of the results is shown in Figure 4.7,
along with the distribution of the values that are generated from running a "what if" scenario analysis
using three values as discussed at the beginning of this chapter. The figure shows that the Monte Carlo
outcome does not have anywhere near as wide a range as the "what if7' analysis. This is because the
"what if" analysis effectively gives equal probability weighting to all scenarios, including where all
costs turned out to be their maximum and all costs turned out to be their minimum. Let us allow,
for a minute, the maximum to mean the value that only has a 1 % chance of being exceeded (say).
The probability that all five costs could be at their maximum at the same time would equal (0.01)~or
1:10 000 000 000: not a realistic outcome! Monte Carlo simulation therefore provides results that are
also far more realistic than those that are produced by simple "what if" scenarios.

46

Risk Analysis

Total construction costs
Minimum

Best guess

Maximum

E 23 500
£172000
£ 56 200
£ 29 600

£

f 31 100

I E 30 500 1 E 33 200 1 E 37 800

Excavation
Foundations
Structure
Roofing
Services and finishes

27 200
E 178000
E
58 500
f 37 200

El89000
E 63 700
f 43 600

Figure 4.5 Construction project cost model.

b) Uniform distributions

a) Triangle distributions

0.12
0.1
0.08
0.06
0.04
0.02
0
0

20

40

0

60

10

5

15

20

25

d) A Relative distribution

c) PERT distributions

0.3PERT(0,49,50)

Relatwe(4,15,{7,9,11),(2,3,0.5))

0.250.20.150.1 0.050

0

20

40

I

0

60

10

15

20

f) A Discrete distribution

e) A Cumulative Ascending distribution

1.2-

5

0.6 -

Discrete({l,2.3),{0.4,0.5,0.1))

CurnulA(O,lO,{l,4,6),(0.2,0.5,0.6),1)

0.5 -

1 -

0.4 0.30.20.1 -

I

0
0
Figure 4.6

5

10

0

Examples of intuitive and simple probability distributions.

1

2

3

I

4

Chapter 4 Choice of model structure

47

1
0.9
0.8

Monte Carlo simulation

r
a,

0.7

s2 0.6

.h

.-a>,

"What-if" scenarios

0.5

C

2 0.4
a

5 0.3
0
0.2
0.1
0
£310000 £320000 £330000 £340000 £350000 £360000 £370000
Total project cost

Figure 4.7

Comparison of distributions of results from "what if" and risk analyses.

4.3 Uncertainty and Variability
"Variability is a phenomenon in the physical world to be measured, analysed and where
appropriate explained. By contrast, uncertainty is an aspect of knowledge".
Sir David Cox
There are two components of our inability to be able precisely to predict what the future holds: these
are variability and uncertainty. This is a difficult subject, not least because of the words that we risk
analysts have available to describe the various concepts and how these words have been used rather
carelessly. Bearing this in mind, a good start will be to define the meaning of various keywords. 1 have
used the now fairly standard meanings for uncertainty and variability, but might be considered to be
deviating a little from the common path in my explanation of the units of uncertainty and variability. The
reader should bear in mind the comments I'll make about the different meanings that various disciplines
assign to certain words. As long as the reader manages to keep the concepts clear, it should be an easy
enough task to work out what another author means even if some of the terminology is different.
Variability
Variability is the effect of chance and is a function of the system. It is not reducible through either
study or further measurement, but may be reduced by changing the physical system. Variability has
been described as "aleatory uncertainty", "stochastic variability" and "interindividual variability".
Tossing a coin a number of times provides us with a simple illustration of variability. If I toss the
coin once, I will have a head (H) or tail (T), each with a probability of 50% if one presumes a fair
coin. If I toss the coin twice, I have four possible outcomes {HH, HT, TH, TT}, each with a probability
of 25 % because of the coin's symmetry. We cannot predict with certainty what the tosses of a coin will
produce because of the inherent randomness of the coin toss.

48

Risk Analysis

The variation among a population provides us with another simple example. If I randomly select
people off the street and note some physical characteristic, like their height, weight, sex, whether they
wear glasses, etc., the result will be a random variable with a probability distribution that matches the
frequency distribution of the population from which I am sampling. So, for example, if 52 % of the
population are female, a randomly sampled person will be female with a probability of 52 %.
In the nineteenth century a rather depressing philosophical school of thought, usually attributed to the
mathematician Marquis Pierre-Simon de Laplace, became popular, which proposed that there was no
such thing as variability, only uncertainty, i.e. that there is no randomness in the world and an omniscient being or machine, a "Laplace machine", could predict any future event. This was the foundation
of the physics of the day, Newtonian physics, and even Albert Einstein believed in determinism of the
physical world, saying the often quoted "Der Herr Gott wurfelt nicht" - "God does not play dice".
Heisenberg's uncertainty principle, one of the foundations of modern physics and, in particular, quantum mechanics, shows us that this is not true at the molecular level, and therefore subtly at any greater
scale. In essence, it states that, the more one characteristic of a particle is constrained (for example, its
location in space), the more random another characteristic becomes (if the first characteristic is location,
the second will be its velocity). Einstein tried to prove that it is our knowledge of one characteristic
that we are losing as we gain knowledge of another characteristic, rather than any characteristic being
a random variable, but he has subsequently been proven wrong both theoretically and experimentally.
Quantum mechanics has so far proved itself to be very accurate in predicting experimental outcomes
at the molecular level where the predictable random effects are most easily observed, so we have a lot
of empirical evidence to support the theory. Philosophically, the idea that everything is predetermined
(i.e. the world is deterministic) is very difficult to accept too, as it deprives us humans of free will.
The non-existence of free will would in turn mean that we are not responsible for our actions - we are
reduced to complicated machines and it is meaningless to be either praised or punished for our deeds
and misdeeds, which of course is contrary to the principles of any civilisation or religion. Thus, if one
accepts the existence of free will, one must also accept an element of randomness in all things that
humans affect. Popper (1988) offers a fuller discussion of the subject.
Sometimes systems are too complex for us to understand properly. For example, stock markets produce varying stock prices all the time that appear random. Nobody knows all the factors that influence
a stock price over time - it is essentially infinitely complex and we accept that this is best modelled as
a random process.
Uncertainty

Uncertainty is the assessor's lack of knowledge (level of ignorance) about the parameters that characterise the physical system that is being modelled. It is sometimes reducible through further measurement
or study, or by consulting more experts. Uncertainty has also been called "fundamental uncertainty",
"epistemic uncertainty" and "degree of belief". Uncertainty is by definition subjective, as it is a function of the assessor, but there are techniques available to allow one to be "objectively subjective".
This essentially amounts to a logical assessment of the information contained in available data about
model parameters without including any prior, non-quantitative information. The result is an uncertainty
analysis that any logical person should agree with, given the available information.
Total uncertainty

Total uncertainty is the combination of uncertainty and variability. These two components act together
to erode our ability to be able to predict what the future holds. Uncertainty and variability are philosophically very different, and it is now quite common for them to be kept separate in risk analysis

Chapter 4 Choice of model structure

49

modelling. Common mistakes are failure to include uncertainty in the model, or modelling variability
in some parts of the model as if it were uncertainty. The former will provide an overconfident (i.e.
insufficiently spread) model output, while the latter can grossly overinflate the total uncertainty.
Unfortunately, as you will have gathered, the term "uncertainty" has been applied to both the meaning
described above and total uncertainty, which has left the risk analyst with some problems of terminology.
Colleagues have suggested the word "indeterminability" to describe total uncertainty (perhaps a bit of
a mouthful, but still the best suggestion I've heard so far). There has been a rather protracted argument
between traditional (frequentist) and Bayesian statisticians over the meaning of words like probability,
frequency, confidence, etc. Rather than go through their various interpretations here, I will simply present
you with how I use these words. I have found my terminology helps clarify my thoughts and those of
my clients and course participants very well. I hope they will do the same for you.
Probability

Probability is a numerical measurement of the likelihood of an outcome of some stochastic process.
It is thus one of the two components, along with the values of the possible outcomes, that describe
the variability of a system. The concept of probability can be developed neatly from two different
approaches. The frequentist approach asks us to imagine repeating the physical process an extremely
large number of times (trials) and then to look at the fraction of times that the outcome of interest
occurs. That fraction is asymptotically (meaning as we approach an infinite number of trials) equal to
the probability of that particular outcome for that physical process. So, for example, the frequentist
would imagine that we toss a coin a very large number of times. The fraction of the tosses that comes
up heads is approximately the true probability of a single toss producing a head, and, the more tosses we
do, the closer the fraction becomes to the true probability. So, for a fair coin we should see the number
of heads stabilise at around 50 % of the trials as the number of trials gets truly huge. The philosophical
problem with this approach is that one usually does not have the opportunity to repeat the scenario a
very large number of times.
The physicist or engineer, on the other hand, could look at the coin, measure it, spin it, bounce lasers
off its surface, etc., until one could declare that, owing to symmetry, the coin must logically have a
50 % probability of falling on either surface (for a fair coin, or some other value for an unbalanced coin
as the measurements dictated).
Probability is used to define a probability distribution, which describes the range of values the variable
may take, together with the probability (likelihood) that the variable will take any specific value.
Degree of uncertainty

In this context, "degree of uncertainty" is our measure of how much we believe something to be true.
It is one of the two components, along with the plausible values of the parameter, that describe the
uncertainty we may have about the parameter of the physical system ("the state of nature", if you
like) to be modelled. We can thus use the degree of uncertainty to define an uncertainty distribution,
which describes the range of values within which we believe the parameter lies, as well as the level
of confidence we have about the parameter being any particular value, or lying within any particular
range. A distribution of confidence looks exactly the same as a distribution of probability, and this can
lead, all to easily, to confusion between the two quantities.
Frequency

Frequency is the number of times a particular characteristic appears in a population. Relative frequency
is the fraction of times the characteristic appears in the population. So, in a population of 1000 people,

SO

Risk Analysis

22 of whom have blue eyes, the frequency of blue eyes is 22 and the relative frequency is 0.022 or
2.2 %. Frequency, by the definition used here, must relate to a known population size.

4.3.1 Some illustrations of uncertainty and variability
Let us look at a couple of examples to clarify the meaning of uncertainty and variability. Since variability
is the more fundamental concept, we'll deal with it first. If I toss a fair coin, there is a 50 % chance that
each toss will come up heads (let's call this a "success"). The result of each toss is independent of the
results of any previous tosses, and it turns out that the probability distribution of the number of heads
in n tosses of a fair coin is described by a Binomial(n, 50%) distribution, which will be explained in
detail in Section 8.2. Figure 4.8 illustrates this binomial distribution for n = 1, 2, 5 and 10. This is a
distribution of variability because I am not a machine, so I am not perfectly repetitive, and the system
(the number of times the coin spins, the air resistance and movement, the angle at which it hits the
ground, the topology of the ground, etc.) is too complicated for me to attempt to influence the outcome,
and the tosses are therefore random.
These binomial distributions are distributions of variability and reflect the randomness inherent in
the tossing of a coin (our stochastic system). We are assuming that there is no uncertainty here, as
we are assuming the coin to be fair and we are defining the number of tosses; in other words, we
are assuming the parameters of the system to be exactly known. The vertical axis of Figure 4.8 gives
the probability of each result, and, naturally, these probabilities add up to 1. In general, probability
distributions or distributions of variability are simple to understand. They give me some comfort that

Binomial(l,0.5)

2 0.4

%

0.3

g

0.2

P

0.1

0.5
0

5
i
successes

0

1

0

2

1
Successes

Binomial(S,0.5)

4

5
Q.

0.25
0.2
0.15
0.1
0.05
0

0

1

Successes
2
3

4

5

0

Figure 4.8 Examples of the Binomial(n, 50 %) distribution.

L*.

1

2

3

4

5 6 7
Successes

8

9

1

0

Chapter 4 Choice of model structure

5I

0.6 -a 05-0.4 --

U

P
.2

0.3--

8

0.2-0.1 --

I

01

I

1

<

O

i

Probability

Figure 4.9 Confidence distributions for the ball in the box being black: 0 = No, 1 = Yes. The left panel is
confidence before any ball is revealed; the right panel is confidence after seeing a blue ball removed from
the sack.

randomness (variability) really does exist in the world: if we take a group of 100 people1 and ask
them to toss a coin 10 times, the resulting distribution of the number of heads will closely follow a
Binomial(l0, 50 %).
look at a distribution Of uncertainty. Imagine I have a sack of 10 balls, six of which are
black and the remaining four of which are blue, and I know these figures.
imagine that, out of my
sight, a
is randomly selected from the sack and placed in an opaque box. I am asked the question:
"What is the probability that the ball in the box is black?', and I could quickly answer 6/10 or 60 %.
another ball is removed from the sack and shown to me: it is blue. I am asked: yqow what is
the probability that the ball in the box is black?", and, as there are now a total of nine balls I have not
seen, six
which I know are black, I could answer 619 or 66.66 %. But that is strange, because it is
to believe that the probability of the ball in the box being black has changed from events that
after its selection. The problem lies in my use of the term "probability" which is inconsistent
with the definition I have given above. When the ball has been placed in the box, the deed is done,
the
in the box is black (i.e. the probability is 1) or it is not (i.e. the probability is 0). I don't
know the truth but could collect information (i.e. look in the box or look in the sack) to find out what
the true
is. Before any ball was revealed to me, I should have said that I was 60 % confident that
the probability was 1, and therefore 40 % confident that the probability was 0. hi^ is an uncertainty
of the true probability. NOW,when the blue ball was revealed to me from the sack I had
extra inf~rmationand would therefore change my uncertainty distribution to show a 66.66 % confidence
that the
in the box was black 6.e. that the probability was 1). These two confidence distributions
are shown in Figure 4.9. Note that the distributions of Figure 4.9 are pure uncertainty distributions. ~h~
--.--\.-.--...
-.-.-,-- ci
4 0
1%
\
L. .
....-.+\
N C oarr
'
"'
-+.?-.-xlLD'
w&
are shown in Figure 4.9. Note t at
and a
is probability but may only take a value of O and I'
has no variability: its outcome is dete-nistic.

-

.--\

-

l.I.2

Lr11; ""'""

a%%%'*%

unce*ainq
and ~ariabilicl
in a risk

intents and puvOses' look and
unce*ainty and variabi.ty are described by distributions that?
In
behave eirctly the same One might therefore reasonabl~conciude lhat they can be
me sme Monte Carlo model: some dishbutions reflecting the unce*ainty about certain parameters ln

-

, I don,t remmmend the expe.ment,

l-he coins went everywhere,

done his a couple of timer with a k g lecture group

in a large banked auditonurn'

the model, the other distributions reflecting the inherent stochastic nature of the system. We could then
run a simulation on such a model which would randomly sample from all the distributions and our
output would therefore take account of all uncertainty and variability. Unfortunately, this does not work
out completely. The resultant single distribution is equivalent to our "best-guess" distribution of the
composite of the two components. Technically, it is difficult to interpret, as the vertical scale represents
neither uncertainty nor variability, and we have lost some information in knowing what component
of the resultant distribution is due to the inherent randomness (variability) of the system, and what
component is due to our ignorance of that system. It is therefore useful to know how to keep these two
components separate in an analysis if necessary.
Why separate uncertainty and variability?

Keeping uncertainty and variability separate in a risk analysis model is mathematically more correct.
Mixing the two together, i.e. by simulating them together, produces a reasonable estimate of the level
of total uncertainty under most conditions. Figure 4.10 shows a Binomial(l0, p) distribution, where p
is uncertain with distribution Beta(l0, 10). The spaghetti-looking graph represents a number of possible
true binomial distributions, shown in cumulative form, and the bold line shows the result one gets
from simulating the binomial and beta distributions together. The combined model may be wrong, but
it covers the possible range very well. But consider doing the same with just one binomial trial, e.g.
Binomial(1, Beta(10,lO)). The result is either a 1 or a 0, each occurring in about 50 % of the simulation
run, the same result as we would have had by modelling Binomial(1, 50%). The output has lost the
information that p is uncertain.
Mixing uncertainty and variability means, of course, that we cannot see how much of the total
uncertainty comes from variability and how much from uncertainty, and that information is useful.
If we know that a large part of the total uncertainty is due to uncertainty (as in the example of
Figure 4.1 I), then we know that collecting further information, thereby reducing uncertainty, would
enable us to improve our estimate of the future. On the other hand, if the total uncertainty is nearly all
due to variability (as in the example of Figure 4.12), we know that it is a waste of time to collect more

0

1

2

3

4

5

6

7

8

9

1

0

Successes
Figure 4.10 300 Binomial(l0, p) distributions resulting from random samples of p from a Beta(l0, 10)
distribution.

Chapter 4 Choice of model structure

53

I
0.9

-.-2 0.8

5 0.7

2I! 0.6
0.5

.-5 0.4

-z 0.3
3

0 0.2

0.1
0
0

20

40

60

80

100
120
Lifetime

140

160

180

200

Figure 4.11 Example of second-order risk analysis model output with uncertainty dominating variability.

1

0.9

.--2
2

n

2

0.8
0.7
0.6

P

a, 0.5

.->

-

z

0.4
0.3

3

0 0.2

0.1
0

Figure 4.12

0

20

40

60

80

100
120
Lifetime

140

160

180

200

Example of second-order risk analysis model output with variability dominating uncertainty.

information and the only way to reduce the total uncertainty would be to change the physical system.
In general, the separation of uncertainty and variability allows us to understand what steps can be taken
to reduce the total uncertainty of our model, and allows us to gauge the value of more information or
of some potential change we can make to the system.
A much larger problem than mixing uncertainty and variability distributions together can occur when
a variability distribution is used as if it were an uncertainty distribution. Separating uncertainty and
variability very deliberately gives us the discipline and understanding to avoid the much larger errors
that this mistake will produce. Consider the following problem.

54

I

I

1

Risk Analysis

A group of 10 jurors is randomly picked from a population for some court case. In this population,
50 % are female, 0.2 % have severe visual disability and 1.1 % are Native American. The defence would
like to have at least one member on the jury who is female and either Native American or visually
disabled or both. What is the probability that there will be at least one such juror in the selection? This is a
pure variability problem, as all the parameters are considered well known and the answer is quite easy to
calculate, assuming independence between the characteristics. The probability that a person is not Native
American and not visually disabled is (100 % - 1.1 %) (100 % - 0.2 %) = 98.7022 %. The probability
that a person is either Native American or visually disabled or both is (100 % - 98.7022 %) = 1.2978 %.
Thus, the probability that a person is either Native American or visually disabled or both and female
is (50 % * 1.2978 %) = 0.6489 %. The probability that none of the potential jurors is either Native
American or visually disabled or both and female is then (100% - 0.6489 %)lo = 93.697 %, and so,
finally, the probability that at least one potential juror is either Native American or visually disabled or
both and female is (100 % - 93.697 %) = 6.303 %.
Now let's compare this calculation with the spreadsheet of Figure 4.13 and the result it produces in
Figure 4.14. In this model, the number of females in the jury has been simulated, but the rest of the
calculation has been explicitly calculated. The output thus has a distribution that is meaningless since it
should be a single figure. The reason for this is that the model both calculated and simulated variability.
We are treating the number of females as if it were an uncertain parameter rather than a variable.
Now, having said how useful it is to separate uncertainty and variability, we must take a step back
and ask whether the effort is worth the extra information that can be gained. In truth, if we run
simulations that combine uncertainty and variability in the same simulation, we can get a good idea
of their contribution to total uncertainty by running the model twice: the first time sampling from all
distributions, and the second time setting all the uncertainty distributions to their mean value. The
difference in spread is a reasonable description of the contribution of uncertainty to total uncertainty.
Writing a model where uncertainty and variability are kept separate, as described in the next section,

*

Figure 4.13 Example of model that incorrectly mixes uncertainty and variability.

Chapter 4 Choice of model structure

Figure 4.14

55

Result of the model of Figure 4.13.

can be very time consuming and cumbersome, so we must keep an eye out for the value of such an
exercise.

4.3.3 Structuring a Monte Carlo model to separate uncertainty and
variability
The core structure of a risk analysis model is the variability of the stochastic system. Once this variability
model has been constructed, the uncertainty about parameters in that variability model can be overlaid. A
risk analysis model that separates uncertainty and variability is described as second order. A variability
model comes in two forms: explicit calculation and simulation. In a variability model with explicit
calculation, the probability of each possible outcome is explicitly calculated. So, for example, if one
were calculating the number of heads in 10 tosses of a coin, the explicit calculation model would take the

Figure 4.15

Model calculating the outcome of 10 tosses of a coin.

56

Risk Analysis

form of the spreadsheet in Figure 4.15. Here, we have used the Excel function BINOMDIST(x, n, p,
cumulative) which returns the probability of x successes in n trials with a binomial probability of
success p . The cumulative parameter requires either a TRUE (or 1) or a FALSE (or 0): using TRUE,
the function returns the cumulative probability F(x), using FALSE the function returns the probability
mass f (x). Plotting columns E and F together in an x-y scatter plot produces the binomial distribution
which can be the output of the model. Statistical results, like the mean and standard deviation shown in
the spreadsheet model, can also be determined explicitly as needed. The formulae calculating the mean
and standard deviation use an Excel array function SUMPRODUCT which multiplies terms in the two
arrays pair by pair and then sums these pair products. In an explicitly calculated model like this it is
a simple matter to include uncertainty about any parameters of the model. For example, if we are not
confident that the coin was truly fair but instead wish to describe our estimate of the probability of
heads as a Beta(l2, 11) distribution (see Section 8.2.3 for explanation of the beta distribution in this
context), we can simply enter the beta distribution in place of the 0.5 value in cell C3 and simulate for
the cells in column F containing the outputs.
The separation of uncertainty and variability is simple and clear when using a model that explicitly
calculates the variability, as we use formulae for the variability and simulation for the uncertainty. But
what do we do if the model is set up to simulate the variability? Figure 4.16 shows the same coin tossing
problem, but now we are simulating the number of heads using a Binomial(n, p ) function in @RISK.
Admittedly, it seems rather unnecessary here to simulate such a simple problem, but in many circumstances it is extremely unwieldy, if not impossible, to use explicit calculation models, and simulation is
the only feasible approach. Since we are using the random sampling of simulation to model the variability, it is no longer available to us to model uncertainty. Let us imagine that we put a possible value for the
binomial probability p into the model and run a simulation. The result is the binomial distribution that
would be the correct model of variability if that value of p were correct. Now, we believe that p could
actually be quite a different value - our confidence about the true value of p is described by a Beta(l2,
11) distribution - so we would really like to take repeated samples from the beta distribution, run a
simulation for each sample and plot all the binomial distributions together to give us a true picture. This
sounds immensely tedious, but @RISK provides a RiskSimtable function that will automate the process.
Crystal Ball also provides a similar facility in its Pro version that allows one to nominate uncertainty
and variability distributions within a model separately and then completely automates the process.
We proceed by taking (say 50) Latin hypercube samples from the beta distribution, then import them
back into the spreadsheet model. We then use a RiskSimtable function to reference the list of values.

Figure 4.16 A simulation version of the model of Figure 4.15.

Chapter 4 Choice o f model structure

57

The RiskSimtable function returns the first value in the list, but when we instruct @RISK to run 50
simulations, each of say 500 iterations, the RiskSimtable function will go through the list, using one
value at a time for each simulation. Note that the number of simulations is set to equal the number of
samples we have from the beta uncertainty distribution. The binomial distribution is then linked to the
RiskSimtable function and named as an output. We now run the 50 simulations and produce 50 different
possible binomial distributions which can be plotted together and analysed in much the same way as an
explicit calculation output. Of course, there are an infinite number of possible binomial distributions,
but, by using Latin hypercube sampling (see Section 4.4.3 for an explanation of the value of Latin
hypercube sampling), we are ensuring that we get a good representation of the uncertainty with a few
simulations.
In spite of the automation provided by the RiskSimtable function in @RISK or the facilities of Crystal
Ball Pro and the speed of modern computers, the simulations can take some time. However, in most
non-trivial models that time is easily balanced by the reduction in complexity of the model itself and
therefore the time it takes to construct, as well as the more intuitive manner in which the models can
be constructed which greatly helps avoiding errors.
The ModelRisk software makes uncertainty analysis much easier, as all its fitting functions offer
the option of either returning best fitting parameters (or distributions, time series, etc., based on best
fitting parameters), which is more common practice, or including the statistical uncertainty about those
parameters, which is more correct.

4.4 How Monte Carlo Simulation Works
This section looks at the technical aspects of how Monte Carlo risk analysis software generates random
samples for the input distributions of a model. The difference between Monte Carlo and Latin hypercube
sampling is explained. An illustration of the improvement in reliability and efficiency of Latin hypercube
sampling over Monte Carlo is also presented. The use of a random number generator seed is explained,
and the reader is shown how it is possible to generate probability distributions of one's own design.
Finally, a brief introduction is given into the methods used by risk analysis software to produce rank
order correlation of input variables.

4.4.1 Random sampling from input distributions
Consider the distribution of an uncertain input variable x. The cumulative distribution function F(x),
defined in Chapter 6.1.1, gives the probability P that the variable X will be less than or equal to x, i.e.

F(x) obviously ranges from 0 to 1. Now, we can look at this equation in the reverse direction: what is
the value of F(x) for a given value of x? This inverse function G(F(x)) is written as

It is this concept of the inverse function G(F(x)) that is used in the generation of random samples
from each distribution in a risk analysis model. Figure 4.17 provides a graphical representation of the
relationship between F (x) and G ( F (x)).

58

Risk Analysis

Figure 4.17 The relationship between x, F(x) and G(F(x)).

To generate a random sample for a probability distribution, a random number r is generated between
0 and 1. This value is then fed into the equation to determine the value to be generated for the
distribution:

The random number r is generated from a Uniform(0, 1) distribution to provide equal opportunity
of an x value being generated in any percentile range. The inverse function concept is employed in
a number of sampling methods, discussed in the following sections. In practice, for some types of
probability distribution it is not possible to determine an equation for G(F(x)), in which case numerical
solving techniques can be employed.
ModelRisk uses the inversion method for all of its 70+ families of univariate distributions and allows
the user to control how the distribution is sampled via its "U-parameter". For example:
VoseNormal(mu, sigma, U)
where mu and sigma are the mean and standard deviation of the normal distribution;
VoseNormal(mu, sigma, 0.9)
returns the 90th percentile of the distribution;
VoseNormal(mu, sigma)

VoseNormal(mu, sigma, RiskUniform(0, 1))

Chapter 4 Cho~ceof model structure

59

for @RISK users or
VoseNormal(mu, sigma, CB.Uniform(0, 1))

I

I

for Crystal Ball users, etc., returns random samples from the distribution that are controlled by ModelRisk, @RISK or Crystal Ball respectively. The inversion method also allows us to make use of copulas
to correlate variables, as explained in Section 13.3.

4.4.2 Monte Carlo sampling
Monte Carlo sampling uses the above sampling method exactly as described. It is the least sophisticated
of the sampling methods discussed here, but is the oldest and best known. Monte Carlo sampling got
its name as the code word for work that von Neumann and Ulam were doing during World War I1
on the Manhattan Project at Los Alamos for the atom bomb, where it was used to integrate otherwise
intractable mathematical functions (Rubinstein, 1981). However, one of the earliest examples of the use
of the Monte Carlo method was in the famous Buffon's needle problem where needles were physically
thrown randomly onto a gridded field to estimate the value of n. At the beginning of the twentieth
century the Monte Carlo method was also used to examine the Boltzmann equation, and in 1908 the
famous statistician Student (W. S. Gossett) used the Monte Carlo method for estimating the correlation
coefficient in his t-distribution.
Monte Carlo sampling satisfies the purist's desire for an unadulterated random sampling method.
It is useful if one is trying to get a model to imitate a random sampling from a population or for
doing statistical experiments. However, the randomness of its sampling means that it will over- and
undersample from various parts of the distribution and cannot be relied upon to replicate the input
distribution's shape unless a very large number of iterations are performed.
For nearly all risk analysis modelling, the pure randomness of Monte Carlo sampling is not really
relevant. We are almost always far more concerned that the model will reproduce the distributions that
we have determined for its inputs. Otherwise, what would be the point of expending so much effort on
getting these distributions right? Latin hypercube sampling addresses this issue by providing a sampling
method that appears random but that also guarantees to reproduce the input distribution with much
greater efficiency than Monte Carlo sampling.

4.4.3 Latin Hypercube sampling
Latin hypercube sampling, or LHS, is an option that is now available for most risk analysis simulation
software programs. It uses a technique known as "stratified sampling without replacement" (Iman,
Davenport and Zeigler, 1980) and proceeds as follows:
The probability distribution is split into n intervals of equal probability, where n is the number of
iterations that are to be performed on the model. Figure 4.18 illustrates an example of the stratification
that is produced for 20 iterations of a normal distribution. The bands can be seen to get progressively
wider towards the tails as the probability density drops away.
In the first iteration, one of these intervals is selected using a random number.
A second random number is then generated to determine where, within that interval, F ( x ) should
lie. In practice, the second half of the first random number can be used for this purpose, reducing
simulation time.

IH
W

I

Figure 4.18

Example of the effect of stratification in Latin hypercube sampling.

x = G ( F (x)) is calculated for that value of F (x).
The process is repeated for the second iteration, but the interval used in the first iteration is marked
as having already been used and therefore will not be selected again.
This process is repeated for all of the iterations. Since the number of iterations n is also the number
of intervals, each interval will only have been sampled once and the distribution will have been
reproduced with predictable uniformity over the F(x) range.
The improvement offered by LHS over Monte Carlo can be easily demonstrated. Figure 4.19 compares the results obtained by sampling from a Triangle(0, 10, 20) distribution with LHS and Monte
Carlo sampling. The top panels of Figure 4.19 show histograms of the triangular distribution after one
simulation of 300 iterations. The LHS clearly reproduces the distribution much better. The middle panels
of Figure 4.19 show an example of the convergence of the two sampling techniques to the true values
of the distribution's mean and standard deviation. In the Monte Carlo test, the distribution was sampled
50 times, then another 50 to make 100, then another 100 to make 200, and so on to give simulations of
50, 100, 200, 300, 500, 1000 and 5000 iterations. In the LHS test, seven different simulations were run
for the seven different numbers of iterations. The difference between the approaches was taken because
the LHS has a "memory" and the Monte Carlo sampling does not. A "memory" is where the sampling
algorithm takes account of from where it has already sampled in the distribution. From these two panels, one can get the feel for the consistency provided by LHS. The bottom two panels provide a more
general picture. To produce these diagrams, the triangular distribution was sampled in seven separate
simulations again with the following number of iterations: 50, 100, 200, 300, 500, 1000 and 5000 for
both LHS and Monte Carlo sampling. This was repeated 100 times and the mean and standard deviation
of the results were noted. The standard deviations of these statistics were calculated to give a feel for

Chapter 4 Choice of model structure

I

Lattn Hypercube sampllng

0.12

-

0.1

-.

Monte Carlo sampling

O1'T

0.08 --

0.06

--

0

10.1

5

10

15

-

-

10 --

,'
,I, .v I
,
,
9.9 -.
I

,.-----

,,'

C

g

9.8

-.

9.7

--

9.6

--

,

...,

'

95

---- Monte Carlo
-Lat~n

Hypercube
I

9

50

100

200

300
500
Iterations

1000

5000

8
s

0.35

t

5

8

1

03
"2,

0

L

0.2

ij

0 15

.1

0 1

.z
$

!"

OU5

0

50

100

200
300
500
Number of iterations

1000

5000

Figure 4.19 Comparison of the performance of Monte Carlo and Latin hypercube sampling.

20

61

62

Risk Analysis

1.I2-

- - - - LH sampling

- MC sampling
............. True mean

0.94 -0.92 s
0

"

4

.

I

20

40
60
80
Iterations completed

100

120

Figure 4.20 Example comparison of the convergence of the mean for Monte Carlo and Latin hypercube
distributions.

how much the results might naturally vary from one simulation to another. LHS consistently produces
values for the distribution's statistics that are nearer to the theoretical values of the input distribution
than Monte Carlo sampling. In fact, one can see that the spread in results using just 100 LHS samples
is smaller than the spread using 5000 MC samples!
The benefit of LHS is eroded if one does not complete the number of iterations nominated at the
beginning, i.e. if one halts the program in mid-simulation. Figure 4.20 illustrates an example where a
Normal(], 0.1) distribution is simulated for 100 iterations with both Monte Carlo sampling and LHS.
The mean of the values generated has roughly the same degree of variance from the true mean of 1
until the number of iterations completed gets close to the prescribed 100, when LHS pulls more sharply
in to the desired value.

4.4.4 Other sampling methods
There are a couple of other sampling methods, and I mention them here for completeness, although they
do not appear very often and are not offered by the standard risk analysis packages. Mid-point LHS is a
version of standard LHS where the mid-point of each interval is used for the sampling. In other words,
the data points (xi) generated from a distribution using n iterations will be at the ( i - 0.5)ln percentiles.
Mid-point LHS will produce even more precise and predictable values for the output statistics than
LHS, and in most situations it would be very useful. However, there are the odd occasions where
its equidistancing between the F(x) values causes interference effects that would not be observed in
standard LHS.
In certain problems, one might only be concerned with the extreme tail of the distribution of possible
outcomes. In such cases, even a very large number of iterations may fail to produce many sufficient
values in the extreme tail of the output for an accurate representation of the area of interest. It can
then be useful to employ importance sampling (Clark, 1961) which artificially raises the probability of
sampling from the ranges within the input distributions that would cause the extreme values of interest
in the output. The accentuated tail of the output distribution is rescaled back to its correct probability
density at the end of the simulation, but there is now good detail in the tail. In Section 4.5.1 we will

I

Chapter 4 Cho~ceof model structure

63

look at another method of simulation that ensures that one can get sufficient detail in the modelling of
rare events.
Sob01 numbers are non-random sequences of numbers that progressively fill in the Latin hypercube
space. The advantage they offer is that one can keep adding more iterations and they keep filling gaps
previously left. Contrast that with LHS for which we need to define the number of iterations at the
beginning of the simulation and, once it is complete, we have to start again - we can't build on the
sampling already done.

4.4.5 Random number generator seeds
There are many algorithms that have been developed to generate a series of random numbers between
0 and 1 with equal probability density for all possible values. There are plenty of reviews you can
find online. The best general-purpose algorithm is currently widely held to be the Mersenne twister.
These algorithms will start with a value between 0 and 1, and all subsequent random numbers that are
generated will rely on this initial seed value. This can be very useful. Most decent risk analysis packages
now offer the option to select a seed value. I personally do this as a matter of course, setting the seed
to 1 (because I can remember it!). Providing the model is not changed, and that includes the position
of the distributions in a spreadsheet model and therefore the order in which they are sampled, the same
simulation results can be exactly repeated. More importantly, one or more distributions can be changed
within the model and a second simulation run to look at the effect these changes have on the model's
outputs. It is then certain that any observed change in the result is due to changes in the model and not
a result of the randomness of the sampling.
i

4.5 Simulation Modelling
My cardinal rule of risk analysis modelling is: "Every iteration of a risk analysis model must be a
scenario that could physically occur". If the modeller follows this "cardinal rule", he or she has a much
better chance of producing a model that is both accurate and realistic and will avoid most of the problems
I so frequently encounter when reviewing a client's work. Section 7.4 discusses the most common risk
modelling errors.
A second very useful rule is: "Simulate when you can't calculate". In other words, don't simulate when
it is possible and not too onerous to determine exactly the answer directly through normal mathematics.
There are several reasons for this: simulation provides an approximate answer and mathematics can
give an exact answer; simulation will often not be able to provide the entire distribution, especially at
the low probability tails; mathematical equations can be updated instantaneously in light of a change in
the value of a parameter; and techniques like partial differentiation that can be applied to mathematical
equations provide methods to optimise decisions much more easily than simulation. In spite of all these
benefits, algebraic solutions can be excessively time consuming or intractable for all but the simplest
problems. For those who are not particularly mathematically inclined or trained, simulation provides an
efficient and intuitive approach to modelling risky issues.

4.5.1 Rare events
It is often tempting in a risk analysis model to include very unlikely events that would have a very
large impact should they occur; for example, including the risk of a large earthquake in a cost model

64

Risk Analysis

of a Sydney construction project. True, the large earthquake could happen and the effect would be
devastating, but there is generally little to be gained from including the rare event in an overview
model.
The expected impact of a rare event is determined by two factors: the probability that it will occur
and, if it did occur, the distribution of possible impact it would have. For example, we may determine
that there is about a 1:50 000 chance of a very large earthquake during the construction of a skyscraper.
However, if there were an earthquake, it would inflict anything between a few hundred pounds damage
and a few million.
In general, the distribution of the impact of a rare event is far more straightforward to determine than
the probability that the rare event will occur in the first place. We often can be no more precise about
the probability than to within one or two orders of magnitude (i.e. to within a factor of 10-100). It is
usually this determination of the probability of the event that provides a stumbling block for the analyst.
One method to determine the probability is to look at past frequencies and assume that they will
represent the future. This may be of use if we are able to collect a sufficiently large and reliable dataset.
Earthquake data in the New World, for example, only extends for 200 or 300 years and could give us,
at its smallest, a one in 200 year probability.
Another method, commonly used in fields like nuclear power reliability, is to break the problem
down into components. For an explosion to occur in a nuclear power station (excluding human error),
a potential hazard would have to occur and a string of safety devices would all have to fail together.
The probability of an explosion is the product of the probability of the initial conditions necessary for
an explosion and the probabilities of each safety device failing. This method has also been applied in
epidemiology where agricultural authorities have sought to determine the risk of introduction of an exotic
disease. These analyses typically attempt to map out the various routes through which contaminated
animals or animal products can enter the country and then infect the country's livestock. In some cases,
the structure of the problem is relatively simple and the probabilities can be reasonably calculated; for
example, the risk of introducing a disease through importing semen straws or embryos. In this case
the volume is easily estimated, its source determinable and regulations can be imposed to minimise
the risk.
In other cases, the structure of the problem is extremely complex and a sensible analysis may be
impossible except to place an upper limit on the probability; for example, the risk of introducing disease
into native fish by importing salmon. There are so many paths through which a fish in a stream or fish
farm could be exposed to imported contaminated salmon, ranging from a seagull picking up a scrap
from a dump and dropping it in a stream right in front of a fish to a saboteur deliberately buying some
salmon and feeding it to fish in a farm. It is clearly impossible to cover all of the scenarios that might
exist, or even to calculate the probability of each individual scenario. In such cases, it makes more sense
to set an upper bound to the probability that infection occurs.
It is very common for people to include rare events in a risk analysis model that is primarily concerned
with the general uncertainty of the problem, but provides little extra insight. For example, we might
construct a model to estimate how long it will take to develop a software application for a client:
designing, coding, testing, etc. The model would be broken down into key tasks and probabilistic
estimates made for the duration of each task. We would then run a simulation to find the total effect
of all these uncertainties. We would not include in such an analysis the effect of a plane crashing into
the office or the project manager quitting. We might recognise these risks and hold back-up files at
a separate location or make the project manager sign a tight contract, but we would gain no greater
understanding of our project's chance of meeting the deadline by incorporating such risks into our
model.

Chapter 4 Choice of model structure

65

4.5.2 Model uncertainty
Model building is subjective. The analyst has to decide the way he will build a necessarily simple model
to attempt to represent a frequently very complicated reality. One needs to make decisions about which
bits can be left out as insignificant, perhaps without a great deal of data to back up the decision. We also
have to reason which type of stochastic process is actually operating. In truth, we rarely have a purely
binomial, Poisson or any other theoretical stochastic process occurring in nature. However, we can often
convince ourselves that the degree of deviation from the simplified model we chose to use is not terribly
significant. It is important in any model to consider how it could fail to represent the real world. In any
mathematical abstraction we are malung certain assumptions, and it is important to run through these
assumptions, both the explicit assumptions that are easy to identify and the implicit assumptions that one
may easily fail to spot. For example, using a Poisson process to model frequencies of epidemics may
seem quite reasonable, as they could be considered to occur randomly in time. However, the individuals
in one epidemic can be the source of the next epidemic, in which case the events are not independent.
Seasonality of epidemics means that the Poisson intensity varies with month, which can be catered for
once it is recognised, but if there are other random elements affecting the Poisson intensity then it may
be more appropriate to model the epidemics as a mixture process.
Sometimes one may have two possible models (for example, two equations relating bacteria growth
rates to time and ambient temperature, or two equations for the lifetime of a device), both of which
seem plausible. In my view, these represent subjective uncertainty that should be included in the model,
just as other uncertain parameters have distributions assigned to them. So, for example, if I have two
plausible growth models, I might use a discrete distribution to use one or the other randomly during
each iteration of the model.
There is no easy solution to the problems of model uncertainty. It is essential to identify the simplifications and assumptions one is making when presenting the model and its results, in order for the
reader to have an appropriate level of confidence in the model. Arguments and counterarguments can
be presented for the factors that would bring about a failure of the model. Analysts can be nervous
about pointing out these assumptions, but practical decision-makers will understand that any model has
assumptions and they would rather be aware of them than not. In any case, I think it is always much
better for me to be the person who points out the potential weaknesses of my models first. One can also
often analyse the effects of changing the model assumptions, which gives the reader some feel for the
reliability of the model's results.

Chapter 5

Understanding and using the results
of a risk analysis
A risk analysis model, however carefully crafted, is of no value unless its results are understandable,
useful, believable and tailored to the problem in hand. This chapter looks at various techniques to help
the analyst achieve these goals.
Section 5.1 gives a brief overview of the points that should be borne in mind in the preparation of
a risk analysis report. Section 5.2 looks at how to present the assumptions of the model in a succinct
and comprehensible way. The results of a risk analysis model are far more likely to be accepted by
decision-makers if they understand the model and accept its assumptions.
Section 5.3 illustrates a number of graphical presentations that can be employed to demonstrate a
model's results and offers guidance for their most appropriate use. Finally, Section 5.4 looks at a
variety of statistical analyses that can be performed on the output data of a risk analysis.
In addition to writing comprehensive risk analysis reports, I have found it particularly helpful to my
clients to run short courses for senior management that explain:
how to manage a risk assessment (time and resources required, typical sequence of activities, etc.);
how to ensure that a risk assessment is being performed properly;
what a risk assessment can and cannot do;
what outputs one can ask for;
how to interpret, present and communicate a risk assessment and its results.
This type of training eases the introduction of risk analysis into an organisation. We see many organisations where the engineers, analysts, scientists, etc., have embraced risk analysis, trained themselves
and acquired the right tools and then fail to push the extra knowledge up the decision chain because
the decision-makers remain unfamiliar and perhaps intimidated by all this new "risk analysis stuff'. If
you are intending to present the results of a risk analysis to an unknown audience, consider assuming
that the audience knows nothing about risk analysis modelling and explain some basic concepts (like
Monte Carlo simulation) at the beginning of the presentation.

5.1 Writing a Risk Analysis Report
Complex models, probability distributions and statistics often leave the reader of a risk analysis report
confused (and probably bored). The reader may have little understanding of the methods employed
in risk analysis or of how to interpret, and make decisions from, its results. In this environment it is
essential that a risk analysis report guide the reader through the assumptions, results and conclusions (if
any) in a manner that is transparently clear but neither esoteric nor oversimplistic.

68

R~skAnalysis

The model's assumptions should always be presented in the report, even if only in a very shorthand
form. I have found that a report puts across its message to the reader much more effectively if these
model assumptions are put to the back of the report, the front being reserved for the model's results,
an assessment of its robustness (see Chapter 3) and any conclusions. We tend to write reports with the
following components (depending on the situation):
summary;
a introduction to problem;
a decision questions addressed and those not addressed;
a discussion of available data and relation to model choice;
a major model assumptions and the impact on the results if incorrect;
a critique of model, comment on validation;
a presentation of results;
a discussion of possible options for improvement, extra data that would change the model or its results,
additional work that could be done;
a discussion of modelling strategy;
a decision question(s);
a available data;
a methods of addressing decision questions with available information;
a assumptions inherent in different modelling options;
a explanation of choice of model;
a discussion of model used;
a overview of model structure, how the sections relate together;
a discussion of each section (data, mathematics, assumptions, partial results);
a results (graphical and statistical analyses);
a model validation;
a references and datasets;
a technical appendices;
a explanation of unusual equation derivations;
a guide on how to interpret and use statistical and graphical outputs.
a

II

The results of the model must be presented in a form that clearly answers the questions that the
analyst sets out to answer. It sounds rather obvious, but I have seen many reports that have failed in
this respect for several reasons:
a
a

a

The report relied purely on statistics. Graphs help the reader enormously to get a "feel" for the
uncertainty that the model is demonstrating.
The key question is never answered. The reader is left instead to make the last logical step. For
example, a distribution of a project's estimated cost is produced but no guidance is offered for
determining a budget, risk contingency or margin.
The graphs and statistics use values to five, six or more significant figures. This is an unnatural way
for most readers to think of values and impairs their ability to use the results.

Chapter 5 Understanding and using the results of a risk analysis

69

The report is filled with volumes of meaningless statistics. Risk analysis software programs, like
@RISK and Crystal Ball, automatically generate very comprehensive statistics reports. However,
most of the statistics they produce will be of no relevance to any one particular model. The analyst
should pare down any statistics report to those few statistics that are germane to the problem being
modelled.
The graphs are not properly labelled! Arrows and notes on a graph can be particularly useful.
In summary:
1. Tailor the report to the audience and the problem.

I

i

i

I

2. Keep statistics to a minimum.
3. Use graphs wherever appropriate.
4. Always include an explanation of the model's assumptions.

5.2 Explaining a Model's Assumptions
We recommend that you are very explicit about your assumptions, and make a summary of them in a
prominent place in the report, rather than just have them scattered through the report in the explanation
of each model component.
A risk analysis model will often have a fairly complex structure, and the analyst needs to find ways of
explaining the model that can quickly be checked. The first step is usually to draw up a schematic diagram
of the structure of the model. The type of schematic diagram will obviously depend on the problem
being modelled: GANTT charts, site plans with phases, work breakdown structure, flow diagrams, event
trees, etc. - any pictorial representation that conveys the required information.
The next step is to show the key quantitative assumptions that are made for the model's variables.
Distribution parameters

Using the parameters of a distribution to explain how a model variable has been characterised will often
be the most informative when explaining a model's logic. We tend to use tables of formulae for more
technical models where there are a lot of parametric distributions and probability equations, because the
logic is apparent from the relationship between a distribution's parameters and other variables. For nonparametric distribution~,which are generally used to model expert opinion, or to represent a dataset, a
thumbnail sketch helps the reader most. Influence diagram plots (Figure 5.1 illustrates a simple example)
are excellent for showing the flow of the logic and interrelationships between model components, but
not the mathematics underlying the links.
Graphical illustrations of quantitative assumptions are particularly useful when non-parametric distributions have been used. For example, a sketch of a VoseRelative (Custom in Crystal Ball, General in
@RISK), a VoseHistogram or a VoseCumulA distribution will be a lot more informative than noting
its parameters values. Sketches are also very good when you want to explain partial model results. For
example, summary plots are useful for demonstrating the numbers that come out of what might be a
quite complex time series model. Scatter plots are useful for giving an overview of what might be a
very complicated correlation structure between two or more variables.
Figure 5.2 illustrates a simple format for an assumptions report. Crystal Ball offers a report-writing
feature that will do most of this automatically. There will usually be a wealth of data behind these
key quantitative assumptions and the formulae that have been used to link them. Explanations of the

70

Risk Analysis

Total project cost
\

I

I

A

Additional
costs
\

I

Inflation

A

'

r

Risk of political
change

.'
Figure 5.1

J

Risk of strike

\

/

Risk of bad
weather

.'

/

Example of a schematic diagram of a model's structure.

data and how they translate into the quantitative assumptions can be relegated to an appendix of a risk
analysis report, if they are to be included at all.

5.3 Graphical Presentation of a Model's Results
There are two forms in which a model's results can be presented: graphs and numbers. Graphs have the
advantage of providing a quick, intuitive way to understand what is usually a fairly complex, numberintensive set of information. Numbers, on the other hand, give us the raw data and statistics from which
we can make quantitative decisions. This section looks at graphical presentations of results, and the
following section reviews statistical methods of reporting. The reader is strongly encouraged to use
graphs wherever it is useful to do so, and to avoid intensive use of statistics.

5.3.1 Histogram plots
The histogram, or relative frequency, plot is the most commonly used in risk analysis. It is produced
by grouping the data generated for a model's output into a number of bars or classes. The number
of values in any class is its frequency. The frequency divided by the total number of values gives an
approximate probability that the output variable will lie in that class's range. We can easily recognise
common distributions such as triangular, normal, uniform, etc., and we can see whether a variable is
skewed. Figure 5.3 shows the result of a simulation of 500 iterations, plotted into a 20-bar histogram.
The most common mistake in interpreting a histogram is to read off the y-scale value as the probability of the x value occurring. In fact, the probability of any x value, given the output is continuous (and
most are), is infinitely small. If the model's output is discrete, the histogram will show the probability
of each allowable x value, providing the class width is less than the distance between each allowable x
value. The number of classes used in a histogram plot will determine the scale of the y axis. Clearly,
the wider the bar width, the more chance there will be that values will fall within it. So, for example,
by doubling the number of histogram bars, the probability scale will approximately halve.
Monte Carlo add-ins generally offer two options for scaling the vertical axis: density and relative
frequency plots, shown in Figures 5.4 and 5.5.
In plotting a histogram, the number of bars should be chosen to balance between a lack of detail (too
few bars) and overwhelming random noise (too many bars). When the result of a risk analysis model

Year
lOthperc
Mean
90th perc

2009
37859
33803
29747

2010
42237
37949
33661

2011
43575
39399
35223

2012
40322
36690
33058

2013
39736
36388
33040

Production cost $/unit

2014
36725
33848
30971

0.35

0.4

2017
25064
23556
22048

2018
20085
19002
17919

172

174

2019
18460
17581
16702

Variable
Labour rate $/day
Advertising budget $Idyear
Admin costs $k/year
Transient market share
Commission rate

6000 0.3

2016
28507
26617
24727

2015
33312
30902
28492

0.45

.

.*

0.5

.*

5000 Plant costs $k
0
+ 4000 .-

3
c

3000 -

8

5

100

150

200

250

2000 1
0.66

0.7

0.72

0.74

0.76

Price charged per unit

0.78

0.8

72

R~skAnalysis

Figure 5.3 Doubling the number of bars on average halves the probability height for a bar.
Model outout

Value output

Model outuut

Value output

Figure 5.4 Histogram "density" plot. The vertical scale is calculated so that the sum of the histogram
bar areas equals unity. This is only appropriate for continuous outputs (left). Simulation software won't
recognise if an output is discrete (right), so treats the generated output data in the same way as a continuous
output. The result is a plot where the probability values make no intuitive sense - in the right-hand plot the
probabilities appear to add up to more than 1. To be able to tell the probability of the output being equal to
4, for example, we first need to know the width of the histogram bar.

is a discrete distribution, it is usually advisable to set the number of histogram bars to the maximum
possible, as this will reveal the discrete nature of the output unless the output distribution takes a large
number of discrete values.
Some risk analysis software programs offer the facility to smooth out a histogram plot. I don't recommend this approach because: (a) it suggests greater accuracy than actually exists; (b) it fits a spline
curve that will accentuate (unnecessarily) any peaks and troughs; and (c) if the scale remains the same,
the area does not integrate to equal 1 unless the original bandwidths were one x-axis unit wide.
The histogram plot is an excellent way of illustrating the distribution of a variable, but is of little
value for determining quantitative information about that variability, which is where the cumulative
frequency plot takes over.
Several histogram plots can be overlaid on each other if the histograms are not filled in. This allows
one to make a visual comparison, for example, between two decision options one may be considering.
The same type of graph can also be used to represent the results of a second-order risk analysis model

Chapter 5 Understanding and using the results of a risk analysis

Model output

Model output
0.2 1

0.08 T

I

73

Value output

Value output

Figure 5.5 Histogram "relative frequency" plot. The vertical scale is calculated as the fraction of the
generated values that fall into each histogram bar's range. Thus, the sum of the bar heights equals unity.
Relative frequency is only appropriate for discrete variables (right), where the histogram heights now sum to
unity. For continuous variables (left), the area under the curve no longer sums to unity.

where the uncertainty and variability have been separated, in which case each distribution curve would
represent the system variability given a random sample from the uncertainty distribution of the model.

5.3.2 The cumulative frequency plot
The cumulative frequency plot has two forms: ascending and descending, shown in Figure 5.6. The
ascending cumulative frequency plot is the most commonly used of the two and shows the probability
of being less than or equal to the x-axis value. The descending cumulative frequency plot, on the other
hand, shows the probability of being greater than or equal to the x-axis value. From now on, we shall
assume use of the ascending plot. Note that the mean of the distribution is sometimes marked on the
curve, in this case using a black square.
The cumulative frequency distribution of an output can be plotted directly from the generated data as
follows:
1. Rank the data in ascending order.

X

Figure 5.6 Ascending and descending cumulative frequency plots.

74

Risk Analysis

Figure 5.7

Producing a cumulative frequency plot from generated data points.

+

2. Next to each value, calculate its cumulative percentile P, = i/(n I), where i is the rank of that
data value and n is the total number of generated values. i/(n 1) is used because it is the best
estimate of the theoretical cumulative distribution function of the output that the data are attempting
to reproduce.
3. Plot the data ( x axis) against the i/(n 1) values (y axis). Figure 5.7 illustrates an example.
A total of 200-300 iterations are usually quite sufficient to plot a smooth curve. The above technique
is very useful if one wishes to avoid using the standard format that Monte Carlo software offer and if
one wishes to plot two or more cumulative frequency plots together.
The cumulative frequency plot is very useful for reading off quantitative information about the distribution of the variable. One can read off the probability of exceeding any value; for example, the probability
of going over budget, failing to meet a deadline or of achieving a positive NPV (net present value).
One can also find the probability of lying between any two x-axis values: it is simply the difference
between their cumulative probabilities. From Figure 5.8 we can see that the probability of lying between
1000 and 2000 is 89 % - 48 % = 41 %.
The cumulative frequency plot is often used in project planning to determine contract bid prices and
project budgets, as shown in Figure 5.9. The budget is set as the expected (mean) value of the variable
determined from the statistics report. A risk contingency is then added to the budget to bring it up
to a cumulative percentile that is comfortable for the organisation. The risk contingency is typically
the amount available to project managers to spend without recourse to their board. The (budget
contingency) value is set to match a cumulative probability that the board of directors is happy to plan
for: in this case 85 %. A more controlling board might set the sum at the 80th percentile or lower.
The margin is then added to the (budget contingency) to determine a bid price or project budget.
The project cost might still possibly exceed the bid price and the company would then make a loss.
Conversely, they would hope, by careful management of the project, to avoid using all of the risk

+

+

+

+

Chapter 5 Understanding and using the results of a risk analysis

75

Figure 5.8 Using the cumulative frequency plot to determine the probability of being between two values.

1

-

0.8 --

$

0.6 --

.--*

Cumulative distribution of cost of work

-

p

ti

-

.-

5
5
0

0.4

--

I

Risk

I

140

150

0.2 --

01
80

i

90

100

110

120

130
f 000s

160

170

180

Figure 5.9 Using the cumulative frequency plot to determine appropriate values for a project's budget,
contingency and margin.

contingency and actually increase their margin. The x axis of a cumulative distribution of project cost
or duration can be thought of roughly as listing risks in decreasing order of importance. The easiest
risks to manage, i.e. those that should be removed with good project management, are the first to erode
the total cost or duration. So a target set at the 80th percentile, sometimes called the 20 % risk level, is
roughly equivalent to removing the identified, easily managed risks. Then there are those risks that will
be removed with a lot of hard work, good management and some luck, which brings us down to the

76

R~skAnalysis

/

I

I

I
'

I

i
I
I
I
I

I
,

1

I

I

I

I

l

'

I'
:

-Milestone A

--- Milestone B
----- Milestone C
---.Milestone D
---- Milestone E

I

~ ~ 3 r k t c ~ i * : ' * r t : ~ * ~ n sq : ' ! s~ g ~ 8 : + r s r : 7 ' 8 * l

0

25

50

75

100
125
x (weeks)

150

175

200

Figure 5.10 Overlaying of the cumulative frequency plots of several project milestones illustrates any
increase in uncertainty with time.

50th percentile, or so. To reduce the actual cost or duration to somewhere around the 20th percentile
will usually require very hard work, good management and a lot of luck.
It is sometimes useful to overlay cumulative frequency plots together. One reason to do this is to get
a visual picture of stochastic dominance, described in Section 5.4.5. Another reason is to visualise the
increase (or perhaps decrease) in uncertainty as a project progresses. Figure 5.10 illustrates an example
for a project with five milestones. The time until completion of a milestone becomes progressively more
uncertain the further from the start the milestone is. Furthermore, the results of a second-order risk
analysis can be plotted as a number of overlying cumulative distributions, each curve representing a
distribution of variability for a particular random sample from the uncertainty distributions of the model.

5.3.3 Second-order cumulative probability plot
A second-order cdf is the best presentation of an output probability distribution when you run a secondorder Monte Carlo simulation. The second-order cdf is composed of many lines, each of which represents

Value

Figure 5.11 A second-order plot of a discrete random variable. The step nature of the plot makes it difficult
to read.

Chapter 5 Understanding and using the results of a risk analysis

0

1

2

3

4

5

6

7

8

9

1

77

0

Value

Figure 5.12 Another second-order plot of a discrete variable, where the probabilities are marked with small
points and joined by straight lines. The connection between the probability estimates is now clear, and the
uncertainty and randomness components can now be compared: at its widest the uncertainty contributes
a spread of about two units (dashed horizontal line), while the randomness ranges over some eight units
(filled horizontal line), so the inability to predict this variable is more driven by its randomness than by our
uncertainty in the model parameters.

0

2

4

6

8

10

12

14

16

18

20

Value

Figure 5.13 A second-order plot of a continuous variable where our inability to predict its value is equally
driven by uncertainty (dashed horizontal line) about the model parameters as by the randomness of the
system (filled horizontal line). This is a useful plot for decision-makers because it tells them potentially how
much more sure one would be of the predicted value if more information could be collected, and thus the
uncertainty reduced.

a distribution of possible variability or probability generated by picking a single value from each
uncertainty distribution in the model (Figures 5.1 1 to 5.13).

1

5.3.4 Overlaying of cdf plots

1

Several cumulative distribution plots can be overlaid together (Figure 5.14). The plots are easier to read
if the curves are formatted into line plots rather than area plots.

!
j

78

Risk Analysis

Cost $000

Figure 5.14

Cost $000

Several cumulative distribution plots overlaid together.

The overlaying of cumulative plots like this is an intuitive and easy way of comparing probabilities,
and is the basis of stochastic dominance tests. It is not very useful, however, for comparing the location,
spread and shape of two or more distributions, for which overlying density plots are much better.
We recommend that a complementary cumulative distribution plot be given alongside the histogram
(density) plot to provide the maximum information.

5.3.5 Plotting a variable with discrete and continuous elements
If a risk event does not occur, we could say it has zero impact, but if it occurs it will have an
uncertain impact. For example: a fire may have a 20 % chance of occurring and, if it does, will incur
$Lognorma1(120000, 30 000). We could model this as

or better still

Running a simulation with this variable as an output, we would get the uninformative, relative
frequency histogram plot (shown with different numbers of bars) in Figure 5.15.
There really is no useful way to show such a distribution as a histogram, because the spike at zero
(in this case) requires a relative frequency scale, while the continuous component requires a continuous
scale. A cumulative distribution, however, would produce the plot in Figure 5.16, which is meaningful.

5.3.6 Relationship between cdf and density (histogram) plots
For a continuous variable, the gradient of a cdf plot is equal to the probability density at that value.
That means that, the steeper the slope of a cdf, the higher a relative frequency (histogram) plot would
look at that point (Figure 5.17).
The disadvantage of a cdf is that one cannot readily determine the central location or shape of the
distribution. We cannot even easily recognise common distributions such as triangular, normal and
uniform without practice in cdf form. Looking at the plots in Figure 5.18, you will readily identify the
distribution form from the left panels, but not so easily from the right panels.

Chapter 5 Understanding and using the results of a risk analysis

200 bars

79

100 bars
0.900
0.800
0.700
0.600
0.500
0.400
0.300
0.200
0.100

0.000

I

0

50

4

100

150

200

0.000

-

-

I

I

0

50

100

Cost (thousands)

Cost (thousands)
40 bars

40 bars

0.900

0.900

T

0.800

0.500 0.400 0.300 0.200 0.100 0.700
0.600

0.000

1

I

50

100

150

200

Cost (thousands)
Figure 5.15

Cost (thousands)

Histogram plot of a risk event.

Cost (thousands)
Figure 5.16 Cumulative distribution of a risk event.

150

200

80

Risk Analysis

I
I
1

Cost $000

1
I
t

-*-'-

I
I

I
1

i

i

?/
/
4'

,
/

Figure 5.17 Relationship between density and cumulative probability curves.

For a discrete distribution, the cdf increases in steps equal to the probability of the x value occurring
(Figure 5.19).

5.3.7 Crude sensitivity analysis and tornado charts
Most Monte Carlo add-ins can perform a crude sensitivity analysis that is often used to identify the
key input variables, as a precursor to performing a tornado chart or similar, more advanced, analysis on
these key variables. It achieves this by performing one of two statistical analyses on data that have been
generated from input distributions and data calculated for the selected output. Built into this operation
are two important assumptions:
1. All the tested input parameters have either a purely positive or negative statistical correlation with

the output.
2. Each uncertain variable is modelled with a single distribution.

Chapter 5 Understanding and using the results of a risk analysis

Figure 5.18

81

Density and cumulative plots for some easily recognised distributions.

Figure 5.19 Relationship between probability mass and cumulative probability plots for a discrete
distribution.

82

Risk Analysis

Input

Figure 5.20

Input

Input

Example input-output relationships for which crude sensitivity analysis is inappropriate.

Assumption 1 is rarely invalid, but would be incorrect if the output value were at a maximum or
minimum for an input value somewhere in the middle of its range (see, for example, Figure 5.20).
Assumption 2 is very often incorrect. For example, the impact of a risk event might be modelled as

Monte Carlo software will generate the Bernoulli (or equivalently, the binomial) and triangular distributions independently. Performing the standard sensitivity analysis will evaluate the effect of the
Bernoulli and the triangular distributions separately, so the measured effect on the output will be divided
between these two distributions. ModelRisk gets round this by providing the function VoseRiskEvent,
for example:

The function constructs a single distribution so only one Uniform(0, 1) variate is being used to drive
the sampling of the risk impact. If you use @RISK, you can write

and @RISK will then drive the sampling for that risk event so the @RISK built-in sensitivity analysis
will now work correctly.
Similarly, if you were an insurance company you might be interested in the impact on your corporate
cashflow of the aggregate claims distribution for some particular policy. ModelRisk offers a number
of aggregate distribution functions that internally calculate the aggregation of claim size and frequency
distributions. So, for example, one can write

which will return the aggregate cost of Poisson(5500) claims each drawn independently from a Lognormal(2350, 1285) distribution, and the generated aggregate cost value will be controlled by the U variate.
ModelRisk has many such tools for simulating from constructed distributions to help you perform a
correct sensitivity analysis.
Assumption 2 also means that this method of sensitivity analysis is invalid for a variable that is
modelled over a series of cells, like a time series of exchange rates or sales volumes. The automated analysis will evaluate the sensitivity of the output to each distribution within the time series

Chapter 5 Understanding and using the results of a risk analysis

83

separately. You can still evaluate the sensitivity of a time series by running two simulations: one
with all the distributions simulating random values, the other with the distributions of the time series
locked to their expected value. If the distributions vary significantly, the variable time series is important.
Two statistical analyses

Tornado charts for two different methods of sensitivity analysis are in common use. Both methods plot
the variable against a statistic that takes values from -1 (the output is wholly dependent on this input,
but when the input is large, the output is small), through 0 (no influence) to +1 (the output is wholly
dependent on this input, and when the input is large, the output is also large):
a

Stepwise least-squares regression between collected input distribution values and the selected output
values. The assumption here is that there is a relationship between each input I and the output 0
(when all other inputs are held constant) of the form 0 = I * m c, where m and c are constants.
That assumption is correct for additive and subtractive models, and will give very accurate results
in those circumstances, but is otherwise less reliable and somewhat unpredictable. The r-squared
statistic is then used as the measure of sensitivity in a tornado chart.
Rank order correlation. This analysis replaces each collected value by its rank among other values
generated for that input or output, and then calculates the Spearman's rank order correlation coefficient r between each input and the output. Since this is a non-parametric analysis, it is considerably
more robust than the regression analysis option where there are complex relationships between the
inputs and output.

+

Tornado charts are used to show the influence an input distribution has on the change in value of
the output (Figure 5.21). They are also useful to check that the model is behaving as you expect. Each
input distribution is represented by a bar, and the horizontal range the bars cover give some measure of
the input distribution's influence on the selected model output. Their main use is as a quick overview
to identify the most influential input model parameters. Once these parameters are determined, other
sensitivity analysis methods like spider plots and scatter plots are more effective.

Figure 5.21

Profit sensitivity

Profit sensitivity

Sensitivity

Profit variation

Examples of tornado charts.

~

!
i

The left-hand plot of Figure 5.21 is the crudest type of sensitivity analysis, where some statistical
measure of the statistical correlation is calculated between the input and output values. The logic is that,
the higher the degree of correlation between the input and output variables, the more the input variable
is affecting the output. The degree of correlation can be calculated using either rank order correlation
or stepwise least-squares regression. My preference is to use rank order correlation because it makes
no assumption about the form of relationship between the input and the output, beyond the assumption
that the direction of the relationship is the same across the entire input parameter's range. Least-squares
regression, on the other hand, assumes that there would be a straight-line relationship between the input
and the output variables. If the model is a sum of costs or task durations, or some other purely additive
model, this assumption is fine. However, divisions and power functions in a model will strongly violate
such an assumption. Be careful with this simple type of sensitivity because input-output relationships
that strongly deviate from a continuously increasing or decreasing trend can be completely missed. The
x-axis scale is a correlation statistic so is not very intuitive because it does not relate to the impact on
the output in terms of the output's units. Moreover, rank order correlation can be deceptive. Consider
the following simple model:

C = Normal(1, 3)

D(output) = A

+B +C

Running a simulation model gives the following levels of correlation:

Clearly from the model structure we can see that variable A is actually driving most of the output
uncertainty. If we set the standard deviation of each variable to zero in turn and compare the drop in
standard deviation of the output (a good measure of variation in this case because we are just adding
normal distributions), then
A : drops output standard deviation by 85.1562 %

B : drops output standard deviation by 0.0004 %
C : drops output standard deviation by 1.1037 %

which tells an entirely different story from the regression and correlation statistics. The reason for this is
that variable B is being driven by A, so the influence of A is being divided essentially equally between
A and B. A proper regression analysis would require us to build in the direction of influence from A to
B, and then the influence of B would come out as insignificant, but to do so we would have to specify
that relationship - a very difficult thing to do in a complex spreadsheet model.

i

Chapter 5 Understanding and using the results of a risk analysis

85

The right-hand plot of Figure 5.21 is a little more robust and is typically created by fixing an input
distribution to a low value (say its 5th percentile), running a simulation, recording the output mean and
then repeating the process with a medium value (say the 5oth percentile) and a high value (say the 95th
percentile) of the input distribution: these output means define the extremes of the bars. This type of
plot is a cut-down version of a spider plot. It is a little more robust, and the x-axis scale is in units of
the output so is more intuitive.
At low levels of correlation you will often see a variable with correlations of the opposite sign to
what you would expect. This is particularly so for rank order correlation. It just means that the level
of correlation is so low that a spurious correlation of generated values will occur. For presentation
purposes, it will obviously be better to remove these bars.
It is standard practice to plot the variables from the top down in decreasing size of correlation.
If there are positive and negative correlations, the result looks a bit like a tornado, hence the name.
It is sensible, of course, to limit the number of variables that are shown on the plot. I usually limit
the plot to those variables that have a correlation of at least a quarter of the maximum observed
correlation, or at least down to the first correlation that has the opposite sign to what one would
logically have expected. This usually means that below such levels of correlation the relationships
are statistically insignificant, although of course one can make a mistake in reasoning the sense of a
correlation.
The tornado chart is useful for identifying the key variables and uncertain parameters that are driving
the result of the model. It makes sense that, if the uncertainty of these key parameters can be reduced
through improved knowledge, or the variability of the problem can be reduced by changing the system,
the total uncertainty of the problem will be reduced too. The tornado chart is therefore very useful for
planning any strategy for the reduction of total uncertainty. The key model components can often be
made more certain by:
Collecting more information on the parameter if it has some level of uncertainty.
Determining strategies to reduce the effect of the variability of the model component. For a project
schedule, this might be altering the project plan to take the task off the critical path. For a project
cost, this might be offloading the uncertainty via a fixed-price subcontract. For a model of the
reliability of a system, this might be increasing the scheduled number of checks or installing some
parallel redundancy.
The rank order correlation between the model components and its output can be easily calculated
if the uncertainty and variability components are all simulated together, because the simulation software will have all the values generated for the input distributions and the output together in the one
database. It may sometimes be useful to show in a tornado chart that certain model components are
uncertain and others are variable by using, for example, white bars for uncertainty and black bars for
variability.

5.3.8 More advanced sensitivity analysis with spider plots
To construct a spider plot we proceed as follows:
Before starting, set the number of iterations to a fairly low value (e.g. 300).
Determine the input distributions to analyse (performing a crude sensitivity analysis will guide you).

86

a

a

Risk Analysis

Determine the cumulative probabilities you wish to test (we generally use 1 %, 5 %, 25 %, 50 %,
75 %, 95 %, 99 %).
Determine the output statistic you wish to measure (mean, a particular percentile, etc.).

Then:
a
a
a

a
a

Select an input distribution.
Replace the distribution with one of the percentiles you specified.
Run a simulation and record the statistic of the output.
Select the next cumulative percentile and run another simulation.
Repeat until all percentiles have been run for this input, then put back the distribution and move on
to the next selected input.

Once all inputs have been treated this way, we can produce the spider plot shown in Figure 5.22.
This type of plot usually has several horizontal lines for variables that have almost no influence on
the output. It makes the graph a lot clearer to delete these (Figure 5.23).
Now we can very clearly see how the output mean is influenced by each input. The vertical range
produced by the oil price line shows the range of expected profits there would be if the oil price were
fixed somewhere between the minimum and maximum (a range of $180 million). The next largest range
is for the gas price ($llOmillion), etc. The analysis helps us understand the degree of sensitivity in
terms decision-makers understand as opposed to correlation or regression coefficients. The plot will also
allow us to see variables that have unusual relationships, e.g. a variable that has no influence except at
its extremes, or some sort of U-shaped relationship that would be missed in a correlation analysis.
Mean of Profit vs Input Distribution Percentile

Thickness

Exchange rate

Oil price

1.60E+08

I

I

I

I

I

1

0%

20%

40%

60%

80%

100%

Percentile

Figure 5.22 Spider plot example.

Chapter 5 Understanding and using the results of a risk analysis

87

Mean of Profit vs Input Distribution Percentile

Exchange rate

Oil price

1.60E+08

(

I

I

I

I

I

0%

20%

40%

60%

80%

100%

Percentile

Figure 5.23 Spider plot example with inconsequential variables removed.

5.3.9 More advanced sensitivity analysis with scatter plots
By plotting the generated values for an input against the output corresponding values for each model
iteration in a scatter plot one can get perhaps the best understanding of the effect of the input on the
output value. Plotting generated values for two outputs is also commonly done; for example, plotting
a project's duration against its total cost. Scatter plots are easy to produce by exporting the simulation
data at the end of a simulation into Excel.
It takes a little effort to generate these scatter plots, so we recommend that you perform a rough
sensitivity analysis to help you determine which of a model's input distributions are most affecting the
output first.
Figure 5.24 shows 3000 points, which is enough to get across any relationship but not too many to
block out central areas if you use small circular markers. The chart tells the story that the model predicts
increasing advertising expenditure will increase sales - up to a point. Since this is an Excel plot we
can add a few useful refinements. For example, we could show scenarios above and below a certain
advertising budget (Figure 5.25).
We could also perform some statistical analysis of the two subsets, like a regression analysis
(Figure 5.26 shows how in an Excel chart).
The equations of the fitted lines show that you are getting about 3 times more return for your advertising dollar below $150k than above (0.034810.0132 2.6). It is also possible, though mindbogglingly
tedious, to plot scatter plot matrices in Excel to show the interrelationship of several variables. Much
better is to export the generated values to a statistical package like SPSS. At the time of writing (2007),
planned versions of @RISK and Crystal Ball will also do this.

88

Risk Analysis

Advertising expenditure $k

Figure 5.24

Example scatter plot.

Advertising expenditure $k

Figure 5.25

Scattef plot separating scenarios where expenditure was above or below $150k.

5.3. I 0 Trend plots
If a model includes a time series forecast or other type of trend, it is useful to be able to picture the
genera1 behaviour of the trend. A trend or summary plot provides this information. Figure 5.27 illustrates
an example using the mean and 5th, 20th, 80th and 95th percentiles. Trend plots can be plotted using
cumulative percentiles as shown here, or with the mean Z!Z one and two standard deviations, etc. I
recommend that you avoid using standard deviations, unless they are of particular interest for some
technical reason, because a spread of, say, one standard deviation around the mean will encompass a

Chapter 5 Understandingand using the results of a risk analysis

89

Advertising expenditure $k

Figure 5.26 Scatter plot with separate regression analysis for scenarios above or below $150k.
Market size predictions
45000

-E
-W

40000
35000

-+
95 percentile

- - -0-- -80 percentile

-

Mean

- - 0-- - 20 percentile

'5 30000

#

C

Z

25000
20000
15000

Figure 5.27 A trend or summary plot.

varying percentage of the distribution depending on its form. That means that there is no consistent
probability interpretation attached to mean f k standard deviations.
The trend plot is useful for reviewing a trending model to ensure that seasonality and any other patterns
are being reproduced. One can also see at a glance whether nonsensical values are being produced; a
forecasting series can be fairly tricky to model, as described in Chapter 12, so this is a nice reality check.
An alternative to the trend plot above is a Tukey or box plot (Figure 5.28).
A Tukey plot is more commonly used to represent variations between datasets, but it does have
the possibility of including more information than trend plots. A word of caution: the minimum and
maximum generated values from a simulation can vary enormously between simulations with different
random number seeds, which means they are not usually values to be relied upon. Plotting the maximum
value of an inflation model going out 15 years, for example, might produce a very large value if you
ran it for many iterations and dominate the graph scaling.

90

Risk Analysis

Sales
35.00

-

Mean
50th percentile

Figure 5.28 A Tukey or box plot. Box contains 25-75 percentile range.

5.3.1 1 Risk-return plots
Risk-return (or cost-benefit) plots are one way graphically to compare several decision options on the
same plot. The expected return in some appropriate measure is plotted on the vertical axis versus the
expected cost in some measure on the horizontal axis (Figure 5.29).
The plot should be tailored to the decision question, and it may be useful to plot two or more such
plots to show different aspects.
Examples of measures of return (benefit) are as follows:
the probability of making a profit;
the income or expected return;
the number of animals that could be imported for a given level of risk (if one were looking at various
border control options for disease control, say);
the number of extra votes that would be gained in an election campaign;
the time that would be saved;
the reduction in the number of complaints received by a utility company;
the extra life expectancy of a kidney transplant patient.
Examples of measures of risk (cost) are as follows:
the amount of capital invested;
the probability of exceeding a schedule deadline;
the probability of financial loss;
the conditional mean loss;
the standard deviation or variance of profit or cashflow;
the probability of introduction of a disease;
the semi-standard deviation of loss;

Chapter 5 Understanding and using the results of a risk analysis

I

Risk (cost)

91

I

Figure 5.29 Example risk-return plot.

the number of employees that would be made redundant;
the increased number of fatalities;
the level of chemical emission into the environment.

5.4 Statistical Methods of Analysing Results
Monte Carlo add-ins offer a number of statistical descriptions to help analyse and compare results.
There are also a number of other statistical measures that you may find useful. I have categorised the
statistical measures into three groups:

1. Measures of location - where the distribution is "centered".
2. Measures of spread - how broad the distribution is.
3. Measures of shape - how lopsided or peaked the distribution is.
In general, at Vose Consulting we use very few statistical measures in writing our reports. The
following statistics are easy to understand and, for nearly any problem, communicate all the information
one needs to get across:

the mean which tells you where the distribution is located and has some important properties for
comparing and combining risks;

92

Risk Analysis

cumulative percentiles which give the probability statements that decision-makers need (like the
probability of being above or below X or between X and Y);
relative measures of spread: normalised standard deviation (occasionally) for comparing the level
of uncertainty of different options relative to their size (i.e. as a dimensionless measure) where the
outputs are roughly normal, and normalised interpercentile range (more commonly) for the same
purpose where the outputs being compared are not all normal.

5.4.1 Measures of location
There are essentially three measures of central tendency (i.e. measures of the central location of a
distribution) that are commonly provided in statistics reports: the mode, the median and the mean.
These are described below, along with the conditional mean, which the reader may find more useful in
certain circumstances.
Mode

The mode is the output value that is most likely to occur (Figure 5.30).
For a discrete output, this is the value with the greatest observed frequency. For a continuous distribution output, the mode is determined by the point at which the gradient of the cumulative distribution
of the model output generated values is at its maximum.
The estimate of the mode is quite imprecise if a risk analysis output is continuous or if it is discrete
and the two (or more) most likely values have similar probabilities (Figure 5.31). In fact the mode is
of no practical value in the assessment of most risk analysis results, and, as it is difficult to determine
precisely, it should generally be ignored.
Median ~ 5 0

The median is the value above and below which the model output has generated equal numbers of data,
i.e. the 5oth percentile. This is simply another cumulative percentile and, in most cases, has no particular
benefits over any other percentile.

C h a ~ t e5r Understandinr! and usine the results of a risk analysis

93

Figure 5.31 A discrete distribution with two modes, or no mode, depending on how you look at it.

Mean 2

This is the average of all the generated output values. It has less immediate intuitive appeal than the
mode or median but it does have far more value. One can think of the mean of the output distribution
as the x-axis point of balance of the histogram plot of the distribution. The mean is also known as
the expected value, although I don't recommend the term as it implies for most people the most likely
value. Sometimes also known as the first moment about the origin, it is the most useful statistic in
It is particularly useful for the
risk analysis. The mean of a dataset {xi}is often given the notation
following two reasons:

x.

(a

+ b) = ii + b and therefore (a - b) = Zi - b

where a and b are two stochastic variables. In other words: (1) the mean of the sum is the sum of their
means; (2) the mean of their product is the product of their means. These two results are very useful if
one wishes to combine risk analysis results or look at the difference between them.
Conditional mean

The conditional mean is used when one is only interested in the expected outcome of a portion of the
output distribution; for example, the expected loss that would occur should the project fail to make a
profit. The conditional mean is found by calculating the average of only those data points that fall into
the scenario in question. In the example of expected loss, it would be found by taking the average of
all the profit output's data points that were negative.
The conditional mean is sometimes accompanied with the probability of the output falling within the
required range. In the loss example, it would be the probability of producing a negative profit.
Relative pos~tionsof the mode, med~anand mean

For any unimodal (single-mode) distribution that is positively skewed (i.e. has a longer right tail than
left tail), the mode, median and mean fall in that order (Figure 5.32).

94

Risk Analysis

Median

Figure 5.32 Relative positions of the mode, median and mean of a univariate distribution.

{I
$

1

I<

If the distribution has a longer left tail than right, the order is reversed. Of course, if the distribution
is symmetric and unimodal, like the normal or Student distributions, the mode, median and mean will
be equal.

5.4.2 Measures of spread
,

The three measures of spread commonly provided in statistics reports are the standard deviation a , the
variance V and the range. There are several other measures of spread, discussed below, that the reader
may also find useful under certain circumstances.

Variance is calculated on generated values as follows:

i.e. it is essentially the average of the squared distance of all generated values from their mean. The
larger the variance, the greater is the spread. The variance is called the second moment (because of its
square term) about the mean and has units that are the square of the variable. So, if the output is in &,
the variance is measured in g2, making it difficult to have any intuitive feel for the statistic.
Since the distance between the mean and each generated value is squared, the variance is far more
sensitive to the data points that make up the tails of the distribution. For example, a data point that was
three units from the mean would contribute 9 times as much (32 = 9) to the variance as a data point

C h a ~ t e 5r Undentanding and using the results of a risk analysis

95

that was only one unit from the mean (12 = 1). The variance is useful if one wishes to determine the
spread of the sum of several uncorrelated variables X, Y as it follows these rules:
V(X

+ Y) = V(X) + V(Y)

V(X - Y) = V(X)

+ V(Y)

V(nX) = n2v(x), where n is some constant

These formulae also provide us with a guideline of how uniformly to disaggregate an additive model
so that each component provides a roughly equal contribution to the total output uncertainty. If the
model sums a number of variables, the contribution of each variable to the output uncertainty will be
approximately equal if each variable has about the same variance.
Standard deviation s

Standard deviation is calculated as the square root of the variance. In other words:

It has the advantage over the variance that it is in the same units as the output to which it refers.
However, it is still summing the squares of the distances of each generated value from the mean and
is therefore far more sensitive to outlying data points that make up the tails of the distribution than to
those that are close to the mean.
The standard deviation is frequently used in connection with the normal distribution. Results in risk
analysis are often quoted using the output's mean and standard deviation, implicitly assuming that the
output is normally distributed, and therefore:

+
+

the range Z - s to T s contains 68 % or so of the distribution;
the range Z - 2s to Z 2s contains 95 % or so of the distribution.
Some care should be exercised here. The distribution of a risk analysis output is often quite skewed
and these assumptions do not then follow at all. However, Tchebysheff's rule provides some weak
interpretation of the fraction of a distribution contained within k standard deviations.
Range

The range of an output is the difference between the maximum and minimum generated values. In
most cases this is not a very useful measure as it is obviously only sensitive to the two extreme values
(which are, after all, randomly generated and could often take a wide range of legitimate values for any
particular model).

96

Risk Analysis

Mean deviation (MD)

The mean deviation is calculated as

i.e. the average of the absolute differences between the data points and their mean. This can be thought
of as the expected distance that the variable will actually be from the mean. The mean deviation offers
two potential advantages over the other measures of spread: it has the same units as the output and
gives equal weighting to all generated data points.
Semi-variance V, and Semi-standard deviation s,

Variance and standard deviation are often used as measures of risk in the financial sector because they
represent uncertainty. However, in a distribution of cashflow, a large positive tail (equivalent to the
chance of a large income) is not really a "risk", although this tail will contribute to, and often dominate,
the value of the calculated standard deviation and variance.
The semi-standard deviation and semi-variance compensate for this problem by considering only those
generated values below (or above, as required) a threshold, the threshold delineating those scenarios
that represent a "risk" and therefore should be included from those that are not a risk and therefore
should be excluded (Figure 5.33).
The semi-variance and semi-standard deviation are
k

S

-

C (xi - X O ) ~

i='

k

and s, =

fi

where xo is the specified threshold value and X I , . . . , xk are all of the data points that are either above
or below xo, as required.

Figure 5.33 The semi-standard deviation concept.

Chapter 5 Understanding and using the results of a risk analysis

97

Norrnalised standard deviation s,

This is the standard deviation divided by the mean, i.e.

It achieves two purposes:
1. The standard deviation is given as a fraction of its mean. Using this statistic allows the spread of
the distribution of a variable with a large mean and correspondingly large standard deviation to be
compared more appropriately with the spread of the distribution of another variable with a smaller
mean and a correspondingly smaller standard deviation.
2. The standard deviation is now independent of its units. So, for example, the relative variability of
the EUR:HKD and USD:GBP exchange rates can be compared.
The normalised interpercentile range works in the same way:
= (xB - X A ) / X ~where
~,
x~ > XA are percentiles like x95 and xo5 respectively

Interpercentile range

The interpercentile range of an output is calculated as the difference between two percentiles, for
example:
x95 - xo5, to give the central 90 % range;
~ 90 minimum, to give the lower 90 % range;
~ 90 xlo, to give the central 80 % range.
The interpercentile range is a stablemeasure of spread (unless one of the percentiles is the minimum
or maximum), meaning that the value is quickly obtained for relatively few iterations of a model. It
also has the great advantage of having a consistent interpretation between distributions.
One potential problem you should be aware of is with applying an interpercentile range calculation to
a discrete distribution, particularly when there are only a few important values, as shown in Figure 5.34.
In this example, several key cumulative percentiles fall on the same values, so of course several
different interpercentile ranges take the same values. In addition, the interpercentile range becomes very
sensitive to the percentile chosen. In the plot above, for example:

but

5.4.3 Measures of shape
Skewness 5

This is the degree to which the distribution is "lopsided". A positive skewness means a longer right tail;
a negative skewness means a longer left tail; zero skewness means the distribution is symmetric about
its mean (Figure 5.35).

98

Risk Analysis

-1

0

1

2

3

4

5

6

Output value

-1

0

1

2
3
Output value

4

5

6

Figure 5.34 Demonstration of how interpercentile ranges can be confusing with discrete distributions.

Figure 5.35

Skewness examples.

Chapter 5 Understanding and using the results of a risk analysis

99

The skewness S is calculated as

The a3 factor is put in to make the skewness a pure number, i.e. it has no units of measurement.
Skewness is also known as the third moment about the mean and is even more sensitive to the data
points in the tails of a distribution than the variance or standard deviation because of the cubed term.
It may be useful to note, for comparative purposes, that an exponential distribution has a skewness of
2.0, an extreme value distribution has a skewness of 1.14, a triangular distribution has a skewness of
between 0.562 and 0, and the skewness of a lognormal distribution goes from zero to infinity as its mean
approaches 0. Skewness has little practical purpose for most risk analysis work, although it is sometimes
used in conjunction with kurtosis (see below) to test whether the output distribution is approximately
normal. High skewness values from a simulation run are really quite unstable - if your simulation gives
a skewness value of 100, say, think of it as "really big" rather than taking its value as being usable.
Another measure of skewness, though rarely used, is the percentile skewness, S,, calculated as
S -

(90 percentile - 50 percentile)

- (50 percentile - 10 percentile)

It has the advantage over the standard skewness of being quite stable because it is not affected by the
values of the extreme data points. However, its scaling is different to standard skewness: if 0 < S, < 1
the distribution is negatively skewed; if Sp = 1 the distribution is symmetric; if S, > 1 the distribution
is positively skewed.
Kurtosis K

Kurtosis is a measure of the peakedness of a distribution. Like skewness statistics, it is not of much use
in general risk analysis. Kurtosis is calculated as

,
L

I

I

t1

I
1

In a similar manner to skewness, the a4 factor is put in to make the kurtosis a pure number. Kurtosis is
often known as the fourth moment about the mean and is even more sensitive to the values of the data
points in the tails of the distribution than the standard skewness statistic. Stable values for the kurtosis
of a risk analysis result therefore require many more iterations than for other statistics. High kurtosis
values from a simulation run are very unstable - if your simulation gives a kurtosis in the hundreds or
thousands, say, it means there is a big spike in the output and the simulation kurtosis is very dependent
on whether that spike was appropriately sampled, so for such large values just think of it as "really big".
Kurtosis is sometimes used in conjunction with the skewness statistics to determine whether an output
is approximately normally distributed. A normal distribution has a kurtosis of 3, so any output that looks
symmetric and bell-shaped and has a zero skewness and a kurtosis of 3 can probably be considered normal.
A uniform distribution has a kurtosis of 1.8, a triangular distribution has a kurtosis of 2.387, the
kurtosis of a lognormal distribution goes from 3.0 to infinity as its mean approaches 0 and an exponential

distribution has a kurtosis of 9.0. The kurtosis statistic is sometimes (in Excel, for example) calculated as

called the excess skewness, which can cause confusion, so be careful what statistic your software is
reporting.

5.4.4 Percentiles

I

Cumulative percentiles

These are values below which the specified percentage of the generated data for an output fall. Standard
notation is x p , where P is the cumulative percentage, e.g. X0.75 is the value that 75 % of the generated
data were less than or equal to.
The cumulative percentiles can be plotted together to form the cumulative frequency plot, the use of
which has been explained above.
Differences between cumulative percentiles are often used as a measure of the variable's range, e.g.
X0.95 - ~ 0 . 0 5would include the middle 90 % of the possible output values and ~ 0 . 8 0
- ~ 0 . 2 0would include
the middle 60 % of the possible values of the output; ~0.25,
~ 0 . 5 0and X0.75 are sometimes referred to as
the quartiles.
Relative percentiles

The relative percentiles are the fractions of the output data points that fall into each bar range of a
histogram plot. They are of little use in most risk analyses and are dependent upon the number of bars
that are used to plot the histogram.
Relative percentiles can be used to replicate the output distribution for inclusion in another risk analysis
model. For example, cashflow models may have been produced for a number of subsidiaries of a large
company. If an analyst wants to combine these uncertain cashflows into an aggregate model, he would
want distributions of the cashflow from each subsidiary. This is achieved by using histogram distributions
to model each subsidiary's cashflow and taking the required parameters (minimum, maximum, relative
percentiles) from the statistics report. Providing the cashflow distributions are independent, they can
then be summed in another model.

5.4.5 Stochastic dominance tests
Stochastic dominance tests are a statistical means of determining the superiority of one distribution
over another. There are several types (or degrees) of stochastic dominance. We have never found any
particular use for any but the first- and second-order tests described here. It would be a very rare problem
where the distributions of two options can be selected for no better reason than a very marginal ordering
provided by a statistical test. In the real world there are usually far more persuasive reasons to select
one option over another: option A would expose us to a greater chance of losing money than B; or
a greater maximum loss; or would cost more to implement; we feel more comfortable with option A
because we've done something similar before; option B will make us more strategically placed for the
future; option B is based on an analysis with fewer assumptions; etc.

Chapter 5 Understanding and using the resuk of a risk analysis

I

10 1

-

0.87 -

- - - - - Option A
-Opt~onB
0

20

40

60

60

100

120

140 160 180200

Figure 5.36 First-order stochastic dominance: FA < Fp,,SO option A dominates option B.

Fiw-order stochartic dominance

Consider options A and B having the distribution functions FA(x) and F s ( x ) , where it is desirable to
maximise the value of x .
If FA(x) 5 F R ( x )for all x , then option A dominates option B. That amounts to saying that the cdf of
option A i s to the right ofthat of oplion B in an ascending plot. This is shown graphically in Figure 5.36.
Option A has a smaller probability than option B of being less than or equal to each x value, so it is
the better option (unless F*(x) = F R ( x ) everywhere). First-order stochastic dominance is intuitive and
makes virtually no assumptions about the decision-maker's utility function, only that it is continuous
and monotonically increasing with increasing x .
Second-order stochastic dominance

min

for all z, then option A dominates option B. Figure 5.37 illustrates how this looks graphically. Figure 5.38
illustrates a situation when second-order stochastic dominance does not hold.
Second-order stochastic dominance makes the additional assumption that the decision-maker has a
risk averse utility function over the entire range of x. This assumption is not very restrictive and can
almost always be assumed to apply. In most fields of risk analysis (finance being an obvious exception)
it will not be necessary to resort to second-degree (or higher) dominance tests since [he decision-maker
should be able to find other, more important, differences between the available options.
Stochastic dominance is great in principle but tends to be rather onerous to apply in practice, particularly if one i s comparing several possible options. ModelRisk has the facility to compare as many options
as you wish. Firslt of all one simulates, say, 5000 iterations of the outcome of each possible option and
imports these into contiguous columns in a spreadsheet. These are then fed into the ModelRisk interface,
as shown in Figure 5.39.
Selecting an output location allows you to insert the stochastic dominance matrix as an array function
(VoseDominance), which wilI show all the dominance combinations and update if the simulation output
arrays are altered.

Chapter 5 Understanding and using the results of a nsk analysis

10 1

+.---,

-----

Option A
Option B

I

Benefit

Figure 5.36

First-order stochastic dominance: FA < FB, SO option A dominates option B.

First-order stochastic dominance

Consider options A and B having the distribution functions FA(x) and FB(x), where it is desirable to
maximise the value of x.
If FA(x) 5 FB(x) for all x, then option A dominates option B. That amounts to saying that the cdf of
option A is to the right of that of option B in an ascending plot. This is shown graphically in Figure 5.36.
Option A has a smaller probability than option B of being less than or equal to each x value, so it is
the better option (unless FA(x) = FB(x) everywhere). First-order stochastic dominance is intuitive and
makes virtually no assumptions about the decision-maker's utility function, only that it is continuous
and monotonically increasing with increasing x.
Second-order stochastic dominance

min

for all z , then option A dominates option B. Figure 5.37 illustrates how this looks graphically. Figure 5.38
illustrates a situation when second-order stochastic dominance does not hold.
Second-order stochastic dominance makes the additional assumption that the decision-maker has a
risk averse utility function over the entire range of x. This assumption is not very restrictive and can
almost always be assumed to apply. In most fields of risk analysis (finance being an obvious exception)
it will not be necessary to resort to second-degree (or higher) dominance tests since the decision-maker
should be able to find other, more important, differences between the available options.
Stochastic dominance is great in principle but tends to be rather onerous to apply in practice, particularly if one is comparing several possible options. ModelRisk has the facility to compare as many options
as you wish. First of all one simulates, say, 5000 iterations of the outcome of each possible option and
imports these into contiguous columns in a spreadsheet. These are then fed into the ModelRisk interface,
as shown in Figure 5.39.
Selecting an output location allows you to insert the stochastic dominance matrix as an array function
(VoseDominance), which will show all the dominance combinations and update if the simulation output
arrays are altered.

Chapter 5 Understanding and using the results of a nsk analys~s 10 1

-----

Option A
Option B

Benefit

Figure 5.36

First-order stochastic dominance: FA < FB, SO option A dominates option B.

First-order stochastic dominance

Consider options A and B having the distribution functions FA(x) and FB(x), where it is desirable to
maximise the value of x.
If FA(x) 5 FB(x) for all x, then option A dominates option B. That amounts to saying that the cdf of
option A is to the right of that of option B in an ascending plot. This is shown graphically in Figure 5.36.
Option A has a smaller probability than option B of being less than or equal to each x value, so it is
the better option (unless FA(x) = FB(x) everywhere). First-order stochastic dominance is intuitive and
makes virtually no assumptions about the decision-maker's utility function, only that it is continuous
and monotonically increasing with increasing x.
Second-order stochastic dominance

D(z)=

i

(FB(x)-FA(x))~L~

min

for all z, then option A dominates option B. Figure 5.37 illustrates how this looks graphically. Figure 5.38
illustrates a situation when second-order stochastic dominance does not hold.
Second-order stochastic dominance makes the additional assumption that the decision-maker has a
risk averse utility function over the entire range of x. This assumption is not very restrictive and can
almost always be assumed to apply. In most fields of risk analysis (finance being an obvious exception)
it will not be necessary to resort to second-degree (or higher) dominance tests since the decision-maker
should be able to find other, more important, differences between the available options.
Stochastic dominance is great in principle but tends to be rather onerous to apply in practice, particularly if one is comparing several possible options. ModelRisk has the facility to compare as many options
as you wish. First of all one simulates, say, 5000 iterations of the outcome of each possible option and
imports these into contiguous columns in a spreadsheet. These are then fed into the ModelRisk interface,
as shown in Figure 5.39.
Selecting an output location allows you to insert the stochastic dominance matrix as an array function
(VoseDominance), which will show all the dominance combinations and update if the simulation output
arrays are altered.

102

Risk Analysis

0

50

100

150

200

250

300

Profit

Figure 5.37 Second-order stochastic dominance: option A dominates option B because D(z) is always

Option A

1 1

0

50

100

150

200

250

300

Profit

Figure 5.38 Second-order stochastic dominance: option A does not dominate option B because D(z) is not

always >O.

5.4.6 Value-of-information methods
Value-of-information (VOI) methods determine the worth of acquiring extra information to help the
decision-maker. From a decision analysis perspective, acquiring extra information is only useful if it
has a significant probability of changing the decision-maker's currently preferred strategy. The penalty
of acquiring more information is usually valued as the cost of that extra information, and sometimes
also the delay incurred in waiting for the information.

Chapter 5 Understand~ngand using the results of a risk analysis

I03

C I" row
(i In cdurnnr

Figure 5.39

;

ModelRisk interface to determine stochastic dominance.

VOI techniques are based on analysing the revised estimates of model inputs that come with extra
data, together with the costs of acquiring the extra data and a decision rule that can be converted into
a mathematical formula to analyse whether the decision would alter. The ideas are well developed
(Clemen and Reilly (2001) and Morgan and Henrion (1990), for example, explain VOI concepts in
some detail), but the probability algebra can be somewhat complex, and simulation is more flexible and
a lot easier for most VOI calculations.
The usual starting point of a VOI analysis is to consider the value of perfect information (VOPI), i.e.
answering the question "What would be the benefit, in terms we are focusing on (usually money, but it
could be lives saved, etc.), of being able to know some parameter(s) perfectly?'. If perfect knowledge
would not change a decision, the extra information is worthless, and, if it does change a decision, then
the value of the extra knowledge is the difference in expected net benefit between the new selected
option and that previously favoured. VOPI is a useful limiting tool, because it tells us the maximum
value that any data may have in better evaluating the input parameter of concern. If the information
costs more than that maximum value, we know not to pursue it any further.
After a VOPI check, one then looks at the value of imperfect information (VOII). Usually, the
collection of more data will decrease, not eliminate, uncertainty about an input parameter, so VOII
focuses on whether the decrease in uncertainty is worth the cost of collecting extra information. In fact,
if new data are inconsistent with previous data or beliefs that were used to estimate the parameter, new
data may even increase the uncertainty.
If the data being used are n random observations (e.g. survey or experimental results), the uncertainty
about the value of a parameter has a width (roughly) proportional to lISQRT(n). So, if you already
have n observations and would like to halve the uncertainty, you will need a total of 4n observations
(an increase of 3n). If you want to decrease uncertainty by a factor of 10, you will need a total of lOOn
observations (an increase of 99n). In other words a decrease in uncertainty about a parameter value

104

Risk Analysis

becomes exponentially more expensive the closer the uncertainty gets to zero. Thus, if a VOPI analysis
shows that it is economically justified to collect more information before making a decision, there will
certainly be a point in the data collection where the cost of collecting data will outweigh their benefit.
VOPI analysis method

Consider the range of possible values for the parameter(s) for which you could collect more information.
Determine whether there are possible values for these parameters that, if known, would make the
decision-maker select a different option from the one currently deemed to be best.
Calculated the extra value (e.g. expected profit) that the more informed decision would give. This
is the VOPI.
VOll analysis method

Start with a prior belief about a parameter (or parameters), based on data or opinion.
Model what observations might be made with new data using the prior belief.
Determine the decision rule that would be affected by these new data.
Calculate any improvement in the decision capability given the new data; the measure of improvement requires some valuation and comparison of possible outcomes, which is usually taken to be
expected monetary or utility value, although this is rather restrictive.
Determine whether any improvement in the decision capability exceeds the cost of the extra information.
VOI example

Your company wants to develop a new cosmetic but there is some concern that people will have a minor
adverse skin reaction to the product. The cost of development of the product to market is $1.8 million.
The revenue NPV (including the cost of development) if the product is of the required quality is
$3.7 million.
Cosmetic regulations state that you will have to withdraw the product if 2 % or more of consumers
have an adverse reaction to your product. You have already performed some preliminary trials on 200
random people selected from the target demographic, at a costlperson of $500. Three of those people
had an adverse reaction to the product.
Management decide the product will only be developed if they can be 85 % confident that the product
will affect less than the required 2 % of the population. Decision question: Should we test more people
or just abandon the product development now? If we should test more people, then how many more?
Having observed three affected people out of 200, our prior belief about p can be modelled as
Beta(3 1,200 - 3 1) = Beta(4, 198), which gives a 57.24 % confidence that 2 % or less of the target
demographic will be affected (calculated as VoseBetaProb(2 %, 4, 198, 1) or BETADIST(2 %, 4, 198)).
Thus, the current level of information means that management would not pursue development of
the product, with no resultant cost or revenue, i.e. a net revenue of $0. However, the beta distribution
shows that it is quite possible that p is less than 2 %, and we could be losing a good opportunity
by quitting now. If this were known for sure, the company would get a profit of $3.7 million, so the
VOPI = $3.7 million * 57.24 % $0 million * 42.76 % = $2.12 million, and each test only costs $500;
it is certainly possible that more information could be worth the expense.

+

+

+

Chapter 5 Understanding and using the results of a risk analysis

I05

VOll analysis

The model in Figure 5.40 performs the VOII steps described above. The parameter of concern is the
fraction of people (prevalence), p, in the target demographic (women 18-65) who would have an
adverse reaction, with a prior uncertainty described by Beta(4, 198), cell C12.
The people in the study are randomly sampled from this demographic, so if we test m extra people
(cell C22) we can assume the number of people who would be adversely affected, s, would follow a
Binomial(m, p) distribution (cell C24).
The revised estimate for p would then become Beta(4 s, 198 (m - s)). The confidence we then
have that p is < 2 % is given by VoseBetaProb(2 %, 4 s, 198 (m - s), l), cell C27. If this confidence
exceeds 85 %, management would take the decision to develop the product (cells C31:C32).
The model simulates different possible values of p from the prior. It models various possible numbers
of extra tests, m, and simulates the extra data generated (s out of m), then evaluates the expected return of
the resultant decision. Of course, although one may have reached the required confidence for p, the true
value for p doesn't change and a bad decision may still be taken. The value of information is calculated
for each iteration, and the mean function is used to calculate the expected value of information.
Note that for this example the question being posed is how many more people to test in one go. A
more optimal strategy would be to test a smaller number, review the results and perform a VOII analysis.
This iterative process will either achieve the required confidence at a smaller test cost or lead one to
abandon further testing because one is fairly sure that the required performance will not be achieved.
It might at first seem that we are getting something for nothing here. After all, we don't actually
know anything more until we perform the extra tests. However, the decision that would be made would
depend on the results of those extra tests, and those results depend on what the true value of p actually

+

i

Perfect knowledge
Decisionwith perfect information:

+

+

+

=IF(Cl 9=1 ,"Deveiop","Don't
develop")
=C22'E8/1000000

106

Risk Analys~s

is. Thus, the analysis is based on our prior for p (i.e, what we know to date about p) and the decision
rule. When the model generates a scenario, it selects a value from the prior for p. It is saying: "Let's
imagine that this is the true value for p". If that value is t 2 %, we should develop the product of
course, but we'll never know the value of p (until we have launched the product and have enough
customer history to know its value). However, extra tests will get us closer to knowing its true value,
1
0.9 0.8 -

z
t9

0.7 -

-E
C

0
.-

0.6

-

b
+

0.5

-

6

0.4 -

.-C

(I)

2

-

0.3

9

-

0.2 0.1 0

1

0

500

1000

1500

2000

2500

3000

3500

2500

3000

3500

Tested people

Figure 5.41

VOI example model results.

1.8
1.6 1.4 -

2

t9

--:
C

.-0
m

1.2 1-

.C 0.8 0

-3

9

0.6 0.4 0.2 0

I

0

500

1000

1500

2000

Tested people

Figure 5.42 VOI example model results where tests have no cost.

Chapter 5 Understanding and using the results of a risk analysis

107

and so we end up taking less of a gamble. When the model picks a small value for p, it will probably
generate a small number of affected people in our new tests, and our interpretation of this small number
as meaning p is small will often be correct. The danger is that a high p value could by chance result in
an unrepresentatively small fraction of m being affected, which will be misinterpreted as a small p and
lead management to make the wrong decision. However, as m gets bigger, so that risk diminishes. The
balance that needs to be made is that the tests cost money. The model simulates 20 scenarios where m
is varied between 100 and 3000, with the results shown in Figure 5.41.
It tells us that the optimal strategy, i.e. the strategy with the greatest expected VOII, is to perform
about another 700 tests. The sawtooth effect in these plots occurs because of the discrete nature of
the extra number affected that one would observe in the new data. Note that, if the tests had no cost,
the graph above would look very different (Figure 5.42). Now it is continually worth collecting more
information (providing it is actually feasible to do) because there is no penalty to be paid in running
more tests (except perhaps time, which is not included as part of this problem). In this case the value
of information asymptotically approaches the VOPI(= $2.121nillion) as the number of people tested
approaches infinity.

Part 2
Introduction
Part 2 constitutes the bulk of this book and covers a wide range of risk analysis modelling techniques that are
in general use. I have again almost exclusively used Microsoft Excel as the modelling environment because
it is ubiquitous and makes it easy to show the principles of a model with printouts of the spreadsheet.
I have also used Vose Consulting's ModelRisk add-in to Excel (see Appendix 11), but I have done my
best to avoid malung this book a glorified advertisement for a software tool. The reality is that you will
need some specialist software to do risk analysis. Using ModelRisk gives me the opportunity to explain
the thinking behind risk analysis modelling without the message getting lost in very long calculations or
wrestling with the mechanical limitations of modelling in spreadsheets. Some of the simpler functions
in ModelRisk are available in other risk analysis software tools, and Excel has some statistical functions
(although they are of dubious quality). When I have used more complex functions in ModelRisk (like
copulas or time series, for example), I have tried to give you enough information for you to do it
yourself. Of course, we'd love you to buy ModelRisk - there is a lot more in the software than I have
used in this book (Appendix I1 gives some highlights and explains how ModelRisk interacts with other
risk analysis spreadsheet add-ins), it has a lot of very nice user-interfaces and its routines can be called
from C++ and VBA. We offer an extended demo period for ModelRisk on the inside back-cover of this
book, together with files for the models created for this book that you can play around with.
Notation used in the spreadsheet models

I have given printouts of spreadsheet models throughout this book. The models were produced in
Microsoft Excel version 2003 and ModelRisk version 2.0 which complies with the standard Excel
rules for cell formulae. The equations easily translate to @RISK, Crystal Ball and other Monte Carlo
simulation packages where they have similar functions. In each spreadsheet, I have given a formulae
table so that the reader can follow and reproduce the model, for example:

lI 0

Risk Analysis

Here you'll see an entry for cells D2:D8 as = VoseLognormal(B2, C2). Where I have given one formula
for a range of cells, it refers to the first cell of the range, and the formulae for other cells in the range
are those that would appear by copying that formula over, for example by using the Excel Autofill
facility. The formulae in the other cells in the range will vary according to their different position. So,
for example, copying the formula above into the other cells would give:

etc.
If the formula had included a fixed reference using the "$" symbol in Excel notation, e.g. =
VoseLognormal(B$2, C2), it would have copied down as

etc.
The VoseLognormal function generates random samples from a lognormal distribution, a very common
distribution that features pretty much in all Monte Carlo simulation add-ins to Excel. So, for example,
VoseLognormal(2,3) could be replaced as follows:

@RISK = RiskLognorm(2,3)
Crystal Ball = CB.Lognormal(2, 3)
There are maybe a dozen other, less common, Monte Carlo add-ins with varying levels of sophistication, and they all follow the same principle, but be careful to ensure that they parameterise a distribution
in the same way.
Excel allows you to input a function as an array, meaning that one function covers several cells.
Array formulae in Excel are inputted by highlighting a range of cells, typing the formula and then
CTRL-SHIFT-Enter together. The function then appears within curly brackets in the formula bar. Array
functions are used rather extensively with ModelRisk. For example:
A
1
2
3
4
5
6
7
8
9

B
Value
1
2

3
4
5
6
7

C
Shuffled
3
5
2
6
4
1

7

Dl

E

C2:C8

F

Formulae table
{=VoseShuffle(B2:B8)}

I

IG

Part 2 Introduction

III

The VoseShuffle function simply randomises the order of the values listed in its parameter array.
You'll see how I display the formula within curly brackets because the VoseShuffle covers that whole
range with one function, which is how it appears when you see it in Excel's formula bar.
Note also that all functions with a name all in upper-case letters are always native Excel functions,
which is how they appear in the spreadsheet. Functions of the form VoseXxxx belong to ModelRisk.
Types of function in ModelRisk

ModelRisk has several types of function that apply to a probability distribution. I'll use the normal
distribution as an example.
VoseNormal(2, 3) generates random values from a normal distribution with mean = 2 and standard
deviation = 3. An optional third parameter (we call it the "U-parameter") is the quantile of the distribution; for example, VoseNormal(2, 3, 0.9) returns the 90th percentile of the distribution. The U-parameter
must obviously lie on [0, 11. The main use of the U-parameter is to control how random samples are
generated from the distribution. For example:

will generate random values from the normal distribution using the random number generators of
@RISK, Crystal Ball or Excel respectively to control the sampling.
The second type of function calculates probabilities for each distribution featured in ModelRisk. For
example, VoseNormalProb(0.7, 2, 3, FALSE) returns the probability density function of the normal distribution evaluated at x = 0.7, as would VoseNormalProb(0.7, 2, 3, 0) or VoseNonnalProb(0.7, 2, 3),
since the last parameter is assumed FALSE if omitted. VoseNormalProb(0.7, 2, 3, TRUE) or VoseNormalProb(0.7, 2, 3, 1) returns the cumulative distribution function of the normal distribution evaluated
at x = 0.7. To this degree, these functions are analogous to Excel's NORMDIST function, e.g.

However, the probability calculation functions can take an array of x values and then return the joint
probability. For example, VoseNormalProb({O.1,0.2,0.3},2, 3,O) = VoseNormalProb(O.1, 2, 3, 0) *
VoseNormalProb(0.2, 2, 3, 0) * VoseNormalProb(0.3, 2, 3, 0). There are two advantages to this feature:
we don't need a vast array of functions to calculate the joint probability for a large dataset, and the
functions are far faster and more accurate than multiplying a long array because, depending on the
distribution, there will be a lot of calculations that can be simplified. Joint probabilities can quickly tend
to very small values, beyond the range that Excel can handle, so ModelRisk offers log base 10 versions
of these functions too, for example:

1 12

Risk Analysis

These functions allow us to develop very efficient log likelihood models, for example, which we can
then optimise to fit to data (see Chapter 10).
Finally, ModelRisk offers what we call object functions, for example VoseNormalObject(2, 3). If you
type =VoseNormalObject(2, 3) into a cell, it returns the string "VoseNormalObject(2, 3)". In many types
of risk analysis calculation we want to do more with a distribution than simply take a random sample
or calculate a probability. For example, we might want to determine its moments (mean, variance, etc.).
The following model does this for a Gamma(3, 7) distribution in two different ways:

The VoseMoments array function returns the first four moments of a distribution and takes as its input
parameter the distribution type and parameter values. There are many other situations in which we want
to manipulate distributions as objects, for example:

This function uses a hybrid Monte Carlo approach to add n Lognormal(l0, 5) distributions together,
where n is itself a Poisson(50) random variable. Note that the lognormal distribution is defined as an
object here because we are using the distribution many times, taking on average 50 independent samples
from the distribution for each execution of the function. However, the Poisson distribution is not an
object because for one execution of the function it simply draws a single random sample. Objects can
be imbedded into other objects too. For example:

is the object for a distribution constructed by splicing a gamma distribution (left) and a shifted Pareto2
distribution (right) together at x = 3. Allowing objects to exist alone in cells (e.g. cell F3 in the above
figure) allows us to create very transparent and efficient models.
Mathematical notation

There are some mathematical notations listed below that the reader will come across in a few parts of
the text. I have tried to keep the algebra to a minimum and the reader should not worry unduly about
this list. There is nothing in this book that really extends beyond the level of mathematics that one
learns in a quantitative undergraduate course.

Part 2 Introduction

x
0

lb

f (x)dx

i=l

is the label generally given to the value of a variable
is the label generally given to an uncertain parameter
means the integral between a and b of the function f (x)
means the sum of all xi values, where i is between 1 and n, i.e. xl

Xi

1 13

+ x2 + . . . + xn

n

n

xi

d

f,

(XI

a

-f (x, y)

ax
x

means the product of all xi values, where i = 1 to n, i.e. ~ 1 . ~ 2. .. .x,
.
means the differential of f (x) with respect to x
means the partial derivative of a function of x and y, f (x, y), with respect to x
means "is approximately equal to7'

F,1

mean "is less than or equal to" and "is greater than or equal to"

<<, >>

mean "is much less than" and "is much greater than"

x!

means "x-factorial", = 1 * 2 * 3 * . . . * x or

exp[x] or ex

means "exponential x" = 2.7 182818 . . .X

ln[xl
x

means the natural logarithm of x, so ln[exp[x]] = x
means the average of all x values

1~ 1

means "modulus x", the absolute value of x

r(x)

is the gamma function evaluated at x : r ( x ) =

B(x, Y)

is the beta function evaluated at (x, y ) :

-

ni
X

i=l

i

/

eFUu

du

tx-'(I - t ) ~ - 'dt =

0

r(x)r(y)
r(x Y )

+

Other special functions are explained in the text where they appear. For those readers with some
background in probability modelling, you might not be used to the notation I use for stating that a
variable follows some distribution. For example, I write:

whereas the reader might be used to

I use the "=" notation because it is easier to write formulae that combine variables and it reflects how
one uses Excel. For example, where I might write

1 14

R~skAnalys~s

using the other notation, we would need to write
Y -- Normal(100, 10)

Z -- Gamma(2,3)
X=Y+Z
which gets to be rather tedious.
This chapter is set out in sections, each of which solves a number of problems in a particular area.
I hope that the problem-solving approach will complement the theory discussed earlier in the book.
References are made to where the theory used in the problems is more fully discussed. The solution to
each problem finishes with the symbol +.

Chapter 6

Probability mathematics
and simulation
This chapter explores some very basic theories of probability and statistics that are essential for risk
analysis modelling and that we need to understand before moving on. In my experience, ignorance of
these fundamentals is a prime cause of the logical failure of a model. Risk analysis software is often
sold on the merits of removing the need for any in-depth statistical theory. Although this is quite true
with respect to using the software, it is often not the case when it comes to producing a logical model.
In this chapter we begin by looking at the concepts that are used in the mathematics of probability
distributions. Then we define some basic statistics in common use. We look at a few probability concepts
that are essential to understand if one is to be assured of producing logical models. This chapter is
designed to offer a reference of statistical and probability concepts: the application of these principles
is left to the appropriate chapters later in the book.
For most people (myself included), probability theory and statistics were not their favourite subjects at
college. I would, however, encourage those readers who find themselves equipped with limited endurance
for statistical theory to get at least as far as the end of Section 6.4.4 before moving on.

6.1 Probability Distribution Equations
6.1.1 Cumulative distribution function (cdf)

t

i

The (cumulative) distribution function, or probability distribution function, F(x), is the mathematical
equation that describes the probability that a variable X is less than or equal to x, i.e.
F (x) = P ( X 5 x)

for all x

where P(X 5 x) means the probability of the event X 5 x.
A cumulative distribution function has the following properties:
d
1. F(x) is always non-decreasing, i.e. -F(x) 2 0.

dx

2.

F(x) = 0 at x = -oo;
F(x) = 1 at x = oo.

6.1.2 Probability mass function (pmf)
If a random variable X is discrete, i.e. it may take any of a specific set of n values xi, i = 1, . . . , n, then

p(x) is called the probability mass function.

1 16

Risk Analysis

Figure 6.1

Distribution of the possible number of heads in three tosses of a coin.

Note that

and

For example, if a coin is tossed 3 times, the number of observed heads is discrete. The possible values
of xi are shown in Figure 6.1 against their probability mass function f (x) and probability distribution
function F(x). In this book, I will often show a discrete variable's probability mass function by joining
together the probability masses with straight lines and marking each allowed value with a point. Vertical
histograms are usually more appropriate representations of discrete variables, but, by using the pointsand-lines type of graph, one can show several discrete distributions together in the same plot.

6.1.3 Probability density function (pdf)
If a random variable X is continuous, i.e. it may take any value within a defined range (or sometimes
ranges), the probability of X having any precise value within that range is vanishingly small because
we are allocating a probability of 1 between an infinite number of values. In other words, there is no
probability mass associated with any specific allowable value of X. Instead, we define a probability
density function f (x) as

i.e. f (x) is the rate of change (the gradient) of the cumulative distribution function. Since F(x) is
always non-decreasing, f (x) is always non-negative.

Chapter 6 Probab~l~ty
mathernattcs and s~rnulatton 1 17

So, for a continuous distribution we cannot define the probability of observing any exact value.
However, we can determine the probability of x lying between any two exact values (a, b):
i

P ( a ~ x 5 b) = F ( b ) - F(a) where b > a

i

(6.3)

Example 1.6

Consider a continuous variable that takes a Rayleigh(1) distribution. Its cumulative distribution function
is given by

and its probability density function is given by

The probability that the variable will be between 1 and 2 is given by

F ( x ) and f ( x ) for this example are shown in Figure 6.2. In this book, we will show a continuous
variable's probability density function with a smooth curve, as illustrated. A square sometimes plotted in

0

0.5

1

2
2.5
Variable value x

1.5

3

3.5

4

4.5

Figure 6.2 Probability density and cumulative probability plots for a Rayleigh(1) distribution.

1 18

Risk Analysis

the middle of this curve represents the position of the mean of the distribution. Providing the distribution
is unimodal, if this point is higher than the 50 percentile, the distribution will be right skewed, and if
lower than the 50 percentile it will be left skewed. +

6.2 The Definition of "Probability"
Probability is a numerical measurement of the likelihood of an outcome of some random process.
Randomness is the effect of chance and is a fundamental property of the system, even if we cannot directly measure it. It is not reducible through either study or further measurement, but may be
reduced by changing the physical system. Randomness has been described as "aleatory uncertainty"
and "stochastic variability". The concept of probability can be developed neatly from two different
approaches:
Frequentist definition

The frequentist approach asks us to imagine repeating the physical process an extremely large number of
times (trials) and then to look at the fraction of times that the outcome of interest occurs. That fraction
is asymptotically (meaning as we approach an infinite number of trials) equal to the probability of that
particular outcome for that physical process. So, for example, the frequentist would imagine that we
toss a coin a very large number of times. The fraction of the tosses that come up heads is approximately
the true probability of a single toss producing a head, and the more tosses we do the closer the fraction
becomes to the true probability. So, for a fair coin, we should see the number of heads stabilise at
around 50 % of the trials as the number of trials gets truly huge. The philosophical problem with this
approach is that one usually does not have the opportunity to repeat the scenario a very large number
of times. How do we match this approach with, for example, the probability of it raining tomorrow, or
you having a car crash?
Axiomatic definition

The physicist or engineer, on the other hand, could look at the coin, measure it, spin it, bounce lasers
off its surface, etc., until one could declare that, owing to symmetry, the coin must logically have a
50 % probability of falling on either surface (for a fair coin, or some other value for an unbalanced
coin, as the measurements dictated). Determining probabilities on the basis of deductive reasoning has
a far broader application than the frequency approach because it does not require us to imagine being
able to repeat the same physical process infinitely.
A third, subjective, definition

In this context, "probability" would be our measure of how much we believe something to be true. 1'11
use the term "confidence" instead of probability to make the separation between belief and real-world
probability clear. A distribution of confidence looks exactly the same as a distribution of probability
and must follow the same rules of complementation, addition, etc., which easily lead to mixing up of
the two ideas. Uncertainty is the assessor's lack of knowledge (level of ignorance) about the parameters that characterise the physical system that is being modelled. It is sometimes reducible through
further measurement or study. Uncertainty has also been called "fundamental uncertainty", "epistemic
uncertainty" and "degree of belief'.

Chapter 6 Probability mathematics and simulation

1 19

6.3 Probability Rules
There are four important probability theorems for risk analysis, the meaning and use of which are
discussed in this section:
strong law of large numbers (also called Tchebysheff's inequality1);
binomial theorem;
Bayes' theorem;
central limit theorem (CLT).

I will also describe a number of mathematical techniques useful in risk analysis and referenced
elsewhere:
Taylor series;
Tchebysheff's rule (theorem);
Markov inequality;
least-squares linear regression;
rank order correlation coefficient.
We'll begin with some basics on conditional probability, using Venn diagrams to help visualise the
thinking.

6.3.1 Venn diagrams
Venn diagrams are introduced here to help visualise some basic rules of probability. In a Venn diagram
the squared area, denoted by E , contains all possible events, and we assign it an area equal to 1. The
circles represent specific events. Probabilities are represented by the ratios of areas. For example, the
probability of event A in Figure 6.3 is the ratio of area A to the total area E:

Figure 6.3

Venn diagram for a single event A.

Mutually excluave events

Figure 6.4 gives an example of a Venn diagram where two events (A and B) are identified. The events
are mutually exclusive, meaning that they cannot occur together, and therefore the circles do not overlap.

'

rp
F

After the Russian mathematician Pafnutl Tchebysheff (1821- 1894). Other transliterations of his name are Tchebycheff, Chebyshev
and Tcheblchef.

120

Risk Analysis

Figure 6.4 Venn diagram for two mutually exclusive events.

The areas of the circles are denoted by A and B , and the probability of the occurrence of events A and
B are denoted by P ( A ) and P ( B ) :

P(A) = A/&
P(B)= B/E
You can think of a Venn diagram as an archery target. Imagine that you are firing an arrow at the
target and that you have an equal chance of landing anywhere within the target area, but will definitely
hit it somewhere. The circles on the target represent each possible event, so if your arrow lands in circle
A, it represents event A happening. In Figure 6.4 you cannot fire an arrow that will land in both A and
B at the same time, so events A and B cannot occur at the same time:

P(A n B) =o
The probability of either event occurring is then just the sum of the probabilities of each event,
because we just need to add the A and B areas together:

P(A U B) = P(A)+ P(B)
Events that are not mutually exclusive

In Figure 6.5, A and B are not mutually exclusive: they can occur together, represented by the overlap
in the Venn diagram. The figure shows the four different areas that are now produced. It can be seen
from these areas that

P(A u B) = P(A)

+P(B) - P(A n B)

Figure 6.5 Venn diagram for two events that are not mutually exclusive.

122

Risk Analysis

Figure 6.6

More complex Venn diagram example.

6.3.3 Central limit theorem
The central limit theorem (CLT) is one of the most important theorem for risk analysis modelling. It
says that the mean 2 of a set of n variables (where n is large), drawn independently from the same
distribution f (x), will be normally distributed:

where p and a are the mean and standard deviation of the f (x) distribution from which the n samples
are drawn.
Example 6.2

If we had 40 variables, each following a Unifonn(1, 3) distribution (with mean = 2 and standard
deviation = l/&), the average of these variables would (approximately) have the following distribution:

(Jim)

(A)

2 = Normal 2, ----- = Normal 2, -

i.e. n is approximately normally distributed with mean = 2 and standard deviation = l / m .

+

Exercise 6.1: Create a variety of Monte Carlo models, averaging n distributions of the same type
with the same parameter values, and see what the resultant distribution looks like. Try different
values for n, e.g. n = 2, 5, 20, 50 and 100, and different distribution types, e.g. triangular, normal,
uniform and exponential. For what values of n are these average distributions close to normal? For
the triangular distribution, does this value of n vary depending on where the most likely parameter's
value lies relative to the minimum and maximum parameter values?
It follows, by multiplying both sides of Equation (6.4) by n , that the sum, C,of n variables drawn
independently from the same distribution is given by

Chapter 6 Probability mathematics and simulation

123

Example 6.3

The sum I; of 40 Uniform(1, 3) independent variables will have (approximately) the following distribution:

Remarkably, this theorem also applies to the sum (or average) of a large number of independent
variables that have different probability distribution types, in that their sum will be approximately
normally distributed providing no variable dominates the uncertainty of the sum.
The theorem can also be applied where a large number of positive variables are being multiplied
together. Consider a set of Xi, i = 1, . . . , n, variables that are being independently sampled from the
same distribution. Then their product, l7, is given by

6

1

Taking the natural log of both sides:
n

Since each variable Xi has the same distribution, the variables (In Xi) must also have the same
distribution, and thus, from the central limit theorem, In l3 is normally distributed. Now, a variable is
lognormally distributed if its natural log is normally distributed, i.e. l7 is lognormally distributed.
In fact, this application of the central limit theorem still approximately holds for the product of a large
number of independent positive variables that have different distribution functions. There are a lot of
situations where this seems to apply. For example, the volume of recoverable oil reserves within a field
is approximately lognormally distributed since they are the product of a number of independent(ish)
variables, i.e. reserve area, average thickness, porosity, gasloil ratio, (1-water saturation), etc.
Most risk analysis models are a combination of adding (subtracting) and multiplying variables together.
It should come as no surprise, therefore, that, from the above discussions, most risk analysis results seem
to be somewhere between normally and lognormally distributed. A lognormal distribution also looks
like a normal distribution when its mean is much larger than its standard deviation, so a risk analysis
model result even more frequently looks approximately normal. This particularly applies to project and
financial risk analyses where one is looking at cost or time to completion or the value of a series of
cashflows.
It is important to note from the results of this theorem that the distribution of the average of a set of
variables depends on the number of variables that are being averaged, as well as the uncertainty of each
variable. It may be tempting, at times, to seek an expert's estimate of the distribution of the average of a
number of variables; for example, the average time it will take to lay a kilometre of road, or the average
weight of the fleece of a particular breed of sheep. The reader can now see that it will be a difficult

task for experts to provide a distribution of an average measure: they will have to know the number of
variables for which the estimate is the average and then apply the central limit theorem - which is no
easy task to do in one's head. It is much better to estimate the distribution of the individual items and
do the central limit theorem calculations oneself.
Many parametric distributions can be thought of as the sum of a number of other identical distributions.
In general, if the mean is much larger than the standard deviation for these summary distributions, they
can be approximated by a normal distribution. The central limit theorem is then useful for determining
the parameters of the normal distribution approximation. Section 111.9 discusses many of the useful
approximations of one distribution for another.

6.3.4 Binomial theorem
The binomial theorem says that for some values a and b and a positive integer n

The binomial coefficient,

, also sometimes written as n C x , is read as "n choose X" and is calcu-

lated as

where the exclamation mark denotes factorial, so 4! = 1 * 2 * 3 * 4, for example. The binomial coefficient calculates the number of different ways one can order n articles where x of those articles are of
one type and therefore indistinguishable from one another and the remaining ( n - x ) are of another type,
again each being indistinguishable from another. The Excel function COMBIN calculates the binomial
coefficient.
The arguments underpinning this equation go as follows. There are n ! ways of ordering n articles, as there are n choices for the first article, then (n - 1 ) choices for the second, ( n - 2) choices
for the third, etc., until we are left with just the one choice for the last article. Thus, there are
n * (n - 1 ) * (n - 2 ) * . . . * 1 = n ! different ways of ordering these articles. Now, suppose that x of
these articles were identical: we would not be able to differentiate between two orderings where we
simply swapped the positions of two of these articles. Repeating the logic above, there are x ! different
orderings that would all appear the same to us, so we would only recognise l l x ! of the possible orderings
and the number of orderings would now be n ! l x ! Now, suppose that the remaining (n - x ) articles are
also identical but differentiable from the x articles. Then we could only distinguish l l ( n - x ) ! of the
remaining possible orderings, and thus the total number of different combinations is given by
n!
x!(n-x)!

A useful way of quickly calculating the binomial coefficients for small n is given by Pascal's triangle
(Figure 6.7). The outside of the triangle is filled with Is, and each value inside the triangle is calculated

Chapter 6 Probability mathematics and simulation

6

7

1

8

1

9
10

6

1

7

1

8

9

1
10

28

120

15

20

35

35

56

84

36
45

15
21

70
126

210

56
126

252

210

1

6

21

7
28

84

1
8

1

9

36
120

I25

45

1
10

1

Figure 6.7 Pascal's triangle.

as the sum of the two values immediately above it. Row n then represents the binomial coefficient for
n, which also appears as the second value in each row. So, for example,

as highlighted in the figure. Note that the binomial coefficients are symmetric so that

This makes sense, as, if we swap x for (n - x) in Equation (6.6), we arrive back at the same formula.
If we replace a with probability p, and b with probability (1 - p), the equation becomes

The summed component

is the binomial probability mass function for x successes in n trials where each trial has a probability
p of success. In a binomial process, all successes are considered identical and interchangeable, as are
all failures.
Properties of the binomial coeficient

I26

Risk Analysis

The last identity is known as Vandermonde's theorem (A. T. Vandermonde, 1735-1796).
Calculating x! for large x

x! is very laborious to calculate for high values of x. For example, loo! = 9.3326E+157 and Excel's
FACT@) cannot calculate values higher than 170! The probability mass functions of many discrete
probability distributions contain factorials, and we therefore often want to work out factorials for values
larger than 170. Algorithms for generating distributions get around any calculation restriction by using
approximations, for example the following equation, known as the stirling2formula, can be used instead
to get a very close approximation:

-

where
is read "asymptotically equal" and means that the right-hand side approaches the left-hand
side as n approaches infinity.
However, if you are attempting to calculate a probability exactly, you can still use the Excel function
GAMMALNO:

This may allow you to manipulate multiplications of factorials, etc., by adding them in log space. But,
be warned, this formula will not return exactly the same answer as FACT(), for example

and, while it is possible to get values for GAMMALN(x) where x > 171, Excel will return an error if
you attempt to calculate the corresponding EXP(GAMMALN(x)).

6.3.5 Bayes' theorem
Bayes' theorem3 is a logical extension of the conditional probability arguments we looked at in the
Venn diagram section. We saw that
P(A1B) =

P(A

n B)

P(B)

and P(B1A) =

P(B

n A)

P (A)

James Stirling (1692-1770) - Scots mathematician.
- English philosopher. A short biography and a reprint of his original paper describing Bayes'
theorem appear in Press (1989).

' Rev. Thomas Bayes (1702-1761)

Chapter 6 Probability mathematics and sirnulat~on

127

and hence

which is Bayes' theorem, and, in general,

The following example illustrates the use of this equation. Many more are given in the section on
Bayesian inference.
Example 6.4

Three machines A, B and C produce 20 %, 45 % and 35 % respectively of a factory's wheel nuts output;
2 %, 1 % and 3 % respectively of these machines outputs are defective:
(a) What is the probability that any wheel nut randomly selected from the factory's stock will be
defective? Let X be the event where the wheel nut is defective, and A, B and C be the events
where the selected wheel nut comes from machines A, B and C respectively:

(b) What is the probability that a randomly selected wheel nut will have come from machine A if it
is defective?
From Bayes' theorem

In other words, in Bayes' Theorem we divide the probability of the required path (the probability that
it came from machine A and was defective) by the probability of all possible paths (the probability that
it came from any machine and was defective). +
Example 6.5

We wish to know the probability that an animal will be infected (I),given that it passes (Pa) a specific
veterinary check, i.e. P (IIPa).

128

R~skAnalvsis

Animal
infected?

<

Figure 6.8 Event tree for Example 6.5.

The problem can be visualised by an event tree diagram (Figure 6.8). First of all, the animal will be
infected (I) or not infected (N). Secondly, the animal will either pass (Pa) or fail (F) the test.
From Bayes' theorem

In veterinary terminology
and thus P ( N ) = (1 - p)
P ( I ) = prevalence p
P ( F I I ) = the sensitivity of the test Se and thus P(Pa1I) = (1 - Se)
P(PalN) = the specificity of the test Sp
Putting these elements into Bayes' theorem,

6.3.6 Taylor series
The Taylor series is a formula that determines a polynomial approximation in x of some mathematical
function f (x) centred at some value xo:

where f ( m ) represents the mth derivative with respect to x of the function f

Chapter 6 Probability mathematics and simulation

129

In the special case where xo = 0, the series is known as the Maclaurin series of f (x):

The Taylor and Maclaurin series expansions are also used to provide polynomial approximations to
probability distribution functions.

6.3.7 Tchebysheff s rule
If a dataset has mean T and standard deviation s, we are used to saying that 68 % of the data will lie
between (T - s ) and (T s), 95 % lie between (T - 2s) and (T 2s), etc. However, that is only true
when the data follow a normal distribution. The same applies to a probability distribution. So, when the
data, or probability distribution, are not normally distributed, how can we interpret the standard deviation?
Tchebysheff's rule applies to any probability distribution or dataset. It states:

+

+

"For any number k greater than 1, at least (1 - l / k 2 ) of the measurements will fall
within k standard deviations of the mean".
Substituting k = 1, Tchebysheff's rule says that at least 0 % of the data or probability distribution lies
within one standard deviation of the mean. Well, we already knew that! However, substitute k = 2 tells
us that at least 75 % of the data or distribution lie within two standard deviations of the mean. That is
useful information because it applies to all distributions.
This is a fairly conservative rule in that, if we know the distribution type, we can specify a much
higher percentage (e.g. 95 % for two standard deviations for a normal distribution, compared with 75 %
with Tchebysheff's rule), but it is certainly helpful in interpreting the standard deviation of a dataset or
probability distribution that is grossly non-normally distributed.
From Figure 6.9 you can see that, for any k, knowing the distribution type allows you to specify a
much higher fraction of the distribution to be contained in the range mean fk standard deviations.
The bimodal distribution tested was as shown in Figure 6.10.

6.3.8 Markov inequality
The Markov inequality gives some indication of the range of a distribution, in a similar way to Tchebysheff's rule. It states that for a non-negative random variable X with mean p

for any constant k greater than p.
So, for example, for a random variable with mean 6, the probability of being greater than 20 is less
than or equal to 6/20 = 30 %.
Of course, being very general like Tchebysheff's rule, it makes a rather conservative statement. For
most distributions, the probability is much smaller than m / k (see Table 6.1 for some examples).

13 0

R~skAnalysis

100%
90%
80%
70%

Normal distribution

60%
50%
40%
30%
20%
10%
0%

Figure 6.9

Comparison of Tchebysheff's rule with the results of a few distributions.

-1 00

-50

0

50

Variable value

Figure 6.10 A bimodal distribution.

100

150

Chapter 6 Probability mathematics and simulation

Table 6.1 Markov's rule for different
distributions.
I Distribution with u = 6 1 PIX > 20)

1

I
I

I
I

Lognormal(6, a)
Pareto(0, 6(0 - 1)/0)

I
I

Max. of 6.0 %
Max. of 3.21 %

13 1

6.3.9 Least-squares linear regression
The purpose of least-squares linear regression is to represent the relationship between one or more
independent variables X I ,x2 and a variable y that is dependent upon them in the following form:

where xji is the ith observed value of the independent variable xi, yi is the ith observed value of the
dependent variable y, E L is the error term or residual (i.e. the difference between the observed y values
and that predicted by the model), Bj is the regression slope for the variable xj and Po is the y-axis
intercept.
Simple least-squares linear regression assumes that there is only one independent variable x. If we
assume that the error terms are normally distributed, the equation reduces to

where m is the slope of the line and c is the y-axis intercept and s is the standard deviation of the
variation of y about this line.
Simple least-squares linear regression is a very standard statistical analysis technique, particularly
when one has little or no idea of the relationship between the x and y variables. It is probably particularly
common because the analysis mathematics are simple (because of the normality assumption), rather
than it being a very common rule for the relationship between variables. LSR makes four important
assumptions (Figure 6.11):
1. Individual y values are independent.
2. For each xi there are an infinite number of possible values of y, which are normally distributed.
3. The distribution of y given a value of x has equal standard deviation for all x values and is centred
about the least-squares regression line.
4. The means of the distribution of y at each x value can be connected by a straight line y = rnx c.

+

Assumptions behind least-squares regression analysis

Statisticians often make transformations of the data (e.g. Log(Y), JX) to force a linear relationship.
That greatly extends the applicability of the regression model, but one must be particularly careful that
the errors are reasonably normal, and one runs an enormous risk in using the regression equations of
malung predictions outside the range of observations.

132

Risk Analysis

Figure 6.11 An illustration of the concepts of least-squares regression.
Estimation of parameten

The simple least-squares regression model determines the straight line that minimises the sum of the
square of the ei errors. It can be shown that this occurs when

where Y,7 are the mean of the observed x and y data and n is the number of data pairs (xi, yi).
The fraction of the total variation in the dependent variable that is explained by the independent
variable is known as the coefficient of determination R', which is calculated as

R 2 = I - - SSE
TSS
where the sum of squares errors, SSE, is given by
SSE =

C (yi - ji)'

and the total sum of squares, TSS, is given by

TSS =

C (yi - 7l2

and where ji are the predicted y values at each xi:

For simple least-squares regression (i.e. only one independent variable), the square root of R' is
equivalent to the simple correlation coefficient r :
r = d X

Chapter 6 Probability mathematics and simulation

133

Correlation coefficient r may alternatively be calculated as

Coefficient r provides a quantitative measure of the linear relationship between x and y. It ranges
from - 1 to +1: a value of r = - 1 or +1 indicates a perfect linear fit, and r = 0 indicates no linear
relationship exists at all. As

(the sum of squared errors between the observed and predicted y values) tends to zero, so r 2 tends to 1
and therefore r tends to - 1 or +1, its sign depending on whether m is negative or positive respectively.
The value of r is used to determine the statistical significance of the fitted line, by first calculating
the test statistic t as

The t-statistic follows a t-distribution with (n - 2) degrees of freedom (provided the linear regression
assumptions of normally distributed variation of y about the regression line hold) which is used to
determine whether the fit should be rejected or not at the required level of confidence.
The standard error of the y estimate, S y x ,is calculated as

This is equivalent to the standard deviation of the error terms si. These errors reflect the true variability
of the dependent variable y from the least-squares regression line. The denominator (n - 2) is used,
instead of the (n - 1) we have seen before for sample standard deviation calculations, because two values
m and c have been estimated from the data to determine the equation values, and we have therefore
lost two degrees of freedom instead of the one degree of freedom usually lost in determining the mean.
The equations of the regression line equation and the S,, statistic can be used together to produce a
stochastic model of the relationship between X and Y, as follows:

Some caution is needed in using such a model. The regression model is intended to work within
the range of the independent variable X for which there have been observations. Using the model
outside this range can produce very significant errors if the relationship between x and y deviates from
this linear relationship. This is also purely a model of variability, i.e. we are assuming that the linear
relationship is correct and that the parameters are known. We should also include our uncertainty about
the parameters, and perhaps about whether the linear relationship is even appropriate.

134

Risk Analysis

Example 6.6

Consider the dataset in Table 6.2 which shows the result of a survey of 30 people. They were asked to
provide details of their monthly net income {xi} and the amount they spent on food each month {yi}.
The values of m , c, r and Syx were calculated using the Excel functions:

The line ji= mxi

+ c is plotted against the data points in Figure 6.12. +

Table 6.2 Data for Example 6.6.
Net monthly
income
X

505
517
523
608
609
805
974
1095
1110
1139
1352
1453
1461
1543
1581
1656
1748
1760
1811
1944
1998
2054
2158
2229
2319
2371
2637
2843
2889
3096

Monthly food
expenditure
Y

Least-squares
regression estimate

Y'

Error terms
6

Chapter 6 Probability mathematics and simulation

I35

Net monthly income

Figure 6.12 The line pi = mxi

-1 50

-1 00

+ c plotted against the data points from Table 6.2.

-50

0

50

100

150

Error term value

Figure 6.13 Distribution of the error terms.

The error terms ~i = yi - ji are shown in Figure 6.13.
A distribution fit of these ~i values shows that they are approximately normally distributed. A test of
significance of r also shows that, for 28 degrees of freedom (n - 2), there is only about a 5 x lo-"
chance that such a high value of r could have been observed from purely random data. We would
therefore feel confident in modelling the relationship between any net monthly income value N (between
the values 505 and 1581) and monthly expenditure on food F using

Uncertainty about least-squares regression parameten

The parameters m , c and ,S
, for the least-squares regression represent the best estimate of the variability
model where we are assuming some stochastically linear relationship between x and y. However, since

l3 6

Risk Analysis

we will have only a limited number of observations (i.e. {x,y ) pairs), we do not have perfect knowledge
of the stochastic system and there is therefore some uncertainty about the regression parameters. The
t-test tells us whether the linear relationship might exist at some level of confidence. More useful,
however, from a risk analysis perspective is that we can readily determine distributions of uncertainty
about these parameters using the bootstrap.

6.3.10 Rank order correlation coefficient
Spearman's rank order correlation coefficient p is a non-parametric statistic for quantifying the correlation relationship between two variables. Non-parametric means that the correlation statistic is not
affected by the type of mathematical relationship between the variables, unlike linear least-squares
regression analysis, for example, which requires the relationship to be described by a straight line with
normally distributed variation of the dependent variable about that line.
Calculating the rank order correlation analysis proceeds as follows. Replace the n observed values
for the two variables X and Y by their ranking: the largest value for each variable has a rank of 1, the
smallest a rank of n, or vice versa. The Excel function RANK() can do this, but it is inaccurate where
there are ties, i.e. where two or more observations have the same value. In such cases, one should assign
to each of the same-valued observations the average of the ranks they would have had if they had been
infinitesimally different from the value they take.
The Spearman rank order correlation coefficient p is calculated as

where ui and vi are the ranks of the ith pair of the X and Y variables. This is, in fact, a shortcut
formula: it is not exact when there are tied measurements, but still works well when there are not too
many ties relative to the size of n. The exact formula is

where

and where ui and vj are the ranks of the ith observation in samples 1 and 2 respectively. This calculation
does not require that one identify which variable is dependent and which is independent: the calculation
for r is symmetric, so X and Y could swap places with no effect on the value of r. The value of r
varies from -1 to 1 in the same way as the least-squares regression coefficient r. A value of r close to

138

R~skAnalysis

The mode

The mode is the x value with the greatest probability p(x) for a discrete distribution, or the greatest
probability density f (x) for a continuous distribution. The mode is not uniquely defined for a discrete
distribution with two or more values that have the equal highest probability. For example, a distribution
of the number of heads in three tosses of a coin gives equal probability
to both one and two heads.
The mode may also not be uniquely defined if a distribution is multimodal (i.e. it has two or more
peaks).

(i)

The median ~ 0 . 5

The median is the value where the variable has a 50 % probability of exceeding, i.e.

An interesting property of unimodal probability distributions relates the relative positions of the
mean, mode and median. If the distribution is right (positively) skewed, these three measures of central
tendency are positioned from left to right: mode, median and mean (see Figure 6.14). Conversely, a
unimodal left (negatively) skewed distribution has them in the reverse order. For a unimodal, symmetric
distribution, the mode, median and mean are all equal.

6.4.2 Measures of spread
Variance V

The variance is a measure of how much the distribution is spread from the mean:

where E[] denotes the expected value (mean) of whatever is in the brackets, so

-

.-*

--..-...
----

Median

(I)

c

50 ~ercentof distr~butlon

a,

u

.-#n
r
I-'

2

2

a.

I
Figure 6.14

Variable value

Relative positions of the mode, median and mean of a right-skewed unimodal distribution.

Chapter 6 Probability mathematics and simulation

139

Thus, the variance sums up the squared distance from the mean of all possible values of x, weighted
by the probability of x occurring. The variance is known as the second moment about the mean. It has
units that are the square of the units of x. So, if x is cows in a random field, V has units of cows2.
This limits the intuitive value of the variance.
Standard deviation a

a.

The standard deviation is the positive square root of the variance, i.e. a =
Thus, if the variance
has units of cows2, the standard deviation has units of cows, the same as the variable x. The standard
deviation is therefore more popularly used to express a measure of spread.

i

Example 6.8

The variance V of the Uniform(1, 3) distribution is calculated as follows:

I

V = E(x 2 ) - p2

p = 2from before

and the standard deviation a is therefore

Variance and standard deviation have the following properties, where a is some constant and X and
Xi are random variables:
1. V(X) ? O a n d a ( X ) 2 0 .
2. V(aX) = a 2V(X) and a (aX) = a a (X).
xi) = C:=,V(Xi), providing the Xi are uncorrelated.
3. V (C:='=,

6.4.3 Mean, standard deviation and the normal distribution
For a normal distribution only, the areas bounded 1, 2 and 3 standard deviations either side of the mean
contain approximately 68.27, 95.45 and 99.73 % of the distribution, as shown in Figure 6.15. Since a
lot of distributions look similar to a normal distribution under certain conditions, people often think
of 70 % of a distribution being reasonably contained within one standard deviation either side of the
mean, but this rule of thumb must be used with care. If it is applied to a distribution that is significantly
non-normal, like an exponential distribution, the error can be quite large (the range p f a contains 87 %
of an exponential distribution, for example).

140

Risk Analysis

Figure 6.15 Some probability areas of the normal distribution.

Example 6.9

Panes of bullet-proof glass manufactured at a factory have a mean thickness over a pane that is normally
distributed, with a mean of 25 rnm and a variance of 0.04 mm2. If 10 panes are purchased, what is the
probability that all the panes will have a mean thickness between 24.8 and 25.4 mm?
The distribution of the mean thickness of a randomly selected pane is Normal(25, 0.2) mm, since the
variance is the square of the standard deviation; 24.8 mm is one standard deviation below the mean,
25.4mm is two standard deviations above the mean. The probability p that a pane lies between 24.8
and 25.4mt-n is then half the probability of lying f one standard deviation from the mean plus half
the probability of lying ftwo standard deviations from the mean, i.e. p (68.27 % 95.45 %)I2 =
81.86 %. The probability that all panes will have a mean thickness between 24.8 and 25.4mm, provided
that they are independent of each other, is therefore x (81.86 %)I0 = 13.51 %. +

+

6.4.4 Measures of shape
The mean and variance are called the first moment about zero and the second moment about the mean.
The third and fourth moments about the mean, called skewness and kurtosis respectively, are also
occasionally used in risk analysis.
Skewness S

The skewness statistic is calculated from the following formulae:
Discrete variable:

Continuous variable:

- -

Chapter 6 Probability mathematics and simulation

Skewness

14 1

Kurtosis

Figure 6.16 Examples of skewness and kurtosis.

This is often called the standardised skewness, as it is divided by a 3 to give a unitless statistic. The
skewness statistic refers to the lopsidedness of the distribution (see left-hand panel of Figure 6.16). If
a distribution has a negative skewness (sometimes described as left skewed), it has a longer tail to the
left than to the right. A positively skewed distribution (right skewed) has a longer tail to the right, and
zero-skewed distributions are usually symmetric.
Kurtosis K
The kurtosis statistic is calculated from the following formulae:
Discrete variable:

Continuous variable:
max

"

(x

-

K=

p14f(x) dx

a4

This is often called the standardised kurtosis, since it is divided by a4, again to give a unitless
statistic. The kurtosis statistic refers to the peakedness of the distribution (see right-hand panel of
Figure 6.16) - the higher the kurtosis, the more peaked is the distribution. A normal distribution has a
kurtosis of 3, so kurtosis values for a distribution are often compared with 3. For example, if a distribution
has a kurtosis below 3 it is flatter than a normal distribution. Table 6.3 gives some examples of skewness
and kurtosis for common distributions.

6.4.5 Raw and central moments
There are three sets of moments that are used in probability modelling to describe a distribution of a
random variable x with density function f (x). The first set are called raw moments p i . The kth raw
moment is defined as

1
max

P; = E [xk] =

x k f(x)dx

min

Table 6.3

Skewness and kurtosis.

Distribution

Skewness

Binomial
ChiSq
Exponential
Lognormal
Normal
Poisson
Triangular
Uniform

Kurtosis

-

where k = 1 , 2 , 3 , . . . , or, for discrete variables with probability mass p(x), as
max

pi = E [xk] = x x k p ( x )
min

Then we have the central moments, mk, defined as
p k

(X - ,ulkf (x) dx,

= E [X - P ) ~ ]=

k = 2,3, . .

min

where p = pi is the mean of the distribution. Finally, we have the normalised moments:
Mean = p
Variance = p 2
Skewness =
Kurtosis =

P3

(~ariance)~/~
P4

(~ariance)~

The normalised moments are what appear most often in this book because they allow us to compare
distributions most easily. One can translate between raw and central moments as follows:
From raw moments to central moments:

From central moments to raw moments:

Chapter 6 Probability mathematics and simulation

143

You might wonder why we don't always use normalised moments and avoid any confusion. Central
moments don't actually have much use in risk analysis - they are more of an intermediary calculation
step, but raw moments are very useful. First of all the equations are simpler and therefore sometimes
easier to calculate than central moments, and we can then convert them to central moments using the
equations above. Secondly, they allow us to detennine the moments of some combinations of random
variables. For example, consider a variable Y that has probability p of taking a value from variable A
and a probability (1 - p ) of talung a value from variable B:

You may also come across something called a moment generating function. This is a function Mx(t)
specific to each distribution and defined as

where t is a dummy variable. This leads to the relationship with raw moments:

[

For example, the Normal(m, s ) distribution has Mx(t) = exp p t

+D:2]

from which we get

The great thing about moment generating functions is that we can use them with the sums of random
variables. For example, if Y = rA sB, where A and B are random variables and r and s are constants,
then

+

Note that, for a few distributions, not all moments are defined. The calculation of the moments of the
Cauchy distribution, for example, is the difference between two integrals that give infinite values. More
commonly, a few distributions don't have defined moments unless their parameters exceed a certain
value. Appendix I11 lists these distributions and the restrictions.

Chapter 7

Building and running a model
In this chapter I give a few tips on how to build a risk analysis model and techniques for making it
run faster - very useful if your model is either very large or needs to be run for many iterations. I also
explain the most common errors people make in their modelling.

7.1 Model Design and Scope
Risk analysis is about supporting decisions by answering questions about risk. We attempt to provide
qualitative and, where time and knowledge permit, quantitative information to decision-makers that is
pertinent to their questions. Inevitably, decision-makers must deal with other factors that may not be
quantified in a risk analysis, which can be frustrating for a risk analyst when they see their work being
"ignored". Don't let it frustrate you: the best risk analysts remain professionally neutral to the decisions
that are made from their work. Our job is to make sure that we have represented the current knowledge
and how that affects the variables on which decisions are made. Remaining neutral also relieves you of
being frustrated by lack of available data or adequate opinion - you just have to work with what you
have.
The first step to designing a good model is to put yourself in the position of the decision-maker
by understanding how the information you might provide connects to the questions they are asking. A
decision-maker often does not appreciate all that comes with asking a question in a certain way, and
may not initially have worked out all the possible options for handling the risk (or opportunity).
When you believe that you properly understand the risk question or questions that need(s) answering,
it is time to brainstorm with colleagues, stakeholders and the managers about how you might put
an analysis together that satisfies the managers' needs. Effort put into this stage pays back tenfold:
everyone is clear on the purpose of your analysis; the participants will be more cooperative in providing
information and estimates; and you can discuss the feasibility of any risk analysis approach. Consider
going through the quality check methods I described in Chapter 3. I recommend you think of mapping
out your ideas with Venn diagrams and event trees. Then look at the data (and perhaps expertise for
subjective estimates) you believe are available to populate the model. If there are data gaps (there usually
are), consider whether you will be able to get the necessary data to fill the gaps, and quickly enough
to be able to produce an analysis within the decision-maker's timeframe. If the answer is "no", look
for other ways to produce an analysis that will meet the decision-maker's needs, or perhaps a subset
of those needs. But, whatever you do, don't embark on a risk analysis where you know that data gaps
will remain and your decision-maker will be left with no useful support. Some scientists argue that risk
analysis can also be for research purposes - to determine where the data gaps lie. We see the value in
that determination, of course, but, if that is your purpose, state it clearly and don't leave any expectation
from the managers that will be unfulfilled.

146

Risk Analys~s

7.2 Building Models that are Easy to Check and Modify
The better a model is explained and the better it is laid out, the easier it is to check. Model building is
an iterative process, which means that you should construct your model to make it easy to add, remove
and modify elements. A few basic rules will help you do this:
Dedicate one sheet of the workbook to recording the history of changes to the model since conception,
with emphasis on changes since the previous version.
Document the model logic, data sources, etc., during the model build. It may seem tedious, especially
for the parts you end up discarding, but writing down what you do as you go along ensures the
documentation does get done (otherwise we move on to the next problem, the model remains a black
box to others, etc.) and also gives you a great self-check on your approach.
Avoid really long formulae if possible unless it is a formula you use very often. It might be rather
satisfying to condense some complex logic into a single cell, but it will be very hard for someone
else to figure out what you did.
Avoid writing macros that rely on model elements being at specific locations in the workbook or
in other files. Add plenty of annotations to macros. Don't put model parameter values in the macro
code. Give each macro and input parameter a sensible name.
Avoid being geeky - I'm reviewing a spreadsheet model right now written loyears ago by a guy
who is no longer around. It is almost completely written in macros, with almost no annotation, but
worst of all is that he wrote the model to allow it automatically to expand to accommodate more
assets, though there was no such requirement. He created dozens of macros to do simple things like
search a table that would normally be done with a VLOOKUP or OFFSET function, and placing
everything in macros linked to other macros, etc., means one cannot use Excel's audit tools like
Trace Precedents. It also takes maybe a 100 times longer to run than it should.
Break down a complex section into its constituent parts. This may best be done in a separate
area of the model and the result placed into a summary area. Hit the F9 key (or whatever will
generate another scenario) to see that the constituent parts are all working well. Often, in developing
ModelRisk functions, we have built spreadsheet models to replicate the logic and have found that
doing so can give us ideas for improvements too.
Use a single formula for an array (e.g. column) so that only one cell need be changed and the
formula copied across the rest of the array.
Keep linking between sheets to a minimum. For example, if you need to do a calculation on a dataset
residing in one sheet, do it in that sheet, then link the calculation to wherever it needs to be used. This
saves huge formulae that are difficult to follow, like: =VoseCumulA('Capital required' !G25,'Capital
required' !G26,'Capital required' !G28:G106,'Capital required' !H28:HI 06).
Create conditional formatting and alerts that tell you when impossible or irrelevant values occur in the
model. ModelRisk functions have a lot of imbedded checks so that, for example, VoseNormal(0, - 1)
will return the text "Error: sigma must be >= 0 rather than Excel's rather unhelpful #VALUE!
approach. If you write macros, include similarly meaningful error messages.
Use the DataNalidation tool in Excel to format cells so that another user cannot input inappropriate
values into the model - for example, they cannot input a non-integer value for an integer variable.

Chapter 7 Building and running a model

147

Use the Excel Tools/Protection/Protect~Sheetfunction together with the Tools/Protection/Allow~
Users-toxdit-Ranges function to ensure other users can only modify input parameters (not calculation cells).
In general, keep the number of unique formulae as small as possible - we often write columns
containing the same formulae repeatedly with just the references changing. If you do need to write
a different formula in certain cells of an array (usually the beginning or end), consider giving them
a different format (we tend to use a grey background).
Colour-code the model elements: we use blue for input data and red for outputs.
Make good use of range naming. To give a name to a cell or range of contiguous cells, select the cells,
click in the name box and type the name you want to use. So, for example, cell A1 might contain
the value 22. Giving it the label "Peter" means that typing ''=PeterV anywhere else in the sheet will
return the value 22. For a lot of probability distributions there are standard conventions for naming
the parameters of your model. For example, =VoseHypergeo(n, D, M) and VoseGamma(alpha, beta).
So, if you have just one or two of these distributions in your model, using these names (e.g. alphal,
alpha2, etc., for each gamma distribution) actually makes it easier to write the formulae too. Note
that a cell or range may have several names, and a cell in a range may have a separate name from
the range's name. Don't follow my lead here because, for the purposes of writing models you can
read in a book, I've rarely used range names.

7.3 Building Models that are Efficient
A model is most efficient when:
1.
2.
3.
4.

It
It
It
It

takes the least time to run.
takes the least effort to maintain and requires the least amount of assumptions.
has a small file size (memory and speed issues).
supports the most decision options (see Chapters 3 and 4).

7.3.1 Least time to run
Microsoft are making efforts to speed up Excel, but it has a very heavy visual interface that really
can slow things down. 1'11 look at a few tips for malung Excel run faster first, then for malung your
simulation software run faster and then for making a model that gets the answer faster. Finally, I'll
give you some ideas on how to determine whether you can stop the model because you've run enough
iterations.
Making Excel run faster

Excel scans for calculations through worksheets in alphabetical order of the worksheet name, and
starts at cell A1 in each sheet, scans the row and drops down to the next row. Then it dances around
for all the links to other cells until it finds the cells it has to calculate first. It can therefore speed
things up if you give names to each sheet that reflect their sequence (e.g. start each sheet with
"1. Assumptions", "2. Market projection", "3.. . . " etc.), and keep the calculations within a sheet
flowing down and across.

148

Risk Analysis

Avoid array functions as they are slow to calculate, although faster than an equivalent VBA function.
Use megaformulae (with the above caution) as they run about twice as fast as intermediary calculations, and 10 times as fast as VBA calculations.
Custom Excel functions run more slowly than built-in functions but speed up model building and
model reliability. Be careful with custom functions because they are hard to check through. There
are a number of vendors, particularly in the finance field, who sell function libraries.
Avoid links to external files.
Keep the simulation model in one workbook.
Making your simulation software run faster

Turn off the Update Display feature if your Monte Carlo add-in has that ability. It makes an enormous
difference if there are imbedded graphs.
Use Multiple CPUs if your simulation software offers this. It can make a big difference.
Avoid the VoseCumulA(), VoseDiscrete(), VoseDUniform(), VoseRelative() and VoseHistogram()
distributions (or other product's equivalents) with large arrays if possible, as they take much longer
to generate values than other distributions.
Latin hypercube sampling gets to the stable output quicker than Monte Carlo sampling, but the effect
gets increasingly quickly lost the more significant distributions there are in the model, particularly
if the model is not just adding andlor subtracting distributions. The sampling methods take the same
time to run, however, so it makes sense to use Latin hypercube sampling for simulation runs.
Run bootstrap analyses and Bayesian distribution calculations in a separate spreadsheet when you
are estimating uncorrelated parameters, fit the results using your simulation software's fitting tool
and, if the fit is good, use just the fitted distributions in your simulation model. This does have the
disadvantage, however, of being more laborious to maintain when more data become available.
If you write VBA macros, consider whether they need to be declared as volatile.
Getting the answer faster

As a general rule, it is much better to be able to create a probability model that calculates, rather
than simulates, the required probability or probability distribution. Calculation is preferable because the
model answer is updated immediately if a parameter value changes (rather than requiring a re-simulation
of the model), and more importantly within this context it is far more efficient.
For example, let's imagine that a machine has 2000 bolts, each of which could shear off within a
certain timeframe with a 0.02 % probability. We'll also say that, if a bolt shears off, there is a 0.3 %
probability that it will cause some serious injury. What is the probability that at least one injury will
occur within the timeframe? How many injuries could there be?
The pure simulation way would be to model the number of bolt shears
Shears = VoseBinomia1(2000,0.02 %)
and then model the number of injuries
Injuries = VoseBinomial(Shears, 0.3 %)

Chapter 7 Building and running a model

A

l

B

c

I

D

I

E

I

F

I

G

I

H

I

I

I

J

I

149

K

Probability injury

Probability of x injuries

6
Number of
injuries x
0
1
2
3
4
5

7
8

9
10

11
&

13

Excel
9.988E-01
1.199E-03
7.188E-07
2.872E-10
8.604E-14
0.000E+00

ModelRisk
9.988E-01
1.I99E-03
7.188E-07
2.872E-10
8.604E-14
2.061E-17

14
2

1
7
18
19
20
21
-

C8:C13
D8:D13
Graph

Formulae table
=BINOMDIST(B8,Bolts,Pshear'Pinjury,FALSE)
=VoseBinomialProb(B8,Bolts,Pshear*Pinjury,FALSE)
=SERIES(,Sheetl!$B$8:$B$13,Sheetl !$D$8:$D$13,1)

22

Figure 7.1

Example model determining a risk analysis outcome by calculation.

Or we could recognise that each bolt has a 0.02 %

* 0.3 % chance of causing injury, so

Injuries = VoseBinomial(Bolts, 0.02 % * 0.3 %)
Run a simulation enough iterations and the fraction of the iterations where Injuries > 0 is the required
probability, and collecting the simulated values gives us the required distribution. However, on average
we should see 2000 * 0.02 % * 0.3 % = 0.0012 injuries (that's 1 in 833), so your simulation will generate
about 830 zeros for every non-zero value; for us to get an accurate description of the result (e.g. have
1000 or so non-zero values), we would have to run the model a long time. A better approach is to
calculate the probabilities and construct the required distribution as in the model shown in Figure 7.1.
I have used Excel's BINOMDIST function to calculate the probability of each number of injuries x.
You can see the probability of non-zero values is pretty small, hence the need for the y axis in the chart
to be shown in log scale. The beauty of this method is that any change to the parameters immediately
produces a new output. I have also shown the same calculation with ModelRisk7s VoseBinomialProb
function, which does the same thing because the probability that x = 5 is not actually zero (obviously)
as BINOMDIST would have us believe - Excel's statistical functions aren't very good.
Of course, most of the risk analysis problems we face are not as simple as the example above, but
we can nonetheless often find shortcuts. For example, imagine that we believe that the maximum daily
wave height around a particular offshore rig follows a Rayleigh(7.45) metres. The deck height (distance
from water at rest to underside of lower deck structure) is 32 metres, and the damage that will be caused
as a fraction f of the value of the rig is a function of the wave height x above the deck level following
the equation

ISO

Risk Analysis

A

1

B

1

-

2

Day 364

C
Deck height
Rayleigh parameter

D

I

E
32 metres
7.45

F

I

G

I

H

1 1 ,

Loss (fraction)
Max wave height (m)
1.598661825
12.34919201
6.245851047
19.18746778

b) Simulation and calculation
P(wave > deck)
Size of wave given > Deck
Resultant damage (fractions)
Expected damage over year (mean=output)

9.85635E-05
37.76709587
0.800691151
0.028805402

c) Calculation only
Expected fractional loss per day
Expected fractional loss over the year (output)

0.0000471
0.017205699
Formulae table
a) Pure simulation

;1:
390

=VoseRayleigh($D$2)
=IF(CG>$D$l ,(l+((C6-$D$1)/1.6)"0.91)"-0.82,O)
olp=mean = s u M ( D ~ : D ~ ~ o )
b) Simulation and calculation
=I-VoseRayleighProb($D$l ,$D$2,1)
D375
=VoseRayleigh($D$2,,VoseXBounds($D$I,))

z;

.

olp=rnean = 3 6 5 * ~ 3 7 4 * ~ 3 7 6

D381 olp

.

c) Calculation only
=Voselntegrate("VoseRayleighProb(#,D2,0)*(1
+((#-D1)Il.6)A-0.91)A-0.82",D1
,200.10)
=D380*365

Figure 7.2 Offshore platform damage model showing three methods to estimate expected damage as a
fraction of rig value.

We would like to know the expected damage cost per year as a fraction of the rig value (this is a
typical question, among others, that insurers need answered).
We could determine this by (a) pure simulation, (b) a combination of calculation and simulation or
(c) pure calculation as shown in the model of Figure 7.2.
The simulation model is simple enough: the maximum wave height is simulated for each and then
the resultant damage is simulated by writing an IF statement for when the wave height exceeds the deck
height. The model has the advantage of being easy to follow, but the probability of damage is low, so
it needs to run a long time. You also need an accurate algorithm for simulating a Rayleigh distribution.
The simulation and calculation model calculates the probability that a wave will exceed the deck
height in cell D374 (about one in 10 000). ModelRisk has equivalent probability functions for all its
distributions, whereas other Monte Carlo add-ins tend to focus only on generating random numbers, but
Appendix I11 gives the relevant formulae so you can replicate this. Cell D375 generates a Rayleigh(7.45)
distribution truncated to have a minimum equal to the deck height, i.e. we are only simulating those
waves that would cause any damage. I've used the ModelRisk generating function but @RISK, Crystal
Ball and some other simulation tools offer distribution truncation. Cell D376 then calculates the damage
fraction for the generated wave height. Finally, cell D377 multiplies the probability that a wave will

Chapter 7 Build~ngand running a model

15 l

exceed the deck height by the damage it would then do and 365 for the days in the year. Running a
simulation and taking the mean (=RiskMean(D377) in @RISK, =CB.GetForeStatFN(D377,2) in Crystal
Ball) will give us the required answer. This version of the model is still pretty easy to understand but
has 11365 of the simulation load and only simulates the 1 in 10 000 scenario where a wave hits the
deck, so it achieves the same accuracy for about 113 650 000th of the iterations as the first model.
The third model performs the integral

in Cell D380 where f (x) is the Rayleigh(7.45) density function and D is the deck height. This is
summing up the damage fraction for each possible wave height x weighted by x's probability of occurrence. The VoseIntegrate function in ModelRisk performs one-dimensional integration on the variable
"#" using a sophisticated error minimisation algorithm that gives very accurate answers with short
computation time (it took about 0.01 seconds in this model, for example). Mathematical software like
Mathematica and Maple will also perform such integrals. The advantage of this approach is that the
results are instantaneous and very accurate (to 15 significant figures!), but the downside is that you need
to know what you are doing in probability modelling (plus you need a fancier tool such as ModelRisk,
Maple, etc). ModelRisk helps out with the explanation and checking by displaying a plot of the function
and the integrated area when you click the Vf (View Function) icon. Note that for numerical integration
you have to pick a high value for the upper integration limit in place of infinity, but a quick look at the
Rayleigh(7.45) shows that its probability of being above 200 is so small that it's outside a computer's
floating point ability to display it anyway.
In summary, calculation is fast and more accurate (true, with simulation you can improve accuracy
by running the model longer, but there's a limit) and simulation is slow. On the other hand, simulation
is easier to understand and check than calculation. I often use the phrase "calculate when you can,
simulate when you can't", and when you "can't" is as much a function of the expertise level of the
reviewers as it is of the modeller. If you really would like to use a calculation method, or want to have
a mixed calculation-simulation model, but wony about getting it right, consider writing both versions
in parallel and checking they produce the same answers for a range of different parameter values.

7.3.2 Least effort to maintain
The biggest problem in maintaining a spreadsheet model is usually updating data, so make sure that
you keep the data in predictable areas (colour-coding the tabs of each sheet is a nice way). Also, avoid
Excel's data analysis features that dump the results of a data analysis as fixed values into a sheet. I
think this is dreadful programming. Software like @RISK and Crystal Ball, which fit distributions to
data, can be "hot-linked" to a dataset, which is a much better method than just exporting the fitted
parameters if you think the dataset may be altered at some point. ModelRisk has a huge range of
"hot-linking7' fit functions that will return fitted parameters or random numbers for copulas, time series
and distributions. You can sometimes replicate the same idea quite easily. For example, to fit a normal
distribution one need only determine the mean and standard deviation of the dataset if the data are
random samples, so using Excel's AVERAGE and STDEV functions on the dataset will a tomatically
update a distribution fit. Sometimes you need to run Solver, e.g. to use maximum likeliho methods
to fit to a gamma distribution, so make a macro with a button that will perform that operation (see, for
example, Figure 7.3).

'b

Figure 7.3 Spreadsheet with automation to run Solver.

The button runs the following macro which asks the user for the data array, runs Solver by creating a
temporary file with the likelihood calculation and finally asks the user where to place the results (cells
D3:E3 in this case):
Private Sub CommandButtonl-Click0
On Error Resume Next
Dim DataRange As Excel.Range
Dim n As Long, Mean As Double, Var As Double

- _ - _ _ - - - - _ - _ -Selecting
_input data --------------1 Set DataRange = Application.InputBox("Select one-dimensional input data array'', "Data",
Selection.Address, , , , , 8 )
If DataRange Is Nothing Then Exit Sub
n = DataRange.Cel1s.Count
8

_ - - _ - - - - _ _ - - _ _ _Error
- - _ messages -----------------If n < 2 Then MsgBox "Please enter at least two data values": G O T ~1
If DataRange.Columns.Count > 1 And DataRange.Rows.Count > 1 Then MsgBox "Selected data is
not one-dimensional":GoTo 1
If Application.WorksheetFunction.Min(DataRange.Value)
<= 0 Then MsgBox "Input data must
be non-negative": GoTo I
f

Sheets.Add Sheets(1) ' adding a temporary sheet
_ - - pasting input data into the temporary sheet ---If DataRange.Co1umns.Count > 1 Then
Sheets(l).Range("Al:AU& n).Value =
Application.WorksheetFunction.Transpose(DataRange.Value)
Else
Sheets(1) .Range("Al:AU& n).Value = DataRange.Value
End If
f

Chapter 7 Build~ngand running a model

153

Mean = Application.WorksheetFunction.Average(Sheets(1).Range("Al:Aq'
& n)) ' calculating
mean of data
& n)) ' calculating
Var = Application.~orksheet~unction.Var(Sheets(1).Range("Al:A"
variance of data
Alpha = Mean " 2 / Var ' Best guess estimate for Alpha
Beta = Var / Mean ' Best guess estimate for Beta
- - - - - - setting initial values for the Solver - - - - - - Sheets(l).Range("DlW)
.Value = Alpha
Sheets(l).Range("El")
.Value = Beta
9

v
- - - - - - - - setting the LogLikelihood function - - - - - - - Sheets(1).Range("B1:BW& n).Formula = "=LOGlO(GAMMADIST(Al,$D$1,$E$l,O))"

- - - - - - - - _ - setting the objective function - - - - - - - - - Sheets(1).Range("GlU)
.Formula = "=SUM(B1:Bn& n & " ) "
I

I

- - - - - - - - - - - - - - _ Launching

the Solver - - - - - - - - - - - - - - -

SOLVER.SolverReset
SOLVER.SolverOk SetCell:=Sheets(l).Range("GlW),MaxMinVal:=l,
ByChange:=Sheets(l).Range("Dl:El")
SolverAdd CellRef:="$D$l",Relation:=3, FormulaText:="O.000000000000001"
SolverAdd CellRef:="$E$l",Relation:=3, FormulaText:="O.000000000000001"
S0LVER.SolverSolve UserFinish:=True
S0LVER.SolverFinish KeepFinal:=l

_ _ _ - - - - - - - - -Remembering
output values -----------Alpha = Sheets(l).Range("Dl").Value
Beta = Sheets(l).Ranqe("El").Value
I

I

_ - _ _ - - - - - - -Deleting

the temporary sheet - - - - - - - - - - -

Application.DisplayA1erts = False

Sheets (1).Delete
Application.DisplayAlerts = True
- - - - - - - - - - - - *electing
output location - - - - - - - - - - - 2 Set ~ata~ange.= Application.InputBox("Select 2x1 output location", "Output",
Selection.Address, , , , , 8 )
If DataRange Is Nothing Then Exit Sub
n = DataRange.Cells.Count
If n < 2 Then MsgBox "Enter at least two data values": GoTo 2
1

- - - - Pasting outputs into the selected range - - - - - - DataRange.Cells(1, 1) = Alpha
If DataRange.Co1umns.Count = 2 Then DataRange(1, 2) = Beta Else DataRange(2, 1)
t

=

Beta

End Sub

A minimum limit is placed on alpha and beta 0.000000000000001 to avoid errors and LOGlO(. . .) is
used around the GAMMADIST(. . .) fun1ctions because a LogLikelihood will behave less dramatically
\

,

I

154

Risk Analysis

and let Solver find the solution more reliably. The moments-based estimate for alpha (=DataMeanA21
Datavariance) and beta (=DataVariance/DataMean) are used as starter values for Solver so it will find
the answer more quickly. If a user needs to perform some operations prior to running a model, then
write a description of what needs doing and why. These days, we attach a help file with the model, and
this allows us to imbed little videos which is very helpful, but at the least try to imbed or couple the
model to a pdf file with screen captures of each step.
In my experience, the other main reason a model can be hard to maintain is that it is complex and
uses many different sources of data that go out of date. When you plan out a risk analysis (Chapters 3
and 4) for a model that will be used periodically, or that could take a long time to complete, consider
whether there is a simpler model that will give answers that are pretty close in decision terms to the
more complex model being planned. If the difference in accuracy is small, it may be balanced by the
greater applicability that comes with updating the inputs more frequently.

7.3.3 Smallest file size
Megaformulae reduce the file size considerably
Maintaining large datasets in your model will increase the file size. It is better to do the analysis
outside the spreadsheet and copy across the results.
Sometimes large datasets or calculation arrays are used to construct distributions (e.g. fitting first- or
second-order non-parametric distributions to data, constructing Bayesian posterior distributions and
bootstrap analysis). Replacing these calculations with a fitted distribution can have a marked effect
on model size and speed.
ModelRisk has been designed to maximise speed and rninimise memory requirements. It has a large
number of functions that will perform complex calculations in a single cell or small array. You might
also be able to achieve some of the same effect in your models with VBA code, particularly if you
need to perform iterative loops.

7.3.4 How many iterations of a model to run
You will often see risk analysis reports, or papers in journals, that show the results and tell you that this
was based on 10 000 (or whatever) Latin hypercube (or whatever) iterations of the model. I suppose
that may sometimes be useful to know, but not often. The author is usually trying to communicate that
the model was run long enough for the results to be stable. The problem is that, for one model trying
to determine a mean, 500 iterations may be good enough; for another trying to determine a 99.9th
percentile, 100 000 iterations might be needed. It also depends on how sensitive the decision question
is to the output's accuracy. A frequent question that pops up in our courses is "how many iterations do
I need to run", and you can see there is no absolute answer to that. A short answer, burdened with many
caveats, is "no less than 3 0 0 if you are interested in the entire output distribution. At 300 iterations you
start to get a reasonably well-defined cumulative distribution, so you can approximately read off the 50th
and 85th percentiles, for example, and the mean is pretty well determined for most output distributions.
At the same time, if you export the generated values from two or more random variables in your model
to produce scatter plots, 300 is really the minimum you need to get any sense of the patterns that they
produce (i.e. their joint distribution). We usually have our models set to run 3000 iterations as a default
(but obviously increase that figure if a particularly high level of accuracy is warranted), because we plot
a great deal of scatter plots from generated data, and this is about the right number of points before

3 000 iterations

300 iterations
1
0.9
0.8

0.8 .

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

10

20

30

40

50

60

70

80

90

100

0

10

20

30

40

50

60

70

80

90

100

Figure 7.4 Comparison of cumulative distribution plots for 20 model runs each of 3000 and 300 Monte
Carlo iterations for a well-behaved output (i.e. a nice smooth curve).

the scatter plot gets clogged up, and certainly enough for all the percentiles and statistics to be well
specified.
Figure 7.4 shows what type of variation you would typically get for a cumulative distribution between
runs of 300 iterations and of 3000 iterations. Since most models include an element of guesswork in
the choice of model, distributions or parameter values to use, one should not usually be too concerned
about exact precision in the Monte Carlo results, but you'll see that 300 iterations is probably the least
level of accuracy you might find acceptable.
Figure 7.5 shows the same input and output plotted together as a scatter plot for 300 and 3000
iterations. We find scatter plots to be a great, intuitive presentation of how, among others, the input
variability influences the output value. You'll see that the pattern is just about visible for 300 iterations,
and just starting to get clogged up at 3000 iterations (of course, if you run more than 3000 iterations,
you can plot a sample of just 3000 of them to keep the scatter plot clear). If the pattern were simpler,
the left-hand pane1 of 300 iterations would of course be clearer.
In general, you'll have two opposing pressures:
Too few iterations and you get inaccurate outputs, graphs (particularly histogram plots) that look
"scruffy".
Too many iterations and it takes a long time to simulate, and it may take even longer to plot graphs,
export and analyse data, etc., afterwards. Export the data into Excel and you may also come upon
row limitations, and limitations on the number of points that can plotted in a chart.
There will usually be one or more statistics in which you are interested from your model outputs, so
it would be quite natural to wish to have sufficient iterations to ensure a certain level of accuracy.
Typically, that accuracy can be described in the following way: "I need the statistic Z to be accurate to
within fd with confidence a".
I will show you how you can determine the number of iterations you need to run to get some specified
level of accuracy for the most common statistics: the mean and cumulative probabilities. The example
models let you monitor the level of accuracy in real time. Note that all models assume that you are using
Monte Carlo sampling. This will therefore somewhat overestimate the number of iterations you'll need
if you are using Latin hypercube sampling (which we recommend, in general). That said, in practice,

lc

1

300 iterations

lnput parameter value
3000 iterations

lnput parameter value

Figure 7.5 Comparison of scatter plots for model runs of 3000 and 300 Monte Carlo iterations.

Latin hypercube sampling will only offer useful improvement when a model is linear, or when there
are very few distributions in the model.
Iterations to run to get suficient accuracy for the mean

Monte Carlo simulation estimates the true mean p of the output distribution by summing all of the
generated values xi and dividing by the number of iterations n:

If Monte Carlo sampling is used, each xi is an iid (independent identically distributed random variable).
Central limit theorem then says that the distribution of the estimate of the true mean is (asymptotically)

Chapter 7 Building and running a model

-3-

u

J;;
Figure 7.6

-2-

u

.J;;

u
--

!J

+-a

.J;;

J;;

+2-

u

+3-

157

u

J;;

Cumulative distribution plot for the normal distribution of Equation (7.1).

given by

P = Normal

(

3

p, -

where a is the true standard deviation of the model's output.
Using a statistical principle called the pivotal method, we can rearrange this equation to make it an
equation for p:
p = Normal

(P,

5)

Figure 7.6 shows the cumulative form of the normal distribution for Equation (7.1). Specifying the
level of confidence we require for our mean estimate translates into a relationship between 6, a and n.
More formally, this relationship is

where @-' (.) is the inverse of the normal cumulative distribution function. Rearranging Equation (7.2)
and recognising that we want to have at least this accuracy gives a minimum value for n:

We have one problem left: we don't know the true output standard deviation a. It turns out that
we can estimate this perfectly well for our purposes by taking the standard deviation of the first few

158

Risk Analysis

A
1
2
-

3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-

~

B

I

C

I

D

l

Output from model:
7.466257485
Required level of accuracy delta about mean: +I-

E

l

F

l

G

l

H

l

90%

1

1
0.01

with confidence alpha

I

IJ

Calculation with Crystal Ball
Standard deviation of olp iterations so far:
3.921402 (our estimate of sigma)
416042
Iterations left to do:

1

Calculation with @RISK
Standard deviation of olp iterations so far:
3.921402 (our estimate of sigma)
Iterations left to do:
416042

1

Formulae table
Crystal Ball
E6
D7
E l0
Dl1

=CB,GetForeStatFN(D2,5)
= I F ( ( E ~ * N O R M S ~ N V ( ( ~ + H ~ ) I ~ ) ~ E ~ ) " ~ - C B . ~ROUNDUP((E6*NORMSINV((1+H3)/2)/E3)"2~~~~~~O~SFN()O,
CB.lterationsFN(),O),"Sufficient accuracy achieved")
@RISK
=RiskStdDev(D2)
=IF((El O*NORMSlNV((l +H3)/2)/E3)"2-RiskCurrentlter()>O,
ROUNDUP((E1O*NORMSINV((I+H3)/2)IE3)~2-RiskCurrentlter(),0),"Sufficient
accuracy achieved")

22

Figure 7.7 Models in QRlSK and Crystal Ball to monitor whether the simulation mean has reached a
required accuracy.

(say 50) iterations. The model in Figure 7.7 shows how you can do this continuously, using Excel's
function NORMSINV to return values for @ - I ( . ) .
If you name cell D7 or Dl1 as an output, together with any other model outputs you are actually
interested in, and select the "Pause on Error in Outputs" option in your host Monte Carlo add-in, it
will automatically stop simulating when the required accuracy is achieved because the cell returns the
"Sufficient accuracy achieved text instead of a number.
lterations to run to get suficient accuracy for the cumulative probability F(x) associated with a particular value x

Percentiles closer to the 50th percentile of an output distribution will reach a stable value relatively far
quicker than percentiles towards the tails. On the other hand, we are often most interested in what is
going on in the tails because that is where the risks and opportunities lie. For example, Base1 I1 and
credit rating agencies often require that the 99.9th percentile or greater be accurately determined. The
following technique shows you how you can ensure that you have the required level of accuracy for
the percentile associated with a particular value.
Your Monte Carlo add-in will estimate the cumulative percentile F(x) of the output distribution
associated with a value x by determining what fraction of the iterations fell at or below x. Imagine
that x is actually the 80th percentile of the true output distribution. Then, for Monte Carlo simulation,
the generated value in each iteration independently has an 80 % probability of falling below x : it is a
binomial process with probability p = 80 %. Thus, if so far we have had n iterations and s have fallen
at or below x, the distribution Beta(s 1, n - s 1) describes the uncertainty associated with the true
cumulative percentile we should associate with x (see Section 8.2.3).
When we are estimating the percentile close to the median of the distribution, or when we are performing a large number of iterations, s and n will both be large, and we can use a normal approximation

+

+

Chapter 7 Building and running a model

to the beta distribution:
Beta(s

+ 1, n - s + 1) x Normal

159

i-m
P,

where S = is the best-guess estimate for F ( x ) . Thus, we can produce a relationship similar to that
in Equation (7.2) for determining the number of iterations to get the required precision for the output
mean:

Rearranging Equation (7.4) and recognising that we want to have at least this accuracy gives a
minimum value for n:

A model can now be written in a very similar fashion to Figure 7.7

7.4 Most Common Modelling Errors
This section describes, and provides examples for, the three most common mistakes we come across
in auditing risk models, even at the more elementary level. These mistakes probably constitute around
90 % of the errors we see. I strongly recommend studying them, and going through the examples
thoroughly:
Common error 1. Calculating means instead of simulating scenarios.
Common error 2. Representing an uncertain variable more than once in a model.
Common error 3. Manipulating probability distributions as if they were fixed numbers.
Common error I : calculating means instead of simulating scenarios

When we first start thinking about risk, it is quite natural to want to convert the impact of a risk to
a single number. For example, we might consider that there is a 20% chance of losing a contract,
which would result in a loss of income of $100 000. Put together, a person might reason that to be a
risk of some $20 000 (i.e. 20 % * $100 000). This $20 000 figure is known as the "expected value" of
the variable. It is the probability weighted average of all possible outcomes. So, the two outcomes are
$100 000 with 20 % probability and $0 with 80 % probability:
Mean risk(expected value) = 0.2 * $100 000
E

i

+ 0.8 * $0 = $20 000

Calculating the expected values of risks might also seem a reasonable and simple method for comparing risks. For example, in Table 7.1, risks A to J are ranked in descending order of expected cost:

160

R~skAnalysis

Table 7.1 A list of probabilities and impacts for 10 risks.
Impact if occurs Expected impact
Risk Probability
$000
$000
0.25
A
400
100

I Total expected impact

367

If a loss of $500 000 or more would ruin your company, you may well rank the risks differently:
risks C, D, I and, to a lesser extent, J pose a survival threat on your company. Note also that you may
value the impact of risk D as no more severe than that of risk C because, if either of them occur, your
company has gone bust.
On the other hand, if risk A occurs, giving you a loss of $400k, you are precariously close to ruin: it
would just take any of the risks except F and H to occur (unless they both occurred) and you've gone
bust. Looking at the sum of the expected values gives you no appreciation of how close you are to ruin.
Figure 7.8 plots the distribution of possible outcomes for this set of risks.

I

200
O0

Figure 7.8

400

600

800
1000
1200
Total risk impact $000

Probability distribution of total impact from risks A to J.

Chapter 7 Building and running a model

16 1

From a risk analysis point of view, by representing the impact of a risk by its expected value, we have
removed the uncertainty (i.e. we can't see the breadth of different outcomes), which is a fundamental
reason for doing risk analysis in the first place. That said, you might think that people running Monte
Carlo simulations would be more attuned to describing risks with distributions rather than single values,
but this is nonetheless one of the most common errors.
Another, slightly more disguised example of the same error is when the impact is uncertain. For
example, let's imagine that there will be an election this year and that two parties are running: the
Socialist Democrats Party and the Democratic Socialists Party. The SDP are currently in power and
have vowed to keep the corporate tax rate at 17 % if they win the election. Political analysts reckon
they have about a 65 % chance of staying in power. The DSP promise to lower the corporate tax rate
by 1-4 %, most probably 3 %. We might chose to express next year's corporate tax rate as
Rate = 0.35 * VosePERT(13 %, 14 %, 16 %)

+ 0.65 * 17 %

Checking the formula by simulating, we'd get a probability distribution that could give us some
comfort that we've assigned uncertainty properly to this parameter. However, a correct model would
have drawn a value of 17 % with a probability of 0.65 and a random value from the PERT distribution
with a probability of 0.35.
Common error 2: representing an uncertain variable more than once in a model

When we develop a large spreadsheet model, perhaps with several linked sheets in the same file, it is
often convenient to have some parameter values that are used in several sheets appearing in each of
those sheets. This makes it quicker to write formulae and trace back precedents in a formula. Even
in a deterministic model (i.e. a model where there are only best-guess values, not distributions) it is
important that there is only one place in the model where the parameter value can be changed (at Vose
Consulting we use the convention that all changeable input parameter values or distributions are labelled
blue). There are two reasons: firstly it is easier to update the model with new parameter values, and
secondly it avoids the potential mistake of only changing the parameter values in some of the cells in
which it appears, forgetting the others, and thereby having a model that is internally inconsistent. For
example, a model could have a parameter "Cargo (mt)" in sheet 1 with a value of 10 000 and a value
of 12 000 in sheet 2.
It becomes even more important to maintain this discipline when we create a Monte Carlo model if
that parameter is modelled with a distribution. Although each cell in the model might carry the same
probability distribution, left unchecked each distribution will generate different values for the parameter
in the same iteration, thus rendering the generated scenario impossible.
If it really is important to you to have the probability distribution formula in each cell where the
parameter is featured (perhaps because you wish to see what distribution equation was used without
having to switch to the source sheet), you can make use of the U-parameter in ModelRisk's simulation
functions to ensure that the same value is being generated in each place:
Cell A1 := VoseNormal(100, 10, Randoml)
Cell A2 := VoseNomal(100, 10, Randoml)
where Randoml is a Uniform(0, 1) distribution placed somewhere in the model. You can achieve the
same thing using a 100% rank order correlation in @RISK or Crystal Ball, for example, but this will

162

Risk Analysis

only work when the simulation is running because rank order correlation generates a set of values before
a simulation run and orders them; when you look at the model stepping through some scenarios, they
won't match.
The error described so far is where the formula for the distribution of a random variable is featured
in more than one cell of a spreadsheet model. These errors are quite easy to spot. Another form of the
same error is where two or more distributions incorporate the same random variable in some way. For
example, consider the following problem.
A company is considering restructuring its operations with the inevitable layoffs, and wishes to analyse
how much it would save in the process. Looking at just the office space component, a consultant estimates
that, if the company were to make the maximum number of redundancies and outsource some of its
operations, it would save $PERT(1.1, 1.3, 1.6)M of office space costs. On the other hand, just making
the redundancies in the accounting section and outsourcing that activity, it could save $PERT(0.4, 0.5,
0.9)M of office space costs.
It would be quite natural, at first sight, to put these two distributions into a model and run a simulation
to determine the savings for the two redundancy options. On their own, each cost saving distribution
would be valid. We might also decide to calculate in a spreadsheet cell the difference between the
two savings, and here we would potentially be making a big mistake. Why? Well, what if there is an
uncertain component that is common to both office cost savings? For example, what if inside these cost
distributions there is the cost of getting out of a current lease contract, uncertain because negotiations
would need to take place. The problem is that, by sampling from these two distributions independently,
we are not recognising the common element, which is a problem if that common element is not a fixed
value, because it induces some level of correlation.
The takeaway message from this example is: consider whether two or more uncertain parameters
in your models share in some way a common element. If they do, you will need to separate out that
common element and thereby allow it to appear just once in your model.
Common error 3: manipulating probability distributions like we do with fixed numben

At school we learn things like

Later, when we take algebra, we learn
A

+ B = C therefore C - A = B

D

* E = F therefore F I D = E

The problem is that these trusted rules do not apply so universally when manipulating random variables. This section explains how and when these simple algebraic rules no longer work, and shows you
how to identify them in your model and how to make the appropriate corrections.
An example

Most deterministic spreadsheet models consist of linked formulae that contain nothing more complicated
-, * and 1. When we decide to start adding uncertainty to the values of
than simple operations like
the components in the model, it seems natural enough simply to replace a fixed value with a probability

+,

Chapter 7 Bu~ld~ng
and runnlng a model

163

distribution describing our uncertainty. So, for example, the simple model for a company offering some
credit service:
Money borrowed by a client M :
Number of clients n :
Interest rate per annum r :
Yearly revenue:

€10 000
6500
7.5 %
M * n * r = €4 875 000

The model can now be "risked:
Money borrowed by a client M :
Number of clients n :
Interest rate per annum r :
Yearly revenue:

Lognormal(€lO 000, €4000)
PERT(6638, 6500, 8200)
7.5 %
M*n*r

The best-guess estimates of the money borrowed by a client and for the number of clients have
been replaced by distributions, but the model is otherwise unchanged. This model is probably very
wrong. The error is most easily seen by watching random values being generated on screen. Look at
the values that are being used for the entire client base and compare with where these values sit on the
Lognormal(l0 000, 4000) distribution.
For example, the Lognormal(10 000, 4000) distribution has 10 % of its probability below £5 670.
Thus, in 10 % of its iterations it will generate a value below this figure, and that value will be used for
all customers. The lognormal distribution undoubtedly reflects the variability that is expected between
customers (perhaps, for example, it was fit to a relevant dataset of amounts individual customers have
previously borrowed). The probability that two randomly selected customers will borrow less than
£5 670 is 10 % * 10 % = 1 %. The probability that all (say) 6500 customers borrow less than £5 670 if
i.e. effectively impossible, yet our model gives it
the amounts they borrow are independent is
a 10 % probability.
In order to model this problem correctly, we need to consider what are the sources of uncertainty
about the amount a customer borrowed. If the source is specific to each individual client, then the
amounts can be considered independent and the techniques of Chapter 11 should be applied. If there
is some systematic influence (like the state of the economy, recent bad press for companies offering
credit, etc.), it will have to be separated out from the individual, independent component.
Let's look at another example. The sum of two independent Uniform(0, 1) distributions is . . . what
do you think? The answer often surprises people. It is hard to imagine a simpler problem, yet when we
canvass a class we get quite a range of answers. Perhaps a Uniform(0, 2)? That's the most common
response. Or something looking a little normal? The answer is a Triangle(0, 1, 2), so we could write
U(0, 1)

+ U(0, 1) = T(0, l , 2 )

The first message in this example is that it is difficult for a person not very well versed in risk analysis
modelling to be able to predict well the results of even the most trivial model. Of course, that makes it
very hard to check the model and be comfortable about its results.
On to the next question we often pose our class:
T(O,l, 2) - U ( 0 , l ) =?

164

Risk Analysis

Figure 7.9 A plot of random samples of C against A, where A = U(0, I), B = U(0, 1) and C = A

+ B.

Now wise to the trickiness of the question, most class participants are pretty sure that their first guess
(i.e. = U(0, 1)) is wrong but don't have anything else to suggest. The answer is a symmetric distribution
that looks a little normal, stretching from -1 to 2 with peak at 0.5. But why isn't it U(0, l)? An easy
way to visualise this is to run a simulation adding two Uniform(0, 1) distributions and plotting the
generated values from one uniform distribution together with the calculated sum of them both. You get
a scatter plot that looks like that in Figure 7.9.
The line y = x shows the lowest value for the triangular distribution C for any given value of the
uniform distribution A, and the line y = 1 x is the highest value, which makes intuitive sense. The
uniform vertical distribution of points between these two lines is the effect of the second Uniform(0, 1)
distribution B. Also note that all the generated values lie uniformly (but randomly) between these two
lines. This actually is quite helpful in visualising why the sum of two Uniform(0, 1) distributions is
a Triangle(0, 1, 2) by projecting all the dots onto the y axis. Can you extend this graph to work out
graphically what U(0, 1) U(O,3) would look like?
The point of the graph is to show you that there is a strong dependency pattern between these two
distributions (a uniform and the triangle sum), which would need to be taken into account if one wished
to extract back out the two uniform distributions from each other. For example, the formulae below do
just that:

+

+

B := VoseUniform(IF(A < 1,O, A - l ) , IF(A > 1, 1, A))

Chapter 7 Building and running a model

165

Try to follow the logic for the formula for B from the graph. B will generate a Uniform(0, 1)
distribution with the right dependency relationship with A to leave C a Uniform(0, 1) distribution too.
To recap, the problem is that we have three variables linked together as follows:

We know the distributions for A and C. How do we find B and how do we simulate A , B and C all
together? The simple example above using two uniform distributions allows us to simulate A , B and C
all together, but only because we assumed A and B were independent and the problem was very simple.
In general, we cannot correctly determine B, so we need either to construct a model that avoids having
to perform such a calculation or admit that we have insufficient information to specify B.

Chapter 8

Some bas~crandom processes
8.1 Introduction
If you want to get the most out of the risk analysis and statistical modelling tools that are available,
you really need to understand the conceptual thinking behind random processes and the equations and
distributions that result, and be able to identify where these random processes occur in the real world.
In this chapter we look at the binomial, Poisson and hypergeometric processes first because they share a
common basis, and a very great deal of risk analysis problems can be tackled with a good knowledge of
just these three processes. I've added the central limit theorem here too because it explains a lot about
the behaviour of distributions. We'll look at the theory and assumptions behind each process, and the
distributions that are used in their modelling. This approach provides us with an excellent opportunity to
become very familiar with a number of important distributions, and to see the relationships between them,
even between the distributions of the different random processes. Then we'll look at some extensions
to these processes that greatly increase their range of applications. Finally, we look at a number of
problems.
There are a number of other random processes discussed in this book relating to the sums of random variables (Chapter 1l), time series modelling (Chapter 12) and correlated variables (Chapter 13).
Chapter 9 on statistics relies heavily on an understanding of the random processes described here.

8.2 The Binomial Process
A binomial process is a random counting system where there are n independent identical trials, each one
of which has the same probability of success p, which produces s successes from those n trials (where
0 5 s 5 n and n > 0 obviously). There are thus three quantities {n, p, s) that between them completely
describe a binomial process. Associated with each of these three quantities are three distributions that
describe the uncertainty about or variability of these quantities. The three distributions require knowledge
of two quantities in order to use these distributions to estimate the third.
The simplest example of a binomial process is the toss of a coin. If we define "heads" as a success,
each toss has the same probability of success p (0.5 for a fair coin). Then, for a given number of trials n
(tosses of a coin), the number of successes will be s (the number of "heads"). Each trial can be thought
of as a random variable that returns either a 1 with probability p or a 0 with probability (1 - p). Such
a trial is often known as a Bernoulli trial, and the probability (1 - p ) is often given the label q .

8.2.1 Number of successes in n trials
We start our exploration of the binomial process by looking at the probability of a certain number of
successes s for a given number of trials n and probability of success p. Imagine we have one toss of

168

Risk Analysis

Head
P.P

3

Head
P
Tail
~(1-P)

1-P
a) One toss

1

Tail
(l-p)(l-p)

1

1

b) Two tosses

Figure 8.1 Event trees for the tossing of (a) one coin and (b) two coins.

a coin. The two outcomes are "heads" (H) with probability p and "tails" (T) with probability (1 - p),
as shown in the event tree of Figure 8.l(a). If we have two tosses of a coin there are four possible
outcomes, as shown in Figure 8.l(b), namely HH, HT, TH and TT, where HT means "heads" followed
by "tails", etc. These outcomes have probabilities p2, p ( l - p), (1 - p ) p and (1 - p)' respectively. If
we are tossing a fair coin (i.e. p = 0.5),then each of the four outcomes has the same probability of 0.25.
Now, the binomial process considers each success to be identical and therefore does not differentiate
between the two events HT and TH: they are both just one success in two trials. The probability of
one success in two trials is then just 2p(l - p) or, for a fair coin, = 0.5. The two in this equation
is the number of different paths that result in one success in two trials. Now imagine that we toss a
coin 3 times. The eight outcomes are: HHH, HHT, HTH, HTT, THH, THT, TTH and TTT. Thus, one
event produces three "heads", three events produce two "heads", three events produce one "head and
one event produces no "heads" for three coin tosses. In general, the number of ways that we can get s
successes from n trials can be calculated directly using the binomial coefficient .C,, which is given by

We can check this is right by choosing n = 3 (remembering that O! = I), then

3!

($=mi!(;)
(:) 2r(l)l
3!

-1

-3

=

3!

=

3!

=3
-1

Chapter 8 Some bas~crandom processes

169

which match the number of combinations we have already calculated. Each of the ways of getting s
successes in n trials has the same probability, namely pS(l so the probability of observing x
successes in n trial is given by
pe,n(x) =

(;)

p X ( l - PI"-'

which is the probability mass function of the Binomial(n, p) distribution. In other words, the number of
successes s one will observe in n trials, where each trial has the same probability of success, is given by
s = Binomial(n, p)
Figure 8.2 shows this distribution for four different combinations of n and p . The binomial distribution
was first derived by Bernoulli (1713).

8.2.2 Number of trials needed to achieve s successes
We have seen how the binomial distribution allows us to model the number of successes that will occur
in n trials where we know the probability of success p . Sometimes, we know how many successes
we wish to have, we know the probability p and we would like to know the number of trials that we
will have to complete in order to achieve the s successes, assuming we stop once the sth success has

Figure 8.2

Examples of the binomial distribution.

170

Risk Analys~s

occurred. In this case, n is the random variable. Now that we have the binomial distribution, we can
readily determine the distribution for n. Let x be the total number of failures. The total number of trials
we will execute is then (s x), and by the (s x - 1)th trial we must have observed (s - 1) successes
and x failures (since the very last trial is, by assumption, a success). The probability of (s - 1) successes
in (s x - 1) trials is given immediately by the binomial distribution as

+

+

+

The probability of this being followed by a success is the same equation multiplied by p, i.e.

which is the probability mass function of the negative binomial distribution NegBin(s, p). In other
words, the NegBin(s, p ) distribution returns the number of failures one will have before observing s
successes. The total number of trials n is thus given by

Figure 8.3 shows various negative binomial distributions. If s = 1, then the distribution (known as
the geometric distribution) is very right skewed and p(0) = p , i.e. the probability that there will be zero
failures equals p, the probability that the first trial is a success. We can also see that, as s gets larger,
the distribution looks more like a normal distribution. In fact, it is common to approximate the negative

~~~~
NegBin(l,0.5)

2n

NegBin(3.0.5)

0.3

E

4

g

g 0.2
a

0.1
0 0

2

4

6

8

10

0.1
0.08
0.06
0.04
0.02
0
0

Failures

5

10

15

20

Failures

~:lzJiz
NegBin(100.0.95)

*:z. 0.14
0.12
0.1
2 0.08
2 0.06
a 0.04

0.02
00

Failures

Figure 8.3

I

Examples of the negative binomial distribution.

5

Failures
10

15

20

Chapter 8 Some basic random processes

171

binomial distribution with a normal distribution under certain circumstances where s is large, in order to
avoid calculating the large factorials for p ( x ) above. A negative binomial distribution shifting k values
along the domain is sometimes called a binomial waiting time distribution, or a Pascal distribution.

8.2.3 Estimate of probability of success p
These results for the binomial and negative binomial distributions are both modelling variability: that is
to say, they are returning probability distributions of possible future outcomes. At times, however, we
are looking back at the results of a binomial process and wish to determine one of the parameters. For
example, we may have observed n trials of which s were successes, and from that information we would
like to estimate p . This binomial probability is a fundamental property of the stochastic system and can
never be observed, but we can become progressively more certain about its true value by collecting
data. As we shall see in Section 9.2.2, we can readily quantify our uncertainty about the true value of p
by using a beta distribution. In brief, if we have no prior information about p , or do not wish to assume
any prior information about p, then it is quite natural to use a uniform prior for p , and, through Bayes'
theorem, we have the equation

which is just the Beta(s

+ 1, n - s + 1) distribution, so

The beta distribution can also be used in the event that we have an informed opinion about the value
of p prior to collecting data. In such cases, providing we can reasonably model our prior opinion about
p with a beta distribution of the form Beta(a, b), the posterior turns out to be a Beta(a s , b n - s)
distribution because the beta distribution is conjugate to the binomial distribution (see Section 111.7.1).
Figure 8.4 illustrates a number of beta distributions.

+

+

8.2.4 Estimate of the number of trials n that were completed
Consider the situation where we have observed s successes and know the probability of success p, but
would like to know how many trials were actually done to have observed those successes. We wish to
estimate a value that is fixed, so we require a distribution that represents our uncertainty about what the
true value is. There are two possible situations: we either know that the trials stopped on the sth success
or we do not. If we know that the trials stopped on the sth success, we can model our uncertainty about
the true value of n as
n = NegBin(s, p ) s

+

If, on the other hand, we do not know that the last trial was a success (though it could have been),
then our uncertainty about n is modelled as

Both of these formulae result from a Bayesian analysis with uniform priors for n. We will now derive
these two results using standard Bayesian inference. The reader unfamiliar with this technique should
refer to Section 9.2. Let x be the number of failures that were carried out before the sth success. We

I72

Risk Analysis

Binomial probability

Binomial probability
Beta(2,20)

Beta(30,l)

25 T

0

Figure 8.4

0.2

1

0.4
0.6
0.8
Binomial probability

I

Binomial probability

Examples of the beta distribution.

will use a uniform prior for x, i.e. p(x) = c, and, from the binomial distribution, the likelihood function
is the probability that at the (s x - 1)th trial there had been (s - 1) successes and then the (s x)th
trial was a success, which is just the negative binomial probability mass function:

+

+

As we are using a uniform prior, and the equation for 1(Xlx) comes directly from a distribution, so
it must sum to unity, we can dispense with the formality of normalising the posterior distribution to 1
and observe
P(X) =

(

s+x-1
s-1

)

p S ( l - p)"

i.e. that x = NegBin(s, p).
In the second case, we do not know that the last trial was a success, only that, in however many
trials were completed, there were just s successes. We have the same uniform prior for the number of
failures, but our likelihood function is just the binomial probability mass function, i.e.

As this does not have the form of a probability mass function of a distribution, we need to complete
the Bayesian analysis, so

Chapter 8 Some basic random processes

1 73

The sum in the denominator equals llp. This can be easily seen by substituting s = a - 1, which gives

If the exponent for p were equal to a instead of (a - I), we would have the probability mass function
of the negative binomial distribution, which would then sum to unity, so our denominator must sum to
llp.
The posterior distribution from Equation (8.1) then reduces to

which is just a NegBin(s

+ 1, p ) distribution.

8.2.5 Summary of results for the binomial process
The results are shown in Table 8.1.
Table 8.1 Distributions of the binomial process.
Quantity

Formula

Notes

-

Number of successes
Probability of success
Number of trials

s = Binomial(n,p)
p = Beta(s + 1,n - s 1)
= Beta(a + s, b + n - s)
n = s + NegBin(s,p)
= s NegBin(s + 1,p)

+

+

Assuming a uniform prior
Assuming a Beta(a,b) prior
When the last trial is a success
When the last trial is not known to be a success

8.2.6 The beta-binomial process
An extension of the binomial process is to consider the probability p to be a random variable. A natural
candidate to model this variability is the Beta(a, B) distribution because it lies on [O, 11 and can take a
lot of shapes so it offers a great deal of flexibility.
The beta-binomial distribution models the number of successes:

The beta-negative binomial models the number of failures that will occur to achieve s successes:

t

Both distributions are included in ModelRisk.
It is important to remember that in the beta-binomial process the same value of p is applied to all
the binomial trials, meaning that, if p is randomly 0.4 (say) for one trial, it is 0.4 for all the others
too. If p were randomly varying between each trial, we would have each trial being an independent
Bernoulli(Beta(a, B)), but since a Bernoulli distribution can only be 0 or 1, this condenses to a set of

174

Risk Analysis

(&)

& is the mean of the beta distribution, and a collection of
n such trials would therefore be just a ~inomial(n,5 )
independent Bernoulli

trials, where

8.2.7 The multinomial process
Whereas in the binomial process there are only two possible outcomes of a trial (0 or 1, yes or no, male
or female, etc.), the multinomial process allows for multiple outcomes. The list of possible outcomes
must be exhaustive, meaning a trial cannot result in something that isn't listed as an outcome. For
example, if we throw a die there are six possible mutually exclusive (they can't happen at the same
time) and exhaustive (one must occur) outcomes.
There are three distributions associated with the multinomial process:
Multinomial(n, Ipl . . .pk}) which describes the number of successes in n trials that fall into each of
the k categories. It's joint probability mass function parallels the binomial equation

You can think of a multinomial distribution as a recursive sequence of nested binomial distributions
where the trials and probability of success are modified through the sequence:

For example, imagine that a person being treated in hospital has three possible outcomes: {cured, not
cured and deceased) with probabilities {0.6, 0.3, 0.1). Assuming their outcomes are independent, we
can model the outcome for 100 patients as follows:
Cured = Binomia1(100,0.6)
NotCured = Binomial(100 - Cured, 0.3/(0.3

+ 0.1))

Deceased = Binomial(100 - Cured - NotCured, 0.1/0.1)
= Binomial(100 - Cured - NotCured, 1)
= 100 - Cured - NotCured

The model in Figure 8.5 shows this calculation in a spreadsheet, together with the ModelRisk distribution VoseMultinomial which achieves the same result but in a single array function.
Negative Multinomial({s 1 . . .sk}, Ip . . .pk}) is the extension to the negative binomial distribution and
describes the number of extra trials (we can't really say "failures" any more because there are several
outcomes, not two in the binomial case where we could designate success or failure) there will be to
observe {sl . . . sk} successes. There are two versions of this question: "How many extra trials will there

Chapter 8 Some bas~crandom Drocesses

A

1

l

B

c

l

~

l

~

0.3
30
30

C
0.15
12
15

D
0.25
16
27

l

~

l

175

~I K

J

l

~

1
2
3

l ~ r i a l sn:

4
-

1001
A

6

E
0.06
7
5

F
0.04
7
5

Total check

5
6

Outcome
P(outcome)
Nested
Multinomial

JQ

Formulae table
=VoseBinomial(C2,C5)
C6 (output)
D6:H6 (output)
=V~~~B~~O~~~I($C$~-SUM($C$~:C~),D~/(~-SUM($C$~:C~)))
(C7:H7) (alt output) (=VoseMultinomial(C2,C5:H5))

7
8
9

11
2

0.2
28
18

13

*e8.5

Model for the multinomial process.
A

1
2
3
4
5
6

7
8

9
10
2
12

1

B

l
A

Outcome
P(outcome)
Requiredsuccesses
Negative Multinomial2

.

(C5:H5) (output)
C7
H7

l

~

l

~

0.15
17
0

D
0.25
5
39

E
0.06
11
2

F
0.04
2
8

I

87)

C

B

0.2
12
22

[~eaativeMultinomial 1 I

-

c

0.3
23
16

1261

l ~ e ~ a t i Multinornial
ve
1 sum

l

~

l

~

Formulae table
(=VoseNegMultinomial2(C4:H4,C3:H3))
=VoseNegMultinomial(C4:H4,C3:H3)
=SUM(CS:H5)

13

A

Figure 8.6 Model for the negative multinomial process.

be in total?", which has a univariate answer, and "How many extra trials will there be in each success
category beyond the number required?', which has a multivariate answer. The probability mass function
is quite complicated for both, but the modelling is pretty easy to see in the spreadsheet in Figure 8.6.
Note in this model that there will always be one zero (in row 5 in this random scenario) and that C7
and H7 will return the same distributions.

Dirichlet({al . . . a k } ) is the multivariate equivalent of the beta distribution which can be seen from its
joint density function:

k

where 0 5 xi 5 1 (a probability lies on [0, I]),

C xi
i-1

= 1 (the probabilities must sum to 1) and ai > 1.

I

176

Risk Analysis

We can use the Dirichlet distribution to model the uncertainty about the set of probabilities ( p l . . . pk}
of a multinornial process. There is a neat relationship with gamma distributions that we can use to
simulate a Dirichlet distribution which is shown in the above model (Figure 8.7), together with the
VoseDirichlet function. In this example, a clinical trial of some face cream has been performed with
300 randomly selected people to ascertain the level of allergic reactions, with the following outcomes:
227 - no effect; 41 - mild itching; 27 - significant discomfort; and 5 - lots of pain and regret. The
Dirichlet({sl
1 . . . sk 1)) will return the joint uncertain estimate of the probability that another
random person (a consumer) would experience each effect.

+

A
1
2
3
4
5
6

1

+

B
Outcome
Number observed
Estimated probability
Gamma distributions
Alternative method

l

c

l

D

l

E

F

None
Itching Discomfort Pain and
227
41
27
0.744
0.155
0.079
0.022
218.452
47.004
25.907
5.697
0.735
0.158
0.087
0.019

IGI

H

1

I

297.060

=VoseGamma(C3+1,1)
=SUM(C3:F3)
13

Figure 8.7

Model for the Dirichlet distribution.

8.3 The Poisson Process
In the binomial process there are n discrete opportunities for an event (a "success") to occur. In the
Poisson process there is a continuous and constant opportunity for an event to occur. For example,
lightning strikes might be considered to occur as a Poisson process during a storm. That would
mean that, in any small time interval during the storm, there is a certain probability that a lightning strike will occur. In the case of lightning strikes, the continuum of opportunity is time. However,
there are other types of exposure. The occurrence of discontinuities in the continuous manufacture of
wire could be considered to be a Poisson process where the measure of exposure is, for example,
kilometres or tonnes of wire produced. If Giardia cysts were randomly distributed in a lake, the consumption of cysts by campers drinking the water would be a Poisson process, where the measure of
exposure would be the amount of water consumed. Typographic errors in a book might be Poisson
distributed, in which case the measure of exposure could be inches of text, although one could just
as easily consider the errors to be binomially distributed with n = the number of characters in the
book.
In a Poisson process, unlike the binomial, as there is a continuum of opportunity for an event to occur
we can theoretically have anything between zero and an infinite number of events within a specific
amount of opportunity, and there is a probability of the event occurring no matter how small a unit of
exposure we might consider. In practice, few physical systems will exactly conform to such a set of
assumptions, but many systems nevertheless are very well approximated by a Poisson process. In the

Chapter 8 Some basic random processes

177

Binomial Process
Number of
trials n
(NegBin)

Poisson Process
obsenrationsa

Number of
successes s
(Binomial)

Probability of
success p

Hypergeometric Process
Number of
successes s
(HYP~~~w)

Number d
*
trials n
(lnvHvper~e0J

Z

I

of events per
unit time A
icamma,

I

/"
Population M,
Subpopulation D

Figure 8.8 Comparison of the distributions of the binomial, Poisson and hypergeometric processes.

Giardia cyst example above, assuming a Poisson process would theoretically mean that we could have
any number of cysts in a volume of water, no matter how small we made that volume. Obviously, this
assumption breaks down when we consider a volume of liquid around the size of a cyst, or smaller, but
this is almost never a restriction in practice.
Tlie distributions describing the Poisson and binomial processes are strongly related to each other,
as shown in Figure 8.8. In a binomial process, the key descriptive parameter is p, the probability of
occurrence of an event, which is the same for all trials, so the trials are independent of each other. The
key descriptive parameter for the Poisson process is h, the mean number of events that will occur per
unit of exposure, which is also considered to be constant over the total amount of exposure t . That
means that there is a constant probability per second, for example, of an event occurring, whether or
not an event has just occurred, has not occurred for an unexpectedly long time, etc. Such a process is
called "memoryless", and both the binomial and Poisson processes can be so described.
Like p for a binomial process, h is a property of the physical system. For static systems (stochastic
processes), p and h are not variables, but we still need distributions to express the state of our knowledge
(uncertainty) about their values.
In a Poisson process, we consider, with the number of events that may occur in a period t , the amount
of "time" one will have to wait to observe a events, and A, the average number of events that could
occur, known as the Poisson intensity. This section will now show how the Poisson distribution, which
describes the number of events a! that may occur in a period of exposure t , can be derived from the
binomial distribution as p tends to zero and n tends to infinity. We will then look at how to determine
the variability distribution of the time t one will need to wait before observing a events, which also
turns out to be the distribution of uncertainty of the time one must have waited before having observed
a events. Finally, we will discuss how to determine our state of knowledge (uncertainty) about h given
a set of observed events a in a period t .

178

Risk Analysis

8.3.1 Deriving the Poisson distribution from the binomial
Consider a binomial process where the number of trials tends to infinity, and the probability of success
at the same time tends to zero, with the constraint that the mean of the binomial distribution = np
remains finitely large. The probability mass function of the binomial distribution can be altered closely
to model the number of successes that will occur under such conditions, as follows:

Using ht = np,
p(X = x) =

x!(n - x)!

For n large and p small,

which simplifies the equation to

This is the probability mass function for the Poisson(ht) distribution, i.e.
Number of events a in time t = Poisson(ht)
when the average number of events that will occur in a unit interval of exposure is known to be A. We
can see how this interpretation fits in with the derivation from the binomial distribution. Imagine that a
young lady decides to buy a pair of very high platform shoes that are in fashion. After some practice she
gets used to the shoes, but there remains a smallish probability (say 1 in 50) that she will fall over with
each step she takes. She decides to go for a short walk, say 100metres. If we say that each step measures
1 metre, then we can model the number of falls she will have on her walk as either Binomial(100, 2 %)
or Poisson(100 * 0.02) = Poisson(2). Figure 8.9 plots these two distributions together and shows how
closely the binomial distribution is approximated by the Poisson distribution in such limiting cases.
The Poisson distribution is often mistakenly considered to be only a distribution of rare events. It is
certainly used in this sense to approximate a binomial distribution, but has far more importance than
that. Where there is a continuum of exposure to an event, the measure of exposure can be split up into
smaller and smaller divisions, until the probability of the event occurring in each division has become
extremely small, while there are also an enormous number of divisions. For example, I could stand on
a street corner during rush hour, looking for red cars to pass by. For the duration of the rush hour,
one could consider that the frequency of cars going by is quite constant, and that the red cars in the
traffic are randomly distributed among the city's traffic. Then the number of red cars passing by will

Cha~ter8 Some basic random Drocesses

I

1 79

successes

Figure 8.9 Comparison of the Binomial(100, 0.02) and Poisson(2)distributions.

be Poisson distributed. If, on average, 0.6 red cars passed by per minute, I could model the number of
cars passing by: in the next 10 seconds as Poisson(O.1); in the next hour as Poisson(36), etc. I could
divide up the time I stand on the street comer into such tiny elements (for example 11100th of a second)
that the probability of a red car passing by within a particular 11100th of a second would be extremely
small. The probability would be so small that the chance of two cars going by within that period would
be absolutely negligible. In such circumstances, we can consider each of these small elements of time
to be independent Bernoulli trials. Similarly, the number of raindrops falling on my head each second
during a shower would also be Poisson distributed.

8.3.2 "Time" to wait to obsewe a events
The Poisson process assumes that there is a constant probability that an event will occur per increment
of time. If we consider a small element of time At, then the probability an event will occur in that
element of time is kAt, where k is some constant. Now let P(t) be the probability that the event will
not have occurred by time t. The probability that an event occurs the first time during the small interval
At after time t is then kAt P(t). This is also equal to P(t) - P(t At) and we have

+

Making At infinitesimally small, this becomes the differential equation

Integration gives

180

Risk Analysis

If we define F(t) as the probability that the event will occur before time t (i.e. the cumulative distribution
function for t), we then have

which is the cumulative distribution function for an exponential distribution Expon(1lk) with mean llk.
Thus, llk is the mean time between occurrences of events or, equivalently, k is the mean number of
events per unit time, which is the Poisson parameter h. The parameter llh, the mean time between
occurrences of events, is given the notation j3.
We have thus shown that the time until occurrence of the first event for a Poisson distribution is
given by

where j3 = llh. It can also be shown (although the maths is too laborious to repeat here) that the time
until ol events have occurred is given by a gamma distribution:

The Expon(p) distribution is therefore simply a special case of the gamma distribution, namely

It is interesting to check the idea that a Poisson process is "memoryless". The probability that the
first event will occur at time x, given it has not yet occurred by time t (x > t), is given by

which is another exponential distribution. Thus, although the event may not have occurred after time t ,
the remaining time until it will occur has the same probability distribution as it had at any prior point
in time.

8.3.3 Estimate of the mean number of events per period (Poisson intensity) 1
Like the binomial probability p, the mean events per period h is a fundamental property of the stochastic
system in question. It can never be observed and it can never be exactly known. However, we can
become progressively more certain about its value as more data are collected. Bayesian inference, see
Section 9.2, provides us with a means of quantifying the state of our knowledge as we accumulate data.
Assuming an uninformed prior n(h) = l l h (see Section 9.2.2) and the Poisson likelihood function
for observing a! events in period t:

since we can ignore terms that don't involve h, and we then get the posterior distribution

I

Chapter 8 Some basic random processes

18 1

which is a gamma(a, llt) distribution. The gamma distribution can also be used to describe our uncertainty about h if we start off with an informed opinion and then observe a events in time t. From
Table 9.1, if we can reasonably describe our prior belief with a Gamma(a, b) distribution, the posterior
is given by a Gamma(a + a , b/(l bt)) distribution.
The choice of n(h) = l/h (which is equivalent to a Gamma(llz, z) distribution, where z is extremely
large) as an uninformed prior is an uncomfortable one for many. This prior makes mathematical sense
in that it is transformation invariant and therefore would give the same answer whether one performed
an analysis from the point of view of A or B = l / h or even changed the unit of exposure relating to
A. On the other hand, a plot of this prior doesn't really seem "uninformed" since it is so peaked at
zero. However, the shape of the posterior gamma distribution becomes progressively less sensitive to
the prior distribution as data are collected. We can get a feel for the importance of the prior with the
following train of thought:

+

(i) A n(A) = l / h prior is equivalent to Garnma(llz, z), where z approaches infinity. You can prove
this by looking at the gamma probability distribution function and setting a to zero and ,f3 to
infinity.
(ii) A flat prior (the opposite extreme to the n(h) = I/h prior) would be equivalent to a Gamma(1,
z), where z approaches infinity, i.e. an infinitely drawn out exponential distribution.

+

+

(iii) We have seen that, for a Gamma(a, b) prior, the resultant posterior is Gamma(a a , b/(l bt)),
which means that the posterior for (i) would be Gamma(a,llt), and the posterior for (ii) would
be Garnma(a 1, 1It).

+

+

(iv) Thus, the sensitivity of the gamma distribution to the prior amounts to whether ( a 1) is approximately the same as a . Moreover, Gamma(a, j3) is the sum of a independent Exponential(j3)
distributions, so one can think of the choice of priors as being whether we add one extra exponential distribution or not to the a exponential distributions from the data. Thus, if a were 100,
for example, the distribution would be roughly 1 % influenced by the prior and 99 % influenced
by the data.

8.3.4 Estimate of the elapsed period t
We can estimate the period t that has elapsed if we know h and the number of events a that have occurred
in time t. The maths turns out to be exactly the same as the estimate for h in the previous section.
The reader may like to verify that, by using a prior of n(t) = lit, we obtain a posterior distribution
t = Gamma(a, llh), which is the same result we would obtain if we were trying to predict forward (i.e.
determine a distribution of variability of) the time required to observe a events given h = 1/B. Also,
if we can reasonably describe our prior belief with a Gamma(a, b) distribution, the posterior is given
by a Gamma(a a, b/(l bh)) distribution.

+

+

8.3.5 Summary of results for the Poisson process
The results are shown in Table 8.2.

8.3.6 The multivariate Poisson process
The properties of the Poisson process make extending to a multivariate situation very easy. Imagine that
we have three categories of car accident: (a) no injury; (b) one or more persons injured but no fatalities;

182

Risk Analysis

Table 8.2

Distributions of the Poisson process.

Quantity

Formula

Number of events

a = Poisson(ht)

Mean number of events
per unit exposure

h = Gamma(a, 1It)
= Gamma(a a, b/(l

Time until observation of first event

tl

Time until observation of first a events

ta = Gamma(a, 1/A)

Time that has elapsed
for a events

ta = Gamma(a, 1/A)
= Gamma(a a, b/(l

+

Notes

Assuming uninformed prior
Assuming Gamma(a,b) prior

+ bt))

= Expon(1/A) = Gamma(1,l /h)

+

+ bh))

Assuming uninformed prior
Assuming Gamma(a,b) prior

(c) one or more persons killed. We'll assume that the accidents occur independently and follow a Poisson
process with expected occurrences ha, hb and A, per year. The number that will occur in the next T
years (assuming that the rates won't change over time) is Poisson(T * (ha hb h,)). The probability
that the next accident is of type (a) is

+ +

The time until the next a accident is
Gamma
and the uncertainty about the true values of each h can be estimated separately as described in Sections
8.3.3 and 9.1.5.

8.3.7 Modifying I in a Poisson process
The Poisson model assumes that h will be constant over the time in which we are counting. That can be a
tenuous assumption. Hurricanes, disease outbreaks, suicides, etc., occur more frequently at certain times
of the year; car accidents, robberies and high-street brawls occur more frequently at certain times of the
day (and sometimes year too). In fact it turns out that, if h has a consistent (even if unknown) seasonal
variation, we can often get round the problem. Imagine that boat accidents occur in each month i at a
rate hi, i = 1, . . . , 12. The number occurring in each future month i will be ai = Poisson(hi), and the
12

total over the year will be

i=l

Poisson(hi). From the identity Poisson(a)

this equation can be rewritten Poisson

C hi
(il:

1

+ Poisson(b) = Poisson(a + b),

, i.e. the boat accidents occurring in a year also follow a

Poisson process. Thus, as long as we ensure that we analyse data over a complete number of seasonal
periods (a whole number of years in this case) and predict for a whole number of seasonal periods,
we can ignore the fact that h changes seasonally. That is immensely useful. If I've observed that
historically there have been an average of 23 outbreaks per year of campylobacteriosis in a city (an
outbreak is defined in epidemiology as an event unconnected to others, so we can think of them as
occurring randomly in time and independently), then I can model the number of outbreaks next year as

Chapter 8 Some basic random processes

183

Poisson(23) without worrying that most of those will occur over the summer months. I can also compare
year-on-year data on outbreaks using Poisson mathematics. What I cannot do, of course, is say that July
will have Poisson(23/12) outbreaks.
I used to live in a rural area of the South of France. As winter approached, the first time there was
black ice on the roads in the morning you would see cars buried in hedges, woods and fields along
the roadside. The more intense the sudden cold snap, the more cars you would see. Some years there
weren't so many, others it was mayhem. Clearly in situations like this the expected rate of accidents
is a random variable. The most common way to model that random variation is to multiply h by a
~ a m m a ( ih, ) distribution. This gamma has a mean of 1 and a standard deviation of h , giving a Poisson
rate of ~ a m m a ( ihh).
,
The idea therefore is that the gamma distribution is just adding a coefficient of
variation of h to h. It turns out that the combination of these two distributions is a p61ya(;, h h ) or, if
is an integer, simplifies to a ~ e ~ ~ i n ( ; , The result is convenient because it means we can use
the P6lya or NegBin distributions to model this Poisson(h) fi Gamma(a, B ) mixture. Along the way,
we can also see that the P6lya and NegBin distributions have a greater coefficient of variation than the
Poisson. Often you will see in statistics that researchers call the data "overdispersed when they want
to fit a Poisson distribution because the data have a variance greater than their mean (they would be
equal for a Poisson distribution), and the statisticians then turn to a NegBin (although they would be
better off with a P6lya but it is less well known).
The Gamma distribution is useful because we have an extra parameter h to play with and can therefore
match, for example, the mean and variance (or any two other statistics) to data. However, at times that
is not enough, and we might need more control to match, for example, the skewness too. Instead of
modelling h in the form Gamma(a, b), we can add a positive shift so we get Poisson(Gamma(a, b) c),
which turns out to be a Delaporte(a, b, c) distribution.

k

A).

+

8.4 The Hypergeometric Process
The hypergeometric process occurs when one is sampling randomly without replacement from some
population, and where one is counting the number in that sample that have some particular characteristic.
This is a very common type of scenario. For example, population surveys, herd testing and lotto are
all hypergeometric processes. In many situations the population is very large in comparison with the
sample, and we can assume that, if a sample were put back into the population, the probability is very
small that it would be picked again. In that case, each sample would have the same probability of picking
an individual with a particular characteristic: in other words, this becomes a binomial process. When
the population is not very large compared with the sample (a good rule is that the population is less
than 10 times the size of the sample), we cannot make a binomial approximation to the hypergeometric.
This section discusses the distributions associated with the hypergeometric process.

8.4.1 Number in a sample with a particular characteristic
Consider a group of M individual items, D of which have a certain characteristic. Randomly picking
n items from this group without replacement, where each of the M items has the same probability of
being selected, is a hypergeometric process. For example, imagine I have a bag of seven balls, three of
which are red, the other four of which are blue. What is the probability that I will select two red balls
from the bag if I randomly pick three balls out without replacement?

184

Risk Analysis

First of all, we note that the probability of the second ball picked being red depends on the colour
of the first picked ball. If the first ball was red (with probability +), there are only two red balls left
of the six balls remaining. The probability of the second ball being red, given the first ball was red,
is therefore = However, each ball remaining in the bag has the same probability of being picked,
which means that each event resulting in x red balls being selected in total has the same probability.
We thus need only consider the different combinations of events that are possible. There are, from the
discussion in Section 6.3.4,
= 35 different possible ways that one can select three items from seven.
= 3 ways to select two red balls from the three in the bag, and there are (;) = 4 ways to
There are
select one blue ball from the four in the bag. Thus, out of the 35 ways we could have picked three balls
from the group of seven, only
= 3 * 4 = 12 of those ways would give us two red balls. Thus,
the probability of selecting two red balls is 12/35 = 34.29 %.
In general, for a population size M of which D have the characteristic of interest, in selecting a
sample of size n from that population at random without replacement, the probability of observing x
with the characteristic of interest is given by

5.

(i)

(32)

(i)(i)

which is the probability mass function of the hypergeometric distribution Hypergeo(n, D, M). Just in
case you are curious, the hypergeometric distribution gets its name because its probabilities are successive
terms in a gaussian hypergeometric series.
Binomial approximation t o the hypergeometric

If we replaced each item one at a time back into the population when taking our sample of size n,
the probability of each individual item having the characteristic of interest is D I M and the number of
times we sampled from D is then given by a Binomial(n, DIM). More usefully, if M is very large
compared with n, the chance of picking the same item more than once if one were to replace the item
after each selection would be very small. Thus, for large M (usually n .= 0.1M is quoted as being a
satisfactory condition) there will be little difference in our sampling result whether we sample with or
without replacement, and we can approximate a Hypergeo(n, D , M) with a Binomial(n, DIM), which
is much easier to calculate.
Multivariate hypergeometric distribution

The hypergeometric distribution can be extended to situations where there are more than two types of
item in the population (i.e. more than D of one type and (M - D) of another). The probability of getting
sl from D l , s2 from D2, etc., all in the sample n is given by

k

where

k

C si = n, C Di = M,
i=l

i=l

Di 2 si 2 0, M > Di > 0.

Chapter 8 Some basic random processes

185

8.4.2 Number of samples to get a specific s
Consider the situation where we are sampling without replacement from a population M with D items
with the characteristic of interest until we have s items with the required characteristic. The distribution
of the number of failures we will have before the sth success can be easily calculated in the same
manner as we developed for the negative binomial distribution in Section 8.2.2. The probability of
observing (s - 1) successes in (x s - 1) trials (i.e. x failures) is given by direct application of the
hypergeometric distribution:

+

+

The probability p of then observing a success in the next trial (the (s x)th trial) is simply the
number of D items remaining (= D - (s - I)) divided by the size of the population remaining (=
M - ( s + x - 1)):

and the probability of having exactly x failures up to the sth success, where trials are stopped at the sth
success, is then the product of these two probabilities:

This is the probability mass function for the inverse hypergeometric distribution InvHypergeo(s, D, M)
and is analogous to the negative binomial distribution for the binomial process and the gamma distribution for the Poisson process. So

For a population M that is large compared with s, the inverse hypergeometric distribution approximates
the negative binomial
InvHypergeo(s, D , M) x NegBin(s, DIM)
and if the probability D I M is very small
InvHypergeo(s, D, M)

%

Garnma(s, MID)

Figure 8.10 shows some examples of the inverse hypergeometric distribution. An inverse hypergeometric distribution shifted k units along the domain is sometimes called a negative hypergeometric
distribution. ModelRisk offers the InvHypergeo(s, D, M) distribution, and the negative hypergeometric
can be achieved by writing VoseInvHypergeo(s, D, M, VoseShift(k)).

186

Risk Analysis

lnvHypergeo(2,2,50)

lnvHypergeo(4,5,50)

g 0.035

i.

p%

g

0.03
m 0.025
2 0.02
a 0.015
0.01
0.005

,?

0

10

20

30

40

50

0.035
0.03
0.025
0.02
0.015
0.01
0.005
0

10

Failures

20

30

40

50

Failures

Figure 8.10 Examples of the inverse hypergeometric distribution.

8.4.3 Number of samples to have observed a specific s
The inverse hypergeometric distribution was derived above as a distribution of variability in predicting
the number of failures one will have before the sth success. However, it can equally be derived as a
distribution of uncertainty about the number of failures x = n - s one must have had if one knows
s , M, D using Bayes7 theorem and a uniform (i.e. uninformed) prior on x. So

In the case where you do not know that the trials had stopped with the sth success, we can still
apply Bayes' theorem with a uniform prior for x and a likelihood function given by a hypergeometric
probability:

which, with a uniform prior, is also the posterior distribution. Substituting n - s for x yields
n ! ( M - n)!

f (n) a (n - s)!(M - D - n + s)!

Chapter 8 Some basic random processes

3
4
5
6
7

8
9
10
11
12
13
total

f(n)
Normalised f(n)
6.5E-04
1.4E-03
1.7E-03
1.6E-03
1.2E-03
7.9E-04
4.2E-04
1.9E-04
6.4E-05
1.5E-05
2.OE-06
8.OE-03

[

Distribution

1

6

187

]

Figure 8.11 A Bayesian inference model with hypergeometric uncertainty. Note that the discrete distribution
could have been used with columns B and C, removing the necessity to normalise the distribution.

Equation (8.6) has dropped out all the terms that are not a function of n and so can be normalised out
of the equation. The uncertainty distribution for n doesn't equate to a standard distribution, so it needs
to be manually normalised. However, it is easier just to work with Equation (8.6) and normalise in the
spreadsheet. Figure 8.11 shows an example of such a calculation where the final distribution is in cell
G18. Note that, if one uses a discrete distribution as shown in this spreadsheet, it is actually unnecessary
to normalise the probabilities, since software like @RISK, Crystal Ball and ModelRisk automatically
normalises them to sum to unity.

8.4.4 Estimate of population and subpopulation sizes
The size of D and M are fundamental properties of the stochastic system, like p for a binomial process
and h for a Poisson process. Distributions of our uncertainty about the value of these parameters can
be determined from Bayesian inference, given a certain sample size taken from the population M, of
which s belonged to the subpopulation D. The hypergeometric probability of s successes in n samples
from a population M of which D have the characteristic of interest is given by Equation (8.5) as

188

Risk Analys~s

So, with a uniform prior, we get the following posterior equations for D and M:

P(D) a

(y)(t1P)

P(M)

(:)(;I:)

rx

D!(M - D)!
(D - s)!(M - D - n

a

+ s)!

(M-D)!(M-n)!
(M- D-n+s)!M!

These formulae do not equate to standard distributions and need to be normalised in the same way as
discussed for Equation (8.6).

8.4.5 Summary of results for the hypergeometric process
The results are shown in Table 8.3.

Table 8.3

Distributions of the hypergeometric process.

Quantity
Number of subpopulation in the
sample
Number of samples to observe s
from the subpopulation
Number of samples there were to
have observed s from the
subpopulation
Number of samples n there were
before having observed s from
the subpopulation

Formula

Notes

n = s+ InvHyp(s, D, M)

Where the last sample is known to
have been from the subpopulation

n! M-n !
s D +

f n cc

n

Size of subpopulation D

f(D)

D! M-D)!
(o-s)!(h-D-n+~)!

Size of population M

f(M) cc

(M-D)I(M-n !
M!(M-D-n+l)!

s

Where the last sample is not known to
have been from the subpopulation.
This uncertainty distribution needs
to be normalised
This uncertainty distribution needs to
be normalised
This uncertainty distribution needs to
be normalised

8.5 Central Limit Theorem
The central limit theorem (CLT) is an asymptotic result of summing probability distributions. It turns
out to be very useful for obtaining sums of individuals (e.g. sums of animal weights, yields, scraps).
It also explains why so many distributions sometimes look like normal distributions. We won't look at
the derivation, just see some examples and its use.

Chapter 8 Some basic random processes

189

The sum C of n independent random variables X i (where n is large), all of which have the same distribution, will asymptotically approach a normal distribution with known mean and standard deviation:

where p and a are the mean and standard deviation of the distribution from which the n samples are
drawn.

8.5.1 Examples
Imagine that the distribution of the weight (read "mass" if you want to be technical) of random nails
produced by some company has a mean of 27.4 g and a standard deviation 1.3 g. What will be the
weight of a box of 100 nails? The answer, assuming that the nail weight distribution isn't really skewed,
is the following normal distribution:

I

I

This CLT result turns out to be very important in risk analysis. Many distributions are the sum of a
number of identical random variables, and so, as that sum gets larger, the distribution tends to look like
a normal distribution. For example: Gamma(a, B) is the sum of a independent Expon(B) distributions,
so, as a gets larger, the gamma distribution looks progressively more like a normal distribution. An
exponential distribution has mean and variance of B, so we have

Other examples are discussed in the section on approximating one distribution with another.

I

How large does n have to be for the sum to be distributed nomallv?

1 Uniform
Symmetric triangular
Normal
Fairlv skewed
1 Exponential

(12(try it: an old way of generating normal distributions)
' 6 (because U(a, b) ~ ( ab), = ~ ( s aa,+ b, 2b))

;

I

1

I!
30+ (e.a. 30 lots of Poisson(2) = Poisson(6O\\
150+ (check with Gamma(a, b) = sum of a Exponential(b)s)I

1

8.5.2 Other related results
The average of a large number of independent, identical distributions

Dividing both sides of Equation (8.7) by n, the average x of n variables drawn independently from the
same distribution is given by

Cxi
-

x=-

i=l

x

Normal(np, f i n )

( .

= Normal u --=
o\

(8.8)

190

Risk Analys~s

Note that the result of Equation (8.8) is correct because both the mean and standard deviation of the
normal distribution are in the same units as the variable itself. However, be warned that for most
distributions one cannot simply divide by n the distribution parameters of a variable X to get the
distribution of X l n .
The product of a large number of independent, identical distributions

CLT can also be applied where a large number of identical random variables are being multiplied
together. Let P be the product of a large number of random variables Xi, i = 1, . . . , n, i.e.

Taking logs of both sides, we get

The right-hand side is the sum of a large number of random variables and will therefore tend to a normal
distribution. Thus, from the definition of a lognormal distribution, P will be asymptotically lognormally
distributed.
A neat result from this is that, if all Xi are lognormally distributed, their product will also be
lognormally distributed.
Is CLT the reason the normal dlstr~but~on
is so popular?

Many stochastic variables are neatly described as the sum or product, or a mixture, of a number of
random variables. A very loose form of CLT says that, if you add up a large number n of different
random variables, and if none of those variables dominates the resultant distribution spread, the sum will
eventually look normal as n gets bigger. The same applies to multiplying (positive) different random
variables and the lognormal distribution. In fact, a lognormal distribution will also look very similar
to a normal distribution if its mean is much larger than its standard deviation (see Figure 8.12), so
perhaps it should not be too surprising that so many variables in nature seem to be somewhere between
lognormally and normally distributed.

8.6 Renewal Processes
In a Poisson process, the times between successive events are described by independent identical exponential distributions. In a renewal process, like a Poisson process, the times between successive events
are independent and identical, but they can take any distribution. The Poisson process is thus a particular
case of a renewal process. The mathematics of the distributions of the number of events in a period
(equivalent to the Poisson distribution for the Poisson process) and the time to wait to observe x events

Cha~ter8 Some basic random Drocesses

19 1

Figure 8.12 Graphs of the normal and lognormal distribution.

(equivalent to the gamma distribution in the Poisson process) can be quite complicated, depending on the
distribution of time between events. However, Monte Carlo simulation lets us bypass the mathematics
to arrive at both of these distributions, as we will see in the following examples.
Example 8.1 Number of events in a specific period

It is known that a certain type of light bulb has a lifetime that is Weibull(l.3, 4020) hours distributed.
(a) If I have one light bulb working at all times, replacing each failed light bulb immediately with
another, how many light bulbs will have failed in 10 OOOhours? (b) If I have 10 light bulbs going at
all times, how many will fail in 1000 hours? (c) If I had one light bulb going constantly, and I had 10
light bulbs to use, how long would it take before the last light bulb failed?
(a) Figure 8.13 shows a model to provide the solution to this question. Note that it takes account of
the possibility of 0 failures.
(b) Figure 8.14 shows a model to provide the solution to this question. Figure 8.15 compares the results
for this question and part (a). Note that they are significantly different. Had the time between events
been exponentially distributed, the results would have been exactly the same.
(c) The answer is simply the sum of 10 independent Weibull(l.3, 4020) distributions.

+

Period of interest
Number of failures

10000

=IF(B3>$D$21,0,1)

Figure 8.13 Model solution to Example 8.1 (a).

Period of interest
Number of failures

I

Formulae table
=VoseWeibull(l.3,4020)
B4:B19, E4:E19, etc. =B3+VoseWeibull(l.3,4020)

Figure 8.14 Model solution to Example 8.1(b).

Chapter 8 Some basic random processes

193

Figure 8.15 Comparison of results from the models of Figures 8.1 3 and 8.1 4.
1

1

8.7 Mixture Distributions
Sometimes a stochastic process can be a combination of two or more separate processes. For example,
car accidents at some particular place and time could be considered to be a Poisson variable, but the
mean number of accidents per unit time h may be a variable too, as we have seen in Section 8.3.7.
A mixture distribution can be written symbolically as follows:

where FA represents the base distribution and FB represents the mixing distribution, i.e. the distribution
of O. So, for example, we might have

which reads as "a gamma mixture of Poisson distributions".
There are a number of commonly used mixture distributions. For example

which is the Beta-Binomial(n, a ,

B ) distribution, and

where the Poisson distribution has parameter h = 4 . p, and p = Beta(a, B). [Though also used in
biology, this should not be confused with the beta-Poisson dose-response model.]

194

Risk Analysis

The cumulative distribution function for a mixture distribution with parameters

Qi

is given by

where the expectation is with respect to the parameters that are random variables. Thus, the functional
form of mixture distributions can quickly become extremely complicated or even intractable. However,
Monte Carlo simulation allows us very simply to include mixture distributions in our model, providing
the Monte Carlo software being used (for example, @RISK, Crystal Ball, ModelRisk) generates samples
for each iteration in the correct logical sequence. So, for example, a Beta-Binomial(n, a, B) distribution
is easily generated by writing =Binomial(n, Beta(czl, B)). In each iteration, the software generates a value
first from the beta distribution, then creates the appropriate binomial distribution using this value of p
and finally samples from that binomial distribution.

8.8 Martingales
A martingale is a stochastic process with sequential variables Xi(i = 1, 2, . . .), where the expected value
of each variable is the same and independent of previous observations. Written more formally

Thus, a martingale is any stochastic process with a constant mean. The theory was originally developed
to demonstrate the fairness of gambling games, i.e. to show that the expected winnings of each turn of
a game were constant; for example, to show that remembering the cards that had already been played
in previous hands of a card game wouldn't impact upon your expected winnings. [Next time a friend
says to you "21 hasn't come up in the lottery numbers for ages, so it must show soon", you can tell
him or her "Not true, I'm afraid, it's a martingale" - they'll be sure finally to understand]. However,
the theory has proven to be of considerable value in many real-world problems.
A martingale gets its name from the gambling "system" of doubling your bet on each loss of an even
odds bet (e.g. betting Red or Impaire at the roulette wheel) until you have a win. It works too - well,
in theory anyway. You must have a huge bankroll, and the casino must have no bet limit. It gives low
returns for high risk, so as risk analysis consultants we would advise you to invest in (gamble on) the
stock market instead.

8.9 Miscellaneous Examples
II

I have given below a few example problems for different random processes discussed in this chapter to
give you some practice.

8.9.1 Binomial process problems
In addition to the problems below, the reader will find the binomial process appearing in the following
examples distributed through this book: examples in Sections 4.3.1, 4.3.2 and 5.4.6 and Examples 22.2
to 22.6, 22.8 and 22.10, as well as many places in Chapter 9.

I

Chapter 8 Some basic random processes

I95

Example 8.2 Wine sampling

Two wine experts are each asked to guess the year of 20 different wines. Expert A guesses 11 correctly,
while expert B guesses 14 correctly. How confident can we be that expert B is really better at this
exercise than expert A?
If we allow that the guess of the year for each wine tasted is independent of every other guess, we
can assume this to be a binomial process. We are thus interested in whether the probability of one expert
guessing correctly is greater than the other's. We can model our uncertainty about the true probability
of success for expert A as Beta(l2, 10) and for expert B as Beta(l5, 7). The model in Figure 8.16 then
randomly samples from the two distributions and cell C5 returns a 1 if the distribution for expert B has
a greater value than the distribution for expert A. We run a simulation on this cell, and the mean result
equals the percentage of time that the distribution for expert B generated a higher value than for expert
A, and thus represents our confidence that expert B is indeed better at this exercise. In this case, we are
83 % confident. +
Example 8.3 Run of luck

If I toss a coin 10 times, what is the distribution of the maximum number of heads I will get in a row?
The solution is provided in the spreadsheet model of Figure 8.17.

+

Example 8.4 Multiple-choice exam

A multiple-choice exam gives three options for each of 50 questions. One student scores 21 out of 50.
(a) What is the probability that the student would have achieved this score or higher without knowing
anything about the subject? (b) Estimate how many questions to which the student actually knew the
answer.
(a) The student has a 113 probability of getting any answer right without knowing anything, and his
or her possible score would then follow a Binomial(50, 113) distribution. The probability that the
student would have achieved 21150 or higher is then = 1 - BINOMDIST (20, 50, 113, I), i.e.
(1 - the probability of achieving 20 or lower).

Formulae table

C5 (output)

Figure 8.16

Model for Example 8.2.

=IF(C4>C3,1,0)

Figure 8.17 Model for Example 8.3.

A!

-1 .
-2 .
3
4
5

-

22
23

-24 .
25
-

B

Known
answers
0
1
2
19
20
21
sum

I

c
Likelihood
5.OE-02
7.OE-02
9.1 E-02
4.8E-07
3.9E-08
1.6E-09
9.2E-01

I

D

Normalised
posterior
5.45%
7.63%
9.84%
0.00%
0.00%
0.00%

26

I

E

I

F

o

3

I

G

I

H

I

I

I

J

0.14
0.12
0.1
0.08
0.06
0.04

Formulae table

29
30
31
32

C3:C24

=BINOMDIST(21-B3,50,1/3,0)

D3:D24

=C3/$C$25

6

9

12

15

18

21

Figure 8.18 Model for Example 8.4(b).

(b) This is a Bayesian problem. Figure 8.18 illustrates a spreadsheet model of the Bayesian inference
with a flat prior and a binomial likelihood function. The imbedded graph is the posterior distribution
of our belief about how many questions the student actually knew. +

7

Chapter 8 Some baslc random processes

197

8.9.2 Poisson process problems
In addition to the problems below, the reader will find the Poisson process appearing in the following
examples distributed through this book: examples in Sections 9.2.2 and 9.3.2 and Examples 9.6, 9.1 1,
22.12, 22.14 and 22.16.
Example 8.5 Insurance problem

My company insures aeroplanes. They crash at a rate of 0.23 crashes per month. Each crash costs
$Lognormal(l20, 52) million. (a) What is the distribution of cost to the company for the next 5 years?
(b) What is the distribution of the value of the liability if I discount it at the risk-free rate of 5 %?
The solution to part (a) is provided in the spreadsheet model of Figure 8.19, which uses the VLOOKUP
Excel function. Part (b) requires that one know the time at which each accident occurred, using exponential distributions. The solution is shown in Figure 8.20. +
Example 8.6 Ra~nwaterbarrel problem

It is a monsoon and rain is falling at a rate of 270 drops per second per square metre. The rain drops
each contain 1 millilitre of water. If I have a drum standing in the rain, measuring 1 metre high and
0.3 metres radius, how long will it be before the drum is full?
The solution is provided in the spreadsheet model of Figure 8.21.

+

Figure 8.19 Model for Example 8.5(a).

Mean crashes per month (A)
Number of months ( t )
Risk free interest rate
Total cost ($M)

Time of accident
(months)
5.105
7.270
13.338
102.497
105.567
113.070

Cost of accidents
($MI
158
115
63
0
0
0

1244.9

Discounted
cost ($M)
154.69
111.85
59.93
0.00
0.00
0.00

Figure 8.20 Model for Example 8.5(b).

Drum radius (m)
Drum volume (mA3)
Drops falling per second into barrel

0.283
76.341

Time to wait to fill barrel (seconds)

3714.0

Formulae table

=ROUNDUP(D4/0.000001,0)
=Gamma(D6,1ID5)

Figure 8.21

Model for Example 8.6.

Example 8.7 Equipment reliability

A piece of electronic equipment is composed of six components A to F. They have the mean time
between failures shown in Table 8.4. The components are in serial and parallel configuration as shown
in Figure 8.22. What is the probability that the machine will fail within 250 hours?

Chapter 8 Some basic random processes

199

Table 8.4 Mean time
between failures of electronic
equipment components.
Component

AI

B

I

C

I

D

I

MTBF (hours)

E

I

F

I

G

I

H

I

~

I

J

1
4

5
7

a

27.8
299 1742.1
1234 1417.9

9
10
11
12

2
2

Time to failure of system

+q

*a

210.1

Formulae table
D3:D8
=Expon(C3)
D l 0 (output) =MIN(D3,MAX{D4:D6),MAX(D7:08))

15

Figure 8.22 Model for Example 8.7

We first assume that the components will fail with a constant probability per unit time, i.e. that their
times to failure will be exponentially distributed, which is a reasonable assumption implied by the
MTBF figure. The problem belongs to reliability engineering. Components in series make the machine
fail if any of the components in series fail. For parallel components, all components in parallel must fail
before the machine fails. Thus, from Figure 8.22 the machine will fail if A fails, or B, C and D all fail,
or E and F both fail. Figure 8.22 also shows the spreadsheet modelling the time to failure. Running a
simulation with 10 000 iterations on cell D l 0 gives an output distribution of which 63.5 % of the trials
were less than 250 hours. +

8.9.3 Hypergeometric process problems
In addition to the problems below, the reader will find the hypergeometric process appearing in the
following examples distributed through this book: examples in Sections 22.4.2 and 22.4.4, as well as
Examples 9.2, 9.3, 22.4, 22.6 and 22.8.
Example 8.8 Equal selection

I am to pick out at random 10 names from each of two bags. The first bag contains the names of 15
men and 22 women. The second bag contains the names of 12 men and 15 women. (a) What is the

200

Risk Analysis

probability that I will have the same proportion of men in the two selections? (b) How many times
would I have to sample from these bags before I did have the same proportion?
(a) The solution can be worked out mathematically or by simulation. Figure 8.23 provides the mathematical calculation and Figure 8.24 provides a simulation model, where the required probability
is the mean of the output result.
A1

B

C

D

1

I

E

27

6
7
8

9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

25

Probability
Bag 1
Bag 2
Men in sample
0
0.19%
0.04%
1
2.14%
0.71%
2
9.64%
5.03%
22.28%
16.78%
3
4
29.24%
29.37%
5
22.70%
28.19%
10.51%
14.95%
6
7
2.84%
4.27%
8
0.43%
0.62%
9
0.03%
0.04%
10
0.00%
0.00%
Total probability of same in each bag

C9:D19
E9:E19
E2O (output)

Both
0.00%
0.02%
0.49%
3.74%
8.59%
6.40%
1.57%
0.12%
0.00%
0.00%
0.00%
20.93%

Formulae table
=HYPGEOMDIST($B9,C$3,C$4,C$5)
=CYD9
=SUM(E9:E19)

26

Figure 8.23

Mathematical model for Example 8.8.

Formulae table
=Hypergeo(C3,C4,C5)
C8 (output - p=olp mean) =IF(C7=D7,1,0)

Figure 8.24 Simulation model for Example 8.8.

IF

Chapter 8 Some basic random processes

-

Al

I3

I

C

I Dl

E

H

I F I G I

I

I

I

J

I

K

I

20 1

L

0.3

5
41
42

46

.

48

c4:c22

L-R

0.000%
0.000%

0.15

-

0.1

0.05
Formulae table
=COMBIN(D,S-I)*COMBIN(M-D,M-SY(D-s+I)'
I(COMBIN(M.84-l)'(M-B4-2's+l))
0

$0

20

30

40

50

51

Figure 8.25 Model for Example 8.9.

(b) Each trial is independent of every other, so the number of trials before one success = 1
(1, p) = 1 Geometric(p), where p is the probability calculated from part (a). +

+

+ NegBin

Example 8.9 Playing cards

How many cards in a well-shuffled pack, complete with jokers, do I need to turn over to see a heart?
There are 54 (= M) cards, of which 13 (= D) are hearts, and I am looking for s = 1 heart. The
number of cards I must turn over is given by the formula 1 + InvHyp(1, 13,54), which is the distribution
shown in Figure 8.25. +
Example 8.10 Faulty tyres

A tyre manufacturer has accidentally mixed up four tyres from a faulty batch with 20 other good tyres.
Testing a tyre for the fault ruins it. If each tyre cost $75, and if the tyres are tested one at a time until
the four faulty tyres are found, how much will this mistake cost?
The solution is provided in the spreadsheet model of Figure 8.26. +

8.9.4 Renewal and mixed process problems
In addition to the problems below, Examples 12.8 and 12.9 also deal with renewal and mixed process
problems.
Example 8.1 1 Batteries

A certain brand of batteries lasts Weibull(2,27) hours in my CD player, which takes two batteries at
a time. I have a pack of 10 batteries. For how long can I run my CD player, given that I replace all
batteries when one has run down?
The solution is provided in the spreadsheet model of Figure 8.27. +
Example 8. I 2 Queuing at a bank (Visual Basic modelling with Monte Carlo simulation)

A post office has one counter that it recognises is insufficient for its customer volume. It is considering
putting in another counter and wishes to model the effect on the maximum number in a queue at any
one time. They are open from 9 a.m. to 5 p.m. each working day. Past data show that, when the doors

202

Risk Analysis

tested tyres

Probability

Tyres actually tested

/(COMBIN(M,B4+s-l)'(M-64-s+l
))
=D+Discrete(64:622,C4:C22)

Figure 8.26

Model for Example 8.10.

Figure 8.27

Model for Example 8.11

open at 9 a.m., the number of people waiting to come in will be as shown in Table 8.5. People arrive
throughout the day at a constant rate of one every 12 minutes. The amount of time it takes to serve
each person is Lognormal(29, 23) minutes. What is the maximum queue size in a day?
This problem requires that one simulate a day, monitor the maximum queue size during the day
and then repeat the simulation. One thus builds up a distribution of the maximum number in a queue.
The solution provided in Figures 8.28 and 8.29 and in the following program runs a looping Visual
Basic macro called "Main Program" at each iteration of the model. This is an advanced technique and,
although this problem is very simple, one can see how it can be greatly extended. For example, one
could change the rate of arrival of the customers to be a function of the time of day; one could add

Chapter 8 Some bas~crandom processes

203

Table 8.5 Historic data

on the number of people
waiting at the start of the
business day.

A1

People

Probability

0
1
2
3
4
5

0.6
0.2
0.1
0.05
0.035
0.015

B

C

D

E

F

IG

1

2

lnputs
Average interarrivaltime (mins)
Serving time mean
Serving time stdev

3
4

5
6

7
8
9

--o
l11
12

13
14
15
16
17
18
19

20

12
29
23

Model
People in queue
Time of day (minutesfrom 00:OO:OO)
Latest customer at counter 1
Latest customer at counter 2

outputs
Total customers served
Maximum number in queue

12
1037.92
Customer Arrive time Serving time Finish time
0
997.24
40.68
1037.92
0
1006.16
36.58
1042.74
35
18

Formulae table
C8:C9, C11:E12, C15:C16
Values updated by macro
F11:F12
=El1 + D l 1

21

Figure 8.28

Sheet "model" for the model for Example 8.12.

more counters, and one could monitor other statistical parameters aside from the maximum queue size,
like the maximum amount of time any one person waits or the amount of free time the people working
behind the counter have. 6
V~sualBas~cmacro for Example 8.12
'Set model variables
Dim modelWS As Object
Dim variableWS As Object
Sub Main-Program0
Set modelWS = Workbooks("queue~model~test.xls").Worksheets("model")
Set variableWS = Workbooks("queue~model~test.xls").Worksheets("variables")
'Reset the model with the starting values
modelWS.Range("c9").Value = 9 * 6 0

204

Risk Analysis

A

I

B

C

ID

1 Label
Random Variable
Counter serving time (min)
7.86
3
4 Customers arriving while serving
2

2
5

Wait time for nexi customer

6
--

7
People waiting at 9:00 a.m.

8
--

57.03

0

9
10 Time in last step
11
12

382
X B 4

2B6
16B8
-

17 B10
18

31.76
Formulae table
=Lognorm(Model!$C$4Model!$C$5)
=IF(B10=O,O,Poisson(B1O/Model!C3))
=Expon(Model!C3)

=Discrete({0,1,2,3,4,5},{0.6,0.2~0.1,0.05,0.035,0.015})
Value updated by macro

Figure 8.29 Sheet "variables" for the model for Example 8.12.

'Start serving customers
Serve-First-Customer
Serve-Next-Customer
End Sub
Sub Serve-First-Customer0
'Serve at counter 1 if 0 ppl in queue
If modelWS.Range("c8")= 0 Then
modelWS.Range("c9")= modelWS.Range("c9").Value + variableWS.
Range ( "b6") .Value
modelWS.Range("c8") = 1
Application.Calcu1ate
'MsgBox -wait 1"
Routine-A
End If
'Serve at counter 1 if 1 person in queue
If modelWS.Range("c8")= 1 Then
Rout ine-A
End If
'Serve at counter 1 and 2 if 2 or more ppl in queue
If modelWS.Range("c8")z = 2 Then
Routine-A
Routine-B
End If
End Sub
Sub Serve-Next-Customer()
'Calculate the new time of day
variableWS.Range("bl0")= Evaluate(" = Max(Model!C9,Min(model!Fll,model!F12)) - Model!C9")
modelWS.Range("c8")= modelWS.Range("c8").Value + variableWS.Range("B4").Value
'Calculate the maximum number of people left in queue

Chapter 8 Some basic random processes

modelWS.Range("C16")= Evaluate(" = max(model!cl6,model!c8)" )
Application.Calculate
modelWS.Range("c9")= model~S.Range("c9").Value+ variableWS.Range("B10").Value
Application.Calcu1ate
'MsgBox wait 3 "
'Check how many ppl are in queue
If rnodelWS.Range("c8")= 0 Then
modelWS.Range("c9")= modelWS.Range("c9")
.Value + variableWS
Range ( " b6 " ) .Value
modelWS.Range("c8")= 1
End If
Application.Calcu1ate
If modelWS.Range("c9")> 1020 Then Exit Sub
If modelWS.Range("fl1")c = modelWS.Range("fl2")Then
Routine-A
Else
Routine-B
End If
Application.Calcu1ate
Serve-Next-Customer
End Sub
'Next customer for counter 1
Sub Routine-A ( )
modelWS.Range("cl1")= 1
modelWS.Range("D11")= modelWS.Range("c9")
.Value

Application.Calculate
End Sub
'Next customer for counter 2
Sub Routi n e B ( )
modelWS.Range("cl2")= 1
modelWS.Range("dl2")= modelWS.Range("c9").Value
Application.Calcu1ate
modelWS.Range("el2")= variableWS.Range("B2").Value
modelWS.Range("cl5")= modelWS.Range("cl5")+ 1
modelWS.Range("c8")= modelWS.Range("c8")- 1
modelWS.Range("C12")= 0
Application.Calcu1ate
End Sub

205

Chapter 9

Data and statistics
Statistics is the discipline of fitting probability models to data. In this chapter I go through a number of
basic statistical techniques from the simple z-tests and t-tests of the classical statistics world, through
the basic ideas behind Bayesian statistics and looking at the application of simulation in statistics - the
bootstrap for classical statistics and Markov chain Monte Carlo modelling for Bayesian statistics. If
you have some statistics training you may think my approach is rather inconsistent, as I have no
problems with using Bayesian and classical methods in the same model in spite of the philosophical
inconsistencies between them. That's because classical statistics is still the most readily accepted type
of statistical analysis - so a model using these methods is less contentious among certain audiences, but
on the other hand Bayesian statistics can solve more problems. Moreover, Bayesian statistics is more
consistent with risk analysis modelling because we need to simulate uncertainty about model parameters
so that we can see how that uncertainty propagates through a model to affect our ability to predict the
outputs of interest, not just quote confidence intervals.
There are a few key messages I would like you to take away from this chapter. The first is that statistics
is subjective: the choice of model that we are fitting to our data is a highly subjective decision. Even the
most established statistical tests, like the z-test, t-test, F-test, chi-square test and regression models, have
at their heart the (subjective) assumption that the underlying variable is normally distributed - which
is very rarely the truth. These tests are really old - a hundred years old - and came to be used so
much because one could restructure a number of basic problems into a form of one of these tests and
look up the confidence values in published tables. We don't use tables anymore - well, we shouldn't
anyway - they aren't very accurate and even basic software like Excel can give you the answers directly.
It's rather strange, then, that statistics books often still publish such tables.
The second key message is that statistics does not need to be a black box. With a little understanding
of probability models, it can become quite intuitive.
The third is that there is ample room in statistics for creative thinking. If you have access to simulation
methods, you are freed from having to find the right "test" for your particular problem. Most real-world
problems are too complex for standardised statistical testing.
The fourth is that statistics is intimately related to probability modelling. You won't understand
statistics until you've understood probability theory, so learn that first.
And lastly, statistics can be really quite a lot of fun as well as very informative. It's rare that a person
coming to one of our courses is excited about the statistics part, and I can't blame them, but I like
to think that they change their mind by the end. I studied mathematics and physics at undergraduate
level and came away with really no useful appreciation of statistics, just a solid understanding of how
astonishingly boring it was, because statistics was taught to me as a set of rules and equations, and any
explanation of "Why?'was far beyond what we could hope to understand (at the same time we were
learning about general relativity theory, quantum electrodynamics, etc.).
At the beginning of this book, I discussed the importance of being able to distinguish between
uncertainty (or epistemic uncertainty) and variability (or stochastic uncertainty). This chapter lays out a

208

Risk Analysis

number of techniques that enable one quantitatively to describe the uncertainty (epistemic uncertainty)
associated with the parameters of a model. Uncertainty is a function of the risk analyst, inasmuch as it
is the description of the state of knowledge the risk analyst's clients have about particular parameters
within his or her model.
A quantitative risk analysis model is structured around modelling the variability (randomness) of the
world. However, we have imperfect knowledge of the parameters that define that model, so we must
estimate their values from data, and, because we have only finite amounts of data, there will remain
some uncertainty that we have to layer over our probability model. This chapter is concerned with
determining the distributions of uncertainty for these parameters.
I will assume that the analyst has somehow accumulated a set of data X = { x l ,x2, . . . , x , ) of n data
points that has been obtained in some manner as to be considered a random sample from a random
process. The purpose of this chapter will be to determine the level of uncertainty, given these available
data, associated with some parameter or parameters of the probability model.
It will be useful here to set out some simple terminology:
The estimate of some statistical parameter of the parent distribution with true (but unknown) value,
say p , is denoted by a hat, e.g. b.

.1

n

The sample mean of the dataset X is denoted by f,i.e. ?Z = .n-

xi.
i=l

The (unbiased) sample standard deviation of the dataset X is denoted by s, i.e.

The true mean and standard deviation of the population distribution are denoted by 1 and a respectively.

9.1 Classical Statistics
The classical statistics techniques we all know (or at least remember we were once taught) are the
z-test, t-test and chi-square test. They allow us to estimate the mean and variance of a random variable
for which we have some randomly sampled data, as well as a number of other problems. I'm going to
offer some fairly simple ways of understanding these statistical tests, but I first want to explain why
the "tests" aren't much good to us as risk analysts in their standard form. Let's take a typical t-test
result: it will say something like the true mean = 9.63 with a 95 % confidence interval of [9.32, 9.941,
meaning that we are 95 % sure that the true mean lies between 9.32 and 9.94. It doesn't mean that
there is a 95 % probability that it will lie within these values - it either does or does not, what we are
describing is how well we (the data holders, i.e. it is subjective) know the mean value. In risk analysis
I may have several such parameters in my model. Let's say we have just three such parameters A , B
and C estimated from different datasets and each with its best estimate and 95 % confidence bound. Let
the model be A*BA(l/C). How can I combine these numbers to make an estimate of the uncertainty of
my calculation? The answer is I can't. However, if I could convert the estimates to distributions I could
perform a Monte Carlo simulation and get the answer at any confidence interval, or any percentile the
decision-maker wishes. Thus, we have to convert these classical tests to distributions of uncertainty.

Chapter 9 Data and statistics

209

The classical statistics tests above are based on two basic statistical principles:
1. Thepivotalmethod. This requires that I rearrange an equation so that the parameter being estimated
is separated from any random variable.
2. A sufficient statistic. This means a sample statistic calculated from the data that contains all the
information in the data that is related to estimating the parameter.
1'11 use these ideas to explain the tests above and how they can be converted to uncertainty distributions.

9.1.1 The z-test
The z-test allows us to determine the best estimate and confidence interval for the mean of a normally
distributed population where we happen to know the standard deviation of that population. That would be
quite an unusual situation since the mean is usually more fundamental than the standard deviation, but
does occur sometimes; for example, when we take repeated measures of some quantity (like the length
of a room, the weight of a beam). In this situation the random variable is not the length of the room,
etc., but the results we will get. Look at the manual of a scientific measuring instrument and it should
tell you the accuracy (e.g. f1 mm). Sadly, the manufacturers don't usually tell us how to interpret these
values - will the measurement lie within 1mm of the true value 68 % (1 standard deviation), 95 % (two
standard deviations), etc., of the time? If the instrument manual were to say the measurement error has
a standard deviation of 1 mm, we could apply the z-test.
Let's say we are measuring some fixed quantity and that we take n such measurements. The sample
mean is given by the formula

Here T is the sufficient statistic. If the errors are normally distributed with mean p and standard deviation
a we have

Note how we have managed to rearrange the equation to place the random element Normal(0,l) apart
from the parameter we are trying to estimate. Now, thanks to the pivotal method, we can rearrange to
make p the focus:

In the z-test we would have specified a confidence interval, say 95 %, and then looked up the "z-score"
values for a Normal(0,l) distribution that would correspond to 2.5 % and 97.5 % (i.e. centrally positioned
values with 95 % between them) which are -1.95996 and +1.95996 respectively.' Then we'd write

p~ = 1.95996-

'

c7

fi

+x

You can get these values with ModelRisk using VoseNormal(0, 1, 0.025) and VoseNormal(0, 1, 0.975) or in Excel with =
NORMSINV(0.025) and = NORMSINV(0.975).

2 10

Risk Analysis

to get the lower and upper bounds respectively. In a risk analysis simulation we just use

9.1.2 The chi-square test
The chi-square (X2)test allows us to determine the best estimate and confidence interval for the standard
deviation of a normally distributed population. There are two situations: we either know the mean p or
we don't. Knowing the mean seems like an unusual scenario but happens, for example, when we are
calibrating a measuring device against some known standard. In this case, the formula for the sample
variance is given by

The sample variance in this case is the sufficient statistic for the population variance. Rewriting to
get a pivotal quantity, we have

However, the sum of n unit normal distributions squared is the definition of a chi-square distribution.
Rearranging, we get

A X2(n) distribution has mean n, so this formula is simply multiplying the sample variance by a
random variable with mean 1. The chi-square test finds, say, the 2.5 an 97.5 percentiles2 and inserts
them into the above equation. For example, these percentiles for 10 degrees of freedom are 3.247 and
20.483. Since we are dividing by the chi-square random variable, the upper estimate corresponds to the
lower chi-square value, and vice versa:

In risk analysis modelling we would instead simulate values for a using Equation 9.2:

In ModelRisk use VoseChiSq(n, 0.005) and VoseChiSq(n, 0.975), and in Excel use CHIINV(0.975, n) and CHIINV(0.025, n)
respectively.

Chapter 9 Data and statistics

211

Now let's consider what happens when we don't know the population mean, in which case statistical
convention says that we use a slightly different formula for the sample variance measure:

However, for a normal distribution it turns out that

Rearranging, we get

9.1.3 The t-test
The t-test allows us to determine the best estimate and confidence interval for the mean of a normally
distributed population where we don't know its standard deviation. From Equation 9.1 we had the result

when the population variance was known, and from Equation 9.2 we had the estimate for the variance
when the mean is unknown:

Substitute for a and we get

The definition of a Student(v) distribution is a normal distribution with mean 0 and variance following
a random variable v/ChiSq(v), so we have

2 12

Risk Analysis

Knowing that the Student t-distribution is just a unit normal distribution with some randomness
about the variance explains why a Student distribution has longer tails than a normal. The Student(v)
distribution has variance v/(v - 2), v > 2, so at v = 3 the variance is 3 and rapidly decreases, so that
by v = 30 it is only 1.07 (a standard deviation of 1.035) and for v = 50 a standard deviation of 1.02.
The practical implication is that, when you have, say, 50 data points, there is only a 2 % difference
in the confidence interval range whether you use a t-test (Equation 9.4) or approximate with a z-test
(Equation 9.1), using the sample variance in place of n2.

9.1.4 Estimating a binomial probability or a proportion
In many problems we need to determine a binomial probability (e.g. the probability of a flood in a
certain week of the year) or a proportion (e.g. the proportion of components that are made to a certain
tolerance). In estimating both, we collect data. Each measurement point is a random variable that has a
probability p of having the characteristic of interest. If all measurements are independent, and we assign
a value to the measurement of 1 when the measurement has the characteristic of interest and 0 when
it does not, the measurements can be thought of as a set of Bernoulli trials. Letting P be the random
variable of the proportion of n of this set of trials {Xi}
that have the characteristic of interest, it will
take a distribution given by

We observe a proportion of the n trials that have the characteristic of interest p, our one observation
from the random variable P which is also our MLE (see later) and unbiased estimate for p. Switching
around Equation (9.9, we can get an uncertainty distribution for the true value of p:

We shall see later how this exactly equates to the non-parametric and parametric bootstrap estimates of
a binomial probability. Equation (9.6) is a bit awkward since it will allow only (n 1) discrete values for
p , i.e. (0, l l n , 2 / n , . . . , (n - l ) / n , 11, whereas our uncertainty about p should really take into account
all values between zero and 1. However, a Binomial(n, 8) has a mean and standard deviation given by

+

and, from the central limit theorem, as n gets large the proportion of observations P will tend to a
normal distribution, in which case Equation (9.6) can be rewritten as

Equation (9.6) gives us what is known as the "exact binomial confidence interval", which is an awful
name in my view because it actually gives us bounds for which we have a t least the required confidence
that the true value of p lies within. We never use this method. Another classical statistics method is
to construct a cumulative uncertainty distribution, which is far more useful. We start by saying that, if

Chapter 9 Data and statistics

2 13

True value of probability p

I

Figure 9.1

Cumulative distributions of estimate of p for n = 10 trials and varying number of successes s.

we've observed s successes in n trials, the confidence that the true value of the probability is less than
some value x is given by

where Y = Binomial(n, x). In Excel we would write

By varying the value x from 0 to 1, we can construct the cumulative confidence. For example, Figure 9.1
shows examples with n = 10.
This is an interesting method. Look at the scenario for s = 0: the cumulative distribution starts with a
value of 50 % at p = 0, so it is saying that, with no successes observed, we have 50 % confidence that
there is no binomial process at all - trials can't become successes, and the remaining 50 % confidence
is distributed over p = (0, 1). The reverse logic applies where s = n. In ModelRisk we have a function
VoseBinomialP(s, n, ProcessExists, U), where you input the successes s and trials n and, in the situation
where s = 0 or n, you have the option to specify whether you know that the probability lies within
(0, 1) (ProcessExists = TRUE). The U-parameter also allows you to specify a cumulative percentile - if
omitted, the function simulates random values of what the value p might be. So, for example:
VoseBinomialP(10,20, TRUE, 0.99) = VoseBinomialP(10,20, FALSE, 0.99) = 0.74605
VoseBinomialP(O,20, TRUE, 0.99) = 0.02522 (it assumes that p cannot be zero)
VoseBinomialP(O,20, FALSE, 0.4) = 0 (it allows that p could be zero)

9.1.5 Estimating a Poisson intensity
In a Poisson process, countable events occur randomly in time or space - like earthquakes, financial
crashes, car crashes, epidemics and customer arrivals. We need to estimate the base rate h at which
these events occur. So, for example, a city of 500 000 people may have had a! murders last year: perhaps
that was unluckily high, or luckily low. We'd like to know the degree of accuracy that we can place

2 14

Risk Analysis

around the statement "The risk is a murders per year". Following a classical statistics approach similar
to section 9.1.4, we could write

where 1 refers to the single year of counting.
We could recognise that a Poisson(a!) distribution has mean and variance = a and looks normal when
a! is large:

The method suffers the same problems as the binomial: if we haven't yet observed any murders this
year, the formulae don't work. A classical statistics alternative is again to construct the cumulative
confidence distribution using

Figure 9.2 shows some examples of the cumulative distribution that can be constructed from this
formula. In ModelRisk there is a function VosePoissonLarnbda(a, t , ProcessExists, U )where you input
the counts a! and the time over which they have been observed t , and in the situation where a = 0 you
have the option to specify whether you know that the intensity is non-zero (ProcessExists = TRUE).
The U-parameter also allows you to specify a cumulative percentile - if omitted, the function simulates
random values of what the value h might be. So, for example:
VosePoissonLambda(2,3, TRUE, 0.2) = VosePoissonLambda(2,3, FALSE, 0.4)

VosePoissonLambda(0, 3, TRUE, 0.2) = 0.203324 (it assumes that h cannot be zero)
VoseBinomialP(O,20, FALSE, 0.2) = 0 (it allows that h could be zero)

0

5
10
15
True value of Poisson intensity h

20

Figure 9.2 Cumulative distributions of estimate of h for varying number of observations a.

Chapter 9 Data and stat~st~cs 2 15

9.2 Bayesian Inference
The Bayesian approach to statistics has enjoyed something of a renaissance over the latter half of the
twentieth century, but there still remains a schism among the scientific community over the Bayesian
position. Many scientists, and particularly many classically trained statisticians, believe that science
should be objective and therefore dislike any methodology that is based on subjectivism. There are, of
course, a host of counterarguments. Experimental design is subjective to begin with; classical statistics are
limited in that they make certain assumptions (normally distributed errors or populations, for example)
and scientists have to use their judgement in deciding whether such an assumption is sufficiently well
met; moreover, at the end of a statistical analysis one is often asked to accept or reject a hypothesis by
picking (quite subjectively) a level of significance (p values).
For the risk analyst, subjectivism is a fact of life. Each model one builds is only an approximation
of the real world. Decisions about the structure and acceptable accuracy of the risk analyst's model are
very subjective. Added to all this, the risk analyst must very often rely on subjective estimates for many
model inputs, frequently without any data to back them up.
Bayesian inference is an extremely powerful technique, based on Bayes' theorem (sometimes called
Bayes' formula), for using data to improve one's estimate of a parameter. There are essentially three
steps involved: (1) determining a prior estimate of the parameter in the form of a confidence distribution;
(2) finding an appropriate likelihood function for the observed data; (3) calculating the posterior (i.e.
revised) estimate of the parameter by multiplying the prior distribution and the likelihood function, then
normalising so that the result is a true distribution of confidence (i.e. the area under the curve equals 1).
The first part of this section introduces the concept and provides some simple examples. The second
part explains how to determine prior distributions. The third part looks more closely at likelihood
functions, and the fourth part explains how normalising of the posterior distribution is carried out.
?

9.2.1 Introduction
Bayesian inference is based on Bayes' theorem (Section 6.3.5), the logic of which was first proposed
in Bayes (1763). Bayes' theorem states that

We will change the notation of that formula for the purpose of explaining Bayesian inference to a
notation often used in the Bayesian world:

Bayesian inference mathematically describes the learning process. We start off with an opinion, however
vague, and then modify our opinion when presented with evidence. The components of Equation (9.8) are:
n(8) - the "prior distribution". n(6) is the density function of our prior belief about the parameter
value 6 before we have observed the data x . In other words, n(6) is not a probability distribution of
0 but rather an uncertainty distribution: it is an adequate representation of the state of our knowledge
about 6 before the data x were observed.

2 16

Risk Analysis

1(x 16) - the "likelihood function". 1(x 18) is the calculated probability of randomly observing the
data x for a given value of 8. The shape of the likelihood function embodies the amount of
information contained in the data. If the information it contains is small, the likelihood function
will be broadly distributed, whereas if the information it contains is large, the likelihood function
will be very focused around some particular value of the parameter. However, if the shape of the
likelihood function corresponds strongly to the prior distribution, the amount of extra information
the likelihood function embodies is relatively small and the posterior distribution will not differ
greatly from the prior. In other words, one would not have learned very much from the data. On
the other hand, if the shape of the likelihood function is very different from the prior we will have
learned a lot from the data.
f (81~) the "posterior distribution". f (81x1 is the description of our state of knowledge of 8 after
we have observed the data x and given our opinion of the value of 8 before x was observed.

The denominator in Equation (9.8) simply normalises the posterior distribution to have a total area
equal to 1. Since the denominator is simply a scalar value and not a function of 8, one can rewrite
Equation (9.8) in a form that is generally more convenient:

The cc symbol means "is proportional to", so this equation shows that the value of the posterior distribution density function, evaluated at some value of 8, is proportional to the product of the prior
distribution density function at that value of 8 and the likelihood of observing the dataset x if that
value of 8 were the parameter's true value. It is interesting to observe that Bayesian inference is thus
not interested in absolute values of the prior and likelihood function, but only their shapes. In writing
equations of the form of Equation(9.9), we are taking as read that one will eventually have to normalise
the distribution.
Bayesian inference seems to confuse a lot of people rather quickly. I have found that the easiest way
to understand it, and to explain it, is through examples.
Example 9.1

I have three "loonies7' (Canadian one dollar coins - they have a loon on the tail face) in my pocket.
Two of them are regular coins, but the third is a weighted coin that has a 70 % chance of landing heads
up. I cannot tell the coins apart on inspection. I take a coin out of my pocket at random and toss it - it
lands heads up. What is the probability that the coin is the weighted coin?
Let's start by noting that the probability, as I have defined the term probability in Chapter 6.2, that
the coin is the weighted one is either 0 or 1: it either is not the weighted coin or it is. The problem
should really be phrased "What confidence do I have that the tossed coin is weighted?', as I am only
dealing with the state of my knowledge. When I took the coin out of my pocket but before I had tossed
it, I would have said I was confident that the coin in my hand was weighted, and confident it
was not weighted. My prior distribution n(f3)for the state of the coin would thus look like Figure 9.3,
i.e. a discrete distribution with two allowed values {not weighted, weighted) with confidences
respectively.
Now I toss the coin and it lands heads up. If the coin were fair, it would have a probability of of
landing that way. My confidence that I took out a fair coin from my pocket and then tossed a head (call
1
1
it scenario A) is therefore proportional to my prior belief multiplied by the likelihood, i.e. * 3 = 5 .
On the other hand, I am also confident that the coin could have been weighted, and then it would

5

(5,4)

5

-

-

Chapter 9 Data and stat~stlcs 2 17

Figure 9.3 Prior distribution for the weighted coin example: a Discrete ((0,I},

{g, 4)).

have had a probability of
of landing that way. My confidence that I took out the weighted coin from
my pocket and then tossed a head (call it scenario B) is therefore proportional to f *
=
The two values 112 and 7/10 used for the probability of observing a head were conditional on the
type of coin that was being tossed. These two values represent, in this problem, the likelihood function.
We will look at some more general likelihood functions in the following examples.
Now, we know that one of scenarios A and B must have actually occurred since we did observe a
head. We must therefore normalise my confidence for these two scenarios so that they add up to 1, i.e.

6. 6.

This normalising is the purpose of the denominator in Equation (9.8).
I am now 10117 confident that the coin is fair and 7/17 confident that it is weighted: I still think it
more likely I tossed a fair coin than a weighted coin. Let us imagine that we toss the coin again and
observe another head. How would this affect my confidence distribution of the state of the coin? Well,
the posterior confidence of selecting a fair coin and observing two heads (scenario C) is proportional
to 2 *
* = The posterior confidence of selecting the weighted coin and observing two heads
(scenario D) is proportional to 113 * 7/10 * 7/10 = 491300. Normalising these two, we get

i.

Now I am roughly equally confident about whether I had tossed a fair or a weighted coin. Figure 9.4
depicts posterior distributions for the above example, plus the posterior distributions for a few more
tosses of a coin where each toss resulted in a head. One can see that, as the amount of observations
(data) we have grows, our prior belief gets swamped by what the data say is really possible, i.e. by the
information contained in the data. +

-

2 18

Risk Analysis

I

1

2 heads in 2 tosses

3 heads in 3 tosses

1

5 heads in 5 tosses

1

0.495

weiohted

I

4 heads in 4 tosses

not weiahted

I

Figure 9.4

weiahted

10 heads in 10 tosses

I

Posterior distributions for the coin tossing example with increasing numbers of heads.

Example 9.2

A game warden on a tropical island would like to know how many tigers she has on her island. It
is a big island with dense jungle and she has a limited budget, so she can't search every inch of the
island methodically. Besides, she wants to disturb the tigers and the other fauna as little as possible.
She arranges for a capture-recapture survey to be carried out as follows.
Hidden traps are laid at random points on the island. The traps are furnished with transmitters that
signal a catch, and each captured tiger is retrieved immediately. When 20 tigers have been caught, the
traps are removed. Each of these 20 tigers are carefully sedated and marked with an ear tag, then all
are released together back to the positions where they were originally caught. Some short time later,
hidden traps are laid again, but at different points on the island, until 30 tigers have been caught and
the number of tagged tigers is recorded. Captured tigers are held in captivity until the 30th tiger has
been caught.
The game warden tries the experiment, and seven of the 30 tigers captured in the second set of traps
are tagged. How many tigers are there on the island?
The warden has gone to some lengths to specify the experiment precisely. This is so that we will
be able to assume within reasonable accuracy that the experiment is taking a hypergeometric sample
from the tiger population (Section 8.4). A hypergeometric sample assumes that an individual with the
characteristic of interest (in this case, a tagged tiger) has the same probability of being sampled as
any individual that does not have that characteristic (i.e. the untagged tigers). The reader may enjoy

Chapter 9 Data and statistics

2 19

thinking through what assumptions are being made in this analysis and where the experimental design
has attempted to minimise any deviation from a true hypergeometric sampling.
We will use the usual notation for a hypergeometric process:

n - the sample size, = 30.
D - the number of individuals in the population of interest (tagged tigers) = 20.
M - the population (the number of tigers in the jungle). In the Bayesian inference terminology, this
is given the symbol 8 as it is the parameter we are attempting to estimate.
x - the number of individuals in the sample that have the characteristic of interest = 7.
We could get a best guess for M by noting that the most likely scenario would be for us to see tagged
tigers in the sample in the same proportion as they occur in the population. In other words
x
D .
7
20
x - 1.e. - x - which gives M
n
M'
30
M'

-

% 85

to 86

but this does not take account of the uncertainty that occurs owing to the random sampling involved in
the experiment. Let us imagine that before the experiment was started the warden and her staff believed
that the number of tigers was equally likely to be any one value as any other. In other words, they
knew absolutely nothing about the number of tigers in the jungle, and their prior distribution is thus a
discrete uniform distribution over all non-negative integers. This is rather unlikely, of course, but we
will discuss better prior distributions in Section 9.2.2.
The likelihood function is given by the probability mass function of the hypergeometric distribution, i.e.

~ ( ~ 1=
8 0,
)

I

i
I

otherwise

The likelihood function is 0 for values of 0 below 43, as the experiment tells us that there must be
at least 43 tigers: 20 that were tagged plus the (30 - 7) that were caught in the recapture part of the
experiment and were not tagged.
The probability mass function (Section 6.1.2) applies to a discrete distribution and equals the probability that exactly x events will occur. Excel provides a convenient function HYPGEOMDIST(x, n , D, M)
that will calculate the hypergeometric distribution mass function automatically, but generates errors
instead of zero when 0 < 43 so I have used the equivalent ModelRisk function. Figure 9.5 illustrates a
spreadsheet where a discrete uniform prior, with values of 0 running from 0 to 150, is multiplied by the
likelihood function above to arrive at a posterior distribution. We know that the total confidence must
add up to 1, which is done in column F to produce the normalised posterior distribution. The shape
of this posterior distribution is shown in Figure 9.6 by plotting column B against column F from the
spreadsheet. The graph peaks at a value of 85, as we would expect, but it appears cut off at the right
tail, which shows that we should also look at values of 0 larger than 150. The analysis is repeated for
values of 8 up to 300, and this more complete posterior distribution is plotted in Figure 9.7. This second
plot represents a good model of the state of the warden's knowledge about the number of tigers on the
island. Don't forget that this is a distribution of belief and is not a true probability distribution since
there is an exact number of tigers on the island.

220

Risk Analysis

C3:C6
B10:B117
CIO:C117
D10:D117
E10:E117
E7
F10:FI 17

Formulae table
constants
(43,..,150)
1
=VoseHypergeoProb(x,n,D,B9)
=DIO*C10
=SUM(EIO:EI 17)
=EIO/$E$7

Figure 9.5

Bayesian inference model for the tiger capture-release-recapture problem.

Figure 9.6

First pass at a posterior distribution for the tagged tiger problem.

In this example, we had to adjust our range of tested values of 8 in light of the posterior distribution. It
is quite common to review the set of tested values of 0, either expanding the prior's range or modelling
some part of the prior's range in more detail when the posterior distribution is concentrated around a
small range. It is entirely appropriate to expand the range of the prior as long as we would have been
happy to have extended our prior to the new range before seeing the data. However, it would not be
appropriate if we had a much more informed prior belief that gave an absolute range for the uncertain
parameter outside of which we are now considering stepping. This would not be right because we would
be revising our prior belief in light of the data: putting the cart before the horse, if you like. However,
if the likelihood function is concentrated very much at one end of the range of the prior, it may well be
worth reviewing whether the prior distribution or the likelihood function is appropriate, since the analysis
could be suggesting that the true value of the parameter lies outside the preconceived range of the prior.
Continuing with our tigers on an island, let us imagine that the warden is unsatisfied with the level of
uncertainty that remains about the number of tigers, which, from 50 to 250, is rather large. She decides
to wait a short while and then capture another 30 tigers. The experiment is completed, and this time t
tagged tigers are captured. Assuming that a tagged tiger still has the same probability of being captured
as an untagged tiger, what is her uncertainty distribution now for the number of tigers on the island?

Chapter 9 Data and statistics

50

Figure 9.7

100
150
200
Tiaers on the island

250

300

22 1

/

Improved posterior distribution for the tagged tiger problem.

This is simply a replication of the first problem, except that we no longer use a discrete uniform
distribution as her prior. Instead, the distribution of Figure 9.7 represents the state of her knowledge
prior to doing this second experiment, and the likelihood function is now given by the Excel function
HYPGEOMDIST(t, 30, 20, 0), equivalently VoseHypergeoProb(t, 30, 20, 8 , 0). The six panels of
Figure 9.8 show what the warden's posterior distribution would have been if the second experiment had
trapped t = 1, 3, 5 , 7, 10 and 15 tagged tigers instead. These posteriors are plotted together with the
prior of Figure 9.7 and the likelihood functions, normalised to sum to 1 for ease of comparison.
You might initially imagine that performing another experiment would make you more confident about
the actual number of tigers on the island, but the graphs of Figure 9.8 show that this is not necessarily
so. In the top two panels the posterior distribution is now more spread than the prior because the data
contradict the prior (the prior and likelihood peak at very different values of 0). In the middle left panel,
the likelihood disagrees moderately with the prior, but the extra information in the data compensates
for this, leaving us with about the same level of uncertainty but with a posterior distribution that is to
the right of the prior.
The middle right panel represents the scenario where the second experiment has the same results
as the first. You'll see that the prior and likelihood overlay on each other because the prior of the
first experiment was uniform and therefore the posterior shape was only influenced by the likelihood
function. Since both experiments produced the same result, our confidence is improved and remains
centred around the best guess of 85.
In the bottom two panels, the likelihood functions disagree with the priors, yet the posterior distributions have a narrower uncertainty. This is because the likelihood function is placing emphasis on the
left tail of the possible range of values for 0, which is bounded at 0 = 43.
In summary, the graphs of Figure 9.8 show that the amount of information contained in data is
dependent on two things: (1) the manner in which the data were collected (i.e. the level of randomness
inherent in the collection), which is described by the likelihood function, and (2) the state of our
knowledge prior to observing the data and the degree to which it compares with the likelihood function.
If the data tell us what we are already fairly sure of, there is little information contained in the data for
us (though the data would contain much more information for those more ignorant of the parameter).
On the other hand, if the data contradict what we already know, our uncertainty may either decrease or
increase, depending on the circumstances.

+

222

Risk Analysis

1 tagged tiger

g

3 tagged tigers

0 016

0 016

0 014

0.01 4

0 012

8

0012

001

E'

0.01

C

g

0008

6c

0

0.008

0 006

0 006

0 004

0 004

0 002

0 002

0

0
0

50

100

150

200

250

300

0

Tigers on the island

0

50

100

150

200

0.025

250

Tigers on the island

150

200

250

300

7 tagged tigers

300

Tigers on the island

10 tagged tigers

100

Tigers on the island

5 tagged tigers

0.018

50

Tigers on the island

I

I

0.08 T

I

15 tagged tigers

Tigers on the island

Figure 9.8 Tagged tiger problem: (a), (b), (c), (d ), (e) and (f) show prior distributions, likelihood functions and
posterior distributions if the second experiment had trapped 1, 3, 5, 7, 10 and 15 tigers tagged respectively
(prior distribution shown as empty circles, likelihood function as grey lines and posterior distributions as
black lines).

Chapter 9 Data and statistics

223

Example 9.3

Twenty people are randomly picked off a city street in France. Whether they are male or female is noted
on 20 identical pieces of paper, put into a hat and the hat is brought to me. I have not seen these 20
people. I take out five pieces of paper from the hat and read them - three are female. I am then asked
to estimate the number of females in the original group of 20.
I can express my estimate as a confidence distribution of the possible values. I might argue that, prior
to reading the five names, I had no knowledge of the number of people who would be female and so
would assign a discrete uniform prior from 0 to 20. However, it would be better to argue that roughly
50 % of people are female and so a much better prior distribution would be a Binomial(20, 0.5). This is
equivalent to a Duniform prior, followed by a Binomial(20, 0.5) likelihood for the number of females
that would be randomly selected from a population in a sample of 20.
The likelihood function relating to sampling five people from the population is again hypergeometric,
except in this problem we know the total population (i.e. M = 20), we know the sample size (n = 5 )
and we know the number observed in the sample with the required property (x = 3), but we don't
know the number of females D, which we denote by 0 as it is the parameter to be estimated. Figure 9.9
illustrates the spreadsheet model for this problem, using the binomial distribution prior. This spreadsheet
has made use of ModelRisk's VoseBinomialProb(x, n, p , cumulative), equivalently the Excel function
BINOMDIST(x, n, p, cumulative), which returns a probability evaluated at x for a Binomial(n, p)
distribution. The cumulative parameter in the function toggles the function to return a probability mass
(cumulative = 0 or FALSE) or a cumulative probability (cumulative = 1 or TRUE). The IF statement in
Cells C8:C28 is unnecessary because the VoseHypergeoProb function will return a zero, but necessary
to avoid errors if you use Excel's HYPGEOMDIST function in its place.
Figure 9.10 shows the resultant posterior distribution, together with the likelihood function and the
prior. Here we can see that the prior is very strong and the amount of information imbedded in the likelihood function is small, so the posterior distribution is quite close to the prior. The posterior distribution
is a sort of compromise between the prior and likelihood function, in that it finds a distribution that
agrees as much as possible with both. Hence, the peak of the posterior distribution now lies somewhere
between the peaks of the prior and likelihood function. The effect of the likelihood function is small
A(

I

1

4
5
-

:

(

B

C

(

D

E

9
10

11
12

25
26
27
28
29
30

Figure 9.9

F

(GI

H

1

I

1J

Parameters

6

7
8
-

1

9

0
1
2
3
4
17
18
19
20

Prior
9.5E-07
1.9E-05
1.8E-04
l.lE-03
4.6E-03
1.1E-03
1.8E-04
1.9E-05
9.5E-07

Likelihood
0
0
0
8.8E-03
3.1 E-02
1.3E-01
5.3E-02
0
0

Posterior
0
0
0
9.5E-06
1.4E-04
1.4E-04
9.5E-06
0
0
0.3125

Normalised
posterior
0
0
0
3.1E-05
4.6E-04
4.6E-04
3.1E-05
0
0

C3:C4
B8:B28
C8:C28
D8:D28
E8:E28
E29
F8:F28

Formulae table
Constants
{O,l,. . .,19,20}
=VoseBinomialProb(B8,20,0.5,0)
=IF(OR(BB20-(n-x))
,O,VoseHypergeoProb(x,n,B8,20))
=C8*D8
=SUM(E8:E28)
=E8/$E$29

Bayesian inference model for the number of "females in a hat" problem.

224

Risk Analysis

+ Prtor

- - L~kel~hood
functlon (norrnal~sed)
Posterlor

+-

Females in the group

Figure 9.10 Prior distribution, likelihood function and posterior distribution for the model of Figure 9.9 using
a Binomial(20, 0.5) prior.

-$-Prior

+

Posterior

Females in the arouD

Figure 9.1 1 Prior and posterior distributions for the model of Figure 9.8 with a Duniform(0, . . ., 20) prior

because the sample is small (a sample of 5 ) and because it does not disagree with the prior (the prior
has a maximum at 8 = 10, and this value of 8 also produces one of the highest likelihood function
values).
For comparison, Figure 9.11 shows the prior and posterior distributions if one had used a discrete
uniform prior. Since the prior is flat in this case, it contributes nothing to the posterior's shape and the
likelihood function becomes the posterior distribution. +

Chapter 9 Data and statistics

225

Hyperparameters

I assumed in Example 9.3 that the prevalence of females in France is 50 %. However, knowing that
females on average live longer than males, this figure will be a slight underestimate. Perhaps I should
have used a value of 5 1 % or 52 %. In Bayesian inference, I can include uncertainty about one or more of
the parameters in the analysis. For example, I could model p with a PERT(5O %, 51 %, 52 %). Uncertain
parameters are called hyperparameters. In the algebraic form of a Bayesian inference calculation, I then
integrate out this nuisance parameter which in reality can be a bit tricky to carry out. Let's look again
at the Bayesian inference calculation in the spreadsheet of Figure 9.9. If I have uncertainty about the
prevalence of females p, I should assign a distribution to its value, in which case there would then
be uncertainty about the posterior distribution. I cannot have an uncertainty about my uncertainty: it
doesn't make sense. This is why we must integrate out (i.e. aggregate) the effect of uncertainty about
p on the posterior distribution. We can do this very easily using Monte Carlo simulation, instead of the
more onerous algebraic integration. We simply include a distribution for p in our model, nominate the
entire array for the posterior distribution as an output and simulate. The set of means of the generated
values for each cell in the array constitutes the final posterior distribution.
Simulating a Bayesian inference calculation

We could have done the same Bayesian inference analysis for Example 9.3 by simulation. Figure 9.12
illustrates a spreadsheet model that performs the Bayesian inference, together with a plot of the model
result. In cell C3, a Binomial(20,0.5) distribution represents the prior. It is randomly generating possible

A

1

I

B

IDI

c

E

I GJ

F

1
Formulae table

=IF(C3=O,O,VoseHypergeo(5,C3,20))
6
7
8
9
10
11
12

0.196

I

I

I

-

0.157
al

----'

0.118

2
15
16
17
18
19
20
21
22
-

I
I

, - - - - - - - --------------->
--------;
I

I
I

I
I

- - - - - - - J-------: --------

,

,
,- - - - - - -,- - - - - - - -

I
I

I
I
I r - ' - - - - -I r - - - - - - - . - - - - - - - - - - - - -

I
I
I

I

I

I

I

g 0,078--------:---------------.--------------4-------;-------.
0
,
II

I
I

I
I

___.___.___________---------------~IIIII
I
I
I

I
4

6

8

10

12

Females in group

23

Figure 9.12

4

Simulation model for the problem of Figure 9.9.

14

I

1

16

18

226

Risk Analysis

scenarios of the number of "females" in the hat. In cell C4 a sample of five people is modelled using
a Hypergeo(5, D, 20), where D is the result from the binomial distribution. The IF statement here is
unnecessary because VoseHypergeo supports D = 0 but, for example, @RISK'S RiskHypergeo(5,0,20)
returns an error. This represents one-half of the likelihood function logic. Finally, in cell C5, the
generated value from the binomial distribution in cell C3 is accepted (and therefore stored in memory)
if the hypergeometric distribution produces a 3 - the number of females observed in the experiment. This
is equivalent to the second half of the likelihood function logic. By running a large number of iterations,
a large number of generated values from the binomial will be accepted. The proportion of times that a
particular value from the binomial distribution will be accepted equates to the hypergeometric probability
that three females would be subsequently observed in a random sample of five from the group. I ran
this model for 100000 iterations, and 31 343 values were accepted, which equates to about 31 % of
the iterations. The technique is interesting but does have limited applications, since, for more complex
problems or those with larger numbers, the technique becomes very inefficient as the percentage of
iterations that are accepted becomes very small indeed. It is also difficult to use where the parameter
being estimated is continuous rather than discrete, in which case one is forced to use a logic that accepts
the generated prior value if the generated result lies within some range of the observed result. However,
to combat this inefficiency, one can alter the prior distribution to generate values that the experimental
results have shown to be possible. For example, in this problem, there must be between three and 18
females in the group of 20, whereas the Binomial(20, 0.5) is generating values between 0 and 20.
Furthermore, one could run several passes, cutting down the prior with each pass to home in on only
those values that are feasible. One can also get more detail in the tails by multiplying up the mass of
some values x , y, z (for example, in the tails of the prior) by some factor, then dividing the heights of
the posterior tail at x , y and z by that factor.
While this technique consumes a lot of simulation time, the models are very simple to construct and
one can also consider multiple parameter priors.
Let us look again at the choice of priors for this problem, i.e. either a Dunifom((0, . . . , 20)) or a
Binomial(20, 50 %). One might consider that the Dunifom distribution is less informed (i.e. says less)
than the binomial distribution. However, we can turn the Duniform distribution around and ask what
that would have said about our prior belief of the probability p of a person randomly selected from the
French population being female. We can show that a uniform assumption for p translates to a Duniform
distribution of females in a group, as follows.
Let s, be the number of successes in n Bernoulli trials where 8 is the unknown probability of success
of a trial. Then the probability that s, = r, r = {O, 1, 2, . . . , n), is given by the de Finetti theorem:

where f (8) is the probability density function for the uncertainty distribution for 8. The formula simply
calculates, for any value of r , the binomial probability

of observing r successes, integrated over the uncertainty distribution for the binomial probability 8. If
we use a Uniform(0, 1) distribution to describe our uncertainty about 8, then f (8) = 1 :

Chapter 9 Data and statistics

227

The integral is a beta function and, for integer values of r and n , we have the standard identity

Thus,
P(, =r ) =

@)( (nn-+r l) )! !r !

-

r

n!
(n - r)!r!- 1
! -r)! (n 1
n 1

+

+

+

+

So each of the n 1 possible values (0, 1 , 2 , . . . , n } has the same likelihood of ll(n 1). In other
words, using a Duniform prior for the number of females in a group equates to saying that we are
equally confident that the true probability of an individual from the population being female is any
value between 0 and 1.
Example 9.4

I
h

&

I
a

A magician has three cups turned over on his table. Under one of the cups you see him put a pea.
With much ceremony, he changes the cups around in a dazzling swirl. He then offers you a bet to pick
which cup the pea is under. You pick one. He then shows you under one of the other cups - empty.
The magician asks you whether you would like to swap your choice for the third, untouched cup. What
is your answer? Note that the magician knows which cup has the pea and would not turn it over.
In this problem, until the magician turns over a cup, we are equally sure about which cup has the
pea so our prior confidence assigns equal weighting to the three cups. We now need to calculate the
probability of what was observed if the pea had been under each of the cups in turn. We can label the
three cups as A for the cup I chose, B for the cup the magician chose and C for the remaining cup.
Let's start with the easy cup, B. What is the probability that the magician would turn over cup B if
he knew the pea was under cup B? Answer: 0, because he would have spoiled the trick.
Next, look at the untouched cup, C. What is the probability that the magician would turn over cup
B if he knew the pea was under cup C? Answer: 1, since he had no choice as I had already picked A,
and C contained the pea.
Now look at my cup, A. What is the probability that the magician would turn over cup B if he knew
the pea was under cup A? Answer: 112, since he could have chosen to turn over either B or C.
Thus, from Bayes' theorem,

5

where P ( A ) = P ( B ) = P ( C ) = are the confidences we assign to the three cups before observing the
data X (i.e. the magician turning over cup B) and P(XI A ) = 0.5, P(XI B ) = 0 and P ( X IC) = 1.
Thus,

P(A1X) =

4,

P(B1X) = 0

and P(C1X) = $

So, after having made our choice of cup and then watching the magician turn over one of the other
two cups, we should always change our mind and pick the third cup as we should now be twice as
confident that the untouched cup will contain the pea as the one we originally chose. The result is a

228

Risk Analysis

little hard for many people to believe in: the obstinate among us would like to stick to our original
choice, and it does not seem that the probability can really have changed for the cup we chose to contain
the pea. Indeed, the probability has not changed after the magician's selection: it remains either 0 or
1, depending on whether we picked the right cup. What has changed is our confidence (the state of
our knowledge) about whether that probability is 1. Originally, we had a 113 confidence that the pea
was under our cup, and that has not changed. There is another way to think of the same problem: we
had 113 confidence in our original choice of cup, and 213 in the other choices, and we also knew that
one of those other cups did not contain the pea, so the 213 migrated to the remaining cup that was
not turned over. This exercise is known as the Monte Hall problem - Wikipedia has a nice explanatory
page, and www.stat.sc.edu/-west/javahtml/letsMakeaDeal.htmlhas a nice simulation applet to test out
the answer. +
Exercise 9.1: Try repeating this problem where there are (a) four cups and one pea, and (b) five
cups and two peas. Each time you get to select a cup, and each time the magician turns one of the
others over.

9.2.2 Prior distributions
As we have seen above, the prior distributions are the description of one's state of knowledge about
the parameter in question prior to observation of the data. Determination of the prior distribution is
the primary focus for criticism of Bayesian inference, and one needs to be quite sure of the effects
of choosing one particular prior over another. This section describes three different types of prior
distribution: the uninformed prior; the conjugate prior and the subjective prior. We will look at the
practical reasons for selecting each type and arguments for and against each selection.
An argument presented by frequentist statisticians (i.e. those who use only traditional statistical techniques) is that the Bayesian inference methodology is subjective. A frequentist might argue that, because
we use prior distributions, representing the state of one's belief prior to accumulation of data, Bayesian
inference may easily produce quite different results from one practitioner to the next because they can
choose quite different priors. This is, of course, true - in principle. It is both one of the strengths and certainly an Achilles' heel of the technique. On the one hand, it is very useful in a statistical technique to be
able to include one's prior experience and knowledge of the parameter, even if that is not available in a
pure data form. On the other hand, one party could argue that the resultant posterior distribution produced
by another party was incorrect. The solution to this dilemma is, in principle, fairly simple. If the purpose
of the Bayesian inference is to make internal decisions within your organisation, you are very much at
liberty to use any experience you have available to determine your prior. On the other hand, if the result
of your analysis is likely to be challenged by a party with a conflicting agenda to your own, you may
be better off choosing an ''uninformed" prior, i.e. one that is neutral in that it provides no extra information. All that said, in the event that one has accumulated a reasonable dataset, the controversy regarding
selection of priors disappears as the prior is overwhelmed by the information contained in the data.
It is important to specify a prior with a sufficiently large range to cover all possible true values for the
parameter, as we have seen in Figure 9.6. Failure to specify a wide enough prior will curtail the posterior
distribution, although this will nearly always be apparent when plotting the posterior distribution and a
correction can be made. The only time it may not be apparent that the prior range is inadequate is when
the likelihood function has more than one peak, in which case one might have extended the range of
the prior to show the first peak but no further.

Chapter 9 Data and statistics

229

Uninformed priors

An uninformed prior has a distribution that would be considered to add no information to the Bayesian
inference, except to specify the possible range of the parameter in question. For example, a Uniform(0,
1) distribution could be considered an uninformed prior when estimating a binomial probability because
it states that, prior to collection of any data, we consider every possible value for the true probability to
be as likely as every other. An uninformed prior is often desirable in the development of public policy
to demonstrate impartiality. Laplace (1812), who also independently stated Bayes' theorem (Laplace,
1774) 1l years after Bayes' essay was published (he apparently had not seen Bayes' essay), proposed
that public policy priors should assume all allowable values to have equal likelihood (i.e. uniform or
Duniform distributions).
At first glance, then, it might seem that uninformed priors will just be uniform distributions running across the entire range of possible values for the parameter. That this is not true can be easily
demonstrated from the following example. Consider the task of estimating the true mean number of
events per unit exposure h of a Poisson process. We have observed a certain number of events within
a certain period, which we can use to give us a likelihood function very easily (see Example 9.6). It
might seem reasonable to assign a Uniform(0, z ) prior to h, where z is some large number. However,
we could just as easily have parameterised the problem in terms of B, the mean exposure between
events. Since B = 1/h, we can quickly check what a Uniform(0, z) prior for h would look like as a
prior for B by running a simulation on the formula: = l/Uniform(O, z). Figure 9.13 shows the result of
such a simulation. It is alarmingly far from being uninformed with respect to B ! Of course, the reverse
equally applies: if we had performed a Bayesian inference on B with a uniform prior, the prior for h
would be just as far from being uninformed. The probability density function for the prior distribution
of a parameter must be known in order to perform a Bayesian inference calculation. However, one can
often choose between a number of different pararneterisations that would equally well describe the same

Figure 9.13

Distribution resulting from the formula: = 1 /Uniform(O, 20).

230

Risk Analysis

stochastic process. For example, one could describe a Poisson process by h, the mean number of events
per unit exposure, by 6, the mean exposure between events as above, or by P(x > O), the probability
of at least one event in a unit of exposure.
The Jacobian transformation lets us calculate the prior distribution for a Bayesian inference problem
after reparameterising. If x is the original parameter with probability density function f (x) and cumulative distribution function F(x), and y is the new parameter with probability density function f (y)
and cumulative distribution function F ( y ) related to x by some function such that x and y increase
monotonically, then we can equate changes d F ( y ) and dF(x) together, i.e.

Rearranging a little, we get

known as the Jacobian.
So, for example, if x = Uniform(0, c) and y = l l x ,

ax
-

a~

= - 1 j y 2 sothe Jacobianis

ax

= 11y2

1
which gives the distribution for y : p(y) = T.
cY

Two advanced exercises for those who like algebra:
Exercise 9.2: Suppose we model p = U(0, 1). What is the density function for Q = 1 - (1 - p)"?

Exercise 9.3: Suppose we want to model P ( 0 ) = exp(-A) = U(0, 1). What is the density function
for h?
There is no all-embracing solution to the problem of setting uninformed priors that don't become
"informed" under some reparameterising of the problem. However, one useful method is to use a
prior such that loglo(8) is Uniform(-z, z) distributed, which, using Jacobian transformation, can be
shown to give the prior density n(8) a 118, for a parameter that can take any positive real value.
We could just as easily use natural logs, i.e. loge(8) = Uniform(-y, y), but in practice it is easier
to set the value z because our minds think quite naturally in powers of 10. Using this prior, we get
log,,(1/8) = - loglo(0) = -Uniform(-z, z) = Uniform(-z, z). In other words, 118 is distributed the
same as 8: in mathematical terminology, the prior distribution is transformation invariant. Now, if

Chapter 9 Data and statistics

Figure 9.14

23 1

Prior distribution n(9)= 118.

logl,(8) is Uniform(-z, r ) distributed, then 6' is distributed as 10Uniform(-z3z).Figure 9.14 shows a
graph of n(8) = 118. You probably wouldn't describe that distribution as very uninformed, but it is
arguably the best one can do for this particular problem. It is worth remembering too that, if there is a
reasonable amount of data available, the likelihood function 1(X (8)will overpower the prior n(8) = 118,
and then the shape of the prior becomes unimportant. This will occur much more quickly if the likelihood
function is a maximum in a region of 8 where the prior is flatter: anywhere from 3 or 4 onwards in
Figure 9.14, for example.
Another requirement might be to ensure that the prior distribution remains invariant under some
rescaling. For example, the location parameter of a distribution should have the same effective prior
under the linear shifting transformation y = 8 - a , where a is some constant. This is achieved if we
select a uniform prior for 8, i.e. n(8) = constant. Similarly, a scale parameter should have a prior that
is invariant under a change of units, i.e. y = k0, where k is some constant. In other words, we require
that the parameter be invariant under a linear transformation which, from the discussion in the previous
paragraph, is achieved if we select the prior log(8) = uniform (i.e. n(8) cx 118) on the real line, since
log(y) = log(k8) = log(k) log(8), which is still uniformly distributed.
Parametric distribution often has either or both a location parameter and a scale parameter. If more
than one parameter is unknown and one is attempting to estimate these parameters, it is common practice
to assume independence between the two parameters in the prior: the logic is that an assumption of
independence is more uninformed than an assumption of any specific degree of dependence. The joint
prior for a scale parameter and a location parameter is then simply the product of the two priors. So, for
example, the prior for the mean of a normal distribution is n ( p ) cx 1, as p is a location parameter; the
prior for the standard deviation of the normal distribution is n ( a ) cx l / a , as a is a scale parameter, and
their joint prior is given by the product of the two priors, i.e. n ( p , a ) oc l/a. The use of joint priors is
discussed more fully in Chapter 10 where we will be fitting distributions to data.

+

Jeffreysprior

The Jeffreys prior, described in Jeffreys (1961), provides an easily computed prior that is invariant under
any one-to-one transformation and therefore determines one version of what could be described as an
uninformed prior. The idea is that one finds a likelihood function, under some transformation of the
data, that produces the same shape for all datasets and simply changes the location of its peak. Thus, a
non-informative prior in this translation would be ambiguous, i.e. flat. Although it is often impossible

to determine such a likelihood function, Jeffreys developed a useful approximation given by

where I(8) is the expected Fisher information in the model:

The formula is averaging, over all values of x (the data), the second-order partial derivative of
the loglikelihood function. The form of the likelihood function is helping determine the prior, but the
data themselves are not. This is important since the prior must be "blind" to the data. [Interestingly,
empirical Bayes methods (another field of Bayesian inference though not discussed in this book) do use
the data to determine the prior distribution and then try to make appropriate corrections for the bias this
creates.]
Some of the Jeffreys prior results are a little counterintuitive. For example, the Jeffreys prior for a
binomial probability is the ~ e t a ( i i, ) shown in Figure 9.15. It peaks at p = 0 and p = 1, dipping to its
lowest value at p = 0.5, which does not equate well with most people's intuitive notion of uninformed.
~ . using the Jacobian transformation, we
The Jeffreys prior for the Poisson mean h is n(h) a l / ~ ' / But,
see that this gives a prior for p = 1/h of p(B) a pV3I2,SO the prior is not transformation invariant.
Improper priors

We have seen how a uniform prior can be used to represent uninformed knowledge about a parameter.
However, if that parameter can take on any value between zero and infinity, for example, then it is not
strictly possible to use the uniform prior n(O) = c, where c is some constant, since no value of c will
let the area of the distribution sum to 1, and the prior is called improper. Other common improper priors
include using l/a for the standard deviation of a normal distribution and l/a2 for the variance. It turns
out that we can use improper priors provided the denominator in Equation (9.8) equals some constant
(i.e. is not infinite), because this means that the posterior distribution can be normalised.

Figure 9.15 The

eta($, $) distribution.

Chapter 9 Data and stat~st~cs 23 3

Savage et al. (1962) has pointed out that an uninformed prior can be uniformly distributed over the
area of interest, then slope smoothly down to zero outside the area of interest. Such a prior can, of
course, be designed to have an area of 1, eliminating the need for improper priors. However, the extra
effort required in designing such a prior is not really necessary if one can accept using an improper
prior.
Hyperprion

I

Occasionally, one may wish to specify a prior that itself has one or more uncertain parameters. For
instance, in Example 9.3 we used a Binomial(20, 0.5) prior because we believed that about 50 % of the
population were female, and we discussed the effect of changing this value to a distribution representing
the uncertainty about the true female prevalence. Such a distribution is described as a hyperprior for
the hyperparameter p in Binomial(20, p). As previously discussed, Bayesian inference can account
for hyperpriors, but we are then required to do an integration over all values of the hyperparameter
to determine the shape of the prior, and that can be time consuming and at times very difficult. An
alternative to the algebraic approach is to find the prior distribution by Monte Carlo simulation. We
run a simulation for this model, naming as outputs the array of cells calculating the prior. At the end
of the simulation, we collect the mean values for each output cell, which together form our prior. The
posterior distribution will naturally have a greater spread if there is uncertainty about any parameters in
the prior. If we had used a Beta(a, 6 ) distribution for p, the prior would have been a Beta-Binomial(20,
a , b) distribution and a beta-binomial distribution always has a greater spread than the best-fitting
binomial.
Theoretically, one could continue applying uncertainty distributions to the parameters of hyperpriors,
etc., but there is little if any accuracy to be gained by doing so, and the model starts to seem pretty silly.
It is also worth remembering that the likelihood function often quickly overpowers the prior distribution
as more data become available, so the effort expended in subtle changes to defining a prior will often
be wasted.
Conjugate priors

A conjugate prior has the same functional form in 6 as the likelihood function which leads to a
posterior distribution belonging to the same distribution family as the prior. For example, the Beta(al,
a2) distribution has probability density function f (8) given by

The denominator is a constant for particular values of

a1

and a2, so we can rewrite the equation as

If we had observed s successes in n trials and were attempting to estimate the true probability of
success p, the likelihood function l(s, n; 8) would be given by the binomial distribution probability
mass function written (using 8 to represent the unknown parameter p) as

Since the binomial coefficient

is constant for the given dataset (i.e. known n, s), we can rewrite

the equation as

We can see that the beta distribution and the binomial likelihood function have the same functional
,
a and b are constants. Since the posterior distribution is a product of
form in 0, i.e. Ha(l - o ) ~ where
the prior and likelihood function, it too will have the same functional form, i.e. using Equation (9.9)
we have
f (HIS, n) a Hffl-l+S (1 - 6)az-l+n-s
Since this is a true distribution, it must normalise to 1, so the probability distribution function is actually
QUI

f(els, n> =

+

-I+$ (1 - ~)ffz-l+n-s

J" tff,-l+s(1 - t)ff2-l+n-s dt

+

which is just the Beta(a1 s , a2 n - s) distribution. (In fact, with a bit of practice, one starts to
recognise distributions because of their functional form, e.g. that Equation (9.10) represents a beta
distribution, without having to go through the step of obtaining the normalised equation.) Thus, if one
uses a beta distribution as a prior for p with a binomial likelihood function, the posterior distribution
is also a beta. The value of using conjugate priors is that we can avoid actually doing any of the
mathematics and get directly to the answer. Conjugate priors are often called convenience priors for
obvious reasons.
The Beta(1, 1) distribution is exactly the same as a Uniform(0, 1) distribution, so, if we want to start
with a Uniforrn(0, 1) prior for p, our posterior distribution is given by Beta(s 1, n - s 1). This is
a particularly useful result that will be used repeatedly in this book. By comparison, the Jeffreys prior
for a binomial probability is a ~ e t a ( i , Haldane (1948) discusses using a Beta(0, 0) prior, which
is mathematically undefined and therefore meaningless by itself, but gives a posterior distribution of
Beta(s, n - s) that has a mean of s/n: in other words, it provides an unbiased estimate for the binomial
probability.
Table 9.1 lists other conjugate priors and the associated likelihood functions. Morris (1983) has shown
that exponential families of distributions, from which one often draws the likelihood function, all have
conjugate priors, so the technique can be used frequently in practice. Conjugate priors are also often
used to provide approximate but very convenient representations to subjective priors, as described in
the next section.

+

+

i).

Subjective priors

A subjective prior (sometimes called an elicited prior) describes the informed opinion of the value of a
parameter prior to the collection of data. Chapter 14 discusses in some depth the techniques for eliciting
opinions. A subjective prior can be represented as a series of points on a graph, as shown in Figure 9.16.
It is a simple enough exercise to read off a number of points from such graphs and use the height of
each point as a substitute for n(0). That makes it quite difficult to normalise the posterior distribution,

23 5

Chapter 9 Data and statistics
--

--

Table 9.1 Likelihood functions and their associated conjugate priors.
Distribution

Probability density
function

Estimated
parameter

Prior

Posterior

+x

Binomial

Probabilityp

Beta(al, a2)

a; = a1

Exponential

Mean-' = h

Gamma(a, B)

a;=a2+n-x
a' = a n

+

B' =

B

1+BCxi

Normal (with
known a)

Poisson

1
-exp

&a

e-At

1 x-p

2]

Mean p

Mean events
per unit time I

22

24

Normal(p,,

a),

p', =

p,(oz;n)
a2/n

(ht)Y
x!

20

Figure 9.16

[- (0)

26

28
30
32
34
Weight of statue

Gamma(a, B)

+~

a i

+ a:

a' = a + x
B
B' = 1 +Bt

36

38

40

Example of a subjective prior.

but we will see in Section 9.2.4 a technique that one can use in Monte Carlo modelling that removes
that problem.
Sometimes it is possible reasonably to match a subjective opinion like that of Figure 9.16 to a
convenience prior for the likelihood function one is intending to use. Software products like ModelRisk,
~ e s t ~ i and
t @ RiskView pro@ can help in this regard. An exact match is not usually important because
(a) the subjective prior is not usually specified that accurately anyway and (b) the prior has progressively
less influence on the posterior the larger the set of data used in calculating the likelihood function. At
other times, a single conjugate prior may be inadequate for describing a subjective prior, but a composite
of two or more conjugate priors will produce a good representation.

236

Risk Analysis

Multivariate priors

I have concentrated discussion on the quantification of uncertainty in this chapter to a single parameter
8. In practice one may find that 8 is multivariate, i.e. that it is multidimensional, in which case one
needs multivariate priors. In general, such techniques are beyond the scope of this book, and the reader
is referred to more specialised texts on Bayesian inference: I have listed some texts I have found useful
(and readable) in Appendix IV. Multivariate priors are, however, discussed briefly with respect to fitting
distributions to data in Section 10.2.2.

9.2.3 Likelihood functions
The likelihood function l(X 10) is a function of 8 with the data X fixed. It calculates the probability of
observing the X observed data as a function of 8. Sometimes the likelihood function is simple: often it is
just the probability distribution function of a distribution like the binomial, Poisson or hypergeometric.
At other times, it can quickly become very complex.
Examples 9.2, 9.3 and 9.6 to 9.8 illustrate some different likelihood functions. As likelihood functions
are calculating probabilities (or probability densities), they can be combined in the same way as we
usually do in probability calculus, discussed in Section 6.3.
The likelihood principle states that all relevant evidence about 8 from an experiment and its observed
outcome should be present in the likelihood function. For example, in binomial sampling with n fixed,
s is binomially distributed for a given p. If s is fixed, n is negative binomially distributed for a given
p . In both cases the likelihood function is proportional to p S ( l - P)"-~, i.e. it is independent of how
the sampling was carried out and dependent only on the type of sampling and the result.

9.2.4 Normalising the posterior distribution
A problem often faced by those using Bayesian inference is the difficulty of determining the normalising
integral that is the denominator of Equation (9.8). For all but the simplest likelihood functions this
can be a complex equation. Although sophisticated commercial software products like ~athematica',
~ a t h c a d ' and ~ a p l e ' are available to perform these equations for the analyst, many integrals remain
intractable and have to be solved numerically. This means that the calculation has to be redone every
time new data are acquired or a slightly different problem is encountered.
For the risk analyst using Monte Carlo techniques, the normalising part of the Bayesian inference analysis can be bypassed altogether. Most Monte Carlo packages offer two functions that enable us to do this:
a Discrete({x}, {p}) distribution and a Relative(min, max, {x), {p}). The first defines a discrete distribution where the allowed values are given by the { x } array and the relative likelihood of each of these values
is given by the { p } array. The second function defines a continuous distribution with a minimum = min,
a maximum = max and several x values given by the array {x), each of which has a relative likelihood
"density" given by the {p} array. The reason that these two functions are so useful is that the user is not
required to ensure that for the discrete distribution the probabilities in {p) sum to 1 and for the relative
distribution the area under the curve equals 1 . The functions normalise themselves automatically.

9.2.5 Taylor series approximation to a Bayesian posterior distribution
When we have a reasonable amount of data with which to calculate the likelihood function, the posterior
distribution tends to come out looking approximately normally distributed. In this section we will

Chapter 9 Data and statistics

237

examine why that is, and provide a shorthand method to determine the approximating normal distribution
directly without needing to go through a complete Bayesian analysis.
Our best estimate O0 of the value of a parameter 8 is the value for which the posterior distribution
f (8) is at its maximum. Mathematically, this equates to the condition

That is to say, 80 occurs where the gradient of f (0) is zero. Strictly speaking, we also require that the
gradient o f f (0) go from positive to negative for Bo to be a maximum, i.e.

The second condition is only of any importance if the posterior distribution has two or more peaks,
for which a normal approximation to the posterior distribution would be inappropriate anyway. Taking
the first and second derivatives of f (8) assumes that 8 is a continuous variable, but the principle applies
equally to discrete variables, in which case we are just looking for that value of 8 for which the posterior
distribution has the highest value.
The Taylor series expansion of a function (see Section 6.3.6) allows one to produce a polynomial
approximation to some function f (x) about some value xo that usually has a much simpler form than
the original function. The Taylor series expansion says

where f (m)(x)represents the mth derivative of f (x) with respect to x.
To make the next calculation a little easier to manage, we first define the log of the posterior distribution L(8) = log,[ f (8)]. Since L(8) increases with f ( G ) , the maximum of L(8) occurs at the same
value of 8 as the maximum of f (8). We now apply the Taylor series expansion of L(8) about 80 (the
MLE) for the first three terms:

The first term in this expansion is just a constant value (k) and tells us nothing about the shape of L(8);
the second term equals zero from Equation (9.1 I), so we are left with the simplified form

This approximation will be good providing the higher-order terms (m = 3, 4, etc.) have much smaller
values than the m = 2 term here.

238

Risk Analysis

We can now take the exponential of L(0) to get back to f (0):
f (0) x K exp

lQo

(i

- d2:)-

(0 - o0)2)

where K is a normalising constant. Now, the Normal(p, a ) distribution has probability density function
f (x) given by

Comparing the above two equations, we can see that f (0) has the same functional form as a normal
distribution, where

p = Oo

and

[

a = - ----

and we can thus often approximate the Bayesian posterior distribution with the following normal distribution:

We shall illustrate this normal (or quadratic) approximation with a few simple examples.
Example 9.5 Approximation t o the beta distribution

+

+

We have seen above that the beta distribution (s 1, n - s 1) provides an estimate of the binomial probability p when we have observed s successes in n independent trials, and assuming a prior
Uniform(0, 1) distribution. The posterior density has the function
f (0) a 0"1 Taking logs gives

and
dL(0)
-d0

s
0

---

n -s d 2 ~ ( 0) -s -n-s
1-0'

do2

O2

We first find our best estimate 80 of 0

s

n-s

=0

(1 - 0)2

Chapter 9 Data and statistics

239

which gives the intuitively encouraging answer
60'

=s/n

i.e. our best guess for the binomial probability is the proportion of trials that were successes.
Next, we find the standard deviation a for the normal approximation to this beta distribution:
d2L ( 0 )
o do2 (

s

-

6

n
n-s
-0
2
&(I-%)

which gives

and so we get the approximation

0

-

Normal

(., [

0o(l - 00)
]'I2)

The equation for a allows us some useful insight into the behaviour of the beta distribution. We can
see in the numerator that the spread of the beta distribution, and therefore our measure of uncertainty
about the true value of 6 , is a function of our best estimate for 0. The function [00(1- Oo)] is at its
maximum when 80 = - so, for a given set of trials n , we will be more uncertain about the true value of
6 if the proportion of successes is close to than if it were closer to 0 or 1. Looking at the denominator,
we see that the degree of uncertainty, represented by (T, is proportional to n-'I2. We will see time and
again that the level of uncertainty of some parameter is inversely proportional to the square root of the
amount of data available. Note also that Equation (9.14) is exactly the same as the classical statistics
result of Equation (9.7).But when is this quadratic approximation to L ( 0 ) ,i.e. the normal approximation
to f (O),a reasonably good fit? The mean p and variance V of a Beta(s 1, n - s 1 ) distribution are
as follows:

i,

+

+

Comparing these identities with Equation (9.13), we can see that the normal approximation works when
s and (n - s ) are both sufficiently large for adding 1 to s and adding 3 to n proportionally to have little
effect, i.e. when

s+l

s

1

and

n+3
n

-- x 1

Figure 9.17 compares the beta distribution with its normal approximation for several combinations of
s successes in n trials. +

240

Risk Analysis

Example 9.6 Uncertainty of h in a Poisson process

The number of earthquakes that have occurred in a region of the Pacific during the last 20years are
shown in Table 9.2. What is the probability that there will be more than 10 earthquakes next year?
Let us assume that the earthquakes come from a Poisson process (it probably doesn't, I admit, since
one big earthquake can release built-up pressure and give a hiatus until the next one), i.e. that there is
a constant probability per unit time of an earthquake and that all earthquakes are independent of each

Chapter 9 Data and statistics

24 1

Table 9.2 Pacific earthquakes.
Year

Earthquakes

1979
1980
1981
1982
1983
1984
1985
1986
1987
1988

Year

Earthquakes

1989
1990
1991
1992
1993
1994
1995
1996
1997
1998

other. If such an assumption is acceptable, then we need to determine the value of the Poisson process
parameter h, the theoretical true mean number of earthquakes there would be per year. Assuming no
prior knowledge, we can proceed with a Bayesian analysis, labelling h = 0 as the parameter to be
estimated. The prior distribution should be uninformed, which, as discussed in Section 9.2.2, leads
us to use a prior n(0) = 110. The likelihood function Z(01X) for the xi observations in n years is
given by

which gives a posterior function

Taking logs gives

Our best estimate O0 is determined by

which gives

242

Risk Analysis

and the standard deviation for the normal approximation is given by

since

which gives our estimate for h:

Again this solution makes sense, and again we see that the uncertainty decreases proportional to the
square root of the amount of data n. The central limit theorem (see Section 6.3.3) says that, for large
n, the uncertainty about the true mean v of a population can be described as

where Y is the mean and s is the standard deviation of the data sampled from the parent distribution.
The Poisson distribution has a variance equal to its mean h, and therefore a standard deviation equal to
As
gets large, so the "-1" in the above formula for 80 gets progressively less important
and 80 gets closer and closer to the mean of the observations per period 2,and we see that the Bayesian
approach and the central limit theorem of classical statistics converge to the same answer. C;=lxi will
be large when either h is large, so each xi is large, or when there are a lot of data (i.e. n is large), so
that the sum of a lot of small xi is still large. Figure 9.18 provides three estimates of A, the true mean
number of earthquakes for the system, given the data for earthquakes for the last 20 years, namely: the
standard Bayesian approach, the normal approximation to the Bayesian and the central limit theorem
approximation. +

a. x;=pxi

Example 9.7 Estimate of the mean of a normal distribution with unknown standard deviation

Assume that we have a set of n data samples from a normal distribution with unknown mean p and
unknown standard deviation a.We would like to determine our best estimate of the mean together
with the appropriate level of uncertainty. A normal distribution can have a mean anywhere in [-oo,
+m], so we could use a uniform improper prior n ( p ) = k. From the discussion in Section 9.2.2, the
uninformed prior for the standard deviation should be n(a)= l/a to ensure invariance under a linear
transformation. The likelihood function is given by the normal distribution density function:

C h a ~ t e 9r D a t a and stat~stics 243

Figure 9.18

Uncertainty distributions for h by various methods.

Multiplying the priors together with the likelihood function and integrating over all possible values of
a, we arrive at the posterior distribution for p:

where F and s are the mean and sample standard deviation of the data values.
Now the Student t-distribution with u degrees of freedom has the probability density

The equation for f ( p ) is of the same form as the equation for f (x) if we set u = n - 1. If we divide
the term inside the square brackets for f ( p ) by the constant ns2, we get

so the equation above for f ( p ) equates to a shifted, rescaled Student t-distribution with (n - 1) degrees
of freedom. Specifically, p can be modelled as

where t (n - 1) represents the Student t-distribution with (n - 1) degrees of freedom. This is the exact
result used in classical statistics, as described in Section 9.1.3. +
Example 9.8 Estimate of the mean of a normal distribution with known standard deviation

This is a more specific case than the previous example and might occur, for example, if one was
making many measurements of the same parameter but believed that the measurements had independent,

244

Risk Analysis

normally distributed errors and no bias (so the distribution of possible values would be centred about
the true value).
We proceed in exactly the same way as before, giving a uniform prior for p and using a normal
likelihood function for the observed n measurements { x i } .No prior is needed for a since it is known,
and we arrive at a posterior distribution for p given by

Taking logs gives

i.e., since a is known,

where k is some constant. Differentiating twice, we get

The best estimate po of p is that value for which !&JJ~= 0:
dLL

i.e. po is the average of the data values Y - no surprise there! A Taylor series expansion of this function
about po gives

The second term is missing because it equals zero and there are no other higher-order terms since
( d 2 ~ ( p ) / d p 2 )= (-n/a2) is independent of p and any further differential therefore equals zero. Consequently, Equation (9.16) is an exact result.

Chapter 9 Data and stat~stics 245

Taking natural exponents to convert back to f (p), and rearranging a little, we get

where K is a normalising constant. By comparison with the probability density function for the normal
distribution, it is easy to see that this is just a normal density function with mean ? and standard
deviation a/&. In other words

which is the classical statistics result of Equation (9.4) and a result predictable from the central limit
theorem.

+

Exercise 9.4: Bayesian uncertainty for the standard deviation of a normal distribution.
Show that the Bayesian inference results of uncertainty about the standard deviation of a normal
distribution take a similar form to the classical statistics results of Section 9.1.2.

9.2.6 Markov chain simulation: the Metropolis algorithm and the Gibbs
sampler
Gibbs sampling is a simulation technique to obtain a required Bayesian posterior distribution and is
particularly useful for multiparameter models where it is difficult algebraically to define, normalise and
draw from a posterior distribution. The method is based on Markov chain simulation: a technique that
creates a Markov process (a type of random walk) whose stationary distribution (the distribution of
the values it will take after a very large number of steps) is the required posterior distribution. The
technique requires that one runs the Markov chain a sufficiently large number of steps to be close to
the stationary distribution, and then records the generated values. The trick to a Markov chain model
is to determine a transition distribution ~,(0'10'-') (the distribution of possible values for the Markov
chain at its ith step 8 ' , conditional on the value generated in the (i - 1)th step oi-') that converges to
the posterior distribution.
The metropolis algorithm

The transition distribution
is a combination of some symmetric jumping distribution ~ ~ (l0'-'),
0'
which lets one move from one value 0'-' to another randomly selected 0*, and a weighting function
that assigns the probability of jumping to 0* (as opposed to staying still) as the ratio r , where

so that
0' = 0 * with probability min[l , r ]
- 0i-1 otherwise

246

Risk Analysis

The technique relies on being able to sample from J, for all i and 0'-', as well as being able to
calculate r for all jumps. For multiparameter problems, the Metropolis algorithm is very inefficient: the
Gibbs sampler provides a method that achieves the same posterior distribution but with far fewer model
iterations.
The Gibbs sampler

The Gibbs sampler, also called alternating conditional sampling, is used in multiparameter problems,
i.e. where 6 is a d-dimensional vector with components (dl, . . . , Od). The Gibbs sampler cycles through
all the components of 6 for each iteration, so there are d steps in each iteration. The order in which the
components are taken is changed at random from one iteration to the next. In a cycle, the kth component
is replaced (k = 1 to d, while all of the other components are kept fixed in turn) with a value drawn
from a distribution with probability density

where df_i' are all the other components of 6 except for Ok at their current value. This may look rather
awkward as one has to determine and sample from d separate distributions for each iteration of the
Gibbs sampler. However, the conditional distributions are often conjugate distributions, which makes
sampling from them a lot simpler and quicker. Have a look at Gelman et al. (1995) for a very readable
discussion of various Markov chain models, and for a number of examples of their use. Gilks et al.
(1996) is written by some of the real gurus of MCMC methods.
M C M C in practice

Some terribly smart people write their own Gibbs sampling programs, but for the rest of us there is a
product called WinBUGS developed originally at Cambridge University. It is free to download and the
software most used for MCMC modelling. It isn't that easy to get the software to work for you unless
you are familiar with S-plus or R type script, and one always waits with baited breathe for the message
"Compiled successfully" because there is rather little in the way of hints about what to do when it
doesn't compile. On the plus side, the actual probability model is quite intuitive to write and WinBUGS
has the flexibility to allow different datasets to be incorporated into the same model. The software is
also continuously improving, and several people have written interfaces to it through the OpenBUGS
project. To use the WinBUGS output, you will need to export the CODA file for data (after a sufficient
burn-in) to a spreadsheet, move the data around to have one column per parameter and then randomly
sample across a line (i.e. one MCMC iteration) in just the same way I explain for bootstrapping paired
data. The ModelRisk function VoseNBootPaired allows you to do this very simply.

9.3 The Bootstrap
The bootstrap was introduced by Efron (1979) and is explored in great depth in Efron and Tibshirani
(1993) and perhaps more practically in Davison and Hinkley (1997). This section presents a rather brief
introduction that covers most of the important concepts. The bootstrap appears at first sight to be rather
dubious, but it has earned its place as a useful technique because (a) it corresponds well to traditional
techniques where they are available, particularly when a large dataset has been obtained, and (b) it offers
an opportunity to assess the uncertainty about a parameter where classical statistics has no technique
available and without recourse to determining a prior.

Chapter 9 Data and statistics

247

The "bootstrap" gets its name from the phrase "to pull yourself up by your bootstraps", which is
thought to originate from one of the tales in the Adventures of Baron Munchausen by Rudolph Erich
Raspe (1737- 1794). Baron Munchausen (1720-1797) actually existed and was known as an enormous
boaster, especially of his exploits during his time as a Russian cavalry officer. Raspe wrote ludicrous
stories supposedly in his name (he would have been sued these days). In one story, the Baron was at the
bottom of a deep lake and in some trouble, until he thought of pulling himself up by his bootstraps. The
name "bootstrap" does not perhaps engender much confidence in the technique: you get the impression
that there is an attempt somehow to get something from nothing - actually, it does seem that way when
one first looks at the technique itself. However, the bootstrap has shown itself to be a powerful method
of statistical analysis and, if used with care, can provide results very easily and in areas where traditional
statistical techniques are not available.
In its simplest form, which is the non-parametric bootstrap, the technique is very straightforward
indeed. The standard notation as used by Efron is perhaps a little confusing, though, to the beginner,
and, since I am not going into any great sophistication in this book, I have modified the notation a little
to keep it as simple as possible. The bootstrap is used in similar conditions to Bayesian inference, i.e.
we have a set of data x randomly drawn from some population distribution F for which we wish to
estimate some statistical parameter.
The jackknife

The bootstrap was originally developed from a much earlier technique called the jackknife. The jackknife
was used to review the accuracy of a statistic calculated from a set of data. A jackknife value was the
statistic of interest calculated with the ith value removed from the dataset and is given the notation
With a dataset of n values, one thus has n jackknife values, the distribution of which gives a feel
for the uncertainty one has about the true value of the statistic. I say "gives a feel" because the reader
is certainly not recommended to use the jackknife as a method for obtaining any precise estimate of
uncertainty. The jackknife turns out to be quite a poor estimation of uncertainty and can be greatly
improved upon.

9.3.1 The non-parametric bootstrap
Imagine that we have a set of n random measurements of some characteristic of a population (the
height of 100 blades of grass from my lawn, for example) and we wish to estimate some parameter
of that population (the true mean height of all blades of grass from my lawn, for example). Bootstrap
theory says that the true distribution F of these blades of grass can be reasonably approximated by the
distribution p of observed values. Obviously, this is a more reasonable assumption the more data one
has collected. The theory then constructs this distribution k of the n observed values and takes another
n random samples (with replacement) from that constructed distribution and calculates the statistic of
interest from that sample. The sampling from the constructed distribution and statistic calculation is
repeated a large number of times until a reasonably stable distribution of the statistic of interest is
obtained. This is the distribution of uncertainty about the parameter.
The method is best illustrated with a simple example. Imagine that I work for a contact lens manufacturer in Auckland and for some reason would really like to know the mean diameter of the pupils of
the eyes of New Zealand's population under some specific light condition. I have a limited budget, so
I randomly select 10 people off the street and measure their pupils while controlling the ambient light.
The results I get are (in mm): 5.92, 5.06, 6.16, 5.60, 4.87, 5.61, 5.72, 5.36, 6.03 and 5.71. This dataset
forms my bootstrap estimate of the true distribution for the whole population, so I now randomly sample

248

Risk Analysis

B4:B13
C4:C13

Formulae table
Data values
=VoseDUniform($B$4:$B$l3)
=AVERAGE(C4:C13)

Figure 9.19 Example of a non-parametric bootstrap model.

0.163

0.131

E 0.098
0

'IJ
L'

5

0

0.065

.........................

0.033

0.000
5.00

5.20

5.40
5.60
Mean pupil diameter (mm)

5.80

6.00

Figure 9.20 Uncertainty distribution resulting from the model of Figure 9.19.

with replacement from the distribution to get 10 bootstrap samples. The spreadsheet in Figure 9.19
illustrates the bootstrap sampling: column B lists the original data and column C gives 10 bootstrap
samples from these data using the Duniform({x}) distribution (Duniform({x}) is a discrete distribution
where all values in the {x} array are equally likely). Cell C14 then calculates the statistic of interest
(the mean) from this sample. Running a 10000 iteration simulation on this cell produces the bootstrap
uncertainty distribution shown in Figure 9.20. The distribution is roughly normal (skewness = -0.16,
kurtosis = 3.02) with mean = 5.604 - the mean of the original dataset.

Chapter 9 D a t a and statistics

249

In summary, the non-parametric bootstrap proceeds as follows:
Collect the dataset of n samples (XI,. . . , x,}.
Create B bootstrap samples {xl*,. . . , x,*}where each xi* is a random sample with replacement from
{XI,...x,I.
For each bootstrap sample {xy,. . . , x;), calculate the required statistic 8. The distribution of these
B estimates of 8 represents the bootstrap estimate of uncertainty about the true value of 8.
Example 9.9 Bootstrap estimate of prevalence

Prevalence is the proportion of a population that has a particular characteristic. An estimate of the
prevalence P is usually made by randomly sampling from the population and seeing what proportion
of the sample has that particular characteristic. Our confidence around this single-point estimate can be
obtained quite easily using the non-parametric bootstrap. Imagine that we have randomly surveyed 50
voters in Washington, DC, and asked them how many will be voting for the Democrats in a presidential
election the following day. Let's rather nayvely assume that they all tell the truth and that none of
them will have a change of mind before tomorrow. The result of the survey is that 19 people said they
will vote Democrat. Our dataset is therefore a set of 50 values, 19 of which are 1 and 31 of which
are 0. A non-parametric bootstrap would sample from this dataset. Thus, the bootstrap replicate would
be equivalent to a Binomial(50, 19/50). The estimate of prevalence is then just the proportion of the
bootstrap samples that are 1, i.e. P = Binomial(50, 19/50)/50. This is exactly the same as the classical
statistics estimate given in Equation (9.6), and, interestingly, the parametric bootstrap (see next section)
has exactly the same estimate in this example too. The distribution being sampled in a parametric
bootstrap is a Binomial(1, P ) from which we have 50 samples and our MLE (maximum likelihood
estimator) for P is 19/50. Thus, the 50 parametric bootstrap replicates could be summed together as a
Binomial(50, 19/50), and our estimate for P is again Binomial(50, 19/50)/50.
We could have used a Bayesian inference approach. With a Uniform(0, 1) prior, and a binomial
likelihood function (which assumes the population is much larger than the sample), we would have an
estimate of prevalence using the beta distribution (see Section 8.2.3):

Figure 9.21 plots the Bayesian estimate alongside the bootstrap for comparison. They are very close,
except that the bootstrap estimate is discrete and the Bayesian is continuous, and, as the sample size
increases, they would become progressively closer. +

9.3.2 The parametric bootstrap
The non-parametric bootstrap in the previous section made no assumptions about the distributional
form of the population (parent) distribution. However, there will be many times that we will know to
which family of distributions the parent distribution belongs. For example, the number of earthquakes
each year and the number of Giardia cysts in litres of water drawn from a lake will logically both
be approximately Poisson distributed; the time between phone calls to an exchange will be roughly
exponentially distributed and the number of males in randomly sampled groups of a certain size will be

250

Risk Analvs~s

Figure 9.21

Bootstrap and Bayesian estimates of prevalence for Example 9.9.

binomially distributed. The parametric bootstrap gives us a means to use the extra information we have
about the population distribution. The procedure is as follows:
Collect the dataset of n samples {xl, . . . , x,}.
Determine the parameter(s) of the distribution that best fit(s) the data from the known distribution
family using maximum likelihood estimators (MLEs - see Section 10.3.1).
Generate B bootstrap samples {xr,. . . , x;} by randomly sampling from this fitted distribution.
For each bootstrap sample {x;,. . . , x:), calculate the required statistic 8. The distribution of these
B estimates of 8 represents the bootstrap estimate of uncertainty about the true value of 8.
We can illustrate the technique by using the pupil measurement data again. Let us assume that we know
for some reason (perhaps experience from other countries) that this measurement should be normally distributed for a population. The normal distribution has two parameters - its mean and standard deviation,
both of which we will assume to be unknown - and their MLEs are the mean and standard deviation
of the data to be fitted. The mean and standard deviation of the pupil measurements are 5.604 mm and
0.410mm respectively. Figure 9.22 shows a spreadsheet model where, in column C, 10 Normal(5.604,
0.410) distributions are randomly sampled to give the bootstrap sample. Cell D l 4 is calculating the mean
(the statistic of interest) of the bootstrap sample. Figure 9.23 shows the results of this parametric bootstrap model, together with the result from applying the classical statistics method of Equation 9.2 - they
are very similar. The result also looks very similar to the non-parametric distribution of Figure 9.20.
In comparison with the classical statistics model, which happens to be exact for this particular problem
(i.e. when the parent distribution is normal), both bootstrap methods provide a narrower range. In other
words, the bootstrap in its simplest form tends to underestimate the uncertainty associated with the
parameter of interest. A number of corrective measures are proposed in Efron and Tibshirani (1993).

Chapter 9 Data and statist~cs 25 1

A

\

B

1
2
-

3
4
5
6
7
8
9
10
11
12
13
14
15
16

7

7

Figure 9.22

Mean
Stdev

C

l

Data
5.92
5.06
6.16
5.60
4.87
5.61
5.72
5.36
6.03
5.71
5.60
0.4095

EI

D
Bootstrap
sample
5.57
5.72
5.25
6.01
4.91
6.06
5.57
5.54
4.68
4.69
5.40

F

I

IH

G

Formulae table
Data values
=AVERAGE(C4:C13)
=STDEV(C4:C13)
=VoseNormal($C$14,$C$15)

C4:C13

D4:D13

Example of a parametric bootstrap model.

4.7

4.9

5.1

5.3

5.5

5.7

5.9

6.1

6.3

True mean pupil diameter

Figure 9.23 Results of the parametric bootstrap model of Figure 9.22, together with the classical statistics
result.

Imagine that we wish to estimate the true depth of a well using some sort of sonic probe. The probe has
a known standard error a = 0.2m, i.e. a is the standard deviation of the normally distributed variation
of results the probe will produce when repetitively measuring the same depth. In order to estimate this
depth, we take n separate measurements. These measurements have a mean of 2 metres. The parametric
bootstrap model would take the average of n Normal(T, o) distributions to estimate the true mean p of
the distribution of possible measurement results, i.e. the true well depth. From the central limit theorem,

252

Risk Analysis

we know that this calculation is equivalent to
p = Normal (F.

5)

which is the classical statistics result of Equation (9.3). +
Parametric bootstrap estimate of the standard deviation of a normal distribution

It can also be shown that the parametric bootstrap estimates of the standard deviation of a normal
distribution when the mean is and is not known are exactly the same as the classical statistics estimates
given in Equations (9.5) and (9.6) (the reader may like to prove this, bearing in mind that the ChiSq(v)
distribution is the sum of the squares of v independent unit normal distributions).
Example 9.10 Parametric bootstrap estimate of mean time between calls at a telephone exchange

Imagine that we want to predict the number of phone calls there will be at an exchange during a
particular hour in the working day (say 2 p.m. to 3 p.m.). Imagine that we have collected data from this
period on n separate, randomly selected days. It is reasonable to assume that telephone calls will arrive
at a Poisson rate since each call will be, roughly speaking, independent of every other. Thus, we could
use a Poisson distribution to model the number of calls in an hour. The maximum likelihood estimate
(MLE) of the mean number of calls per hour at this time of day is simply the average number of calls
observed in the test periods x (see Example 10.3 for proof). Thus, our bootstrap replicate is a set of n
independent Poisson(F) distributions. To generate our uncertainty about the true mean number of phone
calls per hour at this time of the day, we calculate the mean of the sum of the bootstrap replicate, i.e. the
average of n independent Poisson(x) distributions. The sum of n independent Poisson(x) distributions
is simply Poisson(nF), so the average of n Poisson(i7) distributions is Poisson(nF)/n, where (nx) is
simply the sum of the observations. So, in general, if one has observations from n periods, the Poisson
parametric bootstrap for the mean number of observations per period h is given by

where S is the sum of observations in the n periods.
The uncertainty distribution of h should be continuous, as h can take any positive real value. However,
the bootstrap will only generate discrete values for h, i.e. (0, l l n , 2/n, . . .). When n is large this is not
a problem since the allowable values are close together, but when S is small the approximation starts
to fall down. Figure 9.24 illustrates three Poisson parametric bootstrap estimates for h for S = 2, 10
and 20 combined with n = 5. For S = 2, the discreteness will in some circumstances be an inadequate
uncertainty model for A, and a different technique like Bayesian inference would be preferable. However,
for values of S around 20 or more, the allowable values are relatively close together. For large S, one
can also add back the continuous characteristic of the parameter by making a normal approximation to
the Poisson, i.e. since Poisson(a) % Normal(a, &) we get

or, replacing S l n with F, we get
h

%

Normal (F.

{)

Chapter 9 Data and statistics

253

Figure 9.24 Three Poisson parametric bootstrap estimates for A for S = 2, 10 and 20 from Example 9.10.

254

Risk Analysis

which also illustrates the familiar reduction in uncertainty as the square root of the number of data
points n. +

9.3.3 The Bayesian bootstrap
The Bayesian bootstrap is considered to be a robust Bayesian approach for estimating a parameter of a
distribution where one has a random sample x from that distribution. It proceeds in the usual bootstrap
way, determining a distribution of 8, the distribution density of which is then interpreted as the likelihood
function l(x 18). This is then used in the standard Bayesian inference formula (Equation (9.8)) along
with a prior distribution n ( 6 ) for 8 to determine the posterior distribution. In many cases, the bootstrap
distribution for 8 closely approximates a normal distribution, so, by calculating the mean and standard
deviation of the B bootstrap replicates 8, one can quickly define a likelihood function.

9.4 Maximum Entropy Principle
The maximum entropy formalism (sometimes known as MaxEnt) is a statistical method for determining
a distribution of maximum logical uncertainty about some parameter, consistent with a certain limited
amount of information. For a discrete variable, MaxEnt determines the distribution that maximises the
function H(x), where

and where pi is the confidence for each of the M possible values xi of the variable x. The function
H (x) takes the equation of a statistical mechanics property known as entropy, which gives the principle
its name. For a continuous variable, H(x) takes the form of an integral function:
max

H(x) = The appropriate uncertainty distribution is determined by the method of Lagrange multipliers, and, in
practice, the continuous variable equation for H(x) is replaced by its discrete counterpart. It is beyond
the scope of this book to look too deeply into the mathematics, but there are a number of results that
are of general interest. MaxEnt is often used to determine appropriate priors in a Bayesian analysis,
so the results listed in Table 9.3 give some reassurance to prior distributions we might wish to use
conservatively to represent our prior knowledge.
The reader is recommended Sivia (1996) for a very readable explanation of the principle of MaxEnt
and derivation of some of its results. Gzyl(1995) provides a far more advanced treatise on the subject, but
requires a much higher level of mathematical understanding. The normal distribution result is interesting
and provides some justification for the common use of the normal distribution when all we know is the
mean and variance (standard deviation), since it represents the most reasonably conservative estimate
of the parameter given that set of knowledge. The uniform distribution result is also very encouraging
when estimating a binomial probability, for example. The use of a Beta(s a , n - s b) to represent
the uncertainty about the binomial probability p when we have observed s successes in n trials assumes
a Beta(a, b) prior. A Beta(1, 1) is a Uniform(0, 1) distribution, and thus our most honest estimate of p
is given by Beta(s 1, n - s 1).

+

+

+

+

Chapter 9 Data and statistics

Table 9.3

255

Maximum entropy method.

State of knowledge
Discrete parameter, n possible values {xi)
Continuous parameter, minimum and
maximum
Continuous parameter, known mean p
and variance a*
Continuous parameter, known mean p
Discrete parameter, known mean p

MaxEnt distribution
DUniform({xi]), i.e. p(xi) = 1/n
Uniform(min,max), i.e. f(x) = l/(max - min)
Norma@, a)
Expon(p)
Poisson(p)

9.5 Which Technique Should You Use?
I have discussed a variety of methods for estimating your uncertainty about some model parameter. The
question now is which one is best? There are some situations where classical statistics has exact methods
for determining confidence intervals. In such cases, it is sensible to use those methods of course, and the
results are unlikely to be challenged. In situations where the assumptions behind traditional statistical
methods are being stretched rather too much for comfort, you will have to use your judgement as
to which technique to use. Bootstraps, particularly the parametric bootstrap, are powerful classical
statistics techniques and have the advantage of remaining purely objective. They are widely accepted
by statisticians and can also be used to determine uncertainty distributions for statistics like the median,
kurtosis or standard deviation for parent distributions where classical statistics have no method to offer.
However, it is a fairly new (in statistics terms) technique, so you may find people resisting making
decisions based on its results, and the results can be rather "grainy".
The Bayesian inference technique requires some knowledge of an appropriate likelihood function,
which may be difficult and will often require some subjectivity in assessing what is a sufficiently
accurate function to use. Bayesian inference also requires a prior, which can be contentious at times, but
has the potential to include knowledge that the other techniques cannot allow for. Traditional statisticians
will sometimes offer a technique to use on your data that implicitly assumes a random sample from a
normal distribution, though the parent distribution is clearly not normal. This usually involves some sort
of approximation or a translation of the data (e.g. by taking logs) to make the data better fit a normal
distribution. While I appreciate the reasons for doing this, I do find it difficult to know what errors one
is introducing by such data manipulation.
Pretty often in our consulting work there is no option but to use Gibbs sampling because it is the only
way to handle multivariate estimates that are good for risk analysis. The WinBUGS program may be
a little difficult to use but the models can be made very transparent. I suggest that, if the parameter to
your model is important, it may well be worth comparing two techniques (for example, non-parametric
bootstrap (or parametric, if possible) and Bayesian inference with an uninformed prior). It will certainly
give you greater confidence if there is reasonable agreement between any two methods you might choose.
What is meant by reasonable will depend on your model and the level of accuracy your decision-maker
needs from that model. If you find there appears to be some reasonable disagreement between two
methods that you test, you could try running your model twice, once with each estimate, and seeing
if the model outputs are significantly different. Finally, if the uncertainty distributions between two
methods are significantly different and you cannot choose between them, it makes sense to accept
that this is another source of uncertainty and simply combine the two distributions, using a discrete
distribution, in the same way I describe in Section 14.3.4 on combining differing expert opinions.

256

Risk Analysis

9.6 Adding uncertainty in Simple Linear Least-Squares
Regression Analysis
In least-squares regression, one is attempting to model the change in one variable y (the response or
dependent variable) as a function of one or more other variables {x} (the explanatory or independent
variables). The regression relationship between {x} and y minimises the sum of squared errors between
a fitted equation for y and the observations. The theory of least-squares regression assumes the random
variations about this line (resulting from effects not explained by the explanatory variables) to be
normally distributed with constant variance across all {x) values, which means the fitted line describes
the mean y value for a given set of {x}. For simplicity we will consider a single explanatory variable
x (i.e. simple regression analysis), and that the relationship between x and y is linear (which is linear
regression analysis), i.e. we will use a model of the variability in y as a result of changes in x with the
following equation:

where m and c are the gradient and y intercept of the straight-line relationship between x and y, and
a is the standard deviation of the additional variation observed in y that is not explained by the linear
equation in x. Figure 6.1 1 illustrates these concepts. In least-squares linear regression, we typically have
a set of n paired observations {xi,yi) for which we wish to fit this linear relationship.

9.6.1 Classical statistics
Classical statistics theory (see Section 6.3.9) provides us with the best-fitting values for m , c and a,
assuming the model's assumptions to be correct, which we will name &, 2 and 8 . It also gives us
exact distributions of uncertainty for the estimate 9 p = ( m x p c) at some value x p (see, for example,
McClave, Dietrich and Sincich, 1997) and a as follows:

+

where

t(n - 2) is a Student t-distribution with (n - 2) degrees of freedom, X2(n - 1) is a chi-square distribution with (n - 1) degrees of freedom and s is the standard deviation of the differences ei between the
observed value yi and its predictor ji = &xi 2, i.e.

+

Chapter 9 Data and statist~cs 257

I

I

Body welgM (kg)

Log,,,body ml9M (kg1

--

Figure 9.25 Simple least-squares regression uncertainty about

P for the dataset of Table 9.4.

+

The uncertainty distribution for a is independent of the uncertainty distribution for (mx c), since
the model assumes that the random variations about the regression line are constant, i.e. that they are
independent of the values of x and y. It turns out that these same results are given by Bayesian inference
with uninformed priors, i.e. n ( m , c, a ) cc l/a.
The uncertainty equation for ji = mxi c produces a relationship between x and y with uncertainty
that is pinched at the middle, as shown in the simple least-squares regression analysis of Figure 9.25
for the data in Table 9.4. This makes sense as, the further we move towards the extremes of the set
of observations, the more uncertain we should be about the relationship. This describes the relationship between the weight of a mammal in kilograms and the mean weight of the brain of a mammal
in grams at that body weight. Strictly speaking, the theory of regression analysis says that the relationship can only be considered to hold within the range of observed values for x . However, with
caution, one can reasonably extrapolate a little past the range of observed body weights, although,
the further one extends beyond the observed range, the more tenuous the validity of the analysis
becomes.
Including uncertainty in a regression analysis means that we now have a family of normal distributions
representing the possible value of y, given a specific value for x . The normal distribution reflects the
observed variability about the regression line. That there is a family of these distributions reflects our

+

258

Risk Analysis

Table 9.4 Experimental
measurements of the weight
of mammals' bodies and
brains.
Brain weight
(9)

Body weight
(kg)

uncertainty about the coefficients for the regression equation and therefore the parameters for the normal
distribution.
The bootstrap

The variables x and y will fit a simple least-squares regression model if the underlying relationship
between these two variables is one of two forms: type A where the {xi,yi) observations are drawn
from a bivariate normal distribution in x and y, or type B where, for any value x, the distribution of
possible response values in y are Normal(mx c, a(x)) distributed and, for the time being, a(x) = a,
i.e. the random variations about the line have the same standard deviation (known as homoscedasticity).
In order to use the bootstrap to determine the uncertainty about the regression coefficients, we must
first determine which of these two relationships is occurring. Essentially, this is equivalent to the design
of the experiment that produced the {xi,y i ] observations. The experiment design is of type A if we
are making random observations of x and y together, whereas the experiment design is of type B if
we are testing at different specific values of x to determine the response in y. So, for example, the
{body weight, brain weight} data from Table 9.4 are of type A if we have attempted to pick a fairly
random sample of mammals, whereas they would be of type B if we had picked an animal from
each of the 20 subspecies of a species of some particular mammal. If, for example, we were doing
an experiment to demonstrate Hooke's law by adding incremental weights to a hanging spring and
observing the resultant extension beyond the spring's original length, the {mass, extension) observations
would again be of type B, because we are specifically controlling the x values to observe the resultant
y values.
For type A data, the regression coefficients can be thought of as parameters of a bivariate normal
distribution. Thus, using the non-parametric bootstrap, we simply resample from the paired observations
{xi,yi} and, at each bootstrap replicate, calculate the regression coefficients. Figure 9.26 illustrates this
type of analysis set out in a spreadsheet model for the dataset of Table 9.4.
For type B data, the x values are fixed since they were predetermined rather than resulting from a
random sample from a distribution. Assuming the random variations about the regression line to be
homoscedastic and the straight-line relationship to be correct, the only random variable involved is

+

Chapter 9 Data and statist~cs 259

1
2
-

A1

B

C

D

E

Brain weight
(gm)
0.0436
0.4492
1.698
2.844
14.69
16.265
22.309
372.97
713.72
3270.15

Body weight
(kg)
0.685
29.05
175.92
50.856
155.74
294.52
193.49
1034.4
9958.02
35160.5

Log(Brain
weight)
-1.361
-0.348
0.230
0.454
1.167
1.211
1.348
2.572
2.854
3.515

Log(Body
weight)
-0.1 64
1.463
2.245
1.706
2.192
2.469
2.287
3.015
3.998
4.546

F

I

IH

G

7

3
4
5
6

7
8
9

10
11
12
13
14
-

15
16
17
18
19
20
21
22
23

7

Figure 9.26
A1

B4:C13
D4:E13
F4:F13
G4:G13
GI4
GI5
GI6

Formulae table
Data
=LOG(B4)
=VoseDuniform($D$4:$D$13)
=VLOOKUP(F4,$D$4:$E$13,2)
=SLOPE(G4:Gl3,F4:F13)
=INTERCEPT(G4:Gl3,F4:F13)
=STEYX(G4:G13,F4:F13)

Bootstrap
Log(Brain Log(Body
weight)
weight)
1.348
2.287
0.230
2.245
0.454
1.706
1.348
2.287
3.515
4.546
-0.348
1.463
1.211
2.469
-1.361
-0.1 64
-1.361
-0.1 64
0.454
1.706
m
0.91011188
c
1.3382687
Steyx
0.35032446

Example model for a data pairs resampling (type A) bootstrap regression analysis.
B

C

D

E

Brain weight
(gm)
0.0436
0.4492
1.698
2.844
14.69
16.265
22.309
372.97
713.72
3270.15

Body weight
(kg)
0.685
29.05
175.92
50.856
155.74
294.52
193.49
1034.4
9958.02
35160.5

Log(Brain
weight)
-1.3605
-0.348
0.230
0.454
1.167
1.211
1.348
2.572
2.854
3.515

Log(Body
weight)
-0.1 64
1.463
2.245
1.706
2.192
2.469
2.287
3.015
3.998
4.546

I FI

G

1
2

3
4

5

6
7
8
9
-

10

2
2
13
14

2
16
1
7
3

2
20

21
22
3

B4:C13
D4:E13
G4:G13
GI5
H4:H13
H I 7 (output)
H I 8 (output)
H I 9 (output)

Formulae table
Data
=LOG(B4)
=E4-TREND($E$4:$E$13,$D$4:$D$13,D4)
=STDEV(G4:G13)
=VoseNormal(E4-G4,$G$15)
=SLOPE(H4:H13,D4:D13)
=INTERCEPT(H4:H13,D4:D13)
=STEYX(H4:H13,D4:D13)

Residual
-0.425
0.354
0.652
-0.074
-0.186
0.054
-0.243
-0.540
0.207
0.201

1

;::::1

H

I

I

Bootstrap
Log(Body
weight)
0.260
1.109
1.593
1.781
2.378
2.415
2.530
3.555
3.791
4.345

L"
Steyx

0.838
1.400
0.24436273

24

Figure 9.27

Example model for a residuals resampling (type B) parametric bootstrap regression analysis.

260

Risk Analys~s

that producing the variations about the line, and so we seek to bootstrap the residuals. If we know the
residuals are normally distributed, we can use a parametric bootstrap model, as follows:
1. Determine S,, - the standard deviation of the residuals about the least-squares regression line for
the original dataset.
2. For each of the x values in the dataset, randomly sample from a Normal(j, S,,) where 9 = rizx F
and riz and F are the least-squares regression coefficients for the original dataset.
3. Determine the least-squares regression coefficients for this bootstrap sample.
4. Repeat for B iterations.

+

Figure 9.27 illustrates this procedure in a spreadsheet model for the {body weight, brain weight} data.
Although this procedure works quite well, it would be better to use the classical statistics approach
described above, which offers exact answers under these conditions. However, a slight modification to the
above approach allows one to use a non-parametric bootstrap, i.e. where we can remove the assumption
of normally distributed residuals which may often not be very accurate. For the non-parametric model,

Mass (kg)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.O
1.1
1.2
1.3
1.4
1.5
mean
0.8

Extension
(mm)
Residual ei
137.393
0.720
2.281
138.954
3.304
139.977
3.434
140.107
142.765
6.093
145.606
8.933
11.208
147.881
147.011
10.338
144.194
7.521
144.949
8.277
152.161
15.488
149.694
13.021
154.700
18.027
25.364
162.037
155.275
18.602
23.548
160.221

Leverage
hi
0.228
0.187
0.151
0.122
0.099
0.081
0.069
0.063
0.063
0,069
0.081
0.099
0.122
0.151
0.187
0.228
mean

ssx

Figure 9.28
analysis.

Bootstrap
Extension
127.4
132.7
136.4
132.8
151.4
134.7
149.8
142.9
145.2
154.2
160.2
150.3
153.2
147.6
156.1
167.5

rn
19.81
131.54

3.4

83:C18
820
822
D3:D18
E3:E18
F3:F18
G3:G18
G21 (output)
G22 (output)

Modified
residual r,
0.820
2.530
3.587
3.665
6.417
9.318
11.616
10.681
7.771
8.578
16.155
13.714
19.239
27.535
20.628
26.800
11.8

Formulae table
Data
=AVERAGE(83:818)
(=SUM((83:818-$6$20)"2)}
=C3-TREND($C$3:$C$18,$8$3:$6$18,0)
=1/16+(63-$8$20)"2/$6$22
=D3/SORT(1-E3)
=TREND($C$3:$C$18,$6$3:$8$18,83)+Duniform($FS3:$F$18)-$F$19
=SLOPE(G3:G18,83:818)
=INTERCEPT(G3:G18,83:B18)

Example model for a residuals resampling (type B) non-parametric bootstrap regression

Chapter 9 Data and statistics

26 1

we must first develop a non-parametric distribution of residuals by changing them to have constant
variance. We define the modijied residual ri as follows:

where the leverage hi is given by

The mean of the modified residuals F is calculated. Then a bootstrap sample rjc is drawn from the set
of ri values and used to determine the quantity (Fj + ry - 7 ) for each x j value which is used in step 2
of the algorithm above. Figure 9.28 provides a spreadsheet illustration of this type of model using data
from Table 9.5.
In certain problems, it is logical that the y-intercept value c be set to zero. In this situation, the
leverage values are different:

The modified residuals are thus also different and won't sum to zero, so it is essential to mean-correct
the residuals before they are used to simulate random errors.
Table 9.5 Experimental
measurements of the
variation in length of a
vertical spring as weight is
attached to its end.
Mass
(kg)

Extension
(mm)

262

Risk Analysis

Bootstrapping the data pairs is more robust than bootstrapping the residuals, as it is less sensitive
to any deviation from the regression assumptions, but won't be as accurate where the assumptions are
correct. However, as the dataset increases in size, the results from bootstrapping the pairs approach
those from bootstrapping the residual, and it is also easier to execute, of course. These techniques can
be extended to non-linear, non-constant variance and to multiple linear regressions, described in detail
in Efron and Tibshirani (1993) and Davison and Hinkley (1997).

Chapter 10

Fitting distributions to data
In this chapter I use the statistical methods I've described in Chapter 9 to fit probability distributions to
data. I also briefly describe how regression models are fitted to data. There are other types of probability
models we use in risk analysis: fitting time series and copulas are described elsewhere in this book.
This chapter is concerned with a problem frequently confronted by the risk analyst: that of determining
a distribution to represent some variable in a risk analysis model. There are essentially two sources of
information used to quantify the variables within a risk analysis model. The first is available data and
the second is expert opinion. Chapter 14 deals with the quantification of the parameters that describe the
variability purely from expert opinion. Here I am going to look at techniques to interpret observed data for
a variable in order to derive a distribution that realistically models its true variability and our uncertainty
about that true variability. Any interpretation of data by definition requires some subjective input, usually
in the form of assumptions about the variable. The key assumption here is that the observed data can
be thought of as a random sample from some probability distribution that we are attempting to identify.
The observed data may come from a variety of sources: scientific experiments, surveys, computer
databases, literature searches, even computer simulations. It is assumed here that the analyst has satisfied
himself that the observed data are both reliable and as representative as possible. Anomalies in the data
should be checked out first where possible and any unreliable data points discarded. Thought should also
be given to any possible biases that could be produced by the method of data collection, for example:
a high-street survey may have visited an unrepresentative number of large or affluent towns; the data
may have come from an organisation that would benefit from doctoring the data, etc.
I start by encouraging analysts to review the data they have available and the characteristics of
the variable that is to be modelled. Several techniques are then discussed that enable analysts to fit
the available data to an empirical (non-parametric) distribution. The key advantages of this intuitive
approach are the simplicity of use, the avoidance of assuming some distribution form and the omission
of inappropriate or confusing theoretical (parametric or model-based) distributions. Techniques are then
described for fitting theoretical distributions to observed data, including the use of maximum likelihood
estimators, optimising goodness-of-fit statistics and plots.
For both non-parametric and parametric distribution fitting, 1 have offered two approaches. The first
approach provides a first-order distribution, i.e. a best-fitting (best-guess) distribution that describes
the variability only. The second approach provides second-order distributions that describe both the
variability of the variable and the uncertainty we have about what that true distribution of variability
really is. Second-order distributions are more complete than their first-order counterparts and require
more effort: if there is a sufficiently large set of data such that the inclusion of uncertainty provides only
marginally more information, it is quite reasonable to approximate the distribution to one of variability
only. That said, it is often difficult to gauge the degree of uncertainty one has about a distribution
without having first formally determined its uncertainty. The reader is therefore encouraged at least to
go through the exercise of describing the uncertainty of a variability distribution to determine whether
the uncertainty needs to be included.

1 0. I Analysing the Properties of the Observed Data
Before attempting to fit a probability distribution to a set of observed data, it is worth first considering
the properties of the variable in question. The properties of the distribution or distributions chosen to
be fitted to the data should match those of the variable being modelled. Software like BestFit, EasyFit,
Stat::Fit and ExpertFit have made fitting distributions to data very easy and removed the need for any
in-depth statistical knowledge. These products can be very useful but, through their automation and
ease of use, inadvertently encourage the user to attempt fits to wholly inappropriate distributions. It is
therefore worth considering the following points before attempting a fit:
Is the variable to be modelled discrete or continuous? A discrete variable may only take certain
specific values, for example the number of bridges along a motorway, but a measurement such as
the volume of tarmac, for example, is continuous. A variable that is discrete in nature is usually, but
not always, best fitted to a discrete distribution. A very common exception is where the increment
between contiguous allowable values is insignificant compared with the range that the variable may
take. For example, consider a distribution of the number of people using the London Underground
on any particular day. Although there can only be a whole number of people using the Tube, it
is easier to model this number as a continuous variable since the number of users will number in
the millions and there is little importance and considerable practical difficulty in recognising the
discreteness of the number.
In certain circumstances, discrete distributions can be very closely approximated by continuous
distributions for large values of x. If a discrete variable has been modelled by a continuous distribution for convenience, its discrete nature can easily be put back into the risk analysis model by
using the ROUND(. . .) function in Excel.
The reverse of the above, however, never occurs, i.e. data from a continuous variable are always
fitted to a continuous distribution and never a discrete distribution.
Do I really need to$t a mathematical (parametric) distribution to my data? It is often practical to
use the data points directly to define an empirical distribution, without having to attempt a fit to any
theoretical probability distribution type. Section 10.2 describes these methods.
Does the theoretical range of the variable match that of thejitted distribution? The fitted distribution
should, within reason, cover the range over which the variable being modelled may theoretically
extend. If the fitted distribution extends beyond the variable's possible range, a risk analysis model
will produce impossible scenarios. If the distribution fails to extend over the entire possible range of
the variable, the risk analysis will not reflect the true uncertainty of the problem. For example, data
on the oil saturation of a hydrocarbon reserve should be fitted to a distribution that is bounded at
zero and 1, as values outside that range are nonsensical. It may turn out that a normal distribution,
for example, fits the data far better than any other shape, but, of course, it extends from -oo to +oo.
In order to ensure that the risk analysis only produces meaningful scenarios, the normal distribution
would be truncated in the risk analysis model at zero and 1.
Note that a correctly fitted distribution will usually cover a range that is greater than that displayed
by the observed data. This is quite acceptable because data are rarely observed at the theoretical
extremes for the variable in question.
Do you already know the value of the distribution parameters? This applies most often to discrete
variables. For example, a Hypergeometric(n, D , M) distribution describes the number of successes
we might have from n independent individuals without replacement from a population of size M

Chapter 10 Fitting d~str~butions
t o data

265

where a success means the individual comes from a subpopulation of size D. It seems unlikely that
we would not know how many samples were taken to have observed our dataset of successes. More
likely is that we already know n and D and are trying to estimate M, or we know n and M and are
trying to estimate D. Discrete distributions like the binomial, beta-binomial, negative binomial, beta
negative binomial, hypergeometric and inverse hypergeometric have either the number of samples
n or the number of required successes s as parameters and will generally be known.
Is this variable independent of other variables in the model? The variable may be correlated with, or
a function of, another variable within the model. It may also be related to another variable outside the
model but which, in turn, affects a third variable within the risk analysis model. Figure 10.1 illustrates
a couple of examples. In example (a), a high-street bank's revenue is modelled as a function of the
interest and mortgage rates, among other things. The mortgage rate is correlated to the interest rate
since the interest rate largely defines what the mortgage rate is to be. This relationship must be
included in the model to ensure that the simulation will only produce meaningful scenarios. There
are two approaches to modelling such dependency relationships:

J

1. Determine distributions for the mortgage and interest rates on the basis of historical data and then

correlate the sampling from these distributions during simulation.
Interest

correlation

Mortgage
rates

function

Subcontractor's
cost model

Choice of
roofing material

I

F

Person-hours to
construct roof
timbers
function

\

4

Jfunction

Subcontractor's quote to
supply labour for roof
construction

Figure 10.1

Examples of dependency between model variables: (a) direct and (b) indirect.

2. Determine the distribution of interest rate from historical data and a (stochastic) functional relationship with the mortgage rate.
Method 1 is tempting because of its simple execution, but method 2 offers greater opportunity to
reproduce any observed relationship between the two variables.
In example (b) of Figure 10.1, a construction subcontractor is calculating her bid price to supply
labour for a roofing job. The choice of roofing material has not yet been decided and this uncertainty
has implications for the person-hours that will be needed to construct the roofing timbers and to lay
the roof. There is therefore an indirect dependency between these two variables that could easily have
been missed, had she not looked outside the immediate components of her cost calculation. Missing
this correlation would have resulted in an underestimation of the spread of the subcontractor's cost
and potentially could have led her to quote a price that exposed her to significant loss. Correlation
and dependency relationships form a vital part of many risk analyses. Chapter 13 describes several
techniques to model correlation and dependencies between variables.
Does a theoretical distribution exist thatfits the mathematics of this variable? Many theoretical distributions have developed as a result of modelling specific types of problem. These distributions then
find a wider use in other problems that have the same mathematical structure. Examples include: the
times between telephone calls at a telephone exchange or fires in a railway system may be accurately
represented by an exponential distribution; the time until failure of an electronics component may be
represented by a Weibull distribution; how many treble 20s a darts player will score with a specific
number of darts may be represented by a binomial distribution; the number of cars going through a
road junction in any one hour may be represented by a Poisson distribution; and the heights of the
tallest and shortest children in UK school classes may be represented by Gumbel distributions. If a
distribution can be found with the same mathematical basis as the variable being modelled, it only
remains to find the appropriate parameters to define the distribution, as explained in Section 10.3.
Does a theoretical distribution exist that is well known to fit this type of variable? Many types of
variable have been observed closely to follow specific distribution types without any mathematical
rationale being available to explain such close matching. Examples abound with the normal distribution: the weight of babies and other measures that come from nature, which is how the normal
distribution got its name; measurement errors in engineering, variables that are the sum of other
variables (e.g. means of samples from a population), etc. However, there are many other examples
for distributions like the lognormal, Pareto and Rayleigh, some of which are noted in Appendix 111.
If a distribution is known to be a close fit to the type of variable being modelled, usually as a result of
published academic work, all that remains is to find the best-fitting distribution parameters, as explained
in Section 10.3.
Erron - systematic and non-systematic

The collected data will at times have measurement errors that add another level of uncertainty. In
most scientific data collection, the random error is well understood and can be quantified, usually by
simply repeating the same measurement and reviewing the distribution of results. Such random errors are
described as non-systematic. Systematic errors, on the other hand, mean that the values of a measurement
deviate from the true value in a systematic fashion, consistently either over- or underestimating the true
value. This type of error is often very difficult to identify and quantify. One will often attempt to estimate
the degree of suspected systematic measurement error by comparing with measurements using another
technique that is known (or believed) to have little or no systematic error.

Chapter 10 Fitting distributions to data

267

Systematic and non-systematic error can both be accounted for in determining a distribution of fit.
In determining a first-order distribution, one need only adjust the data by the systematic error (the nonsystematic error has, by definition, a mean shift of zero). In second-order distribution fitting, one can
model the data as being uncertain, with appropriate distributions representing both the non-systematic
error and the systematic error (including uncertainty about what these error parameters are).
Sample slze

Is the number of data points available sufficient to give a good idea of the true variability? Consider the
20 plots of Figure 10.2 which each show random samples of twenty values drawn from a NormaI(100,
10) distribution. These samples are all plotted as histograms with six evenly spaced bars, three either
side of 100. The variation in shapes is something of an eye-opener to a lot of people, who expect
to see plots that look at least reasonably like bell-shaped curves and symmetric about 100. After all,
one might think that 20 data points is a reasonable number from which to draw some inference. The
bottom-right panel in Figure 10.2 shows all 400 data values (i.e. 20 plots * 20 data values each), which
looks something like a normal distribution but nonetheless still has a significant degree of asymmetry.
It is an interesting and useful exercise when attempting to fit data to a distribution to see what sort of
patterns one would observe if the data did truly come from the distribution that is being fitted. So, for
example, if I had 30 data values that I was fitting to a Lognormal(l0, 2) distribution, I could plot a
variety of 30 Monte Carlo samples (not Latin hypercube samples, which forces a better-fitting sample
to the true distribution than a random sample would produce) from a Lognormal(l0, 2) distribution in
histogram form and see the different patterns they produce. I am at least then aware of the range of data
patterns that I should accept as feasibly coming from that distribution for that size of sample.
Overdispenion o f data

Sometimes we wish to fit a parametric distribution to observations, but note that the data appear to
show a much larger spread than the fitted distribution would suggest. For example, in fitting a binomial
distribution to the results of a multiple question exam taken by a large class, one might imagine that the
distribution of a number of correct answers could be modelled by a Binomial(n, p ) distribution, where
n = the number of questions and p is the average probability for the class of correctly answering a
question. The spread of the fitted binomial distribution is essentially determined by the mean = np, since
n is fixed, so there is no opportunity to attempt to match the fitted distribution to the data in terms of
the observed spread in results as well as the average result. One plausible reason for the fit being poor is
that there will be a range of abilities in the class. If one models the range of probabilities of successfully
answering a question across all the individuals as a beta distribution, the resultant distribution of results
will be drawn from a beta-binomial distribution, which is then the appropriate distribution to fit to
the data. The extra variability added to the binomial distribution by making p beta distributed means
that the beta-binomial distribution will always have more spread than the binomial. The beta-binomial
distribution has three parameters: a , j3 and n, where a and j3 (sometimes written as a1 and a2) are the
parameters of the beta distribution and n remains the number of trials. These three parameters allow
a better and logical match to the mean and variance of the observations. As a and B become larger,
the beta distribution becomes narrower, i.e. the participants have a narrow range of probabilities of
successfully answering a question (the population is more homogeneous), and the Beta-Binomial(n, a ,
j3) is then approximated well by a Binomial(n, al(a B)).
The same type of problem applies in fitting the Poisson(h) distribution to data. Since the mean and
variance are both equal to h, the spread of the distribution is determined by the mean. Observed data
are often more widely dispersed than a Poisson distribution might suggest, and this is often because the

+

268

R~skAnalysis

Figure 10.2 Examples of distributions of 20 random samples from a NormaI(100, 10) distribution.

Chapter 10 Fitt~ngdistr~butionst o data

269

n observations come from Poisson processes with different means h l , . . . , A,. For example, one might
be looking at the failure rates of computers. Each computer will be slightly different from the next and
so will have its own A. If one models the distribution of variability of the hs using a Gamma(a, B )
distribution, the resultant distribution of failures in a single time period is a P6lya(a, B). The P6lya
distribution always has a variance greater than the mean, and its two parameters allow a greater flexibility
in matching the distribution to the mean and variance of the observations.
Finally, data fitted to a normal distribution can often demonstrate longer tails than a normal distribution. In such cases, the three-parameter Student t-distribution can be used, i.e. Student(v) * a p ,
where p is the fitted distribution's mean, a is the fitted distribution's standard deviation and v is the
"degrees of freedom" parameter that determines the shape of the distribution. For v = 1, this is the
Cauchy distribution which has infinite (i.e. undeterminable) mean and standard deviation. As v gets
larger, the tails shrink until at very large v (some 50 or more) this looks like a Normal(p, a ) distribution. The three-parameter Student t-distribution can be derived as the mixture of normal distributions
with the same mean and different variances distributed as a scaled inverse X 2 . SO, in attempting to fit
data to the three-parameter Student t-distribution instead of a normal distribution, you would need to be
able reasonably to convince yourself that the observations were drawn from normal distributions with
the same mean and different variances.

+

10.2 Fitting a Non-Parametric Distribution to the Observed Data
This section discusses techniques for fitting an empirical distribution to data. We look at continuous
and then discrete variables, and both first-order (variability only) and second-order (variability and
uncertainty) fitting.

10.2. I Modelling a continuous variable (first order)
If the observed variable is continuous and reasonably extensive, it is often sufficient to use a cumulative
frequency plot of the data points themselves to define its probability distribution. Figure 10.3 illustrates
an example with 18 data points. The observed F(x) values are calculated as the expected F(x) that
would correspond to a random sampling from the distribution, i.e. F(x) = i/(n l), where i is the
rank of the observed data point and n is the number of data points. An explanation for this formula
is provided in the next section. Determination of the empirical cumulative distribution proceeds as
follows:

+

The minimum and maximum for the empirical distribution are subjectively determined on the basis
of the analyst's knowledge of the variable. For a continuous variable, these values will generally be
outside the observed range of the data. The minimum and maximum values selected here are 0 and
45.
The data points are ranked in ascending order between the minimum and maximum values.
The cumulative probability F(xi) for each xi value is calculated as follows:

This formula maximises the chance of replicating the true distribution.

270

Risk Analysis

Cumulative Frequency of Data
1.00.9

a

0

5

10

15

20
25
Data values

30

35

40

45

Number of data points n = 18

Figure 10.3 Fitting a continuous empirical distribution to data using a cumulative distribution.

The two arrays, {xi] and {F(xi)), along with the minimum and maximum values, can then be used
as direct inputs into a cumulative distribution CumulA(min, max, {xi], {F(xi)}).
The VoseOgive function in ModelRisk will simulate values from a distribution constructed using the
method above.
If there is a very large amount of data, it becomes impracticable to use all of the data points to define
the cumulative distribution. In such cases it is useful to batch the data first. The number of batches
should be set to the practical maximum that balances fineness of detail (large number of bars) with the
practicalities of having large arrays defining the distribution (lower number of bars).
Example 10.1 Fitting a continuous non-parametric distribution t o data

Figure 10.4 illustrates an example where 221 data points are plotted in histogram form over the range
of the observed data. The analyst considers that the variable could conceivably range from 0 to 300.
Since there are no observed data with values below 20 and above 280, the histogram bar ranges need

Chapter 10 Fitting dlstribut~onst o data

27 1

Relative and Cumulative Frequency of Data

I

x-value

Number of data points n = 221

Histogram bar
From A To 6
20
40
40
60
60
80
80
100
100
120
120
140
140
160
160
180
180
200
200
220
220
240
240
260
260
280

Figure 10.4

Observed frequencies
Cumulative
Histogram
probability f(Acx 5 B ) probability F ( x < B )
0.018
0.018
0.113
0.131
0.204
0.335
0.534
0.199
0.145
0.679
0.796
0.118
0.050
0.846
0.045
0.891
0.045
0.937
0.023
0.959
0.027
0.986
0.009
0.995
0.005
1.000

Modelled distribution
Cumulative
Histogram bar
From A
To B probability F ( x 5 B)
0
40
0.018
40
60
0.131
60
80
0.335
80
100
0.534
100
120
0.679
120
140
0.796
140
160
0.846
160
180
0.89 1
180
200
0.937
200
220
0.959
220
240
0.986
240
260
0.995
1.OOO
260
300

Fitting an empirical distribution to histogrammed data using a cumulative distribution.

to be altered to accommodate the subjective minimum and maximum. The easiest way to achieve this
is to extend the range of the first and last bars with non-zero probability to cover the required range,
but without altering its probability. In this example, the histogram bar with range 20-40 is expanded to
a range 0-40, and the bar with range 260-280 is expanded to range 260-300. We will probably have
slightly exaggerated the tails of the distribution. However, if the number of bars initially selected is quite
large, there will be little real effect on the model. The {xi}array input into the cumulative distribution
is then {40,60, . . . ,240,2601, the {Pi}array is (0.018,O. 131, . . . ,0.986,0.995} and the minimum and
maximum are, of course, 0 and 300 respectively. +

272

Risk Analysis

Converting a histogram distribution into a cumulative distribution may seem a little pointless when
the histogram can be used in a risk analysis model. However, this technique allows analysts to select
varying bar widths to suit their needs, as in the above example, and therefore to maximise detail in the
distribution where it is needed.

10.2.2 Modelling a continuous variable (second order)'
When we do not have a great deal of data, a considerable amount of uncertainty will remain about an
empirical distribution determined directly from the data. It would be very useful to have the flexibility
of using an empirical distribution, i.e. not having to assume a parametric distribution, and also to
be able to quantify the uncertainty about that distribution. The following technique provides these
requirements.
Consider a set of n data values {xj) drawn from a distribution and ranked in ascending order {xi}so
xi < xi+l. Data thus ranked are known as the order statistics of {xj]. Individually, each of the values
of {xj}may map as a U(0, 1) onto the cumulative probability of the parent distribution F(x). We take
a U(0, 1) distribution as the prior distribution for the cumulative probability for any value of x. We
can thus use a U(0, 1) prior for Pi = F(xi) for the value of the ith observation. However, we have the
additional information that, of n values drawn randomly from this distribution, xi ranked ith, i.e. (i - 1)
of the data values are less than xi, and (n - i) values are greater than xi. Using Bayes' theorem and the
binomial theorem, the posterior marginal distribution for Pi can readily be determined, remembering
that Pi has a U(0, 1) prior and therefore a prior probability density = 1:

which is simply the standard beta distribution Beta(i, n - i

+ 1):

Equation (10.2) could actually be determined directly from the fact that the beta distribution is
the conjugate to the binomial likelihood function and that a U(0, 1) = Beta(1, 1). The mean of the
Beta(i, n - i 1) distribution equals il(n 1): a formula that has been used in Equation (10.1) to
estimate the best-fitting first-order non-parametric cumulative distribution.
Since Pi+1 > Pi, these beta distributions are not independent, so we need to determine the conditional
distribution f (Pi+l 1 Pi), as follows. The joint distribution f (Pi, Pj) for any two Pi, Pj is calculated
using the binomial theorem in a similar manner to the numerator of the equation for f (Pi Ixi; i = 1, n),
that is

+

+

where Pj > Pi, and remembering that the prior probability densities for Pi and Pj equal 1 since they
have U(0, 1) priors.
Thus, for j = i 1,

+

'

1 submitted a paper on this technique (I developed the idea) for publication in a journal a long time ago. One reviewer was
horribly dismissive, saying that the derivation was one of the most drunken s h e had ever seen, and anyway it was a Bayesian method
(it isn't) so it was of no value. Actually, this has proven to be one of the most useful things I ever figured out.

Chapter 10 Fitting d~stribut~ons
to data

273

The conditional probability f (Pi+l1 Pi) is thus given by

where k is some constant. The corresponding cumulative distribution function F(Pi+1 I Pi) is then given
by

and thus k = n - i and the formula reduces to

Together, Equations (10.2) and (10.3) provide us with the tools to construct a non-parametric secondorder distribution for a continuous variable given a dataset sampled from that distribution. The distribution for the cumulative probability PI that maps onto the first-order statistic XI can be obtained from
Equation (10.2) by setting i = 1:

The distribution for the cumulative probability Pz that maps onto the first-order statistic X2 can be
obtained from Equation (10.3). Being a cumulative distribution function, F(Pi+1(Pi) is Uniform(0, 1)
distributed. Thus, writing Uitl to represent a Uniform(0, 1) distribution in place of F(Pi+1( Pi), using
the identity 1 - U(0, 1) = U(0, I), and rewriting for Pi+l, we obtain

which gives

etc.
Note that each of the U2, U3, . . . , U, uniform distributions are independent of each other.
The formulae from Equations (10.4) and (10.5) can be used as inputs into a cumulative distribution
function available from standard Monte Carlo software tools like @RISK and Crystal Ball, together with
subjective estimates of the minimum and maximum values that the variable may take. The variability
("inner loop7') is described by the range for the variable in question, and estimates of the cumulative
distribution shape via the ( X i and
} {Pi] values. The uncertainty ("outer loop") is catered for by the
uncertainty distributions for the minimum, maximum and Pi values.

274

Risk Analysis

The RiskCumul distribution function in @RISK, the VoseCumulA function in ModelRisk and the
cumulative version of the custom distribution in Crystal Ball have the same cumulative distribution
function, namely

where Xo = minimum, X,+, = maximum, Po = 0, Pa+, = 1 and Xi 5 x < Xi+, .
Figure 10.5 illustrates a model where a dataset is being used to create a second-order distribution using
this technique. If the model is created in the current version of @RISK, the uncertainty distributions for
F(x) in column D are nominated as outputs, a smallish number of iterations are run and the resultant
data are exported back to a spreadsheet. Those data are then used to perform multiple simulations (the
"outer loop") of uncertainty using @RISK'S RiskSimtable function: the "inner loop" of variability comes
A]
1
2

3
4
5
6
--

7

02
103
104
105

Figure 10.5

B

I

C

I

I E ~F

D

Rank (i) Order statistics (x)
F(x)
minimum
0
0
1
0.473
2
3.170
3
4.254
0.0237
4
4.540
99
0.9453
95.937
0.9726
96.936
100
1
maximum
100

I

G

IH

=VoseBeta(1,100)

Model to produce a second-order non-parametric continuous distribution.

Iteration# / Cell

100
Simtable functions
Order statistics

Rows 3 to 102:

5.04%
O.2g0/o
0.05%
0.93%
5.04%
0.473

6.83%
1.63%
0.69%
4.28%
6.83%
3.170

99.08%
99.20%
98.99%
99.45%
99.08%
95.937

99.60%
99.90%
99.67%
99.88%
99.60%
96.936

Formulaetable
List samples from the distribut~onfor F(X)
=RiskSimtable(CS:C102)
L~sts
the observed data values
=VoseCurnulA(O,100,C104:CX104,C103:CX103)

Figure 10.6 BRISK model to run a second-order risk analysis using the data generated from the model of
Figure 10.5.

Chapter 10 F~tt~ng
d~str~but~ons
t o data

AI
1
2
3
4
5
6
7
101
102
103
104
105
106
107,

-

-

B

[

c

Rank (9 Order statistics (x)
0
m~nlrnurn
1
0 473
3.170
2
4.254
3
4
4 540
98
93.301
95.937
99
100
96 936
100
maxrmum

l~econdorder distribution: 1

I

[EI

D

-

I

(H

G

Crystal Ball Pro formulae table
B4.6103
1.100
Input est~matesof min, max
C3, C104
C4.Cl03
Input data values
D4
=CB Beta(1,100,1)
D5'D103
=1-(CB.Un1form(O,l)~(1/(1OO-B3)))~(1
-D3)
Dl06 (output) =CB Custom(C3 D104)
04:D103 are nominated as uncertalnfy d~str~butions
D 106 IS nomlnated as a varrab111tydtstrtbuhon

F(x)
0
0 0324
0.0383
0 0438
00511
0 9414
0.9766
0 9901
1

41.640

F

275

1
-

---

-

Figure 10.7 Crystal Ball Pro model to run a second-order risk analysis using the data generated from the
model of Figure 10.5.

from the cumulative distribution itself, as shown in Figure 10.6. If one creates the model in Crystal Ball
Pro, the F ( x ) distributions can be nominated as uncertainty distributions and the cumulative nominated
as the variability distribution, and the innerlouter loop procedure will run automatically (Figure 10.7).
There are a few limitations to this technique. In using a cumulative distribution function, one is
assuming a histogram style probability density function. When there are a large number of data points,
this approximation becomes irrelevant. However, for small datasets the approximation tends to accentuate
the tails of the distribution: a result of the histogram "squaring-off' effect of using the cumulative
distribution. In other words, the variability will be slightly exaggerated. However, the squaring effect
can be reduced, if required, by using some sort of smoothing algorithm and defining points between
each observed value. In addition, for small datasets, the tails' contribution to the variability will often
be more influenced by the subjective estimates of the minimum and maximum values: a fact one can
view positively (one is recognising the real uncertainty about a distribution's tail), and negatively (the
smaller the dataset, the more the technique relies on subjective assessment).
The fewer the data points, the wider the confidence intervals will become, quite naturally, and, in
general, the more emphasis will be placed on the subjectively defined minimum and maximum values.
Conversely, the more data points available, the less influence the minimum and maximum estimates will
have on the estimated distribution. In any case, the values of the minimum and maximum only have
influence on the width (and therefore height) of the end two histogram bars in the fitted distribution. The
fact that the technique is non-parametric, i.e. that no statistical distribution with a particular cumulative
distribution function is assumed to be underlying the data, allows the analyst a far greater degree of
flexibility and objectivity than that afforded by fitting parametric distributions.
A further sophistication to this technique would be to correlate the uncertainty distributions for the
minimum and maximum parameter values to the uncertainty distributions for P I and P,, respectively.
If PI were to be sampled with a high value, it would make sense that the variability distribution had a
long left tail and the value sampled for the minimum should be towards its lowest value. Similarly, a
high value for P,, would suggest a low value for the maximum. One could model these relationships
using either very high levels of negative rank order correlation for simplicity or some more involved
but more explicit equation.

276

Risk Analysis

Example 10.2 Fitting a second-order non-parametric distribution t o continuous data

Three datasets of five, then a further 15 and then another 80 random samples were drawn from a
Normal(100, 10) distribution to give sets of five, 20 and 100 samples. The graphs of Figure 10.8 show,
naturally enough, that the population distribution is approached with increasing confidence the more
data values one has available.
There are classical statistical techniques for determining confidence distributions for the mean and
standard deviation of a normal distribution that is fitted to a dataset with a population normal distribution,
as discussed in Section 9.1, namely:

Standard deviation a =

(a) 5 data point second-order distribution

Jx2;i

1)

(b) 20 data point second-order distribution

x-value

I

-(c) 100 data point second-order distribution

x-value

population distribution

x-value

Figure 10.8 Results of fitting a non-parametric distribution to data from a normal parent distribution: (a) five
data points; (b) 20 data points; (c) 100 data points; (d) the true population distribution.

Chapter 10 Fitting dtstributions t o data

277

where:
p and o are the mean and standard deviation of the population distribution;

x and s are the mean and sample standard deviation of the n data points being fitted;
t(n - 1) is a t-distribution with n - 1 degrees of freedom, and X2(n- 1) is a chi-square distribution
with n - 1 degrees of freedom.
The second-order distribution that would be fitted to the 100 data point set using the non-parametric
technique is shown in the right-hand panel of Figure 10.9. The second-order distribution produced using
the above statistical theory with the assumption of a normal distribution is shown in the left-hand panel
of Figure 10.9. There is a strong agreement between these two techniques. The statistical technique
produces less uncertainty in the tails because the assumption of normality adds extra information that
the non-parametric technique does not provide. This is, of course, fine providing we know that the
population distribution is truly normal, but leads to overconfidence in the tails if the assumption is
incorrect. +
The advantage of the technique offered here is that it works for all continuous smooth distributions,
not just the normal distribution. It can also be used to determine distributions of uncertainty for specific
percentiles and quantiles of the population distribution, essentially by reading off values from the fitted
cumulative distribution and interpolating as necessary between the defined points. Figure 10.10 shows
a spreadsheet model for determining the percentile, defined in cell E3, of the population distribution,
given the 100 data points from the normal distribution used previously. The uncertainty distribution for
the percentile is produced by running a simulation with cell G3 as the output. Similarly, Figure 10.11
illustrates a spreadsheet to determine the cumulative probability that the value in cell F2 represents in
the population distribution. The distribution of uncertainty of this cumulative probability is produced
by running a simulation with cell H2 as the output. In other words, the model in Figure 10.10 is
slicing horizontally through the second-order fitted distribution at F(x) = 50 %, while the model of
Figure 10.11 is slicing vertically at x = 99. The spreadsheets can, of course, be expanded or contracted
to suit the number of data points available. ModelRisk includes the VoseOgive2 function that generates
the array of F(x) variables required for second-order distribution modelling.
(b) Non-parametric distribution

(a) Statistical theory distribution

-2

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
70

80

90

100
110
x-value

120

130

I

x-value

Figure 10.9 Comparison of second-order distributions using the non-parametric technique and classical
statistics.

278

Risk Analysis

Formulae table
B3:B104
C4:C103
C3, C104
B107:B208
C107, C208

0:101
Input,datavalues
Input: estimates of rnln, max
=B3
0,l
=VoseBeta(l ,100)

=VLOOKUP(E3,Cl07:D208,2)
=VLOOKUP(G103,B107:C208,2)
=VLOOKUP(G104,B107:C208,2)
=VLOOKUP(G103,B3:C104,2)
=VLOOKUP(Gl04,B3:C104,2)

Figure 10.10 Model to determine the uncertainty distribution for a percentile.

10.2.3 Modelling a discrete variable (first order)
Data from a discrete variable can be used to define an empirical distribution in two ways: if the number
of allowable x values is not very large, the frequency of data at each x value can be used directly to
define a discrete distribution; and if the number of allowable x values is very large, it is usually easier
to arrange the data into histogram form and then define a cumulative distribution, as above. The discrete
nature of the variable can be reintroduced by imbedding the cumulative distribution inside the standard
spreadsheet ROUND(. . .) function.

10.2.4 Modelling a discrete variable (second order)
Uncertainty can be added to the discrete probabilities in the previous technique to provide a secondorder Discrete distribution. Assuming that the variable in question is stable (i.e. is not varying with
time), there is a constant (i.e. binomial) probability pi that any observation will have a particular value
xi. If k of the n observations have taken the value xi, then our estimate of the probability pi is given
by Beta(k 1, n - k 1) from Section 8.2.3. However, all these pi probabilities have to sum to equal

+

+

Chapter 10 Fltting distributions t o data

279

Formulae table
Input: data values
Input: estimates of rnin, max
0:lOl
=B3
0:101
0,l
=VoseBeta(l ,100)
C109:C207 =l-(VoseUniform(0,1)A(1/(100-B108)))*(1-C108)
=VLOOKUP(F2,B3:C104,2)

B4:6103
63, 6104
C3:C104
D3:D104
B107:B208
C107, C208

225

226
-

=VLOOKUP(G103,C3:D104,2)
=VLOOKUP(G104,C3:D104,2)
=VLOOKUP(G103,B107:C208,2)
=VLOOKUP(G104,B107:C208,2)
F2
Input target value
H2 (output) =IF(F2~B3,na(),lF(F2>B104,na(),(F2-G105)/(G106-G105)*(G108-G107)+G107))

227

Figure 10.11 Model to determine the uncertainty distribution for a quantile.

1.0, so we normalise the pi values. Figure 10.12 illustrates a spreadsheet that calculates the discrete
second-order non-parametric distribution for the set of data in Table 10.1 where the distribution has
been assumed to finish with the maximum observed value.
There remains a difficulty in selecting the range of this distribution, and it will be a matter of judgement how far one extends the range beyond the observed values. In the simple form described here there
is also a problem in determining the pi values for these unobserved tails, and any middle range that has
no observed values, since all pi values will have the same (normalised) Beta(1, n 1) distribution, no
matter how extreme their position in the distribution's tail. This obviously makes no sense, and, if it is
important to recognise the possibility of a long tail beyond observed data, a modification is necessary.
The tail can be forced to zero by multiplying the beta distributions by some function that attenuates
the tail, although the choice of function and severity of the attenuation will ultimately be a subjective
matter.
These last two techniques have the advantages that the distribution derived from the observed data
would be unaffected by any subjectivity in selecting a distribution type and that the maximum use of the
data has been made in defining the distribution. There is an obvious disadvantage in that theprocess is

+

280

Risk Analys~s

AI

1
2
-

3
4
5
6

7
8
9
--

10
11
12

2
2

15
16

1
7
3

2

B

I

C

D

E

1 FI

Estimate of Normalised
Value Frequency probability probability
0
0
0.20%
1
1
0.40%
2
1
0.40%
3
12
2.59%
4
7.17%
35
5
52
10.56%
10.19%
6
61
12.35%
11.92%
7
65
13.15%
12.69%
8
69
13.46%
13.94%
9
68
13.27%
13.75%
10
46
9.36%
9.04%
11
33
6.77%
6.54%
12
26
5.38%
5.19%
12
13
2.59%
2.50%
14
2.19%
10
2.12%
2
15
0.60%
0.58%
5
16
1.20%
1.15%

G

I

H

II

Formulae table
=SUM(C3:C22)
=VoseBeta(C3+1,$C$23-C3+1)

24

Figure 10.12 Model to determine a discrete non-parametric second-order distribution.

Table 10.1 Dataset to fit a discrete
second-order non-parametric distribution.
Value

Frequency

Value

Frequency

fairly laborious for large datasets. However, the FREQUENCY() function and Histogram facility in Excel
and the BestFit statistics report and other statistics packages can make sorting the data and calculating the
cumulative frequencies very easy. More importantly, there remains a difficulty in estimating probabilities
for values of the variable that have not been observed. If this is important, it may well be better to fit
the data to a parametric distribution.

Chapter 10 F~tt~ng
d~str~but~ons
t o data

I

5

28 1

1 0.3 Fitting a First-Order Parametric Distribution to Observed
Data
This section describes methods of finding a theoretical (parametric) distribution that best fits the observed
data. The following section deals with fitting a second-order parametric distribution, i.e. a distribution
where the uncertainty about the parameters needs to be recognised. A parametric distribution type may
be selected as the most appropriate to fit the data for three reasons:
The distribution's mathematics corresponds to a model that accurately represents the behaviour of
the variable being considered (see Section 10.1).
The distribution to be fitted to the data is well known to fit this type of variable closely (see
Section 10.1 again).
The analyst simply wants to find the theoretical distribution that best fits the data, whatever it may be.
The third option is very tempting, especially when distribution-fitting software is available that can
automatically attempt fits to a large number of distribution types at the click of an icon. However, this
option should be used with caution. Analysts must ensure that the fitted distribution covers the same
range over which, in theory, the variable being modelled may extend; for example, a four-parameter
beta distribution fitted to data will not extend past the range of the observed data if its minimum and
maximum are determined by the minimum and maximum of the observed data. Analysts should ensure
that the discrete or continuous nature of the distribution matches that of the variable. They should also
be flexible about using a different distribution type in a later model, should more data become available,
although this may cause confusion when comparing old and new versions of the same model. Finally,
they may find it difficult to persuade the decision-maker of the validity of the model: seeing an unusual
distribution in a model with no intuitive logic associated with its parameters can easily invoke distrust
of the model itself. Analysts should consider including in their report a plot of the distribution being
used against the observed data to reassure the decision-maker of its appropriateness.
The distribution parameters that make a distribution type best fit the available data can be determined
in several ways. The most common and most flexible technique is to determine parameter values known
as maximum likelihood estimators (MLEs), described in Section 10.3.1. The MLEs of the distribution
are the parameters that maximise the joint probability density or probability mass for the observed data.
MLEs are very useful because, for several common distributions, they provide a quick way to arrive
at the best-fitting parameters. For example, the normal distribution is defined by its mean and standard
deviation, and its MLEs are the mean and standard deviation of the observed data. More often than not,
however, when we fit a distribution to data using maximum likelihood, we need to use an optimiser
(like Microsoft Solver which comes with Microsoft Excel) to find the combination of parameter values
that maximises the likelihood function. Other methods of fit tend to find parameter values that minimise
some measure of goodness of fit, some of which are described in Section 10.3.4. Both using MLEs
and minimising goodness-of-fit statistics enable us to determine first-order distributions. However, for
fitting second-order distributions we need additional techniques for quantifying the uncertainty about
parameter values, like the bootstrap, Bayesian inference and some classical statistics.

10.3.1 Maximum likelihood estimators
The maximum likelihood estimators (MLEs) of a distribution type are the values of its parameters that
produce the maximum joint probability density for the observed dataset x . In the case of a discrete

282

Risk Analysts

distribution, MLEs maximise the actual probability of that distribution type being able to generate the
observed data.
Consider a probability distribution type defined by a single parameter a . The likelihood function
L(a) that a set of n data points (xi) could be generated from the distribution with probability density
f (x) - or, in the case of a discrete distribution, probability mass - is given by

The MLE B is then that value of a that maximises L(a). It is determined by taking the partial derivative
of L(a) with respect to a and setting it to zero:

For some distribution types this is a relatively simple algebraic problem, for others the differential
equation is extremely complicated and is solved numerically instead. This is the equivalent of using
Bayesian inference with a uniform prior and then finding the peak of the posterior uncertainty distribution
for a. Distribution fitting software have made this process very easy to perform automatically.
Example 10.3 Determining the MLE for the Poisson distribution

The Poisson distribution has one parameter, the product kt, or just h if we let t be a constant. Its
probability mass function f (x) is given by

Because of the memoryless character of the Poisson process, if we have observed x events in a total
time t , the likelihood function is given by

Let I (A) = In L(h), and using the fact that t is a constant:
I (h) = -kt

+ xln(h) + xln(t) - In(x!)

The maximum value of I (A), and therefore of L(h), occurs when the partial derivative with respect
to h equals zero, i.e.

Rearranging yields

i.e. it is the average number of observations per unit time.

+

t:

Chapter 10 Fitting d~stnbut~ons
t o data

283

10.3.2 Finding the best-fitting parameters using optimisation
Figure 10.13 illustrates a Microsoft Excel spreadsheet set up to find the parameters of a gamma distribution that will best match the observed data. Excel provides the GAMMADIST function that will
return the probability density of a gamma distribution.
The Microsoft Solver in Excel is set to find the maximum value for cell F5 (or equivalently F7) by
changing the values of cr and B in cells F2 and F3.

10.3.3 Fitting distributions to truncated, censored or binned data
Maximum likelihood methods offer the greatest flexibility for distribution fitting because we need only
be able to write a probability model that corresponds with how our data are observed and then maximise
that probability by varying the parameters.
Censored data are those observations that we do not know precisely, only that they fall above or
below a certain value. For example, a weight scale will have a maximum value X it can record: we
might have some measurement off the scale and all we can say is that they are greater than X.
Truncated data are those observations that we do not see above or below some level. For example,
at a bank it may not be required to record an error below $100, and a sieve system may not select out
diamonds from a river below a certain diameter.
Binned data are those observations that we only know the value of in terms of bins or categories. For
example, one might record in a survey that customers were (0, 101, (10, 201, (20-401 and (40+) years
of age.

Formulae table

C3 C?42
F5
F7

=LOCIO(CAMMADIST(B3,alpha,beta,O))

=SUM(C3 C242)
=VoseGammaProbl0(B3 8242,alpha,beta,O)

Figure 10.13 Using Solver to perform a maximum likelihood fit of a gamma distribution to data.

284

Risk Analysis

It is a simple matter to produce a probability model for each category or combination, as shown in
the following examples where we are fitting to a continuous variable with density f ( x ) and cumulative
probability F ( x ) :
Example 10.4 Censored data

Observations. Measurement censored at Min and Max. Observations between Min and Max are
a , b , c , d and e ; p observations below Min and q observations above Max.
Likelihood function: f ( a ) * f ( b ) * f ( c ) * f ( d ) * f ( e ) * F(Min)P * (1 - F (Max))g.
Explanation. For p values we only know that they are below some value Min, and the probability of
being below Min is F(Min). We know q values are above Max, each with probability (1 - F(Max)).
The other values we have the exact measurements for. +
Example 10.5 Truncated data

Observations. Measurement truncated at Min and Max. Observations between Min and Max are
a , b , c , d and e .
Likelihoodfunction: f ( a ) * f (b)* f ( c ) * f ( d ) * f ( e ) / ( F ( M a x )- lik in))^.
Explanation. We only observe a value if it lies between Min and Max which has the probability
( F (Max) - F (Min)). +
Example 10.6 Binned data

Observations. Measurement binned into continuous categories as follows:
Bin

Frequency

Likelihood function: ~ ( 1 0 ) "* ( F ( 2 0 ) - F ( 1 0 ) ) *~ (~F (50) - F ( 2 0 ) ) *~ (1
~ - F (50))~.
Explanation. We observe values in bins between a Low and High value with probability F(High) F(Low). +

10.3.4 Goodness-of-fit statistics
Many goodness-of-fit statistics have been developed, but two are in most common use. These are the
chi-square (X2)and Kolmogorov-Smirnoff (K-S) statistics, generally used for discrete and continuous
distributions respectively. The Anderson-Darling (A-D) statistic is a sophistication of the K-S statistic.
The lower the value of these statistics, the closer the theoretical distribution appears to fit the data.

6

i

Chapter 10 Fittinn distributions t o data

285

Goodness-of-fit statistics are not intuitively easy to understand or interpret. They do not provide a true
measure of the probability that the data actually come from the fitted distribution. Instead, they provide
a probability that random data generated from the fitted distribution would have produced a goodnessof-fit statistic value as low as that calculated for the observed data. By far the most intuitive measure of
goodness of fit is a visual comparison of probability distributions, as described in Section 10.3.5. The
reader is encouraged to produce these plots to assure himself or herself of the validity of the fit before
labouring over goodness-of-fit statistics.
Critical values and confidence intervals for goodness-of-fit statistics

Analysis of the x2, K-S and A-D statistics can provide confidence intervals proportional to the probability that the fitted distribution could have produced the observed data. It is important to note that this
is not equivalent to the probability that the data did, in fact, come from the fitted distribution, since
there may be many distributions that have similar shapes and that could have been quite capable of
generating the observed data. This is particularly so for data that are approximately normally distributed,
since many distributions tend to a normal shape under certain conditions.
Critical values are determined by the required confidence level a. They are the values of the goodnessof-fit statistic that have a probability of being exceeded that is equal to the specified confidence level.
Critical values for the X 2 test are found directly from the X 2 distribution. The shape and range of
the X 2 distribution are defined by the degrees of freedom v , where v = N - a - 1, N = number of
histogram bars or classes and a = number of parameters that are estimated to determine the best-fitting
distribution.
Figure 10.14 shows a descending cumulative plot for the x2(11) distribution, i.e. a X 2 distribution
with 11 degrees of freedom. This plots an 80 % chance (*I (the confidence interval) that a value would
have occurred that was higher than 6.988 (the critical value at an 80 % confidence level) for data that
were actually drawn from the fitted distribution, i.e. there is only a 20 % chance that the x 2 value could
be this small. If analysts are conservative and accept this 80 % chance of falsely rejecting the fit, their
confidence interval a equals 80 % and the corresponding critical value is 6.988, and they will not accept
any distribution as a good fit if its x2 is greater than 6.988.
Critical values for K-S and A-D statistics have been found by Monte Carlo simulation (Stephens,
1974, 1977; Chandra, Singpurwalla and Stephens, 1981). Tables of critical values for the K-S statistic
Chi-Squared (11)

286

Risk Analysis

are very commonly found in statistical textbooks. Unfortunately, the standard K-S and A-D values are
of limited use for comparing critical values if there are fewer than about 30 data points. The problem
arises because these statistics are designed to test whether a distribution with known parameters could
have produced the observed data. If the parameters of the fitted distribution have been estimated from the
data, the K-S and A-D statistics will produce conservative test results, i.e. there is a smaller chance of
a well-fitting distribution being accepted. The size of this effect varies between the types of distribution
being fitted. One technique for getting round this problem is to use the first two-fifths or so of the data
to estimate the parameters of a distribution, using MLEs for example, and then to use the remaining
data to check the goodness of fit.
Modifications to the K-S and A-D statistics have been determined to correct for this problem, as
shown in Tables 10.2 and 10.3 (see the BestFit manual published in 1993), where n is the number of
data points and D, and A: are the unmodified K-S and A-D statistics respectively.
Another goodness-of-fit statistic with intuitive appeal, similar to the A-D and K-S statistics, is the
Cramer-von Mises statistic Y:

The statistic essentially sums the square of differences between the cumulative percentile Fo(Xi) for
the fitted distribution for each Xi observation and the average of i l n and (i - l)/n: the low and high
plots of the empirical cumulative distribution of Xi values. Tables for this statistic can be found in
Anderson and Darling (1952).
Table 10.2

Kolmogorov-Smirnoff statistic.

Distribution
Normal
Exponential
Weibull and extreme value

Modified test statistic

I

f i - 0.01 + 085)Dn
fi
on=%
n )

,/ED,

All others

Table 10.3 Anderson-Darling statistics.
Distribution

Modified test statistic

Normal
Exponential
Weibull and extreme value
All others

(I

+

$)A2

A;

Chapter 10 Fltting distributions t o data

287

The chi-square goodness-of-fit statistic

The chi-square (x2) statistic measures how well the expected frequency of the fitted distribution compares with the observed frequency of a histogram of the observed data. The chi-square test makes the
following assumptions:
1. The observed data consist of a random sample of n independent data points.
2. The measurement scale can be nominal (i.e. non-numeric) or numerical.
3. The n data points can be arranged into histogram form with N non-overlapping classes or bars that
cover the entire possible range of the variable.
The chi-square statistic is calculated as follows:

where O(i) is the observed frequency of the ith histogram class or bar and E(i) is the expected
frequency from the fitted distribution of x values falling within the x range of the ith histogram bar.
E(i) is calculated as
E (i) = ( F(i,,)

-

F (imin))* n

(10.7)

where F(x) is the distribution function of the fitted distribution, (i,,,) is the x value upper bound of
the ith histogram bar and (i,,,) is the x value lower bound of the ith histogram bar.
Since the X 2 statistic sums the squares of all of the errors ( 0 ( i ) - E(i)), it can be disproportionately
sensitive to any large errors, e.g. if the error of one bar is 3 times that of another bar, it will contribute
9 times more to the statistic (assuming the same E(i) for both).
X 2 is the most commonly used of the goodness-of-fit statistics described here. However, it is very
dependent on the number of bars, N, that are used. By changing the value of N, one can quite easily
switch ranking between two distribution types. Unfortunately, there are no hard and fast rules for
selecting the value of N. A good guide, however, is Scott's normal approximation which generally
appears to work very well:

where n is the number of data points. Another useful guide is to ensure that no bar has an expected
frequency smaller than about 1, i.e. E(i) > 1 for all i. Note that the X 2 statistic does not require that
all or any histogram bars are of the same width.
The X 2 statistic is most useful for fitting distributions to discrete data and is the only statistic described
here that can be used for nominal (i.e. non-numeric) data.
Example 10.7 Use of X 2 with continuous data

A dataset of 165 points is thought to come from a normal(70,20) distribution. The data are first put into
histogram form with 14 bars, as suggested by Scott's normal approximation (Table 10.4(a)). The four
extreme bars have expected frequencies below 1 for a Normal(70, 20) distribution with 165 observations.
These outside bars are therefore combined to produce a revised set of bar ranges. Table 10.4(b) shows
the X 2 calculation with the revised bar ranges.

Table 10.4 Calculation of the
revised bar ranges.

(a) Histogram bars
From A To B
-00

10
20
30
40
50
60
70
80
90
100
110
120
130

10
20
30
40
50
60
70
80
90
100
110
120
130

+cc

X2

statistic for a continuous dataset: (a) determining the bar ranges to be used; (b) calculation of

Expected frequency
of Normal(70, 20)
0.22
0.80
2.73
7.27
15.15
24.73
31.59
3 1.59
24.73
15.15
7.27
2.73
0.80
0.22

(b)

Revised bars
From A
To B
-00

20
30
40
50
60
70
80
90
100
110
120

20
30
40
50
60
70
80
90
100
110
120
+W

x2

with

E(i) of
Normal(70, 20)

o(i)

Chi-square calc.
{O(i) - ~ ( i ) } ' / ~ ( i )

1.02
2.73
7.27
15.15
24.73
31.59
3 1.59
24.73
15.15
7.27
2.73
1.02

3
5
6
10
21
25
37
21
17
11
6
3
Chi-square:

3.80854
1.88948
0.22168
1.75344
0.56275
1.37523
0.92601
0.56275
0.22463
1.91447
3.92002
3.80854
20.96754

Table 10.5 Calculation of the x 2 statistic for a discrete dataset: (a) tabulation of the data; (b) calculation of

(a) x value

0
1
2
3
4
5
6
7
8
9
10
11+
Total:

Observed Frequency E(i)
frequency
of Poisson
o(i)
(4.456)
0
1.579
8
7.036
18
15.675
20
23.282
29
25.936
21
23.113
18
17.165
10
10.926
8
6.086
2
3.013
1
1.343
1
0.846
136

(b)

x-value

0
1
2
3
4
5
6
7
8
9
1O+

(3~erved
frequency
o(i)
0
8
18
20
29
21
18
10
8
2
2

Frequency
E(i) of Poisson
(4.456)
1.579
7.036
15.675
23.282
25.936
23.113
17.165
10.926
6.086
3.013
2.189
Chi Squared:

x2.
Chi Squared calc.
{O(i) - ~ ( i ) ] ~ / ~ ( i )
1.5790
0.1322
0.3448
0.4627
0.3621
0.1932
0.0406
0.0786
0.6020
0.3406
0.0163
4.1521

290

Risk Analys~s

Hypotheses
a
a

Ho: the data come from a Normal(70, 20) distribution.
H I : the data do not come from the Normal(70, 20) distribution.

Decision

The X 2 test statistic has a value of 21.0 from Table 10.4(b). There are v = N - 1 = 12 - 1 = 11 degrees
of freedom (a = 0 since no distribution parameters were determined from the data). Looking this up
in a x2(1 1) distribution, the probability that we will have such a high value of X 2 when Ho is true is
around 3 %. We therefore conclude that the data did not come from a Normal(70, 20) distribution. +
Example 10.8 Use of X 2 with discrete data

A set of 136 data points is believed to come from a Poisson distribution. The MLE for the parameter h
for the Poisson is estimated by taking the mean of the data points: h = 4.4559. The data are tabulated
in frequency form in Table 10.5(a) and, next to it, the expected frequency from a Poisson(4.4559)
distribution, i.e. E(i) = f (x) * 136, where

The expected frequency for a value of 11+, calculated as 136 - (the sum of all the other expected
frequencies), is less than 1. The number of bars is therefore decreased, as shown in Table 10.5(b), to
ensure that all expected frequencies are greater than 1.
Hypotheses
a
a

Ho: the data come from a Poisson distribution.
HI: the data do not come from a Poisson distribution.

Decision

The X 2 test statistic has a value of 4.152 from Table 10.5(b). There are v = N - a - 1 = 11 - 1 - 1 = 9
degrees of freedom (a = 1 since one distribution parameter, the mean, was determined from the data).
Looking this up in a x2(9) distribution, the probability that we will have such a high value of X 2 when
Ho is true is just over 90 %. Since this is such a large probability, we cannot reasonably reject Ho and
therefore conclude that the data fit a Poisson (4.4559) distribution. +
I've covered the chi-square statistic quite a bit here, because it is used often, but let's just trace back
a moment to the assumptions behind it. The x2(v) distribution is the sum of v unit normal distributions
squared. Equation (10.6) says

so the test is assuming that each { o c i ).(;)- E ( i ) 1 2 is approximately a Normal(0, I)?, i.e. that O(i) is approximately Normal(E(i),
distributed. O(i) is a Binomial(n, p) variable, where p = F(i,,)
F(imi,) and will only look somewhat normal when n is large and p is not near 0 or 1, in which

m)

Chapter 10 Fitt~ngdistribut~onst o data

29 1

case it will be approximately Normal(np, dm).
The point is that the chi-square test is based
on an implicit assumption that there are a lot of observations for each bin - so don't rely on it. Maximum likelihood methods will give better fits than optimising the chi-square statistic and have more
flexibility, and the ability of the chi-square statistic as a measure of comparisons between goodness of
fits is highly questionable since one should change the bin widths for each fitted distribution to give the
same probability of a random sample lying within, but those bin ranges will be different for each fitted
distribution.
Kolmogorov-Smimoff (K-S) statistic
The K-S statistic D, is defined as

where D, is known as the K-S distance, n is the total number of data points, F(x) is the distribution
function of the fitted distribution, F,(x) = i / n and i is the cumulative rank of the data point.
The K-S statistic is thus only concerned with the maximum vertical distance between the cumulative
distribution function of the fitted distribution and the cumulative distribution of the data. Figure 10.15
illustrates the concept for data fitted to a Uniform(0, 1) distribution.
The data are ranked in ascending order.
The upper FU(i) and lower FL(i) cumulative percentiles are calculated as follows:

where i is the rank of the data point and n is the total number of data points.
F(x) is calculated for the Uniform distribution (in this case F(x) = x).

a

t

-

Uniform(0, 1) distribution
Observed distribution

7

I

Upper bound of F(x,), = ~ / n

e,-

I
0

0.2

-iower bound of F(x,): = (i- I)/"

0.4

0.6
x-value

0.8

1

1.2

Figure 10.15 Calculation of the Kolmogorov-Smirnoff distance D, for data fitted to a Uniform(0, 1)
distribution.

292

Risk Analysis

The maximum distance Di between F(i) and F(x) is calculated for each i:

where ABS (. . .) finds the absolute value.
The maximum value of the D idistances is then the K-S distance D,:

The K-S statistic is generally more useful than the X 2 statistic in that the data are assessed at all
data points which avoids the problem of determining the number of bands into which to split the data.
However, its value is only determined by the one largest discrepancy and takes no account of the lack of
fit across the rest of the distribution. Thus, in Figure 10.16 it would give a worse fit to the distribution
in (a) which has one large discrepancy than to the distribution in (b) which has a poor general fit over
the whole x range.
The vertical distance between the observed distribution F,(x) and the theoretical fitted distribution
F(x) at any point, say xo, itself has a distribution with a mean of zero and a standard deviation OK-s
given by binomial theory:

The size of the standard deviation OK-s over the x range is shown in Figure 10.17 for a number of
distribution types with n = 100. The position of D, along the x axis is more likely to occur where
OK-s is greatest, which, Figure 10.17 shows, will generally be away from the low-probability tails. This
insensitivity of the K-S statistic to lack of fit at the extremes of the distributions is corrected for in the
Anderson-Darling statistic.
The enlightened statistical literature is quite scathing about distribution-fitting software that use the
KS statistic as a goodness of fit - particularly if one has estimated the parameters of a fitted distribution
from data (as opposed to comparing data against a predefined distribution). This was not the intention
of the K-S statistic, which assumes that the fitted distribution is fully specified. In order to use it as a
goodness-of-fit measure that ranks levels of distribution fit, one must perform simulation experiments
to determine the critical region of the K-S statistic in each case.
Anderson-Darling (A-D) statistic

The A-D statistic A: is defined as

where

n is the total number of data points, F(x) is the distribution function of the fitted distribution, f (.x)
is the density function of the fitted distribution, F,(x) = i l n and i is the cumulative rank of the data
point.
The Anderson-Darling statistic is a more sophisticated version of the Kolmogorov-Smirnoff statistic.
It is more powerful for the following reasons:

Chapter 10 Fitting distributions to data

293

Determination of K-S distance D,,

1-

- - - - ,

0.7

Observed distribution F,,(x)!
F(x) for fitted distribution

4

15%

20%

25%
x-values

304

35%

40%

(a) Distribution is generally a good fit except in one particular area
Determination of K-S distance D,
Observed distribution Fn(x)
- - - - F(x)
for fitted distribution
--o-

I

x-values

I

(b) Distribution is generally a poor fit but with no single large discrepancies

Figure 10.16 How the K-S distance D, can give a false measure of fit because of its reliance on the single
largest distance between the two cumulative distributions rather than looking at the distances over the whole
possible range.

i

@(x) compensates for the increased variance of the vertical distances between distributions
which is described in Figure 10.17.
f (x) weights the observed distances by the probability that a value will be generated at that x value.
The vertical distances are integrated over all values of x to make maximum use of the observed
data (the K-S statistic only looks at the maximum vertical distance).

;;,y

Standard deviation of D,, for
Pareto(1, 2) distribution

Standard deviation of D, for
NorrnaI(100, 10) distribution

0.01
0

0

50

100 150

,
0.01
0
0

200

10

5

15

Standard deviation of D, for
Triangular(0, 5, 20) distribution

Standard deviation of D, for
Uniform(0, 10) distribution

E
n
,
0.01
0
0
Standard deviation of D, for
Exponential(25) distribution

5

10

15

20

Standard deviation of D, for
Rayleigh(3) distribution

Figure 10.17 Variation in the standard deviation of the K-S statistic D, over the range of a variety of
distributions. The greater the standard deviation, the more chance that D, will fall in that part of the
range, which shows that the K-S statistic will tend to focus on the degree of fit at x values away from a
distribution's tails.

The A-D statistic is therefore a generally more useful measure of fit than the K-S statistic, especially
where it is important to place equal emphasis on fitting a distribution at the tails as well as the main
body. Nonetheless, it still suffers from the same problem as the K-S statistic in that the fitted distribution
should in theory be fully specified, not estimated from the data. It suffers from a larger problem in that
the confidence region has been determined for only a very few distributions.
A better goodness-of-fit measure

For reasons I have explained above, the chi-square, Kolmogorov-Smirnoff and Anderson-Darling
goodness-of-fit statistics are technically all inappropriate as a method of comparing fits of distributions
to data. They are also limited to having precise observations and cannot incorporate censored, truncated
or binned data. Realistically, most of the time we are fitting a continuous distribution to a set of precise

Chapter 10 Fitting distributions t o data

295

observations, and then the Anderson-Darling does a reasonable job. However, for important work you
should instead consider using statistical measures of fit called information criteria.
Let n be the number of observations (e.g. data values, frequencies), k be the number of parameters
be the
to be estimated (e.g. the normal distribution has two parameters: mu and sigma) and log L,,
maximized value of the log-likelihood for the estimated.
1. SIC (Schwarz information criterion, aka Bayesian information criterion, BIC)
SIC = ln[n]k - 2 ln[L,,]
2. AICc (Akaike information criterion)

3. HQIC (Hannan-Quinn information criterion)
HQIC = 2 ln[ln[n]]k - 2 ln[L,,,]
The aim is to find the model with the lowest value of the selected information criterion. The
-21n[L,,]
term appearing in each formula is an estimate of the deviance of the model fit. The
coefficients for k in the first part of each formula shows the degree to which the number of model
parameters is being penalised. For n 2 20 or so the SIC (Schwarz, 1997) is the strictest in penalising
loss of degree of freedom by having more parameters in the fitted model. For n 2 40 the AICc (Akaike,
1974, 1976) is the least strict of the three and the HQIC (Hannan and Quinn, 1979) holds the middle
ground, or is the least penalising for n 5 20.
ModelRisk applies modified versions of these three criteria as a means of ranking each fitted model,
whether it be fitting a distribution, a time series model or a copula. If you fit a number of models to
your data, try not to pick automatically the fitted distribution with the best statistical result, particularly
if the top two or three are close. Also, look at the range and shape of the fitted distribution and see
whether they correspond to what you think is appropriate.

1 0.3.5 Goodness-of-fit plots
Goodness-of-fit plots offer the analyst a visual comparison between the data and fitted distributions.
They provide an overall picture of the errors in a way that a goodness-of-fit statistic cannot and allow
the analyst to select the best-fitting distribution in a more qualitative and intuitive way. Several types
of plot are in common use. Their individual merits are discussed below.
Comparison of probability density

Overlaying a histogram plot of the data with a density function of the fitted distribution is usually the
most informative comparison (see Figure 10.18(a)). It is easy to see where the main discrepancies are
and whether the general shape of the data and fitted distribution compare well. The same scale and
number of histogram bars should be used for all plots if a direct comparison of several distribution fits
is to be made for the same data.

296

0.03

R~skAnalysis

,

P-P Comparison Between lnput Distribution and

Comparison of lnput Distributionand
Normal(99.18.16.52)

Normal(99.18,16.52)

1.o
0.8
Normal 0.6
0.4
0.2
0.0

50

70

90

110

130

0.0

150

0.2

0.4

0.6

0.8

d) Probability-probability plot

(a) Comparison of probability density
Comparison of lnput Distributionand
Norma1(99.18,16.52)

P-P Comparison for Discrete data
1.0 -

--

Poisson 0.6
0.4

.-

0.2

-.

r

I

0.8 .-

0.0

I

0.0
I

0.2

0.4

0.6

0.8

1.0

I
(e) Probability-probability plot for a discrete distribution

(b)Comparison of cumulative probability distributions

1

1.O

DifferenceBetween lnput Distribution and
Normal(99.18,16.52)

Q-Q Comparison Between lnput Distributionand
Norma1(99.18,16.52)
150

0.03

130
Normal 110
90
70

50

-0.03
50

70

90

110

130

150

(c) Plot of difference between probability densities

50

70

90

110

130

150

(f) Quantils-quantile plot

Figure 10.18 Examples of goodness-of-fit plots.

Comparison of probability distributions

An overlay of the cumulative frequency plots of the data and the fitted distribution is sometimes used
(see Figure 10.18(b)). However, this plot has a very insensitive scale and the cumulative frequency of
most distribution types follow very similar S-curves. This type of plot will therefore only show up very
large differences between the data and fitted distributions and is not generally recommended as a visual
measure of the goodness of fit.
Difference between probability densities

This plot is derived from the above comparison of probability density, plotting the difference between
the probability densities (see Figure 10.18(c)). It has a far more sensitive scale than the other plots

Chapter 10 Fitting distributions t o data

297

described here. The size of the deviations is also a function of the number of classes (bars) used to plot
the histogram. In order to make a direct comparison between other distribution function fits using this
type of plot, analysts must ensure that the same number of histogram classes is used for all plots. They
must also ensure that the same vertical scale is used, as this can vary widely between fits.
Probability-probability (P-P) plots

This is a plot of the cumulative distribution of the fitted curve F(x) against the cumulative frequency
F,(x) = i/(n
1) for all values of xi (see Figure 10.18(d)). The better the fit, the closer this plot
resembles a straight line. It can be useful if one is interested in closely matching cumulative percentiles,
and it will show significant differences between the middles of the two distributions. However, the plot is
far less sensitive to any discrepancies in fit than the comparison of probability density plot and is therefore
not often used. It can also be rather confusing when used to review discrete data (see Figure 10.18(e))
where a fairly good fit can easily be masked, especially if there are only a few allowable x values.

+

Quantile-Quantile (Q-Q) plots

+

This is a plot of the observed data xi against the x values where F(x) = F,(x), i.e. = i/(n 1)
(see Figure 10.18(f)). As with P-P plots, the better the fit, the closer this plot resembles a straight
line. It can be useful if one is interested in closely matching cumulative percentiles, and it will show
significant differences between the tails of the two distributions. However, the plot suffers from the
same insensitivity problem as the P-P plots.

10.4 Fitting a Second-Order Parametric Distribution to Observed
Data
The techniques for quantifying uncertainty, described in the first part of this chapter, can be used to
determine the distribution of uncertainty for parameters of a parametric distribution fitted to data. The
three main techniques are classical statistics methods, the bootstrap and Bayesian inference by Gibbs
sampling. The main issue in estimating the parameters of a distribution from data is that the uncertainty
distributions of the estimated parameters are usually linked together in some way.
Classical statistics tends to overcome this problem by assuming that the parameter uncertainty distributions are normally distributed, in which case it determines a covariance between these distributions.
However, in most situations one comes across, the parameter uncertainty distributions are not normal
(they will tend to be as the amount of data gets very large), so the approach is very limited.
The parametric bootstrap is much better, since one simply resamples from the MLE fitted distribution
in the same fashion in which the data appear, and in the same amount, of course. Then, refitting using
MLE again gives us random samples from the joint uncertainty distribution for the parameters. The
main limitation to the bootstrap is in fitting a discrete distribution, particularly one where there are few
allowable values, as this will make the joint uncertainty distribution very "grainy".
Markov chain Monte Carlo will also generate random samples from the joint uncertainty density. It
is very flexible but has the small problem of setting the prior distributions.
Example 10.9 Fitting a second-order normal distribution to data with classical statistics

The normal distribution is easy to fit to data since we have the z-test and chi-square test giving us
precise formulae. There are not many other distributions that can be handled so conveniently. Classical
statistics tells us that the uncertainty distributions for the mean and standard deviation of the normal

298

R~skAnalysis

distribution are given by Equation (9.3)

when we don't know the mean, and by Equation (9.1)

when we know the standard deviation.
So, if we simulate possible values for the standard deviation first with Equation (9.3), we can feed
these values into Equation (9.1) to determine the mean. +
Example 10. I 0 Fitting a second-order normal distribution t o data using the parametric bootstrap

The sample mean (Excel: AVERAGE) and sample standard deviation (Excel: STDEV) are the MLE
estimates for the normal distribution. Thus, if we have n data values with mean Y and standard deviation
s, we generate n independent Normal(T, s) distributions and recalculate their mean and standard deviation
using AVERAGE and STDEV to generate uncertainty values for the population parameters. +
Example 10. I I Fitting a second-order gamma distribution t o data using the parametric bootstrap

There are no equations for direct determination of the MLE parameter values for a gamma distribution,
so one needs to construct the likelihood function and optimise it by varying the parameters, which is
rather tiresome but by far the more common situation encountered. ModelRisk offers distribution-fitting
algorithms that do this automatically. For example, the two-cell array {VoseGammaFitP(data, TRUE)}
will generate values from the joint uncertainty distribution for a gamma distribution fit to the set of
values data. The array {VoseGammaFitP(data, FALSE)) will return just the MLE values. The function
VoseGammaFit(data, TRUE) returns random samples from a gamma distribution. with the parameter
uncertainty imbedded, and VoseGammaFit(data, 0.99, TRUE) will return random samples from the
uncertainty distribution for the 99th percentile of a gamma distribution fit to data. +
Example 10.12 Fitting a second-order gamma distribution t o data using WinBUGS

The following WinBUGS model takes 47 data values (that were in fact drawn from a Gamma(4,7)
distribution) and fits a gamma distribution. There are two important things to note here: in WinBUGS
the scale parameter lambda is defined as the reciprocal of the beta scale parameter more commonly
used (and this book's convention); and I have used a prior for each parameter of Garnma(1, 1000) [in
more standard convention] which is an exponential with mean 1000. The exponential distribution is used
because it extends from zero to infinity which matches the parameters' domains, and an exponential
with such a large mean will appear quite flat over the range of interest (so it is reasonably uninformed).
The model is:
model
(

-

for(i in 1 : M ) {
x [ i]
dgamma (alpha, lambda)

I

Chapter 10 Fitt~ngdistributions t o data

299

--

alpha
dgamma(l.O, 1.OE-3)
beta
dgamrna(l.0, 1.0E-3)
lambda<-l/beta

1

After a burn-in of 100000 iterations, the estimates are as shown in Figure 10.19.
The estimates are centred roughly around 4 (mean = 4.1 11) and 7 (mean = 6.288), as we might have
hoped having generated the samples from a Gamma(4,7). We can check to see whether the choice of prior
has much effect. For alpha the uncertainty distribution ranges from about 2 to 6: the Exponential(1000)

l
E
lE
beta sample: 10000

alpha sample: 10000

1

Figure 10.19 WinBUGS estimates of gamma distribution parameters for Example 10.12.
13
12 11 10 -

.
. .. ..... ..

9m

Z

P

87-

65431
2

3

5

4

6

7

alpha

Figure 10.20 5000 posterior distribution samples from the WinBUGS model to estimate gamma distribution
parameters for Example 10.12.

300

Risk Analysis

Figure 10.21 Plot showing the empirical cumulative distribution of the data in bold and the second-order
fitted lognormal distribution in grey.

density at 2 and 6 respectively are 9.98E-4 and 9.94E-4, a ratio of 1.004, so essentially flat over the posterior region. Between 4 and 13, the range for the beta parameter, the ratio is 1.009 - again essentially flat.
Figure 10.20 shows why it is necessary to estimate the joint uncertainty distribution. The banana
shape of this scatter plot shows that there is a strong correlation between the parameter estimates. You
can understand why this relationship occurs intuitively as follows: the mean of a population distribution
can be estimated quite quickly from the data and will have roughly normally distributed uncertainty:
in this case the 47 observations have sample mean = 25.794 and sample variance = 184.06, so the
population mean uncertainty is Normal(25.794, SQRT(184.06147)) = Normal(25.794, 1.979). The mean
of a Gamma(a, B) distribution is aB. Equating the two says that if a! = 6 then B must be about 4.3 f 0.3,
and if a! = 3 then B is about 8.6 f 0.6, which can be seen in Figure 10.20. +

1 0.4.1 Second-order goodness-of-fit plots
Second-order goodness-of-fit plots are the same as the first-order plots in Figure 10.18, except that
uncertainty about the distribution is expressed as a series of lines describing possible true distributions
(sometimes called a candyfloss or spaghetti plot). Figure 10.21 gives an example.
In Figure 10.21 the grey lines represent the fitted lognormal cumulative distribution function for 15
samples from the joint uncertainty distribution for the lognormal's mean and standard deviation. This
gives an intuitive visual description of how certain we are about the fitted distribution. ModelRisk's
distribution-fitting facility will show these plots automatically with a user-defined number of "spaghetti"
lines.

Chapter

Sums of random variables
One of the most common mistakes people make in producing even the most simple Monte Carlo
simulation model is in calculating sums of random variables.
In this chapter we look at a number of techniques that have extremely broad use in risk analysis in
estimating the sum of random variables. We start with the basic problem and how this can be simulated.
Then we examine how simulation can be improved, and then how it can often be replaced with a direct
construction of the distribution of the sum of the random variables. Finally, I introduce the ability to
model correlation between variables that are being summed.

I I. I The Basic Problem
We are very often in the situation of wanting to estimate the aggregate (sum) of a number n of variables,
each of which follow the same distribution or have the same value X (see Table 11.l, for example).
We have six situations to deal with (Table 11.2).
Situations A, B, D and E

For situations A B, D and E the mathematics is very easy to simulate:
SUM = n * X
Situation C

For situation C, where X are independent random variables (i.e. each X being summed can take a
different value) and n is fixed, we often have a simple way to determine the aggregate distribution
based on known identities. The most common identities are listed in Table 11.3.
We also know from the central limit theorem that, if n is large enough, the sum will often look like
a normal distribution. If X has a mean p and standard deviation a,then, as n becomes large, we get

Sum

%

Normal(n * p , f i * a)

which is rather nice because it means we can have a distribution like the relative distribution and
determine the moments (ModelRisk function VoseMoments will do this automatically for you), or just
the mean and standard deviation of relevant observations of X and use them. It also explains why the
distributions in the right-hand column of Table 11.3 often look approximately normal.
When none of these identities applies, we have to simulate a column of X variables of length n and
add them up, which is usually not too onerous in computing time or spreadsheet size because if n is
large we can usually use the central limit theorem approximation.

1

302

Risk Analysis

Table 11.1 Variables and their aggregate distribution.
X

N

Aggregate distribution
Total receipts in a year
Bacteria in my three-raw-egg
milkshake
Total credit default exposure
Total financial exposure of
insurance company

Purchase of each customer
Bacteria in a contaminated
egg
Amount owed by a creditor
Amount due on death for a
policyholder

Customers in a year
Contaminated egg
Credit defaults
Life insurance holders who
die next year

Table 11.2 Different situations where aggregate distributions are needed.
Situation

N

X

A
B
C
D
E
F

Fixed value
Fixed value
Fixed value
Random variable
Random variable
Random variable

Fixed value
Random variable, all n take same value
Random variable, all n take different values (iids)
Fixed value
Random variable, all n take same value
Random variable, all n take different values (iids)

Table 11.3 Known identities for aggregate
distributions.
Aggregate distribution
X
Bernoulli@)
Binomial(n, p)
BetaBinomial(m, a, p) BetaBinomial(n * m, a, p)
Binomial(n * m,p)
Binomial(m, p)
n * Cauchy(a,b)
Cauchy(a,b)
ChiSq(v)
ChiSq(n * u)
Erlang(n * m, j3)
Erlang(m, p)
Gamma(n. B)
Ex~onentiallB)

An alternative for situation C available in ModelRisk is to use the VoseAggregateMC(n, distribution)
function; for example, if we write

the function will generate and add together 1000 independent random samples from a lognonnal(2, 6)
distribution. However, were we to write

Chapter I I Sums of random variables

the function would generate a single value from a Gamma(2
identities in Table 11.3 are programmed into the function.

*

303

1000, 6) distribution because all of the

Situation F

This leaves us with situation F - the sum of a random number of random variables. The most basic
simulation method is to produce a model where a value for n is generated in one spreadsheet cell and
then a column of X variables is created that varies in size according to the value on n (see, for example,
Figure 11.1).
In this model, n is a Poisson(l2) random being generated at cell C2. The Lognormal(100, 10) X
values are generated in column C only if the count value in column B is smaller than or equal to n.
For example, in the iteration shown, a value of 14 is generated for n , so 14 X values are generated in
column C.
The method is quite generally applicable, but among other problems is inefficient. Imagine if n had
been Poisson(l0 OOO), for example - we would need huge B and C columns to make the model work.
It is also difficult from a modelling perspective because the model has to be written for a specific range
of n . One cannot simply change the parameter in the Poisson distribution.
We have a couple of options based on the techniques described above for situation C. If we are
adding together X variables shown in Table 11.3, then we can apply those identities by simulating n
A )

B

I

c

IDI

E

I

F

I

G

I

H

I

1

2

n:

14

4

Count
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

X value
104.2002
99.24762
104.939
103.0028
97.13033
137.6911
119.2818
111.4683
101.3274
102.6788
107.0966
96.36928
93.28309
101.6922
0
0
0
0
0
0

3

5
6
7

8
9
&
11

2
13
14

15
2
1
7
18
19
2
21

22
23

24
25 .

Total

C2:
C5:C24
F6 (output)

Formulae table
=VosePoisson(l2)
=IF(B5~$C$2,0,VoseLognormal(l00,10))
=SUM(C5:C24)

Figure 11.1 Model for the sum of a random number of random variables.

I

IJ

304

Risk Analysis

in one cell and linking that to a cell that simulates from the aggregate variable conditioned on n. For
example, imagine we are summing Poisson(100) X variables where each X variable takes a Gamma(2, 6)
distribution. Then we can write
Cell A1 : = VosePoisson(100)
Cell A2(output) : = VoseGamma(A1 * 2,6)
We can also use the central limit theorem method. Imagine we have n = Poisson(1000) and X =
Beta4(3, 7, 0, 8), which is illustrated in Figure 11.2.
The distribution is not terribly asymmetric, so adding roughly 1000 of them will look very close
to a normal distribution, which means that we can be confident in applying the central limit theorem
approximation, shown in the model of Figure 11.3.
Here we have made use of the VoseMoments array function which returns the moments of a distribution object. Most software, however, will allow you at least to view the moments of a distribution,
and, if not, you can simulate the distribution on its own and empirically determine its moments from
the values or, if you need greater accuracy or speed, apply the equations given in the distribution compendium in Appendix 111. The VoseCLTSum performs the same calculation as that shown in F5 but is a
little more intuitive. Alternatively, the VoseAggregateMC will, in this iteration, add together 957 values
drawn from the Beta4 distribution because there is no known identity for sums of Beta4 distributions.

Figure 11.2 A Beta4(3, 7,0,8)distribution.

Chapter I I Sums of random variables

A

1

B

I

I

C

I

D

E

I

F

305

IG

957
VoseBeta4(3,7,0,8)

4
[ ~ g g r e ~ a distribution:
te
1.221818182
0.48249791
2.860805861

9
10
11
12
13
14
15
16
17

-

c2
c3
{B5:C8)
F5 (output):
F6 (alternative):
F7 (alternative):

or
or

2345.4

(

2303.4
2318.1

Formulae table
=VosePoisson(l000)
=VoseBeta40bject(3,7,0,8)
{=VoseMoments(C3)}
=VoseNormal(C2*C5,SQRT(C2*C6))
=VoseCLTSum(C2,C5,SQRT(C6))
=VoseAggregateMC(C2,C3)

Figure 11.3 Model for the central limit theorem approximation.

I 1.2 Aggregate Distributions
I 1.2. I Moments of an aggregate distribution
There are general formulae for determining the moments of an aggregate distribution given that one has
the moments for the frequency distribution for n and the severity distribution for X. If the frequency
distribution has mean, variance and skewness of p ~ V,F and SF respectively, and the severity distribution
has mean, variance and skewness of p c , Vc and Sc respectively, then the aggregate distribution has
the following moments:
Mean = p ~ p c
(11.1)
Variance = p~ VC
Skewness =

+ VFpC2

pFsCv;l2

(1 1.2)

+ ~ V F ~ C +V sC ~ v : ~ ~ ~ ~
(Variance)3/2

There is also a formula for kurtosis, but it is rather ugly. The ModelRisk function VoseAggregateMoments determines the first four moments of an aggregate distribution for any frequency and severity
distribution, even if they are bounded and/or shifted.
Equations (1 1.1) to (1 1.3) deserve a little more exploration. Firstly, let's consider the situation where
n is a fixed value, so p~ = n , VF = 0 and SF = undefined. Then we have moments for the aggregate
distribution of
Mean = n p c
Variance = n Vc
sc
Skewness = -

fi

You can see that this gives support to the central limit theorem which states that, if n is large enough,
the aggregate distribution approaches a normal distribution with mean = n p c and variance = n Vc. The
skewness equation shows that the aggregate skewness is proportional to the skewness of X but decreases
rapidly at first with increasing n, then more slowly, and asymptotically towards zero.
Another interesting example is to consider the aggregate moment equations when n follows a
Poisson(A) distribution, which is very commonly the most appropriate distribution for n, and also has
the convenience of being described by just one parameter. Now we have p~ = h, VF = A and SF =
and the aggregate moments are

h,

Mean = ApC

+ w;)
(sC v;l2 + 3wc Vc + p;)
Skewness =
3
(Vc + p ; ) m
Variance = A (Vc

The mean and variance equations are simple formulae. We can see that the skewness decreases with

1in the same way as it does for a fixed value for n. If X is symmetrically distributed, then, for a

4

given A, the skewness is at its maximum when the mean and standard deviation of X are the same, and
at its lowest when the standard deviation is very high. Thus, the aggregate distribution will be more
closely normal when Vc is large.
Being able to determine the aggregate moments is pretty useful. One can directly compare sums
of random variables, which I will discuss more in Chapter 21. One can also match these moments to
some parametric distribution and use that as an approximation to the aggregate distribution. An aggregate
distribution is almost always right skewed, so we can select from a number of right-skewed distributions
like the lognormal and gamma and match moments. For example, a Gamma(a, p) distribution shifted
by a value T has

+T

Mean = a/3

Variance = ap

2

(1 1.4)
(1 1.5)

2
Skewness = -

fi

Thus, matching skewness gives us a value for a. Then, matching variance gives us /3, and, finally,
matching mean gives us T. Adding a shift gives us three parameters to estimate, so we can match three
moments. The model in Figure 11.4 offers an example.
Cells C3:C5 are the parameters for the model. Cells D3 and D4 use ModelRisk functions to create
distribution objects. B8:Cll and D8:Ell use the VoseMoments function to calculate the moments of the
two distributions. Alternatively, you can use the equations in the distribution compendium in Appendix
111. F8:FlO manually calculates the first three aggregate moments from Equations (1 1.1) to (11.3),
and G8:Hll calculates all four using the VoseAggregateMoments function as a check. In C15:C17,
Equations (1 1.4) to (1 1.6) are inverted to determine the gamma distribution parameters. Finally, G14:H17
uses the VoseMoments function again to determine the moments of the gamma distribution. You can see
that they match the mean, variance and skewness of the aggregate distribution - as they should - but
also that the kurtosis is very close, so the gamma distribution would likely be a good substitute for
the aggregate distribution. To be sure, we would need to plot the two together, which we'll look

Chapter I I Sums of random variables

Aggregate
25 Mean
25 Variance
0.2 Skewness
3.04 Kurtosis

307

VoseAMoments

1850 Variance
6.944 1.018515312 Skewness
75.1 056
Kurtosis

1.018515
4.723443

=C8*E9+CYE8A2

Figure 11.4

Model for determining aggregate moments.

at later: a feature in ModelRisk uses the matching moments principle to match shifted versions of
the gamma, inverse gamma, lognormal, Pearson5, Pearson6 and fatigue distributions to constructed
aggregate distributions and overlay the distributions for an extra visual comparison.

1 1.2.2 Methods for constructing an aggregate distribution
In this section I want to turn to a range of very neat techniques for constructing the aggregate distribution
when n is a random variable and X are independent identically distributed random variables. There are
a lot of advantages to being able to construct such an aggregate distribution, among which are:
a
a

We can determine tail probabilities to a high precision.
It is much faster than Monte Carlo simulation.
We can manipulate the aggregate distribution as with any other in Monte Carlo simulation, e.g.
correlate it with other variables.

The main disadvantage to these methods is that they are computationally intensive and need to run
calculations through often very long arrays. This makes them impractical to show in a spreadsheet

308

Risk Analysis

environment, so I will only describe the theory here. All methods are implemented in ModelRisk,
however, which runs the calculations internally in C++.
We start by loolung at the Panjer recursive method, and then the fast Fourier transform (FFT) method.
These two have a similar feel to them, and similar applications, although their mathematics is quite
different. Then we'll look at a multivariate FFT method that allows us to extend the aggregate calculation
to a set of {n, X ] variables. The de Pril recursive method is similar to Panjer's and has specific use.
Finally, I give a summary of these methods and when and why they are useful.
Panjer's recursive method

Panjer's recursive method (Panjer, 1981; Panjer and Willmot, 1992) applies where the number of variables n being added together follows one of these distributions:
binomial;
geometric;
negative binomial;
Poisson;
P6lya.
The technique begins by talung the claim size distribution and discretising it into a number of values
with increment C. Then the probability is redistributed so that the discretised claim distribution has the
same mean as the continuous variable. There are a few ways of doing this, but if the discretisation steps
are small they give essentially the same answer. A simple method is to assign the value (i * C) the
probability si as follows:

In the discretisation process we have to decide on a maximum value of i (called r) so we don't have an
infinite number of calculations to perform. Now comes the clever part. The above discrete distributions
lead to a simple one-time summation through a recursive formula to calculate the probability p ( j ) that
the aggregate distribution will equal j * C:

The formula works for all frequency distributions for n that are of the (a, b, 0) class, which means that,
from P(n = 0) up, we have a recursive relationship between P(n = i ) and P(n = i - 1) of the form

a and b are fixed values that depend on which of the discrete distributions is used and their parameter
value. The specific formula for each case is given below for the (a, b, 0) class of discrete distributions:

For the Binomial(n,p)

Chapter I I Sums of random variables

309

For the Geometric(p)

For the NegBin(s,p)

For the Poisson(h)
PO=exp[h.so - h ] , a = 0, b = h

For the P6lya(cy,B)

The output of the algorithm is two arrays {i},{p(i)} that can be constructed into a distribution, for
example as VoseDiscrete({i},p{i)) * C . Panjer's method can occasionally numerically "blow up" with
the binomial distribution, but when it does so it generates negative probabilities, so is immediately
obvious.
A small change to Panjer's algorithm allows the formula to be applied to (a, b, 1) distributions, which
means that the recursive formula (11.8) works from P ( n = 1) onwards. This allows us to include the
logarithmic distribution using the formulae

Panjer's method cannot, however, be applied to the Delaporte distribution. Panjer's method requires
a bit of hands-on management because one has to experiment with the maximum value r to ensure
sufficient coverage and accuracy of the distribution. ModelRisk uses two controls for this: MaxP specifies
the upper percentile value of the distribution of X at which the algorithm will stop, and Intervals specifies
how many steps will be used in the discretisation of the X distribution. In general, the larger one makes
Intervals, the more accurate the model will be, but at the expense of computation time. The MaxP value
should be set high enough realistically to cover the distribution of X, but, if one sets it too high for
a long tailed distribution, there will be an insufficient number of increments in the main body of the
distribution. In ModelRisk one can compare the exact moments of the aggregate distribution with those
of the Panjer constructed distribution to ensure that the two correspond with sufficient accuracy for the
analyst's needs.

3 10

Risk Analysis

Fast fourier transform (FFT) method

The density function f (x) of a continuous random variable can always be converted into its Fourier
transform @,(t) (also called its characteristic function) as follows:
max

@,(t) =

1

e"" f (x) dx = ~ [ e " " ]

min

and we can transform back using

min

Characteristic functions are really useful for determining the sums of random variables because
@x+y(t)= @ X ( t )* &~(t),i.e. we just multiply the characteristic functions of variables X and Y to
get the characteristic function of (X + Y). For example, the characteristic function for a normal distribution is @(t) = exp
. Thus, for variables X = Normal(px, ax) and Y = Normal(py, a y )
we have
@x+r(t) = $X(~)@Y
(t) = exp

In this particular example, the function form of @x+y(t) equates to another normal distribution with
mean ( p X wy) and variance (a; a;), so we don't have to apply a transformation back - we can
already recognise the result.
The fast Fourier transform method of constructing an aggregate distribution where there are a random
number n of identically distributed random variables X to be summed is described fully in Robertson
(1 992). The technique involves discretising the severity distribution X like Panjer's method so that one
has two sets of discrete vectors, one each for the frequency and severity distributions. The mathematics
involves complex numbers and is based on the convolution theory of discrete Fourier transforms, which
states that to obtain the aggregate distribution one multiplies the two discrete Fourier transforms of these
vectors pointwise and then computes the inverse discrete Fourier transform. The fast Fourier transform
is used as a very quick method for computing the discrete Fourier transform for long vectors.
The main advantage of the FFT method is that it is not recursive, so, when one has a large array
of possible values, the FFT won't suffer the same error propagation that Panjer's recursion will. The
FFT can also take any discrete distribution for its frequency distribution (and, in principle, any other
non-negative continuous distribution if one discretises it). The FFT can also be started away from zero,
whereas the Panjer method must calculate the probability of every value starting at zero. Thus, as a
rough guide, consider using Panjer's method where the frequency distribution does not take very large
values and where it is one of those for which Panjer's method applies, otherwise use the FFT method.
ModelRisk offers a version of the FFT method with some adjustments to improve efficiency and allow
for a continuous aggregate distribution.

+

+

b

Chapter I I Sums of random variables

31I

FFT methods can also be extended to a group of {n, X ) paired distributions, which ModelRisk makes
available via its VoseAggregateMultiFFT function.
De Pril method

For a portfolio of n independent life insurance policies, each policy y has a particular probability of
a claim p, in some period (usually a year) and benefit By. There are various methods for calculating
the aggregate payout. Dickson (2005) is an excellent (and very readable) review of these methods and
other areas of insurance risk and ruin.
The De Pril method is an exact method for determining the aggregate payout distribution. The compound Poisson approximation discussed next is a faster method that will usually work too.
De Pril (1986) offers an exact calculation of the aggregate distribution under the assumptions that:
The benefits are fixed values rather than random variables and take integer multiples of some
convenient base (e.g. $1000) with a maximum value M * base, i.e. Bi = (1 . . . M ) * base.
The probability of claims can similarly be grouped into a set of J values (i.e. into tranches of
mortality rates) p, = {pl . . . pJ}.
Let nij be the number of policies with benefit i and probability of claim p j . Then De Pril's paper
demonstrates that p(y), the probability that the aggregate payout will be equal to y * base, is given by
the recursive formula

xx

min[y,MI

p(y) = ;

Lylil

p(y - ik)h(i, k) for y = 1 , 2 , 3 . . .

and

where

The formula has the benefit of being exact, but it is very computationally intensive. However, the
number of computations can usually be significantly reduced if one accepts ignoring small aggregate
costs to the insurer. Let K be a positive integer. Then the recursive formulae above are modified as
follows:

Dickson (2005) recommends using a value of 4 for K . The De Pril method can be seen as the
counterpart to Panjer's recursive method for the collective model. ModelRisk offers a set of functions
for implementing De Pril's method.

3I2

Risk Analysis

Compound Poisson approximation

The compound Poisson approximation assumes that the probability of payout for an individual policy
is fairly small - which is usually true, but has the advantage over the De Pril method in allowing that
the payout distribution is a random variable rather than a fixed amount.
Let n j be the number of policies with probability of claim pj. The number of payouts in this stratum
is therefore Binomial(nj, p j ) . If n j is large and p j is small, the binomial is well approximated by a
Poisson(nj * p j ) = Poisson(hi) distribution. The additive property of the Poisson distribution tells us
that the frequency distribution for payouts over all groups of lines of insurance is given by

and the total number of claims = Poisson(ha1).
The probability that one of these claims, randomly selected, comes from stratum j is given by

Let F,(x) be the cumulative distribution function for the claim size of stratum j. The probability that
a random claim is less than or equal to some value is therefore

Thus, we can consider the aggregate distribution for the total claims to have a frequency distribution
equal to Poisson(hdl) and a severity distribution given by F ( x ) .
Adding correlation in aggregate calculations
Simulation

The most common method for determining the aggregate distribution of a number of correlated random
variables is to simulate each random variable in its own spreadsheet cell, using one of the correlation
methods described elsewhere in this book, and then sum them up in another cell. For example, the
model in Figure 11.5 adds together Poisson(100) random variables each following a Lognormal(2, 5 )
distribution but where these variables are correlated through a Clayton(l0) copula.
Cell C7 determines the 99.99th percentile of the Poisson(100) distribution - a value of 139 - which
is used as a guide to set the maximum number of rows in the table. The Clayton copula values are
used as "U-parameter" inputs into the lognormal distributions, meaning that they make the lognormal
distributions return the percentile equating to the copula value; for example, cell D l 2 returns a value of
2.5539. . ., which is the 80.98. . .th percentile of the Lognormal(2, 5) distribution.

Chapter I I Sums of random variables

A

1

I

B

c

I

D

I

E

I

F

I

G

H

I

I

J

I

K

3 13

I

L

1

(This tells us how large the array below needs to be)

8
9

[ ~ o t a(output)
l

173.5239678

]

10
11
12
3

14
15
145
146
147

148
2
150
151

Number added
1
2
3
4
134
135
136
137
138
139
140

Clayton
cooula
0.809878223
0.698461498
0.715257242
0.8041 17626
0.644750607
0.700918744
0.846351057
0.617433557
0.671806607
0.730271298
0.674805899

Lognormal
variables
2.553939077
1.54420499
1.654062795
2.47946512
0
0
0
0
0
0
0

Formulae table
=VosePoisson(C2)
=VosePoisson(C2,0.9999)

=SUM(D12:Dl51)

152

Figure 11.5 Model for simulating the aggregate distribution of correlated random variables

A Clayton copula provides a particularly high level of correlation of the variables at their low end.
For example, the plot in Figure 11.6 shows the level of correlation of two variables with a Clayton(] 0).
Thus, the model will produce a wider range for the sum than an uncorrelated set of variables but
in particular will produce more extreme low-end values from a probabilistic view (the correlated sum
has about a 70 % probability of taking a lower value than the uncorrelated sum). The use of one of
the Archimedean copulas is an appropriate tool here because we are adding up a random number of
these variables but the number being summed does not affect the copula's behaviour - all variables will
be related to the same degree no matter how many are being summed. The effect of the correlation is
readily observed by repeating the model without any correlation. The plot in Figure 11.7 compares the
two cumulative distributions.
Complete correlation

In the situation where the source of the randomness or uncertainty of the distribution associated with
a random variable is the same for the whole group you are adding up, there is really just one random
variable. For example, imagine that a railway network company must purchase 127 000 sleepers (the
beams under the rails) next year. The sleepers will be made of wood, but the price is uncertain because
the cost of timber may fluctuate. It is estimated that the cost will be $PERT(22.1, 22.7, 33.4) each. If
all the timber is being purchased at the same time, it might be reasonable to believe that all the sleepers
will have the same price. In that case, the total cost can be modelled simply:

If there are a large number n of random variables Xi(i = 1, . . . , n) being summed and the uncertainty of
the sum is not dominated by a few of these distributions, the sum is approximately normally distributed

3 14

Risk Analysis

Figure 11.6 Correlation of two variables with a Clayton(l0).

Comparison of correlated and uncorrelated sums
1
0.9
%

0.8

2 0.7
CI

m

0.6

h
a
.->

0.5

5

0.3

-53

0.4

0
0.2
0.1
0
0

100

200

300

400

500

600

Value of sum

Figure 11.7 Comparison of correlated and uncorrelated sums.

700

800

900

1000

Chapter I I Sums of random var~ables 3 I 5

according to the central limit theory as follows:

The equation states that the aggregate sum takes a normal distribution with a mean equal to the sum
of the means for the individual distributions being added together. It also states that the variance (the
square of the standard deviation in the formula) of the normal distribution is equal to the square of the
covariance terms between each variable. The covariance terms a i j are calculated as follows:

where ai and a j are the standard deviations of variables i and j, pij is the correlation coefficient and
E [ . ] means "the expected value of' the thing in the brackets.
If we have datasets for the variables being modelled, Excel can calculate the covariance and correlation
coefficients using the functions COVARO and CORREL() respectively. If we were thinking of using a
rank order correlation matrix, each element corresponds reasonably accurately to pij for roughly normal
distributions (at least, not very heavily skewed distributions), so the standard deviation of the normally
distributed sum could be calculated directly from the correlation matrix.
Correlating partial sums

We will sometimes be in the situation of having two or more sums of random variables that have some
correlation relationship between them. For example, imagine that you are a hospital trying to forecast the
number of patient-days you will need to provide next year, and you split the patients into three groups:
surgery, maternity and chronic illness (e.g. cancer). Let's say that the distribution of days that a person
will spend in hospital under each category is independent of the other categories, but the number of individuals being treated is correlated with the number of people in the catchment area, which is uncertain
because hospital catchments are being redefined in your area. There are plenty of ways to model this problem, but perhaps the most convenient is to start with the uncertainty of size of the number of people in
the catchment area and derive what the demand will be for each type of care as a consequence, then make
a projection of what the total patient-days might be as a result, as shown in the model in Figure 11.8.
In this model the uncertainty about the catchment area population is modelled with a PERT distribution, the bed-days for each category of healthcare are modelled by lognormal distributions with
different parameters and the number of patients in each category is modelled with a Poisson distribution
with a mean equal to (population size-000s) * (expected cases/year/1000 people). I have shown three
different methods for simulating the aggregate distribution in each class: pure Monte Carlo for surgery;
FFT for maternity and Panjer's recursive method for chronic. Any of the three could be used to model
each category. You'll notice that the Monte Carlo method is slightly different from the others in that
I've used VosePoisson(. . .) instead of VosePoissonObject(. . .) because the VoseAggregateMC function
requires a numerical input for how many variables to sum (allowing the flexibility that this could be a
calculated value), whereas the FFT and Panjer methods perform calculations on the Poisson distribution
and therefore need it to be defined as an object. Note that the same model could be achieved with other
Monte Carlo simulation software by making randomly varying arrays for each category, the technique
illustrated in Figure 11.l, but the numbers in this problem would require very long arrays.
Using the same basic problem, let us now consider the situation where the frequency distribution for
each category is correlated in some fashion, as we had before, but not because of their direct relationship

3 16

Risk Analysis

A

1

B

C

D

E

I

F

I

F

1
2

128.47 thousand

Predicted population size next year

3
Number of patients by category
Maternity

4

5
6

Surgery
Expected/year/1000residents
Number treated next year
Bed-days for a random patient
Total bed-days

7

8

9
10
11
-

Chronic

14.7
27.4
184
VosePoisson(3519.962137)
23477 VosePoisson(1888.44684)
VoseLognormal(43.28)
VoseLognormal(4.1,2.5)
VoseLognormal(6.3,36.7)
4.880
147.175
15,740

l ~ o t abed-days
l
over all categories

167,795

1

12

13
2
15
2

Formulae table
=VosePERT(82,107,163)
=VosePoisson(CG'$C$2)
=VosePoissonObject(D6'$C$2)
=VoseLognormalObject(6.3,36.7)
=VoseAggregateMC(C7,C8)
=VoseAggregateFFT($D$7,$D$8,,)
=VoseAggregatePanjer(E7,E8,20OOO.999)
=SUM(C9:E9)

C2
C7
D7:E7
C8:E8 (with different values)
C9
D9
E9
C11 (output)

17

2
20
21
22

Figure 11.8 Model for forecasting the number of patient-days in a hospital.
A

1
2

1

B

I

I

C

D

E

Maternity
-0.3
1
-0.25

Chronic
0.2
-0.25
1

107.00 thousand

Predicted population size next year

3
4

5
6

7
-

8
--

9

10
2

Correlation matrix

Normal copula

2
2
14
15
16

1
7
18

Surgery
1
-0.3
0.2

Surgery
Maternity
Chronic

I

0.441

0.745

Number of patients b y category
Chronic
Maternity
27.4
14.7
184
2,967
1,628
19,667
VoseLognormal(43.28)
VoseLognormal(4.1,2.5)
VoseLognormal(6.3,36.7)
127.904
6.715
120.831
Suraerv

Expected/year/1000residents
Number treated next year
Bed-days for a random patient
Total bed-days

2

20
-

0.918

l ~ o t abed-days
l
over all categories

255,450

1

21
22

23
24

25
26
27

{CIl:E11)
C16:E16
C17:E17 (with different values)
C18:E18
C2O (output)

Formulae table
(=VoseCopulaMultiNormal(C7:E9))
=VosePoisson(Cl S$D$2.C11)
=VoseLognormalObject(6.3.36.7)
=VoseAggregateMC(C16,Cl7)
=SUM(C18:E18)

28

Figure 11.9 Using a normal copula to correlate the Poisson frequency distributions.

to any observable variable. Imagine that the population size is known, but we want to model the effects
of increased pollution in the area, so we want the surgery and chronic Poisson variables to be positively
correlated with each other but negatively correlated with maternity. The following model uses a normal
copula to correlate the Poisson frequency distributions (Figure 11.9).

Chapter I I Sums of random var~ables 3 17

There is in fact an FFT method to achieve this correlation between frequency distributions, but the
algorithm is not particularly stable.
Turning now to the severity (length of hospital stay) distributions, we may wish to correlate the length
of stay for all individuals in a certain category. In the above model, this can be achieved by creating a
separate scaling variable for each lognormal distribution with a mean of 1, for example a
h2)
distribution with the required mean and a standard deviation of h (Figure 11.10). Note that this means
that the lognormal distributions will no longer have the standard deviations they were given before.
Finally, let's consider how to correlate the aggregate distributions themselves. We can construct the
distribution of the number of bed-days required for each type of healthcare using either the FFT or Panjer
method. Since the distribution is constructed rather than simulated, we can easily correlate the aggregate
distributions by controlling how they are sampled. In the example in Figure 11.11, the model uses the
FFT method to construct the aggregate variables and correlates them together using a Frank copula.

amm ma($,

1

Predicted population size next year

Number of oatients by cateqow
Maternity
Chronic
14.7
27.4
1,610
2,874
0.15
0.3
0.8740
0.4937
VoseLognormal(5.51,32.08)
VoseLognormal(3.11.18.12)
9,602
8.996

Suraerv
Expected/year/IOOOresidents
Numbertreated next year
Scaling variable stdev (h)
Hospital days scaling variable
Bed-days for a random patient
Total bed-davs

I

107.00 thousand

184
19,858
0.2
1.0267
VoseLognormal(6.47,37.68)
129,922

148,520

ITotal bed-days over all categories

1

Figure 11.10 Creating separate scaling variables for each lognormal distribution.
A

1
2

1

B

I

D

C

Predicted population size next year

107.00 thousand
Number of patients by category
Chronic
Maternity
27.4
14.7
184
VosePoisson(l572.9)
VosePoisson(l9688)
VosePoisson(2931.8)
VoseLognormal(6.3,36.7)
VoseLognormal(4.1,2.5)
VoseLognormal(43,28)
0.6284
0.6507
0.5676
7,746
7,712
7,758
Surgery

Expected/year/1000residents
Number treated next year
Bed-days for a random patient
Frank copula
Total bed-davs

.-

110tal bed-days over all categories

23,216

C7:E7
C8:E8 (with different values)
(C9:Eg)
C10:ElO
C12 (output)

Figure 11.11

E

1
Formulae table
=VosePoissonObject(CG'$C$2)
=VoseLognormalObject(6.3.36.7)
{=VoseCopulaMultiFrank(15))
=VoseAggregateFFT($D$7.$D$8,,C9)
=SUM(ClO:EIO)

Using the FFT method to combine correlated aggregate variables.

I

F

Distribution of n

Figure 11.12 Model that calculates the distribution for n.

1 1.2.3 Number of variables to reach a total
So far in this chapter we have focused on determining the distribution of the sum of a (usually random)
number of random variables. We are also often interested in the reverse question: how many random
variables will it take to exceed some total? For example, we might want to answer the following
questions:
How many random people entering a lift will it take to exceed the maximum load allowed?
How many sales will a company need to make to reach its year-end target?
How many random exposures to a chemical will it take to reach the exposure limit?

Chapter I I Sums o f random variables

3 17

There is in fact an FFT method to achieve this correlation between frequency distributions, but the
algorithm is not particularly stable.
Turning now to the severity (length of hospital stay) distributions, we may wish to correlate the length
of stay for all individuals in a certain category. In the above model, this can be achieved by creating a
separate scaling variable for each lognormal distribution with a mean of 1, for example a ~ a m m a ( h h, 2 )
distribution with the required mean and a standard deviation of h (Figure 11.10). Note that this means
that the lognormal distributions will no longer have the standard deviations they were given before.
Finally, let's consider how to correlate the aggregate distributions themselves. We can construct the
distribution of the number of bed-days required for each type of healthcare using either the FFT or Panjer
method. Since the distribution is constructed rather than simulated, we can easily correlate the aggregate
distributions by controlling how they are sampled. In the example in Figure 11.11, the model uses the
FFT method to construct the aggregate variables and correlates them together using a Frank copula.
A

1

1
2

I

B

I

C

7
-

8

9
2

184
19,858
0.2
1.0267
VoseLognormal(6.47.37.68)
129.922

12

1
3
-

( ~ o t abed-days
l
over all categories

2

F

Number of patients by category
Maternitv
Chronic
14.7
27.4
1,610
2,874
0.15
0.3
0.8740
0.4937
VoseLognormal(3.11.18.12)
VoseLognormal(5.51,32.08)
8,996
9,602

Surqew
Expected/year/1000residents
Number treated next year
Scaling variable stdev (h)
Hospital days scaling variable
Bed-days for a random patient
Total bed-days

I

E

107.00 thousand

Predicted population size next year

p
4
6
--

I

D

148,520

1

1
Formulae table
C7:E7
=VosePoisson(C6*$C$2)
C9:E9
=VoseGamma(CB/I-2,CW2)
C10:ElO (with different values) =VoseLognormalObject(6.3'C9,36.7'C9)
C11:Ell
=VoseAggregateMC(C7,CI0)
=SUM(Cll:Ell)
C13 (output)

16
1
7
2
2
20
,21

Figure 11.10 Creating separate scaling variables for each lognormal distribution.
A

1
2

1

B

I

C

D

I

E

107.00 thousand

Predicted population size next year

3
4

Number of patients by category
Maternity
Chronic
27.4
14.7
184
VosePoisson(l572.9)
VosePoisson(2931.8)
VosePo1sson(l9688)
VoseLognormal(43.28)
VoseLognormal(6.3,36.7)
VoseLognormal(4.1,2.5)
0.5676
0.6284
0.6507
7,712
7.746
7.758
Surgery

6
-

7
8
9
2

Expected/year/1000residents
Number treated next year
Bed-days for a random patient
Frank copula
Total bed-davs

11

3
J
-

14
2
16
1
7
2
19

C7:E7
C8:E8 (with different values)
(C9:Eg)
C10:ElO
C12 (output)

Formulae table
=VosePoissonObject(C6*$C$2)
=VoseLognormalObject(6.3,36.7)
(=VoseCopulaMultiFrank(l5))
=VoseAggregateFFT($D$7,$D$8,.C9)
=SUM(ClO:ElO)

20

Figure 11.11 Using the FFT method to combine correlated aggregate variables.

I

F

3 16

R~skAnalysis

A

1

B

I

C

E

D

I

F

I

F

1

2
3

Predicted population size next year

128.47 thousand
Number of patients b y category
Chronic
Maternity
184
27.4
14.7
23477 VosePoisson(1888.44684)
VosePoisson(3519.962137)
VoseLognorrnal(6.3,36.7)
VoseLognorrnal(4.1.2.5)
VoseLognormal(43.28)
147.175
15.740
4,880

4

5

Surgery

6

Expected/year/1000residents
Number treated next year
Bed-days for a random patient
Total bed-days

7
8
--

9
10
$1
-

l ~ o t abed-days
l
over all categories

167,795

1

2

Formulae table
=VosePERT(82,107,163)
=VosePoisson(CG'$C$2)
=VosePoissonObject(D6'$C$2)
=VoseLognormalObject(6.3,36.7)
=VoseAggregateMC(C7,C8)
=VoseAggregateFFT($D$7.$D$8,,)
=VoseAggregatePanjer(E7.E8,200,0.999)
=SUM(C9:E9)

2
C2
C7
D7:E7
C8:E8 (with different values)
C9
D9
E9
C l l (output)

14

15
16
17

18
19
20
21
22

Figure 11.8 Model for forecasting the number of patient-days in a hospital.
A

1

B

I

I

C

D

E

Maternity
-0.3
1
-0.25

Chronic
0.2
-0.25
1

1

2
3
4
5

6
7
8

9
2
11
12
13
14

3
3

1
7
18

Predicted population size next year

107.00 thousand
Correlation matrix
Surgery
1
-0.3
0.2

Surgery
Maternity
Chronic
Normal copula

1

0.441

0.745 (

Number of patients by cateqory
Maternitv
Chronic
184
14.7
27.4
19,667
1,628
2,967
VoseLognormal(6.3,36.7)
VoseLognormal(4.1,2.5)
VoseLognormal(43,28)
120,831
6,715
127,904
Suraerv

Expected/year/1000residents
Number treated next year
Bed-days for a random patlent
Total bed-davs

2

20
-

0.918

l ~ o t abed-days
l
over all categories

255,450

1

21
22

23
24

25
26
27

(C11:Ell)
C16:E16
C17:E17 (with different values)
C18:E18
C20 (output)

Formulae table
(=VoseCopulaMultiNormal(C7:E9))

=VosePoisson(ClY$D$2,Cll)
=VoseLognormalObject(6.3,36.7)
=VoseAggregateMC(C16,Cl7)
=SUM(C18:E18)

28

Figure 11.9 Using a normal copula to correlate the Poisson frequency distributions.

to any observable variable. Imagine that the population size is known, but we want to model the effects
of increased pollution in the area, so we want the surgery and chronic Poisson variables to be positively
correlated with each other but negatively correlated with maternity. The following model uses a normal
copula to correlate the Poisson frequency distributions (Figure 11.9).

3 18

R~skAnalysis

Distribution of n

Figure 11.12 Model that calculates the distribution for n.

1 1.2.3 Number of variables to reach a total
So far in this chapter we have focused on determining the distribution of the sum of a (usually random)
number of random variables. We are also often interested in the reverse question: how many random
variables will it take to exceed some total? For example, we might want to answer the following
questions:
How many random people entering a lift will it take to exceed the maximum load allowed?
How many sales will a company need to make to reach its year-end target?
How many random exposures to a chemical will it take to reach the exposure limit?

--

- -

Chapter I I Sums of random var~ables 3 19

Some questions like this are directly answered by known distributions, for example the negative
binomial, beta-negative binomial and inverse hypergeometric describe how many trials will be needed
to achieve s successes for the binomial, beta-binomial and hypergeometric processes respectively.
However, if the random variables are not 0 or 1 but are continuous distributions, there are no distributions
available that are directly useful.
The most general method is to use Monte Carlo simulation with a loop that consecutively adds a
random sample from the distribution in question until the required sum is produced. ModelRisk offers
such a function called VoseStopSum(Distribution, Threshold). This can, however, be quite computationally intensive when the required number is large, so it would be useful to have some quicker methods
available.
Table 11.3 gives us some identities that we can use. For example, the sum of n independent variables
following a Gamma(a, B) distribution is equal to a Gamma(n * a , B). If we require a total of at least T,
then the probability that (n - 1) Gamma(a, j3) variables will exceed T is 1 - F[,-l)(T), where F(,-,)(T)
is the cumulative probability for a Gamma((n - 1) * a, B). Excel has the GAMMADIST function
which calculates F(x) for a gamma distribution (ModelRisk has the function VoseGammaProb which
performs the same task but without the errors GAMMADIST sometimes produces). The probability
that n variables will exceed T is given by 1 - Fn(T). Thus, the probability that it was the nth random
variable that took the sum over the threshold is (1 - F,(T)) - (1 - F(,-l)(T)) = F(n-I)(T) - F,(T).
You can therefore construct a model that calculates the distribution for n directly, as shown in the
spreadsheet in Figure 11.12.
The same idea can be applied with the Cauchy, chi-square, Erlang, exponential, Levy, normal and
Student distributions. The VoseStopSum function in ModelRisk implements this shortcut automatically.

A

Chapter

Forecasting with uncertainty
STUDIES HAVE SHOWN
STOCKS BETTER T H A N

THAT'S WHY THE
DOGOERT MUTUAL FUND
EMPLOYS ONLY MONKEYS.

OUnited Feature Syndicate, Inc. Syndicated by Bruno Productions B.V. Reproduced by permission.
This chapter looks at several forecasting methods in common use and how variability and uncertainty
can be incorporated into their forecasts. Time series modelling is usually based on extrapolating a set
of observations from the past, or, where data are not available or inadequate, the modelling focuses on
expert opinion of how the variable may behave in the future. In this chapter we will look first of all
at the more formal techniques of time series modelling based on past observations, then look at some
ways that the reader may find useful to model expert opinion of what the future holds.
The prerequisites of formal quantitative forecasting techniques are that a reliable time series of past
observations is available and that it is believed that the factors determining the patterns exhibited in
that time series are likely to continue to exist, or, if not, that we can determine the effect of changes in
these factors. We begin by discussing ways of measuring the performance of a forecasting technique.
Then we look at the nalve forecast, which is simply repeating the last, deseasonalised, value in the
available time series. This simplistic forecasting technique is useful for providing a benchmark against
which the performance of the other techniques can be compared. This is followed by a look at various
forecasting techniques, divided into three sections according to the length of the period that is to be
forecast. Finally, we will look at a couple of examples of a different approach that aim at modelling
the variability based on a reasonable theoretical model of the actual system.
There are a few useful basic tips I recommend when you are producing a stochastic time series as
part of your risk analysis:
Check the model's behaviour with imbedded Excel x-y scatter plots.
Split the model up into components rather than create long, complicated formulae. That way you'll
see that each component is working correctly, and therefore have confidence in the time series
projection as a whole.

Figure 12.1 Six plots from the same geometric Brownian motion model. Each pattern could easily be what
follows on from any other pattern.

Be realistic about the match between historic patterns and projections. For example, write a simple geometric Brownian motion model, plot the series and hit the F9 key (recalculate) a few
times and see the variation in patterns you get. Remember that these all come from the same
stochastic model - but they will often look convincingly different (see Figure 12.1): if any of these
had been our historical data, a statistical analysis of the data would have tended to agree with
you and reinforced any preconception about the appropriate model, because statistical analysis
requires you to specify the model to test. So, don't always go for a forecast model because it
fits the data the best - also look at whether there is a logical reason for choosing one model over
another.
Be creative. Short-term forecasts (say 20-30% of the historic period for which you have good
data) are often adequately produced from a statistical analysis of your data. Even then, be selective
about the model. However, beyond that timeframe we move into crystal ball gazing. Including your
perceptions of where the future may go, possible influencing events, etc., will be just as valid as an
extrapolation of historic data.

12. I The Properties of a Time Series Forecast
When producing a risk analysis model that forecasts some variable over time, I recommend you go
through a list of several properties that variable might exhibit over time, as this will help you both
statistically analyse any past data you have and select the most appropriate model to use. The properties
are: trend, randomness, seasonality, cyclicity or shocks and constraints.

Chapter 12 Forecasting with uncertainty

Variable

323

Variable

120

160
140

100

120
80

100

60

80
60

40

40

20
0

20
0

5

10

15

20
25
30
Time period

35

40

45

50

0

0

5

10

15

20
25
30
Time period

35

40

45

50

Figure 12.2 Examples of expected value trend over time.

12.1.1 Trend
Most variables we model have a general direction in which they have been moving, or in which we
believe they will move in the future. The four plots in Figure 12.2 give some examples of the expected
value of a variable over time: top left - a steady relative decrease, such as one might expect for sales
of an old technology, or the number of individuals remaining alive from a group; top right - a steady
(straight-line) increase, such as is often assumed for financial returns over a reasonably short period
(sometimes called "drift"); bottom left - a steady relative increase, such as bacterial growth or take-up
of new technology; and bottom right - a drop turning into an increase, such as the rate of component
failures over time (like the bathtub curve in reliability modelling) or advertising expenditure (more at a
launch, then lower, then ramping up to offset reduced sales).
!

,

12.1.2 Randomness
The second most important property is randomness. The four plots in Figure 12.3 give some examples
of the different types of randomness: top left - a relatively small and constant level of randomness that
doesn't hide the underlying trend; top right - a relatively large and constant level of randomness that
can disguise the underlying trend; bottom left - a steadily increasing randomness, which one typically
sees in forecasting (care needs to be taken to ensure that the extreme values don't become unrealistic);
and bottom right - levels of randomness that vary seasonally.

324

Risk Analysis

Variable

160 140 120 100 80 60 -

i

40 20 0

1

7

0

5

10

15

20

25

30

35

40

45

1
50

180
160
140
120
100
80
60
40
20
0

Time period

Variable

Time period

Time period

Figure 12.3 Examples of the behaviour of randomness over time.

12.1.3 Seasonality
Seasonality means a consistent pattern of variation in the expected value (but also sometimes its randomness) of the variable. There can be several overlaying seasonal periods, but we should usually have
a pretty good guess at what the periods of seasonality might be: hour of the day; day of the week; time
of the year (surnmer/winter, for example, or holidays, or end of financial year). The plot in Figure 12.4
shows the effect of two overlaying seasonal periods. The first is weekly with a period of 7, the second
is monthly with a period of 30, which complicates the pattern. Monthly seasonality often occurs with
financial transactions that take place on a certain day of the month: for example, volumes of documents
that a bank's printing facility must produce each day - at the end of the month they have to chum out
bank and credit card statements and get them in the post within some legally defined time.
One difficulty in analysing monthly seasonality from data is that months have different lengths, so one
cannot simply investigate a difference each 30 days, say. Another hurdle in analysing data on variables
with monthly and holiday peaks is that there can be some spread of the effect over 2 or 3 days. For
example, we performed an analysis recently looking at the calls received into a US insurance company's
national call centre to help them optirnise how to staff the centre. We were asked to produce a model
that predicted every 15 minutes for the next 2 weeks, and another model to predict out 6 weeks. We
looked at the patterns by individual state and language (Spanish and English). There was a very obvious
and stable pattern through the day that was constant during the working week, but a different pattern
on Saturday and on Sunday. The pattern was largely the same between states but different between
languages. Holidays like Thanksgiving (the last Thursday of November, so not even a fixed date) were

Chapter 12 Forecasting with uncertainty

325

Variable
120
I

100
80
60
40
20
0
0

10

20

30

40
Time period

50

60

70

80

Figure 12.4 The expected value of a variable with two overlapping seasonal periods.

Variation around Memorial Day

Variation around Thanksgiving Day

-

Figure 12.5 Effect of holidays on daily calls to a call centre. The four lines show the effect on the last
4 years. Zero on the x axis is the day of the holiday.

very interesting: call rates dropped hugely on the holiday to 10 % of the level one would have usually
expected, but were slightly lower than normal the day before (Wednesday), significantly lower the day
after (Friday), a little lower during the following weekend and then significantly higher the following
Monday and Tuesday (presumably because people were catching up on calls they needed to make).
Memorial Day, the last Monday of May, exhibited a similar pattern, as shown in Figure 12.5.
The final models had logic built into them to look for forthcoming holidays and apply these patterns to
forecast expected levels which had a trend by state and a daily seasonality. For the 15-minute models we
also had to take into account the time zone of the state, since all calls from around the US were received

326

Risk Analysis

I

I I

I

Figure 12.6 Two examples of the effect of a cyclicity shock. On the left, the shock produces a sudden and
sustained increase in the variable; on the right, the shock produces a sudden increase that gradually reduces
over time - an exponential distribution is often used to model this reduction.

into one location, which also involved thinking about when states changed their clocks from summer to
winter and little peculiarities like some states having two time zones (Arizona doesn't observe daylight
saving to conserve energy used by air conditioners, etc.).

12.1.4 Cyclicity or shocks
Cyclicity is a confusing term (being rather similar to seasonality) that refers to the effect of obvious
single events on the variable being modelled (Figure 12.6 illustrates two basic forms). For example, the
Hatfield rail crash in the UK on 12 October 2000 was a single event with a long-term effect on the UK
railway network. The accident was caused by the lapsed maintenance of the track which led to "gauge
corner cracking", resulting in the rail separating. Investigators found many more such cracks in the area
and a temporary speed restriction was imposed over very large lengths of track because of fears that
other track might be suffering from the same degradation. The UK network was already at capacity
levels, so slowing down trains resulted in huge delays. The cost of repairs to the undermaintained track
also sent RailTrack, the company managing the network, into administration. In analysing the cause of
train delays for our client, NetworkRail, a not-for-dividend company that took over from RailTrack, we
had to estimate and remove the persistent effect of Hatfield.
Another obvious example is 911 1. Anyone who regularly flies on commercial airlines will have
experienced the extra delays and security checks. The airline industry was also greatly affected, with
several US carriers filing for protection under Chapter 11 of the US Bankruptcy Code, although other
factors also played a part, such as oil price increases and other terrorist attacks (also cyclicity events)
which dissuaded people from going abroad. We performed a study to determine what price should be
charged for parking at a US national airport, part of which included estimating future demand. From
analysing historic data it was evident that the effect of 911 1 on passenger levels was quite immediate,
and, as of 2006, they were only just returning to 2000 levels, where previously there had been consistent
growth in passenger numbers, so levels still remain far below what would have been predicted before
the terrorist attack.
Events like Hatfield and 9/11 are, of course, almost impossible to predict with any confidence.
However, other types of cyclicity event are more predictable. As I write this (20 June 2007), there are
7 days left before Tony Blair steps down as Prime Minister of the UK, which he announced on 10 May,

Chapter 12 Forecasting with uncerta~nty 3 2 7

and Gordon Brown takes over. Newspaper columnists are debating what changes will come about, and,
for people in the know, there are probably some predictable elements.

12.1.5 Constraints
Randomly varying time series projections can quite easily produce extreme values far beyond the range
that the variable might realistically take. There are a number of ways to constrain a model. Mean
reversion, discussed later, will pull a variable back to its mean so that it is far less likely to produce
extreme values. Simple logical bounds like IF(& > 100, 100, St) will constrain a variable to remain at
or below 100, and one can make the constraining parameter (100) a function of time too. The section
describing market modelling below offers some other techniques that are based on more modelling-based
constraints.

12.2 Common Financial Time Series Models
In this section I describe the most commonly used time series for financial models of variables such
as stock prices, exchange rates, interest rates and economic indicators such as producers' price index
(PPI) and gross domestic product (GDP). Although they have been developed for financial markets, I
encourage you to review the ideas and models presented here because they have much wider applications.
Financial time series are considered to vary continuously, even if perhaps we only observe them at
certain moments in time. They are based on stochastic differential equations (SDEs), which are the
most general descriptions of continuously evolving random variables. The problem with SDEs from a
simulation perspective is that they are not always amenable to being exactly converted to algorithms
that will generate random possible observations at specific moments in time, and there are often no
exact methods for estimating their parameters from data. On the other hand, the advantage is that we
have a consistent framework for comparing the time series and there are sometimes analytical solutions
available to us for determining, say, the probability that the variable exceeds some value at a certain point
in time - answers that are useful for pricing derivatives and other financial instruments, for example.
We can get around the problems with a bit of intense computing, as I will explain for each type of time
series.
Financial time series model a variable in one of two forms: the actual price St of the stock (or the
value of a variable such as exchange rate, interest rate, etc., if it is not a stock) at some time t , or its
return (aka its relative change if it is not an investment) rt over a period At, ASIS,. It might seem
that modelling St would be more natural, but in fact modelling the return of the variable is often more
helpful: apart from making the mathematics simpler, it is usually the more fundamental variable. In this
section, I will refer to St when talking specifically about a price, to r, when talking specifically about
a return and to x, when it could be either.
I introduce geometric Brownian motion (GBM) first, as it is the simplest and most common financial
times series, the basis of the Black-Scholes model, etc., and the launching pad for a number of more
advanced models. I have developed the theory a little for GBM, so you get the feel of the thinking, but
keep the theory to a minimum after that, so don't be too put off.
ModelRisk provides facilities (Figure 12.7) to fit andlor model all of the time series described in the
chapter. For financial models, data and forecasts can be either returns or prices, and the fitting algorithms
can automatically include uncertainty about parameter estimates if required.

328

Risk Analysis

Figure 12.7 ModelRisk time series fit window.

12.2.1 Geometric Brownian motion
Consider the formula

It states that the variable's value changes in one unit of time by an amount that is normally distributed
'
. The normal distribution is a good first choice for a lot of variables
with mean p and variance a
because we can think of the model as stating (from the central limit theorem) that the variable x is
being affected additively by many independent random variables. We can iterate the equation to give
us the relationship between x, and x,+z:

and generalise to any time interval T:

This is a rather convenient equation because (a) we keep using normal distributions and (b) we can
make a prediction between any time intervals we choose. The above equation deals with discrete units
of time but can be written in a continuous time form, where we consider any small time interval At:

Chapter 12 Forecasting w ~ t huncerta~nty 329

The SDE equivalent is
dx = p d t + a d z
dz = E&
where dz is the generalised Wiener process, called variously the "perturbation", "innovation" or "error",
and E is a Normal(0, 1) distribution. The notation might seem to be a rather unnecessary complication,
but when you get used to the SDEs they give us the most succinct description of a stochastic time
series. A more general version of Equations (12.2) is

where g and f are two functions. It is really just shorthand for writing
t

t

Equation (12.1) can allow the variable x to take any real value, including negative values, so it would
not be much good at modelling a stock price, interest rate or exchange rate, for example. However, it
has the desirable property of being memoryless, i.e. to make a prediction of the value of x some time
T from now, we only need to know the value of x now, not anything about the path it took to get to
the present value. We can use Equations (12.2) to model the return of a stock:

There is an identity known as It8's lemma which states that for a function F of a stochastic variable
X following an It8 process of the form dx(t) = a(x, t) dt b(x, t) dz we have

+

Choosing F (S) = log[S] together with Equation (12.3) where x = S , a(x, t) = p and b(x, t) = a :

Integrating over time T, we get the relationship between some initial value St and some later value
S+T:

[

St+T = St exp Normal

((

p

-

7

330

Risk Analysis

where r~ is called the log return1 of the stock over the period T. The exp [.] term in Equation (12.6)
means that S is always > 0, so we still retain the memoryless property which corresponds to some
financial thinking that a stock's value encompasses all information available about a stock at the time,
so there should be no memory in the system (I'd argue against that, personally).
The log return r of a stock S is (roughly) the fractional change in the stock's value. For stocks this
is a more interesting value than the stock's actual price because it would be more profitable to own 10
shares in a $1 stock that increased by 6 % over a year than one share in a $10 stock that increased by
4 %, for example.
Equation (12.6) is the GBM model: the "geometric" part comes because we are effectively multiplying lots of distributions together (adding them in log space). From the definition of a lognormal
random variable, if l n [ S ] is normally distributed, then S is lognormally distributed, so Equation (12.6)
is modelling St+T as a lognormal random variable. From the equation of the mean of the IognormalE
distribution in Appendix 111 you can see that St+T has a mean given by

hence p is also called the exponential growth rate, and a variance given by

GBM is very easy to reproduce in Excel, as shown by the model in Figure 12.8, even with different
time increments.
It is also very easy to estimate its parameters from a dataset when the observations have a constant
time increment between them, as shown by the model in Figure 12.9.

2

A

~

2
3
5

7
8
--

9
10
11
2
3

14
2
36
37
38
39
3
41
42

I

Mu
Sigma

4

6

B

Period
0
1
2
3
4
5
8
9
10
11
40
43
44
45
46
47
50

Return
0.027807
-0.031105
0.015708
-0.010917
-0.029635
0.037244
-0.009822
-0.008984
0.071986
0.02078
0.005866
0.03901
-0.01083
-0.010239
0.024494
0.027545

C

I

D

I

E

~

F

I

G II I H J I I

K

I

L

I

M

0.01
0.033
Prices
100
102.8197
99.67078
101.2487
100.1494
97.22498
100.9143
99.92796
99.03423
106.4262
144.1044
144.9522
150.7184
149.0949
147.5762
151.2356
155.4593

.
C7:C42
D7:D42

Formulae table
=VoseNormal((Mu-(Sigma~2)/2)*(B7-B6),Sigma*SQRT(B7-B6))
=D6"EXP(C7)

43

Figure 12.8 GBM model with unequal time increments.
I
I

I

'

Not to be confused with the simple return R,, which is the fractional increase of the variable over time t, and where r, = In[l

+ R,].

Chapter I2 Forecasting with uncertainty

B

1
2
-

I

Period

3

4
5
6
7

8

9
10
2
2

I

I

D

1
2
3
4
5
6
7
8
9
10
11

Price S
LN(S,)-LN(S,.,)
131.2897
0.063167908
139.8505
0.082367645
151.8574
0.005637288
152.7159
0.056436531
161.5825
0.021916209
165.1629
-0.048479708
157.3468
0.021069702
160.6972
-0.018525353
157.7477
-0.030621756
152.9904
0.038550398
159.0034

12
13
14
15
102
103

161.8312
168.8502
160.6408
173.5187
521.6434
542.4933

14
16
5
1
7
104
105

C

E ]

F

I

G

l

l~ime
increment

H

33 1

II

1

Innovations
0.01391
0.032387

5
1
D4:D105

-0.042458444
0.060086715
-0.007382664
0.077114246
0.034478322
0.039191541

GI0

Formulae table
=LN(C4)-LN(C3)
=AVERAGE(D4:D105)
=STDEV(D4:D105)
=G6/SQRT(G2)
=G5/G2+G9"2/2

-106

Figure 12.9 Estimating GBM model parameters with equal time increments.

A
1
2
3
4
5
6
7
8
9
10
11
12
13
185
186
187
188

-

l

B

Period
1
2
3
4
5
8
9
10
11
12
15
255
256
257

I

C

I

D

Price S
100.789
103.0675
102.8591
103.6719
99.8012
107.2738
111.2296
110.0289
114.0051
111.989
112.9895
1685.406
1637.663
1667.555

z

IEl

F

I

G

I

I

H

I

Si ma
-0.305560011
-0.610305645
-0.48660819
-1.060637884
-0.492158354
-0.132347657
-0.7206778
-0.141243519
-0.808033736
-0.949059593
-0.866893734
-0.944206129
-0.358896539

[ ~ r r osum
r

D4:D187
G5
G6
G8

1

1.094871

Formulae table
=(LN(C~)-LN(C~)-(MU-S~~~~"~/~)*(B~-B~))/(S~~~~*SQRT(B~
=ABS(AVERAGE(D4:DI87))
=ABS(STDEV(D4:Dl87)-1)
=G5+G6

Figure 12.10 Estimating GBM model parameters with unequal or missing time increments.

If there are missing observations or observations with different time increments, it is still possible
to estimate the GBM parameters. In the model in Figure 12.10, the observations are transformed to
Normal(0, 1) variables ( z } , and then Excel's Solver is used to vary mu and sigma to make the { z )
values have a mean of zero and a standard deviation of 1 by minimising the value of cell G8.
An alternative method would be to regress

- lnLSt against

&?

&? with

zero intercept: the

slope estimates p and the standard error estimates a.
The spread of possible values in a GBM increases rapidly with time. For example, the plot in
Figure 12.11 shows 50 possible forecasts with So = 1, p = 0.001 and a = 0.02.

332

Risk Analysis

4.5
4
3.5
3
2.5

P 2
1.5
1

0.5
0
0

50

100

150

200

250

300

Time T

Figure 12.11 Plot of 50 possible scenarios with a GBM(p = 0.001, a = 0.02) model with a starting value
of I .

Mean reversion, discussed next, is a modification to GBM that progressively encourages the series to
move back towards a mean the further it strays away. Jump diffusion, discussed after that, acknowledges
that there may be shocks to the variable that result in large discrete jumps. ModelRisk has functions
for fitting and projecting GBM and GBM mean reversion andor jump diffusion. The functions work
with both returns r and stock prices S .

+

12.2.2 GBM with mean reversion
The long-run time series properties of equity prices (among other variables) are, of course, of particular
interest to financial analysts. There is a strong interest in determining whether stock prices can be
characterised as random-walk or mean reverting processes because this has an important effect on an
asset's value. A stock price follows a mean reverting process if it has a tendency to return to some
average value over time, which means that investors may be able to forecast future returns better by
using information on past returns to determine the level of reversion to the long-term trend path. A
random walk has no memory, which means that any large move in a stock price following a randomwalk process is permanent and there is no tendency for the price level to return to a trend path over
time. The random-walk property also implies that the volatility of stock price can grow without bound
in the long run: increased volatility lowers a stock's value, so a reduction in volatility (Figure 12.12)
owing to mean reversion would increase a stock's value.
For a variable x following a Brownian motion random walk, we have the SDE of Equation (12.2):

For mean reversion, this equation can be modified as follows:

Chapter 12 Forecasting with uncertainty

33 3

alpha = 0.0001
0.01

..

-

.. -.

-

...
-.

0.008
0.006
0.004

-

0.002
0
-0.002
-0.004
-0.006
-0.008
-0.01

Time t

alpha = 0.1
0.01

1

0.008
0.006

-

I

0.004
0.002
0
0

-0.002
-0.004
-0.006
-0.008
-0.01

Time t

alpha = 0.4

0

I
I
i

Time t

Figure 12.12 Plots of sample GBM series with mean reversion for different values of alpha
(p = 0, c7 = 0.001).

3 34

R~skAnalysis

where a, > 0 is the speed of reversion. The effect of the dt coefficient is to produce an expectation of
moving downwards if x is currently above p, and vice versa. Mean reversion models are produced in
terms of S or r :

known as the Ornstein-Uhlenbeck process, and was one of the first models used to describe short-term
interest rates, where it is called the Vasicek model. The problem with the equation is that we can get
negative stock prices; modelling in terms of r , however,

keeps the stock price positive. Integrating this last equation over time gives
p

+ exp[-aT](r,

KT)

- p), o

1 - exp[-2aT]

which is very easy to simulate. The following plots show some typical behaviour for r,. Typical values
of a! would be in the range 0.1-0.3.
A slight modification to Equation (12.7) is called the Cox-Ingersoll-Ross or CIR model (Cox,
Ingersoll and Ross, 1985), again used for short-term interest rates, and has the useful property of not
allowing negative values (so we can use it to model the variable S ) because the volatility goes to zero
as S approaches zero:

Integrating over time, we get

where

4 w degrees of freedom and non-centrality parameter
and Y is a non-central chi-square distribution with a2
2crt exp[-aT]. This is a little harder to simulate since you need the uncommon non-central chi-square
distribution in your simulation software, but it has the attraction of being tractable (we can precisely
determine the form of the distribution for the variable St+T), which makes it easier to determine its
parameters using maximum likelihood methods.

12.2.3 GBM with jump diffusion
Jump diffusion refers to sudden shocks to the variable that occur randomly in time. The idea is to
recognise that, beyond the usual background randomness of a time series variable, there will be events
that have a much larger impact on the variable, e.g. a CEO resigns, a terrorist attack takes place, a
drug gets FDA approval. The frequency of the jumps is usually modelled as a Poisson distribution
with intensity h, so that in some time increment T there will be Poisson(hT) jumps. The jump size

Chapter 12 Forecast~ngwith uncertainty

335

for r is usually modelled as Normal(p J, a J ) for mathematical convenience and ease of estimating
the parameters. Adding jump diffusion to the discrete time Equation (12.6) for one period, we get the
following:

If we define k = Poisson(A), this reduces to

or for T periods we have

):

r~ = Normal ((p -

T

+ kp,.

Ja)

which is easy to model with Monte Carlo simulation and easy to estimate parameters for by matching
moments, although one must be careful to ensure that the A estimate isn't too high (e.g. > 0.2) because
the Poisson jumps are meant to be rare events, not form part of each period's volatility. The plot in
Figure 12.13 shows a typical jump diffusion model giving both r and S values and with jumps marked
as circles.

12.2.4 GBM with jump diffusion and mean reversion
You can imagine that, if the return r has just received a large shock, there might well be a "correction"
over time that brings it back to the expected return p of the series. Combining mean reversion with
jump diffusion will allow us to model these characteristics quite well and with few parameters. However,
the additive model of Equation (12.9) for mean and variance no longer applies, particularly when the
reversion speed is large because one needs to model when within the period the jump took place: if it
was at the beginning of the period, it may well have already strongly reverted before one observes the
value at the period's end. The most practical solution, called Euler's method, is to split up a time period
into many small increments. The number of increments will be sufficient when the model produces the
same output for decision purposes as any greater number of increments.

12.3 Autoregressive Models
An ever-increasing number of autoregressive models are being developed in the financial area. The
ones of more general interest discussed here are AR, MA, ARMA, ARCH and GARCH, and it is more
standard to apply the models to the return r rather than to the stock price S. I also give the equations for
EGARCH and APARCH. Let me just repeat my earlier warning that, before being convinced that some
subtle variation of the model gives a genuine advantage, try generating a few samples for simpler models
that you have fit to the data and see whether they can create scenarios of a similar pattern. ModelRisk
offers functions that fit each of these series to data and produce forecasts. The data can be live linked
to historical values, which is very convenient for keeping your model automatically up to date.

3 36

Risk Analysis

0.4

0.3
0.2
0.1

rltl

0

-0.1

-0.2

-0.3
-0.4

Time t

Figure 12.13 Sample of a GBM with jump diffusion with parameters p
and h = 0.02.

1.1

0,a

= 0.01,W J

= 0.Q4, UJ = 0.2

12.3.1 AR
The equation for an autoregressive process of order p, or AR(p), is

where

E~

are independent Normal(0, a) random variables. Some constraints on the parameters {a,}are

needed if one wants to keep the model stationary (meaning the marginal distribution of r i s the same for
all I ) , e.g, for an AR(P), lal 1 -= 1. In most situations, an AR(1) or AR(2) is sufficiently elaborate, i.e:

Chapter 1 2 Forecasting with uncertainty

337

You can see that this is just a regression model where rt is the dependent variable and rt-i are the
explanatory variables. It is usual, though not essential, that ai > ai+l, i.e. that r, is explained more by
more recent values ( t - 1, t - 2, . . .) rather than by older values ( t - 10, t - 11, . . .).

The equation for a moving-average process of order q , or M A ( q ) , is

This says that the variable r, is normally distributed about a mean equal to

where E , are independent Normal(0, c) random variables again. In other words, the mean of r, is the
mean of the process as a whole p plus some weighting of the variation of q previous terms from the
mean. Similarly to A R models, it is usual that bi > bi+l,i.e. that rt is explained more by more recent
terms (t - 1, t - 2 , . . .) rather than by older terms ( t - 10, t - 11, . . .).

12.3.3 A R M A
We can put the A R ( p ) and M A ( q ) processes together to create an autoregressive, moving-average model
A R M A ( p , q ) process with mean p that is described by the following equation:

In practice, the A R M A ( 1 , 1) is usually sufficiently complex, so the equation simplifies to

ARCH models were originally developed to account for fat tails by allowing clustering of periods of
volatility (heteroscedastic, or heteroskedastic, means "having different variances"). One of the assumptions in regression models that were previously used for analysis of high-frequency financial data was
that the error terms have a constant variance. Engle (1982), who won the 2003 Nobel Memorial Prize

338

R~skAnalysis

for Economics, introduced the ARCH model, applying it to quarterly UK inflation data. ARCH was
later generalised to GARCH by Bollerslev (1986), which has proven more successful in fitting to financial data. Let rt denote the returns or return residuals and assume that rt = p atzt, where zt are
independent, Normal(0,l) distributed, and the CT
is:
modelled by

+

where w > 0, ai 2 0, i = 1, . . . , q and at least one a; > 0. Then r, is said to follow an autoregressive
conditional heteroskedastic, ARCH(q), process with mean p. It models the variance of the current error
term as a function of the variance of previous error terms (r,-l - p). Since each ai > 0, it has the effect
of grouping low (or high) volatilities together.
If an autoregressive moving-average process (ARMA process) is assumed for the variance, then r, is
said to be a generalised autoregressive conditional heteroskedastic GARCH(p, q) process with mean g :

where p is the order of GARCH terms and q is the order of ARCH terms, w > 0, a;
bj 2 0, j = 1 , . . . , p and at lease one ai o r b , > 0.
In practice, the model most generally used is a GARCH(1, 1):

> 0, i = I , . . . , q;

1 2.3.5 APARCH
The asymmetric power autoregressive conditional heteroskedasticity, APARCH(p, q), model was introduced by Ding, Granger and Engle (1993) and is defined as follows:

where -1 < yi < 1 and at least one a; or b j > 0. 6 plays the role of a Box-Cox transformation
of the conditional standard deviation q ,while yi reflect the so-called leverage effect. APARCH has
proved very promising and is now quite widespread because it nests several other models as special cases, e.g. the ARCH(6 = 1, y; = 0, bi = O), GARCH(8 = 2, y; = 0), (TS-GARCH(6 = 1, y; = O),
GJR-GARCH(6 = 2), TARCH(6 = 1) and NARCH(bi = 1, y; = 0).
In practice, the model most generally used is an APARCH(1, 1):

Chapter 12 Forecasting with uncertainty

339

12.3.6 EGARCH
The exponential general autoregressive conditional heteroskedastic, EGARCH(p, q), model was another
form of GARCH model with the purpose of allowing negative values in the linear error variance equation.
The GARCH model imposes non-negative constraints on the parameters, a; and b j , while there are
no such restrictions on these parameters in the EGARCH model. In the EGARCH(p, q) model, the
conditional variance, ,
:
a is formulated by an asymmetric function of lagged disturbances rt:

where

and

when zl is a standard normal variable.
Again, in practice the model most generally used has p = q = 1, i.e. is an EGARCH(1, 1):

12.4 Markov Chain Models
~ a r k o chains
v ~ comprise a number of individuals who begin in certain allowed states of the system and
who may or may not randomly change (transition) into other allowed states over time. A Markov chain
has no memory, meaning that the joint distribution of how many individuals will be in each allowed
state depends only on how many were in each state the moment before, not on the pathways that led
there. This lack of memory is known as the Markov property. Markov chains come in two flavours:
continuous time and discrete time. We will look at a discrete-time process first because it is the easiest
to model.

12.4.1 Discrete-time Markov chain
In a discrete-time Markov process the individuals can move between states only at set (usually equally
spaced) intervals of time. Consider a set of 100 individuals in the following four marital states:

43 are single;
29 are married;
11 are separated;
17 are divorced.
Named after Andrey Markov (1 856- 1922), a Russian mathematician.

340

R~skAnalysis

We write this as a vector:

Given sufficient time (let's say a year) there is a reasonable probability that the individuals can change
state. We can construct a matrix of the transition probabilities as follows:
Is now:
Transition matrix

was:

Married
Separated
Divorced 1

Single

Married

0
0
0

0.88
0.13
0.09

Se~arated Divorced
0.08
0.45
0.02

0.89

We read this matrix row by row. For example, it says (first row) that a single person has an 85 % chance
of still being single 1year later, a 12 % chance of being married, a 2 % chance of being separated
and a 1 % chance of being divorced. Since these are the only allowed states (e.g. we haven't included
"engaged" so that must be rolled up into "single"), the probabilities must sum to 100 %. Of course,
we'd have to decide what a death would mean: the transition matrix could either be defined such that
if a person dies they retain their marital status for this model, or we could make this a transition matrix
conditional on them surviving a year.
Notice that the "single" column is all Os, except the singlelsingle cell, because, once one is married,
the only states allowed after that are married, separated and divorced. Also note that one can go directly
from single to separated or divorced, which implies that during that year the individual had passed
through the married state. Markov chain transition matrices describe the probability that one is in a state
at some precise time, given some state at a previous time, and is not concerned with how one got there,
i.e. all the other states one might have passed through.
We now have the two elements of the model, the initial state vector and the transition matrix, to
estimate how many individuals will be in each state after a year. Let's go through an example calculation
to estimate how many people will be married in one year:
a
a
a

for
for
for
for

the
the
the
the

single people, Binomial(43, 0.12) will be married;
married people, Binomial(29, 0.88) will be married;
separated people, Binomial(l1, 0.13) will be married;
divorced people, Binomial(l7, 0.09) will be married.

Add together these four binomial distributions and we get an estimate of the number of people from our
group who will be married next year. However, the above calculation does not work when we want to
look at the joint distribution of how many people will be in each state: clearly we cannot add four sets
of four binomial distributions because the total must sum to 100 people. Instead, we need to use the
multinomial distribution. The number of people who were single but are now {Single, Married, Separated,
Divorced) equals Multinomial(43, (0.85, 0.12, 0.02, 0.01)). Applying the multinomial distribution for
the other three initial states, we can take a random sample from each multinomial and add up how many
are in each state, as shown in the model in Figure 12.14.

Chapter 12 Forecast~ngwith uncertainty

341

Figure 12.14 Multinomial method of performing a Markov chain model.

Let's now look at extending the model to predict further ahead in time, say 5 years. If we can assume
that the probability transition matrix remains valid for that period, and that nobody in our group dies,
we could repeat the above exercise 5 times - calculating in each year how many individuals are in each
state and using that as the input into the next year, etc. However, there is a more efficient method.
The probability a person starting in state i is in state j after 2years is determined by looking at the
probability of the person going from state i to each state after 1 year, and then going from that state to
state j in the second year. So, for example, the probability of changing from single to divorced after
2 years is
P(Sing1e to Single) * P (Single to Divorced)

+P (Single to Married) * P (Married to Divorced)
+P (Single to Separated) * P (Separated to Divorced)
+P (Single to Divorced) * P (Divorced to Divorced)
Notice how we have multiplied the elements in the first row (single) by the elements in the last column
(divorced) and added them. This is the operation performed in matrix multiplication. We can therefore
determine the probability transition matrix over the 2year period by simply multiplying the 1 year
transition matrix by itself (using Excel's MMULT function) in the model in Figure 12.15.
When one wants to forecast T periods in advance, where T is large, performing the matrix multiplication (T - 1) times can become rather tedious, but there is some mathematics based on transforming the
matrix that allows one directly to determine the transition matrix over any number of periods. ModelRisk
provides some efficient means to do this: the VoseMarkovMatrix function calculates the transition matrix
for any time length, and the VoseMarkovSample goes the next step, simulating how many individuals
are in each final state after some period. In this next example (Figure 12.16) we calculate the transition
matrix and simulate how many individuals will be in each state after 25 years.
Notice how after 25 years the probability of being married is about 45 %, irrespective of what state
one started in: a similar situation occurs for separated and divorced. This stabilising property is very
common and, as a matter of interest, is the basis of a statistical technique discussed briefly elsewhere
in this book called Markov chain Monte Carlo. Of course, the above calculation does assume that the
transition matrix for 1 year is valid to apply over such a long period (a big assumption in this case).

342

Risk Analysis

One year transition
Sin le

IS now:
Married Se arated Divorced

Single
Was: Married
Separated
Divorced

0.88
0.13

0.08
0.45

0.04
0.42

Two year transition
IS now:
matrix
Sin le
Married Se arated Divorced
0.7225 0.211 1
Single
0.0358
0.0306
Was: Married
0.2107
0.2213
0.568
Separated
Divorced

,

14

28

Totals

Figure 12.15

A
1
2
3
4

1

26

37

Multinomial method of performing a Markov chain model with time an integer > 1 unit.

B

I C I

Number in
initial state

D

I

E

17

One year transition
matrix
Single
Married
Was:
Separated
Divorced

# periods

matrix

5
6
7

Number in final state
Married Se arated Divorced

Sin le

I

F

I

G

1

H

I

I

I J I

K

IL

IS now:

0.13

0.45

0.42

8

12
13
14
15

Was:

Married

Number in
final state
0.0000

0.4460

0.0821

0.4719
49

2

1
7
2
19
20
21

Formulae table
Input data
(F11:114)
K l 1:K14 (outputs)

Figure 12.16

B4:B7, F4:17, 81 1
{=VoseMarkovMatrix(F4:17,BlI ) }
~=VoseMarkovSample(B4:B7,F4:I7,B11)}

ModelRisk methods for performing a Markov chain model with time an integer > 1 unit.

12.4.2 Continuous-time Markov chain
For a continuous-time Markov process we need to be able to produce the transition matrix for any
positive time increment, not just an integer multiple of the time that applies to the base transition

Chapter 12 Forecasting with uncertainty

343

matrix. So, for example, we might have the above marital status transition matrix for a single year but
wish to know what the matrix is for half a year, or 2.5 years.
There is a mathematical technique for finding the required matrix, based on converting the multinomial
probabilities in the matrix into Poisson intensities that match the required probability. The mathematical
manipulation is somewhat complex, particularly when one has to wrestle with numerical stability. The
ModelRisk functions VoseMarkovMatrix and VoseMarkovSarnple detect when you are using non-integer
time and automatically convert to the alternative mathematics. So, for example, we can have the model
described above for a half-year.

12.5 Birth and Death Models
There are two strongly related probabilistic time series models called the Yule (or pure birth) and pure
death models. We have certainly found them very useful in modelling numbers in a bacterial population,
but they could be helpful in modelling other variables, modelling numbers of individuals that increase
or decrease according to their population size.

12.5.1 Yule growth model
This is a pure birth growth model and is a stochastic analogue to the deterministic exponential growth
models one often sees in, for example, microbial risk analysis. In exponential growth models, the rate
of growth of a population of n individuals is proportional to the size of the population:

where B is the mean rate of growth per unit time t. This gives the number of individuals nt in the
population after time t as
n, = noexp(Bt1
where no is the initial population size. The model is limited because it takes no account of any randomness in the growth. It also takes no account of the discrete nature of the population, which is important
at low values of n. Moreover, there are no defensible statistical tests to apply to fit an exponential
growth curve to observations (regression is often used as a surrogate) because an exponential growth
model is not probabilistic, so no probabilistic (i.e. statistical) interpretation of data is possible.
The Yule model starts with the premise that individuals have offspring on their own (e.g. by division),
that they procreate independently, that procreating is a Poisson process in time and that all individuals
in the population are the same. The expected number of offspring from an individual per unit time (over
some infinitesimal time increment) is defined as /3. This leads to the results that an individual will have,
1.
after time t, Geometric(exp(-Bt)) offspring, giving a new total population of Geometric(exp(-Bt))
Thus, if we start with no individuals, then by some later time t we will have

+

from the relationship
S

NegBin(s, p) =

Geometric(p)
i=l

with mean Ti, = noep< corresponding to the exponential growth model.

344

Risk Analysis

A possible problem in implementing this type of model is that no and n, can be very large, and
simulation programs tend to produce errors for discrete distributions like the negative binomial for large
input parameters and output values. ModelRisk has two time series functions to model the Yule process
that work for all input values:

which generates values for n,, and

VoseTimeSeriesYulelO(Log,ono,Loglncrease, t )
which generates values for Loglo(n,), as one often finds it more convenient to deal with logs for
exponentially growing populations because of the large numbers that can be generated. Loglncrease is
the number of logs (in base 10) by which one expects the population to increase per time unit. The
parameters /Iand Loglncrease are related by
Log Increase = Loglo[exp(j3)]

12.5.2 Death model
The pure death model is a stochastic analogue to the deterministic exponential death models one often
sees in, for example, microbial risk analysis. lndividuals are assumed to die independently and randomly
in time, following a Poisson process. Thus, the time until death can be described by an exponential
distribution. which has a cdf:

where h is the expected instantaneous death of an individual. The probability that an individual is still
alive at time t is therefore

Thus, if no is the initial population, the number n, surviving until time t follows a binomial distribution:

which has a mean of

i.e. the same as the exponential death model. The cdf for the time until extinction t~ of the population
is given by

The binomial death model offered here is an improvement over the exponential death model for
several reasons:
The exponential death model takes no account of any randomness in the growth, so cannot interpret
variations from an exponential line fit.

Chapter I2 Forecasting w ~ t huncertainty

345

The exponential death model takes no account of the discrete nature of the population, which is
important at low values of n.
There are no defensible statistical tests to apply to fit an exponential growth curve to observations
(regression is often used as a surrogate) because an exponential model is not probabilistic, so there
can be no probabilistic interpretation of data. A likelihood function is possible, however, for the
death model described here.
A possible difficulty in implementing this death model is that no and n, can be very large, and
simulation programs tend to produce errors for discrete distributions like the binomial for large input
parameters and output values. ModelRisk has two time series functions to model the death model that
eliminate this problem:

which generates values for n,, and

VoseTimeSeriesDeathlO(Loglono,
LogDecrease, t)
which generates values for Loglo(nt),as one often finds it more convenient to deal with logs for bacterial
populations (for example) because of the large numbers that can be involved. The LogDecrease parameter
is the number of logs (in base 10) that one expects the population to decrease by per time unit. The
parameters h and LogDecrease are related by
LogDecrease = hLoglo(e)

12.6 Time Series Projection of Events Occurring
Randomly in Time
Many things we are concerned about occur randomly in time: people arriving at a queue (customers,
emergency patients, telephone calls into a centre, etc.), accidents, natural disasters, shocks to a market,
terrorist attacks, particles passing through a bubble chamber (a physics experiment), etc. Naturally, we
may want to model these over time, perhaps to figure out whether we will have enough stock vaccine,
storage space, etc. The natural contender for modelling random events is the Poisson distribution - see
Section 8.3 which returns the number of random events occurring in time t when h events are expected
per unit time within t . Often we might think that the expected number of events may increase or decrease
over time, so we make h a function of t as shown by the model in Figure 12.17.
A variation of this model is to take account of seasonality by multiplying the expected number of
events by seasonal indices (which should average to 1).
In Section 8.3.7 I have discussed the P6lya and Delaporte distributions which are counting distributions
similar to the Poisson but which allow h to be a random variable too. The P6lya is particularly helpful
because, with one extra parameter, h , we can add some volatility to the expected number of events, as
shown by the model in Figure 12.18.
Notice the much greater peaks in the plot for this model compared with that of the previous model in
Figure 12.17. Mixing a Poisson with a gamma distribution to create the P6lya is a helpful tool because
we can get the likelihood function directly from the probability mass function (pmf) of the P6lya and
therefore fit to historical data. If the MLE value for h is very small, then the Poisson model will be as
good a fit and has one less parameter to estimate, so the P6lya model is a useful first test.

C6:C55
D6:D55

Formulae table
=Gradient'B6+lntercept
=VosePoisson(C6)

Figure 12.18 A Polya time series with expected intensity A as a linear function of time and a coefficient of
variation of h = 0.3.

Chapter 12 Forecasting with uncertainty

347

The linear equation used in the above two models for giving an approximate description of the
relationship of the expected number with time is often quite convenient, but one needs to be careful
because a negative slope will ultimately produce a negative expected value, which is clearly nonsensical
(which is why it is good practice to plot the expected value together with the modelled counts as shown
in the two figures above). The more correct Poisson regression model considers the log of the expected
value of the number of counts to be a linear function of time, i.e.

where Po and PI are regression parameters. The ln(e) term in Equation (12.10) is included for data
where the amount of exposure e varies between observations; for example, if we were analysing data to
determine the annual increase in burglaries across a country where our data are given for different parts of
the country with different population levels, or where the population size is changing significantly (so the
exposure measure e would be person-years). Where e is constant, we can simplify Equation (12.10) to

The model in Figure 12.19 fits a P6lya regression to data (year <= 0) and projects out the next 3 years
on annual sports accidents where the population is considered constant so we can use Equation (12.11).

Figure 12.19 A Polya regression model fitted to data and projected 3years into the future. The LogL
variable is optimised using Excel's Solver with the constraint that h > 0. ModelRisk offers Poisson and Polya
regression fits for multiple explanatory variables and variable exposure levels.

C

348

Risk Analysis

12.7 Time Series Models with Leading Indicators
Leading indicators are variables whose movement has some relationship to the movement of the variable
you are actually interested in. The leading indicator may move in the same or opposite direction as the
variable of interest, as shown in Figure 12.20.
In order to evaluate the leading indicator relationship, you will have to determine:
the causal relationship;
the quantitative nature of the relationship.
The causal relationship is critical. It gives a plausible argument for why the movement in the leading
indicator should in some way presage the movement of the variable of interest. It will be very easy to
find apparent leading indicator patterns if you try out enough variables, but, if you can't logically argue
why there should be any relationship (preferably make the argument before you do the analysis on the
potential indicator variable, it's much easier to convince yourself of a causal argument when you've
seen a temptingly strong statistical correlation), it's likely that the observed relationship is spurious.
The quantitative nature of the relationship should come from a mixture of analysis of historic data and
practical thinking. Some leading indicators will have a cumulative effect over time (e.g. rainfall as an
indicator of the water available for use at a hydroelectric plant) and so need to be summed or averaged.
Other leading indicators may have a shorter response time to the same, perhaps unmeasurable, causal
variable as the variable in which you are interested (if the causal variable was measurable, you would use
that as the leading indicator instead), and so your variable may exhibit the same pattern with a time lag.
The analysis of historic data to determine the leading indicator relationship will depend largely on
the type of causal relationship. Linear regression is one possible method, where one regresses historic
values of the variable of interest against the lead indicator values, with either a specific lag time if
that can be causally deduced or with a varying lag time to produce the greatest r-squared fit if one is
estimating the lag time. Note that any forecast can only be made a distance into the future equal to the
lag time: otherwise one needs to make a forecast of the lead indicator too.
The model in Figure 12.21 provides a fairly simple example in which the historic data (used to create
the left pane of Figure 12.20 below) of the variable of interest Y are compared visually with lead
indicator X data for different lag periods. The closest pattern match occurs for a lag 6 of 11 periods
(Figure 12.22).

-100

-80

-60

-40

Tlme

-20

0
-Leading

indicator

-100

-Variable

of lnterest

-80

6 0

-40

Time
I

2 0

0

- Leading indicator
- Variable of lnterest

1

Figure 12.20 Lead indicator patterns: left - lead indicator variable is positively correlated with variable of
interest; right - negatively correlated.

C h a p t e r 12 F o r e c a s t ~ n gwith u n c e r t a i n t y

......

Vanable of InterestY

-Y offset 11 periods

78

349

1

160

1

2 1 4 0 ~

.E
$120

~

J 100
1

80

-1 00

-80

-60

-40

2 0

0

Time
R-squared 1 0.971492
Slope lm) 1 0.045557
intercept (=)I -0.017818
S t e n isyx) l 0.163501

F~rmulaelable

=SLOPE($E$5.$E$83.$C$5$C583)
=INTERCEPT($E$5'$E$63,$C55:SC$83)
=STEYX~$E$5:$ED3.$C$5:$C$831

Figure 12.21

Leading indicator fit and projection model.

80
-80

-60

-40
Time

0

-20

Leading indicator X

-

Y offset 10 periods

Figure 12.22 Overlay of variable of interest and lead indicator variable lagged by 10, 11 and 12 periods,
showing the closest pattern correlation at 11 periods.

350

Risk Analys~s

80 4
-1 00

-1 00

L3
-80

-80

-60

-60

-40
Time

-20

-40
Time

-20

0
Leading indicator X
Y offset 11 periods

-

-

0

-Leading indicator X
-Y

offset 12 periods

Figure 12.22 Continued.

A scatter plot of Y (t) against X(t - 11) shows a strong linear relationship, so a least-squares regression
seems appropriate (Figure 12.23).
The regression parameters are:

slope = 0.04555
intercept = -0.01782
SteYX = 0.1635
(We could use the linear regression parametric bootstrap to give us uncertainty about these parameters
if we wished.)
The resultant model is then

which we can use to predict {Y (1) . . . Y (1 1)):

C h a ~ t e r12 Forecast~ngwith uncertainty

3

J

80

90

100

110

120

130

140

150

IB

ijo

Lead indicator X(t-11)
~

- - -

35 1

I
J

Figure 12.23 Scatter plot of variable of interest observations against lead indicator observations lagged by
11 periods.

12.8 Comparing Forecasting Fits for Different Models
There are three components to evaluating the relative merits of the various forecasting models fitted to data. The first is to take an honest look at the data you are going to fit: do they come from
a world that you think is similar to the one you are forecasting into? If not (e.g. there are fewer
companies in the market now, there are stricter controls, the product for which you are forecasting
sales is getting rather old and uninteresting, etc.), then consider some of the forecasting techniques I
describe in Chapter 17 which are based more on intuition than mathematics and statistics. The second
step is also common sense: ask yourself whether the assumptions behind the model could actually
be true and why that might be. Perhaps you can investigate whether this type of model has been
used successfully for similar variables (e.g. a different exchange rate, interest rate, share price, water
levels, hurricane frequencies than the one you are modelling). In fact, I recommend that you use
this as a first step in selecting which models might be appropriate for the variable you are modelling.
Then you will need statistically to evaluate the degree to which each model fits the data and to
compensate for the fact that a model with more parameters will have greater flexibility to fit the data
but may not mean anything. Statistical techniques for model selection and comparison have improved,
and the best methods now use ''information criteria" of which there are three in common usage, described
at the end of Section 10.3.4. The main advantage over the older log-likelihood ratio method is that the
models don't have to be nested - meaning that each tested model does not need to be a simplified
(some parameters removed) version of a more complex model. For ARCH, GARCH, APARCH and
EGARCH you should subtract n(1 ln[2a]), where n is the number of data points, from each of the
criteria. If you fit a number of models to your data, try not to pick automatically the model with the
best statistical result, particularly if the top two or three are close. Also, simulate projections out into
the future and see whether the range and behaviour correspond to what you think is realistic (you
can do this automatically in the time series fitting window in ModelRisk, overlaying any number of
paths).

+

352

Risk Analysis

1 2.9 Long-Term Forecasting
By long-term forecasts I mean making projections out into the future that span more than, say, 20-30 %
of your historical experience. I am not a big believer in using very technical models in these situations.
For a start, there should be a lot of uncertainty to the projections, but more importantly the world is
ever-changing and the key assumption you implicitly make by producing a forecast with a model fitted
to historic data is that the world will carry on behaving in the same way. I know that historically I
have been hopeless at predicting what my life will be like in 5 years time: in 1985 I fully expected to
be a physical oceanographer in the UK; in 1987 I'd become a qualified photographer living in New
Zealand, etc. I'd fixed on being a risk analyst by 1988, but then moved to the UK, Ireland, France and
Five
. ~ years ago I had no idea that our company would have grown in the way it has,
now ~ e l ~ i u m
or that we would have developed such a strong software capability. Try applying the same test to the
world you are attempting to model.
The alternative is to combine lessons learned from the past (e.g. how sensitive your sales are to the US
economy) with a good look around to see how the world is changing (mergers coming up, wars starting
or ending, new technology, etc.) and draw up scenarios of what the world might look like and how it
would affect the variables you want to forecast. I give a number of techniques for this in Chapter 14.

Now I have three kids, a partner, a nice home, a dog and an estate car, so maybe things are settling down.

Chapter

Modelling correlation and
dependencies
13.1 Introduction
In previous chapters we have looked at building a risk analysis model and assigning distributions to
various components of the model. We have also seen how risk analysis models are more complex than
the deterministic models they are expanding upon. The chief reason for this increase in complexity is
that a risk analysis model is dynamic. In most cases there are a potentially infinite number of possible
combinations of scenarios that can be generated for a risk analysis model. We have seen in Chapter 4
that a golden rule of risk analysis is that each one of these scenarios must be potentially observable
in real life. The model, therefore, must be restricted to prevent it from producing, in any iteration, a
scenario that could not physically occur.
One of the restrictions we must place on our model is to recognise any interdependencies between
its uncertain components. For example, we may have both next year's interest rate and next year's
mortgage rate represented as distributions. Figure 13.1 gives an example of two distributions modelling
these interest rate and mortgage rate predictions. Clearly, these two components are strongly positively
correlated, i.e. if the interest rate turns out to be at the high end of the distribution, the mortgage rate
should show a correspondingly high value. If we neglect to model the interdependency between these
two components, the joint probabilities of the various combinations of these two parameters will be
incorrect. Impossible combinations will also be generated. For example, a value for the interest rate of
6.5 % could occur with a value for the mortgage rate of 5.5 %.
There are three reasons why we might observe a correlation between observed data. The first is
that there is a logical relationship between the two (or more) variables. For example, the interest rate
statistically determines the mortgage rate, as discussed above. The second is that there is another external
factor that is affecting both variables. For example, the weather during construction of a building will
affect how long it takes both to excavate the site and to construct the foundations. The third reason is
that the observed correlation has occurred purely by chance and no correlation actually exists. Chapter 6
outlines some statistical confidence tests to help determine whether the observed correlations are real.
However, there are many examples of strong correlation between variables that would pass any tests
of significance but where there is no relationship between the variables. For example, the number of
personal computer users in the UK over the last 8 years and the population of Asia will probably be
strongly correlated - not because there is any relationship but because both have steadily increased over
that period.

Figure 13.1 Distributions of interest and mortgage rate predictions.

1 3.1.1 Explanation of dependency, correlation and regression
The terms dependency, correlation, and regression are often used interchangeably, causing some confusion, but they have quite specific meanings. A dependency relationship in risk analysis modelling is
where the sampled value from one variable (called the independent) has a statistical relationship that
approximately determines the value that will be generated for the other variable (called the dependent).
A statistical relationship has an underlying or average relationship between the variables around which
the individual observations will be scattered. Its chief difference to correlation is that it presumes a causal
relationship. As an example, the interest rate and mortgage rate will be highly correlated. Moreover, the
mortgage rate will be in essence dependent on the interest rate, but not the other way round.
Correlation is a statistic used to describe the degree to which one variable is related to another.
Pearson's correlation coefficient (also known as Pearson's product moment correlation coefficient) is
given by

where Cov(X, Y ) is the covariance between datasets X and Y, and a ( X ) and a ( Y ) are the sample
standard deviations as defined in Chapter 6. Correlation can be considered to be a normalised covariance
between the two datasets: dividing by the standard deviation of each dataset produces a unitless index
between - 1 and + I . The correlation coefficient is frequently used alongside a regression analysis to
measure how well the regression line explains the observed variations of the dependent variable. The
above correlation statistic is not to be confused with Spearman's rank order correlation coefficient which
provides an alternative, non-parametric approach to measuring the correlation between two variables.
A little care is needed in interpreting covariance. Independent variables are always uncorrelated, but
uncorrelated variables are not always independent. A classic, if somewhat theoretical, example is to
consider the variables X = Uniform(-1, 1) and Y = x 2 . There is a direct link between X and Y, but they
have zero covariance since Cov(X, Y) = E[XY] - E [ x ] E [ Y ] ' (the definition) = E [ x ~ E] [ x ] E [ x ~ ] ,
and both E [ X ] and E [ x 3 ]= 0. This is one reason we look at scatter plots of data as well as calculating
correlation statistics.

' E[] denotes the expectation, i.e. the mean of all values weighted by their probability

Regression is a mathematical technique used to determine the equation that relates the independent
and dependent variables with the least margin of error. If we were to plot a scatter plot of the available
data, this equation would be represented by a line that passed as close as possible through the data points
(see Figure 13.2). The most common technique is that of simple least-squares linear regression. This
objectively determines the straight line (Y = ax b) such that the sum of the squares of the vertical
deviations of the data points from the line is a minimum. The assumptions, mathematics and statistics
relating to least-squares linear regression are provided in Section 6.3.9.

+

13.1.2 General comments on dependency modelling
The remainder of this chapter offers several techniques for modelling correlation and dependencies
between uncertain components, with examples of where and how they are used. The sections on rank
order correlation and copulas provide techniques for modelling correlation. The other sections offer
techniques for dependency modelling. The analyst will need to determine whether it is important to
focus on any particular correlation or dependency structure in the model. A simple way to determine
this is to run two simulations, one with a zero rank order correlation and one with a +1 or - 1 correlation,
using two approximate distributions to define the correlated pair. If the model's results from these two
simulations are significantly different, the correlation is obviously an important component of the general
model.
Scatter plots are an extremely useful way of visualising the form of a correlation or dependency.
The common practice is to plot observed data for the independent (when known) variable on the x axis

I

356

Risk Analysis

m
.-E
0

8

3 2
m
d
.

9
Fisherman's prediction of weight of fish

I

Experience

P!

gi
I

Advertising expenditure

Figure 13.3

Examples of dependency patterns.

and corresponding data for the dependent (again, when known) variable on the y axis. Figure 13.3
illustrates four dependency patterns that you may meet: top left - positive linear; top right - negative
linear; bottom left - positive curvilinear; and bottom right - mixed curvilinear.
Scatter plots also provide an excellent way of previewing a correlation pattern that you have defined
in your own models. Most risk analysis packages allow the user to export the Monte Carlo generated
values for any component in your model to the Windows clipboard or directly into a spreadsheet. The
data can then be plotted in a scatter plot using the standard spreadsheet-charting facilities. The number of
iterations (and therefore the number of generated data points) should be set to a value that will produce
a scatter plot that fills out the low-probability areas reasonably well while avoiding overpopulation of
the high-probability areas. High-resolution screens now make it reasonable to plot around 3000 data
points as little dots that will show the pattern and give an impression of density quite nicely.

13.2 Rank Order Correlation
Most risk analysis software products now offer a facility to correlate probability distribution within a
risk analysis model using rank order correlation. The technique is very simple to use, requiring only
that the analyst nominates the two distributions that are to be correlated and a correlation value between
- 1 and +1. This coefficient is known as the Spearman's Rank Order Correlation CoefJicient:
A correlation value of -1 forces the two probability distributions to be exactly negatively correlated,
i.e. the X percentile value in one distribution will appear in the same iteration as the 100 - X
percentile value of the other distribution.

Chapter 13 Modelling correlat~onand dependenc~es 3 5 7

A correlation value of +l forces the two probability distributions to be exactly positively correlated,
i.e. the X percentile value in one distribution will appear in the same iteration as the X percentile
value of the other distribution. In practice, one rarely uses correlation values of -1 and +l.
Negative correlation values between 0 and -1 produce varying degrees of inverse correlation, i.e.
a low value from one distribution will correspond to a high value in the other distribution, and
vice versa. The closer the correlation to zero, the looser will be the relationship between the two
distributions.
Positive correlation values between 0 and +1 produce varying degrees of positive correlation, i.e. a
low value from one distribution will correspond to a low value in the other distribution and a high
value from one distribution will correspond to a high value from the other.
A correlation value of 0 means that there is no relationship between the two distributions.

13.2.1 How rank order correlation works
The rank order correlation coefficient uses the ranking of the data, i.e. what position (rank) the data
point takes in an ordered list from the minimum to maximum values, rather than the actual data values
themselves. It is therefore independent of the distribution shapes of the datasets and allows the integrity
of the input distributions to be maintained. Spearman's p is calculated as

where n is the number of data pairs and AR is the difference in the ranks between data values in
the same pair. This is in fact a short-cut formula where there are few or no ties: the exact formula is
discussed in Section 6.3.10.
Example 13.1

The spreadsheet in Figure 13.4 calculates the Spearman's p for a small dataset. This correlation coefficient is symmetric about the distributions being correlated, i.e. only the difference between ranks

AI
1
2
3
4
5
6
7
8
9
10
22
23

-

B

I c

I D I E I

F

]GI

Variable Variable Rank Rank Difference
A value B value of A of B in ranksA2
90.86
77.57
4
9
25
110.89 95.04
18
17
1
4
66.35
2
4
86.84
92.24
71.1 1
5
6
1
95.88
75.90
7
8
1
15
16
19
115.14 89.06
1
1
0
83.53
51.16
96.96
87.34
8
14
36
3
2
1
87.88
59.84

H

I

I

I

Number of data pairs:
Rank order correlation :

=COUNT(B4:823)

24

Figure 13.4 An example of the calculation of Spearman's rank order correlation coefficient.

IK

J

20
0.72

358

Risk Analys~s

is important and not whether distribution A is being correlated with distribution B or the other way
round. +
In order to apply rank order correlation to a pair of probability distributions, risk analysis software
has to go through several steps. Firstly, a number of rank scores equivalent to the number of iterations
is generated for each distribution that is to be correlated. Secondly, these rank score lists are jumbled
up so that the specified correlation is achieved between correlated pairs. Thirdly, the same number of
samples are drawn from each distribution and sorted from minimum to maximum. Finally, these values
are used during the simulation: the first to be used has the same ranking in the list as the first value in
its rank score list, and so on, until all rank scores and all generated values have been used.

13.2.2 Use, advantages and disadvantages of rank order correlation
Rank order correlation provides a very quick and easy to use method of modelling correlation between
probability distributions. The technique is "distribution independent", i.e. it has no effect on the shape
of the correlated distributions. One is therefore guaranteed that the distributions used to model the
correlated variables will still be replicated.
The primary disadvantage of rank order correlation is the difficulty in selecting the appropriate correlation coefficient. If one is simply seeking to reproduce a correlation that has been observed in previous
data, the correlation coefficient can be calculated directly from the data using the formula in the previous
section. The difficulty appears when attempting to model an expert's opinion of the degree of correlation
between distributions. A rank order correlation lacks intuitive appeal, and it is therefore very difficult
for experts to decide which level of correlation best represents their opinion.
This difficulty is compounded by the fact that the same degree of correlation will look quite different
on a scatter plot for different distribution types, e.g. two lognormals with a 0.7 correlation will produce
a different scatter pattern to two uniform distributions with the same correlation. Determining the
appropriate correlation coefficient is more difficult still if the two distributions do not share the same
geometry, e.g. one is normal and the other uniform, or one is a negatively skewed triangle and the
other a positively skewed triangle. In such cases, the scatter plot will often show quite surprising results
(Figure 13.5 illustrates some examples).
Figure 13.6 shows that correlation only becomes visually evident at levels of about 0.5 or above
(or about -0.5 or below for negative correlation). Producing scatter plots like this at various levels of
correlation for two variables can help subject matter experts provide estimates of levels of correlation
to be applied.
Another disadvantage of rank order correlation is that it ignores any causal relationship between the
two distributions. It is usually more logical to think of a dependency relationship along the lines of that
described in Sections 13.4 and 13.5.
A further disadvantage of which most people are unaware is that an assumption of the correlation shape
has already been built into the simulation software. The programming technique was originally developed
in a seminal paper by Iman and Connover (1982) who used an intermediate step of translating the random
numbers through van der Waerden scores. Iman and Conover found that these scores produced "naturallooking" correlations: variables correlated using van der Waerden scores produced elliptical-shaped
scatter plots, while using the ranking of the variables directly produced scatter patterns that were pinched
in the middle and fanned out at each end. For example, correlating two Uniform(0, 1) distributions
together (the same as plotting the cdfs of any two continuous rank order correlated distributions) produces
the patterns in Figure 13.7.

Chapter 13 Modelling correlation and dependencies

357

A correlation value of +l forces the two probability distributions to be exactly positively correlated,
i.e. the X percentile value in one distribution will appear in the same iteration as the X percentile
value of the other distribution. In practice, one rarely uses correlation values of -1 and +l.
Negative correlation values between 0 and -1 produce varying degrees of inverse correlation, i.e.
a low value from one distribution will correspond to a high value in the other distribution, and
vice versa. The closer the correlation to zero, the looser will be the relationship between the two
distributions.
Positive correlation values between 0 and +1 produce varying degrees of positive correlation, i.e. a
low value from one distribution will correspond to a low value in the other distribution and a high
value from one distribution will correspond to a high value from the other.
A correlation value of 0 means that there is no relationship between the two distributions.

13.2.1 How rank order correlation works
The rank order correlation coefficient uses the ranking of the data, i.e. what position (rank) the data
point takes in an ordered list from the minimum to maximum values, rather than the actual data values
themselves. It is therefore independent of the distribution shapes of the datasets and allows the integrity
of the input distributions to be maintained. Spearman's p is calculated as

where n is the number of data pairs and A R is the difference in the ranks between data values in
the same pair. This is in fact a short-cut formula where there are few or no ties: the exact formula is
discussed in Section 6.3.10.

Example 13.1
The spreadsheet in Figure 13.4 calculates the Spearman's p for a small dataset. This correlation coefficient is symmetric about the distributions being correlated, i.e. only the difference between ranks
A/
1
-

i

2
3
4
5
6
7
8
9
10
22
23
-

-

B

I c

( D I E ]

F

I G ~

Variable Variable Rank Rank Difference
A value B value of A of B in ranksA2
90.86
77.57
4
9
25
110.89 95.04
18
17
1
86.84
66.35
2
4
4
92.24
71.11
5
6
1
8
1
75.90
7
95.88
15
16
19
115.14 89.06
83.53
51.16
1
1
0
96.96
87.34
8
14
36
87.88
59.84
3
2
1

H

I

I

I

Number of data pairs:
Rank order correlation :

=COUNT(B4:B23)

24

Figure 13.4 An example of the calculation of Spearman's rank order correlation coefficient.

IK

J

20
0.72

3 58

R~skAnalysis

is important and not whether distribution A is being correlated with distribution B or the other way
round. +
In order to apply rank order correlation to a pair of probability distributions, risk analysis software
has to go through several steps. Firstly, a number of rank scores equivalent to the number of iterations
is generated for each distribution that is to be correlated. Secondly, these rank score lists are jumbled
up so that the specified correlation is achieved between correlated pairs. Thirdly, the same number of
samples are drawn from each distribution and sorted from minimum to maximum. Finally, these values
are used during the simulation: the first to be used has the same ranking in the list as the first value in
its rank score list, and so on, until all rank scores and all generated values have been used.

13.2.2 Use, advantages and disadvantages of rank order correlation
Rank order correlation provides a very quick and easy to use method of modelling correlation between
probability distributions. The technique is "distribution independent", i.e. it has no effect on the shape
of the correlated distributions. One is therefore guaranteed that the distributions used to model the
correlated variables will still be replicated.
The primary disadvantage of rank order correlation is the difficulty in selecting the appropriate correlation coefficient. If one is simply seeking to reproduce a correlation that has been observed in previous
data, the correlation coefficient can be calculated directly from the data using the formula in the previous
section. The difficulty appears when attempting to model an expert's opinion of the degree of correlation
between distributions. A rank order correlation lacks intuitive appeal, and it is therefore very difficult
for experts to decide which level of correlation best represents their opinion.
This difficulty is compounded by the fact that the same degree of correlation will look quite different
on a scatter plot for different distribution types, e.g. two lognormals with a 0.7 correlation will produce
a different scatter pattern to two uniform distributions with the same correlation. Determining the
appropriate correlation coefficient is more difficult still if the two distributions do not share the same
geometry, e.g. one is normal and the other uniform, or one is a negatively skewed triangle and the
other a positively skewed triangle. In such cases, the scatter plot will often show quite surprising results
(Figure 13.5 illustrates some examples).
Figure 13.6 shows that correlation only becomes visually evident at levels of about 0.5 or above
(or about -0.5 or below for negative correlation). Producing scatter plots like this at various levels of
correlation for two variables can help subject matter experts provide estimates of levels of correlation
to be applied.
Another disadvantage of rank order correlation is that it ignores any causal relationship between the
two distributions. It is usually more logical to think of a dependency relationship along the lines of that
described in Sections 13.4 and 13.5.
A further disadvantage of which most people are unaware is that an assumption of the correlation shape
has already been built into the simulation software. The programming technique was originally developed
in a seminal paper by Iman and Connover (1982) who used an intermediate step of translating the random
numbers through van der Waerden scores. Iman and Conover found that these scores produced "naturallooking" correlations: variables correlated using van der Waerden scores produced elliptical-shaped
scatter plots, while using the ranking of the variables directly produced scatter patterns that were pinched
in the middle and fanned out at each end. For example, correlating two Uniform(0, 1) distributions
together (the same as plotting the cdfs of any two continuous rank order correlated distributions) produces
the patterns in Figure 13.7.

Chapter 13 Modelling correlation and dependencies

..
. . . . . .....
.

K=- . .."**-*. . .
,...
-?

.-6

E
.-

..

.
.
)
-

.

. , , , -m

.

I

-

*..

I)-.

o.."**o-..*..

.

**a

. . . .. .
.... .

.

.

**-*.I)*

I

359

.**-..

.I..

0.0

"I..

<
.
.
.
0..
.
I
.

I Y

i

Triang(0,0,40)

Figure 13.5 Examples of patterns produced by correlating different distribution types with a rank order
correlation of 0.8.

i
Correlat~on= 0

-

1,

.. . .

Correlation = 0 2

iZ

-

*'

1,
C

C

a,

w
C

a,

Q

a,

a,

.- .

a,

n

1

i

0"

..

.

1

I

I

1

Independent x

Independent x

I

Correlation = 0.4

Correlation = 0 6

.

-

1,
C

-

.*

C

a,

u

0"

:*:

.'

Independent x

Independent x

Correlation = 0 8

-

Correlation = 0.99

K

-

U

-0

>..

A

C
a,

a,

C

C

8

i?

a,
Q

. .
Independent x

n
a,

Independent x

Figure 13.6 Patterns produced by two normal distributions with varying degrees of rank order correlation.

Chapter 13 Modell~ngcorrelat~onand dependenc~es 3 6 1

-

0.5 rank correlation

0

0.2

0.4

0.6

0.8 rank correlation

0.8

1

0

0.2

0.9 rank correlation

0

Figure 13.7
correlation.

0.2

0.4

0.6

0.4

0.6

0.8

1

0.8

1

0.95 rank correlation

0.8

1

0.2

0.4

0.6

Patterns produced by two Uniform(0,l) distributions with varying degrees of rank order

Notice that the patterns are symmetric about the diagonals of Figure 13.7. In particular, rank order
correlation will "pinch" the variables to the same extent at each extreme. In fact there are a wide variety
of different patterns that could give us the same level of rank correlation. To illustrate the point, the
following plots in Figure 13.8 give the same 0.9 correlation as the bottom-left pane of Figure 13.7, but
are based on copulas which I discuss in the next section.
There are times in which two variables are perhaps much more correlated at one end of their distribution than the other. In financial markets, for example, we might believe that returns from two
correlated stocks of companies in the same area (let's say mobile phone manufacture) are largely uncorrelated except when the mobile phone market takes a huge dive, in which case the returns are highly
correlated. Then the Clayton copula in Figure 13.7 would be a much better candidate than rank order
correlation.
The final problem with rank order correlation is that it is a simulation technique rather than a probability model. This means that, although we can calculate the rank order correlation between variables
(ModelRisk has the VoseSpearman function to do this; it is possible in Excel but one has to create a
large array to do it), and although we can use a bootstrap technique to gauge the uncertainty about that

362

Risk Analys~s

Frank copula

Clayton copula

0

02

0.4

0.6

0.8

1

0

0.2

0.2

0.4

0.6

0.6

0.8

1

0.8

1

T copula (nu = 2)

Gumbel copula

0

0.4

0.8

1

0

0.2

0.4

0.6

Figure 13.8 Patterns produced by different copulas with an equivalent 0.9 rank order correlation.

correlation coefficient (VoseSpearmanU), it is not possible to compare correlation structures statistically;
for example, it is not possible to use maximum likelihood methods and produce goodness-of-fit statistics. Copulas, on the other hand, are probability models and can be compared, ranked and tested for
significance.
In spite of the inherent disadvantages of rank order correlation, its ease of use and its speed of
implementation make it a very practical technique. In summary, the following guidelines in using rank
order correlation will help ensure that the analyst avoids any problems:
Use rank order correlation to model dependencies that only have a small impact on your model's
results. If you are unsure of its impact, run two simulations: one with the selected correlation
coefficient and one with zero correlation. If there is a substantial difference between the model's
final results, you should choose one of the other more precise techniques explained later in this
chapter.
Wherever possible, restrict its use to pairs of similarly shaped distributions.

Chapter 13 Modell~ngcorrelat~onand dependencies

363

If differently shaped distributions are being correlated, preview the correlation using a scatter plot
before accepting it into the model.
If using subject matter experts (SMEs) to estimate correlations, use charts at various levels of
correlation to help the expert determine the appropriate level of correlation.
Consider using copulas if the correlation is important or shows an unusual pattern.
Avoid modelling a correlation where there is neither a logical reason nor evidence for its existence.
This last point is a contentious issue, since many would argue that it is safer to assume a 100 % positive
or negative correlation (whichever increases the spread of the model output) rather than zero. In my
view, if there is neither a logical reason that would lead one to believe that the variables are related in
some way nor any statistical evidence to suggest that they are, it seems that one would be unjustified in
assuming high levels of correlation. On the other hand, using levels of correlation throughout a model
that maximise the spread of the output, and other correlation levels that minimise the spread of the
output, does provide us with bounds within which we know the true output distribution(s) must lie. This
technique is sometimes used in project risk analysis, for example, where for the sake of reassurance one
would like to see the most widely spread output feasible given the available data and expert estimates.
I suspect that using such pessimistic correlation coefficients proves helpful because it in some general
way compensates for the tendency we all have to be overconfident about our estimates (of time to
complete the project's tasks, for example, thereby reducing the distribution of possible outcomes for
the model outputs like the date of completion) as well as quietly recognising that there are elements
running through a whole project like management competence, team efficiency and quality of the initial
planning - factors that it would be uncomfortable to model explicitly.

13.2.3 Uncertainty about the value of the correlation coefficient
We will often be uncertain about the level of rank order correlation to apply. We will be guided by
either available data or expert opinion. In the latter case, determining an uncertainty distribution for the
correlation coefficient is simply a matter of asking for a subject matter expert to estimate a feasible
correlation coefficient: perhaps just minimum, most likely and maximum values which can then be fed
into a PERT distribution, for example. The expert can be helped in providing these three values by
being shown scatter plots of various degrees of correlation for the two variables of interest.
In the case where data are available on which the estimate of the level of correlation is to be based,
we need some objective technique for determining a distribution of uncertainty for the correlation
coefficient. Classic statistics and the bootstrap both provide techniques that accomplish this. In classical
statistics, the uncertainty about the correlation coefficient, given the data set ({xi1, { y i I), i = 1, . . . , n,
was shown by R. A. Fisher to be as follows (Paradine and Rivett, 1964, pp. 208-210):

where tanh is the hyperbolic tangent, tanh-' is the inverse hyperbolic tangent, is the rank correlation
of the set of observations and p is the true rank correlation between the two variables.
The bootstrap technique that applies here is the same technique usually used to estimate some statistic, except that we have to sample the data in pairs rather than individually. Figure 13.9 illustrates a
spreadsheet where this has been done. Note that the formula that calculates the rank is modified from
the Excel function RANK(), since this function assigns the same lowest-value rank to all data values
that are equal: in calculating p we require the ranks of tied data values to equal the average of the ranks

364

Risk Analysis

1
2
-3 .
4
5
27
28
29
30
31
32
33
34
35
36
37
38
39
40

A1

Sorted data

x
84.61
87.78
116.90
119.64

-

-

-

C

B

B4:C28
D4:D28
E29
F4:F28
G4:G28
H4:128
129
130

Y
1.41
1.68
9.13
9.90

I

D

E

Correlation Calculation
Rankx
Rank y
25
25
24
24
2
3
1
2
Data
0.71 9231

F

I

G

Bootstrap sample

x
111.99
90.18
110.98
88.67

Y
8.30
5.13
8.88
4.59

H

I

Correlation Calculation
Rankx
Rank y
5
6
19.5
16.5
7
4
24
19
Bootstrap
0.570769
Fischer
0.753599

Formulae table
25 data pairs sorted in order of
x
=RANK(B4,B$4:B$28)+(COUNTIF(B$4:B$28,B4)-1)/2
{=I -6*SUM((E4:E28-D4:D28)A2)/(25*(25A2-1))}
=VoseDuniform(B$4:B$28)
=VLOOKUP(F4,B$4:C$28,2)
=RANK(F4,F$4:F$28)+(COUNTIF(F$4:F$28,F4)1)/2
(=1-6*SUM((14:128-H4:H28)"2)/(25*(25"2-1))]
=TANH(VoseNormal(ATANH(E29),1/SQRT(22)))

Figure 13.9 Model to determine uncertainty of a correlation coefficient using the bootstrap.

that the tied values would have had if they had been infinitesimally separated. So, for example, the
dataset (1, 2 , 2 , 3 , 3 , 3 , 4 ) would be assigned the ranks {1,2.5,2.5,5,5,5,7).The 2s have to share the
ranks 2 and 3, so get allocated the average 2.5. The 3s have to share the ranks 4, 5, 6, so get allocated
the average 5. The Duniform distribution has been used randomly to sample from the { x i } values, and
the VLOOKUP() function has been used to sample the { y i }values to ensure that the data are sampled in
appropriate pairs. For this reason, the data pairs have to be ranked in ascending order by { x i } so that the
VLOOKUP function will work correctly. Note in cell I30 that the uncertainty distribution for the correlation coefficient is calculated for comparison using the traditional statistics technique above. While the
results from the two techniques will not normally be in exact agreement, the difference is not excessive
and they will return almost exactly the same mean values. The ModelRisk function VoseSpearmanU
simulates the bootstrap estimate directly.
Uncertainty about correlation coefficients can only be included by running multiple simulations, if one
uses rank order correlation. As discussed previously (Chapter 7), simulating uncertainty and randomness
together produces a single combined distribution that quite well expresses the total indeterminability of
our output, but without showing the degree due to uncertainty and that due to randomness. However, it
is not possible to do this with uncertainty about rank order correlation coefficients, as the scores used to
simulate the correlation between variables are generated before the simulation starts. If one is intending
to simulate uncertainty and randomness together, a representative value for the correlation needs to be
determined, which is not easy because of the difficulty of assessing the effect of a correlation coefficient
on a model's output(s). The reader may choose to use the mean of the uncertainty distribution for the
correlation coefficient or may choose to play safe and pick a value somewhere at an extreme, say the
5 percentile or 95 percentile, whichever is the most conservative for the purposes of the model.

Chapter 13 Modelling correlation and dependencies

365

13.2.4 Rank order correlation matrices
An important benefit of rank order correlation is that one can apply it to a set of several variables
together. In this case, we must construct a matrix of correlation coefficients. Each distribution must
clearly have a correlation of 1.0 with itself, so the top-left to bottom-right diagonal elements are all 1.0.
Furthermore, because the formula for the rank order correlation coefficient is symmetric, as explained
above, the matrix elements are also symmetric about this diagonal line.
Example 13.2

Figure 13.10 shows a simple example for a three-phase engineering project. The cost of each phase is
considered to be strongly correlated with the amount of time it takes to complete (0.8). The construction
time is moderately correlated (0.5) with the design time: it is considered that the more complex the
design, the longer it will take to finish the design and construct the machine, etc. +
There are some restrictions on the correlation coefficients that may be used within the matrix. For
example, if A and B are highly positively correlated and B and C are also highly positively correlated, A
and C cannot be highly negatively correlated. For the mathematically minded, the restriction is that there
can be no negative eigenvalues for the matrix. In practice, the risk analysis software should determine
whether the values entered are valid and alter your entries to the closest allowable values or, at least,
reject the entered values and post a warning.
While correlation matrices suffer from the same drawbacks as those outlined for simple rank order
correlation, they are nonetheless an excellent way of producing a complex multiple correlation that is
laborious and quite difficult to achieve otherwise.

Construction Construction
Testing cost Testing time
cost
time

Design cost Design time

Design cost

1

;

0

.

8

j

o

j

o

j

Design time

0.8

1

1

0

I

j

o
.

......................................................................

i

0.5

i0

. . _. . . . . . . . . . . . . . . . . . . . . . .

0

;

0.4

......................................................................................................................................

Construction
cost

O

j

O

j

i

1 0 . 8

l

................................... ...........................................................

Construction
time
Testing cost

0

i

0.5

i

:

0.8

1

I

..........................................................................................
O

~

o

l

0

0

0

i

0.4

1

0

~

j

Figure 13.10 An example of a rank order correlation matrix.

0.4

0.4

j

;............. ........:................. .....
j

.................................................................................................................
Testing time

0

j

......................................

;

0.8

1
<

i

O0.8

......................
1

~

l

366

Risk Analysis

Adding uncertainty t o a correlation matrix

Uncertainty about the correlation coefficients in a correlation matrix can be easily added when there are
data available. The technique requires a repeated application of the bootstrap procedure described in the
previous section for determining the uncertainty about a single parameter.
Example 13.3

Figure 13.11 provides a spreadsheet model where a dataset for three variables is used to determine the
correlation coefficient between each variable. By using the bootstrap method, we retain the correlation
between the uncertainty distributions of each correlation coefficient automatically. Cells C32:E32 are
the outputs to this model, providing the uncertainty distributions for the correlation coefficients for
A : B, B : C and A : C. The exact formula has been used to calculate the correlation coefficients because

sums

Calculations
SS(AA) SS(BB) SS(CC) SS(AB) SS(BC) SS(AC)
16
16
16
16
16
16
12.25
-1.25 -1.75
6.25
0.25
8.75
16
4
0
0
-8
0
12.25
-1.25 -1.75
6.25
0.25
8.75
79
9
7
65
79
79

Formulae table
Data ranked in triplets by variable A
=VoseDuniform(B$4:B$13)
=VLOOKUP(E4,B$4:D$l3,2)
G4:G13
=VLOOKUP(E4,B$4:D$13,3)
=RANK(E4,E$4:E$13)+(COUNTIF(E$4:E$13,E4)-1)/2
H14:J14
=AVERAGE(H4:H13)
C18:E27
=(H4-H$l4)A2
F18:F27
=(H4-H$14)*(14-1$14)
G I 8:G27
=(14-1$14)*(J4-J$14)
H I8:H27
=(J4-J$14)*(H4-H$14)
C32 (output) =F28/SQRT(C28*D28)
D32 (output) =G28/SQRT(D28*E28)
E32 (output) =H28/SQRT(C28*E28)
B4:D13

Figure 13.11 Model to add uncertainty to a correlation matrix.

Chapter 13 Modelling correlation and dependencies

367

Generates the correlation matrix

Figure 13.12 Using VoseCorrMat and VoseCorrMatU to calculate a rank order correlation matrix from data.

the number of ties can be large compared with the number of data pairs because there are few data
pairs. +
ModelRisk offers two functions VoseCorrMatrix and VoseCorrMatrixU that will construct the correlation matrix of the data and generate uncertainty about those matrix values respectively, as shown
in the model in Figure 13.12. The functions are particularly useful when you have a large data array
because they use less memory and spreadsheet space and calculate far faster than trying to do the entire
analysis in Excel.
Note that, since the uncertainty distributions for the correlation coefficients in a correlation matrix are
correlated together, the traditional statistics technique by Fisher cannot be used here. Fisher's technique
described the uncertainty about an individual correlation coefficient, but not its relationship to other
correlation coefficients in a matrix, whereas the bootstrap does so automatically.

13.3 Copulas

I

1
i:

Quantifying dependence has long been a major topic in finance and insurance risk analysis and has led
to an intense interest in, and development of, copulas, but they are now enjoying increasing popularity
in other areas of risk analysis where one has considerable amounts of data. The rank order correlation
employed by most Monte Carlo simulation tools is certainly a meaningful measure of dependence but is
very limited in the patterns it can produce, as discussed above. Copulas offer a far more flexible method
for combining marginal distributions into multivariate distributions and offer an enormous improvement
in capturing the real correlation pattern. Understanding the mathematics is a little more onerous but
is not all that important if you just want to use it as a correlation tool, so feel free to skim over the
equations a bit. in the following presentation of copulas, I have used the formulae for a bivariate copula
to keep them reasonably readable and show graphs of bivariate copulas, but keep in mind that the ideas
extend to multivariate copulas too. I start off with an introduction to some copulas from a theoretical
viewpoint, and then look at how we can use them in models. Cherubini et al. (2004) is a very thorough

368

Risk Analysis

and readable exploration of copulas and gives algorithms for their generation and estimation, some of
which we use in ModelRisk.
A d-dimensional copula C is a multivariate distribution with uniformly distributed marginals U(0, 1)
on [O, 11. Every multivariate distribution F with marginals F l , F2, . . . , Fd can be written as

for some copula C (this is known as Sklar's theorem). Because the copula of a multivariate distribution
describes its dependence structure, we can use measures of dependence that are copula based. The
concordance measures Kendall's tau and Spearman's rho, as well as the coefficient of tail dependence,
can, unlike the rank order correlation coefficient, be expressed in terms of the underlying copula alone.
I will focus particularly on Kendall's tau, as the relationships between the value of Kendall's tau ( t )
and the parameters of the copulas discussed in this section are quite straightforward.
The general relationship between Kendall's tau of two variables X and Y and the copula C ( u , v) of
the bivariate distribution function of X and Y is

This relationship gives us a tool for fitting a copula to a dataset: we simply determine Kendall's tau for the
data and then apply a transformation to get the appropriate parameter value(s) for the copula being fitted.

13.3.1 Archimedean copulas
An important class of copulas - because of the ease with which they can be constructed and the nice
properties they possess - are the Archimedean copulas, which are defined by

where q is the generator of the copula, which I will explain later. The general relationship between
Kendall's tau and the generator of an Archimedean copula q, ( t ) for a bivariate dataset can be written as

For example, the relationship between Kendall's tau and the Clayton copula parameter a for a bivariate
dataset is given by

The definition doesn't extend to a multivariate dataset of n variables because there will be multiple values
of tau, one for each pairing. However, one can calculate tau for each pair and use the average, i.e.

Chapter 13 Modell~ngcorrelation and dependencies

369

There are three Archimedean copulas in common use: the Clayton, Frank and Gumbel. These are
discussed below.
The Clayton copula

The Clayton copula is an asymmetric Archimedean copula exhibiting greater dependence in the negative
tail than in the positive, as shown in Figure 13.13.
This copula is given by:

and its generator is

where a E [- 1, co) (01, meaning a is greater than or equal to - 1 but can't take a value of zero.
The relationship between Kendall's tau and the Clayton copula parameter a for a bivariate dataset is
given by

The model in Figure 13.14 generates a Clayton copula for four variables.

0

0.2

0.4

0.6

0.8

1

Figure 13.13 Plot of two marginal distributions using 3000 samples taken from a Clayton copula with CY = 8.

370

R~skAnalysis

No. variables n
min = 2

Clayton

Random
number
0.934
0.605
0.473
0.664

0.9338
0.9363
0.9304
0.9575

Formulae table
C5:C8
D5:D8
E5:E8

=RAND()
=F5LAlpha
=SUM($D$5:D5)

F6:F8 out uts = (E5-B6+2 * C W Al ha / Al ha* 1-B6

-1

-1 +1 A -11 A1 ha

Figure 13.14 Model to generate values from a Clayton(alpha) copula.

The Gumbel copula

The Gumbel copula (aka.Gumbe1-Hougard copula) is an asymmetric Archimedean copula, exhibiting
greater dependence in the positive tail than in the negative as shown in Figure 13.15.
This copula is given by

and its generator is
q, (t) = (- In t ) ,

where a E [- 1, co).The relationship between Kendall's tau and the Gumbel copula parameter a for a
bivariate dataset is given by

a=A

1
1-t

The model in Figure 13.16 shows how to generate the Gumbel copula.
The Frank copula

The Frank copula is a symmetric Archimedean copula, exhibiting an even, sausage-type correlation
structure as shown in Figure 13.17.
This copula is given by

and its generator is

Chapter 13 Modelling correlat~onand dependencies

371

Figure 13.15 Plot of two marginal distributions using 3000 samples taken from a Gumbel copula with
a = 5.

where a! E ( - a , a)(O}. The relationship between Kendall's tau and the Frank copula parameter
a bivariate dataset is given by

a!

for

where

is a Debye function of the first kind. There is a simple way to generate values for the Frank copula
using the logarithmic distribution, as shown by the following model in Figure 13.18.

13.3.2 Elliptical copulas
Elliptical copulas are simply the copulas of elliptically contoured (or elliptical) distributions. The most
commonly used elliptical distributions are the multivariate normal and Student t-distributions. The
key advantage of elliptical copulas is that one can specify different levels of correlation between the

372

Risk Analysis

1 gamma

alpha

part1

No. variables
1
2
3
4

Theta0
1.0703469541
1.570796327

Iz

I

part2
10.999848361

1

It
0.51

0.51

Ix

19.517360781

Random number
0.960
0.61 1
0.642
0.223

I

214.6880091
s
0.009
0.006
0.006
0.002

107.3440045

Gumbel copula
0.9098
0.9273
0.9256
0.9555

Formulae table
65 (alpha)
C5 (gamma)
D5 (1)
E5 (ThetaO)
68 (part1)
C8 (part2)
D8 (4
E8 (x)
C11:C14
Dll:D14
Ell:E14 (Output)

=l/theta

=COS(P1()/(2*theta))Atheta
=VoseUniform(-Pl()/2,P1()/2)
=ATAN(TAN(PI()*AIpha/2))/Alpha
=(SIN(Alpha)*(ThetaO+t))/((COS(Alpha*Theta0)*COS(t))A(1/Alpha))
=(COS(Alpha*ThetaO+(Alpha-l)*t)NoseExpon(l))A((1-Alpha)/Alpha~
=B8*C8
=gamrna*Z
=VoseUniform(O,l )
=C11/$E$8
=EXP(-(Dl lA(lltheta)))

Figure 13.16 Model to generate values from a Gumbel(theta) copula.

marginals, and the key disadvantages are that elliptical copulas do not have closed-form expressions
and are restricted to having radial symmetry. For elliptical copulas the relationship between the linear
correlation coefficient p and Kendall's tau is given by

The normal and Student t-copulas are described below.
The normal copula
The normal copula (Figure 13.19) is an elliptical copula given by

where @-' is the inverse of the univariate standard normal distribution function, and p , the linear
correlation coefficient, is the copula parameter.

Chapter 13 Modelling correlation and dependencies

373

I

Figure 13.17 Plot of two marginal distributions using 3000 samples taken from a Frank copula with a = 8.

The relationship between Kendall's tau and the normal copula parameter p is given by
p(X, Y ) = sin

(5 4

The normal copula is generated by first generating a multinormal distribution with mean vector (0) and
then transforming these values into percentiles of a Normal(0, 1) distribution, as shown by the model
in Fig. 13.20.
The Student t-copula (or Just "the t-copula")

The Student t-copula is an elliptical copula defined as

where v (the number of degrees of freedom) and p (the linear correlation coefficient) are the parameters
of the copula. When the number of degrees of freedom v is large (around 30 or so), the copula converges

Figure 13.18 Model to generate values from a Frank(theta) copula.
I

Figure 13.19 Graph of 3000 samples taken from a bivariate normal copula with parameter p = 0.95.

Chapter 13 Modelling correlation and dependencies

MultiNormal Normal copula

Covariance matrix
0.95
0.95
0.95

1.OO
0.95
0.95

0.95
1.00
0.95

375

0.95
0.95
1.00

0.16428688
0.37788608
0.39489876

0.6472

Figure 13.20 Model to generate values from a normal copula.

Figure 13.21 Graph of 3000 samples taken from a bivariate Student t-copula with v = 2 degrees of freedom
and parameter p = 0.95.

to the normal copula just as the Student distribution converges to the normal. But for a limited number
of degrees of freedom the behaviour of the copulas is different: the t-copula has more points in the tails

376

Risk Analysis

As in the normal case (and also for all other elliptical copulas), the relationship between Kendall's
tau and the Student t-copula parameter p is given by
p(X, Y) = sin

(5 4

Fitting a Student t-copula is slightly more complicated than fitting the normal. We first estimate t
and then, starting with v = 2, we determine the likelihood of observing the dataset. Then we repeat the
exercise for v = 3,4, . . . ,50 and find the combination that produces the maximum likelihood. For v
values of 50 or more there will be no discernible difference to using a fitted normal copula which is
simpler to generate values from.
Generating values from a Student copula requires determining the Cholesky decomposition of the
covariance matrix, as shown by the model in Figure 13.22.

nu

3

1
0.99
0.98
0.97
0.97

1

[chisq distribution

Covariance matrix (lower diagonal)
0
0
1
0
0.98
1
0.97
0.98
0.97
0.98

0
0
0
1
0.99

2.7443

0
0
0
0
1

Cholesky Decomposition
0.99
0.98
0.97
0.97

0.14106736
0.069470358
0.068761477
0.068761477

0.18647753
0.132043338
0.132043338

{=VoseCholesky(B4:F8))
=VoseNormal(O,1)
{=MMULT(Bl2:F16,B19:B23))

32

Figure 13.22 Model to generate values from a Student copula.

0.192188491
0.140156239 0.131501501

Chapter 13 Modelling correlation and dependencies

377

13.3.3 Modelling with copulas
In order to make use of copulas in your risk analysis, you need three things:
1. A method to estimate its parameter(s), which has been described above.
2. A model that generates the copula described above.
3. Functions that use the inversion method to generate values from the marginal distributions to which
you wish to apply the copula. Excel offers a very limited number of such function^,^ but they are
notoriously inaccurate and unstable. You can derive many other inversion functions from the F(x)
equations in Appendix 111.
Let's say that we have a dataset of 1000 joint observations for each of five variables, we fit the
data to gamma distributions for each variable and we correlate them together with a normal copula. In
principle one could do all these things in Excel, but it would be a pretty large spreadsheet, so I am
going to compromise a little. (By the way, I am using gamma distributions here so I can make a model
that works with Excel, though be warned that Excel's GAMMAINV is one of the most unstable). In the
model in Figure 13.23 I am also fitting a marginal gamma distribution to each variable using the method
A

1
2

B I C I D I E I F

GI

H

N

Joint 0 b s e ~ a t i o n
for
~ variables:

3

4
5
6

7

8
9
10

11
12
13
14
5
16
1
7
18
19
20

21
22
23

4.2953
17.544
4.8865
2.2816
5.4732
15.073
0.9581
1.4401
4.0238
4.7946
10.943
2.2683
0.7928
7.8518
5.9436
6.9022
4.2686
3.6353
4.3357
10.947
3.9473

12.769
32.924
2.7555
4.2633
10.366
24.038
4.2373
10.711
7.4557
4.967
24.14
9.9109
12.434
27.508
8.5434
16.847
12.562
5.2728
12.094
9.6294
2.9792

5.6258
14.971
14.085
8.3166
7.2923
41.372
3.8137
1.0511
12.291
9.2351
10.191
12.455
4.5066
14.434
24.088
8.3371
4.5681
5.6276
5.2264
14.573
8.0411

21.734
35.321
19.687
25.215
5.4288
19.957
12.819
11.975
11.627
5.1999
22.018
14.784
32.179
27.597
11.487
17.772
32.628
8.1 151
22.58
17.219
18.909

21.849
27.799
17.224
12.18
17.656
19.139
1.9824
7.9597
14.717
9.2756
17.204
18.206
12.783
18.576
16.876
12.025
10.44
7.6885
10.647
11.977
8.403

4
5

Mean
Variance
Alpha
Beta

1.000
0.578

0.578
1.000

0.490
0.373

0.393
0.281

0.242
0.139

0.393
0.242

0.281
0.139

0.252
0.169

1.000
0.035

0.035
1.000

5.974
17.884

Data statistics
12.118
9.909
18.054
46.878 49.397 117.292

12.140
26.045

Distribution Gamma parameter estimates
1.996
3.132
1.988
2.779
5.658
2.994
3.869
4.985
6.497
2.145

0.470

Fitted Normal copula
0.710
0.357
0.524

0.072

4.732

Correlated Gamma variables
14.811
6.205
16.561

5.660

1003
1004
1005
loos
1007

Figure 13.23 A model using copulas.
BETAINV, CHIINV, FINV, GAMMAINV, LOGINV, NORMINV, NORMSINV and TINV.

lo

378

R~skAnalysts

A
1
2

3

4
5
6

7
8
9
2

11
12

1001
1002
1003

B I C I D I E I F
G I H I
I
I J I K I L I M IN
Joint observations for variables:
2
4
3
1
5
1
2
3
4
5
Fitted Normal copula
4.2953 12.769 5.6258 21.734 21.849
0.006
0.109
0.063
0.465
0.065
17.544 32.924 14.971 35.321 27.799
4.8865 2.7555 14.085 19.687 17.224
Correlated Gamma variables
2.2816 4.2633 8.3166 25.215 12.18
29.043 44.197 38.108 150.113 26.046
5.4732 10.366 7.2923 5.4288 17.656
15.073 24.038 41.372 19.957 19.139
0.9581 4.2373 3.8137 12.819 1.9824
Formulae table
1.4401 10.711 1.0511 11.975 7.9597
(14:M4) {=VoseCopulaMultiNormalFit($B$3:$F$1002,FALSE))
4.0238 7.4557 12.291 11.627 14.717
17:M7 =VoseGammaFit(B3:B1002,14)
4.7946 4.967 9.2351 5.1999 9.2756
7.1342 30.602 19.104 29.436 10.365
7.1934 16.667 5.4139 22.768 16.905

Figure 13.24 The same model as in Figure 13.23, but now in ModelRisk.

of moments: usually you would want to use maximum likelihood, but this involves optirnisation, so the
method of moments is easier to follow, and with 1000 data points there won't be much difference. I
am also foregoing the rather elaborate calculations needed to estimate the normal copula's covariance
matrix by using Excel's CORREL as an approximation. I have used ModelRisk's normal copula function
because it takes up less space, and I have already shown you how to generate this copula above.
The model in Figure 13.24 is the equivalent with ModelRisk.

13.3.4 Making a special case of bivariate copulas
In the standard formulation for copulas there is no distinction between a bivariate (only two marginals)
and a multivariate (more than two marginals) copula. However, we can manipulate a bivariate copula
greatly to extend its applicability.
Sometimes, when creating a certain model, one is interested in a particular copula (say the Clayton
copula), but with a greater dependence in the positive tails than in the negative (a Clayton copula has
greater dependence in the negative tail than in the positive, see Figure 13.13 above).
For a bivariate copula it is possible to change the direction of the copulas by calculating I - X, where
X is one of the copula outputs. For example:
{Al:A2} Clayton copula with
B1
=1-A1
B2
=1-A2

a!

=8

A scatter plot of B1 :B2 is now as in Figure 13.25.
ModelRisk offers an extra parameter to allow control over the possible directional combinations. For
Clayton and Gumbel copulas there are four possible directions, but for the Frank there are just two
possibilities since it is symmetric about its centre. The plots in Figures 13.26 and 13.27 illustrate the
four possible bivariate Clayton copulas (1000 samples) with parameter a = 15 and the two possible
bivariate Frank copulas (1000 samples) with parameter 21.
Estimation of which direction gives the closest fit to data simply requires that one repeat the fitting
methods described above, calculate the likelihood of the data for each direction and select the direction
with the maximum likelihood. ModelRisk has bivariate copula functions that do this directly, returning
either the parameters of the fitted copula or generating values from a fitted copula.

Chapter 13 Modelling correlation and dependenc~es 3 7 9

Figure 13.25 Graph of 3000 samples taken from a bivariate Clayton(8) with both directions reversed.

13.3.5 An empirical copula
In spite of the extra flexibility afforded by copulas I have introduced in the chapter over rank order
correlation, you can see that they still rely on a symmetrical relationship between the variables: draw
a line between (0, 0) and (1, 1) and you get a symmetric pattern about that line (assuming you didn't
alter the copula direction). Unfortunately, real-world variables tend to have other ideas. As risk analysts,
we put ourselves in a difficult situation if we try to squeeze data into a model that just doesn't fit. An
empirical copula gives us a possible solution. Provided we have a good amount of observations, we can
bootstrap the ranks of the data to construct an approximation to an empirical copula, as the model in
Figure 13.28 demonstrates.
The model above uses the empirical estimate rankl(n 1) described in Section 10.2 for the quantile
which should be associated with a value in a set of n data points. The VoseStepUniform distribution
simply randomly picks an integer value between 1 and the number of observations (1000).
This method is very general and will replicate any correlation structure that the data show. It will be
rather slow in Excel when you have large datasets because each RANK function goes through the whole
array of data for a variable to determine its rank - it would be more efficient to use the VoseRank array
function which will take far fewer passes through the data. However, the main drawback to this method
occurs when we have relatively few observations. For example, if we have just nine observations, the
empirical copula will only generate values of (0.1, 0.2, . . . ,0.9} so our model will only generate between
the 10th and 90th percentiles of the marginal distributions.
This problem can be corrected by applying some order statistics thinking along the lines of Equations
10.4 and 10.5. The ModelRisk function VoseCopulaData encapsulates that thinking. In the model in
Figure 13.29 there are just 21 observations, so any correlation structure is only vaguely known.
The plots in Figure 13.30 show how the VoseCopulaData performs. The large grey dots are the data
and the small dots are 3000 samples from the empirical copula: notice that the copula extends over

+

Figure 13.26 The four directional possibilities for a bivariate Clayton copula.

(0, 1) for all variables and fills in the areas between the observations with greatest density concentrated
around the observations.

13.4 The Envelope Method
The envelope method offers a more flexible technique for modelling dependencies that is both intuitive
and easy to control. It models the logic whereby the value of the independent variable statistically
determines the value of the dependent variable. Its drawback is that it requires considerably more effort
than rank order correlation and is therefore only really used where the dependency relationship is going
to produce a significant effect on the final outcome of the model.

Chapter 13 Modell~ngcorrelat~onand dependencies

CopulaB1Frank(21,1)

381

CopulaB1Frank(21,2)

1-

08.

06-

i

0

01

02

03

04

05

06

07

08

09

1

Figure 13.27 The two directional possibilities for a bivariate Frank copula

B I C I D I E I F
Joint observations for variables:
1
2
3
4
5

4.2953
17.544
4.8865
2.2816
5.4732
7.1342
7.1934

12.769
32.924
2.7555
4.2633
10.366
30.602
16.667

5.6258
14.971
14.085
8.3166
7.2923
19.104
5.4139

r n bo o b e t o n
Rowtoselect

I

21.734
35.321
19.687
25.215
5.4288
29.436
22.768

21.849
27.799
17.224
12.18
17.656
10.365
16.905

G

J
K
Ranks of observations
2
3
4
I

H

1

409

1

607

1

330

1

705

L

1

M

5

957

1000I
42

Empirical copula
0.6434 0.8072 0.9151 0.3676 0.6533
\

Figure 13.28 Constructing an approximate empirical copula from data.

13.4.1 Using the envelope method for approximate modelling of straight-line
correlation in observed data
A large number of observed correlations can be quite adequately modelled using a straight-line relationship, as already discussed. If this is the case, the following techniques can prove very valuable. However,
you may sometimes come across a dependency relationship that is curvilinear andlor has a vertical spread
that changes across the range of the independent variable. The bottom graphs in Figure 13.3 illustrate
curvilinear relationships. The following section offers some advice on how the envelope method can
still be used to model such relationships.

382

R~skAnalysis

Empirical copula

Formulae table

I {=VoseCopulaData($B$3:$D$23)}
Figure 13.29 Constructing an empirical copula with few data using ModelRisk.

Using a uniform distribution

The envelope method first requires that all available data are plotted in a scatter plot. The independent
variable is plotted on the x axis and the dependent variable on the y axis. Bounding lines are then
determined that contain the minimum and maximum observed values of the dependent variable for all
values of the independent variable.
Example 13.4

Data on the time that 40 participants took to practise making a wicker basket were negatively correlated
to the time they took to make the basket in a subsequent test, shown in Figure 13.31. Two straight lines,
drawn by eye, neatly contain all of the data points: a minimum line of y = - 0 . 2 8 ~ 57 and a maximum
line of y = - 0 . 4 2 ~ 88. The data look to be roughly vertically uniformly distributed between these
two lines for all values of the x axis. We could therefore predict the test time that would be taken for
any value of the practice time as follows:

+

Test time = Uniform(-0.28

+

* Practice time + 57, -0.42 * Practice time + 88)

Chapter 13 Modelling correlat~onand dependencies

383

Figure 13.30 Scatter plots of random samples from the empirical copula fitted to the data in Figure 13.29.

384

Risk Analysis

90 -

-E
'=
C

max = -0.42x+88

65 -60

0

O

0

0

50 -45 --

0

0

--

0
0
0

min = -0.28x+57

Practice time (hours)

Figure 13.31 Setting boundary lines for the envelope method of modelling dependencies.

Minimum test time
Maximum test time
Modelled test time

Figure 13.32 Dependency model using the envelope method with a uniform distribution.

We have thus defined a uniform distribution for the test time that varies according to the practice time
taken. If we believe that the practice time that will be taken by future workers is Triangle(0, 20, 60), we
can use this dependency model to generate the distribution of test times as illustrated in the spreadsheet
of Figure 13.32.
Consider the Triangle(0, 20, 60) generating a value of 30 in one iteration (see Figure 13.33). The
equation for the minimum test time produces a value of -0.28 * 30 57 = 48.6. The equation for the
maximum test time produces a value of -0.42 * 30 88 = 75.4. Thus, for this iteration, the value for
the test time will be generated from a Uniform(48.6, 75.4) distribution.
The above example is a little simplistic. Using a uniform distribution to model the dispersion between
the minimum and maximum lines obviously gives equal weighting to all values within the range. It is
quite simple to extend this technique to using a triangular or normal distribution in place of the uniform
approximation, both of which provide a central tendency that is more realistic.

+

+

+

Chapter 13 Modelling correlation and dependencies

-

385

Maximum
line

4

Dependent
Uniform

:

Minimum

0

10

20

30

40

50

60

Practice time: Triang(0,20,60)

Figure 13.33 Illustration of how the dependency model of Figure 13.32 works.
Using a triangular distribution

Employing a triangular distribution requires that, in addition to the minimum and maximum lines, we
also provide the equation of a line that defines the most likely value for the dependent variable for each
value of the independent variable. The triangular distribution is still a fairly approximate modelling tool.
It is therefore quite reasonable to draw a line through the points of greatest vertical density. Alternatively,
you may prefer to find the least-squares fit line through the available data. All professional spreadsheet
programs now offer the facility to find this line automatically, making the task very simple. A third
option is to say that the most likely value lies midway between the minimum and maximum.
Example 13.5

Figures 13.34 and 13.35 provide an illustration of the envelope method with triangular distributions.

Minimum test time
Most likely test time
Maximum test time

Figure 13.34 Dependency model using the envelope method with a triangular distribution.

+

386

Risk Analys~s

80 -Maximum line

Minimum line
40 -*

,
30 '
0

,,

,

-

, ' Independent
Triang

--------

I

10

20
30
40
Practice time: Triang(0,20,60)

---60

50

Figure 13.35 Illustrationof how the dependency model of Figure 13.34 works.

Using a normal distribution

This option involves running a least-squares regression analysis and finding the equation of the leastsquares line and the standard error of the y-estimate Syx. The Syx statistic is the standard deviation of
the vertical distances of each point from the least-squares line. Least-squares regression assumes that the
error of the data about the least-squares line is normally distributed. Thus, if y = ax b is the equation
of the least-squares line, we can model the dependent distribution as y = Normal(ax b, Syx).

+
+

Example 13.6

Figure 13.36 provides an illustration of the envelope method with normal distributions.

+

Comparison ofthe uniform, triangular and normal methods

Figure 13.37 compares how the three envelope methods behave. The graphs on the left cross-plot a
Triangle(0, 20, 60) for the practice time (x axis) against the resulting test time (y axis). The graphs on
the right show histograms of the resulting test time distributions.
The uniform method produces a scatter plot that is vertically evenly distributed and strongly bounded.
The test time histogram has the flattest shape with the widest "shoulders" of the three methods. The
triangular method produces a scatter plot that has a vertical central tendency and is also strongly bounded.
The histogram is the most peaked of the three methods, producing the smallest standard deviation. The
normal method produces a scatter plot that has a vertical central tendency but that is unbounded. This
will generally be a closer approximation to a plot of available data. The histogram has the widest range
of the three methods. Using the normal distribution has two advantages over the other two methods:
the equation of the line and standard deviation are both calculated directly from the available data and
don't involve any subjective estimation; and the unbounded nature of the normal distribution gives the
opportunity for generated values to fall outside the range of the observed values. This second point may
help ensure that the range of the dependent distribution is not underestimated.

Chapter 13 Modelling correlation and dependencies

387

Regression line:
y = -0.4594*~+74.51
Syx= 8.16

/

0

0

,'Dependent

--

Normal

*

-r

..
Practice time (hours)

Figure 13.36 Using the normal distribution to model a dependency relationship.

Finally, it is important to be sure that the formula you develop will be valid over the entire range of
values that are to be generated for the two variables. For example, the normal formula can potentially
generate negative values for test time. It could, however, be mathematically restricted to prevent a
negative tail, for example by using an IF (test-time < 0, 0, test-time) statement.

13.4.2 Using the envelope method for non-linear correlation observed from
available data
One may come across a correlation relationship that cannot be adequately modelled using a straight-line
fit, as in the examples of Section 13.4.1. However, with a little extra work, the techniques described
above can be adapted to model most relationships.
The first stage is to find the best curvilinear line that fits the data. Microsoft Excel, for example,
offers a choice of automatic line fitting: linear, logarithmic, polynomial (up to sixth order), power
and exponential. Several of these fitted lines can be overlaid on the data to help determine the most
appropriate equation. The second stage is to use the equation of the selected line to determine the
predicted values for the dependent variable for each value of the independent variable. The difference
between the observed and predicted values of the dependent variable (i.e. the error terms) are then
calculated and cross-plotted against the independent variable. The third stage is to determine how these
error terms should be modelled. Any of the three techniques described in Sections 13.4.1 could be used.
The final stage is to combine the equation of the best-fit line with the distribution for the error term.
Example 13.7

Data on the amount of money a cosmetic company spends on advertising the launch of a new product are
compared with the volume of initial orders it receives (Figure 13.38) and cross-plotted in Figure 13.39.
Clearly, the relationship is not linear: an example of the law of diminishing returns. The best-fit line is
determined to be logarithmic: y = 1374.8 * LN(x) - 10713. The error terms appear to have approximately the same distribution across the whole range of advertising budget values. Since the distribution

-

388

Risk Analysis

100 -

40 -30 -20
0

i

8

20

40

60

20

40

60

80

100

Dependent distribution from Uniform formula

Scatter plot from Uniform formula

100 790 --

40 -30 -20
0

I

I

20

40

Scatter plot from Triangle formula

60

20

60

80

100

Dependent distribution from Triangle formula

20
Scatter plot from Normal formula

40

40

60

80

100

Dependent distribution from Normal formula

Figure 13.37 Comparison of the results of the envelope method of modelling dependency using uniform,
triangular and normal distributions.

of error terms appears to have a greater concentration around zero, we might assume that they are normally distributed and calculate their standard deviation (= 126 from Figure 13.38). The final equation
for the total initial order can then be written as
Total initial order = 1374.8 * LN(advertising-budget) - 10 713

+ Normal(0, 126)

Total initial order = Normal(1374.8 a LN(advertising-budget) - 10713, 126)

+

Chapter 13 Modelling correlation and dependencies

Advertising
Initial
budget (A) order (0)
1973
9743
12011
2132
220
2818
24 303
3091
2536
15 082
1573
8142
2652
18 183
17 728
2992
2822
18 531
18 795
2786
16 820
2665
18114
2737
19 603
3003
23 290
3032
Standard deviation

D4:D17

389

Observed difference
to prediction (D)
59
-70
12
-80
22
-93
-120
255
24
-30
1
-29
129
-80
126

Formulae table
=C4-(1374.8*LN(B4)-10713)
=STDEV(D4:D17)

Figure 13.38 Analysis of data and error terms for a curvilinear regression for Example 13.7

Advertising budget

Figure 13.39 The best-fitting non-linear correlation for the data of Example 13.7.

13.4.3 Using the envelope method to model expert opinion of correlation
It is very difficult to get an intuitive feel for rank order correlation coefficients, even when one is familiar
with probabilistic modelling. It is therefore recommended that the more intuitive envelope method be
employed for the modelling of an expert's opinion of a dependency where that dependency is likely to
have a large impact.

390

Risk Analysis

The technique involves the following steps:
Discuss with the expert the logic of how he or she perceives the relationship between the two
variables to be correlated. Review any available data.
Determine the independent and dependent variables. If the causal relationship is unclear, select either
to be the independent variable according to which will be easiest.
Define the range of the independent variable and determine its distribution (using a technique from
Chapter 9 or 10).
Select several values for the independent variable. These values should include the minimum and
maximum and a couple of strategic points in between.
Ask the expert hisher opinion of the minimum, most likely and maximum values for the dependent
variable should each of these selected values of the independent variable occur. I often prefer to ask
for the practical minimum and maximum.
Plot these values on a scatter diagram and find the best-fit lines through the three sets of points
(minima, most likely values and maxima).
Check that the expert agrees the plot is consistent with hisher opinion.
Use these equations of the best-fit lines in a triangular or PERT distribution to define the dependent
variable.
Example 13.8

Figure 13.40 illustrates an example where the expert is defining the relationship between a bank's
average mortgage rate and the number of new mortgages it will sell. The expert has given her opinion
of the practical minimum, most likely and practical maximum values of the number of new mortgages
for four values of the mortgage rate, as shown in Table 13.1. She has defined practical minimum and

6%

7%

8%

9%
10%
11%
Mortgage rate

12%

13%

14%

Figure 13.40 An example of the use of the envelope method to model an expert's opinion of a dependency
relationship or correlation.

Chapter 1 3 Modelling correlation and dependencies

3 91

Table 13.1 Data from expert elicitation.
Mortgage rate (%)

New mortgages
Min
Most likely

Max

maximum to mean, for her, that there is only a 5 % chance that the mortgage will be below and above
those values respectively.
This technique has the advantage of being very intuitive. The expert is asked questions that are
both meaningful and easy to think about. It also has the advantage of avoiding the need to define
the distribution shape for a dependent variable: the shape will be dictated by its relationship to the
independent variable.

+

13.4.4 Adding uncertainty in the envelope method
It is a relatively simple matter to add uncertainty into the envelope method. If data exist to develop the
dependency relationship, one can use the bootstrap method or traditional statistics to give uncertainty
distributions for the least-squares fit parameters. Uncertainty about the boundaries can be included by
simply looking at the best-guess line, as well as extreme possibilities for the minimum and maximum
boundaries on y .

13.5 Multiple Correlation Using a Look-Up Table
There may be times when it is necessary to model the simultaneous effect of an external factor on
several parameters within a model. An example is the effect of poor weather on a construction site. The
times taken to do an archaeological survey of the land, dig out the foundations, put in the form work,
build the foundations, construct the walls and floors and assemble the roof could all be affected by the
weather to varying degrees.
A simple method of modelling such a scenario is to use a spreadsheet look-up table.
Example 13.9

Figure 13.41 illustrates the example above, showing the values for one particular iteration. The model
works as follows:
Cells D5:DlO list the estimates of duration of each activity if the weather is normal.
The look-up table F4:JlO lists the percentages that the activities will increase or decrease owing to
the weather conditions.
Cell D l 3 generates a value for the weather from 1 to 5 using a discrete distribution that reflects the
relative likelihood of the various weather conditions.

L

392

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

-

Risk Analysis

A I B ~

2
3
4
5
6
7

c

Archeology
Dig found'n
Form work
Lay found'n
Walls & floors
Lay roofing

l ~ e a t h e index:
r

D

E

Base
Estimate
4.3
10.9
2.2
6.7
16.7
7.6
Total time
2

Revised
Estimate
5.52
13.1
2.2
8.4
17.4
8.3
54.93

1

I

F
VP
1
40%
30%
10%
40%
10%
20%

D5:DlO
E5:ElO
D l3
El1

I

G

H

I

I

% increase:weather conditionlindex
Poor
Normal
Good
2
3
4
28%
0%
0%
20%
0%
-6%
4%
0%
0%
25%
0%
-1 2%
4%
0%
0%
8%
0%
-4%

1

J

VG
5
-2%
-1 0%
-3%
-1 8%
-2%
-6%

Formulae table
'=~ose~riangle(3,4,6),
VoseTriangle(9,11,13),etc..
=D5*(1+HLOOKUP(D$13,F$4:J$1O,B5))
=VoseDiscrete({l,2,3,4,5),{2,5,4,3,2))
=SUM(ES:EIO)

Figure 13.41 Using a look-up table to model multiple dependencies.

Cells E5:ElO add the appropriate percentage change for that iteration to the base estimate time by
looking it up in the look-up table.
Cell E l l adds up all the revised durations to obtain the total construction time.
It is a simple matter to include uncertainty in this technique. One needs simply to add uncertainty
distributions for the magnitude of effect (in this case, the values in cells F5:JlO). A little care is needed
if the uncertainty distributions overlap for an activity. So, for example, if we used a PERT(30 %, 40 %,
50 %) uncertainty distribution for the parameter in cell F5 and a PERT(20 %, 28 %, 35 %) uncertainty
distribution for the parameter in cell G5, we could be modelling a simulation where very poor weather
increases the archaeological digging time by 31 % but poor weather increases the time by 33 %. Using
high levels of correlation for the uncertainty distributions of effect size across a task will remove
this problem quite efficiently and reflect that errors in estimating the effect (in this case weather) will
probably be similar for each effect size. +

Chapter 14

tl~clt~ng
trom expert oplnlon
14.1 Introduction
Risk analysis models almost invariably involve some element of subjective estimation. It is usually
impossible to obtain data from which to determine the uncertainty of all of the variables within the
model accurately, for a number of reasons:
The data have simply never been collected in the past.
The data are too expensive to obtain.
Past data are no longer relevant (new technology, changes in political or commercial environment, etc.).
The data are sparse, requiring expert opinion "to fill in the holes".
The area being modelled is new.
The uncertainty in subjective estimates has two components: the inherent randomness of the variable
itself and the uncertainty arising from the expert's lack of knowledge of the parameters that describe that
variability. In a risk analysis model, these uncertainties may or may not be distinguished, but both types
of uncertainty should at least be accounted for in a model. The variability is best included by assuming
some sort of stochastic model, and the uncertainty is then included in the uncertainty distributions for
the model parameters.
When insufficient data are available to specify the uncertainty of a variable completely, one or more
experts will usually be consulted to provide their opinion of the variable's uncertainty. This chapter
offers guidelines for the analyst to model the experts' opinions as accurately as possible.
I will start by discussing sources of bias and error that the analyst will encounter when collecting
subjective estimates. We then look at a number of techniques used in the modelling of probabilistic
estimates, and particularly the use of various types of distribution. The analyst is then shown how to
employ brainstorming sessions to ensure that all of the available information relevant to the problem is
disseminated among the experts and the uncertainty of the problem openly discussed. Finally, we look
at methods for eliciting expert opinion in one-to-one interviews with the analyst.
Before delving into the techniques of subjective estimation, I would like the reader to consider the
following two points that have been the downfall of many a model I have been asked to evaluate:
Firstly, the most significant subjective estimate in a model is often in designing the structure of
the model itself. It is surprising how often the structure of a model evades criticism while the
figures within it are given all the scrutiny. Before committing to a specific model structure, it is
recommended that the analyst seeks comment from other interested parties as to its validity. In turn,
this action will greatly enhance the analyst's chances of having the model's results accepted and of

394

Risk Analysis

receiving cooperation in determining the input uncertainties. Good analysts should take this stage very
seriously and promote an environment where it is possible to provide open criticism of their work.
The second point is that the analysts should not take it upon themselves to provide all of the subjective assessments in a model. This sounds painfully obvious, but it still astounds me how many
analysts believe that they can estimate all or most of the variables within their model by themselves
without consulting others who are closer to the particular problem.

14.2 Sources of Error in Subjective Estimation
Before loolung at the techniques for eliciting distributions from an expert, it is very useful to have an
understanding of the biases that commonly occur in subjective estimation. To introduce this subject,
Section 14.2.1 describes two exercises I run in my risk analysis training seminars and that the reader
might find educational to conduct in his or her own organisation. In each exercise the class members have
their own PCs and risk analysis software to help them with their estimates. Section 14.2.2 summarises
the sources of heuristic errors and biases: that is, errors produced by the way people mentally approach
the task of parameter estimating. Finally, Section 14.2.3 looks at other factors that may cause inaccuracy
in the expert's estimates.

14.2.1 Class experiments on estimating
This section looks at two estimating exercises I regularly use in my training seminars on risk analysis
modelling. Their purpose is to highlight some of the thought processes (heuristics) people use to produce
quantitative estimates. The reader should consider the observations from these exercises in conjunction
with the points raised in Section 14.2.2.
Class estimating exercise I

Each member of the class is asked to provide practical minimum, most likely and practical maximum
estimates for a number of quantities (usually 8). The class is instructed that the minimum and maximum
should be as close as possible to each other such that they are 90 % confident that the true value falls
between them. The class is encouraged to ask questions if anything is unclear.
The quantities being estimated are obscure enough that the class members will not have an exact
knowledge of their values, but hopefully familiar enough that they can have a go at estimating the
value. The questions are changed to be relevant to the country in which the seminar is run. Examples
of these quantities are:
the distance from Oxford to Edinburgh along main highway routes in kilometres;
the area of the United IOngdom in square kilometres;
the mass of the Earth in metric tonnes;
the length of the Nile in kilometres;
the number of pages in the December Vogue UK magazine;
the population of Scranton, USA;
the height of K2, Kashmir, in metres;
the deepest ocean depth in metres.

Chapter 14 Eliciting from expert opinion

395

Number of pages in October 1995 UK Cosmopolitan Magazine
150

200

250

300

I

I

a

I
I

350

400
I

br

c.;------- d

I
I

er

-,

f

r

- 9

1

v
j
f

actual value

h
-

I

Figure 14.1

How to draw up class estimates from exercises 1 and 2 on a blackboard.

Each member of the class fills out a form giving the three values for each quantity. When everyone
has completed their forms, I get the class to pick one of these quantities, e.g. the length of the Nile. I
then question each member of the class to find out the minimum and the maximum, i.e. the total range
of all of the estimates they have made. On the blackboard, I draw up a plot of each class member's
three-point estimate, as illustrated in Figure 14.1, and then superimpose the true value. There is almost
invariably an expression of surprise at the true value. Sometimes, after I have drawn all of the estimates
up on the blackboard, I will ask if any of the class wishes to change their estimate before I reveal the
true value. Some will choose to do so, but this rarely increases their chance of encompassing the true
value. I will often repeat this process for four or five of the measurements to collect as many of the
lessons to be learned from the exercise as possible.
Now, if the class members were perfectly calibrated, there would be a 90 % chance (i.e. the defined
level above) that each true value would lie within their minimum to maximum range. By "calibrated
I mean that their perceptions of the precision of their knowledge were accurate. If there are eight
quantities to be estimated, the number that fall within their minimum to maximum range (their score
for this exercise) can be estimated by a Binomial(8, 90 %) distribution, as shown in Figure 14.2.
A host of interesting observations invariably comes out of this exercise. The underlying reasons for
these observations and those of the following exercise are summarised in Section 14.2.2:
In the hundred or so seminars in which I have performed this exercise, I have very rarely seen a
score higher than 6. From Figure 14.2 we can see that there is only a 4 % chance that anyone would
score 5 or less if they were perfectly calibrated. If we take the average score for all members of the
class and assume the distribution of scores to be approximately binomial, we can estimate the real
probability encompassed by their minimum to maximum range. The mean of a binomial distribution
is np, where n is the number of trials (in this case, 8) and p is the probability of success (here,
the probability of falling between the minima and maxima). The average individual score for the
whole class is usually around 3, giving a probability p of = 37.5 %. In other words, where they
were providing a minimum and maximum for which they believed there was a 90 % chance of the
quantity falling between those values, there was in fact only about a 37 % chance. One reason for
this "overconfidence" (i.e. the estimated uncertainty is much smaller than the real uncertainty) is
anchoring, discussed in Section 14.2.2. Figure 14.3 shows the distribution for the largest class for
which I have run this exercise (and the only class for which I kept the results).

396

Risk Analysis

Figure 14.2 Binomial(8, 90 %) distribution for forecasting test scores.

Figure 14.3 Example of scores produced by a large class in the estimating exercise.

The estimators often confuse the units (e.g. miles instead of kilometres, kilograms instead of tonnes),
resulting in a gross error.
In estimating the population of Scranton, some estimators provide a huge maximum estimate. Since
most people have never heard of Scranton, it makes sense that it has a smaller population than
London, New York, etc., but some people ignore this obvious deduction and offer a maximum that
has no logical basis (their estimation is strongly affected by the fact that they have never heard of
Scranton rather than any logic they could apply to the problem).
When the class discusses the quantities, they can usually agree on a logic for their estimation.
If estimators are very sure of their quantity, they may nonetheless provide an unrealistically large
range given their knowledge ("better to be safe") or, more commonly, provide just slightly too
narrow a range (resulting in a protest when I don't award them a correct answer!). I once asked a

Chapter 14 Eliciting from expert opinion

397

class in New Zealand to estimate the area of their country. A gentleman from their Met Office asked
if that was at low or high tide, to the amusement of us all. He knew the answer precisely, but the
true value fell outside his range because he had not known the precise conversion factor between
acres and square kilometres and had made insufficient allowance for that uncertainty.
If offered the choice of a revision to their estimates after I have drawn them all on the board, those
that change will usually gravitate to any grouping of the others' estimates or to the estimate of an
individual in the group whose opinion is highly valued. These actions often do not get them closer
to the correct answer. This observation has encouraged me to avoid asking for distribution estimates
during brainstorming sessions (see Section 14.4).
In many cases, people who have given a vast range to their estimates (to howls of laughter from the
others) are the only ones to get it inside their range.
People attending my seminars are almost always computer literate, but it is surprising how many
have little feel for numbers and offer estimates that could not possibly be correct.
Faced with a quantity that seems impossible to quantify at first, the estimator can often arrive at a
reasonable estimate by being encouraged to either break the quantity down into smaller components
or to make a comparison with other quantities. For example, the mass of the Earth could be estimated
by first estimating the average density of rock and then multiplying it by an estimate of the volume
of the Earth (requiring an estimate of its radius or circumference). Occasionally, this method has
come up with some huge errors where the estimator has confused formulae for the volume of a
sphere with area, etc.
Very occasionally, individuals lacking confidence will refuse to read out their opinions to the class.
Sometimes, estimators will provide a set of answers without really understanding the quantity they
are estimating (e.g. not knowing that K2 is a mountain, the second highest in the world). Note
that the person in question did not seek clarification, even after being encouraged to do so. This
"shyness" seems to be much more common in some nationalities than others.
This exercise can legitimately be criticised on several points:
1. The class members are asked to estimate quantities that they have no real knowledge of and their

score is therefore not reflective of their ability to estimate quantities that would be required of them
in their work.
2. In most real-life problems, the quantity being estimated does not have a fixed known value but is
itself uncertain.
3. In real-life problems, if the estimator has provided a range that was small but just missed the true
value, that estimate would still be more useful than another estimate with a much wider range but
that included the true value.
4. In real-life problems, estimators would presumably check formulae and conversion factors that they
were unsure of.
The scores should not be taken very seriously (I don't keep a record of the results). The exercise is
simply a good way to highlight some of the issues concerned in estimating. A more realistic exercise
would be to compare probabilistic estimates from an expert for real problems with the values that were
eventually observed. Of course, such an exercise could take many months or years to complete.

398

Risk Analysis

Class estimating exercise 2

The class is grouped in pairs and asked to give the same three-point estimate, as used for the above
exercise, of the total weight (mass) of the members of the class in kilograms, including myself, and
our total height in metres. While they are estimating, I go round the class and ask each member quietly
for their own measurements. At the end of the exercise, I draw up the estimates as in Figure 14.1 and
superimpose the true value. Then we discuss how each group produced its estimates. The following
points generally come out:
Three estimating techniques are usually used by the class:
1. Produce a three-point estimate of the distributions of height and mass for individuals in the
class and multiply by the number of people in the class. This logic is incorrect since it ignores
the central limit theorem, which states that the spread of the sum of a set of n variables is
proportional to ,hi, not n. It generally manages to encompass the true result but with a very
wide (and therefore inaccurate) range.
2. Produce a three-point estimate of each individual in the class and add up the minima to get the
final-estimate minimum, add up the most likely values to get the final-estimate most likely and
add up the maxima to get the final-estimate maximum. Again, this is incorrect since it ignores
the central limit theorem and therefore produces too wide a range.
3. Produce a three-point estimate of each individual in the class and then run a simulation to add
them up. Take the 5 %, mode and 95 % values of the simulation result as the final three-point
estimate. This generally has the narrowest range but is still quite likely to encompass the true
value.
There is often a dominant person in a pair who takes over the whole estimating, either because that
person is very enthusiastic or more familiar with the software or because the other person is a bit
laid back or quiet. This, of course, loses the value of being in pairs.
The estimators often forget to exclude themselves from their uncertainty estimates. They have given
me their measurements, so they should only assign uncertainty to the others' measurements and then
add their measurements to the total.
If the central limit theorem corrections are applied to the violating estimates, the class scores average
out at about 1.4 compared with the 1.8 it should have been (i.e. 2 * 90 %). In other words, their minimum
to maximum range, which was supposed to have a 90 % probability of including the true value, actually
had about a 70 % probability.

14.2.2 Common heuristic biases and errors
The analyst should bear in mind the following heuristics that the expert may employ when attempting
to provide subjective estimates and that are potential sources of systematic bias and errors. These biases
are explained in considerably more detail in Hertz and Thomas (1983) and in Morgan and Henrion
(1990) (the latter includes a very comprehensive list of references).
Availability

This is where experts use their recollection of past occurrences of an event to provide an estimate. The
accuracy of their estimates is dictated by their ability to remember past occurrences of the event or

Chapter 14 Eliciting from expert opinion

399

how easily they can imagine the event occurring. This may work very well if the event is a regular
part of their life, e.g. how much they spend on petrol. It also works well if the event is something
that sticks in their mind, e.g. the probability of having a flat tyre. On the other hand, it can produce
poor estimates if it is difficult for the experts to remember past occurrences of the event: for example,
they may not be able confidently to estimate the number of people they passed in the street that day
since they would have no interest in noting each passer-by. Availability can produce overestimates of
frequency if the experts can remember past occurrences very clearly because of the impact they had
on them. For example, if a computer manager was asked how often her mainframe had crashed in the
last two years, she might well overestimate the frequency because she could remember every crash and
the crises they caused, but, because of the clarity of her recollection ("it seems like only yesterday"),
include some crashes that happened well over 2 years ago and therefore overestimate the frequency as
a result.
The availability heuristic is also affected by the degree to which we are exposed to information. For
example: one might consider that the chance of dying in a motoring accident was much higher than
dying from stomach cancer, because car crashes are always being reported in the media and stomach
cancer fatalities are not. On the other hand, an older person may have had several acquaintances who
have died from stomach cancer and would therefore offer the reverse opinion.
Representativeness

One type of bias is the erroneous belief that the large-scale nature of uncertainty is reflected in smallscale sampling. For example, in the National Lottery, many would say I had no chance of winning if I
selected the consecutive numbers 16, 17, 18, 19, 20 and 21. The lottery numbers are randomly picked
each week so it is believed that the winning numbers should also exhibit a random pattern, e.g. 3, 11,
15, 21, 29 and 41. Of course, both sets of numbers are actually equally likely.
I once reviewed a paper that noted that, out of 200 houses fitted with a new type of gas supply piping
and tested over a period of a year and a half, one of those houses suffered a gas leak due to a rat
gnawing through the pipe. It concluded that there was a 1:300 chance of a "rodent attack" per house
per year. What should the answer have been?
A second type of representativeness bias is where people concentrate on an enticing detail of the
problem and forget the overall picture. In a frequently cited paper by Kahneman and Tversky, described
in Morgan and Henrion (1990), subjects in an experiment were asked to determine the probability of
a person being an engineer on the basis of a written description of that person. If they were given a
bland description that gave no clue to the person's profession, the answer given was usually 5050,
despite being told beforehand that, of the 100 described people, 70 were lawyers and 30 were engineers.
However, when the subjects were asked what probability they would give if they had no description
of the person, they said 30 %, illustrating that they understood how to use the information but had just
ignored it.
Adjustment and anchoring

This is probably the most important heuristic of the three. Individuals will usually begin their estimate
of the distribution of uncertainty of a variable with a single value (usually the most likely value) and
then make adjustments for its minimum and maximum from that first value. The problem is that these
adjustments are rarely sufficient to encompass the range of values that could actually occur: the estimators
appear to be "anchored" to their first estimated value. This is certainly one source of overconfidence
and can have a dramatic impact on the validity of a risk analysis model.

400

Risk Analysis

14.2.3 O t h e r sources of estimating inaccuracy
There are other elements that may affect the correct assessment of uncertainty, and the analyst should
be aware of them in order to avoid unnecessary errors.
Inexpert expert

The person nominated (wrongly) as being able to provide the most knowledgeable opinion occasionally
actually has very little idea. Rather than refemng the analyst on to another more expert in the problem,
that person may try to provide an opinion "to be helpful", even though that opinion is of little real value.
The analyst, seeing the inexpertness of the interviewee, should seek an alternative opinion, although
this may not be apparent until later.
Culture of the organisation

The environment within which people work may sometimes impact on their estimating. Sales people
will often provide unduly optimistic estimates of future sales because of the optimistic culture within
which they work. Managers may offer high estimates of running costs because, if they achieve a lower
operating cost, their organisation will view them favourably. The analyst should try to be aware of
any potential conflict and seek to eliminate it through cross-checking with data and other people in the
organisation.
Conflicting agendas

Sometimes the expert will have a vested interest in the values that are submitted to a model. In one
model I developed, managers were deliberately providing hugely optimistic growth rate predictions
to me because, in the organisation they worked for, it could aid their individual empire building. In
another, I was offered very optimistic estimates of completion time and costs for a project because, if
that project were given approval, the person in question would become the project's manager with a
big wage increase to match. Lawyers may offer a low estimate of the cost of litigation because, if they
get the brief, they can usually increase the fees later. The analyst must be aware of such conflicting
agendas and seek a second disinterested opinion.
Unwillingness to consider extremes

The expert will frequently find it difficult or be unwilling to envisage circumstances that would cause
a variable to be extremely low or high. The analyst will often have to encourage the development of
such extreme scenarios in order to elicit an opinion that realistically covers the entire possible range.
This can be done by the analyst dreaming up some examples of extreme circumstances and discussing
them with the expert.
Eagerness t o say the right thing

Occasionally, interviewees will be trying to provide the answer they think the analyst wants to hear. For
this reason, it is important not to ask questions that are leading and never to offer a value for the expert
to comment on. For example, if I said "How long do you think this task will take? Twelve weeks?
More? Less?" I could well get an answer nearer to 12 weeks than if I had simply said "How long do
you think this task will take?".

Chapter 14 Eliciting from expert opinion

40 1

Units used in the estimation

People are frequently confused between the magnitudes of units of measurement. An older (or English)
person may be used to thinking of distances in miles and liquid volumes in (UK) gallons and pints. If
the model uses SI units, the analyst should let the experts describe their estimates in the units in which
they are comfortable and convert the figures afterwards.
Expert too busy

People always seem to be busy and under pressure. A risk analyst coming to ask a lot of difficult questions
may not be very welcome. The expert may act brusquely or give the whole process lip service. Obvious
symptoms are when the expert offers oversimplistic estimates like X f Y % or minimum, most likely
and maximum values that are equally spaced for all estimated variables. The solution to such problems
is to get the top management visibly to support the development of the risk model, ensuring that the
employees are given the message that this work is a priority.
Belief that the expert should be quite certain

It may be perceived by experts that assigning a large uncertainty to a parameter would indicate a lack
of knowledge and thereby undermine their reputation. The expert may need to be reassured that this is
not the case. An expert should have a more precise understanding of a parameter's true uncertainty and
may, in fact, appreciate that the uncertainty could be greater than the layperson would have expected.

14.3 Modelling Techniques
This section describes a range of techniques including the role of various types of probability distribution
that are useful in the eliciting of expert opinion. I have only included those techniques that have worked
for me, so the reader will find some omissions when comparing with other risk analysis texts.

14.3.1 Disaggregation
A key technique to eliciting distributions of opinion is to disaggregate the problem sufficiently well so
that experts can concentrate on estimating something that is tangible and easy to envisage. For example,
it will generally be more useful to ask experts to break down their company's revenue into logical
components (like region, product, subsidiary company, etc.) rather than to estimate the total revenue in
one go. Disaggregation allows the expert and analyst to recognise dependencies between components
of the total revenue. It also means that the risk analysis result will be less critically dependent on
the estimate of each model component. Aggregating the estimates of the various revenue components
will show a more complex and accurate distribution than ever could have been achieved by directly
estimating the sum. The aggregation will also take care of the effects of the central limit theorem
automatically - something that is extremely hard for experts to do in their head. Another benefit of
disaggregation is that the logic of the problem usually becomes more apparent and the model therefore
becomes more realistic.
During the disaggregation process, analysts should be aware of where the key uncertainties lie within
their model and therefore where they should place their emphasis. The analyst can check whether an
appropriate level of disaggregation has been achieved by running a sensitivity analysis on the model (see
Section 5.3.7) and looking to see whether the Tornado chart is dominated by one or two model inputs.

402

Risk Analys~s

14.3.2 Distributions used in modelling expert opinion
This section describes the role of various types of probability distribution in modelling expert opinion.
Non-parametric and parametric distributions

Probability distribution functions fall into two categories: non-parametric and parametric distributions,
the meanings of which are discussed in detail in Appendix 111.3. A parametric distribution is based
on a mathematical function whose shape and range is determined by one or more distribution parameters. These parameters often have little obvious or intuitive relationship to the distribution shapes they
define. Examples of parametric distributions are: lognormal, normal, beta, Weibull, Pareto, loglogistic,
hypergeometric - most distribution types, in fact.
Non-parametric distributions, on the other hand, have their shape and range determined by their
parameters directly in an obvious and intuitive way. Their distribution function is simply a mathematical
description of their shape. Non-parametric distributions are: uniform, relative, triangular, cumulative and
discrete.
As a rule, non-parametric distributions are far more reliable and flexible for modelling expert opinion
about a model parameter. The questions that the analyst poses to the expert to determine the distribution's parameters are intuitive and easy to respond to. Changes to these parameters also produce an
easily predicted change in the distribution's shape and range. The application of each non-parametric
distribution type to modelling expert opinion is discussed below.
There are three common exceptions to the above preference for using non-parametric distributions to
model expert opinion:

2. The PERT distribution is frequently used to model an expert's opinion. Although it is, strictly
speaking, a parametric distribution, it has been adapted so that the expert need only provide estimates
of the minimum, most likely and maximum values for the variable, and the PERT function finds a
shape that fits these restrictions. The PERT distribution is explained more fully below.
3. The expert may occasionally be very familiar with using the parameters that define the particular
distribution. For example, a toxicologist may regularly determine the mean standard error of a
chemical concentration in a set of samples. It might be quite helpful to ask the expert for the mean
and standard deviation of hislher uncertainty about some concentration in this case.
4. The parameters of a parametric distribution are sometimes intuitive and the analyst can therefore
ask for their estimation directly. For example, a binomial distribution is defined by n, the number
of trials that will be conducted, and p, the probability of success of each trial. In cases where I
consider the binomial distribution to be the most appropriate, I generally ask the expert for estimates
of n and p, recognising that I will have to insert them into a binomial distribution, but I would
try to avoid any discussion of the binomial distribution that might cause confusion. Note that the
estimates of n and p can also be distributions themselves.
There are other problems associated with using parametric distributions for modelling expert opinion:
A model that includes parametric distributions to represent opinion is more difficult to review later
because the parameters of the distribution may have no intuitive appeal.
It is very difficult to get the precise shape right when using parametric distributions to model expert
opinion as the effects of changes in the parameters are not usually obvious.

Chapter 14 Eliciting from expert opinion

403

Figure 14.4 Examples of triangular distributions.

The triangular d~stribution

The triangular distribution is the most commonly used distribution for modelling expert opinion. It is
defined by its minimum (a), most likely (b) and maximum (c) values. Figure 14.4 shows three triangular
distributions: Triangle(0, 10, 20), Triangle(0, 10, 50) and Triangle(0, 50, 50), which are symmetric, right
skewed and left skewed respectively. The triangular distribution has a very obvious appeal because it
is so easy to think about the three defining parameters and to envisage the effect of any changes.
The mean and standard deviation of the triangular distribution are determined from its three parameters:
Mean =
Standard deviation =

(a

+ b + c)
3
a2

+ b2 + c2 - a b - a c - be)

From these formulae it can be seen that the mean and standard deviation are equally sensitive to all
three parameters. Many models involve parameters for which it is fairly easy to estimate the minimum
and most likely values, but for which the maximum is almost unbounded and could be enormous.
The central limit theorem tells us that, when adding up a large number of distributions (for example,
adding costs or task durations), it is the distributions' means and standard deviations that are most
important because they determine the mean and standard deviation of the risk analysis result. In situations
where the maximum is so difficult to determine, the triangular distribution is not usually appropriate
since it will depend a great deal on how the estimation of the maximum is approached. For example, if
the maximum is assumed to be the absolutely largest possible value, the risk analysis output will have
a far larger mean and standard deviation than if the maximum is assumed to be a "practical" maximum
by the estimating experts.
The triangular distribution is often considered to be appropriate where little is known about the
parameter outside an approximate estimate of its minimum, most likely and maximum values. On the
other hand, its sharp, very localised peak and straight lines produce a very definite and unusual (and
very unnatural) shape, which conflicts with the assumption of little knowledge of the parameter.

404

Risk Analysis

Figure 14.5 Example of a Trigen distribution.

There is another useful variation of the triangular distribution, called Trigen in @RISK and TriangGen
in Risk Solver, for example. The Trigen distribution requires five parameters: Trigen(a, b, c , p, q), which
have the following meanings:
a : the practical minimum
b : the most likely value
c : the practical maximum value
p : the probability that the parameter value could be below a

q : the probability that the parameter value could be below c

Figure 14.5 shows a Trigen(40, 50, 80, 5 %, 95 %) distribution, with the 5 % areas extending beyond
the minimum and maximum (40 and 80 here). The Trigen distribution is a useful way of avoiding
asking experts for their estimate of the absolute minimum and maximum of a parameter: questions that
experts often have difficulty in answering meaningfully since there may theoretically be no minimum
or maximum. Instead, the analyst can discuss what values of p and q the experts would use to define
"practical" minima and maxima respectively. Once this has been decided, the experts only have to give
their estimates for practical minimum, most likely and practical maximum for each estimated parameter,
and the same p and q estimates are used for all their estimates. One drawback is that the expert may
not appreciate the final range to which the distribution may extend, so it is wise to plot the distribution
and have it agreed by the expert before using it in the model.
The Tri1090 distribution, featured in @RISK, presumes that p and q are 10 and 90 % respectively,
which is generally about right, but I prefer to use the Trigen because it adapts to each expert's concept
of "practical".
The uniform distribution

The uniform distribution is generally a very poor modeller of expert opinion since all values within its
range have equal probability density, but that density falls sharply to zero at the minimum and maximum
in an unnatural way. The uniform distribution obeys the maximum entropy formalism (see Section 9.4)
where only the minimum and maximum are known, but in my experience it is rare indeed that the

Chapter 14 Eliciting from expert opinion

405

expert will be able to define the minimum and maximum but have no opinion to offer on a most likely
value.
The uniform distribution does, however, have several uses:
to highlight or exaggerate the fact that little is known about the parameter;
to model circular variables (like the direction of wind from 0 to 2n) and other specific problems;
to produce spider sensitivity plots (see Section 5.3.8).
The PERT distribution

The PERT distribution gets its name because it uses the same assumption about the mean (see below)
as PERT networks (used in the past for project planning). It is a version of the beta distribution and
requires the same three parameters as the triangular distribution, namely minimum (a), most likely (b)
and maximum (c). Figure 14.6 shows three PERT distributions whose shape can be compared with
the triangular distributions of Figure 14.4. The equation of a PERT distribution is related to the beta
distribution as follows:

where

The mean
The last equation for the mean is a restriction that is assumed in order to be able to determine values
for a1 and a2. It also shows how the mean for the PERT distribution is 4 times more sensitive to

Figure 14.6 Examples of PERT distributions.

406

Risk Analysis

C

cn 0.14

average = 0.174

Most likely value

Figure 14.7 Comparison of the standard deviation of Triangle(0, most likely, 1) and PERT(0, most likely, 1)

distributions.
the most likely value than to the minimum and maximum values. This should be compared with the
triangular distribution where the mean is equally sensitive to each parameter. The PERT distribution
therefore does not suffer to the same extent the potential systematic bias problems of the triangular
distribution, that is, in producing too great a value for the mean of the risk analysis results where the
maximum for the distribution is very large.
The standard deviation of a PERT distribution is also less sensitive to the estimate of the extremes.
Although the equation for the PERT standard deviation is rather complex, the point can be illustrated
very well graphically. Figure 14.7 compares the standard deviations of triangular and PERT distributions
that have the same a, b and c values. To illustrate the point, the figure uses values of 0 and 1 for a
and c respectively and allows b to vary between 0 and 1, although the observed pattern extends to any
(a, b, c} set of values. You can see that the PERT distribution produces a systematically lower standard
deviation than the triangular distribution, particularly where the distribution is highly skewed (i.e. b is
close to 0 or 1 in this case). As a general rough rule of thumb, cost and duration distributions for project
tasks often have a ratio of about 2:l between the (maximum-most likely) and (most likely-minimum),
equivalent to b = 0.3333 in Figure 14.7. The standard deviation of the PERT distribution at this point is
about 88 % of that for the triangular distribution. This implies that using PERT distributions throughout
a cost or schedule model, or any other additive model, will display about 10 % less uncertainty than the
equivalent model using triangular distributions.
Some readers would perhaps argue that the increased uncertainty that occurs with triangular distributions will compensate to some degree for the "overconfidence" that is often apparent in subjective
estimating. The argument is quite appealing at first sight but is not conducive to the long-term improvement of the organisation's ability to estimate. I would rather see an expert's opinion modelled as precisely
as is practical. Then, if the expert is consistently overconfident, this will become apparent with time
and hislher estimating can be corrected.
The modified PERT distribution

The PERT distribution can also be manipulated to produce shapes with varying degrees of uncertainty
for the same minimum, most likely and maximum by changing the assumption about the mean:
The mean ( p ) =

a+y *b+c

Y+2

Chapter 14 Eliciting from expert opinion

407

Figure 14.8 Examples of modified PERT distributions with varying most likely weighting y .

+ +

In the standard PERT, y = 4, which is the PERT network assumption that p = (a 4b c)/6. However,
if we increase the value of y , the distribution becomes progressively more peaked and concentrated
around b (and therefore less uncertain). Conversely, if we decrease y, the distribution becomes flatter and
more uncertain. Figure 14.8 illustrates the effect of three different values of y for a modified PERT(5,
7, 10) distribution. This modified PERT distribution can be very useful in modelling expert opinion. The
expert is asked to estimate the same three values as before (i.e. minimum, most likely and maximum).
Then a set of modified PERT distributions is plotted and the expert is asked to select the shape that fits
hisfher opinion most accurately. It is a fairly simple matter to set up a spreadsheet program that will do
all this automatically.
The relative distribution

The relative distribution (also called the general in @RISK, and a version of the Custom in Crystal Ball)
is the most flexible of all of the continuous distribution functions. It enables the analyst and expert to
tailor the shape of the distribution to reflect, as closely as possible, the opinion of the expert. The relative
distribution has the form Relative(minimum, maximum{xi], {pi]), where {xi] is an array of x values
with probability densities ( p i ] and where the distribution falls between the minimum and maximum.
The {pi] values are not constrained to give an area under the curve of 1, since the software recalibrates
the probability scale. Figure 14.9 shows a Relative(4, 15, {7,9, 111, {2,3, 0.5)).
The cumulative distribution

The cumulative distribution has the form CumulativeA(minimum, maximum{xi), {Pi]), where {xi) is an
array of x values with cumulative probabilities {Pi] and where the distribution falls between the minimum
and maximum. Figure 14.10 shows the distribution CumulativeA(0, 10, (1, 4, 61, (0.1, 0.6, 0.8)) as it is
defined in its cumulative form and how it looks as a relative frequency plot. The cumulative distribution
is used in some texts to model expert opinion. However, I have found it largely unsatisfactory because
of the insensitivity of its probability scale. A small change in the shape of the cumulative distribution
that would pass unnoticed produces a radical change in the corresponding relative frequency plot that
would not be acceptable. Figure 14.11 provides an illustration: a smooth and natural relative frequency
plot (A) is converted to a cumulative frequency plot (B) and then altered slightly (C). Converting back
to a relative frequency plot (D) shows that the modified distribution is dramatically different to the
original, although this would almost certainly not have been appreciated by comparing the cumulative

Figure 14.9 Example of a relative distribution.

Cumulative frequency

1-

0.2 -

0.8 --

Relative frequency

0.15 -.
0.1
0.05 -.

0

2

4

6

8

10

0c
0

+
0

2

4

6

8

10

Figure 14.10 Example of a cumulative distribution and its relative frequency plot.

frequency plots. For this reason, I usually prefer to model expert opinion looking at the relative frequency
distribution instead.
One circumstance where the cumulative distribution is very useful is in attempting to estimate a
variable whose range covers several orders of magnitude. For example, the number of bacteria in 1 kg
of meat will increase exponentially with time. The meat may contain 100 units of bacteria or 1 million.
In such circumstances, it is fruitless to attempt to use a relative distribution directly. This point is
discussed more fully in Section 14.3.3.
The discrete distribution

The discrete distribution has the form Discrete({xi), {pi)), where {xi) is an array of the possible values
of the variable with probability weightings {pi}. The {pi} values do not have to add up to unity as
the software will normalise them automatically. It is actually often useful just to consider the ratio of
likelihood of the different values and not to worry about the actual probability values. The discrete

Chapter 14 Ellciting from expert opinion

409

Figure 14.11 Example of how small changes in a distribution's cumulative plot can dramatically affect its
shape.

distribution can be used to model a discrete parameter (that is, a parameter that may take one of two or
more distinct values), e.g. the number of turbines that will be used in a power station, and to combine
two or more conflicting expert opinions (see Section 14.3.4).

14.3.3 Modelling opinion of a variable that covers several
orders of magnitude
A continuous parameter whose uncertainty extends over several orders of magnitude generally cannot
be modelled in the usual manner. For example, an expert may consider that 1 g of meat could contain
any number of units of bacteria from 1 to 10 000 but that this figure is just as likely to be around 100
or 1000. If we were to model this estimate using a Uniform(1, 10 000) distribution, for example, we
would almost certainly not match the expert's opinion of the values of the cumulative percentiles. The
expert would probably place the 25, 50 and 75 percentiles at about 10, 100 and 1000, where our model
places them at 2500, 5000 and 7500 respectively. The reason for such a large discrepancy is that the
expert is subconsciously making hisher estimate in log-space, i.e. s h e is thinking of the loglo values:
loglol = 0, loglo10 = 1, loglolOO = 2, etc. To match the expert's approach to estimating, the analyst
can also work in log-space, so the distribution becomes
Number of units of bacteria = 10U"if0m(0~4)
Figure 14.12 compares these two interpretations of the expert opinion by looking at the cumulative
distributions and statistics they would produce. The Uniform(1, 10 000) has much larger mean and
standard deviation than the 10U"if0m(0~4)
distribution and an entirely different shape.

4 10

R~skAnalys~s

Mean =
Std Deviation =
Skewness =
Kurtosis =

Uniform(1,10000) 1OAUniform(0,4)
5000.5
1085
2886
2062
0
2.4
1.8
5.2

-

Figure 14.12 Comparison of two ways to model expert opinion of a variable that covers several orders of
magnitude.

If the expert had said instead that there could be between 1 and 10 000 units of bacteria in 1 g
of meat, but the most likely number is around 500, we would probably have the greatest success in
modelling this variable as
Number of units of bacteria = 10PERT(0,2.794)
where 1 0 g ~ ~ 5 0=02.7.
If the variable is to be modelled as a 10" type formula described above, it is judicious to compare
the cumulative percentiles at a few sensible points with those the expert would expect. Any radical
differences would suggest that the expert is not actually thinking in log-space and the cumulative
distribution could be used instead.

14.3.4 Incorporating differences in expert opinions
Experts will sometimes produce profoundly different probability distribution estimates of a parameter.
This is usually because the experts have estimated different things, made differing assumptions or have
different sets of information on which to base their opinion. However, occasionally two or more experts
simply genuinely disagree. How should the analyst approach the problem? The first step is usually to
confer with someone more senior and find out whether one expert is preferred over the other. If those
more senior have some confidence in both opinions, a method is needed to combine these opinions in
some way.
Recommended approach

I have used the following method for a number of years with good results. Use a Discrete((xi}, (pi})
distribution where the {xi}are the expert opinions and the {pi] are the weights given to each opinion
according to the emphasis one wishes to place on them. Figure 14.13 illustrates an example combining

Chapter 14 El~c~t~ng
from expert oplnlon

Expert A's opinion

40

50

60

70

411

Expert B's opinion

80 \90

Expert C's opinion

Figure 14.13 Combining three dissimilar expert opinions.

three differing opinions, but where expert A is given twice the emphasis of the others owing to the
greater experience of that expert.
Two incorrect approaches are frequently used:
Pick the most pessimistic estimate. This is generally unsatisfactory, as a risk analysis model should
be attempting to produce an unbiased estimate of the uncertainty. The caution should only be applied
at the decision-making stage after reviewing the risk analysis results.
Take the average of the two distributions. This is incorrect as the resultant distribution will be too
narrow. By way of illustration, consider the test situation where both experts believed a parameter
should be modelled by a Normal(100, 10) distribution. Whatever technique was used to combine their
opinions, the result should be the same Normal(100, 10) distribution. The average of these two distributions, i.e. AVERAGE(Normal(100, lo), Normal(100, lo)), would be a Normal(100, lo/&) =
Norma1(100,7.07) from the central limit theorem. In other words, we would have produced far too
small a spread.

1 have been offered suggestions for other approaches to this problem:
Take the weighted average of the relative or cumulative percentiles. This will correctly construct
the combined distribution (it is how the ModelRisk function VoseCombined works) but it is very
laborious to execute for all but the most simple distributions of opinion unless you have a library
of density and cdf functions, so it is somewhat impractical to start from scratch.
Multiply together the probability densities at each x value. This is incorrect because (a) it produces
combined distributions with exaggerated peakedness, (b) the area under the curve is no longer 1 and
(c) the combined distribution is contained between the highest minimum and the lowest maximum.

4 12

R~skAnalys~s

A

1
2
3
4
5
6
7
8

9

1

B

SME
Peter
Jane
Paul
Susan

I

c

Min
11
12
8
9

l

D

Mode
13
13
10
10

Combined estimate
P(>14)

l

E

Max
17
16
13
15

F

l

I

G

IH

Distribution
Weight
VosePERT($C$3,$D$3,$E$3)
0.3
VosePERT($C$4,$D$4,$E$4)
0.2
VosePERT($C$5,$D$5,$E$5)
0.4
VosePERT($C$G,$D$6,$E$6)
0.1

8.680244
0.878805

10
11
12
13
14
15

Figure 14.14

Formulae table
F3:F6
=VosePERTObject(C3,D3,E3)
E8 (output) =VoseCombined(F3:F6,G3:G6,B3:B6)
E9 (output) =VoseCombinedProb(l4,F3:F6,G3:G6,B3:B6,1)

Combining weighted SME estimates using VoseCombined functions.

ModelRisk has the function VoseCombined({Distributions}, {Weights)) and related probability calculation functions that perform the combination described above. In the model in Figure 14.14, four
expert estimates are combined to construct the one estimate. The advantage of this function is it
then allows one to perform a sensitivity analysis on the estimate as a whole: if you were to use
the Discrete({Distributions), {Weights}) method, your Monte Carlo software would, in this case, be
performing a sensitivity analysis of five distributions: the four estimates and the discrete distribution,
which will dilute the perceived influence of the combined uncertainty.
In the model in Figure 14.4, the VoseCombined function generates random values from a distribution
constructed by weighting the four SME estimates. The weights do not need to sum to 1: they will be
normalised. The VoseCombinedProb(. . ., 1) function calculates the probability that this distribution will
take a value less than 14. Note that the names of the experts is an optional parameter: this simply
records who said what and has no effect on the calculation, but select cell E8 and then click the Vf
(View Function) icon from the ModelRisk toolbar and you will get the graph shown in Figure 14.15,
which allows us to compare each SME's estimate and see how they are weighted.

14.4 Calibrating Subject Matter Experts
When subject matter experts (SMEs) are first asked to provide probabilistic estimates, they usually won't
be particularly good at it because it is a new way of thinking. We need some techniques that allow us
to help the SMEs gauge how well they are estimating and, over time, correct any biases they have. We
may also need a method for selecting between or weighting SMEs estimates.
Imagine that an SME has estimated that a bespoke generator being placed on a ship will cost
$PERT(1.2, 1.35, 1.9)million, and we compare the actual outturn costs against that estimate. Let's
say it ended up costing $1.83 million. Did the SME provide a good estimate? Well, it fell within the
range provided, which is a good start, but it was at the high end, as Figure 14.16 shows.
The 1.83 value fell at the 99.97th percentile of the PERT distribution. That seems rather high considering the SME's estimate lay from 1.2 to 1.9 and 1.83 is only 90 % along that range, but it is the result
of how the PERT distribution interprets the minimum, mode and maximum values. The distribution is

Chapter 14 Eliciting from expert op~nion 4 1 3

Figure 14.15 Screen capture of graphic interface for the VoseCombined function used in the model of
Figure 14.14.

Figure 14.16 An SME estimate.

quite right skewed, in which case the PERT has a thin right tail - in fact it assigns only a 1 % probability
to values larger than 1.73.
For this exercise, however, we'll assume that the SME had seen the plots above and was comfortable
with the estimate. We can't be certain with just one data point that the SME tends to underestimate.
In areas like engineering, capital investment and project planning, one SME will often provide many

4 14

Risk Analysis

estimates over time, so let's imagine we repeat the exercise some 10 times and determine the percentile
at which each outturn cost lies on each corresponding distribution estimate. In theory, if our SME was
perfectly calibrated, these would be random samples from a Uniform(0, 1) distribution, so the mean
should rapidly approach 0.5. A Uniform(0, 1) distribution has a variance of 1/12, so the mean of 10
samples from a perfectly calibrated SME should, from central limit theorem, fall on a Normal(O.5,
llSQRT(12 * 10)) = Normal(O.5, 0.091287). If the 10 values average to 0.7, we can be pretty sure
that the SME is overestimating, since there is only a (1 - NORMDIST(0.7,0.5,0.091287,
1)) = 1.4 %
chance a perfectly calibrated SME would have produced a value of 0.7 or larger. Similarly, we can
analyse the variance of the 10 values. It should be close to 1/12: if the variance is smaller then the SME's
distributions are too wide, or, as is more likely, if the variance is larger then the SME's distributions
are too narrow. The above analysis assumes, of course, that all the estimates actually fell within the
SME's distribution range, which may well not be the case. The plots in Figure 14.17 can help provide
a more comprehensive picture.
Experts are also sometimes asked to estimate the probability that an event will occur, which is no
easy task. In theory one can roughly estimate how good a SME is at providing these estimates by
grouping estimated probabilities into bands (e.g. the same bands as in Figure 14.17) and determining
what fraction of those risk events actually occurred. Obviously, around 15 % of risks that were thought
to have between 10 % and 20 % chance of occurring should actually occur. However, this breaks down
at the lowest and highest categories because many identified potential risks are perceived to have a very
small probability of occurrence, so we will almost never actually have any observations.

14.5 Conducting a Brainstorming Session
When the initial structure of the problem has been decided and subjective estimates of the key uncertainties are now required, it is often very useful to conduct one or more brainstorming sessions with several
experts in the area of the problem being analysed. If the model covers several different disciplines,
for example engineering, production, marketing and finance, it may be better to hold a brainstorming
session for each discipline group as well as one for everybody.
The objectives of the brainstorming session are to ensure that everyone has the same information
pertinent to the problem and then to debate the uncertainties of the problem. In some risk analysis texts,
the analyst is encouraged to determine a distribution of each uncertain parameter during these meetings. I
have tried this approach and find it very difficult to do well because it relies very heavily on controlling
the group's dynamics: ensuring that the loudest voice does not get all the air time; encouraging the
individuals to express their own opinion rather than following the leader, etc. These meetings can also
end up dragging on, and some of the experts may have to leave before the end of the session, reducing
its effectiveness.
My aim in brainstorming sessions is to ensure that all those attending leave with a common perception
of the risks and uncertainties of the problem. This is achieved by doing the following:
Gathering all relevant information and circulating it to the attending experts prior to the meeting.
Presenting data in easily digested forms, e.g. using scatter plots, trend charts, statistics and histograms
wherever possible rather than columns of figures.
At the meeting, encouraging discussion of the variability and uncertainty in the problem, including
the logical structure and any correlations. Discussing scenarios that would produce extreme values

Chapter 14 Eliciting from expert opinion

4 15

Expert A

0

0

0

0

0

0

0

0

Expert C

30%

-

6
a,

u 20%

-

LL

-

a,

.-

a,
LT

-a

- - - --- - - - - - .
-

-

10%

--------.
-

-

E

V)

z 2 - - - - - - - - -

-

9

c
9
0

2

u9 m9 q 9l n9 s 9k 9
c s9 s - 9O

f - ! " ,' Z, %
,k

0,

," , 2 2 2 - 2

Figure 14.17 Histogram of SME outturn percentiles. Percentiles are grouped into 10 bands so roughly
10 % of the percentile scores should lie in each band (when there are a lot of scores). Expert A is well
calibrated. Expert B provides estimates that are too narrow and tends to underestimate. Expert C provides
estimates that are far too wide and tends to overestimate.

4 16

Risk Analysis

for the uncertain variables to get a feel for the true extent of the total uncertainty. Some of the
experts may also have extra information to add to the pot of knowledge.
The analyst, acting as chairperson, ensuring that the discussion is structured.
Taking minutes of the meeting and circulating them afterwards to the attendees.
After a suitable, but short, period for contemplation following the brainstorming session, the analyst
conducts individual interviews with each expert and attempts to determine their opinions of the uncertainty of each variable that was discussed. The techniques for eliciting these opinions are discussed
in Section 14.6.1. Since all the experts will have the same level of knowledge, they should produce
similar estimates of uncertainty. Where there are large differences between opinions, the experts can be
reconvened to discuss the issue. If no agreement can be reached, the conflicting opinions can be treated
as described in Section 14.3.4.
I believe that this procedure has several distinct benefits over attempting to determine distributions
during brainstorming sessions:
a

Each expert has been given the time to think about the problem.
They are encouraged to develop their own opinion after the benefit of discussion with the other
experts.
A quiet individual is given as much prominence as a dominating one.
Differences in opinion between experts are easier to identify.
The whole process can be conducted in a much more orderly fashion.

14.6 Conducting the Interview
Initial resistance

Expert opinion of the uncertainty of a parameter is generally determined in a one-to-one interview
between the relevant expert and the analyst developing the model. In preparing for such interviews,
analysts should make themselves familiar with the various techniques for modelling expert opinion
described earlier in this chapter. They should also be familiar with the various sources of biases and
errors involved in subjective estimation. The experts, in their turn, having been informed of the interviews
well in advance, should have evaluated any relevant information either on their own or in a brainstorming
session described above.
There is occasionally some initial resistance by the experts in providing estimates in the form of
distributions, particularly if they have not been through the process before. This may be because they
are unfamiliar with probability theory. Alternatively, they may feel they know so little about the variable
(perhaps because it is so uncertain) that they would find it hard enough to give a single point estimate
let alone a whole probability distribution.
I like to start by explaining how, by using uncertainty distributions, we are allowing the experts to
express their lack of certainty. I explain that providing a distribution of the uncertainty of a parameter
does not require any great knowledge of probability theory. Neither does it demand a greater knowledge
of the parameter itself than a single-point estimate - quite the reverse. It gives the experts a means to
express their lack of exact knowledge of the parameter. Where in the past their single-point estimates
were always doomed never to occur precisely, their estimates now using distributions will be correct if
the actual value falls anywhere within the distribution's range.

Chapter 14 Eliciting from expert opinion

4 17

The next step is to discuss the nature of the parameter's uncertainty. I prefer to let the experts explain
how they see the logic of the uncertainty rather than impose on them a structure I may have had in
mind and then to model what I hear.
Opportunity t o revise estimates

Experts are usually more comfortable about providing estimates if they are told before the interviews
that they have the opportunity to revise their estimates at a later date. It is also good practice to leave
the experts with a printed copy of each estimate and to get them to sign a copy for the analyst's records.
Note that the copy should have a date on it. This is important since the experts' opinion could change
dramatically after the occurrence of some event or the acquisition of more data.

14.6.1 Eliciting distributions of the expert opinion
Once the model has been sufficiently disaggregated, it is usually not necessary to provide very precise
estimates of each individual component of the model. In fact, three-point estimates are usually quite
sufficient, the three points being the minimum, most likely and maximum values the expert believes the
value could take. These three values can be used to define either a triangular distribution or some form
of PERT distribution. My preference is to use a modified PERT, as described in Section 14.3.2, because
it has a natural shape that will invariably match the expert's view better than a triangular distribution
would. The analyst should attempt to determine the expert's opinion of the maximum value first and
then the minimum, by considering scenarios that could produce such extremes. Then, the expert should
be asked for hisfher opinion of the most likely value within that range. Determining the parameters in
the order (1) maximum, (2) minimum and (3) most likely will go some way to removing the "anchoring"
error described in Section 14.2.2.
Occasionally, a model will not disaggregate evenly into sufficiently small components, leaving the
model's outputs strongly affected by one or more individual subjective estimates. When this is the case,
it is useful to employ a more rigorous approach to eliciting an expert's opinion than a simple three-point
estimate. In such cases, the modified PERT distribution is a good start but, on review of the plotted
distribution, the expert might still want to modify the shape a little. This can be done with pen and
graph paper as shown in Figure 14.18. In this example, the marketing manager believes that the amount
of wool her company will sell next month will be at least 5 metric tons (mt), no more than 10 mt and
most probably about 7 mt. These figures are then used to define a PERT distribution that is printed

Figure 14.18 Graphing distribution of expert opinion.

4 18

Risk Analysis

out onto graph paper. On reflection, the manager decides that there is a little too much emphasis being
placed on the right tail and draws out a more realistic shape. The revised curve can then be converted to
a relative distribution and entered into the model. Crosses are placed at strategic points along the curve
so that drawing straight lines between these crosses will produce a reasonable approximation of the
distribution. Then the x- and y-axis values are read off for each point and noted. Finally, the manager
is asked to sign and date the figure for the records.
The above technique is flexible, quite accurate and reassuringly transparent to the expert being questioned. This technique can now also be done without the need for pen and paper, using RISKview
software.
Figure 14.19 illustrates the same example using RISKview. The PERT(5, 7, 10) distribution (top
panel) is moved to the Distribution Artist facility of RISKview and automatically converted into a
relative distribution (bottom panel) with a user-defined number of points (10 in this example). This
distribution can now be modified to better reflect the expert's opinion by sliding the points up and down.
The modified distribution can also immediately be viewed as an ascending or descending cumulative
frequency plot to allow the expert to see if the cumulative percentiles also make sense. When the final
distribution has been settled on, it can be directly inserted into a spreadsheet model at the click of an icon.

14.6.2 Subjective estimation of discrete probabilities
Experts will sometimes be called upon to provide an estimate of the probability of occurrence of a
discrete event. This is a difficult task for experts. It requires that they have some feel for probabilities
that is both difficult for them to acquire and to calibrate. If the discrete event in question has occurred
in the past, the analyst can assist by presenting the data and a beta distribution of the probabilities
possible from that data (see Section 8.2.3). The experts can then give their opinion based on the amount
of information available.
However, it is quite usual that past information has no relevance to the problem at hand. For example,
political analysts cannot look to past general election results for guidance in estimating whether the
Labour Party will win the next general election. They will have to rely on their gut feeling based on
their understanding of the current political climate. In effect, they will be asked to pick a probability out
of the air - a daunting task, complicated by the difficulty of having to visualise the difference between,
say, 60 and 70 %. A possible way to avoid this problem is to offer experts a list of probability phrases,
for example:

a

almost certain;
very likely;
highly likely;
reasonably likely;
fairly likely;
even chance;
fairly unlikely;
reasonably unlikely;
highly unlikely;
very unlikely;
almost impossible.

1

Chapter 14 El~c~t~ng
from expert oplnlon

4 19

420

Risk Analysis

Figure 14.20 Visual aid for estimating probabilities: A = 1 %, B = 5 %, C = 10 %, D = 20 %, E = 30 %,
F=40%,G=50%,H=60%,I=70%,J=80%,K=90%,L=95%,M=99%.

The phrases are ranked in order and the experts are told of this ranking. They are then asked to select
a phrase that best fits their understanding of the probability of each event that has to be considered.
At the end of the session, they are also asked to match as many of the phrases as possible to visual
representations of probability. For example, matching a phrase to the probability of picking out a black
ball at random from the trays of Figure 14.20. Since we know the percentage of black balls in each
tray, we can associate a probability with each phrase and thus with each estimated event.

14.6.3 Subjective estimation of very low and very high probabilities
Risk analysis models occasionally incorporate very unlikely events, i.e. those with a very low probability
of occurrence. It is recommended that readers review Section 4.5.1 before deciding to incorporate rare
events into their model. The risk of the rare event is usually modelled as the probability of its occurrence
combined with a distribution of its impact should it occur. An example might be the risk of a large
earthquake on a chemical plant. The distribution of impact on the chemical plant (in terms of damage
and lost production, etc.) can be reasonably estimated since there is a basis on which to make the
estimation (the components most at risk in an earthquake, the cost of replacement, the time required to
effect the repairs, production rates, etc.). However, the probability of an earthquake is far less easy to
estimate. Since it is so rare, there will be very few recorded occurrences on which to base the estimate
of probability.
When data are not available to determine estimates of probability of very unlikely events, experts will
often be consulted for their opinion. Such consultation is fraught with difficulties. Experts, like the rest
of us, are very unlikely to have any feel whatsoever for low probabilities unless there is a reasonable
amount of data on which to base their estimates (in which case they can offer their opinion based around
the frequency of observed occurrences). The best that the experts can do is to make some comparison
with the frequency of other low-probability events whose probabilities are well defined. Figure 14.21

Chapter 14 Ellciting from expert opinion

42 1

Annual risk of dying in the US
(number of deaths per 1 000 000)

1 000 000

100000

10 000

1000

100

10

1

0.1

0.01

-----,

-

80-year-olddying before age 81

60 000

3000
2800
2000

Amateur p~lot
Heart disease
All cancers; Parachutist

800
590
480

Fire fighter; Hang glider
Lung cancer
Stomach organ cancer

320
220
160
120
80

Pneumonia
Diabetes; Police officer
Motor vehicle accidents; Breast cancer
Suicide
Homicide

50

Falls

30
20
15

Accidental poisoning (drugs and medication)
Pedestrian killed by automobile
Drowning; Fires and burns

5

Firearms: Tuberculosis

2

Electric current; Railway accident

-----+

-

-1

0.6

Airplane crash or accident

0.4

Floods

0.2

Lightning; Insect bite or sting

0.06

Hit on ground by falling airplane

0.04

Hurricane

I

Figure 14.21 Illustration of a risk ladder (for the USA) to aid in expert elicitation (from Williams, 1999, with
the author's permission).

I/
I3

f

I

offers a list of well-determined low probabilities in a graphical format that the reader may find helpful
in this regard.
This inaccuracy in estimating the probability of a rare event will have a very large impact on a risk
analysis. Consider two experts estimating the expected cost of the risk of a gas turbine failing. They
agree that it would cost the company about £600 000 f £200 000 should it fail. However, the first
expert estimates the probability of the event as 1:1000/year and the second as 1:5000/year. Both see the
probability as very low, but the expected cost for the first estimate is 5 times that of the second, i.e.
£600 000 * 1/1000 = £600 compared with 600 000 * 1/5000 = £120.
An estimate of the probability of a rare event can sometimes be broken down into a series of
consecutive events that are easier to determine. For example, the failure of the cooling system of a
nuclear power plant would require a number of safety mechanisms all to fail at the same time. The
probability of failure of the cooling system is then the product of the probability of failure of each safety
mechanism, each of which is usually easier to estimate than the total probability of the event. As another
example, this technique is also enjoying increasing popularity in epidemiology for the assessment of the
risks of introducing exotic diseases to a country through imports. The imported entity (animal, vegetable
or product of either) must first have the disease. Then it must slip through any quality checks in its
country of origin. After that, it must still slip through quarantine checks in the importing country, and
finally it has to infect a potential host. Each step has a probability (which may often be broken down
even further) which is estimated, and these probabilities are then multiplied together to determine the
final probability of the introduction of the disease from one animal.

Chapter I 5

l est~ngand modell~ngcausal
relationships
Testing and modelling causal relationships is the subject of plenty of books. I recommend Pearl (2000),
Neapolitan (2004) and Shipley (2002) because they are thorough, fairly readable if you're good at
mathematics and take a practical viewpoint. The technical details of causal inference lie very firmly in
the domain of statistics, so I'll leave it to these books to explain them. In this chapter I want to look at
some practical issues of causality from a risk analysis perspective. The main impetus to including this
as a separate topic is to help you avoid some of the nonsense that I have come across over the years
while reviewing models and scientific papers, battling in court as an expert witness or just watching
the news on TV. There are a few simple, very practical and intuitive rules that will help you test a
hypothesised causal relationship.
Causal inference is mostly applied to health issues, although the thinking has potential applications
in other areas such as econometrics (in his book, Pearl laments the lack of rigorous causal thinking in
current econometric practices), so I am going to use health issues as examples in this chapter. We can
attempt to use a causal model to answer three different types of question:
Predictions - what will happen given a certain set of conditions?
Interventions - what would be the effect of controlling one or more conditions?
Counte~actuals- what would have happened differently if one or more conditions had been
different?

In a deterministic (non-random) world there is a straightforward interpretation to causality. CSI Miami
and derivatives, and all those medical dramas, are such fun programmes because we viewers try to figure
out what really happened - what caused this week's murder(s), and of course the programme always
finishes with a satisfyingly unequivocal solution. I was once stranded in a US airport hotel in which a
real-world CSI conference was taking place, and they were keen to tell me how their reality was rather
different. They don't have the flashy cars, cool clothes, ultrasophisticated equipment or trendy offices
bathed in moody light. More importantly, when they search a database of fingerprints, it comes up with
a list, if they're lucky, of a dozen or so possible candidates, probably with "whereabouts unknown".
For them, the truth is far more elusive.
In the risk analysis world we have to work with causal relationships that are usually probabilistic in
nature, for example:
the probability of having lung cancer within your life is x if you smoke;
the probability of having lung cancer within your life is y if you don't smoke.

424

Risk Analysis

We all know that x > y, which makes being a smoker a risk factor. But life is more complicated than
that: there is a biological gradient, meaning in this case the more you smoke, the more likely the cancer.
If we were to do a study designed to determine the causal relationship between smoking and cancer, we
should look not just at whether people smoked at all, but at how much a person has smoked, for how long
and in what way (cigars, cigarettes with or without filters, pipes, little puffs or deep inhaling, brand, etc).
Things are further complicated because people can change their smoking habits over time. How about:
the probability of having lung cancer within your life is a if you carry matches;
the probability of having lung cancer within your life is b if you don't carry matches.

1

i

I haven't done the study, but I bet a > b, although carrying matches should not be a risk factor. A
correct statistical analysis will determine the high correlation between carrying matches (or lighters) and
using tobacco products. A sensible statistician would figure out that matches should be removed from
the analysis. An uncontrolled statistical analysis can produce some silly results (imagine we had no
idea that tobacco could be related to cancer and didn't collect any tobacco-related data), so we should
always apply some disciplined thinking to how we structure and interpret a statistical model. We need
a few definitions to begin:
A risk factor is an aspect of personal behaviour or lifestyle, environment or characteristic thought
to be associated positively or negatively with a particular adverse condition.
A countelfactual world is an epidemiological hypothetical idea of a world similar to our own in all
ways but for which the exposure to the hazard, or people's behaviour or characteristics, or some
other change that affects exposure, has been changed in some way.
The population attributable risk (PAR) (aka population aetiological fraction, among many others) is
the proportion of the incidence in the population attributable to exposure to a risk factor. It represents
the fraction by which the incidence in the population would have reduced in a counterfactual world
where the effect associated with that risk factor was not present.

b

These concepts are often used to help model what the future might look like if we were to eliminate a
risk factor, but we need to be careful as they technically only refer to the comparison of an observed world
and a counterfactual parallel world in which the risk factor does not appear - making predictions of the
future means that we have to assume that the future world would look just like that counterfactual one.
In figuring out the PAR, we may well have to consider interactions between risk factors. Consider
the situation where the presence of either of two risk factors gives an extremely high probability of the
risk of interest, and where a significant fraction of the population is exposed to both risk factors. In this
case there is a lot of overlap and an individual risk factor has less impact because the other risk factor is
competing for the same victims. On the other hand, exposure to two chemicals at the same time might
produce a far greater effect than either chemical alone. We talk about synergism and antagonism when
the risk factors work together or against each other respectively. Synergism is more common, so the PAR
for the combination of two or more risk factors is usually less than the sum of their individual PARS.

1 5.1 Campylobacter Example
A large survey conducted by CDC (the highly reputable Center for Disease Control and Prevention) in
the United States looked at why people end up getting a certain type of food poisoning (campylobacteriosis). You get campylobacteriosis when bacteria called Campylobacter enter your intestine, find a

I

i

Chapter 15 Testing and modelling causal relationships

425

suitably protected location and multiply (form a colony). Thus, the sequence of events resulting in
campylobacteriosis must include some exposure to the bacteria, then survival of those bacteria through
the stomach (the acid can kill them), then setting up a colony. In order for us to observe the infection,
that person has to become ill. In order to identify the disease as campylobacteriosis, a doctor has to ask
for a stool sample, it has to be provided, the stool sample has to be cultured and the Campylobacter
have to be isolated and identified. Campylobacteriosis will usually resolve itself after a week or so of
unpleasantness, so many more people therefore have campylobacteriosis than a healthcare provider will
observe.
The US survey looked at behaviour patterns of people with confirmed cases, tried to match them
with others of the same sex, age, etc., known not to have suffered from a foodborne illness and looked
for patterns of differences. This is called a case-control study. Some of the key factors were as follows
(+ meaning positively associated with illness, - meaning negatively associated):
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

Ate barbecued chicken (+).
Ate in a restaurant (+).
Were male and young (+).
Had healthcare insurance (+).
Are in a low socioeconomic band (+).
There was another member of the family with an illness (+).
The person was old (+).
Regularly ate chicken at home (-).
Had a dog or cat (+).
Worked on a farm (+).

Let's see whether this matches our understanding of the world:
Factor 1 makes sense since Campylobacter naturally occur in chicken and are very frequently to be
found in chicken meat. People are also somewhat less careful with their hygiene and the cooking
is less controlled at a barbecue (healthcare tip: when you've cooked a piece of meat, place it on a
different plate than the one used to bring the raw meat to the barbecue).
Factor 2 makes sense because of cross-contamination in the restaurant kitchen, so you might eat a
veggie burger but still have consumed Campylobacter originating from a chicken.
Factor 3 makes sense because we guys tend not to pay much attention to lutchen practices when
we're young and start off rather hopeless when we first leave home.
Factor 4 makes sense in that, in the USA, visiting a doctor is expensive, and that is the only way
the healthcare will observe the infection.
Factor 5 maybe seems right because poorer people will eat cheaper-quality food and will visit
restaurants with higher capacity and lower standards (related to factor 2).
Factor 6 is obvious since faecal-oral transmission is a known route (healthcare tip: wash your hands
very well, particularly when you are ill).
Factor 7 makes sense because older people have a less robust immune system, but maybe they also
eat in restaurants more (less?) often, maybe they like chicken more, etc.

426

R~skAnalysis

Factor 8 seems strange. It appears from a number of studies that if you eat chicken at home you
are less likely to get ill. Maybe that is because it displaces eating chicken at a restaurant, maybe it's
because people who cook are wealthier or care more about their food or maybe (the current theory)
it is because -these people get regular small exposures to Campylobacter that boosts their immune
system.
Factor 9 is trickier. Perhaps pet food contains Campylobacter, perhaps the animal gets uncooked
scraps, then cross-infects the family.
Factor 10 makes sense. People working in chicken farms are obviously more at risk, but a farm will
often have just a few chickens, or will buy in manure as fertiliser or used chicken bedding as cattle
feed. Other animals also cany Campylobacter.
Each of the above is a demonstrable risk factor because each passed a test of statistical significance
in this study (and others) and one can find a possible rational explanation. Of course, the possible
rational explanation is often to be expected because the survey was put together with questions that
were designed to test suspected risk factors, not the ones that weren't thought of. Note that the causal
arguments are often interlinked in some way, making it difficult to figure out the importance of each
factor in isolation. Statistical software can deal with this given the appropriate control.

15.2 Types of Model to Analyse Data
Data can be analysed in several different ways in an attempt to determine the magnitude of hypothesised
causal relationships between variables (possible risk factors). Note, these models will not ever prove a
causal relationship, just as it is not possible to prove a theory, only disprove it.

Neural nets - look for patterns within datasets between several variables associated with a set
of individuals. They can find correlations within datasets, and make predictions of where a new
observation might lie on the basis of values for the conditioning variables, but they do not have
a causal interpretation and tend to be rather black box in nature. Neural nets are used a lot in
profiling. For example, they are used to estimate the level of credit risk associated with a credit card
or mortgage applicant, or identify a possible terrorist or smuggler at an airport. They don't seek to
determine why a person might be a poor credit risk, for example, just match the typical behaviour
or history of someone who fails to pay their bills - things like having defaulted before, changing
jobs frequently, not owning a home.
Classification trees - can be used to break down case-control data to list from the top down the
most important factors influencing the outcome of interest. This is done by looking at the difference
in fraction of cases and controls that have the outcome of interest (e.g. disease) when they are split
by each possible explanatory variable. So, for example, having a case-control study of lung cancer,
one might find the fraction of people with lung cancer is much larger among smokers than among
non-smokers, which forms the first fork in the tree. Looking then at the non-smokers only, one might
find that the fraction of people with lung cancer is much higher for those who worked in a smoky
environment compared with those who did not. One continually breaks down the population splits,
figuring out which variable is the next most correlated with a difference in the risk until you run
out of variables or statistical significance.
Regression models - logistic regression is used a lot to determine whether there is a possible
relationship between variables in a dataset and the variable to be predicted. The probability of a

Chapter 15 Testing and modelling causal relationships

427

"success" (e.g. exhibiting the disease) of a dichotomous (two-possible-outcome) variable we wish
to predict, pi, is related to the various possible influencing variables by regression equations; for
example
1

a

where subscript i refers to each observation, and subscript j refers to each possible explanatory
variable in the dataset, of which there are k in total. Stepwise regression is used in two forms:
forward selection starts off with no predictive variables and sequentially adds them until there is no
statistically significant improvement in matching the data; backward selection has all variables in
the pot and keeps taking away the least significant variable until the model's statistical predictive
capability begins to suffer. Logistic regression can take account of important correlations between
possible risk factors by including covariance terms. Like neural nets, it has no in-built causal thinking.
Bayesian belief networks (aka directed acyclic graphs) - visually, these are networks of nodes
(observed variables) connected together by arcs (probabilistic relationships). They offer the closest
connection to causal inference thinking. In principle you could let DAG software run on a set of
data and come up with a set of conditional probabilities - it sounds appealing and objectively hands
off, but the networks need the benefit of human experience to know the direction in which these
arcs should go, i.e. what the directions of influence really are (and if they exist at all). I'm a firm
believer in assigning some constraints to what the model should test, but make sure you know why
you are applying those constraints. To quote Judea Pearl (Pearl, 2000): "[C]ompliance with human
intuition has been the ultimate criterion of adequacy in every philosophical study of causation, and
the proper incorporation of background information into statistical studies likewise relies on accurate
interpretation of causal judgment".

Commercial software are available for each of these methods. The algorithms they use are often
proprietary and can give different results on the same datasets, which is rather frustrating and presents
some opportunities to those who are looking for a particular answer (don't do that). In all of the above
techniques, it is important to split your data up into a training set and a validation set to test whether
the relationships that the software find in the training set will let you reasonably accurately (i.e. at the
decision-maker's required accuracy) predict the outcome observations in the validation dataset. Best
practice involves repeated random splitting of your data into training and validation sets.

15.3 From Risk Factors t o Causes
Let's say that you have completed a statistical analysis of your data and your software has come up with
a list of risk factors. The numerical outputs of your statistical analysis will allow you to calculate PAR
for each factor, and here you should apply a little common sense because PAR relates to the decision
question you are answering.
Let me take the campylobacteriosis study as an example. You first need to know a couple of things
about Campylobacter. It does not survive long outside its natural host (animals like chickens, ducks
and pigs where it causes no illness) and so it does not establish reservoirs in the ground, in water,
etc. It also does not generally stay long in a human gut, although many people could be harbouring

428

Risk Analysis

the bacteria unknowingly. This means that, if we were to eliminate all the Campylobacter at their
animal sources, we would no longer have human campylobacteriosis cases (ignoring infections from
travelling). I was lead risk analyst for the US FDA where we wanted to estimate the number of people
who are infected with fluoroquinolone-resistant Campylobacter from poultry - fluoroquinolone is used
to treat poultry (particularly chickens) with the respiratory disease they get from living in sheds with
poor ventilation so the ammonia from their urine strips out the lining of their lungs. We reasoned:
if say 100 000 people were getting campylobacteriosis from poultry, and say 10% of the poultry
Campylobacter were fluoroquinolone resistant, then about 10 000 were suffering campylobacteriosis
that would not be treatable by administering fluoroquinolone (the antimicrobial is also often used to
treat suspected cases of food poisoning). We used the CDC study and their PAR estimates. The case
ended up going to court, and a risk analyst hired by the opposing side (the drug sponsor, who sold a
lot more of their antimicrobial to chicken farms than to humans) got the CDC data under the Freedom
of Information Act and did a variety of statistical analyses using various tools. He concluded: "A more
realistic assessment based on the CDC case-control data is that the chicken-attributable fraction for
[the pathogen] is between - 11.6 % (protective effect) and 0.72 % (not statistically significantly different
from zero) depending on how missing data values are treated". In other words, he is saying with this
-11.6 % attributable fraction figure that chicken is protective, so in a counterfactual world without
chicken contaminated with Campylobacter there would be more campylobacteriosis, i.e. if we could
remove the largest source of exposure we have to Campylobacter (poultry), more people would get ill.
Put another way, he believes that the Campylobacter on poultry are protective, but the Campylobacter
from other sources are not.
Using classification trees, for example, he determined that the major risk factors were, in descending
order of importance: visiting a farm, travelling, having a pet, drinking unprocessed water, being male
(then eating ground beef at home, eating pink hamburgers and buying raw chicken) or being female
(and then having no health insurance, eating high levels of fast food, eating hamburgers at home . . .
and finally, eating fried chicken at home). Note that chicken is at the bottom of both sequences.
So how did this risk analyst manage to justify his claim that eating chicken was actually protective - it
did not pose a threat of campylobacteriosis? He did so by misinterpreting the risk factors. There is
really no sense in considering a counterfactual world where people are all neuter (neither male nor
female) - and anyway, since we don't have any of those, we have no idea how their behaviour will be
different from males or females. Should we really be including whether people have insurance as a risk
factor to which we assign a PAR? I think not. It is perhaps true that all these factors are associated with
the risk - meaning that the probability of campylobacteriosis is correlated with each factor, but they
are not risk factors within the context of the decision question. I don't think that by paying people's
health insurance we would likely change the number of illnesses, although we would of course change
the number reported and treated. What we hope to achieve is an understanding of how much disease
is caused by Campylobacter from chicken, so the level of total human illness needs to be distributed
among the sources of Campylobacter. That brings some focus to the PAR calculations: dining in a
restaurant is only a risk factor because Campylobacter is in the restaurant kitchen. How did it get
there? Probably chickens mostly, but also ducks and other poultry, although the US eat those in far
lower volumes. It could also sometimes be a kitchen worker with poor hygiene unknowingly carrying
Campylobacter,but where did that worker originally get the infection? Most probably from chicken. The
sex' of a person is no longer relevant. Having a pet (it was mostly puppies) is a debatable point, since
the puppy probably became infected from contaminated meat rather than being a natural carrier itself.

'

Not "gender", which I found out one day listening to a debate in the UK House of Lords is what one feels oneself to be, while
"sex" is defined by the reproductive equipment with which we are born.

t

!

Chapter 15 Testing and modell~ngcausal relationships

429

Looking just at Campylobacter sources, we get a better picture, and, although regular small amounts
of exposure (eating at home) may be protective, this is protecting against other mostly chicken-derived
Campylobacter exposure and we end up with the same risk attribution that CDC determined from its
own survey data.
We won the court case, and the other risk analyst's testimony was, very unusually, rejected as being
unreliable - in no small part because of his selective and doctored quoting of papers.

15.4 Evaluating Evidence
The first test of causality you should make is to consider whether there is a known or possible causal
mechanism that can connect two variables together. For this, you may need to think out of the box:
the history of science is full of examples where people considered something impossible, in spite of an
enormous amount of evidence to the contrary, because they were so firmly attached to their pet theory.
The second test is temporal ordering: if a change in variable A has an effect on variable B, then the
change in A should occur before the resultant change in B. If a person dies of radiation poisoning (B)
then that person must have received a large dose of radiation (A) at some previous time. We can often
test for temporal ordering with statistics, usually some form of regression. But be careful, temporal
ordering doesn't imply a causal relationship. Imagine you have a variable X that affects variables A
and B, but B responds faster than A. If X is unobserved, all we see is that A exhibits some behaviour
that strongly correlates in some way with the previous behaviour of B.
The third test is to determine in some way the size of the possible causal effect. That's where statistics comes in. From a risk analysis perspective, we are usually interested in what we can change about
the world. That ultimately implies that we are only really interested in determining the magnitude of
the causal relationships between variables we can control and those in which we are interested. Risk
analysts are not scientists - our job is not to devise new theories but to adapt the current scientific (or
financial, engineering, etc.) knowledge to help decision-makers make probabilistic decisions. However,
as a breed, I like to think that we are quite adept at stepping back and asking whether a tightly held
belief is correct, and then posing the awkward questions. It's quite possible that we can come up with
an alternative explanation of the world supported by the available evidence, which is fine, but that
explanation has to be presented back to the scientific community for their blessing before we can rely
on it to give decision-making advice.

15.5 The Limits of Causal Arguments
My son is just starting his "Why?" phase. I can see the interminable conversations we will have: "Papa,
why does a plane stay in the air". "Because it has wings". "Why?'. "Because the wings hold it up".
"Why?'. 'Because when an airplane goes fast the wind pushes the wings up". "Why?'. Dim memories
of Bernoulli's equation won't be of much help. "I don't know" is the inevitable end to the conversation.
I can see why kids love this game - once we get to three or four answers, we parents reach the limit
of our understanding. He's soon going to find out I don't know everything after all, and 1'11 plummet
from my pedestal (he's already realised that I can't mend everything he breaks).
Causal thinking is the same. At some point we are going to have to accept the existence of the causal
relationships we are using without really knowing why. If we're lucky, the causal link will be supported
by a statistical analysis of good data, some experiential knowledge and a feeling that it makes sense. If we
go back far enough, all that we believe we know is based on assumptions. My point is that, when you have

430

Risk Analysis

completed your causal analysis, try to be aware that the analysis will always be based on some assumptions, so sometimes a simple analysis is all you need to get the necessary guidance to your problem.

15.6 An Example of a Qualitative Causal Analysis
Our company does a lot of work in the field of animal health, where we help determine the risk of
introducing or exacerbating animal and human disease by moving animals or their products around the
world. This is a very well-developed area of risk analysis, and a lot of models and guidelines have
been written to help ensure that there is a scientifically based rationale for accepting, rejecting and
controlling such risks. Chapter 22 discusses animal health risk analysis. I present a risk analysis below
as an illustration of the need for a healthy cynicism when reviewing scientific literature and official
reports, and as an example of a causal analysis that I performed with absolutely no quantitative data for
an issue for which we do not yet have a complete understanding.

15.6.1 The problem
A year ago I was asked to perform a risk analysis on a particularly curious problem with pigs. Postweaning multisystemic wasting syndrome (PMWS) affects pigs after they have finished suckling. I had
had some dealings with this problem before in another court case. The "syndrome" part of the name
means a pattern of symptoms, which is the closest veterinarians can come to defining the disease since
nobody knows for sure what the pathogen is that creates the problem. Until recently there hasn't even
been an agreed definition of what the pattern of symptoms actually is. A herd case definition for PMWS
was recently agreed by an EU-funded consortium (EU, 2005) led by Belfast University. The PMWS
case definition on herd level is based on two elements - (1) the clinical appearance in the herd and
(2) laboratory examination of necropsied (autopsy for animals) pigs suffering from wasting.

I

1. Clinical appearance on herd level
The occurrence of PMWS is characterised by an excessive increase in mortality and wasting post
weaning compared with the historical level in the herd. There are two options for recognising this
increase, of which l a should be used whenever possible:
l a . If the mortality has been recorded in the herd, then the increase in mortality may be recognised
in either of two ways:
1. Current mortality 2 mean of historical levels in previous periods +1.66 standard deviations.
2. Statistical testing of whether or not the mortality in the current period is higher than in the
previous periods by the chi-square test.
In this context, mortality is defined as the prevalence of dead pigs within a specific period of time.
The current time period is typically 1 or 2 months. The historical reference period should be at least
3 months.
Ib. If there are no records of the mortality in the herd, an increase in mortality exceeding the
national or regional level by 50 % is considered indicative of PMWS.
2. Pathological and histopathological diagnosis of PMWS
Autopsy should be performed on at least five pigs per herd. A herd is considered positive for PMWS
when the pathological and histopathological findings, indicative for PMWS, are all present at the
same time in at least one of the autopsied pigs. The pathological and histopathological findings are:

Chapter 15 Testing and modelling causal relationships

43 1

1. Clinical signs including growth retardation and wasting. Enlargement of inguinal lymph nodes,
dyspnoea, diarrhoea and jaundice may be seen sporadically.
2. Presence of characteristic histopathological lesions in lymphoid tissues: lymphocyte depletion
together with histiocytic infiltration and/or inclusion bodies and/or giant cells.
3. Detection of PCV2 in moderate to massive quantity within the lesions in lymphoid tissues of
affected pigs (basically using antigen detection in tissue by immunostaining or in situ hybridisation).
Other relevant diagnostic procedures must be carried out to exclude other obvious reasons for high
mortality (e.g. E. coli post-weaning diarrhoea or acute pleuropneumonia).
The herd case definition is highly unusual: a result of the lack of identification of the pathogenic
organism. It will need revision when more is known about the syndrome. The definition is also vulnerable
from a statistical viewpoint. To begin with, the definition acknowledges the wasting symptom in PMWS,
but the definitions only apply to mortality. PMWS can only be defined at a herd level because one has
statistically to differentiate the increase in rate of mortality and wasting post weaning from historical
levels in the herd or from other unaffected herds. Thus, for example, PMWS can never be diagnosed
for a backyard pig using this definition.
The chi-square test quoted above is based on making a normal approximation to a binomial variable.
The approximation is only good if one has a sufficiently large number of animals n in a herd and a
sufficiently high prevalence p of mortality or wasting in both unaffected and affected herds. Thus, it
becomes progressively more difficult to differentiate an affected from an unaffected herd where the herd
is small. The alternative requirement of prevalence at > 1.66 standard deviations above previous levels
and the chi-square table provided in this definition are determined by assuming that one should only
diagnose that a herd has PMWS when one is at least 95 % confident that the observed prevalence is
greater than normal. This means that one can choose to declare a herd as PMWS positive when one is
only 95 % confident that the fraction of animals dying or wasting is greater than usual. While one needs
to set a standard confidence for consistency, this is illustrative of the difference in approach between
statistics and risk analysis: in risk analysis one balances the cost associated with correct and incorrect
diagnosis and chooses a confidence level that minimises losses.
The definition has other statistical issues; for example, the use of prevalence assumes that a population
is static (all in, all out) within a herd, rather than a continuous flow. It also does not take into account
the possible effects of a deteriorated farm management that would raise the mortality and wasting rates,
nor of an improved farm management whose improvements would balance against, and therefore mask,
the increased mortality and wasting due to PMWS.
Other definitions of PMWS have been used. New Zealand, for example, made their PMWS diagnosis
on the basis of at least a 15 % post-weaning mortality rate together with characteristic histopathological
lesions and the demonstration of PCV2 antigen in tissues. Denmark diagnoses the disease in a herd on
the basis of histopathology and demonstration of PCV2 antigen in pigs with or without clinical signs
indicative of PMWS and regardless of the number of animals.

15.6.2 Collecting information
PMWS is a worldwide problem among domestic pig populations. It is very difficult to compare experiences in different countries because there hasn't been a single agreed definition until recently, and
there are different motivations involved for reporting the problem. In one country I investigated,
farmers were declaring they had PMWS with, it seemed, completely new symptoms - but when I

432

Risk Analysis

talked confidentially to people "on the ground I found out that, if the problem were declared to
be PMWS, the farmers would be completely compensated by their government, whereas if it were
another, more obvious issue they would not. Another country I investigated declared that it was completely free of PMWS, which seemed extraordinary given the ubiquitous nature of the problem and
that genetically indistinguishable PCV2 had been detected at similar levels to other countries battling
with PMWS. But the pig industry of this country wanted to keep out pork imports and their freedom
from the ubiquitous PMWS was a good reason justified under international trading law. The country used a different (unpublished) definition of PMWS that included the necessity of observing an
increased wasting rate, and I was told that in their one suspected herd the pigs that were wasting were
destroyed prior to the government assessment, with the result that the required wasting rate was not
observed.
The essence of my risk analysis was to try to determine which, if any, of the various causal theories
could be true and then determine whether one could find a way to control the import risk for our clients
given the set of plausible theories. The main impediment to doing so was that it seemed every scientist
investigating the problem had their own pet theory and completely dismissed the others. Moreover,
they conducted experiments designed to affirm their theory, rather than refute it. I distilled the various
theories into the following components:
Theory 1. PCV2 is the causal agent of PMWS in concert with a modulation of the pig's immune
system.
Theory 2. A mutation (or mutations) of PCV2 is the causal agent (sometimes called PCV2A).
Theory 3. PCV2 is the causal agent, but only for pigs that are genetically more susceptible to the
virus.
Theory 4. An unidentified pathogen is the causal agent (sometimes called Agent X).
Theory 5. PMWS does not actually exist as a unique disease but is the combination of other clinical
infections.

Note that the five theories are not all mutually exclusive - one theory being true does not necessarily
imply that the other theories are false. Theory 1 could be true together with theories 2 or 3 or both.
Theories 2 and 3 are true only if theory 1 is true, and theories 4 and 5 eliminate the possibility of all
other theories. A theory of causality can never be proved, only disproved - an absence of observation
of a causal relationship cannot eliminate the possibility of that relationship. The five theories with their
partial overlap were structured to provide the most flexible means for evaluating the cause of PMWS. I
did a review of all (15) pieces of meaningful evidence I could find and categorised the level of support
that each gave to the five theories as follows:

conflicts (C), meaning that the observations in this evidence would not realistically have occurred
if the theory being tested was correct;
neutral (N), meaning that the observations in this evidence provide no information about the theory
being tested;
partially supports (P), meaning that the observations in this evidence could have occurred if the
theory being tested was correct, but other theories could also account for the observations;
supports (S), meaning that the observations in this evidence could only have occurred if the theory
being tested was correct.

C h a ~ t e r15 Testing and modelline causal relationshi~s 4 3 3

15.6.3 Results and conclusions
The results are presented in Table 15.1.

+

Theory 1 (PCV2 immune system modulation causes PMWS). This theory is well supported by
the available evidence. It explains the onset of PMWS post weaning and the presence of other
infections, or vaccines, stimulating the immune system as being cofactors. It explains how the use
of more stringent sanitary measures in a farm can help contain and avoid PMWS. On its own it
does not explain the radially spreading epidemic observed in some countries, nor the difference in
susceptibility observed between pigs and pig breeds.
Theory 2 (PCV2A). This theory is also well supported by the available evidence. It explains the
radially spreading epidemic observed in some countries but does not explain the difference in susceptibility observed between pigs and between pig breeds.
Theory 3 (PCV2 genetic susceptibility). This theory is supported by the small amount of data
available. It could explain the targeting of certain herds over others and the difference in attack rates
between pigs breeds.
Theory 4 (Agent X). This theory is unanimously contradicted by all the available evidence that could
be used to test it.
Theory 5 (PMWS does not actually exist). This theory is unanimously contradicted by all the available
evidence that could be used to test it.

+

As a result, I concluded (rightly, or wrongly, at the time of writing we still don't know the truth) that
it appears from the available evidence that PMWS requires at least two components to be established:

1. A mutated PCV2 that is more pathogenic than the ubiquitous strain(s). There may well be several
different localised mutations of PCV2 in the world's pig population that have varying levels of
pathogenicity. This would in part explain the high variance in attack rates in different countries,
although farm practices, pig genetics and other disease levels will be confounders.
Table 15.1 Comparison of theories on the relationship between PCV2
P = partially
and PMWS and the available evidence (S = supports;
. .
supports; N = neutral; C = conflicts).
Evidence Theory 1 Theory 2 Theory 3 Theory 4 Theory 5
P
1
P
N
N
C
P
2
N
N
C
P

434

Risk Analysis

2. Some immune response modulation, due either to another disease, stress, a live vaccine, etc. The
theory that PMWS requires an immune system modulation is particularly well supported by the
data, both in in vitro and in vivo experiments, and from field observations that co-infection and
stress are major risk factors.
There is also some limited, but very convincing, evidence (Evidence 15) from Ghent University (by
coincidence the town I live in) that the onset of PMWS is related to a third factor:
1. Susceptibility of individual pigs to the mutated virus. The evidence collected for this report suggests

that the variation in susceptibility, while genetic in nature, is not obviously linked to the parents of
a pig. The apparent variation in susceptibility owing to race may mean that susceptibility can be
inherited over many generations, i.e. that there will be a statistically significant difference over many
generations, but the variation between individuals in a single litter would exceed the generational
inherited variation.

15.7 I s Causal Analysis Essential?
In human and animal health risk assessment, we attempt to determine the causal agent(s) of a health
impact. Once determined, one then attempts to apportion that risk among the various sources of the
causal agent(s), if there is more than one source. Some risk analysts, particularly in the area of human
health, argue that a causal analysis is essential to performing a correct risk analysis.
The US Environmental Protection Agency, for example, in its guidelines on hazard identification,
discusses the first step in their risk analysis process: "The objective of hazard identification is to determine whether the available scientific data describe a causal relationship between an environmental agent
and demonstrated injury to human health or the environment". Their approach is understandable. It is
extremely difficult to establish any causal relationship between a chemical and any human effect that
can arise owing to chronic exposure to that chemical (e.g. a carcinogen), since many chemicals can
precipitate the onset of cancer and that may only eventuate after many years of exposure, probably to
many different carcinogens. We can't start by assuming that all chemicals can cause cancer. On the
other hand, we may fail to identify many carcinogens because the data and scientific understanding are
not there. If we are to protect the population and environment, we have to rely on suspicion that a
chemical may be carcinogenic because of similarities with other known carcinogens and act cautiously
until we have the evidence that eliminates that suspicion.
In microbial risk assessment, the problem is simpler either because an exposure to bacteria will
immediately result in infection or because the bacteria will pass through the human gut without effect,
and cultures of stools or blood analyses will usually tell us which bacterium has caused the infection.
By definition, Campylobacter causes campylobacteriosis, for example, so that the risk of campylobacteriosis must logically be distributed among the sources of Campylobacter, because if all sources of
Campylobacter were removed in a counterfactual world there would be no more campylobacteriosis.
I am of the view that we should definitely take the first step of hazard identification and attempt to
amass causal evidence, but the lack of evidence should not lead us to dismiss a suspected hazard from
concern, although clear evidence of a lack of causality should. We should also perform broad causal
studies with an open mind because, although a strong though unsuspected statistical inference does not
prove a causal relationship, finding one may nevertheless offer some lines of investigation leading to
discovery of previously unidentified hazards.

Chapter

Optimisation in risk analysis
by Dr Francisco Zagmutt, Vose Consulting US

16.1 Introduction
Analysts are often faced with the question of how to find a combination of values for interrelated decision
variables (i.e. variables that one can control) that will provide an optimal result. For example, a bakery
may want to know the best combination of materials to make good bread at a minimum price; a portfolio
manager may want to find the asset allocation that yields the highest returns for a certain level of risk;
or a medical researcher may want to design a battery of tests that will provide the most accurate results.
The purpose of this chapter is to introduce the reader to the basic principles of optimisation methods
and their application in risk analysis. For more exhaustive treatments of different optimisation methods,
the readers are directed to specialised books on the subject, such as Randin (1997), Dantzig and Thapa
(1997, 2003) and Bazaraa et al. (2004, 2006).
Optimisation methods aim to find the values of a set of related variable(s) in the objectivefunction that
will produce the minimum or maximum value as required. There are two types of objective function:
deterministic and stochastic. When the objective function is a calculated value in the model (deterministic), we simply find the combination of parameter values that optimise this calculated value. When
the objective function is a simulated random variable, we need to decide on some statistical measure
associated with that variable that should be optimised (e.g. its mean, it 95th percentile or perhaps the
ratio of standard deviation to mean). Then the optimising algorithm must run a simulation for each set
of decision variables values and record the statistic. If one wanted, for example, to minimise the O.lth
percentile, it would be necessary to run thousands of iterations, for each set of decision variable values
tested, to have a reasonable level of accuracy - and that can make optimising under uncertainty very
time consuming. As a general rule, we strongly advise that you try to find some means to calculate the
objective function if at all possible. ModelRisk, for example, has many functions that return statistical measures for certain types of model, and the relationships between stochastic models discussed in
Chapter 8 can help greatly simplify a model.
Let's start by introducing an example.
When a pet food manufacturer wants to make an economically optimal allocation of ingredients for
a dog formula, he may have the choice to use different commodities (i.e. corn or wheat as the main
source of carbohydrates), but the company will want to use the combination of components that will
minimise the cost of manufacturing, without losing nutritional quality. Since the price of commodities
fluctuates over short periods of time, the feed inputs will have to be optimised every time a new contract
for commodities is placed. Hence, an optimal feed would be the one that minimises the ration cost but
also maintains the nutritional value of the feed (i.e. required carbohydrate, protein and fat contents in a
dog's healthy diet).

I
I

I

With this example we have introduced the reader to the concept of constrained optimisation, where
the objective is still to minimise or maximise the output from a function by varying the input variables,
but now the values of some input variables are constrained to only feasible values of those variables
(the nutritional requirements). Going back to the dog feed example, if we know that adult dogs require
a minimum of 18 % of protein (as % of dry matter), then the model solution should be constrained to
the combination of ingredients that will minimise the cost while still providing at least 18 % of protein.
An input can take more than one constraint; for example, dogs may also have a maximum protein
requirement (to avoid certain metabolic diseases) which can also be constrained into the model.
The optimal blending of diets is in fact a classical application of linear programming, an area of
optimisation that will be revisited later in this chapter.
Optimisation requires three basic elements:
a

a

a

The objective function f and its goal (minimisation or maximisation). This is a function that
expresses the relationship among the model variables. The outputs from the objective function
are called responses, performance measures or criteria.
Input variable(s), also called decision variables, factors, parameter settings and design variables,
among many other names. These are the variables whose values we want to experiment with using
the optimisation procedure, and that we can change or control (make a decision about, hence the
name decision variable).
Constraints (if needed), which are conditions that a solution to an optimisation problem must satisfy
to be satisfactory. For example, when only limited resources are available, that constraint should
be explicit in the optimisation model. Variable bounds represent a special case of constraints. For
example, diet components can only take positive values; hence they are bounded to zero.

Throughout this chapter we will review how these elements combine to create an optimisation model.
The field of optimisation is vast, and there are literally hundreds of techniques that can be used to solve
different problems. However, in practical terms the main differences between methods reside in whether
the objective function and constraints are linear or non-linear, whether the parameters are fixed or include
variability and/or uncertainty and whether all or some parameters are continuous or integers. The following sections give the background to basic optimisation methods, and then present practical examples

16.2 Optimisation Methods
There are many optimisation methods available in the literature and implemented in commercial software.
In this section we introduce some of the most used methods in risk analysis.

16.2.1 Linear and non-linear methods
In Section 16.1 we presented a diet blend model and mentioned it was a typical linear programming
application. This model is linear since the objective function and constraints are linear. The general
form of a linear objective function can be expressed as:

max / min f (XI,x2, . . . , x,) = alxl

+ ~ 2 x 2+ . . . + a,x,

(16.1)

where f is the objective function to be minimised or maximised, and x and a are input variables and
their respective coefficients.

Chapter 16 Optimisation in risk analysis

437

The objective function can be subject to constraints in the form

Equation (16.1) shows that the constraints imposed on the optimisation problem must also be linear
to be considered a valid linear optimisation problem.
From Equations (16.1) and (16.2) we can deduce two important assumptions of linear optimisation:
additivity and proportionality:
Additivity entails that the values from the objective function are the result of the sum of all the
variables multiplied by their coefficients, independently. In other words, the increase in the results
of the objective function will be the same whether a certain variable increases from 10 to 11 or from
50 to 51.
Proportionality requires that the value of a term in the linear function is directly proportional to the
amount of that variable in the term. For example, if we are optimising a diet blend, the total cost of
corn in the blend is directly related to the amount of corn used in the blend. Hence, for example, the
concept of economies of scales would violate the assumption of proportionality since the marginal
cost decreases as we increase production.
The most common methodology to solve linear programming problems is called the simplex algorithm,
which was invented by George Dantzig in 1947 and is still used to solve purely linear optimisation
problems. For a good explanation of the simplex methodology the reader is directed to the excellent
book by Dantzig and Thapa (1997).
We cannot apply linear programming if our objective function includes a multiplicative term such
as f (xl, x2) = alxl * ~ 2 x 2because we would be violating the additivity assumption. Recall that we
mentioned that a unit increase in a decision variable will have the same impact on the results of the
objective function, regardless of the current absolute value of the variable. We can't make this assumption
with our multiplicative example, since now the impact that a change in a variable has in the objective
function will depend on the size of the other variable by which it is multiplied. For example, in a
simple function f (x) = ax2, with a = 5, if we increase x from 1 to 2, the results will change by 15
units (5 * 22 - 5 * 12), whereas if x increases from, say, 6 to 7, the function will change by 65 units
(5 * 72 - 5 * 62).
Non-linear problems impose an extra challenge in optimisation, since they may present more than
one minimum or maximum depending on the domain being evaluated. Optimisation methods aiming
and finding the absolute largest (or smallest) value of the objective function in the domain observed
are called global optimisation methods. We will discuss different approaches to global optimisation in
Section 16.3.
The final function to consider is where the relationships in a function are not only non-linear but also
non-smooth. For example, the relationships among some variables in the model use Boolean logic (i.e.
IF, VLOOKUP, INDEX, CHOOSE) with the effect that the function will present sudden changes, e.g.
drastic jumps or drops, making it uneven or "jumpy". These functions are particularly hard to solve
using standard non-linear programming methods and hence require special techniques to find reasonable
solutions.

438

Risk Analysis

16.2.2 Stochastic optimisation
Stochastic optimisation has received a great deal of attention in recent years. One of the reasons for
this growth is that many applied optimisation problems are too complex to be solved mathematically
(i.e. using the linear and non-linear mathematical methods described in the previous section). Stochastic optimisation is the preferred methodology when problems include many complex combinations of
options andlor relationships that are highly non-linear, since such problems either are impossible to
solve mathematically or cannot feasibly be solved within a realistic timeframe.
Simulation optimisation is also essential if the parameters of the model are random or include uncertainty, which is usually the case in many of the models applied to real-world situations in risk analysis.
Fu (2002) presents a summary of current methodologies in stochastic optimisation, and some of the
applications of this method. Most commercial stochastic optimisation software use metaheuristics to
find the optimal solutions. In this method, the simulation model is treated as a black-box function
evaluator, where the optimiser has no knowledge of the detailed structure of the model. Instead, combinations of the decision variables that achieve desirable results (i.e. minimise the objective function
more than other combinations) are stored and recombined by the optimiser into updated combinations, which should eventually find better solutions. The main advantage of this method is that it
does not get "stuck in local minima or maxima. Some software vendors claim that this methodology also finds optimal values relatively faster than other methods, but this is not necessarily true,
especially when the optimisation problem can be quickly solved with well-formulated mathematical
functions.
Usually, three steps are taken at each iteration of the stochastic optimisation:
Possible solutions for the variables are found.
The solutions found in the previous step are applied to the objective function.
If the stopping criterion is not accomplished, a new set of solutions is calculated after the results of
the previous combinations are evaluated. Otherwise, stop.
Although the above process is conceptually simple, the key to a successful stochastic optimisation
resides in the last step, because trying all the combinations of values from different random variables
becomes unfeasible (especially when the model includes continuous variables). For this reason, most
implementations of stochastic optimisation focus their efforts on how to narrow the potential solutions
based on the solutions already known. Some of the methods used for this purpose include genetic algorithms, evolutionary algorithms, simulated annealing, path relinking, scatter search and tabu search, to
name a few. It is beyond the objective of this chapter to review these methodologies, but interested
readers are directed to the chapter on metaheuristics in Pardalos and Resende (2002), and to the work
by Goldberg (1989) and by Glover, Laguna and Marti (2000).
Most commercial Excel add-ins include metaheuristic-based stochastic optimisation algorithms. Some
of the most popular include Opt~uest@for Crystal all@, R I S K ~ ~ t i m i s e rfor
@ @Risk@ and very
recently Risk Solver@.Similar tools are also available for discrete-event simulation suites. There is also
a myriad of statistical and mathematical packages such as R@, SASB and ~athematica@that allow
for complicated optimisation algorithms. In Vose Consulting we rely quite heavily on these applications
(particularly R@) when developing advanced models, but we will stick to Excel-based optimisers here
to avoid having to explain their syntax structure.

Chapter 16 Optimisation in risk analysis

439

16.3 Risk Analysis Modelling and Optimisation
In this section we introduce the reader to some applied principles to implement optimisation models in
a spreadsheet environment, and then briefly explain the use of the different possible settings in Solver,
the default optimisation tool in Excel.

16.3.1 Global optimisation
In the previous section we discussed some of the limitations of linear programming, including the
problem with local minima and maxima depending on the starting values. Figure 16.1 shows a simple
function in the form
f (x) = sin (cos(x) exp

( ))

The function has several peaks (maxima) and valleys (minima) within the plotted range. A function like
this is called non-linear (changes in f (x) are not monotonically increasing with x), and also non-convex
(i.e. line segments drawn from any point to another point can lie above or below the graph of f (x),
depending on the region of the function domain).
Optimisation software like Excel's Solver and other linear and non-linear constrained optimisation
software follow a path from the starting values to the final solution values, using as guide the direction
and curvature of the objective function (and constraints). The algorithm will usually stop at the minimum
or maximum closest to the initial values provided, making the optimiser output quite sensitive to the
starting values.
For example, if the function in Figure 16.1 is to be maximised and a starting value is close to the
smaller peak (Max I), the "best" solution the software will find will be Max 1, when in fact the global
peak for this particular function is located at Max 2.
Evidently, in most risk analysis applications the desirable solution will be the highest (or the lowest)
peak and not a local one. In other words, we always want to make sure that the optimisation is global
rather than local. Depending on the software used, there are several ways to make sure we can obtain
a global optimisation.
Excel's Solver is among the most broadly used optimisation software, as it is part of the popular
spreadsheet bundle, and its algorithms are very sensitive to the initial values provided by the analyst.
Thus, when possible, the entire feasible range of the objective function should be plotted to identify the
global peak or valleys. From evaluating the graph, a rough estimate can then be used as an initial value.
Consider the model shown in Figure 16.2. The objective function is again

(

f ( x ) = sin cos(x) exp

(3)

and is unconstrained within the boundaries shown (-4.2 to 8). When plotting the function, we know
the global maximum is somewhere close to -0.02, so we will use this value in Solver.
To do so, we first enter the value -0.02 into cell x (C2), and then we select Tools -+ Add-Ins and
check the Solver Add-In box and click the OK button. Then go back to Excel and select Tools +
Solver to obtain the menu shown in Figure 16.3.

Min 1

Figure 16.1 A non-linear function presenting multiple maxima and minima.

Figure 16.2 Sensitivity of Excel's Solver to local conditions. The dot represents the optimal solution found
by Solver.

Under "Set Target Cell" we add a reference to the named cell fx (C3), then, since in this example we
want to minimise the function, we select "Equal To" Max and we finally add a reference to named cell
x (C2) under the "By Changing Cells" box. Now that we are ready to run the optimisation procedure
(we will see more about the Solver menus and options later in this chapter, now we will use the default
settings), we click the "Solve" button and after a very short period we should see a form stating that a

-

-

Chapter 16 Optimisation in risk analysis

44 1

Figure 16.3 Excel's Solver main menu.

solution has been found. Select the "Keep Solver Solution" option and click the " O K button. We can
see that Solver successfully found the global maxima since we provided a good initial value.
What would happen if we didn't provide a reasonable initial value? If we repeated the same procedure
but started with, say, -3 in cell x, we would obtain a maximum of -3.38, which turns out to be the
first peak (Max 1 in Figure 16.1). If we started with a larger value, i.e. 4, Solver would find 6.04 as the
optimal maximum, which is Max 3 in Figure 16.1. The reader can use the supplied model and try initial
values to look for minima and maxima and explore how the optimisation algorithm behaves, particularly
to notice the model behaviour when the Solver options (e.g. linearity assumption, quadratic estimates)
are changed.
An alternative for dealing with local minima and maxima is to restrict the domain to be evaluated.
We have already limited the domain by exploring only a limited section of our objective function
(-4 to 8). However, the domain still contained several peaks and valleys. In contrast, if the domain
observed contains only one peak or valley (i.e. -2, 2), the function becomes concave (or convex) which
can be solved with a variety of fast and reliable techniques such as the interior point methods readily
implemented in Solver. Since we know the global peak resides somewhere around zero, we can restrict
the domain of the objective function to (-2, 2) using the constraint feature in Solver. First enter -2 in
cell C6 and 2 in cell C7. Then name the cells "Min" and "Max" respectively. After that, open Solver
and click on the Add button. Type "x" under cell reference, select <= and then type "=Maxn in the
Constraint box. Once that is completed, click the Add button, and, following the same procedure, add
the second constraint, x >= Min. Once both constraints are added, click OK and then Solve. Solver
should find an optimal x close to -0.25 which is the global maximum, so, even though the function has
many local optimal values, now we have successfully restricted the domain enough so the numerical
method can easily find the optimal values. Even if an aberrant number is entered (e.g. 1000) as the
initial value, the domain is so narrow now that the algorithm will still find the optimal value. Try it!
When the function is not tractable (e.g. complex simulation models), plotting is not an option since the
figure could be k-dimensional (and we all have a hard time interpreting elements with more than three
dimensions). Hence, for this case, if the user plans on using Solver, he or she should attempt different
initial values manually, based on the knowledge of the system being modelled. Another more automated
option is to use more sophisticated applications that rely on metaheuristic methods, as explained in
Section 16.2.2. Later in this chapter we present the solution to a problem where the function not only
is intractable but also is highly non-linear and non-smooth and contains a series of integer decision
variables and complex constraints.

442

Risk Analysis

Commercial optimisation software use different methods to make sure only global optimal solutions
are found. As already discussed, metaheuristic methods can be very efficient in finding global optimal solutions. Other commercial software rely upon multistart methods for global optimisation, which
automatically try different starting values until a global solution is found. Although they are reasonably effective, such methodologies can be quite time consuming when solving highly non-linear and
non-smooth functions, or when little is known about the parameters to optimise (uninformed starting
values).

16.3.2 A few notes on using Excel's Solver
We have already mentioned that Excel's Solver is an optimisation tool built into Microsoft's Excel and
dispatched with all copies of Excel. Although the tool has limitations, it can be used in a variety of
situations when stochastic simulation is not required. Solver implements a variety of algorithms to solve
linear and non-linear problems. It uses the generalised reduced gradient (GRG) algorithm to solve nonlinear programming methods, and, when the correct settings are used, it can use the simplex method, a
well known and robust method for solving linear optimisation problems.
The mysterious Options menu in Solver

It is likely that many readers have used or tried to use Solver in the past and have managed fairly
well. It is also likely that the reader has clicked on the Options button and didn't quite understand the
meaning of all the settings. Furthermore, many readers may have found the explanations in the help file
to be rather cryptic, so we will explain the various options.
We have already explained in previous sections how to use the general Solver menu. Now we will
focus on the menus that appear under the Options buttons. To get there, select Tools -+ Solver and then
click the Options button. The menu in Figure 16.4 should be displayed.
We briefly describe the meaning of each option below:
The Load Model and Save Model buttons enable the user to recall and retain model settings so
they don't need to be re-entered every time the optirnisation is run.

Max Dm:
Iteratons:

seconds

, 100

[
(

]

cancel

Load Model..

.

]

1

[

Figure 16.4 The Options menu in Excel's Solver.

zave Model,..

Chapter I 6 Optimisat~onin risk analysis

443

Max Time limits the time taken to find the solution (in seconds). The default 100 seconds should
be appropriate for standard linear problems.
Iterations restricts the number of iterations the algorithm can use to find a solution.
Precision is used to determine the accuracy with which the value of a constraint meets a target
value. A fractional number between 0 and 1. The higher the precision, the smaller the number (i.e.
0.01 is less precise than 0.0001)
Tolerance applies only to integer constraints and is used to estimate the level of tolerance (as a
percentage) by which the solution satisfying the constraints can differ from the true optimal value
and still be considered acceptable. In other words, the lower the tolerance level, the longer it will
take for the solutions to be acceptable.
Convergence applies only to non-linear models and is a fractional number between 0 and 1. If after
five iterations the relative change in the objective function is less than the convergence specified,
Solver stops. As with precision and tolerance, the smaller the convergence number, the longer it will
take to find a solution (up to Max Time that is).
Lowering precision, tolerance and convergence values will slow down the optimisation, but it may help
the algorithm to find a solution. In general, these defaults should be changed if Solver is experiencing
problems finding an optimal solution.

Assume Linear Model is a very important choice. If the optirnisation problem is truly linear, then
this option should be chosen because Solver will use the simplex method, which should find a
solution faster and be more robust than the default optimisation method. However, the function has
to be truly linear for this option to be used. Solver has a built-in algorithm that checks for linearity
conditions, but the analyst should not rely solely on this to assess the model structure.
When the option Show Iteration Results is selected, Solver will pause to show the result of each
iteration, and will require user input to reinitiate the next iteration. This option is certainly not
recommended for computing intensive optimisation.
When selected, Use Automatic Scaling will rescale the variable in cases where variables and results
have large differences in magnitude
Assume Non-Negative will bound to zero all the decision variables that have not been explicitly
constrained. It is preferable, however, to specify explicitly the variables boundaries in the model.
The Estimates section allows one to use either a Tangent method or a Quadratic method to estimate
the optimal solution. The tangent method extrapolates from a tangent vector, whereas quadratic is
the method of choice for highly non-linear problems.
The Derivatives section specifies the differencing method used to estimate partial derivatives of
objective and constraint functions (when differentiable, of course). In general, Forward should be
used for most problems where the constraint values change slowly, whereas the Central method
should be used when the constraints change more dynamically. The central method can be chosen
when solver cannot find improving solutions.
Finally, the Search section allows one to specify the algorithm used to determine the direction to
search at each iteration. The options are Newton, which is a quasi-Newton method to be used when
speed is an issue and computer power is a limiting factor, and Conjugate, which is the preferred
method when memory is an issue, but speed can be slightly compromised.

444

R~skAnalysis

Automating

Solver with Visual Basic for ~xcel@

One of the most powerful tools in Excel is the integration with Visual Basic for Applications (VBA). This
integration can also be extended to optimisation models with Solver. We will use the model presented
in section 16.3.1 to show how to automate solver in Excel. The steps are:
1. Record a macro using Tools -+ Macro + Record New Macro and name the macro accordingly
(e.g. "SolverRun").
2. Open the Solver form as previously explained and press Reset All to clear existing settings.
3. Repeat the steps followed to optimise the model (e.g. set objective function, decision variables and
constraints and click the Solver button).
4. Once Solver has found a solution, stop recording the macro by clicking on the small red square in
the macro toolbar, or by using Tools + Macro + Stop Recording.
5. Use the Forms Toolbar to add a button to the sheet.
6. Then assign the macro (e.g. "SolverRun") to the button by double clicking on it while in Design
Mode and typing "Call SolverRun" in the procedure. For example, assuming the button is called
CommandButtonl, the VBA procedure should look as follows:
Private Sub CommandButtonl-Click0
Call SolverRun
End Sub

7. Add a reference to Solver in Visual Basic by pressing Alt+Fl 1, and then in the Visual Basic menu
select Tools -+ References and make sure the box next to "Solver" is selected.
8. The VBA code for the recorded macro should look similar to the example below:
Sub SolverRun ( )
'This macro runs solver automatically
SolverOk SetCell:="$C$3",MaxMinVal:=l,
SolverAdd CellRef:="$C$2", Relation:=l,
SolverAdd CellRef:="$C$2", Relation:=3,
SolverOk SetCell:="$C$3", MaxMinVal:=l,
SolverSolve userFinish:= True
End Sub

ValueOf:="O", ByChange:="$C$2"
FormulaText:="Max"
FormulaText:="Min"
ValueOf:="On, ByChange:="$C$ZV

Notice we have added an extra line ' ' ~ o l v e r ~ o ~userFinish:=~rue"
ve
which suppresses the optimisation results from being shown at the end of each iteration. Now everything should be ready to use the
macro. Make sure to exit the Design Mode and click on the button. The resulting model is not shown
here but is provided for the user to explore.

16.4 Working Example: Optimal Allocation of Mineral Pots
1

I

This exercise is based on a simplified version of a real-life example taken from our consulting work.
A metallurgic company processes metal into 14 small containers called pots. The contents of the
pots are then split among four larger tubs which are then used to create the final metal product that is
sold. The resulting product receives a premium based on its level of purity (lack of unwanted minerals).

1

Chapter 16 Optimisation in risk analysis

445

Since the input ore is different from batch to batch, the impurity levels will likely be different. It is
then economically important to achieve certain purity level among batches while avoiding "bad" levels.
The goal of the model is to optimise the allocation of pot metal contents into tubs in order to achieve
a certain purity level in the final product.'
Note that, in reality, since the impurity level is estimated with samples, there is uncertainty about
the actual impurity level of each batch. The client required that one was, say, 90 % confident that the
concentration of each impurity in a tub was less than a certain threshold. Since speed was an important
issue for the client, we avoided simulation by using classical statistics estimates of a mean (Chapter 9)
to determine the 10th percentile of the uncertainty distribution for the true concentration in a tub.
For each pot, the variables are:
purity of metal A (in percentage of total weight);
purity of metal B (in percentage of total weight);
weight (in pounds).
As the reader may imagine, the plant's operations present several constraints to be modelled, which
are listed below:
1.
2.
3.
4.

A minimum of 1000 lb should be taken per pot.
The quantities taken from the pots are measured in discrete increments of 20 lb.
A maximum of five pots can be allocated to a given tub.
Pots can be only split in two parts (i.e. the contents of a pot cannot be split into three or four
different tubs).
5. The maximum metal tonnage taken from a pot is equal to the pot weight (obvious, but this needs
to be explicit in the model).
6. Every pot should be allocated into at least one tub (no "leftover" pots).
7. The maximum and minimum weights contained per tub are constrained (by certain values, for this
example assumed to be a minimum of 5000 lb and a maximum of 10 000 lb).
Given the number of constraints and possible combinations to be optimised, this model would be quite
complex to define in mathematical terms (especially when considering parameter uncertainty), and hence
a more practical approach is to use optimisation. For this particular example, we employed OptQuest
with Crystal Ball for its ease of use and connection with Excel, but other commercial spreadsheet add-ins
could be used to achieve similar results. OptQuest is used here for a deterministic model but handles
stochastic optimisation equally well.
One powerful feature of simulation optimisation is that complex constraints such as those imposed
in this model can be specified by removing those scenarios rather than including them explicitly in
the objective function. Such constraints are sometimes called simulation requirements. Although this
approach can be slower than incorporating the constraint directly in the model, it allows for very
complex interactions in the model. Also, the models can be significantly sped up by compiling many
input variables into only one requirement variable. Figure 16.5 shows the general structure of the model.
Cells with a grey background represent input variables (variables that are changed during the optimisation process), and cells with a black background are requirements that are used to set the model

' Another goal was to optimise for several purity levels by their dollar premiums, hut that is omitted here for simplicity.

446

Risk Analysis

Optimal purity
PurityA
0.050
Purity B
0.050

Output for objective Fx
PoWTub
1
2
3
4
5
6
7
8
9
10
11
12
13
14

1
0
0
1480
2660
0
0
0
1000
0
2660
0
1000
0
0

2
0
1500
1180
0
0
2660
2660
1660
0
0
0
0
0
0

3
2660
0
0
0
2660
0
0
0
2660
0
0
1660
0
0

4
0
0
0
0
0
0
0
0
0
0
2660
0
2660
2660

Pass
requirements?
1
1
1
1
1
1
1
1
1
1
1
1
1
1

Total

IA
B
Good tub?

J17
H21:H25
H26:H34
121:J25
126:J34
K21:K25
K26:K34
H36:K36
H38:K38
H39:K39
H40:K40
M21:M34
M36
N40

0.028635 0.02891
2
0- 0286
0.0434
1
- -- - 0.048933 0 045517 0.045155 0.1382
1
1
1
0

Formulae table
=SUM[J3:JIG)
=C21'C3
=IF(COUNTIF($H$21: H25,"=O")=2,SUM(H21 :K21)=J3),1,0)
=SUM(M21:M34)
=SUMPRODUCT(H40:1<4O,H36:K36)

Figure 16.5 The metallurgic optimisation model implemented in Excel.

Chapter 16 Optimisatlon in risk analysis

6-

' .D.-e-f-i -.
Decbbn
~
Varlabk: Cell C3
-- --- -

-.
..-

447

-

fa

.----

Figure 16.6 Dialogue to create decision variables in OptQuest with Crystal Ball.

constraints and define the objective function. The larger table in range G2:J17 contains the purity levels
for both minerals and weight for each pot. The small table on the right contains the target purity levels
that the model will optimise for.
The "Pounds" table (range B2:E16) contains input variables that are modified during the optimisation.
By selecting Define + Define Decision in Crystal Ball's menu, the user will see the form shown
in Figure 16.6 with the settings for cell C3. The variables are discrete and can only increment in steps
of 20 pounds (constraint 2), and are constrained to a fixed minimum of 1000 lb (constraint 1) and a
maximum equal to the total content in the pot, which will vary depending on the batch; hence, the
maximum value is linked to cell ~3~ which contains the pot weight. Similar variables are created for
each combination of pots in tub 1, the only difference being the cell reference to their maximum weight.
Decision variables are only needed for the first tub since the allocation for the other tubs is calculated
on the basis of the initial allocation to the first tub. Thus, the remaining cells in the "Pounds" matrix
are left empty or with a constant value of 1.
The "Switches" matrix (range B19:E34) contains input variables that can only take values of 0 or 1.
The set of input variables from the "Pounds" matrix is multiplied by the variables in the "Switches"
matrix to generate the output matrix "Output for objective Fx". Notice that, for the "Switch variables,
input variables are only needed for the first three tubs, because the fourth tub can be filled with what is
left in the pots after their content has been allocated to the other three.
The remaining components in the model are the constraints and objective function. As previously
mentioned, for this model some constraints are built into the simulation model, whereas others are set
as scenarios that cannot be included in the optimal solution. Hence, anything that does not meet certain
requirements is "tossed" from the set of possible options. The equation from pot 6, tub 1, in the output
matrix incorporates constraint 3 as follows:

which summarises into "if 5 pots have been allocated to tub 1, then do not allocate the product from
pot 6 into tub 1, otherwise, allocate the content defined in cell C8 (which is a decision variable as in
Figure 16.6) multiplied by cell C26". The first part of this equation limits a maximum of five pots to
For some reason unknown to this author, sometimes the cell reference in the decision variables may be lost after opening OptQuest,
and will be replaced by the last number in the cell, e.g. the maximum weight entered for the pot (we are using OptQuest with Crystal
Ball V. 7.3, Build 7.3.814). Readers should be aware of this issue when using this and other models with dynamic referencing on
decision variable parameters.

448

Risk Analysis

be allocated to the tub (constraint 3). The second part (multiplication of two cells) is used to make
sure that there is no bias in the order of the allocation of the pots to tubs (by using the binary decision
van'ables in the "Switch"matrix). The same logic is used for pots 7 to 14 in tub I.
For tubs 2 and 3, the equation for pot 6 is modified so that we add "if the remaining weight left in
the pot is less than ZOO0 pounds, do not allocate any metal to this tub (constraint I), otherwise, allocate
the remaining material from pot 6 into this tub". The subtraction from the total pot weight satisfies
constraint 5.
The reader will notice that, since we can only allocate one pot to one or two tubs (constraint 4),
there is no need for an input variable in columns D and E since the material allocated to tubs 2 to 4
is dependent on whether tub 1 received material from a given pot. Thus, the podtub combinations for
tubs 2 to 3 in the "Pounds" matrix contains only Is, so a 1 is returned when multiplied by Is from the
"Switch matrix.
Finally, metal from the pot that has not been allocated to tubs 1 to 3 (and that is at least 1000 pounds)
is allocated to tub 4. As for the other tubs, formulas from pot 6 onwards are constrained, so no more
than 5 pots can be allocated to one tub.
We cannot waste any remaining material in a pot, of course, so another exogenous constraint (requirement) that we add is that the sum of the pounds allocated from a pot should be exactly the same as the
total weight in that pot. In addition, we can include constraint 6 into the same requirement to speed up
the optimisation. The resulting formula (cell M21 shown, same for all pots) is

In other words, if the pot has not been allocated to more than two tubs, and the sum of the weights
allocated is equal to the weight of the pot, then return a 1, otherwise a 0. The same test is applied to
each pot. Therefore, to meet the conditions, the sum of cells M21:M34 (cell M36) should be exactly 14
because, if all pots "pass the test", each individual pot test should return a 1.
Some readers may wonder why constraint 6 was added into this formula although it was already
mentioned that, if nothing is allocated to tubs 1 to 3, the total weight is allocated to tub 4. In reality, the
constraint is not necessary but is left in the equation to exemplify how to combine several constraints
into one formula, making the model computations significantly faster. Also, when a model is going to be
continuously modified, it is always good to have logical checks to make sure the algorithm is working
the way it is supposed to.
Before we include the final values in the objective function, we need to identify which tubs are lower
than the desired threshold of purity for minerals A and B. The formula we use for this is (cell H40:K40,
H40 shown):
= IF(AND(H38 < OptA, H39 < OptB), 1,O)

where O p t A and O p t 3 are the optimal purity levels for metals A and B respectively.
This formula returns a 1 when the requirements are met. Finally, the objective function is contained
in cell N40 and is the sum of the total weights per tub, multiplied by the "Good tub" indicator. The
optimisation model will try to maxirnise the value of this objective function (the total weight of "good
metal in tubs).
Once the variables, constraints and objective functions are defined, the last step left is to use OptQuest
to set up and then run the optirnisation procedure. To do so, in the Crystal Ball menu, select Run +
OptQuest + and open a new optimisation file. All variables in the Decision Variables form should be
selected. In the Forecast Selection form the inputs should be selected as in Figure 16.7 below.

Chapter 16 optirnisation in risk analysis

449

Figure 16.7 Forecast selection menu in OptQuest for the metallurgic optimisation model.

The objective function is maximised (we want to have the maximum amount of pure metal), the
constraint tests should be equal to 14 and the minimum and maximum contents of the tubs should be
5000 and 10 000 Ib respectively (constraint 7). The software will discard any scenario that does not
meet the requirements, and the objective function will be maximised by finding the best combination
of input variables.
Provided the initial values are reasonable, an optimal solution takes less than an hour to run on a
modem PC, which is important because the production line has to run this model twice a day.

16.4.1 Uncertainty in the model
In the actual model for our client we included the uncertainty about the impurity concentrations. The
user set a required confidence level CL (e.g. 90 %), and the model optimised to produce "tubs" that had
less than the specified impurity level with this confidence. The amount of impurity is determined by
Weight of pot * Impurity concentration
The source of the uncertainty came from the uncertainty of the weight of a pot (mean p p and
standard deviation ap, lbs) and from the uncertainty of the impurity concentrations (mean p~ and
standard deviation CTA for impurity A, for example). The mean and standard deviation of the distribution
of the product (pp, ap) of these two random variables is given by

In order to calculate the impurity level at the required confidence, we use Excel's NORMINV(CL,
pp, ap) function. The normal approximation is a reasonable approximation in this case because the
uncertainty of the concentration was close to a normal distribution and was greater than the weight
uncertainty so dominated the shape of the product. As mentioned before, finding a way of avoiding
having to optimise a simulation model (rather than the calculation model here) is very helpful because
it speeds up the optimisation time hugely: one calculation replaces a simulation of, say, 1000 iterations
to be sure of the required confidence level value.

Chapter 17

Checking and validating a model
In this chapter I describe various methods that can be used to help validate the quality and predictive
capabilities of a model. Some techniques can be carried out during a model's construction, which will
help ensure that the finished model is as free from errors and as accurate and useful as possible. Other
techniques can only be executed at a future time when some of the model's predictions can be compared
against what actually happened, but one may nonetheless devise a plan to help facilitate that comparison.
Key points to consider are:
Does the model meet management needs?
Is the model free from errors?
Are the model's predictions robust?
The following topics describe the methods we use to help answer these questions:
Ensuring the model meets the decision-makers' requirements.
Comparing predictions against reality.
Informal auditing.
Checking units propagate correctly.
Checking model behaviour.
Comparing results of alternative models.

1 7.1 Spreadsheet Model Errors
Your company may have hundreds or thousands of spreadsheet models in use. If even 1 % of these
have errors, you could be making many decisions based on quite inaccurate information. If you now
introduce risk analysis models using Monte Carlo simulation, which is more difficult to write (because
we have to write models that work dynamically) and to check (because the numbers change with each
iteration), the problem could get much worse.
Errors come in several forms:

Syntax errors where a formula is incorrectly put together. For example, you mismatch brackets,
forget to make a formula into an array formula (by entering with Ctrl
Shift + Enter instead of
Enter), use the wrong function, etc.
Mechanical errors which are hitting the wrong key, pointing to the wrong cell, etc. About 1 % of
spreadsheet cells contain such errors.
Logical errors which are incorrect formulae due to mistaken reasoning, misunderstanding of a
function or the appropriate use of probability mathematics. These errors are more difficult to detect
than mechanical errors and occur in about 4 % of spreadsheet cells in normal (unrisked) models.

+

,
11
I

452

Risk Analysis

Application errors where the spreadsheet function does not perform as it should. Excel generates
incorrect results for some statistical functions: GAMMADIST and BINOMDIST are awful, for
example. Some versions of Excel also don't automatically update all formulae correctly - use Ctrl
Alt
F9 instead of F9 to be sure it updates correctly. Random number generation for certain
distributions is quite numerically difficult, so you will see artificial limits to the parameters allowed
for distributions in a lot of software: @RISK, for example, allows a maximum of 32 767 trials in
a binomial distribution and for a hypergeometric population, while Crystal Ball allows a maximum
of 1000 for a Poisson mean and parameters for the beta distribution must lie on [0.3, 10001. It is
frustrating, of course, to have to work around such limits, and often you'll only find them because
the model didn't work for some iterations, so we have designed ModelRisk to have no such issues.
Omission errors where a necessary component of the model has been forgotten. These are the most
difficult errors to detect.
Administrative errors, for example using an old version of a spreadsheet or graph, failing to update
a model with new data, failing to get the spreadsheet to recalculate after changes, importing data
from another application incorrectly, etc.

+

+

We have tried to help reduce the frequency of these types of error with ModelRisk. Each function
returns an informative error message when inappropriate parameter values are entered. For example:
= VoseNormal(100, -10) returns "Error: sigma must be >= 0" because a standard deviation cannot be negative.
= VoseHypergeo(20, 30, 10) returns "Error: n must be <= M" because one cannot take more
samples without replacement (n = 20) than there are individuals in the population (M = 10).
{= VoseAggregateMoments(VosePoissonObject(10), VoseLognormal(10, 3))) returns "Error: Severity distribution not valid" because the severity distribution needs to be an object, e.g. VoseLognormalObject(10, 3)

If you write any user-defined functions, for which the Excel user will be less familiar, please consider
doing the same.
In ModelRisk we have also chosen to return pedantically correct answers for probability calculations,
for example:
= VoseHypergeoProb(2, 10, 25, 30, 0) returns 0: this is the probability of observing exactly two

successes where the minimum possible is 5. If it's impossible, the probability is zero.
= VoseBinomialProb(50, 10, 0.5, 1) returns 1: the probability of observing less than or equal to 50
successes when there are only 10 trials.
This means that you don't have to write special code to get around the function giving errors. For
example, the Excel equivalent formulae are:
= HYPGEOMDIST(2, 10,25,30) returns #NUM!
= BINOMDIST(50, 10,0.5, 1) returns #NUM!

You also need to check how your Monte Carlo simulation software handles special cases for particular
values. Poisson(O), for example, means that the variable can only be zero. In a simulation model, it

Chapter 17 Checking and validating a model

453

would be perfectly reasonable for a cell simulating a concentration to produce a zero value that fed into
a Poisson distribution. However, software will handle this differently:
@RISK: = RiskPoisson(0) returns #VALUE!
Crystal Ball: = CB.Poisson (0) returns #NUM!
ModelRisk: = VosePoisson(0) returns 0
Perhaps the most' useful error-reducing feature in ModelRisk is that we have interfaces that give a
visual explanation and check of most ModelRisk features. For example, a cell containing the formula
= VoseGammaProb(C3:C7, 2, 3, 0) returns the joint probability of the values in cells C3:C7 being
randomly generated from a Gamma(2, 3) distribution. Selecting the cell with this formula and then
clicking ModelRisk7sView Function icon pulls up the interface shown in Figure 17.1.
Crystal Ball and @RISK both have very good interfaces, although these are limited to input distributions only.
A quick Internet search for "spreadsheet model errors" will provide you with a wealth of individuals
and organisations who research into the source and control of spreadsheet errors. For example, the
European Spreadsheet Risks Interest Group is dedicated to the topic. Raymond Panko from the University

~ ~ ~ Location.
p u t

a
I

Figure 17.1 Visual interface in ModelRisk for the formula VoseGammaProb(C3:C7,2,3,0).

454

Risk Analysis

of Hawaii is a leader in the field and provides an interesting summary of spreadsheet error rates and
reasons at http://panko.shidler.hawaii.edu/SSR/index.htm.
Looking at the error percentages, for large models the question is not "Are there any errors?'but
"How many errors are there?'. A company can help minimise model errors by establishing and enforcing a policy for model development and for model auditing. Dr Panko reports the recommendations
of professional model auditors that one should spend 113 of the development time in checking the
model.

17.1.1 Informal auditing
Studies have shown that the original builder of a spreadsheet model has a lower rate of error detection
than an equivalently skilled coworker. It's not so surprising, of course, since we are more inclined than
a reviewer to repeat the same logical, omission and administrative errors.
At Vose Consulting we do a lot of internal auditing. An important part of the process is sitting down and explaining to another analyst the decision question(s) and the model structure with
pen and paper and then how we've executed it in a spreadsheet. Just the process of providing an
explanation will often lead to finding errors in your logic, or to finding simpler ways to write the
model.
Get another analyst to go through your code with the objective of finding your errors, so that a
successful exercise is one that finds errors rather than one that pronounces your model to be error free.
Having several analysts look at your model is even better, of course - it is interesting how people find
different errors. For example, in writing our software, some of our team are just great at finding numerical bugs, others at wrong formulae, others still at finding inconsistencies in structure or presentation.
Different things jump out at different people.

17.1.2 Checking units propagate correctly
I studied physics at university, and one of the first things you learn to do is a "dimensional analysis" of
formulae. For example, there exists an equation relating initial speed u and final speed v to the distance
s over which a body has constant acceleration a:

The dimensions involved are length L (in metres, for example) and time T (in seconds, for example).
.
the elements
Distance has units L , speed has units L I T , and acceleration has units L / T ~ Replacing
in the above formulae with their dimensions gives

(;)'

=

(;)'+ (f)*

L

You can see that the left- and right-hand sides of the equation have the same units and that, when we
add two things together, they have the same units too (so we are not adding "apples and oranges"). In
a spreadsheet model we can use the same logic to help make sure our model is constructed properly.

Chapter 17 Checking and validating a model

455

It is good practice to label cells containing a number or formula with some explanation of what that
value represents, but including units makes the logic of the model even clearer; for example, noting
the currency when there is more than one in your model, or, if it is a rate, then noting the denominator, e.g. "$US/ticket", or "cases/outbreak. Then checking that the units flow through the model using
dimensional analysis will often reveal errors.
Checking that the same units are used for a dimension (length, mass, etc.) is also important. We
commonly come across two problems in this category in our auditing activities that are easily avoided:
Fractions. The first is the use of a fraction, where the modeller might label a cell "Interest rate ( %)"
and then write a value like "6.5". Of course, to apply that interest rate, s h e will have to remember
to divide by 100 to get to a percentage, and we've found that this is sometimes forgotten. Better by
far, in our view, is to label a cell "Interest rate" and input the value "6.5 %" which will show on
screen as 6.5 % but will be interpreted by Excel as 0.065 and can therefore be used directly.
Thousands, millions, etc. In large investment analyses, for example, one is often dealing with very
large numbers, so the modeller finds it more convenient to use units of thousands or millions. This
would not present a problem if the entire spreadsheet used the same units, but very commonly there
will be certain elements that do not; for example, costlunit or pricelunit for a manufacturer or retailer
of high-volume products. The danger is that in summary calculations that evaluate cashflow streams,
the modeller may forget to divide by 1000 or 1000000, in keeping with other currency cells. Even
if it is all done correctly, it is more difficult to follow formulae where "11000 and "* 1000000"
appear without explanation.
Our preference is that the model be kept in the same units throughout, a base currency unit, for
example, like $, E or &. Admittedly this can be tricky if you're converting from values you know in
thousands or millions - we can easily get all those zeros mixed up. A convenient way to get around this
in Excel is to use special number formatting. We use a few formats in particular, employing Excel's
Format1Cells 1 Custom feature:

which will display 1 234 567 890 as S123.5M;

which will display 1 234 567 890 as &123.5M as above, but will display negative values in red;

which does the same as the second option but has the "EMnext to the numbers rather than left justified;

which will display 1 234 567 890 as £123,456.8k
You can, of course, substitute a different currency symbol.

458

Risk Analysis

time series summary plots;
correlation and regression statistics.
They are discussed at length in Chapter 5.

1 7.2.5 Stressing parameter values
A very useful, simple and powerful way of checking your model is to look at the effect of changing
the model parameters. We use two different methods.
Propagate an error

In order to check quickly what elements of your model are affected by a particular spreadsheet cell,
you can replace the cell contents with the Excel formula: =NA(). This will show the warning script
"#N/AU(meaning data not available) in that cell and any other cell that relies on it (except where the
ISNA() or ISERROR() functions are used). Imbedded Excel charts will simply leave the cell out. I like
this method very much because it is quicker than using the Excel audit toolbar to trace dependents and
it also works when you have VBA macros that pick up values from cells within the code, i.e. when
the cells aren't inputs to the macro function the Trace Dependents function in Excel won't work in that
situation.
Set parameter values t o extremes

It is difficult to see whether your Monte Carlo simulation model is performing correctly for lowprobability outcomes because generating scenarios on screen will obviously only rarely show those
low-probability scenarios. However, there are a couple of techniques for concentrating on these lowprobability events by temporarily altering the input distributions. We suggest that you first resave your
model with another name (e.g. append "test" to the file name) to avoid accidentally leaving the model
with the altered distributions. You can generate model extremes as follows:
(a) Set a discrete variable to an extreme instead of its distribution. The theoretical minimum and
maximum of discrete bounded distributions are provided in the formulae pages for each distribution
in Appendix 111. Many distributions have a zero minimum, but only a few distributions have a
maximum value (e.g. binomial). In general, it is not a good idea to stress a continuous variable with
its minimum or maximum, however, because such values have a zero probability of occurrence
and so the scenario is meaningless.
(b) Modify the distribution to generate values only from an extreme range. This is particularly useful
for continuous distributions, and for discrete distributions where there is no defined minimum
andlor maximum. Monte Carlo Excel add-ins normally offer the ability to bound a distribution.
For example, in ModelRisk we can write the following to constrain a lognormal distribution:
Only values above 30: = VoseLognonnal(l0,5,, VoseXBounds(30, ))
Only values below 5: = VoseLognonnal(l0,5,, VoseXBounds(, 5))
Values between 10 and 11: = VoseLognormal(lO,5,, VoseXBounds(10, 11))

Chapter 17 Checking and validating a model

459

In @RISK, this would be
= RiskLognonn(0, 5, RiskTruncate(30, )), etc.

In Crystal Ball you apply bounds in the visual interface. Note that occasionally a model will have
an acute response to a variable that is within a small range. For example, a model of the amplitude
of vibrations of a car may have a very acute (highly non-linear) response to an input variable
modelling the frequency of an external vibrating force, like the bounce from driving over a slatted
bridge, when that frequency approaches the natural frequency of the car. In that case, the rare event
that needs to be tested is not necessarily an extreme of the input variable but is the scenario that
produces the extreme response in the rest of the model.
(c) Modih the Probability of a Risk Occurring. Often in a risk analysis model we have one or more
risk events. We can simulate them occurring (with some probability) or not in a variety of ways.
We can stress the model to see the effect of an individual risk occurring, or a combination of risks,
by increasing their probability during the test. For example, setting a risk to have 50 % probability
(where perhaps we actually believe it to have 10 % probability) and generating on-screen scenarios
allows us comfortably to watch how the model behaves with and without the risk occurring. Setting
two risks each to a 70 % probability will show both risks occurring at the same time in about 50 %
of the scenarios. etc.

17.2.6 Comparing results of alternative models
There are often several ways that one could construct a Monte Carlo model to tackle the same problem.
Each method should give you the same answer, of course. So, if you are unsure about one way of
manipulating distributions, then try it another (perhaps less efficient) way and see if the answers are the
same.
The more difficult area is where you may feel that there are two or more completely different stochastic
processes that could explain the problem at hand. Ideally, one would like to be able to construct both
models and see whether they come up with similar answers. But what do we mean by similar? In
fact, from a decision analysis point of view we don't actually mean that they come up with the same
numbers or distributions: we mean that, if presented with either result, the decision-maker would make
the same decision. If we do have the luxury of being able to construct two completely different model
interpretations of the world, we may be able to use a technique called Bayesian model averaging that
weights the likelihood of each model on the basis of how probable they would make our observations.
We nearly always will not have the luxury of being able to model two or more different approaches
to the same problem because of time and resource constraints. If you are going to have to put all
your efforts into one model, try to make sure that your peers agree with your approach, and that the
decision-maker will be comfortable with making a decision based on the model's assumptions. The
decision-maker could prefer you to construct a model that may not be the most likely explanation for
your problem, but that offers the most conservative guidance for managing it.
Finally, simple "back-of-the-envelope" checks can also be useful. Managers will often look at the
results of a risk analysis and compare with their gut feeling andlor a simple calculation. It is surprising
how often a modeller can get too involved in the modelling and pay too little attention to the numbers
that come out at the end.

460

R~skAnalys~s

1 7.3 Comparing Predictions Against Reality
In many cases, this might be akin to "shutting the stable door after the horse has bolted. Clearly, if you
have made an irreversible decision on the basis of a risk assessment, this exercise may be of limited
value. However, even when that is true, analysing which parts of the model turned out to be the most
inaccurate will help you focus in on how you might improve your risk models for the next decision, or
prepare you for how badly you will have got it wrong.
Perhaps it is possible to structure a decision into a series of steps, each informed by risk analysis, so
that at each step in the series of decisions the risk analysis predictions can be compared against what
has happened so far. For example, setting up an investment that started with a pilot roll-out in a test
market would let a company limit the risks and at the same time evaluate how well it had been able to
predict the initial level of success.
Project risk analysis models, in which the cost and duration of the elements of a project are estimated,
are an excellent example of where predictions can be continuously compared with reality. The uncertainty
of the cost and time elements can be updated as each task is being completed to estimate the remaining
duration and costs, while a review of each task estimate against what actually happened can give you
a feel for whether your estimators have been systematically pessimistic or optimistic. Chapter 13 gives
a number of techniques for monitoring and calibrating expert estimates.

Chapter 18

Discounted cashflow modelling
A typical discounted cashflow model for a potential investment makes forecasts of costs and revenues over the life of the project and discounts those revenues back to a present value. Most analysts start with a "base case" model and add uncertainty to the important elements of the model.
Happily, the mathematics involved in adding risk to these types of model is quite simple. In this
chapter, I will assume that you can build a base case cashflow model that will look something like
Figure 18.2 and I will focus on the input modelling elements of Figure 18.1 and some financial outputs.
There are a number of topics that are already well covered in this book:

Expert estimates. In capital investment models we rely a great deal on expert judgement to estimate
variables like costs, time to market, sales volumes, discount levels, etc. Chapter 14 discusses how
to elicit estimates from subject matter experts.
Fitting distributions to data. We don't usually have a great deal of historic data to work with in
capital investment projects because the investment is new. I have worked with a very successful
retail company that investigates levels of pedestrian traffic at different locations in a town where it
is considering locating a new outlet. It has excellent regional data on how that traffic converts to
till receipts. That is quite typical of the type of data one might have for a cashflow analysis, and I
will go through such a model later in this chapter. Hydrocarbon and mineral exploration will generally have improving levels of data about the reserves, but have specialised methods (e.g. Krieging)
for statistically analysing their data, so I won't consider them further here. Otherwise, Chapter 10
discusses distribution fitting in some detail.
Correlation. Simple forms of correlation modelling - recognising that two or more variables are
likely to be linked in some way - are very important in cashflow models. The correlation techniques
described in Sections 13.4 and 13.5 are particularly useful in cashflow models.
Time series. Chapter 12 deals with many different technical time series models. GBM, seasonal and
autoregressive models are useful for modelling inflation, exchange and interest rates over time in a
cashflow model. Lead indicators can help predict market size a short time into the future. In this
chapter I consider variables such as demand for products and sales volumes that are generally built
on a more intuitive basis.
Common errors. Risk analysis cashflow models are not generally that technically complicated, but
our reviews show that the types of error described in Section 7.4 appear very frequently, so I very
much encourage you to read that section carefully. The rest of Chapter 7 offers some ideas on model
building that are very applicable to cashflow models.

Figure 18.1 Modelling elements in a capital investment discounted cashflow model.

Caah Flow
Tdal Revenue
CCA d Gmds Sdd
G m Maan
Operabng Ewnses
Earn@ BeforeTaxes
Tax Basis
l m e Tax
Net l m e

-

$
$
$
$
$

S

-

$

$
$

172,603 $
(172,603) $
(172.603) $
(172,603) $

$

$

174.041 $
(174.041) $

206.366
86,234
122,154
64.521
37,633
(309.011l

---.'

(346,644) $

3............-.3-------$

$

1174,041) $

$
$

$
$
$
$

205.723
65,132
120.592
55.~0
65.592
(243.419)

216,537 $ 239.116 $
69,606 $ 96,950 $
126,930 S 140.166 $
S 20.~0$ 20.~0$
$ 106.930 $ 120.166 $
$ (136,469) $ (16.323) $
$
$

$

317.672
131.540
166.331
20,m
166.331
150.036

$ 363.047 $ 423.456

$ 500.403
$ 150.235 $ 175,234 $ 207.075
$ 212.812 $ 246,224 $ 293.326

2 5 . ~ 0$ 2 5 . ~ 0 $ 2 5 . 0 ~
$ 187,612 $ 223,224 $ 266,328
$ 187,812 $ 223.224 $ 266.328
$

2 ---.--L-..-4 ------:
.-.$. -----:--J--69.W4.6--8.6.6394
5-1?2.F!?

.s..s.s.s.s:

37,633 $

65.592 $

106,930 $ 120,186' $

E123.43!

97.326 $ 101,419' $ 120.541--5 144.697-

Markst Ccndltl0M
Number d C a n p e h
UnR Casi
InRaScn Rate
Tax Rate

o

0

0
$23

46%

46%

46%

I
23
2

1
$24
4796
4636

1
$25
4746
46%

1
$27
55%
46%

1
$28
56%
46%

2
530
6.0%
46%

2
$32
6.1%
4696

2
$34
56%
46%

$61

$M

$66

$72

$TI

$61

I

Saw Aahrlty

SaksPh
Marketdurn
SalesVdume

$56
3,736
3,736

$56

5.903
3.542

4,697
3,523

9.323
4.662

7,419
3,709

11,716
5.021

14.724
5.522

18.504
6,166

Produalon Expense

27

s
&

ProdMDevebpmeW
CabtalEwnses

$
$

0Eme.ad

S

TdalEx~e~s

$

$

19.041 $
145.~0 $
10.WO$

9,521 $
55,003 $
20,OW$

3 5 . ~ 0$
20,WOS

20,WOS

20,WOS

$
.
$
$
$
$
.
$
.
20,WOS25,WO$25,003$25,OW

172,603 $

174,041 $

64,521 $

55,WO $

20,WO $

20,WO $

20,WO $

47,603 5
125.~0 $

-

-

$

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
d
.
d
d
d
d
d
d
d
.
d
d
d
d
d
d
.
d
d
d
d
d
d
.
.
.
.
.
.
.
d
d
.
d
.
d
.
d
d
d
d
d
d
.
d
.
d
.
d
.
.
.
.
~
.
~
~
.
~
~
~
~
~
~
d
d
d
.
d
~
~
~
~
~
~
~
~
~
~
.
.
.
~
.
.
.
~
~
d
d
d
d
d

-

-

$

$

-

$
$

31

Figure 18.2 A typical, if somewhat reduced, discounted cashflow model.

25,WO $

25,WO $

25.003

Chapter 18 Discounted cashflow modelling

463

18.1 Useful Time Series Models of Sales and Market Size
18.1.1 Effect of an intervention at some uncertain point in time
Time series variables are often affected by single identifiable "shocks", like elections, changes to a
law, the introduction of a competitor, the start or finish of a war, a scandal, etc. The modelling of the
occurrence of a shock and its effects may need to take into account several elements:
when the shock may occur (this could be random);
whether this changes the probability or impact of other possible shocks;
the effect of the shock - magnitude and duration.
Consider the following problem. People are purchasing your product at a current rate of 88/month,
and the rate appears to be increasing by 1.3 saleslmonth with each month. However, we are 80 % sure
that a competitor is going to enter the market and will do so between 20 and 50months from now. If
the competitor enters the market, they will take about 30 % of your sales. Forecast the number of sales
there will be for the next 100months.
Two typical pathways for this problem are shown in Figure 18.3, and the model that created them
is shown in Figure 18.4. The Bernoulli variable returns a 1 with 80% probability, otherwise a 0. It
is used as a "flag", the 1 representing a competitor entry, the 0 representing no competitor. Other
cells use conditional logic to adapt to the scenario. You can use a Binomial(1, 80%) if your software does not have a Bernoulli distribution. In Crystal Ball this is also called a Yes:No distribution.
The Stepuniform generates integer values between 20 and 50, and cell E4 returns the month 1000
if the competitor does not enter the market, i.e. a time beyond the modelled period. It is a good
idea if you use this type of technique to make such a number very far from the range of the modelled period in case someone decides to extend the period analysed. A Poisson distribution is used to
model the number of sales reflecting that the sales are independent of each other and randomly distributed in time. The nice thing about a Poisson distribution is that it takes just one parameter - its
mean, so you don't have to think about variation about that mean separately (e.g. determine a standard
deviation).

1

Sales each month

1

Figure 18.3 Possible pathways generated by the model depending on whether the competitor enters the
market.

464

Risk Analysis

A

1

B

I

C

I

D

E

IF

30%
88.00

9

Month

10
11
12
2
14
15
108
109

1
2
3
4
5
6
99
100

Expected sales

89.30
90.60
91.90
93.20
94.50
95.80
216.70
218.00

Sales fraction
lost
0
0
0
0
0
0
0.3
0.3

Sales

79
111
85
103
99
97
159
153

Formulae table
=VoseBernoulli(E2)
=IF(E3=1,VoseStepUniform(20,50),1000)

117

Figure 18.4

Model of Poisson sales affected by the possible entry of a competitor.

18.1.2 Distributing market share
When competitors enter an established market they have to establish the reputation of their product and
fight for market share with others that are already established. This takes time, so it is more realistic to
model a gradual loss of market share to competitors.
Consider the following problem. Market volume for your product is expected to grow each year by
(10 %, 20 %, 40 %) beginning next year at (2500, 3000, 5000) up to a maximum of 20000 units. You
expect one competitor to emerge as soon as the market volume reaches 3500 units in the previous year.
A second will appear at 8500 units. Your competitors' shares of the market will grow linearly until you
all have equal market share after 3 years. Model the sales you will make.
Figure 18.5 shows the model. It is mostly self-explanatory. The interesting component lies in cells
FlO:LlO, which divides the forecast market for your product among the average of the number of competitors over the last 3 years and yourself (the "1" in the equation). Averaging over 3 years is a neat way
of allocating an emerging competitor 113 of your market strength in the first year, 213 in the second and
equal strength from the third year on - meaning that they will then sell as many units as you. What
is so helpful about this little trick is that it automatically takes into account each new competitor and
when they entered the market, which is rather difficult to do otherwise. Note that we need three zeros
in cells C8:E8 to initialise the model.

Chapter 18 Discounted cashflow modelling

9
10
11
15
16
17
18
19

Market volume
Sales Volume

F8:L8
F9:L9
F10:LlO (output)

2,775
2,775

3,449
3,449

4,286
4,286

5,326
3,995

6,619
3,971

8,225
4,112

10,221
5,111

465

12,702
5,444

Formulae table
=VosePERT(10%,20%,40%)
{O,O,OI
=IF(E9>$C$4,2,IF(E9>$C$3,1,0))
=VosePERT(2500,3000,5000)
=MIN(20000,E9*(1+$C$5))
=ROUND(E9/(AVERAGE(C8:E8)+1),O)

Figure 18.5 Model of sales where the total market is shared with new-entry competitors.

18.1.3 Reduced sales over time to a finite market
Some products are essentially a once-in-a-lifetime purchase, e.g. a life insurance, big flat-screen TV,
a new guttering system or a pet identification chip. If we are initially quite successful in selling the
product into the potential market, the remaining market size decreases, although this can be compensated
for to some degree by new potential consumers. Consider the following problem: There are currently
PERT(50000, 55 000, 60000) possible purchasers of your product. Each year there will be about a
10 % turnover (meaning 10 % more possible purchasers will appear). The probability that you will
sell to any particular purchaser in a year is PERT(10%, 20%, 35 %). Forecast sales for the next
10 years.
Figure 18.6 shows the model for this problem. Note that C8:C16 is subtracting sales already made
from the previous year's market size but also adding in a regenerated market element. The binomial distribution then converts the current market size to sales. In the particular scenario shown in Figure 18.6,
the probability of selling is high (26 %), so sales start off high and drop off quickly as the regeneration
rate is so much lower (10 %). Note that some Monte Carlo software cannot handle large numbers of trials
in their binomial distribution, in which case you will need to use a Poisson or normal approximation
(Section 111.9.1).

18.1.4 Growth of sales over time up to a maximum as a function
of marketing effort
Sometimes we might find it easier to estimate what our annual sales will be when stabilised, but be
unsure of how quickly we will be able to achieve that stability. In this sort of situation it can be easier
to model a theoretical maximum sales and match it to some ramping function. A typical form of such

466

Risk Analysis

Figure 18.6 Model forecasting sales over time to a finite market.

a ramping function r (t) is

which will produce a curve that starts at 0 for t = 0 and asymptotically reaches 1 at an infinite value o f t ,
but reaches 0.5 at t = tl12. Consider the following problem: you expect a final sales rate of PERT(1800,
2300, 3600) and expect to achieve half that in the next PERT(3.5, 4, 5) years. Produce a sales forecast
for the next 10 years.
Figure 18.7 provides a solution.

18.2 Summing Random Variables
Perhaps the most common errors in cashflow modelling occur when one wishes to sum a number of
random costs, sales or revenues. For example, imagine that you expect to have Lognorma1(100000,
25 000) customers enter your store per year and they will spend $Lognormal(55, 12) each - how would
you estimate the total revenue? People generally write something like

using the ROUND function in Excel to recognise that the number of people must be discrete. But
let's think what happens when the software starts simulating. It will pick a random value from each

Chapter 18 Discounted cashflow modelling

-

A

~

B

I

C

I

D

I

E

1
2
--

3

4

-[years
Maxsales

5
6
7

2

2000

-

1500

-

1000

-

500

-

0
m

8

9
10
2
12
2
2
2
3
1
7
18
19

2500

-

(I)

0

I

F

I

G

I

H

I

I

I

J

I

K

I

467

L

1

0

2

4

6

8

10

Year
Formulae table

=VosePERT(3.5,4,5)
=VosePERT(I800,2300,3600)

21
22

Figure 18.7 Model forecasting ramping sales to an uncertain theoretical maximum.

distribution and multiply them together. Picking a reasonably high till receipt, the probability that a
random customer will spend more than $70, for example, is

The probability that two people will do the same is 11 % * 11 % = 1.2 %, and the probability that
thousands of people will spend that much is infinitesimally small. However, Equation (18.1) will assign
a 11 % probability that all customers will spend over $70 no matter how many there are. The equation is
wrong because it should have summed ROUND(Lognorma1(100000,25 OOO), 0) separate Lognormal(55,
12) distributions. That's a big, slow model, so we use a variety of techniques to shortcut to the answer,
which is the topic of Chapter 11.

18.3 Summing Variable Margins on Variable Revenues
A common situation is that we have a large random number of revenue items that follow the same
probability distribution but that are independent of each other, and we have independent profit margins
that follow another distribution that must be applied to each revenue item. This type of model quickly
becomes extremely cumbersome to implement because for each revenue item we need two distributions,
one for revenue and another for the profit margin, and we may have large numbers of revenue items. It
is such a common problem that we designed a function in ModelRisk to handle this, allowing you to
keep the model to a manageable size, speeding up simulation time and making the model far simpler
to review. Perhaps most importantly, it allows you to avoid a lot of conditional logic that it is easy
to get wrong.' Consider the following problem. A capital venture company is considering investing in

' I apologise if this comes across as a sales pitch for ModelRisk, but it is designed with finance people in mind.

~

M

468

Risk Analysis

a company that makes TV shows. They expect to make PERT(28, 32, 39) pilots next year which will
generate a revenue of $PERT(120, 150, 250)k each independently and from which the profit margin is
PERT(1 %, 5 %, 12 %). There is a 30 % chance that each pilot is made into a TV show in that country
running for Discrete((1, 2, 3, 4, 5},{0.4, 0.25, 0.2, 0.1, 0.05)) series, where each season of each series
generates $PERT(120, 150,250)k with margins of PERT(15 %, 25 %, 45 %). There is a 20 % chance that
these local series will be sold to the US, generating $PERT(240, 550, 1350)k per season sold, of which
the profit margin is PERT(65 %, 70 %, 85 %). What is the total profit generated from next year's pilots?
The problem is not technically difficult, but the scale of the modelling explodes very quickly. We
worked on the model for a real investment of this type and it had many more layers: pilots in several
countries, merchandising of various types, repeats, etc., and it took a lot of effort to manage. Figure 18.8
shows a surprisingly succinct model: rows 2 to 11 are the input data, rows 14 to 16 are the actual
calculations.
There are a few things to point out. In cell F2, 112 is subtracted and added to the minimum and
maximum estimates respectively of the number of pilots to give a more realistic chance of their occurrence after rounding. Distributions are input as ModelRisk objects in cells F3, F4, F6, F7, F8, F10
and F11 because we want to use these distributions many times. Cell C16, and elsewhere, uses the
Vose Sum Product function to add together revenue * margin for each pilot, where the revenue and

Pilots made
Series made
Seasons made
Profit

NA
NA
295

F2
F3 (F4, F7, Fl0, F l 1 similar)
F6
F14
El4
Dl4
D l 5:E15
C16
D l6
El6
F16 (output)

Local only
8
15
68 1

Local & US
series
3
9
3933

Total
11
4909

Formulae table
=ROUND(VosePERT(8-0.5,11,17+0.5),0)
=VosePE~TObject(l20,150,250)
=VoseDiscreteObject({l,2,3,4,5),{0.4,0.25,0.2,0.1,0.05))
=VoseBinomial(F2,F5)
=VoseBinomial(F14,F9)
=F14-El4
=VoseAggregateMC(D14,$F$6)
=VoseSumProduct(F2,F3,F4)
=VoseSumProduct(D15,F7,F8)
=VoseSumProduct(El5,F7,F8)+VoseSumProduct(El5,F10,F11)
=SUM(C16:El6)

Figure 18.8 Model forecasting profits from TV series.

Chapter 18 Discounted cashflow modelling

469

margin distributions are defined by the distribution objects in cells F3 and F4 respectively. Cell F14
simulates the number of pilots that made it to become series, from which the model determines how
many of those become series also sold into the US in cell E14, the difference being the number of
pilots that only became local series in cell D14. Setting up the logic this way ensures that we have a
consistent model: the local only and the US and local series always add up to the total series produced.
Cells Dl5 and El5 use the VoseAggregate(x, y) function to simulate the sum of x random variables all
taking the same distribution y defined as an object.

18.4 Financial Measures in Risk Analysis
The two main measures of profitability in DCF models are net present value (NPV) and internal rate
of return (IRR). The two main measures of financial exposure are value at risk (VAR) and expected
shortfall. Their pros and cons are discussed in Section 20.5.

18.4.1 Net present value
Net present value (NPV) attempts to determine the present value of a series of cashflows from a project
that stretches out into the future. This present value is a measure of how much the company is gaining
at today's money by undertaking the project: in other words, how much more the company itself will
be worth by accepting the project.
An NPV calculation discounts future cashflows at a specified discount rate r that takes account of:

1. The time value of money (e.g. if inflation is running at 4 %, £1.04 in a year's time is only worth
£1.OO today).
2. The interest that could have been earned over inflation by investing instead in a guaranteed investment.
3. The extra ret