# Risk Analysis A Quantitative Guide

User Manual:

Open the PDF directly: View PDF .

Page Count: 729

Download | |

Open PDF In Browser | View PDF |

Risk Analysis A quantitative guide David Vose third edition John Wiley & Sons, Ltd Copyright O 2008 David Vose Published by John Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, England Telephone +44 (0)1243 779777 Email (for orders and customer service enquiries): cs-books@wiley.co.uk Visit our Home Page on www.wileyeurope.com or www.wiley.com All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London WIT 4LP, UK, without the permission in writing of the Publisher. Requests to the Publisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to permreq@wiley.co.uk, or faxed to +44 (0)1243 770620. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The Publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Other Wiley Editorial Ofices John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany John Wiley & Sons Australia Ltd, 42 McDougall Street, Milton, Queensland 4064, Australia John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809 John Wiley & Sons Canada Ltd, 6045 Freemont Blvd, Mississauga, Ontario, Canada, L5R 453 Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Library of Congress Cataloging-in-Publication Data Vose, David. Risk analysis : a quantitative guide / David Vose. - 3rd ed. p. cm. Includes bibliographical references and index. ISBN 978-0-470-51284-5 (cloth : alk. paper) 1. Monte Carlo method. 2. Risk assessment-Mathematical models. I. Title. QA298.V67 2008 658.4'0352 - dc22 British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-470-51284-5 (H/B) Typeset in 10/12pt Times by Laserwords Private Limited, Chennai, India Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire Contents Preface Part 1 Introduction 1 Why 1.1 1.2 1.3 1.4 1.5 1.6 do a risk analysis? Moving on from "What If" Scenarios The Risk Analysis Process Risk Management Options Evaluating Risk Management Options Inefficiencies in Transferring Risks to Others Risk Registers 2 Planning a risk analysis 2.1 Questions and Motives Determine the Assumptions that are Acceptable or Required 2.2 2.3 Time and Timing You'll Need a Good Risk Analyst or Team 2.4 3 The quality of a risk analysis The Reasons Why a Risk Analysis can be Terrible 3.1 Communicating the Quality of Data Used in a Risk Analysis 3.2 3.3 Level of Criticality 3.4 The Biggest Uncertainty in a Risk Analysis 3.5 Iterate 4 Choice of model structure Software Tools and the Models they Build 4.1 4.2 Calculation Methods 4.3 Uncertainty and Variability How Monte Carlo Simulation Works 4.4 4.5 Simulation Modelling 5 Understanding and using the results of a risk analysis 5.1 Writing a Risk Analysis Report 5.2 Explaining a Model's Assumptions viii Contents 5.3 5.4 Graphical Presentation of a Model's Results Statistical Methods of Analysing Results Part 2 Introduction 6 Probability mathematics and simulation 6.1 Probability Distribution Equations 6.2 The Definition of "Probability" 6.3 Probability Rules 6.4 Statistical Measures 7 Building and running a model 7.1 Model Design and Scope Building Models that are Easy to Check and Modify 7.2 7.3 Building Models that are Efficient 7.4 Most Common Modelling Errors 8 Some basic random processes 8.1 Introduction 8.2 The Binomial Process 8.3 The Poisson Process 8.4 The Hypergeometric Process 8.5 Central Limit Theorem 8.6 Renewal Processes 8.7 Mixture Distributions 8.8 Martingales 8.9 Miscellaneous Examples 9 Data 9.1 9.2 9.3 9.4 9.5 9.6 10 Fitting distributions to data 10.1 Analysing the Properties of the Observed Data 10.2 Fitting a Non-Parametric Distribution to the Observed Data 10.3 Fitting a First-Order Parametric Distribution to Observed Data 10.4 Fitting a Second-Order Parametric Distribution to Observed Data 11 Sums of random variables 11.1 The Basic Problem 11.2 Aggregate Distributions and statistics Classical Statistics Bayesian Inference The Bootstrap Maximum Entropy Principle Which Technique Should You Use? Adding uncertainty in Simple Linear Least-Squares Regression Analysis contents 12 Forecasting with uncertainty 12.1 The properties of a Time Sefies Forecast 12.2 Common Financial Time Sefies Models 12.3 ~utoregressiveModels 12.4 Markov Chain Models 12.5 Birth and Death Models 1 ~i~~ Series projection of Events Occumng Randomly in Time 12.7 Time Series Models with Leading 1ndifators 12 8 Comparing Forecasting Fits for Different Models 12.9 ~ong-TermForecasting 13 Modelling correlation and dependencies 13.1 Introduction 13.2 Rank Order Conelation 13.3 Copulas 13.4 The Envelope Method 13.5 Multiple Correlation Using a Look-UP Table 14 Eliciting from expert opinion 14.1 Introduction 14.2 Sources of Enor in Subjective Estimation 14.3 Modelling Techniques 14.4 Calibrating Subject Matter Experts 14.5 Conducting a Brainstorming Session 14.6 Conducting the Interview 15 16 i I 17 IX 321 322 Testing and modelling causal 15.1 Compylobaeter Example 15.2 Types of Model to Analy se Data 15.3 From Risk Factors to Causes 15.4 Evaluating Evidence 15.5 The Limits of Causal Arguments 15.6 An Example of a Qualitative Causal Analysis 15.7 is Causal Analysis Essential? optimisation in risk analysis 16.1 Introduction 16.2 timi mi sat ion Methods 16.3 Risk Analysis Modelling and Optimisation 16.4 Working Example: Optimal Allocation of Mineral Pots Checking and validating a model 17.1 Spreadsheet Model Errors 17.2 Checking Model Behaviour 17.3 Comparing predictions Against Reality 436 439 444 451 45 1 456 460 x Contents 18 Discounted cashflow modelling 18.1 Useful Time Series Models of Sales and Market Size 18.2 Summing Random Variables 18.3 Summing Variable Margins on Variable Revenues 18.4 Financial Measures in Risk Analysis 19 Project risk analysis 19.1 Cost Risk Analysis 19.2 Schedule Risk Analysis 19.3 Portfolios of risks 19.4 Cascading Risks 20 Insurance and finance risk analysis modelling 20.1 Operational Risk Modelling 20.2 Credit Risk 20.3 Credit Ratings and Markov Chain Models 20.4 Other Areas of Financial Risk 20.5 Measures of Risk 20.6 Term Life Insurance 20.7 Accident Insurance 20.8 Modelling a Correlated Insurance Portfolio 20.9 Modelling Extremes 20.10 Premium Calculations 21 Microbial food safety risk assessment 21.1 Growth and Attenuation Models 21.2 Dose-Response Models 21.3 Is Monte Carlo Simulation the Right Approach? 21.4 Some Model Simplifications 22 Animal import risk assessment 22.1 Testing for an Infected Animal 22.2 Estimating True Prevalence in a Population 22.3 Importing Problems 22.4 Confidence of Detecting an Infected Group 22.5 Miscellaneous Animal Health and Food Safety Problems I Guide for lecturers I1 About ModelRisk 111 A compendium of distributions 111.1 Discrete and Continuous Distributions 111.2 Bounded and Unbounded Distributions 111.3 Parametric and Non-Parametric Distributions 111.4 Univariate and Multivariate Distributions Contents 111.5 Lists of Applications and the Most Useful Distributions 111.6 How to Read Probability Distribution Equations 111.7 The Distributions 111.8 Introduction to Creating Your Own Distributions 111.9 Approximation of One Distribution with Another 111.10 Recursive Formulae for Discrete Distributions 111.11 A Visual Observation On The Behaviour Of Distributions IV Further reading V Vose Consulting References Index xi Preface I'll try to keep it short. This third edition is an almost complete rewrite. I have thrown out anything from the second edition that was really of pure academic interest - but that wasn't very much, and I had a lot of new topics I wanted to include, so this edition is quite a bit bigger. I apologise if you had to pay postage. There are two main reasons why there is so much material to add since 2000. The first is that our consultancy firm has grown considerably, and, with the extra staff and talent, we have had the privilege of working on more ambitious and varied projects. We have particularly expanded in the insurance and finance markets, so you will see that a lot of techniques from those areas, which have far wider applications, appear throughout this edition. We have had contracts where we were given carte blanche to think up new ideas, and that really got the creative juices flowing. I have also been involved in writing and editing various risk analysis guidelines that made me think more about the disconnect between what risk analysts produce and what risk managers need. This edition is split into two parts in an attempt to help remedy that problem. The second reason is that we have built a really great software team, and the freedom to design our own tools has been a double espresso for our collective imagination. We now build a lot of bespoke risk analysis applications for clients and have our own commercial software products. It has been enormous fun starting off with a typical risk-based problem, researching techniques that would solve that problem if only they were easy to use and then working out how to make that happen. ModelRisk is the result, and we have a few others in the pipeline. Some thank yous . . . I have imposed a lot on Veerle and our children to get this book done. V has spent plenty of evenings without me while I typed away in my office, but I think she suffered much more living with a guy who was perpetually distracted by what he was going to write next. Sophie and SCbastien have also missed out. Papa always seemed to be working instead of playing with them. Worse, perhaps, it didn't stop raining all summer in Belgium, and they had to forego a holiday in the sun so I could finish writing. I'll make it up to all three of you, I promise. I have the luxury of having some really smart and motivated people working with me. I have leaned rather heavily on the partners and staff in our consultancy firm while I focused on this book, particularly on Huybert Groenendaal who has largely run the company in my "absence". He also wrote Appendix 5. Timour Koupeev heads our programming team and has been infinitely patient in converting my neverending ideas for our ModelRisk software into reality. He also wrote Appendix 2. Murat Tomaev, our head programmer, has made it all work together. Getting new modules for me to look at always feels a little like Christmas. xiv R~skAnalysis My secretary, Jane Pooley, retired from the company this year. She was the first person with enough faith to risk working for me, and I couldn't have wished for a better start. Wouter Smet and Michael van Hauwermeiren in our Belgian office have been a great support, going through the manuscript and models for this book. Michael wrote the enormous Appendix 3, which could be a book in its own right, and Wouter offered many suggestions for improving the English, which is embarrassing considering it's his third language. Francisco Zagmutt wrote Chapter 16 while under pressure to finish his thesis for his second doctorate and being a full-time, jumping-on-airplanes, deadline-chasing senior consultant in our US office. When Wiley sent me copies of the first edition, the first thing I did was go over to my parents' house and give them a copy. I did the same with the second edition, and the Japanese version too. They are all proudly displayed in the sitting room. I will be doing the same with this book. There's little that can beat knowing my parents are proud of me, as I am of them. Mum still plays tennis, rides and competes in target shooting. Dad is still a great golfer and neither ever seems to stop working on their house, unless they're off to a party. They are a constant reminder to make the most of life. Paul Curtis copy-edited the manuscript with great diligence and diplomacy. I'd love to know how he spotted inconsistencies and repetitions in parts of the text that were a hundred or more pages apart. Any remaining errors are all my fault. Finally, have you ever watched those TV programmes where some guy with a long beard is teaching you how to paint in thirty minutes? I did once. He didn't have a landscape in front of him, so he just started painting what he felt like: a lake, then some hills, the sky, trees. He built up his painting, and after about 20 minutes I thought - yes, that's finished. Then he added reflections, some snow, a bush or two in the foreground. Each time I thought - yes, now it's finished. That's the problem with writing a book (or software) - there's always something more to add or change or rewrite. So I have rather exceeded my deadline, and certainly the page estimate, and my thanks go to my editor at Wiley, Emma Cooper, for her gentle pushing, encouragement and flexibility. Part I ction The first part of this book is focused on helping those who have to make decisions in the face of risk. The second part of the book focuses on modelling techniques and has all the mathematics. The purpose of Part 1 is to help a manager understand what a risk analysis is and how it can help in decision-making. I offer some thoughts on how to build a risk analysis team, how to evaluate the quality of the analysis and how to ask the right questions so you get the most useful answers. This section should also be of use to analysts because they need to understand the managers' viewpoint and work towards the same goal. Chapter I Why do a risk analysis? In business and government one faces having to make decisions all the time where the outcome is uncertain. Understanding the uncertainty can help us make a much better decision. Imagine that you are a national healthcare provider considering which of two vaccines to purchase. The two vaccines have the same reported level of efficacy (67 %), but further study reveals that there is a difference in confidence attached to these two performance measures: one is twice as uncertain as the other (see Figure 1.1). All else being equal, the healthcare provider would purchase the vaccine with the smallest uncertainty about its performance (vaccine A). Replace vaccine with investment and efficacy with profit and we have a problem in business, for which the answer is the same - pick the investment with the smallest uncertainty, all else being equal (investment A). The principal problem is determining that uncertainty, which is the central focus of this book. We can think of two forms of uncertainty that we have to deal with in risk analysis. The first is a general sense that the quantity we are trying to estimate has some uncertainty attached to it. This is usually described by a distribution like the ones in Figure 1.1. Then we have risk events, which are random events that may or may not occur and for which there is some impact of interest to us. We can distinguish between two types of event: A risk is a random event that may possibly occur and, if it did occur, would have a negative impact on the goals of the organisation. Thus, a risk is composed of three elements: the scenario; its probability of occurrence; and the size of its impact if it did occur (either a fixed value or a distribution). An opportunity is also a random event that may possibly occur but, if it did occur, would have a positive impact on the goals of the organisation. Thus, an opportunity is composed of the same three elements as a risk. A risk and an opportunity can be considered the opposite sides of the same coin. It is usually easiest to consider a potential event to be a risk if it would have a negative impact and its probability is less than 50%, and, if the risk has a probability in excess of 50 %, to include it in a base plan and then consider the opportunity of it not occurring. I. I Moving on from "What If" Scenarios Single-point or deterministic modelling involves using a single "best-guess" estimate of each variable within a model to determine the model's outcome(s). Sensitivities are then performed on the model to determine how much that outcome might in reality vary from the model outcome. This is achieved by selecting various combinations for each input variable. These various combinations of possible values 4 Risk Analysis Figure 1.1 Efficacy comparison for two vaccines: the vertical axis represents how confident we are about the true level of efficacy. I've omitted the scale to avoid some confusion at this stage (see Section 111.1.2). around the "best guess" are commonly known as "what if" scenarios. The model is often also "stressed by putting in values that represent worst-case scenarios. Consider a simple problem that is just the sum of five cost items. We can use the three points, minimum, best guess and maximum, as values to use in a "what if" analysis. Since there are five cost items and three values per item, there are 35 = 243 possible "what if7' combinations we could produce. Clearly, this is too large a set of scenarios to have any practical use. This process suffers from two other important drawbacks: only three values are being used for each variable, where they could, in fact, take any number of values; and no recognition is being given to the fact that the best-guess value is much more likely to occur than the minimum and maximum values. We can stress the model by adding up the minimum costs to find the best-case scenario, and add up the maximum costs to get the worst-case scenario, but in doing so the range is usually unrealistically large and offers no real insight. The exception is when the worst-case scenario is still acceptable. Quantitative risk analysis (QRA) using Monte Carlo simulation (the dominant modelling technique in this book) is similar to "what if" scenarios in that it generates a number of possible scenarios. However, it goes one step further by effectively accounting for every possible value that each variable could take and weighting each possible scenario by the probability of its occurrence. QRA achieves this by modelling each variable within a model by a probability distribution. The structure of a QRA model is usually (there are some important exceptions) very similar to a deterministic model, with all the multiplications, additions, etc., that link the variables together, except that each variable is represented by a probability distribution function instead of a single value. The objective of a QRA is to calculate the combined impact of the uncertainty1 in the model's parameters in order to determine an uncertainty distribution of the possible model outcomes. ' I discuss the exact meaning of "uncertainty", randomness, etc., in Chapter 4. Chapter I Why do a nsk analysis? 5 1.2 The Risk Analysis Process Figure 1.2 shows a typical flow of activities in a risk analysis, leading from problem formulation to decision. This section and those that follow provide more detail on each activity. 1.2. I Identifying the risks Risk identification is the first step in a complete risk analysis, given that the objectives of the decisionmaker have been well defined. There are a number of techniques used to help formalise the identification of risks. This part of a formal risk analysis will often prove to be the most informative and constructive element of the whole process, improving company culture by encouraging greater team effort and reducing blame, and should be executed with care. The organisations participating in a formal risk analysis should take pains to create an open and blameless environment in which expressions of concern and doubt can be openly given. Risk Management approach Decision-maker Analyst Discussion between analyst and decision-maker Identify risks, drivers and risk management options Define quantitative questions to help select between options Review available data and possible analysis t t I Assign probability distribut~ons I I Design model - - - - 1 4Tlme Series 1 Opinion Run simulation * Review results Finish reporting Normal Possible p p Figure 1.2 The risk analysis process. Incorrect 6 Risk Analysis Prompt lists Prompt lists provide a set of categories of risk that are pertinent to the type of project under consideration or the type of risk being considered by an organisation. The lists are used to help people think about and identify risks. Sometimes different types of list are used together to improve further the chance of identifying all of the important risks that may occur. For example, in analysing the risks to some project, one prompt list might look at various aspects of the project (e.g. legal, commercial, technical, etc.) or types of task involved in the project (design, construction, testing). A project plan and a work breakdown structure, with all of the major tasks defined, are natural prompt lists. In analysing the reliability of some manufacturing plant, a list of different types of failure (mechanical, electrical, electronic, human, etc.) or a list of the machines or processes involved could be used. One could also cross-check with a plan of the site or a flow diagram of the manufacturing process. Check lists can be used at the same time: these are a series of questions one asks as a result of experience of previous problems or opportune events. A prompt list will never be exhaustive but acts as a focus of attention in the identification of risks. Whether a risk falls into one category or another is not important, only that the risk is identified. The following list provides an example of a fairly general project prompt list. There will often be a number of subsections for each category: administration; project acceptance; commercial; communication; environmental; financial; knowledge and information; legal; management; partner; political; quality; resources; strategic; subcontractor; technical. The identified risks can then be stored in a risk register described in Section 1.6. 1.2.2 Modelling the risk problem and making appropriate decisions This book is concerned with the modelling of identified risks and how to make decisions from those models. In this book I try not to offer too many modelling rules. Instead, I have focused on techniques that I hope readers will be able to put together as necessary to produce a good model of their problem. However, there are a few basic principles that are worth adhering to. Morgan and Henrion (1990) offer the following excellent "ten commandments" in relation to quantitative risk and policy analysis: Chapter I Why do a risk analysis? 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 7 Do your homework with literature, experts and users. Let the problem drive the analysis. Make the analysis as simple as possible, but no simpler. Identify all significant assumptions. Be explicit about decision criteria and policy strategies. Be explicit about uncertainties. Perform systematic sensitivity and uncertainty analysis. Iteratively refine the problem statement and the analysis. Document clearly and completely. Expose to peer review. The responses to correctly identified and evaluated risks are many, but generally fall into the following categories: Increase (the project plan may be overly cautious). Do nothing (because it would cost too much or there is nothing that can be done). Collect more data (to better understand the risk). Add a contingency (extra amount to budget, deadline, etc., to allow for possibility of risk). Reduce (e.g. build in redundancy, take a less risky approach). Share (e.g. with partner, contractor, providing they can reasonably handle the impact). Transfer (e.g. insure, back-to-back contract). Eliminate (e.g. do it another way). Cancel project. This list can be helpful in thinking of possible responses to identified risks. It should be borne in mind that these risk responses might in turn cany secondary risks. Fall-backplans should be developed to deal with risks that are identified and not eliminated. If done well in advance, they can help the organisation react efficiently, calmly and in unison in a situation where blame and havoc might normally reign. 1.3 Risk Management Options The purpose of risk analysis is to help managers better understand the risks (and opportunities) they face and to evaluate the options available for their control. In general, risk management options can be divided into several groups. Acceptance (Do nothing) Nothing is done to control the risk or one's exposure to that risk. Appropriate for risks where the cost of control is out of proportion with the risk. It is usually appropriate for low-probability, low-impact risks and opportunities, of which one normally has a vast list, but you may be missing some high-value risk mitigation or avoidance options, especially where they control several risks at once. If the chosen response is acceptance, some considerable thought should be given to risk contingency planning. 8 Risk Analys~s Increase You may find that you are already spending considerable resources to manage a risk that is excessive compared with the level of protection that it affords you. In such cases, it is logical to reduce the level of protection and allocate the resources to manage other risks, thereby achieving a superior overall risk efficiency. Examples are: remove a costly safety regulation for nuclear power plants that affects a risk that would otherwise still be miniscule; cease the requirement to test all slaughtered cows for BSE and use saved money for hospital upgrades. It may be logical but nonetheless politically unacceptable. There are not too many politicians or CEOs who want to explain to the public that they've just authorised less caution in handling a risk. Get more information A risk analysis can describe the level of uncertainty there is about the decision problem (here we use uncertainty as distinct from inherent randomness). Uncertainty can often be reduced by acquiring more information (whereas randomness cannot). Thus, a decision-maker can determine that there is too much uncertainty to make a robust decision and request that more information be collected. Using a risk analysis model, the risk analyst can advise the least-cost method of collecting extra data that would be needed to achieve the required level of precision. Value-of-information arguments (see Section 5.4.5) can be used to assess how much, if any, extra information should be collected. Avoidance (Elimination) This involves changing a method of operation, a project plan, an investment strategy, etc., so that the identified risk is no longer relevant. Avoidance is usually employed for high-probability, high-impact type risks. Examples are: use a tried and tested technology instead of the new one that was originally envisaged; change the country location of a factory to avoid political instability; scrap the project altogether. Note that there may be a very real chance of introducing new (and perhaps much more important) risks by changing your plans. Reduction (Mitigation) Reduction involves a range of techniques, which may be used together, to reduce the probability of the risk, its impact or both. Examples are: build in redundancy (standby equipment, back-up computer at different location); perform more quality tests or inspections; provide better training to personnel; spread risk over several areas (portfolio effect). Reduction strategies are used for any level of risk where the remaining risk is not of very high severity (very high probability and impact) and where the benefits (amount by which risk is reduced) outweigh the reduction costs. Chapter I W h y do a r~skanalys~s? 9 Cont~ngencyplann~ng These are plans devised to optimise the response to risks should they occur. They can be used in conjunction with acceptance and reduction strategies. A contingency plan should identify individuals who take responsibility for monitoring the occurrence of the risk, andlor identified risk drivers for changes in the risk's probability or possible impact. The plan should identify what to do, who should do it and in which order, the window of opportunity, etc. Examples are: have a trained firefighting team on site; have a preprepared press release; have a visible phone list (or email distribution list) of whom to contact if the risk occurs; reduce police and emergency service leave during a strike; fit lifeboats on ships. Management's response to an identified risk is to add some reserve (buffer) to cover the risk should it occur. Appropriate for small to medium impact risks. Examples are: allocate extra funds to a project; allocate extra time to complete a project; have cash reserves; have extra stock in shops for a holiday weekend; stockpile medical and food supplies. '4 Insurance Essentially, this is a risk reduction strategy, but it is so common that it is worth mentioning separately. If an insurance company has done its numbers correctly, in a competitive market you will pay a little above the expected cost of the risk (i.e. probability * expected impact should the risk occur). In general, we therefore insure for risks that have an impact outside our comfort zone (i.e. where we value the risk higher than its expected value). Alternatively, you may feel that your exposure is higher than the average policy purchaser, in which case insurance may be under your expected cost and therefore extremely attractive. Risk transfer This involves manipulating the problem so that the risk is transferred from one party to another. A common method of transferring risk is through contracts, where some form of penalty is included into a contractor's performance. The idea is appealing and used often but can be very inefficient. Examples are: penalty clause for running over agreed schedule; performance guarantee of product; lease a maintained building from the builder instead of purchasing; purchase an advertising campaign from some media body or advertising agency with payment contingent on some agreed measure of success. I0 Risk Analysis You can also consider transferring risks to you, where there is some advantage to relieving another party of a risk. For example, if you can guarantee a second party against some small risk resultant from an activity you wish to take that provides you with much greater benefit than the other party's risk, the second party may remove its objection to your proposed activity. 1.4 Evaluating Risk Management Options The manager evaluating the possible options for dealing with a defined risk issue needs to consider many things: Is the risk assessment of sufficient quality to be relied upon? How sensitive is the ranking of each option to model uncertainties? What are the benefits relative to the costs associated with each risk management option? Are there any secondary risks associated with a chosen risk management option? How practical will it be to execute the risk management option? Is the risk assessment of sufficient quality to be relied upon? (See Chapter 3.) How sensitive is the ranking of each option to model uncertainties? On this last point, we almost always would like to have better data, or greater certainty about the form of the problem: we would like the distribution of what will happen in the future to be as narrow as possible. However, a decision-maker cannot wait indefinitely for better data and, from a decisionanalytic point of view, may quickly reach the point where the best option has been determined and no Figure 1.3 Different possible outputs compared with a threshold T Chapter I W h y do a risk analys~s? I I further data (or perhaps only a very dramatic change in knowledge of the problem) will make another option preferable. This concept is known as decision sensitivity. For example, in Figure 1.3 the decisionmaker considers any output below a threshold T (shown with a dashed line) to be perfectly acceptable (perhaps this is a regulatory threshold or a budget). The decision-maker would consider option A to be completely unacceptable and option C to be perfectly fine, and would only need more information about option B to be sure whether it was acceptable or not, in spite of all three having considerable uncertainty. 1.5 Inefficiencies in Transferring Risks to Others A common method of managing risks is to force or persuade another party to accept the risk on your behalf. For example, an oil company could require that a subcontractor welding a pipeline accept the costs to the oil company resulting from any delays they incur or any poor workmanship. The welding company will, in all likelihood, be far smaller than the oil company, so possible penalty payments would be catastrophic. The weldng company will therefore value the risk as very high and will require a premium greatly in excess of the expected value of the risk. On the other hand, the oil company may be able to absorb the risk impact relatively easily, so would not value the risk as highly. The difference in the utility of these two companies is shown in Figures 1.4 to 1.7, which demonstrate that the oil company will pay an excessive amount to eliminate the risk. A far more realistic approach to sharing risks is through a partnership arrangement. A list of risks that may impact on various parties involved in the project is drawn up, and for each risk one then asks: How big is the risk? What are the risk drivers? Who is in control of the risk drivers? Who has the experience to control them? Who could absorb the risk impacts? How can we work together to manage the risks? Utility gain I : ' Utility loss Figure 1.4 The contractor's utility function is highly concave over the money gainlloss range in question. That means, for example, that the contractor would value a loss of 100 units of money (e.g. $100 000) as a vastly larger loss in absolute utility terms than a gain of $100 000 might be. I2 1 Risk Analysis Utility gain Utility loss Figure 1.5 Over that same money gain/loss range, the oil company has an almost exactly linear utility function. The contractor, required to take on a risk with an expected value of -$60000, would value this as - X utiles. To compensate, the contractor would have to charge an additional amount well in excess of $100 000. The oil company, on the other hand, would value -$60 000 in rough balance with +$60 000, so will be paying considerably in excess of its valuation of the risk to transfer it to the contractor. Utility gain Utility loss Figure 1.6 Imagine the risk has a 10 % probability of occurring, and its impact would be -$300 000, to give an expected value of -$30 000. If $300 000 is the total capital value of the contractor, it won't much matter to the contractor whether the risk impact is $300 000 or $3 000 000 -they still go bust. This is shown by the shortened utility curve and the horizontal dashed line for the contractor. What arrangement would efficiently allocate the risk impacts and rewards for good risk management? Can we insure, etc., to share risks with outsiders? The more one can allocate ownership of risks, and opportunities, to those who control them the better - up to the point where the owner could not reasonably bear the risk impact where others can. Answering the questions above will help you construct a contractual arrangement that is risk efficient, workable and tolerable to all parties. Chapter I W h y do a risk analysis? 13 Utility gain Utility loss Figure 1.7 In this situation, the contractor now values any risk with an impact that exceeds its capital value at a level that is less than the oil company (shown as "Discrepancy"). It may mean that the contractor can offer a more competitive bid than another, larger contractor who would feel the full risk impact, but the oil company will not have covered the risk it had hoped to transfer, and so again will be paying more than it should to offload the risk. Of course, one way to avoid this problem is to require evidence from the contractor that they have the necessary insurance or capital base to cover the risk they are being asked to absorb. 1.6 Risk Registers A risk register is a document or database that lists each risk pertaining to a project or organisation, along with a variety of information that is useful for the management of those risks. The risks listed in a risk register will have come from some collective exercise to identify risks. The following items are essential in any risk register entry: date the register was last modified; name of risk; description of what the risk is; description of why it would occur; description of factors that would increase or decrease its probability of occurrence or size of impact (risk drivers); semi-quantitative estimates of its probability and potential impact; P- I scores; name of owner of the risk (the person who will assume responsibility for monitoring the risk and effecting any risk reduction strategies that have been agreed); details of risk reduction strategies that it is agreed will be taken (i.e. strategy that will reduce the impact on the project should the risk event occur andlor the probability of its occurrence); reduced impact andlor probability of the risk, given the above agreed risk reduction strategies have been taken; 14 a a a a Risk Analysis ranking of risk by scores of the reduced P - I ; cross-referencing the risk event to identification numbers of tasks in a project plan or areas of operation or regulation where the risk may impact; description of secondary risks that may arise as a result of adopting the risk reduction strategies; action window - the period during which risk reduction strategies must be put in place. The following items may also be useful to include: a a a a a a description of other optional risk reduction strategies; ranking of risks by the possible effectiveness of further risk mitigation [effectiveness = (total decrease in risk)/(cost of risk mitigation action)]; fall-back plan in the event the risk event still occurs; name of the person who first identified the risk; date the risk was first identified; date the risk was removed from the list of active risks (if appropriate). A risk register should include a description of the scale used in the semi-quantitative analysis, as explained in the section on P-I scores. A risk register should also have a summary that lists the top risks (ten is a fairly usual number but will vary according to the project or overview level). The "top" risks are those that have the highest combination of probability and impact (i.e. severity), after the reducing effects of any agreed risk reduction strategies have been included. Risk registers lend themselves perfectly to being stored in a networked database. In this way, risks from each project or regulatory body's concerns, for example, can be added to a common database. Then, a project manager can access that database to look at all risks to his or her project. The finance director, lawyer, etc., can look at all the risks from any project being managed by their departments and the chief executive can look at the major risks to the organisation as a whole. What is more, head office has an easy means for assessing the threat posed by a risk that may impact on several projects or areas at the same time. "Dashboard software can bring the outputs of a risk register into appropriate focus for the decision-makers. 1.6.1 P-l tables The risk identification stage attempts to identify all risks threatening the achievement of the project's or organisation's goals. It is clearly important, however, that attention is focused on those risks that pose the greatest threat. Defining qualitative risk descriptions A qualitative assessment of the probability P of a risk event (a possible event that would produce a negative impact on the project or organisation) and the impact(s) it would produce, I , can be made by assigning descriptions to the magnitudes of these probabilities and impacts. The assessor is asked to describe the probability and impact of each risk, selecting from a predetermined set of phrases such as: nil, very low, low, medium, high and very high. A range of values is assigned to each phrase in order to maintain consistency between the estimates of each risk. An example of the value range that might be given to each phrase in a risk register for a particular project is shown in Table 1.1. Note that in Table 1.1 the value ranges are not evenly spaced. Ideally there is a multiple difference between each range (in this case roughly 3). If the same multiple is applied for probability and impact i I a Chapter I Why do a r ~ s analys~s? k I5 Table 1.1 An example of the value ranges that could be associated with qualitative descriptions of the probabilities and impacts of a risk on a project. Category Probability (%) Very high High Medium Low Very low 10-50 5-10 2-5 1-2 <1 Delay (days) Cost ($k) Quality - >I00 30-100 10-30 2-10 <2 > I 000 30-100 100-300 20-1 00 t20 -- Failure to meet acceptance criteria Failure to meet > 1 important specification Failure to meet an important specification Failure to meet > 1 minor specification Failure to meet a minor specification Table 1.2 An example of the descriptions that could be associated with impacts of a risk on a corporation. Category Description Catastrophic Major Moderate Minor Insignificant Jeopardises the existence of the company No longer possible to achieve business objectives Reduced ability to achieve business objectives Some business disruptions but little effect on business objectives No impact on business strategy objectives scales, we can more easily determine severity scores as described below. The value ranges can be selected to match the size of the project. Alternatively, they can be matched to the effect the risks would have on the organisation as a whole. The drawback in making the definition of each phrase specific to a project is that it becomes very difficult to perform a combined analysis of the risks from all projects in which the organisation is involved. From a corporate perspective one can describe how a risk affects the health of a company, as shown in Table 1.2. Visualising a portfolio of risks A P-I table offers a quick way to visualise the relative importance of all identified risks that pertain to a project (or organisation). Table 1.3 illustrates an example. All risks are plotted on the one table, allowing easy identification of the most threatening risks as well as providing a general picture of the overall riskiness of the project. Risk numbers 13, 2, 12 and 15 are the most threatening in this example. The impact of a project risk that is most commonly considered is a delay in the scheduled completion of the project. However, an analysis may also consider the increased cost of the project resulting from Table 1.3 Example of a P - 1 table for schedule delav. 16 Risk Analysis Table 1.4 P-1 table for a s~ecificrisk. I Probabilitv I each risk. It might further consider other, less numerically definable impacts on the project, for example: the quality of the final product; the goodwill that could be lost; sociological impacts; political damage; or strategic importance of the project to the organisation. A P-I table can be constructed for each type of impact, enabling the decision-maker to gain a more rounded understanding of a project's riskiness. P - I tables can be constructed for the various types of impact of each single risk. Table 1.4 illustrates an example where the impacts of schedule delay, T, cost, $, and product quality, Q , are shown for a specific risk. The probability of each impact may not be the same. In this example, the probability of the risk event occurring is high, and hence the probability of schedule delay and cost impacts are high, but it is considered that, even if this risk event does occur, the probability of a quality impact is still low. In other words, there is a fairly small probability of a quality impact even when the risk event does occur. Ranking risks P-I scores can be used to rank the identified risks. A scaling factor, or weighting, is assigned to each phrase used to describe each type of impact. Table 1.5 provides an example of the type of scaling factors that could be associated with each phraselimpact type combination. In this type of scoring system, the higher the score, the greater is the risk. A base measure of risk is probability *impact. The categorising system in Table 1.1 is on a log scale, so, to make Table 1.5 consistent, we can define the severity of a risk with a single type of impact as which leaves the severity on a log scale too. If a risk has k possible types of impact (quality, delay, cost, reputation, environmental, etc.), perhaps with different probabilities for each impact type, we can Table 1.5 An example of the scores that could be associated with descriptive risk categories to produce a severity score. Category Score Very high High Medium Low Very low 5 4 3 2 1 Chapter I W h y do a risk analysis? 17 High severity Medium severity Low severity still combine them into one score as follows: The severity scores are then used to determine the most important risks, enabling the management to focus resources on reducing or eliminating risks from the project in a rational and efficient manner. A drawback to this approach of ranking risks is that the process is quite dependent on the granularity of the scaling factors that are assigned to each phrase describing the risk impacts. If we have better information on probability or impact than the scoring system would allow, we can assign a more accurate (non-integer) score. In the scoring regime of Table 1.5, for example, a high severe risk could be defined as having a score higher than 7, and a low risk as having a score lower than 5. Given the crude scaling used, risks with a severity of 7 may require further investigation to determine whether they should be categorised as high severity. Table 1.6 shows how this segregates the risks shown in a P-I table into the three regions. P-I scores for a project provide a consistent measure of risk that can be used to define metrics and perform trend analyses. For example, the distribution of severity scores for a project gives an indication of the overall "amount" of risk exposure. More complex metrics can be derived using severity scores, allowing risk exposure to be normalised and compared with a baseline status. These permit trends in risk exposure to be identified and monitored, giving valuable information to those responsible for controlling the project. Efficient risk management with severity scores Efficient risk management seeks to achieve the maximum reduction in risk for a given amount of investment (of people, time, money, restriction of liberty, etc.). Thus, we need to evaluate in some sense the ratio (reduction in risk)/(investment to achieve reduction). If you use the log scale for severity described here, this would equate to calculating 18 Risk Analysis The risk management options that provide the greatest efficiency should logically be preferred, all else being equal. Inherent risks are the risk estimates before accounting for any mitigation efforts. They can be plotted against a guiding risk response framework where the P - I table is split, covered by overlapping areas of avoid, control, transfer and accept, as shown in Figure 1.8: "Avoid" applies where an organisation would be accepting a high-probability, high-impact risk without any compensating benefits. "Control" applies usually to high-probability, low-impact risks, normally associated with repetitive actions, and therefore usually managed through better internal processes. "Transfer" applies to low-probability, high-impact risks usually managed through insurance or other means of transferring the risk to parties better capable of absorbing the impact. "Accept" applies to the remaining low-probability, low-impact risks for which it may not be effective to focus on too much. Figure 1.9 plots residual risks after any implemented risk mitigation strategies and tracks the progress in managing the residual risks compared with the previous year using arrows. Grey letters represent the status of the risk last year if it is different. A dashed arrow pointing out of the graph means that the risk 5 4 - HIGH-LEVEL RISKS 3- b 2 2 2 - 1- 0 1 0 1 2 3 PROBABILITY Figure 1.8 P-1 graph for inherent risks. 4 5 Chapter I Why do a nsk analys~s? 19 Figure 1.9 P-1 graph for residual risks. has been avoided. An enhancement to the residual risk graph that you might like to add is to plot each risk as a circle whose radius reflects how comfortable you are in dealing with the residual risk - for example, perhaps you have handled the occurrence of similar risks before and minimised their impact through good management, or perhaps they got out of hand. A small circle represents risks that one is comfortable managing, and a large circle represents the opposite, so the less manageable risks stand out in the plot. Chapter 2 Planning a risk analysis In order to plan a risk analysis properly, you'll need to answer a few questions: What do you want to know and why? What assumptions are acceptable? What is the timing? Who is going to do the risk analysis? 1'11 go through each of these in turn. 2.1 Questions and Motives The purpose of a risk analysis is to provide information to help make better decisions in an uncertain world. A decision-maker has to work with the risk analyst precisely to define the questions that need answering. You should consider a number of things: 1. Rank the questions that need answering from "critical" down to "interesting". Often a single model cannot answer all questions, or has to be built in a complicated way to answer several questions, so a common recognition of the extra effort needed to answer each question going down the list helps determine a cut-off point. 2. Discuss with the risk analyst the form of the answer. For example, if you want to know how much extra revenue might be made by buying rather than leasing a vessel, you'll need to specify a currency, whether this should be as a percentage or in actual currency and whether you want just the mean (which can make the modelling a lot easier) or a graph of the distribution. Explain what statistics you need and to what accuracy (e.g. asking for the 95th percentile to the nearest $1000), as this will help the risk analyst save time or figure out that an unusual approach might be needed to get the required accuracy. 3. Explain what arguments will be based on these outputs. I am of the view that this is a key breakdown area because a decision-maker might ask for specific outputs and then put them together into an argument that is probabilistically incorrect. Much embarrassment and frustration all round. It is better to explain the arguments (e.g. comparing with the distribution of another potential project's extra revenue) that would be put forward and find out if the risk analyst agrees that this is technically correct before you get started. 4. Explain whether the risk analysis has to sit within a framework. This could be a formal framework, like a regulatory requirement or a company policy, or it could be informal, like building up a portfolio of risk analyses that can be compared on the same footing (for example, we are helping a 22 5. 6. 7. 8. 9. R~skAnalysis large chemical manufacturer to build up a combined toxicological, environmental, etc., risk analysis database for their treasure chest of compounds). It will help the risk analyst ensure the maximum level of compatibility - e.g. that the same base assumptions are used between risk analyses. Explain the target audience. We write reports on all our risk analyses, of course, but sometimes there can be several versions: the executive summary; the main report; and the technical report with all the formulae and guides for testing. Often, others will want to run the model and change parameters, so we make a model version that minirnises the ability to mess up the mathematics, and write the code to allow the most flexibility. These days we usually put a VBA user interface on the front to make life easier and perhaps add a reporting facility to compare results. We might add a help file too. Clients will also sometimes ask us to prepare a Powerpoint presentation. Knowing the knowledge level and focus of each target audience, and knowing what types of reporting will be needed at the offset, saves a lot of time. Discuss any possible hostile reactions. The results of a risk analysis will not always be popular, and when people dislike the answers they start attacking the model (or, if you're unlucky, the modeller). Assumptions are the primary Achilles' heel, as we can argue forever about whether assumptions are right. I talk about getting buy-in for assumptions in Section 5.2. Statistical analysis of data is also rather draining - it usually involves a couple of very technical people with opposing arguments about the appropriateness of a statistical procedure that nobody else understands. The decision to include and exclude certain datasets can also create a lot of tension. The arguments can be minimised, or at least convincingly dismissed, if people likely to be hostile are brought into the analysis process early, or an external expert is asked to give an independent review. Figure out a timeline. Decision-makers have something of a habit of setting unrealistic deadlines. When these deadlines pass, nothing very dramatic usually happens, as the deadlines are some artificial internal confection. Our consultants deal with deadlines all the time, of course, but we openly discuss whether a deadline is really that important because, if we have to meet a tight deadline (and that happens), the quality of the risk analysis may be lower than would have been achievable with more time. The decision-maker has to be honest about time limits and decide whether it is worth postponing things for a bit. Figure out the priority level. The risk analyst might have other work to juggle too. The project might be of high importance and justify pulling off other resources to help with the analysis or instructing others in the organisation to set aside time to provide good quality input. Decide on how regularly the decision-maker and risk analyst will meet. Things change and the risk analysis may have to be modified, so find that out sooner rather than later. 2.2 Determine the Assumptions that are Acceptable or Required If a risk analysis is to sit within a certain framework, discussed above, it may well have to comply with a set of common assumptions to allow meaningful comparisons between the results of different analyses. Sometimes it is better not to revise some assumptions for a new analysis because it makes it impossible to compare. You can often see a similar problem with historic data, e.g. calculating crime or unemployment statistics. It seems that the basis for these statistics keeps changing, making it impossible to know whether the problem is getting better or worse. In a corporate environment there will be certain base assumptions used for things like interest and exchange rates, production capacity and energy price. The same assumptions should be used in all Chapter 2 Planning a risk analysis 23 models. In a risk analysis world these should be probabilistic forecasts, but they are nonetheless often fixed-point values. Oil companies, for example, have the challenging job of figuring out what the oil price might be in the future. They can get it very wrong so often take a low price for planning purposes, e.g. $16 a barrel, which in 2007 might seem rather unlikely for the future. The risk analyst working hard on getting everything else really precise could find such an assumption irritating, but it allows consistency between analyses where oil price forecast uncertainty could be so large as to mask the differences between investment opportunities. Some assumptions we make are conservative, meaning that, if, for example, we need a certain percentile of the output to be above X before we accept the risk as acceptable, then a conservative assumption will bias the output to lower values. Thus, if the output still gives numbers that say the risk is acceptable, we know we are on pretty safe ground. Conservative assumptions are most useful as a sensitivity tool to demonstrate that one has not taken an unacceptable risk, but they are to be avoided whenever possible because they run counter to the principle of risk analysis which is to give an unbiased report of uncertainty. 2.3 Time and Timing We get a lot of requests to help "risk" a model. The potential client has spent a few months working on a problem, building up a cashflow model, etc., and the decision-makers decide the week before the board meeting that they really should have a risk analysis done. If done properly, risk analysis is an integral part of the planning of a project, not an add-on at the end. One of the prime reasons for doing risk analyses is to identify risks and risk management strategies so the decision-makers can decide how the risks can be managed, which could well involve a revision of the project plan. That can save a lot of time and money on a project. If risk analysis is added on at the end, you lose all that potential benefit. The data collection efforts required to produce a fixed-value model of a project are little different from the efforts required for a risk analysis, so adding a risk analysis on at the end is inefficient and delays a project, as the risk analyst has to go back over previous work. We advocate that a risk analyst write the report as the model develops. This helps keep a track of what one is doing and makes it easier to meet the report submission deadline at the end. I also like to write down my thinking because it helps me spot any mistakes early. Finally, try to allow the risk analyst enough time to check the model for errors and get it reviewed. Chapter 16 offers some advice on model validation. 2.4 You'll Need a Good Risk Analyst or Team If the risk analysis is a one-off and the outcome is important to you, I recommend you hire in a consultant risk analyst. Well I would say that, of course, but it does make a lot of sense. Consultants are expensive on a daily basis, but, certainly at Vose Consulting, we are far faster (my guess is over 10 times faster than a novice) - we know what we're doing and we know how to communicate and organise effectively. Please don't get a bright person within your organisation, install some risk analysis software on their computer and tell them to get on with the job. It will end in tears. The publishers of risk analysis software (Crystal Ball, @RISK, Analytica, Risk+, PERTmaster, etc.) have made risk analysis modelling very easy to implement from a software viewpoint. The courses they 24 Risk Analysis teach show you how to drive the software and reinforce the notion that risk analysis modelling is pretty easy (Vose Consulting courses generally assume you have already attended a software familiarisation course). In a lot of cases, risk analysis is in fact pretty easy, as long as you avoid some common basic errors discussed in Section 7.4. However, it can also become quite tricky too, for sometimes subtle reasons, and you should have someone who understands risk analysis well enough to be able to recognise and handle the trickier models. Knowing how to use Excel won't make you an accountant (but it's a good first step), and learning how to use risk analysis software won't make you a risk analyst (but it's also a good first step). There are still very few tertiary courses in risk analysis, and these courses tend to be highly focused in particular areas (financial modelling, environmental risk assessment, etc.). I don't know of any tertiary courses that aim to produce professional risk analysts who can work across many disciplines. There are very few people who could say they are qualified to be a risk analyst. This makes it pretty tough to know where to search and to be sure you have found someone who will have the knowledge to analyse your risks properly. It seems that industry-specific risk analysts also have little awareness of the narrowness of their knowledge: a little while ago we advertised for two highly qualified actuarial and financial risk analysts with several years experience and received a large number of applications from people who were risk analysts in toxicology, microbial, environmental and project areas with almost no overlap in required skill sets. 2.4.1 Qualities of a risk analyst I often get asked by companies and government agencies what sort of person they should look for to fill a position as a risk analyst. In my view, candidates should have the following characteristics: Creative thinkers. Risk analysis is about problem-solving. This is at the top of my list and is the rarest quality. Conjident. We often have to come up with original solutions. I've seen too many pieces of work that have followed some previously published method because it is "safer". We also have to present to senior decision-makers and maybe defend our work in front of hostile stakeholders or a court. Modest. Too many risk analyses fail to meet their requirements because of a risk analyst who thought shehe could do it without help or consultation. Thick-skinned. Risk analysts bring together a lot of disparate information and ideas, sometimes conflicting, sometimes controversial, and we produce outputs that are not always what people want to see, so we have to be prepared for a fair amount of enthusiastic criticism. Communicators. We have to listen to a lot of people and present ideas that are new and sometimes difficult to understand. Pragmatic. Our models could always be better with more time, data and resources, but decisionmakers have deadlines. Able to conceptualise. There are a lot of tools at our disposal that are developed in various fields of risk, so the risk analyst needs to read widely and be able to extrapolate an idea from one application to another. Curious. Risk analysts need to keep learning. Good at mathematics. Take a look at Part 2 of this book to get a feel for the level. It will depend on the area: project risk requires more intuition and perseverance but less mathematics, insurance Chapter 2 Planning a risk analysis 25 and finance require intuition and high mathematical skills, food safety requires medium levels of everything. A feel for numbers. It is one thing to be good at mathematics, but we also have to have an idea of where the numbers should lie because it (a) helps us check the work and (b) allows us to know where we can take shortcuts. Finishers. Some people are great at coming up with ideas, but lose interest when it comes to implementing them. Risk analysts have to get the job done. a Cynical. We have to maintain a healthy cynicism about published work and about how good our subject matter experts are. Pedantic. When developing probability models, one needs to be very precise about exactly what each variable represents. Careful. It is easy to make mistakes. Social. We have to work in teams. a Neutral. Our job is to produce an objective risk analysis. A project manager is not usually ideal to perform the project risk analysis because it may reflect on hisher ability to manage and plan. A scientist is not ideal if shehe has a pet theory that could slant the approach taken. It's a demanding list and indicates, I think, that risk analysis should be performed by people of high skill levels who are fairly senior and in a respected position within a company or agency. It is also rather unlikely that you will find all these qualities in the one person: the best risk analysis units with which we work are composed of a number of individuals with complementary skills and strengths. 2.4.2 Suitable education I interviewed a statistics student a couple of months back. This person was just finishing a PhD and had top grades throughout from a very reputable school. I asked a pretty simple question about estimating a prevalence and got a vague answer about how this person would perform the appropriate test and report the confidence interval, but the student couldn't tell me what that test might be (this is a really basic Statistics 101-type question). I offered some numbers and asked what the bounds might roughly be, but the interviewee had absolutely no idea. With each question it became very clear that this person had been taught a lot of theory but had no feel for how to use it, and no sense of numbers. We didn't hire. I interviewed another person who had written a very sophisticated traffic model using discrete event simulation (which we use a fair bit) that was helping decide how to manage boat traffic. The model predicted that putting in traffic lights on the narrow part of some waterway would produce a horrendous number of crashes at the traffic light queues, easily outweighing the crashes avoided by letting vessels pass each other in the narrow part of the waterway. Conclusion: no traffic lights. That seemed strange to me and, after some thought, the interviewee explained it was probably because the model used a probability of crashing that was inversely proportional to the distance between the vessels, and vessels in a queue are very close, so the model generated lots of crashes. But they are also barely moving, I pointed out, so the probability of a collision will be lower at a given distance for vessels at the lights than for vessels passing each other at speed, and any contact between waiting vessels would have a negligible effect. The modeller responded that the probability could be changed. We didn't hire that person either because the modeller had never stepped back and asked "does this make sense?'. I interviewed a student who was just finishing a Masters degree and was writing up a thesis on applying probability models from physics to financial markets. This person explained that studying had 26 Risk Analysis become rather dull because it was always about learning what others had done, but the thesis was a different story because there was a chance to think for oneself and come up with something new. The student was very enthusiastic, had great mathematics and could really explain to me what the thesis was about. We hired and I have no regrets. A prospective hire for a risk analysis position will need some sort of quantitative background. I think the best candidates tend to have a background that combines attempting to model the real world with using the results to make decisions. In these areas, approximations and the tools of approximation are embraced as necessary and useful, and there is a clear purpose to modelling that goes beyond the academic exercise of producing the model itself. Applied physics, engineering, applied statistics and operations research are all very suitable. Applied physics is the most appealing of all of them (I may be biased, I studied physics as an undergraduate) because in physics we hypothesise how the world might work, describe the theory with mathematics, make predictions and figure out an experiment that will challenge the theory, perform the experiment, collect and analyse data and conclude whether our theory was supported. Learning this basic thinking is extraordinarily valuable: risk analysis follows much of the same process, uses many of the same modelling and statistical techniques, makes approximations and should critically review scientific data when relevant. Most published papers describe studies that were designed to show supportive evidence for someone's theory. Pure mathematics and classical statistics are not that great: pure mathematics is too abstract; we find that pure statistics teaching is very constrained, and encourages formulaic thinking and reaching for a computer rather than a pen and paper. The schools also don't seem to emphasise communication skills very much. It's a shame because the statistician has so much of the basic knowledge requirements. Bayesian statistics is somewhat better - it does not have such a problem with subjective estimates, its techniques are more conducive to risk analysis and it's a newer field, so the teaching is somewhat less staid. Don't be swayed by a six-sigma black belt qualification - the ideas behind Six Sigma certainly have merit, but the technical knowledge gained to get a black belt is quite basic and the production-line teaching seems to be at the expense of in-depth understanding and creativity. The main things you will need to look out for are a track record of independent thinking, strong communication skills and some reasonable grasp of probability modelling. The more advanced techniques can be learned from courses and books. 2.4.3 Our team I thought it might be helpful to give you a brief description of how we organise our teams. If your organisation is large enough to need 10 or more people in a risk analysis team, you might get some ideas from how we operate. Vose Consulting has quite a mixture of people, roughly split into three groups, and we seem to have hired organically to match people's skills and characters to the roles of these groups. I love to learn, teach, develop new talent and dream up new ideas, so my team is made up of conceptual thinkers with great mathematics, computing and researching skills. They are young and very intelligent, but are too young for us to put them into the most stressful jobs, so part of my role is to give them challenging work and the confidence to meet consulting deadlines by solving their problems with them. My office is the nursery for Huybert's team to which they can migrate once they have more experience. Huybert is an ironman triathlon competitor with boundless energy. His consulting group fly around everywhere solving problems, writing reports and meeting deadlines. They are real finishers and my team provide as much technical support as they need (though they are no slouches, we have four quantitative PhDs and nobody with less than a Masters degree in that team). Timour is a very methodical, deep thinker. Chapter 2 Plann~nga r~skanalysis 27 Unlike me, he tends not to say anything unless he has something to say. His programming group writes our commercial software like ModelRisk, requiring a long-term development view, but he has a couple of people who write bespoke software for our clients meeting strict deadlines too. When we get a consulting enquiry, the partners will discuss whether we have the time and knowledge to do the job, who it would involve and who would lead it. Then the prospective lead is invited to talk with us and the client about the project and then takes over. The lead consultant has to agree to do the project, hisfher name and contact details are put on the MOU and helshe remains in charge and responsible to the client throughout the project. A partner will monitor progress, or a partner could be the lead consultant. The lead consultant can ask anyone within the company for advice, for manpower assistance, to review models and reports, to write bespoke software for the client, to be available for a call with the client, etc. I like this approach because it means we spread around the satisfaction of a job well done, it encourages responsibility and creativity, it emphasises a flat company structure and we all get to know what others in the company can do, and because the poor performance in a project would be the company's failure, not one individual's. I read Ricardo Semler's book Maverick a few months ago and loved it for showing me that much of what we practise in our small company can work in a company as large as Semco. Semco also works in groups that mix around depending on the project and has a flat hierarchy. We give our staff a lot of responsibility, so we also assume that they are responsible: we give them considerable freedom over their working hours and practices, we expect them to keep expenses at a sensible level, but don't set daily rates, etc. Staff choose their own computers, can buy a printer, etc., without having to get approval. The only thing we have no flexibility on is honesty. Chapter 3 The quality of a risk analysis We've seen a fair number of quantitative risk analyses that are terrible. They might also have been very expensive, taken a long time to complete and used up valuable human resources. In fact, I'll stick my neck out and say the more complex and expensive a quantitative risk analysis is, the more likely it is to be terrible. Worst of all, the people making decisions on the results of these analyses have little if any idea of how bad they are. These are rather attention-grabbing sentences, but this chapter is small and I would really like you not to skip over it: it could save you a lot of heartache. In our company we do a lot of reviews of models for decision-makers. We'd love to be able to say "it's great, trust the results" a lot more often than we do, and I want to spend this short chapter explaining what, in our experience, goes wrong and what you can do about it. First of all, to give some motivation for this chapter, I want to show you some of the results of a survey we ran a couple of years ago in a well-developed science-based area of risk analysis (Figure 3.1). The question appears in the title of each pane. Which results do you find most worrying? 3.1 The Reasons Why a Risk Analysis can be Terrible t 1 From Figure 3.1 I think you'll see that there really needs to be more communication between decisionmakers and their risk analysts and a greater attempt to work as a team. I see the risk analyst as an important avenue of communication between those "on the ground" who understand the problem at hand and hold the data and those who make decisions. The risk analyst needs to understand the context of the decision question and have the flexibility to be able to find the method of analysis that gives the most useful information. I've heard too many risk analysts complain that they get told to produce a quantitative model by the boss, but have to make the numbers up because the data aren't there. Now doesn't that seem silly? I'm sure the decision-maker would be none too happy to know the numbers are all made up, but the risk analyst is often not given access to the decision-makers to let them know. On the other hand, in some business and regulatory environments they are trying to follow a rule that says a quantitative risk analysis needs to be completed - the box needs ticking. Regulations and guidelines can be a real impediment to creative thinking. I've been in plenty of committees gathered to write risk analysis guidelines, and I've done my best to reverse the tendency to be formulaic. My argument is that in 19 years we have never done the same risk analysis twice: every one has its individual peculiarities. Yet the tendency seems to be the reverse: I trained over a hundred consultants in one of the big four management consultancy firms in business risk modelling techniques, and they decided that, to ensure that they could maintain consistency, they would keep it simple and essentially fill in a template of three-point estimates with some correlation. I can see their point - if every risk analyst developed a fancy and highly individual model it would be impossible to ensure any quality standard. The problem is, of course, that the standard they will maintain is very low. Risk analysis should not be a packaged commodity but a voyage of reasoned thinking leading to the best possible decision at the time. 30 Risk Analysis What factors leopardise the value of an assessment? 45% 40% 35% 30% E Usually 2596 QJ 50:50 20% Seldom 15% Never 10% 5% 0% lnsufflcienthuman resourcesto complete the assessment lnsufficlenttlme to complete the assessment lnsufflclentdata to support the rlsk assessment lnsufflclent In-house expertise in the area lnsufficlentgeneral sclentlfic knowledge of the area Figure 3.1 Some results of a survey of 39 professional risk analysts working in a scientific field where risk analysis is well developed and applied very frequently. Chapter 3 The quality of a risk analysis 3I I think it is usually pretty easy to see early on in the risk analysis process that a quantitative risk analysis will be of little value. There are several key areas where it can fall down: It can't answer all the key questions. 2. There are going to be a lot of assumptions. 3. There is going to be one or more show-stopping assumption. 4. There aren't enough good data or experts. 1. We can get around 1 sometimes by doing different risk analyses for different questions, but that can be problematic when each risk analysis has a different set of fundamental assumptions - how do we compare their results? For 2 we need to have some way of expressing whether a lot of little assumptions compound to make a very vulnerable analysis: if you have 20 assumptions (and 20 is quite a small number), all pretty good ones - e.g. we think there's a 90 % chance they are correct, but the analysis is only useful if all the assumptions are correct, then we only have a 0.9~' = 12 % chance that the assumption set is correct. Of course, if this were the real problem we wouldn't bother writing models. In reality, in the business world particularly, we deal with assumptions that are good enough because the answers we get are close enough. In some more scientific areas, like human health, we have to deal with assumptions such as: compound X is present; compound X is toxic; people are exposed to compound X; the exposure is sufficient to cause harm; and treatment is ineffective. The sequence then produces the theoretical human harm we might want to protect against, but if any one of those assumptions is wrong there is no human health threat to worry about. If 3 occurs we have a pretty good indication that we don't know enough to produce a decent risk analysis model, but maybe we can produce two or three crude models under different possible assumptions and see whether we come to the same conclusion anyway. Area 4 is the least predictable because the risk analyst doing a preliminary scoping can be reassured that the relevant data are available, but then finds out they are not available either because the data turn out to be clearly wrong (we see this a lot), the data aren't what was thought, there is a delay past the deadline in the data becoming available or the data are dirty and need so much rework that it becomes impractical to analyse them within the decision timeframe. There is a lot of emphasis placed on transparency in a risk analysis, which usually manifests itself in a large report describing the model, all the data and sources, the assumptions, etc., and then finishes with some of the graphical and numerical outputs described in Chapter 5. I've seen reports of 100 or 200 pages that seem far from transparent to me - who really has the time or inclination to read such a document? The executive summary tends to focus on the decision question and numerical results, and places little emphasis on the robustness of the study. 3.2 Communicating the Quality of Data Used in a Risk Analysis Elsewhere in this book you will find lots of techniques for describing the numerical accuracy that a model can provide given the data that are available. These analyses are at the heart of a quantitative risk analysis and give us distributions, percentiles, sensitivity plots, etc. In this section I want to discuss how we can communicate any impact on the robustness of a model owing to the assumptions behind using data or settling on a model scope and structure. Elsewhere in this book I encourage the risk analyst to write down each assumption that is made in developing equations 32 Risk Analysis Table 3.1 Pedigree matrix for parameter strength (adapted from Boone et a/., 2007). Exact measure of the d e s ~ ~ representative) Good fit or measure Large sample, direct measurements, ~ ~ ~ lrecent ~ data, g controlledexperiments Best available practice in wellestablished discipline (accredited met,,od for sampling / diagnostic test) Compared with independent measurements of the same variable over long domain, rigorous correction of errors Small sample, direct measurements, less recent Reliable and common method. data, uncontrolled experiments, Best practice in immature discipline low non-response rate Well correlated but not the same thing Several expert estimates in general agreement Weak correlation (very large geographical differences) One expert opinion, rule-of-thumb estimate Not clearly connected Crude speculation Acceptable method but limited consensus on reliability with unknown reliability No discernible rigour 1' Weak very indirect No validation and performing statistical analyses. We get participants to do the same in the training courses we teach as they solve simple class exercises, and there is a general surprise at how many assumptions are implicit in even the simplest type of equation. It becomes rather onerous to write all these assumptions down, but it is even more difficult to convert the conceptual assumptions underpinning our probability models into something that a reader rather less familiar with probability modelling might understand. The NUSAP (Numeral Unit Spread Assessment Pedigree) method (Funtowicz and Ravetz, 1990) is a notational system that communicates the level of uncertainty for data in scientific analysis used for policy making. The idea is to use a number of experts in the field to score independently the data under different categories. The system is well established as being useful in toxicological risk assessment. I will describe here a generalisation of the idea. It's key attractions are that it is easy to implement and can be sumrnarised into consistent pictorial representations. In Table 3.1 I have used the categorisation descriptions of data from van der Sluijs, Risbey and Ravetz (2005), which are: proxy - reflecting how close data being used are to ideal; empirical - reflecting the quantity and quality of the data; method - reflecting where the method used to collect the data lies between careful and well established and haphazard; and validation - reflecting whether the acquired data have been matched to real-world experience (e.g. does an effect observed in a laboratory actually occur in the wider world). Each dataset is scored in turn by each expert. The average of all scores is calculated and then divided by the maximum attainable score of 4. For example: Expert A Expert B Expert C Proxy 3 3 2 Empirical 2 2 1 Method 4 4 3 Validation 3 3 4 gives an average score of 2.833. Dividing by the maximum score of 4 gives 0.708. An additional level of sophistication is to allow the experts to weight their level of expertise for the particular variable in Chapter 3 The quality of a risk analysis Exposure Release Treatment effectiveness Toxicity .. 1 0.9 - U) 0.6 0.5 0.4 0.3 0.2 - 0.1 07 0 Figure 3.2 4 ' . .. 0.7 - !f A :+ 0.8 A .+ . .. + : + ' + 4 10 + 6 + 'j 5 + A a * . - + + 7 : + .+. A + 4 +: 15 33 ++ 20 25 30 Parameter identification # + 35 40 45 Plot of average scores for datasets in a toxicological risk assessment. question (e.g. 0.3 for low, 0.6 for medium and 1.0 for high, as well as allowing experts to select not to make any comment when it is outside their competence), in which case one calculates a weighted average score. One can then plot these scores together and segregate them by different parts of the analysis if desired, which gives an overview of the robustness of data used in the analysis (Figure 3.2). Scores can be generally categorised as follows: t0.2 0.2-0.4 >0.4-0.8 >0.8 weak moderate high excellent So, for example, Figure 3.2 shows that the toxicity part of the analysis appears to be the weakest, with several datasets in the weak category. We can summarise the scores for each dataset using a kite diagram to give a visual "traffic light", green indicating that the parameter support is excellent, red indicating that it is weak and one or two levels of orange representing gradations between these extremes. Figure 3.3 gives an example: one works from the centre-point, marking on the axes the weighted fraction of all the experts considering the parameter support to be "excellent", then adds the weighted fraction considering the support to be "high", etc. These points are then joined to make the different colour zones - from green in the centre for "excellent", through yellow and orange, to red in the last category: a kite will be green if all experts agree the parameter support is excellent and red for weak. Plotting these kite diagrams together can give a strong visual representation: a sea of green should give great confidence, a sea of red says the risk analysis is extremely weak. In practice, we'll end up with a big mix of colours, but over time one can get a sense of what colour mix is typical, when an analysis is comparatively weak or strong and when it can be relied upon for your field. The only real impediment to using the system above is that you need to develop a database software tool. Some organisations have developed their own in-house products that are effective but somewhat limited in their ability for reviewing, sorting and tracking. Our software developers have it on their "to do" list to make a tool that can be used across an organisation, where one can track the current status 34 Risk Analysis Proxy Method Figure 3.3 A kite diagram summarising the level of data support the experts believe that a model parameter will have: red (dark) in the outer band = weak; green (light) in the inner band = excellent. of a risk analysis, drill down to see the reasons for the vulnerability of a parameter, etc., so you might like to visit www.vosesoftware.com and see if we've got anywhere yet. 3.3 Level of Criticality The categorisation system of Section 3.2 helps determine whether a parameter is well supported, but it can still misrepresent the robustness of the risk analysis. For example, we might have done a food safety microbial risk analysis involving 10 parameters - nine enjoy high or excellent support, and one is suffering weak support. If that weakly supported parameter is defining the dose-response relationship (the probability a random individual will experience an adverse health effect given the number of pathogenic organisms ingested), then the whole risk analysis is jeopardised because the dose-response is the link between all the exposure pathways and the amount of pathogen involved (often a big model) and the size of human health impact that results. It is therefore rather useful to separate the kite diagrams and other analyses into different categories for the level of dependence the analysis has on each parameter: critical, important or small, for example. A more sophisticated version for separating the level of dependence is statistically to analyse the degree of effect each parameter has on the numerical result; for example, one might look at the difference in the mean of the model output when the parameter distribution is replaced by its 95th and 5th percentile. Taking that range and multiplying by (1 - the support score), giving 0 for excellent and 1 for terrible, gives one a sense of the level of vulnerability of the output numbers. However, this method suffers other problems. Imagine that we are performing a risk analysis on an emerging bacterium for which we have absolutely no dose-response data, so we use a dataset for a surrogate bacterium that we think will have a very similar effect (e.g. because it produces a similar toxin). We might have large amounts of excellent data for the surrogate bacterium and may therefore have little uncertainty about the dose-response model, so using the 5th and 95th percentiles of the uncertainty about that dose-response model will result in a small change in the output and multiplying that by ( I - the support score) will under-represent the real uncertainty. A second problem is that we often estimate two or more model parameters from the same dataset; for example, a dose-response model often has two or three parameters Chapter 3 The quality of a risk analysis 35 that are fitted to data. Each parameter might be quite uncertain, but the dose-response curve can be nonetheless quite stable, so this numerical analysis needs to look at the combined effect of the uncertain parameters as a single entity, which requires a fair bit of number juggling. 3.4 The Biggest Uncertainty in a Risk Analysis The techniques discussed above have focused on the vulnerability of the results of a risk analysis to the parameters of a model. When we are asked to review or audit a risk analysis, the client is often surprised that our first step is not to look at the model mathematics and supporting statistical analyses, but to consider what the decision questions are, whether there were a number of assumptions, whether it would be possible to do the analysis a different (usually simpler, but sometimes more complex and precise) way and whether this other way would give the same answers, and to see if there are any means for comparing predictions against reality. What we are trying to do is see whether the structure and scope of the analysis are correct. The biggest uncertainty in a risk analysis is whether we started off analysing the right thing and in the right way. Finding the answer is very often not amenable to any numerical technique because we will not have any alternative to compare against. If we do, it might nonetheless take a great deal of effort to put together an alternative risk analysis model, and a model audit is usually too late in the process to be able to start again. A much better idea, in my view, is to get a sense at the beginning of a risk analysis of how confident we should be that the analysis will be scoped sufficiently broadly, or how confident we are that the world is adequately represented by our model. Needless to say, we can also start rather confident that our approach will be quite adequate and then, once having delved into the details of the problem, find out we were quite mistaken, so it is important to keep revisiting our view of the appropriateness of the model. We encourage clients, particularly in the scientific areas of risk in which we work, to instigate a solid brainstorming session of experts and decision-makers whenever it has been decided that a risk analysis is to be undertaken, or maybe is just under consideration. The focus is to discuss the form and scope of the potential risk analysis. The experts first of all need to think about the decision questions, discuss with decision-makers any possible alternatives or supplements to those questions and then consider how they can be answered and what the outputs should look like (e.g. only the mean is required, or some high percentile). Each approach will have a set of assumptions that need to be thought through carefully: What would the effect be if the assumptions are wrong? If we use a conservative assumption and estimate a risk that is too high, are we back to where we started? We need to think about data requirements too: Is the quality likely to be good and are the data easily attainable? We also need to think about software. I was once asked to review a 2 year, $3million model written entirely in interacting C++ modules - nobody else had been able to figure it out (I couldn't either). When the brainstorming is over, I recommend that you pass around a questionnaire to each expert and ask those attending independently to answer a questionnaire something like this: We discussed three risk analysis approaches (description A, description B, description C). Please indicate your level of confidence (0 = none, 1 = slight, 2 =good, 3 = excellent, -1 = no opinion) to the following: 1. What is your confidence that method A, B or C will be sufficiently flexible and comprehensive to answer any foreseeable questions from the management about this risk? 36 Risk Analysis 2. What is your confidence that method A, B or C is based on assumptions that are correct? 3. What is your confidence for method A, B or C that the necessary data will be available within the required timeframe and budget? 4. What is your confidence that the method A, B or C analysis will be completed in time? 5 . What is your confidence that there will be strong support for method A, B or C among reviewing peers? 6. What is your confidence that there will be strong support for method A, B or C among stakeholders? Asking each brainstorming participant independently will help you attain a balanced view, particularly if the chairperson of that meeting has enforced the discipline of requiring participants not to express their view on the above questions during the meeting (it won't be completely possible, but you are trying to make sure that nobody will be influenced into giving a desired answer). Asking people independently rather than trying to achieve consensus during the meeting will also help remove the overconfidence that often appears when people make a group decision. 3.5 Iterate Things change. The political landscape in which the decision is to be made can become more hostile or accepting to some assumptions, data can prove better or worse than we initially thought, new data turn up, new questions suddenly become important, the timeframe or budget can change, a risk analysis consultant sees an early model and shows you a simpler way, etc. So it makes sense to go back from time to time over the types of assumption analysis I discussed in Sections 3.2 and 3.3 and to remain open to talung a different approach, even to making as dramatic a change as going from a quantitative to a qualitative risk analysis. That means you (analysts and decision-makers alike) should also be a little guarded in making premature promises so you have some space to adapt. In our consultancy contracts, for example, a client will usually commission us to do a quantitative risk analysis and tell us about the data they have. We'll probably have had a little look at the data too. We prefer to structure our proposal into stages. In the first stage we go over the decision problem, review any constraints (time, money, political, etc.), take a first decent look at the available data and figure out possible ways of getting to the answer. Then we produce a report describing how we want to tackle the problem and why. At that stage the client can stop the work, continue with us, do it themselves or maybe hire someone else if they wish. It may take a little longer (usually a day or two), but everyone's expectations are kept realistic, we aren't cornered into doing a risk analysis that we know is inappropriate and clients don't waste their time or money. As consultants, we are in the somewhat privileged position of turning down work that we know would be terrible. A risk analyst employed by a company or government department may not have that luxury. If you, the reader, are a risk analyst in the awkward position of being made to produce terrible risk analyses, perhaps you should show your boss this chapter, or maybe check to see if we have any vacancies. Chapter 4 Choice of model structure There is a tendency to settle on the form that a risk analysis model will take too early on in the risk analysis process. In part that will be because of a limited knowledge of the available options, but also because people tend not to take a step back and ask themselves what the purpose of the analysis is, and also how it might evolve over time. In this chapter I give a short guide to various types of model used in risk analysis. 4.1 Software Tools and the Models they Build 4. I. I Spreadsheets Spreadsheets, and by that I mean Excel these days, are the most natural and the first choice for most people because it is perceived that relatively little additional knowledge is required to produce a risk analysis model. Products like @RISK, CrystaLBall, ModelRisk and many other contenders for their shared crown have made adding uncertainty into a spreadsheet as simple as cliclung a few buttons. You can run a simulation and look at the distribution results in a few seconds and a few more button clicks. Monte Carlo simulation software tools for Excel have focused very much on the graphical interfaces to make risk analysis modelling easy: combine that with the ability to track formulae across spreadsheets, imbed graphs and format sheets in many ways, and with VBA and data importing capabilities, and we can see why Excel is so popular. I have even seen a whole trading floor run on Excel using VBA, and not a single recognisable spreadsheet appeared on any dealer's screen. But Excel has its limitations. ModelRisk overcomes many of them for high-level financial and insurance modelling, and I have used its features in this book a fair bit to help explain some modelling concepts. However, there are many types of problem for which Excel is not suitable. Project cost and schedule risk analysis can be done in spreadsheets at a crude level, which I cover in Chapter 19, and a crude level is often enough for large-scale risk analysis, as we are rarely interested in the minutia that can be built into a project planning model (like you might make with Primevera or Microsoft Project). However, a risk register is better constructed in an electronic database with various levels of access. The problem with building a project plan in a spreadsheet is that expanding the model into greater detail becomes mechanically very awkward, while it is a simple matter in project planning software. In other areas, risk analysis models with spreadsheets have a number of limitations: 1. They scale very badly, meaning that spreadsheets can become really huge when one has a lot of data, or when one is performing repetitive calculations that could be succinctly written in another language (e.g. a looping formula), although one can get round this to some degree with Visual Basic. Our company reviews many risk models built in spreadsheets, and they can be vast, often unnecessarily so because there are shortcuts to achieving the same result if one knows a bit of 38 2. 3. 4. 5. Risk Analysis probability mathematics. The next version of Excel will handle even bigger sheets, so I predict this problem will only get worse. They are limited to the two dimensions of a grid, three at a push if one uses sheets as a third dimension; if you have a multidimensional problem you should really think hard before deciding on a spreadsheet. There are a lot of other modelling environments one could use: C++ is highly flexible, but opaque to anyone who is not a C++ programmer. Matlab and, to a lesser extent, Mathematica and Maple are highly sophisticated mathematical modelling software with very powerful built-in modelling capabilities that will handle many dimensions and can also perform simulations. They are really slow. Running a simulation in Excel will take hundreds or more times longer than specialised tools. That's a problem if you have a huge model, or if you need to achieve a high level of precision (i.e. require many iterations). Simulation models built in spreadsheets calculate in one direction, meaning that, if one acquires new data that can be matched to a forecast in the model, the data cannot be integrated into the model to update the estimates of parameters on which the model was based and therefore produce a more accurate forecast. The simulation software WinBUGS can do this, and I give a number of examples through this book. Spreadsheets cannot easily handle modelling dynamic systems. There are a number of flexible and user-friendly tools like Simul8 which give very good approximations to continuously varying stochastic systems with many interacting components. I give an example later in this chapter. Attempting to achieve the same in Excel is not worth the pain. There are other types of model that one can build, and software that will let you do so easily, which I describe below. 4.1.2 Influence diagrams Influence diagrams are quite popular - they essentially replicate the mathematics you can build in a spreadsheet, but the modelling environment is quite different (Figure 4.1 is a simple example). ~ n a l y t i c ais~ the most popular influence diagram tool. Variables (called nodes) are represented as graphical objects (circles, squares, etc.) and are connected together with arrows (called arcs) which show the direction of interaction between these variables. The visual result is a network that shows the Project base cost Total project cost I I A Additional costs , < Inflation A f L Figure 4.1 \ f Risk of political change Risk of bad weather / Risk of strike Example of a simple influence diagram. 1 L , Chaater 4 Choice of model structure 39 viewer which variables affect which, but you can imagine that such a diagram quickly becomes overly complex, so one builds submodels. Click on a model object and it opens another view to show a lower level of interaction. Personally, I don't like them much because the mathematics and data behind the model are hard to get to, but others love them. They are certainly very visual. 4.1.3 Event trees Event trees offer a way to describe a sequence of probabilistic events, together with their probabilities and impacts. They are perhaps the most useful of all the methods for depicting a probabilistic sequence, because they are very intuitive, the mathematics to combine the probabilities is simple and the diagram helps ensure the necessary discipline. Event trees are built out of nodes (boxes) and arcs (arrows) (Figure 4.2). The tree starts from the left with a node (in the diagram below, "Select animal" to denote the random selection of an animal from some population), and arrows to the right indicate possible outcomes (here, whether the animal is infected with some particular disease agent, or not) and their probabilities (p, which would be the prevalence of infected animals in the population, and (1 - p) respectively). Branching out from these boxes are arrows to the next probability event (the testing of an animal for the disease), and attached to these arrows are the conditional probabilities of the next level of event occurring. The conditional nature of the probabilities in an event tree is extremely important to underline. In this example: Se = P(test positive for disegse given the animal is infected) - Sp = P(test negative for disease given the animal is not infected) Thus, following the rules of conditional probability algebra, we can say, for example: P(anima1 is infected and tests positive) = p*Se P(anima1 is infected and tests negative) = p*(l - Se) P(anima1 tests positive) = p*Se Probabilityof the sequence =lse/ / -P y Not infected Probabilityof this step Figure 4.2 Tests +ve Tests -ve Select animal 4 + (1 - p)*(l - SP) Example of a simple event tree. Probabilities conditional on previous step se : I ~ ( -1Se) 40 Risk Analysis High revenue Investment 1 Low revenue I Investment decision I High revenue Investment 2 Low revenue 1 No investment Figure 4.3 Example of a simple decision tree. The decision options are to make either of two investments or do nothing with associated revenues as a result. More involved decision trees would include two or more sequential decisions depending on how well the investment went. Event trees are very useful for building up your probability thinking, although they will get quite complex rather quickly. We use them a great deal to help understand and communicate a problem. 4.1.4 Decision trees Decision trees are like event trees but add possible decision options (Figure 4.3). They have a role in risk analysis, and in fields like petroleum exploration they are very popular. They sketch the possible decisions that one might make and the outcomes that might result. Decision tree software (which can also produce event trees) can calculate the best option to take under the assumption of some user-defined utility function. Again, personally I am not a big fan of decision trees in actual model writing. I find that it is difficult for decision-makers to be comfortable with defining a utility curve, so I don't have much use for the analytical component of decision tree software, but they are helpful for communicating the logic of a problem. 4.1.5 Fault trees Fault trees start from the reverse approach to an event tree. An event tree looks forward from a starting point and considers the possible future outcomes. A fault tree starts with the outcome and looks at the ways it could have arisen. A fault tree is therefore constructed from the right with the outcome, moves to the left with the possible immediate events that could have made that outcome arise, continues backwards with the possible events that could have made the first set of events arise, etc. Fault trees are very useful for focusing attention on what might go wrong and why. They have been used in reliability engineering for a long time, but also have applications in areas like terrorism. For example, one might start with the risk of deliberate contamination of a city's drinking water supply and then consider routes that the terrorist could use (pipeline, treatment plant, reservoir, etc.) and the probabilities of being able to do that given the security in place. 4.1.6 Discrete event simulation Discrete event simulation (DES) differs from Monte Carlo simulation mainly in that it models the evolution of a (usually stochastic) system over time. It does this by allowing the user to define equations Chapter 4 Choice of model structure 41 for each element in the model for how it changes, moves and interacts with other elements. Then it steps the system through small time increments and keeps track of where all elements are at any time (e.g. parts in a manufacturing system, passengers in an airport or ships in a harbour). More sophisticated tools can increase the clock steps when nothing is happening, then decrease again to get a more accurate approximation to the continuous behaviour it is modelling. We have used DES for a variety of clients, one of which was a shipping firm that regularly received LNG-ships at its site on a narrow shared waterway. The client wanted to investigate the impact of constructing an alternative berthing system designed to reduce the impact of their activities on other shipping movements, and the model evaluated the benefits of such a system. Within the DES model, movements of the client's and any other relevant shipping traffic were simulated, taking into account restrictions of movements by certain rules and regulations and evaluating the costs of delays. The standalone model, as well as documentation and training, was provided to the client and helped them to persuade the other shipping operators and the Federal Energy Regulatory Commission (FERC) of the effectiveness of their plan. Figure 4.4 shows a screen shot of the model (it looks better in colour). Going from left to right, we can see that currently there is one ship in the upper harbour, four in the inner harbour, none at the city front and one in the outer harbour. In the client's berth, two ships are unloading with 1330 and 2430 units of materials still on board. In the upper right-hand comer the number of ships entering the shared waterway is visible, including the number of ships that are currently in a queue (three and two ships of a particular type). Finally, the lower right-hand corner shows the current waterway conditions, which dictate some of the rules such as "only ships of a certain draft can enter or exit the waterway given a particular current, tide, wind speed and visibility". DES allows us to model extremely complicated systems in a simple way by defining how the elements interact and then letting the model simulate what might happen. It is used a great deal to model, for example, manufacturing processes, the spr6ad of epidemics, all sorts of complex queuing systems, traffic River Closed to non-LNG i: Figure 4.4 Example of a DES model. 42 Risk Analysis flows and crowd behaviour to design emergency exits. The beauty of a visual interface is that anyone who knows the system can check whether it behaves as expected, which makes it a great communication and validation tool. 4.2 Calculation Methods Given a certain probability model that we wish to evaluate, there are several methods that we could use to produce the required answer, which I describe below. 4.2.1 Calculating moments This method uses some probability laws that are discussed later in this book. In particular it uses the following rules: I I 1. The mean of the sum of two distributions is equal to the sum of their means, i.e. (a and (a - b) = Z - b. + b) = Z + b 2. The mean of the product of two distributions is equal to the product of their means, i.e. (a . b ) = a . b. 3. The variance of the sum of two independent distributions is equal to the sum of their variances, i.e. V ( a b) = V ( a ) V ( b ) and V ( a - b ) = V ( a ) V ( b ) . 4. V ( n a ) = n 2 v ( a ) ,E i = nZ, where n is some constant. + + + The moments calculation method replaces each uncertain variable with its mean and variance and then uses the above rules to estimate the mean and variance of the model's outcome. So, for example, three variables a , b and c have the following means and variances: a b c If the problem is to calculate 2a Mean = 70 Mean = 16 Mean = 12 Variance = 14 Variance = 2 Variance = 4 + b - c, the result can be estimated as follows: Mean = (2 *70) + 16 - 12 = 144 Variance = (22* 14) + 2 + 4 = 62 These two values are then used to construct a normal distribution of the outcome: Result = Normal(144, a) where &? is the standard deviation of the distribution which is the square root of the variance. This method is useful in certain situations, like the summation of a large number of potential risks and in the determination of aggregate distributions (Section 11.2). It does have some fairly severe limitations - it cannot easily cope with divisions, exponents, power functions, branching, etc. In short, Chapter 4 Choice of model structure 43 this technique becomes very difficult to execute for all but the most simple models that also reasonably obey its set of assumptions. 4.2.2 Exact algebraic solutions Each probability distribution has associated with it a probability distribution function that mathematically describes its shape. Algebraic methods have been developed for determining the probability distribution functions of some combinations of variables, so for simple models one may be able to find an equation directly that describes the output distribution. For example, it is quite simple to calculate the probability distribution function of the sum of two independent distributions (the following maths might not make sense until you've read Chapter 6). Let X be the first distribution with density f ( x ) and cumulative distribution function Fx(x), and let Y be the second distribution with density g(x). Then the cumulative distribution function of the sum of X and Y, Fx+y, is given by The sum of two independent distributions is sometimes known as the convolution of the distributions. By differentiating this equation, we obtain the density function of X Y: + So, for example, we can determine the distribution of the sum of two independent Uniform(0, 1) distributions. The probability distribution functions f (x) and g(x) are both 1 for 0 5 x 5 1, and zero otherwise. From Equation (4.2) we get For 0 5 a 5 1, this yields which gives fx+ (a) = a. For 1 ( a ( 2, this yields which is a Triangle(0, 1, 2) distribution. Thus, if our risk analysis model was just the sum of several simple distributions, we could use these equations repeatedly to determine the exact output distribution. There are a number of advantages to 44 Risk Analysis this approach, for example: the answer is exact; one can immediately see the effect of changing a parameter value; and one can use differential calculus to explore the sensitivity of the output to the model parameters. A variation of the same approach is to recognise the relationship between certain distributions. For example: There are plenty of such relationships, and many are described in Appendix 111, but nonetheless the distributions used in a risk analysis model don't usually allow such simple manipulation and the exact algebraic technique becomes hugely complex and often intractable very quickly, so it cannot usually be considered as a practical solution. 4.2.3 Numerical approximations Some fast Fourier transform and recursive techniques have been developed for directly, and very accurately, determining the aggregate distribution of a random number of independent random variables. A lot of focus has been paid to this particular problem because it is central to the actuarial need to determine the aggregate claim payout an insurance company will face. However, the same generic problem occurs in banking and other areas. I describe these techniques in Section 11.2.2. There are other numerical techniques that can solve certain types of problem, particularly via numerical integration. ModelRisk, for example, provides the function VoseIntegrate which will perform a very accurate numerical integration. Consider a function that relates the probability of illness, Pill(D), to the number of virus particles ingested, D , as follows: 1 If we believed that the number of virus particles followed a Lognorma1(100,10) distribution, we could calculate the probability of illness as follows: Chapter 4 Choice of model structure 45 where the VoseIntegrate function interprets "#" to be the variable to integrate over and the integration is done between 1 and 1000. The answer is 2.10217E-05 - a value that we could only determine with accuracy using Monte Carlo simulation by running a large number of iterations. 4.2.4 Monte carlo simulation This technique involves the random sampling of each probability distribution within the model to produce hundreds or even thousands of scenarios (also called iterations or trials). Each probability distribution is sampled in a manner that reproduces the distribution's shape. The distribution of the values calculated for the model outcome therefore reflects the probability of the values that could occur. Monte Carlo simulation offers many advantages over the other techniques presented above: The distributions of the model's variables do not have to be approximated in any way. Correlation and other interdependencies can be modelled. The level of mathematics required to perform a Monte Carlo simulation is quite basic. The computer does all of the work required in determining the outcome distribution. Software is commercially available to automate the tasks involved in the simulation. Complex mathematics can be included (e.g. power functions, logs, IF statements, etc.) with no extra difficulty. Monte Carlo simulation is widely recognised as a valid technique, so its results are more likely to be accepted. The behaviour of the model can be investiga2/ed with great ease. Changes to the model can be made very quickly and the results compared with previous models. Monte Carlo simulation is often criticised as being an approximate technique. However, in theory at least, any required level of precision can be achieved by simply increasing the number of iterations in a simulation. The limitations are in the number of random numbers that can be produced from a random number generating algorithm and, more commonly, the time a computer needs to generate the iterations. For a great many problems, these limitations are irrelevant or can be avoided by structuring the model into sections. The value of Monte Carlo simulation can be demonstrated by considering the cost model problem of Figure 4.5. Triangular distributions represent uncertainty variables in the model. There are many other, very intuitive, distributions in common use (Figure 4.6 gives some examples) that require little or no probability knowledge to understand. The cumulative distribution of the results is shown in Figure 4.7, along with the distribution of the values that are generated from running a "what if" scenario analysis using three values as discussed at the beginning of this chapter. The figure shows that the Monte Carlo outcome does not have anywhere near as wide a range as the "what if7' analysis. This is because the "what if" analysis effectively gives equal probability weighting to all scenarios, including where all costs turned out to be their maximum and all costs turned out to be their minimum. Let us allow, for a minute, the maximum to mean the value that only has a 1 % chance of being exceeded (say). The probability that all five costs could be at their maximum at the same time would equal (0.01)~or 1:10 000 000 000: not a realistic outcome! Monte Carlo simulation therefore provides results that are also far more realistic than those that are produced by simple "what if" scenarios. 46 Risk Analysis Total construction costs Minimum Best guess Maximum E 23 500 £172000 £ 56 200 £ 29 600 £ f 31 100 I E 30 500 1 E 33 200 1 E 37 800 Excavation Foundations Structure Roofing Services and finishes 27 200 E 178000 E 58 500 f 37 200 El89000 E 63 700 f 43 600 Figure 4.5 Construction project cost model. b) Uniform distributions a) Triangle distributions 0.12 0.1 0.08 0.06 0.04 0.02 0 0 20 40 0 60 10 5 15 20 25 d) A Relative distribution c) PERT distributions 0.3PERT(0,49,50) Relatwe(4,15,{7,9,11),(2,3,0.5)) 0.250.20.150.1 0.050 0 20 40 I 0 60 10 15 20 f) A Discrete distribution e) A Cumulative Ascending distribution 1.2- 5 0.6 - Discrete({l,2.3),{0.4,0.5,0.1)) CurnulA(O,lO,{l,4,6),(0.2,0.5,0.6),1) 0.5 - 1 - 0.4 0.30.20.1 - I 0 0 Figure 4.6 5 10 0 Examples of intuitive and simple probability distributions. 1 2 3 I 4 Chapter 4 Choice of model structure 47 1 0.9 0.8 Monte Carlo simulation r a, 0.7 s2 0.6 .h .-a>, "What-if" scenarios 0.5 C 2 0.4 a 5 0.3 0 0.2 0.1 0 £310000 £320000 £330000 £340000 £350000 £360000 £370000 Total project cost Figure 4.7 Comparison of distributions of results from "what if" and risk analyses. 4.3 Uncertainty and Variability "Variability is a phenomenon in the physical world to be measured, analysed and where appropriate explained. By contrast, uncertainty is an aspect of knowledge". Sir David Cox There are two components of our inability to be able precisely to predict what the future holds: these are variability and uncertainty. This is a difficult subject, not least because of the words that we risk analysts have available to describe the various concepts and how these words have been used rather carelessly. Bearing this in mind, a good start will be to define the meaning of various keywords. 1 have used the now fairly standard meanings for uncertainty and variability, but might be considered to be deviating a little from the common path in my explanation of the units of uncertainty and variability. The reader should bear in mind the comments I'll make about the different meanings that various disciplines assign to certain words. As long as the reader manages to keep the concepts clear, it should be an easy enough task to work out what another author means even if some of the terminology is different. Variability Variability is the effect of chance and is a function of the system. It is not reducible through either study or further measurement, but may be reduced by changing the physical system. Variability has been described as "aleatory uncertainty", "stochastic variability" and "interindividual variability". Tossing a coin a number of times provides us with a simple illustration of variability. If I toss the coin once, I will have a head (H) or tail (T), each with a probability of 50% if one presumes a fair coin. If I toss the coin twice, I have four possible outcomes {HH, HT, TH, TT}, each with a probability of 25 % because of the coin's symmetry. We cannot predict with certainty what the tosses of a coin will produce because of the inherent randomness of the coin toss. 48 Risk Analysis The variation among a population provides us with another simple example. If I randomly select people off the street and note some physical characteristic, like their height, weight, sex, whether they wear glasses, etc., the result will be a random variable with a probability distribution that matches the frequency distribution of the population from which I am sampling. So, for example, if 52 % of the population are female, a randomly sampled person will be female with a probability of 52 %. In the nineteenth century a rather depressing philosophical school of thought, usually attributed to the mathematician Marquis Pierre-Simon de Laplace, became popular, which proposed that there was no such thing as variability, only uncertainty, i.e. that there is no randomness in the world and an omniscient being or machine, a "Laplace machine", could predict any future event. This was the foundation of the physics of the day, Newtonian physics, and even Albert Einstein believed in determinism of the physical world, saying the often quoted "Der Herr Gott wurfelt nicht" - "God does not play dice". Heisenberg's uncertainty principle, one of the foundations of modern physics and, in particular, quantum mechanics, shows us that this is not true at the molecular level, and therefore subtly at any greater scale. In essence, it states that, the more one characteristic of a particle is constrained (for example, its location in space), the more random another characteristic becomes (if the first characteristic is location, the second will be its velocity). Einstein tried to prove that it is our knowledge of one characteristic that we are losing as we gain knowledge of another characteristic, rather than any characteristic being a random variable, but he has subsequently been proven wrong both theoretically and experimentally. Quantum mechanics has so far proved itself to be very accurate in predicting experimental outcomes at the molecular level where the predictable random effects are most easily observed, so we have a lot of empirical evidence to support the theory. Philosophically, the idea that everything is predetermined (i.e. the world is deterministic) is very difficult to accept too, as it deprives us humans of free will. The non-existence of free will would in turn mean that we are not responsible for our actions - we are reduced to complicated machines and it is meaningless to be either praised or punished for our deeds and misdeeds, which of course is contrary to the principles of any civilisation or religion. Thus, if one accepts the existence of free will, one must also accept an element of randomness in all things that humans affect. Popper (1988) offers a fuller discussion of the subject. Sometimes systems are too complex for us to understand properly. For example, stock markets produce varying stock prices all the time that appear random. Nobody knows all the factors that influence a stock price over time - it is essentially infinitely complex and we accept that this is best modelled as a random process. Uncertainty Uncertainty is the assessor's lack of knowledge (level of ignorance) about the parameters that characterise the physical system that is being modelled. It is sometimes reducible through further measurement or study, or by consulting more experts. Uncertainty has also been called "fundamental uncertainty", "epistemic uncertainty" and "degree of belief". Uncertainty is by definition subjective, as it is a function of the assessor, but there are techniques available to allow one to be "objectively subjective". This essentially amounts to a logical assessment of the information contained in available data about model parameters without including any prior, non-quantitative information. The result is an uncertainty analysis that any logical person should agree with, given the available information. Total uncertainty Total uncertainty is the combination of uncertainty and variability. These two components act together to erode our ability to be able to predict what the future holds. Uncertainty and variability are philosophically very different, and it is now quite common for them to be kept separate in risk analysis Chapter 4 Choice of model structure 49 modelling. Common mistakes are failure to include uncertainty in the model, or modelling variability in some parts of the model as if it were uncertainty. The former will provide an overconfident (i.e. insufficiently spread) model output, while the latter can grossly overinflate the total uncertainty. Unfortunately, as you will have gathered, the term "uncertainty" has been applied to both the meaning described above and total uncertainty, which has left the risk analyst with some problems of terminology. Colleagues have suggested the word "indeterminability" to describe total uncertainty (perhaps a bit of a mouthful, but still the best suggestion I've heard so far). There has been a rather protracted argument between traditional (frequentist) and Bayesian statisticians over the meaning of words like probability, frequency, confidence, etc. Rather than go through their various interpretations here, I will simply present you with how I use these words. I have found my terminology helps clarify my thoughts and those of my clients and course participants very well. I hope they will do the same for you. Probability Probability is a numerical measurement of the likelihood of an outcome of some stochastic process. It is thus one of the two components, along with the values of the possible outcomes, that describe the variability of a system. The concept of probability can be developed neatly from two different approaches. The frequentist approach asks us to imagine repeating the physical process an extremely large number of times (trials) and then to look at the fraction of times that the outcome of interest occurs. That fraction is asymptotically (meaning as we approach an infinite number of trials) equal to the probability of that particular outcome for that physical process. So, for example, the frequentist would imagine that we toss a coin a very large number of times. The fraction of the tosses that comes up heads is approximately the true probability of a single toss producing a head, and, the more tosses we do, the closer the fraction becomes to the true probability. So, for a fair coin we should see the number of heads stabilise at around 50 % of the trials as the number of trials gets truly huge. The philosophical problem with this approach is that one usually does not have the opportunity to repeat the scenario a very large number of times. The physicist or engineer, on the other hand, could look at the coin, measure it, spin it, bounce lasers off its surface, etc., until one could declare that, owing to symmetry, the coin must logically have a 50 % probability of falling on either surface (for a fair coin, or some other value for an unbalanced coin as the measurements dictated). Probability is used to define a probability distribution, which describes the range of values the variable may take, together with the probability (likelihood) that the variable will take any specific value. Degree of uncertainty In this context, "degree of uncertainty" is our measure of how much we believe something to be true. It is one of the two components, along with the plausible values of the parameter, that describe the uncertainty we may have about the parameter of the physical system ("the state of nature", if you like) to be modelled. We can thus use the degree of uncertainty to define an uncertainty distribution, which describes the range of values within which we believe the parameter lies, as well as the level of confidence we have about the parameter being any particular value, or lying within any particular range. A distribution of confidence looks exactly the same as a distribution of probability, and this can lead, all to easily, to confusion between the two quantities. Frequency Frequency is the number of times a particular characteristic appears in a population. Relative frequency is the fraction of times the characteristic appears in the population. So, in a population of 1000 people, SO Risk Analysis 22 of whom have blue eyes, the frequency of blue eyes is 22 and the relative frequency is 0.022 or 2.2 %. Frequency, by the definition used here, must relate to a known population size. 4.3.1 Some illustrations of uncertainty and variability Let us look at a couple of examples to clarify the meaning of uncertainty and variability. Since variability is the more fundamental concept, we'll deal with it first. If I toss a fair coin, there is a 50 % chance that each toss will come up heads (let's call this a "success"). The result of each toss is independent of the results of any previous tosses, and it turns out that the probability distribution of the number of heads in n tosses of a fair coin is described by a Binomial(n, 50%) distribution, which will be explained in detail in Section 8.2. Figure 4.8 illustrates this binomial distribution for n = 1, 2, 5 and 10. This is a distribution of variability because I am not a machine, so I am not perfectly repetitive, and the system (the number of times the coin spins, the air resistance and movement, the angle at which it hits the ground, the topology of the ground, etc.) is too complicated for me to attempt to influence the outcome, and the tosses are therefore random. These binomial distributions are distributions of variability and reflect the randomness inherent in the tossing of a coin (our stochastic system). We are assuming that there is no uncertainty here, as we are assuming the coin to be fair and we are defining the number of tosses; in other words, we are assuming the parameters of the system to be exactly known. The vertical axis of Figure 4.8 gives the probability of each result, and, naturally, these probabilities add up to 1. In general, probability distributions or distributions of variability are simple to understand. They give me some comfort that Binomial(l,0.5) 2 0.4 % 0.3 g 0.2 P 0.1 0.5 0 5 i successes 0 1 0 2 1 Successes Binomial(S,0.5) 4 5 Q. 0.25 0.2 0.15 0.1 0.05 0 0 1 Successes 2 3 4 5 0 Figure 4.8 Examples of the Binomial(n, 50 %) distribution. L*. 1 2 3 4 5 6 7 Successes 8 9 1 0 Chapter 4 Choice of model structure 5I 0.6 -a 05-0.4 -- U P .2 0.3-- 8 0.2-0.1 -- I 01 I 1 < O i Probability Figure 4.9 Confidence distributions for the ball in the box being black: 0 = No, 1 = Yes. The left panel is confidence before any ball is revealed; the right panel is confidence after seeing a blue ball removed from the sack. randomness (variability) really does exist in the world: if we take a group of 100 people1 and ask them to toss a coin 10 times, the resulting distribution of the number of heads will closely follow a Binomial(l0, 50 %). look at a distribution Of uncertainty. Imagine I have a sack of 10 balls, six of which are black and the remaining four of which are blue, and I know these figures. imagine that, out of my sight, a is randomly selected from the sack and placed in an opaque box. I am asked the question: "What is the probability that the ball in the box is black?', and I could quickly answer 6/10 or 60 %. another ball is removed from the sack and shown to me: it is blue. I am asked: yqow what is the probability that the ball in the box is black?", and, as there are now a total of nine balls I have not seen, six which I know are black, I could answer 619 or 66.66 %. But that is strange, because it is to believe that the probability of the ball in the box being black has changed from events that after its selection. The problem lies in my use of the term "probability" which is inconsistent with the definition I have given above. When the ball has been placed in the box, the deed is done, the in the box is black (i.e. the probability is 1) or it is not (i.e. the probability is 0). I don't know the truth but could collect information (i.e. look in the box or look in the sack) to find out what the true is. Before any ball was revealed to me, I should have said that I was 60 % confident that the probability was 1, and therefore 40 % confident that the probability was 0. hi^ is an uncertainty of the true probability. NOW,when the blue ball was revealed to me from the sack I had extra inf~rmationand would therefore change my uncertainty distribution to show a 66.66 % confidence that the in the box was black 6.e. that the probability was 1). These two confidence distributions are shown in Figure 4.9. Note that the distributions of Figure 4.9 are pure uncertainty distributions. ~h~ --.--\.-.--... -.-.-,-- ci 4 0 1% \ L. . ....-.+\ N C oarr ' "' -+.?-.-xlLD' w& are shown in Figure 4.9. Note t at and a is probability but may only take a value of O and I' has no variability: its outcome is dete-nistic. - .--\ - l.I.2 Lr11; ""'"" a%%%'*% unce*ainq and ~ariabilicl in a risk intents and puvOses' look and unce*ainty and variabi.ty are described by distributions that? In behave eirctly the same One might therefore reasonabl~conciude lhat they can be me sme Monte Carlo model: some dishbutions reflecting the unce*ainty about certain parameters ln - , I don,t remmmend the expe.ment, l-he coins went everywhere, done his a couple of timer with a k g lecture group in a large banked auditonurn' the model, the other distributions reflecting the inherent stochastic nature of the system. We could then run a simulation on such a model which would randomly sample from all the distributions and our output would therefore take account of all uncertainty and variability. Unfortunately, this does not work out completely. The resultant single distribution is equivalent to our "best-guess" distribution of the composite of the two components. Technically, it is difficult to interpret, as the vertical scale represents neither uncertainty nor variability, and we have lost some information in knowing what component of the resultant distribution is due to the inherent randomness (variability) of the system, and what component is due to our ignorance of that system. It is therefore useful to know how to keep these two components separate in an analysis if necessary. Why separate uncertainty and variability? Keeping uncertainty and variability separate in a risk analysis model is mathematically more correct. Mixing the two together, i.e. by simulating them together, produces a reasonable estimate of the level of total uncertainty under most conditions. Figure 4.10 shows a Binomial(l0, p) distribution, where p is uncertain with distribution Beta(l0, 10). The spaghetti-looking graph represents a number of possible true binomial distributions, shown in cumulative form, and the bold line shows the result one gets from simulating the binomial and beta distributions together. The combined model may be wrong, but it covers the possible range very well. But consider doing the same with just one binomial trial, e.g. Binomial(1, Beta(10,lO)). The result is either a 1 or a 0, each occurring in about 50 % of the simulation run, the same result as we would have had by modelling Binomial(1, 50%). The output has lost the information that p is uncertain. Mixing uncertainty and variability means, of course, that we cannot see how much of the total uncertainty comes from variability and how much from uncertainty, and that information is useful. If we know that a large part of the total uncertainty is due to uncertainty (as in the example of Figure 4.1 I), then we know that collecting further information, thereby reducing uncertainty, would enable us to improve our estimate of the future. On the other hand, if the total uncertainty is nearly all due to variability (as in the example of Figure 4.12), we know that it is a waste of time to collect more 0 1 2 3 4 5 6 7 8 9 1 0 Successes Figure 4.10 300 Binomial(l0, p) distributions resulting from random samples of p from a Beta(l0, 10) distribution. Chapter 4 Choice of model structure 53 I 0.9 -.-2 0.8 5 0.7 2I! 0.6 0.5 .-5 0.4 -z 0.3 3 0 0.2 0.1 0 0 20 40 60 80 100 120 Lifetime 140 160 180 200 Figure 4.11 Example of second-order risk analysis model output with uncertainty dominating variability. 1 0.9 .--2 2 n 2 0.8 0.7 0.6 P a, 0.5 .-> - z 0.4 0.3 3 0 0.2 0.1 0 Figure 4.12 0 20 40 60 80 100 120 Lifetime 140 160 180 200 Example of second-order risk analysis model output with variability dominating uncertainty. information and the only way to reduce the total uncertainty would be to change the physical system. In general, the separation of uncertainty and variability allows us to understand what steps can be taken to reduce the total uncertainty of our model, and allows us to gauge the value of more information or of some potential change we can make to the system. A much larger problem than mixing uncertainty and variability distributions together can occur when a variability distribution is used as if it were an uncertainty distribution. Separating uncertainty and variability very deliberately gives us the discipline and understanding to avoid the much larger errors that this mistake will produce. Consider the following problem. 54 I I 1 Risk Analysis A group of 10 jurors is randomly picked from a population for some court case. In this population, 50 % are female, 0.2 % have severe visual disability and 1.1 % are Native American. The defence would like to have at least one member on the jury who is female and either Native American or visually disabled or both. What is the probability that there will be at least one such juror in the selection? This is a pure variability problem, as all the parameters are considered well known and the answer is quite easy to calculate, assuming independence between the characteristics. The probability that a person is not Native American and not visually disabled is (100 % - 1.1 %) (100 % - 0.2 %) = 98.7022 %. The probability that a person is either Native American or visually disabled or both is (100 % - 98.7022 %) = 1.2978 %. Thus, the probability that a person is either Native American or visually disabled or both and female is (50 % * 1.2978 %) = 0.6489 %. The probability that none of the potential jurors is either Native American or visually disabled or both and female is then (100% - 0.6489 %)lo = 93.697 %, and so, finally, the probability that at least one potential juror is either Native American or visually disabled or both and female is (100 % - 93.697 %) = 6.303 %. Now let's compare this calculation with the spreadsheet of Figure 4.13 and the result it produces in Figure 4.14. In this model, the number of females in the jury has been simulated, but the rest of the calculation has been explicitly calculated. The output thus has a distribution that is meaningless since it should be a single figure. The reason for this is that the model both calculated and simulated variability. We are treating the number of females as if it were an uncertain parameter rather than a variable. Now, having said how useful it is to separate uncertainty and variability, we must take a step back and ask whether the effort is worth the extra information that can be gained. In truth, if we run simulations that combine uncertainty and variability in the same simulation, we can get a good idea of their contribution to total uncertainty by running the model twice: the first time sampling from all distributions, and the second time setting all the uncertainty distributions to their mean value. The difference in spread is a reasonable description of the contribution of uncertainty to total uncertainty. Writing a model where uncertainty and variability are kept separate, as described in the next section, * Figure 4.13 Example of model that incorrectly mixes uncertainty and variability. Chapter 4 Choice of model structure Figure 4.14 55 Result of the model of Figure 4.13. can be very time consuming and cumbersome, so we must keep an eye out for the value of such an exercise. 4.3.3 Structuring a Monte Carlo model to separate uncertainty and variability The core structure of a risk analysis model is the variability of the stochastic system. Once this variability model has been constructed, the uncertainty about parameters in that variability model can be overlaid. A risk analysis model that separates uncertainty and variability is described as second order. A variability model comes in two forms: explicit calculation and simulation. In a variability model with explicit calculation, the probability of each possible outcome is explicitly calculated. So, for example, if one were calculating the number of heads in 10 tosses of a coin, the explicit calculation model would take the Figure 4.15 Model calculating the outcome of 10 tosses of a coin. 56 Risk Analysis form of the spreadsheet in Figure 4.15. Here, we have used the Excel function BINOMDIST(x, n, p, cumulative) which returns the probability of x successes in n trials with a binomial probability of success p . The cumulative parameter requires either a TRUE (or 1) or a FALSE (or 0): using TRUE, the function returns the cumulative probability F(x), using FALSE the function returns the probability mass f (x). Plotting columns E and F together in an x-y scatter plot produces the binomial distribution which can be the output of the model. Statistical results, like the mean and standard deviation shown in the spreadsheet model, can also be determined explicitly as needed. The formulae calculating the mean and standard deviation use an Excel array function SUMPRODUCT which multiplies terms in the two arrays pair by pair and then sums these pair products. In an explicitly calculated model like this it is a simple matter to include uncertainty about any parameters of the model. For example, if we are not confident that the coin was truly fair but instead wish to describe our estimate of the probability of heads as a Beta(l2, 11) distribution (see Section 8.2.3 for explanation of the beta distribution in this context), we can simply enter the beta distribution in place of the 0.5 value in cell C3 and simulate for the cells in column F containing the outputs. The separation of uncertainty and variability is simple and clear when using a model that explicitly calculates the variability, as we use formulae for the variability and simulation for the uncertainty. But what do we do if the model is set up to simulate the variability? Figure 4.16 shows the same coin tossing problem, but now we are simulating the number of heads using a Binomial(n, p ) function in @RISK. Admittedly, it seems rather unnecessary here to simulate such a simple problem, but in many circumstances it is extremely unwieldy, if not impossible, to use explicit calculation models, and simulation is the only feasible approach. Since we are using the random sampling of simulation to model the variability, it is no longer available to us to model uncertainty. Let us imagine that we put a possible value for the binomial probability p into the model and run a simulation. The result is the binomial distribution that would be the correct model of variability if that value of p were correct. Now, we believe that p could actually be quite a different value - our confidence about the true value of p is described by a Beta(l2, 11) distribution - so we would really like to take repeated samples from the beta distribution, run a simulation for each sample and plot all the binomial distributions together to give us a true picture. This sounds immensely tedious, but @RISK provides a RiskSimtable function that will automate the process. Crystal Ball also provides a similar facility in its Pro version that allows one to nominate uncertainty and variability distributions within a model separately and then completely automates the process. We proceed by taking (say 50) Latin hypercube samples from the beta distribution, then import them back into the spreadsheet model. We then use a RiskSimtable function to reference the list of values. Figure 4.16 A simulation version of the model of Figure 4.15. Chapter 4 Choice o f model structure 57 The RiskSimtable function returns the first value in the list, but when we instruct @RISK to run 50 simulations, each of say 500 iterations, the RiskSimtable function will go through the list, using one value at a time for each simulation. Note that the number of simulations is set to equal the number of samples we have from the beta uncertainty distribution. The binomial distribution is then linked to the RiskSimtable function and named as an output. We now run the 50 simulations and produce 50 different possible binomial distributions which can be plotted together and analysed in much the same way as an explicit calculation output. Of course, there are an infinite number of possible binomial distributions, but, by using Latin hypercube sampling (see Section 4.4.3 for an explanation of the value of Latin hypercube sampling), we are ensuring that we get a good representation of the uncertainty with a few simulations. In spite of the automation provided by the RiskSimtable function in @RISK or the facilities of Crystal Ball Pro and the speed of modern computers, the simulations can take some time. However, in most non-trivial models that time is easily balanced by the reduction in complexity of the model itself and therefore the time it takes to construct, as well as the more intuitive manner in which the models can be constructed which greatly helps avoiding errors. The ModelRisk software makes uncertainty analysis much easier, as all its fitting functions offer the option of either returning best fitting parameters (or distributions, time series, etc., based on best fitting parameters), which is more common practice, or including the statistical uncertainty about those parameters, which is more correct. 4.4 How Monte Carlo Simulation Works This section looks at the technical aspects of how Monte Carlo risk analysis software generates random samples for the input distributions of a model. The difference between Monte Carlo and Latin hypercube sampling is explained. An illustration of the improvement in reliability and efficiency of Latin hypercube sampling over Monte Carlo is also presented. The use of a random number generator seed is explained, and the reader is shown how it is possible to generate probability distributions of one's own design. Finally, a brief introduction is given into the methods used by risk analysis software to produce rank order correlation of input variables. 4.4.1 Random sampling from input distributions Consider the distribution of an uncertain input variable x. The cumulative distribution function F(x), defined in Chapter 6.1.1, gives the probability P that the variable X will be less than or equal to x, i.e. F(x) obviously ranges from 0 to 1. Now, we can look at this equation in the reverse direction: what is the value of F(x) for a given value of x? This inverse function G(F(x)) is written as It is this concept of the inverse function G(F(x)) that is used in the generation of random samples from each distribution in a risk analysis model. Figure 4.17 provides a graphical representation of the relationship between F (x) and G ( F (x)). 58 Risk Analysis Figure 4.17 The relationship between x, F(x) and G(F(x)). To generate a random sample for a probability distribution, a random number r is generated between 0 and 1. This value is then fed into the equation to determine the value to be generated for the distribution: The random number r is generated from a Uniform(0, 1) distribution to provide equal opportunity of an x value being generated in any percentile range. The inverse function concept is employed in a number of sampling methods, discussed in the following sections. In practice, for some types of probability distribution it is not possible to determine an equation for G(F(x)), in which case numerical solving techniques can be employed. ModelRisk uses the inversion method for all of its 70+ families of univariate distributions and allows the user to control how the distribution is sampled via its "U-parameter". For example: VoseNormal(mu, sigma, U) where mu and sigma are the mean and standard deviation of the normal distribution; VoseNormal(mu, sigma, 0.9) returns the 90th percentile of the distribution; VoseNormal(mu, sigma) VoseNormal(mu, sigma, RiskUniform(0, 1)) Chapter 4 Cho~ceof model structure 59 for @RISK users or VoseNormal(mu, sigma, CB.Uniform(0, 1)) I I for Crystal Ball users, etc., returns random samples from the distribution that are controlled by ModelRisk, @RISK or Crystal Ball respectively. The inversion method also allows us to make use of copulas to correlate variables, as explained in Section 13.3. 4.4.2 Monte Carlo sampling Monte Carlo sampling uses the above sampling method exactly as described. It is the least sophisticated of the sampling methods discussed here, but is the oldest and best known. Monte Carlo sampling got its name as the code word for work that von Neumann and Ulam were doing during World War I1 on the Manhattan Project at Los Alamos for the atom bomb, where it was used to integrate otherwise intractable mathematical functions (Rubinstein, 1981). However, one of the earliest examples of the use of the Monte Carlo method was in the famous Buffon's needle problem where needles were physically thrown randomly onto a gridded field to estimate the value of n. At the beginning of the twentieth century the Monte Carlo method was also used to examine the Boltzmann equation, and in 1908 the famous statistician Student (W. S. Gossett) used the Monte Carlo method for estimating the correlation coefficient in his t-distribution. Monte Carlo sampling satisfies the purist's desire for an unadulterated random sampling method. It is useful if one is trying to get a model to imitate a random sampling from a population or for doing statistical experiments. However, the randomness of its sampling means that it will over- and undersample from various parts of the distribution and cannot be relied upon to replicate the input distribution's shape unless a very large number of iterations are performed. For nearly all risk analysis modelling, the pure randomness of Monte Carlo sampling is not really relevant. We are almost always far more concerned that the model will reproduce the distributions that we have determined for its inputs. Otherwise, what would be the point of expending so much effort on getting these distributions right? Latin hypercube sampling addresses this issue by providing a sampling method that appears random but that also guarantees to reproduce the input distribution with much greater efficiency than Monte Carlo sampling. 4.4.3 Latin Hypercube sampling Latin hypercube sampling, or LHS, is an option that is now available for most risk analysis simulation software programs. It uses a technique known as "stratified sampling without replacement" (Iman, Davenport and Zeigler, 1980) and proceeds as follows: The probability distribution is split into n intervals of equal probability, where n is the number of iterations that are to be performed on the model. Figure 4.18 illustrates an example of the stratification that is produced for 20 iterations of a normal distribution. The bands can be seen to get progressively wider towards the tails as the probability density drops away. In the first iteration, one of these intervals is selected using a random number. A second random number is then generated to determine where, within that interval, F ( x ) should lie. In practice, the second half of the first random number can be used for this purpose, reducing simulation time. IH W I Figure 4.18 Example of the effect of stratification in Latin hypercube sampling. x = G ( F (x)) is calculated for that value of F (x). The process is repeated for the second iteration, but the interval used in the first iteration is marked as having already been used and therefore will not be selected again. This process is repeated for all of the iterations. Since the number of iterations n is also the number of intervals, each interval will only have been sampled once and the distribution will have been reproduced with predictable uniformity over the F(x) range. The improvement offered by LHS over Monte Carlo can be easily demonstrated. Figure 4.19 compares the results obtained by sampling from a Triangle(0, 10, 20) distribution with LHS and Monte Carlo sampling. The top panels of Figure 4.19 show histograms of the triangular distribution after one simulation of 300 iterations. The LHS clearly reproduces the distribution much better. The middle panels of Figure 4.19 show an example of the convergence of the two sampling techniques to the true values of the distribution's mean and standard deviation. In the Monte Carlo test, the distribution was sampled 50 times, then another 50 to make 100, then another 100 to make 200, and so on to give simulations of 50, 100, 200, 300, 500, 1000 and 5000 iterations. In the LHS test, seven different simulations were run for the seven different numbers of iterations. The difference between the approaches was taken because the LHS has a "memory" and the Monte Carlo sampling does not. A "memory" is where the sampling algorithm takes account of from where it has already sampled in the distribution. From these two panels, one can get the feel for the consistency provided by LHS. The bottom two panels provide a more general picture. To produce these diagrams, the triangular distribution was sampled in seven separate simulations again with the following number of iterations: 50, 100, 200, 300, 500, 1000 and 5000 for both LHS and Monte Carlo sampling. This was repeated 100 times and the mean and standard deviation of the results were noted. The standard deviations of these statistics were calculated to give a feel for Chapter 4 Choice of model structure I Lattn Hypercube sampllng 0.12 - 0.1 -. Monte Carlo sampling O1'T 0.08 -- 0.06 -- 0 10.1 5 10 15 - - 10 -- ,' ,I, .v I , , 9.9 -. I ,.----- ,,' C g 9.8 -. 9.7 -- 9.6 -- , ..., ' 95 ---- Monte Carlo -Lat~n Hypercube I 9 50 100 200 300 500 Iterations 1000 5000 8 s 0.35 t 5 8 1 03 "2, 0 L 0.2 ij 0 15 .1 0 1 .z $ !" OU5 0 50 100 200 300 500 Number of iterations 1000 5000 Figure 4.19 Comparison of the performance of Monte Carlo and Latin hypercube sampling. 20 61 62 Risk Analysis 1.I2- - - - - LH sampling - MC sampling ............. True mean 0.94 -0.92 s 0 " 4 . I 20 40 60 80 Iterations completed 100 120 Figure 4.20 Example comparison of the convergence of the mean for Monte Carlo and Latin hypercube distributions. how much the results might naturally vary from one simulation to another. LHS consistently produces values for the distribution's statistics that are nearer to the theoretical values of the input distribution than Monte Carlo sampling. In fact, one can see that the spread in results using just 100 LHS samples is smaller than the spread using 5000 MC samples! The benefit of LHS is eroded if one does not complete the number of iterations nominated at the beginning, i.e. if one halts the program in mid-simulation. Figure 4.20 illustrates an example where a Normal(], 0.1) distribution is simulated for 100 iterations with both Monte Carlo sampling and LHS. The mean of the values generated has roughly the same degree of variance from the true mean of 1 until the number of iterations completed gets close to the prescribed 100, when LHS pulls more sharply in to the desired value. 4.4.4 Other sampling methods There are a couple of other sampling methods, and I mention them here for completeness, although they do not appear very often and are not offered by the standard risk analysis packages. Mid-point LHS is a version of standard LHS where the mid-point of each interval is used for the sampling. In other words, the data points (xi) generated from a distribution using n iterations will be at the ( i - 0.5)ln percentiles. Mid-point LHS will produce even more precise and predictable values for the output statistics than LHS, and in most situations it would be very useful. However, there are the odd occasions where its equidistancing between the F(x) values causes interference effects that would not be observed in standard LHS. In certain problems, one might only be concerned with the extreme tail of the distribution of possible outcomes. In such cases, even a very large number of iterations may fail to produce many sufficient values in the extreme tail of the output for an accurate representation of the area of interest. It can then be useful to employ importance sampling (Clark, 1961) which artificially raises the probability of sampling from the ranges within the input distributions that would cause the extreme values of interest in the output. The accentuated tail of the output distribution is rescaled back to its correct probability density at the end of the simulation, but there is now good detail in the tail. In Section 4.5.1 we will I Chapter 4 Cho~ceof model structure 63 look at another method of simulation that ensures that one can get sufficient detail in the modelling of rare events. Sob01 numbers are non-random sequences of numbers that progressively fill in the Latin hypercube space. The advantage they offer is that one can keep adding more iterations and they keep filling gaps previously left. Contrast that with LHS for which we need to define the number of iterations at the beginning of the simulation and, once it is complete, we have to start again - we can't build on the sampling already done. 4.4.5 Random number generator seeds There are many algorithms that have been developed to generate a series of random numbers between 0 and 1 with equal probability density for all possible values. There are plenty of reviews you can find online. The best general-purpose algorithm is currently widely held to be the Mersenne twister. These algorithms will start with a value between 0 and 1, and all subsequent random numbers that are generated will rely on this initial seed value. This can be very useful. Most decent risk analysis packages now offer the option to select a seed value. I personally do this as a matter of course, setting the seed to 1 (because I can remember it!). Providing the model is not changed, and that includes the position of the distributions in a spreadsheet model and therefore the order in which they are sampled, the same simulation results can be exactly repeated. More importantly, one or more distributions can be changed within the model and a second simulation run to look at the effect these changes have on the model's outputs. It is then certain that any observed change in the result is due to changes in the model and not a result of the randomness of the sampling. i 4.5 Simulation Modelling My cardinal rule of risk analysis modelling is: "Every iteration of a risk analysis model must be a scenario that could physically occur". If the modeller follows this "cardinal rule", he or she has a much better chance of producing a model that is both accurate and realistic and will avoid most of the problems I so frequently encounter when reviewing a client's work. Section 7.4 discusses the most common risk modelling errors. A second very useful rule is: "Simulate when you can't calculate". In other words, don't simulate when it is possible and not too onerous to determine exactly the answer directly through normal mathematics. There are several reasons for this: simulation provides an approximate answer and mathematics can give an exact answer; simulation will often not be able to provide the entire distribution, especially at the low probability tails; mathematical equations can be updated instantaneously in light of a change in the value of a parameter; and techniques like partial differentiation that can be applied to mathematical equations provide methods to optimise decisions much more easily than simulation. In spite of all these benefits, algebraic solutions can be excessively time consuming or intractable for all but the simplest problems. For those who are not particularly mathematically inclined or trained, simulation provides an efficient and intuitive approach to modelling risky issues. 4.5.1 Rare events It is often tempting in a risk analysis model to include very unlikely events that would have a very large impact should they occur; for example, including the risk of a large earthquake in a cost model 64 Risk Analysis of a Sydney construction project. True, the large earthquake could happen and the effect would be devastating, but there is generally little to be gained from including the rare event in an overview model. The expected impact of a rare event is determined by two factors: the probability that it will occur and, if it did occur, the distribution of possible impact it would have. For example, we may determine that there is about a 1:50 000 chance of a very large earthquake during the construction of a skyscraper. However, if there were an earthquake, it would inflict anything between a few hundred pounds damage and a few million. In general, the distribution of the impact of a rare event is far more straightforward to determine than the probability that the rare event will occur in the first place. We often can be no more precise about the probability than to within one or two orders of magnitude (i.e. to within a factor of 10-100). It is usually this determination of the probability of the event that provides a stumbling block for the analyst. One method to determine the probability is to look at past frequencies and assume that they will represent the future. This may be of use if we are able to collect a sufficiently large and reliable dataset. Earthquake data in the New World, for example, only extends for 200 or 300 years and could give us, at its smallest, a one in 200 year probability. Another method, commonly used in fields like nuclear power reliability, is to break the problem down into components. For an explosion to occur in a nuclear power station (excluding human error), a potential hazard would have to occur and a string of safety devices would all have to fail together. The probability of an explosion is the product of the probability of the initial conditions necessary for an explosion and the probabilities of each safety device failing. This method has also been applied in epidemiology where agricultural authorities have sought to determine the risk of introduction of an exotic disease. These analyses typically attempt to map out the various routes through which contaminated animals or animal products can enter the country and then infect the country's livestock. In some cases, the structure of the problem is relatively simple and the probabilities can be reasonably calculated; for example, the risk of introducing a disease through importing semen straws or embryos. In this case the volume is easily estimated, its source determinable and regulations can be imposed to minimise the risk. In other cases, the structure of the problem is extremely complex and a sensible analysis may be impossible except to place an upper limit on the probability; for example, the risk of introducing disease into native fish by importing salmon. There are so many paths through which a fish in a stream or fish farm could be exposed to imported contaminated salmon, ranging from a seagull picking up a scrap from a dump and dropping it in a stream right in front of a fish to a saboteur deliberately buying some salmon and feeding it to fish in a farm. It is clearly impossible to cover all of the scenarios that might exist, or even to calculate the probability of each individual scenario. In such cases, it makes more sense to set an upper bound to the probability that infection occurs. It is very common for people to include rare events in a risk analysis model that is primarily concerned with the general uncertainty of the problem, but provides little extra insight. For example, we might construct a model to estimate how long it will take to develop a software application for a client: designing, coding, testing, etc. The model would be broken down into key tasks and probabilistic estimates made for the duration of each task. We would then run a simulation to find the total effect of all these uncertainties. We would not include in such an analysis the effect of a plane crashing into the office or the project manager quitting. We might recognise these risks and hold back-up files at a separate location or make the project manager sign a tight contract, but we would gain no greater understanding of our project's chance of meeting the deadline by incorporating such risks into our model. Chapter 4 Choice of model structure 65 4.5.2 Model uncertainty Model building is subjective. The analyst has to decide the way he will build a necessarily simple model to attempt to represent a frequently very complicated reality. One needs to make decisions about which bits can be left out as insignificant, perhaps without a great deal of data to back up the decision. We also have to reason which type of stochastic process is actually operating. In truth, we rarely have a purely binomial, Poisson or any other theoretical stochastic process occurring in nature. However, we can often convince ourselves that the degree of deviation from the simplified model we chose to use is not terribly significant. It is important in any model to consider how it could fail to represent the real world. In any mathematical abstraction we are malung certain assumptions, and it is important to run through these assumptions, both the explicit assumptions that are easy to identify and the implicit assumptions that one may easily fail to spot. For example, using a Poisson process to model frequencies of epidemics may seem quite reasonable, as they could be considered to occur randomly in time. However, the individuals in one epidemic can be the source of the next epidemic, in which case the events are not independent. Seasonality of epidemics means that the Poisson intensity varies with month, which can be catered for once it is recognised, but if there are other random elements affecting the Poisson intensity then it may be more appropriate to model the epidemics as a mixture process. Sometimes one may have two possible models (for example, two equations relating bacteria growth rates to time and ambient temperature, or two equations for the lifetime of a device), both of which seem plausible. In my view, these represent subjective uncertainty that should be included in the model, just as other uncertain parameters have distributions assigned to them. So, for example, if I have two plausible growth models, I might use a discrete distribution to use one or the other randomly during each iteration of the model. There is no easy solution to the problems of model uncertainty. It is essential to identify the simplifications and assumptions one is making when presenting the model and its results, in order for the reader to have an appropriate level of confidence in the model. Arguments and counterarguments can be presented for the factors that would bring about a failure of the model. Analysts can be nervous about pointing out these assumptions, but practical decision-makers will understand that any model has assumptions and they would rather be aware of them than not. In any case, I think it is always much better for me to be the person who points out the potential weaknesses of my models first. One can also often analyse the effects of changing the model assumptions, which gives the reader some feel for the reliability of the model's results. Chapter 5 Understanding and using the results of a risk analysis A risk analysis model, however carefully crafted, is of no value unless its results are understandable, useful, believable and tailored to the problem in hand. This chapter looks at various techniques to help the analyst achieve these goals. Section 5.1 gives a brief overview of the points that should be borne in mind in the preparation of a risk analysis report. Section 5.2 looks at how to present the assumptions of the model in a succinct and comprehensible way. The results of a risk analysis model are far more likely to be accepted by decision-makers if they understand the model and accept its assumptions. Section 5.3 illustrates a number of graphical presentations that can be employed to demonstrate a model's results and offers guidance for their most appropriate use. Finally, Section 5.4 looks at a variety of statistical analyses that can be performed on the output data of a risk analysis. In addition to writing comprehensive risk analysis reports, I have found it particularly helpful to my clients to run short courses for senior management that explain: how to manage a risk assessment (time and resources required, typical sequence of activities, etc.); how to ensure that a risk assessment is being performed properly; what a risk assessment can and cannot do; what outputs one can ask for; how to interpret, present and communicate a risk assessment and its results. This type of training eases the introduction of risk analysis into an organisation. We see many organisations where the engineers, analysts, scientists, etc., have embraced risk analysis, trained themselves and acquired the right tools and then fail to push the extra knowledge up the decision chain because the decision-makers remain unfamiliar and perhaps intimidated by all this new "risk analysis stuff'. If you are intending to present the results of a risk analysis to an unknown audience, consider assuming that the audience knows nothing about risk analysis modelling and explain some basic concepts (like Monte Carlo simulation) at the beginning of the presentation. 5.1 Writing a Risk Analysis Report Complex models, probability distributions and statistics often leave the reader of a risk analysis report confused (and probably bored). The reader may have little understanding of the methods employed in risk analysis or of how to interpret, and make decisions from, its results. In this environment it is essential that a risk analysis report guide the reader through the assumptions, results and conclusions (if any) in a manner that is transparently clear but neither esoteric nor oversimplistic. 68 R~skAnalysis The model's assumptions should always be presented in the report, even if only in a very shorthand form. I have found that a report puts across its message to the reader much more effectively if these model assumptions are put to the back of the report, the front being reserved for the model's results, an assessment of its robustness (see Chapter 3) and any conclusions. We tend to write reports with the following components (depending on the situation): summary; a introduction to problem; a decision questions addressed and those not addressed; a discussion of available data and relation to model choice; a major model assumptions and the impact on the results if incorrect; a critique of model, comment on validation; a presentation of results; a discussion of possible options for improvement, extra data that would change the model or its results, additional work that could be done; a discussion of modelling strategy; a decision question(s); a available data; a methods of addressing decision questions with available information; a assumptions inherent in different modelling options; a explanation of choice of model; a discussion of model used; a overview of model structure, how the sections relate together; a discussion of each section (data, mathematics, assumptions, partial results); a results (graphical and statistical analyses); a model validation; a references and datasets; a technical appendices; a explanation of unusual equation derivations; a guide on how to interpret and use statistical and graphical outputs. a II The results of the model must be presented in a form that clearly answers the questions that the analyst sets out to answer. It sounds rather obvious, but I have seen many reports that have failed in this respect for several reasons: a a a The report relied purely on statistics. Graphs help the reader enormously to get a "feel" for the uncertainty that the model is demonstrating. The key question is never answered. The reader is left instead to make the last logical step. For example, a distribution of a project's estimated cost is produced but no guidance is offered for determining a budget, risk contingency or margin. The graphs and statistics use values to five, six or more significant figures. This is an unnatural way for most readers to think of values and impairs their ability to use the results. Chapter 5 Understanding and using the results of a risk analysis 69 The report is filled with volumes of meaningless statistics. Risk analysis software programs, like @RISK and Crystal Ball, automatically generate very comprehensive statistics reports. However, most of the statistics they produce will be of no relevance to any one particular model. The analyst should pare down any statistics report to those few statistics that are germane to the problem being modelled. The graphs are not properly labelled! Arrows and notes on a graph can be particularly useful. In summary: 1. Tailor the report to the audience and the problem. I i i I 2. Keep statistics to a minimum. 3. Use graphs wherever appropriate. 4. Always include an explanation of the model's assumptions. 5.2 Explaining a Model's Assumptions We recommend that you are very explicit about your assumptions, and make a summary of them in a prominent place in the report, rather than just have them scattered through the report in the explanation of each model component. A risk analysis model will often have a fairly complex structure, and the analyst needs to find ways of explaining the model that can quickly be checked. The first step is usually to draw up a schematic diagram of the structure of the model. The type of schematic diagram will obviously depend on the problem being modelled: GANTT charts, site plans with phases, work breakdown structure, flow diagrams, event trees, etc. - any pictorial representation that conveys the required information. The next step is to show the key quantitative assumptions that are made for the model's variables. Distribution parameters Using the parameters of a distribution to explain how a model variable has been characterised will often be the most informative when explaining a model's logic. We tend to use tables of formulae for more technical models where there are a lot of parametric distributions and probability equations, because the logic is apparent from the relationship between a distribution's parameters and other variables. For nonparametric distribution~,which are generally used to model expert opinion, or to represent a dataset, a thumbnail sketch helps the reader most. Influence diagram plots (Figure 5.1 illustrates a simple example) are excellent for showing the flow of the logic and interrelationships between model components, but not the mathematics underlying the links. Graphical illustrations of quantitative assumptions are particularly useful when non-parametric distributions have been used. For example, a sketch of a VoseRelative (Custom in Crystal Ball, General in @RISK), a VoseHistogram or a VoseCumulA distribution will be a lot more informative than noting its parameters values. Sketches are also very good when you want to explain partial model results. For example, summary plots are useful for demonstrating the numbers that come out of what might be a quite complex time series model. Scatter plots are useful for giving an overview of what might be a very complicated correlation structure between two or more variables. Figure 5.2 illustrates a simple format for an assumptions report. Crystal Ball offers a report-writing feature that will do most of this automatically. There will usually be a wealth of data behind these key quantitative assumptions and the formulae that have been used to link them. Explanations of the 70 Risk Analysis Total project cost \ I I A Additional costs \ I Inflation A ' r Risk of political change .' Figure 5.1 J Risk of strike \ / Risk of bad weather .' / Example of a schematic diagram of a model's structure. data and how they translate into the quantitative assumptions can be relegated to an appendix of a risk analysis report, if they are to be included at all. 5.3 Graphical Presentation of a Model's Results There are two forms in which a model's results can be presented: graphs and numbers. Graphs have the advantage of providing a quick, intuitive way to understand what is usually a fairly complex, numberintensive set of information. Numbers, on the other hand, give us the raw data and statistics from which we can make quantitative decisions. This section looks at graphical presentations of results, and the following section reviews statistical methods of reporting. The reader is strongly encouraged to use graphs wherever it is useful to do so, and to avoid intensive use of statistics. 5.3.1 Histogram plots The histogram, or relative frequency, plot is the most commonly used in risk analysis. It is produced by grouping the data generated for a model's output into a number of bars or classes. The number of values in any class is its frequency. The frequency divided by the total number of values gives an approximate probability that the output variable will lie in that class's range. We can easily recognise common distributions such as triangular, normal, uniform, etc., and we can see whether a variable is skewed. Figure 5.3 shows the result of a simulation of 500 iterations, plotted into a 20-bar histogram. The most common mistake in interpreting a histogram is to read off the y-scale value as the probability of the x value occurring. In fact, the probability of any x value, given the output is continuous (and most are), is infinitely small. If the model's output is discrete, the histogram will show the probability of each allowable x value, providing the class width is less than the distance between each allowable x value. The number of classes used in a histogram plot will determine the scale of the y axis. Clearly, the wider the bar width, the more chance there will be that values will fall within it. So, for example, by doubling the number of histogram bars, the probability scale will approximately halve. Monte Carlo add-ins generally offer two options for scaling the vertical axis: density and relative frequency plots, shown in Figures 5.4 and 5.5. In plotting a histogram, the number of bars should be chosen to balance between a lack of detail (too few bars) and overwhelming random noise (too many bars). When the result of a risk analysis model Year lOthperc Mean 90th perc 2009 37859 33803 29747 2010 42237 37949 33661 2011 43575 39399 35223 2012 40322 36690 33058 2013 39736 36388 33040 Production cost $/unit 2014 36725 33848 30971 0.35 0.4 2017 25064 23556 22048 2018 20085 19002 17919 172 174 2019 18460 17581 16702 Variable Labour rate $/day Advertising budget $Idyear Admin costs $k/year Transient market share Commission rate 6000 0.3 2016 28507 26617 24727 2015 33312 30902 28492 0.45 . .* 0.5 .* 5000 Plant costs $k 0 + 4000 .- 3 c 3000 - 8 5 100 150 200 250 2000 1 0.66 0.7 0.72 0.74 0.76 Price charged per unit 0.78 0.8 72 R~skAnalysis Figure 5.3 Doubling the number of bars on average halves the probability height for a bar. Model outout Value output Model outuut Value output Figure 5.4 Histogram "density" plot. The vertical scale is calculated so that the sum of the histogram bar areas equals unity. This is only appropriate for continuous outputs (left). Simulation software won't recognise if an output is discrete (right), so treats the generated output data in the same way as a continuous output. The result is a plot where the probability values make no intuitive sense - in the right-hand plot the probabilities appear to add up to more than 1. To be able to tell the probability of the output being equal to 4, for example, we first need to know the width of the histogram bar. is a discrete distribution, it is usually advisable to set the number of histogram bars to the maximum possible, as this will reveal the discrete nature of the output unless the output distribution takes a large number of discrete values. Some risk analysis software programs offer the facility to smooth out a histogram plot. I don't recommend this approach because: (a) it suggests greater accuracy than actually exists; (b) it fits a spline curve that will accentuate (unnecessarily) any peaks and troughs; and (c) if the scale remains the same, the area does not integrate to equal 1 unless the original bandwidths were one x-axis unit wide. The histogram plot is an excellent way of illustrating the distribution of a variable, but is of little value for determining quantitative information about that variability, which is where the cumulative frequency plot takes over. Several histogram plots can be overlaid on each other if the histograms are not filled in. This allows one to make a visual comparison, for example, between two decision options one may be considering. The same type of graph can also be used to represent the results of a second-order risk analysis model Chapter 5 Understanding and using the results of a risk analysis Model output Model output 0.2 1 0.08 T I 73 Value output Value output Figure 5.5 Histogram "relative frequency" plot. The vertical scale is calculated as the fraction of the generated values that fall into each histogram bar's range. Thus, the sum of the bar heights equals unity. Relative frequency is only appropriate for discrete variables (right), where the histogram heights now sum to unity. For continuous variables (left), the area under the curve no longer sums to unity. where the uncertainty and variability have been separated, in which case each distribution curve would represent the system variability given a random sample from the uncertainty distribution of the model. 5.3.2 The cumulative frequency plot The cumulative frequency plot has two forms: ascending and descending, shown in Figure 5.6. The ascending cumulative frequency plot is the most commonly used of the two and shows the probability of being less than or equal to the x-axis value. The descending cumulative frequency plot, on the other hand, shows the probability of being greater than or equal to the x-axis value. From now on, we shall assume use of the ascending plot. Note that the mean of the distribution is sometimes marked on the curve, in this case using a black square. The cumulative frequency distribution of an output can be plotted directly from the generated data as follows: 1. Rank the data in ascending order. X Figure 5.6 Ascending and descending cumulative frequency plots. 74 Risk Analysis Figure 5.7 Producing a cumulative frequency plot from generated data points. + 2. Next to each value, calculate its cumulative percentile P, = i/(n I), where i is the rank of that data value and n is the total number of generated values. i/(n 1) is used because it is the best estimate of the theoretical cumulative distribution function of the output that the data are attempting to reproduce. 3. Plot the data ( x axis) against the i/(n 1) values (y axis). Figure 5.7 illustrates an example. A total of 200-300 iterations are usually quite sufficient to plot a smooth curve. The above technique is very useful if one wishes to avoid using the standard format that Monte Carlo software offer and if one wishes to plot two or more cumulative frequency plots together. The cumulative frequency plot is very useful for reading off quantitative information about the distribution of the variable. One can read off the probability of exceeding any value; for example, the probability of going over budget, failing to meet a deadline or of achieving a positive NPV (net present value). One can also find the probability of lying between any two x-axis values: it is simply the difference between their cumulative probabilities. From Figure 5.8 we can see that the probability of lying between 1000 and 2000 is 89 % - 48 % = 41 %. The cumulative frequency plot is often used in project planning to determine contract bid prices and project budgets, as shown in Figure 5.9. The budget is set as the expected (mean) value of the variable determined from the statistics report. A risk contingency is then added to the budget to bring it up to a cumulative percentile that is comfortable for the organisation. The risk contingency is typically the amount available to project managers to spend without recourse to their board. The (budget contingency) value is set to match a cumulative probability that the board of directors is happy to plan for: in this case 85 %. A more controlling board might set the sum at the 80th percentile or lower. The margin is then added to the (budget contingency) to determine a bid price or project budget. The project cost might still possibly exceed the bid price and the company would then make a loss. Conversely, they would hope, by careful management of the project, to avoid using all of the risk + + + + Chapter 5 Understanding and using the results of a risk analysis 75 Figure 5.8 Using the cumulative frequency plot to determine the probability of being between two values. 1 - 0.8 -- $ 0.6 -- .--* Cumulative distribution of cost of work - p ti - .- 5 5 0 0.4 -- I Risk I 140 150 0.2 -- 01 80 i 90 100 110 120 130 f 000s 160 170 180 Figure 5.9 Using the cumulative frequency plot to determine appropriate values for a project's budget, contingency and margin. contingency and actually increase their margin. The x axis of a cumulative distribution of project cost or duration can be thought of roughly as listing risks in decreasing order of importance. The easiest risks to manage, i.e. those that should be removed with good project management, are the first to erode the total cost or duration. So a target set at the 80th percentile, sometimes called the 20 % risk level, is roughly equivalent to removing the identified, easily managed risks. Then there are those risks that will be removed with a lot of hard work, good management and some luck, which brings us down to the 76 R~skAnalysis / I I I ' I i I I I I I , 1 I I I I l ' I' : -Milestone A --- Milestone B ----- Milestone C ---.Milestone D ---- Milestone E I ~ ~ 3 r k t c ~ i * : ' * r t : ~ * ~ n sq : ' ! s~ g ~ 8 : + r s r : 7 ' 8 * l 0 25 50 75 100 125 x (weeks) 150 175 200 Figure 5.10 Overlaying of the cumulative frequency plots of several project milestones illustrates any increase in uncertainty with time. 50th percentile, or so. To reduce the actual cost or duration to somewhere around the 20th percentile will usually require very hard work, good management and a lot of luck. It is sometimes useful to overlay cumulative frequency plots together. One reason to do this is to get a visual picture of stochastic dominance, described in Section 5.4.5. Another reason is to visualise the increase (or perhaps decrease) in uncertainty as a project progresses. Figure 5.10 illustrates an example for a project with five milestones. The time until completion of a milestone becomes progressively more uncertain the further from the start the milestone is. Furthermore, the results of a second-order risk analysis can be plotted as a number of overlying cumulative distributions, each curve representing a distribution of variability for a particular random sample from the uncertainty distributions of the model. 5.3.3 Second-order cumulative probability plot A second-order cdf is the best presentation of an output probability distribution when you run a secondorder Monte Carlo simulation. The second-order cdf is composed of many lines, each of which represents Value Figure 5.11 A second-order plot of a discrete random variable. The step nature of the plot makes it difficult to read. Chapter 5 Understanding and using the results of a risk analysis 0 1 2 3 4 5 6 7 8 9 1 77 0 Value Figure 5.12 Another second-order plot of a discrete variable, where the probabilities are marked with small points and joined by straight lines. The connection between the probability estimates is now clear, and the uncertainty and randomness components can now be compared: at its widest the uncertainty contributes a spread of about two units (dashed horizontal line), while the randomness ranges over some eight units (filled horizontal line), so the inability to predict this variable is more driven by its randomness than by our uncertainty in the model parameters. 0 2 4 6 8 10 12 14 16 18 20 Value Figure 5.13 A second-order plot of a continuous variable where our inability to predict its value is equally driven by uncertainty (dashed horizontal line) about the model parameters as by the randomness of the system (filled horizontal line). This is a useful plot for decision-makers because it tells them potentially how much more sure one would be of the predicted value if more information could be collected, and thus the uncertainty reduced. a distribution of possible variability or probability generated by picking a single value from each uncertainty distribution in the model (Figures 5.1 1 to 5.13). 1 5.3.4 Overlaying of cdf plots 1 Several cumulative distribution plots can be overlaid together (Figure 5.14). The plots are easier to read if the curves are formatted into line plots rather than area plots. ! j 78 Risk Analysis Cost $000 Figure 5.14 Cost $000 Several cumulative distribution plots overlaid together. The overlaying of cumulative plots like this is an intuitive and easy way of comparing probabilities, and is the basis of stochastic dominance tests. It is not very useful, however, for comparing the location, spread and shape of two or more distributions, for which overlying density plots are much better. We recommend that a complementary cumulative distribution plot be given alongside the histogram (density) plot to provide the maximum information. 5.3.5 Plotting a variable with discrete and continuous elements If a risk event does not occur, we could say it has zero impact, but if it occurs it will have an uncertain impact. For example: a fire may have a 20 % chance of occurring and, if it does, will incur $Lognorma1(120000, 30 000). We could model this as or better still Running a simulation with this variable as an output, we would get the uninformative, relative frequency histogram plot (shown with different numbers of bars) in Figure 5.15. There really is no useful way to show such a distribution as a histogram, because the spike at zero (in this case) requires a relative frequency scale, while the continuous component requires a continuous scale. A cumulative distribution, however, would produce the plot in Figure 5.16, which is meaningful. 5.3.6 Relationship between cdf and density (histogram) plots For a continuous variable, the gradient of a cdf plot is equal to the probability density at that value. That means that, the steeper the slope of a cdf, the higher a relative frequency (histogram) plot would look at that point (Figure 5.17). The disadvantage of a cdf is that one cannot readily determine the central location or shape of the distribution. We cannot even easily recognise common distributions such as triangular, normal and uniform without practice in cdf form. Looking at the plots in Figure 5.18, you will readily identify the distribution form from the left panels, but not so easily from the right panels. Chapter 5 Understanding and using the results of a risk analysis 200 bars 79 100 bars 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.000 I 0 50 4 100 150 200 0.000 - - I I 0 50 100 Cost (thousands) Cost (thousands) 40 bars 40 bars 0.900 0.900 T 0.800 0.500 0.400 0.300 0.200 0.100 0.700 0.600 0.000 1 I 50 100 150 200 Cost (thousands) Figure 5.15 Cost (thousands) Histogram plot of a risk event. Cost (thousands) Figure 5.16 Cumulative distribution of a risk event. 150 200 80 Risk Analysis I I 1 Cost $000 1 I t -*-'- I I I 1 i i ?/ / 4' , / Figure 5.17 Relationship between density and cumulative probability curves. For a discrete distribution, the cdf increases in steps equal to the probability of the x value occurring (Figure 5.19). 5.3.7 Crude sensitivity analysis and tornado charts Most Monte Carlo add-ins can perform a crude sensitivity analysis that is often used to identify the key input variables, as a precursor to performing a tornado chart or similar, more advanced, analysis on these key variables. It achieves this by performing one of two statistical analyses on data that have been generated from input distributions and data calculated for the selected output. Built into this operation are two important assumptions: 1. All the tested input parameters have either a purely positive or negative statistical correlation with the output. 2. Each uncertain variable is modelled with a single distribution. Chapter 5 Understanding and using the results of a risk analysis Figure 5.18 81 Density and cumulative plots for some easily recognised distributions. Figure 5.19 Relationship between probability mass and cumulative probability plots for a discrete distribution. 82 Risk Analysis Input Figure 5.20 Input Input Example input-output relationships for which crude sensitivity analysis is inappropriate. Assumption 1 is rarely invalid, but would be incorrect if the output value were at a maximum or minimum for an input value somewhere in the middle of its range (see, for example, Figure 5.20). Assumption 2 is very often incorrect. For example, the impact of a risk event might be modelled as Monte Carlo software will generate the Bernoulli (or equivalently, the binomial) and triangular distributions independently. Performing the standard sensitivity analysis will evaluate the effect of the Bernoulli and the triangular distributions separately, so the measured effect on the output will be divided between these two distributions. ModelRisk gets round this by providing the function VoseRiskEvent, for example: The function constructs a single distribution so only one Uniform(0, 1) variate is being used to drive the sampling of the risk impact. If you use @RISK, you can write and @RISK will then drive the sampling for that risk event so the @RISK built-in sensitivity analysis will now work correctly. Similarly, if you were an insurance company you might be interested in the impact on your corporate cashflow of the aggregate claims distribution for some particular policy. ModelRisk offers a number of aggregate distribution functions that internally calculate the aggregation of claim size and frequency distributions. So, for example, one can write which will return the aggregate cost of Poisson(5500) claims each drawn independently from a Lognormal(2350, 1285) distribution, and the generated aggregate cost value will be controlled by the U variate. ModelRisk has many such tools for simulating from constructed distributions to help you perform a correct sensitivity analysis. Assumption 2 also means that this method of sensitivity analysis is invalid for a variable that is modelled over a series of cells, like a time series of exchange rates or sales volumes. The automated analysis will evaluate the sensitivity of the output to each distribution within the time series Chapter 5 Understanding and using the results of a risk analysis 83 separately. You can still evaluate the sensitivity of a time series by running two simulations: one with all the distributions simulating random values, the other with the distributions of the time series locked to their expected value. If the distributions vary significantly, the variable time series is important. Two statistical analyses Tornado charts for two different methods of sensitivity analysis are in common use. Both methods plot the variable against a statistic that takes values from -1 (the output is wholly dependent on this input, but when the input is large, the output is small), through 0 (no influence) to +1 (the output is wholly dependent on this input, and when the input is large, the output is also large): a Stepwise least-squares regression between collected input distribution values and the selected output values. The assumption here is that there is a relationship between each input I and the output 0 (when all other inputs are held constant) of the form 0 = I * m c, where m and c are constants. That assumption is correct for additive and subtractive models, and will give very accurate results in those circumstances, but is otherwise less reliable and somewhat unpredictable. The r-squared statistic is then used as the measure of sensitivity in a tornado chart. Rank order correlation. This analysis replaces each collected value by its rank among other values generated for that input or output, and then calculates the Spearman's rank order correlation coefficient r between each input and the output. Since this is a non-parametric analysis, it is considerably more robust than the regression analysis option where there are complex relationships between the inputs and output. + Tornado charts are used to show the influence an input distribution has on the change in value of the output (Figure 5.21). They are also useful to check that the model is behaving as you expect. Each input distribution is represented by a bar, and the horizontal range the bars cover give some measure of the input distribution's influence on the selected model output. Their main use is as a quick overview to identify the most influential input model parameters. Once these parameters are determined, other sensitivity analysis methods like spider plots and scatter plots are more effective. Figure 5.21 Profit sensitivity Profit sensitivity Sensitivity Profit variation Examples of tornado charts. ~ ! i The left-hand plot of Figure 5.21 is the crudest type of sensitivity analysis, where some statistical measure of the statistical correlation is calculated between the input and output values. The logic is that, the higher the degree of correlation between the input and output variables, the more the input variable is affecting the output. The degree of correlation can be calculated using either rank order correlation or stepwise least-squares regression. My preference is to use rank order correlation because it makes no assumption about the form of relationship between the input and the output, beyond the assumption that the direction of the relationship is the same across the entire input parameter's range. Least-squares regression, on the other hand, assumes that there would be a straight-line relationship between the input and the output variables. If the model is a sum of costs or task durations, or some other purely additive model, this assumption is fine. However, divisions and power functions in a model will strongly violate such an assumption. Be careful with this simple type of sensitivity because input-output relationships that strongly deviate from a continuously increasing or decreasing trend can be completely missed. The x-axis scale is a correlation statistic so is not very intuitive because it does not relate to the impact on the output in terms of the output's units. Moreover, rank order correlation can be deceptive. Consider the following simple model: C = Normal(1, 3) D(output) = A +B +C Running a simulation model gives the following levels of correlation: Clearly from the model structure we can see that variable A is actually driving most of the output uncertainty. If we set the standard deviation of each variable to zero in turn and compare the drop in standard deviation of the output (a good measure of variation in this case because we are just adding normal distributions), then A : drops output standard deviation by 85.1562 % B : drops output standard deviation by 0.0004 % C : drops output standard deviation by 1.1037 % which tells an entirely different story from the regression and correlation statistics. The reason for this is that variable B is being driven by A, so the influence of A is being divided essentially equally between A and B. A proper regression analysis would require us to build in the direction of influence from A to B, and then the influence of B would come out as insignificant, but to do so we would have to specify that relationship - a very difficult thing to do in a complex spreadsheet model. i Chapter 5 Understanding and using the results of a risk analysis 85 The right-hand plot of Figure 5.21 is a little more robust and is typically created by fixing an input distribution to a low value (say its 5th percentile), running a simulation, recording the output mean and then repeating the process with a medium value (say the 5oth percentile) and a high value (say the 95th percentile) of the input distribution: these output means define the extremes of the bars. This type of plot is a cut-down version of a spider plot. It is a little more robust, and the x-axis scale is in units of the output so is more intuitive. At low levels of correlation you will often see a variable with correlations of the opposite sign to what you would expect. This is particularly so for rank order correlation. It just means that the level of correlation is so low that a spurious correlation of generated values will occur. For presentation purposes, it will obviously be better to remove these bars. It is standard practice to plot the variables from the top down in decreasing size of correlation. If there are positive and negative correlations, the result looks a bit like a tornado, hence the name. It is sensible, of course, to limit the number of variables that are shown on the plot. I usually limit the plot to those variables that have a correlation of at least a quarter of the maximum observed correlation, or at least down to the first correlation that has the opposite sign to what one would logically have expected. This usually means that below such levels of correlation the relationships are statistically insignificant, although of course one can make a mistake in reasoning the sense of a correlation. The tornado chart is useful for identifying the key variables and uncertain parameters that are driving the result of the model. It makes sense that, if the uncertainty of these key parameters can be reduced through improved knowledge, or the variability of the problem can be reduced by changing the system, the total uncertainty of the problem will be reduced too. The tornado chart is therefore very useful for planning any strategy for the reduction of total uncertainty. The key model components can often be made more certain by: Collecting more information on the parameter if it has some level of uncertainty. Determining strategies to reduce the effect of the variability of the model component. For a project schedule, this might be altering the project plan to take the task off the critical path. For a project cost, this might be offloading the uncertainty via a fixed-price subcontract. For a model of the reliability of a system, this might be increasing the scheduled number of checks or installing some parallel redundancy. The rank order correlation between the model components and its output can be easily calculated if the uncertainty and variability components are all simulated together, because the simulation software will have all the values generated for the input distributions and the output together in the one database. It may sometimes be useful to show in a tornado chart that certain model components are uncertain and others are variable by using, for example, white bars for uncertainty and black bars for variability. 5.3.8 More advanced sensitivity analysis with spider plots To construct a spider plot we proceed as follows: Before starting, set the number of iterations to a fairly low value (e.g. 300). Determine the input distributions to analyse (performing a crude sensitivity analysis will guide you). 86 a a Risk Analysis Determine the cumulative probabilities you wish to test (we generally use 1 %, 5 %, 25 %, 50 %, 75 %, 95 %, 99 %). Determine the output statistic you wish to measure (mean, a particular percentile, etc.). Then: a a a a a Select an input distribution. Replace the distribution with one of the percentiles you specified. Run a simulation and record the statistic of the output. Select the next cumulative percentile and run another simulation. Repeat until all percentiles have been run for this input, then put back the distribution and move on to the next selected input. Once all inputs have been treated this way, we can produce the spider plot shown in Figure 5.22. This type of plot usually has several horizontal lines for variables that have almost no influence on the output. It makes the graph a lot clearer to delete these (Figure 5.23). Now we can very clearly see how the output mean is influenced by each input. The vertical range produced by the oil price line shows the range of expected profits there would be if the oil price were fixed somewhere between the minimum and maximum (a range of $180 million). The next largest range is for the gas price ($llOmillion), etc. The analysis helps us understand the degree of sensitivity in terms decision-makers understand as opposed to correlation or regression coefficients. The plot will also allow us to see variables that have unusual relationships, e.g. a variable that has no influence except at its extremes, or some sort of U-shaped relationship that would be missed in a correlation analysis. Mean of Profit vs Input Distribution Percentile Thickness Exchange rate Oil price 1.60E+08 I I I I I 1 0% 20% 40% 60% 80% 100% Percentile Figure 5.22 Spider plot example. Chapter 5 Understanding and using the results of a risk analysis 87 Mean of Profit vs Input Distribution Percentile Exchange rate Oil price 1.60E+08 ( I I I I I 0% 20% 40% 60% 80% 100% Percentile Figure 5.23 Spider plot example with inconsequential variables removed. 5.3.9 More advanced sensitivity analysis with scatter plots By plotting the generated values for an input against the output corresponding values for each model iteration in a scatter plot one can get perhaps the best understanding of the effect of the input on the output value. Plotting generated values for two outputs is also commonly done; for example, plotting a project's duration against its total cost. Scatter plots are easy to produce by exporting the simulation data at the end of a simulation into Excel. It takes a little effort to generate these scatter plots, so we recommend that you perform a rough sensitivity analysis to help you determine which of a model's input distributions are most affecting the output first. Figure 5.24 shows 3000 points, which is enough to get across any relationship but not too many to block out central areas if you use small circular markers. The chart tells the story that the model predicts increasing advertising expenditure will increase sales - up to a point. Since this is an Excel plot we can add a few useful refinements. For example, we could show scenarios above and below a certain advertising budget (Figure 5.25). We could also perform some statistical analysis of the two subsets, like a regression analysis (Figure 5.26 shows how in an Excel chart). The equations of the fitted lines show that you are getting about 3 times more return for your advertising dollar below $150k than above (0.034810.0132 2.6). It is also possible, though mindbogglingly tedious, to plot scatter plot matrices in Excel to show the interrelationship of several variables. Much better is to export the generated values to a statistical package like SPSS. At the time of writing (2007), planned versions of @RISK and Crystal Ball will also do this. 88 Risk Analysis Advertising expenditure $k Figure 5.24 Example scatter plot. Advertising expenditure $k Figure 5.25 Scattef plot separating scenarios where expenditure was above or below $150k. 5.3. I 0 Trend plots If a model includes a time series forecast or other type of trend, it is useful to be able to picture the genera1 behaviour of the trend. A trend or summary plot provides this information. Figure 5.27 illustrates an example using the mean and 5th, 20th, 80th and 95th percentiles. Trend plots can be plotted using cumulative percentiles as shown here, or with the mean Z!Z one and two standard deviations, etc. I recommend that you avoid using standard deviations, unless they are of particular interest for some technical reason, because a spread of, say, one standard deviation around the mean will encompass a Chapter 5 Understandingand using the results of a risk analysis 89 Advertising expenditure $k Figure 5.26 Scatter plot with separate regression analysis for scenarios above or below $150k. Market size predictions 45000 -E -W 40000 35000 -+ 95 percentile - - -0-- -80 percentile - Mean - - 0-- - 20 percentile '5 30000 # C Z 25000 20000 15000 Figure 5.27 A trend or summary plot. varying percentage of the distribution depending on its form. That means that there is no consistent probability interpretation attached to mean f k standard deviations. The trend plot is useful for reviewing a trending model to ensure that seasonality and any other patterns are being reproduced. One can also see at a glance whether nonsensical values are being produced; a forecasting series can be fairly tricky to model, as described in Chapter 12, so this is a nice reality check. An alternative to the trend plot above is a Tukey or box plot (Figure 5.28). A Tukey plot is more commonly used to represent variations between datasets, but it does have the possibility of including more information than trend plots. A word of caution: the minimum and maximum generated values from a simulation can vary enormously between simulations with different random number seeds, which means they are not usually values to be relied upon. Plotting the maximum value of an inflation model going out 15 years, for example, might produce a very large value if you ran it for many iterations and dominate the graph scaling. 90 Risk Analysis Sales 35.00 - Mean 50th percentile Figure 5.28 A Tukey or box plot. Box contains 25-75 percentile range. 5.3.1 1 Risk-return plots Risk-return (or cost-benefit) plots are one way graphically to compare several decision options on the same plot. The expected return in some appropriate measure is plotted on the vertical axis versus the expected cost in some measure on the horizontal axis (Figure 5.29). The plot should be tailored to the decision question, and it may be useful to plot two or more such plots to show different aspects. Examples of measures of return (benefit) are as follows: the probability of making a profit; the income or expected return; the number of animals that could be imported for a given level of risk (if one were looking at various border control options for disease control, say); the number of extra votes that would be gained in an election campaign; the time that would be saved; the reduction in the number of complaints received by a utility company; the extra life expectancy of a kidney transplant patient. Examples of measures of risk (cost) are as follows: the amount of capital invested; the probability of exceeding a schedule deadline; the probability of financial loss; the conditional mean loss; the standard deviation or variance of profit or cashflow; the probability of introduction of a disease; the semi-standard deviation of loss; Chapter 5 Understanding and using the results of a risk analysis I Risk (cost) 91 I Figure 5.29 Example risk-return plot. the number of employees that would be made redundant; the increased number of fatalities; the level of chemical emission into the environment. 5.4 Statistical Methods of Analysing Results Monte Carlo add-ins offer a number of statistical descriptions to help analyse and compare results. There are also a number of other statistical measures that you may find useful. I have categorised the statistical measures into three groups: 1. Measures of location - where the distribution is "centered". 2. Measures of spread - how broad the distribution is. 3. Measures of shape - how lopsided or peaked the distribution is. In general, at Vose Consulting we use very few statistical measures in writing our reports. The following statistics are easy to understand and, for nearly any problem, communicate all the information one needs to get across: the mean which tells you where the distribution is located and has some important properties for comparing and combining risks; 92 Risk Analysis cumulative percentiles which give the probability statements that decision-makers need (like the probability of being above or below X or between X and Y); relative measures of spread: normalised standard deviation (occasionally) for comparing the level of uncertainty of different options relative to their size (i.e. as a dimensionless measure) where the outputs are roughly normal, and normalised interpercentile range (more commonly) for the same purpose where the outputs being compared are not all normal. 5.4.1 Measures of location There are essentially three measures of central tendency (i.e. measures of the central location of a distribution) that are commonly provided in statistics reports: the mode, the median and the mean. These are described below, along with the conditional mean, which the reader may find more useful in certain circumstances. Mode The mode is the output value that is most likely to occur (Figure 5.30). For a discrete output, this is the value with the greatest observed frequency. For a continuous distribution output, the mode is determined by the point at which the gradient of the cumulative distribution of the model output generated values is at its maximum. The estimate of the mode is quite imprecise if a risk analysis output is continuous or if it is discrete and the two (or more) most likely values have similar probabilities (Figure 5.31). In fact the mode is of no practical value in the assessment of most risk analysis results, and, as it is difficult to determine precisely, it should generally be ignored. Median ~ 5 0 The median is the value above and below which the model output has generated equal numbers of data, i.e. the 5oth percentile. This is simply another cumulative percentile and, in most cases, has no particular benefits over any other percentile. C h a ~ t e5r Understandinr! and usine the results of a risk analysis 93 Figure 5.31 A discrete distribution with two modes, or no mode, depending on how you look at it. Mean 2 This is the average of all the generated output values. It has less immediate intuitive appeal than the mode or median but it does have far more value. One can think of the mean of the output distribution as the x-axis point of balance of the histogram plot of the distribution. The mean is also known as the expected value, although I don't recommend the term as it implies for most people the most likely value. Sometimes also known as the first moment about the origin, it is the most useful statistic in It is particularly useful for the risk analysis. The mean of a dataset {xi}is often given the notation following two reasons: x. (a + b) = ii + b and therefore (a - b) = Zi - b where a and b are two stochastic variables. In other words: (1) the mean of the sum is the sum of their means; (2) the mean of their product is the product of their means. These two results are very useful if one wishes to combine risk analysis results or look at the difference between them. Conditional mean The conditional mean is used when one is only interested in the expected outcome of a portion of the output distribution; for example, the expected loss that would occur should the project fail to make a profit. The conditional mean is found by calculating the average of only those data points that fall into the scenario in question. In the example of expected loss, it would be found by taking the average of all the profit output's data points that were negative. The conditional mean is sometimes accompanied with the probability of the output falling within the required range. In the loss example, it would be the probability of producing a negative profit. Relative pos~tionsof the mode, med~anand mean For any unimodal (single-mode) distribution that is positively skewed (i.e. has a longer right tail than left tail), the mode, median and mean fall in that order (Figure 5.32). 94 Risk Analysis Median Figure 5.32 Relative positions of the mode, median and mean of a univariate distribution. {I $ 1 I< If the distribution has a longer left tail than right, the order is reversed. Of course, if the distribution is symmetric and unimodal, like the normal or Student distributions, the mode, median and mean will be equal. 5.4.2 Measures of spread , The three measures of spread commonly provided in statistics reports are the standard deviation a , the variance V and the range. There are several other measures of spread, discussed below, that the reader may also find useful under certain circumstances. Variance is calculated on generated values as follows: i.e. it is essentially the average of the squared distance of all generated values from their mean. The larger the variance, the greater is the spread. The variance is called the second moment (because of its square term) about the mean and has units that are the square of the variable. So, if the output is in &, the variance is measured in g2, making it difficult to have any intuitive feel for the statistic. Since the distance between the mean and each generated value is squared, the variance is far more sensitive to the data points that make up the tails of the distribution. For example, a data point that was three units from the mean would contribute 9 times as much (32 = 9) to the variance as a data point C h a ~ t e 5r Undentanding and using the results of a risk analysis 95 that was only one unit from the mean (12 = 1). The variance is useful if one wishes to determine the spread of the sum of several uncorrelated variables X, Y as it follows these rules: V(X + Y) = V(X) + V(Y) V(X - Y) = V(X) + V(Y) V(nX) = n2v(x), where n is some constant These formulae also provide us with a guideline of how uniformly to disaggregate an additive model so that each component provides a roughly equal contribution to the total output uncertainty. If the model sums a number of variables, the contribution of each variable to the output uncertainty will be approximately equal if each variable has about the same variance. Standard deviation s Standard deviation is calculated as the square root of the variance. In other words: It has the advantage over the variance that it is in the same units as the output to which it refers. However, it is still summing the squares of the distances of each generated value from the mean and is therefore far more sensitive to outlying data points that make up the tails of the distribution than to those that are close to the mean. The standard deviation is frequently used in connection with the normal distribution. Results in risk analysis are often quoted using the output's mean and standard deviation, implicitly assuming that the output is normally distributed, and therefore: + + the range Z - s to T s contains 68 % or so of the distribution; the range Z - 2s to Z 2s contains 95 % or so of the distribution. Some care should be exercised here. The distribution of a risk analysis output is often quite skewed and these assumptions do not then follow at all. However, Tchebysheff's rule provides some weak interpretation of the fraction of a distribution contained within k standard deviations. Range The range of an output is the difference between the maximum and minimum generated values. In most cases this is not a very useful measure as it is obviously only sensitive to the two extreme values (which are, after all, randomly generated and could often take a wide range of legitimate values for any particular model). 96 Risk Analysis Mean deviation (MD) The mean deviation is calculated as i.e. the average of the absolute differences between the data points and their mean. This can be thought of as the expected distance that the variable will actually be from the mean. The mean deviation offers two potential advantages over the other measures of spread: it has the same units as the output and gives equal weighting to all generated data points. Semi-variance V, and Semi-standard deviation s, Variance and standard deviation are often used as measures of risk in the financial sector because they represent uncertainty. However, in a distribution of cashflow, a large positive tail (equivalent to the chance of a large income) is not really a "risk", although this tail will contribute to, and often dominate, the value of the calculated standard deviation and variance. The semi-standard deviation and semi-variance compensate for this problem by considering only those generated values below (or above, as required) a threshold, the threshold delineating those scenarios that represent a "risk" and therefore should be included from those that are not a risk and therefore should be excluded (Figure 5.33). The semi-variance and semi-standard deviation are k S - C (xi - X O ) ~ i=' k and s, = fi where xo is the specified threshold value and X I , . . . , xk are all of the data points that are either above or below xo, as required. Figure 5.33 The semi-standard deviation concept. Chapter 5 Understanding and using the results of a risk analysis 97 Norrnalised standard deviation s, This is the standard deviation divided by the mean, i.e. It achieves two purposes: 1. The standard deviation is given as a fraction of its mean. Using this statistic allows the spread of the distribution of a variable with a large mean and correspondingly large standard deviation to be compared more appropriately with the spread of the distribution of another variable with a smaller mean and a correspondingly smaller standard deviation. 2. The standard deviation is now independent of its units. So, for example, the relative variability of the EUR:HKD and USD:GBP exchange rates can be compared. The normalised interpercentile range works in the same way: = (xB - X A ) / X ~where ~, x~ > XA are percentiles like x95 and xo5 respectively Interpercentile range The interpercentile range of an output is calculated as the difference between two percentiles, for example: x95 - xo5, to give the central 90 % range; ~ 90 minimum, to give the lower 90 % range; ~ 90 xlo, to give the central 80 % range. The interpercentile range is a stablemeasure of spread (unless one of the percentiles is the minimum or maximum), meaning that the value is quickly obtained for relatively few iterations of a model. It also has the great advantage of having a consistent interpretation between distributions. One potential problem you should be aware of is with applying an interpercentile range calculation to a discrete distribution, particularly when there are only a few important values, as shown in Figure 5.34. In this example, several key cumulative percentiles fall on the same values, so of course several different interpercentile ranges take the same values. In addition, the interpercentile range becomes very sensitive to the percentile chosen. In the plot above, for example: but 5.4.3 Measures of shape Skewness 5 This is the degree to which the distribution is "lopsided". A positive skewness means a longer right tail; a negative skewness means a longer left tail; zero skewness means the distribution is symmetric about its mean (Figure 5.35). 98 Risk Analysis -1 0 1 2 3 4 5 6 Output value -1 0 1 2 3 Output value 4 5 6 Figure 5.34 Demonstration of how interpercentile ranges can be confusing with discrete distributions. Figure 5.35 Skewness examples. Chapter 5 Understanding and using the results of a risk analysis 99 The skewness S is calculated as The a3 factor is put in to make the skewness a pure number, i.e. it has no units of measurement. Skewness is also known as the third moment about the mean and is even more sensitive to the data points in the tails of a distribution than the variance or standard deviation because of the cubed term. It may be useful to note, for comparative purposes, that an exponential distribution has a skewness of 2.0, an extreme value distribution has a skewness of 1.14, a triangular distribution has a skewness of between 0.562 and 0, and the skewness of a lognormal distribution goes from zero to infinity as its mean approaches 0. Skewness has little practical purpose for most risk analysis work, although it is sometimes used in conjunction with kurtosis (see below) to test whether the output distribution is approximately normal. High skewness values from a simulation run are really quite unstable - if your simulation gives a skewness value of 100, say, think of it as "really big" rather than taking its value as being usable. Another measure of skewness, though rarely used, is the percentile skewness, S,, calculated as S - (90 percentile - 50 percentile) - (50 percentile - 10 percentile) It has the advantage over the standard skewness of being quite stable because it is not affected by the values of the extreme data points. However, its scaling is different to standard skewness: if 0 < S, < 1 the distribution is negatively skewed; if Sp = 1 the distribution is symmetric; if S, > 1 the distribution is positively skewed. Kurtosis K Kurtosis is a measure of the peakedness of a distribution. Like skewness statistics, it is not of much use in general risk analysis. Kurtosis is calculated as , L I I t1 I 1 In a similar manner to skewness, the a4 factor is put in to make the kurtosis a pure number. Kurtosis is often known as the fourth moment about the mean and is even more sensitive to the values of the data points in the tails of the distribution than the standard skewness statistic. Stable values for the kurtosis of a risk analysis result therefore require many more iterations than for other statistics. High kurtosis values from a simulation run are very unstable - if your simulation gives a kurtosis in the hundreds or thousands, say, it means there is a big spike in the output and the simulation kurtosis is very dependent on whether that spike was appropriately sampled, so for such large values just think of it as "really big". Kurtosis is sometimes used in conjunction with the skewness statistics to determine whether an output is approximately normally distributed. A normal distribution has a kurtosis of 3, so any output that looks symmetric and bell-shaped and has a zero skewness and a kurtosis of 3 can probably be considered normal. A uniform distribution has a kurtosis of 1.8, a triangular distribution has a kurtosis of 2.387, the kurtosis of a lognormal distribution goes from 3.0 to infinity as its mean approaches 0 and an exponential distribution has a kurtosis of 9.0. The kurtosis statistic is sometimes (in Excel, for example) calculated as called the excess skewness, which can cause confusion, so be careful what statistic your software is reporting. 5.4.4 Percentiles I Cumulative percentiles These are values below which the specified percentage of the generated data for an output fall. Standard notation is x p , where P is the cumulative percentage, e.g. X0.75 is the value that 75 % of the generated data were less than or equal to. The cumulative percentiles can be plotted together to form the cumulative frequency plot, the use of which has been explained above. Differences between cumulative percentiles are often used as a measure of the variable's range, e.g. X0.95 - ~ 0 . 0 5would include the middle 90 % of the possible output values and ~ 0 . 8 0 - ~ 0 . 2 0would include the middle 60 % of the possible values of the output; ~0.25, ~ 0 . 5 0and X0.75 are sometimes referred to as the quartiles. Relative percentiles The relative percentiles are the fractions of the output data points that fall into each bar range of a histogram plot. They are of little use in most risk analyses and are dependent upon the number of bars that are used to plot the histogram. Relative percentiles can be used to replicate the output distribution for inclusion in another risk analysis model. For example, cashflow models may have been produced for a number of subsidiaries of a large company. If an analyst wants to combine these uncertain cashflows into an aggregate model, he would want distributions of the cashflow from each subsidiary. This is achieved by using histogram distributions to model each subsidiary's cashflow and taking the required parameters (minimum, maximum, relative percentiles) from the statistics report. Providing the cashflow distributions are independent, they can then be summed in another model. 5.4.5 Stochastic dominance tests Stochastic dominance tests are a statistical means of determining the superiority of one distribution over another. There are several types (or degrees) of stochastic dominance. We have never found any particular use for any but the first- and second-order tests described here. It would be a very rare problem where the distributions of two options can be selected for no better reason than a very marginal ordering provided by a statistical test. In the real world there are usually far more persuasive reasons to select one option over another: option A would expose us to a greater chance of losing money than B; or a greater maximum loss; or would cost more to implement; we feel more comfortable with option A because we've done something similar before; option B will make us more strategically placed for the future; option B is based on an analysis with fewer assumptions; etc. Chapter 5 Understanding and using the resuk of a risk analysis I 10 1 - 0.87 - - - - - - Option A -Opt~onB 0 20 40 60 60 100 120 140 160 180200 Figure 5.36 First-order stochastic dominance: FA < Fp,,SO option A dominates option B. Fiw-order stochartic dominance Consider options A and B having the distribution functions FA(x) and F s ( x ) , where it is desirable to maximise the value of x . If FA(x) 5 F R ( x )for all x , then option A dominates option B. That amounts to saying that the cdf of option A i s to the right ofthat of oplion B in an ascending plot. This is shown graphically in Figure 5.36. Option A has a smaller probability than option B of being less than or equal to each x value, so it is the better option (unless F*(x) = F R ( x ) everywhere). First-order stochastic dominance is intuitive and makes virtually no assumptions about the decision-maker's utility function, only that it is continuous and monotonically increasing with increasing x . Second-order stochastic dominance min for all z, then option A dominates option B. Figure 5.37 illustrates how this looks graphically. Figure 5.38 illustrates a situation when second-order stochastic dominance does not hold. Second-order stochastic dominance makes the additional assumption that the decision-maker has a risk averse utility function over the entire range of x. This assumption is not very restrictive and can almost always be assumed to apply. In most fields of risk analysis (finance being an obvious exception) it will not be necessary to resort to second-degree (or higher) dominance tests since [he decision-maker should be able to find other, more important, differences between the available options. Stochastic dominance is great in principle but tends to be rather onerous to apply in practice, particularly if one i s comparing several possible options. ModelRisk has the facility to compare as many options as you wish. Firslt of all one simulates, say, 5000 iterations of the outcome of each possible option and imports these into contiguous columns in a spreadsheet. These are then fed into the ModelRisk interface, as shown in Figure 5.39. Selecting an output location allows you to insert the stochastic dominance matrix as an array function (VoseDominance), which wilI show all the dominance combinations and update if the simulation output arrays are altered. Chapter 5 Understanding and using the results of a nsk analysis 10 1 +.---, ----- Option A Option B I Benefit Figure 5.36 First-order stochastic dominance: FA < FB, SO option A dominates option B. First-order stochastic dominance Consider options A and B having the distribution functions FA(x) and FB(x), where it is desirable to maximise the value of x. If FA(x) 5 FB(x) for all x, then option A dominates option B. That amounts to saying that the cdf of option A is to the right of that of option B in an ascending plot. This is shown graphically in Figure 5.36. Option A has a smaller probability than option B of being less than or equal to each x value, so it is the better option (unless FA(x) = FB(x) everywhere). First-order stochastic dominance is intuitive and makes virtually no assumptions about the decision-maker's utility function, only that it is continuous and monotonically increasing with increasing x. Second-order stochastic dominance min for all z , then option A dominates option B. Figure 5.37 illustrates how this looks graphically. Figure 5.38 illustrates a situation when second-order stochastic dominance does not hold. Second-order stochastic dominance makes the additional assumption that the decision-maker has a risk averse utility function over the entire range of x. This assumption is not very restrictive and can almost always be assumed to apply. In most fields of risk analysis (finance being an obvious exception) it will not be necessary to resort to second-degree (or higher) dominance tests since the decision-maker should be able to find other, more important, differences between the available options. Stochastic dominance is great in principle but tends to be rather onerous to apply in practice, particularly if one is comparing several possible options. ModelRisk has the facility to compare as many options as you wish. First of all one simulates, say, 5000 iterations of the outcome of each possible option and imports these into contiguous columns in a spreadsheet. These are then fed into the ModelRisk interface, as shown in Figure 5.39. Selecting an output location allows you to insert the stochastic dominance matrix as an array function (VoseDominance), which will show all the dominance combinations and update if the simulation output arrays are altered. Chapter 5 Understanding and using the results of a nsk analys~s 10 1 ----- Option A Option B Benefit Figure 5.36 First-order stochastic dominance: FA < FB, SO option A dominates option B. First-order stochastic dominance Consider options A and B having the distribution functions FA(x) and FB(x), where it is desirable to maximise the value of x. If FA(x) 5 FB(x) for all x, then option A dominates option B. That amounts to saying that the cdf of option A is to the right of that of option B in an ascending plot. This is shown graphically in Figure 5.36. Option A has a smaller probability than option B of being less than or equal to each x value, so it is the better option (unless FA(x) = FB(x) everywhere). First-order stochastic dominance is intuitive and makes virtually no assumptions about the decision-maker's utility function, only that it is continuous and monotonically increasing with increasing x. Second-order stochastic dominance D(z)= i (FB(x)-FA(x))~L~ min for all z, then option A dominates option B. Figure 5.37 illustrates how this looks graphically. Figure 5.38 illustrates a situation when second-order stochastic dominance does not hold. Second-order stochastic dominance makes the additional assumption that the decision-maker has a risk averse utility function over the entire range of x. This assumption is not very restrictive and can almost always be assumed to apply. In most fields of risk analysis (finance being an obvious exception) it will not be necessary to resort to second-degree (or higher) dominance tests since the decision-maker should be able to find other, more important, differences between the available options. Stochastic dominance is great in principle but tends to be rather onerous to apply in practice, particularly if one is comparing several possible options. ModelRisk has the facility to compare as many options as you wish. First of all one simulates, say, 5000 iterations of the outcome of each possible option and imports these into contiguous columns in a spreadsheet. These are then fed into the ModelRisk interface, as shown in Figure 5.39. Selecting an output location allows you to insert the stochastic dominance matrix as an array function (VoseDominance), which will show all the dominance combinations and update if the simulation output arrays are altered. 102 Risk Analysis 0 50 100 150 200 250 300 Profit Figure 5.37 Second-order stochastic dominance: option A dominates option B because D(z) is always Option A 1 1 0 50 100 150 200 250 300 Profit Figure 5.38 Second-order stochastic dominance: option A does not dominate option B because D(z) is not always >O. 5.4.6 Value-of-information methods Value-of-information (VOI) methods determine the worth of acquiring extra information to help the decision-maker. From a decision analysis perspective, acquiring extra information is only useful if it has a significant probability of changing the decision-maker's currently preferred strategy. The penalty of acquiring more information is usually valued as the cost of that extra information, and sometimes also the delay incurred in waiting for the information. Chapter 5 Understand~ngand using the results of a risk analysis I03 C I" row (i In cdurnnr Figure 5.39 ; ModelRisk interface to determine stochastic dominance. VOI techniques are based on analysing the revised estimates of model inputs that come with extra data, together with the costs of acquiring the extra data and a decision rule that can be converted into a mathematical formula to analyse whether the decision would alter. The ideas are well developed (Clemen and Reilly (2001) and Morgan and Henrion (1990), for example, explain VOI concepts in some detail), but the probability algebra can be somewhat complex, and simulation is more flexible and a lot easier for most VOI calculations. The usual starting point of a VOI analysis is to consider the value of perfect information (VOPI), i.e. answering the question "What would be the benefit, in terms we are focusing on (usually money, but it could be lives saved, etc.), of being able to know some parameter(s) perfectly?'. If perfect knowledge would not change a decision, the extra information is worthless, and, if it does change a decision, then the value of the extra knowledge is the difference in expected net benefit between the new selected option and that previously favoured. VOPI is a useful limiting tool, because it tells us the maximum value that any data may have in better evaluating the input parameter of concern. If the information costs more than that maximum value, we know not to pursue it any further. After a VOPI check, one then looks at the value of imperfect information (VOII). Usually, the collection of more data will decrease, not eliminate, uncertainty about an input parameter, so VOII focuses on whether the decrease in uncertainty is worth the cost of collecting extra information. In fact, if new data are inconsistent with previous data or beliefs that were used to estimate the parameter, new data may even increase the uncertainty. If the data being used are n random observations (e.g. survey or experimental results), the uncertainty about the value of a parameter has a width (roughly) proportional to lISQRT(n). So, if you already have n observations and would like to halve the uncertainty, you will need a total of 4n observations (an increase of 3n). If you want to decrease uncertainty by a factor of 10, you will need a total of lOOn observations (an increase of 99n). In other words a decrease in uncertainty about a parameter value 104 Risk Analysis becomes exponentially more expensive the closer the uncertainty gets to zero. Thus, if a VOPI analysis shows that it is economically justified to collect more information before making a decision, there will certainly be a point in the data collection where the cost of collecting data will outweigh their benefit. VOPI analysis method Consider the range of possible values for the parameter(s) for which you could collect more information. Determine whether there are possible values for these parameters that, if known, would make the decision-maker select a different option from the one currently deemed to be best. Calculated the extra value (e.g. expected profit) that the more informed decision would give. This is the VOPI. VOll analysis method Start with a prior belief about a parameter (or parameters), based on data or opinion. Model what observations might be made with new data using the prior belief. Determine the decision rule that would be affected by these new data. Calculate any improvement in the decision capability given the new data; the measure of improvement requires some valuation and comparison of possible outcomes, which is usually taken to be expected monetary or utility value, although this is rather restrictive. Determine whether any improvement in the decision capability exceeds the cost of the extra information. VOI example Your company wants to develop a new cosmetic but there is some concern that people will have a minor adverse skin reaction to the product. The cost of development of the product to market is $1.8 million. The revenue NPV (including the cost of development) if the product is of the required quality is $3.7 million. Cosmetic regulations state that you will have to withdraw the product if 2 % or more of consumers have an adverse reaction to your product. You have already performed some preliminary trials on 200 random people selected from the target demographic, at a costlperson of $500. Three of those people had an adverse reaction to the product. Management decide the product will only be developed if they can be 85 % confident that the product will affect less than the required 2 % of the population. Decision question: Should we test more people or just abandon the product development now? If we should test more people, then how many more? Having observed three affected people out of 200, our prior belief about p can be modelled as Beta(3 1,200 - 3 1) = Beta(4, 198), which gives a 57.24 % confidence that 2 % or less of the target demographic will be affected (calculated as VoseBetaProb(2 %, 4, 198, 1) or BETADIST(2 %, 4, 198)). Thus, the current level of information means that management would not pursue development of the product, with no resultant cost or revenue, i.e. a net revenue of $0. However, the beta distribution shows that it is quite possible that p is less than 2 %, and we could be losing a good opportunity by quitting now. If this were known for sure, the company would get a profit of $3.7 million, so the VOPI = $3.7 million * 57.24 % $0 million * 42.76 % = $2.12 million, and each test only costs $500; it is certainly possible that more information could be worth the expense. + + + Chapter 5 Understanding and using the results of a risk analysis I05 VOll analysis The model in Figure 5.40 performs the VOII steps described above. The parameter of concern is the fraction of people (prevalence), p, in the target demographic (women 18-65) who would have an adverse reaction, with a prior uncertainty described by Beta(4, 198), cell C12. The people in the study are randomly sampled from this demographic, so if we test m extra people (cell C22) we can assume the number of people who would be adversely affected, s, would follow a Binomial(m, p) distribution (cell C24). The revised estimate for p would then become Beta(4 s, 198 (m - s)). The confidence we then have that p is < 2 % is given by VoseBetaProb(2 %, 4 s, 198 (m - s), l), cell C27. If this confidence exceeds 85 %, management would take the decision to develop the product (cells C31:C32). The model simulates different possible values of p from the prior. It models various possible numbers of extra tests, m, and simulates the extra data generated (s out of m), then evaluates the expected return of the resultant decision. Of course, although one may have reached the required confidence for p, the true value for p doesn't change and a bad decision may still be taken. The value of information is calculated for each iteration, and the mean function is used to calculate the expected value of information. Note that for this example the question being posed is how many more people to test in one go. A more optimal strategy would be to test a smaller number, review the results and perform a VOII analysis. This iterative process will either achieve the required confidence at a smaller test cost or lead one to abandon further testing because one is fairly sure that the required performance will not be achieved. It might at first seem that we are getting something for nothing here. After all, we don't actually know anything more until we perform the extra tests. However, the decision that would be made would depend on the results of those extra tests, and those results depend on what the true value of p actually + i Perfect knowledge Decisionwith perfect information: + + + =IF(Cl 9=1 ,"Deveiop","Don't develop") =C22'E8/1000000 106 Risk Analys~s is. Thus, the analysis is based on our prior for p (i.e, what we know to date about p) and the decision rule. When the model generates a scenario, it selects a value from the prior for p. It is saying: "Let's imagine that this is the true value for p". If that value is t 2 %, we should develop the product of course, but we'll never know the value of p (until we have launched the product and have enough customer history to know its value). However, extra tests will get us closer to knowing its true value, 1 0.9 0.8 - z t9 0.7 - -E C 0 .- 0.6 - b + 0.5 - 6 0.4 - .-C (I) 2 - 0.3 9 - 0.2 0.1 0 1 0 500 1000 1500 2000 2500 3000 3500 2500 3000 3500 Tested people Figure 5.41 VOI example model results. 1.8 1.6 1.4 - 2 t9 --: C .-0 m 1.2 1- .C 0.8 0 -3 9 0.6 0.4 0.2 0 I 0 500 1000 1500 2000 Tested people Figure 5.42 VOI example model results where tests have no cost. Chapter 5 Understanding and using the results of a risk analysis 107 and so we end up taking less of a gamble. When the model picks a small value for p, it will probably generate a small number of affected people in our new tests, and our interpretation of this small number as meaning p is small will often be correct. The danger is that a high p value could by chance result in an unrepresentatively small fraction of m being affected, which will be misinterpreted as a small p and lead management to make the wrong decision. However, as m gets bigger, so that risk diminishes. The balance that needs to be made is that the tests cost money. The model simulates 20 scenarios where m is varied between 100 and 3000, with the results shown in Figure 5.41. It tells us that the optimal strategy, i.e. the strategy with the greatest expected VOII, is to perform about another 700 tests. The sawtooth effect in these plots occurs because of the discrete nature of the extra number affected that one would observe in the new data. Note that, if the tests had no cost, the graph above would look very different (Figure 5.42). Now it is continually worth collecting more information (providing it is actually feasible to do) because there is no penalty to be paid in running more tests (except perhaps time, which is not included as part of this problem). In this case the value of information asymptotically approaches the VOPI(= $2.121nillion) as the number of people tested approaches infinity. Part 2 Introduction Part 2 constitutes the bulk of this book and covers a wide range of risk analysis modelling techniques that are in general use. I have again almost exclusively used Microsoft Excel as the modelling environment because it is ubiquitous and makes it easy to show the principles of a model with printouts of the spreadsheet. I have also used Vose Consulting's ModelRisk add-in to Excel (see Appendix 11), but I have done my best to avoid malung this book a glorified advertisement for a software tool. The reality is that you will need some specialist software to do risk analysis. Using ModelRisk gives me the opportunity to explain the thinking behind risk analysis modelling without the message getting lost in very long calculations or wrestling with the mechanical limitations of modelling in spreadsheets. Some of the simpler functions in ModelRisk are available in other risk analysis software tools, and Excel has some statistical functions (although they are of dubious quality). When I have used more complex functions in ModelRisk (like copulas or time series, for example), I have tried to give you enough information for you to do it yourself. Of course, we'd love you to buy ModelRisk - there is a lot more in the software than I have used in this book (Appendix I1 gives some highlights and explains how ModelRisk interacts with other risk analysis spreadsheet add-ins), it has a lot of very nice user-interfaces and its routines can be called from C++ and VBA. We offer an extended demo period for ModelRisk on the inside back-cover of this book, together with files for the models created for this book that you can play around with. Notation used in the spreadsheet models I have given printouts of spreadsheet models throughout this book. The models were produced in Microsoft Excel version 2003 and ModelRisk version 2.0 which complies with the standard Excel rules for cell formulae. The equations easily translate to @RISK, Crystal Ball and other Monte Carlo simulation packages where they have similar functions. In each spreadsheet, I have given a formulae table so that the reader can follow and reproduce the model, for example: lI 0 Risk Analysis Here you'll see an entry for cells D2:D8 as = VoseLognormal(B2, C2). Where I have given one formula for a range of cells, it refers to the first cell of the range, and the formulae for other cells in the range are those that would appear by copying that formula over, for example by using the Excel Autofill facility. The formulae in the other cells in the range will vary according to their different position. So, for example, copying the formula above into the other cells would give: etc. If the formula had included a fixed reference using the "$" symbol in Excel notation, e.g. = VoseLognormal(B$2, C2), it would have copied down as etc. The VoseLognormal function generates random samples from a lognormal distribution, a very common distribution that features pretty much in all Monte Carlo simulation add-ins to Excel. So, for example, VoseLognormal(2,3) could be replaced as follows: @RISK = RiskLognorm(2,3) Crystal Ball = CB.Lognormal(2, 3) There are maybe a dozen other, less common, Monte Carlo add-ins with varying levels of sophistication, and they all follow the same principle, but be careful to ensure that they parameterise a distribution in the same way. Excel allows you to input a function as an array, meaning that one function covers several cells. Array formulae in Excel are inputted by highlighting a range of cells, typing the formula and then CTRL-SHIFT-Enter together. The function then appears within curly brackets in the formula bar. Array functions are used rather extensively with ModelRisk. For example: A 1 2 3 4 5 6 7 8 9 B Value 1 2 3 4 5 6 7 C Shuffled 3 5 2 6 4 1 7 Dl E C2:C8 F Formulae table {=VoseShuffle(B2:B8)} I IG Part 2 Introduction III The VoseShuffle function simply randomises the order of the values listed in its parameter array. You'll see how I display the formula within curly brackets because the VoseShuffle covers that whole range with one function, which is how it appears when you see it in Excel's formula bar. Note also that all functions with a name all in upper-case letters are always native Excel functions, which is how they appear in the spreadsheet. Functions of the form VoseXxxx belong to ModelRisk. Types of function in ModelRisk ModelRisk has several types of function that apply to a probability distribution. I'll use the normal distribution as an example. VoseNormal(2, 3) generates random values from a normal distribution with mean = 2 and standard deviation = 3. An optional third parameter (we call it the "U-parameter") is the quantile of the distribution; for example, VoseNormal(2, 3, 0.9) returns the 90th percentile of the distribution. The U-parameter must obviously lie on [0, 11. The main use of the U-parameter is to control how random samples are generated from the distribution. For example: will generate random values from the normal distribution using the random number generators of @RISK, Crystal Ball or Excel respectively to control the sampling. The second type of function calculates probabilities for each distribution featured in ModelRisk. For example, VoseNormalProb(0.7, 2, 3, FALSE) returns the probability density function of the normal distribution evaluated at x = 0.7, as would VoseNormalProb(0.7, 2, 3, 0) or VoseNonnalProb(0.7, 2, 3), since the last parameter is assumed FALSE if omitted. VoseNormalProb(0.7, 2, 3, TRUE) or VoseNormalProb(0.7, 2, 3, 1) returns the cumulative distribution function of the normal distribution evaluated at x = 0.7. To this degree, these functions are analogous to Excel's NORMDIST function, e.g. However, the probability calculation functions can take an array of x values and then return the joint probability. For example, VoseNormalProb({O.1,0.2,0.3},2, 3,O) = VoseNormalProb(O.1, 2, 3, 0) * VoseNormalProb(0.2, 2, 3, 0) * VoseNormalProb(0.3, 2, 3, 0). There are two advantages to this feature: we don't need a vast array of functions to calculate the joint probability for a large dataset, and the functions are far faster and more accurate than multiplying a long array because, depending on the distribution, there will be a lot of calculations that can be simplified. Joint probabilities can quickly tend to very small values, beyond the range that Excel can handle, so ModelRisk offers log base 10 versions of these functions too, for example: 1 12 Risk Analysis These functions allow us to develop very efficient log likelihood models, for example, which we can then optimise to fit to data (see Chapter 10). Finally, ModelRisk offers what we call object functions, for example VoseNormalObject(2, 3). If you type =VoseNormalObject(2, 3) into a cell, it returns the string "VoseNormalObject(2, 3)". In many types of risk analysis calculation we want to do more with a distribution than simply take a random sample or calculate a probability. For example, we might want to determine its moments (mean, variance, etc.). The following model does this for a Gamma(3, 7) distribution in two different ways: The VoseMoments array function returns the first four moments of a distribution and takes as its input parameter the distribution type and parameter values. There are many other situations in which we want to manipulate distributions as objects, for example: This function uses a hybrid Monte Carlo approach to add n Lognormal(l0, 5) distributions together, where n is itself a Poisson(50) random variable. Note that the lognormal distribution is defined as an object here because we are using the distribution many times, taking on average 50 independent samples from the distribution for each execution of the function. However, the Poisson distribution is not an object because for one execution of the function it simply draws a single random sample. Objects can be imbedded into other objects too. For example: is the object for a distribution constructed by splicing a gamma distribution (left) and a shifted Pareto2 distribution (right) together at x = 3. Allowing objects to exist alone in cells (e.g. cell F3 in the above figure) allows us to create very transparent and efficient models. Mathematical notation There are some mathematical notations listed below that the reader will come across in a few parts of the text. I have tried to keep the algebra to a minimum and the reader should not worry unduly about this list. There is nothing in this book that really extends beyond the level of mathematics that one learns in a quantitative undergraduate course. Part 2 Introduction x 0 lb f (x)dx i=l is the label generally given to the value of a variable is the label generally given to an uncertain parameter means the integral between a and b of the function f (x) means the sum of all xi values, where i is between 1 and n, i.e. xl Xi 1 13 + x2 + . . . + xn n n xi d f, (XI a -f (x, y) ax x means the product of all xi values, where i = 1 to n, i.e. ~ 1 . ~ 2. .. .x, . means the differential of f (x) with respect to x means the partial derivative of a function of x and y, f (x, y), with respect to x means "is approximately equal to7' F,1 mean "is less than or equal to" and "is greater than or equal to" <<, >> mean "is much less than" and "is much greater than" x! means "x-factorial", = 1 * 2 * 3 * . . . * x or exp[x] or ex means "exponential x" = 2.7 182818 . . .X ln[xl x means the natural logarithm of x, so ln[exp[x]] = x means the average of all x values 1~ 1 means "modulus x", the absolute value of x r(x) is the gamma function evaluated at x : r ( x ) = B(x, Y) is the beta function evaluated at (x, y ) : - ni X i=l i / eFUu du tx-'(I - t ) ~ - 'dt = 0 r(x)r(y) r(x Y ) + Other special functions are explained in the text where they appear. For those readers with some background in probability modelling, you might not be used to the notation I use for stating that a variable follows some distribution. For example, I write: whereas the reader might be used to I use the "=" notation because it is easier to write formulae that combine variables and it reflects how one uses Excel. For example, where I might write 1 14 R~skAnalys~s using the other notation, we would need to write Y -- Normal(100, 10) Z -- Gamma(2,3) X=Y+Z which gets to be rather tedious. This chapter is set out in sections, each of which solves a number of problems in a particular area. I hope that the problem-solving approach will complement the theory discussed earlier in the book. References are made to where the theory used in the problems is more fully discussed. The solution to each problem finishes with the symbol +. Chapter 6 Probability mathematics and simulation This chapter explores some very basic theories of probability and statistics that are essential for risk analysis modelling and that we need to understand before moving on. In my experience, ignorance of these fundamentals is a prime cause of the logical failure of a model. Risk analysis software is often sold on the merits of removing the need for any in-depth statistical theory. Although this is quite true with respect to using the software, it is often not the case when it comes to producing a logical model. In this chapter we begin by looking at the concepts that are used in the mathematics of probability distributions. Then we define some basic statistics in common use. We look at a few probability concepts that are essential to understand if one is to be assured of producing logical models. This chapter is designed to offer a reference of statistical and probability concepts: the application of these principles is left to the appropriate chapters later in the book. For most people (myself included), probability theory and statistics were not their favourite subjects at college. I would, however, encourage those readers who find themselves equipped with limited endurance for statistical theory to get at least as far as the end of Section 6.4.4 before moving on. 6.1 Probability Distribution Equations 6.1.1 Cumulative distribution function (cdf) t i The (cumulative) distribution function, or probability distribution function, F(x), is the mathematical equation that describes the probability that a variable X is less than or equal to x, i.e. F (x) = P ( X 5 x) for all x where P(X 5 x) means the probability of the event X 5 x. A cumulative distribution function has the following properties: d 1. F(x) is always non-decreasing, i.e. -F(x) 2 0. dx 2. F(x) = 0 at x = -oo; F(x) = 1 at x = oo. 6.1.2 Probability mass function (pmf) If a random variable X is discrete, i.e. it may take any of a specific set of n values xi, i = 1, . . . , n, then p(x) is called the probability mass function. 1 16 Risk Analysis Figure 6.1 Distribution of the possible number of heads in three tosses of a coin. Note that and For example, if a coin is tossed 3 times, the number of observed heads is discrete. The possible values of xi are shown in Figure 6.1 against their probability mass function f (x) and probability distribution function F(x). In this book, I will often show a discrete variable's probability mass function by joining together the probability masses with straight lines and marking each allowed value with a point. Vertical histograms are usually more appropriate representations of discrete variables, but, by using the pointsand-lines type of graph, one can show several discrete distributions together in the same plot. 6.1.3 Probability density function (pdf) If a random variable X is continuous, i.e. it may take any value within a defined range (or sometimes ranges), the probability of X having any precise value within that range is vanishingly small because we are allocating a probability of 1 between an infinite number of values. In other words, there is no probability mass associated with any specific allowable value of X. Instead, we define a probability density function f (x) as i.e. f (x) is the rate of change (the gradient) of the cumulative distribution function. Since F(x) is always non-decreasing, f (x) is always non-negative. Chapter 6 Probab~l~ty mathernattcs and s~rnulatton 1 17 So, for a continuous distribution we cannot define the probability of observing any exact value. However, we can determine the probability of x lying between any two exact values (a, b): i P ( a ~ x 5 b) = F ( b ) - F(a) where b > a i (6.3) Example 1.6 Consider a continuous variable that takes a Rayleigh(1) distribution. Its cumulative distribution function is given by and its probability density function is given by The probability that the variable will be between 1 and 2 is given by F ( x ) and f ( x ) for this example are shown in Figure 6.2. In this book, we will show a continuous variable's probability density function with a smooth curve, as illustrated. A square sometimes plotted in 0 0.5 1 2 2.5 Variable value x 1.5 3 3.5 4 4.5 Figure 6.2 Probability density and cumulative probability plots for a Rayleigh(1) distribution. 1 18 Risk Analysis the middle of this curve represents the position of the mean of the distribution. Providing the distribution is unimodal, if this point is higher than the 50 percentile, the distribution will be right skewed, and if lower than the 50 percentile it will be left skewed. + 6.2 The Definition of "Probability" Probability is a numerical measurement of the likelihood of an outcome of some random process. Randomness is the effect of chance and is a fundamental property of the system, even if we cannot directly measure it. It is not reducible through either study or further measurement, but may be reduced by changing the physical system. Randomness has been described as "aleatory uncertainty" and "stochastic variability". The concept of probability can be developed neatly from two different approaches: Frequentist definition The frequentist approach asks us to imagine repeating the physical process an extremely large number of times (trials) and then to look at the fraction of times that the outcome of interest occurs. That fraction is asymptotically (meaning as we approach an infinite number of trials) equal to the probability of that particular outcome for that physical process. So, for example, the frequentist would imagine that we toss a coin a very large number of times. The fraction of the tosses that come up heads is approximately the true probability of a single toss producing a head, and the more tosses we do the closer the fraction becomes to the true probability. So, for a fair coin, we should see the number of heads stabilise at around 50 % of the trials as the number of trials gets truly huge. The philosophical problem with this approach is that one usually does not have the opportunity to repeat the scenario a very large number of times. How do we match this approach with, for example, the probability of it raining tomorrow, or you having a car crash? Axiomatic definition The physicist or engineer, on the other hand, could look at the coin, measure it, spin it, bounce lasers off its surface, etc., until one could declare that, owing to symmetry, the coin must logically have a 50 % probability of falling on either surface (for a fair coin, or some other value for an unbalanced coin, as the measurements dictated). Determining probabilities on the basis of deductive reasoning has a far broader application than the frequency approach because it does not require us to imagine being able to repeat the same physical process infinitely. A third, subjective, definition In this context, "probability" would be our measure of how much we believe something to be true. 1'11 use the term "confidence" instead of probability to make the separation between belief and real-world probability clear. A distribution of confidence looks exactly the same as a distribution of probability and must follow the same rules of complementation, addition, etc., which easily lead to mixing up of the two ideas. Uncertainty is the assessor's lack of knowledge (level of ignorance) about the parameters that characterise the physical system that is being modelled. It is sometimes reducible through further measurement or study. Uncertainty has also been called "fundamental uncertainty", "epistemic uncertainty" and "degree of belief'. Chapter 6 Probability mathematics and simulation 1 19 6.3 Probability Rules There are four important probability theorems for risk analysis, the meaning and use of which are discussed in this section: strong law of large numbers (also called Tchebysheff's inequality1); binomial theorem; Bayes' theorem; central limit theorem (CLT). I will also describe a number of mathematical techniques useful in risk analysis and referenced elsewhere: Taylor series; Tchebysheff's rule (theorem); Markov inequality; least-squares linear regression; rank order correlation coefficient. We'll begin with some basics on conditional probability, using Venn diagrams to help visualise the thinking. 6.3.1 Venn diagrams Venn diagrams are introduced here to help visualise some basic rules of probability. In a Venn diagram the squared area, denoted by E , contains all possible events, and we assign it an area equal to 1. The circles represent specific events. Probabilities are represented by the ratios of areas. For example, the probability of event A in Figure 6.3 is the ratio of area A to the total area E: Figure 6.3 Venn diagram for a single event A. Mutually excluave events Figure 6.4 gives an example of a Venn diagram where two events (A and B) are identified. The events are mutually exclusive, meaning that they cannot occur together, and therefore the circles do not overlap. ' rp F After the Russian mathematician Pafnutl Tchebysheff (1821- 1894). Other transliterations of his name are Tchebycheff, Chebyshev and Tcheblchef. 120 Risk Analysis Figure 6.4 Venn diagram for two mutually exclusive events. The areas of the circles are denoted by A and B , and the probability of the occurrence of events A and B are denoted by P ( A ) and P ( B ) : P(A) = A/& P(B)= B/E You can think of a Venn diagram as an archery target. Imagine that you are firing an arrow at the target and that you have an equal chance of landing anywhere within the target area, but will definitely hit it somewhere. The circles on the target represent each possible event, so if your arrow lands in circle A, it represents event A happening. In Figure 6.4 you cannot fire an arrow that will land in both A and B at the same time, so events A and B cannot occur at the same time: P(A n B) =o The probability of either event occurring is then just the sum of the probabilities of each event, because we just need to add the A and B areas together: P(A U B) = P(A)+ P(B) Events that are not mutually exclusive In Figure 6.5, A and B are not mutually exclusive: they can occur together, represented by the overlap in the Venn diagram. The figure shows the four different areas that are now produced. It can be seen from these areas that P(A u B) = P(A) +P(B) - P(A n B) Figure 6.5 Venn diagram for two events that are not mutually exclusive. 122 Risk Analysis Figure 6.6 More complex Venn diagram example. 6.3.3 Central limit theorem The central limit theorem (CLT) is one of the most important theorem for risk analysis modelling. It says that the mean 2 of a set of n variables (where n is large), drawn independently from the same distribution f (x), will be normally distributed: where p and a are the mean and standard deviation of the f (x) distribution from which the n samples are drawn. Example 6.2 If we had 40 variables, each following a Unifonn(1, 3) distribution (with mean = 2 and standard deviation = l/&), the average of these variables would (approximately) have the following distribution: (Jim) (A) 2 = Normal 2, ----- = Normal 2, - i.e. n is approximately normally distributed with mean = 2 and standard deviation = l / m . + Exercise 6.1: Create a variety of Monte Carlo models, averaging n distributions of the same type with the same parameter values, and see what the resultant distribution looks like. Try different values for n, e.g. n = 2, 5, 20, 50 and 100, and different distribution types, e.g. triangular, normal, uniform and exponential. For what values of n are these average distributions close to normal? For the triangular distribution, does this value of n vary depending on where the most likely parameter's value lies relative to the minimum and maximum parameter values? It follows, by multiplying both sides of Equation (6.4) by n , that the sum, C,of n variables drawn independently from the same distribution is given by Chapter 6 Probability mathematics and simulation 123 Example 6.3 The sum I; of 40 Uniform(1, 3) independent variables will have (approximately) the following distribution: Remarkably, this theorem also applies to the sum (or average) of a large number of independent variables that have different probability distribution types, in that their sum will be approximately normally distributed providing no variable dominates the uncertainty of the sum. The theorem can also be applied where a large number of positive variables are being multiplied together. Consider a set of Xi, i = 1, . . . , n, variables that are being independently sampled from the same distribution. Then their product, l7, is given by 6 1 Taking the natural log of both sides: n Since each variable Xi has the same distribution, the variables (In Xi) must also have the same distribution, and thus, from the central limit theorem, In l3 is normally distributed. Now, a variable is lognormally distributed if its natural log is normally distributed, i.e. l7 is lognormally distributed. In fact, this application of the central limit theorem still approximately holds for the product of a large number of independent positive variables that have different distribution functions. There are a lot of situations where this seems to apply. For example, the volume of recoverable oil reserves within a field is approximately lognormally distributed since they are the product of a number of independent(ish) variables, i.e. reserve area, average thickness, porosity, gasloil ratio, (1-water saturation), etc. Most risk analysis models are a combination of adding (subtracting) and multiplying variables together. It should come as no surprise, therefore, that, from the above discussions, most risk analysis results seem to be somewhere between normally and lognormally distributed. A lognormal distribution also looks like a normal distribution when its mean is much larger than its standard deviation, so a risk analysis model result even more frequently looks approximately normal. This particularly applies to project and financial risk analyses where one is looking at cost or time to completion or the value of a series of cashflows. It is important to note from the results of this theorem that the distribution of the average of a set of variables depends on the number of variables that are being averaged, as well as the uncertainty of each variable. It may be tempting, at times, to seek an expert's estimate of the distribution of the average of a number of variables; for example, the average time it will take to lay a kilometre of road, or the average weight of the fleece of a particular breed of sheep. The reader can now see that it will be a difficult task for experts to provide a distribution of an average measure: they will have to know the number of variables for which the estimate is the average and then apply the central limit theorem - which is no easy task to do in one's head. It is much better to estimate the distribution of the individual items and do the central limit theorem calculations oneself. Many parametric distributions can be thought of as the sum of a number of other identical distributions. In general, if the mean is much larger than the standard deviation for these summary distributions, they can be approximated by a normal distribution. The central limit theorem is then useful for determining the parameters of the normal distribution approximation. Section 111.9 discusses many of the useful approximations of one distribution for another. 6.3.4 Binomial theorem The binomial theorem says that for some values a and b and a positive integer n The binomial coefficient, , also sometimes written as n C x , is read as "n choose X" and is calcu- lated as where the exclamation mark denotes factorial, so 4! = 1 * 2 * 3 * 4, for example. The binomial coefficient calculates the number of different ways one can order n articles where x of those articles are of one type and therefore indistinguishable from one another and the remaining ( n - x ) are of another type, again each being indistinguishable from another. The Excel function COMBIN calculates the binomial coefficient. The arguments underpinning this equation go as follows. There are n ! ways of ordering n articles, as there are n choices for the first article, then (n - 1 ) choices for the second, ( n - 2) choices for the third, etc., until we are left with just the one choice for the last article. Thus, there are n * (n - 1 ) * (n - 2 ) * . . . * 1 = n ! different ways of ordering these articles. Now, suppose that x of these articles were identical: we would not be able to differentiate between two orderings where we simply swapped the positions of two of these articles. Repeating the logic above, there are x ! different orderings that would all appear the same to us, so we would only recognise l l x ! of the possible orderings and the number of orderings would now be n ! l x ! Now, suppose that the remaining (n - x ) articles are also identical but differentiable from the x articles. Then we could only distinguish l l ( n - x ) ! of the remaining possible orderings, and thus the total number of different combinations is given by n! x!(n-x)! A useful way of quickly calculating the binomial coefficients for small n is given by Pascal's triangle (Figure 6.7). The outside of the triangle is filled with Is, and each value inside the triangle is calculated Chapter 6 Probability mathematics and simulation 6 7 1 8 1 9 10 6 1 7 1 8 9 1 10 28 120 15 20 35 35 56 84 36 45 15 21 70 126 210 56 126 252 210 1 6 21 7 28 84 1 8 1 9 36 120 I25 45 1 10 1 Figure 6.7 Pascal's triangle. as the sum of the two values immediately above it. Row n then represents the binomial coefficient for n, which also appears as the second value in each row. So, for example, as highlighted in the figure. Note that the binomial coefficients are symmetric so that This makes sense, as, if we swap x for (n - x) in Equation (6.6), we arrive back at the same formula. If we replace a with probability p, and b with probability (1 - p), the equation becomes The summed component is the binomial probability mass function for x successes in n trials where each trial has a probability p of success. In a binomial process, all successes are considered identical and interchangeable, as are all failures. Properties of the binomial coeficient I26 Risk Analysis The last identity is known as Vandermonde's theorem (A. T. Vandermonde, 1735-1796). Calculating x! for large x x! is very laborious to calculate for high values of x. For example, loo! = 9.3326E+157 and Excel's FACT@) cannot calculate values higher than 170! The probability mass functions of many discrete probability distributions contain factorials, and we therefore often want to work out factorials for values larger than 170. Algorithms for generating distributions get around any calculation restriction by using approximations, for example the following equation, known as the stirling2formula, can be used instead to get a very close approximation: - where is read "asymptotically equal" and means that the right-hand side approaches the left-hand side as n approaches infinity. However, if you are attempting to calculate a probability exactly, you can still use the Excel function GAMMALNO: This may allow you to manipulate multiplications of factorials, etc., by adding them in log space. But, be warned, this formula will not return exactly the same answer as FACT(), for example and, while it is possible to get values for GAMMALN(x) where x > 171, Excel will return an error if you attempt to calculate the corresponding EXP(GAMMALN(x)). 6.3.5 Bayes' theorem Bayes' theorem3 is a logical extension of the conditional probability arguments we looked at in the Venn diagram section. We saw that P(A1B) = P(A n B) P(B) and P(B1A) = P(B n A) P (A) James Stirling (1692-1770) - Scots mathematician. - English philosopher. A short biography and a reprint of his original paper describing Bayes' theorem appear in Press (1989). ' Rev. Thomas Bayes (1702-1761) Chapter 6 Probability mathematics and sirnulat~on 127 and hence which is Bayes' theorem, and, in general, The following example illustrates the use of this equation. Many more are given in the section on Bayesian inference. Example 6.4 Three machines A, B and C produce 20 %, 45 % and 35 % respectively of a factory's wheel nuts output; 2 %, 1 % and 3 % respectively of these machines outputs are defective: (a) What is the probability that any wheel nut randomly selected from the factory's stock will be defective? Let X be the event where the wheel nut is defective, and A, B and C be the events where the selected wheel nut comes from machines A, B and C respectively: (b) What is the probability that a randomly selected wheel nut will have come from machine A if it is defective? From Bayes' theorem In other words, in Bayes' Theorem we divide the probability of the required path (the probability that it came from machine A and was defective) by the probability of all possible paths (the probability that it came from any machine and was defective). + Example 6.5 We wish to know the probability that an animal will be infected (I),given that it passes (Pa) a specific veterinary check, i.e. P (IIPa). 128 R~skAnalvsis Animal infected? < Figure 6.8 Event tree for Example 6.5. The problem can be visualised by an event tree diagram (Figure 6.8). First of all, the animal will be infected (I) or not infected (N). Secondly, the animal will either pass (Pa) or fail (F) the test. From Bayes' theorem In veterinary terminology and thus P ( N ) = (1 - p) P ( I ) = prevalence p P ( F I I ) = the sensitivity of the test Se and thus P(Pa1I) = (1 - Se) P(PalN) = the specificity of the test Sp Putting these elements into Bayes' theorem, 6.3.6 Taylor series The Taylor series is a formula that determines a polynomial approximation in x of some mathematical function f (x) centred at some value xo: where f ( m ) represents the mth derivative with respect to x of the function f Chapter 6 Probability mathematics and simulation 129 In the special case where xo = 0, the series is known as the Maclaurin series of f (x): The Taylor and Maclaurin series expansions are also used to provide polynomial approximations to probability distribution functions. 6.3.7 Tchebysheff s rule If a dataset has mean T and standard deviation s, we are used to saying that 68 % of the data will lie between (T - s ) and (T s), 95 % lie between (T - 2s) and (T 2s), etc. However, that is only true when the data follow a normal distribution. The same applies to a probability distribution. So, when the data, or probability distribution, are not normally distributed, how can we interpret the standard deviation? Tchebysheff's rule applies to any probability distribution or dataset. It states: + + "For any number k greater than 1, at least (1 - l / k 2 ) of the measurements will fall within k standard deviations of the mean". Substituting k = 1, Tchebysheff's rule says that at least 0 % of the data or probability distribution lies within one standard deviation of the mean. Well, we already knew that! However, substitute k = 2 tells us that at least 75 % of the data or distribution lie within two standard deviations of the mean. That is useful information because it applies to all distributions. This is a fairly conservative rule in that, if we know the distribution type, we can specify a much higher percentage (e.g. 95 % for two standard deviations for a normal distribution, compared with 75 % with Tchebysheff's rule), but it is certainly helpful in interpreting the standard deviation of a dataset or probability distribution that is grossly non-normally distributed. From Figure 6.9 you can see that, for any k, knowing the distribution type allows you to specify a much higher fraction of the distribution to be contained in the range mean fk standard deviations. The bimodal distribution tested was as shown in Figure 6.10. 6.3.8 Markov inequality The Markov inequality gives some indication of the range of a distribution, in a similar way to Tchebysheff's rule. It states that for a non-negative random variable X with mean p for any constant k greater than p. So, for example, for a random variable with mean 6, the probability of being greater than 20 is less than or equal to 6/20 = 30 %. Of course, being very general like Tchebysheff's rule, it makes a rather conservative statement. For most distributions, the probability is much smaller than m / k (see Table 6.1 for some examples). 13 0 R~skAnalysis 100% 90% 80% 70% Normal distribution 60% 50% 40% 30% 20% 10% 0% Figure 6.9 Comparison of Tchebysheff's rule with the results of a few distributions. -1 00 -50 0 50 Variable value Figure 6.10 A bimodal distribution. 100 150 Chapter 6 Probability mathematics and simulation Table 6.1 Markov's rule for different distributions. I Distribution with u = 6 1 PIX > 20) 1 I I I I Lognormal(6, a) Pareto(0, 6(0 - 1)/0) I I Max. of 6.0 % Max. of 3.21 % 13 1 6.3.9 Least-squares linear regression The purpose of least-squares linear regression is to represent the relationship between one or more independent variables X I ,x2 and a variable y that is dependent upon them in the following form: where xji is the ith observed value of the independent variable xi, yi is the ith observed value of the dependent variable y, E L is the error term or residual (i.e. the difference between the observed y values and that predicted by the model), Bj is the regression slope for the variable xj and Po is the y-axis intercept. Simple least-squares linear regression assumes that there is only one independent variable x. If we assume that the error terms are normally distributed, the equation reduces to where m is the slope of the line and c is the y-axis intercept and s is the standard deviation of the variation of y about this line. Simple least-squares linear regression is a very standard statistical analysis technique, particularly when one has little or no idea of the relationship between the x and y variables. It is probably particularly common because the analysis mathematics are simple (because of the normality assumption), rather than it being a very common rule for the relationship between variables. LSR makes four important assumptions (Figure 6.11): 1. Individual y values are independent. 2. For each xi there are an infinite number of possible values of y, which are normally distributed. 3. The distribution of y given a value of x has equal standard deviation for all x values and is centred about the least-squares regression line. 4. The means of the distribution of y at each x value can be connected by a straight line y = rnx c. + Assumptions behind least-squares regression analysis Statisticians often make transformations of the data (e.g. Log(Y), JX) to force a linear relationship. That greatly extends the applicability of the regression model, but one must be particularly careful that the errors are reasonably normal, and one runs an enormous risk in using the regression equations of malung predictions outside the range of observations. 132 Risk Analysis Figure 6.11 An illustration of the concepts of least-squares regression. Estimation of parameten The simple least-squares regression model determines the straight line that minimises the sum of the square of the ei errors. It can be shown that this occurs when where Y,7 are the mean of the observed x and y data and n is the number of data pairs (xi, yi). The fraction of the total variation in the dependent variable that is explained by the independent variable is known as the coefficient of determination R', which is calculated as R 2 = I - - SSE TSS where the sum of squares errors, SSE, is given by SSE = C (yi - ji)' and the total sum of squares, TSS, is given by TSS = C (yi - 7l2 and where ji are the predicted y values at each xi: For simple least-squares regression (i.e. only one independent variable), the square root of R' is equivalent to the simple correlation coefficient r : r = d X Chapter 6 Probability mathematics and simulation 133 Correlation coefficient r may alternatively be calculated as Coefficient r provides a quantitative measure of the linear relationship between x and y. It ranges from - 1 to +1: a value of r = - 1 or +1 indicates a perfect linear fit, and r = 0 indicates no linear relationship exists at all. As (the sum of squared errors between the observed and predicted y values) tends to zero, so r 2 tends to 1 and therefore r tends to - 1 or +1, its sign depending on whether m is negative or positive respectively. The value of r is used to determine the statistical significance of the fitted line, by first calculating the test statistic t as The t-statistic follows a t-distribution with (n - 2) degrees of freedom (provided the linear regression assumptions of normally distributed variation of y about the regression line hold) which is used to determine whether the fit should be rejected or not at the required level of confidence. The standard error of the y estimate, S y x ,is calculated as This is equivalent to the standard deviation of the error terms si. These errors reflect the true variability of the dependent variable y from the least-squares regression line. The denominator (n - 2) is used, instead of the (n - 1) we have seen before for sample standard deviation calculations, because two values m and c have been estimated from the data to determine the equation values, and we have therefore lost two degrees of freedom instead of the one degree of freedom usually lost in determining the mean. The equations of the regression line equation and the S,, statistic can be used together to produce a stochastic model of the relationship between X and Y, as follows: Some caution is needed in using such a model. The regression model is intended to work within the range of the independent variable X for which there have been observations. Using the model outside this range can produce very significant errors if the relationship between x and y deviates from this linear relationship. This is also purely a model of variability, i.e. we are assuming that the linear relationship is correct and that the parameters are known. We should also include our uncertainty about the parameters, and perhaps about whether the linear relationship is even appropriate. 134 Risk Analysis Example 6.6 Consider the dataset in Table 6.2 which shows the result of a survey of 30 people. They were asked to provide details of their monthly net income {xi} and the amount they spent on food each month {yi}. The values of m , c, r and Syx were calculated using the Excel functions: The line ji= mxi + c is plotted against the data points in Figure 6.12. + Table 6.2 Data for Example 6.6. Net monthly income X 505 517 523 608 609 805 974 1095 1110 1139 1352 1453 1461 1543 1581 1656 1748 1760 1811 1944 1998 2054 2158 2229 2319 2371 2637 2843 2889 3096 Monthly food expenditure Y Least-squares regression estimate Y' Error terms 6 Chapter 6 Probability mathematics and simulation I35 Net monthly income Figure 6.12 The line pi = mxi -1 50 -1 00 + c plotted against the data points from Table 6.2. -50 0 50 100 150 Error term value Figure 6.13 Distribution of the error terms. The error terms ~i = yi - ji are shown in Figure 6.13. A distribution fit of these ~i values shows that they are approximately normally distributed. A test of significance of r also shows that, for 28 degrees of freedom (n - 2), there is only about a 5 x lo-" chance that such a high value of r could have been observed from purely random data. We would therefore feel confident in modelling the relationship between any net monthly income value N (between the values 505 and 1581) and monthly expenditure on food F using Uncertainty about least-squares regression parameten The parameters m , c and ,S , for the least-squares regression represent the best estimate of the variability model where we are assuming some stochastically linear relationship between x and y. However, since l3 6 Risk Analysis we will have only a limited number of observations (i.e. {x,y ) pairs), we do not have perfect knowledge of the stochastic system and there is therefore some uncertainty about the regression parameters. The t-test tells us whether the linear relationship might exist at some level of confidence. More useful, however, from a risk analysis perspective is that we can readily determine distributions of uncertainty about these parameters using the bootstrap. 6.3.10 Rank order correlation coefficient Spearman's rank order correlation coefficient p is a non-parametric statistic for quantifying the correlation relationship between two variables. Non-parametric means that the correlation statistic is not affected by the type of mathematical relationship between the variables, unlike linear least-squares regression analysis, for example, which requires the relationship to be described by a straight line with normally distributed variation of the dependent variable about that line. Calculating the rank order correlation analysis proceeds as follows. Replace the n observed values for the two variables X and Y by their ranking: the largest value for each variable has a rank of 1, the smallest a rank of n, or vice versa. The Excel function RANK() can do this, but it is inaccurate where there are ties, i.e. where two or more observations have the same value. In such cases, one should assign to each of the same-valued observations the average of the ranks they would have had if they had been infinitesimally different from the value they take. The Spearman rank order correlation coefficient p is calculated as where ui and vi are the ranks of the ith pair of the X and Y variables. This is, in fact, a shortcut formula: it is not exact when there are tied measurements, but still works well when there are not too many ties relative to the size of n. The exact formula is where and where ui and vj are the ranks of the ith observation in samples 1 and 2 respectively. This calculation does not require that one identify which variable is dependent and which is independent: the calculation for r is symmetric, so X and Y could swap places with no effect on the value of r. The value of r varies from -1 to 1 in the same way as the least-squares regression coefficient r. A value of r close to 138 R~skAnalysis The mode The mode is the x value with the greatest probability p(x) for a discrete distribution, or the greatest probability density f (x) for a continuous distribution. The mode is not uniquely defined for a discrete distribution with two or more values that have the equal highest probability. For example, a distribution of the number of heads in three tosses of a coin gives equal probability to both one and two heads. The mode may also not be uniquely defined if a distribution is multimodal (i.e. it has two or more peaks). (i) The median ~ 0 . 5 The median is the value where the variable has a 50 % probability of exceeding, i.e. An interesting property of unimodal probability distributions relates the relative positions of the mean, mode and median. If the distribution is right (positively) skewed, these three measures of central tendency are positioned from left to right: mode, median and mean (see Figure 6.14). Conversely, a unimodal left (negatively) skewed distribution has them in the reverse order. For a unimodal, symmetric distribution, the mode, median and mean are all equal. 6.4.2 Measures of spread Variance V The variance is a measure of how much the distribution is spread from the mean: where E[] denotes the expected value (mean) of whatever is in the brackets, so - .-* --..-... ---- Median (I) c 50 ~ercentof distr~butlon a, u .-#n r I-' 2 2 a. I Figure 6.14 Variable value Relative positions of the mode, median and mean of a right-skewed unimodal distribution. Chapter 6 Probability mathematics and simulation 139 Thus, the variance sums up the squared distance from the mean of all possible values of x, weighted by the probability of x occurring. The variance is known as the second moment about the mean. It has units that are the square of the units of x. So, if x is cows in a random field, V has units of cows2. This limits the intuitive value of the variance. Standard deviation a a. The standard deviation is the positive square root of the variance, i.e. a = Thus, if the variance has units of cows2, the standard deviation has units of cows, the same as the variable x. The standard deviation is therefore more popularly used to express a measure of spread. i Example 6.8 The variance V of the Uniform(1, 3) distribution is calculated as follows: I V = E(x 2 ) - p2 p = 2from before and the standard deviation a is therefore Variance and standard deviation have the following properties, where a is some constant and X and Xi are random variables: 1. V(X) ? O a n d a ( X ) 2 0 . 2. V(aX) = a 2V(X) and a (aX) = a a (X). xi) = C:=,V(Xi), providing the Xi are uncorrelated. 3. V (C:='=, 6.4.3 Mean, standard deviation and the normal distribution For a normal distribution only, the areas bounded 1, 2 and 3 standard deviations either side of the mean contain approximately 68.27, 95.45 and 99.73 % of the distribution, as shown in Figure 6.15. Since a lot of distributions look similar to a normal distribution under certain conditions, people often think of 70 % of a distribution being reasonably contained within one standard deviation either side of the mean, but this rule of thumb must be used with care. If it is applied to a distribution that is significantly non-normal, like an exponential distribution, the error can be quite large (the range p f a contains 87 % of an exponential distribution, for example). 140 Risk Analysis Figure 6.15 Some probability areas of the normal distribution. Example 6.9 Panes of bullet-proof glass manufactured at a factory have a mean thickness over a pane that is normally distributed, with a mean of 25 rnm and a variance of 0.04 mm2. If 10 panes are purchased, what is the probability that all the panes will have a mean thickness between 24.8 and 25.4 mm? The distribution of the mean thickness of a randomly selected pane is Normal(25, 0.2) mm, since the variance is the square of the standard deviation; 24.8 mm is one standard deviation below the mean, 25.4mm is two standard deviations above the mean. The probability p that a pane lies between 24.8 and 25.4mt-n is then half the probability of lying f one standard deviation from the mean plus half the probability of lying ftwo standard deviations from the mean, i.e. p (68.27 % 95.45 %)I2 = 81.86 %. The probability that all panes will have a mean thickness between 24.8 and 25.4mm, provided that they are independent of each other, is therefore x (81.86 %)I0 = 13.51 %. + + 6.4.4 Measures of shape The mean and variance are called the first moment about zero and the second moment about the mean. The third and fourth moments about the mean, called skewness and kurtosis respectively, are also occasionally used in risk analysis. Skewness S The skewness statistic is calculated from the following formulae: Discrete variable: Continuous variable: - - Chapter 6 Probability mathematics and simulation Skewness 14 1 Kurtosis Figure 6.16 Examples of skewness and kurtosis. This is often called the standardised skewness, as it is divided by a 3 to give a unitless statistic. The skewness statistic refers to the lopsidedness of the distribution (see left-hand panel of Figure 6.16). If a distribution has a negative skewness (sometimes described as left skewed), it has a longer tail to the left than to the right. A positively skewed distribution (right skewed) has a longer tail to the right, and zero-skewed distributions are usually symmetric. Kurtosis K The kurtosis statistic is calculated from the following formulae: Discrete variable: Continuous variable: max " (x - K= p14f(x) dx a4 This is often called the standardised kurtosis, since it is divided by a4, again to give a unitless statistic. The kurtosis statistic refers to the peakedness of the distribution (see right-hand panel of Figure 6.16) - the higher the kurtosis, the more peaked is the distribution. A normal distribution has a kurtosis of 3, so kurtosis values for a distribution are often compared with 3. For example, if a distribution has a kurtosis below 3 it is flatter than a normal distribution. Table 6.3 gives some examples of skewness and kurtosis for common distributions. 6.4.5 Raw and central moments There are three sets of moments that are used in probability modelling to describe a distribution of a random variable x with density function f (x). The first set are called raw moments p i . The kth raw moment is defined as 1 max P; = E [xk] = x k f(x)dx min Table 6.3 Skewness and kurtosis. Distribution Skewness Binomial ChiSq Exponential Lognormal Normal Poisson Triangular Uniform Kurtosis - where k = 1 , 2 , 3 , . . . , or, for discrete variables with probability mass p(x), as max pi = E [xk] = x x k p ( x ) min Then we have the central moments, mk, defined as p k (X - ,ulkf (x) dx, = E [X - P ) ~ ]= k = 2,3, . . min where p = pi is the mean of the distribution. Finally, we have the normalised moments: Mean = p Variance = p 2 Skewness = Kurtosis = P3 (~ariance)~/~ P4 (~ariance)~ The normalised moments are what appear most often in this book because they allow us to compare distributions most easily. One can translate between raw and central moments as follows: From raw moments to central moments: From central moments to raw moments: Chapter 6 Probability mathematics and simulation 143 You might wonder why we don't always use normalised moments and avoid any confusion. Central moments don't actually have much use in risk analysis - they are more of an intermediary calculation step, but raw moments are very useful. First of all the equations are simpler and therefore sometimes easier to calculate than central moments, and we can then convert them to central moments using the equations above. Secondly, they allow us to detennine the moments of some combinations of random variables. For example, consider a variable Y that has probability p of taking a value from variable A and a probability (1 - p ) of talung a value from variable B: You may also come across something called a moment generating function. This is a function Mx(t) specific to each distribution and defined as where t is a dummy variable. This leads to the relationship with raw moments: [ For example, the Normal(m, s ) distribution has Mx(t) = exp p t +D:2] from which we get The great thing about moment generating functions is that we can use them with the sums of random variables. For example, if Y = rA sB, where A and B are random variables and r and s are constants, then + Note that, for a few distributions, not all moments are defined. The calculation of the moments of the Cauchy distribution, for example, is the difference between two integrals that give infinite values. More commonly, a few distributions don't have defined moments unless their parameters exceed a certain value. Appendix I11 lists these distributions and the restrictions. Chapter 7 Building and running a model In this chapter I give a few tips on how to build a risk analysis model and techniques for making it run faster - very useful if your model is either very large or needs to be run for many iterations. I also explain the most common errors people make in their modelling. 7.1 Model Design and Scope Risk analysis is about supporting decisions by answering questions about risk. We attempt to provide qualitative and, where time and knowledge permit, quantitative information to decision-makers that is pertinent to their questions. Inevitably, decision-makers must deal with other factors that may not be quantified in a risk analysis, which can be frustrating for a risk analyst when they see their work being "ignored". Don't let it frustrate you: the best risk analysts remain professionally neutral to the decisions that are made from their work. Our job is to make sure that we have represented the current knowledge and how that affects the variables on which decisions are made. Remaining neutral also relieves you of being frustrated by lack of available data or adequate opinion - you just have to work with what you have. The first step to designing a good model is to put yourself in the position of the decision-maker by understanding how the information you might provide connects to the questions they are asking. A decision-maker often does not appreciate all that comes with asking a question in a certain way, and may not initially have worked out all the possible options for handling the risk (or opportunity). When you believe that you properly understand the risk question or questions that need(s) answering, it is time to brainstorm with colleagues, stakeholders and the managers about how you might put an analysis together that satisfies the managers' needs. Effort put into this stage pays back tenfold: everyone is clear on the purpose of your analysis; the participants will be more cooperative in providing information and estimates; and you can discuss the feasibility of any risk analysis approach. Consider going through the quality check methods I described in Chapter 3. I recommend you think of mapping out your ideas with Venn diagrams and event trees. Then look at the data (and perhaps expertise for subjective estimates) you believe are available to populate the model. If there are data gaps (there usually are), consider whether you will be able to get the necessary data to fill the gaps, and quickly enough to be able to produce an analysis within the decision-maker's timeframe. If the answer is "no", look for other ways to produce an analysis that will meet the decision-maker's needs, or perhaps a subset of those needs. But, whatever you do, don't embark on a risk analysis where you know that data gaps will remain and your decision-maker will be left with no useful support. Some scientists argue that risk analysis can also be for research purposes - to determine where the data gaps lie. We see the value in that determination, of course, but, if that is your purpose, state it clearly and don't leave any expectation from the managers that will be unfulfilled. 146 Risk Analys~s 7.2 Building Models that are Easy to Check and Modify The better a model is explained and the better it is laid out, the easier it is to check. Model building is an iterative process, which means that you should construct your model to make it easy to add, remove and modify elements. A few basic rules will help you do this: Dedicate one sheet of the workbook to recording the history of changes to the model since conception, with emphasis on changes since the previous version. Document the model logic, data sources, etc., during the model build. It may seem tedious, especially for the parts you end up discarding, but writing down what you do as you go along ensures the documentation does get done (otherwise we move on to the next problem, the model remains a black box to others, etc.) and also gives you a great self-check on your approach. Avoid really long formulae if possible unless it is a formula you use very often. It might be rather satisfying to condense some complex logic into a single cell, but it will be very hard for someone else to figure out what you did. Avoid writing macros that rely on model elements being at specific locations in the workbook or in other files. Add plenty of annotations to macros. Don't put model parameter values in the macro code. Give each macro and input parameter a sensible name. Avoid being geeky - I'm reviewing a spreadsheet model right now written loyears ago by a guy who is no longer around. It is almost completely written in macros, with almost no annotation, but worst of all is that he wrote the model to allow it automatically to expand to accommodate more assets, though there was no such requirement. He created dozens of macros to do simple things like search a table that would normally be done with a VLOOKUP or OFFSET function, and placing everything in macros linked to other macros, etc., means one cannot use Excel's audit tools like Trace Precedents. It also takes maybe a 100 times longer to run than it should. Break down a complex section into its constituent parts. This may best be done in a separate area of the model and the result placed into a summary area. Hit the F9 key (or whatever will generate another scenario) to see that the constituent parts are all working well. Often, in developing ModelRisk functions, we have built spreadsheet models to replicate the logic and have found that doing so can give us ideas for improvements too. Use a single formula for an array (e.g. column) so that only one cell need be changed and the formula copied across the rest of the array. Keep linking between sheets to a minimum. For example, if you need to do a calculation on a dataset residing in one sheet, do it in that sheet, then link the calculation to wherever it needs to be used. This saves huge formulae that are difficult to follow, like: =VoseCumulA('Capital required' !G25,'Capital required' !G26,'Capital required' !G28:G106,'Capital required' !H28:HI 06). Create conditional formatting and alerts that tell you when impossible or irrelevant values occur in the model. ModelRisk functions have a lot of imbedded checks so that, for example, VoseNormal(0, - 1) will return the text "Error: sigma must be >= 0 rather than Excel's rather unhelpful #VALUE! approach. If you write macros, include similarly meaningful error messages. Use the DataNalidation tool in Excel to format cells so that another user cannot input inappropriate values into the model - for example, they cannot input a non-integer value for an integer variable. Chapter 7 Building and running a model 147 Use the Excel Tools/Protection/Protect~Sheetfunction together with the Tools/Protection/Allow~ Users-toxdit-Ranges function to ensure other users can only modify input parameters (not calculation cells). In general, keep the number of unique formulae as small as possible - we often write columns containing the same formulae repeatedly with just the references changing. If you do need to write a different formula in certain cells of an array (usually the beginning or end), consider giving them a different format (we tend to use a grey background). Colour-code the model elements: we use blue for input data and red for outputs. Make good use of range naming. To give a name to a cell or range of contiguous cells, select the cells, click in the name box and type the name you want to use. So, for example, cell A1 might contain the value 22. Giving it the label "Peter" means that typing ''=PeterV anywhere else in the sheet will return the value 22. For a lot of probability distributions there are standard conventions for naming the parameters of your model. For example, =VoseHypergeo(n, D, M) and VoseGamma(alpha, beta). So, if you have just one or two of these distributions in your model, using these names (e.g. alphal, alpha2, etc., for each gamma distribution) actually makes it easier to write the formulae too. Note that a cell or range may have several names, and a cell in a range may have a separate name from the range's name. Don't follow my lead here because, for the purposes of writing models you can read in a book, I've rarely used range names. 7.3 Building Models that are Efficient A model is most efficient when: 1. 2. 3. 4. It It It It takes the least time to run. takes the least effort to maintain and requires the least amount of assumptions. has a small file size (memory and speed issues). supports the most decision options (see Chapters 3 and 4). 7.3.1 Least time to run Microsoft are making efforts to speed up Excel, but it has a very heavy visual interface that really can slow things down. 1'11 look at a few tips for malung Excel run faster first, then for malung your simulation software run faster and then for making a model that gets the answer faster. Finally, I'll give you some ideas on how to determine whether you can stop the model because you've run enough iterations. Making Excel run faster Excel scans for calculations through worksheets in alphabetical order of the worksheet name, and starts at cell A1 in each sheet, scans the row and drops down to the next row. Then it dances around for all the links to other cells until it finds the cells it has to calculate first. It can therefore speed things up if you give names to each sheet that reflect their sequence (e.g. start each sheet with "1. Assumptions", "2. Market projection", "3.. . . " etc.), and keep the calculations within a sheet flowing down and across. 148 Risk Analysis Avoid array functions as they are slow to calculate, although faster than an equivalent VBA function. Use megaformulae (with the above caution) as they run about twice as fast as intermediary calculations, and 10 times as fast as VBA calculations. Custom Excel functions run more slowly than built-in functions but speed up model building and model reliability. Be careful with custom functions because they are hard to check through. There are a number of vendors, particularly in the finance field, who sell function libraries. Avoid links to external files. Keep the simulation model in one workbook. Making your simulation software run faster Turn off the Update Display feature if your Monte Carlo add-in has that ability. It makes an enormous difference if there are imbedded graphs. Use Multiple CPUs if your simulation software offers this. It can make a big difference. Avoid the VoseCumulA(), VoseDiscrete(), VoseDUniform(), VoseRelative() and VoseHistogram() distributions (or other product's equivalents) with large arrays if possible, as they take much longer to generate values than other distributions. Latin hypercube sampling gets to the stable output quicker than Monte Carlo sampling, but the effect gets increasingly quickly lost the more significant distributions there are in the model, particularly if the model is not just adding andlor subtracting distributions. The sampling methods take the same time to run, however, so it makes sense to use Latin hypercube sampling for simulation runs. Run bootstrap analyses and Bayesian distribution calculations in a separate spreadsheet when you are estimating uncorrelated parameters, fit the results using your simulation software's fitting tool and, if the fit is good, use just the fitted distributions in your simulation model. This does have the disadvantage, however, of being more laborious to maintain when more data become available. If you write VBA macros, consider whether they need to be declared as volatile. Getting the answer faster As a general rule, it is much better to be able to create a probability model that calculates, rather than simulates, the required probability or probability distribution. Calculation is preferable because the model answer is updated immediately if a parameter value changes (rather than requiring a re-simulation of the model), and more importantly within this context it is far more efficient. For example, let's imagine that a machine has 2000 bolts, each of which could shear off within a certain timeframe with a 0.02 % probability. We'll also say that, if a bolt shears off, there is a 0.3 % probability that it will cause some serious injury. What is the probability that at least one injury will occur within the timeframe? How many injuries could there be? The pure simulation way would be to model the number of bolt shears Shears = VoseBinomia1(2000,0.02 %) and then model the number of injuries Injuries = VoseBinomial(Shears, 0.3 %) Chapter 7 Building and running a model A l B c I D I E I F I G I H I I I J I 149 K Probability injury Probability of x injuries 6 Number of injuries x 0 1 2 3 4 5 7 8 9 10 11 & 13 Excel 9.988E-01 1.199E-03 7.188E-07 2.872E-10 8.604E-14 0.000E+00 ModelRisk 9.988E-01 1.I99E-03 7.188E-07 2.872E-10 8.604E-14 2.061E-17 14 2 1 7 18 19 20 21 - C8:C13 D8:D13 Graph Formulae table =BINOMDIST(B8,Bolts,Pshear'Pinjury,FALSE) =VoseBinomialProb(B8,Bolts,Pshear*Pinjury,FALSE) =SERIES(,Sheetl!$B$8:$B$13,Sheetl !$D$8:$D$13,1) 22 Figure 7.1 Example model determining a risk analysis outcome by calculation. Or we could recognise that each bolt has a 0.02 % * 0.3 % chance of causing injury, so Injuries = VoseBinomial(Bolts, 0.02 % * 0.3 %) Run a simulation enough iterations and the fraction of the iterations where Injuries > 0 is the required probability, and collecting the simulated values gives us the required distribution. However, on average we should see 2000 * 0.02 % * 0.3 % = 0.0012 injuries (that's 1 in 833), so your simulation will generate about 830 zeros for every non-zero value; for us to get an accurate description of the result (e.g. have 1000 or so non-zero values), we would have to run the model a long time. A better approach is to calculate the probabilities and construct the required distribution as in the model shown in Figure 7.1. I have used Excel's BINOMDIST function to calculate the probability of each number of injuries x. You can see the probability of non-zero values is pretty small, hence the need for the y axis in the chart to be shown in log scale. The beauty of this method is that any change to the parameters immediately produces a new output. I have also shown the same calculation with ModelRisk7s VoseBinomialProb function, which does the same thing because the probability that x = 5 is not actually zero (obviously) as BINOMDIST would have us believe - Excel's statistical functions aren't very good. Of course, most of the risk analysis problems we face are not as simple as the example above, but we can nonetheless often find shortcuts. For example, imagine that we believe that the maximum daily wave height around a particular offshore rig follows a Rayleigh(7.45) metres. The deck height (distance from water at rest to underside of lower deck structure) is 32 metres, and the damage that will be caused as a fraction f of the value of the rig is a function of the wave height x above the deck level following the equation ISO Risk Analysis A 1 B 1 - 2 Day 364 C Deck height Rayleigh parameter D I E 32 metres 7.45 F I G I H 1 1 , Loss (fraction) Max wave height (m) 1.598661825 12.34919201 6.245851047 19.18746778 b) Simulation and calculation P(wave > deck) Size of wave given > Deck Resultant damage (fractions) Expected damage over year (mean=output) 9.85635E-05 37.76709587 0.800691151 0.028805402 c) Calculation only Expected fractional loss per day Expected fractional loss over the year (output) 0.0000471 0.017205699 Formulae table a) Pure simulation ;1: 390 =VoseRayleigh($D$2) =IF(CG>$D$l ,(l+((C6-$D$1)/1.6)"0.91)"-0.82,O) olp=mean = s u M ( D ~ : D ~ ~ o ) b) Simulation and calculation =I-VoseRayleighProb($D$l ,$D$2,1) D375 =VoseRayleigh($D$2,,VoseXBounds($D$I,)) z; . olp=rnean = 3 6 5 * ~ 3 7 4 * ~ 3 7 6 D381 olp . c) Calculation only =Voselntegrate("VoseRayleighProb(#,D2,0)*(1 +((#-D1)Il.6)A-0.91)A-0.82",D1 ,200.10) =D380*365 Figure 7.2 Offshore platform damage model showing three methods to estimate expected damage as a fraction of rig value. We would like to know the expected damage cost per year as a fraction of the rig value (this is a typical question, among others, that insurers need answered). We could determine this by (a) pure simulation, (b) a combination of calculation and simulation or (c) pure calculation as shown in the model of Figure 7.2. The simulation model is simple enough: the maximum wave height is simulated for each and then the resultant damage is simulated by writing an IF statement for when the wave height exceeds the deck height. The model has the advantage of being easy to follow, but the probability of damage is low, so it needs to run a long time. You also need an accurate algorithm for simulating a Rayleigh distribution. The simulation and calculation model calculates the probability that a wave will exceed the deck height in cell D374 (about one in 10 000). ModelRisk has equivalent probability functions for all its distributions, whereas other Monte Carlo add-ins tend to focus only on generating random numbers, but Appendix I11 gives the relevant formulae so you can replicate this. Cell D375 generates a Rayleigh(7.45) distribution truncated to have a minimum equal to the deck height, i.e. we are only simulating those waves that would cause any damage. I've used the ModelRisk generating function but @RISK, Crystal Ball and some other simulation tools offer distribution truncation. Cell D376 then calculates the damage fraction for the generated wave height. Finally, cell D377 multiplies the probability that a wave will Chapter 7 Build~ngand running a model 15 l exceed the deck height by the damage it would then do and 365 for the days in the year. Running a simulation and taking the mean (=RiskMean(D377) in @RISK, =CB.GetForeStatFN(D377,2) in Crystal Ball) will give us the required answer. This version of the model is still pretty easy to understand but has 11365 of the simulation load and only simulates the 1 in 10 000 scenario where a wave hits the deck, so it achieves the same accuracy for about 113 650 000th of the iterations as the first model. The third model performs the integral in Cell D380 where f (x) is the Rayleigh(7.45) density function and D is the deck height. This is summing up the damage fraction for each possible wave height x weighted by x's probability of occurrence. The VoseIntegrate function in ModelRisk performs one-dimensional integration on the variable "#" using a sophisticated error minimisation algorithm that gives very accurate answers with short computation time (it took about 0.01 seconds in this model, for example). Mathematical software like Mathematica and Maple will also perform such integrals. The advantage of this approach is that the results are instantaneous and very accurate (to 15 significant figures!), but the downside is that you need to know what you are doing in probability modelling (plus you need a fancier tool such as ModelRisk, Maple, etc). ModelRisk helps out with the explanation and checking by displaying a plot of the function and the integrated area when you click the Vf (View Function) icon. Note that for numerical integration you have to pick a high value for the upper integration limit in place of infinity, but a quick look at the Rayleigh(7.45) shows that its probability of being above 200 is so small that it's outside a computer's floating point ability to display it anyway. In summary, calculation is fast and more accurate (true, with simulation you can improve accuracy by running the model longer, but there's a limit) and simulation is slow. On the other hand, simulation is easier to understand and check than calculation. I often use the phrase "calculate when you can, simulate when you can't", and when you "can't" is as much a function of the expertise level of the reviewers as it is of the modeller. If you really would like to use a calculation method, or want to have a mixed calculation-simulation model, but wony about getting it right, consider writing both versions in parallel and checking they produce the same answers for a range of different parameter values. 7.3.2 Least effort to maintain The biggest problem in maintaining a spreadsheet model is usually updating data, so make sure that you keep the data in predictable areas (colour-coding the tabs of each sheet is a nice way). Also, avoid Excel's data analysis features that dump the results of a data analysis as fixed values into a sheet. I think this is dreadful programming. Software like @RISK and Crystal Ball, which fit distributions to data, can be "hot-linked" to a dataset, which is a much better method than just exporting the fitted parameters if you think the dataset may be altered at some point. ModelRisk has a huge range of "hot-linking7' fit functions that will return fitted parameters or random numbers for copulas, time series and distributions. You can sometimes replicate the same idea quite easily. For example, to fit a normal distribution one need only determine the mean and standard deviation of the dataset if the data are random samples, so using Excel's AVERAGE and STDEV functions on the dataset will a tomatically update a distribution fit. Sometimes you need to run Solver, e.g. to use maximum likeliho methods to fit to a gamma distribution, so make a macro with a button that will perform that operation (see, for example, Figure 7.3). 'b Figure 7.3 Spreadsheet with automation to run Solver. The button runs the following macro which asks the user for the data array, runs Solver by creating a temporary file with the likelihood calculation and finally asks the user where to place the results (cells D3:E3 in this case): Private Sub CommandButtonl-Click0 On Error Resume Next Dim DataRange As Excel.Range Dim n As Long, Mean As Double, Var As Double - _ - _ _ - - - - _ - _ -Selecting _input data --------------1 Set DataRange = Application.InputBox("Select one-dimensional input data array'', "Data", Selection.Address, , , , , 8 ) If DataRange Is Nothing Then Exit Sub n = DataRange.Cel1s.Count 8 _ - - _ - - - - _ _ - - _ _ _Error - - _ messages -----------------If n < 2 Then MsgBox "Please enter at least two data values": G O T ~1 If DataRange.Columns.Count > 1 And DataRange.Rows.Count > 1 Then MsgBox "Selected data is not one-dimensional":GoTo 1 If Application.WorksheetFunction.Min(DataRange.Value) <= 0 Then MsgBox "Input data must be non-negative": GoTo I f Sheets.Add Sheets(1) ' adding a temporary sheet _ - - pasting input data into the temporary sheet ---If DataRange.Co1umns.Count > 1 Then Sheets(l).Range("Al:AU& n).Value = Application.WorksheetFunction.Transpose(DataRange.Value) Else Sheets(1) .Range("Al:AU& n).Value = DataRange.Value End If f Chapter 7 Build~ngand running a model 153 Mean = Application.WorksheetFunction.Average(Sheets(1).Range("Al:Aq' & n)) ' calculating mean of data & n)) ' calculating Var = Application.~orksheet~unction.Var(Sheets(1).Range("Al:A" variance of data Alpha = Mean " 2 / Var ' Best guess estimate for Alpha Beta = Var / Mean ' Best guess estimate for Beta - - - - - - setting initial values for the Solver - - - - - - Sheets(l).Range("DlW) .Value = Alpha Sheets(l).Range("El") .Value = Beta 9 v - - - - - - - - setting the LogLikelihood function - - - - - - - Sheets(1).Range("B1:BW& n).Formula = "=LOGlO(GAMMADIST(Al,$D$1,$E$l,O))" - - - - - - - - _ - setting the objective function - - - - - - - - - Sheets(1).Range("GlU) .Formula = "=SUM(B1:Bn& n & " ) " I I - - - - - - - - - - - - - - _ Launching the Solver - - - - - - - - - - - - - - - SOLVER.SolverReset SOLVER.SolverOk SetCell:=Sheets(l).Range("GlW),MaxMinVal:=l, ByChange:=Sheets(l).Range("Dl:El") SolverAdd CellRef:="$D$l",Relation:=3, FormulaText:="O.000000000000001" SolverAdd CellRef:="$E$l",Relation:=3, FormulaText:="O.000000000000001" S0LVER.SolverSolve UserFinish:=True S0LVER.SolverFinish KeepFinal:=l _ _ _ - - - - - - - - -Remembering output values -----------Alpha = Sheets(l).Range("Dl").Value Beta = Sheets(l).Ranqe("El").Value I I _ - _ _ - - - - - - -Deleting the temporary sheet - - - - - - - - - - - Application.DisplayA1erts = False Sheets (1).Delete Application.DisplayAlerts = True - - - - - - - - - - - - *electing output location - - - - - - - - - - - 2 Set ~ata~ange.= Application.InputBox("Select 2x1 output location", "Output", Selection.Address, , , , , 8 ) If DataRange Is Nothing Then Exit Sub n = DataRange.Cells.Count If n < 2 Then MsgBox "Enter at least two data values": GoTo 2 1 - - - - Pasting outputs into the selected range - - - - - - DataRange.Cells(1, 1) = Alpha If DataRange.Co1umns.Count = 2 Then DataRange(1, 2) = Beta Else DataRange(2, 1) t = Beta End Sub A minimum limit is placed on alpha and beta 0.000000000000001 to avoid errors and LOGlO(. . .) is used around the GAMMADIST(. . .) fun1ctions because a LogLikelihood will behave less dramatically \ , I 154 Risk Analysis and let Solver find the solution more reliably. The moments-based estimate for alpha (=DataMeanA21 Datavariance) and beta (=DataVariance/DataMean) are used as starter values for Solver so it will find the answer more quickly. If a user needs to perform some operations prior to running a model, then write a description of what needs doing and why. These days, we attach a help file with the model, and this allows us to imbed little videos which is very helpful, but at the least try to imbed or couple the model to a pdf file with screen captures of each step. In my experience, the other main reason a model can be hard to maintain is that it is complex and uses many different sources of data that go out of date. When you plan out a risk analysis (Chapters 3 and 4) for a model that will be used periodically, or that could take a long time to complete, consider whether there is a simpler model that will give answers that are pretty close in decision terms to the more complex model being planned. If the difference in accuracy is small, it may be balanced by the greater applicability that comes with updating the inputs more frequently. 7.3.3 Smallest file size Megaformulae reduce the file size considerably Maintaining large datasets in your model will increase the file size. It is better to do the analysis outside the spreadsheet and copy across the results. Sometimes large datasets or calculation arrays are used to construct distributions (e.g. fitting first- or second-order non-parametric distributions to data, constructing Bayesian posterior distributions and bootstrap analysis). Replacing these calculations with a fitted distribution can have a marked effect on model size and speed. ModelRisk has been designed to maximise speed and rninimise memory requirements. It has a large number of functions that will perform complex calculations in a single cell or small array. You might also be able to achieve some of the same effect in your models with VBA code, particularly if you need to perform iterative loops. 7.3.4 How many iterations of a model to run You will often see risk analysis reports, or papers in journals, that show the results and tell you that this was based on 10 000 (or whatever) Latin hypercube (or whatever) iterations of the model. I suppose that may sometimes be useful to know, but not often. The author is usually trying to communicate that the model was run long enough for the results to be stable. The problem is that, for one model trying to determine a mean, 500 iterations may be good enough; for another trying to determine a 99.9th percentile, 100 000 iterations might be needed. It also depends on how sensitive the decision question is to the output's accuracy. A frequent question that pops up in our courses is "how many iterations do I need to run", and you can see there is no absolute answer to that. A short answer, burdened with many caveats, is "no less than 3 0 0 if you are interested in the entire output distribution. At 300 iterations you start to get a reasonably well-defined cumulative distribution, so you can approximately read off the 50th and 85th percentiles, for example, and the mean is pretty well determined for most output distributions. At the same time, if you export the generated values from two or more random variables in your model to produce scatter plots, 300 is really the minimum you need to get any sense of the patterns that they produce (i.e. their joint distribution). We usually have our models set to run 3000 iterations as a default (but obviously increase that figure if a particularly high level of accuracy is warranted), because we plot a great deal of scatter plots from generated data, and this is about the right number of points before 3 000 iterations 300 iterations 1 0.9 0.8 0.8 . 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Figure 7.4 Comparison of cumulative distribution plots for 20 model runs each of 3000 and 300 Monte Carlo iterations for a well-behaved output (i.e. a nice smooth curve). the scatter plot gets clogged up, and certainly enough for all the percentiles and statistics to be well specified. Figure 7.4 shows what type of variation you would typically get for a cumulative distribution between runs of 300 iterations and of 3000 iterations. Since most models include an element of guesswork in the choice of model, distributions or parameter values to use, one should not usually be too concerned about exact precision in the Monte Carlo results, but you'll see that 300 iterations is probably the least level of accuracy you might find acceptable. Figure 7.5 shows the same input and output plotted together as a scatter plot for 300 and 3000 iterations. We find scatter plots to be a great, intuitive presentation of how, among others, the input variability influences the output value. You'll see that the pattern is just about visible for 300 iterations, and just starting to get clogged up at 3000 iterations (of course, if you run more than 3000 iterations, you can plot a sample of just 3000 of them to keep the scatter plot clear). If the pattern were simpler, the left-hand pane1 of 300 iterations would of course be clearer. In general, you'll have two opposing pressures: Too few iterations and you get inaccurate outputs, graphs (particularly histogram plots) that look "scruffy". Too many iterations and it takes a long time to simulate, and it may take even longer to plot graphs, export and analyse data, etc., afterwards. Export the data into Excel and you may also come upon row limitations, and limitations on the number of points that can plotted in a chart. There will usually be one or more statistics in which you are interested from your model outputs, so it would be quite natural to wish to have sufficient iterations to ensure a certain level of accuracy. Typically, that accuracy can be described in the following way: "I need the statistic Z to be accurate to within fd with confidence a". I will show you how you can determine the number of iterations you need to run to get some specified level of accuracy for the most common statistics: the mean and cumulative probabilities. The example models let you monitor the level of accuracy in real time. Note that all models assume that you are using Monte Carlo sampling. This will therefore somewhat overestimate the number of iterations you'll need if you are using Latin hypercube sampling (which we recommend, in general). That said, in practice, lc 1 300 iterations lnput parameter value 3000 iterations lnput parameter value Figure 7.5 Comparison of scatter plots for model runs of 3000 and 300 Monte Carlo iterations. Latin hypercube sampling will only offer useful improvement when a model is linear, or when there are very few distributions in the model. Iterations to run to get suficient accuracy for the mean Monte Carlo simulation estimates the true mean p of the output distribution by summing all of the generated values xi and dividing by the number of iterations n: If Monte Carlo sampling is used, each xi is an iid (independent identically distributed random variable). Central limit theorem then says that the distribution of the estimate of the true mean is (asymptotically) Chapter 7 Building and running a model -3- u J;; Figure 7.6 -2- u .J;; u -- !J +-a .J;; J;; +2- u +3- 157 u J;; Cumulative distribution plot for the normal distribution of Equation (7.1). given by P = Normal ( 3 p, - where a is the true standard deviation of the model's output. Using a statistical principle called the pivotal method, we can rearrange this equation to make it an equation for p: p = Normal (P, 5) Figure 7.6 shows the cumulative form of the normal distribution for Equation (7.1). Specifying the level of confidence we require for our mean estimate translates into a relationship between 6, a and n. More formally, this relationship is where @-' (.) is the inverse of the normal cumulative distribution function. Rearranging Equation (7.2) and recognising that we want to have at least this accuracy gives a minimum value for n: We have one problem left: we don't know the true output standard deviation a. It turns out that we can estimate this perfectly well for our purposes by taking the standard deviation of the first few 158 Risk Analysis A 1 2 - 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 - ~ B I C I D l Output from model: 7.466257485 Required level of accuracy delta about mean: +I- E l F l G l H l 90% 1 1 0.01 with confidence alpha I IJ Calculation with Crystal Ball Standard deviation of olp iterations so far: 3.921402 (our estimate of sigma) 416042 Iterations left to do: 1 Calculation with @RISK Standard deviation of olp iterations so far: 3.921402 (our estimate of sigma) Iterations left to do: 416042 1 Formulae table Crystal Ball E6 D7 E l0 Dl1 =CB,GetForeStatFN(D2,5) = I F ( ( E ~ * N O R M S ~ N V ( ( ~ + H ~ ) I ~ ) ~ E ~ ) " ~ - C B . ~ROUNDUP((E6*NORMSINV((1+H3)/2)/E3)"2~~~~~~O~SFN()O, CB.lterationsFN(),O),"Sufficient accuracy achieved") @RISK =RiskStdDev(D2) =IF((El O*NORMSlNV((l +H3)/2)/E3)"2-RiskCurrentlter()>O, ROUNDUP((E1O*NORMSINV((I+H3)/2)IE3)~2-RiskCurrentlter(),0),"Sufficient accuracy achieved") 22 Figure 7.7 Models in QRlSK and Crystal Ball to monitor whether the simulation mean has reached a required accuracy. (say 50) iterations. The model in Figure 7.7 shows how you can do this continuously, using Excel's function NORMSINV to return values for @ - I ( . ) . If you name cell D7 or Dl1 as an output, together with any other model outputs you are actually interested in, and select the "Pause on Error in Outputs" option in your host Monte Carlo add-in, it will automatically stop simulating when the required accuracy is achieved because the cell returns the "Sufficient accuracy achieved text instead of a number. lterations to run to get suficient accuracy for the cumulative probability F(x) associated with a particular value x Percentiles closer to the 50th percentile of an output distribution will reach a stable value relatively far quicker than percentiles towards the tails. On the other hand, we are often most interested in what is going on in the tails because that is where the risks and opportunities lie. For example, Base1 I1 and credit rating agencies often require that the 99.9th percentile or greater be accurately determined. The following technique shows you how you can ensure that you have the required level of accuracy for the percentile associated with a particular value. Your Monte Carlo add-in will estimate the cumulative percentile F(x) of the output distribution associated with a value x by determining what fraction of the iterations fell at or below x. Imagine that x is actually the 80th percentile of the true output distribution. Then, for Monte Carlo simulation, the generated value in each iteration independently has an 80 % probability of falling below x : it is a binomial process with probability p = 80 %. Thus, if so far we have had n iterations and s have fallen at or below x, the distribution Beta(s 1, n - s 1) describes the uncertainty associated with the true cumulative percentile we should associate with x (see Section 8.2.3). When we are estimating the percentile close to the median of the distribution, or when we are performing a large number of iterations, s and n will both be large, and we can use a normal approximation + + Chapter 7 Building and running a model to the beta distribution: Beta(s + 1, n - s + 1) x Normal 159 i-m P, where S = is the best-guess estimate for F ( x ) . Thus, we can produce a relationship similar to that in Equation (7.2) for determining the number of iterations to get the required precision for the output mean: Rearranging Equation (7.4) and recognising that we want to have at least this accuracy gives a minimum value for n: A model can now be written in a very similar fashion to Figure 7.7 7.4 Most Common Modelling Errors This section describes, and provides examples for, the three most common mistakes we come across in auditing risk models, even at the more elementary level. These mistakes probably constitute around 90 % of the errors we see. I strongly recommend studying them, and going through the examples thoroughly: Common error 1. Calculating means instead of simulating scenarios. Common error 2. Representing an uncertain variable more than once in a model. Common error 3. Manipulating probability distributions as if they were fixed numbers. Common error I : calculating means instead of simulating scenarios When we first start thinking about risk, it is quite natural to want to convert the impact of a risk to a single number. For example, we might consider that there is a 20% chance of losing a contract, which would result in a loss of income of $100 000. Put together, a person might reason that to be a risk of some $20 000 (i.e. 20 % * $100 000). This $20 000 figure is known as the "expected value" of the variable. It is the probability weighted average of all possible outcomes. So, the two outcomes are $100 000 with 20 % probability and $0 with 80 % probability: Mean risk(expected value) = 0.2 * $100 000 E i + 0.8 * $0 = $20 000 Calculating the expected values of risks might also seem a reasonable and simple method for comparing risks. For example, in Table 7.1, risks A to J are ranked in descending order of expected cost: 160 R~skAnalysis Table 7.1 A list of probabilities and impacts for 10 risks. Impact if occurs Expected impact Risk Probability $000 $000 0.25 A 400 100 I Total expected impact 367 If a loss of $500 000 or more would ruin your company, you may well rank the risks differently: risks C, D, I and, to a lesser extent, J pose a survival threat on your company. Note also that you may value the impact of risk D as no more severe than that of risk C because, if either of them occur, your company has gone bust. On the other hand, if risk A occurs, giving you a loss of $400k, you are precariously close to ruin: it would just take any of the risks except F and H to occur (unless they both occurred) and you've gone bust. Looking at the sum of the expected values gives you no appreciation of how close you are to ruin. Figure 7.8 plots the distribution of possible outcomes for this set of risks. I 200 O0 Figure 7.8 400 600 800 1000 1200 Total risk impact $000 Probability distribution of total impact from risks A to J. Chapter 7 Building and running a model 16 1 From a risk analysis point of view, by representing the impact of a risk by its expected value, we have removed the uncertainty (i.e. we can't see the breadth of different outcomes), which is a fundamental reason for doing risk analysis in the first place. That said, you might think that people running Monte Carlo simulations would be more attuned to describing risks with distributions rather than single values, but this is nonetheless one of the most common errors. Another, slightly more disguised example of the same error is when the impact is uncertain. For example, let's imagine that there will be an election this year and that two parties are running: the Socialist Democrats Party and the Democratic Socialists Party. The SDP are currently in power and have vowed to keep the corporate tax rate at 17 % if they win the election. Political analysts reckon they have about a 65 % chance of staying in power. The DSP promise to lower the corporate tax rate by 1-4 %, most probably 3 %. We might chose to express next year's corporate tax rate as Rate = 0.35 * VosePERT(13 %, 14 %, 16 %) + 0.65 * 17 % Checking the formula by simulating, we'd get a probability distribution that could give us some comfort that we've assigned uncertainty properly to this parameter. However, a correct model would have drawn a value of 17 % with a probability of 0.65 and a random value from the PERT distribution with a probability of 0.35. Common error 2: representing an uncertain variable more than once in a model When we develop a large spreadsheet model, perhaps with several linked sheets in the same file, it is often convenient to have some parameter values that are used in several sheets appearing in each of those sheets. This makes it quicker to write formulae and trace back precedents in a formula. Even in a deterministic model (i.e. a model where there are only best-guess values, not distributions) it is important that there is only one place in the model where the parameter value can be changed (at Vose Consulting we use the convention that all changeable input parameter values or distributions are labelled blue). There are two reasons: firstly it is easier to update the model with new parameter values, and secondly it avoids the potential mistake of only changing the parameter values in some of the cells in which it appears, forgetting the others, and thereby having a model that is internally inconsistent. For example, a model could have a parameter "Cargo (mt)" in sheet 1 with a value of 10 000 and a value of 12 000 in sheet 2. It becomes even more important to maintain this discipline when we create a Monte Carlo model if that parameter is modelled with a distribution. Although each cell in the model might carry the same probability distribution, left unchecked each distribution will generate different values for the parameter in the same iteration, thus rendering the generated scenario impossible. If it really is important to you to have the probability distribution formula in each cell where the parameter is featured (perhaps because you wish to see what distribution equation was used without having to switch to the source sheet), you can make use of the U-parameter in ModelRisk's simulation functions to ensure that the same value is being generated in each place: Cell A1 := VoseNormal(100, 10, Randoml) Cell A2 := VoseNomal(100, 10, Randoml) where Randoml is a Uniform(0, 1) distribution placed somewhere in the model. You can achieve the same thing using a 100% rank order correlation in @RISK or Crystal Ball, for example, but this will 162 Risk Analysis only work when the simulation is running because rank order correlation generates a set of values before a simulation run and orders them; when you look at the model stepping through some scenarios, they won't match. The error described so far is where the formula for the distribution of a random variable is featured in more than one cell of a spreadsheet model. These errors are quite easy to spot. Another form of the same error is where two or more distributions incorporate the same random variable in some way. For example, consider the following problem. A company is considering restructuring its operations with the inevitable layoffs, and wishes to analyse how much it would save in the process. Looking at just the office space component, a consultant estimates that, if the company were to make the maximum number of redundancies and outsource some of its operations, it would save $PERT(1.1, 1.3, 1.6)M of office space costs. On the other hand, just making the redundancies in the accounting section and outsourcing that activity, it could save $PERT(0.4, 0.5, 0.9)M of office space costs. It would be quite natural, at first sight, to put these two distributions into a model and run a simulation to determine the savings for the two redundancy options. On their own, each cost saving distribution would be valid. We might also decide to calculate in a spreadsheet cell the difference between the two savings, and here we would potentially be making a big mistake. Why? Well, what if there is an uncertain component that is common to both office cost savings? For example, what if inside these cost distributions there is the cost of getting out of a current lease contract, uncertain because negotiations would need to take place. The problem is that, by sampling from these two distributions independently, we are not recognising the common element, which is a problem if that common element is not a fixed value, because it induces some level of correlation. The takeaway message from this example is: consider whether two or more uncertain parameters in your models share in some way a common element. If they do, you will need to separate out that common element and thereby allow it to appear just once in your model. Common error 3: manipulating probability distributions like we do with fixed numben At school we learn things like Later, when we take algebra, we learn A + B = C therefore C - A = B D * E = F therefore F I D = E The problem is that these trusted rules do not apply so universally when manipulating random variables. This section explains how and when these simple algebraic rules no longer work, and shows you how to identify them in your model and how to make the appropriate corrections. An example Most deterministic spreadsheet models consist of linked formulae that contain nothing more complicated -, * and 1. When we decide to start adding uncertainty to the values of than simple operations like the components in the model, it seems natural enough simply to replace a fixed value with a probability +, Chapter 7 Bu~ld~ng and runnlng a model 163 distribution describing our uncertainty. So, for example, the simple model for a company offering some credit service: Money borrowed by a client M : Number of clients n : Interest rate per annum r : Yearly revenue: €10 000 6500 7.5 % M * n * r = €4 875 000 The model can now be "risked: Money borrowed by a client M : Number of clients n : Interest rate per annum r : Yearly revenue: Lognormal(€lO 000, €4000) PERT(6638, 6500, 8200) 7.5 % M*n*r The best-guess estimates of the money borrowed by a client and for the number of clients have been replaced by distributions, but the model is otherwise unchanged. This model is probably very wrong. The error is most easily seen by watching random values being generated on screen. Look at the values that are being used for the entire client base and compare with where these values sit on the Lognormal(l0 000, 4000) distribution. For example, the Lognormal(10 000, 4000) distribution has 10 % of its probability below £5 670. Thus, in 10 % of its iterations it will generate a value below this figure, and that value will be used for all customers. The lognormal distribution undoubtedly reflects the variability that is expected between customers (perhaps, for example, it was fit to a relevant dataset of amounts individual customers have previously borrowed). The probability that two randomly selected customers will borrow less than £5 670 is 10 % * 10 % = 1 %. The probability that all (say) 6500 customers borrow less than £5 670 if i.e. effectively impossible, yet our model gives it the amounts they borrow are independent is a 10 % probability. In order to model this problem correctly, we need to consider what are the sources of uncertainty about the amount a customer borrowed. If the source is specific to each individual client, then the amounts can be considered independent and the techniques of Chapter 11 should be applied. If there is some systematic influence (like the state of the economy, recent bad press for companies offering credit, etc.), it will have to be separated out from the individual, independent component. Let's look at another example. The sum of two independent Uniform(0, 1) distributions is . . . what do you think? The answer often surprises people. It is hard to imagine a simpler problem, yet when we canvass a class we get quite a range of answers. Perhaps a Uniform(0, 2)? That's the most common response. Or something looking a little normal? The answer is a Triangle(0, 1, 2), so we could write U(0, 1) + U(0, 1) = T(0, l , 2 ) The first message in this example is that it is difficult for a person not very well versed in risk analysis modelling to be able to predict well the results of even the most trivial model. Of course, that makes it very hard to check the model and be comfortable about its results. On to the next question we often pose our class: T(O,l, 2) - U ( 0 , l ) =? 164 Risk Analysis Figure 7.9 A plot of random samples of C against A, where A = U(0, I), B = U(0, 1) and C = A + B. Now wise to the trickiness of the question, most class participants are pretty sure that their first guess (i.e. = U(0, 1)) is wrong but don't have anything else to suggest. The answer is a symmetric distribution that looks a little normal, stretching from -1 to 2 with peak at 0.5. But why isn't it U(0, l)? An easy way to visualise this is to run a simulation adding two Uniform(0, 1) distributions and plotting the generated values from one uniform distribution together with the calculated sum of them both. You get a scatter plot that looks like that in Figure 7.9. The line y = x shows the lowest value for the triangular distribution C for any given value of the uniform distribution A, and the line y = 1 x is the highest value, which makes intuitive sense. The uniform vertical distribution of points between these two lines is the effect of the second Uniform(0, 1) distribution B. Also note that all the generated values lie uniformly (but randomly) between these two lines. This actually is quite helpful in visualising why the sum of two Uniform(0, 1) distributions is a Triangle(0, 1, 2) by projecting all the dots onto the y axis. Can you extend this graph to work out graphically what U(0, 1) U(O,3) would look like? The point of the graph is to show you that there is a strong dependency pattern between these two distributions (a uniform and the triangle sum), which would need to be taken into account if one wished to extract back out the two uniform distributions from each other. For example, the formulae below do just that: + + B := VoseUniform(IF(A < 1,O, A - l ) , IF(A > 1, 1, A)) Chapter 7 Building and running a model 165 Try to follow the logic for the formula for B from the graph. B will generate a Uniform(0, 1) distribution with the right dependency relationship with A to leave C a Uniform(0, 1) distribution too. To recap, the problem is that we have three variables linked together as follows: We know the distributions for A and C. How do we find B and how do we simulate A , B and C all together? The simple example above using two uniform distributions allows us to simulate A , B and C all together, but only because we assumed A and B were independent and the problem was very simple. In general, we cannot correctly determine B, so we need either to construct a model that avoids having to perform such a calculation or admit that we have insufficient information to specify B. Chapter 8 Some bas~crandom processes 8.1 Introduction If you want to get the most out of the risk analysis and statistical modelling tools that are available, you really need to understand the conceptual thinking behind random processes and the equations and distributions that result, and be able to identify where these random processes occur in the real world. In this chapter we look at the binomial, Poisson and hypergeometric processes first because they share a common basis, and a very great deal of risk analysis problems can be tackled with a good knowledge of just these three processes. I've added the central limit theorem here too because it explains a lot about the behaviour of distributions. We'll look at the theory and assumptions behind each process, and the distributions that are used in their modelling. This approach provides us with an excellent opportunity to become very familiar with a number of important distributions, and to see the relationships between them, even between the distributions of the different random processes. Then we'll look at some extensions to these processes that greatly increase their range of applications. Finally, we look at a number of problems. There are a number of other random processes discussed in this book relating to the sums of random variables (Chapter 1l), time series modelling (Chapter 12) and correlated variables (Chapter 13). Chapter 9 on statistics relies heavily on an understanding of the random processes described here. 8.2 The Binomial Process A binomial process is a random counting system where there are n independent identical trials, each one of which has the same probability of success p, which produces s successes from those n trials (where 0 5 s 5 n and n > 0 obviously). There are thus three quantities {n, p, s) that between them completely describe a binomial process. Associated with each of these three quantities are three distributions that describe the uncertainty about or variability of these quantities. The three distributions require knowledge of two quantities in order to use these distributions to estimate the third. The simplest example of a binomial process is the toss of a coin. If we define "heads" as a success, each toss has the same probability of success p (0.5 for a fair coin). Then, for a given number of trials n (tosses of a coin), the number of successes will be s (the number of "heads"). Each trial can be thought of as a random variable that returns either a 1 with probability p or a 0 with probability (1 - p). Such a trial is often known as a Bernoulli trial, and the probability (1 - p ) is often given the label q . 8.2.1 Number of successes in n trials We start our exploration of the binomial process by looking at the probability of a certain number of successes s for a given number of trials n and probability of success p. Imagine we have one toss of 168 Risk Analysis Head P.P 3 Head P Tail ~(1-P) 1-P a) One toss 1 Tail (l-p)(l-p) 1 1 b) Two tosses Figure 8.1 Event trees for the tossing of (a) one coin and (b) two coins. a coin. The two outcomes are "heads" (H) with probability p and "tails" (T) with probability (1 - p), as shown in the event tree of Figure 8.l(a). If we have two tosses of a coin there are four possible outcomes, as shown in Figure 8.l(b), namely HH, HT, TH and TT, where HT means "heads" followed by "tails", etc. These outcomes have probabilities p2, p ( l - p), (1 - p ) p and (1 - p)' respectively. If we are tossing a fair coin (i.e. p = 0.5),then each of the four outcomes has the same probability of 0.25. Now, the binomial process considers each success to be identical and therefore does not differentiate between the two events HT and TH: they are both just one success in two trials. The probability of one success in two trials is then just 2p(l - p) or, for a fair coin, = 0.5. The two in this equation is the number of different paths that result in one success in two trials. Now imagine that we toss a coin 3 times. The eight outcomes are: HHH, HHT, HTH, HTT, THH, THT, TTH and TTT. Thus, one event produces three "heads", three events produce two "heads", three events produce one "head and one event produces no "heads" for three coin tosses. In general, the number of ways that we can get s successes from n trials can be calculated directly using the binomial coefficient .C,, which is given by We can check this is right by choosing n = 3 (remembering that O! = I), then 3! ($=mi!(;) (:) 2r(l)l 3! -1 -3 = 3! = 3! =3 -1 Chapter 8 Some bas~crandom processes 169 which match the number of combinations we have already calculated. Each of the ways of getting s successes in n trials has the same probability, namely pS(l so the probability of observing x successes in n trial is given by pe,n(x) = (;) p X ( l - PI"-' which is the probability mass function of the Binomial(n, p) distribution. In other words, the number of successes s one will observe in n trials, where each trial has the same probability of success, is given by s = Binomial(n, p) Figure 8.2 shows this distribution for four different combinations of n and p . The binomial distribution was first derived by Bernoulli (1713). 8.2.2 Number of trials needed to achieve s successes We have seen how the binomial distribution allows us to model the number of successes that will occur in n trials where we know the probability of success p . Sometimes, we know how many successes we wish to have, we know the probability p and we would like to know the number of trials that we will have to complete in order to achieve the s successes, assuming we stop once the sth success has Figure 8.2 Examples of the binomial distribution. 170 Risk Analys~s occurred. In this case, n is the random variable. Now that we have the binomial distribution, we can readily determine the distribution for n. Let x be the total number of failures. The total number of trials we will execute is then (s x), and by the (s x - 1)th trial we must have observed (s - 1) successes and x failures (since the very last trial is, by assumption, a success). The probability of (s - 1) successes in (s x - 1) trials is given immediately by the binomial distribution as + + + The probability of this being followed by a success is the same equation multiplied by p, i.e. which is the probability mass function of the negative binomial distribution NegBin(s, p). In other words, the NegBin(s, p ) distribution returns the number of failures one will have before observing s successes. The total number of trials n is thus given by Figure 8.3 shows various negative binomial distributions. If s = 1, then the distribution (known as the geometric distribution) is very right skewed and p(0) = p , i.e. the probability that there will be zero failures equals p, the probability that the first trial is a success. We can also see that, as s gets larger, the distribution looks more like a normal distribution. In fact, it is common to approximate the negative ~~~~ NegBin(l,0.5) 2n NegBin(3.0.5) 0.3 E 4 g g 0.2 a 0.1 0 0 2 4 6 8 10 0.1 0.08 0.06 0.04 0.02 0 0 Failures 5 10 15 20 Failures ~:lzJiz NegBin(100.0.95) *:z. 0.14 0.12 0.1 2 0.08 2 0.06 a 0.04 0.02 00 Failures Figure 8.3 I Examples of the negative binomial distribution. 5 Failures 10 15 20 Chapter 8 Some basic random processes 171 binomial distribution with a normal distribution under certain circumstances where s is large, in order to avoid calculating the large factorials for p ( x ) above. A negative binomial distribution shifting k values along the domain is sometimes called a binomial waiting time distribution, or a Pascal distribution. 8.2.3 Estimate of probability of success p These results for the binomial and negative binomial distributions are both modelling variability: that is to say, they are returning probability distributions of possible future outcomes. At times, however, we are looking back at the results of a binomial process and wish to determine one of the parameters. For example, we may have observed n trials of which s were successes, and from that information we would like to estimate p . This binomial probability is a fundamental property of the stochastic system and can never be observed, but we can become progressively more certain about its true value by collecting data. As we shall see in Section 9.2.2, we can readily quantify our uncertainty about the true value of p by using a beta distribution. In brief, if we have no prior information about p , or do not wish to assume any prior information about p, then it is quite natural to use a uniform prior for p , and, through Bayes' theorem, we have the equation which is just the Beta(s + 1, n - s + 1) distribution, so The beta distribution can also be used in the event that we have an informed opinion about the value of p prior to collecting data. In such cases, providing we can reasonably model our prior opinion about p with a beta distribution of the form Beta(a, b), the posterior turns out to be a Beta(a s , b n - s) distribution because the beta distribution is conjugate to the binomial distribution (see Section 111.7.1). Figure 8.4 illustrates a number of beta distributions. + + 8.2.4 Estimate of the number of trials n that were completed Consider the situation where we have observed s successes and know the probability of success p, but would like to know how many trials were actually done to have observed those successes. We wish to estimate a value that is fixed, so we require a distribution that represents our uncertainty about what the true value is. There are two possible situations: we either know that the trials stopped on the sth success or we do not. If we know that the trials stopped on the sth success, we can model our uncertainty about the true value of n as n = NegBin(s, p ) s + If, on the other hand, we do not know that the last trial was a success (though it could have been), then our uncertainty about n is modelled as Both of these formulae result from a Bayesian analysis with uniform priors for n. We will now derive these two results using standard Bayesian inference. The reader unfamiliar with this technique should refer to Section 9.2. Let x be the number of failures that were carried out before the sth success. We I72 Risk Analysis Binomial probability Binomial probability Beta(2,20) Beta(30,l) 25 T 0 Figure 8.4 0.2 1 0.4 0.6 0.8 Binomial probability I Binomial probability Examples of the beta distribution. will use a uniform prior for x, i.e. p(x) = c, and, from the binomial distribution, the likelihood function is the probability that at the (s x - 1)th trial there had been (s - 1) successes and then the (s x)th trial was a success, which is just the negative binomial probability mass function: + + As we are using a uniform prior, and the equation for 1(Xlx) comes directly from a distribution, so it must sum to unity, we can dispense with the formality of normalising the posterior distribution to 1 and observe P(X) = ( s+x-1 s-1 ) p S ( l - p)" i.e. that x = NegBin(s, p). In the second case, we do not know that the last trial was a success, only that, in however many trials were completed, there were just s successes. We have the same uniform prior for the number of failures, but our likelihood function is just the binomial probability mass function, i.e. As this does not have the form of a probability mass function of a distribution, we need to complete the Bayesian analysis, so Chapter 8 Some basic random processes 1 73 The sum in the denominator equals llp. This can be easily seen by substituting s = a - 1, which gives If the exponent for p were equal to a instead of (a - I), we would have the probability mass function of the negative binomial distribution, which would then sum to unity, so our denominator must sum to llp. The posterior distribution from Equation (8.1) then reduces to which is just a NegBin(s + 1, p ) distribution. 8.2.5 Summary of results for the binomial process The results are shown in Table 8.1. Table 8.1 Distributions of the binomial process. Quantity Formula Notes - Number of successes Probability of success Number of trials s = Binomial(n,p) p = Beta(s + 1,n - s 1) = Beta(a + s, b + n - s) n = s + NegBin(s,p) = s NegBin(s + 1,p) + + Assuming a uniform prior Assuming a Beta(a,b) prior When the last trial is a success When the last trial is not known to be a success 8.2.6 The beta-binomial process An extension of the binomial process is to consider the probability p to be a random variable. A natural candidate to model this variability is the Beta(a, B) distribution because it lies on [O, 11 and can take a lot of shapes so it offers a great deal of flexibility. The beta-binomial distribution models the number of successes: The beta-negative binomial models the number of failures that will occur to achieve s successes: t Both distributions are included in ModelRisk. It is important to remember that in the beta-binomial process the same value of p is applied to all the binomial trials, meaning that, if p is randomly 0.4 (say) for one trial, it is 0.4 for all the others too. If p were randomly varying between each trial, we would have each trial being an independent Bernoulli(Beta(a, B)), but since a Bernoulli distribution can only be 0 or 1, this condenses to a set of 174 Risk Analysis (&) & is the mean of the beta distribution, and a collection of n such trials would therefore be just a ~inomial(n,5 ) independent Bernoulli trials, where 8.2.7 The multinomial process Whereas in the binomial process there are only two possible outcomes of a trial (0 or 1, yes or no, male or female, etc.), the multinomial process allows for multiple outcomes. The list of possible outcomes must be exhaustive, meaning a trial cannot result in something that isn't listed as an outcome. For example, if we throw a die there are six possible mutually exclusive (they can't happen at the same time) and exhaustive (one must occur) outcomes. There are three distributions associated with the multinomial process: Multinomial(n, Ipl . . .pk}) which describes the number of successes in n trials that fall into each of the k categories. It's joint probability mass function parallels the binomial equation You can think of a multinomial distribution as a recursive sequence of nested binomial distributions where the trials and probability of success are modified through the sequence: For example, imagine that a person being treated in hospital has three possible outcomes: {cured, not cured and deceased) with probabilities {0.6, 0.3, 0.1). Assuming their outcomes are independent, we can model the outcome for 100 patients as follows: Cured = Binomia1(100,0.6) NotCured = Binomial(100 - Cured, 0.3/(0.3 + 0.1)) Deceased = Binomial(100 - Cured - NotCured, 0.1/0.1) = Binomial(100 - Cured - NotCured, 1) = 100 - Cured - NotCured The model in Figure 8.5 shows this calculation in a spreadsheet, together with the ModelRisk distribution VoseMultinomial which achieves the same result but in a single array function. Negative Multinomial({s 1 . . .sk}, Ip . . .pk}) is the extension to the negative binomial distribution and describes the number of extra trials (we can't really say "failures" any more because there are several outcomes, not two in the binomial case where we could designate success or failure) there will be to observe {sl . . . sk} successes. There are two versions of this question: "How many extra trials will there Chapter 8 Some bas~crandom Drocesses A 1 l B c l ~ l ~ 0.3 30 30 C 0.15 12 15 D 0.25 16 27 l ~ l 175 ~I K J l ~ 1 2 3 l ~ r i a l sn: 4 - 1001 A 6 E 0.06 7 5 F 0.04 7 5 Total check 5 6 Outcome P(outcome) Nested Multinomial JQ Formulae table =VoseBinomial(C2,C5) C6 (output) D6:H6 (output) =V~~~B~~O~~~I($C$~-SUM($C$~:C~),D~/(~-SUM($C$~:C~))) (C7:H7) (alt output) (=VoseMultinomial(C2,C5:H5)) 7 8 9 11 2 0.2 28 18 13 *e8.5 Model for the multinomial process. A 1 2 3 4 5 6 7 8 9 10 2 12 1 B l A Outcome P(outcome) Requiredsuccesses Negative Multinomial2 . (C5:H5) (output) C7 H7 l ~ l ~ 0.15 17 0 D 0.25 5 39 E 0.06 11 2 F 0.04 2 8 I 87) C B 0.2 12 22 [~eaativeMultinomial 1 I - c 0.3 23 16 1261 l ~ e ~ a t i Multinornial ve 1 sum l ~ l ~ Formulae table (=VoseNegMultinomial2(C4:H4,C3:H3)) =VoseNegMultinomial(C4:H4,C3:H3) =SUM(CS:H5) 13 A Figure 8.6 Model for the negative multinomial process. be in total?", which has a univariate answer, and "How many extra trials will there be in each success category beyond the number required?', which has a multivariate answer. The probability mass function is quite complicated for both, but the modelling is pretty easy to see in the spreadsheet in Figure 8.6. Note in this model that there will always be one zero (in row 5 in this random scenario) and that C7 and H7 will return the same distributions. Dirichlet({al . . . a k } ) is the multivariate equivalent of the beta distribution which can be seen from its joint density function: k where 0 5 xi 5 1 (a probability lies on [0, I]), C xi i-1 = 1 (the probabilities must sum to 1) and ai > 1. I 176 Risk Analysis We can use the Dirichlet distribution to model the uncertainty about the set of probabilities ( p l . . . pk} of a multinornial process. There is a neat relationship with gamma distributions that we can use to simulate a Dirichlet distribution which is shown in the above model (Figure 8.7), together with the VoseDirichlet function. In this example, a clinical trial of some face cream has been performed with 300 randomly selected people to ascertain the level of allergic reactions, with the following outcomes: 227 - no effect; 41 - mild itching; 27 - significant discomfort; and 5 - lots of pain and regret. The Dirichlet({sl 1 . . . sk 1)) will return the joint uncertain estimate of the probability that another random person (a consumer) would experience each effect. + A 1 2 3 4 5 6 1 + B Outcome Number observed Estimated probability Gamma distributions Alternative method l c l D l E F None Itching Discomfort Pain and 227 41 27 0.744 0.155 0.079 0.022 218.452 47.004 25.907 5.697 0.735 0.158 0.087 0.019 IGI H 1 I 297.060 =VoseGamma(C3+1,1) =SUM(C3:F3) 13 Figure 8.7 Model for the Dirichlet distribution. 8.3 The Poisson Process In the binomial process there are n discrete opportunities for an event (a "success") to occur. In the Poisson process there is a continuous and constant opportunity for an event to occur. For example, lightning strikes might be considered to occur as a Poisson process during a storm. That would mean that, in any small time interval during the storm, there is a certain probability that a lightning strike will occur. In the case of lightning strikes, the continuum of opportunity is time. However, there are other types of exposure. The occurrence of discontinuities in the continuous manufacture of wire could be considered to be a Poisson process where the measure of exposure is, for example, kilometres or tonnes of wire produced. If Giardia cysts were randomly distributed in a lake, the consumption of cysts by campers drinking the water would be a Poisson process, where the measure of exposure would be the amount of water consumed. Typographic errors in a book might be Poisson distributed, in which case the measure of exposure could be inches of text, although one could just as easily consider the errors to be binomially distributed with n = the number of characters in the book. In a Poisson process, unlike the binomial, as there is a continuum of opportunity for an event to occur we can theoretically have anything between zero and an infinite number of events within a specific amount of opportunity, and there is a probability of the event occurring no matter how small a unit of exposure we might consider. In practice, few physical systems will exactly conform to such a set of assumptions, but many systems nevertheless are very well approximated by a Poisson process. In the Chapter 8 Some basic random processes 177 Binomial Process Number of trials n (NegBin) Poisson Process obsenrationsa Number of successes s (Binomial) Probability of success p Hypergeometric Process Number of successes s (HYP~~~w) Number d * trials n (lnvHvper~e0J Z I of events per unit time A icamma, I /" Population M, Subpopulation D Figure 8.8 Comparison of the distributions of the binomial, Poisson and hypergeometric processes. Giardia cyst example above, assuming a Poisson process would theoretically mean that we could have any number of cysts in a volume of water, no matter how small we made that volume. Obviously, this assumption breaks down when we consider a volume of liquid around the size of a cyst, or smaller, but this is almost never a restriction in practice. Tlie distributions describing the Poisson and binomial processes are strongly related to each other, as shown in Figure 8.8. In a binomial process, the key descriptive parameter is p, the probability of occurrence of an event, which is the same for all trials, so the trials are independent of each other. The key descriptive parameter for the Poisson process is h, the mean number of events that will occur per unit of exposure, which is also considered to be constant over the total amount of exposure t . That means that there is a constant probability per second, for example, of an event occurring, whether or not an event has just occurred, has not occurred for an unexpectedly long time, etc. Such a process is called "memoryless", and both the binomial and Poisson processes can be so described. Like p for a binomial process, h is a property of the physical system. For static systems (stochastic processes), p and h are not variables, but we still need distributions to express the state of our knowledge (uncertainty) about their values. In a Poisson process, we consider, with the number of events that may occur in a period t , the amount of "time" one will have to wait to observe a events, and A, the average number of events that could occur, known as the Poisson intensity. This section will now show how the Poisson distribution, which describes the number of events a! that may occur in a period of exposure t , can be derived from the binomial distribution as p tends to zero and n tends to infinity. We will then look at how to determine the variability distribution of the time t one will need to wait before observing a events, which also turns out to be the distribution of uncertainty of the time one must have waited before having observed a events. Finally, we will discuss how to determine our state of knowledge (uncertainty) about h given a set of observed events a in a period t . 178 Risk Analysis 8.3.1 Deriving the Poisson distribution from the binomial Consider a binomial process where the number of trials tends to infinity, and the probability of success at the same time tends to zero, with the constraint that the mean of the binomial distribution = np remains finitely large. The probability mass function of the binomial distribution can be altered closely to model the number of successes that will occur under such conditions, as follows: Using ht = np, p(X = x) = x!(n - x)! For n large and p small, which simplifies the equation to This is the probability mass function for the Poisson(ht) distribution, i.e. Number of events a in time t = Poisson(ht) when the average number of events that will occur in a unit interval of exposure is known to be A. We can see how this interpretation fits in with the derivation from the binomial distribution. Imagine that a young lady decides to buy a pair of very high platform shoes that are in fashion. After some practice she gets used to the shoes, but there remains a smallish probability (say 1 in 50) that she will fall over with each step she takes. She decides to go for a short walk, say 100metres. If we say that each step measures 1 metre, then we can model the number of falls she will have on her walk as either Binomial(100, 2 %) or Poisson(100 * 0.02) = Poisson(2). Figure 8.9 plots these two distributions together and shows how closely the binomial distribution is approximated by the Poisson distribution in such limiting cases. The Poisson distribution is often mistakenly considered to be only a distribution of rare events. It is certainly used in this sense to approximate a binomial distribution, but has far more importance than that. Where there is a continuum of exposure to an event, the measure of exposure can be split up into smaller and smaller divisions, until the probability of the event occurring in each division has become extremely small, while there are also an enormous number of divisions. For example, I could stand on a street corner during rush hour, looking for red cars to pass by. For the duration of the rush hour, one could consider that the frequency of cars going by is quite constant, and that the red cars in the traffic are randomly distributed among the city's traffic. Then the number of red cars passing by will Cha~ter8 Some basic random Drocesses I 1 79 successes Figure 8.9 Comparison of the Binomial(100, 0.02) and Poisson(2)distributions. be Poisson distributed. If, on average, 0.6 red cars passed by per minute, I could model the number of cars passing by: in the next 10 seconds as Poisson(O.1); in the next hour as Poisson(36), etc. I could divide up the time I stand on the street comer into such tiny elements (for example 11100th of a second) that the probability of a red car passing by within a particular 11100th of a second would be extremely small. The probability would be so small that the chance of two cars going by within that period would be absolutely negligible. In such circumstances, we can consider each of these small elements of time to be independent Bernoulli trials. Similarly, the number of raindrops falling on my head each second during a shower would also be Poisson distributed. 8.3.2 "Time" to wait to obsewe a events The Poisson process assumes that there is a constant probability that an event will occur per increment of time. If we consider a small element of time At, then the probability an event will occur in that element of time is kAt, where k is some constant. Now let P(t) be the probability that the event will not have occurred by time t. The probability that an event occurs the first time during the small interval At after time t is then kAt P(t). This is also equal to P(t) - P(t At) and we have + Making At infinitesimally small, this becomes the differential equation Integration gives 180 Risk Analysis If we define F(t) as the probability that the event will occur before time t (i.e. the cumulative distribution function for t), we then have which is the cumulative distribution function for an exponential distribution Expon(1lk) with mean llk. Thus, llk is the mean time between occurrences of events or, equivalently, k is the mean number of events per unit time, which is the Poisson parameter h. The parameter llh, the mean time between occurrences of events, is given the notation j3. We have thus shown that the time until occurrence of the first event for a Poisson distribution is given by where j3 = llh. It can also be shown (although the maths is too laborious to repeat here) that the time until ol events have occurred is given by a gamma distribution: The Expon(p) distribution is therefore simply a special case of the gamma distribution, namely It is interesting to check the idea that a Poisson process is "memoryless". The probability that the first event will occur at time x, given it has not yet occurred by time t (x > t), is given by which is another exponential distribution. Thus, although the event may not have occurred after time t , the remaining time until it will occur has the same probability distribution as it had at any prior point in time. 8.3.3 Estimate of the mean number of events per period (Poisson intensity) 1 Like the binomial probability p, the mean events per period h is a fundamental property of the stochastic system in question. It can never be observed and it can never be exactly known. However, we can become progressively more certain about its value as more data are collected. Bayesian inference, see Section 9.2, provides us with a means of quantifying the state of our knowledge as we accumulate data. Assuming an uninformed prior n(h) = l l h (see Section 9.2.2) and the Poisson likelihood function for observing a! events in period t: since we can ignore terms that don't involve h, and we then get the posterior distribution I Chapter 8 Some basic random processes 18 1 which is a gamma(a, llt) distribution. The gamma distribution can also be used to describe our uncertainty about h if we start off with an informed opinion and then observe a events in time t. From Table 9.1, if we can reasonably describe our prior belief with a Gamma(a, b) distribution, the posterior is given by a Gamma(a + a , b/(l bt)) distribution. The choice of n(h) = l/h (which is equivalent to a Gamma(llz, z) distribution, where z is extremely large) as an uninformed prior is an uncomfortable one for many. This prior makes mathematical sense in that it is transformation invariant and therefore would give the same answer whether one performed an analysis from the point of view of A or B = l / h or even changed the unit of exposure relating to A. On the other hand, a plot of this prior doesn't really seem "uninformed" since it is so peaked at zero. However, the shape of the posterior gamma distribution becomes progressively less sensitive to the prior distribution as data are collected. We can get a feel for the importance of the prior with the following train of thought: + (i) A n(A) = l / h prior is equivalent to Garnma(llz, z), where z approaches infinity. You can prove this by looking at the gamma probability distribution function and setting a to zero and ,f3 to infinity. (ii) A flat prior (the opposite extreme to the n(h) = I/h prior) would be equivalent to a Gamma(1, z), where z approaches infinity, i.e. an infinitely drawn out exponential distribution. + + (iii) We have seen that, for a Gamma(a, b) prior, the resultant posterior is Gamma(a a , b/(l bt)), which means that the posterior for (i) would be Gamma(a,llt), and the posterior for (ii) would be Garnma(a 1, 1It). + + (iv) Thus, the sensitivity of the gamma distribution to the prior amounts to whether ( a 1) is approximately the same as a . Moreover, Gamma(a, j3) is the sum of a independent Exponential(j3) distributions, so one can think of the choice of priors as being whether we add one extra exponential distribution or not to the a exponential distributions from the data. Thus, if a were 100, for example, the distribution would be roughly 1 % influenced by the prior and 99 % influenced by the data. 8.3.4 Estimate of the elapsed period t We can estimate the period t that has elapsed if we know h and the number of events a that have occurred in time t. The maths turns out to be exactly the same as the estimate for h in the previous section. The reader may like to verify that, by using a prior of n(t) = lit, we obtain a posterior distribution t = Gamma(a, llh), which is the same result we would obtain if we were trying to predict forward (i.e. determine a distribution of variability of) the time required to observe a events given h = 1/B. Also, if we can reasonably describe our prior belief with a Gamma(a, b) distribution, the posterior is given by a Gamma(a a, b/(l bh)) distribution. + + 8.3.5 Summary of results for the Poisson process The results are shown in Table 8.2. 8.3.6 The multivariate Poisson process The properties of the Poisson process make extending to a multivariate situation very easy. Imagine that we have three categories of car accident: (a) no injury; (b) one or more persons injured but no fatalities; 182 Risk Analysis Table 8.2 Distributions of the Poisson process. Quantity Formula Number of events a = Poisson(ht) Mean number of events per unit exposure h = Gamma(a, 1It) = Gamma(a a, b/(l Time until observation of first event tl Time until observation of first a events ta = Gamma(a, 1/A) Time that has elapsed for a events ta = Gamma(a, 1/A) = Gamma(a a, b/(l + Notes Assuming uninformed prior Assuming Gamma(a,b) prior + bt)) = Expon(1/A) = Gamma(1,l /h) + + bh)) Assuming uninformed prior Assuming Gamma(a,b) prior (c) one or more persons killed. We'll assume that the accidents occur independently and follow a Poisson process with expected occurrences ha, hb and A, per year. The number that will occur in the next T years (assuming that the rates won't change over time) is Poisson(T * (ha hb h,)). The probability that the next accident is of type (a) is + + The time until the next a accident is Gamma and the uncertainty about the true values of each h can be estimated separately as described in Sections 8.3.3 and 9.1.5. 8.3.7 Modifying I in a Poisson process The Poisson model assumes that h will be constant over the time in which we are counting. That can be a tenuous assumption. Hurricanes, disease outbreaks, suicides, etc., occur more frequently at certain times of the year; car accidents, robberies and high-street brawls occur more frequently at certain times of the day (and sometimes year too). In fact it turns out that, if h has a consistent (even if unknown) seasonal variation, we can often get round the problem. Imagine that boat accidents occur in each month i at a rate hi, i = 1, . . . , 12. The number occurring in each future month i will be ai = Poisson(hi), and the 12 total over the year will be i=l Poisson(hi). From the identity Poisson(a) this equation can be rewritten Poisson C hi (il: 1 + Poisson(b) = Poisson(a + b), , i.e. the boat accidents occurring in a year also follow a Poisson process. Thus, as long as we ensure that we analyse data over a complete number of seasonal periods (a whole number of years in this case) and predict for a whole number of seasonal periods, we can ignore the fact that h changes seasonally. That is immensely useful. If I've observed that historically there have been an average of 23 outbreaks per year of campylobacteriosis in a city (an outbreak is defined in epidemiology as an event unconnected to others, so we can think of them as occurring randomly in time and independently), then I can model the number of outbreaks next year as Chapter 8 Some basic random processes 183 Poisson(23) without worrying that most of those will occur over the summer months. I can also compare year-on-year data on outbreaks using Poisson mathematics. What I cannot do, of course, is say that July will have Poisson(23/12) outbreaks. I used to live in a rural area of the South of France. As winter approached, the first time there was black ice on the roads in the morning you would see cars buried in hedges, woods and fields along the roadside. The more intense the sudden cold snap, the more cars you would see. Some years there weren't so many, others it was mayhem. Clearly in situations like this the expected rate of accidents is a random variable. The most common way to model that random variation is to multiply h by a ~ a m m a ( ih, ) distribution. This gamma has a mean of 1 and a standard deviation of h , giving a Poisson rate of ~ a m m a ( ihh). , The idea therefore is that the gamma distribution is just adding a coefficient of variation of h to h. It turns out that the combination of these two distributions is a p61ya(;, h h ) or, if is an integer, simplifies to a ~ e ~ ~ i n ( ; , The result is convenient because it means we can use the P6lya or NegBin distributions to model this Poisson(h) fi Gamma(a, B ) mixture. Along the way, we can also see that the P6lya and NegBin distributions have a greater coefficient of variation than the Poisson. Often you will see in statistics that researchers call the data "overdispersed when they want to fit a Poisson distribution because the data have a variance greater than their mean (they would be equal for a Poisson distribution), and the statisticians then turn to a NegBin (although they would be better off with a P6lya but it is less well known). The Gamma distribution is useful because we have an extra parameter h to play with and can therefore match, for example, the mean and variance (or any two other statistics) to data. However, at times that is not enough, and we might need more control to match, for example, the skewness too. Instead of modelling h in the form Gamma(a, b), we can add a positive shift so we get Poisson(Gamma(a, b) c), which turns out to be a Delaporte(a, b, c) distribution. k A). + 8.4 The Hypergeometric Process The hypergeometric process occurs when one is sampling randomly without replacement from some population, and where one is counting the number in that sample that have some particular characteristic. This is a very common type of scenario. For example, population surveys, herd testing and lotto are all hypergeometric processes. In many situations the population is very large in comparison with the sample, and we can assume that, if a sample were put back into the population, the probability is very small that it would be picked again. In that case, each sample would have the same probability of picking an individual with a particular characteristic: in other words, this becomes a binomial process. When the population is not very large compared with the sample (a good rule is that the population is less than 10 times the size of the sample), we cannot make a binomial approximation to the hypergeometric. This section discusses the distributions associated with the hypergeometric process. 8.4.1 Number in a sample with a particular characteristic Consider a group of M individual items, D of which have a certain characteristic. Randomly picking n items from this group without replacement, where each of the M items has the same probability of being selected, is a hypergeometric process. For example, imagine I have a bag of seven balls, three of which are red, the other four of which are blue. What is the probability that I will select two red balls from the bag if I randomly pick three balls out without replacement? 184 Risk Analysis First of all, we note that the probability of the second ball picked being red depends on the colour of the first picked ball. If the first ball was red (with probability +), there are only two red balls left of the six balls remaining. The probability of the second ball being red, given the first ball was red, is therefore = However, each ball remaining in the bag has the same probability of being picked, which means that each event resulting in x red balls being selected in total has the same probability. We thus need only consider the different combinations of events that are possible. There are, from the discussion in Section 6.3.4, = 35 different possible ways that one can select three items from seven. = 3 ways to select two red balls from the three in the bag, and there are (;) = 4 ways to There are select one blue ball from the four in the bag. Thus, out of the 35 ways we could have picked three balls from the group of seven, only = 3 * 4 = 12 of those ways would give us two red balls. Thus, the probability of selecting two red balls is 12/35 = 34.29 %. In general, for a population size M of which D have the characteristic of interest, in selecting a sample of size n from that population at random without replacement, the probability of observing x with the characteristic of interest is given by 5. (i) (32) (i)(i) which is the probability mass function of the hypergeometric distribution Hypergeo(n, D, M). Just in case you are curious, the hypergeometric distribution gets its name because its probabilities are successive terms in a gaussian hypergeometric series. Binomial approximation t o the hypergeometric If we replaced each item one at a time back into the population when taking our sample of size n, the probability of each individual item having the characteristic of interest is D I M and the number of times we sampled from D is then given by a Binomial(n, DIM). More usefully, if M is very large compared with n, the chance of picking the same item more than once if one were to replace the item after each selection would be very small. Thus, for large M (usually n .= 0.1M is quoted as being a satisfactory condition) there will be little difference in our sampling result whether we sample with or without replacement, and we can approximate a Hypergeo(n, D , M) with a Binomial(n, DIM), which is much easier to calculate. Multivariate hypergeometric distribution The hypergeometric distribution can be extended to situations where there are more than two types of item in the population (i.e. more than D of one type and (M - D) of another). The probability of getting sl from D l , s2 from D2, etc., all in the sample n is given by k where k C si = n, C Di = M, i=l i=l Di 2 si 2 0, M > Di > 0. Chapter 8 Some basic random processes 185 8.4.2 Number of samples to get a specific s Consider the situation where we are sampling without replacement from a population M with D items with the characteristic of interest until we have s items with the required characteristic. The distribution of the number of failures we will have before the sth success can be easily calculated in the same manner as we developed for the negative binomial distribution in Section 8.2.2. The probability of observing (s - 1) successes in (x s - 1) trials (i.e. x failures) is given by direct application of the hypergeometric distribution: + + The probability p of then observing a success in the next trial (the (s x)th trial) is simply the number of D items remaining (= D - (s - I)) divided by the size of the population remaining (= M - ( s + x - 1)): and the probability of having exactly x failures up to the sth success, where trials are stopped at the sth success, is then the product of these two probabilities: This is the probability mass function for the inverse hypergeometric distribution InvHypergeo(s, D, M) and is analogous to the negative binomial distribution for the binomial process and the gamma distribution for the Poisson process. So For a population M that is large compared with s, the inverse hypergeometric distribution approximates the negative binomial InvHypergeo(s, D , M) x NegBin(s, DIM) and if the probability D I M is very small InvHypergeo(s, D, M) % Garnma(s, MID) Figure 8.10 shows some examples of the inverse hypergeometric distribution. An inverse hypergeometric distribution shifted k units along the domain is sometimes called a negative hypergeometric distribution. ModelRisk offers the InvHypergeo(s, D, M) distribution, and the negative hypergeometric can be achieved by writing VoseInvHypergeo(s, D, M, VoseShift(k)). 186 Risk Analysis lnvHypergeo(2,2,50) lnvHypergeo(4,5,50) g 0.035 i. p% g 0.03 m 0.025 2 0.02 a 0.015 0.01 0.005 ,? 0 10 20 30 40 50 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 10 Failures 20 30 40 50 Failures Figure 8.10 Examples of the inverse hypergeometric distribution. 8.4.3 Number of samples to have observed a specific s The inverse hypergeometric distribution was derived above as a distribution of variability in predicting the number of failures one will have before the sth success. However, it can equally be derived as a distribution of uncertainty about the number of failures x = n - s one must have had if one knows s , M, D using Bayes7 theorem and a uniform (i.e. uninformed) prior on x. So In the case where you do not know that the trials had stopped with the sth success, we can still apply Bayes' theorem with a uniform prior for x and a likelihood function given by a hypergeometric probability: which, with a uniform prior, is also the posterior distribution. Substituting n - s for x yields n ! ( M - n)! f (n) a (n - s)!(M - D - n + s)! Chapter 8 Some basic random processes 3 4 5 6 7 8 9 10 11 12 13 total f(n) Normalised f(n) 6.5E-04 1.4E-03 1.7E-03 1.6E-03 1.2E-03 7.9E-04 4.2E-04 1.9E-04 6.4E-05 1.5E-05 2.OE-06 8.OE-03 [ Distribution 1 6 187 ] Figure 8.11 A Bayesian inference model with hypergeometric uncertainty. Note that the discrete distribution could have been used with columns B and C, removing the necessity to normalise the distribution. Equation (8.6) has dropped out all the terms that are not a function of n and so can be normalised out of the equation. The uncertainty distribution for n doesn't equate to a standard distribution, so it needs to be manually normalised. However, it is easier just to work with Equation (8.6) and normalise in the spreadsheet. Figure 8.11 shows an example of such a calculation where the final distribution is in cell G18. Note that, if one uses a discrete distribution as shown in this spreadsheet, it is actually unnecessary to normalise the probabilities, since software like @RISK, Crystal Ball and ModelRisk automatically normalises them to sum to unity. 8.4.4 Estimate of population and subpopulation sizes The size of D and M are fundamental properties of the stochastic system, like p for a binomial process and h for a Poisson process. Distributions of our uncertainty about the value of these parameters can be determined from Bayesian inference, given a certain sample size taken from the population M, of which s belonged to the subpopulation D. The hypergeometric probability of s successes in n samples from a population M of which D have the characteristic of interest is given by Equation (8.5) as 188 Risk Analys~s So, with a uniform prior, we get the following posterior equations for D and M: P(D) a (y)(t1P) P(M) (:)(;I:) rx D!(M - D)! (D - s)!(M - D - n a + s)! (M-D)!(M-n)! (M- D-n+s)!M! These formulae do not equate to standard distributions and need to be normalised in the same way as discussed for Equation (8.6). 8.4.5 Summary of results for the hypergeometric process The results are shown in Table 8.3. Table 8.3 Distributions of the hypergeometric process. Quantity Number of subpopulation in the sample Number of samples to observe s from the subpopulation Number of samples there were to have observed s from the subpopulation Number of samples n there were before having observed s from the subpopulation Formula Notes n = s+ InvHyp(s, D, M) Where the last sample is known to have been from the subpopulation n! M-n ! s D + f n cc n Size of subpopulation D f(D) D! M-D)! (o-s)!(h-D-n+~)! Size of population M f(M) cc (M-D)I(M-n ! M!(M-D-n+l)! s Where the last sample is not known to have been from the subpopulation. This uncertainty distribution needs to be normalised This uncertainty distribution needs to be normalised This uncertainty distribution needs to be normalised 8.5 Central Limit Theorem The central limit theorem (CLT) is an asymptotic result of summing probability distributions. It turns out to be very useful for obtaining sums of individuals (e.g. sums of animal weights, yields, scraps). It also explains why so many distributions sometimes look like normal distributions. We won't look at the derivation, just see some examples and its use. Chapter 8 Some basic random processes 189 The sum C of n independent random variables X i (where n is large), all of which have the same distribution, will asymptotically approach a normal distribution with known mean and standard deviation: where p and a are the mean and standard deviation of the distribution from which the n samples are drawn. 8.5.1 Examples Imagine that the distribution of the weight (read "mass" if you want to be technical) of random nails produced by some company has a mean of 27.4 g and a standard deviation 1.3 g. What will be the weight of a box of 100 nails? The answer, assuming that the nail weight distribution isn't really skewed, is the following normal distribution: I I This CLT result turns out to be very important in risk analysis. Many distributions are the sum of a number of identical random variables, and so, as that sum gets larger, the distribution tends to look like a normal distribution. For example: Gamma(a, B) is the sum of a independent Expon(B) distributions, so, as a gets larger, the gamma distribution looks progressively more like a normal distribution. An exponential distribution has mean and variance of B, so we have Other examples are discussed in the section on approximating one distribution with another. I How large does n have to be for the sum to be distributed nomallv? 1 Uniform Symmetric triangular Normal Fairlv skewed 1 Exponential (12(try it: an old way of generating normal distributions) ' 6 (because U(a, b) ~ ( ab), = ~ ( s aa,+ b, 2b)) ; I 1 I! 30+ (e.a. 30 lots of Poisson(2) = Poisson(6O\\ 150+ (check with Gamma(a, b) = sum of a Exponential(b)s)I 1 8.5.2 Other related results The average of a large number of independent, identical distributions Dividing both sides of Equation (8.7) by n, the average x of n variables drawn independently from the same distribution is given by Cxi - x=- i=l x Normal(np, f i n ) ( . = Normal u --= o\ (8.8) 190 Risk Analys~s Note that the result of Equation (8.8) is correct because both the mean and standard deviation of the normal distribution are in the same units as the variable itself. However, be warned that for most distributions one cannot simply divide by n the distribution parameters of a variable X to get the distribution of X l n . The product of a large number of independent, identical distributions CLT can also be applied where a large number of identical random variables are being multiplied together. Let P be the product of a large number of random variables Xi, i = 1, . . . , n, i.e. Taking logs of both sides, we get The right-hand side is the sum of a large number of random variables and will therefore tend to a normal distribution. Thus, from the definition of a lognormal distribution, P will be asymptotically lognormally distributed. A neat result from this is that, if all Xi are lognormally distributed, their product will also be lognormally distributed. Is CLT the reason the normal dlstr~but~on is so popular? Many stochastic variables are neatly described as the sum or product, or a mixture, of a number of random variables. A very loose form of CLT says that, if you add up a large number n of different random variables, and if none of those variables dominates the resultant distribution spread, the sum will eventually look normal as n gets bigger. The same applies to multiplying (positive) different random variables and the lognormal distribution. In fact, a lognormal distribution will also look very similar to a normal distribution if its mean is much larger than its standard deviation (see Figure 8.12), so perhaps it should not be too surprising that so many variables in nature seem to be somewhere between lognormally and normally distributed. 8.6 Renewal Processes In a Poisson process, the times between successive events are described by independent identical exponential distributions. In a renewal process, like a Poisson process, the times between successive events are independent and identical, but they can take any distribution. The Poisson process is thus a particular case of a renewal process. The mathematics of the distributions of the number of events in a period (equivalent to the Poisson distribution for the Poisson process) and the time to wait to observe x events Cha~ter8 Some basic random Drocesses 19 1 Figure 8.12 Graphs of the normal and lognormal distribution. (equivalent to the gamma distribution in the Poisson process) can be quite complicated, depending on the distribution of time between events. However, Monte Carlo simulation lets us bypass the mathematics to arrive at both of these distributions, as we will see in the following examples. Example 8.1 Number of events in a specific period It is known that a certain type of light bulb has a lifetime that is Weibull(l.3, 4020) hours distributed. (a) If I have one light bulb working at all times, replacing each failed light bulb immediately with another, how many light bulbs will have failed in 10 OOOhours? (b) If I have 10 light bulbs going at all times, how many will fail in 1000 hours? (c) If I had one light bulb going constantly, and I had 10 light bulbs to use, how long would it take before the last light bulb failed? (a) Figure 8.13 shows a model to provide the solution to this question. Note that it takes account of the possibility of 0 failures. (b) Figure 8.14 shows a model to provide the solution to this question. Figure 8.15 compares the results for this question and part (a). Note that they are significantly different. Had the time between events been exponentially distributed, the results would have been exactly the same. (c) The answer is simply the sum of 10 independent Weibull(l.3, 4020) distributions. + Period of interest Number of failures 10000 =IF(B3>$D$21,0,1) Figure 8.13 Model solution to Example 8.1 (a). Period of interest Number of failures I Formulae table =VoseWeibull(l.3,4020) B4:B19, E4:E19, etc. =B3+VoseWeibull(l.3,4020) Figure 8.14 Model solution to Example 8.1(b). Chapter 8 Some basic random processes 193 Figure 8.15 Comparison of results from the models of Figures 8.1 3 and 8.1 4. 1 1 8.7 Mixture Distributions Sometimes a stochastic process can be a combination of two or more separate processes. For example, car accidents at some particular place and time could be considered to be a Poisson variable, but the mean number of accidents per unit time h may be a variable too, as we have seen in Section 8.3.7. A mixture distribution can be written symbolically as follows: where FA represents the base distribution and FB represents the mixing distribution, i.e. the distribution of O. So, for example, we might have which reads as "a gamma mixture of Poisson distributions". There are a number of commonly used mixture distributions. For example which is the Beta-Binomial(n, a , B ) distribution, and where the Poisson distribution has parameter h = 4 . p, and p = Beta(a, B). [Though also used in biology, this should not be confused with the beta-Poisson dose-response model.] 194 Risk Analysis The cumulative distribution function for a mixture distribution with parameters Qi is given by where the expectation is with respect to the parameters that are random variables. Thus, the functional form of mixture distributions can quickly become extremely complicated or even intractable. However, Monte Carlo simulation allows us very simply to include mixture distributions in our model, providing the Monte Carlo software being used (for example, @RISK, Crystal Ball, ModelRisk) generates samples for each iteration in the correct logical sequence. So, for example, a Beta-Binomial(n, a, B) distribution is easily generated by writing =Binomial(n, Beta(czl, B)). In each iteration, the software generates a value first from the beta distribution, then creates the appropriate binomial distribution using this value of p and finally samples from that binomial distribution. 8.8 Martingales A martingale is a stochastic process with sequential variables Xi(i = 1, 2, . . .), where the expected value of each variable is the same and independent of previous observations. Written more formally Thus, a martingale is any stochastic process with a constant mean. The theory was originally developed to demonstrate the fairness of gambling games, i.e. to show that the expected winnings of each turn of a game were constant; for example, to show that remembering the cards that had already been played in previous hands of a card game wouldn't impact upon your expected winnings. [Next time a friend says to you "21 hasn't come up in the lottery numbers for ages, so it must show soon", you can tell him or her "Not true, I'm afraid, it's a martingale" - they'll be sure finally to understand]. However, the theory has proven to be of considerable value in many real-world problems. A martingale gets its name from the gambling "system" of doubling your bet on each loss of an even odds bet (e.g. betting Red or Impaire at the roulette wheel) until you have a win. It works too - well, in theory anyway. You must have a huge bankroll, and the casino must have no bet limit. It gives low returns for high risk, so as risk analysis consultants we would advise you to invest in (gamble on) the stock market instead. 8.9 Miscellaneous Examples II I have given below a few example problems for different random processes discussed in this chapter to give you some practice. 8.9.1 Binomial process problems In addition to the problems below, the reader will find the binomial process appearing in the following examples distributed through this book: examples in Sections 4.3.1, 4.3.2 and 5.4.6 and Examples 22.2 to 22.6, 22.8 and 22.10, as well as many places in Chapter 9. I Chapter 8 Some basic random processes I95 Example 8.2 Wine sampling Two wine experts are each asked to guess the year of 20 different wines. Expert A guesses 11 correctly, while expert B guesses 14 correctly. How confident can we be that expert B is really better at this exercise than expert A? If we allow that the guess of the year for each wine tasted is independent of every other guess, we can assume this to be a binomial process. We are thus interested in whether the probability of one expert guessing correctly is greater than the other's. We can model our uncertainty about the true probability of success for expert A as Beta(l2, 10) and for expert B as Beta(l5, 7). The model in Figure 8.16 then randomly samples from the two distributions and cell C5 returns a 1 if the distribution for expert B has a greater value than the distribution for expert A. We run a simulation on this cell, and the mean result equals the percentage of time that the distribution for expert B generated a higher value than for expert A, and thus represents our confidence that expert B is indeed better at this exercise. In this case, we are 83 % confident. + Example 8.3 Run of luck If I toss a coin 10 times, what is the distribution of the maximum number of heads I will get in a row? The solution is provided in the spreadsheet model of Figure 8.17. + Example 8.4 Multiple-choice exam A multiple-choice exam gives three options for each of 50 questions. One student scores 21 out of 50. (a) What is the probability that the student would have achieved this score or higher without knowing anything about the subject? (b) Estimate how many questions to which the student actually knew the answer. (a) The student has a 113 probability of getting any answer right without knowing anything, and his or her possible score would then follow a Binomial(50, 113) distribution. The probability that the student would have achieved 21150 or higher is then = 1 - BINOMDIST (20, 50, 113, I), i.e. (1 - the probability of achieving 20 or lower). Formulae table C5 (output) Figure 8.16 Model for Example 8.2. =IF(C4>C3,1,0) Figure 8.17 Model for Example 8.3. A! -1 . -2 . 3 4 5 - 22 23 -24 . 25 - B Known answers 0 1 2 19 20 21 sum I c Likelihood 5.OE-02 7.OE-02 9.1 E-02 4.8E-07 3.9E-08 1.6E-09 9.2E-01 I D Normalised posterior 5.45% 7.63% 9.84% 0.00% 0.00% 0.00% 26 I E I F o 3 I G I H I I I J 0.14 0.12 0.1 0.08 0.06 0.04 Formulae table 29 30 31 32 C3:C24 =BINOMDIST(21-B3,50,1/3,0) D3:D24 =C3/$C$25 6 9 12 15 18 21 Figure 8.18 Model for Example 8.4(b). (b) This is a Bayesian problem. Figure 8.18 illustrates a spreadsheet model of the Bayesian inference with a flat prior and a binomial likelihood function. The imbedded graph is the posterior distribution of our belief about how many questions the student actually knew. + 7 Chapter 8 Some baslc random processes 197 8.9.2 Poisson process problems In addition to the problems below, the reader will find the Poisson process appearing in the following examples distributed through this book: examples in Sections 9.2.2 and 9.3.2 and Examples 9.6, 9.1 1, 22.12, 22.14 and 22.16. Example 8.5 Insurance problem My company insures aeroplanes. They crash at a rate of 0.23 crashes per month. Each crash costs $Lognormal(l20, 52) million. (a) What is the distribution of cost to the company for the next 5 years? (b) What is the distribution of the value of the liability if I discount it at the risk-free rate of 5 %? The solution to part (a) is provided in the spreadsheet model of Figure 8.19, which uses the VLOOKUP Excel function. Part (b) requires that one know the time at which each accident occurred, using exponential distributions. The solution is shown in Figure 8.20. + Example 8.6 Ra~nwaterbarrel problem It is a monsoon and rain is falling at a rate of 270 drops per second per square metre. The rain drops each contain 1 millilitre of water. If I have a drum standing in the rain, measuring 1 metre high and 0.3 metres radius, how long will it be before the drum is full? The solution is provided in the spreadsheet model of Figure 8.21. + Figure 8.19 Model for Example 8.5(a). Mean crashes per month (A) Number of months ( t ) Risk free interest rate Total cost ($M) Time of accident (months) 5.105 7.270 13.338 102.497 105.567 113.070 Cost of accidents ($MI 158 115 63 0 0 0 1244.9 Discounted cost ($M) 154.69 111.85 59.93 0.00 0.00 0.00 Figure 8.20 Model for Example 8.5(b). Drum radius (m) Drum volume (mA3) Drops falling per second into barrel 0.283 76.341 Time to wait to fill barrel (seconds) 3714.0 Formulae table =ROUNDUP(D4/0.000001,0) =Gamma(D6,1ID5) Figure 8.21 Model for Example 8.6. Example 8.7 Equipment reliability A piece of electronic equipment is composed of six components A to F. They have the mean time between failures shown in Table 8.4. The components are in serial and parallel configuration as shown in Figure 8.22. What is the probability that the machine will fail within 250 hours? Chapter 8 Some basic random processes 199 Table 8.4 Mean time between failures of electronic equipment components. Component AI B I C I D I MTBF (hours) E I F I G I H I ~ I J 1 4 5 7 a 27.8 299 1742.1 1234 1417.9 9 10 11 12 2 2 Time to failure of system +q *a 210.1 Formulae table D3:D8 =Expon(C3) D l 0 (output) =MIN(D3,MAX{D4:D6),MAX(D7:08)) 15 Figure 8.22 Model for Example 8.7 We first assume that the components will fail with a constant probability per unit time, i.e. that their times to failure will be exponentially distributed, which is a reasonable assumption implied by the MTBF figure. The problem belongs to reliability engineering. Components in series make the machine fail if any of the components in series fail. For parallel components, all components in parallel must fail before the machine fails. Thus, from Figure 8.22 the machine will fail if A fails, or B, C and D all fail, or E and F both fail. Figure 8.22 also shows the spreadsheet modelling the time to failure. Running a simulation with 10 000 iterations on cell D l 0 gives an output distribution of which 63.5 % of the trials were less than 250 hours. + 8.9.3 Hypergeometric process problems In addition to the problems below, the reader will find the hypergeometric process appearing in the following examples distributed through this book: examples in Sections 22.4.2 and 22.4.4, as well as Examples 9.2, 9.3, 22.4, 22.6 and 22.8. Example 8.8 Equal selection I am to pick out at random 10 names from each of two bags. The first bag contains the names of 15 men and 22 women. The second bag contains the names of 12 men and 15 women. (a) What is the 200 Risk Analysis probability that I will have the same proportion of men in the two selections? (b) How many times would I have to sample from these bags before I did have the same proportion? (a) The solution can be worked out mathematically or by simulation. Figure 8.23 provides the mathematical calculation and Figure 8.24 provides a simulation model, where the required probability is the mean of the output result. A1 B C D 1 I E 27 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Probability Bag 1 Bag 2 Men in sample 0 0.19% 0.04% 1 2.14% 0.71% 2 9.64% 5.03% 22.28% 16.78% 3 4 29.24% 29.37% 5 22.70% 28.19% 10.51% 14.95% 6 7 2.84% 4.27% 8 0.43% 0.62% 9 0.03% 0.04% 10 0.00% 0.00% Total probability of same in each bag C9:D19 E9:E19 E2O (output) Both 0.00% 0.02% 0.49% 3.74% 8.59% 6.40% 1.57% 0.12% 0.00% 0.00% 0.00% 20.93% Formulae table =HYPGEOMDIST($B9,C$3,C$4,C$5) =CYD9 =SUM(E9:E19) 26 Figure 8.23 Mathematical model for Example 8.8. Formulae table =Hypergeo(C3,C4,C5) C8 (output - p=olp mean) =IF(C7=D7,1,0) Figure 8.24 Simulation model for Example 8.8. IF Chapter 8 Some basic random processes - Al I3 I C I Dl E H I F I G I I I I J I K I 20 1 L 0.3 5 41 42 46 . 48 c4:c22 L-R 0.000% 0.000% 0.15 - 0.1 0.05 Formulae table =COMBIN(D,S-I)*COMBIN(M-D,M-SY(D-s+I)' I(COMBIN(M.84-l)'(M-B4-2's+l)) 0 $0 20 30 40 50 51 Figure 8.25 Model for Example 8.9. (b) Each trial is independent of every other, so the number of trials before one success = 1 (1, p) = 1 Geometric(p), where p is the probability calculated from part (a). + + + NegBin Example 8.9 Playing cards How many cards in a well-shuffled pack, complete with jokers, do I need to turn over to see a heart? There are 54 (= M) cards, of which 13 (= D) are hearts, and I am looking for s = 1 heart. The number of cards I must turn over is given by the formula 1 + InvHyp(1, 13,54), which is the distribution shown in Figure 8.25. + Example 8.10 Faulty tyres A tyre manufacturer has accidentally mixed up four tyres from a faulty batch with 20 other good tyres. Testing a tyre for the fault ruins it. If each tyre cost $75, and if the tyres are tested one at a time until the four faulty tyres are found, how much will this mistake cost? The solution is provided in the spreadsheet model of Figure 8.26. + 8.9.4 Renewal and mixed process problems In addition to the problems below, Examples 12.8 and 12.9 also deal with renewal and mixed process problems. Example 8.1 1 Batteries A certain brand of batteries lasts Weibull(2,27) hours in my CD player, which takes two batteries at a time. I have a pack of 10 batteries. For how long can I run my CD player, given that I replace all batteries when one has run down? The solution is provided in the spreadsheet model of Figure 8.27. + Example 8. I 2 Queuing at a bank (Visual Basic modelling with Monte Carlo simulation) A post office has one counter that it recognises is insufficient for its customer volume. It is considering putting in another counter and wishes to model the effect on the maximum number in a queue at any one time. They are open from 9 a.m. to 5 p.m. each working day. Past data show that, when the doors 202 Risk Analysis tested tyres Probability Tyres actually tested /(COMBIN(M,B4+s-l)'(M-64-s+l )) =D+Discrete(64:622,C4:C22) Figure 8.26 Model for Example 8.10. Figure 8.27 Model for Example 8.11 open at 9 a.m., the number of people waiting to come in will be as shown in Table 8.5. People arrive throughout the day at a constant rate of one every 12 minutes. The amount of time it takes to serve each person is Lognormal(29, 23) minutes. What is the maximum queue size in a day? This problem requires that one simulate a day, monitor the maximum queue size during the day and then repeat the simulation. One thus builds up a distribution of the maximum number in a queue. The solution provided in Figures 8.28 and 8.29 and in the following program runs a looping Visual Basic macro called "Main Program" at each iteration of the model. This is an advanced technique and, although this problem is very simple, one can see how it can be greatly extended. For example, one could change the rate of arrival of the customers to be a function of the time of day; one could add Chapter 8 Some bas~crandom processes 203 Table 8.5 Historic data on the number of people waiting at the start of the business day. A1 People Probability 0 1 2 3 4 5 0.6 0.2 0.1 0.05 0.035 0.015 B C D E F IG 1 2 lnputs Average interarrivaltime (mins) Serving time mean Serving time stdev 3 4 5 6 7 8 9 --o l11 12 13 14 15 16 17 18 19 20 12 29 23 Model People in queue Time of day (minutesfrom 00:OO:OO) Latest customer at counter 1 Latest customer at counter 2 outputs Total customers served Maximum number in queue 12 1037.92 Customer Arrive time Serving time Finish time 0 997.24 40.68 1037.92 0 1006.16 36.58 1042.74 35 18 Formulae table C8:C9, C11:E12, C15:C16 Values updated by macro F11:F12 =El1 + D l 1 21 Figure 8.28 Sheet "model" for the model for Example 8.12. more counters, and one could monitor other statistical parameters aside from the maximum queue size, like the maximum amount of time any one person waits or the amount of free time the people working behind the counter have. 6 V~sualBas~cmacro for Example 8.12 'Set model variables Dim modelWS As Object Dim variableWS As Object Sub Main-Program0 Set modelWS = Workbooks("queue~model~test.xls").Worksheets("model") Set variableWS = Workbooks("queue~model~test.xls").Worksheets("variables") 'Reset the model with the starting values modelWS.Range("c9").Value = 9 * 6 0 204 Risk Analysis A I B C ID 1 Label Random Variable Counter serving time (min) 7.86 3 4 Customers arriving while serving 2 2 5 Wait time for nexi customer 6 -- 7 People waiting at 9:00 a.m. 8 -- 57.03 0 9 10 Time in last step 11 12 382 X B 4 2B6 16B8 - 17 B10 18 31.76 Formulae table =Lognorm(Model!$C$4Model!$C$5) =IF(B10=O,O,Poisson(B1O/Model!C3)) =Expon(Model!C3) =Discrete({0,1,2,3,4,5},{0.6,0.2~0.1,0.05,0.035,0.015}) Value updated by macro Figure 8.29 Sheet "variables" for the model for Example 8.12. 'Start serving customers Serve-First-Customer Serve-Next-Customer End Sub Sub Serve-First-Customer0 'Serve at counter 1 if 0 ppl in queue If modelWS.Range("c8")= 0 Then modelWS.Range("c9")= modelWS.Range("c9").Value + variableWS. Range ( "b6") .Value modelWS.Range("c8") = 1 Application.Calcu1ate 'MsgBox -wait 1" Routine-A End If 'Serve at counter 1 if 1 person in queue If modelWS.Range("c8")= 1 Then Rout ine-A End If 'Serve at counter 1 and 2 if 2 or more ppl in queue If modelWS.Range("c8")z = 2 Then Routine-A Routine-B End If End Sub Sub Serve-Next-Customer() 'Calculate the new time of day variableWS.Range("bl0")= Evaluate(" = Max(Model!C9,Min(model!Fll,model!F12)) - Model!C9") modelWS.Range("c8")= modelWS.Range("c8").Value + variableWS.Range("B4").Value 'Calculate the maximum number of people left in queue Chapter 8 Some basic random processes modelWS.Range("C16")= Evaluate(" = max(model!cl6,model!c8)" ) Application.Calculate modelWS.Range("c9")= model~S.Range("c9").Value+ variableWS.Range("B10").Value Application.Calcu1ate 'MsgBox wait 3 " 'Check how many ppl are in queue If rnodelWS.Range("c8")= 0 Then modelWS.Range("c9")= modelWS.Range("c9") .Value + variableWS Range ( " b6 " ) .Value modelWS.Range("c8")= 1 End If Application.Calcu1ate If modelWS.Range("c9")> 1020 Then Exit Sub If modelWS.Range("fl1")c = modelWS.Range("fl2")Then Routine-A Else Routine-B End If Application.Calcu1ate Serve-Next-Customer End Sub 'Next customer for counter 1 Sub Routine-A ( ) modelWS.Range("cl1")= 1 modelWS.Range("D11")= modelWS.Range("c9") .Value Application.Calculate End Sub 'Next customer for counter 2 Sub Routi n e B ( ) modelWS.Range("cl2")= 1 modelWS.Range("dl2")= modelWS.Range("c9").Value Application.Calcu1ate modelWS.Range("el2")= variableWS.Range("B2").Value modelWS.Range("cl5")= modelWS.Range("cl5")+ 1 modelWS.Range("c8")= modelWS.Range("c8")- 1 modelWS.Range("C12")= 0 Application.Calcu1ate End Sub 205 Chapter 9 Data and statistics Statistics is the discipline of fitting probability models to data. In this chapter I go through a number of basic statistical techniques from the simple z-tests and t-tests of the classical statistics world, through the basic ideas behind Bayesian statistics and looking at the application of simulation in statistics - the bootstrap for classical statistics and Markov chain Monte Carlo modelling for Bayesian statistics. If you have some statistics training you may think my approach is rather inconsistent, as I have no problems with using Bayesian and classical methods in the same model in spite of the philosophical inconsistencies between them. That's because classical statistics is still the most readily accepted type of statistical analysis - so a model using these methods is less contentious among certain audiences, but on the other hand Bayesian statistics can solve more problems. Moreover, Bayesian statistics is more consistent with risk analysis modelling because we need to simulate uncertainty about model parameters so that we can see how that uncertainty propagates through a model to affect our ability to predict the outputs of interest, not just quote confidence intervals. There are a few key messages I would like you to take away from this chapter. The first is that statistics is subjective: the choice of model that we are fitting to our data is a highly subjective decision. Even the most established statistical tests, like the z-test, t-test, F-test, chi-square test and regression models, have at their heart the (subjective) assumption that the underlying variable is normally distributed - which is very rarely the truth. These tests are really old - a hundred years old - and came to be used so much because one could restructure a number of basic problems into a form of one of these tests and look up the confidence values in published tables. We don't use tables anymore - well, we shouldn't anyway - they aren't very accurate and even basic software like Excel can give you the answers directly. It's rather strange, then, that statistics books often still publish such tables. The second key message is that statistics does not need to be a black box. With a little understanding of probability models, it can become quite intuitive. The third is that there is ample room in statistics for creative thinking. If you have access to simulation methods, you are freed from having to find the right "test" for your particular problem. Most real-world problems are too complex for standardised statistical testing. The fourth is that statistics is intimately related to probability modelling. You won't understand statistics until you've understood probability theory, so learn that first. And lastly, statistics can be really quite a lot of fun as well as very informative. It's rare that a person coming to one of our courses is excited about the statistics part, and I can't blame them, but I like to think that they change their mind by the end. I studied mathematics and physics at undergraduate level and came away with really no useful appreciation of statistics, just a solid understanding of how astonishingly boring it was, because statistics was taught to me as a set of rules and equations, and any explanation of "Why?'was far beyond what we could hope to understand (at the same time we were learning about general relativity theory, quantum electrodynamics, etc.). At the beginning of this book, I discussed the importance of being able to distinguish between uncertainty (or epistemic uncertainty) and variability (or stochastic uncertainty). This chapter lays out a 208 Risk Analysis number of techniques that enable one quantitatively to describe the uncertainty (epistemic uncertainty) associated with the parameters of a model. Uncertainty is a function of the risk analyst, inasmuch as it is the description of the state of knowledge the risk analyst's clients have about particular parameters within his or her model. A quantitative risk analysis model is structured around modelling the variability (randomness) of the world. However, we have imperfect knowledge of the parameters that define that model, so we must estimate their values from data, and, because we have only finite amounts of data, there will remain some uncertainty that we have to layer over our probability model. This chapter is concerned with determining the distributions of uncertainty for these parameters. I will assume that the analyst has somehow accumulated a set of data X = { x l ,x2, . . . , x , ) of n data points that has been obtained in some manner as to be considered a random sample from a random process. The purpose of this chapter will be to determine the level of uncertainty, given these available data, associated with some parameter or parameters of the probability model. It will be useful here to set out some simple terminology: The estimate of some statistical parameter of the parent distribution with true (but unknown) value, say p , is denoted by a hat, e.g. b. .1 n The sample mean of the dataset X is denoted by f,i.e. ?Z = .n- xi. i=l The (unbiased) sample standard deviation of the dataset X is denoted by s, i.e. The true mean and standard deviation of the population distribution are denoted by 1 and a respectively. 9.1 Classical Statistics The classical statistics techniques we all know (or at least remember we were once taught) are the z-test, t-test and chi-square test. They allow us to estimate the mean and variance of a random variable for which we have some randomly sampled data, as well as a number of other problems. I'm going to offer some fairly simple ways of understanding these statistical tests, but I first want to explain why the "tests" aren't much good to us as risk analysts in their standard form. Let's take a typical t-test result: it will say something like the true mean = 9.63 with a 95 % confidence interval of [9.32, 9.941, meaning that we are 95 % sure that the true mean lies between 9.32 and 9.94. It doesn't mean that there is a 95 % probability that it will lie within these values - it either does or does not, what we are describing is how well we (the data holders, i.e. it is subjective) know the mean value. In risk analysis I may have several such parameters in my model. Let's say we have just three such parameters A , B and C estimated from different datasets and each with its best estimate and 95 % confidence bound. Let the model be A*BA(l/C). How can I combine these numbers to make an estimate of the uncertainty of my calculation? The answer is I can't. However, if I could convert the estimates to distributions I could perform a Monte Carlo simulation and get the answer at any confidence interval, or any percentile the decision-maker wishes. Thus, we have to convert these classical tests to distributions of uncertainty. Chapter 9 Data and statistics 209 The classical statistics tests above are based on two basic statistical principles: 1. Thepivotalmethod. This requires that I rearrange an equation so that the parameter being estimated is separated from any random variable. 2. A sufficient statistic. This means a sample statistic calculated from the data that contains all the information in the data that is related to estimating the parameter. 1'11 use these ideas to explain the tests above and how they can be converted to uncertainty distributions. 9.1.1 The z-test The z-test allows us to determine the best estimate and confidence interval for the mean of a normally distributed population where we happen to know the standard deviation of that population. That would be quite an unusual situation since the mean is usually more fundamental than the standard deviation, but does occur sometimes; for example, when we take repeated measures of some quantity (like the length of a room, the weight of a beam). In this situation the random variable is not the length of the room, etc., but the results we will get. Look at the manual of a scientific measuring instrument and it should tell you the accuracy (e.g. f1 mm). Sadly, the manufacturers don't usually tell us how to interpret these values - will the measurement lie within 1mm of the true value 68 % (1 standard deviation), 95 % (two standard deviations), etc., of the time? If the instrument manual were to say the measurement error has a standard deviation of 1 mm, we could apply the z-test. Let's say we are measuring some fixed quantity and that we take n such measurements. The sample mean is given by the formula Here T is the sufficient statistic. If the errors are normally distributed with mean p and standard deviation a we have Note how we have managed to rearrange the equation to place the random element Normal(0,l) apart from the parameter we are trying to estimate. Now, thanks to the pivotal method, we can rearrange to make p the focus: In the z-test we would have specified a confidence interval, say 95 %, and then looked up the "z-score" values for a Normal(0,l) distribution that would correspond to 2.5 % and 97.5 % (i.e. centrally positioned values with 95 % between them) which are -1.95996 and +1.95996 respectively.' Then we'd write p~ = 1.95996- ' c7 fi +x You can get these values with ModelRisk using VoseNormal(0, 1, 0.025) and VoseNormal(0, 1, 0.975) or in Excel with = NORMSINV(0.025) and = NORMSINV(0.975). 2 10 Risk Analysis to get the lower and upper bounds respectively. In a risk analysis simulation we just use 9.1.2 The chi-square test The chi-square (X2)test allows us to determine the best estimate and confidence interval for the standard deviation of a normally distributed population. There are two situations: we either know the mean p or we don't. Knowing the mean seems like an unusual scenario but happens, for example, when we are calibrating a measuring device against some known standard. In this case, the formula for the sample variance is given by The sample variance in this case is the sufficient statistic for the population variance. Rewriting to get a pivotal quantity, we have However, the sum of n unit normal distributions squared is the definition of a chi-square distribution. Rearranging, we get A X2(n) distribution has mean n, so this formula is simply multiplying the sample variance by a random variable with mean 1. The chi-square test finds, say, the 2.5 an 97.5 percentiles2 and inserts them into the above equation. For example, these percentiles for 10 degrees of freedom are 3.247 and 20.483. Since we are dividing by the chi-square random variable, the upper estimate corresponds to the lower chi-square value, and vice versa: In risk analysis modelling we would instead simulate values for a using Equation 9.2: In ModelRisk use VoseChiSq(n, 0.005) and VoseChiSq(n, 0.975), and in Excel use CHIINV(0.975, n) and CHIINV(0.025, n) respectively. Chapter 9 Data and statistics 211 Now let's consider what happens when we don't know the population mean, in which case statistical convention says that we use a slightly different formula for the sample variance measure: However, for a normal distribution it turns out that Rearranging, we get 9.1.3 The t-test The t-test allows us to determine the best estimate and confidence interval for the mean of a normally distributed population where we don't know its standard deviation. From Equation 9.1 we had the result when the population variance was known, and from Equation 9.2 we had the estimate for the variance when the mean is unknown: Substitute for a and we get The definition of a Student(v) distribution is a normal distribution with mean 0 and variance following a random variable v/ChiSq(v), so we have 2 12 Risk Analysis Knowing that the Student t-distribution is just a unit normal distribution with some randomness about the variance explains why a Student distribution has longer tails than a normal. The Student(v) distribution has variance v/(v - 2), v > 2, so at v = 3 the variance is 3 and rapidly decreases, so that by v = 30 it is only 1.07 (a standard deviation of 1.035) and for v = 50 a standard deviation of 1.02. The practical implication is that, when you have, say, 50 data points, there is only a 2 % difference in the confidence interval range whether you use a t-test (Equation 9.4) or approximate with a z-test (Equation 9.1), using the sample variance in place of n2. 9.1.4 Estimating a binomial probability or a proportion In many problems we need to determine a binomial probability (e.g. the probability of a flood in a certain week of the year) or a proportion (e.g. the proportion of components that are made to a certain tolerance). In estimating both, we collect data. Each measurement point is a random variable that has a probability p of having the characteristic of interest. If all measurements are independent, and we assign a value to the measurement of 1 when the measurement has the characteristic of interest and 0 when it does not, the measurements can be thought of as a set of Bernoulli trials. Letting P be the random variable of the proportion of n of this set of trials {Xi} that have the characteristic of interest, it will take a distribution given by We observe a proportion of the n trials that have the characteristic of interest p, our one observation from the random variable P which is also our MLE (see later) and unbiased estimate for p. Switching around Equation (9.9, we can get an uncertainty distribution for the true value of p: We shall see later how this exactly equates to the non-parametric and parametric bootstrap estimates of a binomial probability. Equation (9.6) is a bit awkward since it will allow only (n 1) discrete values for p , i.e. (0, l l n , 2 / n , . . . , (n - l ) / n , 11, whereas our uncertainty about p should really take into account all values between zero and 1. However, a Binomial(n, 8) has a mean and standard deviation given by + and, from the central limit theorem, as n gets large the proportion of observations P will tend to a normal distribution, in which case Equation (9.6) can be rewritten as Equation (9.6) gives us what is known as the "exact binomial confidence interval", which is an awful name in my view because it actually gives us bounds for which we have a t least the required confidence that the true value of p lies within. We never use this method. Another classical statistics method is to construct a cumulative uncertainty distribution, which is far more useful. We start by saying that, if Chapter 9 Data and statistics 2 13 True value of probability p I Figure 9.1 Cumulative distributions of estimate of p for n = 10 trials and varying number of successes s. we've observed s successes in n trials, the confidence that the true value of the probability is less than some value x is given by where Y = Binomial(n, x). In Excel we would write By varying the value x from 0 to 1, we can construct the cumulative confidence. For example, Figure 9.1 shows examples with n = 10. This is an interesting method. Look at the scenario for s = 0: the cumulative distribution starts with a value of 50 % at p = 0, so it is saying that, with no successes observed, we have 50 % confidence that there is no binomial process at all - trials can't become successes, and the remaining 50 % confidence is distributed over p = (0, 1). The reverse logic applies where s = n. In ModelRisk we have a function VoseBinomialP(s, n, ProcessExists, U), where you input the successes s and trials n and, in the situation where s = 0 or n, you have the option to specify whether you know that the probability lies within (0, 1) (ProcessExists = TRUE). The U-parameter also allows you to specify a cumulative percentile - if omitted, the function simulates random values of what the value p might be. So, for example: VoseBinomialP(10,20, TRUE, 0.99) = VoseBinomialP(10,20, FALSE, 0.99) = 0.74605 VoseBinomialP(O,20, TRUE, 0.99) = 0.02522 (it assumes that p cannot be zero) VoseBinomialP(O,20, FALSE, 0.4) = 0 (it allows that p could be zero) 9.1.5 Estimating a Poisson intensity In a Poisson process, countable events occur randomly in time or space - like earthquakes, financial crashes, car crashes, epidemics and customer arrivals. We need to estimate the base rate h at which these events occur. So, for example, a city of 500 000 people may have had a! murders last year: perhaps that was unluckily high, or luckily low. We'd like to know the degree of accuracy that we can place 2 14 Risk Analysis around the statement "The risk is a murders per year". Following a classical statistics approach similar to section 9.1.4, we could write where 1 refers to the single year of counting. We could recognise that a Poisson(a!) distribution has mean and variance = a and looks normal when a! is large: The method suffers the same problems as the binomial: if we haven't yet observed any murders this year, the formulae don't work. A classical statistics alternative is again to construct the cumulative confidence distribution using Figure 9.2 shows some examples of the cumulative distribution that can be constructed from this formula. In ModelRisk there is a function VosePoissonLarnbda(a, t , ProcessExists, U )where you input the counts a! and the time over which they have been observed t , and in the situation where a = 0 you have the option to specify whether you know that the intensity is non-zero (ProcessExists = TRUE). The U-parameter also allows you to specify a cumulative percentile - if omitted, the function simulates random values of what the value h might be. So, for example: VosePoissonLambda(2,3, TRUE, 0.2) = VosePoissonLambda(2,3, FALSE, 0.4) VosePoissonLambda(0, 3, TRUE, 0.2) = 0.203324 (it assumes that h cannot be zero) VoseBinomialP(O,20, FALSE, 0.2) = 0 (it allows that h could be zero) 0 5 10 15 True value of Poisson intensity h 20 Figure 9.2 Cumulative distributions of estimate of h for varying number of observations a. Chapter 9 Data and stat~st~cs 2 15 9.2 Bayesian Inference The Bayesian approach to statistics has enjoyed something of a renaissance over the latter half of the twentieth century, but there still remains a schism among the scientific community over the Bayesian position. Many scientists, and particularly many classically trained statisticians, believe that science should be objective and therefore dislike any methodology that is based on subjectivism. There are, of course, a host of counterarguments. Experimental design is subjective to begin with; classical statistics are limited in that they make certain assumptions (normally distributed errors or populations, for example) and scientists have to use their judgement in deciding whether such an assumption is sufficiently well met; moreover, at the end of a statistical analysis one is often asked to accept or reject a hypothesis by picking (quite subjectively) a level of significance (p values). For the risk analyst, subjectivism is a fact of life. Each model one builds is only an approximation of the real world. Decisions about the structure and acceptable accuracy of the risk analyst's model are very subjective. Added to all this, the risk analyst must very often rely on subjective estimates for many model inputs, frequently without any data to back them up. Bayesian inference is an extremely powerful technique, based on Bayes' theorem (sometimes called Bayes' formula), for using data to improve one's estimate of a parameter. There are essentially three steps involved: (1) determining a prior estimate of the parameter in the form of a confidence distribution; (2) finding an appropriate likelihood function for the observed data; (3) calculating the posterior (i.e. revised) estimate of the parameter by multiplying the prior distribution and the likelihood function, then normalising so that the result is a true distribution of confidence (i.e. the area under the curve equals 1). The first part of this section introduces the concept and provides some simple examples. The second part explains how to determine prior distributions. The third part looks more closely at likelihood functions, and the fourth part explains how normalising of the posterior distribution is carried out. ? 9.2.1 Introduction Bayesian inference is based on Bayes' theorem (Section 6.3.5), the logic of which was first proposed in Bayes (1763). Bayes' theorem states that We will change the notation of that formula for the purpose of explaining Bayesian inference to a notation often used in the Bayesian world: Bayesian inference mathematically describes the learning process. We start off with an opinion, however vague, and then modify our opinion when presented with evidence. The components of Equation (9.8) are: n(8) - the "prior distribution". n(6) is the density function of our prior belief about the parameter value 6 before we have observed the data x . In other words, n(6) is not a probability distribution of 0 but rather an uncertainty distribution: it is an adequate representation of the state of our knowledge about 6 before the data x were observed. 2 16 Risk Analysis 1(x 16) - the "likelihood function". 1(x 18) is the calculated probability of randomly observing the data x for a given value of 8. The shape of the likelihood function embodies the amount of information contained in the data. If the information it contains is small, the likelihood function will be broadly distributed, whereas if the information it contains is large, the likelihood function will be very focused around some particular value of the parameter. However, if the shape of the likelihood function corresponds strongly to the prior distribution, the amount of extra information the likelihood function embodies is relatively small and the posterior distribution will not differ greatly from the prior. In other words, one would not have learned very much from the data. On the other hand, if the shape of the likelihood function is very different from the prior we will have learned a lot from the data. f (81~) the "posterior distribution". f (81x1 is the description of our state of knowledge of 8 after we have observed the data x and given our opinion of the value of 8 before x was observed. The denominator in Equation (9.8) simply normalises the posterior distribution to have a total area equal to 1. Since the denominator is simply a scalar value and not a function of 8, one can rewrite Equation (9.8) in a form that is generally more convenient: The cc symbol means "is proportional to", so this equation shows that the value of the posterior distribution density function, evaluated at some value of 8, is proportional to the product of the prior distribution density function at that value of 8 and the likelihood of observing the dataset x if that value of 8 were the parameter's true value. It is interesting to observe that Bayesian inference is thus not interested in absolute values of the prior and likelihood function, but only their shapes. In writing equations of the form of Equation(9.9), we are taking as read that one will eventually have to normalise the distribution. Bayesian inference seems to confuse a lot of people rather quickly. I have found that the easiest way to understand it, and to explain it, is through examples. Example 9.1 I have three "loonies7' (Canadian one dollar coins - they have a loon on the tail face) in my pocket. Two of them are regular coins, but the third is a weighted coin that has a 70 % chance of landing heads up. I cannot tell the coins apart on inspection. I take a coin out of my pocket at random and toss it - it lands heads up. What is the probability that the coin is the weighted coin? Let's start by noting that the probability, as I have defined the term probability in Chapter 6.2, that the coin is the weighted one is either 0 or 1: it either is not the weighted coin or it is. The problem should really be phrased "What confidence do I have that the tossed coin is weighted?', as I am only dealing with the state of my knowledge. When I took the coin out of my pocket but before I had tossed it, I would have said I was confident that the coin in my hand was weighted, and confident it was not weighted. My prior distribution n(f3)for the state of the coin would thus look like Figure 9.3, i.e. a discrete distribution with two allowed values {not weighted, weighted) with confidences respectively. Now I toss the coin and it lands heads up. If the coin were fair, it would have a probability of of landing that way. My confidence that I took out a fair coin from my pocket and then tossed a head (call 1 1 it scenario A) is therefore proportional to my prior belief multiplied by the likelihood, i.e. * 3 = 5 . On the other hand, I am also confident that the coin could have been weighted, and then it would 5 (5,4) 5 - - Chapter 9 Data and stat~stlcs 2 17 Figure 9.3 Prior distribution for the weighted coin example: a Discrete ((0,I}, {g, 4)). have had a probability of of landing that way. My confidence that I took out the weighted coin from my pocket and then tossed a head (call it scenario B) is therefore proportional to f * = The two values 112 and 7/10 used for the probability of observing a head were conditional on the type of coin that was being tossed. These two values represent, in this problem, the likelihood function. We will look at some more general likelihood functions in the following examples. Now, we know that one of scenarios A and B must have actually occurred since we did observe a head. We must therefore normalise my confidence for these two scenarios so that they add up to 1, i.e. 6. 6. This normalising is the purpose of the denominator in Equation (9.8). I am now 10117 confident that the coin is fair and 7/17 confident that it is weighted: I still think it more likely I tossed a fair coin than a weighted coin. Let us imagine that we toss the coin again and observe another head. How would this affect my confidence distribution of the state of the coin? Well, the posterior confidence of selecting a fair coin and observing two heads (scenario C) is proportional to 2 * * = The posterior confidence of selecting the weighted coin and observing two heads (scenario D) is proportional to 113 * 7/10 * 7/10 = 491300. Normalising these two, we get i. Now I am roughly equally confident about whether I had tossed a fair or a weighted coin. Figure 9.4 depicts posterior distributions for the above example, plus the posterior distributions for a few more tosses of a coin where each toss resulted in a head. One can see that, as the amount of observations (data) we have grows, our prior belief gets swamped by what the data say is really possible, i.e. by the information contained in the data. + - 2 18 Risk Analysis I 1 2 heads in 2 tosses 3 heads in 3 tosses 1 5 heads in 5 tosses 1 0.495 weiohted I 4 heads in 4 tosses not weiahted I Figure 9.4 weiahted 10 heads in 10 tosses I Posterior distributions for the coin tossing example with increasing numbers of heads. Example 9.2 A game warden on a tropical island would like to know how many tigers she has on her island. It is a big island with dense jungle and she has a limited budget, so she can't search every inch of the island methodically. Besides, she wants to disturb the tigers and the other fauna as little as possible. She arranges for a capture-recapture survey to be carried out as follows. Hidden traps are laid at random points on the island. The traps are furnished with transmitters that signal a catch, and each captured tiger is retrieved immediately. When 20 tigers have been caught, the traps are removed. Each of these 20 tigers are carefully sedated and marked with an ear tag, then all are released together back to the positions where they were originally caught. Some short time later, hidden traps are laid again, but at different points on the island, until 30 tigers have been caught and the number of tagged tigers is recorded. Captured tigers are held in captivity until the 30th tiger has been caught. The game warden tries the experiment, and seven of the 30 tigers captured in the second set of traps are tagged. How many tigers are there on the island? The warden has gone to some lengths to specify the experiment precisely. This is so that we will be able to assume within reasonable accuracy that the experiment is taking a hypergeometric sample from the tiger population (Section 8.4). A hypergeometric sample assumes that an individual with the characteristic of interest (in this case, a tagged tiger) has the same probability of being sampled as any individual that does not have that characteristic (i.e. the untagged tigers). The reader may enjoy Chapter 9 Data and statistics 2 19 thinking through what assumptions are being made in this analysis and where the experimental design has attempted to minimise any deviation from a true hypergeometric sampling. We will use the usual notation for a hypergeometric process: n - the sample size, = 30. D - the number of individuals in the population of interest (tagged tigers) = 20. M - the population (the number of tigers in the jungle). In the Bayesian inference terminology, this is given the symbol 8 as it is the parameter we are attempting to estimate. x - the number of individuals in the sample that have the characteristic of interest = 7. We could get a best guess for M by noting that the most likely scenario would be for us to see tagged tigers in the sample in the same proportion as they occur in the population. In other words x D . 7 20 x - 1.e. - x - which gives M n M' 30 M' - % 85 to 86 but this does not take account of the uncertainty that occurs owing to the random sampling involved in the experiment. Let us imagine that before the experiment was started the warden and her staff believed that the number of tigers was equally likely to be any one value as any other. In other words, they knew absolutely nothing about the number of tigers in the jungle, and their prior distribution is thus a discrete uniform distribution over all non-negative integers. This is rather unlikely, of course, but we will discuss better prior distributions in Section 9.2.2. The likelihood function is given by the probability mass function of the hypergeometric distribution, i.e. ~ ( ~ 1= 8 0, ) I i I otherwise The likelihood function is 0 for values of 0 below 43, as the experiment tells us that there must be at least 43 tigers: 20 that were tagged plus the (30 - 7) that were caught in the recapture part of the experiment and were not tagged. The probability mass function (Section 6.1.2) applies to a discrete distribution and equals the probability that exactly x events will occur. Excel provides a convenient function HYPGEOMDIST(x, n , D, M) that will calculate the hypergeometric distribution mass function automatically, but generates errors instead of zero when 0 < 43 so I have used the equivalent ModelRisk function. Figure 9.5 illustrates a spreadsheet where a discrete uniform prior, with values of 0 running from 0 to 150, is multiplied by the likelihood function above to arrive at a posterior distribution. We know that the total confidence must add up to 1, which is done in column F to produce the normalised posterior distribution. The shape of this posterior distribution is shown in Figure 9.6 by plotting column B against column F from the spreadsheet. The graph peaks at a value of 85, as we would expect, but it appears cut off at the right tail, which shows that we should also look at values of 0 larger than 150. The analysis is repeated for values of 8 up to 300, and this more complete posterior distribution is plotted in Figure 9.7. This second plot represents a good model of the state of the warden's knowledge about the number of tigers on the island. Don't forget that this is a distribution of belief and is not a true probability distribution since there is an exact number of tigers on the island. 220 Risk Analysis C3:C6 B10:B117 CIO:C117 D10:D117 E10:E117 E7 F10:FI 17 Formulae table constants (43,..,150) 1 =VoseHypergeoProb(x,n,D,B9) =DIO*C10 =SUM(EIO:EI 17) =EIO/$E$7 Figure 9.5 Bayesian inference model for the tiger capture-release-recapture problem. Figure 9.6 First pass at a posterior distribution for the tagged tiger problem. In this example, we had to adjust our range of tested values of 8 in light of the posterior distribution. It is quite common to review the set of tested values of 0, either expanding the prior's range or modelling some part of the prior's range in more detail when the posterior distribution is concentrated around a small range. It is entirely appropriate to expand the range of the prior as long as we would have been happy to have extended our prior to the new range before seeing the data. However, it would not be appropriate if we had a much more informed prior belief that gave an absolute range for the uncertain parameter outside of which we are now considering stepping. This would not be right because we would be revising our prior belief in light of the data: putting the cart before the horse, if you like. However, if the likelihood function is concentrated very much at one end of the range of the prior, it may well be worth reviewing whether the prior distribution or the likelihood function is appropriate, since the analysis could be suggesting that the true value of the parameter lies outside the preconceived range of the prior. Continuing with our tigers on an island, let us imagine that the warden is unsatisfied with the level of uncertainty that remains about the number of tigers, which, from 50 to 250, is rather large. She decides to wait a short while and then capture another 30 tigers. The experiment is completed, and this time t tagged tigers are captured. Assuming that a tagged tiger still has the same probability of being captured as an untagged tiger, what is her uncertainty distribution now for the number of tigers on the island? Chapter 9 Data and statistics 50 Figure 9.7 100 150 200 Tiaers on the island 250 300 22 1 / Improved posterior distribution for the tagged tiger problem. This is simply a replication of the first problem, except that we no longer use a discrete uniform distribution as her prior. Instead, the distribution of Figure 9.7 represents the state of her knowledge prior to doing this second experiment, and the likelihood function is now given by the Excel function HYPGEOMDIST(t, 30, 20, 0), equivalently VoseHypergeoProb(t, 30, 20, 8 , 0). The six panels of Figure 9.8 show what the warden's posterior distribution would have been if the second experiment had trapped t = 1, 3, 5 , 7, 10 and 15 tagged tigers instead. These posteriors are plotted together with the prior of Figure 9.7 and the likelihood functions, normalised to sum to 1 for ease of comparison. You might initially imagine that performing another experiment would make you more confident about the actual number of tigers on the island, but the graphs of Figure 9.8 show that this is not necessarily so. In the top two panels the posterior distribution is now more spread than the prior because the data contradict the prior (the prior and likelihood peak at very different values of 0). In the middle left panel, the likelihood disagrees moderately with the prior, but the extra information in the data compensates for this, leaving us with about the same level of uncertainty but with a posterior distribution that is to the right of the prior. The middle right panel represents the scenario where the second experiment has the same results as the first. You'll see that the prior and likelihood overlay on each other because the prior of the first experiment was uniform and therefore the posterior shape was only influenced by the likelihood function. Since both experiments produced the same result, our confidence is improved and remains centred around the best guess of 85. In the bottom two panels, the likelihood functions disagree with the priors, yet the posterior distributions have a narrower uncertainty. This is because the likelihood function is placing emphasis on the left tail of the possible range of values for 0, which is bounded at 0 = 43. In summary, the graphs of Figure 9.8 show that the amount of information contained in data is dependent on two things: (1) the manner in which the data were collected (i.e. the level of randomness inherent in the collection), which is described by the likelihood function, and (2) the state of our knowledge prior to observing the data and the degree to which it compares with the likelihood function. If the data tell us what we are already fairly sure of, there is little information contained in the data for us (though the data would contain much more information for those more ignorant of the parameter). On the other hand, if the data contradict what we already know, our uncertainty may either decrease or increase, depending on the circumstances. + 222 Risk Analysis 1 tagged tiger g 3 tagged tigers 0 016 0 016 0 014 0.01 4 0 012 8 0012 001 E' 0.01 C g 0008 6c 0 0.008 0 006 0 006 0 004 0 004 0 002 0 002 0 0 0 50 100 150 200 250 300 0 Tigers on the island 0 50 100 150 200 0.025 250 Tigers on the island 150 200 250 300 7 tagged tigers 300 Tigers on the island 10 tagged tigers 100 Tigers on the island 5 tagged tigers 0.018 50 Tigers on the island I I 0.08 T I 15 tagged tigers Tigers on the island Figure 9.8 Tagged tiger problem: (a), (b), (c), (d ), (e) and (f) show prior distributions, likelihood functions and posterior distributions if the second experiment had trapped 1, 3, 5, 7, 10 and 15 tigers tagged respectively (prior distribution shown as empty circles, likelihood function as grey lines and posterior distributions as black lines). Chapter 9 Data and statistics 223 Example 9.3 Twenty people are randomly picked off a city street in France. Whether they are male or female is noted on 20 identical pieces of paper, put into a hat and the hat is brought to me. I have not seen these 20 people. I take out five pieces of paper from the hat and read them - three are female. I am then asked to estimate the number of females in the original group of 20. I can express my estimate as a confidence distribution of the possible values. I might argue that, prior to reading the five names, I had no knowledge of the number of people who would be female and so would assign a discrete uniform prior from 0 to 20. However, it would be better to argue that roughly 50 % of people are female and so a much better prior distribution would be a Binomial(20, 0.5). This is equivalent to a Duniform prior, followed by a Binomial(20, 0.5) likelihood for the number of females that would be randomly selected from a population in a sample of 20. The likelihood function relating to sampling five people from the population is again hypergeometric, except in this problem we know the total population (i.e. M = 20), we know the sample size (n = 5 ) and we know the number observed in the sample with the required property (x = 3), but we don't know the number of females D, which we denote by 0 as it is the parameter to be estimated. Figure 9.9 illustrates the spreadsheet model for this problem, using the binomial distribution prior. This spreadsheet has made use of ModelRisk's VoseBinomialProb(x, n, p , cumulative), equivalently the Excel function BINOMDIST(x, n, p, cumulative), which returns a probability evaluated at x for a Binomial(n, p) distribution. The cumulative parameter in the function toggles the function to return a probability mass (cumulative = 0 or FALSE) or a cumulative probability (cumulative = 1 or TRUE). The IF statement in Cells C8:C28 is unnecessary because the VoseHypergeoProb function will return a zero, but necessary to avoid errors if you use Excel's HYPGEOMDIST function in its place. Figure 9.10 shows the resultant posterior distribution, together with the likelihood function and the prior. Here we can see that the prior is very strong and the amount of information imbedded in the likelihood function is small, so the posterior distribution is quite close to the prior. The posterior distribution is a sort of compromise between the prior and likelihood function, in that it finds a distribution that agrees as much as possible with both. Hence, the peak of the posterior distribution now lies somewhere between the peaks of the prior and likelihood function. The effect of the likelihood function is small A( I 1 4 5 - : ( B C ( D E 9 10 11 12 25 26 27 28 29 30 Figure 9.9 F (GI H 1 I 1J Parameters 6 7 8 - 1 9 0 1 2 3 4 17 18 19 20 Prior 9.5E-07 1.9E-05 1.8E-04 l.lE-03 4.6E-03 1.1E-03 1.8E-04 1.9E-05 9.5E-07 Likelihood 0 0 0 8.8E-03 3.1 E-02 1.3E-01 5.3E-02 0 0 Posterior 0 0 0 9.5E-06 1.4E-04 1.4E-04 9.5E-06 0 0 0.3125 Normalised posterior 0 0 0 3.1E-05 4.6E-04 4.6E-04 3.1E-05 0 0 C3:C4 B8:B28 C8:C28 D8:D28 E8:E28 E29 F8:F28 Formulae table Constants {O,l,. . .,19,20} =VoseBinomialProb(B8,20,0.5,0) =IF(OR(BB20-(n-x)) ,O,VoseHypergeoProb(x,n,B8,20)) =C8*D8 =SUM(E8:E28) =E8/$E$29 Bayesian inference model for the number of "females in a hat" problem. 224 Risk Analysis + Prtor - - L~kel~hood functlon (norrnal~sed) Posterlor +- Females in the group Figure 9.10 Prior distribution, likelihood function and posterior distribution for the model of Figure 9.9 using a Binomial(20, 0.5) prior. -$-Prior + Posterior Females in the arouD Figure 9.1 1 Prior and posterior distributions for the model of Figure 9.8 with a Duniform(0, . . ., 20) prior because the sample is small (a sample of 5 ) and because it does not disagree with the prior (the prior has a maximum at 8 = 10, and this value of 8 also produces one of the highest likelihood function values). For comparison, Figure 9.11 shows the prior and posterior distributions if one had used a discrete uniform prior. Since the prior is flat in this case, it contributes nothing to the posterior's shape and the likelihood function becomes the posterior distribution. + Chapter 9 Data and statistics 225 Hyperparameters I assumed in Example 9.3 that the prevalence of females in France is 50 %. However, knowing that females on average live longer than males, this figure will be a slight underestimate. Perhaps I should have used a value of 5 1 % or 52 %. In Bayesian inference, I can include uncertainty about one or more of the parameters in the analysis. For example, I could model p with a PERT(5O %, 51 %, 52 %). Uncertain parameters are called hyperparameters. In the algebraic form of a Bayesian inference calculation, I then integrate out this nuisance parameter which in reality can be a bit tricky to carry out. Let's look again at the Bayesian inference calculation in the spreadsheet of Figure 9.9. If I have uncertainty about the prevalence of females p, I should assign a distribution to its value, in which case there would then be uncertainty about the posterior distribution. I cannot have an uncertainty about my uncertainty: it doesn't make sense. This is why we must integrate out (i.e. aggregate) the effect of uncertainty about p on the posterior distribution. We can do this very easily using Monte Carlo simulation, instead of the more onerous algebraic integration. We simply include a distribution for p in our model, nominate the entire array for the posterior distribution as an output and simulate. The set of means of the generated values for each cell in the array constitutes the final posterior distribution. Simulating a Bayesian inference calculation We could have done the same Bayesian inference analysis for Example 9.3 by simulation. Figure 9.12 illustrates a spreadsheet model that performs the Bayesian inference, together with a plot of the model result. In cell C3, a Binomial(20,0.5) distribution represents the prior. It is randomly generating possible A 1 I B IDI c E I GJ F 1 Formulae table =IF(C3=O,O,VoseHypergeo(5,C3,20)) 6 7 8 9 10 11 12 0.196 I I I - 0.157 al ----' 0.118 2 15 16 17 18 19 20 21 22 - I I , - - - - - - - ---------------> --------; I I I I I - - - - - - - J-------: -------- , , ,- - - - - - -,- - - - - - - - I I I I I r - ' - - - - -I r - - - - - - - . - - - - - - - - - - - - - I I I I I I I g 0,078--------:---------------.--------------4-------;-------. 0 , II I I I I ___.___.___________---------------~IIIII I I I I 4 6 8 10 12 Females in group 23 Figure 9.12 4 Simulation model for the problem of Figure 9.9. 14 I 1 16 18 226 Risk Analysis scenarios of the number of "females" in the hat. In cell C4 a sample of five people is modelled using a Hypergeo(5, D, 20), where D is the result from the binomial distribution. The IF statement here is unnecessary because VoseHypergeo supports D = 0 but, for example, @RISK'S RiskHypergeo(5,0,20) returns an error. This represents one-half of the likelihood function logic. Finally, in cell C5, the generated value from the binomial distribution in cell C3 is accepted (and therefore stored in memory) if the hypergeometric distribution produces a 3 - the number of females observed in the experiment. This is equivalent to the second half of the likelihood function logic. By running a large number of iterations, a large number of generated values from the binomial will be accepted. The proportion of times that a particular value from the binomial distribution will be accepted equates to the hypergeometric probability that three females would be subsequently observed in a random sample of five from the group. I ran this model for 100000 iterations, and 31 343 values were accepted, which equates to about 31 % of the iterations. The technique is interesting but does have limited applications, since, for more complex problems or those with larger numbers, the technique becomes very inefficient as the percentage of iterations that are accepted becomes very small indeed. It is also difficult to use where the parameter being estimated is continuous rather than discrete, in which case one is forced to use a logic that accepts the generated prior value if the generated result lies within some range of the observed result. However, to combat this inefficiency, one can alter the prior distribution to generate values that the experimental results have shown to be possible. For example, in this problem, there must be between three and 18 females in the group of 20, whereas the Binomial(20, 0.5) is generating values between 0 and 20. Furthermore, one could run several passes, cutting down the prior with each pass to home in on only those values that are feasible. One can also get more detail in the tails by multiplying up the mass of some values x , y, z (for example, in the tails of the prior) by some factor, then dividing the heights of the posterior tail at x , y and z by that factor. While this technique consumes a lot of simulation time, the models are very simple to construct and one can also consider multiple parameter priors. Let us look again at the choice of priors for this problem, i.e. either a Dunifom((0, . . . , 20)) or a Binomial(20, 50 %). One might consider that the Dunifom distribution is less informed (i.e. says less) than the binomial distribution. However, we can turn the Duniform distribution around and ask what that would have said about our prior belief of the probability p of a person randomly selected from the French population being female. We can show that a uniform assumption for p translates to a Duniform distribution of females in a group, as follows. Let s, be the number of successes in n Bernoulli trials where 8 is the unknown probability of success of a trial. Then the probability that s, = r, r = {O, 1, 2, . . . , n), is given by the de Finetti theorem: where f (8) is the probability density function for the uncertainty distribution for 8. The formula simply calculates, for any value of r , the binomial probability of observing r successes, integrated over the uncertainty distribution for the binomial probability 8. If we use a Uniform(0, 1) distribution to describe our uncertainty about 8, then f (8) = 1 : Chapter 9 Data and statistics 227 The integral is a beta function and, for integer values of r and n , we have the standard identity Thus, P(, =r ) = @)( (nn-+r l) )! !r ! - r n! (n - r)!r!- 1 ! -r)! (n 1 n 1 + + + + So each of the n 1 possible values (0, 1 , 2 , . . . , n } has the same likelihood of ll(n 1). In other words, using a Duniform prior for the number of females in a group equates to saying that we are equally confident that the true probability of an individual from the population being female is any value between 0 and 1. Example 9.4 I h & I a A magician has three cups turned over on his table. Under one of the cups you see him put a pea. With much ceremony, he changes the cups around in a dazzling swirl. He then offers you a bet to pick which cup the pea is under. You pick one. He then shows you under one of the other cups - empty. The magician asks you whether you would like to swap your choice for the third, untouched cup. What is your answer? Note that the magician knows which cup has the pea and would not turn it over. In this problem, until the magician turns over a cup, we are equally sure about which cup has the pea so our prior confidence assigns equal weighting to the three cups. We now need to calculate the probability of what was observed if the pea had been under each of the cups in turn. We can label the three cups as A for the cup I chose, B for the cup the magician chose and C for the remaining cup. Let's start with the easy cup, B. What is the probability that the magician would turn over cup B if he knew the pea was under cup B? Answer: 0, because he would have spoiled the trick. Next, look at the untouched cup, C. What is the probability that the magician would turn over cup B if he knew the pea was under cup C? Answer: 1, since he had no choice as I had already picked A, and C contained the pea. Now look at my cup, A. What is the probability that the magician would turn over cup B if he knew the pea was under cup A? Answer: 112, since he could have chosen to turn over either B or C. Thus, from Bayes' theorem, 5 where P ( A ) = P ( B ) = P ( C ) = are the confidences we assign to the three cups before observing the data X (i.e. the magician turning over cup B) and P(XI A ) = 0.5, P(XI B ) = 0 and P ( X IC) = 1. Thus, P(A1X) = 4, P(B1X) = 0 and P(C1X) = $ So, after having made our choice of cup and then watching the magician turn over one of the other two cups, we should always change our mind and pick the third cup as we should now be twice as confident that the untouched cup will contain the pea as the one we originally chose. The result is a 228 Risk Analysis little hard for many people to believe in: the obstinate among us would like to stick to our original choice, and it does not seem that the probability can really have changed for the cup we chose to contain the pea. Indeed, the probability has not changed after the magician's selection: it remains either 0 or 1, depending on whether we picked the right cup. What has changed is our confidence (the state of our knowledge) about whether that probability is 1. Originally, we had a 113 confidence that the pea was under our cup, and that has not changed. There is another way to think of the same problem: we had 113 confidence in our original choice of cup, and 213 in the other choices, and we also knew that one of those other cups did not contain the pea, so the 213 migrated to the remaining cup that was not turned over. This exercise is known as the Monte Hall problem - Wikipedia has a nice explanatory page, and www.stat.sc.edu/-west/javahtml/letsMakeaDeal.htmlhas a nice simulation applet to test out the answer. + Exercise 9.1: Try repeating this problem where there are (a) four cups and one pea, and (b) five cups and two peas. Each time you get to select a cup, and each time the magician turns one of the others over. 9.2.2 Prior distributions As we have seen above, the prior distributions are the description of one's state of knowledge about the parameter in question prior to observation of the data. Determination of the prior distribution is the primary focus for criticism of Bayesian inference, and one needs to be quite sure of the effects of choosing one particular prior over another. This section describes three different types of prior distribution: the uninformed prior; the conjugate prior and the subjective prior. We will look at the practical reasons for selecting each type and arguments for and against each selection. An argument presented by frequentist statisticians (i.e. those who use only traditional statistical techniques) is that the Bayesian inference methodology is subjective. A frequentist might argue that, because we use prior distributions, representing the state of one's belief prior to accumulation of data, Bayesian inference may easily produce quite different results from one practitioner to the next because they can choose quite different priors. This is, of course, true - in principle. It is both one of the strengths and certainly an Achilles' heel of the technique. On the one hand, it is very useful in a statistical technique to be able to include one's prior experience and knowledge of the parameter, even if that is not available in a pure data form. On the other hand, one party could argue that the resultant posterior distribution produced by another party was incorrect. The solution to this dilemma is, in principle, fairly simple. If the purpose of the Bayesian inference is to make internal decisions within your organisation, you are very much at liberty to use any experience you have available to determine your prior. On the other hand, if the result of your analysis is likely to be challenged by a party with a conflicting agenda to your own, you may be better off choosing an ''uninformed" prior, i.e. one that is neutral in that it provides no extra information. All that said, in the event that one has accumulated a reasonable dataset, the controversy regarding selection of priors disappears as the prior is overwhelmed by the information contained in the data. It is important to specify a prior with a sufficiently large range to cover all possible true values for the parameter, as we have seen in Figure 9.6. Failure to specify a wide enough prior will curtail the posterior distribution, although this will nearly always be apparent when plotting the posterior distribution and a correction can be made. The only time it may not be apparent that the prior range is inadequate is when the likelihood function has more than one peak, in which case one might have extended the range of the prior to show the first peak but no further. Chapter 9 Data and statistics 229 Uninformed priors An uninformed prior has a distribution that would be considered to add no information to the Bayesian inference, except to specify the possible range of the parameter in question. For example, a Uniform(0, 1) distribution could be considered an uninformed prior when estimating a binomial probability because it states that, prior to collection of any data, we consider every possible value for the true probability to be as likely as every other. An uninformed prior is often desirable in the development of public policy to demonstrate impartiality. Laplace (1812), who also independently stated Bayes' theorem (Laplace, 1774) 1l years after Bayes' essay was published (he apparently had not seen Bayes' essay), proposed that public policy priors should assume all allowable values to have equal likelihood (i.e. uniform or Duniform distributions). At first glance, then, it might seem that uninformed priors will just be uniform distributions running across the entire range of possible values for the parameter. That this is not true can be easily demonstrated from the following example. Consider the task of estimating the true mean number of events per unit exposure h of a Poisson process. We have observed a certain number of events within a certain period, which we can use to give us a likelihood function very easily (see Example 9.6). It might seem reasonable to assign a Uniform(0, z ) prior to h, where z is some large number. However, we could just as easily have parameterised the problem in terms of B, the mean exposure between events. Since B = 1/h, we can quickly check what a Uniform(0, z) prior for h would look like as a prior for B by running a simulation on the formula: = l/Uniform(O, z). Figure 9.13 shows the result of such a simulation. It is alarmingly far from being uninformed with respect to B ! Of course, the reverse equally applies: if we had performed a Bayesian inference on B with a uniform prior, the prior for h would be just as far from being uninformed. The probability density function for the prior distribution of a parameter must be known in order to perform a Bayesian inference calculation. However, one can often choose between a number of different pararneterisations that would equally well describe the same Figure 9.13 Distribution resulting from the formula: = 1 /Uniform(O, 20). 230 Risk Analysis stochastic process. For example, one could describe a Poisson process by h, the mean number of events per unit exposure, by 6, the mean exposure between events as above, or by P(x > O), the probability of at least one event in a unit of exposure. The Jacobian transformation lets us calculate the prior distribution for a Bayesian inference problem after reparameterising. If x is the original parameter with probability density function f (x) and cumulative distribution function F(x), and y is the new parameter with probability density function f (y) and cumulative distribution function F ( y ) related to x by some function such that x and y increase monotonically, then we can equate changes d F ( y ) and dF(x) together, i.e. Rearranging a little, we get known as the Jacobian. So, for example, if x = Uniform(0, c) and y = l l x , ax - a~ = - 1 j y 2 sothe Jacobianis ax = 11y2 1 which gives the distribution for y : p(y) = T. cY Two advanced exercises for those who like algebra: Exercise 9.2: Suppose we model p = U(0, 1). What is the density function for Q = 1 - (1 - p)"? Exercise 9.3: Suppose we want to model P ( 0 ) = exp(-A) = U(0, 1). What is the density function for h? There is no all-embracing solution to the problem of setting uninformed priors that don't become "informed" under some reparameterising of the problem. However, one useful method is to use a prior such that loglo(8) is Uniform(-z, z) distributed, which, using Jacobian transformation, can be shown to give the prior density n(8) a 118, for a parameter that can take any positive real value. We could just as easily use natural logs, i.e. loge(8) = Uniform(-y, y), but in practice it is easier to set the value z because our minds think quite naturally in powers of 10. Using this prior, we get log,,(1/8) = - loglo(0) = -Uniform(-z, z) = Uniform(-z, z). In other words, 118 is distributed the same as 8: in mathematical terminology, the prior distribution is transformation invariant. Now, if Chapter 9 Data and statistics Figure 9.14 23 1 Prior distribution n(9)= 118. logl,(8) is Uniform(-z, r ) distributed, then 6' is distributed as 10Uniform(-z3z).Figure 9.14 shows a graph of n(8) = 118. You probably wouldn't describe that distribution as very uninformed, but it is arguably the best one can do for this particular problem. It is worth remembering too that, if there is a reasonable amount of data available, the likelihood function 1(X (8)will overpower the prior n(8) = 118, and then the shape of the prior becomes unimportant. This will occur much more quickly if the likelihood function is a maximum in a region of 8 where the prior is flatter: anywhere from 3 or 4 onwards in Figure 9.14, for example. Another requirement might be to ensure that the prior distribution remains invariant under some rescaling. For example, the location parameter of a distribution should have the same effective prior under the linear shifting transformation y = 8 - a , where a is some constant. This is achieved if we select a uniform prior for 8, i.e. n(8) = constant. Similarly, a scale parameter should have a prior that is invariant under a change of units, i.e. y = k0, where k is some constant. In other words, we require that the parameter be invariant under a linear transformation which, from the discussion in the previous paragraph, is achieved if we select the prior log(8) = uniform (i.e. n(8) cx 118) on the real line, since log(y) = log(k8) = log(k) log(8), which is still uniformly distributed. Parametric distribution often has either or both a location parameter and a scale parameter. If more than one parameter is unknown and one is attempting to estimate these parameters, it is common practice to assume independence between the two parameters in the prior: the logic is that an assumption of independence is more uninformed than an assumption of any specific degree of dependence. The joint prior for a scale parameter and a location parameter is then simply the product of the two priors. So, for example, the prior for the mean of a normal distribution is n ( p ) cx 1, as p is a location parameter; the prior for the standard deviation of the normal distribution is n ( a ) cx l / a , as a is a scale parameter, and their joint prior is given by the product of the two priors, i.e. n ( p , a ) oc l/a. The use of joint priors is discussed more fully in Chapter 10 where we will be fitting distributions to data. + Jeffreysprior The Jeffreys prior, described in Jeffreys (1961), provides an easily computed prior that is invariant under any one-to-one transformation and therefore determines one version of what could be described as an uninformed prior. The idea is that one finds a likelihood function, under some transformation of the data, that produces the same shape for all datasets and simply changes the location of its peak. Thus, a non-informative prior in this translation would be ambiguous, i.e. flat. Although it is often impossible to determine such a likelihood function, Jeffreys developed a useful approximation given by where I(8) is the expected Fisher information in the model: The formula is averaging, over all values of x (the data), the second-order partial derivative of the loglikelihood function. The form of the likelihood function is helping determine the prior, but the data themselves are not. This is important since the prior must be "blind" to the data. [Interestingly, empirical Bayes methods (another field of Bayesian inference though not discussed in this book) do use the data to determine the prior distribution and then try to make appropriate corrections for the bias this creates.] Some of the Jeffreys prior results are a little counterintuitive. For example, the Jeffreys prior for a binomial probability is the ~ e t a ( i i, ) shown in Figure 9.15. It peaks at p = 0 and p = 1, dipping to its lowest value at p = 0.5, which does not equate well with most people's intuitive notion of uninformed. ~ . using the Jacobian transformation, we The Jeffreys prior for the Poisson mean h is n(h) a l / ~ ' / But, see that this gives a prior for p = 1/h of p(B) a pV3I2,SO the prior is not transformation invariant. Improper priors We have seen how a uniform prior can be used to represent uninformed knowledge about a parameter. However, if that parameter can take on any value between zero and infinity, for example, then it is not strictly possible to use the uniform prior n(O) = c, where c is some constant, since no value of c will let the area of the distribution sum to 1, and the prior is called improper. Other common improper priors include using l/a for the standard deviation of a normal distribution and l/a2 for the variance. It turns out that we can use improper priors provided the denominator in Equation (9.8) equals some constant (i.e. is not infinite), because this means that the posterior distribution can be normalised. Figure 9.15 The eta($, $) distribution. Chapter 9 Data and stat~st~cs 23 3 Savage et al. (1962) has pointed out that an uninformed prior can be uniformly distributed over the area of interest, then slope smoothly down to zero outside the area of interest. Such a prior can, of course, be designed to have an area of 1, eliminating the need for improper priors. However, the extra effort required in designing such a prior is not really necessary if one can accept using an improper prior. Hyperprion I Occasionally, one may wish to specify a prior that itself has one or more uncertain parameters. For instance, in Example 9.3 we used a Binomial(20, 0.5) prior because we believed that about 50 % of the population were female, and we discussed the effect of changing this value to a distribution representing the uncertainty about the true female prevalence. Such a distribution is described as a hyperprior for the hyperparameter p in Binomial(20, p). As previously discussed, Bayesian inference can account for hyperpriors, but we are then required to do an integration over all values of the hyperparameter to determine the shape of the prior, and that can be time consuming and at times very difficult. An alternative to the algebraic approach is to find the prior distribution by Monte Carlo simulation. We run a simulation for this model, naming as outputs the array of cells calculating the prior. At the end of the simulation, we collect the mean values for each output cell, which together form our prior. The posterior distribution will naturally have a greater spread if there is uncertainty about any parameters in the prior. If we had used a Beta(a, 6 ) distribution for p, the prior would have been a Beta-Binomial(20, a , b) distribution and a beta-binomial distribution always has a greater spread than the best-fitting binomial. Theoretically, one could continue applying uncertainty distributions to the parameters of hyperpriors, etc., but there is little if any accuracy to be gained by doing so, and the model starts to seem pretty silly. It is also worth remembering that the likelihood function often quickly overpowers the prior distribution as more data become available, so the effort expended in subtle changes to defining a prior will often be wasted. Conjugate priors A conjugate prior has the same functional form in 6 as the likelihood function which leads to a posterior distribution belonging to the same distribution family as the prior. For example, the Beta(al, a2) distribution has probability density function f (8) given by The denominator is a constant for particular values of a1 and a2, so we can rewrite the equation as If we had observed s successes in n trials and were attempting to estimate the true probability of success p, the likelihood function l(s, n; 8) would be given by the binomial distribution probability mass function written (using 8 to represent the unknown parameter p) as Since the binomial coefficient is constant for the given dataset (i.e. known n, s), we can rewrite the equation as We can see that the beta distribution and the binomial likelihood function have the same functional , a and b are constants. Since the posterior distribution is a product of form in 0, i.e. Ha(l - o ) ~ where the prior and likelihood function, it too will have the same functional form, i.e. using Equation (9.9) we have f (HIS, n) a Hffl-l+S (1 - 6)az-l+n-s Since this is a true distribution, it must normalise to 1, so the probability distribution function is actually QUI f(els, n> = + -I+$ (1 - ~)ffz-l+n-s J" tff,-l+s(1 - t)ff2-l+n-s dt + which is just the Beta(a1 s , a2 n - s) distribution. (In fact, with a bit of practice, one starts to recognise distributions because of their functional form, e.g. that Equation (9.10) represents a beta distribution, without having to go through the step of obtaining the normalised equation.) Thus, if one uses a beta distribution as a prior for p with a binomial likelihood function, the posterior distribution is also a beta. The value of using conjugate priors is that we can avoid actually doing any of the mathematics and get directly to the answer. Conjugate priors are often called convenience priors for obvious reasons. The Beta(1, 1) distribution is exactly the same as a Uniform(0, 1) distribution, so, if we want to start with a Uniforrn(0, 1) prior for p, our posterior distribution is given by Beta(s 1, n - s 1). This is a particularly useful result that will be used repeatedly in this book. By comparison, the Jeffreys prior for a binomial probability is a ~ e t a ( i , Haldane (1948) discusses using a Beta(0, 0) prior, which is mathematically undefined and therefore meaningless by itself, but gives a posterior distribution of Beta(s, n - s) that has a mean of s/n: in other words, it provides an unbiased estimate for the binomial probability. Table 9.1 lists other conjugate priors and the associated likelihood functions. Morris (1983) has shown that exponential families of distributions, from which one often draws the likelihood function, all have conjugate priors, so the technique can be used frequently in practice. Conjugate priors are also often used to provide approximate but very convenient representations to subjective priors, as described in the next section. + + i). Subjective priors A subjective prior (sometimes called an elicited prior) describes the informed opinion of the value of a parameter prior to the collection of data. Chapter 14 discusses in some depth the techniques for eliciting opinions. A subjective prior can be represented as a series of points on a graph, as shown in Figure 9.16. It is a simple enough exercise to read off a number of points from such graphs and use the height of each point as a substitute for n(0). That makes it quite difficult to normalise the posterior distribution, 23 5 Chapter 9 Data and statistics -- -- Table 9.1 Likelihood functions and their associated conjugate priors. Distribution Probability density function Estimated parameter Prior Posterior +x Binomial Probabilityp Beta(al, a2) a; = a1 Exponential Mean-' = h Gamma(a, B) a;=a2+n-x a' = a n + B' = B 1+BCxi Normal (with known a) Poisson 1 -exp &a e-At 1 x-p 2] Mean p Mean events per unit time I 22 24 Normal(p,, a), p', = p,(oz;n) a2/n (ht)Y x! 20 Figure 9.16 [- (0) 26 28 30 32 34 Weight of statue Gamma(a, B) +~ a i + a: a' = a + x B B' = 1 +Bt 36 38 40 Example of a subjective prior. but we will see in Section 9.2.4 a technique that one can use in Monte Carlo modelling that removes that problem. Sometimes it is possible reasonably to match a subjective opinion like that of Figure 9.16 to a convenience prior for the likelihood function one is intending to use. Software products like ModelRisk, ~ e s t ~ i and t @ RiskView pro@ can help in this regard. An exact match is not usually important because (a) the subjective prior is not usually specified that accurately anyway and (b) the prior has progressively less influence on the posterior the larger the set of data used in calculating the likelihood function. At other times, a single conjugate prior may be inadequate for describing a subjective prior, but a composite of two or more conjugate priors will produce a good representation. 236 Risk Analysis Multivariate priors I have concentrated discussion on the quantification of uncertainty in this chapter to a single parameter 8. In practice one may find that 8 is multivariate, i.e. that it is multidimensional, in which case one needs multivariate priors. In general, such techniques are beyond the scope of this book, and the reader is referred to more specialised texts on Bayesian inference: I have listed some texts I have found useful (and readable) in Appendix IV. Multivariate priors are, however, discussed briefly with respect to fitting distributions to data in Section 10.2.2. 9.2.3 Likelihood functions The likelihood function l(X 10) is a function of 8 with the data X fixed. It calculates the probability of observing the X observed data as a function of 8. Sometimes the likelihood function is simple: often it is just the probability distribution function of a distribution like the binomial, Poisson or hypergeometric. At other times, it can quickly become very complex. Examples 9.2, 9.3 and 9.6 to 9.8 illustrate some different likelihood functions. As likelihood functions are calculating probabilities (or probability densities), they can be combined in the same way as we usually do in probability calculus, discussed in Section 6.3. The likelihood principle states that all relevant evidence about 8 from an experiment and its observed outcome should be present in the likelihood function. For example, in binomial sampling with n fixed, s is binomially distributed for a given p. If s is fixed, n is negative binomially distributed for a given p . In both cases the likelihood function is proportional to p S ( l - P)"-~, i.e. it is independent of how the sampling was carried out and dependent only on the type of sampling and the result. 9.2.4 Normalising the posterior distribution A problem often faced by those using Bayesian inference is the difficulty of determining the normalising integral that is the denominator of Equation (9.8). For all but the simplest likelihood functions this can be a complex equation. Although sophisticated commercial software products like ~athematica', ~ a t h c a d ' and ~ a p l e ' are available to perform these equations for the analyst, many integrals remain intractable and have to be solved numerically. This means that the calculation has to be redone every time new data are acquired or a slightly different problem is encountered. For the risk analyst using Monte Carlo techniques, the normalising part of the Bayesian inference analysis can be bypassed altogether. Most Monte Carlo packages offer two functions that enable us to do this: a Discrete({x}, {p}) distribution and a Relative(min, max, {x), {p}). The first defines a discrete distribution where the allowed values are given by the { x } array and the relative likelihood of each of these values is given by the { p } array. The second function defines a continuous distribution with a minimum = min, a maximum = max and several x values given by the array {x), each of which has a relative likelihood "density" given by the {p} array. The reason that these two functions are so useful is that the user is not required to ensure that for the discrete distribution the probabilities in {p) sum to 1 and for the relative distribution the area under the curve equals 1 . The functions normalise themselves automatically. 9.2.5 Taylor series approximation to a Bayesian posterior distribution When we have a reasonable amount of data with which to calculate the likelihood function, the posterior distribution tends to come out looking approximately normally distributed. In this section we will Chapter 9 Data and statistics 237 examine why that is, and provide a shorthand method to determine the approximating normal distribution directly without needing to go through a complete Bayesian analysis. Our best estimate O0 of the value of a parameter 8 is the value for which the posterior distribution f (8) is at its maximum. Mathematically, this equates to the condition That is to say, 80 occurs where the gradient of f (0) is zero. Strictly speaking, we also require that the gradient o f f (0) go from positive to negative for Bo to be a maximum, i.e. The second condition is only of any importance if the posterior distribution has two or more peaks, for which a normal approximation to the posterior distribution would be inappropriate anyway. Taking the first and second derivatives of f (8) assumes that 8 is a continuous variable, but the principle applies equally to discrete variables, in which case we are just looking for that value of 8 for which the posterior distribution has the highest value. The Taylor series expansion of a function (see Section 6.3.6) allows one to produce a polynomial approximation to some function f (x) about some value xo that usually has a much simpler form than the original function. The Taylor series expansion says where f (m)(x)represents the mth derivative of f (x) with respect to x. To make the next calculation a little easier to manage, we first define the log of the posterior distribution L(8) = log,[ f (8)]. Since L(8) increases with f ( G ) , the maximum of L(8) occurs at the same value of 8 as the maximum of f (8). We now apply the Taylor series expansion of L(8) about 80 (the MLE) for the first three terms: The first term in this expansion is just a constant value (k) and tells us nothing about the shape of L(8); the second term equals zero from Equation (9.1 I), so we are left with the simplified form This approximation will be good providing the higher-order terms (m = 3, 4, etc.) have much smaller values than the m = 2 term here. 238 Risk Analysis We can now take the exponential of L(0) to get back to f (0): f (0) x K exp lQo (i - d2:)- (0 - o0)2) where K is a normalising constant. Now, the Normal(p, a ) distribution has probability density function f (x) given by Comparing the above two equations, we can see that f (0) has the same functional form as a normal distribution, where p = Oo and [ a = - ---- and we can thus often approximate the Bayesian posterior distribution with the following normal distribution: We shall illustrate this normal (or quadratic) approximation with a few simple examples. Example 9.5 Approximation t o the beta distribution + + We have seen above that the beta distribution (s 1, n - s 1) provides an estimate of the binomial probability p when we have observed s successes in n independent trials, and assuming a prior Uniform(0, 1) distribution. The posterior density has the function f (0) a 0"1 Taking logs gives and dL(0) -d0 s 0 --- n -s d 2 ~ ( 0) -s -n-s 1-0' do2 O2 We first find our best estimate 80 of 0 s n-s =0 (1 - 0)2 Chapter 9 Data and statistics 239 which gives the intuitively encouraging answer 60' =s/n i.e. our best guess for the binomial probability is the proportion of trials that were successes. Next, we find the standard deviation a for the normal approximation to this beta distribution: d2L ( 0 ) o do2 ( s - 6 n n-s -0 2 &(I-%) which gives and so we get the approximation 0 - Normal (., [ 0o(l - 00) ]'I2) The equation for a allows us some useful insight into the behaviour of the beta distribution. We can see in the numerator that the spread of the beta distribution, and therefore our measure of uncertainty about the true value of 6 , is a function of our best estimate for 0. The function [00(1- Oo)] is at its maximum when 80 = - so, for a given set of trials n , we will be more uncertain about the true value of 6 if the proportion of successes is close to than if it were closer to 0 or 1. Looking at the denominator, we see that the degree of uncertainty, represented by (T, is proportional to n-'I2. We will see time and again that the level of uncertainty of some parameter is inversely proportional to the square root of the amount of data available. Note also that Equation (9.14) is exactly the same as the classical statistics result of Equation (9.7).But when is this quadratic approximation to L ( 0 ) ,i.e. the normal approximation to f (O),a reasonably good fit? The mean p and variance V of a Beta(s 1, n - s 1 ) distribution are as follows: i, + + Comparing these identities with Equation (9.13), we can see that the normal approximation works when s and (n - s ) are both sufficiently large for adding 1 to s and adding 3 to n proportionally to have little effect, i.e. when s+l s 1 and n+3 n -- x 1 Figure 9.17 compares the beta distribution with its normal approximation for several combinations of s successes in n trials. + 240 Risk Analysis Example 9.6 Uncertainty of h in a Poisson process The number of earthquakes that have occurred in a region of the Pacific during the last 20years are shown in Table 9.2. What is the probability that there will be more than 10 earthquakes next year? Let us assume that the earthquakes come from a Poisson process (it probably doesn't, I admit, since one big earthquake can release built-up pressure and give a hiatus until the next one), i.e. that there is a constant probability per unit time of an earthquake and that all earthquakes are independent of each Chapter 9 Data and statistics 24 1 Table 9.2 Pacific earthquakes. Year Earthquakes 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 Year Earthquakes 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 other. If such an assumption is acceptable, then we need to determine the value of the Poisson process parameter h, the theoretical true mean number of earthquakes there would be per year. Assuming no prior knowledge, we can proceed with a Bayesian analysis, labelling h = 0 as the parameter to be estimated. The prior distribution should be uninformed, which, as discussed in Section 9.2.2, leads us to use a prior n(0) = 110. The likelihood function Z(01X) for the xi observations in n years is given by which gives a posterior function Taking logs gives Our best estimate O0 is determined by which gives 242 Risk Analysis and the standard deviation for the normal approximation is given by since which gives our estimate for h: Again this solution makes sense, and again we see that the uncertainty decreases proportional to the square root of the amount of data n. The central limit theorem (see Section 6.3.3) says that, for large n, the uncertainty about the true mean v of a population can be described as where Y is the mean and s is the standard deviation of the data sampled from the parent distribution. The Poisson distribution has a variance equal to its mean h, and therefore a standard deviation equal to As gets large, so the "-1" in the above formula for 80 gets progressively less important and 80 gets closer and closer to the mean of the observations per period 2,and we see that the Bayesian approach and the central limit theorem of classical statistics converge to the same answer. C;=lxi will be large when either h is large, so each xi is large, or when there are a lot of data (i.e. n is large), so that the sum of a lot of small xi is still large. Figure 9.18 provides three estimates of A, the true mean number of earthquakes for the system, given the data for earthquakes for the last 20 years, namely: the standard Bayesian approach, the normal approximation to the Bayesian and the central limit theorem approximation. + a. x;=pxi Example 9.7 Estimate of the mean of a normal distribution with unknown standard deviation Assume that we have a set of n data samples from a normal distribution with unknown mean p and unknown standard deviation a.We would like to determine our best estimate of the mean together with the appropriate level of uncertainty. A normal distribution can have a mean anywhere in [-oo, +m], so we could use a uniform improper prior n ( p ) = k. From the discussion in Section 9.2.2, the uninformed prior for the standard deviation should be n(a)= l/a to ensure invariance under a linear transformation. The likelihood function is given by the normal distribution density function: C h a ~ t e 9r D a t a and stat~stics 243 Figure 9.18 Uncertainty distributions for h by various methods. Multiplying the priors together with the likelihood function and integrating over all possible values of a, we arrive at the posterior distribution for p: where F and s are the mean and sample standard deviation of the data values. Now the Student t-distribution with u degrees of freedom has the probability density The equation for f ( p ) is of the same form as the equation for f (x) if we set u = n - 1. If we divide the term inside the square brackets for f ( p ) by the constant ns2, we get so the equation above for f ( p ) equates to a shifted, rescaled Student t-distribution with (n - 1) degrees of freedom. Specifically, p can be modelled as where t (n - 1) represents the Student t-distribution with (n - 1) degrees of freedom. This is the exact result used in classical statistics, as described in Section 9.1.3. + Example 9.8 Estimate of the mean of a normal distribution with known standard deviation This is a more specific case than the previous example and might occur, for example, if one was making many measurements of the same parameter but believed that the measurements had independent, 244 Risk Analysis normally distributed errors and no bias (so the distribution of possible values would be centred about the true value). We proceed in exactly the same way as before, giving a uniform prior for p and using a normal likelihood function for the observed n measurements { x i } .No prior is needed for a since it is known, and we arrive at a posterior distribution for p given by Taking logs gives i.e., since a is known, where k is some constant. Differentiating twice, we get The best estimate po of p is that value for which !&JJ~= 0: dLL i.e. po is the average of the data values Y - no surprise there! A Taylor series expansion of this function about po gives The second term is missing because it equals zero and there are no other higher-order terms since ( d 2 ~ ( p ) / d p 2 )= (-n/a2) is independent of p and any further differential therefore equals zero. Consequently, Equation (9.16) is an exact result. Chapter 9 Data and stat~stics 245 Taking natural exponents to convert back to f (p), and rearranging a little, we get where K is a normalising constant. By comparison with the probability density function for the normal distribution, it is easy to see that this is just a normal density function with mean ? and standard deviation a/&. In other words which is the classical statistics result of Equation (9.4) and a result predictable from the central limit theorem. + Exercise 9.4: Bayesian uncertainty for the standard deviation of a normal distribution. Show that the Bayesian inference results of uncertainty about the standard deviation of a normal distribution take a similar form to the classical statistics results of Section 9.1.2. 9.2.6 Markov chain simulation: the Metropolis algorithm and the Gibbs sampler Gibbs sampling is a simulation technique to obtain a required Bayesian posterior distribution and is particularly useful for multiparameter models where it is difficult algebraically to define, normalise and draw from a posterior distribution. The method is based on Markov chain simulation: a technique that creates a Markov process (a type of random walk) whose stationary distribution (the distribution of the values it will take after a very large number of steps) is the required posterior distribution. The technique requires that one runs the Markov chain a sufficiently large number of steps to be close to the stationary distribution, and then records the generated values. The trick to a Markov chain model is to determine a transition distribution ~,(0'10'-') (the distribution of possible values for the Markov chain at its ith step 8 ' , conditional on the value generated in the (i - 1)th step oi-') that converges to the posterior distribution. The metropolis algorithm The transition distribution is a combination of some symmetric jumping distribution ~ ~ (l0'-'), 0' which lets one move from one value 0'-' to another randomly selected 0*, and a weighting function that assigns the probability of jumping to 0* (as opposed to staying still) as the ratio r , where so that 0' = 0 * with probability min[l , r ] - 0i-1 otherwise 246 Risk Analysis The technique relies on being able to sample from J, for all i and 0'-', as well as being able to calculate r for all jumps. For multiparameter problems, the Metropolis algorithm is very inefficient: the Gibbs sampler provides a method that achieves the same posterior distribution but with far fewer model iterations. The Gibbs sampler The Gibbs sampler, also called alternating conditional sampling, is used in multiparameter problems, i.e. where 6 is a d-dimensional vector with components (dl, . . . , Od). The Gibbs sampler cycles through all the components of 6 for each iteration, so there are d steps in each iteration. The order in which the components are taken is changed at random from one iteration to the next. In a cycle, the kth component is replaced (k = 1 to d, while all of the other components are kept fixed in turn) with a value drawn from a distribution with probability density where df_i' are all the other components of 6 except for Ok at their current value. This may look rather awkward as one has to determine and sample from d separate distributions for each iteration of the Gibbs sampler. However, the conditional distributions are often conjugate distributions, which makes sampling from them a lot simpler and quicker. Have a look at Gelman et al. (1995) for a very readable discussion of various Markov chain models, and for a number of examples of their use. Gilks et al. (1996) is written by some of the real gurus of MCMC methods. M C M C in practice Some terribly smart people write their own Gibbs sampling programs, but for the rest of us there is a product called WinBUGS developed originally at Cambridge University. It is free to download and the software most used for MCMC modelling. It isn't that easy to get the software to work for you unless you are familiar with S-plus or R type script, and one always waits with baited breathe for the message "Compiled successfully" because there is rather little in the way of hints about what to do when it doesn't compile. On the plus side, the actual probability model is quite intuitive to write and WinBUGS has the flexibility to allow different datasets to be incorporated into the same model. The software is also continuously improving, and several people have written interfaces to it through the OpenBUGS project. To use the WinBUGS output, you will need to export the CODA file for data (after a sufficient burn-in) to a spreadsheet, move the data around to have one column per parameter and then randomly sample across a line (i.e. one MCMC iteration) in just the same way I explain for bootstrapping paired data. The ModelRisk function VoseNBootPaired allows you to do this very simply. 9.3 The Bootstrap The bootstrap was introduced by Efron (1979) and is explored in great depth in Efron and Tibshirani (1993) and perhaps more practically in Davison and Hinkley (1997). This section presents a rather brief introduction that covers most of the important concepts. The bootstrap appears at first sight to be rather dubious, but it has earned its place as a useful technique because (a) it corresponds well to traditional techniques where they are available, particularly when a large dataset has been obtained, and (b) it offers an opportunity to assess the uncertainty about a parameter where classical statistics has no technique available and without recourse to determining a prior. Chapter 9 Data and statistics 247 The "bootstrap" gets its name from the phrase "to pull yourself up by your bootstraps", which is thought to originate from one of the tales in the Adventures of Baron Munchausen by Rudolph Erich Raspe (1737- 1794). Baron Munchausen (1720-1797) actually existed and was known as an enormous boaster, especially of his exploits during his time as a Russian cavalry officer. Raspe wrote ludicrous stories supposedly in his name (he would have been sued these days). In one story, the Baron was at the bottom of a deep lake and in some trouble, until he thought of pulling himself up by his bootstraps. The name "bootstrap" does not perhaps engender much confidence in the technique: you get the impression that there is an attempt somehow to get something from nothing - actually, it does seem that way when one first looks at the technique itself. However, the bootstrap has shown itself to be a powerful method of statistical analysis and, if used with care, can provide results very easily and in areas where traditional statistical techniques are not available. In its simplest form, which is the non-parametric bootstrap, the technique is very straightforward indeed. The standard notation as used by Efron is perhaps a little confusing, though, to the beginner, and, since I am not going into any great sophistication in this book, I have modified the notation a little to keep it as simple as possible. The bootstrap is used in similar conditions to Bayesian inference, i.e. we have a set of data x randomly drawn from some population distribution F for which we wish to estimate some statistical parameter. The jackknife The bootstrap was originally developed from a much earlier technique called the jackknife. The jackknife was used to review the accuracy of a statistic calculated from a set of data. A jackknife value was the statistic of interest calculated with the ith value removed from the dataset and is given the notation With a dataset of n values, one thus has n jackknife values, the distribution of which gives a feel for the uncertainty one has about the true value of the statistic. I say "gives a feel" because the reader is certainly not recommended to use the jackknife as a method for obtaining any precise estimate of uncertainty. The jackknife turns out to be quite a poor estimation of uncertainty and can be greatly improved upon. 9.3.1 The non-parametric bootstrap Imagine that we have a set of n random measurements of some characteristic of a population (the height of 100 blades of grass from my lawn, for example) and we wish to estimate some parameter of that population (the true mean height of all blades of grass from my lawn, for example). Bootstrap theory says that the true distribution F of these blades of grass can be reasonably approximated by the distribution p of observed values. Obviously, this is a more reasonable assumption the more data one has collected. The theory then constructs this distribution k of the n observed values and takes another n random samples (with replacement) from that constructed distribution and calculates the statistic of interest from that sample. The sampling from the constructed distribution and statistic calculation is repeated a large number of times until a reasonably stable distribution of the statistic of interest is obtained. This is the distribution of uncertainty about the parameter. The method is best illustrated with a simple example. Imagine that I work for a contact lens manufacturer in Auckland and for some reason would really like to know the mean diameter of the pupils of the eyes of New Zealand's population under some specific light condition. I have a limited budget, so I randomly select 10 people off the street and measure their pupils while controlling the ambient light. The results I get are (in mm): 5.92, 5.06, 6.16, 5.60, 4.87, 5.61, 5.72, 5.36, 6.03 and 5.71. This dataset forms my bootstrap estimate of the true distribution for the whole population, so I now randomly sample 248 Risk Analysis B4:B13 C4:C13 Formulae table Data values =VoseDUniform($B$4:$B$l3) =AVERAGE(C4:C13) Figure 9.19 Example of a non-parametric bootstrap model. 0.163 0.131 E 0.098 0 'IJ L' 5 0 0.065 ......................... 0.033 0.000 5.00 5.20 5.40 5.60 Mean pupil diameter (mm) 5.80 6.00 Figure 9.20 Uncertainty distribution resulting from the model of Figure 9.19. with replacement from the distribution to get 10 bootstrap samples. The spreadsheet in Figure 9.19 illustrates the bootstrap sampling: column B lists the original data and column C gives 10 bootstrap samples from these data using the Duniform({x}) distribution (Duniform({x}) is a discrete distribution where all values in the {x} array are equally likely). Cell C14 then calculates the statistic of interest (the mean) from this sample. Running a 10000 iteration simulation on this cell produces the bootstrap uncertainty distribution shown in Figure 9.20. The distribution is roughly normal (skewness = -0.16, kurtosis = 3.02) with mean = 5.604 - the mean of the original dataset. Chapter 9 D a t a and statistics 249 In summary, the non-parametric bootstrap proceeds as follows: Collect the dataset of n samples (XI,. . . , x,}. Create B bootstrap samples {xl*,. . . , x,*}where each xi* is a random sample with replacement from {XI,...x,I. For each bootstrap sample {xy,. . . , x;), calculate the required statistic 8. The distribution of these B estimates of 8 represents the bootstrap estimate of uncertainty about the true value of 8. Example 9.9 Bootstrap estimate of prevalence Prevalence is the proportion of a population that has a particular characteristic. An estimate of the prevalence P is usually made by randomly sampling from the population and seeing what proportion of the sample has that particular characteristic. Our confidence around this single-point estimate can be obtained quite easily using the non-parametric bootstrap. Imagine that we have randomly surveyed 50 voters in Washington, DC, and asked them how many will be voting for the Democrats in a presidential election the following day. Let's rather nayvely assume that they all tell the truth and that none of them will have a change of mind before tomorrow. The result of the survey is that 19 people said they will vote Democrat. Our dataset is therefore a set of 50 values, 19 of which are 1 and 31 of which are 0. A non-parametric bootstrap would sample from this dataset. Thus, the bootstrap replicate would be equivalent to a Binomial(50, 19/50). The estimate of prevalence is then just the proportion of the bootstrap samples that are 1, i.e. P = Binomial(50, 19/50)/50. This is exactly the same as the classical statistics estimate given in Equation (9.6), and, interestingly, the parametric bootstrap (see next section) has exactly the same estimate in this example too. The distribution being sampled in a parametric bootstrap is a Binomial(1, P ) from which we have 50 samples and our MLE (maximum likelihood estimator) for P is 19/50. Thus, the 50 parametric bootstrap replicates could be summed together as a Binomial(50, 19/50), and our estimate for P is again Binomial(50, 19/50)/50. We could have used a Bayesian inference approach. With a Uniform(0, 1) prior, and a binomial likelihood function (which assumes the population is much larger than the sample), we would have an estimate of prevalence using the beta distribution (see Section 8.2.3): Figure 9.21 plots the Bayesian estimate alongside the bootstrap for comparison. They are very close, except that the bootstrap estimate is discrete and the Bayesian is continuous, and, as the sample size increases, they would become progressively closer. + 9.3.2 The parametric bootstrap The non-parametric bootstrap in the previous section made no assumptions about the distributional form of the population (parent) distribution. However, there will be many times that we will know to which family of distributions the parent distribution belongs. For example, the number of earthquakes each year and the number of Giardia cysts in litres of water drawn from a lake will logically both be approximately Poisson distributed; the time between phone calls to an exchange will be roughly exponentially distributed and the number of males in randomly sampled groups of a certain size will be 250 Risk Analvs~s Figure 9.21 Bootstrap and Bayesian estimates of prevalence for Example 9.9. binomially distributed. The parametric bootstrap gives us a means to use the extra information we have about the population distribution. The procedure is as follows: Collect the dataset of n samples {xl, . . . , x,}. Determine the parameter(s) of the distribution that best fit(s) the data from the known distribution family using maximum likelihood estimators (MLEs - see Section 10.3.1). Generate B bootstrap samples {xr,. . . , x;} by randomly sampling from this fitted distribution. For each bootstrap sample {x;,. . . , x:), calculate the required statistic 8. The distribution of these B estimates of 8 represents the bootstrap estimate of uncertainty about the true value of 8. We can illustrate the technique by using the pupil measurement data again. Let us assume that we know for some reason (perhaps experience from other countries) that this measurement should be normally distributed for a population. The normal distribution has two parameters - its mean and standard deviation, both of which we will assume to be unknown - and their MLEs are the mean and standard deviation of the data to be fitted. The mean and standard deviation of the pupil measurements are 5.604 mm and 0.410mm respectively. Figure 9.22 shows a spreadsheet model where, in column C, 10 Normal(5.604, 0.410) distributions are randomly sampled to give the bootstrap sample. Cell D l 4 is calculating the mean (the statistic of interest) of the bootstrap sample. Figure 9.23 shows the results of this parametric bootstrap model, together with the result from applying the classical statistics method of Equation 9.2 - they are very similar. The result also looks very similar to the non-parametric distribution of Figure 9.20. In comparison with the classical statistics model, which happens to be exact for this particular problem (i.e. when the parent distribution is normal), both bootstrap methods provide a narrower range. In other words, the bootstrap in its simplest form tends to underestimate the uncertainty associated with the parameter of interest. A number of corrective measures are proposed in Efron and Tibshirani (1993). Chapter 9 Data and statist~cs 25 1 A \ B 1 2 - 3 4 5 6 7 8 9 10 11 12 13 14 15 16 7 7 Figure 9.22 Mean Stdev C l Data 5.92 5.06 6.16 5.60 4.87 5.61 5.72 5.36 6.03 5.71 5.60 0.4095 EI D Bootstrap sample 5.57 5.72 5.25 6.01 4.91 6.06 5.57 5.54 4.68 4.69 5.40 F I IH G Formulae table Data values =AVERAGE(C4:C13) =STDEV(C4:C13) =VoseNormal($C$14,$C$15) C4:C13 D4:D13 Example of a parametric bootstrap model. 4.7 4.9 5.1 5.3 5.5 5.7 5.9 6.1 6.3 True mean pupil diameter Figure 9.23 Results of the parametric bootstrap model of Figure 9.22, together with the classical statistics result. Imagine that we wish to estimate the true depth of a well using some sort of sonic probe. The probe has a known standard error a = 0.2m, i.e. a is the standard deviation of the normally distributed variation of results the probe will produce when repetitively measuring the same depth. In order to estimate this depth, we take n separate measurements. These measurements have a mean of 2 metres. The parametric bootstrap model would take the average of n Normal(T, o) distributions to estimate the true mean p of the distribution of possible measurement results, i.e. the true well depth. From the central limit theorem, 252 Risk Analysis we know that this calculation is equivalent to p = Normal (F. 5) which is the classical statistics result of Equation (9.3). + Parametric bootstrap estimate of the standard deviation of a normal distribution It can also be shown that the parametric bootstrap estimates of the standard deviation of a normal distribution when the mean is and is not known are exactly the same as the classical statistics estimates given in Equations (9.5) and (9.6) (the reader may like to prove this, bearing in mind that the ChiSq(v) distribution is the sum of the squares of v independent unit normal distributions). Example 9.10 Parametric bootstrap estimate of mean time between calls at a telephone exchange Imagine that we want to predict the number of phone calls there will be at an exchange during a particular hour in the working day (say 2 p.m. to 3 p.m.). Imagine that we have collected data from this period on n separate, randomly selected days. It is reasonable to assume that telephone calls will arrive at a Poisson rate since each call will be, roughly speaking, independent of every other. Thus, we could use a Poisson distribution to model the number of calls in an hour. The maximum likelihood estimate (MLE) of the mean number of calls per hour at this time of day is simply the average number of calls observed in the test periods x (see Example 10.3 for proof). Thus, our bootstrap replicate is a set of n independent Poisson(F) distributions. To generate our uncertainty about the true mean number of phone calls per hour at this time of the day, we calculate the mean of the sum of the bootstrap replicate, i.e. the average of n independent Poisson(x) distributions. The sum of n independent Poisson(x) distributions is simply Poisson(nF), so the average of n Poisson(i7) distributions is Poisson(nF)/n, where (nx) is simply the sum of the observations. So, in general, if one has observations from n periods, the Poisson parametric bootstrap for the mean number of observations per period h is given by where S is the sum of observations in the n periods. The uncertainty distribution of h should be continuous, as h can take any positive real value. However, the bootstrap will only generate discrete values for h, i.e. (0, l l n , 2/n, . . .). When n is large this is not a problem since the allowable values are close together, but when S is small the approximation starts to fall down. Figure 9.24 illustrates three Poisson parametric bootstrap estimates for h for S = 2, 10 and 20 combined with n = 5. For S = 2, the discreteness will in some circumstances be an inadequate uncertainty model for A, and a different technique like Bayesian inference would be preferable. However, for values of S around 20 or more, the allowable values are relatively close together. For large S, one can also add back the continuous characteristic of the parameter by making a normal approximation to the Poisson, i.e. since Poisson(a) % Normal(a, &) we get or, replacing S l n with F, we get h % Normal (F. {) Chapter 9 Data and statistics 253 Figure 9.24 Three Poisson parametric bootstrap estimates for A for S = 2, 10 and 20 from Example 9.10. 254 Risk Analysis which also illustrates the familiar reduction in uncertainty as the square root of the number of data points n. + 9.3.3 The Bayesian bootstrap The Bayesian bootstrap is considered to be a robust Bayesian approach for estimating a parameter of a distribution where one has a random sample x from that distribution. It proceeds in the usual bootstrap way, determining a distribution of 8, the distribution density of which is then interpreted as the likelihood function l(x 18). This is then used in the standard Bayesian inference formula (Equation (9.8)) along with a prior distribution n ( 6 ) for 8 to determine the posterior distribution. In many cases, the bootstrap distribution for 8 closely approximates a normal distribution, so, by calculating the mean and standard deviation of the B bootstrap replicates 8, one can quickly define a likelihood function. 9.4 Maximum Entropy Principle The maximum entropy formalism (sometimes known as MaxEnt) is a statistical method for determining a distribution of maximum logical uncertainty about some parameter, consistent with a certain limited amount of information. For a discrete variable, MaxEnt determines the distribution that maximises the function H(x), where and where pi is the confidence for each of the M possible values xi of the variable x. The function H (x) takes the equation of a statistical mechanics property known as entropy, which gives the principle its name. For a continuous variable, H(x) takes the form of an integral function: max H(x) = The appropriate uncertainty distribution is determined by the method of Lagrange multipliers, and, in practice, the continuous variable equation for H(x) is replaced by its discrete counterpart. It is beyond the scope of this book to look too deeply into the mathematics, but there are a number of results that are of general interest. MaxEnt is often used to determine appropriate priors in a Bayesian analysis, so the results listed in Table 9.3 give some reassurance to prior distributions we might wish to use conservatively to represent our prior knowledge. The reader is recommended Sivia (1996) for a very readable explanation of the principle of MaxEnt and derivation of some of its results. Gzyl(1995) provides a far more advanced treatise on the subject, but requires a much higher level of mathematical understanding. The normal distribution result is interesting and provides some justification for the common use of the normal distribution when all we know is the mean and variance (standard deviation), since it represents the most reasonably conservative estimate of the parameter given that set of knowledge. The uniform distribution result is also very encouraging when estimating a binomial probability, for example. The use of a Beta(s a , n - s b) to represent the uncertainty about the binomial probability p when we have observed s successes in n trials assumes a Beta(a, b) prior. A Beta(1, 1) is a Uniform(0, 1) distribution, and thus our most honest estimate of p is given by Beta(s 1, n - s 1). + + + + Chapter 9 Data and statistics Table 9.3 255 Maximum entropy method. State of knowledge Discrete parameter, n possible values {xi) Continuous parameter, minimum and maximum Continuous parameter, known mean p and variance a* Continuous parameter, known mean p Discrete parameter, known mean p MaxEnt distribution DUniform({xi]), i.e. p(xi) = 1/n Uniform(min,max), i.e. f(x) = l/(max - min) Norma@, a) Expon(p) Poisson(p) 9.5 Which Technique Should You Use? I have discussed a variety of methods for estimating your uncertainty about some model parameter. The question now is which one is best? There are some situations where classical statistics has exact methods for determining confidence intervals. In such cases, it is sensible to use those methods of course, and the results are unlikely to be challenged. In situations where the assumptions behind traditional statistical methods are being stretched rather too much for comfort, you will have to use your judgement as to which technique to use. Bootstraps, particularly the parametric bootstrap, are powerful classical statistics techniques and have the advantage of remaining purely objective. They are widely accepted by statisticians and can also be used to determine uncertainty distributions for statistics like the median, kurtosis or standard deviation for parent distributions where classical statistics have no method to offer. However, it is a fairly new (in statistics terms) technique, so you may find people resisting making decisions based on its results, and the results can be rather "grainy". The Bayesian inference technique requires some knowledge of an appropriate likelihood function, which may be difficult and will often require some subjectivity in assessing what is a sufficiently accurate function to use. Bayesian inference also requires a prior, which can be contentious at times, but has the potential to include knowledge that the other techniques cannot allow for. Traditional statisticians will sometimes offer a technique to use on your data that implicitly assumes a random sample from a normal distribution, though the parent distribution is clearly not normal. This usually involves some sort of approximation or a translation of the data (e.g. by taking logs) to make the data better fit a normal distribution. While I appreciate the reasons for doing this, I do find it difficult to know what errors one is introducing by such data manipulation. Pretty often in our consulting work there is no option but to use Gibbs sampling because it is the only way to handle multivariate estimates that are good for risk analysis. The WinBUGS program may be a little difficult to use but the models can be made very transparent. I suggest that, if the parameter to your model is important, it may well be worth comparing two techniques (for example, non-parametric bootstrap (or parametric, if possible) and Bayesian inference with an uninformed prior). It will certainly give you greater confidence if there is reasonable agreement between any two methods you might choose. What is meant by reasonable will depend on your model and the level of accuracy your decision-maker needs from that model. If you find there appears to be some reasonable disagreement between two methods that you test, you could try running your model twice, once with each estimate, and seeing if the model outputs are significantly different. Finally, if the uncertainty distributions between two methods are significantly different and you cannot choose between them, it makes sense to accept that this is another source of uncertainty and simply combine the two distributions, using a discrete distribution, in the same way I describe in Section 14.3.4 on combining differing expert opinions. 256 Risk Analysis 9.6 Adding uncertainty in Simple Linear Least-Squares Regression Analysis In least-squares regression, one is attempting to model the change in one variable y (the response or dependent variable) as a function of one or more other variables {x} (the explanatory or independent variables). The regression relationship between {x} and y minimises the sum of squared errors between a fitted equation for y and the observations. The theory of least-squares regression assumes the random variations about this line (resulting from effects not explained by the explanatory variables) to be normally distributed with constant variance across all {x) values, which means the fitted line describes the mean y value for a given set of {x}. For simplicity we will consider a single explanatory variable x (i.e. simple regression analysis), and that the relationship between x and y is linear (which is linear regression analysis), i.e. we will use a model of the variability in y as a result of changes in x with the following equation: where m and c are the gradient and y intercept of the straight-line relationship between x and y, and a is the standard deviation of the additional variation observed in y that is not explained by the linear equation in x. Figure 6.1 1 illustrates these concepts. In least-squares linear regression, we typically have a set of n paired observations {xi,yi) for which we wish to fit this linear relationship. 9.6.1 Classical statistics Classical statistics theory (see Section 6.3.9) provides us with the best-fitting values for m , c and a, assuming the model's assumptions to be correct, which we will name &, 2 and 8 . It also gives us exact distributions of uncertainty for the estimate 9 p = ( m x p c) at some value x p (see, for example, McClave, Dietrich and Sincich, 1997) and a as follows: + where t(n - 2) is a Student t-distribution with (n - 2) degrees of freedom, X2(n - 1) is a chi-square distribution with (n - 1) degrees of freedom and s is the standard deviation of the differences ei between the observed value yi and its predictor ji = &xi 2, i.e. + Chapter 9 Data and statist~cs 257 I I Body welgM (kg) Log,,,body ml9M (kg1 -- Figure 9.25 Simple least-squares regression uncertainty about P for the dataset of Table 9.4. + The uncertainty distribution for a is independent of the uncertainty distribution for (mx c), since the model assumes that the random variations about the regression line are constant, i.e. that they are independent of the values of x and y. It turns out that these same results are given by Bayesian inference with uninformed priors, i.e. n ( m , c, a ) cc l/a. The uncertainty equation for ji = mxi c produces a relationship between x and y with uncertainty that is pinched at the middle, as shown in the simple least-squares regression analysis of Figure 9.25 for the data in Table 9.4. This makes sense as, the further we move towards the extremes of the set of observations, the more uncertain we should be about the relationship. This describes the relationship between the weight of a mammal in kilograms and the mean weight of the brain of a mammal in grams at that body weight. Strictly speaking, the theory of regression analysis says that the relationship can only be considered to hold within the range of observed values for x . However, with caution, one can reasonably extrapolate a little past the range of observed body weights, although, the further one extends beyond the observed range, the more tenuous the validity of the analysis becomes. Including uncertainty in a regression analysis means that we now have a family of normal distributions representing the possible value of y, given a specific value for x . The normal distribution reflects the observed variability about the regression line. That there is a family of these distributions reflects our + 258 Risk Analysis Table 9.4 Experimental measurements of the weight of mammals' bodies and brains. Brain weight (9) Body weight (kg) uncertainty about the coefficients for the regression equation and therefore the parameters for the normal distribution. The bootstrap The variables x and y will fit a simple least-squares regression model if the underlying relationship between these two variables is one of two forms: type A where the {xi,yi) observations are drawn from a bivariate normal distribution in x and y, or type B where, for any value x, the distribution of possible response values in y are Normal(mx c, a(x)) distributed and, for the time being, a(x) = a, i.e. the random variations about the line have the same standard deviation (known as homoscedasticity). In order to use the bootstrap to determine the uncertainty about the regression coefficients, we must first determine which of these two relationships is occurring. Essentially, this is equivalent to the design of the experiment that produced the {xi,y i ] observations. The experiment design is of type A if we are making random observations of x and y together, whereas the experiment design is of type B if we are testing at different specific values of x to determine the response in y. So, for example, the {body weight, brain weight} data from Table 9.4 are of type A if we have attempted to pick a fairly random sample of mammals, whereas they would be of type B if we had picked an animal from each of the 20 subspecies of a species of some particular mammal. If, for example, we were doing an experiment to demonstrate Hooke's law by adding incremental weights to a hanging spring and observing the resultant extension beyond the spring's original length, the {mass, extension) observations would again be of type B, because we are specifically controlling the x values to observe the resultant y values. For type A data, the regression coefficients can be thought of as parameters of a bivariate normal distribution. Thus, using the non-parametric bootstrap, we simply resample from the paired observations {xi,yi} and, at each bootstrap replicate, calculate the regression coefficients. Figure 9.26 illustrates this type of analysis set out in a spreadsheet model for the dataset of Table 9.4. For type B data, the x values are fixed since they were predetermined rather than resulting from a random sample from a distribution. Assuming the random variations about the regression line to be homoscedastic and the straight-line relationship to be correct, the only random variable involved is + Chapter 9 Data and statist~cs 259 1 2 - A1 B C D E Brain weight (gm) 0.0436 0.4492 1.698 2.844 14.69 16.265 22.309 372.97 713.72 3270.15 Body weight (kg) 0.685 29.05 175.92 50.856 155.74 294.52 193.49 1034.4 9958.02 35160.5 Log(Brain weight) -1.361 -0.348 0.230 0.454 1.167 1.211 1.348 2.572 2.854 3.515 Log(Body weight) -0.1 64 1.463 2.245 1.706 2.192 2.469 2.287 3.015 3.998 4.546 F I IH G 7 3 4 5 6 7 8 9 10 11 12 13 14 - 15 16 17 18 19 20 21 22 23 7 Figure 9.26 A1 B4:C13 D4:E13 F4:F13 G4:G13 GI4 GI5 GI6 Formulae table Data =LOG(B4) =VoseDuniform($D$4:$D$13) =VLOOKUP(F4,$D$4:$E$13,2) =SLOPE(G4:Gl3,F4:F13) =INTERCEPT(G4:Gl3,F4:F13) =STEYX(G4:G13,F4:F13) Bootstrap Log(Brain Log(Body weight) weight) 1.348 2.287 0.230 2.245 0.454 1.706 1.348 2.287 3.515 4.546 -0.348 1.463 1.211 2.469 -1.361 -0.1 64 -1.361 -0.1 64 0.454 1.706 m 0.91011188 c 1.3382687 Steyx 0.35032446 Example model for a data pairs resampling (type A) bootstrap regression analysis. B C D E Brain weight (gm) 0.0436 0.4492 1.698 2.844 14.69 16.265 22.309 372.97 713.72 3270.15 Body weight (kg) 0.685 29.05 175.92 50.856 155.74 294.52 193.49 1034.4 9958.02 35160.5 Log(Brain weight) -1.3605 -0.348 0.230 0.454 1.167 1.211 1.348 2.572 2.854 3.515 Log(Body weight) -0.1 64 1.463 2.245 1.706 2.192 2.469 2.287 3.015 3.998 4.546 I FI G 1 2 3 4 5 6 7 8 9 - 10 2 2 13 14 2 16 1 7 3 2 20 21 22 3 B4:C13 D4:E13 G4:G13 GI5 H4:H13 H I 7 (output) H I 8 (output) H I 9 (output) Formulae table Data =LOG(B4) =E4-TREND($E$4:$E$13,$D$4:$D$13,D4) =STDEV(G4:G13) =VoseNormal(E4-G4,$G$15) =SLOPE(H4:H13,D4:D13) =INTERCEPT(H4:H13,D4:D13) =STEYX(H4:H13,D4:D13) Residual -0.425 0.354 0.652 -0.074 -0.186 0.054 -0.243 -0.540 0.207 0.201 1 ;::::1 H I I Bootstrap Log(Body weight) 0.260 1.109 1.593 1.781 2.378 2.415 2.530 3.555 3.791 4.345 L" Steyx 0.838 1.400 0.24436273 24 Figure 9.27 Example model for a residuals resampling (type B) parametric bootstrap regression analysis. 260 Risk Analys~s that producing the variations about the line, and so we seek to bootstrap the residuals. If we know the residuals are normally distributed, we can use a parametric bootstrap model, as follows: 1. Determine S,, - the standard deviation of the residuals about the least-squares regression line for the original dataset. 2. For each of the x values in the dataset, randomly sample from a Normal(j, S,,) where 9 = rizx F and riz and F are the least-squares regression coefficients for the original dataset. 3. Determine the least-squares regression coefficients for this bootstrap sample. 4. Repeat for B iterations. + Figure 9.27 illustrates this procedure in a spreadsheet model for the {body weight, brain weight} data. Although this procedure works quite well, it would be better to use the classical statistics approach described above, which offers exact answers under these conditions. However, a slight modification to the above approach allows one to use a non-parametric bootstrap, i.e. where we can remove the assumption of normally distributed residuals which may often not be very accurate. For the non-parametric model, Mass (kg) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.O 1.1 1.2 1.3 1.4 1.5 mean 0.8 Extension (mm) Residual ei 137.393 0.720 2.281 138.954 3.304 139.977 3.434 140.107 142.765 6.093 145.606 8.933 11.208 147.881 147.011 10.338 144.194 7.521 144.949 8.277 152.161 15.488 149.694 13.021 154.700 18.027 25.364 162.037 155.275 18.602 23.548 160.221 Leverage hi 0.228 0.187 0.151 0.122 0.099 0.081 0.069 0.063 0.063 0,069 0.081 0.099 0.122 0.151 0.187 0.228 mean ssx Figure 9.28 analysis. Bootstrap Extension 127.4 132.7 136.4 132.8 151.4 134.7 149.8 142.9 145.2 154.2 160.2 150.3 153.2 147.6 156.1 167.5 rn 19.81 131.54 3.4 83:C18 820 822 D3:D18 E3:E18 F3:F18 G3:G18 G21 (output) G22 (output) Modified residual r, 0.820 2.530 3.587 3.665 6.417 9.318 11.616 10.681 7.771 8.578 16.155 13.714 19.239 27.535 20.628 26.800 11.8 Formulae table Data =AVERAGE(83:818) (=SUM((83:818-$6$20)"2)} =C3-TREND($C$3:$C$18,$8$3:$6$18,0) =1/16+(63-$8$20)"2/$6$22 =D3/SORT(1-E3) =TREND($C$3:$C$18,$6$3:$8$18,83)+Duniform($FS3:$F$18)-$F$19 =SLOPE(G3:G18,83:818) =INTERCEPT(G3:G18,83:B18) Example model for a residuals resampling (type B) non-parametric bootstrap regression Chapter 9 Data and statistics 26 1 we must first develop a non-parametric distribution of residuals by changing them to have constant variance. We define the modijied residual ri as follows: where the leverage hi is given by The mean of the modified residuals F is calculated. Then a bootstrap sample rjc is drawn from the set of ri values and used to determine the quantity (Fj + ry - 7 ) for each x j value which is used in step 2 of the algorithm above. Figure 9.28 provides a spreadsheet illustration of this type of model using data from Table 9.5. In certain problems, it is logical that the y-intercept value c be set to zero. In this situation, the leverage values are different: The modified residuals are thus also different and won't sum to zero, so it is essential to mean-correct the residuals before they are used to simulate random errors. Table 9.5 Experimental measurements of the variation in length of a vertical spring as weight is attached to its end. Mass (kg) Extension (mm) 262 Risk Analysis Bootstrapping the data pairs is more robust than bootstrapping the residuals, as it is less sensitive to any deviation from the regression assumptions, but won't be as accurate where the assumptions are correct. However, as the dataset increases in size, the results from bootstrapping the pairs approach those from bootstrapping the residual, and it is also easier to execute, of course. These techniques can be extended to non-linear, non-constant variance and to multiple linear regressions, described in detail in Efron and Tibshirani (1993) and Davison and Hinkley (1997). Chapter 10 Fitting distributions to data In this chapter I use the statistical methods I've described in Chapter 9 to fit probability distributions to data. I also briefly describe how regression models are fitted to data. There are other types of probability models we use in risk analysis: fitting time series and copulas are described elsewhere in this book. This chapter is concerned with a problem frequently confronted by the risk analyst: that of determining a distribution to represent some variable in a risk analysis model. There are essentially two sources of information used to quantify the variables within a risk analysis model. The first is available data and the second is expert opinion. Chapter 14 deals with the quantification of the parameters that describe the variability purely from expert opinion. Here I am going to look at techniques to interpret observed data for a variable in order to derive a distribution that realistically models its true variability and our uncertainty about that true variability. Any interpretation of data by definition requires some subjective input, usually in the form of assumptions about the variable. The key assumption here is that the observed data can be thought of as a random sample from some probability distribution that we are attempting to identify. The observed data may come from a variety of sources: scientific experiments, surveys, computer databases, literature searches, even computer simulations. It is assumed here that the analyst has satisfied himself that the observed data are both reliable and as representative as possible. Anomalies in the data should be checked out first where possible and any unreliable data points discarded. Thought should also be given to any possible biases that could be produced by the method of data collection, for example: a high-street survey may have visited an unrepresentative number of large or affluent towns; the data may have come from an organisation that would benefit from doctoring the data, etc. I start by encouraging analysts to review the data they have available and the characteristics of the variable that is to be modelled. Several techniques are then discussed that enable analysts to fit the available data to an empirical (non-parametric) distribution. The key advantages of this intuitive approach are the simplicity of use, the avoidance of assuming some distribution form and the omission of inappropriate or confusing theoretical (parametric or model-based) distributions. Techniques are then described for fitting theoretical distributions to observed data, including the use of maximum likelihood estimators, optimising goodness-of-fit statistics and plots. For both non-parametric and parametric distribution fitting, 1 have offered two approaches. The first approach provides a first-order distribution, i.e. a best-fitting (best-guess) distribution that describes the variability only. The second approach provides second-order distributions that describe both the variability of the variable and the uncertainty we have about what that true distribution of variability really is. Second-order distributions are more complete than their first-order counterparts and require more effort: if there is a sufficiently large set of data such that the inclusion of uncertainty provides only marginally more information, it is quite reasonable to approximate the distribution to one of variability only. That said, it is often difficult to gauge the degree of uncertainty one has about a distribution without having first formally determined its uncertainty. The reader is therefore encouraged at least to go through the exercise of describing the uncertainty of a variability distribution to determine whether the uncertainty needs to be included. 1 0. I Analysing the Properties of the Observed Data Before attempting to fit a probability distribution to a set of observed data, it is worth first considering the properties of the variable in question. The properties of the distribution or distributions chosen to be fitted to the data should match those of the variable being modelled. Software like BestFit, EasyFit, Stat::Fit and ExpertFit have made fitting distributions to data very easy and removed the need for any in-depth statistical knowledge. These products can be very useful but, through their automation and ease of use, inadvertently encourage the user to attempt fits to wholly inappropriate distributions. It is therefore worth considering the following points before attempting a fit: Is the variable to be modelled discrete or continuous? A discrete variable may only take certain specific values, for example the number of bridges along a motorway, but a measurement such as the volume of tarmac, for example, is continuous. A variable that is discrete in nature is usually, but not always, best fitted to a discrete distribution. A very common exception is where the increment between contiguous allowable values is insignificant compared with the range that the variable may take. For example, consider a distribution of the number of people using the London Underground on any particular day. Although there can only be a whole number of people using the Tube, it is easier to model this number as a continuous variable since the number of users will number in the millions and there is little importance and considerable practical difficulty in recognising the discreteness of the number. In certain circumstances, discrete distributions can be very closely approximated by continuous distributions for large values of x. If a discrete variable has been modelled by a continuous distribution for convenience, its discrete nature can easily be put back into the risk analysis model by using the ROUND(. . .) function in Excel. The reverse of the above, however, never occurs, i.e. data from a continuous variable are always fitted to a continuous distribution and never a discrete distribution. Do I really need to$t a mathematical (parametric) distribution to my data? It is often practical to use the data points directly to define an empirical distribution, without having to attempt a fit to any theoretical probability distribution type. Section 10.2 describes these methods. Does the theoretical range of the variable match that of thejitted distribution? The fitted distribution should, within reason, cover the range over which the variable being modelled may theoretically extend. If the fitted distribution extends beyond the variable's possible range, a risk analysis model will produce impossible scenarios. If the distribution fails to extend over the entire possible range of the variable, the risk analysis will not reflect the true uncertainty of the problem. For example, data on the oil saturation of a hydrocarbon reserve should be fitted to a distribution that is bounded at zero and 1, as values outside that range are nonsensical. It may turn out that a normal distribution, for example, fits the data far better than any other shape, but, of course, it extends from -oo to +oo. In order to ensure that the risk analysis only produces meaningful scenarios, the normal distribution would be truncated in the risk analysis model at zero and 1. Note that a correctly fitted distribution will usually cover a range that is greater than that displayed by the observed data. This is quite acceptable because data are rarely observed at the theoretical extremes for the variable in question. Do you already know the value of the distribution parameters? This applies most often to discrete variables. For example, a Hypergeometric(n, D , M) distribution describes the number of successes we might have from n independent individuals without replacement from a population of size M Chapter 10 Fitting d~str~butions t o data 265 where a success means the individual comes from a subpopulation of size D. It seems unlikely that we would not know how many samples were taken to have observed our dataset of successes. More likely is that we already know n and D and are trying to estimate M, or we know n and M and are trying to estimate D. Discrete distributions like the binomial, beta-binomial, negative binomial, beta negative binomial, hypergeometric and inverse hypergeometric have either the number of samples n or the number of required successes s as parameters and will generally be known. Is this variable independent of other variables in the model? The variable may be correlated with, or a function of, another variable within the model. It may also be related to another variable outside the model but which, in turn, affects a third variable within the risk analysis model. Figure 10.1 illustrates a couple of examples. In example (a), a high-street bank's revenue is modelled as a function of the interest and mortgage rates, among other things. The mortgage rate is correlated to the interest rate since the interest rate largely defines what the mortgage rate is to be. This relationship must be included in the model to ensure that the simulation will only produce meaningful scenarios. There are two approaches to modelling such dependency relationships: J 1. Determine distributions for the mortgage and interest rates on the basis of historical data and then correlate the sampling from these distributions during simulation. Interest correlation Mortgage rates function Subcontractor's cost model Choice of roofing material I F Person-hours to construct roof timbers function \ 4 Jfunction Subcontractor's quote to supply labour for roof construction Figure 10.1 Examples of dependency between model variables: (a) direct and (b) indirect. 2. Determine the distribution of interest rate from historical data and a (stochastic) functional relationship with the mortgage rate. Method 1 is tempting because of its simple execution, but method 2 offers greater opportunity to reproduce any observed relationship between the two variables. In example (b) of Figure 10.1, a construction subcontractor is calculating her bid price to supply labour for a roofing job. The choice of roofing material has not yet been decided and this uncertainty has implications for the person-hours that will be needed to construct the roofing timbers and to lay the roof. There is therefore an indirect dependency between these two variables that could easily have been missed, had she not looked outside the immediate components of her cost calculation. Missing this correlation would have resulted in an underestimation of the spread of the subcontractor's cost and potentially could have led her to quote a price that exposed her to significant loss. Correlation and dependency relationships form a vital part of many risk analyses. Chapter 13 describes several techniques to model correlation and dependencies between variables. Does a theoretical distribution exist thatfits the mathematics of this variable? Many theoretical distributions have developed as a result of modelling specific types of problem. These distributions then find a wider use in other problems that have the same mathematical structure. Examples include: the times between telephone calls at a telephone exchange or fires in a railway system may be accurately represented by an exponential distribution; the time until failure of an electronics component may be represented by a Weibull distribution; how many treble 20s a darts player will score with a specific number of darts may be represented by a binomial distribution; the number of cars going through a road junction in any one hour may be represented by a Poisson distribution; and the heights of the tallest and shortest children in UK school classes may be represented by Gumbel distributions. If a distribution can be found with the same mathematical basis as the variable being modelled, it only remains to find the appropriate parameters to define the distribution, as explained in Section 10.3. Does a theoretical distribution exist that is well known to fit this type of variable? Many types of variable have been observed closely to follow specific distribution types without any mathematical rationale being available to explain such close matching. Examples abound with the normal distribution: the weight of babies and other measures that come from nature, which is how the normal distribution got its name; measurement errors in engineering, variables that are the sum of other variables (e.g. means of samples from a population), etc. However, there are many other examples for distributions like the lognormal, Pareto and Rayleigh, some of which are noted in Appendix 111. If a distribution is known to be a close fit to the type of variable being modelled, usually as a result of published academic work, all that remains is to find the best-fitting distribution parameters, as explained in Section 10.3. Erron - systematic and non-systematic The collected data will at times have measurement errors that add another level of uncertainty. In most scientific data collection, the random error is well understood and can be quantified, usually by simply repeating the same measurement and reviewing the distribution of results. Such random errors are described as non-systematic. Systematic errors, on the other hand, mean that the values of a measurement deviate from the true value in a systematic fashion, consistently either over- or underestimating the true value. This type of error is often very difficult to identify and quantify. One will often attempt to estimate the degree of suspected systematic measurement error by comparing with measurements using another technique that is known (or believed) to have little or no systematic error. Chapter 10 Fitting distributions to data 267 Systematic and non-systematic error can both be accounted for in determining a distribution of fit. In determining a first-order distribution, one need only adjust the data by the systematic error (the nonsystematic error has, by definition, a mean shift of zero). In second-order distribution fitting, one can model the data as being uncertain, with appropriate distributions representing both the non-systematic error and the systematic error (including uncertainty about what these error parameters are). Sample slze Is the number of data points available sufficient to give a good idea of the true variability? Consider the 20 plots of Figure 10.2 which each show random samples of twenty values drawn from a NormaI(100, 10) distribution. These samples are all plotted as histograms with six evenly spaced bars, three either side of 100. The variation in shapes is something of an eye-opener to a lot of people, who expect to see plots that look at least reasonably like bell-shaped curves and symmetric about 100. After all, one might think that 20 data points is a reasonable number from which to draw some inference. The bottom-right panel in Figure 10.2 shows all 400 data values (i.e. 20 plots * 20 data values each), which looks something like a normal distribution but nonetheless still has a significant degree of asymmetry. It is an interesting and useful exercise when attempting to fit data to a distribution to see what sort of patterns one would observe if the data did truly come from the distribution that is being fitted. So, for example, if I had 30 data values that I was fitting to a Lognormal(l0, 2) distribution, I could plot a variety of 30 Monte Carlo samples (not Latin hypercube samples, which forces a better-fitting sample to the true distribution than a random sample would produce) from a Lognormal(l0, 2) distribution in histogram form and see the different patterns they produce. I am at least then aware of the range of data patterns that I should accept as feasibly coming from that distribution for that size of sample. Overdispenion o f data Sometimes we wish to fit a parametric distribution to observations, but note that the data appear to show a much larger spread than the fitted distribution would suggest. For example, in fitting a binomial distribution to the results of a multiple question exam taken by a large class, one might imagine that the distribution of a number of correct answers could be modelled by a Binomial(n, p ) distribution, where n = the number of questions and p is the average probability for the class of correctly answering a question. The spread of the fitted binomial distribution is essentially determined by the mean = np, since n is fixed, so there is no opportunity to attempt to match the fitted distribution to the data in terms of the observed spread in results as well as the average result. One plausible reason for the fit being poor is that there will be a range of abilities in the class. If one models the range of probabilities of successfully answering a question across all the individuals as a beta distribution, the resultant distribution of results will be drawn from a beta-binomial distribution, which is then the appropriate distribution to fit to the data. The extra variability added to the binomial distribution by making p beta distributed means that the beta-binomial distribution will always have more spread than the binomial. The beta-binomial distribution has three parameters: a , j3 and n, where a and j3 (sometimes written as a1 and a2) are the parameters of the beta distribution and n remains the number of trials. These three parameters allow a better and logical match to the mean and variance of the observations. As a and B become larger, the beta distribution becomes narrower, i.e. the participants have a narrow range of probabilities of successfully answering a question (the population is more homogeneous), and the Beta-Binomial(n, a , j3) is then approximated well by a Binomial(n, al(a B)). The same type of problem applies in fitting the Poisson(h) distribution to data. Since the mean and variance are both equal to h, the spread of the distribution is determined by the mean. Observed data are often more widely dispersed than a Poisson distribution might suggest, and this is often because the + 268 R~skAnalysis Figure 10.2 Examples of distributions of 20 random samples from a NormaI(100, 10) distribution. Chapter 10 Fitt~ngdistr~butionst o data 269 n observations come from Poisson processes with different means h l , . . . , A,. For example, one might be looking at the failure rates of computers. Each computer will be slightly different from the next and so will have its own A. If one models the distribution of variability of the hs using a Gamma(a, B ) distribution, the resultant distribution of failures in a single time period is a P6lya(a, B). The P6lya distribution always has a variance greater than the mean, and its two parameters allow a greater flexibility in matching the distribution to the mean and variance of the observations. Finally, data fitted to a normal distribution can often demonstrate longer tails than a normal distribution. In such cases, the three-parameter Student t-distribution can be used, i.e. Student(v) * a p , where p is the fitted distribution's mean, a is the fitted distribution's standard deviation and v is the "degrees of freedom" parameter that determines the shape of the distribution. For v = 1, this is the Cauchy distribution which has infinite (i.e. undeterminable) mean and standard deviation. As v gets larger, the tails shrink until at very large v (some 50 or more) this looks like a Normal(p, a ) distribution. The three-parameter Student t-distribution can be derived as the mixture of normal distributions with the same mean and different variances distributed as a scaled inverse X 2 . SO, in attempting to fit data to the three-parameter Student t-distribution instead of a normal distribution, you would need to be able reasonably to convince yourself that the observations were drawn from normal distributions with the same mean and different variances. + 10.2 Fitting a Non-Parametric Distribution to the Observed Data This section discusses techniques for fitting an empirical distribution to data. We look at continuous and then discrete variables, and both first-order (variability only) and second-order (variability and uncertainty) fitting. 10.2. I Modelling a continuous variable (first order) If the observed variable is continuous and reasonably extensive, it is often sufficient to use a cumulative frequency plot of the data points themselves to define its probability distribution. Figure 10.3 illustrates an example with 18 data points. The observed F(x) values are calculated as the expected F(x) that would correspond to a random sampling from the distribution, i.e. F(x) = i/(n l), where i is the rank of the observed data point and n is the number of data points. An explanation for this formula is provided in the next section. Determination of the empirical cumulative distribution proceeds as follows: + The minimum and maximum for the empirical distribution are subjectively determined on the basis of the analyst's knowledge of the variable. For a continuous variable, these values will generally be outside the observed range of the data. The minimum and maximum values selected here are 0 and 45. The data points are ranked in ascending order between the minimum and maximum values. The cumulative probability F(xi) for each xi value is calculated as follows: This formula maximises the chance of replicating the true distribution. 270 Risk Analysis Cumulative Frequency of Data 1.00.9 a 0 5 10 15 20 25 Data values 30 35 40 45 Number of data points n = 18 Figure 10.3 Fitting a continuous empirical distribution to data using a cumulative distribution. The two arrays, {xi] and {F(xi)), along with the minimum and maximum values, can then be used as direct inputs into a cumulative distribution CumulA(min, max, {xi], {F(xi)}). The VoseOgive function in ModelRisk will simulate values from a distribution constructed using the method above. If there is a very large amount of data, it becomes impracticable to use all of the data points to define the cumulative distribution. In such cases it is useful to batch the data first. The number of batches should be set to the practical maximum that balances fineness of detail (large number of bars) with the practicalities of having large arrays defining the distribution (lower number of bars). Example 10.1 Fitting a continuous non-parametric distribution t o data Figure 10.4 illustrates an example where 221 data points are plotted in histogram form over the range of the observed data. The analyst considers that the variable could conceivably range from 0 to 300. Since there are no observed data with values below 20 and above 280, the histogram bar ranges need Chapter 10 Fitting dlstribut~onst o data 27 1 Relative and Cumulative Frequency of Data I x-value Number of data points n = 221 Histogram bar From A To 6 20 40 40 60 60 80 80 100 100 120 120 140 140 160 160 180 180 200 200 220 220 240 240 260 260 280 Figure 10.4 Observed frequencies Cumulative Histogram probability f(Acx 5 B ) probability F ( x < B ) 0.018 0.018 0.113 0.131 0.204 0.335 0.534 0.199 0.145 0.679 0.796 0.118 0.050 0.846 0.045 0.891 0.045 0.937 0.023 0.959 0.027 0.986 0.009 0.995 0.005 1.000 Modelled distribution Cumulative Histogram bar From A To B probability F ( x 5 B) 0 40 0.018 40 60 0.131 60 80 0.335 80 100 0.534 100 120 0.679 120 140 0.796 140 160 0.846 160 180 0.89 1 180 200 0.937 200 220 0.959 220 240 0.986 240 260 0.995 1.OOO 260 300 Fitting an empirical distribution to histogrammed data using a cumulative distribution. to be altered to accommodate the subjective minimum and maximum. The easiest way to achieve this is to extend the range of the first and last bars with non-zero probability to cover the required range, but without altering its probability. In this example, the histogram bar with range 20-40 is expanded to a range 0-40, and the bar with range 260-280 is expanded to range 260-300. We will probably have slightly exaggerated the tails of the distribution. However, if the number of bars initially selected is quite large, there will be little real effect on the model. The {xi}array input into the cumulative distribution is then {40,60, . . . ,240,2601, the {Pi}array is (0.018,O. 131, . . . ,0.986,0.995} and the minimum and maximum are, of course, 0 and 300 respectively. + 272 Risk Analysis Converting a histogram distribution into a cumulative distribution may seem a little pointless when the histogram can be used in a risk analysis model. However, this technique allows analysts to select varying bar widths to suit their needs, as in the above example, and therefore to maximise detail in the distribution where it is needed. 10.2.2 Modelling a continuous variable (second order)' When we do not have a great deal of data, a considerable amount of uncertainty will remain about an empirical distribution determined directly from the data. It would be very useful to have the flexibility of using an empirical distribution, i.e. not having to assume a parametric distribution, and also to be able to quantify the uncertainty about that distribution. The following technique provides these requirements. Consider a set of n data values {xj) drawn from a distribution and ranked in ascending order {xi}so xi < xi+l. Data thus ranked are known as the order statistics of {xj]. Individually, each of the values of {xj}may map as a U(0, 1) onto the cumulative probability of the parent distribution F(x). We take a U(0, 1) distribution as the prior distribution for the cumulative probability for any value of x. We can thus use a U(0, 1) prior for Pi = F(xi) for the value of the ith observation. However, we have the additional information that, of n values drawn randomly from this distribution, xi ranked ith, i.e. (i - 1) of the data values are less than xi, and (n - i) values are greater than xi. Using Bayes' theorem and the binomial theorem, the posterior marginal distribution for Pi can readily be determined, remembering that Pi has a U(0, 1) prior and therefore a prior probability density = 1: which is simply the standard beta distribution Beta(i, n - i + 1): Equation (10.2) could actually be determined directly from the fact that the beta distribution is the conjugate to the binomial likelihood function and that a U(0, 1) = Beta(1, 1). The mean of the Beta(i, n - i 1) distribution equals il(n 1): a formula that has been used in Equation (10.1) to estimate the best-fitting first-order non-parametric cumulative distribution. Since Pi+1 > Pi, these beta distributions are not independent, so we need to determine the conditional distribution f (Pi+l 1 Pi), as follows. The joint distribution f (Pi, Pj) for any two Pi, Pj is calculated using the binomial theorem in a similar manner to the numerator of the equation for f (Pi Ixi; i = 1, n), that is + + where Pj > Pi, and remembering that the prior probability densities for Pi and Pj equal 1 since they have U(0, 1) priors. Thus, for j = i 1, + ' 1 submitted a paper on this technique (I developed the idea) for publication in a journal a long time ago. One reviewer was horribly dismissive, saying that the derivation was one of the most drunken s h e had ever seen, and anyway it was a Bayesian method (it isn't) so it was of no value. Actually, this has proven to be one of the most useful things I ever figured out. Chapter 10 Fitting d~stribut~ons to data 273 The conditional probability f (Pi+l1 Pi) is thus given by where k is some constant. The corresponding cumulative distribution function F(Pi+1 I Pi) is then given by and thus k = n - i and the formula reduces to Together, Equations (10.2) and (10.3) provide us with the tools to construct a non-parametric secondorder distribution for a continuous variable given a dataset sampled from that distribution. The distribution for the cumulative probability PI that maps onto the first-order statistic XI can be obtained from Equation (10.2) by setting i = 1: The distribution for the cumulative probability Pz that maps onto the first-order statistic X2 can be obtained from Equation (10.3). Being a cumulative distribution function, F(Pi+1(Pi) is Uniform(0, 1) distributed. Thus, writing Uitl to represent a Uniform(0, 1) distribution in place of F(Pi+1( Pi), using the identity 1 - U(0, 1) = U(0, I), and rewriting for Pi+l, we obtain which gives etc. Note that each of the U2, U3, . . . , U, uniform distributions are independent of each other. The formulae from Equations (10.4) and (10.5) can be used as inputs into a cumulative distribution function available from standard Monte Carlo software tools like @RISK and Crystal Ball, together with subjective estimates of the minimum and maximum values that the variable may take. The variability ("inner loop7') is described by the range for the variable in question, and estimates of the cumulative distribution shape via the ( X i and } {Pi] values. The uncertainty ("outer loop") is catered for by the uncertainty distributions for the minimum, maximum and Pi values. 274 Risk Analysis The RiskCumul distribution function in @RISK, the VoseCumulA function in ModelRisk and the cumulative version of the custom distribution in Crystal Ball have the same cumulative distribution function, namely where Xo = minimum, X,+, = maximum, Po = 0, Pa+, = 1 and Xi 5 x < Xi+, . Figure 10.5 illustrates a model where a dataset is being used to create a second-order distribution using this technique. If the model is created in the current version of @RISK, the uncertainty distributions for F(x) in column D are nominated as outputs, a smallish number of iterations are run and the resultant data are exported back to a spreadsheet. Those data are then used to perform multiple simulations (the "outer loop") of uncertainty using @RISK'S RiskSimtable function: the "inner loop" of variability comes A] 1 2 3 4 5 6 -- 7 02 103 104 105 Figure 10.5 B I C I I E ~F D Rank (i) Order statistics (x) F(x) minimum 0 0 1 0.473 2 3.170 3 4.254 0.0237 4 4.540 99 0.9453 95.937 0.9726 96.936 100 1 maximum 100 I G IH =VoseBeta(1,100) Model to produce a second-order non-parametric continuous distribution. Iteration# / Cell 100 Simtable functions Order statistics Rows 3 to 102: 5.04% O.2g0/o 0.05% 0.93% 5.04% 0.473 6.83% 1.63% 0.69% 4.28% 6.83% 3.170 99.08% 99.20% 98.99% 99.45% 99.08% 95.937 99.60% 99.90% 99.67% 99.88% 99.60% 96.936 Formulaetable List samples from the distribut~onfor F(X) =RiskSimtable(CS:C102) L~sts the observed data values =VoseCurnulA(O,100,C104:CX104,C103:CX103) Figure 10.6 BRISK model to run a second-order risk analysis using the data generated from the model of Figure 10.5. Chapter 10 F~tt~ng d~str~but~ons t o data AI 1 2 3 4 5 6 7 101 102 103 104 105 106 107, - - B [ c Rank (9 Order statistics (x) 0 m~nlrnurn 1 0 473 3.170 2 4.254 3 4 4 540 98 93.301 95.937 99 100 96 936 100 maxrmum l~econdorder distribution: 1 I [EI D - I (H G Crystal Ball Pro formulae table B4.6103 1.100 Input est~matesof min, max C3, C104 C4.Cl03 Input data values D4 =CB Beta(1,100,1) D5'D103 =1-(CB.Un1form(O,l)~(1/(1OO-B3)))~(1 -D3) Dl06 (output) =CB Custom(C3 D104) 04:D103 are nominated as uncertalnfy d~str~butions D 106 IS nomlnated as a varrab111tydtstrtbuhon F(x) 0 0 0324 0.0383 0 0438 00511 0 9414 0.9766 0 9901 1 41.640 F 275 1 - --- - Figure 10.7 Crystal Ball Pro model to run a second-order risk analysis using the data generated from the model of Figure 10.5. from the cumulative distribution itself, as shown in Figure 10.6. If one creates the model in Crystal Ball Pro, the F ( x ) distributions can be nominated as uncertainty distributions and the cumulative nominated as the variability distribution, and the innerlouter loop procedure will run automatically (Figure 10.7). There are a few limitations to this technique. In using a cumulative distribution function, one is assuming a histogram style probability density function. When there are a large number of data points, this approximation becomes irrelevant. However, for small datasets the approximation tends to accentuate the tails of the distribution: a result of the histogram "squaring-off' effect of using the cumulative distribution. In other words, the variability will be slightly exaggerated. However, the squaring effect can be reduced, if required, by using some sort of smoothing algorithm and defining points between each observed value. In addition, for small datasets, the tails' contribution to the variability will often be more influenced by the subjective estimates of the minimum and maximum values: a fact one can view positively (one is recognising the real uncertainty about a distribution's tail), and negatively (the smaller the dataset, the more the technique relies on subjective assessment). The fewer the data points, the wider the confidence intervals will become, quite naturally, and, in general, the more emphasis will be placed on the subjectively defined minimum and maximum values. Conversely, the more data points available, the less influence the minimum and maximum estimates will have on the estimated distribution. In any case, the values of the minimum and maximum only have influence on the width (and therefore height) of the end two histogram bars in the fitted distribution. The fact that the technique is non-parametric, i.e. that no statistical distribution with a particular cumulative distribution function is assumed to be underlying the data, allows the analyst a far greater degree of flexibility and objectivity than that afforded by fitting parametric distributions. A further sophistication to this technique would be to correlate the uncertainty distributions for the minimum and maximum parameter values to the uncertainty distributions for P I and P,, respectively. If PI were to be sampled with a high value, it would make sense that the variability distribution had a long left tail and the value sampled for the minimum should be towards its lowest value. Similarly, a high value for P,, would suggest a low value for the maximum. One could model these relationships using either very high levels of negative rank order correlation for simplicity or some more involved but more explicit equation. 276 Risk Analysis Example 10.2 Fitting a second-order non-parametric distribution t o continuous data Three datasets of five, then a further 15 and then another 80 random samples were drawn from a Normal(100, 10) distribution to give sets of five, 20 and 100 samples. The graphs of Figure 10.8 show, naturally enough, that the population distribution is approached with increasing confidence the more data values one has available. There are classical statistical techniques for determining confidence distributions for the mean and standard deviation of a normal distribution that is fitted to a dataset with a population normal distribution, as discussed in Section 9.1, namely: Standard deviation a = (a) 5 data point second-order distribution Jx2;i 1) (b) 20 data point second-order distribution x-value I -(c) 100 data point second-order distribution x-value population distribution x-value Figure 10.8 Results of fitting a non-parametric distribution to data from a normal parent distribution: (a) five data points; (b) 20 data points; (c) 100 data points; (d) the true population distribution. Chapter 10 Fitting dtstributions t o data 277 where: p and o are the mean and standard deviation of the population distribution; x and s are the mean and sample standard deviation of the n data points being fitted; t(n - 1) is a t-distribution with n - 1 degrees of freedom, and X2(n- 1) is a chi-square distribution with n - 1 degrees of freedom. The second-order distribution that would be fitted to the 100 data point set using the non-parametric technique is shown in the right-hand panel of Figure 10.9. The second-order distribution produced using the above statistical theory with the assumption of a normal distribution is shown in the left-hand panel of Figure 10.9. There is a strong agreement between these two techniques. The statistical technique produces less uncertainty in the tails because the assumption of normality adds extra information that the non-parametric technique does not provide. This is, of course, fine providing we know that the population distribution is truly normal, but leads to overconfidence in the tails if the assumption is incorrect. + The advantage of the technique offered here is that it works for all continuous smooth distributions, not just the normal distribution. It can also be used to determine distributions of uncertainty for specific percentiles and quantiles of the population distribution, essentially by reading off values from the fitted cumulative distribution and interpolating as necessary between the defined points. Figure 10.10 shows a spreadsheet model for determining the percentile, defined in cell E3, of the population distribution, given the 100 data points from the normal distribution used previously. The uncertainty distribution for the percentile is produced by running a simulation with cell G3 as the output. Similarly, Figure 10.11 illustrates a spreadsheet to determine the cumulative probability that the value in cell F2 represents in the population distribution. The distribution of uncertainty of this cumulative probability is produced by running a simulation with cell H2 as the output. In other words, the model in Figure 10.10 is slicing horizontally through the second-order fitted distribution at F(x) = 50 %, while the model of Figure 10.11 is slicing vertically at x = 99. The spreadsheets can, of course, be expanded or contracted to suit the number of data points available. ModelRisk includes the VoseOgive2 function that generates the array of F(x) variables required for second-order distribution modelling. (b) Non-parametric distribution (a) Statistical theory distribution -2 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 70 80 90 100 110 x-value 120 130 I x-value Figure 10.9 Comparison of second-order distributions using the non-parametric technique and classical statistics. 278 Risk Analysis Formulae table B3:B104 C4:C103 C3, C104 B107:B208 C107, C208 0:101 Input,datavalues Input: estimates of rnln, max =B3 0,l =VoseBeta(l ,100) =VLOOKUP(E3,Cl07:D208,2) =VLOOKUP(G103,B107:C208,2) =VLOOKUP(G104,B107:C208,2) =VLOOKUP(G103,B3:C104,2) =VLOOKUP(Gl04,B3:C104,2) Figure 10.10 Model to determine the uncertainty distribution for a percentile. 10.2.3 Modelling a discrete variable (first order) Data from a discrete variable can be used to define an empirical distribution in two ways: if the number of allowable x values is not very large, the frequency of data at each x value can be used directly to define a discrete distribution; and if the number of allowable x values is very large, it is usually easier to arrange the data into histogram form and then define a cumulative distribution, as above. The discrete nature of the variable can be reintroduced by imbedding the cumulative distribution inside the standard spreadsheet ROUND(. . .) function. 10.2.4 Modelling a discrete variable (second order) Uncertainty can be added to the discrete probabilities in the previous technique to provide a secondorder Discrete distribution. Assuming that the variable in question is stable (i.e. is not varying with time), there is a constant (i.e. binomial) probability pi that any observation will have a particular value xi. If k of the n observations have taken the value xi, then our estimate of the probability pi is given by Beta(k 1, n - k 1) from Section 8.2.3. However, all these pi probabilities have to sum to equal + + Chapter 10 Fltting distributions t o data 279 Formulae table Input: data values Input: estimates of rnin, max 0:lOl =B3 0:101 0,l =VoseBeta(l ,100) C109:C207 =l-(VoseUniform(0,1)A(1/(100-B108)))*(1-C108) =VLOOKUP(F2,B3:C104,2) B4:6103 63, 6104 C3:C104 D3:D104 B107:B208 C107, C208 225 226 - =VLOOKUP(G103,C3:D104,2) =VLOOKUP(G104,C3:D104,2) =VLOOKUP(G103,B107:C208,2) =VLOOKUP(G104,B107:C208,2) F2 Input target value H2 (output) =IF(F2~B3,na(),lF(F2>B104,na(),(F2-G105)/(G106-G105)*(G108-G107)+G107)) 227 Figure 10.11 Model to determine the uncertainty distribution for a quantile. 1.0, so we normalise the pi values. Figure 10.12 illustrates a spreadsheet that calculates the discrete second-order non-parametric distribution for the set of data in Table 10.1 where the distribution has been assumed to finish with the maximum observed value. There remains a difficulty in selecting the range of this distribution, and it will be a matter of judgement how far one extends the range beyond the observed values. In the simple form described here there is also a problem in determining the pi values for these unobserved tails, and any middle range that has no observed values, since all pi values will have the same (normalised) Beta(1, n 1) distribution, no matter how extreme their position in the distribution's tail. This obviously makes no sense, and, if it is important to recognise the possibility of a long tail beyond observed data, a modification is necessary. The tail can be forced to zero by multiplying the beta distributions by some function that attenuates the tail, although the choice of function and severity of the attenuation will ultimately be a subjective matter. These last two techniques have the advantages that the distribution derived from the observed data would be unaffected by any subjectivity in selecting a distribution type and that the maximum use of the data has been made in defining the distribution. There is an obvious disadvantage in that theprocess is + 280 Risk Analys~s AI 1 2 - 3 4 5 6 7 8 9 -- 10 11 12 2 2 15 16 1 7 3 2 B I C D E 1 FI Estimate of Normalised Value Frequency probability probability 0 0 0.20% 1 1 0.40% 2 1 0.40% 3 12 2.59% 4 7.17% 35 5 52 10.56% 10.19% 6 61 12.35% 11.92% 7 65 13.15% 12.69% 8 69 13.46% 13.94% 9 68 13.27% 13.75% 10 46 9.36% 9.04% 11 33 6.77% 6.54% 12 26 5.38% 5.19% 12 13 2.59% 2.50% 14 2.19% 10 2.12% 2 15 0.60% 0.58% 5 16 1.20% 1.15% G I H II Formulae table =SUM(C3:C22) =VoseBeta(C3+1,$C$23-C3+1) 24 Figure 10.12 Model to determine a discrete non-parametric second-order distribution. Table 10.1 Dataset to fit a discrete second-order non-parametric distribution. Value Frequency Value Frequency fairly laborious for large datasets. However, the FREQUENCY() function and Histogram facility in Excel and the BestFit statistics report and other statistics packages can make sorting the data and calculating the cumulative frequencies very easy. More importantly, there remains a difficulty in estimating probabilities for values of the variable that have not been observed. If this is important, it may well be better to fit the data to a parametric distribution. Chapter 10 F~tt~ng d~str~but~ons t o data I 5 28 1 1 0.3 Fitting a First-Order Parametric Distribution to Observed Data This section describes methods of finding a theoretical (parametric) distribution that best fits the observed data. The following section deals with fitting a second-order parametric distribution, i.e. a distribution where the uncertainty about the parameters needs to be recognised. A parametric distribution type may be selected as the most appropriate to fit the data for three reasons: The distribution's mathematics corresponds to a model that accurately represents the behaviour of the variable being considered (see Section 10.1). The distribution to be fitted to the data is well known to fit this type of variable closely (see Section 10.1 again). The analyst simply wants to find the theoretical distribution that best fits the data, whatever it may be. The third option is very tempting, especially when distribution-fitting software is available that can automatically attempt fits to a large number of distribution types at the click of an icon. However, this option should be used with caution. Analysts must ensure that the fitted distribution covers the same range over which, in theory, the variable being modelled may extend; for example, a four-parameter beta distribution fitted to data will not extend past the range of the observed data if its minimum and maximum are determined by the minimum and maximum of the observed data. Analysts should ensure that the discrete or continuous nature of the distribution matches that of the variable. They should also be flexible about using a different distribution type in a later model, should more data become available, although this may cause confusion when comparing old and new versions of the same model. Finally, they may find it difficult to persuade the decision-maker of the validity of the model: seeing an unusual distribution in a model with no intuitive logic associated with its parameters can easily invoke distrust of the model itself. Analysts should consider including in their report a plot of the distribution being used against the observed data to reassure the decision-maker of its appropriateness. The distribution parameters that make a distribution type best fit the available data can be determined in several ways. The most common and most flexible technique is to determine parameter values known as maximum likelihood estimators (MLEs), described in Section 10.3.1. The MLEs of the distribution are the parameters that maximise the joint probability density or probability mass for the observed data. MLEs are very useful because, for several common distributions, they provide a quick way to arrive at the best-fitting parameters. For example, the normal distribution is defined by its mean and standard deviation, and its MLEs are the mean and standard deviation of the observed data. More often than not, however, when we fit a distribution to data using maximum likelihood, we need to use an optimiser (like Microsoft Solver which comes with Microsoft Excel) to find the combination of parameter values that maximises the likelihood function. Other methods of fit tend to find parameter values that minimise some measure of goodness of fit, some of which are described in Section 10.3.4. Both using MLEs and minimising goodness-of-fit statistics enable us to determine first-order distributions. However, for fitting second-order distributions we need additional techniques for quantifying the uncertainty about parameter values, like the bootstrap, Bayesian inference and some classical statistics. 10.3.1 Maximum likelihood estimators The maximum likelihood estimators (MLEs) of a distribution type are the values of its parameters that produce the maximum joint probability density for the observed dataset x . In the case of a discrete 282 Risk Analysts distribution, MLEs maximise the actual probability of that distribution type being able to generate the observed data. Consider a probability distribution type defined by a single parameter a . The likelihood function L(a) that a set of n data points (xi) could be generated from the distribution with probability density f (x) - or, in the case of a discrete distribution, probability mass - is given by The MLE B is then that value of a that maximises L(a). It is determined by taking the partial derivative of L(a) with respect to a and setting it to zero: For some distribution types this is a relatively simple algebraic problem, for others the differential equation is extremely complicated and is solved numerically instead. This is the equivalent of using Bayesian inference with a uniform prior and then finding the peak of the posterior uncertainty distribution for a. Distribution fitting software have made this process very easy to perform automatically. Example 10.3 Determining the MLE for the Poisson distribution The Poisson distribution has one parameter, the product kt, or just h if we let t be a constant. Its probability mass function f (x) is given by Because of the memoryless character of the Poisson process, if we have observed x events in a total time t , the likelihood function is given by Let I (A) = In L(h), and using the fact that t is a constant: I (h) = -kt + xln(h) + xln(t) - In(x!) The maximum value of I (A), and therefore of L(h), occurs when the partial derivative with respect to h equals zero, i.e. Rearranging yields i.e. it is the average number of observations per unit time. + t: Chapter 10 Fitting d~stnbut~ons t o data 283 10.3.2 Finding the best-fitting parameters using optimisation Figure 10.13 illustrates a Microsoft Excel spreadsheet set up to find the parameters of a gamma distribution that will best match the observed data. Excel provides the GAMMADIST function that will return the probability density of a gamma distribution. The Microsoft Solver in Excel is set to find the maximum value for cell F5 (or equivalently F7) by changing the values of cr and B in cells F2 and F3. 10.3.3 Fitting distributions to truncated, censored or binned data Maximum likelihood methods offer the greatest flexibility for distribution fitting because we need only be able to write a probability model that corresponds with how our data are observed and then maximise that probability by varying the parameters. Censored data are those observations that we do not know precisely, only that they fall above or below a certain value. For example, a weight scale will have a maximum value X it can record: we might have some measurement off the scale and all we can say is that they are greater than X. Truncated data are those observations that we do not see above or below some level. For example, at a bank it may not be required to record an error below $100, and a sieve system may not select out diamonds from a river below a certain diameter. Binned data are those observations that we only know the value of in terms of bins or categories. For example, one might record in a survey that customers were (0, 101, (10, 201, (20-401 and (40+) years of age. Formulae table C3 C?42 F5 F7 =LOCIO(CAMMADIST(B3,alpha,beta,O)) =SUM(C3 C242) =VoseGammaProbl0(B3 8242,alpha,beta,O) Figure 10.13 Using Solver to perform a maximum likelihood fit of a gamma distribution to data. 284 Risk Analysis It is a simple matter to produce a probability model for each category or combination, as shown in the following examples where we are fitting to a continuous variable with density f ( x ) and cumulative probability F ( x ) : Example 10.4 Censored data Observations. Measurement censored at Min and Max. Observations between Min and Max are a , b , c , d and e ; p observations below Min and q observations above Max. Likelihood function: f ( a ) * f ( b ) * f ( c ) * f ( d ) * f ( e ) * F(Min)P * (1 - F (Max))g. Explanation. For p values we only know that they are below some value Min, and the probability of being below Min is F(Min). We know q values are above Max, each with probability (1 - F(Max)). The other values we have the exact measurements for. + Example 10.5 Truncated data Observations. Measurement truncated at Min and Max. Observations between Min and Max are a , b , c , d and e . Likelihoodfunction: f ( a ) * f (b)* f ( c ) * f ( d ) * f ( e ) / ( F ( M a x )- lik in))^. Explanation. We only observe a value if it lies between Min and Max which has the probability ( F (Max) - F (Min)). + Example 10.6 Binned data Observations. Measurement binned into continuous categories as follows: Bin Frequency Likelihood function: ~ ( 1 0 ) "* ( F ( 2 0 ) - F ( 1 0 ) ) *~ (~F (50) - F ( 2 0 ) ) *~ (1 ~ - F (50))~. Explanation. We observe values in bins between a Low and High value with probability F(High) F(Low). + 10.3.4 Goodness-of-fit statistics Many goodness-of-fit statistics have been developed, but two are in most common use. These are the chi-square (X2)and Kolmogorov-Smirnoff (K-S) statistics, generally used for discrete and continuous distributions respectively. The Anderson-Darling (A-D) statistic is a sophistication of the K-S statistic. The lower the value of these statistics, the closer the theoretical distribution appears to fit the data. 6 i Chapter 10 Fittinn distributions t o data 285 Goodness-of-fit statistics are not intuitively easy to understand or interpret. They do not provide a true measure of the probability that the data actually come from the fitted distribution. Instead, they provide a probability that random data generated from the fitted distribution would have produced a goodnessof-fit statistic value as low as that calculated for the observed data. By far the most intuitive measure of goodness of fit is a visual comparison of probability distributions, as described in Section 10.3.5. The reader is encouraged to produce these plots to assure himself or herself of the validity of the fit before labouring over goodness-of-fit statistics. Critical values and confidence intervals for goodness-of-fit statistics Analysis of the x2, K-S and A-D statistics can provide confidence intervals proportional to the probability that the fitted distribution could have produced the observed data. It is important to note that this is not equivalent to the probability that the data did, in fact, come from the fitted distribution, since there may be many distributions that have similar shapes and that could have been quite capable of generating the observed data. This is particularly so for data that are approximately normally distributed, since many distributions tend to a normal shape under certain conditions. Critical values are determined by the required confidence level a. They are the values of the goodnessof-fit statistic that have a probability of being exceeded that is equal to the specified confidence level. Critical values for the X 2 test are found directly from the X 2 distribution. The shape and range of the X 2 distribution are defined by the degrees of freedom v , where v = N - a - 1, N = number of histogram bars or classes and a = number of parameters that are estimated to determine the best-fitting distribution. Figure 10.14 shows a descending cumulative plot for the x2(11) distribution, i.e. a X 2 distribution with 11 degrees of freedom. This plots an 80 % chance (*I (the confidence interval) that a value would have occurred that was higher than 6.988 (the critical value at an 80 % confidence level) for data that were actually drawn from the fitted distribution, i.e. there is only a 20 % chance that the x 2 value could be this small. If analysts are conservative and accept this 80 % chance of falsely rejecting the fit, their confidence interval a equals 80 % and the corresponding critical value is 6.988, and they will not accept any distribution as a good fit if its x2 is greater than 6.988. Critical values for K-S and A-D statistics have been found by Monte Carlo simulation (Stephens, 1974, 1977; Chandra, Singpurwalla and Stephens, 1981). Tables of critical values for the K-S statistic Chi-Squared (11) 286 Risk Analysis are very commonly found in statistical textbooks. Unfortunately, the standard K-S and A-D values are of limited use for comparing critical values if there are fewer than about 30 data points. The problem arises because these statistics are designed to test whether a distribution with known parameters could have produced the observed data. If the parameters of the fitted distribution have been estimated from the data, the K-S and A-D statistics will produce conservative test results, i.e. there is a smaller chance of a well-fitting distribution being accepted. The size of this effect varies between the types of distribution being fitted. One technique for getting round this problem is to use the first two-fifths or so of the data to estimate the parameters of a distribution, using MLEs for example, and then to use the remaining data to check the goodness of fit. Modifications to the K-S and A-D statistics have been determined to correct for this problem, as shown in Tables 10.2 and 10.3 (see the BestFit manual published in 1993), where n is the number of data points and D, and A: are the unmodified K-S and A-D statistics respectively. Another goodness-of-fit statistic with intuitive appeal, similar to the A-D and K-S statistics, is the Cramer-von Mises statistic Y: The statistic essentially sums the square of differences between the cumulative percentile Fo(Xi) for the fitted distribution for each Xi observation and the average of i l n and (i - l)/n: the low and high plots of the empirical cumulative distribution of Xi values. Tables for this statistic can be found in Anderson and Darling (1952). Table 10.2 Kolmogorov-Smirnoff statistic. Distribution Normal Exponential Weibull and extreme value Modified test statistic I f i - 0.01 + 085)Dn fi on=% n ) ,/ED, All others Table 10.3 Anderson-Darling statistics. Distribution Modified test statistic Normal Exponential Weibull and extreme value All others (I + $)A2 A; Chapter 10 Fltting distributions t o data 287 The chi-square goodness-of-fit statistic The chi-square (x2) statistic measures how well the expected frequency of the fitted distribution compares with the observed frequency of a histogram of the observed data. The chi-square test makes the following assumptions: 1. The observed data consist of a random sample of n independent data points. 2. The measurement scale can be nominal (i.e. non-numeric) or numerical. 3. The n data points can be arranged into histogram form with N non-overlapping classes or bars that cover the entire possible range of the variable. The chi-square statistic is calculated as follows: where O(i) is the observed frequency of the ith histogram class or bar and E(i) is the expected frequency from the fitted distribution of x values falling within the x range of the ith histogram bar. E(i) is calculated as E (i) = ( F(i,,) - F (imin))* n (10.7) where F(x) is the distribution function of the fitted distribution, (i,,,) is the x value upper bound of the ith histogram bar and (i,,,) is the x value lower bound of the ith histogram bar. Since the X 2 statistic sums the squares of all of the errors ( 0 ( i ) - E(i)), it can be disproportionately sensitive to any large errors, e.g. if the error of one bar is 3 times that of another bar, it will contribute 9 times more to the statistic (assuming the same E(i) for both). X 2 is the most commonly used of the goodness-of-fit statistics described here. However, it is very dependent on the number of bars, N, that are used. By changing the value of N, one can quite easily switch ranking between two distribution types. Unfortunately, there are no hard and fast rules for selecting the value of N. A good guide, however, is Scott's normal approximation which generally appears to work very well: where n is the number of data points. Another useful guide is to ensure that no bar has an expected frequency smaller than about 1, i.e. E(i) > 1 for all i. Note that the X 2 statistic does not require that all or any histogram bars are of the same width. The X 2 statistic is most useful for fitting distributions to discrete data and is the only statistic described here that can be used for nominal (i.e. non-numeric) data. Example 10.7 Use of X 2 with continuous data A dataset of 165 points is thought to come from a normal(70,20) distribution. The data are first put into histogram form with 14 bars, as suggested by Scott's normal approximation (Table 10.4(a)). The four extreme bars have expected frequencies below 1 for a Normal(70, 20) distribution with 165 observations. These outside bars are therefore combined to produce a revised set of bar ranges. Table 10.4(b) shows the X 2 calculation with the revised bar ranges. Table 10.4 Calculation of the revised bar ranges. (a) Histogram bars From A To B -00 10 20 30 40 50 60 70 80 90 100 110 120 130 10 20 30 40 50 60 70 80 90 100 110 120 130 +cc X2 statistic for a continuous dataset: (a) determining the bar ranges to be used; (b) calculation of Expected frequency of Normal(70, 20) 0.22 0.80 2.73 7.27 15.15 24.73 31.59 3 1.59 24.73 15.15 7.27 2.73 0.80 0.22 (b) Revised bars From A To B -00 20 30 40 50 60 70 80 90 100 110 120 20 30 40 50 60 70 80 90 100 110 120 +W x2 with E(i) of Normal(70, 20) o(i) Chi-square calc. {O(i) - ~ ( i ) } ' / ~ ( i ) 1.02 2.73 7.27 15.15 24.73 31.59 3 1.59 24.73 15.15 7.27 2.73 1.02 3 5 6 10 21 25 37 21 17 11 6 3 Chi-square: 3.80854 1.88948 0.22168 1.75344 0.56275 1.37523 0.92601 0.56275 0.22463 1.91447 3.92002 3.80854 20.96754 Table 10.5 Calculation of the x 2 statistic for a discrete dataset: (a) tabulation of the data; (b) calculation of (a) x value 0 1 2 3 4 5 6 7 8 9 10 11+ Total: Observed Frequency E(i) frequency of Poisson o(i) (4.456) 0 1.579 8 7.036 18 15.675 20 23.282 29 25.936 21 23.113 18 17.165 10 10.926 8 6.086 2 3.013 1 1.343 1 0.846 136 (b) x-value 0 1 2 3 4 5 6 7 8 9 1O+ (3~erved frequency o(i) 0 8 18 20 29 21 18 10 8 2 2 Frequency E(i) of Poisson (4.456) 1.579 7.036 15.675 23.282 25.936 23.113 17.165 10.926 6.086 3.013 2.189 Chi Squared: x2. Chi Squared calc. {O(i) - ~ ( i ) ] ~ / ~ ( i ) 1.5790 0.1322 0.3448 0.4627 0.3621 0.1932 0.0406 0.0786 0.6020 0.3406 0.0163 4.1521 290 Risk Analys~s Hypotheses a a Ho: the data come from a Normal(70, 20) distribution. H I : the data do not come from the Normal(70, 20) distribution. Decision The X 2 test statistic has a value of 21.0 from Table 10.4(b). There are v = N - 1 = 12 - 1 = 11 degrees of freedom (a = 0 since no distribution parameters were determined from the data). Looking this up in a x2(1 1) distribution, the probability that we will have such a high value of X 2 when Ho is true is around 3 %. We therefore conclude that the data did not come from a Normal(70, 20) distribution. + Example 10.8 Use of X 2 with discrete data A set of 136 data points is believed to come from a Poisson distribution. The MLE for the parameter h for the Poisson is estimated by taking the mean of the data points: h = 4.4559. The data are tabulated in frequency form in Table 10.5(a) and, next to it, the expected frequency from a Poisson(4.4559) distribution, i.e. E(i) = f (x) * 136, where The expected frequency for a value of 11+, calculated as 136 - (the sum of all the other expected frequencies), is less than 1. The number of bars is therefore decreased, as shown in Table 10.5(b), to ensure that all expected frequencies are greater than 1. Hypotheses a a Ho: the data come from a Poisson distribution. HI: the data do not come from a Poisson distribution. Decision The X 2 test statistic has a value of 4.152 from Table 10.5(b). There are v = N - a - 1 = 11 - 1 - 1 = 9 degrees of freedom (a = 1 since one distribution parameter, the mean, was determined from the data). Looking this up in a x2(9) distribution, the probability that we will have such a high value of X 2 when Ho is true is just over 90 %. Since this is such a large probability, we cannot reasonably reject Ho and therefore conclude that the data fit a Poisson (4.4559) distribution. + I've covered the chi-square statistic quite a bit here, because it is used often, but let's just trace back a moment to the assumptions behind it. The x2(v) distribution is the sum of v unit normal distributions squared. Equation (10.6) says so the test is assuming that each { o c i ).(;)- E ( i ) 1 2 is approximately a Normal(0, I)?, i.e. that O(i) is approximately Normal(E(i), distributed. O(i) is a Binomial(n, p) variable, where p = F(i,,) F(imi,) and will only look somewhat normal when n is large and p is not near 0 or 1, in which m) Chapter 10 Fitt~ngdistribut~onst o data 29 1 case it will be approximately Normal(np, dm). The point is that the chi-square test is based on an implicit assumption that there are a lot of observations for each bin - so don't rely on it. Maximum likelihood methods will give better fits than optimising the chi-square statistic and have more flexibility, and the ability of the chi-square statistic as a measure of comparisons between goodness of fits is highly questionable since one should change the bin widths for each fitted distribution to give the same probability of a random sample lying within, but those bin ranges will be different for each fitted distribution. Kolmogorov-Smimoff (K-S) statistic The K-S statistic D, is defined as where D, is known as the K-S distance, n is the total number of data points, F(x) is the distribution function of the fitted distribution, F,(x) = i / n and i is the cumulative rank of the data point. The K-S statistic is thus only concerned with the maximum vertical distance between the cumulative distribution function of the fitted distribution and the cumulative distribution of the data. Figure 10.15 illustrates the concept for data fitted to a Uniform(0, 1) distribution. The data are ranked in ascending order. The upper FU(i) and lower FL(i) cumulative percentiles are calculated as follows: where i is the rank of the data point and n is the total number of data points. F(x) is calculated for the Uniform distribution (in this case F(x) = x). a t - Uniform(0, 1) distribution Observed distribution 7 I Upper bound of F(x,), = ~ / n e,- I 0 0.2 -iower bound of F(x,): = (i- I)/" 0.4 0.6 x-value 0.8 1 1.2 Figure 10.15 Calculation of the Kolmogorov-Smirnoff distance D, for data fitted to a Uniform(0, 1) distribution. 292 Risk Analysis The maximum distance Di between F(i) and F(x) is calculated for each i: where ABS (. . .) finds the absolute value. The maximum value of the D idistances is then the K-S distance D,: The K-S statistic is generally more useful than the X 2 statistic in that the data are assessed at all data points which avoids the problem of determining the number of bands into which to split the data. However, its value is only determined by the one largest discrepancy and takes no account of the lack of fit across the rest of the distribution. Thus, in Figure 10.16 it would give a worse fit to the distribution in (a) which has one large discrepancy than to the distribution in (b) which has a poor general fit over the whole x range. The vertical distance between the observed distribution F,(x) and the theoretical fitted distribution F(x) at any point, say xo, itself has a distribution with a mean of zero and a standard deviation OK-s given by binomial theory: The size of the standard deviation OK-s over the x range is shown in Figure 10.17 for a number of distribution types with n = 100. The position of D, along the x axis is more likely to occur where OK-s is greatest, which, Figure 10.17 shows, will generally be away from the low-probability tails. This insensitivity of the K-S statistic to lack of fit at the extremes of the distributions is corrected for in the Anderson-Darling statistic. The enlightened statistical literature is quite scathing about distribution-fitting software that use the KS statistic as a goodness of fit - particularly if one has estimated the parameters of a fitted distribution from data (as opposed to comparing data against a predefined distribution). This was not the intention of the K-S statistic, which assumes that the fitted distribution is fully specified. In order to use it as a goodness-of-fit measure that ranks levels of distribution fit, one must perform simulation experiments to determine the critical region of the K-S statistic in each case. Anderson-Darling (A-D) statistic The A-D statistic A: is defined as where n is the total number of data points, F(x) is the distribution function of the fitted distribution, f (.x) is the density function of the fitted distribution, F,(x) = i l n and i is the cumulative rank of the data point. The Anderson-Darling statistic is a more sophisticated version of the Kolmogorov-Smirnoff statistic. It is more powerful for the following reasons: Chapter 10 Fitting distributions to data 293 Determination of K-S distance D,, 1- - - - - , 0.7 Observed distribution F,,(x)! F(x) for fitted distribution 4 15% 20% 25% x-values 304 35% 40% (a) Distribution is generally a good fit except in one particular area Determination of K-S distance D, Observed distribution Fn(x) - - - - F(x) for fitted distribution --o- I x-values I (b) Distribution is generally a poor fit but with no single large discrepancies Figure 10.16 How the K-S distance D, can give a false measure of fit because of its reliance on the single largest distance between the two cumulative distributions rather than looking at the distances over the whole possible range. i @(x) compensates for the increased variance of the vertical distances between distributions which is described in Figure 10.17. f (x) weights the observed distances by the probability that a value will be generated at that x value. The vertical distances are integrated over all values of x to make maximum use of the observed data (the K-S statistic only looks at the maximum vertical distance). ;;,y Standard deviation of D,, for Pareto(1, 2) distribution Standard deviation of D, for NorrnaI(100, 10) distribution 0.01 0 0 50 100 150 , 0.01 0 0 200 10 5 15 Standard deviation of D, for Triangular(0, 5, 20) distribution Standard deviation of D, for Uniform(0, 10) distribution E n , 0.01 0 0 Standard deviation of D, for Exponential(25) distribution 5 10 15 20 Standard deviation of D, for Rayleigh(3) distribution Figure 10.17 Variation in the standard deviation of the K-S statistic D, over the range of a variety of distributions. The greater the standard deviation, the more chance that D, will fall in that part of the range, which shows that the K-S statistic will tend to focus on the degree of fit at x values away from a distribution's tails. The A-D statistic is therefore a generally more useful measure of fit than the K-S statistic, especially where it is important to place equal emphasis on fitting a distribution at the tails as well as the main body. Nonetheless, it still suffers from the same problem as the K-S statistic in that the fitted distribution should in theory be fully specified, not estimated from the data. It suffers from a larger problem in that the confidence region has been determined for only a very few distributions. A better goodness-of-fit measure For reasons I have explained above, the chi-square, Kolmogorov-Smirnoff and Anderson-Darling goodness-of-fit statistics are technically all inappropriate as a method of comparing fits of distributions to data. They are also limited to having precise observations and cannot incorporate censored, truncated or binned data. Realistically, most of the time we are fitting a continuous distribution to a set of precise Chapter 10 Fitting distributions t o data 295 observations, and then the Anderson-Darling does a reasonable job. However, for important work you should instead consider using statistical measures of fit called information criteria. Let n be the number of observations (e.g. data values, frequencies), k be the number of parameters be the to be estimated (e.g. the normal distribution has two parameters: mu and sigma) and log L,, maximized value of the log-likelihood for the estimated. 1. SIC (Schwarz information criterion, aka Bayesian information criterion, BIC) SIC = ln[n]k - 2 ln[L,,] 2. AICc (Akaike information criterion) 3. HQIC (Hannan-Quinn information criterion) HQIC = 2 ln[ln[n]]k - 2 ln[L,,,] The aim is to find the model with the lowest value of the selected information criterion. The -21n[L,,] term appearing in each formula is an estimate of the deviance of the model fit. The coefficients for k in the first part of each formula shows the degree to which the number of model parameters is being penalised. For n 2 20 or so the SIC (Schwarz, 1997) is the strictest in penalising loss of degree of freedom by having more parameters in the fitted model. For n 2 40 the AICc (Akaike, 1974, 1976) is the least strict of the three and the HQIC (Hannan and Quinn, 1979) holds the middle ground, or is the least penalising for n 5 20. ModelRisk applies modified versions of these three criteria as a means of ranking each fitted model, whether it be fitting a distribution, a time series model or a copula. If you fit a number of models to your data, try not to pick automatically the fitted distribution with the best statistical result, particularly if the top two or three are close. Also, look at the range and shape of the fitted distribution and see whether they correspond to what you think is appropriate. 1 0.3.5 Goodness-of-fit plots Goodness-of-fit plots offer the analyst a visual comparison between the data and fitted distributions. They provide an overall picture of the errors in a way that a goodness-of-fit statistic cannot and allow the analyst to select the best-fitting distribution in a more qualitative and intuitive way. Several types of plot are in common use. Their individual merits are discussed below. Comparison of probability density Overlaying a histogram plot of the data with a density function of the fitted distribution is usually the most informative comparison (see Figure 10.18(a)). It is easy to see where the main discrepancies are and whether the general shape of the data and fitted distribution compare well. The same scale and number of histogram bars should be used for all plots if a direct comparison of several distribution fits is to be made for the same data. 296 0.03 R~skAnalysis , P-P Comparison Between lnput Distribution and Comparison of lnput Distributionand Normal(99.18.16.52) Normal(99.18,16.52) 1.o 0.8 Normal 0.6 0.4 0.2 0.0 50 70 90 110 130 0.0 150 0.2 0.4 0.6 0.8 d) Probability-probability plot (a) Comparison of probability density Comparison of lnput Distributionand Norma1(99.18,16.52) P-P Comparison for Discrete data 1.0 - -- Poisson 0.6 0.4 .- 0.2 -. r I 0.8 .- 0.0 I 0.0 I 0.2 0.4 0.6 0.8 1.0 I (e) Probability-probability plot for a discrete distribution (b)Comparison of cumulative probability distributions 1 1.O DifferenceBetween lnput Distribution and Normal(99.18,16.52) Q-Q Comparison Between lnput Distributionand Norma1(99.18,16.52) 150 0.03 130 Normal 110 90 70 50 -0.03 50 70 90 110 130 150 (c) Plot of difference between probability densities 50 70 90 110 130 150 (f) Quantils-quantile plot Figure 10.18 Examples of goodness-of-fit plots. Comparison of probability distributions An overlay of the cumulative frequency plots of the data and the fitted distribution is sometimes used (see Figure 10.18(b)). However, this plot has a very insensitive scale and the cumulative frequency of most distribution types follow very similar S-curves. This type of plot will therefore only show up very large differences between the data and fitted distributions and is not generally recommended as a visual measure of the goodness of fit. Difference between probability densities This plot is derived from the above comparison of probability density, plotting the difference between the probability densities (see Figure 10.18(c)). It has a far more sensitive scale than the other plots Chapter 10 Fitting distributions t o data 297 described here. The size of the deviations is also a function of the number of classes (bars) used to plot the histogram. In order to make a direct comparison between other distribution function fits using this type of plot, analysts must ensure that the same number of histogram classes is used for all plots. They must also ensure that the same vertical scale is used, as this can vary widely between fits. Probability-probability (P-P) plots This is a plot of the cumulative distribution of the fitted curve F(x) against the cumulative frequency F,(x) = i/(n 1) for all values of xi (see Figure 10.18(d)). The better the fit, the closer this plot resembles a straight line. It can be useful if one is interested in closely matching cumulative percentiles, and it will show significant differences between the middles of the two distributions. However, the plot is far less sensitive to any discrepancies in fit than the comparison of probability density plot and is therefore not often used. It can also be rather confusing when used to review discrete data (see Figure 10.18(e)) where a fairly good fit can easily be masked, especially if there are only a few allowable x values. + Quantile-Quantile (Q-Q) plots + This is a plot of the observed data xi against the x values where F(x) = F,(x), i.e. = i/(n 1) (see Figure 10.18(f)). As with P-P plots, the better the fit, the closer this plot resembles a straight line. It can be useful if one is interested in closely matching cumulative percentiles, and it will show significant differences between the tails of the two distributions. However, the plot suffers from the same insensitivity problem as the P-P plots. 10.4 Fitting a Second-Order Parametric Distribution to Observed Data The techniques for quantifying uncertainty, described in the first part of this chapter, can be used to determine the distribution of uncertainty for parameters of a parametric distribution fitted to data. The three main techniques are classical statistics methods, the bootstrap and Bayesian inference by Gibbs sampling. The main issue in estimating the parameters of a distribution from data is that the uncertainty distributions of the estimated parameters are usually linked together in some way. Classical statistics tends to overcome this problem by assuming that the parameter uncertainty distributions are normally distributed, in which case it determines a covariance between these distributions. However, in most situations one comes across, the parameter uncertainty distributions are not normal (they will tend to be as the amount of data gets very large), so the approach is very limited. The parametric bootstrap is much better, since one simply resamples from the MLE fitted distribution in the same fashion in which the data appear, and in the same amount, of course. Then, refitting using MLE again gives us random samples from the joint uncertainty distribution for the parameters. The main limitation to the bootstrap is in fitting a discrete distribution, particularly one where there are few allowable values, as this will make the joint uncertainty distribution very "grainy". Markov chain Monte Carlo will also generate random samples from the joint uncertainty density. It is very flexible but has the small problem of setting the prior distributions. Example 10.9 Fitting a second-order normal distribution to data with classical statistics The normal distribution is easy to fit to data since we have the z-test and chi-square test giving us precise formulae. There are not many other distributions that can be handled so conveniently. Classical statistics tells us that the uncertainty distributions for the mean and standard deviation of the normal 298 R~skAnalysis distribution are given by Equation (9.3) when we don't know the mean, and by Equation (9.1) when we know the standard deviation. So, if we simulate possible values for the standard deviation first with Equation (9.3), we can feed these values into Equation (9.1) to determine the mean. + Example 10. I 0 Fitting a second-order normal distribution t o data using the parametric bootstrap The sample mean (Excel: AVERAGE) and sample standard deviation (Excel: STDEV) are the MLE estimates for the normal distribution. Thus, if we have n data values with mean Y and standard deviation s, we generate n independent Normal(T, s) distributions and recalculate their mean and standard deviation using AVERAGE and STDEV to generate uncertainty values for the population parameters. + Example 10. I I Fitting a second-order gamma distribution t o data using the parametric bootstrap There are no equations for direct determination of the MLE parameter values for a gamma distribution, so one needs to construct the likelihood function and optimise it by varying the parameters, which is rather tiresome but by far the more common situation encountered. ModelRisk offers distribution-fitting algorithms that do this automatically. For example, the two-cell array {VoseGammaFitP(data, TRUE)} will generate values from the joint uncertainty distribution for a gamma distribution fit to the set of values data. The array {VoseGammaFitP(data, FALSE)) will return just the MLE values. The function VoseGammaFit(data, TRUE) returns random samples from a gamma distribution. with the parameter uncertainty imbedded, and VoseGammaFit(data, 0.99, TRUE) will return random samples from the uncertainty distribution for the 99th percentile of a gamma distribution fit to data. + Example 10.12 Fitting a second-order gamma distribution t o data using WinBUGS The following WinBUGS model takes 47 data values (that were in fact drawn from a Gamma(4,7) distribution) and fits a gamma distribution. There are two important things to note here: in WinBUGS the scale parameter lambda is defined as the reciprocal of the beta scale parameter more commonly used (and this book's convention); and I have used a prior for each parameter of Garnma(1, 1000) [in more standard convention] which is an exponential with mean 1000. The exponential distribution is used because it extends from zero to infinity which matches the parameters' domains, and an exponential with such a large mean will appear quite flat over the range of interest (so it is reasonably uninformed). The model is: model ( - for(i in 1 : M ) { x [ i] dgamma (alpha, lambda) I Chapter 10 Fitt~ngdistributions t o data 299 -- alpha dgamma(l.O, 1.OE-3) beta dgamrna(l.0, 1.0E-3) lambda<-l/beta 1 After a burn-in of 100000 iterations, the estimates are as shown in Figure 10.19. The estimates are centred roughly around 4 (mean = 4.1 11) and 7 (mean = 6.288), as we might have hoped having generated the samples from a Gamma(4,7). We can check to see whether the choice of prior has much effect. For alpha the uncertainty distribution ranges from about 2 to 6: the Exponential(1000) l E lE beta sample: 10000 alpha sample: 10000 1 Figure 10.19 WinBUGS estimates of gamma distribution parameters for Example 10.12. 13 12 11 10 - . . .. ..... .. 9m Z P 87- 65431 2 3 5 4 6 7 alpha Figure 10.20 5000 posterior distribution samples from the WinBUGS model to estimate gamma distribution parameters for Example 10.12. 300 Risk Analysis Figure 10.21 Plot showing the empirical cumulative distribution of the data in bold and the second-order fitted lognormal distribution in grey. density at 2 and 6 respectively are 9.98E-4 and 9.94E-4, a ratio of 1.004, so essentially flat over the posterior region. Between 4 and 13, the range for the beta parameter, the ratio is 1.009 - again essentially flat. Figure 10.20 shows why it is necessary to estimate the joint uncertainty distribution. The banana shape of this scatter plot shows that there is a strong correlation between the parameter estimates. You can understand why this relationship occurs intuitively as follows: the mean of a population distribution can be estimated quite quickly from the data and will have roughly normally distributed uncertainty: in this case the 47 observations have sample mean = 25.794 and sample variance = 184.06, so the population mean uncertainty is Normal(25.794, SQRT(184.06147)) = Normal(25.794, 1.979). The mean of a Gamma(a, B) distribution is aB. Equating the two says that if a! = 6 then B must be about 4.3 f 0.3, and if a! = 3 then B is about 8.6 f 0.6, which can be seen in Figure 10.20. + 1 0.4.1 Second-order goodness-of-fit plots Second-order goodness-of-fit plots are the same as the first-order plots in Figure 10.18, except that uncertainty about the distribution is expressed as a series of lines describing possible true distributions (sometimes called a candyfloss or spaghetti plot). Figure 10.21 gives an example. In Figure 10.21 the grey lines represent the fitted lognormal cumulative distribution function for 15 samples from the joint uncertainty distribution for the lognormal's mean and standard deviation. This gives an intuitive visual description of how certain we are about the fitted distribution. ModelRisk's distribution-fitting facility will show these plots automatically with a user-defined number of "spaghetti" lines. Chapter Sums of random variables One of the most common mistakes people make in producing even the most simple Monte Carlo simulation model is in calculating sums of random variables. In this chapter we look at a number of techniques that have extremely broad use in risk analysis in estimating the sum of random variables. We start with the basic problem and how this can be simulated. Then we examine how simulation can be improved, and then how it can often be replaced with a direct construction of the distribution of the sum of the random variables. Finally, I introduce the ability to model correlation between variables that are being summed. I I. I The Basic Problem We are very often in the situation of wanting to estimate the aggregate (sum) of a number n of variables, each of which follow the same distribution or have the same value X (see Table 11.l, for example). We have six situations to deal with (Table 11.2). Situations A, B, D and E For situations A B, D and E the mathematics is very easy to simulate: SUM = n * X Situation C For situation C, where X are independent random variables (i.e. each X being summed can take a different value) and n is fixed, we often have a simple way to determine the aggregate distribution based on known identities. The most common identities are listed in Table 11.3. We also know from the central limit theorem that, if n is large enough, the sum will often look like a normal distribution. If X has a mean p and standard deviation a,then, as n becomes large, we get Sum % Normal(n * p , f i * a) which is rather nice because it means we can have a distribution like the relative distribution and determine the moments (ModelRisk function VoseMoments will do this automatically for you), or just the mean and standard deviation of relevant observations of X and use them. It also explains why the distributions in the right-hand column of Table 11.3 often look approximately normal. When none of these identities applies, we have to simulate a column of X variables of length n and add them up, which is usually not too onerous in computing time or spreadsheet size because if n is large we can usually use the central limit theorem approximation. 1 302 Risk Analysis Table 11.1 Variables and their aggregate distribution. X N Aggregate distribution Total receipts in a year Bacteria in my three-raw-egg milkshake Total credit default exposure Total financial exposure of insurance company Purchase of each customer Bacteria in a contaminated egg Amount owed by a creditor Amount due on death for a policyholder Customers in a year Contaminated egg Credit defaults Life insurance holders who die next year Table 11.2 Different situations where aggregate distributions are needed. Situation N X A B C D E F Fixed value Fixed value Fixed value Random variable Random variable Random variable Fixed value Random variable, all n take same value Random variable, all n take different values (iids) Fixed value Random variable, all n take same value Random variable, all n take different values (iids) Table 11.3 Known identities for aggregate distributions. Aggregate distribution X Bernoulli@) Binomial(n, p) BetaBinomial(m, a, p) BetaBinomial(n * m, a, p) Binomial(n * m,p) Binomial(m, p) n * Cauchy(a,b) Cauchy(a,b) ChiSq(v) ChiSq(n * u) Erlang(n * m, j3) Erlang(m, p) Gamma(n. B) Ex~onentiallB) An alternative for situation C available in ModelRisk is to use the VoseAggregateMC(n, distribution) function; for example, if we write the function will generate and add together 1000 independent random samples from a lognonnal(2, 6) distribution. However, were we to write Chapter I I Sums of random variables the function would generate a single value from a Gamma(2 identities in Table 11.3 are programmed into the function. * 303 1000, 6) distribution because all of the Situation F This leaves us with situation F - the sum of a random number of random variables. The most basic simulation method is to produce a model where a value for n is generated in one spreadsheet cell and then a column of X variables is created that varies in size according to the value on n (see, for example, Figure 11.1). In this model, n is a Poisson(l2) random being generated at cell C2. The Lognormal(100, 10) X values are generated in column C only if the count value in column B is smaller than or equal to n. For example, in the iteration shown, a value of 14 is generated for n , so 14 X values are generated in column C. The method is quite generally applicable, but among other problems is inefficient. Imagine if n had been Poisson(l0 OOO), for example - we would need huge B and C columns to make the model work. It is also difficult from a modelling perspective because the model has to be written for a specific range of n . One cannot simply change the parameter in the Poisson distribution. We have a couple of options based on the techniques described above for situation C. If we are adding together X variables shown in Table 11.3, then we can apply those identities by simulating n A ) B I c IDI E I F I G I H I 1 2 n: 14 4 Count 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 X value 104.2002 99.24762 104.939 103.0028 97.13033 137.6911 119.2818 111.4683 101.3274 102.6788 107.0966 96.36928 93.28309 101.6922 0 0 0 0 0 0 3 5 6 7 8 9 & 11 2 13 14 15 2 1 7 18 19 2 21 22 23 24 25 . Total C2: C5:C24 F6 (output) Formulae table =VosePoisson(l2) =IF(B5~$C$2,0,VoseLognormal(l00,10)) =SUM(C5:C24) Figure 11.1 Model for the sum of a random number of random variables. I IJ 304 Risk Analysis in one cell and linking that to a cell that simulates from the aggregate variable conditioned on n. For example, imagine we are summing Poisson(100) X variables where each X variable takes a Gamma(2, 6) distribution. Then we can write Cell A1 : = VosePoisson(100) Cell A2(output) : = VoseGamma(A1 * 2,6) We can also use the central limit theorem method. Imagine we have n = Poisson(1000) and X = Beta4(3, 7, 0, 8), which is illustrated in Figure 11.2. The distribution is not terribly asymmetric, so adding roughly 1000 of them will look very close to a normal distribution, which means that we can be confident in applying the central limit theorem approximation, shown in the model of Figure 11.3. Here we have made use of the VoseMoments array function which returns the moments of a distribution object. Most software, however, will allow you at least to view the moments of a distribution, and, if not, you can simulate the distribution on its own and empirically determine its moments from the values or, if you need greater accuracy or speed, apply the equations given in the distribution compendium in Appendix 111. The VoseCLTSum performs the same calculation as that shown in F5 but is a little more intuitive. Alternatively, the VoseAggregateMC will, in this iteration, add together 957 values drawn from the Beta4 distribution because there is no known identity for sums of Beta4 distributions. Figure 11.2 A Beta4(3, 7,0,8)distribution. Chapter I I Sums of random variables A 1 B I I C I D E I F 305 IG 957 VoseBeta4(3,7,0,8) 4 [ ~ g g r e ~ a distribution: te 1.221818182 0.48249791 2.860805861 9 10 11 12 13 14 15 16 17 - c2 c3 {B5:C8) F5 (output): F6 (alternative): F7 (alternative): or or 2345.4 ( 2303.4 2318.1 Formulae table =VosePoisson(l000) =VoseBeta40bject(3,7,0,8) {=VoseMoments(C3)} =VoseNormal(C2*C5,SQRT(C2*C6)) =VoseCLTSum(C2,C5,SQRT(C6)) =VoseAggregateMC(C2,C3) Figure 11.3 Model for the central limit theorem approximation. I 1.2 Aggregate Distributions I 1.2. I Moments of an aggregate distribution There are general formulae for determining the moments of an aggregate distribution given that one has the moments for the frequency distribution for n and the severity distribution for X. If the frequency distribution has mean, variance and skewness of p ~ V,F and SF respectively, and the severity distribution has mean, variance and skewness of p c , Vc and Sc respectively, then the aggregate distribution has the following moments: Mean = p ~ p c (11.1) Variance = p~ VC Skewness = + VFpC2 pFsCv;l2 (1 1.2) + ~ V F ~ C +V sC ~ v : ~ ~ ~ ~ (Variance)3/2 There is also a formula for kurtosis, but it is rather ugly. The ModelRisk function VoseAggregateMoments determines the first four moments of an aggregate distribution for any frequency and severity distribution, even if they are bounded and/or shifted. Equations (1 1.1) to (1 1.3) deserve a little more exploration. Firstly, let's consider the situation where n is a fixed value, so p~ = n , VF = 0 and SF = undefined. Then we have moments for the aggregate distribution of Mean = n p c Variance = n Vc sc Skewness = - fi You can see that this gives support to the central limit theorem which states that, if n is large enough, the aggregate distribution approaches a normal distribution with mean = n p c and variance = n Vc. The skewness equation shows that the aggregate skewness is proportional to the skewness of X but decreases rapidly at first with increasing n, then more slowly, and asymptotically towards zero. Another interesting example is to consider the aggregate moment equations when n follows a Poisson(A) distribution, which is very commonly the most appropriate distribution for n, and also has the convenience of being described by just one parameter. Now we have p~ = h, VF = A and SF = and the aggregate moments are h, Mean = ApC + w;) (sC v;l2 + 3wc Vc + p;) Skewness = 3 (Vc + p ; ) m Variance = A (Vc The mean and variance equations are simple formulae. We can see that the skewness decreases with 1in the same way as it does for a fixed value for n. If X is symmetrically distributed, then, for a 4 given A, the skewness is at its maximum when the mean and standard deviation of X are the same, and at its lowest when the standard deviation is very high. Thus, the aggregate distribution will be more closely normal when Vc is large. Being able to determine the aggregate moments is pretty useful. One can directly compare sums of random variables, which I will discuss more in Chapter 21. One can also match these moments to some parametric distribution and use that as an approximation to the aggregate distribution. An aggregate distribution is almost always right skewed, so we can select from a number of right-skewed distributions like the lognormal and gamma and match moments. For example, a Gamma(a, p) distribution shifted by a value T has +T Mean = a/3 Variance = ap 2 (1 1.4) (1 1.5) 2 Skewness = - fi Thus, matching skewness gives us a value for a. Then, matching variance gives us /3, and, finally, matching mean gives us T. Adding a shift gives us three parameters to estimate, so we can match three moments. The model in Figure 11.4 offers an example. Cells C3:C5 are the parameters for the model. Cells D3 and D4 use ModelRisk functions to create distribution objects. B8:Cll and D8:Ell use the VoseMoments function to calculate the moments of the two distributions. Alternatively, you can use the equations in the distribution compendium in Appendix 111. F8:FlO manually calculates the first three aggregate moments from Equations (1 1.1) to (11.3), and G8:Hll calculates all four using the VoseAggregateMoments function as a check. In C15:C17, Equations (1 1.4) to (1 1.6) are inverted to determine the gamma distribution parameters. Finally, G14:H17 uses the VoseMoments function again to determine the moments of the gamma distribution. You can see that they match the mean, variance and skewness of the aggregate distribution - as they should - but also that the kurtosis is very close, so the gamma distribution would likely be a good substitute for the aggregate distribution. To be sure, we would need to plot the two together, which we'll look Chapter I I Sums of random variables Aggregate 25 Mean 25 Variance 0.2 Skewness 3.04 Kurtosis 307 VoseAMoments 1850 Variance 6.944 1.018515312 Skewness 75.1 056 Kurtosis 1.018515 4.723443 =C8*E9+CYE8A2 Figure 11.4 Model for determining aggregate moments. at later: a feature in ModelRisk uses the matching moments principle to match shifted versions of the gamma, inverse gamma, lognormal, Pearson5, Pearson6 and fatigue distributions to constructed aggregate distributions and overlay the distributions for an extra visual comparison. 1 1.2.2 Methods for constructing an aggregate distribution In this section I want to turn to a range of very neat techniques for constructing the aggregate distribution when n is a random variable and X are independent identically distributed random variables. There are a lot of advantages to being able to construct such an aggregate distribution, among which are: a a We can determine tail probabilities to a high precision. It is much faster than Monte Carlo simulation. We can manipulate the aggregate distribution as with any other in Monte Carlo simulation, e.g. correlate it with other variables. The main disadvantage to these methods is that they are computationally intensive and need to run calculations through often very long arrays. This makes them impractical to show in a spreadsheet 308 Risk Analysis environment, so I will only describe the theory here. All methods are implemented in ModelRisk, however, which runs the calculations internally in C++. We start by loolung at the Panjer recursive method, and then the fast Fourier transform (FFT) method. These two have a similar feel to them, and similar applications, although their mathematics is quite different. Then we'll look at a multivariate FFT method that allows us to extend the aggregate calculation to a set of {n, X ] variables. The de Pril recursive method is similar to Panjer's and has specific use. Finally, I give a summary of these methods and when and why they are useful. Panjer's recursive method Panjer's recursive method (Panjer, 1981; Panjer and Willmot, 1992) applies where the number of variables n being added together follows one of these distributions: binomial; geometric; negative binomial; Poisson; P6lya. The technique begins by talung the claim size distribution and discretising it into a number of values with increment C. Then the probability is redistributed so that the discretised claim distribution has the same mean as the continuous variable. There are a few ways of doing this, but if the discretisation steps are small they give essentially the same answer. A simple method is to assign the value (i * C) the probability si as follows: In the discretisation process we have to decide on a maximum value of i (called r) so we don't have an infinite number of calculations to perform. Now comes the clever part. The above discrete distributions lead to a simple one-time summation through a recursive formula to calculate the probability p ( j ) that the aggregate distribution will equal j * C: The formula works for all frequency distributions for n that are of the (a, b, 0) class, which means that, from P(n = 0) up, we have a recursive relationship between P(n = i ) and P(n = i - 1) of the form a and b are fixed values that depend on which of the discrete distributions is used and their parameter value. The specific formula for each case is given below for the (a, b, 0) class of discrete distributions: For the Binomial(n,p) Chapter I I Sums of random variables 309 For the Geometric(p) For the NegBin(s,p) For the Poisson(h) PO=exp[h.so - h ] , a = 0, b = h For the P6lya(cy,B) The output of the algorithm is two arrays {i},{p(i)} that can be constructed into a distribution, for example as VoseDiscrete({i},p{i)) * C . Panjer's method can occasionally numerically "blow up" with the binomial distribution, but when it does so it generates negative probabilities, so is immediately obvious. A small change to Panjer's algorithm allows the formula to be applied to (a, b, 1) distributions, which means that the recursive formula (11.8) works from P ( n = 1) onwards. This allows us to include the logarithmic distribution using the formulae Panjer's method cannot, however, be applied to the Delaporte distribution. Panjer's method requires a bit of hands-on management because one has to experiment with the maximum value r to ensure sufficient coverage and accuracy of the distribution. ModelRisk uses two controls for this: MaxP specifies the upper percentile value of the distribution of X at which the algorithm will stop, and Intervals specifies how many steps will be used in the discretisation of the X distribution. In general, the larger one makes Intervals, the more accurate the model will be, but at the expense of computation time. The MaxP value should be set high enough realistically to cover the distribution of X, but, if one sets it too high for a long tailed distribution, there will be an insufficient number of increments in the main body of the distribution. In ModelRisk one can compare the exact moments of the aggregate distribution with those of the Panjer constructed distribution to ensure that the two correspond with sufficient accuracy for the analyst's needs. 3 10 Risk Analysis Fast fourier transform (FFT) method The density function f (x) of a continuous random variable can always be converted into its Fourier transform @,(t) (also called its characteristic function) as follows: max @,(t) = 1 e"" f (x) dx = ~ [ e " " ] min and we can transform back using min Characteristic functions are really useful for determining the sums of random variables because @x+y(t)= @ X ( t )* &~(t),i.e. we just multiply the characteristic functions of variables X and Y to get the characteristic function of (X + Y). For example, the characteristic function for a normal distribution is @(t) = exp . Thus, for variables X = Normal(px, ax) and Y = Normal(py, a y ) we have @x+r(t) = $X(~)@Y (t) = exp In this particular example, the function form of @x+y(t) equates to another normal distribution with mean ( p X wy) and variance (a; a;), so we don't have to apply a transformation back - we can already recognise the result. The fast Fourier transform method of constructing an aggregate distribution where there are a random number n of identically distributed random variables X to be summed is described fully in Robertson (1 992). The technique involves discretising the severity distribution X like Panjer's method so that one has two sets of discrete vectors, one each for the frequency and severity distributions. The mathematics involves complex numbers and is based on the convolution theory of discrete Fourier transforms, which states that to obtain the aggregate distribution one multiplies the two discrete Fourier transforms of these vectors pointwise and then computes the inverse discrete Fourier transform. The fast Fourier transform is used as a very quick method for computing the discrete Fourier transform for long vectors. The main advantage of the FFT method is that it is not recursive, so, when one has a large array of possible values, the FFT won't suffer the same error propagation that Panjer's recursion will. The FFT can also take any discrete distribution for its frequency distribution (and, in principle, any other non-negative continuous distribution if one discretises it). The FFT can also be started away from zero, whereas the Panjer method must calculate the probability of every value starting at zero. Thus, as a rough guide, consider using Panjer's method where the frequency distribution does not take very large values and where it is one of those for which Panjer's method applies, otherwise use the FFT method. ModelRisk offers a version of the FFT method with some adjustments to improve efficiency and allow for a continuous aggregate distribution. + + b Chapter I I Sums of random variables 31I FFT methods can also be extended to a group of {n, X ) paired distributions, which ModelRisk makes available via its VoseAggregateMultiFFT function. De Pril method For a portfolio of n independent life insurance policies, each policy y has a particular probability of a claim p, in some period (usually a year) and benefit By. There are various methods for calculating the aggregate payout. Dickson (2005) is an excellent (and very readable) review of these methods and other areas of insurance risk and ruin. The De Pril method is an exact method for determining the aggregate payout distribution. The compound Poisson approximation discussed next is a faster method that will usually work too. De Pril (1986) offers an exact calculation of the aggregate distribution under the assumptions that: The benefits are fixed values rather than random variables and take integer multiples of some convenient base (e.g. $1000) with a maximum value M * base, i.e. Bi = (1 . . . M ) * base. The probability of claims can similarly be grouped into a set of J values (i.e. into tranches of mortality rates) p, = {pl . . . pJ}. Let nij be the number of policies with benefit i and probability of claim p j . Then De Pril's paper demonstrates that p(y), the probability that the aggregate payout will be equal to y * base, is given by the recursive formula xx min[y,MI p(y) = ; Lylil p(y - ik)h(i, k) for y = 1 , 2 , 3 . . . and where The formula has the benefit of being exact, but it is very computationally intensive. However, the number of computations can usually be significantly reduced if one accepts ignoring small aggregate costs to the insurer. Let K be a positive integer. Then the recursive formulae above are modified as follows: Dickson (2005) recommends using a value of 4 for K . The De Pril method can be seen as the counterpart to Panjer's recursive method for the collective model. ModelRisk offers a set of functions for implementing De Pril's method. 3I2 Risk Analysis Compound Poisson approximation The compound Poisson approximation assumes that the probability of payout for an individual policy is fairly small - which is usually true, but has the advantage over the De Pril method in allowing that the payout distribution is a random variable rather than a fixed amount. Let n j be the number of policies with probability of claim pj. The number of payouts in this stratum is therefore Binomial(nj, p j ) . If n j is large and p j is small, the binomial is well approximated by a Poisson(nj * p j ) = Poisson(hi) distribution. The additive property of the Poisson distribution tells us that the frequency distribution for payouts over all groups of lines of insurance is given by and the total number of claims = Poisson(ha1). The probability that one of these claims, randomly selected, comes from stratum j is given by Let F,(x) be the cumulative distribution function for the claim size of stratum j. The probability that a random claim is less than or equal to some value is therefore Thus, we can consider the aggregate distribution for the total claims to have a frequency distribution equal to Poisson(hdl) and a severity distribution given by F ( x ) . Adding correlation in aggregate calculations Simulation The most common method for determining the aggregate distribution of a number of correlated random variables is to simulate each random variable in its own spreadsheet cell, using one of the correlation methods described elsewhere in this book, and then sum them up in another cell. For example, the model in Figure 11.5 adds together Poisson(100) random variables each following a Lognormal(2, 5 ) distribution but where these variables are correlated through a Clayton(l0) copula. Cell C7 determines the 99.99th percentile of the Poisson(100) distribution - a value of 139 - which is used as a guide to set the maximum number of rows in the table. The Clayton copula values are used as "U-parameter" inputs into the lognormal distributions, meaning that they make the lognormal distributions return the percentile equating to the copula value; for example, cell D l 2 returns a value of 2.5539. . ., which is the 80.98. . .th percentile of the Lognormal(2, 5) distribution. Chapter I I Sums of random variables A 1 I B c I D I E I F I G H I I J I K 3 13 I L 1 (This tells us how large the array below needs to be) 8 9 [ ~ o t a(output) l 173.5239678 ] 10 11 12 3 14 15 145 146 147 148 2 150 151 Number added 1 2 3 4 134 135 136 137 138 139 140 Clayton cooula 0.809878223 0.698461498 0.715257242 0.8041 17626 0.644750607 0.700918744 0.846351057 0.617433557 0.671806607 0.730271298 0.674805899 Lognormal variables 2.553939077 1.54420499 1.654062795 2.47946512 0 0 0 0 0 0 0 Formulae table =VosePoisson(C2) =VosePoisson(C2,0.9999) =SUM(D12:Dl51) 152 Figure 11.5 Model for simulating the aggregate distribution of correlated random variables A Clayton copula provides a particularly high level of correlation of the variables at their low end. For example, the plot in Figure 11.6 shows the level of correlation of two variables with a Clayton(] 0). Thus, the model will produce a wider range for the sum than an uncorrelated set of variables but in particular will produce more extreme low-end values from a probabilistic view (the correlated sum has about a 70 % probability of taking a lower value than the uncorrelated sum). The use of one of the Archimedean copulas is an appropriate tool here because we are adding up a random number of these variables but the number being summed does not affect the copula's behaviour - all variables will be related to the same degree no matter how many are being summed. The effect of the correlation is readily observed by repeating the model without any correlation. The plot in Figure 11.7 compares the two cumulative distributions. Complete correlation In the situation where the source of the randomness or uncertainty of the distribution associated with a random variable is the same for the whole group you are adding up, there is really just one random variable. For example, imagine that a railway network company must purchase 127 000 sleepers (the beams under the rails) next year. The sleepers will be made of wood, but the price is uncertain because the cost of timber may fluctuate. It is estimated that the cost will be $PERT(22.1, 22.7, 33.4) each. If all the timber is being purchased at the same time, it might be reasonable to believe that all the sleepers will have the same price. In that case, the total cost can be modelled simply: If there are a large number n of random variables Xi(i = 1, . . . , n) being summed and the uncertainty of the sum is not dominated by a few of these distributions, the sum is approximately normally distributed 3 14 Risk Analysis Figure 11.6 Correlation of two variables with a Clayton(l0). Comparison of correlated and uncorrelated sums 1 0.9 % 0.8 2 0.7 CI m 0.6 h a .-> 0.5 5 0.3 -53 0.4 0 0.2 0.1 0 0 100 200 300 400 500 600 Value of sum Figure 11.7 Comparison of correlated and uncorrelated sums. 700 800 900 1000 Chapter I I Sums of random var~ables 3 I 5 according to the central limit theory as follows: The equation states that the aggregate sum takes a normal distribution with a mean equal to the sum of the means for the individual distributions being added together. It also states that the variance (the square of the standard deviation in the formula) of the normal distribution is equal to the square of the covariance terms between each variable. The covariance terms a i j are calculated as follows: where ai and a j are the standard deviations of variables i and j, pij is the correlation coefficient and E [ . ] means "the expected value of' the thing in the brackets. If we have datasets for the variables being modelled, Excel can calculate the covariance and correlation coefficients using the functions COVARO and CORREL() respectively. If we were thinking of using a rank order correlation matrix, each element corresponds reasonably accurately to pij for roughly normal distributions (at least, not very heavily skewed distributions), so the standard deviation of the normally distributed sum could be calculated directly from the correlation matrix. Correlating partial sums We will sometimes be in the situation of having two or more sums of random variables that have some correlation relationship between them. For example, imagine that you are a hospital trying to forecast the number of patient-days you will need to provide next year, and you split the patients into three groups: surgery, maternity and chronic illness (e.g. cancer). Let's say that the distribution of days that a person will spend in hospital under each category is independent of the other categories, but the number of individuals being treated is correlated with the number of people in the catchment area, which is uncertain because hospital catchments are being redefined in your area. There are plenty of ways to model this problem, but perhaps the most convenient is to start with the uncertainty of size of the number of people in the catchment area and derive what the demand will be for each type of care as a consequence, then make a projection of what the total patient-days might be as a result, as shown in the model in Figure 11.8. In this model the uncertainty about the catchment area population is modelled with a PERT distribution, the bed-days for each category of healthcare are modelled by lognormal distributions with different parameters and the number of patients in each category is modelled with a Poisson distribution with a mean equal to (population size-000s) * (expected cases/year/1000 people). I have shown three different methods for simulating the aggregate distribution in each class: pure Monte Carlo for surgery; FFT for maternity and Panjer's recursive method for chronic. Any of the three could be used to model each category. You'll notice that the Monte Carlo method is slightly different from the others in that I've used VosePoisson(. . .) instead of VosePoissonObject(. . .) because the VoseAggregateMC function requires a numerical input for how many variables to sum (allowing the flexibility that this could be a calculated value), whereas the FFT and Panjer methods perform calculations on the Poisson distribution and therefore need it to be defined as an object. Note that the same model could be achieved with other Monte Carlo simulation software by making randomly varying arrays for each category, the technique illustrated in Figure 11.l, but the numbers in this problem would require very long arrays. Using the same basic problem, let us now consider the situation where the frequency distribution for each category is correlated in some fashion, as we had before, but not because of their direct relationship 3 16 Risk Analysis A 1 B C D E I F I F 1 2 128.47 thousand Predicted population size next year 3 Number of patients by category Maternity 4 5 6 Surgery Expected/year/1000residents Number treated next year Bed-days for a random patient Total bed-days 7 8 9 10 11 - Chronic 14.7 27.4 184 VosePoisson(3519.962137) 23477 VosePoisson(1888.44684) VoseLognormal(43.28) VoseLognormal(4.1,2.5) VoseLognormal(6.3,36.7) 4.880 147.175 15,740 l ~ o t abed-days l over all categories 167,795 1 12 13 2 15 2 Formulae table =VosePERT(82,107,163) =VosePoisson(CG'$C$2) =VosePoissonObject(D6'$C$2) =VoseLognormalObject(6.3,36.7) =VoseAggregateMC(C7,C8) =VoseAggregateFFT($D$7,$D$8,,) =VoseAggregatePanjer(E7,E8,20OOO.999) =SUM(C9:E9) C2 C7 D7:E7 C8:E8 (with different values) C9 D9 E9 C11 (output) 17 2 20 21 22 Figure 11.8 Model for forecasting the number of patient-days in a hospital. A 1 2 1 B I I C D E Maternity -0.3 1 -0.25 Chronic 0.2 -0.25 1 107.00 thousand Predicted population size next year 3 4 5 6 7 - 8 -- 9 10 2 Correlation matrix Normal copula 2 2 14 15 16 1 7 18 Surgery 1 -0.3 0.2 Surgery Maternity Chronic I 0.441 0.745 Number of patients b y category Chronic Maternity 27.4 14.7 184 2,967 1,628 19,667 VoseLognormal(43.28) VoseLognormal(4.1,2.5) VoseLognormal(6.3,36.7) 127.904 6.715 120.831 Suraerv Expected/year/1000residents Number treated next year Bed-days for a random patient Total bed-days 2 20 - 0.918 l ~ o t abed-days l over all categories 255,450 1 21 22 23 24 25 26 27 {CIl:E11) C16:E16 C17:E17 (with different values) C18:E18 C2O (output) Formulae table (=VoseCopulaMultiNormal(C7:E9)) =VosePoisson(Cl S$D$2.C11) =VoseLognormalObject(6.3.36.7) =VoseAggregateMC(C16,Cl7) =SUM(C18:E18) 28 Figure 11.9 Using a normal copula to correlate the Poisson frequency distributions. to any observable variable. Imagine that the population size is known, but we want to model the effects of increased pollution in the area, so we want the surgery and chronic Poisson variables to be positively correlated with each other but negatively correlated with maternity. The following model uses a normal copula to correlate the Poisson frequency distributions (Figure 11.9). Chapter I I Sums of random var~ables 3 17 There is in fact an FFT method to achieve this correlation between frequency distributions, but the algorithm is not particularly stable. Turning now to the severity (length of hospital stay) distributions, we may wish to correlate the length of stay for all individuals in a certain category. In the above model, this can be achieved by creating a separate scaling variable for each lognormal distribution with a mean of 1, for example a h2) distribution with the required mean and a standard deviation of h (Figure 11.10). Note that this means that the lognormal distributions will no longer have the standard deviations they were given before. Finally, let's consider how to correlate the aggregate distributions themselves. We can construct the distribution of the number of bed-days required for each type of healthcare using either the FFT or Panjer method. Since the distribution is constructed rather than simulated, we can easily correlate the aggregate distributions by controlling how they are sampled. In the example in Figure 11.11, the model uses the FFT method to construct the aggregate variables and correlates them together using a Frank copula. amm ma($, 1 Predicted population size next year Number of oatients by cateqow Maternity Chronic 14.7 27.4 1,610 2,874 0.15 0.3 0.8740 0.4937 VoseLognormal(5.51,32.08) VoseLognormal(3.11.18.12) 9,602 8.996 Suraerv Expected/year/IOOOresidents Numbertreated next year Scaling variable stdev (h) Hospital days scaling variable Bed-days for a random patient Total bed-davs I 107.00 thousand 184 19,858 0.2 1.0267 VoseLognormal(6.47,37.68) 129,922 148,520 ITotal bed-days over all categories 1 Figure 11.10 Creating separate scaling variables for each lognormal distribution. A 1 2 1 B I D C Predicted population size next year 107.00 thousand Number of patients by category Chronic Maternity 27.4 14.7 184 VosePoisson(l572.9) VosePoisson(l9688) VosePoisson(2931.8) VoseLognormal(6.3,36.7) VoseLognormal(4.1,2.5) VoseLognormal(43,28) 0.6284 0.6507 0.5676 7,746 7,712 7,758 Surgery Expected/year/1000residents Number treated next year Bed-days for a random patient Frank copula Total bed-davs .- 110tal bed-days over all categories 23,216 C7:E7 C8:E8 (with different values) (C9:Eg) C10:ElO C12 (output) Figure 11.11 E 1 Formulae table =VosePoissonObject(CG'$C$2) =VoseLognormalObject(6.3.36.7) {=VoseCopulaMultiFrank(15)) =VoseAggregateFFT($D$7.$D$8,,C9) =SUM(ClO:EIO) Using the FFT method to combine correlated aggregate variables. I F Distribution of n Figure 11.12 Model that calculates the distribution for n. 1 1.2.3 Number of variables to reach a total So far in this chapter we have focused on determining the distribution of the sum of a (usually random) number of random variables. We are also often interested in the reverse question: how many random variables will it take to exceed some total? For example, we might want to answer the following questions: How many random people entering a lift will it take to exceed the maximum load allowed? How many sales will a company need to make to reach its year-end target? How many random exposures to a chemical will it take to reach the exposure limit? Chapter I I Sums o f random variables 3 17 There is in fact an FFT method to achieve this correlation between frequency distributions, but the algorithm is not particularly stable. Turning now to the severity (length of hospital stay) distributions, we may wish to correlate the length of stay for all individuals in a certain category. In the above model, this can be achieved by creating a separate scaling variable for each lognormal distribution with a mean of 1, for example a ~ a m m a ( h h, 2 ) distribution with the required mean and a standard deviation of h (Figure 11.10). Note that this means that the lognormal distributions will no longer have the standard deviations they were given before. Finally, let's consider how to correlate the aggregate distributions themselves. We can construct the distribution of the number of bed-days required for each type of healthcare using either the FFT or Panjer method. Since the distribution is constructed rather than simulated, we can easily correlate the aggregate distributions by controlling how they are sampled. In the example in Figure 11.11, the model uses the FFT method to construct the aggregate variables and correlates them together using a Frank copula. A 1 1 2 I B I C 7 - 8 9 2 184 19,858 0.2 1.0267 VoseLognormal(6.47.37.68) 129.922 12 1 3 - ( ~ o t abed-days l over all categories 2 F Number of patients by category Maternitv Chronic 14.7 27.4 1,610 2,874 0.15 0.3 0.8740 0.4937 VoseLognormal(3.11.18.12) VoseLognormal(5.51,32.08) 8,996 9,602 Surqew Expected/year/1000residents Number treated next year Scaling variable stdev (h) Hospital days scaling variable Bed-days for a random patient Total bed-days I E 107.00 thousand Predicted population size next year p 4 6 -- I D 148,520 1 1 Formulae table C7:E7 =VosePoisson(C6*$C$2) C9:E9 =VoseGamma(CB/I-2,CW2) C10:ElO (with different values) =VoseLognormalObject(6.3'C9,36.7'C9) C11:Ell =VoseAggregateMC(C7,CI0) =SUM(Cll:Ell) C13 (output) 16 1 7 2 2 20 ,21 Figure 11.10 Creating separate scaling variables for each lognormal distribution. A 1 2 1 B I C D I E 107.00 thousand Predicted population size next year 3 4 Number of patients by category Maternity Chronic 27.4 14.7 184 VosePoisson(l572.9) VosePoisson(2931.8) VosePo1sson(l9688) VoseLognormal(43.28) VoseLognormal(6.3,36.7) VoseLognormal(4.1,2.5) 0.5676 0.6284 0.6507 7,712 7.746 7.758 Surgery 6 - 7 8 9 2 Expected/year/1000residents Number treated next year Bed-days for a random patient Frank copula Total bed-davs 11 3 J - 14 2 16 1 7 2 19 C7:E7 C8:E8 (with different values) (C9:Eg) C10:ElO C12 (output) Formulae table =VosePoissonObject(C6*$C$2) =VoseLognormalObject(6.3,36.7) (=VoseCopulaMultiFrank(l5)) =VoseAggregateFFT($D$7,$D$8,.C9) =SUM(ClO:ElO) 20 Figure 11.11 Using the FFT method to combine correlated aggregate variables. I F 3 16 R~skAnalysis A 1 B I C E D I F I F 1 2 3 Predicted population size next year 128.47 thousand Number of patients b y category Chronic Maternity 184 27.4 14.7 23477 VosePoisson(1888.44684) VosePoisson(3519.962137) VoseLognorrnal(6.3,36.7) VoseLognorrnal(4.1.2.5) VoseLognormal(43.28) 147.175 15.740 4,880 4 5 Surgery 6 Expected/year/1000residents Number treated next year Bed-days for a random patient Total bed-days 7 8 -- 9 10 $1 - l ~ o t abed-days l over all categories 167,795 1 2 Formulae table =VosePERT(82,107,163) =VosePoisson(CG'$C$2) =VosePoissonObject(D6'$C$2) =VoseLognormalObject(6.3,36.7) =VoseAggregateMC(C7,C8) =VoseAggregateFFT($D$7.$D$8,,) =VoseAggregatePanjer(E7.E8,200,0.999) =SUM(C9:E9) 2 C2 C7 D7:E7 C8:E8 (with different values) C9 D9 E9 C l l (output) 14 15 16 17 18 19 20 21 22 Figure 11.8 Model for forecasting the number of patient-days in a hospital. A 1 B I I C D E Maternity -0.3 1 -0.25 Chronic 0.2 -0.25 1 1 2 3 4 5 6 7 8 9 2 11 12 13 14 3 3 1 7 18 Predicted population size next year 107.00 thousand Correlation matrix Surgery 1 -0.3 0.2 Surgery Maternity Chronic Normal copula 1 0.441 0.745 ( Number of patients by cateqory Maternitv Chronic 184 14.7 27.4 19,667 1,628 2,967 VoseLognormal(6.3,36.7) VoseLognormal(4.1,2.5) VoseLognormal(43,28) 120,831 6,715 127,904 Suraerv Expected/year/1000residents Number treated next year Bed-days for a random patlent Total bed-davs 2 20 - 0.918 l ~ o t abed-days l over all categories 255,450 1 21 22 23 24 25 26 27 (C11:Ell) C16:E16 C17:E17 (with different values) C18:E18 C20 (output) Formulae table (=VoseCopulaMultiNormal(C7:E9)) =VosePoisson(ClY$D$2,Cll) =VoseLognormalObject(6.3,36.7) =VoseAggregateMC(C16,Cl7) =SUM(C18:E18) 28 Figure 11.9 Using a normal copula to correlate the Poisson frequency distributions. to any observable variable. Imagine that the population size is known, but we want to model the effects of increased pollution in the area, so we want the surgery and chronic Poisson variables to be positively correlated with each other but negatively correlated with maternity. The following model uses a normal copula to correlate the Poisson frequency distributions (Figure 11.9). 3 18 R~skAnalysis Distribution of n Figure 11.12 Model that calculates the distribution for n. 1 1.2.3 Number of variables to reach a total So far in this chapter we have focused on determining the distribution of the sum of a (usually random) number of random variables. We are also often interested in the reverse question: how many random variables will it take to exceed some total? For example, we might want to answer the following questions: How many random people entering a lift will it take to exceed the maximum load allowed? How many sales will a company need to make to reach its year-end target? How many random exposures to a chemical will it take to reach the exposure limit? -- - - Chapter I I Sums of random var~ables 3 19 Some questions like this are directly answered by known distributions, for example the negative binomial, beta-negative binomial and inverse hypergeometric describe how many trials will be needed to achieve s successes for the binomial, beta-binomial and hypergeometric processes respectively. However, if the random variables are not 0 or 1 but are continuous distributions, there are no distributions available that are directly useful. The most general method is to use Monte Carlo simulation with a loop that consecutively adds a random sample from the distribution in question until the required sum is produced. ModelRisk offers such a function called VoseStopSum(Distribution, Threshold). This can, however, be quite computationally intensive when the required number is large, so it would be useful to have some quicker methods available. Table 11.3 gives us some identities that we can use. For example, the sum of n independent variables following a Gamma(a, B) distribution is equal to a Gamma(n * a , B). If we require a total of at least T, then the probability that (n - 1) Gamma(a, j3) variables will exceed T is 1 - F[,-l)(T), where F(,-,)(T) is the cumulative probability for a Gamma((n - 1) * a, B). Excel has the GAMMADIST function which calculates F(x) for a gamma distribution (ModelRisk has the function VoseGammaProb which performs the same task but without the errors GAMMADIST sometimes produces). The probability that n variables will exceed T is given by 1 - Fn(T). Thus, the probability that it was the nth random variable that took the sum over the threshold is (1 - F,(T)) - (1 - F(,-l)(T)) = F(n-I)(T) - F,(T). You can therefore construct a model that calculates the distribution for n directly, as shown in the spreadsheet in Figure 11.12. The same idea can be applied with the Cauchy, chi-square, Erlang, exponential, Levy, normal and Student distributions. The VoseStopSum function in ModelRisk implements this shortcut automatically. A Chapter Forecasting with uncertainty STUDIES HAVE SHOWN STOCKS BETTER T H A N THAT'S WHY THE DOGOERT MUTUAL FUND EMPLOYS ONLY MONKEYS. OUnited Feature Syndicate, Inc. Syndicated by Bruno Productions B.V. Reproduced by permission. This chapter looks at several forecasting methods in common use and how variability and uncertainty can be incorporated into their forecasts. Time series modelling is usually based on extrapolating a set of observations from the past, or, where data are not available or inadequate, the modelling focuses on expert opinion of how the variable may behave in the future. In this chapter we will look first of all at the more formal techniques of time series modelling based on past observations, then look at some ways that the reader may find useful to model expert opinion of what the future holds. The prerequisites of formal quantitative forecasting techniques are that a reliable time series of past observations is available and that it is believed that the factors determining the patterns exhibited in that time series are likely to continue to exist, or, if not, that we can determine the effect of changes in these factors. We begin by discussing ways of measuring the performance of a forecasting technique. Then we look at the nalve forecast, which is simply repeating the last, deseasonalised, value in the available time series. This simplistic forecasting technique is useful for providing a benchmark against which the performance of the other techniques can be compared. This is followed by a look at various forecasting techniques, divided into three sections according to the length of the period that is to be forecast. Finally, we will look at a couple of examples of a different approach that aim at modelling the variability based on a reasonable theoretical model of the actual system. There are a few useful basic tips I recommend when you are producing a stochastic time series as part of your risk analysis: Check the model's behaviour with imbedded Excel x-y scatter plots. Split the model up into components rather than create long, complicated formulae. That way you'll see that each component is working correctly, and therefore have confidence in the time series projection as a whole. Figure 12.1 Six plots from the same geometric Brownian motion model. Each pattern could easily be what follows on from any other pattern. Be realistic about the match between historic patterns and projections. For example, write a simple geometric Brownian motion model, plot the series and hit the F9 key (recalculate) a few times and see the variation in patterns you get. Remember that these all come from the same stochastic model - but they will often look convincingly different (see Figure 12.1): if any of these had been our historical data, a statistical analysis of the data would have tended to agree with you and reinforced any preconception about the appropriate model, because statistical analysis requires you to specify the model to test. So, don't always go for a forecast model because it fits the data the best - also look at whether there is a logical reason for choosing one model over another. Be creative. Short-term forecasts (say 20-30% of the historic period for which you have good data) are often adequately produced from a statistical analysis of your data. Even then, be selective about the model. However, beyond that timeframe we move into crystal ball gazing. Including your perceptions of where the future may go, possible influencing events, etc., will be just as valid as an extrapolation of historic data. 12. I The Properties of a Time Series Forecast When producing a risk analysis model that forecasts some variable over time, I recommend you go through a list of several properties that variable might exhibit over time, as this will help you both statistically analyse any past data you have and select the most appropriate model to use. The properties are: trend, randomness, seasonality, cyclicity or shocks and constraints. Chapter 12 Forecasting with uncertainty Variable 323 Variable 120 160 140 100 120 80 100 60 80 60 40 40 20 0 20 0 5 10 15 20 25 30 Time period 35 40 45 50 0 0 5 10 15 20 25 30 Time period 35 40 45 50 Figure 12.2 Examples of expected value trend over time. 12.1.1 Trend Most variables we model have a general direction in which they have been moving, or in which we believe they will move in the future. The four plots in Figure 12.2 give some examples of the expected value of a variable over time: top left - a steady relative decrease, such as one might expect for sales of an old technology, or the number of individuals remaining alive from a group; top right - a steady (straight-line) increase, such as is often assumed for financial returns over a reasonably short period (sometimes called "drift"); bottom left - a steady relative increase, such as bacterial growth or take-up of new technology; and bottom right - a drop turning into an increase, such as the rate of component failures over time (like the bathtub curve in reliability modelling) or advertising expenditure (more at a launch, then lower, then ramping up to offset reduced sales). ! , 12.1.2 Randomness The second most important property is randomness. The four plots in Figure 12.3 give some examples of the different types of randomness: top left - a relatively small and constant level of randomness that doesn't hide the underlying trend; top right - a relatively large and constant level of randomness that can disguise the underlying trend; bottom left - a steadily increasing randomness, which one typically sees in forecasting (care needs to be taken to ensure that the extreme values don't become unrealistic); and bottom right - levels of randomness that vary seasonally. 324 Risk Analysis Variable 160 140 120 100 80 60 - i 40 20 0 1 7 0 5 10 15 20 25 30 35 40 45 1 50 180 160 140 120 100 80 60 40 20 0 Time period Variable Time period Time period Figure 12.3 Examples of the behaviour of randomness over time. 12.1.3 Seasonality Seasonality means a consistent pattern of variation in the expected value (but also sometimes its randomness) of the variable. There can be several overlaying seasonal periods, but we should usually have a pretty good guess at what the periods of seasonality might be: hour of the day; day of the week; time of the year (surnmer/winter, for example, or holidays, or end of financial year). The plot in Figure 12.4 shows the effect of two overlaying seasonal periods. The first is weekly with a period of 7, the second is monthly with a period of 30, which complicates the pattern. Monthly seasonality often occurs with financial transactions that take place on a certain day of the month: for example, volumes of documents that a bank's printing facility must produce each day - at the end of the month they have to chum out bank and credit card statements and get them in the post within some legally defined time. One difficulty in analysing monthly seasonality from data is that months have different lengths, so one cannot simply investigate a difference each 30 days, say. Another hurdle in analysing data on variables with monthly and holiday peaks is that there can be some spread of the effect over 2 or 3 days. For example, we performed an analysis recently looking at the calls received into a US insurance company's national call centre to help them optirnise how to staff the centre. We were asked to produce a model that predicted every 15 minutes for the next 2 weeks, and another model to predict out 6 weeks. We looked at the patterns by individual state and language (Spanish and English). There was a very obvious and stable pattern through the day that was constant during the working week, but a different pattern on Saturday and on Sunday. The pattern was largely the same between states but different between languages. Holidays like Thanksgiving (the last Thursday of November, so not even a fixed date) were Chapter 12 Forecasting with uncertainty 325 Variable 120 I 100 80 60 40 20 0 0 10 20 30 40 Time period 50 60 70 80 Figure 12.4 The expected value of a variable with two overlapping seasonal periods. Variation around Memorial Day Variation around Thanksgiving Day - Figure 12.5 Effect of holidays on daily calls to a call centre. The four lines show the effect on the last 4 years. Zero on the x axis is the day of the holiday. very interesting: call rates dropped hugely on the holiday to 10 % of the level one would have usually expected, but were slightly lower than normal the day before (Wednesday), significantly lower the day after (Friday), a little lower during the following weekend and then significantly higher the following Monday and Tuesday (presumably because people were catching up on calls they needed to make). Memorial Day, the last Monday of May, exhibited a similar pattern, as shown in Figure 12.5. The final models had logic built into them to look for forthcoming holidays and apply these patterns to forecast expected levels which had a trend by state and a daily seasonality. For the 15-minute models we also had to take into account the time zone of the state, since all calls from around the US were received 326 Risk Analysis I I I I Figure 12.6 Two examples of the effect of a cyclicity shock. On the left, the shock produces a sudden and sustained increase in the variable; on the right, the shock produces a sudden increase that gradually reduces over time - an exponential distribution is often used to model this reduction. into one location, which also involved thinking about when states changed their clocks from summer to winter and little peculiarities like some states having two time zones (Arizona doesn't observe daylight saving to conserve energy used by air conditioners, etc.). 12.1.4 Cyclicity or shocks Cyclicity is a confusing term (being rather similar to seasonality) that refers to the effect of obvious single events on the variable being modelled (Figure 12.6 illustrates two basic forms). For example, the Hatfield rail crash in the UK on 12 October 2000 was a single event with a long-term effect on the UK railway network. The accident was caused by the lapsed maintenance of the track which led to "gauge corner cracking", resulting in the rail separating. Investigators found many more such cracks in the area and a temporary speed restriction was imposed over very large lengths of track because of fears that other track might be suffering from the same degradation. The UK network was already at capacity levels, so slowing down trains resulted in huge delays. The cost of repairs to the undermaintained track also sent RailTrack, the company managing the network, into administration. In analysing the cause of train delays for our client, NetworkRail, a not-for-dividend company that took over from RailTrack, we had to estimate and remove the persistent effect of Hatfield. Another obvious example is 911 1. Anyone who regularly flies on commercial airlines will have experienced the extra delays and security checks. The airline industry was also greatly affected, with several US carriers filing for protection under Chapter 11 of the US Bankruptcy Code, although other factors also played a part, such as oil price increases and other terrorist attacks (also cyclicity events) which dissuaded people from going abroad. We performed a study to determine what price should be charged for parking at a US national airport, part of which included estimating future demand. From analysing historic data it was evident that the effect of 911 1 on passenger levels was quite immediate, and, as of 2006, they were only just returning to 2000 levels, where previously there had been consistent growth in passenger numbers, so levels still remain far below what would have been predicted before the terrorist attack. Events like Hatfield and 9/11 are, of course, almost impossible to predict with any confidence. However, other types of cyclicity event are more predictable. As I write this (20 June 2007), there are 7 days left before Tony Blair steps down as Prime Minister of the UK, which he announced on 10 May, Chapter 12 Forecasting with uncerta~nty 3 2 7 and Gordon Brown takes over. Newspaper columnists are debating what changes will come about, and, for people in the know, there are probably some predictable elements. 12.1.5 Constraints Randomly varying time series projections can quite easily produce extreme values far beyond the range that the variable might realistically take. There are a number of ways to constrain a model. Mean reversion, discussed later, will pull a variable back to its mean so that it is far less likely to produce extreme values. Simple logical bounds like IF(& > 100, 100, St) will constrain a variable to remain at or below 100, and one can make the constraining parameter (100) a function of time too. The section describing market modelling below offers some other techniques that are based on more modelling-based constraints. 12.2 Common Financial Time Series Models In this section I describe the most commonly used time series for financial models of variables such as stock prices, exchange rates, interest rates and economic indicators such as producers' price index (PPI) and gross domestic product (GDP). Although they have been developed for financial markets, I encourage you to review the ideas and models presented here because they have much wider applications. Financial time series are considered to vary continuously, even if perhaps we only observe them at certain moments in time. They are based on stochastic differential equations (SDEs), which are the most general descriptions of continuously evolving random variables. The problem with SDEs from a simulation perspective is that they are not always amenable to being exactly converted to algorithms that will generate random possible observations at specific moments in time, and there are often no exact methods for estimating their parameters from data. On the other hand, the advantage is that we have a consistent framework for comparing the time series and there are sometimes analytical solutions available to us for determining, say, the probability that the variable exceeds some value at a certain point in time - answers that are useful for pricing derivatives and other financial instruments, for example. We can get around the problems with a bit of intense computing, as I will explain for each type of time series. Financial time series model a variable in one of two forms: the actual price St of the stock (or the value of a variable such as exchange rate, interest rate, etc., if it is not a stock) at some time t , or its return (aka its relative change if it is not an investment) rt over a period At, ASIS,. It might seem that modelling St would be more natural, but in fact modelling the return of the variable is often more helpful: apart from making the mathematics simpler, it is usually the more fundamental variable. In this section, I will refer to St when talking specifically about a price, to r, when talking specifically about a return and to x, when it could be either. I introduce geometric Brownian motion (GBM) first, as it is the simplest and most common financial times series, the basis of the Black-Scholes model, etc., and the launching pad for a number of more advanced models. I have developed the theory a little for GBM, so you get the feel of the thinking, but keep the theory to a minimum after that, so don't be too put off. ModelRisk provides facilities (Figure 12.7) to fit andlor model all of the time series described in the chapter. For financial models, data and forecasts can be either returns or prices, and the fitting algorithms can automatically include uncertainty about parameter estimates if required. 328 Risk Analysis Figure 12.7 ModelRisk time series fit window. 12.2.1 Geometric Brownian motion Consider the formula It states that the variable's value changes in one unit of time by an amount that is normally distributed ' . The normal distribution is a good first choice for a lot of variables with mean p and variance a because we can think of the model as stating (from the central limit theorem) that the variable x is being affected additively by many independent random variables. We can iterate the equation to give us the relationship between x, and x,+z: and generalise to any time interval T: This is a rather convenient equation because (a) we keep using normal distributions and (b) we can make a prediction between any time intervals we choose. The above equation deals with discrete units of time but can be written in a continuous time form, where we consider any small time interval At: Chapter 12 Forecasting w ~ t huncerta~nty 329 The SDE equivalent is dx = p d t + a d z dz = E& where dz is the generalised Wiener process, called variously the "perturbation", "innovation" or "error", and E is a Normal(0, 1) distribution. The notation might seem to be a rather unnecessary complication, but when you get used to the SDEs they give us the most succinct description of a stochastic time series. A more general version of Equations (12.2) is where g and f are two functions. It is really just shorthand for writing t t Equation (12.1) can allow the variable x to take any real value, including negative values, so it would not be much good at modelling a stock price, interest rate or exchange rate, for example. However, it has the desirable property of being memoryless, i.e. to make a prediction of the value of x some time T from now, we only need to know the value of x now, not anything about the path it took to get to the present value. We can use Equations (12.2) to model the return of a stock: There is an identity known as It8's lemma which states that for a function F of a stochastic variable X following an It8 process of the form dx(t) = a(x, t) dt b(x, t) dz we have + Choosing F (S) = log[S] together with Equation (12.3) where x = S , a(x, t) = p and b(x, t) = a : Integrating over time T, we get the relationship between some initial value St and some later value S+T: [ St+T = St exp Normal (( p - 7 330 Risk Analysis where r~ is called the log return1 of the stock over the period T. The exp [.] term in Equation (12.6) means that S is always > 0, so we still retain the memoryless property which corresponds to some financial thinking that a stock's value encompasses all information available about a stock at the time, so there should be no memory in the system (I'd argue against that, personally). The log return r of a stock S is (roughly) the fractional change in the stock's value. For stocks this is a more interesting value than the stock's actual price because it would be more profitable to own 10 shares in a $1 stock that increased by 6 % over a year than one share in a $10 stock that increased by 4 %, for example. Equation (12.6) is the GBM model: the "geometric" part comes because we are effectively multiplying lots of distributions together (adding them in log space). From the definition of a lognormal random variable, if l n [ S ] is normally distributed, then S is lognormally distributed, so Equation (12.6) is modelling St+T as a lognormal random variable. From the equation of the mean of the IognormalE distribution in Appendix 111 you can see that St+T has a mean given by hence p is also called the exponential growth rate, and a variance given by GBM is very easy to reproduce in Excel, as shown by the model in Figure 12.8, even with different time increments. It is also very easy to estimate its parameters from a dataset when the observations have a constant time increment between them, as shown by the model in Figure 12.9. 2 A ~ 2 3 5 7 8 -- 9 10 11 2 3 14 2 36 37 38 39 3 41 42 I Mu Sigma 4 6 B Period 0 1 2 3 4 5 8 9 10 11 40 43 44 45 46 47 50 Return 0.027807 -0.031105 0.015708 -0.010917 -0.029635 0.037244 -0.009822 -0.008984 0.071986 0.02078 0.005866 0.03901 -0.01083 -0.010239 0.024494 0.027545 C I D I E ~ F I G II I H J I I K I L I M 0.01 0.033 Prices 100 102.8197 99.67078 101.2487 100.1494 97.22498 100.9143 99.92796 99.03423 106.4262 144.1044 144.9522 150.7184 149.0949 147.5762 151.2356 155.4593 . C7:C42 D7:D42 Formulae table =VoseNormal((Mu-(Sigma~2)/2)*(B7-B6),Sigma*SQRT(B7-B6)) =D6"EXP(C7) 43 Figure 12.8 GBM model with unequal time increments. I I I ' Not to be confused with the simple return R,, which is the fractional increase of the variable over time t, and where r, = In[l + R,]. Chapter I2 Forecasting with uncertainty B 1 2 - I Period 3 4 5 6 7 8 9 10 2 2 I I D 1 2 3 4 5 6 7 8 9 10 11 Price S LN(S,)-LN(S,.,) 131.2897 0.063167908 139.8505 0.082367645 151.8574 0.005637288 152.7159 0.056436531 161.5825 0.021916209 165.1629 -0.048479708 157.3468 0.021069702 160.6972 -0.018525353 157.7477 -0.030621756 152.9904 0.038550398 159.0034 12 13 14 15 102 103 161.8312 168.8502 160.6408 173.5187 521.6434 542.4933 14 16 5 1 7 104 105 C E ] F I G l l~ime increment H 33 1 II 1 Innovations 0.01391 0.032387 5 1 D4:D105 -0.042458444 0.060086715 -0.007382664 0.077114246 0.034478322 0.039191541 GI0 Formulae table =LN(C4)-LN(C3) =AVERAGE(D4:D105) =STDEV(D4:D105) =G6/SQRT(G2) =G5/G2+G9"2/2 -106 Figure 12.9 Estimating GBM model parameters with equal time increments. A 1 2 3 4 5 6 7 8 9 10 11 12 13 185 186 187 188 - l B Period 1 2 3 4 5 8 9 10 11 12 15 255 256 257 I C I D Price S 100.789 103.0675 102.8591 103.6719 99.8012 107.2738 111.2296 110.0289 114.0051 111.989 112.9895 1685.406 1637.663 1667.555 z IEl F I G I I H I Si ma -0.305560011 -0.610305645 -0.48660819 -1.060637884 -0.492158354 -0.132347657 -0.7206778 -0.141243519 -0.808033736 -0.949059593 -0.866893734 -0.944206129 -0.358896539 [ ~ r r osum r D4:D187 G5 G6 G8 1 1.094871 Formulae table =(LN(C~)-LN(C~)-(MU-S~~~~"~/~)*(B~-B~))/(S~~~~*SQRT(B~ =ABS(AVERAGE(D4:DI87)) =ABS(STDEV(D4:Dl87)-1) =G5+G6 Figure 12.10 Estimating GBM model parameters with unequal or missing time increments. If there are missing observations or observations with different time increments, it is still possible to estimate the GBM parameters. In the model in Figure 12.10, the observations are transformed to Normal(0, 1) variables ( z } , and then Excel's Solver is used to vary mu and sigma to make the { z ) values have a mean of zero and a standard deviation of 1 by minimising the value of cell G8. An alternative method would be to regress - lnLSt against &? &? with zero intercept: the slope estimates p and the standard error estimates a. The spread of possible values in a GBM increases rapidly with time. For example, the plot in Figure 12.11 shows 50 possible forecasts with So = 1, p = 0.001 and a = 0.02. 332 Risk Analysis 4.5 4 3.5 3 2.5 P 2 1.5 1 0.5 0 0 50 100 150 200 250 300 Time T Figure 12.11 Plot of 50 possible scenarios with a GBM(p = 0.001, a = 0.02) model with a starting value of I . Mean reversion, discussed next, is a modification to GBM that progressively encourages the series to move back towards a mean the further it strays away. Jump diffusion, discussed after that, acknowledges that there may be shocks to the variable that result in large discrete jumps. ModelRisk has functions for fitting and projecting GBM and GBM mean reversion andor jump diffusion. The functions work with both returns r and stock prices S . + 12.2.2 GBM with mean reversion The long-run time series properties of equity prices (among other variables) are, of course, of particular interest to financial analysts. There is a strong interest in determining whether stock prices can be characterised as random-walk or mean reverting processes because this has an important effect on an asset's value. A stock price follows a mean reverting process if it has a tendency to return to some average value over time, which means that investors may be able to forecast future returns better by using information on past returns to determine the level of reversion to the long-term trend path. A random walk has no memory, which means that any large move in a stock price following a randomwalk process is permanent and there is no tendency for the price level to return to a trend path over time. The random-walk property also implies that the volatility of stock price can grow without bound in the long run: increased volatility lowers a stock's value, so a reduction in volatility (Figure 12.12) owing to mean reversion would increase a stock's value. For a variable x following a Brownian motion random walk, we have the SDE of Equation (12.2): For mean reversion, this equation can be modified as follows: Chapter 12 Forecasting with uncertainty 33 3 alpha = 0.0001 0.01 .. - .. -. - ... -. 0.008 0.006 0.004 - 0.002 0 -0.002 -0.004 -0.006 -0.008 -0.01 Time t alpha = 0.1 0.01 1 0.008 0.006 - I 0.004 0.002 0 0 -0.002 -0.004 -0.006 -0.008 -0.01 Time t alpha = 0.4 0 I I i Time t Figure 12.12 Plots of sample GBM series with mean reversion for different values of alpha (p = 0, c7 = 0.001). 3 34 R~skAnalysis where a, > 0 is the speed of reversion. The effect of the dt coefficient is to produce an expectation of moving downwards if x is currently above p, and vice versa. Mean reversion models are produced in terms of S or r : known as the Ornstein-Uhlenbeck process, and was one of the first models used to describe short-term interest rates, where it is called the Vasicek model. The problem with the equation is that we can get negative stock prices; modelling in terms of r , however, keeps the stock price positive. Integrating this last equation over time gives p + exp[-aT](r, KT) - p), o 1 - exp[-2aT] which is very easy to simulate. The following plots show some typical behaviour for r,. Typical values of a! would be in the range 0.1-0.3. A slight modification to Equation (12.7) is called the Cox-Ingersoll-Ross or CIR model (Cox, Ingersoll and Ross, 1985), again used for short-term interest rates, and has the useful property of not allowing negative values (so we can use it to model the variable S ) because the volatility goes to zero as S approaches zero: Integrating over time, we get where 4 w degrees of freedom and non-centrality parameter and Y is a non-central chi-square distribution with a2 2crt exp[-aT]. This is a little harder to simulate since you need the uncommon non-central chi-square distribution in your simulation software, but it has the attraction of being tractable (we can precisely determine the form of the distribution for the variable St+T), which makes it easier to determine its parameters using maximum likelihood methods. 12.2.3 GBM with jump diffusion Jump diffusion refers to sudden shocks to the variable that occur randomly in time. The idea is to recognise that, beyond the usual background randomness of a time series variable, there will be events that have a much larger impact on the variable, e.g. a CEO resigns, a terrorist attack takes place, a drug gets FDA approval. The frequency of the jumps is usually modelled as a Poisson distribution with intensity h, so that in some time increment T there will be Poisson(hT) jumps. The jump size Chapter 12 Forecast~ngwith uncertainty 335 for r is usually modelled as Normal(p J, a J ) for mathematical convenience and ease of estimating the parameters. Adding jump diffusion to the discrete time Equation (12.6) for one period, we get the following: If we define k = Poisson(A), this reduces to or for T periods we have ): r~ = Normal ((p - T + kp,. Ja) which is easy to model with Monte Carlo simulation and easy to estimate parameters for by matching moments, although one must be careful to ensure that the A estimate isn't too high (e.g. > 0.2) because the Poisson jumps are meant to be rare events, not form part of each period's volatility. The plot in Figure 12.13 shows a typical jump diffusion model giving both r and S values and with jumps marked as circles. 12.2.4 GBM with jump diffusion and mean reversion You can imagine that, if the return r has just received a large shock, there might well be a "correction" over time that brings it back to the expected return p of the series. Combining mean reversion with jump diffusion will allow us to model these characteristics quite well and with few parameters. However, the additive model of Equation (12.9) for mean and variance no longer applies, particularly when the reversion speed is large because one needs to model when within the period the jump took place: if it was at the beginning of the period, it may well have already strongly reverted before one observes the value at the period's end. The most practical solution, called Euler's method, is to split up a time period into many small increments. The number of increments will be sufficient when the model produces the same output for decision purposes as any greater number of increments. 12.3 Autoregressive Models An ever-increasing number of autoregressive models are being developed in the financial area. The ones of more general interest discussed here are AR, MA, ARMA, ARCH and GARCH, and it is more standard to apply the models to the return r rather than to the stock price S. I also give the equations for EGARCH and APARCH. Let me just repeat my earlier warning that, before being convinced that some subtle variation of the model gives a genuine advantage, try generating a few samples for simpler models that you have fit to the data and see whether they can create scenarios of a similar pattern. ModelRisk offers functions that fit each of these series to data and produce forecasts. The data can be live linked to historical values, which is very convenient for keeping your model automatically up to date. 3 36 Risk Analysis 0.4 0.3 0.2 0.1 rltl 0 -0.1 -0.2 -0.3 -0.4 Time t Figure 12.13 Sample of a GBM with jump diffusion with parameters p and h = 0.02. 1.1 0,a = 0.01,W J = 0.Q4, UJ = 0.2 12.3.1 AR The equation for an autoregressive process of order p, or AR(p), is where E~ are independent Normal(0, a) random variables. Some constraints on the parameters {a,}are needed if one wants to keep the model stationary (meaning the marginal distribution of r i s the same for all I ) , e.g, for an AR(P), lal 1 -= 1. In most situations, an AR(1) or AR(2) is sufficiently elaborate, i.e: Chapter 1 2 Forecasting with uncertainty 337 You can see that this is just a regression model where rt is the dependent variable and rt-i are the explanatory variables. It is usual, though not essential, that ai > ai+l, i.e. that r, is explained more by more recent values ( t - 1, t - 2, . . .) rather than by older values ( t - 10, t - 11, . . .). The equation for a moving-average process of order q , or M A ( q ) , is This says that the variable r, is normally distributed about a mean equal to where E , are independent Normal(0, c) random variables again. In other words, the mean of r, is the mean of the process as a whole p plus some weighting of the variation of q previous terms from the mean. Similarly to A R models, it is usual that bi > bi+l,i.e. that rt is explained more by more recent terms (t - 1, t - 2 , . . .) rather than by older terms ( t - 10, t - 11, . . .). 12.3.3 A R M A We can put the A R ( p ) and M A ( q ) processes together to create an autoregressive, moving-average model A R M A ( p , q ) process with mean p that is described by the following equation: In practice, the A R M A ( 1 , 1) is usually sufficiently complex, so the equation simplifies to ARCH models were originally developed to account for fat tails by allowing clustering of periods of volatility (heteroscedastic, or heteroskedastic, means "having different variances"). One of the assumptions in regression models that were previously used for analysis of high-frequency financial data was that the error terms have a constant variance. Engle (1982), who won the 2003 Nobel Memorial Prize 338 R~skAnalysis for Economics, introduced the ARCH model, applying it to quarterly UK inflation data. ARCH was later generalised to GARCH by Bollerslev (1986), which has proven more successful in fitting to financial data. Let rt denote the returns or return residuals and assume that rt = p atzt, where zt are independent, Normal(0,l) distributed, and the CT is: modelled by + where w > 0, ai 2 0, i = 1, . . . , q and at least one a; > 0. Then r, is said to follow an autoregressive conditional heteroskedastic, ARCH(q), process with mean p. It models the variance of the current error term as a function of the variance of previous error terms (r,-l - p). Since each ai > 0, it has the effect of grouping low (or high) volatilities together. If an autoregressive moving-average process (ARMA process) is assumed for the variance, then r, is said to be a generalised autoregressive conditional heteroskedastic GARCH(p, q) process with mean g : where p is the order of GARCH terms and q is the order of ARCH terms, w > 0, a; bj 2 0, j = 1 , . . . , p and at lease one ai o r b , > 0. In practice, the model most generally used is a GARCH(1, 1): > 0, i = I , . . . , q; 1 2.3.5 APARCH The asymmetric power autoregressive conditional heteroskedasticity, APARCH(p, q), model was introduced by Ding, Granger and Engle (1993) and is defined as follows: where -1 < yi < 1 and at least one a; or b j > 0. 6 plays the role of a Box-Cox transformation of the conditional standard deviation q ,while yi reflect the so-called leverage effect. APARCH has proved very promising and is now quite widespread because it nests several other models as special cases, e.g. the ARCH(6 = 1, y; = 0, bi = O), GARCH(8 = 2, y; = 0), (TS-GARCH(6 = 1, y; = O), GJR-GARCH(6 = 2), TARCH(6 = 1) and NARCH(bi = 1, y; = 0). In practice, the model most generally used is an APARCH(1, 1): Chapter 12 Forecasting with uncertainty 339 12.3.6 EGARCH The exponential general autoregressive conditional heteroskedastic, EGARCH(p, q), model was another form of GARCH model with the purpose of allowing negative values in the linear error variance equation. The GARCH model imposes non-negative constraints on the parameters, a; and b j , while there are no such restrictions on these parameters in the EGARCH model. In the EGARCH(p, q) model, the conditional variance, , : a is formulated by an asymmetric function of lagged disturbances rt: where and when zl is a standard normal variable. Again, in practice the model most generally used has p = q = 1, i.e. is an EGARCH(1, 1): 12.4 Markov Chain Models ~ a r k o chains v ~ comprise a number of individuals who begin in certain allowed states of the system and who may or may not randomly change (transition) into other allowed states over time. A Markov chain has no memory, meaning that the joint distribution of how many individuals will be in each allowed state depends only on how many were in each state the moment before, not on the pathways that led there. This lack of memory is known as the Markov property. Markov chains come in two flavours: continuous time and discrete time. We will look at a discrete-time process first because it is the easiest to model. 12.4.1 Discrete-time Markov chain In a discrete-time Markov process the individuals can move between states only at set (usually equally spaced) intervals of time. Consider a set of 100 individuals in the following four marital states: 43 are single; 29 are married; 11 are separated; 17 are divorced. Named after Andrey Markov (1 856- 1922), a Russian mathematician. 340 R~skAnalysis We write this as a vector: Given sufficient time (let's say a year) there is a reasonable probability that the individuals can change state. We can construct a matrix of the transition probabilities as follows: Is now: Transition matrix was: Married Separated Divorced 1 Single Married 0 0 0 0.88 0.13 0.09 Se~arated Divorced 0.08 0.45 0.02 0.89 We read this matrix row by row. For example, it says (first row) that a single person has an 85 % chance of still being single 1year later, a 12 % chance of being married, a 2 % chance of being separated and a 1 % chance of being divorced. Since these are the only allowed states (e.g. we haven't included "engaged" so that must be rolled up into "single"), the probabilities must sum to 100 %. Of course, we'd have to decide what a death would mean: the transition matrix could either be defined such that if a person dies they retain their marital status for this model, or we could make this a transition matrix conditional on them surviving a year. Notice that the "single" column is all Os, except the singlelsingle cell, because, once one is married, the only states allowed after that are married, separated and divorced. Also note that one can go directly from single to separated or divorced, which implies that during that year the individual had passed through the married state. Markov chain transition matrices describe the probability that one is in a state at some precise time, given some state at a previous time, and is not concerned with how one got there, i.e. all the other states one might have passed through. We now have the two elements of the model, the initial state vector and the transition matrix, to estimate how many individuals will be in each state after a year. Let's go through an example calculation to estimate how many people will be married in one year: a a a for for for for the the the the single people, Binomial(43, 0.12) will be married; married people, Binomial(29, 0.88) will be married; separated people, Binomial(l1, 0.13) will be married; divorced people, Binomial(l7, 0.09) will be married. Add together these four binomial distributions and we get an estimate of the number of people from our group who will be married next year. However, the above calculation does not work when we want to look at the joint distribution of how many people will be in each state: clearly we cannot add four sets of four binomial distributions because the total must sum to 100 people. Instead, we need to use the multinomial distribution. The number of people who were single but are now {Single, Married, Separated, Divorced) equals Multinomial(43, (0.85, 0.12, 0.02, 0.01)). Applying the multinomial distribution for the other three initial states, we can take a random sample from each multinomial and add up how many are in each state, as shown in the model in Figure 12.14. Chapter 12 Forecast~ngwith uncertainty 341 Figure 12.14 Multinomial method of performing a Markov chain model. Let's now look at extending the model to predict further ahead in time, say 5 years. If we can assume that the probability transition matrix remains valid for that period, and that nobody in our group dies, we could repeat the above exercise 5 times - calculating in each year how many individuals are in each state and using that as the input into the next year, etc. However, there is a more efficient method. The probability a person starting in state i is in state j after 2years is determined by looking at the probability of the person going from state i to each state after 1 year, and then going from that state to state j in the second year. So, for example, the probability of changing from single to divorced after 2 years is P(Sing1e to Single) * P (Single to Divorced) +P (Single to Married) * P (Married to Divorced) +P (Single to Separated) * P (Separated to Divorced) +P (Single to Divorced) * P (Divorced to Divorced) Notice how we have multiplied the elements in the first row (single) by the elements in the last column (divorced) and added them. This is the operation performed in matrix multiplication. We can therefore determine the probability transition matrix over the 2year period by simply multiplying the 1 year transition matrix by itself (using Excel's MMULT function) in the model in Figure 12.15. When one wants to forecast T periods in advance, where T is large, performing the matrix multiplication (T - 1) times can become rather tedious, but there is some mathematics based on transforming the matrix that allows one directly to determine the transition matrix over any number of periods. ModelRisk provides some efficient means to do this: the VoseMarkovMatrix function calculates the transition matrix for any time length, and the VoseMarkovSample goes the next step, simulating how many individuals are in each final state after some period. In this next example (Figure 12.16) we calculate the transition matrix and simulate how many individuals will be in each state after 25 years. Notice how after 25 years the probability of being married is about 45 %, irrespective of what state one started in: a similar situation occurs for separated and divorced. This stabilising property is very common and, as a matter of interest, is the basis of a statistical technique discussed briefly elsewhere in this book called Markov chain Monte Carlo. Of course, the above calculation does assume that the transition matrix for 1 year is valid to apply over such a long period (a big assumption in this case). 342 Risk Analysis One year transition Sin le IS now: Married Se arated Divorced Single Was: Married Separated Divorced 0.88 0.13 0.08 0.45 0.04 0.42 Two year transition IS now: matrix Sin le Married Se arated Divorced 0.7225 0.211 1 Single 0.0358 0.0306 Was: Married 0.2107 0.2213 0.568 Separated Divorced , 14 28 Totals Figure 12.15 A 1 2 3 4 1 26 37 Multinomial method of performing a Markov chain model with time an integer > 1 unit. B I C I Number in initial state D I E 17 One year transition matrix Single Married Was: Separated Divorced # periods matrix 5 6 7 Number in final state Married Se arated Divorced Sin le I F I G 1 H I I I J I K IL IS now: 0.13 0.45 0.42 8 12 13 14 15 Was: Married Number in final state 0.0000 0.4460 0.0821 0.4719 49 2 1 7 2 19 20 21 Formulae table Input data (F11:114) K l 1:K14 (outputs) Figure 12.16 B4:B7, F4:17, 81 1 {=VoseMarkovMatrix(F4:17,BlI ) } ~=VoseMarkovSample(B4:B7,F4:I7,B11)} ModelRisk methods for performing a Markov chain model with time an integer > 1 unit. 12.4.2 Continuous-time Markov chain For a continuous-time Markov process we need to be able to produce the transition matrix for any positive time increment, not just an integer multiple of the time that applies to the base transition Chapter 12 Forecasting with uncertainty 343 matrix. So, for example, we might have the above marital status transition matrix for a single year but wish to know what the matrix is for half a year, or 2.5 years. There is a mathematical technique for finding the required matrix, based on converting the multinomial probabilities in the matrix into Poisson intensities that match the required probability. The mathematical manipulation is somewhat complex, particularly when one has to wrestle with numerical stability. The ModelRisk functions VoseMarkovMatrix and VoseMarkovSarnple detect when you are using non-integer time and automatically convert to the alternative mathematics. So, for example, we can have the model described above for a half-year. 12.5 Birth and Death Models There are two strongly related probabilistic time series models called the Yule (or pure birth) and pure death models. We have certainly found them very useful in modelling numbers in a bacterial population, but they could be helpful in modelling other variables, modelling numbers of individuals that increase or decrease according to their population size. 12.5.1 Yule growth model This is a pure birth growth model and is a stochastic analogue to the deterministic exponential growth models one often sees in, for example, microbial risk analysis. In exponential growth models, the rate of growth of a population of n individuals is proportional to the size of the population: where B is the mean rate of growth per unit time t. This gives the number of individuals nt in the population after time t as n, = noexp(Bt1 where no is the initial population size. The model is limited because it takes no account of any randomness in the growth. It also takes no account of the discrete nature of the population, which is important at low values of n. Moreover, there are no defensible statistical tests to apply to fit an exponential growth curve to observations (regression is often used as a surrogate) because an exponential growth model is not probabilistic, so no probabilistic (i.e. statistical) interpretation of data is possible. The Yule model starts with the premise that individuals have offspring on their own (e.g. by division), that they procreate independently, that procreating is a Poisson process in time and that all individuals in the population are the same. The expected number of offspring from an individual per unit time (over some infinitesimal time increment) is defined as /3. This leads to the results that an individual will have, 1. after time t, Geometric(exp(-Bt)) offspring, giving a new total population of Geometric(exp(-Bt)) Thus, if we start with no individuals, then by some later time t we will have + from the relationship S NegBin(s, p) = Geometric(p) i=l with mean Ti, = noep< corresponding to the exponential growth model. 344 Risk Analysis A possible problem in implementing this type of model is that no and n, can be very large, and simulation programs tend to produce errors for discrete distributions like the negative binomial for large input parameters and output values. ModelRisk has two time series functions to model the Yule process that work for all input values: which generates values for n,, and VoseTimeSeriesYulelO(Log,ono,Loglncrease, t ) which generates values for Loglo(n,), as one often finds it more convenient to deal with logs for exponentially growing populations because of the large numbers that can be generated. Loglncrease is the number of logs (in base 10) by which one expects the population to increase per time unit. The parameters /Iand Loglncrease are related by Log Increase = Loglo[exp(j3)] 12.5.2 Death model The pure death model is a stochastic analogue to the deterministic exponential death models one often sees in, for example, microbial risk analysis. lndividuals are assumed to die independently and randomly in time, following a Poisson process. Thus, the time until death can be described by an exponential distribution. which has a cdf: where h is the expected instantaneous death of an individual. The probability that an individual is still alive at time t is therefore Thus, if no is the initial population, the number n, surviving until time t follows a binomial distribution: which has a mean of i.e. the same as the exponential death model. The cdf for the time until extinction t~ of the population is given by The binomial death model offered here is an improvement over the exponential death model for several reasons: The exponential death model takes no account of any randomness in the growth, so cannot interpret variations from an exponential line fit. Chapter I2 Forecasting w ~ t huncertainty 345 The exponential death model takes no account of the discrete nature of the population, which is important at low values of n. There are no defensible statistical tests to apply to fit an exponential growth curve to observations (regression is often used as a surrogate) because an exponential model is not probabilistic, so there can be no probabilistic interpretation of data. A likelihood function is possible, however, for the death model described here. A possible difficulty in implementing this death model is that no and n, can be very large, and simulation programs tend to produce errors for discrete distributions like the binomial for large input parameters and output values. ModelRisk has two time series functions to model the death model that eliminate this problem: which generates values for n,, and VoseTimeSeriesDeathlO(Loglono, LogDecrease, t) which generates values for Loglo(nt),as one often finds it more convenient to deal with logs for bacterial populations (for example) because of the large numbers that can be involved. The LogDecrease parameter is the number of logs (in base 10) that one expects the population to decrease by per time unit. The parameters h and LogDecrease are related by LogDecrease = hLoglo(e) 12.6 Time Series Projection of Events Occurring Randomly in Time Many things we are concerned about occur randomly in time: people arriving at a queue (customers, emergency patients, telephone calls into a centre, etc.), accidents, natural disasters, shocks to a market, terrorist attacks, particles passing through a bubble chamber (a physics experiment), etc. Naturally, we may want to model these over time, perhaps to figure out whether we will have enough stock vaccine, storage space, etc. The natural contender for modelling random events is the Poisson distribution - see Section 8.3 which returns the number of random events occurring in time t when h events are expected per unit time within t . Often we might think that the expected number of events may increase or decrease over time, so we make h a function of t as shown by the model in Figure 12.17. A variation of this model is to take account of seasonality by multiplying the expected number of events by seasonal indices (which should average to 1). In Section 8.3.7 I have discussed the P6lya and Delaporte distributions which are counting distributions similar to the Poisson but which allow h to be a random variable too. The P6lya is particularly helpful because, with one extra parameter, h , we can add some volatility to the expected number of events, as shown by the model in Figure 12.18. Notice the much greater peaks in the plot for this model compared with that of the previous model in Figure 12.17. Mixing a Poisson with a gamma distribution to create the P6lya is a helpful tool because we can get the likelihood function directly from the probability mass function (pmf) of the P6lya and therefore fit to historical data. If the MLE value for h is very small, then the Poisson model will be as good a fit and has one less parameter to estimate, so the P6lya model is a useful first test. C6:C55 D6:D55 Formulae table =Gradient'B6+lntercept =VosePoisson(C6) Figure 12.18 A Polya time series with expected intensity A as a linear function of time and a coefficient of variation of h = 0.3. Chapter 12 Forecasting with uncertainty 347 The linear equation used in the above two models for giving an approximate description of the relationship of the expected number with time is often quite convenient, but one needs to be careful because a negative slope will ultimately produce a negative expected value, which is clearly nonsensical (which is why it is good practice to plot the expected value together with the modelled counts as shown in the two figures above). The more correct Poisson regression model considers the log of the expected value of the number of counts to be a linear function of time, i.e. where Po and PI are regression parameters. The ln(e) term in Equation (12.10) is included for data where the amount of exposure e varies between observations; for example, if we were analysing data to determine the annual increase in burglaries across a country where our data are given for different parts of the country with different population levels, or where the population size is changing significantly (so the exposure measure e would be person-years). Where e is constant, we can simplify Equation (12.10) to The model in Figure 12.19 fits a P6lya regression to data (year <= 0) and projects out the next 3 years on annual sports accidents where the population is considered constant so we can use Equation (12.11). Figure 12.19 A Polya regression model fitted to data and projected 3years into the future. The LogL variable is optimised using Excel's Solver with the constraint that h > 0. ModelRisk offers Poisson and Polya regression fits for multiple explanatory variables and variable exposure levels. C 348 Risk Analysis 12.7 Time Series Models with Leading Indicators Leading indicators are variables whose movement has some relationship to the movement of the variable you are actually interested in. The leading indicator may move in the same or opposite direction as the variable of interest, as shown in Figure 12.20. In order to evaluate the leading indicator relationship, you will have to determine: the causal relationship; the quantitative nature of the relationship. The causal relationship is critical. It gives a plausible argument for why the movement in the leading indicator should in some way presage the movement of the variable of interest. It will be very easy to find apparent leading indicator patterns if you try out enough variables, but, if you can't logically argue why there should be any relationship (preferably make the argument before you do the analysis on the potential indicator variable, it's much easier to convince yourself of a causal argument when you've seen a temptingly strong statistical correlation), it's likely that the observed relationship is spurious. The quantitative nature of the relationship should come from a mixture of analysis of historic data and practical thinking. Some leading indicators will have a cumulative effect over time (e.g. rainfall as an indicator of the water available for use at a hydroelectric plant) and so need to be summed or averaged. Other leading indicators may have a shorter response time to the same, perhaps unmeasurable, causal variable as the variable in which you are interested (if the causal variable was measurable, you would use that as the leading indicator instead), and so your variable may exhibit the same pattern with a time lag. The analysis of historic data to determine the leading indicator relationship will depend largely on the type of causal relationship. Linear regression is one possible method, where one regresses historic values of the variable of interest against the lead indicator values, with either a specific lag time if that can be causally deduced or with a varying lag time to produce the greatest r-squared fit if one is estimating the lag time. Note that any forecast can only be made a distance into the future equal to the lag time: otherwise one needs to make a forecast of the lead indicator too. The model in Figure 12.21 provides a fairly simple example in which the historic data (used to create the left pane of Figure 12.20 below) of the variable of interest Y are compared visually with lead indicator X data for different lag periods. The closest pattern match occurs for a lag 6 of 11 periods (Figure 12.22). -100 -80 -60 -40 Tlme -20 0 -Leading indicator -100 -Variable of lnterest -80 6 0 -40 Time I 2 0 0 - Leading indicator - Variable of lnterest 1 Figure 12.20 Lead indicator patterns: left - lead indicator variable is positively correlated with variable of interest; right - negatively correlated. C h a p t e r 12 F o r e c a s t ~ n gwith u n c e r t a i n t y ...... Vanable of InterestY -Y offset 11 periods 78 349 1 160 1 2 1 4 0 ~ .E $120 ~ J 100 1 80 -1 00 -80 -60 -40 2 0 0 Time R-squared 1 0.971492 Slope lm) 1 0.045557 intercept (=)I -0.017818 S t e n isyx) l 0.163501 F~rmulaelable =SLOPE($E$5.$E$83.$C$5$C583) =INTERCEPT($E$5'$E$63,$C55:SC$83) =STEYX~$E$5:$ED3.$C$5:$C$831 Figure 12.21 Leading indicator fit and projection model. 80 -80 -60 -40 Time 0 -20 Leading indicator X - Y offset 10 periods Figure 12.22 Overlay of variable of interest and lead indicator variable lagged by 10, 11 and 12 periods, showing the closest pattern correlation at 11 periods. 350 Risk Analys~s 80 4 -1 00 -1 00 L3 -80 -80 -60 -60 -40 Time -20 -40 Time -20 0 Leading indicator X Y offset 11 periods - - 0 -Leading indicator X -Y offset 12 periods Figure 12.22 Continued. A scatter plot of Y (t) against X(t - 11) shows a strong linear relationship, so a least-squares regression seems appropriate (Figure 12.23). The regression parameters are: slope = 0.04555 intercept = -0.01782 SteYX = 0.1635 (We could use the linear regression parametric bootstrap to give us uncertainty about these parameters if we wished.) The resultant model is then which we can use to predict {Y (1) . . . Y (1 1)): C h a ~ t e r12 Forecast~ngwith uncertainty 3 J 80 90 100 110 120 130 140 150 IB ijo Lead indicator X(t-11) ~ - - - 35 1 I J Figure 12.23 Scatter plot of variable of interest observations against lead indicator observations lagged by 11 periods. 12.8 Comparing Forecasting Fits for Different Models There are three components to evaluating the relative merits of the various forecasting models fitted to data. The first is to take an honest look at the data you are going to fit: do they come from a world that you think is similar to the one you are forecasting into? If not (e.g. there are fewer companies in the market now, there are stricter controls, the product for which you are forecasting sales is getting rather old and uninteresting, etc.), then consider some of the forecasting techniques I describe in Chapter 17 which are based more on intuition than mathematics and statistics. The second step is also common sense: ask yourself whether the assumptions behind the model could actually be true and why that might be. Perhaps you can investigate whether this type of model has been used successfully for similar variables (e.g. a different exchange rate, interest rate, share price, water levels, hurricane frequencies than the one you are modelling). In fact, I recommend that you use this as a first step in selecting which models might be appropriate for the variable you are modelling. Then you will need statistically to evaluate the degree to which each model fits the data and to compensate for the fact that a model with more parameters will have greater flexibility to fit the data but may not mean anything. Statistical techniques for model selection and comparison have improved, and the best methods now use ''information criteria" of which there are three in common usage, described at the end of Section 10.3.4. The main advantage over the older log-likelihood ratio method is that the models don't have to be nested - meaning that each tested model does not need to be a simplified (some parameters removed) version of a more complex model. For ARCH, GARCH, APARCH and EGARCH you should subtract n(1 ln[2a]), where n is the number of data points, from each of the criteria. If you fit a number of models to your data, try not to pick automatically the model with the best statistical result, particularly if the top two or three are close. Also, simulate projections out into the future and see whether the range and behaviour correspond to what you think is realistic (you can do this automatically in the time series fitting window in ModelRisk, overlaying any number of paths). + 352 Risk Analysis 1 2.9 Long-Term Forecasting By long-term forecasts I mean making projections out into the future that span more than, say, 20-30 % of your historical experience. I am not a big believer in using very technical models in these situations. For a start, there should be a lot of uncertainty to the projections, but more importantly the world is ever-changing and the key assumption you implicitly make by producing a forecast with a model fitted to historic data is that the world will carry on behaving in the same way. I know that historically I have been hopeless at predicting what my life will be like in 5 years time: in 1985 I fully expected to be a physical oceanographer in the UK; in 1987 I'd become a qualified photographer living in New Zealand, etc. I'd fixed on being a risk analyst by 1988, but then moved to the UK, Ireland, France and Five . ~ years ago I had no idea that our company would have grown in the way it has, now ~ e l ~ i u m or that we would have developed such a strong software capability. Try applying the same test to the world you are attempting to model. The alternative is to combine lessons learned from the past (e.g. how sensitive your sales are to the US economy) with a good look around to see how the world is changing (mergers coming up, wars starting or ending, new technology, etc.) and draw up scenarios of what the world might look like and how it would affect the variables you want to forecast. I give a number of techniques for this in Chapter 14. Now I have three kids, a partner, a nice home, a dog and an estate car, so maybe things are settling down. Chapter Modelling correlation and dependencies 13.1 Introduction In previous chapters we have looked at building a risk analysis model and assigning distributions to various components of the model. We have also seen how risk analysis models are more complex than the deterministic models they are expanding upon. The chief reason for this increase in complexity is that a risk analysis model is dynamic. In most cases there are a potentially infinite number of possible combinations of scenarios that can be generated for a risk analysis model. We have seen in Chapter 4 that a golden rule of risk analysis is that each one of these scenarios must be potentially observable in real life. The model, therefore, must be restricted to prevent it from producing, in any iteration, a scenario that could not physically occur. One of the restrictions we must place on our model is to recognise any interdependencies between its uncertain components. For example, we may have both next year's interest rate and next year's mortgage rate represented as distributions. Figure 13.1 gives an example of two distributions modelling these interest rate and mortgage rate predictions. Clearly, these two components are strongly positively correlated, i.e. if the interest rate turns out to be at the high end of the distribution, the mortgage rate should show a correspondingly high value. If we neglect to model the interdependency between these two components, the joint probabilities of the various combinations of these two parameters will be incorrect. Impossible combinations will also be generated. For example, a value for the interest rate of 6.5 % could occur with a value for the mortgage rate of 5.5 %. There are three reasons why we might observe a correlation between observed data. The first is that there is a logical relationship between the two (or more) variables. For example, the interest rate statistically determines the mortgage rate, as discussed above. The second is that there is another external factor that is affecting both variables. For example, the weather during construction of a building will affect how long it takes both to excavate the site and to construct the foundations. The third reason is that the observed correlation has occurred purely by chance and no correlation actually exists. Chapter 6 outlines some statistical confidence tests to help determine whether the observed correlations are real. However, there are many examples of strong correlation between variables that would pass any tests of significance but where there is no relationship between the variables. For example, the number of personal computer users in the UK over the last 8 years and the population of Asia will probably be strongly correlated - not because there is any relationship but because both have steadily increased over that period. Figure 13.1 Distributions of interest and mortgage rate predictions. 1 3.1.1 Explanation of dependency, correlation and regression The terms dependency, correlation, and regression are often used interchangeably, causing some confusion, but they have quite specific meanings. A dependency relationship in risk analysis modelling is where the sampled value from one variable (called the independent) has a statistical relationship that approximately determines the value that will be generated for the other variable (called the dependent). A statistical relationship has an underlying or average relationship between the variables around which the individual observations will be scattered. Its chief difference to correlation is that it presumes a causal relationship. As an example, the interest rate and mortgage rate will be highly correlated. Moreover, the mortgage rate will be in essence dependent on the interest rate, but not the other way round. Correlation is a statistic used to describe the degree to which one variable is related to another. Pearson's correlation coefficient (also known as Pearson's product moment correlation coefficient) is given by where Cov(X, Y ) is the covariance between datasets X and Y, and a ( X ) and a ( Y ) are the sample standard deviations as defined in Chapter 6. Correlation can be considered to be a normalised covariance between the two datasets: dividing by the standard deviation of each dataset produces a unitless index between - 1 and + I . The correlation coefficient is frequently used alongside a regression analysis to measure how well the regression line explains the observed variations of the dependent variable. The above correlation statistic is not to be confused with Spearman's rank order correlation coefficient which provides an alternative, non-parametric approach to measuring the correlation between two variables. A little care is needed in interpreting covariance. Independent variables are always uncorrelated, but uncorrelated variables are not always independent. A classic, if somewhat theoretical, example is to consider the variables X = Uniform(-1, 1) and Y = x 2 . There is a direct link between X and Y, but they have zero covariance since Cov(X, Y) = E[XY] - E [ x ] E [ Y ] ' (the definition) = E [ x ~ E] [ x ] E [ x ~ ] , and both E [ X ] and E [ x 3 ]= 0. This is one reason we look at scatter plots of data as well as calculating correlation statistics. ' E[] denotes the expectation, i.e. the mean of all values weighted by their probability Regression is a mathematical technique used to determine the equation that relates the independent and dependent variables with the least margin of error. If we were to plot a scatter plot of the available data, this equation would be represented by a line that passed as close as possible through the data points (see Figure 13.2). The most common technique is that of simple least-squares linear regression. This objectively determines the straight line (Y = ax b) such that the sum of the squares of the vertical deviations of the data points from the line is a minimum. The assumptions, mathematics and statistics relating to least-squares linear regression are provided in Section 6.3.9. + 13.1.2 General comments on dependency modelling The remainder of this chapter offers several techniques for modelling correlation and dependencies between uncertain components, with examples of where and how they are used. The sections on rank order correlation and copulas provide techniques for modelling correlation. The other sections offer techniques for dependency modelling. The analyst will need to determine whether it is important to focus on any particular correlation or dependency structure in the model. A simple way to determine this is to run two simulations, one with a zero rank order correlation and one with a +1 or - 1 correlation, using two approximate distributions to define the correlated pair. If the model's results from these two simulations are significantly different, the correlation is obviously an important component of the general model. Scatter plots are an extremely useful way of visualising the form of a correlation or dependency. The common practice is to plot observed data for the independent (when known) variable on the x axis I 356 Risk Analysis m .-E 0 8 3 2 m d . 9 Fisherman's prediction of weight of fish I Experience P! gi I Advertising expenditure Figure 13.3 Examples of dependency patterns. and corresponding data for the dependent (again, when known) variable on the y axis. Figure 13.3 illustrates four dependency patterns that you may meet: top left - positive linear; top right - negative linear; bottom left - positive curvilinear; and bottom right - mixed curvilinear. Scatter plots also provide an excellent way of previewing a correlation pattern that you have defined in your own models. Most risk analysis packages allow the user to export the Monte Carlo generated values for any component in your model to the Windows clipboard or directly into a spreadsheet. The data can then be plotted in a scatter plot using the standard spreadsheet-charting facilities. The number of iterations (and therefore the number of generated data points) should be set to a value that will produce a scatter plot that fills out the low-probability areas reasonably well while avoiding overpopulation of the high-probability areas. High-resolution screens now make it reasonable to plot around 3000 data points as little dots that will show the pattern and give an impression of density quite nicely. 13.2 Rank Order Correlation Most risk analysis software products now offer a facility to correlate probability distribution within a risk analysis model using rank order correlation. The technique is very simple to use, requiring only that the analyst nominates the two distributions that are to be correlated and a correlation value between - 1 and +1. This coefficient is known as the Spearman's Rank Order Correlation CoefJicient: A correlation value of -1 forces the two probability distributions to be exactly negatively correlated, i.e. the X percentile value in one distribution will appear in the same iteration as the 100 - X percentile value of the other distribution. Chapter 13 Modelling correlat~onand dependenc~es 3 5 7 A correlation value of +l forces the two probability distributions to be exactly positively correlated, i.e. the X percentile value in one distribution will appear in the same iteration as the X percentile value of the other distribution. In practice, one rarely uses correlation values of -1 and +l. Negative correlation values between 0 and -1 produce varying degrees of inverse correlation, i.e. a low value from one distribution will correspond to a high value in the other distribution, and vice versa. The closer the correlation to zero, the looser will be the relationship between the two distributions. Positive correlation values between 0 and +1 produce varying degrees of positive correlation, i.e. a low value from one distribution will correspond to a low value in the other distribution and a high value from one distribution will correspond to a high value from the other. A correlation value of 0 means that there is no relationship between the two distributions. 13.2.1 How rank order correlation works The rank order correlation coefficient uses the ranking of the data, i.e. what position (rank) the data point takes in an ordered list from the minimum to maximum values, rather than the actual data values themselves. It is therefore independent of the distribution shapes of the datasets and allows the integrity of the input distributions to be maintained. Spearman's p is calculated as where n is the number of data pairs and AR is the difference in the ranks between data values in the same pair. This is in fact a short-cut formula where there are few or no ties: the exact formula is discussed in Section 6.3.10. Example 13.1 The spreadsheet in Figure 13.4 calculates the Spearman's p for a small dataset. This correlation coefficient is symmetric about the distributions being correlated, i.e. only the difference between ranks AI 1 2 3 4 5 6 7 8 9 10 22 23 - B I c I D I E I F ]GI Variable Variable Rank Rank Difference A value B value of A of B in ranksA2 90.86 77.57 4 9 25 110.89 95.04 18 17 1 4 66.35 2 4 86.84 92.24 71.1 1 5 6 1 95.88 75.90 7 8 1 15 16 19 115.14 89.06 1 1 0 83.53 51.16 96.96 87.34 8 14 36 3 2 1 87.88 59.84 H I I I Number of data pairs: Rank order correlation : =COUNT(B4:823) 24 Figure 13.4 An example of the calculation of Spearman's rank order correlation coefficient. IK J 20 0.72 358 Risk Analys~s is important and not whether distribution A is being correlated with distribution B or the other way round. + In order to apply rank order correlation to a pair of probability distributions, risk analysis software has to go through several steps. Firstly, a number of rank scores equivalent to the number of iterations is generated for each distribution that is to be correlated. Secondly, these rank score lists are jumbled up so that the specified correlation is achieved between correlated pairs. Thirdly, the same number of samples are drawn from each distribution and sorted from minimum to maximum. Finally, these values are used during the simulation: the first to be used has the same ranking in the list as the first value in its rank score list, and so on, until all rank scores and all generated values have been used. 13.2.2 Use, advantages and disadvantages of rank order correlation Rank order correlation provides a very quick and easy to use method of modelling correlation between probability distributions. The technique is "distribution independent", i.e. it has no effect on the shape of the correlated distributions. One is therefore guaranteed that the distributions used to model the correlated variables will still be replicated. The primary disadvantage of rank order correlation is the difficulty in selecting the appropriate correlation coefficient. If one is simply seeking to reproduce a correlation that has been observed in previous data, the correlation coefficient can be calculated directly from the data using the formula in the previous section. The difficulty appears when attempting to model an expert's opinion of the degree of correlation between distributions. A rank order correlation lacks intuitive appeal, and it is therefore very difficult for experts to decide which level of correlation best represents their opinion. This difficulty is compounded by the fact that the same degree of correlation will look quite different on a scatter plot for different distribution types, e.g. two lognormals with a 0.7 correlation will produce a different scatter pattern to two uniform distributions with the same correlation. Determining the appropriate correlation coefficient is more difficult still if the two distributions do not share the same geometry, e.g. one is normal and the other uniform, or one is a negatively skewed triangle and the other a positively skewed triangle. In such cases, the scatter plot will often show quite surprising results (Figure 13.5 illustrates some examples). Figure 13.6 shows that correlation only becomes visually evident at levels of about 0.5 or above (or about -0.5 or below for negative correlation). Producing scatter plots like this at various levels of correlation for two variables can help subject matter experts provide estimates of levels of correlation to be applied. Another disadvantage of rank order correlation is that it ignores any causal relationship between the two distributions. It is usually more logical to think of a dependency relationship along the lines of that described in Sections 13.4 and 13.5. A further disadvantage of which most people are unaware is that an assumption of the correlation shape has already been built into the simulation software. The programming technique was originally developed in a seminal paper by Iman and Connover (1982) who used an intermediate step of translating the random numbers through van der Waerden scores. Iman and Conover found that these scores produced "naturallooking" correlations: variables correlated using van der Waerden scores produced elliptical-shaped scatter plots, while using the ranking of the variables directly produced scatter patterns that were pinched in the middle and fanned out at each end. For example, correlating two Uniform(0, 1) distributions together (the same as plotting the cdfs of any two continuous rank order correlated distributions) produces the patterns in Figure 13.7. Chapter 13 Modelling correlation and dependencies 357 A correlation value of +l forces the two probability distributions to be exactly positively correlated, i.e. the X percentile value in one distribution will appear in the same iteration as the X percentile value of the other distribution. In practice, one rarely uses correlation values of -1 and +l. Negative correlation values between 0 and -1 produce varying degrees of inverse correlation, i.e. a low value from one distribution will correspond to a high value in the other distribution, and vice versa. The closer the correlation to zero, the looser will be the relationship between the two distributions. Positive correlation values between 0 and +1 produce varying degrees of positive correlation, i.e. a low value from one distribution will correspond to a low value in the other distribution and a high value from one distribution will correspond to a high value from the other. A correlation value of 0 means that there is no relationship between the two distributions. 13.2.1 How rank order correlation works The rank order correlation coefficient uses the ranking of the data, i.e. what position (rank) the data point takes in an ordered list from the minimum to maximum values, rather than the actual data values themselves. It is therefore independent of the distribution shapes of the datasets and allows the integrity of the input distributions to be maintained. Spearman's p is calculated as where n is the number of data pairs and A R is the difference in the ranks between data values in the same pair. This is in fact a short-cut formula where there are few or no ties: the exact formula is discussed in Section 6.3.10. Example 13.1 The spreadsheet in Figure 13.4 calculates the Spearman's p for a small dataset. This correlation coefficient is symmetric about the distributions being correlated, i.e. only the difference between ranks A/ 1 - i 2 3 4 5 6 7 8 9 10 22 23 - - B I c ( D I E ] F I G ~ Variable Variable Rank Rank Difference A value B value of A of B in ranksA2 90.86 77.57 4 9 25 110.89 95.04 18 17 1 86.84 66.35 2 4 4 92.24 71.11 5 6 1 8 1 75.90 7 95.88 15 16 19 115.14 89.06 83.53 51.16 1 1 0 96.96 87.34 8 14 36 87.88 59.84 3 2 1 H I I I Number of data pairs: Rank order correlation : =COUNT(B4:B23) 24 Figure 13.4 An example of the calculation of Spearman's rank order correlation coefficient. IK J 20 0.72 3 58 R~skAnalysis is important and not whether distribution A is being correlated with distribution B or the other way round. + In order to apply rank order correlation to a pair of probability distributions, risk analysis software has to go through several steps. Firstly, a number of rank scores equivalent to the number of iterations is generated for each distribution that is to be correlated. Secondly, these rank score lists are jumbled up so that the specified correlation is achieved between correlated pairs. Thirdly, the same number of samples are drawn from each distribution and sorted from minimum to maximum. Finally, these values are used during the simulation: the first to be used has the same ranking in the list as the first value in its rank score list, and so on, until all rank scores and all generated values have been used. 13.2.2 Use, advantages and disadvantages of rank order correlation Rank order correlation provides a very quick and easy to use method of modelling correlation between probability distributions. The technique is "distribution independent", i.e. it has no effect on the shape of the correlated distributions. One is therefore guaranteed that the distributions used to model the correlated variables will still be replicated. The primary disadvantage of rank order correlation is the difficulty in selecting the appropriate correlation coefficient. If one is simply seeking to reproduce a correlation that has been observed in previous data, the correlation coefficient can be calculated directly from the data using the formula in the previous section. The difficulty appears when attempting to model an expert's opinion of the degree of correlation between distributions. A rank order correlation lacks intuitive appeal, and it is therefore very difficult for experts to decide which level of correlation best represents their opinion. This difficulty is compounded by the fact that the same degree of correlation will look quite different on a scatter plot for different distribution types, e.g. two lognormals with a 0.7 correlation will produce a different scatter pattern to two uniform distributions with the same correlation. Determining the appropriate correlation coefficient is more difficult still if the two distributions do not share the same geometry, e.g. one is normal and the other uniform, or one is a negatively skewed triangle and the other a positively skewed triangle. In such cases, the scatter plot will often show quite surprising results (Figure 13.5 illustrates some examples). Figure 13.6 shows that correlation only becomes visually evident at levels of about 0.5 or above (or about -0.5 or below for negative correlation). Producing scatter plots like this at various levels of correlation for two variables can help subject matter experts provide estimates of levels of correlation to be applied. Another disadvantage of rank order correlation is that it ignores any causal relationship between the two distributions. It is usually more logical to think of a dependency relationship along the lines of that described in Sections 13.4 and 13.5. A further disadvantage of which most people are unaware is that an assumption of the correlation shape has already been built into the simulation software. The programming technique was originally developed in a seminal paper by Iman and Connover (1982) who used an intermediate step of translating the random numbers through van der Waerden scores. Iman and Conover found that these scores produced "naturallooking" correlations: variables correlated using van der Waerden scores produced elliptical-shaped scatter plots, while using the ranking of the variables directly produced scatter patterns that were pinched in the middle and fanned out at each end. For example, correlating two Uniform(0, 1) distributions together (the same as plotting the cdfs of any two continuous rank order correlated distributions) produces the patterns in Figure 13.7. Chapter 13 Modelling correlation and dependencies .. . . . . . ..... . K=- . .."**-*. . . ,... -? .-6 E .- .. . . ) - . . , , , -m . I - *.. I)-. o.."**o-..*.. . **a . . . .. . .... . . . **-*.I)* I 359 .**-.. .I.. 0.0 "I.. < . . . 0.. . I . I Y i Triang(0,0,40) Figure 13.5 Examples of patterns produced by correlating different distribution types with a rank order correlation of 0.8. i Correlat~on= 0 - 1, .. . . Correlation = 0 2 iZ - *' 1, C C a, w C a, Q a, a, .- . a, n 1 i 0" .. . 1 I I 1 Independent x Independent x I Correlation = 0.4 Correlation = 0 6 . - 1, C - .* C a, u 0" :*: .' Independent x Independent x Correlation = 0 8 - Correlation = 0.99 K - U -0 >.. A C a, a, C C 8 i? a, Q . . Independent x n a, Independent x Figure 13.6 Patterns produced by two normal distributions with varying degrees of rank order correlation. Chapter 13 Modell~ngcorrelat~onand dependenc~es 3 6 1 - 0.5 rank correlation 0 0.2 0.4 0.6 0.8 rank correlation 0.8 1 0 0.2 0.9 rank correlation 0 Figure 13.7 correlation. 0.2 0.4 0.6 0.4 0.6 0.8 1 0.8 1 0.95 rank correlation 0.8 1 0.2 0.4 0.6 Patterns produced by two Uniform(0,l) distributions with varying degrees of rank order Notice that the patterns are symmetric about the diagonals of Figure 13.7. In particular, rank order correlation will "pinch" the variables to the same extent at each extreme. In fact there are a wide variety of different patterns that could give us the same level of rank correlation. To illustrate the point, the following plots in Figure 13.8 give the same 0.9 correlation as the bottom-left pane of Figure 13.7, but are based on copulas which I discuss in the next section. There are times in which two variables are perhaps much more correlated at one end of their distribution than the other. In financial markets, for example, we might believe that returns from two correlated stocks of companies in the same area (let's say mobile phone manufacture) are largely uncorrelated except when the mobile phone market takes a huge dive, in which case the returns are highly correlated. Then the Clayton copula in Figure 13.7 would be a much better candidate than rank order correlation. The final problem with rank order correlation is that it is a simulation technique rather than a probability model. This means that, although we can calculate the rank order correlation between variables (ModelRisk has the VoseSpearman function to do this; it is possible in Excel but one has to create a large array to do it), and although we can use a bootstrap technique to gauge the uncertainty about that 362 Risk Analys~s Frank copula Clayton copula 0 02 0.4 0.6 0.8 1 0 0.2 0.2 0.4 0.6 0.6 0.8 1 0.8 1 T copula (nu = 2) Gumbel copula 0 0.4 0.8 1 0 0.2 0.4 0.6 Figure 13.8 Patterns produced by different copulas with an equivalent 0.9 rank order correlation. correlation coefficient (VoseSpearmanU), it is not possible to compare correlation structures statistically; for example, it is not possible to use maximum likelihood methods and produce goodness-of-fit statistics. Copulas, on the other hand, are probability models and can be compared, ranked and tested for significance. In spite of the inherent disadvantages of rank order correlation, its ease of use and its speed of implementation make it a very practical technique. In summary, the following guidelines in using rank order correlation will help ensure that the analyst avoids any problems: Use rank order correlation to model dependencies that only have a small impact on your model's results. If you are unsure of its impact, run two simulations: one with the selected correlation coefficient and one with zero correlation. If there is a substantial difference between the model's final results, you should choose one of the other more precise techniques explained later in this chapter. Wherever possible, restrict its use to pairs of similarly shaped distributions. Chapter 13 Modell~ngcorrelat~onand dependencies 363 If differently shaped distributions are being correlated, preview the correlation using a scatter plot before accepting it into the model. If using subject matter experts (SMEs) to estimate correlations, use charts at various levels of correlation to help the expert determine the appropriate level of correlation. Consider using copulas if the correlation is important or shows an unusual pattern. Avoid modelling a correlation where there is neither a logical reason nor evidence for its existence. This last point is a contentious issue, since many would argue that it is safer to assume a 100 % positive or negative correlation (whichever increases the spread of the model output) rather than zero. In my view, if there is neither a logical reason that would lead one to believe that the variables are related in some way nor any statistical evidence to suggest that they are, it seems that one would be unjustified in assuming high levels of correlation. On the other hand, using levels of correlation throughout a model that maximise the spread of the output, and other correlation levels that minimise the spread of the output, does provide us with bounds within which we know the true output distribution(s) must lie. This technique is sometimes used in project risk analysis, for example, where for the sake of reassurance one would like to see the most widely spread output feasible given the available data and expert estimates. I suspect that using such pessimistic correlation coefficients proves helpful because it in some general way compensates for the tendency we all have to be overconfident about our estimates (of time to complete the project's tasks, for example, thereby reducing the distribution of possible outcomes for the model outputs like the date of completion) as well as quietly recognising that there are elements running through a whole project like management competence, team efficiency and quality of the initial planning - factors that it would be uncomfortable to model explicitly. 13.2.3 Uncertainty about the value of the correlation coefficient We will often be uncertain about the level of rank order correlation to apply. We will be guided by either available data or expert opinion. In the latter case, determining an uncertainty distribution for the correlation coefficient is simply a matter of asking for a subject matter expert to estimate a feasible correlation coefficient: perhaps just minimum, most likely and maximum values which can then be fed into a PERT distribution, for example. The expert can be helped in providing these three values by being shown scatter plots of various degrees of correlation for the two variables of interest. In the case where data are available on which the estimate of the level of correlation is to be based, we need some objective technique for determining a distribution of uncertainty for the correlation coefficient. Classic statistics and the bootstrap both provide techniques that accomplish this. In classical statistics, the uncertainty about the correlation coefficient, given the data set ({xi1, { y i I), i = 1, . . . , n, was shown by R. A. Fisher to be as follows (Paradine and Rivett, 1964, pp. 208-210): where tanh is the hyperbolic tangent, tanh-' is the inverse hyperbolic tangent, is the rank correlation of the set of observations and p is the true rank correlation between the two variables. The bootstrap technique that applies here is the same technique usually used to estimate some statistic, except that we have to sample the data in pairs rather than individually. Figure 13.9 illustrates a spreadsheet where this has been done. Note that the formula that calculates the rank is modified from the Excel function RANK(), since this function assigns the same lowest-value rank to all data values that are equal: in calculating p we require the ranks of tied data values to equal the average of the ranks 364 Risk Analysis 1 2 -3 . 4 5 27 28 29 30 31 32 33 34 35 36 37 38 39 40 A1 Sorted data x 84.61 87.78 116.90 119.64 - - - C B B4:C28 D4:D28 E29 F4:F28 G4:G28 H4:128 129 130 Y 1.41 1.68 9.13 9.90 I D E Correlation Calculation Rankx Rank y 25 25 24 24 2 3 1 2 Data 0.71 9231 F I G Bootstrap sample x 111.99 90.18 110.98 88.67 Y 8.30 5.13 8.88 4.59 H I Correlation Calculation Rankx Rank y 5 6 19.5 16.5 7 4 24 19 Bootstrap 0.570769 Fischer 0.753599 Formulae table 25 data pairs sorted in order of x =RANK(B4,B$4:B$28)+(COUNTIF(B$4:B$28,B4)-1)/2 {=I -6*SUM((E4:E28-D4:D28)A2)/(25*(25A2-1))} =VoseDuniform(B$4:B$28) =VLOOKUP(F4,B$4:C$28,2) =RANK(F4,F$4:F$28)+(COUNTIF(F$4:F$28,F4)1)/2 (=1-6*SUM((14:128-H4:H28)"2)/(25*(25"2-1))] =TANH(VoseNormal(ATANH(E29),1/SQRT(22))) Figure 13.9 Model to determine uncertainty of a correlation coefficient using the bootstrap. that the tied values would have had if they had been infinitesimally separated. So, for example, the dataset (1, 2 , 2 , 3 , 3 , 3 , 4 ) would be assigned the ranks {1,2.5,2.5,5,5,5,7).The 2s have to share the ranks 2 and 3, so get allocated the average 2.5. The 3s have to share the ranks 4, 5, 6, so get allocated the average 5. The Duniform distribution has been used randomly to sample from the { x i } values, and the VLOOKUP() function has been used to sample the { y i }values to ensure that the data are sampled in appropriate pairs. For this reason, the data pairs have to be ranked in ascending order by { x i } so that the VLOOKUP function will work correctly. Note in cell I30 that the uncertainty distribution for the correlation coefficient is calculated for comparison using the traditional statistics technique above. While the results from the two techniques will not normally be in exact agreement, the difference is not excessive and they will return almost exactly the same mean values. The ModelRisk function VoseSpearmanU simulates the bootstrap estimate directly. Uncertainty about correlation coefficients can only be included by running multiple simulations, if one uses rank order correlation. As discussed previously (Chapter 7), simulating uncertainty and randomness together produces a single combined distribution that quite well expresses the total indeterminability of our output, but without showing the degree due to uncertainty and that due to randomness. However, it is not possible to do this with uncertainty about rank order correlation coefficients, as the scores used to simulate the correlation between variables are generated before the simulation starts. If one is intending to simulate uncertainty and randomness together, a representative value for the correlation needs to be determined, which is not easy because of the difficulty of assessing the effect of a correlation coefficient on a model's output(s). The reader may choose to use the mean of the uncertainty distribution for the correlation coefficient or may choose to play safe and pick a value somewhere at an extreme, say the 5 percentile or 95 percentile, whichever is the most conservative for the purposes of the model. Chapter 13 Modelling correlation and dependencies 365 13.2.4 Rank order correlation matrices An important benefit of rank order correlation is that one can apply it to a set of several variables together. In this case, we must construct a matrix of correlation coefficients. Each distribution must clearly have a correlation of 1.0 with itself, so the top-left to bottom-right diagonal elements are all 1.0. Furthermore, because the formula for the rank order correlation coefficient is symmetric, as explained above, the matrix elements are also symmetric about this diagonal line. Example 13.2 Figure 13.10 shows a simple example for a three-phase engineering project. The cost of each phase is considered to be strongly correlated with the amount of time it takes to complete (0.8). The construction time is moderately correlated (0.5) with the design time: it is considered that the more complex the design, the longer it will take to finish the design and construct the machine, etc. + There are some restrictions on the correlation coefficients that may be used within the matrix. For example, if A and B are highly positively correlated and B and C are also highly positively correlated, A and C cannot be highly negatively correlated. For the mathematically minded, the restriction is that there can be no negative eigenvalues for the matrix. In practice, the risk analysis software should determine whether the values entered are valid and alter your entries to the closest allowable values or, at least, reject the entered values and post a warning. While correlation matrices suffer from the same drawbacks as those outlined for simple rank order correlation, they are nonetheless an excellent way of producing a complex multiple correlation that is laborious and quite difficult to achieve otherwise. Construction Construction Testing cost Testing time cost time Design cost Design time Design cost 1 ; 0 . 8 j o j o j Design time 0.8 1 1 0 I j o . ...................................................................... i 0.5 i0 . . _. . . . . . . . . . . . . . . . . . . . . . . 0 ; 0.4 ...................................................................................................................................... Construction cost O j O j i 1 0 . 8 l ................................... ........................................................... Construction time Testing cost 0 i 0.5 i : 0.8 1 I .......................................................................................... O ~ o l 0 0 0 i 0.4 1 0 ~ j Figure 13.10 An example of a rank order correlation matrix. 0.4 0.4 j ;............. ........:................. ..... j ................................................................................................................. Testing time 0 j ...................................... ; 0.8 1 < i O0.8 ...................... 1 ~ l 366 Risk Analysis Adding uncertainty t o a correlation matrix Uncertainty about the correlation coefficients in a correlation matrix can be easily added when there are data available. The technique requires a repeated application of the bootstrap procedure described in the previous section for determining the uncertainty about a single parameter. Example 13.3 Figure 13.11 provides a spreadsheet model where a dataset for three variables is used to determine the correlation coefficient between each variable. By using the bootstrap method, we retain the correlation between the uncertainty distributions of each correlation coefficient automatically. Cells C32:E32 are the outputs to this model, providing the uncertainty distributions for the correlation coefficients for A : B, B : C and A : C. The exact formula has been used to calculate the correlation coefficients because sums Calculations SS(AA) SS(BB) SS(CC) SS(AB) SS(BC) SS(AC) 16 16 16 16 16 16 12.25 -1.25 -1.75 6.25 0.25 8.75 16 4 0 0 -8 0 12.25 -1.25 -1.75 6.25 0.25 8.75 79 9 7 65 79 79 Formulae table Data ranked in triplets by variable A =VoseDuniform(B$4:B$13) =VLOOKUP(E4,B$4:D$l3,2) G4:G13 =VLOOKUP(E4,B$4:D$13,3) =RANK(E4,E$4:E$13)+(COUNTIF(E$4:E$13,E4)-1)/2 H14:J14 =AVERAGE(H4:H13) C18:E27 =(H4-H$l4)A2 F18:F27 =(H4-H$14)*(14-1$14) G I 8:G27 =(14-1$14)*(J4-J$14) H I8:H27 =(J4-J$14)*(H4-H$14) C32 (output) =F28/SQRT(C28*D28) D32 (output) =G28/SQRT(D28*E28) E32 (output) =H28/SQRT(C28*E28) B4:D13 Figure 13.11 Model to add uncertainty to a correlation matrix. Chapter 13 Modelling correlation and dependencies 367 Generates the correlation matrix Figure 13.12 Using VoseCorrMat and VoseCorrMatU to calculate a rank order correlation matrix from data. the number of ties can be large compared with the number of data pairs because there are few data pairs. + ModelRisk offers two functions VoseCorrMatrix and VoseCorrMatrixU that will construct the correlation matrix of the data and generate uncertainty about those matrix values respectively, as shown in the model in Figure 13.12. The functions are particularly useful when you have a large data array because they use less memory and spreadsheet space and calculate far faster than trying to do the entire analysis in Excel. Note that, since the uncertainty distributions for the correlation coefficients in a correlation matrix are correlated together, the traditional statistics technique by Fisher cannot be used here. Fisher's technique described the uncertainty about an individual correlation coefficient, but not its relationship to other correlation coefficients in a matrix, whereas the bootstrap does so automatically. 13.3 Copulas I 1 i: Quantifying dependence has long been a major topic in finance and insurance risk analysis and has led to an intense interest in, and development of, copulas, but they are now enjoying increasing popularity in other areas of risk analysis where one has considerable amounts of data. The rank order correlation employed by most Monte Carlo simulation tools is certainly a meaningful measure of dependence but is very limited in the patterns it can produce, as discussed above. Copulas offer a far more flexible method for combining marginal distributions into multivariate distributions and offer an enormous improvement in capturing the real correlation pattern. Understanding the mathematics is a little more onerous but is not all that important if you just want to use it as a correlation tool, so feel free to skim over the equations a bit. in the following presentation of copulas, I have used the formulae for a bivariate copula to keep them reasonably readable and show graphs of bivariate copulas, but keep in mind that the ideas extend to multivariate copulas too. I start off with an introduction to some copulas from a theoretical viewpoint, and then look at how we can use them in models. Cherubini et al. (2004) is a very thorough 368 Risk Analysis and readable exploration of copulas and gives algorithms for their generation and estimation, some of which we use in ModelRisk. A d-dimensional copula C is a multivariate distribution with uniformly distributed marginals U(0, 1) on [O, 11. Every multivariate distribution F with marginals F l , F2, . . . , Fd can be written as for some copula C (this is known as Sklar's theorem). Because the copula of a multivariate distribution describes its dependence structure, we can use measures of dependence that are copula based. The concordance measures Kendall's tau and Spearman's rho, as well as the coefficient of tail dependence, can, unlike the rank order correlation coefficient, be expressed in terms of the underlying copula alone. I will focus particularly on Kendall's tau, as the relationships between the value of Kendall's tau ( t ) and the parameters of the copulas discussed in this section are quite straightforward. The general relationship between Kendall's tau of two variables X and Y and the copula C ( u , v) of the bivariate distribution function of X and Y is This relationship gives us a tool for fitting a copula to a dataset: we simply determine Kendall's tau for the data and then apply a transformation to get the appropriate parameter value(s) for the copula being fitted. 13.3.1 Archimedean copulas An important class of copulas - because of the ease with which they can be constructed and the nice properties they possess - are the Archimedean copulas, which are defined by where q is the generator of the copula, which I will explain later. The general relationship between Kendall's tau and the generator of an Archimedean copula q, ( t ) for a bivariate dataset can be written as For example, the relationship between Kendall's tau and the Clayton copula parameter a for a bivariate dataset is given by The definition doesn't extend to a multivariate dataset of n variables because there will be multiple values of tau, one for each pairing. However, one can calculate tau for each pair and use the average, i.e. Chapter 13 Modell~ngcorrelation and dependencies 369 There are three Archimedean copulas in common use: the Clayton, Frank and Gumbel. These are discussed below. The Clayton copula The Clayton copula is an asymmetric Archimedean copula exhibiting greater dependence in the negative tail than in the positive, as shown in Figure 13.13. This copula is given by: and its generator is where a E [- 1, co) (01, meaning a is greater than or equal to - 1 but can't take a value of zero. The relationship between Kendall's tau and the Clayton copula parameter a for a bivariate dataset is given by The model in Figure 13.14 generates a Clayton copula for four variables. 0 0.2 0.4 0.6 0.8 1 Figure 13.13 Plot of two marginal distributions using 3000 samples taken from a Clayton copula with CY = 8. 370 R~skAnalysis No. variables n min = 2 Clayton Random number 0.934 0.605 0.473 0.664 0.9338 0.9363 0.9304 0.9575 Formulae table C5:C8 D5:D8 E5:E8 =RAND() =F5LAlpha =SUM($D$5:D5) F6:F8 out uts = (E5-B6+2 * C W Al ha / Al ha* 1-B6 -1 -1 +1 A -11 A1 ha Figure 13.14 Model to generate values from a Clayton(alpha) copula. The Gumbel copula The Gumbel copula (aka.Gumbe1-Hougard copula) is an asymmetric Archimedean copula, exhibiting greater dependence in the positive tail than in the negative as shown in Figure 13.15. This copula is given by and its generator is q, (t) = (- In t ) , where a E [- 1, co).The relationship between Kendall's tau and the Gumbel copula parameter a for a bivariate dataset is given by a=A 1 1-t The model in Figure 13.16 shows how to generate the Gumbel copula. The Frank copula The Frank copula is a symmetric Archimedean copula, exhibiting an even, sausage-type correlation structure as shown in Figure 13.17. This copula is given by and its generator is Chapter 13 Modelling correlat~onand dependencies 371 Figure 13.15 Plot of two marginal distributions using 3000 samples taken from a Gumbel copula with a = 5. where a! E ( - a , a)(O}. The relationship between Kendall's tau and the Frank copula parameter a bivariate dataset is given by a! for where is a Debye function of the first kind. There is a simple way to generate values for the Frank copula using the logarithmic distribution, as shown by the following model in Figure 13.18. 13.3.2 Elliptical copulas Elliptical copulas are simply the copulas of elliptically contoured (or elliptical) distributions. The most commonly used elliptical distributions are the multivariate normal and Student t-distributions. The key advantage of elliptical copulas is that one can specify different levels of correlation between the 372 Risk Analysis 1 gamma alpha part1 No. variables 1 2 3 4 Theta0 1.0703469541 1.570796327 Iz I part2 10.999848361 1 It 0.51 0.51 Ix 19.517360781 Random number 0.960 0.61 1 0.642 0.223 I 214.6880091 s 0.009 0.006 0.006 0.002 107.3440045 Gumbel copula 0.9098 0.9273 0.9256 0.9555 Formulae table 65 (alpha) C5 (gamma) D5 (1) E5 (ThetaO) 68 (part1) C8 (part2) D8 (4 E8 (x) C11:C14 Dll:D14 Ell:E14 (Output) =l/theta =COS(P1()/(2*theta))Atheta =VoseUniform(-Pl()/2,P1()/2) =ATAN(TAN(PI()*AIpha/2))/Alpha =(SIN(Alpha)*(ThetaO+t))/((COS(Alpha*Theta0)*COS(t))A(1/Alpha)) =(COS(Alpha*ThetaO+(Alpha-l)*t)NoseExpon(l))A((1-Alpha)/Alpha~ =B8*C8 =gamrna*Z =VoseUniform(O,l ) =C11/$E$8 =EXP(-(Dl lA(lltheta))) Figure 13.16 Model to generate values from a Gumbel(theta) copula. marginals, and the key disadvantages are that elliptical copulas do not have closed-form expressions and are restricted to having radial symmetry. For elliptical copulas the relationship between the linear correlation coefficient p and Kendall's tau is given by The normal and Student t-copulas are described below. The normal copula The normal copula (Figure 13.19) is an elliptical copula given by where @-' is the inverse of the univariate standard normal distribution function, and p , the linear correlation coefficient, is the copula parameter. Chapter 13 Modelling correlation and dependencies 373 I Figure 13.17 Plot of two marginal distributions using 3000 samples taken from a Frank copula with a = 8. The relationship between Kendall's tau and the normal copula parameter p is given by p(X, Y ) = sin (5 4 The normal copula is generated by first generating a multinormal distribution with mean vector (0) and then transforming these values into percentiles of a Normal(0, 1) distribution, as shown by the model in Fig. 13.20. The Student t-copula (or Just "the t-copula") The Student t-copula is an elliptical copula defined as where v (the number of degrees of freedom) and p (the linear correlation coefficient) are the parameters of the copula. When the number of degrees of freedom v is large (around 30 or so), the copula converges Figure 13.18 Model to generate values from a Frank(theta) copula. I Figure 13.19 Graph of 3000 samples taken from a bivariate normal copula with parameter p = 0.95. Chapter 13 Modelling correlation and dependencies MultiNormal Normal copula Covariance matrix 0.95 0.95 0.95 1.OO 0.95 0.95 0.95 1.00 0.95 375 0.95 0.95 1.00 0.16428688 0.37788608 0.39489876 0.6472 Figure 13.20 Model to generate values from a normal copula. Figure 13.21 Graph of 3000 samples taken from a bivariate Student t-copula with v = 2 degrees of freedom and parameter p = 0.95. to the normal copula just as the Student distribution converges to the normal. But for a limited number of degrees of freedom the behaviour of the copulas is different: the t-copula has more points in the tails 376 Risk Analysis As in the normal case (and also for all other elliptical copulas), the relationship between Kendall's tau and the Student t-copula parameter p is given by p(X, Y) = sin (5 4 Fitting a Student t-copula is slightly more complicated than fitting the normal. We first estimate t and then, starting with v = 2, we determine the likelihood of observing the dataset. Then we repeat the exercise for v = 3,4, . . . ,50 and find the combination that produces the maximum likelihood. For v values of 50 or more there will be no discernible difference to using a fitted normal copula which is simpler to generate values from. Generating values from a Student copula requires determining the Cholesky decomposition of the covariance matrix, as shown by the model in Figure 13.22. nu 3 1 0.99 0.98 0.97 0.97 1 [chisq distribution Covariance matrix (lower diagonal) 0 0 1 0 0.98 1 0.97 0.98 0.97 0.98 0 0 0 1 0.99 2.7443 0 0 0 0 1 Cholesky Decomposition 0.99 0.98 0.97 0.97 0.14106736 0.069470358 0.068761477 0.068761477 0.18647753 0.132043338 0.132043338 {=VoseCholesky(B4:F8)) =VoseNormal(O,1) {=MMULT(Bl2:F16,B19:B23)) 32 Figure 13.22 Model to generate values from a Student copula. 0.192188491 0.140156239 0.131501501 Chapter 13 Modelling correlation and dependencies 377 13.3.3 Modelling with copulas In order to make use of copulas in your risk analysis, you need three things: 1. A method to estimate its parameter(s), which has been described above. 2. A model that generates the copula described above. 3. Functions that use the inversion method to generate values from the marginal distributions to which you wish to apply the copula. Excel offers a very limited number of such function^,^ but they are notoriously inaccurate and unstable. You can derive many other inversion functions from the F(x) equations in Appendix 111. Let's say that we have a dataset of 1000 joint observations for each of five variables, we fit the data to gamma distributions for each variable and we correlate them together with a normal copula. In principle one could do all these things in Excel, but it would be a pretty large spreadsheet, so I am going to compromise a little. (By the way, I am using gamma distributions here so I can make a model that works with Excel, though be warned that Excel's GAMMAINV is one of the most unstable). In the model in Figure 13.23 I am also fitting a marginal gamma distribution to each variable using the method A 1 2 B I C I D I E I F GI H N Joint 0 b s e ~ a t i o n for ~ variables: 3 4 5 6 7 8 9 10 11 12 13 14 5 16 1 7 18 19 20 21 22 23 4.2953 17.544 4.8865 2.2816 5.4732 15.073 0.9581 1.4401 4.0238 4.7946 10.943 2.2683 0.7928 7.8518 5.9436 6.9022 4.2686 3.6353 4.3357 10.947 3.9473 12.769 32.924 2.7555 4.2633 10.366 24.038 4.2373 10.711 7.4557 4.967 24.14 9.9109 12.434 27.508 8.5434 16.847 12.562 5.2728 12.094 9.6294 2.9792 5.6258 14.971 14.085 8.3166 7.2923 41.372 3.8137 1.0511 12.291 9.2351 10.191 12.455 4.5066 14.434 24.088 8.3371 4.5681 5.6276 5.2264 14.573 8.0411 21.734 35.321 19.687 25.215 5.4288 19.957 12.819 11.975 11.627 5.1999 22.018 14.784 32.179 27.597 11.487 17.772 32.628 8.1 151 22.58 17.219 18.909 21.849 27.799 17.224 12.18 17.656 19.139 1.9824 7.9597 14.717 9.2756 17.204 18.206 12.783 18.576 16.876 12.025 10.44 7.6885 10.647 11.977 8.403 4 5 Mean Variance Alpha Beta 1.000 0.578 0.578 1.000 0.490 0.373 0.393 0.281 0.242 0.139 0.393 0.242 0.281 0.139 0.252 0.169 1.000 0.035 0.035 1.000 5.974 17.884 Data statistics 12.118 9.909 18.054 46.878 49.397 117.292 12.140 26.045 Distribution Gamma parameter estimates 1.996 3.132 1.988 2.779 5.658 2.994 3.869 4.985 6.497 2.145 0.470 Fitted Normal copula 0.710 0.357 0.524 0.072 4.732 Correlated Gamma variables 14.811 6.205 16.561 5.660 1003 1004 1005 loos 1007 Figure 13.23 A model using copulas. BETAINV, CHIINV, FINV, GAMMAINV, LOGINV, NORMINV, NORMSINV and TINV. lo 378 R~skAnalysts A 1 2 3 4 5 6 7 8 9 2 11 12 1001 1002 1003 B I C I D I E I F G I H I I I J I K I L I M IN Joint observations for variables: 2 4 3 1 5 1 2 3 4 5 Fitted Normal copula 4.2953 12.769 5.6258 21.734 21.849 0.006 0.109 0.063 0.465 0.065 17.544 32.924 14.971 35.321 27.799 4.8865 2.7555 14.085 19.687 17.224 Correlated Gamma variables 2.2816 4.2633 8.3166 25.215 12.18 29.043 44.197 38.108 150.113 26.046 5.4732 10.366 7.2923 5.4288 17.656 15.073 24.038 41.372 19.957 19.139 0.9581 4.2373 3.8137 12.819 1.9824 Formulae table 1.4401 10.711 1.0511 11.975 7.9597 (14:M4) {=VoseCopulaMultiNormalFit($B$3:$F$1002,FALSE)) 4.0238 7.4557 12.291 11.627 14.717 17:M7 =VoseGammaFit(B3:B1002,14) 4.7946 4.967 9.2351 5.1999 9.2756 7.1342 30.602 19.104 29.436 10.365 7.1934 16.667 5.4139 22.768 16.905 Figure 13.24 The same model as in Figure 13.23, but now in ModelRisk. of moments: usually you would want to use maximum likelihood, but this involves optirnisation, so the method of moments is easier to follow, and with 1000 data points there won't be much difference. I am also foregoing the rather elaborate calculations needed to estimate the normal copula's covariance matrix by using Excel's CORREL as an approximation. I have used ModelRisk's normal copula function because it takes up less space, and I have already shown you how to generate this copula above. The model in Figure 13.24 is the equivalent with ModelRisk. 13.3.4 Making a special case of bivariate copulas In the standard formulation for copulas there is no distinction between a bivariate (only two marginals) and a multivariate (more than two marginals) copula. However, we can manipulate a bivariate copula greatly to extend its applicability. Sometimes, when creating a certain model, one is interested in a particular copula (say the Clayton copula), but with a greater dependence in the positive tails than in the negative (a Clayton copula has greater dependence in the negative tail than in the positive, see Figure 13.13 above). For a bivariate copula it is possible to change the direction of the copulas by calculating I - X, where X is one of the copula outputs. For example: {Al:A2} Clayton copula with B1 =1-A1 B2 =1-A2 a! =8 A scatter plot of B1 :B2 is now as in Figure 13.25. ModelRisk offers an extra parameter to allow control over the possible directional combinations. For Clayton and Gumbel copulas there are four possible directions, but for the Frank there are just two possibilities since it is symmetric about its centre. The plots in Figures 13.26 and 13.27 illustrate the four possible bivariate Clayton copulas (1000 samples) with parameter a = 15 and the two possible bivariate Frank copulas (1000 samples) with parameter 21. Estimation of which direction gives the closest fit to data simply requires that one repeat the fitting methods described above, calculate the likelihood of the data for each direction and select the direction with the maximum likelihood. ModelRisk has bivariate copula functions that do this directly, returning either the parameters of the fitted copula or generating values from a fitted copula. Chapter 13 Modelling correlation and dependenc~es 3 7 9 Figure 13.25 Graph of 3000 samples taken from a bivariate Clayton(8) with both directions reversed. 13.3.5 An empirical copula In spite of the extra flexibility afforded by copulas I have introduced in the chapter over rank order correlation, you can see that they still rely on a symmetrical relationship between the variables: draw a line between (0, 0) and (1, 1) and you get a symmetric pattern about that line (assuming you didn't alter the copula direction). Unfortunately, real-world variables tend to have other ideas. As risk analysts, we put ourselves in a difficult situation if we try to squeeze data into a model that just doesn't fit. An empirical copula gives us a possible solution. Provided we have a good amount of observations, we can bootstrap the ranks of the data to construct an approximation to an empirical copula, as the model in Figure 13.28 demonstrates. The model above uses the empirical estimate rankl(n 1) described in Section 10.2 for the quantile which should be associated with a value in a set of n data points. The VoseStepUniform distribution simply randomly picks an integer value between 1 and the number of observations (1000). This method is very general and will replicate any correlation structure that the data show. It will be rather slow in Excel when you have large datasets because each RANK function goes through the whole array of data for a variable to determine its rank - it would be more efficient to use the VoseRank array function which will take far fewer passes through the data. However, the main drawback to this method occurs when we have relatively few observations. For example, if we have just nine observations, the empirical copula will only generate values of (0.1, 0.2, . . . ,0.9} so our model will only generate between the 10th and 90th percentiles of the marginal distributions. This problem can be corrected by applying some order statistics thinking along the lines of Equations 10.4 and 10.5. The ModelRisk function VoseCopulaData encapsulates that thinking. In the model in Figure 13.29 there are just 21 observations, so any correlation structure is only vaguely known. The plots in Figure 13.30 show how the VoseCopulaData performs. The large grey dots are the data and the small dots are 3000 samples from the empirical copula: notice that the copula extends over + Figure 13.26 The four directional possibilities for a bivariate Clayton copula. (0, 1) for all variables and fills in the areas between the observations with greatest density concentrated around the observations. 13.4 The Envelope Method The envelope method offers a more flexible technique for modelling dependencies that is both intuitive and easy to control. It models the logic whereby the value of the independent variable statistically determines the value of the dependent variable. Its drawback is that it requires considerably more effort than rank order correlation and is therefore only really used where the dependency relationship is going to produce a significant effect on the final outcome of the model. Chapter 13 Modell~ngcorrelat~onand dependencies CopulaB1Frank(21,1) 381 CopulaB1Frank(21,2) 1- 08. 06- i 0 01 02 03 04 05 06 07 08 09 1 Figure 13.27 The two directional possibilities for a bivariate Frank copula B I C I D I E I F Joint observations for variables: 1 2 3 4 5 4.2953 17.544 4.8865 2.2816 5.4732 7.1342 7.1934 12.769 32.924 2.7555 4.2633 10.366 30.602 16.667 5.6258 14.971 14.085 8.3166 7.2923 19.104 5.4139 r n bo o b e t o n Rowtoselect I 21.734 35.321 19.687 25.215 5.4288 29.436 22.768 21.849 27.799 17.224 12.18 17.656 10.365 16.905 G J K Ranks of observations 2 3 4 I H 1 409 1 607 1 330 1 705 L 1 M 5 957 1000I 42 Empirical copula 0.6434 0.8072 0.9151 0.3676 0.6533 \ Figure 13.28 Constructing an approximate empirical copula from data. 13.4.1 Using the envelope method for approximate modelling of straight-line correlation in observed data A large number of observed correlations can be quite adequately modelled using a straight-line relationship, as already discussed. If this is the case, the following techniques can prove very valuable. However, you may sometimes come across a dependency relationship that is curvilinear andlor has a vertical spread that changes across the range of the independent variable. The bottom graphs in Figure 13.3 illustrate curvilinear relationships. The following section offers some advice on how the envelope method can still be used to model such relationships. 382 R~skAnalysis Empirical copula Formulae table I {=VoseCopulaData($B$3:$D$23)} Figure 13.29 Constructing an empirical copula with few data using ModelRisk. Using a uniform distribution The envelope method first requires that all available data are plotted in a scatter plot. The independent variable is plotted on the x axis and the dependent variable on the y axis. Bounding lines are then determined that contain the minimum and maximum observed values of the dependent variable for all values of the independent variable. Example 13.4 Data on the time that 40 participants took to practise making a wicker basket were negatively correlated to the time they took to make the basket in a subsequent test, shown in Figure 13.31. Two straight lines, drawn by eye, neatly contain all of the data points: a minimum line of y = - 0 . 2 8 ~ 57 and a maximum line of y = - 0 . 4 2 ~ 88. The data look to be roughly vertically uniformly distributed between these two lines for all values of the x axis. We could therefore predict the test time that would be taken for any value of the practice time as follows: + Test time = Uniform(-0.28 + * Practice time + 57, -0.42 * Practice time + 88) Chapter 13 Modelling correlat~onand dependencies 383 Figure 13.30 Scatter plots of random samples from the empirical copula fitted to the data in Figure 13.29. 384 Risk Analysis 90 - -E '= C max = -0.42x+88 65 -60 0 O 0 0 50 -45 -- 0 0 -- 0 0 0 min = -0.28x+57 Practice time (hours) Figure 13.31 Setting boundary lines for the envelope method of modelling dependencies. Minimum test time Maximum test time Modelled test time Figure 13.32 Dependency model using the envelope method with a uniform distribution. We have thus defined a uniform distribution for the test time that varies according to the practice time taken. If we believe that the practice time that will be taken by future workers is Triangle(0, 20, 60), we can use this dependency model to generate the distribution of test times as illustrated in the spreadsheet of Figure 13.32. Consider the Triangle(0, 20, 60) generating a value of 30 in one iteration (see Figure 13.33). The equation for the minimum test time produces a value of -0.28 * 30 57 = 48.6. The equation for the maximum test time produces a value of -0.42 * 30 88 = 75.4. Thus, for this iteration, the value for the test time will be generated from a Uniform(48.6, 75.4) distribution. The above example is a little simplistic. Using a uniform distribution to model the dispersion between the minimum and maximum lines obviously gives equal weighting to all values within the range. It is quite simple to extend this technique to using a triangular or normal distribution in place of the uniform approximation, both of which provide a central tendency that is more realistic. + + + Chapter 13 Modelling correlation and dependencies - 385 Maximum line 4 Dependent Uniform : Minimum 0 10 20 30 40 50 60 Practice time: Triang(0,20,60) Figure 13.33 Illustration of how the dependency model of Figure 13.32 works. Using a triangular distribution Employing a triangular distribution requires that, in addition to the minimum and maximum lines, we also provide the equation of a line that defines the most likely value for the dependent variable for each value of the independent variable. The triangular distribution is still a fairly approximate modelling tool. It is therefore quite reasonable to draw a line through the points of greatest vertical density. Alternatively, you may prefer to find the least-squares fit line through the available data. All professional spreadsheet programs now offer the facility to find this line automatically, making the task very simple. A third option is to say that the most likely value lies midway between the minimum and maximum. Example 13.5 Figures 13.34 and 13.35 provide an illustration of the envelope method with triangular distributions. Minimum test time Most likely test time Maximum test time Figure 13.34 Dependency model using the envelope method with a triangular distribution. + 386 Risk Analys~s 80 -Maximum line Minimum line 40 -* , 30 ' 0 ,, , - , ' Independent Triang -------- I 10 20 30 40 Practice time: Triang(0,20,60) ---60 50 Figure 13.35 Illustrationof how the dependency model of Figure 13.34 works. Using a normal distribution This option involves running a least-squares regression analysis and finding the equation of the leastsquares line and the standard error of the y-estimate Syx. The Syx statistic is the standard deviation of the vertical distances of each point from the least-squares line. Least-squares regression assumes that the error of the data about the least-squares line is normally distributed. Thus, if y = ax b is the equation of the least-squares line, we can model the dependent distribution as y = Normal(ax b, Syx). + + Example 13.6 Figure 13.36 provides an illustration of the envelope method with normal distributions. + Comparison ofthe uniform, triangular and normal methods Figure 13.37 compares how the three envelope methods behave. The graphs on the left cross-plot a Triangle(0, 20, 60) for the practice time (x axis) against the resulting test time (y axis). The graphs on the right show histograms of the resulting test time distributions. The uniform method produces a scatter plot that is vertically evenly distributed and strongly bounded. The test time histogram has the flattest shape with the widest "shoulders" of the three methods. The triangular method produces a scatter plot that has a vertical central tendency and is also strongly bounded. The histogram is the most peaked of the three methods, producing the smallest standard deviation. The normal method produces a scatter plot that has a vertical central tendency but that is unbounded. This will generally be a closer approximation to a plot of available data. The histogram has the widest range of the three methods. Using the normal distribution has two advantages over the other two methods: the equation of the line and standard deviation are both calculated directly from the available data and don't involve any subjective estimation; and the unbounded nature of the normal distribution gives the opportunity for generated values to fall outside the range of the observed values. This second point may help ensure that the range of the dependent distribution is not underestimated. Chapter 13 Modelling correlation and dependencies 387 Regression line: y = -0.4594*~+74.51 Syx= 8.16 / 0 0 ,'Dependent -- Normal * -r .. Practice time (hours) Figure 13.36 Using the normal distribution to model a dependency relationship. Finally, it is important to be sure that the formula you develop will be valid over the entire range of values that are to be generated for the two variables. For example, the normal formula can potentially generate negative values for test time. It could, however, be mathematically restricted to prevent a negative tail, for example by using an IF (test-time < 0, 0, test-time) statement. 13.4.2 Using the envelope method for non-linear correlation observed from available data One may come across a correlation relationship that cannot be adequately modelled using a straight-line fit, as in the examples of Section 13.4.1. However, with a little extra work, the techniques described above can be adapted to model most relationships. The first stage is to find the best curvilinear line that fits the data. Microsoft Excel, for example, offers a choice of automatic line fitting: linear, logarithmic, polynomial (up to sixth order), power and exponential. Several of these fitted lines can be overlaid on the data to help determine the most appropriate equation. The second stage is to use the equation of the selected line to determine the predicted values for the dependent variable for each value of the independent variable. The difference between the observed and predicted values of the dependent variable (i.e. the error terms) are then calculated and cross-plotted against the independent variable. The third stage is to determine how these error terms should be modelled. Any of the three techniques described in Sections 13.4.1 could be used. The final stage is to combine the equation of the best-fit line with the distribution for the error term. Example 13.7 Data on the amount of money a cosmetic company spends on advertising the launch of a new product are compared with the volume of initial orders it receives (Figure 13.38) and cross-plotted in Figure 13.39. Clearly, the relationship is not linear: an example of the law of diminishing returns. The best-fit line is determined to be logarithmic: y = 1374.8 * LN(x) - 10713. The error terms appear to have approximately the same distribution across the whole range of advertising budget values. Since the distribution - 388 Risk Analysis 100 - 40 -30 -20 0 i 8 20 40 60 20 40 60 80 100 Dependent distribution from Uniform formula Scatter plot from Uniform formula 100 790 -- 40 -30 -20 0 I I 20 40 Scatter plot from Triangle formula 60 20 60 80 100 Dependent distribution from Triangle formula 20 Scatter plot from Normal formula 40 40 60 80 100 Dependent distribution from Normal formula Figure 13.37 Comparison of the results of the envelope method of modelling dependency using uniform, triangular and normal distributions. of error terms appears to have a greater concentration around zero, we might assume that they are normally distributed and calculate their standard deviation (= 126 from Figure 13.38). The final equation for the total initial order can then be written as Total initial order = 1374.8 * LN(advertising-budget) - 10 713 + Normal(0, 126) Total initial order = Normal(1374.8 a LN(advertising-budget) - 10713, 126) + Chapter 13 Modelling correlation and dependencies Advertising Initial budget (A) order (0) 1973 9743 12011 2132 220 2818 24 303 3091 2536 15 082 1573 8142 2652 18 183 17 728 2992 2822 18 531 18 795 2786 16 820 2665 18114 2737 19 603 3003 23 290 3032 Standard deviation D4:D17 389 Observed difference to prediction (D) 59 -70 12 -80 22 -93 -120 255 24 -30 1 -29 129 -80 126 Formulae table =C4-(1374.8*LN(B4)-10713) =STDEV(D4:D17) Figure 13.38 Analysis of data and error terms for a curvilinear regression for Example 13.7 Advertising budget Figure 13.39 The best-fitting non-linear correlation for the data of Example 13.7. 13.4.3 Using the envelope method to model expert opinion of correlation It is very difficult to get an intuitive feel for rank order correlation coefficients, even when one is familiar with probabilistic modelling. It is therefore recommended that the more intuitive envelope method be employed for the modelling of an expert's opinion of a dependency where that dependency is likely to have a large impact. 390 Risk Analysis The technique involves the following steps: Discuss with the expert the logic of how he or she perceives the relationship between the two variables to be correlated. Review any available data. Determine the independent and dependent variables. If the causal relationship is unclear, select either to be the independent variable according to which will be easiest. Define the range of the independent variable and determine its distribution (using a technique from Chapter 9 or 10). Select several values for the independent variable. These values should include the minimum and maximum and a couple of strategic points in between. Ask the expert hisher opinion of the minimum, most likely and maximum values for the dependent variable should each of these selected values of the independent variable occur. I often prefer to ask for the practical minimum and maximum. Plot these values on a scatter diagram and find the best-fit lines through the three sets of points (minima, most likely values and maxima). Check that the expert agrees the plot is consistent with hisher opinion. Use these equations of the best-fit lines in a triangular or PERT distribution to define the dependent variable. Example 13.8 Figure 13.40 illustrates an example where the expert is defining the relationship between a bank's average mortgage rate and the number of new mortgages it will sell. The expert has given her opinion of the practical minimum, most likely and practical maximum values of the number of new mortgages for four values of the mortgage rate, as shown in Table 13.1. She has defined practical minimum and 6% 7% 8% 9% 10% 11% Mortgage rate 12% 13% 14% Figure 13.40 An example of the use of the envelope method to model an expert's opinion of a dependency relationship or correlation. Chapter 1 3 Modelling correlation and dependencies 3 91 Table 13.1 Data from expert elicitation. Mortgage rate (%) New mortgages Min Most likely Max maximum to mean, for her, that there is only a 5 % chance that the mortgage will be below and above those values respectively. This technique has the advantage of being very intuitive. The expert is asked questions that are both meaningful and easy to think about. It also has the advantage of avoiding the need to define the distribution shape for a dependent variable: the shape will be dictated by its relationship to the independent variable. + 13.4.4 Adding uncertainty in the envelope method It is a relatively simple matter to add uncertainty into the envelope method. If data exist to develop the dependency relationship, one can use the bootstrap method or traditional statistics to give uncertainty distributions for the least-squares fit parameters. Uncertainty about the boundaries can be included by simply looking at the best-guess line, as well as extreme possibilities for the minimum and maximum boundaries on y . 13.5 Multiple Correlation Using a Look-Up Table There may be times when it is necessary to model the simultaneous effect of an external factor on several parameters within a model. An example is the effect of poor weather on a construction site. The times taken to do an archaeological survey of the land, dig out the foundations, put in the form work, build the foundations, construct the walls and floors and assemble the roof could all be affected by the weather to varying degrees. A simple method of modelling such a scenario is to use a spreadsheet look-up table. Example 13.9 Figure 13.41 illustrates the example above, showing the values for one particular iteration. The model works as follows: Cells D5:DlO list the estimates of duration of each activity if the weather is normal. The look-up table F4:JlO lists the percentages that the activities will increase or decrease owing to the weather conditions. Cell D l 3 generates a value for the weather from 1 to 5 using a discrete distribution that reflects the relative likelihood of the various weather conditions. L 392 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 - Risk Analysis A I B ~ 2 3 4 5 6 7 c Archeology Dig found'n Form work Lay found'n Walls & floors Lay roofing l ~ e a t h e index: r D E Base Estimate 4.3 10.9 2.2 6.7 16.7 7.6 Total time 2 Revised Estimate 5.52 13.1 2.2 8.4 17.4 8.3 54.93 1 I F VP 1 40% 30% 10% 40% 10% 20% D5:DlO E5:ElO D l3 El1 I G H I I % increase:weather conditionlindex Poor Normal Good 2 3 4 28% 0% 0% 20% 0% -6% 4% 0% 0% 25% 0% -1 2% 4% 0% 0% 8% 0% -4% 1 J VG 5 -2% -1 0% -3% -1 8% -2% -6% Formulae table '=~ose~riangle(3,4,6), VoseTriangle(9,11,13),etc.. =D5*(1+HLOOKUP(D$13,F$4:J$1O,B5)) =VoseDiscrete({l,2,3,4,5),{2,5,4,3,2)) =SUM(ES:EIO) Figure 13.41 Using a look-up table to model multiple dependencies. Cells E5:ElO add the appropriate percentage change for that iteration to the base estimate time by looking it up in the look-up table. Cell E l l adds up all the revised durations to obtain the total construction time. It is a simple matter to include uncertainty in this technique. One needs simply to add uncertainty distributions for the magnitude of effect (in this case, the values in cells F5:JlO). A little care is needed if the uncertainty distributions overlap for an activity. So, for example, if we used a PERT(30 %, 40 %, 50 %) uncertainty distribution for the parameter in cell F5 and a PERT(20 %, 28 %, 35 %) uncertainty distribution for the parameter in cell G5, we could be modelling a simulation where very poor weather increases the archaeological digging time by 31 % but poor weather increases the time by 33 %. Using high levels of correlation for the uncertainty distributions of effect size across a task will remove this problem quite efficiently and reflect that errors in estimating the effect (in this case weather) will probably be similar for each effect size. + Chapter 14 tl~clt~ng trom expert oplnlon 14.1 Introduction Risk analysis models almost invariably involve some element of subjective estimation. It is usually impossible to obtain data from which to determine the uncertainty of all of the variables within the model accurately, for a number of reasons: The data have simply never been collected in the past. The data are too expensive to obtain. Past data are no longer relevant (new technology, changes in political or commercial environment, etc.). The data are sparse, requiring expert opinion "to fill in the holes". The area being modelled is new. The uncertainty in subjective estimates has two components: the inherent randomness of the variable itself and the uncertainty arising from the expert's lack of knowledge of the parameters that describe that variability. In a risk analysis model, these uncertainties may or may not be distinguished, but both types of uncertainty should at least be accounted for in a model. The variability is best included by assuming some sort of stochastic model, and the uncertainty is then included in the uncertainty distributions for the model parameters. When insufficient data are available to specify the uncertainty of a variable completely, one or more experts will usually be consulted to provide their opinion of the variable's uncertainty. This chapter offers guidelines for the analyst to model the experts' opinions as accurately as possible. I will start by discussing sources of bias and error that the analyst will encounter when collecting subjective estimates. We then look at a number of techniques used in the modelling of probabilistic estimates, and particularly the use of various types of distribution. The analyst is then shown how to employ brainstorming sessions to ensure that all of the available information relevant to the problem is disseminated among the experts and the uncertainty of the problem openly discussed. Finally, we look at methods for eliciting expert opinion in one-to-one interviews with the analyst. Before delving into the techniques of subjective estimation, I would like the reader to consider the following two points that have been the downfall of many a model I have been asked to evaluate: Firstly, the most significant subjective estimate in a model is often in designing the structure of the model itself. It is surprising how often the structure of a model evades criticism while the figures within it are given all the scrutiny. Before committing to a specific model structure, it is recommended that the analyst seeks comment from other interested parties as to its validity. In turn, this action will greatly enhance the analyst's chances of having the model's results accepted and of 394 Risk Analysis receiving cooperation in determining the input uncertainties. Good analysts should take this stage very seriously and promote an environment where it is possible to provide open criticism of their work. The second point is that the analysts should not take it upon themselves to provide all of the subjective assessments in a model. This sounds painfully obvious, but it still astounds me how many analysts believe that they can estimate all or most of the variables within their model by themselves without consulting others who are closer to the particular problem. 14.2 Sources of Error in Subjective Estimation Before loolung at the techniques for eliciting distributions from an expert, it is very useful to have an understanding of the biases that commonly occur in subjective estimation. To introduce this subject, Section 14.2.1 describes two exercises I run in my risk analysis training seminars and that the reader might find educational to conduct in his or her own organisation. In each exercise the class members have their own PCs and risk analysis software to help them with their estimates. Section 14.2.2 summarises the sources of heuristic errors and biases: that is, errors produced by the way people mentally approach the task of parameter estimating. Finally, Section 14.2.3 looks at other factors that may cause inaccuracy in the expert's estimates. 14.2.1 Class experiments on estimating This section looks at two estimating exercises I regularly use in my training seminars on risk analysis modelling. Their purpose is to highlight some of the thought processes (heuristics) people use to produce quantitative estimates. The reader should consider the observations from these exercises in conjunction with the points raised in Section 14.2.2. Class estimating exercise I Each member of the class is asked to provide practical minimum, most likely and practical maximum estimates for a number of quantities (usually 8). The class is instructed that the minimum and maximum should be as close as possible to each other such that they are 90 % confident that the true value falls between them. The class is encouraged to ask questions if anything is unclear. The quantities being estimated are obscure enough that the class members will not have an exact knowledge of their values, but hopefully familiar enough that they can have a go at estimating the value. The questions are changed to be relevant to the country in which the seminar is run. Examples of these quantities are: the distance from Oxford to Edinburgh along main highway routes in kilometres; the area of the United IOngdom in square kilometres; the mass of the Earth in metric tonnes; the length of the Nile in kilometres; the number of pages in the December Vogue UK magazine; the population of Scranton, USA; the height of K2, Kashmir, in metres; the deepest ocean depth in metres. Chapter 14 Eliciting from expert opinion 395 Number of pages in October 1995 UK Cosmopolitan Magazine 150 200 250 300 I I a I I 350 400 I br c.;------- d I I er -, f r - 9 1 v j f actual value h - I Figure 14.1 How to draw up class estimates from exercises 1 and 2 on a blackboard. Each member of the class fills out a form giving the three values for each quantity. When everyone has completed their forms, I get the class to pick one of these quantities, e.g. the length of the Nile. I then question each member of the class to find out the minimum and the maximum, i.e. the total range of all of the estimates they have made. On the blackboard, I draw up a plot of each class member's three-point estimate, as illustrated in Figure 14.1, and then superimpose the true value. There is almost invariably an expression of surprise at the true value. Sometimes, after I have drawn all of the estimates up on the blackboard, I will ask if any of the class wishes to change their estimate before I reveal the true value. Some will choose to do so, but this rarely increases their chance of encompassing the true value. I will often repeat this process for four or five of the measurements to collect as many of the lessons to be learned from the exercise as possible. Now, if the class members were perfectly calibrated, there would be a 90 % chance (i.e. the defined level above) that each true value would lie within their minimum to maximum range. By "calibrated I mean that their perceptions of the precision of their knowledge were accurate. If there are eight quantities to be estimated, the number that fall within their minimum to maximum range (their score for this exercise) can be estimated by a Binomial(8, 90 %) distribution, as shown in Figure 14.2. A host of interesting observations invariably comes out of this exercise. The underlying reasons for these observations and those of the following exercise are summarised in Section 14.2.2: In the hundred or so seminars in which I have performed this exercise, I have very rarely seen a score higher than 6. From Figure 14.2 we can see that there is only a 4 % chance that anyone would score 5 or less if they were perfectly calibrated. If we take the average score for all members of the class and assume the distribution of scores to be approximately binomial, we can estimate the real probability encompassed by their minimum to maximum range. The mean of a binomial distribution is np, where n is the number of trials (in this case, 8) and p is the probability of success (here, the probability of falling between the minima and maxima). The average individual score for the whole class is usually around 3, giving a probability p of = 37.5 %. In other words, where they were providing a minimum and maximum for which they believed there was a 90 % chance of the quantity falling between those values, there was in fact only about a 37 % chance. One reason for this "overconfidence" (i.e. the estimated uncertainty is much smaller than the real uncertainty) is anchoring, discussed in Section 14.2.2. Figure 14.3 shows the distribution for the largest class for which I have run this exercise (and the only class for which I kept the results). 396 Risk Analysis Figure 14.2 Binomial(8, 90 %) distribution for forecasting test scores. Figure 14.3 Example of scores produced by a large class in the estimating exercise. The estimators often confuse the units (e.g. miles instead of kilometres, kilograms instead of tonnes), resulting in a gross error. In estimating the population of Scranton, some estimators provide a huge maximum estimate. Since most people have never heard of Scranton, it makes sense that it has a smaller population than London, New York, etc., but some people ignore this obvious deduction and offer a maximum that has no logical basis (their estimation is strongly affected by the fact that they have never heard of Scranton rather than any logic they could apply to the problem). When the class discusses the quantities, they can usually agree on a logic for their estimation. If estimators are very sure of their quantity, they may nonetheless provide an unrealistically large range given their knowledge ("better to be safe") or, more commonly, provide just slightly too narrow a range (resulting in a protest when I don't award them a correct answer!). I once asked a Chapter 14 Eliciting from expert opinion 397 class in New Zealand to estimate the area of their country. A gentleman from their Met Office asked if that was at low or high tide, to the amusement of us all. He knew the answer precisely, but the true value fell outside his range because he had not known the precise conversion factor between acres and square kilometres and had made insufficient allowance for that uncertainty. If offered the choice of a revision to their estimates after I have drawn them all on the board, those that change will usually gravitate to any grouping of the others' estimates or to the estimate of an individual in the group whose opinion is highly valued. These actions often do not get them closer to the correct answer. This observation has encouraged me to avoid asking for distribution estimates during brainstorming sessions (see Section 14.4). In many cases, people who have given a vast range to their estimates (to howls of laughter from the others) are the only ones to get it inside their range. People attending my seminars are almost always computer literate, but it is surprising how many have little feel for numbers and offer estimates that could not possibly be correct. Faced with a quantity that seems impossible to quantify at first, the estimator can often arrive at a reasonable estimate by being encouraged to either break the quantity down into smaller components or to make a comparison with other quantities. For example, the mass of the Earth could be estimated by first estimating the average density of rock and then multiplying it by an estimate of the volume of the Earth (requiring an estimate of its radius or circumference). Occasionally, this method has come up with some huge errors where the estimator has confused formulae for the volume of a sphere with area, etc. Very occasionally, individuals lacking confidence will refuse to read out their opinions to the class. Sometimes, estimators will provide a set of answers without really understanding the quantity they are estimating (e.g. not knowing that K2 is a mountain, the second highest in the world). Note that the person in question did not seek clarification, even after being encouraged to do so. This "shyness" seems to be much more common in some nationalities than others. This exercise can legitimately be criticised on several points: 1. The class members are asked to estimate quantities that they have no real knowledge of and their score is therefore not reflective of their ability to estimate quantities that would be required of them in their work. 2. In most real-life problems, the quantity being estimated does not have a fixed known value but is itself uncertain. 3. In real-life problems, if the estimator has provided a range that was small but just missed the true value, that estimate would still be more useful than another estimate with a much wider range but that included the true value. 4. In real-life problems, estimators would presumably check formulae and conversion factors that they were unsure of. The scores should not be taken very seriously (I don't keep a record of the results). The exercise is simply a good way to highlight some of the issues concerned in estimating. A more realistic exercise would be to compare probabilistic estimates from an expert for real problems with the values that were eventually observed. Of course, such an exercise could take many months or years to complete. 398 Risk Analysis Class estimating exercise 2 The class is grouped in pairs and asked to give the same three-point estimate, as used for the above exercise, of the total weight (mass) of the members of the class in kilograms, including myself, and our total height in metres. While they are estimating, I go round the class and ask each member quietly for their own measurements. At the end of the exercise, I draw up the estimates as in Figure 14.1 and superimpose the true value. Then we discuss how each group produced its estimates. The following points generally come out: Three estimating techniques are usually used by the class: 1. Produce a three-point estimate of the distributions of height and mass for individuals in the class and multiply by the number of people in the class. This logic is incorrect since it ignores the central limit theorem, which states that the spread of the sum of a set of n variables is proportional to ,hi, not n. It generally manages to encompass the true result but with a very wide (and therefore inaccurate) range. 2. Produce a three-point estimate of each individual in the class and add up the minima to get the final-estimate minimum, add up the most likely values to get the final-estimate most likely and add up the maxima to get the final-estimate maximum. Again, this is incorrect since it ignores the central limit theorem and therefore produces too wide a range. 3. Produce a three-point estimate of each individual in the class and then run a simulation to add them up. Take the 5 %, mode and 95 % values of the simulation result as the final three-point estimate. This generally has the narrowest range but is still quite likely to encompass the true value. There is often a dominant person in a pair who takes over the whole estimating, either because that person is very enthusiastic or more familiar with the software or because the other person is a bit laid back or quiet. This, of course, loses the value of being in pairs. The estimators often forget to exclude themselves from their uncertainty estimates. They have given me their measurements, so they should only assign uncertainty to the others' measurements and then add their measurements to the total. If the central limit theorem corrections are applied to the violating estimates, the class scores average out at about 1.4 compared with the 1.8 it should have been (i.e. 2 * 90 %). In other words, their minimum to maximum range, which was supposed to have a 90 % probability of including the true value, actually had about a 70 % probability. 14.2.2 Common heuristic biases and errors The analyst should bear in mind the following heuristics that the expert may employ when attempting to provide subjective estimates and that are potential sources of systematic bias and errors. These biases are explained in considerably more detail in Hertz and Thomas (1983) and in Morgan and Henrion (1990) (the latter includes a very comprehensive list of references). Availability This is where experts use their recollection of past occurrences of an event to provide an estimate. The accuracy of their estimates is dictated by their ability to remember past occurrences of the event or Chapter 14 Eliciting from expert opinion 399 how easily they can imagine the event occurring. This may work very well if the event is a regular part of their life, e.g. how much they spend on petrol. It also works well if the event is something that sticks in their mind, e.g. the probability of having a flat tyre. On the other hand, it can produce poor estimates if it is difficult for the experts to remember past occurrences of the event: for example, they may not be able confidently to estimate the number of people they passed in the street that day since they would have no interest in noting each passer-by. Availability can produce overestimates of frequency if the experts can remember past occurrences very clearly because of the impact they had on them. For example, if a computer manager was asked how often her mainframe had crashed in the last two years, she might well overestimate the frequency because she could remember every crash and the crises they caused, but, because of the clarity of her recollection ("it seems like only yesterday"), include some crashes that happened well over 2 years ago and therefore overestimate the frequency as a result. The availability heuristic is also affected by the degree to which we are exposed to information. For example: one might consider that the chance of dying in a motoring accident was much higher than dying from stomach cancer, because car crashes are always being reported in the media and stomach cancer fatalities are not. On the other hand, an older person may have had several acquaintances who have died from stomach cancer and would therefore offer the reverse opinion. Representativeness One type of bias is the erroneous belief that the large-scale nature of uncertainty is reflected in smallscale sampling. For example, in the National Lottery, many would say I had no chance of winning if I selected the consecutive numbers 16, 17, 18, 19, 20 and 21. The lottery numbers are randomly picked each week so it is believed that the winning numbers should also exhibit a random pattern, e.g. 3, 11, 15, 21, 29 and 41. Of course, both sets of numbers are actually equally likely. I once reviewed a paper that noted that, out of 200 houses fitted with a new type of gas supply piping and tested over a period of a year and a half, one of those houses suffered a gas leak due to a rat gnawing through the pipe. It concluded that there was a 1:300 chance of a "rodent attack" per house per year. What should the answer have been? A second type of representativeness bias is where people concentrate on an enticing detail of the problem and forget the overall picture. In a frequently cited paper by Kahneman and Tversky, described in Morgan and Henrion (1990), subjects in an experiment were asked to determine the probability of a person being an engineer on the basis of a written description of that person. If they were given a bland description that gave no clue to the person's profession, the answer given was usually 5050, despite being told beforehand that, of the 100 described people, 70 were lawyers and 30 were engineers. However, when the subjects were asked what probability they would give if they had no description of the person, they said 30 %, illustrating that they understood how to use the information but had just ignored it. Adjustment and anchoring This is probably the most important heuristic of the three. Individuals will usually begin their estimate of the distribution of uncertainty of a variable with a single value (usually the most likely value) and then make adjustments for its minimum and maximum from that first value. The problem is that these adjustments are rarely sufficient to encompass the range of values that could actually occur: the estimators appear to be "anchored" to their first estimated value. This is certainly one source of overconfidence and can have a dramatic impact on the validity of a risk analysis model. 400 Risk Analysis 14.2.3 O t h e r sources of estimating inaccuracy There are other elements that may affect the correct assessment of uncertainty, and the analyst should be aware of them in order to avoid unnecessary errors. Inexpert expert The person nominated (wrongly) as being able to provide the most knowledgeable opinion occasionally actually has very little idea. Rather than refemng the analyst on to another more expert in the problem, that person may try to provide an opinion "to be helpful", even though that opinion is of little real value. The analyst, seeing the inexpertness of the interviewee, should seek an alternative opinion, although this may not be apparent until later. Culture of the organisation The environment within which people work may sometimes impact on their estimating. Sales people will often provide unduly optimistic estimates of future sales because of the optimistic culture within which they work. Managers may offer high estimates of running costs because, if they achieve a lower operating cost, their organisation will view them favourably. The analyst should try to be aware of any potential conflict and seek to eliminate it through cross-checking with data and other people in the organisation. Conflicting agendas Sometimes the expert will have a vested interest in the values that are submitted to a model. In one model I developed, managers were deliberately providing hugely optimistic growth rate predictions to me because, in the organisation they worked for, it could aid their individual empire building. In another, I was offered very optimistic estimates of completion time and costs for a project because, if that project were given approval, the person in question would become the project's manager with a big wage increase to match. Lawyers may offer a low estimate of the cost of litigation because, if they get the brief, they can usually increase the fees later. The analyst must be aware of such conflicting agendas and seek a second disinterested opinion. Unwillingness to consider extremes The expert will frequently find it difficult or be unwilling to envisage circumstances that would cause a variable to be extremely low or high. The analyst will often have to encourage the development of such extreme scenarios in order to elicit an opinion that realistically covers the entire possible range. This can be done by the analyst dreaming up some examples of extreme circumstances and discussing them with the expert. Eagerness t o say the right thing Occasionally, interviewees will be trying to provide the answer they think the analyst wants to hear. For this reason, it is important not to ask questions that are leading and never to offer a value for the expert to comment on. For example, if I said "How long do you think this task will take? Twelve weeks? More? Less?" I could well get an answer nearer to 12 weeks than if I had simply said "How long do you think this task will take?". Chapter 14 Eliciting from expert opinion 40 1 Units used in the estimation People are frequently confused between the magnitudes of units of measurement. An older (or English) person may be used to thinking of distances in miles and liquid volumes in (UK) gallons and pints. If the model uses SI units, the analyst should let the experts describe their estimates in the units in which they are comfortable and convert the figures afterwards. Expert too busy People always seem to be busy and under pressure. A risk analyst coming to ask a lot of difficult questions may not be very welcome. The expert may act brusquely or give the whole process lip service. Obvious symptoms are when the expert offers oversimplistic estimates like X f Y % or minimum, most likely and maximum values that are equally spaced for all estimated variables. The solution to such problems is to get the top management visibly to support the development of the risk model, ensuring that the employees are given the message that this work is a priority. Belief that the expert should be quite certain It may be perceived by experts that assigning a large uncertainty to a parameter would indicate a lack of knowledge and thereby undermine their reputation. The expert may need to be reassured that this is not the case. An expert should have a more precise understanding of a parameter's true uncertainty and may, in fact, appreciate that the uncertainty could be greater than the layperson would have expected. 14.3 Modelling Techniques This section describes a range of techniques including the role of various types of probability distribution that are useful in the eliciting of expert opinion. I have only included those techniques that have worked for me, so the reader will find some omissions when comparing with other risk analysis texts. 14.3.1 Disaggregation A key technique to eliciting distributions of opinion is to disaggregate the problem sufficiently well so that experts can concentrate on estimating something that is tangible and easy to envisage. For example, it will generally be more useful to ask experts to break down their company's revenue into logical components (like region, product, subsidiary company, etc.) rather than to estimate the total revenue in one go. Disaggregation allows the expert and analyst to recognise dependencies between components of the total revenue. It also means that the risk analysis result will be less critically dependent on the estimate of each model component. Aggregating the estimates of the various revenue components will show a more complex and accurate distribution than ever could have been achieved by directly estimating the sum. The aggregation will also take care of the effects of the central limit theorem automatically - something that is extremely hard for experts to do in their head. Another benefit of disaggregation is that the logic of the problem usually becomes more apparent and the model therefore becomes more realistic. During the disaggregation process, analysts should be aware of where the key uncertainties lie within their model and therefore where they should place their emphasis. The analyst can check whether an appropriate level of disaggregation has been achieved by running a sensitivity analysis on the model (see Section 5.3.7) and looking to see whether the Tornado chart is dominated by one or two model inputs. 402 Risk Analys~s 14.3.2 Distributions used in modelling expert opinion This section describes the role of various types of probability distribution in modelling expert opinion. Non-parametric and parametric distributions Probability distribution functions fall into two categories: non-parametric and parametric distributions, the meanings of which are discussed in detail in Appendix 111.3. A parametric distribution is based on a mathematical function whose shape and range is determined by one or more distribution parameters. These parameters often have little obvious or intuitive relationship to the distribution shapes they define. Examples of parametric distributions are: lognormal, normal, beta, Weibull, Pareto, loglogistic, hypergeometric - most distribution types, in fact. Non-parametric distributions, on the other hand, have their shape and range determined by their parameters directly in an obvious and intuitive way. Their distribution function is simply a mathematical description of their shape. Non-parametric distributions are: uniform, relative, triangular, cumulative and discrete. As a rule, non-parametric distributions are far more reliable and flexible for modelling expert opinion about a model parameter. The questions that the analyst poses to the expert to determine the distribution's parameters are intuitive and easy to respond to. Changes to these parameters also produce an easily predicted change in the distribution's shape and range. The application of each non-parametric distribution type to modelling expert opinion is discussed below. There are three common exceptions to the above preference for using non-parametric distributions to model expert opinion: 2. The PERT distribution is frequently used to model an expert's opinion. Although it is, strictly speaking, a parametric distribution, it has been adapted so that the expert need only provide estimates of the minimum, most likely and maximum values for the variable, and the PERT function finds a shape that fits these restrictions. The PERT distribution is explained more fully below. 3. The expert may occasionally be very familiar with using the parameters that define the particular distribution. For example, a toxicologist may regularly determine the mean standard error of a chemical concentration in a set of samples. It might be quite helpful to ask the expert for the mean and standard deviation of hislher uncertainty about some concentration in this case. 4. The parameters of a parametric distribution are sometimes intuitive and the analyst can therefore ask for their estimation directly. For example, a binomial distribution is defined by n, the number of trials that will be conducted, and p, the probability of success of each trial. In cases where I consider the binomial distribution to be the most appropriate, I generally ask the expert for estimates of n and p, recognising that I will have to insert them into a binomial distribution, but I would try to avoid any discussion of the binomial distribution that might cause confusion. Note that the estimates of n and p can also be distributions themselves. There are other problems associated with using parametric distributions for modelling expert opinion: A model that includes parametric distributions to represent opinion is more difficult to review later because the parameters of the distribution may have no intuitive appeal. It is very difficult to get the precise shape right when using parametric distributions to model expert opinion as the effects of changes in the parameters are not usually obvious. Chapter 14 Eliciting from expert opinion 403 Figure 14.4 Examples of triangular distributions. The triangular d~stribution The triangular distribution is the most commonly used distribution for modelling expert opinion. It is defined by its minimum (a), most likely (b) and maximum (c) values. Figure 14.4 shows three triangular distributions: Triangle(0, 10, 20), Triangle(0, 10, 50) and Triangle(0, 50, 50), which are symmetric, right skewed and left skewed respectively. The triangular distribution has a very obvious appeal because it is so easy to think about the three defining parameters and to envisage the effect of any changes. The mean and standard deviation of the triangular distribution are determined from its three parameters: Mean = Standard deviation = (a + b + c) 3 a2 + b2 + c2 - a b - a c - be) From these formulae it can be seen that the mean and standard deviation are equally sensitive to all three parameters. Many models involve parameters for which it is fairly easy to estimate the minimum and most likely values, but for which the maximum is almost unbounded and could be enormous. The central limit theorem tells us that, when adding up a large number of distributions (for example, adding costs or task durations), it is the distributions' means and standard deviations that are most important because they determine the mean and standard deviation of the risk analysis result. In situations where the maximum is so difficult to determine, the triangular distribution is not usually appropriate since it will depend a great deal on how the estimation of the maximum is approached. For example, if the maximum is assumed to be the absolutely largest possible value, the risk analysis output will have a far larger mean and standard deviation than if the maximum is assumed to be a "practical" maximum by the estimating experts. The triangular distribution is often considered to be appropriate where little is known about the parameter outside an approximate estimate of its minimum, most likely and maximum values. On the other hand, its sharp, very localised peak and straight lines produce a very definite and unusual (and very unnatural) shape, which conflicts with the assumption of little knowledge of the parameter. 404 Risk Analysis Figure 14.5 Example of a Trigen distribution. There is another useful variation of the triangular distribution, called Trigen in @RISK and TriangGen in Risk Solver, for example. The Trigen distribution requires five parameters: Trigen(a, b, c , p, q), which have the following meanings: a : the practical minimum b : the most likely value c : the practical maximum value p : the probability that the parameter value could be below a q : the probability that the parameter value could be below c Figure 14.5 shows a Trigen(40, 50, 80, 5 %, 95 %) distribution, with the 5 % areas extending beyond the minimum and maximum (40 and 80 here). The Trigen distribution is a useful way of avoiding asking experts for their estimate of the absolute minimum and maximum of a parameter: questions that experts often have difficulty in answering meaningfully since there may theoretically be no minimum or maximum. Instead, the analyst can discuss what values of p and q the experts would use to define "practical" minima and maxima respectively. Once this has been decided, the experts only have to give their estimates for practical minimum, most likely and practical maximum for each estimated parameter, and the same p and q estimates are used for all their estimates. One drawback is that the expert may not appreciate the final range to which the distribution may extend, so it is wise to plot the distribution and have it agreed by the expert before using it in the model. The Tri1090 distribution, featured in @RISK, presumes that p and q are 10 and 90 % respectively, which is generally about right, but I prefer to use the Trigen because it adapts to each expert's concept of "practical". The uniform distribution The uniform distribution is generally a very poor modeller of expert opinion since all values within its range have equal probability density, but that density falls sharply to zero at the minimum and maximum in an unnatural way. The uniform distribution obeys the maximum entropy formalism (see Section 9.4) where only the minimum and maximum are known, but in my experience it is rare indeed that the Chapter 14 Eliciting from expert opinion 405 expert will be able to define the minimum and maximum but have no opinion to offer on a most likely value. The uniform distribution does, however, have several uses: to highlight or exaggerate the fact that little is known about the parameter; to model circular variables (like the direction of wind from 0 to 2n) and other specific problems; to produce spider sensitivity plots (see Section 5.3.8). The PERT distribution The PERT distribution gets its name because it uses the same assumption about the mean (see below) as PERT networks (used in the past for project planning). It is a version of the beta distribution and requires the same three parameters as the triangular distribution, namely minimum (a), most likely (b) and maximum (c). Figure 14.6 shows three PERT distributions whose shape can be compared with the triangular distributions of Figure 14.4. The equation of a PERT distribution is related to the beta distribution as follows: where The mean The last equation for the mean is a restriction that is assumed in order to be able to determine values for a1 and a2. It also shows how the mean for the PERT distribution is 4 times more sensitive to Figure 14.6 Examples of PERT distributions. 406 Risk Analysis C cn 0.14 average = 0.174 Most likely value Figure 14.7 Comparison of the standard deviation of Triangle(0, most likely, 1) and PERT(0, most likely, 1) distributions. the most likely value than to the minimum and maximum values. This should be compared with the triangular distribution where the mean is equally sensitive to each parameter. The PERT distribution therefore does not suffer to the same extent the potential systematic bias problems of the triangular distribution, that is, in producing too great a value for the mean of the risk analysis results where the maximum for the distribution is very large. The standard deviation of a PERT distribution is also less sensitive to the estimate of the extremes. Although the equation for the PERT standard deviation is rather complex, the point can be illustrated very well graphically. Figure 14.7 compares the standard deviations of triangular and PERT distributions that have the same a, b and c values. To illustrate the point, the figure uses values of 0 and 1 for a and c respectively and allows b to vary between 0 and 1, although the observed pattern extends to any (a, b, c} set of values. You can see that the PERT distribution produces a systematically lower standard deviation than the triangular distribution, particularly where the distribution is highly skewed (i.e. b is close to 0 or 1 in this case). As a general rough rule of thumb, cost and duration distributions for project tasks often have a ratio of about 2:l between the (maximum-most likely) and (most likely-minimum), equivalent to b = 0.3333 in Figure 14.7. The standard deviation of the PERT distribution at this point is about 88 % of that for the triangular distribution. This implies that using PERT distributions throughout a cost or schedule model, or any other additive model, will display about 10 % less uncertainty than the equivalent model using triangular distributions. Some readers would perhaps argue that the increased uncertainty that occurs with triangular distributions will compensate to some degree for the "overconfidence" that is often apparent in subjective estimating. The argument is quite appealing at first sight but is not conducive to the long-term improvement of the organisation's ability to estimate. I would rather see an expert's opinion modelled as precisely as is practical. Then, if the expert is consistently overconfident, this will become apparent with time and hislher estimating can be corrected. The modified PERT distribution The PERT distribution can also be manipulated to produce shapes with varying degrees of uncertainty for the same minimum, most likely and maximum by changing the assumption about the mean: The mean ( p ) = a+y *b+c Y+2 Chapter 14 Eliciting from expert opinion 407 Figure 14.8 Examples of modified PERT distributions with varying most likely weighting y . + + In the standard PERT, y = 4, which is the PERT network assumption that p = (a 4b c)/6. However, if we increase the value of y , the distribution becomes progressively more peaked and concentrated around b (and therefore less uncertain). Conversely, if we decrease y, the distribution becomes flatter and more uncertain. Figure 14.8 illustrates the effect of three different values of y for a modified PERT(5, 7, 10) distribution. This modified PERT distribution can be very useful in modelling expert opinion. The expert is asked to estimate the same three values as before (i.e. minimum, most likely and maximum). Then a set of modified PERT distributions is plotted and the expert is asked to select the shape that fits hisfher opinion most accurately. It is a fairly simple matter to set up a spreadsheet program that will do all this automatically. The relative distribution The relative distribution (also called the general in @RISK, and a version of the Custom in Crystal Ball) is the most flexible of all of the continuous distribution functions. It enables the analyst and expert to tailor the shape of the distribution to reflect, as closely as possible, the opinion of the expert. The relative distribution has the form Relative(minimum, maximum{xi], {pi]), where {xi] is an array of x values with probability densities ( p i ] and where the distribution falls between the minimum and maximum. The {pi] values are not constrained to give an area under the curve of 1, since the software recalibrates the probability scale. Figure 14.9 shows a Relative(4, 15, {7,9, 111, {2,3, 0.5)). The cumulative distribution The cumulative distribution has the form CumulativeA(minimum, maximum{xi), {Pi]), where {xi) is an array of x values with cumulative probabilities {Pi] and where the distribution falls between the minimum and maximum. Figure 14.10 shows the distribution CumulativeA(0, 10, (1, 4, 61, (0.1, 0.6, 0.8)) as it is defined in its cumulative form and how it looks as a relative frequency plot. The cumulative distribution is used in some texts to model expert opinion. However, I have found it largely unsatisfactory because of the insensitivity of its probability scale. A small change in the shape of the cumulative distribution that would pass unnoticed produces a radical change in the corresponding relative frequency plot that would not be acceptable. Figure 14.11 provides an illustration: a smooth and natural relative frequency plot (A) is converted to a cumulative frequency plot (B) and then altered slightly (C). Converting back to a relative frequency plot (D) shows that the modified distribution is dramatically different to the original, although this would almost certainly not have been appreciated by comparing the cumulative Figure 14.9 Example of a relative distribution. Cumulative frequency 1- 0.2 - 0.8 -- Relative frequency 0.15 -. 0.1 0.05 -. 0 2 4 6 8 10 0c 0 + 0 2 4 6 8 10 Figure 14.10 Example of a cumulative distribution and its relative frequency plot. frequency plots. For this reason, I usually prefer to model expert opinion looking at the relative frequency distribution instead. One circumstance where the cumulative distribution is very useful is in attempting to estimate a variable whose range covers several orders of magnitude. For example, the number of bacteria in 1 kg of meat will increase exponentially with time. The meat may contain 100 units of bacteria or 1 million. In such circumstances, it is fruitless to attempt to use a relative distribution directly. This point is discussed more fully in Section 14.3.3. The discrete distribution The discrete distribution has the form Discrete({xi), {pi)), where {xi) is an array of the possible values of the variable with probability weightings {pi}. The {pi} values do not have to add up to unity as the software will normalise them automatically. It is actually often useful just to consider the ratio of likelihood of the different values and not to worry about the actual probability values. The discrete Chapter 14 Ellciting from expert opinion 409 Figure 14.11 Example of how small changes in a distribution's cumulative plot can dramatically affect its shape. distribution can be used to model a discrete parameter (that is, a parameter that may take one of two or more distinct values), e.g. the number of turbines that will be used in a power station, and to combine two or more conflicting expert opinions (see Section 14.3.4). 14.3.3 Modelling opinion of a variable that covers several orders of magnitude A continuous parameter whose uncertainty extends over several orders of magnitude generally cannot be modelled in the usual manner. For example, an expert may consider that 1 g of meat could contain any number of units of bacteria from 1 to 10 000 but that this figure is just as likely to be around 100 or 1000. If we were to model this estimate using a Uniform(1, 10 000) distribution, for example, we would almost certainly not match the expert's opinion of the values of the cumulative percentiles. The expert would probably place the 25, 50 and 75 percentiles at about 10, 100 and 1000, where our model places them at 2500, 5000 and 7500 respectively. The reason for such a large discrepancy is that the expert is subconsciously making hisher estimate in log-space, i.e. s h e is thinking of the loglo values: loglol = 0, loglo10 = 1, loglolOO = 2, etc. To match the expert's approach to estimating, the analyst can also work in log-space, so the distribution becomes Number of units of bacteria = 10U"if0m(0~4) Figure 14.12 compares these two interpretations of the expert opinion by looking at the cumulative distributions and statistics they would produce. The Uniform(1, 10 000) has much larger mean and standard deviation than the 10U"if0m(0~4) distribution and an entirely different shape. 4 10 R~skAnalys~s Mean = Std Deviation = Skewness = Kurtosis = Uniform(1,10000) 1OAUniform(0,4) 5000.5 1085 2886 2062 0 2.4 1.8 5.2 - Figure 14.12 Comparison of two ways to model expert opinion of a variable that covers several orders of magnitude. If the expert had said instead that there could be between 1 and 10 000 units of bacteria in 1 g of meat, but the most likely number is around 500, we would probably have the greatest success in modelling this variable as Number of units of bacteria = 10PERT(0,2.794) where 1 0 g ~ ~ 5 0=02.7. If the variable is to be modelled as a 10" type formula described above, it is judicious to compare the cumulative percentiles at a few sensible points with those the expert would expect. Any radical differences would suggest that the expert is not actually thinking in log-space and the cumulative distribution could be used instead. 14.3.4 Incorporating differences in expert opinions Experts will sometimes produce profoundly different probability distribution estimates of a parameter. This is usually because the experts have estimated different things, made differing assumptions or have different sets of information on which to base their opinion. However, occasionally two or more experts simply genuinely disagree. How should the analyst approach the problem? The first step is usually to confer with someone more senior and find out whether one expert is preferred over the other. If those more senior have some confidence in both opinions, a method is needed to combine these opinions in some way. Recommended approach I have used the following method for a number of years with good results. Use a Discrete((xi}, (pi}) distribution where the {xi}are the expert opinions and the {pi] are the weights given to each opinion according to the emphasis one wishes to place on them. Figure 14.13 illustrates an example combining Chapter 14 El~c~t~ng from expert oplnlon Expert A's opinion 40 50 60 70 411 Expert B's opinion 80 \90 Expert C's opinion Figure 14.13 Combining three dissimilar expert opinions. three differing opinions, but where expert A is given twice the emphasis of the others owing to the greater experience of that expert. Two incorrect approaches are frequently used: Pick the most pessimistic estimate. This is generally unsatisfactory, as a risk analysis model should be attempting to produce an unbiased estimate of the uncertainty. The caution should only be applied at the decision-making stage after reviewing the risk analysis results. Take the average of the two distributions. This is incorrect as the resultant distribution will be too narrow. By way of illustration, consider the test situation where both experts believed a parameter should be modelled by a Normal(100, 10) distribution. Whatever technique was used to combine their opinions, the result should be the same Normal(100, 10) distribution. The average of these two distributions, i.e. AVERAGE(Normal(100, lo), Normal(100, lo)), would be a Normal(100, lo/&) = Norma1(100,7.07) from the central limit theorem. In other words, we would have produced far too small a spread. 1 have been offered suggestions for other approaches to this problem: Take the weighted average of the relative or cumulative percentiles. This will correctly construct the combined distribution (it is how the ModelRisk function VoseCombined works) but it is very laborious to execute for all but the most simple distributions of opinion unless you have a library of density and cdf functions, so it is somewhat impractical to start from scratch. Multiply together the probability densities at each x value. This is incorrect because (a) it produces combined distributions with exaggerated peakedness, (b) the area under the curve is no longer 1 and (c) the combined distribution is contained between the highest minimum and the lowest maximum. 4 12 R~skAnalys~s A 1 2 3 4 5 6 7 8 9 1 B SME Peter Jane Paul Susan I c Min 11 12 8 9 l D Mode 13 13 10 10 Combined estimate P(>14) l E Max 17 16 13 15 F l I G IH Distribution Weight VosePERT($C$3,$D$3,$E$3) 0.3 VosePERT($C$4,$D$4,$E$4) 0.2 VosePERT($C$5,$D$5,$E$5) 0.4 VosePERT($C$G,$D$6,$E$6) 0.1 8.680244 0.878805 10 11 12 13 14 15 Figure 14.14 Formulae table F3:F6 =VosePERTObject(C3,D3,E3) E8 (output) =VoseCombined(F3:F6,G3:G6,B3:B6) E9 (output) =VoseCombinedProb(l4,F3:F6,G3:G6,B3:B6,1) Combining weighted SME estimates using VoseCombined functions. ModelRisk has the function VoseCombined({Distributions}, {Weights)) and related probability calculation functions that perform the combination described above. In the model in Figure 14.14, four expert estimates are combined to construct the one estimate. The advantage of this function is it then allows one to perform a sensitivity analysis on the estimate as a whole: if you were to use the Discrete({Distributions), {Weights}) method, your Monte Carlo software would, in this case, be performing a sensitivity analysis of five distributions: the four estimates and the discrete distribution, which will dilute the perceived influence of the combined uncertainty. In the model in Figure 14.4, the VoseCombined function generates random values from a distribution constructed by weighting the four SME estimates. The weights do not need to sum to 1: they will be normalised. The VoseCombinedProb(. . ., 1) function calculates the probability that this distribution will take a value less than 14. Note that the names of the experts is an optional parameter: this simply records who said what and has no effect on the calculation, but select cell E8 and then click the Vf (View Function) icon from the ModelRisk toolbar and you will get the graph shown in Figure 14.15, which allows us to compare each SME's estimate and see how they are weighted. 14.4 Calibrating Subject Matter Experts When subject matter experts (SMEs) are first asked to provide probabilistic estimates, they usually won't be particularly good at it because it is a new way of thinking. We need some techniques that allow us to help the SMEs gauge how well they are estimating and, over time, correct any biases they have. We may also need a method for selecting between or weighting SMEs estimates. Imagine that an SME has estimated that a bespoke generator being placed on a ship will cost $PERT(1.2, 1.35, 1.9)million, and we compare the actual outturn costs against that estimate. Let's say it ended up costing $1.83 million. Did the SME provide a good estimate? Well, it fell within the range provided, which is a good start, but it was at the high end, as Figure 14.16 shows. The 1.83 value fell at the 99.97th percentile of the PERT distribution. That seems rather high considering the SME's estimate lay from 1.2 to 1.9 and 1.83 is only 90 % along that range, but it is the result of how the PERT distribution interprets the minimum, mode and maximum values. The distribution is Chapter 14 Eliciting from expert op~nion 4 1 3 Figure 14.15 Screen capture of graphic interface for the VoseCombined function used in the model of Figure 14.14. Figure 14.16 An SME estimate. quite right skewed, in which case the PERT has a thin right tail - in fact it assigns only a 1 % probability to values larger than 1.73. For this exercise, however, we'll assume that the SME had seen the plots above and was comfortable with the estimate. We can't be certain with just one data point that the SME tends to underestimate. In areas like engineering, capital investment and project planning, one SME will often provide many 4 14 Risk Analysis estimates over time, so let's imagine we repeat the exercise some 10 times and determine the percentile at which each outturn cost lies on each corresponding distribution estimate. In theory, if our SME was perfectly calibrated, these would be random samples from a Uniform(0, 1) distribution, so the mean should rapidly approach 0.5. A Uniform(0, 1) distribution has a variance of 1/12, so the mean of 10 samples from a perfectly calibrated SME should, from central limit theorem, fall on a Normal(O.5, llSQRT(12 * 10)) = Normal(O.5, 0.091287). If the 10 values average to 0.7, we can be pretty sure that the SME is overestimating, since there is only a (1 - NORMDIST(0.7,0.5,0.091287, 1)) = 1.4 % chance a perfectly calibrated SME would have produced a value of 0.7 or larger. Similarly, we can analyse the variance of the 10 values. It should be close to 1/12: if the variance is smaller then the SME's distributions are too wide, or, as is more likely, if the variance is larger then the SME's distributions are too narrow. The above analysis assumes, of course, that all the estimates actually fell within the SME's distribution range, which may well not be the case. The plots in Figure 14.17 can help provide a more comprehensive picture. Experts are also sometimes asked to estimate the probability that an event will occur, which is no easy task. In theory one can roughly estimate how good a SME is at providing these estimates by grouping estimated probabilities into bands (e.g. the same bands as in Figure 14.17) and determining what fraction of those risk events actually occurred. Obviously, around 15 % of risks that were thought to have between 10 % and 20 % chance of occurring should actually occur. However, this breaks down at the lowest and highest categories because many identified potential risks are perceived to have a very small probability of occurrence, so we will almost never actually have any observations. 14.5 Conducting a Brainstorming Session When the initial structure of the problem has been decided and subjective estimates of the key uncertainties are now required, it is often very useful to conduct one or more brainstorming sessions with several experts in the area of the problem being analysed. If the model covers several different disciplines, for example engineering, production, marketing and finance, it may be better to hold a brainstorming session for each discipline group as well as one for everybody. The objectives of the brainstorming session are to ensure that everyone has the same information pertinent to the problem and then to debate the uncertainties of the problem. In some risk analysis texts, the analyst is encouraged to determine a distribution of each uncertain parameter during these meetings. I have tried this approach and find it very difficult to do well because it relies very heavily on controlling the group's dynamics: ensuring that the loudest voice does not get all the air time; encouraging the individuals to express their own opinion rather than following the leader, etc. These meetings can also end up dragging on, and some of the experts may have to leave before the end of the session, reducing its effectiveness. My aim in brainstorming sessions is to ensure that all those attending leave with a common perception of the risks and uncertainties of the problem. This is achieved by doing the following: Gathering all relevant information and circulating it to the attending experts prior to the meeting. Presenting data in easily digested forms, e.g. using scatter plots, trend charts, statistics and histograms wherever possible rather than columns of figures. At the meeting, encouraging discussion of the variability and uncertainty in the problem, including the logical structure and any correlations. Discussing scenarios that would produce extreme values Chapter 14 Eliciting from expert opinion 4 15 Expert A 0 0 0 0 0 0 0 0 Expert C 30% - 6 a, u 20% - LL - a, .- a, LT -a - - - --- - - - - - . - - 10% --------. - - E V) z 2 - - - - - - - - - - 9 c 9 0 2 u9 m9 q 9l n9 s 9k 9 c s9 s - 9O f - ! " ,' Z, % ,k 0, ," , 2 2 2 - 2 Figure 14.17 Histogram of SME outturn percentiles. Percentiles are grouped into 10 bands so roughly 10 % of the percentile scores should lie in each band (when there are a lot of scores). Expert A is well calibrated. Expert B provides estimates that are too narrow and tends to underestimate. Expert C provides estimates that are far too wide and tends to overestimate. 4 16 Risk Analysis for the uncertain variables to get a feel for the true extent of the total uncertainty. Some of the experts may also have extra information to add to the pot of knowledge. The analyst, acting as chairperson, ensuring that the discussion is structured. Taking minutes of the meeting and circulating them afterwards to the attendees. After a suitable, but short, period for contemplation following the brainstorming session, the analyst conducts individual interviews with each expert and attempts to determine their opinions of the uncertainty of each variable that was discussed. The techniques for eliciting these opinions are discussed in Section 14.6.1. Since all the experts will have the same level of knowledge, they should produce similar estimates of uncertainty. Where there are large differences between opinions, the experts can be reconvened to discuss the issue. If no agreement can be reached, the conflicting opinions can be treated as described in Section 14.3.4. I believe that this procedure has several distinct benefits over attempting to determine distributions during brainstorming sessions: a Each expert has been given the time to think about the problem. They are encouraged to develop their own opinion after the benefit of discussion with the other experts. A quiet individual is given as much prominence as a dominating one. Differences in opinion between experts are easier to identify. The whole process can be conducted in a much more orderly fashion. 14.6 Conducting the Interview Initial resistance Expert opinion of the uncertainty of a parameter is generally determined in a one-to-one interview between the relevant expert and the analyst developing the model. In preparing for such interviews, analysts should make themselves familiar with the various techniques for modelling expert opinion described earlier in this chapter. They should also be familiar with the various sources of biases and errors involved in subjective estimation. The experts, in their turn, having been informed of the interviews well in advance, should have evaluated any relevant information either on their own or in a brainstorming session described above. There is occasionally some initial resistance by the experts in providing estimates in the form of distributions, particularly if they have not been through the process before. This may be because they are unfamiliar with probability theory. Alternatively, they may feel they know so little about the variable (perhaps because it is so uncertain) that they would find it hard enough to give a single point estimate let alone a whole probability distribution. I like to start by explaining how, by using uncertainty distributions, we are allowing the experts to express their lack of certainty. I explain that providing a distribution of the uncertainty of a parameter does not require any great knowledge of probability theory. Neither does it demand a greater knowledge of the parameter itself than a single-point estimate - quite the reverse. It gives the experts a means to express their lack of exact knowledge of the parameter. Where in the past their single-point estimates were always doomed never to occur precisely, their estimates now using distributions will be correct if the actual value falls anywhere within the distribution's range. Chapter 14 Eliciting from expert opinion 4 17 The next step is to discuss the nature of the parameter's uncertainty. I prefer to let the experts explain how they see the logic of the uncertainty rather than impose on them a structure I may have had in mind and then to model what I hear. Opportunity t o revise estimates Experts are usually more comfortable about providing estimates if they are told before the interviews that they have the opportunity to revise their estimates at a later date. It is also good practice to leave the experts with a printed copy of each estimate and to get them to sign a copy for the analyst's records. Note that the copy should have a date on it. This is important since the experts' opinion could change dramatically after the occurrence of some event or the acquisition of more data. 14.6.1 Eliciting distributions of the expert opinion Once the model has been sufficiently disaggregated, it is usually not necessary to provide very precise estimates of each individual component of the model. In fact, three-point estimates are usually quite sufficient, the three points being the minimum, most likely and maximum values the expert believes the value could take. These three values can be used to define either a triangular distribution or some form of PERT distribution. My preference is to use a modified PERT, as described in Section 14.3.2, because it has a natural shape that will invariably match the expert's view better than a triangular distribution would. The analyst should attempt to determine the expert's opinion of the maximum value first and then the minimum, by considering scenarios that could produce such extremes. Then, the expert should be asked for hisfher opinion of the most likely value within that range. Determining the parameters in the order (1) maximum, (2) minimum and (3) most likely will go some way to removing the "anchoring" error described in Section 14.2.2. Occasionally, a model will not disaggregate evenly into sufficiently small components, leaving the model's outputs strongly affected by one or more individual subjective estimates. When this is the case, it is useful to employ a more rigorous approach to eliciting an expert's opinion than a simple three-point estimate. In such cases, the modified PERT distribution is a good start but, on review of the plotted distribution, the expert might still want to modify the shape a little. This can be done with pen and graph paper as shown in Figure 14.18. In this example, the marketing manager believes that the amount of wool her company will sell next month will be at least 5 metric tons (mt), no more than 10 mt and most probably about 7 mt. These figures are then used to define a PERT distribution that is printed Figure 14.18 Graphing distribution of expert opinion. 4 18 Risk Analysis out onto graph paper. On reflection, the manager decides that there is a little too much emphasis being placed on the right tail and draws out a more realistic shape. The revised curve can then be converted to a relative distribution and entered into the model. Crosses are placed at strategic points along the curve so that drawing straight lines between these crosses will produce a reasonable approximation of the distribution. Then the x- and y-axis values are read off for each point and noted. Finally, the manager is asked to sign and date the figure for the records. The above technique is flexible, quite accurate and reassuringly transparent to the expert being questioned. This technique can now also be done without the need for pen and paper, using RISKview software. Figure 14.19 illustrates the same example using RISKview. The PERT(5, 7, 10) distribution (top panel) is moved to the Distribution Artist facility of RISKview and automatically converted into a relative distribution (bottom panel) with a user-defined number of points (10 in this example). This distribution can now be modified to better reflect the expert's opinion by sliding the points up and down. The modified distribution can also immediately be viewed as an ascending or descending cumulative frequency plot to allow the expert to see if the cumulative percentiles also make sense. When the final distribution has been settled on, it can be directly inserted into a spreadsheet model at the click of an icon. 14.6.2 Subjective estimation of discrete probabilities Experts will sometimes be called upon to provide an estimate of the probability of occurrence of a discrete event. This is a difficult task for experts. It requires that they have some feel for probabilities that is both difficult for them to acquire and to calibrate. If the discrete event in question has occurred in the past, the analyst can assist by presenting the data and a beta distribution of the probabilities possible from that data (see Section 8.2.3). The experts can then give their opinion based on the amount of information available. However, it is quite usual that past information has no relevance to the problem at hand. For example, political analysts cannot look to past general election results for guidance in estimating whether the Labour Party will win the next general election. They will have to rely on their gut feeling based on their understanding of the current political climate. In effect, they will be asked to pick a probability out of the air - a daunting task, complicated by the difficulty of having to visualise the difference between, say, 60 and 70 %. A possible way to avoid this problem is to offer experts a list of probability phrases, for example: a almost certain; very likely; highly likely; reasonably likely; fairly likely; even chance; fairly unlikely; reasonably unlikely; highly unlikely; very unlikely; almost impossible. 1 Chapter 14 El~c~t~ng from expert oplnlon 4 19 420 Risk Analysis Figure 14.20 Visual aid for estimating probabilities: A = 1 %, B = 5 %, C = 10 %, D = 20 %, E = 30 %, F=40%,G=50%,H=60%,I=70%,J=80%,K=90%,L=95%,M=99%. The phrases are ranked in order and the experts are told of this ranking. They are then asked to select a phrase that best fits their understanding of the probability of each event that has to be considered. At the end of the session, they are also asked to match as many of the phrases as possible to visual representations of probability. For example, matching a phrase to the probability of picking out a black ball at random from the trays of Figure 14.20. Since we know the percentage of black balls in each tray, we can associate a probability with each phrase and thus with each estimated event. 14.6.3 Subjective estimation of very low and very high probabilities Risk analysis models occasionally incorporate very unlikely events, i.e. those with a very low probability of occurrence. It is recommended that readers review Section 4.5.1 before deciding to incorporate rare events into their model. The risk of the rare event is usually modelled as the probability of its occurrence combined with a distribution of its impact should it occur. An example might be the risk of a large earthquake on a chemical plant. The distribution of impact on the chemical plant (in terms of damage and lost production, etc.) can be reasonably estimated since there is a basis on which to make the estimation (the components most at risk in an earthquake, the cost of replacement, the time required to effect the repairs, production rates, etc.). However, the probability of an earthquake is far less easy to estimate. Since it is so rare, there will be very few recorded occurrences on which to base the estimate of probability. When data are not available to determine estimates of probability of very unlikely events, experts will often be consulted for their opinion. Such consultation is fraught with difficulties. Experts, like the rest of us, are very unlikely to have any feel whatsoever for low probabilities unless there is a reasonable amount of data on which to base their estimates (in which case they can offer their opinion based around the frequency of observed occurrences). The best that the experts can do is to make some comparison with the frequency of other low-probability events whose probabilities are well defined. Figure 14.21 Chapter 14 Ellciting from expert opinion 42 1 Annual risk of dying in the US (number of deaths per 1 000 000) 1 000 000 100000 10 000 1000 100 10 1 0.1 0.01 -----, - 80-year-olddying before age 81 60 000 3000 2800 2000 Amateur p~lot Heart disease All cancers; Parachutist 800 590 480 Fire fighter; Hang glider Lung cancer Stomach organ cancer 320 220 160 120 80 Pneumonia Diabetes; Police officer Motor vehicle accidents; Breast cancer Suicide Homicide 50 Falls 30 20 15 Accidental poisoning (drugs and medication) Pedestrian killed by automobile Drowning; Fires and burns 5 Firearms: Tuberculosis 2 Electric current; Railway accident -----+ - -1 0.6 Airplane crash or accident 0.4 Floods 0.2 Lightning; Insect bite or sting 0.06 Hit on ground by falling airplane 0.04 Hurricane I Figure 14.21 Illustration of a risk ladder (for the USA) to aid in expert elicitation (from Williams, 1999, with the author's permission). I/ I3 f I offers a list of well-determined low probabilities in a graphical format that the reader may find helpful in this regard. This inaccuracy in estimating the probability of a rare event will have a very large impact on a risk analysis. Consider two experts estimating the expected cost of the risk of a gas turbine failing. They agree that it would cost the company about £600 000 f £200 000 should it fail. However, the first expert estimates the probability of the event as 1:1000/year and the second as 1:5000/year. Both see the probability as very low, but the expected cost for the first estimate is 5 times that of the second, i.e. £600 000 * 1/1000 = £600 compared with 600 000 * 1/5000 = £120. An estimate of the probability of a rare event can sometimes be broken down into a series of consecutive events that are easier to determine. For example, the failure of the cooling system of a nuclear power plant would require a number of safety mechanisms all to fail at the same time. The probability of failure of the cooling system is then the product of the probability of failure of each safety mechanism, each of which is usually easier to estimate than the total probability of the event. As another example, this technique is also enjoying increasing popularity in epidemiology for the assessment of the risks of introducing exotic diseases to a country through imports. The imported entity (animal, vegetable or product of either) must first have the disease. Then it must slip through any quality checks in its country of origin. After that, it must still slip through quarantine checks in the importing country, and finally it has to infect a potential host. Each step has a probability (which may often be broken down even further) which is estimated, and these probabilities are then multiplied together to determine the final probability of the introduction of the disease from one animal. Chapter I 5 l est~ngand modell~ngcausal relationships Testing and modelling causal relationships is the subject of plenty of books. I recommend Pearl (2000), Neapolitan (2004) and Shipley (2002) because they are thorough, fairly readable if you're good at mathematics and take a practical viewpoint. The technical details of causal inference lie very firmly in the domain of statistics, so I'll leave it to these books to explain them. In this chapter I want to look at some practical issues of causality from a risk analysis perspective. The main impetus to including this as a separate topic is to help you avoid some of the nonsense that I have come across over the years while reviewing models and scientific papers, battling in court as an expert witness or just watching the news on TV. There are a few simple, very practical and intuitive rules that will help you test a hypothesised causal relationship. Causal inference is mostly applied to health issues, although the thinking has potential applications in other areas such as econometrics (in his book, Pearl laments the lack of rigorous causal thinking in current econometric practices), so I am going to use health issues as examples in this chapter. We can attempt to use a causal model to answer three different types of question: Predictions - what will happen given a certain set of conditions? Interventions - what would be the effect of controlling one or more conditions? Counte~actuals- what would have happened differently if one or more conditions had been different? In a deterministic (non-random) world there is a straightforward interpretation to causality. CSI Miami and derivatives, and all those medical dramas, are such fun programmes because we viewers try to figure out what really happened - what caused this week's murder(s), and of course the programme always finishes with a satisfyingly unequivocal solution. I was once stranded in a US airport hotel in which a real-world CSI conference was taking place, and they were keen to tell me how their reality was rather different. They don't have the flashy cars, cool clothes, ultrasophisticated equipment or trendy offices bathed in moody light. More importantly, when they search a database of fingerprints, it comes up with a list, if they're lucky, of a dozen or so possible candidates, probably with "whereabouts unknown". For them, the truth is far more elusive. In the risk analysis world we have to work with causal relationships that are usually probabilistic in nature, for example: the probability of having lung cancer within your life is x if you smoke; the probability of having lung cancer within your life is y if you don't smoke. 424 Risk Analysis We all know that x > y, which makes being a smoker a risk factor. But life is more complicated than that: there is a biological gradient, meaning in this case the more you smoke, the more likely the cancer. If we were to do a study designed to determine the causal relationship between smoking and cancer, we should look not just at whether people smoked at all, but at how much a person has smoked, for how long and in what way (cigars, cigarettes with or without filters, pipes, little puffs or deep inhaling, brand, etc). Things are further complicated because people can change their smoking habits over time. How about: the probability of having lung cancer within your life is a if you carry matches; the probability of having lung cancer within your life is b if you don't carry matches. 1 i I haven't done the study, but I bet a > b, although carrying matches should not be a risk factor. A correct statistical analysis will determine the high correlation between carrying matches (or lighters) and using tobacco products. A sensible statistician would figure out that matches should be removed from the analysis. An uncontrolled statistical analysis can produce some silly results (imagine we had no idea that tobacco could be related to cancer and didn't collect any tobacco-related data), so we should always apply some disciplined thinking to how we structure and interpret a statistical model. We need a few definitions to begin: A risk factor is an aspect of personal behaviour or lifestyle, environment or characteristic thought to be associated positively or negatively with a particular adverse condition. A countelfactual world is an epidemiological hypothetical idea of a world similar to our own in all ways but for which the exposure to the hazard, or people's behaviour or characteristics, or some other change that affects exposure, has been changed in some way. The population attributable risk (PAR) (aka population aetiological fraction, among many others) is the proportion of the incidence in the population attributable to exposure to a risk factor. It represents the fraction by which the incidence in the population would have reduced in a counterfactual world where the effect associated with that risk factor was not present. b These concepts are often used to help model what the future might look like if we were to eliminate a risk factor, but we need to be careful as they technically only refer to the comparison of an observed world and a counterfactual parallel world in which the risk factor does not appear - making predictions of the future means that we have to assume that the future world would look just like that counterfactual one. In figuring out the PAR, we may well have to consider interactions between risk factors. Consider the situation where the presence of either of two risk factors gives an extremely high probability of the risk of interest, and where a significant fraction of the population is exposed to both risk factors. In this case there is a lot of overlap and an individual risk factor has less impact because the other risk factor is competing for the same victims. On the other hand, exposure to two chemicals at the same time might produce a far greater effect than either chemical alone. We talk about synergism and antagonism when the risk factors work together or against each other respectively. Synergism is more common, so the PAR for the combination of two or more risk factors is usually less than the sum of their individual PARS. 1 5.1 Campylobacter Example A large survey conducted by CDC (the highly reputable Center for Disease Control and Prevention) in the United States looked at why people end up getting a certain type of food poisoning (campylobacteriosis). You get campylobacteriosis when bacteria called Campylobacter enter your intestine, find a I i Chapter 15 Testing and modelling causal relationships 425 suitably protected location and multiply (form a colony). Thus, the sequence of events resulting in campylobacteriosis must include some exposure to the bacteria, then survival of those bacteria through the stomach (the acid can kill them), then setting up a colony. In order for us to observe the infection, that person has to become ill. In order to identify the disease as campylobacteriosis, a doctor has to ask for a stool sample, it has to be provided, the stool sample has to be cultured and the Campylobacter have to be isolated and identified. Campylobacteriosis will usually resolve itself after a week or so of unpleasantness, so many more people therefore have campylobacteriosis than a healthcare provider will observe. The US survey looked at behaviour patterns of people with confirmed cases, tried to match them with others of the same sex, age, etc., known not to have suffered from a foodborne illness and looked for patterns of differences. This is called a case-control study. Some of the key factors were as follows (+ meaning positively associated with illness, - meaning negatively associated): 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Ate barbecued chicken (+). Ate in a restaurant (+). Were male and young (+). Had healthcare insurance (+). Are in a low socioeconomic band (+). There was another member of the family with an illness (+). The person was old (+). Regularly ate chicken at home (-). Had a dog or cat (+). Worked on a farm (+). Let's see whether this matches our understanding of the world: Factor 1 makes sense since Campylobacter naturally occur in chicken and are very frequently to be found in chicken meat. People are also somewhat less careful with their hygiene and the cooking is less controlled at a barbecue (healthcare tip: when you've cooked a piece of meat, place it on a different plate than the one used to bring the raw meat to the barbecue). Factor 2 makes sense because of cross-contamination in the restaurant kitchen, so you might eat a veggie burger but still have consumed Campylobacter originating from a chicken. Factor 3 makes sense because we guys tend not to pay much attention to lutchen practices when we're young and start off rather hopeless when we first leave home. Factor 4 makes sense in that, in the USA, visiting a doctor is expensive, and that is the only way the healthcare will observe the infection. Factor 5 maybe seems right because poorer people will eat cheaper-quality food and will visit restaurants with higher capacity and lower standards (related to factor 2). Factor 6 is obvious since faecal-oral transmission is a known route (healthcare tip: wash your hands very well, particularly when you are ill). Factor 7 makes sense because older people have a less robust immune system, but maybe they also eat in restaurants more (less?) often, maybe they like chicken more, etc. 426 R~skAnalysis Factor 8 seems strange. It appears from a number of studies that if you eat chicken at home you are less likely to get ill. Maybe that is because it displaces eating chicken at a restaurant, maybe it's because people who cook are wealthier or care more about their food or maybe (the current theory) it is because -these people get regular small exposures to Campylobacter that boosts their immune system. Factor 9 is trickier. Perhaps pet food contains Campylobacter, perhaps the animal gets uncooked scraps, then cross-infects the family. Factor 10 makes sense. People working in chicken farms are obviously more at risk, but a farm will often have just a few chickens, or will buy in manure as fertiliser or used chicken bedding as cattle feed. Other animals also cany Campylobacter. Each of the above is a demonstrable risk factor because each passed a test of statistical significance in this study (and others) and one can find a possible rational explanation. Of course, the possible rational explanation is often to be expected because the survey was put together with questions that were designed to test suspected risk factors, not the ones that weren't thought of. Note that the causal arguments are often interlinked in some way, making it difficult to figure out the importance of each factor in isolation. Statistical software can deal with this given the appropriate control. 15.2 Types of Model to Analyse Data Data can be analysed in several different ways in an attempt to determine the magnitude of hypothesised causal relationships between variables (possible risk factors). Note, these models will not ever prove a causal relationship, just as it is not possible to prove a theory, only disprove it. Neural nets - look for patterns within datasets between several variables associated with a set of individuals. They can find correlations within datasets, and make predictions of where a new observation might lie on the basis of values for the conditioning variables, but they do not have a causal interpretation and tend to be rather black box in nature. Neural nets are used a lot in profiling. For example, they are used to estimate the level of credit risk associated with a credit card or mortgage applicant, or identify a possible terrorist or smuggler at an airport. They don't seek to determine why a person might be a poor credit risk, for example, just match the typical behaviour or history of someone who fails to pay their bills - things like having defaulted before, changing jobs frequently, not owning a home. Classification trees - can be used to break down case-control data to list from the top down the most important factors influencing the outcome of interest. This is done by looking at the difference in fraction of cases and controls that have the outcome of interest (e.g. disease) when they are split by each possible explanatory variable. So, for example, having a case-control study of lung cancer, one might find the fraction of people with lung cancer is much larger among smokers than among non-smokers, which forms the first fork in the tree. Looking then at the non-smokers only, one might find that the fraction of people with lung cancer is much higher for those who worked in a smoky environment compared with those who did not. One continually breaks down the population splits, figuring out which variable is the next most correlated with a difference in the risk until you run out of variables or statistical significance. Regression models - logistic regression is used a lot to determine whether there is a possible relationship between variables in a dataset and the variable to be predicted. The probability of a Chapter 15 Testing and modelling causal relationships 427 "success" (e.g. exhibiting the disease) of a dichotomous (two-possible-outcome) variable we wish to predict, pi, is related to the various possible influencing variables by regression equations; for example 1 a where subscript i refers to each observation, and subscript j refers to each possible explanatory variable in the dataset, of which there are k in total. Stepwise regression is used in two forms: forward selection starts off with no predictive variables and sequentially adds them until there is no statistically significant improvement in matching the data; backward selection has all variables in the pot and keeps taking away the least significant variable until the model's statistical predictive capability begins to suffer. Logistic regression can take account of important correlations between possible risk factors by including covariance terms. Like neural nets, it has no in-built causal thinking. Bayesian belief networks (aka directed acyclic graphs) - visually, these are networks of nodes (observed variables) connected together by arcs (probabilistic relationships). They offer the closest connection to causal inference thinking. In principle you could let DAG software run on a set of data and come up with a set of conditional probabilities - it sounds appealing and objectively hands off, but the networks need the benefit of human experience to know the direction in which these arcs should go, i.e. what the directions of influence really are (and if they exist at all). I'm a firm believer in assigning some constraints to what the model should test, but make sure you know why you are applying those constraints. To quote Judea Pearl (Pearl, 2000): "[C]ompliance with human intuition has been the ultimate criterion of adequacy in every philosophical study of causation, and the proper incorporation of background information into statistical studies likewise relies on accurate interpretation of causal judgment". Commercial software are available for each of these methods. The algorithms they use are often proprietary and can give different results on the same datasets, which is rather frustrating and presents some opportunities to those who are looking for a particular answer (don't do that). In all of the above techniques, it is important to split your data up into a training set and a validation set to test whether the relationships that the software find in the training set will let you reasonably accurately (i.e. at the decision-maker's required accuracy) predict the outcome observations in the validation dataset. Best practice involves repeated random splitting of your data into training and validation sets. 15.3 From Risk Factors t o Causes Let's say that you have completed a statistical analysis of your data and your software has come up with a list of risk factors. The numerical outputs of your statistical analysis will allow you to calculate PAR for each factor, and here you should apply a little common sense because PAR relates to the decision question you are answering. Let me take the campylobacteriosis study as an example. You first need to know a couple of things about Campylobacter. It does not survive long outside its natural host (animals like chickens, ducks and pigs where it causes no illness) and so it does not establish reservoirs in the ground, in water, etc. It also does not generally stay long in a human gut, although many people could be harbouring 428 Risk Analysis the bacteria unknowingly. This means that, if we were to eliminate all the Campylobacter at their animal sources, we would no longer have human campylobacteriosis cases (ignoring infections from travelling). I was lead risk analyst for the US FDA where we wanted to estimate the number of people who are infected with fluoroquinolone-resistant Campylobacter from poultry - fluoroquinolone is used to treat poultry (particularly chickens) with the respiratory disease they get from living in sheds with poor ventilation so the ammonia from their urine strips out the lining of their lungs. We reasoned: if say 100 000 people were getting campylobacteriosis from poultry, and say 10% of the poultry Campylobacter were fluoroquinolone resistant, then about 10 000 were suffering campylobacteriosis that would not be treatable by administering fluoroquinolone (the antimicrobial is also often used to treat suspected cases of food poisoning). We used the CDC study and their PAR estimates. The case ended up going to court, and a risk analyst hired by the opposing side (the drug sponsor, who sold a lot more of their antimicrobial to chicken farms than to humans) got the CDC data under the Freedom of Information Act and did a variety of statistical analyses using various tools. He concluded: "A more realistic assessment based on the CDC case-control data is that the chicken-attributable fraction for [the pathogen] is between - 11.6 % (protective effect) and 0.72 % (not statistically significantly different from zero) depending on how missing data values are treated". In other words, he is saying with this -11.6 % attributable fraction figure that chicken is protective, so in a counterfactual world without chicken contaminated with Campylobacter there would be more campylobacteriosis, i.e. if we could remove the largest source of exposure we have to Campylobacter (poultry), more people would get ill. Put another way, he believes that the Campylobacter on poultry are protective, but the Campylobacter from other sources are not. Using classification trees, for example, he determined that the major risk factors were, in descending order of importance: visiting a farm, travelling, having a pet, drinking unprocessed water, being male (then eating ground beef at home, eating pink hamburgers and buying raw chicken) or being female (and then having no health insurance, eating high levels of fast food, eating hamburgers at home . . . and finally, eating fried chicken at home). Note that chicken is at the bottom of both sequences. So how did this risk analyst manage to justify his claim that eating chicken was actually protective - it did not pose a threat of campylobacteriosis? He did so by misinterpreting the risk factors. There is really no sense in considering a counterfactual world where people are all neuter (neither male nor female) - and anyway, since we don't have any of those, we have no idea how their behaviour will be different from males or females. Should we really be including whether people have insurance as a risk factor to which we assign a PAR? I think not. It is perhaps true that all these factors are associated with the risk - meaning that the probability of campylobacteriosis is correlated with each factor, but they are not risk factors within the context of the decision question. I don't think that by paying people's health insurance we would likely change the number of illnesses, although we would of course change the number reported and treated. What we hope to achieve is an understanding of how much disease is caused by Campylobacter from chicken, so the level of total human illness needs to be distributed among the sources of Campylobacter. That brings some focus to the PAR calculations: dining in a restaurant is only a risk factor because Campylobacter is in the restaurant kitchen. How did it get there? Probably chickens mostly, but also ducks and other poultry, although the US eat those in far lower volumes. It could also sometimes be a kitchen worker with poor hygiene unknowingly carrying Campylobacter,but where did that worker originally get the infection? Most probably from chicken. The sex' of a person is no longer relevant. Having a pet (it was mostly puppies) is a debatable point, since the puppy probably became infected from contaminated meat rather than being a natural carrier itself. ' Not "gender", which I found out one day listening to a debate in the UK House of Lords is what one feels oneself to be, while "sex" is defined by the reproductive equipment with which we are born. t ! Chapter 15 Testing and modell~ngcausal relationships 429 Looking just at Campylobacter sources, we get a better picture, and, although regular small amounts of exposure (eating at home) may be protective, this is protecting against other mostly chicken-derived Campylobacter exposure and we end up with the same risk attribution that CDC determined from its own survey data. We won the court case, and the other risk analyst's testimony was, very unusually, rejected as being unreliable - in no small part because of his selective and doctored quoting of papers. 15.4 Evaluating Evidence The first test of causality you should make is to consider whether there is a known or possible causal mechanism that can connect two variables together. For this, you may need to think out of the box: the history of science is full of examples where people considered something impossible, in spite of an enormous amount of evidence to the contrary, because they were so firmly attached to their pet theory. The second test is temporal ordering: if a change in variable A has an effect on variable B, then the change in A should occur before the resultant change in B. If a person dies of radiation poisoning (B) then that person must have received a large dose of radiation (A) at some previous time. We can often test for temporal ordering with statistics, usually some form of regression. But be careful, temporal ordering doesn't imply a causal relationship. Imagine you have a variable X that affects variables A and B, but B responds faster than A. If X is unobserved, all we see is that A exhibits some behaviour that strongly correlates in some way with the previous behaviour of B. The third test is to determine in some way the size of the possible causal effect. That's where statistics comes in. From a risk analysis perspective, we are usually interested in what we can change about the world. That ultimately implies that we are only really interested in determining the magnitude of the causal relationships between variables we can control and those in which we are interested. Risk analysts are not scientists - our job is not to devise new theories but to adapt the current scientific (or financial, engineering, etc.) knowledge to help decision-makers make probabilistic decisions. However, as a breed, I like to think that we are quite adept at stepping back and asking whether a tightly held belief is correct, and then posing the awkward questions. It's quite possible that we can come up with an alternative explanation of the world supported by the available evidence, which is fine, but that explanation has to be presented back to the scientific community for their blessing before we can rely on it to give decision-making advice. 15.5 The Limits of Causal Arguments My son is just starting his "Why?" phase. I can see the interminable conversations we will have: "Papa, why does a plane stay in the air". "Because it has wings". "Why?'. "Because the wings hold it up". "Why?'. 'Because when an airplane goes fast the wind pushes the wings up". "Why?'. Dim memories of Bernoulli's equation won't be of much help. "I don't know" is the inevitable end to the conversation. I can see why kids love this game - once we get to three or four answers, we parents reach the limit of our understanding. He's soon going to find out I don't know everything after all, and 1'11 plummet from my pedestal (he's already realised that I can't mend everything he breaks). Causal thinking is the same. At some point we are going to have to accept the existence of the causal relationships we are using without really knowing why. If we're lucky, the causal link will be supported by a statistical analysis of good data, some experiential knowledge and a feeling that it makes sense. If we go back far enough, all that we believe we know is based on assumptions. My point is that, when you have 430 Risk Analysis completed your causal analysis, try to be aware that the analysis will always be based on some assumptions, so sometimes a simple analysis is all you need to get the necessary guidance to your problem. 15.6 An Example of a Qualitative Causal Analysis Our company does a lot of work in the field of animal health, where we help determine the risk of introducing or exacerbating animal and human disease by moving animals or their products around the world. This is a very well-developed area of risk analysis, and a lot of models and guidelines have been written to help ensure that there is a scientifically based rationale for accepting, rejecting and controlling such risks. Chapter 22 discusses animal health risk analysis. I present a risk analysis below as an illustration of the need for a healthy cynicism when reviewing scientific literature and official reports, and as an example of a causal analysis that I performed with absolutely no quantitative data for an issue for which we do not yet have a complete understanding. 15.6.1 The problem A year ago I was asked to perform a risk analysis on a particularly curious problem with pigs. Postweaning multisystemic wasting syndrome (PMWS) affects pigs after they have finished suckling. I had had some dealings with this problem before in another court case. The "syndrome" part of the name means a pattern of symptoms, which is the closest veterinarians can come to defining the disease since nobody knows for sure what the pathogen is that creates the problem. Until recently there hasn't even been an agreed definition of what the pattern of symptoms actually is. A herd case definition for PMWS was recently agreed by an EU-funded consortium (EU, 2005) led by Belfast University. The PMWS case definition on herd level is based on two elements - (1) the clinical appearance in the herd and (2) laboratory examination of necropsied (autopsy for animals) pigs suffering from wasting. I 1. Clinical appearance on herd level The occurrence of PMWS is characterised by an excessive increase in mortality and wasting post weaning compared with the historical level in the herd. There are two options for recognising this increase, of which l a should be used whenever possible: l a . If the mortality has been recorded in the herd, then the increase in mortality may be recognised in either of two ways: 1. Current mortality 2 mean of historical levels in previous periods +1.66 standard deviations. 2. Statistical testing of whether or not the mortality in the current period is higher than in the previous periods by the chi-square test. In this context, mortality is defined as the prevalence of dead pigs within a specific period of time. The current time period is typically 1 or 2 months. The historical reference period should be at least 3 months. Ib. If there are no records of the mortality in the herd, an increase in mortality exceeding the national or regional level by 50 % is considered indicative of PMWS. 2. Pathological and histopathological diagnosis of PMWS Autopsy should be performed on at least five pigs per herd. A herd is considered positive for PMWS when the pathological and histopathological findings, indicative for PMWS, are all present at the same time in at least one of the autopsied pigs. The pathological and histopathological findings are: Chapter 15 Testing and modelling causal relationships 43 1 1. Clinical signs including growth retardation and wasting. Enlargement of inguinal lymph nodes, dyspnoea, diarrhoea and jaundice may be seen sporadically. 2. Presence of characteristic histopathological lesions in lymphoid tissues: lymphocyte depletion together with histiocytic infiltration and/or inclusion bodies and/or giant cells. 3. Detection of PCV2 in moderate to massive quantity within the lesions in lymphoid tissues of affected pigs (basically using antigen detection in tissue by immunostaining or in situ hybridisation). Other relevant diagnostic procedures must be carried out to exclude other obvious reasons for high mortality (e.g. E. coli post-weaning diarrhoea or acute pleuropneumonia). The herd case definition is highly unusual: a result of the lack of identification of the pathogenic organism. It will need revision when more is known about the syndrome. The definition is also vulnerable from a statistical viewpoint. To begin with, the definition acknowledges the wasting symptom in PMWS, but the definitions only apply to mortality. PMWS can only be defined at a herd level because one has statistically to differentiate the increase in rate of mortality and wasting post weaning from historical levels in the herd or from other unaffected herds. Thus, for example, PMWS can never be diagnosed for a backyard pig using this definition. The chi-square test quoted above is based on making a normal approximation to a binomial variable. The approximation is only good if one has a sufficiently large number of animals n in a herd and a sufficiently high prevalence p of mortality or wasting in both unaffected and affected herds. Thus, it becomes progressively more difficult to differentiate an affected from an unaffected herd where the herd is small. The alternative requirement of prevalence at > 1.66 standard deviations above previous levels and the chi-square table provided in this definition are determined by assuming that one should only diagnose that a herd has PMWS when one is at least 95 % confident that the observed prevalence is greater than normal. This means that one can choose to declare a herd as PMWS positive when one is only 95 % confident that the fraction of animals dying or wasting is greater than usual. While one needs to set a standard confidence for consistency, this is illustrative of the difference in approach between statistics and risk analysis: in risk analysis one balances the cost associated with correct and incorrect diagnosis and chooses a confidence level that minimises losses. The definition has other statistical issues; for example, the use of prevalence assumes that a population is static (all in, all out) within a herd, rather than a continuous flow. It also does not take into account the possible effects of a deteriorated farm management that would raise the mortality and wasting rates, nor of an improved farm management whose improvements would balance against, and therefore mask, the increased mortality and wasting due to PMWS. Other definitions of PMWS have been used. New Zealand, for example, made their PMWS diagnosis on the basis of at least a 15 % post-weaning mortality rate together with characteristic histopathological lesions and the demonstration of PCV2 antigen in tissues. Denmark diagnoses the disease in a herd on the basis of histopathology and demonstration of PCV2 antigen in pigs with or without clinical signs indicative of PMWS and regardless of the number of animals. 15.6.2 Collecting information PMWS is a worldwide problem among domestic pig populations. It is very difficult to compare experiences in different countries because there hasn't been a single agreed definition until recently, and there are different motivations involved for reporting the problem. In one country I investigated, farmers were declaring they had PMWS with, it seemed, completely new symptoms - but when I 432 Risk Analysis talked confidentially to people "on the ground I found out that, if the problem were declared to be PMWS, the farmers would be completely compensated by their government, whereas if it were another, more obvious issue they would not. Another country I investigated declared that it was completely free of PMWS, which seemed extraordinary given the ubiquitous nature of the problem and that genetically indistinguishable PCV2 had been detected at similar levels to other countries battling with PMWS. But the pig industry of this country wanted to keep out pork imports and their freedom from the ubiquitous PMWS was a good reason justified under international trading law. The country used a different (unpublished) definition of PMWS that included the necessity of observing an increased wasting rate, and I was told that in their one suspected herd the pigs that were wasting were destroyed prior to the government assessment, with the result that the required wasting rate was not observed. The essence of my risk analysis was to try to determine which, if any, of the various causal theories could be true and then determine whether one could find a way to control the import risk for our clients given the set of plausible theories. The main impediment to doing so was that it seemed every scientist investigating the problem had their own pet theory and completely dismissed the others. Moreover, they conducted experiments designed to affirm their theory, rather than refute it. I distilled the various theories into the following components: Theory 1. PCV2 is the causal agent of PMWS in concert with a modulation of the pig's immune system. Theory 2. A mutation (or mutations) of PCV2 is the causal agent (sometimes called PCV2A). Theory 3. PCV2 is the causal agent, but only for pigs that are genetically more susceptible to the virus. Theory 4. An unidentified pathogen is the causal agent (sometimes called Agent X). Theory 5. PMWS does not actually exist as a unique disease but is the combination of other clinical infections. Note that the five theories are not all mutually exclusive - one theory being true does not necessarily imply that the other theories are false. Theory 1 could be true together with theories 2 or 3 or both. Theories 2 and 3 are true only if theory 1 is true, and theories 4 and 5 eliminate the possibility of all other theories. A theory of causality can never be proved, only disproved - an absence of observation of a causal relationship cannot eliminate the possibility of that relationship. The five theories with their partial overlap were structured to provide the most flexible means for evaluating the cause of PMWS. I did a review of all (15) pieces of meaningful evidence I could find and categorised the level of support that each gave to the five theories as follows: conflicts (C), meaning that the observations in this evidence would not realistically have occurred if the theory being tested was correct; neutral (N), meaning that the observations in this evidence provide no information about the theory being tested; partially supports (P), meaning that the observations in this evidence could have occurred if the theory being tested was correct, but other theories could also account for the observations; supports (S), meaning that the observations in this evidence could only have occurred if the theory being tested was correct. C h a ~ t e r15 Testing and modelline causal relationshi~s 4 3 3 15.6.3 Results and conclusions The results are presented in Table 15.1. + Theory 1 (PCV2 immune system modulation causes PMWS). This theory is well supported by the available evidence. It explains the onset of PMWS post weaning and the presence of other infections, or vaccines, stimulating the immune system as being cofactors. It explains how the use of more stringent sanitary measures in a farm can help contain and avoid PMWS. On its own it does not explain the radially spreading epidemic observed in some countries, nor the difference in susceptibility observed between pigs and pig breeds. Theory 2 (PCV2A). This theory is also well supported by the available evidence. It explains the radially spreading epidemic observed in some countries but does not explain the difference in susceptibility observed between pigs and between pig breeds. Theory 3 (PCV2 genetic susceptibility). This theory is supported by the small amount of data available. It could explain the targeting of certain herds over others and the difference in attack rates between pigs breeds. Theory 4 (Agent X). This theory is unanimously contradicted by all the available evidence that could be used to test it. Theory 5 (PMWS does not actually exist). This theory is unanimously contradicted by all the available evidence that could be used to test it. + As a result, I concluded (rightly, or wrongly, at the time of writing we still don't know the truth) that it appears from the available evidence that PMWS requires at least two components to be established: 1. A mutated PCV2 that is more pathogenic than the ubiquitous strain(s). There may well be several different localised mutations of PCV2 in the world's pig population that have varying levels of pathogenicity. This would in part explain the high variance in attack rates in different countries, although farm practices, pig genetics and other disease levels will be confounders. Table 15.1 Comparison of theories on the relationship between PCV2 P = partially and PMWS and the available evidence (S = supports; . . supports; N = neutral; C = conflicts). Evidence Theory 1 Theory 2 Theory 3 Theory 4 Theory 5 P 1 P N N C P 2 N N C P 434 Risk Analysis 2. Some immune response modulation, due either to another disease, stress, a live vaccine, etc. The theory that PMWS requires an immune system modulation is particularly well supported by the data, both in in vitro and in vivo experiments, and from field observations that co-infection and stress are major risk factors. There is also some limited, but very convincing, evidence (Evidence 15) from Ghent University (by coincidence the town I live in) that the onset of PMWS is related to a third factor: 1. Susceptibility of individual pigs to the mutated virus. The evidence collected for this report suggests that the variation in susceptibility, while genetic in nature, is not obviously linked to the parents of a pig. The apparent variation in susceptibility owing to race may mean that susceptibility can be inherited over many generations, i.e. that there will be a statistically significant difference over many generations, but the variation between individuals in a single litter would exceed the generational inherited variation. 15.7 I s Causal Analysis Essential? In human and animal health risk assessment, we attempt to determine the causal agent(s) of a health impact. Once determined, one then attempts to apportion that risk among the various sources of the causal agent(s), if there is more than one source. Some risk analysts, particularly in the area of human health, argue that a causal analysis is essential to performing a correct risk analysis. The US Environmental Protection Agency, for example, in its guidelines on hazard identification, discusses the first step in their risk analysis process: "The objective of hazard identification is to determine whether the available scientific data describe a causal relationship between an environmental agent and demonstrated injury to human health or the environment". Their approach is understandable. It is extremely difficult to establish any causal relationship between a chemical and any human effect that can arise owing to chronic exposure to that chemical (e.g. a carcinogen), since many chemicals can precipitate the onset of cancer and that may only eventuate after many years of exposure, probably to many different carcinogens. We can't start by assuming that all chemicals can cause cancer. On the other hand, we may fail to identify many carcinogens because the data and scientific understanding are not there. If we are to protect the population and environment, we have to rely on suspicion that a chemical may be carcinogenic because of similarities with other known carcinogens and act cautiously until we have the evidence that eliminates that suspicion. In microbial risk assessment, the problem is simpler either because an exposure to bacteria will immediately result in infection or because the bacteria will pass through the human gut without effect, and cultures of stools or blood analyses will usually tell us which bacterium has caused the infection. By definition, Campylobacter causes campylobacteriosis, for example, so that the risk of campylobacteriosis must logically be distributed among the sources of Campylobacter, because if all sources of Campylobacter were removed in a counterfactual world there would be no more campylobacteriosis. I am of the view that we should definitely take the first step of hazard identification and attempt to amass causal evidence, but the lack of evidence should not lead us to dismiss a suspected hazard from concern, although clear evidence of a lack of causality should. We should also perform broad causal studies with an open mind because, although a strong though unsuspected statistical inference does not prove a causal relationship, finding one may nevertheless offer some lines of investigation leading to discovery of previously unidentified hazards. Chapter Optimisation in risk analysis by Dr Francisco Zagmutt, Vose Consulting US 16.1 Introduction Analysts are often faced with the question of how to find a combination of values for interrelated decision variables (i.e. variables that one can control) that will provide an optimal result. For example, a bakery may want to know the best combination of materials to make good bread at a minimum price; a portfolio manager may want to find the asset allocation that yields the highest returns for a certain level of risk; or a medical researcher may want to design a battery of tests that will provide the most accurate results. The purpose of this chapter is to introduce the reader to the basic principles of optimisation methods and their application in risk analysis. For more exhaustive treatments of different optimisation methods, the readers are directed to specialised books on the subject, such as Randin (1997), Dantzig and Thapa (1997, 2003) and Bazaraa et al. (2004, 2006). Optimisation methods aim to find the values of a set of related variable(s) in the objectivefunction that will produce the minimum or maximum value as required. There are two types of objective function: deterministic and stochastic. When the objective function is a calculated value in the model (deterministic), we simply find the combination of parameter values that optimise this calculated value. When the objective function is a simulated random variable, we need to decide on some statistical measure associated with that variable that should be optimised (e.g. its mean, it 95th percentile or perhaps the ratio of standard deviation to mean). Then the optimising algorithm must run a simulation for each set of decision variables values and record the statistic. If one wanted, for example, to minimise the O.lth percentile, it would be necessary to run thousands of iterations, for each set of decision variable values tested, to have a reasonable level of accuracy - and that can make optimising under uncertainty very time consuming. As a general rule, we strongly advise that you try to find some means to calculate the objective function if at all possible. ModelRisk, for example, has many functions that return statistical measures for certain types of model, and the relationships between stochastic models discussed in Chapter 8 can help greatly simplify a model. Let's start by introducing an example. When a pet food manufacturer wants to make an economically optimal allocation of ingredients for a dog formula, he may have the choice to use different commodities (i.e. corn or wheat as the main source of carbohydrates), but the company will want to use the combination of components that will minimise the cost of manufacturing, without losing nutritional quality. Since the price of commodities fluctuates over short periods of time, the feed inputs will have to be optimised every time a new contract for commodities is placed. Hence, an optimal feed would be the one that minimises the ration cost but also maintains the nutritional value of the feed (i.e. required carbohydrate, protein and fat contents in a dog's healthy diet). I I I With this example we have introduced the reader to the concept of constrained optimisation, where the objective is still to minimise or maximise the output from a function by varying the input variables, but now the values of some input variables are constrained to only feasible values of those variables (the nutritional requirements). Going back to the dog feed example, if we know that adult dogs require a minimum of 18 % of protein (as % of dry matter), then the model solution should be constrained to the combination of ingredients that will minimise the cost while still providing at least 18 % of protein. An input can take more than one constraint; for example, dogs may also have a maximum protein requirement (to avoid certain metabolic diseases) which can also be constrained into the model. The optimal blending of diets is in fact a classical application of linear programming, an area of optimisation that will be revisited later in this chapter. Optimisation requires three basic elements: a a a The objective function f and its goal (minimisation or maximisation). This is a function that expresses the relationship among the model variables. The outputs from the objective function are called responses, performance measures or criteria. Input variable(s), also called decision variables, factors, parameter settings and design variables, among many other names. These are the variables whose values we want to experiment with using the optimisation procedure, and that we can change or control (make a decision about, hence the name decision variable). Constraints (if needed), which are conditions that a solution to an optimisation problem must satisfy to be satisfactory. For example, when only limited resources are available, that constraint should be explicit in the optimisation model. Variable bounds represent a special case of constraints. For example, diet components can only take positive values; hence they are bounded to zero. Throughout this chapter we will review how these elements combine to create an optimisation model. The field of optimisation is vast, and there are literally hundreds of techniques that can be used to solve different problems. However, in practical terms the main differences between methods reside in whether the objective function and constraints are linear or non-linear, whether the parameters are fixed or include variability and/or uncertainty and whether all or some parameters are continuous or integers. The following sections give the background to basic optimisation methods, and then present practical examples 16.2 Optimisation Methods There are many optimisation methods available in the literature and implemented in commercial software. In this section we introduce some of the most used methods in risk analysis. 16.2.1 Linear and non-linear methods In Section 16.1 we presented a diet blend model and mentioned it was a typical linear programming application. This model is linear since the objective function and constraints are linear. The general form of a linear objective function can be expressed as: max / min f (XI,x2, . . . , x,) = alxl + ~ 2 x 2+ . . . + a,x, (16.1) where f is the objective function to be minimised or maximised, and x and a are input variables and their respective coefficients. Chapter 16 Optimisation in risk analysis 437 The objective function can be subject to constraints in the form Equation (16.1) shows that the constraints imposed on the optimisation problem must also be linear to be considered a valid linear optimisation problem. From Equations (16.1) and (16.2) we can deduce two important assumptions of linear optimisation: additivity and proportionality: Additivity entails that the values from the objective function are the result of the sum of all the variables multiplied by their coefficients, independently. In other words, the increase in the results of the objective function will be the same whether a certain variable increases from 10 to 11 or from 50 to 51. Proportionality requires that the value of a term in the linear function is directly proportional to the amount of that variable in the term. For example, if we are optimising a diet blend, the total cost of corn in the blend is directly related to the amount of corn used in the blend. Hence, for example, the concept of economies of scales would violate the assumption of proportionality since the marginal cost decreases as we increase production. The most common methodology to solve linear programming problems is called the simplex algorithm, which was invented by George Dantzig in 1947 and is still used to solve purely linear optimisation problems. For a good explanation of the simplex methodology the reader is directed to the excellent book by Dantzig and Thapa (1997). We cannot apply linear programming if our objective function includes a multiplicative term such as f (xl, x2) = alxl * ~ 2 x 2because we would be violating the additivity assumption. Recall that we mentioned that a unit increase in a decision variable will have the same impact on the results of the objective function, regardless of the current absolute value of the variable. We can't make this assumption with our multiplicative example, since now the impact that a change in a variable has in the objective function will depend on the size of the other variable by which it is multiplied. For example, in a simple function f (x) = ax2, with a = 5, if we increase x from 1 to 2, the results will change by 15 units (5 * 22 - 5 * 12), whereas if x increases from, say, 6 to 7, the function will change by 65 units (5 * 72 - 5 * 62). Non-linear problems impose an extra challenge in optimisation, since they may present more than one minimum or maximum depending on the domain being evaluated. Optimisation methods aiming and finding the absolute largest (or smallest) value of the objective function in the domain observed are called global optimisation methods. We will discuss different approaches to global optimisation in Section 16.3. The final function to consider is where the relationships in a function are not only non-linear but also non-smooth. For example, the relationships among some variables in the model use Boolean logic (i.e. IF, VLOOKUP, INDEX, CHOOSE) with the effect that the function will present sudden changes, e.g. drastic jumps or drops, making it uneven or "jumpy". These functions are particularly hard to solve using standard non-linear programming methods and hence require special techniques to find reasonable solutions. 438 Risk Analysis 16.2.2 Stochastic optimisation Stochastic optimisation has received a great deal of attention in recent years. One of the reasons for this growth is that many applied optimisation problems are too complex to be solved mathematically (i.e. using the linear and non-linear mathematical methods described in the previous section). Stochastic optimisation is the preferred methodology when problems include many complex combinations of options andlor relationships that are highly non-linear, since such problems either are impossible to solve mathematically or cannot feasibly be solved within a realistic timeframe. Simulation optimisation is also essential if the parameters of the model are random or include uncertainty, which is usually the case in many of the models applied to real-world situations in risk analysis. Fu (2002) presents a summary of current methodologies in stochastic optimisation, and some of the applications of this method. Most commercial stochastic optimisation software use metaheuristics to find the optimal solutions. In this method, the simulation model is treated as a black-box function evaluator, where the optimiser has no knowledge of the detailed structure of the model. Instead, combinations of the decision variables that achieve desirable results (i.e. minimise the objective function more than other combinations) are stored and recombined by the optimiser into updated combinations, which should eventually find better solutions. The main advantage of this method is that it does not get "stuck in local minima or maxima. Some software vendors claim that this methodology also finds optimal values relatively faster than other methods, but this is not necessarily true, especially when the optimisation problem can be quickly solved with well-formulated mathematical functions. Usually, three steps are taken at each iteration of the stochastic optimisation: Possible solutions for the variables are found. The solutions found in the previous step are applied to the objective function. If the stopping criterion is not accomplished, a new set of solutions is calculated after the results of the previous combinations are evaluated. Otherwise, stop. Although the above process is conceptually simple, the key to a successful stochastic optimisation resides in the last step, because trying all the combinations of values from different random variables becomes unfeasible (especially when the model includes continuous variables). For this reason, most implementations of stochastic optimisation focus their efforts on how to narrow the potential solutions based on the solutions already known. Some of the methods used for this purpose include genetic algorithms, evolutionary algorithms, simulated annealing, path relinking, scatter search and tabu search, to name a few. It is beyond the objective of this chapter to review these methodologies, but interested readers are directed to the chapter on metaheuristics in Pardalos and Resende (2002), and to the work by Goldberg (1989) and by Glover, Laguna and Marti (2000). Most commercial Excel add-ins include metaheuristic-based stochastic optimisation algorithms. Some of the most popular include Opt~uest@for Crystal all@, R I S K ~ ~ t i m i s e rfor @ @Risk@ and very recently Risk Solver@.Similar tools are also available for discrete-event simulation suites. There is also a myriad of statistical and mathematical packages such as R@, SASB and ~athematica@that allow for complicated optimisation algorithms. In Vose Consulting we rely quite heavily on these applications (particularly R@) when developing advanced models, but we will stick to Excel-based optimisers here to avoid having to explain their syntax structure. Chapter 16 Optimisation in risk analysis 439 16.3 Risk Analysis Modelling and Optimisation In this section we introduce the reader to some applied principles to implement optimisation models in a spreadsheet environment, and then briefly explain the use of the different possible settings in Solver, the default optimisation tool in Excel. 16.3.1 Global optimisation In the previous section we discussed some of the limitations of linear programming, including the problem with local minima and maxima depending on the starting values. Figure 16.1 shows a simple function in the form f (x) = sin (cos(x) exp ( )) The function has several peaks (maxima) and valleys (minima) within the plotted range. A function like this is called non-linear (changes in f (x) are not monotonically increasing with x), and also non-convex (i.e. line segments drawn from any point to another point can lie above or below the graph of f (x), depending on the region of the function domain). Optimisation software like Excel's Solver and other linear and non-linear constrained optimisation software follow a path from the starting values to the final solution values, using as guide the direction and curvature of the objective function (and constraints). The algorithm will usually stop at the minimum or maximum closest to the initial values provided, making the optimiser output quite sensitive to the starting values. For example, if the function in Figure 16.1 is to be maximised and a starting value is close to the smaller peak (Max I), the "best" solution the software will find will be Max 1, when in fact the global peak for this particular function is located at Max 2. Evidently, in most risk analysis applications the desirable solution will be the highest (or the lowest) peak and not a local one. In other words, we always want to make sure that the optimisation is global rather than local. Depending on the software used, there are several ways to make sure we can obtain a global optimisation. Excel's Solver is among the most broadly used optimisation software, as it is part of the popular spreadsheet bundle, and its algorithms are very sensitive to the initial values provided by the analyst. Thus, when possible, the entire feasible range of the objective function should be plotted to identify the global peak or valleys. From evaluating the graph, a rough estimate can then be used as an initial value. Consider the model shown in Figure 16.2. The objective function is again ( f ( x ) = sin cos(x) exp (3) and is unconstrained within the boundaries shown (-4.2 to 8). When plotting the function, we know the global maximum is somewhere close to -0.02, so we will use this value in Solver. To do so, we first enter the value -0.02 into cell x (C2), and then we select Tools -+ Add-Ins and check the Solver Add-In box and click the OK button. Then go back to Excel and select Tools + Solver to obtain the menu shown in Figure 16.3. Min 1 Figure 16.1 A non-linear function presenting multiple maxima and minima. Figure 16.2 Sensitivity of Excel's Solver to local conditions. The dot represents the optimal solution found by Solver. Under "Set Target Cell" we add a reference to the named cell fx (C3), then, since in this example we want to minimise the function, we select "Equal To" Max and we finally add a reference to named cell x (C2) under the "By Changing Cells" box. Now that we are ready to run the optimisation procedure (we will see more about the Solver menus and options later in this chapter, now we will use the default settings), we click the "Solve" button and after a very short period we should see a form stating that a - - Chapter 16 Optimisation in risk analysis 44 1 Figure 16.3 Excel's Solver main menu. solution has been found. Select the "Keep Solver Solution" option and click the " O K button. We can see that Solver successfully found the global maxima since we provided a good initial value. What would happen if we didn't provide a reasonable initial value? If we repeated the same procedure but started with, say, -3 in cell x, we would obtain a maximum of -3.38, which turns out to be the first peak (Max 1 in Figure 16.1). If we started with a larger value, i.e. 4, Solver would find 6.04 as the optimal maximum, which is Max 3 in Figure 16.1. The reader can use the supplied model and try initial values to look for minima and maxima and explore how the optimisation algorithm behaves, particularly to notice the model behaviour when the Solver options (e.g. linearity assumption, quadratic estimates) are changed. An alternative for dealing with local minima and maxima is to restrict the domain to be evaluated. We have already limited the domain by exploring only a limited section of our objective function (-4 to 8). However, the domain still contained several peaks and valleys. In contrast, if the domain observed contains only one peak or valley (i.e. -2, 2), the function becomes concave (or convex) which can be solved with a variety of fast and reliable techniques such as the interior point methods readily implemented in Solver. Since we know the global peak resides somewhere around zero, we can restrict the domain of the objective function to (-2, 2) using the constraint feature in Solver. First enter -2 in cell C6 and 2 in cell C7. Then name the cells "Min" and "Max" respectively. After that, open Solver and click on the Add button. Type "x" under cell reference, select <= and then type "=Maxn in the Constraint box. Once that is completed, click the Add button, and, following the same procedure, add the second constraint, x >= Min. Once both constraints are added, click OK and then Solve. Solver should find an optimal x close to -0.25 which is the global maximum, so, even though the function has many local optimal values, now we have successfully restricted the domain enough so the numerical method can easily find the optimal values. Even if an aberrant number is entered (e.g. 1000) as the initial value, the domain is so narrow now that the algorithm will still find the optimal value. Try it! When the function is not tractable (e.g. complex simulation models), plotting is not an option since the figure could be k-dimensional (and we all have a hard time interpreting elements with more than three dimensions). Hence, for this case, if the user plans on using Solver, he or she should attempt different initial values manually, based on the knowledge of the system being modelled. Another more automated option is to use more sophisticated applications that rely on metaheuristic methods, as explained in Section 16.2.2. Later in this chapter we present the solution to a problem where the function not only is intractable but also is highly non-linear and non-smooth and contains a series of integer decision variables and complex constraints. 442 Risk Analysis Commercial optimisation software use different methods to make sure only global optimal solutions are found. As already discussed, metaheuristic methods can be very efficient in finding global optimal solutions. Other commercial software rely upon multistart methods for global optimisation, which automatically try different starting values until a global solution is found. Although they are reasonably effective, such methodologies can be quite time consuming when solving highly non-linear and non-smooth functions, or when little is known about the parameters to optimise (uninformed starting values). 16.3.2 A few notes on using Excel's Solver We have already mentioned that Excel's Solver is an optimisation tool built into Microsoft's Excel and dispatched with all copies of Excel. Although the tool has limitations, it can be used in a variety of situations when stochastic simulation is not required. Solver implements a variety of algorithms to solve linear and non-linear problems. It uses the generalised reduced gradient (GRG) algorithm to solve nonlinear programming methods, and, when the correct settings are used, it can use the simplex method, a well known and robust method for solving linear optimisation problems. The mysterious Options menu in Solver It is likely that many readers have used or tried to use Solver in the past and have managed fairly well. It is also likely that the reader has clicked on the Options button and didn't quite understand the meaning of all the settings. Furthermore, many readers may have found the explanations in the help file to be rather cryptic, so we will explain the various options. We have already explained in previous sections how to use the general Solver menu. Now we will focus on the menus that appear under the Options buttons. To get there, select Tools -+ Solver and then click the Options button. The menu in Figure 16.4 should be displayed. We briefly describe the meaning of each option below: The Load Model and Save Model buttons enable the user to recall and retain model settings so they don't need to be re-entered every time the optirnisation is run. Max Dm: Iteratons: seconds , 100 [ ( ] cancel Load Model.. . ] 1 [ Figure 16.4 The Options menu in Excel's Solver. zave Model,.. Chapter I 6 Optimisat~onin risk analysis 443 Max Time limits the time taken to find the solution (in seconds). The default 100 seconds should be appropriate for standard linear problems. Iterations restricts the number of iterations the algorithm can use to find a solution. Precision is used to determine the accuracy with which the value of a constraint meets a target value. A fractional number between 0 and 1. The higher the precision, the smaller the number (i.e. 0.01 is less precise than 0.0001) Tolerance applies only to integer constraints and is used to estimate the level of tolerance (as a percentage) by which the solution satisfying the constraints can differ from the true optimal value and still be considered acceptable. In other words, the lower the tolerance level, the longer it will take for the solutions to be acceptable. Convergence applies only to non-linear models and is a fractional number between 0 and 1. If after five iterations the relative change in the objective function is less than the convergence specified, Solver stops. As with precision and tolerance, the smaller the convergence number, the longer it will take to find a solution (up to Max Time that is). Lowering precision, tolerance and convergence values will slow down the optimisation, but it may help the algorithm to find a solution. In general, these defaults should be changed if Solver is experiencing problems finding an optimal solution. Assume Linear Model is a very important choice. If the optirnisation problem is truly linear, then this option should be chosen because Solver will use the simplex method, which should find a solution faster and be more robust than the default optimisation method. However, the function has to be truly linear for this option to be used. Solver has a built-in algorithm that checks for linearity conditions, but the analyst should not rely solely on this to assess the model structure. When the option Show Iteration Results is selected, Solver will pause to show the result of each iteration, and will require user input to reinitiate the next iteration. This option is certainly not recommended for computing intensive optimisation. When selected, Use Automatic Scaling will rescale the variable in cases where variables and results have large differences in magnitude Assume Non-Negative will bound to zero all the decision variables that have not been explicitly constrained. It is preferable, however, to specify explicitly the variables boundaries in the model. The Estimates section allows one to use either a Tangent method or a Quadratic method to estimate the optimal solution. The tangent method extrapolates from a tangent vector, whereas quadratic is the method of choice for highly non-linear problems. The Derivatives section specifies the differencing method used to estimate partial derivatives of objective and constraint functions (when differentiable, of course). In general, Forward should be used for most problems where the constraint values change slowly, whereas the Central method should be used when the constraints change more dynamically. The central method can be chosen when solver cannot find improving solutions. Finally, the Search section allows one to specify the algorithm used to determine the direction to search at each iteration. The options are Newton, which is a quasi-Newton method to be used when speed is an issue and computer power is a limiting factor, and Conjugate, which is the preferred method when memory is an issue, but speed can be slightly compromised. 444 R~skAnalysis Automating Solver with Visual Basic for ~xcel@ One of the most powerful tools in Excel is the integration with Visual Basic for Applications (VBA). This integration can also be extended to optimisation models with Solver. We will use the model presented in section 16.3.1 to show how to automate solver in Excel. The steps are: 1. Record a macro using Tools -+ Macro + Record New Macro and name the macro accordingly (e.g. "SolverRun"). 2. Open the Solver form as previously explained and press Reset All to clear existing settings. 3. Repeat the steps followed to optimise the model (e.g. set objective function, decision variables and constraints and click the Solver button). 4. Once Solver has found a solution, stop recording the macro by clicking on the small red square in the macro toolbar, or by using Tools + Macro + Stop Recording. 5. Use the Forms Toolbar to add a button to the sheet. 6. Then assign the macro (e.g. "SolverRun") to the button by double clicking on it while in Design Mode and typing "Call SolverRun" in the procedure. For example, assuming the button is called CommandButtonl, the VBA procedure should look as follows: Private Sub CommandButtonl-Click0 Call SolverRun End Sub 7. Add a reference to Solver in Visual Basic by pressing Alt+Fl 1, and then in the Visual Basic menu select Tools -+ References and make sure the box next to "Solver" is selected. 8. The VBA code for the recorded macro should look similar to the example below: Sub SolverRun ( ) 'This macro runs solver automatically SolverOk SetCell:="$C$3",MaxMinVal:=l, SolverAdd CellRef:="$C$2", Relation:=l, SolverAdd CellRef:="$C$2", Relation:=3, SolverOk SetCell:="$C$3", MaxMinVal:=l, SolverSolve userFinish:= True End Sub ValueOf:="O", ByChange:="$C$2" FormulaText:="Max" FormulaText:="Min" ValueOf:="On, ByChange:="$C$ZV Notice we have added an extra line ' ' ~ o l v e r ~ o ~userFinish:=~rue" ve which suppresses the optimisation results from being shown at the end of each iteration. Now everything should be ready to use the macro. Make sure to exit the Design Mode and click on the button. The resulting model is not shown here but is provided for the user to explore. 16.4 Working Example: Optimal Allocation of Mineral Pots 1 I This exercise is based on a simplified version of a real-life example taken from our consulting work. A metallurgic company processes metal into 14 small containers called pots. The contents of the pots are then split among four larger tubs which are then used to create the final metal product that is sold. The resulting product receives a premium based on its level of purity (lack of unwanted minerals). 1 Chapter 16 Optimisation in risk analysis 445 Since the input ore is different from batch to batch, the impurity levels will likely be different. It is then economically important to achieve certain purity level among batches while avoiding "bad" levels. The goal of the model is to optimise the allocation of pot metal contents into tubs in order to achieve a certain purity level in the final product.' Note that, in reality, since the impurity level is estimated with samples, there is uncertainty about the actual impurity level of each batch. The client required that one was, say, 90 % confident that the concentration of each impurity in a tub was less than a certain threshold. Since speed was an important issue for the client, we avoided simulation by using classical statistics estimates of a mean (Chapter 9) to determine the 10th percentile of the uncertainty distribution for the true concentration in a tub. For each pot, the variables are: purity of metal A (in percentage of total weight); purity of metal B (in percentage of total weight); weight (in pounds). As the reader may imagine, the plant's operations present several constraints to be modelled, which are listed below: 1. 2. 3. 4. A minimum of 1000 lb should be taken per pot. The quantities taken from the pots are measured in discrete increments of 20 lb. A maximum of five pots can be allocated to a given tub. Pots can be only split in two parts (i.e. the contents of a pot cannot be split into three or four different tubs). 5. The maximum metal tonnage taken from a pot is equal to the pot weight (obvious, but this needs to be explicit in the model). 6. Every pot should be allocated into at least one tub (no "leftover" pots). 7. The maximum and minimum weights contained per tub are constrained (by certain values, for this example assumed to be a minimum of 5000 lb and a maximum of 10 000 lb). Given the number of constraints and possible combinations to be optimised, this model would be quite complex to define in mathematical terms (especially when considering parameter uncertainty), and hence a more practical approach is to use optimisation. For this particular example, we employed OptQuest with Crystal Ball for its ease of use and connection with Excel, but other commercial spreadsheet add-ins could be used to achieve similar results. OptQuest is used here for a deterministic model but handles stochastic optimisation equally well. One powerful feature of simulation optimisation is that complex constraints such as those imposed in this model can be specified by removing those scenarios rather than including them explicitly in the objective function. Such constraints are sometimes called simulation requirements. Although this approach can be slower than incorporating the constraint directly in the model, it allows for very complex interactions in the model. Also, the models can be significantly sped up by compiling many input variables into only one requirement variable. Figure 16.5 shows the general structure of the model. Cells with a grey background represent input variables (variables that are changed during the optimisation process), and cells with a black background are requirements that are used to set the model ' Another goal was to optimise for several purity levels by their dollar premiums, hut that is omitted here for simplicity. 446 Risk Analysis Optimal purity PurityA 0.050 Purity B 0.050 Output for objective Fx PoWTub 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 0 0 1480 2660 0 0 0 1000 0 2660 0 1000 0 0 2 0 1500 1180 0 0 2660 2660 1660 0 0 0 0 0 0 3 2660 0 0 0 2660 0 0 0 2660 0 0 1660 0 0 4 0 0 0 0 0 0 0 0 0 0 2660 0 2660 2660 Pass requirements? 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Total IA B Good tub? J17 H21:H25 H26:H34 121:J25 126:J34 K21:K25 K26:K34 H36:K36 H38:K38 H39:K39 H40:K40 M21:M34 M36 N40 0.028635 0.02891 2 0- 0286 0.0434 1 - -- - 0.048933 0 045517 0.045155 0.1382 1 1 1 0 Formulae table =SUM[J3:JIG) =C21'C3 =IF(COUNTIF($H$21: H25,"=O") =2,SUM(H21 :K21)=J3),1,0) =SUM(M21:M34) =SUMPRODUCT(H40:1<4O,H36:K36) Figure 16.5 The metallurgic optimisation model implemented in Excel. Chapter 16 Optimisatlon in risk analysis 6- ' .D.-e-f-i -. Decbbn ~ Varlabk: Cell C3 -- --- - -. ..- 447 - fa .---- Figure 16.6 Dialogue to create decision variables in OptQuest with Crystal Ball. constraints and define the objective function. The larger table in range G2:J17 contains the purity levels for both minerals and weight for each pot. The small table on the right contains the target purity levels that the model will optimise for. The "Pounds" table (range B2:E16) contains input variables that are modified during the optimisation. By selecting Define + Define Decision in Crystal Ball's menu, the user will see the form shown in Figure 16.6 with the settings for cell C3. The variables are discrete and can only increment in steps of 20 pounds (constraint 2), and are constrained to a fixed minimum of 1000 lb (constraint 1) and a maximum equal to the total content in the pot, which will vary depending on the batch; hence, the maximum value is linked to cell ~3~ which contains the pot weight. Similar variables are created for each combination of pots in tub 1, the only difference being the cell reference to their maximum weight. Decision variables are only needed for the first tub since the allocation for the other tubs is calculated on the basis of the initial allocation to the first tub. Thus, the remaining cells in the "Pounds" matrix are left empty or with a constant value of 1. The "Switches" matrix (range B19:E34) contains input variables that can only take values of 0 or 1. The set of input variables from the "Pounds" matrix is multiplied by the variables in the "Switches" matrix to generate the output matrix "Output for objective Fx". Notice that, for the "Switch variables, input variables are only needed for the first three tubs, because the fourth tub can be filled with what is left in the pots after their content has been allocated to the other three. The remaining components in the model are the constraints and objective function. As previously mentioned, for this model some constraints are built into the simulation model, whereas others are set as scenarios that cannot be included in the optimal solution. Hence, anything that does not meet certain requirements is "tossed" from the set of possible options. The equation from pot 6, tub 1, in the output matrix incorporates constraint 3 as follows: which summarises into "if 5 pots have been allocated to tub 1, then do not allocate the product from pot 6 into tub 1, otherwise, allocate the content defined in cell C8 (which is a decision variable as in Figure 16.6) multiplied by cell C26". The first part of this equation limits a maximum of five pots to For some reason unknown to this author, sometimes the cell reference in the decision variables may be lost after opening OptQuest, and will be replaced by the last number in the cell, e.g. the maximum weight entered for the pot (we are using OptQuest with Crystal Ball V. 7.3, Build 7.3.814). Readers should be aware of this issue when using this and other models with dynamic referencing on decision variable parameters. 448 Risk Analysis be allocated to the tub (constraint 3). The second part (multiplication of two cells) is used to make sure that there is no bias in the order of the allocation of the pots to tubs (by using the binary decision van'ables in the "Switch"matrix). The same logic is used for pots 7 to 14 in tub I. For tubs 2 and 3, the equation for pot 6 is modified so that we add "if the remaining weight left in the pot is less than ZOO0 pounds, do not allocate any metal to this tub (constraint I), otherwise, allocate the remaining material from pot 6 into this tub". The subtraction from the total pot weight satisfies constraint 5. The reader will notice that, since we can only allocate one pot to one or two tubs (constraint 4), there is no need for an input variable in columns D and E since the material allocated to tubs 2 to 4 is dependent on whether tub 1 received material from a given pot. Thus, the podtub combinations for tubs 2 to 3 in the "Pounds" matrix contains only Is, so a 1 is returned when multiplied by Is from the "Switch matrix. Finally, metal from the pot that has not been allocated to tubs 1 to 3 (and that is at least 1000 pounds) is allocated to tub 4. As for the other tubs, formulas from pot 6 onwards are constrained, so no more than 5 pots can be allocated to one tub. We cannot waste any remaining material in a pot, of course, so another exogenous constraint (requirement) that we add is that the sum of the pounds allocated from a pot should be exactly the same as the total weight in that pot. In addition, we can include constraint 6 into the same requirement to speed up the optimisation. The resulting formula (cell M21 shown, same for all pots) is In other words, if the pot has not been allocated to more than two tubs, and the sum of the weights allocated is equal to the weight of the pot, then return a 1, otherwise a 0. The same test is applied to each pot. Therefore, to meet the conditions, the sum of cells M21:M34 (cell M36) should be exactly 14 because, if all pots "pass the test", each individual pot test should return a 1. Some readers may wonder why constraint 6 was added into this formula although it was already mentioned that, if nothing is allocated to tubs 1 to 3, the total weight is allocated to tub 4. In reality, the constraint is not necessary but is left in the equation to exemplify how to combine several constraints into one formula, making the model computations significantly faster. Also, when a model is going to be continuously modified, it is always good to have logical checks to make sure the algorithm is working the way it is supposed to. Before we include the final values in the objective function, we need to identify which tubs are lower than the desired threshold of purity for minerals A and B. The formula we use for this is (cell H40:K40, H40 shown): = IF(AND(H38 < OptA, H39 < OptB), 1,O) where O p t A and O p t 3 are the optimal purity levels for metals A and B respectively. This formula returns a 1 when the requirements are met. Finally, the objective function is contained in cell N40 and is the sum of the total weights per tub, multiplied by the "Good tub" indicator. The optimisation model will try to maxirnise the value of this objective function (the total weight of "good metal in tubs). Once the variables, constraints and objective functions are defined, the last step left is to use OptQuest to set up and then run the optirnisation procedure. To do so, in the Crystal Ball menu, select Run + OptQuest + and open a new optimisation file. All variables in the Decision Variables form should be selected. In the Forecast Selection form the inputs should be selected as in Figure 16.7 below. Chapter 16 optirnisation in risk analysis 449 Figure 16.7 Forecast selection menu in OptQuest for the metallurgic optimisation model. The objective function is maximised (we want to have the maximum amount of pure metal), the constraint tests should be equal to 14 and the minimum and maximum contents of the tubs should be 5000 and 10 000 Ib respectively (constraint 7). The software will discard any scenario that does not meet the requirements, and the objective function will be maximised by finding the best combination of input variables. Provided the initial values are reasonable, an optimal solution takes less than an hour to run on a modem PC, which is important because the production line has to run this model twice a day. 16.4.1 Uncertainty in the model In the actual model for our client we included the uncertainty about the impurity concentrations. The user set a required confidence level CL (e.g. 90 %), and the model optimised to produce "tubs" that had less than the specified impurity level with this confidence. The amount of impurity is determined by Weight of pot * Impurity concentration The source of the uncertainty came from the uncertainty of the weight of a pot (mean p p and standard deviation ap, lbs) and from the uncertainty of the impurity concentrations (mean p~ and standard deviation CTA for impurity A, for example). The mean and standard deviation of the distribution of the product (pp, ap) of these two random variables is given by In order to calculate the impurity level at the required confidence, we use Excel's NORMINV(CL, pp, ap) function. The normal approximation is a reasonable approximation in this case because the uncertainty of the concentration was close to a normal distribution and was greater than the weight uncertainty so dominated the shape of the product. As mentioned before, finding a way of avoiding having to optimise a simulation model (rather than the calculation model here) is very helpful because it speeds up the optimisation time hugely: one calculation replaces a simulation of, say, 1000 iterations to be sure of the required confidence level value. Chapter 17 Checking and validating a model In this chapter I describe various methods that can be used to help validate the quality and predictive capabilities of a model. Some techniques can be carried out during a model's construction, which will help ensure that the finished model is as free from errors and as accurate and useful as possible. Other techniques can only be executed at a future time when some of the model's predictions can be compared against what actually happened, but one may nonetheless devise a plan to help facilitate that comparison. Key points to consider are: Does the model meet management needs? Is the model free from errors? Are the model's predictions robust? The following topics describe the methods we use to help answer these questions: Ensuring the model meets the decision-makers' requirements. Comparing predictions against reality. Informal auditing. Checking units propagate correctly. Checking model behaviour. Comparing results of alternative models. 1 7.1 Spreadsheet Model Errors Your company may have hundreds or thousands of spreadsheet models in use. If even 1 % of these have errors, you could be making many decisions based on quite inaccurate information. If you now introduce risk analysis models using Monte Carlo simulation, which is more difficult to write (because we have to write models that work dynamically) and to check (because the numbers change with each iteration), the problem could get much worse. Errors come in several forms: Syntax errors where a formula is incorrectly put together. For example, you mismatch brackets, forget to make a formula into an array formula (by entering with Ctrl Shift + Enter instead of Enter), use the wrong function, etc. Mechanical errors which are hitting the wrong key, pointing to the wrong cell, etc. About 1 % of spreadsheet cells contain such errors. Logical errors which are incorrect formulae due to mistaken reasoning, misunderstanding of a function or the appropriate use of probability mathematics. These errors are more difficult to detect than mechanical errors and occur in about 4 % of spreadsheet cells in normal (unrisked) models. + , 11 I 452 Risk Analysis Application errors where the spreadsheet function does not perform as it should. Excel generates incorrect results for some statistical functions: GAMMADIST and BINOMDIST are awful, for example. Some versions of Excel also don't automatically update all formulae correctly - use Ctrl Alt F9 instead of F9 to be sure it updates correctly. Random number generation for certain distributions is quite numerically difficult, so you will see artificial limits to the parameters allowed for distributions in a lot of software: @RISK, for example, allows a maximum of 32 767 trials in a binomial distribution and for a hypergeometric population, while Crystal Ball allows a maximum of 1000 for a Poisson mean and parameters for the beta distribution must lie on [0.3, 10001. It is frustrating, of course, to have to work around such limits, and often you'll only find them because the model didn't work for some iterations, so we have designed ModelRisk to have no such issues. Omission errors where a necessary component of the model has been forgotten. These are the most difficult errors to detect. Administrative errors, for example using an old version of a spreadsheet or graph, failing to update a model with new data, failing to get the spreadsheet to recalculate after changes, importing data from another application incorrectly, etc. + + We have tried to help reduce the frequency of these types of error with ModelRisk. Each function returns an informative error message when inappropriate parameter values are entered. For example: = VoseNormal(100, -10) returns "Error: sigma must be >= 0" because a standard deviation cannot be negative. = VoseHypergeo(20, 30, 10) returns "Error: n must be <= M" because one cannot take more samples without replacement (n = 20) than there are individuals in the population (M = 10). {= VoseAggregateMoments(VosePoissonObject(10), VoseLognormal(10, 3))) returns "Error: Severity distribution not valid" because the severity distribution needs to be an object, e.g. VoseLognormalObject(10, 3) If you write any user-defined functions, for which the Excel user will be less familiar, please consider doing the same. In ModelRisk we have also chosen to return pedantically correct answers for probability calculations, for example: = VoseHypergeoProb(2, 10, 25, 30, 0) returns 0: this is the probability of observing exactly two successes where the minimum possible is 5. If it's impossible, the probability is zero. = VoseBinomialProb(50, 10, 0.5, 1) returns 1: the probability of observing less than or equal to 50 successes when there are only 10 trials. This means that you don't have to write special code to get around the function giving errors. For example, the Excel equivalent formulae are: = HYPGEOMDIST(2, 10,25,30) returns #NUM! = BINOMDIST(50, 10,0.5, 1) returns #NUM! You also need to check how your Monte Carlo simulation software handles special cases for particular values. Poisson(O), for example, means that the variable can only be zero. In a simulation model, it Chapter 17 Checking and validating a model 453 would be perfectly reasonable for a cell simulating a concentration to produce a zero value that fed into a Poisson distribution. However, software will handle this differently: @RISK: = RiskPoisson(0) returns #VALUE! Crystal Ball: = CB.Poisson (0) returns #NUM! ModelRisk: = VosePoisson(0) returns 0 Perhaps the most' useful error-reducing feature in ModelRisk is that we have interfaces that give a visual explanation and check of most ModelRisk features. For example, a cell containing the formula = VoseGammaProb(C3:C7, 2, 3, 0) returns the joint probability of the values in cells C3:C7 being randomly generated from a Gamma(2, 3) distribution. Selecting the cell with this formula and then clicking ModelRisk7sView Function icon pulls up the interface shown in Figure 17.1. Crystal Ball and @RISK both have very good interfaces, although these are limited to input distributions only. A quick Internet search for "spreadsheet model errors" will provide you with a wealth of individuals and organisations who research into the source and control of spreadsheet errors. For example, the European Spreadsheet Risks Interest Group is dedicated to the topic. Raymond Panko from the University ~ ~ ~ Location. p u t a I Figure 17.1 Visual interface in ModelRisk for the formula VoseGammaProb(C3:C7,2,3,0). 454 Risk Analysis of Hawaii is a leader in the field and provides an interesting summary of spreadsheet error rates and reasons at http://panko.shidler.hawaii.edu/SSR/index.htm. Looking at the error percentages, for large models the question is not "Are there any errors?'but "How many errors are there?'. A company can help minimise model errors by establishing and enforcing a policy for model development and for model auditing. Dr Panko reports the recommendations of professional model auditors that one should spend 113 of the development time in checking the model. 17.1.1 Informal auditing Studies have shown that the original builder of a spreadsheet model has a lower rate of error detection than an equivalently skilled coworker. It's not so surprising, of course, since we are more inclined than a reviewer to repeat the same logical, omission and administrative errors. At Vose Consulting we do a lot of internal auditing. An important part of the process is sitting down and explaining to another analyst the decision question(s) and the model structure with pen and paper and then how we've executed it in a spreadsheet. Just the process of providing an explanation will often lead to finding errors in your logic, or to finding simpler ways to write the model. Get another analyst to go through your code with the objective of finding your errors, so that a successful exercise is one that finds errors rather than one that pronounces your model to be error free. Having several analysts look at your model is even better, of course - it is interesting how people find different errors. For example, in writing our software, some of our team are just great at finding numerical bugs, others at wrong formulae, others still at finding inconsistencies in structure or presentation. Different things jump out at different people. 17.1.2 Checking units propagate correctly I studied physics at university, and one of the first things you learn to do is a "dimensional analysis" of formulae. For example, there exists an equation relating initial speed u and final speed v to the distance s over which a body has constant acceleration a: The dimensions involved are length L (in metres, for example) and time T (in seconds, for example). . the elements Distance has units L , speed has units L I T , and acceleration has units L / T ~ Replacing in the above formulae with their dimensions gives (;)' = (;)'+ (f)* L You can see that the left- and right-hand sides of the equation have the same units and that, when we add two things together, they have the same units too (so we are not adding "apples and oranges"). In a spreadsheet model we can use the same logic to help make sure our model is constructed properly. Chapter 17 Checking and validating a model 455 It is good practice to label cells containing a number or formula with some explanation of what that value represents, but including units makes the logic of the model even clearer; for example, noting the currency when there is more than one in your model, or, if it is a rate, then noting the denominator, e.g. "$US/ticket", or "cases/outbreak. Then checking that the units flow through the model using dimensional analysis will often reveal errors. Checking that the same units are used for a dimension (length, mass, etc.) is also important. We commonly come across two problems in this category in our auditing activities that are easily avoided: Fractions. The first is the use of a fraction, where the modeller might label a cell "Interest rate ( %)" and then write a value like "6.5". Of course, to apply that interest rate, s h e will have to remember to divide by 100 to get to a percentage, and we've found that this is sometimes forgotten. Better by far, in our view, is to label a cell "Interest rate" and input the value "6.5 %" which will show on screen as 6.5 % but will be interpreted by Excel as 0.065 and can therefore be used directly. Thousands, millions, etc. In large investment analyses, for example, one is often dealing with very large numbers, so the modeller finds it more convenient to use units of thousands or millions. This would not present a problem if the entire spreadsheet used the same units, but very commonly there will be certain elements that do not; for example, costlunit or pricelunit for a manufacturer or retailer of high-volume products. The danger is that in summary calculations that evaluate cashflow streams, the modeller may forget to divide by 1000 or 1000000, in keeping with other currency cells. Even if it is all done correctly, it is more difficult to follow formulae where "11000 and "* 1000000" appear without explanation. Our preference is that the model be kept in the same units throughout, a base currency unit, for example, like $, E or &. Admittedly this can be tricky if you're converting from values you know in thousands or millions - we can easily get all those zeros mixed up. A convenient way to get around this in Excel is to use special number formatting. We use a few formats in particular, employing Excel's Format1Cells 1 Custom feature: which will display 1 234 567 890 as S123.5M; which will display 1 234 567 890 as &123.5M as above, but will display negative values in red; which does the same as the second option but has the "EMnext to the numbers rather than left justified; which will display 1 234 567 890 as £123,456.8k You can, of course, substitute a different currency symbol. 458 Risk Analysis time series summary plots; correlation and regression statistics. They are discussed at length in Chapter 5. 1 7.2.5 Stressing parameter values A very useful, simple and powerful way of checking your model is to look at the effect of changing the model parameters. We use two different methods. Propagate an error In order to check quickly what elements of your model are affected by a particular spreadsheet cell, you can replace the cell contents with the Excel formula: =NA(). This will show the warning script "#N/AU(meaning data not available) in that cell and any other cell that relies on it (except where the ISNA() or ISERROR() functions are used). Imbedded Excel charts will simply leave the cell out. I like this method very much because it is quicker than using the Excel audit toolbar to trace dependents and it also works when you have VBA macros that pick up values from cells within the code, i.e. when the cells aren't inputs to the macro function the Trace Dependents function in Excel won't work in that situation. Set parameter values t o extremes It is difficult to see whether your Monte Carlo simulation model is performing correctly for lowprobability outcomes because generating scenarios on screen will obviously only rarely show those low-probability scenarios. However, there are a couple of techniques for concentrating on these lowprobability events by temporarily altering the input distributions. We suggest that you first resave your model with another name (e.g. append "test" to the file name) to avoid accidentally leaving the model with the altered distributions. You can generate model extremes as follows: (a) Set a discrete variable to an extreme instead of its distribution. The theoretical minimum and maximum of discrete bounded distributions are provided in the formulae pages for each distribution in Appendix 111. Many distributions have a zero minimum, but only a few distributions have a maximum value (e.g. binomial). In general, it is not a good idea to stress a continuous variable with its minimum or maximum, however, because such values have a zero probability of occurrence and so the scenario is meaningless. (b) Modify the distribution to generate values only from an extreme range. This is particularly useful for continuous distributions, and for discrete distributions where there is no defined minimum andlor maximum. Monte Carlo Excel add-ins normally offer the ability to bound a distribution. For example, in ModelRisk we can write the following to constrain a lognormal distribution: Only values above 30: = VoseLognonnal(l0,5,, VoseXBounds(30, )) Only values below 5: = VoseLognonnal(l0,5,, VoseXBounds(, 5)) Values between 10 and 11: = VoseLognormal(lO,5,, VoseXBounds(10, 11)) Chapter 17 Checking and validating a model 459 In @RISK, this would be = RiskLognonn(0, 5, RiskTruncate(30, )), etc. In Crystal Ball you apply bounds in the visual interface. Note that occasionally a model will have an acute response to a variable that is within a small range. For example, a model of the amplitude of vibrations of a car may have a very acute (highly non-linear) response to an input variable modelling the frequency of an external vibrating force, like the bounce from driving over a slatted bridge, when that frequency approaches the natural frequency of the car. In that case, the rare event that needs to be tested is not necessarily an extreme of the input variable but is the scenario that produces the extreme response in the rest of the model. (c) Modih the Probability of a Risk Occurring. Often in a risk analysis model we have one or more risk events. We can simulate them occurring (with some probability) or not in a variety of ways. We can stress the model to see the effect of an individual risk occurring, or a combination of risks, by increasing their probability during the test. For example, setting a risk to have 50 % probability (where perhaps we actually believe it to have 10 % probability) and generating on-screen scenarios allows us comfortably to watch how the model behaves with and without the risk occurring. Setting two risks each to a 70 % probability will show both risks occurring at the same time in about 50 % of the scenarios. etc. 17.2.6 Comparing results of alternative models There are often several ways that one could construct a Monte Carlo model to tackle the same problem. Each method should give you the same answer, of course. So, if you are unsure about one way of manipulating distributions, then try it another (perhaps less efficient) way and see if the answers are the same. The more difficult area is where you may feel that there are two or more completely different stochastic processes that could explain the problem at hand. Ideally, one would like to be able to construct both models and see whether they come up with similar answers. But what do we mean by similar? In fact, from a decision analysis point of view we don't actually mean that they come up with the same numbers or distributions: we mean that, if presented with either result, the decision-maker would make the same decision. If we do have the luxury of being able to construct two completely different model interpretations of the world, we may be able to use a technique called Bayesian model averaging that weights the likelihood of each model on the basis of how probable they would make our observations. We nearly always will not have the luxury of being able to model two or more different approaches to the same problem because of time and resource constraints. If you are going to have to put all your efforts into one model, try to make sure that your peers agree with your approach, and that the decision-maker will be comfortable with making a decision based on the model's assumptions. The decision-maker could prefer you to construct a model that may not be the most likely explanation for your problem, but that offers the most conservative guidance for managing it. Finally, simple "back-of-the-envelope" checks can also be useful. Managers will often look at the results of a risk analysis and compare with their gut feeling andlor a simple calculation. It is surprising how often a modeller can get too involved in the modelling and pay too little attention to the numbers that come out at the end. 460 R~skAnalys~s 1 7.3 Comparing Predictions Against Reality In many cases, this might be akin to "shutting the stable door after the horse has bolted. Clearly, if you have made an irreversible decision on the basis of a risk assessment, this exercise may be of limited value. However, even when that is true, analysing which parts of the model turned out to be the most inaccurate will help you focus in on how you might improve your risk models for the next decision, or prepare you for how badly you will have got it wrong. Perhaps it is possible to structure a decision into a series of steps, each informed by risk analysis, so that at each step in the series of decisions the risk analysis predictions can be compared against what has happened so far. For example, setting up an investment that started with a pilot roll-out in a test market would let a company limit the risks and at the same time evaluate how well it had been able to predict the initial level of success. Project risk analysis models, in which the cost and duration of the elements of a project are estimated, are an excellent example of where predictions can be continuously compared with reality. The uncertainty of the cost and time elements can be updated as each task is being completed to estimate the remaining duration and costs, while a review of each task estimate against what actually happened can give you a feel for whether your estimators have been systematically pessimistic or optimistic. Chapter 13 gives a number of techniques for monitoring and calibrating expert estimates. Chapter 18 Discounted cashflow modelling A typical discounted cashflow model for a potential investment makes forecasts of costs and revenues over the life of the project and discounts those revenues back to a present value. Most analysts start with a "base case" model and add uncertainty to the important elements of the model. Happily, the mathematics involved in adding risk to these types of model is quite simple. In this chapter, I will assume that you can build a base case cashflow model that will look something like Figure 18.2 and I will focus on the input modelling elements of Figure 18.1 and some financial outputs. There are a number of topics that are already well covered in this book: Expert estimates. In capital investment models we rely a great deal on expert judgement to estimate variables like costs, time to market, sales volumes, discount levels, etc. Chapter 14 discusses how to elicit estimates from subject matter experts. Fitting distributions to data. We don't usually have a great deal of historic data to work with in capital investment projects because the investment is new. I have worked with a very successful retail company that investigates levels of pedestrian traffic at different locations in a town where it is considering locating a new outlet. It has excellent regional data on how that traffic converts to till receipts. That is quite typical of the type of data one might have for a cashflow analysis, and I will go through such a model later in this chapter. Hydrocarbon and mineral exploration will generally have improving levels of data about the reserves, but have specialised methods (e.g. Krieging) for statistically analysing their data, so I won't consider them further here. Otherwise, Chapter 10 discusses distribution fitting in some detail. Correlation. Simple forms of correlation modelling - recognising that two or more variables are likely to be linked in some way - are very important in cashflow models. The correlation techniques described in Sections 13.4 and 13.5 are particularly useful in cashflow models. Time series. Chapter 12 deals with many different technical time series models. GBM, seasonal and autoregressive models are useful for modelling inflation, exchange and interest rates over time in a cashflow model. Lead indicators can help predict market size a short time into the future. In this chapter I consider variables such as demand for products and sales volumes that are generally built on a more intuitive basis. Common errors. Risk analysis cashflow models are not generally that technically complicated, but our reviews show that the types of error described in Section 7.4 appear very frequently, so I very much encourage you to read that section carefully. The rest of Chapter 7 offers some ideas on model building that are very applicable to cashflow models. Figure 18.1 Modelling elements in a capital investment discounted cashflow model. Caah Flow Tdal Revenue CCA d Gmds Sdd G m Maan Operabng Ewnses Earn@ BeforeTaxes Tax Basis l m e Tax Net l m e - $ $ $ $ $ S - $ $ $ 172,603 $ (172,603) $ (172.603) $ (172,603) $ $ $ 174.041 $ (174.041) $ 206.366 86,234 122,154 64.521 37,633 (309.011l ---.' (346,644) $ 3............-.3-------$ $ 1174,041) $ $ $ $ $ $ $ 205.723 65,132 120.592 55.~0 65.592 (243.419) 216,537 $ 239.116 $ 69,606 $ 96,950 $ 126,930 S 140.166 $ S 20.~0$ 20.~0$ $ 106.930 $ 120.166 $ $ (136,469) $ (16.323) $ $ $ $ 317.672 131.540 166.331 20,m 166.331 150.036 $ 363.047 $ 423.456 $ 500.403 $ 150.235 $ 175,234 $ 207.075 $ 212.812 $ 246,224 $ 293.326 2 5 . ~ 0$ 2 5 . ~ 0 $ 2 5 . 0 ~ $ 187,612 $ 223,224 $ 266,328 $ 187,812 $ 223.224 $ 266.328 $ 2 ---.--L-..-4 ------: .-.$. -----:--J--69.W4.6--8.6.6394 5-1?2.F!? .s..s.s.s.s: 37,633 $ 65.592 $ 106,930 $ 120,186' $ E123.43! 97.326 $ 101,419' $ 120.541--5 144.697- Markst Ccndltl0M Number d C a n p e h UnR Casi InRaScn Rate Tax Rate o 0 0 $23 46% 46% 46% I 23 2 1 $24 4796 4636 1 $25 4746 46% 1 $27 55% 46% 1 $28 56% 46% 2 530 6.0% 46% 2 $32 6.1% 4696 2 $34 56% 46% $61 $M $66 $72 $TI $61 I Saw Aahrlty SaksPh Marketdurn SalesVdume $56 3,736 3,736 $56 5.903 3.542 4,697 3,523 9.323 4.662 7,419 3,709 11,716 5.021 14.724 5.522 18.504 6,166 Produalon Expense 27 s & ProdMDevebpmeW CabtalEwnses $ $ 0Eme.ad S TdalEx~e~s $ $ 19.041 $ 145.~0 $ 10.WO$ 9,521 $ 55,003 $ 20,OW$ 3 5 . ~ 0$ 20,WOS 20,WOS 20,WOS $ . $ $ $ $ . $ . 20,WOS25,WO$25,003$25,OW 172,603 $ 174,041 $ 64,521 $ 55,WO $ 20,WO $ 20,WO $ 20,WO $ 47,603 5 125.~0 $ - - $ . . . . . . . . . . . . . . . . . . . d . d d d d d d d . d d d d d d . d d d d d d . . . . . . . d d . d . d . d d d d d d . d . d . d . . . . ~ . ~ ~ . ~ ~ ~ ~ ~ ~ d d d . d ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ . . . ~ . . . ~ ~ d d d d d - - $ $ - $ $ 31 Figure 18.2 A typical, if somewhat reduced, discounted cashflow model. 25,WO $ 25,WO $ 25.003 Chapter 18 Discounted cashflow modelling 463 18.1 Useful Time Series Models of Sales and Market Size 18.1.1 Effect of an intervention at some uncertain point in time Time series variables are often affected by single identifiable "shocks", like elections, changes to a law, the introduction of a competitor, the start or finish of a war, a scandal, etc. The modelling of the occurrence of a shock and its effects may need to take into account several elements: when the shock may occur (this could be random); whether this changes the probability or impact of other possible shocks; the effect of the shock - magnitude and duration. Consider the following problem. People are purchasing your product at a current rate of 88/month, and the rate appears to be increasing by 1.3 saleslmonth with each month. However, we are 80 % sure that a competitor is going to enter the market and will do so between 20 and 50months from now. If the competitor enters the market, they will take about 30 % of your sales. Forecast the number of sales there will be for the next 100months. Two typical pathways for this problem are shown in Figure 18.3, and the model that created them is shown in Figure 18.4. The Bernoulli variable returns a 1 with 80% probability, otherwise a 0. It is used as a "flag", the 1 representing a competitor entry, the 0 representing no competitor. Other cells use conditional logic to adapt to the scenario. You can use a Binomial(1, 80%) if your software does not have a Bernoulli distribution. In Crystal Ball this is also called a Yes:No distribution. The Stepuniform generates integer values between 20 and 50, and cell E4 returns the month 1000 if the competitor does not enter the market, i.e. a time beyond the modelled period. It is a good idea if you use this type of technique to make such a number very far from the range of the modelled period in case someone decides to extend the period analysed. A Poisson distribution is used to model the number of sales reflecting that the sales are independent of each other and randomly distributed in time. The nice thing about a Poisson distribution is that it takes just one parameter - its mean, so you don't have to think about variation about that mean separately (e.g. determine a standard deviation). 1 Sales each month 1 Figure 18.3 Possible pathways generated by the model depending on whether the competitor enters the market. 464 Risk Analysis A 1 B I C I D E IF 30% 88.00 9 Month 10 11 12 2 14 15 108 109 1 2 3 4 5 6 99 100 Expected sales 89.30 90.60 91.90 93.20 94.50 95.80 216.70 218.00 Sales fraction lost 0 0 0 0 0 0 0.3 0.3 Sales 79 111 85 103 99 97 159 153 Formulae table =VoseBernoulli(E2) =IF(E3=1,VoseStepUniform(20,50),1000) 117 Figure 18.4 Model of Poisson sales affected by the possible entry of a competitor. 18.1.2 Distributing market share When competitors enter an established market they have to establish the reputation of their product and fight for market share with others that are already established. This takes time, so it is more realistic to model a gradual loss of market share to competitors. Consider the following problem. Market volume for your product is expected to grow each year by (10 %, 20 %, 40 %) beginning next year at (2500, 3000, 5000) up to a maximum of 20000 units. You expect one competitor to emerge as soon as the market volume reaches 3500 units in the previous year. A second will appear at 8500 units. Your competitors' shares of the market will grow linearly until you all have equal market share after 3 years. Model the sales you will make. Figure 18.5 shows the model. It is mostly self-explanatory. The interesting component lies in cells FlO:LlO, which divides the forecast market for your product among the average of the number of competitors over the last 3 years and yourself (the "1" in the equation). Averaging over 3 years is a neat way of allocating an emerging competitor 113 of your market strength in the first year, 213 in the second and equal strength from the third year on - meaning that they will then sell as many units as you. What is so helpful about this little trick is that it automatically takes into account each new competitor and when they entered the market, which is rather difficult to do otherwise. Note that we need three zeros in cells C8:E8 to initialise the model. Chapter 18 Discounted cashflow modelling 9 10 11 15 16 17 18 19 Market volume Sales Volume F8:L8 F9:L9 F10:LlO (output) 2,775 2,775 3,449 3,449 4,286 4,286 5,326 3,995 6,619 3,971 8,225 4,112 10,221 5,111 465 12,702 5,444 Formulae table =VosePERT(10%,20%,40%) {O,O,OI =IF(E9>$C$4,2,IF(E9>$C$3,1,0)) =VosePERT(2500,3000,5000) =MIN(20000,E9*(1+$C$5)) =ROUND(E9/(AVERAGE(C8:E8)+1),O) Figure 18.5 Model of sales where the total market is shared with new-entry competitors. 18.1.3 Reduced sales over time to a finite market Some products are essentially a once-in-a-lifetime purchase, e.g. a life insurance, big flat-screen TV, a new guttering system or a pet identification chip. If we are initially quite successful in selling the product into the potential market, the remaining market size decreases, although this can be compensated for to some degree by new potential consumers. Consider the following problem: There are currently PERT(50000, 55 000, 60000) possible purchasers of your product. Each year there will be about a 10 % turnover (meaning 10 % more possible purchasers will appear). The probability that you will sell to any particular purchaser in a year is PERT(10%, 20%, 35 %). Forecast sales for the next 10 years. Figure 18.6 shows the model for this problem. Note that C8:C16 is subtracting sales already made from the previous year's market size but also adding in a regenerated market element. The binomial distribution then converts the current market size to sales. In the particular scenario shown in Figure 18.6, the probability of selling is high (26 %), so sales start off high and drop off quickly as the regeneration rate is so much lower (10 %). Note that some Monte Carlo software cannot handle large numbers of trials in their binomial distribution, in which case you will need to use a Poisson or normal approximation (Section 111.9.1). 18.1.4 Growth of sales over time up to a maximum as a function of marketing effort Sometimes we might find it easier to estimate what our annual sales will be when stabilised, but be unsure of how quickly we will be able to achieve that stability. In this sort of situation it can be easier to model a theoretical maximum sales and match it to some ramping function. A typical form of such 466 Risk Analysis Figure 18.6 Model forecasting sales over time to a finite market. a ramping function r (t) is which will produce a curve that starts at 0 for t = 0 and asymptotically reaches 1 at an infinite value o f t , but reaches 0.5 at t = tl12. Consider the following problem: you expect a final sales rate of PERT(1800, 2300, 3600) and expect to achieve half that in the next PERT(3.5, 4, 5) years. Produce a sales forecast for the next 10 years. Figure 18.7 provides a solution. 18.2 Summing Random Variables Perhaps the most common errors in cashflow modelling occur when one wishes to sum a number of random costs, sales or revenues. For example, imagine that you expect to have Lognorma1(100000, 25 000) customers enter your store per year and they will spend $Lognormal(55, 12) each - how would you estimate the total revenue? People generally write something like using the ROUND function in Excel to recognise that the number of people must be discrete. But let's think what happens when the software starts simulating. It will pick a random value from each Chapter 18 Discounted cashflow modelling - A ~ B I C I D I E 1 2 -- 3 4 -[years Maxsales 5 6 7 2 2000 - 1500 - 1000 - 500 - 0 m 8 9 10 2 12 2 2 2 3 1 7 18 19 2500 - (I) 0 I F I G I H I I I J I K I 467 L 1 0 2 4 6 8 10 Year Formulae table =VosePERT(3.5,4,5) =VosePERT(I800,2300,3600) 21 22 Figure 18.7 Model forecasting ramping sales to an uncertain theoretical maximum. distribution and multiply them together. Picking a reasonably high till receipt, the probability that a random customer will spend more than $70, for example, is The probability that two people will do the same is 11 % * 11 % = 1.2 %, and the probability that thousands of people will spend that much is infinitesimally small. However, Equation (18.1) will assign a 11 % probability that all customers will spend over $70 no matter how many there are. The equation is wrong because it should have summed ROUND(Lognorma1(100000,25 OOO), 0) separate Lognormal(55, 12) distributions. That's a big, slow model, so we use a variety of techniques to shortcut to the answer, which is the topic of Chapter 11. 18.3 Summing Variable Margins on Variable Revenues A common situation is that we have a large random number of revenue items that follow the same probability distribution but that are independent of each other, and we have independent profit margins that follow another distribution that must be applied to each revenue item. This type of model quickly becomes extremely cumbersome to implement because for each revenue item we need two distributions, one for revenue and another for the profit margin, and we may have large numbers of revenue items. It is such a common problem that we designed a function in ModelRisk to handle this, allowing you to keep the model to a manageable size, speeding up simulation time and making the model far simpler to review. Perhaps most importantly, it allows you to avoid a lot of conditional logic that it is easy to get wrong.' Consider the following problem. A capital venture company is considering investing in ' I apologise if this comes across as a sales pitch for ModelRisk, but it is designed with finance people in mind. ~ M 468 Risk Analysis a company that makes TV shows. They expect to make PERT(28, 32, 39) pilots next year which will generate a revenue of $PERT(120, 150, 250)k each independently and from which the profit margin is PERT(1 %, 5 %, 12 %). There is a 30 % chance that each pilot is made into a TV show in that country running for Discrete((1, 2, 3, 4, 5},{0.4, 0.25, 0.2, 0.1, 0.05)) series, where each season of each series generates $PERT(120, 150,250)k with margins of PERT(15 %, 25 %, 45 %). There is a 20 % chance that these local series will be sold to the US, generating $PERT(240, 550, 1350)k per season sold, of which the profit margin is PERT(65 %, 70 %, 85 %). What is the total profit generated from next year's pilots? The problem is not technically difficult, but the scale of the modelling explodes very quickly. We worked on the model for a real investment of this type and it had many more layers: pilots in several countries, merchandising of various types, repeats, etc., and it took a lot of effort to manage. Figure 18.8 shows a surprisingly succinct model: rows 2 to 11 are the input data, rows 14 to 16 are the actual calculations. There are a few things to point out. In cell F2, 112 is subtracted and added to the minimum and maximum estimates respectively of the number of pilots to give a more realistic chance of their occurrence after rounding. Distributions are input as ModelRisk objects in cells F3, F4, F6, F7, F8, F10 and F11 because we want to use these distributions many times. Cell C16, and elsewhere, uses the Vose Sum Product function to add together revenue * margin for each pilot, where the revenue and Pilots made Series made Seasons made Profit NA NA 295 F2 F3 (F4, F7, Fl0, F l 1 similar) F6 F14 El4 Dl4 D l 5:E15 C16 D l6 El6 F16 (output) Local only 8 15 68 1 Local & US series 3 9 3933 Total 11 4909 Formulae table =ROUND(VosePERT(8-0.5,11,17+0.5),0) =VosePE~TObject(l20,150,250) =VoseDiscreteObject({l,2,3,4,5),{0.4,0.25,0.2,0.1,0.05)) =VoseBinomial(F2,F5) =VoseBinomial(F14,F9) =F14-El4 =VoseAggregateMC(D14,$F$6) =VoseSumProduct(F2,F3,F4) =VoseSumProduct(D15,F7,F8) =VoseSumProduct(El5,F7,F8)+VoseSumProduct(El5,F10,F11) =SUM(C16:El6) Figure 18.8 Model forecasting profits from TV series. Chapter 18 Discounted cashflow modelling 469 margin distributions are defined by the distribution objects in cells F3 and F4 respectively. Cell F14 simulates the number of pilots that made it to become series, from which the model determines how many of those become series also sold into the US in cell E14, the difference being the number of pilots that only became local series in cell D14. Setting up the logic this way ensures that we have a consistent model: the local only and the US and local series always add up to the total series produced. Cells Dl5 and El5 use the VoseAggregate(x, y) function to simulate the sum of x random variables all taking the same distribution y defined as an object. 18.4 Financial Measures in Risk Analysis The two main measures of profitability in DCF models are net present value (NPV) and internal rate of return (IRR). The two main measures of financial exposure are value at risk (VAR) and expected shortfall. Their pros and cons are discussed in Section 20.5. 18.4.1 Net present value Net present value (NPV) attempts to determine the present value of a series of cashflows from a project that stretches out into the future. This present value is a measure of how much the company is gaining at today's money by undertaking the project: in other words, how much more the company itself will be worth by accepting the project. An NPV calculation discounts future cashflows at a specified discount rate r that takes account of: 1. The time value of money (e.g. if inflation is running at 4 %, £1.04 in a year's time is only worth £1.OO today). 2. The interest that could have been earned over inflation by investing instead in a guaranteed investment. 3. The extra ret