A Practical Guide To Quantitative Portfolio Trading

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 743

DownloadA Practical Guide To Quantitative Portfolio Trading
Open PDF In BrowserView PDF
Quantitative Analytics

A Practical Guide To Quantitative Portfolio
Trading
Daniel Bloch
30th of December 2014

The copyright to this computer software and documentation is the property of Quant Finance Ltd. It may be
used and/or copied only with the written consent of the company or in accordance with the terms and conditions
stipulated in the agreement/contract under which the material has been supplied.

Copyright © 2015 Quant Finance Ltd
Quantitative Analytics, London

Created: 14 January 2015

A Practical Guide To Quantitative Portfolio Trading
Daniel B LOCH 1
Q UANT F INANCE LTD
eBook

30th of December 2014
Version 1.01

1

db@quantfin.eu

Abstract
We discuss risk, preference and valuation in classical economics, which led academics to develop a theory of
market prices, resulting in the general equilibrium theories. However, in practice, the decision process does not follow
that theory since the qualitative aspect coming from human decision making process is missing. Further, a large
number of studies in empirical finance showed that financial assets exhibit trends or cycles, resulting in persistent
inefficiencies in the market, that can be exploited. The uneven assimilation of information emphasised the multifractal
nature of the capital markets, recognising complexity. New theories to explain financial markets developed, among
which is a multitude of interacting agents forming a complex system characterised by a high level of uncertainty.
Recently, with the increased availability of data, econophysics emerged as a mix of physical sciences and economics
to get the best of both world, in view of analysing more deeply assets’ predictability. For instance, data mining and
machine learning methodologies provide a range of general techniques for classification, prediction, and optimisation
of structured and unstructured data. Using these techniques, one can describe financial markets through degrees
of freedom which may be both qualitative and quantitative in nature. In this book we detail how the growing use
of quantitative methods changed finance and investment theory. The most significant benefit being the power of
automation, enforcing a systematic investment approach and a structured and unified framework. We present in a
chronological order the necessary steps to identify trading signals, build quantitative strategies, assess expected returns,
measure and score strategies, and allocate portfolios.

Quantitative Analytics

I would like to thank my wife and children for their patience and support during this adventure.

1

Quantitative Analytics

I would like to thank Antoine Haddad and Philippe Ankaoua for giving me the opportunity, and
the means, of completing this book. I would also like to thank Sebastien Gurrieri for writing a
section on CUDA programming in finance.

2

Contents
0.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0.1.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0.1.2 An overview of quantitative trading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

I

Quantitative trading in classical economics

1

Risk, preference, and valuation
1.1 A brief history of ideas . . . . . . . . . . . . . . .
1.2 Solving the St. Petersburg paradox . . . . . . . . .
1.2.1 The simple St. Petersburg game . . . . . .
1.2.2 The sequential St. Petersburg game . . . .
1.2.3 Using time averages . . . . . . . . . . . .
1.2.4 Using option pricing theory . . . . . . . .
1.3 Modelling future cashflows in presence of risk . . .
1.3.1 Introducing the discount rate . . . . . . . .
1.3.2 Valuing payoffs in continuous time . . . .
1.3.3 Modelling the discount factor . . . . . . .
1.4 The pricing kernel . . . . . . . . . . . . . . . . . .
1.4.1 Defining the pricing kernel . . . . . . . . .
1.4.2 The empirical pricing kernel . . . . . . . .
1.4.3 Analysing the expected risk premium . . .
1.4.4 Infering risk premium from option prices .
1.5 Modelling asset returns . . . . . . . . . . . . . . .
1.5.1 Defining the return process . . . . . . . . .
1.5.2 Valuing potfolios . . . . . . . . . . . . . .
1.5.3 Presenting the factor models . . . . . . . .
1.5.3.1 The presence of common factors
1.5.3.2 Defining factor models . . . . .
1.5.3.3 CAPM: a one factor model . . .
1.5.3.4 APT: a multi-factor model . . . .
1.6 Introducing behavioural finance . . . . . . . . . .
1.6.1 The Von Neumann and Morgenstern model
1.6.2 Preferences . . . . . . . . . . . . . . . . .
1.6.3 Discussion . . . . . . . . . . . . . . . . .
1.6.4 Some critics . . . . . . . . . . . . . . . . .
1.7 Predictability of financial markets . . . . . . . . .
1.7.1 The martingale theory of asset prices . . .
1.7.2 The efficient market hypothesis . . . . . .
3

21
21
21

25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

26
26
28
28
29
30
32
33
33
34
36
38
39
40
41
42
43
43
44
46
46
46
47
48
48
49
50
52
53
54
54
55

Quantitative Analytics

1.7.3
1.7.4
1.7.5
1.7.6

2

Some major critics . . . . . . . . . . . .
Contrarian and momentum strategies . .
Beyond the EMH . . . . . . . . . . . . .
Risk premia and excess returns . . . . . .
1.7.6.1 Risk premia in option prices . .
1.7.6.2 The existence of excess returns

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

56
57
59
62
62
63

Introduction to asset management
2.1 Portfolio management . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Defining portfolio management . . . . . . . . . . . . . . . . . .
2.1.2 Asset allocation . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2.1 Objectives and methods . . . . . . . . . . . . . . . . .
2.1.2.2 Active portfolio strategies . . . . . . . . . . . . . . . .
2.1.2.3 A review of asset allocation techniques . . . . . . . . .
2.1.3 Presenting some trading strategies . . . . . . . . . . . . . . . . .
2.1.3.1 Some examples of behavioural strategies . . . . . . . .
2.1.3.2 Some examples of market neutral strategies . . . . . .
2.1.3.3 Predicting changes in business cycles . . . . . . . . . .
2.1.4 Risk premia investing . . . . . . . . . . . . . . . . . . . . . . . .
2.1.5 Introducing technical analysis . . . . . . . . . . . . . . . . . . .
2.1.5.1 Defining technical analysis . . . . . . . . . . . . . . .
2.1.5.2 Presenting a few trading indicators . . . . . . . . . . .
2.1.5.3 The limitation of indicators . . . . . . . . . . . . . . .
2.1.5.4 The risk of overfitting . . . . . . . . . . . . . . . . . .
2.1.5.5 Evaluating trading system performance . . . . . . . . .
2.2 Portfolio construction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 The problem of portfolio selection . . . . . . . . . . . . . . . . .
2.2.1.1 Minimising portfolio variance . . . . . . . . . . . . . .
2.2.1.2 Maximising portfolio return . . . . . . . . . . . . . . .
2.2.1.3 Accounting for portfolio risk . . . . . . . . . . . . . .
2.3 A market equilibrium theory of asset prices . . . . . . . . . . . . . . . .
2.3.1 The capital asset pricing model . . . . . . . . . . . . . . . . . . .
2.3.1.1 Markowitz solution to the portfolio allocation problem .
2.3.1.2 The Sharp-Lintner CAPM . . . . . . . . . . . . . . . .
2.3.1.3 Some critics and improvements of the CAPM . . . . .
2.3.2 The growth optimal portfolio . . . . . . . . . . . . . . . . . . . .
2.3.2.1 Discrete time . . . . . . . . . . . . . . . . . . . . . . .
2.3.2.2 Continuous time . . . . . . . . . . . . . . . . . . . . .
2.3.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2.4 Comparing the GOP with the MV approach . . . . . .
2.3.2.5 Time taken by the GOP to outperfom other portfolios .
2.3.3 Measuring and predicting performances . . . . . . . . . . . . . .
2.3.4 Predictable variation in the Sharpe ratio . . . . . . . . . . . . . .
2.4 Risk and return analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Some financial meaning to alpha and beta . . . . . . . . . . . . .
2.4.1.1 The financial beta . . . . . . . . . . . . . . . . . . . .
2.4.1.2 The financial alpha . . . . . . . . . . . . . . . . . . .
2.4.2 Performance measures . . . . . . . . . . . . . . . . . . . . . . .
2.4.2.1 The Sharpe ratio . . . . . . . . . . . . . . . . . . . . .
2.4.2.2 More measures of risk . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

64
64
64
66
66
68
69
70
70
71
73
74
75
75
77
79
79
80
80
81
81
83
84
85
85
85
87
89
91
91
95
99
99
102
102
104
105
105
105
107
107
108
109

4

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

Quantitative Analytics

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

109
110
111
111
113
113
114
114
114
115
115
117
117
117
118

Introduction to financial time series analysis
3.1 Prologue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 An overview of data analysis . . . . . . . . . . . . . . . . . . . .
3.2.1 Presenting the data . . . . . . . . . . . . . . . . . . . . .
3.2.1.1 Data description . . . . . . . . . . . . . . . . .
3.2.1.2 Analysing the data . . . . . . . . . . . . . . . .
3.2.1.3 Removing outliers . . . . . . . . . . . . . . . .
3.2.2 Basic tools for summarising and forecasting data . . . . .
3.2.2.1 Presenting forecasting methods . . . . . . . . .
3.2.2.2 Summarising the data . . . . . . . . . . . . . .
3.2.2.3 Measuring the forecasting accuracy . . . . . . .
3.2.2.4 Prediction intervals . . . . . . . . . . . . . . .
3.2.2.5 Estimating model parameters . . . . . . . . . .
3.2.3 Modelling time series . . . . . . . . . . . . . . . . . . . .
3.2.3.1 The structural time series . . . . . . . . . . . .
3.2.3.2 Some simple statistical models . . . . . . . . .
3.2.4 Introducing parametric regression . . . . . . . . . . . . .
3.2.4.1 Some rules for conducting inference . . . . . .
3.2.4.2 The least squares estimator . . . . . . . . . . .
3.2.5 Introducing state-space models . . . . . . . . . . . . . . .
3.2.5.1 The state-space form . . . . . . . . . . . . . . .
3.2.5.2 The Kalman filter . . . . . . . . . . . . . . . .
3.2.5.3 Model specification . . . . . . . . . . . . . . .
3.3 Asset returns and their characteristics . . . . . . . . . . . . . . .
3.3.1 Defining financial returns . . . . . . . . . . . . . . . . . .
3.3.1.1 Asset returns . . . . . . . . . . . . . . . . . . .
3.3.1.2 The percent returns versus the logarithm returns
3.3.1.3 Portfolio returns . . . . . . . . . . . . . . . . .
3.3.1.4 Modelling returns: The random walk . . . . . .
3.3.2 The properties of returns . . . . . . . . . . . . . . . . . .
3.3.2.1 The distribution of returns . . . . . . . . . . . .
3.3.2.2 The likelihood function . . . . . . . . . . . . .
3.3.3 Testing the series against trend . . . . . . . . . . . . . . .
3.3.4 Testing the assumption of normally distributed returns . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

119
119
120
120
120
120
120
121
121
122
125
127
128
128
128
129
131
132
132
135
135
136
138
138
138
139
141
141
142
143
143
144
144
146

2.4.3
2.4.4

2.4.5
2.4.6

3

2.4.2.3 Alpha as a measure of risk . . . . .
2.4.2.4 Empirical measures of risk . . . .
2.4.2.5 Incorporating tail risk . . . . . . .
Some downside risk measures . . . . . . . .
Considering the value at risk . . . . . . . . .
2.4.4.1 Introducing the value at risk . . . .
2.4.4.2 The reward to VaR . . . . . . . . .
2.4.4.3 The conditional Sharpe ratio . . .
2.4.4.4 The modified Sharpe ratio . . . . .
2.4.4.5 The constant adjusted Sharpe ratio
Considering drawdown measures . . . . . .
Some limitation . . . . . . . . . . . . . . . .
2.4.6.1 Dividing by zero . . . . . . . . . .
2.4.6.2 Anomaly in the Sharpe ratio . . . .
2.4.6.3 The weak stochastic dominance . .

5

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Quantitative Analytics

3.4

II
4

3.3.4.1 Testing for the fitness of the Normal distribution .
3.3.4.2 Quantifying deviations from a Normal distribution
3.3.5 The sample moments . . . . . . . . . . . . . . . . . . . . .
3.3.5.1 The population mean and volatility . . . . . . . .
3.3.5.2 The population skewness and kurtosis . . . . . .
3.3.5.3 Annualisation of the first two moments . . . . . .
Introducing the volatility process . . . . . . . . . . . . . . . . . . .
3.4.1 An overview of risk and volatility . . . . . . . . . . . . . .
3.4.1.1 The need to forecast volatility . . . . . . . . . . .
3.4.1.2 A first decomposition . . . . . . . . . . . . . . .
3.4.2 The structure of volatility models . . . . . . . . . . . . . .
3.4.2.1 Benchmark volatility models . . . . . . . . . . .
3.4.2.2 Some practical considerations . . . . . . . . . . .
3.4.3 Forecasting volatility with RiskMetrics methodology . . . .
3.4.3.1 The exponential weighted moving average . . . .
3.4.3.2 Forecasting volatility . . . . . . . . . . . . . . .
3.4.3.3 Assuming zero-drift in volatility calculation . . .
3.4.3.4 Estimating the decay factor . . . . . . . . . . . .
3.4.4 Computing historical volatility . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Statistical tools applied to finance

146
147
149
149
150
151
152
152
152
153
153
155
156
157
157
158
159
160
161

164

Filtering and smoothing techniques
4.1 Presenting the challenge . . . . . . . . . . . . . . . . . . . . .
4.1.1 Describing the problem . . . . . . . . . . . . . . . . . .
4.1.2 Regression smoothing . . . . . . . . . . . . . . . . . .
4.1.3 Introducing trend filtering . . . . . . . . . . . . . . . .
4.1.3.1 Filtering in frequency . . . . . . . . . . . . .
4.1.3.2 Filtering in the time domain . . . . . . . . . .
4.2 Smooting techniques and nonparametric regression . . . . . . .
4.2.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1.1 Definition of the Histogram . . . . . . . . . .
4.2.1.2 Smoothing the histogram by WARPing . . . .
4.2.2 Kernel density estimation . . . . . . . . . . . . . . . .
4.2.2.1 Definition of the Kernel estimate . . . . . . .
4.2.2.2 Statistics of the Kernel density . . . . . . . .
4.2.2.3 Confidence intervals and confidence bands . .
4.2.3 Bandwidth selection in practice . . . . . . . . . . . . .
4.2.3.1 Kernel estimation using reference distribution
4.2.3.2 Plug-in methods . . . . . . . . . . . . . . . .
4.2.3.3 Cross-validation . . . . . . . . . . . . . . . .
4.2.4 Nonparametric regression . . . . . . . . . . . . . . . .
4.2.4.1 The Nadaraya-Watson estimator . . . . . . .
4.2.4.2 Kernel smoothing algorithm . . . . . . . . . .
4.2.4.3 The K-nearest neighbour . . . . . . . . . . .
4.2.5 Bandwidth selection . . . . . . . . . . . . . . . . . . .
4.2.5.1 Estimation of the average squared error . . . .
4.2.5.2 Penalising functions . . . . . . . . . . . . . .
4.2.5.3 Cross-validation . . . . . . . . . . . . . . . .

6

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

165
165
165
166
167
167
168
169
169
169
172
173
173
174
176
177
177
177
178
180
181
186
186
187
187
189
190

Quantitative Analytics

4.3

5

Trend filtering in the time domain .
4.3.1 Some basic principles . . .
4.3.2 The local averages . . . . .
4.3.3 The Savitzky-Golay filter . .
4.3.4 The least squares filters . . .
4.3.4.1 The L2 filtering .
4.3.4.2 The L1 filtering .
4.3.4.3 The Kalman filters
4.3.5 Calibration . . . . . . . . .
4.3.6 Introducing linear prediction

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

190
190
192
194
195
195
196
197
198
199

Presenting time series analysis
5.1 Basic principles of linear time series . . . . . . . . . . .
5.1.1 Stationarity . . . . . . . . . . . . . . . . . . . .
5.1.2 The autocorrelation function . . . . . . . . . . .
5.1.3 The portmanteau test . . . . . . . . . . . . . . .
5.2 Linear time series . . . . . . . . . . . . . . . . . . . . .
5.2.1 Defining time series . . . . . . . . . . . . . . .
5.2.2 The autoregressive models . . . . . . . . . . . .
5.2.2.1 Definition . . . . . . . . . . . . . . .
5.2.2.2 Some properties . . . . . . . . . . . .
5.2.2.3 Identifying and estimating AR models
5.2.2.4 Parameter estimation . . . . . . . . .
5.2.3 The moving-average models . . . . . . . . . . .
5.2.4 The simple ARMA model . . . . . . . . . . . .
5.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Forecasting with the AR models . . . . . . . . .
5.3.2 Forecasting with the MA models . . . . . . . . .
5.3.3 Forecasting with the ARMA models . . . . . . .
5.4 Nonstationarity and serial correlation . . . . . . . . . . .
5.4.1 Unit-root nonstationarity . . . . . . . . . . . . .
5.4.1.1 The random walk . . . . . . . . . . .
5.4.1.2 The random walk with drift . . . . . .
5.4.1.3 The unit-root test . . . . . . . . . . .
5.4.2 Regression models with time series . . . . . . .
5.4.3 Long-memory models . . . . . . . . . . . . . .
5.5 Multivariate time series . . . . . . . . . . . . . . . . . .
5.5.1 Characteristics . . . . . . . . . . . . . . . . . .
5.5.2 Introduction to a few models . . . . . . . . . . .
5.5.3 Principal component analysis . . . . . . . . . .
5.6 Some conditional heteroscedastic models . . . . . . . .
5.6.1 The ARCH model . . . . . . . . . . . . . . . .
5.6.2 The GARCH model . . . . . . . . . . . . . . .
5.6.3 The integrated GARCH model . . . . . . . . . .
5.6.4 The GARCH-M model . . . . . . . . . . . . . .
5.6.5 The exponential GARCH model . . . . . . . . .
5.6.6 The stochastic volatility model . . . . . . . . . .
5.6.7 Another approach: high-frequency data . . . . .
5.6.8 Forecasting evaluation . . . . . . . . . . . . . .
5.7 Exponential smoothing and forecasting data . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

202
202
202
203
204
205
205
206
206
206
208
209
209
210
211
212
212
213
213
213
214
215
215
216
217
218
218
219
220
221
221
224
225
225
226
227
228
229
229

7

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

Quantitative Analytics

5.7.1

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

230
230
231
231
233
234
235
236
237
242
245
246

Filtering and forecasting with wavelet analysis
6.1 Introducing wavelet analysis . . . . . . . . . . . . . . . . . . . . .
6.1.1 From spectral analysis to wavelet analysis . . . . . . . . . .
6.1.1.1 Spectral analysis . . . . . . . . . . . . . . . . . .
6.1.1.2 Wavelet analysis . . . . . . . . . . . . . . . . . .
6.1.2 The a trous wavelet decomposition . . . . . . . . . . . . . .
6.2 Some applications . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 A brief review . . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 Filtering with wavelets . . . . . . . . . . . . . . . . . . . .
6.2.3 Non-stationarity . . . . . . . . . . . . . . . . . . . . . . .
6.2.4 Decomposition tool for seasonality extraction . . . . . . . .
6.2.5 Interdependence between variables . . . . . . . . . . . . .
6.2.6 Introducing long memory processes . . . . . . . . . . . . .
6.3 Presenting wavelet-based forecasting methods . . . . . . . . . . . .
6.3.1 Forecasting with the a trous wavelet transform . . . . . . .
6.3.2 The redundant Haar wavelet transform for time-varying data
6.3.3 The multiresolution autoregressive model . . . . . . . . . .
6.3.3.1 Linear model . . . . . . . . . . . . . . . . . . . .
6.3.3.2 Non-linear model . . . . . . . . . . . . . . . . .
6.3.4 The neuro-wavelet hybrid model . . . . . . . . . . . . . . .
6.4 Some wavelets applications to finance . . . . . . . . . . . . . . . .
6.4.1 Deriving strategies from wavelet analysis . . . . . . . . . .
6.4.2 Literature review . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

248
248
248
248
249
249
251
251
252
253
253
254
254
255
255
256
257
257
258
258
259
259
259

5.7.2

5.7.3
5.7.4
5.7.5
5.7.6
6

III
7

The moving average . . . . . . . . . . . . . . .
5.7.1.1 Simple moving average . . . . . . . .
5.7.1.2 Weighted moving average . . . . . . .
5.7.1.3 Exponential smoothing . . . . . . . .
5.7.1.4 Exponential moving average revisited .
Introducing exponential smoothing models . . .
5.7.2.1 Linear exponential smoothing . . . . .
5.7.2.2 The damped trend model . . . . . . .
A summary . . . . . . . . . . . . . . . . . . . .
Model fitting . . . . . . . . . . . . . . . . . . .
Prediction intervals and random simulation . . .
Random coefficient state space model . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

Quantitative trading in inefficient markets
Introduction to quantitative strategies
7.1 Presenting hedge funds . . . . . . . . . . . . . . . . .
7.1.1 Classifying hedge funds . . . . . . . . . . . .
7.1.2 Some facts about leverage . . . . . . . . . . .
7.1.2.1 Defining leverage . . . . . . . . . .
7.1.2.2 Different measures of leverage . . .
7.1.2.3 Leverage and risk . . . . . . . . . .
7.2 Different types of strategies . . . . . . . . . . . . . . .
7.2.1 Long-short portfolio . . . . . . . . . . . . . .
7.2.1.1 The problem with long-only portfolio

8

261
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

262
262
262
263
263
263
264
264
264
264

Quantitative Analytics

7.2.1.2 The benefits of long-short portfolio
7.2.2 Equity market neutral . . . . . . . . . . . . .
7.2.3 Pairs trading . . . . . . . . . . . . . . . . .
7.2.4 Statistical arbitrage . . . . . . . . . . . . . .
7.2.5 Mean-reversion strategies . . . . . . . . . .
7.2.6 Adaptive strategies . . . . . . . . . . . . . .
7.2.7 Constraints and fees on short-selling . . . . .
Enhanced active strategies . . . . . . . . . . . . . .
7.3.1 Definition . . . . . . . . . . . . . . . . . . .
7.3.2 Some misconceptions . . . . . . . . . . . .
7.3.3 Some benefits . . . . . . . . . . . . . . . . .
7.3.4 The enhanced prime brokerage structures . .
Measuring the efficiency of portfolio implementation
7.4.1 Measures of efficiency . . . . . . . . . . . .
7.4.2 Factors affecting performances . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

265
266
267
269
270
270
271
271
271
272
273
274
275
275
276

Describing quantitative strategies
8.1 Time series momentum strategies . . . . . . . . . . .
8.1.1 The univariate time-series strategy . . . . . .
8.1.2 The momentum signals . . . . . . . . . . . .
8.1.2.1 Return sign . . . . . . . . . . . . .
8.1.2.2 Moving Average . . . . . . . . . .
8.1.2.3 EEMD Trend Extraction . . . . . .
8.1.2.4 Time-Trend t-statistic . . . . . . .
8.1.2.5 Statistically Meaningful Trend . .
8.1.3 The signal speed . . . . . . . . . . . . . . .
8.1.4 The relative strength index . . . . . . . . . .
8.1.5 Regression analysis . . . . . . . . . . . . . .
8.1.6 The momentum profitability . . . . . . . . .
8.2 Factors analysis . . . . . . . . . . . . . . . . . . . .
8.2.1 Presenting the factor model . . . . . . . . .
8.2.2 Some trading applications . . . . . . . . . .
8.2.2.1 Pairs-trading . . . . . . . . . . . .
8.2.2.2 Decomposing stock returns . . . .
8.2.3 A systematic approach . . . . . . . . . . . .
8.2.3.1 Modelling returns . . . . . . . . .
8.2.3.2 The market neutral portfolio . . . .
8.2.4 Estimating the factor model . . . . . . . . .
8.2.4.1 The PCA approach . . . . . . . .
8.2.4.2 The selection of the eigenportfolios
8.2.5 Strategies based on mean-reversion . . . . .
8.2.5.1 The mean-reverting model . . . . .
8.2.5.2 Pure mean-reversion . . . . . . . .
8.2.5.3 Mean-reversion with drift . . . . .
8.2.6 Portfolio optimisation . . . . . . . . . . . .
8.2.7 Back-testing . . . . . . . . . . . . . . . . .
8.3 The meta strategies . . . . . . . . . . . . . . . . . .
8.3.1 Presentation . . . . . . . . . . . . . . . . . .
8.3.1.1 The trading signal . . . . . . . . .
8.3.1.2 The strategies . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

278
278
278
279
279
279
280
280
280
281
281
282
283
284
284
287
287
287
288
288
289
290
290
291
292
292
294
294
295
297
297
297
297
298

7.3

7.4

8

9

Quantitative Analytics

8.3.2

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

298
298
299
300
301
301
301

Portfolio management under constraints
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Robust portfolio allocation . . . . . . . . . . . . . . . . . . . . . . . .
9.2.1 Long-short mean-variance approach under constraints . . . . .
9.2.2 Portfolio selection . . . . . . . . . . . . . . . . . . . . . . . .
9.2.2.1 Long only investment: non-leveraged . . . . . . . . .
9.2.2.2 Short selling: No ruin constraints . . . . . . . . . . .
9.2.2.3 Long only investment: leveraged . . . . . . . . . . .
9.2.2.4 Short selling and leverage . . . . . . . . . . . . . . .
9.3 Empirical log-optimal portfolio selections . . . . . . . . . . . . . . . .
9.3.1 Static portfolio selection . . . . . . . . . . . . . . . . . . . . .
9.3.2 Constantly rebalanced portfolio selection . . . . . . . . . . . .
9.3.2.1 Log-optimal portfolio for memoryless market process
9.3.2.2 Semi-log-optimal portfolio . . . . . . . . . . . . . .
9.3.3 Time varying portfolio selection . . . . . . . . . . . . . . . . .
9.3.3.1 Log-optimal portfolio for stationary market process .
9.3.3.2 Empirical portfolio selection . . . . . . . . . . . . .
9.3.4 Regression function estimation: The local averaging estimates .
9.3.4.1 The partitioning estimate . . . . . . . . . . . . . . .
9.3.4.2 The Nadaraya-Watson kernel estimate . . . . . . . .
9.3.4.3 The k-nearest neighbour estimate . . . . . . . . . . .
9.3.4.4 The correspondence . . . . . . . . . . . . . . . . . .
9.4 A simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4.1 A self-financed long-short portfolio . . . . . . . . . . . . . . .
9.4.2 Allowing for capital inflows and outflows . . . . . . . . . . . .
9.4.3 Allocating the weights . . . . . . . . . . . . . . . . . . . . . .
9.4.3.1 Choosing uniform weights . . . . . . . . . . . . . .
9.4.3.2 Choosing Beta for the weight . . . . . . . . . . . . .
9.4.3.3 Choosing Alpha for the weight . . . . . . . . . . . .
9.4.3.4 Combining Alpha and Beta for the weight . . . . . .
9.4.4 Building a beta neutral portfolio . . . . . . . . . . . . . . . . .
9.4.4.1 A quasi-beta neutral portfolio . . . . . . . . . . . . .
9.4.4.2 An exact beta-neutral portfolio . . . . . . . . . . . .
9.5 Value at Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5.1 Defining value at risk . . . . . . . . . . . . . . . . . . . . . . .
9.5.2 Computing value at risk . . . . . . . . . . . . . . . . . . . . .
9.5.2.1 RiskMetrics . . . . . . . . . . . . . . . . . . . . . .
9.5.2.2 Econometric models to VaR calculation . . . . . . . .
9.5.2.3 Quantile estimation to VaR calculation . . . . . . . .
9.5.2.4 Extreme value theory to VaR calculation . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

303
303
304
304
307
308
310
312
313
314
314
315
316
318
318
318
319
320
320
321
322
322
322
322
325
326
326
326
327
327
327
327
328
328
328
329
329
330
332
334

8.4

9

The risk measures . . . . . . . . . . . . .
8.3.2.1 Conditional expectations . . . .
8.3.2.2 Some examples . . . . . . . . .
8.3.3 Computing the Sharpe ratio of the strategies
Random sampling measures of risk . . . . . . . . .
8.4.1 The sample Sharpe ratio . . . . . . . . . .
8.4.2 The sample conditional Sharpe ratio . . . .

10

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

Quantitative Analytics

IV

Quantitative trading in multifractal markets

10 The fractal market hypothesis
10.1 Fractal structure in the markets . . . . . . . . . . . . . . . . . .
10.1.1 Introducing fractal analysis . . . . . . . . . . . . . . . .
10.1.1.1 A brief history . . . . . . . . . . . . . . . . .
10.1.1.2 Presenting the results . . . . . . . . . . . . .
10.1.2 Defining random fractals . . . . . . . . . . . . . . . . .
10.1.2.1 The fractional Brownian motion . . . . . . . .
10.1.2.2 The multidimensional fBm . . . . . . . . . .
10.1.2.3 The fractional Gaussian noise . . . . . . . . .
10.1.2.4 The fractal process and its distribution . . . .
10.1.2.5 An application to finance . . . . . . . . . . .
10.1.3 A first approach to generating random fractals . . . . . .
10.1.3.1 Approximating fBm by spectral synthesis . .
10.1.3.2 The ARFIMA models . . . . . . . . . . . . .
10.1.4 From efficient to fractal market hypothesis . . . . . . .
10.1.4.1 Some limits of the efficient market hypothesis
10.1.4.2 The Larrain KZ model . . . . . . . . . . . . .
10.1.4.3 The coherent market hypothesis . . . . . . . .
10.1.4.4 Defining the fractal market hypothesis . . . .
10.2 The R/S analysis . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.1 Defining R/S analysis for financial series . . . . . . . .
10.2.2 A step-by-step guide to R/S analysis . . . . . . . . . . .
10.2.2.1 A first approach . . . . . . . . . . . . . . . .
10.2.2.2 A better step-by-step method . . . . . . . . .
10.2.3 Testing the limits of R/S analysis . . . . . . . . . . . . .
10.2.4 Improving the R/S analysis . . . . . . . . . . . . . . . .
10.2.4.1 Reducing bias . . . . . . . . . . . . . . . . .
10.2.4.2 Lo’s modified R/S statistic . . . . . . . . . .
10.2.4.3 Removing short-term memory . . . . . . . .
10.2.5 Detecting periodic and nonperiodic cycles . . . . . . . .
10.2.5.1 The natural period of a system . . . . . . . .
10.2.5.2 The V statistic . . . . . . . . . . . . . . . . .
10.2.5.3 The Hurst exponent and chaos theory . . . . .
10.2.6 Possible models for FMH . . . . . . . . . . . . . . . .
10.2.6.1 A few points about chaos theory . . . . . . .
10.2.6.2 Using R/S analysis to detect noisy chaos . .
10.2.6.3 A unified theory . . . . . . . . . . . . . . . .
10.2.7 Revisiting the measures of volatility risk . . . . . . . . .
10.2.7.1 The standard deviation . . . . . . . . . . . . .
10.2.7.2 The fractal dimension as a measure of risk . .
10.3 Hurst exponent estimation methods . . . . . . . . . . . . . . . .
10.3.1 Estimating the Hurst exponent with wavelet analysis . .
10.3.2 Detrending methods . . . . . . . . . . . . . . . . . . .
10.3.2.1 Detrended fluctuation analysis . . . . . . . .
10.3.2.2 A modified DFA . . . . . . . . . . . . . . . .
10.3.2.3 Detrending moving average . . . . . . . . . .
10.3.2.4 DMA in high dimensions . . . . . . . . . . .
10.3.2.5 The periodogram and the Whittle estimator . .

11

337
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

338
338
338
338
339
342
342
344
344
345
346
347
347
348
350
350
351
352
353
353
353
355
355
356
357
358
358
359
360
360
360
361
361
362
362
363
364
365
365
366
367
367
369
370
372
372
373
374

Quantitative Analytics

10.4 Testing for market efficiency . . . . . . . . . . . . . . . . . . . . . . . .
10.4.1 Presenting the main controversy . . . . . . . . . . . . . . . . . .
10.4.2 Using the Hurst exponent to define the null hypothesis . . . . . .
10.4.2.1 Defining long-range dependence . . . . . . . . . . . .
10.4.2.2 Defining the null hypothesis . . . . . . . . . . . . . . .
10.4.3 Measuring temporal correlation in financial data . . . . . . . . .
10.4.3.1 Statistical studies . . . . . . . . . . . . . . . . . . . .
10.4.3.2 An example on foreign exchange rates . . . . . . . . .
10.4.4 Applying R/S analysis to financial data . . . . . . . . . . . . . .
10.4.4.1 A first analysis on the capital markets . . . . . . . . . .
10.4.4.2 A deeper analysis on the capital markets . . . . . . . .
10.4.4.3 Defining confidence intervals for long-memory analysis
10.4.5 Some critics at Lo’s modified R/S statistic . . . . . . . . . . . .
10.4.6 The problem of non-stationary and dependent increments . . . . .
10.4.6.1 Non-stationary increments . . . . . . . . . . . . . . . .
10.4.6.2 Finite sample . . . . . . . . . . . . . . . . . . . . . .
10.4.6.3 Dependent increments . . . . . . . . . . . . . . . . . .
10.4.6.4 Applying stress testing . . . . . . . . . . . . . . . . .
10.4.7 Some results on measuring the Hurst exponent . . . . . . . . . .
10.4.7.1 Accuracy of the Hurst estimation . . . . . . . . . . . .
10.4.7.2 Robustness for various sample size . . . . . . . . . . .
10.4.7.3 Computation time . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

374
374
375
375
376
376
376
377
378
378
378
379
380
381
381
381
382
382
383
383
385
387

11 The multifractal markets
11.1 Multifractality as a new stylised fact . . . . . . . . . . . . . . . . . . . . . .
11.1.1 The multifractal scaling behaviour of time series . . . . . . . . . . .
11.1.1.1 Analysing complex signals . . . . . . . . . . . . . . . . .
11.1.1.2 A direct application to financial time series . . . . . . . . .
11.1.2 Defining multifractality . . . . . . . . . . . . . . . . . . . . . . . . .
11.1.2.1 Fractal measures and their singularities . . . . . . . . . . .
11.1.2.2 Scaling analysis . . . . . . . . . . . . . . . . . . . . . . .
11.1.2.3 Multifractal analysis . . . . . . . . . . . . . . . . . . . . .
11.1.2.4 The wavelet transform and the thermodynamical formalism
11.1.3 Observing multifractality in financial data . . . . . . . . . . . . . . .
11.1.3.1 Applying multiscaling analysis . . . . . . . . . . . . . . .
11.1.3.2 Applying multifractal fluctuation analysis . . . . . . . . .
11.2 Holder exponent estimation methods . . . . . . . . . . . . . . . . . . . . . .
11.2.1 Applying the multifractal formalism . . . . . . . . . . . . . . . . . .
11.2.2 The multifractal wavelet analysis . . . . . . . . . . . . . . . . . . .
11.2.2.1 The wavelet transform modulus maxima . . . . . . . . . .
11.2.2.2 Wavelet multifractal DFA . . . . . . . . . . . . . . . . . .
11.2.3 The multifractal fluctuation analysis . . . . . . . . . . . . . . . . . .
11.2.3.1 Direct and indirect procedure . . . . . . . . . . . . . . . .
11.2.3.2 Multifractal detrended fluctuation . . . . . . . . . . . . . .
11.2.3.3 Multifractal empirical mode decomposition . . . . . . . .
11.2.3.4 The R/S ananysis extented . . . . . . . . . . . . . . . . .
11.2.3.5 Multifractal detrending moving average . . . . . . . . . .
11.2.3.6 Some comments about using MFDFA . . . . . . . . . . . .
11.2.4 General comments on multifractal analysis . . . . . . . . . . . . . .
11.2.4.1 Characteristics of the generalised Hurst exponent . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

390
390
390
390
391
391
391
394
396
399
400
400
401
402
402
403
404
405
406
406
407
408
408
409
409
411
411

12

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Quantitative Analytics

11.3

11.4

11.5

11.6

11.2.4.2 Characteristics of the multifractal spectrum . . . . . . . .
11.2.4.3 Some issues regarding terminology and definition . . . .
The need for time and scale dependent Hurst exponent . . . . . . . . . . .
11.3.1 Computing the Hurst exponent on a sliding window . . . . . . . . .
11.3.1.1 Introducing time-dependent Hurst exponent . . . . . . .
11.3.1.2 Describing the sliding window . . . . . . . . . . . . . .
11.3.1.3 Understanding the time-dependent Hurst exponent . . . .
11.3.1.4 Time and scale Hurst exponent . . . . . . . . . . . . . .
11.3.2 Testing the markets for multifractality . . . . . . . . . . . . . . . .
11.3.2.1 A summary on temporal correlation in financial data . . .
11.3.2.2 Applying sliding windows . . . . . . . . . . . . . . . . .
Local Holder exponent estimation methods . . . . . . . . . . . . . . . . .
11.4.1 The wavelet analysis . . . . . . . . . . . . . . . . . . . . . . . . .
11.4.1.1 The effective Holder exponent . . . . . . . . . . . . . .
11.4.1.2 Gradient modulus wavelet projection . . . . . . . . . . .
11.4.1.3 Testing the performances of wavelet multifractal methods
11.4.2 The fluctuation analysis . . . . . . . . . . . . . . . . . . . . . . .
11.4.2.1 Local detrended fluctuation analysis . . . . . . . . . . .
11.4.2.2 The multifractal spectrum and the local Hurst exponent .
11.4.3 Detection and localisation of outliers . . . . . . . . . . . . . . . .
11.4.4 Testing for the validity of the local Hurst exponent . . . . . . . . .
11.4.4.1 Local change of fractal structure . . . . . . . . . . . . .
11.4.4.2 Abrupt change of fractal structure . . . . . . . . . . . . .
11.4.4.3 A simple explanation . . . . . . . . . . . . . . . . . . .
Analysing the multifractal markets . . . . . . . . . . . . . . . . . . . . . .
11.5.1 Describing the method . . . . . . . . . . . . . . . . . . . . . . . .
11.5.2 Testing for trend and mean-reversion . . . . . . . . . . . . . . . .
11.5.2.1 The equity market . . . . . . . . . . . . . . . . . . . . .
11.5.2.2 The FX market . . . . . . . . . . . . . . . . . . . . . . .
11.5.3 Testing for crash prediction . . . . . . . . . . . . . . . . . . . . .
11.5.3.1 The Asian crisis in 1997 . . . . . . . . . . . . . . . . . .
11.5.3.2 The dot-com bubble in 2000 . . . . . . . . . . . . . . . .
11.5.3.3 The financial crisis of 2007 . . . . . . . . . . . . . . . .
11.5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Some multifractal models for asset pricing . . . . . . . . . . . . . . . . . .

12 Systematic trading
12.1 Introduction . . . . . . . . . . . . . . . . . .
12.2 Technical analysis . . . . . . . . . . . . . . .
12.2.1 Definition . . . . . . . . . . . . . . .
12.2.2 Technical indicator . . . . . . . . . .
12.2.3 Optimising portfolio selection . . . .
12.2.3.1 Classifying strategies . . .
12.2.3.2 Examples of multiple rules

V

.
.
.
.
.
.
.

.
.
.
.
.
.
.

Numerical Analysis

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

411
412
415
415
415
415
416
417
417
417
418
421
421
421
422
423
423
423
425
425
426
426
427
427
428
428
430
430
431
432
432
433
434
435
436

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

441
441
442
442
443
443
444
445

446

13 Presenting some machine-learning methods
448
13.1 Some facts on machine-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448

13

Quantitative Analytics

13.2

13.3

13.4

13.5

13.6

13.1.1 Introduction to data mining . . . . . . . . . . . . . . . . .
13.1.2 The challenges of computational learning . . . . . . . . .
Introduction to information theory . . . . . . . . . . . . . . . . .
13.2.1 Presenting a few concepts . . . . . . . . . . . . . . . . .
13.2.2 Some facts on entropy in information theory . . . . . . .
13.2.3 Relative entropy and mutual information . . . . . . . . .
13.2.4 Bounding performance measures . . . . . . . . . . . . . .
13.2.5 Feature selection . . . . . . . . . . . . . . . . . . . . . .
Introduction to artificial neural networks . . . . . . . . . . . . . .
13.3.1 Presentation . . . . . . . . . . . . . . . . . . . . . . . . .
13.3.2 Gradient descent and the delta rule . . . . . . . . . . . . .
13.3.3 Introducing multilayer networks . . . . . . . . . . . . . .
13.3.3.1 Describing the problem . . . . . . . . . . . . .
13.3.3.2 Describing the algorithm . . . . . . . . . . . .
13.3.3.3 A simple example . . . . . . . . . . . . . . . .
13.3.4 Multi-layer back propagation . . . . . . . . . . . . . . . .
13.3.4.1 The output layer . . . . . . . . . . . . . . . . .
13.3.4.2 The first hidden layer . . . . . . . . . . . . . .
13.3.4.3 The next hidden layer . . . . . . . . . . . . . .
13.3.4.4 Some remarks . . . . . . . . . . . . . . . . . .
Online learning and regret-minimising algorithms . . . . . . . . .
13.4.1 Simple online algorithms . . . . . . . . . . . . . . . . . .
13.4.1.1 The Halving algorithm . . . . . . . . . . . . .
13.4.1.2 The weighted majority algorithm . . . . . . . .
13.4.2 The online convex optimisation . . . . . . . . . . . . . .
13.4.2.1 The online linear optimisation problem . . . . .
13.4.2.2 Considering Bergmen divergence . . . . . . . .
13.4.2.3 More on the online convex optimisation problem
Presenting the problem of automated market making . . . . . . .
13.5.1 The market neutral case . . . . . . . . . . . . . . . . . .
13.5.2 The case of infinite outcome space . . . . . . . . . . . . .
13.5.3 Relating market design to machine learning . . . . . . . .
13.5.4 The assumptions of market completeness . . . . . . . . .
Presenting scoring rules . . . . . . . . . . . . . . . . . . . . . . .
13.6.1 Describing a few scoring rules . . . . . . . . . . . . . . .
13.6.1.1 The proper scoring rules . . . . . . . . . . . . .
13.6.1.2 The market scoring rules . . . . . . . . . . . .
13.6.2 Relating MSR to cost function based market makers . . .

14 Introducing Differential Evolution
14.1 Introduction . . . . . . . . . . . . . . . . . . . . .
14.2 Calibration to implied volatility . . . . . . . . . . .
14.2.1 Introducing calibration . . . . . . . . . . .
14.2.1.1 The general idea . . . . . . . . .
14.2.1.2 Measures of pricing errors . . . .
14.2.2 The calibration problem . . . . . . . . . .
14.2.3 The regularisation function . . . . . . . . .
14.2.4 Beyond deterministic optimisation method
14.3 Nonlinear programming problems with constraints
14.3.1 Describing the problem . . . . . . . . . . .

14

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

448
449
451
451
452
453
455
457
460
460
461
462
463
463
465
465
466
466
468
470
471
471
471
471
473
473
473
474
475
475
476
479
480
480
480
480
481
482

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

483
483
483
483
483
484
485
486
487
487
487

Quantitative Analytics

14.3.1.1 A brief history . . . . . . . . . . . . . . . . . . .
14.3.1.2 Defining the problems . . . . . . . . . . . . . . .
14.3.2 Some optimisation methods . . . . . . . . . . . . . . . . .
14.3.2.1 Random optimisation . . . . . . . . . . . . . . .
14.3.2.2 Harmony search . . . . . . . . . . . . . . . . . .
14.3.2.3 Particle swarm optimisation . . . . . . . . . . . .
14.3.2.4 Cross entropy optimisation . . . . . . . . . . . .
14.3.2.5 Simulated annealing . . . . . . . . . . . . . . . .
14.3.3 The DE algorithm . . . . . . . . . . . . . . . . . . . . . .
14.3.3.1 The mutation . . . . . . . . . . . . . . . . . . . .
14.3.3.2 The recombination . . . . . . . . . . . . . . . . .
14.3.3.3 The selection . . . . . . . . . . . . . . . . . . . .
14.3.3.4 Convergence criterions . . . . . . . . . . . . . .
14.3.4 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . .
14.3.5 The strategies . . . . . . . . . . . . . . . . . . . . . . . . .
14.3.5.1 Scheme DE1 . . . . . . . . . . . . . . . . . . . .
14.3.5.2 Scheme DE2 . . . . . . . . . . . . . . . . . . . .
14.3.5.3 Scheme DE3 . . . . . . . . . . . . . . . . . . . .
14.3.5.4 Scheme DE4 . . . . . . . . . . . . . . . . . . . .
14.3.5.5 Scheme DE5 . . . . . . . . . . . . . . . . . . . .
14.3.5.6 Scheme DE6 . . . . . . . . . . . . . . . . . . . .
14.3.5.7 Scheme DE7 . . . . . . . . . . . . . . . . . . . .
14.3.5.8 Scheme DE8 . . . . . . . . . . . . . . . . . . . .
14.3.6 Improvements . . . . . . . . . . . . . . . . . . . . . . . . .
14.3.6.1 Ageing . . . . . . . . . . . . . . . . . . . . . . .
14.3.6.2 Constraints on parameters . . . . . . . . . . . . .
14.3.6.3 Convergence . . . . . . . . . . . . . . . . . . . .
14.3.6.4 Self-adaptive parameters . . . . . . . . . . . . .
14.3.6.5 Selection . . . . . . . . . . . . . . . . . . . . . .
14.4 Handling the constraints . . . . . . . . . . . . . . . . . . . . . . .
14.4.1 Describing the problem . . . . . . . . . . . . . . . . . . . .
14.4.2 Defining the feasibility rules . . . . . . . . . . . . . . . . .
14.4.3 Improving the feasibility rules . . . . . . . . . . . . . . . .
14.4.4 Handling diversity . . . . . . . . . . . . . . . . . . . . . .
14.5 The proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . .
14.6 Describing some benchmarks . . . . . . . . . . . . . . . . . . . . .
14.6.1 Minimisation of the sphere function . . . . . . . . . . . . .
14.6.2 Minimisation of the Rosenbrock function . . . . . . . . . .
14.6.3 Minimisation of the step function . . . . . . . . . . . . . .
14.6.4 Minimisation of the Rastrigin function . . . . . . . . . . . .
14.6.5 Minimisation of the Griewank function . . . . . . . . . . .
14.6.6 Minimisation of the Easom function . . . . . . . . . . . . .
14.6.7 Image from polygons . . . . . . . . . . . . . . . . . . . . .
14.6.8 Minimisation problem g01 . . . . . . . . . . . . . . . . . .
14.6.9 Maximisation problem g03 . . . . . . . . . . . . . . . . . .
14.6.10 Maximisation problem g08 . . . . . . . . . . . . . . . . . .
14.6.11 Minimisation problem g11 . . . . . . . . . . . . . . . . . .
14.6.12 Minimisation of the weight of a tension/compression spring

15

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

487
487
489
489
490
491
492
493
494
494
494
495
495
495
496
496
496
496
496
497
497
497
498
498
498
499
499
499
499
500
500
500
501
502
503
504
505
505
505
506
506
506
507
507
508
508
508
508

Quantitative Analytics

15 Introduction to CUDA Programming in Finance
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
15.1.1 A birief overview . . . . . . . . . . . . . . .
15.1.2 Preliminary words on parallel programming .
15.1.3 Why GPUs? . . . . . . . . . . . . . . . . .
15.1.4 Why CUDA? . . . . . . . . . . . . . . . . .
15.1.5 Applications in financial computing . . . . .
15.2 Programming with CUDA . . . . . . . . . . . . . .
15.2.1 Hardware . . . . . . . . . . . . . . . . . . .
15.2.2 Thread hierarchy . . . . . . . . . . . . . . .
15.2.3 Memory management . . . . . . . . . . . .
15.2.4 Syntax and connetion to C/C++ . . . . . . .
15.2.5 Random number generation . . . . . . . . .
15.2.5.1 Memory storage . . . . . . . . . .
15.2.5.2 Inline . . . . . . . . . . . . . . . .
15.3 Case studies . . . . . . . . . . . . . . . . . . . . . .
15.3.1 Exotic swaps in Monte-Carlo . . . . . . . . .
15.3.1.1 Product and model . . . . . . . . .
15.3.1.2 Single-thread algorithm . . . . . .
15.3.1.3 Multi-thread algorithm . . . . . .
15.3.1.4 Using the texture memory . . . . .
15.3.2 Volatility calibration by differential evolution
15.3.2.1 Model and difficulties . . . . . . .
15.3.2.2 Single-thread algorithm . . . . . .
15.3.2.3 Multi-thread algorithm . . . . . .
15.4 Conclusion . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Appendices
A Review of some mathematical facts
A.1 Some facts on convex and concave analysis
A.1.1 Convex functions . . . . . . . . . .
A.1.2 Concave functions . . . . . . . . .
A.1.3 Some approximations . . . . . . .
A.1.4 Conjugate duality . . . . . . . . . .
A.1.5 A note on Legendre transformation
A.1.6 A note on the Bregman divergence .
A.2 The logistic function . . . . . . . . . . . .
A.3 The convergence of series . . . . . . . . . .
A.4 The Dirac function . . . . . . . . . . . . .
A.5 Some linear algebra . . . . . . . . . . . . .
A.6 Some facts on matrices . . . . . . . . . . .
A.7 Utility function . . . . . . . . . . . . . . .
A.7.1 Definition . . . . . . . . . . . . . .
A.7.2 Some properties . . . . . . . . . .
A.7.3 Some specific utility functions . . .
A.7.4 Mean-variance criterion . . . . . .
A.7.4.1 Normal returns . . . . . .
A.7.4.2 Non-normal returns . . .
A.8 Optimisation . . . . . . . . . . . . . . . .

510
510
510
511
512
513
513
514
514
514
515
516
520
521
521
522
522
522
522
523
524
525
525
526
526
527
528

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

16

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

529
529
530
530
532
532
533
533
534
536
538
538
542
544
544
545
547
548
548
549
549

Quantitative Analytics

A.9 Conjugate gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
B Some probabilities
B.1 Some definitions . . . . . . . . . . . . . . . . . . . . . .
B.2 Random variables . . . . . . . . . . . . . . . . . . . . . .
B.2.1 Discrete random variables . . . . . . . . . . . . .
B.2.2 Continuous random variables . . . . . . . . . . .
B.3 Introducing stochastic processes . . . . . . . . . . . . . .
B.4 The characteristic function and moments . . . . . . . . . .
B.4.1 Definitions . . . . . . . . . . . . . . . . . . . . .
B.4.2 The first two moments . . . . . . . . . . . . . . .
B.4.3 Trading correlation . . . . . . . . . . . . . . . . .
B.5 Conditional moments . . . . . . . . . . . . . . . . . . . .
B.5.1 Conditional expectation . . . . . . . . . . . . . .
B.5.2 Conditional variance . . . . . . . . . . . . . . . .
B.5.3 More details on conditional expectation . . . . . .
B.5.3.1 Some discrete results . . . . . . . . . .
B.5.3.2 Some continuous results . . . . . . . . .
B.6 About fractal analysis . . . . . . . . . . . . . . . . . . . .
B.6.1 The fractional Brownian motion . . . . . . . . . .
B.6.2 The R/S analysis . . . . . . . . . . . . . . . . . .
B.7 Some continuous variables and their distributions . . . . .
B.7.1 Some popular distributions . . . . . . . . . . . . .
B.7.1.1 Uniform distribution . . . . . . . . . . .
B.7.1.2 Exponential distribution . . . . . . . . .
B.7.1.3 Normal distribution . . . . . . . . . . .
B.7.1.4 Gamma distribution . . . . . . . . . . .
B.7.1.5 Chi-square distribution . . . . . . . . .
B.7.1.6 Weibull distribution . . . . . . . . . . .
B.7.2 Normal and Lognormal distributions . . . . . . . .
B.7.3 Multivariate Normal distributions . . . . . . . . .
B.7.4 Distributions arising from the Normal distribution .
B.7.4.1 Presenting the problem . . . . . . . . .
B.7.4.2 The t-distribution . . . . . . . . . . . .
B.7.4.3 The F -distribution . . . . . . . . . . . .
B.8 Some results on Normal sampling . . . . . . . . . . . . .
B.8.1 Estimating the mean and variance . . . . . . . . .
B.8.2 Estimating the mean with known variance . . . . .
B.8.3 Estimating the mean with unknown variance . . .
B.8.4 Estimating the parameters of a linear model . . . .
B.8.5 Asymptotic confidence interval . . . . . . . . . .
B.8.6 The setup of the Monte Carlo engine . . . . . . . .
B.9 Some random sampling . . . . . . . . . . . . . . . . . . .
B.9.1 The sample moments . . . . . . . . . . . . . . . .
B.9.2 Estimation of a ratio . . . . . . . . . . . . . . . .
B.9.3 Stratified random sampling . . . . . . . . . . . . .
B.9.4 Geometric mean . . . . . . . . . . . . . . . . . .

17

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

554
554
556
556
557
557
558
558
559
560
560
560
563
564
564
565
566
566
567
568
568
568
568
568
569
569
569
570
570
571
571
572
573
574
574
574
575
575
575
576
577
577
579
580
584

Quantitative Analytics

C Stochastic processes and Time Series
C.1 Introducing time series . . . . . . . . . . . . . . .
C.1.1 Definitions . . . . . . . . . . . . . . . . .
C.1.2 Estimation of trend and seasonality . . . .
C.1.3 Some sample statistics . . . . . . . . . . .
C.2 The ARMA model . . . . . . . . . . . . . . . . .
C.3 Fitting ARIMA models . . . . . . . . . . . . . . .
C.4 State space models . . . . . . . . . . . . . . . . .
C.5 ARCH and GARCH models . . . . . . . . . . . .
C.5.1 The ARCH process . . . . . . . . . . . . .
C.5.2 The GARCH process . . . . . . . . . . . .
C.5.3 Estimating model parameters . . . . . . . .
C.6 The linear equation . . . . . . . . . . . . . . . . .
C.6.1 Solving linear equation . . . . . . . . . . .
C.6.2 A simple example . . . . . . . . . . . . .
C.6.2.1 Covariance matrix . . . . . . . .
C.6.2.2 Expectation . . . . . . . . . . .
C.6.2.3 Distribution and probability . . .
C.6.3 From OU to AR(1) process . . . . . . . . .
C.6.3.1 The Ornstein-Uhlenbeck process
C.6.3.2 Deriving the discrete model . . .
C.6.4 Some facts about AR series . . . . . . . .
C.6.4.1 Persistence . . . . . . . . . . . .
C.6.4.2 Prewhitening and detrending . .
C.6.4.3 Simulation and prediction . . . .
C.6.5 Estimating the model parameters . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

585
585
585
586
587
588
599
606
608
608
609
610
610
610
611
611
612
612
613
613
614
615
615
615
616
616

D Defining market equilibrirum and asset prices
D.1 Introducing the theory of general equilibrium . . . . . . . .
D.1.1 1 period, (d + 1) assets, k states of the world . . . .
D.1.2 Complete market . . . . . . . . . . . . . . . . . . .
D.1.3 Optimisation with consumption . . . . . . . . . . .
D.2 An introduction to the model of Von Neumann Morgenstern
D.2.1 Part I . . . . . . . . . . . . . . . . . . . . . . . . .
D.2.2 Part II . . . . . . . . . . . . . . . . . . . . . . . . .
D.3 Simple equilibrium model . . . . . . . . . . . . . . . . . .
D.3.1 m agents, (d + 1) assets . . . . . . . . . . . . . . .
D.3.2 The consumption based asset pricing model . . . . .
D.4 The n-dates model . . . . . . . . . . . . . . . . . . . . . . .
D.5 Discrete option valuation . . . . . . . . . . . . . . . . . . .
D.6 Valuation in financial markets . . . . . . . . . . . . . . . .
D.6.1 Pricing securities . . . . . . . . . . . . . . . . . . .
D.6.2 Introducing the recovery theorem . . . . . . . . . .
D.6.3 Using implied volatilities . . . . . . . . . . . . . . .
D.6.4 Bounding the pricing kernel . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

618
618
618
620
620
622
622
623
624
624
625
627
628
629
629
631
632
633

18

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Quantitative Analytics

E Pricing and hedging options
E.1 Valuing options on multi-underlyings . . . . . . . . . . . .
E.1.1 Self-financing portfolios . . . . . . . . . . . . . . .
E.1.2 Absence of arbitrage opportunity and rate of returns
E.1.3 Numeraire . . . . . . . . . . . . . . . . . . . . . .
E.1.4 Evaluation and hedging . . . . . . . . . . . . . . . .
E.2 The dynamics of financial assets . . . . . . . . . . . . . . .
E.2.1 The Black-Scholes world . . . . . . . . . . . . . . .
E.2.2 The dynamics of the bond price . . . . . . . . . . .
E.3 From market prices to implied volatility . . . . . . . . . . .
E.3.1 The Black-Scholes formula . . . . . . . . . . . . . .
E.3.2 The implied volatility in the Black-Scholes formula .
E.3.3 The robustness of the Black-Scholes formula . . . .
E.4 Some properties satisfied by market prices . . . . . . . . . .
E.4.1 The no-arbitrage conditions . . . . . . . . . . . . .
E.4.2 Pricing two special market products . . . . . . . . .
E.4.2.1 The digital option . . . . . . . . . . . . .
E.4.2.2 The butterfly option . . . . . . . . . . . .
E.5 Introduction to indifference pricing theory . . . . . . . . . .
E.5.1 Martingale measures and state-price densities . . . .
E.5.2 An overview . . . . . . . . . . . . . . . . . . . . .
E.5.2.1 Describing the optimisation problem . . .
E.5.2.2 The dual problem . . . . . . . . . . . . .
E.5.3 The non-traded assets model . . . . . . . . . . . . .
E.5.3.1 Discrete time . . . . . . . . . . . . . . . .
E.5.3.2 Continuous time . . . . . . . . . . . . . .
E.5.4 The pricing method . . . . . . . . . . . . . . . . . .
E.5.4.1 Computing indifference prices . . . . . .
E.5.4.2 Computing option prices . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

634
634
634
637
638
639
642
642
643
645
645
645
646
647
647
647
647
648
648
648
649
649
650
651
651
651
652
653
654

F Some results on signal processing
F.1 A short introduction to Fourier transform methods . . . . . . . . . . . . .
F.1.1 Some analytical formalism . . . . . . . . . . . . . . . . . . . . .
F.1.2 The Fourier integral . . . . . . . . . . . . . . . . . . . . . . . .
F.1.3 The Fourier transformation . . . . . . . . . . . . . . . . . . . . .
F.1.4 The discrete Fourier transform . . . . . . . . . . . . . . . . . . .
F.1.5 The Fast Fourier Transform algorithm . . . . . . . . . . . . . . .
F.2 From spline analysis to wavelet analysis . . . . . . . . . . . . . . . . . .
F.2.1 An introduction to splines . . . . . . . . . . . . . . . . . . . . .
F.2.2 Multiresolution spline processing . . . . . . . . . . . . . . . . .
F.3 A short introduction to wavelet transform methods . . . . . . . . . . . .
F.3.1 The continuous wavelet transform . . . . . . . . . . . . . . . . .
F.3.2 The discrete wavelet transform . . . . . . . . . . . . . . . . . . .
F.3.2.1 An infinite summations of discrete wavelet coefficients
F.3.2.2 The scaling function . . . . . . . . . . . . . . . . . . .
F.3.2.3 The FWT algorithm . . . . . . . . . . . . . . . . . . .
F.3.3 Discrete input signals of finite length . . . . . . . . . . . . . . .
F.3.3.1 Discribing the algorithm . . . . . . . . . . . . . . . . .
F.3.3.2 Presenting thresholding . . . . . . . . . . . . . . . . .
F.3.4 Wavelet-based statistical measures . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

657
657
657
659
661
662
663
665
665
667
669
669
675
675
676
678
679
679
681
682

19

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Quantitative Analytics

F.4

The problem of shift-invariance . . . . . . . . . . . . . . . . . . . . . .
F.4.1 A brief overview . . . . . . . . . . . . . . . . . . . . . . . . .
F.4.1.1 Describing the problem . . . . . . . . . . . . . . . .
F.4.1.2 The a trous algorithm . . . . . . . . . . . . . . . . .
F.4.1.3 Relating the a trous and Mallat algorithms . . . . . .
F.4.2 Describing some redundant transforms . . . . . . . . . . . . .
F.4.2.1 The multiresolution analysis . . . . . . . . . . . . . .
F.4.2.2 The standard DWT . . . . . . . . . . . . . . . . . .
F.4.2.3 The -decimated DWT . . . . . . . . . . . . . . . .
F.4.2.4 The stationary wavelet transform . . . . . . . . . . .
F.4.3 The autocorrelation functions of compactly supported wavelets .

20

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

684
684
684
685
685
687
687
690
691
692
693

Quantitative Analytics

0.1
0.1.1

Introduction
Preamble

There is a vast literature on the investment decision making process and associated assessment of expected returns on
investments. Traditionally, historical performances, economic theories, and forward looking indicators were usually
put forward for investors to judge expected returns. However, modern finance theory, including quantitative models
and econometric techniques, provided the foundation that has revolutionised the investment management industry over
the last 20 years. Technical analysis have initiated a broad current of literature in economics and statistical physics
refining and expanding the underlying concepts and models. It is remarkable to note that some of the features of
financial data were general enough to have spawned the interest of several fields in sciences, from economics and
econometrics, to mathematics and physics, to further explore the behaviour of this data and develop models explaining
these characteristics. As a result, some theories found by a group of scientists were rediscovered at a later stage
by another group, or simply observed and mentioned in studies but not formalised. Financial text books presenting
academic and practitioners findings tend to be too vague and too restrictive, while published articles tend to be too
technical and too specialised. This guide tries to bridge the gap by presenting the necessary tools for performing
quantitative portfolio selection and allocation in a simple, yet robust way. We present in a chronological order the
necessary steps to identify trading signals, build quantitative strategies, assess expected returns, measure and score
strategies, and allocate portfolios. This is done with the help of various published articles referenced along this guide,
as well as financial and economical text books. In the spirit of Alfred North Whitehead, we aim to seek the simplest
explanations of complex facts, which is achieved by structuring this book from the simple to the complex. This
pedagogic approach, inevitably, leads to some necessary repetitions of materials. We first introduce some simple
ideas and concepts used to describe financial data, and then show how empirical evidences led to the introduction of
complexity which modified the existing market consensus. This book is divided into in five parts. We first present
and describe quantitative trading in classical economics, and provide the paramount statistical tools. We then discuss
quantitative trading in inefficient markets before detailing quantitative trading in multifractal markets. At last, we we
present a few numerical tools to perform the necessary computation when performing quantitative trading strategies.
The decision making process and portfolio allocation being a vast subject, this is not an exhaustive guide, and some
fields and techniques have not been covered. However, we intend to fill the gap over time by reviewing and updating
this book.

0.1.2

An overview of quantitative trading

Following the spirit of Focardi et al. [2004], who detailed how the growing use of quantitative methods changed
finance and investment theory, we are going to present an overview of quantitative portfolio trading. Just as automation
and mechanisation were the cornerstones of the Industrial Revolution at the turn of the 19th century, modern finance
theory, quantitative models, and econometric techniques provide the foundation that has revolutionised the investment
management industry over the last 20 years. Quantitative models and scientific techniques are playing an increasingly
important role in the financial industry affecting all steps in the investment management process, such as
• defining the policy statement
• setting the investment objectives
• selecting investment strategies
• implementing the investment plan
• constructing the portfolio
• monitoring, measuring, and evaluating investment performance

21

Quantitative Analytics

The most significant benefit being the power of automation, enforcing a systematic investment approach and a structured and unified framework. Not only completely automated risk models and marking-to-market processes provide
a powerful tool for analysing and tracking portfolio performance in real time, but it also provides the foundation for
complete process and system backtests. Quantifying the chain of decision allows a portfolio manager to more fully
understand, compare, and calibrate investment strategies, underlying investment objectives and policies.
Since the pioneering work of Pareto [1896] at the end of the 19th century and the work of Von Neumann et al.
[1944], decision making has been modelled using both
1. utility function to order choices, and,
2. some probabilities to identify choices.
As a result, in order to complete the investment management process, market participants, or agents, can rely either
on subjective information, in a forecasting model, or a combination of both. This heavy dependence of financial asset
management on the ability to forecast risk and returns led academics to develop a theory of market prices, resulting in
the general equilibrium theories (GET). In the classical approach, the Efficient Market Hypothesis (EMH) states that
current prices reflect all available or public information, so that future price changes can be determined only by new
information. That is, the markets follow a random walk (see Bachelier [1900] and Fama [1970]). Hence, agents are
coordinated by a central price signal, and as such, do not interact so that they can be aggregated to form a representative
agent whose optimising behaviour sets the optimal price process. Classical economics is based on the principles that
1. the agent decision making process can be represented as the maximisation of expected utility, and,
2. that agents have a perfect knowledge of the future (the stochastic processes on which they optimise are exactly
the true stochastic processes).
The essence of general equilibrium theories (GET) states that the instantaneous and continuous interaction among
agents, taking advantage of arbitrage opportunities (AO) in the market is the process that will force asset prices toward
equilibrium. Markowitz [1952] first introduced portfolio selection using a quantitative optimisation technique that
balances the trade-off between risk and return. His work laid the ground for the capital asset pricing model (CAPM),
the most fundamental general equilibrium theory in modern finance. The CAPM states that the expected value of
the excess return of any asset is proportional to the excess return of the total investible market, where the constant of
proportionality is the covariance between the asset return and the market return. Many critics of the mean-variance
optimisation framework were formulated, such as, oversimplification and unrealistic assumption of the distribution
of asset returns, high sensitivity of the optimisation to inputs (the expected returns of each asset and their covariance
matrix). Extensions to classical mean-variance optimisation were proposed to make the portfolio allocation process
more robust to different source of risk, such as, Bayesian approaches, and Robust Portfolio Allocation. In addition,
higher moments were introduced in the optimisation process. Nonetheless, the question of whether general equilibrium
theories are appropriate representations of economic systems can not be answered empirically.
Classical economics is founded on the concept of equilibrium. On one hand, econometric analysis assumes that, if
there are no outside, or exogenous, influences, then a system is at rest. The system reacts to external perturbation by
reverting to equilibrium in a linear fashion. On the other hand, it ignores time, or treats time as a simple variable by
assuming the market has no memory, or only limited memory of the past. These two points might explain why classical
economists had trouble forecasting our economic future. Clearly, the qualitative aspect coming from human decision
making process is missing. Over the last 30 years, econometric analysis has shown that asset prices present some
level of predictability contradicting models such as the CAPM or the APT, which are based on constant trends. As a
result, a different view on financial markets emerged postulating that markets are populated by interacting agent, that
is, agents making only imperfect forecasts and directly influencing each other, leading to feedback in financial markets
and potential asset prices predictability. In consequence, factor models and other econometric techniques developed

22

Quantitative Analytics

to forecast price processes in view of capturing these financial paterns at some level. However, until recently, asset
price predictability seemed to be greater at the portfolio level than at the individual asset level. Since in most cases
it is not possible to measure the agent’s utility function and its ability to forecast returns, GET are considered as
abstract mathematical constructs which are either not easy or impossible to validate empirically. On the other hand,
econometrics has a strong data-mining component since it attempts at fitting generalised models to the market with
free parameters. As such, it has a strong empirical basis but a relatively simple theoretical foundation. Recently, with
the increased availability of data, econophysics emerged as a mix of physical sciences and economics to get the best
of both world in view of analysing more deeply asset predictability.
Since the EMH implicitly assumes that all investors immediately react to new information, so that the future is
unrelated to the past or the present, the Central Limit Theorem (CLT) could therefore be applied to capital market
analysis. The CLT was necessary to justify the use of probability calculus and linear models. However, in practice,
the decision process do not follow the general equilibrium theories (GET), as some agents may react to information
as it is received, while most agents wait for confirming information and do not react until a trend is established. The
uneven assimilation of information may cause a biased random walk (called fractional Brownian motion) which were
extensively studied by Hurst in the 1940s, and by Mandelbrot in the 1960s and 1970s. A large number of studies
showed that market returns were persistent time series with an underlying fractal probability distribution, following
a biased random walk. Stocks having Hurst exponents, H, greater than 12 are fractal, and application of standard
statistical analysis becomes of questionable value. In that case, variances are undefined, or infinite, making volatility
a useless and misleading estimate of risk. High H values, meaning less noise, more persistence, and clearer trends
than lower values of H, we can assume that higher values of H mean less risk. However, stocks with high H values
do have a higher risk of abrupt changes. The fractal nature of the capital markets contradicts the EMH and all the
quantitative models derived from it, such as the Capital Asset Pricing Model (CAPM), the Arbitrage Pricing Theory
(APT), and the Black-Scholes option pricing model, and other models depending on the normal distribution and/or
finite variance. This is because they simplify reality by assuming random behaviour, and they ignore the influence
of time on decision making. By assuming randomness, the models can be optimised for a single optimal solution.
That is, we can find optimal portfolios, intrinsic value, and fair price. On the other hand, fractal structure recognises
complexity and provides cycles, trends, and a range of fair values.
New theories to explain financial markets are gaining ground, among which is a multitude of interacting agents
forming a complex system characterised by a high level of uncertainty. Complexity theory deals with processes where
a large number of seemingly independent agents act coherently. Multiple interacting agent systems are subject to
contagion and propagation phenomena generating feedbacks and producing fat tails. Real feedback systems involve
long-term correlations and trends since memories of long-past events can still affect the decisions made in the present.
Most complex, natural systems, can be modelled by nonlinear differential, or difference, equations. These systems are
characterised by a high level of uncertainty which is embedded in the probabilistic structure of models. As a result,
econometrics can now supply the empirical foundation of economics. For instance, science being highly stratified, one
can build complex theories on the foundation of simpler theories. That is, starting with a collection of econometric
data, we model it and analyse it, obtaining statistical facts of an empirical nature that provide us with the building
blocks of future theoretical development. For instance, assuming that economic agents are heterogeneous, make
mistakes, and mutually interact leads to more freedom to devise economic theory (see Aoki [2004]).
With the growing quantity of data available, machine-learning methods that have been successfully applied in science are now applied to mining the markets. Data mining and more recent machine-learning methodologies provide
a range of general techniques for the classification, prediction, and optimisation of structured and unstructured data.
Neural networks, classification and decision trees, k-nearest neighbour methods, and support vector machines (SVM)
are some of the more common classification and prediction techniques used in machine learning. Further, combinatorial optimisation, genetic algorithms and reinforced learning are now widespread. Using these techniques, one
can describe financial markets through degrees of freedom which may be both qualitative and quantitative in nature,
each node being the siege of complicated mathematical entity. One could use a matrix form to represent interactions

23

Quantitative Analytics

between the various degrees of freedom of the different nodes, each link having a weight and a direction. Further, time
delays should be taken into account, leading to non-symmetric matrix (see Ausloos [2010]).
Future success for portfolio managers will not only depend on their ability to provide excess returns in a riskcontrolled fashion to investors, but also on their ability to incorporate financial innovation and process automation
into their frameworks. However, the quantitative approach is not without risk, introducing new sources of risk such as
model risk, operational risk, and an inescapable dependence on historical data as its raw material. One must therefore
be cautious on how the models are used, understand their weaknesses and limitations, and prevent applications beyond
what they were originally designed for. With more model parameters and more sophisticated econometric techniques,
we run the risk of over-fitting models, and distinguishing spurious phenomena as a result of data mining becomes a
difficult task.
In the rest of this guide we will present an overview of asset valuation in presence of risk and we will review the
evolution of quantitative methods. We will then present the necessary tools and techniques to design the main steps of
an automated investment management system, and we will address some of the challenges that need to be met.

24

Part I

Quantitative trading in classical economics

25

Chapter 1

Risk, preference, and valuation
1.1

A brief history of ideas

Pacioli [1494] as well as Pascal and Fermat (1654) considered the problem of the points, where a game of dice has to
be abandoned before it can be concluded, and how is the pot (the total wager) distributed among the players in a fair
manner, introducing the concept of fairness (see Devlin [2008] for historical details). Pascal and Fermat agreed that
the fair solution is to give to each player the expectation value of his winnings. The expectation value they computed
is an ensemble average, where all possible outcomes of the game are enumerated, and the products of winnings and
probabilities associated with each outcome for each player are added up. Instead of considering only the state of the
universe as it is, or will be, an infinity of additional equally probable universes is imagined. The proportion of those
universes where some event occurs is the probability of that event. Following Pascal’s and Fermat’s work, others
recognised the potential of their investigation for making predictions. For instance, Halley [1693] devised a method
for pricing life annuities. Huygens [1657] is credited with making the concept of expectation values explicit and with
first proposing an axiomatic form of probability theory. A proven result in probability theory follows from the axioms
of probability theory, now usually those of Kolmogorov [1933].
Once the concept of probability and expectation values was introduced by Pascal and Fermat, the St Petersburg
paradox was the first well-documented example of a situation where the use of ensembles leads to absurd conclusions. The St Petersburg paradox rests on the apparent contradiction between a positively infinite expectation value of
winnings in a game and real people’s unwillingness to pay much to be allowed to participate in the game. Bernoulli
[1738-1954] (G. Cramer 1728, personal communication with N. Bernoulli) pointed out that because of this incongruence, the expectation value of net winnings has to be discarded as a descriptive or prescriptive behavioural rule. As
pointed out by Peters [2011a], one can decide what to change about the expectation value of net winnings, either the
expectation value or the net winnings. Bernoulli (and Cramer) chose to replace the net winnings by introducing utility,
and computing the expectation value of the gain in utility. They argued that the desirability or utility associated with a
financial gain depends not only on the gain itself but also on the wealth of the person who is making this gain. The expected utility theory (EUT) deals with the analysis of choices among risky projects with multidimensional outcomes.
The classical resolution is to apply a utility function to the wealth, which reflects the notion that the usefulness of an
amount of money depends on how much of it one already has, and then to maximise the expectation of this. The choice
of utility function is often framed in terms of the individual’s risk preferences and may vary between individuals. The
first important use of the EUT was that of Von Neumann and Morgenstern (VNM) [1944] who used the assumption of
expected utility maximisation in their formulation of game theory. When comparing objects one needs to rank utilities
but also compare the sizes of utilities. VNM method of comparison involves considering probabilities. If a person can
choose between various randomised events (lotteries), then it is possible to additively compare for example a shirt and
a sandwich. Later, Kelly [1956], who contributed to the debate on time averages, computed time-average exponential
growth rates in games of chance (optimise wager sizes in a hypothetical horse race using private information) and
26

Quantitative Analytics

argued that utility was not necessary and too general to shed any light on the specific problems he considered. In the
same spirit, Peters [2011a] considered an alternative to Bernoulli’s approach by replacing the expectation value (or
ensemble average) with a time average, without introducing utility.
It is argued that Kelly [1956] is at the origin of the growth optimal portfolio (GOP), when he studied gambling
and information theory and stated that there is an optimal gambling strategy that will accumulate more wealth than
any other different strategy. This strategy is the growth optimal strategy. We refer the reader to Mosegaard Christensen
[2011] who presented a comprehensive review of the different connections in which the GOP has been applied. Since
one aspect of the GOP is the maximisation of the geometric mean, one can go back to Williams [1936] who considered
speculators in a multi-period setting and reached the conclusion that due to compounding, speculators should worry
about the geometric mean and not the arithmetic one. One can further go back in time by recognising that the GOP is
the choice of a log-utility investor, which was first discussed by Bernoulli and Cramer in the St. Petersurg paradox.
However, it was argued (leading to debate among economists) that the choice of the logarithm appears to have nothing
to do with the growth properties of the strategy (Cramer solved the paradox with a square-root function). Nonetheless,
the St. Petersurg paradox inspired Latane [1959], who independently from Kelly, suggested that investors should
maximise the geometric mean of their portfolios, as this would maximise the probability that the portfolio would be
more valuable than any other portfolio. It was recently proved that when denominated in terms of the GOP, asset
prices become supermartingales, leading Long [1990] to consider change of numeraire and suggest a method for
measuring abnormal returns. The change of numeraire technique was then used for derivative pricing. No matter the
approach chosen, the perspective described by Bernoulli and Cramer has consequences far beyond the St Petersburg
paradox, including predictions and investment decisions, as in this case, the conceptual context change from moral to
predictive. In the latter, one can assume that the expected gain (or growth factor or exponential growth rate) is the
relevant quantity for an individual deciding whether to take part in the lottery. However, considering the growth of an
investment over time can make this assumption somersault into being trivially false.
In order to explain the prices of economical goods, Walras [1874-7] started the theory of general equilibrium
by considering demand and supply and equating them, which was formalised later by Arrow-Debreu [1954] and
McKenzie [1959]. In parallel Arrow [1953] and then Debreu [1953] generalised the theory, which was static and determinist, to the case of uncertain future by introducing contingent prices (Arrow-Debreu state-prices). Arrow [1953]
proposed to create financial markets, and was at the origin of the modern theory of financial markets equilibrium. This
theory was developed to value asset prices and define market equilibium. Radner [1976] improved Arrow’s model by
considering more general assets and introducing the concept of rational anticipation. Radner is also at the origin of the
incomplete market theory. Defining an arbitrage as a transaction involving no negative cash flow at any probabilistic or
temporal state, and a positive cash flow in at least one state (that is, the possibility of a risk-free profit after transaction
costs), the prices are said to constitute an arbitrage equilibrium if they do not allow for profitable arbitrage (see Ross
[1976]). An arbitrage equilibrium is a precondition for a general economic equilibrium (see Harrison et al. [1979]).
In complete markets, no arbitrage implies the existence of positive Arrow-Debreu state-prices, a risk-neutral measure
under which the expected return on any asset is the risk-free rate, and equivalently, the existence of a strictly positive
pricing kernel that can be used to price all assets by taking the expectation of their payoffs weighted by the kernel (see
Ross [2005]).
More recently, the concepts of EUT have been adapted for derivative security (contingent claim) pricing (see
Hodges et al. [1989]). In a financial market, given an investor receiving a particular contingent claim offering payoff
CT at future time T > 0 and assuming market completeness, then the price the investor would pay can be found
uniquely. Option pricing in complete markets uses the idea of replication whereby a portfolio in stocks and bonds recreates the terminal payoff of the option, thus removing all risk and uncertainty. However, in reality, most situations are
incomplete as market frictions, transactions costs, non-traded assets and portfolio constraints make perfect replication
impossible. The price is no-longer unique and several potential approaches exist, including utility indifference pricing
(UIP), superreplication, the selection of one particular measure according to a minimal distance criteria (for example
the minimal martingale measure or the minimal entropy measure) and convex risk measures. The UIP will be of

27

Quantitative Analytics

particular interest to us in the rest of this book (see Henderson et al. [2004] for an overview). In that setting, the
investor can maximise expected utility of wealth and may be able to reduce the risk due to the uncertain payoff
through dynamic trading. As explained by Hodges et al. [1989], the investor is willing to pay a certain amount today
for the right to receive the claim such that he is no worse off in expected utility terms than he would have been without
the claim. Some of the advantages of UIP include its economic justification, incorporation of wealth dependence, and
incorporation of risk aversion, leading to a non-linear price in the number of units of claim, which is in contrast to
prices in complete markets.

1.2
1.2.1

Solving the St. Petersburg paradox
The simple St. Petersburg game

The expected utility theory (EUT) deals with the analysis of choices among risky projects with (possibly multidimensional) outcomes. The expected utility model was first proposed by Nicholas Bernoulli in 1713 and solved by Daniel
Bernoulli [1738-1954] in 1738 as the St. Petersburg paradox. A casino offers a game of chance for a single player in
which a fair coin is tossed at each stage. The pot starts at $1 and is doubled every time a head appears. The first time a
tail appears, the game ends and the player wins whatever is in the pot. The player wins the payout Dk = $2k−1 where
k heads are tossed before the first tail appears. That is, the random number of coin tosses, k, follows a geometric
distribution with parameter 12 , and the payouts increase exponentially with k. The question being on the fair price
to pay the casino for entering the game. Following Pascal and Fermat, one answer is to consider the average payout
(expected value)
E[Dk ] =< Dk >=

∞
∞
X
X
1
1
1
1
( )k 2k−1 = 1 + 2 + ... =
=∞
2
2
4
2
k=1

k=1

A gamble is worth taking if the expectation value of the net change of wealth, < Dk > −c where c is the cost charged
to enter the game, is positive. Assuming infinite time and unlimited resources, this sum grows without bound and so
the expected win for repeated play is an infinite amount of money. Hence, considering nothing but the expectation
value of the net change in one’s monetary wealth, one should therefore play the game at any price if offered the
opportunity, but people are ready to pay only a few dollars. The paradox is the discrepancy between what people
seem willing to pay to enter the game and the infinite expected value. Instead of computing the expectation value
of the monetary winnings, Bernoulli [1738-1954] proposed to compute the expectation value of the gain in utility.
He argued that the paradox could be resolved if decision-makers displayed risk aversion and argued for a logarithmic
cardinal utility function u(w) = ln w where w is the gambler’s total initial wealth. It was based on the intuition that
the increase in wealth should correspond to an increase in utility which is inversely proportional to the wealth a person
1
already has, that is, du
dx = x , whose solution is the logarithm. The expected utility hypothesis posits that a utility
function exists whose expected net change is a good criterion for real people’s behaviour. For each possible event, the
change in utility will be weighted by the probability of that event occurring. Letting c be the cost charged to enter the
game, the expected net change in logarithmic utility is
∞
X

1
E[∆u] =< ∆u >=
ln (w + 2k−1 − c) − ln w < ∞
2k

(1.2.1)

k=1

where w +2k−1 −c is the wealth after the event, converges to a finite value. This formula gives an implicit relationship
between the gambler’s wealth and how much he should be willing to pay to play (specifically, any c that gives a positive
expected utility). However, this solution by Cramer and Bernoulli is not completely satisfying, since the lottery can
easily be changed in a way such that the paradox reappears. For instance, we just need to change the game so that
k
it gives the (even larger) payoff e2 . More generally, it is argued that one can find a lottery that allows for a variant

28

Quantitative Analytics

of the St. Petersburg paradox for every unbounded utility function (see Menger [1934]). But, this conclusion was
shown by Peters [2011b] to be incorrect. Nicolas Bernoulli himself proposed an alternative idea for solving the
paradox, conjecturing that people will neglect unlikely events, since only unlikely events yield the high prizes leading
to an infinite expected value. The idea of probability weighting resurfaced much later in the work of Kahneman et al.
[1979], but their experiments indicated that, very much to the contrary, people tend to overweight small probability
events. Alternatively, relaxing the unrealistic assumption of infinite resources for the casino, and assuming that the
expected value of the lottery only grows logarithmically with the resources of the casino, one can show that the
expected value of the lottery is quite modest.

1.2.2

The sequential St. Petersburg game

As a way of illustrating the GOP (presented in Section (2.3.2)) for constantly rebalanced portfolio, Gyorfi et al. [2009]
[2011] introduced the sequential St. Petersburg game which is a multi-period game having exponential growth. Before
presenting that game we first discuss an alternative version (called iterated St. Petersburg game) where in each round
the player invest CA = $1, and let Xn denotes the payoff for the n-th simple game. Assuming the sequence {Xn }∞
n=1
to be independent and identically distributed, after n rounds the player’s wealth in the repeated game becomes
CA (n) =

n
X

Xi

i=1

so that in the limit we get
lim

n→∞

CA (n)
=1
n log2 n

in probability, where log2 denotes the logarithm with base 2. We can now introduce the sequential St. Petersburg
game. Starting with initial capital CA (0) = $1 and assuming an independent sequence of simple St. Petersburg
c
(n − 1) is the capital after the (n − 1)-th simple
games, for each simple game the player reinvest his capital. If CA
c
c
game, then the invested capital is CA (n − 1)(1 − fc ), while CA (n − 1)fc is the proportional cost of the simple game
with commission factor 0 < fc < 1. Hence, after the n-th round the capital is
c
CA
(n) = CA (n − 1)f c (1 − fc )Xn = CA (0)(1 − fc )n

n
Y

Xi = (1 − fc )n

i=1

n
Y

Xi

i=1

c
Given its multiplicative definition, CA
(n) has exponential trend
c

c
CA
(n) = enWn ≈ enW

c

with average growth rate
Wnc =

1
c
ln CA
(n)
n

and with asymptotic average growth rate
W c = lim

n→∞

1
c
ln CA
(n)
n

From the definition of the average growth rate, we get
Wnc =

n
X

1
n ln (1 − fc ) +
ln Xi
n
i=1

and applying the strong law of large numbers, we obtain the asymptotic average growth rate

29

(1.2.2)

Quantitative Analytics

n

1X
ln Xi = ln (1 − fc ) + E[ln X1 ] a.s.
n→∞ n
i=1

W c = ln (1 − fc ) + lim

so that W c can be calculated via expected log-utility. The commission factor fc is called fair if
Wc = 0
so that the growth rate of the sequential game is 0. We can calculate the fair factor fc as
ln (1 − fc ) = −E[ln X1 ] = −

∞
X

k ln 2

k=1

1
= −2 ln 2
2k

and we get
3
4
Note, Gyorfi et al. [2009] studied the portfolio game, where a fraction of the capital is invested in the simple fair St.
Petersburg game and the rest is kept in cash.
fc =

1.2.3

Using time averages

Peters [2011a] used the notion of ergodicity in stochastic systems, where it is meaningless to assign a probability to
a single event, as the event has to be embedded within other similar events. While Fermat and Pascal chose to embed
events within parallel universes, alternatively we can embed them within time, as the consequences of the decision will
unfold over time (the dynamics of a single system are averaged along a time trajectory). However, the system under
investigation, a mathematical representation of the dynamics of wealth of an individual, is not ergodic, and that this
manifests itself as a difference between the ensemble average and the time average of the growth rate of wealth. The
origins of ergodic theory lie in the mechanics of gases (large-scale effects of the molecular dynamics) where the key
rationale is that the systems considered are in equilibrium. It is permissible under strict conditions of stationarity (see
Grimmet et al. [1992]). While the literature on ergodic systems is concerned with deterministic dynamics, the basic
question whether time averages may be replaced by ensemble averages is equally applicable to stochastic systems,
such as Langevin equations or lotteries. The essence of ergodicity is the question whether the system when observed
for a sufficiently long time t samples all states in its sample space in such a way that the relative frequencies f (x, t)dx
with which they are observed approach a unique (independent of initial conditions) probability P (x)dx
lim f (x, t) = P (x)

t→∞

R
1 T
If this distribution does not exist or is not unique, then the time average A = limT →∞
R T 0 A(x(t))dt of an observable A can not be computed as an ensemble average in Huygens’ sense, < A >= x A(x)P (x)dx. Peters [2011a]
pointed out that computing the naive expected payout is mathematically equivalent to considering multiple outcomes
of the same lottery in parallel universes. It is therefore unclear why expected wealth should be a quantity whose
maximization should lead to a sound decision theory. Indeed, the St. Petersburg paradox is only a paradox if one
accepts the premise that rational actors seek to maximize their expected wealth. The choice of utility function is often
framed in terms of the individual’s risk preferences and may vary between individuals. An alternative premise, which
is less arbitrary and makes fewer assumptions, is that the performance over time of an investment better characterises
an investor’s prospects and, therefore, better informs his investment decision. To compute ensemble averages, only a
probability distribution is required, whereas time averages require a dynamic, implying an additional assumption. This
assumption corresponds to the multiplicative nature of wealth accumulation. That is, any wealth gained can itself be
employed to generate further wealth, which leads to exponential growth (banks and governments offer exponentially
growing interest payments on savings). The accumulation of wealth over time is well characterized by an exponential

30

Quantitative Analytics

growth rate, see Equation (1.2.2). To compute this, we consider the factor rk by which a player’s wealth changes in
one round of the lottery (one sequence of coin tosses until a tails-event occurs)
w − c + Dk
w
where Dk is the kth (positive finite) payout. Note, this factor corresponds to the payoff Xk for the k-th simple game
described in Section (1.2.2). To convert this factor into an exponential growth rate g (so that egt is the factor by which
wealth changes in t rounds of the lottery), we take the logarithm gk = ln rk . The ensemble-average growth factor is
rk =

< r >=

∞
X

pk rk

k=1

where pk is the (non-zero) probability. The logarithm of < r > expresses this as the ensemble-average exponential
growth rate, that is < g >= ln < r >, and we get
< g >= ln

∞
X

pk rk

k=1
by 1t

to be consistent with the factor egt . Further, this rate
Note, the exponential growth rate < g > should be rescaled
corresponds to the rate of an M-market. The main idea in the time averages is to consider the rate of an ergotic process
and to average over time. In this case, the passage of time is incorporated by identifying as the quantity of interest
the Average Rate of Exponential Growth of the player’s wealth in a single round of the lottery. Repeating the simple
game in sequences, the time-average growth factor r is
r=

∞
Y

rkpk

k=1
c
corresponding to the player’s wealth CA
(∞). The logarithm of r expresses this as the time-average exponential growth
rate, that is, g = ln r. Hence, the time-average exponential growth rate is

g(w, c) =

∞
X

pk ln (

k=1

w − c + Dk
)
w

where pk is the (non-zero) probability of receiving it. In the standard St. Petersburg lottery, Dk = 2k−1 and pk =
Note, given Jensen’s inequality (see Appendix (A.1.1)), when f is concave (here, the logarithm function), we get

1
.
2k

< g >≥ g(w, c)
Although the rate g(w, c) is an expectation value of a Growth Rate rk (the time unit being one lottery game), and
may therefore be thought of in one sense as an average over parallel universes, it is in fact equivalent to the time
average Growth Rate that would be obtained if repeated lotteries were played over time. This is because the logarithm
function, taken as a utility function, has the special property of encoding the multiplicative nature common to gambling
and investing in a linear additive object. The expectation value is
∞
X

pk ln rk = ln

k=1

lim (

T →∞

T
Y

1

ri ) T



i=1

which is the geometric average return (for details see Section (2.3.2)). It is reasonable to assume that the intuition
behind the human behaviour is a result of making repeated decisions and considering repeated games. While g(., .)
is identical to the rate of change of the expected logarithmic utility in Equation (1.2.1), it has been obtained without
making any assumptions about the player’s risk preferences or behaviour, other than that he is interested in the Rate
of Growth of his wealth. Under this paradigm, an individual with wealth w should buy a ticket at a price c provided
g(w, c) > 0. Note, this equation can also be considered a criterion for how much risk a person should take.

31

Quantitative Analytics

1.2.4

Using option pricing theory

Bachelier [1900] asserted that every price follows a martingale stochastic process, leading to the notion of perfect
market. One of the fundamental concept in the mathematical theory of financial markets is the no-arbitrage condition.
The fundamental theorem of asset pricing states that in an arbitrage free market model there exists a probability
measure Q on (Ω, F) such that every discounted price process X is a martingale under Q, and Q is equivalent to P.
Using the notion of Arrow-Debreu state-price density from economics, Harrison et al. [1979] showed that the absence
of arbitrage implies the existence of a density or pricing kernel, also called stochastic discount factor, that prices all
asset. We consider the probability space (Ω, F, P) where Ft is a right continuous filtration including all P negligible
sets in F. Given the payoff CT at maturity T , the prices πt seen at time t can be calculated as the expectation under
the physical measure
πt = E[ξT CT |Ft ]

(1.2.3)

where ξT is the state-price density at time T which depend on the market price of risk λ. The pricing kernel measures
the degree of risk aversion in the market, and serves as a benchmark for preferences. In the special case where the
interest rates and the market price of risk are null, one can easily compute the price of a contingent claim as the
expected value of the terminal flux. These conditions are satisfied in the M-market (see Remark (E.1.2) in Appendix
(E.1)) also called the market numeraire and introduced by Long [1990]. A common approach to option pricing in
complete markets, in the mathematical financial literature, is to fix a measure Q under which the discounted traded
assets are martingales and to calculate option prices via expectation under this measure (see Harrison et al. [1981]).
The option pricing theory (OPT) states that when the market is complete, and market price of risk and rates are
bounded, then there exists a unique risk-neutral probability Q, and the risk-neutral rule of valuation can be applied to
all contingent claims which are square integrable (see Theorem (E.1.5)). Risk-neutral probabilities are the product of
an unknown kernel (risk aversion) and natural probabilities. While in a complete market there is just one martingale
measure or state-price density, there are an infinity of state-price densities in an incomplete market (see Cont et
al. [2003] for a description of incomplete market theory). In the utility indifference pricing (UIP) theory (see an
introduction to IUP in Appendix (E.5)), assuming the investor initially has wealth w, the value function (see Equation
(E.5.16)) is given by
V (w, k) =

sup

E[u(WT + kCT )]

WT ∈A(w)

with k > 0 units of the claim, and where the supremum is taken over all wealth WT which can be generated from
initial fortune w. The utility indifference buy price πb (k) (see Equation (E.5.18)) is the solution to
V (w − πb (k), k) = V (w, 0)
which involve solving two stochastic control problems (see Merton [1969] [1971]). An alternative solution is to
convert this primal problem into the dual problem which involves minimising over state-price densities or martingale
measures (see Equation (E.5.19)) (a simple example is given in Appendix (D.5)). A consequence of the dual problem
is that the market price of risk plays a fundamental role in the characterisation of the solution to the utility indifference
pricing problem (see Remark (E.5.1)).
Clearly the St Petersburg paradox is neither an example of a complete market situation nor one of an incomplete market situation, since the payout grows without bound, making the payoff not square integrable. Further, the
expectation value of net winnings proposed by Pascal and Fermat implicitly assume the situation of an M-market, corresponding to a market with null rates and null market price of risk. As pointed out by Huygens [1657], this concept
of expectation is agnostic regarding fluctuations, which is harmless only if the consequences of the fluctuations, such
as associated risks, are negligible. The ability to bear risk depends not only on the risk but also on the risk-bearer’s
resources. Similarly, Bernoulli [1738-1954] noted ”if I am not wrong then it seems clear that all men can not use the

32

Quantitative Analytics

same rule to evaluate the gamble". That is, in the M-market investors are risk-neutral which does not correspond to a
real market situation, and one must incorporate rates and market price of risk in the pricing of a claim.
Rather than explicitly introducing the market price of risk, Bernoulli and Cramer proposed to compute the expectation value of the gain in some function of wealth (utility). It leads to solving Equation (1.2.1) which is to be compared
with Equation (E.5.16) in the IUP discussed above. If the utility function is properly defined, it has the advantage of
bounding the claim so that a solution can be found. Clearly, the notion of time is ignored, and there is no supremum
taken over all wealth generated from initial fortune w. Note, the lottery is played only once with wealth w and one
round of the game is assumed to be very fast (instantaneous). Further, there is no mention of repetitions of the game in
N. Bernoulli’s letter, only human behaviour. It is assumed that the gambler’s wealth is only modified by the outcome
of the game, so that his wealth after the event becomes w + 2k−1 − c where c is the ticket price. Hence, the absence of
supremum in the ensemble average. As a result, the ensemble average on gain by Pascal and Fermat has been replaced
by an ensemble average on a function of wealth (bounding the claim), but no notion of time and market price of risk has
been considered. As discussed by Peters [2011a], utility functions (in Bernoulli’s framework) are externally provided
to represent risk preferences, but are unable by construction to recommend appropriate levels of risk. A quantity that
is more directly relevant to the financial well-being of an individual is the growth of an investment over time. In UIP
and time averages, any wealth gained can itself be employed to generate further wealth, leading to exponential growth.
By proposing a time average, Peters introduced wealth optimisation over time, but he had to assume something about
wealth W in the future. In the present situation, similarly to the sequential St. Petersburg game discussed in Section
(1.2.2), he based his results on the assumption that equivalent lotteries can be played in sequence as often as desired,
implying that irrespective of how close a player gets to bankruptcy, losses will be recovered over time. To summarise,
UIP and time averages are meaningless in the absence of time (here sequences of equivalent rounds).

1.3

Modelling future cashflows in presence of risk

The rationally oriented academic literature still considered the pricing equation (1.3.4), but in a world of uncertainty
and time-varying expected returns (see Arrow [1953] and Debreu [1953]). That is, the discount rate (µt )t≥0 is now
a stochastic process, leading to the notion of stochastic discount factor (SDF). As examined in Section (1.1), a vast
literature discussed the various ways of valuing asset prices and defining market equilibrium. In view of introducing
the main ideas and concepts, we present in Appendix (D) some simple models in discrete time with one or two time
periods and with a finite number of states of the world. We then consider in Appendix (E) more complex models
in continuous time and discuss the valuation of portfolios on multi-underlyings where we express the dynamics of a
self-financing portfolio in terms of the rate of return and volatility of each asset. As a consequence of the concept of
absence of arbitrage opportunity (AAO) (see Ross [1976]), the idea that there exists a constraint on the rate of return
of financial assets developed, leading to the presence of a market price of risk λt which is characterised in Equation
(E.1.2). Further, in a complete market with no-arbitrage opportunity, the price of a contingent claim is equal to the
expected value of the terminal flux expressed in the cash numeraire, under the risk-neutral probability Q (see details
in Appendix (D) and Appendix (E) and especially Theorem (E.1.5)).

1.3.1

Introducing the discount rate

We saw in Section (1.2.4) that at the core of finance is the present value relation stating that the market price of an
asset should equal its expected discounted cash flows under the right probability measure (see Harrison et al. [1979],
Dana et al. [1994]). The question being: How to define the pricing Kernel? or equivalently, how to define the
martingale measures? We let πt be the price at time t of an asset with random cash flows Fk = F (Tk ) at the time Tk
for k = 1, .., N such that 0 < T1 < T2 < ... < TN . Note, N can possibly go to infinity. Given the price of an asset in
Equation (1.2.3) and assuming ξTk = e−µ(Tk −t) , the present value of the asset at time t is given by

33

Quantitative Analytics

πt = E[

N
X

e−µ(Tk −t) Fk |Ft ]

(1.3.4)

k=1

where µ is the discount rate, and such that the most common cash flows Fk can be coupons and principal payments
of bonds, or dividends of equities. The main question becomes: What discount rate to use? As started by Walras
[1874-7], equilibrium market prices are set by supply and demand among investors applying the pricing Equation
(1.3.4), so that the discount rates they require for holding assets are the expected returns they can rationally anticipate.
Hence, the discount rate or the expected return contains both compensation for time and compensation for risk bearing.
Williams [1936] discussed the effects of risk on valuation and argued that ”the customary way to find the value of
a risky security has been to add a premium for risk to the pure rate of interest, and then use the sum as the interest
rate for discounting future receipts”. The expected excess return of a given asset over the risk-free rate is called the
expected risk premium of that asset. In equity, the risk premium is the growth rate of earnings, plus the dividend yield,
minus the riskless rate. Since all these variables are dynamic, so must be the risk premium. Further, the introduction
of stock and bond option markets confirmed that implied volatilities vary over time, reinforcing the view that expected
returns vary over time. Fama et al. [1989] documented counter-cyclical patterns in expected returns for both stocks
and bonds, in line with required risk premia being high in bad times, such as cyclical troughs or financial crises.
Understanding the various risk premia is at the heart of finance, and the capital asset pricing model (CAPM) as well as
the option pricing theory (OPT) are two possible answers among others. To proceed, one must find a way to observe
risk premia in view of analyising them and possibly forecasting them. Through out this guide we are going to describe
the tools and techniques used by institutional asset holders to estimate the risk premia via historical data as well as the
approach used by option’s traders to implicitly infer the risk premia from option prices.

1.3.2

Valuing payoffs in continuous time

A consequence of the fundamental theorem of asset pricing introduced in Section (1.2.4) is that the price process
S need to be a semimartingale under the original measure P. In this section we give a bief introduction to some
fundamental finance concepts. While some more general models might include discontinuous semimartingales and
even stochastic processes which are not semimartingales (for example fractional Brownian motion), we consider the
continuous decomposable semimartingales models for equity securities
S(t) = A(t) + M (t)
where the drift A is a finite variation process and the volatility M is a local martingale. For simplicity of exposition
we focus on an Ft -adapted market consisting of the (N + 1) multi-underlying diffusion model described in Appendix
(E.1) with 0 ≤ t ≤ T and dynamics given by
dSt
= bt dt+ < σt , Ŵt >
St

(1.3.5)

where the instantaneous rate of return bt is an adapted vector in RN , Ŵt is a k-dimensional Brownian motion with
components Ŵtj , and σt is a N × k adapted volatility matrix with elements σji (t). The market {St }t∈[0,T ] is called
normalised if St0 = 1. We can always normalise the market by defining
i

St =

Sti
,1≤i≤N
St0

so that
1

N

St = 1, S t , ..., S t

is the normalisation of St . Hence, it corresponds to regarding the price Stt of the safe investment as the unit of price
(the numeraire) and computing the other prices in terms of this unit. We define the riskless asset or accumulation

34

Quantitative Analytics

factor B(t) as the value at time t of a fund created by investing $1 at time 0 on the money market and continuously
reinvested at the instantaneous interest rate r(t). Assuming that for almost all ω, t → rt (ω) is strictly positive and
continuous, and rt is an Ft measurable process, then the riskless asset is given by
B(t) = St0 = e

Rt
0

r(s)ds

(1.3.6)

We can now introduce the notion of arbitrage in the market.
Lemma 1.3.1 Suppose there exists a measure Q on FT such that Q ∼ P and such that the normalised price process
{S t }t∈[0,T ] is a martingale with respect to Q. Then the market {St }t∈[0,T ] has no arbitrage.
Definition 1.3.1 A measure Q ∼ P such that the normalised process {S t }t∈[0,T ] is a martingale with respect to Q is
called an equivalent martingale measure.
That is, if there exists an equivalent martingale measure then the market has no arbitrage. In that setting the market
also satisfies the stronger condition called no free lunch with vanishing risk (NFLVR) (see Delbaen et al. [1994]). We
can consider a weaker result. We let Q be an equivalent martingale measure with dQ = ξdP , such that ξ is strictly
positive and square integrable. Further, the process {ξt }0≤t≤T defined by ξt = E[ξ|Ft ] is a strictly positive martingale
over the Brownian fields {Ft } with ξT = ξ and E[ξt ] = E[ξ] = 1 for all t. Given φ an arbitrary bounded, real-valued
function on the real line, Harrison et al. [1979] showed that the Radon-Nikodym derivative is given by
ξt =

dQ
dP

1

Ft

= e− 2

Rt
0

R
φ2 (Ws )ds+ 0t φ(Ws )dW̃ (s)

,0≤t≤T

where W̃ is a Brownian motion on the probability space (Ω, F, P). Note, λt = φ(W̃t ) is the market price of risk such
that (λt )0≤t≤T is a bounded adapted process. Then, ξ is a positive martingale under P and one can define the new
probability measure Q for arbitrary t > 0 by
Q(A) = E[ξt IA ]
where IA is the indicator function for the event A ∈ Ft . Moreover, using the theorem of Girsanov [1960], the
Brownian motion
Z t
W (t) = W̃ (t) +
λs ds
0

is a Brownian motion on the probability space (Ω, F, Q) (see Appendix (E.2)). Using the Girsanov transformation
(see Girsanov [1960]), we choose λt in a clever way, and verify that each discounted price process is a martingale under the probability measure Q. For instance, in the special case where we assume that the stock prices are lognormally
distributed with drift µ and volatility σ, the market price of risk is given by Equation ((E.2.4)) (see details in Appendix
(E.2.1)). Hence, we see that modelling the rate of return of risky asset is about modelling the market price of risk λt .
Assuming a viable Ito market, Theorem (E.1.1) applies, that is, there exists an adapted random vector λt and an
equivalent martingale measure Q such that the instantaneous rate of returns bt of the risky assets satisfies Equation
(E.1.2), that is
bt = rt I + σt λt , dP × dt a.s.
Remark 1.3.1 As a result of the absence of arbitrage opportunity, there is a constraint on the rate of return of financial
assets. The riskier the asset, the higher the return, to justify its presence in the portfolio.

35

Quantitative Analytics

Hence, in absence of arbitrage opportunity, the multi-underlyings model has dynamics
k
k
X
X
dSti
i
i
=
r
dt
+
σ
(t)λ
dt
+
σji (t)Ŵtj
t
j
t
Sti
j=1
j=1

and we see that the normalised market S t is a Q-martingale, and the conclusion of no-arbitrage follows from Lemma
(1.3.1). Geman et al. [1995] proved that many other probability measures can be defined in a similar way, which
reveal themselves to be very useful in complex option pricing. We can now price claims with future cash flows. Given
XT an FT -measurable random variable, and assuming πT (H) = X for some self-financing H, then by the Law of
One Price, πt (H) is the price at time t of X defined as
πt (H) = Bt E Q [

πT
X
|Ft ] = Bt E Q [
|Ft ]
BT
BT

is a martingale under Q (see Harrison et al. [1981] [1983]). As a
where Bt = ert is the riskless asset, and π(H)
B
result, the market price πt in Equation (1.3.4) becomes
πt = E Q [

N
X

e−

R Tk
t

rs ds

Fk ] =

k=1

N
X

F̂k P (t, Tk )

k=1

where (rt )0≤t≤T is the risk-free rate (possibly stochastic), P (t, Tk ) is the discount factor, and
R Tk

E Q [e− t rs ds Fk ]
F̂k (t) = t
P (t, Tk )
is the expected cash flows at time t for the maturities Tk for k = 1, .., N .

1.3.3

Modelling the discount factor

We let the zero-coupon bond price P (t, T ) be the price at time t of $1 paid at maturity. Given the pair (rt , λt )t≥0 of
bounded adapted stochastic processes, the price of a zero-coupon bond in the period [t, T ] and under the risk-neutral
probability measure Q is given by
P (t, T ) = E Q [e−

RT
t

rs ds

|Ft ]

reflecting the uncertainty in time-varying discount rates. As a result, to model the bond price we can characterise a
dynamic of the short rate rt . Alternatively, the AAO allows us to describe the dynamic of the bond price from its
initial value and the knowledge of its volatility function. Therefore, assuming further hypothesis, the shape taken by
the volatility function fully characterise the dynamic of the bond price and some specific functions gave their names
to popular models commonly used in practice. Hence, the dynamics of the zero-coupon bond price are
dP (t, T )
= rt dt ± ΓP (t, T )dWP (t) with P (T, T ) = 1
P (t, T )

(1.3.7)

where (WP (t))t≥0 is valued in Rn and ΓP (t, T ) 1 is a family of local volatilities parameterised by their maturities T .
However, practitioners would rather work with the forward instantaneous rate which is related to the bond price by
fP (t, T ) = −∂T ln P (t, T )
1

ΓP (t, T )dW (t) =

Pn

j=1

ΓP,j (t, T )dWj (t)

36

Quantitative Analytics

The relationship between the bond price and the rates in general was found by Heath et al. [1992] and following their
approach the forward instantaneous rate is
Z t
Z t
γP (s, T )ΓP (s, T )T ds
γP (s, T )dWP (s) +
fP (t, T ) = fP (0, T ) ∓
0

0

where γP (s, T ) = ∂T ΓP (t, T ). The spot rate rt = fP (t, t) is therefore
Z t
Z t
rt = fP (0, t) ∓
γP (s, t)dWP (s) +
γP (s, t)ΓP (s, t)T ds
0

0

Similarly to the bond price, the short rate is characterised by the initial yield curve and a family of bond price volatility
functions. However, either the bond price or the short rate above are too general and additional constraints must be
made on the volatility function. A large literature flourished to model the discount-rate process and risk premia in
continuous time with stochastic processes. The literature on term structure of interest rates is currently dominated by
two different frameworks. The first one is originated by Vasicek [1977] and extended among others by Cox, Ingersoll,
and Ross [1985]. It assumes that a finite number of latent factors drive the whole dynamics of term structure, among
which are the Affine models. The other framework comprises curve models which are calibrated to the relevant
forward curve. Among them are forward rate models generalised by Heath, Jarrow and Morton (HJM) [1992], the
libor market models (LMM) initiated by Brace, Gatarek and Musiela (BGM) [1997] and the random field models
introduced by Kennedy [1994].
As an example of the HJM models, Frachot [1995] and Duffie et al. [1993] considered respectively the special
case of the quadratic and linear factor model for the yield price. In that setting, we assume that at a given time t all
the zero-coupon bond prices are function of some state variables. We can further restrain the model by assuming that
market price of claims is a function of some Markov process.
Assumption 1 We assume there exists a Markov process Xt valued in some open subset D ⊂ Rn × [0, ∞) such that
the market value at time t of an asset maturing at t + τ is of the form
f (τ, Xt )
where f ∈ C 1,2 (D × [0, ∞[) and τ ∈ [0, T ] with some fixed and finite T .
For tractability, we assume that the drift term and the market price of risk of the Markov process X are nontrivially
affine under the historical probability measure P. Only technical regularity is required for equivalence between absence
of arbitrage and the existence of an equivalent martingale measure.
That is, the price process of any security is a Q
Rt
martingale after normalisation at each time t by the riskless asset e 0 R(Xs )ds (see Lemma (1.3.1)). Therefore, there is
a standard Brownian motion W in Rn under the probability measure Q such that
dXt = µ(Xt )dt + σ(Xt )dWt

(1.3.8)

where the drift µ : D → Rn and the diffusion σ : D → Rn×n are regular enough to have a unique strong solution
valued in D. To be more precise, the domain D is a subset of Rn × [0, ∞) and we treat the state process X defined so
that (Xt , t) is in D for all t. We assume that for each t, {x : (t, x) ∈ D} contains an open subset of Rn . We are now
going to be interested in the choice for (f, µ, σ) that are compatible in the sense that f characterises a price process.
El Karoui et al. [1992] explained interest rates as regular functions of an n-dimensional state variable process X
fP (t, T ) = F (t, T, Xt ) , t ≤ T
where F is at most quadratic in X. By constraining X to be linear, it became the Quadratic Gaussian model (QG).
Similarly, introducing n state variables Frachot in [1995] described the linear factor model

37

Quantitative Analytics

fP (t, T ) = b(t, T ) + a(t, T ).Xt , ∀t ≤ T

(1.3.9)

where fP (t, T ) is the instantaneous forward rate and the functions b(t, T ) and a(t, T ) = [a1 (t, T ), ..., an (t, T )]> are
deterministic. He showed that by discarding the variables Xi (t) one can identify the state variables to some particular
rates such that
fP (t, T ) = b(t, T ) + a1 (t, T )fP (t, t + θ1 ) + ... + an (t, T )fP (t, t + θn ) , ∀t ≤ T
where θi for i = 1, .., n are distinct maturities and the functions a(., .) and b(., .) must satisfy extra compatibility
conditions
b(t, t + θi ) = 0 , ai (t, t + θi ) = 1 , aj (t, t + θi ) = 0(j 6= i)
that is, the rate with maturity (t + θi ) can only be identified with a single state variable. Frachot also showed that
the QG model could be seen as a linear factor model with constraints, that is by considering a model linear in X and
XX > . In that case, the entire model can not be identified in term of some particular rates but only XX > as in the
case of X it would leads to ai (t, T ) = 0 for all i belonging to X. Frachot [1995] and Filipovic [2001] proved that the
quadratic class represents the highest order of polynomial functions that one can apply to consistent time-separable
term structure models. Another example is the Yield factor model defined by Duffie et al. [1993] which is a special
case of the linear factor model where the asset f (τ, Xt ) in assumption (1) is the price of a zero-coupon bond of
maturity t + τ
f (T − t, Xt ) = E Q [e−

RT
t

R(Xs )ds

|Ft ]

(1.3.10)

so that given the definition of the risk-free zero coupon bond, P (t, T ) is its price at time t. The short rate is assumed
to be such that there is a measurable function R : D → R defined as the limit of yields as the maturity goes to zero,
that is
1
R(x) = lim − log f (τ, x) for x ∈ D
τ →0
τ
Depending on the asset prices and the markets under consideration, many different authors have constructed such
compatible set (f, ν, σ). For instance, the Factors models were extended by Duffie, Pan and Singleton [2000] in
the case of jump-diffusion models, and a unified presentation was developed by Duffie, Filipovic and Schachermayer
[2003]. Affine models is a class of time-homogeneous Markov processes that has arisen from a large and growing
range of useful applications in finance. They imply risk premium on untraded risks which are linear functions of the
underlying state variables, creating a method for going between the objective and risk-neutral probability measures
while retaining convenient affine properties for both measures. Duffie et al. [2003] provided a definition and complete
n
characterization of regular affine processes. Given a state space of the form D = Rm
+ × R for integers m ≥ 0 and
n ≥ 0 the key affine property, is roughly that the logarithm of the characteristic function of the transition distribution
pt (x, .) of such a process is affine with respect to the initial state x ∈ D. Given a regular affine process X, and a
discount-rate process {R(Xt ) : t ≥ 0} defined by an affine map x → R(x) on D into R, the discount factor
P (t, T ) = E[e−

RT
t

R(Xu )du

|Xt ]

is well defined under certain conditions, and is of the anticipated exponential-affine form in Xt .

1.4

The pricing kernel

We saw in Section (1.3.2) that risk neutral returns are risk-adjusted natural returns. That is, the return under the risk
neutral measure is the return under the natural measure with the risk premium subtracted out. Hence, to use risk
neutral prices to estimate natural probabilities, we must know the risk adjustment to add it back in. This is equivalent

38

Quantitative Analytics

to knowing both the agent’s risk aversion and his subjective probability which are non-observable. Various authors
tried to infer these risks from the market with the help a model, but with more or less success. Further, the natural
expected return of a strategy depends on the risk premium for that strategy, so that knowledge on the kernel can help
in estimating the variability of the risk premium. At last, we are unable to obtain the current market forecast of the
expected returns on equities directly from their prices, and we are left to using historical returns. Even though the risk
premium is not directly observable from option prices various authors tried to infer it. This is because there is a rich
market in equity option prices and a well developed theory to extract the martingale or risk neutral probabilities from
these prices. As a result, one can use these probabilities to forecast the probability distribution of future returns.

1.4.1

Defining the pricing kernel

The asset pricing kernel summarises investor preferences for payoffs over different states of the world. In absence of
arbitrage, all asset prices can be expressed as the expected value of the product of the pricing kernel and the asset payoff
(see Equation (1.2.3)). Discounting payoffs using time and risk preferences, it is also called the stochastic discount
factor. Hence, combined with a probability model for the states, the pricing kernel gives a complete description of
asset prices, expected returns, and risk premia. In a discrete time world with asset payoffs h(X) at time T , contingent
on the realisation of a state of nature X ∈ Ω, absence of arbitrage opportunity (AAO) (see Dybvig et al. [2003])
implies the existence of positive state space prices, that is, the Arrow-Debreu contingent claims prices p(X) paying
$1 in state X and nothing in any other states (see Theorem (D.1.1)). If the market is complete, then these state prices
are unique. The current value πh of an asset paying h(X) in one period is given by
Z
πh = h(X)dP (X)
where P (X) is a price distribution function. Letting r(X 0 ) be the riskless rate as a function of the current state X 0 ,
R
0
such that p(X)dX = e−r(X )T , we can rewrite the price as
Z
πh

=

Z
h(X)dP (X) = (

= e−r(X

0

)T

Z
dP (X))

0
dP (X)
= e−r(X )T
h(X) R
dP (X)

Z

E ∗ [h(x)] = E[h(X)ξ(X)]

h(X)dq ∗ (X)
(1.4.11)

where the asterisk denotes the expectation in the martingale measure and where the pricing kernel, that is, the stateprice density ξ(X) is the Radon-Nikodym derivative of P (X) with respect to the natural measure denoted F (X).
With continuous distribution, we get ξ(X) = fp(X)
(X) where f (X) is the natural probability, that is, the actual or relevant
objective probability distribution, and the risk-neutral probabilities are given by
q ∗ (X) = R

0
p(X)
= er(X )T p(X)
p(X)dX

so that
ξ(X) = e−r(X

0

)T

q ∗ (X)
f (X)

(1.4.12)

We let Xt denote the current state and Xt+1 be a state after one period and assume that it fully describe the state of
nature. Then, ξt,t+1 is the empirical pricing kernel associated with returns between date t and t + 1, conditional on
the information available at date t ≤ t + 1. It is estimated as
ξt,t+1 = ξ(Xt , Xt+1 ) =

p(Xt+1 |Xt )
q(Xt+1 |Xt )
= e−r∆t
f (Xt+1 |Xt )
f (Xt+1 |Xt )

39

Quantitative Analytics

where q is the risk-neutral density, and f is the objective (historical) density. Hence, the kernel is defined as the price
per unit of probability in continuous state spaces (see Equation (D.6.13)). Note, we can always rewrite the risk-neutral
probability as
q(Xt+1 |Xt ) = er∆t ξt,t+1 f (Xt+1 |Xt )
where the natural probability transition function f (Xt+1 |xt ), the kernel ξt,t+1 , and the discount factor e−r∆t are
unknowns. One can therefore spend his time either modelling each separate element and try to recombine them or
directly infer them from the risk-neutral probability. However, one can not disentangle them, and restrictions on the
kernel, or the natural distribution must be imposed to identify them separately from the knowledge of the risk-neutral
probability.
One approach is to use the historical distribution of returns to estimate the unknown kernel and then link the
historical estimate of the natural distribution to the risk neutral distribution. For instance, Jackwerth et al. [1996]
used implied binomial trees to represent the stochastic process, Ait Sahalia et al. [1998] [2000] combined state prices
derived from option prices with estimates of the natural distribution to determine the kernel, Bollerslev et al. [2011]
used high frequency data to estimate the premium for jump risk in a jump diffusion model. Assuming a transition
independent kernel, Ross [2013] showed that the equilibrium system above could be solved without the need of
historical data.

1.4.2

The empirical pricing kernel

The discount factor can be seen as an index of bad times such that the required risk premium for any asset reflects
its covariation with bad times. Investors require higher risk premia for assets suffering more in bad times, where bad
times are periods when the marginal utility (MU) of investors is high.
Lucas [1978] expressed the pricing kernel as the intertemporal marginal rate of substitution
0

ξt,t+1

u (ct+1 )
= 0
u (ct )

where u(•) is a utility function and ct is the consumption at time t. Under the assumption of power utility, the pricing
kernel becomes
ξt,t+1 = e−ρ (

ct+1 −γ
)
ct

where ρ is the rate of time preference and γ is a level of relative risk aversion. The standard risk-aversion measures
are usually functions of the pricing kernel slope, and Pratt showed that it is the negative of the ratio of the derivative
of the pricing kernel to the pricing kernel, that is
0

ct+1 ξt,t+1 (ct+1 )
γt = −
ξt,t+1 (ct+1 )
Generally, the pricing kernel depends not just on current and future consumption, but also on all variables affecting
the marginal utility. When the pricing kernel is a function of multiple state variables, the level of risk aversion can also
fluctuate as these variables change. Several authors used Equation (Equation (1.2.3)) to investigate the characteristics
of investor preferences in relation to equity securities, that is
St = E[ξt,t+1 Xt+1 |Ft ]
where St is the current stock price and Xt+1 is the asset payoff in one period. Hansen et al. [1982] identified the
pricing kernel equation above with the unconditional version

40

Quantitative Analytics

E[

St+1
ξt,t+1 ] = 1
St

(1.4.13)

They specified the aggregate consumption growth rate as a pricing kernel state variable, and measured consumption
using data from the National Income and Products Accounts. Campbell et al. [1997] and Cochrane [2001] provided
comprehensive treatments of the role of the pricing kernel in asset pricing. As it is not clear among researchers
about what state variables should enter the pricing kernel, Rosenberg et al. [2001] [2002] considered a pricing
kernel projection estimated without specifying the state variables. Writing the original pricing kernel as ξt,t+1 =
ξt,t+1 (Zt , Zt+1 ) where Zt is a vector of pricing kernel state variables they re-wrote the pricing equation by factoring
the joint density ft (Xt+1 , Zt+1 ) into the product of the conditional density ft (Zt+1 |Xt+1 ) and the marginal density
ft (Xt+1 ). The expectation is then evaluated in two step, first the pricing kernel is integrated using the conditional
∗
density, giving the projected pricing kernel ξt,t+1
(Xt+1 ). second, the product of the projected pricing kernel and the
payoff variable is integrated using the marginal density, giving the asset price
∗
∗
St = Et [ξt,t+1
(Xt+1 )Xt+1 ] , ξt,t+1
(Xt+1 ) = Et [ξt,t+1 (Zt , Zt+1 )|Xt+1 ]

Thus, for the valuation of an asset with payoffs depending only on Xt+1 , the pricing kernel is summarised as a
function of the asset payoff which can vary over time, reflecting time-variation in the state variables. It is called the
empirical pricing kernel (EPK) and was estimated on monthly basis from 1991 to 1995 on the S&P 500 index option
data. Barone-Adesi et al. [2008] relaxed the restriction that the variances of the objective return distribution and risk
neutral distribution are equal, along with higher moments such as skewness and kurtosis. Further, Barone-Adesi et al.
combined an empirical and a theoretical approach relaxing the restriction that the objective return distribution and risk
neutral distribution share the same volatility and higher order moments.

1.4.3

Analysing the expected risk premium

One of the consequence of the absence of arbitrage opportunities (AAO) is that the expected value of the product of
the pricing kernel and the gross asset return Rtg defined in Equation (3.3.4) must equal unity (see Equation (1.4.13)).
g
, we
That is, assuming dividend adjusted prices and setting the one period gross return as Rt+1
= 1 + Rt,t+1 = SSt+1
t
get
g
πt = E[ξt+1 Rt+1
|Ft ] = 1

Hence, the one-period risk-free rate rf t can be written as the inverse of the expectation of the pricing kernel
rf t = E[ξt+1 |Ft ]−1
Further, Equation (1.4.13) implies that the expected risk premium is proportional to the conditional covariance of its
return with the pricing kernel
g
g
)
Et [Rt+1
− rf t ] = −rf t Covt (ξt+1 , Rt+1

where Covt (•, •) is the covariance conditional on information available at time t (see Whitelaw [1997]). As a result,
the conditional Sharpe ratio of any asset, defined as the ratio of the conditional mean excess return to the conditional
standard deviation of this return, can be written in terms of the volatility of the pricing kernel and the correlation
between the kernel and the return
g
Et [Rt+1
− rf t ]
g
σt (Rt+1 − rf t )

g
)
Covt (ξt+1 , Rt+1
g
σt (Rt+1 )

=

−rf t

=

g
−rf t σt (ξt+1 )Corrt (ξt+1 , Rt+1
)

41

Quantitative Analytics

where σt (•) and Corrt (•, •) are respectively the standard deviation and correlation, conditional on information at
time t. Hence, this equation shows that the conditional Sharpe ratio MSR,t is proportional to the correlation between
the pricing kernel and the return on the market Rm,t
g
MSR,t = −rf t σt (ξt+1 )Corrt (ξt+1 , Rm,t+1
)

Hence, if the Sharpe ratio varies substantially over time, then the variation is mostly attributable to the variation in the
conditional correlation and depend critically on the modelling of the pricing kernel.
One approach is to specify the kernel ξt as a function of asset returns. For instance, modelling the pricing kernel as
a linear function of the market return produces the conditional CAPM. Since risk aversion implies a negative coefficient
on the market return, modelling the pricing kernel as a linear function of the market return leads the correlation to be
around −1 so that the market Sharpe ratio is approximately constant over time. Alternatively, modelling the discount
factor as a quadratic function of the market return gives the conditional three moment CAPM proposed by Kraus
et al. [1976] allowing for some time variation in the market SRs due to to the pricing of skewness risk, but the
correlation is still pushed towards −1. Other modelling of the pricing kernel exists, such as nonlinear function of
the market return or a linear function of multiple asset returns, but with limited time variation in the correlation
(see Bansal et al. [1993]). Based on explanatory and predictive power, a number of additional factors, including
small firm returns and return spreads between long-term and short-term bonds, have been proposed and tested but
the correlations between the discount factor and the market return tend to be relatively stable. Another approach
uses results from a representative agent and exchange economy, and models the pricing kernel as the marginal rate of
substitution (MRS) over consumption leading to the consumption CAPM (see Breeden et al. [1989]). When the MRS
depends on consumption growth and the stock market is modelled as a claim on aggregate consumption, one might
expect the correlation and the Sharpe ratio to be relatively stable. Assuming consumption growth follows a two-regime
autoregressive process, Whitelaw [1997a] obtained regime shifts (phases of business cycle) with mean and volatility
being negatively correlated, implying significant time variation in the Sharpe ratio.

1.4.4

Infering risk premium from option prices

In a recent article, assuming that the state-price transition function p(Xi , Xj ) is observable, Ross [2013] used the
recovery theorem to uniquely determine the kernel, the discount rate, future values of the kernel, and the underlying
natural probability distribution of return from the transition state prices alone. Note, the notion of transition independence was necessary to separately determine the kernel and the natural probability distribution. Other approaches used
the historical distribution of returns to estimate the unknown kernel and thereby link the historical estimate of the natural distribution to the risk-neutral distribution. Alternatively, one could assume a functional form for the kernel. Ross
[2013] showed that the equilibrium system in Equation (D.6.15) could be solved without the need to use either the
historical distribution of returns or independent parametric assumptions on preferences to find the market’s subjective
distribution of future returns. The approach relies on the knowledge of the state transition matrix whose elements give
the price of one dollar in a future state, conditional on any other state. One way forward is to find these transaction
prices from the state-prices for different maturities derived from the market prices of simple options by using a version
of the forward equation for Markov processes. As pointed out by Ross, there are a lot of possible applications if we
know the kernel (market’s risk aversion) and the market’s subjective assessment of the distribution of returns. We can
use the market’s future distribution of returns much as we use forward rates as forecasts of future spot rates. Rather
than using historical estimates of the risk premium on the market as an input into asset allocation models (see Section
(1.4.3)), we should use the market’s current subjective forecast. The idea can extend to all project valuation using
historical estimates of the risk premium.

42

Quantitative Analytics

1.5

Modelling asset returns

1.5.1

Defining the return process

Setting αt = log (Bt ) for 0 ≤ t ≤ T where B(t) is the riskless asset defined in Equation (1.3.6), we call α the return
process for B, such that dB = Bdα with α0 = 0. Since the riskless asset, B, is absolutely continuous, then
Z t
αt =
rs ds
0

and, rs , is the time-s interest rate with continuous compounding. Similarly to the riskless asset, assuming any semimartingale price process, S, possibly with jumps, we want to consider its return process for a stock to satisfy the
equation
dS = S− dR
which is equivalent to
t

Z
St = S0 +

Su dRu

(1.5.14)

dSu

(1.5.15)

0

as well as
t

Z
Rt =
0

1
Su−

t
where the return, Rt , is defined over the range [0, t]. Note, SdS
, is the infinitesimal rate of return of the price process, S,
t−
while, Rt , is its integrated version. Given a (reasonable) price process, S, the above equation defines the corresponding
return process R. Similarly, given, S0 , and a semimartingale, R, Equation (1.5.14) always has a semimartingale
solution S. It is unique and given by

St = S0 ψt (R) , 0 ≤ t ≤ t
where
1

ψt (R) = eRt −R0 − 2 [R,R]t

Y

1

2

(1 + ∆Rs )e−∆Rs + 2 (∆Rs )

s≤t

with ∆Rs = Rs − Rs− and quadratic variation
Z
[R, R]t = Rt Rt − 2

t

Rs− dRs
0

Note, R, is such that 1 + ∆R > 0 for any and all jumps if and only if ψt (R) > 0 for all t. Similarly with weak
inequalities. Further, we have ψ0 (R) = 1. In that setting, the discounted price process is given by
S
= S0 ψ(R)e−α
B
Since −α is continuous, α0 = 0, and [−α, −α] = 0 we get the semimartingale exponential expression
S=

S = S0 ψ(R)ψ(−α)
Further, since [R, −α] = 0, using a probability property, we get
S = S0 ψ(R − α) = S 0 ψ(R − α)
43

Quantitative Analytics

so that Y = R − α can be interpreted as the return process for the discounted price process S. For example, in the
Black-Scholes [1973] model the return process is given by
Z t
1
Rt =
dSu = µt + σ Ŵt
S
u
0
where Ŵt is a standard Brownian motion in the historical probability measure P. With a constant interest rate, r, we
get αt = rt, and the return process for the discounted price process becomes
Yt = (µ − r)t + σ Ŵt

1.5.2

Valuing potfolios

In order to describe general portfolio valuation and its optimisation, we consider a general continuous time framework
where the stock price, S, is the decomposable semimartingale. The example of a multi-underlying diffusion model
is detailed in Appendix (E.1). We let the world be defined as in Assumption (E.1.1), and we assume (N + 1) assets
S = (S 0 , S 1 , .., S N ) traded between 0 and TH where the risky-assets S i ; 1 ≤ i ≤ N are Ito’s random functions given
in Equation (1.3.5). The risk-free asset S 0 satisfies the dynamics
dSt0 = rt St0 dt
The data format of interest is returns, that is, relative price change. From the definition of the percent return given in
i
Equation (3.3.4), we consider the one period net returns Rt,t+1
=
i
rL
(t, t

i
ln St+1

ln Sti

i
St+1
Sti

− 1, which for high-frequency data are almost

identical to log-price changes
+ 1) =
−
(at daily or higher frequency). The portfolio strategy
is given by the process (δi (t))0≤i≤N corresponding to the quantity invested in each asset. The financial value of the
portfolio δ is given by V (δ), and its value at time t satisfies
Vt (δ) =< δ(t), St >=

N
X

δ i (t)Sti

(1.5.16)

i=0

Assuming a self-financing portfolio (see Definition (E.1.2)), its dynamics satisfy
dVt (δ) =

N
X

δ i (t)dSti = δ 0 (t)dSt0 +

i=0

N
X

δ i (t)dSti

i=1
0

Assuming a simple strategy, for any dates t < t , the self-financing condition can be written as
0

t

Z
Vt0 − Vt =
t
0

(t)St0

From the definition of the portfolio we get δ
factorising, the dynamics of the portfolio become

N
X
i=0

= Vt (δ) −

dVt (δ) = Vt (δ)rt dt +

N
X
i=1

i

δ i (u)dSui
PN

i=1

δ i (t)Sti

δ i (t)Sti , so that, plugging back in the SDE and

dSti
− rt dt
i
St

(t)Sti )1≤i≤N

Further, we let (hi (t))1≤i≤N = (δ
be the fraction of wealth invested in the ith risky security, or the
dollars invested in the ith stock, so that the dynamics of the portfolio rewrite
dVt (h) = Vt (h) −

N
X

N
X

i
hi (t) rt dt +
hi (t)Rt,t+1

i=1

i=1

44

(1.5.17)

Quantitative Analytics

which can be rewritten as
dVt (h) = Vt (h)rt dt +

N
X

i
hi (t)(Rt,t+1
− rt dt)

i=1

corresponding to Equation (E.1.1), that is
dVt (h) = rt Vt (h)dt+ < ht , Rt − rt Idt >
where ht = (δS)t corresponds to the vector with component (δ i (t)Sti )1≤i≤N describing the amount to be invested in
i
each stock, and where Rt is a vector with component (Rt,t+1
)1≤i≤N describing the return of each risky security. It is
often convenient to consider the portfolio fractions

πv = πv (t) = (πv0 (t), .., πvN (t))> , t ∈ [0, ∞)
with coordinates defined by
πvi (t) =

hi (t)
δ i (t)Sti
=
Vt (h)
Vt (h)

(1.5.18)

denoting the proportion of the investor’s capital invested in the ith asset at time t. It leads to the dynamics of the
portfolio defined as
dVt (h)
= rt dt+ < πv (t), Rt − rt Idt >
Vt (h)
The linear Equation having a unique solution, knowing the initial investment and the weights of the portfolio is enough
to characterise the value of the portfolio. In a viable market, one of the consequences of the absence of arbitrage
opportunity is that the instantaneous rate of returns of the risky assets satisfies Equation (E.1.2). Hence, in that setting,
the dynamics of the portfolio become
dVt (h)
= rt dt+ < πv (t), σt λt dt > + < πv (t), σt dW̃t >
Vt (h)
where
< πv (t), σt λt dt >
represents the systematic component of returns of the portfolio, and
< πv (t), σt dW̃t >
represents the idiosyncratic component of the portfolio.
i
Remark 1.5.1 To value the portfolio, one must therefore assume a model for the process Rt,t+1
.

One way forward is to consider Equation (1.5.15) and follow the interest rate modelling described in Section (1.3.3)
by assuming that the equity return is a function of some latent factors or state variables.

45

Quantitative Analytics

1.5.3

Presenting the factor models

1.5.3.1

The presence of common factors

Following Ilmanen [2011], we state that the expected returns on a zero-coupon government bond are known, but
for all other assets the expected returns are uncertain ex-ante and unknowable ex-post, while the realised returns
are knowable ex-post but do not reveal what investors expected. As a result, apart from ZC-bond, institutional asset
holders, such as pension funds, must infer expected returns from empirical data, past returns, and investor surveys, and
from statistical models. The aim being to forecast expected future returns in time and across assets in view of valuing
portfolios (see Section (1.5.2)) and performing the asset allocation process described in Section (2.1.2). Given constant
expected returns, the best way for estimating expected future return is to calculate the sample average realised return
over a long sample period. However, when the risk premia λt is time-varying or stochastic, this average becomes
biased. It is understood that time-varying expected returns (risk-premia) can make historical average returns highly
misleading since it can reflect historical average unexpected returns. As a result, practitioners became interested in
ex-ante return measures such as valuation ratios or dividend discount models. Even though market’s expected returns
are unobservable, both academics and practitioners focused on forward-looking indicators such as simple value and
carry measures as proxies of long-run expected returns. The carry measures include any income return and other
returns that are earned when capital market conditions are unchanged. Among these measures, dividend yield was the
early leader, but broader payout yields including share buybacks have replaced it as the preferred carry measure, while
earnings yield and the Gordon model (DDM) equity premium became the preferred valuation measures (see Campbell
et al. [1988]). This is because they give better signals than extrapolation of past decade realised returns. For instance,
in order to estimate prospective returns, one can relate current market prices to some sort of value anchor (historical
average price, yield, spread etc.) assuming that prices will revert to fair values which are proxied by the anchors.
Note, value or carry indicators are inherently narrow and one should consider broader statistical forecasting models
such as momentum, volatility, macroeconomic environment etc. The presence of portfolio forecastability, even if
single assets are sometime unforecastable, is one of the key features of cointegration. Two or more processes are said
to be cointegrated if there are long-term stable regression relationships between them, even if the processes themselves
are individually integrated. This means that there are linear combinations of the processes that are autocorrelated and
thus forecastable. Hence, the presence of cointegrating relationships is associated with common trends, or common
factors, so that cointegration and the presence of common factors are equivalent concepts. This equivalence can be
expressed more formally in terms of state-space models which are dynamic models representing a set of processes as
regressions over a set of possibly hidden factors or state variables. One approach is to express asset returns in terms
of factor models.
1.5.3.2

Defining factor models

The economical literature considered Factors models of the risk premia in discrete time, and concentrated on the special case of Affine models. In the equity setting, given the one period percentage return Rt = Rt−1,t , the econometric
linear factor model in Equation (1.3.9) can be expressed as
Rt = αt +

m
X

βk Fkt + t

(1.5.19)

k=1

where the terms Fkt for k = 1, .., m represent returns of the risk factors associated with the market under condition,
m is the number of factors explaining the stock returns, and αt is the drift of the idiosyncratic component. We can
then view the residuals as increments of a process that will be estimated
Xt = X0 +

t
X

i,s

s=1

As a result, using continuous-time notation, the continuous-time model for the evolution of stock prices becomes

46

Quantitative Analytics

m

X
dS(t)
= αt dt +
βk Fk (t) + dX(t)
S(t)

(1.5.20)

k=1

Pm
where the term k=1 βk Fk (t) represents the systematic component of returns. The coefficients βk are the corresponding loadings. The idiosyncratic component of the stock returns is given by
dX̃(t) = αt dt + dX(t)
where αt represents the drift of the idiosyncratic component, which implies that αt dt represents the excess rate of
return of the stock with respect to the stock market, or some industry sector, over a particular period of time. The term
dX(t) is assumed to be the increment of a stationary stochastic process which models price fluctuations corresponding
to over-reactions or other idiosyncratic fluctuations in the stock price which are not reflected the industry sector.
1.5.3.3

CAPM: a one factor model

The capital asset pricing model (CAPM) introduced by Sharpe [1964], which is an approach to understanding expected
asset returns or discount rates beyond the riskless rate, is a simplification of Factors models defined above, based on
some restrictive assumptions
• one-period world (constant investment opportunity set and constant risk premia over time)
• access to unlimited riskless borrowing and lending as well as tradable risky assets
• no taxes or transaction costs
• investors are rational mean variance optimisers (normally distributed asset returns, or quadratic utility function)
• investors have homogeneous expectations (all agree about asset means and covariances)
These assumptions ensure that every investor holds the same portfolio of risky assets, combining it with some amount
of the riskless asset (based on the investor’s risk aversion) in an optimal way. Even though these assumptions are too
restrictive, the main insight is that only systematic risk is priced in the sense that it influences expected asset returns.
In the CAPM, the ith asset’s return is given by
Rit = αi (t) + βiM (t)RM t + it
where RM t is the market return, and the residual it is normally distributed with zero mean and variance σi2 (t). It
i ,RM )
is the slope called the
represent a straight line in the (Ri , RM ) plane where αi is the intercept and βiM = Cov(R
2
σM
beta value. According to the CAPM, in equilibrium, the ith asset’s expected excess return Et [Ri,(t+1) ] is a product of
its market beta βiM (t) and the market risk premium λM (t), given by
Et [Ri,(t+1) ] = βiM (t)λM (t)
Note, λM is the market risk premium
PN common to all assets, but it also reflects the price of risk (investor risk aversion).
Defining the portfolio as P (t) = i=1 wi Rit where wi is the ith weight of the portfolio, the portfolio beta is βP (t) =
PN
i=1 wi βiM (t), and the variance of the portfolio is given by
V ar(P (t)) =

2
βP2 (t)σM

+

N
X

wi2 V ar(it )

i=1

Note, in the case where the weights are uniform w1 = w2 = .. = wN =

47

1
N,

we get the limit

Quantitative Analytics

N
X
1
( )2 V ar(it ) → 0 , N → ∞
N
i=1

and the idiosyncratic risk vanishes. Realised returns reflect both common (systematic) risk factors and asset specific
(idiosyncratic) risk. While idiosyncratic risk can be diversified away as N increases, risk premia compensate investors
for systematic risk that can not be diversified away. However, the CAPM does not specify how large the market risk
premium should be. While the CAPM is a static model, a more realistic approach should reflect the view that market
risk aversion varies with recent market moves and economic conditions, and that the amount of risk in the market
varies with stock market volatility and asset correlations.
1.5.3.4

APT: a multi-factor model

Relaxing some of the simplifying assumptions of the CAPM, new approaches flourished to explain expected returns
such as multiple systematic factors, time-varying risk premia, skewness and liquidity preferences, market frictions,
investor irrationalities etc. The Arbitrage Pricing Theory (APT) developed by Ross [1976] is one of the theories that
relate stock returns to macroeconomic state variables. However, identifying risk factors in the multi-factor models in
Equation (1.5.20) that are effective at explaining realised return variations over time is an open problem, since theory
gives limited guidance. One can either consider theoretical factor models or empirical factor models where factors are
chosen to fit empirical asset return behaviour (see Ilmanen [2011]). Factors with a strong theoretical basis include
aggregate consumption growth, investment growth, as well as overall labour income growth and idiosyncratic job risk.
Equity factors that are primarily empirical include value, size, and often momentum. The list can be extended to
indicators like return reversals, volatility, liquidity, distress, earnings momentum, quality factors such as accruals, and
corporate actions such as asset growth and net issuance. Beyond equities, sensitivities to major asset classes are the
most obvious factors to consider. For instance, the inflation factor is especially important for bonds, and liquidities
for other assets. In the case where several common factors generate undiversifiable risk, then a multi-factor relation
holds. For instance, with K sistematic factors, the ith asset’s expected excess return reflects its factor sensitivities
(βi1 , .., βiK ) and the factor risk premia (λ1 , .., λk )
Et [Ri,(t+1) ] = βi1 (t)λ1 (t) + ... + βiK (t)λk (t)
More generally, if we assume that the stochastic discount factor (SDF) is linearly related to a set of common risk
factors, then asset returns can be described by a linear factor model. Moreover, the idea that the risk premium depends
on covariation with the SDF (bad times, high MU periods) also applies to the risk factors, not just to individual assets.

1.6

Introducing behavioural finance

We saw in Section (1.3) that understanding the risk premia in view of predicting asset returns was at the heart of
finance, and that several answers existed. One of them, called behavioural finance, implies that market prices do not
only reflect the rationally expected discounted cash flows from an asset (see Section (1.4.12)), but also incorporate
noise from irrational investors. In complete markets, rational investors make forecast that correctly incorporate all
available information, leaving no room for systematic forecasting errors. Behavioural economics and behavioural
finance have challenged this paradigm (see Barberis et al. [2002]). As fully rational behaviour requires quite complex calculations, an alternative idea developed suggesting that investors exhibit bounded rationality, rationality that
is limited by their cognitive resources and observational powers. Psychological biases predict specific systematic deviations from rationality causing mispricings. The main biases are heuristic simplifications such as rules of thumb,
mental shortcuts, attention and memory biases, representativeness, conservatism, and/or self-deception such as overconfidence, overoptimism, biased self-attribution, confirmation bias, hindsight bias. The best known market-level
mispricings are speculative bubbles, and the best known relative mispricings are value and momentum effects. Even
though it is argued that rational traders will view any such mispricings as trading opportunities, which when realised,

48

Quantitative Analytics

will undo any price impact irrational traders might have, it did not happen in practice since these strategies are risky
and costly (see Shleifer et al. [1997]). In fact, any observed predictability pattern can be interpreted to reflect either
irrational mispricings or rational time-varying risk premia (rational learning about structural changes). It is likely that
both irrational and rational forces drive asset prices and expected returns. That is, available information is not fully
reflected in current market prices, leading to incomplete market situations. Hence, by challenging the complete market
paradigm, behavioural finance attempted to propose an alternative theory, which is to be compared with the incomplete
market theory (IMT).

1.6.1

The Von Neumann and Morgenstern model

The first important use of the EUT was that of Von Neumann and Morgenstern (VNM) [1944] who used the assumption of expected utility maximisation in their formulation of game theory. When comparing objects one needs to rank
utilities but also compare the sizes of utilities. VNM method of comparison involves considering probabilities. If a
person can choose between various randomised events (lotteries), then it is possible to additively compare for example
a shirt and a sandwich. It is possible to compare a sandwich with probability 1, to a shirt with probability p or nothing
with probability 1 − p. By adjusting p, the point at which the sandwich becomes preferable defines the ratio of the
utilities of the two options. If options A and B have probability p and 1 − p in the lottery, we can write the lottery L
as a linear combination
L = pA + (1 − p)B
and for a lottery with n possible options, we get
L=

n
X

p i Ai

i=1

Pn
with i=1 pi = 1. VNM showed that, under some assumptions, if an agent can choose between the lotteries, then this
agent has a utility function which can be added and multiplied by real numbers, which means the utility of an arbitrary
lottery can be calculated as a linear combination of the utility of its parts. This is called the expected utility theorem.
The required assumptions are made of four axioms about the properties of the agent’s preference relation over simple
lotteries, which are lotteries with just two options. Writing B  A for A is weakly preferred to B, the axioms are
1. Completeness: for any two simple lotteries L and M, either L  M or M  L (or both).
2. Transitivity: for any three lotteries L, M, N, if L  M and M  N , then L  N .
3. Convexity: if L  M  N , then there is p ∈ [0, 1] such that the lottery pL + (1 − p)N is equally preferable to
M.
4. Independence: for any three lotteries L, M, N, L  M if and only if pL + (1 − p)N  pM + (1 − p)N .
A VNM utility function is a function from choices to the real numbers u : X → R which assigns a real number
to every outcome in a way that captures the agent’s preferences over simple lotteries. Under the four assumptions
mentioned above, the agent will prefer a lottery L2 to a lottery L1 , if and only if the expected utility of L2 is greater
than the expected utility of L1
L1  L2 if and only if u(L1 ) ≤ u(L2 )
More formally, we assume a finite number of states of the world such that the jth state happens with the probability
pj , and we let the consumption C be a random variable taking the values cj for j = 1, .., k (see Appendix (D.2) for
details). To get a Von-Neumann Morgenstern (VNM) utility there must exist v : R+ → R such that

49

Quantitative Analytics

∞

Z
u(P ) =

v(x)dP (x)
0

In the special case where PC is a discrete sum, the VNM utility simplifies to
u(P ) =

k
X

pj v(cj )

j=1

Hence, the criterion becomes that of maximising the expected value of the utility of consumption where u(P ) =
E[v(C)], with
E[v(C)] =< v(C) >=

k
X

pj v(cj )

j=1

A variety of generalized expected utility theories have arisen, most of which drop or relax the independence axiom.
One of the most common uses of a utility function, especially in economics, is the utility of money. The utility function
for money is a nonlinear function that is bounded and asymmetric about the origin. The utility function is concave
in the positive region, reflecting the phenomenon of diminishing marginal utility. The boundedness reflects the fact
that beyond a certain point money ceases being useful at all, as the size of any economy at any point in time is itself
bounded. The asymmetry about the origin reflects the fact that gaining and losing money can have radically different
implications both for individuals and businesses. The nonlinearity of the utility function for money has profound
implications in decision making processes: in situations where outcomes of choices influence utility through gains or
losses of money, which are the norm in most business settings, the optimal choice for a given decision depends on the
possible outcomes of all other decisions in the same time-period.

1.6.2

Preferences

In traditional decision theory, people form beliefs, or make judgments about some probabilities by assigning values, or
utilities, to outcomes (see Arrow [1953] [1971]). Attitudes toward risk (risk aversion) determine what choices people
make among the various opportunities that exist, given their beliefs. We saw in Section (1.6.1) that people calculate the
expected utility of each gamble or lottery as the probability weighted sum of utility outcomes, then choose the gamble
with the highest expected utility (see Von Neumann et al. [1944] and Friedman et al. [1948]). More formaly, one can
view decision making under risk as a choice between prospects
Pn or gambles (lotteries). A prospect (x1 , p1 ; ...; xn , pn ) is
a contract yielding outcome xi with probability pi where i=1 pi = 1. To simplify notation, omitting null outcomes,
we use (x, p) to denote the prospect (x, p; 0, (1 − p)) that yields x with probability p and 0 with probability (1 − p).
It is equivalent to a simple lottery. The riskless prospect that yields x with certainty is denoted (x). The application of
expected utility theory is based on
1. Expectation: u(x1 , p1 ; ...; xn , pn ) =< u(X) >= p1 u(x1 ) + ... + pn u(xn ) where u(.) is the utility function of
a prospect. The utilities of outcomes are weighted by their probabilities.
2. Asset integration: (x1 , p1 ; ...; xn , pn ) is acceptable at asset position w if and only if u(w + x1 , p1 ; ...; w +
xn , pn ) ≥ u(w). A prospect is acceptable if the utility resulting from integrating the prospect with one’s assets
exceeds the utility of those assets alone. The domain of the utility function is final states rather than gains or
losses.
00

3. Risk aversion: u is concave, that is, u < 0.
However, experimental work found substantial violations of this rational model (see Allais [1953]), leading to the
development of various alternative theories, such as the prospect theory (PT). Kahneman et al. [1979] described
several classes of choice problems in which preferences systematically violate the axioms of expected utility theory,

50

Quantitative Analytics

and developed an intentionally positive, or descriptive, model of preferences that would best characterise the deviations
of actual choices from the normative expected utility model. Their experimental studies revealed that people
• overweight outcomes that are considered certain, relative to outcomes which are merely probable (certainty
effect).
• care more about gains and losses (changes in wealth) than about overall wealth
• exhibit loss aversion and can be risk seeking when facing the possibility of loss (reflection effect).
• overweight low probability events.
The reflection effect implies that risk aversion in the positive domain is accompanied by risk seeking in the negative
domain. Further, outcomes which are obtained with certainty are overweighted relative to uncertain outcomes. In the
positive domain, the certainty effect contributes to a risk averse preference for a sure gain over a larger gain that is
merely probable. In the negative domain, the same effect leads to a risk seeking preference for a loss that is merely
probable over a smaller loss that is certain. The following example (Problem 11, 12) illustrates these findings.
1. suppose that you are paid $1000 to participate in a gamble that presents you with the following further choices
(a) a sure $500 gain
(b) a 50% chance of winning $1000, a 50% chance of $0
2. suppose that you are paid $2000 to participate in a gamble that presents you with the following further choices
(a) a sure $500 loss
(b) a 50% chance of winning $1000, a 50% chance of $0
The typical strategies chosen are (1a) (certainty) and (2b) (risk seeking), even though the total wealth outcomes in
(1a) and (2a) are identical, and that in (1b) and (2b) are likewise identical. People act as if they are risk averse when
only gains are involved but become risk seeking when facing the possibility of loss and they view the initial bonus
and the gamble separately (isolation effect). The experimental studies showed that the intuitive notion of risk is not
adequately captured by the assumed concavity of the utility function for wealth. Prospect theory (PT) distinguishes
two phases in the choice process, an early phase of editing and a subsequent phase of evaluation. The major operations
of the editing phase are coding, combination, segregation, and cancellation. The edited prospects are then evaluated
and the prospect of highest value is chosen. As the editing operations facilitate the task of decision, it can explain
many anomalies of preference. For example, the inconsistencies associated with the isolation effect result from the
cancellation of common components. In PT, people maximise the weighted sum of values (utilities) where weights
are not probability themselves (probabilities are rational) but their nonlinear transformation.
• whether an outcome is seen as a gain or a loss (relative to some neutral reference point) determines the value
(utility) of a dollar. Its value depends on the context (coding). The reference point can be affected by the
formulation of the offered prospects, and by the expectations of the decision maker.
• the value function is kinked at the origin (reference point) where the steeper slope below zero implies loss
aversion. Many studies show that losses hurt twice to two and a half times as much as same-sized gains satisfy.
In general, the value function has concave (convex) shape to the right (left) of the reference point, implying risk
aversion among gambles involving only gains but risk seeking among gambles involving only losses.
• overweighting low-probability events is the main feature of the probability weighting function, explaining the
simultaneous demand for both lotteries and insurance. Such overweighting of small probabilities can be strong
enough to reverse the sign of risk appetite in the value function. Lotteries offering a small chance of very large
gains can induce risk seeking, despite a general tendency to risk aversion when gambles only involve gains.

51

Quantitative Analytics

To summarise, the main idea was to assume that values are attached to changes rather than to final states, and that
decision weights do not coincide with stated probabilities, leading to inconsistencies, intransitivities, and violations
of dominance. More formally, the overall value of an edited prospect, denoted V, is expressed in terms of two scales
π and v, where the former associates with each probability p a decision weight π(p) reflecting the impact of p on the
over-all value of the prospect, and the latter assigns to each outcome x a number v(x) reflecting the subjective value
of that outcome. Note, v measures the value of deviations from that reference point (gains and losses). Kahneman et
al. [1979] considered simple prospects (x, p; y, q) with at most two non-zero outcomes, where one receives x with
probability p, y with probability q, and nothing with probability 1 − p − q, with p + q ≤ 1. An offered prospect is
strictly positive if its outcomes are all positive, if x, y > 0 and p + q = 1, and it is strictly negative if its outcomes are
all negative. A prospect is regular if it is neither strictly positive nor strictly negative (either p + q < 0, or x ≥ 0 ≥ y,
or x ≤ 0 ≤ y). If (x, p; y, q) is a regular prospect, then the value is given by
V (x, p; y, q) = π(p)v(x) + π(q)v(y)
where v(0) = 0, π(0) = 0, and π(1) = 1. The two scales coincide for sure prospects, where V (x, 1) = V (x) = v(x).
In that setting, the expectation principle (1) of the expected utility theory is relaxed since π is not a probability
measure. The evaluation of strictly positive and strictly negative prospects is slightly modified. If p + q = 1, and either
x > y > 0, or x < y < 0, then
V (x, p; y, q) = v(y) + π(p)(v(x) − v(y))
since π(p) + π(1 − p) = 1. Markowitz [1952b] is at the origin of the idea that utility be defined on gains and losses
rather than on final asset positions. He proposed a utility function which has convex and concave regions in both
the positive and the negative domains. However, in PT, the carriers of value are changes in wealth or welfare, rather
than final states, which is consistent with basic principles of perception and judgment. Value should be treated as a
function in two arguments, the asset position that serves as reference point, and the magnitude of the change (positive
or negative) from that reference point. Kahneman et al. [1979] assumed that the value function for changes of wealth
00
00
is normally concave above the reference point (v (x) < 0, for x > 0) and often convex below it (v (x) > 0, for
x < 0). Put another way, PT postulates the leaning S-shape of the value function. However, the actual scaling is
considerably more complicated than in utility theory, because of the introduction of decision weights. They measure
the impact of events on the desirability of prospects, and should not be interpreted as measures of degree or belief.
Note, π(p) = p only if the expectation principle holds, but not otherwise. In general, π is an increasing function of p,
with π(0) = 0 and π(1) = 1. In the case of small values of p, π is a subadditive function of p, that is, π(rp) > rπ(p)
for 0 < r < 1, and that very low probabilities are generally overweighted, that is, π(p) > p for small p. In general,
for 0 < p < 1 we get π(p) + π(1 − p) < 1 called subcertainty. Thus, for a fixed ratio of probabilities, the ratio of the
corresponding decision weights is closer to unity when the probabilities are low than when they are high. This holds
if and only if log π is a convex function of log p. The slope of π in the interval (0, 1) can be viewed as a measure of
the sensitivity of preferences to changes in probability. These properties entail that π is relatively shallow in the open
interval and changes abruptly near the end-points where π(0) = 0 and π(1) = 1. Because people are limited in their
ability to comprehend and evaluate extreme probabilities, highly unlikely events are either ignored or overweighted,
and the difference between high probability and certainty is either neglected or exaggerated. Consequently, π is not
well-behaved near the end-points (sharp drops, discontinuities).

1.6.3

Discussion

Note, most applications of the PT theory have been concerned with monetary outcomes. We saw that one consequence
of the reflection effect is that risk aversion in the positive domain is accompanied by risk seeking in the negative
domain. That is, the preference between negative prospects is the mirror image of the preference between positive
prospects. Denoting > the prevalent preference, in Problem 3 we get (3000) > (4000, 0.8) with 80% against 20%,
0
and in Problem 3 we get (−4000, 0.8) > (−3000) with 92% against 8%. The majority of subjects were willing to
accept a risk of 0.8 to lose 4000, in preference to a sure loss of 3000, although the gamble has a lower expected value.

52

Quantitative Analytics

These Problems demonstrate that outcomes which are obtained with certainty are overweighted relative to uncertain
outcomes. The same psychological principle, the overweighting of certainty, favours risk aversion in the domain
of gains and risk seeking in the domain of losses. Referencing Markowitz [1959] and Tobin [1958], the authors
postulate that the aversion for uncertainty or variability is the explanation of the certainty effect in the rational theory.
That is, people prefer prospects with high expected value and small variance (high Sharpe ratio). For instance, (3000)
is chosen over (4000, 0.8) despite its lower expected value because it has no variance. They argued that the difference
in variance between (3000, 0.25) and (4000, 0.20) may be insufficient to overcome the difference in expected value.
They further postulate that since (−3000) has both higher expected value and lower variance than (−4000, .80), the
sure loss should be preferred, contrary to the data. They concluded that their data was incompatible with the notion
that certainty is generally desirable and that certainty increases the aversiveness of losses as well as the desirability
of gains. In all these Problems, there is no notion of volatility so that one can not define preferences in terms of
measures such as the Sharpe ratio, except in the case of certainty events. In problem 3, the prospect (3000) has an
0
infinite Sharpe ratio (MSR = ∞), while in Problem 3 the prospect (−3000) has a negative infinite Sharpe ratio
(MSR = −∞) contradicting the founding that (−3000) should be preferred over (−4000, .80).
Other research proposed an alternative foundation based on salience while using standard risk preferences (see
Bordalo et al. [2010]). Decision makers overweight the likelihood of salient states (where lotteries have extreme,
contrasting payoffs), explaining both the reflecting shape of the value function and the overweighting of low probability events. Note, one feature of PT preferences is that people derive decision utility from the gains and losses of
a single trade. To obtain testable predictions, PT must be combined with other assumptions such as narrow framing
or the house money effect. The former stipulates that ignoring the rest of wealth implies narrow framing (analysing
problems in a too isolated framework). For instance, focusing too much on asset specific risks (volatility, default) and
ignoring correlation effects. One important aspect of framing is the selection of a reference point as the benchmark for
comparisons (doing nothing, one’s current asset position). The latter stipulates that gamblers tend to become less loss
averse and more willing to take risks when they are ahead (playing with house money). Risk preferences in a sequence
of gambles depend on how prior gains and losses influence loss aversion over time (more aggressive risk taking following successful trading, and cautiousness following losses). Further, people dislike vague uncertainty (ambiguity) more
than objective uncertainty. While risky choices always involve the possibility of adverse outcomes, some outcomes
are more likely to trigger regret than others. Hence, we might want to minimise regret (when hanging on to losers)
which is also a motivation for diversification. At last, moods and feelings may be the most plausible explanations
for empirical observations that average stock returns tend to be higher or lower near major events (holidays, weather,
lunar cycles).

1.6.4

Some critics

Kahneman et al. [1979] considered a very simplistic approach that can be interpreted (in the positive domain) as
a special case of the model of Von Neumann Morgenstern with a single agent, one period of time, a single good of
consumption and a finite number of states of the world, described in Appendix (D.2). However, in their settings,
there is no notion of good of consumption, and as a result, no objective of maximising wealth over time via optimal
consumption. The idea being to compute the expected value of the utility function rather than to maximise that
expected value over consumption. Consequently, their class of choice problems is to be related to the St Petersburg
paradox described in Section (1.2.1). Hence, all the critics addressed in Section (1.2.3) to Bernoulli’s expected utility
theory apply. That is, it is meaningless to assign a probability to a single event, as the event has to be embedded
within other similar events. The main difference with the EUT being that the decision weights π(p) do not coincide
with stated probabilities, relaxing the expectation principle of the EUT, since π is no-longer a probability measure.
Recognising that EUT can represent risk preferences, but is unable to recommend appropriate levels of risk (see
Section (1.2.4)), PT, via the modification of the historical probability, is an artifact to include the missing notion
of risk premia in the EUT. It is a non-mathematical approach where the choices of the decision weights have to be
related to the minimisation over martingale measures introduced in the utility indifference pricing theory discussed in
Section (1.2.4). Obviously the previous example (Problem 11, 12) is a free lunch and one can not use the results from

53

Quantitative Analytics

this experimental work to elaborate a pricing theory. However, the main idea of behavioural finance is to recognise
that rationality requires complex calculations, and that when facing a situation where the decision process must be
instantaneous investors exhibit bounded rationality. Put another way, investors can not by themselves value the fair
price of uncertain future outcomes, and requires shortcuts and heuristic simplifications with an early phase of editing
and a subsequent phase of evaluation.

1.7
1.7.1

Predictability of financial markets
The martingale theory of asset prices

As told by Mandelbrot [1982], the question of the predictability of financial markets is an old one, as financial newspapers have always presented analysis of charts claiming that they could predict the future from the geometry of those
charts. However, as early as 1900, Bachelier [1900] asserted that successive price changes were statistically independent, implying that charting was useless. Weakening that statement, he stated that every price follows a martingale
stochastic process, leading to the concept of perfect market (for the definition of independent processes and martingales see Appendix (B.3)). That is, everything in its past has been discounted fully for the definition of independent
processes and martingales. Bachelier introduced an even weaker statement, the notion of efficient market where imperfections remain only as long as they are smaller than transaction costs. A more specific assertion by Bachelier is
that any competitive price follows, in the first approximation, a one-dimensional Brownian motion. Since the thesis of
Bachelier, the option pricing theory (see Sections (1.2.4) and (1.3.2)) developed around the martingale theory of dynamic price processes, stating that discounted traded assets are martingales under the appropriate probability measure.
In a martingale, the expected value of a process at future dates is its current value. If after the appropriate discounting
(by taking into account the time value of money and risk), all price processes behave as martingales, the best forecast
of future prices is present prices. That is, prices might have different distributions, but the conditional mean, after
appropriate discounting, is the present price. Hence, under the appropriate probability measure, discounted prices are
not exponentially diverging. In principle, it is satisfied by any price process that precludes arbitrage opportunities.
Delbaen et al. [1994] proved that an arbitrage opportunity exists if a price process, P , is not a semimartingale. Hence,
an important question in the quantitative analysis of financial data is therefore to check whether an observed process is
a semimartingale. However, the discount factors taking risk into account, also called kernels, are in general stochastic
making the theory difficult to validate.
Estimating the Hurst exponent (see Hurst [1951]) for a data set provides a measure of whether the data is a
pure white noise random process or has underlying trends (see details in Section (10.1)). For instance, processes
that we might assume are purely white noise sometimes turn out to exhibit Hurst exponent statistics for long memory
processes. In practice, asset prices have dependence (autocorrelation) where the change at time t has some dependence
on the change at time t−1, so that the Brownian motion is a poor representation of financial data. Actual stock returns,
especially daily returns, do not have a normal distribution as the curve of the distribution exhibits fatter tails (called the
stylised facts of asset distribution). One approach to validate the theory is to introduce a test about the Hurst coefficient
H of a fractional Brownian motion (fBm). A fBm is an example for a stochastic process that is not a semimartingale
except in the case of a Hurst coefficient H equal to 12 . Hence, a financial market model with a price process, P , that is
assumed to be a fBm with H 6= 12 implies an arbitrage opportunity. Rogers [1997] provided a direct construction of
a trading strategy producing arbitrage in this situation.
Many applied areas of financial economics such as option pricing theory (see Black et al. [1973]) and portfolio
theory (see Markowitz [1952] and [1959]) followed Bachelier’s assumption of normally distributed returns. The
justification for this assumption is provided by the law of large numbers stating that if price changes at the smallest
unit of time are independently and identically distributed (i.i.d.) random numbers, returns over longer intervals can be
seen as the sum of a large number of such i.i.d. observations, and, irrespective of the distribution of their summands,
should under some weak additional assumptions converge to the normal distribution. While this seemed plausible

54

Quantitative Analytics

and the resulting Gaussian distribution would also come very handy for many applied purposes, Mandelbrot [1963]
was the first to demonstrate that empirical data are distinctly non-Gaussian, exhibiting excess kurtosis and higher
probability mass in the center and in their tails than the normal distribution. Given sufficiently long record of stock
market, foreign exchange or other financial data, the Gaussian distribution can always be rejected with statistical
significance beyond all usual boundaries, and the observed largest historical price changes would be so unlikely under
the normal law that one would have to wait for horizons beyond at least the history of stock markets to observe them
occur with non-negligible probability.

1.7.2

The efficient market hypothesis

From a macroeconomic perspective, it is often assumed that the economy is the superposition of different economic
cycles ranging from short to long periods. They have first been studied by Clément Juglar in the 19th-Century to
prevent France from being hit by repetitive crises. Then, many economists have been interested in these phenomena
such as Mitchell [1927], Kondratiev [1925] or Schumpeter [1927], to name but a few. Nowadays, many new
classical economists such as Kydland (Nobel Prize in 1995), Prescott (Nobel Prize in 2004), Sargent (Nobel Prize in
2011) are still working on this area. As a consequence, it is widely believed that equity prices react sensitively to the
developments of these macroeconomic fundamentals. That is, the changes in the current fundamental value of the firm
will depend upon the present value of the future earnings, explaining the behaviour of stock markets in the long-run.
However, some economists such as Fama [1965a] [1970] demonstrated that stock prices were extremely difficult to
predict in the short run, with new information quickly incorporated into prices. Even though Bachelier is at the origin
of using statistical methods to analyse returns, his work was largely ignored and forgotten until the late 1940s where
the basis of the efficient market hypothesis (EMH) was collected in a book by Cootner [1964]. The book presents the
rationale for what was to be formalised as the EMH by Fama in the 1960s. Originally, during the 1920s through the
1940s, on one hand the Fundamentalists assumed investors to be rational, in order for value to reassert itself, and on
the other hand the Technicians assumed markets were driven by emotions. In the 1950s the Quants made an appeal for
widespread use of statistical analysis (see Roberts [1964]). At the same time, Osborn [1964] formalised the claim that
stock prices follow a random walk. Similarly, Samuelson [1965] postulated that properly anticipated prices fluctuate
randomly. Later, Malkiel [1973] stated that the past movement or direction of the price of a stock or overall market
could not be used to predict its future movements, leading to the Random Walk Theory (RWT). Note, in his conclusion
(Assumption 7), Osborn states that since price changes are independent (random walk), we expect the distribution of
changes to be normal, with a stable mean and finite variance. This is a result of the Central Limit Theorem stating
that a sample of i.i.d. random variables will be normally distributed as the sample gets larger. This postulate relies on
the fact that capital markets are large systems with a large number of degrees of freedom (investors or agents), so that
current prices must reflect the information everyone already has. Hence, investors value stocks based on their expected
value (expected return), which is the probability weighted average of possible returns. It is assumed that investors set
their subjective probabilities in a rational and unbiased manner. Consequently, if we can not beat the market, the best
investment strategy we can apply is buy-and-hold where an investor buys stocks and hold them for a long period of
time, regardless of market fluctuations.
Fama [1970] formalised the concept of EMH by presenting three basic models which states that the market is a
martingale model and a random walk model, or a fair game model. The main implications of the EMH are
• Homogeneity of investors based on their rationality: if all investors are rational and have the access to the same
information, they necessarily arrive at the same expectations and are therefore homogeneous.
• Normal distribution of returns: random walk can be represented by AR(1) process in the form of Pt = Pt−1 +t
implying that rt = Pt − Pt−1 = t where t ∼ N (0, σ) is independent normally distributed variable.
• Standard deviation as a measure of volatility and thus risk : since returns are normally distributed, it implies that
standard deviation is stable and finite and thus is a good measure of volatility.

55

Quantitative Analytics

• Tradeoff between risk and return: since standard deviation is stable and finite, there is a relationship between
risk and return based on non-satiation and risk-aversion of the investors.
• Unpredictability of future returns: since returns follow random walk, the known information is already incorporated in the prices and thus their prediction is not possible.
While the EMH does not require independence through time or accept only i.i.d. observations, the random walk does.
That is, if returns are random, the markets are efficient, but the converse may not be true. Over time, a semistrong
version of the EMH was accepted by the investment community which states that markets are efficient because prices
reflect all public information. In a weak form efficient market, the price changes are independent and may be a
random walk. It can be restated in terms of information sets such that, in the weak form, only historical prices
of stocks are available for current price formation, while the semi-strong form broadens the information set by all
publicly available information, and the strong form includes insider information into the information set. Market is
then said to be weakly efficient when investors can not reach above-average risk-adjusted returns based on historical
prices and similarly for the other forms. While the efficient market hypothesis (EMH) implies that all available
information is reflected in current market prices, leading to future returns to be unpredictable, this assumption has
been rejected both by practitioners and academics. To test the EMH requires understanding the restrictions it imposes
on probabilistic models of returns. In theory, the EMH is embodied in the martingale theory of dynamic price process
(see Section 1.7.1) which can be hardly, if at all, tested in practice since the full reflection of information in prices
is hard to define 2 . LeRoy [1976] criticised all presented definitions of EMH and argued that they were tautologies
and therefore impossible to test. Further, some authors raised the problem of joint-hypothesis which state that even
when the potential inefficiency of the market is uncovered, it can be due to wrongly chosen asset-pricing model. It
is therefore impossible to reject EMH in general, and one must state under which conditions to test EMH (see Lo
[2008]). For instance, we can either follow Fama [1965a] (Fa65) and assume that the market is weakly efficient if
it follows a random walk, or we can follow Samuelson [1965] (Sa65) and assume that the market is efficient if it
follows a martingale process. Note, the assumption of martingale process is more general than the random walk one
and allows for dependence in the process.

1.7.3

Some major critics

For the technical community, this idea of purely random movements of prices was totally rejected. A large number
of studies showed that stock prices are too volatile to be explained entirely by fundamentals (see Shiller [1981] and
LeRoy et al. [1981]). Even Fama [1965] found that returns were negatively skewed, the tails were fatter, and the peak
around the mean was higher than predicted by the normal distribution (leptokurtosis). This was also noted by Sharpe
[1970] on annual returns. Similarly, using daily S&P returns from 1928 till 1990, Turner et al. [1990] found the
distributions to be negatively skewed and having leptokurtosis. While any frequency distribution including October
1987 will be negatively skewed with a fat negative tail, earlier studies showed the same phenomenon. Considering
quaterly S&P 500 returns from 1946 till 1999, Friedman et al. [1989] noted that in addition to being leptokurtotic,
large movements have more often been crashes than rallies, and significant leptokurtosis appears regardless of the
period chosen. Analysing financial futures prices of Treasury Bond, Treasury Note, and Eurodollar contracts, Sterge
[1989] found that very large (three or more standard deviations from the norm) price changes could be expected to
occur two or three times as often as predicted by normality. Evidence suggests that in the short-run equity prices
deviate from their fundamental values, and are also driven by non-fundamental forces (see Chung et al. [1998] and
Lee [1998]). This is due to noise and can be driven by irrational expectations such as irrational waves of optimism or
pessimism, feedback trading, or other inefficiencies. The facts that stock market returns are not normally distributed
weaken statistical analysis such as correlation coefficients and t-statistics as well as the concept of random walk.
Nonetheless, over a longer period of time, the deviations from the fundamentals diminish, and stock prices can be
compatible with economic theories such as the Present Value Model.
2

The problem was mentioned by Fama in Fama [1970].

56

Quantitative Analytics

Mandelbrot [1964] postulated that capital market returns follow the family of Stable Paretian distribution with
high peaks at the mean, and fat tails. These distributions are characterised by a tendency to have trends and cycles as
well as abrupt and discontinuous changes, and can be adjusted for skewness. However, since variance can be infinite
(or undefined), if financial returns fall into the Stable Paretian family of distributions, then variance is only stable and
finite for the normal distribution. Hence, given that market returns are not normally distributed, volatility was found
to be disturbingly unstable. For instance, Turner et al. [1990] found that monthly and quarterly volatility were higher
than they should be compared to annual volatility, but daily volatility was lower. Engle [1982] proposed to model
volatility as conditional upon its previous level, that is, high volatility levels are followed by more high volatility, while
low volatility is followed by more low volatility. This is consistent with the observation by Mandelbrot [1964] that
the size of price changes (ignoring the sign) seems to be correlated. Engle [1982] and LeBaron [1990], among others,
found supportive evidence of the autoregressive conditional heteroskedastic (ARCH) model family, such that standard
deviation is not a standard measure (at least over the short term).
Defining rationality as the ability to value securities on the basis of all available information, and to price them
accordingly, we see that the EMH is heavily dependent on rational investors. Even though investors are risk-averse,
Kahneman et al. [1979] and Tversky [1990] suggested that when losses are involved, people tend to be risk-seeking.
That is, they are more likely to gamble if gambling can minimise their loss (see Section (1.6)). In addition, in practice,
markets are not complete, and investors are not always logical. Shiller [1981] showed that investors could be irrational
and that assets from stocks to housing could develop into bubbles. He concluded that rational models of the stock
market, in which stock prices reflect rational expectations of future payouts, were in error. He argued that a clever
combination of logic, statistics, and data implied that stock markets were, instead, prone to irrational exuberance.
Following from these studies and from the paper by DeBondt et al. [1985], behavioural finance developed where
people do not recognise, or react to, trends until they are well established. This behaviour is quite different from
that of the rational investor, who would immediately adjust to new information, leading to market inefficiencies as
all information has not yet been reflected in prices. As explained by Peters [1991-96], if the reaction to information
occurs in clumps, and investors ignore information until trends are well in place, and then react in a cumulative fashion
to all the information previously ignored, then people react to information in a nonlinear way. This sequence implies
that the present is influenced by the past, which is a clear violation of the EMH.
This old debate was partly put to an end by The Royal Swedish Academy of Sciences which attributed the Nobel
prize 2013 for economics to both Fama and Shiller. Nonetheless, there is a consensus in empirical finance around the
idea that financial assets may exhibit trends or cycles, resulting in persistent inefficiencies in the market that can be
exploited (see Keim et al. [1986], Fama et al. [1989]). For instance, Kandel et al. [1996] showed that even a low
level of statistical predictability can generate economic significance with abnormal returns attained even if the market
is successfully timed only one out of hundred times. One argument put forward is that risk premiums are time varying
and depend on business cycle so that returns are related to some slow moving economic variables exhibiting cyclical
patterns in accordance with business cycles (see Cochrane [2001]). Another argument states that some agents are not
fully rational (for theory of behavioural finance see Barberis et al. [1998], Barberis et al. [2002]), leading prices to
underreact in the short run and to overreact at longer horizons.

1.7.4

Contrarian and momentum strategies

Since the seminal article of Fama [1970], a large number of articles have provided substantial evidence that the
stock returns and portfolio returns can be predicted from historical data. For instance, Campbell et al. [1997]
showed that while the returns of individual stocks do not seem to be autocorrelated, portfolio returns are significantly
autocorrelated. The presence of significant cross-autocorrelations lead to more evident predictability at the portfolio
level than at the level of individual stocks. The profit generated by trading strategies based on momentum and reversal
effect are further evidence of cross-autocorrelations. Lo et al. [1990b] and Lewellen [2002] showed that momentum
and reversal effects are not due to significant positive or negative auto-correlation of individual stocks but to crossautocorrelation effects and other cross-sectional phenomena. The empirical evidences challenged the paradigm of the

57

Quantitative Analytics

weak form efficient market hypothesis (EMH) putting into questions the well accepted capital asset pricing model
(CAPM).
The two most popular strategies that emerged from the literature are the contrarian and the momentum strategies.
A contrarian strategy takes advantage of the negative autocorrelation of asset returns and is constructed by taking a
long position in stocks performing badly in the past and shorting stocks performing well in the past. In contrast, a
momentum strategy is based on short selling past losers and buying past winners. Empirical evidence suggested that
these two strategies mutually co-exist and that their profitability are international (see Griffin et al. [2003], Chui et al.
[2005]). However, although there are sufficient supportive evidence for both strategies, the source and interpretations
of the profits is a subject of much debate. Three alternative explanations for such an outcome were proposed:
1. the size effect: with the losers tending to be those stocks with small market value and overreaction being most
significant for small firms.
2. time-varying risk: the coefficients of the risk premia of the losers are larger than those of the winners in the
period after the formation of the portfolios.
3. market microstructure related effect: part of the return reversal is due to bid-ask biases, illiquidity, etc.

Although contrarian strategies (buying past losers and selling past winners) have received a lot of attention in the
early literature on market efficiency, recent literature focused on relative strength strategies that buy past winners and
sell past losers. It seems that the authors favoring contrarian strategies focuses on trading strategies based on either
very short-term return reversals (1 week or 1 month) or very long-term return reversals (3 to 5 years), while evidence
suggests that practitioners using relative strength rules base their selections on price movements over the past 3 to 12
months. As individual tend to overreact to information, De Bondt et al. [1985] [1987] assumed that stock prices
also overreact to information, and showed that over 3 to 5-year holding periods stocks performing poorly over the
previous 3 to 5 years achieved higher returns than stocks performing well over the same period. When ranked by the
previous cumulated returns in the past 3 to 5 years, the losers outperformed the previous winners by nearly 25% in the
subsequent 3 to 5 years. Jegadeesh [1990] and Lehmann [1990] provided evidence of shorter-term return reversal.
That is, contrarian strategies selecting stocks based on their returns in the previous week or month generated significant
abnormal returns. In the case of momentum strategies, Levy [1967] claimed that a trading rule that buys stocks with
current prices being substantially higher than their average prices over the past 27 weeks realised significant abnormal
returns. Jegadeesh et al. [1993] provided analysis of relative strength trading strategies over 3 to 12-month horizon on
the NYSE and AMEX stock, and showed significant profits in the 1965 to 1989 sample period for each of the relative
strength strategies examined. For example, a strategy selecting stocks based on their past 6-month returns and holds
them for 6 months realised a compounded excess return of 12.01% per year on average. Additional evidence indicates
that the profitability of the relative strength strategies are not due to their systematic risk. These results also indicated
that the relative strength profits can not be attributed to lead-lag effects resulting from delayed stock price reactions to
common factors. However, the evidence is consistent with delayed price reactions to firm specific information. That is,
part of the abnormal returns generated in the first year after portfolio formation dissipates in the following two years.
Using out-of-sample tests, Jegadeesh et al. [2001] found that momentum profits continued after 1990, indicating that
their original findings were not due to a data snooping bias. They suggested that the robustness of momentum returns
could be driven by investors’ cognitive biases or under reaction to information, such as earning announcements.
Time series momentum or trend is an asset pricing anomaly with effect persisting for about a year and then partially
reversing over longer horizons. Hurst et al. [2010] noted that the main driver of many managed futures strategies
pursued by CTAs is trend-following or momentum investing. That is, buying assets whose price is rising and selling
assets whose price is falling. Rather than focus on the relative returns of securities in the cross-section, time series
momentum focuses purely on a security’s own past return. These findings are robust across a number of subsamples,
look-back periods, and holding periods (see Moskowitz et al [2012]). They argued that time series momentum

58

Quantitative Analytics

directly matches the predictions of many prominent behavioral and rational asset pricing theories. They found that
the correlations of time series momentum strategies across asset classes are larger than the correlations of the asset
classes themselves, suggesting a stronger common component to time series momentum across different assets than
is present among the assets themselves. They decomposed the returns to a time series and cross-sectional momentum
strategy to identify the properties of returns that contribute to these patterns and found that positive auto-covariance
in futures contracts’ returns drives most of the time series and cross-sectional momentum effects. One explanation
is that speculators trade in the same direction as a return shock and reduce their positions as the shock dissipates,
whereas hedgers take the opposite side of these trades. In general, spot price changes are mostly driven by information
shocks, and they are associated with long-term reversals, consistent with the idea that investors may be over-reacting
to information in the spot market. This finding of time series momentum in virtually every instrument challenges the
random walk hypothesis on the stock prices. Further, they showed that a diversified portfolio of time series momentum
across all assets is remarkably stable and robust, yielding a Sharpe ratio greater than one on an annual basis, or roughly
2.5 times the Sharpe ratio for the equity market portfolio, with little correlation to passive benchmarks in each asset
class or a host of standard asset pricing factors. At last, the abnormal returns to time series momentum also do
not appear to be compensation for crash risk or tail events. Rather, the return to time series momentum tends to
be largest when the stock market’s returns are most extreme, performing best when the market experiences large up
and down moves. Note, the studies of autocorrelation examine, by definition, return predictability where the length
of the look-back period is the same as the holding period over which returns are predicted. This restriction masks
significant predictability that is uncovered once look-back periods are allowed to differ from predicted or holding
periods. Note, return continuation can be detected implicitly from variance ratios. Also, a significant component
of the higher frequency findings in equities is contaminated by market microstructure effects such as stale prices.
Focusing on liquid futures instead of individual stocks and looking at lower frequency data mitigates many of these
issues (see Ahn et al. [2003]).
Recently, Baltas et al. [2012a] extended existing studies of futures time-series momentum strategies in three dimensions (time-series, cross-section and trading frequency) and documented strong return continuation patterns across
different portfolio rebalancing frequencies with the Sharpe ratio of the momentum portfolios exceeding 1.20. These
strategies are typically applied to exchange traded futures contracts which are considered relatively liquid compared to
cash equity or bond markets. However, capacity constraints have limited these funds in the past. The larger they get,
the more difficult it is to maintain the diversity of their trading books. Baltas et al. rigorously establish a link between
CTAs and momentum strategies by showing that time-series momentum strategies have high explanatory power in the
time series of CTA returns. They do not find evidence of capacity constraints when looking at momentum strategies in
commodities markets only, suggesting that the futures markets are relatively deep and liquid enough to accommodate
the trading activity of the CTA industry.

1.7.5

Beyond the EMH

In financial markets, volatility is a measure of price fluctuations of risky assets over time which can not be directly
observed and must be estimated via appropriate measures. These measures of volatility show volatility clustering,
asymmetry and mean reversion, comovements of volatilities across assets and financial markets, stronger correlation
of volatility compared to that of raw returns, (semi-) heavy-tails of the distribution of returns, anomalous scaling
behaviour, changes in shape of the return distribution over time horizons, leverage effects, asymmetric lead-lag correlation of volatilities, strong seasonality, and some dependence of scaling exponents on market structure. Mandelbrot
[1963] showed that the standard Geometric Brownian motion (gBm) proposed by Bachelier was unable to reproduce
these stylised facts. In particular, the fat tails and the strong correlation observed in volatility are in sharp contrast
to the mild, uncorrelated fluctuations implied by models with Brownian random terms. He presented an alternative
description of asset prices constructed on the basis of a scaling assumption. From simple observation, a continuous
process can not account for a phenomenon characterised by very sharp discontinuities such as asset prices. When
P (t) is a price at time t, then log (P (t)) has the property that its increment over an arbitrary time lag d, that is

59

Quantitative Analytics

∆(d) = log (P (t + d)) − log (P (t)), has a distribution independent of d, except for a scale factor. Hence, in a competitive market no time lag is more special than any other. Under this assumption, typical statistics to summarise data
such as sample average to measure location and sample root mean square to measure dispersion have poor descriptive
properties. This lead Mandelbrot to assume that the increment ∆(d) has infinite variance and to conclude that price
change is ruled by Levy stable distribution. It was motivated by the fact that in a generalised version of the central
limit law dispensing with the assumption of a finite second moment, sums of i.i.d. random variables converge to these
more general distributions (where the normal law is a special case of the Levy stable law obtained in the borderline
case of a finite second moment). Therefore, the desirable stability property indicates the choice of the Levy stable law
which has a shape that, in the standard case of infinite variance, is characterised by fat tails. It is interesting to note
that Fama [1963] discussed the Levy stable law applied to market returns and Fama et al. [1971] proposed statistical
techniques for estimating the parameters of the Levy distributions. While the investment community accepted variance
and standard deviation as the measures of risk, the early founders of the capital market theory (Samuelson, Sharpe,
Fama, and others) were well aware of these assumptions and their limitations as they all published work modifying the
MPT for non-normal distributions. Through the 1960s and 1970s, empirical evidence continued to accumulate proving
the non-normallity of market returns (see Section (1.7.3)). Sharpe [1970] and Fama et al. [1972] published books including sections on needed modifications to standard portfolio theory accounting for the Stable Paretian Hypothesis of
Mandelbrot [1964]. As the weak-form EMH became widely accepted more complex applications developed such as
the option pricing of Black et al. [1973] and the Arbitrage Pricing Theory (APT) of Ross [1976]. The APT postulates
that price changes come from unexpected changes in factors, allowing the structure to handle nonlinear relationships.
In the statistical extreme value theory the extremes and the tail regions of a sample of i.i.d. random variables
converge in distribution to one of only three types of limiting laws (see Reiss et al. [1997])
1. exponential decay
2. power-law decay
3. the behaviour of distributions with finite endpoint of their support
Fat tails are often used as a synonym for power-law tails, so that the highest realisations of returns would obey a law
like P (xt < x) ∼ 1 − x−α after appropriate normalisation with transformation xt = art + b. Hence, the universe of
fat-tailed distributions can be indexed by their tail index α with α ∈ (0, ∞). Levy stable distributions are characterised
by tail indices α < 2 (2 characterising the case of the normal distribution). All other distributions with a tail index
smaller than 2 converge under summation to the Levy stable law with the same index, while all distributions with an
asymptotic tail behaviour with α > 2 converge under aggregation to the Gaussian law. Various authors such as Jansen
et al. [1991] and Lux [1996] used semi-parametric methods of inference to estimate the tail index without assuming
a particular shape of the entire distribution. The outcome of these studies on daily records is a tail index α in the range
of 3 to 4 counting as a stylised fact. Using intra-daily data records Dacorogna et al. [2001] confirmed the previous
results on daily data giving more weight to the stability of the tail behaviour under time aggregation as predicted by
extreme-value theory. As a result, it was then assumed that the unconditional distribution of returns converged toward
the Gaussian distribution, but was distinctly different from it at the daily (and higher) frequencies. Hence, the nonnormal shape of the distribution motivated the quest for the best non-stable characterisation at intermediate levels of
aggregation. A large literature developed on mixtures of normal distributions (see Kon [1984]) as well as on a broad
range of generalised distributions (see Fergussen et al. [2006]) leading to the distribution of daily returns close to a
Student-t distribution with three degrees of freedom. Note, even though a tail index between 3 and 4 was typically
found for stock and foreign exchange markets, some other markets were found to have fatter tails (see Koedijk et al.
[1992]).
Even though the limiting laws of extreme value theory apply to samples of i.i.d. random variables it may still
be valid for certain deviations from i.i.d. behaviour, but dependency in the time series of return will dramatically
slow down convergence leading to a long regime of pre-asymptotic behaviour. While this dependency of long lasting

60

Quantitative Analytics

autocorrelation is subject to debate for raw (signed) returns, it is plainly visible in absolute returns, squared returns, or
any other measure of the extent of fluctuations (volatility). For instance, Ausloos et al. [1999] identified such effects
on raw returns. With sufficiently long time series, significant autocorrelation can be found for time lags (of daily data)
up to a few years. This positive feedback effect, called volatility clustering or turbulent (tranquil) periods, are more
likely followed by still turbulent (tranquil) periods than vice versa. Lo [1991] proposed rigorous statistical test for
long term dependence with more or less success on finding deviations from the null hypothesis of short memory for
raw asset returns, but strongly significant evidence of long memory in squared or absolute returns. In general, short
memory comes along with exponential decay of the autocorrelation, while one speaks of long memory if the decay
follows a power-law. Evidence of the latter type of behaviour both on rate of returns and volatility accumulated over
time. Lobato et al. [1998] claimed that such long-range memory in volatility measures was a universal stylised fact
of financial markets. Note, this long-range memory effect applies differently on the foreign exchange markets and the
stock markets (see Genacy et al. [2001a]). Further, LeBaron [1992] showed that, due to the leverage effect, stock
markets exhibited correlation between volatility and raw returns.
The hyperbolic decay of the unconditional pdf together with the hyperbolic decay of the autocorrelations of many
measures of volatility (squared, absolute returns) fall into the category of scaling laws in the natural sciences. The
identification of such universal scaling laws in financial markets spawned the interest of natural scientists to further explore the behaviour of financial data and to develop models explaining these characteristics. From this line of research,
multifractality, multi-scaling or anomalous scaling emerged gradually over the 90s as a more subtle characteristic of
financial data, motivating the adaptation of known generating mechanisms for multifractal processes from the natural
sciences in empirical finance. The background of these models is the theory of multifractal measures originally developed by Mandelbrot [1974] in order to model turbulent flows. The formal analysis of such measures and processes,
called multifractal formalism, was developed by Frisch et al. [1985], Mandelbrot [1989], and Evertz et al. [1992],
among others. Mandelbrot et al. [1997] introduced the concept of multifractality in finance by adapting an earlier
asset pricing framework of Mandelbrot [1974]. Subsequent literature moved from the more combinatorial style of
the multifractal model of asset returns (MMAR) to iterative, causal models of similar design principles such as the
Markov-switching multifractal (MSM) model proposed by Calvet et al. [2004] and the multifractal random walk
(MRW) by Bacry et al. [2001] constituting the second generation of multifractal models.
Mantegna et al. [2000] and Bouchaud et al. [2000] considered econophysics to study the herd behaviour of
financial markets via return fluctuations, leading to a better understanding of the scaling properties based on methods
and approaches in scientific fields. To measure the multifractals of dynamical dissipative systems, the generalised
dimension and the spectrum have effectively been used to calculate the trajectory of chaotic attractors that may be
classified by the type and number of the unstable periodic orbits. Even though a time series can be tested for correlation
in many different ways (see Taqqu et al. [1995]), some attempts at computing these statistical quantities emerged
from the box-counting method, while others extended the R/S analysis. Using detrended fluctuation analysis (DFA)
or detrended moving average (DMA) to analyse asset returns on different markets, various authors observed that
the Hurst exponent would change over time indicating multifractal process (see Costa et al. [2003], Kim et al.
[2004]). Then methods for the multifractal characterisation of nonstationary time series were developed based on the
generalisation of DFA, such as the MFDFA by Kantelhardt et al. [2002]. Consequently, the multifractal properties
as a measure of efficiency (or inefficiency) of financial markets were extensively studied in stock market indices,
foreign exchange, commodities, traded volume and interest rates (see Matia et al. [2003], Ho et al. [2004], Moyano
et al. [2006], Zunino et al. [2008], Stosic et al. [2014]). It was also shown that observable in the dynamics
of financial markets have a richer multifractality for emerging markets than mature one. As a rule, the presence of
multifractality signalises time series exhibiting a complex behaviour with long-range time correlations manifested on
different intrinsic time scales. Considering an artificial multifractal process and daily records of the S&P 500 index
gathered over a period of 50 years, and using multifractal detrended fluctuation analysis (MFDFA) and multifractal
diffusion entropy analysis (MFDEA), Jizba et al. [2012] showed that the latter posses highly nonlinear, and longranged, interactions which is the manifestation of a number of interlocked driving dynamics operating at different time
scales each with its own scaling function. Such a behaviour typically points to the presence of recurrent economic

61

Quantitative Analytics

cycles, crises, large fluctuations (spikes or sudden jumps), and other non-linear phenomena that are out of reach of
more conventional multivariate methods (see Mantegna et al. [2000]).

1.7.6

Risk premia and excess returns

1.7.6.1

Risk premia in option prices

The Black-Scholes model [1973] for pricing European options assumes a continuous-time economy where trading can
take place continuously with no differences between lending and borrowing rates, no taxes and short-sale constraints.
Investors require no compensation for taking risk, and can construct a self-financing riskless hedge which must be
continuously adjusted as the asset price changes over time. In that model, the volatility is a parameter quantifying
the risk associated to the returns of the underlying asset, and it is the only unknown variable. However, since the
market crash of October 1987, options with different strikes and expirations exhibit different Black-Scholes implied
volatilities (IV). The out-of-the-money (OTM) put prices have been viewed as an insurance product against substantial
downward movements of the stock price and have been overpriced relative to OTM calls that will pay off only if the
market rises substantially. As a result, the implicit distribution inferred from option prices is substantially negatively
skewed compared to the lognormal distribution inferred from the Black-Scholes model. That is, given the BlackScholes assumptions of lognormally distributed returns, the market assumes a higher return than the risk-free rate in
the tails of the distributions.
Market efficiency states that in a free market all available information about an asset is already included in its
price so that there is no good buy. However, in financial markets, perfect hedges do not exist and option prices induce
market risks called gamma risk and vega risk whose order of magnitude is much larger than market corrections such
as transaction costs and other imperfections. In general, these risks can not be hedged away even in continuous time
trading, and hedging becomes approximating a target payoff with a trading strategy. The value of the option price
is thus the cost of the hedging strategy plus a risk premium required by the seller to cover his residual risk which is
unhedgeable. The no-arbitrage pricing theory tells us about the first component of the option value while the second
component depends on the preferences of investors. Thus, the unhedgeable portion is a risky asset and one must decide
how much he is willing to pay for taking the risk. The no-arbitrage argument implies a unique price for that extra risk
called the market price of risk. Hence, when pricing in incomplete market, the market price of risk enter explicitly the
pricing equation leading to a distribution of prices rather than a single price such that one consider bounds. Therefore,
one can either simply ignore the risk premium associated to a discontinuity in the underlying, or one can choose any
equivalent martingale measure as a self-consistent pricing rule but in that case the option price does not correspond to
the cost of a specific hedging strategy. Hence, one should first discuss a hedging strategy and then derive a valuation
for the options in terms of the cost of hedging plus a risk premium.
Incorporating market incompleteness, alternative explanations for the divergence between the risk-neutral distributions and observed returns include peso problems, risk premia and option mispricing but no consensus has yet been
reached. For instance, in the analysis performed by Britten-Jones et al. [2000] the bias may not be due to model
misspecification or measurement errors, but to the way the market prices volatility risk. Similarly Duarte et al. [2007]
documented strong evidence of conditional risk premium that varies positively with the overall level of market volatility. Their results indicate that the bias induced by censoring options that do not satisfy arbitrage bounds can be large,
possibly resulting in biases in expected returns as large as several percentage points per day. Option investors are
willing to pay more to purchase options as hedges under adverse market conditions, which is indicative of a negative
volatility risk premium. These results are consistent with the existence of time-varying risk premiums and volatility
feedback, but there may be other factors driving the results. Nonetheless, negative market price of volatility risk is the
key premium in explaining the noticeable differences between implied volatility and realised volatility in the equity
market. Thus, research now proposes the volatility risk premium as a possible explanation (Lin et al. [2009] , Bakshi
et al. [2003] found supportive evidence of a negative market volatility risk premium).

62

Quantitative Analytics

1.7.6.2

The existence of excess returns

To capture the extra returns embedded in the tails of the market distributions, the literature focused on adding stochastic
processes to the diffusion coefficient of the asset prices or even jumps to the asset prices as the drift was forced to match
the risk-free rate. The notion that equity returns exhibit stochastic volatility is well documented in the literature, and
evidence indicates the existence of a negative volatility risk premium in the options market (see Bakshi et al. [2003]).
CAPM suggests that the only common risk factor relevant to the pricing of any asset is its covariance with the market
portfolio, making beta the right measure of risk. However, excess returns on the traded index options and on the
market portfolio explain this variation, implying that options are non-redundant securities. As a result, Detemple et
al. [1991] argued that there is a general interaction between the returns of risky assets and the returns of options,
implying that option returns should help explain stock returns. That is, option returns should appear as factors in
explaining the cross section of asset returns. For example, Bekaert et al. [2000] investigated the leverage effect and
the time-varying risk premium explanations of the asymmetric volatility phenomenon at both the market and firm
level. They found covariance asymmetry to be the main mechanism behind the asymmetry for the high and medium
leverage portfolios. Negative shocks increase conditional covariances substantially, whereas positive shocks have a
mixed impact on conditional covariances. While the above evidence indicates that volatility risk is priced in options
market, Arisoy et al. [2006] used straddle returns (volatility trade) on the S&P 500 index and showed that it is also
priced in securities markets.

63

Chapter 2

Introduction to asset management
2.1
2.1.1

Portfolio management
Defining portfolio management

A financial portfolio consists in a group of financial assets, also called securities or investments, such as stocks, bonds,
futures, or groups of these investment vehicles referred as exchange-traded-funds (ETFs). The building of financial
portfolio constitutes a well known problem in financial markets requiring a rigorous analysis in order to select the
most profitable assets. Portfolio construction consists of two interrelated tasks
1. an asset allocation task for choosing how to allocate the investor’s wealth between a risk-free security and a set
of N risky securities, and
2. a risky portfolio construction task for choosing how to distribute wealth among the N risky securities.
Therefore, in order to construct a portfolio, we must define investments objectives by focusing on accepted degree of
risk for a given return. Portfolio management is the act of deciding which assets need to be included in the portfolio,
how much capital should be allocated to each kind of security, and when to remove a specific investment from the
holding portfolio while taking the investor’s preferences into account. We can apply two forms of management (see
Maginn et al. [2007])
1. Passive management in which the investor concentrates his objective on tracking a market index. This is related
to the idea that it is not possible to beat the market index, as stated by the Random Walk Theory (see Section
(1.7.2)). A passive strategy aims only at establishing a well diversified portfolio without trying to find under or
overvalued stocks.
2. Active management where the main goal of the investor consists in outperforming an investment benchmark
index, buying undervalued stocks and selling overvalued ones.
As explained by Jacobs et al. [2006], a typical equity portfolio is constructed and managed relative to an underlying
benchmark. Designed to track a benchmark, an indexed equity portfolio is a passive management style with no active
returns and residual risk constrained to be close to zero (see Equation (8.2.1)). While the indexed equity portfolio
may underperform the benchmark after costs are considered, enhanced indexed portfolios are designed to provide an
index-like performance plus some excess return after costs. The latter are allowed to relax the constraint on residual
risk by slightly overweighting securities expected to perform well and slightly underweighting securities expected
to perform poorly. This active portfolio incurs controlled anticipated residual risk at a level generally not exceeding
2%. Rather than placing hard constraint on the portfolio’s residual risk, active equity management seek portfolios

64

Quantitative Analytics

with a natural level of residual risk based on the return opportunities available and consistent with the investor’s
level of risk tolerance. The aim of most active equity portfolios is to generate attractive risk-adjusted returns (or
alpha). While both the portfolio and the benchmark are defined in terms of constituent securities and their percentage
weights, active equity portfolios have active weights (differ from their weights in the benchmark) giving rise to active
returns measured as the difference between the returns of the actively managed equity portfolio and the returns of its
benchmark. In general, an actively managed portfolio overweights the securities expected to perform the benchmark
and underweights the securities expected to perform below the benchmark. In a long-only portfolio, while any security
can be overweighted to achieve a significant positive active weight, the maximum attainable underweight is equal to
the security’s weight in the underlying benchmark index which is achieved by not holding any of the security in the
portfolio. As the weights of most securities in most benchmarks are very small, there is extremely limited opportunity
to profit from underweighting unattractive securities in long-only portfolios. Allowing short-selling by relaxing the
long-only constraint gives the investor more flexibility to underweight overvalued stocks and enhance the actively
managed portfolio’s ability to produce attractive active equity returns. It also reduces the portfolio’s market exposure.
Greater diversification across underweighted and overweighted opportunities should result in greater consistency of
performance relative to the benchmark (see details in Section (7.2)).
An active portfolio management tries to find under or overvalued stocks in order to achieve a significant profit
when prices are rising or falling. Both trend measurement and portfolio allocation are part of momentum trading
strategies. The former requires the selection of trend filtering techniques which can involve a pool of methods and the
need for an aggregation procedure. This can be done through averaging or dynamic model selection. The resulting
trend indicator can be used to analyse past data or to forecast future asset returns for a given horizon. The latter
requires quantifying the size of each long or short position given a clear investment process. This process should
account for the risk entailed by each position given the expected return. In general, individual risks are calculated
in relation to asset volatility, while a correlation matrix aggregate those individual risks into a global portfolio risk.
Note, rather than considering the correlation of assets one can also consider the correlation of each individual strategy.
In any case, the distribution of these risks between assets or strategies remains an open problem. One would like
the distribution to account for the individual risks, their correlations, and the expected return of each asset. Wagman
[1999] provided a simple framework based on Genetic Programming (GP), which tries to find an optimal portfolio
with recurrence to a simple technical analysis indicator, the moving average (MA). The approach starts by generating a
set of random portfolios and the GP algorithm tries to converge in an optimal portfolio by using an evaluation function
which considers the weight of each asset within the portfolio and the respective degree of satisfaction against the MA
indicator, using different period parameters. Similarly, Yan [2003] and Yan et al. [2005] used a GP approach to find
an optimal model to classify the stocks within the market. The top stocks adopt long positions while the bottom ones
follows short positions. Their model is based on the employment of Fundamental Analysis which consists on studying
the underlying forces of the economy to forecast the market development.
In order to outperform a benchmark index by buying and selling properly selected assets, a portfolio manager must
detect profitable opportunities. For instance, in the capital asset pricing model (CAPM), or extension to a multi-factor
model (APT), the skill of a fund manager relies on accurate forecasts of the expected returns and systematic risk on
all risky assets, and on the market portfolio. That is, conditional on the returns on the market portfolio and the risk
free asset, and given forecasts of the systematic risks of risky assets, the fund manager must identify those assets
presenting promising investment opportunities. As a result, markets need to be predictable in some way in order to
apply successfully active management. Fortunately, we saw in Section (1.7.4) that there was substantial evidence
showing that market returns and portfolio returns could be predicted from historical data. We also saw that the two
most popular strategies in financial markets are the contrarian and the momentum strategies. The literature, which
tries to explain the reasons behind why momentum exists, seems to be split into two categories:
1. behavioural explanations (irrational),
2. and market-based explanations (rational).

65

Quantitative Analytics

The literature on behavioural explanations usually focuses around investors under-reacting to information such as
news. This under-reaction can be manifested by either not reacting early enough or the actions that they take are
insufficiently drastic in order to protect themselves from the volatility of the market. As a result, prices rise or fall for
longer than would normally be expected by the market players. Market-based explanations regarding momentum are
based around the fact that poor market performance can establish diminishing illiquidity, putting downward pressure
on performance (see Smith [2012]). Another market-based explanation for momentum is that an investor’s appetite
for risk changes over time. When the values of an investor’s assets are forced towards their base level of wealth, the
investor begins to worry about further losses. This leads the investor to sell, putting downward pressures on prices,
further lowering risk appetites and prices. Markets can therefore generate their own momentum when risk appetites
fall or liquidity is low.
In a study on the inequality of capital returns, Piketty [2013] stressed the importance of active portfolio management by showing that skilled portfolio managers could generate substantial additional profits. Analysing capital
returns from the world’s richest persons ranked in Forbes magazine as well as the returns generated from donations to
the US universities, he showed that the rate of return was proportional to the size of the initial capital invested. The
net annualised rate of return of the wealthiest persons and universities is around 6 − 7% against 3 − 4% for the rest.
The main reason being that a larger fraction of the capital could be invested in riskier assets, necessitating the services
of skilled portfolio managers to identify and select in an optimum way the best portfolio. For instance, with about
30 billion dollars of donations invested in 2010, Harvard university obtained a rate of return of about 10.2% from
1980-2010. On the other hand, for roughly 500 universities out of 850 having a capital smaller or equal to 100 million
dollars invested, they obtained a rate of return of about 6.2% from 1980-2010 (5.1% from 1990-2010). Universities investing over 1 billion dollars have 60% or more of their capital invested in risky assets, while for universities investing
between 50 and 100 million dollars 25% of the capital is invested in risky assets, and for universities investing less that
50 million dollars only 10% is invested in risky assets. While Harvard university spent about 100 million dollars of
management fees, which is 0.3% of 30 billion of dollars, it represents 10% for a university investing 1 billion dollars.
Considering that a university pays between 0.5% and 1% of management fees, it would spend 5 million dollars to
manage 1 billion dollars. A university such as North Iowa Community College which investing 11.5 million dollars,
would spend 150, 000 dollars in management fees.

2.1.2

Asset allocation

2.1.2.1

Objectives and methods

While the objective of investing is to increase the purchasing power of capital, the main goal of asset allocation is
to improve the risk-reward trade-off in an investment portfolio. As explained by Darst [2003], investors pursue this
objective by selecting an appropriate mix of asset classes and underlying investments based on
• the investor’s needs and temperament
• the characteristics of risk, return, and correlation coefficients of the assets under consideration in the portfolio
• the financial market outlook
The objective being
1. to increase the overall return from a portfolio for a given degree of risk, or,
2. to reduce the overall risk from the portfolio for a targeted level of return.
For most investors asset allocation often means
1. calculating the rates of return from, standard deviations on, and correlations between, various asset classes

66

Quantitative Analytics

2. running these variables through a mean-variance optimisation program to select asset mixes with different riskreward profiles
3. analysing and implementing some version of the desired asset allocation in light of the institution’s goals,
history, preferences, constraints, and other factors
A disciplined asset allocation process tends to proceed in a series of sequential steps
1. investor examines and proposes some assumptions on future expected returns, risk, and the correlation of future
returns between asset classes
2. investor selects asset classes that best match his profile and objectives with the maximum expected return for a
given level of risk
3. investor establishes a long-term asset allocation policy (Strategic Asset Allocation (SAA)) reflecting the optimal
long-term standard around which future asset mixes might be expected to vary
4. investor may decide to implement Tactical Asset Allocation (TAA) decisions against the broad guidelines of the
Strategic Asset Allocation
5. investor will periodically rebalance the portfolio of assets, with sensitivity to the tax and transaction cost consequences of such rebalancing, taking account of the SAA framework
6. from time to time, the investor may carefully review the SAA itself to ensure overall appropriateness given current circumstances, frame of mind, the outlook for each of the respective asset classes, and overall expectations
for the financial markets
Asset allocation seeks, through diversification, to provide higher returns with lower risk over a sufficiently long time
frame and to appropriately compensate the investor for bearing non-diversifiable volatility. Some of the foundations
of asset allocation are related to
• the asset - such as the selection of asset classes, the assessment of asset characteristics, the evaluation of the
outlook for each asset class
• the market - such as gauging divergence, scenario analysis, risk estimation
• the investor - such as investor circumstances review, models efficacy analysis, application of judgment
While the scope of asset allocation for any investor defines his universe of investment activity, the types of asset
allocation are classified according their style, orientation, and inputs and can be combined accordingly
• The style of an asset allocation can be described as conservative, moderate, or aggressive (cash, bonds, equities,
derivatives). A conservative style should exhibit lower price volatility (measured by the standard deviation of
returns from the portfolio), and generate a greater proportion of its returns in the form of dividend and interest
income. An aggressive style may exhibit higher price volatility and generate a greater proportion of its returns
in the form of capital gains.
• The orientation type can be described as strategic, tactical, or a blend of the two. A strategic asset allocation
(SAA) attempts to establish the best long-term mix of assets for the investor, with relatively less focus on
short-term market fluctuations. It helps determine which asset classes to include in the long-term asset mix.
Some investors may adopt a primarily tactical approach to asset allocation by viewing the long term as an
ongoing series of short term time frames. Others can use TAA to either reinforce or counteract the portfolio’s
strategic allocation policies. Owing to the price-aware, opportunistic nature of TAA, special forms of tactical
risk management can include price alerts, limit and stop-loss orders, simultaneous transaction techniques, and
value-at-risk (VaR) models. While SAA allows investors to map out a long-term plan, TAA helps investors to
anticipate and respond to significant shifts in asset prices.

67

Quantitative Analytics

• Investors can use different inputs to formulate the percentages of the overall portfolio that they will invest in
each asset class. These percentages can be determined with the help of quantitative models, qualitative judgements, or a combination of both. The quantitative approach consists in selecting the asset classes and subclasses
for the portfolio, propose assumptions on future expected returns, risk of the asset classes, correlations of future
expected returns between each pair of asset classes. Then, portfolio optimisation program can generate a set of
possible asset allocations, each with its own level of expected risk and return. As a result, investors can select
a series of Efficient Frontier asset allocation showing portfolios with the minimum risk for a given level of
expected return. Investors may then decide to set upper and lower percentage limits on the maximum and minimum amounts allowed in the portfolio by imposing constraints on the optimisation. Qualitative asset allocation
assesses fundamental measures, valuation measures, psychology and liquidity measures. These assessments,
carried out on an absolute basis and relative to long-term historical averages, are often expressed in terms of the
number of standard deviations above or below their long-term mean.
2.1.2.2

Active portfolio strategies

The use of predetermined variables to predict asset returns in view of constructing optimum portfolios, has produced
new insights into asset pricing models which have been applied on improving existing policies based upon unconditional estimates. Several strategies exist in taking advantage of market predictability in view of generating excess
return. Extending Sharpe’s CAPM [1964] to account for the presence of pervasive risk, Fama et al. [1992] decomposed portfolio returns into:
• systematic market risk,
• systematic style risk, and,
• specific risk.
As a result, a new classification of active portfolio strategies appears where
• Market Timing or Tactical Asset Allocation (TAA) strategies aim at exploiting evidence of predictability in
market factors, while,
• Style Timing or Tactical Style Allocation (TSA) strategies aim at exploiting evidence of predictability in style
factors, and,
• Stock Picking (SP) strategies are based on stock specific risk.
As early as the 1970s, Tactical Asset Allocation (TAA) was considered as a way of allocating wealth between two asset
classes, typically shifting between stocks and bonds. It is a style timing strategy, that is, a dynamic investment strategy
actively adjusting a portfolio’s asset allocation in view of improving the risk-adjusted returns of passive management
investing. The objectives being to maximise total return on investment, limit risk, and maintain an appropriate degree
of portfolio diversification. Systematic TAA use quantitative investment models such as trend following or relative
strength techniques, capitalising on momentum, to exploit inefficiencies and produce excess returns. Market timing is
another form of asset allocation where investors attempt to time the market by adding funds to or withdrawing funds
from the asset class in question according to a periodic schedule, seeking to take advantage of downward or upward
price fluctuations. Momentum strategies are examples of market timing. For instance, momentum strategies try to
benefit from either market trends or market cycles. Being an investment style based only on the history of past prices,
one can identify two types of momentum strategies:
1. On one hand the trend following strategy consisting in buying (or selling) an asset when the estimated price
trend is positive (or negative).

68

Quantitative Analytics

2. On the other hand the contrarian (or mean-reverting) strategy consisting in selling (or buying) an asset when the
estimated price trend is positive (or negative).
For example, earning momentum strategies involve buying the shares of companies exhibiting strong growth in reported earnings, and selling shares experiencing a slowdown in the rate of growth in earnings. Similarly, price momentum strategies are based on buying shares with increasing prices, and selling shares with declining prices. Note, such
momentum based methods involves high rate of portfolio turnover and trading activity, and can be quite risky. Stock
Selection criteria or Stock Picking strategies aim at exploiting evidence of predictability in individual stock specific
risk. Perhaps one of the most popular stock picking strategies is that of the long/short with the majority of equity
managers still favouring this strategy to generate returns. The stock investment or position can be long to benefit from
a stock price increase or short to benefit from a stock price decrease, depending on the investor’s expectation of how
the stock price is going to move. The stock selection criteria may include systematic stock picking methods utilising
computer software and/or data. Note, most mutual fund managers actually make discretionary, and sometimes unintended, bets on styles as much as they make bets on stocks. In other words, they perform tactical asset allocation
(TAA), tactical style allocation (TSA) and stock picking (SP) at the same time.
2.1.2.3

A review of asset allocation techniques

We present a few allocation techniques among the numerous methodologies developed in the financial literature. For
more details see text book by Meucci [2005].
• Equally-weighted is the simplest allocation algorithm, where the same weight is attributed to each strategy. It is
used as a benchmark for other allocation methods.
• Inverse volatility consists of weighting the strategies in proportion to the inverse of their volatility, that is, it
takes a large exposure to assets with low volatility.
• Minimum variance seeks at building a portfolio such that the overall variance is minimal. If the correlation
between all assets in the basket is null, the minimum variance will allocate everything to the lowest volatility
asset, resulting in poor diversification.
• Improved minimum variance improves the portfolio’s covariance by using a correlation matrix based on the
Speaman correlation which is calculated on the ranks of the variables and tends to be a more reliable estimate
of correlation.
• Minimum value-at-risk (VaR) seeks to build a portfolio such that the overall VaR is minimal. The marginal
distribution of each strategy is measured empirically and the relationship between the strategies is modelled by
the Gaussian copula which consider a single correlation coefficient.
• Minimum expected shortfall seeks to build a portfolio such that the overall expected shortfall (average risk above
the VaR) is minimal.
• Equity-weighted risk contribution (ERC) seeks to equalise the risk contribution of each strategy. The risk
contribution of a strategy is the share of the total portfolio risk due to the strategy represented by the product of
the standard deviation and correlation with the portfolio.
• Factor based minimum variance applies the minimum variance method on a covariance matrix reduced to the
first three factors of a rolling principal component analysis (PCA).
• Factor based ERC applies the ERC method on a covariance matrix reduced to the first three factors of a rolling
PCA.

69

Quantitative Analytics

Optimal portfolios are designed to offer best risk metrics by computing estimates of future covariances and risk metrics
measured by looking at past data over a certain rolling window. However, future returns may vary widely from the
past and strategies may subsequently fail. Investors have preferences in terms of risk, return, and diversification. One
can classify the allocation strategies into two groups:
1. the low volatility allocation techniques (inverse volatility, minimum variance),
2. and the strong performance allocation techniques (equally-weighted allocation, minimum VaR, minimum expected shortfall).
While low volatility techniques deliver the best risk metrics once adjusted for volatility, strong performance techniques,
such as minimum VaR and shortfall, lead to higher extreme risks. Techniques such as inverse volatility and minimum
variance reduce risk and improve Sharpe ratios, mostly by steering away from the most volatile strategies at the right
time. But in some circumstances this is done at the cost of lower diversification. Note, getting away from too volatile
strategies by reducing the scope of the portfolio may be preferable. While the equally weighted allocation is the most
diversified allocation strategy, equal risk contribution offers an attractive combination of low risk and high returns.

2.1.3

Presenting some trading strategies

2.1.3.1

Some examples of behavioural strategies

For applications of behavioural finance such as bubbles and other anomalies see Shiller [2000]. To summarise, the
bubble story states that shifting investor sentiment over time creates periods of overvaluation and undervaluation in
the aggregate equity market level that a contrarian market timer can exploit. In addition, varying investor sentiment
across individual stocks creates cross-sectional opportunities that a contrarian stock-picker can exploit. However,
short-term market timers may also join the bandwagon while the bubble keeps growing (there is a cross-sectional
counterpart for this behaviour). Specifically, momentum-based stock selection strategies involving buying recent
winners appear profitable. Cross-sectional trading strategies may be relatively value oriented (buy low, sell high) or
momentum oriented (buy rising stocks, sell falling ones), and they may be applied within one market or across many
asset markets. Micro-inefficiency refers to either the rare extreme case of riskless arbitrage opportunities or the more
plausible case of risky trades and strategies with attractive reward-to-risk ratios. Note, cross-sectional opportunities are
safer to exploit than market directional opportunities since one can hedge away directional risk and diversify specific
risk much more effectively. Also, the value effect refers to the pattern that value stocks. For instance, those with low
valuation ratios (low price/earnings, price/cash flow, price/sales, price/dividend, price/book value ratios) tend to offer
higher long-run average return than growth stocks or glamour stocks with high valuation ratios. Some of the most
important biases of behavioural finance are
1. momentum,
2. and reversal effects.
DeBondt et al. [1985] found stocks that had underperformed in the previous 12 to 36 months tended to subsequently
outperform the market. Jegadeesh et al. [1993] found a short to medium term momentum effect where stocks that
had outperformed in recent months tended to keep outperforming up to 12 months ahead. In addition, time series
evidence suggests that many financial time series exhibit positive autocorrelation over short horizons and negative
autocorrelation over multi-year horizons. As a result, trend following strategies are profitable for many risky assets in
the short run, while value strategies, which may in part rely on long-term (relative) mean reversion, are profitable in
the long run. Momentum and value strategies also appear to be profitable when applied across countries, within other
asset classes, and across asset classes (global tactical asset allocation) but with different time horizons. It seems that
behavioural finance was better equipped than rational finance to explain the combination of momentum patterns up
to 12 months followed by reversal patterns beyond 12 months. Other models relying on different behavioural errors
were developed to explain observed momentum and reversal patterns. Assuming that noise traders follow positive

70

Quantitative Analytics

feedback strategies (buying recent winners and selling recent losers) which could reflect extrapolative expectations,
stop-loss orders, margin calls, portfolio insurance, wealth dependent risk aversion or sentiment, De Long et al. [1990]
developed a formal model to predict both short-term momentum and long-term reversal. Positive feedback trading
creates short-term momentum and price over-reaction with eventual return toward fundamental values creating longterm price reversal. Hong et al. [1999] developed a model relying on the interaction between two not-fully rational
investor groups, news-watchers and momentum traders, under condition of heterogeneous information. Slow diffusion
of private information across news-watchers creates underreaction and momentum effects. That is, momentum traders
jumping on the bandwagon when observing trends in the hope to profit from the continued diffusion of information,
generating further momentum and causing prices to over-react beyond fundamental values. All these models use
behavioural finance to explain long-term reversal as a return toward fundamental values as a correction to over-reation.
2.1.3.2

Some examples of market neutral strategies

As described by Guthrie [2006], equity market neutral hedge funds buy and sell stocks with the goal of neutralising
exposure to the market, while capturing a positive return, regardless of the market’s direction. It includes different
equity strategies with varying degrees of volatility seeking to exploit the market inefficiencies. This is in direct contradiction with the efficient market hypothesis (EMH) (see Section (1.7.2)). The main strategy involves simultaneously
holding matched long and short stock positions, taking advantage of relatively under-priced and over-priced stocks.
The spread between the performance of the longs and the shorts, and the interest earned from the short rebate, provides
the primary return for this strategy. An equity market neutral strategy can be established in terms of dollar amount,
beta, country, currency, industry or sector, market capitalisation, style, and other factors or a combination thereof. The
three basic steps to build a market neutral strategy are
1. Select the universe: The universe consists of all equity securities that are candidates for the portfolio in one or
more industry sectors, spanning one or more stock exchanges. The stock in the universe should have sufficient
liquidity so that entering and exiting positions can be done quickly, and it should be feasible to sell stocks short.
with reasonable borrowing cost.
2. Generate a forecast: Fund managers use proprietary trading models to generate potential trades. The algorithms should indicate each trade’s expected return and risk, and implementation costs should be included when
determining the net risk-return profile.
3. Construct the portfolio: In the portfolio construction process, the manager assigns weights (both positive and
negative) to each security in the universe. There are different portfolio construction techniques, but in any case
risk management issues must be considered. For instance, the maximum exposure to any single security or
sector, and the appropriate amount of leverage to be employed.
One can distinguish two main approaches to equity market neutral:
1. the statistical arbitrage,
2. and the fundamental arbitrage which can be combined.
Statistical arbitrage involves model-based, short-term trading using quantitative and technical analysis to detect profit
opportunities. A particular type of arbitrage opportunity is hypothesised, formalised into a set of trading rules and
back-tested with historical data. This way, the manager hopes to discover a persistent and statistically significant
method to detect profit opportunities. Three typical statistical arbitrage techniques are
1. Pairs or peer group involves simultaneously buying and selling short stocks of companies in the same economic
sector or peer group. Typical correlations are measured and positions are established when current prices fall
outside of a normal band. Position sizes can be weighted to achieve dollar, beta, or volatility neutrality. Positions
are closed when prices revert to the normal range or when stop losses are breached. Portfolios of multiple pair
trades are blended to reduce stock specific risk.

71

Quantitative Analytics

2. Stub trading involves simultaneously buying and selling short stocks of a parent company and its subsidiaries,
depending on short-term discrepancies in market valuation versus actual stock ownership. Position sizes are
typically weighted by percentage ownership.
3. Multi-class trading involves simultaneously buying and selling short different classes of stocks of the same
company, typically voting and non-voting or multi-volting and single-volting share classes. Much like pairs
trading, typical correlations are measured and positions are established when current prices fall outside of a
normal band.
Fundamental arbitrage consists mainly of building portfolios in certain industries by buying the strongest companies
and selling short companies showing signs of weakness. Even though the analysis is mainly fundamental and less
quantitative than statistical arbitrage, some managers use technical and price momentum indicators (moving averages,
relative strength and trading volume) to help them in their decision making. Fundamental factors used in the analysis
include valuation ratios (price/earnings, price/cash flow, price/earnings before interest and tax, price/book), discounted
cash flows, return on equity, operating margins and other indicators. Portfolio turnover is generally lower than in
statistical arbitrage as the signals are stronger but change less frequently.
Among the factors contributing to the different sources of return are
• No index constraint: Equity market neutral removes the index constraints limiting the buy-and-hold market
participants. Selling a stock short is different from not owning a stock in the index, since the weight of the short
position is limited only by the manager’s forecast accuracy, confidence and ability to offset market risk with
long positions.
• Inefficiencies in short selling: Significant inefficiencies are available in selling stocks short.
• Time arbitrage: Equity market neutral involves a time arbitrage for short-term traders at the expense of longterm investors. With higher turnover and more frequent signals, the equity market neutral manager can often
profit at the expense of the long-term equity investor.
• Additional active return potential: Equity market neutral involves double the market exposure by being both
long and short stocks. At a minimum, two dollars are at work for every one dollar of invested capital. Hence,
a market neutral manager has the potential to generate more returns than the active return of a long-only equity
manager.
• Managing volatility: Through an integrated optimisation, the co-relationship between all stocks in an index can
be exploited. Depending on the dispersion of stock returns, risk can be significantly reduced by systematically
reweighting positions to profit from offsetting volatility. Reducing volatility allows for leverage to be used,
which is an additional source of return.
• Profit potential in all market conditions: By managing a relatively fixed volatility portfolio, an equity market
neutral manager may have an advantage over a long-only equity manager allowing him to remain fully invested
in all market conditions.
The key risk factors of an equity market neutral strategy are
• Unintended beta mismatch: Long and short equity portfolios can easily be dollar neutral, but not beta neutral. Reaction to large market movements is therefore unpredictable, as one side of the portfolio will behave
differently than the other.
• Unintended factor mismatch: Long and Short equity portfolios can be both dollar neutral and beta neutral, but
severely mismatched on other important factors (liquidity, turnover, value/growth, market capitalisation). Again,
large market moves will affect one side of the portfolio differently from the other.

72

Quantitative Analytics

• Model risk: All risk exposures of the model should be assessed to prevent bad forecast generation, and practical
implementation issues should be considered. For instance, even if the model indicates that a certain stock should
be shorted at a particular instant in time, this may not be feasible due to the uptick rule. Finally, the effectiveness
of the model may diminish as the market environment changes.
• Changes in volatility: The total volatility of a market neutral position depends on the volatility of each position,
so that the manager must carefully assess the volatility of each long and short position as well as the relationship
between them.
• Low interest rates: As part of the return from an equity market neutral strategy is due to the interest earned
on the proceeds from a short sale (rebate), a lower interest rate environment places more pressure on the other
return sources of this strategy.
• Higher borrowing costs for stock lending: Higher borrowing costs cause friction on the short stock side, and
decreases the number of market neutral opportunities available.
• Short squeeze: A sudden increase in the price of a stock that is heavily shorted will cause short sellers to
scramble to cover their positions, resulting in a further increase in price.
• Currency risk: Buying and selling stocks in multiple countries may create currency risk for an equity market
neutral fund. The cost of hedging, or not hedging, can significantly affect the fund’s return.
• Lack of rebalancing risk: The success of a market neutral fund is contingent on constantly rebalancing the
portfolio to reflect current market conditions. Failure to rebalance the portfolio is a primary risk of the strategy.
2.1.3.3

Predicting changes in business cycles

We saw in Section (2.1.3) that there are evidence in the market for different equity styles to perform better at different
points in time. For instance, the stock market can be divided into two types of stocks, value and growth where value
stocks are bargain or out-of-favour stocks that are inexpensive relative to company earnings or assets, and growth
stocks represent companies with rapidly expanding earnings growth. Hence, an investment style which is based around
growth searches for investments whose returns are expected to grow at a faster pace than the rest of the market, while
the value style of investment seeks to find investments that are thought to be under-priced. Investors have an intuitive
understanding that equity indexes have contrasted performance at different points of the business cycle. Historically,
value investing tends to be more prominent in periods when the economy is experiencing a recession, while growth
investing is performing better during times of economic booms (BlackRock, 2013). As a result, excess returns are
produced by value and growth styles at different points within the business cycle since growth assets and sectors are
affected in a different way than their value equivalents. Therefore, predicting the changes in the business cycle is very
important as it has a direct impact on the tactical style allocation decisions. There are actually two approaches which
can be considered when predicting these changes:
• One approach consists in forecasting returns by first forecasting the values of various economic variables (scenarios on the contemporaneous variables).
• The other approach to forecasting returns is based on anticipating market reactions to known economic variables
(econometric model with lagged variables).
A number of academic studies (see Bernard et al. [1989]) suggested that the reaction of market participants to known
variables was easier to predict than financial and economic factors. The performance of timing decisions based on an
econometric model with lagged variables results from a better ability to process available information, as opposed to
privileged access to private information. This makes a strong case towards using time series modelling in order to gain
insights into the momentum that a market exhibits. Therefore, the objective of a Systematic Tactical Allocator is to
set up an econometric model capable of predicting the time when a given Style is going to outperform other Styles.

73

Quantitative Analytics

For instance, using a robust multi-factor recursive modelling approach, Amenc et al. [2003] found strong evidence
of predictability in value and size style differentials. Since forecasting returns based on anticipating market reactions
to known economic variables is more favourable than trying to forecast financial or economic factors, econometric
models which include lagged variables are usually used. This type of modelling is usually associated with univariate
time series models but can be extended to account for cross-sections. These types of models attempt at predicting
variables using only the information contained in their own past values. The class of time series models one should
first consider is the ARMA/ARIMA family of univariate time series models. These types of models are usually atheoretical therefore their construction is not based on any underlying theoretical model describing the behaviour of
a particular variable. ARMA/ARIMA models try to capture any empirically relevant features of the data, and can be
used to forecast past stock returns as well as to improve the signals associated with time series momentum strategies.
More sophisticated models, such as Exponential Smoothing models and more generally State-Space models models
can also be used.

2.1.4

Risk premia investing

As discussed in Section (1.5.3), a large range of effective multi-factor models exist to explain realised return variation
over time. A different approach is to use multi-factor models directly on strategies. Risk premia investing is a way of
improving asset allocation decisions with the goal of delivering a more dependable and less volatile investment return.
This new approach is about allocating investment to strategies rather than to assets (see Ilmanen [2011]). Traditional
portfolio allocation such as a 60/40 allocation between equities and bonds remain volatile and dominated by equity
risk. Risk premia investing introduce a different approach to portfolio diversification by constructing portfolios using
available risk premia within the traditional asset classes or risk premia from systematic trading strategies rather than
focusing on classic risk premia, such as equities and bonds. Correlations between many risk premia have historically
been low, offering significant diversification potential, particularly during periods of distress (see Bender et al [2010]).
There is a large selection of risk premia strategies across assets covering equities, bonds, credit, currency and derivative
markets, and using risk-return characteristics, most of them can be classified as either income, momentum, or relative
value.
1. Income strategies aim at receiving a certain steady flow of money, typically in the form of interest rates or dividend payments. These strategies are often exposed to large losses during confidence crises, when the expected
income no longer offsets the risk of holding the instruments. Credit carry, VIX contango, variance swap strategies, and volatility premium strategy, equity value, dividend and size, FX carry, the rates roll-down strategy, and
volatility tail-event strategies belongs to that grouping.
2. Momentum strategies are designed to bring significant gains in market downturns, whilst maintaining a decent
performance in other circumstances. For example, CTA-type momentum strategies perform well when markets
rise or fall significantly. An equity overwriting strategy performs best when stock prices fall, and still benefits
from the option premia in other circumstances. Equity/rates/FX momentum, quality equity, overwriting, and
interest rates carry belongs to that grouping. Further, a momentum system has a lot in common with a strategy
that buys options and can be used as a hedge during crisis (see Ungari [2013]).
3. Relative value outright systematic alpha strategies based on market anomalies and inefficiencies, for example
convertible arbitrage and dividend swaps. Such discrepancies are expected to correct over a time frame varying
from a few days for technical strategies to a few months or even years for more fundamental strategies.
These categories represent distinct investment styles which is a key component in understanding risk premia payoffs
and can be compared to risk factors in traditional asset classes. Portfolio managers have always tried to reap a reward
for bearing extra risk, and risk premia investing is one way forward since risk premium has
• demonstrated an attractive positive historical return profile
• fundamental value allowing for a judgement on future expected returns

74

Quantitative Analytics

• some diversification benefits when combined with multi-asset portfolio
The basic concept recognises that assets are driven by a set of common risk factors for which the investor typically gets
paid and, by controlling and timing exposures to these risk factors the investor can deliver a superior and more robust
outcome than through more traditional forms of asset allocation (see Clarke et al. [2005]). Principal Components
Analysis (PCA), which is an efficient way of summing up the information conveyed by a correlation matrix, is generally used to determine the main drivers of the strategy. For instance, in order to assess the performances of risk premia
investing, Turc et al. [2013] compiled two equity portfolios based on the same five factors (value, momentum, size,
quality and yield). The former is an equal-weighted combination of the five factors in the form of risk premia indices,
and the latter is a quant specific model with a quantitative stock selection process scoring each stock on a combination
of the five factors and creating a long-short portfolio. Using PCA, they identified three risk factors to which each
strategy is exposed, market crisis, equity direction, and volatility premium. In their study, the traditional equity quant
portfolio outperformed by far the combined risk premia approach. Nonetheless, when comparing the two approaches,
most risk premia are transparent, easy of access, and obtained through taking a longer term view exploiting the short
term consideration of other active market participants. An important issue with risk premia strategies is the way in
which they are combined, as their performance and behaviour are linked to the common factors, making them difficult
to implement. In the equity market, returns are expressed on a long-short basis, adding a significant amount of cost
and complexity, as portfolios are often rebalanced to such a degree that annual turnover rates not only eat into returns,
but also involve a considerable amount of portfolio management. Capacity constraints are another concern faced by
many portfolio managers.
In order to compare risk premia strategies across different asset classes, one need to use a variety of risk metrics
based on past returns such as volatility, skew and kurtosis. The skewness is a measure of the symmetry of a distribution,
such that a negative skew means that the most extreme movements are on the downside. On the other hand, a strategy
with a positive skew is more likely to make large gains than suffer large loss. Kurtosis is a measure of extreme risk,
and a high kurtosis indicates potential fat-tails, that is, a tendency to post unusually large returns, whether either on
the upside or the downside. Practitioners compares some standard performance ratios and statistics commonly used
in asset management including the Sharpe ratio (returns divided by volatility), the Sortino ratio (returns divided by
downside volatility), maximum drawdown and time to recovery, or measures designed to evaluate extreme risks such
as value-at-risk and expected shortfall.

2.1.5

Introducing technical analysis

2.1.5.1

Defining technical analysis

We saw in Section (1.7.2) that the accumulating evidence against the efficiency of the market has caused a resurgence
of interest in the claims of technical analysis as the belief that the distribution of price dynamics is totally random
is now being questioned. There exists a large body of work on the mathematical analysis of the behaviour of stock
prices, stock markets and successful strategies for trading in these environments. While investing involves the study of
the basic market fundamentals which may take several years to be reflected in the market, trading involves the study
of technical factors governing short-term market movements together with the behaviour of the market. As a result,
trading is riskier than long-term investing, but it offers opportunities for greater profits (see Hill et al. [2000]).
Technical analysis is about market traders studying market price history with the view of predicting future price
changes in order to enhance trading profitability. Technical trading rules involve the use of technical analysis to design
indicators that help a trader determine whether current behaviour is indicative of a particular trend, together with
the timing of a potential future trade. As a result, in order to apply Technical Analysis, which tries to analyse the
securities past performance in view of evaluating possible future investments, we must assume that the historical data
in the markets forms appropriate indications about the market future performance. Technical Analysis relies on three
principles (see Murphy [1999])

75

Quantitative Analytics

1. market action discounts everything
2. prices move in trends or are contrarian
3. history tends to repeat itself
Hence, by analysing financial data and studying charts, we can anticipate which way the market is most likely to go.
That is, even though we do not know when we pick a specific stock if its price is going to rise or fall, we can use
technical indicators to give us a future perspective on its behaviour in order to determine the best choice when building
a portfolio. Technical indicators try to capture the behaviour and investment psychology in order to determine if a
stock is under or overvalued. For instance, in order to classify each stock within the market, we can employ a set of
rules based on technical indicators applied to the asset’s prices, their volumes, and/or other financial factors. Based on
entry/exit signals and other plot characteristics, we can define different rules allowing us to score the distinct stocks
within the market and subsequently pick the best securities according to the indicator employed. However, there are
several problems occurring when using technical indicators. There is no better indicator, so that the indicators should
be combined in order to provide different perspectives. Further, a technical indicator always need to be applied to a
time window, and determining the best time window is a complex task. For instance, the problem of determining the
best time window can be the solution to an optimisation problem (see Fernandez-Blanco et al. [2008]).
New techniques combining elements of learning, evolution and adaptation from the field of Computational Intelligence developed, aiming at generating profitable portfolios by using technical analysis indicators in an automated
way. In particular, subjects such as Neural Networks (28), Swarm Intelligence, Fuzzy Systems and Evolutionary Computation can be applied to financial markets in a variety of ways such as predicting the future movement of stock’s
price or optimising a collection of investment assets (funds and portfolios). These techniques assume that there exist
patterns in stock returns and that they can be exploited by analysis of the history of stock prices, returns, and other key
indicators (see Schwager [1996]). With the fast increase of technology in computer science, new techniques can be
applied to financial markets in view of developing applications capable of automatically manage a portfolio. Consequently, there is substantial interest and possible incentive in developing automated programs that would trade in the
market much like a technical trader would, and have it relatively autonomous. A mechanical trading systems (MTS),
founded on technical analysis, is a mathematically defined algorithm designed to help the user make objective trading
decisions based on historically reoccurring events. Some of the reasons why a trader should use a trading systems are
• continuous and simultaneous multimarket analysis
• elimination of human emotions
• back test and verification capabilities
Mechanical system traders assume that the trending nature of the markets can be understood through the use of
mathematical formulas. For instance, properly filtering time series by removing the noise (congestion), they recover
the trend which is analysed in view of inferring trading signals. Assuming that assets are in a continuous state of
flux, a single system can profitably trade many markets allowing a trader to be exposed to different markets without
fully understanding the nuances of all the individual markets. Since MTS can be verified and analysed with accuracy
through back testing, they are very popular. Commodity Trading Advisors (CTA) use systems due to their ease of
use, non-emotional factor, and their ability to be used as a foundation for an entire trading platform. Everything being
mathematically defined, a CTA can demonstrate a hypothetical track record based on the different needs of his client
and customise a specific trading plan. However, one has to make sure the the system is not over-fitting the data in the
back test. A system must have robust parameters, that is, parameters not curve fitted to historical data. Hence, one
must understand the logic and mathematics behind a system. Unlike actual performance records, simulated results
do not represent actual trading. As a rule of thumb, one should expect only one half of the total profit and twice the
maximum draw down of a hypothetical track record.

76

Quantitative Analytics

2.1.5.2

Presenting a few trading indicators

Any mechanical trading system (MTS) must have some consistent method or trigger for entering and exiting the market
based on some type of indicator or mathematical statistics with price forecasting capability. Anything that indicates
what the future may hold is an indicator. Most system traders and developers spend 90% of their time developing
entry and exit technique, and the rest of their time is dedicated to the decision process determining profitability. Some
of the most popular indicators include moving average, rate of change, momentum, stochastic (Lane 1950), Relative
Strength Iindex (RSI) (see Wilder [1978]), moving average convergence divergence (MACD) (see Appel [1999]),
Donchian breakout, Bollinger bands, Keltner bands (see Keltner [1960]). However, indicators can not stand alone,
and should be used in concert with other ideas and logic. There is a large list of indicators with price forecasting
capability and we will describe a few of them. For more details we refer the reader to Hill et al. [2000].
• In the 50s Lane introduced an oscillatoring type of indicator called Stochastics that compares the current market
close to the range of prices over a specific time period, indicating when a market is overbought or oversold (see
Lane [1984]). This indicator is based on the assumption that when an uptrend/downtrend approaches a turning
point, the closing prices start moving away from the high/low price of a specific range. The number generated
by this indicator is a percentage in the range [0, 100] such that for a reading of 70 or more it indicates that the
close is near the high of the range. A reading of 30 or below indicates the close is near the low of the range.
Note, in general the values of the indicator are smoothed with a moving average (MA).
• The Donchian breakout is an envelope indicator involving two lines that are plotted above and below the market.
The top line represents the highest high of n days back (or weeks) and conversely the bottom line represents the
lowest low of n days back. The idea being buying when the day’s high penetrates the highest high of four weeks
back and selling when the day’s low penetrates the lowest low of four weeks back.
• The Moving Average Crossover involves two or more moving averages usually consisting of a longer-term and
shorter-term average. When the short-term MA crosses from below the long-term MA, it usually indicates a
buying opportunity, and selling opportunities occur when the shorter-term MA crosses from above the longerterm MA. Moving averages can be calculated as simple, exponential, and weighted average. Exponential and
weighted MAs tend to skew the moving averages toward the most recent prices, increasing volatility.
• In the 70s Appel developed another price oscillator called the moving average convergence divergence (MACD)
which is derived from three different exponentially smoothed moving averages (see Appel et al. [2008]). It is
plotted as two different lines, the first line (MACD line) being the difference between the two MAs (long-term
and short-term MAs), and the second line (signal or trigger line) being an exponentially smoothed MA of the
MACD line. Note, the difference (or divergence) between the MACD line and the signal line is shown as a
bar graph. The purpose being to try to eliminate the lag associated with MA type systems. This is done by
anticipating the MA crossover and taking action before the actual crossover. The system buys when the MACD
line crosses the signal line from below and sells when the MACD line crosses the signal line from above.
• As an example of channel trading, Keltner [1960] proposed a system called the 10-day moving average rule
using a constant width channel to time buy-sell signals with the following rules
1. compute the daily average price

high+low+close
.
3

2. compute a 10-day average of the daily average price.
3. compute a 10-day average of the daily range.
4. add and subtract the daily average range from the 10-day moving average to form a band or channel.
5. buy when the market penetrates the upper band and sell when it breaks the lower band.
While this system is buying on the strength and selling on weakness, some practitioners have modified the rules
as follow

77

Quantitative Analytics

1. instead of buying at the upper band, you sell and vice versa.
2. the number of days are changed to a three-day average with bands around that average.
• The Donchian channel is an indicator formed by taking the highest high and the lowest low of the last n periods.
The area between the high and the low is the channel for the period chosen. While it is an indicator of volatility
of a market price, it is used for providing signals for long and short positions. If a security trades above its
highest n periods high, then a long is established, otherwise if it trades below the lowest n periods, a short is
established.
• The Bollinger bands, also called alpha-beta bands, usually uses 20 or more days in its calculations and does not
oscillate around a fixed point. It consist of three lines, where the middle line is a simple moving average and
the outside lines are plus or minus two (that number can vary) standard deviations above and below the MA.
A typical BB type system buys when price reaches the bottom and liquidates as the price moves up past the
MA. The sell side is simply the opposite. It is assumed that when a price goes beyond two standard deviations
it should revert to the MA. Note, some practitioners revert the logic and sell rather than buy when prices reach
the lower band, and vice versa with the upper band. Alternatively, one can use a 20-day MA with one and two
standard deviations above and below the MA. Looking at the chart we can deduce trend, volatility, and overbought/oversold conditions. A market above one standard deviation is overbought, and it becomes extremely
overbought above two standard deviation. That is, most underlyings will pullback to the average even in strongly
trending markets. Further, with narrow bands we should buy volatility (calls and puts) and when the bands are
widening we should sell volatility.

Considering projected charts to map future market activity, Drummond et al. [1999] introduced the Drummond
Geometry (DG) which is both a trend-following and a congestion-action methodology, leading rather than lagging the
market. It tries to foretell the most likely scenario that shows the highest probability of occurring in the immediate
future and can be custom fitted to one’s personality and trading style. The key elements of DG include a combination
of the following three basic categories of trading tools and techniques
1. a series of short-term moving averages
2. short-term trend lines
3. multiple time-period overlays
The concept of Point and Line (PL) reflecting how all of life moves from one extreme to another, flowing back and
forth in a cyclical or wave-like manner, is applied to the market. The PLdot (average of the high, low, and close of the
last three bars) is a short-term MA based on three bars (or time periods) of data capturing the trend/nontrend activity
of the time frame that is being charted. It represent the center of all market activity. The PLdot is very sensitive to
trending markets, it is also very quick at registering the change of a market out of congestion (noise) into trend, and it
is sensitive to ending trend.
Since pattern is defined as a predictable route or movement, all trading systems are pattern recognition systems.
For instance, a long-term moving average cross over system uses pattern recognition, the crossover, in its decision
to buy or sell. Similarly, an open range breakout is pattern recognition given by the movement from the open to the
breakout point. All systems look for some type of reoccurring event and try to capitalise on it. Hill demonstrated the
success of pattern recognition when used as a filter. The system was developed around the idea of a pattern consisting
of the last four days’ closing prices. A buy or sell signal is not generated unless the range of the past four days’ closes
is less than the 30-day average true range, indicating that the market has reached a state of rest and any movement, up
or down, from this state will most likely result in a significant move.

78

Quantitative Analytics

Eventually, we want to develop an approach into a comprehensive, effective, trading methodology that combines
analytical sophistication with tradable rules and principles. One way forward is to consider a multiple time frame
approach. A time frame is any regular sampling of prices in a time series, from the smallest such as one minute up to
the longest capped out at ten year. The multiple time frame approach has proven to be a fundamental advance in the
field of TA allowing for significant improvement in trading results. For instance, if market analysis is coordinated to
show the interaction of these time frames, then the trader can monitor what happens when the support and resistance
lines of the different time frames coincide. Assuming we are interested in analysing the potential of the integration of
time frames, then we need to look at both a higher time frame and a lower time frame.

2.1.5.3

The limitation of indicators

Since each indicator has a significant failure rate, the random nature of price change being one reason why indicators
fail, Chande et al. [1994] explained how most traders developed several indicators to analyse prices. Traders use
multiple indicators to confirm the signal of one indicator with respect to another one, believing that the consensus is
more likely to be correct. However, this is not a viable approach due to the strong similarities existing between price
based momentum indicators. In general, momentum based indicators fail most of the time because
• none of them is a pure momentum oscillator that measures momentum directly.
• the time period of the calculations is fixed, giving a different picture of market action for different time periods.
• they all mirror the price pattern, so that it may be better trading prices themselves.
• they do not consistently show extremes in prices because they use a constant time period.
• the smoothing mechanism introduces lags and obscures short-term price extremes that are valuable for trading.
2.1.5.4

The risk of overfitting

While a simplified representation of reality can either be descriptive or predictive in nature, or both, financial models
are predictive to forecast unknown or future values based on current or known values using mathematical equations or
set of rules. However, the forecasting power of a model is limited by the appropriateness of the inputs and assumptions
so that one must identify the sources of model risk to understand these limitations. Model risk generally occurs as
a result of incorrect assumptions, model identification or specification errors, inappropriate estimation procedures, or
in models used without satisfactory out-of-sample testing. For instance, some models can be very sensitive to small
changes in inputs, resulting in big changes in outputs. Further, a model may be overfitted, meaning that it captures
the underlying structure or the dynamics in the data as well as random noise. This generally occurs when too many
model parameters are used, restricting the degrees of freedom relative to the size of the sample data. It often results
in good in-sample fit but poor out-of-sample behaviour. Hence, while an incorrect or misspecified model can be made
to fit the available data by systematically searching the parameter space, it does not have a descriptive or predictive
power. Familiar examples of such problems include the spurious correlations popularised in the media, where over
the past 30 years when the winner of the Super Bowl championship in American football is from a particular league, a
leading stock market index historically goes up in the following months. Similar examples are plentiful in economics
and the social sciences, where data are often relatively sparse but models and theories to fit the data are relatively
prolific. In economic time series prediction, there may be a relatively short time-span of historical data available in
conjunction with a large number of economic indicators. One particularly humorous example of this type of prediction
was provided by Leinweber who achieved almost perfect prediction of annual values of the S&P 500 financial index
as a function of annual values from previous years for butter production, cheese production, and sheep populations
in Bangladesh and the United States. There are no easy technical solutions to this problem, even though various
strategies have been developed. In order to avoid model overfitting and data snooping, one should decide upon the

79

Quantitative Analytics

framework by defining how the model should be specified before beginning to analyse the actual data. First, by properly formulating model hypothesis making financial or economic sense, and then carefully determining the number
of dependent variables in a regression model, or the number of factors and components in a stochastic model one
can expect avoiding or reducing storytelling and data mining. To increase confidence in a model, true out-of-sample
studies of model performance should be conducted after the model has been fully developed. We should also be more
comfortable with a model working cross-sectionally and producing similar results in different countries. Note, in a
general setting, sampling bootstraping, and randomisation techniques can be used to evaluate whether a given model
has predictive power over a benchmark model (see White [2000]). All forecasting models should be monitored and
compared on a regular basis, and deteriorating results from a model or variable should be investigated and understood.
2.1.5.5

Evaluating trading system performance

We define a successful strategy to be one that maximise the number of profitable days, as well as positive average
profits over a substantial period of time, coupled with reasonably consistent behaviour. As a result, while we must
look at profit when evaluating trading system performance, we must also look at other statistics such as
• maximum drawdown: the highest point in equity to the subsequent lowest point in equity. It is the largest
amount of money the system lost before it recovered.
• longest flat time: the amount of time the system went without making money.
• average drawdown: maximum drawdown is one time occurrence, but the average drawdown takes all of the
yearly drawdowns into consideration.
• profit to loss ratio: it represents the magnitude of winning trade dollars to the magnitude of losing trade dollars.
As it tells us the ratio of wins to losses, the higher the ratio the better.
• average trade: the amount of profit or loss we can expect on any given trade.
• profit to drawdown ratio: risk in this statistic comes in the form of drawdown, whereas reward is in the form of
profit.
• outlier adjusted profit: the probability of monstrous wins and/or losses reoccurring being extremely slim, it
should not be included in an overall track record.
• most consecutive losses: it is the total number of losses that occurred consecutively. It gives the user an idea of
how many losing trades one may have to go through before a winner occurs.
• Sharpe ratio: it indicates the smoothness of the equity curve. The ratio is calculated by dividing the average
monthly or yearly return by the standard deviation of those returns.
• long and short net profit: as a robust system would split the profits between the long trades and the short trades,
we need to make sure that the money made is well balanced between both sides.
• percent winning months: it checks the number of winning month out of one year.

2.2

Portfolio construction

We assume that the assets have already been selected, but we do not know the allocations, and we try to make the
best choice for the portfolio weights. In his article about the St. Petersburg paradox, Bernoulli [1738-1954] argued
that risk-averse investors will want to diversify: ” ... it is advisable to divide goods which are exposed to some
small danger into several portions rather than to risk them all together ”. Later, Fisher [1906] suggested variance
as a measure of economic risk, which lead investors to allocate their portfolio weights by minimising the variance

80

Quantitative Analytics

of the portfolio subject to several constraints. For instance, the theory of mean-variance based portfolio selection,
proposed by Markowitz [1952], assumes that rational investors choose among risky assets purely on the basis of
expected return and risk, where risk is measured as variance. Markowwitz concluded that rational investors should
diversify their investments in order to reduce the respective risk and increase the expected returns. The author’s
assumption focus on the basis that for a well diversified portfolio, the risk which is assumed as the average deviation
from the mean, has a minor contribution to the overall portfolio risk. Instead, it is the difference (covariance) between
individual investment’s levels of risk that determines the global risk. Based on this assumption, Markowitz provided
a mathematical model which can easily be solved by meta-heuristics such as Simulated Annealing (SA) or Genetic
Algorithm (GA). Solutions based on this model focus their goals on optimising either a single objective, the risk
inherent to the portfolio, or two conflicting objectives, the global risk and the expected returns of the securities within
the portfolio. That is, a portfolio is considered mean-variance efficient
• if it minimises the variance for a given expected mean return, or,
• if it maximise the expected mean return for a given variance.
On a theoretical ground, mean-variance efficiency assumes that
• investors exhibit quadratic utility, ignoring non-normality in the data, or
• returns are multivariate normal, such that all higher moments are irrelevant for utility function.

2.2.1

The problem of portfolio selection

The expected return on a linear portfolio being a weighted sum of the returns on its constituents, we denote the
expected return by µp = w> E[r], where
E[r] = (E[r1 ], .., E[rN ])> and w = (w1 , ..., wN )>
are the vectors of expected returns on N risky assets and portfolio weights. The variance of a linear portfolio has the
quadratic form σ 2 = w> Qw where Q is the covariance matrix of the asset returns. In practice, we can either
• minimise portfolio variance for all portfolios ranging from minimum return to maximum return to trace out an
efficient frontier, or
• construct optimal potfolios for different risk-tolerance parameters, and by varying the parameters, find the efficient frontier.
2.2.1.1

Minimising portfolio variance

We assume that the elements of the N × 1 vector w are all non-negative and sum to 1. We can write the N × N
covariance matrix Q as Q = DCD where D is the N × N diagonal matrix of standard deviations and C is the
correlation matrix of the asset returns (see details in Appendix (A.6)). One can show that whenever asset returns are
less than perfectly correlated, the risk from holding a long-only portfolio will be less than the weighted sum of the
component risks. We can write the variance of the portfolio return R as
Q(R) = w> Qw = w> DCDw = x> Cx
where
x = Dw = (w1 σ1 , .., wN σN )>
is a vector where each portfolio weight is multiplied by the standard deviation of the corresponding asset return. If all
asset returns are perfectly correlated, then C = IN , and the volatility of the portfolio becomes
81

Quantitative Analytics

1

(w> Qw) 2 = w1 σ1 + ... + wN σN
where the standard deviation of the portfolio return is the weighted sum of the asset return standard deviation. However, when some asset returns have less than perfect correlation, then C has elements less than 1. As the portfolio is
long-only, the vector x has non-negative elements, and we get
Q(R) = w> Qw = x> Cx ≤ I > CI
which is an upper bound for the portfolio variance. It correspond to the Principle of Portfolio Diversification (PPD).
Maximum risk reduction for a long-only portfolio occurs when correlations are highly negative. However, if the
portfolio contains short positions, we want the short positions to have a high positive correlation with the long positions
for the maximum diversification benefit. The PPD implies that investors can make their net specific risk very small by
holding a large portfolio with many assets. However, they are still exposed to irreducible risk since the exposure to a
general market risk factor is common to all assets.
We obtain the minimum variance portfolio (MVP) when the portfolio weights are chosen so that the portfolio
variance is as small as possible. That is
min w> Qw
w

(2.2.1)

with the constraint
N
X

wi = 1

i=1

in the case of a long-only portfolio. Any constraint on the portfolio weights restricts the feasible set of solutions to the
minimum variance problem. In the case of the single constraint above, the solution to the MVP is given by
w̃i = ψi

N
X

ψi

−1

i=1

where ψi is the sum of all the elements in the ith column of Q−1 . The portfolio with these weights is called the global
minimum variance portfolio with variance
V∗ =

N
X

ψi

−1

i=1

In general, there is no analytic solution to the MVP when more constraints are added. While the MVP ignores the
return characteristics of portfolio, more risk may be perfectly acceptable for higher returns. As a result, Markowitz
[1952] [1959] considered adding another constraint to the MVP by allowing the portfolio to meet or exceed a target
level of return R leading to the optimisation problem of solving Equation (2.2.1) subject to the constraints
N
X

wi = 1 and w> E[r] = R

i=1

where E[r] = R is a target level for the portfolio return. Using the Lagrange multipliers we can obtain the solution
analytically.

82

Quantitative Analytics

2.2.1.2

Maximising portfolio return

An alternative approach is to maximise portfolio return by defining the utility function
1 2
1 >
σ = w> µ −
w Qw
2λ
2λ
where µ = E[r]. We let λ be a risk-tolerance parameter, and compute the optimal solution by taking the first derivative
with respect to portfolio weights, setting the term to zero
U = µp −

dU
1
1
=µ−
2Qw = µ − CQ = 0
dw
2λ
λ
and solving for the optimal vector w∗ , getting
w∗ = λQ−1 µ
To be more realistic, we introduce general linear constraints of the form Aw = b, where A is a N × M matrix where
N is the number of assets and M is the number of equality constraints and b is a N × 1 vector of limits. We now
maximise
U = w> µ −

1 >
w Qw subject to Aw = b
2λ

We can write the Lagrangian
1 >
w Qw − δ > (Aw − b)
2λ
where δ is the M × 1 vector of Lagrangian multipliers (one for each constraint). Taking the first derivatives with
respect to the optimal weight vector and the vector of multipliers yields
L = w> µ −

dL
dw
dL
dδ

= µ−

1
Qw − δ > A = 0 , w∗ = λQ−1 (µ − δ > A)
λ

= Aw − b , Aw = b

From the above equations, we obtain
λAQ−1 µ − b =
δ

=

λAQ−1 A> δ
1
AC −1 µ
− bAC −1 A>
AC −1 A>
λ

Replacing in the derivative of the Lagrangian, we get the optimal solution under linear equality constraints

w∗ = Q−1 A> (AQ−1 A> )b + λQ−1 µ − A> (AQ−1 A> )−1 AQ−1 µ
The optimal solution is split into a (constrained) minimum variance portfolio and a speculative portfolio. It is called
a two-fund separation because the first term does not depend on expected returns or on risk tolerance, and the second
term is sensitive to both inputs. To test for the significance between the constrained and unconstrained optimisation we
0
can use the Shape ratio RS . Assuming an unconstrained optimisation with N assets and a constrained optimisation
0
with only N assets (N > N ) we an use the measure
0

0

0

(T − N )(N − N )(RS2 (N ) − RS2 (N ))
∼ FN 0 ,T −(N 0 +N +1)
(1 + RS2 (N ))
where T is the number of observations. This statistic is F-distributed.

83

Quantitative Analytics

2.2.1.3

Accounting for portfolio risk

While there are many competing allocation procedures such as Markowwitz portfolio theory (PT), or risk budgeting
methods to name a few, in all cases risk must be decided. Scherer [2007] argued that portfolio construction, using
various portfolio optimisation tools to assess expected return versus expected risk, is equivalent to risk budgeting. In
both cases, investors have to trade off risk and return in an optimal way. Even though the former is an allocation
either in nominal dollar terms or in percentage weights, and the latter arrives at risk exposures expressed in terms
of value at risk (VaR) or percentage contributions to risk, this is juste a presentational difference. While the average
target volatility of the portfolio is closely related to the risk aversion of the investors, this amount is not constant
over time. In general, any consistent investment process should measure and control the global risk of a portfolio.
Nonetheless, it seems that full risk-return optimisation at the portfolio level is only done in the most quantitative firms,
and that portfolio management remains a pure judgemental process based on qualitative, not quantitative, assessments.
Portfolio managers developed risk measures to represent the level of risk in a particular portfolio, where risk is defined
as underperformance relative to a mandate. In the financial industry, there is a large variety of risk indicators, and
portfolio managers must decide which ones to consider. For instance, one may consider the maximum drawdown of
the cumulative profit or total open equity of a financial trading strategy. One can also consider performance measures
such as the Sharpe ratio, the Burke ratio, the Calmar ratio and many more at their disposal. In practice, portfolio
managers must consider other issues such as execution policies and transaction cost management on a regular basis.
The main reason for using qualitative measures being the difficulty to apply practically optimisation technology.
For instance, the classical mean-variance optimisation is very sensitive to inputs such as expected returns of each asset
and their covariance matrix. Chopra et al. [1993] have done elementary research to the sensitivity of the classic
Mean-Variance to errors in the input parameters. According to their research, errors in expected returns have a much
bigger influence on the performance than errors in variances or covariances. This lead to optimal portfolios having
extreme or non-intuitive weights for some of the individual assets. Consequently, practitioners added constraints to the
original problem to limit or reduce these drawbacks, resulting in an optimum portfolio dominated by the constraints.
Additional problems to portfolio optimisation exist, such as
• poor model ex-post performance, coupled with the risk of maximising error rather than minimising it.
• difficulty in estimating a stable covariance matrix for a large number of assets.
• sensitivity of portfolio weights to small changes in forecasts.
Different methods exist to make the portfolio allocation process more robust to different sources of risk (estimation
risk, model risk etc.) among which are
• Bayesian approches
• Robust Portfolio Allocation
In the classical approach future expected returns are estimated by assuming that the true expected returns and covariances of returns are unknown and fixed. Hence, a point estimate of expected returns and (co)variances is obtained
using forecasting models of observed market data, influencing the mean-variance portfolio allocation decision by the
estimation error of the forecasts. Once the expected returns and the covariance matrix of returns have been estimated,
the portfolio optimisation problem is typically treated and solved as a deterministic problem with no uncertainty.
A more realistic model would consider the uncertainty of expected returns and risk into the optimisation problem.
One way forward is to choose an optimum portfolio under different scenarios that is robust in some worst case model
misspecification. The goal of the Robust Portfolio Allocation (RPA) framework is to get a portfolio, which will
perform well under a number of different scenarios instead of one scenario. However, to obtain such a portfolio
the investor has to give up some performance under the most likely scenarios to have some insurance for the less

84

Quantitative Analytics

likely scenarios. In order to construct such portfolios an expected returns distribution is necessary instead of a pointestimate. One method to obtain such distributions is the Bayesian method which assumes that the true expected returns
are unknown and random. A prior distribution is used, reflecting the investor’s knowledge about the probability before
any data are observed. The posterior distribution, computed with Bayes’ formula, is based on the knowledge of
the prior probability distribution plus the new data. For instance, Black et al. [1990] estimated future expected
returns by combining market equilibrium (CAPM equilibrium) with an investor’s views. The Bayesian framework
allows forecasting systems to use external information sources and subjective interventions in addition to traditional
information sources. The only restriction being that additional information is combined with the model following the
law of probability (see Carter et al. [1994]).
One alternative approach, discussed by Focardi et al. [2004], is to use Monte Carlo technique by sampling from the
return distribution and averaging the resulting portfolios. In this method a set of returns is drawn iteratively from the
expected return distribution. In each iteration, a mean-variance optimisation is run on the set of expected returns. The
robust portfolio is then the average of all the portfolios created in the different iterations. Although this method will
create portfolios that are more or less robust, it is computationally very expensive because an optimisation must be run
for each iteration step. Furthermore there is no guarantee that the resulting average portfolio will satisfy the constraints
on which the original portfolios are created. Note, in the Robust Portfolio Allocation approach, the portfolio is not
created with an iterative process but the distribution of the expected returns is directly taken into account, resulting in a
single optimisation process. Therefore this approach is computationally more effective than the Monte Carlo process.

2.3

A market equilibrium theory of asset prices

In the problem of portfolio selection, the mean-variance approach introduced by Markowitz [1952] is a simple tradeoff between return and uncertainty, where one is left with the choice of one free parameter, the amount of variance
acceptable to the individual investor. For proofs and rigorous introduction to the mean-variance portfolio technique
see Huang et al. [1988]. For a retrospective on Markowitz’s portfolio selection see Rubinstein [2002]. Investment
theory based on growth is an alternative to utility theory with simple goal. Following this approach, Kelly [1956]
used the role of time in multiplicative processes to solve the problem of portfolio selection.

2.3.1

The capital asset pricing model

2.3.1.1

Markowitz solution to the portfolio allocation problem

We showed in Section (2.2.1) how a rational investor should allocate his funds between the different risky assets in his
universe, leading to the portfolio allocation problem. To solve this problem Markowitz [1952] introduced the concept
of utility functions (see Section (1.6.1) and Appendix (A.7)) to express investor’s risk preferences. Markowitz first
considered the rule that the investor should maximise discounted expected, or anticipated returns (which is linked to
the St. Petersurg paradox). However, he showed that the law of large numbers (see Bernoulli [1713]) can not apply to a
portfolio of securities since the returns from securities are too intercorrelated. That is, diversification can not eliminate
all variance. Hence, rejecting the first hypothesis, he then considered the rule that the investor should consider expected
return a desirable thing and variance of return an undesirable thing. Note, Marschak [1938] suggested using the means
and covariance matrix of consumption of commodities as a first order approximation to utility.
We saw in Section (2.2.1) that the mean-variance efficient portfolios are obtained as the solution to a quadratic
optimisation program. Its theoretical justification requires either a quadratic utility function or some fairly restrictive
assumptions on the class of return distribution, such as the assumption of normally distributed returns. For instance, we
assume zero transaction costs and portfolios with prices Vt taking values in R and following the geometric Brownian
motion with dynamics under the historical probability measure P given by

85

Quantitative Analytics

dVt
= µdt + σV dWt
Vt

(2.3.2)

where µ is the drift, σV is the volatility and Wt is a standard Brownian motion. Markowitz first considered the problem
of maximising the expected rate of return
g=

1
dVt
<
>= µ
dt
Vt

(also called ensemble average growth rate) and rejected such strategy because the portfolio with maximum expected
rate of return is likely to be under-diversified, and as a result, to have an unacceptable high volatility. As a result, he
postulated that while diversification would reduce risk, it would not eliminate it, so that an investor should maximise
the expected portfolio return µ while minimising portfolio variance of return σV2 . It follows from the relation between
the variance of the return of the portfolio σV2 and the variance of return of its constituent securities σj2 for j = 1, 2, ..., N
given by
X
XX
σV2 =
wj2 σj2 +
wj wk ρjk σj σk
j

j

k6=j

P

where the wj are the portfolio weights such that j wj = 1, and ρjk is the correlation of the returns of securities j
and k. Therefore, ρjk σj σk is the covariance of their returns. So, the decision to hold any security would depend on
what other securities the investor wants to hold. That is, securities can not be properly evaluated in isolation, but only
as a group. Consequently, Markowitz suggested calling portfolio i efficient if
1. there exists no other portfolio j in the market with equal or smaller volatility, σj ≤ σi , whose drift term µj
exceeds that of portfolio i. That is, for all j such that σj ≤ σi , we have µj ≤ µi .
2. there exists no other portfolio j in the market with equal or greater drift term, µj ≥ µi , whose volatility σj is
smaller than that of portfolio i. That is, for all j such that µj ≥ µi , we have σj ≥ σi .
In the presence of a riskless asset (with σi = 0), all efficient portfolios lie along a straight line, the efficient frontier,
intersecting in the space of volatility and drift terms, the riskless asset rf and the so-called market portfolio M . Since
any point along the efficient frontier represents an efficient portfolio, additional information is needed in order to select
the optimal portfolio. For instance, one can specify the usefulness or desirability of a particular investment outcome to
a particular investor, namely his risk preference, and represent it with a utility function u = u(Vt ). Following the work
of von Neumann et al. [1944] and Savage [1954], Markowitz [1959] found a way to reconcile his mean-variance
criterion with the maximisation of the expected utility of wealth after many reinvestment periods. He advised using the
strategy of maximising the expected logarithmic utility of return each period for investors with a long-term horizon,
and developed a quadratic approximation to this strategy allowing the investor to choose portfolios based on mean and
variance.
As an alternative to the problem of portfolio selection, Kelly [1956] proposed to maximise the expected growth
rate
1
1
< d ln Vt >= µ − σ 2
dt
2
obtained by using Ito’s formula (see Peters [2011c]). This rate is called the expected growth rate, or the logarithmic
geometric mean rate of return (also called the time average growth rate). In that setting, we observe that large returns
and small volatilities are desirable. Note, Ito’s formula changes the behaviour in time without changing the noise
term. That is, Ito’s formula encode the multiplicative effect of time (for noise terms) in the ensemble average (see
Oksendal [1998]). Hence, it can be seen as a mean of accounting for the effects of time. For self-financing portfolios,
where eventual outcomes are the product over intermediate returns, maximising g b yields meaningful results. This
gb =

86

Quantitative Analytics

is because it is equivalent to using logarithmic utility function u(Vt ) = ln Vt . In that setting, the rate of change of
the ensemble average utility happens to be the time average of the growth rate in a multiplicative process. Note, the
problem that additional information is needed to select the right portfolio disappears when using the expected growth
rate g b . That is, there is no need to use utility function to express risk preferences as one can use solely the role of time
in multiplicative processes.
2.3.1.2

The Sharp-Lintner CAPM

As discusses in Section (2.2.1), in equilibrium and under proper diversification, market prices are made of the risk-free
rate and the price of risk in such a way that an investor can attain any desired point along a capital market line (CML).
Higher expected rate of return can be obtained only by incurring additional risk. In view of properly describing the
price of risk, Sharpe [1964] extended the model of investor behaviour (see Arrow [1952] and Markowitz [1952]) to
construct a market equilibrium theory of asset prices under conditions of risk. That is, the purpose of the capital asset
pricing model (CAPM) is to deduce how to price risky assets when the market is in equilibrium. The conditions under
which a risky asset may be added to an already well diversified portfolio depend on the Systematic Risk of the asset,
also called the undiversifiable risk of the asset. Assuming that an investor views the possible result of an investment
in terms of some probability distribution, he might consider the first two moments of the distribution represented by
the total utility function
u = f (Ew , σw )
du
> 0 and
where Ew is the expected future wealth and σw the predicted standard deviation, and such that dE
w
Letting Wi be the quantity of the investor’s present wealth and Wt be his terminal wealth, we get

du
dσw

< 0.

Wt = Wi (1 + R)
where R is the rate of return on the investment. One can therefore express the utility function in terms of R, getting
u = g(ER , σR )
so that the investor can choose from a set of investment opportunities, represented by a point in the (ER , σR ) plane
(with ER on the x-axis and σR on the y-axis), the one maximising his utility. Both Markowitz [1952] [1959] and Tobin
[1958] derived the indifference curves by maximising the expected utility with total utility represented by a quadratic
function of R with decreasing marginal utility. The investor will choose the plan placing him on the indifference curve
representing the highest level of utility. A plan is said to be efficient if and only if there is no alternative with either
1. the same ER and a lower σR
2. the same σR and a hiher ER
3. a higher ER and a lower σR
For example, in the case of two investment plans A and B, each with one or more assets, such that α is the proportion
of the individual’s wealth placed in plan A and (1 − α) in plan B, the expected rate of return is
ERc = αERa + (1 − α)ERb
and the predicted standard deviation of return is
q
2 + (1 − α)2 σ 2 + 2ρ α(1 − α)σ
σRc = α 2 σR
ab
Ra σRb
Rb
a

(2.3.3)

where ρab is the correlation between Ra and Rb . In case of perfect correlation between the two plans (ρab = 1), both
ERc and σRc are linearly related to the proportions invested in the two plans and the standard deviation simplifies to

87

Quantitative Analytics

σRc = σRb + α(σRa − σRb )
Considering the riskless asset P with σRp = 0, an investor placing α of his wealth in P and the remainder in the risky
asset A, we obtain the expected rate of return
ERc = αERp + (1 − α)ERa
and the standard deviation reduces to
σRc = (1 − α)σRa
such that all combinations involving any risky asset or combination of assets with the riskless asset must have the
values (ERc , σRc ) lying along a straight line between the points representing the two components. To prove it, we set
σ
(1 − α) = σRRc and replace in the expected rate of return, getting
a

ERc = ERp +

ERa − ERp
σRc
σRa

Remark 2.3.1 The investment plan lying at the point of the original investment opportunity curve where a ray from
point P is tangent to the curve will dominate.
Since borrowing is equivalent to disinvesting, assuming that the rate at which funds can be borrowed equals the lending
rate, we obtain the same dominant curve.
To reach equilibrium conditions, Sharpe [1964] showed that by assuming a common risk-free rate with all investors
borrowing or lending funds on equal terms, and homogeneity of investor expectations, capital asset prices must keep
changing until a set of prices is attained for which every assets enters at least one combination lying on the capital
market line (CML). While many alternative combinations of risky assets are efficient, they must be perfectly positively
correlated as they lie along a linear border of the (ER , σR ) region, even though the contained individual securities are
not perfectly correlated. For individual assets, the pair (ERi , σi ) for the ith asset (with ERi on the x-axis and σi on the
y-axis) will lie above the capital market line (due to inefficiency of undiversified holdings) and be scattered throughout
the feasible region. Given a single capital asset (point i) and an efficient combination of assets (point g) of which it is
part, we can combine them in a linear way such that the expected return of the combination is
E = αERi + (1 − α)ERg
In equilibrium, Sharpe obtained a tangent curve to the CML at point g, leading to a simple formula relating ERi to
some risk in combination g. The standard deviation of a combination of i and g is given by Equation (2.3.3) with a
and b replaced with i and g, respectively. Further, at α = 0 we get
σRg − ρig σRi
dσ
=
dE
ERg − ERi
Letting the equation of the capital market line (CML) be
σR = s(ER − P ) or ER = P + bσR
1
s,

(2.3.4)

with b =
where P is the risk-free rate, and since we have a tangent line at point g with the pair (ERg , σRg ) lying
on that line, we get
σRg − ρig σRi
σRg
=
ERg − ERi
E Rg − P

88

Quantitative Analytics

Given a number of ex-post observations of the return of the two investments with ERf approximated with Rf for
f = i, g and total risk σRf approximated with σ, we call Big the slope of the regression line between the two returns,
and observe that the response of Ri to changes in Rg account for much of the variation in Ri . This component Big of
the asset’s total risk is called the Systematic Risk, and the remainder which is uncorrelated with Rg is the unsystematic
component. This relationship between Ri and Rg can be employed ex-ante as a predictive model where Big is the
predicted response of Ri to changes in Rg . Hence, all assets entering efficient combination g have Big and ERi values
lying on a straight line (minimum variance condition)
ERi = Big (ERg − P ) + P

(2.3.5)

where P is the risk-free rate and
Big =

ρig σRi
Cov(Ri , Rg )
=
2
σRg
σR
g

(2.3.6)

The slope Big , also called the CAPM beta, represents the part of an asset’s risk which is due to its correlation with the
return on a combination and can not be diversified away when the asset is added to the combination. Consequently,
it should be directly related to the expected return ERi . This result is true for any efficient combinations because
the rates of return from all efficient combinations are perfectly correlated. Risk resulting from swings in economic
activity being set aside, the theory states that after diversification, only the responsiveness of an asset’s rate of return
to the level of economic activity is relevant in assessing its risk. Therefore, prices will adjust until there is a linear
relationship between the magnitude of such responsiveness and expected return.
2.3.1.3

Some critics and improvements of the CAPM

Fama et al. [2004] discussed the CAPM and argued that whether the model’s problems reflect weakness in the theory
or in its empirical implementation, the failure of the CAPM in empirical test implies that most applications of the
model are invalid. The CAPM is based on the model of portfolio choice developed by Markowitz [1952] where
an investor selects a portfolio at time t − 1 that produces a stochastic return at time t. Investors are risk averse in
choosing among portfolios, and care only about the mean and variance of their one-period investment return. It results
in algebraic condition on asset weights in mean-variance efficient portfolios (see Section (2.3.1.2)). In view of making
prediction about the relation between risk and expected return Sharpe [1964] and Lintner added two assumptions,
complete agreement of all investors on the joint distribution of asset returns from t − 1 to t supposed to be the true
one, and that there is borrowing and lending at a risk-free rate. As a result, the market portfolio M (tangency portfolio)
must be on the minimum variance frontier if the asset market is to clear and satisfy Equation (2.3.5) with g replaced by
M. The slope BiM measures the sensitivity of the asset’s return to variation in the market return, or put another way,
it is proportional to the risk each dollar invested in asset i contributes to the market portfolio. Consequently, it can
be seen as a sensitivity risk measure relative to the market risk factor. To stress the proportionality of the normalised
excess return of the risky asset with that of the market portfolio, we can rewrite Equation (2.3.5) as
ERi − P
ERM − P
= ρiM
σRi
σRg
Black [1972] developed a version of the CAPM without risk-free borrowing or lending by allowing unrestricted short
sales of risky assets. These unrealistic simplifications were tested by Fama et al. [2004]. They were faced with
numerical errors when estimating the beta of individual assets to explain average returns, and they obtained positive
correlation in the residuals producing bias in ordinary least squares (OLS) estimates. Since the CAPM explains
security returns, it also explains portfolio returns so that one can work with portfolios rather than securities to estimate
betas. Letting wip for i = 1, .., N be the weights for the assets in some portfolio p, the expected return and market
beta for the portfolio are given by

89

Quantitative Analytics

ERp =

N
X

wip ERi and βpM =

i=1

N
X

wip βip

i=1

so that the CAPM relation in Equation (2.3.5) also holds when the ith asset is a portfolio. However, grouping stocks
shrinks the range of betas and reduces statistical power, so that one should sort securities on beta when forming
portfolios where the first portfolio contains securities with the lowest betas, and so on, up to the last portfolio with the
highest beta assets (see Black et al. [1972]). Jensen [1968] argued that the CAPM relation in Equation (2.3.5) was
also a time-series regression test
ERit = αi + BiM (ERM t − Rf t ) + Rf t + it

(2.3.7)

where Rf t is the risk-free rate at time t, it is assumed to be a white noise, and such that the intercept term in the
regression, also called the Jensen’s alpha, is zero for each asset. In the CAPM equilibrium, no single asset may have
abnormal return where it earns a rate of return of alpha above (or below) the risk free rate without taking any market
risk. In the case where αi 6= 0 for any risky asset i, the market is not in equilibrium, and pairs (ERi , BiM ) will lie
above or below the CML according to the sign of αi . If the market is not in equilibrium with an asset having a positive
alpha, it should have an expected return in excess of its equilibrium return and should be bought. Similarly, an asset
with a negative alpha has expected return below its equilibrium return, and it should be sold. In the CAPM, abnormal
returns should not continue indefinitely, and price should rise as a result of buying pressure so that abnormal profits
will vanish. Forecasting the alpha of an asset, using a regression model based on the CAPM, one can decide whether
to add it or not in a portfolio. While the CAPM is a cross-sectional model, it is common to cast the model into time
series context and to test the hypothesis
H0 : α1 = α2 = ... = 0
using historical data on the excess returns on the assets and the excess return on the market portfolio. A large number
of tests rejected the Sharpe-Lintner version of the CAPM by showing that the regressions consistently got intercept
greater than the average risk-free rate as well as a beta less than the average excess market return (see Black et al.
[1972], Fama et al. [1973], Fama et al. [1992]). More recently, Fama et al. [2004] considered market return for
1928-2003 to estimate the predicted line, and confirmed that the relation between beta and average return is much
flatter than the Sharpe-Lintner CAPM predicts. They also tested the prediction of mean-variance efficiency of the
portfolio (portfolios are entirely explained by differences in market beta) by considering additional variables with
cross-section regressions and time-series regressions. They found that standard market proxies seemed to be on the
minimum variance frontier, that is, market betas suffice to explain expected returns and the risk premium for beta is
positive, but the idea of SL-CAPM that the premium per unit of beta is the expected market return minus the riskfree rate was consistently rejected. Nonetheless, further research on the scaling of asset prices (or ratios), such as
earning-price, debt-equity and book-to-market ratios (B/M), showed that much of the variation in expected return was
unrelated to market beta (see Fama et al. [1992]). Among the possible explanations for the empirical failures of the
CAPM, some refers to the behaviour of investors over-extrapolating past performance, while other point to the need of
a more complex asset pricing model. For example, in the intertemporal capital asset pricing model (ICAPM) presented
by Merton [1973], investors prefer high expected return and low return variance, but they are also concerned with
the covariances of portfolio returns with state variables, so that portfolios are multifactor efficient. The ICAPM is a
generalisation of the CAPM requireing additional betas, along with a market beta, to explain expected returns, and
necessitate the specification of state variables affecting expected returns (see Fama [1996]). One approach is to derive
an extension to the CAPM equilibrium where the systematic risk of a risky asset is related to the higher moments
of the joint distribution between the return on an asset and the return on the market portfolio. Kraus et al. [1976]
considered the coskewness to capture the asymmetric nature of returns on risky asset, and Fang et al. [1997] used the
cokurtosis to capture the returns leptokurtosis. In both cases, the derivation of higher moment CAPM models is based
on the higher moment extension of the investor’s utility function derived in Appendix (A.7.4). Alternatively, to avoid
specifying state variables, Fama et al. [1993] followed the arbitrage pricing theory by Ross [1976] and considered

90

Quantitative Analytics

unidentified state variables producing undiversifiable risks (covariances) in returns not captured by the market return
and priced separately from market betas. For instance, the returns on the stocks of small firms covary more with one
another than with returns on the stocks of large firms, and returns on high B/M stocks covary more with one another
than with returns on low B/M stocks. As a result, Fama et al. [1993] [1996] proposed a three-factor model for
expected returns given by
E[Rit ] − Rf t = βiM (E[RM t ] − Rf t ) + βis E[SM Bt ] + βih E[HM Lt ]
where SM Bt (small minus big) is the difference between the returns on diversified portfolios of small and big stocks,
HM Lt (high minus low) is the difference between the returns on diversified portfolios of high and low B/M stocks,
and the betas are the slopes in the multiple regression of Rit − Rf t on RM t − Rf t , SM Bt and HM Lt . Given the
time-series regression
E[Rit ] − Rf t = αi + βiM (E[RM t ] − Rf t ) + βis E[SM Bt ] + βih E[HM Lt ] + it
they found that the intercept αi is zero for all assets i. Estimates of αi from the time-series are used to calibrate the
speed to which stock prices respond to new information as well as to measure the special information of portfolio
managers such as performance. The momentum effect of Jegadeesh et al. [1993] states that stocks doing well relative
to the market over the last three to twelve months tend to continue doing well for the next few months, and stocks doing
poorly continue to do poorly. Even though this momentum effect is not explained by the CAPM or the three-factor
model, one can add to these model a momentum factor consisting of the difference between the return on diversified
portfolios of short-term winners and losers. For instance, Carhart [1997] proposed the four factor model
E[Rit ] − Rf t = αi + βiM (E[RM t ] − Rf t ) + βis E[SM Bt ] + βih E[HM Lt ] + βim E[U M Dt ] + it
where U M Dt is the monthly return of the style-attribution Carhart momentum factor.

2.3.2

The growth optimal portfolio

The growth optimal portfolio (GOP) is a portfolio having maximal expected growth rate over any time horizon, and
as such, it is sure to outperform any other significantly different strategy as the time horizon increases. As a result, it
is an investment tool for long horizon investors. Calculating the growth optimal strategy is in general very difficult in
discrete time (in incomplete market), but it is much easier in the continuous time continuous diffusion case and was
solved by Merton [1969]. Solutions to the problem exists in a semi-explicit form and in the general case, the GOP can
be characterised in terms of the semimartinale characteristic triplet. Following Mosegaard Christensen [2011], we
briefly review the discrete time case, providing the main properties of the GOP and extend the results to the continuous
case. Details can be found in Algoet et al. [1988], Goll et al. [2000], Becherer [2001], Christensen et al. [2005].
2.3.2.1

Discrete time

Consider a market consisting of a finite number of non-dividend paying assets. The market consists of N + 1 assets,
represented by a N + 1 dimensional vector process S where
S = {S(t) = (S 0 (t), .., S N (t)), t ∈ [0, 1, .., T ]}
and T is assumed to be a finite number. The first asset S 0 is sometimes assumed to be risk-free from one period to the
next, that is, it is a predictable process. The price of each asset is known at time t, given the information Ft . Define
the return process
R = {R(t) = (R0 (t), .., RN (t)), t ∈ [1, .., T ]}
by

91

Quantitative Analytics

Ri (t) =

S i (t)
−1
− 1)

S i (t

Often it is assumed that returns are independent over time, and for simplicity this assumption is made in this section.
Investors in such a market consider the choice of a strategy
b = {b(t) = (b0 (t), .., bN (t)), t ∈ [0, 1, .., T ]}
where bi (t) denotes the number of units of asset i that is being held during the period (t, t + 1].
Definition 2.3.1 A trading strategy b generates the portfolio value process S b (t) = b(t).S(t). The strategy is called
admissible if it satisfies the three conditions
1. Non-anticipative: the process b is adapted to the filtration F.
2. Limited liability: the strategy generates a portfolio process S b (t) which is non-negative.
3. Self-financing: b(t − 1).S(t) = b(t).S(t) for t ∈ [1, .., T ] or equivalently ∆S b (t) = b(t − 1).∆S(t).
where x.y denotes the standard Euclidean inner product. The set of admissible portfolios in the market is denoted
Θ(S), and Θ(S) denotes the strictly positive portfolios. It is assumed that Θ(S) 6= 0. The third part requires that the
investor re-invests all money in each time step. No wealth is withdrawn or added to the portfolio. This means that
intermediate consumption is not possible. Consider an investor who invests a dollar of wealth in some portfolio. At
the end of period T his wealth becomes
S b (T ) = S b (0)

T
Y

(1 + Rb (j))

j=1

where Rb (t) is the return in period t. The ratio is given by
S b (T )
= (1 + Rb (T ))
− 1)

S b (T

If the portfolio fractions are fixed during the period, the right-hand-side is the product of T independent and identically
distributed (i.i.d.) random variables. The geometric average return over the period is then
T
Y

1
(1 + Rb (j)) T

j=1

Because the returns of each period are i.i.d., this average is a sample of the geometric mean value of the one-period
return distribution. For discrete random variables, the geometric mean of a random variable X taking (not necessarily
distinct) values x1 , ..., xS with equal probabilities is defined as
G(X) =

S
Y

xs

 S1

=

s=1

K
Y


x̃fkk = eE[log X]

k=1

where x̃k is the distinct values of X and fk is the frequency of which X = xk , that is, fk = P (X = xk ). In other
words, the geometric mean is the exponential function of the growth rate
g b (t) = E[log (1 + Rb )(t)]
of some portfolio. Hence if Ω is discrete or more precisely if the σ-algebra F on Ω is countable, maximising the
geometric mean is equivalent to maximising the expected growth rate. Generally, one defines the geometric mean of
an arbitrary random variable by

92

Quantitative Analytics

G(X) = eE[log X]
assuming the mean value E[log X] is well defined. Over long stretches intuition dictates that each realised value of the
return distribution should appear on average the number of times dictated by its frequency, and hence as the number
of periods increase, it would hold that
T
Y

(1 + Rb (j))

 T1

1

= eT

PT

j=1

log S b (j)

→ G(1 + Rb (1))

j=1

as T → ∞. This states that the average growth rate converges to the expected growth rate. In fact this heuristic
argument can be made precise by an application of the law of large numbers. In multi-period models, the geometric
mean was suggested by Williams [1936] as a natural performance measure, because it took into account the effects
from compounding. Instead of worrying about the average expected return, an investor who invests repeatedly should
worry about the geometric mean return. It explains why one might consider the problem
sup

E[log (

S b (T )∈Θ

S b (T )
)]
S b (0)

(2.3.8)

Definition 2.3.2 A soulution S b to Equation (2.3.8) is called a GOP.
Hence the objective given by Equation (2.3.8) is often referred to as the geometric mean criteria. Economists may
view this as the maximisation of expected terminal wealth for an individual with logarithmic utility. However, the
GOP was introduced because of the properties of the geometric mean, when the investment horizon stretches over
several periods. For simplicity it is always assumed that S b (0) = 1, i.e. the investors start with one unit of wealth.
Definition 2.3.3 An admissible strategy b is called an arbitrage strategy if
S b (0) = 0 , P (S b (T ) ≥ 0) = 1 , P (S b (T ) > 0) > 0
It is closely related to the existence of a solution to problem in Equation (2.3.8), because the existence of a strategy
that creates something out of nothing would provide an infinitely high growth rate.
Theorem 2.3.1 There exists a GOP S b if and only if there is no arbitrage. If the GOP exists its value process is
unique.
The necessity of no arbitrage is straightforward as indicated above. The sufficiency will follow directly once the
numeraire property of the GOP has been established. It is possible to infer some simple properties of the GOP
strategy, without further specifications of the model:
Theorem 2.3.2 The GOP strategy has the following properties:
1. The fractions of wealth invested in each asset are independent of the level of total wealth.
2. The invested fraction of wealth in asset i is proportional to the return on asset i.
3. The strategy is myopic
To see why the GOP strategy depends only on the distribution of asset returns one period ahead note that
E[log S b (T )] = log S b (0) +

T
X
j=1

93

E[log (1 + Rb (j))]

Quantitative Analytics

In general, obtaining the strategy in an explicit closed form is not possible, as it involves solving a non-linear optimisation problem. To see this, we derive the first order conditions of Equation (2.3.8). Since the GOP strategy is myopic
and the invested fractions are independent of wealth, one needs to solve the problem
sup Et [log (
b(t)

S b (t + 1)
)]
S b (t)

for each t ∈ [0, 1, .., T − 1]. This is equivalent to solving the problem
sup Et [log (1 + Rb (t + 1))]
b(t)
i

i

b (t)S (t)
S b (t)

Using the fractions πbi (t) =

the problem can be written

E[log 1 + (1 −

sup
πb (t)∈RN

N
X

πbk (t))R0 (t + 1) +

k=1

N
X


πbk (t)Rk (t + 1) ]

k=1

since
N
X

1 + (1 −

πbi (t))R0 (t + 1) +

i=1

N
X

πbi (t)Ri (t + 1) =

i=1
N
X

N
X

1
0
b
i
i
0
(1
+
R
(t
+
1))S
(t)
−
b
(t)S
(t)R
(t
+
1)
+
bi (t)S i (t)Ri (t + 1)
b
S (t)
i=1
i=1

which gives

1 + (1 −

N
X

πbi (t))R0 (t + 1) +

i=1

N
X

πbi (t)Ri (t + 1) =

i=1

N
N
X
X

1
0
b
0
i
i
(1
+
R
(t
+
1))S
(t)
−
(1
+
R
(t
+
1))
b
(t)S
(t)
+
bi (t)S i (t + 1)
b
S (t)
i=1
i=1

Since S b (t) − b0 (t)S 0 (t) =

PN

i=1

bi (t)S i (t), we get

1 + (1 −

N
X

πbi (t))R0 (t + 1) +

i=1

N
X

πbi (t)Ri (t + 1) =

i=1

N
X

1
0
0
0
(1
+
R
(t
+
1))b
(t)S
(t)
+
bi (t)S i (t + 1)
b
S (t)
i=1

and since the portfolio is self-financing, we get
1 + (1 −

N
X
i=1

πbi (t))R0 (t + 1) +

N
X
i=1

πbi (t)Ri (t + 1) =

S b (t + 1)
S b (t)

The properties of the logarithm ensures that the portfolio will automatically become admissible. By differentiation,
the first order conditions become

94

Quantitative Analytics

Et−1 [

1 + Rk (t)
] = 1 , k = 0, 1, .., N
1 + Rb (t)

This constitutes a set of N +1 non-linearP
equation to be solved simultaneously such that one of which is a consequence
N
of the others, due to the constraint that i=0 πbi = 1. Although these equations do not generally posses an explicit
closed-form solution, there are some special cases which can be handled.
2.3.2.2

Continuous time

Being a (N + 1)-dimensional semimartingale and satisfying the usual conditions, S can be decomposed as
S(t) = A(t) + M (t)
where A is a finite variation process and M is a local martingale. The reader is encouraged to think of these as drift
and volatility respectively, but should beware that the decomposition above is not always unique. If A can be chosen
to be predictable, then the decomposition is unique. This is exactly the case when S is a special semimartingale (see
Protter [2004]). Following standard conventions, the first security is assumed to be the numeraire, and hence it is
assumed that S 0 (t) = 1 almost surely for all t ∈ [0, T ]. The investor needs to choose a strategy, represented by the
N + 1 dimensional process
b = {b(t) = (b0 (t), .., bN (t)), t ∈ [0, T ]}
Definition 2.3.4 An admissible trading strategy b satisfies the three conditions:
1. b is an S-integrable, predictable process.
PN
2. The resulting portfolio value S b (t) = i=0 bi (t)S i (t) is nonnegative.
Rt
3. The portfolio is self-financing, that is S b (t) = 0 bs dS(s).
The last requirement states that the investor does not withdraw or add any funds. It is often convenient to consider
portfolio fractions, i.e
πb = {πb (t) = (πb0 (t), .., πbN (t))> , t ∈ [0, ∞)}
with coordinates defined by
bi (t)S i (t)
S b (t)

πbi (t) =

One may define the GOP S b as the solution to the problem
S b = arg

sup
S b ∈Θ(S)

E[log (

S b (T )
)]
S b (0)

(2.3.9)

Definition 2.3.5 A portfolio is called a GOP if it satisfies Equation (2.3.9).
The essential feature of No Free Lunch with Vanishing Risk (NFLVR) is the fact that it implies the existence of an
equivalent martingale measure. More precisely, if asset prices are locally bounded, the measure is an equivalent local
martingale measure and if they are unbounded, the measure becomes an equivalent sigma martingale measure. Here,
these measures will all be referred to collectively as equivalent martingale measures (EMM).

95

Quantitative Analytics

Theorem 2.3.3 Assume that
sup E[log (
Sb

S b (T )
)] < ∞
S b (0)

and that NFLVR holds. Then there is a GOP.
A less stringent and numeraire invariant condition is the requirement
that the market should have a martingale density.
R
A martingale density is a strictly positive process Z, such that SdZ is a local martingale. In other words, a RadonNikodym derivative of some EMM is a martingale density, but a martingale density is only the Radon-Nikodym
derivative of an EMM if it is a true martingale. Modifying the definition of the GOP slightly, one may show that:
Corollary 1 There is a GOP if and only if there is a martingale density.
We present a simple example to get a feel of how to finf the growth optimal strategy in the continuous setting.
Example : two assets
Let the market consist of two assets, a stock and a bond. Specifically the SDEs describing these assets are given by
dS 0 (t)

= S 0 (t)rdt

1


= S 1 (t) adt + σdW (t)

dS (t)

where W is a Wiener process and r, a, σ are constants. Since S b (t) = b0 (t)S 0 (t) + b1 (t)S 1 (t), applying Ito’s lemma
we get
dS b (t) = b0 (t)S 0 (t)rdt + b1 (t)S 1 (t)adt + b1 (t)S 1 (t)σdW (t)
Using fractions π 1 (t) =

b1 (t)S 1 (t)
,
S b (t)

any admissible strategy can be written


dS b (t) = S b (t) (r + π 1 (t)(a − r))dt + π 1 (t)σdW (t)
since
dS b (t) = S b (t) rdt +


b1 (t)S 1 (t)
b1 (t)S 1 (t)
(a − r)dt +
σdW (t)
b
b
S (t)
S (t)

which gives
dS b (t) = (S b (t) − b1 (t)S 1 (t))rdt + b1 (t)S 1 (t)adt + b1 (t)S 1 (t)σdW (t)



and since S b (t) − b1 (t)S 1 (t) = b0 (t)S 0 (t), we recover the SDE
dS b (t) = b0 (t)S 0 (t)rdt + b1 (t)S 1 (t)adt + b1 (t)S 1 (t)σdW (t)



Applying Ito’s lemma to Y (t) = log S b (t) we get
dY (t) =




1
r + π 1 (t)(a − r) − (π 1 (t))2 σ 2 dt + π 1 (t)σdW (t)
2

Hence, assuming the local martingale with differential π 1 (t)σdW (t) to be a true martingale, it follows that
Z
E[log S b (T )] = E[
0

T


1
r + π 1 (t)(a − r) − (π 1 (t))2 σ 2 dt]
2

so by maximizing the expression for each (t, ω) the optimal fraction is obtained as

96

Quantitative Analytics

a−r
σ2
Hence, inserting the optimal fractions into the wealth process, the GOP is described by the SDE
πb1 (t) =

dS b (t) = S b (t) (r + (


a−r 2
a−r
) )dt +
dW (t)
σ
σ

which we rewrite as
dS b (t) = S b (t) (r + θ2 )dt + θdW (t)
where θ =

a−r
σ



is the market price of risk process.

Fix a truncation function h i.e. a bounded function with compact support h : RN → RN such that h(x) = x in a
neighbourhood around zero. For instance, a common choice would be
h(x) = xI{|x|≤1}
For such truncation function, there is a triplet (A, B, ν) describing the behaviour of the semimartingale. There exists
a locally integrable, increasing, predictable process  such that (A, B, ν) can be written as
Z
Z
A = ad , B = bd and ν(dt, dv) = dÂt F (t, dv)
The process A is related to the finite variation part of the semimartingale, and it can be thought of as a generalised
drift. The process B is similarly interpreted as the quadratic variation of the continuous part of S, or in other words it
is the square volatility where volatility is measured in absolute terms. The process ν is the compensated jump measure,
interpreted as the expected number of jumps with a given size over a small interval and F essentially characterises the
jump size.
Example
Let S 1 be geometric Brownian Motion. Then  = t and
dA(t) = S 1 (t)adt , dB(t) = (S 1 (t)σ)2 dt
Theorem 2.3.4 (Goll and Kallsen [2000])
Let S have a characteristic triplet (A, B, ν) as described above. Suppose there is an admissible strategy b with
corresponding fractions πb such that
ak (t) −

N
X
πbi
i=1

S i (t)

bi,k (t) +

Z
RN

1+

xk
PN

πbi
i
i=1 S i (t) x


− h(x) F (t, dx) = 0

for P × d almost all (ω, t) ∈ Ω × [0, T ] where k ∈ [0, .., N ] and × denotes the standard product measure. Then b is
the GOP strategy.
This Equation represents the first order conditions for optimality and they would be obtained easily if one tried to
solve the problem in a pathwise sense.
Example
Assume that discounted asset prices are driven by an m-dimensional Wiener process. The locally risk free asset is
used as numeraire, whereas the remaining risky assets evolve according to

97

Quantitative Analytics

dS i (t) = S i (t)ai (t)dt +

m
X

S i (t)bi,k (t)dW k (t)

k=1
i

for i ∈ [1, .., N ]. Here a (t) is the excess return above the risk free rate. From this equation, the decomposition of the
semimartingale S follows directly. Choosing  = t, a good version of the characteristic triplet becomes
Z
Z

(A, B, ν) =
a(t)S(t)dt, S(t)b(t)(S(t)b(t))> dt, 0
Consequently, in vector form and after division by S i (t), the above Equation yields that
a(t) − (b(t)b(t)> )πb (t) = 0
In the particular case where m = N and the matrix b is invertible, we get the well-known result that
π(t) = b−1 (t)θ(t)
where θ(t) = b−1 (t)a(t) is the market price of risk. Generally, whenever the asset prices can be represented by a
continuous semimartingale, a closed form solution to the GOP strategy may be found. The cases where jumps are
included are less trivial. In general when jumps are present, there is no explicit solution in an incomplete market. In
such cases, it is necessary to use numerical methods.
As it was done in discrete time, the GOP can be characterised in terms of its growth properties.
Theorem 2.3.5 The GOP has the following properties:
1. The GOP maximises the instantaneous growth rate of investments
2. In the long term, the GOP will have a higher realised growth rate than any other strategy, i.e.
lim sup

T →∞

1
1
log S b (T ) ≤ lim sup log S b (T )
T →∞
T
T

for any other admissible strategy S b .
The instantaneous growth rate is the drift of log S b (t).
Example
Given the previous example, the instantaneous growth rate g b (t) of a portfolio S b was found by applying the Ito’s
formula to get
dY (t) =





1
r + π(t)(a − r) − π 2 (t)σ 2 dt + π(t)σdW (t)
2

Hence, the instantaneous growth rate is
1
g b (t) = r + π(t)(a − r) − π 2 (t)σ 2
2

(2.3.10)

Differentiating the instantaneous growth rate g b (t) with respect to the fraction π(t) and setting the results to zero, we
recover πb1 (t) in the Example with two assets. Hence, by construction, the GOP maximise the instantaneous growth
rate. As in the discrete setting, the GOP enjoys the numeraire property. However, there are some subtle differences.
Theorem 2.3.6 Let S b denote any admissible portfolio process and define Ŝ b (t) =

98

S b (t)
.
S b (t)

Then

Quantitative Analytics

1. Ŝ b (t) is a supermartingale if and only if S b (t) is the GOP.
2. The process

1
Ŝ b (t)

is a submartingale.

3. If asset prices are continuous, then Ŝ b (t) is a local martingale.
2.3.2.3

Discussion

Mosegaard Christensen [2011] reviewed the discussion around the attractiveness of the GOP against the CAPM and
concluded that there is an agreement on the fact that the GOP can neither proxi for, nor dominate other strategies
in terms of expected utility, and no matter how long (finite) horizon the investor has, utility based preferences can
make other portfolios more attractive because they have a more appropriate risk profile. Authors favouring the GOP
believe growth optimality to be a reasonable investment goal, with attractive properties being relevant to long horizon
investors (see Kelly [1956], Latane [1959]). On the other hand, authors disagreeing do so because they do not believe
that every investor could be described as log-utility maximising investors (see Markowitz [1976]). To summarise, the
disagreement had its roots in two very fundamental issues, namely whether or not utility theory is a reasonable way of
approaching investment decisions in practice, and whether utility functions, different from the logarithm, is a realistic
description of individual long-term investors. The fact that investors must be aware of their own utility functions is a
very abstract statement which is not a fundamental law of nature. Once a portfolio has been build using the CAPM,
it is impossible to verify ex-post that it was the right choice. On the other hand, maximising growth over time is
formulated in dollars, so that one has a good idea of the final wealth he will get.
2.3.2.4

Comparing the GOP with the MV approach

The main results on the comparison between mean variance (MV) and growth optimality can be found in Hakansson
[1971]. Mosegaard Christensen [2011] presented a review of whether or not the GOP and MV approach could be
united or if they were fundamentally different. He concluded by stating that they were in general two different things.
The mean-variance efficient portfolios are obtained as the solution to a quadratic optimisation program. Its theoretical
justification requires either a quadratic utility function or some fairly restrictive assumptions on the class of return
distribution, such as the assumption of normally distributed returns. As a result, the GOP is in general not meanvariance efficient. Mean-variance efficient portfolios have the possibility of ruin, and they are not consistent with first
order stochastic dominance. We saw earlier that the mean-variance approach was further developed into the CAPM,
where the market is assumed mean-variance efficient. Similarly, it was assumed that if all agents were to maximise the
expected logarithm of wealth, then the GOP becomes the market portfolio and from this an equilibrium asset pricing
model appears. As with the CAPM, the conclusion of the analysis provide empirically testable predictions. As a
result of assuming log-utility, the martingale or numeraire condition becomes a key element of the equilibrium model.
Recall, Ri (t) is the return on the ith asset between time t − 1 and t, and Rb is the return process for the GOP. Then the
equilibrium is
1 + Ri (t)
]
1 + Rb
which is the first order condition for a logarithmic investor. We assume a world with a finite number of states,
Ω = {ω1 , .., ωn } and define pi = P({ωi }). Then if S i (t) is an Arrow-Debreu price, paying off one unit of wealth at
time t + 1, we get
1 = Et−1 [

S i (t) = Et [

I{ω=ωi }
]
1 + Rb (t + 1)

and consequently summing over all states provides an equilibrium condition for the risk-free rate
1 + r(t, t + 1) = Et [

99

1
1+

Rb (t

+ 1)

]

Quantitative Analytics

i

Combining with the previous equations, defining R = Ri − r, and performing some basic calculations, we get
i

b

Et [R (t + 1)] = βti Et [R (t + 1)]
where
βti =

i

b

b

b

(t+1)
Cov(R (t + 1), R
)
Rb (t+1)
(t+1)
Cov(R (t + 1), R
)
Rb (t+1)

This is to be compared with the CAPM, where the β is given by
βCAP M =

Cov(Ri , R∗ )
V ar(R∗ )

In some cases only is the CAPM and the CAPM based on the GOP similar. Note, the mean-variance portfolio provides
a simple trade-off between expected return and variance which can be parametrised in a closed-form, requiring only
the estimation of a variance-covariance matrix of returns and the ability to invert that matrix. Further, choosing a
portfolio being either a fractional Kelly strategy or logarithmic mean-variance efficient provides the same trade-off,
but it is computationally more involved.
In the continuous case, for simplicity of exposition, we use the notation given in Appendix (E.1) and rewrite the
dynamics of the ith risky asset as
dS i (t)
= ait dt + σti dWt
S i (t)
where dWt is a column vector of dimension (M, 1) of independent Brownian motions with elements (dWtj )M
j=1 , and
i
N
i
σt is a volatility matrix of dimension (1, N ) with elements (σj (t))j=1 such that
<

σti , dWt

>=

M
X

σji (t)dWtj

j=1

with Euclidean norm
|σti |2 =

M
X
(σji (t))2
j=1

In that setting, the portfolio in Equation (E.1.1) becomes
dVtb = rt Vtb dt+ < (bS)t , at − rt I > dt+ < (bS)t , σt dWt >
where we let σt be an adapted matrix of dimension N × M , (bS)t corresponds to the vector
 with component
(bi (t)Sti )1≤i≤N describing the amount to be invested in each stock. Also, S 01(t) Vtb − (bS)t is invested in the
riskless asset. Writting π(t) =
the portfolio become

(bS)t
Vtb

as a (N, 1) vector with elements (π i (t))N
i=1 =

(bi (t)Sti )1≤i≤N
Vtb

dVtb
= rt dt+ < π(t), at − rt I > dt+ < π(t), σt dWt >
Vtb
In that setting, the instantaneous mean-variance efficient portfolio is the solution to the probelem

100

, the dynamics of

Quantitative Analytics

sup ab (t)
b∈Θ(S)

s.t. σtb ≤ κ(t)
where κ(t) is some non-negative adapted process. Defining the process Ytb = ln Vtb and applying Ito’s lemma, we get
1
dYtb = rt dt+ < π(t), at − rt I > dt − |π(t)σti |2 dt+ < π(t), σt dWt >
2
with instantaneous growth rate being
1
g b (t) = rt + < π(t), at − rt I > − |π(t)σti |2
2
which is a generalisation of Equation (2.3.10). By construction, the GOP maximise the instantaneous growth rate.
Taking the expectation, we get
Z
= E[

T


1
rs + < π(s), as − rs I > − |π(s)σsi |2 ds]
2
0
Note, we can define the minimal market price of risk as
E[YTb ]

at − rt I
σt σt>
Any efficient portfolio along the straight efficient frontier can be specified by its fractional holdings of the market
portfolio, called the leverage and denoted by α. The instantaneously mean-variance efficient portfolios have fractions
solving the equation
θ(t) = σt

at − rt I
σt σt>
for some non-negative process α. Hence, the optimum fractions become
πb (t)σt = α(t)θ(t) = α(t)σt

at − rt I
(2.3.11)
σt σt>
In the special case where we assume the volatility matrix to be of dimension (N, N ) and invertible, the market price
tI
and the optimum fractions simplify to
of risk become θ(t) = at −r
σt
πb (t) = α(t)

θ(t)
σt
Using the optimum fractions, the SDE for such leveraged portfolios become
πb (t) = α(t)

(2.3.12)

b

dVt

= rt dt + α(t)|θ(t)|2 dt + α(t) < θ(t), dWt >
b
Vt
where the volatility of the portfolio is now θt . The GOP is instantaneously mean-variance efficient,corresponding to
the choice of α = 1. The GOP belongs to the class of instantaneous Sharpe ratio maximising strategies, where for
some strategy b, it is defined as
b
MSR
=

bt rt + < bt , at − rt I > −rt
< bt , at > +rt (b0t − 1)
=
2
|(bσj )t |
|(bσj )t |2

(2.3.13)

0
where (bσj )t is a weighted volatility vector with elements (bit σji (t))N
i=1 . Recall, bt rt − < bt , rt I >= bt rt , so that the
mean-variance portfolios consist of a position in the GOP and the rest in the riskless asset, that is, a fractional Kelly
strategy.

101

Quantitative Analytics

2.3.2.5

Time taken by the GOP to outperfom other portfolios

As the GOP was advocated, not as a particular utility function, but as an alternative to utility theory relying on its
ability to outperform other portfolios over time, it is important to document this ability over horizons relevant to actual
investors. We present a simple example illustrating the time it takes for the GOP to dominate other assets.
Example Assume a two asset Black-Scholes model with constant parameters with risk-free asset S 0 (t) = ert and
and solving the SDE, the stock price is given as
1

2

)t+σW (t)

1

2

)t+θW (t)

S 1 (t) = e(a− 2 σ
The GOP is given by the process
S b (t) = e(r− 2 θ
where θ =

a−r
σ .

Some simple calculations imply that the probability
P0 (t) = P (S b (t) ≥ S 0 (t))

of the GOP outperforming the savings account over a period of length t and the probability
P1 (t) = P (S b (t) ≥ S 1 (t))
of the GOP outperforming the stock over a period of length t are given by
1 √
P0 (t) = N ( θ t)
2
and
√
1
P1 (t) = N ( |θ − σ| t)
2
where the cumulative distribution function of the standard Gaussian distribution N (.) are independent of the short rate.
Moreover, the probabilities are increasing in the market price of risk and time horizon. They converge to one as the
time horizon increases to infinity, which is a manifestation of the growth properties of the GOP. The time needed for
the GOP to outperform the risk free asset for a 99% confidence level is 8659 year for a market price of risk θ = 0.05
and 87 year for θ = 0.5. Similarly, for a 95% confidence level the time is 4329 year for θ = 0.05 and 43 year for
θ = 0.5. One can conclude that the long run may be very long. Hence, the argument that one should choose the GOP
to maximise the probability of doing better than other portfolios is somewhat weakened.

2.3.3

Measuring and predicting performances

We saw in Section (2.3.1.1) that since any point along the efficient frontier represents an efficient portfolio, the investor
needs additional information in order to select the optimal portfolio. That is, the key element in mean-variance
portfolio analysis being one’s view on expected return and risk, the selection of a preferred combination of risk and
expected return depends on the investor. However, we saw in Section (2.3.1.2) that one can attempt at finding efficient
portfolios promising the greatest expected return for a given degree of risk (see Sharpe [1966]). Hence, one must
translate predictions about security performance into predictions of portfolio performance, and select one efficient
portfolio based on some utility function. The process for mutual funds becomes that of security analysis and portfolio
analysis given some degree of risk. As a result, there is room for major and persisting differences in the performance
of different funds. Over time, security analysis moved towards evaluating the interrelationships between securities,
while portfolio analysis focused more on diversification as any diversified portfolio should be efficient in a perfect
market. For example, one may only require the spreading of holdings among standard industrial classes.

102

Quantitative Analytics

In the CAPM presented in Section (2.3.1.2), one assumes that the predicted performance of the ith portfolio is
described with two measures, namely the expected rate of return ERi and the predicted variability or risk expressed as
the standard deviation of return σi . Further, assuming that all investors can invest and borrow at the risk-free rate, all
efficient portfolios satisfy Equation (2.3.4) and follow the linear representation
ERi = a + bσi
where a is the risk-free rate and b is the risk premium. Hence, by allocating his funds between the ith portfolio and
borrowing or lending, the investor can attain any point on the line
ER = a +

ERi − a
σR
σi
E

−a

for a given pair (ER , σR ), such that the best portfolio is the one for which the slope Rσii is the greatest (see Tobin
[1958]). The predictions of future performance being difficult to obtain, ex-post values must be used in the model.
That is, the average rate of return of a portfolio Ri must be substituted for its expected rate of return, and the actual
standard deviation σ i of its rate of return for its predicted risk. In the ex-post settings, funds with properly diversified
portfolios should provide returns giving Ri and σ i lying along a straight line, but if they fail to diversify the returns will
yield inferior values for Ri and σ i . In order to analyse the performances of different funds, Sharpe [1966] proposed
a single measure by substituting the ex-post measures R and σ R for the ex-ante measures ER and σR obtaining the
formula
Ri − a
σR
σi
which is a reward-to-variability ratio (RV) or a reward per unit of variability. The numerator is the reward provided the
investor for bearing risk, and the denominator measures the standard deviation of the annual rate of return. The results
of his analysis based on 34 funds on two periods, from 1944 till 1953 and from 1954 till 1963, showed that differences
in performance can be predicted, although imperfectly, but one can not identifies the sources of the differences. Further,
there is no assurance that past performance is the best predictor of future performance. During the period 1954-63,
almost 90% of the variance of the return a typical fund of the sample was due to its comovement with the return of the
other securities used to compute the Dow-Jones Industrial Average, with a similar percentage for most of the 34 funds.
Taking advantage of this relationship, Treynor [1965] used the volatility of a fund as a measure of its risk instead of
the total variability used in the RV ratio. Letting Bi be the volatility of the ith fund defined as the change in the rate
of return of the fund associated with a 1% change in the rate of return of a benchmark or index, the Treynor index can
be written as
R=a+

Ri − a
Bi
According to Sharpe [1966], Treynor intended that his index be used both for measuring a fund’s performance, and
for predicting its performance in the future. Treynor [1965] argued that a good historical performance pattern is
one which, if continued in the future, would cause investors to prefer it to others. Given the level of contribution of
volatility to the over-all variability, one can expect the ranking of funds on the basis of the Treynor index to be very
close to that based on the RV ratio, especially when funds hold highly diversified portfolios. Differences appear in the
case of undiversified funds since the TI index do not capture the portion of variability due to the lack of diversification.
For this reason Sharpe concluded that the TI ratio was an inferior measure of past performance but a possibly superior
measure for predicting future performance.
MT I =

Note, independently from Markowitz, Roy [1952] set down the same equation relating portfolio variance of return
to the variances of return of the constituent securities, developing a similar mean-variance efficient set. However, while
Markowitz left it up to the investor to choose where along the efficient set he would invest, Roy advised choosing the
single portfolio in the mean-variance efficient set maximising

103

Quantitative Analytics

µ−d
2
σM
where d is a disaster level return the investor places a high priority on not falling below. It is very similar to the
reward-to-variability ratio (RV) proposed by Sharpe. In its measure of quality and performance of a portfolio, Sharpe
[1966] did not distinguish between time and ensemble averages. We saw in Section (2.3.2.4) that the measure is
also meaningful in the context of time averages in geometric Brownian motion and derived in Equation (2.3.13) the
GOP Sharpe ratio. Assuming a portfolio following the simple geometric Brownian motion in Equation (2.3.2), Peters
[2011c] derived the dynamics of the leveraged portfolio, and, applying Ito’s lemma to obtain the dynamics of the
log-portfolio, computed the time-average leveraged exponential growth rate as
1
1
2
< d ln Vtα >= r + αµ − α2 σM
dt
2
where σM is the volatility of the market portfolio. Differentiating with respect to α and setting the result to zero, the
optimum leverage becomes
gαb =

α∗ =

µ
2
σM

corresponding to the optimum fraction in Equation (2.3.12) with α = 1. Note, Peters chose to optimise the leverage
rather than optimising the fraction π(t). Differing from the Sharpe ratio for the market portfolio only by a square in the
volatility, the optimum leverage or GOP Sharpe ratio is also a fundamental measure of the quality and performance
of a portfolio. Further, unless the Sharpe ratio, the optimum leverage is a dimensionless quantity, and as such can
distinguish between fundamentally different dynamical regimes.

2.3.4

Predictable variation in the Sharpe ratio

The Sharpe ratio (SR) is the most common measure of risk-adjusted return used by private investors to assess the
performance of mutual funds (see Modigliani et al. [1997]). Given evidence on predictable variation in the mean and
volatility of equity returns (see Fama et al. [1989]), various authors studied the predictable variation in equity market
SRs. However, due to the independence of the sample mean and sample variance of independently normally distributed
variables (see Theorem (B.7.3)), predictable variation in the individual moments does not imply predictable variation
in the Sharpe ratio. One must therefore ask whether these moments move together, leading to SRs which are more
stable and potentially less predictable than the two components individually. The intuition being that volatility in the
SR is not a good proxy for priced risk. Using regression analysis, some studies suggested a negative relation between
the conditional mean and volatility of returns, indicating the likelihood of substantial predictable variation in market
SRs. Using linear functions of four predetermined financial variables to estimate conditional moments, Whitelaw
[1997] showed that estimated conditional SRs exhibit substantial time-variation that coincides with the variation in
ex-post SRs and with the phases of the business cycle. For instance, the conditional SRs had monthly values ranging
from less than −0.3 to more than 1.0 relative to an unconditional SR of 0.14 over the full sample period. This variation
in estimated SRs closely matches variation in ex-post SRs measured over short horizons. Subsamples chosen on the
basis of in-sample regression have SRs more than three times larger than SRs over the full sample. On an out-sample
basis, using 10-year rolling regressions, subsample SRs exhibited similar magnitudes. As a result, Whitelaw showed
that relatively naive market-timing strategies exploiting this predictability could generate SRs more than 70% larger
than a buy-and-hold strategy. These active trading strategies involve switching between the market and the risk-free
asset depending on the level of the estimated SR relative to a specific threshold. This result is critical in asset allocation
decisions, and it has implications for the use of SRs in investment performance evaluation.
While the Sharpe ratio is regarded as a reliable measure during periods of increasing stock prices, it leads to
erroneous conclusions during periods of declining share prices. However, there are still contradictions in the literature
with respect to the interpretation of the SR in bear market periods. Scholz et al. [2006] showed that ex-post Sharpe

104

Quantitative Analytics

ratios do not allow for meaningful performance assessment of funds during non-normal periods. Using a single factor
model, they showed the resulting SRs to be subject to random market climates (random mean and standard deviation
of market excess returns). Considering a sample of 532 US equity mutual funds, funds exhibiting relatively high
proportions of fund-specific risk showed on average superior ranking according to the SR in bear markets, and vice
versa. Using regression analysis, they ascertained that the SRs of funds significantly depend on the mean excess
returns of the market.

2.4

Risk and return analysis

Asset managers employ risk metrics to provide their investors with an accurate report of the return of the fund as well
as its risk. Risk measures allow investors to choose the best strategies per rebalancing frequency in a more robust way.
Performance evaluation of any asset, strategy, or fund tends to be done on returns that are adjusted for the average risk
taken. We call active return and active risk the return and risk measured relative to a benchmark. Since all investors
in funds have some degree of risk aversion and require limits on the active risk of the funds, they consider the ratio of
active return to active risk in a risk adjusted performance measure (RAPM) to rank different investment opportunities.
In general, RAPMs are used to rank portfolios in order of preference, implying that preferences are already embodied
in the measure. However, we saw in Section (1.6.2) that to make a decision we need a utility function. While some
RAPMs have a direct link to a utility function, others are still used to rank investments but we can not deduce anything
about preferences from their ranking so that no decision can be based on their ranks (see Alexander [2008]).
The three measures by which the risk/return framework describes the universe of assets are the mean (taken as
the arithmetic mean), the standard deviation, and the correlation of an asset to other assets’ returns. Concretely,
historical time series of assets are used to calculate the statistics from it, then these statistics are interpreted as true
estimators of the future behaviour of the assets. In addition, following the central limit theorem, returns of individual
assets are jointly normally distributed. Thus, given the assumption of a Gaussian (normal) distribution, the first two
moments suffice to completely describe the distribution of a multi-asset portfolio. As a result, adjustment for volatility
is the most common risk adjustment leading to Sharpe type metrics. Implicit in the use of the Sharpe ratio is the
assumption that the preferences of investors can be represented by the exponential utility function. This is because
the tractability of an exponential utility function allows an investor to form optimal portfolios by maximising a meanvariance criterion. However, some RAPMs are based on downside risk metrics which are only concerned with returns
falling short of a benchmark or threshold returns, and are not linked to a utility function. Nonetheless these metrics
are used by practitioners irrespectively of their theoretical foundation.

2.4.1

Some financial meaning to alpha and beta

2.4.1.1

The financial beta

In finance, the beta of a stock or portfolio is a number describing the correlated volatility of an asset in relation to the
volatility of the benchmark that this asset is being compared to. We saw in Section (2.3.1.2) that the beta coefficient
was born out of linear regression analysis of the returns of a portfolio (such as a stock index) (x-axis) in a specific
period versus the returns of an individual asset (y-axis) in a specific year. The regression line is then called the Security
characteristic Line (SCL)
SCL : Ra (t) = αa + βa Rm (t) + t
where αa is called the asset’s alpha and βa is called the asset’s beta coefficient. Note, if we let Rf be a constant rate,
we can rewrite the SCL as
SCL : Ra (t) − Rf = αa + βa (Rm (t) − Rf ) + t
Both coefficients have an important role in Modern portfolio theory. For

105

(2.4.14)

Quantitative Analytics

• β < 0 the asset generally moves in the opposite direction as compared to the index.
• β = 0 movement of the asset is uncorrelated with the movement of the benchmark
• 0 < β < 1 movement of the asset is generally in the same direction as, but less than the movement of the
benchmark.
• β = 1 movement of the asset is generally in the same direction as, and about the same amount as the movement
of the benchmark
• β > 1 movement of the asset is generally in the same direction as, but more than the movement of the benchmark
We consider that a stock with β = 1 is a representative stock, or a stock that is a strong contributor to the index itself.
For β > 1 we get a volatile stock, or stocks which are very strongly influenced by day-to-day market news. Higherbeta stocks tend to be more volatile and therefore riskier, but provide the potential for higher returns. Lower-beta
stocks pose less risk but generally offer lower returns. For instance, a stock with a beta of 2 has returns that change,
on average, by twice the magnitude of the overall market’s returns: when the market’s return falls or rises by 3%, the
stock’s return will fall or rise (respectively) by 6% on average.
The Beta measures the part of the asset’s statistical variance that cannot be removed by the diversification provided
by the portfolio of many risky assets, because of the correlation of its returns with the returns of the other assets that
are in the portfolio. Beta can be estimated for individual companies by using regression analysis against a stock market
index. The formula for the beta of an asset within a portfolio is
βa =

Cov(Ra , Rb )
V ar(Rb )

where Ra measures the rate of return of the asset, Rb measures the rate of return of the portfolio benchmark, and
Cov(Ra , Rb ) is the covariance between the rates of return. The portfolio of interest in the Capital Asset Pricing
Model (CAPM) formulation is the market portfolio that contains all risky assets, and so the Rb terms in the formula
are replaced by Rm , the rate of return of the market. Beta is also referred to as financial elasticity or correlated
relative volatility, and can be referred to as a measure of the sensitivity of the asset’s returns to market returns, its
non-diversifiable risk, its systematic risk, or market risk. On an individual asset level, measuring beta can give clues to
volatility and liquidity in the market place. As beta also depends on the correlation of returns, there can be considerable
variance about that average: the higher the correlation, the less variance; the lower the correlation, the higher the
variance.
In order to estimate beta, one needs a list of returns for the asset and returns for the index which can be daily,
weekly or any period. Then one uses standard formulas from linear regression. The slope of the fitted line from
the linear least-squares calculation is the estimated beta. The y-intercept is the estimated alpha. Beta is a statistical
variable and should be considered with its statistical significance (R square value of the regression line). Higher R
square value implies higher correlation and a stronger relationship between returns of the asset and benchmark index.
Using beta as a measure of relative risk has its own limitations. Beta views risk solely from the perspective of
market prices, failing to take into consideration specific business fundamentals or economic developments. The price
level is also ignored. Beta also assumes that the upside potential and downside risk of any investment are essentially
equal, being simply a function of that investment’s volatility compared with that of the market as a whole. This too
is inconsistent with the world as we know it. The reality is that past security price volatility does not reliably predict
future investment performance (or even future volatility) and therefore is a poor measure of risk.

106

Quantitative Analytics

2.4.1.2

The financial alpha

Alpha is a risk-adjusted measure of the so-called active return on an investment. It is the part of the asset’s excess
return not explained by the market excess return. Put another way, it is the return in excess of the compensation for
the risk borne, and thus commonly used to assess active managers’ performances. Often, the return of a benchmark
is subtracted in order to consider relative performance, which yields Jensen’s [1968] alpha. It is the intercept of the
security characteristic line (SCL), that is, the coefficient of the constant in a market model regression in Equation
(2.4.14). Therefore the alpha coefficient indicates how an investment has performed after accounting for the risk it
involved:
• α < 0 the investment has earned too little for its risk (or, was too risky for the return)
• α = 0 the investment has earned a return adequate for the risk taken
• α > 0 the investment has a return in excess of the reward for the assumed risk
For instance, although a return of 20% may appear good, the investment can still have a negative alpha if it is involved
in an excessively risky position.
A simple observation: during the middle of the twentieth century, around 75% of stock investment managers did
not make as much money picking investments as someone who simply invested in every stock in proportion to the
weight it occupied in the overall market in terms of market capitalisation, or indexing. A belief in efficient markets
spawned the creation of market capitalisation weighted index funds that seek to replicate the performance of investing
in an entire market in the weights that each of the equity securities comprises in the overall market. This phenomenon
created a new standard of performance that must be matched: an investment manager should not only avoid losing
money for the client and should make a certain amount of money, but in fact he should make more money than the
passive strategy of investing in everything equally.
Although the strategy of investing in every stock appeared to perform better than 75% of investment managers, the
price of the stock market as a whole fluctuates up and down. The passive strategy appeared to generate the marketbeating return over periods of 10 years or more. This strategy may be risky for those who feel they might need to
withdraw their money before a 10-year holding period. Investors can use both Alpha and Beta to judge a manager’s
performance. If the manager has had a high alpha, but also a high beta, investors might not find that acceptable,
because of the chance they might have to withdraw their money when the investment is doing poorly.

2.4.2

Performance measures

When considering the performance evaluation of mutual funds, one need to assess whether these funds are earning
higher returns than the benchmark returns (portfolio or index returns) in terms of risk. Three measures developed in
the framework of the Capital Asset Pricing Model (CAPM) proposed by Treynor [1965], Sharpe [1964] and Lintner
[1965] directly relate to the beta of the portfolio through the security market line (SML). Jensen’s [1968] alpha is
defined as the portfolio excess return earned in addition to the required average return, while the Treynor ratio and the
Information ratio are defined as the alpha divided by the portfolio beta and by the standard deviation of the portfolio
residual returns. More recent performance measures developed along hedge funds, such as the Sortino ratio, the M2
and the Omega, focus on a measure of total risk, in the continuation of the Sharpe ratio applied to the capital market
line (CML). In the context of the extension of the CAPM to linear multi-factor asset pricing models, the development
of measures has not been so prolific (see Hubner [2007]).
The Sharpe ratio or Reward to Variability and Sterling ratio have been widely used to measure commodity trading
advisor (CTA) performance. One can group investment statistics as Sharpe type combining risk and return in a ratio,
or descriptive statistics (neither good nor bad) providing information about the pattern of returns. Examples of the
latter are regression statistics (systematic risk), covariance and R2 . Additional risk measures exist to accommodate
the risk concerns of different types of investors. Some of these measures have been categorised in Table (2.1).

107

Quantitative Analytics

Table 2.1: List of measure
Type
Normal
Regression
Partial Moments
Drawdown
Value at Risk
2.4.2.1

Combined Return and Risk Ratio
Sharpe, Information, Modified Information
Apraisal, Treynor
Sortino, Omega, Upside Potential, Omega-Sharpe, Prospect
Calmar, Sterling, Burke, Sterling-Calmar, Pain, Martin
Reward to VaR, Conditional Sharpe, Modified Sharpe

The Sharpe ratio

The Sharpe ratio measures the excess return per unit of deviation in an investment asset or a trading strategy defined
as
E[Ra − Rb ]
(2.4.15)
σ
where Ra is the asset return and Rb is the return of a benchmark asset such as the risk free rate or an index. Hence,
E[Ra − Rb ] is the expected value of the excess of the asset return over the benchmark return, and σ is the standard
deviation of this expected excess return. It characterise how well the return of an asset compensates the investor for the
risk taken. If we graph the risk measure with a the measure of return in the vertical axis and the measure of risk in the
horizontal axis, then the Sharpe ratio simply measures the gradient of the line from the risk-free rate to the combined
return and risk of each asset (or portfolio). Thus, the steeper the gradient, the higher the Sharpe ratio, and the better
the combined performance of risk and return.
MSR =

Remark 2.4.1 The ex-post Sharpe ratio uses the above equation with the realised returns of the asset and benchmark
rather than expected returns.
rP − rF
MSR =
σP
where rp is the asset/portfolio return (annualised), rF is the annualised risk-free rate, and σP is the portfolio risk or
standard deviation of return.
This measure can be compared with the Information ratio in finance defined in general as mean over standard deviation of a series of measurements. The Sharp ratio is directly computable from any observed series of returns without
the need for additional information surrounding the source of profitability. While the Treynor ratio only works with
systemic risk of a portfolio, the Sharp ratio observes both systemic and idiosyncratic risks. The SR has some shortcomings because all volatility is not equal, and the volatility taken in the measure ignores the distinction between
systematic and diversifiable risks. Further, volatility does not distinguish between losses occurring in good or bad time
or even between upside and downside surprises.
Remark 2.4.2 The returns measured can be any frequency (daily, weekly, monthly or annually) as long as they are
normally distributed, as the returns can always be annualised. However, not all asset returns are normally distributed.
The SR assumes that assets are normally distributed or equivalently that the investors’ preferences can be represented
by the quadratic (exponential) utility function. That is, the portfolio is completely characterised by its mean and
volatility. As soon as the portfolio is invested in technology stocks, distressed companies, hedge funds or high yield
bonds, this ratio is no-longer valid. In that case, the risk comes not only from volatility but also from higher moments
like skewness and kurtosis. Abnormalities like kurtosis, fatter tails and higher peaks or skewness on the distribution
can be problematic for the computation of the ratio as standard deviation does not have the same effectiveness when
these problems exist. As a result, we can get very misleading measure of risk-return. In addition, the Sharp ratio
being a dimensionless ratio it may be difficult to interpret the measure of different investments. This weakness was

108

Quantitative Analytics

well addressed by the development of the Modigliani risk-adjusted performance measure, which is in units of percent
returns. One need to consider a proper risk-adjusted return measure to get a better feel of risk-adjusted out-performance
such as M 2 defined as
M 2 = (rp − rF )

σM
+ rF
σP

where σM is the market risk or standard deviation of a benchmark return (see Modigliani et al. [1997]). It can also be
rewritten as
M 2 = rP + MSR (σM − σP )
where the variability can be replaced by any measure of risk and M 2 can be calculated for different types of risk
measures. This statistic introduces a return penalty for asset or portfolio risk greater than benchmark risk and a reward
if it is lower.
2.4.2.2

More measures of risk

Treynor proposed a risk adjusted performance measure associated with abnormal returns in the CAPM. The Treynor
ratio or reward to volatility is a Sharpe type ratio where the numerator (or vertical axis graphically speaking) is identical
but the denominator (horizontal axis) replace total risk with systematic risk as calculated by beta
MT R =

rP − rF
βP

where βP is the market beta (see Treynor [1965]). Although well known, the Treynor ratio is less useful precisely
because it ignores specific risk. It will converge to the Sharpe ratio in a fully diversified portfolio with no specific
risk. The Appraisal ratio suggested by Treynor & Black [1973] is a Sharpe ratio type with excess return adjusted for
systematic risk in the numerator, and specific risk and not total risk in the denominator
MAR =

α
σ

where α is the Jensen’s alpha. It measures the systematic risk adjusted reward for each unit of specific risk taken.
While the Sharpe ratio compares absolute return and absolute risk, the Information ratio compares the excess return
and tracking error (the standard deviation of excess return). That is, the Information ratio is a Sharpe ratio type with
excess return on the vertical axis and tracking error or relative risk on the horizontal axis. As we are not using the
risk-free rate, the information ratio lines radiate from the origin and can be negative indicating underperformance.
2.4.2.3

Alpha as a measure of risk

The Jensen’s alpha, which is the excess return adjusted for systematic risk, was argued by Jensen to be a more appropriate measure than TR or IR for ranking the potential performance of different portfolios, implying that asset managers
should view the best investment as the one with the largest alpha, irrespective of its risk. In view of keeping track
of the portfolio’s returns Ross extended the CAPM to multiple risk factors, leading to the multi-factor return models
commonly used to identify alpha opportunities. Some multi-factor models include size and value factors besides the
market beta factor, and others include various industry and style factors for equities. In contrast to the CAPM that has
only one risk factor, namely the overall market, APT has multiple risk factors. Each risk factor has a corresponding
beta indicating the responsiveness of the asset being priced to that risk factor. Whatever risk factors are used, significant average loadings on any risk factor are viewed as evidence of a systematic risk tilt. Therefore, while the Jensen’s
alpha is the intercept when regressing asset returns on equity market returns, it is also the intercept of any risk factor
model and can be used as a metric. Howerver, it is extremely difficult to obtain a consistent ranking of portfolios
using Jensen’s alpha because the estimated alpha is too dependent on the multi-factor model used. For example, as the
number of factors increase, there is ever less scope for alpha.

109

Quantitative Analytics

2.4.2.4

Empirical measures of risk

In the CAPM, the empirical derivation of the security market line (SML) corresponds to the market model
Rie = αi + βi Rm + i
where Rie = Ri − Rf denotes the excess return on the ith security. The risk and return are forecast ex-ante using a
model for the risk and the expected return. They may also be estimated ex-post using some historical data on returns,
assuming that investors believe that historical information are relevant to infer the future behaviour of financial assets.
The ex-post (or realised) version of the SML is
e

Ri = α̂i + β̂i R

m

e

e

m

where Ri is the average excess return for the ith security, β̂i = Ĉov(Rie , Rm ) and α̂i = Ri − β̂i R are the estimators
of βi and αi , respectively. The Jensen’s alpha is measured by the α̂i in the ex post SML, while the Treynor ratio is
defined as the ratio of Jensen’s alpha over the stock beta, that is, MT R (i) = α̂β̂ i . Finally, the Information Ratio is a
i
measure of Jensen’s alpha per unit of portfolio specific risk, measured as the standard deviation of the market model
α̂i
.
residuals MIR = σ̂(
i)
The sample estimates being based on monthly, weekly or daily data, the moments in the risk measures must be
annualised. However,the formula to convert returns or volatility measures from one time period to another assume
a particular underlying model or process. When portfolio returns are autocorrelated, the standard deviation does
not obey the square-root-of-time rule and one must use higher moments leading to Equation (3.3.11). When returns
are perfectly correlated with 100% autocorrelation then a positive return is followed by a positive return and we get a
trending market. On the other hand, when the autocorrelation is −100% then a positive return is followed by a negative
return and we get a mean reverting or contrarian market. Therefore, assuming
√ a 100% daily correlated market with
252) but the weekly volatility is 35%
1% daily
return
(5%
weekly
return),
then
the
daily
volatility
is
16%
(1%
×
√
(5% × 252) which is more than twice as large (see Bennett et al. [2012]).
When the use of a single index or benchmark in a market model is not sufficient to keep track of the systematic sources of portfolio returns in excess of the risk free rate, one can consider the families of linear multi-index
unconditional asset pricing models among which is the ex-post multidimensional equation
e

Ri = α̂i +

k
X

β̂ij Rj = α̂i + B̂i R

e

j=1

where j = 1, ..., k is the number of distinct risk factors, the line vector B̂i = (β̂i1 , .., β̂ik ) and the column vector
e
e
e
R = (R1 , .., Rk )> represent risk loadings and average returns for the factors, respectively. In this setting, the alpha
remains a scalar, and the standard deviation of the regression residuals is also a positive number, so that the multi-index
counterparts of the Jensen’s alpha and the information ratio are similar to the performance measures applied to the
single index model. To conserve the same interpretation as the original Treynor ratio, Hubner [2005] proposed the
following generalisation
MGT R (i) = α̂i

B̂l R

e

B̂i R

e

where l denotes the benchmark portfolio against which the ith portfolio is compared.

110

Quantitative Analytics

2.4.2.5

Incorporating tail risk

While the problem of fat tails is everywhere in financial risk analysis, there is no solution, and one should only consider
partial solutions. Hence, one practical strategy for dealing with the messy but essential issues related to measuring
and managing risk starts with the iron rule of never relying on one risk metric. Even though all of the standard metrics
have well-known flaws, that does not make them worthless, but it is a reminder that we must understand where any
one risk measure stumbles, where it can provide insight, and what are the possible fixes, if any.
The main flaw of the Sharpe ratio (SR) is that it uses standard deviation as a proxy for risk. This is a problem
because standard deviation, which is the second moment of the distribution, works best with normal distributions.
Therefore, investors must have a minimal type of risk aversion to variance alone, as if their utility function was
exponential. Even though normality has some validity over long periods of time, in the short run it is very unlikely
(see short maturity smile on options in Section (1.7.6.1)). Note, when moving away from the normality assumption
for the stock returns, only the denominator in the Sharpe ratio is modified. Hence, all the partial solutions are attempts
at expressing one way or another the proper noise of the stock returns. While extensions of the SR to normality
assumption have been successful, extension to different types of utility function have been more problematic.
The problem is that extreme losses occur more frequently than one would expect when assuming that price changes
are always and forever random (normally distributed). Put another way, statistics calculated using normal assumption
might underestimate risk. One must therefore account for higher moments to get a better understanding of the shape
of the distribution of returns in view of assessing the relative qualities of portfolios. Investors should prefer high
average returns, lower variance or standard deviation, positive skewness, and lower kurtosis. The adjusted Sharpe
ratio suggested by Pezier et al. [2006] explicitly rewards positive skewness and low kurtosis (below 3, the kurtosis of
a normal distribution) in its calculation

K −3 2 
S
MSR
MASR = MSR 1 + MSR −
6
24
where S is the skew and K is the kurtosis. This adjustment will tend to lower the SR if there is negative skewness
and positive excess kurtosis in the returns. Hence, it potentially removes one of the possible criticisms of the Sharpe
ratio. Hodges [1997] introduced another extension of the SR accounting for non-normality and incorporating utility
function. Assuming that investors are able to find the expected maximum utility E[u∗ ] associated with any portfolio,
the generalised Sharpe ratio (GSR) of the portfolio is
1
MGSR = −2 ln (−E[u∗ ]) 2
One can avoid the difficulty of computing the maximum expected utility by assuming the investor has an exponential
utility function. Using the fourth order Taylor approximation of the certain equivalent, and approximating the multiplicative factor Pezier et al. [2006] obtained the maximum expected utility function in that setting and showed that
the GSR simplifies to the ASR. Thus, when the utility function is exponential and the returns are normally distributed,
the GSR is identical to the SR. Otherwise a negative skewness and high positive kurtosis will reduce the GSR relative
to the SR.

2.4.3

Some downside risk measures

Downside risk measures the variability of underperformance below a minimum target rate which could be the risk
free rate, the benchmark or any other fixed threshold required by the client. All positive returns are included as zero
in the calculation of downside risk or semi-standard deviation. Investors being less concerned with variability on the
upside, and extremely concerned about the variability on the downside, an extended family of risk-adjusted measures
flourished, reflecting the downside risk tolerances of investors seeking absolute and not relative returns. Lower partial
moments (LPMs) measure risk by negative deviations of the returns realised in relation to a minimal acceptable return
rT . The LPM of kth-order is computed as

111

Quantitative Analytics

LP M (k) = E[max (rT − r, 0)k ] =

n
X
1
max [rT − ri , 0]k
n
i=1

Kaplan
et al. [2004] introduced a Sharpe type denominator with lower partial moments in the denominator given by
p
k
LP M (k). The Kappa index of order k is
rP − rT
MK = p
k
LP M (k)
Kappa indices can be tailord to the degree of risk aversion of the investor, but can not be used to rank portfolio’s
performance according to investor’s preference. One possible calculation of semi-standard deviation or downside risk
in the period [0, T ] is
v
u n
uX 1
σD = t
min [ri − rT , 0]2
n
i=1
where rT is the minimum target return. Downside potential is simply the average of returns below target
n
n
X
X
1
1
min [ri − rT , 0] =
I{r rT } (ri − rT )
n
n
i=1
i=1

Shadwick et al. [2002] proposed a gain-loss ratio, called Omega, that captutres the information in the higher moments
of return distribution
Pn
1
max (ri − rT , 0)
E[max (r − rT , 0)]
MΩ =
= n1 Pi=1
n
E[max (rT − r, 0)]
i=1 max (rT − ri , 0)
n
This ratio implicitly adjusts for both skewness and kurtosis. It can also be used as a ranking statistics (the higher, the
better). Note, the ratio is equal to 1 when rT is the mean return. Kaplan et al. [2004] showed that the Omega ratio
can be rewritten as a Sharpe type ratio called Omega-Sharpe ratio
MOSR =

1
n

rP − rT
i=1 max (rT − ri , 0)

Pn

which is simply Ω − 1, thus generating identical ranking than the Omega ratio. Setting rT = 0 in the Omega ratio,
Bernardo et al. [1996] obtained the Bernardo Ledoit ratio (or Gain-Loss ratio)

112

Quantitative Analytics

MBLR =

Pn
1
max (ri , 0)
n
Pni=1
1
max
(−ri , 0)
i=1
n

Sortino et al. [1991] proposed an extension of the Omega-Sharpe ratio by using downside risk in the denominator
MSoR =

rp − rT
σD

In that measure, portfolio managers will only be penalised for variability below the minimum target return, but will
not be penalised for upside variability. In order to rank portfolio performance while combining upside potential with
downside risk, Sortino et al. [1999] proposed the Upside Potential ratio
Pn
1
i=1 max (ri − rT , 0)
n
MU P R =
σD
This measure is similar to the Omega ratio except that performance below target is penalised further by using downside
risk rather than downside potential. Going further, we can replace the upside potential in the numerator with the upside
risk, getting the Variability Skewness
MV SR =

2.4.4

Considering the value at risk

2.4.4.1

Introducing the value at risk

σU
σD

The Value at Risk (VaR) is a widely used risk measure of the risk of loss on a specific portfolio of financial assets. VaR
is defined as a threshold value such that the probability that the mark-to-market loss on the portfolio over the given
time horizon exceeds this value (assuming normal markets and no trading in the portfolio) is the given probability
level. For example, if a portfolio of stocks has a one-day 5% VaR of $1 million, there is a 0.05 probability that the
portfolio will fall in value by more than $1 million over a one day period if there is no trading. Informally, a loss of
$1 million or more on this portfolio is expected on 1 day out of 20 days (because of 5% probability). VaR represents a
percentile of the predictive probability distribution for the size of a future financial loss. That is, if you have a record
of portfolio value over time then the VaR is simply the negative quantile function of those values.
Given a confidence level α ∈ (0, 1), the VaR of the portfolio at the confidence level α is given by the smallest
number zp such that the probability that the loss L exceeds zp is at most (1 − α). Assuming normally distributed
returns, the Value-at-Risk (daily or monthly) is
V aR(p) = W0 (µ − zp σ)
where W0 is the initial portfolio wealth, µ is the expected asset return (daily or monthly), σ is the standard deviation
(daily or monthly), and zp is the number of standard deviation at (1 − α) (distance between µ and the VaR in number
of standard deviation). It ensures that
P (dW ≤ −V aR(p)) = 1 − α
Note, V aR(p) represents the lower bound of the confidence interval given in Appendix (B.8.6). For example, setting
α = 5% then zp = 1.96 with p = 97.5 which is a 95% probability.
If returns do not display a normal distribution pattern, the Cornish-Fisher expansion can be used to include skewness and kurtosis in computing value at risk (see Favre et al. [2002]). It adjusts the z-value of a standard VaR for
skewness and kurtosis as follows

113

Quantitative Analytics

1
1
1
zcf = zp + (zp2 − 1)S + (zp3 − 3zp )K − (2zp3 − 5zp )S 2
6
24
36
where zp is the critical value according to the chosen α-confidence level in a standard normal distribution, S is the
skewness, K is the excess kurtosis. Integrating them into the VaR measure by means of the Cornish-Fisher expansion
zcf , we end up with a modified formulation for the VaR, called MVaR
M V aR(p) = W0 (µ − zcf σ)
2.4.4.2

The reward to VaR

We saw earlier that when the risk is only measured with the volatility it is often underestimated, because the assets
returns are negatively skewed and have fat tails. One solution is to use the value-at-risk as a measure of risk, and
consider Sharpe type measures using VaR. For instance, replacing the standard deviation in the denominator with the
VaR ratio (Var expressed as a percentage of portfolio value rather than an amount) Dowd [2000] got the Reward to
VaR
rP − rF
VaR ratio
Note, the VaR measure does not provide any information about the shape of the tail or the expected size of loss beyond
the confidence level, making it an unsatisfactory risk measure.
MRV aR =

2.4.4.3

The conditional Sharpe ratio

Tail risk is the possibility that investment losses will exceed expectations implied by a normal distribution. One attempt
at trying to anticipate non-normality is the modified Sharpe ratio, which incorporates skewness and kurtosis into the
calculation. Another possibility is the so-called conditional Sharpe ratio (CSR) or expected shortfall, which attempts
to quantify the risk that an asset or portfolio will experience extreme losses. VaR tries to tell us what the possibility
of loss is up to some confidence level, usually 95%. So, for instance, one might say that a certain portfolio is at risk
of losing X% for 95% of the time. What about the remaining 5%? Conditional VaR, or CVaR, dares to tread into this
black hole of fat taildom (by accounting for the shape of the tail). For the conditional Sharpe ratio, CVaR replaces
standard deviation in the metric’s denominator
MCV aR =

rP − rF
CV aR(p)

The basic message in conditional Sharpe ratio, like that of its modified counterpart, is that investors underestimate risk
by roughly a third (or more?) when looking only at standard deviation and related metrics (see Agarwal et al. [2004]).
2.4.4.4

The modified Sharpe ratio

The modified Sharpe ratio (MSR) is one of several attempts at improving the limitations of standard deviation. MSR
is far from a complete solution, still, it factors in two aspects of non-normal distributions, skewness and kurtosis. It
does so through the use of what is known as a modified Value at Risk measure (MVaR) as the denominator. The MVaR
follows the Cornish-Fisher expansion, which can adjust the VaR in terms of asymmetric distribution (skewness) and
above-average frequency of earnings at both ends of the distribution (kurtosis). The modified Sharpe ratio is
MM SR =

rP − rF
M V aR(p)

Similarly to the Adjusted Sharpe ratio, the Modified Sharpe ratio uses modified VaR adjusted for skewness and kurtosis
(see Gregoriou et al. [2003]). Given a 10 years example, in all cases the modified Sharpe ratio was lower than its
traditional Sharpe ratio counterpart. Hence, for the past decade, risk-adjusted returns were lower than expected after

114

Quantitative Analytics

adjusting for skewness and kurtosis. Note, depending on the rolling period, MSR has higher sensitivity to changes in
non-normal distributions whereas the standard SR is immune to those influences.

2.4.4.5

The constant adjusted Sharpe ratio

Eling et al. [2006] showed that even though hedge fund returns are not normally distributed, the first two moments
describe the return distribution sufficiently well. Furthermore, on a theoretical basis, the Sharpe ratio is consistent with
expected utility maximisation under the assumption of elliptically distributed returns. Taking all the previous remarks
into consideration, we propose a new very simple Sharpe ratio called the Constant Adjusted Sharpe ratio and defined
as
MCASR =

rP − rF
σ(1 + S )

where S > 0 (S = 13 to recover the conditional Sharpe ratio) is the adjusted volatility defined in Section (??). In that
measure, the volatility is simply modified by a constant.

2.4.5

Considering drawdown measures

For an investor wishing to avoid losses, any continuous losing return period or drawdown constitutes a simple measure
of risk. The drawdown measures the decline from a historical peak in some variable (see Magdon-Ismail et al. [2004]).
It is the pain period experienced by an investor between peak (new highs) and subsequent valley (a low point before
moving higher). If (Xt )t≥0 is a random process with X0 = 0, the drawdown D(T ) at time T is defined as

D(T ) = max 0, max (Xt − XT )
t∈(0,T )

One can count the total number of drawdowns nd in the entire period [0, T ] and compute the average drawdown as
D(T ) =

nd
1 X
Di
nd i=1

where Di is the ith drawdown over the entire period. The maximum drawdown (MDD) up to time T is the maximum
of the drawdown over the history of the variable (typically the Net Asset Value of an investment)
M DD(τ ) = max D(τ )
τ ∈(0,T )

In a long-short portfolio, the maximum drawdown is the maximum loss an investor can suffer in the fund buying at the
highest point and selling at the lowest. We can also define the drawdown duration as the length of any peak to peak period, or the time between new equity highs. Hence, the maximum drawdown duration is the worst (maximum/longest)
amount of time an investment has seen between peaks. Martin [1989] developed the Ulcer index where the impact of
the duration of drawdowns is incorporated by selecting the negative return for each period below the previous peak or
high water mark
v
u n
u1 X 0
(D )2
Ulcer Index = t
n i=1 i
0

where Di is the drawdown since the previous peak in ith period. This way, deep, long drawdowns will have a significant impact as the underperformance since the last peak is squared. Being sensitive to the frequency of time period,
this index penalises managers taking time to recovery from previous high. If the drawdowns are not squared, we get
the Pain index

115

Quantitative Analytics

n

Pain Index =

1X 0
|D |
n i=1 i

which is similar to the Zephyr Pain index in discrete form proposed by Becker.
We are considering measures which are modification of the Sharpe ratio in the sense that the numerator is always
the excess of mean returns over risk-free rate, but the standard deviation of returns in the denominator is replaced by
some function of the drawdown.
The Calmar ratio (or drawdown ratio) is a performance measurement used to evaluate hedge funds which was
created by T.W. Young [1991]. Originally, it is a modification of the Sterling ratio where the average annual rate of
return for the last 36 months is divided by the maximum drawdown for the same period. It is computed on a monthly
basis as opposed to other measures computed on a yearly basis. Note, the MAR ratio, discussed in Managed Account
Reports, is equal to the compound annual return from inception divided by the maximum drawdown over the same
period of time. As discussed by Bacon [2008], later version of the Calmar ratio introduce the risk-free rate into the
numerator to create a Sharpe type ratio
MCR =

rP − rF
M DD(τ )

The Sterling ratio replaces the maximum drawdowns in the Calmar ratio with the average drawdown. According to
Bacon, the original definition of the Sterling ratio is
MSterR =

rP
Dlar + 10%

where Dlar is the average largest drawdown, and the 10% is an arbitrary compensation for the fact that Dlar is
inevitably smaller than the maximum drawdown. In vew of generalising the measure, Bacon rewrote it in as a Sharpe
type ratio given by
MSterR =

rP − rF
D(T )

where the number of observations nd is fixed by the investor’s preference. Other variation of the Sterling ratio uses
the average annual maximum drawdown M DD(τ ) in the denominator over three years. Combining the Sterling and
Calmar ratio, Bacon proposed the Sterling-Calmar ratio as
MSCR =

rP − rF
M DD(τ )

In order to penalise major drawdowns as opposed to many mild ones, Burke [1994] used the concept of the square
root of the sum of the squares of each drawdown, getting
rP − rF
MBR = pPnd
2
i=1 Di
where the number of drawdowns nd used can be restricted to a set number of the largest drawdowns. In the case
where the investor is more concerned by the duration of the drawdowns, the Martin ratio or Ulcer performance index
is similar to the Burke ratio but with the Ulcer index in the denominator
rP − rF
MM R = q P
0
d
1
2
i=1 n (Di )
and the equivalent to the Martin ratio but using the Pain index is the Pain ratio

116

Quantitative Analytics

rP − rF
MP R = P d 1 0
i=1 n Di
In view of assessing the best measure to use, Eling et al. [2006] concluded that most of these measures are all
highly correlated and do not lead to significant different rankings. For Bacon, the investor must decide ex-ante which
measures of return and risk best describe his preference, and choose accordingly.

2.4.6

Some limitation

2.4.6.1

Dividing by zero

Statistical inference with measures based on ratios, such as the Treynor performance measure, is delicate when the
denominator tends to zero as the ratio goes to infinity. Hence, this measure provides unstable performance measures
for non-directional portfolios such as market neutral hedge funds. When the denominator is not bounded away from
zero, the expectation of the ratio is infinite. Further, when the denominator is negative, the ratio would assign positive
performance to portfolios with negative abnormal returns. As suggested by Hubner [2007], one way arround when
assessing the quality of performance measures is to consider only directional managed portfolios. However, hedge
funds favour market neutral portfolios. We present two artifacts capable of handeling the beta in the denominator of a
ratio. We let, βia (jδ) taking values in R, be the statistical Beta for the stock Si (jδ) at time t = jδ. We want to define
1
allocates maximum weight to stocks with β ≈ 0, and decreasing weight as
a mapping βi (jδ) such that the ratio βi (jδ)
the β moves away from zero. One possibility is to set
βi (jδ) = a + bβia (jδ) , i = 1, .., N
with a = 31 and b = 23 , but it does not stop the ratio from being negative. An alternative approach is to consider the
inverse bell shape for the distribution of the Beta
a

βi (jδ) = a 1 − e−b(βi (jδ))

2



+c

such that for βia (jδ) = 0 we get βi (jδ) = c. In that setting βi (jδ) ∈ [c, a + c] and a good calibration gives a = 1.7,
b = 0.58, and c = 0.25. Modifying the bell shape, we can directly define the ratio as
a
2
1
= ae−b(βi (jδ))
βi (jδ)

with a = 3 and b = 0.25. In that setting
a.
2.4.6.2

1
βi (jδ)

∈ [0, a] with the property that when β = 0 we get the maximum value

Anomaly in the Sharpe ratio

N
The (ex post) Sharpe ratio of a sequence of returns x1 , ..., xN ∈ [−1, ∞) is M (N ) = µσN
where µN is the sample
2
mean and σN is the sample variance. Note, the returns are bounded from below by −1. Intuitively, the Sharpe ratio
is the return per unit of risk. Another way of measuring the performance of a portfolio with the above sequence of
returns is to see how this sequence of returns would have affected an initial investment of CA = 1 assuming no capital
inflows and outflows after the initial investment. The final capital resulting from this sequence of returns is

P N = CA

N
Y

(1 + xi )

i=1

We are interested in conditions under which the following anomaly is possible: the Sharpe ratio M (N ) is large while
PN < 1. We could also consider the condition that in the absence of capital inflows and outflows the returns x1 , ..., xN
underperform the benchmark portfolio. Vovk [2011] showed that if the return is 5% over k − 1 periods, and then it is

117

Quantitative Analytics

−100% in the kth period then as k → ∞ we get µk → 0.05 and σk → 0. Therefore, making k large enough, we can
make the Sharpe ratio M (k) as large as we want, despite losing all the money over the k periods. In this example the
returns are far from being Gaussian (strictly speaking, returns cannot be Gaussian unless they are constant, since they
are bounded from below by −1). Note, this example leads to the same conclusions when the Sharpe ratio is replaced
by the Sortino ratio. However, this example is somewhat unrealistic in that there is a period in which the portfolio
loses almost all its money. Fortunately,Vovk [2011] showed that it is the only way a high Sharpe ratio can become
compatible with losing money. That is, in the case of the Sharpe ratio, such an abnormal behaviour can happen only
when some one-period returns are very close to −1. In the case of the Sortino ratio, such an abnormal behaviour can
happen only when some one-period returns are very close to −1 or when some one-period returns are huge.
2.4.6.3

The weak stochastic dominance

The stochastic dominance axiom of utility implies that if exactly the same returns can be obtained with two different
investments A and B, but the probability of a return exceeding any threshold τ is always greater with investment A,
then A should be preferred to B. That is, investment A strictly dominates investment B if and only if
PA (R > τ ) > PB (R > τ )∀τ
and A weakly dominates B if and only if
PA (R > τ ) ≥ PB (R > τ )∀τ
Hence, no rational investor should choose an investment which is weakly dominated by another one. With the help of
an example, Alexandrer showed that the SR can fail to rank investment according to the weak stochastic dominance.
We consider two portfolios A and B with the distribution of their returns in excess of the risk-free rate given in Table
(2.2).
Table 2.2: Distribution of returns
Probability
0.1
0.8
0.1

Excess return A
20%
10%
−20%

Excess return B
40%
10%
−20%

The highest excess return from portfolio A is only 20%, whereas the highest excess return from portfolio
B is
P
40%. We show the result of the SR
of
the
two
investments
in
Table
(2.3).
The
mean
is
given
by
E[R]
=
P
R
i i i and
P
the variance satisfies V ar(R) = i Pi Ri2 − (E[R])2 .
Table 2.3: Sharpe ratios
Portfolio
Expected excess return
Standard deviation
Sharpe ratio

A
8.0%
9.79%
0.8165

B
10.0%
13.416%
0.7453

Following the SR, investor would choose portfolio A, whereas the weak stochastic dominance indicates that any
rational investor should prefer B to A. As a result, one can conclude that the SRs are not good metrics to use in the
decision process on uncertain investments.

118

Chapter 3

Introduction to financial time series analysis
For details see text books by Makridakis et al. [1989], Brockwell et al. [1991] and Tsay [2002].

3.1

Prologue

A time series is a set of measurements recorded on a single unit over multiple time periods. More generally, a time
seies is a set of statistics, usually collected at regular intervals, and occurring naturally in many application areas
such as economics, finance, environmental, medecine, etc. In order to analyse and model price series to develop
efficient quantitative trading, we define returns as the differences of the logarithms of the closing price series, and we
fit models to these returns. Further, to construct efficient security portfolios matching the risk profile and needs of
individual investors we need to estimate the various properties of the securities consitituting such a portfolio. Hence,
modelling and forecasting price return and volatility is the main task of financial research. Focusing on closing prices
recorded at the end of each trading day, we argue that it is the trading day rather than the chronological day which
is relevant so that constructing the series from available data, we obtain a process equally spaced in the relevant time
unit. We saw in Section (2.1.5) that a first step towards forecasting financial time series was to consider some type
of technical indicators or mathematical statistics with price forecasting capability, hoping that history trends would
repeat itself. However, following this approach we can not assess the uncertainty inherent in the forecast, and as
such, we can not measure the error of forecast. An alternative is to consider financial time series analysis which is
concerned with theory and practice of asset valuation over time. While the methods of time series analysis pre-date
those for general stochastic processes and Markov Chains, their aims are to describe and summarise time series data,
fit low-dimensional models, and make forecasts. Even though it is a highly empirical discipline, theory forms the
foundation for making inference. However, both financial theory and its empirical time series contain some elements
of uncertainty. For instance, there are various definitions of asset volatility, and in addition, volatility is not directly
observable. Consequently, statistical theory and methods play an important role in financial time series analysis. One
must therefore use his knowledge of financial time series in order to use the appropriate statistical tools to analyse the
series. In the rest of this section we are going to describe financial time series analysis, and we will introduce statistical
theory and methods in the following sections.

119

Quantitative Analytics

3.2

An overview of data analysis

3.2.1

Presenting the data

3.2.1.1

Data description

The data may consists in equity stocks, equity indices, futures, FX rates, commodities, and interest rates (Eurodollar
and 10-year US Treasury Note) spanning a period from years to decade with frequency of intraday quotes, closeto-close, weeks, or months. As the contracts are traded in various exchanges, each with different trading hours and
holidays, the data series should be appropriately aligned to avoid potential lead-lag effects by filling forward any
missing asset prices (see Pesaran et al. [2009]). Daily, weekly or montly return series are constructed for each contract
by computing the percentage change in the closing end of day, week or month asset price level. The mechanics of
opening and maintaining a position on a futures contract involves features like initial margins, potential margin call,
interest accured on the margin account, and no initial cash payment at the initiation of the contract (see Miffre et al.
[2007]). As a result, the construction of a return data series for a futures contract does not have an objective nature
and various methodologies have been used in the literature. Pesaran et al. [2009], Fuertes et al. [2010] compute
returns similarly as the percentage change in the price level, wheras Pirrong [2005] and Gorton et al. [2006] also
take into account interest rate accruals on a fully collateralised basis, and Miffre et al. [2007] use the change in the
logarithms of the price level. Lastly, Moskowitz et al. [2012] use the percentage change in the price level in excess
of the risk-free rate. Knowing the percentage returns of the time series, we can compute the annualised mean return,
volatility, and Sharpe ratios.
3.2.1.2

Analysing the data

We apply standard econometric theory described in Section (5.1) to test for the presence of heteroskedasticity and
autocorrelation, and we adjust the models accordingly when needed. In general, there exists a great amount of crosssectional variation in mean returns and volatilities with the commodities being historically the most volatile contracts
(see Pesaran et al. [2009]). Further, the distribution of buy-and-hold or buy-and-sell return series exhibits fat tails
as deduced by the kurtosis and the maximum likelihood estimated degrees of freedom for a Student t-distribution. A
normal distribution is almost universally rejected by the Jarque and Bera [1987] and the Lilliefors [1967] tests of
normality (see Section (3.3.4.2)). It is more difficult to conclude about potential first-order time-series autocorrelation
using tools such as the Ljung and Box [1978] test. However, very strong evidence of heteroscedasticity is apparent
across all frequencies deduced by the ARCH test of Engle [1982]. Baltas et al. [2012b] found that this latter
effect of time variation in the second moment of the return series was also apparent in the volatility. We also perform a
regression analysis with ARMA (autoregressive moving average) modelling of the serial correlation in the disturbance.
In addition we can perform several robustness checks. First, we check the robustness of the model through time by
using a Chow [1960] test to test for stability of regression coefficients between two periods. When we find significant
evidence of parameter instability, we use a Kalman filter analysis described in Section (3.2.5.2), which is a general
form of a linear model with dynamic parameters, where priors on model parameters are recursively updated in reaction
to new information (see Hamilton [1994]).
3.2.1.3

Removing outliers

We follow an approach described by Zhu [2005] consisting in finding the general trend curve for the time series,
and then calculating the spread which is the distance between each point and the trend curve. The idea is to replace
each data point by some kind of local average of surrounding data points such that averaging reduce the level of noise
without biasing too much the value obtained. To find the trend, we consider the Savitzky-Golay low-pass smoothing
filter described in Section (4.3.3). After some experiments, Zhu [2005] found that the filter should be by degree 1 and
span size 3. Given the corresponding smoothed data representing the trend of the market data we get the spread for
each market data point from the trend. The search for outliers uses the histogram of (fi − f i ) with M = 10 bins of
equal width. We label a threshold T and define all fi with |fi − f i | > T to be outliers. The next question is how to

120

Quantitative Analytics

select the value M, and the threshold T. Suppose we are given a set of market data which contain previously known
errors. Adjust M and T until we find proper pairs of M and T which can successfully find all the errors. We then can
tune the parameters with more historical data from the same market. We can have an over-determined solution for
the value of M and T by enough training data provided. Outliers are replaced by interpolation. On the market, one
common way to deal with error data is to replace it with the previous data, that is, zeroth order interpolation. This
method neglects the trend, while we usually expect movements on a liquid market. Instead of utilising much training
data, an alternative to search for T is to iteratively smooth the data points. Step one, we choose a start T , say T0 , and
smooth the data according to M0 and T0 . Second step we stop the iteration if the histogram has a short tail, since we
believe all the outliers are removed. Else we replace the outliers by interpolations, and repeat step one.

3.2.2

Basic tools for summarising and forecasting data

We assume that we have available a database from which to filter data and build numerical forecasts, that is, a table
with multiple dimensions. Cross sectional data refer to measurements on multiple units, recorded in a single time
period. Although forecasting practice involves multiple series, the methods we are going to examine use data from the
past and present to predict future outcomes. Hence, we will first focus on the use of time series data.
3.2.2.1

Presenting forecasting methods

Forecasting is about making statements on events whose actual outcomes have not yet been observed. We distinguish
two main types of forecasting methods:
1. Qualitative forecasting techniques are subjective, based on opinions and judgements, and are appropriate when
past data are not available. For example, one tries to verify whether there is some causal relationship between
some variables and the demand. If this is the case, and if the variable is known in advance, it can be used to
make a forecast.
2. On the contrary, quantitative forecasting models are used to forecast future data as a function of past data,
and as such, are appropriate when past data are available. The main idea being that the evolution in the past
will continue into the future. If we observe some correlations between some variables, then we can use these
correlations to make some forecast. A dynamic model incorporating all the important internal and external
variables is implemented and used to test different alternatives. For instance, to estimate the future demand
accurately, we need to take into account facts influencing the demand.
Subjective forecasts are often time-consuming to generate and may be subject to a variety of conscious or unconscious biases. In general, simple analysis of available data can perform as well as judgement procedures, and are much
quicker and less expensive to produce. The effective possible choices are judgement only, quantitative method only
and quantitative method with results adjusted by user judgement. All three options have their place in the forecasting
lexicon, depending upon costs, available data and the importance of the task in hand. Careful subjective adjustment of
quantitative forecasts may often be the best combination, but we first need to develop an effective arsenal of quantitative methods. To do so, we need to distinguish between methods and models.
• A forecasting method is a (numerical) procedure for generating a forecast. When such methods are not based
upon an underlying statistical model, they are termed heuristic.
• A statistical (forecasting) model is a statistical description of the data generating process from which a forecasting method may be derived. Forecasts are made by using a forecast function that is derived from the model.
For example, we can specify a forecasting method as
Ft = b0 + b1 t
121

Quantitative Analytics

where Ft is the forecast for time period t, b0 is the intercept representing the value at time zero, and b1 is the slope
representing the increase in forecast values from one period to the next. All we need to do to obtain a forecast is
to calibrate the model. However, we lack a basis for choosing values for the parameters, and we can not assess the
uncertainty inherent in the forecasts. Alternatively, we may formulate a forecasting model as
Yt = β0 + β1 t + 
where Y denotes the time series being studied, β0 and β1 are the level and slope parameters, and  denotes a random
error term corresponding to that part of the series that cannot be fitted by the trend line. Once we make appropriate
assumptions about the nature of the error term, we can estimate the unknown parameters, β0 and β1 . These estimates
are typically written as b0 and b1 . Thus the forecasting model gives rise to the forecast function
Ft = b0 + b1 t
where the underlying model enables us to make statements about the uncertainty in the forecast, something that the
heuristic method do not provide. As a result, risk and uncertainty are central to forecasting, as one must indicate
the degree of uncertainty attaching to forecasts. Hence, some idea about its probability distribution is necessary. For
example, assuming a forecast for some demand has the distribution of Gauss (normal) with average µ and standard
deviation σ, the coefficient of variation of the prediction is σµ .
3.2.2.2

Summarising the data

Following Brockwell et al. [1991], we write the real-valued series of observations as ..., Y−2 , Y−1 , Y0 , Y1 , Y2 , ... a
doubly infinite sequence of real-valued random variables indexed by Z. Given a set of n values Y1 , Y2 , .., Yn , we
place these values in ascending order written as Y(1) ≤ Y(2) ≤ ... ≤ Y(n) . The median is the middle observation.
When n is odd it can be written n = 2m + 1 and the median is Y(m+1) , and when n is even we get n = 2m and the
median is 21 (Y(m) + Y(m) ). It is possible to have two very different datasets with the same means and medians. For
that reason, measures of the middle are useful but limited. Another important attribute of a dataset is its dispersion
or variability about its middle. The most useful measures of dispersion (or variability) are the range, the percentiles,
the mean absolute deviation, and the standard deviation. The range denotes the difference between the largest and
smallest values in the sample
Range = Y(n) − Y(1)
Therefore, the more spread out the data values are, the larger the range will be. However, if a few observations are
relatively far from the middle but the rest are relatively close to the middle, the range can give a distorted measure
of dispersion. Percentiles are positional measures for a dataset that enable one to determine the relative standing of a
single measurement within the dataset. In particular, the pth percentile is defined to be a number such that p% of the
observations are less than or equal to that number and (100 − p)% are greater than that number. So, for example, an
observation that is at the 75th percentile is less than only 25% of the data. In practice, we often can not satisfy the
definition exactly. However, the steps outlined below at least satisfies the spirit of the definition:
1. Order the data values from smallest to largest, including ties.
2. Determine the position
k.ddd = 1 +

p(n − 1)
100

3. The pth percentile is located between the kth and the (k + 1)th ordered value. Use the fractional part of the
position, .ddd as an interpolation factor between these values. If k = 0, then take the smallest observation as the
percentile and if k = n, then take the largest observation as the percentile.

122

Quantitative Analytics

The 50th percentile is the median and partitions the data into a lower half (below median) and upper half (above
median). The 25th, 50th, 75th percentiles are referred to as quartiles. They partition the data into 4 groups with
25% of the values below the 25th percentile (lower quartile), 25% between the lower quartile and the median, 25th
between the median and the 75th percentile (upper quartile), and 25% bove the upper quartile. The difference between
the upper and lower quartiles is referred to as the inter-quartile range. This is the range of the middle 50% of the
data. Given di = Yi − Y where Y is the arithmetic mean, the Mean Absolute Deviation (MAD) is the average of the
deviations about the mean, ignoring the sign
M AD =

1X
|di |
n i

The sample variance is an average of the squared deviations about the mean
S2 =

1 X 2
d
n−1 i i

The population variance is given by
σp2 =

1X
(Yi − µp )2
n i

where µp is the population mean. Note that the unit of measure for the variance is the square of the unit of measure for
the data. For that reason (and others), the square root of the variance, called the standard deviation, is more commonly
used as a measure of dispersion. Note that datasets in which the values tend to be far away from the middle have a
large variance (and hence large standard deviation), and datasets in which the values cluster closely around the middle
have small variance. Unfortunately, it is also the case that a dataset with one value very far from the middle and the
rest very close to the middle also will have a large variance. Comparing the variance with the MAD, S gives greater
weight to the more extreme observations by squaring them and it may be shown that S > M AD whenever MAD is
greater than zero. A rough relationship between the two is
S = 1.25M AD
The standard deviation of a dataset can be interpreted by Chebychev’s Theorem (see Tchebichef [1867]):
Theorem 3.2.1 For any k > 1, the proportion of observations within the interval µp ± kσp is at least (1 −

1
k2 ).

Hence, knowing just the mean and standard deviation of a dataset allows us to obtain a rough picture of the distribution
of the data values. Note that the smaller the standard deviation, the smaller is the interval that is guaranteed to contain
at least 75% of the observations. Conversely, the larger the standard deviation, the more likely it is that an observation
will not be close to the mean. Note that Chebychev’s Theorem applies to all data and therefore must be conservative. In
many situations the actual percentages contained within these intervals are much higher than the minimums specified
by this theorem. If the shape of the data histogram is known, then better results can be given. In particular, if it is
known that the data histogram is approximately bell-shaped, then we can say
• µp ± σp contains approximately 68%,
• µp ± 2σp contains approximately 95%,
• µp ± 3σp contains essentially all
of the data values. This set of results is called the empirical rule. Several extensions of Chebyshev’s inequality have
been developed, among which is the asymmetric two-sided version given by

123

Quantitative Analytics

4 (µp − k1 )(k2 − µp ) − σp2
P (k1 < Y < k2 ) ≥
(k2 − k1 )2



In mathematical statistics, a random variable Y is standardized by subtracting its expected value E[Y ] and dividing
the difference by its standard deviation σ(Y )
Z=

Y − E[Y ]
σ(Y )

The Z-score is a dimensionless quantity obtained by subtracting the population mean µp from an individual raw score
Yi and then dividing the difference by the population standard deviation σp . That is,
Z=

Yi − µp
σp

From Chebychev’s theorem, at least 75% of observations in any dataset will have Z-scores in the range [−2, 2].
The standard score is the (signed) number of standard deviations an observation or datum is above the mean, and it
provides an assessment of how off-target a process is operating. The use of the term Z is due to the fact that the
Normal distribution is also known as the Z distribution. They are most frequently used to compare a sample to a
standard normal deviate, though they can be defined without assumptions of normality. Note, considering the Z-score,
Cantelli obtained sharpened bounds given by
1
1 + k2
The Z-score is only defined if one knows the population parameters, but knowing the true standard deviation of a
population is often unrealistic except in cases such as standardized testing, where the entire population is measured.
If one only has a sample set, then the analogous computation with sample mean and sample standard deviation yields
the Student’s t-statistic. Given a sample mean Y and sample standard deviation S, we define the standardised scores
for the observations, also known as Z-scores as
P (Z ≥ k) ≤

Yi − Y
S
The Z-score is to examine forecast errors and one proceed in three steps:
Z=

• Check that the observed distribution of the errors is approximately normal
• If the assumption is satisfied, relate the Z-score to the normal tables
– The probability that |Z| > 1 is about 0.32
– The probability that |Z| > 2 is about 0.046
– The probability that |Z| > 3 is about 0.0027
• Create a time series plot of the residuals (and/or Z-scores) when appropriate to determine which observations
appear to be extreme
Hence, whenever you see a Z-score greater than 3 in absolute value, the observation is very atypical, and we refer to
such observations as outliers.
The change in the absolute level of the series from one period to the next is called the first difference of the serries,
given by
DYt = Yt − Yt−1
124

Quantitative Analytics

where Yt−1 is known at time t. Letting D̂t be the forecast for the difference, the forecast for Yt becomes
Ft = Ŷt = Yt−1 + D̂t
The growth rate for Yt is
Gt = GYt = 100

Dt
Yt−1

so that the forecast for Yt can be written as
Ĝt 
100
If we think of changes in the time series in absolute terms we should use DY and if we think of it in relative terms we
should use GY . Note, reducing a chocolate ration by 50% and then increasing it by 50% does not gives you as much
chocolate as before since
Ft = Ŷt = Yt−1 1 +

50
50
)(1 +
) = 0.75
100
100
To avoid this asymmetry we can use the logarithm transform Lt = ln Yt with first difference in logarithm being
(1 −

DLt = ln Yt − ln Yt−1
converting the exponential (or proportional) growth into linear growth. If we generate a forecast of the log-difference,
the forecast for the original series, given the previous value Yt−1 becomes
Ŷt = Yt−1 eDL̂t
3.2.2.3

Measuring the forecasting accuracy

When selecting a forecasting procedure, a key question is how to measure performance. A natural approach would
be to look at the differences between the observed values and the forecasts, and to use their average as a performance
measure. Suppose that we start from forecast origin t so that the forecasts are made successively (one-step-ahead) at
times t + 1, t + 2, ..., t + h, there being h such forecasts in all. The one-step-ahead forecast error at time t + i may be
denoted by
et+i = Yt+i − Ft+i
The Mean Error (ME) is given by
h

ME =

h

1X
1X
(Yt+i − Ft+i ) =
et+i
h i=1
h i=1

ME will be large and positive (negative) when the actual value is consistently greater (less) than the forecast. However, this measure do not reflect variability, as positive and negative errors could virtually cancel each other out, yet
substantial forecasting errors could remain. Hence, we need measures that take account of the magnitude of an error
regardless of the sign. The simplest way to gauge the variability in forecasting performance is to examine the absolute
errors, defined as the value of the error ignoring its sign and expressed as
|ei | = |Yi − Fi |
We now present various averages, based upon the errors or the absolute errors.

125

Quantitative Analytics

• the Mean Absolute Error
h

M AE =

h

1X
1X
|Yt+i − Ft+i | =
|et+i |
h i=1
h i=1

• the Mean Absolute Percentage Error
h

M AP E =

h

100 X |et+i |
100 X |Yt+i − Ft+i |
=
h i=1
Yt+i
h i=1 Yt+i

• the Mean Square Error
h

M SE =

h

1X
1X 2
(Yt+i − Ft+i )2 =
e
h i=1
h i=1 t+i

• the Normalised Mean Square Error
h

h

1 X
1 X 2
N M SE = 2
(Yt+i − Ft+i )2 = 2
e
σ h i=1
σ h i=1 t+i
where σ 2 is the variance of the true sequence over the prediction period (validation set).
• the Root Mean Square Error
√
RM SE =

M SE

• the Mean Absolute Scaled Error
Ph
|Yt+i − Ft+i |
M ASE = Phi=1
i=1 |Yt+i − Yt+i−1 |
• the Directional Symmetry
h

DS =

1 X
H(Yt+i .Ft+i )
h − 1 i=1

where H(x) = 1 if x > 0 and H(x) = 0 otherwise (Heaviside function).
• the Direction Variation Symmetry
h

DV S =

1 X
H((Yt+i − Yt+i−1 ).(Ft+i − Ft+i−1 ))
h − 1 i=2

Note, the Mean Absolute Scaled Error was introduced by Hyndman et al. [2006]. It is the ratio of the MAE for the
current set of forecasts relative to the MAE for forecasts made using the random walk. Hence, for M ASE > 1 the
random walk forecasts are superior, otherwise the method under consideration is superior to the random walk. We
now give some general comments on these measures.

126

Quantitative Analytics

• MAPE should only be used when Y > 0, MASE is not so restricted.
• MAPE is the most commonly used error measure in practice, but it is sensitive to values of Y close to zero.
• MSE is measured in terms of (dollars), and taking the square root to obtain the RMSE restores the original units.
• A value of the N M SE = 1 corresponds to predicting the unconditional mean.
• The RMSE gives greater weight to large (absolute) errors. It is therefore sensitive to extreme errors.
• The measure using absolute values always equals or exceeds the absolute value of the measure based on the
errors, so that M AE ≥ |M E| and M AP E ≥ |M P E|. If the values are close in magnitude that suggests a
systematic bias in the forecasts.
• Both MAPE and MASE are scale-free and so can be used to make comparisons across multiple series. The
other measures need additional scaling.
• DS is the percentage of correctly predicted directions with respect to the target variable. It provides a measure
of the number of times the sign of the target was correctly forecast.
• DVS is the percentage of correctly predicted direction variations with respect to the target variable.
3.2.2.4

Prediction intervals

So far we have considered point forecasts which are future observations for which we report a single forecast value.
However, confidence in a single number is often misplaced. We assume that the predictive distribution for the series
Y follows the normal law (although such an assumption is at best an approximation and needs to be checked). If we
assume that the standard deviation (SD) of the distribution is known, we may use the upper 95% point of the standard
normal distribution (this value is 1.645) so that the one-sided prediction interval is
Ŷ + 1.645 × SD
where Ŷ is the point forecast. The normal distribution being the most widely used in the construction of prediction
intervals, it is critical to check that the forecast errors are approximately normally distributed. Typically the SD is
unknown and must be estimated from the sample that was used to generate the point forecast, meaning that we use the
RMSE to estimate the SD. We can also use the two-sided 100(1 − α) prediction intervals given by
Ŷ ± z1− α2 × RM SE
α
2)

where z1− α2 denotes the upper 100(1 −
percentage point of the normal distribution. In the case of a 95% one-stepahead prediction intervals we set α = 5% and get z1− 0.05
= 1.96. The general purpose of such intervals is to provide
2
an indication of the reliability of the point forecasts.
An alternative approach to using theoretical formulae when calculating prediction intervals is to use the observed
errors to show the range of variation expected in the forecasts. For instance, we can calculate the 1-step ahead errors
made using the random walk forecasts which form a histogram. We can also fit a theoretical probability density to
the observed errors. Other distributions (than normal) are possible since in many applications more extreme errors are
observed than those suggested by a normal distribution. Fitting a distribution gives us more precise estimates of the
prediction intervals which are called empirical prediction intervals. To be useful, these empirical prediction intervals
need to be based on a large sample of errors.

127

Quantitative Analytics

3.2.2.5

Estimating model parameters

Usually we partition the original series into two parts and refer to the first part (containing 75−80% of the observations)
as the estimation sample, which is used to estimate the starting values and the smoothing parameters. The parameters
are commonly estimated by minimizing the mean squared error (MSE), although the mean absolute error (MAE)
or mean absolute percentage error (MAPE) are also used. The second part called hold-out sample represents the
remaining 20 − 25% of the observations and is used to check forecasting performance. Some programs allow repeated
estimation and forecast error evaluation by advancing the estimation sample one observation at a time and repeating
the error calculations.
When forecating data, we should not rely on an arbitrary pre-set smoothing parameter. Most computer programs
nowdays provide efficient estimates of the smoothing constant, based upon minimizing some measure of risk such as
the mean squared error (MSE) for the one-step-ahead forecasts
M SE =

N
1 X
(Yi − Fi )2
N i=1

where Yi = Yti and Fi = Fti |ti−1 . More formally, we let ŶT +τ (T ) (or FT +τ |T ) denote the forecast of a given time
series {Yt }t∈Z+ at time T + τ , where T is a specified origin and τ ∈ Z+ . In that setting, the MSE becomes
M SE(T ) =

T
1X
(Yt − Ŷ(t−1)+1 (t − 1))2
T t=1

Similarly, we define the MAD as
M AD(T ) =

T
1X
|Yt − Ŷ(t−1)+1 (t − 1)|
T t=1

which is the forecast ŶT +τ (T ) with T = t − 1 and τ = 1. An approximate 95% prediction interval for ŶT +τ (T ) is
given by
ŶT +τ (T ) ± z0.25 1.25M AD(T )
where z0.25 ≈ 1.96 (see Appendix (5.7.5)).

3.2.3

Modelling time series

Given the time series (Xt )t∈Z at time t, we can either decompose it into elements and estimate each components
separately, or we can directly model the series with a model such as an autoregressive integrated moving average
(ARIMA).
3.2.3.1

The structural time series

When analysing a time series, the classical approach is to decompose it into components: the trend, the seasonal
component, and the irregular term. More generally, the structural time series model proposed by Majani [1987]
decomposes the time series (Xt )t∈Z at time t into four elements
1. the trend (Tt ) : long term movements in the mean
2. the seasonal effects (It ) : cyclical fluctuations related to the calendar
3. the cycles Ct : other cyclycal fluctuations (such as business cycles)

128

Quantitative Analytics

4. the residuals Et : other random or systematic fluctuations
The idea being to create separate models for these four elements and to combine them either additively
Xt = Tt + It + Ct + Et
or multiplicatively
Xt = Tt · It · Ct · Et
which can be obtained by applying the logarithm. Forecasting is done by extrapolating Tt , It and Ct , and expecting
E[Et ] = c ∈ R. One can therefore spend his time either modelling each seperate element and try to recombine
them, or, directly modelling the process Xt . Howver, the decomposition is not unique, and the components are
interrelated, making identification difficult. Several methods have been proposed to extract the components in a time
series, ranging from simple weighted averages to more sophisticated methods, such as Kalman filter or exponential
smoothing, Fourier transform, spectral analysis, and more recently wavelet analysis (see Kendall [1976b], Brockwell
et al. [1991], Arino et al. [1995]). In economic time series, the seasonal component has usually a constant period of
12 months, and to assess it one uses some underlying assumptions or theory about the nature of the series. Longerterm trends, defined as fluctuations of a series on time scales of more than one year, are more difficult to estimate.
These business cycles are found by elimination of the seasonal component and the irregular term. Further, forecasting
is another reason for decomposing a series as it is generally easier to forecast components of a time series than the
whole series itself. One approach for decomposing a continuous, or discrete, time series into components is through
spectral analysis. Fourier analysis uses sum of sine and cosine at different wavelengths to express almost any given
periodic function, and therefore any function with a compact support. However, the non-local characteristic of sine
and cosine implies that we can only consider stationary signals along the time axis. Even though various methods for
time-localising a Fourier transform have been proposed to avoid this problem such as windowed Fourier transform,
the real improvement comes with the development of wavelet theory. In the rest of this guide, we are going to discribe
various technique to model the residual components (Et ), the trend Tt , and the business cycles Ct , and we will also
consider different models to forecast the process Xt directly.
3.2.3.2

Some simple statistical models

Rather than modelling the elements of the time series (Xt )t∈Z we can directly model the series. For illustration
purpose we present a few basic statistical models describing the data which will be used and detailed in Chapter (5).
Note, each of these models has a number of variants, which are refinements of the basic models.
AR process An autoregressive (AR) process is one in which the change in the variable at a point in time is linearly
correlated with the previous change. In general, the correlation declines exponentially with time and desapear in a
relatively short period of time. Letting Yn be the change in Y at time n, with 0 ≤ Y ≤ 1, then we get
Yn = c1 Yn−1 + ... + cp Yn−p + en
where |cl | ≤ 1 for l = 1, .., p, and e is a white noise series with mean 0 and variance σe2 . The restrictions on the
coefficients cl ensure that the process is stationary, that is, there is no long-term trend, up or down, in the mean or
variance. This is an AR(p) process where the change in Y at time n is dependent on the previous p periods. To test
for the possibility of an AR process, a regression is run where the change at time n is the dependent variable, and the
changes in the previous q periods (the lags) are used as independent variables. Evaluating the t-statistic for each lag,
if any of them are significant at the 5% level, we can form the hypothesis that an AR process is at work.

129

Quantitative Analytics

MA process
time series

In a moving average (MA) process, the time series is the result of the moving average of an unobserved

Yn = d1 en−1 + ... + dp en−q + en
where |dl | < 1 for l = 1, .., q. The restriction on the coefficients dl ensure that the process is invertible. In the case
where dl > 1, future events would affect the present, and the process is statioary. Because of the moving average
process, there is a linear dependence on the past and a short-term memory effect.
ARMA process In an autoregressive moving average (ARM A) model, we have both some autoregressive terms and
some moving average terms which are unobserved random series. We get the general ARM A(p, q) form
Yn = c0 + c1 Yn−1 + ... + cp Yn−p − d1 en−1 − ... − dq Yn−q + en
where p is the number of autoregressive terms, q is the number of moving average terms, and en is a random variable
with a given distribution F and c0 ∈ R is the drift.
ARIMA process Both AR and ARM A models can be absorbed into a more general class of processes called
autoregressive integrated moving average (ARIMA) models which are specifically applied to nonstationary time series.
While they have an underlying trend in their mean and variance, by taking successive differences of the data, these
processes become stationary. For instance, a price series is not stationary merly because it has a long-term growth
component. That is, the price will not tend towards an average value as it can grow without bound. Fortunately, in
the efficient market hypothesis (EMH), it is assumed that the changes in price (or returns) are stationary. Typically,
price changes are specified as percentage changes, or, log differences, which is the first difference. However, in some
series, higher order differences may be needed to make the data stationary. Hence, the difference of the differences is
a second-order ARIM A process. In general, we say that Yt is a homogeneous nonstationary process of order d if
Zt = ∆d Yt
is stationary, where ∆ represents differencing, and d represents the level of differencing. If Zt is an ARM A(p, q)
process, then Yt is considered an ARIM A(p, d, q) process. The process does not have to be mixed as if Yt is an
ARIM A(p, d, 0) process, then Zt is an AR(p) process.
ARCH process We now introduce popular models to describe the conditional variance of market returns. The basic
autoregressive conditional heteroskedasticity (ARCH) model developed by Engle [1982] became famous because
• they are a family of nonlinear stochastic processes (as opposed to ARM A models)
• their frequency distribution is a high-peaked, fat-tailed one
• empirical studies showed that financial time series exhibit statistically significant ARCH.
In the ARCH model, time series are defined by normal probability distributions but time-dependent variances. That
is, the expected variance of a process is conditional on its previous value. The process is also autoregressive in that
it has a time dependence. A sample frequency distribution is the average of these expanding and contracting normal
distributions, leading to fat-tailed, high-peaked distribution at any point in time. The basic model follows
Yn

=

Sn en

Sn2

=

α0 + α1 e2n−1

where e is a standard normal random variable, and α1 is a constant. Typical values are α0 = 1 and α1 = 12 . Once
again, the observed value Y is the result of an unobserved series, e, depending on past realisations of itself. The

130

Quantitative Analytics

nonlinearity of the model implies that small changes will likely be followed by other small changes, and large changes
by other large changes, but the sign will be unpredictable. Further, large changes will amplify, and small changes will
contrac, resulting in fat-tailed high-peaked distribution.
GARCH process Bollerslev [1986] formalised the generalised ARCH (or GARCH) by making the S variable
dependent on the past as well,
Yn

= Sn en

Sn2

2
= α0 + α1 e2n−1 + β1 Sn−1

where the three values range from 0 to 1, but α0 = 1, α1 = 0.1, and β1 = 0.8 are typical values. GARCH also
creates a fat-tailed high-peaked distribution.
Example of a financial model The main idea behind (G)ARCH models is that the conditional standard deviations
of a data series are a function of their past values. A very common model in financial econometric is the AR(1) −
GARCH(1, 1) process given by
rn

= c0 + c1 rn−1 + an

vn

= α0 + α1 e2n−1 + β1 vn−1

where rn is the log-returns of the data series for each n, vn is the conditional variance of the residuals for the mean
equation 1 for each n, and c0 , c1 , α0 and α1 are known parameters that need to be estimated. The GARCH process
is well defined as long as the condition α1 + β1 < 1 is satisfied. If this is not the case, the variance process is
non-stationary and we have to fit other processes for conditional variance such as Integrated GARCH (IGARCH)
models.

3.2.4

Introducing parametric regression

Given a set of observations, we want to summarise the data by fitting it to a model that depends on adjustable parameters. To do so we design a merit function measuring the agreement between the data and the model with a particular
choice of parameters. We can design the merit function such that either small values represent close agreement (frequentist), or, by considering probabilities, larger values represent closer agreement (bayesians). In either case, the
parameters of the model are adjusted to find the corresponding extremum in the merit function, providing best-fit
parameters. The adjustment process is an optimisation problem which we will treat in Chapter (14). However, in
some special cases, specific modelling exist, providing an alternative solution. In any case, a fitting procedure should
provide
1. some parameters
2. error estimates on the parameters, or a way to sample from their probability distribution
3. a statistical measure of goodness of fit
In the event where the third item suggests that the model is an unlikely match to the data, then the first two items are
probably worthless.
1

an = rn − c0 − c1 rn−1 , an =

√

vn en .

131

Quantitative Analytics

3.2.4.1

Some rules for conducting inference

The central frequentist idea postulates that given the details of a null hypothesis, there is an implied population (probability distribution) of possible data sets. If the assumed null hypothesis is correct, the actual, measured, data set is
drawn from that population. When the measured data occurs very infrequently in the population, then the hypothesis is rejected. Focusing on the distribution of the data sets, they neglect the concept of a probability distribution
of the hypothesis. That is, for frequentists, there is no statistical universe of models from which the parameters are
drawn. Instead, they identify the probability of the data given the parameters, as the likelihood of the parameters given
data. Parameters derived in this way are called maximum likelihood estimators (MLE). An alternative approach is to
consider Bayes’s theorem relating the conditional probabilities of two events, A and B,
P (A|B) = P (A)

P (B|A)
P (B)

(3.2.1)

where P (A|B) is the probability of A given that B has occured. A and B need not to be repeatable events, and can
be propositions or hypotheses, obtaining a set of consistent rules for conducting inference. All Bayesian probabilities
are viewed as conditional on some collective background information I. Assuming some hypothesis H, even before
any explicit data exist, we can assign some degree of plausibility P (H|I) called the Bayesian prior. When some data
D1 comes along, using Equation (3.2.1), we reassess the plausibility of H as
P (H|D1 I) = P (H|I)

P (D1 |HI)
P (D1 |I)

where the numerator is calculable as the probability of a data set given the hypothesis, and the denominator is the prior
predictive probability of the data. The latter is a normalisation constant ensuring that the probability of all hypotheses
sums to unity. When some additional data D2 come along, we can further refine the estimate of the probability of H
P (H|D2 D1 I) = P (H|D1 I)

P (D2 |HD1 I)
P (D2 |D1 I)

and so on. From the product rule for probabilities P (AB|C) = P (A|C)P (B|AC), we get
P (H|D2 D1 I) = P (H|I)

P (D2 D1 |HI)
P (D2 D1 |I)

obtaining the same answer as if all the data D1 D2 had been taken together.
3.2.4.2

The least squares estimator

Maximum likelihood estimator Given N data points (Xi , Yi ) for i = 0, .., N − 1, we want to fit a model having
M adjustable parameters aj , j = 0, .., M − 1, predicting a functional relationship between the measured independent
and dependent varibles, defined as
Y (X) = Y (X|a0 , .., aM −1 )
Following the frequentist, given a set of parameters, if the probability of obtaining the data set is too amall, then we
can conclude that the parameters are unlikely to be right. Assuming that each data point Yi has a measurement error
that is independently random and distributed as a Gaussian distribution around the true model Y (X), and assuming
that the standard deviations σ of these normal distributions are the same for all points, then the probability of obtaining
the data set is the product of the probabilities of each point
P ( data | model ) ∝

N
−1
Y
i=0

132

1

e− 2 (

Yi −Y (Xi ) 2
)
σ

∆Y

Quantitative Analytics

Alternatively, calling Bayes’ theorem in Equation (3.2.1), we get
P ( model | data ) ∝ P ( data | model )P ( model )
where P ( model ) = P (a0 , .., aM −1 ) is the prior probability distribution on all models. The most probable model is
to maximise the probability of obtaining the data set above, or equivalently, minimise the negative of its logarithm
N
−1
X
i=0

1 Yi − Y (Xi ) 2
− N log ∆Y
2
σ

which is equivalent to minimising the probability above since N , σ, and ∆Y are all contants. In that setting, we
recover the least squares fit
minimise over a0 , .., aM −1 :

N
−1
X

(Yi − Y (Xi |a0 , .., aM −1 ))2

i=0

Under specific assumptions on measurement errors (see above), the least-squares fitting is the most probable parameter
set in the Bayesian sense (assuming flat prior), and it is the maximum likelihood estimate of the fitted parameters.
Relaxing the assumption of constant standard deviations, by assuming a known standard deviation σi for each data
point (Xi , Yi ), then the MLE of the model parameters, and the Bayesian most probable parameter set, is given by
minimising the quantity
χ2 =

N
−1
X
i=0

Yi − Y (Xi ) 2
σi

(3.2.2)

called the chi-square, which is a sum of N squares of normally distributed quantities, each normalised to unit variance.
Note, in practice measurement errors are far from Gaussian, and the central limit theorem does not apply, leading to
fat tail events skewing the least-squares fit. In some cases, the effect of nonnormal errors is to create an abundance of
outlier points decreasing the probability Q that the chi-square should exceed a particular value χ2 by chance.
Linear models So far we have made no assumption about the linearity or nonlinearity of the model Y (X|a0 , .., aM −1 )
in its parameters a0 , .., aM −1 . The simplest model is a straight line
Y (X) = Y (X|a, b) = a + bX
called linear regression. Assuming that the uncertainty σi associated with each measurement Yi is known, and that the
dependent variables Xi are known excatly, we can minimise Equation (3.2.2) to determine a and b. At its minumum,
derivatives of ξ 2 (a, b) with respect to a and b vanish. See Press et al. [1992] for explicit solution of a and b, covariance
of a and b characterising the uncertainty of the parameter estimation, and an estimate of the goodness of fit of the data.
We can also consider the general linear combination
Y (X) = a0 + a1 X + a2 X 2 + ... + aM −1 X M −1
which is a polynomial of degree M − 1. Further, linear combination of sines and cosines is a Fourier series. More
generally, we have models of the form
Y (X) =

M
−1
X

ak φk (X)

k=0

where the quantities φ0 (X), .., φM −1 (X) are arbitrary fixed functions of X called basis functions which can be nonlinear (linear refers only to the model’s dependence on its parameters ak ). In that setting, the chi-square merit function
becomes

133

Quantitative Analytics

χ2 =

N
−1
X

Yi −

PM −1
k=0

ak φk (Xi ) 2

σi

i=0

where σi is the measurement error of the ith data point. We can use optimisation to minimise χ2 , or in special cases
we can use specific techniques. We let A be an N × M matrix constructed from the M basis functions evaluated at
the N abscissas Xi , and from the N measurement errors σi with element
Aij =

φj (Xi )
σi

This matrix is called the design matrix, and in genral N ≥ M . We also define the vector b of length N with element
bi = Yσii , and denote the M vector whose components are the parameters to be fitted a0 , .., aM −1 by a. The minimum
of the merit function occurs where the derivative of χ2 with respect to all M parameters ak vanishes. It yields M
equations
N
−1
X
i−0

M
−1
X

1
Yi −
aj φj (Xi ) φk (Xi ) = 0 , k = 0, .., M − 1
σi
j=0

Interchanging the order of summations, we get the normal equations of the least-squares problem
M
−1
X

αkj aj = βk

j=0

where α = A> .A is an M × M matrix, and β = A> .b is a vector of length M . In matrix form, the normal equations
become
(A> .A)a = A> .b
which can be solved for the vector a by LU decomposition, Cholesky decomposition, or Gauss-Jordan elimination.
The inverse matrix C = α−1 is called the covariance matrix, and is closely related to the uncertainties of the estimated
parameters a. These uncertainties are estimated as
σ 2 (aj ) = Cjj
the diagonal elements of C, being the variances of the fitted parameters a. The off-diagonal elements Cjk are the
covariances between aj and ak .
Nonlinear models In the case where the model depends nonlinearly on the set of M unknown parameters ak for
k + 0, .., M − 1, we use the same method as above where we define a χ2 merit function and determine best-fit
parameters by its minimisation. This is similar to the general nonlinear function minimisation problem. If we are
sufficiently close to the minimum, we expect the χ2 function to be well approximated by a quadratic form
1
χ2 (a) ≈ γ − d.a + a.D.a
2
where d is an M -vector and D is an M × M matrix. If the approximation is a good one, we can jump from the current
trial parameters acur to the minimising ones amin in a single leap
amin = acur + D−1 .[−∇χ2 (acur )]
However, in the where the approximation is a poor one, we can take a step down the gradient, as in the steepest descent
method, getting

134

Quantitative Analytics

anext = acur − cst × ∇χ2 (acur )
for small constant cst. In both cases we need to compute the gradient of the χ2 function at any set of parameters a. For
amin , we also need the matrix D, which is the second derivative matrix (Hesian matrix) of the χ2 merit function, at
any a. In this particular case, we know exactly the form of χ2 , since it is based on a model function that we specified,
so that the Hessian matrix is known to us.

3.2.5

Introducing state-space models

3.2.5.1

The state-space form

The state-space form of time series models represent the actual dynamics of a data generation process. We let Yt
denote the observation from a time series at time t, related to a vector αt , called the state vector, which is possibly
unobserved and whose dimension m is independent of the dimension n of Yt . The general form of a linear state-space
model, is given by the following two equations
Yt = Zt αt + dt + Gt t , t = 1, .., T

(3.2.3)

αt+1 = Tt αt + ct + Ht t
where Zt is an (n × m) matrix, dt is an (n × 1) vector, Gt is an (n × (n + m)) matrix, Tt is an (m × m) matrix, and Ht
is an (m × (n + m)) matrix. The process t is an ((n + m) × 1) vector of serially independent, identically distributed
disturbances with E[t ] = 0 and V ar(t ) = I the identity matrix. We let the initial state vector α1 be independent of
t at all time t. The first equation is the observation or measurement equation, and the second equation is transition
equation. The general (first-order Markov) state equation takes the form
αt = f (αt−1 , θt−1 ) + ηt−1
and the general observation equation takes the form
Yt = h(αt , θt ) + t
with independent error processes {ηt } and {t }. If the system matrices do not evolve with time, the state-space model
is called time-invariant or time-homogeneous. If the disturbances t and initial state vector α1 are assumed to have
a normal distribution, then the model is termed Gaussian. Further, if Gt Ht> = 0 for all t then the measurement and
transition equations are uncorrelated. The fundamental inference mechanism is Bayesian and consists in computing
the posterior quantities of interest sequentially in the following recursive calculation:
1. Letting ψt = {Y1 , .., Yt } be the information set up to time t, we get the prior distribution
Z
p(αt |ψt−1 ) =

p(αt |αt−1 )p(αt−1 |ψt−1 )dαt−1

corresponding to the distribution of the parameters before any data is observed.
2. Then, the updating equation becomes
p(αt |ψt ) =

p(Yt |αt )p(αt |ψt−1 )
p(Yt |ψt−1 )

where the sampling distribution p(Yt |αt ) is the distribution of the observed data conditional on its parameters.

135

Quantitative Analytics

The updates provides an analytical solution if all densities in the state and observation equation are Gaussian, and
both the state and the observation equation are linear. If these conditions are met, the Kalman filter (see Section
(3.2.5.2)) provides the optimal Bayesian solution to the tracking problem. Otherwise we require approximations such
as the Extended Kalman filter (EKF), or Particle filter (PF) which approximates non-Gaussian densities and nonlinear equations. The particle filter uses Monte Carlo methods, in particular Importance sampling, to construct the
approximations.
3.2.5.2

The Kalman filter

The Kalman filter is used for prediction, filtering and smoothing. If we let ψt = {Y1 , .., Yt } denote the information
set up to time t, then the problem of prediction is to compute E[αt+1 |ψt ]. Filtering is concerned with calculating
E[αt |ψt ], while smoothing is concerned with estimating E[αt |ψT ] for all t < T . In the Linear Gaussian State-Space
Model we assume Gt Ht> = 0 and drop the terms dt and ct from the observation and transition equations (3.2.3).
Further, we let
>
Gt G>
t = Σt , Ht Ht = Ωt

and the Kalman filter recursively computes the quantities
at|t = E[αt |ψt ] filtering
at+1|t = E[αt+1 |ψt ] prediction
Pt|t = M SE(αt |ψt−1 )
Pt+1|t = M SE(αt+1 |ψt )
where MSE is the mean-square error or one-step ahead prediction variance. Then, starting with a1|0 , P1|0 , then at|t
and at+1|t are obtained by running for t = 1, .., t the recursions
Vt = Yt − Zt at|t−1 , Ft = Zt Pt|t−1 Zt> + Σt
at|t = at|t−1 + Pt|t−1 Zt> Ft−1 Vt
Pt|t = Pt|t−1 − Pt|t−1 Zt> Ft−1 Zt Pt|t−1
at+1|t = Tt at|t
Pt+1|t = Tt Pt|t Tt> + Ωt
where Vt denotes the one-step-ahead error in forecasting Yt conditional on the information set at time (t − 1) and Ft
is its MSE. The quantities at|t and at|t−1 are optimal estimators of αt conditional on the available information. The
resulting recursions for t = 1, .., T − 1 follows
at+1|t = Tt at|t−1 + Kt Vt
Kt = Tt Pt|t−1 Zt> Ft−1
Pt+1|t = Tt Pt+1|t L>
t + Ωt
Lt = Tt − Kt Zt
Parameter estimation Another application of the Kalman filter is the estimation of any unknown parameters θ that
appear in the system matrices. The likelihood for data Y = (Y1 , .., YY ) can be constructed as

136

Quantitative Analytics

p(Y1 , .., YY ) = p(YT |ψT −1 )...p(Y2 |ψ1 )p(Y1 ) =

T
Y

p(Yt |ψt−1 )

t=1

Assuming that the state-space model is Gaussian, by taking conditional expectations on both sides of the observation
equation, with dt = 0 we deduce that for t = 1, .., T
E[Yt |ψt−1 ] = Zt at|t−1
V ar(Yt |ψt−1 ) = Ft
the one-step-ahead prediction density p(Yt |ψt−1 ) is the density of a multivariate normal random variable with mean
Zt at|t−1 and and covariance matrix Ft . Thus, the log-likelihood function is given by
T

log L = −

T

1X
1 X > −1
nT
log 2π −
log detFt −
V F Vt
2
2 t=1
2 t=1 t t

where Vt = Yt − Zt at|t−1 . Numerical procedures are used in order to maximise the log-likelihood to obtain the ML
estimates of the parameters θ which are consistent and asymptotically normal. If the state-space model is not Gaussian,
the likelihood can still be constructed in the same way using the minimum mean square linear estimators of the state
vector. However, the estimators θ̂ maximising the likelihood are the quasi-maximum likelihood (QML) estimators of
the parameters. They are also consistent and asymptotically normal.
Smoothing Smoothing is another application of Kalman filter where, given a fixed set of data, estimates of the state
vector are computed at each time t in the sample period taking into account the full information set available. The
algorithm computes at|T = E[αt |ψT ] along with its MSE, and Pt|T computed via a set of backward recursions for all
t = 1, .., T − 1. To obtain at|T and Pt|T we start with aT |T and PT |T and run backwards for t = T − 1, .., 0
at|T = at|t + Pt∗ (at+1|T − at+1|t )
Pt|T = Pt|t + Pt∗ (Pt+1|T − Pt+1|t )Pt∗ , Pt∗ = Pt Tt> Pt+1|t
the extensive use of the Markov chain Monte Carlo (MCMC), in particular the Gibbs sampler, has given rise to another
smoothing algorithm called the simulation smoother and is also closely related to the Kalman filter. In contrast, to the
fixed interval smoother, which computes the conditional mean and variance of the state vector at each time t in the
sample, a simulation smoother is used for drawing samples from the density p(α0 , ..., αT |YT ). The first simulation
smoother is based on the identity
p(α0 , ..., αT |YT ) = p(αT |YT )

TY
−1

p(αt |ψt , αt+1 )

t=0

and a draw from p(α0 , ..., αT |YT ) s recursively constructed in terms of αt . Starting with a draw α̂T ∼ N (αT |T , PT |T ),
the main idea is that for a Gaussian state space model p(αt |ψt , αt+1 ) is a multivariate normal density and hence it is
completely characterized by its first and second moments. The usual Kalman filter recursions are run, so that αt|t is
initially obtained. Then, the draw α̂t+1 ∼ p(αt+1 |ψt , αt+2 ) is treated as m additional observations and a second set
of m Kalman filter recursions is run for each element of the state vector α̂t+1 . However, the latter procedure involves
the inversion of system matrices, which are not necessarily non-singular.

137

Quantitative Analytics

3.2.5.3

Model specification

While state space models are widely used in time series analysis to deal with processes gradually changing over time,
model specification is a challenge for these models as one has to specify which components to include and to decide
whether they are fixed or time-varying. It leads to testing problems which are non-regular from the view-point of
classical statistics. Thus, a classical approach toward model selection which is based on hypothesis testing such as a
likelihood ratio test or information criteria such as AIC or BIC cannot be easily applied, because it relies on asymptotic
arguments based on regularity conditions that are violated in this context. For example, we consider the time series
Y = (Y1 , ..., YT ) for t = 1, .., T modelled with the linear trend model
Yt = µt + t , t ∼ N (0, σ2 )
where µt is a random walk with a random drift starting from unknown initial values µ0 and a0
µt = µt−1 + at−1 + ω1t , ω1t ∼ N (0, θ1 )
at = at−1 + ω2t , ω2t ∼ N (0, θ2 )
In order to decide whether the drift at is time-varying or fixed we could test θ2 = 0 versus θ2 > 0. However,
it is a nonregular testing problem since the null hypothesis lies on the boundary of the parameter space. Testing
the null hypothesis a0 = a1 = ... = aT versus the alternative at follows a random walk is, again, non-regular
because the size of the hypothesis increases with the number of observations. One possibility is to consider the
Bayesian approach when dealing with such non-regular testing problems. We assume that there are K different
candidates models M1 , .., MK for generating the time series Y . In a Bayesian setting each of these models is assigned
a prior probability p(Mk ) with the goal of deriving the posterior model probability p(Mk |Y ) (the probability of a
hypothesis Mk given the observed evidence Y ) for each model Mk for k = 1, .., K. One strategie for computing
the posterior model probabilities is to determine the posterior model probabilities of each model separately by using
Bayes’ rule p(Mk |Y ) ∝ p(Y |Mk )p(Mk ) where p(Y |Mk ) is the marginal likelihood for model Mk (it is the probability
of observing Y given Mk ). An explicit expression for the marginal likelihood exists only for conjugate problems like
linear regression models with normally distributed errors, whereas for more complex models numerical techniques
are required. For Gaussian state space models, marginal likelihoods have been estimated using methods such as
importance sampling, Chib’s estimator, numerical integration and bridge sampling. The modern approach to Bayesian
model selection is to apply model space MCMC methods by sampling jointly model indicators and parameters, using
for instance the reversible jump MCMC algorithm (see Green [1995]) or the stochastic variable selection approach
(see George and McCulloch [1993] [1997]). The stochastic variable selection approach applied to model selection
for regression models aims at identifying non-zero regression effects and allows parsimonious covariance modelling
for longitudinal data. Fruhwirth-Schnatter et al. [2010a] considered the variable selection approach for many model
selection problems occurring in state space modelling. In the above example, they used binary stochastic indicators in
such a way that the unconstrained model corresponds to setting all indicators equal to 1. Reduced model specifications
result by setting certain indicators equal to 0.

3.3
3.3.1

Asset returns and their characteristics
Defining financial returns

We consider the probability space (Ω, F, P) where Ft is a right continuous filtration including all P negligible sets in
F. For simplicity, we let the market be complete and assume that there exists an equivalent martingale measure Q as
defined in a mixed diffusion model by Bellamy and Jeanblanc [2000]. In the presence of continuous dividend yield,
that unique probability measure equivalent to P is such that the discounted stock price plus the cumulated dividends
are martingale when the riskless asset is the numeraire. In a general setting, we let the underlying process (St )t≥0 be
a one-dimensional Ito process valued in the open subset D.

138

Quantitative Analytics

3.3.1.1

Asset returns

Return series are easier to handle than price series due to their more attractive statistical properties and to the fact
that they represent a complete and scale-free summary of the investment opportunity (see Campbell et al. [1997]).
Expected returns need to be viewed over some time horizon, in some base currency, and using one of many possible
averaging and compounded methods. Holding the asset for one period from date t to date (t + 1) would result in a
simple gross return
1 + Rt,t+1 =

St+1
St

(3.3.4)

where the corresponding one-period simple net return, or simple return, Rt,t+1 is given by
Rt,t+1 =

St+1
St+1 − St
−1=
St
St

More generally, we let
∇d St + Dt−d,t
St−d

Rt−d,t =

be the discrete return of the underlying process where ∇d St = St − St−d with period d and where Dt−d,t is the
dividend over the period [t−d, t]. For simplicity we will only consider dividend-adjusted prices with discrete dividendd St
adjusted returns Rt−d,t = ∇
St−d . Hence, holding the asset for d periods between dates t − d and t gives a d-period
simple gross return

1 + Rt−d,t

=

St
St
St−1
St−d+1
=
×
× ... ×
St−d
St−1
St−2
St−d

=

(1 + Rt−1,t )(1 + Rt−2,t−1 )...(1 + Rt−d,t−d+1 ) =

(3.3.5)
d
Y

(1 + Rt−j,t−j+1 )

j=1

so that the d-period simple gross return is just the product of the d one-period simple gross returns which is called a
compound return. Holding the asset for d years, then the annualised (average) return is defined as
A
Rt−d,t
=

d
Y

1
(1 + Rt−j,t−j+1 ) d − 1

j=1

which is the geometric mean of the d one-period simple gross returns involved and can be computed (see Appendix
(B.9.4)) by
1

A
Rt−d,t
= ed

Pd

j=1

ln (1+Rt−j,t−j+1 )

−1

It is simply the arithmetic mean of the logarithm returns (1 + Rt−j,t−j+1 ) for j = 1, .., d which is then exponentiated
to return the computation to the original scale. As it is easier to compute arithmetic average than geometric mean,
and since the one-period returns tend to be small, one can use a first-order Taylor expansion 2 to approximate the
annualised (average) return
d

A
Rt−d,t
≈

2

1X
Rt−j,t−j+1
d j=1

since log (1 + x) ≈ x for |x| ≤ 1

139

Quantitative Analytics

Note, the arithmetic mean of two successive returns of +50% and −50% is 0%, but the geometric mean is −13%
1
since [(1 + 0.5)(1 − 0.5)] d = 0.87 with d = 2 periods. While some financial theory requires arithmetic mean as
inputs (single-period Markowitz or mean-variance optimisation, single-period CAPM), most investors are interested
in wealth compounding which is better captured by geometric means.
In general, the net asset value A of continuous compounding is
A = Cer×n
where r is the interest rate per annum, C is the initial capital, and n is the number of years. Similarly,
C = Ae−r×n
is referred to as the present value of an asset that is worth A dollars n years from now, assuming that the continuously
compounded interest rate is r per annum. If the gross return on a security is just 1 + Rt−d,t , then the continuously
compounded return or logarithmic return is
rL (t − d, t) = ln (1 + Rt−d,t ) = Lt − Lt−d

(3.3.6)

where Lt = ln St . Note, on a daily basis we get Rt = Rt−1,t and rL (t) = ln (1 + Rt ). The change in log price is the
yield or return, with continuous compounding, from holding the security from trading day t − 1 to trading day t. As a
result, the price becomes
St = St−1 erL (t)
Further, the return rL (t) has the property that the log return between the price at time t1 and at time tn is given by the
sum of the rL (t) between t1 and tn
n

X
Sn
log
=
rL (ti )
S1
i=1
which implies that
Sn = S1 e

Pn

i=1

rL (ti )

so that if the rL (t) are independent random variables with finite mean and variance, the central limit theorem implies
that for very large n, the summand in the above equation is normally distributed. Hence, we would get a log-normal
distribution for Sn given S1 . In addition, the variability of simple price changes for a given security is an increasing
function of the price level of the security, whereas this is not necessarily the case with the change in log price. Given
Equation (3.3.5), then Equation (3.3.6) becomes
rL (t − d, t) = rL (t) + rL (t − 1) + rL (t − 2) + ... + rL (t − d + 1)

(3.3.7)

so that the continuously compounded multiperiod return is simply the sum of continuously compounded one-period
returns. Note, statistical properties of log returns are more tractable. Moreover, in the cross section approach aggregation is done across the individual returns.
Remark 3.3.1 That is, simple returns Rt−d,t are additive across assets but not over time (see Equation (3.3.5)),
wheras continuously compounded returns rL (t − d, t) are additive over time but not across assets (see Equation
(3.3.7)).

140

Quantitative Analytics

3.3.1.2

The percent returns versus the logarithm returns

t−1
In the financial industry, most measures of returns and indices use change of returns Rt = Rt−1,t defined as StS−S
t−1
where St is the price of a series at time t. However, some investors and researchers prefer to use returns based
t
on logarithms of prices rt = log SSt−1
or compound returns. As discussed in Section (3.3.1.1), continuous time
generalisations of discrete time results are easier, and returns over more than one day are simple functions of a single
day return. In order to compare change and compound returns, Longerstaey et al. [1995a] compared kernel estimates
of the probability density function for both returns. As opposed to the histogram of the data, this approach spreads
the frequency represented by each observation along the horizontal axis according to a chosen distribution function,
called the kernel, and chosen to be the normal distribution. They also compared daily volatility estimates for both
types of returns based on an exponential weighting scheme. They concluded that the volatility forecasts were very
similar. They used that methodology to compute the volatility for change returns and then replaced the change returns
with logarithm returns. The same analysis was repeated on correlation by changing the inputs from change returns
to logarithm returns. They also used monthly time series and found little difference between the two volatility and
correlation series. Note, while the one month volatility and correlation estimators based on change and logarithm
returns do not coincide, the difference between their point estimates is negligible.

3.3.1.3

Portfolio returns

i
We consider a portfolio consisting of N instruments, and let rL
(t) and Ri (t) for i = 1, 2, .., N be respectively the
continuously compounded and percent
returns.
We
assign
weights
ωi to the ith instrument in the portfolio together
PN
with a condition of no short sales i=1 ωi = 1 (it is the percentage of the portfolio’s value invested in that asset). We
let P0 be the initial value of the portfolio, and P1 be the price after one period, then by using discrete compounding,
we derive the usual expression for a portfolio return as

P1 = ω1 P0 (1 + R1 ) + ω2 P0 (1 + R2 ) + ... + ωN P0 (1 + RN ) =

N
X

ωi P0 (1 + Ri )

i=1
0
be the return of the portfolio for the first period and replace P1 with its value. We repeat the
We let Rp (1) = P1P−P
0
process at periods t = 2, 3, ... to get the portfolio at time t as

Rp (t) = ω1 R1 (t) + ω2 R2 (t) + ... + ωN RN (t) =

N
X

ωi Ri (t)

i=1

Hence, the simple net return of a portfolio consisting of N assets is a weighted average of the simple net returns of
the assets involved. However, the continuously compounded returns of a portfolio do not have the above convenient
property. The portfolio of returns satisfies
P1 = ω1 P0 er1 + ω2 P0 er2 + ... + ωN P0 erN
and setting rp = log

P1
P0 ,

we get
rp = log ω1 P0 er1 + ω2 P0 er2 + ... + ωN P0 erN



Nonetheless, RiskMetrics (see Longerstaey et al. [1996]) uses logarithmic returns as the basis in all computations and
the assumption that simple returns Ri are all small in magnitude. As a result, the portfolio return becomes a weighted
average of logarithmic returns
rp (t) ≈

N
X
i=1

since log (1 + x) ≈ x for |x| ≤ 1.

141

i
ωi rL
(t)

Quantitative Analytics

3.3.1.4

Modelling returns: The random walk

We are interested in characterising the future changes in the portfolio of returns described in Section (3.3.1.3), by
forecasting each component of the portfolio using only past changes of market prices. To do so, we need to model
1. the temporal dynamics of returns
2. the distribution of returns at any point in time
Traditionally, to get tractable statistical properties of asset returns, financial markets assume that simple returns
{Rit |t = 1, .., T } are independently and identically distributed as normal with fixed mean and variance. However,
while the lower bound of a simple return is −1, normal distribution may assume any value in the real line (no lower
bound). Further, assuming that Rit is normally distributed, then the multi-period simple return Rit (k) is not normally
distributed. At last, the normality assumption is not supported by many empirical asset returns which tend to have
positive excess kurtosis. Still, the random walk is one of the widely used class of models to characterise the development of price returns. In order to guarantee non-negativity of prices, we model the log price Lt as a random walk with
independent and identically distributed (iid) normally distributed changes with mean µ and variance σ 2 . It is given by
Lt = µ + Lt−1 + σt , t ∼ iidN (0, 1)
The use of log prices, implies that the model has continuously compounded returns, that is, rt = Lt − Lt−1 = µ + σt
with mean and variance
E[rt ] = µ , V ar(rt ) = σ 2
Hence, an expression for prices can be derived as
St = St−1 eµ+σt
and St follows the lognormal distribution. Hence, the mean and variance of simple returns become
1

2

2

2

E[Rt ] = eµ+ 2 σ − 1 , V ar(Rt ) = e2µ+σ (eσ − 1)
which can be used in forecasting asset returns. There is no lower bound for rt , and the lower bound for Rt is satisfied
using 1 + Rt = ert . Assuming that logarithmic price changes are i.i.d. implies that
• at each point in time t the log price changes are identically distributed with mean µ and variance σ 2 implying
homoskedasticity (unchanging prices over time).
• log price changes are statistically independent of each other over time (the values of returns sampled at different
points are completely unrelated).
However, the lognormal assumption is not consistent with all the properties of historical stock returns. The above
models assume a constant variance in price changes, which in practice is flawed in most financial time series data. We
can relax this assumption to let the variance vary with time in the modified model
Lt = µ + Lt−1 + σt t , t ∼ N (0, 1)
These random walk models imply certain movement of financial prices over time.

142

Quantitative Analytics

3.3.2

The properties of returns

3.3.2.1

The distribution of returns

As explained by Tsay [2002] when studying the distributional properties of asset returns, the objective is to understand
the behaviour of the returns across assets and over time, in order to characterise the portfolio of returns described in
Section (3.3.1.3). We consider a collection of N assets held for T time periods t = 1, .., T . The most general model
for the log returns {rit ; i = 1, .., N ; t = 1, .., T } is its joint distribution function
Fr r11 , ..., rN 1 ; r12 , ..., rN 2 ; ..; r1T , ..., rN T ; Y ; θ



(3.3.8)

where Y is a state vector consisting of variables that summarise the environment in which asset returns are determined
and θ is a vector of parameters that uniquely determine the distribution function Fr (.). The probability distribution
Fr (.) governs the stochastic behavior of the returns rit and the state vector Y . In general Y is treated as given and
the main concern is defining the conditional distribution of {rit } given Y . Empirical analysis of asset returns is then
to estimate the unknown parameter θ and to draw statistical inference about behavior of {rit } given some past log
returns. Consequently, Equation (3.3.8) provides a general framework with respect to which an econometric model for
asset returns rit can be put in a proper perspective. For instance, financial theories such as the Capital Asset Pricing
Model (CAPM) of Sharpe focus on the joint distribution of N returns at a single time index t, that is, {r1t , ..., rN t },
while theories emphasise the dynamic structure of individual asset returns, that is, {ri1 , ..., riT } for a given asset i.
When dealing with the joint distribution of {rit }Tt=1 for asset i, it is useful to partition the joint distribution as
Fr (ri1 , ..., riT ; θ)

= F (ri1 )F (ri2 |ri1 )...F (riT |ri,T −1 , ..., ri,1 )
= F (ri1 )

T
Y

F (rit |ri,t−1 , ..., ri,1 )

(3.3.9)

t=2

highlighting the temporal dependencies of the log return. As a result, one is left to specify the conditional distribution
F (rit |ri,t−1 , ..., ri,1 ) and the way it evolves over time. Different distributional specifications leads to different theories. In one version of the random-walk, the hypothesis is that the conditional distribution is equal to the marginal
distribution F (rit ) so that returns are temporally independent and, hence, not predictable. In general, asset returns are
assumed to be continuous random variables so that one need to know their probability density functions to perform
some analysis. Using the relation among joint, marginal, and conditional distributions we can write the partition as
fr (ri1 , ..., riT ; θ) = f (ri1 ; θ)

T
Y

f (rit |ri,t−1 , ..., ri,1 ; θ)

t=2

In general, it is easier to estimate marginal distributions than conditional distributions using past returns. Several
statistical distributions have been proposed in the literature for the marginal distributions of asset returns (see Tsay
[2002]). A traditional assumption made in financial study is that the simple returns {Rit }Tt=1 are independently and
identically distributed as normal with fixed mean and variance. However, the lower bound of a simple return is −1 but
normal distribution may assume any value in the real line having no lower bound. Further, the normality assumption
is not supported by many empirical asset returns, which tend to have a positive excess kurtosis. To overcome the first
problem, assumption is that the log returns rt of an asset is independent and identically distributed (iid) as normal with
mean µ and variance σ 2 .
The multivariate analyses are concerned with the joint distribution of {rt }Tt=1 where rt = (r1t , .., rN t )> is the log
returns of N assets at time t. This joint distribution can be partitioned in the same way as above so that the analysis
focusses on the specification of the conditional distribution function
F (rt |rt−1 , .., r1 ; θ)

143

Quantitative Analytics

in particular, how the conditional expectation and conditional covariance matrix of rt evolve over time. The mean
vector and covariance matrix of a random vector X = (X1 , .., Xp ) are defined as
E[X]
Cov(X)

= µX = (E[X1 ], .., E[Xp ])>
=

ΣX = E[(X − µX )(X − µX )> ]

provided that the expectations involved exist. When the data {x1 , .., xT } of X are available, the sample mean and
covariance matrix are defined as
µ̂X =

T
T
1X
1X
xt , Σ̂x =
(xt − µ̂X )(xt − µ̂X )>
T t=1
T t=1

These sample statistics are consistent estimates of their theoretical counterparts provided that the covariance matrix of
X exists. In the finance literature, multivariate normal distribution is often used for the log return rt .
3.3.2.2

The likelihood function

One can use the partition in Equation (3.3.9) to obtain the likelihood function of the log returns {r1 , ..., rT } for the ith
asset. Assuming that the conditional distribution f (rt |rt−1 , ..., r1 ; Θ) is normal with mean µt and variance σt2 , then
Θ consists of the parameters µt and σt2 and the likelihood function of the data is
f (r1 , ..., rT ; Θ) = f (r1 ; Θ)

T
Y
t=2

√

−(rt −µt )
1
2
e 2σt
2πσt

2

where f (r1 ; Θ) is the marginal density function of the first observation r1 . The value Θ∗ maximising this likelihood
function is the maximum likelihood estimate (MLE) of Θ. The log function being monotone, the MLE can be obtained
by maximising the log likelihood function
T

ln f (r1 , ..., rT ; Θ) = ln f (r1 ; Θ) −

1 X
(rt − µt )2 
ln 2π + ln (σt2 ) +
2 t=2
σt2

(3.3.10)

Note, even if the conditional distribution f (rt |rt−1 , ..., r1 ; Θ) is not normal, one can still compute the log likelihood
function of the data.

3.3.3

Testing the series against trend

We have made the assumption of independently distributed returns in Section (3.3.1.4) which is at the heart of the
efficient market hypothesis (EMH), but we saw in Section (1.7.6) that the technical community totally rejected the
idea of purely random prices. In fact, portfolio returns are significantly autocorrelated leading to contrarian and
momentum strategies. One must therefore test time series for trends. When testing against trend we are testing the
hypothesis that the members of a sequence of random variables x1 , .., xn are distributed independently of each other,
each with the same distribution. Following the definition of trend given by Mann [1945], a sequence of random
variables x1 , .., xn is said to have a downward trend if the variables are independently distributed so that xi has the
cumulative distribution fi and fi (x) < fj (x) for every i < j and every x. Similarly, an upward trend is defined with
fi (x) > fj (x) for every i < j.
Since all statistical tests involve the type I error (rejecting the null hypothesis when it is true), and the type II
error (not rejecting the null hypothesis when it is false), it is important to consider the power of a test, defined as one
minus the probability of type II error. A powerful test will reject a false hypothesis with a high probability. Studying

144

Quantitative Analytics

the existence of trends in hydrological time series, Onoz et al. [2002] compared the power of parametric and nonparametric tests for trend detection for various probability distributions estimated by Monte-Carlo simulation. The
parametric test considers the linear regression of the random variable Y on time X with the regression coefficient b
(or the Pearson correlation coefficient r) computed from the data. The statistic
√
r n−2
b
t= √
= s
1 − r2
Sqx
follows the Student’s t distribution with (n − 2) degrees of freedom, where n is the sample size, s is the standard
deviation of the residuals, and Sqx is the sums of squares of the independent variable (time in trend analysis). For
non-parametric tests, Yue et al. [2002] showed that the Spearman’s rho test provided results almost identical to those
obtained with the Mann-Kendall test. Hence, we consider only the non-parametric Mann-Kendall test for analysing
trends. Kendall [1938] introduced the T-test for testing the independence in a bivariate distribution by counting the
number of inequalities xi < xj for i < j and computing the distribution of T via a recursive equation. Mann [1945]
introduced lower and upper bounds for the power of the T -test. He proposed a trend detection by considering the
statistic
n
SM
(t) =

n−2
X n−1
X

sign(yt−i − yt−j )

i=0 j=i+1

where each pair of observed values (yi , yj ) for i > j of the random variable is inspected to find out whether yi > yj
or yi < yj . If P (t) is the number of the former type of pairs, and M (t) is the number of the latter type of pairs, the
n
(t) = P (t) − M (t). The variance of the statistic is
sattistic becomes SM
n
V ar(SM
(t)) =

1
n(n − 1)(2n + 5)
18

so that the statistic is bounded by
1
n
(t) ≤
− n(n + 1) ≤ SM
2
The bounds are reached when yt < yt−i (negative trend) or yt >
the normalised score
n

S M (t) =

1
n(n + 1)
2
yt−i (positive trend) for i ∈ N∗ . Hence, we obtain

2
S n (t)
n(n + 1) M

n

where S M (t) takes the value 1 (or −1) if we have a perfect positive (or negative) trend. In absence of trend we get
n
S M (t) ≈ 0. Letting that statement be the null hypothesis (no trend), we get
Z n (t) →n→∞ N (0, 1)
with
S n (t)
Z n (t) = p M n
V ar(SM (t))
The null hypothesis that there is no trend is rejected when Z n (t) is greater than z α2 in absolute value. Note, parametric
tests assume that the random variable is normally distributed and homosedastic (homogeneous variance), while nonparametric tests make no assumption on the probability distribution. The t-test for trend detection is based on linear
regression, thus, checks only for a linear trend. There is no such restriction for the Mann-Kendall test. Further, MK
is expected to be less affected by outliers as its statistic is based on the sign of the differences, and not directly on
the values of the random variable. Plotting the ratio of the power of the t-test to that of the Mann-Kendall test as

145

Quantitative Analytics

function of the slope of the trend of a large number of simulated time series, Yue et al. [2002] showed that the power
of the Mann-Kendall trend test was dependent on the distribution types, and was increasing with the coefficient of
skewness. Onoz et al. [2002] repeated the experiment on various distributions obtaining a ratio slightly above one
for the normal distribution, and for all other (non-normal) distributions the ratio was significantly less than one. For
skewed distributions, the Mann-Kendall test was more powerful, especially for high coefficient of skewness.

3.3.4

Testing the assumption of normally distributed returns

We saw in Section (2.3.1.1) that the mean-variance efficient portfolios introduced by Markowitz [1952] require some
fairly restrictive assumptions on the class of return distribution, such as the assumption of normally distributed returns.
Further, the Sharpe type metrics for performance measures described in Section (??) depend on returns of individual
assets that are jointly normally distributed. Hence, one must be able to assess the suiability of the normal assumption,
and to quantitfy the deviations from normality.
3.3.4.1

Testing for the fitness of the Normal distribution

While changes in financial asset prices are known to be non-normally distributed, practitioners still assume that they
are normally distributed because they can make predictions on their conditional mean and variance. There has been
much discussion about the usefulness of the underlying assumption of normality for return series. One way foreward
is to compare directly the predictions made ny the Normal model to what we observe. In general, practitiners use
simple heuristics to assess model performances such as measuring
• the difference between observed and predicted frequecies of observations in the tail of the normal distribution.
• the difference between observed and predicted values of these tail observations.
t−1
be the percent return at time t and σt =
In the case of univariate tail probabilities, we let Rt = Rt−1,t = StS−S
t−1
σt|t−1 be the one day forecast standard deviation of Rt . The theoretical tail probabilities corresponding to the lower
and upper tail areas are given by

P (Rt < −1.65σt ) = 5% and P (Rt > 1.65σt ) = 5%
Letting T be the total number of returns observed over a given sample period, the observed tail probabilities are given
by
T
T
1X
1X
I{Rt <−1.65σ̂t } and
I{Rt >1.65σ̂t }
T t=1
T t=1

where σ̂t is the estimated standard deviation using a particular model. Having estimated the tail probabilities, we
are now interested in the value of the observations falling in the tail area, called tail points. We need to check the
predictable quality of the Normal model by comparing the observed tail points and the predicted values. In the case
of the lower tail, we first record the value of the observations Rt < −1.65σ̂t and then find the average value of these
returns. We want to derive forecasts of these tail points called the predicted values. Since we assumed normally
distributed returns, the best guess of any return is simply its expected value. Therefore, the predicted value for an
observation falling in the lower tail at time t is
E[Rt |Rt < −1.65σt ] = −σt λ(−1.65) and λ(x) =

φ(x)
N (x)

where φ(.) is the standard normal density, and N (.) is the standard normal cumulative distribution function. The same
heuristic test must be performed on correlation. Assuming a portfolio made of two underlyings R1 and R2 , its daily
earning at risks (DEaR) is given by

146

Quantitative Analytics

√
DEaR(R1 , R2 ) =

V > CV

where V = (DEaR(R1 ), DEaR(R2 ))> is a transpose vector, and C is the 2 × 2 correlation matrix. In the bivariate
case, we want to analyse the probabilities associated with the joint distribution of the two return series to assess the
R2 (t)
1 (t)
performance of the model. We consider the event P ( R
σ1 (t) < 0 and σ2 (t) < −1.65) where the choice for R1 (t) being
less than zero is arbitrary. The observed values are given by
T
1X
I R1 (t)
T t=1 { σ1 (t) <0 and

R2 (t)
}
σ2 (t)

× 100

and the predicted probability is obtained by integrating over the bivariate density function
Z

0

Z

−1.65

B(0, −1.65, ρ) =

φ(x1 , x2 , ρ)dx1 dx2
−∞

−∞

where φ(x1 , x2 , ρ) is the standard normal bivariate density function, and ρ is the correlation between S1 and S2 . For
any pair of returns, we are now interested in the value of one return when the other is a tail point. The observed
values of return S1 are the average of the R1 (t) when R2 (t) < −1.65σ2 (t). Hence, we first record the value of the
observations R1 (t) corresponding to R2 (t) < −1.65σ2 (t) and then find the average value of these returns. Based on
the assumption of normality for the returns we can derive the forecasts of these tail points called the predicted values.
Again, the best guess is the expected value of R1 (t)|R2 (t) < −1.65σ2 (t) given by
E[R1 (t)|R2 (t) < −1.65σ2 (t)] = −σ1 (t)ρλ(−1.65)
In the case of individual returns, assuming an exponentially weighted moving average (EWMA) with decay factor
0.94 for the estimated volatility, Longerstaey et al. [1995a] concluded that the observed tail frequencies and points
match up quite well their predictions from the Normal model. Similarly, in the bivariate case (except for money market
rates), the Normal model’s predictions of frequencies and tail points coincided with the observed ones.
3.3.4.2

Quantifying deviations from a Normal distribution

For more than a century, the problem of testing whether a sample is from a normal population has attracted the
attention of leading figures in statistics. The absence of exact solutions for the sampling distributions generated a
large number of simulation studies exploring the power of these statistics as both directional and omnibus tests (see
D’Agostino et al. [1973]). A wide variety of tests are available for testing goodness of fit to the normal distribution.
If the data is grouped into bins, with several counts in each bin, Pearson’s chi-square test for goodness of fit maybe
applied. In order for the limiting distribution to be chi-square, the parameters must be estimated from the grouped
data. On the other hand, departures from normality often take the form of asymmetry, or skewness. It happens out
of this that mean and variance are no longer sufficient to give a complete description of returns distribution. In fact,
skewness and kurtosis (the third and fourth central moments) have to be taken into account to describe a stock (index)
returns’ probability distribution entirely. To check whether the skewness and kurtosis statistics of an asset can still be
regarded as normal, the Jarque-Bera statistic (see Jarque et al. [1987]) can be applied. We are going to briefly describe
the Omnibus test for normality (two sided) where the information in b1 and b2 is indicated for a general non-normal
alternative. Note, when a specific alternative distribution is indicated, one can use a specific likelihood ratio test,
increasing power. For instance, when information on the alternative distribution exists, a directional test using only b1
and b2 is preferable. However, the number of cases where such directional tests are available is limited for practical
application. Let X1 , ..., Xn be independent random variables with absolutely continuous distributions function F . We
wish to test
H0 : F (x) = N (

x−µ
) , ∀x ∈ R
σ

147

Quantitative Analytics

versus the two sided alternative
x−µ
) for at least one x ∈ R
σ
where N (.) is the cdf of the standard normal distribution and σ(σ > 0) may be known or unknown. In practice, the
null hypothesis of normality is usually specified in composite form where µ and σ are unknown. When performing
a test hypothesis, the p-value is found by using the distribution assumed for the test statistic under H0 . However, the
accuracy of the p-value depends on how close the assumed distribution is to the true distribution of the test statistic
under the null hypothesis.
H1 : F (x) 6= N (

Suppose that we want to test the null hypothesis that the returns Ri (1), Ri (2), ..., Ri (n) for the ith asset are
independent normally distributed random variables with the same mean and variance. A goodness-of-fit test can be
based on the coefficient of skewness for the sample of size n
Pn
1
3
m̂c3
j=1 (Ri (j) − Ri )
n
=
b1 =
3
S3
(m̂c2 ) 2
where mc2 and mc3 are the theoretical second and third central moments, respectively, with its sample estimates
n

m̂cj =

1X
(Xi − X)j , j = 2, 3, 4
n i=1

The test rejects for large values of |b1 |.
Remark 3.3.2 In some articles, the notation for b1 is sometime slightly different, with b1 replaced with

√

b1 =

m̂c3

3

(m̂c2 ) 2

.

Note, skewness is a non-dimensional quantity characterising only the shape of the distribution.
q For the idealised case

of a normal distribution, the standad deviation of the skew coefficient b1 is approximately 15
n . Hence, it is good
practice to believe in skewness only when they are several or many times as large as this. Departure of returns from
the mean may be detected by the coefficient of kurtosis for the sample
Pn
1
4
m̂c4
j=1 (Ri (j) − Ri )
n
b2 =
=
(m̂c2 )2
S4
The kurtosis is also a non-dimensional quantity. To test kurtosis we can compute
kurt = b2 − 3
to recover the zero-value of a normal
q distribution. The standard deviation of kurt as an estimator of the kurtosis of an
underlying normal distribution is 96
n.
The estimates of skewness and kurtosis are used in the Jarque-Bera (JB) statistic (see Jarque et al. [1987]) to
analyse time series of returns for the assumption of normality. It is a goodness-of-fit measure with an asymptotic
χ2 -distribution with two degrees of freedom (because JB is just the sum of squares of two asymptotically independent
standardised normals)
JB ∼ χ22 , n → ∞ under H0
However, in general the χ2 approximation does not work well due to the slow convergence to the asymptotic results.
Jarque et al. showed, with convincing evidence, that convergence of the sampling distributions to asynmptotic results
was very slow, especially for b2 . Nonetheless, the JB test can be used to test the null hypothesis that the data are from
a normal distribution. That means that H0 has to be rejected at level α if
148

Quantitative Analytics

JB ≥ χ21−α,2
At the 5% significance level the critical value is equal to 5.9915. The null hypothesis is a joint hypothesis of the
skewness being zero and the excess kurtosis being 0, as samples from a normal distribution have an expected skewness
of 0 and an expected excess kurtosis of 0. As the definition of JB shows, any deviation from this increases the JB
statistic
n 2 (b2 − 3)2 
b +
6 1
4
where b1 is the coefficient of skewness and b2 is the coefficient of kurtosis. Note, Urzua [1996] introduced a modification to the JB test by standardising the skewness b1 and the kurtosis b2 in the JB formula, getting
JB =

JBU =

(b2 − eK )2
b21
+
vS
vK

with
vS =

3(n − 1)
24n(n − 2)(n − 3)
6(n − 2)
, eK =
, vK =
(n + 1)(n + 3)
(n + 1)
(n + 1)2 (n + 3)(n + 5)

Note, JB and JBU are asymptotically equivalent, that is, H0 has to be rejected at level α if JBU ≥ χ21−α,2 . Critical
values of tests for various sample sizes n with α = 0.05 are
n = 50 JB = 5.0037 , n = 100 JB = 5.4479 , n = 200 JB = 5.7275 , n = 500 JB = 5.8246
See Thadewald et al. [2004] for tables. Testing for normality, they investigated the power of several tests by considering independent random variables (model I), and the residual in the classical linear regression (model II). The power
comparison was carried out via Monte Carlo simulation with a model of contaminated normal distributions (mixture
of normal distributions) with varying parameters µ and σ as well as different proportions of contamination. They
found that for the JB test, the approximation of critical values by the chi-square distribution did not work well. The
test was superior in power to its competitors for symmetric distributions with medium up to long tails and for slightly
skewed distributions with long tails. The power of the JB test was poor for distributions with short tails, especially
bimodal shape. Further, testing for normality is problematic in the case of autocorrelated error terms and in the case
of heteroscedastic error terms.

3.3.5

The sample moments

We assume that the population is of size N and that associated with each member of the population is a numerical value
of interest denoted by x1 , x2 , .., xN . We take a sample with replacement of n values X1 , ..., Xn from the population,
where n < N and such that Xi is a random variable. That is, Xi is the value of the ith member of the sample, and xi
is that of the ith member of the population. The population moments and the sample moments are given in Appendix
(B.9.1).
3.3.5.1

The population mean and volatility

While volatility is a parameter measuring the risk associated with the returns of the underlying price, local volatility is a
parameter measuring the risk associated with the instantaneous variation of the underlying price. It can be deterministic
or stochastic. On the other hand, historical volatility is computed by using historical data on observed values of the
underlying price (opening, closing, highest value, lowest value etc...). In general, one uses standard estimators of
variance per unit of time of the logarithm of the underlying price which is assumed to follow a non-centred Brownian
motion (see Section (3.3.1.4)). Considering observed values uniformally allocated in time with time difference δ, the
stock price a time t + 1 = (j + 1)δ is given by

149

Quantitative Analytics

S(j+1)δ = Sjδ eµδ−σ(W(j+1)δ −Wjδ )
with
ln

S(j+1)δ
= µδ − σ(W(j+1)δ − Wjδ )
Sjδ

Given N period of time, the first two sample moments are

µ̂N

=

N −1
1 X S(j+1)δ
ln
N j=0
Sjδ

2
σ̂N

=

N −1
2
S(j+1)δ
1 X
ln
− µ̂N
N j=0
Sjδ

Alternatively, we can consider the return of the underlying price given by
R(j+1)δ =

S(j+1)δ − Sjδ
= eµδ−σ(W(j+1)δ −Wjδ ) − 1
Sjδ

Considering the expansion ex ≈ 1 + x for |x| < 1 and assuming µδ − σ(W(j+1)δ − Wjδ ) to be small, we get
R(j+1)δ =

S(j+1)δ − Sjδ
≈ µδ − σ(W(j+1)δ − Wjδ )
Sjδ

and we set

µ̃N

=

N −1
1 X S(j+1)δ − Sjδ
N j=0
Sjδ

2
σ̃N

=

N −1
2
1 X S(j+1)δ − Sjδ
− µ̃N
N j=0
Sjδ

In order to use statistical techniques on market data, one must make sure that the data is stationary, and test the
assumption of log-normality on the observed values of the underlying price. Note, the underlying prices rarely satisfy
the Black-Scholes assumption. However, we are not trying to compute option prices in the risk-neutral measure, but
we are estimating the moments of the stock returns under the historical measure.
3.3.5.2

The population skewness and kurtosis

As defined in Section (B.4), skewness is a statistical measure of the asymmetry of the probability distribution of a
random variable, in this case the return of a stock. Since a normal distribution is symmetrical, it exhibits exactly zero
skewness. The more asymmetric and thus unlike a normal distribution, the larger the figure gets in absolute terms.
Given N period of time, the sample skewness is
ŜN =

N −1
3
S(j+1)δ
1 X
ln
− µ̂N
3
N σ̂N j=0
Sjδ

Kurtosis is a measure of the Peakedness of a probability distribution of random variables. As such, it discloses how
concentrated a return distribution is around the mean. Higher kurtosis means more of the variance is due to infrequent

150

Quantitative Analytics

extreme deviations (fat tails), lower kurtosis implies a variance composed of frequent modestly-sized deviations. Normally distributed asset returns exhibit a kurtosis of 3. To measure the excess kurtosis with regard to a normal distribution, a value of 3 is hence subtracted from the kurtosis value. A distribution with positive excess kurtosis is called
leptokurtic. We can calculate the sample kurtosis of a single asset class with
K̂N

N −1
4
S(j+1)δ
1 X
ln
− µ̂N
4
N σ̂N j=0
Sjδ

Under normality assumption, skew and kurtosis are distributed asymptotically as normal with mean equal to zero and
variance being N6 and 24
N respectively (see Snedecor et al. [1980]).
3.3.5.3

Annualisation of the first two moments

The sample estimates of the first two moments are often based on monthly, weekly or daily data, but all quantities
are usually quoted in annualised terms. Annualisation is often performed on the sample estimates under the assumption that the random variables (returns) are i.i.d. We let the annualised volatility σ be the standard deviation of the
instrument’s
√ yearly logarithmic returns. The generalised volatility σT for time horizon T in years is expressed as
σT = σ T . Therefore, if the daily logarithmic returns of a stock have a standard deviation of σSD and the time
period of returns is P ( or ∆t), the annualised volatility is
σSD
σ= √
P
A common assumption for daily returns is that P = 1/252 (there are 252 trading days in any given year). Then, if
σSD = 0.01 the annualised volatility is
√
0.01
σ=q
= 0.01 252
1
252

More generally, we have
F
N
where X is the sum of all values referenced, F is the base rate of return (time period frequency) with 12 monthly, 252
daily, 52 weekly, 4 quaterly, and N is the total number of periods. For example, setting P = F1 the annualised mean
and variance becomes
X×

µ =
σ2

=

µSD
= µSD 252
P
2
σSD
2
= σSD
252
P

This formula to convert returns or volatility measures from one time period to another assume a particular underlying
model or process. These formulas are accurate extrapolations of a random walk, or Wiener process, whose steps
have finite variance. However, if portfolio returns are autocorrelated, the standard deviation does not obey the squareroot-of-time rule (see Section (10.1.1)). Again, F is the time period frequency (number of returns per year), then the
annualised mean return is still F times the mean return, but the standard deviation of returns should be calculated
using the scaling factor
s


Q
(F − 1)(1 − Q) − Q(1 − QF −1 )
(3.3.11)
F +2
2
(1 − Q)

151

Quantitative Analytics

where Q is the first order autocorrelation of the returns (see Alexander [2008]). If the autocorrelation of the returns is
positive, then the scaling factor is greater than the square root of F . More generally, for natural stochastic processes,
the precise relationship between volatility measures for different time periods is more complicated. Some researchers
use the Levy stability exponent α (linked to the Hurst exponent) to extrapolate natural processes
1

σT = T α σ
If α = 2 we get the Wiener process scaling relation, but some people believe α < 2 for financial activities such as
stocks, indexes and so on. Mandelbrot [1967] followed a Levy alpha-stable distribution with α = 1.7. Given our
previous example with P = 1/252 we get
1

σ = 0.01(252) α
We let αW = 2 for the Wiener process and αM = 1.7 for the Levy alpha-stable distribution. Since
α̂M = α1M = α̂W + ξ we get (252)α̂M = (252)α̂W (252)ξ so that

1
αM

>

1
αW

we get

σM = 0.01(252)α̂N (252)ξ
with ξ = 0.09 and (252)ξ = 1.64. Hence, we get the Mandelbrot annualised volatility
√
σM = σSD 252(1 + S ) = σW (1 + S )
where S ≥ 0 is the adjusted volatility.

3.4

Introducing the volatility process

3.4.1

An overview of risk and volatility

3.4.1.1

The need to forecast volatility

When visualising financial time series, one can observe heteroskedasticity 3 , with periods of high volatility and periods
of low volatility, corresponding to periods of high and low risks, respectively. We also observe returns having very
high absolute value compared with their mean, suggesting fat tail distribution for returns, with large events having a
larger probability to appear when compared to returns drawn from a Gaussian distribution. Hence, besides the return
series introduced in Section (3.3.1), we must also consider the volatility process and the behaviour of extreme returns
of an asset (the large positive or negative returns). The negative extreme returns are important in risk management,
whereas positive extreme returns are critical to holding a short position. Volatility is important in risk management
as it provides a simple approach to calculating the value at risk (VaR) of a financial position. Further, modelling the
volatility of a time series can improve the efficiency in parameter estimation and the accuracy in interval forecast. As
returns may vary substantially over time and appear in clusters, the volatility process is concerned with the evolution of
conditional variance of the return over time. When using risk management models and measures of preference, users
must make sure that volatilities and correlations are predictable and that their forecasts incorporate the most useful
information available. As the forecasts are based on historical data, the estimators must be flexible enough to account
for changing market conditions. One simple approach is to assume that returns are governed by the random walk
2
model described in Section (3.3.1.4), and that the sample standard deviation σ̂N or the sample variance σ̂N
of returns
for N periods of time can be used as a simple forecast of volatility of returns, rt , over the future period [t + 1, t + h]
for some positive integer h. However, volatility has some specific characteristics such as
• volatility clusters: volatility may be high for certain time periods and low for other periods.
• continuity: volatility jumps are rare.
3

Heteroskedastic means that a time series has a non-constant variance through time.

152

Quantitative Analytics

• mean-reversion: volatility does not diverge to infinity, it varies within some fixed range so that it is often
stationary.
• volatility reacts differently to a big price increase or a big price drop.
These properties play an important role in the development of volatility models. As a result, there is a large literature
on econometric models available for modelling the volatility of an asset return, called the conditional heteroscedastic
(CH) models. Some univariate volatility models include the autoregressive conditional heteroscedastic (ARCH) model
of Engle [1982], the generalised ARCH (GARCH) model of Bollerslev [1986], the exponential GARCH (EGARCH)
of Nelson [1991], the stochastic volatility (SV) models and many more. Tsay [2002] discussed the advantages and
weaknesses of each volatility model and showed some applications of the models. Following his approach, we will
describe some of these models in Section (5.6). Unfortunately, stock volatility is not directly observable from returns
as in the case of daily volatility where there is only one observation in a trading day. Even though one can use intraday
data to estimate daily volatility, accuracy is difficult to obtain. The unobservability of volatility makes it difficult to
evaluate the forecasting performance of CH models and heuristics must be developed to estimate volatility on small
samples.
3.4.1.2

A first decomposition

As risk is mainly given by the probability of large negative returns in the forthcoming period, risk evaluation is closely
related to time series forecasts. The desired quantity is a forecast for the probability distribution (pdf) p̃(r) of the
possible returns r over the risk horizon ∆T . This problem is generally decomposed into forecasts for the mean and
variance of the return probability distribution
r∆T = µ∆T + a∆T
with
a∆T = σ∆T 
where the return r∆T over the period ∆T is a random variable, µ∆T is the forecast for the mean return, and σ∆t is the
volatility forecast. The term a∆T = r∆T − µ∆T is the mean-corrected asset return. The residual , which corresponds
to the unpredictable part, is a random variable distributed according to a pdf p∆T (). The standard assumption is to
let (t) be an independent and identically distributed (iid) random variable. In general, a risk methodology will set the
mean µ to zero and concentrate on σ and p(). To validate the methodology, we set
r−µ
σ
compute the right hand side on historical data, and obtain a time series for the residual. We can then check that 
is independent and distributed according to p(). For instance, we can test that  is uncorrelated, and that given a
risk threshold α (say, 95%), the number of exceedance behaves as expected. However, when the horizon period ∆T
increases, it becomes very difficult to perform back testing due to the lack of data. Alternatively, we can consider a
process to model the returns with a time increment δt of one day, computing the forecasts using conditional averages.
We can then relate daily data with forecasts at any time horizon, and the forecasts depend only on the process parameters, which are independent of ∆T and are consistent across risk horizon. The quality of the volatility forecasts is the
major determinant factor for a risk methodology. The residuals can then be computed and their properties studied.
=

3.4.2

The structure of volatility models

The above heuristics being poor estimates of the future volatility, one must rely on proper volatility models such as
the conditional heteroscedastic (CH) models. Since the early 80s, volatility clustering spawned a large literaturure on
a new class of stochastic processes capturing the dependency of second moments in a phenomenological way. As the

153

Quantitative Analytics

lognormal assumption is not consistent with all the properties of historical stock returns, Engle [1982] first introduced
the autoregressive conditional heteroscedasticity model (ARCH) which has been generalised to GARCH by Bollerslev
[1986]. We let rt be the log return of an asset at time t, and assume that {rt } is either serially uncorrelated or with
minor lower order serial correlations, but it is dependent. Volatility models attempt at capturing such dependence in
the return series. We consider the conditional mean and conditional variance of rt given the filtration Ft−1 defined by
µt = E[rt |Ft−1 ] , σt2 = V ar(rt |Ft−1 ) = E[(rt − µt )2 |Ft−1 ]
Since we assumed that the serial dependence of a stock return series was weak, if it exists at all, µt should be simple
and we can assume that rt follows a simple time series model such as a stationary ARM A(p, q) model. That is
rt = µt + at , µt = φ0 +

p
X
i=1

φi rt−i −

q
X

θi at−i

(3.4.12)

i=1

where at is the shock or mean-corrected return of an asset return 4 . The model for µt is the mean equation for rt , and
the model for σt2 is the volatility equation for rt .
Remark
3.4.1 Some authors use ht to denote the conditional variance of rt , in which case the shock becomes at =
√
ht t .
The paramereters p and q are non-negative integers, and the order (p, q) of the ARMA model may depend on the
frequency of the return series. The excess kurtosis values, measuring deviation from the normality of the returns, are
indicative of the long-tailed nature of the process. Hence, one can then compute and plot the autocorrelation functions
for the returns process rt as well as the autocorrelation functions for the squared returns rt2 .
1. If the securities exhibit a significant positive autocorrelation at lag one and higher lags as well, then large (small)
returns tend to be followed by large (small) returns of the same sign. That is, there are trends in the return series.
This is evidence against the weakly efficient market hypothesis which asserts that all historical information
is fully reflected in prices, implying that historical prices contain no information that could be used to earn a
trading profit above that which could be attained with a naive buy-and-hold strategy which implies further that
returns should be uncorrelated. In this case, the autocorrelation function would suggest that an autoregressive
model should capture much of the behaviour of the returns.
2. The autocorrelation in the squared returns process would suggest that large (small) absolute returns tend to
follow each other. That is, large (small) returns are followed by large (small) returns of unpredictable sign. It
implies that the returns series exhibits volatility clustering where large (small) returns form clusters throughout
the series. As a result, the variance of a return conditioned on past returns is a monotonic function of the past
returns, and hence the conditional variance is heteroskedastic and should be properly modelled.
The conditional heteroscedastic (CH) models are capable of dealing with this conditional heteroskedasticity. The
variance in the model described in Equation (3.4.12) becomes
σt2 = V ar(rt |Ft−1 ) = V ar(at |Ft−1 )
Since the way in which σt2 evolves over time differentiate one volatility model from another, the CH models are
concerned with the evolution of the volatility. Hence, modelling conditional heteroscdasticity (CH) amounts to augmenting a dynamic equation to a time series model to govern the time evolution of the conditional variance of the
shock. We distinguish two types or groups of CH models, the first one using an exact function to govern the evolution
of σt2 , and the second one using a stochastic equation to describe σt2 . For instance, the (G)ARCH model belongs to
the former, and the stochastic volatility (SV) model belongs to the latter. In general, we estimate the conditional mean
and variance equations jointly in empirical studies.
4

since at = rt − µt

154

Quantitative Analytics

3.4.2.1

Benchmark volatility models

The ARCH model, which is the first model providing a systematic framework for volatility modelling, states that
1. the mean-corrected asset return at is serially uncorrelated, but dependent
2. the dependence of at can be described by a simple quadratic function of its lagged values.
Specifically, setting µt = 0 for simplicity, an ARCH(p) model assumes that
1

2
2
rt = ht2 t , ht = α0 + α1 rt−1
+ ... + αp rt−p

where {t } is a sequence of i.i.d. random variables with mean zero and variance 1, α0 > 0, and αi ≥ 0 for i > 0. In
practice, t follows the standard normal or a standardised Student-t distribution. Generalising the ARCH model, the
main idea behind (G)ARCH models is to consider asset returns as a mixture of normal distributions with the current
variance being driven by a deterministic difference equation
1

rt = ht2 t with t ∼ N (0, 1)

(3.4.13)

and
ht = α0 +

p
X
i=1

2
αi rt−i
+

q
X

βj ht−j , α0 > 0 , αi , βj > 0

(3.4.14)

j=1

Pmax (p,q)
where α0 > 0, αi ≥ 0, βj ≥ 0, and i=1
(αi + βi ) < 1. The latter constraint on αi + βi implies that the
unconditional variance of rt is finite, whereas its conditional variance ht evolves over time. In general, empirical
applications find the GARCH(1, 1) model with p = q = 1 to be sufficient to model financial time series
2
ht = α0 + α1 rt−1
+ β1 ht−1 , α0 > 0 , α1 , β1 > 0

(3.4.15)

When estimated, the sum of the parameters α1 + β1 turns out to be close to the non-stationary case, that is, mostly
only a constraint on the parameters prevents them from exceeding 1 in their sum, which would lead to non-stationary
behaviour. Different extensions of GARCH were developed in the literature with the objective of better capturing the
financial stylised facts. Among them are the Exponential GARCH (EGARCH) model proposed by Nelson [1991]
accounting for asymmetric behaviour of returns, the Threshold GARCH (TGARCH) model of Rabemananjara et al.
[1993] taking into account the leverage effects, the regime switching GARCH (RS-GARCH) developed by Cai [1994],
and the Integrated GARCH (IGARCH) introduced by Engle et al allowing for capturing high persistence observed in
returns time series. Nelson [1990] showed that Ito diffusion or jump-diffusion processes could be obtained as a
continuous time limit of discrete GARCH sequences. In order to capture stochastic shocks to the variance process,
Taylor [1986] introduced the class of stochastic volatility (SV) models whose instantaneous variance is driven by
ln (ht ) = k + φ ln (ht−1 ) + τ ξt with ξt ∼ N (0, 1)

(3.4.16)

This approach has been refined and extended in many ways. The SV process is more flexible than the GARCH model,
providing more mixing due to the co-existence of shocks to volatility and return innovations. However, one drawback
of the GARCH models and extension to Equation (3.4.16) is their implied exponential decay of the autocorrelations
of measures of volatility which is in contrast to the very long autocorrelation discussed in Section (10.4). Both the
GARCH and the baseline SV model are only characterised by short-term rather than long-term dependence. In order to
capture long memory effects, the GARCH and SV models were expanded by allowing for an infinite number of lagged
volatility terms instead of the limited number of lags present in Equations (3.4.14) and (3.4.16). To obtain a compact
characterisation of the long memory feature, a fractional differencing operator was used in both extensions, leading
to the fractionally integrated GARCH (FIGARCH) model introduced by Baillie et al. [1996], and the long-memory
stochastic volatility model of Breidt et al. [1998]. As an intermediate approach, Dacorogna et al. [1998] proposed

155

Quantitative Analytics

the heterogeneous ARCH (HARCH) model, considering returns at different time aggregation levels as determinants
of the dynamic law governing current volatility. In this model, we need to replace Equations (3.4.14) with
ht = c0 +

n
X

2
cj rt,t−∆t
y

j=1

where rt,t−∆ty = ln (pt ) − ln (pt−∆tj ) are returns computed over different frequencies. This model was motivated
by the finding that volatility on fine time scales can be explained to a larger extend by coarse-grained volatility than
vice versa (see Muller et al. [1997]). Thus, the right hand side of the above equation covers local volatility at various
lower frequencies than the time step of the underlying data (∆tj = 2, 3, ..). Note, multifractal models have a closely
related structure but model the hierarchy of volatility components in a multiplicative rather than additive format.
3.4.2.2

Some practical considerations

As an example, we are breifly discussing the (G)ARCH models, which are assumed to be serially uncorrelated. However, before these models are used one must remove the autocorrelation present in the returns process. One can remove
the autocorrelation structure from the returns process by fitting Autoregressive models. For instance, we can consider
the AR(p) model
rt − µ = φ1 (rt−1 − µ) + φ2 (rt−2 − µ) + ... + φp (rt−p − µ) + ut
where µ is the sample mean of the series {rt } and the error {ut } are assumed to be from an i.i.d. process. One can
use the Yule-Walker method to estimate the model parameters. To discern between models we use the AIC criterion
which while rewarding a model fitting well (large maximum likelihood) also penalises for the inclusion of too many
parameter values, that is, overfitting. We then obtain models of order p with associated standard errors of the parameter
values. Again one has to study the normality of the errors by computing kurtosis statistics as well as the autocorrelation
function. One expect the residuals to be uncorrelated not contradicting the Gaussian white noise hypothesis. However,
it may happen that the squares of the residuals are autocorrelated. That is, the AR(p) model did not account for
the volatility clustering present in the returns process. Even if the residuals from returns are uncorrelated, but still
show some evidence of volatility clustering, we can attempt to model the residuals with (G)ARCH processes. Hence,
we can remove the autocorrelation by modelling the first order moment with AR(p) models, and we can model the
second order moments using (G)ARCH models. (G)ARCH models can explain the features of small autocorrelations,
positive and statistically significant autocorrelations in the squares and excess kurtosis present in the residuals of the
AR model. For an adequate fit, the residuals from the (G)ARCH models should have white noise properties. Hence,
we fit (G)ARCH models to the residuals of the AR models {ut } getting
1

µt − ν = ht2 t
where {t } is Gaussian with zero mean and constant variance. Letting ν be the mean, we get
ht = α0 +

p
X

αi (ut−i − ν)2 +

i=1

q
X

βj ht−j

j=1

For q = 0 the process reduces to the ARCH(p) process and for p = q = 0 the process is a Gaussian white noise. In
the ARCH(p) process the conditional variance is specified as a linear function of past sample variances only, whereas
the GARCH(p, q) process allows lagged conditional variances to enter as well. Once the (G)ARCH models have
been fitted, we can discern between (nested) competing models using the Likelihood Ratio Test. To discern between
two competing models where neither is nested in the other, we resort to examining the residuals of the models. For an
adequate model, the residuals should ressemble a white noise process being uncorrelated with constant variance. In
addition, the autocorrelation of the squared residuals should also be zero. We should favour simpler models when two
models appear adequate from an examination of the residuals.

156

Quantitative Analytics

3.4.3

Forecasting volatility with RiskMetrics methodology

3.4.3.1

The exponential weighted moving average

As volatility of financial markets changes over time, we saw that forecasting volatility could be of practical importance
when computing the conditional variance of the log return of the undelying asset. The historical sample variance
computed in Section (3.3.5) assigns weights of zero to squared deviations prior to a chosen cut-off date and an equal
weight to observations after the cut-off date. In order to benefit from closing auction prices, one should consider
close-to close volatility, with a large number of samples to get a good estimate of historical volatility. Letting D
be the number of past trading days used to estimate the volatility, the daily log-return at time t is r(t) = rt−1,t =
C(t) − C(t − 1) where C(t) is the closing log-price at the end of day t, and the annualised D-day variance of return
is given by
2
σST
DEV (t; D) = Csf

D−1
1 X
(r(t − i) − r(t))2
D i=0

where the scalar Csf scales the variance to be annual, and where r(t) =
authors set Csf = 252 while others set it to Csf = 261.

1
D

PD−1
i=0

r(t − i) is the sample mean. Some

While for large sample sizes the standard close to close estimator is best, it can obscure short-term changes in
volatility. That is, volatility reacts faster to shocks in the market as recent data carry more weight than data in the
distant past. Further, following a shock (a large return), the volatility declines exponentially as the weight of the shock
observation falls. As a result, the equally weighted moving average leads to relatively abrupt changes in the standard
deviation once the shock falls out of the measurement sample, which can be several months after it occurs. On the other
hand, the weighting scheme for GARCH(1,1) in Equation (3.4.15) and the Exponential Weighted Moving Average
(EWMA) (see details in Section (5.7.1)) are such that weights decline exponentially in both models. Therefore,
RiskMetrics (see Longerstaey et al. [1996]) let the ex-ante volatility σt be estimated with the exponentially weighted
lagged squared daily returns (similar to a simple univariate GARCH model). It assigns the highest weight to the latest
observations and the least to the oldest observations in the volatility estimate. The assignment of these weights enables
volatility to react to large returns (jumps) in the market, so that following a jump the volatility declines exponentially
as the weight of the jump falls. Specifically, the ex-ante annualised variance σt2 is calculated as follows:
σt2 (t; D) = Csf

∞
X

ωi (rt−1−i − rt )2

i=0
i

where the weights ωi = (1 − δ)δ add up to one, and where
rt =

∞
X

ωi r(t − i)

i=0

is the exponentially weighted average return computed similarly. The parameter δ with 0 < δ < 1 is called the decay
factor and determines the relative weights that are assigned to returns (observations)
the effective amount of data
Pand
∞
δ
used in estimating volatility. It is chosen so that the center of mass of the weights i=0 (1 − δ)δ i i = (1−δ)
is equal
P∞
i
to 30 or 60 days (since i=0 (1 − δ)δ = 1). The volatility model is the same for all assets at all times. To ensure
no look-ahead bias contaminates our results, we use the volatility estimates at time t − 1 applied to time t returns
throughout the analysis, that is, σt = σt−1|t .
The formula above being an infinite sum, in practice we estimate the volatility σ with the Exponential Weighted
Moving Average model (EWMA) for a given sequence of k returns as

157

Quantitative Analytics

2
σEW
M A (t)(t; D) = Csf

k−1
X

ωi (rt−i − r)2

i=0

The latest return has weight (1 − δ) and the second latest (1 − δ)δ and so on. The oldest return appears with weight
(1 − δ)δ k−1 . The decay factor δ is chosen to minimise the error between observed volatility and its forecast over some
sample period.
3.4.3.2

Forecasting volatility

One advantage of the exponentially weighted estimator is that it can be written in recursive form, allowing for volatility
forecasts. If we assume the sample mean r is zero and that infinite amounts of data are available, then by using
the recursive feature of the exponential weighted moving average (EWMA) estimator, the one-day variance forecast
satisfies
2
2
2
σ1,t+1|t
= δσ1,t|t−1
+ (1 − δ)r1,t
2
where σ1,t+1|t
denotes 1-day time t + 1 forecast given information up to time t. It is derived as follow

2
σ1,t+1|t

∞
X

2
δ i r1,t−i

=

(1 − δ)

=


2
2
2
(1 − δ) r1,t
+ δr1,t−1
+ δ 2 r1,t−2
+ ....

=

2
2
2
2
(1 − δ)r1,t
+ δ(1 − δ) r1,t−1
+ δr1,t−2
+ δ 2 r1,t−3
+ ...

i=0



2
2
= δσ1,t|t−1
+ (1 − δ)r1,t

Taking the square root of both sides of the equation we get the one day volatility forecast σ1,t+1|t . For two return
series, the EWMA estimate of covariance for a given sequence of k returns is given by
2
σ1,2
(t; D) = Csf

k−1
X

ωi (r1,t−i − r1 )(r2,t−i − r2 )

i=0

Similarly to the variance forecast, the covariance forecast can also be written in recursive form. The 1-day covariance
forecast between two return series r1,t and r2,t is
2
2
σ12,t+1|t
= δσ12,t|t−1
+ (1 − δ)r1,t r2,t

In order to get the correlation forecast, we apply the corresponding covariance and volatility forecast. The 1-day
correlation forecast is given by
ρ12,t+1|t =

2
σ12,t+1|t

σ1,t+1|t σ2,t+1|t

Using the EWMA model, we can also construct variance and covariance forecast over longer time horizons. The
T-period forecasts of the variance and covariance are, respectively,
2
2
σ1,t+T
|t = T σ1,t+1|t

and
2
2
σ12,t+T
|t = T σ12,t+1|t

158

Quantitative Analytics

implying that the correlation forecasts remain unchanged irrespective of the forecast horizon, that is, ρt+T |t = ρt+1|t .
We observe that in the EWMA model, multiple day forecasts are simple multiples of one-day forecasts. Note, the
square root of time rule results from the assumption that variances are constant. However, the above derivation of
volatilities and covariances vary with time. In fact, the EWMA model implicitly assume that the variance process is
non-stationary. In the literature, this model is a special case the integrated GARCH model (IGARCH) described in
Section (5.6.3). In practice, scaling up volatility estimates prove problematic when
• rates/prices are mean-reverting.
• boundaries limit the potential movements in rates and prices.
• estimates of volatilities optimised to forecast changes over a particular horizon are used for another horizon.
The cluster properties are measured by the lagged correlation of volatility, and the decay of that correlation quantifies
the memory shape and magnitude. Based on statistical analysis, Zumbach [2006a] [2006b] found the lagged correlog ∆T
lation of volatility to decay logarithmically as 1 − log
∆T0 in the range from 1 day to 1 year for all assets. That is, the
memory of the volatility decays very slowly, so that a volatility model must capture its long memory. He considered
a multi-scales long memory extention of IGARCH called the Long-Memory ARCH (LMARCH) process. The main
idea being to measure the historical volatilities with a set of exponential moving averages (EMA) on a set of time horizons chosen according to a geometric series. The feed-back loop of the historical returns on the next random return is
similar to a GARCH(1, 1) process, but the volatilities are measured at multiple time scales. Computing analytically
the conditional expectations related to the volatility forecasts, we get
∆T X ∆T
2
λ(
, i)rt−iδt
δt i
δt
P
with weights λ( ∆T
process equations and satisfying i λ( ∆T
δt , i) derived from theq
δt , i) = 1. We see that the leading
P
∆T
2
term of the forecast is given by σ∆T |t ≈
i λ(i)r (i) is a measure
δt which is the Normal square root scaling, and
of the past volatility constructed as a weighted sum of the past return square. We saw in Section (3.4.2) that more
complicated nonlinear modelling of volatility exists such as GARCH, stochastic volatility, applications of chaotic
dynamics etc. We will details all these models in Section (5.6).
2
σ∆T
|t =

3.4.3.3

Assuming zero-drift in volatility calculation

Pk−1
Given the logarithmic return in Equation (3.3.6) and the k-days period, the sample mean return r = k1 i=0 rt−i is
the estimate of the mean µ. An important issue arising in the estimation of the historical variance is the noisy estimate
of the mean return. This is due to the fact that the mean logarithmic return depends on the range (length) of the return
series in the sense that
r=

k−1
k−1
1X
1X
rt−i =
(Lt − Lt−i )
k i=0
k i=0

Thus, the mean return does not take into account the price movements or the number of prices within the period.
Therefore, while a standard deviation measures the dispersion of the observations around its mean, in practice, it may
be difficult to obtain a good estimate of the mean. As a result, some authors suggested to measure volatility around zero
rather than the sample mean. Assuming a EWMA model to estimate volatiliy, Longerstaey et al. [1995a] proposed
to study the difference between results given by the sample mean and zero-mean centred estimators. They considered
the one-day volatility forecast referred to as the estimated mean estimator, and given by
2
σ̂t2 = δσ̂t−1
+ δ(1 − δ)(Rt − Rt−1 )2

159

Quantitative Analytics

where Rt is the percentage change return and Rt−1 is an exponentially weighted mean. The zero-men estimator is
derived in Section (3.4.3.2) is given by
2
σ̃t2 = (1 − δ)Rt2 + δσ̃t−1

Setting up a Monte Carlo experiment, they studied the forecast difference of the two models σ̂t2 and σ̃t2 at any time t.
Defining the arithmetic diffenence ∆t as
∆t = σ̃t2 − σ̂t2

Ri =0

for i = t, t − 1, ...

we get
∆t = Rt2 (1 − δ)2
so that the one-day volatility forecast for δ = 0.94 becomes δt = 0.0036Rt2 . Assuming zero sample mean, for
sufficiently small percentage return Rt , the difference ∆t is negligible, and one should not expect significant differences between the two models. Considering a database consisting of eleven time series, Longerstaey et al. [1995a]
concluded that the relative differences between the two models are quite small. Further, investigating the differences
of the one-day forecasted correlation between 1990 and 1995, they found very small deviations. They extended the
analysis of the difference between the estimated mean and zero-mean estimators to one month horizons and obtained
relatively small differences between the two estimates. As a result, the zero-mean estimator is a viable alternative to
the estimated mean estimator, which is simpler to compute and not sensitive to short-term trends.
Note, assuming a conditional zero mean of returns is consistent with the financial theory of the efficient market
hypothesis (EMH). We will see in Chapter (10) that financial markets are multifractal and that conditional mean of
returns experience long term trends. In fact, Zumbach [2006a] [2006b] showed that neglecting the mean return
forecast was not correct, particularly for interest rates and stock indexes. For the former, the yields can follow a
downward or upward trend for very long periods, of the order of a year or more, and the latter can follow an overall
upward trend related to interest rates. These long trends introduce correlations, equivalent to some predictability in the
rates themselves. Even though these effects are quantitatively small, they introduce clear deviations from the random
walk with ARCH effect on the volatility.
3.4.3.4

Estimating the decay factor

We now need to estimate the sample mean and the exponential decay factor δ. The largest sample size available
should be used to reduce the standard error. Choosing a suitable decay factor is a necessity in forecasting volatility
and correlations. One important issue in this estimation is the determination of an effective number of days (k) used
in forecasting. It is postulated in RiskMetrics that the volatility model should be determined by using the metric
Ωk = (1 − δ)

∞
X

δt

t=k

where Ωk is set to the tolerance level α. We can now solve for k. Expanding the summation we get
α = δ k (1 − δ)[1 + δ + δ 2 + ...]
and taking the natural logarithms on both sides we get
ln α = k ln δ + ln (1 − δ) + ln [1 + δ + δ 2 + ...]
Since log (1 ± x) ≈ ±x − 21 x2 for |x| < 1 we get
k≈

ln α
ln δ

160

Quantitative Analytics

In principle, we can find a set of optimal decay factors, one for each covariance can be determined such that the
estimated covariance matrix is symmetric and positive definite. RiskMetrics presented a method for choosing one
optimal decay factor to be used in estimation of the entire covariance matrix. They found δ = 0.94 to be the optimal
decay factor for one-day forecast and δ = 0.97 to be optimal for one month (25 trading days) forecast.

3.4.4

Computing historical volatility

Computing historical volatility is not an easy task as it depends on two parameters, the length of time and the frequency
of measurement. As volatility mean revert over a period of months, it is difficult to define the best period of time to
obtain a fair value of realised volatility. Further, in presence of rare events, the best estimate of future volatility is not
necessary the current historical volatility. While historical volatility can be measured monthly, quarterly or yearly, it
is usually measured daily or weekly. In presence of independent stock price returns, then daily and weekly historical
volatility should on average be the same. However, when stock price returns are not independent there is a difference
due to autocorrelation. If we assume that daily volatility should be preferable to weekly volatility because there are five
times as many data points available then intraday volatility should always be preferred. However, intraday volatility
is not constant as it is usually greatest just after the market open and just before the market close and falling in the
middle of the day, leading to noisy time series. Hence, traders taking into account intraday prices should depend on
advanced volatility measures. One approach is to use exponential weighted moving average (EWMA) model described
in Section (3.4.3) which avoids volatility collapse of historic volatility. It has the advantage over standard historical
volatility (STDEV) to gradually reduce the effect of a spike on volatility. However, EWMA models are rarely used,
partly due to the fact that they do not properly handle regular volatility driving events such as earnings. That is,
previous earnings jumps will have least weight just before an earnings date (when future volatility is most likely to be
high), and most weight just after earnings (when future volatility is most likely to be low).
The availability of intraday data allows one to consider volatility estimators making use of intraday information
for more efficient volatility estimates. Using the theory of quadratic variation, Andersen et al. [1998] and BarndorffNielsen et al. [2002] introduced the concept of integrated variance and showed that the sum of squared high-frequency
intraday log-returns is an efficient estimator of daily variance in the absence of price jumps and serial correlation in the
return series. However, market effects such as lack of continuous trading, bid/ask spread, price discretisation, swamp
the estimation procedure, and in the limit, microstructure noise dominates the result. The research on high-frequency
volatility estimation and the effects of microstructure noise being extremely active, we will just say that intervals
between 5 to 30 minutes tend to give satisfactory volatility estimates. We let Nday (t) denote the number of active
price quotes during the trading day t, so that S1 (t), .., SNday (t) denotes the intraday quotes. Assuming 30 minutes
quotes, the realised variance (RV) model is
Nday
2
σ̂RV
(t) =

X

2
log Sj (t) − log Sj (t − 1)

j=2

One simple heuristic to define advanced volatility measures making use of intraday information is to consider range
estimators that use some or all of the open (O), high (H), low (L) and close (C). In that setting we define
• opening price: O(t) = log S1 (t)
• closing price: C(t) = log SNday (t)
• high price: H(t) = log (maxj=1,..,Nday Sj (t))
• low price: L(t) = log (minj=1,..,Nday Sj (t))
• normalised opening price: o(t) = O(t) − C(t − 1) = log

161

S1 (t)
SNday (t−1)

Quantitative Analytics

• normalised closing price: c(t) = C(t) − O(t) = log

SNday (t)
S1 (t)

• normalised high price: h(t) = H(t) − O(t) = log (maxj=1,..,Nday
• normalised low price: l(t) = L(t) − O(t) = log (minj=1,..,Nday

Sj (t)
S1 (t) )

Sj (t)
S1 (t) )

We are going to discuss a few of them below, introduced by Parkinson [1980], Garman et al. [1980], Rogers et al.
[1994], Yang et al. [2000]. We refer the readers to Baltas et al. [2012b] and Bennett [2012] for a more detailed list
with formulas.
• close to close (C): the most common type of calculation benefiting only from reliable prices at closing auctions.
• Parkinson (HL): it is the first to propose the use of intraday high and low prices to estimate daily volatility
σ̂P2 K (t) =

1
(h(t) − l(t))2
4 log 2

The model assumes that the asset price follows a driftless diffusion process and it is about 5 times more efficient
than STDEV.
• Garman-Klass (OHLC): it is an extention of the PK model including opening and closing prices. It is the most
powerful estimate for stocks with Brownian motion, zero drift and no opening jumps. The GK estimator is given
by
1
(h(t) − l(t))2 − (2 log 2 − 1)c2 (t)
2
and it is about 7.4 times more efficient than STDEV.
2
σ̂GK
(t) =

• Rogers-Satchell (OHLC): being similar to the GK estimate, it benefits from handling non-zero drift in the price
process. However, opening jumps are not well handled. The estimator is given by
2
σ̂RS
(t) = h(t)(h(t) − c(t)) + l(t)(l(t) − c(t))

Rogers-Satchell showed that GK is just 1.2 times more efficient than RS.
• Garman-Klass Yang-Zhang extension (OHLC): Yang-Zhang extended the GK method by incorporating the difference between the current opening log-price and the previous day’s closing log-price. The estimator becomes
robust to the opening jumps, but still assumes zero drift. The estimator is given by
2
σ̂GKY Z = σGK
+ (O(t) − C(t − 1))2

• Yang-Zhang (OHLC): Having a multi-period specification, it is an unbiased volatility estimator that is independent of both the opening jump and the drift of the price process. It is the most powerful volatility estimator with
minimum estimation error. It is a linear combination of the RS estimator, the standard deviation of past daily
log-returns (STDEV), and a similar estimator using the normalised opening prices instead of the close-to-close
log-returns.
Since these estimators provide daily estimates of variance/volatility, an annualised D-day estimator is therefore given
by the average estimate over the past D days
σl2 (t; D) =

D−1
Csf X 2
σ̂ (t − i)
D i=0 l

162

Quantitative Analytics

for l = {RV, P K, GK, GKY Z, RS}. The Yang-Zhang estimator is given by
2
2
2
σY2 Z (t : D) = σopen
(t : D) + kσST
DEV (t : D) + (1 − k)σRS (t; D)
PD−1
C
2
where σopen
(t; D) = Dsf i=0 (o(t − i) − o(t))2 . The parameter k is chosen so that the variance of the estimator is
minimised. YZ showed that for the value

k=

0.34
D+1
1.34 + D−1

their estimator is 1 + k1 times more efficient than the ordinary STDEV estimator. Brandt et al. [2005] showed that the
range-based volatility estimates are approximately Gaussian, whereas return-based volatility estimates are far from
Gaussian, which is is an advantage when calibrating stochastic volatility models with likelihood procedure. Bennett
[2012] defined two measures to determine the quality of a volatility measure, namely, the efficiency and the bias. The
former is defined by
(σe2 ) =

2
σST
DEV
σl2

where σl is the volatility of the estimate and σST DEV is the volatility of the standard close to close estimate. It
describes the volatility of the estimate, and decreases as the number of samples increases. The latter is the difference
between the estimated variance and the average volatility. It depends on the sample size and the type of distribution of
the underlying. Generally, for small sample sizes the Yang-Zhang estimator is best overall, and for large sample sizes
the standard close to close estimator is best. Setting aside the realised variance, the Yang-Zhang estimator is the most
efficient estimator, it exhibits the smallest bias when compared to the realised variance, and it generates the lowest
turnover. While the optimal choice for volatility estimation is the realised variance (RV) estimator, the Yang-Zhang
estimator constitute an optimal tradeoff between efficiency, turnover, and the necessity of high frequency data, as it
only requires daily information on opening, closing, high and low prices.

163

Part II

Statistical tools applied to finance

164

Chapter 4

Filtering and smoothing techniques
4.1
4.1.1

Presenting the challenge
Describing the problem

Dynamical systems are characterised by two types of noise, where the first one is called observational or additive
noise, and the second one is called dynamical noise. In the former, the system is unaffected by this noise, instead the
noise is a measurement problem. The observer has trouble precisely measuring the output of the system, leading to
recorded values with added noise increment. This additive noise is external to the process. In the latter, the system
interprets the noisy output as an input, leading to dynamical noise because the noise invades the system. Dynamical
noise being inherent to financial time series, we are now going to summarise some of the tools proposed in statistical
analysis and signal processing to filter it out.
In financial time series analysis, the trend is the component containing the global change, while the local changes
are represented by noise. In general, the trend is characterised by a smooth function representing long-term movement.
Hence, trends should exhibit slow changes, while noise is assumed to be highly volatile. Trend filtering (TF) attempts
at differentiating meaningful information from exogenous noise. The separation between trend and noise lies at the
core of modern statistics and time series analysis. TF is generally used to analyse the past by transforming any noisy
signal into a smoother one. It can also be used as a predictive tool, but it can not be performed on any time series. For
instance, trend following predictions suppose that the last observed trend influences future returns, but the trend may
not persist in the future.
A physical process can be described either in the time domain, by the values of some quantity h as a function
of time h(t), or in the frequency domain where the process is specified by giving its amplitude H as a function of
frequency f , that is, H(f ) with −∞ < f < ∞. One can think of h(t) and H(f ) as being two different representations of the same function, and the Fourier transform equations are a tool to go back and forth between these two
representations. There are several reasons to filter digitally a signal, such as applying a high-pass or low-pass filtering
to eliminate noise at low or high frequencies, or requiring a bandpass filter if the interesting part of the signal lies only
in a certain frequency band. One can either filter data in the frequency domain or in the time domain. While it is
very convenient to filter data in the former, in the case where we have a real-time application the latter may be more
appropriate.
In the time domain, the main idea behind TF is to replace each data point by some kind of local average of
surrounding data points such that averaging reduce the level of noise without biasing too much the value obtained.
Observations can be averaged using many different types of weightings, some trend following methods are referred
to as linear filtering, while others are classified as nonlinear. Depending on whether trend filtering is performed to
165

Quantitative Analytics

explain past behaviour of asset prices or to forecast future returns, one will consider different estimator and calibration techniques. In the former, the model and parameters can be selected by minimising past prediction error or by
considering a benchmark estimator and calibrating another model to be as close as possible to the benchmark. In the
later, trend following predictions assume that positive (or negative) trends are more likely to be followed by positive
(or negative) returns. That is, trend filtering solve the problem of denoising while taking into account the dynamics of
the underlying process.
Bruder et al. [2011] tested the persistence of trends for major financial indices on a period ranging from January
1995 till October 2011 where the average one-month returns for each index is separated into a set including onemonth returns immediately following a positive three-month return and another set for negative three-month returns.
The results showed that on average, higher returns can be expected after a positive three-month return than after a
negative three-month period so that observation of the current trend may have a predictive value for the indices under
consideration. Note, on other time scales or for other assets, one may obtain opposite results supporting contrarian
strategies.
The main goal of trend filtering in finance is to design portfolio strategies benefiting from these trends. However,
before computing an estimate of the trend, one must decide if there is a trend or not in the series. We discussed in
Section (3.3.3) the power of statistical tests for trend detection, and concluded that the Mann-Kendall test was more
powerful than the parametric t-test for high coefficient of skewness. Given an established trend, one approach is to use
the resulting trend indicator to forecast future asset returns for a given horizon and allocate accordingly the portfolio.
For instance, an investor could buy assets with positive return forecasts and sell them when the forecasts are negative.
The size of each long or short position is a quantitative problem requiring a clear investment process. As explained
in Section (), the portfolio allocation should take into account the individual risks, their correlations and the expected
return of each asset.

4.1.2

Regression smoothing

A regression curve describes a general relationship between an explanatory variable X, which may be a vector in Rd ,
and a response variable Y , and the knowledge of this relation is of great interest. Given n data points, a regression
curve fitting a relationship between variables {Xi }ni=1 and {Yi }ni=1 is commonly modelled as
Yi = m(Xi ) + i , i = 1, .., n
where  is a random variable denoting the variation of Y around m(X), the mean regression curve E[Y |X = x] when
we try to approximate the mean response function m. By reducing the observational errors, we can concentrate on
important details of the mean dependence of Y on X. This curve approximation is called smoothing. Approximating
the mean function can be done in two ways. On one hand the parametric approach assume that the mean curve m has
some prespecified functional form (for example a line with unknown slope and intercept). On the other hand we try to
estimate m nonparametrically without reference to a specific form. In the former, the functional form is fully described
by a finite set of parameters, which is not the case in the latter, offering more flexibility for analysing unknown
regression relationship. It can be used to predict observations without referencing to a fixed parametric model, to
find spurious observations by studying the influence of isolated points, and to substitute missing values or interpolate
between adjacent points. As an example, Engle et al. [1986] considered a nonlinear relationship between electricity
sales and temperature using a parametric-nonparametric estimation approach. The prediction of new observations is of
particular interest to time series analysis. In general, classical parametric models are too restrictive to give reasonable
explanations of observed phenomena. For instance, Ullah [1987] applied kernel smoothing to a time series of stock
market prices and estimated certain risk indexes. Deaton [1988] used smoothing methods to examine demand patterns
in Thailand and investigated how the knowledge of those patterns affects the assessment of pricing policies.
Smoothing of a data set {(Xi , Yi )}ni=1 involves the approximation of the mean response curve m(x) which should

166

Quantitative Analytics

be any representative point close to the point x. This local averaging procedure, which is the basic idea of smoothing,
can be defined as
m̂(x) = n−1

n
X

Wni (x)Yi

i=1

where {Wni (x)}ni=1 denotes a sequence of weights which may depend on the whole vector {Xi }ni=1 . The amount of
averaging is controlled by the weight sequence which is tuned by a smoothing parameter regulating the size of the
neighborhood around x. In the special case where the weights {Wni (x)}ni=1 are positive and sum to one for all x, that
is,
n−1

n
X

Wni (x) = 1

i=1

then m̂(x) is a least squares estimate (LSE) at point x since it is the solution of the minimisation problem
min n−1
θ

n
X

Wni (x)(Yi − θ)2 = n−1

i=1

n
X

Wni (x)(Yi − m̂(x))2

i=1

where the residuals are weighted quadratically. Thus, the basic idea of local averaging is equivalent to the procedure
of finding a local weighted least squares estimate. In the random design model, we let {(Xi , Yi )}ni=1 be independent,
identically distributed variables, and we concentrate on the average dependence of Y on X = x, that is, we try to
estimate the conditional mean curve
R
yf (x, y)dy
m(x) = E[Y |X = x] =
f (x)
R
where f (x, y) is the joint density of (X, Y ), and f (x) = f (x, y)dy is the marginal density of X. Note that for a
normal joint distribution with mean zero, the regression curve is linear and m(x) = ρx with ρ = Corr(X, Y ).
By contrast, the fixed design model is concerned with controlled, non-stochastic X-variables, so that
Yi = m(Xi ) + i , 1 ≤ i ≤ n
{i }ni=1

where
denotes zero-mean random variables with variance σ 2 . Although the stochastic mechanism is different,
the basic idea of smoothing is the same for both random and nonrandom X-variables.

4.1.3

Introducing trend filtering

4.1.3.1

Filtering in frequency

We consider the removal of noise from a corrupted signal by assuming that we want to measure the uncorrupted signal
u(t), but that the measurement process is imperfect, leading to the corrupted signal c(t). On one hand the true signal
u(t) may be convolved with some known response function r(t) to give a smeared signal s(t)
Z ∞
s(t) =
r(t − τ )u(τ )dτ or S(f ) = R(f )U (f )
−∞

where S, R, U are the Fourier transforms of s, r, u respectively. On the other hand the measured signal c(t) may
contain an additional component of noise n(t) (dynamical noise)
c(t) = s(t) + n(t)

167

(4.1.1)

Quantitative Analytics

While in the first case we can devide C(f ) by R(f ) to get a deconvolved signal, in presence of noise we need to find
the optimal filter φ(t) or Φ(f ) producing the signal ũ(t) or Ũ (f ) as close as possible to the uncorrupted signal u(t) or
U (f ). That is, we want to estimate
Ũ (f ) =
in the least-square sense, such that
Z

C(f )Φ(f )
R(f )

∞
2

Z

∞

|Ũ (f ) − U (f )|2 df

|ũ(t) − u(t)| dt =
−∞

−∞

is minimised. In the frequency domain we get
Z ∞
Z ∞
(S(f ) + N (f ))Φ(f )
|S(f )|2 |1 − Φ(f )|2 + |N (f )|2 |Φ(f )|2
S(f ) 2
|
df
−
| df =
R(f )
R(f )
|R(f )|2
−∞
−∞
if S and N are uncorrelated. We get a minimum if and only if the integrand is minimised with respect to Φ(f ) at every
value of f . Differentiating with respect to Φ and setting the result to zero, we get
Φ(f ) =

|S(f )|2
|S(f )|2 + |N (f )|2

(4.1.2)

involving the smearded signal S and the noise N but not the true signal U . Since we can not estimate separately S
and N from C we need extra information or assumption. One way forward is to sample a long stretch of data c(t) and
plot its power spectral density as it is proportional to |S(f )|2 + |N (f )|2
|S(f )|2 + |N (f )|2 ≈ Pc (f ) = |C(f )|2 , 0 ≤ f < fc
which is the modulus squared of the discrete Fourier transform of some finite sample. In general, the resulting plot
shows the spectral signature of a signal sticking up above a continuous noise spectrum. Drawing a smooth curve
through the signal plus noise power, the difference between the two curves is the smooth model of the signal power.
After designing a filter with response Φ(f ) and using it to make a respectable guess at the signal Ũ (f ) we might regard
Ũ (f ) as a new signal to improve even further with the same filtering technique. However, the scheme converges to a
signal of S(f ) = 0. Alternatively, we take the whole data record, FFT it, multiply the FFT output by a filter function
H(f ) (constructed in the frequency domain), and then do an inverse FFT to get back a filtered data set in the time
domain.
4.1.3.2

Filtering in the time domain

As discussed above, even though it is very convenient to filter data in the frequency domain, in the case where we have
a real-time application the time domain may be more appropriate. A general linear filter takes a sequence yk of input
points and produce a sequence xn of output points by the formula
xn =

M
X

ck yn−k +

N
X

dj xn−j

j=1

k=0

where the M + 1 coefficients ck and the N coefficients dj are fixed and define the filter response. This filter produces
each new output value from the current and M previous input values, and from its own N previous output values.
In the case where N = 0 the filter is called nonrecursive or finite impulse response (FIR), and if N 6= 0 it is called
recursive or infinite impulse response (IIR). The relation between the ck ’s and dj ’s and the filter response function
H(f ) is

168

Quantitative Analytics

PM
H(f ) =

k=0 ck e

1−

PN

j=1

−2πik(f ∆)

dj e−2πij(f ∆)

where ∆ is the sampling interval and f ∆ is the Nyquist interval. To determine a filter we need to find a suitable set
of c’s and d’s from a desired H(f ), but like many inverse problem it has no all-purpose solution since the filter is
a continuous function while the short list of the c’s and d’s represents only a few adjustable parameters. When the
denominator in the filter H(f ) is unity, we recover a discrete Fourier transform. Nonrecursive filters have a frequency
response that is a polynomial in the variable z1 where
z = e2πi(f ∆)
while the recursive filter’s frequency response is a rational function in z1 . However, nonrecursive filters are always
stable but recursive filters are not necessarily stable. Hence, the problem of designing recursive filters is an inverse
problem with an additional stability constraint. See Press et al. [1992] for a sketch of basic techniques.

4.2

Smooting techniques and nonparametric regression

As opposed to pure parametric curve estimations, smoothing techniques provide flexibility in data analysis. In this
section, we are going to consider the statistical aspects of nonparametric regression smoothing by considering the
choice of smoothing parameters and the construction of confidence bands. While various smoothing methods exist,
all smoothing methods are in an asymptotic sense equivalent to kernel smoothing.

4.2.1

Histogram

The density function f tells us where observations cluster and occur more frequently. Nonparametric approach does
not restrict the possible form of the density function by assuming f to belong to a prespecified family of functions.
The estimation of the unknown density function f provides a way of understanding and representing the behaviour of
a random variable.
4.2.1.1

Definition of the Histogram

Histogram combines neighbouring needles by counting how many fall into a small interval of length h called a bin.
The probability for observations of x to fall into the interval [− h2 , h2 ) equals the shaded area under the density
h h
P (X ∈ [− , )) =
2 2

Z

h
2

f (x)dx

(4.2.3)

−h
2

Histogram as a frequency counting curve Histogram counts the relative frequency of observations falling into a
prescribed mesh and normalised so that the resulting function is a density. The relative frequency of observations in
this interval is a good estimate of the probability in Equation (4.2.3), which we divide by the number of observations
to get
1
h h
h h
P (X ∈ [− , )) ' #(Xi ∈ [− , ))
2 2
n
2 2
where n is the sample size. Applying the mean value theorem to Equation (4.2.3), we obtain
Z

h
2

−h
2

h h
f (x)dx = f (ξ)h , ξ ∈ [− , )
2 2

169

Quantitative Analytics

so that
h h
P (X ∈ [− , )) =
2 2

h
2

Z

f (x)dx = f (ξ)h
−h
2

h h
1
h h
P (X ∈ [− , )) ' #(Xi ∈ [− , ))
2 2
n
2 2
Equating the two equations, we arrive at the density estimate
1
h h
h h
fˆh (x) =
#(Xi ∈ [− , )) , x ∈ [− , )
nh
2 2
2 2
The calculation of the histogram is characterised by the following two steps
1. divide the real line into bins
Bj = [x0 + (j − 1)h, x0 + jh) , j ∈ z
2. count how many data fall into each bin
fˆh (x) = (nh)−1

n X
X
i=1

The histogram as a Maximum Likelihood Estimate
observations

I{Xi ∈Bj } I{x∈Bj }

j

We want to find a density fˆ maximising the likelihood in the

n
Y

fˆ(Xi )

i=1

However, this task is ill-posed, and we must restrict the class of densities in order to obtain a well-defined solution.
Varying the binwidth By varying the binwidth h we can get different shapes for the density fˆh (x)
• h → 0 : needleplot, very noisy representation of the data
• h = : smoother less noisy density estimate
• h → ∞ : box-shaped, overly smooth
We now need to choose the binwidth h in practice. It can be done by working out the statistics of the histogram.
Statistics of the histogram We need to find out if the estimate fˆh is unbiased, and if it matches on average the
unknown density. If x ∈ Bj = [(j − 1), jh), the histogram for x (i.i.d.) is given by
fˆh = (nh)−1

n
X
i=1

where n is the sample size.

170

I{Xi ∈Bj }

Quantitative Analytics

Bias of the histogram Since the Xi are identically distributed, the expected value of the estimate is given by
E[fˆh (x)] = (nh)−1

n
X

1
E[I{Xi ∈Bj } ] =
h
i=1

Z

jh

f (u)du
(j−1)h

We define the bias as
Bias(fˆh (x)) = E[fˆh (x)] − f (x)

(4.2.4)

which becomes in our setting
Bias(fˆh (x)) =

1
h

Z
f (u)du − f (x)
Bj

Rewriting the bias of the histogram in terms of the binwidth h and the density f , we get
0
1
1
Bias(fˆh (x)) = ((j − )h − x)f ((j − )h) + o(h) , h → 0
2
2
The stability of the estimate is measured by the variance which we are now going to calculate.

Variance of the histogram If the index function I{Xi ∈Bj } is a Bernoulli variable, the variance is given by

V ar(fˆh (x))

= V ar((nh)−1

n
X

IBj (Xj ))

i=1

=

(nh)−1 f (x) + o((nh)−1 ) , nh → ∞

Clearly, the variance decreases when nh increases, so that we have a dilema between the bias and the variance. Hence,
we are going to consider the Mean Square Error (MSE) as a measure of accuracy of an estimator.
The Mean Squared Error of the histogram The Mean Squared Error of the estimate is given by
M SE(fˆh (x))

= E[(fˆh (x) − f (x))2 ]
= V ar(fˆh (x)) + Bias2 (fˆh (x))
=

(4.2.5)

0
1
1
1
1
f (x) + ((j − )h − x)2 f ((j − )h)2 + o(h) + o( )
nh
2
2
nh

The minimisation of the MSE with respect to h denies a compromise between the problem of oversmoothing (if we
choose a large binwidth h to reduce the variance) and undersmoothing (for the reduction of the bias by decreasing the
binwidth h). We can conclude that the histogram fˆh (x) is a consistent estimator for f (x) since
h → 0 , nh → ∞ implies M SE(fˆh (x)) → 0
and
p
fˆh (x) −
→ f (x)

However, the application of the MSE formula is difficult in practice, since it contains the unkown density function f
both in the variance and the squared bias.
M SE = estimate in one particular point x

171

Quantitative Analytics

The Mean Integrated Squared Error of the histogram Another measure we can consider is the Mean Integrated
Squared Error (MISE). It is a measure of the goodness of fit for the whole histogram defined as
Z ∞
M SE(fˆh (x))dx
M ISE(fˆh ) =
−∞

measuring the average MSE deviation. The speed of convergence of the MISE in the asymptotic sense is given by
h2 0
1
M ISE(fˆh ) = (nh)−1 + kf k22 + o(h2 ) + o( )
12
nh
where kf (x)k2 =

R∞
−∞

|f (x)|2 dx. The leading term of the MISE is the asymptotic MISE given by

h2 0
A − M ISE(fˆh ) = (nh)−1 + kf k22
12
We can minimise the A-MISE by differentiating it with respect to h and equating the result to zero
∂
A − M ISE(fˆh ) = 0
∂h
getting
h0 =

6  13
nkfˆ0 k2

(4.2.6)

2

Therefore, choosing theoretically
1

h0 ∼ n− 3
we obtain
1

M ISE ∼ n− 3 >> n−1 , for n sufficiently large
0

The calculation of h0 requires the knowledge of the unkown parameter kf k22 . One approach is to use the plug-in
0
method, which mean that we take an estimate of kf k22 and plug it into the asymptotic formula in Equation (4.2.6).
0
However, it is difficult to estimate this functional form. A practical solution is for kf k22 to take a reference distribution
and to calculate h0 by using this reference distribution.
4.2.1.2

Smoothing the histogram by WARPing

One of the main criticism of the histogram is its dependence on the choice of the origin. This is because if we use one
of these histograms for density estimation, the choice is arbitrary. To get rid of this shortcoming, we can average these
histograms, which is known in a generalised form as Weighted Averaging of Rounded Points (WARPing). It is based
on a smaller bin mesh, by discretising the data first into a finite grid of bins and then smoothing the binned data

l
l 
Bj,l = (j − 1 +
)h, (j +
)h , l ∈ 0, ..., M − 1
M
M
The bins Bj,l are generated by shifting each Bj by the amount of
on these shifted bins
fˆh,l (x) = (nh)−1

n X
X
i=1

lh
M

to the right. We now have M histograms based

I{x∈Bj,l } I{Xi ∈Bj,l }

j

The WARPing idea is to calculate the average over all these histograms

172



Quantitative Analytics

fˆh (x) = M −1

M
−1
X

(nh)−1

n X
X
i=1

l=0

I{x∈Bj,l } I{Xi ∈Bj,l }



j

We can inspect the double sum over the two index functions more closely by assuming for a moment that x, and Xi
are given and fixed, In that case, we get

l+1 
l
∗
)h, (j − 1 +
)h = Bj,l
x ∈ (j − 1 +
M
M

l+K
l+K +1 
∗
Xi ∈ (j − 1 +
)h, (j − 1 +
)h = Bj,l+K
M
M
for j, K ∈ z , l ∈ 0, ..., M − 1
∗
where Bj,l
is generated by dividing each bin Bj of binwidth h into M subbins of binwidth
M
−1 X
X
l=0

IBj,l (Xi )IBj,l (x) =

X
z

j

M
−1
X

IBz∗ (x)

h
M

= δ. Hence,we get

∗
IBz+K
(Xi )(M − |K|)

K=1−M

z
h, z+1
where Bz∗ = [ M
M h). This leads to the WARPed histogram

fˆh (x)

=

−1

(nM h)

n X
X
i=0

=

(nh)−1

X
j

I

Bj∗

M
−1
X

(x)

j

∗
IBj+K
(Xi )(M − |K|)

K=1−M
M
−1
X

IBj∗ (x)

WM (K)nj+K

K=1−M

Pn
|K|
∗ (Xi ) and WM (K) = 1 −
where nK = i=1 IBK
M . The explicit specification of the weighting function WM (•)
allows us to approximate a bigger class of density estimators such as the Kernel.

4.2.2

Kernel density estimation

Rosenblatt [1956] proposed putting smooth Kernel weights in each of the observations. That is, around each observation Xi a Kernel function Kh (• − Xi ) is centred. Like histogram, instead of averaging needles, one averages Kernel
functions, and we have a smoothing parameter, called the bandwidth h, regulating the degree of smoothness for Kernel
smoothers.
4.2.2.1

Definition of the Kernel estimate

A general Kernel K is a real function defined as
1
x
K( )
h
h
Averaging over these Kernel functions in the observations leads to the Kernel density estimator
Kh =

n

n

1X
1 X
x − Xi
fˆh (x) =
Kh (x − Xi ) =
nK(
)
n i=1
nh i=1
h
See Hardle [1991] for a short survey of some other Kernel functions. Some of the properties of the Kernel functions
are

173

Quantitative Analytics

• Kernel functions are symmetric around 0 and integrate to 1
• since the Kernel is a density function, the Kernel estimate is a density too
Z

Z
K(x)dx = 1 implies

Varying the Kernel

fˆh (x)dx = 1

The property of smoothness of the Kernel K is inherited by the corresponding estimate fˆh (x)

• the Uniform Kernel is not continuous in −1 and 1, so that fˆh (x) is discountinuous and not differentiable in
Xi − h and Xi + h.
• the Triangle Kernel is continuous but not differentiable in −1, 0, and 1, so that fˆh (x) is continuous but not
differentiable in Xi − h, Xi , and Xi + h.
• the Quartic Kernel is continuous and differentiable everywhere.
Hence, we can approximate f by using different Kernels which gives qualitatively different estimates fˆh .
Varying the bandwidth We consider the bandwidth h in Kernel smoothing such that
• h → 0 : needleplot, very noisy representation of the data
• h small : a smoother, less noisy density estimate
• h big : a very smooth density estimate
• h → ∞ : a very flat estimate of roughly the shape of the chosen Kernel
4.2.2.2

Statistics of the Kernel density

Kernel density estimates are based on two parameters
1. the bandwidth h
2. the Kernel density function
We now provide some guidelines on how to choose h and K in order to obtain a good estimate with respect to a given
goodness of fit criterion. We are interested in the extent of the uncertainty or at what speed the convergence of the
smoother actually happens. In general, the extent of this uncertainty is expressed in terms of the sampling variance of
the estimator, but in nonparametric smoothing situation it is not enough as there is also a bias to consider. Hence, we
should consider the pointwise mean squared error (MSE), the sum of variance and squared bias. A variety of distance
measures exist, both uniform and pointwise, but we will only describe the mean squared error and the mean integrated
squared error.
Bias of the Kernel density We first check the asymptotic unbiasedness of fˆh (x) using Equation (4.2.4). The expected value of the estimate is
n

E[fˆh (x)] =

1X
E[Kh (x − Xi )]
n i=1

with the property

174

Quantitative Analytics

if h → 0 , E[fˆh (x)] → f (x)
The estimate is thus asymptotically unbiased, when the bandwidth h converges to zero. We can look at bias analysis
by using a Taylor expansion of f (x + sh) in x, getting
Bias(fˆh (x)) =

h2 00
f (x)µ2 (K) + o(h2 ) , h → 0
2

R∞
where µ2 (K) = −∞ u2 K(u)du. The bias is quadratic in h. Hence, we have to choose h small enough to reduce the
00
bias. The size of the bias depends on the curvature of f in x, that is, on the absolute value of f (x).
Variance of the Kernel density
stability of such estimates

We compute the variance of the Kernel density estimation to get insight into the

V ar(fˆh (x))

−2

n
X

nKh (x − Xi ))
i=1
(nh)−1 kKk22 f (x) + o((nh)−1 )

= n
=

V ar(

, nh → ∞

The variance being proportional to (nh)−1 , we want to choose h large. This contradicts the aim of decreasing the
bias by decreasing the bandwidth h. Therefore we must consider h as a compromise of both effects, namely the
M SE(fˆh (x)) or M ISE(fˆh (x)).
The Mean Squared Error of the Kernel density
Kernel density defined in Equation (4.2.5)

We are now looking at the Mean Squared Error (MSE) of the

h4 00
1
f (x)kKk22 + (f (x)µ2 (K))2 + o((nh)−1 ) + o(h4 ) , h → 0 , nh → ∞
M SE(fˆh (x)) =
nh
4
The MSE converges to zero if h → 0, nh → ∞. Thus, the Kernel density estimate is consistent
p
fˆh (x) −
→ f (x)

We define the optimal bandwidth h0 for estimating f (x) by
h0 = arg min M SE(fˆh (x))
h

We can rewrite the MSE as
1
M SE(fˆh (x)) = (nh)−1 c1 + h4 c2
4
with

Setting the derivative

∂
ˆ
∂h M SE(fh (x))

c1

= f (x)kKk22

c2

=

00

(f (x))2 (µ2 (K))2

to zero, yields the optimum h0 as

h0 = (

 51
c1 1
f (x)kKk22
)5 =
00
2
2
c2 n
(f (x)) (µ2 (K)) n

175

Quantitative Analytics

Thus, the optimal rate of convergence of the MSE is given by
M SE(fˆh0 (x)) =

1
5 f (x)kKk22 00
(f (x)µ2 (K))2 5
4
n
00

Again, the formula includea the unknown functions f (•) and f (•).
The Mean Integrated Squared Error of the Kernel density We are now looking at the Mean Integrated Squared
Error (MISE) of the Kernel density
00
1
h4
M ISE(fˆh (x)) =
kKk22 + (µ2 (K))2 kf k22 + o((nh)−1 ) + o(h4 ) , h → 0 , nh → ∞
nh
4
The optimal bandwidth h0 which minimise the A-MISE with respect to the parameter h is given by

h0 =

 15
kKk22
2
2
kf k2 (µ2 (K)) n
00

The optimal rate rate of convergence of the MISE is given by
2
00
4
4
5
(kKk22 ) 5 µ2 (K)kf k22 5 n− 5
4
We have not escaped from the circulus virtuosis of estimating f , by encountering the knowledge of a function of f ,
00
here f . Fortunately, there exists ways of computing good bandwidths h, even if we have no knowledge of f . A
comparison of the speed of convergence of MISE for histogram and Kernel density estimation is given by Hardle. The
speed of convergence is faster for the Kernel density than for the histogram one.
A − M ISE(fˆh0 (x)) =

4.2.2.3

Confidence intervals and confidence bands

To obtain confidence intervals, we derive the asymptotic distribution of the kernel smoothers and use either their
asymptotic quantiles or bootstrap approximations for these quantiles. The estimate fˆh (x) is asymptotically normally
1
distributed as n increases and the bandwidth h decreases in the order of n− 2 . Using the bias and the variance of the
Kernel, we can derive the following theorem.
00
1
Theorem 4.2.1 Suppose that f (x) exists and hn = cn− 5 . Then the Kernel density estimate fˆh (x) is asymptotically
normally distributed.


2
c2 00
n 5 (fˆhn (x) − f (x)) → N
f (x)µ2 (K), c−1 f (x)kKk22 , n → ∞
2
This theorem enables us to compute a confidence interval for f (x). An asymptotical (1 − a) confidence interval for
f (x) is given by
h

2 00
2 00

i
2 c
2 c
fˆh (x) − n− 5
f (x)µ2 (K) + da , fˆh (x) − n− 5
f (x)µ2 (K) − da
2
2

p
with da = u1− a2 c−1 f (x)kKk22 and u1− a2 is the (1 − a2 ) quantile of a standard normal distribution. Again, it
00
00
includes the functions f (x) and f (x). A more practical way is to replace f (x) and f (x) with estimates fˆh (x) and
00
00
00
fˆg (x) orcorresponding values of reference distributions. The estimate fˆg (x) can be defined as [fˆg (x)] , where the
bandwidth g is not the same as h. If we use a value of two as quantile for an asymptotic 95% confidence interval
q
q
h
2 00
2 00

i
2 c
2 c
fˆg (x)µ2 (K) + 2 c−1 f (x)kKk22 , fˆh (x) − n− 5
fˆg (x)µ2 (K) − 2 c−1 f (x)kKk22
fˆh (x) − n− 5
2
2
176

Quantitative Analytics

this technique yields a 95% confidence interval for f (x) and not for the whole function. In order to get a confidence
1
band for the whole function Bicket et al. suggested choosing a smaller bandwidth than of order n− 5 to reduce the
bias, such that the limiting distribution of the estimate has an expectation equal to f (x).

4.2.3

Bandwidth selection in practice

The choice of the bandwidth h is the main problem of the Kernel density estimation. So far we have derived formulas
00
for optimal bandwidths that minimise the MSE or MISE, but employ the unknown functions f (•) and f (•). We are
now considering how to obtain a reasonable choice of h when we do not know f (•).
4.2.3.1

Kernel estimation using reference distribution

We can adopt the technique using reference distribution for the choice of the binwidth of the histogram described
00
above. We try to estimate kf k22 , assuming f to belong to a prespecified class of density functions. For example, we
can choose the normal distribution with parameter µ and σ

kf

00

k22

= σ

−5

Z

00

(Q (x))2 dx

3
= σ −5 √ ≈ 0.212σ −5
8 π
00

We can then estimate kf k22 through an estimator σ̂ for σ. For instance, if we take the Gaussian Kernel we obtain the
following rule of thumb

ĥ0

=
=

kQk22

 15

kfˆ
1
4σ̂ 5
≈ 1.06σ̂n− 5
3n
00

k22 µ22 (Q)n

Note, we can use instead of σ̂ a more robust estimate for the scale parameter of the distribution. For example, we can
take the interquartile range R̂ defined as
R̂ = X[0.75n] − X[0.25n]
The rule of thumb is then modified to
1

ĥ0 = 0.79R̂n− 5
We can combine both rules to get the better rule
ĥ0 = 1.06 min (σ̂,

1
R̂
)n− 5
1.34

since for Gaussian data R̂ ≈ 1.34σ̂.
4.2.3.2

Plug-in methods
00

Another approach is to directly estimate kf k22 , but doing so we move the problem from the estimation of f to the
00
estimation of f . The plug-in procedure is based on the asymptotic expansion of the squared error kernel smoothers.

177

Quantitative Analytics

4.2.3.3

Cross-validation

Maximum likelihood cross-validation We want to test for a specific h the hypothesis
fˆh (x) = f (x) vs. fˆh (x) 6= f (x)
The likelihood ratio test would be based on the rest statistic

f (x)
fˆh (x)

and should be close to 1, or the average over X,

EX [log ( fˆf )(X)]
h

should be 0. Thus, a good bandwidth minimising this measure of accuracy is in effect optimising
the Kullback-Leibler information
Z
f
dKL (f, fˆh ) =
(x)f (x)dx
fˆh
We are not able to compute dKL (f, fˆh ) from the data, since it requires the knowledge of f . However, from a theoretical
point of view, we can investigate this distance for the choice of an appropriate bandwidth h minimising dKL (fˆ, f ). If
Q
we are given additional observations Xi , the likelihood for these observations i fˆh (Xi ) for different h would indicate
which value of h is preferable, since the logarithm of this statistic is close to dKL (fˆh , f ). In the case where we do not
have additional observations, we can base the estimate fˆh on the subset {Xj }j6=i and to calculate the likelihood for
Xi . Denote the Leave-One-Out estimate by
fˆh,i (Xi ) = (n − 1)−1 h−1

X

Xi − Xj
)
h

K(

j6=i

The likelihood is
n
Y

fˆh,i (Xi ) = (n − 1)−n h−n

i=1

n X
Y

K(

i=1 j6=i

Xi − Xj
)
h

We take the logarithm of this statistic normalised with the factor n−1 to get the maximum likelihood CV
CVKL (h) = n−1

n
X
i=1

log [fˆh,i (Xi )] + n−1

n
X

log [

i=1

X

K(

j6=i

Xi − Xj
)] − log [(n − 1)h]
h

so that
ĥKL = arg max CVKL (h)
and
E[CVKL ] ≈ −E[dKL (f, fˆh )] +

Z
log [f (x)]f (x)dx

For more details see Hall [1982].
Least-squares cross-validation We consider an alternative distance measure between fˆ and f called the Integrated
Squared Error (SSE) defined as
Z
dI (h) = (fˆh − f )2 (x)dx
which is a quadratic measure of accuracy. Hence, we get
Z
Z
Z
dI (h) − f 2 (x)dx = fˆh2 (x)dx − 2 (fˆh f )(x)dx

178

Quantitative Analytics

and
Z

(fˆh f )(x)dx = EX [fˆh (x)]

The leave-one-out estimate is
EX [fˆh (x)] = n−1

n
X

fˆh,i (Xi )

i=1

It determines a good bandwidth h minimising the right hand side of the above equation using the leave-one-out
estimate. This leads to the least-squares cross-validation
Z
CV (h) =

n

2Xˆ
fh,i (Xi )
fˆh2 (x)dx −
n i=1

The bandwidth minimising this function is
ĥCV = arg min CV (h)
Scott et al. [1987] call the function CV an unbiased cross-validation criterion, since
E[CV (h)] = M ISE(fˆh ) − kf k22
defines a sequence of bandwidths ĥn = h(X1 , .., Xn ) to be asymptotically optimal if
dI (hn )
→1,n→∞
inf h≥0 dI (h)
If the density f is bounded, then ĥCV is asymptotically optimal. Note, the minimisation of CV (h) is independent
from the order of differentiability p of f . So, this technique is more general to apply than the plig-in method, which
requires f to be exactly of the same order of differentiability p. For the computation of the score function, note that
Z

fˆh2 (x)dx = n−2 h−2

n X
n
X

K ∗ K(

i=1 j=1

Xj − Xi
)
h

where K ∗ K(u) is the convolution of the Kernel function K. As a result, we get

CV (h)

n X
n
X

n

Xj − Xi
2Xˆ
)−
fh,i (Xi )
h
n i=1

=

n−2 h−2

=

n X
n
X
Xj − Xi
2
Xj − Xi i
2 hn
K
∗
K(0)
+
K
∗
K(
)
−
K(
)
n2 h 2
h
n−1
h
i=1 j=1

i=1 j=1

K ∗ K(

Biased cross-validation The biased cross-validation introduced by Scott et al. [1987] is based on the idea of a
direct estimate of A − M ISE(fˆh ) given by
00
h4
A − M ISE(fˆh ) = (nh)−1 kKk22 + µ22 (K)kf k22
4
00

where we have to estimate kf k22 . So, we get

179

Quantitative Analytics

BCV1 (h) = (nh)−1 kKk22 +

00
h4 2
µ (K)kfˆ k22
4 2
1

The minimisation of the M ISE(fˆh ) requires a sequence of bandwidths proportional to n− 5 . We can use a bandwidth
of this order for the optimisation of BCV1 (h)
00
V ar(fˆh (x)) = V ar(h−3

n
X

00

K (

i=1

00
x − Xi
)) ∼ n−1 h−5 kK k22
h

00
1
Hence, the variance of fˆh does not converge to zero for this choice of h ∼ n− 5 so that the BCV1 (h) can not
00
approximate the M ISE(fˆh ). This is because the same bandwidth h is used for the estimation of kf k22 and that of f .
Hence, we have to employ different bandwidths. For this bias in the estimation of the L2 norm, the method is called
00
biased CV . We have a formula for the expectation of kfˆ k22 given by Scott et al. [1987]
00

00

E[kfˆh k22 ] = kf k22 +

00
1
kK k22 + o(h2 )
nh5

Therefore, we correct the above bias by
00

00

kfˆ k22 = kfˆh k22 −

00
1
kK k22
nh5

1

It is asymptotically unbiased when we let h ∼ n− 5 → 0. The biased cross-validation is given by
BCV (h) =

00
00
1
h4
1
kKk22 + µ22 (K)kfˆh k22 −
kK k22
5
nh
4
nh

where
ĥBV C = arg min BCV (h)
is an estimate for the optimal bandwidth ĥ0 minimising dI (h). Scott et al. [1987] showed that ĥBV C is asymptotically
optimal. The optimal bandwidth ĥBV C has a smaller standard deviation than ĥCV and hence, gives satisfying results
for the estimation of the A-MISE. On the other hand, for some skewed distributions biased cross-validation tends to
oversmooth where CV (h) is still quite close to the A-MISE optimal bandwidth.

4.2.4

Nonparametric regression

A regression curve fitting a relationship between variables {Xi }ni=1 and {Yi }ni=1 , where the former is the explanatory
variable and the latter is the response variable, is commonly modelled as
Yi = m(Xi ) + i , i = 1, .., n
where  is a random variable denoting the variation of Y around m(X), the mean regression curve E[Y |X = x] when
we try to approximate the mean response function m. By reducing the observational errors, we can concentrate on
important details of the mean dependence of Y on X. This curve approximation is called smoothing. Approximating
the mean function can be done in two ways. On one hand the parametric approach assume that the mean curve m has
some prespecified functional form (a line with unknown slope and intercept). On the other hand we try to estimate m
nonparametrically without reference to a specific form. In the former, the functional form is fully described by a finite
set of parameters, which is not the case in the latter offering more flexibility for analysing unknown regression relationship. For regression curve fitting we are interested in weighting the response variable Y in a certain neighbourhood
of x. Hence, we weight the observations Yi depending on the distance of Xi to x using the estimator

180

Quantitative Analytics

m̂(x) = n−1

n
X

Wn (x; X1 , .., Xn )Yi

i=1

Since in general most of the weight Wn (x; X1 , .., Xn ) is given to the observation Xi , we can abbreviate it to Wni (x),
so that
m̂(x) = n−1

n
X

Wni (x)Yi

i=1

where {Wni (x)}ni=1 denotes a sequence of weights which may depend on the whole vector {Xi }ni=1 . We call smoother
the regression estimator m̂h (x), and smooth the outcome of the smoothing procedure. We consider the random design
model, where the X-variables have been randomly generated, and we let {(Xi , Yi )}ni=1 be independent, identically
distributed variables. We concentrate on the average dependence of Y on X = x, that is, we try to estimate the
conditional mean curve
R
yf (x, y)dy
m(x) = E[Y |X = x] =
f (x)
R
where f (x, y) is the joint density of (X, Y ), and f (x) = f (x, y)dy is the marginal density of X. Following Hardle
[1990], we now present some common choice for the weights Wni (•).
4.2.4.1

The Nadaraya-Watson estimator

We want to find an estimate of the conditional expectation m(x) = E[Y |X = x] where
R
Z
yf (x, y)dy
m(x) = R
= yf (y|x)dy
f (x, y)dy
R
since f (x) = f (x, y)dy and f (y|x) = ff(x,y)
(x) is the conditional density of Y given X = x. While various smoothing
methods exist, all smoothing methods are in asymptotic sense equivalent to kernel smoothing. We therefore choose
the Kernel density K to represent the weight sequence {Wni (x)}ni=1 . We saw in Section (4.2.2) how to estimate the
denominator using the Kernel density estimate. For the numerator, we could estimate the joint density f (x, y) by
using the multiplicative Kernel
fˆh1 ,h2 = n−1

n
X

Kh1 (x − Xi )Kh2 (x − Xi )

i=1

We can work out an estimate of the numerator as
Z

y fˆh1 ,h2 (x, y)dy = n−1

n
X

Kh1 (x − Xi )Yi

i=1

Hence, employing the same bandwidth h for both estimates, we can estimate the conditional expectation m(x) by
combining the estimates of the numerator and the denominator. This method was proposed by Nadaraya [1964] and
Watson [1964] and gave the Nadaraya-Watson estimator
Pn
n−1 i=1 Kh (x − Xi )Yi
m̂h (x) = −1 Pn
n
j=1 Kh (x − Xj )
In terms of the general nonparametric regression curve estimate, the weights have the form

181

Quantitative Analytics

i
h−1 K( x−X
h )
fˆh (x)

Whi (x) =

where the shape of the Kernel weights is determined by K, and the size of the weights is parametrised by h.
A variety of Kernel functions exist, but both practical and theoretical considerations limit the choice. A commonly
used Kernel function is of the parabolic shape with support [−1, 1] (Epanechnikov)
K(u) = 0.75(1 − u2 )I{|u|≤1}
but it is not differentiable at u = ±1. The Kernel smoother is not defined for a bandwidth with fˆh (x) = 0. If such 00
case occurs, one defines m̂h (x) as being zero. Assuming that the kernel estimator is only evaluated at the observations
{Xi }ni=1 , then as h → 0, we get
m̂h (Xi ) →

K(0)Yi
= Yi
K(0)

i
so that small bandwidths reproduce the data. In the case where h → ∞, and K has support [−1, 1], then K( x−X
h )→
K(0), and

m̂h (x) →

Pn
n
X
n−1 i=1 K(0)Yi
−1
P
=
n
Yi
n
n−1 i=1 K(0)
i=1

resulting in an oversmooth curve, the average of the response variables.
Statistics of the Nadaraya-Watson estimator The numerator and denominator of this statistic are both random
variables so that their analysis is done separately. We first define
Z
r(x) = yf (x, y)dy = m(x)f (x)
(4.2.7)
The estimate is
r̂h (x) = n−1

n
X

Kh (x − Xi )Yi

i=1

The regression curve estimate is thus given by
r̂h (x)
fˆh (x)

m̂h (x) =

We already analised the properties of fˆh (x), and can work out the expectation and variance of r̂h (x)

E[r̂h (x)]

= E[n−1

n
X

Kh (x − Xi )Yi ]

i=1

Z
Kh (x − u)r(u)du

=
Similarly to the density estimation with Kernels, we get
E[r̂h (x)] = r(x) +

h2 00
r (x)µ2 (K) + o(h2 ) , h → 0
2

182

Quantitative Analytics

To compute the variance of r̂h (x) we let s2 (x) = E[Y 2 |X = x], so that

V ar(r̂h (x))

= V ar(n−1

n
X

Kh (x − Xi )Yi ) = n−2

i=1

= n

−1



Z

Kh2 (x

≈ n−1 h−1

Z

n
X

V ar(Kh (x − Xi )Yi )

i=1

Z
− u)s (u)f (u)du − ( Kh (x − u)r(u)du)2
2

K 2 (u)s2 (x + uh)f (x + uh)du

Using the techniques of splitting up integrals, the variance is asymptotically given by
V ar(r̂h (x)) = n−1 h−1 f (x)s2 (x)kKk22 + o((nh)−1 ) , nh → ∞
and the variance tends to zero as nh → ∞. Thus, the MSE is given by

M SE(r̂h (x)) =

h4 00
1
f (x)s2 (x)kKk22 + (r (x)µ2 (K))2 + o(h4 ) + o((nh)−1 ) , h → 0 , nh → ∞
nh
4

(4.2.8)

Hence, if we let h → 0 such that nh → ∞, we have
M SE(r̂h (x)) → 0
so that the estimate is consistent
p

r̂h (x) −
→ m(x)f (x) = r(x)
The denominator of m̂h (x), the Kernel density estimate fˆh (x), is also consistent for the same asymptotics of h. Hence,
using Slutzky’s theorem (see Schonfeld [1969]) we obtain
m̂h (x) =

r̂h (x) p r(x)
m(x)f (x)
=
= m(x) , h → 0 , nh → ∞
−
→
ˆ
f (x)
f (x)
fh (x)

and m̂h (x) is a consistent estimate of the regression curve m(x), if h → 0 and nh → ∞. In order to get more insight
into how m̂h (x) behaves, such as its speed of convergence, we can study the mean squared error
dM (x, h) = E[(m̂h (x) − m(x))2 ]
at a point x.
Theorem 4.2.2 Assume the fixed design model with a one-dimensional predictor variable X, and define
Z
Z
cK = K 2 (u)du , dK = u2 K(u)du
Further, assume K has support [−1, 1] with K(−1) = K(1) = 0, m ∈ C 2 , maxi |Xi − Xi−1 | = o(n−1 ), and
var(i ) = σ 2 for i = 1, .., n. Then
dM (x, h) ≈ (nh)−1 σ 2 cK +

00
h4 2
dK (m (x))2 , h → 0 , nh → ∞
4

183

Quantitative Analytics

which says that the bias, as a function of h, is increasing whereas the variaiance is decreasing. To understand this
result, we note that the estimator m̂h (x) is a ratio of random variables, such that the central limit theorem can not
directly be applied. Thus, we linearise the estimator as follows

m̂h (x) − m(x)

=
=

 fˆh (x)
fˆh (x) 
r̂h (x)
− m(x)
+ (1 −
)
f (x)
f (x)
fˆh (x)
f (x) − fˆh (x)
r̂h (x) − m(x)fˆh (x)
+ (m̂h (x) − m(x))
f (x)
f (x)
1

By the above consistency property of m̂h (x) we can choose h ∼ n− 5 . Using this bandwidth we can state
r̂h (x) − m(x)fˆh (x)

=

(r̂h (x) − r(x)) − m(x)(fˆh (x) − f (x))

=

op (n− 5 ) + m(x)op (n− 5 )

=

op (n− 5 )

2

2

2

such that
(m̂h (x) − m(x))(f (x) − fˆh (x))

2

= op (1)op (n− 5 )
=

2

op (n− 5 )

The leading term in the distribution of m̂h (x) − m(x) is
(f (x))−1 (r̂h (x) − m(x)fˆh (x))
and the MSE of this leading term is
(f (x))−2 E[(r̂h (x) − m(x)fˆh (x))2 ]
leading to the approximate mean squared error
0

M SE(m̂h (x)) =

0

00
1 σ 2 (x)
h4
m (x)f (x) 2 2
kKk22 +
m (x) + 2
µ2 (K) + o((nh)−1 ) + o(h4 ) , h → 0 , nh → ∞
nh f (x)
4
f (x)
4

1

The MSE is of order o(n− 5 ) when we choose h ∼ n− 5 . The second summand corresponds to the squared bias of
00
m̂h (x) and is either dominated by the second derivative m (x) when we are near to a local extremum of m(x), or by
0
the first derivative m (x) when we are near to a deflection point of m(x).
Confidence intervals The asymptotic confidence intervals for m(x) is computed using the formulas of the asymptotic variance and bias of m̂h (x). The asymptotic distribution of m̂h (x) is given by the following theorem
Theorem 4.2.3 The Nadaraya-Watson Kernel smoother m̂h (xj ) at the K different locations x1 , .., xK converges in
distribution to a multivariate normal random vector with mean B and identity covariance matrix
n
oK
1 m̂h (xj ) − m(xj )
(nh) 2 σ2 (x )kKk2 1
→ N (B, I)
j
2 2
j=1
( f (x
)
j)
where

184

Quantitative Analytics

0
0
n
 00
m (xj )f (xj ) oK
B = µ2 (K) m (xj ) + 2
f (xj )
j=1

We can use this theorem to compute an asymptotic (1 − a) confidence interval for m(x), when we employ estimates
0
0
of the unknown functions σ(x), f (x), f (x), m(x), and m (x). One way forward is to assume that the bias of m̂h (x)
is of negligible size compared with the variance, so that B is set equal to the zero vector, leaving only σ(x) and f (x)
to be estimated. Note, f (x) can be estimated with the Kernel density estimator fˆh (x), and the conditional variance
σ 2 (x) can be defined as
σ̂ 2 (x) = n−1

n
X

Whi (x)(Yi − m̂h (x))2

i=1

We then compute the interval [clo, cup] around m̂h (x) at the K distinct points x1 , .., xk with
1
2
clo = m̂h (x) − ca cK

σ̂(x)
1
(nhfˆh (x)) 2
σ̂(x)

1

cup =

2
m̂h (x) + ca cK

1

(nhfˆh (x)) 2

1

If the bandwidth is h ∼ n− 5 , then the computed interval does not lead asymptotically to an exact confidence interval
1
for m(x). A bandwidth sequence of order less than n− 5 must be chosen such that the bias vanishes asymptotically. For
simultaneous error bars, we can use the technique based on the golden section bootstrap which is a delicate resampling
technique used to approximate the joint-distribution of m̂h (x) − m(x) at different points x.
0

Fixed design model This is the case where the density f (x) = F (x) of the predictor variable is known, so that the
Kernel weights become
Whi (x) =

Kh (x − Xi )
f (x)

and the estimate can be written as
m̂h =

r̂h (x)
f (x)

We can employ the previous results concerning r̂h (x) to derive the statistical properties of this smoother. If the X
observations are taken at regular distances, we may assume that they are uniformly U (0, 1) distributed. In the fixed
design model of nearly equispaced, nonrandom {Xi }ni=1 on [0, 1], Priestley et al. [1972] and Benedetti [1977]
introduced the weight sequence
Whi (x) = n(Xi − Xi−1 )Kh (x − Xi ) , X0 = 0
The spacing (Xi − Xi−1 ) can be interpreted as an estimate of n−1 f −1 from the Kernel weight above. Gasser et al.
[1979] considered the weight sequence
Z

Si

Kh (x − u)du

Whi (x) = n
Si−1

where Xi−1 ≤ Si−1 ≤ Xi is chosen between the odered X-data. It is related to the convolution smoothing proposed
by Clark [1980].

185

Quantitative Analytics

4.2.4.2

Kernel smoothing algorithm

Computing kernel smoothing at N distinct points for a kernel with unbounded support would result in o(N n) operations. However, using kernels with bounded support, say [−1, 1] would result in o(N nh) operations since about
2nh points fall into an interval of length 2h. One computational approach consists in using the WARPing defined in
Section (4.2.1.2). Another one uses the Fourier transforms
Z
g̃(t) = g(x)e−itx dx
where for g(x) = n−1

Pn

i=1

Kh (x − Xi )Yi , the Fourier transform becomes
g̃(t) = K̃(th)

n
X

e−itXi Yi

i=1

Using the Gaussian kernel
u2
1
K(u) = √ e− 2
2π
t2

we get K̃(th) = e− 2 . Decoupling the smoothing operation from the Fourier transform of the data
we can use the Fast Fourier Transform described in Appendix () with o(N log N ) operations.
4.2.4.3

Pn

i=1

e−itXi Yi ,

The K-nearest neighbour

Definition of the K-NN estimate Regression by Kernels is based on local averaging of observations Yi in a fixed
neighbourhood around x. Rather than considering this fixed neighbourhood, the K-NN employs varying neighbourhood in the X-variables which are among the K-nearest neighbours of x in Euclidean distance. Introduced by Loftsgaarden et al. [1965], it is defined in the form of the general nonparametric regression estimate
m̂K (x) = n−1

n
X

WKi (x)Yi

i=1

where the weight sequence {WKi (x)}ni=1 is defined through the set of indices
Jx = {i : Xi is one of the K nearest observations to x}
This set of neighboring observations defines the K-NN weight sequence
 n
K if i ∈ Jx
WKi (x) =
0 otherwise
where the smoothing parameter K regulates the degree of smoothness of the estimated curve. Assuming fixed n, in
the case where K becomes larger than n, the the K-NN smoother is equal to the average of the response variables. In
the case K = 1 we obtain a step function with a jump in the middle between two observations.
Statistics of the K-NN estimate Again, we face a trade-off between a good approximation to the regression function
and a good reduction of observational noise. It can be expressed formally by an expansion of the mean squared error
of the K-NN estimate. Lai (1977) proposed the following theorem (See Hardle [1990] for references and proofs).
Theorem 4.2.4 Let K → ∞, K
n → 0, n → ∞. Bias and variance of the K-NN estimate m̂K with weights as in the
K-NN weight sequence, are given by

186

Quantitative Analytics

E[m̂K (x)] − m(x) ≈

 00
 K
0 0
1
(m f + 2m f )(x) ( )2
3
24f (x)
n

and
V ar(m̂K (x)) ≈

σ 2 (x)
K

We observe that the bias is increasing and the variance is decreasing in the smoothing parameter K. To balance this
4
trade-off in an asymptotic sense, we should choose K ∼ n 5 . We then obtain for the mean squared error (MSE) a rate
4
of convergence to zero of the order K −1 = n− 5 . Hence, for this choice of K, the MSE is of the same order as for
the Kernel regression. In addition to the uniform weights above, Stone [1977] defined triangular and quadratic K-NN
weights. In general, the weights can be thought of as being generated by a Kernel function
WRi (x) =

KR (x − Xi )
fˆR (x)

where
n
X

fˆR (x) = n−1

KR (x − Xi )

i=1

is a Kernel density estimate of f (x) with Kernel sequence
KR (u) = R−1 K(

u
)
R

and R is the distance between x and its kth nearest neighbour.

4.2.5

Bandwidth selection

While the accuracy of kernel smoothers, as estimators of m or of derivatives of m, is a function of the kernel K and
the bandwidth h, it mainly depends on the smoothing parameter h. So far in choosing the smoothing parameter h
we tried to compute sequences of bandwidths which approximate the A-MISE minimising bandwidth. However, the
A-MISE of the Kernel regression smoother m̂h (x) is not the only candidate for a reasonable measure of discrepancy
between the unknown curve m(x) and the approximation m̂h (x). A list of distance measurements is given in Hardle
[1991] together with the Hardle et al. [1986] theorem showing that they all lead asymptotically to the same level of
smoothing. For convenience, we will only consider the distance that can be most easily computed, namely the average
squared error (ASE).
4.2.5.1

Estimation of the average squared error

A typical representative quadratic measure of accuracy is the Integrated Squared Error (ISE) defined as
Z
dI (m, m̂) = (m(x) − m̂(x))2 f (x)W (x)dx
where W denotes a nonnegative weight function. Taking the expectation of dI with respect to X yields the MISE
dM (m, m̂) = E[dI (m, m̂)]
A discrete approximation to dI is the averaged squared error (ASE) defined as
dA (m, m̂) = ASE(h) = n−1

n
X
(m(Xi ) − m̂h (Xi ))2 W (Xi )
i=1

187

(4.2.9)

Quantitative Analytics

To illustrate the distribution of ASE(h) we look at the M ASE(h), the conditioned squared error of m̂h (x), conditioned on the given set of predictor variables X1 , .., Xn , and express it in terms of a variance component and a bias
component.
dC (m, m̂) = M ASE(h)

= E[ASE(h)|X1 , .., Xn ]
n
X

= n−1
V ar(m̂h (Xi )|X1 , .., Xn ) + Bias2 (m̂h (Xi )|X1 , .., Xn ) W (Xi )
i=1

where distance dC is a random distance through the distribution of the Xs. The expectation of ASE(h), dC , contains
a variance component ν(h)
ν(h) = n−1

n
X


n−2

n
X

i=1


2
Whj
(Xi )σ 2 (Xj ) W (Xi )

j=1

2

and a squared bias component b (h)
b2 (h) = n−1

n
n
X
 −1 X
2
2
n
Whj
(Xi )m(Xj ) − m(Xi ) W (Xi )
i=1

j=1

The squared bias b2 (h) increases with h, while ν(h) proportional to h−1 decreases. The sum of both components is
M ASE(h) which shows a clear minimum. We therefore need to approximate the bandwidth minimising ASE. To
do this we consider the averaged squared error (ASE) defined in Equation (4.2.9), and expand it
dA (h) = n−1

n
X

m2 (Xi )W (Xi ) + n−1

i=1

n
X

m̂2h (Xi )W (Xi ) − 2n−1

i=1

n
X

m(Xi )m̂h (Xi )W (Xi )

i=1

where the first term is independent of h, the second term cam be entirely computed from the data, and the third term
could be estimated if it was vanishing faster than dA tends to zero. A naive estimate of this distance can be based on
the replacement of the unknown value m(Xi ) by Yi leading to the so-called Resubstitution estimate
P (h) = n−1

n
X
(Yi − m̂h (Xi ))2 W (Xi )
i=1

Unfortunately, P (h) is a biased estimate of ASE(h). The intuitive reason for this bias being that the observation Yi
is used in m̂h (Xi ) to predict itself. To get a deeper insight, denote i = Yi − m(Xi ) the ith error term and consider
the expansion
P (h) = n−1

n
X

2i W (Xi ) + ASE(h) − 2n−1

i=1

n
X

i (m̂h (Xi ) − m(Xi ))2 W (Xi )

i=1

Thus, the approximation of ASE(h) by P (h) would be fine if the last term had an expectation which is asymptotically
of negligible size in comparison with the expectation of ASE(h). This is unfortunately not the case as
n
n
X
X


E −2n−1
i (m̂h (Xi ) − m(Xi ))2 W (Xi )|X1 , .., Xn = −2n−1
E[i |X1 , .., Xn ]
i=1

−

n−1

n
X

i=1

n X
n
X

Whj m(Xj ) − m(Xi ) W (Xi ) − 2n−2
Whj (Xi )E[i j |X1 , .., Xn ]W (Xi )

j=1

i=1 j=1

188

Quantitative Analytics

The error i are independent random variables with expectation zero and variance 2 (Xi ). Hence,
n
n
X
X


E −2n−1
i (m̂h (Xi ) − m(Xi ))2 W (Xi )|X1 , .., Xn = −2n−1
Whi (Xi )2 (Xi )W (Xi )
i=1

i=1

This quantity tends to zero at the same rate as the variance component ν(h) of ASE(h). Thus, P (h) is biased by this
additional variance component. We can use this naive estimate P (h) to construct an asymptotically unbiased estimate
of ASE(h). We shall discuss two techniques, namely the concept of penalising functions which improve the estimate
P (h) by introducing a correcting term for this estimate, and the cross-validation where the computation is based on
the leave-one-out estimate m̂hi (Xi ), the Kernel smoother without (Xi , Yi ).
4.2.5.2

Penalising functions

With the goal of asymptotically cancelling the bias, the prediction error P (h) is adjusted by the correction term
Ξ(n−1 Whi (Xi )) to give the penalising function selector
G(h) = n−1

n
X

(Yi − m̂h (Xi ))2 Ξ(n−1 Whi (Xi ))W (Xi )

i=1

The form of the correction term Ξ(n

−1

Whi (Xi )) is restricted by the first order Taylor expansion of Ξ
Ξ(u) = 1 + 2u + o(u2 ) , u → 0

Hence, the correcting term can be written as
Ξ(n−1 Whi (Xi ))

1 + 2n−1 Whi (Xi ) + o((nh)−2 ) , nh → ∞
K(0)
+ o((nh)−2 )
= 1 + 2(nh)−1
ˆ
fh (Xi )

=

penalising values of h too low. We can work out the leading terms of G(h) ignoring terms of lower order

G(h)

=
+

n−1
2n

n
X

2i W (Xi ) + ASE(h)

i=1
n
X
−1

i (m(Xi ) − m̂h (Xi ))W (Xi ) + 2n−2

i=1

n
X

2i Whi (Xi )W (Xi )

i=1

The first term is independent of h, and the expectation of the third summand (in equation above) is the negative
expected value of the last term in the leading terms of G(h). Hence, the last two terms cancel asymptotically so that
G(h) is roughly equal to ASE(h), and as a result G(h) is an unbiased estimator of ASE(h). This gives rise to a lot of
penalising functions which lead to asymptotically unbiased estimates of the ASE minimising bandwidth. The simplest
function is of the form
Ξ(u) = 1 + 2u
The objective of these selector functions is to penalise too small bandwidths. Any sequence of optimising bandwidths
of one of these penalising functions is asymptotically optimal, that is, the ratio of the expected loss to the minimum
loss tends to one. Denote ĥ as the minimising bandwidth of G(h) and ĥ0 as the ASE optimal bandwidth. Then
ASE(ĥ)
ASE(ĥ0 )

p

−
→1,

189

ĥ
ĥ0

p

−
→1

Quantitative Analytics

1

However, the speed of convergence is slow. The relative difference between the estimate ĥ and ĥ0 is of rate n− 10 and
we can not hope to derive a better one, since the relative difference between ĥ0 and the MISE optimal bandwidth h0
is of the same size.
4.2.5.3

Cross-validation

Cross-validation employs the leave-one-out estimates m̂hi (Xi ) in the formula of the prediction error instead of the
original estimates. That is, one observation, say the ith one, is left out
X
m̂hi (Xi ) = n−1
Whj (Xi )Yj
j6=i

This leads to the score function of cross-validation
CV (h) = n−1

n
X

2
Yi − m̂hi (Xi ) W (Xi )

i=1

The equation of the leading terms of G(h) showed that P (h) contains a component of roughly the same size as the
variance of ASE(h), but with a negative sign, so that the effect of the variance cancels. When we use the leave-one-out
estimates of m̂h (Xi ) we arrive at
n
X



E −2n−1
i m̂hi (Xi ) − m(Xi ) W (Xi )|X1 , .., Xn
i=1

= −2n−1 (n − 1)−1

n X
X

Whj (Xi )E[i j |X1 , .., Xn ]W (Xi ) = 0

i=1 j6=i

The cross-validation can also be understood in terms of penalising functions. Assume that fˆhi (Xi ) 6= 0 and m̂hi (Xi ) 6=
Yi for all i, then we note that

CV (h)

= n

−1

n
X

Yi − m̂h (Xi )

2 m̂h (Xi ) − Yi −2
W (Xi )
m̂hi (Xi ) − Yi

Yi − m̂h (Xi )

2

i=1

= n−1

n
X

−2
1 − n−1 Whi (Xi )
W (Xi )

i=1

Thus, the score function for CV can be rewritten as a penalising function with the selector of generalised crossvalidation. Hence, a sequence of bandwidths on CV (h) is asymptotically optimal and yields the same speed of
convergence as the other techniques.

4.3
4.3.1

Trend filtering in the time domain
Some basic principles

We are now going to filter, in the time domain, the corrupted signal defined in Equation (4.1.1) in the case of financial time series. Slightly modifying notation, we let yt be a stochastic process made of two different unobservable
components, and assume that in its simplest form its dynamics are
yt = xt + t

190

(4.3.10)

Quantitative Analytics

where xt is the trend, and the noise t is a stochastic process. We are now concerned with estimating the trend xt . We
let y = {..., y−1 , y0 , y1 , ...} be the ordered sequence of observations of the process yt and x̂t be the estimator of the
unobservable underlying trend xt . A filtering procedure consists in applying a filter L to the data y
x̂ = L(y)
with x̂ = {..., x̂−1 , x̂0 , x̂1 , ..}. In the case of linear filter we have x̂ = Ly with the normalisation condition 1 = L1.
Further, if the signal yt is observed at regular dates, we get
x̂t =

∞
X

Lt,t−i yt−i

i=−∞

and the linear filter may be viewed as a convolution. Imposing some restriction on the coefficients Lt,t−i , to use
only past and present values, we get a causal filter. Further, considering only time invariant filters, we get a simple
convolution of observed signal yt with a window function Li
x̂t =

n−1
X

Li yt−i

(4.3.11)

i=0

corresponding to the nonrecursive filter (or FIR) in Section (4.1.3.2). That is, a linear filter is characterised by a
window kernel Li and its support n where the former defines the type of filtering and the latter defines the range of
the filter. When it is not possible to express the trend as a linear convolution of the signal and a window function, the
filters are called nonlinear filters. Bruder et al. [2011] provide a detailed description of linear and non-linear filter. As
an example of linear filter, in the well known moving average (MA), we take a square window on a compact support
[0, T ] with T = n∆ 1 , the width of the averaging window, and get the kernel
1
I{i n2 , the trend is approximated by
µ̂t ≈

2
(ŷ n2 − ŷtn1 )
(n1 − n2 )∆ t

which is positive when the short-term moving average is higher than the long-term moving average. Hence, the sign
of the approximated trend changes when the short-term MA crosses the long-term one. Note, this estimator may be
viewed as a weighted moving average of asset returns. Inverting the derivative window li we recover the operator Li
as

 l0 if i = 0
li + Li−1 if i = 1, ..., n − 1
Li =

−ln−1 if i = n
and one can then interpret the estimator in terms of asset returns. The weighting of each return in the estimator forms
a triangle where the biggest weighting is given at the horizon of the smallest MA. Hence, the indicator can be focused
towards the current trend (if n2 is small) or towards past trends (if n2 is as large as n21 ).
In order to improve the uniform MA estimator, we can consider the kernel function
4
n
sign( − i)
2
n
2
where the estimator µ̂t takes into account all the dates of the window period. Taking the primitive of the function li ,
the filter becomes
li =

4 n
n
( − |i − |)
2
n 2
2
Other types of MA filter exists which are characterised by an asymmetric form of the convolution kernel. For instance,
one can take an asymmetric window function with a triangular form
Li =

2
(n − i)I{i .A).a = A> .f or a = (A> .A)−1 .(A> .f )
where
Aij = ij , i = −nL , .., nR , j = 0, .., M
We also have the specific forms

{A> .A}ij =
{A> .f }j =

nR
X

Aki Akj =

nR
X

k i+j

k=−nL

k=−nL
nR
X

Akj fk =

k=−nL

nR
X

k j fk

k=−nL

Since the coefficient cn is the component a0 when f is replaced by the unit vector en for −nL ≤ n ≤ nR , we have
cn = {(A> .A)−1 .(A> .en )}0 =

M
X

{(A> .A)−1 }0m nm

m=0

meaning that we only need one row of the inverse matrix (numerically we can get this by LU decomposition with only
a single back-substitution). A higher degree polynomial makes it possible to achieve a high level of smoothing without
attenuation of real data features. Hence, within limits, the Savitzky-Golay filtering manages to provides smoothing
without loss of resolution. When dealing with irregularly sampled data, where the values fi are not uniformly spaced
in time, there is no way to obtain universal filter coefficients applicable to more than one data point. Note, the
Savitzky-Golay technique can also be used to compute numerical derivatives. In that case, the desired order is usually
m = M = 4 or larger where m is the order of the smoothing polynomial, also equal to the highest conserved moment.
Numerical experiments are usually done with a 33 point smoothing filter, that is nL = nR = 16.

4.3.4

The least squares filters

4.3.4.1

The L2 filtering

An alternative to averaging observations is to impose a model on the process yt and its trend xt (see Section (4.1.2)).
For instance, the Lanczos filter in the previous section may be considered as a local linear regression. Given a model
for the process yt , least squares methods are often used to define trend estimators
n

{x̂1 , ..., x̂n } = arg min

1X
(yt − x̂t )2
2 t=1

but the problem is ill-posed and one must impose some restrictions on the underlying process yt or on the filtered trend
x̂t to obtain a solution. For instance, we can consider the deterministic constant trend
xt = xt−1 + µ

195

Quantitative Analytics

such that the process yt becomes yt = xt−1 + µ + t . Iterating with x0 = 0 2 , we get the process
yt = µt + t

(4.3.13)

and estimating the filtered trend x̂t is equivalent to estimating the coefficient µ
Pn
tyt
µ̂ = Pt=1
n
2
t=1 t
In the case where the trend is not constant one can consider the Hodrick-Prescott filter (or L2 filter) where the objective
function is
n
n−1
X
1X
(yt − x̂t )2 + λ
(x̂t−1 − 2x̂t + x̂t+1 )2
2 t=1
t=2

where λ > 0 is a regularisation parameter controlling the trade off between the smoothness of x̂t and the noise
(yt − x̂t ). Rewriting the objective function in vectorial form, we get
1
ky − x̂k22 + λkDx̂k22
2
where the operator D is the (n − 2) × n matrix

1 −2 1
0 ... 0
0
 0
1
−2
1
...
0
0

...
...
...
...
...
...
...
D=

 0
0
0
0 ... 1 −2
0
0
0
0 ... 0
1


0 0
0 0 


...

1 0 
−2 1

so that the estimator becomes
x̂ = I + 2λD> D
4.3.4.2

−1

y

The L1 filtering

One can generalise the Hodrick-Prescott filter to a larger class of filters by using the Lp penalty condition instead of
the L2 one (see Daubechies et al. [2004]). If we consider an L1 filter, the objective function becomes
n
n−1
X
1X
2
(yt − x̂t ) + λ
|x̂t−1 − 2x̂t + x̂t+1 |
2 t=1
t=2

which is expressed in vectorial form, as
1
ky − x̂k22 + λkDx̂k1
2
Kim et al. [2009] showed that the dual problem of the L1 filter scheme is a quadratic program with some boundary
constraints. Since the L1 norm imposes the condition that the second derivative of the filtered signal must be zero, we
obtain a set of straight trends and breaks. Hence, the smoothing parameter λ plays an important role in detecting the
number of breaks.
2

yt = xt−n + nµ + t , with xt−n = x0 .

196

Quantitative Analytics

4.3.4.3

The Kalman filters

Another approach to estimating the trend is to consider the Kalman filter where the trend µt is a hidden process
following a given dynamics (see details in Appendix (C.4)). For instance, we can assume it follows the dynamics
Rt

= µt + σR R (t)

µt

= µt−1 + σµ µ (t)

(4.3.14)

where Rt is the observable signal of realised returns and the hidden process µt follows a random walk. Hence, it
follows a Markov model. If we let the conditional trend be µ̂t|t−1 = Et−1 [µt ] and the estimation error be Pt|t−1 =
Et−1 [(µ̂t|t−1 − µt )2 ], then we get the forecast estimator
µ̂t+1|t = (1 − Kt )µ̂t|t−1 + Kt Rt
where
Kt =

Pt|t−1
2
Pt|t−1 + σR

which is the Kalman gain. The estimation error is given by the Riccati’s equation
Pt+1|t = Pt|t−1 + σµ2 − Pt|t−1 Kt
with stationary solution
P∗ =

q

1
2
σµ σµ + σµ2 + 4σR
2

and the filter equation becomes
µ̂t+1|t = (1 − κ)µ̂t|t−1 + κRt
with
κ=

2σ
q µ

2
σµ + σµ2 + 4σR

Note, the Kalman filter in Equation (4.3.14) can be rewritten as an exponential moving average (EMA) filter with
parameter λ = − ln (1 − κ) for 0 < κ < 1 and λ > 0. In this setting, the estimator is given by
µ̂t = (1 − e−λ )

∞
X

e−λi Rt−i

i=0

with µ̂t = Et [µt ] so that the 1-day forecast estimator is µ̂t+1|t = µ̂t . From our discussion above, the filter of the trend
x̂t is given by the equation
x̂t = (1 − e−λ )

∞
X

e−λi yt−i

i=0

and the derivative of the trend may be related to the signal yt by the following equation
µ̂t = (1 − e−λ )yt − (1 − e−λ )(eλ − 1)

∞
X
i=1

197

e−λi yt−i

Quantitative Analytics

One can relate the regression model in Equation (4.3.13) with the Markov model in Equation (4.3.14) by notting that
they are special cases of the structural models described in Appendix (C.4.12). More precisely, the regression model
in Equation (4.3.13) is equivalent to the state space model
yt

=

xt + σy y (t)

xt

=

xt−1 + µ

If we let the tend be stochastic we get the local level model
yt

= xt + σy y (t)

xt

= xt−1 + µ + σx x (t)

Further, assuming that the slope of the trend is stochastic, we obtain the local linear trend model
yt

=

xt + σy y (t)

xt

=

xt−1 + µt−1 + σx x (t)

µt

=

µt−1 + σµ µ (t)

(4.3.15)

and setting σy = 0 we recover the Markov model in Equation (4.3.14). These examples are special case of structural
models which can be solved by using the Kalman filter. Note, the Kalman filter is optimal in the case of the linear
Gaussian model in Equation (), and it can be regarded as an efficient computational solution of the least squares method
(see Sorenson [1970]).
Remark 4.3.2 The Kalman filter can be used to solve more sophisticated process than the Markov model, but some
nonlinear or non Gaussian models may be too complex for Kalman filtering.
To conclude, the Kalman filter can be used to derive an optimal smoother as it improves the estimate of x̂t−i by using
all the information between t − i and t.

4.3.5

Calibration

When filtering trends from time series, one must consider the calibration of the filtering parameters. We briefly discuss
two possible calibration schemes, one where the calibrated parameters incorporate our prediction requirement and the
other one where they can be mapped to a known benchmark estimator.
To illustrate the approach of statistical inference we consider the local linear trend model in Equation (4.3.15). We
estimate the set of parameters (σy , σx , σµ ) by maximising the log-likelihood function
n

l=

v2
1X
ln 2π + ln Ft + t
2 t=1
Ft

where vt = yt − Et−1 [yt ] is the (one-day) innovative process and Ft = Et−1 [vt2 ] is the variance of vt .
In order to look at longer trend, the innovation process becomes vt = yt − Et−h [yt ] where h is the horizon time.
In that setting, we calibrate the parameters θ by using a cross-validation technique. We divide our historical data into
an in-sample set and an out-sample set charaterised by two time parameters T1 and T2 where the size of the former
controls the precision of the calibration to the parameter θ. We compute the value of the expectation Et−h [yt ] in the
in-sample set which are used in the out-sample set to estimate the prediction error

198

Quantitative Analytics

e(θ; h) =

n−h
X

2
yt − Et−h [yt ]

t=1

which is directly related to the prediction horizon h = T2 for a given strategy. Minimising the prediction error, we get
the optimal value θ∗ of the filter parameter used to predict the trend for the test set.
The estimator of the slope of the trend µ̂t is a random value defined by a probability distribution function, and
based on the sample data, takes a value called the estimate of the slope. If we let µ0t be the true value of the slope, the
quality of the slope is defined by the mean squared error (MSE)
M SE(µ̂t ) = E[(µ̂t − µ0t )2 ]
(1)

The estimator µ̂t

(2)

is more efficient than the estimator µ̂t
(1)

µ̂t

(2)

 µ̂t

if its MSE is lower
(1)

(2)

⇔ M SE(µ̂t ) ≤ M SE(µ̂t )

Decomposing the MSE into two components we get
M SE(µ̂t ) = E[(µ̂t − E[µ̂t ])2 ] + E[(E[µ̂t ] − µ0t )2 ]
where the first component is the variance of the estimator V ar(µ̂t ) while the second one is the square of the bias
B(µ̂t ). When comparing unbiased estimators, we are left with comparing their variances. Hence, the estimate of a
trend may not be significant when the variance of the estimator is too large.

4.3.6

Introducing linear prediction
0

We let {yα } be a set of measured values for some underlying set of true values of a quantity y, denoted {yα }, related
to these true values by the addition of random noise
0

yα = yα + ηα
The Greek subscript indexing the vales indicates that the data points are not necessarily equally spaced along a line,
or even ordered. We want to construct the best estimate of the true value of some particular point y∗ as a linear
combination of the known, noisy values. That is, given
X
0
y∗ =
d∗α yα + x∗
(4.3.16)
α

we want to find coefficients d∗α minimising the discepancy x∗ .
Remark 4.3.3 In the case where we let y∗ be one of the existing yα ’s the problem becomes one of optimal filtering or
estimation. On the other hand, if y∗ is a completely new point then the problem is that of a linear prediction.
One way forward is to minimise the discrepancy x∗ in the statistical mean square sense. That is, assuming that the
noise is uncorrelated with the signal (< ηα yβ >= 0), we seek d∗α minimising
< x2∗ >

= <

X

d∗α (yα + ηα ) − y∗

2

>

α

=

X
X
(< yα yβ > + < ηα ηβ >)d∗α d∗β − 2
< y∗ yα > d∗α + < y∗2 >
α

αβ

199

Quantitative Analytics

where <> is the statistical average, and β is a subscript to index another member of the set. Note, < yα yβ > and
< y∗ yα > describe the autocorrelation structure of the underlying data. For point to point uncorrelated noise we
get < ηα ηβ >=< ηα2 > δαβ where δ. is the Dirac function. One can think of the various correlation quantities as
comprising matrices and vectors
φαβ =< yα yβ > , φ∗α =< y∗ yα > , ηαβ =< ηα ηβ > or < ηα2 > δαβ
Setting the derivative with respect to the d∗α ’s equal to zero in the above equation, we get the set of linear equations
X
[φαβ + ηαβ ]d∗β = φ∗α
β

Writing the solution as a matrix inverse and omitting the minimised discrepancy x∗ , the estimation of Equation (4.3.16)
becomes
X
0
y∗ ≈
φ∗α [φµν + ηµν ]−1
(4.3.17)
αβ yβ
αβ

We can also calculate the expected mean square value of the discrepancy at its minimum
X
X
< x2∗ >0 =< y∗2 > −
d∗β φ∗β =< y∗2 > −
φ∗α [φµν + ηµν ]−1
αβ φ∗β
β

β

Replacing the star with the Greek index γ, the above formulas describe optimal filtering. In the case where the noise
amplitudes ηα goes to zero, so does the noise autocorrelations ηαβ , cancelling a matrix times its inverse, Equation
0
(4.3.17) simply becomes yγ = yγ . In the case where the matrices φαβ and ηαβ are diagonal, Equation (4.3.17)
becomes
yγ =

0
φγγ
y
φγγ + ηγγ γ

(4.3.18)

which is Equation (4.1.2) with S 2 → φγγ and N 2 → ηγγ . For the case of equally spaced data points, and in the
Fourier domain, autocorrelations simply become squares of Fourier amplitudes (Wiener-Khinchin theorem), and the
optimal filter can be constructed algebraically, as Equation (4.3.18), without inverting any matrix. In the time domain,
or any other domain, an optimal filter (minimising the square of the discrepancy from the underlying true value in
the presence of measurement noise) can be constructed by estimating the autocorrelation matrices φαβ and ηαβ , and
applying Equation (4.3.17) with ∗ → γ. Classical linear prediction (LP) specialises to the case where the data points
yβ are equally spaced along a line yi for i = 1, ., N and we want to use M consecutive values of yi to predict the
M + 1 value. Note, stationarity is assumed, that is, the autocorrelation < yj yk > is assumed to depend only on the
difference |j − k|, and not on j or k individually, so that the autocorrelation φ has only a single index
φj =< yi yi+j >≈

N −j
1 X
yi yi+j
N − j i=1

However, there is a better way to estimate the autocorrelation. In that setting the estimation Equation (4.3.16) is
yn =

M
X

dj yn−j + xn

(4.3.19)

j=1

so that the set of linear equations above becomes the set of M equations for the M unknown dj ’s, called the linear
prediction (LP) coefficients

200

Quantitative Analytics

M
X

φ|j−k| dj = φk , k = 1, .., M

j=1

Note, results obtained from linear prediction are remarkably sensitive to exactly how the φk ’s are estimated. Even
though the noise is not explicitly included in the equations, it is properly accounted for, if it is point-to-point uncorrelated. Note, φ0 above estimates the diagonal part of φαα + ηαα , and the mean square discrepancy < x2n > is given
by
< x2n >= φ0 − φ1 d1 − ... − φM dM
Hence, we first compute the dj ’s with the equations above, then calculate the mean square discrepancy < x2n >. If the
discrepancies are small, we continue applying Equation (4.3.19) right on into the future, assuming future discrepancies
xi to be zero. This is a kind of extrapolation formula. Note, Equation (4.3.19) being a special case of the general linear
filter, the condition for stability is that the characteristic polynomial
zN −

N
X

dj z N −j = 0

j=1

has all N of its roots inside the unit circle
|z| ≤ 1
If the data contain many oscillations without any particular trend towards increasing or decreasing amplitude, then the
complex roots of the polynomial will generally all be rather close to the unit circle. When the instability is a problem,
one should massage the LP coefficients by
1. solving numerically the polynomial for its N complex roots
2. moving the roots to where we think they should be inside or on the unit circle
3. reconstructing the modified LP coefficients
Assuming that the signal is truly a sum of undamped sine and cosine waves, one can simply move each root zi onto
the unit circle
zi →

zi
|zi |

Alternatively, one can reflect a bad root across the unit circle
zi →

1
zi∗

preserving the amplitude of the output of Equation (4.3.19) when it is driven by a sinusoidal set of xi ’s. Note, the
choice of M , the number of LP coefficients to use, is an open problem. Linear prediction is successful at extrapolating
signals that are smooth and oscillatory, not necessarily periodic.

201

Chapter 5

Presenting time series analysis
5.1

Basic principles of linear time series

We consider the asset returns to be a collection of random variables over time, obtaining the time series {rt } in the case
of log returns. Linear time series analysis is a first step to understanding the dynamic structure of such a series (see
Box et al. [1994]). That is, for an asset return rt , simple models attempt at capturing the linear relationship between
rt and some information available prior to time t. For instance, the information may contain the historical values
of rt and the random vector Y that describes the economic environment under which the asset price is determined.
As a result, correlations between the variable of interest and its past values become the focus of linear time series
analysis, and are referred to as serial correlations or autocorrelations. Hence, Linear models can be used to analyse
the dynamic structure of such a series with the help of autocorrelation function, and forecasting can then be performed
(see Brockwell et al. [1996]).

5.1.1

Stationarity

While the foundation of time series analysis is stationarity, autocorrelations are basic tools for studying this stationarity.
A time series {xt , Z} is said to be strongly stationary, or strictly stationary, if the joint distribution of (xt1 , .., xtk ) is
identical to that of (xt1 +h , .., ytk +h ) for all h
(xt1 , .., xtk ) = (xt1 +h , .., ytk +h )
where k is an arbitrary positive integer and (t1 , .., tk ) is a collection of k positive integers. Thus, strict stationarity
requires that the joint distribution of (xt1 , .., xtk ) is invariant under time shift. Since this condition is difficult to verify
empirically, a weaker version of stationarity is often assumed. The time series {xt , Z} is weakly stationary if both the
mean of xt and the covariance between xt and xt−k are time-invariant, where k is an arbitrary integer. That is, {xt }
is weakly stationary if
E[xt ] = µ and Cov(xt , xt−k ) = γk
where µ is constant and γk is independent of t. That is, we assume that the first two moments of xt are finite. In
the special case where xt is normally distributed, then the weak stationarity is equivalent to strict stationarity. The
covariance γk is called the lag-k autocovariance of xt and has the following properties:
• γ0 = V ar(xt )
• γ−k = γk

202

Quantitative Analytics

The latter holds because Cov(xt , xt−(−k) ) = Cov(xt−(−k) , xt ) = Cov(xt+k , xt ) = Cov(xt1 , xt1 −k ), where t1 =
t + k.
In the finance literature, it is common to assume that an asset return series is weakly stationary since a stationary
time series is easy to predict as its statistical properties are constant. However, financial time series such as rates, FX,
and equity are non-stationary. The non-stationarity of price series is mainly due to the fact that there is no fixed level
for the price which is called unit-root non-stationarity time series. It is well known that these underlyings are prone to
different external shocks. In a stationary time series, these shocks should eventually die away, meaning that a shock
occurring at time t will have a smaller effect at time t + 1, and an even smaller effect at time t + 2 gradually dying
out. However, if the data is non-stationary, the persistence of shocks will always be infinite, meaning that a shock at
time t will not have a smaller effect at time t + 1, t + 2 and so on.

5.1.2

The autocorrelation function

Studies often mention the problem of timely dependence in returns series of stocks or indices. Typically, estimates
for non existent return figures are then set equal to the last reported transaction price. This results in serial correlation
for stock prices, which further causes distortions in the parameter estimates, especially the standard deviation. When
the linear dependence between xt and xt−i is of interest, we consider a generalisation of the correlation called the
autocorrelation.
Definition 5.1.1 ACF
The autocorrelation function (ACF), ρ(k) for a weakly-stationary time series, {xt : t ∈ N} is given by
ρ(k) =

E[(xt − µ)(xt+k − µ)]
σ2

where E[xt ] is the expectation of xt , µ is the mean and σ 2 is the variance.
Following Eling [2006], we compute the first order autocorrelation value for all stocks and then use the Ljung-Box
statistic (see Ljung et al. [1978]) to check whether this value is statistically significant. It test for high order serial
correlation in the residuals. Given two random variables X and Y , the correlation coefficient between these two
variables is
ρx,y = p

Cov(X, Y )
V ar(X)V ar(Y )

=p

E[(X − µx )(Y − µy )]
E[(X − µx )2 ]E[(Y − µy )2 ]

where µx and µy are the mean of X and Y , and with −1 ≤ ρx,y ≤ 1 and ρx,y = ρy,x . Given the sample {(xt , yt )}Tt=1 ,
then the sample correlation can be consistently estimated by
PT
(xt − x)(yt − y)
ρ̂x,y = qP t=1
PT
T
2
2
t=1 (xt − x)
t=1 (yt − y)
PT
PT
where x = T1 t=1 xt and y = T1 t=1 yt are respectively the sample mean of X and Y . Similarly, given the weakly
stationary time seris {xt }, the lag-k autocorrelation of xt is defined by
Cov(xt , xt−k )
Cov(xt , xt−k )
γk
ρk = p
=
=
V ar(xt )
γ0
V ar(xt )V ar(xt−k )
since V ar(xt ) = V ar(xt−k ) for a weakly stationary series. We have ρ0 = 1, ρk = ρ−k , and −1 ≤ ρk ≤ 1. Further,
a weakly stationary series xt is not serially correlated if and only if ρk = 0 for all k > 0. Again, we let {xt }Tt=1 be a
given sample of X, and estimate the autocorrelation coefficient at lag k with

203

Quantitative Analytics

ρ̂(k) =

1
T −k−1

PT

t=k+1 (xt −
PT
1
t=1 (xt
T −1

x)(xt−k − x)
− x)2

,0≤k  0, then we get the decomposition
p
xt = g(Ft−1 ) + h(Ft−1 )t
where t = σatt is a standardised shock. In the linear model, g(.) is a linear function of elements of Ft−1 and h(.) = σ 2 .
Nonlinear models involves making extensions such that if g(.) is nonlinear, xt is nonlinear in mean, and if h(.) is timevariant, then xt is nonlinear in variance.

5.2.2

The autoregressive models

5.2.2.1

Definition

Letting xt = rt and observing that monthly returns of equity index has a statistically significant lag-1 autocorrelation
indicates that the lagged return rt−1 may be useful in predicting rt . A simple autoregressive (AR) model designed to
use such a predictive power is
rt = φ0 + φ1 rt−1 + at
where {at } is a white noise series with mean zero and variance σ 2 . This is an AR(1) model where rt is the dependent
variable and rt−1 is the explanatory variable. In this model, conditional on the past return rt−1 , we have
E[rt |rt−1 ] = φ0 + φ1 rt−1 , V ar(rt |rt−1 ) = V ar(at ) = σ 2
This is a Markov property such that conditional on rt−1 , the return rt is not correlated with rt−i for i > 1. In the case
where rt−1 alone can not determine the conditional expectation of rt , we can use a generalisation of the AR(1) model
called AR(p) model defined as
rt = φ0 + φ1 rt−1 + ... + φp rt−p + at
where p is a non-negative integer. In that model, the past p values {rt−i }i=1,..,p jointly determine the conditional
expectation of rt given the past data. The AR(p) model is in the same form as a multiple linear regression model with
lagged values serving as explanatory variables.
5.2.2.2

Some properties

Given the conditional mean of the AR(1) model in Section (5.2.2.1), under the stationarity condition we get E[rt ] =
E[rt−1 ] = µ, so that
µ = φ0 + φ1 µ or E[rt ] =

φ0
1 − φ1

As a result, the mean of rt exists if φ1 6= 1, and it is zero if and only if φ0 = 0, implying that the term φ0 is related to
the mean of rt . Further, using φ0 = (1 − φ1 )µ, we can rewrite the AR(1) model as
rt − µ = φ1 (rt−1 − µ) + at
By repeated substitutions, the prior equation implies
rt − µ = at + φ1 at−1 + φ21 at−2 + ... =

∞
X

φi1 at−i

i=0

such that rt − µ is a linear function of at−i for i ≥ 0. From independence of the series {at }, we get E[(rt − µ)at+1 ] =
0, and by the stationarity assumption we get Cov(rt−1 , at ) = E[(rt−1 − µ)at ] = 0. Taking the square, we obtain

206

Quantitative Analytics

V ar(rt ) = φ21 V ar(rt−1 ) + σ 2
since the covariance between rt−1 and at is zero. Under the stationarity assumption V ar(rt ) = V ar(rt−1 ), we get
V ar(rt ) =

σ2
1 − φ21

provided that φ21 < 1 which results from the fact that the variance of a random variable is bounded and non-negative.
One can show that the AR(1) model is weakly stationary if |φ1 | < 1. Multiplying the equation for rt − µ above by
at , and taking the expectation we get
E[at (rt − µ)] = E[at (rt−1 − µ)] + E[a2t ] = E[a2t ] = σ 2
where σ 2 is the variance of at . Repeating the process for (rt−k − µ) and using the prior result, we get

φ1 γ1 + σ 2 if k = 0
γk
φ1 γk−1 if k > 0
where we use γk = γ−k . As a result, we get
V ar(rt ) = γ0 =

σ2
and γk = φ1 γk−1 for k > 0
1 − φ21

Consequently, the ACF of rt satisfies
ρk = φ1 ρk−1 for k ≥ 0
φk1

Since ρ0 = 1, we have ρk =
stating that the ACF of a weakly stationary AR(1) series decays exponentially with
rate φ1 and starting value ρ0 = 1. Setting p = 2 in the AR(p) model and repeating the same technique as that of the
AR(1) model, we get
E[rt ] = µ =

φ0
1 − φ1 − φ2

provided φ1 + φ1 6= 1. Further, we get
γk = φ1 γk−1 + φ2 γk−2 for k > 0
called the moment equation of a stationary AR(2) model. Dividing this equation by γ0 , we get
ρk = φ1 ρk−1 + φ2 ρk−2 for k > 0
for the ACF of rt . It satisfies the second order difference equation

1 − φ1 B − φ2 B 2 ρk = 0
where B is the back-shift operator Bρk = ρk−1 . Corresponding to the prior difference equation, we can solve a
second order polynomial equation leading to the characteristic roots ωi for i = 1, 2. In the case of an AR(p) model,
the mean of a stationary series satisfies
E[rt ] =

φ0
1 − φ1 − ... − φp

provided that the denominator is not zero. The associated polynomial equation of the model is
xp − φ1 xp−1 − ... − φp = 0
207

Quantitative Analytics

so that the series rt is stationary if all the characteristic roots of this equation are less than one in modulus. For a
stationary AR(p) series, the ACF satisfies the difference equation

1 − φ1 B − φ2 B 2 − ... − φp B p ρk = 0
5.2.2.3

Identifying and estimating AR models

The order determination of AR models consists in specifying empirically the unknown order p of the time series. We
briefly discuss two approaches, one using the partial autocorrelation function (PACF), and the other one using some
information criterion function. Different approaches to order determination may result in different choices for p, and
there is no evidence suggesting that one approach is better than another one in real application.
The partial autocorrelation function One can introduce PACF by considering AR models expressed in the form
of a multiple linear regression and arranged in a sequential order, which enable us to apply the idea of partial F test in
multiple linear regression analysis
rt

= φ0,1 + φ1,1 rt−1 + 1t

rt

=

φ0,2 + φ1,2 rt−1 + φ2,2 rt−2 + 2t

rt

=

φ0,3 + φ1,3 rt−1 + φ2,3 rt−2 + φ3,3 rt−3 + 3t

... =

...

where φ0,j , φi,j , and {jt } are the constant term, the coefficient of rt−i , and the error term of an AR(j) model. The
estimate φ̂j,j of the jth equation is called the lag-j sample PACF of rt . The lag-j PACF shows the added contribution
of rt−j to rt over an AR(j − 1) model, and so on. Therefore, for an AR(p) model, the lag-p sample PACF should
not be zero, but φ̂j,j should be close to zero for all j > p. Under some regularity conditions, it can be shown that the
sample PACF of an AR(p) model has the following properties
• φ̂p,p converges to φp as the sample size T goes to infinity.
• φ̂k,k converges to zero for all k > p.
• the asymptotic variance of φ̂k,k is

1
T

for k > T .

That is, for an AR(p) series, the sample PACF cuts off at lag p.
The information criteria All information criteria available to determine the order p of an AR process are likelihood
based. For example, the Akaike Information Criterion (AIC), proposed by Akaike [1973], is defined as
−2
2
ln (likelihood) + (number of parameters)
T
T
where the likelihood function is evaluated at the maximum likelihood estimates, and T is the sample size. The second
term of the equation above is called the penalty function of the criterion because it penalises a candidate model by the
number of parameters used. Different penalty functions result in different information criteria. For a Gaussian AR(p)
model, the AIC simplifies to
AIC =

2k
T
2
2
where σ̂k is the maximum likelihood estimate of σ the variance of at . In practice, one computes AIC(k) for
k = 0, ..., P where P is a prespecified positive integer, and then selects the order p∗ that has the minimum AIC
value.
AIC(k) = ln (σ̂k2 ) +

208

Quantitative Analytics

5.2.2.4

Parameter estimation

One usually use the conditional least squares methods when estimating the parameters of an AR(p) model, which
starts with the (p + 1)th observation. Conditioning on the first p observations, we have
rt = φ0 + φ1 rt−1 + ... + φp rt−p + at , t = p + 1, .., T
which can be estimated by the least squares method (see details in Section (3.2.4.2)). Denotting φ̂i the estimate of φi ,
the fitted model is
r̂t = φ̂0 + φ̂1 rt−1 + ... + φ̂p rt−p
and the associated residual is
ât = rt − r̂t
The series {ât } is called the residual series, from which we obtain
σ̂ 2 =

T
X
1
â2
T − 2p − 1 t=p+1 t

If the model is adequate, the residual series should behave as a white noise. If a fitted model is inadequate, it must
be refined. The ACF and the Ljung-Box statistics of the residuals can be used to check the closeness of ât to a white
noise. In the case of an AR(p) model, the Ljung-Box statistic Q(h) follows asymptotically a chi-squared distribution
with (h − p) degrees of freedom.

5.2.3

The moving-average models

One can consider another simple linear model called moving average (MA) which can be treated either as a simple
extension of white noise series, or as an infinite order AR model with some parameter constraints. In the latter, we can
write
rt = φ0 − θ1 rt−1 − θ12 rt−2 − ... + at
where the coefficients depend on a single parameter θ1 via φi = −θ1i for i ≥ 1. To get stationarity, we must have
|θ1 | < 1 since θ1i → 0 as i → ∞. Thus, the contribution of rt−i to rt decays exponentially as i increases. Writing the
above equation in compact form we get
rt + θ1 rt−1 + θ12 rt−2 + ... = φ0 + at
repeating the process for rt−1 , multiplying by θ1 and subtracting the result from the equation for rt , we obtain
rt = φ0 (1 − θ1 ) + at − θ1 at−1
which is a weighted average of shocks at and at−1 . This is the M A(1) model with c0 = φ0 (1 − θ1 ). The M A(q)
model is
rt = c0 + at − θ1 at−1 − ... − θq at−q
where q > 0. MA models are always weakly stationary since they are finite linear combinations of a white noise
sequence for which the first two moments are time-invariant. For example, in the M A(1) model, E[rt ] = c0 and
V ar(rt ) = σ 2 + θ12 σ 2 = (1 + θ12 )σ 2

209

Quantitative Analytics

where at and at−1 are uncorrelated. In the M A(q) model, E[rt ] = c0 and

V ar(rt ) = 1 + θ12 + ... + θq2 σ 2
Assuming C0 = 0 for an M A(1) model, multiplying the model by rt−k we get
rt−k rt = rt−k at − θ1 rt−k at−1
and taking expectation
γ1 = −θ1 σ 2 and γk = 0 for k > 1
Given the variance above, we have
ρ0 = 1 , ρ1 =

−θ1
, ρk = 0 for k > 1
1 + θ12

Thus, the ACF of an M A(1) model cuts off at lag 1. This property generalises to other MA models, so that an M A(q)
series is only linearly related to its first q lagged values. Hence, it is a finite-memory model. The maximum likelihood
estimation is commonly used to estimate MA models.

5.2.4

The simple ARMA model

We described in Section (5.2.2) the autoregressive models and in Section (5.2.3) the moving-average models. We can
then combine the AR and MA models in a compact form so that the number of parameters used is kept small. The
general ARM A(p, q) model (see Box et al. [1994]) is defined as
yt = φ0 +

p
X

φi yt−i + t −

i=1

q
X

θi t−i

i=1

where {t } is a white noise series and p and q are non-negative integers. Using the back-shift operator, we can rewrite
the model as


1 − φ1 B − ... − φp B p yt = 1 − θ1 B − ... − θq B q t
and we require that there are no common factors between the AR and MA polynomials, so that the order (p, q) of the
model can not be reduced. Note, the AR polynomial introduces the characterisic equation of an ARMA model. Hence,
if all the solutions of the characteristic equation are less than 1 in absolute value, then the ARMA model is weakly
stationary. In this case, the unconditional mean of the model is
E[yt ] =

φ0
1 − φ1 − ... − φp

Both the ACF and PACF are not informative in determining the order of an ARMA model, but one can use the extended
autocorrelation function (EACF) to specify the order of an ARMA process(see Tsay et al. [1984]). It states that if we
can obtain a consistent estimate of the AR component of an ARMA model, then we can derive the MA component
and use ACF to identify its order. The output of EACF is a two-way table where the rows correspond to AR order p
and the columns to MA order q. Once an ARM A(p, q) model is specified, its parameters can be estimated by either
the conditional or exact likelihood method. Then the Ljung-Box statistics of the residuals can be used to check the
adequacy of the fitted model. In the case where the model is correctly specified, then Q(h) follows asymptotically a
chi-squared distribution with (h − g) degrees of freedom, where g is the number of parameters used.
The representation of the ARM A(p, q) model using the back-shift operator is compact and useful in parameter
estimation. However, other
long division of two polynomials. That is, given two
Pprepresentations exist usingPthe
q
polynomials φ(B) = 1 − i=1 φi B i and θ(B) = 1 − i=1 θi B i , we get by long division
210

Quantitative Analytics

θ(B)
φ(B)
φ(B)
θ(B)

=

1 + ψ1 B + ψ2 B 2 + ... = ψ(B)

=

1 − π1 B − π2 B 2 − ... = π(B)

From the definition, ψ(B)π(B) = 1, and making use of the fact that Bc = c for any constant, we have
φ0
φ0
φ0
φ0
=
and
=
θ(1)
1 − θ1 − ... − θq
φ(1)
1 − φ1 − ... − φq
Using the results above, the ARM A(p, q) model can be written as an AR model
yt =

φ0
+ π1 yt−1 + π2 yt−2 + ... + t
1 − θ1 − ... − θq

showing the dependence of the current value yt to the past values yt−i for i > 0. To show that the contribution of
the lagged value yt−i to yt is diminishing as i increases, the πi coefficient should decay to zero as i increases. An
ARM A(p, q) model having this property is invertible. A sufficient condition for invertibility is that all the zeros of the
polynomial θ(B) are greater than unity in modulus. Using the AR representation, an invertible ARM A(p, q) series
yt is a linear combination of the current shock t and a weighted average of the past values, with weights decaying
exponentially. Similarly, the ARM A(p, q) model can also be written as an MA model
yt = µ + t + ψ1 t−1 + ψ2 t−2 + ... = µψ(B)t
φ0
1−φ1 −...−φp .

It shows explicitly the impact of the past shock t−i , i > 0, on the current value yt .
where µ = E[yt ] =
The coefficients {ψi } are called the impulse response function of the ARMA model. For a weakly stationary series,
the ψi decay exponentially as i increases. The MA representation provides a simple proof of mean reversion of a
stationary time series, since the speed at which ŷt+k|t approaches µ determines the speed of mean reversion.

5.3

Forecasting

The ultimate objective of using a stochastic process to model time series is for forecasting. In the construction of
forecasts the idea of conditional expectations is very important. The conditional expectation has the property of being
the minimum mean square error (MMSE) forecast. This means that if the model specification is correct, there is no
other forecast which will have errors whose squares have a lower expected value
ŷt+k|t = E[yt+k |Ft ]
with forecast horizon k. Therefore given a set of observations, the optimal predictor k steps ahead is the expected
value of yt+k conditional on the information at time t. The predictor k is said to be optimal because it has minimum
mean square error. In order to prove this statement, the forecasting error can be split as


yt+k − ŷt+k|t = yt+k − E[yt+k|t ] + E[yt+k|t ] − ŷt+k|t
and it follows that
2
M SE(ŷt+k|t ) = V ar(ŷt+k|t ) + ŷt+k − E[ŷt+k|t ]
The conditional variance of yt+k does not depend on ŷt+k|t , and therefore, the MMSE of yt+k is given by the conditional mean.

211

Quantitative Analytics

5.3.1

Forecasting with the AR models

For the AR(p) model, with forecast origin t, and forecast horizon k, we let ŷt+k|t be the forecast of yt+k using the
minimum squared error loss function. That is, ŷt+k|t is chosen such that
E[yt+k − ŷt+k|t ] ≤ min E[(yt+k − g)2 ]
g

where g is a function of the information available at time t (inclusive). In the AR(p) model, for the 1-step ahead
forecast, we have
yt+1 = φ0 + φ1 yt + ... + φp yt+1−p + t+1
where yt corresponds to rt and t to at when considering returns. Under the minimum squared error loss function, the
point forecast of yt+1 , given the model and observations up to time t, is the conditional expectation
ŷt+1|t = E[yt+1 |yt , yt−1 , ...] = φ0 +

p
X

φi yt+1−i

i=1

and the associeated forecast error is
et+1|t = yt+1 − ŷt+1|t = t+1
The variance of the 1-step ahead forecast error is V ar(et+1 |t) = V ar(t+1 ) = σ 2 . Hence, if t is normally distributed,
then a 95% 1-step ahead interval forecast of yt+1 is
ŷt+1|t ± 1.96 × σ
Note, t+1 is the 1-step ahead forecast error at the forecast origin t, and it is referred to as the shock of the series at
time t + 1. Further, in practice, estimated parameters are often used to compute point and interval forecast, resulting in
a Conditional Forecast since it does not take into consideration the uncertainty in the parameter estimates. Considering
parameter uncertainty is a much more involved process. When the sample size used in estimation is sufficiently large,
then the conditional forecast is close to the unconditional one. In the general case, we have
yt+k = φ0 + φ1 yt+k−1 + ... + φp yt+k−p + t+k
The k-step ahead forecast based on the minimum squared error loss function is the conditional expectation of yt+k
given {yt−i }∞
i=0 obtained as
ŷt+k|t = φ0 +

p
X

φi ŷt+k−i|t

i=1

where ŷt+i|t = yt+i for i ≤ 0. This forecast can be computed recursively using forecast ŷt+i|t for i = 1, .., k − 1.
The k-step ahead forecast error is et+k|t = yt+k − ŷt+k|t . It can be shown that for a stationary AR(p) model, ŷt+k|t
converges to E[yt ] as k → ∞, meaning that for such a series, long-term point forecast approaches its unconditional
mean. This property is called mean-reversion in the finance literature. The variance of the forecast error approaches
the unconditional variance of yt .

5.3.2

Forecasting with the MA models

Since the MA model has finite memory, its point forecasts go to the mean of the series very quickly. For the 1-step
ahead forecast of an M A(1) process at the forecast origin t, we get
yt+1 = c0 + t+1 − θ1 t

212

Quantitative Analytics

and taking conditional expectation, we have
ŷt+1|t

=

E[yt+1 |yy , yt−1 , ..] = c0 − θ1 t

et+1|t

=

yt+1 − ŷt+1|t = t+1

The variance of the 1-step ahead forecast error is V ar(et+1|t ) = σ 2 . To compute t one can assume that 0 = 0 and
get 1 = y1 − c0 , and then compute h for 2 ≤ h ≤ t recursively by using h = yh − c0 + θ1 h−1 . For the 2-step ahead
forecast we get ŷt+2|t = c0 and the variance of the forecast error is V ar(et+2|t ) = (1 + θ1 )2 σ 2 , so that the 2-step
ahead forecast of the series is simply the unconditional mean of the model. More generally ŷt+k|t = c0 for k ≥ 2.
Hence, the forecast ŷt+k|t versus k form a horizontal line on a plot after one step. In general, for an M A(q) model,
multistep ahead forecasts go to the mean after the first q steps.

5.3.3

Forecasting with the ARMA models

The forecasts of an ARM A(p, q) model have similar characteristics to those of an AR(p) model, after adjusting for
the impacts of the MA component on the lower horizon forecasts. For the 1-step ahead forecast of yt+1 , with forecast
origin t, we have
ŷt+1|t = E[yt+1 |yt , yt−1 , ...] = φ0 +

p
X

φi yt+1−i −

i=1

q
X

θi t+1−i

i=1

and the associated forecast error is et+1|t = yt+1 − ŷt+1|t = t+1 . The variance of the 1-step ahead error is
V ar(et+1|t ) = σ 2 . For the k-step ahead forecast of yt+k|t , with forecast origin t, we have
ŷt+k|t = E[yt+k |yt , yt−1 , ...] = φ0 +

p
X

φi ŷt+k−i|t −

i=1

q
X

θi ˆt+k−i|t

i=1

where ŷt+k−i|t = yt+k−i if k − i ≤ 0 and

ˆt+k−i|t =

0 if k − i > 0
t+k−i if k − i ≤ 0

Thus, the multi-step ahead forecasts of an ARMA model can be computed recursively, and the associated forecast
error is
et+k|t = yt+k − ŷt+k|t

5.4
5.4.1

Nonstationarity and serial correlation
Unit-root nonstationarity

When modelling equity stocks, interest rates, or foreign excange rates, the two most considered models for characterising their non-stationarity have been the random walk model with a drift
yt = µ + φyt−1 + t

(5.4.1)

yt = α + βt + t

(5.4.2)

and the trend-stationary process

where {t } is a white noise series. In the random walk model the value of φ can have different effects on the stock
process

213

Quantitative Analytics

1. φ < 1 → φT → 0 as T → ∞
2. φ = 1 → φT = 1 for all T
3. φ > 1
In case (1) the shocks in the system gradually die away which is called the stationarity case. In case (2) the shocks
persist in the system and do not die away, leading to
yt = y0 +

∞
X

t as T → ∞

t=0

That is, the current value of y is an infinite sum of the past shocks added to the starting value of y. This case is known
as the unit root case. In the case (3) the shocks become more influential as time goes on.
5.4.1.1

The random walk

A time series {yt } is a random walk if it satisfies
yt = yt−1 + t
where the real number y0 is the starting value of the process and {t } is a white noise series. The random walk is a
special AR(1) model with coefficient φ1 of yt−1 being equal to unity, so that it does not satisfies the weak stationarity
of an AR(1) model. Hence, we call it a unit-root nonstationarity time series. Under such a model, the stock price is
not predictable or mean reverting. The 1-step ahead forecast of the random walk at the origin t is
ŷt+1|t = E[yt+1 |yt , yt−1 , ...] = yt
which is the value at the forecast origin, and thus, has no practical value. For any forecast horizon k > 0 we have
ŷt+k|t = yt
so that point forecasts of a random walk model are simply the value of the series at the forecast origin, and the process
is not mean-reverting. The MA representation of the random walk is
yt =

∞
X

t−i

i=0

and the k-step ahead forecast error is
et+k|t = t+k + ... + t+1
with V ar(et+k|t ) = kσ 2 , which diverges to infinity as k → ∞. Hence, the usefulness of point forecast ŷt+k|t
diminishes as k increases, implying that the model is not predictable. Further, as the variance of the forecast error
approaches infinity when k increases, the unconditional variance of yt is unbounded, meaning that it can take any real
value for sufficiently large t which is questionable for indexes. At last, since ψi = 1 for all i, then the impact of any
past shock t−i on yt does not decay over time, and the series has a strong memory as it remembers all of the past
shocks. That is, the shocks have a permanent effect on the series.

214

Quantitative Analytics

5.4.1.2

The random walk with drift

When a time series experience a small and positive mean, we can consider a random walk with drift
yt = µ + yt−1 + t
where µ = E[yt − yt−1 ] is the time-trend of yt , or drift of the model, and {t } is a white noise series. Assuming initial
value y0 , we can rewrite the model as
yt = tµ + y0 + t + t−1 + ... + 1
Pt
Pt
2
consisting of the time-trend tµ and a pure
√ random walk process i=1 i . Also, since V ar( i=1 i ) = tσ , the
conditional standard deviation of yt is tσ which grows at a slower rate than the conditional expectation of yt .
Plotting yt against the time index t, we get a time-trend with slop equal to µ. We can analyse the constant term in
the series by noting that for an M A(q) model the constant term is the mean of the series. In the case of a stationary
AR(p) model or ARM A(p, q) model, the constant term is related to the mean via
µ=

1−

φ
P0p

i=1

φi

These differences in interpreting the constant term reflects the difference between the dynamic and linear regression
models. In the general case, allowing the AR polynomial to have 1 as a characteristic root, we get the autoregressive
integrated moving average ARIM A model which is unit-root nonstationary because its AR polynomial has a unit
root. An ARIM A model has a strong memory because the ψi coefficients in its M A representation do not decay over
time to zero, so that the past shocks t−i of the model has a permanent effect on the series. A conventional approach
for handling unit root nonstationarity is to use differencing.
5.4.1.3

The unit-root test

To test whether the value yt follows a random walk or a random walk with a drift, the unit-root testing problem (see
Dickey et al. [1979]) employs the models
yt

=

φ1 yt−1 + t

yt

=

φ0 + φ1 yt−1 + t

where t denotes the error term, and consider the null hypothesis
H0 : φ1 = 1
versus the alternative hypothesis
H1 : φ1 < 1
A convenient test statistic is the t ratio of the least squares (LS) estimate of φ1 under the null hypothesis. The LS
method fot the first equation above gives
PT

φ̂1 = Pt=1
T

yt−1 yt

t=1

2
yt−1

T

, σ̂2 =

1 X
(yt − φ̂1 yt−1 )2
T − 1 t=1

where y0 = 0 and T is the sample size. The t ratio is
PT
yt−1 t
DF = t-ratio =
= qt=1
P
T
2
std(φ̂1 )
σ̂
t=1 yt−1
φ̂1 − 1

215

Quantitative Analytics

which is referred to as the Dickey-Fuller test. If {t } is a white noise series with finite moments of order slightly
greater than 2, then the DF-statistic converge to a function of the standard Brownian motion as T → ∞ (see Chan et
al. [1988]). If φ0 = 0, but the model is still used, the resulting t ratio for testing φ1 = 1 will converge to another nonstandard asymptotic distribution. If φ0 6= 0, and the model is still used, the t ratio for testing φ1 = 1 is asymptotically
normal, but large sample sizes is required.

5.4.2

Regression models with time series

The relationship between two time series is of major interest, for instance the Market model in finance relates the
return of an individual stock to the return of a market index. In general, we consider the linear regression
r1t = α + βr2t + t
where rit for i = 1, 2 are two time series and t is the error term. The least squares (LS) method is often used to
estimate the model parameters (see details in Section (3.2.4.2)). If {t } is a white noise series, then the least square
method (LS) produces consistent estimates. However, in practice the error term {t } is often serially correlated, so
that we get a regression model with time series errors. Even though this approach is widely used in finance, it is a
misused econometric model when the serial dependence in t is overlooked. On can look at the time plot and ACF
of the two series residuals to detect patterns of a unit-root nonstationarity time series. When two time series are unitroot nonstationary, the behaviour of the residuals indicates that the series are not co-integrated. In that case the data
fail to support the hypothesis that there exists a long-term equilibrium between the two series. One way forward for
building a linear regression model with time series errors, is to use a simple time series model for the residual series
and estimate the whole model jointly. For example, considering the modified series
c1t

= r1t − r1,t−1 = (1 − B)r1t for t ≥ 2

c2t

= r2t − r2,t−1 = (1 − B)r2t for t ≥ 2

we can specify a M A(1) model for the residuals and modify the linear regression model to get
c2t = α + βc1t + t , t = at − θ1 at−1
where {at } is assumed to be a white noise series. The M A(1) model is used to capture the serial dependence in the
error term. More complex time series models can be added to a linear regression equation to form a general regression
model with time series error. One can consider the Cochran-Orcutt estimator to handle the serial dependence in the
residuals (see Greene [2000]). When the time series model used is stationary and invertible, one can estimate the
model jointly by using the maximum likelihood method (MLM). Note, one can use the Durbin-Watson (DW) statistic
to check residuals for serial correlation, but it only consider the lag-1 serial correlation. For a residual series t with T
observations, the Durbin-Watson statistic is
PT
DW =

2
t+2 (t − t−1 )
PT 2
t+1 t

which is approximated by
DW ≈ 2(1 − ρ̂1 )
where ρ̂1 is the lag-1 ACF of {t }. When residual serial dependence appears at higher order lags (seasonal behaviour),
one can use the Ljung-Box statistics.

216

Quantitative Analytics

5.4.3

Long-memory models

Even though for a stationary time series the ACF decays exponentially to zero as lag increases, for a unit-root nonstationary time series, Tiao et al. [1983] showed that the sample ACF converges to 1 for all fixed lags as the sample
size increases. There exist some time series, called long-memory time series, whose ACF decays slowly to zero at
a polynomial rate as the lag increases. For instance, Hosking [1981] proposed the fractionally differenced process
defined by
1
1
 − 12 , then xt is invertible and has the infinite AR representation
xt =

∞
X

πi xt−i + at

i=1

with
πk =

(k − d − 1)!
−d(1 − d)...(k − 1 − d)
=
k!
k!(−d − 1)!

• for − 12 < d < 12 , the ACF of xt is
ρk =
in particular, ρ1 =

d
1−d

d(1 + d)...(k − 1 + d)
, k = 1, 2, ...
(1 − d)(2 − d)...(k − d)

and
ρk ≈

• for − 12 < d < 12 , the PACF of xt is φk,k =
• for − 12 < d <
satisfies

1
2,

(−d)! 2d−1
k
as k → ∞
(d − 1)!

d
(k−d)

for k = 1, 2, ..

the spectral density function f (w) of xt , which is the Fourier transform of the ACF of xt ,
f (w) ∼ w−2d as w → 0

where w ∈ [0, 2π] denotes the frequency.
In the case where d < 12 , the property of the ACF of xt says that ρk ∼ k 2d−1 , which decays at a polynomial rate
rather than an exponential one. Note, in the spectral density above, the spectrum diverges to infinity as w → 0, but
it is bounded for all w ∈ [0, 2π] in the case of a stationary ARMA process. If the fractionally differenced series
(1 − B)d xt follows an ARM A(p, q) model, then xt is called an ARF IM A(p, d, q) process, which is a generalised
ARIMA model by allowing for noninteger d. One can estimate d by using either a maximum likelihood method or a
regression method with logged periodigram at the lower frequency.

217

Quantitative Analytics

5.5
5.5.1

Multivariate time series
Characteristics

For an investor holding multiple assets, the dynamic relationships between returns of the assets play an important role
in the process of decision. Vector or multivariate time series analysis are methods used to study jointly multiple return
series. As multivariate time series are made of multiple single series referred as components, vector and matrix are
the necessary tools. We let rt = (r1t , r2t , .., rN t )> be the log returns of N assets at time t. The series rt is weakly
stationary if its first two moments are time-invariant. Hence, the mean vector and covariance matrix of a weakly
stationary series are constant over time. Assuming weakly stationarity of rt , the mean vector and covariance matrix
are given by
µ = E[rt ] , Γ0 = E[(rt − µt )(rt − µt )> ]
where the expectation is taken element by element over the joint distribution of rt . The mean µ is a N-dimensional
vector, and the covariance matrix Γ0 is a N × N matrix. The ith diagonal element of Γ0 is the variance of rit , and
the (i, j)th element of Γ0 for i 6= j is the covariance between rit and rjt . We then write µ = (µ1 , .., µN )> and
Γ0 = [Γij (0)] when describing the elements.
p
p
We let D = diag[ Γ11 (0), ..., ΓN N (0)] be a N × N diagonal matrix consisting of the standard deviation of
rit for i = 1, .., N . The lag-zero cross-correlation matrix of rt is
ρ0 = [ρij (0)] = D−1 Γ0 D−1
with (i, j)th element being
Γij (0)
Cov(rit , rjt )
ρij (0) = p
=
σ(rit )σ(rjt )
Γii (0)Γjj (0)
and corresponding to the correlation between rit and rjt where ρij (0) = ρji (0), −1 ≤ ρij (0) ≤ 1, and ρii (0) = 1
for 1 ≤ i, j ≤ N . In order to understand the lead-lag relationships between component series, the cross-correlation
matrices are used to measure the strength of linear dependence between time series. The lag-k cross-covariance matrix
of rt is defined as
Γk = [Γij (k)] = E[(rt − µ)(rt−k − µ)> ]
where µ is the mean vector of rt . For a weakly stationary series, the cross-covariance matrix Γk is a function of k, but
not the time t. The lag-k cross-correlation matrix (CCM) of rt is defined as
ρk = [ρij (k)] = D−1 Γk D−1
with (i, j)th element being
ρij (k) = p

Γij (k)
Cov(rit , rj,t−k )
=
σ(rit )σ(rjt )
Γii (0)Γjj (0)

and corresponding to the correlation between rit and rj,t−k . When k > 0 it measures the linear dependence between
rit and rj,t−k occurring prior to time t such that when ρij (k) 6= 0 the series rjt leads the series rit at lag k. This
result is reversed for ρji (k), and the diagonal element of ρii (k) is the lag-k autocorrelation of rit . In general, when
k > 0 we get ρij (k) 6= ρji (k) for i 6= j because the two correlation coefficients measure different linear relationships
between {rit } and {rjt }, and Γk and ρk are not symmetric. Further, from Cov(rit , rj,t−k ) = Cov(rj,t−k , rit ) and by
the weak stationarity assumption
Cov(rj,t−k , rit ) = Cov(rj,t , ri,t+K ) = Cov(rjt , ri,t−(−k) )
218

Quantitative Analytics

we have Γij (k) = Γji (−k). Since Γji (−k) is the (j, i)th element of the matrix Γ−k , and since the equality holds
>
for 1 ≤ i, j ≤ N , we have Γk = Γ>
−k and ρk = ρ−k . Hence, unlike the univariate case, ρk 6= ρ−k for a general
vector time series when k > 0. As a result, it suffices to consider the cross-correlation matrices ρk for k ≥ 0. Given
the information contained in the cross-correlation matrices {ρk }k=0,1,2,.. of a weakly stationary vector time series, if
ρij (k) = 0 for all k > 0, then rit does not depend linearly on any past value rj,t−k of the rjt series.
Given the data {rt }Tt=1 , the cross-covariance matrix Γk is computed as
Γ̂k =

T
1 X
(rt − r)(rt−k − r)> , k ≥ 0
T
t=k+1

where r =

1
T

PT

t=1 rt

is the vector sample mean. The cross-correlation matrix ρk is estimated by
ρ̂k = D̂−1 Γ̂k D̂−1 , k ≥ 0

where D̂ is the N × N diagonal matrix of the sample standard deviation of the component series. The asymptotic
properties of the sample cross-correlation matrix ρ̂k have been investigated under various assumptions (see Fuller
[1976]). For asset return series, the presence of conditional heteroscedasticity and high kurtosis complexify the finite
sample distribution of ρ̂k , and proper bootstrap resampling methods should be used to get an approximate estimate of
the distribution. The univariate Ljung-Box statistic Q(m) has been generalised to the multivariate case by Hosking
[1980] and Li et al. [1981].

5.5.2

Introduction to a few models

The vector autoregressive (VAR) model is considered a simple vector model when modelling asset returns. A multivariate time series rt is a VAR process of order 1 or V AR(1) if it follows the model
rt = φ0 + Φrt−1 + at
where φ0 is a N-dimentional vector, Φ is a N × N matrix, and {at } is a sequence of serially uncorrelated random
vectors with mean zero and positive definite covariance matrix Σ. For example, in the bivariate case with N = 2,
rt = (r1t , r2t )> and at = (a1t , a2t )> , we get the equations
r1t

= φ10 + Φ11 r1,t−1 + Φ12 r2,t−1 + a1t

r2t

= φ20 + Φ21 r1,t−1 + Φ22 r2,t−1 + a2t

where Φij is the (i, j)th element of Φ and φi0 is the ith element of φ0 . The coefficient matrix Φ measures the dynamic
dependence of rt , and the concurrent relationship between r1t and r2t is shown by the off-diagonal element σ12 of the
covariance matrix Σ of at . The V AR(1) model is called a reduced-form model because it does not show explicitly the
concurrent dependence between the component series. Assuming weakly stationarity, and using E[at ] = 0, we get
E[rt ] = φ0 + ΦE[rt−1 ]
Since E[rt ] is time-invariant, we get
µ = E[rt ] = (I − Φ)−1 φ0
provided that I − Φ is non-singular, where I is the N × N identity matrix. Hence, using φ0 = (I − Φ)µ, we can
rewrite the AR(1) model as
(rt − µ) = Φ(rt−1 − µ) + at
219

Quantitative Analytics

Letting r̃t = rt − µ be the mean-corrected time series, the V AR(1) model becomes
r̃t = Φr̃t−1 + at
By repeated substitutions, the V AR(1) model becomes
r̃t = at + φat−1 + φ2 at−2 + ...
characterising the V AR(1) process. We can generalise the V AR(1) model to V AR(p) models and get their characteristics.
Another approach is the generalise univariate ARMA models to handle vector time series and obtain VARMA
models. However, these models suffer from identifiability problem as they are not uniquely defined. Hence, building
a VARMA model for a given data set requires some attention.

5.5.3

Principal component analysis

Another important statistic in multivariate time series analysis is the covariance (or correlation) structure of the series.
Given a N-dimensional random variable r = (r1 , .., rN )> with covariance matrix Σr , a principal component analysis
(PCA) is concerned with using a few linear combinations of ri to explain the structure of Σr . PCA applies to either
the covariance matrix Σr or the correlation matrix ρr of r. The correlation matrix being the covariance matrix of the
standardised random vector r∗ = D−1 r where D is the diagonal matrix of standard deviations of the components of
r, we apply PCA to the covariance matrix. Let ci = (ci1 , ..., ciN )> be a N-dimensional vector, where i = 1, .., N such
that
yi = c>
i r =

N
X

cij rj

j=1

is a linear combination of the random vector r. In the case where r consists of the simple returns of N stocks, then yi
is the return of a portfolio assigning weight cij to the jth stock. Without modifying the proportional allocation of the
PN 2
portfolio, we can standardise the vector ci so that c>
i ci =
j=1 cij = 1. Using the properties of a linear combination
of random variables, we get
V ar(yi )
Cov(yi , yj )

= c>
i Σr ci , i = 1, .., N
= c>
i Σr cj , i, j = 1, .., N

The idea of PCA is to find linear combinations ci such that yi and yj are uncorrelated for i 6= j and the variances of
yi are as large as possible. Specifically
1. the first principal component of r is the linear combination y1 = c>
1 r maximising V ar(y1 ) under the constraint
c>
c
=
1.
1 1
2. the second principal component of r is the linear combination y2 = c>
2 r maximising V ar(y2 ) under the constraint c>
c
=
1
and
Cov(y
,
y
)
=
0.
1 2
2 2
3. the ith principal component of r is the linear combination yi = c>
i r maximising V ar(yi ) under the constraint
c>
c
=
1
and
Cov(y
,
y
)
=
0
for
j
=
1,
...,
i
−
1.
i j
i i
Since the covariance matrix Σr is non-negative definite, it has a spectral decomposition (see Appendix (A.6)). Hence,
letting (λ1 , e1 ), ..., (λN , eN ) be the eigenvalue-eigenvector pairs of Σr , where λ1 ≥ λ2 ≥ ... ≥ λN ≥ 0, then the ith
PN
principal component of r is yi = e>
i r =
j=1 eij rj for i = 1, .., N . Moreover, we get
220

Quantitative Analytics

V ar(yi )
Cov(yi , yj )

=

e>
i Σr ei = λi , i = 1, .., N

= e>
i Σr ej = 0 , i 6= j

In the case where some eigenvalues λi are equal, the choices of the corresponding eigenvectors ei and hence yi are
not unique. Further, we have
N
X

V ar(ri ) = tr(Σr ) =

i=1

N
X

λi =

i=1

N
X

V ar(yi )

i=1

This result says that
λi
V ar(yi )
=
PN
λ
+
.. + λN
1
V
ar(r
)
i
i=1
so that the proportion of total variance in r explained by the ith principal component is simply the ratio between the
ith eigenvalue and the sum of all eigenvalues of Σr . Since tr(ρr ) = N , the proportion of variance explained by the ith
principal component becomes λNi when the correlation matrix is used to perform the PCA. A byproduct of the PCA
is that a zero eigenvalue of Σr or ρr indicates the existence of an exact linear relationship between the components
of r. For instance, if the smallest eigenvalue λN = 0, then from the previous result V ar(yN ) = 0, and therefore
PN
yN =
j=1 eN j rj is a constant and there are only N − 1 random quantities in r, reducing the dimension of r.
Hence, PCA has been used as a tool for dimension reduction. In practice, the covariance matrix Σr and the correlation
matrix ρr of the return vector r are unknown, but they can be estimated consistently by the sample covariance and
correlation matrices under some regularity conditions. Assuming that the returns {rt }Tt=1 are weakly stationary, we
get the estimates
Σ̂r = [σ̂ij,r ] =

T
T
1 X
1X
(rt − r)(rt − r)> , r =
rt
T − 1 t=1
T t=1

and
ρ̂r = D̂−1 Σ̂r D̂−1
p
p
where D̂ = diag{ σ̂11,r , .., σ̂N N,r } is the diagonal matrix of sample standard errors of rt . Methods to compute
eigenvalues and eigenvectors of a symmetric matrix can then be used to perform PCA. An informal technique to
determine the number of principal components needed in an application is to examine the scree plot, which is the time
plot of the eigenvalues λ̂i ordered from the largest to the smallest, that is, a plot of λ̂i versus i. By looking for an elbow
in the scree plot, indicating that the remaining eigenvalues are relatively small and all about the same size, one can
determine the appropriate number of components. Note, except for the case in which λj = 0 for j > i, selecting the
first i principal components only provides an approximation to the total variance of the data. If a small i can provide
a good approximation, then the simplification becomes valuable.

5.6

Some conditional heteroscedastic models

Following the notation in Section (3.4.2) we are going to describe a few conditional heteroscedastic (CH) models.

5.6.1

The ARCH model

Starting with the ARCH model proposed by Engle, the main idea is that

221

Quantitative Analytics

1. the mean-corrected asset return at is serially uncorrelated, but dependent, and
2. the dependence of at can be described by a simple quadratic function of its lagged values.
Formally, an ARCH(m) model is given by
at = σt t , σt2 = α0 + α1 a2t−1 + ... + αm a2t−m

(5.6.3)

where {t } is a sequence of i.i.d. random variables with mean zero and variance 1, α0 > 0, and αi ≥ 0 for i > 0.
The coefficients αi must satisfy some regularity conditions to ensure that the unconditional variance of at is finite. In
practice, t is often assumed to follow the standard normal or a standardised Student-t distribution. From the structure
2
of the model, one can see that large past squared shocks {a2t−i }m
i=1 imply a large conditional variance σt for the
mean-corrected return at , so that at tends to assume a large value (in modulus). Hence, in the ARCH model large
shocks tend to be followed by another large shock similarly to the volatility clustering observed in asset returns. For
simplicity of exposition, we consider the ARCH(1) model given by
at = σt t , σt2 = α0 + α1 a2t−1
where α0 > 0 and α1 ≥ 0. The unconditional mean of at remains zero because
E[at ] = E[E[at |Ft−1 ]] = E[σt E[t ]] = 0
Further, the unconditional variance of at can be obtained as
V ar(at ) = E[a2t ] = E[E[a2t |Ft−1 ]] = E[α0 + α1 a2t−1 ] = α0 + α1 E[a2t−1 ]
Since at is a stationary process with E[at ] = 0, V ar(at ) = V ar(at−1 ) = E[a2t−1 ], so that V ar(at ) = α0 +
α0
. As the variance of at must be positive, we need 0 ≤ α1 < 1. Note, in some
α1 V ar(at ) and V ar(at ) = 1−α
1
applications, we need higher order moments of at to exist, and α1 must also satisfy additional constraints. For example,
when studying the tail behaviour we need the fourth moment of at to be finite. Under the normality assumption of t
we have
E[a4t |Ft−1 ] = 3 E[a2t |Ft−1 ]

2

= 3 α0 + α1 a2t−1

2

Therefore, we get
2
E[a4t ] = E[E[a4t |Ft−1 ]] = 3E[ α0 + α1 a2t−1 ] = 3E[α02 + 2α0 α1 a2t−1 + α12 a4t−1 ]
If at is fourth-order stationary with m4 = E[a4t ], then we have

α1
m4 = 3 α02 + 2α0 α1 V ar(at ) + α12 m4 = 3α02 (1 + 2
) + 3α12 m4
1 − α1
so that we get
m4 =

3α02 (1 + α1 )
(1 − α1 )(1 − 3α12 )

As a result,
1. since m4 is positive, then α1 must also satisfy the condition 1 − 3α12 > 0, that is, 0 ≤ α12 <
2. the unconditional kurtosis of at is
3α02 (1 + α1 ) (1 − α1 )2
1 − α12
E[a4t ]
=
=3
>3
4
2
2
V ar(at )
(1 − α1 )(1 − 3α1 )
α0
1 − 3α12

222

1
3

Quantitative Analytics

Thus, the excess kurtosis of at is positive and its tail distribution is heavier than that of a normal distribution. That
is, the shock at of a conditional Gaussian ARCH(1) model is more likely than a Gaussian white noise to produce
outliers in agreement with the empirical findings on asset returns. These properties continues to hold for general
ARCH models. Note, a natural way of achieving positiveness of the conditional variance is to rewrite an ARCH(m)
model as
at = σt t , σt2 = α0 + A>
m,t−1 ΩAm,t−1
where Am,t−1 = (at−1 , .., at−M )> and Ω is a m×m non-negative definite matrix. Hence, we see that the ARCH(m)
model requires Omega to be diagonal, and that Engle’s model uses a parsimonious approach to approximate a
quadratic function. A simple way to achieve the diagonality constraint on the matrix Ω is to employ a random coefficient model for at as done in the CHARMA and RCA models. Further, ARCH models also have some weaknesses
• the model assumes that positive and negative shocks have the same effects on volatility because it depends on
the square of the previous shocks.
• the model is rather restrictive, since in the case of an ARCH(1) the parameter α12 is constraint to be in the
interval [0, 13 ] for the series to have a finite fourth moment.
• the model is likely to overpredict the volatility because it slowly responds to large isolated shocks to the return
series.
A simple way for building an ARCH model consists of three steps
1. build an econometric model (for example an ARMA model) for the return series to remove any linear dependence in the data, and use the residual series to test for ARCH effects
2. specify the ARCH order and perform estimation
3. check carefully the fitted ARCH model and refine it if necessary
To determine the ARCH order, we define ηt = a2t − σt2 since it was shown that {ηt } is an uncorrelated series with
zero mean. The ARCH model then becomes
a2t = α0 + α1 a2t−1 + ... + αm a2t−m + ηt
which is the form of an AR(m) model for a2t , except that {ηt } is not a i.i.d. series. As a result, the least squares
estimates of the prior model are consistent, but not efficient. The PACF of a2t , which is a useful tool for determining
the order m, may not be effective in the case of small sample size.
Forecasts of the ARCH model in Equation (5.6.3) are obtained recursively just like those of an AR model. Given
2
the ARCH(m) model, at the forecast origin h, the 1-step ahead forecast of σh+1
is
σh2 (1) = α0 + α1 a2h + ... + αm a2h+1−m
2
and the l-step ahead forecast for σh+l
is

σh2 (l) = α0 +

m
X
i=1

where

σh2 (l

− i) =

a2h+l−i

if l − i ≤ 0.

223

αi σh2 (l − i)

Quantitative Analytics

5.6.2

The GARCH model

In the GARCH model, given a log return series rt , we assume that the mean equation of the process can be adequately
described by an ARMA model. Then, the mean-corrected log return at follows a GARCH(m, s) model if
at = σt t , σt2 = α0 +

m
X

αi a2t−i +

i=1

s
X

2
βj σt−j

(5.6.4)

j=1

where {t } is a sequence of i.i.d. random variables with zero mean and variance 1, α0 > 0, αi ≥ 0, βj ≥ 0, and
Pmax (m,s)
(αi + βi ) < 1 implying that the unconditional variance of at is finite, whereas its conditional variance
i=1
σt2 evolves over time. In general, t is assumed to be a standard normal or standardised Student-t distribution. To
2
= a2t−i −ηt−i
understand the properties of GARCH models we let ηt = a2t −σt2 so that σt2 = a2t −ηt . By plugging σt−i
for i=0,..,s into Equation (5.6.4), we get
max (m,s)

X

a2t = α0 +

(αi + βi )a2t−i + ηt −

i=1

s
X

βj ηt−j

(5.6.5)

j=1

Note, while {ηt } is a martingale difference series (E[ηt ] = 0 and Cov(ηt , ηt−j ) = 0 for j ≥ 1), in general it is not
an i.i.d. sequence. Since the above equation is an ARMA form for the squared series a2t , a GARCH model can be
regarded as an application of the ARMA idea to the squared series a2t . Hence, using the unconditional mean of an
ARMA model, we have
E[a2t ] =

1−

α0
Pmax (m,s)
i=1

(αi + βi )

provided that the denominator of the prior fraction is positive. For simplicity of exposition we now consider the
GARCH(1, 1) model given by
2
σt2 = α0 + α1 a2t−1 + β1 σt−1
, 0 ≤ α1 , β1 ≤ 1, (α1 + β1 ) < 1
2
We see that a large a2t−1 or σt−1
gives rise to a large σt2 meaning that a large a2t−1 tends to be followed by another
2
large at . It can also be shown that if 1 − 2α12 − (α1 + β1 )2 > 0, then

3 1 − (α1 + β1 )2
E[a4t ]
>3
2 =
1 − (α1 + β1 )2 − 2α12
E[a2 ]
t

so that, similarly to ARCH models, the tail distribution of a GARCH(1, 1) process is heavier than that of a normal
distribution. Further, the model provides a simple parametric function that can be used for describing the volatility
evolution. Forecasts of a GARCH model can be obtained in a similar way to those of the ARMA model. For example,
in the GARCH(1, 1) model, with forecast origin h, for a 1-step ahead forecast we have
2
σh+1
= α0 + α1 a2h + β1 σh2

where ah and σh2 are known at the time index h. Hence, the 1-step ahead forecast is given by
σh2 (1) = α0 + α1 a2h + β1 σh2
In the case of multiple steps ahead, we use a2t = σt2 2t and rewrite the volatility equation as
2
σt+1
= α0 + (α1 + β1 )σt2 + α1 σt2 (2t − 1)

Setting t = h + 1, since E[2h+1 − 1|Fh ] = 0, the 2-step ahead volatility forecast becomes
σh2 (2) = α0 + (α1 + β1 )σh2 (1)
224

(5.6.6)

Quantitative Analytics

and the l-step ahead volatility forecast satisfies the equation
σh2 (l) = α0 + (α1 + β1 )σh2 (l − 1) , l > 1

(5.6.7)

which is the same result as that of an ARM A(1, 1) model with AR polynomial 1 − (α1 + β1 )B. By repeated
substitution of the above equation (5.6.7), the l-step ahead forecast can be rewritten as

α0 1 − (α1 + β1 )l−1
2
σh (l) =
+ (α1 + β1 )l−1 σh2 (1)
1 − α1 − β1
such that
σh2 (l) →

α0
as l → ∞
1 − α1 − β1

provided that α1 + β1 < 1. As a result, the multistep ahead volatility forecasts of a GARCH(1, 1) model converge
to the unconditional variance of at as the forecast horizon increases to infinity, provided that V ar(at ) exists. Note,
the GARCH models encounter the same weaknesses as the ARCH models, such as responding equally to positive
and negative shocks. While the approach used to build ARCH models can be used for building GARCH models, it is
difficult to specify the order of the latter. Fortunately, only lower order GARCH models are used in most applications.
The conditional maximum likelihood method still applies provided that the starting values of the volatility {σt2 } are
assumed to be known. In some applications, the sample variance of at is used as a starting value.

5.6.3

The integrated GARCH model

If the AR polynomial of the GARCH representation in Equation (5.6.5) has a unit root, then we have an IGARCH
model. That is, IGARCH models are unit-root GARCH models. As for ARIMA models, a key feature of IGARCH
2
models is that the impact of past squared shocks ηt−i = a2t−i − σt−i
for i > 0 on a2t is persistent. For example, the
IGARCH(1, 1) model is given by
2
at = σt t , σt2 = α0 + β1 σt−1
+ (1 − β1 )a2t−1

where {t } is defined as before, and 1 > β1 > 0. In some applications, the unconditional variance of at , hence that
of rt , is not defined in this model. From a theoretical point of view, the IGARCH phenomenon might be caused by
occasional level shifts in volatility. When α1 + β1 = 1 in Equation ((5.6.7)), repeated substitutions in the l-step ahead
volatility forecast equation of GARCH models gives
σh2 (l) = σh2 (1) + (l − 1)α0 , l ≥ 1
such that the effect of σh2 (1) on future volatility is also persistent, and the volatility forecasts form a straight line with
slope α0 . Note, the process σt2 is a martingale for which some nice results are available (see Nelson [1990]). Under
certain conditions, the volatility process is strictly stationary, but not weakly stationary as it does not have the first two
moments. Further, in the special case α0 = 0 in the IGARCH(1, 1) model, the volatility forecasts are simply σh2 (1)
for all forecast horizons (see Equation (5.6.6)). This is the volatility model used in RiskMetrics (see details in Section
(3.4.3)), which is an approach for calculating Value at Risk (VaR) (see details in Scetion (9.5.2)).

5.6.4

The GARCH-M model

In general, asset return should depend on its volatility, and one way forward is to consider the GARCH-M model,
which is a GARCH in mean. A simple GARCH(1, 1) − M model is given by
rt
σt2

= µ + cσt2 + at , at = σt t
2
= α0 + α1 a2t−1 + β1 σt−1

225

Quantitative Analytics

where µ and c are constants. The parameter c is called the risk premium, with a positive value indicating that the return
is positively related to its past volatility. The formulation of the above model implies serial correlations in the return
series rt introduced by those in the volatility process σt2 . Thus, the existence of risk premium is, therefore, another
reason for historical stock returns to have serial correlations.

5.6.5

The exponential GARCH model

Nelson [1991] proposed the exponential GARCH (EGARCH) model allowing for asymmetric effects between positive
and negative asset returns. He considered a weighted innovation which can be written as

(θ + γ)t − γE[|t |] if t ≥ 0
g(t ) =
(θ − γ)t − γE[|t |] if t < 0
where θ and γ are real constants. Both t and |t |−E[|t |] are zero-mean i.i.d. sequences with
q continuous distributions,

so that E[g(t )] = 0. For the standard Gaussian random variable t , we have E[|t |] =
model can be written as

2
π.

An EGARCH(m, s)

1 + β1 B + ... + βs B s
g(t−1 )
1 − α1 B − ... − αm B m
where α0 is a constant, B is the back-shift (or lag) operator such that Bg(t ) = g(t−1 ), and both the numerator and
denominator above are polynomials with zeros outside the unit circle (absolute values of the zeros are greater than one)
and have no common factors. Since the EGARCH model uses the ARMA parametrisation to describe the evolution
of the conditional variance of at , some properties of the model can be obtained in a similar manner as those of the
GARCH model. For instance, the unconditional mean of ln (σt2 ) is α0 . However, the EGARCH model uses logged
conditional variance to relax the positiveness constraint of model coefficients, and the use of g(t ) enables the model
to respond asymmetrically to positive and negative lagged values of at . For example, in the simple EGARCH(1, 0)
we get
at = σt t , ln (σt2 ) = α0 +

at = σt t , (1 − αB) ln (σt2 ) = (1 − α)α0 + g(t−1 )
where {t } are i.i.d. standard normal and the subscript of α1 is omitted. In this case, E[|t |] =
ln (σt2 )

q

2
π

and the model for

becomes
(1 − αB) ln (σt2 ) =



α∗ + (θ + γ)t−1 if t−1 ≥ 0
α∗ + (θ − γ)t−1 if t−1 < 0

q
where α∗ = (1 − α)α0 − π2 γ. Note, this is a nonlinear function similar to that of the threshold autoregressive (TAR)
model of Tong [1990]. In this model, the conditional variance evolves in a nonlinear manner depending on the sign
of at−1 , that is,

√at−1

2
 (θ+γ) σt−1
if at−1 ≥ 0
e
2α α∗
at−1
σt2 = σt−1
e
(θ−γ) √

2

σ
t−1 if a
e
t−1 < 0
The coefficients (θ ± γ) show the asymmetry in response to positive and negative at−1 , and the model is nonlinear
when γ 6= 0. In presence of higher orders, the nonlinearity becomes much more complicated. Now, given the
EGARCH(1, 0) model, assuming known model parameters and that the innovations are standard Gaussian, we have
2
ln (σt2 ) = (1 − α1 )α0 + α1 ln (σt−1
) + g(t−1 )
r
2
g(t−1 ) = θt−1 + γ |t−1 | −
π

226

Quantitative Analytics

Taking exponential, the model becomes
2α1 (1−α1 )α0 g(t−1 )
e
e
σt2 = σt−1

and the 1-step ahead forecast, with forecast origin h, satisfies
2
σh+1
= σh2α1 e(1−α1 )α0 eg(h )

where all the quantities on the right-hand side are known. Thus, the 1-step ahead volatility forecast at the forecast
2
origin h is simply σ̂h1 (1) = σh+1
. Repeating for the 2-step ahead forecast, and taking conditional expectation at time
h, we get
σ̂h2 (2) = σ̂h2α1 (1)e(1−α1 )α0 Eh [eg(h+1 ) ]
where Eh [•] denotes the conditional expectation at the time origin h. After some calculation, the prior expectation is
given by

√2 1
2
2
1
E[eg() ] = e−γ π e 2 (θ+γ) N (θ + γ) + e 2 (θ−γ) N (θ − γ)
where f (•) and N (•) are the probability density function and CDF of the standard normal distribution, respectively.
As a result, the 2-step ahead volatility forecast becomes

√2  1
2
2
1
σ̂h2 (2) = σ̂h2α1 (1)e(1−α1 )α0 e−γ π e 2 (θ+γ) N (θ + γ) + e 2 (θ−γ) N (θ − γ)
Repeating this procedure, we obtain a recursive formula for the j-step ahead forecast

 1
2
2
1
σ̂h2 (j) = σ̂h2α1 (j − 1)ew e 2 (θ+γ) N (θ + γ) + e 2 (θ−γ) N (θ − γ)
q
where w = (1 − α1 )α0 − γ π2 .

5.6.6

The stochastic volatility model

An alternative approach for describing the dynamics of volatility is to introduce the innovation vt to the conditional
variance equation of at obtaining a stochastic volatility (SV) model (see Melino et al. [1990], Harvey et al. [1994]).
Using ln (σt2 ) to ensure positivity of the conditional variance, a SV model is defined as

at = σt t , 1 − α1 B − ... − αm B m ln (σt2 ) = α0 + vt
where t are i.i.d. NP
(0, 1), vt are i.i.d. N (0, σv2 ), {t } and {vt } are independent, α0 is a constant, and all zeros of
m
the polynomial 1 − i=1 αi B i are greater than 1 in modulus. While the innovation vt substantially increases the
flexibility of the model in describing the evolution of σt2 , it also increases the difficulty in parameter estimation since
for each shock at the model uses two innovations t and vt . To estimate a SV model, one need a quasi-likelihood
method via Kalman filtering or a Monte Carlo method. Details on SV models and their parameters estimation can be
found in Taylor [1994]. Properties of the model can be found in Jacquier et al. [1994] when m = 1. In that setting,
we have
ln (σt2 ) ∼ N (

α0
σv2
,
) = N (µh , σh2 )
1 − α1 1 − α12

and
E[a2t ] = e

µh +

1
2σ 2
h

2

2

2

, E[a4t ] = 3e2µh +2σh , Corr(a2t , a2t−i ) =

227

i

eσh α1 − 1
2
3eσh − 1

Quantitative Analytics

While SV models often provide improvements in model fitting, their contributions to out-of-sample volatility forecasts
received mixed results. Note, using the idea of fractional difference (see Section (5.4.3)), SV models have been
extended to allow for long memory in volatility. This extension has been motivated by the fact that autocorrelation
function of the squared or absolute-valued series of asset returns often slowly decay, even though the return series
has no serial correlation (see Ding et al. [1993]). A simple long-memory stochastic volatility (LMSV) model can be
defined as
1

at = σt t , σt = σe 2 ut , (1 − B)d ut = ηt
where σ > 0, t are i.i.d. N (0, 1), ηt are i.i.d. N (0, ση2 ) and independent of t , and 0 < d < 12 . The feature of long
memory stems from the fractional difference (1 − B)d implying that the ACF of ut decays slowly at a hyperbolic,
instead of an exponential rate as the lag increases. Using these settings we have
ln (a2t )

=
=

ln (σ 2 ) + ut + ln (2t )


ln (σ 2 ) + E[ln (2t )] + ut + ln (2t ) − E[ln (2t )] = µ + ut + et

so that the ln (a2t ) series is a Gaussian long-memory signal plus a non-Gaussian white noise (see Breidt et al. [1998]).
Estimation of the long-memory stochastic volatility model is difficult, but the fractional difference parameter d can
be estimated by using either a quasi-maximum likelihood method or a regression method. Using the log series of
squared daily returns for companies in S&P 500 index, Bollerslev et al. [1999] and Ray et al. [2000] found the
median estimate of d to be about 0.38. Further, Ray et al. studied common long-memory components in daily stock
volatilities of groups of companies classified according to various characteristics, and found that companies in the
same industrial or business sector tend to have more common long-memory components.

5.6.7

Another approach: high-frequency data

Due to the availability of high-frequency financial data, especially in the foreign exchange markets, alternative approach for volatility estimation using high-frequency data to calculate volatility of low frequency returns developed
(see French et al. [1987]). For example, considering the monthly volatility of an asset for which daily returns are
available, we let rtm be the monthly log return of the asset at month t. Assuming n trading days in the month t, the
daily log returns of the asset in the month are {rt,i }ni=1 . Using properties of log returns, we have
rtm =

n
X

rt,i

i=1

Assuming that the conditional variance and covariance exist, we have
V ar(rtm |Ft−1 ) =

n
X

V ar(rt,i |Ft−1 ) + 2

i=1

X

Cov(rt,i , rt,i |Ft−1 )

i 1
with E1 = P1 and where Pt is the value at a time period t, and Et is the value of the EMA at any time period t.
The coefficient α ∈ [0, 1] represents the degree of weighting decrease. A higher α discounts older observations faster.
Alternatively, α may be expressed in terms of N time periods, where α = (N2+1) . For example, if N = 19 it is
N
equivalent to α = 0.1, the half-life of the weights is approximately 2.8854
. Note, E1 is undefined, and it may be
initialised in a number of different ways, most commonly by setting E1 to P1 , though other techniques exist, such
as setting E1 to an average of the first 4 or 5 observations. The prominence of the E1 initialisation’s effect on the
resultant moving average depends on α; smaller α values make the choice of E1 relatively more important than larger
α values, since a higher α discounts older observations faster. By repeated application of this formula for different
times, we can eventually write Et as a weighted sum of the datum points Pt as:

Et = α Pt−1 + (1 − α)Pt−2 + (1 − α)2 Pt−3 + .. + (1 − α)k Pt−(k+1) + (1 − α)k+1 Et−(k+1)
for any suitable k = 0, 1, 2, ... The weight of the general datum point Pt−i is α(1 − α)i−1 . We can show how the
EMA steps towards the latest datum point, but only by a proportion of the difference (each time)
Et = Et−1 + α(Pt − Et−1 )
Expanding out Et−1 each time results in the following power series, showing how the weighting factor on each datum
point p1 , p2 , ... decreases exponentially:
Et = α p1 + (1 − α)p2 + (1 − α)2 p3 + ...



where p1 = Pt , p2 = Pt−1 , ... Note, the weights α(1 − α)t decrease geometrically, and their sum is unity. Using a
property of geometric series, we get
α

t−1
X

(1 − α)i = α

i=0

t
X

(1 − α)i−1 = α

i=1

 1 − (1 − α)t 
= 1 − (1 − α)t
1 − (1 − α)

t

and limt→∞ (1 − α) = 0. As a result, we get
α

∞
X

(1 − α)i−1 = 1

i=1

Focusing on the term α, we get

233

Quantitative Analytics

∞

Et =

p1 + (1 − α)p2 + (1 − α)2 p3 + ...
1 X
=
ωi pi
norm
norm i=1

where ωi = (1 − α)i−1 and
∞

X
1
= norm = 1 + (1 − α) + (1 − α)2 + ... =
(1 − α)i−1
α
i=1
This is an infinite sum with decreasing terms. The N periods in an N-day EMA only specify the α factor. N is not
a stopping point for the calculation in the way it is in an SMA or WMA. For sufficiently large N, the first N datum
points in an EMA represent about 86% of the total weight in the calculation

α 1 + (1 − α) + (1 − α)2 + ... + (1 − α)N
2
 = 1 − (1 −
)N +1
2
∞
N +1
α 1 + (1 − α) + (1 − α) + ... + (1 − α)
and

lim 1 − (1 −

N →∞


2
)N +1 = 1 − e−2 ≈ 0.8647
N +1

since limn→∞ (1 + nx )n = ex . This power formula can give a starting value for a particular day, after which the
successive days formula above can be applied. The question of how far back to go for an initial value depends, in the
worst case, on the data. Large price values in old data will affect on the total even if their weighting is very small. The
weight omitted by stopping after k terms can be used in the fraction
weight omitted by stopping after k terms
= (1 − α)k
total weight
For example, to have 99.9% of the weight, set the ratio equal to 0.1% and solve for k:
k=

log (0.001)
log (1 − α)

As the Taylor series of log (1 − α) = −α − 21 α2 − ... tends to −α we get the limit limN →∞ log (1 − α) = − (N2+1) ,
and the computation simplifies to
k = − log (0.001)
for this example and

5.7.2

1
2

(N + 1)
2

log (0.001) = −3.45.

Introducing exponential smoothing models

Given the recorded observations Y1 , Y2 , ..., Yt over t time periods, which represent all the data currently available,
our interest lies in forecasting the series Yt+1 , .., Yt+h over the next h weeks, known as the forecasting horizon. The
(point) forecasts for future series are all made at time t, known as the forecast origin, so the first forecast will be made
one step ahead, the second two steps ahead, and so on. We let Ft+h|t be the forecast for Yt+h made at time t. The
subscripts always tell us which time period is being forecast and when the forecast was made. When no ambiguity
arises, we will use Ft+1 to represent the one-step-ahead forecast Ft+1|t .

234

Quantitative Analytics

5.7.2.1

Linear exponential smoothing

Following Brown [1959] and Holt [1957] on exponential smoothing, to avoid the continuation of global patterns, we
are now considering tools that project trends more locally. Given the straight-line equation
Yt = L0 + Bt
where Lt is the level of series at time t and Bt is slope of series at time t. In our example, as we have a constant
slope Bt = B then the value of the series starts out at the value L0 at time zero and increases by an amount B in each
time period. Another way of writing the right-hand side of this expression is to state directly that the new level, Lt is
obtained from the previous level by adding one unit of slope B
Lt = L0 + Bt = Lt−1 + B
We may then define the variable Yt in terms of the level and the slope as
Yt = Lt−1 + B
Further, we can see that if we go h periods ahead, we can define the variable at time (t + h) in terms of the level at
time t − 1 and the appropriate number of slope increments
Yt+h = L0 + B(t + h) = Lt + Bh
There is a lot of redundancy in these expressions, since we are considering the error-free case. When we turn back to
the real problem with random errors and changes over time in the level and the slope, these equations suggest that we
consider forecasts of the form
Ft+h|t = Lt + hBt

(5.7.10)

which is a straight line. Thus, the one-step-ahead forecast made at time (t − 1) is
Ft|t−1 = Ft = Lt−1 + Bt−1
We can now consider updating the level and the slope using equations like those we used for SES. We define the
observed error t as the difference between the newly observed value of the series and its previous one-step-ahead
forecast
t = Yt − Ft = Yt − (Lt−1 + Bt−1 )
Given the latest observation Yt we update the expressions for the level and the slope by making partial adjustments
that depend upon the error
Lt

=

Lt−1 + Bt−1 + αt = αYt + (1 − α)(Lt−1 + Bt−1 )

Bt

=

Bt−1 + αβt

(5.7.11)

The new slope is the old slope plus a partial adjustment (weight αβ) for the error. These equations are known as
the error correction form of the updating equations. As may be checked by substitution, the slope update can also be
expressed as
Bt = Bt−1 + β(Lt − Lt−1 − Bt−1 )
where Lt − Lt−1 represents the latest estimate of the slope such that (Lt − Lt−1 − Bt−1 ) is the latest error made if the
smoothed slope is used as an estimate instead. Hence, a second round of smoothing is applied to estimate the slope
which has led some authors to describe the method as double exponential smoothing.

235

Quantitative Analytics

To start the forecast process we need starting values for the level and slope, and values for the two smoothing
constants α and β. The smoothing constants may be specified by the user, and conventional wisdom decrees using
0.05 < α < 0.3 and 0.05 < α < 0.15. These values are initial values in a procedure to select optimal coefficients
by minimising the MSE over some initial sample. As for SES, different programs use a variety of procedures to set
starting values (see Gardner [1985]). One can use
B3 =

Y3 − Y1
2

for the slope and
Y1 + Y2 + Y3
Y3 − Y1
+
3
2
for the level, corresponding to fitting a straight line to the first three observations. Thus, the first three observations
were used to set initial values for the level and slope. Once the initial values are set, equations (5.7.11) are used to
update the level and slope as each new observation becomes available. When time series show a very strong trend, we
would expect LES to perform much better than SES.
L3 =

The LES method requires that the series is locally linear. That is, if the trend for the last few time periods in the
series appears to be close to a straight line, the method should work well. However, in many cases this assumption
is not realistic. For example, in the case of exponential growth, any linear approximation will undershoot the true
function sooner or later. A first approach would be to use a logarithmic transformation. Given
Yt = γYt−1
where γ is some constant, the log transform becomes
ln Yt = ln γ + ln Yt−1
and the log-transform produces a linear trend to which we can apply LES. We must then transform back to the original
series to obtain the forecasts of interest. Writing Zt = ln Yt the reverse transformation is
Yt = eZt
In general, SES should not be used for strongly trending series; whether to use LES on the original or transformed
series, or to use SES on growth rates, remains a question for further examination in any particular study.
5.7.2.2

The damped trend model

Some time series have a history of growth, possibly later followed by a decline, other series may have a strong tendency
to increase over time, while other time series relates to the returns on an investment. Based on the findings of the M
competition, Makridakis et al. [1982] showed that the practice of projecting a straight line trend indefinitely into
the future was often too optimistic (or pessimistic). Hence, one can either convert the series to growth over time and
forecast the growth rate, or we need to develop forecasting methods that account for trends. When the growth rate slow
down and then decline we can accommodate such effects by modifying the updating equations for the level and slope.
Assuming that the series flatten out unless the process encounters some new stimulus, the slope should approach zero.
We consider the damped trend model introduced by Gardner and McKenzie [1985] which proved to be very effective
(see Makridakis and Hibon [2000]). This is achieved by introducing a dampening factor φ in equations (5.7.11),
getting
Lt

= Lt−1 + φBt−1 + αt

Bt

= φBt−1 + αβt
236

Quantitative Analytics

where φ ∈ [0, 1] multiplies each slope term Bt−1 shifting that term towards zero, or dampening it. Computing the
forecast function for h-steps ahead, we get
Ft+h|t = Lt + (φ + φ2 + .. + φh )Bt
Bt
This forecast levels out over time approaching the limiting value Lt + 1−φ
provided the dampening factor is less than
one. This is to contrast with the case φ = 1 when the forecast keeps increasing so that Ft+h|t = Lt + hBt .

Robert Brown [1959] was the original developer of exponential smoothing methods where his initial derivation
of exponential smoothing used a least squares argument which, for a local linear trend reduces to the use of LES with
α = β. In general, there is no particular benefit to imposing this restriction. However, the discounted least squares
approach is particularly useful when complex non-linear functions are involved and updating equations are not readily
available. If we set β = 0 the updating equations (5.7.11) become
Lt

= Lt−1 + B + αt

Bt

= Bt−1 = B

which may be referred to as SES with drift, since the level increases by a fixed amount each period (see SES in
Equation (5.7.10)). This method being just a special case of LES, the simpler structure makes it easier to derive an
optimal value for B using the estimation sample (see Hyndman and Billah [2003]).
Trigg and Leach [1967] introduced the concept of a tracking signal, whereby not only are the level and slope
updated each time, but also the smoothing parameters. In the case of SES, we can use the updated value for α given
by
P
Et−1
Et
P
, αt =
α̂t =
Mt
Mt−1
where Et and Mt are smoothed values of the error and the absolute error respectively given by
Et

=

δt + (1 − δ)Et−1

Mt

=

δ|t | + (1 − δ)Mt−1

where δ ∈ [0.1, 0.2]. If a string of positive errors occurs, the value of αt increases, to speed up the adjustment process,
the reverse occurs for negative errors. A generally preferred approach is to update the parameter estimate regularly,
which is no longer much of a computational problem even for large numbers of series.
One alternative to SES is look at the successive differences in the series Yt − Yt−1 and take a moving average of
t−n
these values to estimate the slope. The net effect is to estimate the slope by Yt −Y
for a n-term moving average.
n
Again, LES usually provides better forecasts.

5.7.3

A summary

Gardner [2006] reviewed the state of the art in exponential smoothing (ES) up to date. He classified and gave
formulations for the standard methods of ES which can be modified to create state-space models. For each type of
trend, and for each type of seasonality, there are two sections of equations. We first consider recurrence forms (used
in the original work by Brown [1959] and Holt [1957]) and then we give error-correction forms (notation follows
Gardner [1985]) which are simpler and give equivalent forecasts. Note, there is still no agreement on notation for ES.
The notation by Hyndman et al. [2002] and extended by Taylor [2003] is helpful in describing the methods. Each

237

Quantitative Analytics

Table 5.1: List of ES models
Trend Component
N (none)
A (additive)
DA (damped-additive)
M (multiplicative)
DM (damped-multiplicative)

N (none)
NN
AN
DA-N
MN
DM-N

A (additive)
NA
AA
DA-A
MA
DM-A

M (multiplicative)
NM
AM
DA-M
MM
DM-M

method is denoted by one or two letters for the trend and one letter for seasonality. Method (N-N) denotes no trend
with no seasonality, or simple exponential smoothing (SES). The other nonseasonal methods are additive trend (A-N),
damped additive trend (DA-N), multiplicative trend (M-N), and damped multiplicative trend (DM-N).
All seasonal methods are formulated by extending the methods in Winters [1960]. Note that the forecast equations
for the seasonal methods are valid only for a forecast horizon (h) less than or equal to the length of the seasonal cycle
(p). Given the smoothing parameter for the level of the series α ∈ [0, 1], the smoothing parameter for the trend
γ ∈ [0, 1], the smoothing parameter for seasonal indices δ, the autoregressive or damping parameter φ ∈ [0, 1], we let
t = Yt − Ft be the one-step-ahead forecast error with Ft = Ft|t−1 = Ŷt−1 (1) is the one-step ahead forecast, and we
get
1. (N-N)
St = αYt + (1 − α)St−1
Ŷt (h) = Ft+h|t = St
and
St = St−1 + αt
Ft+h|t = St
2. (A-N)
St = αYt + (1 − α)(St−1 + Tt−1 )
Tt = γ(St − St−1 ) + (1 − γ)Tt−1
Ft+h|t = St + hTt
and
St = St−1 + Tt−1 + αt
Tt = Tt−1 + αγt
Ft+h|t = St + hTt
3. (DA-N)
St = αYt + (1 − α)(St−1 + φTt−1 )
Tt = γ(St − St−1 ) + (1 − γ)φTt−1
Ft+h|t = St +

h
X
i=1

238

φi Tt

Quantitative Analytics

and
St = St−1 + φTt−1 + αt
Tt = φTt−1 + αγt
Ft+h|t = St +

h
X

φi Tt

i=1

4. (M-N)
St = αYt + (1 − α)St−1 Rt−1
St
+ (1 − γ)Rt−1
Rt = γ
St−1
Ft+h|t = St Rth
and
St = St−1 Tt−1 + αt
t
Rt = Rt−1 + αγ
St−1
Ft+h|t = St Rth
5. (DM-N)
φ
St = αYt + (1 − α)St−1 Rt−1
St
φ
Rt = γ
+ (1 − γ)Rt−1
St−1
Ph

Ft+h|t = St Rt

i=1

φi

and
φ
St = St−1 Rt−1
+ αt
t
φ
Rt = Rt−1 + αγ
St−1
Ph

Ft+h|t = St Rt

i=1

φi

6. (N-A)
St = α(Yt − It−p ) + (1 − α)St−1
It = δ(Yt − St ) + (1 − δ)It−p
Ft+h|t = St + It−p+h
and
St = St−1 + αt
It = It−p + δ(1 − α)t
Ft+h|t = St + It−p+h

239

Quantitative Analytics

7. (A-A)
St = α(Yt − It−p ) + (1 − α)(St−1 + Tt−1 )
Tt = γ(St − St ) + (1 − γ)Tt−1
It = δ(Yt − St ) + (1 − δ)It−p
Ft+h|t = St + hTt + It−p+h
and
St = St−1 + Tt−1 + αt
Tt = Tt−1 + αγt
It = It−p + δ(1 − α)t
Ft+h|t = St + hTt + It−p+h
8. (N-M)
Yt

+ (1 − α)St−1
It−p
Yt
It = δ + (1 − δ)It−p
St
Ft+h|t = St It−p+h
St = α

and
St = St−1 + α

t
It−p

It = It−p + δ(1 − α)

t
St

Ft+h|t = St It−p+h
9. (A-M)
Yt

+ (1 − α)(St−1 + Tt−1 )
It−p
Tt = γ(St − St ) + (1 − γ)Tt−1
Yt
It = δ + (1 − δ)It−p
St
Ft+h|t = (St + hTt )It−p+h
St = α

and
St = St−1 + Tt−1 + α
Tt = Tt−1 + αγ

t
It−p

t
It−p

It = It−p + δ(1 − α)

t
St

Ft+h|t = (St + hTt )It−p+h
where St is the smoothed level of the series, Tt is the smoothed additive trend at the end of period t, Rt is the smoothed
multiplicative trend, It is the smoothed seasonal index at time t, h is the number of periods in the forecast lead-time,
and p is the number of periods in the seasonal cycle.

240

Quantitative Analytics

Remark 5.7.3 When forecasting time series with ES, it is generally assumed that the most common time series in
business are inherently non-negative. Therefore, it is of interest to consider the properties of the potential stochastic
models underlying ES when applied to non-negative data. It is clearly a problem when forecasting financial returns
as the multiplicative error models are not well defined if there are zeros or negative values in the data.
The (DA-N) method can be used to forecast multiplicative trends with the autoregressive or damping parameter φ
restricted to the range 1 < φ < 2, a method sometimes called generalised Holt. In hopes of producing more robust
forecasts, Taylor’s [2003] methods (DM-N, DM-A, and DM-M) add a damping parameter φ < 1 to Pegels’ [1969]
multiplicative trends. Each exponential smoothing method above is equivalent to one or more stochastic models.
The possibilities include regression, ARIMA, and state-space models. The most important property of exponential
smoothing is robustness. Note, the damped multiplicative trends are the only new methods creating new forecast
profiles since 1985. The forecast profiles for Taylor’s methods will eventually approach a horizontal nonseasonal or
seasonally adjusted asymptote, but in the near term, different values of φ can produce forecast profiles that are convex,
nearly linear, or even concave.
There are many equivalent state-space models for each of the methods described in the above table. In the framework of Hyndman et al. [2002] each ES method in the table (except the DM methods) has two corresponding
state-space models, each with a single source of error (SSOE), one with an additive error and the other with a multiplicative error. The methods corresponding to the framework of Hyndman et al. are the same as the ones in the
table appart from two exceptions where one has to modify all multiplicative seasonal methods and all damped additive
trend methods. Each ES method in the table is equivalent to one or more stochastic models, including regression,
ARIMA, and state-space models. In large samples, ES is equivalent to an exponentially-weighted or DLS regression
model. General exponential smoothing (GES) also relies on DLS regression with one or two discount factor to fit a
variety of functions of time to the data, including polynomials, exponentials, sinusoids, and their sums and products
(see Gardner [1985]). Gijbels et al. [1999] and Taylor [2004c] showed that GES can be viewed in a kernel regression
framework. For instance simple smoothing (N −N ) is a zero-degree local polynomial kernel model. They showed that
choosing the minimum-MSE parameter in simple smoothing is equivalent to choosing the regression bandwidth by
cross-validation, a procedure that divides the data into two disjoint sets, with the model fitted in one set and validated
in another.
All linear exponential smoothing methods have equivalent ARIMA models which can be easily shown through the
DA-N method containing at least six ARIMA models as special cases (see Gardner et al. [1988]). If 0 < θ < 1 then
the DA-N method is equivalent to the ARIMA (1, 1, 2) model, which can be written as


(1 − B)(1 − φB)Yt = 1 − (1 + φ − α − φαγ)B − φ(α − 1)B 2 t
We obtain an ARIM A(1, 1, 1) model by setting α = 1. When α = γ = 1 the model is ARIM A(1, 1, 0). When
φ = 1 we have a linear trend (A-N) and the model is ARIM A(0, 2, 2)


(1 − B)2 Yt = 1 − (2 − α − αγ)B − (α − 1)B 2 t
When φ = 0 we have simple smoothing (N-N) and the equivalent ARIM A(0, 1, 1) model


(1 − B)Yt = 1 − (1 − α) t
The ARIM A(0, 1, 0) random walk model can be obtained from the above equation by choosing α = 1. Note,
ARIMA-equivalent seasonal models for the linear exponential smoothing methods exist.
The equivalent ARIMA models do not extend to the nonlinear exponential smoothing methods. Prior to the work
by Ord et al. [1997] (OKS), state-space models for ES were formulated using multiple sources of error (MSOE). For
instance, the exponential smoothing (N-N) is optimal for a model with two sources of error (see Muth [1960]) where
observation and state equations are given by

241

Quantitative Analytics

Yt = Lt + νt
Lt = Lt−1 + ηt
so that the unobserved state variable Lt denotes the local level at time t, and the error terms νt and ηt are generated by
independent white noise processes. Various authors showed that simple smoothing SES is optimal with α determined
by the ratio of the variances of the noise processes (see Chatfield [1996]). Harvey [1984] also showed that the Kalman
filter for the above equations reduces to simple smoothing in the steady state.
Due to the limitation of the MSOE, Ord et al. [1997] created a general, yet simple class of state-space models
with a single source of error (SSOE). For example, the SSOE model with additive errors for the (N-N) model is given
by
Yt = Lt−1 + t
Lt = Lt−1 + αt
where the error term t in the observation equation is the one-step ahead forecast error assuming knowledge of the
level at time t − 1. For the multiplicative error (N-N) model, we alter the additive-error SSOE model and get
Yt = Lt−1 + Lt t
Lt = Lt−1 (1 + αt ) = Lt−1 + αLt−1 t
where the one-step ahead forecast error is still Yt − Lt−1 which is no-longer the same as t . Hence, the above state
equation becomes
Lt = Lt−1 + αLt−1

Yt − Lt−1
= Lt−1 + α(Yt − Lt−1 )
Lt−1

where the multiplicative error state equation can be written in the error correction form of simple smoothing. As a
result, the state equations are the same in the additive and multiplicative error cases, and this is true for all SSOE
models. Hyndman et al. [2002] extended the class of SSOE models by Ord et al. [1997] to include all the methods
of ES in the above table except from the DM methods. The theoretical advantage of the SSOE approach to ES is that
the errors can depend on the other components of the time series. That is, each of the linear exponential smoothing
(LES) models with additive errors has an ARIMA equivalent, but the linear models with multiplicative errors and
the nonlinear models are beyond the scope of the ARIMA class. The equivalent models help explain the general
robustness of exponential smoothing. Simple smoothing (N-N) is certainly the most robust forecasting method and has
performed well in many types of series not generated by the equivalent ARIM A(0, 1, 1) process. Such series include
the common first-order autoregressive processes and a number of lower-order ARIMA processes. Bossons [1966]
showed that simple smoothing is generally insensitive to specification error, especially when the misspecification
arises from an incorrect belief in the stationarity of the generating process. Similarly, Hyndman [2001] showed that
ARIMA model selection errors can inflate MSEs compared to simple smoothing. Using AIC to select the best model,
the ARIMA forecast MSEs were significantly larger than those of simple smoothing due to incorrect model selections,
and becoming worse when the errors were non-normal.

5.7.4

Model fitting

When considering method selection, the definitions of aggregate and individual method selection in the work of Fildes
[1992] are useful in exponential smoothing. Aggregate selection is the choice of a single method for all time series

242

Quantitative Analytics

in a population, while individual selection is the choice of a method for each series. While in aggregate selection it is
difficult to beat the damped-trend version of exponential smoothing, in individual selection it may be possible to beat
the damped trend, but it is not clear how one should proceed. Even though individual method selection can be done in
a variety of ways, such as time series characteristics, the most sophisticated approach to method selection is through
information criteria.
Various expert systems for individual selection have been proposed, among which the Collopy et al. [1992]
including 99 rules constructed from time series characteristics and domain knowledge, combining the forecasts from
four methods: a random walk, time series regression, double exponential smoothing, and the (A-N) method. This
approach requiring considerable human intervention in identifying features of time series, Vokurka et al. [1996]
developed a completely automatic expert system selecting from a different set of candidate methods: the (N-N) and
(DA-N) methods, classical decomposition, and a combination of all candidates. Testing their systems using 126 annual
time series from the M 1 competition, they concluded that they were more accurate than various alternatives. Gardner
[1999] considered the aggregate selection of the (DA-N) method and showed that it was more accurate at all forecast
horizons than either version of rule-based forecasting.
Numerous information criteria that can distinguish between additive and multiplicative seasonality are available
for selection of an ES method, but the computational burden can be significant. For instance, Hyndman et al. [2002]
recommended fitting all models (from their set of 24 alternatives) to time series, then selecting the one minimising
the AIC. In the 1, 001 series (M 1 and M 3 data), for the average of all forecast horizons, the (DA-N) method was
better than individual selection using the AIC. Later work by Billah et al. [2005] compared eight information criteria
used to select from four ES methods, including AIC, BIC, and other standards, as well as two Empirical Information
Criteria (EIC) (a linear and a non-linear function) penalising the likelihood of the data by a function of the number of
parameters in the model.
Although state-space models for exponential smoothing dominate the recent literature, very little has been done on
the identification of such models as opposed to selection using information criteria. Koehler et al. [1988] identified
and fitted MSOE state-space models to 60 time series from the 111 series in the M 1 competition with a semi-automatic
fitting routine. In general, the identification process was disappointing. Rather than attempt to identify a model, we
could attempt to identify the best exponential smoothing method directly. Chatfield et al. [1988] call this a thoughtful
use of exponential smoothing methods that are usually regarded as automatic. They gave a common-sense strategy for
identifying the most appropriate method for the Holt-Winters class (see also Chatfield [2002]). Gardner 2006 gave
the strategy in a nutshell.
1. We plot the series and look for trend, seasonal variation, outliers, and changes in structure that may be slow or
sudden and may indicate that ES is not appropriate in the first place. We should examine any outliers, consider
making adjustments, and then decide on the form of the trend and seasonal variation. At this point, we should
also consider the possibility of transforming the data, either to stabilise the variance or to make the seasonal
effect additive.
2. We fit an appropriate method, produce forecasts, and check the adequacy of the method by examining the onestep-ahead forecast errors, particularly their autocorrelation function.
3. The findings may lead to a different method or a modification of the selected method.

In order to implement an ES method, the user must choose parameters, either fixed or adaptive, as well as initial
values and loss functions. Parameter selection is not independent of initial values and loss functions. Note, in the trend
and seasonal models, the response surface is not necessarily convex so that one need to start any search routine from
several different points to evaluate local minima. We hope that our search routine comes to rest at a set of invertible
parameters, but this may not happen. Invertible parameters create a model in which each forecast can be written as a

243

Quantitative Analytics

linear combination of all past observations, with the absolute value of the weight on each observation less than one,
and with recent observations weighted more heavily than older ones. If we view an ES method as a system of linear
difference equations, a stable system has an impulse response that decays to zero over time. The stability region for
parameters in control theory is the same as the invertibility region in time series analysis. In the linear non-seasonal
methods, the parameters are always invertible if they are chosen in the interval [0, 1]. The same conclusion holds for
quarterly seasonal methods, but not for monthly seasonal methods, whose invertibility regions are complex (see Sweet
[1985]). Non-invertibility usually occurs when one or more parameters fall near boundaries, or when trend and/or
seasonal parameters are greater than the level parameter. For all seasonal ES methods, we can test parameters for
invertibility using an algorithm by Gardner et al. [1989] assuming that additive and multiplicative invertible regions
are identical, but the test may fail to eliminate some troublesome parameters. Archibald [1990] found that some
combination of [0, 1] parameters near boundaries fall within the ARIMA invertible region, but the weights on past data
diverge. Hence, they concluded that one should be skeptical of parameters near boundaries in all seasonal models.
Once the parameters have been selected, another problem is deciding how frequently they should be updated.
Fildes et al. [1998] compared three options for choosing parameters in the (N-N), (A-N), and (DA-N) methods
1. arbitrarily
2. optimise once at the first time origin
3. optimise each time forecasts are made
and found that the best option was to optimise each time forecasts were made. The term adaptive smoothing mean
that the parameters are allowed to change automatically in a controlled manner as the characteristics of the time series
change. For instance, the Kalman filter can be used to compute the parameter in the (N-N) method. The only adaptive
method that has demonstrated significant improvement in forecast accuracy compared to the fixed-parameter (N-N)
method is Taylor’s [2004a] [2004b] smooth transition exponential smoothing (STES). Smooth transition models are
differentiated by at least one parameter that is a continuous function of a transition variable Vt . The formula for the
adaptive parameter αt is a logistic function (see details in Appendix (A.2))
αt =

1
1 + ea+bVt

with several possibilities for Vt including t , |t |, and 2t . Whatever the transition variable, the logistic function restricts
αt to [0, 1]. The drawback to STES is that model-fitting is required to estimate a and b; thereafter, the method adapts
to the data through Vt . In Taylor [2004a], STES was arguably the best method overall in volatility forecasting of stock
index data compared to the fixed-parameter version of (N-N) and a range of GARCH and autoregressive models. Note,
with financial returns, the mean is often assumed to be zero or a small constant value, and attention turns to predicting
the variance. Following the advice of Fildes [1998], Taylor evaluated forecast performance across time. Using the
last 18 observations of each series, he computed successive one-step ahead monthly forecasts, for a total of 25, 704
forecasts, and judged by MAPE and median APE, STES was the most accurate method tested, with best results for
the MAPE.
Standard ES methods are usually fitted in two steps, by choosing fixed initial values, followed by an independent
search for parameters. In contrast, the new state-space methods are usually fitted using maximum likelihood where
initial values are less of a concern because they are refined simultaneously with the smoothing parameters during the
optimisation process. However, this approach requires significant computation times. The nonlinear programming
model introduced by Sergua et al. [2001] optimise initial values and parameters simultaneously. Examining the M 1
series, Makridakis et al. [1991] measured the effect of different initial values and loss functions in fitting (N-N),
(A-N), and (DA-N) methods, using seasonal-adjusted data where appropriate. Initial values were computed by least
squares, backcasting, and several simple methods. Loss functions included the MAD, MAPE. median APE, MSE, the
sum of cubed errors, and a variety of non-symmetric functions computed by weighting the errors in different ways.

244

Quantitative Analytics

They concluded that initialising by least squares, choosing parameters from the [0, 1] interval, and fitting models to
minimise the MSE provided satisfactory results.

5.7.5

Prediction intervals and random simulation

Hyndman et al. [2008] [2008b] provided a taxonomy of ES methods with forecasts equivalent to the one from a state
space model. This equivalence allows
1. easy calculation of the likelihood, the AIC and other model selection criteria
2. computation of prediction intervals for each method
3. random simulation from the state space model
Following their notation, the ES point forecast equations become
lt

=

αPt + (1 − α)Qt

bt

=

βRt + (φ − β)bt−1

st

=

γTt + (1 − γ)st−m

where lt is the series level at time t, bt is the slope at time t, st is the seasonal component of the series and m is the
number of seasons in a year. The values of Pt , Qt , Rt and Tt vary according to which of the cells the method belongs,
and α, β, γ and φ are constants. For example, for the (N-N) method we get Pt = Yt , Qt = lt−1 , φ = 1 and Ft+h = lt .
Rewriting these equations in their error-correction form, we get
lt

= Qt + α(Pt − Qt )

bt

= φbt−1 + β(Rt − bt−1 )

st

= st−m + γ(Tt − st−m )

Setting α = 0 we get the method with fixed level (constant over time), setting β = 0 we get the method with fixed
trend, and the method with fixed seasonal pattern is obtained by setting γ = 0. Hyndman et al. [2008] extended the
work of Ord et al. [1997] (OKS) in SSOE to cover all the methods in the classification of the ES and obtained two
models, one with additive error and the other one with multiplicative errors giving the same forecasts but different
prediction intervals. The general OKS framework involves a state vector Xt and state space equations of the form
Yt

=

h(Xt−1 ) + k(Xt−1 )t

Xt

=

f (Xt−1 ) + g(Xt−1 )t

where {t } is a Gaussian white noise process with mean zero and variance σ 2 . Defining (lt , bt , st , st−1 , .., st−(m−1) ),
et = k(Xt−1 )t and µt = h(Xt−1 ), we get
Yt = µt + et
For example, in the (N-N) method we get µt = lt−1 and lt = lt−1 + αt . Note, the model with additive errors is
written Yt = µt + t where µt = F(t−1)+1 is the one-step ahead forecast at time t − 1, so that k(Xt−1 ) = 1. The
t)
model with multiplicative errors is written as Yt = µt (1 + t ) with k(Xt−1 ) = µt and t = µett = (Ytµ−µ
which is
t
a relative error. Note, the multiplicative error models are not well defined if there are zeros or negative values in the
data. Further, we should not consider seasonal methods if the data are not quarterly or monthly (or do not have some
other seasonal period).

245

Quantitative Analytics

Model parameters are usually estimated with the maximum likelihood function (see Section (3.3.2.2)) while the
Akaike Information Criterion (AIC) (Akaike [1973]) and the bias-corrected version (AICC) (Hurvich et al. [1989])
are standard procedures for model selection (see Appendix (??)). The use of the OKS enables easy calculation of the
likelihood as well as model selection criteria such as the AIC. We let L∗ be equal to twice the negative logarithm of
the conditional likelihood function
L∗ (θ, X0 ) = n log

n
X
t=1

n
X

e2t
+
2
log |k(xt−1 )|
k 2 (xt−1 )
t=1

with parameters θ = (α, β, γ, φ) and initial states X0 = (l0 , b0 , s0 , s−1 , .., s−m+1 ). They can be estimated by
minimising L∗ . Estimates can also be obtained by minimising the one-step MSE, the one-step MAPE, the residual
variance σ 2 or by using other criterions measuring forecast error. Models are selected by minimising the AIC among
all the ES methods
AIC = L∗ (θ̂, X̂0 ) + 2p
where p is the number of parameters in θ, and θ̂ and X̂0 are the estimates of θ and X0 . Note, the AIC penalises
against models containing too much parameters, and also provides a method for selecting between the additive and
multiplicative error models because it is based on likelihood and not a one-step forecasts. Given initial values for the
parameters θ and following a heuristic scheme for the initial state X0 , bla obtained a robust automatic forecasting
algorithm (AFA)
• For each series, we apply the appropriate models and optimise the parameters in each case.
• Select the best model according to the AIC.
• Produce forecasts using the best model for a given number of steps ahead
• to obtain prediction intervals, we use a bootstrap method by simulating 5000 future sample paths for {Yn+1 , .., Yn+h }
and finding the α2 and 1 − α2 percentiles of the simulated data at each forecasting horizon. The sample paths
are generated by using the normal distribution for errors (parametric bootstrap) or by using the resampled errors
(ordinary bootstrap).
Application of the AFA to the M and M 3 competition data showed that the methodology was very good at short term
forecasts (up to about 6 periods ahead).

5.7.6

Random coefficient state space model

Gardner [2009] considered a damped linear trend model with additive errors with ES point equations given by
yt

= lt−1 + φbt−1 + t

lt

= lt−1 + φbt−1 + (1 − α)t

bt

= φbt−1 + (1 − β)t

(5.7.12)

where {yt } is the observed series, {lt } is its level, {bt } is the gradient of the linear trend, and {t } is the single source
of error. Note, these notations are slightly different from the ones in Section (5.7.5) to simplify some of the results. In
the special case where φ = 1 we recover the linear trend model corresponding to the ARIM A(0, 2, 2) model
(1 − B)2 yt = t − (α + β)t−1 + αt−2
where the gradient of the trend is a random walk. Otherwise for φ 6= 1 we get the ARIM A(1, 1, 2) model

246

Quantitative Analytics

(1 − φB)(1 − B)yt = t − (α + φβ)t−1 + φαt−2
where the gradient of the trend follows an AR(1) process, and as a result, changes in a stationary way. With φ close to
1 the linear trend is highly persistent, but as φ moves away from 1 towards zero the trend becomes weakly persistent,
and for φ = 0 there is absence of a linear trend. Consequently, we can interpret φ as a direct measure of the persistence
of the linear trend. In order to assume a locally constant model (Brown), Gardner postulated that we can consider the
linear trend model (φ = 1) with a revised gradient each time the local segment of the series changes in a sudden way.
He modelled the revision gradient as
bt = At bt−1 + (1 − β)t
where {At } is a sequence of i.i.d. binary random variates with P (At = 1) = φ and P (At = 0) = (1 − φ). In the
case of a strongly persistent trend the sequence {At } will consist of long runs of 1s interrupted by occasional 0s and
vice versa if not persistent. We get a mixture when phi is between 0 and 1 with mean length of such runs given by
φ
(1−φ) . We obtain the random coefficient state space model by replacing φ with At in the above Equation (5.7.12)
and setting (α∗ , β ∗ ) to distinguish from the two models. We get a stochastic mixture of two well known forms, the
ARIM A(0, 2, 2) with probability φ and the ARIM A(0, 1, 1) with probability (1 − φ). Gardner [2009] proved that
the forecasts in the standard damped trend model as well as the ones in the random coefficient state-space model are
optimal, with the same parameter value φ, but with different values of α and β. That is, the damped trend forecasts
are also optimal for such a more general and broader class of models. This reasoning can be applied to similar models
with linear trend component such as the additive seasonal model or linear trend models with multiplicative errors.

247

Chapter 6

Filtering and forecasting with wavelet
analysis
6.1

Introducing wavelet analysis

6.1.1

From spectral analysis to wavelet analysis

6.1.1.1

Spectral analysis

We presented in Section (4.3) the basic principles of trend filtering in the time domain and argued that filtering in
the frequency domain was more appropriate. That is, another way of estimating the trend xt in Equation (4.3.10) is
to denoise the signal yt by using spectral analysis. Fourier analysis (see details in Appendix (F.1)) uses sum of sine
and cosine at different wavelengths to express almost any given periodic function, and therefore any function with a
compact support. We can use the Fourier transform, which is an alternative representation of the original signal yt ,
expressed by the frequency function
y(w) =

n
X

yt e−iwt

t=1

where y(w) = F(y) with w a frequency, and such that y = F −1 (y) with F −1 the inverse Fourier transform. Given
the sample {y0 , .., yn−1 } of a time series, and assuming that the mean has been removed before the analysis, from the
Parseval’s theorem for the discrete Fourier transform, it follows that the sample variance of {yt } is
s2 =

n−1
n−1
1X
1X 2
yt =
|y(wj )|2
n t=0
n j=0

n 1
where y(w) is the discrete Fourier transform of {yt } and wj = 2πj
n for j = 0, 1, .., [ 2 ] are the Fourier frequencies.
Hence, the variance of the series can be decomposed into contributions given by a set of frequencies. The expression
1
2
n |y(wj )| , as a function of wj , is the periodogram of the series, which is an estimator of the true spectrum f (w)
of the process, providing an alternative way of looking at the series in the frequency domain rather than in the time
domain. If the spectrum of the series peaks at the frequency w0 , it can be concluded that in its Fourier decomposition
the component with the frequency w0 accounts for a large part of the variance of the series. Hence, denoising in
spectral analysis consists in setting some coefficients y(w) to zero before reconstructing the signal. Selected parts of
the frequency spectrum can be manipulated by filtering tools, some can be attenuated and others may be completely
1

[x] denotes the integer part of x.

248

Quantitative Analytics

removed. Hence, a smoothing signal can be generated by applying a low-pass filter, that is, by removing the higher
frequencies. However, while the Fourier analysis remains an important mathematical tool in many fields of science,
the decomposition of a function into simple harmonics of the form Aeinw has some drawbacks. One problem with
the Fourier transform is the bad time location for low frequency signals and the bad frequency location for the high
frequency signals making it difficult to localise when the trend (located in low frequencies) reverses. That is, the
Fourier representation of local events related to a series requires many terms of the form Aeinw . Hence, the nonlocal characteristic of sine and cosine implies that we can only consider stationary signals along the time axis (see
Oppenheim et al. [2009]). Even though various methods for time-localising a Fourier transform have been proposed
to avoid this problem, such as windowed Fourier transform, the real improvement comes with the development of
wavelet theory.
6.1.1.2

Wavelet analysis

Wavelets are the building blocks of wavelet transformations (WT) in the same way that the functions einx are the
building blocks of the ordinary Fourier transformation. Haar [1910] constructed the first known wavelet basis by
showing that any continuous function f (y) on [0, 1] could be approximated by a series of step functions. Later,
Grossmann et al. [1984] introduced Wavelet transform in seismic data analysis, as a solution for analysing time
series in terms of the time-frequency dimension. They are defined over a finite domain, localised both in time and
in scale, allowing for the data to be described into different frequency component for individual analysis. Wavelets
can be (or almost can be) supported on an arbitrarily small closed time interval, making them a very powerful tool
in dealing with phenomena rapidly changing in time. A wavelet basis is made of a father wavelet representing the
smooth baseline trend and a mother wavelet that is dilated and shifted to construct different level of detail. At high
scales, the wavelets have small time support, enabling them to zoom on details and short-lived phenomena. Their
abilities to switch between time and scale allow them to escape the Heisenberg’s curse stating that one can not analyse
both time and frequency with high accuracy. One can then separate signal trends and details using different levels of
resolution or different sizes/scales of detail. That is, the transform generate a phase space decomposition defined by
two parameters, the scale and location, as opposed to the Fourier decomposition. Several methods exists to compute
the wavelet coefficients such as the cascade algorithm of Mallat [1989] and the low-pass and high-pass filters of order
6 proposed by Daubechies [1992]. In digital signal processing, Mallat [1989] discovered the relationship between
quadrature mirror filters (QMF) and orthonormal wavelet bases leading to multiresolution analysis which builds on
an iterative filter algorithm (pyramid algorithm). It is the cornerstone of the fast wavelet transform (FWT). The last
important step in the evolution of wavelet theory is due to Daubechies [1988] who constructed consumer-ready
wavelets with a preassigned degree of smoothness. Shensa [1992] clarified the relationship between discrete and
continuous wavelet transforms, bringing together two separately motivated implementations of the wavelet transform,
namely the algorithme a trous for non-orthogonal wavelets (see Holschneider et al. [1989] and Dutilleux [1989]) and
the multiresolution approach of Mallat employing orthonormal wavelets.

6.1.2

The a trous wavelet decomposition

A naive approach to obtaining a detailed picture of the underlying process would be to apply to the data a bank of
filters with varying frequencies and widths. However, choosing the proper number and type of filters for this, is a very
difficult task. Wavelet transforms (WT) provide a sound mathematical principles for designing and spacing filters,
while retaining the original relationships in the time series. While the continuous wavelet transform (CWT) of a
continuous function produces a continuum of scales as output, the output of a discrete wavelet transform (DWT) can
take various forms. For instance, a triangle can be used as a result of decimation, or the retaining, of one sample
out of every two, so that just enough information is kept to allow exact reconstruction of the input data (see details
in Appendix (F.3)). Even though this approach is ideal for data compression, it can not simply relate information at
a given time point at the different scales. Further, it is not possible to have shift invariance. We can get around this
problem by means of a redundant, or, non-decimated wavelet transform, such as the a trous algorithm (see details in
Appendix (F.4)). Since translation invariant wavelet transform can produce a good local representation of the signal

249

Quantitative Analytics

both in the time domain and frequency domain, they have been used to preprocess the data (see Aussem et al. [1998],
Gonghui et al. [1999]).
We assume that a function f is known only through the time series {xt } consisting of discrete measurements
at fixed intervals. We define the signal S0 (t) as the scalar product at samples t of the function f (x) with a scaling
function φ(x)
S0 (t) =< f (x), φ(x − t) >
where the scaling function satisfies the dilation equation (F.3.17). There exists several ways of constructing a redundant
discrete wavelet transform. For instance, we can consider that the successive resolution levels are
• formed by convolving with an increasingly dilated wavelet function which looks like a Mexican hat (central
bump, symmetric, two negative side lobes).
• constructed by smoothing with an increasingly dilated scaling function looking like a Gaussian function defined
on a fixed support (a B3 spline).
• constructed by taking the difference between successive versions of the data which are smoothed in this way.
In the a trous wavelet transform (see Shensa [1992]), the input data is decomposed into a set of band-pass filtered
components, the wavelet details (or coefficients) plus a low-pass filtered version of the data called residual (or smooth).
The smoothed data Sj (t), at a given resolution j and position t, is the scalar product
x−t
1
< f (x), φ( j ) >
2j
2
j
which corresponds to Equation (F.3.13) with m = j and n = t2 . It is equivalent to performing successive convolutions with the discrete lowpass filter h
Sj (t) =

Sj+1 (t) =

∞
X

h(l)Sj (t + 2j l)

l=−∞

where the finest scale is the original series S0 (t) = xt . The distance between levels increases by a factor 2 from one
scale to the next. The name a trous (with holes) results from the increase in the distances between the sampled points
(2j l). Hence, we can think of the successive convolutions as a moving average of 2j l increasingly distant points.
1 1 3 1 1
Here, smoothing with a B3 spline, the lowpass filter, h, is defined as 16
, 4 , 8 , 4 , 16 , which has compact support
and is point symmetric. From the sequence of smoothed representations of the signal, we take the difference between
successive smoothed versions, obtaining the wavelet details (or wavelet coefficients)
dj (t) = Sj−1 (t) − Sj (t)
which we can also, independently, express as
1
x−t
< f (x), ψ( j ) >
2j
2
corresponding to the discrete wavelet transform for the resolution level j. The original data is then expanded (reconstructed) as
dj (t) =

x(t) = Sp (t) +

p
X

dj (t)

(6.1.1)

j=1

for a fixed number of scales p. At each scale j, we obtain a set called wavelet scale, having the same number of
samples as the original signal.

250

Quantitative Analytics

Figure 6.1: Decomposing the original data with the a trous wavelet decomposition algorithm.

6.2
6.2.1

Some applications
A brief review

Wavelets are ideal for frequency domain analysis in time-series econometrics as their capability to simultaneously capture long-term movements and high frequency details are very useful when dealing with non-stationary and complex
functions. They can also be used in connection with fractionally integrated process having long-memory properties. It
was shown that when decomposing time series with long-term memory, the processes of wavelet coefficients at each
scale lack this feature (see Soltani et al. [2000]), enhancing forecasting. Further, decomposing a time series into
different scales may reveal details that can be interpreted on theoretical grounds and can be used to improve forecast
accuracy. In the former, economic actions and decision making take place at different scales, and in the latter, forecasting seems to improve at the scale level as models like autoregressive moving average (ARMA) or neural networks
can extract information from the different scales that are hidden in the aggregate.
Using wavelet transform (WT) we can decompose a time series into a linear combination of different frequencies
and then hopefully quantify the influence of patterns with certain frequencies at a certain time, thus, improving the
quality of forecasting. For instance, Conejo et al. [2005] decomposed the time series into a sum of processes
with different frequencies, and forecasted the individual time series before adding up the results. It is assumed that the
motions on different frequencies follow different underlying processes and that treating them separately could increase
the forecasting quality. As a result, several approaches have been proposed for time-series filtering and prediction by
the wavelet transform (WT), based on neural networks (see Aussem et al. [1998], Zheng et al. [1999]), Kalman
filtering (see Cristi et al. [2000]), AR and GARCH models (see Soltani et al. [2000], Renaud et al. [2002]).

251

Quantitative Analytics

There exists different wavelet based forecasting methods such as using WT to eliminate noise in the data or to
estimate the components in a structural time series model (STSM) (see Section (3.2.3.1)). Alternatively, we can
perform the forecasting directly on the wavelet generated time series decomposition, or we can use locally stationary
wavelet processes. We are going to discuss three wavelet based forecasting methods
1. Wavelet denoising: it is based on the assumption that a data set (Xt )t=1,..,T can be written as
Xt = Yt + t
where Yt is a deterministic function and t ∼ N (0, σ 2 ) is a white noise component. Reducing the noise via
thresholding yields a modified Xt on which the standard forecasting methods can be applied (see Alrumaih
et al. [2002]). More recently, tree-based wavelet denoising methods were developed in the context of image
denoising, which exploit tree structures of wavelet coefficients and parent-child correlations.
2. Decomposition tool: we can also decompose the process Xt into components (STSM), such as
Xt = Tt + It + t
where Tt is a trend and It is a seasonality component, and we can do the forecasting by extrapolating from
polynomial functions. For example, Wong et al. [2003] used the hidden periodicity analysis to estimate the
trend and fitted an ARIM A(1, 0) to forecast the noise.
3. Forecasting wavelet coefficients: we can decompose the time series Xt with the wavelet coefficients T (a, b)
with a ∈ A, b = 1, .., T where A denotes a scale discretisation. For each a, the corresponding vector T (a) =
(T (a, 1), .., T (a, T )) is treated as a time series, and standard techniques like ARMA-based forecasting are
0
applied to obtain wavelet coefficient forecasts, which are subsequently added to the matrix T (a, b) (see Conejo
et al. [2005]). Note, Renaud et al. [2002] [2005] only used specific coefficients for this forecast which is more
0
efficient but increases the forecasting error. The extended matrix T is then inverted and we yield a forecast
X̂t+1 for the value Xt in the series.

6.2.2

Filtering with wavelets

When filtering out white noise in spectral analysis, hard thresholding consists in setting equal to zero all coefficients in
the frequency domain below a certain bandwidth. The filtered series are then transformed back into the time domain.
While the method of denoising is the same as for the Fourier analysis, the estimation of the trend xt can be done in
three steps
1. compute the wavelet transform T of the original signal yt to obtain the wavelet coefficients w = T (y)
2. modify the wavelet coefficients according to the denoising rule D, that is,
w∗ = D(w)
3. convert the modified wavelet coefficients into a new signal using the inverse wavelet transform
x = T −1 (w∗ )
Hence, to perform this method we first need to specify the mother wavelet, and then we must define the denoising rule.
Wavelet shrinkage in statistics was introduced and explored in a series of papers by Donoho et al. [1995] [1995b]. It
consists in shrinking the wavelet image of the original data set and returning the shrunk version of the data domain by
the inverse wavelet transformation. This results in the original data being denoised or compressed. More specifically,
given the thresholds w− and w+ two scalars with 0 < w− < w+ acting as tuning parameters of the wavelet shrinkage,
Donoho et al. [1995] defined several shrinkage methods

252

Quantitative Analytics

• Hard shrinkage: consists in setting to 0 all wavelet coefficients having an absolute value lower than a threshold
w+ .
wi∗ = wi I{|wi |>w+ }
• Soft shrinkage: consists in replacing each wavelet coefficient by the value w∗ where
wi∗ = sign(wi )(|wi | − w+ )+
where (x)+ = max (x, 0).
• Semi-soft shrinkage

 0 if |wi | ≤ w−
∗
sign(wi )(w+ − w− )−1 w+ (|wi | − w− ) if w− < |wi | ≤ w+
wi =

wi if |wi | > w+
• Quantile shrinkage is a hard shrinkage method where w+ is the qth quantile of the coefficients |wi |
Several thresholds as well as several thresholding policies have been proposed (see Vidakovic [1999]). The main
advantage of wavelet shrinkage is that denoising is carried out without smoothing out sharp structures such as spikes
and cusps.

6.2.3

Non-stationarity

The hierarchical construction of wavelets means that non-stationary components of time series are absorbed by the
lower scales, while non-lasting disturbances are captured by the higher scales, leading to the whitening property (see
Vidakovic [1999]). For example, when generating a sample data from an ARIM A(2, 1, 1), the autocorrelation
function shows almost no decay over 20 period, since the process contains a unit root. However, the autocorrelation
of the differenced ARM A(2, 1) shows a drop after two periods. Computing the autocorrelation functions of the six
highest layers of the original series decomposed with wavelet transform, the ninth scale resemble white noise, while
scale six and four show autocorrelation at all included lags.

6.2.4

Decomposition tool for seasonality extraction

One of the fundamental advantages of wavelet analysis is its capacity to decompose time series into different components. Forecasting is one of the reasons for decomposing a series, since it can be easier to forecast the components
of the series than the whole series itself. Arino et al. [1995] used the scalogram to decompose the energy into level
components in order to detect and separate periodic components in time series. To illustrate their approach, they considered two perfect periodic functions with different frequencies added up together. To filter each component of the
combined signal, they used the scalogram and looked at how much energy was contained in each scale of the wavelet
transform. Since two peakes were observed, they split the wavelet decomposition d of the time series {yt } into two
new wavelet decompositions, d(1) and d(2) , such that the coefficients dj,k of d which are in level j close to the first
(1)
(2)
peak are assigned to d(1) (dj,k = dj,k ). The corresponding coefficients in d2 are set to zero (dj,k = 0). The same is
done for the coefficients close to the second peak. When a level occurs between the two peaks, the coefficients of that
level split in two according to two different methods. In the first one, the split is additive with respect to energies but
not with respect to wavelet coefficients
(1)

(2)

(dj,k )2 = (dj,k )2 + (dj,k )2
It is not additive in the scale domain. In the second one, the split is additive with respect to wavlet coefficients but
does not preserve energies

253

Quantitative Analytics

(1)

(2)

dj,k = dj,k + dj,k

Following the same approach, Schleicher [2002] illustrated the separation of frequency levels on two sine functions
with different frequencies added up. It was done by adding up the squared coefficients in each scale to get a scalogram.
Doing so, he observed two spikes, one at the third level and one at the sixth level. Since slow-moving, low frequency
components are represented with larger support, he conjectured that level three represented the wavelet transform for
function one, and that level six represented that for function two. To filter out function one, he kept only the first four
levels and pad the rest of the wavelet transform to zeros, and then took the inverse transform. Conversely, levels five
to nine for the second function were kept and pad the first four levels with zeros. As many economic data are likely
generated as aggregates of different scales, separating these scales and analysing them individually provides interesting
insights and can improve the forecasting accuracy of the aggregate series. Renaud et al. [2003] divided the original
time series to multiresolution crystals and forecasted these crystals separately. The forecasts are combined to achieve
an aggregate forecast for the original time series. Genacy et al. [2001a] investigated the scaling properties of foreign
exchange rates using wavelet methods. They decomposed the variance of the process and found that FX volatilities
can be described by different scaling laws on different horizons. Using the maximal overlap discrete wavelet transform
(MODWT), Genacy et al. [2001b] constructed a method for seasonality extraction from a time series, which is free
of model selection parameters, translationally invariant, and associated with a zero-phase filter. Following the same
ideas, Genacy et al. [2003] [2005] proposed a new approach for estimating the systematic risk of an asset and found
that the estimation of CAPM could be flawed due to the multiscale nature of risk and return.

6.2.5

Interdependence between variables

Wavelets have been widely used to study interdependence of economic and financial time series. Genacy et al. [2001a]
analysed the dependencies between foreign exchange markets and found an increase of correlation from intra-day scale
towards the daily timescale stabilising for longer time scale. Fernandez [2005] studied the return spillovers in major
stock markets on different time scales and concluded that G7 countries significantly affect global markets but that
the reverse reaction is much weaker. Kim et al. [2005] [2006] have conducted many studies in finance using the
wavelet variance, wavelet correlation and cross-correlation and found a positive relationship between stock returns
and inflation on a scale of one month and 128 months, and a negative relationship between these scales. Studying the
relationship between stock and futures markets with the MODWT based estimator of wavelet cross-correlation, In et
al. [2006] found a feedback relationship between them on every scale, and correlation increasing with increasing time
scale.

6.2.6

Introducing long memory processes

We have seen earlier that it was important to differentiate between stationary I(0) and non-stationary I(1) processes.
However, there exist another type of processes, the fractionally integrated I(d) processes, lying between the two sharpedged alternatives of I(0) and I(1). Long-memory processes, corresponding to d ∈ [0, 0.5], are processes with finite
variance but autocovariance function decaying at a much slower rate than that of a stationary ARMA process. When
d ∈ [0.5, 1], the variance becomes infinite, but the processes still return to their long-run equilibrium. A fractionally
integrated process, I(d), can be defined as
(1 − L)d y(t) = (t)
where (t) is white noise or follows an ARMA process. Since long-memory processes have a very dense covariance
matrix, direct maximum likelihood estimation is not feasible for large data sets, and one generally uses a nonparametric
approach, which regress the log values of the periodigram on the log Fourier frequencies to estimate d (see Geweke
et al. [1983] (GPH)). Alternatively, McCoy et al. [1996] found a log-linear relationship between the variance of
the wavelet coefficients and its scale, and developed a maximum likelihood estimator. Jensen [1999] developed an
ordinary least square (OLS) estimator based on the observation that for a mean zero I(d) process, |d| < 0.5, the

254

Quantitative Analytics

wavelet coefficients, wjk (for scale j and translation k), are asymptotically normally distributed with mean zero and
variance σ 2 2−2jd as j goes to zero. That is, the wavelet transform of these kind of processes have a sparse covariance
matrix which can be approximated at high precision with a diagonal matrix, such that the calculation of the likelihood
function is of an order smaller than calculations with the exact MLE methods. Taking logs, we can estimate d using
the linear relationship
ln R(j) = ln σ 2 − d ln 22j
where R(j) is the sample estimate of the covariance in each scale. The wavelet estimators have a higher small-sample
bias than the GPH estimator, but they have a mean-squared error about six times lower.

6.3
6.3.1

Presenting wavelet-based forecasting methods
Forecasting with the a trous wavelet transform

Knowing the individual time series resulting from the decomposition in Equation (6.1.1), several approaches exist to
estimate xt+k , where k is a look-ahead period, from the observations xt , xt−1 , .., x1 . For instance,
• if the residual vector Sp is sufficiently smooth, we can use a linear approximation of the data, or a carbon copy
(xt → xt+k ) of it.
• we can make independent predictions to the resolution scales di and Sp and use the additive property of the
reconstruction equation to fuse predictions in an additive manner.
• we can also test a number of short-memory and long-memory predictions at each resolution level, and retain the
method performing best.
Note, the symmetric property of the filter function does not support the fact that time is a fundamentally asymmetric
variable. In prediction studies, very careful attention must be given to the boundaries of the signal. Assuming a time
series of size N , values at times N, N − 1, N − 2, ... are of great importance. When handling boundary (or edge), any
symmetric wavelet function is problematic, as we can not use wavelet coefficients estimated from unknown future data
values. One way around is to hypothesise future data based on values in the nearest past. Further, for both symmetric
and asymmetric functions, we have to use some variant of the transform to deal with the problem of edges. We can
use the mirror or periodic border handling, or even the transformation of the border wavelets and scaling functions
(see Cohen et al. [1992]). For instance, Aussem et al. [1998] chose the boundary condition
S(N + k) = S(N − k)

(6.3.2)

and described a novel approach for time-varying data. Two types of feature were considered
1. Decomposition-based approach: wavelet coefficients at a particular time point were taken as a feature vector.
2. Scale-based approach: modelling and prediction were run independently at each resolution level, and the results
were combined.
They performed the feature selection with feature vector xt = {d1 (t), d2 (t), .., dp (t), Sp (t)}. Since they are using
a wrap-around approach to defining the WT at the boundary region of the data, they considered data up to point
t = t0 and used xt0 as a feature vector. The succession of feature vectors, xt0 , corresponding to successive values
of t0 , is not the same as a single WT of the input data. While these special coefficients better represent the true
wavelet coefficients, they do not sum to zero at each level. However, they do retain the additive decomposition
property of the reconstruction equation, and they do not use unknown future data. The scale-based approach used
the dynamic recurrent neural network (DRNN) which is endowed with internal memory using additional information

255

Quantitative Analytics

on the past time series. They found that the wavelet coefficients at higher frequency levels (lower scales) provided
some benefit for estimating variation at less high frequency levels. For instance, to model and predict at scale 2,
the target value is d2 (t − 15), d2 (t − 14), .., d2 (t) combined with the input vector d1 (t − 15), d1 (t − 14), .., d1 (t).
The use of d1 for prediction of d2 is of benefit, since the more noisy and irregular the data, the more demanding
the prediction task, and the more useful the neural network. On the final smooth trend curve Sp (t), they found that
the linear extrapolation Ŝp (t + 5) = Sp (t) + α(Sp (t) − Sp (t − 1)) with α = 5 performed better than the NN
solution. They considered three performance criteria to test the forecasting method, the normalised mean squared
error (NMSE), the directional symmetry (DS), and the direction variation symmetry (DVS) (see details in Section
(3.2.2.3)). Recombining all wavelet estimates, they obtained 0.72 for NMSE, 0.73% for the DS, and 0.6% for the
DVS.
In conclusion, when forecasting, the features revealed by the individual wavelet coefficient series are meaningful
when considered individually. However, when using wavelet coefficients at a given level to forecast the coefficient
at the next level, we introduce positive correlation between the prediction residuals, and a single forecast error at a
particular level can propagate, impacting the other predictors. Consequently, individual forecasts should be provided
by different forecasting models to avoid output discrepancy correlation resulting from model misspecification. Further,
the way to deal with the edge is problematic in prediction applications as it adds artifacts in the most important part
of the signal, namely, its right border values. It is interesting to note that years later, Murtagh et al. [2003] did not
recommend the use of this algorithm.

6.3.2

The redundant Haar wavelet transform for time-varying data

Smoothing with a B3 spline to construct the a trous wavelet transform, as described above, is not appropriate for a
directed (time-varying) data stream since future data values can not be used in the calculation of the wavelet transform.
We could use the Haar wavelet transform due to the asymmetry of the wavelet function, but it is a decimated one.
Alternatively, Zheng et al. [1999] developed a non-decimated, or redundant, version of this transform corresponding
to the a trous algorithm discussed above, but with a different pair of scaling and wavelet functions. The non-decimated
Haar algorithm uses the simple filter h = ( 21 , 12 ), which is non-symmetric, with l = −1, 0 in the a trous algorithm. We
can then derive the (j + 1) wavelet resolution level from the (j) level by convolving the latter with h, getting
Sj+1 (t) =


1
Sj (t − 2j ) + Sj (t)
2

and
dj+1 (t) = Sj (t) − Sj+1 (t)
Hence, at any point, t, we never use information after t when computing the wavelet coefficients (see Percival et al.
[2000]), obtaining a computationally straightforward solution to the problem of boundary conditions at time point t.
Since at a given time t and scale (j + 1) we need two values from the previous scale (j), namely Sj (t) and Sj (t − 2j ),
the window length must be equal to 2j for scale (j). Further, the smooth data Sj (t) can be written as a moving average
of the original signal as follow
j

2 −1
1 X
S0 (t − l)
Sj (t) = j
2
l=0

This method has the following advantages
• The computational requirement is O(N ) per scale, and in practice the number of scales is set as a constant.
• Since we do not shift the signal, the wavelet coefficients at any scale j of the signal (X1 , .., Xt ) are strictly equal
to the first t wavelet coefficients at scale j of the signal (X1 , .., XN ) for N > t.

256

Quantitative Analytics

As a result, we get linearity in terms of the mapping of inputs defined by wavelet coefficients vis a vis the output target
value. Note the following properties of the multiresolution transform
• all wavelet scales are of zero mean
• the smooth trend is generally much larger-valued than the max-min ranges of the wavelet coefficients

6.3.3

The multiresolution autoregressive model

In order to capture short-range and long-range dependencies of time series, Renaud et al. [2002] proposed a multiresolution autoregressive (MAR) model to forecast time-series. They considered the non-decimated Haar a trous wavelet
transform described in Section (6.1.2), and used a linear prediction based on some coefficients of the decomposition
of the past values. We saw in the previous section that the window length in the Haar WT must be equal to 2j for
scale (j), introducing redundancy. Hence, when selecting the number of coefficients at different scales, we should
exclude these redundant points. After some investigation, Renaud et al. found that the wavelet and scaling function
coefficients that should be used for the prediction at time N + 1 should have the form
dj,N −2j (k−1) and SJ,N −2J (k−1)
for positive value of k. For each N , this subgroup of coefficients is part of an orthogonal transform. We now want
to allow for adaptivity in the numbers of wavelet coefficients selected from different resolution scales and used in the
prediction.
6.3.3.1

Linear model

Stationary process We let the window size at scale (j) be denoted Aj . Assuming a stationary signal X =
Pp
(X1 , .., XN ), the one-step forward prediction of an AR(p) process is X̂N +1 = k=1 φ̂k XN −(k−1) (see details in
Section (5.3.1)). In order to use the wavelet decomposition in Equation (6.1.1), Renaud et al. modified the prediction
to the AR multiscale prediction as
X̂N +1 =

Aj
J X
X

AJ+1

âj,k dj,N −2j (k−1) +

j=1 k=1

X

âJ+1,k SJ,N −2J (k−1)

k=1

where D = d1 , .., dJ , SJ represents the Haar a trous wavelet transform of X. Note, we have one AR(p) process
per scale j = 1, .., J with {âj,k }1≤j≤J,1≤k≤Aj parameters for the details and {âJ+1,k }1≤k≤AJ+1 parameters for the
residual. Put another way, if on each scale the lagged coefficients follow an AR(Aj ) process, the addition of the
predictions on each level would lead to the same prediction formula than the above one. That is, the MAR prediction
model is linear. In the special case where Aj = 1 for all resolution levels j, the prediction simplifies to
X̂N +1 =

J
X

âj dj,N + âJ+1 SJ,N

j=1

PJ+1
In this model, we need to estimate the Q = j=1 Aj unkown parameters which we grouped in the vector α, so that
0
0
we can solve the equation A Aα = A S where
A

0

0

L
α
S

0

0

=

(LN −1 , .., LN −M )

=

(d1,t , .., d1,t−2A1 , .., d2,t , .., d2,t−22 A2 , ..., dJ,t , .., dJ,t−2J AJ , SJ,t , ..., SJ,t−2J AJ+1 )

=

(a1,1 , .., a1,A1 , a2,1 , .., a2,A2 , .., aJ,1 , .., aJ,Aj , .., aJ+1,1 , .., aJ+1,AJ+1 )

=

(XN , .., Xt+1 , .., XN −M +1 )
257

Quantitative Analytics

where A is a Q × M matrix (M rows Lt , each with Q elements), α and S are respectively Q and M -size vectors, and
Q is larger than M .
Non-stationary process When a trend is present in the time-series, we use the fact that the multiscale decomposition
automatically separates the trend from the signal. As a result, we can predict both the trend and the stochastic part
within the multiscale decomposition. In general, the trend affects the low frequency components, while the high
frequencies are purely stochastic. Hence, we separate the signal X into low (L) and high (H) frequencies, getting

L = SJ and H = X − L =

J
X

dj

j=1

XN +1 = LN +1 + HN +1
In that setting, the signal H has zero-mean, and the MAR model gives
ĤN +1 =

Aj
J X
X

aj,k dj,N −2j (k−1)

j=1 k=1

and the estimation of the Q unkown parameters is as before, except that the coefficients S are not used in Li and that S
is based on Ht+1 . Since L is very smooth, a polynomial fitting can be used for prediction, or one can use an AR process
as for H. Since the frequencies are non-overlapping in each scale, one can select the parameters Aj independently on
each scale with AIC, AICC, or BIC methods. Hence, the general method consists in fitting an AR model to each scale
of the multiresolution transform. In the case where the true process is AR, thisforecasting procedure will converge to
the optimal procedure, since it is asymptotically equivalent to the best forecast.
6.3.3.2

Non-linear model

Note, Murtagh et al. [2003] generalised the MAR formula to the nonlinear case leading to a learning algorithm.
Assuming a standard multilayer perceptron with a linear transfer function at the output node, and L hidden layer
neurons, they obtained the following
X̂N +1 =

L
X
l=1

âl g

Aj
J X
X

AJ+1

âj,k dj,N −2j (k−1) +

j=1 k=1

X

âJ+1,k SJ,N −2J (k−1)



k=1

where the sigmoidal function g(•) is used from the feedforward multilayer perceptron (see details in Section (13.3.1)).

6.3.4

The neuro-wavelet hybrid model

Zhang et al. [2001] developed a neuro-wavelet hybrid system incorporating multiscale wavelet analysis into a set
of neural networks for a multistage time series prediction. Their approach consists of some three stage prediction
scheme. Considering a shift invariant WT, they used the autocorrelation shell representation (ASR) introduced by
Beylkin et al. [1992] which we described in Appendix (F.4.3). Then, they performed the prediction of each scale of
the wavelet coefficients with the help of a separate feedforward neural network. That is, the prediction results for the
wavelet coefficients can either be directly combined from the linear additive reconstruction property of ASR, or, can
be combined from another neural network (NN). The main goal of the latter being to adaptively choose the weight of
each scale in the final prediction. For the prediction of different scale wavelet coefficients, they applied the Bayesian
method of automatic relevance determination (ARD) to learn the different significance of a specific length of past
window and wavelet scale.

258

Quantitative Analytics

The additive form of reconstruction of the ASR allows one to combine the predictions in a simple additive manner.
In order to deal with the boundary condition of using the most recent data to make predictions, Zhang et al. followed
the approach proposed by Aussem et al. [1998] described in Section (6.3.1), and used a time-based a trous filters
algorithm on the signal x1 , x2 , .., xN where N is the present time-point. The steps are as follow:
1. For index k sufficiently large, carry out the a trous transform in Equation (F.4.32) on the signal using a mirror
extension of the signal when the filter extends beyond k (see Equation (6.3.2)).
2. Retain the coefficient values (details) as well as the residual values for the kth time point only, that is,
Dk1 , Dk2 , .., Dkp , Skp . The summation of these values gives xk .
3. If k is less than N , set k to k + 1 and return to step 1).
This process produces an additive decomposition of the signal xk , xk+1 , ..., xN , which is similar to the a trous wavelet
transform decomposition on x1 , x2 , .., xN . However, as discussed in Section (6.3.1), the boundary condition in Equation (6.3.2) is not appropriate when forecasting financial data, as we can not use future data in the calculation of the
wavelet transform.
In general, the appropriate size for the time-window size of inputs of a regression problem is a difficult choice as
there are many possible input variables, some of which may be irrelevant. This problem also applies to time series
forecasting with neural networks (NN). Zhang et al. used the ARD method for choosing the length of past windows
to train the NN.

6.4
6.4.1

Some wavelets applications to finance
Deriving strategies from wavelet analysis

Given the properties of time series in the frequency and scale domain, we can apply Fourier and wavelet analysis
to perform pattern recognition and denoising around time series. As explained above, we want to statistically study
wavelets coefficients in order to compare them. We choose to concentrate the information (energy) of market signal
in a small number of coefficients at a certain scale. We therefore need to use a mother wavelet minimising the number
of levels and coefficients containing significant information.
We can apply wavelet analysis to forecast time series. By decomposing time series into different scales and adding
up the squared coefficients within each level, we can measure the energy constant of each scale (power spectral density
in the Fourier analysis). Using the properties of the multiscale analysis, we then decompose the time series into two
series which we then forecast using an ARIMA model. The forecast of the original series is obtained by aggregating
the forecast of the individual series.
Combining wavelet analysis with other tools, such as Neural Network, has now gained wide acceptance in major
scientific fields. To apply wavelet network, we first decompose the time series into different scales, and each scale
is then used to train a recurrent neural network that will provide the forecast. Aggregating the results we recover the
original signal. In both cases, wavelets are used to extract periodic information within individual scales which is used
by other techniques.

6.4.2

Literature review

Wavelet analysis can be used on financial time series which are typically highly nonstationary, exhibit high complexity
and involve both (pseudo) random processes and intermittent deterministic processes. A good overview of the application of wavelets in economics and finance is given by Ramsey [1999]. Davidson et al. [1998] used the orthogonal
dyadic Haar transform to perform semi-nonparametric regression analysis of commodity price behaviour. Ramsey

259

Quantitative Analytics

et al. [1995] searched for evidence of self-similarity in the US stock market price index. Investigating the power
law scaling relationship between the wavelet coefficients and scale, they found some evidence of quasi-periodicity
in the occurrence of some large amplitude shocks to the system, concluding that there may be a modest amount of
predictability in the data. Further, Ramsey et al. [1998] highlighted the importance of timescale decomposition in
analysing economic relationships. Wavelet-based methods to remove hidden cycles from within financial time series
have been developed by Arino et al. [1995] where they first decompose the signal into its wavelet coefficients and
then compute the energy associated with each scale. Defining dominant scales as those with the highest energies, new
coefficient sets are produced related to each of the dominant scales by either one of two methods developed by the
authors. For the signal containing two dominant scales, two new complete sets of wavelet coefficients are computed
which are used to reconstruct two separate signals, corresponding to each dominant scale. Arneodo et al. [1998] found
evidence for a cascade mechanism in market dynamics attributed to the heterogeneity of traders and their different time
horizons causing an information cascade from long to short timescales, the lag between stock market fluctuations and
long-run movements in dividends, and the effect of the release (monthly, quaterly) of major economic indicators which
cascades to fine timescales. Aussem et al. [1998] used wavelet transformed financial data as the input to a neural
network which was trained to provide five-days ahead forecasts for the S&P 500 closing prices. They examined each
wavelet series individually to provide separate forecasts for each timescale and recombined these forecasts to form
an overall forecast. Ramsey et al. [1996] decomposed the S&P 500 index using matching pursuits and found data
are characterised by periods of quiet interspersed with intense activity over short periods of time. They found that
fewer coefficients were required to specify the data than for a purely random signal, signifying some form of deterministic structure to the signal. Ramsey et al. [1997] also applied matching pursuits to foreign exchange rate data
sets, the Deutschmark-US dollar, yen-US dollar and yen-Deutschmark, and found underlying traits of the signal. Even
though most of the energy of the system occored in localised burst of activity, they could not predict their occurence
and could not improve forecasting. Combining wavelet transforms, genetic algorithms and artificial neural networks,
Shin et al. [2000] forecasted daily Korean-US dollar returns one-day ahead of time and showed that the genetic-based
wavelet thresholder outperformed cross-validation, best level and best bias. Gencay et al. [2001b] filtered out intraday
periodicities in exchange rate time series using the maximal overlap discrete wavelet transform.

260

Part III

Quantitative trading in inefficient markets

261

Chapter 7

Introduction to quantitative strategies
The market inefficiency and the asymmetrical inefficiency of the long-only constraint in portfolio construction led
some authors to revise the modern portfolio theory developed by Markowitz, Sharpe, Tobin and others. The notion
of active, or benchmark-relative, performance and risk was introduced by Grinold [1989] [1994] and the source of
excess risk-adjusted return for an investment portfolio was examined.

7.1
7.1.1

Presenting hedge funds
Classifying hedge funds

Due to the various ways of selecting and risk managing a portfolio (see details in Section (2.1)), there is a large
number of Hedge Funds in the financial industry classified by the type of strategies used to manage their portfolios.
Considering single strategies, we are going to list a few of them.
• Macro: Macro strategies concentrate on forecasting how global macroeconomic and political events affect the
valuations of financial instruments. The strategy has a broad investment mandate. With the ability to hold positions in practically any market with any instrument, profits are made by correctly anticipating price movements
in global markets.
• Equity Hedge: also known as long/short equity, combine core long holdings of equities with short sales of stock
or stock index options and may be anywhere from net long to net short depending on market conditions. The
source of return is similar to that of traditional stock picking on the upside, but the use of short selling and
hedging attempts to outperform the market on the downside.
• Equity Market Neutral: Using complex valuation models, equity market neutral fund managers strive to identify under/overvalued securities. Accordingly, they are long in undervalued positions while selling overvalued
securities short. In contrast to equity hedge, equity neutral has a total net exposure of zero. The strategy intends
to neutralise the effect that a systematic change will have on values of the stock market as a whole.
• Relative Value: Generally, a relative value strategy makes Spread Trades in similar or related securities when
their values, which are mathematically or historically interrelated, are temporarily distorted. Profits are derived
when the skewed relationship between the securities returns to normal.
• Statistical Arbitrage: As a trading strategy, statistical arbitrage is a heavily quantitative and computational
approach to equity trading involving data mining and statistical methods, as well as automated trading systems.
StatArb evolved out of the simpler pairs trade strategy, but it considers a portfolio of a hundred or more stocks,
some long and some short, that are carefully matched by sector and region to eliminate exposure to beta and
other risk factors.
262

Quantitative Analytics

7.1.2

Some facts about leverage

7.1.2.1

Defining leverage

There are numerous ways leverage is defined in the investment industry, and there is no consensus on exactly how
to measure it. Leverage can be defined as the creation of exposure greater in magnitude than the initial cash amount
posted to an investment, where leverage is created through borrowing, investing the proceeds from short sales, or
through the use of derivatives. Thus, leverage may be broadly defined as any means of increasing expected return or
value without increasing out-of-pocket investment. There are three primary types of leverage
1. Financial Leverage: This is created through borrowing leverage and/or notional leverage, both of which allow
investors to gain cash-equivalent risk exposures greater than those that could be funded only by investing the
capital in cash instruments.
2. Construction Leverage: This is created by combining securities in a portfolio in a certain manner. The way
one constructs a portfolio will have a significant effect on overall portfolio risk, depending on the amount and
type of diversification in the portfolio, and the type of hedging applied (e.g., offsetting some or all of the long
positions with short positions).
3. Instrument Leverage: This reflects the intrinsic risk of the specific securities selected, as different instruments
have different levels of internal leverage.
Leverage allows hedge funds to magnify their exposures and thus magnify their risks and returns. However, a hedge
fund’s use of leverage must consider margin and collateral requirements at the transaction level, and any credit limits
imposed by trading counterparties such as prime brokers. Therefore, hedge funds are often limited in their use of
leverage by the willingness of creditors and counterparties to provide the leverage.
7.1.2.2

Different measures of leverage

Leverage may be quoted as a ratio of assets to capital or equity (e.g., 4 to 1), as a percentage (e.g., 400%), or as an
incremental percentage (e.g., 300%). The Gross Market Exposure is defined as
Long + Short
100%
Capital or Equity

Gross Market Exposure =

For example, Hedge Fund A has $1 million of capital, borrows $250, 000 and invests the full $1, 250, 000 in a portfolio
of stocks (i.e., the Fund is long $1.25 million). At the same time, Hedge Fund A sells short $750, 000 of stocks. Then
1.25 + 0.75
2
100% = 100% = 200%
1
1
Many investors do not consider the 200% Gross Market Exposure in the above example to be leverage per se. For
example, assume Hedge Fund A has capital of $1 million and is $1 million long and $1 million short. This results
in Gross Market Exposure of 200%, but Net Market Exposure of zero, which is the typical exposure of an equity
market-neutral fund. That is, the Net Market Exposure is defined as
Gross Market Exposure =

Net Market Exposure =

Long − Short
100%
Capital or Equity

Given the previous example, the Net Market Exposure is
1.25 − 0.75
1/2
100% =
100% = 50%
1
1
One should ask hedge fund managers the following types of questions regarding leverage:
Net Market Exposure =

263

Quantitative Analytics

• Does the manager have prescribed limits for net market exposure and for gross market exposure?
• What drives the decision to go long or short, and to use more or less leverage?
• What leverage and net market exposure was used to generate the manager’s track record?
• What is the attribution of return between security selection, market timing and the use of leverage?
7.1.2.3

Leverage and risk

We must distinguish between the concepts of leverage and risk, as there is a common misconception that a levered
asset is always riskier than an unlevered asset. In general, risk is defined by the portfolio’s stock market risk (beta),
and when investors are confronted to several equity portfolios they have to identify the one with the greatest risk. Even
though equity portfolios may have the same market risk, as each portfolio has the same aggregate beta, the key point is
that the same risk level is achieved through different types of leverage. The relationship between risk and leverage is
complex, in particular when comparing different investments, a higher degree of leverage does not necessarily imply
a higher degree of risk. Leverage is the link between the underlying or inherent risk of an asset and the actual risk of
the investor’s exposure to that asset. Thus, the investor’s actual risk has two components:
1. The market risk (beta) of the asset being purchased
2. The leverage that is applied to the investment
For example, which is more risky: a fund with low net market exposure and borrowing leverage of 1.5 times capital,
or a fund with 100% market exposure and a beta of 1.5 but no borrowing leverage?
For a given capital base, leverage allows investors to build up a larger investment position and thus a higher
exposure to specific risks. Buying riskier assets or increasing the leverage ratio applied to a given set of assets increases
the risk of the overall investment, and hence the capital base. Therefore, if a portfolio has very low market risk then
higher leverage may be more acceptable for these strategies than for strategies that have greater market exposure,
such as long-short equity or global macro. In fact, a levered portfolio of low-risk assets may well carry less risk than
an unlevered portfolio of high-risk assets. Therefore, investors should not concern themselves with leverage per se,
but rather focus on the risk/return relationship that is associated with a particular portfolio construction. In this way,
investors can determine the optimal allocation to a specific strategy in a diversified portfolio.

7.2

Different types of strategies

7.2.1

Long-short portfolio

7.2.1.1

The problem with long-only portfolio

Some form of risk constraints are generally placed on fund managers by investors or fund administrators, such as sizeneutrality, sector neutrality, value-growth neutrality, maximum total number of positions and long-only constraints.
Clarke et al. [2002] found that the long-only constraint is the most significant restriction placed on portfolio managers. While most investors focus on the management of long portfolios and the selection of winning securities, the
identification of winning securities ignores by definition a whole class of losing securities. As explained in Section
(2.1.1), excess returns come from active security weights, that is, portfolio weights differing from benchmark weights.
An active long-only portfolio holds securities expected to perform above average at higher-than-benchmark weights
and those expected to perform below average at lower-than-benchmark weights. Without short-selling, it can not
underweight many securities by enough to achieve significant negative active weights. Hence, restricting short sales
prevents managers from fully implementing their complete information set when constructing their portfolios. As
explained by Jacobs et al. [1999], the ability to sell short frees the investor from taking advantage of the full array

264

Quantitative Analytics

of securities and the full complement of investment insights by holding expected winners long and selling expected
losers short. Active fund managers expresses their investment view on the assets in their investment universe by holding an over-, neutral or under-weight position (wa,i > 0, wa,i = 0, wa,i < 0) in these assets relative to their assigned
benchmark
wa,i = wf,i − wb,i
where wa,i is the active weight in asset i, wf,i is the weight in asset i,and wb,i is the weight of the benchmark in asset i.
Since the conventional long-only fund manager can only expand a negative position to the point of excluding the asset
from the fund (wf,i ≥ 0), then the most negative active weight possible in any particular asset, in a long-only fund,
is the negative of the asset’s benchmark weight (wa,i ≥ −wb,i ). It results in a greater scope for expressing positive
investment views in each asset than negative views, leading to the asymmetry in the long-only active manager’s
opportunity set. On the other hand, the unconstrained investor benefits from a symmetrical investment opportunity set
with respect to implementing negative active investment weights. The extent to which the fund manager can expand
a positive active weight is limited only by a particular mandate restrictions and the ability to finance the total positive
active positions in the portfolio with sufficient negative positions in other assets.
7.2.1.2

The benefits of long-short portfolio

The benefits of long-short portfolio are to a large extent dependent on proper portfolio construction, and only an
integrated portfolio can maximise the value of investors’ insights. Much of the incremental cost associated with a
given long-short portfolio reflects the strategy’s degree of leverage. Although most existing long-short portfolios are
constructed to be neutral to systematic risk, neutrality is neither necessary nor optimal. Further, long-short portfolio do
not constitute a separate asset class and can be constructed to include a desired exposure to the return of any existing
asset class.
Long-short portfolio should not be considered as a two portfolio strategy but as a one portfolio strategy in which
the long and short positions are determined jointly within an optimisation that takes into account the expected returns
of the individual securities, the standard deviations of those returns, and the correlations between them, as well as the
investor’s tolerance for risk (see Jacobs et al. [1995] and [1998]). Within integrated optimisation, there is no need
to converge to securities’ benchmark weights in order to control risk. Rather, offsetting long and short positions can
be used to control portfolio risk. For example, if an investor has some strong insight about oil stocks, some of which
are expected to do very well and some other very poorly, he does not need to restrict weights to index-like weights
and can allocate much of the portfolio to oil stocks. The offsetting long and short positions control the portfolio’s
exposure to the oil factor. On the other hand, if he has no insights into oil stock behaviour, the long-short investor can
totally exclude oil stocks from the portfolio. The risk is not increased because in that setting it is independent of any
security’s benchmark weight. The absence of restrictions imposed by securities’s benchmark weights enhances the
long-short investor’s ability to implement investment insights.
An integrated optimisation that considers both long and short positions simultaneously, not only frees the investor
from the non-negativity constraint imposed on long-only portfolios, but also frees the long-short portfolio from the
restrictions imposed by securities’s benchmark weights. To see this we follow Jacobs et al. [1999] and consider an
obvious (suboptimal) way of constructing a long-short portfolio. To do so, we combine a long-only portfolio with a
short-only portfolio resulting in a long-plus-short portfolio and not a true long-short portfolio. The long side of this
portfolio being identical to a long-only portfolio, it offers no benefits in terms of incremental return or reduced risk.
Further, the short side is statistically equivalent to the long side, hence to the long-only portfolio. In effect, assuming
symmetry of inefficiencies across attractive and unattractive stocks, and, assuming identical and separate portfolio
construction for the long and short sides, we get
αL = αS = αLO
σe,L = σe,S = σe,LO
265

Quantitative Analytics

where αl for l = L, S, LO is the alpha of the long, short, and long-only portfolio, and σe,l for l = L, S, LO is the
residual risk of the respective portfolios. The excess return, or alpha, of the long side of the long-plus-short portfolio
will equal the alpha of the short side, which will equal the alpha of the long-only portfolio. This is also true of the
residual risk σe . It means that all the three portfolios are constructed relative to a benchmark index. Each portfolio is
active in pursuing excess return relative to the underlying index only insofar as it holds securities in weights that depart
from their index weights. This portfolio construction is index-constrained. Assuming that the beta of the short side
equals the beta of the long side, the ratio of the performance of the long-plus-short portfolio to that of the long-only
portfolio can be expressed as
s
2
IRL+S
=
IRLO
1 + ρL+S
where the information ratio IR is a measure of risk-adjusted outperformance. It is the ratio of excess return over the
benchmark divided by the residual risk (tracking error) 1
α
(7.2.1)
σe
and ρL+S is the correlation between the alphas of the long and short sides of the long-plus-short portfolio. Hence,
the advantage of a long-plus-short portfolio is curtailed by the need to control risk by holding or shorting securities
in index-like weights. Benefits only apply if there is a less-than-one correlation between the alphas of its long and
short sides. In that case, the long-plus-short portfolio will enjoy greater diversification and reduced risk relative to the
long-only portfolio.
IR =

Advocates of long-short portfolios also point to the diversification benefits provided by the short side. According
to them, a long-short strategy includes a long and a short portfolio; if the two portfolios are uncorrelated, the combined
strategy would have a higher information ratio than the two separate portfolios as a result of diversification. Jacobs et
al. [1995] addressed the diversification argument by observing that long and short alphas are not separately measurable
in an integrated long-short optimisation framework. They suggested that the correlation between the separate long and
short portfolios is not relevant. More recently, the centre stage of the long-short debate has focused on whether
efficiency gains result from relaxing the long-only constraint. Grinold et al. [2000] showed that information ratios
decline when one moves from a long-short to a long-only strategy.

7.2.2

Equity market neutral

An investment strategy or portfolio is considered market-neutral if it seeks to entirely avoid some form of market risk,
typically by hedging. A portfolio is truly market-neutral if it exhibits zero correlation with the unwanted source of risk,
and it is seldom possible in practice. Equity market-neutral is a hedge fund strategy that seeks to exploit investment
opportunities unique to some specific group of stocks while maintaining a neutral exposure to broad groups of stocks
defined, for example, by sector, industry, market capitalisation, country, or region. The strategy holds long-short
equity positions, with long positions hedged with short positions in the same and related sectors, so that the equity
market-neutral investor should be little affected by sector-wide events. For example, a hedge fund manager will go
long in the 10 biotech stocks that should outperform and short the 10 biotech stocks that will underperform. Therefore,
what the actual market does will not matter (much) because the gains and losses will offset each other. Equivalently,
the process of stock picking can be realised with complex valuation models. This places, in essence, a bet that the
long positions will outperform their sectors (or the short positions will underperform) regardless of the strength of the
sectors.
As an example, a delta neutral strategy describes a portfolio of related financial securities, in which the portfolio
value remains unchanged due to small changes in the value of the underlying security. The term delta hedging is the
1 The tracking error refers to the standard deviation of portfolio returns against the benchmark return. Hence, risk refers to the deviation of the
portfolio returns from the benchmark returns.

266

Quantitative Analytics

process of setting or keeping the delta of a portfolio as close to zero as possible. It may be accomplished by buying or
selling an amount of the underlier that corresponds to the delta of the portfolio. By adjusting the amount bought or sold
on new positions, the portfolio delta can be made to sum to zero, and the portfolio is then delta neutral (see Wilmott
et al. [2005]). Another example is the pairs trade or pair trading which corresponds to a market neutral trading
strategy enabling traders to profit from virtually any market conditions: uptrend, downtrend, or sideways movement.
This strategy is categorised as a statistical arbitrage and convergence trading strategy. The pair trading was pioneered
by Gerry Bamberger and later led by Nunzio Tartaglia’s quantitative group at Morgan Stanley in the early to mid
1980s (see Gatev et al. [2006], Bookstaber [2007]). The idea was to challenge the Efficient Market Hypothesis and
exploit the discrepancies in the stock prices to generate abnormal profits. The strategy monitors performance of two
historically correlated securities. When the correlation between the two securities temporarily weakens, that is, one
stock moves up while the other moves down, the pairs trade would be to short the outperforming stock and to long the
underperforming one, betting that the spread between the two would eventually converge.
There are many ways in which to invest in market neutral strategies, all of which seek to take systematic risk out of
the investment equation. Among the most common market neutral approaches is long-short equity. Long-short equity
investing has several benefits. The strategy is uncorrelated to other asset classes. The alpha generated by long-short
managers is uncorrelated to the alpha generated by index equity managers. Moreover, the alpha generated by longshort managers has low correlation to one another providing an excellent diversifying strategy. Long-short equity also
provides flexibility in asset allocation and rebalancing due to the portability of the alpha generated by long-short equity
and, moreover, market neutral in general. Over the longer term, long-short equity investing should provide attractive
risk adjusted returns as well as greater diversification and flexibility within investment programs.
A portfolio which appears to be market-neutral may exhibit unexpected correlations as market conditions change
leading to basis risk. Equity market-neutral managers recognise that the markets are dynamic and take advantage of
sophisticated mathematical techniques to explore new opportunities and improve their methodology. The fact that there
are many different investment universes globally makes this strategy less susceptible to alpha decay. The abundance
of data lends itself well to rigorous back-testing and the development of new algorithms.

7.2.3

Pairs trading

We saw in Section (7.2.2) that Equity Market Neutral is not just a single trading strategy, but it is an umbrella term
used for a broad range of quantitative trading strategies such as pairs trading. Pairs trading is one of Wall Street’s
quantitative methods of speculation which dates back to the mid-1980s (see Vidyamurthy [2004]). Market neutral
strategies are generally known for attractive investment properties, such as low exposure to the equity markets and
relatively low volatility. The industry practice for market neutral hedge funds is to use a daily sampling frequency and
standard cointegration techniques 2 to find matching pairs (see Gatev et al. [2006]). The general description of the
technique is that a pair of shares is formed, where the investor is long one share and short another share. The rational is
that there is a long-term equilibrium (spread) between the share prices, and thus the share prices fluctuate around that
equilibrium level (the spread has a constant mean). The investor evaluates the current position of the spread based on
its historical fluctuations and when the current spread deviates from its historical mean by a pre-determined significant
amount (measured in standard deviations), the spread is subsequently altered and the legs are adjusted accordingly.
Studying the effectiveness of this type of strategy, Gatev et al. [2006] conducted empirical tests on pair trading using
common stocks and found that the strategy was profitable even after taking the transaction costs into account. Jurek et
al. [2007] improved performance by deriving a mean reversion strategy. Investigating the usefulness of pair trading
applied to the energy futures market, Kanamura et al. [2008] obtained high total profits due to strong mean reversion
and high volatility in the energy markets.
2 Cointegration is a quantitative technique based on finding long-term relations between asset prices introduced in a seminal paper by Engle and
Granger [1987]. Another approach was developed by Johansen [1988], which can be applied to more than two assets at the same time. The result
is a set of cointegrating vectors that can be found in the system. If one only deals with pairs of shares, it is preferable to use the simpler Engle and
Granger [1987] methodology.

267

Quantitative Analytics

In practice, the investor bets on the reversion of the current spread to its historical mean by shorting/going long an
appropriate amount of each share in pair. That amount is expressed by the variable beta, which tells the investor the
number of the shares X he has to short/go long, for each 1 share Y. There are various ways of calculating beta, it can
either be fixed, or it can be time-varying. In the latter, one can use rolling ordinary least squares (OLS) regression,
double exponential smoothing prediction (DESP) model and the Kalman filter. As an example, Dunis and Shannon
[2005] use time adaptive betas with the Kalman filter methodology (see Hamilton [1994] or Harvey [1989] for a
detailed description of the Kalman filter implementation). It is a forward looking methodology, as it tries to predict
the future position of the parameters as opposed to using a rolling OLS regression (see Bentz [2003]). Later, Dunis
et al. [2010] applied a long-short strategy to compare the profit potential of shares sampled at 6 different frequencies,
namely 5-minute, 10-minute, 20-minute, 30-minute, 60-minute and daily sampling intervals. They considered an
approach enhancing the performance of the basic trading strategy by selecting the pairs for trading based on the best
in-sample information ratios and the highest in-sample t-stat of the Augmented Dickey-Fuller (ADF) unit root test
of the residuals of the cointegrating regression sampled a daily frequency. As described by Aldridge [2009] one
advantage of using the high-frequency data is higher potentially achievable information ratio compared to the use of
daily closing prices.
Assuming the pairs belong to the same industry, we follow the description given by Dunis et al.
calculate the spread between two shares as

[2010] and

Zt = PtY − βt PtX
where Zt is the value of the spread at time t, PtY is the price of share Y at time t, PtX is the price of share X at time
t, and βt is the adaptive coefficient beta at time t. In general, the spread is normalised by subtracting its mean and
dividing by its standard deviation. The mean and the standard deviation are calculated from the in-sample period and
are then used to normalise the spread both in the in- and out-of-sample periods. Dunis et al. sell (buy) the spread when
it is 2 standard deviations above (below) its mean value and the position is liquidated when the spread is closer than
0.5 standard deviation to its mean. They chose the investment to be money-neutral, so that the amounts of euros to be
invested on the long and short side of the trade is the same. They did not assume rebalancing once they entered into
the position. Therefore, after an initial entry into the position with equal amounts of euros on both sides of the trade,
even when due to price movements both positions stop being money-neutral, they did not rebalance the position. Only
two types of transactions were allowed, entry into a new position, and total liquidation of the position they were in
previously.
Dunis et al. [2010] explained the different indicators calculated in the in-sample period, trying to find a connecting link with the out-of-sample information ratio and as a consequence proposed a methodology for evaluating
the suitability of a given pair for arbitrage trading. All the indicators are calculated in the in-sample period. The
objective being to find the indicators with high predictive power of the profitability of the pair in the out-of-sample
period. According to Do et al. [2006], the success of pairs trading depends heavily on the modelling and forecasting
of the spread time series. For instance, the Ornstein-Uhlenbeck (OU) equation can be used to calculate the speed and
strength of mean reversion

dZt = k µ − Zt dt + σdWt
where µ is the long-term mean of the spread, Zt is the value of the spread at particular point in time, k is the strength of
mean reversion, and σ is the standard deviation. The parameters of the process are estimated on the in-sample spread.
This SDE is just the supplementary equation from which we calculate the half-life of mean reversion of the pairs. The
half-life of mean reversion in number of periods can be calculated as
ln 2
k
Intuitively speaking, it is half the average time the pair usually takes to revert back to its mean. Thus, pairs with low
half-life should be preferred to high half-lives by traders. The information ratio (IR) gives us an idea of the quality
k 12 = −

268

Quantitative Analytics

of the strategy. An annualised information ratio of 2 means that the strategy is profitable almost every month, while
strategies with an information ratio around 3 are profitable almost every day (see Chan [2009]). In the case of intraday
trading, the annualised information ratio is
Rp
hd × 252
σ
where hd is the number of hours traded per day (for a day hd 6= 24). However, it overestimates the true information
ratio if returns are autocorrelated (see Alexander [2008]).
MIR =

Pairs trading is not a risk-free strategy as the difficulty comes when prices of the two securities begin to drift
apart, that is, the spread begins to trend instead of reverting to the original mean. Dealing with such adverse situations
requires strict risk management rules, which have the trader exit an unprofitable trade as soon as the original setup,
a bet for reversion to the mean, has been invalidated. This can be achieved by forecasting the spread and exiting at
forecast error bounds. Further, the market-neutral strategies assume that the CAPM model is valid and that beta is
a correct estimate of systematic risk. If this is not the case, the hedge may not properly protect us in the event of a
shift in the markets. In addition, measures of market risk, such as beta, are historical and could vary from their past
behaviour and become be very different in the future. Hence, in a mean reversion strategy where the mean is assumed
to remain constant, then a change of mean is referred to as drift.

7.2.4

Statistical arbitrage

Statistical arbitrage (abbreviated as Stat Arb) refers to a particular category of hedge funds based on highly technical
short-term mean-reversion strategies involving large numbers of securities (hundreds to thousands, depending on the
amount of risk capital), very short holding periods (measured in days to seconds), and substantial computational, trading, and information technology (IT) infrastructure. As a trading strategy, statistical arbitrage is a heavily quantitative
and computational approach to equity trading involving data mining and sophisticated statistical methods and mathematical models, as well as automated trading systems to generate a higher than average profit for the traders. StatArb
evolved out of the simpler pairs trade strategy (see Section (7.2.3)), but it considers a portfolio of a hundred or more
stocks, some long and some short, that are carefully matched by sector and region to eliminate exposure to beta and
other risk factors.
Broadly speaking, StatArb is actually any strategy that is bottom-up, beta-neutral in approach and uses statistical/econometric techniques in order to provide signals for execution. The mathematical concepts used in Statistical
Arbitrage range from Time Series Analysis, Principal Components Analysis (PCA), Co-integration, neural networks
and pattern recognition, covariance matrices and efficient frontier analysis to advanced concepts in particle physics
such as free energy and energy minimisation. Signals are often generated through a contrarian mean-reversion principle, but they can also be designed using such factors as lead/lag effects, corporate activity, short-term momentum, etc.
This is usually referred to as a multi-factor approach to StatArb. Because of the large number of stocks involved, the
high portfolio turnover and the fairly small size of the effects one is trying to capture, the strategy is often implemented
in an automated fashion and great attention is placed on reducing trading costs.
As an example, an automated portfolio may consists of two phases. In the scoring phase, each stock in the market
is assigned a numeric score or rank reflecting its desirability; high scores indicate stocks that should be held long
and low scores indicate stocks that are candidates for shorting. The details of the scoring formula vary and are highly
proprietary, but, generally (as in pairs trading), they involve a short term mean reversion principle so that stocks having
done unusually well in the past week receive low scores and stocks having underperformed receive high scores. In
the second or risk reduction phase, the stocks are combined into a portfolio in carefully matched proportions so as to
eliminate, or at least greatly reduce, market and factor risk.
Statistical arbitrage is subject to model weakness as well as stock or security-specific risk. The statistical relationship on which the model is based may be spurious, or may break down due to changes in the distribution of returns

269

Quantitative Analytics

on the underlying assets. Factors, which the model may not be aware of having exposure to, could become the significant drivers of price action in the markets, and the inverse applies also. The existence of the investment based upon
model itself may change the underlying relationship, particularly if enough entrants invest with similar principles. The
exploitation of arbitrage opportunities themselves increases the efficiency of the market, thereby reducing the scope
for arbitrage, so continual updating of models is necessary. Further, StatArb has developed to a point where it is a
significant factor in the marketplace, that existing funds have similar positions and are in effect competing for the same
returns.

7.2.5

Mean-reversion strategies

Mean reversion strategies have been very popular since 2009. They have performed exceptionally well for the past
10 years, performing well even during the 2008-09 bear market. Different versions have been popularised, notably by
Larry Connors and Cezar Alvarez (David Varadi, Michael Stokes). Some of the indicators used are
• the RSI indicator (Relative Strength Index)
• a short term simple moving average
• the boillinger bands
The concept is the same: If price moved up today, it will tend to revert (come down) tomorrow.
Example on RSI (on GSPC index):
Analysing data since 1960, for the last 10 years (2000-2010) the market has changed and has become mean reverting:
buy on oversold and sell on overbought. Mean-reverting strategies have not performed as well starting 2010. Let’s say
we traded the opposite strategy. Buy if short term RSI is high, sell if it’s low (trend). As expected, it does well up to
2000, then it’s a disaster.

7.2.6

Adaptive strategies

An adaptive strategy depends on a Master Strategy and some Allocation Rules:
1. Master Strategy
• Instead of deciding on which RSI period and thresholds (sample sizes) to use, we use 6 different versions
(RSI(2), RSI(3) and RSI(4), each with different thresholds).
• One Non-Mean-Reverting strategy: If RSI(2) crosses 50 up then buy. If it crosses below, sell
2. Allocation Rules
• We measure risk adjusted performance for the last 600 bars for each of the 7 strategies.
• The top 5 get allocated capital; Best gets 50% of account to trade with, then 2nd gets 40%, 3rd gets 30%
etc.
Total allocation is 150%, meaning if all strategies were trading we would have to use 1.5x leverage.
Based on the previous example, up to 2002 the system takes positions mostly in the trend following strategy while
starting as early as 1996 mean-reverting strategies start increasing positions and eventually take over by 2004. There
is a 3 year period (2000-2003) of continuous draw-down as the environment changes and the strategy tries to adapt.
Notice that the trend-following RSI strategy (buy on up, sell on down) briefly started traded in August 2011, after
being inactive for 9 years.

270

Quantitative Analytics

7.2.7

Constraints and fees on short-selling

Many complications are related to the use of short-selling in the form of constraints and fees. For instance, a pair
trading strategy requiring one to be long one share and short another, is called self-financing strategy (see Alexander
et al. [2002]). That is, an investor can borrow the amount he wants to invest, say from a bank, then to be able to
short a share, he deposits the borrowed amount with the financial institution as collateral and obtains borrowed shares.
Thus, the only cost he has to pay is the difference between borrowing interest rates paid by the investor and lending
interest rates paid by the financial institution to the investor. Subsequently, to go short a given share, the investor sells
the borrowed share and obtains cash in return. From the cash he finances his long position. On the whole, the only
cost is the difference between both interest rates (paid vs. received). A more realistic approach is the situation where
an investor does not have to borrow capital from a bank in the beginning (e.g. the case of a hedge fund that disposes of
capital from investors) allows us to drop the difference in interest rates. Therefore, a short position would be wholly
financed by an investor. However, in that case the investor must establish an account with a prime broker who arranges
to borrow stocks for short-selling. The investor may be subject to buy-in and have to cover the short positions. The
financial intermediation cost of borrowing including the costs associated with securing and administrating lendable
stocks averages 25 to 30 basis points. This cost is incurred as a hair-cut on the short rebate received from the interest
earned on the short sale proceeds. Short-sellers may also incur trading opportunity costs because exchange rules delay
or prevent short sales. For example, dealing with the 50 most liquid European shares, we can consider conservative
total transaction costs of 0.3% one-way in total for both shares (see Alexander et al. [2002]) consisting of transaction
costs 0.1% of brokerage fee for each share (thus 0.2% for both shares), plus a bid-ask spread for each share (long
and short) which we assume to be 0.05% (0.3% in total for both shares). Long-short portfolio can take advantage
of the leverage allowed by regulations (two-to-one leverage) by engaging in about twice as much trading activity
as a comparable unlevered long-only strategy. The differential is largely a function of the portfolio’s leverage. For
example, given a capital of $10 million the investor can choose to invest $5 million long and sell $5 million short.
Trading activity for the resulting long-short portfolio will be roughly equivalent to that for a $10 million long-only
portfolio. If one considers management fees per dollar of securities positions, rather than per dollar of capital, there
should not be much difference between long-short and long-only portfolio. In general, investors should consider
the amount of active management provided per dollar of fees. Long-only portfolios have a sizable hidden passive
component as only their overweights and underweights relative to the benchmark are truly active. On the other hand,
long-short portfolio is entierly active such that in terms of management fees per active dollars, long-short may be
substantially less costly than long-only portfolio. Moreover, long-short management is almost always offered on a
performance-fee basis. Long-short is viewed as riskier than long-only portfolio due to potentially unlimited losses on
short positions. In practice, long-short will incur more risk than long-only portfolio to the extent that it engages in
leverage, and/or takes more active positions. Taking full advantage of the leverage available will have at risk roughly
double the amount of assets invested in a comparable unlevered long-only strategy. Note, both the portfolio’s degree
of leverage and its activeness are within the explicit control of the investor.

7.3
7.3.1

Enhanced active strategies
Definition

Enhanced active equity portfolios (EAEP) seek to improve upon the performance of actively managed long-only portfolios by allowing for short-selling and reinvestment of the entire short sales proceeds in incremental long positions.
This style advances the pursuit of active equity returns by relaxing the long-only constraint while maintaining full
portfolio exposure to market return and risk. Enhanced active equity strategy has short positions equal to some percentage X% of capital (generally 20% or 30% and possibly 100% or more) and an equal percentage of leveraged long
positions (100 + X)%. On a net basis, the portfolio has a 100% exposure to the market and it often has a target beta
of one. For example, in a 130-30 portfolio with initial capital of $100, an investor can sell short $30 of securities and
use the $30 proceeds along with $100 of capital to purchase $130 of long positions. This way, the 130-30, or active
extension, portfolio structure provides fund managers with exposure to market returns unavailable to market neutral

271

Quantitative Analytics

long-short portfolios. A 130/30 strategy has two basic components: forecasts of expected returns, or alphas, for each
stock in the portfolio universe, and an estimate of the covariance matrix used to construct an efficient portfolio. With
modern prime brokerage structures (called enhanced prime brokerage), the additional long purchases can be accomplished without borrowing on margin, allowing for the management style called enhanced active equity. As a result,
the 130-30 products were expected to reach $2 trillion by 2010 (see Tabb et al. [2007]).

7.3.2

Some misconceptions

Since enhanced active strategies differ in some fundamental ways from other active equity strategies, both long-only
and long-short, some misconceptions about these strategies formed, which Jacobs et al. [2007a] showed not to survive
objective scrutiny. For instance, a portfolio that can sell short can underweight in larger amounts, so that meaningful
underweights of most securities can only be achieved if short selling is allowed. Hence, a 120-20 portfolio can take
more and/or larger active overweight positions than a long-only portfolio with the same amount of capital. Further,
it is not optimum to split a 120-20 equity portfolio into a long-only 100-0 portfolio and a 20-20 long-short portfolio
because the real benefits of any long-short portfolio emerge only with an integrated optimisation that considers all
long and short positions simultaneously. In that setting, Jacobs et al. [2005] developed a theoretical framework and
algorithms for integrated portfolio optimisation and showed that it must satisfies two constraints
1. the sum of the long position weights is (100 + X)%.
2. the sum of the short position weights is X%.
Short-selling, even in limited amounts, can extend portfolio underweights substantially. Opportunities for shorting
are not necessarily mirror images of the ones for buying long. It is assumed that overvaluation is more common
and larger in magnitude than undervaluation (non-linear relation). Also, price reactions to good and bad news may
not be symmetrical. An enhanced active portfolio can take short positions as large as the prime’s broker policies on
leverage allow. For example, the portfolio could short securities equal to 100% of capital and use the proceeds plus the
capital to purchase long positions, resulting in a 200-100 portfolio. Comparing an enhanced active 200-100 portfolio
with an equitized market-neutral long-short portfolio with 100% of capital in short positions, 100% in long positions,
and 100% in an equity market overlay (stock index futures, swaps, exchange traded funds (ETFs)), we see that they
are equivalent with identical active weights and identical market exposures. However, the equity overlay is passive,
whereas with an enhanced active equity portfolio, market exposure is established with individual security positions.
For each $100 of capital, the investor has $300 in stock positions to use in pursuing return and controlling risk. Further,
the cost of both strategies is about the same.
While all enhanced portfolios are in a risky position in terms of potential value added or lost relative to the
benchmark index return, losses on unleveraged long positions are limited because a stock price can not drop below
zero, but losses on short positions are theoretically unlimited as stock price can rise to infinity. However, this risk
can be minimised by diversification and rebalancing so that losses in some positions can be mitigated by gains in
others. A 120-20 portfolio is leveraged, in that it has $140 at risk for every $100 of capital invested. The market
exposure created by the 20% in leveraged long positions is offset, however, by the 20% sold short. The portfolio has
a 100% net exposure to the market, and with appropriate risk control, a marketlike level of systematic risk (a beta
of 1). The leverage and added flexibility can be expected to increase excess return and residual risk relative to the
benchmark. If the manager is skilled at security selection and portfolio construction, any incremental risk borne by the
investor should be compensated for by incremental excess return. Since EAEP have a net market exposure of 100%,
any pressures put on individual security prices should net out at the aggregate market level. Turnover in an enhanced
active equity portfolio should be roughly proportional to the leverage in the portfolio. With $140 in positions in a
120-20 portfolio, versus $100 in a long-only portfolio, turnover can be expected to be about 40% higher in the 120-20
portfolio. The portfolio optimisation process should account for expected trading costs so that a trade does not occur
unless the expected benefit in terms of risk-adjusted return outweighs the expected cost of trading.

272

Quantitative Analytics

Michaud [1993] argued that costs related to short sales are an impediment to efficiency. No investment strategy
provides a free lunch. An enhanced active equity strategy has an explicit cost, namely a stock loan fee paid to the
prime broker. The prime broker arranges for the investor to borrow the securities that are sold short and handles
the collateral for the securities’ lenders. The stock loan fee amounts to about 0.5% annually of the market value of
the shares shorted (about 10 bps of capital for a 120-20 portfolio). It will usually incur a higher management fee
than a long-only portfolio and higher transaction costs, but it offers a more efficient way of managing equities than a
long-only strategy allows. The incremental underweights and overweights can lead to better diversification than in a
long-only portfolio. Moreover, the enhanced active portfolio may incur more trading costs than a long-only portfolio
because, as security prices change, it needs to trade to maintain the balance between its short and long positions
relative to the benchmark. For example, assume that a 120-20 portfolio experiences adverse stock price moves so
that its long positions lose $2 (prices drops) and its short positions loose $3 (prices raise), causing capital to decline
from $100 to $95. The portfolio now has long positions of $118 and short positions of $23, not the desired portfolio
proportions (120% of $95 is $114 and 20% is $19). To reestablish portfolio exposures of 120% of capital as long
positions and 20% of capital as short positions, the manager needs to rebalance by selling $4 of long positions and
using the proceeds to cover $4 of short positions. The resulting portfolio restores the 120-20 proportions because the
$114 long and $19 short are respectively 120% and 20% of the $95 capital. If an EAEP is properly constructed with
the use of integrated optimisation, the performance of the long and short positions can not be meaningfully separated.
The unique characteristics of 130-30 portfolio strategy suggest that the existing indexes such as the S&P 500 are
inappropriate benchmarks for leveraged dynamic portfolios. Lo et al. [2008] provided a new benchmark incorporating
the same leverage constraints and the same portfolio construction, but which is otherwise transparent, investable, and
passive. They used only information available prior to each rebalancing date to formulate the portfolio weights and
obtained a dynamic trading portfolio requiring monthly rebalancing. The introduction of short sales and leverage into
the investment process led to dynamic indexes capable of capturing time-varying characteristics.

7.3.3

Some benefits

Recently, the centre stage of the long-short debate focused on whether efficiency gains result from relaxing the longonly constraint. Brush [1997] showed that adding a long-short strategy to a long strategy expands the mean-variance
efficient frontier, provided that long-short strategies have positive expected alphas. Grinold et al. [2000b] showed
that information ratios (IR) decline when one moves from a long-short to a long-only strategy. Jacobs et al. [1998]
[1999] further elaborated on the loss in efficiency occurring as a result of the long-only constraint. Martielli [2005]
and Jacobs and Levy [2006] provided an excellent practical perspective on the mechanics of enhanced active equity
portfolio construction and a number of operational considerations. They compared the enhanced active equity portfolio
(EAEP) with traditional long-only passive and active approaches to portfolio management as well as other long-short
approaches including market-neutral and equitized long-short. EAEPs are expected to outperform long-only portfolios
based on comparable insights. They afford managers greater flexibility in portfolio construction, allowing for fuller
exploitation of investment insights. They also provide managers and investors with a wider choice of risk-return tradeoffs. The advantages of enhanced active equity over equitized long-short strategies are summarised in Jacobs et al.
[2007b].
Clarke et al. [2002] developed a framework for measuring the impact of constraints on the value added by and
the performance analysis of constrained portfolios. Further, Clarke et al. [2004] found that short sale constraints
in a long-only portfolio cause the most significant reduction in portfolio efficiency. They showed that lifting this
constraint is critical for improving the information transferred from stock selection models to active portfolio weights.
Sorensen et al. [2007] used numerical simulations of long-short portfolios to demonstrate the net benefits of shorting
and to compute the optimal degree of shorting as a function of alpha (manager skill), desired tracking error (risk
target), turnover, leverage, and trading costs. They also found that there was no universal optimal level of short selling
in an active extension portfolio, but that level varied according to different factors and market conditions. Johnson
et al. [2007] further emphasised the loss in efficiency from the long-only constraint as well as the importance of

273

Quantitative Analytics

the concerted selection of gearing and risk in the execution of long-short portfolios. Adopting several simplifying
assumptions regarding the security covariance matrix and the concentration profile of the benchmark, Clarke et al.
[2008] derived an equation that shows how the expected short weight for a security depends on the relative size of the
security’s benchmark weight and its assigned active weight in the absence of constraints. They argue that to maintain
a constant level of active risk, the long-short ratio should be allowed to vary over time to accommodate changes in
individual security risk, security correlation, and benchmark weight concentration.

7.3.4

The enhanced prime brokerage structures

As explained by Jacobs et al. [2006], with a traditional margin account, the lenders of any securities sold short must
be provided with collateral at least equal to the current value of the securities (see details in Section 7.2.7). When the
securities are first borrowed, the proceeds from the short sale usually serve as this collateral. As the short positions
subsequently rise or fall in value, the investor’s account provides to or receives from the securities’ lenders cash equal
to the change in value. To avoid the need to borrow money from the broker to meet these collateral demands, the
account usually maintains a cash buffer. Market-neutral long-short portfolios have traditionally been managed in a
margin account, with a cash buffer of 10% typically maintained to meet the daily marks on the short positions. Long
positions may sometimes need to be sold to replenish the cash buffer (without earning investment profits).
With the enhanced brokerage structures available today, the investor’s account must have sufficient equity to meet
the broker’s maintenance margin requirements, generally 100% of the value of the shares sold short plus some additional percentage determined by the broker. This collateral requierment is usually covered by the long positions.
The investor does not have to meet cash marks to market on the short positions. The broker cover those needs and is
compensated by the stock loan fee. Also, dividends received on long positions can be expected to more than offset the
amount the account has to pay to reimburse the securities’ lenders for dividends on the short positions. The investor
thus has little need for a cash buffer in the account. An enhanced active portfolio will generally retain only a small
amount of cash, similar to the frictional cash retained in a long-only portfolio.
More formally, the enhanced prime brokerage structures allow investors to establish a stock loan account with a
broker where the investor is not a customer of the prime broker, as would be the case with a regular margin account,
but rather a counterparty in the stock loan transaction 3 . This is an important distinction for at least four reasons:
1. Investors can use the stock loan account to borrow directly the shares they want to sell short. The shares
the investor holds long serve as collateral for the shares borrowed. The broker arranges the collateral for the
securities’ lenders, providing cash, cash equivalents, securities, or letters of credit. Hence, the proceeds from
the short sales are available to the investor to purchase securities long.
2. The shares borrowed are collateralized by securities the investor holds long, rather than by the short sale proceeds, eliminating the need for a cash buffer. All the proceeds of short sale and any other available cash can thus
be redirected toward long purchases.
3. A stock loan account in contrast to a margin account provides critical benefits for a tax-exempt investor. The
long positions established in excess of the investor’s capital are financed by the proceeds from the investor’s sale
of short positions. The longs are not purchased with borrowed funds.
4. The investor being a counterparty in a stock loan account, the investor’s borrowing of shares to sell short is not
subject to Federal Reserve Board Regulation T (limits on leverage). Instead, the investor’s leverage is limited
by the broker’s own internal lending policies.
3 To establish a stock loan account with a prime broker, the manager must meet the criteria for a Qualified Professional Asset Manager. For a
registered investment advisor, it means more than $85 million of client assets under management and $1 million of shareholders’ equity.

274

Quantitative Analytics

In exchange for its lending services (arranging for the shares to borrow and handling the collateral), the prime broker
charges an annual fee equal to about 0.50% of the market value of the shares shorted (fees may be higher for harderto-borrow shares or smaller accounts). For a 120-20 portfolio with 20% of capital shorted, the fee as a percentage of
capital is thus about 0.10%. Generally, the broker also obtains access to the shares the investor holds long, up to the
dollar amount the investor has sold short, without paying a lending fee to the investor. Hence, the broker can lend
these shares to other investors to sell short, and in turn, the investor can borrow the shares the broker can hypothecate
from other investors, as well as the shares the broker holds in its own accounts and the share it can borrow from other
lenders.

7.4
7.4.1

Measuring the efficiency of portfolio implementation
Measures of efficiency

There are a number of studies measuring the efficiency of portfolio implementation, such as Grinold [1989], who
introduced the Fundamental Law (FL) of active management, given by the equation
√
IR = IC. N
where IR is the observed information ratio given in Equation (7.2.1), IC is the information coefficient (a measure of
manager skill) given by the correlation of forecast security returns with the subsequent realised security returns, and
N is the number of securities in the investment universe. Even though the FL is an approximation, the main intuition
is that returns are a function of information level, breadth of investment universe and portfolio risk. The law was
extended by Clarke et al. [2002] who introduced the idea of transfer coefficient (TC) to measure the efficiency of
portfolio implementation. It is a measure of how effectively manager information is transferred into portfolio weights.
The transfer coefficient is defined as the cross-sectional correlation of the risk-adjusted forecasts across assets and the
risk-adjusted active portfolio weights in the same assets
α
)
σe
where wa σe is a vector of risk adjusted active weights, σe,i is the residual risk for each asset (the risk of each asset
not explained by the benchmark portfolio), and α is a vector of forecast active returns (forecast returns in excess of
benchmark related return). Hence, the TC measures the manager’s ability to invest in a way consistent with their
relative views on the assets in their investment universe. While a perfectly consistent investment portfolio has a TC of
one, any inconsistency in implementation will reduce the TC below one. Assuming that managers have no restrictions
on the construction of a portfolio from the information set they possess, the equation becomes
T C = ρ(wa σe ,

√
IR = T C.IC. N
where T C acts as a scaling factor on the level of information. In absence of any constraints T C = 1, otherwise
it is below 1 since the constraints place limits on how efficiently managers can construct portfolios reflecting their
forecasts. This result infers that portfolio outperformance is not only driven by the the ability to forecast security
returns, but also by the ability to frame those security returns in the form of an efficient portfolio.
Note, we need to know the forecasts and model estimates of residual risk σe,i for every asset i to accurately
measure the TC of a fund at a particular time. However, only the portfolio managers themselves should have access to
this kind of information. Raubenheimer [2011] proposed a simple metrics, called implied transfer coefficient (ITC),
requiring only the weights of the benchmark assets and the investment weight constraints. Nonetheless, we need an
understanding of the distribution of likely security weightings in the portfolio. Grinold [1994] proposed the alpha
generation formula which was generalised by Clarke er al. [2006]
1

α = ICσ 2 SN
275

Quantitative Analytics

where σ is an N × N estimated covariance matrix of the returns of the securities, and SN is an N × 1 vector
of randomised standard normal scores. This equation presents forecasted excess returns as generated by a random
normal process, scaled by skill and risk. Clarke et al. [2008] and Sorensen et al. [2007] argued that, if forecast excess
returns follow a random process, the distribution of optimal active weights resulting from these forecasts could be
derived accordingly. These simulated distributions of asset weights can provide sound justification for various weight
constraints which are appropriate for each asset and across changing investment views. They considered a simplified
two-parameter variance-covariance matrix to simulated the active weight distributions by setting all individual asset
variances to a single value σ, and all pairwise correlations to the same value ρ. Under these assumptions, they showed
that the unconstrained optimal active weights are normally distributed with a mean of zero and a variance proportional
to the active risk

1
σA
√
wa ∼ N 0, √
N σ 1−ρ
where σA is the target active risk of the portfolio. Increasing volatility and decreasing correlation (increasing crosssectional variation) result in a narrower distribution of active weights and a lower probability of needing short positions
to optimally achieving a particular active risk target. All things being equal, a wider distribution of active weights in
each security is required with greater active targets. This increase in active weight spread is exponentially increased
by a reduced investment universe (smaller N). Hence, given the same active risk targets, funds managed on a smaller
asset universe universe or benchmark will likely be more aggressive in their individual active weights per asset than
funds managed in a more diverse universe. Further, the wider the distribution of active weights, the more likely short
positions in the smaller stocks will be required in the optimal portfolio construction.

7.4.2

Factors affecting performances

Following Segara et al. [2012] we present some hypotheses relating the performance of active extension portfolios to
unique factors:
• Skill levels: Managers with higher skill levels have a greater increase in performance from relaxing the longonly constraint. In the case where a manager has some predictive skill (IC > 0), then he will be able to
transform larger active weights into greater outperformance, leading to higher level of performance from active
extension strategies. However, short selling can only increase up to the point where the additional transaction
and financing costs outweigh the marginal benefits.
• Skew in predictive ability: Managers with a higher skew towards picking underperforming stocks can construct
active extension portfolios with higher levels of performance (see Gastineau [2008]).
• Risk constraints: Portfolios with higher tracking error targets experience greater performance increase from
relaxing the long-only constraint. Portfolio managers must face limits to the size of a tracking error which is a
function of portfolio active weights and the variance-covariance matrix. Clarke et al. [2004] found a trade-off
between the maximum TC, the target tracking error, and the level of shorting.
• Costs: An increase in costs relative to the skill the manager possesses will at some point lower the performance
of active extensions strategies. Transaction, financing and stock borrowing costs increase proportionally to the
gross exposure of the fund, which is driven by the level of short selling in the portfolio.
• Volatility: Higher market volatility will increase the performance of active extension strategies. In volatile
markets, as greater portfolio concentration may expose a portfolio to higher risk due to lower diversification, an
active extension strategy allows for a lower risk target for the same return by using short-side information in a
portfolio with added diversification.
• Cross-sectional spread of returns: Active extension portfolios perform better in comparison to long-only portfolios in periods where individual stock returns are more highly correlated. According to Clarke et al. [2008],

276

Quantitative Analytics

in environments of higher correlation between individual security returns, larger active positions are needed to
achieve the same target level of outperformance. Hence, a higher level of short selling will allow managers to
distribute more efficiently their higher active weights over both long and short positions in the portfolio.
• Market Conditions: The level of outperformance or underperformance of active extension portfolios is equivalent across periods or negative market returns. Having a constant 100% net market exposure and a beta of about
1 when well diversified, an active extension portfolio on average will perform in line with the broader market.

277

Chapter 8

Describing quantitative strategies
8.1

Time series momentum strategies

Trend-following or momentum investing is about buying assets whose price is rising and selling assets whose price
is falling. Cross-sectional momentum strategies in three dimensions (time-series, cross-section, trading frequency),
which are the main driver of commodity trading advisors (CTAs) (see Hurst et al. [2010]), were extensively studied
and reported to present strong return continuation patterns across different portfolio rebalancing frequencies with high
Sharpe ratio (see Jegadeesh et al. [2001], Moskowitz et al. [2012], Baltas et al. [2012a]). Time-series momentum
refers to the trading strategy that results from the aggregation of a number of univariate momentum strategies on a
volatility-adjusted basis. As opposed to the cross-sectioal momentum strategy which is constructed as a long-short
zero-cost portfolio of securities with the best and worst relative performance during the lookback period, the univariate
time-series momentum strategy (UTMS) relies heavily on the serial correlation/predictability of the asset’s return
series. Moskowitz et al. [2012] found strong positive predictability from a security’s own past returns across the
nearly five dozen futures contracts and several major asset classes studied over the last 25 years. They found that the
past 12-month excess return of each instrument is a positive predictor of its future return. This time series momentum
effect persists for about a year before partially reversing. Baltas et al. [2012a] showed that time-series momentum
strategies have high explanatory power in the time-series of CTA returns. They further documented the existence
of strong time-series momentum effects across monthly, weekly and daily frequencies, and confirmed that strategies
at different frequencies have low correlation between each other, capturing distinct patterns. This dependence on
strong autocorrelation in the individual return series of the contracts poses a substantial challenge to the random walk
hypothesis and the market efficiency which was explained by rational and behavioural finance. Using intraday data,
Baltas et al. [2012b] explored the profitability of time-series momentum strategies focusing on the momentum trading
signals and on the volatility estimation. Results showed that the information content of the price path throughout the
lookback period can be used to provide more descriptive indicators of the intertemporal price trends and avoid eminent
price reversals. They showed empirically that the volatility adjustment of the constituents of the time-series momentum
is critical for the resulting portfolio turnover.

8.1.1

The univariate time-series strategy

Assuming predictability of some market time-series, the univariate time-series momentum strategy (UTMS) is defined
as the trading strategy that takes a long/short position on a single asset based on the trading signal ψi (., .) of the
recent asset return over a particular lookback period. We let J denote the lookback period over which the asset’s past
performance is measured and K denote the holding period. In general, both J and K are measured in months, weeks
or days depending on the rebalancing frequency of interest. We use the notation MJK to denote monthly strategies
with a lookback and holding period of J and K months respectively. The notations WJK and DJK follow similarly for

278

Quantitative Analytics

i
weekly and daily strategies. Following Moskowitz et al. [2012] (MOP), we construct the return YJ,K
(t) at time t for
the series of the ith available individual strategy as
i
YJ,K
(t) = ψi (t − J, t)Ri (t, t + K)

(8.1.1)

where ψi (t − J, t) is the particular trading signal for the ith asset which is determined during the lookback period
and in general takes values in the set {−1, 0, 1} which in turn translates to {short, inactive, long}. Note, to evaluate
the abnormal performance of these strategies, we can compute their alphas from a linear regression of returns (see
Equation (8.1.3)) where we control for passive exposures to the three major asset classes on stocks, commodities and
bonds.

8.1.2

The momentum signals

i
Given the return YJ,K
(t) at time t for the series of the ith available individual strategy in Equation (8.1.1), we consider
five different methodologies in order to generate momentum trading signals, all focusing on the asset performance
during the lookback period [t − J, t] (see Moskowitz et al. [2012], Baltas et al. [2012b]).

8.1.2.1

Return sign

In that setting, the time-series momentum strategy is defined as the trading strategy that takes a long/short position on
a single asset based on the sign of the recent asset return over a particular lookback period. The trading signal is given
by
ψi (t − J, t) = signi (t − J, t)
where signi (t − J, t) is the sign of the J-period past return of the ith asset. That is, a positive (negative) past return
dictates a long (short) position. The return of the time-series momentum strategy becomes
i
YJ,K
(t) = signi (t − J, t)Ri (t, t + K) = sign(Ri (t − J, t))Ri (t, t + K)

8.1.2.2

(8.1.2)

Moving Average

A long (short) position is determined when the J-(month/week) lagging moving average of the price series lies below
(above) a past (month/week)’s leading moving average of the price series. Given the price level S i (t) of an instrument
at time t, we let NJ (t) be the number of trading days in the period [t − J, t] and define AJ (t) the average price level
during the same time period as
NJ (t)
X
1
S i (t − NJ (t) + j)
AJ (t) =
NJ (t) j=1

The trading signal at time t is determined as

M A(t − J, t) =

1 if AJ (t) < A1 (t)
−1 otherwise

Hence, the trading strategy that takes a long/short position on a single asset based on the moving average of the price
series over a particular lookback period is
i
YJ,K
(t) = M Ai (t − J, t)Ri (t, t + K)

The idea behind the MA methodology is that when a short-term moving average of the price process lies above a
longer-term average then the asset price exhibits an upward trend and therefore a momentum investor should take a

279

Quantitative Analytics

long position. The reverse holds when the relationship between the averages changes. The comparison of the longterm lagging MA with a short-term leading MA gives the MA methodology a market timing feature. The choice of
the past month for the short-term horizon is justified, because it captures the most recent trend breaks.
8.1.2.3

EEMD Trend Extraction

This trading signal relies on some extraction of the price trend during the lookback period. We choose to use a
recent data-driven signal processing technique, known as the Ensemble Empirical Mode Decomposition (EEMD),
which is introduced by Wu et al. [2009] and constitutes an extension of the Empirical Mode Decomposition. The
EEMD methodology decomposes a time-series of observations into a finite number of oscillating components and a
residual non-cyclical long-term trend of the original series, without virtually imposing any restrictions of stationarity
or linearity upon application. That is, the stock price process can be written as the complete summation of an arbitrary
number, n of oscillating components ck (t) for k = 1, .., n and a residual long-term trend p(t)
S(t) =

n
X

ck (t) + p(t)

k=1

The focus is on the extracted trend p(t) and therefore an upward (downward) trend during the lookback period determines a long (short) position

1 if p(t) > p(t − J)
EEM D(t − J, t) =
−1 otherwise
8.1.2.4

Time-Trend t-statistic

Another way of capturing the trend of a price series is through fitting a linear trend on the J-month price series using
least-square. The momentum signal can them be determined based on the significance of the slope coefficient of the
fit
S(j)
= α + βj + (j) , j = 1, 2, .., NJ (t)
S(t − NJ (t))
Estimating this model for the asset using all NJ (t) trading days of the lookback period yields an estimate of the timetrend, given by the slope coefficient β. The significance of the trend is determined by the t-statistic of β, denoted as
t(β), and the cutoff points for the long/short position of the trading signal are chosen to be +2/ − 2 respectively

 1 if t(β) > 2
−1 if t(β) < −2
T REN D(t − J, t) =

0 otherwise
In order to account for potential autocorrelation and heteroskedasticity in the price process, Newey et al. [1987] tstatistics are used. Note, the normalisation of the regressand in above equation is done for convenience, since it allows
for cross-sectional comparison of the slope coefficient, when necessary. The t-statistic of β is of course unaffected by
such scalings.
8.1.2.5

Statistically Meaningful Trend

Bryhn et al. [2011] study the statistical significance of a linear trend and claim that if the number of data points
is large, then a trend may be statistically significant even if the data points are very erratically scattered around the
trend line. They introduced the term of statistical meaningfulness in order to describe a trend that not only exhibits
statistical significance, but also describes the behaviour of the data to a certain degree. They showed that a trend is
informative and strong if, except for a significant t-statistic (or equivalently a small p-value where p is the p-value of

280

Quantitative Analytics

the slope coefficient), the R2 of the linear regression exceeds 65%. Further, they consider a method providing some
sort of pre-smoothing in the data before the extraction of the trend. Hence, we split the lookback period in 4 to 10
intervals (i.e. 7 regressions per lookback period per asset) and decide upon a long/short position only if at least one of
the regressions satisfies the above criteria

 1 if tk (β) > 2 and Rk2 ≥ 65% for some k
−1 if tk (β) < −2 and Rk2 ≥ 65% for some k
SM T (t − J, t) =

0 otherwise
where k denotes the kth regression with k = 1, 2, ..., 7. Clearly, SMT is a stricter signal than TREND and therefore
would lead to more periods of inactivity.

8.1.3

The signal speed

Since the sparse activity could potentially limit the ex-post portfolio mean return, Baltas et al. [2012b] chose to
estimate for each contract and for each signal an activity-to-turnover ratio, which is called Signal Speed. It is computed
as the square root of the ratio between the time series average of the squared signal value and the time-series average
of the squared first-order difference in the signal value
E[ψ 2 ]
=
(Speed(ψ)) =
E[(∆ψ)2 ]

1
T −J

2

1
T −J−1

PT 
t=1

PT

t=1

ψ 2 (t − J, t)

ψ(t − J, t) − ψ(t − 1 − J, t − 1)

2

Clearly, the larger the signal activity and the smaller the average difference between consecutive signal values (in other
words the smoother the transition between long and short positions), the larger the signal speed. When the signals
constantly jump between long (+1) to short (−1) positions the numerator is always equal to 1. Inactive trading allows
for smoother transition between long and short positions.

8.1.4

The relative strength index

The relative strength index (RSI) is a technical indicator intended to chart the current and historical strength or weakness of a stock or market based on the closing prices of a recent trading period. The RSI is classified as a momentum
oscillator, measuring the velocity and magnitude of directional price movements. Momentum is the rate of the rise
or fall in price. The RSI computes momentum as the ratio of higher closes to lower closes: stocks which have had
more or stronger positive changes have a higher RSI than stocks which have had more or stronger negative changes.
The RSI is most typically used on a 14 day time frame, measured on a scale from 0 to 100, with high and low levels
marked at 70 and 30, respectively. For each trading period an upward change U is defined by the close being higher
than the previous close

SC (jδ) − SC ((j − 1)δ) if SC (jδ) > SC ((j − 1)δ)
U (jδ) =
0 otherwise
Similarly, a downward change D si given by

SC ((j − 1)δ) − SC (jδ) if SC ((j − 1)δ) > SC (jδ)
D(jδ) =
0 otherwise
The average of U and D are calculated by using an n-period Exponential Moving Average (EMA) in the AIQ version
but with an equal-weighted moving average in Wilder’s original version. The ratio of these averages is the Relative
Strength Factor
RS =

EM A(U, n)
EM A(D, n)

281

Quantitative Analytics

The EMA should be appropriately initialised with a simple average using the first n-values in the price series. When
the average of D values is zero, the RS value is defined ad 100. The RSF is then converted to a Relative Strength Index
in the range [0, 100] as
100
1 + RS
so that when RS = 100 then RSI is close to 100 and when RS = 0 then RSI = 0. The RSI is presented on a
graph above or below the price chart. The indicator has an upper line, typically at 70, a lower line at 30 and a dashed
mid-line at 50. The inbetween level is considered neutral with the 50 level being a sign of no trend. Wilder posited
that when price moves up very rapidly, at some point it is considered overbought, while when price falls very rapidly,
at some point it is considered oversold. Failure swings above 70 and below 30 on the RSI are strong indications of
market reversals. The slope of the RSI is directly proportional to the velocity of a change in the trend. Cardwell
noticed that uptrends generally traded between RSI of 40 and 80 while downtrends traded with an RSI between 60 and
20. When securities change from uptrend to downtrend and vice versa, the RSI will undergo a range shift. Bearish
divergence (between stock price and RSI) is a sign confirming an uptrend while bullish divergence is a sign confirming
a downtrend. Further, he noted that reversals are the opposite of divergence.
RSI = 100 −

A variation called Cutler’s RSI is based on a simple moving average (SMA) of U and D
RS =

SM A(U, n)
SM A(D, n)

When the EMA is used, the RSI value depends upon where in the data file his calculation is started which called the
Data Length Dependency. Hence Cutler’s RSI is not data length dependent, and it returns consistent results regardless
of the length of, or the starting point within a data file. The two measures are similar since SMA and EMA are also
similar.

8.1.5

Regression analysis

Before constructing momentum strategies, following Moskowitz et al. [2012], we first assess the amount of return
predictability that is inherent in a series of predictors by running a pooled time-series cross-sectional regression of the
contemporaneous standardised return on a lagged return predictor. We regress the excess return rti for instrument i in
month t on its return lagged h months/weeks/days, where both returns are scaled by their ex-ante volatilities σti
rti
= α + βh Z(t − h) + it
i
σt−1

(8.1.3)

where the regressor Z(t−h) is chosen from a broad collection of momentum-related quantities. Note that all regressor
choices are normalised, in order to allow for the pooling across the instruments. For example, we can consider the
regression
i
rt−h
rti
=
α
+
β
+ it
h
i
i
σt−1
σt−h−1

Given the vast differences in volatilities we divide all returns by their volatility to put them on the same scale. This
is similar to using Generalized Least Squares instead of Ordinary Least Squares (OLS). The regressions are run using
lags of h = 1, 2, .., 60 months/weeks/days. Another way of looking at time series predictability is to simply focus
only on the sign of the past excess return underlying our trading strategies. In a regression setting, this strategy can be
captured using the following specification:
rti
i
= α + βh sign(rt−h
) + it
i
σt−1
282

(8.1.4)

Quantitative Analytics

where

sign(a) =

1 if a ≥ 0
−1 otherwise

All possible choices are comparable across the various contracts, and refer to a single period (J = 1) to avoid serial
autocorrelation in the error term it . Equation (8.1.4) is estimated for each lag h and regressor Z by pooling all the
underlyings together. To allow for the pooling across instruments, all regressor choices are normalised, and the asset
returns are normalised. The quantity of interest in these regressions is the t-statistic of the coefficient βh for each lag
h. Large and significant t-statistics essentially support the hypothesis of time-series return predictability. The results
are similar across the two regression specifications: strong return continuation for the first year and weaker reversals
for the next 4 years (see Moskowitz et al. [2012], Baltas et al. [2012b]).

8.1.6

The momentum profitability

Subsequently, we construct the return series of the aggregate time-series momentum strategy over the investment
horizon as the inverse-volatility weighted average return of all available individual momentum strategies
RJK (t)

Mt
Csf
1 X
Y i (t)
=
Mt i=1 σi (t, D) J,K

(8.1.5)

where Mt is the number of available assets at time t, and where Csf is a scaling factor and σi (t, D) is an estimate at
time t of the realised volatility of the ith asset computed using a window of the past D trading days. See Section (3.4)
for the description of a family of volatility estimators. This risk-adjustment (use of standardised returns) across instruments allows for a direct comparison and combination of various asset classes with very different return distributions
in a single portfolio.
Assuming that the individual time series strategies are mutually independent and that the volatility process is
persistent, then the conditional variance of the return can be approximated by
i
V art (YJ,K
(t)) ≈ σi2 (t, D)

since ψi2 (t − J, t) = 1 if we further assume that the frequency of the trading periods when ψi (t − J, t) = 0 is relatively
small. As a result, the conditional variance of the portfolio is approximated by
V art (RJK (t)) ≈

Mt
1 X
C2
Mt2 i=1 sf

This approximation ignores any covariation among the individual momentum strategies as well as any potential
changes in the individual volatility processes but it can be used to define the scaling factor Csf . For example, we
can consider D = 60 trading days. The scaling factor Csf = 40% is used by MOP in order to achieve an ex-ante
volatility equal to 40% for each individual strategy. This is because it results in an ex-post annualised volatility of 12%
1
strategy roughly matching the level of volatility of several risk factors in their √
sample period. Baltas et al.
for their M12
[2012b] considered a rolling window of D = 30 days and a scaling factor Csf = 10% × Mt to achieve an ex-ante
volatility equal to 10%. Regarding the ex-ante volatility adjustment (risk-adjustment), it must be noted that it is compulsory in order to allow us to combine in a single portfolio various contracts of different asset classes with different
volatility profiles. Recently, Barroso and Santa-Clara [2012] revised the equity cross-sectional momentum strategy
and scaled similarly the winners-minus-losers portfolio in order to form what they call a risk-managed momentum
strategy.
Instead of forming a new momentum portfolio every K periods, when the previous portfolio is unwound, we can
follow the overlapping methodology of Jegadeesh et al. [2001], and perform portfolio rebalancing at the end of each

283

Quantitative Analytics

month/week/day. The respective monthly/weekly/daily return is then computed as the equally-weighted average across
1
-th of the portfolio is only rebalanced
the K active portfolios during the period of interest. Based on this technique, K
every month/week/day. In order to assess the profitability of the portfolio, we consider different momentum trading
signal, various out-of-sample performance statistics for the (J, K) time-series momentum strategy. The statistics
are all annualised and include the mean portfolio return along with the respective Newey et al. [1987] t-statistic,
the portfolio volatility, the dollar growth, the Sharpe ratio and the downside risk Sharpe ratio (see Ziemba [2005]).
To analyse the performance of the strategies, we can then plot the annualised Sharpe ratios of these strategies for
each stock/futures contract. In most studies, they showed that every single stock/futures contract exhibits positive
predictability from past one-year returns. These studies also regressed the strategy for each security on the strategy of
always being long, and they got a positive alpha in 90% of the cases. Thus, a time series strategy provides additional
returns over and above a passive long position for most instruments.

8.2

Factors analysis

Some theoretical models have been proposed as a framework for researching connections between asset returns and
macroeconomic factors. The Arbitrage Pricing Theory (APT) developed by Ross [1976] is one of the theories that
relate stock returns to macroeconomic state variables. It argues that the expected future return of a stock can be
modelled as a linear function of a variety of economic state variables or some theoretical market indices, where
sensitivity of stock returns to changes in every variable is indicated by an economic state variable-specific coefficient.
The rate of return provided by the model can then be used for correcting stock pricing. Thus, the current price of a
stock should be equal to the expected price at the end of the period discounted by the discount rate that is suggested
by the model. The theory suggests that if the current price of a stock diverges from the theoretical price then arbitrage
should bring it back into equilibrium. Roll and Ross [1980] argued that the Arbitrage Pricing Theory is an attractive
pricing model to researchers because of its modest assumptions and pleasing implications in comparison with the
Capital Asset Pricing Model. The vast majority of papers that used APT as a framework attempted to model a shortrun relation between the prices of equities and financial and economic variables. They used differenced variables and
presumed that the variables were stationary. However, evidence suggests that in the short-run equity prices deviate
from their fundamental values and are also driven by non-fundamentals. In this section, we are going to model these
non-fundamentals values by assuming they are mean-reverting.
For simplicity, we assume the market is composed of two types of agents, namely, the indexers (mutual fund
managers and long-only managers) and the market-neutral agents. The former seek exposure to the entire market or to
specific industry sectors with the goal of being generally long the market or sector with appropriate weightings in each
stock, whereas the latter seek uncorrelated returns with the market (alpha). In this market, we are going to present a
systematic approach to statistical arbitrage defined in Section (7.2.4), and construct market-neutral portfolio strategies
based on mean-reversion. This is done by decomposing stock returns into systematic and idiosyncratic components
using different definitions of risk factors.

8.2.1

Presenting the factor model

One of the main difficulties in multivariate analysis is the problem of dimensionality, forcing practitioners to use
simplifying methods. From an empirical viewpoint, multivariate data often exhibit similar patterns indicating the
existence of common structure hidden in the data. Factor analysis is one of those simplifying methods available to the
portfolio manager. It aims at identifying a few factors that can account for most of the variations in the covariance
or correlation of the data. Traditional factor analysis assumes that the data have no serial correlations. While this
assumption is often violated by financial data taken with frequency less than or equal to a week, it becomes more
reasonable for asset returns with lower frequencies (monthly returns of stocks or market indexes). If the assumption
is violated, one can use parametric models introduced in Section (5.5.2) to remove the linear dynamic dependence of
the data and apply factor analysis to the residual series.

284

Quantitative Analytics

Considering factor analysis based on orthogonal factor model, we let r = (r1 , ..., rN )> be the N-dimensional log
returns, and assume that the mean and covariance matrix of r are µ and Σ. For a return series, it is equivalent to
requiring that r is weakly stationary. The factor model postulates that r is linearly dependent on a few unobservable
random variables F = (f1 , f2 , ..., fm )> and N additional noises  = (1 , .., N )> where m < N , fi are the common
factors, and i are the errors. The factor model is given by
r1 − µ1

= l11 f1 + ... + l1m fm + 1

r2 − µ2

= l21 f1 + ... + l2m fm + 2

... = ...
rN − µN

= lN 1 f1 + ... + lN m fm + N

where ri − µi is the ith mean-corrected value. Equivalently in matrix notation, we get
r − µ = LF + 

(8.2.6)

where L = [lij ]N ×m is the matrix of factor loadings, lij is the loading of the ith variable on the jth factor, and i is
the specific error of ri . The above equation is not a multivariate linear regression model as in Section (5.5.2), even
though it has a similar appearance, since the m factors fi and the N errors i are unobservable. The factor model is
an orthogonal factor model if it satisfies the following assumptions
1. E[F ] = 0 and Cov(F ) = Im , the m × m identity matrix
2. E[] = 0 and Cov() = Ψ = diag{ψ1 , .., ψN } that is, Ψ is a N × N diagonal matrix
3. F and  are independent so that Cov(F, ) = E[F > ] = 0m×N
Under these assumptions, we get
Σ

=

Cov(r) = E[(r − µ)(r − µ)> ] = E[(LF + )(LF + )> ]

=

LE[F F > ]L> + E[F > ]L> + LE[F > ] + E[> ]

=

LL> + Ψ

and
Cov(r, F ) = E[(r − µ)F > ] = LE[F F > ] + E[F > ] = L
Using these two equations, we get
V ar(ri )

2
2
= li1
+ ... + lim
+ ψi

Cov(ri , rj )

= li1 lj1 + ... + lim ljm

Cov(ri , fj )

= lij

2
2
The quantity li1
+ ... + lim
, called the communality, is the portion of the variance of ri contributed by the m common
factors, while the remaining portion ψi of the variance of ri is called the uniqueness or specific variance. The orthogonal factor representation of a random variable r is not unique, and in some cases does not exist. For any m × m
orthogonal matrix P satisfying P P > = P > P = I, we let L∗ = LP and F ∗ = P > F and get

r − µ = LF +  = LP P > F +  = L∗ F ∗ + 

285

(8.2.7)

Quantitative Analytics

with E[F ∗ ] = 0 and Cov(F ∗ ) = P > Cov(F )P = P > P = I. Thus, L∗ and F ∗ form another orthogonal factor
model for r. As a result, the meaning of factor loading is arbitrary, but one can perform rotations to find common
factors with nice interpretations. Since P is an orthogonal matrix, the transformation F ∗ = P > F is a rotation in the
m-dimensional space.
One can estimate the orthogonal factor model with maximum likelihood methods under the assumption of normal
density and prespecified number of common factors. If the common factors F and the specific factors  are jointly
normal and the number of common factors are given a priori, then r is multivariate normal with mean µ and covariance
matrix Σr = LL> + Ψ. One can then use the MLM to get estimates of L and Ψ under the constraint L> Ψ−1 L = ∆,
which is a diagonal matrix.
Alternatively, one can use PCA without requiring the normality of assumption of the data nor the prespecification
of the number of common factors, but the solution is often an approximation. Following the description of PCA in
Section (5.5.3), we let (λ̂1 , ê1 ), .., (λ̂N , êN ) be pairs of the eigenvalues and eigenvectors of the sample covariance
matrix Σ̂r , where λ̂1 ≥ λ̂2 ≥ ... ≥ λ̂N . Letting m < N be the number of common factor, the matrix of factor
loadings is given by


L̂ = [ˆlij ] = λ̂1 ê1 |λ̂2 ê2 |...|λ̂m êm
The estimated specific variances are the diagonal elements of the matrix Σ̂r − L̂L̂> , that is Ψ̂ = diag{ψ̂1 , .., ψ̂N },
Pm 2
where ψ̂i = σ̂ii,r − j=1 ˆlij
, and σ̂ii,r is the (i, i)th element of Σ̂r . The communalities are estimated by
2
2
ĉ2i = ˆli1
+ ... + ˆlim

and the error matrix due to approximation is
Σ̂r − (L̂L̂> + Ψ̂)
which we would like close to zero. One can show that the sum of squared elements of the above error matrix is less
than or equal to λ̂2m+1 + ... + λ̂2N so that the approximation error is bounded by the sum of squares of the neglected
eigenvalues. Further, the estimated factor loadings based on PCA do not change as the number of common factors m
is increased.
For any m × m orthogonal matrix P , the random variable r can be represented with Equation (8.2.7) and we get
LL> + Ψ = LP P > L> + Ψ = L∗ (L∗ )> + Ψ
so that the communalities and specific variances remain unchanged under an orthogonal transformation. One would
like to find an orthogonal matrix P to transform the factor model so that the common factors have some interpretations.
There are infinite possible factor rotations available and some authors proposed criterions to select the best possible
rotation (see Kaiser [1958]).
Factor analysis searches common factors to explain the variabilities of the the returns. One must make sure that
the assumption of no serial correlations in the data is satisfied which can be done with the multivariate Portmanteau
statistics. If serial correlations are found, one can build a VARMA model to remove the dynamic dependence in the
data and apply the factor analysis to the residual series.

286

Quantitative Analytics

8.2.2

Some trading applications

8.2.2.1

Pairs-trading

Following the description of pairs trading in Section (7.2.3) we let stocks P and Q be in the same industry or have similar characteristics, and expect the returns of the two stocks to track each other after controlling for beta. Accordingly,
if Pt and Qt denote the corresponding price time series, then we can model the system as
ln

Pt
Qt
= α(t − t0 ) + β ln
+ Xt
Pt 0
Qt0

or, in its differential version
dPt
dQt
= αdt + β
+ dXt
Pt
Qt

(8.2.8)

where Xt is a stationary, or mean-reverting process called the cointegration residual, or simply the residual (see
Pole [2007]). In many cases of interest, the drift α is small compared to the fluctuations of Xt and can therefore
be neglected. This means that, after controlling for beta, the long-short portfolio oscillates near some statistical
equilibrium (see Avellaneda et al. [2008]). The model in Equation (8.2.8) suggests a contrarian investment strategy in
which we go long 1 dollar of stock P and short β dollars of stock Q if Xt is small and, conversely, go short P and long
Q if Xt is large. The portfolio is expected to produce a positive return as valuations converge. The mean-reversion
paradigm is typically associated with market over-reaction: assets are temporarily under or over-priced with respect
to one or several reference securities (see Lo et al. [1990]).
Generalised pairs-trading, or trading groups of stocks against other groups of stocks, is a natural extension of
pairs-trading. The role of the stock Q would be played by an index or exchange-traded fund (ETF) and P would be
an arbitrary stock in the portfolio or sector of activity. The analysis of the residuals, based of the magnitude of Xt ,
suggests typically that some stocks are cheap with respect to the index or sector, others expensive and others fairly
priced. A generalised pairs trading book, or statistical arbitrage book, consists of a collection of pair trades of stocks
relative to the ETF (or, more generally, factors that explain the systematic stock returns). In some cases, an individual
stock may be held long against a short position in ETF, and in others we would short the stock and go long the ETF.
Remark 8.2.1 Due to netting of long and short positions, we expect that the net position in ETFs will represent a
small fraction of the total holdings. The trading book will look therefore like a long-short portfolio of single stocks.
That is, given a set of stocks S = {S1 , S2 , .., SN } we can go long a subset of stocks Lo with hi dollars invested per
ith stock for i ∈ Lo and short the subset So with hi βi dollars of index or ETF for each stock. Conversely, we can go
short a subset of stocks So with hj dollars invested per jth stock for j ∈ So and long the subset Lo with hj βj dollars
of index or ETF for each stock. We can construct the portfolios
PN such that the net position in ETFs will represent a
small fraction of the total holdings or even zero by setting k=1 hk βk = 0.
8.2.2.2

Decomposing stock returns

Following the concept of pairs-trading in Section (8.2.2.1), the analysis of residuals will be our starting point. Signals
will be based on relative-value pricing within a sector or a group of peers, by decomposing stock returns into systematic
and idiosyncratic components and statistically modelling the idiosyncratic part. Here, the emphasis is on the residual
that remains after the decomposition is done and not the choice of a set of risk-factors. Given the d-day period discrete
d St
return Rt−d,t = ∇
St−d of the underlying process where ∇d St = St − St−d with period d, we want to explain or predict
stock returns. We saw in Section (8.2.1) that one approach is to explain the returns/prices based on some statistical
factors

287

Quantitative Analytics

R=

m
X

βj Fj +  , Corr(Fj , ) = 0 , j = 1, .., m

j=1

Pm
where Fj is the explanatory factor, βj is the factor loading such that j=1 βj Fj is the explained or systematic portion
and  is the residual, or idiosyncratic portion. For example, the CAPM described in Section (2.3.1.2) consider a single
explanatory factor called the market portfolio
R = βF +  , Cov(R, ) = 0 , <  >= 0
where F is the returns of a broad-market index (market portfolio). The model implies that if the market is efficient, or
in equilibrium, investors will not make money (systematically) by picking individual stocks and shorting the index or
vice-versa (assuming uncorrelated residuals) (Sharpe [1964], Lintner [1965]). However, markets may not be efficient,
and the residuals may be correlated. In that case, we need additional explanatory factors F to model stock returns (see
Ross [1976]). In the multi-factor models above (APT), the factors represent industry returns so that
< R >=

m
X

βj < Fj >

j=1

where brackets denote averaging over different stocks. Thus, the problem of correlations of residuals (idiosyncratic
risks) will closely depend on the number of explanatory factors in the model.

8.2.3

A systematic approach

8.2.3.1

Modelling returns

A systematic approach in equity when looking at mean-reversion is to look for stock returns devoid of explanatory
factors (see Section (1.5.3)), and analyse the corresponding residuals as stochastic processes. Avellaneda et al. [2008]
proposed a quantitative approach to stock pricing based on relative performance within industry sectors or PCA factors.
They studied how different sets of risk-factors lead to different residuals producing different profit and loss (PnL) for
statistical arbitrage strategies. Following their settings, we let {Ri }N
i=1 be the returns of the different stocks in the
trading universe over an arbitrary one-day period (from close to close). Considering the econometric factor model in
Equation (8.2.1) with the continuous-time model for the evolution of stock prices defined in Equation (1.5.20), we let
k (t)
the return of the kth risky factor be Fkt = dP
Pk (t)
m

X
dSi (t)
dPk (t)
= αi dt +
βik
+ dXi (t)
Si (t)
Pk (t)
k=1

where we let the systematic component of returns
ETFs. Therefore, the factors are either

Pm

k=1

k (t)
βik dP
Pk (t) be driven by the returns of the eigenportfolios or

• eigenportfolios corresponding to significant eigenvalues of the market
• industry ETF, or portfolios of ETFs
The term dXi (t) is assumed to be the increment of a stationary stochastic process which models price fluctuations
corresponding to over-reactions or other idiosyncratic fluctuations in the stock price which are not reflected in the
industry sector. Therefore, the approach followed by Avellaneda et al. [2008] was to let the model assumes
• a drift which measures systematic deviations from the sector
• a price fluctuation that is mean-reverting to the overall industry level

288

Quantitative Analytics

Focusing on the residual, they studied how different sets of risk-factors lead to different residuals, and hence, different
profit and loss. Market neutrality is achieved via two different approaches, either by extracting risk factors using
Principal Component Analysis, or by using industry-sector ETFs as proxies for risk factors.
8.2.3.2

The market neutral portfolio

Following the notation in Section (1.5.2), we consider h1 , h2 , .., hN to be the dollars invested in different stocks (long
or short) and let S1 , S2 , .., SN be the dividend-adjusted prices. Neglecting transaction costs, we consider the trading
portfolio returns given by
N
X

hi,t Ri,t

i=1

where Ri,t is the expected return on the ith risky security over one period of time. Assuming the stock returns to
follow the factor model in Equation (8.2.1) with αi,t = 0, the portfolio returns become
N
X

hi,t

m
X

N
 X
βik Fkt +
hi,t Xi,t

i=1

k=1

N
m X
X

N
X

hi,t Xi,t
hi,t βik Fkt +

i=1

which becomes

i=1

k=1 i=1

where

PN

i=1

hi,t βik is net dollar-beta exposure along factor k and

PN

i=1

hi,t is the net dollar exposure of the portfolio.

Definition 8.2.1 A trading portfolio is said to be market-neutral if the dollar amounts {hi }N
i=1 invested in each of the
stocks are such that
βk =

N
X

hi βik = 0 , k = 1, 2, .., m

i=1

The coefficients β k correspond to the portfolio betas, or projections of the portfolio returns on the different factors. A
market-neutral portfolio has vanishing portfolio betas; it is uncorrelated
PNwith the market portfolio or factors driving the
market returns. As a result, cancelling the net dollar-beta exposure i=1 hi βik for each factor, the portfolio returns
become
N
X

hi Xi

i=1

Thus, a market-neutral portfolio is affected only by idiosyncratic returns (residuals). In G8 economies, stock returns
are explained by approximately m = 15 factors (or between 10 and 20 factors), and the systematic component of
stock returns explains approximately 50% of the variance (see Plerou et al. [2002] and Laloux et al. [2000]).
Further, in this setting, we define the Leverage Ratio as
PN
Λ=

i=1

|hi,t |

Vt

which is also written

289

(8.2.9)

Quantitative Analytics

Λ=

Long market value + | Short market value |
Equity

Some examples of leverage are
• long-only: Λ =

L
V

• long-only, Reg T: L ≤ 2E therefore Λ ≤ 2
• 130-30 investment fund: L = 1.3V , |S| = 0.3V therefore Λ = 1.6
• long-short $-neutral, Reg T: L + |S| ≤ 2V therefore Λ ≤ 2
• long-short equal target position in each stock: hi ≤

8.2.4

Estimating the factor model

8.2.4.1

The PCA approach

Λmax V
N

therefore

P

i

|hi | ≤ Λmax V

Although this is very simplistic, the model can be tested on cross-sectional data. Using statistical testing, we can accept
or reject the model for each stock in a given list and then construct a trading strategy for those stocks that appear to
follow the model and yet for which significant deviations from equilibrium are observed. One of the problem is to find
out if the residuals can be fitted to (increments of) OU processes or some other mean-reversion processes? If it is the
case, we need to estimate the typical correlation time-scale.
In risk-management, factor analysis is used to measure exposure of a portfolio to a particular industry of market
feature. One relies on dimension-reduction technique for the study systems with a large number of degrees of freedom,
making the portfolio theory viable in practice. Hence, one can consider PCA for extracting factors from data by using
historical stock price data on a cross-section of N stocks going back M days in the past. Considering the time window
1
t = 0, 1, 2, .., T (days) where ∆t = 252
and a universe of N stocks, we let {Ri }N
i=1 be the returns of the different
stocks in the trading universe over an arbitrary one-day period (from close to close). The returns data is represented
by a T × N matrix R(i, t) for i = 1, .., N with covariance matrix ΣR and elements
σi2 =

T
T
1 X
1X
(R(i, t) − Ri )2 , Ri =
R(i, t)
T − 1 t=1
T t=1

Data centring being an important element of PCA analysis as it helps minimizing the error of mean squared deviation,
the standardised returns are given by
Y (i, t) =

R(i, t)
R(i, t) − Ri
or Y (i, t) =
σi
σi

such that the empirical correlation matrix of the data is defined by
T

Γ(i, j) =

1 X
Y (i, t)Y (j, t)
T − 1 t=1

where Rank(Γ) ≤ min (N, T ). So one can consider
T

Γ(i, j) =

1 X
Y (i, t)Y (j, t)
T − 1 t=1

such that for any index i we get

290

Quantitative Analytics

T

Γ(i, i) =

1 X
(Y (i, t))2 = 1
T − 1 t=1

One can regularise the correlation matrix as follow
T

C(i, j) =

1 X
(R(i, t) − Ri )(R(j, t) − Rj ) + γδ(i, j) , γ = (10)−9
T − 1 t=1

where δij is the Kronecker delta, and C(i, i) ≈ σi2 . We can obtain the matrix as
Γreg (i, j) = p

C(i, j)
C(i, i)C(j, j)

This is a positive definite correlation matrix. It is equivalent for all practical purposes to the original one but is
numerically stable for inversion and eigenvector analysis. this is especially useful when T << N . If we consider
daily returns, we are faced with the problem that very long estimation windows T >> N do not make sense because
they take into account the distant past which is economically irrelevant. On the other hand, if we just consider the
behaviour of the market over the past year, for example, then we are faced with the fact that there are considerably
more entries in the correlation matrix than data points. The commonly used solution to extract meaningful information
from the data is Principal Components Analysis.
8.2.4.2

The selection of the eigenportfolios

Following Section (5.5.3), we can now let λ1 > λ2 ≥ .. ≥ λN ≥ 0 be the eigenvalues ranked the in decreasing order
(j)
(j)
(j)
and V (j) = (V1 , V2 , .., VN ) jor j = 1, 2, .., N be the corresponding eigenvectors. We now need to estimate the
significant eigenportfolios which can be used as factors. Analysing the density of states of the eigenvalues, we let m
to be a fixed number of eigenvalues to extract the factors close to the number of industry sector. In that setting, the
eigenportfolio for each index j is
Fjt =

N
X

(j)

Vi Y (i, t) =

i=1
(j)

V

N
(j)
X
V
i

i=1

σi

R(i, t) , j = 1, 2, .., m

(j)

where Qi = σi i is the respective amounts invested in each of the stock. We use the coefficients of the eigenvectors
and the volatilities of the stocks to build portfolio weights. It corresponds to the returns of the eigenportfolios which
0
are uncorrelated in the sense that the empirical correlation of Fj and Fj 0 vanishes for j 6= j . These random variables
span the same linear space as the original returns. As each stock return in the investment universe can be decomposed
into its projection on the m factors and a residual, thus the PCA approach delivers a natural set of risk-factors that can
be used to decompose our returns. Assuming that the correlation matrix is invertible, we get
< Ri , Rj >= C(i, j) =

m
X

(k)

λk Vi

(k)

Vj

k=1

with the factors
Fk =

N
(k)
X
V

N
(k)
1 X Vi
R(i) , F̃k = √
R(i)
σi
λk i=1 σi
i

i=1

and the norms
< Fk2 >= λk , < F̃k2 >= 1 , < F̃k F̃k0 >= δkk0
291

Quantitative Analytics

so that given the return Ri =

Pm

k=1

βik Fk we get the coefficient
p
(k)
βik = σi λk Vi

It is not difficult to verify that this approach corresponds to modelling the correlation matrix of stock returns as a sum
of a rank-m matrix corresponding to the significant spectrum and a diagonal matrix of full rank
C(i, j) =

m
X

(k)

λk Vi

(k)

Vj

+ 2ii δij

k=1

where δij is the Kronecker delta and

2ii

is given by
2ii = 1 −

m
X

(k)

λk Vi

(k)

Vi

k=1

so that C(i, i) = 1. This means that we keep only the significant eigenvalues/eigenvectors of the correlation matrix
and add a diagonal noise matrix for the purposes of conserving the total variance of the system.
Laloux et al. [2000] pointed out that the dominant eigenvector is associated with the market portfolio, in the sense
(1)

(1)

V

(1)

that all the coefficients Vi for i = 1, .., N are positive. Thus, the eigenportfolio has positive weights Qi = σi i
which are inversely proportional to the stock’s volatility. It is consistent with the capitalisation-weighting, since larger
capitalisation companies tend to have smaller volatilities. Note, the remaining eigenvectors must have components
that are negative, in order to be orthogonal to V (i) . However, contrary to the interest-rate curve analysis, one can not
apply the shape analysis to interpret the PCA.
Another method consists in using the returns of sector ETFs as factors. In this approach, we select a sufficiently
diverse set of ETFs and perform multiple regression analysis of stock returns on these factors. Unlike the case of
eigenportfolios, ETF returns are not uncorrelated, so there can be redundancies: strongly correlated ETFs may lead
to large factor loadings with opposing signs for stocks that belong to or are strongly correlated to different ETFs. To
remedy this, we can perform a robust version of multiple regression analysis to obtain the coefficients βij such as the
matching pursuit algorithm or the ridge regression. Avellaneda et al. [2008] associated to each stock a sector ETF
and performed a regression of the stock returns on the corresponding ETF returns. Letting I1 , I2 , .., Im be the class of
ETFs spanning the main sectors in the economy, and RIj be the corresponding returns, they decomposed the ETF as
Ri =

m
X

βij RIj + i

j=1

While we need some prior knowledge of the economy to identify the right ETFs to explain returns, the interpretation
of the factor loadings is more intuitive than for PCA. Note, ETF holdings give more weight to large capitalisation
companies, whereas PCA has no apriori capitalisation bias.

8.2.5

Strategies based on mean-reversion

8.2.5.1

The mean-reverting model

We consider the evolution of stock prices in Equation (1.5.20) and test the model on cross-sectional data. For instance,
in the ETF framework, Pk (t) represents the mid-market price of the k-th ETF used to span the market. In practice,
only ETFs that are in the same industry as the stock in question will have significant loadings, so we could also work
with the simplified model
(
Cov(Ri ,RPk )
V ar(RPk ) if stock i is in industry k
βik =
0 otherwise
292

Quantitative Analytics

where each stock is regressed to a single ETF representing its peers. In order to get a market-neutral portfolio, we we
introduce a parametric model for Xi (t), namely, the Ornstein-Uhlembeck (OU) process with SDE
dXi (t) = ki (mi − Xi (t))dt + σi dWi , ki > 0
where ki > 0 and {dWi }N
i=1 are uncorrelated. This process is stationary and auto-regressive with lag 1 (AR(1)
model). See Appendix (C.6.3) for properties of the OU process, details on its discretisation, and the calibration of the
AR(1) model. In particular, the increment dXi (t) has unconditional mean zero and conditional mean equal to
E[dXi (t)|Xi (s), s ≤ t] = ki (mi − Xi (t))dt
The conditional mean, or forecast of expected daily returns, is positive or negative according to the sign of (mi −
Xi (t)). In general, assuming that the model parameters vary slowly with respect to the Brownian motion, we estimate
the statistics for the residual process on a window of length 60 days, letting the model parameters be constant over the
window. This hypothesis is tested for each stock in the universe, by goodness-of-fit of the model and, in particular, by
analysing the speed of mean-reversion.
As described in Appendix (C.6.3), if we assume that the parameters of the model are constant, we get the solution
Xi (t0 + ∆t) = e−ki ∆t Xi (t0 ) + (1 − e−ki ∆t )mi + σi

Z

t0 +∆t

e−ki (t0 +∆t−s) dWi (s)

t0

which is the linear regression
Xn+1 = aXn + b + νn+1 , {νn } iid N 0, σ 2 (

1 − e2ki ∆t 
)
2ki

where a = e−ki ∆t is the slope and b = (1 − e−ki ∆t )mi is the intercept. Letting ∆t tend to infinity, we see that
equilibrium probability distribution for the process Xi (t) is normal with
E[Xi (t)] = mi and V ar(Xi (t)) =

σi2
2ki

(8.2.10)

According to Equation (1.5.20), investment in a market-neutral long-short portfolio in which the agent is long $1 in
the stock and short βik dollars in the kth ETF has an expected 1-day return
αi dt + ki (mi − Xi (t))dt
The second term corresponds to the model’s prediction for the return based on the position of the stationary process
Xi (t). It forecasts a negative return if Xi (t) is sufficiently high and a positive return if Xi (t) is sufficiently low. The
parameter ki is called the speed of mean-reversion and
τi =

1
ki

represents the characteristic time-scale for mean reversion. If k >> 1 the stock reverts quickly to its mean and the
effect of the drift is negligible. Hence, we are interested in stocks with fast mean-reversion such that τi << T1 where
T1 is the estimation window.
Based on this simple model, Avellaneda et al. [2008] defined several trading signals. First they considered an
60
= 0.24) incorporating at least one earnings cycle for the company
estimation window of 60 business days (T1 = 252
1
and they selected stocks with mean-reversion times less than 12 period (k > 252
30 = 8.4) with τ = 0.12. That is, 2
30
period is τ = T21 = 252
and k = τ1 = 252
30 . To calibrate the model we consider the linear regression above, and deduce
that

293

Quantitative Analytics

mi =

b
1
2ki
V ar(ν)
, ki = −
log a , σi2 =
1−a
∆t
1 − a2

A fast mean-reversion (compared to the 60-days estimation window) requires that k >
reversion of the order of 1.5 months at most.
8.2.5.2

252
30 ,

corresponding to a mean-

Pure mean-reversion

In this section we focus only on the process Xi (t), neglecting the drift αi . Given Equation (8.2.10) we know that the
equilibrium volatility is
r
σi
τi
σeq,i = √
= σi
2
2ki
We can then define the dimensionless variable
si =

Xi (t) − mi
σeq,i

called the s-score. The s-score measures the distance to equilibrium of the cointegrated residual in units standard
deviations, that is, how far away a given stock is from the theoretical equilibrium value associated with our model. As
a result, one can define a basic trading signal based on mean-reversion as
• buy to open if si < −sbo
• sell to open if si > +sso
• close short position if si < +sbc
• close long position if si > −ssc
where the cutoff values sl for l = bo, so, bc, sc are determined empirically. Entering a trade, that is buy to open,
means buying $1 of the corresponding stock and selling βi dollars of its sector ETF (pair trading). Similarly, in the
case of using multiple factors, we buy βi1 dollars of ETF #1, βi2 dollars of ETF #2 up to βim dollars of ETF #m.
The opposite trade consisting in closing a long position means selling stock and buying ETFs. Since we expressed
all quantities in dimensionless variables, we expect the cutoffs sl to be valid across the different stocks. Based on
simulating strategies from 2000 to 2004 in the case of ETF factors, Avellaneda et al. [2008] found that a good
choice of cutoffs was sbo = sso = 1.25, sbc = 0.75, and ssc = 0.5. The rationale for opening trades only when
the s-score si is far from equilibrium is to trade only when we think that we detected an anomalous excursion of the
co-integration residual. Closing trades when the s-score is near zero also makes sense, since we expect most stocks to
be near equilibrium most of the time. The trading rule detects stocks with large excursions and trades assuming these
excursions will revert to the mean in a period of the order of the mean-reversion time τi .
8.2.5.3

Mean-reversion with drift

When ignoring the presence of the drift, we implicitly assume that the effect of the drift is irrelevant in comparison
with mean-reversion. Incorporating the drift, the conditional expectation of the residual return over a period of time
∆t becomes
αi dt + ki (mi − Xi )dt = ki



αi
αi
+ mi − Xi dt = ki
− σeq,i si dt
ki
ki

This suggests that the dimensionless decision variable is the modified s-score

294

Quantitative Analytics

smod,i = si −

αi
αi τi
= si −
ki σeq,i
σeq,i

In the previous framework, we short stock if the s-score is large enough. The modified s-score is larger if αi is negative,
and smaller if αi is positive. Therefore, it will be harder to generate a short signal if we think that the residual has an
upward drift and easier to short if we think that the residual has a downward drift. Since the drift can be interpreted
as the slope of a 60-day moving average, we have therefore a built-in momentum strategy in this second signal. A
calibration exercise using the training period 2000-2004 showed that the cutoffs defined in the previous strategy are
also acceptable for this one. However, Avellaneda et al. [2008] found that the drift parameter had values of the order
of 15 basis points and the average expected reversion time was 7 days, whereas the equilibrium volatility of residuals
7
≈ 0.3.
was on the order of 300 bps. The expected average shift for the modified s-score was of the order of 0.15 300
Hence, in practice, the effect of incorporating a drift in these time-scales was minor.

8.2.6

Portfolio optimisation

Following the general portfolio valuation in Section (1.5.2), we consider h0 , h1 , .., hN to be the dollars invested in
different stocks (long or short) and S0 , S1 , .., SN to be the dividend-adjusted prices. Neglecting transaction costs, the
change in portfolio returns becomes
dVt =

N
X

hi

i=1

N
X
dSi (t)
−(
hi )rdt + Vt rdt
Si (t)
i=1

dSi (t)
Si (t)

where Ri (t) =
is the expected return on the ith risky security. Then, given the evolution of stock prices in
Equation (1.5.20), the change in portfolio becomes
dVt =

N
X
i=1

hi

m
X

βik

k=1

N
X

dPk
hi )rdt + Vt rdt
+ dXi − (
Pk
i=1

which becomes
dVt =

N
X

hi dXi +

N
m X
X

hi βik

k=1 i=1

i=1

N
X
 dPk
hi )rdt + Vt rdt
−(
Pk
i=1

PN

PN
where i=1 hi βik is net dollar-beta exposure along factor k and i=1 hi the net dollar exposure of the portfolio.
PN
Cancelling the net dollar-beta exposure i=1 hi βik for each factor, the change in portfolio becomes
dVt =

N
X
i=1

N
X
hi dXi − (
hi )rdt + Vt rdt
i=1

Thus, a market-neutral portfolio is affected only by idiosyncratic returns. Replacing with the residual process, we get
dVt =

N
X

hi ki (m − Xi )dt + σi dWi



N
X
−(
hi )rdt + Vt rdt

i=1

i=1

which gives
dVt =

N
X

N
X

hi ki (m − Xi ) − r dt +
hi σi dWi + Vt rdt

i=1

i=1

Ignoring the term Vt rdt, and taking the conditional expectation, we get

295

Quantitative Analytics

E[dVt |X] =

N
X

N
X

hi ki (m − Xi ) − r dt =
hi µi dt

i=1

i=1

where µi = ki (m − Xi ) − r, and the conditional variance
N
X

V ar(dVt |X) =

h2i σi2 dt

i=1

Following the mean-variance approach detailed in Section (??), we can therefore build the Mean-Variance optimal
portfolio. We considerPthe mean-variance utility function given in Equation () where
the expected return of
Pwhere
N
N
the portfolio is rP = i=1 hi µi and the variance of the portfolio’s return is σP2 = i=1 h2i σi2 . In that setting, the
optimisation problem is given by
N
X

max
h

N
1 X 2 2
h σ
2τ i=1 i i

hi µi −

i=1

where τ is the investor’s risk tolerance. The optimal risky portfolio must satisfy Equation (9.2.4). However, in the
case of a beta-neutral portfolio, the constraint in Equation (9.2.1) must be satisfied. In our setting, the optimal weight
becomes
µi
σi2

hi = τ

Replacing the optimal weight hi in the change of portfolio, we get
dVt = τ
Setting ξi =

(m−Xi ) √
2ki
σi

N
N
X
X
2
1
(ki (m − Xi ) − r)
k
(m
−
X
)
−
r
dt
+
λ
dWi
i
i
2
σ
σi
i=1 i
i=1

and r = 0 we get
dVt = λ

N
X
ki
i=1

2

ξi2 dt

+τ

N
X
i=1

r

ki
ξi dWi
2

As a result, we get the norms
τN
2

< dVt >=

PN

i=1

ki 

N

dt

< (dVt )2 > − < dVt >2 =

τ 2N
2

PN

i=1

ki 

N

dt

and the annualised sharpe ratio becomes
τN
2

M=q

τ 2N
2

since

√a
a

=

√

PN

i=1

ki 

N
PN

=
i=1 ki

r

s
N
2

N

a.

296

PN


i=1 ki
=
N

s
Nk
2

Quantitative Analytics

8.2.7

Back-testing

The back-testing experiments consisted in running the signals through historical data, with the estimation of parameters
(betas, residuals), signal evaluations and portfolio re-balancing performed daily. That is, we assumed that all trades are
done at the closing price of that day. Further, we assume a round-trip transaction cost per trade of 10 basis points, to
incorporate an estimate of price slippage and other costs as a single friction coefficient. Given the portfolio dynamics
in Equation (1.5.17) where Vt is the portfolio equity at time t, the basic PnL equation for the strategy has the following
form

Vt+∆t

= Vt + r∆tVt +

N
X

hi,t Ri,t − r∆t(

i=1

hi,t

N
X

hi,t ) +

i=1

N
X
i=1

hi,t

N
X
Di,t
−
|hi,t+∆t − hi,t |
Si,t
i=1

= Vt Λt

where Ri,t is the stock return on the period (t, t + ∆t), r represents the interest rate (assuming, for simplicity, no
1
spread between long and short rates), ∆t = 252
, Di,t is the dividend payable to holders of stock i over the period
(t, t + ∆t), Si,t is the price of stock i at time t, and  = 0.0005 is the slippage term alluded to above. At last, hi,t is
the dollar investment in stock i at time t which is proportional to the total equity in the portfolio. The proportionality
factor Λt is stock-independent and chosen so that the portfolio has a desired level of leverage on average. As a result,
N
X

|hi,t | = N Et Λt

i=1

Given the definition of the leverage ratio in Equation (8.2.9) we get
Λt = N Λt
so that Λt = ΛNt . That is, the weights Λt are uniformly distributed. For example, given N = 200, if we have 100 stocks
2
long and 100 short and we wish to have a (2 + 2) leverage ($2 long and $2 short for $1 of capital), then Λt = 100
and
PN
2
we get i=1 |Qi,t | = Et 100 200 = 4Et .
Remark 8.2.2 In practice this number is adjusted only for new positions, so as not to incur transaction costs for stock
which are already held in the portfolio.
Hence, it controls the maximum fraction of the equity that can be invested in any stock, and we take this bound to be
equal for all stocks.
Given the discrete nature of the signals, the strategy is such that there is no continuous trading. Instead, the full
amount is invested on the stock once the signal is active (buy-to-open, short-to-open) and the position is unwound
when the s-score indicates a closing signal. This all-or-nothing strategy, which might seem inefficient at first glance,
turns out to outperform making continuous portfolio adjustments.

8.3

The meta strategies

8.3.1

Presentation

8.3.1.1

The trading signal

i
Given the time-series momentum strategy described in Section (8.1), we consider the return YJ,K
(t) at time t for the
series of the ith available individual strategy which is given by Equation (8.1.1). We are now going to consider several
possible trading signal ψi (., .) defined in Section (8.1) and characterising the strategies based on the returns of the

297

Quantitative Analytics

underlying process for the period [t − J, t]. Contrary to the time-series momentum strategy described in Section (8.1),
we do not let the Return Sign or Moving Average be the trading signal ψi (., t) and do not directly follow the strategy
characterised with the above return. Instead, we first perform some risk analysis of that strategy based on different risk
measures. That is, we no-longer consider the time-series of the stock returns, but instead, we consider the time-series
of some associated risk measures and use it to infer some trading signals.
8.3.1.2

The strategies

Return sign Following the return sign strategy discussed in Section (8.1.2.1), we consider the random vector Y0i
characterising the benchmark strategy of the ith asset where we always follow the previous move of the underlying
price returns. For risk management purposes, we set J = 1 and K = 1 in Equation (8.1.1) getting the return
i
Y1,1
(jδ) = signi ((j − 1)δ, jδ)Ri (jδ, (j + 1)δ) = sign(Ri ((j − 1)δ, jδ))Ri (jδ, (j + 1)δ) = Y0i (jδ)

where the random variables Y0i (jδ) take values in R.
Moving average Following the moving average strategy discussed in Section (8.1.2.2), we consider the random
vector Y0i characterising the strategy of the ith asset where a long (short) position is determined by a lagging moving
average of a price series lying below (above) a past leading moving average. For J lookback periods and K holding
periods, the return in Equation (8.1.1) becomes
i
YJ,K
(jδ) = M Ai ((j − J)δ, jδ)Ri (jδ, (j + K)δ) = Y0i (jδ)

where the random variables Y0i (jδ) take values in R.

8.3.2

The risk measures

8.3.2.1

Conditional expectations

Focusing on the ith asset, we then let X i (jδ) for i = 1, .., N be a random variable taking on only countably many
values and possibly correlated with the random variable Y0i (jδ) defined in Section (14.3.5). One of the main advantage
of the theory of conditional expectation (described in Appendix (B.5.1)) is that if we already know the value of X i (jδ)
we can use this information to calculate the expected value of Y0i (jδ) taking into account the knowledge of X i (jδ).
That is, suppose we know that the event {X i (jδ) = k} for some value k has occurred, then the expectation of Y0i (jδ)
may change given this knowledge. As a result, the conditional expectation of Y0i (jδ) given the event {X i (jδ) = k}
is defined to be
E[Y0i (jδ)|X i (jδ) = k] = E Q [Y0i (jδ)]
where Q is the probability given by Q(Λ) = P (Λ|X i (jδ) = k). Further, if the r.v. Y0i (jδ) is countably valued then
the conditional expectation becomes
E[Y0i (jδ)|X i (jδ) = k] =

∞
X

y0i (l)P Y0i (jδ) = y0i (l)|X i (jδ) = k



l=1
i

Note, if we denote Bk the event {X (jδ) = k}, we make sure that the family B1 , B2 , .., Bn is a partition of the sample
space Ω (see Appendix (B.5.1)). As a result, we can express the conditional expectation as
E[Y0i (jδ)|Bk ] =

E[Y0i (jδ)IBk ]
P (Bk )

For practicality we define the random variable

298

Quantitative Analytics

Yki (jδ) = Y0i (jδ)I{Bk } , k = 1, 2, ..., n

(8.3.11)

Using Equation (B.5.2) we can express the expectation of the random variable Y0 (jδ) characterising the original
strategy as a weighted sum of conditional expectation
E[Y0i (jδ)] =

n
X

E[Y0i (jδ)|Bk ]P (Bk ) =

k=1

8.3.2.2

n
X

E[Y0i (jδ)IBk ] =

k=1

n
X

E[Yki (jδ)]

k=1

Some examples

Example 1 For instance, we can define the random variable X i (jδ) to characterise the strategy consisting in following the previous move of returns only when that return is either positive or equal to zero, or when that return is
negative. That is, the random variable X i (jδ) = sign(Ri ((j − 1)δ, jδ)) has only two events (or two points), and the
state space is given by
Ω = {Ri ((j − 1)δ, jδ) ≥ 0, Ri ((j − 1)δ, jδ) < 0}
with B1 = Ri ((j − 1)δ, jδ) ≥ 0 and B2 = Ri ((j − 1)δ, jδ) < 0, so that

Y1i (jδ) + Y2i (jδ) = Rti I{Ri ((j−1)δ,jδ)≥0} − I{Ri ((j−1)δ,jδ)<0} = Y0i (jδ)
Since I{Ri ((j−1)δ,jδ)≥0} + I{Ri ((j−1)δ,jδ)<0} = 1, we get

Y1i (jδ) + Y2i (jδ) = Ri (jδ) 2I{Ri ((j−1)δ,jδ)≥0} − 1 = Y0i (jδ)
Example 2 Similarly to the previous example, we can define the random variable X i (jδ) to characterise the strategy
consisting in following the product of the two previous move of returns only when that product is either positive or
equal to zero, or either when that product is negative. That is, the random variable
X i (jδ) = sign(Ri ((j − 2)δ, (j − 1)δ)Ri ((j − 1)δ, jδ))
has only two events (or two points), and the state space is given by
Ω = {Ri ((j − 2)δ, (j − 1)δ)Ri ((j − 1)δ, jδ) ≥ 0, Ri ((j − 2)δ, (j − 1)δ)Ri ((j − 1)δ, jδ)δ) < 0}
with B1 = Ri ((j − 2)δ, (j − 1)δ)Ri ((j − 1)δ, jδ) ≥ 0 and B2 = Ri ((j − 2)δ, (j − 1)δ)Ri ((j − 1)δ, jδ) < 0, so that

Y1i (jδ) + Y2i (jδ) = Rti I{Ri ((j−2)δ,(j−1)δ)Ri ((j−1)δ,jδ)≥0} − I{Ri ((j−2)δ,(j−1)δ)Ri ((j−1)δ,jδ)<0} = Y0i (jδ)
Example 3 In this example, we define the random variable X i (jδ) to characterise the strategy consisting in following
the previous move of returns only when that return is either big, or when that return is small. That is, the random
variable X i (jδ) = |Ri ((j − 1)δ, jδ)| ≥ cB has only two events (or two points), and the state space is given by
Ω = {|Ri ((j − 1)δ, jδ)| ≥ cB , |Ri ((j − 1)δ, jδ)| < cB }
with B1 = |Ri ((j − 1)δ, jδ)| ≥ cB and B2 = |Ri ((j − 1)δ, jδ)| < cB , so that

Y1i (jδ) + Y2i (jδ) = Rti I{|Ri ((j−1)δ,jδ)|≥cB } − I{|Ri ((j−1)δ,jδ)| V ar(Yi (t)) for i = 1, 2, when E[Yi (t)] > 0 for i = 1, 2 we get
E[Y1 (t)]
E[Y2 (t)]
M (Y0 (t)) < p
+p
= M (Y1 ) + M (Y2 )
V ar(Y1 (t))
V ar(Y2 (t))
and when E[Yi (t)] < 0 for i = 1, 2 we get
E[Y2 (t)]
E[Y1 (t)]
+p
= M (Y1 ) + M (Y2 )
M (Y0 (t)) > p
V ar(Y1 (t))
V ar(Y2 (t))
while for E[Yi (t)] > 0 and E[Yj (t)] < 0 for i = 1, 2 and for j = 1, 2 with i 6= j, if |E[Yi (t)]| > |E[Yj (t)]| we get
E[Yi (t)]
E[Yj (t)]
M (Y0 (t)) < p
+p
= M (Yi ) + M (Yj )
V ar(Yj (t))
V ar(Yi (t))
Given the definition of the conditional expectation and conditional variance in Appendix (B.5.1), we can consider the
Information ratio or conditional Sharpe ratio
M (Y0 (t)|Bk ) =

E[Y0 (t)I{Bk } ]/P (Bk )
E[Y0 (t)|Bk ]
=
σY0 |Bk
σY0 |Bk

(8.3.12)

p
where σY0 |Bk =
V ar(Y0 (t)|Bk ). Further, given the Definition (B.5.2) of the conditional expectation, we can
rewrite the conditional Sharpe ratio as
k

M (Y0 (t)|Bk ) =
where Qk is the probability given by Qk (Λ) = P (Λ|X = k).

300

E Q [Y0 (t)]
σY0 |Bk

Quantitative Analytics

In order to select the appropriate strategy at time t, we need to consider the measure for which the event Bk (t)
for k = 1, 2, .. is satisfied. Calling this event Bk∗ (t) and its associated measure M ∗ (Y0 |Bk ), if M ∗ (Y0 |Bk ) > αk we
follow the strategy Yk (t) while if M ∗ (X0 |Bk ) < βk we follow the strategy −Yk (t) where αk is a positive constant
and βk is a negative constant.

8.4

Random sampling measures of risk

We assume that the population is of size N and that associated with each member of the population is a numerical value
of interest denoted by x1 , x2 , .., xN . We take a sample with replacement of n values X1 , ..., Xn from the population,
where n < N and such that Xi is a random variable. That is, Xi is the value of the ith member of the sample, and xi
is that of the ith member of the population. The population moments and the sample moments are given in Appendix
(B.9.1).

8.4.1

The sample Sharpe ratio

From an investor’s perspective, volatility per se is not a bad feature of a trading strategy. In fact, increases in volatility
generated by positive returns are desired. Instead, it is only the part of volatility that is generated by negative returns
that is clearly unwanted. There exists different methodologies in describing what is called the downside risk of an
investment. Sortino and Van Der Meer [1991] suggested the use of Sortino ratio as a performance evaluation metric
in place of the ordinary Sharpe ratio. The Sharpe ratio treats equally positive and negative returns
M=

µ − Rf
σ

where the mean µ is estimated with the sample mean X, and the variance σ 2 is estimated with the sample variance
n

S2 =

1 X
(Xi − X)2
n − 1 i=1

Whereas the former normalises the average excess returns with the square root of the semi-variance of returns (variance
generated by negative returns)
MS =

µ − Rf
σ−

where the semi-variance (σ − )2 is estimated with the sample semi-variance
n

(S − )2 =

X
1
(Xi I{Xi <0} )2
n− − 1 i=1

with n− being the number of periods with a negative return. It is therefore expected that the Sortino ratio will be
relatively larger than the ordinary Sharpe ratio for positively skewed distributions

8.4.2

The sample conditional Sharpe ratio

Given the definition of the conditional Sharpe ratio in Equation (8.3.12), we let the probability of the event be P (Bk ) =
nk
n where nk is the number of count in the partition Bk and define the conditional sample mean as
X Bk =

n
n
n 1X
1 X
Xi I{Bk } =
Xi I{Bk }
nk n i=1
nk i=1

301

Quantitative Analytics

which corresponds to the sample mean of the strata k. Given the definition of the local averaging estimates in Equation
(9.3.19) we get ω n,i = I{Bk } with norm = nk and we recover the partitioning estimate in Equation (9.3.20)
n

X Bk =

1 X
ω n,i Xi
norm i=1

Given the definition of the conditional variance in Equation (B.5.3) we get the conditional sample variance as
2
SB
=
k

n
n
n 1X 2
1 X 2
Xi I{Bk } − (X Bk )2 =
X I{Bk } − (X Bk )2
nk n i=1
nk i=1 i

Putting terms together, the sample conditional Sharpe ratio is
M (Y0 (t)|Bk ) =

302

X Bk
SBk

Chapter 9

Portfolio management under constraints
9.1

Introduction

We discussed in Section (1.7.6) the existence of excess returns and showed that the market is inefficient, that is, substantial gain is achievable by rebalancing and predicting market returns based on market’s history. As a consequence,
we showed in Section (2.1.1) that market inefficiency leads to active equity management where enhanced indexed portfolios were designed to generate attractive risk-adjusted returns through active weights giving rise to active returns.
Earlier results in the non-parametric statistics, information theory and economics literature (such as Kelly [1956],
Markowitz [1952]) established optimality criterion for long-only, non-leveraged investment. We saw in Section (2.3)
that in view of solving the problem of portfolio selection Markowitz [1952] introduced the mean-variance approach
which is a simple trade-off between return and uncertainty, where one is left with the choice of one free parameter,
the amount of variance acceptable to the individual investor. Similarly, Kelly [1956] introduced an investment theory
based on growth by using the role of time in multiplicative processes to solve the problem of portfolio selection. We
presented in Section (2.3.1.2) the capital asset pricing model (CAPM) and described in Appendix (9.3.16) the growth
optimum theory (GOT) as an alternative to the expected utility theory and the mean-variance approaches to asset pricing. As an example, we presented in Section (2.3.2) the growth optimal portfolio (POP) as a portfolio having maximal
expected growth rate over any time horizon.
We saw in Section (7) that in view of taking advantage of market excess returns, a large number of hedge funds
flourished, using complex valuation models to predict market returns. Specialised quantitative strategies developed
along with specific prime brokerage structures. We defined leverage in Section (7.1.2) as any means of increasing
expected return or value without increasing out-of-pocket investment. We are now going to introduce portfolio construction in presence of financial leverage and construction leverage such as short selling which is the process of
borrowing assets and selling them immediately, with the obligation to rebuy them later. Portfolio optimisation in the
long-short context does not differ much from optimisation in the long-only context. Jacobs et al. [1999] showed
how short positions could be added to a long portfolio, by removing the greater than zero constraint from the model.
In order to optimise a true long-short portfolio, constraints should be added to the model in order to ensure equal
(in terms of total exposure) long and short legs. It was shown in Section (2.3.2) that the optimal asymptotic growth
rate on non-leveraged, long-only memoryless market (independent identically distributed, i.i.d.) coincides with that
of the best constantly rebalanced portfolio (BCRP). In the special case of memoryless assumption on returns, adding
leverage through margin buying and short selling, Horvath et al. [2011] derived optimality conditions and generalised
the BCRP by establishing no-ruin conditions.

303

Quantitative Analytics

9.2

Robust portfolio allocation

There are some different methods of portfolio optimisation available. We described in Section (2.2.1) the classic MeanVariance portfolio based on algorithms working with point-estimates of expected returns, variances and covariances.

9.2.1

Long-short mean-variance approach under constraints

Considering proper portfolio optimisation, Jacobs et al. [1999] had a rigorous look at long-short optimality and called
into question the goals of dollar, and beta neutrality which is common practice in traditional long-short management.
Following their approach we consider the mean-variance 1 utility function (see details in Appendix (A.7.2))
1 σP2
2 τ

U = rP −

(9.2.1)

where rP is the expected return of the portfolio during the investor’s horizon, σP2 is the variance of the portfolio’s
return, and τ is the investor’s risk tolerance. We need to choose how to allocate the investor’s wealth between a riskfree security and a set of N securities, and we need to choose how to distribute wealth among the N risky securities.
We let hR be the fraction of wealth allocated to the risky portfolio (total wealth is 1), and we let hi be the fraction of
wealth invested in the ith risky security. The three components of capital earning interest at the risk-free rate are
• the wealth allocated to the risk-free security with magnitude of 1 − hR
• the balance
P of the deposit made with the broker after paying for the purchase of shares long with magnitude
hR − i∈L hi where L is the set of securities held long
P
P
• the proceeds of the short sales with magnitude of i∈S |hi | = − i∈S hi where S is the set of securities sold
short
Since |x| = x for a positive x and |x| = −x for a negative x then hi is negative for i ∈ S. Summing these three
components gives the total amount of capital hF earning interest at the risk-free rate
hF = 1 −

N
X

hi

(9.2.2)

i=1

which is independent of hR . We can then make the following observations:
PN
• In case of long-only management where everything is invested in risky assets, the portfolio satisfies i=1 hi = 1
and we get hF = 0,
PN
• while in case of short-only management in which i=1 hi = −1, the quantity hF is equal to 2 and the investor
earns the risk-free rate twice.
PN
• In the case of a dollar-balanced long-short management in which i=1 hi = 0, the investor earns the risk-free
rate only once.
Note, if we do not allocate wealth to the risk-free security, the fraction of wealth allocated to the risky portfolio is
hR = 1, and the total amount of capital hF earning interest at the risk-free rate becomes
hF = hR −

N
X

hi

i=1
1

It is only a single-period formulation which is not sensitive to investor wealth.

304

Quantitative Analytics

We now let rF be the return on the risk-free security, and Ri be the expected return on the ith risky security. As
explained in Section (9.2.2.2), in the case of short-selling, when the price of an asset drops, the returns becomes −Ri .
Since hi < 0, when asset price rises we get Ri > 0 and we loose money, and when asset price drops we get Ri < 0
and we make money. Putting long and short returns together, the expected return on the investor’s total portfolio is
rP = hF rF +

N
X

hi Ri

i=1

We substitute hF into this equation, the total portfolio return is the sum of a risk-free return and a risky return component
rP = (1 −

N
X

hi )rF +

N
X

i=1

hi Ri = rF + rR

(9.2.3)

i=1

where the risky return component is
rR =

N
X

hi ri

i=1

where ri = Ri − rF is the expected return on the ith risky security in excess of the risk-free rate. It can be expressed
in matrix notation as
rR = h> r
where h = [h1 , .., hN ]> and r = [r1 , .., rN ]> . Similarly to the long-only portfolio in Section (2.2.1), given rR , the
variance of the risky component is
2
σR
= h> Qh
2
where Q is the covariance matrix of the risky securities’ returns. It is also the variance of the entire portfolio σP2 = σR
.
Using these expressions, the utility function in Equation (9.2.1) can be rewritten in terms of controllable variables. We
determine the optimal portfolio by maximising the utility function through appropriate choice of these variables under
constraints. For instance, the requirement that all the wealth allocated to the risky securities is fully utilised. Some of
the most common portfolio constraints are
PN
• beta-neutrality constraint i=1 hi βi = 0
PN
PN
• portfolio constraint i=1 hi = 1 for long-only portfolio and i=1 hi = −1 for short-only portfolio
PN
P
P
• leverage constraint i=1 |hi | = l where l is the leverage or we can set i∈L |hi | = lL and i∈S |hi | = lS
with l = lL + lS

Most strategies will be dollar and beta neutral, but fewer will be sector/industry, capitalisation and factor neutral.
Arguably, the more neutral a long-short portfolio the better, as systematic risk diminishes (as does the residual return
correlation of the long and short portfolios) and stock specific risk which is the object of long-short, increases. The
solution (provided Q is non-singular) gives the optimal risky portfolio
h = τ Q−1 r

(9.2.4)

corresponding to the minimally constrained portfolio. The expected returns and their covariances must be quantities
that the investor expect to be realised over the portfolio’s holding period.

305

Quantitative Analytics

Remark 9.2.1 The true statistical distribution of returns being unknown, it leads to different results based on different
assumptions. Optimal portfolio holdings will thus differ for each investor, even though investors use the same utility
function.
The optimal holdings in Equation (9.2.4) define a portfolio allowing for short positions because no non-negativity
constraints are imposed. The single portfolio exploits the characteristics of individual securities in a single integrated
optimisation even though the portfolio can be partitioned artificially into one sub-portfolio of long stocks and the other
one of stocks sold short (there is no benefit in doing so). Further, the holdings need not satisfy any arbitrary balance
conditions, that is dollar or beta neutrality is not required. The portfolio has no inherent benchmark so that there is no
residual risk. The portfolio will exhibit an absolute return and an absolute variance of return. This return is calculated
as the weighted spread between the returns to the securities held long and the ones sold short.
Performance attribution can not distinguish between the contribution of the stocks held long and those sold short.
Remark 9.2.2 Separate long and short alpha (and their correlation) are meaningless.
Long-short portfolio allows investors to be insensitive to chosen exogenous factors such as the return of the equity
market. This is done by constructing a portfolio so that the beta of the short positions equals and offsets the beta
of the long position, or (more problematically) the dollar amount of securities sold short equals the dollar amount
of securities held long. However, market neutrality may exact costs in terms of forgone utility. This is the case if
more opportunities exist on the short side than on the long side of the market. One might expect some return sacrifice
from a portfolio that is required to hold equal-dollar or equal-beta positions long and shorts. Market neutrality can
be achieved by using the appropriate amount of stock index futures, without requiring that long and short security
positions be balanced. Nevertheless, investors may prefer long-short balances for mental accounting reasons. Making
the portfolio insensitive to equity market return (or to any other factor) constitutes an additional constraint on the
portfolio. The optimal neutral portfolio maximise the investor’s utility subject to all constraints, including neutrality.
However, the optimal solution is no-longer given by Equation (9.2.4).
Definition 9.2.1 By definition, the risky portfolio is dollar-neutral if the net holding H of risky securities is zero,
meaning that
H=

N
X

hi = 0

(9.2.5)

i=1

This condition is independent of hR . Applying the condition in Equation (9.2.5) to the optimal weights in Equation
(9.2.4) it can be shown that the dollar-neutral portfolio is equal to the minimally constrained optimal portfolio when
H∼

N
X
ri
(ξi − ξ) = 0
σi
i=1

where ξi = σ1i is a measure of stability of the return of the stock i and ξ is the average return stability of all stocks in
the investor’s universe. The term σrii is a risk adjusted return, and (ξi − ξ) can be seen as an excess stability. Highly
volatile stocks will have low stabilities and their excess stability will be negative. If the above quantity is positive, the
net holding should be long, and if it is negative it should be short.
Once the investor has chosen a benchmark, each security can be modelled in terms of its expected excess return αi
and its beta βi with respect to that benchmark. If rB is the expected return of the benchmark, then the expected return
of the ith security is
ri = αi + βi rB

306

Quantitative Analytics

Similarly, the expected return of the portfolio can be modelled in terms of its expected excess return αP and beta βP
with respect to the benchmark
rP = αP + βP rB

(9.2.6)

where the beta of the portfolio is expressed as a linear combination of the betas of the individual securities
βP =

N
X

hi βi

i=1

This is obtained by replacing the expected return ri of the ith security in the portfolio return rP given in Equation
(9.2.3). From Equation (9.2.6) it is clear that any portfolio that is insensitive to changes in the expected benchmark
return must satisfy the condition
βP = 0
Applying this condition to the optimal weights in Equation (9.2.4), together with the model for the expected return
of the ith security above, it can be shown that the beta-neutral portfolio is equal to the optimal minimally constrained
portfolio when
N
X

βi φ i = 0

i=1

with weights
φi =

ri
2
σe,i

(9.2.7)

2
where ri is the excess return and σe,i
is the variance of the excess return of the ith security. It is the portfolio net
beta-weighted risk-adjusted expected return. When this condition is satisfied the constructed portfolio is unaffected
by the return of the chosen benchmark (it is beta-neutral).

9.2.2

Portfolio selection

As an example of long-only, non-leveraged active equity management, the Constantly Rebalanced Portfolio (CRP)
is a self-financing portfolio strategy, rebalancing to the same proportional portfolio in each investment period. This
means that the investor neither consumes from, nor deposits new cash into his account, but reinvests his capital in each
trading period. Using this strategy the investor chooses a proportional portfolio vector πv = (πv1 , ..., πvN ), where πvi
is defined in Equation (1.5.18), and rebalances his portfolio after each period to correct the price shifts in the market.
The idea being that on a frictionless market the investor can rebalance his portfolio for free at each trading period.
Hence, asymptotic optimisation on a memoryless market means that the growth optimal (GO) strategy will pick the
same portfolio vector at each trading period, leading to CRP strategies. Thus, the one with the highest asymptotic
average growth rate is referred to as Best Constantly Rebalanced portfolio (BCRP).
In a memoryless market, leverage is anticipated to have substantial merit in terms of growth rate, while short
selling is not expected to yield much better results since companies worth to short in a testing period might already
have defaulted. That is, in case of margin buying (the act of borrowing money and increasing market exposure) and
short selling, it is easy to default on total initial investment. In this case the asymptotic growth rate becomes minus
infinity. Horvath et al. [2011] showed that using leverage through margin buying yields substantially higher growth
rate in the case of memoryless assumption on returns. They also established mathematical basis for short selling,
that is, creating negative exposure to asset prices. Adding leverage and short selling to the framework, Horvath et al.
derived optimality conditions and generalised the BCRP by establishing no-ruin conditions. They further showed that
short selling might yield increased profits in case of markets with memory since market is inefficient.

307

Quantitative Analytics

9.2.2.1

Long only investment: non-leveraged

We consider a market consisting of N assets and let the evolution of prices to be represented by a sequence of price
vectors S1 , S2 , .. ∈ RN
+ where
Sn = (Sn1 , Sn2 , ..., SnN )
Sni denotes the price of the ith asset at the end of the nth trading period. We transform the sequence of price vectors
{Sn } into return vectors
xn = (x1n , x2n , ..., xN
n)
where
xin =

Sni
i
Sn−1

(9.2.8)

Note, we usually denotes return as
Rni = xin − 1
Note also that expected excess return is the expected absolute return minus the expected risk free rate. A representative example of the dynamic portfolio selection in the long-only case is the constantly rebalanced portfolio (CRP),
introduced and studied by Kelly [1956], Latane [1959], and presented in Section (2.3.2). As defined in Equation
i
i
(n)
(1.5.18), we let πvi (n) = δ (n)S
be the ith component of the vector πv representing the proportion of the investor’s
V (n)
capital invested in the ith asset in the nth trading period.
Remark 9.2.3 In that setting, the proportion of the investor’s wealth invested in each asset at the beginning of trading
periods is constant.
The portfolio vector has non-negative components that sum up to 1, and the set of portfolio vectors is denoted by
N
X

πvi = 1
∆N = πv = (πv1 , ..., πvN ) , πvi ≥ 0 ,
i=1

Note, in this example nothing is invested into cash (risk-free rate). Let V (0) denote the investor’s initial capital. At
the beginning of the first trading period, n = 1, V (0)πvi is invested into asset i, and it results in position size
V (1) = V (0)πvi + V (0)πvi

S1i − S0i
= V (0)πvi xi1
S0i

after changes in market prices, or equivalently V (1) = V (0)πvi (1 + R1i ). That is, we hold V (0)πvi of asset i and we
earn V (0)πvi R1i of interest. Therefore, at the end of the first trading period the investor’s wealth becomes
V (1) = V (0)

N
X

πvi xi1 = V (0) < πv , x1 >

i=1

where < ., . > is the inner product. Equivalently, we get
V (1) = V (0)

N
X

πvi (1 + R1i ) = V (0) < πv , (1 + R1 ) >

i=1

For the second trading period, n = 2, V (1) is the new initial capital, and we get

308

Quantitative Analytics

V (2) = V (1) < πv , x2 >= V (0) < πv , x1 >< πv , x2 >
Note, taking the difference between period n = 2 and period n = 1 and setting dS1i = S2i − S1i , we get
V (2) − V (1) = V (1)

N
X

πvi R2i = V (1)

i=1

PN

since i=1 πvi = 1. For πvi (1) =
the investor’s wealth becomes

δ i S1i
V (1)

N
X
πi

v

i=1

S1i

dS1i

we recover the dynamics of the portfolio. By induction, after n trading periods,

V (n) = V (n − 1) < πv , xn >= V (0)

n
Y

< πv , xj >

j=1

Including cash account into the framework is straight forward by assuming xin = 1 (or Rni = 0) for some i and for all
n. The asymptotic average growth rate of the portfolio satisfies
n

1X
1
ln V (n) = lim
ln < πv , Xj >
n→∞ n
n→∞ n
j=1

W (πv ) = lim

if the limit exists. If the market process {Xi } is memoryless, (it is a sequence of independent and identically distributed
(i.i.d.) random return vectors) then the asymptotic rate of growth exists almost surely (a.s.), where, with random vector
X being distributed as Xi we get
n

1X
ln < πv , Xj >= E[ln < πv , X >] = E[ln < πv , (1 + R) >]
n→∞ n
j=1

W (πv ) = lim

(9.2.9)

given that E[ln < πv , X >] is finite, due to strong law of large numbers. We can ensure this property by assuming
finiteness of E[ln X i ], that is, E[| ln X i |] < ∞ for each i = {1, 2, .., N }. Because of πvi > 0 for some i, we have
E[ln < πv , X >] ≥ E[ln(πvi X j )] = ln πvi + E[ln X j ] > −∞
and because of πvi ≤ 1 for all i, we have
E[ln < πv , X >] ≤ ln N +

X

E[ln |X j |] < ∞

j

From Equation (9.2.9), it follows that rebalancing according to the best log-optimal strategy
πv∗ ∈ arg max E[ln < πv , X >]
πv ∈∆N

is also an asymptotically optimal trading strategy, that is, a strategy with a.s. optimum asymptotic growth
W (πv∗ ) ≥ W (πv )
for any πv ∈ ∆N . The strategy of rebalancing according to πv∗ at the beginning of each trading period, is called best
constantly rebalanced portfolio (BCRP). For details on the maximisation of the asymptotic average rate of growth see
Bell et al. [1980].

309

Quantitative Analytics

9.2.2.2

Short selling: No ruin constraints

Short selling an asset is usually done by borrowing the asset under consideration and selling it. As collateral the
investor has to provide securities of the same value to the lender of the shorted asset. This ensures that if anything
goes wrong, the lender still has high recovery rate. While the investor has to provide collateral, after selling the assets
having been borrowed, he obtains the price of the shorted asset again. This means that short selling is virtually for free
0

V =V −C +P
0

where V is wealth after opening the short position, V is wealth before, C is collateral for borrowing and P is price
income of selling the asset being shorted. For simplicity we assume
C=P

(9.2.10)

0

so that V = V and short selling is free. Note, in the case of naked short transaction, selling an asset short yields
immediate cash (see Cover et al. [1998]).
For example, assume an investor wants to short sell 10 shares of IBM at $100, and he has $1000 in cash. First he
has to find a lender, the short provider, who is willing to lend the shares. After exchanging the shares and the $1000
collateral, the investor sells the borrowed shares. After selling the investor has $1000 in cash again, and the obligation
to cover the shorted assets later. If the price drops $10, he has to cover the short position at $90, thus he gains 10 × $10
(weight is 10 and return is $10). If the price rises $10, he has to cover at $110, loosing 10 × $10. In this example we
assume that our only investment is in asset i and our initial wealth is V (0). We invest a proportion of πvi ∈ (−1, 1) of
our wealth in the ith risky asset and πv0 = 1 − πvi is invested in cash.
If the position is long (πvi > 0), it results in the wealth in period n = 1 as
V (1) = V (0)(1 − πvi ) + V (0)πvi xi1 = V (0) + V (0)πvi (X1i − 1) = V (0) + δ i (S1i − S0i )
where V (0)πv0 is invested in cash. Equivalently, we get
V (1) = V (0)(1 − πvi ) + V (0)πvi (1 + R1i ) = V (0) + V (0)πvi R1i
Again, we hold V (0)πvi of asset i and we earn V (0)πvi R1i of interest.
While if the position is short (πvi < 0), we win as much money as price drop of the asset.
Remark 9.2.4 As we are short selling a stock, we do not hold the asset and we do not earn interest out of it. From
Equation (9.2.10), we only make a profit when the value of the asset drops.
Given the return in Equation (9.2.8), when price of asset drops, the return becomes
1 − Xni =

i
Sn−1
− Sni
= −Rni ≥ 0
i
Sn−1

and the wealth becomes
V (1) = V (0) + V (0)|πvi |(1 − X1i ) = V (0) − V (0)πvi (1 − X1i ) = V (0) + V (0)πvi (X1i − 1)
since from Equation (9.2.10) short selling is free. That is, V is used as collateral to borrow stocks, and once sold we
receive income (cash) from them. Equivalently, we get
V (1) = V (0) + V (0)|πvi |(−R1i ) = V (0) + V (0)πvi R1i
Remark 9.2.5 Since πvi < 0, when price rises we get positive return R1i > 0 and we loose money, and when price
drops we get negative return R1i < 0 and we make money.
310

Quantitative Analytics

Let’s consider the general case where πv = (πv0 , πv1 , ..., πvN ) is the portfolio vector such that the 0th component
corresponds to cash. From Remark (9.2.4), we can conclude that at the end of the first trading period, n = 1, the
investor’s wealth becomes
N

X
 i+ i
+
−
V (1) = V (0) πv0 +
πv X1 + πvi (X1i − 1)
i=1
−

where (.) denotes the negative part operation. Equivalently, we get
N

X
 i+
+
−
V (1) = V (0) πv0 +
πv (1 + R1i ) + πvi R1i
i=1

In case of the investor’s net wealth falling to zero or below he defaults. However, negative wealth is not allowed in our
framework, thus the outer positive part operation. Since only long positions cost money in this setup, we will constrain
to portfolios such that
N
X

+

πvi = 1

i=0

PN

+

leading to πv0 = 1 − i=1 πvi . Hence, if the portfolio has no long position, that is, only short position, then from
Equation (9.2.10) all the capital is in cash πv0 = 1 since short selling is free. Considering this, we can rewrite the
portfolio in period n = 1 as

V (1)

N
N
X
X
 i+ i
+
+
−
πvi +
πv (X1 − 1) + πvi (X1i − 1)
= V (0)
i=0

i=1

N

X
 i i
+
πv (X1 − 1)
= V (0) 1 +
i=1

Equivalently, we get
N

+
X
πvi R1i
V (1) = V (0) 1 +
i=1

This shows that we gain as much as long positions raise and short positions fall (see Remark (9.2.5)). Hence, we can
see that short selling is a risky investment, because it is possible to default on total initial wealth without the default of
any of the assets in the portfolio. The possibility of this would lead to a growth rate of minus infinity, thus we restrict
our market according to
1 − B + δ < Xni < 1 + B − δ , i = 1, ..., N

(9.2.11)

or equivalently
−B + δ < Rni < B − δ , i = 1, ..., N
Besides aiming at no-ruin, the role of δ > 0 is ensuring that rate of growth is finite for any portfolio vector. For the
usual stock market daily data, there exist 0 < a1 < 1 < a2 < ∞ such that
a1 ≤ xin ≤ a2 or a1 − 1 ≤ Rni ≤ a2 − 1

311

Quantitative Analytics

for all i = 1, ..., N and for example a1 = 0.7 and with a2 = 1.2. Thus, we can choose B = 0.3. As a result, the
maximal loss that we could suffer is
B

N
X

|πvi |

i=1

This value has to be constrained to ensure no-ruin. We denote the set of possible portfolio vectors by
N
N
X
X

+
0
1
N
0
∆−B
πvi = 1 , B
|πvi | ≤ 1
N = πv = (πv , πv , ..., πv ) , πv ≥ 0 ,
i=0
+
where i=0 πvi = 1 means that we invest all of our initial
PN
B i=1 |πvi | ≤ 1 maximal exposure is limited such that ruin

PN

i=1

wealth into some assets - buying long - or cash. By
is not possible, and rate of growth it is finite. This is

equivalent to
N
X

−

πvi ≤

i=1

1−B
= 2.33
B

∆−B
N

Since the set of possible portfolio vectors
is not convex we can not apply the Kuhn-Tucker theorem to get the
maximum asymptotic average rate of growth. Horvath et al. [2011] proposed to transform the non-convex set ∆−B
N
˜ −B and showed that in that setting π̃v∗ had the same market exposure as with πv∗ .
to a convex region ∆
N
9.2.2.3

Long only investment: leveraged

Assuming condition in Equation (9.2.11) then market exposure can be increased over one without the possibility of
ruin. Given the portfolio vector πv = (πv0 , πv1 , ..., πvN ) where πv0 is the cash component, the no short selling condition
implies that πvi > 0 for i = 1, ..., N . We assume the investor can borrow money and invest it on the same rate r, and
that the maximal investable amount of cash LB,r (relative to initial wealth V (0)) is always available for the investor.
That is, LB,r ≥ 1 sometimes called the buying power, is chosen to be the maximal amount, investing of which ruin is
not possible given Equation (9.2.11). Because our investor decides over the distribution of his buying power, we get
the constraint
N
X

πvi = LB,r

i=0

πv0

so that
= LB,r −
defined as

PN

i
i=1 πv .

Unspent cash earns the same interest r, as the rate of lending. The market vector is
Xr = (X 0 , X 1 , ..., X N ) = (1 + r, X 1 , ..., X N )

where X 0 = 1 + r. The feasible set of portfolio vectors is
N
X

+N +1
0
1
N
∆+B,r
=
π
=
(π
,
π
,
...,
π
)
∈
R
,
πvi = LB,r
v
v
v
v
0
N
i=0

where

πv0

denotes unspent buying power. Hence, the investor’s wealth evolves according to
+
V (1) = V (0) < πv , Xr > −(LB,r − 1)(1 + r)

where V (0)r(LB,r − 1) is interest on borrowing (LB,r − 1) times the initial wealth V (0). Equivalently, we get

312

Quantitative Analytics

V (1) = V (0)

N
X

πvi (1 + Ri ) − (LB,r − 1)(1 + r)

+

i=0

πv0 X 0

πv0 (1

0

where
=
+ r) so that R = r. Given the maximal investable amount of cash LB,r ≥ 1 relative to initial
wealth V (0), we borrow the quantity (LB,r − 1) which is invested in the long-only portfolio. However, we do not keep
borrowing that quantity on the rolling portfolio and we must subtract it from our rolling portfolio. We can visualise
that phenomenon by expending the previous equation
V (1) = V (0) LB,r +

N
X

N
X
+
+
πvi Ri − (LB,r − 1)(1 + r) = V (0) 1 +
πvi Ri − r(LB,r − 1)

i=0

i=0

Further, as we borrow the amount V (0)(LB,r − 1) from the broker, we must subtract the interest rV (0)(LB,r − 1)
from our portfolio. To ensure no-ruin and finiteness of growth rate choose
LB,r =

1+r
B+r

This ensures that ruin is not possible:

< πv , Xr > −(LB,r − 1)(1 + r) =

N
X

πvi X i − (LB,r − 1)(1 + r)

i=0

= πv0 (1 + r) +

N
X

πvi X i − (LB,r − 1)(1 + r) > πv0 (1 + r) +

i=1

N
X

πvi (1 − B + δ) − (LB,r − 1)(1 + r)

i=1

and after simplification
1+r
B+r
For details on maximising the asymptotic average rate of growth see Horvath et al. [2011].
< πv , Xr > −(LB,r − 1)(1 + r) > δ

9.2.2.4

Short selling and leverage

Assuming short selling and leverage, we combine the results of the previous two sections, so that the investor’s wealth
evolves according to
N

+
X
 i+ i

−
V (1) = V (0) πv0 (1 + r) +
πv X1 + πvi (X1i − 1 − r) − (LB,r − 1)(1 + r)
i=1
−

where (.) denotes the negative part operation. Equivalently, we get
N
+

X
 i+

−
V (1) = V (0) πv0 (1 + r) +
πv (1 + r1i ) + πvi r1i − (LB,r − 1)(1 + r)
i=1

and expending the equation, the investor’s wealth becomes
N
N

+
X
X
 i+ i

+
−
V (1) = V (0) πv0 (1 + r) +
πvi +
πv r1 + πvi r1i − (LB,r − 1)(1 + r)
i=1

i=1

with the buyer power constraint

313

(9.2.12)

Quantitative Analytics

N
X

+

πvi = LB,r

i=0

so that πv0 = LB,r −

PN

i=1

+

πvi . Expending the previous equation, and putting terms together, we get

N
N

+

+
X
X
V (1) = V (0) rπv0 + 1 +
πvi r1i − r(LB,r − 1)
= V (0) 1 +
πvi r1i − r(LB,r − 1)
i=1

with

r10

i=0

= r. The feasible set corresponding to the non-convex region is
N
N
X
X

0
1
N
i+
∆±B,r
=
π
=
(π
,
π
,
...,
π
)
,
π
=
L
,
B
|πvi | ≤ 1
v
B,r
v
v
v
v
N
i=0

(9.2.13)

i=0

Again, using the transformation for short selling, Horvath et al. [2011] showed that in that setting π̃v∗ had the same
market exposure as with πv∗ .

9.3

Empirical log-optimal portfolio selections

We presented in Section (2.3.2) the growth optimal portfolio (POP) as a portfolio having maximal expected growth
rate over any time horizon. Following Gyorfi et al. [2011], we are now going to introduce the growth optimum theory
(GOT) as an alternative to the expected utility theory and the mean-variance approaches to asset pricing. Investment
strategies are allowed to use information collected from the past of the market, and determine a portfolio at the
beginning of a trading period, that is, a way of distributing their current capital among the available assets. The goal of
the investor is to maximise his wealth in the long run without knowing the underlying distribution generating the stock
prices. Under this assumption the asymptotic rate of growth has a well-defined maximum which can be achieved in
full knowledge of the underlying distribution generated by the stock prices. In this section, both static (buy and hold)
and dynamic (daily rebalancing) portfolio selections are considered under various assumptions on the behaviour of the
market process. While every static portfolio asymptotically approximates the growth rate of the best asset in the study,
one can achieve larger growth rate with daily rebalancing. Under memoryless assumption on the underlying process
generating the asset prices, the best rebalancing is the log-optimal portfolio, which achieves the maximal asymptotic
average growth rate. After presenting the log-optimal portfolio, we will then briefly present the semi-log optimal
portfolio selection as an alternative to the log-optimal portfolio.

9.3.1

Static portfolio selection

Keeping the same notation as in Section (2.3.2), the market consists of N assets, represented by an N -dimensional
vector process S where
Sn = (Sn0 , Sn1 , .., SnN )
with Sn0 = 1, and such that the ith component S i of Sn denotes the price of the ith asset on the n-th trading period.
Further, we put S0i = 1. We assume that {Sn } has exponential trend:
i

Sni = enWn ≈ enW
with average growth rate (average yield)
Wni =

1
ln Sni
n

and with asymptotic average growth rate

314

i

Quantitative Analytics

W i = lim

n→∞

1
ln Sni
n

A static portfolio selection is a single period investment strategy. A portfolio vector is denoted by πv = (πv1 , .., πvN )
where the ith component πvi of πv denotes the proportion of the investor’s capital invested in asset i. We assume that
the portfolio vector b has non-negative components which sum up to 1, meaning that short selling is not permitted.
The set of portfolio vectors is denoted by
N
X

∆N = πv = (πv1 , ..., πvN ) , πvi ≥ 0 ,
πvi = 1
i=1
i

The aim of static portfolio selection is to achieve max1≤i≤N W . The static portfolio is an index, for example, the
S&P 500 such that at time n = 0 we distribute the initial capital V (0) according to a fix portfolio vector πv , that is,
if V (n) denotes the wealth at the trading period n, then
V (n) = V (0)

N
X

πvi Sni

i=1

We apply the following simple bounds
V (0) max πvi Sni ≤ V (n) ≤ N V (0) max πvi Sni
i

If

πvi

i

> 0 for all i = 1, .., N then these bounds imply that
W = lim

n→∞

1
1
ln V (n) = lim max ln Sni = max W i
n→∞
i
i
n
n

Thus, any static portfolio selection achieves the growth rate of the best asset in the study, maxi W i , and so the limit
does not depend on the portfolio πv . In case of uniform portfolio (uniform index) πvi = N1 , and the convergence above
is from below:
V (0) max
i

9.3.2

1 i
S ≤ V (n) ≤ V (0) max Sni
i
N n

Constantly rebalanced portfolio selection

In order to apply the usual prediction techniques for time series analysis one has to transform the