Manual

manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 51

Download
Open PDF In Browser	View PDF

The GPML Toolbox version 4.2
Carl Edward Rasmussen & Hannes Nickisch
June 15, 2018

Abstract
The GPML toolbox is an Octave 3.2.x and Matlab 7.x implementation of inference and prediction in Gaussian process (GP) models. It implements algorithms discussed in Rasmussen &
Williams: Gaussian Processes for Machine Learning , the MIT press, 2006 and Nickisch &
Rasmussen: Approximations for Binary Gaussian Process Classification , JMLR, 2008.
The strength of the function lies in its flexibility, simplicity and extensibility. The function is
flexible as firstly it allows specification of the properties of the GP through definition of mean function and covariance functions. Secondly, it allows specification of different inference procedures,
such as e.g. exact inference and Expectation Propagation (EP). Thirdly it allows specification of
likelihood functions e.g. Gaussian or Laplace (for regression) and e.g. cumulative Logistic (for
classification). Simplicity is achieved through a single function and compact code. Extensibility is
ensured by modular design allowing for easy addition of extension for the already fairly extensive
libraries for inference methods, mean functions, covariance functions and likelihood functions.
This document is a technical manual for a developer containing many details. If you are not
yet familiar with the GPML toolbox, the user documentation and examples therein are a better
way to get started.

1

Contents
1 Gaussian Process Training and Prediction

3

2 The gp Function

4

3 Inference Methods
3.1 Exact Inference with Gaussian likelihood . . . . . . . . . . . . . . . . . .
3.2 Laplace’s Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Expectation Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Kullback Leibler Divergence Minimisation . . . . . . . . . . . . . . . . . .
3.5 Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Compatibility Between Inference Methods and Covariance Approximations
3.7 Sparse Covariance Approximations . . . . . . . . . . . . . . . . . . . . . .
3.8 Grid-Based Covariance Approximations . . . . . . . . . . . . . . . . . . .
3.9 State Space Representation of GPs . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

9
10
11
11
12
13
13
14
14
15

4 Likelihood Functions
4.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Implemented Likelihood Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Usage of Implemented Likelihood Functions . . . . . . . . . . . . . . . . . . . . . . .
4.5 Compatibility Between Likelihoods and Inference Methods . . . . . . . . . . . . . . .
4.6 Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1 Exact Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.2 Laplace’s Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.3 Expectation Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.4 Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Warped Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 Gumbel Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9 Laplace Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.10 Student’s t Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.11 Cumulative Logistic Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12 GLM Likelihoods: Poisson, Negative Binomial, Weibull, Gamma, Exponential, Inverse Gaussian and Beta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.1 Inverse Link Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.2 Poisson Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3 Weibull Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.4 Gamma Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.5 Exponential Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.6 Inverse Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.7 Beta Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18
18
19
21
22
23
23
25
25
25
25
26
27
27
30
31

5 Mean Functions
5.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Implemented Mean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Usage of Implemented Mean Functions . . . . . . . . . . . . . . . . . . . . . . . . . .

36
36
37
37

6 Covariance Functions
6.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Implemented Covariance Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Usage of Implemented Covariance Functions . . . . . . . . . . . . . . . . . . . . . . .

39
39
43
44

2

32
32
34
34
34
35
35
35

7 Hyperpriors
7.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Implemented Hyperpriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 Usage of Implemented Hyperpriors . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

46
46
48
48

1

Gaussian Process Training and Prediction

The gpml toolbox contains a single user function gp described in section 2. In addition there are a
number of supporting structures and functions which the user needs to know about, as well as an
internal convention for representing the posterior distribution, which may not be of direct interest to
the casual user.
Inference Methods: An inference method is a function which computes the (approximate) posterior, the (approximate) negative log marginal likelihood and its partial derivatives w.r.t.. the
hyperparameters, given a model specification (i.e., GP mean and covariance functions and a
likelihood function) and a data set. Inference methods are discussed in section 3. New inference methods require a function providing the desired inference functionality and possibly
extra functionality in the likelihood functions applicable.
Hyperparameters: The hyperparameters is a struct controlling the properties of the model, i.e.. the
GP mean and covariance function and the likelihood function. The hyperparameters is a struct
with the three fields mean, cov and lik, each of which is a vector. The number of elements in
each field must agree with number of hyperparameters in the specification of the three functions
they control (below). If a field is either empty or non-existent it represents zero hyperparameters. When working with FITC approximate inference, the inducing inputs xu can also be
treated as hyperparameters for some common stationary covariances.
Hyperparameter Prior Distributions: When optimising the marginal likelihood w.r.t. hyperparameters, it is sometimes useful to softly constrain the hyperparameters by means of prior knowledge. A prior is a probability distribution over individual or a group of hyperparameters,
section 7.
Likelihood Functions: The likelihood function specifies the form of the likelihood of the GP model
and computes terms needed for prediction and inference. For inference, the required properties
of the likelihood depend on the inference method, including properties necessary for hyperparameter learning, section 4.
Mean Functions: The mean function is a cell array specifying the GP mean. It computes the mean
and its derivatives w.r.t.. the part of the hyperparameters pertaining to the mean. The cell array
allows flexible specification and composition of mean functions, discussed in section 5. The
default is the zero function.
Covariance Functions: The covariance function is a cell array specifying the GP covariance function.
It computes the covariance and its derivatives w.r.t.. the part of the hyperparameters pertaining
to the covariance function. The cell array allows flexible specification and composition of
covariance functions, discussed in section 6.
Inference methods, see section 3, compute (among other things) an approximation to the posterior
distribution of the latent variables fi associated with the training cases, i = 1, . . . , n. This approximate posterior is assumed to be Gaussian, and is communicated via a struct post with the fields
post.alpha, post.sW and post.L. Often, starting from the Gaussian prior p(f) = N(f|m, K) the
approximate posterior admits the form

q(f|D) = N f|µ = m + Kα, V = (K−1 + W)−1 , where W diagonal with Wii = s2i . (1)
In such cases, the entire posterior can be computed from the two vectors post.alpha and post.sW;
the inference method may optionally also return L = chol(diag(s)K diag(s) + I).
If on the other hand the posterior doesn’t admit the above form, then post.L returns the matrix
L = −(K + W −1 )−1 (and post.sW is unused). In addition, a sparse representation of the posterior
may be used, in which case the non-zero elements of the post.alpha vector indicate the active entries.
4

2

The gp Function

The gp function is typically the only function the user would directly call.
4a

hgp.m 4ai≡
1
2
3
4
5
6
7
8
9

function [varargout] = gp(hyp, inf, mean, cov, lik, x, y, xs, ys)
hgp function help 4bi
hinitializations 5bi
hinference 6ci
if nargin==7
% if no test cases are provided
varargout = {nlZ, dnlZ, post};
% report -log marg lik, derivatives and post
else
hcompute test predictions 7i
end

It offers facilities for training the hyperparameters of a GP model as well as predictions at unseen
inputs as detailed in the following help.
4b

hgp function help 4bi≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%

(4a)

Gaussian Process inference and prediction. The gp function provides a
flexible framework for Bayesian inference and prediction with Gaussian
processes for scalar targets , i.e. both regression and binary
classification. The prior is Gaussian process , defined through specification
of its mean and covariance function. The likelihood function is also
specified. Both the prior and the likelihood may have hyperparameters
associated with them.
Two modes are possible: training or prediction: if no test cases are
supplied , then the negative log marginal likelihood and its partial
derivatives w.r.t. the hyperparameters is computed; this mode is used to fit
the hyperparameters. If test cases are given , then the test set predictive
probabilities are returned. Usage:
training: [nlZ dnlZ
] = gp(hyp, inf, mean, cov, lik, x, y);
prediction: [ymu ys2 fmu fs2
] = gp(hyp, inf, mean, cov, lik, x, y, xs);
or: [ymu ys2 fmu fs2 lp] = gp(hyp, inf, mean, cov, lik, x, y, xs, ys);
where:
hyp
inf
mean
cov
lik
x
y
xs
ys

struct of column vectors of mean/cov/lik hyperparameters
function specifying the inference method
prior mean function
prior covariance function
likelihood function
n by D matrix of training inputs
column vector of length n of training targets
ns by D matrix of test inputs
column vector of length nn of test targets

nlZ
dnlZ

returned value of the negative log marginal likelihood
struct of column vectors of partial derivatives of the negative
log marginal likelihood w.r.t. mean/cov/lik hyperparameters
column vector (of length ns) of predictive output means
column vector (of length ns) of predictive output variances
column vector (of length ns) of predictive latent means
column vector (of length ns) of predictive latent variances
column vector (of length ns) of log predictive probabilities

ymu
ys2
fmu
fs2
lp

5

40
41
42
43
44
45
46
5a

%
post
struct representation of the (approximate) posterior
%
3rd output in training mode or 6th output in prediction mode
%
can be reused in prediction mode gp(.., cov, lik, x, post, xs,..)
%
% See also infMethods.m, meanFunctions.m, covFunctions.m, likFunctions.m.
%
hgpml copyright 5ai

hgpml copyright 5ai≡

(4b 9 10 19 22 24a 36 38 39 44 46 49)

1 % Copyright (c) by Carl Edward Rasmussen and Hannes Nickisch , 2018-06-15.
2 %
File automatically generated using noweb.

Depending on the number of input parameters, gp knows whether it is operated in training or in
prediction mode. The highlevel structure of the code is as follows. After some initialisations, we
perform inference and decide whether test set predictions are needed or only the result of the inference
is demanded.
5b

hinitializations 5bi≡
1
2
3
4

(4a)

hminimalist usage 5ci
hmultivariate output by recursion 5di
hprocess input arguments 6ai
hcheck hyperparameters 6bi

If the number of input arguments is incorrect, we echo a minimalist usage and return.
5c

hminimalist usage 5ci≡

(5b)

1 if nargin <7 || nargin >9
2
disp(’Usage: [nlZ dnlZ
] = gp(hyp, inf, mean, cov, lik, x, y);’)
3
disp(’
or: [ymu ys2 fmu fs2
] = gp(hyp, inf, mean, cov, lik, x, y, xs);’)
4
disp(’
or: [ymu ys2 fmu fs2 lp] = gp(hyp, inf, mean, cov, lik, x, y, xs, ys);’)
5
return
6 end

If there is more than a single output dimension in y, we call multiple instances of gp with shared
hyperparameters and settings each computing the corresponding result with scalar output.
5d

hmultivariate output by recursion 5di≡

(5b)

1 if size(y,2)>1
% deal with (independent) multivariate output y by recursing
2
d = size(y,2); varargout = cell(nargout ,1); out = cell(nargout ,1); % allocate
3
for i=1:d
4
in = {hyp, inf, mean, cov, lik, x, y(:,i)};
5
if nargin >7, in = {in{:}, xs}; end
6
if nargin >8, in = {in{:}, ys(:,i)}; end
7
if i==1, [varargout{:}] = gp(in{:});
% perform inference for dimension ..
8
else
[out{:}] = gp(in{:});
% .. number i in the output
9
if nargin==7, no = 2;
10
varargout{1} = varargout{1} + out{1};
% sum nlZ
11
if nargout >1
% sum dnlZ
12
varargout{2} = vec2any(hyp,any2vec(varargout{2})+any2vec(out{2}));
13
end
14
else no = 5;
% concatenate ymu ys2 fmu fs2 lp
15
for j=1:min(nargout ,no), varargout{j} = [varargout{j},out{j}]; end
16
end
17
if nargout >no
% concatenate post
18
if i==2, varargout{no+1} = {varargout{no+1},out{no+1}};
19
else varargout{no+1} = {varargout{no+1}{:},out{no+1}}; end
20
end
21
end
22
end, return
% return to end the recursion
23 end

6

Next, we set some useful default values for empty arguments, and convert inf and lik to function
handles and mean and cov to cell arrays if necessary. Initialize variables.
6a

hprocess input arguments 6ai≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

(5b)

if isempty(mean), mean = {@meanZero}; end
% set default mean
if ischar(mean) || isa(mean, ’function_handle’), mean = {mean}; end % make cell
if isempty(cov), error(’Covariance function cannot be empty’); end % no default
if ischar(cov) || isa(cov,’function_handle’), cov = {cov}; end
% make cell
cstr = cov{1}; if isa(cstr,’function_handle’), cstr = func2str(cstr); end
if (strcmp(cstr,’covFITC’) || strcmp(cstr,’apxSparse’)) && isfield(hyp,’xu’)
cov{3} = hyp.xu;
%use hyp.xu
end
if isempty(inf), inf = {@infGaussLik}; end
% set default inference method
if ischar(inf), inf = str2func(inf); end
% convert into function handle
if ischar(inf) || isa(inf,’function_handle’), inf = {inf}; end
% make cell
istr = inf{1}; if isa(istr,’function_handle’), istr = func2str(istr); end
if strcmp(istr,’infPrior’)
istr = inf{2}; if isa(istr,’function_handle’), istr = func2str(istr); end
end
if isempty(lik), lik = {@likGauss}; end
% set default lik
if ischar(lik) || isa(lik,’function_handle’), lik = {lik}; end
% make cell
lstr = lik{1}; if isa(lstr,’function_handle’), lstr = func2str(lstr); end
D = size(x,2);
if strncmp(cstr,’covGrid’,7) || strcmp(cstr,’apxGrid’) % only some inf* possible
D = 0; xg = cov{3}; p = numel(xg); for i=1:p, D = D+size(xg{i},2); end % dims
end

Check that the sizes of the hyperparameters supplied in hyp match the sizes expected. The three parts
hyp.mean, hyp.cov and hyp.lik are checked separately, and define empty entries if they don’t exist.
6b

hcheck hyperparameters 6bi≡
1
2
3
4
5
6
7
8
9
10
11
12

(5b)

if ~isfield(hyp,’mean’), hyp.mean = []; end
% check the hyp specification
if eval(feval(mean{:})) ~= numel(hyp.mean)
error(’Number of mean function hyperparameters disagree with mean function’)
end
if ~isfield(hyp,’cov’), hyp.cov = []; end
if eval(feval(cov{:})) ~= numel(hyp.cov)
error(’Number of cov function hyperparameters disagree with cov function’)
end
if ~isfield(hyp,’lik’), hyp.lik = []; end
if eval(feval(lik{:})) ~= numel(hyp.lik)
error(’Number of lik function hyperparameters disagree with lik function’)
end

Inference is performed by calling the desired inference method inf. In training mode, we accept a
failure of the inference method (and issue a warning), since during hyperparameter learning, hyperparameters causing a numerical failure may be attempted, but the minimize function may gracefully
recover from this. During prediction, failure of the inference method is an error.
6c

hinference 6ci≡

(4a)

1 try
% call the inference method
2
% issue a warning if a classification likelihood is used in conjunction with
3
% labels different from +1 and -1
4
if strcmp(lstr,’likErf’) || strcmp(lstr,’likLogistic’)
5
if ~isstruct(y)
6
uy = unique(y);
7
if any( uy~=+1 & uy~=-1 )
8
warning(’You try classification with labels different from {+1,-1}’)

7

9
end
10
end
11
end
12
if nargin >7
% compute marginal likelihood and its derivatives only if needed
13
if isstruct(y)
14
post = y;
% reuse a previously computed posterior approximation
15
else
16
post = feval(inf{:}, hyp, mean, cov, lik, x, y);
17
end
18
else
19
if nargout <=1
20
[post nlZ] = feval(inf{:}, hyp, mean, cov, lik, x, y); dnlZ = {};
21
else
22
[post nlZ dnlZ] = feval(inf{:}, hyp, mean, cov, lik, x, y);
23
end
24
end
25 catch
26
msgstr = lasterr;
27
if nargin >7, error(’Inference method failed [%s]’, msgstr); else
28
warning(’Inference method failed [%s] .. attempting to continue’,msgstr)
29
varargout = {NaN, vec2any(hyp,zeros(numel(any2vec(hyp)),1))}; return % go on
30
end
31 end

We copy the already computed negative log marginal likelihood to the first output argument, and if
desired report its partial derivatives w.r.t. the hyperparameters if running in inference mode.
Predictions are computed in a loop over small batches to avoid memory problems for very large test
sets.
7

hcompute test predictions 7i≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

(4a)

alpha = post.alpha; L = post.L; sW = post.sW;
if issparse(alpha)
% handle things for sparse representations
nz = alpha ~= 0;
% determine nonzero indices
if issparse(L), L = full(L(nz,nz)); end
% convert L and sW if necessary
if issparse(sW), sW = full(sW(nz)); end
else nz = true(size(alpha ,1),1); end
% non-sparse representation
if isempty(L)
% in case L is not provided , we compute it
K = feval(cov{:}, hyp.cov, x(nz,:));
L = chol(eye(sum(nz))+sW*sW’.*K);
end
%verify whether L contains valid Cholesky decomposition or something different
Lchol = isnumeric(L) && all(all(tril(L,-1)==0)&diag(L)’>0&isreal(diag(L))’);
ns = size(xs,1);
% number of data points
if strncmp(cstr,’apxGrid’,7), xs = apxGrid(’idx2dat’,cov{3},xs); end % expand
nperbatch = 1000;
% number of data points per mini batch
nact = 0;
% number of already processed test data points
ymu = zeros(ns,1); ys2 = ymu; fmu = ymu; fs2 = ymu; lp = ymu;
% allocate mem
while nact use Cholesky parameters (alpha ,sW,L)
V = L’\(repmat(sW,1,length(id)).*Ks);
fs2(id) = kss - sum(V.*V,1)’;
% predictive variances
else
% L is not triangular => use alternative parametrisation
if isnumeric(L), LKs = L*Ks; else LKs = L(Ks); end
% matrix or callback
fs2(id) = kss + sum(Ks.*LKs ,1)’;
% predictive variances
end
fs2(id) = max(fs2(id),0);
% remove numerical noise i.e. negative variances
Fs2 = repmat(fs2(id),1,N);
% we have multiple values in case of sampling
if nargin <9
[Lp, Ymu, Ys2] = feval(lik{:},hyp.lik,[],
Fmu(:),Fs2(:));
else
Ys = repmat(ys(id),1,N);
[Lp, Ymu, Ys2] = feval(lik{:},hyp.lik,Ys(:),Fmu(:),Fs2(:));
end
lp(id) = sum(reshape(Lp, [],N),2)/N;
% log probability; sample averaging
ymu(id) = sum(reshape(Ymu,[],N),2)/N;
% predictive mean ys|y and ..
ys2(id) = sum(reshape(Ys2,[],N),2)/N;
% .. variance

9

3

Inference Methods

Inference methods are responsible for computing the (approximate) posterior post, the (approximate) negative log marginal likelihood nlZ and its partial derivatives dnlZ w.r.t. the hyperparameters hyp. The arguments to the function are hyperparameters hyp, mean function mean, covariance
function cov, likelihood function lik and training data x and y. Several inference methods are
implemented and described this section.
9

hinfMethods.m 9i≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%

Inference methods: Compute the (approximate) posterior for a Gaussian process.
Methods currently implemented include:
infGaussLik
infLaplace
infEP
infVB
infKL

Exact inference (only possible with Gaussian likelihood)
Laplace ’s Approximation
Expectation Propagation
Variational Bayes Approximation
Kullback -Leibler optimal Approximation

infMCMC

Markov Chain Monte Carlo and Annealed Importance Sampling
We offer two samplers.
- hmc: Hybrid Monte Carlo
- ess: Elliptical Slice Sampling
No derivatives w.r.t. to hyperparameters are provided.

infLOO
infPrior

Leave -One-Out predictive probability and Least -Squares Approxim.
Perform inference with hyperparameter prior.

The interface to the approximation methods is the following:
function [post nlZ dnlZ] = inf..(hyp, mean, cov, lik, x, y)
where:
hyp
mean
cov
lik
x
y

is
is
is
is
is
is

nlZ
dnlZ

is the returned value of the negative log marginal likelihood
is a (column) vector of partial derivatives of the negative
log marginal likelihood w.r.t. each hyperparameter
struct representation of the (approximate) posterior containing
is a (sparse or full column vector) containing inv(K)*(mu-m),
where K is the prior covariance matrix , m the prior mean,
and mu the approx posterior mean
is a (sparse or full column) vector containing diagonal of sqrt(W)
the approximate posterior covariance matrix is inv(inv(K)+W)
is a (sparse or full) triangular matrix , L = chol(sW*K*sW+eye(n)),
or a full matrix , L = -inv(K+inv(W)),
or a function L(A) of a matrix A such that -(K+inv(W))*L(A) = A

post
alpha

sW
L

a struct of hyperparameters
the name of the mean function
(see meanFunctions.m)
the name of the covariance function (see covFunctions.m)
the name of the likelihood function (see likFunctions.m)
a n by D matrix of training inputs
a (column) vector (of size n) of targets

Usually , the approximate posterior to be returned admits the form
N(mu=m+K*alpha , V=inv(inv(K)+W)), where alpha is a vector and W is diagonal.
For more information on the individual approximation methods and their

10

49 % implementations , see the separate inf??.m files. See also gp.m.
50 %
51 hgpml copyright 5ai

Not all inference methods are compatible with all likelihood functions, e.g.. exact inference is only
possible with Gaussian likelihood. In order to perform inference, each method needs various properties of the likelihood functions, section 4.

3.1

Exact Inference with Gaussian likelihood

For Gaussian likelihoods, GP inference reduces to computing mean and covariance of a multivariate
Gaussian which can be done exactly by simple matrix algebra. The program inf/infExact.m does
exactly this. If it is called with a likelihood
function other than the Gaussian, it issues an error. The

Gaussian posterior q(f|D) = N f|µ, V is exact.
10

hinf/infGaussLik.m 10i≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

function [post nlZ dnlZ] = infGaussLik(hyp, mean, cov, lik, x, y, opt)
% Exact inference for a GP with Gaussian likelihood.
%
% Compute a parametrization of the posterior , the negative log marginal
% likelihood and its derivatives w.r.t. the hyperparameters. The function takes
% a specified covariance function (see covFunctions.m) and likelihood function
% (see likFunctions.m), and is designed to be used with gp.m.
%
hgpml copyright 5ai
%
% See also INFMETHODS.M, APX.M.
if
if
if
if

nargin <7, opt = []; end
% make sure parameter exists
iscell(lik), likstr = lik{1}; else likstr = lik; end
~ischar(likstr), likstr = func2str(likstr); end
~strcmp(likstr ,’likGauss’)
% NOTE: no explicit call to likGauss
error(’Exact inference only possible with Gaussian likelihood’);
end

[n, D] = size(x);
[m,dm] = feval(mean{:}, hyp.mean, x);
% evaluate mean vector and deriv
sn2 = exp(2*hyp.lik); W = ones(n,1)/sn2;
% noise variance of likGauss
K = apx(hyp,cov,x,opt);
% set up covariance approximation
[ldB2,solveKiW ,dW,dhyp,post.L] = K.fun(W); % obtain functionality depending on W
alpha = solveKiW(y-m);
post.alpha = K.P(alpha);
% return the posterior parameters
post.sW = sqrt(W);
% sqrt of noise precision vector
if nargout >1
% do we want the marginal likelihood?
nlZ = (y-m)’*alpha/2 + ldB2 + n*log(2*pi*sn2)/2;
% -log marginal likelihood
if nargout >2
% do we want derivatives?
dnlZ = dhyp(alpha); dnlZ.mean = -dm(alpha);
dnlZ.lik = -sn2*(alpha ’*alpha) - 2*sum(dW)/sn2 + n;
end
end

11

3.2

Laplace’s Approximation

For differentiable likelihoods, Laplace’s approximation, approximates the posterior by a Gaussian
centered at its mode and matching its curvature inf/infLaplace.m.

More concretely,
the
mean
of
the
posterior
q(f|D)
=
N
f|µ,
V
is – defining `i (fi ) = ln p(yi |fi ) and
P
`(f) = n
`
(f
)
–
given
by
i
i
i=1
µ = arg min φ(f), where φ(f) =
f

1
c
(f − m)> K−1 (f − m) − `(f) = − ln[p(f)p(y|f)],
2

which we abbreviate by µ ← L(`). The curvature

∂2 φ
∂ff>

(2)

2

∂
= K−1 + W with Wii = − ∂f
2 ln p(yi |fi )
i

−1
−1
serves as precision
R for the Gaussian posterior approximation V
R = (K + W) and the marginal
likelihood Z = p(f)p(y|f)df is approximated by Z ≈ ZLA = φ̃(f)df where we use the 2nd order
Taylor expansion at the mode µ given by φ̃(f) = φ(µ) + 21 (f − µ)> V−1 (f − µ) ≈ φ(f).

Laplace’s approximation needs derivatives up to third order for the mode fitting procedure (Newton
method)
∂k
dk =
log p(y|f), k = 0, 1, 2, 3
∂fk
and
∂ ∂k
log p(y|f), k = 0, 1, 2
dk =
∂ρi ∂fk
evaluated at the latent location f and observed value y. The likelihood calls (see section 4)
• [d0, d1, d2, d3] = lik(hyp, y, f, [], ’infLaplace’)
and
• [d0, d1, d2] = lik(hyp, y, f, [], ’infLaplace’, i)
return exactly these values.

3.3

Expectation Propagation

The basic idea of Expectation Propagation (EP) as implemented in inf/infEP.m. is to replace the
non-Gaussian likelihood terms p(yi |fi ) by Gaussian functions t(fi ; νi , τi ) = exp(νi fi − 12 τi f2i ) and
to adjust the natural parameters νi , τi such that the following identity holds:
Z
Z
1
1
k
f q−i (f) · t(f; νi , τi )df =
fk q−i (f) · p(yi |f)df, k = 1, 2
Zt,i
Zp,i
Q
with the so-called cavity distributions q−i (f) = N(f|m, K) j6=i t(fj ; νj , τj ) ∝ N(f|µ, V)/t(fi ; νi , τi )
equal to
approximation function and the two normalisers
R the posterior divided by the ith Gaussian
R
Zt,i = q−i (f) · t(fi ; νi , τi )df and Zp,i = q−i (f) · p(yi |fi )df. The moment matching corresponds
to minimising the following local KL-divergence
νi , τi = arg min KL[q−i (f)p(yi |fi )/Zp,i ||q−i (f)t(fi ; ν, τ)/Zt,i ].
ν,τ

In order to apply the moment matching steps in a numerically safe way, EP requires the deriviatives
of the expectations w.r.t. the Gaussian mean parameter µ
Z
∂k
dk =
log p(y|f)N(f|µ, σ2 )df, k = 0, 1, 2
∂µk
12

and the ith likelihood hyperparameter ρi

Z
∂
d =
log p(y|f)N(f|µ, σ2 )df
∂ρi

which can be obtained by the likelihood calls (see section 4)
• [d0, d1, d2] = lik(hyp, y, mu, s2, ’infEP’)
and
• d = lik(hyp, y, mu, s2, ’infEP’, i).

3.4

Kullback Leibler Divergence Minimisation

Another well known approach to approximate inference implemented inf/infKL.m in attempts to
directly find the closest Gaussian q(f|D) = N(f|µ, V) to the exact posterior p(f|D) w.r.t. to some
proximity measure or equivalently to maximise a lower bound Z(µ, V) to the marginal likelihood
Z as described in Nickisch & Rasmussen Approximations for Binary Gaussian Process Classification, JMLR, 2008. In particular, one minimises KL (N(f|µ, V)||p(f|D)) which amounts to minimising
− ln Z(µ, V) as defined by:
Z
Z
p(f)
− ln Z
=
− ln p(f)p(y|f)df = − ln q(f|D)
p(y|f)df
q(f|D)
Z
Z
Jensen
q(f|D)
6
q(f|D) ln
df − q(f|D) ln p(y|f)df =: − ln Z(µ, V)
p(f)
n Z
X
=
KL (N(f|µ, V)||N(f|m, K)) −
N(fi |µi , vii ) ln p(yi |fi )dfi , vii = [V]ii
i=1
n
 1
X
1
`KL (µi , vii )
tr(VK−1 − I) − ln |VK−1 | + (µ − m)> K−1 (µ − m) −
2
2

=

i=1

R

where
= N(fi |µi , vii )`i (fi )dfi is the convolution of the log likelihood `i with the Gaussian N and v = dg(V). Equivalently, one can view `KL as a smoothed version of ` with univariate
smoothing kernel N.
`KL
v (µi )

From Challis & Barber Concave Gaussian Variational Approximations for Inference in Large Scale
Bayesian Linear Models, AISTATS, 2011 we know that the mapping (µ, L) 7→ − ln Z(µ, L> L) is
jointly convex whenever the likelihoods fi 7→ P(yi |fi ) are log concave. In particular, this implies that
every (µi , si ) 7→ −`KL (µi , s2i ) is jointly convex.
We use an optimisation algorithm similar to EP (section 3.3) where we minimise the local KLdivergence the other way round µi , si = arg minµ,s KL[N(f|µ, s2 )||q−i (f)p(yi |fi )/Zp,i ]. This view
was brought forward by Tom Minka Convex Divergence measures and message passing, MSRTR, 2005. The KL minimisation constitutes a jointly convex 2d optimisation problem solved by
klmin using a scaled Newton approach which is included as a sub function in inf/infKL.m. The
smoothed likelihood `KL (µi , vii ) is implemented as a meta likelihood in likKL; it uses GaussianHermite quadrature to compute the required integrals. Note that – as opposed to EP – GaussianHermite quadrature is appropriate since we integrate against the ln P(yi |fi ) (which can be well approximated by a polynomial) instead of P(yi |fi ) itself. The algorithm is – again unlike EP – provably convergent for log-concave likelihoods (e.g. likGauss, likLaplace, likSech2, likLogistic,
likPoisson) since it can be regarded as coordinate descent with guaranteed decrease in the objective
in every step. Due to the complex update computations, infKL can be quite slow although it has the
same O(n3 ) asymptotic complexity as EP and Laplace.
13

3.5

Variational Bayes

One can drive the bounding even further by means of local quadratic lower bounds to the log likelihood `(f) = ln p(y|f). Suppose that we use a super-Gaussian likelihood p(y|f) i.e. likelihoods
that can be lower bounded by Gaussians of any width w (e.g. likLaplace, likT, likLogistic,
likSech2). Formally, that means that there are b, z ∈ R such that
ρ(f) = ln p(y|f − z) − bf
√
is symmetric and f 7→ ρ(f) is a convex function for all f > 0. As a result, we obtain the following
exact representation of the likelihood


wf2 1
`(f) = ln p(y|f) = max (b + wz)f −
− h(γ) ,
2
2
w>0
which can be derived by convex duality and assuming the likelihoods to be super-Gaussian. Details
can be found in papers by Palmer et al. Variational EM Algorithms for Non-Gaussian Latent Variable
Models, NIPS, 2006 and Nickisch & Seeger Convex Variational Bayesian Inference for Large Scale
Generalized Linear Models, ICML, 2009.
The bottom line is that we can treat the variational bounding as a sequence of Laplace approximations with the “variational Bayes” log likelihood
q
VB
` (fi ) = `(gi ) + bi (fi − gi ), g = sgn(f − z)
(f − z)2 + v + z
instead of the usual likelihood `(fi ) = ln p(yi |fi ) i.e. we solve µ ← L(`VB
v ) instead of µ ← L(`). See
section 3.2. In the code of inf/infVB.m, the likelihood is implemented in the function likVB.
At the end, the optimal value of W can be obtained analytically via wi = |bi − ` 0 (gi )|/|gi − zi |.
For the minimisation in inf/infVB.m, we use a provably convergent double loop algorithm, where in
the inner loop a nonlinear least squares problem (convex for log-concave likelihoods) is solved using
−1 + W)−1 ).
inf/infLaplace.m such that µ ← L(`VB
v ) and in the outer loop, we compute v ← dg((K
The only requirement to the likelihood function is that it returns the values z and b required by the
bound which are delivered by the call (see section 4)
• [b,z] = lik(hyp, y, [], ga, ’infVB’)
The negative marginal likelihood upper bound − ln ZVB is obtained by integrating the prior times
the exact representation of the likelihood


h(γ) ν2 p
1
p(y|f) = max q(y|f, γ), q(y|f, γ) = N(f|ν, γ) exp −
−
2πγ, γ = , ν = bγ + z
2
2γ
w
γ>0
w.r.t. the latent variables f yielding
Z
− ln ZVB = − ln N(f|m, K)

n
Y

qi (yi |fi , γi )df

i=1

= − ln N(m|ν, K + Γ) +

3.6


1
h(γ) − w> ν2 − 1> ln 2πγ .
2

Compatibility Between Inference Methods and Covariance Approximations

Another kind of approximation is needed to render an inference method scalable. We have two approximation schemes which in fact approximate the covariance to make it amenable to large number
of training data points. The following table shows the compatibility between some inference methods
and two major groups of covariance approximations we will discuss in the next two sections.
14

Approximation \ Inference
variational free energy (VFE)
fully independent training conditionals (FITC)
hybrid between VFE and FITC, s0 ∈ [0, 1]
Kronecker/Toeplitz/BTTB grid (KISS-GP)
State space representation

3.7

Exact
Laplace
infGaussLik infLaplace
X
X
X
X
X
X
X
X
X
X

VB
infVB
X
X
X
X
X

EP
infEP

KL, MCMC, LOO
inf{KL,MCMC,LOO}

X

(X, ADF)

Implementation
apxSparse, opt.s=0.0
apxSparse, opt.s=1.0
apxSparse, opt.s=s0
apxGrid
apxState

Sparse Covariance Approximations

One of the main problems with GP models is the high computational load for inference computations.
In a setting with n training points x, exact inference with Gaussian likelihood requires O(n3 ) effort;
approximations like Laplace or EP consist of a sequence of O(n3 ) operations.
There is a line of research with the goal to alleviate this burden by using approximate covariance
functions k̃ instead of k. A review is given by Candela and Rasmussen1 . One basic idea in those
approximations is to work with a set of m inducing inputs u with a reduced computational load
of O(nm2 ). In the following, we will provide a rough idea of the FITC approximation used in the
toolbox. Let K denote the n × n covariance matrix between the training points x, Ku the m × n
covariance matrix between the n training points and the m inducing points, and Kuu the m × m
covariance matrix between the m inducing points. The FITC approximation to the covariance is
given by
−1
2
K ≈ K̃ = Q + G, G = diag(g), g = diag(K − Q), Q = K>
u Quu Ku , Quu = Kuu + σnu I,

where σnu is the noise from the inducing inputs. Note that K̃ and K have the same diagonal elements
diag(K̃) = diag(K); all off-diagonal elements are the same as for Q. Internally, the necessary covariance evaluations are performed by a meta covariance function cov/apxSparse.m. The toolbox offers
FITC versions for regression with Gaussian likelihood inf/infGaussLik.m, as well as for Laplace’s
approximation inf/infLaplace.m.
The user can decide whether to treat the inducing inputs u as fixed or as hyperparameters. The latter
allows to adjust the inducing inputs u w.r.t. the marginal likelihood. As detailed in the documentation
of inf/apx.m, u is treated as fixed if it is passed as the 2nd parameter of apxSparse(cov,xu,..). If
the hyperparameter structure hyp contains a field hyp.xu in inference method calls such as infGaussLik(hyp,..)
or inference/prediction calls like gp(hyp,@infGaussLik,..) the inducing inputs u are treated as hyperparameters and can be optimised. See doc/demoSparse.m for an illustration.

3.8

Grid-Based Covariance Approximations

Another way to bring down computational costs is to take advantage of grid structure x. For example, in geostatistics or image processing, the training data x ∈ Rn×D could be a complete 2d lattice
of size n1 × n2 as given by the axes g1 ∈ Rn1 , g2 ∈ Rn2 so that n = N = n1 · n2 , D = 2 and
x = [vec(g1 1> ), vec(1g>
grid U ∈ RN×D is specified by a set of axis
2 )]. In general, a p-dimensional
Q
Pp
p
matrices {gi ∈ Rni ×Di }i=1..p so that N = i=1 ni and D = i=1 Di where the axes do not need
to be 1d nor do their components need to be sorted. As a consequence, U represents a Cartesian
product of its axes U = g1 × g2 × .. × gp . The cov/apxGrid.m covariance function represents a
Kronecker product covariance matrix
KU,U = Kp ⊗ .. ⊗ K2 ⊗ K1
whose factorisation structure is given by the grid xg . The gain in computationial efficiency is due to
the fact that matrix-vector product, determinant, inverse and eigenvalue
decompose
Pcomputations
p
3
3
so that many operations with an overall cost of O(N ) now only cost O( i=1 ni ).
1

A Unifying View of Sparse Approximate Gaussian Process Regression, JMLR, 2005

15

For off-grid data points, we can still take advantage of the computational properties of a grid-based
covariance matrix KU,U via the structured kernel interpolation (SKI) framework aka KISS-GP by
Wilson and Nickisch2 with extensions3 . Here, the n × n covariance K is obtained from the N × N
grid covariance KU,U by interpolation K ≈ WX KU,U W>
X , where KU,U is a covariance matrix formed
by evaluating the user-specified kernel over a set of latent inducing inputs U, with locations that
have been chosen to create algebraic structure in KU,U that we can exploit for efficiency. Here, the
interpolation matrix WX ∈ Rn×N is extremely sparse; i.e., for local cubic interpolation WX contains
only 4D nonzeros per row, where D is the diata dimension. In addition WX is row-normalised
1n = WX 1N . The structure in KU,U alongside the sparsity of WX , allows for very fast MVMs with
the SKI approximate covariance matrix K over the inputs x enabling fast inference and prediction.
Internally, we use a meta covariance function cov/apxGrid.m to represent the Kronecker covariance
matrix and a Gaussian regression inference method inf/infGaussLik.m. We also support incomplete
grids where n < N. A good starting point is Yunus Saatçi’s PhD thesis4 . For incomplete grids, we use
the interpolation-based extensions by Wilson et al.5 where conjugate gradients and a determinant
approximations are used. See doc/demoGrid1d.m and doc/demoGrid2d.m for an illustration. We
also offer non-Gaussian likelihoods as described by Seth Flaxman6 so that inf/infLaplace.m can
be used.

3.9

State Space Representation of GPs

GP models with covariance functions with a Markovian structure can be transformed into equivalent
discrete state space models where inference can be done in linear time O(n). Exact models can be
derived for sum, product, linear, noise, constant, Matérn (half-integer), Ornstein–Uhlenbeck, and
Wiener covariance functions. Other common covariance functions can be approximated by their
Markovian counterparts, including squared exponential, rational quadratic, and periodic covariance
functions.
A state space model describes the evolution of a dynamical system at different time instances ti , i =
1, 2, . . . by
fi ∼ P(fi |fi−1 ), yi ∼ P(yi |fi ),
where fi := f(ti ) ∈ Rd and f0 ∼ P(f0 ) with fi being the latent (hidden/unobserved) variable and yi
being the observed variable. In continuous time, a simple dynamical system able to represent many
covariance functions is given by the following linear time-invariant stochastic differential equation:
ḟ(t) = F f(t) + L w(t),

yi = H f(ti ) + i ,

where w(t) is an s-dimensional white noise process, the measurement noise i ∼ N(0, σ2n ) is Gaussian, and F ∈ Rd×d , L ∈ Rd×s , H ∈ R1×d are the feedback, noise effect, and measurement matrices,
respectively. The initial state is distributed according to f0 ∼ N(0, P0 ).
The latent GP is recovered by f(t) = Hf(t) and w(t) ∈ Rs is a multivariate white noise process with
spectral density matrix Qc ∈ Rs×s . For discrete values, this translates into
fi ∼ N(Ai−1 fi−1 , Qi−1 ),
2

yi ∼ P(yi |H Lfi ),

Kernel Interpolation for Scalable Structured Gaussian Processes, ICML, 2015
Thoughts on Massively Scalable Gaussian Processes, TR, 2015.
4
Scalable Inference for Structured Gaussian Process Models, University of Cambridge, 2011
5
Fast Kernel Learning for Multidimensional Pattern Extrapolation, NIPS, 2014
6
Fast Kronecker inference in Gaussian processes with non-Gaussian likelihoods, ICML, 2015
3

16

Algorithm 1 Kalman (forward) filtering.
Input: {ti } , y
{Ai }, {Qi }, H, P0
W,b
for i = 1 to n do
if i == 1 then
mi ← 0; Pi ← P0
else
mi ← Ai mi−1 ; Pi ← Ai Pi−1 A>
i +Qi
end if
if has label yi then
µf ← Hmi ; u ← Pi H> ; σ2f ← Hu
zi ← Wii σ2f + 1
ki ← Wii u/zi ; Pi ← Pi − ki u>
ci ← Wii µf − bi ; mi ← mi − uci /zi
end if
end for
P
1
1
log det(I + W 2 KW 2 ) ← i log zi

# training inputs and targets
# state space model
# likelihood eff. precision and location

# init
# predict

# latent
# variance
# mean

Algorithm 2 Rauch–Tung–Striebel (backward) smoothing.
Input: {mi }, {Pi }
{Ai }, {Qi }
for i = n down to 2 do
m ← Ai mi−1 ; P ← Ai Pi−1 A>
i + Qi
>
−1
Gi ← Pi−1 Ai P ; ∆mi−1 ← Gi (mi − m)
Pi−1 ← Pi−1 + Gi (Pi − P)G>
i
mi−1 ← mi−1 + ∆mi−1
end for

# Kalman filter output
# state space model
# predict
# variance
# mean

with f0 ∼ N(0, P0 ). The discrete-time matrices are
Ai = A[∆ti ] = e∆ti F ,
Z ∆ti
>
Qi =
e(∆tk −τ)F L Qc L> e(∆ti −τ)F dτ,
0

where ∆ti = ti+1 − ti > 0.
For stationary covariances k(t, t 0 ) = k(t − t 0 ), the stationary state is distributed by f∞ ∼ N(0, P∞ )
and the stationary covariance can be found by solving the Lyapunov equation
Ṗ∞ = F P∞ + P∞ F> + L Qc L> = 0,
which leads to the identity Qi = P∞ − Ai P∞ A>
i .
In practice, the evaluation of the n discrete-time transition matrices Ai = e∆ti F and the noise covariance matrices Qi (in the stationary case) for different values of ∆ti is a computational challenge.
Since the matrix exponential ψ : s 7→ esX is smooth, its evaluation can be accurately approximated
by convolution interpolation.
From the discrete set of matrices, all the necessary computations can be done using Kalman filtering
and Kalman smoothing as detailed in Algorithms 1 and 2.
Internally, we use a meta covariance function cov/apxState.m to represent the state space represen17

tation. A good starting point is Arno Solin’s PhD thesis7 . See doc/demoState.m for an illustration.
We also offer non-Gaussian likelihoods 8 so that inf/infLaplace.m and inf/infVB.m can be used.
EP is not fully functional; we offer single-sweep EP aka assumed density filtering (ADF).

7
8

Stochastic Differential Equation Methods for Spatio-Temporal Gaussian Process Regression, Aalto University, 2016
State Space Gaussian Processes with Non-Gaussian Likelihood, ICML, 2018

18

4

Likelihood Functions

R
A likelihood function pρ (y|f) (with hyperparameters ρ) is a conditional density pρ (y|f)dy = 1
defined forQscalar latent function values f and outputs y. In the GPML toolbox, we use iid. likelihoods
pρ (y|f) = n
i=1 pρ (yi |fi ). The approximate inference engine does not explicitly distinguish between
classification and regression likelihoods: it is fully generic in the likelihood allowing to use a single
code in the inference step.
Likelihood functionality is needed both during inference and while predicting.

4.1

Prediction

A prediction at x∗ conditioned on the data D = (X, y) (as implemented in gp.m) consists of the
predictive mean µy∗ and variance σ2y∗ which are computed from the the latent marginal moments
µf∗ , σ2f∗ i.e. the Gaussian marginal approximation N(f∗ |µf∗ , σ2f∗ ) via
Z
Z
p(y∗ |D, x∗ ) = p(y∗ |f∗ )p(f∗ |D, x∗ )df∗ ≈ p(y∗ |f∗ )N(f∗ |µf∗ , σ2f∗ )df∗ .
(3)
The moments are given by µy∗ =
likelihood call

R

R
y∗ p(y∗ |D, x∗ )dy∗ and σ2y∗ = (y∗ − µy∗ )2 p(y∗ |D, x∗ )dy∗ . The

• [lp,ymu,ys2] = lik(hyp, [], fmu, fs2)
does exactly this. Evaluation of the logarithm of py∗ = p(y∗ |D, x∗ ) for values y∗ can be done via
• [lp,ymu,ys2] = lik(hyp, y, fmu, fs2)
where lp contains the number ln py∗ .

R
R
Using the moments of the likelihood µ(f∗ ) = y∗ p(y∗ |f∗ )dy∗ and σ2 (f∗ ) = (y∗ −µ(f∗ ))2 p(y∗ |f∗ )dy∗
we obtain for the predictive moments the following (exact) expressions
Z
µy∗ =
µ(f∗ )p(f∗ |D, x∗ )df∗ , and
Zh
i
σ2y∗ =
σ2 (f∗ ) + (µ(f∗ ) − µy∗ )2 p(f∗ |D, x∗ )df∗ .
1. The binary case is simple since y∗ ∈ {−1, +1} and 1 = py∗ + p−y∗ . Using π∗ = p+1 , we find

π∗
y∗ = +1
p y∗ =
1 − π∗ y∗ = −1
X
µ y∗ =
y∗ p(y∗ |D, x∗ ) = 2 · π∗ − 1 ∈ [−1, 1], and
y∗ =±1

σ2y∗

=

X

(y∗ − µy∗ )2 p(y∗ |D, x∗ ) = 4 · π∗ (1 − π∗ ) ∈ [0, 1].

y∗ =±1

2. The continuous case for homoscedastic likelihoods depending on r∗ = y∗ − f∗ only and having
noise variance σ2 (f∗ ) = σ2n is also simple since
R the identity p(y∗ |f∗ ) = p(yR∗ − f∗ |0) allows to
substitute y∗ ← y∗ +f∗ yielding µ(f∗ ) = f∗ + y∗ p(y∗ |0)dy∗ and assuming y∗ p(y∗ |0)dy∗ = 0
we arrive at
µy∗

= µf∗ , and

σ2y∗

= σ2f∗ + σ2n .
19

3. The generalised linear model (GLM) case is also feasible. Evaluation of the predictive distribution is done by quadrature
Z
Z
py∗ =
p(y∗ |f∗ )p(f∗ |D, x∗ )df∗ ≈ p(y∗ |f∗ )N(f∗ |µf∗ , σ2f∗ )df∗ .
For GLMs the mean is given by µ(f∗ ) = g(f∗ ) and the variance is usually given by a simple
function of the mean σ2 (f∗ ) = v(g(f∗ )), hence we use Gaussian-Hermite quadrature with
N(f∗ |µf∗ , σ2f∗ ) ≈ p(f∗ |D, x∗ ) to compute
Z
µy∗ =
g(f∗ )p(f∗ |D, x∗ )df∗ , and
Zh
i
σ2y∗ =
v(g(f∗ )) + (g(f∗ ) − µy∗ )2 p(f∗ |D, x∗ )df∗ 6= v(µy∗ ).
4. Finally the warped Gaussian likelihood predicitive distribution with strictly monotonically increasing warping function g is given by the expression


p(y∗ |D, x∗ ) = g 0 (y∗ )N g(y∗ )|µf∗ , σ2n + σ2f∗
so that the predictive moments can be computed by Gaussian-Hermite quadrature.
In the following, we will detail how and which likelihood functions are implemented in the GPML
toolbox. Further, we will mention dependencies between likelihoods and inference methods and
provide some analytical expressions in addition to some likelihood implementations.

4.2

Interface

The likelihoods are in fact the most challenging object in our implementation. Different inference
algorithms require different aspects of the likelihood to be computed, therefore the interface is rather
involved as detailed below.
19

hlikFunctions.m 19i≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

% likelihood functions are provided to be used by the gp.m function:
%
%
likErf
(Error function , classification , probit regression)
%
likLogistic
(Logistic ,
classification , logit regression)
%
likUni
(Uniform likelihood , classification)
%
%
likGauss
(Gaussian , regression)
%
likGaussWarp
(Warped Gaussian , regression)
%
likGumbel
(Gumbel likelihood for extremal values)
%
likLaplace
(Laplacian or double exponential , regression)
%
likSech2
(Sech-square , regression)
%
likT
(Student ’s t, regression)
%
%
likPoisson
(Poisson regression , count data)
%
likNegBinom
(Negativ binomial regression , count data)
%
likGamma
(Nonnegative regression , positive data)
%
likExp
(Nonnegative regression , positive data)
%
likInvGauss
(Nonnegative regression , positive data)
%
likBeta
(Beta regression , interval data)
%
%
likMix
(Mixture of individual likelihood functions)
%

20

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%

The likelihood functions have three possible modes , the mode being selected
as follows (where "lik" stands for any likelihood function in "lik/lik*.m".):
1) With one or no input arguments:

[REPORT NUMBER OF HYPERPARAMETERS]

s = lik OR s = lik(hyp)
The likelihood function returns a string telling how many hyperparameters it
expects , using the convention that "D" is the dimension of the input space.
For example , calling "likLogistic" returns the string ’0’.

2) With three or four input arguments:

[PREDICTION MODE]

lp = lik(hyp, y, mu) OR [lp, ymu, ys2] = lik(hyp, y, mu, s2)
This allows to evaluate the predictive distribution. Let p(y_*|f_*) be the
likelihood of a test point and N(f_*|mu,s2) an approximation to the posterior
marginal p(f_*|x_*,x,y) as returned by an inference method. The predictive
distribution p(y_*|x_*,x,y) is approximated by.
q(y_*) = \int N(f_*|mu,s2) p(y_*|f_*) df_*
lp = log( q(y) ) for a particular value of y, if s2 is [] or 0, this
corresponds to log( p(y|mu) )
ymu and ys2
the mean and variance of the predictive marginal q(y)
note that these two numbers do not depend on a particular
value of y
All vectors have the same size.

3) With five or six input arguments , the fifth being a string [INFERENCE MODE]
[varargout] = lik(hyp, y, mu, s2, inf) OR
[varargout] = lik(hyp, y, mu, s2, inf, i)
There are three cases for inf, namely a) infLaplace , b) infEP and c) infVB.
The last input i, refers to derivatives w.r.t. the ith hyperparameter.
a1) [lp,dlp,d2lp,d3lp] = lik(hyp, y, f, [], ’infLaplace ’)
lp, dlp, d2lp and d3lp correspond to derivatives of the log likelihood
log(p(y|f)) w.r.t. to the latent location f.
lp = log( p(y|f) )
dlp = d
log( p(y|f) ) / df
d2lp = d^2 log( p(y|f) ) / df^2
d3lp = d^3 log( p(y|f) ) / df^3
a2) [lp_dhyp ,dlp_dhyp ,d2lp_dhyp] = lik(hyp, y, f, [], ’infLaplace ’, i)
returns derivatives w.r.t. to the ith hyperparameter
lp_dhyp = d
log( p(y|f) ) / (
dhyp_i)
dlp_dhyp = d^2 log( p(y|f) ) / (df
dhyp_i)
d2lp_dhyp = d^3 log( p(y|f) ) / (df^2 dhyp_i)

b1) [lZ,dlZ,d2lZ] =
let Z = \int p(y|f)
lZ =
log(Z)
dlZ = d
log(Z) /
d2lZ = d^2 log(Z) /

lik(hyp, y, mu, s2, ’infEP ’)
N(f|mu,s2) df then
dmu
dmu^2

21

81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103

%
% b2) [dlZhyp] = lik(hyp, y, mu, s2, ’infEP ’, i)
% returns derivatives w.r.t. to the ith hyperparameter
% dlZhyp = d log(Z) / dhyp_i
%
%
% c1) [b,z] = lik(hyp, y, [], ga, ’infVB ’)
% ga is the variance of a Gaussian lower bound to the likelihood p(y|f).
%
p(y|f) \ge exp( b*(f+z) - (f+z).^2/(2*ga) - h(ga)/2 ) \propto N(f|b*ga-z,ga)
% The function returns the linear part b and z.
%
% Cumulative likelihoods are designed for binary classification. Therefore , they
% only look at the sign of the targets y; zero values are treated as +1.
%
% Some examples for valid likelihood functions:
%
lik = @likLogistic;
%
lik = {’likMix ’,{’likUni ’,@likErf}}
%
lik = {@likPoisson ,’logistic ’};
%
% See the help for the individual likelihood for the computations specific to
% each likelihood function.
%
hgpml copyright 5ai

4.3

Implemented Likelihood Functions

The following table enumerates all (currently) implemented likelihood functions that can be found
at lik/lik.m and their respective set of hyperparameters ρ.
lik

regression yi ∈ R

Gauss

Gaussian

GaussWarp
Gumbel
Sech2

Warped Gaussian
Gumbel
Sech-squared

Laplace

Laplacian

T

Student’s t

lik
Erf
Logistic
Uni
lik
Poisson
NegBinom

classification yi ∈ {±1}
Error function
Logistic function
Label noise
count data yi ∈ N
Poisson
Negative Binomial

lik
Weibull

nonnegative data yi ∈ R+ \{0}
Weibull, γ1 = Γ (1 + 1/κ)

Gamma

Gamma

Exp

Exponential

InvGauss

Inverse Gaussian

pρ (yi |fi ) =

interval data yi ∈ [0, 1]
Beta





pρ (yi |fi ) =
Ryi fi
−∞ N(t)dt
1
1+exp(−yi fi )
1
2

pρ (yi |fi ) =
where µ = g(fi )
µyi e−µ /yi !

yi +r−1 r yi
r µ /(r + µ)r+yi
yi

{ln σ}
|s| = 1

{θ1 , .., θng , ln σ}
{ln σ}
{ln σ}
{ln σ}
{ln(ν − 1), ln σ}
ρ=
∅
∅
∅
ρ=
∅
{ln r}

pρ (yi |fi ) =
where µ = g(fi )
κγ1 /µ (yi γ1 /µ)κ−1 exp (−(yi γ1 /µ)κ )


αα yα−1
−α exp − yi α
i
µ
µ
Γ (α)


µ−1 exp − yµi


q
2
λ
i −µ)
exp − λ(y
2µ2 y
2πy3

ρ=
{ln κ}

pρ (yi |fi ) =

ρ=
{ln φ}

i

i

lik
Beta

ρ=

−fi )2
√1
exp − (yi2σ
2
2πσ
N(gθ, (yi )|fi , σ2 )gθ0 (yi )
s·π(yi −fi )
π
√ exp (−zi − e−zi ), zi = γ +
√
,
σ 6
σ 6
π√
τ
, τ=
2 cosh2 (τ(y
 i −fi ))  2σ 3
|yi −fi |
1
, b = √σ2
2b exp −
b

− ν+1
2
Γ ( ν+1
−fi )2
2 ) √ 1
1 + (yiνσ
2
Γ ( ν2 )
νπσ

N(yi |fi , σ2 ) =

Γ (φ)
µφ−1
(1
Γ (µφ)Γ ((1−µ)φ) yi

Composite likelihood functions [p1 (yi |fi ), p1 (yi |fi ), ..] 7→ pρ (yi |fi )
P
Mix
Mixture
j αj pj (yi |fi )

22

where µ = g(fi )
− yi )(1−µ)φ−1

{ln α}
∅
{ln λ}

{ln α1 , ln α2 , ..}

4.4

Usage of Implemented Likelihood Functions

Some code examples taken from doc/usageLik.m illustrate how to use simple and composite likelihood functions to specify a GP model.
Syntactically, a likelihood function lf is defined by
lk := ’func’ | @func // simple
lf := {lk} | {param, lk} | {lk, {lk, .., lk}} // composite
i.e., it is either a string containing the name of a likelihood function, a pointer to a likelihood function
or one of the former in combination with a cell array of likelihood functions and an additional list
of parameters.
22

hdoc/usageLik.m 22i≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

% demonstrate usage of likelihood functions
%
% See also likFunctions.m.
%
hgpml copyright 5ai
clear all, close all
n = 5; f = randn(n,1);
% create random latent function values
% set up simple classification likelihood functions
yc = sign(f);
lc0 = {’likErf’};
hypc0 = [];
% no hyperparameters are needed
lc1 = {@likLogistic}; hypc1 = [];
% also function handles are OK
lc2 = {’likUni’};
hypc2 = [];
lc3 = {’likMix’,{’likUni’,@likErf}}; hypc3 = log([1;2]); %mixture
% set up simple regression likelihood functions
yr = f + randn(n,1)/20;
sn = 0.1;
% noise standard deviation
lr0 = {’likGauss’};
hypr0 = log(sn);
lr1 = {’likLaplace’}; hypr1 = log(sn);
lr2 = {’likSech2’};
hypr2 = log(sn);
nu = 4;
% number of degrees of freedom
lr3 = {’likT’};
hypr3 = [log(nu-1); log(sn)];
lr4 = {’likMix’,{lr0,lr1}}; hypr4 = [log([1,2]);hypr0;hypr1];
a = 1; % set up warped Gaussian with g(y) = y + a*sign(y).*y.^2
lr5 = {’likGaussWarp’,[’poly2’]}; hypr5 = log([a;sn]);
lr6 = {’likGumbel’,’+’}; hypr6 = log(sn);
% set up Poisson/negative binomial regression
yp = fix(abs(f)) + 1;
lp0 = {@likPoisson ,’logistic’};
hypp0 =
lp1 = {@likPoisson ,{’logistic2’ ,0.1}}; hypp1 =
lp2 = {@likPoisson ,’exp’};
hypp2 =
ln1 = {@likNegBinom ,’logistic2’};
hypn1 =
% set
lg1 =
lg2 =
lg3 =
lg4 =
lg5 =

[];
[];
[];
[];

up other GLM likelihoods for positive or interval regression
{@likGamma ,’logistic’};
al = 2;
hyp.lik = log(al);
{@likInvGauss ,’exp’};
lam = 1.1; hyp.lik = log(lam);
{@likBeta ,’expexp’};
phi = 2.1; hyp.lik = log(phi);
{@likBeta ,’logit’};
phi = 4.7; hyp.lik = log(phi);
{@likWeibull ,{’logistic2’ ,0.01}}; ka = 0.5; hyp.lik = log(ka);

% 0) specify the likelihood function

23

45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

lik = lc0; hyp = hypc0; y = yc;
% lik = lr4; hyp = hypr4; y = yr;
% lik = lp1; hyp = hypp1; y = yp;
% 1) query the number of parameters
feval(lik{:})
% 2) evaluate the likelihood function on f
exp(feval(lik{:},hyp,y,f))
% 3a) evaluate derivatives of the likelihood
[lp,dlp,d2lp,d3lp] = feval(lik{:}, hyp, y, f, [], ’infLaplace’);
% 3b) compute Gaussian integrals w.r.t. likelihood
mu = f; s2 = rand(n,1);
[lZ,dlZ,d2lZ] = feval(lik{:}, hyp, y, mu, s2, ’infEP’);
% 3c) obtain lower bound on likelihood
ga = rand(n,1);
[b,z] = feval(lik{:}, hyp, y, [], ga, ’infVB’);

4.5

Compatibility Between Likelihoods and Inference Methods

The following table lists all possible combinations of likelihood function and inference methods.
Likelihood \ Inference
Gaussian
Warped Gaussian
Gumbel
Sech-squared
Laplacian
Student’s t
Mixture
Error function
Logistic function
Uniform
Weibull
Gamma
Exp
Inverse Gaussian
Poisson
Negative Binomial
Beta

Gaussian
Laplace
VB
Likelihood
approx. cov. possible, Table3.6
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X

EP

KL

MCMC

LOO

no approx. cov. possible, Table3.6
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
(X)∗ X
X
X
X
X
X
X

Type, Output Domain
regression, R
regression, R
regression, R
regression, R
regression, R
regression, R
classification, {±1}
classification, {±1}
classification, {±1}
positive data, R+ \{0}
positive data, R+ \{0}
positive data, R+ \{0}
positive data, R+ \{0}
count data, N
count data, N
interval data, [0, 1]

Alternative Names

logistic distribution
double exponential
mixing meta likelihood
probit regression
logit regression
label noise
nonnegative regression
nonnegative regression
nonnegative regression
nonnegative regression
Poisson regression
negative binomial regression
beta regression

(X)∗ EP might not converge in some cases since quadrature is used.
Exact inference is only tractable for Gaussian likelihoods. Expectation propagation together with
Student’s t likelihood is inherently unstable due to non-log-concavity. Laplace’s approximation for
Laplace likelihoods is not sensible because at the mode the curvature and the gradient is undefined
due to the non-differentiable peak of the Laplace distribution. Special care has been taken for the
non-convex optimisation problem imposed by the combination Student’s t likelihood and Laplace’s
approximation.

4.6

Gaussian Likelihood

The Gaussian likelihood is the simplest likelihood because the posterior distribution is not only
Gaussian but can be computed analytically. In principle, the Gaussian likelihood would only be
24

operated in conjunction with the exact inference method but we chose to provide compatibility with
all other inference algorithms as well because it enables code testing and allows to switch between
different regression likelihoods very easily.
24a

hlik/likGauss.m 24ai≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

24b

function [varargout] = likGauss(hyp, y, mu, s2, inf, i)
% likGauss - Gaussian likelihood function for regression. The expression for the
% likelihood is
%
likGauss(t) = exp(-(t-y)^2/2*sn^2) / sqrt(2*pi*sn^2),
% where y is the mean and sn is the standard deviation.
%
% The hyperparameters are:
%
% hyp = [ log(sn) ]
%
% Several modes are provided , for computing likelihoods , derivatives and moments
% respectively , see likFunctions.m for the details. In general , care is taken
% to avoid numerical issues when the arguments are extreme.
%
hgpml copyright 5ai
%
% See also LIKFUNCTIONS.M.
if nargin <3, varargout = {’1’}; return; end
sn2 = exp(2*hyp);
if nargin <5
% prediction mode if inf is not present
hPrediction with Gaussian likelihood 24bi
else
switch inf
case ’infLaplace’
hLaplace’s method with Gaussian likelihood 25ai
case ’infEP’
hEP inference with Gaussian likelihood 25bi
case ’infVB’
hVariational Bayes inference with Gaussian likelihood 25ci
end
end

hPrediction with Gaussian likelihood 24bi≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

% report number of hyperparameters

(24a)

if isempty(y), y = zeros(size(mu)); end
s2zero = 1; if nargin >3&&numel(s2)>0&&norm(s2)>eps, s2zero = 0; end % s2==0 ?
if s2zero
% log probability
lp = -(y-mu).^2./sn2/2-log(2*pi*sn2)/2; s2 = 0;
else
lp = likGauss(hyp, y, mu, s2, ’infEP’);
% prediction
end
ymu = {}; ys2 = {};
if nargout >1
ymu = mu;
% first y moment
if nargout >2
ys2 = s2 + sn2;
% second y moment
end
end
varargout = {lp,ymu,ys2};

25

The Gaussian likelihood function has a single hyperparameter ρ, the log of the noise standard deviation σn .
4.6.1

Exact Inference

Exact inference doesn’t require any specific likelihood related code; all computations are done directly
by the inference method, section 3.1.
4.6.2
25a

Laplace’s Approximation

hLaplace’s method with Gaussian likelihood 25ai≡

(24a)

1 if nargin <6
% no derivative mode
2
if isempty(y), y=0; end
3
ymmu = y-mu; dlp = {}; d2lp = {}; d3lp = {};
4
lp = -ymmu.^2/(2*sn2) - log(2*pi*sn2)/2;
5
if nargout >1
6
dlp = ymmu/sn2;
% dlp, derivative of log likelihood
7
if nargout >2
% d2lp, 2nd derivative of log likelihood
8
d2lp = -ones(size(ymmu))/sn2;
9
if nargout >3
% d3lp, 3rd derivative of log likelihood
10
d3lp = zeros(size(ymmu));
11
end
12
end
13
end
14
varargout = {lp,dlp,d2lp,d3lp};
15 else
% derivative mode
16
lp_dhyp = (y-mu).^2/sn2 - 1; % derivative of log likelihood w.r.t. hypers
17
dlp_dhyp = 2*(mu-y)/sn2;
% first derivative ,
18
d2lp_dhyp = 2*ones(size(mu))/sn2;
% and also of the second mu derivative
19
varargout = {lp_dhyp ,dlp_dhyp ,d2lp_dhyp};
20 end

4.6.3
25b

Expectation Propagation

hEP inference with Gaussian likelihood 25bi≡

(24a)

1 if nargin <6
% no derivative mode
2
lZ = -(y-mu).^2./(sn2+s2)/2 - log(2*pi*(sn2+s2))/2;
% log part function
3
dlZ = {}; d2lZ = {};
4
if nargout >1
5
dlZ = (y-mu)./(sn2+s2);
% 1st derivative w.r.t. mean
6
if nargout >2
7
d2lZ = -1./(sn2+s2);
% 2nd derivative w.r.t. mean
8
end
9
end
10
varargout = {lZ,dlZ,d2lZ};
11 else
% derivative mode
12
dlZhyp = ((y-mu).^2./(sn2+s2)-1) ./ (1+s2./sn2);
% deriv. w.r.t. hyp.lik
13
varargout = {dlZhyp};
14 end

4.6.4
25c

Variational Bayes

hVariational Bayes inference with Gaussian likelihood 25ci≡
26

(24a)

1
2
3
4
5

% variational lower site bound
% t(s) = exp(-(y-s)^2/2sn2)/sqrt(2*pi*sn2)
% the bound has the form: (b+z/ga)*f - f.^2/(2*ga) - h(ga)/2
n = numel(s2); b = zeros(n,1); y = y.*ones(n,1); z = y;
varargout = {b,z};

4.7

Warped Gaussian Likelihood

Starting from the likelihood p(y|f) we are sometimes facing the situation where the data y ∈ Y ⊆ R
is not distributed according to p(y|f) but some nonlinear transformation of the data g(y) = z so
that z ∼ p(z|f). Here, the warping function g : Y → R needs to be strictly monotonically increasing
i.e. g 0 (y) > 0. Formally, we start from the fact that p(z|f) integrates to one and use the derivative
dz = g 0 (y)dy to substitute
Z
Z
p(z|f)dz = 1 = pg (y|f)dy, pg (y|f) = p(g(y)|f)g 0 (y)
where we have defined the log warped likelihood ln pg (y|f) = ln p(g(y)|f) + ln g 0 (y). The interesting
bit is that approximate inference methods such as infExact, infLaplace, infEP, infVB, infKL remain fully feasible; only prediction and derivatives become more involved. The usual GP inference is
recovered by using the identity warping function g : y 7→ y. The construction works in princple for
any likelihood but our implementation in likGaussWarp is limited to the Gaussian likelihood.
Hyperparameter derivatives
Hyperparameter derivatives for infLaplace are obtained as follows
∂k
∂
ln k pg (y|f) =
∂θ ∂f

∂
∂k
∂ ∂k
ln p(g(y)|f) +
ln g 0 (y), k = 0, 1, 2
∂θ ∂f
∂θ ∂fk
∂
∂k+1
∂ ∂k
= − k+1 ln p(g(y)|f) g(y) +
ln g 0 (y).
∂θ
∂θ ∂fk
∂f

Similarly for infEP the derivatives are given by
Z
Z
∂
∂
∂
2
ln pg (y|f)N(f|µ, σ )df =
ln p(g(y)|f)N(f|µ, σ2 )df +
ln g 0 (y)
∂θ
∂θ
∂θ
Z
∂
∂
∂
ln p(g(y)|f)N(f|µ, σ2 )df g(y) +
ln g 0 (y).
= −
∂µ
∂θ
∂θ
This trick above works for any homoscedastic likelihood where p(y|f) = p(y + β|f + β) such as
likGauss, likLaplace, likSech2 and likT.
Predictive moments
As detailed in 4, the predictive distribution is – for Gaussian likelihood – given by
Z
Z
p(z∗ |D, x∗ ) =
p(z∗ |f∗ )p(f∗ |D, x∗ )df∗ = N(z∗ |f∗ , σ2n )N(f∗ |µf∗ , σ2f∗ )df∗
= N(z∗ |µf∗ , σ2n + σ2f∗ ), where z∗ = g(y∗ )
p(y∗ |D, x∗ ) = g 0 (y∗ )N(g(y∗ )|µf∗ , σ2n + σ2f∗ ).

27

Hence, the predictive moments are obtained by the 1d integrals
Z
µy∗ =
y∗ g 0 (y∗ )N(g(y∗ )|µf∗ , σ2n + σ2f∗ )dy∗
Z
=
g−1 (z∗ )N(z∗ |µf∗ , σ2n + σ2f∗ )dz∗ , and
Z
2
σy∗ = (y∗ − µy∗ )2 g 0 (y∗ )N(g(y∗ )|µf∗ , σ2n + σ2f∗ )dy∗
Z
= (g−1 (z∗ ) − µy∗ )2 N(z∗ |µf∗ , σ2n + σ2f∗ )dz∗ .

4.8

Gumbel Likelihood

Distributions of extrema are well captured by the Gumbel distribution
p(y) =


y−η
1
exp −z − e−z , z = s
, s ∈ {±1}
β
β

with mean µ = η + βγ and variance σ2 = π2 β2 /6 where γ = 0.57721566490153 denotes Euler–Mascheroni’s constant. Skewness is approximately given by 1.1395s where s is a sign switching
between left and right skewness and kurtosis is 12/5. The final expression for the Gumbel likelihood
is

π
π
p(y|f) = √ exp −z − e−z , z = γ + s √ (y − f), s ∈ {±1}.
σ 6
σ 6

4.9

Laplace Likelihood

Laplace’s Approximation
The following derivatives are needed:
ln p(y|f) = − ln(2b) −
∂ ln p
∂f
2
∂ ln p
(∂f)2
∂ ln p
∂ ln σn

=
=
=

|f − y|
b

sign(f − y)
b
3
∂ ln p
∂3 ln p
=
=0
(∂f)3
(∂ ln σn )(∂f)2
|f − y|
−1
b

Expectation Propagation
Expectation propagation requires integration against a Gaussian measure for moment matching.
R
2 ln Z
ln Z
We need to evaluate ln Z = ln L(y|f, σ2n )N(f|µ, σ2 )df as well as the derivatives ∂∂µ
and ∂∂µ
2




2
|y−f|
(f−µ)
σ
1
1
where N(f|µ, σ2 ) = √ 2 exp − 2σ² , L(y|f, σ2n ) = 2b exp − b , and b = √n2 . As a first
2πσ

28

step, we reduce the number of parameters by means of the substitution f̃ =

f−y
σn

yielding

Z
Z =
=
=

=

=
ln Z =
with µ̃ =

µ−y
σn

and σ̃ =

σ
σn .

L(y|f, σ2n )N(f|µ, σ2 )df
√ Z




√ |f − y|
(f − µ)2
1
2
√
exp −
exp − 2
df
2σ²
σn
2πσ 2σn
√


Z
 √ 
2
(σn f̃ + y − µ)2
√
exp − 2|f̃| df̃
exp −
2σ²
2σ 2π


2 
µ−y
2
Z
σn
 σn f̃ − σn

√
exp −
 L(f̃|0, 1)df̃
2σ²
σσn 2π
Z
1
L(f|0, 1)N(f|µ̃, σ̃2 )df
σn
Z
ln Z̃ − ln σn = ln L(f|0, 1)N(f|µ̃, σ̃2 )df − ln σn
Thus, we concentrate on the simpler quantity ln Z̃.

C

 z
√ }|
√ {
(f − µ̃)2 √
ln Z = ln exp −
− 2|f| df − ln σ̃ 2π − ln 2σn
2σ̃²
"Z

 #


Z∞
0
(f − µ̃)2 √
(f − µ̃)2 √
= ln
+ 2f df +
exp −
− 2f df + C
exp −
2σ̃²
2σ̃²
0
−∞


 


m−
m+
z }| √ {
z }| √ {
Z∞
Z 0
 f2 − 2(µ̃ + σ̃2 2)f + µ̃2 
 f2 − 2(µ̃ − σ̃2 2)f + µ̃2  


 


= ln 
exp −
exp −
 df +
 df + C
 −∞

 


2σ̃²
2
σ̃²
0

Z



 2  Z∞

 #
m+
µ̃2
(f − m− )2
(f − m+ )2
= ln exp
exp −
df + exp
exp −
df −
+C
2σ̃²
2σ̃² 0
2σ̃²
2σ̃²
−∞
"
!#
 2  Z0
 2
Z0
√
m−
m
µ̃2
+
1−
= ln exp
N(f|m+ , σ̃2 )df
−
− ln 2σn
N(f|m− , σ̃2 )df + exp
2σ̃² −∞
2σ̃²
2σ̃²
−∞

 2 
 2 
 2 


√
m−
m+
m+
µ̃2
m−
m+
− exp
+ exp
= ln exp
Φ
Φ
−
− ln 2σn
2σ̃²
σ̃
2σ̃²
σ̃
2σ̃²
2σ̃²
Rz
Here, Φ(z) = −∞ N(f|0, 1)df denotes the cumulative Gaussian distribution. Finally, we have
"



m2−
2σ̃²

 Z0

√   m i
h
 √  m 
√
−
+
+ exp
2µ̃ Φ −
+ σ̃2 − ln 2σn
ln Z = ln exp − 2µ̃ Φ
σ̃
σ̃





√ 
√ 
√



= ln exp ln Φ(−z+ ) + 2µ̃ + exp ln Φ(z− ) − 2µ̃ + σ̃2 − ln 2σn
|
{z
}
|
{z
}
a+

√
= ln(e + e ) + σ̃ − ln 2σn
√
√
√
µ−y
µ̃
σ
where z+ = µ̃
σ̃ + σ̃ 2 = σ + σn 2, z− = σ̃ − σ̃ 2 =
a+

Now, using

d
dθ

ln Φ(z) =

a−

a−

2

1
d
Φ(z) dθ Φ(z)

=

N(z) dz
Φ(z) dθ

µ−y
σ

−

σ
σn

√
2 and µ̃ =

we tackle first derivative
29

µ−y
σn ,

σ̃ =

σ
σn .

∂ ln Z
∂µ
∂a+
∂µ

a− ∂a−
+
ea+ ∂a
∂µ + e
∂µ

=

ea+ + ea−

√
∂
2
ln Φ(−z+ ) +
∂µ
σn
√
√
N(−z+ )
2
q+
2
−
=−
+
+
σΦ(−z+ ) σn
σ
σn
√
∂
2
ln Φ(z− ) −
∂µ
σn
√
√
N(z− )
2
q−
2
−
=
−
σΦ(z− ) σn
σ
σn
√
q±
2
.
∓
±
σ
σn

=
=

∂a−
∂µ

=
=

∂a±
∂µ

=

as well as the second derivative
∂2 ln Z
∂µ2
∂
∂µ

=

∂
∂µ



a± ∂a±
e
= ea±
∂µ
∂2 a+
∂µ2

= −

1
σ



+
ea+ ∂a
+
∂µ
"



−
ea− ∂a
∂µ

∂
∂µ
ea−

ea+ +
#

∂a± 2 ∂2 a±
+
∂µ
∂µ2


−

∂ ln Z
∂µ

2

∂
∂
∂µ N(−z+ )Φ(−z+ ) − ∂µ Φ(−z+ )N(−z+ )
Φ2 (−z+ )
∂−z2 /2

∂−z+
2
+
1 N(−z+ )Φ(−z+ ) ∂µ − N (−z+ ) ∂µ
= −
σ
Φ2 (−z+ )

∂2 a−
∂µ2

=

q2+ − q+ z+
N(−z+ ) Φ(−z+ )z+ − N(−z+ )
·
=
−
σ2
Φ2 (−z+ )
σ2

=

1
σ

=

∂z−
2
−
1 N(z− )Φ(z− ) ∂µ − N (z− ) ∂µ
σ
Φ2 (z− )

∂
∂
∂µ N(z− )Φ(z− ) − ∂µ Φ(z− )N(z− )
Φ2 (z− )
∂−z2 /2

q2− + q− z−
N(z− ) −Φ(z− )z− − N(z− )
·
=
−
σ2
Φ2 (z− )
σ2
q 2 ∓ q ± z±
= − ± 2
σ

=
∂2 a±
∂µ2
which can be simplified to

∂2 ln Z
∂µ2

=

ea+ b+ + ea− b−
−
ea+ + ea−

30



∂ ln Z
∂µ

2

using

b± =

∂a±
∂µ

2

∂2 a±
+
∂µ2

√ !2
q 2 ∓ q ± z±
q±
2
− ± 2
∓
±
σ
σn
σ
√ !2
q2
q ± z±
q±
2
−
− ±
±
σ
σn
σ2
σ2
!
√
2
8
z±
∓ 2 q± .
−
2
σσn
σn
σ

=

=
=
We also need
∂ ln Z
∂ ln σn

=

a− ∂a−
+
ea+ ∂∂a
ln σn + e
∂ ln σn

ea+ + ea−

−

2σ2
− 1.
σ2n

Variational Bayes
We need h(γ) and its derivatives as well as β(γ):

h(γ) =
h 0 (γ) =

2
γ + ln(2σ2n ) + y2 γ−1
σ2n
2
− y2 γ−2
σ2n

h 00 (γ) = 2y2 γ−3
β(γ) = yγ−1

4.10

Student’s t Likelihood

The likelihood has two hyperparameters (both represented in the log domain to ensure positivity):
ν
the degrees of freedom ν and the scale σn with mean y (for ν > 1) and variance ν−2
σn ² (for ν > 2).

 ν+1
(f − y)² − 2
p(y|f) = Z · 1 +
,
νσ2n

31

Z=

Γ
Γ

ν
2



ν+1
p2



νπσ2n

Laplace’s Approximation
For the mode fitting procedure, we need derivatives up to third order; the hyperparameter derivatives
at the mode require some mixed derivatives. All in all, using r = y − f, we have




ν 1
ν+1
r²
ν+1
2
− ln Γ
ln p(y|f) = ln Γ
− ln νπσn −
ln 1 +
2
2
2
2
νσ2n
∂ ln p
r
= (ν + 1)
∂f
r² + νσ2n
r2 − νσ2n
∂2 ln p
=
(ν
+
1)
(∂f)2
(r² + νσ2n )2
∂3 ln p
r3 − 3rνσ2n
= 2(ν + 1)
3
(∂f)
(r² + νσ2n )3


∂ ln p
ν+1
∂Z
ν
r2
r2
+
=
− ln 1 +
·
∂ ln ν
∂ ln ν 2
2
νσ2n
r² + νσ2n


∂Z
ν d ln Γ ν+1
ν d ln Γ ν2
1
2
=
−
−
∂ ln ν
2
d ln ν
2 d ln ν
2
2
2
2
3
r (r − 3(ν + 1)σn ²) + νσn
∂ ln p
= ν
(∂ ln ν)(∂f)²
(r² + νσ2n )3
∂ ln p
r2
−1
= (ν + 1)
∂ ln σn
r² + νσ2n
∂3 ln p
νσ2n − 3r2
= 2νσ2n (ν + 1)
(∂ ln σn )(∂f)²
(r² + νσ2n )3

4.11

Cumulative Logistic Likelihood

The likelihood has one hyperparameter (represented in the log domain), namely the standard deviation σn
π
π
√ , Z=
√
p(y|f) = Z · cosh−2 (τ(f − y)) , τ =
2σn 3
4σn 3
Laplace’s Approximation
The following derivatives are needed where φ(x) ≡ ln(cosh(x))
√
ln p(y|f) = ln(π) − ln(4σn 3) − 2φ (τ(f − y))
∂ ln p
= 2τφ 0 (τ(f − y))
∂f
∂2 ln p
= −2τ2 φ 00 (τ(f − y))
(∂f)2
∂3 ln p
= 2τ3 φ 000 (τ(f − y))
(∂f)3

∂3 ln p
2
00
000
(τ(f
(τ(f
=
2τ
2φ
−
y))
+
τ(f
−
y)φ
−
y))
(∂ ln σn )(∂f)2
∂ ln p
= 2τ(f − y)φ 0 (τ(f − y)) − 1
∂ ln σn

32

4.12

GLM Likelihoods: Poisson, Negative Binomial, Weibull, Gamma, Exponential, Inverse Gaussian and Beta

Data y from a space other than R e.g. N, R+ or [0, 1] can be modeled using generalised linear model
likelihoods p(y|f) where the expected value E[y] = µ is related to the underlying Gaussian process
f by means of an inverse link function µ = g(f). Typically, the likelihoods are from an exponential
family, hence the variance V[y] = v(µ), is a simple function of the mean µ as well as higher order
moments such as skewness S[y] = s(µ) and kurtosis K[y] = k(µ).
Here, we directly specify the inverse link function µ = g(f) defining the mapping from the GP f
to the mean intensity µ. For numerical reasons, we work with the log of the inverse link function
h(f) = ln g(f) and use its derivatives h 0 , h 00 and h 000 for subsequent computations. In the table below,
we have summarised the GLM likelihood expressions, the moments, the range of their variables and
the applicable inverse link functions.
Likelihood
Poisson

ρ=
∅

v(µ) =
µ

Neg. Binomial

{ln r}

µ(µ/r + 1)

Weibull

{ln κ}

µ2 (γ2 /γ21 − 1)

Gamma

{ln α}

µ2 /α

Exponential

∅

µ2

Inv. Gauss

{ln λ}

µ3 /λ

Beta

{ln φ}

µ(1 − µ)/(1 + φ)

4.12.1

s(µ) =
√
1/ µ
q

4(r+µ)
rµ

k(µ) =
1/µ
−

q

γ3 −3γ1 γ2 +2γ31
(γ2 −γ21 )3/2

√
2/ α

r
µ(µ+r)

6
r

+

r
µ(µ+r)

γ4 −4γ1 γ3 +12γ21 γ2 −3γ22 −6γ41
(γ2 −γ21 )2

6/α

2
p
3 µ/λ

6

(2−4µ)(1+φ)
√
v(µ)(2+φ)

−v(µ)(5φ+6)
6 (φ+1)
v(µ)(φ+2)(φ+3)

15µ/λ
2

p(y|f) =
µy exp(−µ)/y!

y+r−1 r y
r µ /(r + µ)r+y
y

y∈
N

µ∈
R+

Inverse Links
exp, logistic*

N

R+

exp, logistic*

κγ1 /µ (yγ1 /µ)κ−1 exp (−(yγ1 /µ)κ )


αα yα−1 −α
exp − yα
µ
Γ (α) µ


y
µ−1 exp − µ


q
2
λ
exp − λ(y−µ)
2πy3
2µ2 y

R+ \{0}

R+ \{0}

exp, logistic*

R+ \{0}

R+ \{0}

exp, logistic*

R+ \{0}

R+ \{0}

exp, logistic*

R+ \{0}

R+ \{0}

exp, logistic*

Γ (φ)
µφ−1 (1
Γ (µφ)Γ ((1−µ)φ) y

[0, 1]

[0, 1]

expexp, logit

− y)(1−µ)φ−1

Inverse Link Functions

Possible inverse link functions and their properties (∪ convex, ∩ concave, ↑ monotone) are summarised below:
util/glm_invlink_*
exp
logistic
logistic2
expexp
logit

g(f) = µ =
ef
`(f) = ln(1 + ef )
` (f + af`(f))
exp(−e−f )
1/(1 + e−f )

g:R→
R+
R+
R+
[0, 1]
[0, 1]

g is
∪,↑
∪,↑
↑
↑
↑

h(f) = ln µ =
f
ln(ln(1 + ef ))
ln(` (f + af`(f)))
−e−f
− ln(1 + e−f )

h is
∪,∩,↑
∩,↑
∩,↑
∪,↑
∪,↑

Please see doc/usageLik.m for how to specify the pair likelihood function and link function in
GPML.
Exponential inverse link: exp
For g(f) = ef things are simple since h(f) = f, h 0 (f) = 1 and h 00 (f) = h 000 (f) = 0.

33

Logistic inverse link: logistic
For g(f) = ln(1 + ef ) the derivatives of h(f) are given by
h(f) = ln(ln(1 + ef ))
1
−ef
1
s(−f), s(f) =
, s 0 (f) =
= −s(−f)s(f)
h 0 (f) =
f
f
ln(1 + e )
1+e
(1 + ef )2
1
e−f
1
ef
1
h 00 (f) =
−
f
f
−f
2
2
ln(1 + e ) (1 + e )
ln (1 + ef ) 1 + e 1 + e−f


= h 0 (f) s(f) − h 0 (f)




−ef
00
000
00
0
0
h (f) = h (f) s(f) − h (f) + h (f)
− h (f)
(1 + ef )2


= h 00 (f) s(f) − 2h 0 (f) − h 0 (f)s(f)s(−f).
Note that g(f) = eh(f) = ln(1 + ef ) is convex and h(f) = ln(ln(1 + ef )) with


1
ef
1
1
00
h (f) =
60
1−
f
f
f
ln(1 + e )
ln(1 + e ) 1 + e 1 + e−f
is concave since ef > ln(1 + ef ) for all f ∈ R.
Twice logistic inverse link: logistic2
Note that h(f) = ln(` (f + af`(f))) is – according to Seeger et al.9 – concave.
Double negative exponential inverse link: expexp
For g(f) = exp(−e−f ) the derivatives of h(f) are given by
h(f) = −e−f
h 0 (f) = −h(f)
h 00 (f) = h(f)
h 000 (f) = −h(f)
Logit regression inverse link: logit
For g(f) = 1/(1+e−f ) the derivatives of h(f) can be computed using the logistic inverse link function
h` (f) since h(f) = f − exp(h` (f))
h(f) = f − eh` (f)
h 0 (f) = 1 − eh` (f) h`0 (f)
h 00 (f) = −eh` (f) [h`0 (f)2 + h`00 (f)] = eh` (f) s` (−f)s2` (f)
h 000 (f) = −eh` (f) [h`0 (f)3 + 3h`00 (f)h`0 (f) + h`000 (f)]
9

Bayesian Intermittent Demand Forecasting, NIPS, 2016

34

4.12.2

Poisson Likelihood

Count data y ∈ Nn can be modeled in the GP framework using the Poisson distribution p(y) =
√
µy e−µ /y! with mean/variance E[y] = V[y] = µ, skewness S[y] = 1/ µ and kurtosis K[y] = 1/µ
leading to the likelihood
p(y|f) = µy exp(−µ)/y!, µ = g(f)
⇔ ln p(y|f) = y · ln g(f) − g(f) − ln Γ (y + 1).
For Laplace’s method to work, we need the first three derivatives of the log likelihood ln p(y|f), where
h(f) = ln g(f)
ln p(y|f)
∂
ln p(y|f)
∂f
∂2
ln p(y|f)
∂f2
∂3
ln p(y|f)
∂f3

= y · h(f) − exp(h(f)) − ln Γ (y + 1)
= h 0 (f) [y − exp(h(f))]
= h 00 (f) [y − exp(h(f))] − [h 0 (f)]2 exp(h(f))
= h 000 (f) [y − exp(h(f))] − 3h 0 (f) · h 00 (f) exp(h(f)) − [h 0 (f)]3 exp(h(f))
h 000 (f) [y − exp(h(f))] − h 0 (f)[h 0 (f)2 + 3h 00 (f)] exp(h(f)).

Note that if ln µ = h(f) is concave and µ = g(f) is convex then the Poisson likelihood p(y|f) is
log-concave in f which is the case for both exp and logistic.
4.12.3

Weibull Likelihood

Nonnegative data y ∈ R+ such as time-to-failure can be modeled in the GP framework using the
κ
Weibull distribution p(y) = κ/λ(y/λ)κ−1 e−(y/λ) with shape parameter κ > 0, scale parameter
λ > 0, mean E[y] = λγ1 = µ where γj = Γ (1 + j/κ), variance V[y] = λ2 γ2 − µ2 = µ2 (γ2 /γ21 − 1),
skewness S[y] = (γ3 − 3γ1 γ2 + 2γ31 )/(γ2 − γ21 )3/2 and kurtosis K[y] = (γ4 − 4γ1 γ3 + 12γ21 γ2 − 3γ22 −
6γ41 )/(γ2 − γ21 )2 . Using the substitution µ = λγ1 ⇔ 1/λ = γ1 /µ, we obtain
 


 
y κ
y κ−1
κ
exp − γ1
γ1
, µ = g(f) > 0
p(y|f) = γ1
µ
µ
µ



 

κ
y
y κ
.
⇔ ln p(y|f) = ln γ1
+ (κ − 1) ln γ1
− γ1
µ
µ
µ
Note that the Weibull likelihood p(y|f) is log-concave in f neither for the exp nor for the logistic
inverse link.
4.12.4

Gamma Likelihood

Nonnegative data y ∈ R+ can be modeled in the GP framework using the Gamma distribution
p(y) = θ−α /Γ (α)yα−1 e−y/θ with shape parameter α > 0, scale parameter θ > 0, mean E[y] =
√
αθ = µ, variance V[y] = αθ2 = µ2 /α, skewness S[y] = 2/ α and kurtosis K[y] = 6/α. Using the
substitution µ = αθ ⇔ α/µ = 1/θ, we obtain


αα yα−1 −α
yα
p(y|f) =
µ exp −
, µ = g(f) > 0
Γ (α)
µ


y
⇔ ln p(y|f) = −α ln µ +
− ln Zα (y), ln Zα (y) = ln Γ (α) − α ln α + (1 − α) ln y.
µ
35

Note that if ln µ = h(f) was convex and µ = g(f) was concave then the Gamma likelihood p(y|f)
would be log-concave in f which is not the case for both exp and logistic.
4.12.5

Exponential Likelihood

Nonnegative data y ∈ R+ can be modeled in the GP framework using the Exponential distribution
p(y) = θ−1 e−y/θ with scale parameter θ > 0, mean E[y] = θ = µ, variance V[y] = µ2 , skewness
S[y] = 2 and kurtosis K[y] = 6. We obtain


y
−1
p(y|f) = µ exp −
, µ = g(f) > 0
µ
y
⇔ ln p(y|f) = − ln µ − .
µ
Note that for exp (but not for logistic) the likelihood is log-concave. The exponential distribution
corresponds to the Gamma distribution with α = 1 and the Weibull distribution with κ = 1.
4.12.6

Inverse Gaussian Likelihood

n
Nonnegative data
p y ∈ R+ can be modeled in the GP framework using the Inverse Gaussian distriparameter λ > 0, mean parameter
bution p(y) = λ/(2πy3 ) exp(−λ(y − µ)2 /(2µ2 y)) with shape p
µ > 0, mean E[y] = µ, variance V[y] = µ3 /λ, skewness S[y] = 3 µ/λ and kurtosis K[y] = 15µ/λ.
We obtain
s


λ
λ(y − µ)2
p(y|f) =
exp −
, µ = g(f) > 0
2πy3
2µ2 y

⇔ ln p(y|f) = −

λ(y − µ)2
1
− ln Zα (y), ln Zα (y) = − (ln λ − ln 2πy3 ).
2
2
2µ y

The inverse Gaussian likelihood is in general not log-concace in f for both exp and logistic.
4.12.7

Beta Likelihood

Interval data y ∈ [0, 1]n can be modeled in the GP framework using the Beta distribution p(y) =
yα−1 (1 − y)β−1 /B(α, β) with shape parameters α, β > 0, mean E[y] = α/(α + β) and variance
V[y] = αβ/[(α + β)2 (α + β + 1)] and 1/B(α, β) = Γ (α + β)/[Γ (α)Γ (β)]. Reparametrising using
the mean parameter µ = E[y] = α/(α + β) , the shape parameter φ = α + β, the variance V[y] =
µ(1 − µ)/(1 + φ) and hence
Γ (φ)
yµφ−1 (1 − y)(1−µ)φ−1 , µ = g(f) > 0
Γ (µφ)Γ ((1 − µ)φ)
⇔ ln p(y|f) = ln Γ (φ) − ln Γ (µφ) − ln Γ ((1 − µ)φ) + (µφ − 1) ln y + ((1 − µ)φ − 1) ln(1 − y).
p(y|f) =

The Beta likelihood is in general not log-concace in f for both exp and logistic.

36

5

Mean Functions

A mean function mφ : X → R (with hyperparameters φ) of a GP f is a scalar function defined over
the whole domain X that computes the expected value m(x) = E[f(x)] of f for the input x.

5.1

Interface

In the GPML toolbox, a mean function m : X → R needs to implement evaluation m = mφ (X) and
∂
first derivatives mi = ∂φ
m with respect to the components i of the parameter φ ∈ Φ as detailed
i
below.
36

hmeanFunctions.m 36i≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%

Mean functions to be use by Gaussian process functions. There are two
different kinds of mean functions: simple and composite:
Simple mean functions:
meanZero
meanOne
meanConst
meanLinear
meanPoly
meanDiscrete
meanGP
meanGPexact
meanNN
meanWSPC

-

zero mean function
one mean function
constant mean function
linear mean function
polynomial mean function
precomputed mean for discrete data
predictive mean of another GP
predictive mean of a regression GP
nearest neighbor mean function
weighted sum of projected cosines

Composite mean functions (see explanation at the bottom):
meanScale
meanSum
meanProd
meanPow
meanMask
meanPref
meanWarp

-

scaled version of a mean function
sum of mean functions
product of mean functions
power of a mean function
mask some dimensions of the data
difference mean for preference learning
warped mean function

Naming convention: all mean functions are named "mean/mean*.m".

1) With no or only a single input argument:
s = meanNAME

or

s = meanNAME(hyp)

The mean function returns a string s telling how many hyperparameters hyp it
expects , using the convention that "D" is the dimension of the input space.
For example , calling "meanLinear" returns the string ’D’.
2) With two input arguments and one output argument:
m = meanNAME(hyp, x)
The function computes and returns the mean vector m with components
m(i) = m(x(i,:)) where hyp are the hyperparameters and x is an n by D matrix
of data, where D is the dimension of the input space. The returned mean

37

45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

% vector m is of size n by 1.
%
% 3) With two input arguments and two output arguments:
%
%
[m,dm] = meanNAME(hyp, x)
%
% The function computes and returns the mean vector m as in 2) above.
% In addition to that, the (linear) directional derivative function dm is
% returned. The call dhyp = dm(q) for a direction vector q of size n by 1
% returns a vector of directional derivatives dhyp = d (q’*m(x)) / d hyp of
% the same size as the hyperparameter vector hyp. The components of dhyp are
% defined as follows: dhyp(i) = q’*( d m(x) / d hyp(i) ).
%
% See also doc/usageMean.m.
%
hgpml copyright 5ai

5.2

Implemented Mean Functions

We offer simple and composite mean functions producing new mean functions m(x) from existing
mean functions µj (x). All code files are named according to the pattern mean/mean.m for
simple identification. This modular specification allows to define affine mean functions m(x) =
c + a> x or polynomial mean functions m(x) = (c + a> x)2 . All currently available mean functions
are summarised in the following table.
Simple mean functions m(x)

Mmeaning
Zero
mean vanishes always
One
mean equals 1
Const
mean equals a constant
Linear
mean linearly depends on x ∈ X ⊆ RD
Poly
mean polynomially depends on x ∈ X ⊆ RD
Discrete precomputed mean for discrete data x ∈ X ⊆ N
GP
predictive mean of another GP
GPexact
predictive mean of a regression GP
NN
nearest neighbor for a set (zj , mj ) ∈ X × R
WSPC
weighted sum of d projected cosines x ∈ X ⊆ RD
Composite mean functions [µ1 (x), µ2 (x), ..] 7→ m(x)

meaning
Scale
scale a mean
Sum
add up mean functions
Prod
multiply mean functions
Pow
raise a mean to a power
Mask
act on components I ⊆ [1, 2, .., D] of x ∈ X ⊆ RD only
Pref
preference learning mean x = [x1 ; x2 ], xi ⊆ RD/2
Warp
warped mean

5.3

m(x) =
0
1
c
a> x
P > d
d ad x
µ
x
R
R y · p(y|D, x)dy
y · p(y|D, x)dy
mi , i = arg minj d(x, zj )
a> cos(Wx + b)

φ
∅
∅
c∈R
a ∈ RD
a ∈ RD×d
µ ∈ Rs
∅
ρ, ψ, σn
∅
W ∈ Rd×D , a, b ∈ Rd

m(x) =
αµ(x)
P
µ (x)
Qj j
j µj (x)
µ(x)d
µ(xI )
µ(x1 ) − µ(x2 )
g[µ(x)]

φ
α∈R
∅
∅
∅
∅
∅
∅

Usage of Implemented Mean Functions

Some code examples taken from doc/usageMean.m illustrate how to use simple and composite mean
functions to specify a GP model.
Syntactically, a mean function mf is defined by
mn := ’func’ | @func // simple

38

mf := {mn} | {mn, {param, mf}} | {mn, {mf, .., mf}} // composite
i.e., it is either a string containing the name of a mean function, a pointer to a mean function or one
of the former in combination with a cell array of mean functions and an additional list of parameters.
38

hdoc/usageMean.m 38i≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

% demonstrate usage of mean functions
%
% See also meanFunctions.m.
%
hgpml copyright 5ai
clear all, close all
n = 5; D = 2; x = randn(n,D);

% create a random data set

% set up simple mean functions
m0 = {’meanZero’}; hyp0 = [];
% no hyperparameters are needed
m1 = {’meanOne’};
hyp1 = [];
% no hyperparameters are needed
mc = {@meanConst}; hypc = 2; % also function handles are possible
ml = {@meanLinear}; hypl = [2;3];
% m(x) = 2*x1 + 3*x2
mp = {@meanPoly ,2}; hypp = [1;1;2;3]; % m(x) = x1+x2+2*x1^2+3*x2^2
mn = {@meanNN ,[1,0; 0,1],[0.9,0.5]}; hypn = []; % nearest neighbor
s = 12; hypd = randn(s,1);
% discrete mean with 12 hypers
md = {’meanDiscrete’,s};
hyp.cov = [0;0]; hypg = [];
% GP predictive mean
xt = randn(2*n,D); yt = sign(xt(:,1)-xt(:,2));
% training data
mg = {@meanGP ,hyp,@infEP ,@meanZero ,@covSEiso ,@likErf ,xt,yt};
hype = [0;0; log(0.1)];
% regression GP predictive mean
xt = randn(2*n,D); yt = xt(:,1).*xt(:,2);
% training data
me = {@meanGPexact ,@meanZero ,@covSEiso ,xt,yt};
% set up composite mean functions
msc = {’meanScale’,{m1}};
hypsc = [3; hyp1];
% scale by 3
msu = {’meanSum’,{m0,mc,ml}}; hypsu = [hyp0; hypc; hypl];
% sum
mpr = {@meanProd ,{mc,ml}};
hyppr = [hypc; hypl];
% product
mpo = {’meanPow’,3,msu};
hyppo = hypsu;
% third power
mask = [false ,true];
% mask excluding all but the 2nd component
mma = {’meanMask’,mask,ml};
hypma = hypl(mask);
mpf = {@meanPref ,ml};
hyppf = 2; % linear pref with slope
mwp = {@meanWarp ,ml,@sin,@cos};hypwp = 2;
% sin of linear
% 0) specify mean function
% mean = md; hyp = hypd; x = randi([1,s],n,1);
% mean = mn; hyp = hypn;
% mean = mg; hyp = hypg;
mean = me; hyp = hype;
% mean = m0; hyp = hyp0;
% mean = msu; hyp = hypsu;
% mean = mpr; hyp = hyppr;
% mean = mpo; hyp = hyppo;
% mean = mpf; hyp = hyppf;
% 1) query the number of parameters
feval(mean{:})
% 2) evaluate the function on x
feval(mean{:},hyp,x)
% 3) compute the derivatives w.r.t. to hyperparameter i
i = 2; feval(mean{:},hyp,x,i)

39

6

Covariance Functions

A covariance function kψ : X × X → R (with hyperparameters ψ) of a GP f is a scalar function
defined over the whole domain X2 that computes the covariance k(x, z) = V[f(x), f(z)] = E[(f(x) −
m(x))(f(z) − m(z))] of f between the inputs x and z.

6.1

Interface

Again, the interface is simple since only evaluation of the full covariance matrix K = kψ (X) and its
∂
derivatives Ki = ∂ψ
K as well as cross terms k∗ = kψ (X, x∗ ) and k∗∗ = kψ (x∗ , x∗ ) for prediction
i
are required.
39

hcovFunctions.m 39i≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%

Covariance functions to be use by Gaussian process functions. There are two
different kinds of covariance functions: simple and composite:
1) Elementary and standalone covariance functions:
covZero
- zero covariance function
covEye
- unit covariance function
covOne
- unit constant covariance function
covDiscrete
- precomputed covariance for discrete data
2) Composite covariance functions:
covScale
- scaled version of a covariance function
covSum
- sums of covariance functions
covProd
- products of covariance functions
covMask
- mask some dimensions of the data
covPref
- difference covariance for preference learning
covPER
- make stationary covariance periodic
covADD
- additive covariance function
covWarp
- warp input to a covariance function
3) Mahalanobis distance based covariances and their modes
covMaha
- generic "mother" covariance
* eye
- unit length scale
* iso
- isotropic length scale
* ard
- automatic relevance determination
* prot
- (low-rank) projection in input space
* fact
- factor analysis covariance
* vlen
- spatially varying length scale
covGE
- Gamma exponential covariance
covMatern
- Matern covariance function with nu=1/2, 3/2 or 5/2
covPP
- piecewise polynomial covariance function (compact support)
covRQ
- rational quadratic covariance function
covSE
- squared exponential covariance function
4) Dot product based covariances and their modes
covDot
- generic "mother" covariance
* eye
- unit length scale
* iso
- isotropic length scale
* ard
- automatic relevance determination
* pro
- (low-rank) projection in input space
* fac
- factor analysis covariance
covLIN
- linear covariance function
covPoly
- polynomial covariance function

40

44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101

%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%

5) Time series covariance functions on the positive real line
covFBM
- fractional Brownian motion covariance
covULL
- underdamped linear Langevin process covariance
covW
- i-times integrated Wiener process covariance
covOU
- i-times integrated Ornstein -Uhlenbeck process covariance
6) Standalone covariances
covNNone
- neural network covariance function
covLINone
- linear covariance function with bias
covPeriodic
- smooth periodic covariance function (1d)
covPeriodicNoDC - as above but with zero DC component and properly scaled
covCos
- sine periodic covariance function (1d) with unit period
covGabor
- Gabor covariance function
7) Shortcut covariances assembled from library
covConst
- covariance for constant functions
covNoise
- independent covariance function (i.e. white noise)
covPERiso
- make isotropic stationary covariance periodic
covPERard
- make ARD stationary covariance periodic
covMaterniso - Matern covariance function with nu=1/2, 3/2 or 5/2
covMaternard - Matern covariance function with nu=1/2, 3/2 or 5/2 with ARD
covPPiso
- piecewise polynomial covariance function (compact support)
covPPard
- piecewise polynomial covariance function (compact support)
covRQiso
- isotropic rational quadratic covariance function
covRQard
- rational quadratic covariance function with ARD
covSEiso
- isotropic squared exponential covariance function
covSEisoU
- same as above but without latent scale
covSEard
- squared exponential covariance function with ARD
covSEvlen
- spatially varying lengthscale squared exponential
covSEproj
- projection squared exponential covariance function
covLINiso
- linear covariance function
covLINard
- linear covariance function with ARD
covGaborard
- Gabor covariance function with ARD
covGaborsio
- isotropic Gabor covariance function
covSM
- spectral mixture covariance function
8) Special purpose (approximation) covariance functions
apxSparse
- sparse approximation: to be used for large scale inference
problems with inducing points aka FITC
apxGrid
- grid interpolation:
to be used for large scale inference
problems with Kronecker/Toeplitz/BTTB covariance matrix
The covariance functions are written according to a special convention where
the exact behaviour depends on the number of input and output arguments
passed to the function. If you want to add new covariance functions , you
should follow this convention if you want them to work with the function gp.
There are four different ways of calling the covariance functions:
1) With no (or one) input argument(s):
s = cov
The covariance function returns a string s telling how many hyperparameters it
expects , using the convention that "D" is the dimension of the input space.
For example , calling "covRQard" returns the string ’(D+2)’.
2) With two input arguments and one output argument:

41

102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159

%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%

K = cov(hyp, x) equivalent to K = cov(hyp, x, [])
The function computes and returns the covariance matrix where hyp are
the hyperparameters and x is an n by D matrix of cases , where
D is the dimension of the input space. The returned covariance matrix is of
size n by n.
3) With three input arguments and one output argument:
Kz = cov(hyp, x, z)
kx = cov(hyp, x, ’diag ’)
The function computes test set covariances; kx is a vector of self covariances
for the test cases in x (of length n) and Kz is an (n by nz) matrix of cross
covariances between training cases x and test cases z.
4) With two output arguments:
[K,dK] = cov(hyp, x) equivalent to [K,dK] = cov(hyp, x, [])
[K,dK] = cov(hyp, x, z)
[K,dK] = cov(hyp, x, ’diag ’)
The function computes and returns the covariances K as in 3) above.
In addition to that, the (linear) directional derivative function dK is
returned. The two possible calls dhyp = dK(Q) and [dhyp,dx] = dK(Q) for a
direction Q of the same size as K are possible. The return arguments dhyp
and dx are the directional derivatives dhyp = d trace(Q’*K) / d hyp and
dx = d trace(Q’*K) / d x are of the same size as the hyperparameter
vector hyp and the input data x, respectively. The components of dhyp and
dx are defined as follows: dhyp(i) = trace(Q’*( d K / d hyp(i) ))
and dx(i,j) = trace(Q’*( d K / d x(i,j) )).
Covariance functions can be specified in two ways: either as a string
containing the name of the covariance function or using a cell array. For
example:
cov = ’covRQard ’;
cov = {’covRQard ’};
cov = {@covRQard};
are supported. Only the second and third form using the cell array can be used
for specifying composite covariance functions , made up of several
contributions. For example:
cov
cov
cov
cov
q=1; cov
d=3; cov
cov
cov

=
=
=
=
=
=
=
=

{’covScale ’, {’covRQiso ’}};
{’covSum ’, {’covRQiso ’,’covSEard ’,’covNoise ’}};
{’covProd ’,{’covRQiso ’,’covSEard ’,’covNoise ’}};
{’covMask ’,{mask,’covSEiso ’}}
{’covPPiso ’,q};
{’covPoly ’,d};
{’covADD ’,{[1,2],’covSEiso ’}};
{@apxSparse , {@covSEiso}, u}; where u are the inducing inputs

specifies a covariance function which is the sum of three contributions. To
find out how many hyperparameters this covariance function requires , we do:
feval(cov{:})

42

160
161
162
163
164
165

% which returns the string ’3+(D+1)+1’ (i.e. the ’covRQiso ’ contribution uses
% 3 parameters , the ’covSEard ’ uses D+1 and ’covNoise ’ a single parameter).
%
% See also doc/usageCov.m.
%
hgpml copyright 5ai

43

6.2

Implemented Covariance Functions

Similarly to the mean functions, we provide a whole algebra of covariance functions k : X × X → R
with the same generic name pattern cov/cov.m as before.
Besides a long list of simple covariance functions, we also offer a variety of composite covariance
functions as shown in the following table.

1) Elementary and standalone covariance functions k(x, x 0 )

meaning
Zero
covariance vanishes always
Eye
unit additive measurement noise
One
unit constant
Discrete
precomputed covariance for discrete data x ∈ X ⊆ N
2) Composite covariance functions [κ1 (x, z), κ2 (x, z), ..] 7→ k(x, z)

meaning
Scale
modulate covariance function by a scalar
Scale
modulate covariance function depending on input
Sum
add up covariance functions
Prod
multiply covariance functions
Mask
Pref

act on components I ⊆ [1, 2, .., D] of x ∈ X ⊆ RD only
preference learning covariance x = [x1 ; x2 ], xi ⊆ RD/2

PER

turn covariance into a periodic, X ⊆ RD

ADD
Warp

additive, X ⊆ RD , index degree set D = {1, .., D}
warp inputs to a covariance function

3) Covariance functions based on Mahalanobis distances k(x, z) = k(r),
Maha
shared generic “mother” covariance function
GE
gamma exponential
2
Matern
Matérn, f1 (t) = 1, f3 (t) = 1 + t, f5 (t) = f3 (t) + t3
PP
compact support, piecewise polynomial fv (r)
RQ
rational quadratic
SE
squared exponential
Allowable mode parameter values for Mahalanibis distance covariances
’eye’
unit lengthscale
’iso’
isotropic lengthscale
’ard’
automatic relevance determination
’proj’
(low-rank) projection in input space
’fact’
factor analysis
’vlen’

r2

= (x −

x> P−1 z,

underdamped linear Langevin process covariance

W
OU

i-times integrated Wiener process covariance, i ∈ {−1, .., 3}
i-times integrated Ornstein-Uhlenbeck process covariance, i ∈ {0, 1}

NNone

neural net, X ⊆ RD

LINone

linear with bias, X ⊆ RD

Periodic

kxx 0 where K = L> L is Cholesky decomposition, L ∈ Rs×s

L

k(x, z) =
σ2f κ(x, z)
σf (x)κ(x, z)σf (z)
P
κ (x, z)
Qj j
j κj (x, z)

ψ
ln σf
φ`
∅
∅

κ(xI , zI )
κ(x1 , z1 ) + κ(x2 , z2 ) − κ(x1 , z2 ) − κ(x2 , z1 )




x
sin 2πxp
κ(u(x), u(z)) , u(x) =
, xp = x/p
cos 2πxp


x/p
P
P
Q
2
d∈D σfd
|I|=d
i∈I κ(xi , zi ; ψi )
D
D
p
κ(p(x), p(z)), where p : R → R

∅
∅

− z), where x, z ∈ X ⊆
k(r) =
exp (−|r|γ ) , γ ∈ (0, 2]
√
√
fd ( dr) exp(− dr)
max(0, 1 − r)j+v · fv (r)
−α
1 + 12 r2 /α 
exp −r2 /2

where x, z ∈ X ⊆
k(s) =
s
σ2f (s + c)d

∅
ln(γ/(2 − γ))
∅
∅
ln α
∅
∅
ln `
ln Λ
L
{L, ln f}
`2 (x)+`2 (z)
2

∅
∅
ln c

k(x, z) =
 2h

1 2
2h
2h
− |x −
2 σf |x|
 z| + |z|

ψ
{ln σf , − ln(1/h − 1)}


sin(ωt)
+ cos(ωt)
, t = |x − z|
ω
µ
Rx Rz
κ0 (x, z) = σ2f min(x, z), κi (x, z) = 0 0 κi (x̃, z̃)dx̃dz̃
|x+z|
2
2
κ0 (x, z) = σ2f exp(− |x−z|
` ) + (σf0 − σf ) exp(− ` )

{ln µ, ln ω, ln a}

k(x, z) =

ψ

k(t) = a2 e−µt


p
(`2 + x> x)(`2 + z> z)

ln t
{ln `, ln p, ln σf }

Rπ

PeriodicNoDC

periodic, rescaled, DC component removed, X ⊆ R

Cos

periodic cosine, X ⊆ R

σ2f cos (π(x − z)/p)

Gabor

Gabor function, X ⊆

π

0



t



1 > −2
>
exp − 2 t Λ t cos 2πtp 1 , tp = t/p


t/p

44

ln σf
{ln `, ln σf , ln σf0 }

{ln `, ln σf }





κ(x−z)− 1
κ(t)dt
σ2f κ(0)− 1 πRπ 0κ(t)dt , κ(x − z) = exp − `22 sin2 [π(x − z)/p]

λ, p ∈

φ`

∅
ln `
ln Λ
L
{L, ln f}

(x> z +1)/t2

RD
+

{ψ1 , .., ψD , ln σf1 , .., ln σf|D| }
∅

RD

σ2f exp − `22 sin2 [π(x − z)/p]

RD ,

ψu

P=I
P = `2 I, ` ∈ R+
P = Λ2 , Λ = diag(λ), λ ∈ RD
+
P−1 = L> L, L ∈ Rd×D
−1
>
D
P = L L + diag(f), f ∈ R , L ∈ Rd×D

σ2f sin−1 x> z/

periodic, X ⊆ R

eye
iso
ard

RD

P=I
P = `2 I, ` ∈ R+
P = Λ2 , Λ = diag(λ), λ ∈ RD
+
P−1 = L> L, L ∈ Rd×D
P−1 = L> L + diag(f), f ∈ RD , L ∈ Rd×D

D/2  2 
r
k` (x, z) = `(x)`(z)
k l2 (x,z)
, l2 (x, z) =
l2 (x,z)

ULL

6) Standalone covariance functions

meaning

ψ
∅
∅
∅

z)> P−1 (x

spatially varying lengthscale,

4) Covariance functions based on Euclidean dot products k(x, z) = k(s), s =
Dot
shared generic “mother” covariance function
LIN
linear covariance function
Poly
polynomial covariance
Allowable mode parameter values for Euclidean dot product covariances
’eye’
unit lengthscale
’iso’
isotropic lengthscale
’ard’
automatic relevance determination
’proj’
(low-rank) projection in input space
’fact’
factor analysis
5) Time series covariance functions k(x, z) = k(x, z), where x, z ∈ X = R+

meaning
FBM
fractional Brownian motion covariance with Hurst index h

k(x, z) =
0
δ(x − z)
1

eye
iso , t = x − z
ard

{ln `, ln p, ln σf }
{ln p, ln σf }
{ln λ, ψp }

7) Shortcut covariance functions assembled from library k(x, z)

Meaning
Const
covariance equals a constant
Noise
additive measurement noise
PERiso
turn covariance into a periodic, X ⊆ RD
PERard
turn covariance into a periodic, X ⊆ RD
Materniso

Matérn, X ⊆ RD , f1 (t) = 1, f3 (t) = 1 + t, f5 (t) = f3 (t) +

Maternard
PPiso
PPard
RQiso
RQard
SEiso
SEisoU
SEard

Matérn, X ⊆
f1 (t) = 1, f3 (t) = 1 + t, f5 (t) = f3 (t) +
compact support, piecewise polynomial fv (r)
compact support, piecewise polynomial fv (r)
rational quadratic, X ⊆ RD
rational quadratic, X ⊆ RD
diagonal squared exponential, X ⊆ RD
squared exponential, X ⊆ RD
automatic relevence determination squared exponential

SEvlen

spatially varying lengthscale squared exponential

SEproj

factor analysis squared exponential

LINard
LINiso

linear with diagonal weighting
linear with isotropic weighting

RD ,

t2
3
t2
3

Gaborard

anisotropic Gabor function, X ⊆ RD , λ, p ∈ RD
+

Gaboriso

isotropic Gabor function, X ⊆

SM

spectral mixture, X ⊆ RD , w ∈ RQ
M, V ∈ RD×Q
+,
+
spectral mixture product, X ⊆ RD , W ∈ RD×Q
, M, V ∈ RD×Q
+
+

RD ,

`, p ∈ R+

k(x, z) =
σ2f
σ2f δ(x − z)
κ(u(x), u(z)) , u(x) = [sin xp , cos xp ], xp = 2πx/p
κ(u(x), u(z)) , u(x) = [sin xp , cos xp ], xp = 2πx/p
q
σ2f fd (rd ) exp(−rd ), rd = `d2 (x − z)> (x − z)
p
σ2f fd (rd ) exp(−rd ), rd = d(x − z)> Λ−2 (x − z)
D
j+v
2
· fv (r), r = kx−zk
σf max(0, 1 − r)
` ,j= 2 +v+1
σ2f max(0, 1 − r)j+v · fv (r), r2 = (x − z)> Λ−2 (x − z)
−α
1
>
σ2f 1 + 2α`
2 (x − z) (x − z)
−α
1
2
>
−2
σf 1 + 2α (x − z) Λ (x − z)

σ2f exp − 2`12 (x − z)> (x − z)

exp − 2`12 (x − z)> (x − z)

σ2f exp − 12 (x − z)> Λ−2 (x − z)


D
kx−zk2
a 2
2
σf b
, a = 2`(x)`(z), b = `2 (x) + `2 (z)
exp − b

σ2f exp − 12 (x − z)> L> L(x − z) , L ∈ Rd×D

ψ
ln σf
ln σf
ln p
{ln p1 , .., ln pD }

x> Λ−2 z
x> z/`2
 P
exp − D
d=1

{ln λ1 , .., ln λD }
ln `

 P

cos 2π D
d=1 td /pd , td = xd − zd

t2d
2λ2d
t> t
exp(− 2`2 ) cos(2πt> 1/p), t = x − z

w> exp(− 21 V> t2 ) cos(M> t) , t = 2π(x − z)

QD
1
>
2
cos(md td ) , td = 2π(xd
d=1 wd exp(− 2 vd td )



{ln `, ln σf }
{ln λ1 , .., ln λD , ln σf }
{ln `, ln σf }
{ln λ1 , .., ln λD , ln σf }
{ln `, ln σf , ln α}
{ln λ1 , .., ln λD , ln σf , ln α}
{ln `, ln σf }
ln `
{ln λ1 , .., ln λD , ln σf }
{φ` , ln σf }
{L, ln σf }

{ln λ, ln p}
{ln `, ln p}

− zd )

{ln w, ln M, ln V}
{ln W, ln M, ln V}

The spectral mixture covariance covSM was introduced by Wilson & Adams Gaussian Process Kernels for Pattern Discovery and Extrapolation, ICML, 2013.
The periodic covariance function covPER starts from a stationary covariance function that depends
on the data only through a distance r2 = (x − x 0 )> Λ−2 (x − x 0 ) such as covMatern, covPP, covRQ,
covSE and turns them into a periodic covariance function by embedding the data x ∈ RD into a
periodic high-dimensional space xp = u(x) ∈ R2D by a function u(x) = 2πdiag(p−1 )x.
The additive covariance function covADD starts from a one-dimensional covariance function κ(xi , xi0 , ψi )
acting
on a single component i ∈ [1, .., D] of x. From that, we define covariance functions κI (xI , xI ) =
Q
0
i∈I κ(xi , xi , ψi ) acting on vector-valued inputs xI . The sums of exponential size can efficiently be
computed using the Newton-Girard formulae. Samples functions drawn from a GP with additive covariance are additive functions. The number of interacting variables |I| is a measure of how complex
the additive functions are.

6.3

Usage of Implemented Covariance Functions

Some code examples taken from doc/usageCov.m illustrate how to use simple and composite covariance functions to specify a GP model.
Syntactically, a covariance function cf is defined by
cv := ’func’ | @func // simple
cf := {cv} | {cv, {param, cf}} | {cv, {cf, .., cf}} // composite
i.e., it is either a string containing the name of a covariance function, a pointer to a covariance function or one of the former in combination with a cell array of covariance functions and an additional
list of parameters.
44

hdoc/usageCov.m 44i≡
1
2
3
4
5
6
7
8

% demonstrate usage of covariance functions
%
% See also covFunctions.m.
%
hgpml copyright 5ai
clear all, close all
n = 5; D = 3; x = randn(n,D); xs = randn(3,D);

45

% create a data set

9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

% set up simple covariance functions
cn = {’covNoise’}; sn = .1; hypn = log(sn); % one hyperparameter
cc = {@covConst};
sf = 2; hypc = log(sf); % function handles OK
ce = {@covEye};
hype = [];
% identity
cl = {@covLIN};
hypl = []; % linear is parameter -free
cla = {’covLINard’}; L = rand(D,1); hypla = log(L); % linear (ARD)
cli = {’covLINiso’}; l = rand(1);
hypli = log(l);
% linear iso
clo = {@covLINone}; ell = .9; hyplo = log(ell); % linear with bias
cp = {@covPoly ,3}; c = 2; hypp = log([c;sf]);
% third order poly
cga = {@covSEard};
hypga = log([L;sf]);
% Gaussian with ARD
cgi = {’covSEiso’}; hypgi = log([ell;sf]);
% isotropic Gaussian
cgu = {’covSEisoU’}; hypgu = log(ell);
% isotropic Gauss no scale
cra = {’covRQard’}; al = 2; hypra = log([L;sf;al]); % ration. quad.
cri = {@covRQiso};
hypri = log([ell;sf;al]);
% isotropic
cma = {@covMaternard ,5}; hypma = log([ell;sf]); % Matern class d=5
cmi = {’covMaterniso’,3}; hypmi = log([ell;sf]); % Matern class d=3
cnn = {’covNNone’}; hypnn = log([L;sf]);
% neural network
cpe = {’covPeriodic’}; p = 2; hyppe = log([ell;p;sf]);
% periodic
cpn = {’covPeriodicNoDC’}; p = 2; hyppe = log([ell;p;sf]); % w/o DC
cpc = {’covCos’}; p = 2; hypcpc = log([p;sf]);
% cosine cov
cca = {’covPPard’,3}; hypcc = hypgu;% compact support poly degree 3
cci = {’covPPiso’,2}; hypcc = hypgi;% compact support poly degree 2
cgb = {@covGaboriso}; ell = 1; p = 1.2; hypgb=log([ell;p]); % Gabor
Q = 2; w = ones(Q,1)/Q; m = rand(D,Q); v = rand(D,Q);
csm = {@covSM ,Q}; hypsm = log([w;m(:);v(:)]);
% Spectral Mixture
cvl = {@covSEvlen ,{@meanLinear}}; hypvl = [1;2;1; 0]; % var lenscal
s = 12; cds = {@covDiscrete ,s};
% discrete covariance function
L = randn(s); L = chol(L’*L); L(1:(s+1):end) = log(diag(L));
hypds = L(triu(true(s))); xd = randi([1,s],[n,1]); xsd = [1;3;6];
cfa = {@covSEfact ,2}; hypfa = randn(D*2,1);
% factor analysis
% set up composite i.e. meta covariance functions
csc = {’covScale’,{cgu}};
hypsc = [log(3); hypgu]; % scale by 9
csu = {’covSum’,{cn,cc,cl}}; hypsu = [hypn; hypc; hypl];
% sum
cpr = {@covProd ,{cc,cci}};
hyppr = [hypc; hypcc];
% product
mask = [0,1,0]; %
binary mask excluding all but the 2nd component
cma = {’covMask’,{mask,cgi{:}}}; hypma = hypgi;
% isotropic periodic rational quadratic
cpi = {’covPERiso’,{@covRQiso}};
% periodic Matern with ARD
cpa = {’covPERard’,{@covMaternard ,3}};
% additive based on SEiso using unary and pairwise interactions
cad = {’covADD’,{[1,2],’covSEiso’}};
% preference covariance with squared exponential base covariance
cpr = {’covPref’,{’covSEiso’}}; hyppr = [0;0];
xp = randn(n,2*D); xsp = randn(3,2*D);
% 0) specify a covariance function
% cov = cma; hyp = hypma;
% cov = cci; hyp = hypcc;
% cov = csm; hyp = hypsm;
cov = cds; hyp = hypds; x = xd; xs = xsd;
% cov = cfa; hyp = hypfa;
% cov = cvl; hyp = hypvl;
% cov = cpr; hyp = hyppr; x = xp; xs = xsp;
% 1) query the number of parameters
feval(cov{:})

46

67
68
69
70
71
72
73

% 2) evaluate the function on x
[K,dK] = feval(cov{:},hyp,x)
% 3) evaluate the function on x and xs to get cross -terms
[kss,dkss] = feval(cov{:},hyp,xs,’diag’)
[Ks, dKs ] = feval(cov{:},hyp,x,xs)

7

Hyperpriors

A hyperprior p(θ) with θ = [ρ, φ, ψ] is a joint probability distribution over the likelihood hyperparameters ρ, the mean hyperparameters
φ and the covariance hyperparameters ψ. We concentrate
Q
on factorial priors p(θ) = j pj (θj ). Hyperpriors can be used to regularise the optimisation of the
hyperparameters via the marginal likelihood Z(θ) so that p(θ)Z(θ) is maximised instead. As we
wish to perform unconstrained optimisation, we require (mainly) smooth hyperpriors with infinite
support.

7.1

Interface

In the GPML toolbox, a prior distribution p(θ) needs to implement the evaluation of the log density
∂
ln p(θ). In addition, we require sampling capabilities i.e. the
ln p(θ) and its first derivative ∂θ
generation of θ ∼ p(θ).
46

hpriorDistributions.m 46i≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%

prior distributions to be used for hyperparameters of Gaussian processes
using infPrior.
There are two different kinds of prior distributions: simple and composite:
simple prior distributions:
priorGauss
priorLaplace
priorT

- univariate Gaussian
- univariate Laplace
- univariate Student ’s t

priorSmoothBox1
priorSmoothBox2

- univariate interval (linear decay in log domain)
- univariate interval (quadr. decay in log domain)

priorGamma
priorWeibull
priorInvGauss
priorLogNormal

-

priorClamped or
priorDelta
priorGaussMulti
priorLaplaceMulti
priorTMulti

univariate
univariate
univariate
univariate

Gamma , IR+
Weibull , IR+
Inverse Gaussian , IR+
Log-normal , IR+

- fix hyperparameter to its current value by setting
derivatives to zero, no effect on marginal likelihood
- multivariate Gauss
- multivariate Laplace
- multivariate Student ’s t

priorClampedMulti or - fix hyperparameter to its current value by setting
priorDeltaMulti
derivatives to zero, no effect on marginal likelihood
priorEqualMulti or
priorSameMulti

- make several hyperparameters have the same value by
same derivative , no effect on marginal likelihood

47

31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

%
% composite prior distributions (see explanation at the bottom):
%
%
priorMix
- nonnegative mixture of priors
%
priorTransform
- prior on g(t) rather than t
%
% Naming convention: all prior distributions are named "prior/prior*.m".
%
%
% 1) With only a fixed input arguments:
%
%
r = priorNAME(par1,par2,parN)
%
% The function returns a random sample from the distribution for e.g.
% random restarts , simulations or optimisation initialisation.
%
% 2) With one additional input arguments:
%
%
[lp,dlp] = priorNAME(par1,par2,parN, t)
%
% The function returns the log density at location t along with its first
% derivative.
%
% See also doc/usagePrior.m, inf/infPrior.m.
%
hgpml copyright 5ai

48

7.2

Implemented Hyperpriors

All code files are named according to the pattern prior/prior.m for simple identification. All
currently available hyperpriors are summarised in the following table.

Simple hyperpriors p(θ)
Univariate hyperpriors defined over the whole reals with mean µ and variace σ2

Meaning
p(θ) = 
normally distributed hyperparameter θ ∈ R

Gauss

Laplace

double exponentially hyperparameter θ ∈ R
Student’s t distributed hyperparameter θ ∈ R

T

τ

2
√1
exp − (θ−µ)
2
2σ
σ 2π 

(θ−µ)
1
,b
2b exp − b

Γ ( ν+1
2 )
Γ ( ν2 )



1
(ν−2)πσ

√

µ ∈ R, σ2 ∈ R+

√
= σ/ 2


1+

(θ−µ)2
(ν−2)σ2

µ ∈ R, σ2 ∈ R+
− ν+1
2

µ ∈ R, σ2 , ν ∈ R+

Univariate hyperpriors with effective bounded support but defined over the whole real line
1−exp(η(a−b))
1
1
· 1+exp(−η(θ−a))
· 1+exp(η(θ−b))
≈
b−a
SmoothBox1
interval hyperparameter θ ∈ R i.e. θ ∈ [a, b]
1+π2 /g2
a+b
w2
2
µ = 2 , σ = 1−exp(−2g) 12 , g = wη
2 , w = |a − b|

2

N(θ|a, σab ) t < a
1
√
1
t ∈ [a, b] , σab = ηb−a
≈
2π
SmoothBox2
localised hyperparameter θ ∈ R i.e. θ ∈ [a, b] (1/η+1)(b−a) 

N(θ|a, σ2ab ) b < t
µ=
Univariate hyperpriors supported only over the positive reals
Gamma
Gamma hyperparameter θ ∈ R+
Weibull

Weibull hyperparameter θ ∈ R+

InverseGauss

inverse Gaussian hyperparameter θ ∈ R+

LogNormal

log-normal hyperparameter θ ∈ R+

a+b
2 ,

σ2 =

3
2
w2 η /3+η +4η/π+2/π

η3 +η2

4

=

1
√
θσ 2π

a 6 b ∈ R, η ∈ R+

, w = |a − b|

1
exp( θt )θk−1
Γ (k)tk


k−1
k θ
exp −( θλ )k 
λ λ
λ(θ−µ)2
1
√ 3 exp −
2µ2 θ
2πθ /λ

N(θ|µ, σ2 )

a 6 b ∈ R, η ∈ R+



2
exp − (ln θ−µ)
2σ2

k ∈ R+ , t ∈ R+
k ∈ R+ , λ ∈ R+
k ∈ R+ , λ ∈ R+
µ ∈ R, σ2 ∈ R+

RD

Multivariate hyperpriors supported all over
with mean µ and covariance Σ

1
GaussMulti
multivariate normal distribution θ ∈ RD
|2πΣ|− 2 exp − 12 (θ − µ)> Σ−1 (θ − µ)
√ −1
√
D
−1
LaplaceMulti multivariate Laplace distribution θ ∈ R
| 2Σ| 2 exp − 2 L (θ − µ) 1 , L> L = Σ

 ν+D
> Σ−1 (θ−µ) − 2
1 Γ ( ν+D )
TMulti
multivariate Student’s t distribution θ ∈ RD
|(ν − 2)πΣ|− 2 Γ ( ν2 ) 1 + (θ−µ)(ν−2)
2

Improper hyperpriors used to fix the value of a particular hyperparameter
Delta
clamped hyperparameter θ = θ0 ∈ R
δ(θ − θ0 )
Clamped
DeltaMulti
clamped hyperparameter θ = θ0 ∈ RD
δ(θ − θ0 )
ClampedMulti
Q

SameMulti
D
same hyperparameter θ = θ0 1 ∈ RD
δ
i=1 (θ0 − θi )
EqualMulti
Composite hyperpriors [π1 (θ), π2 (θ), ..] 7→ p(θ)
Transform
prior distribution on g(θ) instead of θ
π(g(θ))
P
Mix
mixture distribution
i wi πi (θ)

µ ∈ RD , Σ ∈ RD×D
µ ∈ RD , Σ ∈ RD×D
µ ∈ RD , Σ ∈ RD×D , ν ∈ R

∅
∅
∅
{g}
{w}

The priorSmoothBox2 is a Gauss-uniform sandwich obtained by complementing a uniform distribution on [a, b] with two Gaussian halves at each side. The parameter η balances the probability mass
between the constituents so that η/(η + 1) is used for the box and 1/(η + 1) for the Gaussian sides.
Its brother priorSmoothBox1 is the product of two sigmoidal functions.
The priorDelta or equivalently priorClamped can be used to exclude some hyperparameters from
the optimisation. Their values are clamped to θ0 and the derivative vanishes. There are also multivariate counterparts priorDeltaMulti and priorClampedMulti.

7.3

Usage of Implemented Hyperpriors

Some code examples taken from doc/usagePrior.m illustrate how to use univariate, multivariate
and composite priors on hyperparameters. Syntactically, a hyperprior hp is defined by
func := Dist
// prior distributions in prior/
| Clamped | Delta // predefined for fixing the hyperparameter
pr

:= ’func’

| @func

// univariate hyperprior
49

|
hp

’funcMulti’

| @funcMulti

// multivariate hyperprior

:= {pr} | {pr, {param, hp}} | {pr, {hp, .., hp}} // composite

i.e., it is either a string containing the name of a hyperprior function, a pointer to a hyperprior
function or one of the former in combination with a cell array of hyperprior functions and an additional list of parameters. Furthermore, we have multivariate hyperprior variants and 2 (equivalent)
predefined hyperpriors allowing to exclude variables from optimisation.
49

hdoc/usagePrior.m 49i≡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

% demonstrate usage of prior distributions
%
% See also priorDistributions.m.
%
hgpml copyright 5ai
clear all, close all
% 1) specify some priors
% a) univariate priors
mu = 1.0; s2 = 0.01^2; nu = 3;
pg = {@priorGauss ,mu,s2};
% Gaussian prior
pl = {’priorLaplace’,mu,s2};
% Laplace prior
pt = {@priorT ,mu,s2,nu};
% Student ’s t prior
p1 = {@priorSmoothBox1 ,0,3,15}; % smooth box constraints lin decay
p2 = {@priorSmoothBox2 ,0,2,15}; % smooth box constraints qua decay
pd = {’priorDelta’}; % fix value of prior exclude from optimisation
pc = {@priorClamped};
% equivalent to above
lam = 1.05; k = 2.5;
pw = {@priorWeibull ,lam,k};
% Weibull prior
% b) meta priors
pmx = {@priorMix ,[0.5,0.5],{pg,pl}};
g = @exp; dg = @exp; ig = @log;
ptr = {@priorTransform ,g,dg,ig,pg};

% mixture of two priors
% Gaussian in the exp domain

% c) multivariate priors
m = [1;2]; V = [2,1;1,2];
pG = {@priorGaussMulti ,m,V};
% 2d Gaussian prior
pD = {’priorDeltaMulti’};
% fix value of prior exclude from optim
pC = {@priorClampedMulti};
% equivalent to above
% 2) evaluation
% pri = pt;
hp
% pri = pmx; hp
% pri = ptr; hp
pri = pG;
hp =

= randn(1,3);
= randn(1,3);
= randn(1,3);
randn(2,3);

% a) draw a sample from the prior
feval(pri{:})
% b) evaluate prior and derivative if requires
[lp,dlp] = feval(pri{:},hp)
% 3) comprehensive example
x = (0:0.1:10)’; y = 2*x+randn(size(x));
% generate training data
mean = {@meanSum ,{@meanConst ,@meanLinear}}; % specify mean function
cov = {@covSEiso}; lik = {@likGauss}; % specify covariance and lik
hyp.cov = [log(1);log(1.2)]; hyp.lik = log(0.9); hyp.mean = [2;3];

50

49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76

par = {mean,cov,lik,x,y}; mfun = @minimize; % input for GP function
% a) plain marginal likelihood optimisation (maximum likelihood)
im = @infExact;
% inference method
hyp_plain = feval(mfun, hyp, @gp, -10, im, par{:});
% optimise
% b) regularised optimisation (maximum a posteriori) with 1d priors
prior.mean = {pg;pc}; % Gaussian prior for first , clamp second par
prior.cov = {p1;[]}; % box prior for first , nothing for second par
im = {@infPrior ,@infExact ,prior};
% inference method
hyp_p1 = feval(mfun, hyp, @gp, -10, im, par{:});
% optimise
% c) regularised optimisation (maximum a posteriori) with Nd priors
prior = [];
% clear the structure
% multivariate Student ’s t prior on the first and second mean hyper
prior.multi{1} = {@priorTMulti ,[mu;mu],diag([s2,s2]),nu,...
struct(’mean’,[1,2])};
% use hyper struct
% Equivalent shortcut (same mu and s2 for all dimensions)
prior.multi{1} = {@priorTMulti ,mu,s2,nu,struct(’mean’,[1,2])};
% multivariate Gaussian prior jointly on 1st and 3rd hyper
prior.multi{2} = {@priorGaussMulti ,[mu;mu],diag([s2,s2]),...
[1,3]};
% use unwrapped hyper vector
% Equivalent shortcut (same mu and s2 for all dimensions)
prior.multi{2} = {@priorGaussMulti ,mu,s2,[1,3]};
im = {@infPrior ,@infExact ,prior};
% inference method
hyp_pN = feval(mfun, hyp, @gp, -10, im, par{:});
% optimise
[any2vec(hyp), any2vec(hyp_plain), any2vec(hyp_p1), any2vec(hyp_pN)]

51

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 51
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.40.18
Create Date                     : 2018:06:15 08:13:21+01:00
Modify Date                     : 2018:06:15 08:13:21+01:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017/MacPorts 2017_4) kpathsea version 6.2.3

EXIF Metadata provided by EXIF.tools

Manual

manual

manual

Navigation menu

Versions of this User Manual:

Views

Navigation