Manual

manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 51

Gaussian Process Training and Prediction
The gp Function
Inference Methods
Likelihood Functions
Mean Functions
Covariance Functions
Hyperpriors

The GPML Toolbox version 4.2

Carl Edward Rasmussen & Hannes Nickisch

June 15, 2018

Abstract

The GPML toolbox is an Octave 3.2.x and Matlab 7.x implementation of inference and pre-

diction in Gaussian process (GP) models. It implements algorithms discussed in Rasmussen &

Williams: Gaussian Processes for Machine Learning , the MIT press, 2006 and Nickisch &

Rasmussen: Approximations for Binary Gaussian Process Classiﬁcation , JMLR, 2008.

The strength of the function lies in its ﬂexibility, simplicity and extensibility. The function is

ﬂexible as ﬁrstly it allows speciﬁcation of the properties of the GP through deﬁnition of mean func-

tion and covariance functions. Secondly, it allows speciﬁcation of different inference procedures,

such as e.g. exact inference and Expectation Propagation (EP). Thirdly it allows speciﬁcation of

likelihood functions e.g. Gaussian or Laplace (for regression) and e.g. cumulative Logistic (for

classiﬁcation). Simplicity is achieved through a single function and compact code. Extensibility is

ensured by modular design allowing for easy addition of extension for the already fairly extensive

libraries for inference methods, mean functions, covariance functions and likelihood functions.

This document is a technical manual for a developer containing many details. If you are not

yet familiar with the GPML toolbox, the user documentation and examples therein are a better

way to get started.

Contents

1 Gaussian Process Training and Prediction 3

2 The gp Function 4

3 Inference Methods 9

3.1 Exact Inference with Gaussian likelihood . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Laplace’sApproximation.................................. 11

3.3 ExpectationPropagation.................................. 11

3.4 Kullback Leibler Divergence Minimisation . . . . . . . . . . . . . . . . . . . . . . . . 12

3.5 VariationalBayes...................................... 13

3.6 Compatibility Between Inference Methods and Covariance Approximations . . . . . . 13

3.7 Sparse Covariance Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.8 Grid-Based Covariance Approximations . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.9 State Space Representation of GPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Likelihood Functions 18

4.1 Prediction.......................................... 18

4.2 Interface........................................... 19

4.3 Implemented Likelihood Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4 Usage of Implemented Likelihood Functions . . . . . . . . . . . . . . . . . . . . . . . 22

4.5 Compatibility Between Likelihoods and Inference Methods . . . . . . . . . . . . . . . 23

4.6 GaussianLikelihood .................................... 23

4.6.1 ExactInference................................... 25

4.6.2 Laplace’s Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.6.3 Expectation Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.6.4 VariationalBayes.................................. 25

4.7 Warped Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.8 GumbelLikelihood..................................... 27

4.9 LaplaceLikelihood..................................... 27

4.10Student’stLikelihood ................................... 30

4.11 Cumulative Logistic Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.12 GLM Likelihoods: Poisson, Negative Binomial, Weibull, Gamma, Exponential, In-

verseGaussianandBeta .................................. 32

4.12.1 Inverse Link Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.12.2PoissonLikelihood................................. 34

4.12.3WeibullLikelihood................................. 34

4.12.4GammaLikelihood................................. 34

4.12.5 Exponential Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.12.6 Inverse Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.12.7BetaLikelihood................................... 35

5 Mean Functions 36

5.1 Interface........................................... 36

5.2 Implemented Mean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Usage of Implemented Mean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Covariance Functions 39

6.1 Interface........................................... 39

6.2 Implemented Covariance Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.3 Usage of Implemented Covariance Functions . . . . . . . . . . . . . . . . . . . . . . . 44

7 Hyperpriors 46

7.1 Interface........................................... 46

7.2 ImplementedHyperpriors ................................. 48

7.3 Usage of Implemented Hyperpriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

1 Gaussian Process Training and Prediction

The gpml toolbox contains a single user function gp described in section 2. In addition there are a

number of supporting structures and functions which the user needs to know about, as well as an

internal convention for representing the posterior distribution, which may not be of direct interest to

the casual user.

Inference Methods: An inference method is a function which computes the (approximate) poste-

rior, the (approximate) negative log marginal likelihood and its partial derivatives w.r.t.. the

hyperparameters, given a model speciﬁcation (i.e., GP mean and covariance functions and a

likelihood function) and a data set. Inference methods are discussed in section 3. New in-

ference methods require a function providing the desired inference functionality and possibly

extra functionality in the likelihood functions applicable.

Hyperparameters: The hyperparameters is a struct controlling the properties of the model, i.e.. the

GP mean and covariance function and the likelihood function. The hyperparameters is a struct

with the three ﬁelds mean,cov and lik, each of which is a vector. The number of elements in

each ﬁeld must agree with number of hyperparameters in the speciﬁcation of the three functions

they control (below). If a ﬁeld is either empty or non-existent it represents zero hyperparam-

eters. When working with FITC approximate inference, the inducing inputs xu can also be

treated as hyperparameters for some common stationary covariances.

Hyperparameter Prior Distributions: When optimising the marginal likelihood w.r.t. hyperparame-

ters, it is sometimes useful to softly constrain the hyperparameters by means of prior knowl-

edge. A prior is a probability distribution over individual or a group of hyperparameters,

section 7.

Likelihood Functions: The likelihood function speciﬁes the form of the likelihood of the GP model

and computes terms needed for prediction and inference. For inference, the required properties

of the likelihood depend on the inference method, including properties necessary for hyperpa-

rameter learning, section 4.

Mean Functions: The mean function is a cell array specifying the GP mean. It computes the mean

and its derivatives w.r.t.. the part of the hyperparameters pertaining to the mean. The cell array

allows ﬂexible speciﬁcation and composition of mean functions, discussed in section 5. The

default is the zero function.

Covariance Functions: The covariance function is a cell array specifying the GP covariance function.

It computes the covariance and its derivatives w.r.t.. the part of the hyperparameters pertaining

to the covariance function. The cell array allows ﬂexible speciﬁcation and composition of

covariance functions, discussed in section 6.

Inference methods, see section 3, compute (among other things) an approximation to the posterior

distribution of the latent variables fiassociated with the training cases, i=1, . . . , n. This approx-

imate posterior is assumed to be Gaussian, and is communicated via a struct post with the ﬁelds

post.alpha,post.sW and post.L. Often, starting from the Gaussian prior p(f) = N(f|m,K)the

approximate posterior admits the form

q(f|D) = Nf|µ=m+Kα,V= (K−1+W)−1, where Wdiagonal with Wii =s2

i. (1)

In such cases, the entire posterior can be computed from the two vectors post.alpha and post.sW;

the inference method may optionally also return L=chol(diag(s)Kdiag(s) + I).

If on the other hand the posterior doesn’t admit the above form, then post.L returns the matrix

L= −(K+W−1)−1(and post.sW is unused). In addition, a sparse representation of the posterior

may be used, in which case the non-zero elements of the post.alpha vector indicate the active entries.

2 The gp Function

The gp function is typically the only function the user would directly call.

4a hgp.m 4ai≡

1function [varargout] = gp(hyp, inf, mean, cov, lik, x, y, xs, ys)

2hgp function help 4bi

3hinitializations 5bi

4hinference 6ci

5if nargin==7 % if no test cases are provided

6varargout = {nlZ, dnlZ, post}; % report -log marg lik, derivatives and post

7else

8hcompute test predictions 7i

9end

It offers facilities for training the hyperparameters of a GP model as well as predictions at unseen

inputs as detailed in the following help.

4b hgp function help 4bi≡ (4a)

1% Gaussian Process inference and prediction. The gp function provides a

2% flexible framework for Bayesian inference and prediction with Gaussian

3% processes for scalar targets , i.e. both regression and binary

4% classification. The prior is Gaussian process , defined through specification

5% of its mean and covariance function. The likelihood function is also

6% specified. Both the prior and the likelihood may have hyperparameters

7% associated with them.

9% Two modes are possible: training or prediction: if no test cases are

10 % supplied , then the negative log marginal likelihood and its partial

11 % derivatives w.r.t. the hyperparameters is computed; this mode is used to fit

12 % the hyperparameters. If test cases are given, then the test set predictive

13 % probabilities are returned. Usage:

14 %

15 % training: [nlZ dnlZ ] = gp(hyp, inf, mean, cov, lik, x, y);

16 % prediction: [ymu ys2 fmu fs2 ] = gp(hyp, inf, mean , cov, lik, x, y, xs);

17 % or: [ymu ys2 fmu fs2 lp] = gp(hyp, inf, mean , cov, lik, x, y, xs, ys);

18 %

19 % where:

20 %

21 % hyp struct of column vectors of mean/cov/lik hyperparameters

22 % inf function specifying the inference method

23 % mean prior mean function

24 % cov prior covariance function

25 % lik likelihood function

26 % x n by D matrix of training inputs

27 % y column vector of length n of training targets

28 % xs ns by D matrix of test inputs

29 % ys column vector of length nn of test targets

30 %

31 % nlZ returned value of the negative log marginal likelihood

32 % dnlZ struct of column vectors of partial derivatives of the negative

33 % log marginal likelihood w.r.t. mean/cov/lik hyperparameters

34 % ymu column vector (of length ns) of predictive output means

35 % ys2 column vector (of length ns) of predictive output variances

36 % fmu column vector (of length ns) of predictive latent means

37 % fs2 column vector (of length ns) of predictive latent variances

38 % lp column vector (of length ns) of log predictive probabilities

39 %

40 % post struct representation of the (approximate) posterior

41 % 3rd output in training mode or 6th output in prediction mode

42 % can be reused in prediction mode gp(.., cov, lik, x, post, xs,..)

43 %

44 % See also infMethods.m, meanFunctions.m, covFunctions.m, likFunctions.m.

45 %

46 hgpml copyright 5ai

5a hgpml copyright 5ai≡ (4b 9 10 19 22 24a 36 38 39 44 46 49)

1% Copyright (c) by Carl Edward Rasmussen and Hannes Nickisch, 2018-06-15.

2% File automatically generated using noweb.

Depending on the number of input parameters, gp knows whether it is operated in training or in

prediction mode. The highlevel structure of the code is as follows. After some initialisations, we

perform inference and decide whether test set predictions are needed or only the result of the inference

is demanded.

5b hinitializations 5bi≡ (4a)

1hminimalist usage 5ci

2hmultivariate output by recursion 5di

3hprocess input arguments 6ai

4hcheck hyperparameters 6bi

If the number of input arguments is incorrect, we echo a minimalist usage and return.

5c hminimalist usage 5ci≡ (5b)

1if nargin<7 || nargin >9

2disp(’Usage: [nlZ dnlZ ] = gp(hyp, inf, mean, cov, lik, x, y);’)

3disp(’ or: [ymu ys2 fmu fs2 ] = gp(hyp, inf, mean , cov, lik, x, y, xs);’)

4disp(’ or: [ymu ys2 fmu fs2 lp] = gp(hyp, inf, mean, cov, lik, x, y, xs, ys);’)

5return

6end

If there is more than a single output dimension in y, we call multiple instances of gp with shared

hyperparameters and settings each computing the corresponding result with scalar output.

5d hmultivariate output by recursion 5di≡ (5b)

1if size(y,2)>1 % deal with (independent) multivariate output y by recursing

2d = size(y,2); varargout = cell(nargout,1); out = cell(nargout,1); % allocate

3for i=1:d

4in = {hyp, inf, mean, cov, lik, x, y(:,i)};

5if nargin>7, in = {in{:}, xs}; end

6if nargin>8, in = {in{:}, ys(:,i)}; end

7if i==1, [varargout{:}] = gp(in{:}); % perform inference for dimension ..

8else [out{:}] = gp(in{:}); % .. number i in the output

9if nargin==7, no = 2;

10 varargout{1} = varargout{1} + out{1}; % sum nlZ

11 if nargout>1 % sum dnlZ

12 varargout{2} = vec2any(hyp,any2vec(varargout{2})+any2vec(out{2}));

13 end

14 else no = 5; % concatenate ymu ys2 fmu fs2 lp

15 for j=1:min(nargout,no), varargout{j} = [varargout{j},out{j}]; end

16 end

17 if nargout>no % concatenate post

18 if i==2, varargout{no+1} = {varargout{no+1},out{no+1}};

19 else varargout{no+1} = {varargout{no+1}{:},out{no+1}}; end

20 end

21 end

22 end,return % return to end the recursion

23 end

Next, we set some useful default values for empty arguments, and convert inf and lik to function

handles and mean and cov to cell arrays if necessary. Initialize variables.

6a hprocess input arguments 6ai≡ (5b)

1if isempty(mean), mean = {@meanZero}; end % set default mean

2if ischar(mean) || isa(mean, ’function_handle’), mean = {mean}; end % make cell

3if isempty(cov), error(’Covariance function cannot be empty’); end % no default

4if ischar(cov) || isa(cov,’function_handle’), cov = {cov}; end % make cell

5cstr = cov{1}; if isa(cstr ,’function_handle’), cstr = func2str(cstr); end

6if (strcmp(cstr,’covFITC’) || strcmp(cstr,’apxSparse’)) && isfield(hyp,’xu’)

7cov{3} = hyp.xu; %use hyp.xu

8end

9if isempty(inf), inf = {@infGaussLik}; end % set default inference method

10 if ischar(inf), inf = str2func(inf); end % convert into function handle

11 if ischar(inf) || isa(inf,’function_handle’), inf = {inf}; end % make cell

12 istr = inf{1}; if isa(istr ,’function_handle’), istr = func2str(istr); end

13 if strcmp(istr,’infPrior’)

14 istr = inf{2}; if isa(istr ,’function_handle’), istr = func2str(istr); end

15 end

16 if isempty(lik), lik = {@likGauss}; end % set default lik

17 if ischar(lik) || isa(lik,’function_handle’), lik = {lik}; end % make cell

18 lstr = lik{1}; if isa(lstr ,’function_handle’), lstr = func2str(lstr); end

20 D = size(x,2);

21 if strncmp(cstr,’covGrid’,7) || strcmp(cstr,’apxGrid’)% only some inf* possible

22 D = 0; xg = cov{3}; p = numel(xg); for i=1:p, D = D+size(xg{i},2); end % dims

23 end

Check that the sizes of the hyperparameters supplied in hyp match the sizes expected. The three parts

hyp.mean,hyp.cov and hyp.lik are checked separately, and deﬁne empty entries if they don’t exist.

6b hcheck hyperparameters 6bi≡ (5b)

1if ~isfield(hyp,’mean’), hyp.mean = []; end % check the hyp specification

2if eval(feval(mean{:})) ~= numel(hyp.mean)

3error(’Number of mean function hyperparameters disagree with mean function’)

4end

5if ~isfield(hyp,’cov’), hyp.cov = []; end

6if eval(feval(cov{:})) ~= numel(hyp.cov)

7error(’Number of cov function hyperparameters disagree with cov function’)

8end

9if ~isfield(hyp,’lik’), hyp.lik = []; end

10 if eval(feval(lik{:})) ~= numel(hyp.lik)

11 error(’Number of lik function hyperparameters disagree with lik function’)

12 end

Inference is performed by calling the desired inference method inf. In training mode, we accept a

failure of the inference method (and issue a warning), since during hyperparameter learning, hyper-

parameters causing a numerical failure may be attempted, but the minimize function may gracefully

recover from this. During prediction, failure of the inference method is an error.

6c hinference 6ci≡ (4a)

1try % call the inference method

2% issue a warning if a classification likelihood is used in conjunction with

3% labels different from +1 and -1

4if strcmp(lstr,’likErf’) || strcmp(lstr,’likLogistic’)

5if ~isstruct(y)

6uy = unique(y);

7if any( uy~=+1 & uy~=-1 )

8warning(’You try classification with labels different from {+1,-1}’)

9end

10 end

11 end

12 if nargin>7 % compute marginal likelihood and its derivatives only if needed

13 if isstruct(y)

14 post = y; % reuse a previously computed posterior approximation

15 else

16 post = feval(inf{:}, hyp, mean, cov, lik, x, y);

17 end

18 else

19 if nargout<=1

20 [post nlZ] = feval(inf{:}, hyp, mean, cov, lik, x, y); dnlZ = {};

21 else

22 [post nlZ dnlZ] = feval(inf{:}, hyp, mean, cov, lik, x, y);

23 end

24 end

25 catch

26 msgstr = lasterr;

27 if nargin>7, error(’Inference method failed [%s]’, msgstr); else

28 warning(’Inference method failed [%s] .. attempting to continue’,msgstr)

29 varargout = {NaN, vec2any(hyp,zeros(numel(any2vec(hyp)),1))}; return % go on

30 end

31 end

We copy the already computed negative log marginal likelihood to the ﬁrst output argument, and if

desired report its partial derivatives w.r.t. the hyperparameters if running in inference mode.

Predictions are computed in a loop over small batches to avoid memory problems for very large test

sets.

7hcompute test predictions 7i≡ (4a)

1alpha = post.alpha; L = post.L; sW = post.sW;

2if issparse(alpha) % handle things for sparse representations

3nz = alpha ~= 0; % determine nonzero indices

4if issparse(L), L = full(L(nz,nz)); end % convert L and sW if necessary

5if issparse(sW), sW = full(sW(nz)); end

6else nz = true(size(alpha ,1) ,1); end % non-sparse representation

7if isempty(L) % in case L is not provided, we compute it

8K = feval(cov{:}, hyp.cov, x(nz,:));

9L = chol(eye(sum(nz))+sW*sW’.*K);

10 end

11 %verify whether L contains valid Cholesky decomposition or something different

12 Lchol = isnumeric(L) && all(all(tril(L,-1)==0)&diag(L)’>0&isreal(diag(L))’);

13 ns = size(xs,1); % number of data points

14 if strncmp(cstr,’apxGrid’,7), xs = apxGrid(’idx2dat’,cov{3},xs); end % expand

15 nperbatch = 1000; % number of data points per mini batch

16 nact = 0; % number of already processed test data points

17 ymu = zeros(ns,1); ys2 = ymu; fmu = ymu; fs2 = ymu; lp = ymu; % allocate mem

18 while nact<ns % process minibatches of test cases to save memory

19 id = (nact+1):min(nact+nperbatch ,ns); % data points to process

20 hmake predictions 8i

21 nact = id(end); % set counter to index of last processed data point

22 end

23 if nargin<9

24 varargout = {ymu, ys2, fmu, fs2, [], post}; % assign output arguments

25 else

26 varargout = {ymu, ys2, fmu, fs2, lp, post};

27 end

In every iteration of the above loop, we compute the predictions for all test points of the batch.

8hmake predictions 8i≡ (7)

1kss = feval(cov{:}, hyp.cov, xs(id,:), ’diag’); % self-variance

2if strcmp(cstr,’covFITC’) || strcmp(cstr,’apxSparse’)% cross-covariances

3Ks = feval(cov{:}, hyp.cov, x, xs(id,:)); Ks = Ks(nz,:); % res indep. of x

4else

5Ks = feval(cov{:}, hyp.cov, x(nz,:), xs(id,:)); % avoid computation

6end

7ms = feval(mean{:}, hyp.mean, xs(id,:));

8N = size(alpha ,2); % number of alphas (usually 1; more in case of sampling)

9Fmu = repmat(ms,1,N) + Ks’*full(alpha(nz,:)); % conditional mean fs|f

10 fmu(id) = sum(Fmu ,2)/N; % predictive means

11 if Lchol % L contains chol decomp => use Cholesky parameters (alpha,sW,L)

12 V = L’\(repmat(sW,1,length(id)).*Ks);

13 fs2(id) = kss - sum(V.*V,1)’; % predictive variances

14 else % L is not triangular => use alternative parametrisation

15 if isnumeric(L), LKs = L*Ks; else LKs = L(Ks); end % matrix or callback

16 fs2(id) = kss + sum(Ks.*LKs ,1)’; % predictive variances

17 end

18 fs2(id) = max(fs2(id),0); % remove numerical noise i.e. negative variances

19 Fs2 = repmat(fs2(id),1,N); % we have multiple values in case of sampling

20 if nargin<9

21 [Lp, Ymu, Ys2] = feval(lik{:},hyp.lik ,[], Fmu(:),Fs2(:));

22 else

23 Ys = repmat(ys(id),1,N);

24 [Lp, Ymu, Ys2] = feval(lik{:},hyp.lik,Ys(:),Fmu(:),Fs2(:));

25 end

26 lp(id) = sum(reshape(Lp, [],N),2)/N; % log probability; sample averaging

27 ymu(id) = sum(reshape(Ymu,[],N),2)/N; % predictive mean ys|y and ..

28 ys2(id) = sum(reshape(Ys2,[],N),2)/N; % .. variance

3 Inference Methods

Inference methods are responsible for computing the (approximate) posterior post, the (approxi-

mate) negative log marginal likelihood nlZ and its partial derivatives dnlZ w.r.t. the hyperparame-

ters hyp. The arguments to the function are hyperparameters hyp, mean function mean, covariance

function cov, likelihood function lik and training data xand y. Several inference methods are

implemented and described this section.

9hinfMethods.m 9i≡

1% Inference methods: Compute the (approximate) posterior for a Gaussian process.

2% Methods currently implemented include:

4% infGaussLik Exact inference (only possible with Gaussian likelihood)

5% infLaplace Laplace’s Approximation

6% infEP Expectation Propagation

7% infVB Variational Bayes Approximation

8% infKL Kullback -Leibler optimal Approximation

10 % infMCMC Markov Chain Monte Carlo and Annealed Importance Sampling

11 % We offer two samplers.

12 % - hmc: Hybrid Monte Carlo

13 % - ess: Elliptical Slice Sampling

14 % No derivatives w.r.t. to hyperparameters are provided.

15 %

16 % infLOO Leave -One-Out predictive probability and Least -Squares Approxim.

17 % infPrior Perform inference with hyperparameter prior.

18 %

19 % The interface to the approximation methods is the following:

20 %

21 % function [post nlZ dnlZ] = inf..(hyp, mean, cov, lik, x, y)

22 %

23 % where:

24 %

25 % hyp is a struct of hyperparameters

26 % mean is the name of the mean function (see meanFunctions.m)

27 % cov is the name of the covariance function (see covFunctions.m)

28 % lik is the name of the likelihood function (see likFunctions.m)

29 % x is a n by D matrix of training inputs

30 % y is a (column) vector (of size n) of targets

31 %

32 % nlZ is the returned value of the negative log marginal likelihood

33 % dnlZ is a (column) vector of partial derivatives of the negative

34 % log marginal likelihood w.r.t. each hyperparameter

35 % post struct representation of the (approximate) posterior containing

36 % alpha is a (sparse or full column vector) containing inv(K)*(mu-m),

37 % where K is the prior covariance matrix, m the prior mean,

38 % and mu the approx posterior mean

39 % sW is a (sparse or full column) vector containing diagonal of sqrt(W)

40 % the approximate posterior covariance matrix is inv(inv(K)+W)

41 % L is a (sparse or full) triangular matrix , L = chol(sW*K*sW+eye(n)),

42 % or a full matrix , L = -inv(K+inv(W)),

43 % or a function L(A) of a matrix A such that -(K+inv(W))*L(A) = A

44 %

45 % Usually, the approximate posterior to be returned admits the form

46 % N(mu=m+K*alpha, V=inv(inv(K)+W)), where alpha is a vector and W is diagonal.

47 %

48 % For more information on the individual approximation methods and their

49 % implementations , see the separate inf??.m files. See also gp.m.

50 %

51 hgpml copyright 5ai

Not all inference methods are compatible with all likelihood functions, e.g.. exact inference is only

possible with Gaussian likelihood. In order to perform inference, each method needs various prop-

erties of the likelihood functions, section 4.

3.1 Exact Inference with Gaussian likelihood

For Gaussian likelihoods, GP inference reduces to computing mean and covariance of a multivariate

Gaussian which can be done exactly by simple matrix algebra. The program inf/infExact.m does

exactly this. If it is called with a likelihood function other than the Gaussian, it issues an error. The

Gaussian posterior q(f|D) = Nf|µ,Vis exact.

10 hinf/infGaussLik.m 10i≡

1function [post nlZ dnlZ] = infGaussLik(hyp, mean , cov, lik, x, y, opt)

3% Exact inference for a GP with Gaussian likelihood.

5% Compute a parametrization of the posterior , the negative log marginal

6% likelihood and its derivatives w.r.t. the hyperparameters. The function takes

7% a specified covariance function (see covFunctions.m) and likelihood function

8% (see likFunctions.m), and is designed to be used with gp.m.

10 hgpml copyright 5ai

11 %

12 % See also INFMETHODS.M, APX.M.

14 if nargin<7, opt = []; end % make sure parameter exists

15 if iscell(lik), likstr = lik{1}; else likstr = lik; end

16 if ~ischar(likstr), likstr = func2str(likstr); end

17 if ~strcmp(likstr ,’likGauss’)% NOTE: no explicit call to likGauss

18 error(’Exact inference only possible with Gaussian likelihood’);

19 end

21 [n, D] = size(x);

22 [m,dm] = feval(mean{:}, hyp.mean , x); % evaluate mean vector and deriv

23 sn2 = exp(2*hyp.lik); W = ones(n,1)/sn2; % noise variance of likGauss

24 K = apx(hyp,cov,x,opt); % set up covariance approximation

25 [ldB2,solveKiW,dW,dhyp ,post.L] = K.fun(W); % obtain functionality depending on W

27 alpha = solveKiW(y-m);

28 post.alpha = K.P(alpha); % return the posterior parameters

29 post.sW = sqrt(W); % sqrt of noise precision vector

30 if nargout>1 % do we want the marginal likelihood?

31 nlZ = (y-m)’*alpha/2 + ldB2 + n*log(2*pi*sn2)/2; % -log marginal likelihood

32 if nargout>2 % do we want derivatives?

33 dnlZ = dhyp(alpha); dnlZ.mean = -dm(alpha);

34 dnlZ.lik = -sn2*(alpha’*alpha) - 2*sum(dW)/sn2 + n;

35 end

36 end

3.2 Laplace’s Approximation

For differentiable likelihoods, Laplace’s approximation, approximates the posterior by a Gaussian

centered at its mode and matching its curvature inf/infLaplace.m.

More concretely, the mean of the posterior q(f|D) = Nf|µ,Vis – deﬁning `i(fi) = ln p(yi|fi)and

`(f) = Pn

i=1`i(fi)– given by

µ=arg min

fφ(f), where φ(f) = 1

2(f−m)>K−1(f−m) − `(f)c

= − ln[p(f)p(y|f)], (2)

which we abbreviate by µ←L(`). The curvature ∂2φ

∂ff>=K−1+Wwith Wii = − ∂2

∂f2

ln p(yi|fi)

serves as precision for the Gaussian posterior approximation V= (K−1+W)−1and the marginal

likelihood Z=Rp(f)p(y|f)dfis approximated by Z≈ZLA =R˜

φ(f)dfwhere we use the 2nd order

Taylor expansion at the mode µgiven by ˜

φ(f) = φ(µ) + 1

2(f−µ)>V−1(f−µ)≈φ(f).

Laplace’s approximation needs derivatives up to third order for the mode ﬁtting procedure (Newton

method)

dk=∂k

∂fklog p(y|f),k=0, 1, 2, 3

and

dk=∂

∂ρi

∂k

∂fklog p(y|f),k=0, 1, 2

evaluated at the latent location fand observed value y. The likelihood calls (see section 4)

•[d0, d1, d2, d3] = lik(hyp, y, f, [], ’infLaplace’)

and

•[d0, d1, d2] = lik(hyp, y, f, [], ’infLaplace’, i)

return exactly these values.

3.3 Expectation Propagation

The basic idea of Expectation Propagation (EP) as implemented in inf/infEP.m. is to replace the

non-Gaussian likelihood terms p(yi|fi)by Gaussian functions t(fi;νi,τi) = exp(νifi−1

2τif2

i)and

to adjust the natural parameters νi,τisuch that the following identity holds:

Zt,iZfkq−i(f)·t(f;νi,τi)df=1

Zp,iZfkq−i(f)·p(yi|f)df,k=1, 2

with the so-called cavity distributions q−i(f) = N(f|m,K)Qj6=it(fj;νj,τj)∝N(f|µ,V)/t(fi;νi,τi)

equal to the posterior divided by the ith Gaussian approximation function and the two normalisers

Zt,i=Rq−i(f)·t(fi;νi,τi)dfand Zp,i=Rq−i(f)·p(yi|fi)df. The moment matching corresponds

to minimising the following local KL-divergence

νi,τi=arg min

ν,τKL[q−i(f)p(yi|fi)/Zp,i||q−i(f)t(fi;ν,τ)/Zt,i].

In order to apply the moment matching steps in a numerically safe way, EP requires the deriviatives

of the expectations w.r.t. the Gaussian mean parameter µ

dk=∂k

∂µklog Zp(y|f)N(f|µ,σ2)df,k=0, 1, 2

and the ith likelihood hyperparameter ρi

d=∂

∂ρi

log Zp(y|f)N(f|µ,σ2)df

which can be obtained by the likelihood calls (see section 4)

•[d0, d1, d2] = lik(hyp, y, mu, s2, ’infEP’)

and

•d = lik(hyp, y, mu, s2, ’infEP’, i).

3.4 Kullback Leibler Divergence Minimisation

Another well known approach to approximate inference implemented inf/infKL.m in attempts to

directly ﬁnd the closest Gaussian q(f|D) = N(f|µ,V)to the exact posterior p(f|D)w.r.t. to some

proximity measure or equivalently to maximise a lower bound Z(µ,V)to the marginal likelihood

Zas described in Nickisch & Rasmussen Approximations for Binary Gaussian Process Classiﬁca-

tion, JMLR, 2008. In particular, one minimises KL (N(f|µ,V)||p(f|D))which amounts to minimising

−ln Z(µ,V)as deﬁned by:

−ln Z= − ln Zp(f)p(y|f)df= − ln Zq(f|D)p(f)

q(f|D)p(y|f)df

Jensen

6Zq(f|D)ln q(f|D)

p(f)df−Zq(f|D)ln p(y|f)df=: − ln Z(µ,V)

=KL (N(f|µ,V)||N(f|m,K))−

i=1ZN(fi|µi,vii)ln p(yi|fi)dfi,vii = [V]ii

2tr(VK−1−I) − ln |VK−1|+1

2(µ−m)>K−1(µ−m) −

i=1

`KL(µi,vii)

where `KL

v(µi) = RN(fi|µi,vii)`i(fi)dfiis the convolution of the log likelihood `iwith the Gaus-

sian Nand v=dg(V). Equivalently, one can view `KL as a smoothed version of `with univariate

smoothing kernel N.

From Challis & Barber Concave Gaussian Variational Approximations for Inference in Large Scale

Bayesian Linear Models, AISTATS, 2011 we know that the mapping (µ,L)7→ −ln Z(µ,L>L)is

jointly convex whenever the likelihoods fi7→ P(yi|fi)are log concave. In particular, this implies that

every (µi,si)7→ −`KL(µi,s2

i)is jointly convex.

We use an optimisation algorithm similar to EP (section 3.3) where we minimise the local KL-

divergence the other way round µi,si=arg minµ,sKL[N(f|µ,s2)||q−i(f)p(yi|fi)/Zp,i]. This view

was brought forward by Tom Minka Convex Divergence measures and message passing, MSR-

TR, 2005. The KL minimisation constitutes a jointly convex 2d optimisation problem solved by

klmin using a scaled Newton approach which is included as a sub function in inf/infKL.m. The

smoothed likelihood `KL(µi,vii)is implemented as a meta likelihood in likKL; it uses Gaussian-

Hermite quadrature to compute the required integrals. Note that – as opposed to EP – Gaussian-

Hermite quadrature is appropriate since we integrate against the ln P(yi|fi)(which can be well ap-

proximated by a polynomial) instead of P(yi|fi)itself. The algorithm is – again unlike EP – prov-

ably convergent for log-concave likelihoods (e.g. likGauss,likLaplace,likSech2,likLogistic,

likPoisson) since it can be regarded as coordinate descent with guaranteed decrease in the objective

in every step. Due to the complex update computations, infKL can be quite slow although it has the

same O(n3)asymptotic complexity as EP and Laplace.

3.5 Variational Bayes

One can drive the bounding even further by means of local quadratic lower bounds to the log like-

lihood `(f) = ln p(y|f). Suppose that we use a super-Gaussian likelihood p(y|f)i.e. likelihoods

that can be lower bounded by Gaussians of any width w(e.g. likLaplace,likT,likLogistic,

likSech2). Formally, that means that there are b,z∈Rsuch that

ρ(f) = ln p(y|f−z) − bf

is symmetric and √f7→ ρ(f)is a convex function for all f>0. As a result, we obtain the following

exact representation of the likelihood

`(f) = ln p(y|f) = max

w>0(b+wz)f−wf2

2−1

2h(γ),

which can be derived by convex duality and assuming the likelihoods to be super-Gaussian. Details

can be found in papers by Palmer et al. Variational EM Algorithms for Non-Gaussian Latent Variable

Models, NIPS, 2006 and Nickisch & Seeger Convex Variational Bayesian Inference for Large Scale

Generalized Linear Models, ICML, 2009.

The bottom line is that we can treat the variational bounding as a sequence of Laplace approxima-

tions with the “variational Bayes” log likelihood

`VB(fi) = `(gi) + bi(fi−gi),g=sgn(f−z)q(f−z)2+v+z

instead of the usual likelihood `(fi) = ln p(yi|fi)i.e. we solve µ←L(`VB

v)instead of µ←L(`). See

section 3.2. In the code of inf/infVB.m, the likelihood is implemented in the function likVB.

At the end, the optimal value of Wcan be obtained analytically via wi=|bi−`0(gi)|/|gi−zi|.

For the minimisation in inf/infVB.m, we use a provably convergent double loop algorithm, where in

the inner loop a nonlinear least squares problem (convex for log-concave likelihoods) is solved using

inf/infLaplace.m such that µ←L(`VB

v)and in the outer loop, we compute v←dg((K−1+W)−1).

The only requirement to the likelihood function is that it returns the values zand brequired by the

bound which are delivered by the call (see section 4)

•[b,z] = lik(hyp, y, [], ga, ’infVB’)

The negative marginal likelihood upper bound −ln ZVB is obtained by integrating the prior times

the exact representation of the likelihood

p(y|f) = max

γ>0q(y|f,γ),q(y|f,γ) = N(f|ν,γ)exp −h(γ)

2−ν2

2γp2πγ,γ=1

w,ν=bγ +z

w.r.t. the latent variables fyielding

−ln ZVB = − ln ZN(f|m,K)

i=1

qi(yi|fi,γi)df

= − ln N(m|ν,K+Γ) + 1

2h(γ) − w>ν2−1>ln 2πγ.

3.6 Compatibility Between Inference Methods and Covariance Approximations

Another kind of approximation is needed to render an inference method scalable. We have two ap-

proximation schemes which in fact approximate the covariance to make it amenable to large number

of training data points. The following table shows the compatibility between some inference methods

and two major groups of covariance approximations we will discuss in the next two sections.

Approximation \ Inference Exact Laplace VB EP KL, MCMC, LOO Implementation

infGaussLik infLaplace infVB infEP inf{KL,MCMC,LOO}

variational free energy (VFE) X X X apxSparse,opt.s=0.0

fully independent training conditionals (FITC) X X X X apxSparse,opt.s=1.0

hybrid between VFE and FITC, s0∈[0, 1]X X X apxSparse,opt.s=s0

Kronecker/Toeplitz/BTTB grid (KISS-GP) X X X apxGrid

State space representation X X X (X, ADF) apxState

3.7 Sparse Covariance Approximations

One of the main problems with GP models is the high computational load for inference computations.

In a setting with ntraining points x, exact inference with Gaussian likelihood requires O(n3)effort;

approximations like Laplace or EP consist of a sequence of O(n3)operations.

There is a line of research with the goal to alleviate this burden by using approximate covariance

functions ˜

kinstead of k. A review is given by Candela and Rasmussen1. One basic idea in those

approximations is to work with a set of minducing inputs uwith a reduced computational load

of O(nm2). In the following, we will provide a rough idea of the FITC approximation used in the

toolbox. Let Kdenote the n×ncovariance matrix between the training points x,Kuthe m×n

covariance matrix between the ntraining points and the minducing points, and Kuu the m×m

covariance matrix between the minducing points. The FITC approximation to the covariance is

given by

K≈˜

K=Q+G,G=diag(g),g=diag(K−Q),Q=K>

uQ−1

uu Ku,Quu =Kuu +σ2

nuI,

where σnuis the noise from the inducing inputs. Note that ˜

Kand Khave the same diagonal elements

diag(˜

K) = diag(K); all off-diagonal elements are the same as for Q. Internally, the necessary covari-

ance evaluations are performed by a meta covariance function cov/apxSparse.m. The toolbox offers

FITC versions for regression with Gaussian likelihood inf/infGaussLik.m, as well as for Laplace’s

approximation inf/infLaplace.m.

The user can decide whether to treat the inducing inputs uas ﬁxed or as hyperparameters. The latter

allows to adjust the inducing inputs uw.r.t. the marginal likelihood. As detailed in the documentation

of inf/apx.m,uis treated as ﬁxed if it is passed as the 2nd parameter of apxSparse(cov,xu,..). If

the hyperparameter structure hyp contains a ﬁeld hyp.xu in inference method calls such as infGaussLik(hyp,..)

or inference/prediction calls like gp(hyp,@infGaussLik,..) the inducing inputs uare treated as hy-

perparameters and can be optimised. See doc/demoSparse.m for an illustration.

3.8 Grid-Based Covariance Approximations

Another way to bring down computational costs is to take advantage of grid structure x. For exam-

ple, in geostatistics or image processing, the training data x∈Rn×Dcould be a complete 2d lattice

of size n1×n2as given by the axes g1∈Rn1,g2∈Rn2so that n=N=n1·n2,D=2 and

x= [vec(g11>), vec(1g>

2)]. In general, a p-dimensional grid U∈RN×Dis speciﬁed by a set of axis

matrices {gi∈Rni×Di}i=1..pso that N=Qp

i=1niand D=Pp

i=1Diwhere the axes do not need

to be 1d nor do their components need to be sorted. As a consequence, Urepresents a Cartesian

product of its axes U=g1×g2×.. ×gp. The cov/apxGrid.m covariance function represents a

Kronecker product covariance matrix

KU,U=Kp⊗.. ⊗K2⊗K1

whose factorisation structure is given by the grid xg. The gain in computationial efﬁciency is due to

the fact that matrix-vector product, determinant, inverse and eigenvalue computations decompose

so that many operations with an overall cost of O(N3)now only cost O(Pp

i=1n3

i).

1A Unifying View of Sparse Approximate Gaussian Process Regression, JMLR, 2005

For off-grid data points, we can still take advantage of the computational properties of a grid-based

covariance matrix KU,Uvia the structured kernel interpolation (SKI) framework aka KISS-GP by

Wilson and Nickisch2with extensions3. Here, the n×ncovariance Kis obtained from the N×N

grid covariance KU,Uby interpolation K≈WXKU,UW>

X, where KU,Uis a covariance matrix formed

by evaluating the user-speciﬁed kernel over a set of latent inducing inputs U, with locations that

have been chosen to create algebraic structure in KU,Uthat we can exploit for efﬁciency. Here, the

interpolation matrix WX∈Rn×Nis extremely sparse; i.e., for local cubic interpolation WXcontains

only 4Dnonzeros per row, where Dis the diata dimension. In addition WXis row-normalised

1n=WX1N. The structure in KU,Ualongside the sparsity of WX, allows for very fast MVMs with

the SKI approximate covariance matrix Kover the inputs xenabling fast inference and prediction.

Internally, we use a meta covariance function cov/apxGrid.m to represent the Kronecker covariance

matrix and a Gaussian regression inference method inf/infGaussLik.m. We also support incomplete

grids where n < N. A good starting point is Yunus Saatçi’s PhD thesis4. For incomplete grids, we use

the interpolation-based extensions by Wilson et al.5where conjugate gradients and a determinant

approximations are used. See doc/demoGrid1d.m and doc/demoGrid2d.m for an illustration. We

also offer non-Gaussian likelihoods as described by Seth Flaxman6so that inf/infLaplace.m can

be used.

3.9 State Space Representation of GPs

GP models with covariance functions with a Markovian structure can be transformed into equivalent

discrete state space models where inference can be done in linear time O(n). Exact models can be

derived for sum, product, linear, noise, constant, Matérn (half-integer), Ornstein–Uhlenbeck, and

Wiener covariance functions. Other common covariance functions can be approximated by their

Markovian counterparts, including squared exponential, rational quadratic, and periodic covariance

functions.

A state space model describes the evolution of a dynamical system at different time instances ti,i=

1, 2, . . . by

fi∼P(fi|fi−1),yi∼P(yi|fi),

where fi:= f(ti)∈Rdand f0∼P(f0)with fibeing the latent (hidden/unobserved) variable and yi

being the observed variable. In continuous time, a simple dynamical system able to represent many

covariance functions is given by the following linear time-invariant stochastic differential equation:

f(t) = F f(t) + L w(t),yi=H f(ti) + i,

where w(t)is an s-dimensional white noise process, the measurement noise i∼ N(0, σ2

n)is Gaus-

sian, and F∈Rd×d,L∈Rd×s,H∈R1×dare the feedback, noise effect, and measurement matrices,

respectively. The initial state is distributed according to f0∼ N(0,P0).

The latent GP is recovered by f(t) = Hf(t)and w(t)∈Rsis a multivariate white noise process with

spectral density matrix Qc∈Rs×s. For discrete values, this translates into

fi∼ N(Ai−1fi−1,Qi−1),yi∼P(yi|H Lfi),

2Kernel Interpolation for Scalable Structured Gaussian Processes, ICML, 2015

3Thoughts on Massively Scalable Gaussian Processes, TR, 2015.

4Scalable Inference for Structured Gaussian Process Models, University of Cambridge, 2011

5Fast Kernel Learning for Multidimensional Pattern Extrapolation, NIPS, 2014

6Fast Kronecker inference in Gaussian processes with non-Gaussian likelihoods, ICML, 2015

Algorithm 1 Kalman (forward) ﬁltering.

Input: {ti},y# training inputs and targets

{Ai},{Qi},H,P0# state space model

W,b# likelihood eff. precision and location

for i=1to ndo

if i== 1then

mi←0;Pi←P0# init

else

mi←Aimi−1;Pi←AiPi−1A>

i+Qi# predict

end if

if has label yithen

µf←Hmi;u←PiH>;σ2

f←Hu # latent

zi←Wiiσ2

f+1

ki←Wiiu/zi;Pi←Pi−kiu># variance

ci←Wiiµf−bi;mi←mi−uci/zi# mean

end if

end for

log det(I+W1

2KW1

2)←Pilog zi

Algorithm 2 Rauch–Tung–Striebel (backward) smoothing.

Input: {mi},{Pi}# Kalman ﬁlter output

{Ai},{Qi}# state space model

for i=ndown to 2do

m←Aimi−1;P←AiPi−1A>

i+Qi# predict

Gi←Pi−1A>

iP−1;∆mi−1←Gi(mi−m)

Pi−1←Pi−1+Gi(Pi−P)G>

i# variance

mi−1←mi−1+∆mi−1# mean

end for

with f0∼ N(0,P0). The discrete-time matrices are

Ai=A[∆ti] = e∆tiF,

Qi=Z∆ti

e(∆tk−τ)FL QcL>e(∆ti−τ)F>dτ,

where ∆ti=ti+1−ti>0.

For stationary covariances k(t,t0) = k(t−t0), the stationary state is distributed by f∞∼ N(0,P∞)

and the stationary covariance can be found by solving the Lyapunov equation

P∞=F P∞+P∞F>+L QcL>=0,

which leads to the identity Qi=P∞−AiP∞A>

In practice, the evaluation of the ndiscrete-time transition matrices Ai=e∆tiFand the noise co-

variance matrices Qi(in the stationary case) for different values of ∆tiis a computational challenge.

Since the matrix exponential ψ:s7→ esXis smooth, its evaluation can be accurately approximated

by convolution interpolation.

From the discrete set of matrices, all the necessary computations can be done using Kalman ﬁltering

and Kalman smoothing as detailed in Algorithms 1 and 2.

Internally, we use a meta covariance function cov/apxState.m to represent the state space represen-

tation. A good starting point is Arno Solin’s PhD thesis7. See doc/demoState.m for an illustration.

We also offer non-Gaussian likelihoods 8so that inf/infLaplace.m and inf/infVB.m can be used.

EP is not fully functional; we offer single-sweep EP aka assumed density ﬁltering (ADF).

7Stochastic Differential Equation Methods for Spatio-Temporal Gaussian Process Regression, Aalto University, 2016

8State Space Gaussian Processes with Non-Gaussian Likelihood, ICML, 2018

4 Likelihood Functions

A likelihood function pρ(y|f)(with hyperparameters ρ) is a conditional density Rpρ(y|f)dy=1

deﬁned for scalar latent function values fand outputs y. In the GPML toolbox, we use iid. likelihoods

pρ(y|f) = Qn

i=1pρ(yi|fi). The approximate inference engine does not explicitly distinguish between

classiﬁcation and regression likelihoods: it is fully generic in the likelihood allowing to use a single

code in the inference step.

Likelihood functionality is needed both during inference and while predicting.

4.1 Prediction

A prediction at x∗conditioned on the data D= (X,y)(as implemented in gp.m) consists of the

predictive mean µy∗and variance σ2

y∗which are computed from the the latent marginal moments

µf∗,σ2

f∗i.e. the Gaussian marginal approximation N(f∗|µf∗,σ2

f∗)via

p(y∗|D,x∗) = Zp(y∗|f∗)p(f∗|D,x∗)df∗≈Zp(y∗|f∗)N(f∗|µf∗,σ2

f∗)df∗. (3)

The moments are given by µy∗=Ry∗p(y∗|D,x∗)dy∗and σ2

y∗=R(y∗−µy∗)2p(y∗|D,x∗)dy∗. The

likelihood call

•[lp,ymu,ys2] = lik(hyp, [], fmu, fs2)

does exactly this. Evaluation of the logarithm of py∗=p(y∗|D,x∗)for values y∗can be done via

•[lp,ymu,ys2] = lik(hyp, y, fmu, fs2)

where lp contains the number ln py∗.

Using the moments of the likelihood µ(f∗) = Ry∗p(y∗|f∗)dy∗and σ2(f∗) = R(y∗−µ(f∗))2p(y∗|f∗)dy∗

we obtain for the predictive moments the following (exact) expressions

µy∗=Zµ(f∗)p(f∗|D,x∗)df∗, and

σ2

y∗=Zhσ2(f∗) + (µ(f∗) − µy∗)2ip(f∗|D,x∗)df∗.

1. The binary case is simple since y∗∈{−1, +1}and 1 =py∗+p−y∗. Using π∗=p+1, we ﬁnd

py∗=π∗y∗= +1

1−π∗y∗= −1

µy∗=X

y∗=±1

y∗p(y∗|D,x∗) = 2·π∗−1∈[−1, 1], and

σ2

y∗=X

y∗=±1

(y∗−µy∗)2p(y∗|D,x∗) = 4·π∗(1−π∗)∈[0, 1].

2. The continuous case for homoscedastic likelihoods depending on r∗=y∗−f∗only and having

noise variance σ2(f∗) = σ2

nis also simple since the identity p(y∗|f∗) = p(y∗−f∗|0)allows to

substitute y∗←y∗+f∗yielding µ(f∗) = f∗+Ry∗p(y∗|0)dy∗and assuming Ry∗p(y∗|0)dy∗=0

we arrive at

µy∗=µf∗, and

σ2

y∗=σ2

f∗+σ2

3. The generalised linear model (GLM) case is also feasible. Evaluation of the predictive distribu-

tion is done by quadrature

py∗=Zp(y∗|f∗)p(f∗|D,x∗)df∗≈Zp(y∗|f∗)N(f∗|µf∗,σ2

f∗)df∗.

For GLMs the mean is given by µ(f∗) = g(f∗)and the variance is usually given by a simple

function of the mean σ2(f∗) = v(g(f∗)), hence we use Gaussian-Hermite quadrature with

N(f∗|µf∗,σ2

f∗)≈p(f∗|D,x∗)to compute

µy∗=Zg(f∗)p(f∗|D,x∗)df∗, and

σ2

y∗=Zhv(g(f∗)) + (g(f∗) − µy∗)2ip(f∗|D,x∗)df∗6=v(µy∗).

4. Finally the warped Gaussian likelihood predicitive distribution with strictly monotonically in-

creasing warping function gis given by the expression

p(y∗|D,x∗) = g0(y∗)Ng(y∗)|µf∗,σ2

n+σ2

f∗

so that the predictive moments can be computed by Gaussian-Hermite quadrature.

In the following, we will detail how and which likelihood functions are implemented in the GPML

toolbox. Further, we will mention dependencies between likelihoods and inference methods and

provide some analytical expressions in addition to some likelihood implementations.

4.2 Interface

The likelihoods are in fact the most challenging object in our implementation. Different inference

algorithms require different aspects of the likelihood to be computed, therefore the interface is rather

involved as detailed below.

19 hlikFunctions.m 19i≡

1% likelihood functions are provided to be used by the gp.m function:

3% likErf (Error function , classification , probit regression)

4% likLogistic (Logistic , classification , logit regression)

5% likUni (Uniform likelihood, classification)

7% likGauss (Gaussian , regression)

8% likGaussWarp (Warped Gaussian , regression)

9% likGumbel (Gumbel likelihood for extremal values)

10 % likLaplace (Laplacian or double exponential , regression)

11 % likSech2 (Sech -square, regression)

12 % likT (Student ’s t, regression)

13 %

14 % likPoisson (Poisson regression , count data)

15 % likNegBinom (Negativ binomial regression , count data)

16 % likGamma (Nonnegative regression , positive data)

17 % likExp (Nonnegative regression , positive data)

18 % likInvGauss (Nonnegative regression , positive data)

19 % likBeta (Beta regression , interval data)

20 %

21 % likMix (Mixture of individual likelihood functions)

22 %

23 % The likelihood functions have three possible modes, the mode being selected

24 % as follows (where "lik" stands for any likelihood function in "lik/lik*.m".):

25 %

26 % 1) With one or no input arguments: [REPORT NUMBER OF HYPERPARAMETERS]

27 %

28 % s = lik OR s = lik(hyp)

29 %

30 % The likelihood function returns a string telling how many hyperparameters it

31 % expects, using the convention that "D" is the dimension of the input space.

32 % For example , calling "likLogistic" returns the string ’0’.

33 %

34 %

35 % 2) With three or four input arguments: [PREDICTION MODE]

36 %

37 % lp = lik(hyp, y, mu) OR [lp, ymu, ys2] = lik(hyp, y, mu, s2)

38 %

39 % This allows to evaluate the predictive distribution. Let p(y_*|f_*) be the

40 % likelihood of a test point and N(f_*|mu,s2) an approximation to the posterior

41 % marginal p(f_*|x_*,x,y) as returned by an inference method. The predictive

42 % distribution p(y_*|x_*,x,y) is approximated by.

43 % q(y_*) = \int N(f_*|mu,s2) p(y_*|f_*) df_*

44 %

45 % lp = log( q(y) ) for a particular value of y, if s2 is [] or 0, this

46 % corresponds to log( p(y|mu) )

47 % ymu and ys2 the mean and variance of the predictive marginal q(y)

48 % note that these two numbers do not depend on a particular

49 % value of y

50 % All vectors have the same size.

51 %

52 %

53 % 3) With five or six input arguments, the fifth being a string [INFERENCE MODE]

54 %

55 % [varargout] = lik(hyp, y, mu, s2, inf) OR

56 % [varargout] = lik(hyp, y, mu, s2, inf, i)

57 %

58 % There are three cases for inf, namely a) infLaplace , b) infEP and c) infVB.

59 % The last input i, refers to derivatives w.r.t. the ith hyperparameter.

60 %

61 % a1) [lp,dlp,d2lp,d3lp] = lik(hyp, y, f, [], ’infLaplace ’)

62 % lp, dlp, d2lp and d3lp correspond to derivatives of the log likelihood

63 % log(p(y|f)) w.r.t. to the latent location f.

64 % lp = log( p(y|f) )

65 % dlp = d log( p(y|f) ) / df

66 % d2lp = d^2 log( p(y|f) ) / df^2

67 % d3lp = d^3 log( p(y|f) ) / df^3

68 %

69 % a2) [lp_dhyp ,dlp_dhyp ,d2lp_dhyp] = lik(hyp, y, f, [], ’infLaplace ’, i)

70 % returns derivatives w.r.t. to the ith hyperparameter

71 % lp_dhyp = d log( p(y|f) ) / ( dhyp_i)

72 % dlp_dhyp = d^2 log( p(y|f) ) / (df dhyp_i)

73 % d2lp_dhyp = d^3 log( p(y|f) ) / (df^2 dhyp_i)

74 %

75 %

76 % b1) [lZ,dlZ,d2lZ] = lik(hyp, y, mu, s2, ’infEP ’)

77 % let Z = \int p(y|f) N(f|mu,s2) df then

78 % lZ = log(Z)

79 % dlZ = d log(Z) / dmu

80 % d2lZ = d^2 log(Z) / dmu^2

81 %

82 % b2) [dlZhyp] = lik(hyp, y, mu, s2, ’infEP’, i)

83 % returns derivatives w.r.t. to the ith hyperparameter

84 % dlZhyp = d log(Z) / dhyp_i

85 %

86 %

87 % c1) [b,z] = lik(hyp, y, [], ga, ’infVB’)

88 % ga is the variance of a Gaussian lower bound to the likelihood p(y|f).

89 % p(y|f) \ge exp( b*(f+z) - (f+z).^2/(2*ga) - h(ga)/2 ) \propto N(f|b*ga-z,ga)

90 % The function returns the linear part b and z.

91 %

92 % Cumulative likelihoods are designed for binary classification. Therefore , they

93 % only look at the sign of the targets y; zero values are treated as +1.

94 %

95 % Some examples for valid likelihood functions:

96 % lik = @likLogistic;

97 % lik = {’likMix ’,{’likUni ’,@likErf}}

98 % lik = {@likPoisson ,’logistic ’};

99 %

100 % See the help for the individual likelihood for the computations specific to

101 % each likelihood function.

102 %

103 hgpml copyright 5ai

4.3 Implemented Likelihood Functions

The following table enumerates all (currently) implemented likelihood functions that can be found

at lik/lik<NAME>.m and their respective set of hyperparameters ρ.

lik<NAME> regression yi∈Rpρ(yi|fi) = ρ=

Gauss Gaussian N(yi|fi,σ2) = 1

√2πσ exp −(yi−fi)2

2σ2{ln σ}

GaussWarp Warped Gaussian N(gθ,(yi)|fi,σ2)g0

θ(yi){θ1, .., θng, ln σ}

Gumbel Gumbel π

σ√6exp (−zi−e−zi),zi=γ+s·π(yi−fi)

σ√6,|s|=1{ln σ}

Sech2 Sech-squared τ

2 cosh2(τ(yi−fi)),τ=π

2σ√3{ln σ}

Laplace Laplacian 1

2bexp −|yi−fi|

b,b=σ

√2{ln σ}

TStudent’s t Γ(ν+1

Γ(ν

√νπσ 1+(yi−fi)2

νσ2−ν+1

2{ln(ν−1), ln σ}

lik<NAME> classiﬁcation yi∈{±1}pρ(yi|fi) = ρ=

Erf Error function Ryifi

−∞N(t)dt∅

Logistic Logistic function 1

1+exp(−yifi)∅

Uni Label noise 1

2∅

lik<NAME> count data yi∈Npρ(yi|fi) = where µ=g(fi)ρ=

Poisson Poisson µyie−µ/yi!∅

NegBinom Negative Binomial yi+r−1

yirrµyi/(r+µ)r+yi{ln r}

lik<NAME> nonnegative data yi∈R+\{0}pρ(yi|fi) = where µ=g(fi)ρ=

Weibull Weibull, γ1=Γ(1+1/κ)κγ1/µ (yiγ1/µ)κ−1exp (−(yiγ1/µ)κ){ln κ}

Gamma Gamma ααyα−1

Γ(α)µ−αexp −yiα

µ{ln α}

Exp Exponential µ−1exp −yi

µ∅

InvGauss Inverse Gaussian qλ

2πy3

exp −λ(yi−µ)2

2µ2yi{ln λ}

lik<NAME> interval data yi∈[0, 1]pρ(yi|fi) = where µ=g(fi)ρ=

Beta Beta Γ(φ)

Γ(µφ)Γ((1−µ)φ)yµφ−1

i(1−yi)(1−µ)φ−1{ln φ}

Composite likelihood functions [p1(yi|fi),p1(yi|fi), ..]7→ pρ(yi|fi)

Mix Mixture Pjαjpj(yi|fi){ln α1, ln α2, ..}

4.4 Usage of Implemented Likelihood Functions

Some code examples taken from doc/usageLik.m illustrate how to use simple and composite likeli-

hood functions to specify a GP model.

Syntactically, a likelihood function lf is deﬁned by

lk := ’func’ | @func // simple

lf := {lk} | {param, lk} | {lk, {lk, .., lk}} // composite

i.e., it is either a string containing the name of a likelihood function, a pointer to a likelihood function

or one of the former in combination with a cell array of likelihood functions and an additional list

of parameters.

22 hdoc/usageLik.m 22i≡

1% demonstrate usage of likelihood functions

3% See also likFunctions.m.

5hgpml copyright 5ai

6clear all,close all

7n = 5; f = randn(n,1); % create random latent function values

9% set up simple classification likelihood functions

10 yc = sign(f);

11 lc0 = {’likErf’}; hypc0 = []; % no hyperparameters are needed

12 lc1 = {@likLogistic}; hypc1 = []; % also function handles are OK

13 lc2 = {’likUni’}; hypc2 = [];

14 lc3 = {’likMix’,{’likUni’,@likErf}}; hypc3 = log([1;2]); %mixture

16 % set up simple regression likelihood functions

17 yr = f + randn(n,1)/20;

18 sn = 0.1; % noise standard deviation

19 lr0 = {’likGauss’}; hypr0 = log(sn);

20 lr1 = {’likLaplace’}; hypr1 = log(sn);

21 lr2 = {’likSech2’}; hypr2 = log(sn);

22 nu = 4; % number of degrees of freedom

23 lr3 = {’likT’}; hypr3 = [log(nu-1); log(sn)];

24 lr4 = {’likMix’,{lr0,lr1}}; hypr4 = [log([1,2]);hypr0;hypr1];

26 a = 1; % set up warped Gaussian with g(y) = y + a*sign(y).*y.^2

27 lr5 = {’likGaussWarp’,[’poly2’]}; hypr5 = log([a;sn]);

28 lr6 = {’likGumbel’,’+’}; hypr6 = log(sn);

30 % set up Poisson/negative binomial regression

31 yp = fix(abs(f)) + 1;

32 lp0 = {@likPoisson ,’logistic’}; hypp0 = [];

33 lp1 = {@likPoisson ,{’logistic2’,0.1}}; hypp1 = [];

34 lp2 = {@likPoisson ,’exp’}; hypp2 = [];

35 ln1 = {@likNegBinom,’logistic2’}; hypn1 = [];

37 % set up other GLM likelihoods for positive or interval regression

38 lg1 = {@likGamma,’logistic’}; al = 2; hyp.lik = log(al);

39 lg2 = {@likInvGauss,’exp’}; lam = 1.1; hyp.lik = log(lam);

40 lg3 = {@likBeta ,’expexp’}; phi = 2.1; hyp.lik = log(phi);

41 lg4 = {@likBeta ,’logit’}; phi = 4.7; hyp.lik = log(phi);

42 lg5 = {@likWeibull ,{’logistic2’,0.01}}; ka = 0.5; hyp.lik = log(ka);

44 % 0) specify the likelihood function

45 lik = lc0; hyp = hypc0; y = yc;

46 % lik = lr4; hyp = hypr4; y = yr;

47 % lik = lp1; hyp = hypp1; y = yp;

49 % 1) query the number of parameters

50 feval(lik{:})

52 % 2) evaluate the likelihood function on f

53 exp(feval(lik{:},hyp,y,f))

55 % 3a) evaluate derivatives of the likelihood

56 [lp,dlp,d2lp,d3lp] = feval(lik{:}, hyp, y, f, [], ’infLaplace’);

58 % 3b) compute Gaussian integrals w.r.t. likelihood

59 mu = f; s2 = rand(n,1);

60 [lZ,dlZ,d2lZ] = feval(lik{:}, hyp, y, mu, s2, ’infEP’);

62 % 3c) obtain lower bound on likelihood

63 ga = rand(n,1);

64 [b,z] = feval(lik{:}, hyp, y, [], ga, ’infVB’);

4.5 Compatibility Between Likelihoods and Inference Methods

The following table lists all possible combinations of likelihood function and inference methods.

Likelihood \ Inference Gaussian

Likelihood

Laplace VB EP KL MCMC LOO Type, Output Domain Alternative Names

approx. cov. possible, Table3.6 no approx. cov. possible, Table3.6

Gaussian X X X X X X X regression, R

Warped Gaussian X X X X X X regression, R

Gumbel X X X X regression, R

Sech-squared X X X X X X regression, Rlogistic distribution

Laplacian X X X X X X regression, Rdouble exponential

Student’s t X X X X X regression, R

Mixture X X X X X mixing meta likelihood

Error function X X X X X classiﬁcation, {±1}probit regression

Logistic function X X X X X X classiﬁcation, {±1}logit regression

Uniform X X X X X X classiﬁcation, {±1}label noise

Weibull X X X positive data, R+\{0}nonnegative regression

Gamma X X X positive data, R+\{0}nonnegative regression

Exp X X X positive data, R+\{0}nonnegative regression

Inverse Gaussian X X X positive data, R+\{0}nonnegative regression

Poisson X(X)∗X X X count data, NPoisson regression

Negative Binomial X X X count data, Nnegative binomial regression

Beta X X X interval data, [0, 1]beta regression

(X)∗EP might not converge in some cases since quadrature is used.

Exact inference is only tractable for Gaussian likelihoods. Expectation propagation together with

Student’s t likelihood is inherently unstable due to non-log-concavity. Laplace’s approximation for

Laplace likelihoods is not sensible because at the mode the curvature and the gradient is undeﬁned

due to the non-differentiable peak of the Laplace distribution. Special care has been taken for the

non-convex optimisation problem imposed by the combination Student’s t likelihood and Laplace’s

approximation.

4.6 Gaussian Likelihood

The Gaussian likelihood is the simplest likelihood because the posterior distribution is not only

Gaussian but can be computed analytically. In principle, the Gaussian likelihood would only be

operated in conjunction with the exact inference method but we chose to provide compatibility with

all other inference algorithms as well because it enables code testing and allows to switch between

different regression likelihoods very easily.

24a hlik/likGauss.m 24ai≡

1function [varargout] = likGauss(hyp, y, mu, s2, inf, i)

3% likGauss - Gaussian likelihood function for regression. The expression for the

4% likelihood is

5% likGauss(t) = exp(-(t-y)^2/2*sn^2) / sqrt(2*pi*sn^2),

6% where y is the mean and sn is the standard deviation.

8% The hyperparameters are:

10 % hyp = [ log(sn) ]

11 %

12 % Several modes are provided , for computing likelihoods , derivatives and moments

13 % respectively , see likFunctions.m for the details. In general , care is taken

14 % to avoid numerical issues when the arguments are extreme.

15 %

16 hgpml copyright 5ai

17 %

18 % See also LIKFUNCTIONS.M.

20 if nargin<3, varargout = {’1’}; return;end % report number of hyperparameters

22 sn2 = exp(2*hyp);

24 if nargin<5 % prediction mode if inf is not present

25 hPrediction with Gaussian likelihood 24bi

26 else

27 switch inf

28 case ’infLaplace’

29 hLaplace’s method with Gaussian likelihood 25ai

30 case ’infEP’

31 hEP inference with Gaussian likelihood 25bi

32 case ’infVB’

33 hVariational Bayes inference with Gaussian likelihood 25ci

34 end

35 end

24b hPrediction with Gaussian likelihood 24bi≡ (24a)

1if isempty(y), y = zeros(size(mu)); end

2s2zero = 1; if nargin >3&&numel(s2)>0&&norm(s2)>eps, s2zero = 0; end % s2==0 ?

3if s2zero % log probability

4lp = -(y-mu).^2./sn2/2-log(2*pi*sn2)/2; s2 = 0;

5else

6lp = likGauss(hyp, y, mu, s2, ’infEP’); % prediction

7end

8ymu = {}; ys2 = {};

9if nargout>1

10 ymu = mu; % first y moment

11 if nargout>2

12 ys2 = s2 + sn2; % second y moment

13 end

14 end

15 varargout = {lp,ymu,ys2};

The Gaussian likelihood function has a single hyperparameter ρ, the log of the noise standard devia-

tion σn.

4.6.1 Exact Inference

Exact inference doesn’t require any speciﬁc likelihood related code; all computations are done directly

by the inference method, section 3.1.

4.6.2 Laplace’s Approximation

25a hLaplace’s method with Gaussian likelihood 25ai≡ (24a)

1if nargin<6 % no derivative mode

2if isempty(y), y=0; end

3ymmu = y-mu; dlp = {}; d2lp = {}; d3lp = {};

4lp = -ymmu.^2/(2*sn2) - log(2*pi*sn2)/2;

5if nargout>1

6dlp = ymmu/sn2; % dlp, derivative of log likelihood

7if nargout>2 % d2lp, 2nd derivative of log likelihood

8d2lp = -ones(size(ymmu))/sn2;

9if nargout>3 % d3lp, 3rd derivative of log likelihood

10 d3lp = zeros(size(ymmu));

11 end

12 end

13 end

14 varargout = {lp,dlp,d2lp,d3lp};

15 else % derivative mode

16 lp_dhyp = (y-mu).^2/sn2 - 1; % derivative of log likelihood w.r.t. hypers

17 dlp_dhyp = 2*(mu-y)/sn2; % first derivative,

18 d2lp_dhyp = 2*ones(size(mu))/sn2; % and also of the second mu derivative

19 varargout = {lp_dhyp,dlp_dhyp ,d2lp_dhyp};

20 end

4.6.3 Expectation Propagation

25b hEP inference with Gaussian likelihood 25bi≡ (24a)

1if nargin<6 % no derivative mode

2lZ = -(y-mu).^2./(sn2+s2)/2 - log(2*pi*(sn2+s2))/2; % log part function

3dlZ = {}; d2lZ = {};

4if nargout>1

5dlZ = (y-mu)./(sn2+s2); % 1st derivative w.r.t. mean

6if nargout>2

7d2lZ = -1./(sn2+s2); % 2nd derivative w.r.t. mean

8end

9end

10 varargout = {lZ,dlZ,d2lZ};

11 else % derivative mode

12 dlZhyp = ((y-mu).^2./(sn2+s2)-1) ./ (1+s2./sn2); % deriv. w.r.t. hyp.lik

13 varargout = {dlZhyp};

14 end

4.6.4 Variational Bayes

25c hVariational Bayes inference with Gaussian likelihood 25ci≡ (24a)

1% variational lower site bound

2% t(s) = exp(-(y-s)^2/2sn2)/sqrt(2*pi*sn2)

3% the bound has the form: (b+z/ga)*f - f.^2/(2*ga) - h(ga)/2

4n = numel(s2); b = zeros(n,1); y = y.*ones(n,1); z = y;

5varargout = {b,z};

4.7 Warped Gaussian Likelihood

Starting from the likelihood p(y|f)we are sometimes facing the situation where the data y∈Y⊆R

is not distributed according to p(y|f)but some nonlinear transformation of the data g(y) = zso

that z∼p(z|f). Here, the warping function g:Y→Rneeds to be strictly monotonically increasing

i.e. g0(y)>0. Formally, we start from the fact that p(z|f)integrates to one and use the derivative

dz=g0(y)dyto substitute

Zp(z|f)dz=1=Zpg(y|f)dy,pg(y|f) = p(g(y)|f)g0(y)

where we have deﬁned the log warped likelihood ln pg(y|f) = ln p(g(y)|f) + ln g0(y). The interesting

bit is that approximate inference methods such as infExact,infLaplace,infEP,infVB,infKL re-

main fully feasible; only prediction and derivatives become more involved. The usual GP inference is

recovered by using the identity warping function g:y7→ y. The construction works in princple for

any likelihood but our implementation in likGaussWarp is limited to the Gaussian likelihood.

Hyperparameter derivatives

Hyperparameter derivatives for infLaplace are obtained as follows

∂

∂θ ln ∂k

∂fkpg(y|f) = ∂

∂θ ln ∂k

∂f p(g(y)|f) + ∂

∂θ

∂k

∂fkln g0(y),k=0, 1, 2

= − ∂k+1

∂fk+1ln p(g(y)|f)∂

∂θg(y) + ∂

∂θ

∂k

∂fkln g0(y).

Similarly for infEP the derivatives are given by

∂

∂θ ln Zpg(y|f)N(f|µ,σ2)df=∂

∂θ ln Zp(g(y)|f)N(f|µ,σ2)df+∂

∂θ ln g0(y)

= − ∂

∂µ ln Zp(g(y)|f)N(f|µ,σ2)df∂

∂θg(y) + ∂

∂θ ln g0(y).

This trick above works for any homoscedastic likelihood where p(y|f) = p(y+β|f+β)such as

likGauss, likLaplace,likSech2 and likT.

Predictive moments

As detailed in 4, the predictive distribution is – for Gaussian likelihood – given by

p(z∗|D,x∗) = Zp(z∗|f∗)p(f∗|D,x∗)df∗=ZN(z∗|f∗,σ2

n)N(f∗|µf∗,σ2

f∗)df∗

=N(z∗|µf∗,σ2

n+σ2

f∗), where z∗=g(y∗)

p(y∗|D,x∗) = g0(y∗)N(g(y∗)|µf∗,σ2

n+σ2

f∗).

Hence, the predictive moments are obtained by the 1d integrals

µy∗=Zy∗g0(y∗)N(g(y∗)|µf∗,σ2

n+σ2

f∗)dy∗

=Zg−1(z∗)N(z∗|µf∗,σ2

n+σ2

f∗)dz∗, and

σ2

y∗=Z(y∗−µy∗)2g0(y∗)N(g(y∗)|µf∗,σ2

n+σ2

f∗)dy∗

=Z(g−1(z∗) − µy∗)2N(z∗|µf∗,σ2

n+σ2

f∗)dz∗.

4.8 Gumbel Likelihood

Distributions of extrema are well captured by the Gumbel distribution

p(y) = 1

βexp −z−e−z,z=sy−η

β,s∈{±1}

with mean µ=η+βγ and variance σ2=π2β2/6 where γ=0.57721566490153 denotes Eu-

ler–Mascheroni’s constant. Skewness is approximately given by 1.1395swhere sis a sign switching

between left and right skewness and kurtosis is 12/5. The ﬁnal expression for the Gumbel likelihood

p(y|f) = π

σ√6exp −z−e−z,z=γ+sπ

σ√6(y−f),s∈{±1}.

4.9 Laplace Likelihood

Laplace’s Approximation

The following derivatives are needed:

ln p(y|f) = − ln(2b) − |f−y|

∂ln p

∂f =sign(f−y)

∂2ln p

(∂f)2=∂3ln p

(∂f)3=∂3ln p

(∂ln σn)(∂f)2=0

∂ln p

∂ln σn

=|f−y|

b−1

Expectation Propagation

Expectation propagation requires integration against a Gaussian measure for moment matching.

We need to evaluate ln Z=ln RL(y|f,σ2

n)N(f|µ,σ2)dfas well as the derivatives ∂ln Z

∂µ and ∂2ln Z

∂µ2

where N(f|µ,σ2) = 1

√2πσ2exp −(f−µ)2

2σ²,L(y|f,σ2

n) = 1

2bexp −|y−f|

b, and b=σn

√2. As a ﬁrst

step, we reduce the number of parameters by means of the substitution ˜

f=f−y

σnyielding

Z=ZL(y|f,σ2

n)N(f|µ,σ2)df

√2πσ

√2

2σnZexp −(f−µ)2

2σ²exp −√2|f−y|

σndf

=√2

2σ√2πZexp −(σn˜

f+y−µ)2

2σ²exp −√2|˜

f|d˜

=σn

σσn√2πZexp 



−

σ2

n˜

f−µ−y

σn2

2σ²



L(˜

f|0, 1)d˜

σnZL(f|0, 1)N(f|˜

µ,˜

σ2)df

ln Z=ln ˜

Z−ln σn=ln ZL(f|0, 1)N(f|˜

µ,˜

σ2)df−ln σn

with ˜

µ=µ−y

σnand ˜

σ=σ

σn. Thus, we concentrate on the simpler quantity ln ˜

ln Z=ln Zexp −(f−˜

µ)2

2˜

σ²−√2|f|df

z }| {

−ln ˜

σ√2π−ln √2σn

=ln "Z0

−∞

exp −(f−˜

µ)2

2˜

σ²+√2fdf+Z∞

exp −(f−˜

µ)2

2˜

σ²−√2fdf#+C

=ln 



Z0

−∞

exp 





−f2−2(

m−

z }| {

µ+˜

σ2√2)f+˜

µ2

2˜

σ²





df+Z∞

exp 





−f2−2(

z }| {

µ−˜

σ2√2)f+˜

µ2

2˜

σ²





df





=ln "exp m2

−

2˜

σ²Z0

−∞

exp −(f−m−)2

2˜

σ²df+exp m2

2˜

σ²Z∞

exp −(f−m+)2

2˜

σ²df#−˜

µ2

2˜

σ²+C

=ln "exp m2

−

2˜

σ²Z0

−∞

N(f|m−,˜

σ2)df+exp m2

2˜

σ² 1−Z0

−∞

N(f|m+,˜

σ2)df!#−˜

µ2

2˜

σ²−ln √2σn

=ln exp m2

−

2˜

σ²Φm−

σ−exp m2

2˜

σ²Φm+

σ+exp m2

2˜

σ²−˜

µ2

2˜

σ²−ln √2σn

Here, Φ(z) = Rz

−∞N(f|0, 1)dfdenotes the cumulative Gaussian distribution. Finally, we have

ln Z=ln hexp −√2˜

µΦm−

σ+exp √2˜

µΦ−m+

σi+˜

σ2−ln √2σn

=ln 



exp 



ln Φ(−z+) + √2˜

| {z }





+exp 



ln Φ(z−) − √2˜

| {z }

a−









+˜

σ2−ln √2σn

=ln(ea++ea−) + ˜

σ2−ln √2σn

where z+=˜

σ+˜

σ√2=µ−y

σ+σ

σn√2, z−=˜

σ−˜

σ√2=µ−y

σ−σ

σn√2 and ˜

µ=µ−y

σn,˜

σ=σ

σn.

Now, using d

dθln Φ(z) = 1

Φ(z)

dθΦ(z) = N(z)

Φ(z)

dθwe tackle ﬁrst derivative

∂ln Z

∂µ =ea+∂a+

∂µ +ea−∂a−

∂µ

ea++ea−

∂a+

∂µ =∂

∂µ ln Φ(−z+) + √2

σn

= − N(−z+)

σΦ(−z+)+√2

σn

= −q+

σ+√2

σn

∂a−

∂µ =∂

∂µ ln Φ(z−) − √2

σn

=N(z−)

σΦ(z−)−√2

σn

=q−

σ−√2

σn

∂a±

∂µ =∓q±

σ±√2

σn

as well as the second derivative

∂2ln Z

∂µ2=

∂

∂µ ea+∂a+

∂µ +∂

∂µ ea−∂a−

∂µ 

ea++ea−−∂ln Z

∂µ 2

∂

∂µ ea±∂a±

∂µ =ea±"∂a±

∂µ 2

+∂2a±

∂µ2#

∂2a+

∂µ2= − 1

∂

∂µ N(−z+)Φ(−z+) − ∂

∂µ Φ(−z+)N(−z+)

Φ2(−z+)

= − 1

N(−z+)Φ(−z+)∂−z2

+/2

∂µ −N2(−z+)∂−z+

∂µ

Φ2(−z+)

=N(−z+)

σ2·Φ(−z+)z+−N(−z+)

Φ2(−z+)= −q2

+−q+z+

σ2

∂2a−

∂µ2=1

∂

∂µ N(z−)Φ(z−) − ∂

∂µ Φ(z−)N(z−)

Φ2(z−)

N(z−)Φ(z−)∂−z2

−/2

∂µ −N2(z−)∂z−

∂µ

Φ2(z−)

=N(z−)

σ2·−Φ(z−)z−−N(z−)

Φ2(z−)= −q2

−+q−z−

σ2

∂2a±

∂µ2= −q2

±∓q±z±

σ2

which can be simpliﬁed to

∂2ln Z

∂µ2=ea+b++ea−b−

ea++ea−−∂ln Z

∂µ 2

using

b±=∂a±

∂µ 2

+∂2a±

∂µ2= ∓q±

σ±√2

σn!2

−q2

±∓q±z±

σ2

= q±

σ−√2

σn!2

−q2

σ2±q±z±

σ2

− √8

σσn∓z±

σ2!q±.

We also need

∂ln Z

∂ln σn

=ea+∂a+

∂ln σn+ea−∂a−

∂ln σn

ea++ea−−2σ2

σ2

−1.

Variational Bayes

We need h(γ)and its derivatives as well as β(γ):

h(γ) = 2

σ2

γ+ln(2σ2

n) + y2γ−1

h0(γ) = 2

σ2

−y2γ−2

h00(γ) = 2y2γ−3

β(γ) = yγ−1

4.10 Student’s t Likelihood

The likelihood has two hyperparameters (both represented in the log domain to ensure positivity):

the degrees of freedom νand the scale σnwith mean y(for ν > 1) and variance ν

ν−2σn² (for ν > 2).

p(y|f) = Z·1+(f−y)²

νσ2

n−ν+1

,Z=Γν+1

2

Γν

2pνπσ2

Laplace’s Approximation

For the mode ﬁtting procedure, we need derivatives up to third order; the hyperparameter derivatives

at the mode require some mixed derivatives. All in all, using r=y−f, we have

ln p(y|f) = ln Γν+1

2−ln Γν

2−1

2ln νπσ2

n−ν+1

2ln 1+r²

νσ2

n

∂ln p

∂f = (ν+1)r

r²+νσ2

∂2ln p

(∂f)2= (ν+1)r2−νσ2

(r²+νσ2

n)2

∂3ln p

(∂f)3=2(ν+1)r3−3rνσ2

(r²+νσ2

n)3

∂ln p

∂ln ν=∂Z

∂ln ν−ν

2ln 1+r2

νσ2

n+ν+1

2·r2

r²+νσ2

∂Z

∂ln ν=ν

d ln Γν+1

2

d ln ν−ν

d ln Γν

2

d ln ν−1

∂3ln p

(∂ln ν)(∂f)²=νr2(r2−3(ν+1)σn²) + νσ2

(r²+νσ2

n)3

∂ln p

∂ln σn

= (ν+1)r2

r²+νσ2

−1

∂3ln p

(∂ln σn)(∂f)²=2νσ2

n(ν+1)νσ2

n−3r2

(r²+νσ2

n)3

4.11 Cumulative Logistic Likelihood

The likelihood has one hyperparameter (represented in the log domain), namely the standard devia-

tion σn

p(y|f) = Z·cosh−2(τ(f−y)),τ=π

2σn√3,Z=π

4σn√3

Laplace’s Approximation

The following derivatives are needed where φ(x)≡ln(cosh(x))

ln p(y|f) = ln(π) − ln(4σn√3) − 2φ(τ(f−y))

∂ln p

∂f =2τφ0(τ(f−y))

∂2ln p

(∂f)2= −2τ2φ00 (τ(f−y))

∂3ln p

(∂f)3=2τ3φ000 (τ(f−y))

∂3ln p

(∂ln σn)(∂f)2=2τ22φ00 (τ(f−y))+τ(f−y)φ000 (τ(f−y))

∂ln p

∂ln σn

=2τ(f−y)φ0(τ(f−y))−1

4.12 GLM Likelihoods: Poisson, Negative Binomial, Weibull, Gamma, Exponential, In-

verse Gaussian and Beta

Data yfrom a space other than Re.g. N,R+or [0, 1]can be modeled using generalised linear model

likelihoods p(y|f)where the expected value E[y] = µis related to the underlying Gaussian process

fby means of an inverse link function µ=g(f). Typically, the likelihoods are from an exponential

family, hence the variance V[y] = v(µ), is a simple function of the mean µas well as higher order

moments such as skewness S[y] = s(µ)and kurtosis K[y] = k(µ).

Here, we directly specify the inverse link function µ=g(f)deﬁning the mapping from the GP f

to the mean intensity µ. For numerical reasons, we work with the log of the inverse link function

h(f) = ln g(f)and use its derivatives h0,h00 and h000 for subsequent computations. In the table below,

we have summarised the GLM likelihood expressions, the moments, the range of their variables and

the applicable inverse link functions.

Likelihood ρ=v(µ) = s(µ) = k(µ) = p(y|f) = y∈µ∈Inverse Links

Poisson ∅µ1/√µ1/µ µyexp(−µ)/y!N R+exp,logistic*

Neg. Binomial {ln r}µ(µ/r +1)q4(r+µ)

rµ −qr

µ(µ+r)

r+r

µ(µ+r)y+r−1

yrrµy/(r+µ)r+yN R+exp,logistic*

Weibull {ln κ}µ2(γ2/γ2

1−1)γ3−3γ1γ2+2γ3

(γ2−γ2

1)3/2

γ4−4γ1γ3+12γ2

1γ2−3γ2

2−6γ4

(γ2−γ2

1)2κγ1/µ (yγ1/µ)κ−1exp (−(yγ1/µ)κ)R+\{0}R+\{0}exp,logistic*

Gamma {ln α}µ2/α 2/√α6/α ααyα−1

Γ(α)µ−αexp −yα

µR+\{0}R+\{0}exp,logistic*

Exponential ∅µ22 6 µ−1exp −y

µR+\{0}R+\{0}exp,logistic*

Inv. Gauss {ln λ}µ3/λ 3pµ/λ 15µ/λ qλ

2πy3exp −λ(y−µ)2

2µ2yR+\{0}R+\{0}exp,logistic*

Beta {ln φ}µ(1−µ)/(1+φ)(2−4µ)(1+φ)

√v(µ)(2+φ)6(φ+1)2−v(µ)(5φ+6)

v(µ)(φ+2)(φ+3)

Γ(φ)

Γ(µφ)Γ((1−µ)φ)yµφ−1(1−y)(1−µ)φ−1[0, 1] [0, 1]expexp,logit

4.12.1 Inverse Link Functions

Possible inverse link functions and their properties (∪convex, ∩concave, ↑monotone) are sum-

marised below:

util/glm_invlink_* g(f) = µ=g:R→gis h(f) = ln µ=his

exp efR+∪,↑f∪,∩,↑

logistic `(f) = ln(1+ef)R+∪,↑ln(ln(1+ef)) ∩,↑

logistic2 `(f+af`(f))R+↑ln(`(f+af`(f)))∩,↑

expexp exp(−e−f) [0, 1]↑−e−f∪,↑

logit 1/(1+e−f) [0, 1]↑−ln(1+e−f)∪,↑

Please see doc/usageLik.m for how to specify the pair likelihood function and link function in

GPML.

Exponential inverse link: exp

For g(f) = efthings are simple since h(f) = f,h0(f) = 1 and h00(f) = h000(f) = 0.

Logistic inverse link: logistic

For g(f) = ln(1+ef)the derivatives of h(f)are given by

h(f) = ln(ln(1+ef))

h0(f) = 1

ln(1+ef)s(−f),s(f) = 1

1+ef,s0(f) = −ef

(1+ef)2= −s(−f)s(f)

h00(f) = 1

ln(1+ef)

e−f

(1+e−f)2−1

ln2(1+ef)

1+ef

1+e−f

=h0(f)s(f) − h0(f)

h000(f) = h00(f)s(f) − h0(f)+h0(f)−ef

(1+ef)2−h00(f)

=h00(f)s(f) − 2h0(f)−h0(f)s(f)s(−f).

Note that g(f) = eh(f)=ln(1+ef)is convex and h(f) = ln(ln(1+ef)) with

h00(f) = 1

ln(1+ef)1−ef

ln(1+ef)1

1+ef

1+e−f60

is concave since ef>ln(1+ef)for all f∈R.

Twice logistic inverse link: logistic2

Note that h(f) = ln(`(f+af`(f)))is – according to Seeger et al.9– concave.

Double negative exponential inverse link: expexp

For g(f) = exp(−e−f)the derivatives of h(f)are given by

h(f) = −e−f

h0(f) = −h(f)

h00(f) = h(f)

h000(f) = −h(f)

Logit regression inverse link: logit

For g(f) = 1/(1+e−f)the derivatives of h(f)can be computed using the logistic inverse link function

h`(f)since h(f) = f−exp(h`(f))

h(f) = f−eh`(f)

h0(f) = 1−eh`(f)h0

`(f)

h00(f) = −eh`(f)[h0

`(f)2+h00

`(f)] = eh`(f)s`(−f)s2

`(f)

h000(f) = −eh`(f)[h0

`(f)3+3h00

`(f)h0

`(f) + h000

`(f)]

9Bayesian Intermittent Demand Forecasting, NIPS, 2016

4.12.2 Poisson Likelihood

Count data y∈Nncan be modeled in the GP framework using the Poisson distribution p(y) =

µye−µ/y! with mean/variance E[y] = V[y] = µ, skewness S[y] = 1/√µand kurtosis K[y] = 1/µ

leading to the likelihood

p(y|f) = µyexp(−µ)/y!, µ=g(f)

⇔ln p(y|f) = y·ln g(f) − g(f) − ln Γ(y+1).

For Laplace’s method to work, we need the ﬁrst three derivatives of the log likelihood ln p(y|f), where

h(f) = ln g(f)

ln p(y|f) = y·h(f) − exp(h(f)) − ln Γ(y+1)

∂

∂f ln p(y|f) = h0(f)[y−exp(h(f))]

∂2

∂f2ln p(y|f) = h00(f)[y−exp(h(f))]− [h0(f)]2exp(h(f))

∂3

∂f3ln p(y|f) = h000(f)[y−exp(h(f))]−3h0(f)·h00(f)exp(h(f)) − [h0(f)]3exp(h(f))

h000(f)[y−exp(h(f))]−h0(f)[h0(f)2+3h00(f)] exp(h(f)).

Note that if ln µ=h(f)is concave and µ=g(f)is convex then the Poisson likelihood p(y|f)is

log-concave in fwhich is the case for both exp and logistic.

4.12.3 Weibull Likelihood

Nonnegative data y∈R+such as time-to-failure can be modeled in the GP framework using the

Weibull distribution p(y) = κ/λ(y/λ)κ−1e−(y/λ)κwith shape parameter κ > 0, scale parameter

λ > 0, mean E[y] = λγ1=µwhere γj=Γ(1+j/κ), variance V[y] = λ2γ2−µ2=µ2(γ2/γ2

1−1),

skewness S[y]=(γ3−3γ1γ2+2γ3

1)/(γ2−γ2

1)3/2and kurtosis K[y]=(γ4−4γ1γ3+12γ2

1γ2−3γ2

2−

6γ4

1)/(γ2−γ2

1)2. Using the substitution µ=λγ1⇔1/λ =γ1/µ, we obtain

p(y|f) = γ1

µγ1

µκ−1

exp −γ1

µκ,µ=g(f)>0

⇔ln p(y|f) = ln γ1

µ+ (κ−1)ln γ1

µ−γ1

µκ

Note that the Weibull likelihood p(y|f)is log-concave in fneither for the exp nor for the logistic

inverse link.

4.12.4 Gamma Likelihood

Nonnegative data y∈R+can be modeled in the GP framework using the Gamma distribution

p(y) = θ−α/Γ(α)yα−1e−y/θ with shape parameter α > 0, scale parameter θ > 0, mean E[y] =

αθ =µ, variance V[y] = αθ2=µ2/α, skewness S[y] = 2/√αand kurtosis K[y] = 6/α. Using the

substitution µ=αθ ⇔α/µ =1/θ, we obtain

p(y|f) = ααyα−1

Γ(α)µ−αexp −yα

µ,µ=g(f)>0

⇔ln p(y|f) = −αln µ+y

µ−ln Zα(y), ln Zα(y) = ln Γ(α) − αln α+ (1−α)ln y.

Note that if ln µ=h(f)was convex and µ=g(f)was concave then the Gamma likelihood p(y|f)

would be log-concave in fwhich is not the case for both exp and logistic.

4.12.5 Exponential Likelihood

Nonnegative data y∈R+can be modeled in the GP framework using the Exponential distribution

p(y) = θ−1e−y/θ with scale parameter θ > 0, mean E[y] = θ=µ, variance V[y] = µ2, skewness

S[y] = 2 and kurtosis K[y] = 6. We obtain

p(y|f) = µ−1exp −y

µ,µ=g(f)>0

⇔ln p(y|f) = − ln µ−y

µ.

Note that for exp (but not for logistic) the likelihood is log-concave. The exponential distribution

corresponds to the Gamma distribution with α=1 and the Weibull distribution with κ=1.

4.12.6 Inverse Gaussian Likelihood

Nonnegative data y∈Rn

+can be modeled in the GP framework using the Inverse Gaussian distri-

bution p(y) = pλ/(2πy3)exp(−λ(y−µ)2/(2µ2y)) with shape parameter λ > 0, mean parameter

µ > 0, mean E[y] = µ, variance V[y] = µ3/λ, skewness S[y] = 3pµ/λ and kurtosis K[y] = 15µ/λ.

We obtain

p(y|f) = sλ

2πy3exp −λ(y−µ)2

2µ2y,µ=g(f)>0

⇔ln p(y|f) = −λ(y−µ)2

2µ2y−ln Zα(y), ln Zα(y)=−1

2(ln λ−ln 2πy3).

The inverse Gaussian likelihood is in general not log-concace in ffor both exp and logistic.

4.12.7 Beta Likelihood

Interval data y∈[0, 1]ncan be modeled in the GP framework using the Beta distribution p(y) =

yα−1(1−y)β−1/B(α,β)with shape parameters α,β > 0, mean E[y] = α/(α+β)and variance

V[y] = αβ/[(α+β)2(α+β+1)] and 1/B(α,β) = Γ(α+β)/[Γ(α)Γ(β)]. Reparametrising using

the mean parameter µ=E[y] = α/(α+β), the shape parameter φ=α+β, the variance V[y] =

µ(1−µ)/(1+φ)and hence

p(y|f) = Γ(φ)

Γ(µφ)Γ((1−µ)φ)yµφ−1(1−y)(1−µ)φ−1,µ=g(f)>0

⇔ln p(y|f) = ln Γ(φ) − ln Γ(µφ) − ln Γ((1−µ)φ) + (µφ −1)ln y+ ((1−µ)φ−1)ln(1−y).

The Beta likelihood is in general not log-concace in ffor both exp and logistic.

5 Mean Functions

A mean function mφ:X→R(with hyperparameters φ) of a GP fis a scalar function deﬁned over

the whole domain Xthat computes the expected value m(x) = E[f(x)] of ffor the input x.

5.1 Interface

In the GPML toolbox, a mean function m:X→Rneeds to implement evaluation m=mφ(X)and

ﬁrst derivatives mi=∂

∂φimwith respect to the components iof the parameter φ∈Φas detailed

below.

36 hmeanFunctions.m 36i≡

1% Mean functions to be use by Gaussian process functions. There are two

2% different kinds of mean functions: simple and composite:

4% Simple mean functions:

6% meanZero - zero mean function

7% meanOne - one mean function

8% meanConst - constant mean function

9% meanLinear - linear mean function

10 % meanPoly - polynomial mean function

11 % meanDiscrete - precomputed mean for discrete data

12 % meanGP - predictive mean of another GP

13 % meanGPexact - predictive mean of a regression GP

14 % meanNN - nearest neighbor mean function

15 % meanWSPC - weighted sum of projected cosines

16 %

17 % Composite mean functions (see explanation at the bottom):

18 %

19 % meanScale - scaled version of a mean function

20 % meanSum - sum of mean functions

21 % meanProd - product of mean functions

22 % meanPow - power of a mean function

23 % meanMask - mask some dimensions of the data

24 % meanPref - difference mean for preference learning

25 % meanWarp - warped mean function

26 %

27 % Naming convention: all mean functions are named "mean/mean*.m".

28 %

29 %

30 % 1) With no or only a single input argument:

31 %

32 % s = meanNAME or s = meanNAME(hyp)

33 %

34 % The mean function returns a string s telling how many hyperparameters hyp it

35 % expects, using the convention that "D" is the dimension of the input space.

36 % For example , calling "meanLinear" returns the string ’D’.

37 %

38 % 2) With two input arguments and one output argument:

39 %

40 % m = meanNAME(hyp, x)

41 %

42 % The function computes and returns the mean vector m with components

43 % m(i) = m(x(i,:)) where hyp are the hyperparameters and x is an n by D matrix

44 % of data, where D is the dimension of the input space. The returned mean

45 % vector m is of size n by 1.

46 %

47 % 3) With two input arguments and two output arguments:

48 %

49 % [m,dm] = meanNAME(hyp, x)

50 %

51 % The function computes and returns the mean vector m as in 2) above.

52 % In addition to that, the (linear) directional derivative function dm is

53 % returned. The call dhyp = dm(q) for a direction vector q of size n by 1

54 % returns a vector of directional derivatives dhyp = d (q’*m(x)) / d hyp of

55 % the same size as the hyperparameter vector hyp. The components of dhyp are

56 % defined as follows: dhyp(i) = q’*( d m(x) / d hyp(i) ).

57 %

58 % See also doc/usageMean.m.

59 %

60 hgpml copyright 5ai

5.2 Implemented Mean Functions

We offer simple and composite mean functions producing new mean functions m(x)from existing

mean functions µj(x). All code ﬁles are named according to the pattern mean/mean<NAME>.m for

simple identiﬁcation. This modular speciﬁcation allows to deﬁne afﬁne mean functions m(x) =

c+a>xor polynomial mean functions m(x)=(c+a>x)2. All currently available mean functions

are summarised in the following table.

Simple mean functions m(x)

<NAME> Mmeaning m(x) = φ

Zero mean vanishes always 0 ∅

One mean equals 1 1 ∅

Const mean equals a constant c c ∈R

Linear mean linearly depends on x∈X⊆RDa>x a ∈RD

Poly mean polynomially depends on x∈X⊆RDPda>

dxda∈RD×d

Discrete precomputed mean for discrete data x∈X⊆Nµxµ∈Rs

GP predictive mean of another GP Ry·p(y|D,x)dy ∅

GPexact predictive mean of a regression GP Ry·p(y|D,x)dy ρ,ψ,σn

NN nearest neighbor for a set (zj,mj)∈X×Rmi,i=arg minjd(x,zj)∅

WSPC weighted sum of dprojected cosines x∈X⊆RDa>cos(Wx +b)W∈Rd×D,a,b∈Rd

Composite mean functions [µ1(x),µ2(x), ..]7→ m(x)

<NAME> meaning m(x) = φ

Scale scale a mean αµ(x)α∈R

Sum add up mean functions Pjµj(x)∅

Prod multiply mean functions Qjµj(x)∅

Pow raise a mean to a power µ(x)d∅

Mask act on components I⊆[1, 2, .., D]of x∈X⊆RDonly µ(xI)∅

Pref preference learning mean x= [x1;x2],xi⊆RD/2µ(x1) − µ(x2)∅

Warp warped mean g[µ(x)] ∅

5.3 Usage of Implemented Mean Functions

Some code examples taken from doc/usageMean.m illustrate how to use simple and composite mean

functions to specify a GP model.

Syntactically, a mean function mf is deﬁned by

mn := ’func’ | @func // simple

mf := {mn} | {mn, {param, mf}} | {mn, {mf, .., mf}} // composite

i.e., it is either a string containing the name of a mean function, a pointer to a mean function or one

of the former in combination with a cell array of mean functions and an additional list of parameters.

38 hdoc/usageMean.m 38i≡

1% demonstrate usage of mean functions

3% See also meanFunctions.m.

5hgpml copyright 5ai

6clear all,close all

7n = 5; D = 2; x = randn(n,D); % create a random data set

9% set up simple mean functions

10 m0 = {’meanZero’}; hyp0 = []; % no hyperparameters are needed

11 m1 = {’meanOne’}; hyp1 = []; % no hyperparameters are needed

12 mc = {@meanConst}; hypc = 2; % also function handles are possible

13 ml = {@meanLinear}; hypl = [2;3]; % m(x) = 2*x1 + 3*x2

14 mp = {@meanPoly ,2}; hypp = [1;1;2;3]; % m(x) = x1+x2+2*x1^2+3*x2^2

15 mn = {@meanNN ,[1,0; 0 ,1],[0.9,0.5]}; hypn = []; % nearest neighbor

16 s = 12; hypd = randn(s,1); % discrete mean with 12 hypers

17 md = {’meanDiscrete’,s};

18 hyp.cov = [0;0]; hypg = []; % GP predictive mean

19 xt = randn(2*n,D); yt = sign(xt(:,1)-xt(:,2)); % training data

20 mg = {@meanGP ,hyp,@infEP ,@meanZero ,@covSEiso ,@likErf,xt,yt};

21 hype = [0;0; log(0.1)]; % regression GP predictive mean

22 xt = randn(2*n,D); yt = xt(:,1).*xt(:,2); % training data

23 me = {@meanGPexact ,@meanZero ,@covSEiso ,xt,yt};

25 % set up composite mean functions

26 msc = {’meanScale’,{m1}}; hypsc = [3; hyp1]; % scale by 3

27 msu = {’meanSum’,{m0,mc,ml}}; hypsu = [hyp0; hypc; hypl]; % sum

28 mpr = {@meanProd ,{mc,ml}}; hyppr = [hypc; hypl]; % product

29 mpo = {’meanPow’,3,msu}; hyppo = hypsu; % third power

30 mask = [false ,true]; % mask excluding all but the 2nd component

31 mma = {’meanMask’,mask,ml}; hypma = hypl(mask);

32 mpf = {@meanPref,ml}; hyppf = 2; % linear pref with slope

33 mwp = {@meanWarp,ml,@sin,@cos};hypwp = 2; % sin of linear

35 % 0) specify mean function

36 % mean = md; hyp = hypd; x = randi([1,s],n,1);

37 % mean = mn; hyp = hypn;

38 % mean = mg; hyp = hypg;

39 mean = me; hyp = hype;

40 % mean = m0; hyp = hyp0;

41 % mean = msu; hyp = hypsu;

42 % mean = mpr; hyp = hyppr;

43 % mean = mpo; hyp = hyppo;

44 % mean = mpf; hyp = hyppf;

46 % 1) query the number of parameters

47 feval(mean{:})

49 % 2) evaluate the function on x

50 feval(mean{:},hyp,x)

52 % 3) compute the derivatives w.r.t. to hyperparameter i

53 i = 2; feval(mean{:},hyp,x,i)

6 Covariance Functions

A covariance function kψ:X×X→R(with hyperparameters ψ) of a GP fis a scalar function

deﬁned over the whole domain X2that computes the covariance k(x,z) = V[f(x),f(z)] = E[(f(x) −

m(x))(f(z) − m(z))] of fbetween the inputs xand z.

6.1 Interface

Again, the interface is simple since only evaluation of the full covariance matrix K=kψ(X)and its

derivatives Ki=∂

∂ψiKas well as cross terms k∗=kψ(X,x∗)and k∗∗ =kψ(x∗,x∗)for prediction

are required.

39 hcovFunctions.m 39i≡

1% Covariance functions to be use by Gaussian process functions. There are two

2% different kinds of covariance functions: simple and composite:

4% 1) Elementary and standalone covariance functions:

5% covZero - zero covariance function

6% covEye - unit covariance function

7% covOne - unit constant covariance function

8% covDiscrete - precomputed covariance for discrete data

10 % 2) Composite covariance functions:

11 % covScale - scaled version of a covariance function

12 % covSum - sums of covariance functions

13 % covProd - products of covariance functions

14 % covMask - mask some dimensions of the data

15 % covPref - difference covariance for preference learning

16 % covPER - make stationary covariance periodic

17 % covADD - additive covariance function

18 % covWarp - warp input to a covariance function

19 %

20 % 3) Mahalanobis distance based covariances and their modes

21 % covMaha - generic "mother" covariance

22 % * eye - unit length scale

23 % * iso - isotropic length scale

24 % * ard - automatic relevance determination

25 % * prot - (low-rank) projection in input space

26 % * fact - factor analysis covariance

27 % * vlen - spatially varying length scale

28 % covGE - Gamma exponential covariance

29 % covMatern - Matern covariance function with nu=1/2, 3/2 or 5/2

30 % covPP - piecewise polynomial covariance function (compact support)

31 % covRQ - rational quadratic covariance function

32 % covSE - squared exponential covariance function

33 %

34 % 4) Dot product based covariances and their modes

35 % covDot - generic "mother" covariance

36 % * eye - unit length scale

37 % * iso - isotropic length scale

38 % * ard - automatic relevance determination

39 % * pro - (low-rank) projection in input space

40 % * fac - factor analysis covariance

41 % covLIN - linear covariance function

42 % covPoly - polynomial covariance function

43 %

44 % 5) Time series covariance functions on the positive real line

45 % covFBM - fractional Brownian motion covariance

46 % covULL - underdamped linear Langevin process covariance

47 % covW - i-times integrated Wiener process covariance

48 % covOU - i-times integrated Ornstein -Uhlenbeck process covariance

49 %

50 % 6) Standalone covariances

51 % covNNone - neural network covariance function

52 % covLINone - linear covariance function with bias

53 % covPeriodic - smooth periodic covariance function (1d)

54 % covPeriodicNoDC - as above but with zero DC component and properly scaled

55 % covCos - sine periodic covariance function (1d) with unit period

56 % covGabor - Gabor covariance function

57 %

58 % 7) Shortcut covariances assembled from library

59 % covConst - covariance for constant functions

60 % covNoise - independent covariance function (i.e. white noise)

61 % covPERiso - make isotropic stationary covariance periodic

62 % covPERard - make ARD stationary covariance periodic

63 % covMaterniso - Matern covariance function with nu=1/2, 3/2 or 5/2

64 % covMaternard - Matern covariance function with nu=1/2, 3/2 or 5/2 with ARD

65 % covPPiso - piecewise polynomial covariance function (compact support)

66 % covPPard - piecewise polynomial covariance function (compact support)

67 % covRQiso - isotropic rational quadratic covariance function

68 % covRQard - rational quadratic covariance function with ARD

69 % covSEiso - isotropic squared exponential covariance function

70 % covSEisoU - same as above but without latent scale

71 % covSEard - squared exponential covariance function with ARD

72 % covSEvlen - spatially varying lengthscale squared exponential

73 % covSEproj - projection squared exponential covariance function

74 % covLINiso - linear covariance function

75 % covLINard - linear covariance function with ARD

76 % covGaborard - Gabor covariance function with ARD

77 % covGaborsio - isotropic Gabor covariance function

78 % covSM - spectral mixture covariance function

79 %

80 % 8) Special purpose (approximation) covariance functions

81 % apxSparse - sparse approximation: to be used for large scale inference

82 % problems with inducing points aka FITC

83 % apxGrid - grid interpolation: to be used for large scale inference

84 % problems with Kronecker/Toeplitz/BTTB covariance matrix

85 %

86 % The covariance functions are written according to a special convention where

87 % the exact behaviour depends on the number of input and output arguments

88 % passed to the function. If you want to add new covariance functions, you

89 % should follow this convention if you want them to work with the function gp.

90 % There are four different ways of calling the covariance functions:

91 %

92 % 1) With no (or one) input argument(s):

93 %

94 % s = cov

95 %

96 % The covariance function returns a string s telling how many hyperparameters it

97 % expects, using the convention that "D" is the dimension of the input space.

98 % For example , calling "covRQard" returns the string ’(D+2)’.

99 %

100 % 2) With two input arguments and one output argument:

101 %

102 % K = cov(hyp, x) equivalent to K = cov(hyp, x, [])

103 %

104 % The function computes and returns the covariance matrix where hyp are

105 % the hyperparameters and x is an n by D matrix of cases, where

106 % D is the dimension of the input space. The returned covariance matrix is of

107 % size n by n.

108 %

109 % 3) With three input arguments and one output argument:

110 %

111 % Kz = cov(hyp, x, z)

112 % kx = cov(hyp, x, ’diag’)

113 %

114 % The function computes test set covariances; kx is a vector of self covariances

115 % for the test cases in x (of length n) and Kz is an (n by nz) matrix of cross

116 % covariances between training cases x and test cases z.

117 %

118 % 4) With two output arguments:

119 %

120 % [K,dK] = cov(hyp, x) equivalent to [K,dK] = cov(hyp, x, [])

121 % [K,dK] = cov(hyp, x, z)

122 % [K,dK] = cov(hyp, x, ’diag ’)

123 %

124 % The function computes and returns the covariances K as in 3) above.

125 % In addition to that, the (linear) directional derivative function dK is

126 % returned. The two possible calls dhyp = dK(Q) and [dhyp,dx] = dK(Q) for a

127 % direction Q of the same size as K are possible. The return arguments dhyp

128 % and dx are the directional derivatives dhyp = d trace(Q’*K) / d hyp and

129 % dx = d trace(Q’*K) / d x are of the same size as the hyperparameter

130 % vector hyp and the input data x, respectively. The components of dhyp and

131 % dx are defined as follows: dhyp(i) = trace(Q’*( d K / d hyp(i) ))

132 % and dx(i,j) = trace(Q’*( d K / d x(i,j) )).

133 %

134 % Covariance functions can be specified in two ways: either as a string

135 % containing the name of the covariance function or using a cell array. For

136 % example:

137 %

138 % cov = ’covRQard ’;

139 % cov = {’covRQard ’};

140 % cov = {@covRQard};

141 %

142 % are supported. Only the second and third form using the cell array can be used

143 % for specifying composite covariance functions , made up of several

144 % contributions. For example:

145 %

146 % cov = {’covScale ’, {’covRQiso ’}};

147 % cov = {’covSum ’, {’covRQiso ’,’covSEard ’,’covNoise ’}};

148 % cov = {’covProd ’,{’covRQiso ’,’covSEard ’,’covNoise ’}};

149 % cov = {’covMask ’,{mask ,’covSEiso ’}}

150 % q=1; cov = {’covPPiso ’,q};

151 % d=3; cov = {’covPoly ’,d};

152 % cov = {’covADD ’,{[1,2],’covSEiso ’}};

153 % cov = {@apxSparse, {@covSEiso}, u}; where u are the inducing inputs

154 %

155 % specifies a covariance function which is the sum of three contributions. To

156 % find out how many hyperparameters this covariance function requires , we do:

157 %

158 % feval(cov{:})

159 %

160 % which returns the string ’3+(D+1)+1’ (i.e. the ’covRQiso ’ contribution uses

161 % 3 parameters , the ’covSEard’ uses D+1 and ’covNoise’ a single parameter).

162 %

163 % See also doc/usageCov.m.

164 %

165 hgpml copyright 5ai

6.2 Implemented Covariance Functions

Similarly to the mean functions, we provide a whole algebra of covariance functions k:X×X→R

with the same generic name pattern cov/cov<NAME>.m as before.

Besides a long list of simple covariance functions, we also offer a variety of composite covariance

functions as shown in the following table.

1) Elementary and standalone covariance functions k(x,x0)

<NAME> meaning k(x,z) = ψ

Zero covariance vanishes always 0 ∅

Eye unit additive measurement noise δ(x−z)∅

One unit constant 1 ∅

Discrete precomputed covariance for discrete data x∈X⊆Nkxx0where K=L>Lis Cholesky decomposition, L∈Rs×sL

2) Composite covariance functions [κ1(x,z),κ2(x,z), ..]7→ k(x,z)

<NAME> meaning k(x,z) = ψ

Scale modulate covariance function by a scalar σ2

fκ(x,z)ln σf

Scale modulate covariance function depending on input σf(x)κ(x,z)σf(z)φ`

Sum add up covariance functions Pjκj(x,z)∅

Prod multiply covariance functions Qjκj(x,z)∅

Mask act on components I⊆[1, 2, .., D]of x∈X⊆RDonly κ(xI,zI)∅

Pref preference learning covariance x= [x1;x2],xi⊆RD/2κ(x1,z1) + κ(x2,z2) − κ(x1,z2) − κ(x2,z1)∅

PER turn covariance into a periodic, X⊆RDκ(u(x),u(z)) ,u(x) = sin 2πxp

cos 2πxp,xp=









xeye

x/p iso

x/pard

ψu

ADD additive, X⊆RD, index degree set D={1, .., D}Pd∈Dσ2

fdP|I|=dQi∈Iκ(xi,zi;ψi){ψ1, .., ψD, ln σf1, .., ln σf|D| }

Warp warp inputs to a covariance function κ(p(x),p(z)), where p:RD→RDp∅

3) Covariance functions based on Mahalanobis distances k(x,z) = k(r),r2= (x−z)>P−1(x−z), where x,z∈X⊆RD

Maha shared generic “mother” covariance function k(r) = ∅

GE gamma exponential exp (−|r|γ),γ∈(0, 2]ln(γ/(2−γ))

Matern Matérn, f1(t) = 1, f3(t) = 1+t,f5(t) = f3(t) + t2

3fd(√dr)exp(−√dr)∅

PP compact support, piecewise polynomial fv(r)max(0, 1 −r)j+v·fv(r)∅

RQ rational quadratic 1+1

2r2/α−αln α

SE squared exponential exp −r2/2∅

Allowable mode parameter values for Mahalanibis distance covariances

’eye’ unit lengthscale P=I∅

’iso’ isotropic lengthscale P=`2I,`∈R+ln `

’ard’ automatic relevance determination P=Λ2,Λ=diag(λ),λ∈RD

+ln Λ

’proj’ (low-rank) projection in input space P−1=L>L,L∈Rd×DL

’fact’ factor analysis P−1=L>L+diag(f),f∈RD,L∈Rd×D{L, ln f}

’vlen’ spatially varying lengthscale, k`(x,z) = `(x)`(z)

l2(x,z)D/2kr2

l2(x,z),l2(x,z) = `2(x)+`2(z)

2φ`

4) Covariance functions based on Euclidean dot products k(x,z) = k(s),s=x>P−1z, where x,z∈X⊆RD

Dot shared generic “mother” covariance function k(s) = ∅

LIN linear covariance function s∅

Poly polynomial covariance σ2

f(s+c)dln c

Allowable mode parameter values for Euclidean dot product covariances

’eye’ unit lengthscale P=I∅

’iso’ isotropic lengthscale P=`2I,`∈R+ln `

’ard’ automatic relevance determination P=Λ2,Λ=diag(λ),λ∈RD

+ln Λ

’proj’ (low-rank) projection in input space P−1=L>L,L∈Rd×DL

’fact’ factor analysis P−1=L>L+diag(f),f∈RD,L∈Rd×D{L, ln f}

5) Time series covariance functions k(x,z) = k(x,z), where x,z∈X=R+

<NAME> meaning k(x,z) = ψ

FBM fractional Brownian motion covariance with Hurst index h1

2σ2

f|x|2h−|x−z|2h+|z|2h{ln σf,−ln(1/h −1)}

ULL underdamped linear Langevin process covariance k(t) = a2e−µt sin(ωt)

ω+cos(ωt)

µ,t=|x−z| {ln µ, ln ω, ln a}

Wi-times integrated Wiener process covariance, i∈{−1, .., 3}κ0(x,z)=σ2

fmin(x,z),κi(x,z)= Rx

0Rz

0κi(˜

x,˜

z)d˜

xd˜

zln σf

OU i-times integrated Ornstein-Uhlenbeck process covariance, i∈{0, 1}κ0(x,z)= σ2

fexp(−|x−z|

`) + (σ2

f0−σ2

f)exp(−|x+z|

`){ln `, ln σf, ln σf0}

6) Standalone covariance functions

<NAME> meaning k(x,z) = ψ

NNone neural net, X⊆RDσ2

fsin−1x>z/p(`2+x>x)(`2+z>z){ln `, ln σf}

LINone linear with bias, X⊆RD(x>z+1)/t2ln t

Periodic periodic, X⊆Rσ2

fexp −2

`2sin2[π(x−z)/p]{ln `, ln p, ln σf}

PeriodicNoDC periodic, rescaled, DC component removed, X⊆Rσ2

κ(x−z)− 1

πRπ

0κ(t)dt

κ(0)− 1

πRπ

0κ(t)dt,κ(x−z) = exp −2

`2sin2[π(x−z)/p]{ln `, ln p, ln σf}

Cos periodic cosine, X⊆Rσ2

fcos (π(x−z)/p){ln p, ln σf}

Gabor Gabor function, X⊆RD,λ,p∈RD

+exp −1

2t>Λ−2tcos 2πt>

p1,tp=









teye

t/p iso

t/pard

,t=x−z{ln λ,ψp}

7) Shortcut covariance functions assembled from library k(x,z)

<NAME> Meaning k(x,z) = ψ

Const covariance equals a constant σ2

fln σf

Noise additive measurement noise σ2

fδ(x−z)ln σf

PERiso turn covariance into a periodic, X⊆RDκ(u(x),u(z)) ,u(x)=[sin xp, cos xp],xp=2πx/p ln p

PERard turn covariance into a periodic, X⊆RDκ(u(x),u(z)) ,u(x)=[sin xp, cos xp],xp=2πx/p{ln p1, .., ln pD}

Materniso Matérn, X⊆RD,f1(t) = 1, f3(t) = 1+t,f5(t) = f3(t) + t2

3σ2

ffd(rd)exp(−rd),rd=qd

`2(x−z)>(x−z){ln `, ln σf}

Maternard Matérn, X⊆RD,f1(t) = 1, f3(t) = 1+t,f5(t) = f3(t) + t2

3σ2

ffd(rd)exp(−rd),rd=pd(x−z)>Λ−2(x−z){ln λ1, .., ln λD, ln σf}

PPiso compact support, piecewise polynomial fv(r)σ2

fmax(0, 1 −r)j+v·fv(r),r=kx−zk

`,j=D

2+v+1{ln `, ln σf}

PPard compact support, piecewise polynomial fv(r)σ2

fmax(0, 1 −r)j+v·fv(r),r2= (x−z)>Λ−2(x−z){ln λ1, .., ln λD, ln σf}

RQiso rational quadratic, X⊆RDσ2

f1+1

2α`2(x−z)>(x−z)−α{ln `, ln σf, ln α}

RQard rational quadratic, X⊆RDσ2

f1+1

2α(x−z)>Λ−2(x−z)−α{ln λ1, .., ln λD, ln σf, ln α}

SEiso diagonal squared exponential, X⊆RDσ2

fexp −1

2`2(x−z)>(x−z){ln `, ln σf}

SEisoU squared exponential, X⊆RDexp −1

2`2(x−z)>(x−z)ln `

SEard automatic relevence determination squared exponential σ2

fexp −1

2(x−z)>Λ−2(x−z){ln λ1, .., ln λD, ln σf}

SEvlen spatially varying lengthscale squared exponential σ2

fa

bD

2exp −kx−zk2

b,a=2`(x)`(z),b=`2(x) + `2(z){φ`, ln σf}

SEproj factor analysis squared exponential σ2

fexp −1

2(x−z)>L>L(x−z),L∈Rd×D{L, ln σf}

LINard linear with diagonal weighting x>Λ−2z{ln λ1, .., ln λD}

LINiso linear with isotropic weighting x>z/`2ln `

Gaborard anisotropic Gabor function, X⊆RD,λ,p∈RD

+exp −PD

d=1

2λ2

dcos 2πPD

d=1td/pd,td=xd−zd{ln λ, ln p}

Gaboriso isotropic Gabor function, X⊆RD,`,p∈R+exp(− t>t

2`2)cos(2πt>1/p),t=x−z{ln `, ln p}

SM spectral mixture, X⊆RD,w∈RQ

+,M,V∈RD×Q

+w>exp(−1

2V>t2)cos(M>t),t=2π(x−z){ln w, ln M, ln V}

spectral mixture product, X⊆RD,W∈RD×Q

+,M,V∈RD×Q

+QD

d=1w>

dexp(−1

2vdt2

d)cos(mdtd),td=2π(xd−zd){ln W, ln M, ln V}

The spectral mixture covariance covSM was introduced by Wilson & Adams Gaussian Process Ker-

nels for Pattern Discovery and Extrapolation, ICML, 2013.

The periodic covariance function covPER starts from a stationary covariance function that depends

on the data only through a distance r2= (x−x0)>Λ−2(x−x0)such as covMatern,covPP,covRQ,

covSE and turns them into a periodic covariance function by embedding the data x∈RDinto a

periodic high-dimensional space xp=u(x)∈R2Dby a function u(x) = 2πdiag(p−1)x.

The additive covariance function covADD starts from a one-dimensional covariance function κ(xi,x0

i,ψi)

acting on a single component i∈[1, .., D]of x. From that, we deﬁne covariance functions κI(xI,xI) =

Qi∈Iκ(xi,x0

i,ψi)acting on vector-valued inputs xI. The sums of exponential size can efﬁciently be

computed using the Newton-Girard formulae. Samples functions drawn from a GP with additive co-

variance are additive functions. The number of interacting variables |I|is a measure of how complex

the additive functions are.

6.3 Usage of Implemented Covariance Functions

Some code examples taken from doc/usageCov.m illustrate how to use simple and composite covari-

ance functions to specify a GP model.

Syntactically, a covariance function cf is deﬁned by

cv := ’func’ | @func // simple

cf := {cv} | {cv, {param, cf}} | {cv, {cf, .., cf}} // composite

i.e., it is either a string containing the name of a covariance function, a pointer to a covariance func-

tion or one of the former in combination with a cell array of covariance functions and an additional

list of parameters.

44 hdoc/usageCov.m 44i≡

1% demonstrate usage of covariance functions

3% See also covFunctions.m.

5hgpml copyright 5ai

6clear all,close all

7n = 5; D = 3; x = randn(n,D); xs = randn(3,D); % create a data set

9% set up simple covariance functions

10 cn = {’covNoise’}; sn = .1; hypn = log(sn); % one hyperparameter

11 cc = {@covConst}; sf = 2; hypc = log(sf); % function handles OK

12 ce = {@covEye}; hype = []; % identity

13 cl = {@covLIN}; hypl = []; % linear is parameter -free

14 cla = {’covLINard’}; L = rand(D,1); hypla = log(L); % linear (ARD)

15 cli = {’covLINiso’}; l = rand(1); hypli = log(l); % linear iso

16 clo = {@covLINone}; ell = .9; hyplo = log(ell); % linear with bias

17 cp = {@covPoly,3}; c = 2; hypp = log([c;sf]); % third order poly

18 cga = {@covSEard}; hypga = log([L;sf]); % Gaussian with ARD

19 cgi = {’covSEiso’}; hypgi = log([ell;sf]); % isotropic Gaussian

20 cgu = {’covSEisoU’}; hypgu = log(ell); % isotropic Gauss no scale

21 cra = {’covRQard’}; al = 2; hypra = log([L;sf;al]); % ration. quad.

22 cri = {@covRQiso}; hypri = log([ell;sf;al]); % isotropic

23 cma = {@covMaternard ,5}; hypma = log([ell;sf]); % Matern class d=5

24 cmi = {’covMaterniso’,3}; hypmi = log([ell;sf]); % Matern class d=3

25 cnn = {’covNNone’}; hypnn = log([L;sf]); % neural network

26 cpe = {’covPeriodic’}; p = 2; hyppe = log([ell;p;sf]); % periodic

27 cpn = {’covPeriodicNoDC’}; p = 2; hyppe = log([ell;p;sf]); % w/o DC

28 cpc = {’covCos’}; p = 2; hypcpc = log([p;sf]); % cosine cov

29 cca = {’covPPard’,3}; hypcc = hypgu;% compact support poly degree 3

30 cci = {’covPPiso’,2}; hypcc = hypgi;% compact support poly degree 2

31 cgb = {@covGaboriso}; ell = 1; p = 1.2; hypgb=log([ell;p]); % Gabor

32 Q = 2; w = ones(Q,1)/Q; m = rand(D,Q); v = rand(D,Q);

33 csm = {@covSM ,Q}; hypsm = log([w;m(:);v(:)]); % Spectral Mixture

34 cvl = {@covSEvlen ,{@meanLinear}}; hypvl = [1;2;1; 0]; % var lenscal

35 s = 12; cds = {@covDiscrete ,s}; % discrete covariance function

36 L = randn(s); L = chol(L’*L); L(1:(s+1):end) = log(diag(L));

37 hypds = L(triu(true(s))); xd = randi([1,s],[n,1]); xsd = [1;3;6];

38 cfa = {@covSEfact ,2}; hypfa = randn(D*2,1); % factor analysis

40 % set up composite i.e. meta covariance functions

41 csc = {’covScale’,{cgu}}; hypsc = [log(3); hypgu]; % scale by 9

42 csu = {’covSum’,{cn,cc,cl}}; hypsu = [hypn; hypc; hypl]; % sum

43 cpr = {@covProd ,{cc,cci}}; hyppr = [hypc; hypcc]; % product

44 mask = [0,1,0]; % binary mask excluding all but the 2nd component

45 cma = {’covMask’,{mask,cgi{:}}}; hypma = hypgi;

46 % isotropic periodic rational quadratic

47 cpi = {’covPERiso’,{@covRQiso}};

48 % periodic Matern with ARD

49 cpa = {’covPERard’,{@covMaternard ,3}};

50 % additive based on SEiso using unary and pairwise interactions

51 cad = {’covADD’,{[1,2],’covSEiso’}};

52 % preference covariance with squared exponential base covariance

53 cpr = {’covPref’,{’covSEiso’}}; hyppr = [0;0];

54 xp = randn(n,2*D); xsp = randn(3,2*D);

56 % 0) specify a covariance function

57 % cov = cma; hyp = hypma;

58 % cov = cci; hyp = hypcc;

59 % cov = csm; hyp = hypsm;

60 cov = cds; hyp = hypds; x = xd; xs = xsd;

61 % cov = cfa; hyp = hypfa;

62 % cov = cvl; hyp = hypvl;

63 % cov = cpr; hyp = hyppr; x = xp; xs = xsp;

65 % 1) query the number of parameters

66 feval(cov{:})

68 % 2) evaluate the function on x

69 [K,dK] = feval(cov{:},hyp,x)

71 % 3) evaluate the function on x and xs to get cross -terms

72 [kss,dkss] = feval(cov{:},hyp,xs,’diag’)

73 [Ks, dKs ] = feval(cov{:},hyp,x,xs)

7 Hyperpriors

A hyperprior p(θ)with θ= [ρ,φ,ψ]is a joint probability distribution over the likelihood hyper-

parameters ρ, the mean hyperparameters φand the covariance hyperparameters ψ. We concentrate

on factorial priors p(θ) = Qjpj(θj). Hyperpriors can be used to regularise the optimisation of the

hyperparameters via the marginal likelihood Z(θ)so that p(θ)Z(θ)is maximised instead. As we

wish to perform unconstrained optimisation, we require (mainly) smooth hyperpriors with inﬁnite

support.

7.1 Interface

In the GPML toolbox, a prior distribution p(θ)needs to implement the evaluation of the log density

ln p(θ)and its ﬁrst derivative ∂

∂θln p(θ). In addition, we require sampling capabilities i.e. the

generation of θ∼p(θ).

46 hpriorDistributions.m 46i≡

1% prior distributions to be used for hyperparameters of Gaussian processes

2% using infPrior.

3% There are two different kinds of prior distributions: simple and composite:

5% simple prior distributions:

7% priorGauss - univariate Gaussian

8% priorLaplace - univariate Laplace

9% priorT - univariate Student ’s t

10 %

11 % priorSmoothBox1 - univariate interval (linear decay in log domain)

12 % priorSmoothBox2 - univariate interval (quadr. decay in log domain)

13 %

14 % priorGamma - univariate Gamma, IR+

15 % priorWeibull - univariate Weibull , IR+

16 % priorInvGauss - univariate Inverse Gaussian , IR+

17 % priorLogNormal - univariate Log-normal, IR+

18 %

19 % priorClamped or - fix hyperparameter to its current value by setting

20 % priorDelta derivatives to zero, no effect on marginal likelihood

21 %

22 % priorGaussMulti - multivariate Gauss

23 % priorLaplaceMulti - multivariate Laplace

24 % priorTMulti - multivariate Student ’s t

25 %

26 % priorClampedMulti or - fix hyperparameter to its current value by setting

27 % priorDeltaMulti derivatives to zero, no effect on marginal likelihood

28 %

29 % priorEqualMulti or - make several hyperparameters have the same value by

30 % priorSameMulti same derivative , no effect on marginal likelihood

31 %

32 % composite prior distributions (see explanation at the bottom):

33 %

34 % priorMix - nonnegative mixture of priors

35 % priorTransform - prior on g(t) rather than t

36 %

37 % Naming convention: all prior distributions are named "prior/prior*.m".

38 %

39 %

40 % 1) With only a fixed input arguments:

41 %

42 % r = priorNAME(par1,par2,parN)

43 %

44 % The function returns a random sample from the distribution for e.g.

45 % random restarts, simulations or optimisation initialisation.

46 %

47 % 2) With one additional input arguments:

48 %

49 % [lp,dlp] = priorNAME(par1,par2,parN, t)

50 %

51 % The function returns the log density at location t along with its first

52 % derivative.

53 %

54 % See also doc/usagePrior.m, inf/infPrior.m.

55 %

56 hgpml copyright 5ai

7.2 Implemented Hyperpriors

All code ﬁles are named according to the pattern prior/prior<NAME>.m for simple identiﬁcation. All

currently available hyperpriors are summarised in the following table.

Simple hyperpriors p(θ)

Univariate hyperpriors deﬁned over the whole reals with mean µand variace σ2

<NAME> Meaning p(θ) = τ

Gauss normally distributed hyperparameter θ∈R1

σ√2πexp −(θ−µ)2

2σ2µ∈R,σ2∈R+

Laplace double exponentially hyperparameter θ∈R1

2bexp −(θ−µ)

b,b=σ/√2µ∈R,σ2∈R+

TStudent’s t distributed hyperparameter θ∈RΓ(ν+1

Γ(ν

√(ν−2)πσ 1+(θ−µ)2

(ν−2)σ2−ν+1

2µ∈R,σ2,ν∈R+

Univariate hyperpriors with effective bounded support but deﬁned over the whole real line

SmoothBox1 interval hyperparameter θ∈Ri.e. θ≈

∈[a,b]

1−exp(η(a−b))

b−a·1

1+exp(−η(θ−a)) ·1

1+exp(η(θ−b)) a6b∈R,η∈R+

µ=a+b

2,σ2=w2

1−exp(−2g)

1+π2/g2

12 ,g=wη

2,w=|a−b|

SmoothBox2 localised hyperparameter θ∈Ri.e. θ≈

∈[a,b]

(1/η+1)(b−a)









N(θ|a,σ2

ab)t<a

1t∈[a,b]

N(θ|a,σ2

ab)b<t

,σab =b−a

η√2πa6b∈R,η∈R+

µ=a+b

2,σ2=w2

η3/3+η2+4η/π+2/π

η3+η2,w=|a−b|

Univariate hyperpriors supported only over the positive reals

Gamma Gamma hyperparameter θ∈R+1

Γ(k)tkexp(θ

t)θk−1k∈R+,t∈R+

Weibull Weibull hyperparameter θ∈R+k

λθ

λk−1exp −(θ

λ)kk∈R+,λ∈R+

InverseGauss inverse Gaussian hyperparameter θ∈R+1

√2πθ3/λ exp −λ(θ−µ)2

2µ2θk∈R+,λ∈R+

LogNormal log-normal hyperparameter θ∈R+N(θ|µ,σ2) = 1

θσ√2πexp −(ln θ−µ)2

2σ2µ∈R,σ2∈R+

Multivariate hyperpriors supported all over RDwith mean µand covariance Σ

GaussMulti multivariate normal distribution θ∈RD|2πΣ|−1

2exp −1

2(θ−µ)>Σ−1(θ−µ)µ∈RD,Σ∈RD×D

LaplaceMulti multivariate Laplace distribution θ∈RD|√2Σ|−1

2exp −√2

L−1(θ−µ)

1,L>L=Σ µ ∈RD,Σ∈RD×D

TMulti multivariate Student’s t distribution θ∈RD|(ν−2)πΣ|−1

2Γ(ν+D

Γ(ν

2)1+(θ−µ)>Σ−1(θ−µ)

(ν−2)−ν+D

2µ∈RD,Σ∈RD×D,ν∈R

Improper hyperpriors used to ﬁx the value of a particular hyperparameter

Delta clamped hyperparameter θ=θ0∈Rδ(θ−θ0)∅

Clamped

DeltaMulti clamped hyperparameter θ=θ0∈RDδ(θ−θ0)∅

ClampedMulti

SameMulti same hyperparameter θ=θ01∈RDδQD

i=1(θ0−θi)∅

EqualMulti

Composite hyperpriors [π1(θ),π2(θ), ..]7→ p(θ)

Transform prior distribution on g(θ)instead of θ π(g(θ)) {g}

Mix mixture distribution Piwiπi(θ){w}

The priorSmoothBox2 is a Gauss-uniform sandwich obtained by complementing a uniform distribu-

tion on [a,b]with two Gaussian halves at each side. The parameter ηbalances the probability mass

between the constituents so that η/(η+1)is used for the box and 1/(η+1)for the Gaussian sides.

Its brother priorSmoothBox1 is the product of two sigmoidal functions.

The priorDelta or equivalently priorClamped can be used to exclude some hyperparameters from

the optimisation. Their values are clamped to θ0and the derivative vanishes. There are also multi-

variate counterparts priorDeltaMulti and priorClampedMulti.

7.3 Usage of Implemented Hyperpriors

Some code examples taken from doc/usagePrior.m illustrate how to use univariate, multivariate

and composite priors on hyperparameters. Syntactically, a hyperprior hp is deﬁned by

func := Dist // prior distributions in prior/

| Clamped | Delta // predefined for fixing the hyperparameter

pr := ’func’ | @func // univariate hyperprior

| ’funcMulti’ | @funcMulti // multivariate hyperprior

hp := {pr} | {pr, {param, hp}} | {pr, {hp, .., hp}} // composite

i.e., it is either a string containing the name of a hyperprior function, a pointer to a hyperprior

function or one of the former in combination with a cell array of hyperprior functions and an addi-

tional list of parameters. Furthermore, we have multivariate hyperprior variants and 2 (equivalent)

predeﬁned hyperpriors allowing to exclude variables from optimisation.

49 hdoc/usagePrior.m 49i≡

1% demonstrate usage of prior distributions

3% See also priorDistributions.m.

5hgpml copyright 5ai

6clear all,close all

8% 1) specify some priors

9% a) univariate priors

10 mu = 1.0; s2 = 0.01^2; nu = 3;

11 pg = {@priorGauss ,mu,s2}; % Gaussian prior

12 pl = {’priorLaplace’,mu,s2}; % Laplace prior

13 pt = {@priorT ,mu,s2,nu}; % Student’s t prior

14 p1 = {@priorSmoothBox1 ,0,3,15}; % smooth box constraints lin decay

15 p2 = {@priorSmoothBox2 ,0,2,15}; % smooth box constraints qua decay

16 pd = {’priorDelta’}; % fix value of prior exclude from optimisation

17 pc = {@priorClamped}; % equivalent to above

18 lam = 1.05; k = 2.5;

19 pw = {@priorWeibull ,lam ,k}; % Weibull prior

21 % b) meta priors

22 pmx = {@priorMix ,[0.5,0.5],{pg,pl}}; % mixture of two priors

23 g = @exp; dg = @exp; ig = @log;

24 ptr = {@priorTransform ,g,dg ,ig,pg}; % Gaussian in the exp domain

26 % c) multivariate priors

27 m = [1;2]; V = [2,1;1,2];

28 pG = {@priorGaussMulti ,m,V}; % 2d Gaussian prior

29 pD = {’priorDeltaMulti’}; % fix value of prior exclude from optim

30 pC = {@priorClampedMulti}; % equivalent to above

32 % 2) evaluation

33 % pri = pt; hp = randn(1,3);

34 % pri = pmx; hp = randn(1,3);

35 % pri = ptr; hp = randn(1,3);

36 pri = pG; hp = randn(2,3);

38 % a) draw a sample from the prior

39 feval(pri{:})

41 % b) evaluate prior and derivative if requires

42 [lp,dlp] = feval(pri{:},hp)

44 % 3) comprehensive example

45 x = (0:0.1:10)’; y = 2*x+randn(size(x)); % generate training data

46 mean = {@meanSum ,{@meanConst ,@meanLinear }}; % specify mean function

47 cov = {@covSEiso}; lik = {@likGauss}; % specify covariance and lik

48 hyp.cov = [log(1);log(1.2)]; hyp.lik = log(0.9); hyp.mean = [2;3];

49 par = {mean,cov,lik,x,y}; mfun = @minimize; % input for GP function

51 % a) plain marginal likelihood optimisation (maximum likelihood)

52 im = @infExact; % inference method

53 hyp_plain = feval(mfun, hyp, @gp, -10, im, par{:}); % optimise

55 % b) regularised optimisation (maximum a posteriori) with 1d priors

56 prior.mean = {pg;pc}; % Gaussian prior for first, clamp second par

57 prior.cov = {p1;[]}; % box prior for first , nothing for second par

58 im = {@infPrior,@infExact,prior}; % inference method

59 hyp_p1 = feval(mfun, hyp, @gp, -10, im, par{:}); % optimise

61 % c) regularised optimisation (maximum a posteriori) with Nd priors

62 prior = []; % clear the structure

63 % multivariate Student ’s t prior on the first and second mean hyper

64 prior.multi{1} = {@priorTMulti ,[mu;mu],diag([s2,s2]),nu,...

65 struct(’mean’ ,[1,2])}; % use hyper struct

66 % Equivalent shortcut (same mu and s2 for all dimensions)

67 prior.multi {1} = {@priorTMulti ,mu ,s2,nu ,struct(’mean’,[1,2])};

68 % multivariate Gaussian prior jointly on 1st and 3rd hyper

69 prior.multi{2} = {@priorGaussMulti ,[mu;mu],diag([s2,s2]),...

70 [1,3]}; % use unwrapped hyper vector

71 % Equivalent shortcut (same mu and s2 for all dimensions)

72 prior.multi{2} = { @priorGaussMulti ,mu,s2 ,[1 ,3]};

73 im = {@infPrior,@infExact,prior}; % inference method

74 hyp_pN = feval(mfun, hyp, @gp, -10, im, par{:}); % optimise

76 [any2vec(hyp), any2vec(hyp_plain), any2vec(hyp_p1), any2vec(hyp_pN)]

Manual

manual

manual

Navigation menu

Versions of this User Manual:

Views

Navigation