Manual
manual
manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 51
- Gaussian Process Training and Prediction
- The gp Function
- Inference Methods
- Exact Inference with Gaussian likelihood
- Laplace's Approximation
- Expectation Propagation
- Kullback Leibler Divergence Minimisation
- Variational Bayes
- Compatibility Between Inference Methods and Covariance Approximations
- Sparse Covariance Approximations
- Grid-Based Covariance Approximations
- State Space Representation of GPs
- Likelihood Functions
- Prediction
- Interface
- Implemented Likelihood Functions
- Usage of Implemented Likelihood Functions
- Compatibility Between Likelihoods and Inference Methods
- Gaussian Likelihood
- Warped Gaussian Likelihood
- Gumbel Likelihood
- Laplace Likelihood
- Student's t Likelihood
- Cumulative Logistic Likelihood
- GLM Likelihoods: Poisson, Negative Binomial, Weibull, Gamma, Exponential, Inverse Gaussian and Beta
- Mean Functions
- Covariance Functions
- Hyperpriors
The GPML Toolbox version 4.2
Carl Edward Rasmussen & Hannes Nickisch
June 15, 2018
Abstract
The GPML toolbox is an Octave 3.2.x and Matlab 7.x implementation of inference and pre-
diction in Gaussian process (GP) models. It implements algorithms discussed in Rasmussen &
Williams: Gaussian Processes for Machine Learning , the MIT press, 2006 and Nickisch &
Rasmussen: Approximations for Binary Gaussian Process Classification , JMLR, 2008.
The strength of the function lies in its flexibility, simplicity and extensibility. The function is
flexible as firstly it allows specification of the properties of the GP through definition of mean func-
tion and covariance functions. Secondly, it allows specification of different inference procedures,
such as e.g. exact inference and Expectation Propagation (EP). Thirdly it allows specification of
likelihood functions e.g. Gaussian or Laplace (for regression) and e.g. cumulative Logistic (for
classification). Simplicity is achieved through a single function and compact code. Extensibility is
ensured by modular design allowing for easy addition of extension for the already fairly extensive
libraries for inference methods, mean functions, covariance functions and likelihood functions.
This document is a technical manual for a developer containing many details. If you are not
yet familiar with the GPML toolbox, the user documentation and examples therein are a better
way to get started.
1
Contents
1 Gaussian Process Training and Prediction 3
2 The gp Function 4
3 Inference Methods 9
3.1 Exact Inference with Gaussian likelihood . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Laplace’sApproximation.................................. 11
3.3 ExpectationPropagation.................................. 11
3.4 Kullback Leibler Divergence Minimisation . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 VariationalBayes...................................... 13
3.6 Compatibility Between Inference Methods and Covariance Approximations . . . . . . 13
3.7 Sparse Covariance Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.8 Grid-Based Covariance Approximations . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.9 State Space Representation of GPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Likelihood Functions 18
4.1 Prediction.......................................... 18
4.2 Interface........................................... 19
4.3 Implemented Likelihood Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Usage of Implemented Likelihood Functions . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Compatibility Between Likelihoods and Inference Methods . . . . . . . . . . . . . . . 23
4.6 GaussianLikelihood .................................... 23
4.6.1 ExactInference................................... 25
4.6.2 Laplace’s Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6.3 Expectation Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6.4 VariationalBayes.................................. 25
4.7 Warped Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.8 GumbelLikelihood..................................... 27
4.9 LaplaceLikelihood..................................... 27
4.10Student’stLikelihood ................................... 30
4.11 Cumulative Logistic Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.12 GLM Likelihoods: Poisson, Negative Binomial, Weibull, Gamma, Exponential, In-
verseGaussianandBeta .................................. 32
4.12.1 Inverse Link Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.12.2PoissonLikelihood................................. 34
4.12.3WeibullLikelihood................................. 34
4.12.4GammaLikelihood................................. 34
4.12.5 Exponential Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.12.6 Inverse Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.12.7BetaLikelihood................................... 35
5 Mean Functions 36
5.1 Interface........................................... 36
5.2 Implemented Mean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Usage of Implemented Mean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6 Covariance Functions 39
6.1 Interface........................................... 39
6.2 Implemented Covariance Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Usage of Implemented Covariance Functions . . . . . . . . . . . . . . . . . . . . . . . 44
2
1 Gaussian Process Training and Prediction
The gpml toolbox contains a single user function gp described in section 2. In addition there are a
number of supporting structures and functions which the user needs to know about, as well as an
internal convention for representing the posterior distribution, which may not be of direct interest to
the casual user.
Inference Methods: An inference method is a function which computes the (approximate) poste-
rior, the (approximate) negative log marginal likelihood and its partial derivatives w.r.t.. the
hyperparameters, given a model specification (i.e., GP mean and covariance functions and a
likelihood function) and a data set. Inference methods are discussed in section 3. New in-
ference methods require a function providing the desired inference functionality and possibly
extra functionality in the likelihood functions applicable.
Hyperparameters: The hyperparameters is a struct controlling the properties of the model, i.e.. the
GP mean and covariance function and the likelihood function. The hyperparameters is a struct
with the three fields mean,cov and lik, each of which is a vector. The number of elements in
each field must agree with number of hyperparameters in the specification of the three functions
they control (below). If a field is either empty or non-existent it represents zero hyperparam-
eters. When working with FITC approximate inference, the inducing inputs xu can also be
treated as hyperparameters for some common stationary covariances.
Hyperparameter Prior Distributions: When optimising the marginal likelihood w.r.t. hyperparame-
ters, it is sometimes useful to softly constrain the hyperparameters by means of prior knowl-
edge. A prior is a probability distribution over individual or a group of hyperparameters,
section 7.
Likelihood Functions: The likelihood function specifies the form of the likelihood of the GP model
and computes terms needed for prediction and inference. For inference, the required properties
of the likelihood depend on the inference method, including properties necessary for hyperpa-
rameter learning, section 4.
Mean Functions: The mean function is a cell array specifying the GP mean. It computes the mean
and its derivatives w.r.t.. the part of the hyperparameters pertaining to the mean. The cell array
allows flexible specification and composition of mean functions, discussed in section 5. The
default is the zero function.
Covariance Functions: The covariance function is a cell array specifying the GP covariance function.
It computes the covariance and its derivatives w.r.t.. the part of the hyperparameters pertaining
to the covariance function. The cell array allows flexible specification and composition of
covariance functions, discussed in section 6.
Inference methods, see section 3, compute (among other things) an approximation to the posterior
distribution of the latent variables fiassociated with the training cases, i=1, . . . , n. This approx-
imate posterior is assumed to be Gaussian, and is communicated via a struct post with the fields
post.alpha,post.sW and post.L. Often, starting from the Gaussian prior p(f) = N(f|m,K)the
approximate posterior admits the form
q(f|D) = Nf|µ=m+Kα,V= (K−1+W)−1, where Wdiagonal with Wii =s2
i. (1)
In such cases, the entire posterior can be computed from the two vectors post.alpha and post.sW;
the inference method may optionally also return L=chol(diag(s)Kdiag(s) + I).
If on the other hand the posterior doesn’t admit the above form, then post.L returns the matrix
L= −(K+W−1)−1(and post.sW is unused). In addition, a sparse representation of the posterior
may be used, in which case the non-zero elements of the post.alpha vector indicate the active entries.
4
2 The gp Function
The gp function is typically the only function the user would directly call.
4a hgp.m 4ai≡
1function [varargout] = gp(hyp, inf, mean, cov, lik, x, y, xs, ys)
2hgp function help 4bi
3hinitializations 5bi
4hinference 6ci
5if nargin==7 % if no test cases are provided
6varargout = {nlZ, dnlZ, post}; % report -log marg lik, derivatives and post
7else
8hcompute test predictions 7i
9end
It offers facilities for training the hyperparameters of a GP model as well as predictions at unseen
inputs as detailed in the following help.
4b hgp function help 4bi≡ (4a)
1% Gaussian Process inference and prediction. The gp function provides a
2% flexible framework for Bayesian inference and prediction with Gaussian
3% processes for scalar targets , i.e. both regression and binary
4% classification. The prior is Gaussian process , defined through specification
5% of its mean and covariance function. The likelihood function is also
6% specified. Both the prior and the likelihood may have hyperparameters
7% associated with them.
8%
9% Two modes are possible: training or prediction: if no test cases are
10 % supplied , then the negative log marginal likelihood and its partial
11 % derivatives w.r.t. the hyperparameters is computed; this mode is used to fit
12 % the hyperparameters. If test cases are given, then the test set predictive
13 % probabilities are returned. Usage:
14 %
15 % training: [nlZ dnlZ ] = gp(hyp, inf, mean, cov, lik, x, y);
16 % prediction: [ymu ys2 fmu fs2 ] = gp(hyp, inf, mean , cov, lik, x, y, xs);
17 % or: [ymu ys2 fmu fs2 lp] = gp(hyp, inf, mean , cov, lik, x, y, xs, ys);
18 %
19 % where:
20 %
21 % hyp struct of column vectors of mean/cov/lik hyperparameters
22 % inf function specifying the inference method
23 % mean prior mean function
24 % cov prior covariance function
25 % lik likelihood function
26 % x n by D matrix of training inputs
27 % y column vector of length n of training targets
28 % xs ns by D matrix of test inputs
29 % ys column vector of length nn of test targets
30 %
31 % nlZ returned value of the negative log marginal likelihood
32 % dnlZ struct of column vectors of partial derivatives of the negative
33 % log marginal likelihood w.r.t. mean/cov/lik hyperparameters
34 % ymu column vector (of length ns) of predictive output means
35 % ys2 column vector (of length ns) of predictive output variances
36 % fmu column vector (of length ns) of predictive latent means
37 % fs2 column vector (of length ns) of predictive latent variances
38 % lp column vector (of length ns) of log predictive probabilities
39 %
5
40 % post struct representation of the (approximate) posterior
41 % 3rd output in training mode or 6th output in prediction mode
42 % can be reused in prediction mode gp(.., cov, lik, x, post, xs,..)
43 %
44 % See also infMethods.m, meanFunctions.m, covFunctions.m, likFunctions.m.
45 %
46 hgpml copyright 5ai
5a hgpml copyright 5ai≡ (4b 9 10 19 22 24a 36 38 39 44 46 49)
1% Copyright (c) by Carl Edward Rasmussen and Hannes Nickisch, 2018-06-15.
2% File automatically generated using noweb.
Depending on the number of input parameters, gp knows whether it is operated in training or in
prediction mode. The highlevel structure of the code is as follows. After some initialisations, we
perform inference and decide whether test set predictions are needed or only the result of the inference
is demanded.
5b hinitializations 5bi≡ (4a)
1hminimalist usage 5ci
2hmultivariate output by recursion 5di
3hprocess input arguments 6ai
4hcheck hyperparameters 6bi
If the number of input arguments is incorrect, we echo a minimalist usage and return.
5c hminimalist usage 5ci≡ (5b)
1if nargin<7 || nargin >9
2disp(’Usage: [nlZ dnlZ ] = gp(hyp, inf, mean, cov, lik, x, y);’)
3disp(’ or: [ymu ys2 fmu fs2 ] = gp(hyp, inf, mean , cov, lik, x, y, xs);’)
4disp(’ or: [ymu ys2 fmu fs2 lp] = gp(hyp, inf, mean, cov, lik, x, y, xs, ys);’)
5return
6end
If there is more than a single output dimension in y, we call multiple instances of gp with shared
hyperparameters and settings each computing the corresponding result with scalar output.
5d hmultivariate output by recursion 5di≡ (5b)
1if size(y,2)>1 % deal with (independent) multivariate output y by recursing
2d = size(y,2); varargout = cell(nargout,1); out = cell(nargout,1); % allocate
3for i=1:d
4in = {hyp, inf, mean, cov, lik, x, y(:,i)};
5if nargin>7, in = {in{:}, xs}; end
6if nargin>8, in = {in{:}, ys(:,i)}; end
7if i==1, [varargout{:}] = gp(in{:}); % perform inference for dimension ..
8else [out{:}] = gp(in{:}); % .. number i in the output
9if nargin==7, no = 2;
10 varargout{1} = varargout{1} + out{1}; % sum nlZ
11 if nargout>1 % sum dnlZ
12 varargout{2} = vec2any(hyp,any2vec(varargout{2})+any2vec(out{2}));
13 end
14 else no = 5; % concatenate ymu ys2 fmu fs2 lp
15 for j=1:min(nargout,no), varargout{j} = [varargout{j},out{j}]; end
16 end
17 if nargout>no % concatenate post
18 if i==2, varargout{no+1} = {varargout{no+1},out{no+1}};
19 else varargout{no+1} = {varargout{no+1}{:},out{no+1}}; end
20 end
21 end
22 end,return % return to end the recursion
23 end
6
Next, we set some useful default values for empty arguments, and convert inf and lik to function
handles and mean and cov to cell arrays if necessary. Initialize variables.
6a hprocess input arguments 6ai≡ (5b)
1if isempty(mean), mean = {@meanZero}; end % set default mean
2if ischar(mean) || isa(mean, ’function_handle’), mean = {mean}; end % make cell
3if isempty(cov), error(’Covariance function cannot be empty’); end % no default
4if ischar(cov) || isa(cov,’function_handle’), cov = {cov}; end % make cell
5cstr = cov{1}; if isa(cstr ,’function_handle’), cstr = func2str(cstr); end
6if (strcmp(cstr,’covFITC’) || strcmp(cstr,’apxSparse’)) && isfield(hyp,’xu’)
7cov{3} = hyp.xu; %use hyp.xu
8end
9if isempty(inf), inf = {@infGaussLik}; end % set default inference method
10 if ischar(inf), inf = str2func(inf); end % convert into function handle
11 if ischar(inf) || isa(inf,’function_handle’), inf = {inf}; end % make cell
12 istr = inf{1}; if isa(istr ,’function_handle’), istr = func2str(istr); end
13 if strcmp(istr,’infPrior’)
14 istr = inf{2}; if isa(istr ,’function_handle’), istr = func2str(istr); end
15 end
16 if isempty(lik), lik = {@likGauss}; end % set default lik
17 if ischar(lik) || isa(lik,’function_handle’), lik = {lik}; end % make cell
18 lstr = lik{1}; if isa(lstr ,’function_handle’), lstr = func2str(lstr); end
19
20 D = size(x,2);
21 if strncmp(cstr,’covGrid’,7) || strcmp(cstr,’apxGrid’)% only some inf* possible
22 D = 0; xg = cov{3}; p = numel(xg); for i=1:p, D = D+size(xg{i},2); end % dims
23 end
Check that the sizes of the hyperparameters supplied in hyp match the sizes expected. The three parts
hyp.mean,hyp.cov and hyp.lik are checked separately, and define empty entries if they don’t exist.
6b hcheck hyperparameters 6bi≡ (5b)
1if ~isfield(hyp,’mean’), hyp.mean = []; end % check the hyp specification
2if eval(feval(mean{:})) ~= numel(hyp.mean)
3error(’Number of mean function hyperparameters disagree with mean function’)
4end
5if ~isfield(hyp,’cov’), hyp.cov = []; end
6if eval(feval(cov{:})) ~= numel(hyp.cov)
7error(’Number of cov function hyperparameters disagree with cov function’)
8end
9if ~isfield(hyp,’lik’), hyp.lik = []; end
10 if eval(feval(lik{:})) ~= numel(hyp.lik)
11 error(’Number of lik function hyperparameters disagree with lik function’)
12 end
Inference is performed by calling the desired inference method inf. In training mode, we accept a
failure of the inference method (and issue a warning), since during hyperparameter learning, hyper-
parameters causing a numerical failure may be attempted, but the minimize function may gracefully
recover from this. During prediction, failure of the inference method is an error.
6c hinference 6ci≡ (4a)
1try % call the inference method
2% issue a warning if a classification likelihood is used in conjunction with
3% labels different from +1 and -1
4if strcmp(lstr,’likErf’) || strcmp(lstr,’likLogistic’)
5if ~isstruct(y)
6uy = unique(y);
7if any( uy~=+1 & uy~=-1 )
8warning(’You try classification with labels different from {+1,-1}’)
7
9end
10 end
11 end
12 if nargin>7 % compute marginal likelihood and its derivatives only if needed
13 if isstruct(y)
14 post = y; % reuse a previously computed posterior approximation
15 else
16 post = feval(inf{:}, hyp, mean, cov, lik, x, y);
17 end
18 else
19 if nargout<=1
20 [post nlZ] = feval(inf{:}, hyp, mean, cov, lik, x, y); dnlZ = {};
21 else
22 [post nlZ dnlZ] = feval(inf{:}, hyp, mean, cov, lik, x, y);
23 end
24 end
25 catch
26 msgstr = lasterr;
27 if nargin>7, error(’Inference method failed [%s]’, msgstr); else
28 warning(’Inference method failed [%s] .. attempting to continue’,msgstr)
29 varargout = {NaN, vec2any(hyp,zeros(numel(any2vec(hyp)),1))}; return % go on
30 end
31 end
We copy the already computed negative log marginal likelihood to the first output argument, and if
desired report its partial derivatives w.r.t. the hyperparameters if running in inference mode.
Predictions are computed in a loop over small batches to avoid memory problems for very large test
sets.
7hcompute test predictions 7i≡ (4a)
1alpha = post.alpha; L = post.L; sW = post.sW;
2if issparse(alpha) % handle things for sparse representations
3nz = alpha ~= 0; % determine nonzero indices
4if issparse(L), L = full(L(nz,nz)); end % convert L and sW if necessary
5if issparse(sW), sW = full(sW(nz)); end
6else nz = true(size(alpha ,1) ,1); end % non-sparse representation
7if isempty(L) % in case L is not provided, we compute it
8K = feval(cov{:}, hyp.cov, x(nz,:));
9L = chol(eye(sum(nz))+sW*sW’.*K);
10 end
11 %verify whether L contains valid Cholesky decomposition or something different
12 Lchol = isnumeric(L) && all(all(tril(L,-1)==0)&diag(L)’>0&isreal(diag(L))’);
13 ns = size(xs,1); % number of data points
14 if strncmp(cstr,’apxGrid’,7), xs = apxGrid(’idx2dat’,cov{3},xs); end % expand
15 nperbatch = 1000; % number of data points per mini batch
16 nact = 0; % number of already processed test data points
17 ymu = zeros(ns,1); ys2 = ymu; fmu = ymu; fs2 = ymu; lp = ymu; % allocate mem
18 while nact<ns % process minibatches of test cases to save memory
19 id = (nact+1):min(nact+nperbatch ,ns); % data points to process
20 hmake predictions 8i
21 nact = id(end); % set counter to index of last processed data point
22 end
23 if nargin<9
24 varargout = {ymu, ys2, fmu, fs2, [], post}; % assign output arguments
25 else
26 varargout = {ymu, ys2, fmu, fs2, lp, post};
27 end
8
In every iteration of the above loop, we compute the predictions for all test points of the batch.
8hmake predictions 8i≡ (7)
1kss = feval(cov{:}, hyp.cov, xs(id,:), ’diag’); % self-variance
2if strcmp(cstr,’covFITC’) || strcmp(cstr,’apxSparse’)% cross-covariances
3Ks = feval(cov{:}, hyp.cov, x, xs(id,:)); Ks = Ks(nz,:); % res indep. of x
4else
5Ks = feval(cov{:}, hyp.cov, x(nz,:), xs(id,:)); % avoid computation
6end
7ms = feval(mean{:}, hyp.mean, xs(id,:));
8N = size(alpha ,2); % number of alphas (usually 1; more in case of sampling)
9Fmu = repmat(ms,1,N) + Ks’*full(alpha(nz,:)); % conditional mean fs|f
10 fmu(id) = sum(Fmu ,2)/N; % predictive means
11 if Lchol % L contains chol decomp => use Cholesky parameters (alpha,sW,L)
12 V = L’\(repmat(sW,1,length(id)).*Ks);
13 fs2(id) = kss - sum(V.*V,1)’; % predictive variances
14 else % L is not triangular => use alternative parametrisation
15 if isnumeric(L), LKs = L*Ks; else LKs = L(Ks); end % matrix or callback
16 fs2(id) = kss + sum(Ks.*LKs ,1)’; % predictive variances
17 end
18 fs2(id) = max(fs2(id),0); % remove numerical noise i.e. negative variances
19 Fs2 = repmat(fs2(id),1,N); % we have multiple values in case of sampling
20 if nargin<9
21 [Lp, Ymu, Ys2] = feval(lik{:},hyp.lik ,[], Fmu(:),Fs2(:));
22 else
23 Ys = repmat(ys(id),1,N);
24 [Lp, Ymu, Ys2] = feval(lik{:},hyp.lik,Ys(:),Fmu(:),Fs2(:));
25 end
26 lp(id) = sum(reshape(Lp, [],N),2)/N; % log probability; sample averaging
27 ymu(id) = sum(reshape(Ymu,[],N),2)/N; % predictive mean ys|y and ..
28 ys2(id) = sum(reshape(Ys2,[],N),2)/N; % .. variance
9
3 Inference Methods
Inference methods are responsible for computing the (approximate) posterior post, the (approxi-
mate) negative log marginal likelihood nlZ and its partial derivatives dnlZ w.r.t. the hyperparame-
ters hyp. The arguments to the function are hyperparameters hyp, mean function mean, covariance
function cov, likelihood function lik and training data xand y. Several inference methods are
implemented and described this section.
9hinfMethods.m 9i≡
1% Inference methods: Compute the (approximate) posterior for a Gaussian process.
2% Methods currently implemented include:
3%
4% infGaussLik Exact inference (only possible with Gaussian likelihood)
5% infLaplace Laplace’s Approximation
6% infEP Expectation Propagation
7% infVB Variational Bayes Approximation
8% infKL Kullback -Leibler optimal Approximation
9%
10 % infMCMC Markov Chain Monte Carlo and Annealed Importance Sampling
11 % We offer two samplers.
12 % - hmc: Hybrid Monte Carlo
13 % - ess: Elliptical Slice Sampling
14 % No derivatives w.r.t. to hyperparameters are provided.
15 %
16 % infLOO Leave -One-Out predictive probability and Least -Squares Approxim.
17 % infPrior Perform inference with hyperparameter prior.
18 %
19 % The interface to the approximation methods is the following:
20 %
21 % function [post nlZ dnlZ] = inf..(hyp, mean, cov, lik, x, y)
22 %
23 % where:
24 %
25 % hyp is a struct of hyperparameters
26 % mean is the name of the mean function (see meanFunctions.m)
27 % cov is the name of the covariance function (see covFunctions.m)
28 % lik is the name of the likelihood function (see likFunctions.m)
29 % x is a n by D matrix of training inputs
30 % y is a (column) vector (of size n) of targets
31 %
32 % nlZ is the returned value of the negative log marginal likelihood
33 % dnlZ is a (column) vector of partial derivatives of the negative
34 % log marginal likelihood w.r.t. each hyperparameter
35 % post struct representation of the (approximate) posterior containing
36 % alpha is a (sparse or full column vector) containing inv(K)*(mu-m),
37 % where K is the prior covariance matrix, m the prior mean,
38 % and mu the approx posterior mean
39 % sW is a (sparse or full column) vector containing diagonal of sqrt(W)
40 % the approximate posterior covariance matrix is inv(inv(K)+W)
41 % L is a (sparse or full) triangular matrix , L = chol(sW*K*sW+eye(n)),
42 % or a full matrix , L = -inv(K+inv(W)),
43 % or a function L(A) of a matrix A such that -(K+inv(W))*L(A) = A
44 %
45 % Usually, the approximate posterior to be returned admits the form
46 % N(mu=m+K*alpha, V=inv(inv(K)+W)), where alpha is a vector and W is diagonal.
47 %
48 % For more information on the individual approximation methods and their
10
49 % implementations , see the separate inf??.m files. See also gp.m.
50 %
51 hgpml copyright 5ai
Not all inference methods are compatible with all likelihood functions, e.g.. exact inference is only
possible with Gaussian likelihood. In order to perform inference, each method needs various prop-
erties of the likelihood functions, section 4.
3.1 Exact Inference with Gaussian likelihood
For Gaussian likelihoods, GP inference reduces to computing mean and covariance of a multivariate
Gaussian which can be done exactly by simple matrix algebra. The program inf/infExact.m does
exactly this. If it is called with a likelihood function other than the Gaussian, it issues an error. The
Gaussian posterior q(f|D) = Nf|µ,Vis exact.
10 hinf/infGaussLik.m 10i≡
1function [post nlZ dnlZ] = infGaussLik(hyp, mean , cov, lik, x, y, opt)
2
3% Exact inference for a GP with Gaussian likelihood.
4%
5% Compute a parametrization of the posterior , the negative log marginal
6% likelihood and its derivatives w.r.t. the hyperparameters. The function takes
7% a specified covariance function (see covFunctions.m) and likelihood function
8% (see likFunctions.m), and is designed to be used with gp.m.
9%
10 hgpml copyright 5ai
11 %
12 % See also INFMETHODS.M, APX.M.
13
14 if nargin<7, opt = []; end % make sure parameter exists
15 if iscell(lik), likstr = lik{1}; else likstr = lik; end
16 if ~ischar(likstr), likstr = func2str(likstr); end
17 if ~strcmp(likstr ,’likGauss’)% NOTE: no explicit call to likGauss
18 error(’Exact inference only possible with Gaussian likelihood’);
19 end
20
21 [n, D] = size(x);
22 [m,dm] = feval(mean{:}, hyp.mean , x); % evaluate mean vector and deriv
23 sn2 = exp(2*hyp.lik); W = ones(n,1)/sn2; % noise variance of likGauss
24 K = apx(hyp,cov,x,opt); % set up covariance approximation
25 [ldB2,solveKiW,dW,dhyp ,post.L] = K.fun(W); % obtain functionality depending on W
26
27 alpha = solveKiW(y-m);
28 post.alpha = K.P(alpha); % return the posterior parameters
29 post.sW = sqrt(W); % sqrt of noise precision vector
30 if nargout>1 % do we want the marginal likelihood?
31 nlZ = (y-m)’*alpha/2 + ldB2 + n*log(2*pi*sn2)/2; % -log marginal likelihood
32 if nargout>2 % do we want derivatives?
33 dnlZ = dhyp(alpha); dnlZ.mean = -dm(alpha);
34 dnlZ.lik = -sn2*(alpha’*alpha) - 2*sum(dW)/sn2 + n;
35 end
36 end
11
3.2 Laplace’s Approximation
For differentiable likelihoods, Laplace’s approximation, approximates the posterior by a Gaussian
centered at its mode and matching its curvature inf/infLaplace.m.
More concretely, the mean of the posterior q(f|D) = Nf|µ,Vis – defining `i(fi) = ln p(yi|fi)and
`(f) = Pn
i=1`i(fi)– given by
µ=arg min
fφ(f), where φ(f) = 1
2(f−m)>K−1(f−m) − `(f)c
= − ln[p(f)p(y|f)], (2)
which we abbreviate by µ←L(`). The curvature ∂2φ
∂ff>=K−1+Wwith Wii = − ∂2
∂f2
i
ln p(yi|fi)
serves as precision for the Gaussian posterior approximation V= (K−1+W)−1and the marginal
likelihood Z=Rp(f)p(y|f)dfis approximated by Z≈ZLA =R˜
φ(f)dfwhere we use the 2nd order
Taylor expansion at the mode µgiven by ˜
φ(f) = φ(µ) + 1
2(f−µ)>V−1(f−µ)≈φ(f).
Laplace’s approximation needs derivatives up to third order for the mode fitting procedure (Newton
method)
dk=∂k
∂fklog p(y|f),k=0, 1, 2, 3
and
dk=∂
∂ρi
∂k
∂fklog p(y|f),k=0, 1, 2
evaluated at the latent location fand observed value y. The likelihood calls (see section 4)
•[d0, d1, d2, d3] = lik(hyp, y, f, [], ’infLaplace’)
and
•[d0, d1, d2] = lik(hyp, y, f, [], ’infLaplace’, i)
return exactly these values.
3.3 Expectation Propagation
The basic idea of Expectation Propagation (EP) as implemented in inf/infEP.m. is to replace the
non-Gaussian likelihood terms p(yi|fi)by Gaussian functions t(fi;νi,τi) = exp(νifi−1
2τif2
i)and
to adjust the natural parameters νi,τisuch that the following identity holds:
1
Zt,iZfkq−i(f)·t(f;νi,τi)df=1
Zp,iZfkq−i(f)·p(yi|f)df,k=1, 2
with the so-called cavity distributions q−i(f) = N(f|m,K)Qj6=it(fj;νj,τj)∝N(f|µ,V)/t(fi;νi,τi)
equal to the posterior divided by the ith Gaussian approximation function and the two normalisers
Zt,i=Rq−i(f)·t(fi;νi,τi)dfand Zp,i=Rq−i(f)·p(yi|fi)df. The moment matching corresponds
to minimising the following local KL-divergence
νi,τi=arg min
ν,τKL[q−i(f)p(yi|fi)/Zp,i||q−i(f)t(fi;ν,τ)/Zt,i].
In order to apply the moment matching steps in a numerically safe way, EP requires the deriviatives
of the expectations w.r.t. the Gaussian mean parameter µ
dk=∂k
∂µklog Zp(y|f)N(f|µ,σ2)df,k=0, 1, 2
12
and the ith likelihood hyperparameter ρi
d=∂
∂ρi
log Zp(y|f)N(f|µ,σ2)df
which can be obtained by the likelihood calls (see section 4)
•[d0, d1, d2] = lik(hyp, y, mu, s2, ’infEP’)
and
•d = lik(hyp, y, mu, s2, ’infEP’, i).
3.4 Kullback Leibler Divergence Minimisation
Another well known approach to approximate inference implemented inf/infKL.m in attempts to
directly find the closest Gaussian q(f|D) = N(f|µ,V)to the exact posterior p(f|D)w.r.t. to some
proximity measure or equivalently to maximise a lower bound Z(µ,V)to the marginal likelihood
Zas described in Nickisch & Rasmussen Approximations for Binary Gaussian Process Classifica-
tion, JMLR, 2008. In particular, one minimises KL (N(f|µ,V)||p(f|D))which amounts to minimising
−ln Z(µ,V)as defined by:
−ln Z= − ln Zp(f)p(y|f)df= − ln Zq(f|D)p(f)
q(f|D)p(y|f)df
Jensen
6Zq(f|D)ln q(f|D)
p(f)df−Zq(f|D)ln p(y|f)df=: − ln Z(µ,V)
=KL (N(f|µ,V)||N(f|m,K))−
n
X
i=1ZN(fi|µi,vii)ln p(yi|fi)dfi,vii = [V]ii
=1
2tr(VK−1−I) − ln |VK−1|+1
2(µ−m)>K−1(µ−m) −
n
X
i=1
`KL(µi,vii)
where `KL
v(µi) = RN(fi|µi,vii)`i(fi)dfiis the convolution of the log likelihood `iwith the Gaus-
sian Nand v=dg(V). Equivalently, one can view `KL as a smoothed version of `with univariate
smoothing kernel N.
From Challis & Barber Concave Gaussian Variational Approximations for Inference in Large Scale
Bayesian Linear Models, AISTATS, 2011 we know that the mapping (µ,L)7→ −ln Z(µ,L>L)is
jointly convex whenever the likelihoods fi7→ P(yi|fi)are log concave. In particular, this implies that
every (µi,si)7→ −`KL(µi,s2
i)is jointly convex.
We use an optimisation algorithm similar to EP (section 3.3) where we minimise the local KL-
divergence the other way round µi,si=arg minµ,sKL[N(f|µ,s2)||q−i(f)p(yi|fi)/Zp,i]. This view
was brought forward by Tom Minka Convex Divergence measures and message passing, MSR-
TR, 2005. The KL minimisation constitutes a jointly convex 2d optimisation problem solved by
klmin using a scaled Newton approach which is included as a sub function in inf/infKL.m. The
smoothed likelihood `KL(µi,vii)is implemented as a meta likelihood in likKL; it uses Gaussian-
Hermite quadrature to compute the required integrals. Note that – as opposed to EP – Gaussian-
Hermite quadrature is appropriate since we integrate against the ln P(yi|fi)(which can be well ap-
proximated by a polynomial) instead of P(yi|fi)itself. The algorithm is – again unlike EP – prov-
ably convergent for log-concave likelihoods (e.g. likGauss,likLaplace,likSech2,likLogistic,
likPoisson) since it can be regarded as coordinate descent with guaranteed decrease in the objective
in every step. Due to the complex update computations, infKL can be quite slow although it has the
same O(n3)asymptotic complexity as EP and Laplace.
13
3.5 Variational Bayes
One can drive the bounding even further by means of local quadratic lower bounds to the log like-
lihood `(f) = ln p(y|f). Suppose that we use a super-Gaussian likelihood p(y|f)i.e. likelihoods
that can be lower bounded by Gaussians of any width w(e.g. likLaplace,likT,likLogistic,
likSech2). Formally, that means that there are b,z∈Rsuch that
ρ(f) = ln p(y|f−z) − bf
is symmetric and √f7→ ρ(f)is a convex function for all f>0. As a result, we obtain the following
exact representation of the likelihood
`(f) = ln p(y|f) = max
w>0(b+wz)f−wf2
2−1
2h(γ),
which can be derived by convex duality and assuming the likelihoods to be super-Gaussian. Details
can be found in papers by Palmer et al. Variational EM Algorithms for Non-Gaussian Latent Variable
Models, NIPS, 2006 and Nickisch & Seeger Convex Variational Bayesian Inference for Large Scale
Generalized Linear Models, ICML, 2009.
The bottom line is that we can treat the variational bounding as a sequence of Laplace approxima-
tions with the “variational Bayes” log likelihood
`VB(fi) = `(gi) + bi(fi−gi),g=sgn(f−z)q(f−z)2+v+z
instead of the usual likelihood `(fi) = ln p(yi|fi)i.e. we solve µ←L(`VB
v)instead of µ←L(`). See
section 3.2. In the code of inf/infVB.m, the likelihood is implemented in the function likVB.
At the end, the optimal value of Wcan be obtained analytically via wi=|bi−`0(gi)|/|gi−zi|.
For the minimisation in inf/infVB.m, we use a provably convergent double loop algorithm, where in
the inner loop a nonlinear least squares problem (convex for log-concave likelihoods) is solved using
inf/infLaplace.m such that µ←L(`VB
v)and in the outer loop, we compute v←dg((K−1+W)−1).
The only requirement to the likelihood function is that it returns the values zand brequired by the
bound which are delivered by the call (see section 4)
•[b,z] = lik(hyp, y, [], ga, ’infVB’)
The negative marginal likelihood upper bound −ln ZVB is obtained by integrating the prior times
the exact representation of the likelihood
p(y|f) = max
γ>0q(y|f,γ),q(y|f,γ) = N(f|ν,γ)exp −h(γ)
2−ν2
2γp2πγ,γ=1
w,ν=bγ +z
w.r.t. the latent variables fyielding
−ln ZVB = − ln ZN(f|m,K)
n
Y
i=1
qi(yi|fi,γi)df
= − ln N(m|ν,K+Γ) + 1
2h(γ) − w>ν2−1>ln 2πγ.
3.6 Compatibility Between Inference Methods and Covariance Approximations
Another kind of approximation is needed to render an inference method scalable. We have two ap-
proximation schemes which in fact approximate the covariance to make it amenable to large number
of training data points. The following table shows the compatibility between some inference methods
and two major groups of covariance approximations we will discuss in the next two sections.
14
Approximation \ Inference Exact Laplace VB EP KL, MCMC, LOO Implementation
infGaussLik infLaplace infVB infEP inf{KL,MCMC,LOO}
variational free energy (VFE) X X X apxSparse,opt.s=0.0
fully independent training conditionals (FITC) X X X X apxSparse,opt.s=1.0
hybrid between VFE and FITC, s0∈[0, 1]X X X apxSparse,opt.s=s0
Kronecker/Toeplitz/BTTB grid (KISS-GP) X X X apxGrid
State space representation X X X (X, ADF) apxState
3.7 Sparse Covariance Approximations
One of the main problems with GP models is the high computational load for inference computations.
In a setting with ntraining points x, exact inference with Gaussian likelihood requires O(n3)effort;
approximations like Laplace or EP consist of a sequence of O(n3)operations.
There is a line of research with the goal to alleviate this burden by using approximate covariance
functions ˜
kinstead of k. A review is given by Candela and Rasmussen1. One basic idea in those
approximations is to work with a set of minducing inputs uwith a reduced computational load
of O(nm2). In the following, we will provide a rough idea of the FITC approximation used in the
toolbox. Let Kdenote the n×ncovariance matrix between the training points x,Kuthe m×n
covariance matrix between the ntraining points and the minducing points, and Kuu the m×m
covariance matrix between the minducing points. The FITC approximation to the covariance is
given by
K≈˜
K=Q+G,G=diag(g),g=diag(K−Q),Q=K>
uQ−1
uu Ku,Quu =Kuu +σ2
nuI,
where σnuis the noise from the inducing inputs. Note that ˜
Kand Khave the same diagonal elements
diag(˜
K) = diag(K); all off-diagonal elements are the same as for Q. Internally, the necessary covari-
ance evaluations are performed by a meta covariance function cov/apxSparse.m. The toolbox offers
FITC versions for regression with Gaussian likelihood inf/infGaussLik.m, as well as for Laplace’s
approximation inf/infLaplace.m.
The user can decide whether to treat the inducing inputs uas fixed or as hyperparameters. The latter
allows to adjust the inducing inputs uw.r.t. the marginal likelihood. As detailed in the documentation
of inf/apx.m,uis treated as fixed if it is passed as the 2nd parameter of apxSparse(cov,xu,..). If
the hyperparameter structure hyp contains a field hyp.xu in inference method calls such as infGaussLik(hyp,..)
or inference/prediction calls like gp(hyp,@infGaussLik,..) the inducing inputs uare treated as hy-
perparameters and can be optimised. See doc/demoSparse.m for an illustration.
3.8 Grid-Based Covariance Approximations
Another way to bring down computational costs is to take advantage of grid structure x. For exam-
ple, in geostatistics or image processing, the training data x∈Rn×Dcould be a complete 2d lattice
of size n1×n2as given by the axes g1∈Rn1,g2∈Rn2so that n=N=n1·n2,D=2 and
x= [vec(g11>), vec(1g>
2)]. In general, a p-dimensional grid U∈RN×Dis specified by a set of axis
matrices {gi∈Rni×Di}i=1..pso that N=Qp
i=1niand D=Pp
i=1Diwhere the axes do not need
to be 1d nor do their components need to be sorted. As a consequence, Urepresents a Cartesian
product of its axes U=g1×g2×.. ×gp. The cov/apxGrid.m covariance function represents a
Kronecker product covariance matrix
KU,U=Kp⊗.. ⊗K2⊗K1
whose factorisation structure is given by the grid xg. The gain in computationial efficiency is due to
the fact that matrix-vector product, determinant, inverse and eigenvalue computations decompose
so that many operations with an overall cost of O(N3)now only cost O(Pp
i=1n3
i).
1A Unifying View of Sparse Approximate Gaussian Process Regression, JMLR, 2005
15
For off-grid data points, we can still take advantage of the computational properties of a grid-based
covariance matrix KU,Uvia the structured kernel interpolation (SKI) framework aka KISS-GP by
Wilson and Nickisch2with extensions3. Here, the n×ncovariance Kis obtained from the N×N
grid covariance KU,Uby interpolation K≈WXKU,UW>
X, where KU,Uis a covariance matrix formed
by evaluating the user-specified kernel over a set of latent inducing inputs U, with locations that
have been chosen to create algebraic structure in KU,Uthat we can exploit for efficiency. Here, the
interpolation matrix WX∈Rn×Nis extremely sparse; i.e., for local cubic interpolation WXcontains
only 4Dnonzeros per row, where Dis the diata dimension. In addition WXis row-normalised
1n=WX1N. The structure in KU,Ualongside the sparsity of WX, allows for very fast MVMs with
the SKI approximate covariance matrix Kover the inputs xenabling fast inference and prediction.
Internally, we use a meta covariance function cov/apxGrid.m to represent the Kronecker covariance
matrix and a Gaussian regression inference method inf/infGaussLik.m. We also support incomplete
grids where n < N. A good starting point is Yunus Saatçi’s PhD thesis4. For incomplete grids, we use
the interpolation-based extensions by Wilson et al.5where conjugate gradients and a determinant
approximations are used. See doc/demoGrid1d.m and doc/demoGrid2d.m for an illustration. We
also offer non-Gaussian likelihoods as described by Seth Flaxman6so that inf/infLaplace.m can
be used.
3.9 State Space Representation of GPs
GP models with covariance functions with a Markovian structure can be transformed into equivalent
discrete state space models where inference can be done in linear time O(n). Exact models can be
derived for sum, product, linear, noise, constant, Matérn (half-integer), Ornstein–Uhlenbeck, and
Wiener covariance functions. Other common covariance functions can be approximated by their
Markovian counterparts, including squared exponential, rational quadratic, and periodic covariance
functions.
A state space model describes the evolution of a dynamical system at different time instances ti,i=
1, 2, . . . by
fi∼P(fi|fi−1),yi∼P(yi|fi),
where fi:= f(ti)∈Rdand f0∼P(f0)with fibeing the latent (hidden/unobserved) variable and yi
being the observed variable. In continuous time, a simple dynamical system able to represent many
covariance functions is given by the following linear time-invariant stochastic differential equation:
˙
f(t) = F f(t) + L w(t),yi=H f(ti) + i,
where w(t)is an s-dimensional white noise process, the measurement noise i∼ N(0, σ2
n)is Gaus-
sian, and F∈Rd×d,L∈Rd×s,H∈R1×dare the feedback, noise effect, and measurement matrices,
respectively. The initial state is distributed according to f0∼ N(0,P0).
The latent GP is recovered by f(t) = Hf(t)and w(t)∈Rsis a multivariate white noise process with
spectral density matrix Qc∈Rs×s. For discrete values, this translates into
fi∼ N(Ai−1fi−1,Qi−1),yi∼P(yi|H Lfi),
2Kernel Interpolation for Scalable Structured Gaussian Processes, ICML, 2015
3Thoughts on Massively Scalable Gaussian Processes, TR, 2015.
4Scalable Inference for Structured Gaussian Process Models, University of Cambridge, 2011
5Fast Kernel Learning for Multidimensional Pattern Extrapolation, NIPS, 2014
6Fast Kronecker inference in Gaussian processes with non-Gaussian likelihoods, ICML, 2015
16
Algorithm 1 Kalman (forward) filtering.
Input: {ti},y# training inputs and targets
{Ai},{Qi},H,P0# state space model
W,b# likelihood eff. precision and location
for i=1to ndo
if i== 1then
mi←0;Pi←P0# init
else
mi←Aimi−1;Pi←AiPi−1A>
i+Qi# predict
end if
if has label yithen
µf←Hmi;u←PiH>;σ2
f←Hu # latent
zi←Wiiσ2
f+1
ki←Wiiu/zi;Pi←Pi−kiu># variance
ci←Wiiµf−bi;mi←mi−uci/zi# mean
end if
end for
log det(I+W1
2KW1
2)←Pilog zi
Algorithm 2 Rauch–Tung–Striebel (backward) smoothing.
Input: {mi},{Pi}# Kalman filter output
{Ai},{Qi}# state space model
for i=ndown to 2do
m←Aimi−1;P←AiPi−1A>
i+Qi# predict
Gi←Pi−1A>
iP−1;∆mi−1←Gi(mi−m)
Pi−1←Pi−1+Gi(Pi−P)G>
i# variance
mi−1←mi−1+∆mi−1# mean
end for
with f0∼ N(0,P0). The discrete-time matrices are
Ai=A[∆ti] = e∆tiF,
Qi=Z∆ti
0
e(∆tk−τ)FL QcL>e(∆ti−τ)F>dτ,
where ∆ti=ti+1−ti>0.
For stationary covariances k(t,t0) = k(t−t0), the stationary state is distributed by f∞∼ N(0,P∞)
and the stationary covariance can be found by solving the Lyapunov equation
˙
P∞=F P∞+P∞F>+L QcL>=0,
which leads to the identity Qi=P∞−AiP∞A>
i.
In practice, the evaluation of the ndiscrete-time transition matrices Ai=e∆tiFand the noise co-
variance matrices Qi(in the stationary case) for different values of ∆tiis a computational challenge.
Since the matrix exponential ψ:s7→ esXis smooth, its evaluation can be accurately approximated
by convolution interpolation.
From the discrete set of matrices, all the necessary computations can be done using Kalman filtering
and Kalman smoothing as detailed in Algorithms 1 and 2.
Internally, we use a meta covariance function cov/apxState.m to represent the state space represen-
17
tation. A good starting point is Arno Solin’s PhD thesis7. See doc/demoState.m for an illustration.
We also offer non-Gaussian likelihoods 8so that inf/infLaplace.m and inf/infVB.m can be used.
EP is not fully functional; we offer single-sweep EP aka assumed density filtering (ADF).
7Stochastic Differential Equation Methods for Spatio-Temporal Gaussian Process Regression, Aalto University, 2016
8State Space Gaussian Processes with Non-Gaussian Likelihood, ICML, 2018
18
4 Likelihood Functions
A likelihood function pρ(y|f)(with hyperparameters ρ) is a conditional density Rpρ(y|f)dy=1
defined for scalar latent function values fand outputs y. In the GPML toolbox, we use iid. likelihoods
pρ(y|f) = Qn
i=1pρ(yi|fi). The approximate inference engine does not explicitly distinguish between
classification and regression likelihoods: it is fully generic in the likelihood allowing to use a single
code in the inference step.
Likelihood functionality is needed both during inference and while predicting.
4.1 Prediction
A prediction at x∗conditioned on the data D= (X,y)(as implemented in gp.m) consists of the
predictive mean µy∗and variance σ2
y∗which are computed from the the latent marginal moments
µf∗,σ2
f∗i.e. the Gaussian marginal approximation N(f∗|µf∗,σ2
f∗)via
p(y∗|D,x∗) = Zp(y∗|f∗)p(f∗|D,x∗)df∗≈Zp(y∗|f∗)N(f∗|µf∗,σ2
f∗)df∗. (3)
The moments are given by µy∗=Ry∗p(y∗|D,x∗)dy∗and σ2
y∗=R(y∗−µy∗)2p(y∗|D,x∗)dy∗. The
likelihood call
•[lp,ymu,ys2] = lik(hyp, [], fmu, fs2)
does exactly this. Evaluation of the logarithm of py∗=p(y∗|D,x∗)for values y∗can be done via
•[lp,ymu,ys2] = lik(hyp, y, fmu, fs2)
where lp contains the number ln py∗.
Using the moments of the likelihood µ(f∗) = Ry∗p(y∗|f∗)dy∗and σ2(f∗) = R(y∗−µ(f∗))2p(y∗|f∗)dy∗
we obtain for the predictive moments the following (exact) expressions
µy∗=Zµ(f∗)p(f∗|D,x∗)df∗, and
σ2
y∗=Zhσ2(f∗) + (µ(f∗) − µy∗)2ip(f∗|D,x∗)df∗.
1. The binary case is simple since y∗∈{−1, +1}and 1 =py∗+p−y∗. Using π∗=p+1, we find
py∗=π∗y∗= +1
1−π∗y∗= −1
µy∗=X
y∗=±1
y∗p(y∗|D,x∗) = 2·π∗−1∈[−1, 1], and
σ2
y∗=X
y∗=±1
(y∗−µy∗)2p(y∗|D,x∗) = 4·π∗(1−π∗)∈[0, 1].
2. The continuous case for homoscedastic likelihoods depending on r∗=y∗−f∗only and having
noise variance σ2(f∗) = σ2
nis also simple since the identity p(y∗|f∗) = p(y∗−f∗|0)allows to
substitute y∗←y∗+f∗yielding µ(f∗) = f∗+Ry∗p(y∗|0)dy∗and assuming Ry∗p(y∗|0)dy∗=0
we arrive at
µy∗=µf∗, and
σ2
y∗=σ2
f∗+σ2
n.
19
3. The generalised linear model (GLM) case is also feasible. Evaluation of the predictive distribu-
tion is done by quadrature
py∗=Zp(y∗|f∗)p(f∗|D,x∗)df∗≈Zp(y∗|f∗)N(f∗|µf∗,σ2
f∗)df∗.
For GLMs the mean is given by µ(f∗) = g(f∗)and the variance is usually given by a simple
function of the mean σ2(f∗) = v(g(f∗)), hence we use Gaussian-Hermite quadrature with
N(f∗|µf∗,σ2
f∗)≈p(f∗|D,x∗)to compute
µy∗=Zg(f∗)p(f∗|D,x∗)df∗, and
σ2
y∗=Zhv(g(f∗)) + (g(f∗) − µy∗)2ip(f∗|D,x∗)df∗6=v(µy∗).
4. Finally the warped Gaussian likelihood predicitive distribution with strictly monotonically in-
creasing warping function gis given by the expression
p(y∗|D,x∗) = g0(y∗)Ng(y∗)|µf∗,σ2
n+σ2
f∗
so that the predictive moments can be computed by Gaussian-Hermite quadrature.
In the following, we will detail how and which likelihood functions are implemented in the GPML
toolbox. Further, we will mention dependencies between likelihoods and inference methods and
provide some analytical expressions in addition to some likelihood implementations.
4.2 Interface
The likelihoods are in fact the most challenging object in our implementation. Different inference
algorithms require different aspects of the likelihood to be computed, therefore the interface is rather
involved as detailed below.
19 hlikFunctions.m 19i≡
1% likelihood functions are provided to be used by the gp.m function:
2%
3% likErf (Error function , classification , probit regression)
4% likLogistic (Logistic , classification , logit regression)
5% likUni (Uniform likelihood, classification)
6%
7% likGauss (Gaussian , regression)
8% likGaussWarp (Warped Gaussian , regression)
9% likGumbel (Gumbel likelihood for extremal values)
10 % likLaplace (Laplacian or double exponential , regression)
11 % likSech2 (Sech -square, regression)
12 % likT (Student ’s t, regression)
13 %
14 % likPoisson (Poisson regression , count data)
15 % likNegBinom (Negativ binomial regression , count data)
16 % likGamma (Nonnegative regression , positive data)
17 % likExp (Nonnegative regression , positive data)
18 % likInvGauss (Nonnegative regression , positive data)
19 % likBeta (Beta regression , interval data)
20 %
21 % likMix (Mixture of individual likelihood functions)
22 %
20
23 % The likelihood functions have three possible modes, the mode being selected
24 % as follows (where "lik" stands for any likelihood function in "lik/lik*.m".):
25 %
26 % 1) With one or no input arguments: [REPORT NUMBER OF HYPERPARAMETERS]
27 %
28 % s = lik OR s = lik(hyp)
29 %
30 % The likelihood function returns a string telling how many hyperparameters it
31 % expects, using the convention that "D" is the dimension of the input space.
32 % For example , calling "likLogistic" returns the string ’0’.
33 %
34 %
35 % 2) With three or four input arguments: [PREDICTION MODE]
36 %
37 % lp = lik(hyp, y, mu) OR [lp, ymu, ys2] = lik(hyp, y, mu, s2)
38 %
39 % This allows to evaluate the predictive distribution. Let p(y_*|f_*) be the
40 % likelihood of a test point and N(f_*|mu,s2) an approximation to the posterior
41 % marginal p(f_*|x_*,x,y) as returned by an inference method. The predictive
42 % distribution p(y_*|x_*,x,y) is approximated by.
43 % q(y_*) = \int N(f_*|mu,s2) p(y_*|f_*) df_*
44 %
45 % lp = log( q(y) ) for a particular value of y, if s2 is [] or 0, this
46 % corresponds to log( p(y|mu) )
47 % ymu and ys2 the mean and variance of the predictive marginal q(y)
48 % note that these two numbers do not depend on a particular
49 % value of y
50 % All vectors have the same size.
51 %
52 %
53 % 3) With five or six input arguments, the fifth being a string [INFERENCE MODE]
54 %
55 % [varargout] = lik(hyp, y, mu, s2, inf) OR
56 % [varargout] = lik(hyp, y, mu, s2, inf, i)
57 %
58 % There are three cases for inf, namely a) infLaplace , b) infEP and c) infVB.
59 % The last input i, refers to derivatives w.r.t. the ith hyperparameter.
60 %
61 % a1) [lp,dlp,d2lp,d3lp] = lik(hyp, y, f, [], ’infLaplace ’)
62 % lp, dlp, d2lp and d3lp correspond to derivatives of the log likelihood
63 % log(p(y|f)) w.r.t. to the latent location f.
64 % lp = log( p(y|f) )
65 % dlp = d log( p(y|f) ) / df
66 % d2lp = d^2 log( p(y|f) ) / df^2
67 % d3lp = d^3 log( p(y|f) ) / df^3
68 %
69 % a2) [lp_dhyp ,dlp_dhyp ,d2lp_dhyp] = lik(hyp, y, f, [], ’infLaplace ’, i)
70 % returns derivatives w.r.t. to the ith hyperparameter
71 % lp_dhyp = d log( p(y|f) ) / ( dhyp_i)
72 % dlp_dhyp = d^2 log( p(y|f) ) / (df dhyp_i)
73 % d2lp_dhyp = d^3 log( p(y|f) ) / (df^2 dhyp_i)
74 %
75 %
76 % b1) [lZ,dlZ,d2lZ] = lik(hyp, y, mu, s2, ’infEP ’)
77 % let Z = \int p(y|f) N(f|mu,s2) df then
78 % lZ = log(Z)
79 % dlZ = d log(Z) / dmu
80 % d2lZ = d^2 log(Z) / dmu^2
21
81 %
82 % b2) [dlZhyp] = lik(hyp, y, mu, s2, ’infEP’, i)
83 % returns derivatives w.r.t. to the ith hyperparameter
84 % dlZhyp = d log(Z) / dhyp_i
85 %
86 %
87 % c1) [b,z] = lik(hyp, y, [], ga, ’infVB’)
88 % ga is the variance of a Gaussian lower bound to the likelihood p(y|f).
89 % p(y|f) \ge exp( b*(f+z) - (f+z).^2/(2*ga) - h(ga)/2 ) \propto N(f|b*ga-z,ga)
90 % The function returns the linear part b and z.
91 %
92 % Cumulative likelihoods are designed for binary classification. Therefore , they
93 % only look at the sign of the targets y; zero values are treated as +1.
94 %
95 % Some examples for valid likelihood functions:
96 % lik = @likLogistic;
97 % lik = {’likMix ’,{’likUni ’,@likErf}}
98 % lik = {@likPoisson ,’logistic ’};
99 %
100 % See the help for the individual likelihood for the computations specific to
101 % each likelihood function.
102 %
103 hgpml copyright 5ai
4.3 Implemented Likelihood Functions
The following table enumerates all (currently) implemented likelihood functions that can be found
at lik/lik<NAME>.m and their respective set of hyperparameters ρ.
lik<NAME> regression yi∈Rpρ(yi|fi) = ρ=
Gauss Gaussian N(yi|fi,σ2) = 1
√2πσ exp −(yi−fi)2
2σ2{ln σ}
GaussWarp Warped Gaussian N(gθ,(yi)|fi,σ2)g0
θ(yi){θ1, .., θng, ln σ}
Gumbel Gumbel π
σ√6exp (−zi−e−zi),zi=γ+s·π(yi−fi)
σ√6,|s|=1{ln σ}
Sech2 Sech-squared τ
2 cosh2(τ(yi−fi)),τ=π
2σ√3{ln σ}
Laplace Laplacian 1
2bexp −|yi−fi|
b,b=σ
√2{ln σ}
TStudent’s t Γ(ν+1
2)
Γ(ν
2)
1
√νπσ 1+(yi−fi)2
νσ2−ν+1
2{ln(ν−1), ln σ}
lik<NAME> classification yi∈{±1}pρ(yi|fi) = ρ=
Erf Error function Ryifi
−∞N(t)dt∅
Logistic Logistic function 1
1+exp(−yifi)∅
Uni Label noise 1
2∅
lik<NAME> count data yi∈Npρ(yi|fi) = where µ=g(fi)ρ=
Poisson Poisson µyie−µ/yi!∅
NegBinom Negative Binomial yi+r−1
yirrµyi/(r+µ)r+yi{ln r}
lik<NAME> nonnegative data yi∈R+\{0}pρ(yi|fi) = where µ=g(fi)ρ=
Weibull Weibull, γ1=Γ(1+1/κ)κγ1/µ (yiγ1/µ)κ−1exp (−(yiγ1/µ)κ){ln κ}
Gamma Gamma ααyα−1
i
Γ(α)µ−αexp −yiα
µ{ln α}
Exp Exponential µ−1exp −yi
µ∅
InvGauss Inverse Gaussian qλ
2πy3
i
exp −λ(yi−µ)2
2µ2yi{ln λ}
lik<NAME> interval data yi∈[0, 1]pρ(yi|fi) = where µ=g(fi)ρ=
Beta Beta Γ(φ)
Γ(µφ)Γ((1−µ)φ)yµφ−1
i(1−yi)(1−µ)φ−1{ln φ}
Composite likelihood functions [p1(yi|fi),p1(yi|fi), ..]7→ pρ(yi|fi)
Mix Mixture Pjαjpj(yi|fi){ln α1, ln α2, ..}
22
4.4 Usage of Implemented Likelihood Functions
Some code examples taken from doc/usageLik.m illustrate how to use simple and composite likeli-
hood functions to specify a GP model.
Syntactically, a likelihood function lf is defined by
lk := ’func’ | @func // simple
lf := {lk} | {param, lk} | {lk, {lk, .., lk}} // composite
i.e., it is either a string containing the name of a likelihood function, a pointer to a likelihood function
or one of the former in combination with a cell array of likelihood functions and an additional list
of parameters.
22 hdoc/usageLik.m 22i≡
1% demonstrate usage of likelihood functions
2%
3% See also likFunctions.m.
4%
5hgpml copyright 5ai
6clear all,close all
7n = 5; f = randn(n,1); % create random latent function values
8
9% set up simple classification likelihood functions
10 yc = sign(f);
11 lc0 = {’likErf’}; hypc0 = []; % no hyperparameters are needed
12 lc1 = {@likLogistic}; hypc1 = []; % also function handles are OK
13 lc2 = {’likUni’}; hypc2 = [];
14 lc3 = {’likMix’,{’likUni’,@likErf}}; hypc3 = log([1;2]); %mixture
15
16 % set up simple regression likelihood functions
17 yr = f + randn(n,1)/20;
18 sn = 0.1; % noise standard deviation
19 lr0 = {’likGauss’}; hypr0 = log(sn);
20 lr1 = {’likLaplace’}; hypr1 = log(sn);
21 lr2 = {’likSech2’}; hypr2 = log(sn);
22 nu = 4; % number of degrees of freedom
23 lr3 = {’likT’}; hypr3 = [log(nu-1); log(sn)];
24 lr4 = {’likMix’,{lr0,lr1}}; hypr4 = [log([1,2]);hypr0;hypr1];
25
26 a = 1; % set up warped Gaussian with g(y) = y + a*sign(y).*y.^2
27 lr5 = {’likGaussWarp’,[’poly2’]}; hypr5 = log([a;sn]);
28 lr6 = {’likGumbel’,’+’}; hypr6 = log(sn);
29
30 % set up Poisson/negative binomial regression
31 yp = fix(abs(f)) + 1;
32 lp0 = {@likPoisson ,’logistic’}; hypp0 = [];
33 lp1 = {@likPoisson ,{’logistic2’,0.1}}; hypp1 = [];
34 lp2 = {@likPoisson ,’exp’}; hypp2 = [];
35 ln1 = {@likNegBinom,’logistic2’}; hypn1 = [];
36
37 % set up other GLM likelihoods for positive or interval regression
38 lg1 = {@likGamma,’logistic’}; al = 2; hyp.lik = log(al);
39 lg2 = {@likInvGauss,’exp’}; lam = 1.1; hyp.lik = log(lam);
40 lg3 = {@likBeta ,’expexp’}; phi = 2.1; hyp.lik = log(phi);
41 lg4 = {@likBeta ,’logit’}; phi = 4.7; hyp.lik = log(phi);
42 lg5 = {@likWeibull ,{’logistic2’,0.01}}; ka = 0.5; hyp.lik = log(ka);
43
44 % 0) specify the likelihood function
23
45 lik = lc0; hyp = hypc0; y = yc;
46 % lik = lr4; hyp = hypr4; y = yr;
47 % lik = lp1; hyp = hypp1; y = yp;
48
49 % 1) query the number of parameters
50 feval(lik{:})
51
52 % 2) evaluate the likelihood function on f
53 exp(feval(lik{:},hyp,y,f))
54
55 % 3a) evaluate derivatives of the likelihood
56 [lp,dlp,d2lp,d3lp] = feval(lik{:}, hyp, y, f, [], ’infLaplace’);
57
58 % 3b) compute Gaussian integrals w.r.t. likelihood
59 mu = f; s2 = rand(n,1);
60 [lZ,dlZ,d2lZ] = feval(lik{:}, hyp, y, mu, s2, ’infEP’);
61
62 % 3c) obtain lower bound on likelihood
63 ga = rand(n,1);
64 [b,z] = feval(lik{:}, hyp, y, [], ga, ’infVB’);
4.5 Compatibility Between Likelihoods and Inference Methods
The following table lists all possible combinations of likelihood function and inference methods.
Likelihood \ Inference Gaussian
Likelihood
Laplace VB EP KL MCMC LOO Type, Output Domain Alternative Names
approx. cov. possible, Table3.6 no approx. cov. possible, Table3.6
Gaussian X X X X X X X regression, R
Warped Gaussian X X X X X X regression, R
Gumbel X X X X regression, R
Sech-squared X X X X X X regression, Rlogistic distribution
Laplacian X X X X X X regression, Rdouble exponential
Student’s t X X X X X regression, R
Mixture X X X X X mixing meta likelihood
Error function X X X X X classification, {±1}probit regression
Logistic function X X X X X X classification, {±1}logit regression
Uniform X X X X X X classification, {±1}label noise
Weibull X X X positive data, R+\{0}nonnegative regression
Gamma X X X positive data, R+\{0}nonnegative regression
Exp X X X positive data, R+\{0}nonnegative regression
Inverse Gaussian X X X positive data, R+\{0}nonnegative regression
Poisson X(X)∗X X X count data, NPoisson regression
Negative Binomial X X X count data, Nnegative binomial regression
Beta X X X interval data, [0, 1]beta regression
(X)∗EP might not converge in some cases since quadrature is used.
Exact inference is only tractable for Gaussian likelihoods. Expectation propagation together with
Student’s t likelihood is inherently unstable due to non-log-concavity. Laplace’s approximation for
Laplace likelihoods is not sensible because at the mode the curvature and the gradient is undefined
due to the non-differentiable peak of the Laplace distribution. Special care has been taken for the
non-convex optimisation problem imposed by the combination Student’s t likelihood and Laplace’s
approximation.
4.6 Gaussian Likelihood
The Gaussian likelihood is the simplest likelihood because the posterior distribution is not only
Gaussian but can be computed analytically. In principle, the Gaussian likelihood would only be
24
operated in conjunction with the exact inference method but we chose to provide compatibility with
all other inference algorithms as well because it enables code testing and allows to switch between
different regression likelihoods very easily.
24a hlik/likGauss.m 24ai≡
1function [varargout] = likGauss(hyp, y, mu, s2, inf, i)
2
3% likGauss - Gaussian likelihood function for regression. The expression for the
4% likelihood is
5% likGauss(t) = exp(-(t-y)^2/2*sn^2) / sqrt(2*pi*sn^2),
6% where y is the mean and sn is the standard deviation.
7%
8% The hyperparameters are:
9%
10 % hyp = [ log(sn) ]
11 %
12 % Several modes are provided , for computing likelihoods , derivatives and moments
13 % respectively , see likFunctions.m for the details. In general , care is taken
14 % to avoid numerical issues when the arguments are extreme.
15 %
16 hgpml copyright 5ai
17 %
18 % See also LIKFUNCTIONS.M.
19
20 if nargin<3, varargout = {’1’}; return;end % report number of hyperparameters
21
22 sn2 = exp(2*hyp);
23
24 if nargin<5 % prediction mode if inf is not present
25 hPrediction with Gaussian likelihood 24bi
26 else
27 switch inf
28 case ’infLaplace’
29 hLaplace’s method with Gaussian likelihood 25ai
30 case ’infEP’
31 hEP inference with Gaussian likelihood 25bi
32 case ’infVB’
33 hVariational Bayes inference with Gaussian likelihood 25ci
34 end
35 end
24b hPrediction with Gaussian likelihood 24bi≡ (24a)
1if isempty(y), y = zeros(size(mu)); end
2s2zero = 1; if nargin >3&&numel(s2)>0&&norm(s2)>eps, s2zero = 0; end % s2==0 ?
3if s2zero % log probability
4lp = -(y-mu).^2./sn2/2-log(2*pi*sn2)/2; s2 = 0;
5else
6lp = likGauss(hyp, y, mu, s2, ’infEP’); % prediction
7end
8ymu = {}; ys2 = {};
9if nargout>1
10 ymu = mu; % first y moment
11 if nargout>2
12 ys2 = s2 + sn2; % second y moment
13 end
14 end
15 varargout = {lp,ymu,ys2};
25
The Gaussian likelihood function has a single hyperparameter ρ, the log of the noise standard devia-
tion σn.
4.6.1 Exact Inference
Exact inference doesn’t require any specific likelihood related code; all computations are done directly
by the inference method, section 3.1.
4.6.2 Laplace’s Approximation
25a hLaplace’s method with Gaussian likelihood 25ai≡ (24a)
1if nargin<6 % no derivative mode
2if isempty(y), y=0; end
3ymmu = y-mu; dlp = {}; d2lp = {}; d3lp = {};
4lp = -ymmu.^2/(2*sn2) - log(2*pi*sn2)/2;
5if nargout>1
6dlp = ymmu/sn2; % dlp, derivative of log likelihood
7if nargout>2 % d2lp, 2nd derivative of log likelihood
8d2lp = -ones(size(ymmu))/sn2;
9if nargout>3 % d3lp, 3rd derivative of log likelihood
10 d3lp = zeros(size(ymmu));
11 end
12 end
13 end
14 varargout = {lp,dlp,d2lp,d3lp};
15 else % derivative mode
16 lp_dhyp = (y-mu).^2/sn2 - 1; % derivative of log likelihood w.r.t. hypers
17 dlp_dhyp = 2*(mu-y)/sn2; % first derivative,
18 d2lp_dhyp = 2*ones(size(mu))/sn2; % and also of the second mu derivative
19 varargout = {lp_dhyp,dlp_dhyp ,d2lp_dhyp};
20 end
4.6.3 Expectation Propagation
25b hEP inference with Gaussian likelihood 25bi≡ (24a)
1if nargin<6 % no derivative mode
2lZ = -(y-mu).^2./(sn2+s2)/2 - log(2*pi*(sn2+s2))/2; % log part function
3dlZ = {}; d2lZ = {};
4if nargout>1
5dlZ = (y-mu)./(sn2+s2); % 1st derivative w.r.t. mean
6if nargout>2
7d2lZ = -1./(sn2+s2); % 2nd derivative w.r.t. mean
8end
9end
10 varargout = {lZ,dlZ,d2lZ};
11 else % derivative mode
12 dlZhyp = ((y-mu).^2./(sn2+s2)-1) ./ (1+s2./sn2); % deriv. w.r.t. hyp.lik
13 varargout = {dlZhyp};
14 end
4.6.4 Variational Bayes
25c hVariational Bayes inference with Gaussian likelihood 25ci≡ (24a)
26
1% variational lower site bound
2% t(s) = exp(-(y-s)^2/2sn2)/sqrt(2*pi*sn2)
3% the bound has the form: (b+z/ga)*f - f.^2/(2*ga) - h(ga)/2
4n = numel(s2); b = zeros(n,1); y = y.*ones(n,1); z = y;
5varargout = {b,z};
4.7 Warped Gaussian Likelihood
Starting from the likelihood p(y|f)we are sometimes facing the situation where the data y∈Y⊆R
is not distributed according to p(y|f)but some nonlinear transformation of the data g(y) = zso
that z∼p(z|f). Here, the warping function g:Y→Rneeds to be strictly monotonically increasing
i.e. g0(y)>0. Formally, we start from the fact that p(z|f)integrates to one and use the derivative
dz=g0(y)dyto substitute
Zp(z|f)dz=1=Zpg(y|f)dy,pg(y|f) = p(g(y)|f)g0(y)
where we have defined the log warped likelihood ln pg(y|f) = ln p(g(y)|f) + ln g0(y). The interesting
bit is that approximate inference methods such as infExact,infLaplace,infEP,infVB,infKL re-
main fully feasible; only prediction and derivatives become more involved. The usual GP inference is
recovered by using the identity warping function g:y7→ y. The construction works in princple for
any likelihood but our implementation in likGaussWarp is limited to the Gaussian likelihood.
Hyperparameter derivatives
Hyperparameter derivatives for infLaplace are obtained as follows
∂
∂θ ln ∂k
∂fkpg(y|f) = ∂
∂θ ln ∂k
∂f p(g(y)|f) + ∂
∂θ
∂k
∂fkln g0(y),k=0, 1, 2
= − ∂k+1
∂fk+1ln p(g(y)|f)∂
∂θg(y) + ∂
∂θ
∂k
∂fkln g0(y).
Similarly for infEP the derivatives are given by
∂
∂θ ln Zpg(y|f)N(f|µ,σ2)df=∂
∂θ ln Zp(g(y)|f)N(f|µ,σ2)df+∂
∂θ ln g0(y)
= − ∂
∂µ ln Zp(g(y)|f)N(f|µ,σ2)df∂
∂θg(y) + ∂
∂θ ln g0(y).
This trick above works for any homoscedastic likelihood where p(y|f) = p(y+β|f+β)such as
likGauss, likLaplace,likSech2 and likT.
Predictive moments
As detailed in 4, the predictive distribution is – for Gaussian likelihood – given by
p(z∗|D,x∗) = Zp(z∗|f∗)p(f∗|D,x∗)df∗=ZN(z∗|f∗,σ2
n)N(f∗|µf∗,σ2
f∗)df∗
=N(z∗|µf∗,σ2
n+σ2
f∗), where z∗=g(y∗)
p(y∗|D,x∗) = g0(y∗)N(g(y∗)|µf∗,σ2
n+σ2
f∗).
27
Hence, the predictive moments are obtained by the 1d integrals
µy∗=Zy∗g0(y∗)N(g(y∗)|µf∗,σ2
n+σ2
f∗)dy∗
=Zg−1(z∗)N(z∗|µf∗,σ2
n+σ2
f∗)dz∗, and
σ2
y∗=Z(y∗−µy∗)2g0(y∗)N(g(y∗)|µf∗,σ2
n+σ2
f∗)dy∗
=Z(g−1(z∗) − µy∗)2N(z∗|µf∗,σ2
n+σ2
f∗)dz∗.
4.8 Gumbel Likelihood
Distributions of extrema are well captured by the Gumbel distribution
p(y) = 1
βexp −z−e−z,z=sy−η
β,s∈{±1}
with mean µ=η+βγ and variance σ2=π2β2/6 where γ=0.57721566490153 denotes Eu-
ler–Mascheroni’s constant. Skewness is approximately given by 1.1395swhere sis a sign switching
between left and right skewness and kurtosis is 12/5. The final expression for the Gumbel likelihood
is
p(y|f) = π
σ√6exp −z−e−z,z=γ+sπ
σ√6(y−f),s∈{±1}.
4.9 Laplace Likelihood
Laplace’s Approximation
The following derivatives are needed:
ln p(y|f) = − ln(2b) − |f−y|
b
∂ln p
∂f =sign(f−y)
b
∂2ln p
(∂f)2=∂3ln p
(∂f)3=∂3ln p
(∂ln σn)(∂f)2=0
∂ln p
∂ln σn
=|f−y|
b−1
Expectation Propagation
Expectation propagation requires integration against a Gaussian measure for moment matching.
We need to evaluate ln Z=ln RL(y|f,σ2
n)N(f|µ,σ2)dfas well as the derivatives ∂ln Z
∂µ and ∂2ln Z
∂µ2
where N(f|µ,σ2) = 1
√2πσ2exp −(f−µ)2
2σ²,L(y|f,σ2
n) = 1
2bexp −|y−f|
b, and b=σn
√2. As a first
28
step, we reduce the number of parameters by means of the substitution ˜
f=f−y
σnyielding
Z=ZL(y|f,σ2
n)N(f|µ,σ2)df
=1
√2πσ
√2
2σnZexp −(f−µ)2
2σ²exp −√2|f−y|
σndf
=√2
2σ√2πZexp −(σn˜
f+y−µ)2
2σ²exp −√2|˜
f|d˜
f
=σn
σσn√2πZexp
−
σ2
n˜
f−µ−y
σn2
2σ²
L(˜
f|0, 1)d˜
f
=1
σnZL(f|0, 1)N(f|˜
µ,˜
σ2)df
ln Z=ln ˜
Z−ln σn=ln ZL(f|0, 1)N(f|˜
µ,˜
σ2)df−ln σn
with ˜
µ=µ−y
σnand ˜
σ=σ
σn. Thus, we concentrate on the simpler quantity ln ˜
Z.
ln Z=ln Zexp −(f−˜
µ)2
2˜
σ²−√2|f|df
C
z }| {
−ln ˜
σ√2π−ln √2σn
=ln "Z0
−∞
exp −(f−˜
µ)2
2˜
σ²+√2fdf+Z∞
0
exp −(f−˜
µ)2
2˜
σ²−√2fdf#+C
=ln
Z0
−∞
exp
−f2−2(
m−
z }| {
˜
µ+˜
σ2√2)f+˜
µ2
2˜
σ²
df+Z∞
0
exp
−f2−2(
m+
z }| {
˜
µ−˜
σ2√2)f+˜
µ2
2˜
σ²
df
+C
=ln "exp m2
−
2˜
σ²Z0
−∞
exp −(f−m−)2
2˜
σ²df+exp m2
+
2˜
σ²Z∞
0
exp −(f−m+)2
2˜
σ²df#−˜
µ2
2˜
σ²+C
=ln "exp m2
−
2˜
σ²Z0
−∞
N(f|m−,˜
σ2)df+exp m2
+
2˜
σ² 1−Z0
−∞
N(f|m+,˜
σ2)df!#−˜
µ2
2˜
σ²−ln √2σn
=ln exp m2
−
2˜
σ²Φm−
˜
σ−exp m2
+
2˜
σ²Φm+
˜
σ+exp m2
+
2˜
σ²−˜
µ2
2˜
σ²−ln √2σn
Here, Φ(z) = Rz
−∞N(f|0, 1)dfdenotes the cumulative Gaussian distribution. Finally, we have
ln Z=ln hexp −√2˜
µΦm−
˜
σ+exp √2˜
µΦ−m+
˜
σi+˜
σ2−ln √2σn
=ln
exp
ln Φ(−z+) + √2˜
µ
| {z }
a+
+exp
ln Φ(z−) − √2˜
µ
| {z }
a−
+˜
σ2−ln √2σn
=ln(ea++ea−) + ˜
σ2−ln √2σn
where z+=˜
µ
˜
σ+˜
σ√2=µ−y
σ+σ
σn√2, z−=˜
µ
˜
σ−˜
σ√2=µ−y
σ−σ
σn√2 and ˜
µ=µ−y
σn,˜
σ=σ
σn.
Now, using d
dθln Φ(z) = 1
Φ(z)
d
dθΦ(z) = N(z)
Φ(z)
dz
dθwe tackle first derivative
29
∂ln Z
∂µ =ea+∂a+
∂µ +ea−∂a−
∂µ
ea++ea−
∂a+
∂µ =∂
∂µ ln Φ(−z+) + √2
σn
= − N(−z+)
σΦ(−z+)+√2
σn
= −q+
σ+√2
σn
∂a−
∂µ =∂
∂µ ln Φ(z−) − √2
σn
=N(z−)
σΦ(z−)−√2
σn
=q−
σ−√2
σn
∂a±
∂µ =∓q±
σ±√2
σn
.
as well as the second derivative
∂2ln Z
∂µ2=
∂
∂µ ea+∂a+
∂µ +∂
∂µ ea−∂a−
∂µ
ea++ea−−∂ln Z
∂µ 2
∂
∂µ ea±∂a±
∂µ =ea±"∂a±
∂µ 2
+∂2a±
∂µ2#
∂2a+
∂µ2= − 1
σ
∂
∂µ N(−z+)Φ(−z+) − ∂
∂µ Φ(−z+)N(−z+)
Φ2(−z+)
= − 1
σ
N(−z+)Φ(−z+)∂−z2
+/2
∂µ −N2(−z+)∂−z+
∂µ
Φ2(−z+)
=N(−z+)
σ2·Φ(−z+)z+−N(−z+)
Φ2(−z+)= −q2
+−q+z+
σ2
∂2a−
∂µ2=1
σ
∂
∂µ N(z−)Φ(z−) − ∂
∂µ Φ(z−)N(z−)
Φ2(z−)
=1
σ
N(z−)Φ(z−)∂−z2
−/2
∂µ −N2(z−)∂z−
∂µ
Φ2(z−)
=N(z−)
σ2·−Φ(z−)z−−N(z−)
Φ2(z−)= −q2
−+q−z−
σ2
∂2a±
∂µ2= −q2
±∓q±z±
σ2
which can be simplified to
∂2ln Z
∂µ2=ea+b++ea−b−
ea++ea−−∂ln Z
∂µ 2
30
using
b±=∂a±
∂µ 2
+∂2a±
∂µ2= ∓q±
σ±√2
σn!2
−q2
±∓q±z±
σ2
= q±
σ−√2
σn!2
−q2
±
σ2±q±z±
σ2
=2
σ2
n
− √8
σσn∓z±
σ2!q±.
We also need
∂ln Z
∂ln σn
=ea+∂a+
∂ln σn+ea−∂a−
∂ln σn
ea++ea−−2σ2
σ2
n
−1.
Variational Bayes
We need h(γ)and its derivatives as well as β(γ):
h(γ) = 2
σ2
n
γ+ln(2σ2
n) + y2γ−1
h0(γ) = 2
σ2
n
−y2γ−2
h00(γ) = 2y2γ−3
β(γ) = yγ−1
4.10 Student’s t Likelihood
The likelihood has two hyperparameters (both represented in the log domain to ensure positivity):
the degrees of freedom νand the scale σnwith mean y(for ν > 1) and variance ν
ν−2σn² (for ν > 2).
p(y|f) = Z·1+(f−y)²
νσ2
n−ν+1
2
,Z=Γν+1
2
Γν
2pνπσ2
n
31
Laplace’s Approximation
For the mode fitting procedure, we need derivatives up to third order; the hyperparameter derivatives
at the mode require some mixed derivatives. All in all, using r=y−f, we have
ln p(y|f) = ln Γν+1
2−ln Γν
2−1
2ln νπσ2
n−ν+1
2ln 1+r²
νσ2
n
∂ln p
∂f = (ν+1)r
r²+νσ2
n
∂2ln p
(∂f)2= (ν+1)r2−νσ2
n
(r²+νσ2
n)2
∂3ln p
(∂f)3=2(ν+1)r3−3rνσ2
n
(r²+νσ2
n)3
∂ln p
∂ln ν=∂Z
∂ln ν−ν
2ln 1+r2
νσ2
n+ν+1
2·r2
r²+νσ2
n
∂Z
∂ln ν=ν
2
d ln Γν+1
2
d ln ν−ν
2
d ln Γν
2
d ln ν−1
2
∂3ln p
(∂ln ν)(∂f)²=νr2(r2−3(ν+1)σn²) + νσ2
n
(r²+νσ2
n)3
∂ln p
∂ln σn
= (ν+1)r2
r²+νσ2
n
−1
∂3ln p
(∂ln σn)(∂f)²=2νσ2
n(ν+1)νσ2
n−3r2
(r²+νσ2
n)3
4.11 Cumulative Logistic Likelihood
The likelihood has one hyperparameter (represented in the log domain), namely the standard devia-
tion σn
p(y|f) = Z·cosh−2(τ(f−y)),τ=π
2σn√3,Z=π
4σn√3
Laplace’s Approximation
The following derivatives are needed where φ(x)≡ln(cosh(x))
ln p(y|f) = ln(π) − ln(4σn√3) − 2φ(τ(f−y))
∂ln p
∂f =2τφ0(τ(f−y))
∂2ln p
(∂f)2= −2τ2φ00 (τ(f−y))
∂3ln p
(∂f)3=2τ3φ000 (τ(f−y))
∂3ln p
(∂ln σn)(∂f)2=2τ22φ00 (τ(f−y))+τ(f−y)φ000 (τ(f−y))
∂ln p
∂ln σn
=2τ(f−y)φ0(τ(f−y))−1
32
4.12 GLM Likelihoods: Poisson, Negative Binomial, Weibull, Gamma, Exponential, In-
verse Gaussian and Beta
Data yfrom a space other than Re.g. N,R+or [0, 1]can be modeled using generalised linear model
likelihoods p(y|f)where the expected value E[y] = µis related to the underlying Gaussian process
fby means of an inverse link function µ=g(f). Typically, the likelihoods are from an exponential
family, hence the variance V[y] = v(µ), is a simple function of the mean µas well as higher order
moments such as skewness S[y] = s(µ)and kurtosis K[y] = k(µ).
Here, we directly specify the inverse link function µ=g(f)defining the mapping from the GP f
to the mean intensity µ. For numerical reasons, we work with the log of the inverse link function
h(f) = ln g(f)and use its derivatives h0,h00 and h000 for subsequent computations. In the table below,
we have summarised the GLM likelihood expressions, the moments, the range of their variables and
the applicable inverse link functions.
Likelihood ρ=v(µ) = s(µ) = k(µ) = p(y|f) = y∈µ∈Inverse Links
Poisson ∅µ1/√µ1/µ µyexp(−µ)/y!N R+exp,logistic*
Neg. Binomial {ln r}µ(µ/r +1)q4(r+µ)
rµ −qr
µ(µ+r)
6
r+r
µ(µ+r)y+r−1
yrrµy/(r+µ)r+yN R+exp,logistic*
Weibull {ln κ}µ2(γ2/γ2
1−1)γ3−3γ1γ2+2γ3
1
(γ2−γ2
1)3/2
γ4−4γ1γ3+12γ2
1γ2−3γ2
2−6γ4
1
(γ2−γ2
1)2κγ1/µ (yγ1/µ)κ−1exp (−(yγ1/µ)κ)R+\{0}R+\{0}exp,logistic*
Gamma {ln α}µ2/α 2/√α6/α ααyα−1
Γ(α)µ−αexp −yα
µR+\{0}R+\{0}exp,logistic*
Exponential ∅µ22 6 µ−1exp −y
µR+\{0}R+\{0}exp,logistic*
Inv. Gauss {ln λ}µ3/λ 3pµ/λ 15µ/λ qλ
2πy3exp −λ(y−µ)2
2µ2yR+\{0}R+\{0}exp,logistic*
Beta {ln φ}µ(1−µ)/(1+φ)(2−4µ)(1+φ)
√v(µ)(2+φ)6(φ+1)2−v(µ)(5φ+6)
v(µ)(φ+2)(φ+3)
Γ(φ)
Γ(µφ)Γ((1−µ)φ)yµφ−1(1−y)(1−µ)φ−1[0, 1] [0, 1]expexp,logit
4.12.1 Inverse Link Functions
Possible inverse link functions and their properties (∪convex, ∩concave, ↑monotone) are sum-
marised below:
util/glm_invlink_* g(f) = µ=g:R→gis h(f) = ln µ=his
exp efR+∪,↑f∪,∩,↑
logistic `(f) = ln(1+ef)R+∪,↑ln(ln(1+ef)) ∩,↑
logistic2 `(f+af`(f))R+↑ln(`(f+af`(f)))∩,↑
expexp exp(−e−f) [0, 1]↑−e−f∪,↑
logit 1/(1+e−f) [0, 1]↑−ln(1+e−f)∪,↑
Please see doc/usageLik.m for how to specify the pair likelihood function and link function in
GPML.
Exponential inverse link: exp
For g(f) = efthings are simple since h(f) = f,h0(f) = 1 and h00(f) = h000(f) = 0.
33
Logistic inverse link: logistic
For g(f) = ln(1+ef)the derivatives of h(f)are given by
h(f) = ln(ln(1+ef))
h0(f) = 1
ln(1+ef)s(−f),s(f) = 1
1+ef,s0(f) = −ef
(1+ef)2= −s(−f)s(f)
h00(f) = 1
ln(1+ef)
e−f
(1+e−f)2−1
ln2(1+ef)
ef
1+ef
1
1+e−f
=h0(f)s(f) − h0(f)
h000(f) = h00(f)s(f) − h0(f)+h0(f)−ef
(1+ef)2−h00(f)
=h00(f)s(f) − 2h0(f)−h0(f)s(f)s(−f).
Note that g(f) = eh(f)=ln(1+ef)is convex and h(f) = ln(ln(1+ef)) with
h00(f) = 1
ln(1+ef)1−ef
ln(1+ef)1
1+ef
1
1+e−f60
is concave since ef>ln(1+ef)for all f∈R.
Twice logistic inverse link: logistic2
Note that h(f) = ln(`(f+af`(f)))is – according to Seeger et al.9– concave.
Double negative exponential inverse link: expexp
For g(f) = exp(−e−f)the derivatives of h(f)are given by
h(f) = −e−f
h0(f) = −h(f)
h00(f) = h(f)
h000(f) = −h(f)
Logit regression inverse link: logit
For g(f) = 1/(1+e−f)the derivatives of h(f)can be computed using the logistic inverse link function
h`(f)since h(f) = f−exp(h`(f))
h(f) = f−eh`(f)
h0(f) = 1−eh`(f)h0
`(f)
h00(f) = −eh`(f)[h0
`(f)2+h00
`(f)] = eh`(f)s`(−f)s2
`(f)
h000(f) = −eh`(f)[h0
`(f)3+3h00
`(f)h0
`(f) + h000
`(f)]
9Bayesian Intermittent Demand Forecasting, NIPS, 2016
34
4.12.2 Poisson Likelihood
Count data y∈Nncan be modeled in the GP framework using the Poisson distribution p(y) =
µye−µ/y! with mean/variance E[y] = V[y] = µ, skewness S[y] = 1/√µand kurtosis K[y] = 1/µ
leading to the likelihood
p(y|f) = µyexp(−µ)/y!, µ=g(f)
⇔ln p(y|f) = y·ln g(f) − g(f) − ln Γ(y+1).
For Laplace’s method to work, we need the first three derivatives of the log likelihood ln p(y|f), where
h(f) = ln g(f)
ln p(y|f) = y·h(f) − exp(h(f)) − ln Γ(y+1)
∂
∂f ln p(y|f) = h0(f)[y−exp(h(f))]
∂2
∂f2ln p(y|f) = h00(f)[y−exp(h(f))]− [h0(f)]2exp(h(f))
∂3
∂f3ln p(y|f) = h000(f)[y−exp(h(f))]−3h0(f)·h00(f)exp(h(f)) − [h0(f)]3exp(h(f))
h000(f)[y−exp(h(f))]−h0(f)[h0(f)2+3h00(f)] exp(h(f)).
Note that if ln µ=h(f)is concave and µ=g(f)is convex then the Poisson likelihood p(y|f)is
log-concave in fwhich is the case for both exp and logistic.
4.12.3 Weibull Likelihood
Nonnegative data y∈R+such as time-to-failure can be modeled in the GP framework using the
Weibull distribution p(y) = κ/λ(y/λ)κ−1e−(y/λ)κwith shape parameter κ > 0, scale parameter
λ > 0, mean E[y] = λγ1=µwhere γj=Γ(1+j/κ), variance V[y] = λ2γ2−µ2=µ2(γ2/γ2
1−1),
skewness S[y]=(γ3−3γ1γ2+2γ3
1)/(γ2−γ2
1)3/2and kurtosis K[y]=(γ4−4γ1γ3+12γ2
1γ2−3γ2
2−
6γ4
1)/(γ2−γ2
1)2. Using the substitution µ=λγ1⇔1/λ =γ1/µ, we obtain
p(y|f) = γ1
κ
µγ1
y
µκ−1
exp −γ1
y
µκ,µ=g(f)>0
⇔ln p(y|f) = ln γ1
κ
µ+ (κ−1)ln γ1
y
µ−γ1
y
µκ
.
Note that the Weibull likelihood p(y|f)is log-concave in fneither for the exp nor for the logistic
inverse link.
4.12.4 Gamma Likelihood
Nonnegative data y∈R+can be modeled in the GP framework using the Gamma distribution
p(y) = θ−α/Γ(α)yα−1e−y/θ with shape parameter α > 0, scale parameter θ > 0, mean E[y] =
αθ =µ, variance V[y] = αθ2=µ2/α, skewness S[y] = 2/√αand kurtosis K[y] = 6/α. Using the
substitution µ=αθ ⇔α/µ =1/θ, we obtain
p(y|f) = ααyα−1
Γ(α)µ−αexp −yα
µ,µ=g(f)>0
⇔ln p(y|f) = −αln µ+y
µ−ln Zα(y), ln Zα(y) = ln Γ(α) − αln α+ (1−α)ln y.
35
Note that if ln µ=h(f)was convex and µ=g(f)was concave then the Gamma likelihood p(y|f)
would be log-concave in fwhich is not the case for both exp and logistic.
4.12.5 Exponential Likelihood
Nonnegative data y∈R+can be modeled in the GP framework using the Exponential distribution
p(y) = θ−1e−y/θ with scale parameter θ > 0, mean E[y] = θ=µ, variance V[y] = µ2, skewness
S[y] = 2 and kurtosis K[y] = 6. We obtain
p(y|f) = µ−1exp −y
µ,µ=g(f)>0
⇔ln p(y|f) = − ln µ−y
µ.
Note that for exp (but not for logistic) the likelihood is log-concave. The exponential distribution
corresponds to the Gamma distribution with α=1 and the Weibull distribution with κ=1.
4.12.6 Inverse Gaussian Likelihood
Nonnegative data y∈Rn
+can be modeled in the GP framework using the Inverse Gaussian distri-
bution p(y) = pλ/(2πy3)exp(−λ(y−µ)2/(2µ2y)) with shape parameter λ > 0, mean parameter
µ > 0, mean E[y] = µ, variance V[y] = µ3/λ, skewness S[y] = 3pµ/λ and kurtosis K[y] = 15µ/λ.
We obtain
p(y|f) = sλ
2πy3exp −λ(y−µ)2
2µ2y,µ=g(f)>0
⇔ln p(y|f) = −λ(y−µ)2
2µ2y−ln Zα(y), ln Zα(y)=−1
2(ln λ−ln 2πy3).
The inverse Gaussian likelihood is in general not log-concace in ffor both exp and logistic.
4.12.7 Beta Likelihood
Interval data y∈[0, 1]ncan be modeled in the GP framework using the Beta distribution p(y) =
yα−1(1−y)β−1/B(α,β)with shape parameters α,β > 0, mean E[y] = α/(α+β)and variance
V[y] = αβ/[(α+β)2(α+β+1)] and 1/B(α,β) = Γ(α+β)/[Γ(α)Γ(β)]. Reparametrising using
the mean parameter µ=E[y] = α/(α+β), the shape parameter φ=α+β, the variance V[y] =
µ(1−µ)/(1+φ)and hence
p(y|f) = Γ(φ)
Γ(µφ)Γ((1−µ)φ)yµφ−1(1−y)(1−µ)φ−1,µ=g(f)>0
⇔ln p(y|f) = ln Γ(φ) − ln Γ(µφ) − ln Γ((1−µ)φ) + (µφ −1)ln y+ ((1−µ)φ−1)ln(1−y).
The Beta likelihood is in general not log-concace in ffor both exp and logistic.
36
5 Mean Functions
A mean function mφ:X→R(with hyperparameters φ) of a GP fis a scalar function defined over
the whole domain Xthat computes the expected value m(x) = E[f(x)] of ffor the input x.
5.1 Interface
In the GPML toolbox, a mean function m:X→Rneeds to implement evaluation m=mφ(X)and
first derivatives mi=∂
∂φimwith respect to the components iof the parameter φ∈Φas detailed
below.
36 hmeanFunctions.m 36i≡
1% Mean functions to be use by Gaussian process functions. There are two
2% different kinds of mean functions: simple and composite:
3%
4% Simple mean functions:
5%
6% meanZero - zero mean function
7% meanOne - one mean function
8% meanConst - constant mean function
9% meanLinear - linear mean function
10 % meanPoly - polynomial mean function
11 % meanDiscrete - precomputed mean for discrete data
12 % meanGP - predictive mean of another GP
13 % meanGPexact - predictive mean of a regression GP
14 % meanNN - nearest neighbor mean function
15 % meanWSPC - weighted sum of projected cosines
16 %
17 % Composite mean functions (see explanation at the bottom):
18 %
19 % meanScale - scaled version of a mean function
20 % meanSum - sum of mean functions
21 % meanProd - product of mean functions
22 % meanPow - power of a mean function
23 % meanMask - mask some dimensions of the data
24 % meanPref - difference mean for preference learning
25 % meanWarp - warped mean function
26 %
27 % Naming convention: all mean functions are named "mean/mean*.m".
28 %
29 %
30 % 1) With no or only a single input argument:
31 %
32 % s = meanNAME or s = meanNAME(hyp)
33 %
34 % The mean function returns a string s telling how many hyperparameters hyp it
35 % expects, using the convention that "D" is the dimension of the input space.
36 % For example , calling "meanLinear" returns the string ’D’.
37 %
38 % 2) With two input arguments and one output argument:
39 %
40 % m = meanNAME(hyp, x)
41 %
42 % The function computes and returns the mean vector m with components
43 % m(i) = m(x(i,:)) where hyp are the hyperparameters and x is an n by D matrix
44 % of data, where D is the dimension of the input space. The returned mean
37
45 % vector m is of size n by 1.
46 %
47 % 3) With two input arguments and two output arguments:
48 %
49 % [m,dm] = meanNAME(hyp, x)
50 %
51 % The function computes and returns the mean vector m as in 2) above.
52 % In addition to that, the (linear) directional derivative function dm is
53 % returned. The call dhyp = dm(q) for a direction vector q of size n by 1
54 % returns a vector of directional derivatives dhyp = d (q’*m(x)) / d hyp of
55 % the same size as the hyperparameter vector hyp. The components of dhyp are
56 % defined as follows: dhyp(i) = q’*( d m(x) / d hyp(i) ).
57 %
58 % See also doc/usageMean.m.
59 %
60 hgpml copyright 5ai
5.2 Implemented Mean Functions
We offer simple and composite mean functions producing new mean functions m(x)from existing
mean functions µj(x). All code files are named according to the pattern mean/mean<NAME>.m for
simple identification. This modular specification allows to define affine mean functions m(x) =
c+a>xor polynomial mean functions m(x)=(c+a>x)2. All currently available mean functions
are summarised in the following table.
Simple mean functions m(x)
<NAME> Mmeaning m(x) = φ
Zero mean vanishes always 0 ∅
One mean equals 1 1 ∅
Const mean equals a constant c c ∈R
Linear mean linearly depends on x∈X⊆RDa>x a ∈RD
Poly mean polynomially depends on x∈X⊆RDPda>
dxda∈RD×d
Discrete precomputed mean for discrete data x∈X⊆Nµxµ∈Rs
GP predictive mean of another GP Ry·p(y|D,x)dy ∅
GPexact predictive mean of a regression GP Ry·p(y|D,x)dy ρ,ψ,σn
NN nearest neighbor for a set (zj,mj)∈X×Rmi,i=arg minjd(x,zj)∅
WSPC weighted sum of dprojected cosines x∈X⊆RDa>cos(Wx +b)W∈Rd×D,a,b∈Rd
Composite mean functions [µ1(x),µ2(x), ..]7→ m(x)
<NAME> meaning m(x) = φ
Scale scale a mean αµ(x)α∈R
Sum add up mean functions Pjµj(x)∅
Prod multiply mean functions Qjµj(x)∅
Pow raise a mean to a power µ(x)d∅
Mask act on components I⊆[1, 2, .., D]of x∈X⊆RDonly µ(xI)∅
Pref preference learning mean x= [x1;x2],xi⊆RD/2µ(x1) − µ(x2)∅
Warp warped mean g[µ(x)] ∅
5.3 Usage of Implemented Mean Functions
Some code examples taken from doc/usageMean.m illustrate how to use simple and composite mean
functions to specify a GP model.
Syntactically, a mean function mf is defined by
mn := ’func’ | @func // simple
38
mf := {mn} | {mn, {param, mf}} | {mn, {mf, .., mf}} // composite
i.e., it is either a string containing the name of a mean function, a pointer to a mean function or one
of the former in combination with a cell array of mean functions and an additional list of parameters.
38 hdoc/usageMean.m 38i≡
1% demonstrate usage of mean functions
2%
3% See also meanFunctions.m.
4%
5hgpml copyright 5ai
6clear all,close all
7n = 5; D = 2; x = randn(n,D); % create a random data set
8
9% set up simple mean functions
10 m0 = {’meanZero’}; hyp0 = []; % no hyperparameters are needed
11 m1 = {’meanOne’}; hyp1 = []; % no hyperparameters are needed
12 mc = {@meanConst}; hypc = 2; % also function handles are possible
13 ml = {@meanLinear}; hypl = [2;3]; % m(x) = 2*x1 + 3*x2
14 mp = {@meanPoly ,2}; hypp = [1;1;2;3]; % m(x) = x1+x2+2*x1^2+3*x2^2
15 mn = {@meanNN ,[1,0; 0 ,1],[0.9,0.5]}; hypn = []; % nearest neighbor
16 s = 12; hypd = randn(s,1); % discrete mean with 12 hypers
17 md = {’meanDiscrete’,s};
18 hyp.cov = [0;0]; hypg = []; % GP predictive mean
19 xt = randn(2*n,D); yt = sign(xt(:,1)-xt(:,2)); % training data
20 mg = {@meanGP ,hyp,@infEP ,@meanZero ,@covSEiso ,@likErf,xt,yt};
21 hype = [0;0; log(0.1)]; % regression GP predictive mean
22 xt = randn(2*n,D); yt = xt(:,1).*xt(:,2); % training data
23 me = {@meanGPexact ,@meanZero ,@covSEiso ,xt,yt};
24
25 % set up composite mean functions
26 msc = {’meanScale’,{m1}}; hypsc = [3; hyp1]; % scale by 3
27 msu = {’meanSum’,{m0,mc,ml}}; hypsu = [hyp0; hypc; hypl]; % sum
28 mpr = {@meanProd ,{mc,ml}}; hyppr = [hypc; hypl]; % product
29 mpo = {’meanPow’,3,msu}; hyppo = hypsu; % third power
30 mask = [false ,true]; % mask excluding all but the 2nd component
31 mma = {’meanMask’,mask,ml}; hypma = hypl(mask);
32 mpf = {@meanPref,ml}; hyppf = 2; % linear pref with slope
33 mwp = {@meanWarp,ml,@sin,@cos};hypwp = 2; % sin of linear
34
35 % 0) specify mean function
36 % mean = md; hyp = hypd; x = randi([1,s],n,1);
37 % mean = mn; hyp = hypn;
38 % mean = mg; hyp = hypg;
39 mean = me; hyp = hype;
40 % mean = m0; hyp = hyp0;
41 % mean = msu; hyp = hypsu;
42 % mean = mpr; hyp = hyppr;
43 % mean = mpo; hyp = hyppo;
44 % mean = mpf; hyp = hyppf;
45
46 % 1) query the number of parameters
47 feval(mean{:})
48
49 % 2) evaluate the function on x
50 feval(mean{:},hyp,x)
51
52 % 3) compute the derivatives w.r.t. to hyperparameter i
53 i = 2; feval(mean{:},hyp,x,i)
39
6 Covariance Functions
A covariance function kψ:X×X→R(with hyperparameters ψ) of a GP fis a scalar function
defined over the whole domain X2that computes the covariance k(x,z) = V[f(x),f(z)] = E[(f(x) −
m(x))(f(z) − m(z))] of fbetween the inputs xand z.
6.1 Interface
Again, the interface is simple since only evaluation of the full covariance matrix K=kψ(X)and its
derivatives Ki=∂
∂ψiKas well as cross terms k∗=kψ(X,x∗)and k∗∗ =kψ(x∗,x∗)for prediction
are required.
39 hcovFunctions.m 39i≡
1% Covariance functions to be use by Gaussian process functions. There are two
2% different kinds of covariance functions: simple and composite:
3%
4% 1) Elementary and standalone covariance functions:
5% covZero - zero covariance function
6% covEye - unit covariance function
7% covOne - unit constant covariance function
8% covDiscrete - precomputed covariance for discrete data
9%
10 % 2) Composite covariance functions:
11 % covScale - scaled version of a covariance function
12 % covSum - sums of covariance functions
13 % covProd - products of covariance functions
14 % covMask - mask some dimensions of the data
15 % covPref - difference covariance for preference learning
16 % covPER - make stationary covariance periodic
17 % covADD - additive covariance function
18 % covWarp - warp input to a covariance function
19 %
20 % 3) Mahalanobis distance based covariances and their modes
21 % covMaha - generic "mother" covariance
22 % * eye - unit length scale
23 % * iso - isotropic length scale
24 % * ard - automatic relevance determination
25 % * prot - (low-rank) projection in input space
26 % * fact - factor analysis covariance
27 % * vlen - spatially varying length scale
28 % covGE - Gamma exponential covariance
29 % covMatern - Matern covariance function with nu=1/2, 3/2 or 5/2
30 % covPP - piecewise polynomial covariance function (compact support)
31 % covRQ - rational quadratic covariance function
32 % covSE - squared exponential covariance function
33 %
34 % 4) Dot product based covariances and their modes
35 % covDot - generic "mother" covariance
36 % * eye - unit length scale
37 % * iso - isotropic length scale
38 % * ard - automatic relevance determination
39 % * pro - (low-rank) projection in input space
40 % * fac - factor analysis covariance
41 % covLIN - linear covariance function
42 % covPoly - polynomial covariance function
43 %
40
44 % 5) Time series covariance functions on the positive real line
45 % covFBM - fractional Brownian motion covariance
46 % covULL - underdamped linear Langevin process covariance
47 % covW - i-times integrated Wiener process covariance
48 % covOU - i-times integrated Ornstein -Uhlenbeck process covariance
49 %
50 % 6) Standalone covariances
51 % covNNone - neural network covariance function
52 % covLINone - linear covariance function with bias
53 % covPeriodic - smooth periodic covariance function (1d)
54 % covPeriodicNoDC - as above but with zero DC component and properly scaled
55 % covCos - sine periodic covariance function (1d) with unit period
56 % covGabor - Gabor covariance function
57 %
58 % 7) Shortcut covariances assembled from library
59 % covConst - covariance for constant functions
60 % covNoise - independent covariance function (i.e. white noise)
61 % covPERiso - make isotropic stationary covariance periodic
62 % covPERard - make ARD stationary covariance periodic
63 % covMaterniso - Matern covariance function with nu=1/2, 3/2 or 5/2
64 % covMaternard - Matern covariance function with nu=1/2, 3/2 or 5/2 with ARD
65 % covPPiso - piecewise polynomial covariance function (compact support)
66 % covPPard - piecewise polynomial covariance function (compact support)
67 % covRQiso - isotropic rational quadratic covariance function
68 % covRQard - rational quadratic covariance function with ARD
69 % covSEiso - isotropic squared exponential covariance function
70 % covSEisoU - same as above but without latent scale
71 % covSEard - squared exponential covariance function with ARD
72 % covSEvlen - spatially varying lengthscale squared exponential
73 % covSEproj - projection squared exponential covariance function
74 % covLINiso - linear covariance function
75 % covLINard - linear covariance function with ARD
76 % covGaborard - Gabor covariance function with ARD
77 % covGaborsio - isotropic Gabor covariance function
78 % covSM - spectral mixture covariance function
79 %
80 % 8) Special purpose (approximation) covariance functions
81 % apxSparse - sparse approximation: to be used for large scale inference
82 % problems with inducing points aka FITC
83 % apxGrid - grid interpolation: to be used for large scale inference
84 % problems with Kronecker/Toeplitz/BTTB covariance matrix
85 %
86 % The covariance functions are written according to a special convention where
87 % the exact behaviour depends on the number of input and output arguments
88 % passed to the function. If you want to add new covariance functions, you
89 % should follow this convention if you want them to work with the function gp.
90 % There are four different ways of calling the covariance functions:
91 %
92 % 1) With no (or one) input argument(s):
93 %
94 % s = cov
95 %
96 % The covariance function returns a string s telling how many hyperparameters it
97 % expects, using the convention that "D" is the dimension of the input space.
98 % For example , calling "covRQard" returns the string ’(D+2)’.
99 %
100 % 2) With two input arguments and one output argument:
101 %
41
102 % K = cov(hyp, x) equivalent to K = cov(hyp, x, [])
103 %
104 % The function computes and returns the covariance matrix where hyp are
105 % the hyperparameters and x is an n by D matrix of cases, where
106 % D is the dimension of the input space. The returned covariance matrix is of
107 % size n by n.
108 %
109 % 3) With three input arguments and one output argument:
110 %
111 % Kz = cov(hyp, x, z)
112 % kx = cov(hyp, x, ’diag’)
113 %
114 % The function computes test set covariances; kx is a vector of self covariances
115 % for the test cases in x (of length n) and Kz is an (n by nz) matrix of cross
116 % covariances between training cases x and test cases z.
117 %
118 % 4) With two output arguments:
119 %
120 % [K,dK] = cov(hyp, x) equivalent to [K,dK] = cov(hyp, x, [])
121 % [K,dK] = cov(hyp, x, z)
122 % [K,dK] = cov(hyp, x, ’diag ’)
123 %
124 % The function computes and returns the covariances K as in 3) above.
125 % In addition to that, the (linear) directional derivative function dK is
126 % returned. The two possible calls dhyp = dK(Q) and [dhyp,dx] = dK(Q) for a
127 % direction Q of the same size as K are possible. The return arguments dhyp
128 % and dx are the directional derivatives dhyp = d trace(Q’*K) / d hyp and
129 % dx = d trace(Q’*K) / d x are of the same size as the hyperparameter
130 % vector hyp and the input data x, respectively. The components of dhyp and
131 % dx are defined as follows: dhyp(i) = trace(Q’*( d K / d hyp(i) ))
132 % and dx(i,j) = trace(Q’*( d K / d x(i,j) )).
133 %
134 % Covariance functions can be specified in two ways: either as a string
135 % containing the name of the covariance function or using a cell array. For
136 % example:
137 %
138 % cov = ’covRQard ’;
139 % cov = {’covRQard ’};
140 % cov = {@covRQard};
141 %
142 % are supported. Only the second and third form using the cell array can be used
143 % for specifying composite covariance functions , made up of several
144 % contributions. For example:
145 %
146 % cov = {’covScale ’, {’covRQiso ’}};
147 % cov = {’covSum ’, {’covRQiso ’,’covSEard ’,’covNoise ’}};
148 % cov = {’covProd ’,{’covRQiso ’,’covSEard ’,’covNoise ’}};
149 % cov = {’covMask ’,{mask ,’covSEiso ’}}
150 % q=1; cov = {’covPPiso ’,q};
151 % d=3; cov = {’covPoly ’,d};
152 % cov = {’covADD ’,{[1,2],’covSEiso ’}};
153 % cov = {@apxSparse, {@covSEiso}, u}; where u are the inducing inputs
154 %
155 % specifies a covariance function which is the sum of three contributions. To
156 % find out how many hyperparameters this covariance function requires , we do:
157 %
158 % feval(cov{:})
159 %
42
6.2 Implemented Covariance Functions
Similarly to the mean functions, we provide a whole algebra of covariance functions k:X×X→R
with the same generic name pattern cov/cov<NAME>.m as before.
Besides a long list of simple covariance functions, we also offer a variety of composite covariance
functions as shown in the following table.
1) Elementary and standalone covariance functions k(x,x0)
<NAME> meaning k(x,z) = ψ
Zero covariance vanishes always 0 ∅
Eye unit additive measurement noise δ(x−z)∅
One unit constant 1 ∅
Discrete precomputed covariance for discrete data x∈X⊆Nkxx0where K=L>Lis Cholesky decomposition, L∈Rs×sL
2) Composite covariance functions [κ1(x,z),κ2(x,z), ..]7→ k(x,z)
<NAME> meaning k(x,z) = ψ
Scale modulate covariance function by a scalar σ2
fκ(x,z)ln σf
Scale modulate covariance function depending on input σf(x)κ(x,z)σf(z)φ`
Sum add up covariance functions Pjκj(x,z)∅
Prod multiply covariance functions Qjκj(x,z)∅
Mask act on components I⊆[1, 2, .., D]of x∈X⊆RDonly κ(xI,zI)∅
Pref preference learning covariance x= [x1;x2],xi⊆RD/2κ(x1,z1) + κ(x2,z2) − κ(x1,z2) − κ(x2,z1)∅
PER turn covariance into a periodic, X⊆RDκ(u(x),u(z)) ,u(x) = sin 2πxp
cos 2πxp,xp=
xeye
x/p iso
x/pard
ψu
ADD additive, X⊆RD, index degree set D={1, .., D}Pd∈Dσ2
fdP|I|=dQi∈Iκ(xi,zi;ψi){ψ1, .., ψD, ln σf1, .., ln σf|D| }
Warp warp inputs to a covariance function κ(p(x),p(z)), where p:RD→RDp∅
3) Covariance functions based on Mahalanobis distances k(x,z) = k(r),r2= (x−z)>P−1(x−z), where x,z∈X⊆RD
Maha shared generic “mother” covariance function k(r) = ∅
GE gamma exponential exp (−|r|γ),γ∈(0, 2]ln(γ/(2−γ))
Matern Matérn, f1(t) = 1, f3(t) = 1+t,f5(t) = f3(t) + t2
3fd(√dr)exp(−√dr)∅
PP compact support, piecewise polynomial fv(r)max(0, 1 −r)j+v·fv(r)∅
RQ rational quadratic 1+1
2r2/α−αln α
SE squared exponential exp −r2/2∅
Allowable mode parameter values for Mahalanibis distance covariances
’eye’ unit lengthscale P=I∅
’iso’ isotropic lengthscale P=`2I,`∈R+ln `
’ard’ automatic relevance determination P=Λ2,Λ=diag(λ),λ∈RD
+ln Λ
’proj’ (low-rank) projection in input space P−1=L>L,L∈Rd×DL
’fact’ factor analysis P−1=L>L+diag(f),f∈RD,L∈Rd×D{L, ln f}
’vlen’ spatially varying lengthscale, k`(x,z) = `(x)`(z)
l2(x,z)D/2kr2
l2(x,z),l2(x,z) = `2(x)+`2(z)
2φ`
4) Covariance functions based on Euclidean dot products k(x,z) = k(s),s=x>P−1z, where x,z∈X⊆RD
Dot shared generic “mother” covariance function k(s) = ∅
LIN linear covariance function s∅
Poly polynomial covariance σ2
f(s+c)dln c
Allowable mode parameter values for Euclidean dot product covariances
’eye’ unit lengthscale P=I∅
’iso’ isotropic lengthscale P=`2I,`∈R+ln `
’ard’ automatic relevance determination P=Λ2,Λ=diag(λ),λ∈RD
+ln Λ
’proj’ (low-rank) projection in input space P−1=L>L,L∈Rd×DL
’fact’ factor analysis P−1=L>L+diag(f),f∈RD,L∈Rd×D{L, ln f}
5) Time series covariance functions k(x,z) = k(x,z), where x,z∈X=R+
<NAME> meaning k(x,z) = ψ
FBM fractional Brownian motion covariance with Hurst index h1
2σ2
f|x|2h−|x−z|2h+|z|2h{ln σf,−ln(1/h −1)}
ULL underdamped linear Langevin process covariance k(t) = a2e−µt sin(ωt)
ω+cos(ωt)
µ,t=|x−z| {ln µ, ln ω, ln a}
Wi-times integrated Wiener process covariance, i∈{−1, .., 3}κ0(x,z)=σ2
fmin(x,z),κi(x,z)= Rx
0Rz
0κi(˜
x,˜
z)d˜
xd˜
zln σf
OU i-times integrated Ornstein-Uhlenbeck process covariance, i∈{0, 1}κ0(x,z)= σ2
fexp(−|x−z|
`) + (σ2
f0−σ2
f)exp(−|x+z|
`){ln `, ln σf, ln σf0}
6) Standalone covariance functions
<NAME> meaning k(x,z) = ψ
NNone neural net, X⊆RDσ2
fsin−1x>z/p(`2+x>x)(`2+z>z){ln `, ln σf}
LINone linear with bias, X⊆RD(x>z+1)/t2ln t
Periodic periodic, X⊆Rσ2
fexp −2
`2sin2[π(x−z)/p]{ln `, ln p, ln σf}
PeriodicNoDC periodic, rescaled, DC component removed, X⊆Rσ2
f
κ(x−z)− 1
πRπ
0κ(t)dt
κ(0)− 1
πRπ
0κ(t)dt,κ(x−z) = exp −2
`2sin2[π(x−z)/p]{ln `, ln p, ln σf}
Cos periodic cosine, X⊆Rσ2
fcos (π(x−z)/p){ln p, ln σf}
Gabor Gabor function, X⊆RD,λ,p∈RD
+exp −1
2t>Λ−2tcos 2πt>
p1,tp=
teye
t/p iso
t/pard
,t=x−z{ln λ,ψp}
44
7) Shortcut covariance functions assembled from library k(x,z)
<NAME> Meaning k(x,z) = ψ
Const covariance equals a constant σ2
fln σf
Noise additive measurement noise σ2
fδ(x−z)ln σf
PERiso turn covariance into a periodic, X⊆RDκ(u(x),u(z)) ,u(x)=[sin xp, cos xp],xp=2πx/p ln p
PERard turn covariance into a periodic, X⊆RDκ(u(x),u(z)) ,u(x)=[sin xp, cos xp],xp=2πx/p{ln p1, .., ln pD}
Materniso Matérn, X⊆RD,f1(t) = 1, f3(t) = 1+t,f5(t) = f3(t) + t2
3σ2
ffd(rd)exp(−rd),rd=qd
`2(x−z)>(x−z){ln `, ln σf}
Maternard Matérn, X⊆RD,f1(t) = 1, f3(t) = 1+t,f5(t) = f3(t) + t2
3σ2
ffd(rd)exp(−rd),rd=pd(x−z)>Λ−2(x−z){ln λ1, .., ln λD, ln σf}
PPiso compact support, piecewise polynomial fv(r)σ2
fmax(0, 1 −r)j+v·fv(r),r=kx−zk
`,j=D
2+v+1{ln `, ln σf}
PPard compact support, piecewise polynomial fv(r)σ2
fmax(0, 1 −r)j+v·fv(r),r2= (x−z)>Λ−2(x−z){ln λ1, .., ln λD, ln σf}
RQiso rational quadratic, X⊆RDσ2
f1+1
2α`2(x−z)>(x−z)−α{ln `, ln σf, ln α}
RQard rational quadratic, X⊆RDσ2
f1+1
2α(x−z)>Λ−2(x−z)−α{ln λ1, .., ln λD, ln σf, ln α}
SEiso diagonal squared exponential, X⊆RDσ2
fexp −1
2`2(x−z)>(x−z){ln `, ln σf}
SEisoU squared exponential, X⊆RDexp −1
2`2(x−z)>(x−z)ln `
SEard automatic relevence determination squared exponential σ2
fexp −1
2(x−z)>Λ−2(x−z){ln λ1, .., ln λD, ln σf}
SEvlen spatially varying lengthscale squared exponential σ2
fa
bD
2exp −kx−zk2
b,a=2`(x)`(z),b=`2(x) + `2(z){φ`, ln σf}
SEproj factor analysis squared exponential σ2
fexp −1
2(x−z)>L>L(x−z),L∈Rd×D{L, ln σf}
LINard linear with diagonal weighting x>Λ−2z{ln λ1, .., ln λD}
LINiso linear with isotropic weighting x>z/`2ln `
Gaborard anisotropic Gabor function, X⊆RD,λ,p∈RD
+exp −PD
d=1
t2
d
2λ2
dcos 2πPD
d=1td/pd,td=xd−zd{ln λ, ln p}
Gaboriso isotropic Gabor function, X⊆RD,`,p∈R+exp(− t>t
2`2)cos(2πt>1/p),t=x−z{ln `, ln p}
SM spectral mixture, X⊆RD,w∈RQ
+,M,V∈RD×Q
+w>exp(−1
2V>t2)cos(M>t),t=2π(x−z){ln w, ln M, ln V}
spectral mixture product, X⊆RD,W∈RD×Q
+,M,V∈RD×Q
+QD
d=1w>
dexp(−1
2vdt2
d)cos(mdtd),td=2π(xd−zd){ln W, ln M, ln V}
The spectral mixture covariance covSM was introduced by Wilson & Adams Gaussian Process Ker-
nels for Pattern Discovery and Extrapolation, ICML, 2013.
The periodic covariance function covPER starts from a stationary covariance function that depends
on the data only through a distance r2= (x−x0)>Λ−2(x−x0)such as covMatern,covPP,covRQ,
covSE and turns them into a periodic covariance function by embedding the data x∈RDinto a
periodic high-dimensional space xp=u(x)∈R2Dby a function u(x) = 2πdiag(p−1)x.
The additive covariance function covADD starts from a one-dimensional covariance function κ(xi,x0
i,ψi)
acting on a single component i∈[1, .., D]of x. From that, we define covariance functions κI(xI,xI) =
Qi∈Iκ(xi,x0
i,ψi)acting on vector-valued inputs xI. The sums of exponential size can efficiently be
computed using the Newton-Girard formulae. Samples functions drawn from a GP with additive co-
variance are additive functions. The number of interacting variables |I|is a measure of how complex
the additive functions are.
6.3 Usage of Implemented Covariance Functions
Some code examples taken from doc/usageCov.m illustrate how to use simple and composite covari-
ance functions to specify a GP model.
Syntactically, a covariance function cf is defined by
cv := ’func’ | @func // simple
cf := {cv} | {cv, {param, cf}} | {cv, {cf, .., cf}} // composite
i.e., it is either a string containing the name of a covariance function, a pointer to a covariance func-
tion or one of the former in combination with a cell array of covariance functions and an additional
list of parameters.
44 hdoc/usageCov.m 44i≡
1% demonstrate usage of covariance functions
2%
3% See also covFunctions.m.
4%
5hgpml copyright 5ai
6clear all,close all
7n = 5; D = 3; x = randn(n,D); xs = randn(3,D); % create a data set
8
45
9% set up simple covariance functions
10 cn = {’covNoise’}; sn = .1; hypn = log(sn); % one hyperparameter
11 cc = {@covConst}; sf = 2; hypc = log(sf); % function handles OK
12 ce = {@covEye}; hype = []; % identity
13 cl = {@covLIN}; hypl = []; % linear is parameter -free
14 cla = {’covLINard’}; L = rand(D,1); hypla = log(L); % linear (ARD)
15 cli = {’covLINiso’}; l = rand(1); hypli = log(l); % linear iso
16 clo = {@covLINone}; ell = .9; hyplo = log(ell); % linear with bias
17 cp = {@covPoly,3}; c = 2; hypp = log([c;sf]); % third order poly
18 cga = {@covSEard}; hypga = log([L;sf]); % Gaussian with ARD
19 cgi = {’covSEiso’}; hypgi = log([ell;sf]); % isotropic Gaussian
20 cgu = {’covSEisoU’}; hypgu = log(ell); % isotropic Gauss no scale
21 cra = {’covRQard’}; al = 2; hypra = log([L;sf;al]); % ration. quad.
22 cri = {@covRQiso}; hypri = log([ell;sf;al]); % isotropic
23 cma = {@covMaternard ,5}; hypma = log([ell;sf]); % Matern class d=5
24 cmi = {’covMaterniso’,3}; hypmi = log([ell;sf]); % Matern class d=3
25 cnn = {’covNNone’}; hypnn = log([L;sf]); % neural network
26 cpe = {’covPeriodic’}; p = 2; hyppe = log([ell;p;sf]); % periodic
27 cpn = {’covPeriodicNoDC’}; p = 2; hyppe = log([ell;p;sf]); % w/o DC
28 cpc = {’covCos’}; p = 2; hypcpc = log([p;sf]); % cosine cov
29 cca = {’covPPard’,3}; hypcc = hypgu;% compact support poly degree 3
30 cci = {’covPPiso’,2}; hypcc = hypgi;% compact support poly degree 2
31 cgb = {@covGaboriso}; ell = 1; p = 1.2; hypgb=log([ell;p]); % Gabor
32 Q = 2; w = ones(Q,1)/Q; m = rand(D,Q); v = rand(D,Q);
33 csm = {@covSM ,Q}; hypsm = log([w;m(:);v(:)]); % Spectral Mixture
34 cvl = {@covSEvlen ,{@meanLinear}}; hypvl = [1;2;1; 0]; % var lenscal
35 s = 12; cds = {@covDiscrete ,s}; % discrete covariance function
36 L = randn(s); L = chol(L’*L); L(1:(s+1):end) = log(diag(L));
37 hypds = L(triu(true(s))); xd = randi([1,s],[n,1]); xsd = [1;3;6];
38 cfa = {@covSEfact ,2}; hypfa = randn(D*2,1); % factor analysis
39
40 % set up composite i.e. meta covariance functions
41 csc = {’covScale’,{cgu}}; hypsc = [log(3); hypgu]; % scale by 9
42 csu = {’covSum’,{cn,cc,cl}}; hypsu = [hypn; hypc; hypl]; % sum
43 cpr = {@covProd ,{cc,cci}}; hyppr = [hypc; hypcc]; % product
44 mask = [0,1,0]; % binary mask excluding all but the 2nd component
45 cma = {’covMask’,{mask,cgi{:}}}; hypma = hypgi;
46 % isotropic periodic rational quadratic
47 cpi = {’covPERiso’,{@covRQiso}};
48 % periodic Matern with ARD
49 cpa = {’covPERard’,{@covMaternard ,3}};
50 % additive based on SEiso using unary and pairwise interactions
51 cad = {’covADD’,{[1,2],’covSEiso’}};
52 % preference covariance with squared exponential base covariance
53 cpr = {’covPref’,{’covSEiso’}}; hyppr = [0;0];
54 xp = randn(n,2*D); xsp = randn(3,2*D);
55
56 % 0) specify a covariance function
57 % cov = cma; hyp = hypma;
58 % cov = cci; hyp = hypcc;
59 % cov = csm; hyp = hypsm;
60 cov = cds; hyp = hypds; x = xd; xs = xsd;
61 % cov = cfa; hyp = hypfa;
62 % cov = cvl; hyp = hypvl;
63 % cov = cpr; hyp = hyppr; x = xp; xs = xsp;
64
65 % 1) query the number of parameters
66 feval(cov{:})
46
67
68 % 2) evaluate the function on x
69 [K,dK] = feval(cov{:},hyp,x)
70
71 % 3) evaluate the function on x and xs to get cross -terms
72 [kss,dkss] = feval(cov{:},hyp,xs,’diag’)
73 [Ks, dKs ] = feval(cov{:},hyp,x,xs)
7 Hyperpriors
A hyperprior p(θ)with θ= [ρ,φ,ψ]is a joint probability distribution over the likelihood hyper-
parameters ρ, the mean hyperparameters φand the covariance hyperparameters ψ. We concentrate
on factorial priors p(θ) = Qjpj(θj). Hyperpriors can be used to regularise the optimisation of the
hyperparameters via the marginal likelihood Z(θ)so that p(θ)Z(θ)is maximised instead. As we
wish to perform unconstrained optimisation, we require (mainly) smooth hyperpriors with infinite
support.
7.1 Interface
In the GPML toolbox, a prior distribution p(θ)needs to implement the evaluation of the log density
ln p(θ)and its first derivative ∂
∂θln p(θ). In addition, we require sampling capabilities i.e. the
generation of θ∼p(θ).
46 hpriorDistributions.m 46i≡
1% prior distributions to be used for hyperparameters of Gaussian processes
2% using infPrior.
3% There are two different kinds of prior distributions: simple and composite:
4%
5% simple prior distributions:
6%
7% priorGauss - univariate Gaussian
8% priorLaplace - univariate Laplace
9% priorT - univariate Student ’s t
10 %
11 % priorSmoothBox1 - univariate interval (linear decay in log domain)
12 % priorSmoothBox2 - univariate interval (quadr. decay in log domain)
13 %
14 % priorGamma - univariate Gamma, IR+
15 % priorWeibull - univariate Weibull , IR+
16 % priorInvGauss - univariate Inverse Gaussian , IR+
17 % priorLogNormal - univariate Log-normal, IR+
18 %
19 % priorClamped or - fix hyperparameter to its current value by setting
20 % priorDelta derivatives to zero, no effect on marginal likelihood
21 %
22 % priorGaussMulti - multivariate Gauss
23 % priorLaplaceMulti - multivariate Laplace
24 % priorTMulti - multivariate Student ’s t
25 %
26 % priorClampedMulti or - fix hyperparameter to its current value by setting
27 % priorDeltaMulti derivatives to zero, no effect on marginal likelihood
28 %
29 % priorEqualMulti or - make several hyperparameters have the same value by
30 % priorSameMulti same derivative , no effect on marginal likelihood
47
31 %
32 % composite prior distributions (see explanation at the bottom):
33 %
34 % priorMix - nonnegative mixture of priors
35 % priorTransform - prior on g(t) rather than t
36 %
37 % Naming convention: all prior distributions are named "prior/prior*.m".
38 %
39 %
40 % 1) With only a fixed input arguments:
41 %
42 % r = priorNAME(par1,par2,parN)
43 %
44 % The function returns a random sample from the distribution for e.g.
45 % random restarts, simulations or optimisation initialisation.
46 %
47 % 2) With one additional input arguments:
48 %
49 % [lp,dlp] = priorNAME(par1,par2,parN, t)
50 %
51 % The function returns the log density at location t along with its first
52 % derivative.
53 %
54 % See also doc/usagePrior.m, inf/infPrior.m.
55 %
56 hgpml copyright 5ai
48
7.2 Implemented Hyperpriors
All code files are named according to the pattern prior/prior<NAME>.m for simple identification. All
currently available hyperpriors are summarised in the following table.
Simple hyperpriors p(θ)
Univariate hyperpriors defined over the whole reals with mean µand variace σ2
<NAME> Meaning p(θ) = τ
Gauss normally distributed hyperparameter θ∈R1
σ√2πexp −(θ−µ)2
2σ2µ∈R,σ2∈R+
Laplace double exponentially hyperparameter θ∈R1
2bexp −(θ−µ)
b,b=σ/√2µ∈R,σ2∈R+
TStudent’s t distributed hyperparameter θ∈RΓ(ν+1
2)
Γ(ν
2)
1
√(ν−2)πσ 1+(θ−µ)2
(ν−2)σ2−ν+1
2µ∈R,σ2,ν∈R+
Univariate hyperpriors with effective bounded support but defined over the whole real line
SmoothBox1 interval hyperparameter θ∈Ri.e. θ≈
∈[a,b]
1−exp(η(a−b))
b−a·1
1+exp(−η(θ−a)) ·1
1+exp(η(θ−b)) a6b∈R,η∈R+
µ=a+b
2,σ2=w2
1−exp(−2g)
1+π2/g2
12 ,g=wη
2,w=|a−b|
SmoothBox2 localised hyperparameter θ∈Ri.e. θ≈
∈[a,b]
1
(1/η+1)(b−a)
N(θ|a,σ2
ab)t<a
1t∈[a,b]
N(θ|a,σ2
ab)b<t
,σab =b−a
η√2πa6b∈R,η∈R+
µ=a+b
2,σ2=w2
4
η3/3+η2+4η/π+2/π
η3+η2,w=|a−b|
Univariate hyperpriors supported only over the positive reals
Gamma Gamma hyperparameter θ∈R+1
Γ(k)tkexp(θ
t)θk−1k∈R+,t∈R+
Weibull Weibull hyperparameter θ∈R+k
λθ
λk−1exp −(θ
λ)kk∈R+,λ∈R+
InverseGauss inverse Gaussian hyperparameter θ∈R+1
√2πθ3/λ exp −λ(θ−µ)2
2µ2θk∈R+,λ∈R+
LogNormal log-normal hyperparameter θ∈R+N(θ|µ,σ2) = 1
θσ√2πexp −(ln θ−µ)2
2σ2µ∈R,σ2∈R+
Multivariate hyperpriors supported all over RDwith mean µand covariance Σ
GaussMulti multivariate normal distribution θ∈RD|2πΣ|−1
2exp −1
2(θ−µ)>Σ−1(θ−µ)µ∈RD,Σ∈RD×D
LaplaceMulti multivariate Laplace distribution θ∈RD|√2Σ|−1
2exp −√2
L−1(θ−µ)
1,L>L=Σ µ ∈RD,Σ∈RD×D
TMulti multivariate Student’s t distribution θ∈RD|(ν−2)πΣ|−1
2Γ(ν+D
2)
Γ(ν
2)1+(θ−µ)>Σ−1(θ−µ)
(ν−2)−ν+D
2µ∈RD,Σ∈RD×D,ν∈R
Improper hyperpriors used to fix the value of a particular hyperparameter
Delta clamped hyperparameter θ=θ0∈Rδ(θ−θ0)∅
Clamped
DeltaMulti clamped hyperparameter θ=θ0∈RDδ(θ−θ0)∅
ClampedMulti
SameMulti same hyperparameter θ=θ01∈RDδQD
i=1(θ0−θi)∅
EqualMulti
Composite hyperpriors [π1(θ),π2(θ), ..]7→ p(θ)
Transform prior distribution on g(θ)instead of θ π(g(θ)) {g}
Mix mixture distribution Piwiπi(θ){w}
The priorSmoothBox2 is a Gauss-uniform sandwich obtained by complementing a uniform distribu-
tion on [a,b]with two Gaussian halves at each side. The parameter ηbalances the probability mass
between the constituents so that η/(η+1)is used for the box and 1/(η+1)for the Gaussian sides.
Its brother priorSmoothBox1 is the product of two sigmoidal functions.
The priorDelta or equivalently priorClamped can be used to exclude some hyperparameters from
the optimisation. Their values are clamped to θ0and the derivative vanishes. There are also multi-
variate counterparts priorDeltaMulti and priorClampedMulti.
7.3 Usage of Implemented Hyperpriors
Some code examples taken from doc/usagePrior.m illustrate how to use univariate, multivariate
and composite priors on hyperparameters. Syntactically, a hyperprior hp is defined by
func := Dist // prior distributions in prior/
| Clamped | Delta // predefined for fixing the hyperparameter
pr := ’func’ | @func // univariate hyperprior
49
| ’funcMulti’ | @funcMulti // multivariate hyperprior
hp := {pr} | {pr, {param, hp}} | {pr, {hp, .., hp}} // composite
i.e., it is either a string containing the name of a hyperprior function, a pointer to a hyperprior
function or one of the former in combination with a cell array of hyperprior functions and an addi-
tional list of parameters. Furthermore, we have multivariate hyperprior variants and 2 (equivalent)
predefined hyperpriors allowing to exclude variables from optimisation.
49 hdoc/usagePrior.m 49i≡
1% demonstrate usage of prior distributions
2%
3% See also priorDistributions.m.
4%
5hgpml copyright 5ai
6clear all,close all
7
8% 1) specify some priors
9% a) univariate priors
10 mu = 1.0; s2 = 0.01^2; nu = 3;
11 pg = {@priorGauss ,mu,s2}; % Gaussian prior
12 pl = {’priorLaplace’,mu,s2}; % Laplace prior
13 pt = {@priorT ,mu,s2,nu}; % Student’s t prior
14 p1 = {@priorSmoothBox1 ,0,3,15}; % smooth box constraints lin decay
15 p2 = {@priorSmoothBox2 ,0,2,15}; % smooth box constraints qua decay
16 pd = {’priorDelta’}; % fix value of prior exclude from optimisation
17 pc = {@priorClamped}; % equivalent to above
18 lam = 1.05; k = 2.5;
19 pw = {@priorWeibull ,lam ,k}; % Weibull prior
20
21 % b) meta priors
22 pmx = {@priorMix ,[0.5,0.5],{pg,pl}}; % mixture of two priors
23 g = @exp; dg = @exp; ig = @log;
24 ptr = {@priorTransform ,g,dg ,ig,pg}; % Gaussian in the exp domain
25
26 % c) multivariate priors
27 m = [1;2]; V = [2,1;1,2];
28 pG = {@priorGaussMulti ,m,V}; % 2d Gaussian prior
29 pD = {’priorDeltaMulti’}; % fix value of prior exclude from optim
30 pC = {@priorClampedMulti}; % equivalent to above
31
32 % 2) evaluation
33 % pri = pt; hp = randn(1,3);
34 % pri = pmx; hp = randn(1,3);
35 % pri = ptr; hp = randn(1,3);
36 pri = pG; hp = randn(2,3);
37
38 % a) draw a sample from the prior
39 feval(pri{:})
40
41 % b) evaluate prior and derivative if requires
42 [lp,dlp] = feval(pri{:},hp)
43
44 % 3) comprehensive example
45 x = (0:0.1:10)’; y = 2*x+randn(size(x)); % generate training data
46 mean = {@meanSum ,{@meanConst ,@meanLinear }}; % specify mean function
47 cov = {@covSEiso}; lik = {@likGauss}; % specify covariance and lik
48 hyp.cov = [log(1);log(1.2)]; hyp.lik = log(0.9); hyp.mean = [2;3];
50
49 par = {mean,cov,lik,x,y}; mfun = @minimize; % input for GP function
50
51 % a) plain marginal likelihood optimisation (maximum likelihood)
52 im = @infExact; % inference method
53 hyp_plain = feval(mfun, hyp, @gp, -10, im, par{:}); % optimise
54
55 % b) regularised optimisation (maximum a posteriori) with 1d priors
56 prior.mean = {pg;pc}; % Gaussian prior for first, clamp second par
57 prior.cov = {p1;[]}; % box prior for first , nothing for second par
58 im = {@infPrior,@infExact,prior}; % inference method
59 hyp_p1 = feval(mfun, hyp, @gp, -10, im, par{:}); % optimise
60
61 % c) regularised optimisation (maximum a posteriori) with Nd priors
62 prior = []; % clear the structure
63 % multivariate Student ’s t prior on the first and second mean hyper
64 prior.multi{1} = {@priorTMulti ,[mu;mu],diag([s2,s2]),nu,...
65 struct(’mean’ ,[1,2])}; % use hyper struct
66 % Equivalent shortcut (same mu and s2 for all dimensions)
67 prior.multi {1} = {@priorTMulti ,mu ,s2,nu ,struct(’mean’,[1,2])};
68 % multivariate Gaussian prior jointly on 1st and 3rd hyper
69 prior.multi{2} = {@priorGaussMulti ,[mu;mu],diag([s2,s2]),...
70 [1,3]}; % use unwrapped hyper vector
71 % Equivalent shortcut (same mu and s2 for all dimensions)
72 prior.multi{2} = { @priorGaussMulti ,mu,s2 ,[1 ,3]};
73 im = {@infPrior,@infExact,prior}; % inference method
74 hyp_pN = feval(mfun, hyp, @gp, -10, im, par{:}); % optimise
75
76 [any2vec(hyp), any2vec(hyp_plain), any2vec(hyp_p1), any2vec(hyp_pN)]
51