Manual
manual
manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 51
Download | |
Open PDF In Browser | View PDF |
The GPML Toolbox version 4.2 Carl Edward Rasmussen & Hannes Nickisch June 15, 2018 Abstract The GPML toolbox is an Octave 3.2.x and Matlab 7.x implementation of inference and prediction in Gaussian process (GP) models. It implements algorithms discussed in Rasmussen & Williams: Gaussian Processes for Machine Learning , the MIT press, 2006 and Nickisch & Rasmussen: Approximations for Binary Gaussian Process Classification , JMLR, 2008. The strength of the function lies in its flexibility, simplicity and extensibility. The function is flexible as firstly it allows specification of the properties of the GP through definition of mean function and covariance functions. Secondly, it allows specification of different inference procedures, such as e.g. exact inference and Expectation Propagation (EP). Thirdly it allows specification of likelihood functions e.g. Gaussian or Laplace (for regression) and e.g. cumulative Logistic (for classification). Simplicity is achieved through a single function and compact code. Extensibility is ensured by modular design allowing for easy addition of extension for the already fairly extensive libraries for inference methods, mean functions, covariance functions and likelihood functions. This document is a technical manual for a developer containing many details. If you are not yet familiar with the GPML toolbox, the user documentation and examples therein are a better way to get started. 1 Contents 1 Gaussian Process Training and Prediction 3 2 The gp Function 4 3 Inference Methods 3.1 Exact Inference with Gaussian likelihood . . . . . . . . . . . . . . . . . . 3.2 Laplace’s Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Expectation Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Kullback Leibler Divergence Minimisation . . . . . . . . . . . . . . . . . . 3.5 Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Compatibility Between Inference Methods and Covariance Approximations 3.7 Sparse Covariance Approximations . . . . . . . . . . . . . . . . . . . . . . 3.8 Grid-Based Covariance Approximations . . . . . . . . . . . . . . . . . . . 3.9 State Space Representation of GPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 10 11 11 12 13 13 14 14 15 4 Likelihood Functions 4.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Implemented Likelihood Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Usage of Implemented Likelihood Functions . . . . . . . . . . . . . . . . . . . . . . . 4.5 Compatibility Between Likelihoods and Inference Methods . . . . . . . . . . . . . . . 4.6 Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Exact Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Laplace’s Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Expectation Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.4 Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Warped Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Gumbel Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Laplace Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Student’s t Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 Cumulative Logistic Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 GLM Likelihoods: Poisson, Negative Binomial, Weibull, Gamma, Exponential, Inverse Gaussian and Beta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.1 Inverse Link Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.2 Poisson Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.3 Weibull Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.4 Gamma Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.5 Exponential Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.6 Inverse Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.7 Beta Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 18 19 21 22 23 23 25 25 25 25 26 27 27 30 31 5 Mean Functions 5.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Implemented Mean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Usage of Implemented Mean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 36 36 37 37 6 Covariance Functions 6.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Implemented Covariance Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Usage of Implemented Covariance Functions . . . . . . . . . . . . . . . . . . . . . . . 39 39 43 44 2 32 32 34 34 34 35 35 35 7 Hyperpriors 7.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Implemented Hyperpriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Usage of Implemented Hyperpriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 46 46 48 48 1 Gaussian Process Training and Prediction The gpml toolbox contains a single user function gp described in section 2. In addition there are a number of supporting structures and functions which the user needs to know about, as well as an internal convention for representing the posterior distribution, which may not be of direct interest to the casual user. Inference Methods: An inference method is a function which computes the (approximate) posterior, the (approximate) negative log marginal likelihood and its partial derivatives w.r.t.. the hyperparameters, given a model specification (i.e., GP mean and covariance functions and a likelihood function) and a data set. Inference methods are discussed in section 3. New inference methods require a function providing the desired inference functionality and possibly extra functionality in the likelihood functions applicable. Hyperparameters: The hyperparameters is a struct controlling the properties of the model, i.e.. the GP mean and covariance function and the likelihood function. The hyperparameters is a struct with the three fields mean, cov and lik, each of which is a vector. The number of elements in each field must agree with number of hyperparameters in the specification of the three functions they control (below). If a field is either empty or non-existent it represents zero hyperparameters. When working with FITC approximate inference, the inducing inputs xu can also be treated as hyperparameters for some common stationary covariances. Hyperparameter Prior Distributions: When optimising the marginal likelihood w.r.t. hyperparameters, it is sometimes useful to softly constrain the hyperparameters by means of prior knowledge. A prior is a probability distribution over individual or a group of hyperparameters, section 7. Likelihood Functions: The likelihood function specifies the form of the likelihood of the GP model and computes terms needed for prediction and inference. For inference, the required properties of the likelihood depend on the inference method, including properties necessary for hyperparameter learning, section 4. Mean Functions: The mean function is a cell array specifying the GP mean. It computes the mean and its derivatives w.r.t.. the part of the hyperparameters pertaining to the mean. The cell array allows flexible specification and composition of mean functions, discussed in section 5. The default is the zero function. Covariance Functions: The covariance function is a cell array specifying the GP covariance function. It computes the covariance and its derivatives w.r.t.. the part of the hyperparameters pertaining to the covariance function. The cell array allows flexible specification and composition of covariance functions, discussed in section 6. Inference methods, see section 3, compute (among other things) an approximation to the posterior distribution of the latent variables fi associated with the training cases, i = 1, . . . , n. This approximate posterior is assumed to be Gaussian, and is communicated via a struct post with the fields post.alpha, post.sW and post.L. Often, starting from the Gaussian prior p(f) = N(f|m, K) the approximate posterior admits the form q(f|D) = N f|µ = m + Kα, V = (K−1 + W)−1 , where W diagonal with Wii = s2i . (1) In such cases, the entire posterior can be computed from the two vectors post.alpha and post.sW; the inference method may optionally also return L = chol(diag(s)K diag(s) + I). If on the other hand the posterior doesn’t admit the above form, then post.L returns the matrix L = −(K + W −1 )−1 (and post.sW is unused). In addition, a sparse representation of the posterior may be used, in which case the non-zero elements of the post.alpha vector indicate the active entries. 4 2 The gp Function The gp function is typically the only function the user would directly call. 4a hgp.m 4ai≡ 1 2 3 4 5 6 7 8 9 function [varargout] = gp(hyp, inf, mean, cov, lik, x, y, xs, ys) hgp function help 4bi hinitializations 5bi hinference 6ci if nargin==7 % if no test cases are provided varargout = {nlZ, dnlZ, post}; % report -log marg lik, derivatives and post else hcompute test predictions 7i end It offers facilities for training the hyperparameters of a GP model as well as predictions at unseen inputs as detailed in the following help. 4b hgp function help 4bi≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % (4a) Gaussian Process inference and prediction. The gp function provides a flexible framework for Bayesian inference and prediction with Gaussian processes for scalar targets , i.e. both regression and binary classification. The prior is Gaussian process , defined through specification of its mean and covariance function. The likelihood function is also specified. Both the prior and the likelihood may have hyperparameters associated with them. Two modes are possible: training or prediction: if no test cases are supplied , then the negative log marginal likelihood and its partial derivatives w.r.t. the hyperparameters is computed; this mode is used to fit the hyperparameters. If test cases are given , then the test set predictive probabilities are returned. Usage: training: [nlZ dnlZ ] = gp(hyp, inf, mean, cov, lik, x, y); prediction: [ymu ys2 fmu fs2 ] = gp(hyp, inf, mean, cov, lik, x, y, xs); or: [ymu ys2 fmu fs2 lp] = gp(hyp, inf, mean, cov, lik, x, y, xs, ys); where: hyp inf mean cov lik x y xs ys struct of column vectors of mean/cov/lik hyperparameters function specifying the inference method prior mean function prior covariance function likelihood function n by D matrix of training inputs column vector of length n of training targets ns by D matrix of test inputs column vector of length nn of test targets nlZ dnlZ returned value of the negative log marginal likelihood struct of column vectors of partial derivatives of the negative log marginal likelihood w.r.t. mean/cov/lik hyperparameters column vector (of length ns) of predictive output means column vector (of length ns) of predictive output variances column vector (of length ns) of predictive latent means column vector (of length ns) of predictive latent variances column vector (of length ns) of log predictive probabilities ymu ys2 fmu fs2 lp 5 40 41 42 43 44 45 46 5a % post struct representation of the (approximate) posterior % 3rd output in training mode or 6th output in prediction mode % can be reused in prediction mode gp(.., cov, lik, x, post, xs,..) % % See also infMethods.m, meanFunctions.m, covFunctions.m, likFunctions.m. % hgpml copyright 5ai hgpml copyright 5ai≡ (4b 9 10 19 22 24a 36 38 39 44 46 49) 1 % Copyright (c) by Carl Edward Rasmussen and Hannes Nickisch , 2018-06-15. 2 % File automatically generated using noweb. Depending on the number of input parameters, gp knows whether it is operated in training or in prediction mode. The highlevel structure of the code is as follows. After some initialisations, we perform inference and decide whether test set predictions are needed or only the result of the inference is demanded. 5b hinitializations 5bi≡ 1 2 3 4 (4a) hminimalist usage 5ci hmultivariate output by recursion 5di hprocess input arguments 6ai hcheck hyperparameters 6bi If the number of input arguments is incorrect, we echo a minimalist usage and return. 5c hminimalist usage 5ci≡ (5b) 1 if nargin <7 || nargin >9 2 disp(’Usage: [nlZ dnlZ ] = gp(hyp, inf, mean, cov, lik, x, y);’) 3 disp(’ or: [ymu ys2 fmu fs2 ] = gp(hyp, inf, mean, cov, lik, x, y, xs);’) 4 disp(’ or: [ymu ys2 fmu fs2 lp] = gp(hyp, inf, mean, cov, lik, x, y, xs, ys);’) 5 return 6 end If there is more than a single output dimension in y, we call multiple instances of gp with shared hyperparameters and settings each computing the corresponding result with scalar output. 5d hmultivariate output by recursion 5di≡ (5b) 1 if size(y,2)>1 % deal with (independent) multivariate output y by recursing 2 d = size(y,2); varargout = cell(nargout ,1); out = cell(nargout ,1); % allocate 3 for i=1:d 4 in = {hyp, inf, mean, cov, lik, x, y(:,i)}; 5 if nargin >7, in = {in{:}, xs}; end 6 if nargin >8, in = {in{:}, ys(:,i)}; end 7 if i==1, [varargout{:}] = gp(in{:}); % perform inference for dimension .. 8 else [out{:}] = gp(in{:}); % .. number i in the output 9 if nargin==7, no = 2; 10 varargout{1} = varargout{1} + out{1}; % sum nlZ 11 if nargout >1 % sum dnlZ 12 varargout{2} = vec2any(hyp,any2vec(varargout{2})+any2vec(out{2})); 13 end 14 else no = 5; % concatenate ymu ys2 fmu fs2 lp 15 for j=1:min(nargout ,no), varargout{j} = [varargout{j},out{j}]; end 16 end 17 if nargout >no % concatenate post 18 if i==2, varargout{no+1} = {varargout{no+1},out{no+1}}; 19 else varargout{no+1} = {varargout{no+1}{:},out{no+1}}; end 20 end 21 end 22 end, return % return to end the recursion 23 end 6 Next, we set some useful default values for empty arguments, and convert inf and lik to function handles and mean and cov to cell arrays if necessary. Initialize variables. 6a hprocess input arguments 6ai≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 (5b) if isempty(mean), mean = {@meanZero}; end % set default mean if ischar(mean) || isa(mean, ’function_handle’), mean = {mean}; end % make cell if isempty(cov), error(’Covariance function cannot be empty’); end % no default if ischar(cov) || isa(cov,’function_handle’), cov = {cov}; end % make cell cstr = cov{1}; if isa(cstr,’function_handle’), cstr = func2str(cstr); end if (strcmp(cstr,’covFITC’) || strcmp(cstr,’apxSparse’)) && isfield(hyp,’xu’) cov{3} = hyp.xu; %use hyp.xu end if isempty(inf), inf = {@infGaussLik}; end % set default inference method if ischar(inf), inf = str2func(inf); end % convert into function handle if ischar(inf) || isa(inf,’function_handle’), inf = {inf}; end % make cell istr = inf{1}; if isa(istr,’function_handle’), istr = func2str(istr); end if strcmp(istr,’infPrior’) istr = inf{2}; if isa(istr,’function_handle’), istr = func2str(istr); end end if isempty(lik), lik = {@likGauss}; end % set default lik if ischar(lik) || isa(lik,’function_handle’), lik = {lik}; end % make cell lstr = lik{1}; if isa(lstr,’function_handle’), lstr = func2str(lstr); end D = size(x,2); if strncmp(cstr,’covGrid’,7) || strcmp(cstr,’apxGrid’) % only some inf* possible D = 0; xg = cov{3}; p = numel(xg); for i=1:p, D = D+size(xg{i},2); end % dims end Check that the sizes of the hyperparameters supplied in hyp match the sizes expected. The three parts hyp.mean, hyp.cov and hyp.lik are checked separately, and define empty entries if they don’t exist. 6b hcheck hyperparameters 6bi≡ 1 2 3 4 5 6 7 8 9 10 11 12 (5b) if ~isfield(hyp,’mean’), hyp.mean = []; end % check the hyp specification if eval(feval(mean{:})) ~= numel(hyp.mean) error(’Number of mean function hyperparameters disagree with mean function’) end if ~isfield(hyp,’cov’), hyp.cov = []; end if eval(feval(cov{:})) ~= numel(hyp.cov) error(’Number of cov function hyperparameters disagree with cov function’) end if ~isfield(hyp,’lik’), hyp.lik = []; end if eval(feval(lik{:})) ~= numel(hyp.lik) error(’Number of lik function hyperparameters disagree with lik function’) end Inference is performed by calling the desired inference method inf. In training mode, we accept a failure of the inference method (and issue a warning), since during hyperparameter learning, hyperparameters causing a numerical failure may be attempted, but the minimize function may gracefully recover from this. During prediction, failure of the inference method is an error. 6c hinference 6ci≡ (4a) 1 try % call the inference method 2 % issue a warning if a classification likelihood is used in conjunction with 3 % labels different from +1 and -1 4 if strcmp(lstr,’likErf’) || strcmp(lstr,’likLogistic’) 5 if ~isstruct(y) 6 uy = unique(y); 7 if any( uy~=+1 & uy~=-1 ) 8 warning(’You try classification with labels different from {+1,-1}’) 7 9 end 10 end 11 end 12 if nargin >7 % compute marginal likelihood and its derivatives only if needed 13 if isstruct(y) 14 post = y; % reuse a previously computed posterior approximation 15 else 16 post = feval(inf{:}, hyp, mean, cov, lik, x, y); 17 end 18 else 19 if nargout <=1 20 [post nlZ] = feval(inf{:}, hyp, mean, cov, lik, x, y); dnlZ = {}; 21 else 22 [post nlZ dnlZ] = feval(inf{:}, hyp, mean, cov, lik, x, y); 23 end 24 end 25 catch 26 msgstr = lasterr; 27 if nargin >7, error(’Inference method failed [%s]’, msgstr); else 28 warning(’Inference method failed [%s] .. attempting to continue’,msgstr) 29 varargout = {NaN, vec2any(hyp,zeros(numel(any2vec(hyp)),1))}; return % go on 30 end 31 end We copy the already computed negative log marginal likelihood to the first output argument, and if desired report its partial derivatives w.r.t. the hyperparameters if running in inference mode. Predictions are computed in a loop over small batches to avoid memory problems for very large test sets. 7 hcompute test predictions 7i≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 (4a) alpha = post.alpha; L = post.L; sW = post.sW; if issparse(alpha) % handle things for sparse representations nz = alpha ~= 0; % determine nonzero indices if issparse(L), L = full(L(nz,nz)); end % convert L and sW if necessary if issparse(sW), sW = full(sW(nz)); end else nz = true(size(alpha ,1),1); end % non-sparse representation if isempty(L) % in case L is not provided , we compute it K = feval(cov{:}, hyp.cov, x(nz,:)); L = chol(eye(sum(nz))+sW*sW’.*K); end %verify whether L contains valid Cholesky decomposition or something different Lchol = isnumeric(L) && all(all(tril(L,-1)==0)&diag(L)’>0&isreal(diag(L))’); ns = size(xs,1); % number of data points if strncmp(cstr,’apxGrid’,7), xs = apxGrid(’idx2dat’,cov{3},xs); end % expand nperbatch = 1000; % number of data points per mini batch nact = 0; % number of already processed test data points ymu = zeros(ns,1); ys2 = ymu; fmu = ymu; fs2 = ymu; lp = ymu; % allocate mem while nactuse Cholesky parameters (alpha ,sW,L) V = L’\(repmat(sW,1,length(id)).*Ks); fs2(id) = kss - sum(V.*V,1)’; % predictive variances else % L is not triangular => use alternative parametrisation if isnumeric(L), LKs = L*Ks; else LKs = L(Ks); end % matrix or callback fs2(id) = kss + sum(Ks.*LKs ,1)’; % predictive variances end fs2(id) = max(fs2(id),0); % remove numerical noise i.e. negative variances Fs2 = repmat(fs2(id),1,N); % we have multiple values in case of sampling if nargin <9 [Lp, Ymu, Ys2] = feval(lik{:},hyp.lik,[], Fmu(:),Fs2(:)); else Ys = repmat(ys(id),1,N); [Lp, Ymu, Ys2] = feval(lik{:},hyp.lik,Ys(:),Fmu(:),Fs2(:)); end lp(id) = sum(reshape(Lp, [],N),2)/N; % log probability; sample averaging ymu(id) = sum(reshape(Ymu,[],N),2)/N; % predictive mean ys|y and .. ys2(id) = sum(reshape(Ys2,[],N),2)/N; % .. variance 9 3 Inference Methods Inference methods are responsible for computing the (approximate) posterior post, the (approximate) negative log marginal likelihood nlZ and its partial derivatives dnlZ w.r.t. the hyperparameters hyp. The arguments to the function are hyperparameters hyp, mean function mean, covariance function cov, likelihood function lik and training data x and y. Several inference methods are implemented and described this section. 9 hinfMethods.m 9i≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % Inference methods: Compute the (approximate) posterior for a Gaussian process. Methods currently implemented include: infGaussLik infLaplace infEP infVB infKL Exact inference (only possible with Gaussian likelihood) Laplace ’s Approximation Expectation Propagation Variational Bayes Approximation Kullback -Leibler optimal Approximation infMCMC Markov Chain Monte Carlo and Annealed Importance Sampling We offer two samplers. - hmc: Hybrid Monte Carlo - ess: Elliptical Slice Sampling No derivatives w.r.t. to hyperparameters are provided. infLOO infPrior Leave -One-Out predictive probability and Least -Squares Approxim. Perform inference with hyperparameter prior. The interface to the approximation methods is the following: function [post nlZ dnlZ] = inf..(hyp, mean, cov, lik, x, y) where: hyp mean cov lik x y is is is is is is nlZ dnlZ is the returned value of the negative log marginal likelihood is a (column) vector of partial derivatives of the negative log marginal likelihood w.r.t. each hyperparameter struct representation of the (approximate) posterior containing is a (sparse or full column vector) containing inv(K)*(mu-m), where K is the prior covariance matrix , m the prior mean, and mu the approx posterior mean is a (sparse or full column) vector containing diagonal of sqrt(W) the approximate posterior covariance matrix is inv(inv(K)+W) is a (sparse or full) triangular matrix , L = chol(sW*K*sW+eye(n)), or a full matrix , L = -inv(K+inv(W)), or a function L(A) of a matrix A such that -(K+inv(W))*L(A) = A post alpha sW L a struct of hyperparameters the name of the mean function (see meanFunctions.m) the name of the covariance function (see covFunctions.m) the name of the likelihood function (see likFunctions.m) a n by D matrix of training inputs a (column) vector (of size n) of targets Usually , the approximate posterior to be returned admits the form N(mu=m+K*alpha , V=inv(inv(K)+W)), where alpha is a vector and W is diagonal. For more information on the individual approximation methods and their 10 49 % implementations , see the separate inf??.m files. See also gp.m. 50 % 51 hgpml copyright 5ai Not all inference methods are compatible with all likelihood functions, e.g.. exact inference is only possible with Gaussian likelihood. In order to perform inference, each method needs various properties of the likelihood functions, section 4. 3.1 Exact Inference with Gaussian likelihood For Gaussian likelihoods, GP inference reduces to computing mean and covariance of a multivariate Gaussian which can be done exactly by simple matrix algebra. The program inf/infExact.m does exactly this. If it is called with a likelihood function other than the Gaussian, it issues an error. The Gaussian posterior q(f|D) = N f|µ, V is exact. 10 hinf/infGaussLik.m 10i≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 function [post nlZ dnlZ] = infGaussLik(hyp, mean, cov, lik, x, y, opt) % Exact inference for a GP with Gaussian likelihood. % % Compute a parametrization of the posterior , the negative log marginal % likelihood and its derivatives w.r.t. the hyperparameters. The function takes % a specified covariance function (see covFunctions.m) and likelihood function % (see likFunctions.m), and is designed to be used with gp.m. % hgpml copyright 5ai % % See also INFMETHODS.M, APX.M. if if if if nargin <7, opt = []; end % make sure parameter exists iscell(lik), likstr = lik{1}; else likstr = lik; end ~ischar(likstr), likstr = func2str(likstr); end ~strcmp(likstr ,’likGauss’) % NOTE: no explicit call to likGauss error(’Exact inference only possible with Gaussian likelihood’); end [n, D] = size(x); [m,dm] = feval(mean{:}, hyp.mean, x); % evaluate mean vector and deriv sn2 = exp(2*hyp.lik); W = ones(n,1)/sn2; % noise variance of likGauss K = apx(hyp,cov,x,opt); % set up covariance approximation [ldB2,solveKiW ,dW,dhyp,post.L] = K.fun(W); % obtain functionality depending on W alpha = solveKiW(y-m); post.alpha = K.P(alpha); % return the posterior parameters post.sW = sqrt(W); % sqrt of noise precision vector if nargout >1 % do we want the marginal likelihood? nlZ = (y-m)’*alpha/2 + ldB2 + n*log(2*pi*sn2)/2; % -log marginal likelihood if nargout >2 % do we want derivatives? dnlZ = dhyp(alpha); dnlZ.mean = -dm(alpha); dnlZ.lik = -sn2*(alpha ’*alpha) - 2*sum(dW)/sn2 + n; end end 11 3.2 Laplace’s Approximation For differentiable likelihoods, Laplace’s approximation, approximates the posterior by a Gaussian centered at its mode and matching its curvature inf/infLaplace.m. More concretely, the mean of the posterior q(f|D) = N f|µ, V is – defining `i (fi ) = ln p(yi |fi ) and P `(f) = n ` (f ) – given by i i i=1 µ = arg min φ(f), where φ(f) = f 1 c (f − m)> K−1 (f − m) − `(f) = − ln[p(f)p(y|f)], 2 which we abbreviate by µ ← L(`). The curvature ∂2 φ ∂ff> (2) 2 ∂ = K−1 + W with Wii = − ∂f 2 ln p(yi |fi ) i −1 −1 serves as precision R for the Gaussian posterior approximation V R = (K + W) and the marginal likelihood Z = p(f)p(y|f)df is approximated by Z ≈ ZLA = φ̃(f)df where we use the 2nd order Taylor expansion at the mode µ given by φ̃(f) = φ(µ) + 21 (f − µ)> V−1 (f − µ) ≈ φ(f). Laplace’s approximation needs derivatives up to third order for the mode fitting procedure (Newton method) ∂k dk = log p(y|f), k = 0, 1, 2, 3 ∂fk and ∂ ∂k log p(y|f), k = 0, 1, 2 dk = ∂ρi ∂fk evaluated at the latent location f and observed value y. The likelihood calls (see section 4) • [d0, d1, d2, d3] = lik(hyp, y, f, [], ’infLaplace’) and • [d0, d1, d2] = lik(hyp, y, f, [], ’infLaplace’, i) return exactly these values. 3.3 Expectation Propagation The basic idea of Expectation Propagation (EP) as implemented in inf/infEP.m. is to replace the non-Gaussian likelihood terms p(yi |fi ) by Gaussian functions t(fi ; νi , τi ) = exp(νi fi − 12 τi f2i ) and to adjust the natural parameters νi , τi such that the following identity holds: Z Z 1 1 k f q−i (f) · t(f; νi , τi )df = fk q−i (f) · p(yi |f)df, k = 1, 2 Zt,i Zp,i Q with the so-called cavity distributions q−i (f) = N(f|m, K) j6=i t(fj ; νj , τj ) ∝ N(f|µ, V)/t(fi ; νi , τi ) equal to approximation function and the two normalisers R the posterior divided by the ith Gaussian R Zt,i = q−i (f) · t(fi ; νi , τi )df and Zp,i = q−i (f) · p(yi |fi )df. The moment matching corresponds to minimising the following local KL-divergence νi , τi = arg min KL[q−i (f)p(yi |fi )/Zp,i ||q−i (f)t(fi ; ν, τ)/Zt,i ]. ν,τ In order to apply the moment matching steps in a numerically safe way, EP requires the deriviatives of the expectations w.r.t. the Gaussian mean parameter µ Z ∂k dk = log p(y|f)N(f|µ, σ2 )df, k = 0, 1, 2 ∂µk 12 and the ith likelihood hyperparameter ρi Z ∂ d = log p(y|f)N(f|µ, σ2 )df ∂ρi which can be obtained by the likelihood calls (see section 4) • [d0, d1, d2] = lik(hyp, y, mu, s2, ’infEP’) and • d = lik(hyp, y, mu, s2, ’infEP’, i). 3.4 Kullback Leibler Divergence Minimisation Another well known approach to approximate inference implemented inf/infKL.m in attempts to directly find the closest Gaussian q(f|D) = N(f|µ, V) to the exact posterior p(f|D) w.r.t. to some proximity measure or equivalently to maximise a lower bound Z(µ, V) to the marginal likelihood Z as described in Nickisch & Rasmussen Approximations for Binary Gaussian Process Classification, JMLR, 2008. In particular, one minimises KL (N(f|µ, V)||p(f|D)) which amounts to minimising − ln Z(µ, V) as defined by: Z Z p(f) − ln Z = − ln p(f)p(y|f)df = − ln q(f|D) p(y|f)df q(f|D) Z Z Jensen q(f|D) 6 q(f|D) ln df − q(f|D) ln p(y|f)df =: − ln Z(µ, V) p(f) n Z X = KL (N(f|µ, V)||N(f|m, K)) − N(fi |µi , vii ) ln p(yi |fi )dfi , vii = [V]ii i=1 n 1 X 1 `KL (µi , vii ) tr(VK−1 − I) − ln |VK−1 | + (µ − m)> K−1 (µ − m) − 2 2 = i=1 R where = N(fi |µi , vii )`i (fi )dfi is the convolution of the log likelihood `i with the Gaussian N and v = dg(V). Equivalently, one can view `KL as a smoothed version of ` with univariate smoothing kernel N. `KL v (µi ) From Challis & Barber Concave Gaussian Variational Approximations for Inference in Large Scale Bayesian Linear Models, AISTATS, 2011 we know that the mapping (µ, L) 7→ − ln Z(µ, L> L) is jointly convex whenever the likelihoods fi 7→ P(yi |fi ) are log concave. In particular, this implies that every (µi , si ) 7→ −`KL (µi , s2i ) is jointly convex. We use an optimisation algorithm similar to EP (section 3.3) where we minimise the local KLdivergence the other way round µi , si = arg minµ,s KL[N(f|µ, s2 )||q−i (f)p(yi |fi )/Zp,i ]. This view was brought forward by Tom Minka Convex Divergence measures and message passing, MSRTR, 2005. The KL minimisation constitutes a jointly convex 2d optimisation problem solved by klmin using a scaled Newton approach which is included as a sub function in inf/infKL.m. The smoothed likelihood `KL (µi , vii ) is implemented as a meta likelihood in likKL; it uses GaussianHermite quadrature to compute the required integrals. Note that – as opposed to EP – GaussianHermite quadrature is appropriate since we integrate against the ln P(yi |fi ) (which can be well approximated by a polynomial) instead of P(yi |fi ) itself. The algorithm is – again unlike EP – provably convergent for log-concave likelihoods (e.g. likGauss, likLaplace, likSech2, likLogistic, likPoisson) since it can be regarded as coordinate descent with guaranteed decrease in the objective in every step. Due to the complex update computations, infKL can be quite slow although it has the same O(n3 ) asymptotic complexity as EP and Laplace. 13 3.5 Variational Bayes One can drive the bounding even further by means of local quadratic lower bounds to the log likelihood `(f) = ln p(y|f). Suppose that we use a super-Gaussian likelihood p(y|f) i.e. likelihoods that can be lower bounded by Gaussians of any width w (e.g. likLaplace, likT, likLogistic, likSech2). Formally, that means that there are b, z ∈ R such that ρ(f) = ln p(y|f − z) − bf √ is symmetric and f 7→ ρ(f) is a convex function for all f > 0. As a result, we obtain the following exact representation of the likelihood wf2 1 `(f) = ln p(y|f) = max (b + wz)f − − h(γ) , 2 2 w>0 which can be derived by convex duality and assuming the likelihoods to be super-Gaussian. Details can be found in papers by Palmer et al. Variational EM Algorithms for Non-Gaussian Latent Variable Models, NIPS, 2006 and Nickisch & Seeger Convex Variational Bayesian Inference for Large Scale Generalized Linear Models, ICML, 2009. The bottom line is that we can treat the variational bounding as a sequence of Laplace approximations with the “variational Bayes” log likelihood q VB ` (fi ) = `(gi ) + bi (fi − gi ), g = sgn(f − z) (f − z)2 + v + z instead of the usual likelihood `(fi ) = ln p(yi |fi ) i.e. we solve µ ← L(`VB v ) instead of µ ← L(`). See section 3.2. In the code of inf/infVB.m, the likelihood is implemented in the function likVB. At the end, the optimal value of W can be obtained analytically via wi = |bi − ` 0 (gi )|/|gi − zi |. For the minimisation in inf/infVB.m, we use a provably convergent double loop algorithm, where in the inner loop a nonlinear least squares problem (convex for log-concave likelihoods) is solved using −1 + W)−1 ). inf/infLaplace.m such that µ ← L(`VB v ) and in the outer loop, we compute v ← dg((K The only requirement to the likelihood function is that it returns the values z and b required by the bound which are delivered by the call (see section 4) • [b,z] = lik(hyp, y, [], ga, ’infVB’) The negative marginal likelihood upper bound − ln ZVB is obtained by integrating the prior times the exact representation of the likelihood h(γ) ν2 p 1 p(y|f) = max q(y|f, γ), q(y|f, γ) = N(f|ν, γ) exp − − 2πγ, γ = , ν = bγ + z 2 2γ w γ>0 w.r.t. the latent variables f yielding Z − ln ZVB = − ln N(f|m, K) n Y qi (yi |fi , γi )df i=1 = − ln N(m|ν, K + Γ) + 3.6 1 h(γ) − w> ν2 − 1> ln 2πγ . 2 Compatibility Between Inference Methods and Covariance Approximations Another kind of approximation is needed to render an inference method scalable. We have two approximation schemes which in fact approximate the covariance to make it amenable to large number of training data points. The following table shows the compatibility between some inference methods and two major groups of covariance approximations we will discuss in the next two sections. 14 Approximation \ Inference variational free energy (VFE) fully independent training conditionals (FITC) hybrid between VFE and FITC, s0 ∈ [0, 1] Kronecker/Toeplitz/BTTB grid (KISS-GP) State space representation 3.7 Exact Laplace infGaussLik infLaplace X X X X X X X X X X VB infVB X X X X X EP infEP KL, MCMC, LOO inf{KL,MCMC,LOO} X (X, ADF) Implementation apxSparse, opt.s=0.0 apxSparse, opt.s=1.0 apxSparse, opt.s=s0 apxGrid apxState Sparse Covariance Approximations One of the main problems with GP models is the high computational load for inference computations. In a setting with n training points x, exact inference with Gaussian likelihood requires O(n3 ) effort; approximations like Laplace or EP consist of a sequence of O(n3 ) operations. There is a line of research with the goal to alleviate this burden by using approximate covariance functions k̃ instead of k. A review is given by Candela and Rasmussen1 . One basic idea in those approximations is to work with a set of m inducing inputs u with a reduced computational load of O(nm2 ). In the following, we will provide a rough idea of the FITC approximation used in the toolbox. Let K denote the n × n covariance matrix between the training points x, Ku the m × n covariance matrix between the n training points and the m inducing points, and Kuu the m × m covariance matrix between the m inducing points. The FITC approximation to the covariance is given by −1 2 K ≈ K̃ = Q + G, G = diag(g), g = diag(K − Q), Q = K> u Quu Ku , Quu = Kuu + σnu I, where σnu is the noise from the inducing inputs. Note that K̃ and K have the same diagonal elements diag(K̃) = diag(K); all off-diagonal elements are the same as for Q. Internally, the necessary covariance evaluations are performed by a meta covariance function cov/apxSparse.m. The toolbox offers FITC versions for regression with Gaussian likelihood inf/infGaussLik.m, as well as for Laplace’s approximation inf/infLaplace.m. The user can decide whether to treat the inducing inputs u as fixed or as hyperparameters. The latter allows to adjust the inducing inputs u w.r.t. the marginal likelihood. As detailed in the documentation of inf/apx.m, u is treated as fixed if it is passed as the 2nd parameter of apxSparse(cov,xu,..). If the hyperparameter structure hyp contains a field hyp.xu in inference method calls such as infGaussLik(hyp,..) or inference/prediction calls like gp(hyp,@infGaussLik,..) the inducing inputs u are treated as hyperparameters and can be optimised. See doc/demoSparse.m for an illustration. 3.8 Grid-Based Covariance Approximations Another way to bring down computational costs is to take advantage of grid structure x. For example, in geostatistics or image processing, the training data x ∈ Rn×D could be a complete 2d lattice of size n1 × n2 as given by the axes g1 ∈ Rn1 , g2 ∈ Rn2 so that n = N = n1 · n2 , D = 2 and x = [vec(g1 1> ), vec(1g> grid U ∈ RN×D is specified by a set of axis 2 )]. In general, a p-dimensional Q Pp p matrices {gi ∈ Rni ×Di }i=1..p so that N = i=1 ni and D = i=1 Di where the axes do not need to be 1d nor do their components need to be sorted. As a consequence, U represents a Cartesian product of its axes U = g1 × g2 × .. × gp . The cov/apxGrid.m covariance function represents a Kronecker product covariance matrix KU,U = Kp ⊗ .. ⊗ K2 ⊗ K1 whose factorisation structure is given by the grid xg . The gain in computationial efficiency is due to the fact that matrix-vector product, determinant, inverse and eigenvalue decompose Pcomputations p 3 3 so that many operations with an overall cost of O(N ) now only cost O( i=1 ni ). 1 A Unifying View of Sparse Approximate Gaussian Process Regression, JMLR, 2005 15 For off-grid data points, we can still take advantage of the computational properties of a grid-based covariance matrix KU,U via the structured kernel interpolation (SKI) framework aka KISS-GP by Wilson and Nickisch2 with extensions3 . Here, the n × n covariance K is obtained from the N × N grid covariance KU,U by interpolation K ≈ WX KU,U W> X , where KU,U is a covariance matrix formed by evaluating the user-specified kernel over a set of latent inducing inputs U, with locations that have been chosen to create algebraic structure in KU,U that we can exploit for efficiency. Here, the interpolation matrix WX ∈ Rn×N is extremely sparse; i.e., for local cubic interpolation WX contains only 4D nonzeros per row, where D is the diata dimension. In addition WX is row-normalised 1n = WX 1N . The structure in KU,U alongside the sparsity of WX , allows for very fast MVMs with the SKI approximate covariance matrix K over the inputs x enabling fast inference and prediction. Internally, we use a meta covariance function cov/apxGrid.m to represent the Kronecker covariance matrix and a Gaussian regression inference method inf/infGaussLik.m. We also support incomplete grids where n < N. A good starting point is Yunus Saatçi’s PhD thesis4 . For incomplete grids, we use the interpolation-based extensions by Wilson et al.5 where conjugate gradients and a determinant approximations are used. See doc/demoGrid1d.m and doc/demoGrid2d.m for an illustration. We also offer non-Gaussian likelihoods as described by Seth Flaxman6 so that inf/infLaplace.m can be used. 3.9 State Space Representation of GPs GP models with covariance functions with a Markovian structure can be transformed into equivalent discrete state space models where inference can be done in linear time O(n). Exact models can be derived for sum, product, linear, noise, constant, Matérn (half-integer), Ornstein–Uhlenbeck, and Wiener covariance functions. Other common covariance functions can be approximated by their Markovian counterparts, including squared exponential, rational quadratic, and periodic covariance functions. A state space model describes the evolution of a dynamical system at different time instances ti , i = 1, 2, . . . by fi ∼ P(fi |fi−1 ), yi ∼ P(yi |fi ), where fi := f(ti ) ∈ Rd and f0 ∼ P(f0 ) with fi being the latent (hidden/unobserved) variable and yi being the observed variable. In continuous time, a simple dynamical system able to represent many covariance functions is given by the following linear time-invariant stochastic differential equation: ḟ(t) = F f(t) + L w(t), yi = H f(ti ) + i , where w(t) is an s-dimensional white noise process, the measurement noise i ∼ N(0, σ2n ) is Gaussian, and F ∈ Rd×d , L ∈ Rd×s , H ∈ R1×d are the feedback, noise effect, and measurement matrices, respectively. The initial state is distributed according to f0 ∼ N(0, P0 ). The latent GP is recovered by f(t) = Hf(t) and w(t) ∈ Rs is a multivariate white noise process with spectral density matrix Qc ∈ Rs×s . For discrete values, this translates into fi ∼ N(Ai−1 fi−1 , Qi−1 ), 2 yi ∼ P(yi |H Lfi ), Kernel Interpolation for Scalable Structured Gaussian Processes, ICML, 2015 Thoughts on Massively Scalable Gaussian Processes, TR, 2015. 4 Scalable Inference for Structured Gaussian Process Models, University of Cambridge, 2011 5 Fast Kernel Learning for Multidimensional Pattern Extrapolation, NIPS, 2014 6 Fast Kronecker inference in Gaussian processes with non-Gaussian likelihoods, ICML, 2015 3 16 Algorithm 1 Kalman (forward) filtering. Input: {ti } , y {Ai }, {Qi }, H, P0 W,b for i = 1 to n do if i == 1 then mi ← 0; Pi ← P0 else mi ← Ai mi−1 ; Pi ← Ai Pi−1 A> i +Qi end if if has label yi then µf ← Hmi ; u ← Pi H> ; σ2f ← Hu zi ← Wii σ2f + 1 ki ← Wii u/zi ; Pi ← Pi − ki u> ci ← Wii µf − bi ; mi ← mi − uci /zi end if end for P 1 1 log det(I + W 2 KW 2 ) ← i log zi # training inputs and targets # state space model # likelihood eff. precision and location # init # predict # latent # variance # mean Algorithm 2 Rauch–Tung–Striebel (backward) smoothing. Input: {mi }, {Pi } {Ai }, {Qi } for i = n down to 2 do m ← Ai mi−1 ; P ← Ai Pi−1 A> i + Qi > −1 Gi ← Pi−1 Ai P ; ∆mi−1 ← Gi (mi − m) Pi−1 ← Pi−1 + Gi (Pi − P)G> i mi−1 ← mi−1 + ∆mi−1 end for # Kalman filter output # state space model # predict # variance # mean with f0 ∼ N(0, P0 ). The discrete-time matrices are Ai = A[∆ti ] = e∆ti F , Z ∆ti > Qi = e(∆tk −τ)F L Qc L> e(∆ti −τ)F dτ, 0 where ∆ti = ti+1 − ti > 0. For stationary covariances k(t, t 0 ) = k(t − t 0 ), the stationary state is distributed by f∞ ∼ N(0, P∞ ) and the stationary covariance can be found by solving the Lyapunov equation Ṗ∞ = F P∞ + P∞ F> + L Qc L> = 0, which leads to the identity Qi = P∞ − Ai P∞ A> i . In practice, the evaluation of the n discrete-time transition matrices Ai = e∆ti F and the noise covariance matrices Qi (in the stationary case) for different values of ∆ti is a computational challenge. Since the matrix exponential ψ : s 7→ esX is smooth, its evaluation can be accurately approximated by convolution interpolation. From the discrete set of matrices, all the necessary computations can be done using Kalman filtering and Kalman smoothing as detailed in Algorithms 1 and 2. Internally, we use a meta covariance function cov/apxState.m to represent the state space represen17 tation. A good starting point is Arno Solin’s PhD thesis7 . See doc/demoState.m for an illustration. We also offer non-Gaussian likelihoods 8 so that inf/infLaplace.m and inf/infVB.m can be used. EP is not fully functional; we offer single-sweep EP aka assumed density filtering (ADF). 7 8 Stochastic Differential Equation Methods for Spatio-Temporal Gaussian Process Regression, Aalto University, 2016 State Space Gaussian Processes with Non-Gaussian Likelihood, ICML, 2018 18 4 Likelihood Functions R A likelihood function pρ (y|f) (with hyperparameters ρ) is a conditional density pρ (y|f)dy = 1 defined forQscalar latent function values f and outputs y. In the GPML toolbox, we use iid. likelihoods pρ (y|f) = n i=1 pρ (yi |fi ). The approximate inference engine does not explicitly distinguish between classification and regression likelihoods: it is fully generic in the likelihood allowing to use a single code in the inference step. Likelihood functionality is needed both during inference and while predicting. 4.1 Prediction A prediction at x∗ conditioned on the data D = (X, y) (as implemented in gp.m) consists of the predictive mean µy∗ and variance σ2y∗ which are computed from the the latent marginal moments µf∗ , σ2f∗ i.e. the Gaussian marginal approximation N(f∗ |µf∗ , σ2f∗ ) via Z Z p(y∗ |D, x∗ ) = p(y∗ |f∗ )p(f∗ |D, x∗ )df∗ ≈ p(y∗ |f∗ )N(f∗ |µf∗ , σ2f∗ )df∗ . (3) The moments are given by µy∗ = likelihood call R R y∗ p(y∗ |D, x∗ )dy∗ and σ2y∗ = (y∗ − µy∗ )2 p(y∗ |D, x∗ )dy∗ . The • [lp,ymu,ys2] = lik(hyp, [], fmu, fs2) does exactly this. Evaluation of the logarithm of py∗ = p(y∗ |D, x∗ ) for values y∗ can be done via • [lp,ymu,ys2] = lik(hyp, y, fmu, fs2) where lp contains the number ln py∗ . R R Using the moments of the likelihood µ(f∗ ) = y∗ p(y∗ |f∗ )dy∗ and σ2 (f∗ ) = (y∗ −µ(f∗ ))2 p(y∗ |f∗ )dy∗ we obtain for the predictive moments the following (exact) expressions Z µy∗ = µ(f∗ )p(f∗ |D, x∗ )df∗ , and Zh i σ2y∗ = σ2 (f∗ ) + (µ(f∗ ) − µy∗ )2 p(f∗ |D, x∗ )df∗ . 1. The binary case is simple since y∗ ∈ {−1, +1} and 1 = py∗ + p−y∗ . Using π∗ = p+1 , we find π∗ y∗ = +1 p y∗ = 1 − π∗ y∗ = −1 X µ y∗ = y∗ p(y∗ |D, x∗ ) = 2 · π∗ − 1 ∈ [−1, 1], and y∗ =±1 σ2y∗ = X (y∗ − µy∗ )2 p(y∗ |D, x∗ ) = 4 · π∗ (1 − π∗ ) ∈ [0, 1]. y∗ =±1 2. The continuous case for homoscedastic likelihoods depending on r∗ = y∗ − f∗ only and having noise variance σ2 (f∗ ) = σ2n is also simple since R the identity p(y∗ |f∗ ) = p(yR∗ − f∗ |0) allows to substitute y∗ ← y∗ +f∗ yielding µ(f∗ ) = f∗ + y∗ p(y∗ |0)dy∗ and assuming y∗ p(y∗ |0)dy∗ = 0 we arrive at µy∗ = µf∗ , and σ2y∗ = σ2f∗ + σ2n . 19 3. The generalised linear model (GLM) case is also feasible. Evaluation of the predictive distribution is done by quadrature Z Z py∗ = p(y∗ |f∗ )p(f∗ |D, x∗ )df∗ ≈ p(y∗ |f∗ )N(f∗ |µf∗ , σ2f∗ )df∗ . For GLMs the mean is given by µ(f∗ ) = g(f∗ ) and the variance is usually given by a simple function of the mean σ2 (f∗ ) = v(g(f∗ )), hence we use Gaussian-Hermite quadrature with N(f∗ |µf∗ , σ2f∗ ) ≈ p(f∗ |D, x∗ ) to compute Z µy∗ = g(f∗ )p(f∗ |D, x∗ )df∗ , and Zh i σ2y∗ = v(g(f∗ )) + (g(f∗ ) − µy∗ )2 p(f∗ |D, x∗ )df∗ 6= v(µy∗ ). 4. Finally the warped Gaussian likelihood predicitive distribution with strictly monotonically increasing warping function g is given by the expression p(y∗ |D, x∗ ) = g 0 (y∗ )N g(y∗ )|µf∗ , σ2n + σ2f∗ so that the predictive moments can be computed by Gaussian-Hermite quadrature. In the following, we will detail how and which likelihood functions are implemented in the GPML toolbox. Further, we will mention dependencies between likelihoods and inference methods and provide some analytical expressions in addition to some likelihood implementations. 4.2 Interface The likelihoods are in fact the most challenging object in our implementation. Different inference algorithms require different aspects of the likelihood to be computed, therefore the interface is rather involved as detailed below. 19 hlikFunctions.m 19i≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 % likelihood functions are provided to be used by the gp.m function: % % likErf (Error function , classification , probit regression) % likLogistic (Logistic , classification , logit regression) % likUni (Uniform likelihood , classification) % % likGauss (Gaussian , regression) % likGaussWarp (Warped Gaussian , regression) % likGumbel (Gumbel likelihood for extremal values) % likLaplace (Laplacian or double exponential , regression) % likSech2 (Sech-square , regression) % likT (Student ’s t, regression) % % likPoisson (Poisson regression , count data) % likNegBinom (Negativ binomial regression , count data) % likGamma (Nonnegative regression , positive data) % likExp (Nonnegative regression , positive data) % likInvGauss (Nonnegative regression , positive data) % likBeta (Beta regression , interval data) % % likMix (Mixture of individual likelihood functions) % 20 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % The likelihood functions have three possible modes , the mode being selected as follows (where "lik" stands for any likelihood function in "lik/lik*.m".): 1) With one or no input arguments: [REPORT NUMBER OF HYPERPARAMETERS] s = lik OR s = lik(hyp) The likelihood function returns a string telling how many hyperparameters it expects , using the convention that "D" is the dimension of the input space. For example , calling "likLogistic" returns the string ’0’. 2) With three or four input arguments: [PREDICTION MODE] lp = lik(hyp, y, mu) OR [lp, ymu, ys2] = lik(hyp, y, mu, s2) This allows to evaluate the predictive distribution. Let p(y_*|f_*) be the likelihood of a test point and N(f_*|mu,s2) an approximation to the posterior marginal p(f_*|x_*,x,y) as returned by an inference method. The predictive distribution p(y_*|x_*,x,y) is approximated by. q(y_*) = \int N(f_*|mu,s2) p(y_*|f_*) df_* lp = log( q(y) ) for a particular value of y, if s2 is [] or 0, this corresponds to log( p(y|mu) ) ymu and ys2 the mean and variance of the predictive marginal q(y) note that these two numbers do not depend on a particular value of y All vectors have the same size. 3) With five or six input arguments , the fifth being a string [INFERENCE MODE] [varargout] = lik(hyp, y, mu, s2, inf) OR [varargout] = lik(hyp, y, mu, s2, inf, i) There are three cases for inf, namely a) infLaplace , b) infEP and c) infVB. The last input i, refers to derivatives w.r.t. the ith hyperparameter. a1) [lp,dlp,d2lp,d3lp] = lik(hyp, y, f, [], ’infLaplace ’) lp, dlp, d2lp and d3lp correspond to derivatives of the log likelihood log(p(y|f)) w.r.t. to the latent location f. lp = log( p(y|f) ) dlp = d log( p(y|f) ) / df d2lp = d^2 log( p(y|f) ) / df^2 d3lp = d^3 log( p(y|f) ) / df^3 a2) [lp_dhyp ,dlp_dhyp ,d2lp_dhyp] = lik(hyp, y, f, [], ’infLaplace ’, i) returns derivatives w.r.t. to the ith hyperparameter lp_dhyp = d log( p(y|f) ) / ( dhyp_i) dlp_dhyp = d^2 log( p(y|f) ) / (df dhyp_i) d2lp_dhyp = d^3 log( p(y|f) ) / (df^2 dhyp_i) b1) [lZ,dlZ,d2lZ] = let Z = \int p(y|f) lZ = log(Z) dlZ = d log(Z) / d2lZ = d^2 log(Z) / lik(hyp, y, mu, s2, ’infEP ’) N(f|mu,s2) df then dmu dmu^2 21 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 % % b2) [dlZhyp] = lik(hyp, y, mu, s2, ’infEP ’, i) % returns derivatives w.r.t. to the ith hyperparameter % dlZhyp = d log(Z) / dhyp_i % % % c1) [b,z] = lik(hyp, y, [], ga, ’infVB ’) % ga is the variance of a Gaussian lower bound to the likelihood p(y|f). % p(y|f) \ge exp( b*(f+z) - (f+z).^2/(2*ga) - h(ga)/2 ) \propto N(f|b*ga-z,ga) % The function returns the linear part b and z. % % Cumulative likelihoods are designed for binary classification. Therefore , they % only look at the sign of the targets y; zero values are treated as +1. % % Some examples for valid likelihood functions: % lik = @likLogistic; % lik = {’likMix ’,{’likUni ’,@likErf}} % lik = {@likPoisson ,’logistic ’}; % % See the help for the individual likelihood for the computations specific to % each likelihood function. % hgpml copyright 5ai 4.3 Implemented Likelihood Functions The following table enumerates all (currently) implemented likelihood functions that can be found at lik/lik .m and their respective set of hyperparameters ρ. lik regression yi ∈ R Gauss Gaussian GaussWarp Gumbel Sech2 Warped Gaussian Gumbel Sech-squared Laplace Laplacian T Student’s t lik Erf Logistic Uni lik Poisson NegBinom classification yi ∈ {±1} Error function Logistic function Label noise count data yi ∈ N Poisson Negative Binomial lik Weibull nonnegative data yi ∈ R+ \{0} Weibull, γ1 = Γ (1 + 1/κ) Gamma Gamma Exp Exponential InvGauss Inverse Gaussian pρ (yi |fi ) = interval data yi ∈ [0, 1] Beta pρ (yi |fi ) = Ryi fi −∞ N(t)dt 1 1+exp(−yi fi ) 1 2 pρ (yi |fi ) = where µ = g(fi ) µyi e−µ /yi ! yi +r−1 r yi r µ /(r + µ)r+yi yi {ln σ} |s| = 1 {θ1 , .., θng , ln σ} {ln σ} {ln σ} {ln σ} {ln(ν − 1), ln σ} ρ= ∅ ∅ ∅ ρ= ∅ {ln r} pρ (yi |fi ) = where µ = g(fi ) κγ1 /µ (yi γ1 /µ)κ−1 exp (−(yi γ1 /µ)κ ) αα yα−1 −α exp − yi α i µ µ Γ (α) µ−1 exp − yµi q 2 λ i −µ) exp − λ(y 2µ2 y 2πy3 ρ= {ln κ} pρ (yi |fi ) = ρ= {ln φ} i i lik Beta ρ= −fi )2 √1 exp − (yi2σ 2 2πσ N(gθ, (yi )|fi , σ2 )gθ0 (yi ) s·π(yi −fi ) π √ exp (−zi − e−zi ), zi = γ + √ , σ 6 σ 6 π√ τ , τ= 2 cosh2 (τ(y i −fi )) 2σ 3 |yi −fi | 1 , b = √σ2 2b exp − b − ν+1 2 Γ ( ν+1 −fi )2 2 ) √ 1 1 + (yiνσ 2 Γ ( ν2 ) νπσ N(yi |fi , σ2 ) = Γ (φ) µφ−1 (1 Γ (µφ)Γ ((1−µ)φ) yi Composite likelihood functions [p1 (yi |fi ), p1 (yi |fi ), ..] 7→ pρ (yi |fi ) P Mix Mixture j αj pj (yi |fi ) 22 where µ = g(fi ) − yi )(1−µ)φ−1 {ln α} ∅ {ln λ} {ln α1 , ln α2 , ..} 4.4 Usage of Implemented Likelihood Functions Some code examples taken from doc/usageLik.m illustrate how to use simple and composite likelihood functions to specify a GP model. Syntactically, a likelihood function lf is defined by lk := ’func’ | @func // simple lf := {lk} | {param, lk} | {lk, {lk, .., lk}} // composite i.e., it is either a string containing the name of a likelihood function, a pointer to a likelihood function or one of the former in combination with a cell array of likelihood functions and an additional list of parameters. 22 hdoc/usageLik.m 22i≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 % demonstrate usage of likelihood functions % % See also likFunctions.m. % hgpml copyright 5ai clear all, close all n = 5; f = randn(n,1); % create random latent function values % set up simple classification likelihood functions yc = sign(f); lc0 = {’likErf’}; hypc0 = []; % no hyperparameters are needed lc1 = {@likLogistic}; hypc1 = []; % also function handles are OK lc2 = {’likUni’}; hypc2 = []; lc3 = {’likMix’,{’likUni’,@likErf}}; hypc3 = log([1;2]); %mixture % set up simple regression likelihood functions yr = f + randn(n,1)/20; sn = 0.1; % noise standard deviation lr0 = {’likGauss’}; hypr0 = log(sn); lr1 = {’likLaplace’}; hypr1 = log(sn); lr2 = {’likSech2’}; hypr2 = log(sn); nu = 4; % number of degrees of freedom lr3 = {’likT’}; hypr3 = [log(nu-1); log(sn)]; lr4 = {’likMix’,{lr0,lr1}}; hypr4 = [log([1,2]);hypr0;hypr1]; a = 1; % set up warped Gaussian with g(y) = y + a*sign(y).*y.^2 lr5 = {’likGaussWarp’,[’poly2’]}; hypr5 = log([a;sn]); lr6 = {’likGumbel’,’+’}; hypr6 = log(sn); % set up Poisson/negative binomial regression yp = fix(abs(f)) + 1; lp0 = {@likPoisson ,’logistic’}; hypp0 = lp1 = {@likPoisson ,{’logistic2’ ,0.1}}; hypp1 = lp2 = {@likPoisson ,’exp’}; hypp2 = ln1 = {@likNegBinom ,’logistic2’}; hypn1 = % set lg1 = lg2 = lg3 = lg4 = lg5 = []; []; []; []; up other GLM likelihoods for positive or interval regression {@likGamma ,’logistic’}; al = 2; hyp.lik = log(al); {@likInvGauss ,’exp’}; lam = 1.1; hyp.lik = log(lam); {@likBeta ,’expexp’}; phi = 2.1; hyp.lik = log(phi); {@likBeta ,’logit’}; phi = 4.7; hyp.lik = log(phi); {@likWeibull ,{’logistic2’ ,0.01}}; ka = 0.5; hyp.lik = log(ka); % 0) specify the likelihood function 23 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 lik = lc0; hyp = hypc0; y = yc; % lik = lr4; hyp = hypr4; y = yr; % lik = lp1; hyp = hypp1; y = yp; % 1) query the number of parameters feval(lik{:}) % 2) evaluate the likelihood function on f exp(feval(lik{:},hyp,y,f)) % 3a) evaluate derivatives of the likelihood [lp,dlp,d2lp,d3lp] = feval(lik{:}, hyp, y, f, [], ’infLaplace’); % 3b) compute Gaussian integrals w.r.t. likelihood mu = f; s2 = rand(n,1); [lZ,dlZ,d2lZ] = feval(lik{:}, hyp, y, mu, s2, ’infEP’); % 3c) obtain lower bound on likelihood ga = rand(n,1); [b,z] = feval(lik{:}, hyp, y, [], ga, ’infVB’); 4.5 Compatibility Between Likelihoods and Inference Methods The following table lists all possible combinations of likelihood function and inference methods. Likelihood \ Inference Gaussian Warped Gaussian Gumbel Sech-squared Laplacian Student’s t Mixture Error function Logistic function Uniform Weibull Gamma Exp Inverse Gaussian Poisson Negative Binomial Beta Gaussian Laplace VB Likelihood approx. cov. possible, Table3.6 X X X X X X X X X X X X X X X X X X X X X X X X X EP KL MCMC LOO no approx. cov. possible, Table3.6 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X (X)∗ X X X X X X X Type, Output Domain regression, R regression, R regression, R regression, R regression, R regression, R classification, {±1} classification, {±1} classification, {±1} positive data, R+ \{0} positive data, R+ \{0} positive data, R+ \{0} positive data, R+ \{0} count data, N count data, N interval data, [0, 1] Alternative Names logistic distribution double exponential mixing meta likelihood probit regression logit regression label noise nonnegative regression nonnegative regression nonnegative regression nonnegative regression Poisson regression negative binomial regression beta regression (X)∗ EP might not converge in some cases since quadrature is used. Exact inference is only tractable for Gaussian likelihoods. Expectation propagation together with Student’s t likelihood is inherently unstable due to non-log-concavity. Laplace’s approximation for Laplace likelihoods is not sensible because at the mode the curvature and the gradient is undefined due to the non-differentiable peak of the Laplace distribution. Special care has been taken for the non-convex optimisation problem imposed by the combination Student’s t likelihood and Laplace’s approximation. 4.6 Gaussian Likelihood The Gaussian likelihood is the simplest likelihood because the posterior distribution is not only Gaussian but can be computed analytically. In principle, the Gaussian likelihood would only be 24 operated in conjunction with the exact inference method but we chose to provide compatibility with all other inference algorithms as well because it enables code testing and allows to switch between different regression likelihoods very easily. 24a hlik/likGauss.m 24ai≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 24b function [varargout] = likGauss(hyp, y, mu, s2, inf, i) % likGauss - Gaussian likelihood function for regression. The expression for the % likelihood is % likGauss(t) = exp(-(t-y)^2/2*sn^2) / sqrt(2*pi*sn^2), % where y is the mean and sn is the standard deviation. % % The hyperparameters are: % % hyp = [ log(sn) ] % % Several modes are provided , for computing likelihoods , derivatives and moments % respectively , see likFunctions.m for the details. In general , care is taken % to avoid numerical issues when the arguments are extreme. % hgpml copyright 5ai % % See also LIKFUNCTIONS.M. if nargin <3, varargout = {’1’}; return; end sn2 = exp(2*hyp); if nargin <5 % prediction mode if inf is not present hPrediction with Gaussian likelihood 24bi else switch inf case ’infLaplace’ hLaplace’s method with Gaussian likelihood 25ai case ’infEP’ hEP inference with Gaussian likelihood 25bi case ’infVB’ hVariational Bayes inference with Gaussian likelihood 25ci end end hPrediction with Gaussian likelihood 24bi≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 % report number of hyperparameters (24a) if isempty(y), y = zeros(size(mu)); end s2zero = 1; if nargin >3&&numel(s2)>0&&norm(s2)>eps, s2zero = 0; end % s2==0 ? if s2zero % log probability lp = -(y-mu).^2./sn2/2-log(2*pi*sn2)/2; s2 = 0; else lp = likGauss(hyp, y, mu, s2, ’infEP’); % prediction end ymu = {}; ys2 = {}; if nargout >1 ymu = mu; % first y moment if nargout >2 ys2 = s2 + sn2; % second y moment end end varargout = {lp,ymu,ys2}; 25 The Gaussian likelihood function has a single hyperparameter ρ, the log of the noise standard deviation σn . 4.6.1 Exact Inference Exact inference doesn’t require any specific likelihood related code; all computations are done directly by the inference method, section 3.1. 4.6.2 25a Laplace’s Approximation hLaplace’s method with Gaussian likelihood 25ai≡ (24a) 1 if nargin <6 % no derivative mode 2 if isempty(y), y=0; end 3 ymmu = y-mu; dlp = {}; d2lp = {}; d3lp = {}; 4 lp = -ymmu.^2/(2*sn2) - log(2*pi*sn2)/2; 5 if nargout >1 6 dlp = ymmu/sn2; % dlp, derivative of log likelihood 7 if nargout >2 % d2lp, 2nd derivative of log likelihood 8 d2lp = -ones(size(ymmu))/sn2; 9 if nargout >3 % d3lp, 3rd derivative of log likelihood 10 d3lp = zeros(size(ymmu)); 11 end 12 end 13 end 14 varargout = {lp,dlp,d2lp,d3lp}; 15 else % derivative mode 16 lp_dhyp = (y-mu).^2/sn2 - 1; % derivative of log likelihood w.r.t. hypers 17 dlp_dhyp = 2*(mu-y)/sn2; % first derivative , 18 d2lp_dhyp = 2*ones(size(mu))/sn2; % and also of the second mu derivative 19 varargout = {lp_dhyp ,dlp_dhyp ,d2lp_dhyp}; 20 end 4.6.3 25b Expectation Propagation hEP inference with Gaussian likelihood 25bi≡ (24a) 1 if nargin <6 % no derivative mode 2 lZ = -(y-mu).^2./(sn2+s2)/2 - log(2*pi*(sn2+s2))/2; % log part function 3 dlZ = {}; d2lZ = {}; 4 if nargout >1 5 dlZ = (y-mu)./(sn2+s2); % 1st derivative w.r.t. mean 6 if nargout >2 7 d2lZ = -1./(sn2+s2); % 2nd derivative w.r.t. mean 8 end 9 end 10 varargout = {lZ,dlZ,d2lZ}; 11 else % derivative mode 12 dlZhyp = ((y-mu).^2./(sn2+s2)-1) ./ (1+s2./sn2); % deriv. w.r.t. hyp.lik 13 varargout = {dlZhyp}; 14 end 4.6.4 25c Variational Bayes hVariational Bayes inference with Gaussian likelihood 25ci≡ 26 (24a) 1 2 3 4 5 % variational lower site bound % t(s) = exp(-(y-s)^2/2sn2)/sqrt(2*pi*sn2) % the bound has the form: (b+z/ga)*f - f.^2/(2*ga) - h(ga)/2 n = numel(s2); b = zeros(n,1); y = y.*ones(n,1); z = y; varargout = {b,z}; 4.7 Warped Gaussian Likelihood Starting from the likelihood p(y|f) we are sometimes facing the situation where the data y ∈ Y ⊆ R is not distributed according to p(y|f) but some nonlinear transformation of the data g(y) = z so that z ∼ p(z|f). Here, the warping function g : Y → R needs to be strictly monotonically increasing i.e. g 0 (y) > 0. Formally, we start from the fact that p(z|f) integrates to one and use the derivative dz = g 0 (y)dy to substitute Z Z p(z|f)dz = 1 = pg (y|f)dy, pg (y|f) = p(g(y)|f)g 0 (y) where we have defined the log warped likelihood ln pg (y|f) = ln p(g(y)|f) + ln g 0 (y). The interesting bit is that approximate inference methods such as infExact, infLaplace, infEP, infVB, infKL remain fully feasible; only prediction and derivatives become more involved. The usual GP inference is recovered by using the identity warping function g : y 7→ y. The construction works in princple for any likelihood but our implementation in likGaussWarp is limited to the Gaussian likelihood. Hyperparameter derivatives Hyperparameter derivatives for infLaplace are obtained as follows ∂k ∂ ln k pg (y|f) = ∂θ ∂f ∂ ∂k ∂ ∂k ln p(g(y)|f) + ln g 0 (y), k = 0, 1, 2 ∂θ ∂f ∂θ ∂fk ∂ ∂k+1 ∂ ∂k = − k+1 ln p(g(y)|f) g(y) + ln g 0 (y). ∂θ ∂θ ∂fk ∂f Similarly for infEP the derivatives are given by Z Z ∂ ∂ ∂ 2 ln pg (y|f)N(f|µ, σ )df = ln p(g(y)|f)N(f|µ, σ2 )df + ln g 0 (y) ∂θ ∂θ ∂θ Z ∂ ∂ ∂ ln p(g(y)|f)N(f|µ, σ2 )df g(y) + ln g 0 (y). = − ∂µ ∂θ ∂θ This trick above works for any homoscedastic likelihood where p(y|f) = p(y + β|f + β) such as likGauss, likLaplace, likSech2 and likT. Predictive moments As detailed in 4, the predictive distribution is – for Gaussian likelihood – given by Z Z p(z∗ |D, x∗ ) = p(z∗ |f∗ )p(f∗ |D, x∗ )df∗ = N(z∗ |f∗ , σ2n )N(f∗ |µf∗ , σ2f∗ )df∗ = N(z∗ |µf∗ , σ2n + σ2f∗ ), where z∗ = g(y∗ ) p(y∗ |D, x∗ ) = g 0 (y∗ )N(g(y∗ )|µf∗ , σ2n + σ2f∗ ). 27 Hence, the predictive moments are obtained by the 1d integrals Z µy∗ = y∗ g 0 (y∗ )N(g(y∗ )|µf∗ , σ2n + σ2f∗ )dy∗ Z = g−1 (z∗ )N(z∗ |µf∗ , σ2n + σ2f∗ )dz∗ , and Z 2 σy∗ = (y∗ − µy∗ )2 g 0 (y∗ )N(g(y∗ )|µf∗ , σ2n + σ2f∗ )dy∗ Z = (g−1 (z∗ ) − µy∗ )2 N(z∗ |µf∗ , σ2n + σ2f∗ )dz∗ . 4.8 Gumbel Likelihood Distributions of extrema are well captured by the Gumbel distribution p(y) = y−η 1 exp −z − e−z , z = s , s ∈ {±1} β β with mean µ = η + βγ and variance σ2 = π2 β2 /6 where γ = 0.57721566490153 denotes Euler–Mascheroni’s constant. Skewness is approximately given by 1.1395s where s is a sign switching between left and right skewness and kurtosis is 12/5. The final expression for the Gumbel likelihood is π π p(y|f) = √ exp −z − e−z , z = γ + s √ (y − f), s ∈ {±1}. σ 6 σ 6 4.9 Laplace Likelihood Laplace’s Approximation The following derivatives are needed: ln p(y|f) = − ln(2b) − ∂ ln p ∂f 2 ∂ ln p (∂f)2 ∂ ln p ∂ ln σn = = = |f − y| b sign(f − y) b 3 ∂ ln p ∂3 ln p = =0 (∂f)3 (∂ ln σn )(∂f)2 |f − y| −1 b Expectation Propagation Expectation propagation requires integration against a Gaussian measure for moment matching. R 2 ln Z ln Z We need to evaluate ln Z = ln L(y|f, σ2n )N(f|µ, σ2 )df as well as the derivatives ∂∂µ and ∂∂µ 2 2 |y−f| (f−µ) σ 1 1 where N(f|µ, σ2 ) = √ 2 exp − 2σ² , L(y|f, σ2n ) = 2b exp − b , and b = √n2 . As a first 2πσ 28 step, we reduce the number of parameters by means of the substitution f̃ = f−y σn yielding Z Z = = = = = ln Z = with µ̃ = µ−y σn and σ̃ = σ σn . L(y|f, σ2n )N(f|µ, σ2 )df √ Z √ |f − y| (f − µ)2 1 2 √ exp − exp − 2 df 2σ² σn 2πσ 2σn √ Z √ 2 (σn f̃ + y − µ)2 √ exp − 2|f̃| df̃ exp − 2σ² 2σ 2π 2 µ−y 2 Z σn σn f̃ − σn √ exp − L(f̃|0, 1)df̃ 2σ² σσn 2π Z 1 L(f|0, 1)N(f|µ̃, σ̃2 )df σn Z ln Z̃ − ln σn = ln L(f|0, 1)N(f|µ̃, σ̃2 )df − ln σn Thus, we concentrate on the simpler quantity ln Z̃. C z √ }| √ { (f − µ̃)2 √ ln Z = ln exp − − 2|f| df − ln σ̃ 2π − ln 2σn 2σ̃² "Z # Z∞ 0 (f − µ̃)2 √ (f − µ̃)2 √ = ln + 2f df + exp − − 2f df + C exp − 2σ̃² 2σ̃² 0 −∞ m− m+ z }| √ { z }| √ { Z∞ Z 0 f2 − 2(µ̃ + σ̃2 2)f + µ̃2 f2 − 2(µ̃ − σ̃2 2)f + µ̃2 = ln exp − exp − df + df + C −∞ 2σ̃² 2 σ̃² 0 Z 2 Z∞ # m+ µ̃2 (f − m− )2 (f − m+ )2 = ln exp exp − df + exp exp − df − +C 2σ̃² 2σ̃² 0 2σ̃² 2σ̃² −∞ " !# 2 Z0 2 Z0 √ m− m µ̃2 + 1− = ln exp N(f|m+ , σ̃2 )df − − ln 2σn N(f|m− , σ̃2 )df + exp 2σ̃² −∞ 2σ̃² 2σ̃² −∞ 2 2 2 √ m− m+ m+ µ̃2 m− m+ − exp + exp = ln exp Φ Φ − − ln 2σn 2σ̃² σ̃ 2σ̃² σ̃ 2σ̃² 2σ̃² Rz Here, Φ(z) = −∞ N(f|0, 1)df denotes the cumulative Gaussian distribution. Finally, we have " m2− 2σ̃² Z0 √ m i h √ m √ − + + exp 2µ̃ Φ − + σ̃2 − ln 2σn ln Z = ln exp − 2µ̃ Φ σ̃ σ̃ √ √ √ = ln exp ln Φ(−z+ ) + 2µ̃ + exp ln Φ(z− ) − 2µ̃ + σ̃2 − ln 2σn | {z } | {z } a+ √ = ln(e + e ) + σ̃ − ln 2σn √ √ √ µ−y µ̃ σ where z+ = µ̃ σ̃ + σ̃ 2 = σ + σn 2, z− = σ̃ − σ̃ 2 = a+ Now, using d dθ ln Φ(z) = a− a− 2 1 d Φ(z) dθ Φ(z) = N(z) dz Φ(z) dθ µ−y σ − σ σn √ 2 and µ̃ = we tackle first derivative 29 µ−y σn , σ̃ = σ σn . ∂ ln Z ∂µ ∂a+ ∂µ a− ∂a− + ea+ ∂a ∂µ + e ∂µ = ea+ + ea− √ ∂ 2 ln Φ(−z+ ) + ∂µ σn √ √ N(−z+ ) 2 q+ 2 − =− + + σΦ(−z+ ) σn σ σn √ ∂ 2 ln Φ(z− ) − ∂µ σn √ √ N(z− ) 2 q− 2 − = − σΦ(z− ) σn σ σn √ q± 2 . ∓ ± σ σn = = ∂a− ∂µ = = ∂a± ∂µ = as well as the second derivative ∂2 ln Z ∂µ2 ∂ ∂µ = ∂ ∂µ a± ∂a± e = ea± ∂µ ∂2 a+ ∂µ2 = − 1 σ + ea+ ∂a + ∂µ " − ea− ∂a ∂µ ∂ ∂µ ea− ea+ + # ∂a± 2 ∂2 a± + ∂µ ∂µ2 − ∂ ln Z ∂µ 2 ∂ ∂ ∂µ N(−z+ )Φ(−z+ ) − ∂µ Φ(−z+ )N(−z+ ) Φ2 (−z+ ) ∂−z2 /2 ∂−z+ 2 + 1 N(−z+ )Φ(−z+ ) ∂µ − N (−z+ ) ∂µ = − σ Φ2 (−z+ ) ∂2 a− ∂µ2 = q2+ − q+ z+ N(−z+ ) Φ(−z+ )z+ − N(−z+ ) · = − σ2 Φ2 (−z+ ) σ2 = 1 σ = ∂z− 2 − 1 N(z− )Φ(z− ) ∂µ − N (z− ) ∂µ σ Φ2 (z− ) ∂ ∂ ∂µ N(z− )Φ(z− ) − ∂µ Φ(z− )N(z− ) Φ2 (z− ) ∂−z2 /2 q2− + q− z− N(z− ) −Φ(z− )z− − N(z− ) · = − σ2 Φ2 (z− ) σ2 q 2 ∓ q ± z± = − ± 2 σ = ∂2 a± ∂µ2 which can be simplified to ∂2 ln Z ∂µ2 = ea+ b+ + ea− b− − ea+ + ea− 30 ∂ ln Z ∂µ 2 using b± = ∂a± ∂µ 2 ∂2 a± + ∂µ2 √ !2 q 2 ∓ q ± z± q± 2 − ± 2 ∓ ± σ σn σ √ !2 q2 q ± z± q± 2 − − ± ± σ σn σ2 σ2 ! √ 2 8 z± ∓ 2 q± . − 2 σσn σn σ = = = We also need ∂ ln Z ∂ ln σn = a− ∂a− + ea+ ∂∂a ln σn + e ∂ ln σn ea+ + ea− − 2σ2 − 1. σ2n Variational Bayes We need h(γ) and its derivatives as well as β(γ): h(γ) = h 0 (γ) = 2 γ + ln(2σ2n ) + y2 γ−1 σ2n 2 − y2 γ−2 σ2n h 00 (γ) = 2y2 γ−3 β(γ) = yγ−1 4.10 Student’s t Likelihood The likelihood has two hyperparameters (both represented in the log domain to ensure positivity): ν the degrees of freedom ν and the scale σn with mean y (for ν > 1) and variance ν−2 σn ² (for ν > 2). ν+1 (f − y)² − 2 p(y|f) = Z · 1 + , νσ2n 31 Z= Γ Γ ν 2 ν+1 p2 νπσ2n Laplace’s Approximation For the mode fitting procedure, we need derivatives up to third order; the hyperparameter derivatives at the mode require some mixed derivatives. All in all, using r = y − f, we have ν 1 ν+1 r² ν+1 2 − ln Γ ln p(y|f) = ln Γ − ln νπσn − ln 1 + 2 2 2 2 νσ2n ∂ ln p r = (ν + 1) ∂f r² + νσ2n r2 − νσ2n ∂2 ln p = (ν + 1) (∂f)2 (r² + νσ2n )2 ∂3 ln p r3 − 3rνσ2n = 2(ν + 1) 3 (∂f) (r² + νσ2n )3 ∂ ln p ν+1 ∂Z ν r2 r2 + = − ln 1 + · ∂ ln ν ∂ ln ν 2 2 νσ2n r² + νσ2n ∂Z ν d ln Γ ν+1 ν d ln Γ ν2 1 2 = − − ∂ ln ν 2 d ln ν 2 d ln ν 2 2 2 2 3 r (r − 3(ν + 1)σn ²) + νσn ∂ ln p = ν (∂ ln ν)(∂f)² (r² + νσ2n )3 ∂ ln p r2 −1 = (ν + 1) ∂ ln σn r² + νσ2n ∂3 ln p νσ2n − 3r2 = 2νσ2n (ν + 1) (∂ ln σn )(∂f)² (r² + νσ2n )3 4.11 Cumulative Logistic Likelihood The likelihood has one hyperparameter (represented in the log domain), namely the standard deviation σn π π √ , Z= √ p(y|f) = Z · cosh−2 (τ(f − y)) , τ = 2σn 3 4σn 3 Laplace’s Approximation The following derivatives are needed where φ(x) ≡ ln(cosh(x)) √ ln p(y|f) = ln(π) − ln(4σn 3) − 2φ (τ(f − y)) ∂ ln p = 2τφ 0 (τ(f − y)) ∂f ∂2 ln p = −2τ2 φ 00 (τ(f − y)) (∂f)2 ∂3 ln p = 2τ3 φ 000 (τ(f − y)) (∂f)3 ∂3 ln p 2 00 000 (τ(f (τ(f = 2τ 2φ − y)) + τ(f − y)φ − y)) (∂ ln σn )(∂f)2 ∂ ln p = 2τ(f − y)φ 0 (τ(f − y)) − 1 ∂ ln σn 32 4.12 GLM Likelihoods: Poisson, Negative Binomial, Weibull, Gamma, Exponential, Inverse Gaussian and Beta Data y from a space other than R e.g. N, R+ or [0, 1] can be modeled using generalised linear model likelihoods p(y|f) where the expected value E[y] = µ is related to the underlying Gaussian process f by means of an inverse link function µ = g(f). Typically, the likelihoods are from an exponential family, hence the variance V[y] = v(µ), is a simple function of the mean µ as well as higher order moments such as skewness S[y] = s(µ) and kurtosis K[y] = k(µ). Here, we directly specify the inverse link function µ = g(f) defining the mapping from the GP f to the mean intensity µ. For numerical reasons, we work with the log of the inverse link function h(f) = ln g(f) and use its derivatives h 0 , h 00 and h 000 for subsequent computations. In the table below, we have summarised the GLM likelihood expressions, the moments, the range of their variables and the applicable inverse link functions. Likelihood Poisson ρ= ∅ v(µ) = µ Neg. Binomial {ln r} µ(µ/r + 1) Weibull {ln κ} µ2 (γ2 /γ21 − 1) Gamma {ln α} µ2 /α Exponential ∅ µ2 Inv. Gauss {ln λ} µ3 /λ Beta {ln φ} µ(1 − µ)/(1 + φ) 4.12.1 s(µ) = √ 1/ µ q 4(r+µ) rµ k(µ) = 1/µ − q γ3 −3γ1 γ2 +2γ31 (γ2 −γ21 )3/2 √ 2/ α r µ(µ+r) 6 r + r µ(µ+r) γ4 −4γ1 γ3 +12γ21 γ2 −3γ22 −6γ41 (γ2 −γ21 )2 6/α 2 p 3 µ/λ 6 (2−4µ)(1+φ) √ v(µ)(2+φ) −v(µ)(5φ+6) 6 (φ+1) v(µ)(φ+2)(φ+3) 15µ/λ 2 p(y|f) = µy exp(−µ)/y! y+r−1 r y r µ /(r + µ)r+y y y∈ N µ∈ R+ Inverse Links exp, logistic* N R+ exp, logistic* κγ1 /µ (yγ1 /µ)κ−1 exp (−(yγ1 /µ)κ ) αα yα−1 −α exp − yα µ Γ (α) µ y µ−1 exp − µ q 2 λ exp − λ(y−µ) 2πy3 2µ2 y R+ \{0} R+ \{0} exp, logistic* R+ \{0} R+ \{0} exp, logistic* R+ \{0} R+ \{0} exp, logistic* R+ \{0} R+ \{0} exp, logistic* Γ (φ) µφ−1 (1 Γ (µφ)Γ ((1−µ)φ) y [0, 1] [0, 1] expexp, logit − y)(1−µ)φ−1 Inverse Link Functions Possible inverse link functions and their properties (∪ convex, ∩ concave, ↑ monotone) are summarised below: util/glm_invlink_* exp logistic logistic2 expexp logit g(f) = µ = ef `(f) = ln(1 + ef ) ` (f + af`(f)) exp(−e−f ) 1/(1 + e−f ) g:R→ R+ R+ R+ [0, 1] [0, 1] g is ∪,↑ ∪,↑ ↑ ↑ ↑ h(f) = ln µ = f ln(ln(1 + ef )) ln(` (f + af`(f))) −e−f − ln(1 + e−f ) h is ∪,∩,↑ ∩,↑ ∩,↑ ∪,↑ ∪,↑ Please see doc/usageLik.m for how to specify the pair likelihood function and link function in GPML. Exponential inverse link: exp For g(f) = ef things are simple since h(f) = f, h 0 (f) = 1 and h 00 (f) = h 000 (f) = 0. 33 Logistic inverse link: logistic For g(f) = ln(1 + ef ) the derivatives of h(f) are given by h(f) = ln(ln(1 + ef )) 1 −ef 1 s(−f), s(f) = , s 0 (f) = = −s(−f)s(f) h 0 (f) = f f ln(1 + e ) 1+e (1 + ef )2 1 e−f 1 ef 1 h 00 (f) = − f f −f 2 2 ln(1 + e ) (1 + e ) ln (1 + ef ) 1 + e 1 + e−f = h 0 (f) s(f) − h 0 (f) −ef 00 000 00 0 0 h (f) = h (f) s(f) − h (f) + h (f) − h (f) (1 + ef )2 = h 00 (f) s(f) − 2h 0 (f) − h 0 (f)s(f)s(−f). Note that g(f) = eh(f) = ln(1 + ef ) is convex and h(f) = ln(ln(1 + ef )) with 1 ef 1 1 00 h (f) = 60 1− f f f ln(1 + e ) ln(1 + e ) 1 + e 1 + e−f is concave since ef > ln(1 + ef ) for all f ∈ R. Twice logistic inverse link: logistic2 Note that h(f) = ln(` (f + af`(f))) is – according to Seeger et al.9 – concave. Double negative exponential inverse link: expexp For g(f) = exp(−e−f ) the derivatives of h(f) are given by h(f) = −e−f h 0 (f) = −h(f) h 00 (f) = h(f) h 000 (f) = −h(f) Logit regression inverse link: logit For g(f) = 1/(1+e−f ) the derivatives of h(f) can be computed using the logistic inverse link function h` (f) since h(f) = f − exp(h` (f)) h(f) = f − eh` (f) h 0 (f) = 1 − eh` (f) h`0 (f) h 00 (f) = −eh` (f) [h`0 (f)2 + h`00 (f)] = eh` (f) s` (−f)s2` (f) h 000 (f) = −eh` (f) [h`0 (f)3 + 3h`00 (f)h`0 (f) + h`000 (f)] 9 Bayesian Intermittent Demand Forecasting, NIPS, 2016 34 4.12.2 Poisson Likelihood Count data y ∈ Nn can be modeled in the GP framework using the Poisson distribution p(y) = √ µy e−µ /y! with mean/variance E[y] = V[y] = µ, skewness S[y] = 1/ µ and kurtosis K[y] = 1/µ leading to the likelihood p(y|f) = µy exp(−µ)/y!, µ = g(f) ⇔ ln p(y|f) = y · ln g(f) − g(f) − ln Γ (y + 1). For Laplace’s method to work, we need the first three derivatives of the log likelihood ln p(y|f), where h(f) = ln g(f) ln p(y|f) ∂ ln p(y|f) ∂f ∂2 ln p(y|f) ∂f2 ∂3 ln p(y|f) ∂f3 = y · h(f) − exp(h(f)) − ln Γ (y + 1) = h 0 (f) [y − exp(h(f))] = h 00 (f) [y − exp(h(f))] − [h 0 (f)]2 exp(h(f)) = h 000 (f) [y − exp(h(f))] − 3h 0 (f) · h 00 (f) exp(h(f)) − [h 0 (f)]3 exp(h(f)) h 000 (f) [y − exp(h(f))] − h 0 (f)[h 0 (f)2 + 3h 00 (f)] exp(h(f)). Note that if ln µ = h(f) is concave and µ = g(f) is convex then the Poisson likelihood p(y|f) is log-concave in f which is the case for both exp and logistic. 4.12.3 Weibull Likelihood Nonnegative data y ∈ R+ such as time-to-failure can be modeled in the GP framework using the κ Weibull distribution p(y) = κ/λ(y/λ)κ−1 e−(y/λ) with shape parameter κ > 0, scale parameter λ > 0, mean E[y] = λγ1 = µ where γj = Γ (1 + j/κ), variance V[y] = λ2 γ2 − µ2 = µ2 (γ2 /γ21 − 1), skewness S[y] = (γ3 − 3γ1 γ2 + 2γ31 )/(γ2 − γ21 )3/2 and kurtosis K[y] = (γ4 − 4γ1 γ3 + 12γ21 γ2 − 3γ22 − 6γ41 )/(γ2 − γ21 )2 . Using the substitution µ = λγ1 ⇔ 1/λ = γ1 /µ, we obtain y κ y κ−1 κ exp − γ1 γ1 , µ = g(f) > 0 p(y|f) = γ1 µ µ µ κ y y κ . ⇔ ln p(y|f) = ln γ1 + (κ − 1) ln γ1 − γ1 µ µ µ Note that the Weibull likelihood p(y|f) is log-concave in f neither for the exp nor for the logistic inverse link. 4.12.4 Gamma Likelihood Nonnegative data y ∈ R+ can be modeled in the GP framework using the Gamma distribution p(y) = θ−α /Γ (α)yα−1 e−y/θ with shape parameter α > 0, scale parameter θ > 0, mean E[y] = √ αθ = µ, variance V[y] = αθ2 = µ2 /α, skewness S[y] = 2/ α and kurtosis K[y] = 6/α. Using the substitution µ = αθ ⇔ α/µ = 1/θ, we obtain αα yα−1 −α yα p(y|f) = µ exp − , µ = g(f) > 0 Γ (α) µ y ⇔ ln p(y|f) = −α ln µ + − ln Zα (y), ln Zα (y) = ln Γ (α) − α ln α + (1 − α) ln y. µ 35 Note that if ln µ = h(f) was convex and µ = g(f) was concave then the Gamma likelihood p(y|f) would be log-concave in f which is not the case for both exp and logistic. 4.12.5 Exponential Likelihood Nonnegative data y ∈ R+ can be modeled in the GP framework using the Exponential distribution p(y) = θ−1 e−y/θ with scale parameter θ > 0, mean E[y] = θ = µ, variance V[y] = µ2 , skewness S[y] = 2 and kurtosis K[y] = 6. We obtain y −1 p(y|f) = µ exp − , µ = g(f) > 0 µ y ⇔ ln p(y|f) = − ln µ − . µ Note that for exp (but not for logistic) the likelihood is log-concave. The exponential distribution corresponds to the Gamma distribution with α = 1 and the Weibull distribution with κ = 1. 4.12.6 Inverse Gaussian Likelihood n Nonnegative data p y ∈ R+ can be modeled in the GP framework using the Inverse Gaussian distriparameter λ > 0, mean parameter bution p(y) = λ/(2πy3 ) exp(−λ(y − µ)2 /(2µ2 y)) with shape p µ > 0, mean E[y] = µ, variance V[y] = µ3 /λ, skewness S[y] = 3 µ/λ and kurtosis K[y] = 15µ/λ. We obtain s λ λ(y − µ)2 p(y|f) = exp − , µ = g(f) > 0 2πy3 2µ2 y ⇔ ln p(y|f) = − λ(y − µ)2 1 − ln Zα (y), ln Zα (y) = − (ln λ − ln 2πy3 ). 2 2 2µ y The inverse Gaussian likelihood is in general not log-concace in f for both exp and logistic. 4.12.7 Beta Likelihood Interval data y ∈ [0, 1]n can be modeled in the GP framework using the Beta distribution p(y) = yα−1 (1 − y)β−1 /B(α, β) with shape parameters α, β > 0, mean E[y] = α/(α + β) and variance V[y] = αβ/[(α + β)2 (α + β + 1)] and 1/B(α, β) = Γ (α + β)/[Γ (α)Γ (β)]. Reparametrising using the mean parameter µ = E[y] = α/(α + β) , the shape parameter φ = α + β, the variance V[y] = µ(1 − µ)/(1 + φ) and hence Γ (φ) yµφ−1 (1 − y)(1−µ)φ−1 , µ = g(f) > 0 Γ (µφ)Γ ((1 − µ)φ) ⇔ ln p(y|f) = ln Γ (φ) − ln Γ (µφ) − ln Γ ((1 − µ)φ) + (µφ − 1) ln y + ((1 − µ)φ − 1) ln(1 − y). p(y|f) = The Beta likelihood is in general not log-concace in f for both exp and logistic. 36 5 Mean Functions A mean function mφ : X → R (with hyperparameters φ) of a GP f is a scalar function defined over the whole domain X that computes the expected value m(x) = E[f(x)] of f for the input x. 5.1 Interface In the GPML toolbox, a mean function m : X → R needs to implement evaluation m = mφ (X) and ∂ first derivatives mi = ∂φ m with respect to the components i of the parameter φ ∈ Φ as detailed i below. 36 hmeanFunctions.m 36i≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % Mean functions to be use by Gaussian process functions. There are two different kinds of mean functions: simple and composite: Simple mean functions: meanZero meanOne meanConst meanLinear meanPoly meanDiscrete meanGP meanGPexact meanNN meanWSPC - zero mean function one mean function constant mean function linear mean function polynomial mean function precomputed mean for discrete data predictive mean of another GP predictive mean of a regression GP nearest neighbor mean function weighted sum of projected cosines Composite mean functions (see explanation at the bottom): meanScale meanSum meanProd meanPow meanMask meanPref meanWarp - scaled version of a mean function sum of mean functions product of mean functions power of a mean function mask some dimensions of the data difference mean for preference learning warped mean function Naming convention: all mean functions are named "mean/mean*.m". 1) With no or only a single input argument: s = meanNAME or s = meanNAME(hyp) The mean function returns a string s telling how many hyperparameters hyp it expects , using the convention that "D" is the dimension of the input space. For example , calling "meanLinear" returns the string ’D’. 2) With two input arguments and one output argument: m = meanNAME(hyp, x) The function computes and returns the mean vector m with components m(i) = m(x(i,:)) where hyp are the hyperparameters and x is an n by D matrix of data, where D is the dimension of the input space. The returned mean 37 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 % vector m is of size n by 1. % % 3) With two input arguments and two output arguments: % % [m,dm] = meanNAME(hyp, x) % % The function computes and returns the mean vector m as in 2) above. % In addition to that, the (linear) directional derivative function dm is % returned. The call dhyp = dm(q) for a direction vector q of size n by 1 % returns a vector of directional derivatives dhyp = d (q’*m(x)) / d hyp of % the same size as the hyperparameter vector hyp. The components of dhyp are % defined as follows: dhyp(i) = q’*( d m(x) / d hyp(i) ). % % See also doc/usageMean.m. % hgpml copyright 5ai 5.2 Implemented Mean Functions We offer simple and composite mean functions producing new mean functions m(x) from existing mean functions µj (x). All code files are named according to the pattern mean/mean .m for simple identification. This modular specification allows to define affine mean functions m(x) = c + a> x or polynomial mean functions m(x) = (c + a> x)2 . All currently available mean functions are summarised in the following table. Simple mean functions m(x) Mmeaning Zero mean vanishes always One mean equals 1 Const mean equals a constant Linear mean linearly depends on x ∈ X ⊆ RD Poly mean polynomially depends on x ∈ X ⊆ RD Discrete precomputed mean for discrete data x ∈ X ⊆ N GP predictive mean of another GP GPexact predictive mean of a regression GP NN nearest neighbor for a set (zj , mj ) ∈ X × R WSPC weighted sum of d projected cosines x ∈ X ⊆ RD Composite mean functions [µ1 (x), µ2 (x), ..] 7→ m(x) meaning Scale scale a mean Sum add up mean functions Prod multiply mean functions Pow raise a mean to a power Mask act on components I ⊆ [1, 2, .., D] of x ∈ X ⊆ RD only Pref preference learning mean x = [x1 ; x2 ], xi ⊆ RD/2 Warp warped mean 5.3 m(x) = 0 1 c a> x P > d d ad x µ x R R y · p(y|D, x)dy y · p(y|D, x)dy mi , i = arg minj d(x, zj ) a> cos(Wx + b) φ ∅ ∅ c∈R a ∈ RD a ∈ RD×d µ ∈ Rs ∅ ρ, ψ, σn ∅ W ∈ Rd×D , a, b ∈ Rd m(x) = αµ(x) P µ (x) Qj j j µj (x) µ(x)d µ(xI ) µ(x1 ) − µ(x2 ) g[µ(x)] φ α∈R ∅ ∅ ∅ ∅ ∅ ∅ Usage of Implemented Mean Functions Some code examples taken from doc/usageMean.m illustrate how to use simple and composite mean functions to specify a GP model. Syntactically, a mean function mf is defined by mn := ’func’ | @func // simple 38 mf := {mn} | {mn, {param, mf}} | {mn, {mf, .., mf}} // composite i.e., it is either a string containing the name of a mean function, a pointer to a mean function or one of the former in combination with a cell array of mean functions and an additional list of parameters. 38 hdoc/usageMean.m 38i≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 % demonstrate usage of mean functions % % See also meanFunctions.m. % hgpml copyright 5ai clear all, close all n = 5; D = 2; x = randn(n,D); % create a random data set % set up simple mean functions m0 = {’meanZero’}; hyp0 = []; % no hyperparameters are needed m1 = {’meanOne’}; hyp1 = []; % no hyperparameters are needed mc = {@meanConst}; hypc = 2; % also function handles are possible ml = {@meanLinear}; hypl = [2;3]; % m(x) = 2*x1 + 3*x2 mp = {@meanPoly ,2}; hypp = [1;1;2;3]; % m(x) = x1+x2+2*x1^2+3*x2^2 mn = {@meanNN ,[1,0; 0,1],[0.9,0.5]}; hypn = []; % nearest neighbor s = 12; hypd = randn(s,1); % discrete mean with 12 hypers md = {’meanDiscrete’,s}; hyp.cov = [0;0]; hypg = []; % GP predictive mean xt = randn(2*n,D); yt = sign(xt(:,1)-xt(:,2)); % training data mg = {@meanGP ,hyp,@infEP ,@meanZero ,@covSEiso ,@likErf ,xt,yt}; hype = [0;0; log(0.1)]; % regression GP predictive mean xt = randn(2*n,D); yt = xt(:,1).*xt(:,2); % training data me = {@meanGPexact ,@meanZero ,@covSEiso ,xt,yt}; % set up composite mean functions msc = {’meanScale’,{m1}}; hypsc = [3; hyp1]; % scale by 3 msu = {’meanSum’,{m0,mc,ml}}; hypsu = [hyp0; hypc; hypl]; % sum mpr = {@meanProd ,{mc,ml}}; hyppr = [hypc; hypl]; % product mpo = {’meanPow’,3,msu}; hyppo = hypsu; % third power mask = [false ,true]; % mask excluding all but the 2nd component mma = {’meanMask’,mask,ml}; hypma = hypl(mask); mpf = {@meanPref ,ml}; hyppf = 2; % linear pref with slope mwp = {@meanWarp ,ml,@sin,@cos};hypwp = 2; % sin of linear % 0) specify mean function % mean = md; hyp = hypd; x = randi([1,s],n,1); % mean = mn; hyp = hypn; % mean = mg; hyp = hypg; mean = me; hyp = hype; % mean = m0; hyp = hyp0; % mean = msu; hyp = hypsu; % mean = mpr; hyp = hyppr; % mean = mpo; hyp = hyppo; % mean = mpf; hyp = hyppf; % 1) query the number of parameters feval(mean{:}) % 2) evaluate the function on x feval(mean{:},hyp,x) % 3) compute the derivatives w.r.t. to hyperparameter i i = 2; feval(mean{:},hyp,x,i) 39 6 Covariance Functions A covariance function kψ : X × X → R (with hyperparameters ψ) of a GP f is a scalar function defined over the whole domain X2 that computes the covariance k(x, z) = V[f(x), f(z)] = E[(f(x) − m(x))(f(z) − m(z))] of f between the inputs x and z. 6.1 Interface Again, the interface is simple since only evaluation of the full covariance matrix K = kψ (X) and its ∂ derivatives Ki = ∂ψ K as well as cross terms k∗ = kψ (X, x∗ ) and k∗∗ = kψ (x∗ , x∗ ) for prediction i are required. 39 hcovFunctions.m 39i≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % Covariance functions to be use by Gaussian process functions. There are two different kinds of covariance functions: simple and composite: 1) Elementary and standalone covariance functions: covZero - zero covariance function covEye - unit covariance function covOne - unit constant covariance function covDiscrete - precomputed covariance for discrete data 2) Composite covariance functions: covScale - scaled version of a covariance function covSum - sums of covariance functions covProd - products of covariance functions covMask - mask some dimensions of the data covPref - difference covariance for preference learning covPER - make stationary covariance periodic covADD - additive covariance function covWarp - warp input to a covariance function 3) Mahalanobis distance based covariances and their modes covMaha - generic "mother" covariance * eye - unit length scale * iso - isotropic length scale * ard - automatic relevance determination * prot - (low-rank) projection in input space * fact - factor analysis covariance * vlen - spatially varying length scale covGE - Gamma exponential covariance covMatern - Matern covariance function with nu=1/2, 3/2 or 5/2 covPP - piecewise polynomial covariance function (compact support) covRQ - rational quadratic covariance function covSE - squared exponential covariance function 4) Dot product based covariances and their modes covDot - generic "mother" covariance * eye - unit length scale * iso - isotropic length scale * ard - automatic relevance determination * pro - (low-rank) projection in input space * fac - factor analysis covariance covLIN - linear covariance function covPoly - polynomial covariance function 40 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % 5) Time series covariance functions on the positive real line covFBM - fractional Brownian motion covariance covULL - underdamped linear Langevin process covariance covW - i-times integrated Wiener process covariance covOU - i-times integrated Ornstein -Uhlenbeck process covariance 6) Standalone covariances covNNone - neural network covariance function covLINone - linear covariance function with bias covPeriodic - smooth periodic covariance function (1d) covPeriodicNoDC - as above but with zero DC component and properly scaled covCos - sine periodic covariance function (1d) with unit period covGabor - Gabor covariance function 7) Shortcut covariances assembled from library covConst - covariance for constant functions covNoise - independent covariance function (i.e. white noise) covPERiso - make isotropic stationary covariance periodic covPERard - make ARD stationary covariance periodic covMaterniso - Matern covariance function with nu=1/2, 3/2 or 5/2 covMaternard - Matern covariance function with nu=1/2, 3/2 or 5/2 with ARD covPPiso - piecewise polynomial covariance function (compact support) covPPard - piecewise polynomial covariance function (compact support) covRQiso - isotropic rational quadratic covariance function covRQard - rational quadratic covariance function with ARD covSEiso - isotropic squared exponential covariance function covSEisoU - same as above but without latent scale covSEard - squared exponential covariance function with ARD covSEvlen - spatially varying lengthscale squared exponential covSEproj - projection squared exponential covariance function covLINiso - linear covariance function covLINard - linear covariance function with ARD covGaborard - Gabor covariance function with ARD covGaborsio - isotropic Gabor covariance function covSM - spectral mixture covariance function 8) Special purpose (approximation) covariance functions apxSparse - sparse approximation: to be used for large scale inference problems with inducing points aka FITC apxGrid - grid interpolation: to be used for large scale inference problems with Kronecker/Toeplitz/BTTB covariance matrix The covariance functions are written according to a special convention where the exact behaviour depends on the number of input and output arguments passed to the function. If you want to add new covariance functions , you should follow this convention if you want them to work with the function gp. There are four different ways of calling the covariance functions: 1) With no (or one) input argument(s): s = cov The covariance function returns a string s telling how many hyperparameters it expects , using the convention that "D" is the dimension of the input space. For example , calling "covRQard" returns the string ’(D+2)’. 2) With two input arguments and one output argument: 41 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % K = cov(hyp, x) equivalent to K = cov(hyp, x, []) The function computes and returns the covariance matrix where hyp are the hyperparameters and x is an n by D matrix of cases , where D is the dimension of the input space. The returned covariance matrix is of size n by n. 3) With three input arguments and one output argument: Kz = cov(hyp, x, z) kx = cov(hyp, x, ’diag ’) The function computes test set covariances; kx is a vector of self covariances for the test cases in x (of length n) and Kz is an (n by nz) matrix of cross covariances between training cases x and test cases z. 4) With two output arguments: [K,dK] = cov(hyp, x) equivalent to [K,dK] = cov(hyp, x, []) [K,dK] = cov(hyp, x, z) [K,dK] = cov(hyp, x, ’diag ’) The function computes and returns the covariances K as in 3) above. In addition to that, the (linear) directional derivative function dK is returned. The two possible calls dhyp = dK(Q) and [dhyp,dx] = dK(Q) for a direction Q of the same size as K are possible. The return arguments dhyp and dx are the directional derivatives dhyp = d trace(Q’*K) / d hyp and dx = d trace(Q’*K) / d x are of the same size as the hyperparameter vector hyp and the input data x, respectively. The components of dhyp and dx are defined as follows: dhyp(i) = trace(Q’*( d K / d hyp(i) )) and dx(i,j) = trace(Q’*( d K / d x(i,j) )). Covariance functions can be specified in two ways: either as a string containing the name of the covariance function or using a cell array. For example: cov = ’covRQard ’; cov = {’covRQard ’}; cov = {@covRQard}; are supported. Only the second and third form using the cell array can be used for specifying composite covariance functions , made up of several contributions. For example: cov cov cov cov q=1; cov d=3; cov cov cov = = = = = = = = {’covScale ’, {’covRQiso ’}}; {’covSum ’, {’covRQiso ’,’covSEard ’,’covNoise ’}}; {’covProd ’,{’covRQiso ’,’covSEard ’,’covNoise ’}}; {’covMask ’,{mask,’covSEiso ’}} {’covPPiso ’,q}; {’covPoly ’,d}; {’covADD ’,{[1,2],’covSEiso ’}}; {@apxSparse , {@covSEiso}, u}; where u are the inducing inputs specifies a covariance function which is the sum of three contributions. To find out how many hyperparameters this covariance function requires , we do: feval(cov{:}) 42 160 161 162 163 164 165 % which returns the string ’3+(D+1)+1’ (i.e. the ’covRQiso ’ contribution uses % 3 parameters , the ’covSEard ’ uses D+1 and ’covNoise ’ a single parameter). % % See also doc/usageCov.m. % hgpml copyright 5ai 43 6.2 Implemented Covariance Functions Similarly to the mean functions, we provide a whole algebra of covariance functions k : X × X → R with the same generic name pattern cov/cov .m as before. Besides a long list of simple covariance functions, we also offer a variety of composite covariance functions as shown in the following table. 1) Elementary and standalone covariance functions k(x, x 0 ) meaning Zero covariance vanishes always Eye unit additive measurement noise One unit constant Discrete precomputed covariance for discrete data x ∈ X ⊆ N 2) Composite covariance functions [κ1 (x, z), κ2 (x, z), ..] 7→ k(x, z) meaning Scale modulate covariance function by a scalar Scale modulate covariance function depending on input Sum add up covariance functions Prod multiply covariance functions Mask Pref act on components I ⊆ [1, 2, .., D] of x ∈ X ⊆ RD only preference learning covariance x = [x1 ; x2 ], xi ⊆ RD/2 PER turn covariance into a periodic, X ⊆ RD ADD Warp additive, X ⊆ RD , index degree set D = {1, .., D} warp inputs to a covariance function 3) Covariance functions based on Mahalanobis distances k(x, z) = k(r), Maha shared generic “mother” covariance function GE gamma exponential 2 Matern Matérn, f1 (t) = 1, f3 (t) = 1 + t, f5 (t) = f3 (t) + t3 PP compact support, piecewise polynomial fv (r) RQ rational quadratic SE squared exponential Allowable mode parameter values for Mahalanibis distance covariances ’eye’ unit lengthscale ’iso’ isotropic lengthscale ’ard’ automatic relevance determination ’proj’ (low-rank) projection in input space ’fact’ factor analysis ’vlen’ r2 = (x − x> P−1 z, underdamped linear Langevin process covariance W OU i-times integrated Wiener process covariance, i ∈ {−1, .., 3} i-times integrated Ornstein-Uhlenbeck process covariance, i ∈ {0, 1} NNone neural net, X ⊆ RD LINone linear with bias, X ⊆ RD Periodic kxx 0 where K = L> L is Cholesky decomposition, L ∈ Rs×s L k(x, z) = σ2f κ(x, z) σf (x)κ(x, z)σf (z) P κ (x, z) Qj j j κj (x, z) ψ ln σf φ` ∅ ∅ κ(xI , zI ) κ(x1 , z1 ) + κ(x2 , z2 ) − κ(x1 , z2 ) − κ(x2 , z1 ) x sin 2πxp κ(u(x), u(z)) , u(x) = , xp = x/p cos 2πxp x/p P P Q 2 d∈D σfd |I|=d i∈I κ(xi , zi ; ψi ) D D p κ(p(x), p(z)), where p : R → R ∅ ∅ − z), where x, z ∈ X ⊆ k(r) = exp (−|r|γ ) , γ ∈ (0, 2] √ √ fd ( dr) exp(− dr) max(0, 1 − r)j+v · fv (r) −α 1 + 12 r2 /α exp −r2 /2 where x, z ∈ X ⊆ k(s) = s σ2f (s + c)d ∅ ln(γ/(2 − γ)) ∅ ∅ ln α ∅ ∅ ln ` ln Λ L {L, ln f} `2 (x)+`2 (z) 2 ∅ ∅ ln c k(x, z) = 2h 1 2 2h 2h − |x − 2 σf |x| z| + |z| ψ {ln σf , − ln(1/h − 1)} sin(ωt) + cos(ωt) , t = |x − z| ω µ Rx Rz κ0 (x, z) = σ2f min(x, z), κi (x, z) = 0 0 κi (x̃, z̃)dx̃dz̃ |x+z| 2 2 κ0 (x, z) = σ2f exp(− |x−z| ` ) + (σf0 − σf ) exp(− ` ) {ln µ, ln ω, ln a} k(x, z) = ψ k(t) = a2 e−µt p (`2 + x> x)(`2 + z> z) ln t {ln `, ln p, ln σf } Rπ PeriodicNoDC periodic, rescaled, DC component removed, X ⊆ R Cos periodic cosine, X ⊆ R σ2f cos (π(x − z)/p) Gabor Gabor function, X ⊆ π 0 t 1 > −2 > exp − 2 t Λ t cos 2πtp 1 , tp = t/p t/p 44 ln σf {ln `, ln σf , ln σf0 } {ln `, ln σf } κ(x−z)− 1 κ(t)dt σ2f κ(0)− 1 πRπ 0κ(t)dt , κ(x − z) = exp − `22 sin2 [π(x − z)/p] λ, p ∈ φ` ∅ ln ` ln Λ L {L, ln f} (x> z +1)/t2 RD + {ψ1 , .., ψD , ln σf1 , .., ln σf|D| } ∅ RD σ2f exp − `22 sin2 [π(x − z)/p] RD , ψu P=I P = `2 I, ` ∈ R+ P = Λ2 , Λ = diag(λ), λ ∈ RD + P−1 = L> L, L ∈ Rd×D −1 > D P = L L + diag(f), f ∈ R , L ∈ Rd×D σ2f sin−1 x> z/ periodic, X ⊆ R eye iso ard RD P=I P = `2 I, ` ∈ R+ P = Λ2 , Λ = diag(λ), λ ∈ RD + P−1 = L> L, L ∈ Rd×D P−1 = L> L + diag(f), f ∈ RD , L ∈ Rd×D D/2 2 r k` (x, z) = `(x)`(z) k l2 (x,z) , l2 (x, z) = l2 (x,z) ULL 6) Standalone covariance functions meaning ψ ∅ ∅ ∅ z)> P−1 (x spatially varying lengthscale, 4) Covariance functions based on Euclidean dot products k(x, z) = k(s), s = Dot shared generic “mother” covariance function LIN linear covariance function Poly polynomial covariance Allowable mode parameter values for Euclidean dot product covariances ’eye’ unit lengthscale ’iso’ isotropic lengthscale ’ard’ automatic relevance determination ’proj’ (low-rank) projection in input space ’fact’ factor analysis 5) Time series covariance functions k(x, z) = k(x, z), where x, z ∈ X = R+ meaning FBM fractional Brownian motion covariance with Hurst index h k(x, z) = 0 δ(x − z) 1 eye iso , t = x − z ard {ln `, ln p, ln σf } {ln p, ln σf } {ln λ, ψp } 7) Shortcut covariance functions assembled from library k(x, z) Meaning Const covariance equals a constant Noise additive measurement noise PERiso turn covariance into a periodic, X ⊆ RD PERard turn covariance into a periodic, X ⊆ RD Materniso Matérn, X ⊆ RD , f1 (t) = 1, f3 (t) = 1 + t, f5 (t) = f3 (t) + Maternard PPiso PPard RQiso RQard SEiso SEisoU SEard Matérn, X ⊆ f1 (t) = 1, f3 (t) = 1 + t, f5 (t) = f3 (t) + compact support, piecewise polynomial fv (r) compact support, piecewise polynomial fv (r) rational quadratic, X ⊆ RD rational quadratic, X ⊆ RD diagonal squared exponential, X ⊆ RD squared exponential, X ⊆ RD automatic relevence determination squared exponential SEvlen spatially varying lengthscale squared exponential SEproj factor analysis squared exponential LINard LINiso linear with diagonal weighting linear with isotropic weighting RD , t2 3 t2 3 Gaborard anisotropic Gabor function, X ⊆ RD , λ, p ∈ RD + Gaboriso isotropic Gabor function, X ⊆ SM spectral mixture, X ⊆ RD , w ∈ RQ M, V ∈ RD×Q +, + spectral mixture product, X ⊆ RD , W ∈ RD×Q , M, V ∈ RD×Q + + RD , `, p ∈ R+ k(x, z) = σ2f σ2f δ(x − z) κ(u(x), u(z)) , u(x) = [sin xp , cos xp ], xp = 2πx/p κ(u(x), u(z)) , u(x) = [sin xp , cos xp ], xp = 2πx/p q σ2f fd (rd ) exp(−rd ), rd = `d2 (x − z)> (x − z) p σ2f fd (rd ) exp(−rd ), rd = d(x − z)> Λ−2 (x − z) D j+v 2 · fv (r), r = kx−zk σf max(0, 1 − r) ` ,j= 2 +v+1 σ2f max(0, 1 − r)j+v · fv (r), r2 = (x − z)> Λ−2 (x − z) −α 1 > σ2f 1 + 2α` 2 (x − z) (x − z) −α 1 2 > −2 σf 1 + 2α (x − z) Λ (x − z) σ2f exp − 2`12 (x − z)> (x − z) exp − 2`12 (x − z)> (x − z) σ2f exp − 12 (x − z)> Λ−2 (x − z) D kx−zk2 a 2 2 σf b , a = 2`(x)`(z), b = `2 (x) + `2 (z) exp − b σ2f exp − 12 (x − z)> L> L(x − z) , L ∈ Rd×D ψ ln σf ln σf ln p {ln p1 , .., ln pD } x> Λ−2 z x> z/`2 P exp − D d=1 {ln λ1 , .., ln λD } ln ` P cos 2π D d=1 td /pd , td = xd − zd t2d 2λ2d t> t exp(− 2`2 ) cos(2πt> 1/p), t = x − z w> exp(− 21 V> t2 ) cos(M> t) , t = 2π(x − z) QD 1 > 2 cos(md td ) , td = 2π(xd d=1 wd exp(− 2 vd td ) {ln `, ln σf } {ln λ1 , .., ln λD , ln σf } {ln `, ln σf } {ln λ1 , .., ln λD , ln σf } {ln `, ln σf , ln α} {ln λ1 , .., ln λD , ln σf , ln α} {ln `, ln σf } ln ` {ln λ1 , .., ln λD , ln σf } {φ` , ln σf } {L, ln σf } {ln λ, ln p} {ln `, ln p} − zd ) {ln w, ln M, ln V} {ln W, ln M, ln V} The spectral mixture covariance covSM was introduced by Wilson & Adams Gaussian Process Kernels for Pattern Discovery and Extrapolation, ICML, 2013. The periodic covariance function covPER starts from a stationary covariance function that depends on the data only through a distance r2 = (x − x 0 )> Λ−2 (x − x 0 ) such as covMatern, covPP, covRQ, covSE and turns them into a periodic covariance function by embedding the data x ∈ RD into a periodic high-dimensional space xp = u(x) ∈ R2D by a function u(x) = 2πdiag(p−1 )x. The additive covariance function covADD starts from a one-dimensional covariance function κ(xi , xi0 , ψi ) acting on a single component i ∈ [1, .., D] of x. From that, we define covariance functions κI (xI , xI ) = Q 0 i∈I κ(xi , xi , ψi ) acting on vector-valued inputs xI . The sums of exponential size can efficiently be computed using the Newton-Girard formulae. Samples functions drawn from a GP with additive covariance are additive functions. The number of interacting variables |I| is a measure of how complex the additive functions are. 6.3 Usage of Implemented Covariance Functions Some code examples taken from doc/usageCov.m illustrate how to use simple and composite covariance functions to specify a GP model. Syntactically, a covariance function cf is defined by cv := ’func’ | @func // simple cf := {cv} | {cv, {param, cf}} | {cv, {cf, .., cf}} // composite i.e., it is either a string containing the name of a covariance function, a pointer to a covariance function or one of the former in combination with a cell array of covariance functions and an additional list of parameters. 44 hdoc/usageCov.m 44i≡ 1 2 3 4 5 6 7 8 % demonstrate usage of covariance functions % % See also covFunctions.m. % hgpml copyright 5ai clear all, close all n = 5; D = 3; x = randn(n,D); xs = randn(3,D); 45 % create a data set 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 % set up simple covariance functions cn = {’covNoise’}; sn = .1; hypn = log(sn); % one hyperparameter cc = {@covConst}; sf = 2; hypc = log(sf); % function handles OK ce = {@covEye}; hype = []; % identity cl = {@covLIN}; hypl = []; % linear is parameter -free cla = {’covLINard’}; L = rand(D,1); hypla = log(L); % linear (ARD) cli = {’covLINiso’}; l = rand(1); hypli = log(l); % linear iso clo = {@covLINone}; ell = .9; hyplo = log(ell); % linear with bias cp = {@covPoly ,3}; c = 2; hypp = log([c;sf]); % third order poly cga = {@covSEard}; hypga = log([L;sf]); % Gaussian with ARD cgi = {’covSEiso’}; hypgi = log([ell;sf]); % isotropic Gaussian cgu = {’covSEisoU’}; hypgu = log(ell); % isotropic Gauss no scale cra = {’covRQard’}; al = 2; hypra = log([L;sf;al]); % ration. quad. cri = {@covRQiso}; hypri = log([ell;sf;al]); % isotropic cma = {@covMaternard ,5}; hypma = log([ell;sf]); % Matern class d=5 cmi = {’covMaterniso’,3}; hypmi = log([ell;sf]); % Matern class d=3 cnn = {’covNNone’}; hypnn = log([L;sf]); % neural network cpe = {’covPeriodic’}; p = 2; hyppe = log([ell;p;sf]); % periodic cpn = {’covPeriodicNoDC’}; p = 2; hyppe = log([ell;p;sf]); % w/o DC cpc = {’covCos’}; p = 2; hypcpc = log([p;sf]); % cosine cov cca = {’covPPard’,3}; hypcc = hypgu;% compact support poly degree 3 cci = {’covPPiso’,2}; hypcc = hypgi;% compact support poly degree 2 cgb = {@covGaboriso}; ell = 1; p = 1.2; hypgb=log([ell;p]); % Gabor Q = 2; w = ones(Q,1)/Q; m = rand(D,Q); v = rand(D,Q); csm = {@covSM ,Q}; hypsm = log([w;m(:);v(:)]); % Spectral Mixture cvl = {@covSEvlen ,{@meanLinear}}; hypvl = [1;2;1; 0]; % var lenscal s = 12; cds = {@covDiscrete ,s}; % discrete covariance function L = randn(s); L = chol(L’*L); L(1:(s+1):end) = log(diag(L)); hypds = L(triu(true(s))); xd = randi([1,s],[n,1]); xsd = [1;3;6]; cfa = {@covSEfact ,2}; hypfa = randn(D*2,1); % factor analysis % set up composite i.e. meta covariance functions csc = {’covScale’,{cgu}}; hypsc = [log(3); hypgu]; % scale by 9 csu = {’covSum’,{cn,cc,cl}}; hypsu = [hypn; hypc; hypl]; % sum cpr = {@covProd ,{cc,cci}}; hyppr = [hypc; hypcc]; % product mask = [0,1,0]; % binary mask excluding all but the 2nd component cma = {’covMask’,{mask,cgi{:}}}; hypma = hypgi; % isotropic periodic rational quadratic cpi = {’covPERiso’,{@covRQiso}}; % periodic Matern with ARD cpa = {’covPERard’,{@covMaternard ,3}}; % additive based on SEiso using unary and pairwise interactions cad = {’covADD’,{[1,2],’covSEiso’}}; % preference covariance with squared exponential base covariance cpr = {’covPref’,{’covSEiso’}}; hyppr = [0;0]; xp = randn(n,2*D); xsp = randn(3,2*D); % 0) specify a covariance function % cov = cma; hyp = hypma; % cov = cci; hyp = hypcc; % cov = csm; hyp = hypsm; cov = cds; hyp = hypds; x = xd; xs = xsd; % cov = cfa; hyp = hypfa; % cov = cvl; hyp = hypvl; % cov = cpr; hyp = hyppr; x = xp; xs = xsp; % 1) query the number of parameters feval(cov{:}) 46 67 68 69 70 71 72 73 % 2) evaluate the function on x [K,dK] = feval(cov{:},hyp,x) % 3) evaluate the function on x and xs to get cross -terms [kss,dkss] = feval(cov{:},hyp,xs,’diag’) [Ks, dKs ] = feval(cov{:},hyp,x,xs) 7 Hyperpriors A hyperprior p(θ) with θ = [ρ, φ, ψ] is a joint probability distribution over the likelihood hyperparameters ρ, the mean hyperparameters φ and the covariance hyperparameters ψ. We concentrate Q on factorial priors p(θ) = j pj (θj ). Hyperpriors can be used to regularise the optimisation of the hyperparameters via the marginal likelihood Z(θ) so that p(θ)Z(θ) is maximised instead. As we wish to perform unconstrained optimisation, we require (mainly) smooth hyperpriors with infinite support. 7.1 Interface In the GPML toolbox, a prior distribution p(θ) needs to implement the evaluation of the log density ∂ ln p(θ). In addition, we require sampling capabilities i.e. the ln p(θ) and its first derivative ∂θ generation of θ ∼ p(θ). 46 hpriorDistributions.m 46i≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % prior distributions to be used for hyperparameters of Gaussian processes using infPrior. There are two different kinds of prior distributions: simple and composite: simple prior distributions: priorGauss priorLaplace priorT - univariate Gaussian - univariate Laplace - univariate Student ’s t priorSmoothBox1 priorSmoothBox2 - univariate interval (linear decay in log domain) - univariate interval (quadr. decay in log domain) priorGamma priorWeibull priorInvGauss priorLogNormal - priorClamped or priorDelta priorGaussMulti priorLaplaceMulti priorTMulti univariate univariate univariate univariate Gamma , IR+ Weibull , IR+ Inverse Gaussian , IR+ Log-normal , IR+ - fix hyperparameter to its current value by setting derivatives to zero, no effect on marginal likelihood - multivariate Gauss - multivariate Laplace - multivariate Student ’s t priorClampedMulti or - fix hyperparameter to its current value by setting priorDeltaMulti derivatives to zero, no effect on marginal likelihood priorEqualMulti or priorSameMulti - make several hyperparameters have the same value by same derivative , no effect on marginal likelihood 47 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 % % composite prior distributions (see explanation at the bottom): % % priorMix - nonnegative mixture of priors % priorTransform - prior on g(t) rather than t % % Naming convention: all prior distributions are named "prior/prior*.m". % % % 1) With only a fixed input arguments: % % r = priorNAME(par1,par2,parN) % % The function returns a random sample from the distribution for e.g. % random restarts , simulations or optimisation initialisation. % % 2) With one additional input arguments: % % [lp,dlp] = priorNAME(par1,par2,parN, t) % % The function returns the log density at location t along with its first % derivative. % % See also doc/usagePrior.m, inf/infPrior.m. % hgpml copyright 5ai 48 7.2 Implemented Hyperpriors All code files are named according to the pattern prior/prior .m for simple identification. All currently available hyperpriors are summarised in the following table. Simple hyperpriors p(θ) Univariate hyperpriors defined over the whole reals with mean µ and variace σ2 Meaning p(θ) = normally distributed hyperparameter θ ∈ R Gauss Laplace double exponentially hyperparameter θ ∈ R Student’s t distributed hyperparameter θ ∈ R T τ 2 √1 exp − (θ−µ) 2 2σ σ 2π (θ−µ) 1 ,b 2b exp − b Γ ( ν+1 2 ) Γ ( ν2 ) 1 (ν−2)πσ √ µ ∈ R, σ2 ∈ R+ √ = σ/ 2 1+ (θ−µ)2 (ν−2)σ2 µ ∈ R, σ2 ∈ R+ − ν+1 2 µ ∈ R, σ2 , ν ∈ R+ Univariate hyperpriors with effective bounded support but defined over the whole real line 1−exp(η(a−b)) 1 1 · 1+exp(−η(θ−a)) · 1+exp(η(θ−b)) ≈ b−a SmoothBox1 interval hyperparameter θ ∈ R i.e. θ ∈ [a, b] 1+π2 /g2 a+b w2 2 µ = 2 , σ = 1−exp(−2g) 12 , g = wη 2 , w = |a − b| 2 N(θ|a, σab ) t < a 1 √ 1 t ∈ [a, b] , σab = ηb−a ≈ 2π SmoothBox2 localised hyperparameter θ ∈ R i.e. θ ∈ [a, b] (1/η+1)(b−a) N(θ|a, σ2ab ) b < t µ= Univariate hyperpriors supported only over the positive reals Gamma Gamma hyperparameter θ ∈ R+ Weibull Weibull hyperparameter θ ∈ R+ InverseGauss inverse Gaussian hyperparameter θ ∈ R+ LogNormal log-normal hyperparameter θ ∈ R+ a+b 2 , σ2 = 3 2 w2 η /3+η +4η/π+2/π η3 +η2 4 = 1 √ θσ 2π a 6 b ∈ R, η ∈ R+ , w = |a − b| 1 exp( θt )θk−1 Γ (k)tk k−1 k θ exp −( θλ )k λ λ λ(θ−µ)2 1 √ 3 exp − 2µ2 θ 2πθ /λ N(θ|µ, σ2 ) a 6 b ∈ R, η ∈ R+ 2 exp − (ln θ−µ) 2σ2 k ∈ R+ , t ∈ R+ k ∈ R+ , λ ∈ R+ k ∈ R+ , λ ∈ R+ µ ∈ R, σ2 ∈ R+ RD Multivariate hyperpriors supported all over with mean µ and covariance Σ 1 GaussMulti multivariate normal distribution θ ∈ RD |2πΣ|− 2 exp − 12 (θ − µ)> Σ−1 (θ − µ) √ −1 √ D −1 LaplaceMulti multivariate Laplace distribution θ ∈ R | 2Σ| 2 exp − 2 L (θ − µ) 1 , L> L = Σ ν+D > Σ−1 (θ−µ) − 2 1 Γ ( ν+D ) TMulti multivariate Student’s t distribution θ ∈ RD |(ν − 2)πΣ|− 2 Γ ( ν2 ) 1 + (θ−µ)(ν−2) 2 Improper hyperpriors used to fix the value of a particular hyperparameter Delta clamped hyperparameter θ = θ0 ∈ R δ(θ − θ0 ) Clamped DeltaMulti clamped hyperparameter θ = θ0 ∈ RD δ(θ − θ0 ) ClampedMulti Q SameMulti D same hyperparameter θ = θ0 1 ∈ RD δ i=1 (θ0 − θi ) EqualMulti Composite hyperpriors [π1 (θ), π2 (θ), ..] 7→ p(θ) Transform prior distribution on g(θ) instead of θ π(g(θ)) P Mix mixture distribution i wi πi (θ) µ ∈ RD , Σ ∈ RD×D µ ∈ RD , Σ ∈ RD×D µ ∈ RD , Σ ∈ RD×D , ν ∈ R ∅ ∅ ∅ {g} {w} The priorSmoothBox2 is a Gauss-uniform sandwich obtained by complementing a uniform distribution on [a, b] with two Gaussian halves at each side. The parameter η balances the probability mass between the constituents so that η/(η + 1) is used for the box and 1/(η + 1) for the Gaussian sides. Its brother priorSmoothBox1 is the product of two sigmoidal functions. The priorDelta or equivalently priorClamped can be used to exclude some hyperparameters from the optimisation. Their values are clamped to θ0 and the derivative vanishes. There are also multivariate counterparts priorDeltaMulti and priorClampedMulti. 7.3 Usage of Implemented Hyperpriors Some code examples taken from doc/usagePrior.m illustrate how to use univariate, multivariate and composite priors on hyperparameters. Syntactically, a hyperprior hp is defined by func := Dist // prior distributions in prior/ | Clamped | Delta // predefined for fixing the hyperparameter pr := ’func’ | @func // univariate hyperprior 49 | hp ’funcMulti’ | @funcMulti // multivariate hyperprior := {pr} | {pr, {param, hp}} | {pr, {hp, .., hp}} // composite i.e., it is either a string containing the name of a hyperprior function, a pointer to a hyperprior function or one of the former in combination with a cell array of hyperprior functions and an additional list of parameters. Furthermore, we have multivariate hyperprior variants and 2 (equivalent) predefined hyperpriors allowing to exclude variables from optimisation. 49 hdoc/usagePrior.m 49i≡ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 % demonstrate usage of prior distributions % % See also priorDistributions.m. % hgpml copyright 5ai clear all, close all % 1) specify some priors % a) univariate priors mu = 1.0; s2 = 0.01^2; nu = 3; pg = {@priorGauss ,mu,s2}; % Gaussian prior pl = {’priorLaplace’,mu,s2}; % Laplace prior pt = {@priorT ,mu,s2,nu}; % Student ’s t prior p1 = {@priorSmoothBox1 ,0,3,15}; % smooth box constraints lin decay p2 = {@priorSmoothBox2 ,0,2,15}; % smooth box constraints qua decay pd = {’priorDelta’}; % fix value of prior exclude from optimisation pc = {@priorClamped}; % equivalent to above lam = 1.05; k = 2.5; pw = {@priorWeibull ,lam,k}; % Weibull prior % b) meta priors pmx = {@priorMix ,[0.5,0.5],{pg,pl}}; g = @exp; dg = @exp; ig = @log; ptr = {@priorTransform ,g,dg,ig,pg}; % mixture of two priors % Gaussian in the exp domain % c) multivariate priors m = [1;2]; V = [2,1;1,2]; pG = {@priorGaussMulti ,m,V}; % 2d Gaussian prior pD = {’priorDeltaMulti’}; % fix value of prior exclude from optim pC = {@priorClampedMulti}; % equivalent to above % 2) evaluation % pri = pt; hp % pri = pmx; hp % pri = ptr; hp pri = pG; hp = = randn(1,3); = randn(1,3); = randn(1,3); randn(2,3); % a) draw a sample from the prior feval(pri{:}) % b) evaluate prior and derivative if requires [lp,dlp] = feval(pri{:},hp) % 3) comprehensive example x = (0:0.1:10)’; y = 2*x+randn(size(x)); % generate training data mean = {@meanSum ,{@meanConst ,@meanLinear}}; % specify mean function cov = {@covSEiso}; lik = {@likGauss}; % specify covariance and lik hyp.cov = [log(1);log(1.2)]; hyp.lik = log(0.9); hyp.mean = [2;3]; 50 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 par = {mean,cov,lik,x,y}; mfun = @minimize; % input for GP function % a) plain marginal likelihood optimisation (maximum likelihood) im = @infExact; % inference method hyp_plain = feval(mfun, hyp, @gp, -10, im, par{:}); % optimise % b) regularised optimisation (maximum a posteriori) with 1d priors prior.mean = {pg;pc}; % Gaussian prior for first , clamp second par prior.cov = {p1;[]}; % box prior for first , nothing for second par im = {@infPrior ,@infExact ,prior}; % inference method hyp_p1 = feval(mfun, hyp, @gp, -10, im, par{:}); % optimise % c) regularised optimisation (maximum a posteriori) with Nd priors prior = []; % clear the structure % multivariate Student ’s t prior on the first and second mean hyper prior.multi{1} = {@priorTMulti ,[mu;mu],diag([s2,s2]),nu,... struct(’mean’,[1,2])}; % use hyper struct % Equivalent shortcut (same mu and s2 for all dimensions) prior.multi{1} = {@priorTMulti ,mu,s2,nu,struct(’mean’,[1,2])}; % multivariate Gaussian prior jointly on 1st and 3rd hyper prior.multi{2} = {@priorGaussMulti ,[mu;mu],diag([s2,s2]),... [1,3]}; % use unwrapped hyper vector % Equivalent shortcut (same mu and s2 for all dimensions) prior.multi{2} = {@priorGaussMulti ,mu,s2,[1,3]}; im = {@infPrior ,@infExact ,prior}; % inference method hyp_pN = feval(mfun, hyp, @gp, -10, im, par{:}); % optimise [any2vec(hyp), any2vec(hyp_plain), any2vec(hyp_p1), any2vec(hyp_pN)] 51
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 51 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.18 Create Date : 2018:06:15 08:13:21+01:00 Modify Date : 2018:06:15 08:13:21+01:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017/MacPorts 2017_4) kpathsea version 6.2.3EXIF Metadata provided by EXIF.tools