GPstuff Manual

GPstuffManual

User Manual:

Open the PDF directly: View PDF .
Page Count: 77

Download
Open PDF In Browser	View PDF

Bayesian Modeling with Gaussian Processes using the GPstuff
Toolbox
Jarno Vanhatalo∗

jarno.vanhatalo@helsinki.fi

Department of Environmental Sciences
University of Helsinki
P.O. Box 65, FI-00014 Helsinki, Finland

Jaakko Riihimäki†
Jouni Hartikainen†
Pasi Jylänki†
Ville Tolvanen†
Aki Vehtari‡

aki.vehtari@aalto.fi
Helsinki Institute for Information Technology HIIT, Department of Computuer Science
Aalto University School of Science
P.O. Box 15400, FI-00076 Aalto, Finland

Abstract
Gaussian processes (GP) are powerful tools for probabilistic modeling purposes. They
can be used to define prior distributions over latent functions in hierarchical Bayesian
models. The prior over functions is defined implicitly by the mean and covariance function,
which determine the smoothness and variability of the function. The inference can then
be conducted directly in the function space by evaluating or approximating the posterior
process. Despite their attractive theoretical properties GPs provide practical challenges
in their implementation. GPstuff is a versatile collection of computational tools for GP
models compatible with Linux and Windows MATLAB and Octave. It includes, among
others, various inference methods, sparse approximations and tools for model assessment.
In this work, we review these tools and demonstrate the use of GPstuff in several models.
Last updated 2015-10-12.
Keywords: Gaussian process, Bayesian hierarchical model, nonparametric Bayes

∗. Work done mainly while at the Department of Biomedical Engineering and Computational Science, Aalto
University
†. Not anymore in Aalto University
‡. Corresponding author

Contents
1 Introduction

4

2 Gaussian process models
2.1 Gaussian process prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Conditioning on the observations . . . . . . . . . . . . . . . . . . . . . . . . .

6
6
7

3 Conditional posterior and predictive distributions
3.1 Gaussian observation model: the analytically tractable case
3.2 Laplace approximation . . . . . . . . . . . . . . . . . . . . .
3.3 Expectation propagation algorithm . . . . . . . . . . . . . .
3.4 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

4 Marginal likelihood given parameters

8
8
9
9
10
10

5 Marginalization over parameters
5.1 Maximum a posterior estimate of parameters
5.2 Grid integration . . . . . . . . . . . . . . . . .
5.3 Monte Carlo integration . . . . . . . . . . . .
5.4 Central composite design . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

11
12
12
12
13

6 Getting started with GPstuff: regression and classification
6.1 Gaussian process regression . . . . . . . . . . . . . . . . . . .
6.1.1 Constructing the model . . . . . . . . . . . . . . . . .
6.1.2 MAP estimate for the parameters . . . . . . . . . . . .
6.1.3 Marginalization over parameters with grid integration
6.1.4 Marginalization over parameters with MCMC . . . . .
6.2 Gaussian process classification . . . . . . . . . . . . . . . . . .
6.2.1 Constructing the model . . . . . . . . . . . . . . . . .
6.2.2 Inference with Laplace approximation . . . . . . . . .
6.2.3 Inference with expectation propagation . . . . . . . . .
6.2.4 Inference with MCMC . . . . . . . . . . . . . . . . . .
6.2.5 Marginal posterior corrections . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

14
14
14
15
16
17
17
17
18
18
18
19

7 Other single latent models
7.1 Robust regression . . . . . . . . . . . . .
7.1.1 Student-t observation model . . .
7.1.2 Laplace observation model . . . .
7.2 Count data . . . . . . . . . . . . . . . .
7.2.1 Poisson . . . . . . . . . . . . . .
7.2.2 Negative binomial . . . . . . . .
7.2.3 Binomial . . . . . . . . . . . . . .
7.2.4 Hurdle model . . . . . . . . . . .
7.3 Log-Gaussian Cox process . . . . . . . .
7.4 Accelerated failure time survival models

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

19
19
19
20
21
21
21
22
22
22
23

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

24
24
24
25
25

8 Multilatent models
8.1 Multiclass classification . . . . . . . . . . . .
8.2 Multinomial . . . . . . . . . . . . . . . . . . .
8.3 Cox proportional hazard model . . . . . . . .
8.4 Zero-inflated negative binomial . . . . . . . .
8.5 Density estimation and regression . . . . . . .
8.6 Input dependent models . . . . . . . . . . . .
8.6.1 Input-dependent Noise . . . . . . . . .
8.6.2 Input-dependent overdispersed Weibull
8.7 Monotonicity constraint . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

26
26
27
27
28
29
30
30
30
31

7.5
7.6

7.4.1 Weibull . . . .
7.4.2 Log-Gaussian .
7.4.3 Log-logistic . .
Derivative observations
Quantile Regression .

. . . .
. . . .
. . . .
in GP
. . . .

. . . . . .
. . . . . .
. . . . . .
regression
. . . . . .

.
.
.
.
.

.
.
.
.
.

9 Mean functions
31
9.1 Explicit basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
10 Sparse Gaussian processes
10.1 Compactly supported covariance functions . . . . . . . . . . . . . . . . . . .
10.1.1 Compactly supported covariance functions in GPstuff . . . . . . . .
10.2 FIC and PIC sparse approximations . . . . . . . . . . . . . . . . . . . . . .
10.2.1 FIC sparse approximation in GPstuff . . . . . . . . . . . . . . . . . .
10.2.2 PIC sparse approximation in GPstuff . . . . . . . . . . . . . . . . . .
10.3 Deterministic training conditional, subset of regressors and variational sparse
approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3.1 Variational, DTC and SOR sparse approximation in GPstuff . . . . .
10.4 Sparse GP models with non-Gaussian likelihoods . . . . . . . . . . . . . . .
10.5 Predictive active set selection for classification . . . . . . . . . . . . . . . . .
11 Modifying the covariance functions
11.1 Additive models . . . . . . . . . . . . . . . . . . . . .
11.1.1 Additive models in GPstuff . . . . . . . . . .
11.2 Additive covariance functions with selected variables
11.3 Product of covariance functions . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.

34
34
35
36
37
37

.
.
.
.

37
38
39
39

.
.
.
.

39
39
40
41
41

12 State space inference
13 Model assessment and comparison
13.1 Marginal likelihood . . . . . . . . .
13.2 Cross-validation . . . . . . . . . . .
13.3 DIC . . . . . . . . . . . . . . . . .
13.4 WAIC . . . . . . . . . . . . . . . .
13.5 Model assessment demos . . . . . .

41

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

43
43
43
44
45
46

14 Bayesian optimization

47

15 Adding new features in the toolbox

48

16 Discussion

49

A Comparison of features in GPstuff, GPML and FBM

61

B Covariance functions

61

C Observation models

66

D Priors

69

E Transformation of hyperparameters

71

F Developer appendix
F.1 lik functions . . . . . .
F.1.1 lik_negbin . . .
F.1.2 lik_t . . . . . .
F.2 gpcf functions . . . . .
F.2.1 gpcf_sexp . . . .
F.3 prior functions . . . . .
F.4 Optimisation functions .
F.5 Other inference methods

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

72
72
72
73
74
74
76
77
77

1. Introduction
This work reviews a free open source toolbox GPstuff (version 4.6) which implements a
collection of inference methods and tools for inference for various Gaussian process (GP)
models. The toolbox is compatible with Unix and Windows MATLAB (Mathworks, 2010)
(r2009b or later, earlier versions may work, but has not been tested extensively) and Octave
(Octave community, 2012) (3.6.4 or later, compact support covariance functions are not
currently working in Octave). It is available from http://becs.aalto.fi/en/research/
bayes/gpstuff/. If you find GPstuff useful, please use the reference (Vanhatalo et al., 2013)
Jarno Vanhatalo, Jaakko Riihimäki, Jouni Hartikainen, Pasi Jylänki, Ville Tolvanen
and Aki Vehtari (2013). GPstuff: Bayesian Modeling with Gaussian Processes. Journal of Machine Learning Research, 14:1175-1179
and appropriate references mentioned in this text or in the GPstuff functions you are using.
GP is a stochastic process, which provides a powerful tool for probabilistic inference
directly on distributions over functions (e.g. O’Hagan, 1978) and which has gained much
attention in recent years (Rasmussen and Williams, 2006). In many practical GP models,
the observations y = [y1 , ..., yn ]T related to inputs (covariates) X = {xi = [xi,1 , ..., xi,d ]T }ni=1
are assumed to be conditionally
given a latent function (or predictor) f (x) so
Qindependent
n
that the likelihood p(y | f ) = i=1 p(yi |fi ), where f = [f (x1 ), ..., f (xn )]T , factorizes over
cases. GP prior is set for the latent function after which the posterior p(f | y, X) is solved

and used for prediction. GP defines the prior over the latent function implicitly through the
mean and covariance function, which determine, for example, the smoothness and variability
of the latent function. Usually, the model hierarchy is extended to the third level by giving
also prior for the parameters of the covariance function and the observation model.
Most of the models in GPstuff follow the above definition and can be summarized as:
Observation model:

y | f, φ ∼

n
Y

p(yi |fi , φ)

(1)

i=1

GP prior:
hyperprior:

f (x)|θ ∼ GP m(x), k(x, x0 |θ)
θ, φ ∼ p(θ)p(φ).



(2)
(3)

Here θ denotes the parameters in the covariance function k(x, x0 |θ), and φ the parameters in
the observation model. We will call the function value f (x) at fixed x a latent variable and for
brevity we sometimes denote all the covariance function and observation model parameters
with ϑ = [θ, φ]. For the simplicity of presentation the mean function is considered zero,
m(x) ≡ 0, throughout this paper. We will also denote both the observation model and
likelihood by p(y | f , φ) and assume the reader is able to distinguish between these two from
the context. The likelihood naming convention is used in the toolbox for both likelihood
and observation model related functionalities which follows the naming convention used, for
example, in INLA (Rue et al., 2009) and GPML (Rasmussen and Nickisch, 2010) software
packages. There are also models with multiple latent processes and likelihoods where each
factor depens on multiple latent values. These are discussed in Section 8.
Early examples of GP models can be found, for example, in time series analysis and
filtering (Wiener, 1949) and in geostatistics (e.g. Matheron, 1973). GPs are still widely used
in these fields and useful overviews are provided in (Cressie, 1993; Grewal and Andrews,
2001; Diggle and Ribeiro, 2007; Gelfand et al., 2010). O’Hagan (1978) was one of the firsts
to consider GPs in a general probabilistic modeling context and provided a general theory of
GP prediction. This general regression framework was later rediscovered as an alternative
for neural network models (Williams and Rasmussen, 1996; Rasmussen, 1996) and extended
for other problems than regression (Neal, 1997; Williams and Barber, 1998). This machine
learning perspective is comprehensively summarized by Rasmussen and Williams (2006).
Nowadays, GPs are used, for example, in weather forecasting (Fuentes and Raftery,
2005; Berrocal et al., 2010), spatial statistics (Best et al., 2005; Kaufman et al., 2008;
Banerjee et al., 2008; Myllymäki et al., 2014), computational biology (Honkela et al., 2011),
reinforcement learning (Deisenroth et al., 2009, 2011), healthcare applications (Stegle et al.,
2008; Vanhatalo et al., 2010; Rantonen et al., 2012, 2014), survival analysis (Joensuu et al.,
2012, 2014) industrial applications (Kalliomäki et al., 2005), computer model calibration
and emulation (Kennedy and O’Hagan, 2001; Conti et al., 2009), prior elicitation (Moala
and O’Hagan, 2010) and density estimation (Tokdar and Ghosh, 2007; Tokdar et al., 2010;
Riihimäki and Vehtari, 2014) to name a few. Despite their attractive theoretical properties
and wide range of applications, GPs provide practical challenges in implementation.
GPstuff provides several state-of-the-art inference algorithms and approximative methods that make the inference easier and faster for many practically interesting models. GPstuff is a modular toolbox which combines inference algorithms and model structures in an
easily extensible format. It also provides various tools for model checking and comparison.

These are essential in making model assessment and criticism an integral part of the data
analysis. Many algorithms and models in GPstuff are proposed by others but reimplemented
for GPstuff. In each case, we provide reference to the original work but the implementation
of the algorithm in GPstuff is always unique. The basic implementation of two important
algorithms, the Laplace approximation and expectation propagation algorithm (discussed
in Section 3), follow the pseudocode from (Rasmussen and Williams, 2006). However, these
are later generalized and improved as described in (Vanhatalo et al., 2009, 2010; Vanhatalo
and Vehtari, 2010; Jylänki et al., 2011; Riihimäki et al., 2013; Riihimäki and Vehtari, 2014).
There are also many other toolboxes for GP modelling than GPstuff freely available.
Perhaps the best known packages are nowadays the Gaussian processes for Machine Learning
GPML toolbox (Rasmussen and Nickisch, 2010) and the flexible Bayesian modelling (FBM)
toolbox by Radford Neal. A good overview of other packages can be obtained from the
Gaussian processes website http://www.gaussianprocess.org/ and the R Archive Network
http://cran.r-project.org/. Other GP softwares have some overlap with GPstuff and
some of them include models that are not implemented in GPstuff. The main advantages
of GPstuff over other GP software are its versatile collection of models and computational
tools as well as modularity which allows easy extensions. A comparison between GPstuff,
GPML and FBM is provided in the Appendix A.
Three earlier GP and Bayesian modelling packages have influenced our work. Structure
of GPstuff is mostly in debt to the Netlab toolbox (Nabney, 2001), although it is far from being compatible. The GPstuff project was started in 2006 based on the MCMCStuff-toolbox
(1998-2006) (http://becs.aalto.fi/en/research/bayes/mcmcstuff/). MCMCStuff for
its part was based on Netlab and it was also influenced by the FBM. The INLA software
package by Rue et al. (2009) has also motivated some of the technical details in the toolbox. In addition to these, some technical implementations of GPstuff rely on the sparse
matrix toolbox SuiteSparse (Davis, 2005) (http://www.cise.ufl.edu/research/sparse/
SuiteSparse/).
This work concentrates on discussing the essential theory and methods behind the implementation of GPstuff. We explain important parts of the code, but the full code demonstrating the important features of the package (including also data creation and such), are
included in the demonstration files demo_* to which we refer in the text.

2. Gaussian process models
2.1 Gaussian process prior
GP prior over function f (x) implies that any set of function values f , indexed by the input
coordinates X, have a multivariate Gaussian distribution
p(f |X, θ) = N(f |0, Kf,f ),

(4)

where Kf,f is the covariance matrix. Notice, that the distribution over functions will
be denoted by GP(·, ·), whereas the distribution over a finite set of latent variables will
be denoted by N(·, ·). The covariance matrix is constructed by a covariance function,
[Kf,f ]i,j = k(xi , xj |θ), which characterizes the correlation between different points in the
process. Covariance function can be chosen freely as long as the covariance matrices produced are symmetric and positive semi-definite (vT Kf,f v ≥ 0, ∀v ∈  0 and the baseline hazard h0 (y) is assumed to follow the Weibull distribution
parametrized in GPstuff as
h0 (y) = ry r−1 ,
(42)
where r > 0 is the shape parameter. This follows the parametrization used in Martino et al.
(2011). The likelihood is defined as
L=

n
Y

(1−zi )(r−1) −yir /ef (xi )

r1−zi e−(1−zi )f (xi ) yi

e

(43)

i=1

where z is a vector of censoring indicators with zi = 0 for uncensored event and zi = 1 for
right censored event for observation i. In case of censoring yi denotes the time of censoring.
Here we present only the likelihood function because we don’t have observation model for
the censoring. Another common parameterization of Weibull model is through shape, r, and
scale, λ parameters. The parameterization in GPstuff corresponds to Weibull model with
input dependent scale λ(xi ) = ef (xi )/r , so that λ(xi )r = ef (xi ) . Hence the Weibull model
for an uncensored observation yi is

r−1
r
yi
r
p(yi |f (xi ), r) =
e−(yi /λ(xi ))
(44)
λ(xi ) λ(xi )
7.4.2 Log-Gaussian
With Log-Gaussian survival model the logarithms of survival times are assumed to be normally distributed.
The likelihood is defined as


n
Y
1
2 −(1−zi )/2 1−zi
2
L=
(2πσ )
yi
exp − 2 (1 − zi )(log(yi ) − f (xi ))
(45)
2σ
i=1


zi
log(yi ) − f (xi )
×
1−Φ
σ
where σ is the scale parameter.
7.4.3 Log-logistic
The log-logistic likelihood is defined as
L=

n 
Y
i=1

ry r−1
exp(f (xi ))

1−zi 

1+

y
exp(f (xi ))

r zi −2

where r is the shape parameter and zi the censoring indicators.

,

(46)

7.5 Derivative observations in GP regression
Incorporating derivative observations in GP regression is fairly straightforward, because a
derivative of Gaussian process is a Gaussian process. In short, derivative observation are
taken into account by extending covariance matrices to include derivative observations. This
is done by forming joint covariance matrices of function values and derivatives. Following
equations (Rasmussen and Williams, 2006) state how the covariances between function values
and derivatives, and between derivatives are calculated
Cov(fi ,

∂fj
∂k(xi , xj
)=
),
∂xdj
∂xdj

Cov(

∂ 2 k(xi , xj
∂fi ∂fj
)=
.
,
∂xdi ∂xej
∂xdi ∂xej

The joint covariance matrix for function values and derivatives is of the following form


Kf f Kf D
K =
KDf KDD
Kij
ff

= k(xi , xj ),

Kij
Df

=

Kf D

∂k(xi , xj )
,
∂xdi
= (KDf )> ,

Kij
DD =

∂ 2 k(xi , xj )
,
∂xdi ∂xej

Prediction is done as usual but with derivative observations joint covariance matrices are to
be used instead of the normal ones.
Using derivative observations in GPstuff requires two steps: when initializing the GP
structure one must set option ’derivobs’ to ’on’. The second step is to form right sized
observation vector. With input size n × m the observation vector with derivatives should
be of size n + m · n. The observation vector is constructed by adding partial derivative
observations after function value observations


y(x)
 ∂y(x) 
 ∂x1 

yobs = 
(47)
 ..  .
 . 
∂y(x)
∂xm

Different noise level could be assumed for function values and derivative observations but
at the moment the implementation allows only same noise for all the observations. The use
of derivative observations is demonstrated in demo_derivativeobs.
Derivative process can be also used to construct a monotonicity constraint as described
in section 8.7.
7.6 Quantile Regression
Quantile regression is used for estimating quantiles of the response variable as a function of
input variables (Boukouvalas et al., 2012).

The likelihood is defines as


τ (1 − τ )
y−f
p(y|f, σ, τ ) =
exp −
(τ − I(y ≤ f )) ,
σ
σ

(48)

where τ is the quantile of interest and σ is the standard deviation. I(y ≤ f ) is 1 if the
condition inside brackets is true and 0 otherwise. Because the logarithm of the likelihood
is not twice differentiable at the mode, Laplace approximation cannot be used for inference
with Quantile Regression. Quantile Regression with EP and MCMC is demonstrated in
demo_qgp with toy data.

8. Multilatent models
The multilatent models consist of models where individual likelihood factors depend on
multiple latent variables. In this section we shortly summarize such models in GPstuff.
8.1 Multiclass classification
In multiclass classification problems the target variables have more than two possible class
labels, yi ∈ {1, . . . , c}, where c > 2 is the number of classes. In GPstuff, multi-class classification can be made either using the softmax likelihood (lik_softmax)
exp(fiyi )
p(yi | f i ) = Pc
,
j
j=1 exp(fi )

(49)


T
where f i = fi1 , . . . , fic , or the multinomial probit likelihood (lik_multinomialprobit)


c
 Y

p(yi | f i ) = Ep(ui )
Φ(ui + fiyi − fij ) ,
(50)


j=1,j6=yi

where the auxiliary variable ui is distributed as p(ui ) = N (ui |0, 1), and Φ(x) denotes the
cumulative density function of the standard normal distribution. In Gaussian process literature for multiclass classification, a common assumption is to introduce c independent prior
processes that are associated with c classes (see, e.g., Rasmussen and Williams, 2006). By
assuming zero-mean Gaussian processes for latent functions associated with different classes,
we obtain a zero-mean Gaussian prior
p(f |X) = N (f |0, K),
(51)

T
where f = f11 , . . . , fn1 , f12 , . . . , fn2 , . . . , f1c , . . . , fnc and K is a cn × cn block-diagonal covariance matrix with matrices K 1 , K 2 , . . . , K c (each of size n × n) on its diagonal.
Inference for softmax can be made with MCMC or Laplace approximation. As described by Williams and Barber (1998) and Rasmussen and Williams (2006), using the
Laplace approximation for the softmax likelihood with the uncorrelated prior processes,
the posterior computations can be done efficiently in a way that scales linearly in c, which
is also implemented in GPstuff. Multiclass classification with softmax is demonstrated in
demo_multiclass.

Inference for multinomial probit can be made with expectation propagation by using a
nested EP approach that does not require numerical quadratures or sampling for estimation
of the tilted moments and predictive probabilities, as proposed by Riihimäki et al. (2013).
Similarly to softmax with Laplace’s method, the nested EP approach leads to low-rank
site approximations which retain all posterior couplings but results in linear computational
scaling with respect to c. Multiclass classification with the multinomial probit likelihood is
demonstrated in demo_multiclass_nested_ep.
8.2 Multinomial
The multinomial model (lik_multinomial) is an extension of the multiclass classification to situation where each observation consist of counts of class observations. Let yi =
[yi,1 , . . . , yi,c ], where c > 2, be a vector of counts of class observations related to inputs xi
so that
yi ∼ Multinomial([pi,1 , . . . , pi,c ], ni )
(52)
Pc
where ni = j=1 yi,j . The propability to observe class j is
exp(fij )
pi,j = Pc
,
j
j=1 exp(fi )

(53)


T
where f i = fi1 , . . . , fic . As in multiclass classification we can introduce c independent prior
processes that are associated with c classes. By assuming zero-mean Gaussian processes for
latent functions associated with different classes, we obtain a zero-mean Gaussian prior
p(f |X) = N (f |0, K),
(54)

T
where f = f11 , . . . , fn1 , f12 , . . . , fn2 , . . . , f1c , . . . , fnc and K is a cn × cn block-diagonal covariance matrix with matrices K 1 , K 2 , . . . , K c (each of size n × n) on its diagonal. This model is
demonstrated in demo_multinomial and it has been used, for example, in (Juntunen et al.,
2012).
8.3 Cox proportional hazard model
For the individual i, where i = 1, . . . , n, we have observed survival time yi (possibly right
censored) with censoring indicator δi , where δi = 0 if the ith observation is uncensored and
δi = 1 if the observation is right censored. The traditional approach to analyze continuous
time-to-event data is to assume the Cox proportional hazard model (Cox, 1972).
hi (t) = h0 (t) exp(xTi β),

(55)

where h0 is the unspecified baseline hazard rate, xi is the d × 1 vector of covariates for the
ith patient and β is the vector of regression coefficients. The matrix X = [x1 , . . . , xn ]T of
size n × d includes all covariate observations.
The Cox model with linear predictor can be extended to more general form to enable,
for example, additive and non-linear effects of covariates (Kneib, 2006; Martino et al., 2011).
We extend the proportional hazard model by
hi (t) = exp(log(h0 (t)) + ηi (xi )),

(56)

where the linear predictor is replaced with the latent predictor ηi depending on the covariates
xi . By assuming a Gaussian process prior over η = (η1 , . . . , ηn )T , smooth nonlinear effects
of continuous covariates are possible, and if there are dependencies between covariates, GP
can model these interactions implicitly.
A piecewise log-constant baseline hazard (see, e.g. Ibrahim et al., 2001; Martino et al.,
2011) is assumed by partitioning the time axis into K intervals with equal lengths: 0 =
s0 < s1 < s2 < . . . < sK , where sK > yi for all i = 1, . . . , n. In the interval k (where
k = 1, . . . , K), hazard is assumed to be constant:
h0 (t) = λk for t ∈ (sk−1 , sk ].

(57)

For the ith individual the hazard rate in the kth time interval is then
hi (t) = exp(fk + ηi (xi )), t ∈ (sk−1 , sk ],

(58)

where fk = log(λk ). To assume smooth hazard rate functions, we place another Gaussian
process prior for f = (f1 , . . . , fK )T . We define a vector containing the mean locations of K
time intervals as τ = (τ1 , . . . , τK )T .
The likelihood contribution for the possibly right censored ith observation (yi , δi ) is
assumed to be

 Z
yi

li = hi (yi )(1−δi ) exp −

hi (t)dt .

(59)

0

Using the piecewise log-constant assumption for the hazard rate function, the contribution
of the observation i for the likelihood results in


k−1
X
li = [λk exp(ηi )](1−δi ) exp −[(yi − sk−1 )λk +
(sg − sg−1 )λg ] exp(ηi ) ,
(60)
g=1

where yi ∈ (sk−1 , sk ] (Ibrahim et al., 2001; Martino et al., 2011). By applying the Bayes
theorem, the prior information and likelihood contributions are combined, and the posterior
distribution of the latent variables can be computed. Due to the form of the likelihood
function, the resulting posterior becomes non-Gaussian and analytically exact inference is
intractable. GPstuff supports MCMC and Laplace approximation to integrate over the latent
variables. The use of Gaussian process Cox proportional hazard model is demonstrated in
demo_survival_coxph. The Gaussian process Cox proportional hazard model implemented
with GPstuff was used in reference (Joensuu et al., 2012) to model risk of gastrointestinal
stromal tumour recurrence after surgery.
8.4 Zero-inflated negative binomial
Zero-inflated negative binomial model is suitable for modelling count variables with excessive
number of zero observations compared to usual Poisson and negative binomial count models.
Zero-inflated negative binomial models assume a two-stage process, where the first process
determines whether the count is larger than zero, and the second process determines the
non-zero count (Mullahy, 1986). In GPstuff, these processes are assumed independent a

priori, but they become a posteriori dependent through the likelihood function
p+(1 − p)NegBin(y|y = 0),

when y = 0

(61)

(1 − p)NegBin(y|y > 0),

when y > 0,

(62)

where the probability p is given by a binary classifier with logit likelihood. NegBin is the
Negative-binomial distribution parametrized for the i’th observation as
y | f , e, r ∼


n
Y
Γ(r + yi )
i=1

yi !Γ(r)

r
r + µi

r 

µi
r + µi

yi
,

(63)

where µi = ei exp(f (xi )) and r is the dispersion parameter governing the variance. In
GPstuff, the latent value vector f = [f T1 f T2 ]T has length 2N , where N is the number of
observations. The latents f 1 are associated with the classification process and the latents f 2
with the negative-binomial count process.
8.5 Density estimation and regression
Logistic Gaussian process can be used for flexible density estimation and density regression.
GPstuff includes implementation based on Laplace (and MCMC) approximation as described
in (Riihimäki and Vehtari, 2012).
The likelihood is defined as
p(x | f ) =

n
Y
i=1

exp(fi )
Pn
.
j=1 exp(fj )

(64)

For the latent function f , we assume the model f (x) = g(x) + h(x)T β, where the GP prior
g(x) is combined with the explicit basis functions h(x). Regression coefficients are denoted
with β, and by placing a Gaussian prior β ∼ N (b, B) with mean b and covariance B, the
parameters β can be integrated out from the model, which results in the following GP prior
for f :

f (x) ∼ GP h(x)T b, κ(x, x0 ) + h(x)T B h(x0 ) .
(65)
For the explicit basis functions, we use the second-order polynomials which leads to a GP
prior that can favour density estimates where the tails of the distribution go eventually to
zero.
We use finite-dimensional approximation and evaluate the integral and the Gaussian
process in a grid. We approximate inference for logistic Gaussian process density estimation
in a grid using Laplace’s method or MCMC to integrate over the non-Gaussian posterior
distribution of latent values (see details in Riihimäki and Vehtari, 2012).
Logistic Gaussian processes are also suitable for estimating conditional densities p(t| x),
where t is a response variable. We discretize both input and target space in a finite region
to model the conditional densities with the logistic GP. To approximate the resulting nonGaussian posterior distribution, we use again Laplace’s method.
1D and 2D density estimation and density regression are implemented in the function
lgpdens. The usage of the function is demonstrated in demo_lgpdens.

100

90

80

70

60

50

40
10000

20000

30000

1

(a) Galaxy data.

2

3

4

5

6

(b) Old faithful data.

Figure 7: Density estimates for two different distributions. Figures are from
demo_lgpdens. On the left blue line is density estimation of Galaxy data while
gray area is 95% mean region and on the right is 2D density estimation with countours based on Old faithful data.

8.6 Input dependent models
In input-dependent models more than one latent variables are used for modeling different
parameters of the distribution. Additional latent variables in turn enable modeling dependencies between inputs and parameters.
8.6.1 Input-dependent Noise
Input-dependent noise model (Goldberg et al., 1997) can be used to model regression data.
The model assumes Gaussian distributed data with latent variables defining both mean and
variance of the Gaussian distribution instead of just the mean like in the standard Gaussian
likelihood. Setting independent GP priors for the two latent variables, it is possible to infer
non-constant noise, e.g. input-dependent, in the data.
The likelihood is defined as
p(y | f (1) , f (2) , σ 2 ) =

n
Y

(1)

(2)

N (yi |fi , σ 2 exp(fi )),

(66)

i=1

with latent function f (1) defining the mean and f (2) defining the variance. Input-dependent
noise is demonstrated in demo_inputdependentnoise with both heteroscedastic and constant noise.
8.6.2 Input-dependent overdispersed Weibull
In input-dependent Weibull model additional latent variable is used for modeling shape
parameter r of the standard Weibull model.

The likelihood is defined as
p(y | f (1) , f (2) , z) =

n
Y

(1)

(1)

(1−zi )(ri −1) −yiri /efi

ri1−zi e−(1−zi )fi yi

e

(67)

i=1

where z is a vector of censoring indicators with zi = 0 for uncensored event and zi = 1 for
right censored event for observation i. Latent variable f2 is used for modeling the shape
parameter here. Because the dispersion parameter is constrained to be greater than 0,
it must be computed through exp(f2 ) to ensure positiveness. Input-dependent Weibull is
demonstrated in demo_inputdependentweibull.
8.7 Monotonicity constraint
.
Riihimäki and Vehtari (2010) presented how a monotonicity constraint for GP can be
introduced with virtual derivative observations, and the resulting posterior can be approximated with expectation propagation.
Currently GPstuff supports monotonicity constraint for models which have Gaussian
likelihood or fully factoring likelihoods with EP inference and which use covariance functions
gpcf_constant, gpcf_linear and gpcf_sexp. Riihimäki and Vehtari (2010) used type
II MAP estimate for the hyperparameters, but GPstuff allows easy integration over the
hyperparameters with any method described in Section 5.
GP model with monotonicity constraints on one or many covariates can be easily constructed with the function gp_monotonic. Monotonicity constraint is demonstrated in
demo_monotonic and demo_monotonic2.
Figure 8 displays the predictions with models with and without the monotonicity constraint. The monotonicity constraint allows inclusion of sensible prior information that death
rate increases as age increases. The effect is shown most clearly for larger ages where there
is less observations, but also overall reduction in uncertainty related to the latent function
can be seen.

9. Mean functions
In the standard GP regression a zero mean function is assumed for the prior process. This
is convenient but there are nonetheless some advantages in using a specified mean function.
The basic principle in doing GP regression with a mean function is to apply a zero mean
GP for the difference between observations and the mean function.
9.1 Explicit basis functions
Here we follow closely the presentation of (Rasmussen and Williams, 2006) about the subject
and briefly present the main results. A mean function can be specified as a weighted sum
of some basis functions h
m = h(x)> β,
with weights β. The target of modeling, the underlying process g, is assumed to be a sum
of mean function and a zero mean GP.
g = h(x)> β + GP(0, K).

Deaths/1000

15
10
5
0

Deaths/1000

sexp

20

40

50
Age

sexp + monot.

15
10
5
0

40

50
Age

20

10
5

20

40

50
Age

60

lin + sexp + monot.

15
10
5
0

60

lin + sexp

15

0

60

Deaths/1000

Deaths/1000

20

40

50
Age

60

Figure 8: Illustration how the monotonicity constraint of the latent function affects
the predictions. Mortality rate data from Broffitt (1988).

By assuming Gaussian prior for the weights β ∼ N (b, B), the weight parameters can be
integrated out and the prior for g is another GP
prior g ∼ GP




h(x)> b, K + h(x)> B h(x) .

The predictive equations are obtained by using the mean and covariance of the prior (68)
in the zero mean GP predictive equations (15)
E(g∗ ) = E(f∗ ) + R> β,

(68)




>
Cov(g∗ ) = Cov(f∗ ) + R> B−1 + HK−1
R,
y H

β


−1

>
= B−1 + HK−1
H
B−1 b + HK−1
y
y y ,

R = H∗ − HK−1
y K∗ ,


h1 (x)
 h2 (x) 


H =
 , x is row vector.
..


.
hk (x)

(69)

If the prior assumptions about the weights are vague then B−1 → O, (O is a zero matrix)
and the predictive equations (68) and (69) don’t depend on b or B
E(g∗ )

=

Cov(g∗ )

=

E(f∗ ) + R> β v ,


>
Cov(f∗ ) + R> HK−1
H
R,
y

(70)
(71)


−1
>
HK−1
β v = HK−1
H
y y,
y
Corresponding the exact and vague prior for the basis functions’ weights there are two
versions of the marginal likelihood. With exact prior the marginal likelihood is
1
1
1
1
n
log p(y | X, b, B) = − M> N−1 M − log |Ky | − log |B| − log |A| − log 2π,
2
2
2
2
2
>
M = H b − y,
N = Ky + H> BH,
>
A = B−1 + HK−1
y H ,

where n is the amount of observations. Its derivative with respect to hyperparameters are
1 > −1 ∂Ky −1 >
M N
N M
2
∂θi




∂Ky
1
1
−1 ∂A
tr K−1
tr
A
−
−
,
y
2
∂θi
2
∂θi
∂Ky −1 >
∂A
= − HK−1
K H .
y
∂θi
∂θi y

∂
log p(y | X, b, B) = +
∂θi

With a vague prior the marginal likelihood is
1
1 > −1
y Ky y + y> Cy
2
2
1
1
n−m
−
log |Ky | − log |Av | −
log 2π,
2
2
2
>
Av =
HK−1
y H

log p(y | X) = −

C=

> −1
−1
K−1
y H Av HKy ,

where m is the rank of H > . Its derivative is

∂
1 >
1 >
log p(y | X) =
y Py +
−y PG − G> Py + G> PG
∂θi
2
2




∂K
1
1
y
−1
−1 ∂Av
−
tr Ky
− tr Av
,
2
∂θi
2
∂θi
∂Ky −1
P = K−1
K ,
y
∂θi y
G = H> A−1 HK−1
y y,
∂K

(72)

y
where has been used the fact that matrices K−1
y , ∂θi and Av are symmetric. The above
expression (72) could be simplified a little further because y> PG + G> Py = 2 · y> PG.
The use of mean functions is demonstrated in demo_regression_meanf.

10. Sparse Gaussian processes
The evaluation of the inverse and determinant of the covariance matrix in the log marginal
likelihood (or its approximation) and its gradient scale as O(n3 ) in time. This restricts the
implementation of GP models to moderate size data sets. However, the unfavorable scaling
in computational time can be alleviated with sparse approximations to GP and compactly
supported (CS) covariance functions. The sparse approximations scale as O(nm2 ), where
m < n and the CS covariance functions lead to computations which scale, in general,
as O(n3 ) but with a smaller constant than with traditional globally supported covariance
functions. Many sparse GPs are proposed in the literature and some of them are build into
GPstuff.
10.1 Compactly supported covariance functions
A CS covariance function gives zero correlation between data points whose distance exceeds
a certain threshold. This leads to a sparse covariance matrix. The challenge with constructing CS covariance functions is to guarantee their positive definiteness and much literature
has been devoted on the subject (see e.g. Sansò and Schuh, 1987; Wu, 1995; Wendland,
1995; Gaspari and Cohn, 1999; Gneiting, 1999, 2002; Buhmann, 2001) The CS functions implemented in GPstuff are Wendland’s piecewise polynomials kpp,q (Wendland, 2005), such
as
2

σpp
kpp,2 =
(1 − r)j+2
(j 2 + 4j + 3)r2 + (3j + 6)r + 3 ,
(73)
+
3
where j = bd/2c + 3. These functions correspond to processes that are q times mean square
differentiable and are positive definite up to an input dimension d. Thus, the degree of
the polynomial has to be increased alongside the input dimension. The dependence of CS
covariance functions to the input dimension is very fundamental. There are no radial CS
functions that are positive definite on every  0 is the more diffusive the length-scales of the
mixing components are. The parameter lk > 0 characterizes the typical length-scale of the
individual components in input dimension k.
Periodic covariance function (gpcf_periodic)
Many real world systems exhibit periodic phenomena, which can be modelled with a periodic
covariance function. One possible construction (Rasmussen and Williams, 2006) is
!
d
X
2 sin2 (π(xi,k − xj,k )/γ)
k(xi , xj ) = exp −
,
(121)
lk2
k=1

where the parameter γ controls the inverse length of the periodicity and lk the smoothness
of the process in dimension k.
Product covariance function (gpcf_product)
A product of two or more covariance functions, k1 (x, x0 ) · k2 (x, x0 )..., is a valid covariance
function as well. Combining covariance functions in a product form can be done with
gpcf_prod, for which the user can freely specify the covariance functions to be multiplied
with each other from the collection of covariance functions implemented in GPstuff.

Categorical covariance function (gpcf_cat)
Categorical covariance function gpcf_cat returns correlation 1 if input values are equal and
0 otherwise.
(
1 if xi − xj = 0
k(xi , xj ) =
(122)
0 otherwise
Categorical covariance function can be combined with other covariance functions using
gpcf_prod, for example, to produce hierarchical models.

Appendix C. Observation models
Here, we summarize all the observation models in GPstuff. Most of them are implemented in
files lik_* which reminds that at the inference step they are considered likelihood functions.
Gaussian (lik_gaussian)
The i.i.d Gaussian noise with variance σ 2 is
y | f , σ 2 ∼ N (f , σ 2 I).

(123)

Student-t (lik_t, lik_gaussiansmt)
The Student-t observation model (implemented in lik_t) is

−(ν+1)/2
n
Y
Γ((ν + 1)/2)
(yi − fi )2
√
y | f , ν, σt ∼
1+
,
Γ(ν/2) νπσt
νσt2
i=1

(124)

where ν is the degrees of freedom and σ the scale parameter. The scale mixture version of
the Student-t distribution is implemented in lik_gaussiansmt and it is parametrized as
yi |fi , α, Ui ∼ N (fi , αUi )
2

(125)
2

Ui ∼ Inv-χ (ν, τ ),

(126)

where each observation has its own noise variance αUi (Neal, 1997; Gelman et al., 2013).
Logit (lik_logit)
The logit transformation gives the probability for yi of being 1 or −1 as
plogit (yi |fi ) =

1
.
1 + exp(−yi fi )

(127)

Probit (lik_probit)
The probit transformation gives the probability for yi of being 1 or −1 as
Z

yi f (xi )

pprobit (yi |fi ) = Φ(yi f (xi )) =

N (z|0, 1)dz.
−∞

(128)

Poisson (lik_poisson)
The Poisson observation model with expected number of cases e is
y | f, e ∼

n
Y

Poisson(yi | exp(fi )ei ).

(129)

i=1

Negative-Binomial (lik_negbin)
The negative-binomial is a robustified version of the Poisson distribution. It is parametrized
y | f , e, r ∼


n
Y
Γ(r + yi )
i=1

yi !Γ(r)

r
r + µi

r 

µi
r + µi

yi
(130)

where µi = e exp(fi ), r is the dispersion parameter governing the variance, ei is the expected
number of cases and yi is positive integer telling the observed count.
Binomial (lik_binomial)
The binomial observation model with the probability of success pi = exp(fi )/(1 + exp(fi ))
is
N
Y
zi !
(131)
y | f, z ∼
pyi (1 − pi )(zi −yi ) .
yi !(zi − yi )! i
i=1

Here, zi denotes the number of trials and yi is the number of successes.
Weibull (lik_weibull)
The Weibull likelihood is defined as
L=

n
Y

r1−zi exp ((1 − zi )f (xi ) + (1 − zi )(r − 1) log(yi ) − exp(f (xi ))yir ) ,

(132)

i=1

where z is a vector of censoring indicators with zi = 0 for uncensored event and zi = 1 for
right censored event for observation i and r is the shape parameter. Here we present only
the likelihood function because we don’t have observation model for the censoring.
Log-Gaussian (lik_loggaussian)
The Log-Gaussian likelihood is defined as


n
Y
1
2
2 −(1−zi )/2 1−zi
L=
(2πσ )
yi
exp − 2 (1 − zi )(log(yi ) − f (xi ))
2σ
i=1


zi
log(yi ) − f (xi )
×
1−Φ
σ
where σ is the scale parameter.

(133)

Log-logistic (lik_loglogistic)
The log-logistic likelihood is defined as
1−zi 

r zi −2
n 
Y
ry r−1
y
L=
1+
exp(f (xi ))
exp(f (xi ))

(134)

i=1

Cox proportional hazard model (lik_coxph)
The likelihood contribution for the possibly right censored ith observation (yi , δi ) is assumed
to be

 Z y
i

li = hi (yi )(1−δi ) exp −

hi (t)dt .

(135)

0

Using the piecewise log-constant assumption for the hazard rate function, the contribution
of the observation i for the likelihood results in


k−1
X
li = [λk exp(ηi )](1−δi ) exp −[(yi − sk−1 )λk +
(sg − sg−1 )λg ] exp(ηi ) ,
(136)
g=1

where yi ∈ (sk−1 , sk ]
Input-dependent noise (lik_inputdependentnoise)
The input-dependent noise observation model is defined as
y|f

(1)

,f

(2)

2

,σ ∼

n
Y

(1)

(2)

N (yi |fi , σ 2 exp(fi )),

(137)

i=1

with latent function f (1) defining the mean and f (2) defining the variance.
Input-dependent Weibull (lik_inputdependentweibull)
The input-dependent Weibull observation model is defined as
y|f

(1)

,f

(2)

,z ∼

n
Y


(2)
(1)
exp(fi )1−zi exp (1 − zi )fi

(138)

i=1
(2)

(1)

(2)

exp(fi

+ (1 − zi )(exp(fi ) − 1) log(yi ) exp(fi )yi

)



,

where z is a vector of censoring indicators with zi = 0 for uncensored event and zi = 1 for
right censored event for observation i.
Zero-inflated Negative-Binomial (lik_zinegbin)
The Zero-Inflated Negative-Binomial observation model is defined as

r 
yi
n
Y
Γ(r + yi )
r
µi
y | f , e, r ∼
,
yi !Γ(r)
r + µi
r + µi
i=1

where µi = ei exp(f (xi )) and r is the dispersion parameter governing the variance.

(139)

Appendix D. Priors
This appendix lists all the priors implemented in the GPstuff package.
Gaussian prior (prior_gaussian)
The Gaussian distribution is parametrized as


1
1
p(θ) = √
exp − 2 (θ − µ)2
2σ
2πσ 2

(140)

where µ is a location parameter and σ 2 is a scale parameter.
Log-Gaussian prior (prior_loggaussian)
The log-Gaussian distribution is parametrized as


1
2
p(θ) = √
exp − 2 (log(θ) − µ)
2σ
θ 2πσ 2
1

(141)

where µ is a location parameter and σ 2 is a scale parameter.
Laplace prior (prior_laplace)
The Laplace distribution is parametrized as


|θ − µ|
1
exp −
p(θ) =
2σ
σ

(142)

where µ is a location parameter and σ > 0 is a scale parameter.
Student-t prior (prior_t)
The Student-t distribution is parametrized as
Γ((ν + 1)/2)
√
p(θ) =
Γ(ν/2) νπσ 2


−(ν+1)/2
(θ − µ)2
1+
νσ 2

(143)

where µ is a location parameter, σ 2 is a scale parameter and ν > 0 is the degrees of freedom.
Square root Student-t prior (prior_sqrtt)
The square root Student-t distribution is parametrized as
p(θ1/2 ) =

Γ((ν + 1)/2)
√
Γ(ν/2) νπσ 2


−(ν+1)/2
(θ − µ)2
1+
νσ 2

(144)

where µ is a location parameter, σ 2 is a scale parameter and ν > 0 is the degrees of freedom.

Scaled inverse-χ2 prior (prior_sinvchi2)
The scaled inverse-χ2 distribution is parametrized as
p(θ) =

(ν/2)ν/2 2 ν/2 −(ν/2+1) −νs2 /(2θ)
(s ) θ
e
Γ(ν/2)

(145)

where s2 is a scale parameter and ν > 0 is the degrees of freedom parameter.
Gamma prior (prior_gamma)
The gamma distribution is parametrized as
p(θ) =

β α α−1 −βθ
θ
e
Γ(α)

(146)

where α > 0 is a shape parameter and β > 0 is an inverse scale parameter.
Inverse-gamma prior (prior_invgamma)
The inverse-gamma distribution is parametrized as
p(θ) =

β α −(α+1) −β/θ
θ
e
Γ(α)

(147)

where α > 0 is a shape parameter and β > 0 is a scale parameter.
Uniform prior (prior_unif)
The uniform prior is parametrized as
p(θ) ∝ 1.

(148)

Square root uniform prior (prior_sqrtunif)
The square root uniform prior is parametrized as
p(θ1/2 ) ∝ 1.

(149)

Log-uniform prior (prior_logunif)
The log-uniform prior is parametrized as
p(log(θ)) ∝ 1.

(150)

Log-log-uniform prior (prior_loglogunif)
The log-log-uniform prior is parametrized as
p(log(log(θ))) ∝ 1.

(151)

Appendix E. Transformation of hyperparameters
The inference on the parameters of covariance functions is conducted mainly transformed
space. Most of ten used transformation is log-transformation, which has the advantage that
the parameter space (0, ∞) is transformed into (−∞, ∞). The change of parametrization
has to be taken into account in the evaluation of the probability densities of the model. If
parameter θ with probability density pθ (θ) is transformed into the parameter w = f (θ) with
equal number of components, the probability density of w is given by
pw (w) = |J|pθ (f −1 (w)),

(152)

where J is the Jacobian of the transformation θ = f −1 (w). The parameter transformations
are discussed shortly, for example, in Gelman et al. (2013)[p. 21].
Due to the log transformation w = log(θ) transformation the probability densities pθ (θ)
are changed to the densities
pw (w) = |J|pθ (exp(w)) = |J|pθ (θ),

(153)

where the Jacobian is J = ∂ exp(w)
= exp(w) = θ. Now, given Gaussian observation model
∂w
(see Section 3.1) the posterior of w can be written as
pw (w|D) ∝ p(y|X, θ)p(θ|γ)θ,

(154)

which leads to energy function
E(w) = − log p(y|X, θ) − log p(θ|γ) − log(|θ|).
= E(θ) − log(θ),
where the absolute value signs are not shown explicitly around θ because it is strictly positive.
Thus, the log transformation just adds term − log θ in the energy function.
The inference on w requires also the gradients of an energy function E(w). These can
be obtained easily with the chain rule
∂E(w)
∂
∂θ
=
[E(θ) − log(|J|)]
∂w
∂w
∂θ

∂E(θ) ∂ log(|J|) ∂θ
=
−
∂θ
∂θ
∂w


∂E(θ)
1 ∂|J|
=
−
J.
∂θ
|J| ∂θ

(155)

Here we have used the fact that the last term, derivative of θ with respect to w, is the same
−1
∂θ
as the Jacobian J = ∂w
= ∂f∂w . Now in the case of log transformation the Jacobian can be
replaced by θ and the gradient is gotten an easy expression
∂E(w)
∂E(θ)
=
θ − 1.
∂w
∂θ

(156)

Appendix F. Developer appendix
This section provides additional description of lik, gpcf and prior functions, and argument
interface for optimisation and sampling functions.
F.1 lik functions
New likelihoods can be added by copying and modifying one of the existing lik functions.
We use lik_negbin as an example of log-concave likelihood and lik_t as an example of
non-log-concave likelihood. Note that adding a new non-log-concave likelihood is more
challenging as the posterior of the latent values may be multimodal, which makes it more
difficult to implement stable Laplace and EP methods (see, e.g., Vanhatalo et al., 2009;
Jylänki et al., 2011).
F.1.1 lik_negbin
inputParser is used to make argument checks and set some default values. If the new
likelihood does not have parameters, see for example, lik_poisson.
ip=inputParser;
ip.FunctionName = ’LIK_NEGBIN’;
ip.addOptional(’lik’, [], @isstruct);
ip.addParamValue(’disper’,10, @(x) isscalar(x) && x>0);
ip.addParamValue(’disper_prior’,prior_logunif(), @(x) isstruct(x) || isempty(x));
ip.parse(varargin{:});
lik=ip.Results.lik;
...
Function handles to the subfunctions provide object-oriented behaviour (at time when
GPstuff project was started, real classes in Matlab were too inefficient and Octave still lacks
proper support). If some of the subfunctions have not been implemented, corresponding
function handles should not be defined.
if init
% Set the function handles to the subfunctions
lik.fh.pak = @lik_negbin_pak;
lik.fh.unpak = @lik_negbin_unpak;
lik.fh.lp = @lik_negbin_lp;
lik.fh.lpg = @lik_negbin_lpg;
lik.fh.ll = @lik_negbin_ll;
lik.fh.llg = @lik_negbin_llg;
lik.fh.llg2 = @lik_negbin_llg2;
lik.fh.llg3 = @lik_negbin_llg3;
lik.fh.tiltedMoments = @lik_negbin_tiltedMoments;
lik.fh.siteDeriv = @lik_negbin_siteDeriv;
lik.fh.predy = @lik_negbin_predy;
lik.fh.predprcty = @lik_negbin_predprcty;
lik.fh.invlink = @lik_negbin_invlink;

lik.fh.recappend = @lik_negbin_recappend;
end
end
lik_negbin_pak and lik_negbin_unpak functions are used to support generic optimization and sampling functions, which assume that the parameters are presented as a vector.
Parameters which have empty prior ([]) are ignored. These subfunctions are mandatory
(even if there are no parameters).
lik_negbin_lp and lik_negbin_lpg compute the log prior density of the parameters
and its gradient with respect to parameters. Parameters which have empty prior ([]) are
ignored. These subfunctions are needed if there are likelihood parameters.
lik_negbin_ll and lik_negbin_llg compute the log likelihood and its gradient with
respect to parameters and latents. Parameters which have empty prior ([]) are ignored.
These subfunctions are mandatory.
lik_negbin_llg2 computes second gradients of the log likelihood. This subfunction is
needed when using Laplace approximation or EP for inference with non-Gaussian likelihoods.
lik_negbin_llg3 computes third gradients of the log likelihood. This subfunction is
needed when using Laplace approximation for inference with non-Gaussian likelihoods.
lik_negbin_tiltedMoments returns the marginal moments for the EP algorithm. This
subfunction is needed when using EP for inference with non-Gaussian likelihoods.
lik_negbin_siteDeriv evaluates the expectation of the gradient of the log likelihood
term with respect to the likelihood parameters for EP. This subfunction is needed when
using EP for inference with non-Gaussian likelihoods and there are likelihood parameters.
lik_negbin_predy returns the predictive mean, variance and density of y. This subfunction is needed when computing posterior predictive distributions for future observations.
lik_negbin_predprcty returns the percentiles of predictive density of y. This subfunction is needed when using function gp_predprcty.
lik_negbin_init_negbin_norm is a private function for lik_negbin. It returns function
handle to a function evaluating Negative-Binomial * Gaussian which is used for evaluating
(likelihood * cavity) or (likelihood * posterior). This subfunction is needed by subfunctions
lik_negbin_tiltedMoments, lik_negbin_siteDeriv and lik_negbin_predy. Note that
this is not needed for those likelihoods for which integral over the likelihood times Gaussian
is computed using other than quadrature integration.
lik_negbin_invlink returns values of inverse link function. This subfunction is needed
when using function gp_predprctmu.
lik_negbin_recappend This subfunction is needed when using MCMC sampling (gp_mc).
F.1.2 lik_t
lik_t includes some additional subfunctions useful for non-log-concave likelihoods.
lik_t_tiltedMoments2 returns the marginal moments for the EP algorithm. This subfunction is needed when using robust-EP for inference with non-Gaussian likelihoods.
lik_t_siteDeriv2 evaluates the expectation of the gradient of the log likelihood term
with respect to the likelihood parameters for EP. This subfunction is needed when using
robust-EP for inference with non-Gaussian likelihoods and there are likelihood parameters.

lik_t_optimizef function to optimize the latent variables with EM algorithm. This
subfunction is needed when using likelihood specific optimization method for mode finding
in the Laplace algorithm.
F.2 gpcf functions
New covariance functions can be added by copying and modifying one of the existing gpcf
functions. We use gpcf_sexp as an example.
F.2.1 gpcf_sexp
inputParser is used to make argument checks and set some default values. If the new
covariance function does not have parameters, corresponding lines can be removed (see,
e.g., lik_cat).
ip=inputParser;
ip.FunctionName = ’GPCF_SEXP’;
ip.addOptional(’gpcf’, [], @isstruct);
ip.addParamValue(’magnSigma2’,0.1, @(x) isscalar(x) && x>0);
ip.addParamValue(’lengthScale’,1, @(x) isvector(x) && all(x>0));
ip.addParamValue(’metric’,[], @isstruct);
ip.addParamValue(’magnSigma2_prior’, prior_logunif(), ...
@(x) isstruct(x) || isempty(x));
ip.addParamValue(’lengthScale_prior’,prior_t(), ...
@(x) isstruct(x) || isempty(x));
ip.addParamValue(’selectedVariables’,[], @(x) isempty(x) || ...
(isvector(x) && all(x>0)));
ip.parse(varargin{:});
gpcf=ip.Results.gpcf;
...
Optional ’metric’ can be used to replace default simple euclidean metric. Currently it is
possible to use delta distance or group covariates to have common lengthScale parameters.
Other potential metrics could be, for example, Manhattan or distance matrix based metrics.
The metric function is called only if it has been explicitly set. This adds some additional
code branches to the code, but avoids extra overhead in computation when using simple
euclidean metric case.
Function handles to the subfunctions provide object-oriented behaviour (at time when
GPstuff project was started, real classes in Matlab were too inefficient and Octave still lacks
proper support). If some of the subfunctions have not been implemented, corresponding
function handles should not be defined.
if init
% Set the function handles to the subfunctions
gpcf.fh.pak = @gpcf_sexp_pak;
gpcf.fh.unpak = @gpcf_sexp_unpak;
gpcf.fh.lp = @gpcf_sexp_lp;

gpcf.fh.lpg= @gpcf_sexp_lpg;
gpcf.fh.cfg = @gpcf_sexp_cfg;
gpcf.fh.cfdg = @gpcf_sexp_cfdg;
gpcf.fh.cfdg2 = @gpcf_sexp_cfdg2;
gpcf.fh.ginput = @gpcf_sexp_ginput;
gpcf.fh.ginput2 = @gpcf_sexp_ginput2;
gpcf.fh.ginput3 = @gpcf_sexp_ginput3;
gpcf.fh.ginput4 = @gpcf_sexp_ginput4;
gpcf.fh.cov = @gpcf_sexp_cov;
gpcf.fh.trcov = @gpcf_sexp_trcov;
gpcf.fh.trvar = @gpcf_sexp_trvar;
gpcf.fh.recappend = @gpcf_sexp_recappend;
end
gpcf_sexp_pak and gpcf_sexp_unpak functions are used to support generic optimization
and sampling functions, which assume that parameters are presented as a vector. Parameters
which have empty prior ([]) are ignored. These subfunctions are mandatory (even if there
are no parameters).
gpcf_sexp_lp and gpcf_sexp_lpg compute the log prior density of the parameters and
its gradient with respect to parameters. Parameters which have empty prior ([]) are ignored.
These subfunctions are mandatory.
gpcf_sexp_cov evaluates covariance matrix between two input vectors. This is a mandatory subfunction used for example in prediction and energy computations.
gpcf_sexp_trcov evaluates training covariance matrix of inputs. This is a mandatory subfunction used for example in prediction and energy computations. If available,
gpcf_sexp_trcov uses faster C-implementation trcov for covariance computation (see linuxCsource/trcov.c). trcov includes C-code implementation for all currently available covariance functions. If trcov is not available, gpcf_sexp_trcov uses M-code computation.
gpcf_sexp_trvar evaluates training variance vector. This is a mandatory subfunction
used for example in prediction and energy computations.
gpcf_sexp_cfg evaluates gradient of covariance function with respect to the parameters.
gpcf_sexp_cfg has four different calling syntaxes. First one is a mandatory used in gradient
computations. Second and third are needed when using sparse approximations (e.g. FIC).
Fourth one is needed when using memory save option in gp_set (without memory save
option, all matrix derivatives with respect to all covariance parameters are computed at
once, which may take lot of memory if n is large and there are many parameters).
gpcf_sexp_cfdg evaluates gradient of covariance function, of which has been taken partial derivative with respect to x, with respect to parameters. This subfunction is needed
when using derivative observations.
gpcf_sexp_cfdg2 evaluates gradient of covariance function, of which has been taken
partial derivative with respect to both input variables x, with respect to parameters. This
subfunction is needed when using derivative observations.
gpcf_sexp_ginput evaluates gradient of covariance function with respect to x. This
subfunction is needed when computing gradients with respect to inducing inputs in sparse
approximations.

gpcf_sexp_ginput2 evaluates gradient of covariance function with respect to both input
variables x and x2. This subfunction is needed when computing gradients with respect to
both input variables x and x2 (in same dimension). This subfunction is needed when using
derivative observations.
gpcf_sexp_ginput3 evaluates gradient of covariance function with respect to both input
variables x and x2. This subfunction is needed when computing gradients with respect to
both input variables x and x2 (in different dimensions). This subfunction is needed when
using derivative observations.
gpcf_sexp_ginput4 evaluates gradient of covariance function with respect to x. Simplified and faster version of sexp_ginput, returns full matrices. This subfunction is needed
when using derivative observations.
gpcf_sexp_recappend is needed when using MCMC sampling (gp_mc).
F.3 prior functions
New priors can be added by copying and modifying one of the existing prior functions. We
use prior_t as an example.
inputParser is used to make argument checks and set some default values. If the new
prior does not have parameters, corresponding lines can be removed (see, e.g., prior_unif).
ip=inputParser;
ip.FunctionName = ’PRIOR_T’;
ip.addOptional(’p’, [], @isstruct);
ip.addParamValue(’mu’,0, @(x) isscalar(x));
ip.addParamValue(’mu_prior’,[], @(x) isstruct(x) || isempty(x));
ip.addParamValue(’s2’,1, @(x) isscalar(x) && x>0);
ip.addParamValue(’s2_prior’,[], @(x) isstruct(x) || isempty(x));
ip.addParamValue(’nu’,4, @(x) isscalar(x) && x>0);
ip.addParamValue(’nu_prior’,[], @(x) isstruct(x) || isempty(x));
ip.parse(varargin{:});
p=ip.Results.p;
...
Function handles to the subfunctions provide object-oriented behaviour (at time when
GPstuff project was started, real classes in Matlab were too inefficient and Octave still lacks
proper support). If some of the subfunctions have not been implemented, corresponding
function handles should not be defined.
if init
% set functions
p.fh.pak = @prior_t_pak;
p.fh.unpak = @prior_t_unpak;
p.fh.lp = @prior_t_lp;
p.fh.lpg = @prior_t_lpg;
p.fh.recappend = @prior_t_recappend;
end

prior_t_pak and prior_t_unpak functions are used to support generic optimization and
sampling functions, which assume that parameters are presented as a vector. Parameters
which have empty prior ([]) are ignored. These subfunctions are mandatory (even if there
are no parameters).
prior_t_lp and prior_t_lpg compute the log prior density of the parameters and its
gradient with respect to parameters. Parameters which have empty prior ([]) are ignored.
These subfunctions are mandatory.
prior_t_recappend This subfunction is needed when using MCMC sampling (gp_mc).
F.4 Optimisation functions
Adding new optimisation functions is quite easy as covariance function and likelihood parameters and inducing inputs can be optimised using any optimisation function using same
syntax as, for example, fminunc by Mathworks or fminscg and fminlbfgs included in
GPstuff.
Syntax is
X = FMINX(FUN, X0, OPTIONS)
where FUN accepts input X and returns a scalar function value F and its scalar or vector
gradient G evaluated at X, X0 is initial value and OPTIONS is option structure.
Then new optimisation algorithm can be used with code
gp=gp_optim(gp,x,y,’opt’,opt,’optimf’,@fminx);
F.5 Other inference methods
Adding new Monte Carlo sampling functions is more complicated. gp_mc function is used to
allow the use of different sampling methods for covariance function parameters, likelihood
parameters and latent values with one function call. Thus adding new sampling method
requires modification of gp_mc. Note also the subfunctions gpmc_e and gpmc_g, which return
energy (negative log posterior density) and its gradient while handling infer_params option
properly. See gp_mc for further information.
Adding new analytic approximation, for example, variational Bayes approach, would
require major effort implementing many inference related functions. To start see functions
with name starting gpla_ or gpep_ and gp_set.

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 77
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.40.14
Create Date                     : 2015:11:20 16:04:50+02:00
Modify Date                     : 2015:11:20 16:04:50+02:00
Trapped                         : False
PTEX Fullbanner                 : This is MiKTeX-pdfTeX 2.9.4902 (1.40.14)

EXIF Metadata provided by EXIF.tools

GPstuff Manual

GPstuffManual

Navigation menu

Versions of this User Manual:

Views

Navigation