GPstuff Manual

GPstuffManual

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 77

DownloadGPstuff Manual
Open PDF In BrowserView PDF
Bayesian Modeling with Gaussian Processes using the GPstuff
Toolbox
Jarno Vanhatalo∗

jarno.vanhatalo@helsinki.fi

Department of Environmental Sciences
University of Helsinki
P.O. Box 65, FI-00014 Helsinki, Finland

Jaakko Riihimäki†
Jouni Hartikainen†
Pasi Jylänki†
Ville Tolvanen†
Aki Vehtari‡

aki.vehtari@aalto.fi
Helsinki Institute for Information Technology HIIT, Department of Computuer Science
Aalto University School of Science
P.O. Box 15400, FI-00076 Aalto, Finland

Abstract
Gaussian processes (GP) are powerful tools for probabilistic modeling purposes. They
can be used to define prior distributions over latent functions in hierarchical Bayesian
models. The prior over functions is defined implicitly by the mean and covariance function,
which determine the smoothness and variability of the function. The inference can then
be conducted directly in the function space by evaluating or approximating the posterior
process. Despite their attractive theoretical properties GPs provide practical challenges
in their implementation. GPstuff is a versatile collection of computational tools for GP
models compatible with Linux and Windows MATLAB and Octave. It includes, among
others, various inference methods, sparse approximations and tools for model assessment.
In this work, we review these tools and demonstrate the use of GPstuff in several models.
Last updated 2015-10-12.
Keywords: Gaussian process, Bayesian hierarchical model, nonparametric Bayes

∗. Work done mainly while at the Department of Biomedical Engineering and Computational Science, Aalto
University
†. Not anymore in Aalto University
‡. Corresponding author

Contents
1 Introduction

4

2 Gaussian process models
2.1 Gaussian process prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Conditioning on the observations . . . . . . . . . . . . . . . . . . . . . . . . .

6
6
7

3 Conditional posterior and predictive distributions
3.1 Gaussian observation model: the analytically tractable case
3.2 Laplace approximation . . . . . . . . . . . . . . . . . . . . .
3.3 Expectation propagation algorithm . . . . . . . . . . . . . .
3.4 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

4 Marginal likelihood given parameters

8
8
9
9
10
10

5 Marginalization over parameters
5.1 Maximum a posterior estimate of parameters
5.2 Grid integration . . . . . . . . . . . . . . . . .
5.3 Monte Carlo integration . . . . . . . . . . . .
5.4 Central composite design . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

11
12
12
12
13

6 Getting started with GPstuff: regression and classification
6.1 Gaussian process regression . . . . . . . . . . . . . . . . . . .
6.1.1 Constructing the model . . . . . . . . . . . . . . . . .
6.1.2 MAP estimate for the parameters . . . . . . . . . . . .
6.1.3 Marginalization over parameters with grid integration
6.1.4 Marginalization over parameters with MCMC . . . . .
6.2 Gaussian process classification . . . . . . . . . . . . . . . . . .
6.2.1 Constructing the model . . . . . . . . . . . . . . . . .
6.2.2 Inference with Laplace approximation . . . . . . . . .
6.2.3 Inference with expectation propagation . . . . . . . . .
6.2.4 Inference with MCMC . . . . . . . . . . . . . . . . . .
6.2.5 Marginal posterior corrections . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

14
14
14
15
16
17
17
17
18
18
18
19

7 Other single latent models
7.1 Robust regression . . . . . . . . . . . . .
7.1.1 Student-t observation model . . .
7.1.2 Laplace observation model . . . .
7.2 Count data . . . . . . . . . . . . . . . .
7.2.1 Poisson . . . . . . . . . . . . . .
7.2.2 Negative binomial . . . . . . . .
7.2.3 Binomial . . . . . . . . . . . . . .
7.2.4 Hurdle model . . . . . . . . . . .
7.3 Log-Gaussian Cox process . . . . . . . .
7.4 Accelerated failure time survival models

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

19
19
19
20
21
21
21
22
22
22
23

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

24
24
24
25
25

8 Multilatent models
8.1 Multiclass classification . . . . . . . . . . . .
8.2 Multinomial . . . . . . . . . . . . . . . . . . .
8.3 Cox proportional hazard model . . . . . . . .
8.4 Zero-inflated negative binomial . . . . . . . .
8.5 Density estimation and regression . . . . . . .
8.6 Input dependent models . . . . . . . . . . . .
8.6.1 Input-dependent Noise . . . . . . . . .
8.6.2 Input-dependent overdispersed Weibull
8.7 Monotonicity constraint . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

26
26
27
27
28
29
30
30
30
31

7.5
7.6

7.4.1 Weibull . . . .
7.4.2 Log-Gaussian .
7.4.3 Log-logistic . .
Derivative observations
Quantile Regression .

. . . .
. . . .
. . . .
in GP
. . . .

. . . . . .
. . . . . .
. . . . . .
regression
. . . . . .

.
.
.
.
.

.
.
.
.
.

9 Mean functions
31
9.1 Explicit basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
10 Sparse Gaussian processes
10.1 Compactly supported covariance functions . . . . . . . . . . . . . . . . . . .
10.1.1 Compactly supported covariance functions in GPstuff . . . . . . . .
10.2 FIC and PIC sparse approximations . . . . . . . . . . . . . . . . . . . . . .
10.2.1 FIC sparse approximation in GPstuff . . . . . . . . . . . . . . . . . .
10.2.2 PIC sparse approximation in GPstuff . . . . . . . . . . . . . . . . . .
10.3 Deterministic training conditional, subset of regressors and variational sparse
approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3.1 Variational, DTC and SOR sparse approximation in GPstuff . . . . .
10.4 Sparse GP models with non-Gaussian likelihoods . . . . . . . . . . . . . . .
10.5 Predictive active set selection for classification . . . . . . . . . . . . . . . . .
11 Modifying the covariance functions
11.1 Additive models . . . . . . . . . . . . . . . . . . . . .
11.1.1 Additive models in GPstuff . . . . . . . . . .
11.2 Additive covariance functions with selected variables
11.3 Product of covariance functions . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.

34
34
35
36
37
37

.
.
.
.

37
38
39
39

.
.
.
.

39
39
40
41
41

12 State space inference
13 Model assessment and comparison
13.1 Marginal likelihood . . . . . . . . .
13.2 Cross-validation . . . . . . . . . . .
13.3 DIC . . . . . . . . . . . . . . . . .
13.4 WAIC . . . . . . . . . . . . . . . .
13.5 Model assessment demos . . . . . .

41

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

43
43
43
44
45
46

14 Bayesian optimization

47

15 Adding new features in the toolbox

48

16 Discussion

49

A Comparison of features in GPstuff, GPML and FBM

61

B Covariance functions

61

C Observation models

66

D Priors

69

E Transformation of hyperparameters

71

F Developer appendix
F.1 lik functions . . . . . .
F.1.1 lik_negbin . . .
F.1.2 lik_t . . . . . .
F.2 gpcf functions . . . . .
F.2.1 gpcf_sexp . . . .
F.3 prior functions . . . . .
F.4 Optimisation functions .
F.5 Other inference methods

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

72
72
72
73
74
74
76
77
77

1. Introduction
This work reviews a free open source toolbox GPstuff (version 4.6) which implements a
collection of inference methods and tools for inference for various Gaussian process (GP)
models. The toolbox is compatible with Unix and Windows MATLAB (Mathworks, 2010)
(r2009b or later, earlier versions may work, but has not been tested extensively) and Octave
(Octave community, 2012) (3.6.4 or later, compact support covariance functions are not
currently working in Octave). It is available from http://becs.aalto.fi/en/research/
bayes/gpstuff/. If you find GPstuff useful, please use the reference (Vanhatalo et al., 2013)
Jarno Vanhatalo, Jaakko Riihimäki, Jouni Hartikainen, Pasi Jylänki, Ville Tolvanen
and Aki Vehtari (2013). GPstuff: Bayesian Modeling with Gaussian Processes. Journal of Machine Learning Research, 14:1175-1179
and appropriate references mentioned in this text or in the GPstuff functions you are using.
GP is a stochastic process, which provides a powerful tool for probabilistic inference
directly on distributions over functions (e.g. O’Hagan, 1978) and which has gained much
attention in recent years (Rasmussen and Williams, 2006). In many practical GP models,
the observations y = [y1 , ..., yn ]T related to inputs (covariates) X = {xi = [xi,1 , ..., xi,d ]T }ni=1
are assumed to be conditionally
given a latent function (or predictor) f (x) so
Qindependent
n
that the likelihood p(y | f ) = i=1 p(yi |fi ), where f = [f (x1 ), ..., f (xn )]T , factorizes over
cases. GP prior is set for the latent function after which the posterior p(f | y, X) is solved

and used for prediction. GP defines the prior over the latent function implicitly through the
mean and covariance function, which determine, for example, the smoothness and variability
of the latent function. Usually, the model hierarchy is extended to the third level by giving
also prior for the parameters of the covariance function and the observation model.
Most of the models in GPstuff follow the above definition and can be summarized as:
Observation model:

y | f, φ ∼

n
Y

p(yi |fi , φ)

(1)

i=1

GP prior:
hyperprior:

f (x)|θ ∼ GP m(x), k(x, x0 |θ)
θ, φ ∼ p(θ)p(φ).



(2)
(3)

Here θ denotes the parameters in the covariance function k(x, x0 |θ), and φ the parameters in
the observation model. We will call the function value f (x) at fixed x a latent variable and for
brevity we sometimes denote all the covariance function and observation model parameters
with ϑ = [θ, φ]. For the simplicity of presentation the mean function is considered zero,
m(x) ≡ 0, throughout this paper. We will also denote both the observation model and
likelihood by p(y | f , φ) and assume the reader is able to distinguish between these two from
the context. The likelihood naming convention is used in the toolbox for both likelihood
and observation model related functionalities which follows the naming convention used, for
example, in INLA (Rue et al., 2009) and GPML (Rasmussen and Nickisch, 2010) software
packages. There are also models with multiple latent processes and likelihoods where each
factor depens on multiple latent values. These are discussed in Section 8.
Early examples of GP models can be found, for example, in time series analysis and
filtering (Wiener, 1949) and in geostatistics (e.g. Matheron, 1973). GPs are still widely used
in these fields and useful overviews are provided in (Cressie, 1993; Grewal and Andrews,
2001; Diggle and Ribeiro, 2007; Gelfand et al., 2010). O’Hagan (1978) was one of the firsts
to consider GPs in a general probabilistic modeling context and provided a general theory of
GP prediction. This general regression framework was later rediscovered as an alternative
for neural network models (Williams and Rasmussen, 1996; Rasmussen, 1996) and extended
for other problems than regression (Neal, 1997; Williams and Barber, 1998). This machine
learning perspective is comprehensively summarized by Rasmussen and Williams (2006).
Nowadays, GPs are used, for example, in weather forecasting (Fuentes and Raftery,
2005; Berrocal et al., 2010), spatial statistics (Best et al., 2005; Kaufman et al., 2008;
Banerjee et al., 2008; Myllymäki et al., 2014), computational biology (Honkela et al., 2011),
reinforcement learning (Deisenroth et al., 2009, 2011), healthcare applications (Stegle et al.,
2008; Vanhatalo et al., 2010; Rantonen et al., 2012, 2014), survival analysis (Joensuu et al.,
2012, 2014) industrial applications (Kalliomäki et al., 2005), computer model calibration
and emulation (Kennedy and O’Hagan, 2001; Conti et al., 2009), prior elicitation (Moala
and O’Hagan, 2010) and density estimation (Tokdar and Ghosh, 2007; Tokdar et al., 2010;
Riihimäki and Vehtari, 2014) to name a few. Despite their attractive theoretical properties
and wide range of applications, GPs provide practical challenges in implementation.
GPstuff provides several state-of-the-art inference algorithms and approximative methods that make the inference easier and faster for many practically interesting models. GPstuff is a modular toolbox which combines inference algorithms and model structures in an
easily extensible format. It also provides various tools for model checking and comparison.

These are essential in making model assessment and criticism an integral part of the data
analysis. Many algorithms and models in GPstuff are proposed by others but reimplemented
for GPstuff. In each case, we provide reference to the original work but the implementation
of the algorithm in GPstuff is always unique. The basic implementation of two important
algorithms, the Laplace approximation and expectation propagation algorithm (discussed
in Section 3), follow the pseudocode from (Rasmussen and Williams, 2006). However, these
are later generalized and improved as described in (Vanhatalo et al., 2009, 2010; Vanhatalo
and Vehtari, 2010; Jylänki et al., 2011; Riihimäki et al., 2013; Riihimäki and Vehtari, 2014).
There are also many other toolboxes for GP modelling than GPstuff freely available.
Perhaps the best known packages are nowadays the Gaussian processes for Machine Learning
GPML toolbox (Rasmussen and Nickisch, 2010) and the flexible Bayesian modelling (FBM)
toolbox by Radford Neal. A good overview of other packages can be obtained from the
Gaussian processes website http://www.gaussianprocess.org/ and the R Archive Network
http://cran.r-project.org/. Other GP softwares have some overlap with GPstuff and
some of them include models that are not implemented in GPstuff. The main advantages
of GPstuff over other GP software are its versatile collection of models and computational
tools as well as modularity which allows easy extensions. A comparison between GPstuff,
GPML and FBM is provided in the Appendix A.
Three earlier GP and Bayesian modelling packages have influenced our work. Structure
of GPstuff is mostly in debt to the Netlab toolbox (Nabney, 2001), although it is far from being compatible. The GPstuff project was started in 2006 based on the MCMCStuff-toolbox
(1998-2006) (http://becs.aalto.fi/en/research/bayes/mcmcstuff/). MCMCStuff for
its part was based on Netlab and it was also influenced by the FBM. The INLA software
package by Rue et al. (2009) has also motivated some of the technical details in the toolbox. In addition to these, some technical implementations of GPstuff rely on the sparse
matrix toolbox SuiteSparse (Davis, 2005) (http://www.cise.ufl.edu/research/sparse/
SuiteSparse/).
This work concentrates on discussing the essential theory and methods behind the implementation of GPstuff. We explain important parts of the code, but the full code demonstrating the important features of the package (including also data creation and such), are
included in the demonstration files demo_* to which we refer in the text.

2. Gaussian process models
2.1 Gaussian process prior
GP prior over function f (x) implies that any set of function values f , indexed by the input
coordinates X, have a multivariate Gaussian distribution
p(f |X, θ) = N(f |0, Kf,f ),

(4)

where Kf,f is the covariance matrix. Notice, that the distribution over functions will
be denoted by GP(·, ·), whereas the distribution over a finite set of latent variables will
be denoted by N(·, ·). The covariance matrix is constructed by a covariance function,
[Kf,f ]i,j = k(xi , xj |θ), which characterizes the correlation between different points in the
process. Covariance function can be chosen freely as long as the covariance matrices produced are symmetric and positive semi-definite (vT Kf,f v ≥ 0, ∀v ∈  0 and the baseline hazard h0 (y) is assumed to follow the Weibull distribution
parametrized in GPstuff as
h0 (y) = ry r−1 ,
(42)
where r > 0 is the shape parameter. This follows the parametrization used in Martino et al.
(2011). The likelihood is defined as
L=

n
Y

(1−zi )(r−1) −yir /ef (xi )

r1−zi e−(1−zi )f (xi ) yi

e

(43)

i=1

where z is a vector of censoring indicators with zi = 0 for uncensored event and zi = 1 for
right censored event for observation i. In case of censoring yi denotes the time of censoring.
Here we present only the likelihood function because we don’t have observation model for
the censoring. Another common parameterization of Weibull model is through shape, r, and
scale, λ parameters. The parameterization in GPstuff corresponds to Weibull model with
input dependent scale λ(xi ) = ef (xi )/r , so that λ(xi )r = ef (xi ) . Hence the Weibull model
for an uncensored observation yi is

r−1
r
yi
r
p(yi |f (xi ), r) =
e−(yi /λ(xi ))
(44)
λ(xi ) λ(xi )
7.4.2 Log-Gaussian
With Log-Gaussian survival model the logarithms of survival times are assumed to be normally distributed.
The likelihood is defined as


n
Y
1
2 −(1−zi )/2 1−zi
2
L=
(2πσ )
yi
exp − 2 (1 − zi )(log(yi ) − f (xi ))
(45)
2σ
i=1


zi
log(yi ) − f (xi )
×
1−Φ
σ
where σ is the scale parameter.
7.4.3 Log-logistic
The log-logistic likelihood is defined as
L=

n 
Y
i=1

ry r−1
exp(f (xi ))

1−zi 

1+

y
exp(f (xi ))

r zi −2

where r is the shape parameter and zi the censoring indicators.

,

(46)

7.5 Derivative observations in GP regression
Incorporating derivative observations in GP regression is fairly straightforward, because a
derivative of Gaussian process is a Gaussian process. In short, derivative observation are
taken into account by extending covariance matrices to include derivative observations. This
is done by forming joint covariance matrices of function values and derivatives. Following
equations (Rasmussen and Williams, 2006) state how the covariances between function values
and derivatives, and between derivatives are calculated
Cov(fi ,

∂fj
∂k(xi , xj
)=
),
∂xdj
∂xdj

Cov(

∂ 2 k(xi , xj
∂fi ∂fj
)=
.
,
∂xdi ∂xej
∂xdi ∂xej

The joint covariance matrix for function values and derivatives is of the following form


Kf f Kf D
K =
KDf KDD
Kij
ff

= k(xi , xj ),

Kij
Df

=

Kf D

∂k(xi , xj )
,
∂xdi
= (KDf )> ,

Kij
DD =

∂ 2 k(xi , xj )
,
∂xdi ∂xej

Prediction is done as usual but with derivative observations joint covariance matrices are to
be used instead of the normal ones.
Using derivative observations in GPstuff requires two steps: when initializing the GP
structure one must set option ’derivobs’ to ’on’. The second step is to form right sized
observation vector. With input size n × m the observation vector with derivatives should
be of size n + m · n. The observation vector is constructed by adding partial derivative
observations after function value observations


y(x)
 ∂y(x) 
 ∂x1 

yobs = 
(47)
 ..  .
 . 
∂y(x)
∂xm

Different noise level could be assumed for function values and derivative observations but
at the moment the implementation allows only same noise for all the observations. The use
of derivative observations is demonstrated in demo_derivativeobs.
Derivative process can be also used to construct a monotonicity constraint as described
in section 8.7.
7.6 Quantile Regression
Quantile regression is used for estimating quantiles of the response variable as a function of
input variables (Boukouvalas et al., 2012).

The likelihood is defines as


τ (1 − τ )
y−f
p(y|f, σ, τ ) =
exp −
(τ − I(y ≤ f )) ,
σ
σ

(48)

where τ is the quantile of interest and σ is the standard deviation. I(y ≤ f ) is 1 if the
condition inside brackets is true and 0 otherwise. Because the logarithm of the likelihood
is not twice differentiable at the mode, Laplace approximation cannot be used for inference
with Quantile Regression. Quantile Regression with EP and MCMC is demonstrated in
demo_qgp with toy data.

8. Multilatent models
The multilatent models consist of models where individual likelihood factors depend on
multiple latent variables. In this section we shortly summarize such models in GPstuff.
8.1 Multiclass classification
In multiclass classification problems the target variables have more than two possible class
labels, yi ∈ {1, . . . , c}, where c > 2 is the number of classes. In GPstuff, multi-class classification can be made either using the softmax likelihood (lik_softmax)
exp(fiyi )
p(yi | f i ) = Pc
,
j
j=1 exp(fi )

(49)


T
where f i = fi1 , . . . , fic , or the multinomial probit likelihood (lik_multinomialprobit)


c
 Y

p(yi | f i ) = Ep(ui )
Φ(ui + fiyi − fij ) ,
(50)


j=1,j6=yi

where the auxiliary variable ui is distributed as p(ui ) = N (ui |0, 1), and Φ(x) denotes the
cumulative density function of the standard normal distribution. In Gaussian process literature for multiclass classification, a common assumption is to introduce c independent prior
processes that are associated with c classes (see, e.g., Rasmussen and Williams, 2006). By
assuming zero-mean Gaussian processes for latent functions associated with different classes,
we obtain a zero-mean Gaussian prior
p(f |X) = N (f |0, K),
(51)

T
where f = f11 , . . . , fn1 , f12 , . . . , fn2 , . . . , f1c , . . . , fnc and K is a cn × cn block-diagonal covariance matrix with matrices K 1 , K 2 , . . . , K c (each of size n × n) on its diagonal.
Inference for softmax can be made with MCMC or Laplace approximation. As described by Williams and Barber (1998) and Rasmussen and Williams (2006), using the
Laplace approximation for the softmax likelihood with the uncorrelated prior processes,
the posterior computations can be done efficiently in a way that scales linearly in c, which
is also implemented in GPstuff. Multiclass classification with softmax is demonstrated in
demo_multiclass.

Inference for multinomial probit can be made with expectation propagation by using a
nested EP approach that does not require numerical quadratures or sampling for estimation
of the tilted moments and predictive probabilities, as proposed by Riihimäki et al. (2013).
Similarly to softmax with Laplace’s method, the nested EP approach leads to low-rank
site approximations which retain all posterior couplings but results in linear computational
scaling with respect to c. Multiclass classification with the multinomial probit likelihood is
demonstrated in demo_multiclass_nested_ep.
8.2 Multinomial
The multinomial model (lik_multinomial) is an extension of the multiclass classification to situation where each observation consist of counts of class observations. Let yi =
[yi,1 , . . . , yi,c ], where c > 2, be a vector of counts of class observations related to inputs xi
so that
yi ∼ Multinomial([pi,1 , . . . , pi,c ], ni )
(52)
Pc
where ni = j=1 yi,j . The propability to observe class j is
exp(fij )
pi,j = Pc
,
j
j=1 exp(fi )

(53)


T
where f i = fi1 , . . . , fic . As in multiclass classification we can introduce c independent prior
processes that are associated with c classes. By assuming zero-mean Gaussian processes for
latent functions associated with different classes, we obtain a zero-mean Gaussian prior
p(f |X) = N (f |0, K),
(54)

T
where f = f11 , . . . , fn1 , f12 , . . . , fn2 , . . . , f1c , . . . , fnc and K is a cn × cn block-diagonal covariance matrix with matrices K 1 , K 2 , . . . , K c (each of size n × n) on its diagonal. This model is
demonstrated in demo_multinomial and it has been used, for example, in (Juntunen et al.,
2012).
8.3 Cox proportional hazard model
For the individual i, where i = 1, . . . , n, we have observed survival time yi (possibly right
censored) with censoring indicator δi , where δi = 0 if the ith observation is uncensored and
δi = 1 if the observation is right censored. The traditional approach to analyze continuous
time-to-event data is to assume the Cox proportional hazard model (Cox, 1972).
hi (t) = h0 (t) exp(xTi β),

(55)

where h0 is the unspecified baseline hazard rate, xi is the d × 1 vector of covariates for the
ith patient and β is the vector of regression coefficients. The matrix X = [x1 , . . . , xn ]T of
size n × d includes all covariate observations.
The Cox model with linear predictor can be extended to more general form to enable,
for example, additive and non-linear effects of covariates (Kneib, 2006; Martino et al., 2011).
We extend the proportional hazard model by
hi (t) = exp(log(h0 (t)) + ηi (xi )),

(56)

where the linear predictor is replaced with the latent predictor ηi depending on the covariates
xi . By assuming a Gaussian process prior over η = (η1 , . . . , ηn )T , smooth nonlinear effects
of continuous covariates are possible, and if there are dependencies between covariates, GP
can model these interactions implicitly.
A piecewise log-constant baseline hazard (see, e.g. Ibrahim et al., 2001; Martino et al.,
2011) is assumed by partitioning the time axis into K intervals with equal lengths: 0 =
s0 < s1 < s2 < . . . < sK , where sK > yi for all i = 1, . . . , n. In the interval k (where
k = 1, . . . , K), hazard is assumed to be constant:
h0 (t) = λk for t ∈ (sk−1 , sk ].

(57)

For the ith individual the hazard rate in the kth time interval is then
hi (t) = exp(fk + ηi (xi )), t ∈ (sk−1 , sk ],

(58)

where fk = log(λk ). To assume smooth hazard rate functions, we place another Gaussian
process prior for f = (f1 , . . . , fK )T . We define a vector containing the mean locations of K
time intervals as τ = (τ1 , . . . , τK )T .
The likelihood contribution for the possibly right censored ith observation (yi , δi ) is
assumed to be

 Z
yi

li = hi (yi )(1−δi ) exp −

hi (t)dt .

(59)

0

Using the piecewise log-constant assumption for the hazard rate function, the contribution
of the observation i for the likelihood results in


k−1
X
li = [λk exp(ηi )](1−δi ) exp −[(yi − sk−1 )λk +
(sg − sg−1 )λg ] exp(ηi ) ,
(60)
g=1

where yi ∈ (sk−1 , sk ] (Ibrahim et al., 2001; Martino et al., 2011). By applying the Bayes
theorem, the prior information and likelihood contributions are combined, and the posterior
distribution of the latent variables can be computed. Due to the form of the likelihood
function, the resulting posterior becomes non-Gaussian and analytically exact inference is
intractable. GPstuff supports MCMC and Laplace approximation to integrate over the latent
variables. The use of Gaussian process Cox proportional hazard model is demonstrated in
demo_survival_coxph. The Gaussian process Cox proportional hazard model implemented
with GPstuff was used in reference (Joensuu et al., 2012) to model risk of gastrointestinal
stromal tumour recurrence after surgery.
8.4 Zero-inflated negative binomial
Zero-inflated negative binomial model is suitable for modelling count variables with excessive
number of zero observations compared to usual Poisson and negative binomial count models.
Zero-inflated negative binomial models assume a two-stage process, where the first process
determines whether the count is larger than zero, and the second process determines the
non-zero count (Mullahy, 1986). In GPstuff, these processes are assumed independent a

priori, but they become a posteriori dependent through the likelihood function
p+(1 − p)NegBin(y|y = 0),

when y = 0

(61)

(1 − p)NegBin(y|y > 0),

when y > 0,

(62)

where the probability p is given by a binary classifier with logit likelihood. NegBin is the
Negative-binomial distribution parametrized for the i’th observation as
y | f , e, r ∼


n
Y
Γ(r + yi )
i=1

yi !Γ(r)

r
r + µi

r 

µi
r + µi

yi
,

(63)

where µi = ei exp(f (xi )) and r is the dispersion parameter governing the variance. In
GPstuff, the latent value vector f = [f T1 f T2 ]T has length 2N , where N is the number of
observations. The latents f 1 are associated with the classification process and the latents f 2
with the negative-binomial count process.
8.5 Density estimation and regression
Logistic Gaussian process can be used for flexible density estimation and density regression.
GPstuff includes implementation based on Laplace (and MCMC) approximation as described
in (Riihimäki and Vehtari, 2012).
The likelihood is defined as
p(x | f ) =

n
Y
i=1

exp(fi )
Pn
.
j=1 exp(fj )

(64)

For the latent function f , we assume the model f (x) = g(x) + h(x)T β, where the GP prior
g(x) is combined with the explicit basis functions h(x). Regression coefficients are denoted
with β, and by placing a Gaussian prior β ∼ N (b, B) with mean b and covariance B, the
parameters β can be integrated out from the model, which results in the following GP prior
for f :

f (x) ∼ GP h(x)T b, κ(x, x0 ) + h(x)T B h(x0 ) .
(65)
For the explicit basis functions, we use the second-order polynomials which leads to a GP
prior that can favour density estimates where the tails of the distribution go eventually to
zero.
We use finite-dimensional approximation and evaluate the integral and the Gaussian
process in a grid. We approximate inference for logistic Gaussian process density estimation
in a grid using Laplace’s method or MCMC to integrate over the non-Gaussian posterior
distribution of latent values (see details in Riihimäki and Vehtari, 2012).
Logistic Gaussian processes are also suitable for estimating conditional densities p(t| x),
where t is a response variable. We discretize both input and target space in a finite region
to model the conditional densities with the logistic GP. To approximate the resulting nonGaussian posterior distribution, we use again Laplace’s method.
1D and 2D density estimation and density regression are implemented in the function
lgpdens. The usage of the function is demonstrated in demo_lgpdens.

100

90

80

70

60

50

40
10000

20000

30000

1

(a) Galaxy data.

2

3

4

5

6

(b) Old faithful data.

Figure 7: Density estimates for two different distributions. Figures are from
demo_lgpdens. On the left blue line is density estimation of Galaxy data while
gray area is 95% mean region and on the right is 2D density estimation with countours based on Old faithful data.

8.6 Input dependent models
In input-dependent models more than one latent variables are used for modeling different
parameters of the distribution. Additional latent variables in turn enable modeling dependencies between inputs and parameters.
8.6.1 Input-dependent Noise
Input-dependent noise model (Goldberg et al., 1997) can be used to model regression data.
The model assumes Gaussian distributed data with latent variables defining both mean and
variance of the Gaussian distribution instead of just the mean like in the standard Gaussian
likelihood. Setting independent GP priors for the two latent variables, it is possible to infer
non-constant noise, e.g. input-dependent, in the data.
The likelihood is defined as
p(y | f (1) , f (2) , σ 2 ) =

n
Y

(1)

(2)

N (yi |fi , σ 2 exp(fi )),

(66)

i=1

with latent function f (1) defining the mean and f (2) defining the variance. Input-dependent
noise is demonstrated in demo_inputdependentnoise with both heteroscedastic and constant noise.
8.6.2 Input-dependent overdispersed Weibull
In input-dependent Weibull model additional latent variable is used for modeling shape
parameter r of the standard Weibull model.

The likelihood is defined as
p(y | f (1) , f (2) , z) =

n
Y

(1)

(1)

(1−zi )(ri −1) −yiri /efi

ri1−zi e−(1−zi )fi yi

e

(67)

i=1

where z is a vector of censoring indicators with zi = 0 for uncensored event and zi = 1 for
right censored event for observation i. Latent variable f2 is used for modeling the shape
parameter here. Because the dispersion parameter is constrained to be greater than 0,
it must be computed through exp(f2 ) to ensure positiveness. Input-dependent Weibull is
demonstrated in demo_inputdependentweibull.
8.7 Monotonicity constraint
.
Riihimäki and Vehtari (2010) presented how a monotonicity constraint for GP can be
introduced with virtual derivative observations, and the resulting posterior can be approximated with expectation propagation.
Currently GPstuff supports monotonicity constraint for models which have Gaussian
likelihood or fully factoring likelihoods with EP inference and which use covariance functions
gpcf_constant, gpcf_linear and gpcf_sexp. Riihimäki and Vehtari (2010) used type
II MAP estimate for the hyperparameters, but GPstuff allows easy integration over the
hyperparameters with any method described in Section 5.
GP model with monotonicity constraints on one or many covariates can be easily constructed with the function gp_monotonic. Monotonicity constraint is demonstrated in
demo_monotonic and demo_monotonic2.
Figure 8 displays the predictions with models with and without the monotonicity constraint. The monotonicity constraint allows inclusion of sensible prior information that death
rate increases as age increases. The effect is shown most clearly for larger ages where there
is less observations, but also overall reduction in uncertainty related to the latent function
can be seen.

9. Mean functions
In the standard GP regression a zero mean function is assumed for the prior process. This
is convenient but there are nonetheless some advantages in using a specified mean function.
The basic principle in doing GP regression with a mean function is to apply a zero mean
GP for the difference between observations and the mean function.
9.1 Explicit basis functions
Here we follow closely the presentation of (Rasmussen and Williams, 2006) about the subject
and briefly present the main results. A mean function can be specified as a weighted sum
of some basis functions h
m = h(x)> β,
with weights β. The target of modeling, the underlying process g, is assumed to be a sum
of mean function and a zero mean GP.
g = h(x)> β + GP(0, K).

Deaths/1000

15
10
5
0

Deaths/1000

sexp

20

40

50
Age

sexp + monot.

15
10
5
0

40

50
Age

20

10
5

20

40

50
Age

60

lin + sexp + monot.

15
10
5
0

60

lin + sexp

15

0

60

Deaths/1000

Deaths/1000

20

40

50
Age

60

Figure 8: Illustration how the monotonicity constraint of the latent function affects
the predictions. Mortality rate data from Broffitt (1988).

By assuming Gaussian prior for the weights β ∼ N (b, B), the weight parameters can be
integrated out and the prior for g is another GP
prior g ∼ GP




h(x)> b, K + h(x)> B h(x) .

The predictive equations are obtained by using the mean and covariance of the prior (68)
in the zero mean GP predictive equations (15)
E(g∗ ) = E(f∗ ) + R> β,

(68)




>
Cov(g∗ ) = Cov(f∗ ) + R> B−1 + HK−1
R,
y H

β


−1

>
= B−1 + HK−1
H
B−1 b + HK−1
y
y y ,

R = H∗ − HK−1
y K∗ ,


h1 (x)
 h2 (x) 


H =
 , x is row vector.
..


.
hk (x)

(69)

If the prior assumptions about the weights are vague then B−1 → O, (O is a zero matrix)
and the predictive equations (68) and (69) don’t depend on b or B
E(g∗ )

=

Cov(g∗ )

=

E(f∗ ) + R> β v ,


>
Cov(f∗ ) + R> HK−1
H
R,
y

(70)
(71)


−1
>
HK−1
β v = HK−1
H
y y,
y
Corresponding the exact and vague prior for the basis functions’ weights there are two
versions of the marginal likelihood. With exact prior the marginal likelihood is
1
1
1
1
n
log p(y | X, b, B) = − M> N−1 M − log |Ky | − log |B| − log |A| − log 2π,
2
2
2
2
2
>
M = H b − y,
N = Ky + H> BH,
>
A = B−1 + HK−1
y H ,

where n is the amount of observations. Its derivative with respect to hyperparameters are
1 > −1 ∂Ky −1 >
M N
N M
2
∂θi




∂Ky
1
1
−1 ∂A
tr K−1
tr
A
−
−
,
y
2
∂θi
2
∂θi
∂Ky −1 >
∂A
= − HK−1
K H .
y
∂θi
∂θi y

∂
log p(y | X, b, B) = +
∂θi

With a vague prior the marginal likelihood is
1
1 > −1
y Ky y + y> Cy
2
2
1
1
n−m
−
log |Ky | − log |Av | −
log 2π,
2
2
2
>
Av =
HK−1
y H

log p(y | X) = −

C=

> −1
−1
K−1
y H Av HKy ,

where m is the rank of H > . Its derivative is

∂
1 >
1 >
log p(y | X) =
y Py +
−y PG − G> Py + G> PG
∂θi
2
2




∂K
1
1
y
−1
−1 ∂Av
−
tr Ky
− tr Av
,
2
∂θi
2
∂θi
∂Ky −1
P = K−1
K ,
y
∂θi y
G = H> A−1 HK−1
y y,
∂K

(72)

y
where has been used the fact that matrices K−1
y , ∂θi and Av are symmetric. The above
expression (72) could be simplified a little further because y> PG + G> Py = 2 · y> PG.
The use of mean functions is demonstrated in demo_regression_meanf.

10. Sparse Gaussian processes
The evaluation of the inverse and determinant of the covariance matrix in the log marginal
likelihood (or its approximation) and its gradient scale as O(n3 ) in time. This restricts the
implementation of GP models to moderate size data sets. However, the unfavorable scaling
in computational time can be alleviated with sparse approximations to GP and compactly
supported (CS) covariance functions. The sparse approximations scale as O(nm2 ), where
m < n and the CS covariance functions lead to computations which scale, in general,
as O(n3 ) but with a smaller constant than with traditional globally supported covariance
functions. Many sparse GPs are proposed in the literature and some of them are build into
GPstuff.
10.1 Compactly supported covariance functions
A CS covariance function gives zero correlation between data points whose distance exceeds
a certain threshold. This leads to a sparse covariance matrix. The challenge with constructing CS covariance functions is to guarantee their positive definiteness and much literature
has been devoted on the subject (see e.g. Sansò and Schuh, 1987; Wu, 1995; Wendland,
1995; Gaspari and Cohn, 1999; Gneiting, 1999, 2002; Buhmann, 2001) The CS functions implemented in GPstuff are Wendland’s piecewise polynomials kpp,q (Wendland, 2005), such
as
2

σpp
kpp,2 =
(1 − r)j+2
(j 2 + 4j + 3)r2 + (3j + 6)r + 3 ,
(73)
+
3
where j = bd/2c + 3. These functions correspond to processes that are q times mean square
differentiable and are positive definite up to an input dimension d. Thus, the degree of
the polynomial has to be increased alongside the input dimension. The dependence of CS
covariance functions to the input dimension is very fundamental. There are no radial CS
functions that are positive definite on every 

Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 77
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.40.14
Create Date                     : 2015:11:20 16:04:50+02:00
Modify Date                     : 2015:11:20 16:04:50+02:00
Trapped                         : False
PTEX Fullbanner                 : This is MiKTeX-pdfTeX 2.9.4902 (1.40.14)
EXIF Metadata provided by EXIF.tools

Navigation menu