GPstuff Manual
GPstuffManual
User Manual:
Open the PDF directly: View PDF .
Page Count: 77
Download | ![]() |
Open PDF In Browser | View PDF |
Bayesian Modeling with Gaussian Processes using the GPstuff Toolbox Jarno Vanhatalo∗ jarno.vanhatalo@helsinki.fi Department of Environmental Sciences University of Helsinki P.O. Box 65, FI-00014 Helsinki, Finland Jaakko Riihimäki† Jouni Hartikainen† Pasi Jylänki† Ville Tolvanen† Aki Vehtari‡ aki.vehtari@aalto.fi Helsinki Institute for Information Technology HIIT, Department of Computuer Science Aalto University School of Science P.O. Box 15400, FI-00076 Aalto, Finland Abstract Gaussian processes (GP) are powerful tools for probabilistic modeling purposes. They can be used to define prior distributions over latent functions in hierarchical Bayesian models. The prior over functions is defined implicitly by the mean and covariance function, which determine the smoothness and variability of the function. The inference can then be conducted directly in the function space by evaluating or approximating the posterior process. Despite their attractive theoretical properties GPs provide practical challenges in their implementation. GPstuff is a versatile collection of computational tools for GP models compatible with Linux and Windows MATLAB and Octave. It includes, among others, various inference methods, sparse approximations and tools for model assessment. In this work, we review these tools and demonstrate the use of GPstuff in several models. Last updated 2015-10-12. Keywords: Gaussian process, Bayesian hierarchical model, nonparametric Bayes ∗. Work done mainly while at the Department of Biomedical Engineering and Computational Science, Aalto University †. Not anymore in Aalto University ‡. Corresponding author Contents 1 Introduction 4 2 Gaussian process models 2.1 Gaussian process prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Conditioning on the observations . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 7 3 Conditional posterior and predictive distributions 3.1 Gaussian observation model: the analytically tractable case 3.2 Laplace approximation . . . . . . . . . . . . . . . . . . . . . 3.3 Expectation propagation algorithm . . . . . . . . . . . . . . 3.4 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Marginal likelihood given parameters 8 8 9 9 10 10 5 Marginalization over parameters 5.1 Maximum a posterior estimate of parameters 5.2 Grid integration . . . . . . . . . . . . . . . . . 5.3 Monte Carlo integration . . . . . . . . . . . . 5.4 Central composite design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 12 12 12 13 6 Getting started with GPstuff: regression and classification 6.1 Gaussian process regression . . . . . . . . . . . . . . . . . . . 6.1.1 Constructing the model . . . . . . . . . . . . . . . . . 6.1.2 MAP estimate for the parameters . . . . . . . . . . . . 6.1.3 Marginalization over parameters with grid integration 6.1.4 Marginalization over parameters with MCMC . . . . . 6.2 Gaussian process classification . . . . . . . . . . . . . . . . . . 6.2.1 Constructing the model . . . . . . . . . . . . . . . . . 6.2.2 Inference with Laplace approximation . . . . . . . . . 6.2.3 Inference with expectation propagation . . . . . . . . . 6.2.4 Inference with MCMC . . . . . . . . . . . . . . . . . . 6.2.5 Marginal posterior corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 14 15 16 17 17 17 18 18 18 19 7 Other single latent models 7.1 Robust regression . . . . . . . . . . . . . 7.1.1 Student-t observation model . . . 7.1.2 Laplace observation model . . . . 7.2 Count data . . . . . . . . . . . . . . . . 7.2.1 Poisson . . . . . . . . . . . . . . 7.2.2 Negative binomial . . . . . . . . 7.2.3 Binomial . . . . . . . . . . . . . . 7.2.4 Hurdle model . . . . . . . . . . . 7.3 Log-Gaussian Cox process . . . . . . . . 7.4 Accelerated failure time survival models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 19 20 21 21 21 22 22 22 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 24 25 25 8 Multilatent models 8.1 Multiclass classification . . . . . . . . . . . . 8.2 Multinomial . . . . . . . . . . . . . . . . . . . 8.3 Cox proportional hazard model . . . . . . . . 8.4 Zero-inflated negative binomial . . . . . . . . 8.5 Density estimation and regression . . . . . . . 8.6 Input dependent models . . . . . . . . . . . . 8.6.1 Input-dependent Noise . . . . . . . . . 8.6.2 Input-dependent overdispersed Weibull 8.7 Monotonicity constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 26 27 27 28 29 30 30 30 31 7.5 7.6 7.4.1 Weibull . . . . 7.4.2 Log-Gaussian . 7.4.3 Log-logistic . . Derivative observations Quantile Regression . . . . . . . . . . . . . in GP . . . . . . . . . . . . . . . . . . . . . . regression . . . . . . . . . . . . . . . . 9 Mean functions 31 9.1 Explicit basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 10 Sparse Gaussian processes 10.1 Compactly supported covariance functions . . . . . . . . . . . . . . . . . . . 10.1.1 Compactly supported covariance functions in GPstuff . . . . . . . . 10.2 FIC and PIC sparse approximations . . . . . . . . . . . . . . . . . . . . . . 10.2.1 FIC sparse approximation in GPstuff . . . . . . . . . . . . . . . . . . 10.2.2 PIC sparse approximation in GPstuff . . . . . . . . . . . . . . . . . . 10.3 Deterministic training conditional, subset of regressors and variational sparse approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Variational, DTC and SOR sparse approximation in GPstuff . . . . . 10.4 Sparse GP models with non-Gaussian likelihoods . . . . . . . . . . . . . . . 10.5 Predictive active set selection for classification . . . . . . . . . . . . . . . . . 11 Modifying the covariance functions 11.1 Additive models . . . . . . . . . . . . . . . . . . . . . 11.1.1 Additive models in GPstuff . . . . . . . . . . 11.2 Additive covariance functions with selected variables 11.3 Product of covariance functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 34 35 36 37 37 . . . . 37 38 39 39 . . . . 39 39 40 41 41 12 State space inference 13 Model assessment and comparison 13.1 Marginal likelihood . . . . . . . . . 13.2 Cross-validation . . . . . . . . . . . 13.3 DIC . . . . . . . . . . . . . . . . . 13.4 WAIC . . . . . . . . . . . . . . . . 13.5 Model assessment demos . . . . . . 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 43 44 45 46 14 Bayesian optimization 47 15 Adding new features in the toolbox 48 16 Discussion 49 A Comparison of features in GPstuff, GPML and FBM 61 B Covariance functions 61 C Observation models 66 D Priors 69 E Transformation of hyperparameters 71 F Developer appendix F.1 lik functions . . . . . . F.1.1 lik_negbin . . . F.1.2 lik_t . . . . . . F.2 gpcf functions . . . . . F.2.1 gpcf_sexp . . . . F.3 prior functions . . . . . F.4 Optimisation functions . F.5 Other inference methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 72 72 73 74 74 76 77 77 1. Introduction This work reviews a free open source toolbox GPstuff (version 4.6) which implements a collection of inference methods and tools for inference for various Gaussian process (GP) models. The toolbox is compatible with Unix and Windows MATLAB (Mathworks, 2010) (r2009b or later, earlier versions may work, but has not been tested extensively) and Octave (Octave community, 2012) (3.6.4 or later, compact support covariance functions are not currently working in Octave). It is available from http://becs.aalto.fi/en/research/ bayes/gpstuff/. If you find GPstuff useful, please use the reference (Vanhatalo et al., 2013) Jarno Vanhatalo, Jaakko Riihimäki, Jouni Hartikainen, Pasi Jylänki, Ville Tolvanen and Aki Vehtari (2013). GPstuff: Bayesian Modeling with Gaussian Processes. Journal of Machine Learning Research, 14:1175-1179 and appropriate references mentioned in this text or in the GPstuff functions you are using. GP is a stochastic process, which provides a powerful tool for probabilistic inference directly on distributions over functions (e.g. O’Hagan, 1978) and which has gained much attention in recent years (Rasmussen and Williams, 2006). In many practical GP models, the observations y = [y1 , ..., yn ]T related to inputs (covariates) X = {xi = [xi,1 , ..., xi,d ]T }ni=1 are assumed to be conditionally given a latent function (or predictor) f (x) so Qindependent n that the likelihood p(y | f ) = i=1 p(yi |fi ), where f = [f (x1 ), ..., f (xn )]T , factorizes over cases. GP prior is set for the latent function after which the posterior p(f | y, X) is solved and used for prediction. GP defines the prior over the latent function implicitly through the mean and covariance function, which determine, for example, the smoothness and variability of the latent function. Usually, the model hierarchy is extended to the third level by giving also prior for the parameters of the covariance function and the observation model. Most of the models in GPstuff follow the above definition and can be summarized as: Observation model: y | f, φ ∼ n Y p(yi |fi , φ) (1) i=1 GP prior: hyperprior: f (x)|θ ∼ GP m(x), k(x, x0 |θ) θ, φ ∼ p(θ)p(φ). (2) (3) Here θ denotes the parameters in the covariance function k(x, x0 |θ), and φ the parameters in the observation model. We will call the function value f (x) at fixed x a latent variable and for brevity we sometimes denote all the covariance function and observation model parameters with ϑ = [θ, φ]. For the simplicity of presentation the mean function is considered zero, m(x) ≡ 0, throughout this paper. We will also denote both the observation model and likelihood by p(y | f , φ) and assume the reader is able to distinguish between these two from the context. The likelihood naming convention is used in the toolbox for both likelihood and observation model related functionalities which follows the naming convention used, for example, in INLA (Rue et al., 2009) and GPML (Rasmussen and Nickisch, 2010) software packages. There are also models with multiple latent processes and likelihoods where each factor depens on multiple latent values. These are discussed in Section 8. Early examples of GP models can be found, for example, in time series analysis and filtering (Wiener, 1949) and in geostatistics (e.g. Matheron, 1973). GPs are still widely used in these fields and useful overviews are provided in (Cressie, 1993; Grewal and Andrews, 2001; Diggle and Ribeiro, 2007; Gelfand et al., 2010). O’Hagan (1978) was one of the firsts to consider GPs in a general probabilistic modeling context and provided a general theory of GP prediction. This general regression framework was later rediscovered as an alternative for neural network models (Williams and Rasmussen, 1996; Rasmussen, 1996) and extended for other problems than regression (Neal, 1997; Williams and Barber, 1998). This machine learning perspective is comprehensively summarized by Rasmussen and Williams (2006). Nowadays, GPs are used, for example, in weather forecasting (Fuentes and Raftery, 2005; Berrocal et al., 2010), spatial statistics (Best et al., 2005; Kaufman et al., 2008; Banerjee et al., 2008; Myllymäki et al., 2014), computational biology (Honkela et al., 2011), reinforcement learning (Deisenroth et al., 2009, 2011), healthcare applications (Stegle et al., 2008; Vanhatalo et al., 2010; Rantonen et al., 2012, 2014), survival analysis (Joensuu et al., 2012, 2014) industrial applications (Kalliomäki et al., 2005), computer model calibration and emulation (Kennedy and O’Hagan, 2001; Conti et al., 2009), prior elicitation (Moala and O’Hagan, 2010) and density estimation (Tokdar and Ghosh, 2007; Tokdar et al., 2010; Riihimäki and Vehtari, 2014) to name a few. Despite their attractive theoretical properties and wide range of applications, GPs provide practical challenges in implementation. GPstuff provides several state-of-the-art inference algorithms and approximative methods that make the inference easier and faster for many practically interesting models. GPstuff is a modular toolbox which combines inference algorithms and model structures in an easily extensible format. It also provides various tools for model checking and comparison. These are essential in making model assessment and criticism an integral part of the data analysis. Many algorithms and models in GPstuff are proposed by others but reimplemented for GPstuff. In each case, we provide reference to the original work but the implementation of the algorithm in GPstuff is always unique. The basic implementation of two important algorithms, the Laplace approximation and expectation propagation algorithm (discussed in Section 3), follow the pseudocode from (Rasmussen and Williams, 2006). However, these are later generalized and improved as described in (Vanhatalo et al., 2009, 2010; Vanhatalo and Vehtari, 2010; Jylänki et al., 2011; Riihimäki et al., 2013; Riihimäki and Vehtari, 2014). There are also many other toolboxes for GP modelling than GPstuff freely available. Perhaps the best known packages are nowadays the Gaussian processes for Machine Learning GPML toolbox (Rasmussen and Nickisch, 2010) and the flexible Bayesian modelling (FBM) toolbox by Radford Neal. A good overview of other packages can be obtained from the Gaussian processes website http://www.gaussianprocess.org/ and the R Archive Network http://cran.r-project.org/. Other GP softwares have some overlap with GPstuff and some of them include models that are not implemented in GPstuff. The main advantages of GPstuff over other GP software are its versatile collection of models and computational tools as well as modularity which allows easy extensions. A comparison between GPstuff, GPML and FBM is provided in the Appendix A. Three earlier GP and Bayesian modelling packages have influenced our work. Structure of GPstuff is mostly in debt to the Netlab toolbox (Nabney, 2001), although it is far from being compatible. The GPstuff project was started in 2006 based on the MCMCStuff-toolbox (1998-2006) (http://becs.aalto.fi/en/research/bayes/mcmcstuff/). MCMCStuff for its part was based on Netlab and it was also influenced by the FBM. The INLA software package by Rue et al. (2009) has also motivated some of the technical details in the toolbox. In addition to these, some technical implementations of GPstuff rely on the sparse matrix toolbox SuiteSparse (Davis, 2005) (http://www.cise.ufl.edu/research/sparse/ SuiteSparse/). This work concentrates on discussing the essential theory and methods behind the implementation of GPstuff. We explain important parts of the code, but the full code demonstrating the important features of the package (including also data creation and such), are included in the demonstration files demo_* to which we refer in the text. 2. Gaussian process models 2.1 Gaussian process prior GP prior over function f (x) implies that any set of function values f , indexed by the input coordinates X, have a multivariate Gaussian distribution p(f |X, θ) = N(f |0, Kf,f ), (4) where Kf,f is the covariance matrix. Notice, that the distribution over functions will be denoted by GP(·, ·), whereas the distribution over a finite set of latent variables will be denoted by N(·, ·). The covariance matrix is constructed by a covariance function, [Kf,f ]i,j = k(xi , xj |θ), which characterizes the correlation between different points in the process. Covariance function can be chosen freely as long as the covariance matrices produced are symmetric and positive semi-definite (vT Kf,f v ≥ 0, ∀v ∈0 and the baseline hazard h0 (y) is assumed to follow the Weibull distribution parametrized in GPstuff as h0 (y) = ry r−1 , (42) where r > 0 is the shape parameter. This follows the parametrization used in Martino et al. (2011). The likelihood is defined as L= n Y (1−zi )(r−1) −yir /ef (xi ) r1−zi e−(1−zi )f (xi ) yi e (43) i=1 where z is a vector of censoring indicators with zi = 0 for uncensored event and zi = 1 for right censored event for observation i. In case of censoring yi denotes the time of censoring. Here we present only the likelihood function because we don’t have observation model for the censoring. Another common parameterization of Weibull model is through shape, r, and scale, λ parameters. The parameterization in GPstuff corresponds to Weibull model with input dependent scale λ(xi ) = ef (xi )/r , so that λ(xi )r = ef (xi ) . Hence the Weibull model for an uncensored observation yi is r−1 r yi r p(yi |f (xi ), r) = e−(yi /λ(xi )) (44) λ(xi ) λ(xi ) 7.4.2 Log-Gaussian With Log-Gaussian survival model the logarithms of survival times are assumed to be normally distributed. The likelihood is defined as n Y 1 2 −(1−zi )/2 1−zi 2 L= (2πσ ) yi exp − 2 (1 − zi )(log(yi ) − f (xi )) (45) 2σ i=1 zi log(yi ) − f (xi ) × 1−Φ σ where σ is the scale parameter. 7.4.3 Log-logistic The log-logistic likelihood is defined as L= n Y i=1 ry r−1 exp(f (xi )) 1−zi 1+ y exp(f (xi )) r zi −2 where r is the shape parameter and zi the censoring indicators. , (46) 7.5 Derivative observations in GP regression Incorporating derivative observations in GP regression is fairly straightforward, because a derivative of Gaussian process is a Gaussian process. In short, derivative observation are taken into account by extending covariance matrices to include derivative observations. This is done by forming joint covariance matrices of function values and derivatives. Following equations (Rasmussen and Williams, 2006) state how the covariances between function values and derivatives, and between derivatives are calculated Cov(fi , ∂fj ∂k(xi , xj )= ), ∂xdj ∂xdj Cov( ∂ 2 k(xi , xj ∂fi ∂fj )= . , ∂xdi ∂xej ∂xdi ∂xej The joint covariance matrix for function values and derivatives is of the following form Kf f Kf D K = KDf KDD Kij ff = k(xi , xj ), Kij Df = Kf D ∂k(xi , xj ) , ∂xdi = (KDf )> , Kij DD = ∂ 2 k(xi , xj ) , ∂xdi ∂xej Prediction is done as usual but with derivative observations joint covariance matrices are to be used instead of the normal ones. Using derivative observations in GPstuff requires two steps: when initializing the GP structure one must set option ’derivobs’ to ’on’. The second step is to form right sized observation vector. With input size n × m the observation vector with derivatives should be of size n + m · n. The observation vector is constructed by adding partial derivative observations after function value observations y(x) ∂y(x) ∂x1 yobs = (47) .. . . ∂y(x) ∂xm Different noise level could be assumed for function values and derivative observations but at the moment the implementation allows only same noise for all the observations. The use of derivative observations is demonstrated in demo_derivativeobs. Derivative process can be also used to construct a monotonicity constraint as described in section 8.7. 7.6 Quantile Regression Quantile regression is used for estimating quantiles of the response variable as a function of input variables (Boukouvalas et al., 2012). The likelihood is defines as τ (1 − τ ) y−f p(y|f, σ, τ ) = exp − (τ − I(y ≤ f )) , σ σ (48) where τ is the quantile of interest and σ is the standard deviation. I(y ≤ f ) is 1 if the condition inside brackets is true and 0 otherwise. Because the logarithm of the likelihood is not twice differentiable at the mode, Laplace approximation cannot be used for inference with Quantile Regression. Quantile Regression with EP and MCMC is demonstrated in demo_qgp with toy data. 8. Multilatent models The multilatent models consist of models where individual likelihood factors depend on multiple latent variables. In this section we shortly summarize such models in GPstuff. 8.1 Multiclass classification In multiclass classification problems the target variables have more than two possible class labels, yi ∈ {1, . . . , c}, where c > 2 is the number of classes. In GPstuff, multi-class classification can be made either using the softmax likelihood (lik_softmax) exp(fiyi ) p(yi | f i ) = Pc , j j=1 exp(fi ) (49) T where f i = fi1 , . . . , fic , or the multinomial probit likelihood (lik_multinomialprobit) c Y p(yi | f i ) = Ep(ui ) Φ(ui + fiyi − fij ) , (50) j=1,j6=yi where the auxiliary variable ui is distributed as p(ui ) = N (ui |0, 1), and Φ(x) denotes the cumulative density function of the standard normal distribution. In Gaussian process literature for multiclass classification, a common assumption is to introduce c independent prior processes that are associated with c classes (see, e.g., Rasmussen and Williams, 2006). By assuming zero-mean Gaussian processes for latent functions associated with different classes, we obtain a zero-mean Gaussian prior p(f |X) = N (f |0, K), (51) T where f = f11 , . . . , fn1 , f12 , . . . , fn2 , . . . , f1c , . . . , fnc and K is a cn × cn block-diagonal covariance matrix with matrices K 1 , K 2 , . . . , K c (each of size n × n) on its diagonal. Inference for softmax can be made with MCMC or Laplace approximation. As described by Williams and Barber (1998) and Rasmussen and Williams (2006), using the Laplace approximation for the softmax likelihood with the uncorrelated prior processes, the posterior computations can be done efficiently in a way that scales linearly in c, which is also implemented in GPstuff. Multiclass classification with softmax is demonstrated in demo_multiclass. Inference for multinomial probit can be made with expectation propagation by using a nested EP approach that does not require numerical quadratures or sampling for estimation of the tilted moments and predictive probabilities, as proposed by Riihimäki et al. (2013). Similarly to softmax with Laplace’s method, the nested EP approach leads to low-rank site approximations which retain all posterior couplings but results in linear computational scaling with respect to c. Multiclass classification with the multinomial probit likelihood is demonstrated in demo_multiclass_nested_ep. 8.2 Multinomial The multinomial model (lik_multinomial) is an extension of the multiclass classification to situation where each observation consist of counts of class observations. Let yi = [yi,1 , . . . , yi,c ], where c > 2, be a vector of counts of class observations related to inputs xi so that yi ∼ Multinomial([pi,1 , . . . , pi,c ], ni ) (52) Pc where ni = j=1 yi,j . The propability to observe class j is exp(fij ) pi,j = Pc , j j=1 exp(fi ) (53) T where f i = fi1 , . . . , fic . As in multiclass classification we can introduce c independent prior processes that are associated with c classes. By assuming zero-mean Gaussian processes for latent functions associated with different classes, we obtain a zero-mean Gaussian prior p(f |X) = N (f |0, K), (54) T where f = f11 , . . . , fn1 , f12 , . . . , fn2 , . . . , f1c , . . . , fnc and K is a cn × cn block-diagonal covariance matrix with matrices K 1 , K 2 , . . . , K c (each of size n × n) on its diagonal. This model is demonstrated in demo_multinomial and it has been used, for example, in (Juntunen et al., 2012). 8.3 Cox proportional hazard model For the individual i, where i = 1, . . . , n, we have observed survival time yi (possibly right censored) with censoring indicator δi , where δi = 0 if the ith observation is uncensored and δi = 1 if the observation is right censored. The traditional approach to analyze continuous time-to-event data is to assume the Cox proportional hazard model (Cox, 1972). hi (t) = h0 (t) exp(xTi β), (55) where h0 is the unspecified baseline hazard rate, xi is the d × 1 vector of covariates for the ith patient and β is the vector of regression coefficients. The matrix X = [x1 , . . . , xn ]T of size n × d includes all covariate observations. The Cox model with linear predictor can be extended to more general form to enable, for example, additive and non-linear effects of covariates (Kneib, 2006; Martino et al., 2011). We extend the proportional hazard model by hi (t) = exp(log(h0 (t)) + ηi (xi )), (56) where the linear predictor is replaced with the latent predictor ηi depending on the covariates xi . By assuming a Gaussian process prior over η = (η1 , . . . , ηn )T , smooth nonlinear effects of continuous covariates are possible, and if there are dependencies between covariates, GP can model these interactions implicitly. A piecewise log-constant baseline hazard (see, e.g. Ibrahim et al., 2001; Martino et al., 2011) is assumed by partitioning the time axis into K intervals with equal lengths: 0 = s0 < s1 < s2 < . . . < sK , where sK > yi for all i = 1, . . . , n. In the interval k (where k = 1, . . . , K), hazard is assumed to be constant: h0 (t) = λk for t ∈ (sk−1 , sk ]. (57) For the ith individual the hazard rate in the kth time interval is then hi (t) = exp(fk + ηi (xi )), t ∈ (sk−1 , sk ], (58) where fk = log(λk ). To assume smooth hazard rate functions, we place another Gaussian process prior for f = (f1 , . . . , fK )T . We define a vector containing the mean locations of K time intervals as τ = (τ1 , . . . , τK )T . The likelihood contribution for the possibly right censored ith observation (yi , δi ) is assumed to be Z yi li = hi (yi )(1−δi ) exp − hi (t)dt . (59) 0 Using the piecewise log-constant assumption for the hazard rate function, the contribution of the observation i for the likelihood results in k−1 X li = [λk exp(ηi )](1−δi ) exp −[(yi − sk−1 )λk + (sg − sg−1 )λg ] exp(ηi ) , (60) g=1 where yi ∈ (sk−1 , sk ] (Ibrahim et al., 2001; Martino et al., 2011). By applying the Bayes theorem, the prior information and likelihood contributions are combined, and the posterior distribution of the latent variables can be computed. Due to the form of the likelihood function, the resulting posterior becomes non-Gaussian and analytically exact inference is intractable. GPstuff supports MCMC and Laplace approximation to integrate over the latent variables. The use of Gaussian process Cox proportional hazard model is demonstrated in demo_survival_coxph. The Gaussian process Cox proportional hazard model implemented with GPstuff was used in reference (Joensuu et al., 2012) to model risk of gastrointestinal stromal tumour recurrence after surgery. 8.4 Zero-inflated negative binomial Zero-inflated negative binomial model is suitable for modelling count variables with excessive number of zero observations compared to usual Poisson and negative binomial count models. Zero-inflated negative binomial models assume a two-stage process, where the first process determines whether the count is larger than zero, and the second process determines the non-zero count (Mullahy, 1986). In GPstuff, these processes are assumed independent a priori, but they become a posteriori dependent through the likelihood function p+(1 − p)NegBin(y|y = 0), when y = 0 (61) (1 − p)NegBin(y|y > 0), when y > 0, (62) where the probability p is given by a binary classifier with logit likelihood. NegBin is the Negative-binomial distribution parametrized for the i’th observation as y | f , e, r ∼ n Y Γ(r + yi ) i=1 yi !Γ(r) r r + µi r µi r + µi yi , (63) where µi = ei exp(f (xi )) and r is the dispersion parameter governing the variance. In GPstuff, the latent value vector f = [f T1 f T2 ]T has length 2N , where N is the number of observations. The latents f 1 are associated with the classification process and the latents f 2 with the negative-binomial count process. 8.5 Density estimation and regression Logistic Gaussian process can be used for flexible density estimation and density regression. GPstuff includes implementation based on Laplace (and MCMC) approximation as described in (Riihimäki and Vehtari, 2012). The likelihood is defined as p(x | f ) = n Y i=1 exp(fi ) Pn . j=1 exp(fj ) (64) For the latent function f , we assume the model f (x) = g(x) + h(x)T β, where the GP prior g(x) is combined with the explicit basis functions h(x). Regression coefficients are denoted with β, and by placing a Gaussian prior β ∼ N (b, B) with mean b and covariance B, the parameters β can be integrated out from the model, which results in the following GP prior for f : f (x) ∼ GP h(x)T b, κ(x, x0 ) + h(x)T B h(x0 ) . (65) For the explicit basis functions, we use the second-order polynomials which leads to a GP prior that can favour density estimates where the tails of the distribution go eventually to zero. We use finite-dimensional approximation and evaluate the integral and the Gaussian process in a grid. We approximate inference for logistic Gaussian process density estimation in a grid using Laplace’s method or MCMC to integrate over the non-Gaussian posterior distribution of latent values (see details in Riihimäki and Vehtari, 2012). Logistic Gaussian processes are also suitable for estimating conditional densities p(t| x), where t is a response variable. We discretize both input and target space in a finite region to model the conditional densities with the logistic GP. To approximate the resulting nonGaussian posterior distribution, we use again Laplace’s method. 1D and 2D density estimation and density regression are implemented in the function lgpdens. The usage of the function is demonstrated in demo_lgpdens. 100 90 80 70 60 50 40 10000 20000 30000 1 (a) Galaxy data. 2 3 4 5 6 (b) Old faithful data. Figure 7: Density estimates for two different distributions. Figures are from demo_lgpdens. On the left blue line is density estimation of Galaxy data while gray area is 95% mean region and on the right is 2D density estimation with countours based on Old faithful data. 8.6 Input dependent models In input-dependent models more than one latent variables are used for modeling different parameters of the distribution. Additional latent variables in turn enable modeling dependencies between inputs and parameters. 8.6.1 Input-dependent Noise Input-dependent noise model (Goldberg et al., 1997) can be used to model regression data. The model assumes Gaussian distributed data with latent variables defining both mean and variance of the Gaussian distribution instead of just the mean like in the standard Gaussian likelihood. Setting independent GP priors for the two latent variables, it is possible to infer non-constant noise, e.g. input-dependent, in the data. The likelihood is defined as p(y | f (1) , f (2) , σ 2 ) = n Y (1) (2) N (yi |fi , σ 2 exp(fi )), (66) i=1 with latent function f (1) defining the mean and f (2) defining the variance. Input-dependent noise is demonstrated in demo_inputdependentnoise with both heteroscedastic and constant noise. 8.6.2 Input-dependent overdispersed Weibull In input-dependent Weibull model additional latent variable is used for modeling shape parameter r of the standard Weibull model. The likelihood is defined as p(y | f (1) , f (2) , z) = n Y (1) (1) (1−zi )(ri −1) −yiri /efi ri1−zi e−(1−zi )fi yi e (67) i=1 where z is a vector of censoring indicators with zi = 0 for uncensored event and zi = 1 for right censored event for observation i. Latent variable f2 is used for modeling the shape parameter here. Because the dispersion parameter is constrained to be greater than 0, it must be computed through exp(f2 ) to ensure positiveness. Input-dependent Weibull is demonstrated in demo_inputdependentweibull. 8.7 Monotonicity constraint . Riihimäki and Vehtari (2010) presented how a monotonicity constraint for GP can be introduced with virtual derivative observations, and the resulting posterior can be approximated with expectation propagation. Currently GPstuff supports monotonicity constraint for models which have Gaussian likelihood or fully factoring likelihoods with EP inference and which use covariance functions gpcf_constant, gpcf_linear and gpcf_sexp. Riihimäki and Vehtari (2010) used type II MAP estimate for the hyperparameters, but GPstuff allows easy integration over the hyperparameters with any method described in Section 5. GP model with monotonicity constraints on one or many covariates can be easily constructed with the function gp_monotonic. Monotonicity constraint is demonstrated in demo_monotonic and demo_monotonic2. Figure 8 displays the predictions with models with and without the monotonicity constraint. The monotonicity constraint allows inclusion of sensible prior information that death rate increases as age increases. The effect is shown most clearly for larger ages where there is less observations, but also overall reduction in uncertainty related to the latent function can be seen. 9. Mean functions In the standard GP regression a zero mean function is assumed for the prior process. This is convenient but there are nonetheless some advantages in using a specified mean function. The basic principle in doing GP regression with a mean function is to apply a zero mean GP for the difference between observations and the mean function. 9.1 Explicit basis functions Here we follow closely the presentation of (Rasmussen and Williams, 2006) about the subject and briefly present the main results. A mean function can be specified as a weighted sum of some basis functions h m = h(x)> β, with weights β. The target of modeling, the underlying process g, is assumed to be a sum of mean function and a zero mean GP. g = h(x)> β + GP(0, K). Deaths/1000 15 10 5 0 Deaths/1000 sexp 20 40 50 Age sexp + monot. 15 10 5 0 40 50 Age 20 10 5 20 40 50 Age 60 lin + sexp + monot. 15 10 5 0 60 lin + sexp 15 0 60 Deaths/1000 Deaths/1000 20 40 50 Age 60 Figure 8: Illustration how the monotonicity constraint of the latent function affects the predictions. Mortality rate data from Broffitt (1988). By assuming Gaussian prior for the weights β ∼ N (b, B), the weight parameters can be integrated out and the prior for g is another GP prior g ∼ GP h(x)> b, K + h(x)> B h(x) . The predictive equations are obtained by using the mean and covariance of the prior (68) in the zero mean GP predictive equations (15) E(g∗ ) = E(f∗ ) + R> β, (68) > Cov(g∗ ) = Cov(f∗ ) + R> B−1 + HK−1 R, y H β −1 > = B−1 + HK−1 H B−1 b + HK−1 y y y , R = H∗ − HK−1 y K∗ , h1 (x) h2 (x) H = , x is row vector. .. . hk (x) (69) If the prior assumptions about the weights are vague then B−1 → O, (O is a zero matrix) and the predictive equations (68) and (69) don’t depend on b or B E(g∗ ) = Cov(g∗ ) = E(f∗ ) + R> β v , > Cov(f∗ ) + R> HK−1 H R, y (70) (71) −1 > HK−1 β v = HK−1 H y y, y Corresponding the exact and vague prior for the basis functions’ weights there are two versions of the marginal likelihood. With exact prior the marginal likelihood is 1 1 1 1 n log p(y | X, b, B) = − M> N−1 M − log |Ky | − log |B| − log |A| − log 2π, 2 2 2 2 2 > M = H b − y, N = Ky + H> BH, > A = B−1 + HK−1 y H , where n is the amount of observations. Its derivative with respect to hyperparameters are 1 > −1 ∂Ky −1 > M N N M 2 ∂θi ∂Ky 1 1 −1 ∂A tr K−1 tr A − − , y 2 ∂θi 2 ∂θi ∂Ky −1 > ∂A = − HK−1 K H . y ∂θi ∂θi y ∂ log p(y | X, b, B) = + ∂θi With a vague prior the marginal likelihood is 1 1 > −1 y Ky y + y> Cy 2 2 1 1 n−m − log |Ky | − log |Av | − log 2π, 2 2 2 > Av = HK−1 y H log p(y | X) = − C= > −1 −1 K−1 y H Av HKy , where m is the rank of H > . Its derivative is ∂ 1 > 1 > log p(y | X) = y Py + −y PG − G> Py + G> PG ∂θi 2 2 ∂K 1 1 y −1 −1 ∂Av − tr Ky − tr Av , 2 ∂θi 2 ∂θi ∂Ky −1 P = K−1 K , y ∂θi y G = H> A−1 HK−1 y y, ∂K (72) y where has been used the fact that matrices K−1 y , ∂θi and Av are symmetric. The above expression (72) could be simplified a little further because y> PG + G> Py = 2 · y> PG. The use of mean functions is demonstrated in demo_regression_meanf. 10. Sparse Gaussian processes The evaluation of the inverse and determinant of the covariance matrix in the log marginal likelihood (or its approximation) and its gradient scale as O(n3 ) in time. This restricts the implementation of GP models to moderate size data sets. However, the unfavorable scaling in computational time can be alleviated with sparse approximations to GP and compactly supported (CS) covariance functions. The sparse approximations scale as O(nm2 ), where m < n and the CS covariance functions lead to computations which scale, in general, as O(n3 ) but with a smaller constant than with traditional globally supported covariance functions. Many sparse GPs are proposed in the literature and some of them are build into GPstuff. 10.1 Compactly supported covariance functions A CS covariance function gives zero correlation between data points whose distance exceeds a certain threshold. This leads to a sparse covariance matrix. The challenge with constructing CS covariance functions is to guarantee their positive definiteness and much literature has been devoted on the subject (see e.g. Sansò and Schuh, 1987; Wu, 1995; Wendland, 1995; Gaspari and Cohn, 1999; Gneiting, 1999, 2002; Buhmann, 2001) The CS functions implemented in GPstuff are Wendland’s piecewise polynomials kpp,q (Wendland, 2005), such as 2 σpp kpp,2 = (1 − r)j+2 (j 2 + 4j + 3)r2 + (3j + 6)r + 3 , (73) + 3 where j = bd/2c + 3. These functions correspond to processes that are q times mean square differentiable and are positive definite up to an input dimension d. Thus, the degree of the polynomial has to be increased alongside the input dimension. The dependence of CS covariance functions to the input dimension is very fundamental. There are no radial CS functions that are positive definite on every 0 is the more diffusive the length-scales of the mixing components are. The parameter lk > 0 characterizes the typical length-scale of the individual components in input dimension k. Periodic covariance function (gpcf_periodic) Many real world systems exhibit periodic phenomena, which can be modelled with a periodic covariance function. One possible construction (Rasmussen and Williams, 2006) is ! d X 2 sin2 (π(xi,k − xj,k )/γ) k(xi , xj ) = exp − , (121) lk2 k=1 where the parameter γ controls the inverse length of the periodicity and lk the smoothness of the process in dimension k. Product covariance function (gpcf_product) A product of two or more covariance functions, k1 (x, x0 ) · k2 (x, x0 )..., is a valid covariance function as well. Combining covariance functions in a product form can be done with gpcf_prod, for which the user can freely specify the covariance functions to be multiplied with each other from the collection of covariance functions implemented in GPstuff. Categorical covariance function (gpcf_cat) Categorical covariance function gpcf_cat returns correlation 1 if input values are equal and 0 otherwise. ( 1 if xi − xj = 0 k(xi , xj ) = (122) 0 otherwise Categorical covariance function can be combined with other covariance functions using gpcf_prod, for example, to produce hierarchical models. Appendix C. Observation models Here, we summarize all the observation models in GPstuff. Most of them are implemented in files lik_* which reminds that at the inference step they are considered likelihood functions. Gaussian (lik_gaussian) The i.i.d Gaussian noise with variance σ 2 is y | f , σ 2 ∼ N (f , σ 2 I). (123) Student-t (lik_t, lik_gaussiansmt) The Student-t observation model (implemented in lik_t) is −(ν+1)/2 n Y Γ((ν + 1)/2) (yi − fi )2 √ y | f , ν, σt ∼ 1+ , Γ(ν/2) νπσt νσt2 i=1 (124) where ν is the degrees of freedom and σ the scale parameter. The scale mixture version of the Student-t distribution is implemented in lik_gaussiansmt and it is parametrized as yi |fi , α, Ui ∼ N (fi , αUi ) 2 (125) 2 Ui ∼ Inv-χ (ν, τ ), (126) where each observation has its own noise variance αUi (Neal, 1997; Gelman et al., 2013). Logit (lik_logit) The logit transformation gives the probability for yi of being 1 or −1 as plogit (yi |fi ) = 1 . 1 + exp(−yi fi ) (127) Probit (lik_probit) The probit transformation gives the probability for yi of being 1 or −1 as Z yi f (xi ) pprobit (yi |fi ) = Φ(yi f (xi )) = N (z|0, 1)dz. −∞ (128) Poisson (lik_poisson) The Poisson observation model with expected number of cases e is y | f, e ∼ n Y Poisson(yi | exp(fi )ei ). (129) i=1 Negative-Binomial (lik_negbin) The negative-binomial is a robustified version of the Poisson distribution. It is parametrized y | f , e, r ∼ n Y Γ(r + yi ) i=1 yi !Γ(r) r r + µi r µi r + µi yi (130) where µi = e exp(fi ), r is the dispersion parameter governing the variance, ei is the expected number of cases and yi is positive integer telling the observed count. Binomial (lik_binomial) The binomial observation model with the probability of success pi = exp(fi )/(1 + exp(fi )) is N Y zi ! (131) y | f, z ∼ pyi (1 − pi )(zi −yi ) . yi !(zi − yi )! i i=1 Here, zi denotes the number of trials and yi is the number of successes. Weibull (lik_weibull) The Weibull likelihood is defined as L= n Y r1−zi exp ((1 − zi )f (xi ) + (1 − zi )(r − 1) log(yi ) − exp(f (xi ))yir ) , (132) i=1 where z is a vector of censoring indicators with zi = 0 for uncensored event and zi = 1 for right censored event for observation i and r is the shape parameter. Here we present only the likelihood function because we don’t have observation model for the censoring. Log-Gaussian (lik_loggaussian) The Log-Gaussian likelihood is defined as n Y 1 2 2 −(1−zi )/2 1−zi L= (2πσ ) yi exp − 2 (1 − zi )(log(yi ) − f (xi )) 2σ i=1 zi log(yi ) − f (xi ) × 1−Φ σ where σ is the scale parameter. (133) Log-logistic (lik_loglogistic) The log-logistic likelihood is defined as 1−zi r zi −2 n Y ry r−1 y L= 1+ exp(f (xi )) exp(f (xi )) (134) i=1 Cox proportional hazard model (lik_coxph) The likelihood contribution for the possibly right censored ith observation (yi , δi ) is assumed to be Z y i li = hi (yi )(1−δi ) exp − hi (t)dt . (135) 0 Using the piecewise log-constant assumption for the hazard rate function, the contribution of the observation i for the likelihood results in k−1 X li = [λk exp(ηi )](1−δi ) exp −[(yi − sk−1 )λk + (sg − sg−1 )λg ] exp(ηi ) , (136) g=1 where yi ∈ (sk−1 , sk ] Input-dependent noise (lik_inputdependentnoise) The input-dependent noise observation model is defined as y|f (1) ,f (2) 2 ,σ ∼ n Y (1) (2) N (yi |fi , σ 2 exp(fi )), (137) i=1 with latent function f (1) defining the mean and f (2) defining the variance. Input-dependent Weibull (lik_inputdependentweibull) The input-dependent Weibull observation model is defined as y|f (1) ,f (2) ,z ∼ n Y (2) (1) exp(fi )1−zi exp (1 − zi )fi (138) i=1 (2) (1) (2) exp(fi + (1 − zi )(exp(fi ) − 1) log(yi ) exp(fi )yi ) , where z is a vector of censoring indicators with zi = 0 for uncensored event and zi = 1 for right censored event for observation i. Zero-inflated Negative-Binomial (lik_zinegbin) The Zero-Inflated Negative-Binomial observation model is defined as r yi n Y Γ(r + yi ) r µi y | f , e, r ∼ , yi !Γ(r) r + µi r + µi i=1 where µi = ei exp(f (xi )) and r is the dispersion parameter governing the variance. (139) Appendix D. Priors This appendix lists all the priors implemented in the GPstuff package. Gaussian prior (prior_gaussian) The Gaussian distribution is parametrized as 1 1 p(θ) = √ exp − 2 (θ − µ)2 2σ 2πσ 2 (140) where µ is a location parameter and σ 2 is a scale parameter. Log-Gaussian prior (prior_loggaussian) The log-Gaussian distribution is parametrized as 1 2 p(θ) = √ exp − 2 (log(θ) − µ) 2σ θ 2πσ 2 1 (141) where µ is a location parameter and σ 2 is a scale parameter. Laplace prior (prior_laplace) The Laplace distribution is parametrized as |θ − µ| 1 exp − p(θ) = 2σ σ (142) where µ is a location parameter and σ > 0 is a scale parameter. Student-t prior (prior_t) The Student-t distribution is parametrized as Γ((ν + 1)/2) √ p(θ) = Γ(ν/2) νπσ 2 −(ν+1)/2 (θ − µ)2 1+ νσ 2 (143) where µ is a location parameter, σ 2 is a scale parameter and ν > 0 is the degrees of freedom. Square root Student-t prior (prior_sqrtt) The square root Student-t distribution is parametrized as p(θ1/2 ) = Γ((ν + 1)/2) √ Γ(ν/2) νπσ 2 −(ν+1)/2 (θ − µ)2 1+ νσ 2 (144) where µ is a location parameter, σ 2 is a scale parameter and ν > 0 is the degrees of freedom. Scaled inverse-χ2 prior (prior_sinvchi2) The scaled inverse-χ2 distribution is parametrized as p(θ) = (ν/2)ν/2 2 ν/2 −(ν/2+1) −νs2 /(2θ) (s ) θ e Γ(ν/2) (145) where s2 is a scale parameter and ν > 0 is the degrees of freedom parameter. Gamma prior (prior_gamma) The gamma distribution is parametrized as p(θ) = β α α−1 −βθ θ e Γ(α) (146) where α > 0 is a shape parameter and β > 0 is an inverse scale parameter. Inverse-gamma prior (prior_invgamma) The inverse-gamma distribution is parametrized as p(θ) = β α −(α+1) −β/θ θ e Γ(α) (147) where α > 0 is a shape parameter and β > 0 is a scale parameter. Uniform prior (prior_unif) The uniform prior is parametrized as p(θ) ∝ 1. (148) Square root uniform prior (prior_sqrtunif) The square root uniform prior is parametrized as p(θ1/2 ) ∝ 1. (149) Log-uniform prior (prior_logunif) The log-uniform prior is parametrized as p(log(θ)) ∝ 1. (150) Log-log-uniform prior (prior_loglogunif) The log-log-uniform prior is parametrized as p(log(log(θ))) ∝ 1. (151) Appendix E. Transformation of hyperparameters The inference on the parameters of covariance functions is conducted mainly transformed space. Most of ten used transformation is log-transformation, which has the advantage that the parameter space (0, ∞) is transformed into (−∞, ∞). The change of parametrization has to be taken into account in the evaluation of the probability densities of the model. If parameter θ with probability density pθ (θ) is transformed into the parameter w = f (θ) with equal number of components, the probability density of w is given by pw (w) = |J|pθ (f −1 (w)), (152) where J is the Jacobian of the transformation θ = f −1 (w). The parameter transformations are discussed shortly, for example, in Gelman et al. (2013)[p. 21]. Due to the log transformation w = log(θ) transformation the probability densities pθ (θ) are changed to the densities pw (w) = |J|pθ (exp(w)) = |J|pθ (θ), (153) where the Jacobian is J = ∂ exp(w) = exp(w) = θ. Now, given Gaussian observation model ∂w (see Section 3.1) the posterior of w can be written as pw (w|D) ∝ p(y|X, θ)p(θ|γ)θ, (154) which leads to energy function E(w) = − log p(y|X, θ) − log p(θ|γ) − log(|θ|). = E(θ) − log(θ), where the absolute value signs are not shown explicitly around θ because it is strictly positive. Thus, the log transformation just adds term − log θ in the energy function. The inference on w requires also the gradients of an energy function E(w). These can be obtained easily with the chain rule ∂E(w) ∂ ∂θ = [E(θ) − log(|J|)] ∂w ∂w ∂θ ∂E(θ) ∂ log(|J|) ∂θ = − ∂θ ∂θ ∂w ∂E(θ) 1 ∂|J| = − J. ∂θ |J| ∂θ (155) Here we have used the fact that the last term, derivative of θ with respect to w, is the same −1 ∂θ as the Jacobian J = ∂w = ∂f∂w . Now in the case of log transformation the Jacobian can be replaced by θ and the gradient is gotten an easy expression ∂E(w) ∂E(θ) = θ − 1. ∂w ∂θ (156) Appendix F. Developer appendix This section provides additional description of lik, gpcf and prior functions, and argument interface for optimisation and sampling functions. F.1 lik functions New likelihoods can be added by copying and modifying one of the existing lik functions. We use lik_negbin as an example of log-concave likelihood and lik_t as an example of non-log-concave likelihood. Note that adding a new non-log-concave likelihood is more challenging as the posterior of the latent values may be multimodal, which makes it more difficult to implement stable Laplace and EP methods (see, e.g., Vanhatalo et al., 2009; Jylänki et al., 2011). F.1.1 lik_negbin inputParser is used to make argument checks and set some default values. If the new likelihood does not have parameters, see for example, lik_poisson. ip=inputParser; ip.FunctionName = ’LIK_NEGBIN’; ip.addOptional(’lik’, [], @isstruct); ip.addParamValue(’disper’,10, @(x) isscalar(x) && x>0); ip.addParamValue(’disper_prior’,prior_logunif(), @(x) isstruct(x) || isempty(x)); ip.parse(varargin{:}); lik=ip.Results.lik; ... Function handles to the subfunctions provide object-oriented behaviour (at time when GPstuff project was started, real classes in Matlab were too inefficient and Octave still lacks proper support). If some of the subfunctions have not been implemented, corresponding function handles should not be defined. if init % Set the function handles to the subfunctions lik.fh.pak = @lik_negbin_pak; lik.fh.unpak = @lik_negbin_unpak; lik.fh.lp = @lik_negbin_lp; lik.fh.lpg = @lik_negbin_lpg; lik.fh.ll = @lik_negbin_ll; lik.fh.llg = @lik_negbin_llg; lik.fh.llg2 = @lik_negbin_llg2; lik.fh.llg3 = @lik_negbin_llg3; lik.fh.tiltedMoments = @lik_negbin_tiltedMoments; lik.fh.siteDeriv = @lik_negbin_siteDeriv; lik.fh.predy = @lik_negbin_predy; lik.fh.predprcty = @lik_negbin_predprcty; lik.fh.invlink = @lik_negbin_invlink; lik.fh.recappend = @lik_negbin_recappend; end end lik_negbin_pak and lik_negbin_unpak functions are used to support generic optimization and sampling functions, which assume that the parameters are presented as a vector. Parameters which have empty prior ([]) are ignored. These subfunctions are mandatory (even if there are no parameters). lik_negbin_lp and lik_negbin_lpg compute the log prior density of the parameters and its gradient with respect to parameters. Parameters which have empty prior ([]) are ignored. These subfunctions are needed if there are likelihood parameters. lik_negbin_ll and lik_negbin_llg compute the log likelihood and its gradient with respect to parameters and latents. Parameters which have empty prior ([]) are ignored. These subfunctions are mandatory. lik_negbin_llg2 computes second gradients of the log likelihood. This subfunction is needed when using Laplace approximation or EP for inference with non-Gaussian likelihoods. lik_negbin_llg3 computes third gradients of the log likelihood. This subfunction is needed when using Laplace approximation for inference with non-Gaussian likelihoods. lik_negbin_tiltedMoments returns the marginal moments for the EP algorithm. This subfunction is needed when using EP for inference with non-Gaussian likelihoods. lik_negbin_siteDeriv evaluates the expectation of the gradient of the log likelihood term with respect to the likelihood parameters for EP. This subfunction is needed when using EP for inference with non-Gaussian likelihoods and there are likelihood parameters. lik_negbin_predy returns the predictive mean, variance and density of y. This subfunction is needed when computing posterior predictive distributions for future observations. lik_negbin_predprcty returns the percentiles of predictive density of y. This subfunction is needed when using function gp_predprcty. lik_negbin_init_negbin_norm is a private function for lik_negbin. It returns function handle to a function evaluating Negative-Binomial * Gaussian which is used for evaluating (likelihood * cavity) or (likelihood * posterior). This subfunction is needed by subfunctions lik_negbin_tiltedMoments, lik_negbin_siteDeriv and lik_negbin_predy. Note that this is not needed for those likelihoods for which integral over the likelihood times Gaussian is computed using other than quadrature integration. lik_negbin_invlink returns values of inverse link function. This subfunction is needed when using function gp_predprctmu. lik_negbin_recappend This subfunction is needed when using MCMC sampling (gp_mc). F.1.2 lik_t lik_t includes some additional subfunctions useful for non-log-concave likelihoods. lik_t_tiltedMoments2 returns the marginal moments for the EP algorithm. This subfunction is needed when using robust-EP for inference with non-Gaussian likelihoods. lik_t_siteDeriv2 evaluates the expectation of the gradient of the log likelihood term with respect to the likelihood parameters for EP. This subfunction is needed when using robust-EP for inference with non-Gaussian likelihoods and there are likelihood parameters. lik_t_optimizef function to optimize the latent variables with EM algorithm. This subfunction is needed when using likelihood specific optimization method for mode finding in the Laplace algorithm. F.2 gpcf functions New covariance functions can be added by copying and modifying one of the existing gpcf functions. We use gpcf_sexp as an example. F.2.1 gpcf_sexp inputParser is used to make argument checks and set some default values. If the new covariance function does not have parameters, corresponding lines can be removed (see, e.g., lik_cat). ip=inputParser; ip.FunctionName = ’GPCF_SEXP’; ip.addOptional(’gpcf’, [], @isstruct); ip.addParamValue(’magnSigma2’,0.1, @(x) isscalar(x) && x>0); ip.addParamValue(’lengthScale’,1, @(x) isvector(x) && all(x>0)); ip.addParamValue(’metric’,[], @isstruct); ip.addParamValue(’magnSigma2_prior’, prior_logunif(), ... @(x) isstruct(x) || isempty(x)); ip.addParamValue(’lengthScale_prior’,prior_t(), ... @(x) isstruct(x) || isempty(x)); ip.addParamValue(’selectedVariables’,[], @(x) isempty(x) || ... (isvector(x) && all(x>0))); ip.parse(varargin{:}); gpcf=ip.Results.gpcf; ... Optional ’metric’ can be used to replace default simple euclidean metric. Currently it is possible to use delta distance or group covariates to have common lengthScale parameters. Other potential metrics could be, for example, Manhattan or distance matrix based metrics. The metric function is called only if it has been explicitly set. This adds some additional code branches to the code, but avoids extra overhead in computation when using simple euclidean metric case. Function handles to the subfunctions provide object-oriented behaviour (at time when GPstuff project was started, real classes in Matlab were too inefficient and Octave still lacks proper support). If some of the subfunctions have not been implemented, corresponding function handles should not be defined. if init % Set the function handles to the subfunctions gpcf.fh.pak = @gpcf_sexp_pak; gpcf.fh.unpak = @gpcf_sexp_unpak; gpcf.fh.lp = @gpcf_sexp_lp; gpcf.fh.lpg= @gpcf_sexp_lpg; gpcf.fh.cfg = @gpcf_sexp_cfg; gpcf.fh.cfdg = @gpcf_sexp_cfdg; gpcf.fh.cfdg2 = @gpcf_sexp_cfdg2; gpcf.fh.ginput = @gpcf_sexp_ginput; gpcf.fh.ginput2 = @gpcf_sexp_ginput2; gpcf.fh.ginput3 = @gpcf_sexp_ginput3; gpcf.fh.ginput4 = @gpcf_sexp_ginput4; gpcf.fh.cov = @gpcf_sexp_cov; gpcf.fh.trcov = @gpcf_sexp_trcov; gpcf.fh.trvar = @gpcf_sexp_trvar; gpcf.fh.recappend = @gpcf_sexp_recappend; end gpcf_sexp_pak and gpcf_sexp_unpak functions are used to support generic optimization and sampling functions, which assume that parameters are presented as a vector. Parameters which have empty prior ([]) are ignored. These subfunctions are mandatory (even if there are no parameters). gpcf_sexp_lp and gpcf_sexp_lpg compute the log prior density of the parameters and its gradient with respect to parameters. Parameters which have empty prior ([]) are ignored. These subfunctions are mandatory. gpcf_sexp_cov evaluates covariance matrix between two input vectors. This is a mandatory subfunction used for example in prediction and energy computations. gpcf_sexp_trcov evaluates training covariance matrix of inputs. This is a mandatory subfunction used for example in prediction and energy computations. If available, gpcf_sexp_trcov uses faster C-implementation trcov for covariance computation (see linuxCsource/trcov.c). trcov includes C-code implementation for all currently available covariance functions. If trcov is not available, gpcf_sexp_trcov uses M-code computation. gpcf_sexp_trvar evaluates training variance vector. This is a mandatory subfunction used for example in prediction and energy computations. gpcf_sexp_cfg evaluates gradient of covariance function with respect to the parameters. gpcf_sexp_cfg has four different calling syntaxes. First one is a mandatory used in gradient computations. Second and third are needed when using sparse approximations (e.g. FIC). Fourth one is needed when using memory save option in gp_set (without memory save option, all matrix derivatives with respect to all covariance parameters are computed at once, which may take lot of memory if n is large and there are many parameters). gpcf_sexp_cfdg evaluates gradient of covariance function, of which has been taken partial derivative with respect to x, with respect to parameters. This subfunction is needed when using derivative observations. gpcf_sexp_cfdg2 evaluates gradient of covariance function, of which has been taken partial derivative with respect to both input variables x, with respect to parameters. This subfunction is needed when using derivative observations. gpcf_sexp_ginput evaluates gradient of covariance function with respect to x. This subfunction is needed when computing gradients with respect to inducing inputs in sparse approximations. gpcf_sexp_ginput2 evaluates gradient of covariance function with respect to both input variables x and x2. This subfunction is needed when computing gradients with respect to both input variables x and x2 (in same dimension). This subfunction is needed when using derivative observations. gpcf_sexp_ginput3 evaluates gradient of covariance function with respect to both input variables x and x2. This subfunction is needed when computing gradients with respect to both input variables x and x2 (in different dimensions). This subfunction is needed when using derivative observations. gpcf_sexp_ginput4 evaluates gradient of covariance function with respect to x. Simplified and faster version of sexp_ginput, returns full matrices. This subfunction is needed when using derivative observations. gpcf_sexp_recappend is needed when using MCMC sampling (gp_mc). F.3 prior functions New priors can be added by copying and modifying one of the existing prior functions. We use prior_t as an example. inputParser is used to make argument checks and set some default values. If the new prior does not have parameters, corresponding lines can be removed (see, e.g., prior_unif). ip=inputParser; ip.FunctionName = ’PRIOR_T’; ip.addOptional(’p’, [], @isstruct); ip.addParamValue(’mu’,0, @(x) isscalar(x)); ip.addParamValue(’mu_prior’,[], @(x) isstruct(x) || isempty(x)); ip.addParamValue(’s2’,1, @(x) isscalar(x) && x>0); ip.addParamValue(’s2_prior’,[], @(x) isstruct(x) || isempty(x)); ip.addParamValue(’nu’,4, @(x) isscalar(x) && x>0); ip.addParamValue(’nu_prior’,[], @(x) isstruct(x) || isempty(x)); ip.parse(varargin{:}); p=ip.Results.p; ... Function handles to the subfunctions provide object-oriented behaviour (at time when GPstuff project was started, real classes in Matlab were too inefficient and Octave still lacks proper support). If some of the subfunctions have not been implemented, corresponding function handles should not be defined. if init % set functions p.fh.pak = @prior_t_pak; p.fh.unpak = @prior_t_unpak; p.fh.lp = @prior_t_lp; p.fh.lpg = @prior_t_lpg; p.fh.recappend = @prior_t_recappend; end prior_t_pak and prior_t_unpak functions are used to support generic optimization and sampling functions, which assume that parameters are presented as a vector. Parameters which have empty prior ([]) are ignored. These subfunctions are mandatory (even if there are no parameters). prior_t_lp and prior_t_lpg compute the log prior density of the parameters and its gradient with respect to parameters. Parameters which have empty prior ([]) are ignored. These subfunctions are mandatory. prior_t_recappend This subfunction is needed when using MCMC sampling (gp_mc). F.4 Optimisation functions Adding new optimisation functions is quite easy as covariance function and likelihood parameters and inducing inputs can be optimised using any optimisation function using same syntax as, for example, fminunc by Mathworks or fminscg and fminlbfgs included in GPstuff. Syntax is X = FMINX(FUN, X0, OPTIONS) where FUN accepts input X and returns a scalar function value F and its scalar or vector gradient G evaluated at X, X0 is initial value and OPTIONS is option structure. Then new optimisation algorithm can be used with code gp=gp_optim(gp,x,y,’opt’,opt,’optimf’,@fminx); F.5 Other inference methods Adding new Monte Carlo sampling functions is more complicated. gp_mc function is used to allow the use of different sampling methods for covariance function parameters, likelihood parameters and latent values with one function call. Thus adding new sampling method requires modification of gp_mc. Note also the subfunctions gpmc_e and gpmc_g, which return energy (negative log posterior density) and its gradient while handling infer_params option properly. See gp_mc for further information. Adding new analytic approximation, for example, variational Bayes approach, would require major effort implementing many inference related functions. To start see functions with name starting gpla_ or gpep_ and gp_set.
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 77 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.14 Create Date : 2015:11:20 16:04:50+02:00 Modify Date : 2015:11:20 16:04:50+02:00 Trapped : False PTEX Fullbanner : This is MiKTeX-pdfTeX 2.9.4902 (1.40.14)EXIF Metadata provided by EXIF.tools