STAN User Manual Version 2.17.0
User Manual:
Open the PDF directly: View PDF .
Page Count: 637
Download | |
Open PDF In Browser | View PDF |
Stan Modeling Language User’s Guide and Reference Manual Stan Development Team Stan Version 2.17.0 Tuesday 5th September, 2017 mc-stan.org Stan Development Team (2017) Stan Modeling Language: User’s Guide and Reference Manual. Version 2.17.0. Copyright © 2011–2017, Stan Development Team. This document is distributed under the Creative Commons Attribution 4.0 International License (CC BY-ND 4.0). For full details, see https://creativecommons.org/licenses/by-nd/4.0/legalcode The Stan logo is distributed under the Creative Commons AttributionNoDerivatives 4.0 International License (CC BY-ND 4.0). For full details, see https://creativecommons.org/licenses/by-nd/4.0/legalcode Stan Development Team Currently Active Developers This is the list of current developers in order of joining the development team (see the next section for former development team members). • Andrew Gelman (Columbia University) Stan, RStan, RStanArm • Bob Carpenter (Columbia University) Stan Math, Stan, CmdStan • Daniel Lee (Stan Group, Inc.) Stan Math, Stan, CmdStan, RStan, PyStan, dev ops • Ben Goodrich (Columbia University) Stan Math, Stan, RStan, RStanArm • Michael Betancourt (University of Warwick) Stan Math, Stan, CmdStan • Marcus Brubaker (York University) Stan Math, Stan • Jiqiang Guo (NPD Group) Stan Math, Stan, RStan • Allen Riddell (Indiana University) PyStan, dev ops • Marco Inacio (University of São Paulo/UFSCar) Stan Math, Stan • Jeffrey Arnold (University of Washington) Emacs Mode, Pygments mode • Mitzi Morris (Consultant, New York) Stan Math, Stan, CmdStan, dev ops • Rob J. Goedman (Consultant, La Jolla, California) Stan, Stan.jl • Brian Lau (CNRS, Paris) MatlabStan iii • Rob Trangucci (Columbia University) Stan Math, Stan, RStan • Jonah Sol Gabry (Columbia University) RStan, RStanArm, ShinyStan, Loo, BayesPlot • Robert L. Grant (Consultant, London) StataStan • Krzysztof Sakrejda (University of Massachusetts, Amherst) Stan Math, Stan • Aki Vehtari (Aalto University) Stan Math, Stan, MatlabStan, RStan, Loo, BayesPlot • Rayleigh Lei (University of Michigan) Stan Math, Stan, RStan • Sebastian Weber (Novartis Pharma) Stan Math, Stan, RStan, RStanArm, BayesPlot • Charles Margossian (Metrum LLC) Stan Math, Stan • Thel Seraphim (Columbia University) Stan Math, Stan • Vincent Picaud (CEA, France) MathematicaStan • Imad Ali (Columbia University) RStan, RStanArm • Sean Talts (Columbia University) Stan Math, Stan, dev ops • Ben Bales (University of California, Santa Barbara) Stan Math • Ari Hartikainen (Aalto University) PyStan iv Development Team Alumni These are developers who have made important contributions in the past, but are no longer contributing actively. • Matt Hoffman (while at Columbia University) Stan Math, Stan, CmdStan • Michael Malecki (while at Columbia University) software and graphical design • Peter Li (while at Columbia University) Stan Math, Stan • Yuanjun Guo (while at Columbia University) Stan • Alp Kucukelbir (while at Columbia University) Stan, CmdStan • Dustin Tran (while at Columbia University) Stan, CmdStan v Contents Preface x Acknowledgements xvi I Introduction 20 1. Overview 21 II Stan Modeling Language 29 2. Encodings, Includes, and Comments 30 3. Data Types and Variable Declarations 33 4. Expressions 53 5. Statements 74 6. Program Blocks 98 7. User-Defined Functions 109 8. Execution of a Stan Program 116 III 9. Example Models 122 Regression Models 123 10. Time-Series Models 162 11. Missing Data & Partially Known Parameters 180 12. Truncated or Censored Data 186 13. Finite Mixtures 191 14. Measurement Error and Meta-Analysis 203 15. Latent Discrete Parameters 210 16. Sparse and Ragged Data Structures 230 17. Clustering Models 233 18. Gaussian Processes 246 19. Directions, Rotations, and Hyperspheres 268 20. Solving Algebraic Equations 271 vi 21. Solving Differential Equations 275 IV 284 Programming Techniques 22. Reparameterization & Change of Variables 285 23. Custom Probability Functions 296 24. User-Defined Functions 298 25. Problematic Posteriors 309 26. Matrices, Vectors, and Arrays 323 27. Multiple Indexing and Range Indexing 329 28. Optimizing Stan Code for Efficiency 337 V 363 Inference 29. Bayesian Data Analysis 364 30. Markov Chain Monte Carlo Sampling 368 31. Penalized Maximum Likelihood Point Estimation 377 32. Bayesian Point Estimation 385 33. Variational Inference 387 VI Algorithms & Implementations 389 34. Hamiltonian Monte Carlo Sampling 390 35. Transformations of Constrained Variables 403 36. Optimization Algorithms 420 37. Variational Inference 423 38. Diagnostic Mode 425 VII 427 Built-In Functions 39. Void Functions 428 40. Integer-Valued Basic Functions 429 41. Real-Valued Basic Functions 432 42. Array Operations 457 vii 43. Matrix Operations 464 44. Sparse Matrix Operations 486 45. Mixed Operations 489 46. Compound Arithmetic and Assignment 492 47. Algebraic Equation Solver 496 48. Ordinary Differential Equation Solvers 499 VIII 502 Discrete Distributions 49. Conventions for Probability Functions 503 50. Binary Distributions 508 51. Bounded Discrete Distributions 510 52. Unbounded Discrete Distributions 516 53. Multivariate Discrete Distributions 521 IX 522 Continuous Distributions 54. Unbounded Continuous Distributions 523 55. Positive Continuous Distributions 531 56. Non-negative Continuous Distributions 539 57. Positive Lower-Bounded Probabilities 541 58. Continuous Distributions on [0, 1] 543 59. Circular Distributions 545 60. Bounded Continuous Probabilities 547 61. Distributions over Unbounded Vectors 548 62. Simplex Distributions 555 63. Correlation Matrix Distributions 556 64. Covariance Matrix Distributions 559 X 561 Software Development 65. Model Building as Software Development 562 66. Software Development Lifecycle 568 viii 67. Reproducibility 577 68. Contributed Modules 579 69. Stan Program Style Guide 580 Appendices 589 A. Licensing 589 B. Stan for Users of BUGS 591 C. Modeling Language Syntax 600 D. Warning and Error Messages 607 E. Deprecated Features 609 F. Mathematical Functions 613 Bibliography 615 Index 625 ix Preface Why Stan? We did not set out to build Stan as it currently exists. We set out to apply full Bayesian inference to the sort of multilevel generalized linear models discussed in Part II of (Gelman and Hill, 2007). These models are structured with grouped and interacted predictors at multiple levels, hierarchical covariance priors, nonconjugate coefficient priors, latent effects as in item-response models, and varying output link functions and distributions. The models we wanted to fit turned out to be a challenge for current generalpurpose software. A direct encoding in BUGS or JAGS can grind these tools to a halt. Matt Schofield found his multilevel time-series regression of climate on treering measurements wasn’t converging after hundreds of thousands of iterations. Initially, Aleks Jakulin spent some time working on extending the Gibbs sampler in the Hierarchical Bayesian Compiler (Daumé, 2007), which as its name suggests, is compiled rather than interpreted. But even an efficient and scalable implementation does not solve the underlying problem that Gibbs sampling does not fare well with highly correlated posteriors. We finally realized we needed a better sampler, not a more efficient implementation. We briefly considered trying to tune proposals for a random-walk MetropolisHastings sampler, but that seemed too problem specific and not even necessarily possible without some kind of adaptation rather than tuning of the proposals. The Path to Stan We were at the same time starting to hear more and more about Hamiltonian Monte Carlo (HMC) and its ability to overcome some of the the problems inherent in Gibbs sampling. Matt Schofield managed to fit the tree-ring data using a hand-coded implementation of HMC, finding it converged in a few hundred iterations. HMC appeared promising but was also problematic in that the Hamiltonian dynamics simulation requires the gradient of the log posterior. Although it’s possible to do this by hand, it is very tedious and error prone. That’s when we discovered reverse-mode algorithmic differentiation, which lets you write down a templated C++ function for the log posterior and automatically compute a proper analytic gradient up to machine precision accuracy in only a few multiples of the cost to evaluate the log probability function itself. We explored existing algorithmic differentiation packages with open licenses such as rad (Gay, 2005) and its repackaging in the Sacado module of the Trilinos toolkit and the CppAD package in the coin-or toolkit. But neither package supported very many special functions (e.g., probability functions, log x gamma, inverse logit) or linear algebra operations (e.g., Cholesky decomposition) and were not easily and modularly extensible. So we built our own reverse-mode algorithmic differentiation package. But once we’d built our own reverse-mode algorithmic differentiation package, the problem was that we could not just plug in the probability functions from a package like Boost because they weren’t templated on all the arguments. We only needed algorithmic differentiation variables for parameters, not data or transformed data, and promotion is very inefficient in both time and memory. So we wrote our own fully templated probability functions. Next, we integrated the Eigen C++ package for matrix operations and linear algebra functions. Eigen makes extensive use of expression templates for lazy evaluation and the curiously recurring template pattern to implement concepts without virtual function calls. But we ran into the same problem with Eigen as with the existing probability libraries — it doesn’t support mixed operations of algorithmic differentiation variables and primitives like double. At this point (Spring 2011), we were happily fitting models coded directly in C++ on top of the pre-release versions of the Stan API. Seeing how well this all worked, we set our sights on the generality and ease of use of BUGS. So we designed a modeling language in which statisticians could write their models in familiar notation that could be transformed to efficient C++ code and then compiled into an efficient executable program. It turned out that our modeling language was a bit more general than we’d anticipated, and we had an imperative probabilistic programming language on our hands.1 The next problem we ran into as we started implementing richer models is variables with constrained support (e.g., simplexes and covariance matrices). Although it is possible to implement HMC with bouncing for simple boundary constraints (e.g., positive scale or precision parameters), it’s not so easy with more complex multivariate constraints. To get around this problem, we introduced typed variables and automatically transformed them to unconstrained support with suitable adjustments to the log probability from the log absolute Jacobian determinant of the inverse transforms. Even with the prototype compiler generating models, we still faced a major hurdle to ease of use. HMC requires a step size (discretization time) and number of steps (for total simulation time), and is very sensitive to how they are set. The step size parameter could be tuned during warmup based on Metropolis rejection rates, but the number of steps was not so easy to tune while maintaining detailed balance in the sampler. This led to the development of the No-U-Turn sampler (NUTS) (Hoffman 1 In contrast, BUGS and JAGS can be viewed as declarative probabilistic programming languages for specifying a directed graphical model. In these languages, stochastic and deterministic (poor choice of name) nodes may represent random quantities. xi and Gelman, 2011, 2014), which takes an exponentially increasing number of steps (structured as a binary tree) forward and backward in time until the direction of the simulation turns around, then uses slice sampling to select a point on the simulated trajectory. Although not part of the original Stan prototype, which used a unit mass matrix, Stan now allows a diagonal or dense mass matrix to be estimated during warmup. This allows adjustment for globally scaled or correlated parameters. Without this adjustment, models with differently scaled parameters could only mix as quickly as their most constrained parameter allowed. We thought we were home free at this point. But when we measured the speed of some BUGS examples versus Stan, we were very disappointed. The very first example model, Rats, ran more than an order of magnitude faster in JAGS than in Stan. Rats is a tough test case because the conjugate priors and lack of posterior correlations make it an ideal candidate for efficient Gibbs sampling. But we thought the efficiency of compilation might compensate for the lack of ideal fit to the problem. We realized we were doing redundant calculations, so we wrote a vectorized form of the normal distribution for multiple variates with the same mean and scale, which sped things up a bit. At the same time, we introduced some simple template metaprograms to remove the calculation of constant terms in the log probability. These both improved speed, but not enough. Finally, we figured out how to both vectorize and partially evaluate the gradients of the densities using a combination of expression templates and metaprogramming. At this point, we are within a small multiple of a hand-coded gradient function. Later, when we were trying to fit a time-series model, we found that normalizing the data to unit sample mean and variance sped up the fits by an order of magnitude. Although HMC and NUTS are rotation invariant (explaining why they can sample effectively from multivariate densities with high correlations), they are not scale invariant. Gibbs sampling, on the other hand, is scale invariant, but not rotation invariant. We were still using a unit mass matrix in the simulated Hamiltonian dynamics. The last tweak to Stan before version 1.0 was to estimate a diagonal mass matrix during warmup; this has since been upgraded to a full mass matrix in version 1.2. Both these extensions go a bit beyond the NUTS paper on arXiv. Using a mass matrix sped up the unscaled data models by an order of magnitude, though it breaks the nice theoretical property of rotation invariance. The full mass matrix estimation has rotational invariance as well, but scales less well because of the need to invert the mass matrix at the end of adaptation blocks and then perform matrix multiplications every leapfrog step. xii Stan 2 It’s been over a year since the initial release of Stan, and we have been overjoyed by the quantity and quality of models people are building with Stan. We’ve also been a bit overwhelmed by the volume of traffic on our user’s list and issue tracker. We’ve been particularly happy about all the feedback we’ve gotten about installation issues as well as bugs in the code and documentation. We’ve been pleasantly surprised at the number of such requests which have come with solutions in the form of a GitHub pull request. That certainly makes our life easy. As the code base grew and as we became more familiar with it, we came to realize that it required a major refactoring (see, for example, (Fowler et al., 1999) for a nice discussion of refactoring). So while the outside hasn’t changed dramatically in Stan 2, the inside is almost totally different in terms of how the HMC samplers are organized, how the output is analyzed, how the mathematics library is organized, etc. We’ve also improved our original simple optimization algorithms and now use LBFGS (a limited memory quasi-Newton method that uses gradients and a short history of the gradients to make a rolling estimate of the Hessian). We’ve added more compile-time and run-time error checking for models. We’ve added many new functions, including new matrix functions and new distributions. We’ve added some new parameterizations and managed to vectorize all the univariate distributions. We’ve increased compatibility with a range of C++ compilers. We’ve also tried to fill out the manual to clarify things like array and vector indexing, programming style, and the I/O and command-line formats. Most of these changes are direct results of user-reported confusions. So please let us know where we can be clearer or more fully explain something. Finally, we’ve fixed all the bugs which we know about. It was keeping up with the latter that really set the development time back, including bugs that resulted in our having to add more error checking. Perhaps most importantly, we’ve developed a much stricter process for unit testing, code review, and automated integration testing (see Chapter 66). As fast and scalable as Stan’s MCMC sampling is, for large data sets it can still be prohibitively slow. Stan 2.7 introduced variational inference for arbitrary Stan models. In contrast to penalized maximum likelihood, which finds the posterior mode, variational inference finds an approximation to the posterior mean (both methods use curvature to estimate a multivariate normal approximation to posterior covariance). This promises Bayesian inference at much larger scale than is possible with MCMC methods. In examples we’ve run, problems that take days with MCMC complete in half an hour with variational inference. There is still a long road ahead in understanding these variational approximations, both in how good the multivariate approximation is to the true posterior and which forms of models can be fit efficiently, xiii scalably, and reliably. Stan’s Future We’re not done. There’s still an enormous amount of work to do to improve Stan. Some older, higher-level goals are in a standalone to-do list: https://github.com/stan-dev/stan/wiki/ Longer-Term-To-Do-List We are gradually weaning ourselves off of the to-do list in favor of the GitHub issue tracker (see the next section for a link). Some major features are on our short-term horizon: Riemannian manifold Hamiltonian Monte Carlo (RHMC), transformed Laplace approximations with uncertainty quantification for maximum likelihood estimation, marginal maximum likelihood estimation, data-parallel expectation propagation, and streaming (stochastic) variational inference. The latter has been prototyped and described in papers. We will also continue to work on improving numerical stability and efficiency throughout. In addition, we plan to revise the interfaces to make them easier to understand and more flexible to use (a difficult pair of goals to balance). Later in the Stan 2 release cycle (Stan 2.7), we added variational inference to Stan’s sampling and optimization routines, with the promise of approximate Bayesian inference at much larger scales than is possible with Monte Carlo methods. The future plans involve extending to a stochastic data-streaming implementation for very largescale data problems. You Can Help Please let us know if you have comments about this manual or suggestions for Stan. We’re especially interested in hearing about models you’ve fit or had problems fitting with Stan. The best way to communicate with the Stan team about user issues is through the following user’s group. http://groups.google.com/group/stan-users For reporting bugs or requesting features, Stan’s issue tracker is at the following location. https://github.com/stan-dev/stan/issues xiv One of the main reasons Stan is freedom-respecting, open-source software2 is that we love to collaborate. We’re interested in hearing from you if you’d like to volunteer to get involved on the development side. We have all kinds of projects big and small that we haven’t had time to code ourselves. For developer’s issues, we have a separate group. http://groups.google.com/group/stan-dev To contact the project developers off the mailing lists, send email to mc.stanislaw@gmail.com The Stan Development Team Tuesday 5th September, 2017 2 See Appendix A for more information on Stan’s licenses and the licenses of the software on which it depends. xv Acknowledgements Institutions We thank Columbia University along with the Departments of Statistics and Political Science, the Applied Statistics Center, the Institute for Social and Economic Research and Policy (iserp), and the Core Research Computing Facility. Grants and Corporate Support Without the following grant and consulting support, Stan would not exist. Current Grants • U. S. Department of Education Institute of Education Sciences – Statistical and Research Methodology: Solving Difficult Bayesian Computation Problems in Education Research Using Stan • Alfred P. Sloan Foundation – G-2015-13987: Stan Community and Continuity (non-research) • U. S. Office of Naval Research (ONR) – Informative Priors for Bayesian Inference and Regularization Previous Grants Stan was supported in part by • U. S. Department of Energy – DE-SC0002099: Petascale Computing • U. S. National Science Foundation – ATM-0934516: Reconstructing Climate from Tree Ring Data – CNS-1205516: Stan: Scalable Software for Bayesian Modeling • U. S. Department of Education Institute of Education Sciences – ED-GRANTS-032309-005: Practical Tools for Multilevel Hierarchical Modeling in Education Research – R305D090006-09A: Practical Solutions for Missing Data xvi • U. S. National Institutes of Health – 1G20RR030893-01: Research Facility Improvement Grant Stan Logo The original Stan logo was designed by Michael Malecki. The current logo is designed by Michael Betancourt, with special thanks to Stephanie Mannheim (http: //www.stephaniemannheim.com/) for critical refinements. The Stan logo is copyright 2015 Michael Betancourt and released for use under the CC-BY ND 4.0 license (i.e., no derivative works allowed). Individuals We thank John Salvatier for pointing us to automatic differentiation and HMC in the first place. And a special thanks to Kristen van Leuven (formerly of Columbia’s ISERP) for help preparing our initial grant proposals. Code and Doc Patches Thanks for bug reports, code patches, pull requests, and diagnostics to: Ethan Adams, Avraham Adler, Jarret Barber, David R. Blair, Miguel de Val-Borro, Ross Boylan, Eric N. Brown, Devin Caughey, Emmanuel Charpentier, Daniel Chen, Jacob Egner, Ashley Ford, Jan Gläscher, Robert J. Goedman, Danny Goldstein, Tom Haber, B. Harris, Kevin Van Horn, Stephen Hoover, Andrew Hunter, Bobby Jacob, Bruno Jacobs, Filip Krynicki Dan Lakeland, Devin Leopold, Nathanael I. Lichti, Jussi Määttä, Titus van der Malsburg, P. D. Metcalfe, Kyle Meyer, Linas Mockus, Jeffrey Oldham, Tomi Peltola, Joerg Rings, Cody T. Ross, Patrick Snape, Matthew Spencer, Wiktor Soral, Alexey Stukalov, Fernando H. Toledo, Arseniy Tsipenyuk, Zhenming Su, Matius Simkovic, Matthew Zeigenfuse, and Alex Zvoleff. Thanks for documentation bug reports and patches to: alvaro1101 (GitHub handle), Avraham Adler, Chris Anderson, Asim, Jarret Barber, Ryan Batt, Frederik Beaujean, Guido Biele, Luca Billi, Chris Black, botanize (GitHub handle), Portia Brat, Arthur Breitman, Eric C. Brown, Juan Sebastián Casallas, Alex Chase, Daniel Chen, Roman Cheplyaka, Andy Choi, David Chudzicki, Michael Clerx, Andria Dawson, daydreamt (GitHub handle), Conner DiPaolo, Eric Innocents Eboulet, José Rojas Echenique, Andrew Ellis, Gökçen Eraslan, Rick Farouni, Avi Feller, Seth Flaxman, Wayne Folta, Ashley Ford, Kyle Foreman, Mauricio Garnier-Villarreal, Christopher Gandrud, Jonathan Gilligan, John Hall, David Hallvig, David Harris, C. Hoeppler, Cody James Horst, Herra Huu, Bobby Jacob, Max Joseph, Julian King, Fränzi Korner-Nievergelt, Juho Kokkala, xvii Takahiro Kubo, Mike Lawrence, Louis Luangkesorn, Tobias Madsen, Stefano Mangiola, David Manheim, Stephen Martin, Sean Matthews, David Mawdsley, Dieter Menne, Evelyn Mitchell, Javier Moreno, Robert Myles,xs Sunil Nandihalli, Eric Novik, Julia Palacios, Tamas Papp, Anders Gorm Pedersen, Tomi Peltola, Andre Pfeuffer, Sergio Polini, Joerg Rings, Sean O’Riordain, Brendan Rocks, Cody Ross, Mike Ross, Tony Rossini, Nathan Sanders, James Savage, Terrance Savitsky, Dan Schrage, Gary Schulz, seldomworks (GitHub handle), Janne Sinkkonen, skanskan (GitHub handle), Yannick Spill, sskates (GitHub handle), Martin Stjernman, Dan Stowell, Alexey Stukalov, Dougal Sutherland, John Sutton, Maciej Swat, J. Takoua, Andrew J. Tanentzap, Shravan Vashisth, Aki Vehtari, Damjan Vukcevic, Matt Wand, Amos Waterland, Sebastian Weber, Sam Weiss, Luke Wiklendt, wrobell (GitHub handle), Howard Zail, Jon Zelner, and Xiubo Zhang Thanks to Kevin van Horn for install instructions for Cygwin and to Kyle Foreman for instructions on using the MKL compiler. Bug Reports We’re really thankful to everyone who’s had the patience to try to get Stan working and reported bugs. All the gory details are available from Stan’s issue tracker at the following URL. https://github.com/stan-dev/stan/issues !"#$%&'()*+, !"#$%&'( )$"( "%#*+"')( ,-./0"1)",( /'"( -2( #%1,-0( '%0&*+13( )-( '-*4"( % 0%)$"0%)+.%*(-5*"0(6%'()$%)(-2(7-0&)"(,"(8/22-1(+1(9::;<(=1()$"(2-**-6+13(1">)()6."1)/#+"'?()$+'()".$1+@/"($%,(%(1/05"#(-2(-)$"#(/'"'<((=1()$"(9ABC'?(D1#+.-(E"#0+(/'",(+) )-( '-*4"( -5*"0'( +1( 1"/)#-1( &$F'+.'?( %*)$-/3$( $"( 1"4"#( &/5*+'$",( $+'( #"'/*)'<( ( =1( G-' H*%0-'(,/#+13(I-#*,(I%#(==?(E"#0+(%*-13(6+)$(J)%1(K*%0?(L-$1(4-1(M"/0%11?(M+.$-*%' N")#-&-*+'?(%1,(-)$"#'(,+'./''",()$"(%&&*+.%)+-1(-2()$+'(')%)+')+.%*('%0&*+13()".$1+@/"())$"( -5*"0'( )$"F( 6"#"( 6-#O+13( -1<( ( K*%0( &-+1)",( -/)( )$"( /'"( -2( "*".)#-0".$%1+.%* .-0&/)"#'( )-( -4"#.-0"( )$"( *-13( %1, )",+-/'( 1%)/#"( -2( )$"( .%*./*%)+-1'?( %1, N")#-&-*+'( 1%0",( )$+'( "4+-/'*F( /11%0", )".$1+@/"(PN-1)"(7%#*-Q(%2)"#( K*%0R'( /1.*" 6$-( 5-##-6",( 0-1"F( 2#-0( #"*%)+4"' 5".%/'"($"(ST/')($%,()-(3-()-(N-1)"(7%#*-Q U)$"(3%05*+13(.%'+1-V< Stanislaw Ulam,W1( namesake of 9AX:?( Stan and N%#.$( 99?( L-$1(co4-1 M"/0%11('"1)(%(*"))"#(UY+.$)0F"#?(9AX:V()inventor of Monte Carlo methods (Metropo)$"( Z$"-#")+.%*( [+4+'+-1( *"%,"#( -&-'+13 lis and Ulam, 1949), shown here holding )$"(/'"(-2()$+'()".$1+@/"(-1(DM=H7()-('-*4" the Fermiac, Enrico,+22/'+-1( Fermi’s physical Monte 1"/)#-1( %1,( 0/*)+&*+.%)+-1 -5*"0'<( 6%'( )$"( 2+#')( -&-'%*( )Carlo simulator for( Z$+'( neutron diffusion. /'"( )$"( N-1)"( 7%#*-( )".$1+@/"( -1( %1 "*".)#-1+.( ,+3+)%*( ( H*'-(2000). +1( 9AX:? Image.-0&/)"#<( from (Giesler, D1#+.-( E"#0+( $%,( EDYN=H7( UE+3/#"( 9V?( % 0".$%1+.%*( %1%*-3( .-0&/)"#?( -3#%00", )-(#/1(N-1)"(7%#*-(-5*"0'<((=1(9AX\?()$" 2+#')( #/1'( -1( % ,+3+)%*( .-0&/)"# xviii )--O( &*%."( -1 DM=H7( UE+3/#"( ;V< -'./+0%62%%7)89%:;8<%&*;='9.%-3>!45" Part I Introduction 20 1. Overview This document is both a user’s guide and a reference manual for Stan’s probabilistic modeling language. This introductory chapter provides a high-level overview of Stan. The remaining parts of this document include a practically-oriented user’s guide for programming models and a detailed reference manual for Stan’s modeling language and associated programs and data formats. 1.1. Stan Home Page For links to up-to-date code, examples, manuals, bug reports, feature requests, and everything else Stan related, see the Stan home page: http://mc-stan.org/ 1.2. Stan Interfaces There are three interfaces for Stan that are supported as part of the Stan project. Models and their use are the same across the three interfaces, and this manual is the modeling language manual for all three interfaces. All of the interfaces share initialization, sampling and tuning controls, and roughly share posterior analysis functionality. The interfaces all provide getting-started guides, documentation, and full source code. CmdStan CmdStan allows Stan to be run from the command line. In some sense, CmdStan is the reference implementation of Stan. The CmdStan documentation used to be part of this document, but is now its own standalone document. The CmdStan home page is http://mc-stan.org/cmdstan.html RStan RStan is the R interface to Stan. RStan interfaces to Stan through R’s memory rather than just calling Stan from the outside, as in the R2WinBUGS and R2jags interfaces on which it was modeled. The RStan home page is http://mc-stan.org/rstan.html 21 PyStan PyStan is the Python interface to Stan. Like RStan, it interfaces at the Python memory level rather than calling Stan from the outside. The PyStan home page is http://mc-stan.org/pystan.html MatlabStan MatlabStan is the MATLAB interface to Stan. Unlike RStan and PyStan, MatlabStan currently wraps a CmdStan process. The MatlabStan home page is http://mc-stan.org/matlab-stan.html Stan.jl Stan.jl is the Julia interface to Stan. Like MatlabStan, Stan.jl wraps a CmdStan process. The Stan.jl home page is http://mc-stan.org/julia-stan.html StataStan StataStan is the Stata interface to Stan. Like MatlabStan, Stan.jl wraps a CmdStan process. The StataStan home page is http://mc-stan.org/stata-stan.html MathematicaStan MathematicaStan is the Mathematica interface to Stan. Like MatlabStan, MathematicaStan wraps a CmdStan process. The MathematicaStan home page is http://mc-stan.org/mathematica-stan.html 1.3. Stan Programs A Stan program defines a statistical model through a conditional probability function p(θ|y, x), where θ is a sequence of modeled unknown values (e.g., model parameters, latent variables, missing data, future predictions), y is a sequence of modeled known values, and x is a sequence of unmodeled predictors and constants (e.g., sizes, hyperparameters). Stan programs consist of variable type declarations and statements. Variable types include constrained and unconstrained integer, scalar, vector, and matrix types, 22 as well as (multidimensional) arrays of other types. Variables are declared in blocks corresponding to the variable’s use: data, transformed data, parameter, transformed parameter, or generated quantity. Unconstrained local variables may be declared within statement blocks. The transformed data, transformed parameter, and generated quantities blocks contain statements defining the variables declared in their blocks. A special model block consists of statements defining the log probability for the model. Within the model block, BUGS-style sampling notation may be used as shorthand for incrementing an underlying log probability variable, the value of which defines the log probability function. The log probability variable may also be accessed directly, allowing user-defined probability functions and Jacobians of transforms. Variable Constraints Variable constraints are very important in Stan, particularly for parameters. For Stan to sample efficiently, any parameter values that satisfy the constraints declared for the parameters must have support in the model block (i.e., must have non-zero posterior density). Constraints in the data and transformed data block are only used for error checking data input and transforms. Constraints in the transformed parameters block must be satisfied the same way as parameter constraints or sampling will devolve to a random walk or fail. Constraints in the generated quantities block must succeed or sampling will be halted altogether because it is too late to reject a draw at the point the generated quantities block is evaluated. Execution Order Statements in Stan are interpreted imperatively, so their order matters. Atomic statements involve the assignment of a value to a variable. Sequences of statements (and optionally local variable declarations) may be organized into a block. Stan also provides bounded for-each loops of the sort used in R and BUGS. Probabilistic Programming Language Stan is an imperative probabilistic programming language. It is an instance of a domain-specific language, meaning that it was developed for a specific domain, namely statistical inference. Stan is a probabilistic programming language in the sense that a random variable is a bona fide first-class object. In Stan, variables may be treated as random, and among the random variables, some are observed and some are unknown and need to be estimated or used for posterior predictive inference. Observed random variables 23 are declared as data and unobserved random variables are declared as parameters (including transformed parameters, generated quantities, and local variables depending on them). For the unobserved random variables, it is possible to sample them either marginally or jointly, estimate their means and variance, or plug them in for downstream posterior predictive inference. Stan is an imperative language, like C or Fortran (and parts of C++, R, Python, and Java), in the sense that is based on assignment, loops, conditionals, local variables, object-level function application, and array-like data structures. In contrast and/or complement, functional languages typically allow higher-order functions and often allow reflection of programming language features into the object language, whereas pure functional languages remove assignment altogether. Object-oriented languages introduce more general data types with dynamic function dispatch. Stan’s language is Church-Turing complete Church (1936); Turing (1936); Hopcroft and Motwani (2006), in the same way that C or R is. That means any program that is computable on a Turing machine (or in C) can be implemented in Stan (not necessarily easily, of course). All that is required for Turing completeness is loops, conditionals, and arrays that can be dynamically (re)sized in a loop. 1.4. Compiling and Running Stan Programs A Stan program is first translated to a C++ program by the Stan compiler stanc, then the C++ program compiled to a self-contained platform-specific executable. Stan can generate executables for various flavors of Windows, Mac OS X, and Linux.1 Running the Stan executable for a model first reads in and validates the known values y and x, then generates a sequence of (non-independent) identically distributed samples θ (1) , θ (2) , . . ., each of which has the marginal distribution p(θ|y, x). 1.5. Sampling For continuous parameters, Stan uses Hamiltonian Monte Carlo (HMC) sampling (Duane et al., 1987; Neal, 1994, 2011), a form of Markov chain Monte Carlo (MCMC) sampling (Metropolis et al., 1953). Stan does not provide discrete sampling for parameters. Discrete observations can be handled directly, but discrete parameters must be marginalized out of the model. Chapter 13 and Chapter 15 discuss how finite discrete parameters can be summed out of models, leading to large efficiency gains versus discrete parameter sampling. 1 A Stan program may also be compiled to a dynamically linkable object file for use in a higher-level scripting language such as R or Python. 24 HMC accelerates both convergence to the stationary distribution and subsequent parameter exploration by using the gradient of the log probability function. The unknown quantity vector θ is interpreted as the position of a fictional particle. Each iteration generates a random momentum and simulates the path of the particle with potential energy determined by the (negative) log probability function. Hamilton’s decomposition shows that the gradient of this potential determines change in momentum and the momentum determines the change in position. These continuous changes over time are approximated using the leapfrog algorithm, which breaks the time into discrete steps which are easily simulated. A Metropolis reject step is then applied to correct for any simulation error and ensure detailed balance of the resulting Markov chain transitions (Metropolis et al., 1953; Hastings, 1970). Basic Euclidean Hamiltonian Monte Carlo involves three “tuning” parameters to which its behavior is quite sensitive. Stan’s samplers allow these parameters to be set by hand or set automatically without user intervention. The first tuning parameter is the step size, measured in temporal units (i.e., the discretization interval) of the Hamiltonian. Stan can be configured with a userspecified step size or it can estimate an optimal step size during warmup using dual averaging (Nesterov, 2009; Hoffman and Gelman, 2011, 2014). In either case, additional randomization may be applied to draw the step size from an interval of possible step sizes (Neal, 2011). The second tuning parameter is the number of steps taken per iteration, the product of which with the temporal step size determines the total Hamiltonian simulation time. Stan can be set to use a specified number of steps, or it can automatically adapt the number of steps during sampling using the No-U-Turn (NUTS) sampler (Hoffman and Gelman, 2011, 2014). The third tuning parameter is a mass matrix for the fictional particle. Stan can be configured to estimate a diagonal mass matrix or a full mass matrix during warmup; Stan will support user-specified mass matrices in the future. Estimating a diagonal mass matrix normalizes the scale of each element θk of the unknown variable sequence θ, whereas estimating a full mass matrix accounts for both scaling and rotation,2 but is more memory and computation intensive per leapfrog step due to the underlying matrix operations. Convergence Monitoring and Effective Sample Size Samples in a Markov chain are only drawn with the marginal distribution p(θ|y, x) after the chain has converged to its equilibrium distribution. There are several methods to test whether an MCMC method has failed to converge; unfortunately, passing 2 These estimated mass matrices are global, meaning they are applied to every point in the parameter space being sampled. Riemann-manifold HMC generalizes this to allow the curvature implied by the mass matrix to vary by position. 25 the tests does not guarantee convergence. The recommended method for Stan is to run multiple Markov chains, initialized randomly with a diffuse set of initial parameter values, discard the warmup/adaptation samples, then split the remainder of each chain in half and compute the potential scale reduction statistic, R̂ (Gelman and Rubin, 1992). If the result is not enough effective samples, double the number of iterations and start again, including rerunning warmup and everything.3 When estimating a mean based on a sample of M independent draws, the estima√ tion error is proportional to 1/ M. If the draws are positively correlated, as they typ√ ically are when drawn using MCMC methods, the error is proportional to 1/ n_eff, where n_eff is the effective sample size. Thus it is standard practice to also monitor (an estimate of) the effective sample size until it is large enough for the estimation or inference task at hand. Bayesian Inference and Monte Carlo Methods Stan was developed to support full Bayesian inference. Bayesian inference is based in part on Bayes’s rule, p(θ|y, x) ∝ p(y|θ, x) p(θ, x), which, in this unnormalized form, states that the posterior probability p(θ|y, x) of parameters θ given data y (and constants x) is proportional (for fixed y and x) to the product of the likelihood function p(y|θ, x) and prior p(θ, x). For Stan, Bayesian modeling involves coding the posterior probability function up to a proportion, which Bayes’s rule shows is equivalent to modeling the product of the likelihood function and prior up to a proportion. Full Bayesian inference involves propagating the uncertainty in the value of parameters θ modeled by the posterior p(θ|y, x). This can be accomplished by basing inference on a sequence of samples from the posterior using plug-in estimates for quantities of interest such as posterior means, posterior intervals, predictions based on the posterior such as event outcomes or the values of as yet unobserved data. 1.6. Optimization Stan also supports optimization-based inference for models. Given a posterior p(θ|y), Stan can find the posterior mode θ ∗ , which is defined by θ ∗ = argmaxθ p(θ|y). Here the notation argmaxv f (v) is used to pick out the value of v at which f (v) is maximized. 3 Often a lack of effective samples is a result of not enough warmup iterations. At most this rerunning strategy will consume about 50% more cycles than guessing the correct number of iterations at the outset. 26 If the prior is uniform, the posterior mode corresponds to the maximum likelihood estimate (MLE) of the parameters. If the prior is not uniform, the posterior mode is sometimes called the maximum a posteriori (MAP) estimate. For optimization, the Jacobian of any transforms induced by constraints on variables are ignored. It is more efficient in many optimization problems to remove lower and upper bound constraints in variable declarations and instead rely on rejection in the model block to disallow out-of-support solutions. Inference with Point Estimates The estimate θ ∗ is a so-called “point estimate,” meaning that it summarizes the posterior distribution by a single point, rather than with a distribution. Of course, a point estimate does not, in and of itself, take into account estimation variance. Posterior predictive inferences p(ỹ | y) can be made using the posterior mode given data y as p(ỹ | θ ∗ ), but they are not Bayesian inferences, even if the model involves a prior, because they do not take posterior uncertainty into account. If the posterior variance is low and the posterior mean is near the posterior mode, inference with point estimates can be very similar to full Bayesian inference. 1.7. Variational Inference Stan also supports variational inference, an approximate Bayesian inference technique (Jordan et al., 1999; Wainwright and Jordan, 2008). Variational inference provides estimates of posterior means and uncertainty through a parametric approximation of a posterior that is optimized for its fit to the true posterior. Variational inference has had a tremendous impact on Bayesian computation, especially in the machine learning community; it is typically faster than sampling techniques and can scale to massive datasets (Hoffman et al., 2013). Variational inference approximates the posterior p(θ | y) with a simple, parameterized distribution q(θ | φ). It matches the approximation to the true posterior by minimizing the Kullback-Leibler divergence, φ∗ = arg min KL q(θ | φ) k p(θ | y) . φ This converts Bayesian inference into an optimization problem with a well-defined metric for convergence. Variational inference can provide orders of magnitude faster convergence than sampling; the quality of the approximation will vary from model to model. Note that variational inference is not a point estimation technique; the result is a distribution that approximates the posterior. 27 Stan implements Automatic Differentiation Variational Inference (ADVI), an algorithm designed to leverage Stan’s library of transformations and automatic differentiation toolbox (Kucukelbir et al., 2015). ADVI circumvents all of the mathematics typically required to derive variational inference algorithms; it works with any Stan model. 28 Part II Stan Modeling Language 29 2. Encodings, Includes, and Comments This quick chapter covers the character encoding, include mechanism, and comment syntax for the Stan language. 2.1. Character Encoding Stan Program The content of a Stan program must be coded in ASCII. Extended character sets such as UTF-8 encoded Unicode may not be used for identifiers or other text in a program. Comments The content of comments is ignored by the language compiler and may be written using any character encoding (e.g., ASCII, UTF-8, Latin1, Big5). The comment delimiters themselves must be coded in ASCII. 2.2. Includes Stan allows one file to be included within another file with the following syntax. For example, suppose the file std-normal.stan defines the standard normal log probability density function (up to an additive constant). functions { real std_normal_lpdf(vector y) { return -0.5 * y' * y; } } Suppose we also have a file containing a Stan program with an include statement. #include std-normal.stan parameters { real y; } model { y ~ std_normal(); } This Stan program behaves as if the contents of the file std-normal.stan replace the line with the #include statement, behaving as if a single Stan program were provided. 30 functions { real std_normal_lpdf(vector y) { return -0.5 * y' * y; } } parameters { real y; } model { y ~ std_normal(); } There are no restrictions on where include statements may be placed within a file or what the contents are of the replaced file. No additional whitespace is included beyond what is in the included file. Recursive Includes Recursive includes will be ignored. For example, suppose a.stan contains #include b.stan and b.stan contains #include a.stan The result of processing this file will be empty, because a.stan will include b.stan, from which the include of a.stan is ignored and a warning printed. Include Paths The Stan interfaces provide a mechanism for specifying a sequence of system paths in which to search for include files. The file included is the first one that is found in the sequence. 2.3. Comments Stan supports C++-style line-based and bracketed comments. Comments may be used anywhere whitespace is allowed in a Stan program. Line-Based Comments Any characters on a line following two forward slashes (//) is ignored along with the slashes. These may be used, for example, to document variables, 31 data { intN; // number of observations real y[N]; // observations } Bracketed Comments For bracketed comments, any text between a forward-slash and asterisk pair (/*) and an asterisk and forward-slash pair (*/) is ignored. 2.4. Whitespace Whitespace Characters The whitespace characters (and their ASCII code points) are the space (0x20), tab (0x09), carriage return (0x0D), and line feed (0x0A). Whitespace Neutrality Stan treats these whitespace characters identically. Specifically, there is no significance to indentation, to tabs, to carriage returns or line feeds, or to any vertical alignment of text. Whitespace Location Zero or more whitespace characters may be placed between symbols in a Stan program. For example, zero or more whitespace characters of any variety may be included before and after a binary operation such as a * b, before a statement-ending semicolon, around parentheses or brackets, before or after commas separating function arguments, etc. Identifiers and literals may not be separated by whitespace. Thus it is not legal to write the number 10000 as 10 000 or to write the identifier normal_lpdf as normal _ lpdf. 32 3. Data Types and Variable Declarations This chapter covers the data types for expressions in Stan. Every variable used in a Stan program must have a declared data type. Only values of that type will be assignable to the variable (except for temporary states of transformed data and transformed parameter values). This follows the convention of programming languages like C++, not the conventions of scripting languages like Python or statistical languages such as R or BUGS. The motivation for strong, static typing is threefold. • Strong typing forces the programmer’s intent to be declared with the variable, making programs easier to comprehend and hence easier to debug and maintain. • Strong typing allows programming errors relative to the declared intent to be caught sooner (at compile time) rather than later (at run time). The Stan compiler (called through an interface such as CmdStan, RStan, or PyStan) will flag any type errors and indicate the offending expressions quickly when the program is compiled. • Constrained types will catch runtime data, initialization, and intermediate value errors as soon as they occur rather than allowing them to propagate and potentially pollute final results. Strong typing disallows assigning the same variable to objects of different types at different points in the program or in different invocations of the program. 3.1. Overview of Data Types Arguments for built-in and user-defined functions and local variables are required to be basic data types, meaning an unconstrained primitive, vector, or matrix type or an array of such. Passing arguments to functions in Stan works just like assignment to basic types. Stan functions are only specified for the basic data types of their arguments, including array dimensionality, but not for sizes or constraints. Of course, functions often check constraints as part of their behavior. Primitive Types Stan provides two primitive data types, real for continuous values and int for integer values. 33 Vector and Matrix Types Stan provides three matrix-based data types, vector for column vectors, row_vector for row vectors, and matrix for matrices. Array Types Any type (including the constrained types discussed in the next section) can be made into an array type by declaring array arguments. For example, real x[10]; matrix[3, 3] m[6, 7]; declares x to be a one-dimensional array of size 10 containing real values, and declares m to be a two-dimensional array of size 6 × 7 containing values that are 3 × 3 matrices. Constrained Data Types Declarations of variables other than local variables may be provided with constraints. These constraints are not part of the underlying data type for a variable, but determine error checking in the transformed data, transformed parameter, and generated quantities block, and the transform from unconstrained to constrained space in the parameters block. All of the basic data types may be given lower and upper bounds using syntax such as int N; real log_p; vector [3] rho; There are also special data types for structured vectors and matrices. There are four constrained vector data types, simplex for unit simplexes, unit_vector for unit-length vectors, ordered for ordered vectors of scalars and positive_ordered for vectors of positive ordered scalars. There are specialized matrix data types corr_matrix and cov_matrix for correlation matrices (symmetric, positive definite, unit diagonal) and covariance matrices (symmetric, positive definite). The type cholesky_factor_cov is for Cholesky factors of covariance matrices (lower triangular, positive diagonal, product with own transpose is a covariance matrix). The type cholesky_factor_corr is for Cholesky factors of correlation matrices (lower triangular, positive diagonal, unit-length rows). Constraints provide error checking for variables defined in the data, transformed data, transformed parameters, and generated quantities 34 blocks. Constraints are critical for variables declared in the parameters block, where they determine the transformation from constrained variables (those satisfying the declared constraint) to unconstrained variables (those ranging over all of Rn ). It is worth calling out the most important aspect of constrained data types: The model must have support (non-zero density, equivalently finite log density) at every value of the parameters that meets their declared constraints. If this condition is violated with parameter values that satisfy declared constraints but do not have finite log density, then the samplers and optimizers may have any of a number of pathologies including just getting stuck, failure to initialize, excessive Metropolis rejection, or biased samples due to inability to explore the tails of the distribution. 3.2. Primitive Numerical Data Types Unfortunately, the lovely mathematical abstraction of integers and real numbers is only partially supported by finite-precision computer arithmetic. Integers Stan uses 32-bit (4-byte) integers for all of its integer representations. The maximum value that can be represented as an integer is 231 − 1; the minimum value is −(231 ). When integers overflow, their values wrap. Thus it is up to the Stan programmer to make sure the integer values in their programs stay in range. In particular, every intermediate expression must have an integer value that is in range. Integer arithmetic works in the expected way for addition, subtraction, and multiplication, but rounds the result of division (see Section 40.1 for more information). Reals Stan uses 64-bit (8-byte) floating point representations of real numbers. Stan roughly1 follows the IEEE 754 standard for floating-point computation. The range of a 64-bit number is roughly ±21022 , which is slightly larger than ±10307 . It is a good idea to stay well away from such extreme values in Stan models as they are prone to cause overflow. 64-bit floating point representations have roughly 15 decimal digits of accuracy. But when they are combined, the result often has less accuracy. In some cases, the difference in accuracy between two operands and their result is large. 1 Stan compiles integers to int and reals to double types in C++. Precise details of rounding will depend on the compiler and hardware architecture on which the code is run. 35 There are three special real values used to represent (1) not-a-number value for error conditions, (2) positive infinity for overflow, and (3) negative infinity for overflow. The behavior of these special numbers follows standard IEEE 754 behavior. Not-a-number The not-a-number value propagates. If an argument to a real-valued function is not-anumber, it either rejects (an exception in the underlying C++) or returns not-a-number itself. For boolean-valued comparison operators, if one of the arguments is not-anumber, the return value is always zero (i.e., false). Infinite values Positive infinity is greater than all numbers other than itself and not-a-number; negative infinity is similarly smaller. Adding an infinite value to a finite value returns the infinite value. Dividing a finite number by an infinite value returns zero; dividing an infinite number by a finite number returns the infinite number of appropriate sign. Dividing a finite number by zero returns positive infinity. Dividing two infinite numbers produces a not-a-number value as does subtracting two infinite numbers. Some functions are sensitive to infinite values; for example, the exponential function returns zero if given negative infinity and positive infinity if given positive infinity. Often the gradients will break down when values are infinite, making these boundary conditions less useful than they may appear at first. Promoting Integers to Reals Stan automatically promotes integer values to real values if necessary, but does not automatically demote real values to integers. For very large integers, this will cause a rounding error to fewer significant digits in the floating point representation than in the integer representation. Unlike in C++, real values are never demoted to integers. Therefore, real values may only be assigned to real variables. Integer values may be assigned to either integer variables or real variables. Internally, the integer representation is cast to a floating-point representation. This operation is not without overhead and should thus be avoided where possible. 3.3. Univariate Data Types and Variable Declarations All variables used in a Stan program must have an explicitly declared data type. The form of a declaration includes the type and the name of a variable. This section covers 36 univariate types, the next section vector and matrix types, and the following section array types. Unconstrained Integer Unconstrained integers are declared using the int keyword. For example, the variable N is declared to be an integer as follows. int N; Constrained Integer Integer data types may be constrained to allow values only in a specified interval by providing a lower bound, an upper bound, or both. For instance, to declare N to be a positive integer, use the following. int N; This illustrates that the bounds are inclusive for integers. To declare an integer variable cond to take only binary values, that is zero or one, a lower and upper bound must be provided, as in the following example. int cond; Unconstrained Real Unconstrained real variables are declared using the keyword real, The following example declares theta to be an unconstrained continuous value. real theta; Constrained Real Real variables may be bounded using the same syntax as integers. In theory (that is, with arbitrary-precision arithmetic), the bounds on real values would be exclusive. Unfortunately, finite-precision arithmetic rounding errors will often lead to values on the boundaries, so they are allowed in Stan. The variable sigma may be declared to be non-negative as follows. real sigma; The following declares the variable x to be less than or equal to −1. real x; To ensure rho takes on values between −1 and 1, use the following declaration. real rho; 37 Infinite Constraints Lower bounds that are negative infinity or upper bounds that are positive infinity are ignored. Stan provides constants positive_infinity() and negative_infinity() which may be used for this purpose, or they may be read as data in the dump format. Expressions as Bounds Bounds for integer or real variables may be arbitrary expressions. The only requirement is that they only include variables that have been defined before the declaration. If the bounds themselves are parameters, the behind-the-scenes variable transform accounts for them in the log Jacobian. For example, it is acceptable to have the following declarations. data { real lb; } parameters { real phi; } This declares a real-valued parameter phi to take values greater than the value of the real-valued data variable lb. Constraints may be complex expressions, but must be of type int for integer variables and of type real for real variables (including constraints on vectors, row vectors, and matrices). Variables used in constraints can be any variable that has been defined at the point the constraint is used. For instance, data { int N; real y[N]; } parameters { real phi; } This declares a positive integer data variable N, an array y of real-valued data of length N, and then a parameter ranging between the minimum and maximum value of y. As shown in the example code, the functions min() and max() may be applied to containers such as arrays. 3.4. Vector and Matrix Data Types Stan provides three types of container objects: arrays, vectors, and matrices. Vectors and matrices are more limited kinds of data structures than arrays. Vectors are in38 trinsically one-dimensional collections of reals, whereas matrices are intrinsically two dimensional. Vectors, matrices, and arrays are not assignable to one another, even if their dimensions are identical. A 3 × 4 matrix is a different kind of object in Stan than a 3 × 4 array. The intention of using matrix types is to call out their usage in the code. There are three situations in Stan where only vectors and matrices may be used, • matrix arithmetic operations (e.g., matrix multiplication) • linear algebra functions (e.g., eigenvalues and determinants), and • multivariate function parameters and outcomes (e.g., multivariate normal distribution arguments). Vectors and matrices cannot be typed to return integer values. They are restricted to real values.2 Indexing from 1 Vectors and matrices, as well as arrays, are indexed starting from one in Stan. This follows the convention in statistics and linear algebra as well as their implementations in the statistical software packages R, MATLAB, BUGS, and JAGS. General computer programming languages, on the other hand, such as C++ and Python, index arrays starting from zero. Vectors Vectors in Stan are column vectors; see the next subsection for information on row vectors. Vectors are declared with a size (i.e., a dimensionality). For example, a 3dimensional vector is declared with the keyword vector, as follows. vector[3] u; Vectors may also be declared with constraints, as in the following declaration of a 3-vector of non-negative values. vector [3] u; 2 This may change if Stan is called upon to do complicated integer matrix operations or boolean matrix operations. Integers are not appropriate inputs for linear algebra functions. 39 Unit Simplexes A unit simplex is a vector with non-negative values whose entries sum to 1. For instance, (0.2, 0.3, 0.4, 0.1)> is a unit 4-simplex. Unit simplexes are most often used as parameters in categorical or multinomial distributions, and they are also the sampled variate in a Dirichlet distribution. Simplexes are declared with their full dimensionality. For instance, theta is declared to be a unit 5-simplex by simplex[5] theta; Unit simplexes are implemented as vectors and may be assigned to other vectors and vice-versa. Simplex variables, like other constrained variables, are validated to ensure they contain simplex values; for simplexes, this is only done up to a statically specified accuracy threshold to account for errors arising from floating-point imprecision. In high dimensional problems, simplexes may require smaller step sizes in the inference algorithms in order to remain stable; this can be achieved through higher target acceptance rates for samplers and longer warmup periods, tighter tolerances for optimization with more iterations, and in either case, with less dispersed parameter initialization or custom initialization if there are informative priors for some parameters. Unit Vectors A unit vector is a vector with a norm of one. For instance, (0.5, 0.5, 0.5, 0.5)> is a unit 4-vector. Unit vectors are sometimes used in directional statistics. Unit vectors are declared with their full dimensionality. For instance, theta is declared to be a unit 5-vector by unit_vector[5] theta; Unit vectors are implemented as vectors and may be assigned to other vectors and vice-versa. Unit vector variables, like other constrained variables, are validated to ensure that they are indeed unit length; for unit vectors, this is only done up to a statically specified accuracy threshold to account for errors arising from floatingpoint imprecision. Ordered Vectors An ordered vector type in Stan represents a vector whose entries are sorted in ascending order. For instance, (−1.3, 2.7, 2.71)> is an ordered 3-vector. Ordered vectors are most often employed as cut points in ordered logistic regression models (see Section 9.8). The variable c is declared as an ordered 5-vector by 40 ordered[5] c; After their declaration, ordered vectors, like unit simplexes, may be assigned to other vectors and other vectors may be assigned to them. Constraints will be checked after executing the block in which the variables were declared. Positive, Ordered Vectors There is also a positive, ordered vector type which operates similarly to ordered vectors, but all entries are constrained to be positive. For instance, (2, 3.7, 4, 12.9) is a positive, ordered 4-vector. The variable d is declared as a positive, ordered 5-vector by positive_ordered[5] d; Like ordered vectors, after their declaration positive ordered vectors assigned to other vectors and other vectors may be assigned to them. Constraints will be checked after executing the block in which the variables were declared. Row Vectors Row vectors are declared with the keyword row_vector. Like (column) vectors, they are declared with a size. For example, a 1093-dimensional row vector u would be declared as row_vector[1093] u; Constraints are declared as for vectors, as in the following example of a 10-vector with values between -1 and 1. row_vector [10] u; Row vectors may not be assigned to column vectors, nor may column vectors be assigned to row vectors. If assignments are required, they may be accommodated through the transposition operator. Matrices Matrices are declared with the keyword matrix along with a number of rows and number of columns. For example, matrix[3, 3] A; matrix[M, N] B; 41 declares A to be a 3 × 3 matrix and B to be a M × N matrix. For the second declaration to be well formed, the variables M and N must be declared as integers in either the data or transformed data block and before the matrix declaration. Matrices may also be declared with constraints, as in this (3×4) matrix of nonpositive values. matrix [3, 4] B; Assigning to Rows of a Matrix Rows of a matrix can be assigned by indexing the left-hand side of an assignment statement. For example, this is possible. matrix[M, N] a; row_vector[N] b; // ... a[1] = b; This copies the values from row vector b to a[1], which is the first row of the matrix a. If the number of columns in a is not the same as the size of b, a run-time error is raised; the number of rows of a is N, which is also the size of b. Assignment works by copying values in Stan. That means any subsequent assignment to a[1] does not affect b, nor does an assignment to b affect a. Correlation Matrices Matrix variables may be constrained to represent correlation matrices. A matrix is a correlation matrix if it is symmetric and positive definite, has entries between −1 and 1, and has a unit diagonal. Because correlation matrices are square, only one dimension needs to be declared. For example, corr_matrix[3] Sigma; declares Sigma to be a 3 × 3 correlation matrix. Correlation matrices may be assigned to other matrices, including unconstrained matrices, if their dimensions match, and vice-versa. Cholesky Factors of Correlation Matrices Matrix variables may be constrained to represent the Cholesky factors of a correlation matrix. A Cholesky factor for a correlation matrix L is a K × K lower-triangular matrix PK with positive diagonal entries and rows that are of length 1 (i.e., n=1 L2m,n = 1). If 42 L is a Cholesky factor for a correlation matrix, then L L> is a correlation matrix (i.e., symmetric positive definite with a unit diagonal). A declaration such as follows. cholesky_factor_corr[K] L; declares L to be a Cholesky factor for a K by K correlation matrix. Covariance Matrices Matrix variables may be constrained to represent covariance matrices. A matrix is a covariance matrix if it is symmetric and positive definite. Like correlation matrices, covariance matrices only need a single dimension in their declaration. For instance, cov_matrix[K] Omega; declares Omega to be a K × K covariance matrix, where K is the value of the data variable K. Cholesky Factors of Covariance Matrices Matrix variables may be constrained to represent the Cholesky factors of a covariance matrix. This is often more convenient or more efficient than representing covariance matrices directly. A Cholesky factor L is an M ×N lower-triangular matrix (if m < n then L[m, n] = 0) with a strictly positive diagonal (L[k, k] > 0) and M ≥ N. If L is a Cholesky factor, then Σ = L L> is a covariance matrix. Furthermore, every covariance matrix has a Cholesky factorization. The typical case of a square Cholesky factor may be declared with a single dimension, cholesky_factor_cov[4] L; In general, two dimensions may be declared, with the above being equal to cholesky_factor_cov[4, 4]. The type cholesky_factor_cov[M, N] may be used for the general M × N. Assigning Constrained Variables Constrained variables of all types may be assigned to other variables of the same unconstrained type and vice-versa. Matching is interpreted strictly as having the same basic type and number of array dimensions. Constraints are not considered, but basic data types are. For instance, a variable declared to be real could be assigned to a variable declared as real and vice-versa. Similarly, a variable 43 declared as matrix[3, 3] may be assigned to a variable declared as cov_matrix[3] or cholesky_factor_cov[3], and vice-versa. Checks are carried out at the end of each relevant block of statements to ensure constraints are enforced. This includes run-time size checks. The Stan compiler isn’t able to catch the fact that an attempt may be made to assign a matrix of one dimensionality to a matrix of mismatching dimensionality. Expressions as Size Declarations Variables may be declared with sizes given by expressions. Such expressions are constrained to only contain data or transformed data variables. This ensures that all sizes are determined once the data is read in and transformed data variables defined by their statements. For example, the following is legal. data { int N_observed; int N_missing; // ... transformed parameters { vector[N_observed + N_missing] y; // ... Accessing Vector and Matrix Elements If v is a column vector or row vector, then v[2] is the second element in the vector. If m is a matrix, then m[2, 3] is the value in the second row and third column. Providing a matrix with a single index returns the specified row. For instance, if m is a matrix, then m[2] is the second row. This allows Stan blocks such as matrix[M, N] m; row_vector[N] v; real x; // ... v = m[2]; x = v[3]; // x == m[2][3] == m[2, 3] The type of m[2] is row_vector because it is the second row of m. Thus it is possible to write m[2][3] instead of m[2, 3] to access the third element in the second row. When given a choice, the form m[2, 3] is preferred.3 3 As of Stan version 1.0, the form m[2, 3] is more efficient because it does not require the creation and use of an intermediate expression template for m[2]. In later versions, explicit calls to m[2][3] may be optimized to be as efficient as m[2, 3] by the Stan compiler. 44 Size Declaration Restrictions An integer expression is used to pick out the sizes of vectors, matrices, and arrays. For instance, we can declare a vector of size M + N using vector[M + N] y; Any integer-denoting expression may be used for the size declaration, providing all variables involved are either data, transformed data, or local variables. That is, expressions used for size declarations may not include parameters or transformed parameters or generated quantities. 3.5. Array Data Types Stan supports arrays of arbitrary dimension. The values in an array can be any type, so that arrays may contain values that are simple reals or integers, vectors, matrices, or other arrays. Arrays are the only way to store sequences of integers, and some functions in Stan, such as discrete distributions, require integer arguments. A two-dimensional array is just an array of arrays, both conceptually and in terms of current implementation. When an index is supplied to an array, it returns the value at that index. When more than one index is supplied, this indexing operation is chained. For example, if a is a two-dimensional array, then a[m, n] is just a convenient shorthand for a[m][n]. Vectors, matrices, and arrays are not assignable to one another, even if their dimensions are identical. Declaring Array Variables Arrays are declared by enclosing the dimensions in square brackets following the name of the variable. The variable n is declared as an array of five integers as follows. int n[5]; A two-dimensional array of real values with three rows and four columns is declared with the following. real a[3, 4]; A three-dimensional array z of positive reals with five rows, four columns, and two shelves can be declared as follows. real z[5, 4, 2]; Arrays may also be declared to contain vectors. For example, 45 vector[7] mu[3]; declares mu to be an array of size 3 containing vectors with 7 elements. Arrays may also contain matrices. The example matrix[7, 2] mu[15, 12]; declares a 15 by 12 array of 7 × 2 matrices. Any of the constrained types may also be used in arrays, as in the declaration cholesky_factor_cov[5, 6] mu[2, 3, 4]; of a 2 × 3 × 4 array of 5 × 6 Cholesky factors of covariance matrices. Accessing Array Elements and Subarrays If x is a 1-dimensional array of length 5, then x[1] is the first element in the array and x[5] is the last. For a 3 × 4 array y of two dimensions, y[1, 1] is the first element and y[3, 4] the last element. For a three-dimensional array z, the first element is z[1, 1, 1], and so on. Subarrays of arrays may be accessed by providing fewer than the full number of indexes. For example, suppose y is a two-dimensional array with three rows and four columns. Then y[3] is one-dimensional array of length four. This means that y[3][1] may be used instead of y[3, 1] to access the value of the first column of the third row of y. The form y[3, 1] is the preferred form (see Footnote 3 in this chapter). Assigning Subarrays may be manipulated and assigned just like any other variables. Similar to the behavior of matrices, Stan allows blocks such as real w[9, 10, 11]; real x[10, 11]; real y[11]; real z; // ... x = w[5]; y = x[4]; // y == w[5][4] == w[5, 4] z = y[3]; // z == w[5][4][3] == w[5, 4, 3] Arrays of Matrices and Vectors Arrays of vectors and matrices are accessed in the same way as arrays of doubles. Consider the following vector and scalar declarations. 46 vector[5] a[3, 4]; vector[5] b[4]; vector[5] c; real x; With these declarations, the following assignments are legal. b c c x x x = = = = = = a[1]; // result is array of vectors a[1, 3]; // result is vector b[3]; // same result as above a[1, 3, 5]; // result is scalar b[3, 5]; // same result as above c[5]; // same result as above Row vectors and other derived vector types (simplex and ordered) behave the same way in terms of indexing. Consider the following matrix, vector and scalar declarations. matrix[6, 5] d[3, 4]; matrix[6, 5] e[4]; matrix[6, 5] f; row_vector[5] g; real x; With these declarations, the following definitions are legal. e f f g g g x x x x = = = = = = = = = = d[1]; d[1,3]; e[3]; d[1,3,2]; e[3,2]; f[2]; d[1,3,5,2]; e[3,5,2]; f[5,2]; g[2]; // // // // // // // // // // result result same result same same result same same same is array of matrices is matrix result as above is row vector result as above result as above is scalar result as above result as above result as above As shown, the result f[2] of supplying a single index to a matrix is the indexed row, here row 2 of matrix f. Partial Array Assignment Subarrays of arrays may be assigned by indexing on the left-hand side of an assignment statement. For example, the following is legal. 47 real x[I,J,K]; real y[J,K]; real z[K]; // ... x[1] = y; x[1,1] = z; The sizes must match. Here, x[1] is a J by K array, as is is y. Partial array assignment also works for arrays of matrices, vectors, and row vectors. Mixing Array, Vector, and Matrix Types Arrays, row vectors, column vectors and matrices are not interchangeable in Stan. Thus a variable of any one of these fundamental types is not assignable to any of the others, nor may it be used as an argument where the other is required (use as arguments follows the assignment rules). Mixing Vectors and Arrays For example, vectors cannot be assigned to arrays or vice-versa. real a[4]; vector[4] b; row_vector c[4]; // ... a = b; // illegal b = a; // illegal a = c; // illegal c = a; // illegal assignment assignment assignment assignment of of of of vector to array array to vector row vector to array array to row vector Mixing Row and Column Vectors It is not even legal to assign row vectors to column vectors or vice versa. vector b[4]; row_vector c[4]; // ... b = c; // illegal assignment of row vector to column vector c = b; // illegal assignment of column vector to row vector Mixing Matrices and Arrays The same holds for matrices, where 2-dimensional arrays may not be assigned to matrices or vice-versa. 48 real a[3,4]; matrix[3,4] b; // ... a = b; // illegal assignment of matrix to array b = a; // illegal assignment of array to matrix Mixing Matrices and Vectors A 1 × N matrix cannot be assigned a row vector or vice versa. matrix[1,4] a; row_vector[4] b; // ... a = b; // illegal assignment of row vector to matrix b = a; // illegal assignment of matrix to row vector Similarly, an M × 1 matrix may not be assigned to a column vector. matrix[4,1] a; vector[4] b; // ... a = b; // illegal assignment of column vector to matrix b = a; // illegal assignment of matrix to column vector Size Declaration Restrictions An integer expression is used to pick out the sizes of arrays. The same restrictions as for vector and matrix sizes apply, namely that the size is declared with an integerdenoting expression that does not contain any parameters, transformed parameters, or generated quantities. Size Zero Arrays If any of an array’s dimensions is size zero, the entire array will be of size zero. That is, if we declare real a[3, 0]; then the resulting size of a is zero and querying any of its dimensions at run time will result in the value zero. Declared as above, a[1] will be a size-zero one-dimensional array. For comparison, declaring real b[0, 3]; 49 also produces an array with an overall size of zero, but in this case, there is no way to index legally into b, because b[0] is undefined. The array will behave at run time as if it’s a 0 × 0 array. For example, the result of to_matrix(b) will be a 0 × 0 matrix, not a 0 × 3 matrix. 3.6. Variable Types vs. Constraints and Sizes The type information associated with a variable only contains the underlying type and dimensionality of the variable. Type Information Excludes Sizes The size associated with a given variable is not part of its data type. For example, declaring a variable using real a[3]; declares the variable a to be an array. The fact that it was declared to have size 3 is part of its declaration, but not part of its underlying type. When are Sizes Checked? Sizes are determined dynamically (at run time) and thus cannot be type-checked statically when the program is compiled. As a result, any conformance error on size will raise a run-time error. For example, trying to assign an array of size 5 to an array of size 6 will cause a run-time error. Similarly, multiplying an N × M by a J × K matrix will raise a run-time error if M ≠ J. Type Information Excludes Constraints Like sizes, constraints are not treated as part of a variable’s type in Stan when it comes to the compile-time check of operations it may participate in. Anywhere Stan accepts a matrix as an argument, it will syntactically accept a correlation matrix or covariance matrix or Cholesky factor. Thus a covariance matrix may be assigned to a matrix and vice-versa. Similarly, a bounded real may be assigned to an unconstrained real and vice-versa. When are Function Argument Constraints Checked? For arguments to functions, constraints are sometimes, but not always checked when the function is called. Exclusions include C++ standard library functions. All probability functions and cumulative distribution functions check that their arguments are appropriate at run time as the function is called. 50 When are Declared Variable Constraints Checked? For data variables, constraints are checked after the variable is read from a data file or other source. For transformed data variables, the check is done after the statements in the transformed data block have executed. Thus it is legal for intermediate values of variables to not satisfy declared constraints. For parameters, constraints are enforced by the transform applied and do not need to be checked. For transformed parameters, the check is done after the statements in the transformed parameter block have executed. For all blocks defining variables (transformed data, transformed parameters, generated quantities), real values are initialized to NaN and integer values are initialized to the smallest legal integer (i.e., a large absolute value negative number). For generated quantities, constraints are enforced after the statements in the generated quantities block have executed. Type Naming Notation In order to refer to data types, it is convenient to have a way to refer to them. The type naming notation outlined in this section is not part of the Stan programming language, but rather a convention adopted in this document to enable a concise description of a type. Because size information is not part of a data type, data types will be written without size information. For instance, real[] is the type of one-dimensional array of reals and matrix is the type of matrices. The three-dimensional integer array type is written as int[ , ,], indicating the number slots available for indexing. Similarly, vector[ , ] is the type of a two-dimensional array of vectors. 3.7. Compound Variable Declaration and Definition Stan allows assignable variables to be declared and defined in a single statement. Assignable variables are • local variables, and • variables declared in the transformed data, transformed parameters, or generated quantities blocks. For example, the statement int N = 5; declares the variable N to be an integer scalar type and at the same time defines it to be the value of the expression 5. 51 Assignment Typing The type of the expression on the right-hand side of the assignment must be assignable to the type of the variable being declared. For example, it is legal to have real sum = 0; even though 0 is of type int and sum is of type real, because integer-typed scalar expressions can be assigned to real-valued scalar variables. In all other cases, the type of the expression on the right-hand side of the assignment must be identical to the type of the variable being declared. Any type may be assigned. For example, matrix[3, 2] a = b; declares a matrix variable a and assigns it to the value of b, which must be of type matrix for the compound statement to be well formed. The sizes of matrices are not part of their static typing and cannot be validated until run time. Right-Hand Side Expressions The right-hand side may be any expression which has a type which is assignable to the variable being declared. For example, matrix[3, 2] a = 0.5 * (b + c); assigns the matrix variable a to half of the sum of b and c. The only requirement on b and c is that the expression b + c be of type matrix. For example, b could be of type matrix and c of type real, because adding a matrix to a scalar produces a matrix, and the multiplying by a scalar produces another matrix. The right-hand side expression can be a call to a user defined function, allowing general algorithms to be applied that might not be otherwise expressible as simple expressions (e.g., iterative or recursive algorithms). Scope within Expressions Any variable that is in scope and any function that is available in the block in which the compound declaration and definition appears may be used in the expression on the right-hand side of the compound declaration and definition statement. 52 4. Expressions An expression is the basic syntactic unit in a Stan program that denotes a value. Every expression in a well-formed Stan program has a type that is determined statically (at compile time). If an expressions type cannot be determined statically, the Stan compiler will report the location of the problem. This chapter covers the syntax, typing, and usage of the various forms of expressions in Stan. 4.1. Numeric Literals The simplest form of expression is a literal that denotes a primitive numerical value. Integer Literals Integer literals represent integers of type int. Integer literals are written in base 10 without any separators. Integer literals may contain a single negative sign. (The expression --1 is interpreted as the negation of the literal -1.) The following list contains well-formed integer literals. 0, 1, -1, 256, -127098, 24567898765 Integer literals must have values that fall within the bounds for integer values (see Section 3.2). Integer literals may not contain decimal points (.). Thus the expressions 1. and 1.0 are of type real and may not be used where a value of type int is required. Real Literals A number written with a period or with scientific notation is assigned to a the continuous numeric type real. Real literals are written in base 10 with a period (.) as a separator. Examples of well-formed real literals include the following. 0.0, 1.0, 3.14, -217.9387, 2.7e3, -2E-5 The notation e or E followed by a positive or negative integer denotes a power of 10 to multiply. For instance, 2.7e3 denotes 2.7 × 103 and -2E-5 denotes −2 × 10−5 . 4.2. Variables A variable by itself is a well-formed expression of the same type as the variable. Variables in Stan consist of ASCII strings containing only the basic lower-case and 53 upper-case Roman letters, digits, and the underscore (_) character. Variables must start with a letter (a-z and A-Z) and may not end with two underscores (__). Examples of legal variable identifiers are as follows. a, a3, a_3, Sigma, my_cpp_style_variable, myCamelCaseVariable Unlike in R and BUGS, variable identifiers in Stan may not contain a period character. Reserved Names Stan reserves many strings for internal use and these may not be used as the name of a variable. An attempt to name a variable after an internal string results in the stanc translator halting with an error message indicating which reserved name was used and its location in the model code. Model Name The name of the model cannot be used as a variable within the model. This is usually not a problem because the default in bin/stanc is to append _model to the name of the file containing the model specification. For example, if the model is in file foo.stan, it would not be legal to have a variable named foo_model when using the default model name through bin/stanc. With user-specified model names, variables cannot match the model. User-Defined Function Names User-defined function names cannot be used as a variable within the model. Reserved Words from Stan Language The following list contains reserved words for Stan’s programming language. Not all of these features are implemented in Stan yet, but the tokens are reserved for future use. for, in, while, repeat, until, if, then, else, true, false Variables should not be named after types, either, and thus may not be any of the following. int, real, vector, simplex, unit_vector, ordered, positive_ordered, row_vector, matrix, cholesky_factor_corr, cholesky_factor_cov, corr_matrix, cov_matrix. Variable names will not conflict with the following block identifiers, 54 functions, model, data, parameters, quantities, transformed, generated, Reserved Names from Stan Implementation Some variable names are reserved because they are used within Stan’s C++ implementation. These are var, fvar, STAN_MAJOR, STAN_MINOR, STAN_PATCH, STAN_MATH_MAJOR, STAN_MATH_MINOR, STAN_MATH_PATCH Reserved Function and Distribution Names Variable names will conflict with the names of predefined functions other than constants. Thus a variable may not be named logit or add, but it may be named pi or e. Variable names will also conflict with the names of distributions suffixed with _lpdf, _lpmf, _lcdf, and _lccdf, _cdf, and _ccdf, such as normal_lcdf_log; this also holds for the deprecated forms _log, _cdf_log, and _ccdf_log, Using any of these variable names causes the stanc translator to halt and report the name and location of the variable causing the conflict. Reserved Names from C++ Finally, variable names, including the names of models, should not conflict with any of the C++ keywords. alignas, alignof, and, and_eq, asm, auto, bitand, bitor, bool, break, case, catch, char, char16_t, char32_t, class, compl, const, constexpr, const_cast, continue, decltype, default, delete, do, double, dynamic_cast, else, enum, explicit, export, extern, false, float, for, friend, goto, if, inline, int, long, mutable, namespace, new, noexcept, not, not_eq, nullptr, operator, or, or_eq, private, protected, public, register, reinterpret_cast, return, short, signed, sizeof, static, static_assert, static_cast, struct, switch, template, this, thread_local, throw, true, try, typedef, typeid, typename, union, unsigned, using, virtual, void, volatile, wchar_t, while, xor, xor_eq Legal Characters The legal variable characters have the same ASCII code points in the range 0–127 as in Unicode. 55 Characters a - z A - Z 0 - 9 _ ASCII (Unicode) Code Points 97 - 122 65 - 90 48 - 57 95 Although not the most expressive character set, ASCII is the most portable and least prone to corruption through improper character encodings or decodings. Comments Allow ASCII-Compatible Encoding Within comments, Stan can work with any ASCII-compatible character encoding, such as ASCII itself, UTF-8, or Latin1. It is up to user shells and editors to display them properly. 4.3. Vector, Matrix, and Array Expressions Expressions for the Stan container objects arrays, vectors, and matrices can be constructed via a sequence of expressions enclosed in either curly braces for arrays, or square brackets for vectors and matrices. Vector Expressions Square brackets may be wrapped around a sequence of comma separated primitive expressions to produce a row vector expression. For example, the expression [ 1, 10, 100 ] denotes a row vector of three elements with real values 1.0, 10.0, and 100.0. Applying the transpose operator to a row vector expression produces a vector expression. This syntax provides a way declare and define small vectors a single line, as follows. row_vector[2] rv2= [ 1, 2 ]; vector[3] v3 = [ 3, 4, 5 ]'; The vector expression values may be compound expressions or variable names, so it is legal to write [ 2 * 3, 1 + 4] or [ x, y ], providing that x and y are primitive variables. Matrix Expressions A matrix expression consists of square brackets wrapped around a sequence of comma separated row vector expressions. This syntax provides a way declare and define a matrix in a single line, as follows. 56 matrix[3,2] m1 = [ [ 1, 2 ], [ 3, 4 ], [5, 6 ] ]; Any expression denoting a row vector can be used in a matrix expression. For example, the following code is valid: vector[2] vX = [ 1, 10 ]'; row_vector[2] vY = [ 100, 1000 ]; matrix[3,2] m2 = [ vX', vY, [ 1, 2 ] ]; No empty vector or matrix expressions The empty expression [ ] is ambiguous and therefore is not allowed and similarly expressions such as [ [ ] ] or [ [ ], [ ] ] are not allowed. Array Expressions Curly braces may be wrapped around a sequence of expressions to produce an array expression. For example, the expression { 1, 10, 100 } denotes an integer array of three elements with values 1, 10, and 100. This syntax is particularly convenient to define small arrays in a single line, as follows. int a[3] = { 1, 10, 100 }; The values may be compound expressions, so it is legal to write { 2 * 3, 1 + 4 }. It is also possible to write two dimensional arrays directly, as in the following example. int b[2, 3] = { { 1, 2, 3 }, { 4, 5, 6 } }; This way, b[1] is { 1, 2, 3 } and b[2] is { 4, 5, 6 }. Whitespace is always interchangeable in Stan, so the above can be laid out as follows to more clearly indicate the row and column structure of the resulting two dimensional array. int b[2, 3] = { { 1, 2, 3 }, { 4, 5, 6 } }; Array Expression Types Any type of expression may be used within braces to form an array expression. In the simplest case, all of the elements will be of the same type and the result will be an array of elements of that type. For example, the elements of the array can be vectors, in which case the result is an array of vectors. 57 vector[3] b; vector[3] c; ... vector[3] d[2] = { b, c }; The elements may also be a mixture of int and real typed expressions, in which case the result is an array of real values. real b[2] = { 1, 1.9 }; Restrictions on Values There are some restrictions on how array expressions may be used that arise from their types being calculated bottom up and the basic data type and assignment rules of Stan. Rectangular array expressions only Although it is tempting to try to define a ragged array expression, all Stan data types are rectangular (or boxes or other higher-dimensional generalizations). Thus the following nested array expression will cause an error when it tries to create a non-rectangular array. { { 1, 2, 3 }, { 4, 5 } } // compile time error: size mismatch This may appear to be OK, because it is creating a two-dimensional integer array (int[ , ]) out of two one-dimensional array integer arrays (int[ ]). But it is not allowed because the two one-dimensional arrays are not the same size. If the elements are array expressions, this can be diagnosed at compile time. If one or both expressions is a variable, then that won’t be caught until runtime. { { 1, 2, 3 }, m } // runtime error if m not size 3 No empty array expressions Because there is no way to infer the type of the result, the empty array expression ({ }) is not allowed. This does not sacrifice expressive power, because a declaration is sufficient to initialize a zero-element array. int a[0]; // a is fully defined as zero element array 58 Integer only array expressions If an array expression contains only integer elements, such as { 1, 2, 3 }, then the result type will be an integer array, int[]. This means that the following will not be legal. real a[2] = { -3, 12 }; // error: int[] can't be assigned to real[] Integer arrays may not be assigned to real values. However, this problem is easily sidestepped by using real literal expressions. real a[2] = { -3.0, 12.0 }; Now the types match and the assignment is allowed. 4.4. Parentheses for Grouping Any expression wrapped in parentheses is also an expression. Like in C++, but unlike in R, only the round parentheses, ( and ), are allowed. The square brackets [ and ] are reserved for array indexing and the curly braces { and } for grouping statements. With parentheses it is possible to explicitly group subexpressions with operators. Without parentheses, the expression 1 + 2 * 3 has a subexpression 2 * 3 and evaluates to 7. With parentheses, this grouping may be made explicit with the expression 1 + (2 * 3). More importantly, the expression (1 + 2) * 3 has 1 + 2 as a subexpression and evaluates to 9. 4.5. Arithmetic and Matrix Operations on Expressions For integer and real-valued expressions, Stan supports the basic binary arithmetic operations of addition (+), subtraction (-), multiplication (*) and division (/) in the usual ways. For integer expressions, Stan supports the modulus (%) binary arithmetic operation. Stan also supports the unary operation of negation for integer and real-valued expressions. For example, assuming n and m are integer variables and x and y real variables, the following expressions are legal. 3.0 + 0.14, -15, 2 * 3 + 1, (x - y) / 2.0, (n * (n + 1)) / 2, x / n, m % n The negation, addition, subtraction, and multiplication operations are extended to matrices, vectors, and row vectors. The transpose operation, written using an apostrophe (’) is also supported for vectors, row vectors, and matrices. Return types for matrix operations are the smallest types that can be statically guaranteed to contain 59 the result. The full set of allowable input types and corresponding return types is detailed in Chapter 43. For example, if y and mu are variables of type vector and Sigma is a variable of type matrix, then (y - mu)’ * Sigma * (y - mu) is a well-formed expression of type real. The type of the complete expression is inferred working outward from the subexpressions. The subexpression(s) y - mu are of type vector because the variables y and mu are of type vector. The transpose of this expression, the subexpression (y - mu)’ is of type row_vector. Multiplication is left associative and transpose has higher precedence than multiplication, so the above expression is equivalent to the following well-formed, fully specified form. (((y - mu)’) * Sigma) * (y - mu) The type of subexpression (y - mu)’ * Sigma is inferred to be row_vector, being the result of multiplying a row vector by a matrix. The whole expression’s type is thus the type of a row vector multiplied by a (column) vector, which produces a real value. Stan provides elementwise matrix division and multiplication operations, a .* b and a ./b. These provide a shorthand to replace loops, but are not intrinsically more efficient than a version programmed with an elementwise calculations and assignments in a loop. For example, given declarations, vector[N] a; vector[N] b; vector[N] c; the assignment, c = a .* b; produces the same result with roughly the same efficiency as the loop for (n in 1:N) c[n] = a[n] * b[n]; Stan supports exponentiation (^) of integer and real-valued expressions. The return type of exponentiation is always a real-value. For example, assuming n and m are integer variables and x and y real variables, the following expressions are legal. 3 ^ 2, 3.0 ^ -2, 3.0 ^ 0.14, x ^ n, n ^ x, n ^ m, x ^ y Exponentiation is right associative, so the expression 60 2 ^ 3 ^ 4 is equivalent to the following well-formed, fully specified form. 2 ^ (3 ^ 4) Operator Precedence and Associativity The precedence and associativity of operators, as well as built-in syntax such as array indexing and function application is given in tabular form in Figure 4.1. Other expression-forming operations, such as function application and subscripting bind more tightly than any of the arithmetic operations. The precedence and associativity determine how expressions are interpreted. Because addition is left associative, the expression a+b+c is interpreted as (a+b)+c. Similarly, a/b*c is interpreted as (a/b)*c. Because multiplication has higher precedence than addition, the expression a*b+c is interpreted as (a*b)+c and the expression a+b*c is interpreted as a+(b*c). Similarly, 2*x+3*-y is interpreted as (2*x)+(3*(-y)). Transposition and exponentiation bind more tightly than any other arithmetic or logical operation. For vectors, row vectors, and matrices, -u’ is interpreted as -(u’), u*v’ as u*(v’), and u’*v as (u’)*v. For integer and reals, -n ^ 3 is interpreted as -(n ^ 3). 4.6. Conditional Operator Conditional Operator Syntax The ternary conditional operator is unique in that it takes three arguments and uses a mixed syntax. If a is an expression of type int and b and c are expressions that can be converted to one another (e.g., compared with ==), then a ? b : c is an expression of the promoted type of b and c. The only promotion allowed in Stan is from integer to real; if one argument is of type int and the other of type real, the conditional expression as a whole is of type real. In all other cases, the arguments have to be of the same underlying Stan type (i.e., constraints don’t count, only the shape) and the conditional expression is of that type. Conditional Operator Precedence The conditional operator is the most loosely binding operator, so its arguments rarely require parentheses for disambiguation. For example, 61 Op. Prec. Assoc. Placement Description ? : || && == != < <= > >= + * / % \ .* ./ ! + ^ ’ 10 9 8 7 7 6 6 6 6 5 5 4 4 4 3 2 2 1 1 1 0.5 0 right left left left left left left left left left left left left left left left left n/a n/a n/a right n/a ternary infix binary infix binary infix binary infix binary infix binary infix binary infix binary infix binary infix binary infix binary infix binary infix binary infix binary infix binary infix binary infix binary infix unary prefix unary prefix unary prefix binary infix unary postfix conditional logical or logical and equality inequality less than less than or equal greater than greater than or equal addition subtraction multiplication (right) division modulus left division elementwise multiplication elementwise division logical negation negation promotion (no-op in Stan) exponentiation transposition () [] 0 0 n/a left prefix, wrap prefix, wrap function application array, matrix indexing Figure 4.1: Stan’s unary, binary, and ternary operators, with their precedences, associativities, place in an expression, and a description. The last two lines list the precedence of function application and array, matrix, and vector indexing. The operators are listed in order of precedence, from least tightly binding to most tightly binding. The full set of legal arguments and corresponding result types are provided in the function documentation in Part VII prefaced with operator (i.e., operator*(int,int):int indicates the application of the multiplication operator to two integers, which returns an integer). Parentheses may be used to group expressions explicitly rather than relying on precedence and associativity. 62 a > 0 || b < 0 ? c + d : e - f is equivalent to the explicitly grouped version (a > 0 || b < 0) ? (c + d) : (e - f) The latter is easier to read even if the parentheses are not strictly necessary. Conditional Operator Associativity The conditional operator is right associative, so that a ? b : c ? d : e parses as if explicitly grouped as a ? b : (c ? d : e) Again, the explicitly grouped version is easier to read. Conditional Operator Semantics Stan’s conditional operator works very much like its C++ analogue. The first argument must be an expression denoting an integer. Typically this is a variable or a relation operator, as in the variable a in the example above. Then there are two resulting arguments, the first being the result returned if the condition evaluates to true (i.e., non-zero) and the second if the condition evaluates to false (i.e., zero). In the example above, the value b is returned if the condition evaluates to a non-zero value and c is returned if the condition evaluates to zero. Lazy Evaluation of Results The key property of the conditional operator that makes it so useful in highperformance computing is that it only evaluates the returned subexpression, not the alternative expression. In other words, it is not like a typical function that evaluates its argument expressions eagerly in order to pass their values to the function. As usual, the saving is mostly in the derivatives that do not get computed rather than the unnecessary function evaluation itself. Promotion to Parameter If one return expression is a data value (an expression involving only constants and variables defined in the data or transformed data block), and the other is not, then the ternary operator will promote the data value to a parameter value. This can cause needless work calculating derivatives in some cases and be less efficient than a full if-then conditional statement. For example, 63 data { real x[10]; ... parameters { real z[10]; ... model { y ~ normal(cond ? x : z, sigma); ... would be more efficiently (if not more transparently) coded as if (cond) y ~ normal(x, sigma); else y ~ normal(z, sigma); The conditional statement, like the conditional operator, only evaluates one of the result statements. In this case, the variable x will not be promoted to a parameter and thus not cause any needless work to be carried out when propagating the chain rule during derivative calculations. 4.7. Indexing Stan arrays, matrices, vectors, and row vectors are all accessed using the same arraylike notation. For instance, if x is a variable of type real[] (a one-dimensional array of reals) then x[1] is the value of the first element of the array. Subscripting has higher precedence than any of the arithmetic operations. For example, alpha*x[1] is equivalent to alpha*(x[1]). Multiple subscripts may be provided within a single pair of square brackets. If x is of type real[ , ], a two-dimensional array, then x[2,501] is of type real. Accessing Subarrays The subscripting operator also returns subarrays of arrays. For example, if x is of type real[ , , ], then x[2] is of type real[ , ], and x[2,3] is of type real[]. As a result, the expressions x[2,3] and x[2][3] have the same meaning. Accessing Matrix Rows If Sigma is a variable of type matrix, then Sigma[1] denotes the first row of Sigma and has the type row_vector. 64 index type example value integer integer array a[11] a[ii] value of a at index 11 a[ii[1]], . . . , a[ii[K]] lower bound upper bound range a[3:] a[:5] a[2:7] a[3], . . . , a[N] a[1], . . . , a[5] a[2], . . . , a[7] all all a[:] a[] a[1], . . . , a[N] a[1], . . . , a[N] Figure 4.2: Types of indexes and examples with one-dimensional containers of size N and an integer array ii of type int[] size K. Mixing Array and Vector/Matrix Indexes Stan supports mixed indexing of arrays and their vector, row vector or matrix values. For example, if m is of type matrix[ , ], a two-dimensional array of matrices, then m[1] refers to the first row of the array, which is a one-dimensional array of matrices. More than one index may be used, so that m[1,2] is of type matrix and denotes the matrix in the first row and second column of the array. Continuing to add indices, m[1,2,3] is of type row_vector and denotes the third row of the matrix denoted by m[1,2]. Finally, m[1,2,3,4] is of type real and denotes the value in the third row and fourth column of the matrix that is found at the first row and second column of the array m. 4.8. Multiple Indexing and Range Indexing In addition to single integer indexes, as described in Section 4.7, Stan supports multiple indexing. Multiple indexes can be integer arrays of indexes, lower bounds, upper bounds, lower and upper bounds, or simply shorthand for all of the indexes. A complete table of index types is given in Figure 4.2. Multiple Index Semantics The fundamental semantic rule for dealing with multiple indexes is the following. If idxs is a multiple index, then it produces an indexable position in the result. To evaluate that index position in the result, the index is first passed to the multiple index, and the resulting index used. a[idxs, ...][i, ...] = a[idxs[i], ...][...] 65 example row index column index result type a[i] a[is] a[i, j] a[i, js] a[is, j] a[is, js] single multiple single single multiple multiple n/a n/a single multiple single multiple row vector matrix real row vector vector matrix Figure 4.3: Special rules for reducing matrices based on whether the argument is a single or multiple index. Examples are for a matrix a, with integer single indexes i and j and integer array multiple indexes is and js. The same typing rules apply for all multiple indexes. On the other hand, if idx is a single index, it reduces the dimensionality of the output, so that a[idx, ...] = a[idx][...] The only issue is what happens with matrices and vectors. Vectors work just like arrays. Matrices with multiple row indexes and multiple column indexes produce matrices. Matrices with multiple row indexes and a single column index become (column) vectors. Matrices with a single row index and multiple column indexes become row vectors. The types are summarized in Figure 4.3. Evaluation of matrices with multiple indexes is defined to respect the following distributivity conditions. m[idxs1, idxs2][i, j] = m[idxs1[i], idxs2[j]] m[idxs, idx][j] = m[idxs[j], idx] m[idx, idxs][j] = m[idx, idxs[j]] Evaluation of arrays of matrices and arrays of vectors or row vectors is defined recursively, beginning with the array dimensions. 4.9. Function Application Stan provides a range of built in mathematical and statistical functions, which are documented in Part VII. Expressions in Stan may consist of the name of function followed by a sequence of zero or more argument expressions. For instance, log(2.0) is the expression of type real denoting the result of applying the natural logarithm to the value of the real literal 2.0. Syntactically, function application has higher precedence than any of the other operators, so that y + log(x) is interpreted as y + (log(x)). 66 Type Signatures and Result Type Inference Each function has a type signature which determines the allowable type of its arguments and its return type. For instance, the function signature for the logarithm function can be expressed as real log(real); and the signature for the lmultiply function is real lmultiply(real,real); A function is uniquely determined by its name and its sequence of argument types. For instance, the following two functions are different functions. real mean(real[]); real mean(vector); The first applies to a one-dimensional array of real values and the second to a vector. The identity conditions for functions explicitly forbids having two functions with the same name and argument types but different return types. This restriction also makes it possible to infer the type of a function expression compositionally by only examining the type of its subexpressions. Constants Constants in Stan are nothing more than nullary (no-argument) functions. For instance, the mathematical constants π and e are represented as nullary functions named pi() and e(). See Section 41.2 for a list of built-in constants. Type Promotion and Function Resolution Because of integer to real type promotion, rules must be established for which function is called given a sequence of argument types. The scheme employed by Stan is the same as that used by C++, which resolves a function call to the function requiring the minimum number of type promotions. For example, consider a situation in which the following two function signatures have been registered for foo. real foo(real,real); int foo(int,int); The use of foo in the expression foo(1.0,1.0) resolves to foo(real,real), and thus the expression foo(1.0,1.0) itself is assigned a type of real. 67 Because integers may be promoted to real values, the expression foo(1,1) could potentially match either foo(real,real) or foo(int,int). The former requires two type promotions and the latter requires none, so foo(1,1) is resolved to function foo(int,int) and is thus assigned the type int. The expression foo(1,1.0) has argument types (int,real) and thus does not explicitly match either function signature. By promoting the integer expression 1 to type real, it is able to match foo(real,real), and hence the type of the function expression foo(1,1.0) is real. In some cases (though not for any built-in Stan functions), a situation may arise in which the function referred to by an expression remains ambiguous. For example, consider a situation in which there are exactly two functions named bar with the following signatures. real bar(real,int); real bar(int,real); With these signatures, the expression bar(1.0,1) and bar(1,1.0) resolve to the first and second of the above functions, respectively. The expression bar(1.0,1.0) is illegal because real values may not be demoted to integers. The expression bar(1,1) is illegal for a different reason. If the first argument is promoted to a real value, it matches the first signature, whereas if the second argument is promoted to a real value, it matches the second signature. The problem is that these both require one promotion, so the function name bar is ambiguous. If there is not a unique function requiring fewer promotions than all others, as with bar(1,1) given the two declarations above, the Stan compiler will flag the expression as illegal. Random-Number Generating Functions For most of the distributions supported by Stan, there is a corresponding randomnumber generating function. These random number generators are named by the distribution with the suffix _rng. For example, a univariate normal random number can be generated by normal_rng(0,1); only the parameters of the distribution, here a location (0) and scale (1) are specified because the variate is generated. Random-Number Generators Locations The use of random-number generating functions is restricted to the transformed data and generated quantities blocks; attempts to use them elsewhere will result in a parsing error with a diagnostic message. They may also be used in the bodies of userdefined functions whose names end in _rng. This allows the random number generating functions to be used for simulation in general, and for Bayesian posterior predictive checking in particular. 68 Posterior Predictive Checking Posterior predictive checks typically use the parameters of the model to generate simulated data (at the individual and optionally at the group level for hierarchical models), which can then be compared informally using plots and formally by means of test statistics, to the actual data in order to assess the suitability of the model; see (Gelman et al., 2013, Chapter 6) for more information on posterior predictive checks. 4.10. Type Inference Stan is strongly statically typed, meaning that the implementation type of an expression can be resolved at compile time. Implementation Types The primitive implementation types for Stan are int, real, vector, row_vector, and matrix. Every basic declared type corresponds to a primitive type; see Figure 4.4 for the mapping from types to their primitive types. A full implementation type consists of a primitive implementation type and an integer array dimensionality greater than or equal to zero. These will be written to emphasize their array-like nature. For example, int[] has an array dimensionality of 1, int an array dimensionality of 0, and int[ , ,] an array dimensionality of 3. The implementation type matrix[ , , ] has a total of five dimensions and takes up to five indices, three from the array and two from the matrix. Recall that the array dimensions come before the matrix or vector dimensions in an expression such as the following declaration of a three-dimensional array of matrices. matrix[M, N] a[I, J, K]; The matrix a is indexed as a[i, j, k, m, n] with the array indices first, followed by the matrix indices, with a[i, j, k] being a matrix and a[i, j, k, m] being a row vector. Type Inference Rules Stan’s type inference rules define the implementation type of an expression based on a background set of variable declarations. The rules work bottom up from primitive literal and variable expressions to complex expressions. 69 Type Primitive Type int int real real matrix cov_matrix corr_matrix cholesky_factor_cov cholesky_factor_corr matrix matrix matrix matrix matrix vector simplex unit_vector ordered positive_ordered vector vector vector vector vector row_vector row_vector Figure 4.4: The table shows the variable declaration types of Stan and their corresponding primitive implementation type. Stan functions, operators, and probability functions have argument and result types declared in terms of primitive types plus array dimensionality. Literals An integer literal expression such as 42 is of type int. Real literals such as 42.0 are of type real. Variables The type of a variable declared locally or in a previous block is determined by its declaration. The type of a loop variable is int. There is always a unique declaration for each variable in each scope because Stan prohibits the redeclaration of an already-declared variables.1 Indexing If x is an expression of total dimensionality greater than or equal to N, then the type of expression e[i1, ..., iN] is the same as that of e[i1]...[iN], so it suffices to 1 Languages such as C++ and R allow the declaration of a variable of a given name in a narrower scope to hide (take precedence over for evaluation) a variable defined in a containing scope. 70 define the type of a singly-indexed function. Suppose e is an expression and i is an expression of primitive type int. Then • if e is an expression of array dimensionality K > 0, then e[i] has array dimensionality K − 1 and the same primitive implementation type as e, • if e has implementation type vector or row_vector of array dimensionality 0, then e[i] has implementation type real, and • if e has implementation type matrix, then e[i] has type row_vector. Function Application If f is the name of a function and e1,...,eN are expressions for N ≥ 0, then f(e1,...,eN) is an expression whose type is determined by the return type in the function signature for f given e1 through eN. Recall that a function signature is a declaration of the argument types and the result type. In looking up functions, binary operators like real * real are defined as operator*(real,real) in the documentation and index. In matching a function definition, arguments of type int may be promoted to type real if necessary (see the subsection on type promotion in Section 4.9 for an exact specification of Stan’s integer-to-real type-promotion rule). In general, matrix operations return the lowest inferable type. For example, row_vector * vector returns a value of type real, which is declared in the function documentation and index as real operator*(row_vector,vector). 4.11. Chain Rule and Derivatives Derivatives of the log probability function defined by a model are used in several ways by Stan. The Hamiltonian Monte Carlo samplers, including NUTS, use gradients to guide updates. The BFGS optimizers also use gradients to guide search for posterior modes. Errors Due to Chain Rule Unlike evaluations in pure mathematics, evaluation of derivatives in Stan is done by applying the chain rule on an expression-by-expression basis, evaluating using floating-point arithmetic. As a result, models such as the following are problematic for inference involving derivatives. parameters { real x; 71 } model { x ~ normal(sqrt(x - x), 1); } Algebraically, the sampling statement in the model could be reduced to x ~ normal(0, 1); and it would seem the model should produce unit normal samples for x. But rather than canceling, the expression sqrt(x - x) causes a problem for derivatives. The cause is the mechanistic evaluation of the chain rule, d √ x−x dx = 1 d √ × (x − x) 2 x−x dx = 1 × (1 − 1) 0 = ∞×0 = NaN. Rather than the x − x canceling out, it introduces a 0 into the numerator and denominator of the chain-rule evaluation. The only way to avoid this kind problem is to be careful to do the necessary algebraic reductions as part of the model and not introduce expressions like sqrt(x - x) for which the chain rule produces not-a-number values. Diagnosing Problems with Derivatives The best way to diagnose whether something is going wrong with the derivatives is to use the test-gradient option to the sampler or optimizer inputs; this option is available in both Stan and RStan (though it may be slow, because it relies on finite differences to make a comparison to the built-in automatic differentiation). For example, compiling the above model to an executable sqrt-x-minus-x, the test can be run as > ./sqrt-x-minus-x diagnose test=gradient ... TEST GRADIENT MODE Log probability=-0.393734 param idx 0 value -0.887393 model nan 72 finite diff 0 error nan Even though finite differences calculates the right gradient of 0, automatic differentiation follows the chain rule and produces a not-a-number output. 73 5. Statements The blocks of a Stan program (see Chapter 6) are made up of variable declarations and statements. Unlike programs in BUGS, the declarations and statements making up a Stan program are executed in the order in which they are written. Variables must be defined to have some value (as well as declared to have some type) before they are used — if they do not, the behavior is undefined. The basis of Stan’s execution is the evaluation of a log probability function (specifically, a probability density function) for a given set of (real-valued) parameters. Log probability function can be constructed by using assignment statements. Statements may be grouped into sequences and into for-each loops. In addition, Stan allows local variables to be declared in blocks and also allows an empty statement consisting only of a semicolon. 5.1. Assignment Statement An assignment statement consists of a variable (possibly multivariate with indexing information) and an expression. Executing an assignment statement evaluates the expression on the right-hand side and assigns it to the (indexed) variable on the lefthand side. An example of a simple assignment is as follows.1 n = 0; Executing this statement assigns the value of the expression 0, which is the integer zero, to the variable n. For an assignment to be well formed, the type of the expression on the right-hand side should be compatible with the type of the (indexed) variable on the left-hand side. For the above example, because 0 is an expression of type int, the variable n must be declared as being of type int or of type real. If the variable is of type real, the integer zero is promoted to a floating-point zero and assigned to the variable. After the assignment statement executes, the variable n will have the value zero (either as an integer or a floating-point value, depending on its type). Syntactically, every assignment statement must be followed by a semicolon. Otherwise, whitespace between the tokens does not matter (the tokens here being the lefthand-side (indexed) variable, the assignment operator, the right-hand-side expression and the semicolon). Because the right-hand side is evaluated first, it is possible to increment a variable in Stan just as in C++ and other programming languages by writing n = n + 1; 1 In versions of Stan before 2.17.0, the operator <- was used for assignment rather than using the equal sign =. The old operator <- is now deprecated and will print a warning. In the future, it will be removed. 74 Such self assignments are not allowed in BUGS, because they induce a cycle into the directed graphical model. The left-hand side of an assignment may contain indices for array, matrix, or vector data structures. For instance, if Sigma is of type matrix, then Sigma[1, 1] = 1.0; sets the value in the first column of the first row of Sigma to one. Assignments can involve complex objects of any type. If Sigma and Omega are matrices and sigma is a vector, then the following assignment statement, in which the expression and variable are both of type matrix, is well formed. Sigma = diag_matrix(sigma) * Omega * diag_matrix(sigma); This example also illustrates the preferred form of splitting a complex assignment statement and its expression across lines. Assignments to subcomponents of larger multi-variate data structures are supported by Stan. For example, a is an array of type real[ , ] and b is an array of type real[], then the following two statements are both well-formed. a[3] = b; b = a[4]; Similarly, if x is a variable declared to have type row_vector and Y is a variable declared as type matrix, then the following sequence of statements to swap the first two rows of Y is well formed. x = Y[1]; Y[1] = Y[2]; Y[2] = x; Lvalue Summary The expressions that are legal left-hand sides of assignment statements are known as “lvalues.” In Stan, there are only two kinds of legal lvalues, • a variable, or • a variable with one or more indices. To be used as an lvalue, an indexed variable must have at least as many dimensions as the number of indices provided. An array of real or integer types has as many 75 dimensions as it is declared for. A matrix has two dimensions and a vector or row vector one dimension; this also holds for the constrained types, covariance and correlation matrices and their Cholesky factors and ordered, positive ordered, and simplex vectors. An array of matrices has two more dimensions than the array and an array of vectors or row vectors has one more dimension than the array. Note that the number of indices can be less than the number of dimensions of the variable, meaning that the right hand side must itself be multidimensional to match the remaining dimensions. Multiple Indexes Multiple indexes, as described in Section 4.8, are also permitted on the left-hand side of assignments. Indexing on the left side works exactly as it does for expressions, with multiple indexes preserving index positions and single indexes reducing them. The type on the left side must still match the type on the right side. Aliasing All assignment is carried out as if the right-hand side is copied before the assignment. This resolves any potential aliasing issues arising from he right-hand side changing in the middle of an assignment statement’s execution. Compound Arithmetic and Assignment Statement Stan’s arithmetic operators may be used in compound arithmetic and assignment operations. For example, consider the following example of compound addition and assignment. real x = 5; x += 7; // value of x is now 12 The compound arithmetic and assignment statement above is equivalent to the following long form. x = x + 7; In general, the compound form x op= y will be equivalent to x = x op y; 76 Operation addition subtraction multiplication division elementwise multiplication elementwise division Compound Long x x x x x x x x x x x x += y -= y *= y /= y . *= y ./= y = = = = = = x x x x x x + y - y * y / y .* y ./ y Figure 5.1: Stan allows compound arithmetic and assignment statements of the forms listed in the table above. The compound form is legal whenever the corresponding long form would be legal and it has the same effect. The compound statement will be legal whenever the long form is legal. This requires that the operation x op y must itself be well formed and that the result of the operation be assignable to x. For the expression x to be assignable, it must be an indexed variable where the variable is defined in the current block. For example, the following compound addition and assignment statement will increment a single element of a vector by two. vector[N] x; x[3] += 2; As a further example, consider matirx[M, M] x; vector[M] y; real z; x *= x; // OK, (x * x) is a matrix x *= z; // OK, (x * z) is a matrix x *= y; // BAD, (x * y) is a vector The supported compound arithmetic and assignment operations are listed in Figure 5.1; they are also listed in the index prefaced by operator, e.g., operator+=. 5.2. Increment Log Density The basis of Stan’s execution is the evaluation of a log probability function (specifically, a probability density function) for a given set of (real-valued) parameters; this function returns the log density of the posterior up to an additive constant. Data and transformed data are fixed before the log density is evaluated. The total log probability is initialized to zero. Next, any log Jacobian adjustments accrued by the variable 77 constraints are added to the log density (the Jacobian adjustment may be skipped for optimization). Sampling and log probability increment statements may add to the log density in the model block. A log probability increment statement directly increments the log density with the value of an expression as follows.2 target += -0.5 * y * y; The keyword target here is actually not a variable, and may not be accessed as such (though see below on how to access the value of target through a special function). In this example, the unnormalized log probability of a unit normal variable y is added to the total log probability. In the general case, the argument can be any expression.3 An entire Stan model can be implemented this way. For instance, the following model will draw a single variable according to a unit normal probability. parameters { real y; } model { target += -0.5 * y * y; } This model defines a log probability function log p(y) = − y2 − log Z 2 where Z is a normalizing constant that does not depend on y. The constant Z is conventionally written this way because on the linear scale, ! 1 y2 p(y) = exp − . Z 2 which is typically written without reference to Z as ! y2 p(y) ∝ exp − . 2 Stan only requires models to be defined up to a constant that does not depend on the parameters. This is convenient because often the normalizing constant Z is either time-consuming to compute or intractable to evaluate. 2 The current notation replaces two previous versions. Originally, a variable lp__ was directly exposed and manipulated; this is no longer allowed. The original statement syntax for target += u was increment_log_prob(u), but this form has been deprecated and will be removed in Stan 3. 3 Writing this model with the expression -0.5 * y * y is more efficient than with the equivalent expression y * y / -2 because multiplication is more efficient than division; in both cases, the negation is rolled into the numeric literal (-0.5 and -2). Writing square(y) instead of y * y would be even more efficient because the derivatives can be precomputed, reducing the memory and number of operations required for automatic differentiation. 78 Relation to compound addition and assignment The increment log density statement looks syntactically like compound addition and assignment (see Section 5.1.3, it is treated as a primitive statement because target is not itself a variable. So, even though target += lp; is a legal statement, the corresponding long form is not legal. target = target + lp; // BAD, target is not a variable Vectorization The target += ... statement accepts an argument in place of ... for any expression type, including integers, reals, vectors, row vectors, matrices, and arrays of any dimensionality, including arrays of vectors and matrices. For container arguments, their sum will be added to the total log density. Accessing the Log Density To access accumulated log density up to the current execution point, the function target()() may be used. 5.3. Sampling Statements Stan supports writing probability statements also in sampling notation, such as y ~ normal(mu,sigma); The name “sampling statement” is meant to be suggestive, not interpreted literally. Conceptually, the variable y, which may be an unknown parameter or known, modeled data, is being declared to have the distribution indicated by the right-hand side of the sampling statement. Executing such a statement does not perform any sampling. In Stan, a sampling statement is merely a notational convenience. The above sampling statement could be expressed as a direct increment on the total log probability as target += normal_lpdf(y | mu, sigma); In general, a sampling statement of the form y ~ dist(theta1, ..., thetaN); 79 involving subexpressions y and theta1 through thetaN (including the case where N is zero) will be well formed if and only if the corresponding assignment statement is well-formed. For densities allowing real y values, the log probability density function is used, target += dist_lpdf(y | theta1, ..., thetaN); For those restricted to integer y values, the log probability mass function is used, target += dist_lpmf(y | theta1, ..., thetaN); This will be well formed if and only if dist_lpdf(y | theta1, ..., thetaN) or dist_lpmf(y | theta1, ..., thetaN) is a well-formed expression of type real. Log Probability Increment vs. Sampling Statement Although both lead to the same sampling behavior in Stan, there is one critical difference between using the sampling statement, as in y ~ normal(mu, sigma); and explicitly incrementing the log probability function, as in target += normal_lpdf(y | mu,sigma); The sampling statement drops all the terms in the log probability function that are constant, whereas the explicit call to normal_lpdf adds all of the terms in the definition of the log normal probability function, including all of the constant normalizing terms. Therefore, the explicit increment form can be used to recreate the exact log probability values for the model. Otherwise, the sampling statement form will be faster if any of the input expressions, y, mu, or sigma, involve only constants, data variables, and transformed data variables. User-Transformed Variables The left-hand side of a sampling statement may be a complex expression. For instance, it is legal syntactically to write parameters { real beta; } // ... model { log(beta) ~ normal(mu, sigma); } 80 Unfortunately, this is not enough to properly model beta as having a lognormal distribution. Whenever a nonlinear transform is applied to a parameter, such as the logarithm function being applied to beta here, and then used on the left-hand side of a sampling statement or on the left of a vertical bar in a log pdf function, an adjustment must be made to account for the differential change in scale and ensure beta gets the correct distribution. The correction required is to add the log Jacobian of the transform to the target log density (see Section 35.1 for full definitions). For the case above, the following adjustment will account for the log transform.4 target += - log(fabs(y)); Truncated Distributions Stan supports truncating distributions with lower bounds, upper bounds, or both. Truncating with lower and upper bounds A probability density function p(x) for a continuous distribution may be truncated to an interval [a, b] to define a new density p[a,b] (x) with support [a, b] by setting p[a,b] (x) = R b a p(x) p(u) du . A probability mass function p(x) for a discrete distribution may be truncated to the closed interval [a, b] by p(x) p[a,b] (x) = Pb . u=a p(u) Truncating with a lower bound A probability density function p(x) can be truncated to [a, ∞] by defining p[a,∞] (x) = R ∞ a p(x) . p(u) du A probability mass function p(x) is truncated to [a, ∞] by defining p(x) . a<=u p(u) p[a,∞] (x) = P 4 Because d log | dy log y| = log |1/y| = − log |y|; see Section 35.1. 81 Truncating with an upper bound A probability density function p(x) can be truncated to [−∞, b] by defining p(x) p[−∞,b] (x) = R b −∞ p(u) du . A probability mass function p(x) is truncated to [−∞, b] by defining p[−∞,b] (x) = P p(x) . p(u) u<=b Cumulative distribution functions Given a probability function pX (x) for a random variable X, its cumulative distribution function (cdf) FX (x) is defined to be the probability that X ≤ x, FX (x) = Pr[X ≤ x]. The upper-case variable X is the random variable whereas the lower-case variable x is just an ordinary bound variable. For continuous random variables, the definition of the cdf works out to Z x FX (x) = −∞ pX (u) du, For discrete variables, the cdf is defined to include the upper bound given by the argument, X FX (x) = pX (u). u≤x Complementary cumulative distribution functions The complementary cumulative distribution function (ccdf) in both the continuous and discrete cases is given by FXC (x) = Pr[X > x] = 1 − FX (x). Unlike the cdf, the ccdf is exclusive of the bound, hence the event X > x rather than the cdf’s event X ≤ x. For continuous distributions, the ccdf works out to FXC (x) Zx = 1− −∞ Z∞ pX (u) du = 82 x pX (u) du. The lower boundary can be included in the integration bounds because it is a single point on a line and hence has no probability mass. For the discrete case, the lower bound must be excluded in the summation explicitly by summing over u > x, FXC (x) = 1 − X pX (u) = u≤x X pX (u). u>x Cumulative distribution functions provide the necessary integral calculations to define truncated distributions. For truncation with lower and upper bounds, the denominator is defined by Zb p(u) du = FX (b) − FX (a). a This allows truncated distributions to be defined as p[a,b] (x) = pX (x) . FX (b) − FX (a) For discrete distributions, a slightly more complicated form is required to explicitly insert the lower truncation point, which is otherwise excluded from FX (b)−FX (a), p[a,b] (x) = pX (x) . FX (b) − FX (a) + pX (a) Truncation with lower and upper bounds in Stan Stan allows probability functions to be truncated. For example, a truncated unit normal distributions restricted to [−0.5, 2.1] can be coded with the following sampling statement. y ~ normal(0, 1) T[-0.5, 2.1]; Truncated distributions are translated as an additional term in the accumulated log density function plus error checking to make sure the variate in the sampling statement is within the bounds of the truncation. In general, the truncation bounds and parameters may be parameters or local variables. Because the example above involves a continuous distribution, it behaves the same way as the following more verbose form. y ~ normal(0, 1); if (y < -0.5 || y > 2.1) target += negative_infinity(); else target += -log_diff_exp(normal_lcdf(2.1 | 0, 1), normal_lcdf(-0.5 | 0, 1)); 83 Because a Stan program defines a log density function, all calculations are on the log scale. The function normal_lcdf is the log of the cumulative normal distribution function and the function log_diff_exp(a, b) is a more arithmetically stable form of log(exp(a) - exp(b)). For a discrete distribution, another term is necessary in the denominator to account for the excluded boundary. The truncated discrete distribution y ~ poisson(3.7) T[2, 10]; behaves in the same way as the following code. y ~ poisson(3.7); if (y < 2 || y > 10) target += negative_infinity(); else target += -log_sum_exp(poisson_lpmf(2 | 3.7), log_diff_exp(poisson_lcdf(10 | 3.7), poisson_lcdf(2 | 3.7))); Recall that log_sum_exp(a, b) is just the arithmetically stable form of log(exp(a) + exp(b)). Truncation with lower bounds in Stan For truncating with only a lower bound, the upper limit is left blank. y ~ normal(0, 1) T[-0.5, ]; This truncated sampling statement has the same behavior as the following code. y ~ normal(0, 1); if (y < -0.5) target += negative_infinity(); else target += -normal_lccdf(-0.5 | 0, 1); The normal_lccdf function is the normal complementary cumulative distribution function. As with lower and upper truncation, the discrete case requires a more complicated denominator to add back in the probability mass for the lower bound. Thus y ~ poisson(3.7) T[2, ]; behaves the same way as 84 y ~ poisson(3.7); if (y < 2) target += negative_infinity(); else target += -log_sum_exp(poisson_lpmf(2 | 3.7), poisson_lccdf(2 | 3.7)); Truncation with upper bounds in Stan To truncate with only an upper bound, the lower bound is left blank. The upper truncated sampling statement y ~ normal(0, 1) T[ , 2.1]; produces the same result as the following code. target += normal_lpdf(y | 0, 1); if (y > 2.1) target += negative_infinity(); else target += -normal_lcdf(2.1 | 0, 1); With only an upper bound, the discrete case does not need a boundary adjustment. The upper-truncated sampling statement y ~ poisson(3.7) T[ , 10]; behaves the same way as the following code. y ~ poisson(3.7); if (y > 10) target += negative_infinity(); else target += -poisson_lcdf(10 | 3.7); Cumulative distributions must be defined In all cases, the truncation is only well formed if the appropriate log density or mass function and necessary log cumulative distribution functions are defined. Not every distribution built into Stan has log cdf and log ccdfs defined, nor will every userdefined distribution. Part VIII and Part IX document the available discrete and continuous cumulative distribution functions; most univariate distributions have log cdf and log ccdf functions. 85 Type constraints on bounds For continuous distributions, truncation points must be expressions of type int or real. For discrete distributions, truncation points must be expressions of type int. Variates outside of truncation bounds For a truncated sampling statement, if the value sampled is not within the bounds specified by the truncation expression, the result is zero probability and the entire statement adds −∞ to the total log probability, which in turn results in the sample being rejected; see the subsection of Section 12.2 discussing constraints and out-ofbounds returns for programming strategies to keep all values within bounds. Vectorizing Truncated Distributions Stan does not (yet) support vectorization of distribution functions with truncation. 5.4. For Loops Suppose N is a variable of type int, y is a one-dimensional array of type real[], and mu and sigma are variables of type real. Furthermore, suppose that n has not been defined as a variable. Then the following is a well-formed for-loop statement. for (n in 1:N) { y[n] ~ normal(mu, sigma); } The loop variable is n, the loop bounds are the values in the range 1:N, and the body is the statement following the loop bounds. Loop Variable Typing and Scope The bounds in a for loop must be integers. Unlike in R, the loop is always interpreted as an upward counting loop. The range L:H will cause the loop to execute the loop with the loop variable taking on all integer values greater than or equal to L and less than or equal to H. For example, the loop for (n in 2:5) will cause the body of the for loop to be executed with n equal to 2, 3, 4, and 5, in order. The variable and bound for (n in 5:2) will not execute anything because there are no integers greater than or equal to 5 and less than or equal to 2. 86 Order Sensitivity and Repeated Variables Unlike in BUGS, Stan allows variables to be reassigned. For example, the variable theta in the following program is reassigned in each iteration of the loop. for (n in 1:N) { theta = inv_logit(alpha + x[n] * beta); y[n] ~ bernoulli(theta); } Such reassignment is not permitted in BUGS. In BUGS, for loops are declarative, defining plates in directed graphical model notation, which can be thought of as repeated substructures in the graphical model. Therefore, it is illegal in BUGS or JAGS to have a for loop that repeatedly reassigns a value to a variable.5 In Stan, assignments are executed in the order they are encountered. As a consequence, the following Stan program has a very different interpretation than the previous one. for (n in 1:N) { y[n] ~ bernoulli(theta); theta = inv_logit(alpha + x[n] * beta); } In this program, theta is assigned after it is used in the probability statement. This presupposes it was defined before the first loop iteration (otherwise behavior is undefined), and then each loop uses the assignment from the previous iteration. Stan loops may be used to accumulate values. Thus it is possible to sum the values of an array directly using code such as the following. total = 0.0; for (n in 1:N) total = total + x[n]; After the for loop is executed, the variable total will hold the sum of the elements in the array x. This example was purely pedagogical; it is easier and more efficient to write total = sum(x); A variable inside (or outside) a loop may even be reassigned multiple times, as in the following legal code. 5 A programming idiom in BUGS code simulates a local variable by replacing theta in the above example with theta[n], effectively creating N different variables, theta[1], . . . , theta[N]. Of course, this is not a hack if the value of theta[n] is required for all n. 87 for (n in 1:100) { y += y * epsilon; epsilon = 0.5 * epsilon; y += y * epsilon; } 5.5. Conditional Statements Stan supports full conditional statements using the same if-then-else syntax as C++. The general format is if (condition1) statement1 else if (condition2) statement2 // ... else if (conditionN-1) statementN-1 else statementN There must be a single leading if clause, which may be followed by any number of else if clauses, all of which may be optionally followed by an else clause. Each condition must be a real or integer value, with non-zero values interpreted as true and the zero value as false. The entire sequence of if-then-else clauses forms a single conditional statement for evaluation. The conditions are evaluated in order until one of the conditions evaluates to a non-zero value, at which point its corresponding statement is executed and the conditional statement finishes execution. If none of the conditions evaluates to a non-zero value and there is a final else clause, its statement is executed. 5.6. While Statements Stan supports standard while loops using the same syntax as C++. The general format is as follows. while (condition) body The condition must be an integer or real expression and the body can be any statement (or sequence of statements in curly braces). Evaluation of a while loop starts by evaluating the condition. If the condition evaluates to a false (zero) value, the execution of the loop terminates and control 88 moves to the position after the loop. If the loop’s condition evaluates to a true (nonzero) value, the body statement is executed, then the whole loop is executed again. Thus the loop is continually executed as long as the condition evaluates to a true value. 5.7. Statement Blocks and Local Variable Declarations Just as parentheses may be used to group expressions, curly brackets may be used to group a sequence of zero or more statements into a statement block. At the beginning of each block, local variables may be declared that are scoped over the rest of the statements in the block. Blocks in For Loops Blocks are often used to group a sequence of statements together to be used in the body of a for loop. Because the body of a for loop can be any statement, for loops with bodies consisting of a single statement can be written as follows. for (n in 1:N) y[n] ~ normal(mu,sigma); To put multiple statements inside the body of a for loop, a block is used, as in the following example. for (n in 1:N) { lambda[n] ~ gamma(alpha,beta); y[n] ~ poisson(lambda[n]); } The open curly bracket ({) is the first character of the block and the close curly bracket (}) is the last character. Because whitespace is ignored in Stan, the following program will not compile. for (n in 1:N) y[n] ~ normal(mu, sigma); z[n] ~ normal(mu, sigma); // ERROR! The problem is that the body of the for loop is taken to be the statement directly following it, which is y[n] ~ normal(mu,sigma). This leaves the probability statement for z[n] hanging, as is clear from the following equivalent program. for (n in 1:N) { y[n] ~ normal(mu, sigma); } z[n] ~ normal(mu, sigma); // ERROR! 89 Neither of these programs will compile. If the loop variable n was defined before the for loop, the for-loop declaration will raise an error. If the loop variable n was not defined before the for loop, then the use of the expression z[n] will raise an error. Local Variable Declarations A for loop has a statement as a body. It is often convenient in writing programs to be able to define a local variable that will be used temporarily and then forgotten. For instance, the for loop example of repeated assignment should use a local variable for maximum clarity and efficiency, as in the following example. for (n in 1:N) { real theta; theta = inv_logit(alpha + x[n] * beta); y[n] ~ bernoulli(theta); } The local variable theta is declared here inside the for loop. The scope of a local variable is just the block in which it is defined. Thus theta is available for use inside the for loop, but not outside of it. As in other situations, Stan does not allow variable hiding. So it is illegal to declare a local variable theta if the variable theta is already defined in the scope of the for loop. For instance, the following is not legal. for (m in 1:M) { real theta; for (n in 1:N) { real theta; // ERROR! theta = inv_logit(alpha + x[m, n] * beta); y[m, n] ~ bernoulli(theta); // ... The compiler will flag the second declaration of theta with a message that it is already defined. No Constraints on Local Variables Local variables may not have constraints on their declaration. The only types that may be used are int, real, vector[K], row_vector[K], and matrix[M, N]. 90 Blocks within Blocks A block is itself a statement, so anywhere a sequence of statements is allowed, one or more of the statements may be a block. For instance, in a for loop, it is legal to have the following for (m in 1:M) { { int n = 2 * m; sum += n; } for (n in 1:N) sum += x[m, n]; } The variable declaration int n; is the first element of an embedded block and so has scope within that block. The for loop defines its own local block implicitly over the statement following it in which the loop variable is defined. As far as Stan is concerned, these two uses of n are unrelated. 5.8. Break and Continue Statements The one-token statements continue and break may be used within loops to alter control flow; continue causes the next iteration of the loop to run immediately, whereas break terminates the loop and causes execution to resume after the loop. Both control structures must appear in loops. Both break and continue scope to the most deeply nested loop, but pass through non-loop statements. Although these control statements may seem undesirable because of their gotolike behavior, their judicious use can greatly improve readability by reducing the level of nesting or eliminating bookkeeping inside loops. Break Statements When a break statement is executed, the most deeply nested loop currently being executed is ended and execution picks up with the next statement after the loop. For example, consider the following program: while (1) { if (n < 0) break; foo(n); n = n - 1; } 91 The while (1) loop is a “forever” loop, because 1 is the true value, so the test always succeeds. Within the loop, if the value of n is less than 0, the loop terminates, otherwise it executes foo(n) and then decrements n. The statement above does exactly the same thing as while (n >= 0) { foo(n); n = n - 1; } This case is simply illustrative of the behavior; it is not a case where a break simplifies the loop. Continue Statements The continue statement ends the current operation of the loop and returns to the condition at the top of the loop. Such loops are typically used to exclude some values from calculations. For example, we could use the following loop to sum the positive values in the array x, real sum; sum = 0; for (n in 1:size(x)) { if (x[n] <= 0) continue; sum += x[n]; } When the continue statement is executed, control jumps back to the conditional part of the loop. With while and for loops, this causes control to return to the conditional of the loop. With for loops, this advances the loop variable, so the the above program will not go into an infinite loop when faced with an x[n] less than zero. Thus the above program could be rewritten with deeper nesting by reversing the conditional, real sum; sum = 0; for (n in 1:size(x)) { if (x[n] > 0) sum += x[n]; } While the latter form may seem more readable in this simple case, the former has the main line of execution nested one level less deep. Instead, the conditional at the top finds cases to exclude and doesn’t require the same level of nesting for code that’s not excluded. When there are several such exclusion conditions, the break or continue versions tend to be much easier to read. 92 Breaking and Continuing Nested Loops If there is a loop nested within a loop, a break or continue statement only breaks out of the inner loop. So while (cond1) { ... while (cond2) { ... if (cond3) break; ... } // execution continues here after break ... } If the break is triggered by cond3 being true, execution will continue after the nested loop. As with break statements, continue statements go back to the top of the most deeply nested loop in which the continue appears. Although break and continue must appear within loops, they may appear in nested statements within loops, such as within the conditionals shown above or within nested statements. The break and continue statements jump past any control structure other than while-loops and for-loops. 5.9. Print Statements Stan provides print statements that can print literal strings and the values of expressions. Print statements accept any number of arguments. Consider the following for-each statement with a print statement in its body. for (n in 1:N) { print("loop iteration: ", n); ... } The print statement will execute every time the body of the loop does. Each time the loop body is executed, it will print the string “loop iteration: ” (with the trailing space), followed by the value of the expression n, followed by a new line. Print Content The text printed by a print statement varies based on its content. A literal (i.e., quoted) string in a print statement always prints exactly that string (without the quotes). Expressions in print statements result in the value of the expression being printed. But how the value of the expression is formatted will depend on its type. 93 Printing a simple real or int typed variable always prints the variable’s value.6 For array, vector, and matrix variables, the print format uses brackets. For example, a 3-vector will print as [1, 2, 3] and a 2 × 3-matrix as [[1, 2, 3], [4, 5, 6]] Printing a more readable version of arrays or matrices can be done with loops. An example is the print statement in the following transformed data block. transformed data { matrix[2, 2] u; u[1, 1] = 1.0; u[1, 2] = 4.0; u[2, 1] = 9.0; u[2, 2] = 16.0; for (n in 1:2) print("u[", n, "] = ", u[n]); } This print statement executes twice, printing the following two lines of output. u[1] = [1, 4] u[2] = [9, 16] Non-void Input The input type to a print function cannot be void. In particular, it can’t be the result of a user-defined void function. All other types are allowed as arguments to the print function. Print Frequency Printing for a print statement happens every time it is executed. The transformed data block is executed once per chain, the transformed parameter and model blocks once per leapfrog step, and the generated quantities block once per iteration. 6 The adjoint component is always zero during execution for the algorithmic differentiation variables used to implement parameters, transformed parameters, and local variables in the model. 94 String Literals String literals begin and end with a double quote character ("). The characters between the double quote characters may be the space character or any visible ASCII character, with the exception of the backslash character (\) and double quote character ("). The full list of visible ASCII characters is as follows. a A 0 } b B 1 [ c C 2 ] d D 3 ( e E 4 ) f F 5 < g G 6 > h H 7 | i I 8 / j J 9 ! k K 0 ? l L ~ . m M @ , n N # ; o p q r s t u v w x y z O P Q R S T U V W X Y Z $ % ^ & * _ ' ` - + = { : Debug by print Because Stan is an imperative language, print statements can be very useful for debugging. They can be used to display the values of variables or expressions at various points in the execution of a program. They are particularly useful for spotting problematic not-a-number of infinite values, both of which will be printed. It is particularly useful to print the value of the log probability accumulator (see Section 41.4), as in the following example. vector[2] y; y[1] = 1; print("lp before =", target()); y ~ normal(0,1); // bug! y[2] not defined print("lp after =", target()); The example has a bug in that y[2] is not defined before the vector y is used in the sampling statement. By printing the value of the log probability accumulator before and after each sampling statement, it’s possible to isolate where the log probability becomes ill-defined (i.e., becomes not-a-number). 5.10. Reject Statements The Stan reject statement provides a mechanism to report errors or problematic values encountered during program execution and either halt processing or reject samples or optimization iterations. Like the print statement, the reject statement accepts any number of quoted string literals or Stan expressions as arguments. Reject statements are typically embedded in a conditional statement in order to detect variables in illegal states. For example, the following code handles the case where a variable x’s value is negative. 95 if (x < 0) reject("x must not be negative; found x=", x); Behavior of Reject Statements Reject statements have the same behavior as exceptions thrown by built-in Stan functions. For example, the normal_lpdf function raises an exception if the input scale is not positive and finite. The effect of a reject statement depends on the program block in which the rejection occurs. In all cases of rejection, the interface accessing the Stan program should print the arguments to the reject statement. Rejections in Functions Rejections in user-defined functions are just passed to the calling function or program block. Reject statements can be used in functions to validate the function arguments, allowing user-defined functions to fully emulate built-in function behavior. It is better to find out earlier rather than later when there is a problem. Fatal Exception Contexts In both the transformed data block and generated quantities block, rejections are fatal. This is because if initialization fails or if generating output fails, there is no way to recover values. Reject statements placed in the transformed data block can be used to validate both the data and transformed data (if any). This allows more complicated constraints to be enforced that can be specified with Stan’s constrained variable declarations. Recoverable Rejection Contexts Rejections in the transformed parameters and model blocks are not in and of themselves instantly fatal. The result has the same effect as assigning a −∞ log probability, which causes rejection of the current proposal in MCMC samplers and adjustment of search parameters in optimization. If the log probability function results in a rejection every time it is called, the containing application (MCMC sampler or optimization) should diagnose this problem and terminate with an appropriate error message. To aid in diagnosing problems, the message for each reject statement will be printed as a result of executing it. 96 Rejection is not for Constraints Rejection should be used for error handling, not defining arbitrary constraints. Consider the following errorful Stan program. parameters { real a; real b; real theta; ... model { // **wrong** needs explicit truncation theta ~ normal(0, 1); ... This program is wrong because its truncation bounds on theta depend on parameters, and thus need to be accounted for using an explicit truncation on the distribution. This is the right way to do it. theta ~ normal(0, 1) T[a, b]; The conceptual issue is that the prior does not integrate to one over the admissible parameter space; it integrates to one over all real numbers and integrates to something less than one over [a, b]; in these simple univariate cases, we can overcome that with the T[ , ] notation, which essentially divides by whatever the prior integrates to over [a, b]. This problem is exactly the same problem as you would get using reject statements to enforce complicated inequalities on multivariate functions. In this case, it is wrong to try to deal with truncation through constraints. if (theta < a || theta > b) reject("theta not in (a, b)"); // still **wrong**, needs T[a,b] theta ~ normal(0, 1); In this case, the prior integrates to something less than one over the region of the parameter space where the complicated inequalities are satisfied. But we don’t generally know what value the prior integrates to, so we can’t increment the log probability function to compensate. Even if this adjustment to a proper probability model may seem like “no big deal” in particular models where the amount of truncated posterior density is negligible or constant, we can’t sample from that truncated posterior efficiently. Programs need to use one-to-one mappings that guarantee the constraints are satisfied and only use reject statements to raise errors or help with debugging. 97 6. Program Blocks A Stan program is organized into a sequence of named blocks, the bodies of which consist of variable declarations, followed in the case of some blocks with statements. 6.1. Overview of Stan’s Program Blocks The full set of named program blocks is exemplified in the following skeletal Stan program. functions { // ... function declarations and definitions ... } data { // ... declarations ... } transformed data { // ... declarations ... statements ... } parameters { // ... declarations ... } transformed parameters { // ... declarations ... statements ... } model { // ... declarations ... statements ... } generated quantities { // ... declarations ... statements ... } The function-definition block contains user-defined functions. The data block declares the required data for the model. The transformed data block allows the definition of constants and transforms of the data. The parameters block declares the model’s parameters — the unconstrained version of the parameters is what’s sampled or optimized. The transformed parameters block allows variables to be defined in terms of data and parameters that may be used later and will be saved. The model block is where the log probability function is defined. The generated quantities block allows derived quantities based on parameters, data, and optionally (pseudo) random number generation. 98 Optionality and Ordering All of the blocks are optional. A consequence of this is that the empty string is a valid Stan program, although it will trigger a warning message from the Stan compiler. The Stan program blocks that occur must occur in the order presented in the skeletal program above. Within each block, both declarations and statements are optional, subject to the restriction that the declarations come before the statements. Variable Scope The variables declared in each block have scope over all subsequent statements. Thus a variable declared in the transformed data block may be used in the model block. But a variable declared in the generated quantities block may not be used in any earlier block, including the model block. The exception to this rule is that variables declared in the model block are always local to the model block and may not be accessed in the generated quantities block; to make a variable accessible in the model and generated quantities block, it must be declared as a transformed parameter. Variables declared as function parameters have scope only within that function definition’s body, and may not be assigned to (they are constant). Function Scope Functions defined in the function block may be used in any appropriate block. Most functions can be used in any block and applied to a mixture of parameters and data (including constants or program literals). Random-number-generating functions are restricted to the generated quantities block; such functions are suffixed with _rng. Log-probability modifying functions to blocks where the log probability accumulator is in scope (transformed parameters and model); such functions are suffixed with _lp. Density functions defined in the program may be used in sampling statements. Automatic Variable Definitions The variables declared in the data and parameters block are treated differently than other variables in that they are automatically defined by the context in which they are used. This is why there are no statements allowed in the data or parameters block. The variables in the data block are read from an external input source such as a file or a designated R data structure. The variables in the parameters block are read from the sampler’s current parameter values (either standard HMC or NUTS). The initial values may be provided through an external input source, which is also typically a file or a designated R data structure. In each case, the parameters are instantiated to the values for which the model defines a log probability function. 99 Transformed Variables The transformed data and transformed parameters block behave similarly to each other. Both allow new variables to be declared and then defined through a sequence of statements. Because variables scope over every statement that follows them, transformed data variables may be defined in terms of the data variables. Before generating any samples, data variables are read in, then the transformed data variables are declared and the associated statements executed to define them. This means the statements in the transformed data block are only ever evaluated once.1 Transformed parameters work the same way, being defined in terms of the parameters, transformed data, and data variables. The difference is the frequency of evaluation. Parameters are read in and (inverse) transformed to constrained representations on their natural scales once per log probability and gradient evaluation. This means the inverse transforms and their log absolute Jacobian determinants are evaluated once per leapfrog step. Transformed parameters are then declared and their defining statements executed once per leapfrog step. Generated Quantities The generated quantity variables are defined once per sample after all the leapfrog steps have been completed. These may be random quantities, so the block must be rerun even if the Metropolis adjustment of HMC or NUTS rejects the update proposal. Variable Read, Write, and Definition Summary A table summarizing the point at which variables are read, written, and defined is given in Figure 6.1. Another way to look at the variables is in terms of their function. To decide which variable to use, consult the charts in Figure 6.2. The last line has no corresponding location, as there is no need to print a variable every iteration that does not depend on parameters.2 The rest of this chapter provides full details on when and how the variables and statements in each block are executed. 6.2. Statistical Variable Taxonomy (Gelman and Hill, 2007, p. 366) provides a taxonomy of the kinds of variables used in Bayesian models. Figure 6.3 contains Gelman and Hill’s taxonomy along with a 1 If the C++ code is configured for concurrent threads, the data and transformed data blocks can be executed once and reused for multiple chains. 2 It is possible to print a variable every iteration that does not depend on parameters — just define it (or redefine it if it is transformed data) in the generated quantities block. 100 Block Stmt Action / Period data transformed data parameters no yes no read / chain evaluate / chain inv. transform, Jacobian / leapfrog inv. transform, write / sample transformed parameters yes model generated quantities yes yes evaluate / leapfrog write / sample evaluate / leapfrog step eval / sample write / sample (initialization) n/a read, transform / chain Figure 6.1: The read, write, transform, and evaluate actions and periodicities listed in the last column correspond to the Stan program blocks in the first column. The middle column indicates whether the block allows statements. The last row indicates that parameter initialization requires a read and transform operation applied once per chain. Params Log Prob Print + + + + − − − + + − − − ± + + − − + + − + Declare In transformed parameters local in model local in generated quantities generated quantities generated quantities∗ local in transformed data transformed data and generated quantities∗ Figure 6.2: This table indicates where variables that are not basic data or parameters should be declared, based on whether it is defined in terms of parameters, whether it is used in the log probability function defined in the model block, and whether it is printed. The two lines marked with asterisks (∗) should not be used as there is no need to print a variable every iteration that does not depend on the value of any parameters (for information on how to print these if necessary, see Footnote 2 in this chapter). missing-data kind along with the corresponding locations of declarations and definitions in Stan. Constants can be built into a model as literals, data variables, or as transformed data variables. If specified as variables, their definition must be included in data files. 101 Variable Kind Declaration Block unmodeled data modeled data missing data modeled parameters unmodeled parameters data, transformed data data, transformed data parameters, transformed parameters parameters, transformed parameters data, transformed data generated quantities transformed data, transformed parameters, generated quantities loop indices loop statement Figure 6.3: Variables of the kind indicated in the left column must be declared in one of the blocks declared in the right column. If they are specified as transformed data variables, they cannot be used to specify the sizes of elements in the data block. The following program illustrates various variables kinds, listing the kind of each variable next to its declaration. data { int N; // unmodeled data real y[N]; // modeled data real mu_mu; // config. unmodeled param real sigma_mu; // config. unmodeled param } transformed data { real alpha; // const. unmodeled param real beta; // const. unmodeled param alpha = 0.1; beta = 0.1; } parameters { real mu_y; // modeled param real tau_y; // modeled param } transformed parameters { real sigma_y; // derived quantity (param) sigma_y = pow(tau_y, -0.5); } model { tau_y ~ gamma(alpha, beta); mu_y ~ normal(mu_mu, sigma_mu); 102 for (n in 1:N) y[n] ~ normal(mu_y, sigma_y); } generated quantities { real variance_y; // derived quantity (transform) variance_y = sigma_y * sigma_y; } In this example, y[N] is a modeled data vector. Although it is specified in the data block, and thus must have a known value before the program may be run, it is modeled as if it were generated randomly as described by the model. The variable N is a typical example of unmodeled data. It is used to indicate a size that is not part of the model itself. The other variables declared in the data and transformed data block are examples of unmodeled parameters, also known as hyperparameters. Unmodeled parameters are parameters to probability densities that are not themselves modeled probabilistically. In Stan, unmodeled parameters that appear in the data block may be specified on a per-model execution basis as part of the data read. In the above model, mu_mu and sigma_mu are configurable unmodeled parameters. Unmodeled parameters that are hard coded in the model must be declared in the transformed data block. For example, the unmodeled parameters alpha and beta are both hard coded to the value 0.1. To allow such variables to be configurable based on data supplied to the program at run time, they must be declared in the data block, like the variables mu_mu and sigma_mu. This program declares two modeled parameters, mu and tau_y. These are the location and precision used in the normal model of the values in y. The heart of the model will be sampling the values of these parameters from their posterior distribution. The modeled parameter tau_y is transformed from a precision to a scale parameter and assigned to the variable sigma_y in the transformed parameters block. Thus the variable sigma_y is considered a derived quantity — its value is entirely determined by the values of other variables. The generated quantities block defines a value variance_y, which is defined as a transform of the scale or deviation parameter sigma_y. It is defined in the generated quantities block because it is not used in the model. Making it a generated quantity allows it to be monitored for convergence (being a non-linear transform, it will have different autocorrelation and hence convergence properties than the deviation itself). In later versions of Stan which have random number generators for the distributions, the generated quantities block will be usable to generate replicated data for model checking. 103 Finally, the variable n is used as a loop index in the model block. 6.3. Program Block: data The rest of this chapter will lay out the details of each block in order, starting with the data block in this section. Variable Reads and Transformations The data block is for the declaration of variables that are read in as data. With the current model executable, each Markov chain of samples will be executed in a different process, and each such process will read the data exactly once.3 Data variables are not transformed in any way. The format for data files or data in memory depends on the interface; see the user’s guides and interface documentation for PyStan, RStan, and CmdStan for details. Statements The data block does not allow statements. Variable Constraint Checking Each variable’s value is validated against its declaration as it is read. For example, if a variable sigma is declared as real , then trying to assign it a negative value will raise an error. As a result, data type errors will be caught as early as possible. Similarly, attempts to provide data of the wrong size for a compound data structure will also raise an error. 6.4. Program Block: transformed data The transformed data block is for declaring and defining variables that do not need to be changed when running the program. Variable Reads and Transformations For the transformed data block, variables are all declared in the variable declarations and defined in the statements. There is no reading from external sources and no transformations performed. 3 With multiple threads, or even running chains sequentially in a single thread, data could be read only once per set of chains. Stan was designed to be thread safe and future versions will provide a multithreading option for Markov chains. 104 Variables declared in the data block may be used to declare transformed variables. Statements The statements in a transformed data block are used to define (provide values for) variables declared in the transformed data block. Assignments are only allowed to variables declared in the transformed data block. These statements are executed once, in order, right after the data is read into the data variables. This means they are executed once per chain (though see Footnote 3 in this chapter). Variables declared in the data block may be used in statements in the transformed data block. Restriction on Operations in transformed data The statements in the transformed data block are designed to be executed once and have a deterministic result. Therefore, log probability is not accumulated and sampling statements may not be used. Random number generating functions are also prohibited. Variable Constraint Checking Any constraints on variables declared in the transformed data block are checked after the statements are executed. If any defined variable violates its constraints, Stan will halt with a diagnostic error message. 6.5. Program Block: parameters The variables declared in the parameters program block correspond directly to the variables being sampled by Stan’s samplers (HMC and NUTS). From a user’s perspective, the parameters in the program block are the parameters being sampled by Stan. Variables declared as parameters cannot be directly assigned values. So there is no block of statements in the parameters program block. Variable quantities derived from parameters may be declared in the transformed parameters or generated quantities blocks, or may be defined as local variables in any statement blocks following their declaration. There is a substantial amount of computation involved for parameter variables in a Stan program at each leapfrog step within the HMC or NUTS samplers, and a bit more computation along with writes involved for saving the parameter values corresponding to a sample. 105 Constraining Inverse Transform Stan’s two samplers, standard Hamiltonian Monte Carlo (HMC) and the adaptive No-UTurn sampler (NUTS), are most easily (and often most effectively) implemented over a multivariate probability density that has support on all of Rn . To do this, the parameters defined in the parameters block must be transformed so they are unconstrained. In practice, the samplers keep an unconstrained parameter vector in memory representing the current state of the sampler. The model defined by the compiled Stan program defines an (unnormalized) log probability function over the unconstrained parameters. In order to do this, the log probability function must apply the inverse transform to the unconstrained parameters to calculate the constrained parameters defined in Stan’s parameters program block. The log Jacobian of the inverse transform is then added to the accumulated log probability function. This then allows the Stan model to be defined in terms of the constrained parameters. In some cases, the number of parameters is reduced in the unconstrained space. For instance, a K-simplex only requires K − 1 unconstrained parameters, and a K K correlation matrix only requires 2 unconstrained parameters. This means that the probability function defined by the compiled Stan program may have fewer parameters than it would appear from looking at the declarations in the parameters program block. The probability function on the unconstrained parameters is defined in such a way that the order of the parameters in the vector corresponds to the order of the variables defined in the parameters program block. The details of the specific transformations are provided in Chapter 35. Gradient Calculation Hamiltonian Monte Carlo requires the gradient of the (unnormalized) log probability function with respect to the unconstrained parameters to be evaluated during every leapfrog step. There may be one leapfrog step per sample or hundreds, with more being required for models with complex posterior distribution geometries. Gradients are calculated behind the scenes using Stan’s algorithmic differentiation library. The time to compute the gradient does not depend directly on the number of parameters, only on the number of subexpressions in the calculation of the log probability. This includes the expressions added from the transforms’ Jacobians. The amount of work done by the sampler does depend on the number of unconstrained parameters, but this is usually dwarfed by the gradient calculations. 106 Writing Samples In the basic Stan compiled program, the values of variables are written to a file for each sample. The constrained versions of the variables are written, again in the order they are defined in the parameters block. In order to do this, the transformed parameter, model, and generated quantities statements must be executed. 6.6. Program Block: transformed parameters The transformed parameters program block consists of optional variable declarations followed by statements. After the statements are executed, the constraints on the transformed parameters are validated. Any variable declared as a transformed parameter is part of the output produced for samples. Any variable that is defined wholly in terms of data or transformed data should be declared and defined in the transformed data block. Defining such quantities in the transformed parameters block is legal, but much less efficient than defining them as transformed data. Constraints are for Error Checking Like the constraints on data, the constraints on transformed parameters is meant to catch programming errors as well as convey programmer intent. They are not automatically transformed in such a way as to be satisfied. What will happen if a transformed parameter does not match its constraint is that the current parameter values will be rejected. This can cause Stan’s algorithms to hang or to devolve to random walks. It is not intended to be a way to enforce ad hoc constraints in Stan programs. See Section 5.10 for further discussion of the behavior of reject statements. 6.7. Program Block: model The model program block consists of optional variable declarations followed by statements. The variables in the model block are local variables and are not written as part of the output. Local variables may not be defined with constraints because there is no welldefined way to have them be both flexible and easy to validate. The statements in the model block typically define the model. This is the block in which probability (sampling notation) statements are allowed. These are typically used when programming in the BUGS idiom to define the probability model. 107 6.8. Program Block: generated quantities The generated quantities program block is rather different than the other blocks. Nothing in the generated quantities block affects the sampled parameter values. The block is executed only after a sample has been generated. Among the applications of posterior inference that can be coded in the generated quantities block are • forward sampling to generate simulated data for model testing, • generating predictions for new data, • calculating posterior event probabilities, including multiple comparisons, sign tests, etc., • calculating posterior expectations, • transforming parameters for reporting, • applying full Bayesian decision theory, • calculating log likelihoods, deviances, etc. for model comparison. Forward samples, event probabilities and statistics may all be calculated directly using plug-in estimates. Stan automatically provides full Bayesian inference by producing samples from the posterior distribution of any calculated event probabilities, predictions, or statistics. See Chapter 29 for more information on Bayesian inference. Within the generated quantities block, the values of all other variables declared in earlier program blocks (other than local variables) are available for use in the generated quantities block. It is more efficient to define a variable in the generated quantities block instead of the transformed parameters block. Therefore, if a quantity does not play a role in the model, it should be defined in the generated quantities block. After the generated quantities statements are executed, the constraints on the declared generated quantity variables are validated. All variables declared as generated quantities are printed as part of the output. 108 7. User-Defined Functions Stan allows users to define their own functions. The basic syntax is a simplified version of that used in C and C++. This chapter specifies how functions are declared, defined, and used in Stan; see Chapter 24 for a more programming-oriented perspective. 7.1. Function-Definition Block User-defined functions appear in a special function-definition block before all of the other program blocks. functions { // ... function declarations and definitions ... } data { // ... Function definitions and declarations may appear in any order, subject to the condition that a function must be declared before it is used. Forward declarations are allowed in order to support recursive functions. 7.2. Function Names The rules for function naming and function-argument naming are the same as for other variables; see Section 4.2 for more information on valid identifiers. For example, real foo(real mu, real sigma); declares a function named foo with two argument variables of types real and real. The arguments are named mu and sigma, but that is not part of the declaration. Two user-defined functions may not have the same name even if they have different sequences of argument types. 7.3. Calling Functions All function arguments are mandatory—there are no default values. Functions as Expressions Functions with non-void return types are called just like any other built-in function in Stan—they are applied to appropriately typed arguments to produce an expression, which has a value when executed. 109 Functions as Statements Functions with void return types may be applied to arguments and used as statements. These act like sampling statements or print statements. Such uses are only appropriate for functions that act through side effects, such as incrementing the log probability accumulator, printing, or raising exceptions. Probability Functions in Sampling Statements Functions whose name ends in _lpdf or _lpmf (density and mass functions) may be used as probability functions and may be used in place of parameterized distributions on the right-hand-side of sampling statements. There is no restriction on where such functions may be used. Restrictions on Placement Functions of certain types are restricted on scope of usage. Functions whose names end in _lp assume access to the log probability accumulator and are only available in the transformed parameter and model blocks. Functions whose names end in _rng assume access to the random number generator and may only be used within the generated quantities block, transformed data block, and within user-defined functions ending in _rng. See Section 7.5 for more information on these two special types of function. 7.4. Unsized Argument Types Stan’s functions all have declared types for both arguments and returned value. As with built-in functions, user-defined functions are only declared for base argument type and dimensionality. This requires a different syntax than for declaring other variables. The choice of language was made so that return types and argument types could use the same declaration syntax. The type void may not be used as an argument type, only a return type for a function with side effects. Base Variable Type Declaration The base variable types are integer, real, vector, row_vector, and matrix. No lower-bound or upper-bound constraints are allowed (e.g., real is illegal). Specialized types are also not allowed (e.g., simplex is illegal) . 110 Dimensionality Declaration Arguments and return types may be arrays, and these are indicated with optional brackets and commas as would be used for indexing. For example, int denotes a single integer argument or return, whereas real[ ] indicates a one-dimensional array of reals, real[ , ] a two-dimensional array and real[ , , ] a three-dimensional array; whitespace is optional, as usual. The dimensions for vectors and matrices are not included, so that matrix is the type of a single matrix argument or return type. Thus if a variable is declared as matrix a, then a has two indexing dimensions, so that a[1] is a row vector and a[1, 1] a real value. Matrices implicitly have two indexing dimensions. The type declaration matrix[,] b specifies that b is a two-dimensional array of matrices, for a total of four indexing dimensions, with b[1, 1, 1, 1] picking out a real value. Dimensionality Checks and Exceptions Function argument and return types are not themselves checked for dimensionality. A matrix of any size may be passed in as a matrix argument. Nevertheless, a userdefined function might call a function (such as a multivariate normal density) that itself does dimensionality checks. Dimensions of function return values will be checked if they’re assigned to a previously declared variable. They may also be checked if they are used as the argument to a function. Any errors raised by calls to functions inside user functions or return type mismatches are simply passed on; this typically results in a warning message and rejection of a proposal during sampling or optimization. 7.5. Function Bodies The body of a function is bounded by curly braces ({ and }). The body may contain local variable declarations at the top of the function body’s block and these scope the same way as local variables used in any other statement block. The only restrictions on statements in function bodies are external, and determine whether the log probability accumulator or random number generators are available; see the rest of this section for details. Random Number Generating Functions Functions that call random number generating functions in their bodies must have a name that ends in _rng; attempts to use random-number generators in other functions leads to a compile-time error. 111 Like other random number generating functions, user-defined functions with names that end in _rng may be used only in the generated quantities block and transformed data block, or within the bodies of user-defined functions ending in _rng. An attempt to use such a function elsewhere results in a compile-time error. Log Probability Access in Functions Functions that include sampling statements or log probability increment statements must have a name that ends in _lp. Attempts to use sampling statements or increment log probability statements in other functions leads to a compile-time error. Like the target log density increment statement and sampling statements, userdefined functions with names that end in _lp may only be used in blocks where the log probability accumulator is accessible, namely the transformed parameters and model blocks. An attempt to use such a function elsewhere results in a compile-time error. Defining Probability Functions for Sampling Statements Functions whose names end in _lpdf and _lpmf (density and mass functions) can be used as probability functions in sampling statements. As with the built-in functions, the first argument will appear on the left of the sampling statement operator (~) in the sampling statement and the other arguments follow. For example, suppose a function returning the log of the density of y given parameter theta allows the use of the sampling statement is defined as follows. real foo_lpdf(real y, vector theta) { ... } Note that for function definitions, the comma is used rather than the vertical bar. Then the shorthand z ~ foo(phi); will have exactly the same effect target += foo_lpdf(z | phi); Unlike built-in probability functions, user-defined probability functions like the example foo above will not automatically drop constant terms. The same syntax and shorthand works for log probability mass functions with suffixes _lpmf. A function that is going to be accessed as distributions must return the log of the density or mass function it defines. 112 7.6. Parameters are Constant Within function definition bodies, the parameters may be used like any other variable. But the parameters are constant in the sense that they can’t be assigned to (i.e., can’t appear on the left side of an assignment (=) statement. In other words, their value remains constant throughout the function body. Attempting to assign a value to a function parameter value will raise a compile-time error.1 Local variables may be declared at the top of the function block and scope as usual. 7.7. Return Value Non-void functions must have a return statement that returns an appropriately typed expression. If the expression in a return statement does not have the same type as the return type declared for the function, a compile-time error is raised. Void functions may use return only without an argument, but return statements are not mandatory. Return Guarantee Required Unlike C++, Stan enforces a syntactic guarantee for non-void functions that ensures control will leave a non-void function through an appropriately typed return statement or because an exception is raised in the execution of the function. To enforce this condition, functions must have a return statement as the last statement in their body. This notion of last is defined recursively in terms of statements that qualify as bodies for functions. The base case is that • a return statement qualifies, and the recursive cases are that • a sequence of statements qualifies if its last statement qualifies, • a for loop or while loop qualifies if its body qualifies, and • a conditional statement qualifies if it has a default else clause and all of its body statements qualify. These rules disqualify 1 Despite being declared constant and appearing to have a pass-by-value syntax in Stan, the implementation of the language passes function arguments by constant reference in C++. 113 real foo(real x) { if (x > 2) return 1.0; else if (x <= 2) return -1.0; } because there is no default else clause, and disqualify real foo(real x) { real y; y = x; while (x < 10) { if (x > 0) return x; y = x / 2; } } because the return statement is not the last statement in the while loop. A bogus dummy return could be placed after the while loop in this case. The rules for returns allow real log_fancy(real x) { if (x < 1e-30) return x; else if (x < 1e-14) return x * x; else return log(x); } because there’s a default else clause and each condition body has return as its final statement. 7.8. Void Functions as Statements Void Functions A function can be declared without a return value by using void in place of a return type. Note that the type void may only be used as a return type—arguments may not be declared to be of type void. Usage as Statement A void function may be used as a statement after the function is declared; see Section 7.9 for rules on declaration. Because there is no return, such a usage is only for side effects, such as incrementing the log probability function, printing, or raising an error. 114 Special Return Statements In a return statement within a void function’s definition, the return keyword is followed immediately by a semicolon (;) rather than by the expression whose value is returned. 7.9. Declarations In general, functions must be declared before they are used. Stan supports forward declarations, which look like function definitions without bodies. For example, real unit_normal_lpdf(real y); declares a function named unit_normal_log that consumes a single real-valued input and produces a real-valued output. A function definition with a body simultaneously declares and defines the named function, as in real unit_normal_lpdf(real y) { return -0.5 * square(y); } A user-defined Stan function may be declared and then later defined, or just defined without being declared. No other combination of declaration and definition is legal, so that, for instance, a function may not be declared more than once, nor may it be defined more than once. If there is a declaration, there must be a definition. These rules together ensure that all the declared functions are eventually defined. Recursive Functions Forward declarations allow the definition of self-recursive or mutually recursive functions. For instance, consider the following code to compute Fibonacci numbers. int fib(int n); int fib(int n) { if (n < 2) return n; else return fib(n-1) + fib(n-2); } Without the forward declaration in the first line, the body of the definition would not compile. 115 8. Execution of a Stan Program This chapter provides a sketch of how a compiled Stan model is executed using sampling. Optimization shares the same data reading and initialization steps, but then does optimization rather than sampling. This sketch is elaborated in the following chapters of this part, which cover variable declarations, expressions, statements, and blocks in more detail. 8.1. Reading and Transforming Data The reading and transforming data steps are the same for sampling, optimization and diagnostics. Read Data The first step of execution is to read data into memory. Data may be read in through file (in CmdStan) or through memory (RStan and PyStan); see their respective manuals for details.1 All of the variables declared in the data block will be read. If a variable cannot be read, the program will halt with a message indicating which data variable is missing. After each variable is read, if it has a declared constraint, the constraint is validated. For example, if a variable N is declared as int , after N is read, it will be tested to make sure it is greater than or equal to zero. If a variable violates its declared constraint, the program will halt with a warning message indicating which variable contains an illegal value, the value that was read, and the constraint that was declared. Define Transformed Data After data is read into the model, the transformed data variable statements are executed in order to define the transformed data variables. As the statements execute, declared constraints on variables are not enforced. Transformed data variables are initialized with real values set to NaN and integer values set to the smallest integer (large absolute value negative number). After the statements are executed, all declared constraints on transformed data variables are validated. If the validation fails, execution halts and the variable’s name, value and constraints are displayed. 1 The C++ code underlying Stan is flexible enough to allow data to be read from memory or file. Calls from R, for instance, can be configured to read data from file or directly from R’s memory. 116 8.2. Initialization Initialization is the same for sampling, optimization, and diagnosis User-Supplied Initial Values If there are user-supplied initial values for parameters, these are read using the same input mechanism and same file format as data reads. Any constraints declared on the parameters are validated for the initial values. If a variable’s value violates its declared constraint, the program halts and a diagnostic message is printed. After being read, initial values are transformed to unconstrained values that will be used to initialize the sampler. Boundary Values are Problematic Because of the way Stan defines its transforms from the constrained to the unconstrained space, initializing parameters on the boundaries of their constraints is usually problematic. For instance, with a constraint parameters { real theta; // ... } an initial value of 0 for theta leads to an unconstrained value of −∞, whereas a value of 1 leads to an unconstrained value of +∞. While this will be inverse transformed back correctly given the behavior of floating point arithmetic, the Jacobian will be infinite and the log probability function will fail and raise an exception. Random Initial Values If there are no user-supplied initial values, the default initialization strategy is to initialize the unconstrained parameters directly with values drawn uniformly from the interval (−2, 2). The bounds of this initialization can be changed but it is always symmetric around 0. The value of 0 is special in that it represents the median of the initialization. An unconstrained value of 0 corresponds to different parameter values depending on the constraints declared on the parameters. An unconstrained real does not involve any transform, so an initial value of 0 for the unconstrained parameters is also a value of 0 for the constrained parameters. For parameters that are bounded below at 0, the initial value of 0 on the unconstrained scale corresponds to exp(0) = 1 on the constrained scale. A value of -2 corresponds to exp(−2) = .13 and a value of 2 corresponds to exp(2) = 7.4. 117 For parameters bounded above and below, the initial value of 0 on the unconstrained scale corresponds to a value at the midpoint of the constraint interval. For probability parameters, bounded below by 0 and above by 1, the transform is the inverse logit, so that an initial unconstrained value of 0 corresponds to a constrained value of 0.5, -2 corresponds to 0.12 and 2 to 0.88. Bounds other than 0 and 1 are just scaled and translated. Simplexes with initial values of 0 on the unconstrained basis correspond to symmetric values on the constrained values (i.e., each value is 1/K in a K-simplex). Cholesky factors for positive-definite matrices are initialized to 1 on the diagonal and 0 elsewhere; this is because the diagonal is log transformed and the belowdiagonal values are unconstrained. The initial values for other parameters can be determined from the transform that is applied. The transforms are all described in full detail in Chapter 35. Zero Initial Values The initial values may all be set to 0 on the unconstrained scale. This can be helpful for diagnosis, and may also be a good starting point for sampling. Once a model is running, multiple chains with more diffuse starting points can help diagnose problems with convergence; see Section 30.3 for more information on convergence monitoring. 8.3. Sampling Sampling is based on simulating the Hamiltonian of a particle with a starting position equal to the current parameter values and an initial momentum (kinetic energy) generated randomly. The potential energy at work on the particle is taken to be the negative log (unnormalized) total probability function defined by the model. In the usual approach to implementing HMC, the Hamiltonian dynamics of the particle is simulated using the leapfrog integrator, which discretizes the smooth path of the particle into a number of small time steps called leapfrog steps. Leapfrog Steps For each leapfrog step, the negative log probability function and its gradient need to be evaluated at the position corresponding to the current parameter values (a more detailed sketch is provided in the next section). These are used to update the momentum based on the gradient and the position based on the momentum. For simple models, only a few leapfrog steps with large step sizes are needed. For models with complex posterior geometries, many small leapfrog steps may be needed to accurately model the path of the parameters. 118 If the user specifies the number of leapfrog steps (i.e., chooses to use standard HMC), that number of leapfrog steps are simulated. If the user has not specified the number of leapfrog steps, the No-U-Turn sampler (NUTS) will determine the number of leapfrog steps adaptively (Hoffman and Gelman, 2011, 2014). Log Probability and Gradient Calculation During each leapfrog step, the log probability function and its gradient must be calculated. This is where most of the time in the Stan algorithm is spent. This log probability function, which is used by the sampling algorithm, is defined over the unconstrained parameters. The first step of the calculation requires the inverse transform of the unconstrained parameter values back to the constrained parameters in terms of which the model is defined. There is no error checking required because the inverse transform is a total function on every point in whose range satisfies the constraints. Because the probability statements in the model are defined in terms of constrained parameters, the log Jacobian of the inverse transform must be added to the accumulated log probability. Next, the transformed parameter statements are executed. After they complete, any constraints declared for the transformed parameters are checked. If the constraints are violated, the model will halt with a diagnostic error message. The final step in the log probability function calculation is to execute the statements defined in the model block. As the log probability function executes, it accumulates an in-memory representation of the expression tree used to calculate the log probability. This includes all of the transformed parameter operations and all of the Jacobian adjustments. This tree is then used to evaluate the gradients by propagating partial derivatives backward along the expression graph. The gradient calculations account for the majority of the cycles consumed by a Stan program. Metropolis Accept/Reject A standard Metropolis accept/reject step is required to retain detailed balance and ensure samples are marginally distributed according to the probability function defined by the model. This Metropolis adjustment is based on comparing log probabilities, here defined by the Hamiltonian, which is the sum of the potential (negative log probability) and kinetic (squared momentum) energies. In theory, the Hamiltonian is invariant over the path of the particle and rejection should never occur. In practice, the probability of rejection is determined by the accuracy of the leapfrog approximation to the true trajectory of the parameters. 119 If step sizes are small, very few updates will be rejected, but many steps will be required to move the same distance. If step sizes are large, more updates will be rejected, but fewer steps will be required to move the same distance. Thus a balance between effort and rejection rate is required. If the user has not specified a step size, Stan will tune the step size during warmup sampling to achieve a desired rejection rate (thus balancing rejection versus number of steps). If the proposal is accepted, the parameters are updated to their new values. Otherwise, the sample is the current set of parameter values. 8.4. Optimization Optimization runs very much like sampling in that it starts by reading the data and then initializing parameters. Unlike sampling, it produces a deterministic output which requires no further analysis other than to verify that the optimizer itself converged to a posterior mode. The output for optimization is also similar to that for sampling. 8.5. Variational Inference Variational inference also runs similar to sampling. It begins by reading the data and initializing the algorithm. The initial variational approximation is a random draw from the standard normal distribution in the unconstrained (real-coordinate) space. Again, similar to sampling, it outputs samples from the approximate posterior once the algorithm has decided that it has converged. Thus, the tools we use for analyzing the result of Stan’s sampling routines can also be used for variational inference. 8.6. Model Diagnostics Model diagnostics are like sampling and optimization in that they depend on a model’s data being read and its parameters being initialized. The user’s guides for the interfaces (RStan, PyStan, CmdStan) provide more details on the diagnostics available; as of Stan 2.0, that’s just gradients on the unconstrained scale and log probabilities. 8.7. Output For each final sample (not counting samples during warmup or samples that are thinned), there is an output stage of writing the samples. 120 Generated Quantities Before generating any output, the statements in the generated quantities block are executed. This can be used for any forward simulation based on parameters of the model. Or it may be used to transform parameters to an appropriate form for output. After the generated quantities statements execute, the constraints declared on generated quantities variables are validated. If these constraints are violated, the program will terminate with a diagnostic message. Write The final step is to write the actual values. The values of all variables declared as parameters, transformed parameters, or generated quantities are written. Local variables are not written, nor is the data or transformed data. All values are written in their constrained forms, that is the form that is used in the model definitions. In the executable form of a Stan models, parameters, transformed parameters, and generated quantities are written to a file in comma-separated value (csv) notation with a header defining the names of the parameters (including indices for multivariate parameters).2 2 In the R version of Stan, the values may either be written to a csv file or directly back to R’s memory. 121 Part III Example Models 122 9. Regression Models Stan supports regression models from simple linear regressions to multilevel generalized linear models. 9.1. Linear Regression The simplest linear regression model is the following, with a single predictor and a slope and intercept coefficient, and normally distributed noise. This model can be written using standard regression notation as yn = α + βxn + n where n ∼ Normal(0, σ ). This is equivalent to the following sampling involving the residual, yn − (α + βXn ) ∼ Normal(0, σ ), and reducing still further, to yn ∼ Normal(α + βXn , σ ). This latter form of the model is coded in Stan as follows. data { int N; vector[N] x; vector[N] y; } parameters { real alpha; real beta; real sigma; } model { y ~ normal(alpha + beta * x, sigma); } There are N observations, each with predictor x[n] and outcome y[n]. The intercept and slope parameters are alpha and beta. The model assumes a normally distributed noise term with scale sigma. This model has improper priors for the two regression coefficients. 123 Matrix Notation and Vectorization The sampling statement in the previous model is vectorized, with y ~ normal(alpha + beta * x, sigma); providing the same model as the unvectorized version, for (n in 1:N) y[n] ~ normal(alpha + beta * x[n], sigma); In addition to being more concise, the vectorized form is much faster.1 In general, Stan allows the arguments to distributions such as normal to be vectors. If any of the other arguments are vectors or arrays, they have to be the same size. If any of the other arguments is a scalar, it is reused for each vector entry. See Section 49.8 for more information on vectorization of probability functions. The other reason this works is that Stan’s arithmetic operators are overloaded to perform matrix arithmetic on matrices. In this case, because x is of type vector and beta of type real, the expression beta * x is of type vector. Because Stan supports vectorization, a regression model with more than one predictor can be written directly using matrix notation. data { int N; // number of data items int K; // number of predictors matrix[N, K] x; // predictor matrix vector[N] y; // outcome vector } parameters { real alpha; // intercept vector[K] beta; // coefficients for predictors real sigma; // error scale } model { y ~ normal(x * beta + alpha, sigma); // likelihood } The constraint lower=0 in the declaration of sigma constrains the value to be greater than or equal to 0. With no prior in the model block, the effect is an improper prior 1 Unlike in Python and R, which are interpreted, Stan is translated to C++ and compiled, so loops and assignment statements are fast. Vectorized code is faster in Stan because (a) the expression tree used to compute derivatives can be simplified, leading to fewer virtual function calls, and (b) computations that would be repeated in the looping version, such as log(sigma) in the above model, will be computed once and reused. 124 on non-negative real numbers. Although a more informative prior may be added, improper priors are acceptable as long as they lead to proper posteriors. In the model above, x is an N × K matrix of predictors and beta a K-vector of coefficients, so x * beta is an N-vector of predictions, one for each of the N data items. These predictions line up with the outcomes in the N-vector y, so the entire model may be written using matrix arithmetic as shown. It would be possible to include a column of 1 values in x and remove the alpha parameter. The sampling statement in the model above is just a more efficient, vector-based approach to coding the model with a loop, as in the following statistically equivalent model. model { for (n in 1:N) y[n] ~ normal(x[n] * beta, sigma); } With Stan’s matrix indexing scheme, x[n] picks out row n of the matrix x; because beta is a column vector, the product x[n] * beta is a scalar of type real. Intercepts as Inputs In the model formulation y ~ normal(x * beta, sigma); there is no longer an intercept coefficient alpha. Instead, we have assumed that the first column of the input matrix x is a column of 1 values. This way, beta[1] plays the role of the intercept. If the intercept gets a different prior than the slope terms, then it would be clearer to break it out. It is also slightly more efficient in its explicit form with the intercept variable singled out because there’s one fewer multiplications; it should not make that much of a difference to speed, though, so the choice should be based on clarity. 9.2. The QR Reparameterization In the previous example, the linear predictor can be written as η = xβ, where η is a Nvector of predictions, x is a N×K matrix, and β is a K-vector of coefficients. Presuming N ≥ K, we can exploit the fact that any design matrix, x can be decomposed using the thin QR decomposition into an orthogonal matrix Q and an upper-triangular matrix R, i.e. x = QR. See 43.13.4 for more information on the QR decomposition but note that qr_Q and qr_R implement the fat QR decomposition so here we thin it by including only K columns in Q and K rows in R. Also, in practice, it is best to write 125 x = Q∗ R ∗ where Q∗ = Q × ∗ ∗ √ n − 1 and R ∗ = √ 1 R. n−1 Thus, we can equivalently write ∗ −1 η = xβ = QRβ = Q R β. If we let θ = R β, then we have η = Q∗ θ and β = R ∗ θ. In that case, the previous Stan program becomes data { int N; // number of data items int K; // number of predictors matrix[N, K] x; // predictor matrix vector[N] y; // outcome vector } transformed data { matrix[N, K] Q_ast; matrix[K, K] R_ast; matrix[K, K] R_ast_inverse; // thin and scale the QR decomposition Q_ast = qr_Q(x)[, 1:K] * sqrt(N - 1); R_ast = qr_R(x)[1:K, ] / sqrt(N - 1); R_ast_inverse = inverse(R_ast); } parameters { real alpha; // intercept vector[K] theta; // coefficients on Q_ast real sigma; // error scale } model { y ~ normal(Q_ast * theta + alpha, sigma); // likelihood } generated quantities { vector[K] beta; beta = R_ast_inverse * theta; // coefficients on x } Since this Stan program generates equivalent predictions for y and the same posterior distribution for α, β, and σ as the previous Stan program, many wonder why the version with this QR reparameterization performs so much better in practice, often both in terms of wall time and in terms of effective sample size. The reasoning is threefold: 1. The columns of Q∗ are orthogonal whereas the columns of x generally are not. Thus, it is easier for a Markov Chain to move around in θ-space than in β-space. 2. The columns of Q∗ have the same scale whereas the columns of x generally do not. Thus, a Hamiltonian Monte Carlo algorithm can move around the parameter space with a smaller number of larger steps 126 3. Since the covariance matrix for the columns of Q∗ is an identity matrix, θ typically has a reasonable scale if the units of y are also reasonable. This also helps HMC move efficiently without compromising numerical accuracy. Consequently, this QR reparameterization is recommended for linear and generalized linear models in Stan whenever K > 1 and you do not have an informative prior on the location of β. It can also be worthwhile to subtract the mean from each column of x before obtaining the QR decomposition, which does not affect the posterior distribution of θ or β but does affect α and allows you to interpret α as the expectation of y in a linear model. 9.3. Priors for Coefficients and Scales This section describes the choices available for modeling priors for regression coefficients and scales. Priors for univariate parameters in hierarchical models are discussed in Section 9.10 and multivariate parameters in Section 9.13. There is also a discussion of priors used to identify models in Section 9.12. However, as described in Section 9.2, if you do not have an informative prior on the location of the regression coefficients, then you are better off reparameterizing your model so that the regression coefficients are a generated quantity. In that case, it usually does not matter very much what prior is used on on the reparameterized regression coefficients and almost any weakly informative prior that scales with the outcome will do. Background Reading See (Gelman, 2006) for an overview of choices for priors for scale parameters, (Chung et al., 2013) for an overview of choices for scale priors in penalized maximum likelihood estimates, and Gelman et al. (2008) for a discussion of prior choice for regression coefficients. Improper Uniform Priors The default in Stan is to provide uniform (or “flat”) priors on parameters over their legal values as determined by their declared constraints. A parameter declared without constraints is thus given a uniform prior on (−∞, ∞) by default, whereas a scale parameter declared with a lower bound of zero gets an improper uniform prior on (0, ∞). Both of these priors are improper in the sense that there is no way formulate a density function for them that integrates to 1 over its support. Stan allows models to be formulated with improper priors, but in order for sampling or optimization to work, the data provided must ensure a proper posterior. This 127 usually requires a minimum quantity of data, but can be useful as a starting point for inference and as a baseline for sensitivity analysis (i.e., considering the effect the prior has on the posterior). Uniform priors are specific to the scale on which they are formulated. For instance, we could give a scale parameter σ > 0 a uniform prior on (0, ∞), q(σ ) = c (we use q because the “density” is not only unnormalized, but unnormalizable), or we could work on the log scale and provide log σ a uniform prior on (−∞, ∞), q(log σ ) = c. These work out to be different priors on σ due to the Jacobian adjustment necessary for the log transform; see Section 35.1 for more information on changes of variables and their requisite Jacobian adjustments. Stan automatically applies the necessary Jacobian adjustment for variables declared with constraints to ensure a uniform density on the legal constrained values. This Jacobian adjustment is turned off when optimization is being applied in order to produce appropriate maximum likelihood estimates. Proper Uniform Priors: Interval Constraints It is possible to declare a variable with a proper uniform prior by imposing both an upper and lower bound on it, for example, real sigma; This will implicitly give sigma a Uniform(0.1, 2.7) prior. Matching Support to Constraints As with all constraints, it is important that the model provide support for all legal values of sigma. For example, the following code constraints sigma to be positive, but then imposes a bounded uniform prior on it. parameters { real sigma; ... model { // *** bad *** : support narrower than constraint sigma ~ uniform(0.1, 2.7); The sampling statement imposes a limited support for sigma in (0.1, 2.7), which is narrower than the support declared in the constraint, namely (0, ∞). This can cause the Stan program to be difficult to initialize, hang during sampling, or devolve to a random walk. 128 Boundary Estimates Estimates near boundaries for interval-constrained parameters typically signal that the prior is not appropriate for the model. It can also cause numerical problems with underflow and overflow when sampling or optimizing. “Uninformative” Proper Priors It is not uncommon to see models with priors on regression coefficients such as Normal(0, 1000).2 If the prior scale, such as 1000, is several orders of magnitude larger than the estimated coefficients, then such a prior is effectively providing no effect whatsoever. We actively discourage users from using the default scale priors suggested through the BUGS examples (Lunn et al., 2012), such as σ 2 ∼ InvGamma(0.001, 0.001). Such priors concentrate too much probability mass outside of reasonable posterior values, and unlike the symmetric wide normal priors, can have the profound effect of skewing posteriors; see (Gelman, 2006) for examples and discussion. Truncated Priors If a variable is declared with a lower bound of zero, then assigning it a normal prior in a Stan model produces the same effect as providing a properly truncated half-normal prior. The truncation at zero need not be specified as Stan only requires the density up to a proportion. So a variable declared with real sigma; and given a prior sigma ~ normal(0, 1000); gives sigma a half-normal prior, technically p(σ ) = Normal(σ |0, 1000) ∝ Normal(σ |0, 1000), 1 − NormalCDF(0|0, 1000) but Stan is able to avoid the calculation of the normal cumulative distribution (CDF) function required to normalize the half-normal density. If either the prior location or scale is a parameter or if the truncation point is a parameter, the truncation cannot be dropped, because the normal CDF term will not be a constant. 2 The practice was common in BUGS and can be seen in most of their examples Lunn et al. (2012). 129 Weakly Informative Priors Typically a researcher will have some knowledge of the scale of the variables being estimated. For instance, if we’re estimating an intercept-only model for the mean population height for adult women, then we know the answer is going to be somewhere in the one to three meter range. That gives us information around which to form a weakly informative prior. Similarly, a logistic regression with predictors on the standard scale (roughly zero mean, unit variance) is unlikely to have a coefficient that’s larger than five in absolute value. In these cases, it makes sense to provide a weakly informative prior such as Normal(0, 5) for such a coefficient. Weakly informative priors help control inference computationally and statistically. Computationally, a prior increases the curvature around the volume where the solution is expected to lie, which in turn guides both gradient-based like L-BFGS and Hamiltonian Monte Carlo sampling by not allowing them to stray too far from the location of a surface. Statistically, a weakly informative prior is more sensible for a problem like women’s mean height, because a very diffuse prior like Normal(0, 1000) will ensure that the vast majority of the prior probability mass is outside the range of the expected answer, which can overwhelm the inferences available from a small data set. Bounded Priors Consider the women’s height example again. One way to formulate a proper prior is to impose a uniform prior on a bounded scale. For example, we could declare the parameter for mean women’s height to have a lower bound of one meter and an upper bound of three meters. Surely the answer has to lie in that range. Similarly, it is not uncommon to see priors for scale parameters that impose lower bounds of zero and upper bounds of very large numbers, such as 10,000.3 This provides roughly the same problem for estimation as a very diffuse inverse gamma prior on variance. We prefer to leave parameters which are not absolutely physically constrained to float and provide them informative priors. In the case of women’s height, such a prior might be Normal(2, 0.5) on the scale of meters; it concentrates 95% of its mass in the interval (1, 3), but still allows values outside of that region. In cases where bounded priors are used, the posterior fits should be checked to make sure the parameter is not estimated at or very close to a boundary. This will not only cause computational problems, it indicates a problem with the way the model is formulated. In such cases, the interval should be widened to see where the 3 This was also a popular strategy in the BUGS example models (Lunn et al., 2012), which often went one step further and set the lower bounds to a small number like 0.001 to discourage numerical underflow to zero. 130 parameter fits without such constraints, or boundary-avoid priors should be used (see Section 9.10.) Fat-Tailed Priors and “Default” Priors A reasonable alternative if we want to accommodate outliers is to use a prior that concentrates most of mass around the area where values are expected to be, but still leaves a lot of mass in its tails. The usual choice in such a situation is to use a Cauchy distribution for a prior, which can concentrate its mass around its median, but has tails that are so fat that the variance is infinite. Without specific information, the Cauchy prior is a very good default parameter choice for regression coefficients (Gelman et al., 2008) and the half-Cauchy (coded implicitly in Stan) a good default choice for scale parameters (Gelman, 2006). Informative Priors Ideally, there will be substantive information about a problem that can be included in an even tighter prior than a weakly informative prior. This may come from actual prior experiments and thus be the posterior of other data, it may come from metaanalysis, or it may come simply by soliciting it from domain experts. All the goodness of weakly informative priors applies, only with more strength. Conjugacy Unlike in Gibbs sampling, there is no computational advantage to providing conjugate priors (i.e., priors that produce posteriors in the same family) in a Stan program.4 Neither the Hamiltonian Monte Carlo samplers or the optimizers make use of conjugacy, working only on the log density and its derivatives. 9.4. Robust Noise Models The standard approach to linear regression is to model the noise term as having a normal distribution. From Stan’s perspective, there is nothing special about normally distributed noise. For instance, robust regression can be accommodated by giving the noise term a Student-t distribution. To code this in Stan, the sampling distribution is changed to the following. 4 BUGS and JAGS both support conjugate sampling through Gibbs sampling. JAGS extended the range of conjugacy that could be exploited with its GLM module. Unlike Stan, both BUGS and JAGS are restricted to conjugate priors for constrained multivariate quantities such as covariance matrices or simplexes. 131 data { ... real nu; } ... model { y ~ student_t(nu, alpha + beta * x, sigma); } The degrees of freedom constant nu is specified as data. 9.5. Logistic and Probit Regression For binary outcomes, either of the closely related logistic or probit regression models may be used. These generalized linear models vary only in the link function they use to map linear predictions in (−∞, ∞) to probability values in (0, 1). Their respective link functions, the logistic function and the unit normal cumulative distribution function, are both sigmoid functions (i.e., they are both S-shaped). A logistic regression model with one predictor and an intercept is coded as follows. data { int N; vector[N] x; int y[N]; } parameters { real alpha; real beta; } model { y ~ bernoulli_logit(alpha + beta * x); } The noise parameter is built into the Bernoulli formulation here rather than specified directly. Logistic regression is a kind of generalized linear model with binary outcomes and the log odds (logit) link function, defined by v logit(v) = log . 1−v The inverse of the link function appears in the model. logit−1 (u) = 1 . 1 + exp(−u) 132 The model formulation above uses the logit-parameterized version of the Bernoulli distribution, which is defined by BernoulliLogit(y|α) = Bernoulli(y|logit−1 (α)). The formulation is also vectorized in the sense that alpha and beta are scalars and x is a vector, so that alpha + beta * x is a vector. The vectorized formulation is equivalent to the less efficient version for (n in 1:N) y[n] ~ bernoulli_logit(alpha + beta * x[n]); Expanding out the Bernoulli logit, the model is equivalent to the more explicit, but less efficient and less arithmetically stable for (n in 1:N) y[n] ~ bernoulli(inv_logit(alpha + beta * x[n])); Other link functions may be used in the same way. For example, probit regression uses the cumulative normal distribution function, which is typically written as Zx Normal(y|0, 1) dy. Φ(x) = −∞ The cumulative unit normal distribution function Φ is implemented in Stan as the function Phi. The probit regression model may be coded in Stan by replacing the logistic model’s sampling statement with the following. y[n] ~ bernoulli(Phi(alpha + beta * x[n])); A fast approximation to the cumulative unit normal distribution function Φ is implemented in Stan as the function Phi_approx. The approximate probit regression model may be coded with the following. y[n] ~ bernoulli(Phi_approx(alpha + beta * x[n])); 9.6. Multi-Logit Regression Multiple outcome forms of logistic regression can be coded directly in Stan. For instance, suppose there are K possible outcomes for each output variable yn . Also suppose that there is a D-dimensional vector xn of predictors for yn . The multi-logit model with Normal(0, 5) priors on the coefficients is coded as follows. 133 data { int K; int N; int D; int y[N]; vector[D] x[N]; } parameters { matrix[K,D] beta; } model { for (k in 1:K) beta[k] ~ normal(0, 5); for (n in 1:N) y[n] ~ categorical(softmax(beta * x[n])); } See Section 43.11 for a definition of the softmax function. A more efficient way to write the final line is y[n] ~ categorical_logit(beta * x[n]); The categorical_logit distribution is like the categorical distribution, with the parameters on the logit scale (see Section 51.5 for a full definition of categorical_logit). The first loop may be made more efficient by vectorizing the first loop by converting the matrix beta to a vector, to_vector(beta) ~ normal(0, 5); Constraints on Data Declarations The data block in the above model is defined without constraints on sizes K, N, and D or on the outcome array y. Constraints on data declarations provide error checking at the point data is read (or transformed data is defined), which is before sampling begins. Constraints on data declarations also make the model author’s intentions more explicit, which can help with readability. The above model’s declarations could be tightened to int K; int N; int D; int y[N]; 134 These constraints arise because the number of categories, K, must be at least two in order for a categorical model to be useful. The number of data items, N, can be zero, but not negative; unlike R, Stan’s for-loops always move forward, so that a loop extent of 1:N when N is equal to zero ensures the loop’s body will not be executed. The number of predictors, D, must be at least one in order for beta * x[n] to produce an appropriate argument for softmax(). The categorical outcomes y[n] must be between 1 and K in order for the discrete sampling to be well defined. Constraints on data declarations are optional. Constraints on parameters declared in the parameters block, on the other hand, are not optional—they are required to ensure support for all parameter values satisfying their constraints. Constraints on transformed data, transformed parameters, and generated quantities are also optional. Identifiability Because softmax is invariant under adding a constant to each component of its input, the model is typically only identified if there is a suitable prior on the coefficients. An alternative is to use (K − 1)-vectors by fixing one of them to be zero. Section 11.2 discusses how to mix constants and parameters in a vector. In the multilogit case, the parameter block would be redefined to use (K − 1)-vectors parameters { matrix[K - 1, D] beta_raw; } and then these are transformed to parameters to use in the model. First, a transformed data block is added before the parameters block to define a row vector of zero values, transformed data { row_vector[D] zeros; zeros = rep_row_vector(0, D); } which can then be appended to beta_row to produce the coefficient matrix beta, transformed parameters { matrix[K, D] beta; beta = append_row(beta_raw, zeros); } See Section 43.7 for a definition of rep_row_vector and Section 43.10 for a definition of append_row. 135 This is not quite the same model as using K-vectors as parameters, because now the prior only applies to (K − 1)-vectors. In practice, this will cause the maximum likelihood solutions to be different and also the posteriors to be slightly different when taking priors centered around zero, as is typical for regression coefficients. 9.7. Parameterizing Centered Vectors It is often convenient to define a parameter vector β that is centered in the sense of satisfying the sum-to-zero constraint, K X βk = 0. k=1 Such a parameter vector may be used to identify a multi-logit regression parameter vector (see Section 9.6), or may be used for ability or difficulty parameters (but not both) in an IRT model (see Section 9.11). K − 1 Degrees of Freedom There is more than one way to enforce a sum-to-zero constraint on a parameter vector, the most efficient of which is to define the K-th element as the negation of the sum of the elements 1 through K − 1. parameters { vector[K-1] beta_raw; ... transformed parameters { vector[K] beta; // centered for (k in 1:(K-1)) { beta[k] = beta_raw[k]; } beta[K] = -sum(beta_raw); ... Placing a prior on beta_raw in this parameterization leads to a subtly different posterior than that resulting from the same prior on beta in the original parameterization without the sum-to-zero constraint. Most notably, a simple prior on each component of beta_raw produces different results than putting the same prior on each component of an unconstrained K-vector beta. For example, providing a Normal(0, 5) prior on beta will produce a different posterior mode than placing the same prior on beta_raw. 136 Translated and Scaled Simplex An alternative approach that’s less efficient, but amenable to a symmetric prior, is to offset and scale a simplex. parameters { simplex[K] beta_raw; real beta_scale; ... transformed parameters { vector[K] beta; beta = beta_scale * (beta_raw - 1.0 / K); ... Given that beta_raw sums to 1 because it is a simplex, the elementwise subtraction of 1/K is guaranteed to sum to zero (note that the expression 1.0 / K is used rather than 1 / K to prevent integer arithmetic rounding down to zero). Because the magnitude of the elements of the simplex is bounded, a scaling factor is required to provide beta with K degrees of freedom necessary to take on every possible value that sums to zero. With this parameterization, a Dirichlet prior can be placed on beta_raw, perhaps uniform, and another prior put on beta_scale, typically for “shrinkage.” Soft Centering Adding a prior such as β ∼ Normal(0, σ ) will provide a kind of soft centering of a PK parameter vector β by preferring, all else being equal, that k=1 βk = 0. This approach is only guaranteed to roughly center if β and the elementwise addition β + c for a scalar constant c produce the same likelihood (perhaps by another vector α being transformed to α − c, as in the IRT models). This is another way of achieving a symmetric prior. 9.8. Ordered Logistic and Probit Regression Ordered regression for an outcome yn ∈ {1, . . . , k} with predictors xn ∈ RD is determined by a single coefficient vector β ∈ RD along with a sequence of cutpoints c ∈ RK−1 sorted so that cd < cd+1 . The discrete output is k if the linear predictor xn β falls between ck−1 and ck , assuming c0 = −∞ and cK = ∞. The noise term is fixed by the form of regression, with examples for ordered logistic and ordered probit models. 137 Ordered Logistic Regression The ordered logistic model can be coded in Stan using the ordered data type for the cutpoints and the built-in ordered_logistic distribution. data { int K; int N; int D; int y[N]; row_vector[D] x[N]; } parameters { vector[D] beta; ordered[K-1] c; } model { for (n in 1:N) y[n] ~ ordered_logistic(x[n] * beta, c); } The vector of cutpoints c is declared as ordered[K-1], which guarantees that c[k] is less than c[k+1]. If the cutpoints were assigned independent priors, the constraint effectively truncates the joint prior to support over points that satisfy the ordering constraint. Luckily, Stan does not need to compute the effect of the constraint on the normalizing term because the probability is needed only up to a proportion. Ordered Probit An ordered probit model could be coded in exactly the same way by swapping the cumulative logistic (inv_logit) for the cumulative normal (Phi). data { int K; int N; int D; int y[N]; row_vector[D] x[N]; } parameters { vector[D] beta; ordered[K-1] c; } 138 model { vector[K] theta; for (n in 1:N) { real eta; eta = x[n] * beta; theta[1] = 1 - Phi(eta - c[1]); for (k in 2:(K-1)) theta[k] = Phi(eta - c[k-1]) - Phi(eta - c[k]); theta[K] = Phi(eta - c[K-1]); y[n] ~ categorical(theta); } } The logistic model could also be coded this way by replacing Phi with inv_logit, though the built-in encoding based on the softmax transform is more efficient and more numerically stable. A small efficiency gain could be achieved by computing the values Phi(eta - c[k]) once and storing them for re-use. 9.9. Hierarchical Logistic Regression The simplest multilevel model is a hierarchical model in which the data is grouped into L distinct categories (or levels). An extreme approach would be to completely pool all the data and estimate a common vector of regression coefficients β. At the other extreme, an approach with no pooling assigns each level l its own coefficient vector βl that is estimated separately from the other levels. A hierarchical model is an intermediate solution where the degree of pooling is determined by the data and a prior on the amount of pooling. Suppose each binary outcome yn ∈ {0, 1} has an associated level, lln ∈ {1, . . . , L}. Each outcome will also have an associated predictor vector xn ∈ RD . Each level l gets its own coefficient vector βl ∈ RD . The hierarchical structure involves drawing the coefficients βl,d ∈ R from a prior that is also estimated with the data. This hierarchically estimated prior determines the amount of pooling. If the data in each level are very similar, strong pooling will be reflected in low hierarchical variance. If the data in the levels are dissimilar, weaker pooling will be reflected in higher hierarchical variance. The following model encodes a hierarchical logistic regression model with a hierarchical prior on the regression coefficients. data { int D; int N; int L; int y[N]; 139 int ll[N]; row_vector[D] x[N]; } parameters { real mu[D]; real sigma[D]; vector[D] beta[L]; } model { for (d in 1:D) { mu[d] ~ normal(0, 100); for (l in 1:L) beta[l,d] ~ normal(mu[d], sigma[d]); } for (n in 1:N) y[n] ~ bernoulli(inv_logit(x[n] * beta[ll[n]])); } The standard deviation parameter sigma gets an implicit uniform prior on (0, ∞) because of its declaration with a lower-bound constraint of zero. Stan allows improper priors as long as the posterior is proper. Nevertheless, it is usually helpful to have informative or at least weakly informative priors for all parameters; see Section 9.3 for recommendations on priors for regression coefficients and scales. Optimizing the Model Where possible, vectorizing sampling statements leads to faster log probability and derivative evaluations. The speed boost is not because loops are eliminated, but because vectorization allows sharing subcomputations in the log probability and gradient calculations and because it reduces the size of the expression tree required for gradient calculations. The first optimization vectorizes the for-loop over D as mu ~ normal(0, 100); for (l in 1:L) beta[l] ~ normal(mu, sigma); The declaration of beta as an array of vectors means that the expression beta[l] denotes a vector. Although beta could have been declared as a matrix, an array of vectors (or a two-dimensional array) is more efficient for accessing rows; see Section 26.3 for more information on the efficiency tradeoffs among arrays, vectors, and matrices. 140 This model can be further sped up and at the same time made more arithmetically stable by replacing the application of inverse-logit inside the Bernoulli distribution with the logit-parameterized Bernoulli, for (n in 1:N) y[n] ~ bernoulli_logit(x[n] * beta[ll[n]]); See Section 50.2 for a definition of bernoulli_logit. Unlike in R or BUGS, loops, array access and assignments are fast in Stan because they are translated directly to C++. In most cases, the cost of allocating and assigning to a container is more than made up for by the increased efficiency due to vectorizing the log probability and gradient calculations. Thus the following version is faster than the original formulation as a loop over a sampling statement. { vector[N] x_beta_ll; for (n in 1:N) x_beta_ll[n] = x[n] * beta[ll[n]]; y ~ bernoulli_logit(x_beta_ll); } The brackets introduce a new scope for the local variable x_beta_ll; alternatively, the variable may be declared at the top of the model block. In some cases, such as the above, the local variable assignment leads to models that are less readable. The recommended practice in such cases is to first develop and debug the more transparent version of the model and only work on optimizations when the simpler formulation has been debugged. 9.10. Hierarchical Priors Priors on priors, also known as “hyperpriors,” should be treated the same way as priors on lower-level parameters in that as much prior information as is available should be brought to bear. Because hyperpriors often apply to only a handful of lower-level parameters, care must be taken to ensure the posterior is both proper and not overly sensitive either statistically or computationally to wide tails in the priors. Boundary-Avoiding Priors for MLE in Hierarchical Models The fundamental problem with maximum likelihood estimation (MLE) in the hierarchical model setting is that as the hierarchical variance drops and the values cluster 141 around the hierarchical mean, the overall density grows without bound. As an illustration, consider a simple hierarchical linear regression (with fixed prior mean) of yn ∈ R on xn ∈ RK , formulated as yn ∼ Normal(xn β, σ ) βk ∼ Normal(0, τ) τ ∼ Cauchy(0, 2.5) In this case, as τ → 0 and βk → 0, the posterior density p(β, τ, σ |y, x) ∝ p(y|x, β, τ, σ ) grows without bound. There is a plot of a Neal’s funnel density in Figure 28.1, which has similar behavior. There is obviously no MLE estimate for β, τ, σ in such a case, and therefore the model must be modified if posterior modes are to be used for inference. The approach recommended by Chung et al. (2013) is to use a gamma distribution as a prior, such as σ ∼ Gamma(2, 1/A), for a reasonably large value of A, such as A = 10. 9.11. Item-Response Theory Models Item-response theory (IRT) models the situation in which a number of students each answer one or more of a group of test questions. The model is based on parameters for the ability of the students, the difficulty of the questions, and in more articulated models, the discriminativeness of the questions and the probability of guessing correctly; see (Gelman and Hill, 2007, pps. 314–320) for a textbook introduction to hierarchical IRT models and (Curtis, 2010) for encodings of a range of IRT models in BUGS. Data Declaration with Missingness The data provided for an IRT model may be declared as follows to account for the fact that not every student is required to answer every question. data { int J; int K; int N; // number of students // number of questions // number of observations 142 int jj[N]; int kk[N]; int y[N]; // student for observation n // question for observation n // correctness for observation n } This declares a total of N student-question pairs in the data set, where each n in 1:N indexes a binary observation y[n] of the correctness of the answer of student jj[n] on question kk[n]. The prior hyperparameters will be hard coded in the rest of this section for simplicity, though they could be coded as data in Stan for more flexibility. 1PL (Rasch) Model The 1PL item-response model, also known as the Rasch model, has one parameter (1P) for questions and uses the logistic link function (L). The model parameters are declared as follows. parameters { real delta; real alpha[J]; real beta[K]; } // mean student ability // ability of student j - mean ability // difficulty of question k The parameter alpha[j] is the ability coefficient for student j and beta[k] is the difficulty coefficient for question k. The non-standard parameterization used here also includes an intercept term delta, which represents the average student’s response to the average question.5 The model itself is as follows. model { alpha ~ normal(0, 1); // informative true prior beta ~ normal(0, 1); // informative true prior delta ~ normal(0.75, 1); // informative true prior for (n in 1:N) y[n] ~ bernoulli_logit(alpha[jj[n]] - beta[kk[n]] + delta); } This model uses the logit-parameterized Bernoulli distribution, where bernoulli_logit(y|α) = bernoulli(y|logit−1 (α)). 5 (Gelman and Hill, 2007) treat the δ term equivalently as the location parameter in the distribution of student abilities. 143 The key to understanding it is the term inside the bernoulli_logit distribution, from which it follows that Pr[yn = 1] = logit−1 (αjj[n] − βkk[n] + δ). The model suffers from additive identifiability issues without the priors. For example, adding a term ξ to each αj and βk results in the same predictions. The use of priors for α and β located at 0 identifies the parameters; see (Gelman and Hill, 2007) for a discussion of identifiability issues and alternative approaches to identification. For testing purposes, the IRT 1PL model distributed with Stan uses informative priors that match the actual data generation process used to simulate the data in R (the simulation code is supplied in the same directory as the models). This is unrealistic for most practical applications, but allows Stan’s inferences to be validated. A simple sensitivity analysis with fatter priors shows that the posterior is fairly sensitive to the prior even with 400 students and 100 questions and only 25% missingness at random. For real applications, the priors should be fit hierarchically along with the other parameters, as described in the next section. Multilevel 2PL Model The simple 1PL model described in the previous section is generalized in this section with the addition of a discrimination parameter to model how noisy a question is and by adding multilevel priors for the question difficulty and discrimination parameters. The model parameters are declared as follows. parameters { real mu_beta; real alpha[J]; real beta[K]; real gamma[K]; real sigma_beta; real sigma_gamma; } // // // // // // mean student ability ability for j - mean difficulty for k discrimination of k scale of difficulties scale of log discrimination The parameters should be clearer after the model definition. model { alpha ~ normal(0, 1); beta ~ normal(0, sigma_beta); gamma ~ lognormal(0, sigma_gamma); mu_beta ~ cauchy(0, 5); sigma_alpha ~ cauchy(0, 5); sigma_beta ~ cauchy(0, 5); 144 sigma_gamma ~ cauchy(0, 5); for (n in 1:N) y[n] ~ bernoulli_logit(gamma[kk[n]] * (alpha[jj[n]] - (beta[kk[n]] + mu_beta))); } This is similar to the 1PL model, with the additional parameter gamma[k] modeling how discriminative question k is. If gamma[k] is greater than 1, responses are more attenuated with less chance of getting a question right at random. The parameter gamma[k] is constrained to be positive, which prohibits there being questions that are easier for students of lesser ability; such questions are not unheard of, but they tend to be eliminated from most testing situations where an IRT model would be applied. The model is parameterized here with student abilities alpha being given a unit normal prior. This is to identify both the scale and the location of the parameters, both of which would be unidentified otherwise; see Chapter 25 for further discussion of identifiability. The difficulty and discrimination parameters beta and gamma then have varying scales given hierarchically in this model. They could also be given weakly informative non-hierarchical priors, such as beta ~ normal(0, 5); gamma ~ lognormal(0, 2); The point is that the alpha determines the scale and location and beta and gamma are allowed to float. The beta parameter is here given a non-centered parameterization, with parameter mu_beta serving as the mean beta location. An alternative would’ve been to take: beta ~ normal(mu_beta, sigma_beta); and y[n] ~ bernoulli_logit(gamma[kk[n]] * (alpha[jj[n]] - beta[kk[n]])); Non-centered parameterizations tend to be more efficient in hierarchical models; see Section 28.6 for more information on non-centered reparameterizations. The intercept term mu_beta can’t itself be modeled hierarchically, so it is given a weakly informative Cauchy(0, 5) prior. Similarly, the scale terms, sigma_alpha, sigma_beta, and sigma_gamma, are given half-Cauchy priors. The truncation in the half-Cauchy prior is implicit; explicit truncation is not necessary because the log probability need only be calculated up to a proportion and the scale variables are constrained to (0, ∞) by their declarations. 145 9.12. Priors for Identifiability Location and Scale Invariance One application of (hierarchical) priors is to identify the scale and/or location of a group of parameters. For example, in the IRT models discussed in the previous section, there is both a location and scale non-identifiability. With uniform priors, the posteriors will float in terms of both scale and location. See Section 25.1 for a simple example of the problems this poses for estimation. The non-identifiability is resolved by providing a unit normal (i.e., Normal(0, 1)) prior on one group of coefficients, such as the student abilities. With a unit normal prior on the student abilities, the IRT model is identified in that the posterior will produce a group of estimates for student ability parameters that have a sample mean of close to zero and a sample variance of close to one. The difficulty and discrimination parameters for the questions should then be given a diffuse, or ideally a hierarchical prior, which will identify these parameters by scaling and locating relative to the student ability parameters. Collinearity Another case in which priors can help provide identifiability is in the case of collinearity in a linear regression. In linear regression, if two predictors are collinear (i.e, one is a linear function of the other), then their coefficients will have a correlation of 1 (or -1) in the posterior. This leads to non-identifiability. By placing normal priors on the coefficients, the maximum likelihood solution of two duplicated predictors (trivially collinear) will be half the value than would be obtained by only including one. Separability In a logistic regression, if a predictor is positive in cases of 1 outcomes and negative in cases of 0 outcomes, then the maximum likelihood estimate for the coefficient for that predictor diverges to infinity. This divergence can be controlled by providing a prior for the coefficient, which will “shrink” the estimate back toward zero and thus identify the model in the posterior. Similar problems arise for sampling with improper flat priors. The sampler will try to draw very large values. By providing a prior, the posterior will be concentrated around finite values, leading to well-behaved sampling. 146 9.13. Multivariate Priors for Hierarchical Models In hierarchical regression models (and other situations), several individual-level variables may be assigned hierarchical priors. For example, a model with multiple varying intercepts and slopes within might assign them a multivariate prior. As an example, the individuals might be people and the outcome income, with predictors such as education level and age, and the groups might be states or other geographic divisions. The effect of education level and age as well as an intercept might be allowed to vary by state. Furthermore, there might be state-level predictors, such as average state income and unemployment level. Multivariate Regression Example (Gelman and Hill, 2007, Chapter 13, Chapter 17) discuss a hierarchical model with N individuals organized into J groups. Each individual has a predictor row vector xn of size K; to unify the notation, they assume that xn,1 = 1 is a fixed “intercept” predictor. To encode group membership, they assume individual n belongs to group jj[n] ∈ 1:J. Each individual n also has an observed outcome yn taking on real values. Likelihood The model is a linear regression with slope and intercept coefficients varying by group, so that βj is the coefficient K-vector for group j. The likelihood function for individual n is then just yn ∼ Normal(xn βjj[n] , σ ) for n ∈ 1:N. Coefficient Prior Gelman and Hill model the coefficient vectors βj as being drawn from a multivariate distribution with mean vector µ and covariance matrix Σ, βj ∼ MultiNormal(µ, Σ) for j ∈ 1:J. Below, we discuss the full model of Gelman and Hill, which uses group-level predictors to model µ; for now, we assume µ is a simple vector parameter. Hyperpriors For hierarchical modeling, the group-level mean vector µ and covariance matrix Σ must themselves be given priors. The group-level mean vector can be given a reasonable weakly-informative prior for independent coefficients, such as µj ∼ Normal(0, 5). 147 Of course, if more is known about the expected coefficient values βj,k , this information can be incorporated into the prior for µk . For the prior on the covariance matrix, Gelman and Hill suggest using a scaled inverse Wishart. That choice was motivated primarily by convenience as it is conjugate to the multivariate likelihood function and thus simplifies Gibbs sampling. In Stan, there is no restriction to conjugacy for multivariate priors, and we in fact recommend a slightly different approach. Like Gelman and Hill, we decompose our prior into a scale and a matrix, but are able to do so in a more natural way based on the actual variable scales and a correlation matrix. Specifically, we define Σ = diag_matrix(τ) Ω diag_matrix(τ), where Ω is a correlation matrix and τ is the vector of coefficient scales. This mapping from scale vector τ and correlation matrix Ω can be inverted, using τk = and Ωi,j = q Σk,k Σi,j . τi τj The components of the scale vector τ can be given any reasonable prior for scales, but we recommend something weakly informative like a half-Cauchy distribution with a small scale, such as τk ∼ Cauchy(0, 2.5) for k ∈ 1:K constrained by τk > 0. As for the prior means, if there is information about the scale of variation of coefficients across groups, it should be incorporated into the prior for τ. For large numbers of exchangeable coefficients, the components of τ itself (perhaps excluding the intercept) may themselves be given a hierarchical prior. Our final recommendation is to give the correlation matrix Ω an LKJ prior with shape ν ≥ 1, Ω ∼ LKJCorr(ν). The LKJ correlation distribution is defined in Section 63.1, but the basic idea for modeling is that as ν increases, the prior increasingly concentrates around the unit correlation matrix (i.e., favors less correlation among the components of βj ). At ν = 1, the LKJ correlation distribution reduces to the identity distribution over correlation matrices. The LKJ prior may thus be used to control the expected amount of correlation among the parameters βj . 148 Group-Level Predictors for Prior Mean To complete Gelman and Hill’s model, suppose each group j ∈ 1:J is supplied with an L-dimensional row-vector of group-level predictors uj . The prior mean for the βj can then itself be modeled as a regression, using an L-dimensional coefficient vector γ. The prior for the group-level coefficients then becomes βj ∼ MultiNormal(uj γ, Σ) The group-level coefficients γ may themselves be given independent weakly informative priors, such as γl ∼ Normal(0, 5). As usual, information about the group-level means should be incorporated into this prior. Coding the Model in Stan The Stan code for the full hierarchical model with multivariate priors on the grouplevel coefficients and group-level prior means follows its definition. data { int N; int K; int J; int L; int jj[N]; matrix[N, K] x; row_vector[L] u[J]; vector[N] y; } parameters { corr_matrix[K] Omega; vector [K] tau; matrix[L, K] gamma; vector[K] beta[J]; real sigma; } model { tau ~ cauchy(0, 2.5); Omega ~ lkj_corr(2); to_vector(gamma) ~ normal(0, { row_vector[K] u_gamma[J]; for (j in 1:J) // num individuals // num ind predictors // num groups // num group predictors // group for individual // individual predictors // group predictors // outcomes // prior correlation // prior scale // group coeffs // indiv coeffs by group // prediction error scale 5); 149 u_gamma[j] = u[j] * gamma; beta ~ multi_normal(u_gamma, quad_form_diag(Omega, tau)); } for (n in 1:N) y[n] ~ normal(x[n] * beta[jj[n]], sigma); } The hyperprior covariance matrix is defined implicitly through the a quadratic form in the code because the correlation matrix Omega and scale vector tau are more natural to inspect in the output; to output Sigma, define it as a transformed parameter. The function quad_form_diag is defined so that quad_form_diag(Sigma, tau) is equivalent to diag_matrix(tau) * Sigma * diag_matrix(tau), where diag_matrix(tau) returns the matrix with tau on the diagonal and zeroes off diagonal; the version using quad_form_diag should be faster. See Section 43.2 for more information on specialized matrix operations. Optimization through Vectorization The code in the Stan program above can be sped up dramatically by replacing: for (n in 1:N) y[n] ~ normal(x[n] * beta[jj[n]], sigma); with the vectorized form: { vector[N] x_beta_jj; for (n in 1:N) x_beta_jj[n] = x[n] * beta[jj[n]]; y ~ normal(x_beta_jj, sigma); } The outer brackets create a local scope in which to define the variable x_beta_jj, which is then filled in a loop and used to define a vectorized sampling statement. The reason this is such a big win is that it allows us to take the log of sigma only once and it greatly reduces the size of the resulting expression graph by packing all of the work into a single density function. Although it is tempting to redeclare beta and include a revised model block sampling statement, parameters { matrix[J, K] beta; ... model { y ~ normal(rows_dot_product(x, beta[jj]), sigma); ... 150 this fails because it breaks the vectorization of sampling for beta,6 beta ~ multi_normal(...); which requires beta to be an array of vectors. Both vectorizations are important, so the best solution is to just use the loop above, because rows_dot_product cannot do much optimization in and of itself because there are no shared computations. The code in the Stan program above also builds up an array of vectors for the outcomes and for the multivariate normal, which provides a very significant speedup by reducing the number of linear systems that need to be solved and differentiated. { matrix[K, K] Sigma_beta; Sigma_beta = quad_form_diag(Omega, tau); for (j in 1:J) beta[j] ~ multi_normal((u[j] * gamma)', Sigma_beta); } In this example, the covariance matrix Sigma_beta is defined as a local variable so as not to have to repeat the quadratic form computation J times. This vectorization can be combined with the Cholesky-factor optimization in the next section. Optimization through Cholesky Factorization The multivariate normal density and LKJ prior on correlation matrices both require their matrix parameters to be factored. Vectorizing, as in the previous section, ensures this is only done once for each density. An even better solution, both in terms of efficiency and numerical stability, is to parameterize the model directly in terms of Cholesky factors of correlation matrices using the multivariate version of the noncentered parameterization. For the model in the previous section, the program fragment to replace the full matrix prior with an equivalent Cholesky factorized prior is as follows. data { matrix[J, L] u; ... parameters { matrix[K, J] z; cholesky_factor_corr[K] L_Omega; ... transformed parameters { matrix[J, K] beta; beta = u * gamma + (diag_pre_multiply(tau,L_Omega) * z)'; 6 Thanks to Mike Lawrence for pointing this out in the GitHub issue for the manual. 151 } model { to_vector(z) ~ normal(0, 1); L_Omega ~ lkj_corr_cholesky(2); ... The data variable u was originally an array of vectors, which is efficient for access; here it is redeclared as a matrix in order to use it in matrix arithmetic. The new parameter L_Omega is the Cholesky factor of the original correlation matrix Omega, so that Omega = L_Omega * L_Omega' The prior scale vector tau is unchanged, and furthermore, Pre-multiplying the Cholesky factor by the scale produces the Cholesky factor of the final covariance matrix, Sigma_beta = quad_form_diag(Omega, tau) = diag_pre_multiply(tau, L_Omega) * diag_pre_multiply(tau, L_Omega)' where the diagonal pre-multiply compound operation is defined by diag_pre_multiply(a, b) = diag_matrix(a) * b The new variable z is declared as a matrix, the entries of which are given independent unit normal priors; the to_vector operation turns the matrix into a vector so that it can be used as a vectorized argument to the univariate normal density. Multiplying the Cholesky factor of the covariance matrix by z and adding the mean (u * gamma)’ produces a beta distributed as in the original model. Omitting the data declarations, which are the same as before, the optimized model is as follows. parameters { matrix[K, J] z; cholesky_factor_corr[K] L_Omega; vector [K] tau_unif; matrix[L, K] gamma; // group coeffs real sigma; // prediction error scale } transformed parameters { matrix[J, K] beta; vector [K] tau; // prior scale for (k in 1:K) tau[k] = 2.5 * tan(tau_unif[k]); beta = u * gamma + (diag_pre_multiply(tau,L_Omega) * z)'; 152 } model { to_vector(z) ~ normal(0, 1); L_Omega ~ lkj_corr_cholesky(2); to_vector(gamma) ~ normal(0, 5); y ~ normal(rows_dot_product(beta[jj] , x), sigma); } This model also reparameterizes the prior scale tau to avoid potential problems with the heavy tails of the Cauchy distribution. The statement tau_unif uniform(0,pi()/2) can be omitted from the model block because stan increments the log posterior for parameters with uniform priors without it. 9.14. Prediction, Forecasting, and Backcasting Stan models can be used for “predicting” the values of arbitrary model unknowns. When predictions are about the future, they’re called “forecasts;” when they are predictions about the past, as in climate reconstruction or cosmology, they are sometimes called “backcasts” (or “aftcasts” or “hindcasts” or “antecasts,” depending on the author’s feelings about the opposite of “fore”). Programming Predictions As a simple example, the following linear regression provides the same setup for estimating the coefficients beta as in our very first example above, using y for the N observations and x for the N predictor vectors. The model parameters and model for observations are exactly the same as before. To make predictions, we need to be given the number of predictions, N_new, and their predictor matrix, x_new. The predictions themselves are modeled as a parameter y_new. The model statement for the predictions is exactly the same as for the observations, with the new outcome vector y_new and prediction matrix x_new. data { int K; int N; matrix[N, K] x; vector[N] y; int N_new; matrix[N_new, K] x_new; } parameters { vector[K] beta; 153 real sigma; vector[N_new] y_new; } model { y ~ normal(x * beta, sigma); // predictions // observed model y_new ~ normal(x_new * beta, sigma); // prediction model } Predictions as Generated Quantities Where possible, the most efficient way to generate predictions is to use the generated quantities block. This provides proper Monte Carlo (not Markov chain Monte Carlo) inference, which can have a much higher effective sample size per iteration. ...data as above... parameters { vector[K] beta; real sigma; } model { y ~ normal(x * beta, sigma); } generated quantities { vector[N_new] y_new; for (n in 1:N_new) y_new[n] = normal_rng(x_new[n] * beta, sigma); } Now the data is just as before, but the parameter y_new is now declared as a generated quantity, and the prediction model is removed from the model and replaced by a pseudo-random draw from a normal distribution. Overflow in Generated Quantities It is possible for values to overflow or underflow in generated quantities. The problem is that if the result is NaN, then any constraints placed on the variables will be violated. It is possible to check a value assigned by an RNG and reject it if it overflows, but this is both inefficient and leads to biased posterior estimates. Instead, the conditions causing overflow, such as trying to generate a negative binomial random variate with a mean of 231 . These must be intercepted and dealt with, typically be 154 reparameterizing or reimplementing the random number generator using real values rather than integers, which are upper-bounded by 231 − 1 in Stan. 9.15. Multivariate Outcomes Most regressions are set up to model univariate observations (be they scalar, boolean, categorical, ordinal, or count). Even multinomial regressions are just repeated categorical regressions. In contrast, this section discusses regression when each observed value is multivariate. To relate multiple outcomes in a regression setting, their error terms are provided with covariance structure. This section considers two cases, seemingly unrelated regressions for continuous multivariate quantities and multivariate probit regression for boolean multivariate quantities. Seemingly Unrelated Regressions The first model considered is the “seemingly unrelated” regressions (SUR) of econometrics where several linear regressions share predictors and use a covariance error structure rather than independent errors (Zellner, 1962; Greene, 2011). The model is easy to write down as a regression, yn = x n β + n n ∼ MultiNormal(0, Σ) where xn is a J-row-vector of predictors (x is an (N × J)-matrix), yn is a K-vector of observations, β is a (K × J)-matrix of regression coefficients (vector βk holds coefficients for outcome k), and Σ is covariance matrix governing the error. As usual, the intercept can be rolled into x as a column of ones. The basic Stan code is straightforward (though see below for more optimized code for use with LKJ priors on correlation). data { int K; int J; int N; vector[J] x[N]; vector[K] y[N]; } parameters { matrix[K, J] beta; cov_matrix[K] Sigma; } 155 model { vector[K] mu[N]; for (n in 1:N) mu[n] = beta * x[n]; y ~ multi_normal(mu, Sigma); } For efficiency, the multivariate normal is vectorized by precomputing the array of mean vectors and sharing the same covariance matrix. Following the advice in Section 9.13, we will place a weakly informative normal prior on the regression coefficients, an LKJ prior on the correlations and a half-Cauchy prior on standard deviations. The covariance structure is parameterized in terms of Cholesky factors for efficiency and arithmetic stability. ... parameters { matrix[K, J] beta; cholesky_factor_corr[K] L_Omega; vector [K] L_sigma; } model { vector[K] mu[N]; matrix[K, K] L_Sigma; for (n in 1:N) mu[n] = beta * x[n]; L_Sigma = diag_pre_multiply(L_sigma, L_Omega); to_vector(beta) ~ normal(0, 5); L_Omega ~ lkj_corr_cholesky(4); L_sigma ~ cauchy(0, 2.5); y ~ multi_normal_cholesky(mu, L_Sigma); } The Cholesky factor of the covariance matrix is then reconstructed as a local variable and used in the model by scaling the Cholesky factor of the correlation matrices. The regression coefficients get a prior all at once by converting the matrix beta to a vector. If required, the full correlation or covariance matrices may be reconstructed from their Cholesky factors in the generated quantities block. 156 Multivariate Probit Regression The multivariate probit model generates sequences of boolean variables by applying a step function to the output of a seemingly unrelated regression. The observations yn are D-vectors of boolean values (coded 0 for false, 1 for true). The values for the observations yn are based on latent values zn drawn from a seemingly unrelated regression model (see the previous section), zn = xn β + n n ∼ MultiNormal(0, Σ) These are then put through the step function to produce a K-vector zn of boolean values with elements defined by yn,k = I(zn,k > 0), where I() is the indicator function taking the value 1 if its argument is true and 0 otherwise. Unlike in the seemingly unrelated regressions case, here the covariance matrix Σ has unit standard deviations (i.e., it is a correlation matrix). As with ordinary probit and logistic regressions, letting the scale vary causes the model (which is defined only by a cutpoint at 0, not a scale) to be unidentified (see (Greene, 2011)). Multivariate probit regression can be coded in Stan using the trick introduced by Albert and Chib (1993), where the underlying continuous value vectors yn are coded as truncated parameters. The key to coding the model in Stan is declaring the latent vector z in two parts, based on whether the corresponding value of y is 0 or 1. Otherwise, the model is identical to the seemingly unrelated regression model in the previous section. First, we introduce a sum function for two-dimensional arrays of integers; this is going to help us calculate how many total 1 values there are in y. functions { int sum(int[,] a) { int s = 0; for (i in 1:size(a)) s += sum(a[i]); return s; } } The function is trivial, but it’s not a built-in for Stan and it’s easier to understand the rest of the model if it’s pulled into its own function so as not to create a distraction. The data declaration block is much like for the seemingly unrelated regressions, but the observations y are now integers constrained to be 0 or 1. 157 data { int K; int D; int N; int y[N,D]; vector[K] x[N]; } After declaring the data, there is a rather involved transformed data block whose sole purpose is to sort the data array y into positive and negative components, keeping track of indexes so that z can be easily reassembled in the transformed parameters block. transformed data { int N_pos; int int int N_neg; int int n_pos[sum(y)]; d_pos[size(n_pos)]; n_neg[(N * D) - size(n_pos)]; d_neg[size(n_neg)]; N_pos = size(n_pos); N_neg = size(n_neg); { int i; int j; i = 1; j = 1; for (n in 1:N) { for (d in 1:D) { if (y[n,d] == 1) { n_pos[i] = n; d_pos[i] = d; i += 1; } else { n_neg[j] = n; d_neg[j] = d; j += 1; } } } } } The variables N_pos and N_neg are set to the number of true (1) and number of false 158 (0) observations in y. The loop then fills in the sequence of indexes for the positive and negative values in four arrays. The parameters are declared as follows. parameters { matrix[D, K] beta; cholesky_factor_corr[D] L_Omega; vector [N_pos] z_pos; vector [N_neg] z_neg; } These include the regression coefficients beta and the Cholesky factor of the correlation matrix, L_Omega. This time there is no scaling because the covariance matrix has unit scale (i.e., it is a correlation matrix; see above). The critical part of the parameter declaration is that the latent real value z is broken into positive-constrained and negative-constrained components, whose size was conveniently calculated in the transformed data block. The transformed data block’s real work was to allow the transformed parameter block to reconstruct z. transformed parameters { vector[D] z[N]; for (n in 1:N_pos) z[n_pos[n], d_pos[n]] = z_pos[n]; for (n in 1:N_neg) z[n_neg[n], d_neg[n]] = z_neg[n]; } At this point, the model is simple, pretty much recreating the seemingly unrelated regression. model { L_Omega ~ lkj_corr_cholesky(4); to_vector(beta) ~ normal(0, 5); { vector[D] beta_x[N]; for (n in 1:N) beta_x[n] = beta * x[n]; z ~ multi_normal_cholesky(beta_x, L_Omega); } } This simple form of model is made possible by the Albert and Chib-style constraints on z. Finally, the correlation matrix itself can be put back together in the generated quantities block if desired. 159 generated quantities { corr_matrix[D] Omega; Omega = multiply_lower_tri_self_transpose(L_Omega); } Of course, the same could be done for the seemingly unrelated regressions in the previous section. 9.16. Applications of Pseudorandom Number Generation The main application of pseudorandom number generator (PRNGs) is for posterior inference, including prediction and posterior predictive checks. They can also be used for pure data simulation, which is like a posterior predictive check with no conditioning. See Section 49.6 for a description of their syntax and the scope of their usage. Prediction Consider predicting unobserved outcomes using linear regression. Given predictors x1 , . . . , xN and observed outcomes y1 , . . . , yN , and assuming a standard linear regression with intercept α, slope β, and error scale σ , along with improper uniform priors, the posterior over the parameters given x and y is p(α, β, σ | x, y) ∝ N Y Normal(yn | α + βxn , σ ). n=1 For this model, the posterior predictive inference for a new outcome ỹm given a predictor x̃m , conditioned on the observed data x and y, is Z p(ỹn | x̃n , x, y) = Normal(ỹn | α + βx̃n , σ ) × p(α, β, σ | x, y) d(α, β, σ ). (α,β,σ ) To code the posterior predictive inference in Stan, a standard linear regression is combined with a random number in the generated quantities block. data { int N; vector[N] y; vector[N] x; int N_tilde; vector[N_tilde] x_tilde; } parameters { 160 real alpha; real beta; real sigma; } model { y ~ normal(alpha + beta * x, sigma); } generated quantities { vector[N_tilde] y_tilde; for (n in 1:N_tilde) y_tilde[n] = normal_rng(alpha + beta * x_tilde[n], sigma); } Given observed predictors x and outcomes y, y_tilde will be drawn according to p(ỹ | x̃, y, x). This means that, for example, the posterior mean for y_tilde is the estimate of the outcome that minimizes expected square error (conditioned on the data and model, of course). Posterior Predictive Checks A good way to investigate the fit of a model to the data, a critical step in Bayesian data analysis, is to generate simulated data according to the parameters of the model. This is carried out with exactly the same procedure as before, only the observed data predictors x are used in place of new predictors x̃ for unobserved outcomes. If the model fits the data well, the predictions for ỹ based on x should match the observed data y. To code posterior predictive checks in Stan requires only a slight modification of the prediction code to use x and N in place of x̃ and Ñ, generated quantities { vector[N] y_tilde; for (n in 1:N) y_tilde[n] = normal_rng(alpha + beta * x[n], sigma); } Gelman et al. (2013) recommend choosing several posterior draws ỹ (1) , . . . , ỹ (M) and plotting each of them alongside the data y that was actually observed. If the model fits well, the simulated ỹ will look like the actual data y. 161 10. Time-Series Models Times series data come arranged in temporal order. This chapter presents two kinds of time series models, regression-like models such as autoregressive and moving average models, and hidden Markov models. Chapter 18 presents Gaussian processes, which may also be used for time-series (and spatial) data. 10.1. Autoregressive Models A first-order autoregressive model (AR(1)) with normal noise takes each point yn in a sequence y to be generated according to yn ∼ Normal(α + βyn−1 , σ ). That is, the expected value of yn is α + βyn−1 , with noise scaled as σ . AR(1) Models With improper flat priors on the regression coefficients for slope (β), intercept (α), and noise scale (σ ), the Stan program for the AR(1) model is as follows. data { int N; vector[N] y; } parameters { real alpha; real beta; real sigma; } model { for (n in 2:N) y[n] ~ normal(alpha + beta * y[n-1], sigma); } The first observed data point, y[1], is not modeled here because there is nothing to condition on; instead, it acts to condition y[2]. This model also uses an improper prior for sigma, but there is no obstacle to adding an informative prior if information is available on the scale of the changes in y over time, or a weakly informative prior to help guide inference if rough knowledge of the scale of y is available. 162 Slicing for Efficiency Although perhaps a bit more difficult to read, a much more efficient way to write the above model is by slicing the vectors, with the model above being replaced with the one-liner model { y[2:N] ~ normal(alpha + beta * y[1:(N - 1)], sigma); } The left-hand side slicing operation pulls out the last N − 1 elements and the righthand side version pulls out the first N − 1. Extensions to the AR(1) Model Proper priors of a range of different families may be added for the regression coefficients and noise scale. The normal noise model can be changed to a Student-t distribution or any other distribution with unbounded support. The model could also be made hierarchical if multiple series of observations are available. To enforce the estimation of a stationary AR(1) process, the slope coefficient beta may be constrained with bounds as follows. real beta; In practice, such a constraint is not recommended. If the data is not stationary, it is best to discover this while fitting the model. Stationary parameter estimates can be encouraged with a prior favoring values of beta near zero. AR(2) Models Extending the order of the model is also straightforward. For example, an AR(2) model could be coded with the second-order coefficient gamma and the following model statement. for (n in 3:N) y[n] ~ normal(alpha + beta*y[n-1] + gamma*y[n-2], sigma); AR(K) Models A general model where the order is itself given as data can be coded by putting the coefficients in an array and computing the linear predictor in a loop. data { int K; int N; 163 real y[N]; } parameters { real alpha; real beta[K]; real sigma; } model { for (n in (K+1):N) { real mu = alpha; for (k in 1:K) mu += beta[k] * y[n-k]; y[n] ~ normal(mu, sigma); } } ARCH(1) Models Econometric and financial time-series models usually assume heteroscedasticity (i.e., they allow the scale of the noise terms defining the series to vary over time). The simplest such model is the autoregressive conditional heteroscedasticity (ARCH) model (Engle, 1982). Unlike the autoregressive model AR(1), which modeled the mean of the series as varying over time but left the noise term fixed, the ARCH(1) model takes the scale of the noise terms to vary over time but leaves the mean term fixed. Of course, models could be defined where both the mean and scale vary over time; the econometrics literature presents a wide range of time-series modeling choices. The ARCH(1) model is typically presented as the following sequence of equations, where rt is the observed return at time point t and µ, α0 , and α1 are unknown regression coefficient parameters. rt = µ + at at = σt t t ∼ Normal(0, 1) σt2 = 2 α0 + α1 at−1 In order to ensure the noise terms σt2 are positive, the scale coefficients are constrained to be positive, α0 , α1 > 0. To ensure stationarity of the time series, the slope is constrained to to be less than one, α1 < 1.1 The ARCH(1) model may be coded directly in Stan as follows. 1 In practice, it can be useful to remove the constraint to test whether a non-stationary set of coefficients provides a better fit to the data. It can also be useful to add a trend term to the model, because an unfitted trend will manifest as non-stationarity. 164 data { int T; // number of time points real r[T]; // return at time t } parameters { real mu; // average return real alpha0; // noise intercept real alpha1; // noise slope } model { for (t in 2:T) r[t] ~ normal(mu, sqrt(alpha0 + alpha1 * pow(r[t-1] - mu,2))); } The loop in the model is defined so that the return at time t = 1 is not modeled; the model in the next section shows how to model the return at t = 1. The model can be vectorized to be more efficient; the model in the next section provides an example. 10.2. Modeling Temporal Heteroscedasticity A set of variables is homoscedastic if their variances are all the same; the variables are heteroscedastic if they do not all have the same variance. Heteroscedastic time-series models allow the noise term to vary over time. GARCH(1,1) Models The basic generalized autoregressive conditional heteroscedasticity (GARCH) model, GARCH(1,1), extends the ARCH(1) model by including the squared previous difference in return from the mean at time t − 1 as a predictor of volatility at time t, defining 2 2 σt2 = α0 + α1 at−1 + β1 σt−1 . To ensure the scale term is positive and the resulting time series stationary, the coefficients must all satisfy α0 , α1 , β1 > 0 and the slopes α1 + β1 < 1. data { int T; real r[T]; real sigma1; } parameters { real mu; real alpha0; 165 real alpha1; real beta1; } transformed parameters { real sigma[T]; sigma[1] = sigma1; for (t in 2:T) sigma[t] = sqrt(alpha0 + alpha1 * pow(r[t-1] - mu, 2) + beta1 * pow(sigma[t-1], 2)); } model { r ~ normal(mu, sigma); } To get the recursive definition of the volatility regression off the ground, the data declaration includes a non-negative value sigma1 for the scale of the noise at t = 1. The constraints are coded directly on the parameter declarations. This declaration is order-specific in that the constraint on beta1 depends on the value of alpha1. A transformed parameter array of non-negative values sigma is used to store the scale values at each time point. The definition of these values in the transformed parameters block is where the regression is now defined. There is an intercept alpha0, a slope alpha1 for the squared difference in return from the mean at the previous time, and a slope beta1 for the previous noise scale squared. Finally, the whole regression is inside the sqrt function because Stan requires scale (deviation) parameters (not variance parameters) for the normal distribution. With the regression in the transformed parameters block, the model reduces a single vectorized sampling statement. Because r and sigma are of length T, all of the data is modeled directly. 10.3. Moving Average Models A moving average model uses previous errors as predictors for future outcomes. For a moving average model of order Q, MA(Q), there is an overall mean parameter µ and regression coefficients θq for previous error terms. With t being the noise at time t, the model for outcome yt is defined by yt = µ + θ1 t−1 + · · · + θQ t−Q + t , with the noise term t for outcome yt modeled as normal, t ∼ Normal(0, σ ). In a proper Bayesian model, the parameters µ, θ, and σ must all be given priors. 166 MA(2) Example An MA(2) model can be coded in Stan as follows. data { int T; // number of observations vector[T] y; // observation at time T } parameters { real mu; // mean real sigma; // error scale vector[2] theta; // lag coefficients } transformed parameters { vector[T] epsilon; // error terms epsilon[1] = y[1] - mu; epsilon[2] = y[2] - mu - theta[1] * epsilon[1]; for (t in 3:T) epsilon[t] = ( y[t] - mu - theta[1] * epsilon[t - 1] - theta[2] * epsilon[t - 2] ); } model { mu ~ cauchy(0, 2.5); theta ~ cauchy(0, 2.5); sigma ~ cauchy(0, 2.5); for (t in 3:T) y[t] ~ normal(mu + theta[1] * epsilon[t - 1] + theta[2] * epsilon[t - 2], sigma); } The error terms t are defined as transformed parameters in terms of the observations and parameters. The definition of the sampling statement (defining the likelihood) follows the definition, which can only be applied to yn for n > Q. In this example, the parameters are all given Cauchy (half-Cauchy for σ ) priors, although other priors can be used just as easily. This model could be improved in terms of speed by vectorizing the sampling statement in the model block. Vectorizing the calculation of the t could also be sped up by using a dot product instead of a loop. 167 Vectorized MA(Q) Model A general MA(Q) model with a vectorized sampling probability may be defined as follows. data { int Q; // num previous noise terms int T; // num observations vector[T] y; // observation at time t } parameters { real mu; // mean real sigma; // error scale vector[Q] theta; // error coeff, lag -t } transformed parameters { vector[T] epsilon; // error term at time t for (t in 1:T) { epsilon[t] = y[t] - mu; for (q in 1:min(t - 1, Q)) epsilon[t] = epsilon[t] - theta[q] * epsilon[t - q]; } } model { vector[T] eta; mu ~ cauchy(0, 2.5); theta ~ cauchy(0, 2.5); sigma ~ cauchy(0, 2.5); for (t in 1:T) { eta[t] = mu; for (q in 1:min(t - 1, Q)) eta[t] = eta[t] + theta[q] * epsilon[t - q]; } y ~ normal(eta, sigma); } Here all of the data is modeled, with missing terms just dropped from the regressions as in the calculation of the error terms. Both models converge very quickly and mix very well at convergence, with the vectorized model being quite a bit faster (per iteration, not to converge — they compute the same model). 168 10.4. Autoregressive Moving Average Models Autoregressive moving-average models (ARMA), combine the predictors of the autoregressive model and the moving average model. An ARMA(1,1) model, with a single state of history, can be encoded in Stan as follows. data { int T; // num observations real y[T]; // observed outputs } parameters { real mu; // mean coeff real phi; // autoregression coeff real theta; // moving avg coeff real sigma; // noise scale } model { vector[T] nu; // prediction for time t vector[T] err; // error for time t nu[1] = mu + phi * mu; // assume err[0] == 0 err[1] = y[1] - nu[1]; for (t in 2:T) { nu[t] = mu + phi * y[t-1] + theta * err[t-1]; err[t] = y[t] - nu[t]; } mu ~ normal(0, 10); // priors phi ~ normal(0, 2); theta ~ normal(0, 2); sigma ~ cauchy(0, 5); err ~ normal(0, sigma); // likelihood } The data is declared in the same way as the other time-series regressions and the parameters are documented in the code. In the model block, the local vector nu stores the predictions and err the errors. These are computed similarly to the errors in the moving average models described in the previous section. The priors are weakly informative for stationary processes. The likelihood only involves the error term, which is efficiently vectorized here. Often in models such as these, it is desirable to inspect the calculated error terms. This could easily be accomplished in Stan by declaring err as a transformed parameter, then defining it the same way as in the model above. The vector nu could still be a local variable, only now it will be in the transformed parameter block. 169 Wayne Folta suggested encoding the model without local vector variables as follows. model { real err; mu ~ normal(0, 10); phi ~ normal(0, 2); theta ~ normal(0, 2); sigma ~ cauchy(0, 5); err = y[1] - mu + phi * mu; err ~ normal(0, sigma); for (t in 2:T) { err = y[t] - (mu + phi * y[t-1] + theta * err); err ~ normal(0, sigma); } } This approach to ARMA models provides a nice example of how local variables, such as err in this case, can be reused in Stan. Folta’s approach could be extended to higher order moving-average models by storing more than one error term as a local variable and reassigning them in the loop. Both encodings are very fast. The original encoding has the advantage of vectorizing the normal distribution, but it uses a bit more memory. A halfway point would be to vectorize just err. Identifiability and Stationarity MA and ARMA models are not identifiable if the roots of the characteristic polynomial for the MA part lie inside the unit circle, so it’s necessary to add the following constraint.2 real theta; When the model is run without the constraint, using synthetic data generated from the model, the simulation can sometimes find modes for (theta, phi) outside the [−1, 1] interval, which creates a multiple mode problem in the posterior and also causes the NUTS tree depth to get very large (often above 10). Adding the constraint both improves the accuracy of the posterior and dramatically reduces the tree depth, which speeds up the simulation considerably (typically by much more than an order of magnitude). Further, unless one thinks that the process is really non-stationary, it’s worth adding the following constraint to ensure stationarity. 2 This subsection is a lightly edited comment of Jonathan Gilligan’s on GitHub; see https://github. com/stan-dev/stan/issues/1617#issuecomment-160249142. 170 read phi; 10.5. Stochastic Volatility Models Stochastic volatility models treat the volatility (i.e., variance) of a return on an asset, such as an option to buy a security, as following a latent stochastic process in discrete time (Kim et al., 1998). The data consist of mean corrected (i.e., centered) returns yt on an underlying asset at T equally spaced time points. Kim et al. formulate a typical stochastic volatility model using the following regression-like equations, with a latent parameter ht for the log volatility, along with parameters µ for the mean log volatility, and φ for the persistence of the volatility term. The variable t represents the white-noise shock (i.e., multiplicative error) on the asset return at time t, whereas δt represents the shock on volatility at time t. yt = t exp(ht /2), ht+1 = µ + φ(ht − µ) + δt σ σ h1 ∼ Normal µ, q 1 − φ2 t ∼ Normal(0, 1); δt ∼ Normal(0, 1) Rearranging the first line, t = yt exp(−ht /2), allowing the sampling distribution for yt to be written as yt ∼ Normal(0, exp(ht /2)). The recurrence equation for ht+1 may be combined with the scaling and sampling of δt to yield the sampling distribution ht ∼ Normal(µ + φ(ht − µ), σ ). This formulation can be directly encoded, as shown in the following Stan model. data { int T; // # time points (equally spaced) vector[T] y; // mean corrected return at time t } parameters { real mu; // mean log volatility real phi; // persistence of volatility real sigma; // white noise shock scale vector[T] h; // log volatility at time t } 171 model { phi ~ uniform(-1, 1); sigma ~ cauchy(0, 5); mu ~ cauchy(0, 10); h[1] ~ normal(mu, sigma / sqrt(1 - phi * phi)); for (t in 2:T) h[t] ~ normal(mu + phi * (h[t - 1] - mu), sigma); for (t in 1:T) y[t] ~ normal(0, exp(h[t] / 2)); } Compared to the Kim et al. formulation, the Stan model adds priors for the parameters φ, σ , and µ. Note that the shock terms t and δt do not appear explicitly in the model, although they could be calculated efficiently in a generated quantities block. The posterior of a stochastic volatility model such as this one typically has high posterior variance. For example, simulating 500 data points from the above model with µ = −1.02, φ = 0.95, and σ = 0.25 leads to 95% posterior intervals for µ of (−1.23, −0.54), for φ of (0.82, 0.98) and for σ of (0.16, 0.38). The samples using NUTS show a high degree of autocorrelation among the samples, both for this model and the stochastic volatility model evaluated in (Hoffman and Gelman, 2011, 2014). Using a non-diagonal mass matrix provides faster convergence and more effective samples than a diagonal mass matrix, but will not scale to large values of T . It is relatively straightforward to speed up the effective samples per second generated by this model by one or more orders of magnitude. First, the sampling statements for return y is easily vectorized to y ~ normal(0, exp(h / 2)); This speeds up the iterations, but does not change the effective sample size because the underlying parameterization and log probability function have not changed. Mixing is improved by by reparameterizing in terms of a standardized volatility, then rescaling. This requires a standardized parameter h_std to be declared instead of h. parameters { ... vector[T] h_std; // std log volatility time t The original value of h is then defined in a transformed parameter block. transformed parameters { vector[T] h = h_std * sigma; h[1] /= sqrt(1 - phi * phi); h += mu; // now h ~ normal(0, sigma) // rescale h[1] 172 for (t in 2:T) h[t] += phi * (h[t-1] - mu); } The first assignment rescales h_std to have a Normal(0, σ ) distribution and temporarily assigns it to h. The second assignment rescales h[1] so that its prior differs from that of h[2] through h[T]. The next assignment supplies a mu offset, so that h[2] through h[T] are now distributed Normal(µ, σ ); note that this shift must be done after the rescaling of h[1]. The final loop adds in the moving average so that h[2] through h[T] are appropriately modeled relative to phi and mu. As a final improvement, the sampling statement for h[1] and loop for sampling h[2] to h[T] are replaced with a single vectorized unit normal sampling statement. model { ... h_std ~ normal(0, 1); Although the original model can take hundreds and sometimes thousands of iterations to converge, the reparameterized model reliably converges in tens of iterations. Mixing is also dramatically improved, which results in higher effective sample sizes per iteration. Finally, each iteration runs in roughly a quarter of the time of the original iterations. 10.6. Hidden Markov Models A hidden Markov model (HMM) generates a sequence of T output variables yt conditioned on a parallel sequence of latent categorical state variables zt ∈ {1, . . . , K}. These “hidden” state variables are assumed to form a Markov chain so that zt is conditionally independent of other variables given zt−1 . This Markov chain is parameterized by a transition matrix θ where θk is a K-simplex for k ∈ {1, . . . , K}. The probability of transitioning to state zt from state zt−1 is zt ∼ Categorical(θz[t−1] ). The output yt at time t is generated conditionally independently based on the latent state zt . This section describes HMMs with a simple categorical model for outputs yt ∈ {1, . . . , V }. The categorical distribution for latent state k is parameterized by a V simplex φk . The observed output yt at time t is generated based on the hidden state indicator zt at time t, yt ∼ Categorical(φz[t] ). In short, HMMs form a discrete mixture model where the mixture component indicators form a latent Markov chain. 173 Supervised Parameter Estimation In the situation where the hidden states are known, the following naive model can be used to fit the parameters θ and φ. data { int K; // num categories int V; // num words int T; // num instances int w[T]; // words int z[T]; // categories vector [K] alpha; // transit prior vector [V] beta; // emit prior } parameters { simplex[K] theta[K]; // transit probs simplex[V] phi[K]; // emit probs } model { for (k in 1:K) theta[k] ~ dirichlet(alpha); for (k in 1:K) phi[k] ~ dirichlet(beta); for (t in 1:T) w[t] ~ categorical(phi[z[t]]); for (t in 2:T) z[t] ~ categorical(theta[z[t - 1]]); } Explicit Dirichlet priors have been provided for θk and φk ; dropping these two statements would implicitly take the prior to be uniform over all valid simplexes. Start-State and End-State Probabilities Although workable, the above description of HMMs is incomplete because the start state z1 is not modeled (the index runs from 2 to T ). If the data are conceived as a subsequence of a long-running process, the probability of z1 should be set to the stationary state probabilities in the Markov chain. In this case, there is no distinct end to the data, so there is no need to model the probability that the sequence ends at zT . An alternative conception of HMMs is as models of finite-length sequences. For example, human language sentences have distinct starting distributions (usually a capital letter) and ending distributions (usually some kind of punctuation). The simplest way to model the sequence boundaries is to add a new latent state K+1, generate 174 the first state from a categorical distribution with parameter vector θK+1 , and restrict the transitions so that a transition to state K + 1 is forced to occur at the end of the sentence and is prohibited elsewhere. Calculating Sufficient Statistics The naive HMM estimation model presented above can be sped up dramatically by replacing the loops over categorical distributions with a single multinomial distribution.3 The data is declared as before, but now a transformed data blocks computes the sufficient statistics for estimating the transition and emission matrices. transformed data { int trans[K, K]; int emit[K, V]; for (k1 in 1:K) for (k2 in 1:K) trans[k1, k2] = 0; for (t in 2:T) trans[z[t - 1], z[t]] += 1; for (k in 1:K) for (v in 1:V) emit[k,v] = 0; for (t in 1:T) emit[z[t], w[t]] += 1; } The likelihood component of the model based on looping over the input is replaced with multinomials as follows. model { ... for (k in 1:K) trans[k] ~ multinomial(theta[k]); for (k in 1:K) emit[k] ~ multinomial(phi[k]); } In a continuous HMM with normal emission probabilities could be sped up in the same way by computing sufficient statistics. 3 The program is available in the Stan example model repository; documentation. 175 see http://mc-stan.org/ Analytic Posterior With the Dirichlet-multinomial HMM, the posterior can be computed analytically because the Dirichlet is the conjugate prior to the multinomial. The following example4 illustrates how a Stan model can define the posterior analytically. This is possible in the Stan language because the model only needs to define the conditional probability of the parameters given the data up to a proportion, which can be done by defining the (unnormalized) joint probability or the (unnormalized) conditional posterior, or anything in between. The model has the same data and parameters as the previous models, but now computes the posterior Dirichlet parameters in the transformed data block. transformed data { vector [K] alpha_post[K]; vector [V] beta_post[K]; for (k in 1:K) alpha_post[k] = alpha; for (t in 2:T) alpha_post[z[t-1], z[t]] += 1; for (k in 1:K) beta_post[k] = beta; for (t in 1:T) beta_post[z[t], w[t]] += 1; } The posterior can now be written analytically as follows. model { for (k in 1:K) theta[k] ~ dirichlet(alpha_post[k]); for (k in 1:K) phi[k] ~ dirichlet(beta_post[k]); } Semisupervised Estimation HMMs can be estimated in a fully unsupervised fashion without any data for which latent states are known. The resulting posteriors are typically extremely multimodal. An intermediate solution is to use semisupervised estimation, which is based on a combination of supervised and unsupervised data. Implementing this estimation strategy in Stan requires calculating the probability of an output sequence with an 4 The program is available in the Stan example model repository; documentation. 176 see http://mc-stan.org/ unknown state sequence. This is a marginalization problem, and for HMMs, it is computed with the so-called forward algorithm. In Stan, the forward algorithm is coded as follows.5 First, two additional data variable are declared for the unsupervised data. data { ... int T_unsup; // num unsupervised items int u[T_unsup]; // unsup words ... The model for the supervised data does not change; the unsupervised data is handled with the following Stan implementation of the forward algorithm. model { ... { real acc[K]; real gamma[T_unsup, K]; for (k in 1:K) gamma[1, k] = log(phi[k, u[1]]); for (t in 2:T_unsup) { for (k in 1:K) { for (j in 1:K) acc[j] = gamma[t-1, j] + log(theta[j, k]) + log(phi[k, u[t]]); gamma[t, k] = log_sum_exp(acc); } } target += log_sum_exp(gamma[T_unsup]); } The forward values gamma[t, k] are defined to be the log marginal probability of the inputs u[1],...,u[t] up to time t and the latent state being equal to k at time t; the previous latent states are marginalized out. The first row of gamma is initialized by setting gamma[1, k] equal to the log probability of latent state k generating the first output u[1]; as before, the probability of the first latent state is not itself modeled. For each subsequent time t and output j, the value acc[j] is set to the probability of the latent state at time t-1 being j, plus the log transition probability from state j at time t-1 to state k at time t, plus the log probability of the output u[t] being generated by state k. The log_sum_exp operation just multiplies the probabilities for each prior state j on the log scale in an arithmetically stable way. The brackets provide the scope for the local variables acc and gamma; these could have been declared earlier, but it is clearer to keep their declaration near their use. 5 The program is available in the Stan example model repository; documentation. 177 see http://mc-stan.org/ Predictive Inference Given the transition and emission parameters, θk,k0 and φk,v and an observation sequence u1 , . . . , uT ∈ {1, . . . , V }, the Viterbi (dynamic programming) algorithm computes the state sequence which is most likely to have generated the observed output u. The Viterbi algorithm can be coded in Stan in the generated quantities block as follows. The predictions here is the most likely state sequence y_star[1], ..., y_star[T_unsup] underlying the array of observations u[1], ..., u[T_unsup]. Because this sequence is determined from the transition probabilities theta and emission probabilities phi, it may be different from sample to sample in the posterior. generated quantities { int y_star[T_unsup]; real log_p_y_star; { int back_ptr[T_unsup, K]; real best_logp[T_unsup, K]; real best_total_logp; for (k in 1:K) best_logp[1, K] = log(phi[k, u[1]]); for (t in 2:T_unsup) { for (k in 1:K) { best_logp[t, k] = negative_infinity(); for (j in 1:K) { real logp; logp = best_logp[t-1, j] + log(theta[j, k]) + log(phi[k, u[t]]); if (logp > best_logp[t, k]) { back_ptr[t, k] = j; best_logp[t, k] = logp; } } } } log_p_y_star = max(best_logp[T_unsup]); for (k in 1:K) if (best_logp[T_unsup, k] == log_p_y_star) y_star[T_unsup] = k; for (t in 1:(T_unsup - 1)) y_star[T_unsup - t] = back_ptr[T_unsup - t + 1, y_star[T_unsup - t + 1]]; } 178 } The bracketed block is used to make the three variables back_ptr, best_logp, and best_total_logp local so they will not be output. The variable y_star will hold the label sequence with the highest probability given the input sequence u. Unlike the forward algorithm, where the intermediate quantities were total probability, here they consist of the maximum probability best_logp[t, k] for the sequence up to time t with final output category k for time t, along with a backpointer to the source of the link. Following the backpointers from the best final log probability for the final time t yields the optimal state sequence. This inference can be run for the same unsupervised outputs u as are used to fit the semisupervised model. The above code can be found in the same model file as the unsupervised fit. This is the Bayesian approach to inference, where the data being reasoned about is used in a semisupervised way to train the model. It is not “cheating” because the underlying states for u are never observed — they are just estimated along with all of the other parameters. If the outputs u are not used for semisupervised estimation but simply as the basis for prediction, the result is equivalent to what is represented in the BUGS modeling language via the cut operation. That is, the model is fit independently of u, then those parameters used to find the most likely state to have generated u. 179 11. Missing Data & Partially Known Parameters Bayesian inference supports a very general approach to missing data in which any missing data item is represented as a parameter that is estimated in the posterior (Gelman et al., 2013). If the missing data is not explicitly modeled, as in the predictors for most regression models, then the result is an improper prior on the parameter representing the missing predictor. Mixing arrays of observed and missing data can be difficult to include in Stan, partly because it can be tricky to model discrete unknowns in Stan and partly because unlike some other statistical languages (for example, R and Bugs), Stan requires observed and unknown quantities to be defined in separate places in the model. Thus it can be necessary to include code in a Stan program to splice together observed and missing parts of a data structure. Examples are provided later in the chapter. 11.1. Missing Data Stan treats variables declared in the data and transformed data blocks as known and the variables in the parameters block as unknown. An example involving missing normal observations1 could be coded as follows. data { int N_obs; int N_mis; real y_obs[N_obs]; } parameters { real mu; real sigma; real y_mis[N_mis]; } model { y_obs ~ normal(mu, sigma); y_mis ~ normal(mu, sigma); } The number of observed and missing data points are coded as data with non-negative integer variables N_obs and N_mis. The observed data is provided as an array data 1 A more meaningful estimation example would involve a regression of the observed and missing observations using predictors that were known for each and specified in the data block. 180 variable y_obs. The missing data is coded as an array parameter, y_mis. The ordinary parameters being estimated, the location mu and scale sigma, are also coded as parameters. The model is vectorized on the observed and missing data; combining them in this case would be less efficient because the data observations would be promoted and have needless derivatives calculated. 11.2. Partially Known Parameters In some situations, such as when a multivariate probability function has partially observed outcomes or parameters, it will be necessary to create a vector mixing known (data) and unknown (parameter) values. This can be done in Stan by creating a vector or array in the transformed parameters block and assigning to it. The following example involves a bivariate covariance matrix in which the variances are known, but the covariance is not. data { int N; vector[2] y[N]; real var1; real var2; } transformed data { real max_cov = sqrt(var1 * var2); real min_cov = -max_cov; } parameters { vector[2] mu; real cov; } transformed parameters { matrix[2, 2] Sigma; Sigma[1, 1] = var1; Sigma[1, 2] = cov; Sigma[2, 1] = cov; Sigma[2, 2] = var2; } model { y ~ multi_normal(mu, Sigma); } The variances are defined as data in variables var1 and var2, whereas the covariance is defined as a parameter in variable cov. The 2×2 covariance matrix Sigma is defined as a transformed parameter, with the variances assigned to the two diagonal elements and the covariance to the two off-diagonal elements. The constraint on the covariance declaration ensures that the resulting covariance matrix sigma is positive definite. The bound, plus or minus the square root of the 181 product of the variances, is defined as transformed data so that it is only calculated once. The vectorization of the multivariate normal is critical for efficiency here. The transformed parameter Sigma could be defined as a local variable within the model block if 11.3. Sliced Missing Data If the missing data is part of some larger data structure, then it can often be effectively reassembled using index arrays and slicing. Here’s an example for time-series data, where only some entries in the series are observed. data { int N_obs; int N_mis; int ii_obs[N_obs]; int ii_mis[N_mis]; real y_obs[N_obs]; } transformed data { int N = N_obs + N_mis; } parameters { real y_mis[N_mis]; real sigma; } transformed parameters { real y[N]; y[ii_obs] = y_obs; y[ii_mis] = y_mis; } model { sigma ~ gamma(1, 1); y[1] ~ normal(0, 100); y[2:N] ~ normal(y[1:(N - 1)], sigma); } The index arrays ii_obs and ii_mis contain the indexes into the final array y of the observed data (coded as a data vector y_obs) and the missing data (coded as a parameter vector y_mis). See Chapter 10 for further discussion of time-series model and specifically Section 10.1 for an explanation of the vectorization for y as well as an explanation of how to convert this example to a full AR(1) model. To ensure y[1] 182 has a proper posterior in case it is missing, we have given it an explicit, albeit broad, prior. Another potential application would be filling the columns of a data matrix of predictors for which some predictors are missing; matrix columns can be accessed as vectors and assigned the same way, as in x[N_obs_2, 2] = x_obs_2; x[N_mis_2, 2] = x_mis_2; where the relevant variables are all hard coded with index 2 because Stan doesn’t support ragged arrays. These could all be packed into a single array with more fiddly indexing that slices out vectors from longer vectors (see Section 16.2 for a general discussion of coding ragged data structures in Stan). 11.4. Loading matrix for factor analysis Rick Farouni, on the Stan users group, inquired as to how to build a Cholesky factor for a covariance matrix with a unit diagonal, as used in Bayesian factor analysis Aguilar and West (2000). This can be accomplished by declaring the below-diagonal elements as parameters, then filling the full matrix as a transformed parameter. data { int K; } transformed data { int K_choose_2; K_choose_2 = (K * (K - 1)) / 2; } parameters { vector[K_choose_2] L_lower; } transformed parameters { cholesky_factor_cov[K] L; for (k in 1:K) L[k, k] = 1; { int i; for (m in 2:K) { for (n in 1:(m - 1)) { L[m, n] = L_lower[i]; L[n, m] = 0; i += 1; } 183 } } } It is most convenient to place a prior directly on L_lower. An alternative would be a prior for the full Cholesky factor L, because the transform from L_lower to L is just the identity and thus does not require a Jacobian adjustment (despite the warning from the parser, which is not smart enough to do the code analysis to infer that the transform is linear). It would not be at all convenient to place a prior on the full covariance matrix L * L’, because that would require a Jacobian adjustment; the exact adjustment is provided in the subsection of Section 35.1 devoted to covariance matrices. 11.5. Missing Multivariate Data It’s often the case that one or more components of a multivariate outcome are missing.2 As an example, we’ll consider the bivariate distribution, which is easily marginalized. The coding here is brute force, representing both an array of vector observations y and a boolean array y_observed to indicate which values were observed (others can have dummy values in the input). vector[2] y[N]; int y_observed[N, 2]; If both components are observed, we model them using the full multi-normal, otherwise we model the marginal distribution of the component that is observed. for (n in 1:N) { if (y_observed[n, 1] && y_observed[n, 2]) y[n] ~ multi_normal(mu, Sigma); else if (y_observed[n, 1]) y[n, 1] ~ normal(mu[1], sqrt(Sigma[1, 1])); else if (y_observed[n, 2]) y[n, 2] ~ normal(mu[2], sqrt(Sigma[2, 2])); } It’s a bit more work, but much more efficient to vectorize these sampling statements. In transformed data, build up three vectors of indices, for the three cases above: 2 Note that this is not the same as missing components of a multivariate predictor in a regression problem; in that case, you will need to represent the missing data as a parameter and impute missing values in order to feed them into the regression. 184 transformed data { int ns12[observed_12(y_observed)]; int ns1[observed_1(y_observed)]; int ns2[observed_2(y_observed)]; } You will need to write functions that pull out the count of observations in each of the three sampling situations. This must be done with functions because the result needs to go in top-level block variable size declaration. Then the rest of transformed data just fills in the values using three counters. int n12 = 1; int n1 = 1; int n2 = 1; for (n in 1:N) { if (y_observed[n, 1] && y_observed[n, 2]) { ns12[n12] = n; n12 += 1; } else if (y_observed[n, 1]) { ns1[n1] = n; n1 += 1; } else if (y_observed[n, 2]) { ns2[n2] = n; n2 += 1; } } Then, in the model block, everything’s nice and vectorizable using those indexes constructed once in transformed data: y[ns12] ~ multi_normal(mu, Sigma); y[ns1] ~ normal(mu[1], sqrt(Sigma[1, 1])); y[ns2] ~ normal(mu[2], sqrt(Sigma[2, 2])); The result will be much more efficient than using latent variables for the missing data, but requires the multivariate distribution to be marginalized analytically. It’d be more efficient still to precompute the three arrays in the transformed data block, though the efficiency improvement will be relatively minor compared to vectorizing the probability functions. This approach can easily be generalized with some index fiddling to the general multivariate case. The trick is to pull out entries in the covariance matrix for the missing components. It can also be used in situations such as multivariate differential equation solutions where only one component is observed, as in a phase-space experiment recording only time and position of a pendulum (and not recording momentum). 185 12. Truncated or Censored Data Data in which measurements have been truncated or censored can be coded in Stan following their respective probability models. 12.1. Truncated Distributions Truncation in Stan is restricted to univariate distributions for which the corresponding log cumulative distribution function (cdf) and log complementary cumulative distribution (ccdf) functions are available. See the subsection on truncated distributions in Section 5.3 for more information on truncated distributions, cdfs, and ccdfs. 12.2. Truncated Data Truncated data is data for which measurements are only reported if they fall above a lower bound, below an upper bound, or between a lower and upper bound. Truncated data may be modeled in Stan using truncated distributions. For example, suppose the truncated data is yn with an upper truncation point of U = 300 so that yn < 300. In Stan, this data can be modeled as following a truncated normal distribution for the observations as follows. data { int N; real U; real y[N]; } parameters { real mu; real sigma; } model { for (n in 1:N) y[n] ~ normal(mu, sigma) T[,U]; } The model declares an upper bound U as data and constrains the data for y to respect the constraint; this will be checked when the data is loaded into the model before sampling begins. This model implicitly uses an improper flat prior on the scale and location parameters; these could be given priors in the model using sampling statements. 186 Constraints and Out-of-Bounds Returns If the sampled variate in a truncated distribution lies outside of the truncation range, the probability is zero, so the log probability will evaluate to −∞. For instance, if variate y is sampled with the statement. for (n in 1:N) y[n] ~ normal(mu, sigma) T[L,U]; then if the value of y[n] is less than the value of L or greater than the value of U, the sampling statement produces a zero-probability estimate. For user-defined truncation, this zeroing outside of truncation bounds must be handled explicitly. To avoid variables straying outside of truncation bounds, appropriate constraints are required. For example, if y is a parameter in the above model, the declaration should constrain it to fall between the values of L and U. parameters { real y[N]; ... If in the above model, L or U is a parameter and y is data, then L and U must be appropriately constrained so that all data is in range and the value of L is less than that of U (if they are equal, the parameter range collapses to a single point and the Hamiltonian dynamics used by the sampler break down). The following declarations ensure the bounds are well behaved. parameters { real L; // L < y[n] real U; // L < U; y[n] < U Note that for pairs of real numbers, the function fmax is used rather than max. Unknown Truncation Points If the truncation points are unknown, they may be estimated as parameters. This can be done with a slight rearrangement of the variable declarations from the model in the previous section with known truncation points. data { int N; real y[N]; } parameters { real L; real U; 187 real mu; real sigma; } model { L ~ ...; U ~ ...; for (n in 1:N) y[n] ~ normal(mu, sigma) T[L,U]; } Here there is a lower truncation point L which is declared to be less than or equal to the minimum value of y. The upper truncation point U is declared to be larger than the maximum value of y. This declaration, although dependent on the data, only enforces the constraint that the data fall within the truncation bounds. With N declared as type int , there must be at least one data point. The constraint that L is less than U is enforced indirectly, based on the non-empty data. The ellipses where the priors for the bounds L and U should go should be filled in with a an informative prior in order for this model to not concentrate L strongly around min(y) and U strongly around max(y). 12.3. Censored Data Censoring hides values from points that are too large, too small, or both. Unlike with truncated data, the number of data points that were censored is known. The textbook example is the household scale which does not report values above 300 pounds. Estimating Censored Values One way to model censored data is to treat the censored data as missing data that is constrained to fall in the censored range of values. Since Stan does not allow unknown values in its arrays or matrices, the censored values must be represented explicitly, as in the following right-censored case. data { int N_obs; int N_cens; real y_obs[N_obs]; real U; } parameters { real y_cens[N_cens]; real mu; real sigma; 188 } model { y_obs ~ normal(mu, sigma); y_cens ~ normal(mu, sigma); } Because the censored data array y_cens is declared to be a parameter, it will be sampled along with the location and scale parameters mu and sigma. Because the censored data array y_cens is declared to have values of type real , all imputed values for censored data will be greater than U. The imputed censored data affects the location and scale parameters through the last sampling statement in the model. Integrating out Censored Values Although it is wrong to ignore the censored values in estimating location and scale, it is not necessary to impute values. Instead, the values can be integrated out. Each censored data point has a probability of Z∞ Pr[y > U ] = Normal(y|µ, σ ) dy = 1 − Φ U y −µ , σ where Φ() is the unit normal cumulative distribution function. With M censored observations, the total probability on the log scale is log M Y y −µ M Pr[ym > U ] = log 1 − Φ = M normal_lccdf(y|µ, σ ), σ m=1 where normal_lccdf is the log of complementary CDF (Stan provides _lccdf for each distribution implemented in Stan). The following right-censored model assumes that the censoring point is known, so it is declared as data. data { int N_obs; int N_cens; real y_obs[N_obs]; real U; } parameters { real mu; real sigma; } model { 189 y_obs ~ normal(mu, sigma); target += N_cens * normal_lccdf(U | mu, sigma); } For the observed values in y_obs, the normal sampling model is used without truncation. The log probability is directly incremented using the calculated log cumulative normal probability of the censored data items. For the left-censored data the CDF (normal_lcdf) has to be used instead of complementary CDF. If the censoring point variable (L) is unknown, its declaration should be moved from the data to the parameters block. data { int N_obs; int N_cens; real y_obs[N_obs]; } parameters { real L; real mu; real sigma; } model { L ~ normal(mu, sigma); y_obs ~ normal(mu, sigma); target += N_cens * normal_lcdf(L | mu, sigma); } 190 13. Finite Mixtures Finite mixture models of an outcome assume that the outcome is drawn from one of several distributions, the identity of which is controlled by a categorical mixing distribution. Mixture models typically have multimodal densities with modes near the modes of the mixture components. Mixture models may be parameterized in several ways, as described in the following sections. Mixture models may be used directly for modeling data with multimodal distributions, or they may be used as priors for other parameters. 13.1. Relation to Clustering Clustering models, as discussed in Chapter 17, are just a particular class of mixture models that have been widely applied to clustering in the engineering and machinelearning literature. The normal mixture model discussed in this chapter reappears in multivariate form as the statistical basis for the K-means algorithm; the latent Dirichlet allocation model, usually applied to clustering problems, can be viewed as a mixed-membership multinomial mixture model. 13.2. Latent Discrete Parameterization One way to parameterize a mixture model is with a latent categorical variable indicating which mixture component was responsible for the outcome. For example, consider K normal distributions with locations µk ∈ R and scales σk ∈ (0, ∞). Now PK consider mixing them in proportion λ, where λk ≥ 0 and k=1 λk = 1 (i.e., λ lies in the unit K-simplex). For each outcome yn there is a latent variable zn in {1, . . . , K} with a categorical distribution parameterized by λ, zn ∼ Categorical(λ). The variable yn is distributed according to the parameters of the mixture component zn , yn ∼ Normal(µz[n] , σz[n] ). This model is not directly supported by Stan because it involves discrete parameters zn , but Stan can sample µ and σ by summing out the z parameter as described in the next section. 191 13.3. Summing out the Responsibility Parameter To implement the normal mixture model outlined in the previous section in Stan, the discrete parameters can be summed out of the model. If Y is a mixture of K normal distributions with locations µk and scales σk with mixing proportions λ in the unit K-simplex, then K X pY (y|λ, µ, σ ) = λk Normal(y | µk , σk ). k=1 13.4. Log Sum of Exponentials: Linear Sums on the Log Scale The log sum of exponentials function is used to define mixtures on the log scale. It is defined for two inputs by log_sum_exp(a, b) = log(exp(a) + exp(b)). If a and b are probabilities on the log scale, then exp(a) + exp(b) is their sum on the linear scale, and the outer log converts the result back to the log scale; to summarize, log_sum_exp does linear addition on the log scale. The reason to use Stan’s built-in log_sum_exp function is that it can prevent underflow and overflow in the exponentiation, by calculating the result as log exp(a) + exp(b) = c + log exp(a − c) + exp(b − c) , where c = max(a, b). In this evaluation, one of the terms, a − c or b − c, is zero and the other is negative, thus eliminating the possibility of overflow or underflow in the leading term and eking the most arithmetic precision possible out of the operation. For example, the mixture of Normal(−1, 2) and Normal(3, 1) with mixing proportion λ = (0.3, 0.7)> can be implemented in Stan as follows. parameters { real y; } model { target += log_sum_exp(log(0.3) + normal_lpdf(y | -1, 2), log(0.7) + normal_lpdf(y | 3, 1)); } 192 The log probability term is derived by taking log pY (y|λ, µ, σ ) = log(0.3 × Normal(y| − 1, 2) + 0.7 × Normal(y|3, 1) ) = log(exp(log(0.3 × Normal(y| − 1, 2))) + exp(log(0.7 × Normal(y|3, 1))) ) = log_sum_exp(log(0.3) + log Normal(y| − 1, 2), log(0.7) + log Normal(y|3, 1) ). Dropping uniform mixture ratios If a two-component mixture has a mixing ratio of 0.5, then the mixing ratios can be dropped, because neg_log_half = -log(0.5); for (n in 1:N) target += log_sum_exp(neg_log_half + normal_lpdf(y[n] | mu[1], sigma[1]), neg_log_half + normal_lpdf(y[n] | mu[2], sigma[2])); then the − log 0.5 term isn’t contributing to the proportional density, and the above can be replaced with the more efficient version for (n in 1:N) target += log_sum_exp(normal_lpdf(y[n] | mu[1], sigma[1]), normal_lpdf(y[n] | mu[2], sigma[2])); The same result holds if there are K components and the mixing simplex λ is symmetric, i.e., 1 1 λ= ,..., . K K The result follows from the identity log_sum_exp(c + a, c + b) = c + log_sum_exp(a, b) and the fact that adding a constant c to the log density accumulator has no effect because the log density is only specified up to an additive constant in the first place. There is nothing specific to the normal distribution here; constants may always be dropped from the target. Estimating Parameters of a Mixture Given the scheme for representing mixtures, it may be moved to an estimation setting, where the locations, scales, and mixture components are unknown. Further generalizing to a number of mixture components specified as data yields the following model. 193 data { int K; // number of mixture components int N; // number of data points real y[N]; // observations } parameters { simplex[K] theta; // mixing proportions ordered mu[K]; // locations of mixture components vector [K] sigma; // scales of mixture components } model { real log_theta[K] = log(theta); // cache log calculation sigma ~ lognormal(0, 2); mu ~ normal(0, 10); for (n in 1:N) { real lps[K] = log_theta; for (k in 1:K) lps[k] += normal_lpdf(y[n] | mu[k], sigma[k]); target += log_sum_exp(lps); } } The model involves K mixture components and N data points. The mixing proportion parameter theta is declared to be a unit K-simplex, whereas the component location parameter mu and scale parameter sigma are both defined to be K-vectors. The location parameter mu is declared to be an ordered vector in order to identify the model. This will not affect inferences that do not depend on the ordering of the components as long as the prior for the components mu[k] is symmetric, as it is here (each component has an independent Normal(0, 10) prior). It would even be possible to include a hierarchical prior for the components. The values in the scale array sigma are constrained to be non-negative, and have a weakly informative prior given in the model chosen to avoid zero values and thus collapsing components. The model declares a local array variable lps to be size K and uses it to accumulate the log contributions from the mixture components. The main action is in the loop over data points n. For each such point, the log of θk ×Normal(yn | µk , σk ) is calculated and added to the array lpps. Then the log probability is incremented with the log sum of exponentials of those values. 194 13.5. Vectorizing Mixtures There is (currently) no way to vectorize mixture models at the observation level in Stan. This section is to warn users away from attempting to vectorize naively, as it results in a different model. A proper mixture at the observation level is defined as follows, where we assume that lambda, y[n], mu[1], mu[2], and sigma[1], sigma[2] are all scalars and lambda is between 0 and 1. for (n in 1:N) { target += log_sum_exp(log(lambda) + normal_lpdf(y[n] | mu[1], sigma[1]), log1m(lambda) + normal_lpdf(y[n] | mu[2], sigma[2])); or equivalently for (n in 1:N) target += log_mix(lambda, normal_lpdf(y[n] | mu[1], sigma[1]), normal_lpdf(y[n] | mu[2], sigma[2]))); This definition assumes that each observation yn may have arisen from either of the mixture components. The density is p(y | λ, µ, σ ) = N Y (λ × Normal(yn | µ1 , σ1 ) + (1 − λ) × Normal(yn | µ2 , σ2 ). n=1 Contrast the previous model with the following (erroneous) attempt to vectorize the model. target += log_sum_exp(log(lambda) + normal_lpdf(y | mu[1], sigma[1]), log1m(lambda) + normal_lpdf(y | mu[2], sigma[2])); or equivalently, target += log_mix(lambda, normal_lpdf(y | mu[1], sigma[1]), normal_lpdf(y | mu[2], sigma[2])); This second definition implies that the entire sequence y1 , . . . , yn of observations comes form one component or the other, defining a different density, p(y | λ, µ, σ ) = λ × N Y Normal(yn | µ1 , σ1 ) + (1 − λ) × n=1 N Y n=1 195 Normal(yn | µ2 , σ2 ). 13.6. Inferences Supported by Mixtures In many mixture models, the mixture components are underlyingly exchangeable in the model and thus not identifiable. This arises if the parameters of the mixture components have exchangeable priors and the mixture ratio gets a uniform prior so that the parameters of the mixture components are also exchangeable in the likelihood. We have finessed this basic problem by ordering the parameters. This will allow us in some cases to pick out mixture components either ahead of time or after fitting (e.g., male vs. female, or Democrat vs. Republican). In other cases, we do not care about the actual identities of the mixture components and want to consider inferences that are independent of indexes. For example, we might only be interested in posterior predictions for new observations. Mixtures with Unidentifiable Components As an example, consider the normal mixture from the previous section, which provides an exchangeable prior on the pairs of parameters (µ1 , σ1 ) and (µ2 , σ2 ), µ1 , µ2 ∼ Normal(0, 10) σ1 , σ2 ∼ HalfNormal(0, 10) The prior on the mixture ratio is uniform, λ ∼ Uniform(0, 1), so that with the likelihood p(yn | µ, σ ) = λ Normal(yn | µ1 , σ1 ) + (1 − λ) Normal(yn | µ2 , σ2 ), the joint distribution p(y, µ, σ , λ) is exchangeable in the parameters (µ1 , σ1 ) and (µ2 , σ2 ) with λ flipping to 1 − λ.1 Inference under Label Switching In cases where the mixture components are not identifiable, it can be difficult to diagnose convergence of sampling or optimization algorithms because the labels will switch, or be permuted, in different MCMC chains or different optimization runs. Luckily, posterior inferences which do not refer to specific component labels are invariant under label switching and may be used directly. This subsection considers a pair of examples. 1 Imposing a constraint such as θ < 0.5 will resolve the symmetry, but fundamentally changes the model and its posterior inferences. 196 Predictive likelihood Predictive likelihood for a new observation ỹ given the complete parameter vector θ will be Z p(ỹ | y) = p(ỹ | θ) p(θ|y) dθ. θ The normal mixture example from the previous section, with θ = (µ, σ , λ), shows that the likelihood returns the same density under label switching and thus the predictive inference is sound. In Stan, that predictive inference can be done either by computing p(ỹ | y), which is more efficient statistically in terms of effective sample size, or simulating draws of ỹ, which is easier to plug into other inferences. Both approaches can be coded directly in the generated quantities block of the program. Here’s an example of the direct (non-sampling) approach. data { int N_tilde; vector[N_tilde] y_tilde; ... generated quantities { vector[N_tilde] log_p_y_tilde; for (n in 1:N_tilde) log_p_y_tilde[n] = log_mix(lambda, normal_lpdf(y_tilde[n] | mu[1], sigma[1]) normal_lpdf(y_tilde[n] | mu[2], sigma[2])); } It is a bit of a bother afterwards, because the logarithm function isn’t linear and hence doesn’t distribute through averages (Jensen’s inequality shows which way the inequality goes). The right thing to do is to apply log_sum_exp of the posterior draws of log_p_y_tilde. The average log predictive density is then given by subtracting log(N_new). Clustering and similarity Often a mixture model will be applied to a clustering problem and there might be two data items yi and yj for which there is a question of whether they arose from the same mixture component. If we take zi and zj to be the component responsibility discrete variables, then the quantity of interest is zi = zj , which can be summarized as an event probability Z Pr[zi = zj | y] = θ P1 k=0 p(zi = P1 k=0 m=0 p(zi P1 k, zj = k, yi , yj | θ) = k, zj = m, yi , yj | θ) 197 p(θ | y) dθ. As with other event probabilities, this can be calculated in the generated quantities block either by sampling zi and zj and using the indicator function on their equality, or by computing the term inside the integral as a generated quantity. As with predictive likelihood, working in expectation is more statistically efficient than sampling. 13.7. Zero-Inflated and Hurdle Models Zero-inflated and hurdle models both provide mixtures of a Poisson and Bernoulli probability mass function to allow more flexibility in modeling the probability of a zero outcome. Zero-inflated models, as defined by Lambert (1992), add additional probability mass to the outcome of zero. Hurdle models, on the other hand, are formulated as pure mixtures of zero and non-zero outcomes. Zero inflation and hurdle models can be formulated for discrete distributions other than the Poisson. Zero inflation does not work for continuous distributions in Stan because of issues with derivatives; in particular, there is no way to add a point mass to a continuous distribution, such as zero-inflating a normal as a regression coefficient prior. Zero Inflation Consider the following example for zero-inflated Poisson distributions. It uses a parameter theta here there is a probability θ of drawing a zero, and a probability 1 − θ of drawing from Poisson(λ) (now θ is being used for mixing proportions because λ is the traditional notation for a Poisson mean parameter). The probability function is thus θ + (1 − θ) × Poisson(0|λ) if yn = 0, and p(yn |θ, λ) = (1 − θ) × Poisson(yn |λ) if yn > 0. The log probability function can be implemented directly in Stan as follows. data { int N; int y[N]; } parameters { real theta; real lambda; } model { for (n in 1:N) { if (y[n] == 0) target += log_sum_exp(bernoulli_lpmf(1 | theta), 198 bernoulli_lpmf(0 | theta) + poisson_lpmf(y[n] | lambda)); else target += bernoulli_lpmf(0 | theta) + poisson_lpmf(y[n] | lambda); } } The log_sum_exp(lp1,lp2) function adds the log probabilities on the linear scale; it is defined to be equal to log(exp(lp1) + exp(lp2)), but is more arithmetically stable and faster. This could also be written using the conditional operator; see Section 4.6. Hurdle Models The hurdle model is similar to the zero-inflated model, but more flexible in that the zero outcomes can be deflated as well as inflated. The probability mass function for the hurdle likelihood is defined by θ if y = 0, and p(y|θ, λ) = Poisson(y|λ) (1 − θ) if y > 0, 1 − PoissonCDF(0|λ) where PoissonCDF is the cumulative distribution function for the Poisson distribution. The hurdle model is even more straightforward to program in Stan, as it does not require an explicit mixture. if (y[n] == 0) 1 ~ bernoulli(theta); else { 0 ~ bernoulli(theta); y[n] ~ poisson(lambda) T[1, ]; } The Bernoulli statements are just shorthand for adding log θ and log(1 − θ) to the log density. The T[1,] after the Poisson indicates that it is truncated below at 1; see Section 12.1 for more about truncation and Section 52.5 for the specifics of the Poisson CDF. The net effect is equivalent to the direct definition of the log likelihood. if (y[n] == 0) target += log(theta); else target += log1m(theta) + poisson_lpmf(y[n] | lambda) - poisson_lccdf(0 | lambda)); 199 Julian King pointed out that because log (1 − PoissonCDF(0|λ)) = log (1 − Poisson(0|λ)) = log(1 − exp(−λ)) the CCDF in the else clause can be replaced with a simpler expression. target += log1m(theta) + poisson_lpmf(y[n] | lambda) - log1m_exp(-lambda)); The resulting code is about 15% faster than the code with the CCDF. This is an example where collecting counts ahead of time can also greatly speed up the execution speed without changing the density. For data size N = 200 and parameters θ = 0.3 and λ = 8, the speedup is a factor of 10; it will be lower for smaller N and greater for larger N; it will also be greater for larger θ. To achieve this speedup, it helps to have a function to count the number of nonzero entries in an array of integers, functions { int num_zero(int[] y) { int nz = 0; for (n in 1:size(y)) if (y[n] == 0) nz += 1; return nz; } } Then a transformed data block can be used to store the sufficient statistics, transformed data { int N0 = num_zero(y); int Ngt0 = N - N0; int y_nz[N - num_zero(y)]; { int pos = 1; for (n in 1:N) { if (y[n] != 0) { y_nz[pos] = y[n]; pos += 1; } } } } The model block can then be reduced to three statements. 200 model { N0 ~ binomial(N, theta); y_nz ~ poisson(lambda); target += -Ngt0 * log1m_exp(-lambda); } The first statement accounts for the Bernoulli contribution to both the zero and nonzero counts. The second line is the Poisson contribution from the non-zero counts, which is now vectorized. Finally, the normalization for the truncation is a single line, so that the expression for the log CCDF at 0 isn’t repeated. Also note that the negation is applied to the constant Ngt0; whenever possible, leave subexpressions constant because then gradients need not be propagated until a non-constant term is encountered. 13.8. Priors and Effective Data Size in Mixture Models Suppose we have a two-component mixture model with mixing rate λ ∈ (0, 1). Because the likelihood for the mixture components is proportionally weighted by the mixture weights, the effective data size used to estimate each of the mixture components will also be weighted as a fraction of the overall data size. Thus although there are N observations, the mixture components will be estimated with effective data sizes of θ N and (1 − θ) N for the two components for some θ ∈ (0, 1). The effective weighting size is determined by posterior responsibility, not simply by the mixing rate λ. Comparison to Model Averaging In contrast to mixture models, which create mixtures at the observation level, model averaging creates mixtures over the posteriors of models separately fit with the entire data set. In this situation, the priors work as expected when fitting the models independently, with the posteriors being based on the complete observed data y. If different models are expected to account for different observations, we recommend building mixture models directly. If the models being mixed are similar, often a single expanded model will capture the features of both and may be used on its own for inferential purposes (estimation, decision making, prediction, etc.). For example, rather than fitting an intercept-only regression and a slope-only regression and averaging their predictions, even as a mixture model, we would recommend building a single regression with both a slope and an intercept. Model complexity, such as having more predictors than data points, can be tamed using appropriately regularizing priors. If computation becomes a bottleneck, the only recourse can be model averaging, which can be calculated after fitting each model independently (see (Hoeting 201 et al., 1999) and (Gelman et al., 2013) for theoretical and computational details). 202 14. Measurement Error and Meta-Analysis Most quantities used in statistical models arise from measurements. Most of these measurements are taken with some error. When the measurement error is small relative to the quantity being measured, its effect on a model is usually small. When measurement error is large relative to the quantity being measured, or when very precise relations can be estimated being measured quantities, it is useful to introduce an explicit model of measurement error. One kind of measurement error is rounding. Meta-analysis plays out statistically very much like measurement error models, where the inferences drawn from multiple data sets are combined to do inference over all of them. Inferences for each data set are treated as providing a kind of measurement error with respect to true parameter values. 14.1. Bayesian Measurement Error Model A Bayesian approach to measurement error can be formulated directly by treating the true quantities being measured as missing data (Clayton, 1992; Richardson and Gilks, 1993). This requires a model of how the measurements are derived from the true values. Regression with Measurement Error Before considering regression with measurement error, first consider a linear regression model where the observed data for N cases includes a predictor xn and outcome yn . In Stan, a linear regression for y based on x with a slope and intercept is modeled as follows. data { int N; // number of cases real x[N]; // predictor (covariate) real y[N]; // outcome (variate) } parameters { real alpha; // intercept real beta; // slope real sigma; // outcome noise } model { y ~ normal(alpha + beta * x, sigma); alpha ~ normal(0, 10); beta ~ normal(0, 10); 203 sigma ~ cauchy(0, 5); } Now suppose that the true values of the predictors xn are not known, but for each n, a measurement xmeas of xn is available. If the error in measurement can be n can be modeled in terms of the true value xn modeled, the measured value xmeas n plus measurement noise. The true value xn is treated as missing data and estimated along with other quantities in the model. A very simple approach is to assume the measurement error is normal with known deviation τ. This leads to the following regression model with constant measurement error. data { ... real x_meas[N]; // measurement of x real tau; // measurement noise } parameters { real x[N]; // unknown true value real mu_x; // prior location real sigma_x; // prior scale ... } model { x ~ normal(mu_x, sigma_x); // prior x_meas ~ normal(x, tau); // measurement model y ~ normal(alpha + beta * x, sigma); ... } The regression coefficients alpha and beta and regression noise scale sigma are the same as before, but now x is declared as a parameter rather than as data. The data is now x_meas, which is a measurement of the true x value with noise scale tau. The model then specifies that the measurement error for x_meas[n] given true value x[n] is normal with deviation tau. Furthermore, the true values x are given a hierarchical prior here. In cases where the measurement errors are not normal, richer measurement error models may be specified. The prior on the true values may also be enriched. For instance, (Clayton, 1992) introduces an exposure model for the unknown (but noisily measured) risk factors x in terms of known (without measurement error) risk factors c. A simple model would regress xn on the covariates cn with noise term υ, xn ∼ Normal(γ > c, υ). This can be coded in Stan just like any other regression. And, of course, other exposure models can be provided. 204 Rounding A common form of measurement error arises from rounding measurements. Rounding may be done in many ways, such as rounding weights to the nearest milligram, or to the nearest pound; rounding may even be done by rounding down to the nearest integer. Exercise 3.5(b) from (Gelman et al., 2013) provides an example. 3.5. Suppose we weigh an object five times and measure weights, rounded to the nearest pound, of 10, 10, 12, 11, 9. Assume the unrounded measurements are normally distributed with a noninformative prior distribution on µ and σ 2 . (b) Give the correct posterior distribution for (µ, σ 2 ), treating the measurements as rounded. Letting zn be the unrounded measurement for yn , the problem as stated assumes the likelihood zn ∼ Normal(µ, σ ). The rounding process entails that zn ∈ (yn − 0.5, yn + 0.5). The probability mass function for the discrete observation y is then given by marginalizing out the unrounded measurement, producing the likelihood Z yn +0.5 p(yn | µ, σ ) = yn −0.5 Normal(zn | µ, σ ) dzn = Φ yn + 0.5 − µ σ −Φ yn − 0.5 − µ . σ Gelman’s answer for this problem took the noninformative prior to be uniform in the variance σ 2 on the log scale, which yields (due to the Jacobian adjustment), the prior density 1 p(µ, σ 2 ) ∝ 2 . σ The posterior after observing y = (10, 10, 12, 11, 9) can be calculated by Bayes’s rule as p(µ, σ 2 | y) ∝ p(µ, σ 2 ) p(y | µ, σ 2 ) ∝ 1 σ2 5 Y n=1 Φ yn + 0.5 − µ σ −Φ yn − 0.5 − µ σ . The Stan code simply follows the mathematical definition, providing an example of the direct definition of a probability function up to a proportion. 205 data { int N; vector[N] y; } parameters { real mu; real sigma_sq; } transformed parameters { real sigma; sigma = sqrt(sigma_sq); } model { target += -2 * log(sigma); for (n in 1:N) target += log(Phi((y[n] + 0.5 - mu) / sigma) - Phi((y[n] - 0.5 - mu) / sigma)); } Alternatively, the model may be defined with latent parameters for the unrounded measurements zn . The Stan code in this case uses the likelihood for zn directly while respecting the constraint zn ∈ (yn − 0.5, yn + 0.5). Because Stan does not allow varying upper- and lower-bound constraints on the elements of a vector (or array), the parameters are declared to be the rounding error y − z, and then z is defined as a transformed parameter. data { int N; vector[N] y; } parameters { real mu; real sigma_sq; vector [N] y_err; } transformed parameters { real sigma; vector[N] z; sigma = sqrt(sigma_sq); z = y + y_err; } model { target += -2 * log(sigma); z ~ normal(mu, sigma); } 206 This explicit model for the unrounded measurements z produces the same posterior for µ and σ as the previous model that marginalizes z out. Both approaches mix well, but the latent parameter version is about twice as efficient in terms of effective samples per iteration, as well as providing a posterior for the unrounded parameters. 14.2. Meta-Analysis Meta-analysis aims to pool the data from several studies, such as the application of a tutoring program in several schools or treatment using a drug in several clinical trials. The Bayesian framework is particularly convenient for meta-analysis, because each previous study can be treated as providing a noisy measurement of some underlying quantity of interest. The model then follows directly from two components, a prior on the underlying quantities of interest and a measurement-error style model for each of the studies being analyzed. Treatment Effects in Controlled Studies Suppose the data in question arise from a total of M studies providing paired binomial data for a treatment and control group. For instance, the data might be post-surgical pain reduction under a treatment of ibuprofen (Warn et al., 2002) or mortality after myocardial infarction under a treatment of beta blockers (Gelman et al., 2013, Section 5.6). Data The clinical data consists of J trials, each with nt treatment cases, nc control cases, r t successful outcomes among those treated and r c successful outcomes among those in the control group. This data can be declared in Stan as follows.1 data { int int int int int } J; n_t[J]; r_t[J]; n_c[J]; r_c[J]; // // // // num num num num cases, treatment successes, treatment cases, control successes, control 1 Stan’s integer constraints are not powerful enough to express the constraint that r_t[j] ≤ n_t[j], but this constraint could be checked in the transformed data block. 207 Converting to Log Odds and Standard Error Although the clinical trial data is binomial in its raw format, it may be transformed to an unbounded scale by considering the log odds ratio yj = log rjt /(njt − rjt ) rjc /(njc − rjc ) ! = log rjt njt − rjt ! − log rjc ! njc − rjc and corresponding standard errors s σj = 1 1 1 1 + T + C + C . riT ni − riT ri ni − riC The log odds and standard errors can be defined in a transformed parameter block, though care must be taken not to use integer division (see Section 40.1). transformed data { real y[J]; real sigma[J]; for (j in 1:J) y[j] = log(r_t[j]) - log(n_t[j] - r_t[j]) - (log(r_c[j]) - log(n_c[j] - r_c[j]); for (j in 1:J) sigma[j] = sqrt(1 / r_t[j] + 1 / (n_t[j] - r_t[j]) + 1 / r_c[j] + 1 / (n_c[j] - r_c[j])); } This definition will be problematic if any of the success counts is zero or equal to the number of trials. If that arises, a direct binomial model will be required or other transforms must be used than the unregularized sample log odds. Non-Hierarchical Model With the transformed data in hand, two standard forms of meta-analysis can be applied. The first is a so-called “fixed effects” model, which assumes a single parameter for the global odds ratio. This model is coded in Stan as follows. parameters { real theta; // global treatment effect, log odds } model { y ~ normal(theta, sigma); } The sampling statement for y is vectorized; it has the same effect as the following. 208 for (j in 1:J) y[j] ~ normal(theta, sigma[j]); It is common to include a prior for theta in this model, but it is not strictly necessary for the model to be proper because y is fixed and Normal(y|µ, σ ) = Normal(µ|y, σ ). Hierarchical Model To model so-called “random effects,” where the treatment effect may vary by clinical trial, a hierarchical model can be used. The parameters include per-trial treatment effects and the hierarchical prior parameters, which will be estimated along with other unknown quantities. parameters { real theta[J]; // per-trial treatment effect real mu; // mean treatment effect real tau; // deviation of treatment effects } model { y ~ normal(theta, sigma); theta ~ normal(mu, tau); mu ~ normal(0, 10); tau ~ cauchy(0, 5); } Although the vectorized sampling statement for y appears unchanged, the parameter theta is now a vector. The sampling statement for theta is also vectorized, with the hyperparameters mu and tau themselves being given wide priors compared to the scale of the data. Rubin (1981) provided a hierarchical Bayesian meta-analysis of the treatment effect of Scholastic Aptitude Test (SAT) coaching in eight schools based on the sample treatment effect and standard error in each school.2 Extensions and Alternatives Smith et al. (1995) and Gelman et al. (2013, Section 19.4) provide meta-analyses based directly on binomial data. Warn et al. (2002) consider the modeling implications of using alternatives to the log-odds ratio in transforming the binomial data. If trial-specific predictors are available, these can be included directly in a regression model for the per-trial treatment effects θj . 2 The model provided for this data in (Gelman et al., 2013, Section 5.5) is included with the data in the Stan example model repository, http://mc-stan.org/documentation. 209 15. Latent Discrete Parameters Stan does not support sampling discrete parameters. So it is not possible to directly translate BUGS or JAGS models with discrete parameters (i.e., discrete stochastic nodes). Nevertheless, it is possible to code many models that involve bounded discrete parameters by marginalizing out the discrete parameters.1 This chapter shows how to code several widely-used models involving latent discrete parameters. The next chapter, Chapter 17, on clustering models, considers further models involving latent discrete parameters. 15.1. The Benefits of Marginalization Although it requires some algebra on the joint probability function, a pleasant byproduct of the required calculations is the posterior expectation of the marginalized variable, which is often the quantity of interest for a model. This allows far greater exploration of the tails of the distribution as well as more efficient sampling on an iteration-by-iteration basis because the expectation at all possible values is being used rather than itself being estimated through sampling a discrete parameter. Standard optimization algorithms, including expectation maximization (EM), are often provided in applied statistics papers to describe maximum likelihood estimation algorithms. Such derivations provide exactly the marginalization needed for coding the model in Stan. 15.2. Change Point Models The first example is a model of coal mining disasters in the U.K. for the years 1851– 1962.2 Model with Latent Discrete Parameter (Fonnesbeck et al., 2013, Section 3.1) provide a Poisson model of disaster rate Dt in year t with two rate parameters, an early rate (e) and late rate (l), that change at a 1 The computations are similar to those involved in expectation maximization (EM) algorithms (Dempster et al., 1977). 2 The original source of the data is (Jarrett, 1979), which itself is a note correcting an earlier data collection. 210 given point in time s. The full model expressed using a latent discrete parameter s is e ∼ Exponential(re ) l ∼ Exponential(rl ) s ∼ Uniform(1, T ) Dt ∼ Poisson(t < s ? e : l) The last line uses the conditional operator (also known as the ternary operator), which is borrowed from C and related languages. The conditional operator has the same behavior as the ifelse function in R, but uses a more compact notation involving separating its three arguments by a question mark (?) and colon (:). The conditional operator is defined by x1 if c is true (i.e., non-zero), and c ? x1 : x2 = x if c is false (i.e., zero). 2 As of version 2.10, Stan supports the conditional operator. Marginalizing out the Discrete Parameter To code this model in Stan, the discrete parameter s must be marginalized out to produce a model defining the log of the probability function p(e, l, Dt ). The full joint probability factors as p(e, l, s, D) = p(e) p(l) p(s) p(D|s, e, l) Exponential(e|re ) Exponential(l|rl ) Uniform(s|1, T ) QT t=1 Poisson(Dt |t < s ? e : l), = To marginalize, an alternative factorization into prior and likelihood is used, p(e, l, D) = p(e, l) p(D|e, l), where the likelihood is defined by marginalizing s as p(D|e, l) = T X p(s, D|e, l) s=1 = T X p(s)p(D|s, e, l) s=1 = T X Uniform(s|1, T ) s=1 T Y t=1 211 Poisson(Dt |t < s ? e : l) Stan operates on the log scale and thus requires the log likelihood, log p(D|e, l) = log_sum_expTs=1 log Uniform(s | 1, T ) PT + t=1 log Poisson(Dt | t < s ? e : l) , where the log sum of exponents function is defined by log_sum_expN n=1 αn = log N X exp(αn ). n=1 The log sum of exponents function allows the model to be coded directly in Stan using the built-in function log_sum_exp, which provides both arithmetic stability and efficiency for mixture model calculations. Coding the Model in Stan The Stan program for the change point model is shown in Figure 15.1. The transformed parameter lp[s] stores the quantity log p(s, D | e, l). Although the model in Figure 15.1 is easy to understand, the doubly nested loop used for s and t is quadratic in T. Luke Wiklendt pointed out that a linear alternative can be achieved by the use of dynamic programming similar to the forward-backward algorithm for Hidden Markov models; he submitted a slight variant of the following code to replace the transformed parameters block of the above Stan program. transformed parameters { vector[T] lp; { vector[T + 1] lp_e; vector[T + 1] lp_l; lp_e[1] = 0; lp_l[1] = 0; for (t in 1:T) { lp_e[t + 1] = lp_e[t] + poisson_lpmf(D[t] | e); lp_l[t + 1] = lp_l[t] + poisson_lpmf(D[t] | l); } lp = rep_vector(log_unif + lp_l[T + 1], T) + head(lp_e, T) - head(lp_l, T); } } As should be obvious from looking at it, it has linear complexity in T rather than quadratic. The result for the mining-disaster data is about 20 times faster; the improvement will be greater for larger T. 212 data { real r_e; real r_l; int T; int D[T]; } transformed data { real log_unif; log_unif = -log(T); } parameters { real e; real l; } transformed parameters { vector[T] lp; lp = rep_vector(log_unif, T); for (s in 1:T) for (t in 1:T) lp[s] = lp[s] + poisson_lpmf(D[t] | t < s ? e : l); } model { e ~ exponential(r_e); l ~ exponential(r_l); target += log_sum_exp(lp); } Figure 15.1: A change point model in which disaster rates D[t] have one rate, e, before the change point and a different rate, l, after the change point. The change point itself, s, is marginalized out as described in the text. The key to understanding Wiklendt’s dynamic programming version is to see that head(lp_e) holds the forward values, whereas lp_l[T + 1] - head(lp_l, T) holds the backward values; the clever use of subtraction allows lp_l to be accumulated naturally in the forward direction. Fitting the Model with MCMC This model is easy to fit using MCMC with NUTS in its default configuration. Convergence is very fast and sampling produces roughly one effective sample every two iterations. Because it is a relatively small model (the inner double loop over time is roughly 20,000 steps), it is very fast. 213 The value of lp for each iteration for each change point is available because it is declared as a transformed parameter. If the value of lp were not of interest, it could be coded as a local variable in the model block and thus avoid the I/O overhead of saving values every iteration. Posterior Distribution of the Discrete Change Point The value of lp[s] in a given iteration is given by log p(s, D|e, l) for the values of the early and late rates, e and l, in the iteration. In each iteration after convergence, the early and late disaster rates, e and l, are drawn from the posterior p(e, l|D) by MCMC sampling and the associated lp calculated. The value of lp may be normalized to calculate p(s|e, l, D) in each iteration, based on on the current values of e and l. Averaging over iterations provides an unnormalized probability estimate of the change point being s (see below for the normalizing constant), p(s|D) ∝ q(s|D) = M 1 X exp(lp[m, s]). M m=1 where lp[m, s] represents the value of lp in posterior draw m for change point s. By averaging over draws, e and l are themselves marginalized out, and the result has no dependence on a given iteration’s value for e and l. A final normalization then produces the quantity of interest, the posterior probability of the change point being s conditioned on the data D, p(s|D) = PT q(s|D) s 0 =1 q(s 0 |D) . A plot of the values of log p(s|D) computed using Stan 2.4’s default MCMC implementation is shown in Figure 15.2. Discrete Sampling The generated quantities block may be used to draw discrete parameter values using the built-in pseudo-random number generators. For example, with lp defined as above, the following program draws a random value for s at every iteration. generated quantities { int s; s = categorical_logit_rng(lp); } 214 0 ●● ●●●● ●● ●●● ● ● ●●● ●● ●● ● ●● ●●●●● ● ● ● ●● ● ●● ●● ● ● ●● ●● −20 ● ● ●●● ● ● ●● ● ● ●●●● ●● ●● ●● ●●● ● ● ●●● ● ●● ● ● ● ● ● −40 −60 frequency in 4000 draws log p(change at year) ●● ● ● ● ●●● ● ● ● ●● ●● ●●● ● ● ● ● ●● ● ● ● ●● 1900 1925 500 250 0 ● 1875 750 1950 1885 year 1890 1895 1900 year Figure 15.2: The posterior estimates for the change point. Left) log probability of change point being in year, calculated analytically using lp; Right) frequency of change point draws in the posterior generated using lp. The plot on the left is on the log scale and the plot on the right on the linear scale; note the narrower range of years in the right-hand plot resulting from sampling. The posterior mean of s is roughly 1891. A posterior histogram of draws for s is shown on the right side of Figure 15.2. Compared to working in terms of expectations, discrete sampling is highly inefficient, especially for tails of distributions, so this approach should only be used if draws from a distribution are explicitly required. Otherwise, expectations should be computed in the generated quantities block based on the posterior distribution for s given by softmax(lp). Posterior Covariance The discrete sample generated for s can be used to calculate covariance with other parameters. Although the sampling approach is straightforward, it is more statistically efficient (in the sense of requiring far fewer iterations for the same degree of accuracy) to calculate these covariances in expectation using lp. Multiple Change Points There is no obstacle in principle to allowing multiple change points. The only issue is that computation increases from linear to quadratic in marginalizing out two change points, cubic for three change points, and so on. There are three parameters, e, m, and l, and two loops for the change point and then one over time, with log densities being stored in a matrix. 215 matrix[T, T] lp; lp = rep_matrix(log_unif, T); for (s1 in 1:T) for (s2 in 1:T) for (t in 1:T) lp[s1,s2] = lp[s1,s2] + poisson_lpmf(D[t] | t < s1 ? e : (t < s2 ? m : l)); The matrix can then be converted back to a vector using to_vector before being passed to log_sum_exp. 15.3. Mark-Recapture Models A widely applied field method in ecology is to capture (or sight) animals, mark them (e.g., by tagging), then release them. This process is then repeated one or more times, and is often done for populations on an ongoing basis. The resulting data may be used to estimate population size. The first subsection describes a very simple mark-recapture model that does not involve any latent discrete parameters. The following subsections describes the Cormack-Jolly-Seber model, which involves latent discrete parameters for animal death. Simple Mark-Recapture Model In the simplest case, a one-stage mark-recapture study produces the following data • M : number of animals marked in first capture, • C : number animals in second capture, and • R : number of marked animals in second capture. The estimand of interest is • N : number of animals in the population. Despite the notation, the model will take N to be a continuous parameter; just because the population must be finite doesn’t mean the parameter representing it must be. The parameter will be used to produce a real-valued estimate of the population size. The Lincoln-Petersen (Lincoln, 1930; Petersen, 1896) method for estimating population size is MC . N̂ = R 216 data { int M; int C; int R; } parameters { real N; } model { R ~ binomial(C, M / N); } Figure 15.3: A probabilistic formulation of the Lincoln-Petersen estimator for population size based on data from a one-step mark-recapture study. The lower bound on N is necessary to efficiently eliminate impossible values. This population estimate would arise from a probabilistic model in which the number of recaptured animals is distributed binomially, R ∼ Binomial(C, M/N) given the total number of animals captured in the second round (C) with a recapture probability of M/N, the fraction of the total population N marked in the first round. The probabilistic variant of the Lincoln-Petersen estimator can be directly coded in Stan as shown in Figure 15.3. The Lincoln-Petersen estimate is the maximum likelihood estimate (MLE) for this model. To ensure the MLE is the Lincoln-Petersen estimate, an improper uniform prior for N is used; this could (and should) be replaced with a more informative prior if possible based on knowledge of the population under study. The one tricky part of the model is the lower bound C − R + M placed on the population size N. Values below this bound are impossible because it is otherwise not possible to draw R samples out of the C animals recaptured. Implementing this lower bound is necessary to ensure sampling and optimization can be carried out in an unconstrained manner with unbounded support for parameters on the transformed (unconstrained) space. The lower bound in the declaration for C implies a variable transform f : (C − R + M, ∞) → (−∞, +∞) defined by f (N) = log(N − (C − R + M)); see Section 35.2 for more information on the transform used for variables declared with a lower bound. 217 Cormack-Jolly-Seber with Discrete Parameter The Cormack-Jolly-Seber (CJS) model (Cormack, 1964; Jolly, 1965; Seber, 1965) is an open-population model in which the population may change over time due to death; the presentation here draws heavily on (Schofield, 2007). The basic data is • I : number of individuals, • T : number of capture periods, and • yi,t : boolean indicating if individual i was captured at time t. Each individual is assumed to have been captured at least once because an individual only contributes information conditionally after they have been captured the first time. There are two Bernoulli parameters in the model, • φt : probability that animal alive at time t survives until t + 1 and • pt : probability that animal alive at time t is captured at time t. These parameters will both be given uniform priors, but information should be used to tighten these priors in practice. The CJS model also employs a latent discrete parameter zi,t indicating for each individual i whether it is alive at time t, distributed as zi,t ∼ Bernoulli(zi,t−1 ? 0 : φt−1 ). The conditional prevents the model positing zombies; once an animal is dead, it stays dead. The data distribution is then simple to express conditional on z as yi,t ∼ Bernoulli(zi,t ? 0 : pt ) The conditional enforces the constraint that dead animals cannot be captured. Collective Cormack-Jolly-Seber Model This subsection presents an implementation of the model in terms of counts for different history profiles for individuals over three capture times. It assumes exchangeability of the animals in that each is assigned the same capture and survival probabilities. In order to ease the marginalization of the latent discrete parameter zi,t , the Stan models rely on a derived quantity χt for the probability that an individual is never 218 captured again if it is alive at time t (if it is dead, the recapture probability is zero). this quantity is defined recursively by 1 if t = T χt = (1 − φt ) + φt (1 − pt+1 )χt+1 if t < T The base case arises because if an animal was captured in the last time period, the probability it is never captured again is 1 because there are no more capture periods. The recursive case defining χt in terms of χt+1 involves two possibilities: (1) not surviving to the next time period, with probability (1 − φt ), or (2) surviving to the next time period with probability φt , not being captured in the next time period with probability (1 − pt+1 ), and not being captured again after being alive in period t + 1 with probability χt+1 . With three capture times, there are three captured/not-captured profiles an individual may have. These may be naturally coded as binary numbers as follows. profile 0 1 2 3 4 5 6 7 captures 1 2 3 + + + + + + + + + + + + probability n/a n/a χ2 φ2 φ3 χ1 φ1 (1 − p2 ) φ2 p3 φ1 p2 χ2 φ1 p2 φ2 p3 History 0, for animals that are never captured, is unobservable because only animals that are captured are observed. History 1, for animals that are only captured in the last round, provides no information for the CJS model, because capture/non-capture status is only informative when conditioned on earlier captures. For the remaining cases, the contribution to the likelihood is provided in the final column. By defining these probabilities in terms of χ directly, there is no need for a latent binary parameter indicating whether an animal is alive at time t or not. The definition of χ is typically used to define the likelihood (i.e., marginalize out the latent discrete parameter) for the CJS model (Schofield, 2007, page 9). The Stan model defines χ as a transformed parameter based on parameters φ and p. In the model block, the log probability is incremented for each history based on its count. This second step is similar to collecting Bernoulli observations into a binomial or categorical observations into a multinomial, only it is coded directly in the Stan program using target += rather than being part of a built-in probability function. 219 data { int history[7]; } parameters { real phi[2]; real p[3]; } transformed parameters { real chi[2]; chi[2] = (1 - phi[2]) + phi[2] * (1 - p[3]); chi[1] = (1 - phi[1]) + phi[1] * (1 - p[2]) * chi[2]; } model { target += history[2] * log(chi[2]); target += history[3] * (log(phi[2]) + log(p[3])); target += history[4] * (log(chi[1])); target += history[5] * (log(phi[1]) + log1m(p[2]) + log(phi[2]) + log(p[3])); target += history[6] * (log(phi[1]) + log(p[2]) + log(chi[2])); target += history[7] * (log(phi[1]) + log(p[2]) + log(phi[2]) + log(p[3])); } generated quantities { real beta3; beta3 = phi[2] * p[3]; } Figure 15.4: A Stan program for the Cormack-Jolly-Seber mark-recapture model that considers counts of individuals with observation histories of being observed or not in three capture periods. Identifiability The parameters φ2 and p3 , the probability of death at time 2 and probability of capture at time 3 are not identifiable, because both may be used to account for lack of capture at time 3. Their product, β3 = φ2 p3 , is identified. The Stan model defines beta3 as a generated quantity. Unidentified parameters pose a problem for Stan’s samplers’ adaptation. Although the problem posed for adaptation is mild here because the parameters are bounded and thus have proper uniform priors, it would be better to formulate an identified parameterization. One way to do this would be to formulate a hierarchical model for the p and φ parameters. 220 Individual Cormack-Jolly-Seber Model This section presents a version of the Cormack-Jolly-Seber (CJS) model cast at the individual level rather than collectively as in the previous subsection. It also extends the model to allow an arbitrary number of time periods. The data will consist of the number T of capture events, the number I of individuals, and a boolean flag yi,t indicating if individual i was observed at time t. In Stan, data { int T; int I; int y[I, T]; } The advantages to the individual-level model is that it becomes possible to add individual “random effects” that affect survival or capture probability, as well as to avoid the combinatorics involved in unfolding 2T observation histories for T capture times. Utility Functions The individual CJS model is written involves several function definitions. The first two are used in the transformed data block to compute the first and last time period in which an animal was captured.3 functions { int first_capture(int[] y_i) { for (k in 1:size(y_i)) if (y_i[k]) return k; return 0; } int last_capture(int[] y_i) { for (k_rev in 0:(size(y_i) - 1)) { int k; k = size(y_i) - k_rev; if (y_i[k]) return k; } return 0; } 3 An alternative would be to compute this on the outside and feed it into the Stan model as preprocessed data. Yet another alternative encoding would be a sparse one recording only the capture events along with their time and identifying the individual captured. 221 ... } These two functions are used to define the first and last capture time for each individual in the transformed data block.4 transformed data { int first[I]; int last[I]; vector [T] n_captured; for (i in 1:I) first[i] = first_capture(y[i]); for (i in 1:I) last[i] = last_capture(y[i]); n_captured = rep_vector(0, T); for (t in 1:T) for (i in 1:I) if (y[i, t]) n_captured[t] = n_captured[t] + 1; } The transformed data block also defines n_captured[t], which is the total number of captures at time t. The variable n_captured is defined as a vector instead of an integer array so that it can be used in an elementwise vector operation in the generated quantities block to model the population estimates at each time point. The parameters and transformed parameters are as before, but now there is a function definition for computing the entire vector chi, the probability that if an individual is alive at t that it will never be captured again. parameters { vector [T-1] phi; vector [T] p; } transformed parameters { vector [T] chi; chi = prob_uncaptured(T,p,phi); } The definition of prob_uncaptured, from the functions block, is functions { ... 4 Both functions return 0 if the individual represented by the input array was never captured. Individuals with no captures are not relevant for estimating the model because all probability statements are conditional on earlier captures. Typically they would be removed from the data, but the program allows them to be included even though they make not contribution to the log probability function. 222 vector prob_uncaptured(int T, vector p, vector phi) { vector[T] chi; chi[T] = 1.0; for (t in 1:(T - 1)) { int t_curr; int t_next; t_curr = T - t; t_next = t_curr + 1; chi[t_curr] = (1 - phi[t_curr]) + phi[t_curr] * (1 - p[t_next]) * chi[t_next]; } return chi; } } The function definition directly follows the mathematical definition of χt , unrolling the recursion into an iteration and defining the elements of chi from T down to 1. The Model Given the precomputed quantities, the model block directly encodes the CJS model’s log likelihood function. All parameters are left with their default uniform priors and the model simply encodes the log probability of the observations q given the parameters p and phi as well as the transformed parameter chi defined in terms of p and phi. model { for (i in 1:I) { if (first[i] > 0) { for (t in (first[i]+1):last[i]) { 1 ~ bernoulli(phi[t-1]); y[i, t] ~ bernoulli(p[t]); } 1 ~ bernoulli(chi[last[i]]); } } } The outer loop is over individuals, conditional skipping individuals i which are never captured. The never-captured check depends on the convention of the first-capture and last-capture functions returning 0 for first if an individual is never captured. 223 The inner loop for individual i first increments the log probability based on the survival of the individual with probability phi[t-1]. The outcome of 1 is fixed because the individual must survive between the first and last capture (i.e., no zombies). Note that the loop starts after the first capture, because all information in the CJS model is conditional on the first capture. In the inner loop, the observed capture status y[i, t] for individual i at time t has a Bernoulli distribution based on the capture probability p[t] at time t. After the inner loop, the probability of an animal never being seen again after being observed at time last[i] is included, because last[i] was defined to be the last time period in which animal i was observed. Identified Parameters As with the collective model described in the previous subsection, this model does not identify phi[T-1] and p[T], but does identify their product, beta. Thus beta is defined as a generated quantity to monitor convergence and report. generated quantities { real beta; ... beta = phi[T-1] * p[T]; ... } The parameter p[1] is also not modeled and will just be uniform between 0 and 1. A more finely articulated model might have a hierarchical or time-series component, in which case p[1] would be an unknown initial condition and both phi[T-1] and p[T] could be identified. Population Size Estimates The generated quantities also calculates an estimate of the population mean at each time t in the same way as in the simple mark-recapture model as the number of individuals captured at time t divided by the probability of capture at time t. This is done with the elementwise division operation for vectors (./) in the generated quantities block. generated quantities { ... vector [T] pop; ... pop = n_captured ./ p; 224 pop[1] = -1; } Generalizing to Individual Effects All individuals are modeled as having the same capture probability, but this model could be easily generalized to use a logistic regression here based on individual-level inputs to be used as predictors. 15.4. Data Coding and Diagnostic Accuracy Models Although seemingly disparate tasks, the rating/coding/annotation of items with categories and diagnostic testing for disease or other conditions share several characteristics which allow their statistical properties to modeled similarly. Diagnostic Accuracy Suppose you have diagnostic tests for a condition of varying sensitivity and specificity. Sensitivity is the probability a test returns positive when the patient has the condition and specificity is the probability that a test returns negative when the patient does not have the condition. For example, mammograms and puncture biopsy tests both test for the presence of breast cancer. Mammograms have high sensitivity and low specificity, meaning lots of false positives, whereas puncture biopsies are the opposite, with low sensitivity and high specificity, meaning lots of false negatives. There are several estimands of interest in such studies. An epidemiological study may be interested in the prevalence of a kind of infection, such as malaria, in a population. A test development study might be interested in the diagnostic accuracy of a new test. A health care worker performing tests might be interested in the disease status of a particular patient. Data Coding Humans are often given the task of coding (equivalently rating or annotating) data. For example, journal or grant reviewers rate submissions, a political study may code campaign commercials as to whether they are attack ads or not, a natural language processing study might annotate Tweets as to whether they are positive or negative in overall sentiment, or a dentist looking at an X-ray classifies a patient as having a cavity or not. In all of these cases, the data coders play the role of the diagnostic tests and all of the same estimands are in play — data coder accuracy and bias, true categories of items being coded, or the prevalence of various categories of items in the data. 225 Noisy Categorical Measurement Model In this section, only categorical ratings are considered, and the challenge in the modeling for Stan is to marginalize out the discrete parameters. Dawid and Skene (1979) introduce a noisy-measurement model for data coding and apply in the epidemiological setting of coding what doctor notes say about patient histories; the same model can be used for diagnostic procedures. Data The data for the model consists of J raters (diagnostic tests), I items (patients), and K categories (condition statuses) to annotate, with yi,j ∈ 1:K being the rating provided by rater j for item i. In a diagnostic test setting for a particular condition, the raters are diagnostic procedures and often K = 2, with values signaling the presence or absence of the condition.5 It is relatively straightforward to extend Dawid and Skene’s model to deal with the situation where not every rater rates each item exactly once. Model Parameters The model is based on three parameters, the first of which is discrete: • zi : a value in 1:K indicating the true category of item i, • π : a K-simplex for the prevalence of the K categories in the population, and • θj,k : a K-simplex for the response of annotator j to an item of true category k. Noisy Measurement Model The true category of an item is assumed to be generated by a simple categorical distribution based on item prevalence, zi ∼ Categorical(π ). The rating yi,j provided for item i by rater j is modeled as a categorical response of rater i to an item of category zi ,6 yi,j ∼ Categorical(θj,πz[i] ). 5 Diagnostic procedures are often ordinal, as in stages of cancer in oncological diagnosis or the severity of a cavity in dental diagnosis. Dawid and Skene’s model may be used as is or naturally generalized for ordinal ratings using a latent continuous rating and cutpoints as in ordinal logistic regression. 6 In the subscript, z[i] is written as z to improve legibility. i 226 Priors and Hierarchical Modeling Dawid and Skene provided maximum likelihood estimates for θ and π , which allows them to generate probability estimates for each zi . To mimic Dawid and Skene’s maximum likelihood model, the parameters θj,k and π can be given uniform priors over K-simplexes. It is straightforward to generalize to Dirichlet priors, π ∼ Dirichlet(α) and θj,k ∼ Dirichlet(βk ) with fixed hyperparameters α (a vector) and β (a matrix or array of vectors). The prior for θj,k must be allowed to vary in k, so that, for instance, βk,k is large enough to allow the prior to favor better-than-chance annotators over random or adversarial ones. Because there are J coders, it would be natural to extend the model to include a hierarchical prior for β and to partially pool the estimates of coder accuracy and bias. Marginalizing out the True Category Because the true category parameter z is discrete, it must be marginalized out of the joint posterior in order to carry out sampling or maximum likelihood estimation in Stan. The joint posterior factors as p(y, θ, π ) = p(y|θ, π ) p(π ) p(θ), where p(y|θ, π ) is derived by marginalizing z out of J I Y Y Categorical(zi |π ) p(z, y|θ, π ) = Categorical(yi,j |θj,z[i] ) . i=1 j=1 This can be done item by item, with J I X K Y Y Categorical(zi |π ) p(y|θ, π ) = Categorical(yi,j |θj,z[i] ) . i=1 k=1 j=1 In the missing data model, only the observed labels would be used in the inner product. Dawid and Skene (1979) derive exactly the same equation in their Equation (2.7), required for the E-step in their expectation maximization (EM) algorithm. Stan requires the marginalized probability function on the log scale, log p(y|θ, π ) P PI PJ K = , i=1 log k=1 exp log Categorical(zi |π ) + j=1 log Categorical(yi,j |θj,z[i] ) which can be directly coded using Stan’s built-in log_sum_exp function. 227 data { int K; int I; int J; int y[I, J]; vector [K] alpha; vector [K] beta[K]; } parameters { simplex[K] pi; simplex[K] theta[J, K]; } transformed parameters { vector[K] log_q_z[I]; for (i in 1:I) { log_q_z[i] = log(pi); for (j in 1:J) for (k in 1:K) log_q_z[i, k] = log_q_z[i, k] + log(theta[j, k, y[i, j]]); } } model { pi ~ dirichlet(alpha); for (j in 1:J) for (k in 1:K) theta[j, k] ~ dirichlet(beta[k]); for (i in 1:I) target += log_sum_exp(log_q_z[i]); } Figure 15.5: Stan program for the rating (or diagnostic accuracy) model of Dawid and Skene (1979). The model marginalizes out the discrete parameter z, storing the unnormalized conditional probability log q(zi = k|θ, π ) in log_q_z[i, k]. Stan Implementation The Stan program for the Dawid and Skene model is provided in Figure 15.5. The Stan model converges quickly and mixes well using NUTS starting at diffuse initial points, unlike the equivalent model implemented with Gibbs sampling over the discrete parameter. Reasonable weakly informative priors are αk = 3 and βk,k = 2.5K 228 and βk,k0 = 1 if k ≠ k0 . Taking α and βk to be unit vectors and applying optimization will produce the same answer as the expectation maximization (EM) algorithm of Dawid and Skene (1979). Inference for the True Category The quantity log_q_z[i] is defined as a transformed parameter. It encodes the (unnormalized) log of p(zi |θ, π ). Each iteration provides a value conditioned on that iteration’s values for θ and π . Applying the softmax function to log_q_z[i] provides a simplex corresponding to the probability mass function of zi in the posterior. These may be averaged across the iterations to provide the posterior probability distribution over each zi . 229 16. Sparse and Ragged Data Structures Stan does not directly support either sparse or ragged data structures, though both can be accommodated with some programming effort. Chapter 44 introduces a special-purpose sparse matrix times dense vector multiplication, which should be used where applicable; this chapter covers more general data structures. 16.1. Sparse Data Structures Coding sparse data structures is as easy as moving from a matrix-like data structure to a database-like data structure. For example, consider the coding of sparse data for the IRT models discussed in Section 9.11. There are J students and K questions, and if every student answers every question, then it is practical to declare the data as a J × K array of answers. data { int