Norm2User Guide
User Manual:
Open the PDF directly: View PDF .
Page Count: 50
Download | |
Open PDF In Browser | View PDF |
User’s Guide for norm2 Joseph L. Schafer Office of the Associate Director for Research and Methodology United States Census Bureau Washington, DC 20233 May 5, 2016 Abstract The R package norm2 provides functions for analyzing incomplete multivariate data under a normal model, including likelihood-based parameter estimation, Bayesian posterior simulation, and multiple imputation. This document gives detailed information about norm2’s model and procedures, with step-by-step instructions and example analyses using datasets included with the package. Part of this work was supported by National Institute on Drug Abuse, 1-P50-DA-10075 while the author was employed at The Pennsylvania State University. The remainder of this work was produced at the U.S. Census Bureau in the course of official duties and, pursuant to title 17 Section 105 of the United States Code, is not subject to copyright protection within the United States. Therefore, there is no copyright to assign or license and this work may be used, reproduced or distributed within the United States. This work may be modified provided that any derivative works bear notice that they are derived from it, and any modified versions bear some notice that they have been modified, as required by Title 17, Section 403 of the United States Code. The U.S. Census Bureau may assert copyright internationally. To this end, this work may be reproduced and disseminated outside of the United States, provided that the work distributed or published internationally provide the notice: “International copyright, 2016, U.S. Census Bureau, U.S. Government”. The author and the Census Bureau assume no responsibility whatsoever for the use of this work by other parties, and makes no guarantees, expressed or implied, about its quality, reliability, or any other characteristic. The author and the Census Bureau are not obligated to assist users or to fix reported problems with this work. For additional information, refer to GNU General Public License Version 3 (GPLv3). 1 1 Overview The R package norm2 provides methods for analyzing numeric data with missing values. These methods are based on the multivariate normal model from which the package name is derived. Although real-world data rarely conform to assumptions of normality, these techniques have proven to be surprisingly useful and may often be adapted to non-normal situations (for example, by transforming variables prior to analysis). Source code written by the author for a much older version of this package called norm was ported to R by Alvaro A. Novo and submitted to the Comprehensive R Archive Network (CRAN), the collection of web servers that disseminates R. The old package is still available on CRAN, but it has major limitations and the author does not recommend its use. The new package offers better syntax, greater numerical stability, improved exception and error handling, and many other enhancements. • Variables that are completely observed can now be included as unmodeled predictors rather than responses, eliminating unnecessary parameters and making the computations more efficient. • The two most important functions, emNorm and mcmcNorm, return objects of new class "norm" which can be used as an arguments to other function calls. • Models may now be specified using a standard formula syntax used in R’s linear regression modeling procedure lm. Data frames and factors are handled seamlessly. In Sections 2–4, we present the model used by norm2 and describe our algorithms for likelihood-based parameter estimation, Bayesian posterior simulation and multiple imputation. Section 5 reviews some general concepts from R that will help new users to understand package functions and documentation. Sections 6–8 illustrate the use of norm2 functions with three data examples taken from Schafer (1997) which are distributed with the package. We assume that readers have basic familiarity with the R programming environment. Overviews of R can be found in the manual An Introduction to R written by the R Core Team and distributed with the software, in numerous books and articles, and other resources available on the Internet. 2 2 2.1 The model Model for the complete data The norm2 model describes an n × r data matrix Y = y⊤ 1 y⊤ 2 .. . y⊤ n = y11 y12 y21 y22 .. .. . . yn1 yn2 · · · y1r · · · y2r . .. . .. · · · ynr , where y i = (yi1 , . . . , yir )⊤ denotes a vector of responses for unit i, and the units i = 1, . . . , n are randomly sampled from a population. Throughout this document, we follow the convention of displaying vectors and matrices in boldface type; all vectors are assumed to be columns; and the superscript “⊤” denotes a transpose. We allow arbitrary portions of Y to be missing, which is why the norm2 package was created. In earlier releases of this software, we had assumed that y 1 , y 2 , . . . , y n were independent and identically distributed as N (µ, Σ), where µ = (µ1 , . . . , µr )⊤ is a vector of means and Σ is a positive-definite covariance matrix. In that framework, if one wanted to include completely observed covariates in an analysis, they would have to be placed among the columns of Y . That approach was sometimes computationally inefficient, because the joint distribution of the covariates became part of the model formulation even if it was not needed, inflating the number of unknown parameters unnecessarily (Schafer, 1997, Sec. 2.6.2). The new norm2 model still places incomplete variables into the columns of Y , but completely observed variables may now be included in a matrix of predictors, ⊤ x1 x11 x12 · · · x1p ⊤ x2 x21 x22 · · · x2p X = .. = .. .. . . .. , . . . . . x⊤ xn1 xn2 · · · xnp n which is not modeled. Responses are related to predictors by a standard multivariate regression, y i | xi ∼ N ( β ⊤xi , Σ ), (1) where β is a p × r matrix of regression coefficients. This model can also be written as Y = Xβ + ϵ, where ϵ is an n × r matrix distributed as vec(ϵ) ∼ N (0, Σ ⊗ I n ), and I n is the n × n identity matrix. In most applications, the first column of X will be a constant, xi1 ≡ 1. If X consists of a single column (1, 1, . . . , 1)⊤ , then this new model reduces to the old one with β = µ⊤ . If desired, completely observed covariates may also be placed into the ⊤ ⊤ columns of Y , because a joint normal model for the augmented vector (y ⊤ i , xi ) implies a conditional model of the form (1) for y i given xi . 3 2.2 Estimation with complete data Before describing our missing-data routines, it is helpful to consider how the parameters of this model (1) could be estimated if there were no missing values. Given Y , the complete-data loglikelihood function for β and Σ is l(β, Σ | Y , X) = − n n 1 ∑ log |Σ| − (y − β ⊤ xi )⊤ Σ−1 (y i − β ⊤ xi ) 2 2 i=1 i { n ∑ n 1 = − log |Σ| − tr Σ−1 (y i − β ⊤ xi )(y i − β ⊤ xi )⊤ 2 2 i=1 n 1 = − log |Σ| − tr Σ−1 (Y − Xβ)⊤ (Y − Xβ). 2 2 } (2) This loglikelihood is a linear function of the sufficient statistics T 1 = X ⊤ Y and T 2 = Y ⊤ Y , whose expectations are E(T 1 |X) = X ⊤Xβ and E(T 2 |X) = β ⊤ X ⊤Xβ + nΣ. Setting the realized values of T 1 and T 2 equal to their expectations and solving for β and Σ leads immediately to the maximum-likelihood (ML) estimates, β̂ = (X ⊤X)−1 X ⊤ Y = (X ⊤X)−1 T 1 , (3) Σ̂ = n−1 (Y − X β̂)⊤ (Y − X β̂) ( ) ⊤ −1 = n−1 T 2 − T ⊤ 1 (X X) T 1 . (4) Another useful expression for the complete-data loglikelihood function is l(β, Σ | Y , X) = − 3 3.1 n 1 log |Σ| − tr Σ−1 (Y − X β̂)⊤ (Y − X β̂) 2 2 [ ]−1 1 vec(β − β̂). − vec(β − β̂)⊤ Σ ⊗ (X ⊤X)−1 2 (5) ML estimation with ignorable nonresponse An EM algorithm Now suppose that some elements of Y are missing at random (MAR) in the sense defined by Rubin (1976). Let Y ! and Y ? denote the observed and missing parts of Y , respectively, and let y i! and y i? denote the observed and missing parts of y i . Let β i! and β i? denote the submatrices of β corresponding to y i! and y i? . That is, β i! contains the columns of β for predicting the observed elements of y i , and β i? contains the columns for predicting the missing elements. Similarly, suppose we partition Σ into submatrices corresponding to the observed and missing parts of y i , calling the submatrices Σi!! , Σi!? , Σi?! and Σi?? . 4 For example, if the first r1 elements of y i are observed and the remaining r2 = r − r1 elements of y i are missing, then [ y i! y i? ] ( [ ⊤ | xi ∼ N [β i! , β i? ] xi , Σi!! Σi!? Σi?! Σi?? ]) , where β i! is p × r1 , β i? is p × r2 , Σi!! is r1 × r1 , Σi!? is r1 × r2 , Σi?! is r2 × r1 and Σi?? is r2 × r2 . The loglikelihood function ignoring the missing-data mechanism is l(β, Σ | Y ! , X) = = n ∑ {∫ log i=1 n { ∑ − i=1 −1/2 |Σ| 1 exp − (y i − β ⊤ xi )⊤ Σ−1 (y i − β ⊤ xi ) dy i? 2 } } 1 1 ⊤ ⊤ −1 log |Σi!! | − (y i! − β ⊤ i! xi ) Σi!! (y i! − β i! xi ) . 2 2 (6) We maximize this function by an extension of the EM algorithm for multivariate normal data described by Schafer (1997), Little and Rubin (2002) and others. In the E-step, we ∑ ∑ ⊤ accumulate the expectations of T 1 = i xi y ⊤ i y i y i with respect to the i and T 2 = conditional distribution of Y ? given Y ! , fixing β and Σ equal to their current estimates. In the M-step, we update the estimates by substituting the expected values of T 1 and T 2 into (3)–(4). When computing the expectations, we use the formulas E(y i! | y i! ) E(y i? | y i! ) E(y i! y ⊤ i! | y i! ) ⊤ E(y i! y i? | y i! ) E(y i? y ⊤ i? | y i! ) = = = = = y i! , ⊤ −1 β⊤ i? xi + Σi?! Σi!! (y i! − β i! xi ), y i! y ⊤ i! , y i! E(y i? | y i! )⊤ , E(y i? | y i! ) E(y i? | y i! )⊤ + Σi?? − Σi?! Σ−1 i!! Σi!? , where conditioning on xi , β and Σ is assumed but suppressed in the notation. We −1 compute Σi?! Σ−1 i!! and Σi?? − Σi?! Σi!! Σi!? using the sweep operator (Goodnight, 1979), sweeping Σ on the positions corresponding to y i! . At the outset, we group the sample units by their missingness patterns to avoid unnecessary sweeps. This EM algorithm is implemented in norm2 in a function called emNorm, which will be described in Section 6. Each iteration of this algorithm increases the loglikelihood function (6) until it converges to a local maximum (or, rarely, a saddlepoint) or until it reaches a boundary of the parameter space where Σ is no longer positive definite. If no elements of Y are missing, the algorithm converges after a single iteration. 3.2 Starting values The EM routine in emNorm allows the user to provide starting values for β and Σ. If none are given, the software chooses its own. The default starting values are obtained as 5 follows. The observed values in each column of Y are regressed on the corresponding rows of X, and the ordinary least-squares (OLS) coefficients from this regression are placed in the appropriate column of β. The usual estimate of the residual variance, commonly known as the mean squared error or MSE, is placed in the appropriate diagonal position of Σ, and the off-diagonal elements are set to zero. If the predictor matrix happens to be rank-deficient, the OLS coefficients are not unique, and a convenient solution is chosen. If the model happens to have perfect fit (for example, because the number of predictors exceeds the number of available cases), the residual variance is arbitrarily set to one-half the value of the sample variance of the observed responses. 3.3 Detecting convergence Suppose we arrange the parameters of our model into a vector θ = [ vec(β)⊤ , vech(Σ)⊤ ]⊤ , where “vec” stacks the columns of a matrix into a single column, and “vech” stacks the nonredundant elements (the lower triangle) of a symmetric matrix into a single column. (t) Let θj denote an element of θ, and let θj denote the estimate of θj after t iterations of (t) EM. We say that the algorithm has converged by iteration t if θj is sufficiently close to (t−1) θj for all j. The convergence criterion used by emNorm requires (t) (t−1) θj − θj ≤ δ (t−1) θj for all j, where the default value of δ is 1.0 × 10−5 . In the unlikely event that θj happens to be exactly zero, that parameter is ignored when assessing convergence at that iteration. (t−1) 3.4 Elementwise rates of convergence When EM has nearly converged, the rate of convergence for θj can be estimated by (t+1) λ̂j = θj (t) (t) − θj (t−1) θj − θj . (7) If there happens to be no missing information regarding θj (e.g., if θj is the mean or variance of a variable that has no missing values), then EM converges for that parameter in a single step, and λ̂j will be zero or undefined. In most other situations, λ̂j estimates the “worst fraction of missing information” for the problem—not the rate of missing 6 information for θj itself, but the supremum of the rate of missing information over all linear combinations of the elements of θ (Meng and Rubin, 1994; Schafer, 1997, Sec. 3.3). The worst fraction of missing information can sometimes be approximated by the λ̂j ’s over the final iterations of EM. Unfortunately, those quantities can be notoriously unstable, especially in problems where EM converges quickly. A better technique was developed by Fraley (1998). Suppose we view the E- and M-steps of EM as a function that updates the parameter vector at each iteration, θ (t+1) = M( θ (t) ), where M maps the parameter space onto itself. The algorithm stops at a fixed point θ̂ if θ̂ = M(θ̂). The convergence behavior of EM and rates of missing information are determined by the Jacobian matrix J (θ) = ∂ ∂θ ⊤ M(θ). More specifically, the worst fraction of missing information λ̂ is the largest eigenvalue of Jˆ = J (θ̂), and the eigenvector v̂ corresponding to λ̂ approximates the trajectory of EM as it approaches θ̂. Fraley (1998) obtained a numerical estimate of λ̂ using power iteration, a well known technique for computing dominant eigenvalues that dates back to the 1920’s. Fraley’s method does not require the full Jacobian matrix, but only the vector d M( θ̂ + δv̂ ) dδ in the neighborhood of δ = 0. In emNorm, we apply Fraley’s power method with a numerical estimate of this derivative vector obtained from a second-order finite differencing sequence. In well behaved problems, the procedure gives reasonably accurate estimates of λ̂ and v̂. If the procedure fails, it suggests that the loglikelihood function (6) may be oddly shaped, with a maximum that is not unique or a solution on the boundary of the parameter space where Σ is singular. 3.5 Prior information for Σ Depending on the rates and patterns of missing observations in Y , certain aspects of Σ may be poorly estimated. For example, if there are no cases having joint responses for variables j and j ′ , then functions of Σ pertaining to the partial correlation between yij and yij ′ given xi and the other variables will be inestimable (Rubin, 1974). In other cases, multicollinearity among the response variables may push the ML estimate of Σ toward a boundary of the parameter space. When this occurs, instability in the estimation of Σ may make it difficult to draw inferences based only on the likelihood function. One remedy for this problem is to introduce prior information to smooth the estimate of Σ toward a guess. 7 Suppose that we apply an improper uniform prior distribution to β and an inverted Wishart prior distribution to Σ, so that Σ−1 ∼ W (ξ, Λ), where ξ and Λ denote the degrees of freedom and the scale matrix, respectively. The prior density function is −( ξ+r+1 ) 2 p(β, Σ) ∝ |Σ| { } 1 exp − tr Σ−1 Λ−1 , 2 (8) which has the same functional form as the complete-data likelihood. If we add the logarithm of this density to the complete-data loglikelihood function, the joint mode for β and Σ is given by β̂ = (X ⊤X)−1 X ⊤ Y = (X ⊤X)−1 T 1 and 1 n+ξ+r+1 1 = n+ξ+r+1 Σ̂ = [ (Y − Xβ)⊤ (Y − Xβ) + Λ−1 [ ] ] ⊤ −1 −1 T2 + T⊤ . 1 (X X) T 1 + Λ (9) If we change the M-step by computing (9) rather than (4), with T 1 and T 2 replaced by their expectations, then the modified EM algorithm will maximize the observed-data log-posterior density, which is equal to (6) plus the logarithm of (8). In choosing values for the hyperparameters, we may think of ξ −1 Λ−1 as a prior guess for Σ and ξ as the degrees of freedom on which this guess is based. In that case, Λ−1 can be regarded as the prior sums of squares and cross-products (SSCP) matrix. The function emNorm allows the user to supply values for ξ and Λ−1 . It also implements three special cases of this prior distribution: • the uniform prior, which sets ξ = −(r + 1) and Λ−1 = 0; • the Jeffreys prior, which sets ξ = 0 and Λ−1 = 0; and • a data-dependent “ridge” prior, which smooths all of the estimated correlations toward zero (Schafer, 1997, pp. 155–156). The ridge prior requires the user to specify ξ, which determines the amount of smoothing. The prior guess for Σ is set equal to the ad hoc diagonal estimate used for default starting values as described in Section 3.2. The default prior used by emNorm is the uniform prior, for which the posterior mode is the ML estimate. 8 4 Posterior simulation and multiple imputation 4.1 A Markov chain Monte Carlo procedure With some modification, the EM algorithm described above can be turned into a Markov chain Monte Carlo (MCMC) procedure for simulating draws of Y ? , β and Σ from their joint posterior distribution given Y ! and X. The MCMC algorithm proceeds as follows. (t) Given the current random draws Y ? , β (t) and Σ(t) , we first (t+1) draw y i? from P (y i? | y i! , β (t) , Σ(t) , xi ) (10) (t+1) independently for i = 1, . . . , n to create Y ? . We call this the Imputation or I-step; it is very closely related to the E-step of the previous section. Given y i! , β and Σ, the distribution of y i? is normal with mean vector ⊤ −1 β⊤ i? xi + Σi?! Σi!! (y i! − β i! xi ) and covariance matrix Σi?? − Σi?! Σ−1 i!! Σi!? . After the I-step has been completed, we (t+1) draw Σ(t+1) from P (Σ | Y ! , Y ? , X) (11) , Σ(t+1) , X), (12) and then (t+1) draw β (t+1) from P (β | Y ! , Y ? which is called the Posterior or P-step. Repeating (10)–(12) many times generates a (t) sequence of random draws (Y ? , β (t) , Σ(t) ), t = 1, 2, . . . whose limiting distribution is P (Y ? , β, Σ | Y ! , X) regardless of the starting values. Steps (11) and (12) require us to specify a joint prior distribution for β and Σ. Once again, we apply an improper uniform prior to β and an inverted Wishart prior to Σ. Combining the prior density (8) with the likelihood function implied by (5), and using the fact that Σ ⊗ (X ⊤ X)−1 = | Σ |p | X ⊤X|−r , the joint posterior density for β and Σ given Y and X is p(β, Σ | Y , X) ∝ | Σ|−( n−p+ξ+r+1 2 × exp × { ) [ ] 1 − tr Σ−1 Λ−1 + (Y − X β̂)⊤ (Y − X β̂) 2 Σ ⊗ (X ⊤X)−1 × exp { } − 12 [ ]−1 1 − vec(β − β̂)⊤ Σ ⊗ (X ⊤X)−1 vec(β − β̂) 2 9 } . From this it immediately follows that Σ−1 | Y , X ∼ W (ξ ′ , Λ′ ), ( ) vec(β) | Y , X, Σ ∼ N vec(β̂), Σ ⊗ (X ⊤X)−1 , where ξ ′ = ξ + n − p and Λ′ = [ Λ−1 + (Y − X β̂)⊤ (Y − X β̂) ]−1 . Under the Jeffreys prior p(β, Σ) ∝ |Σ|−( r+1 2 ), which can be regarded as the limit of (8) as ξ → 0 and Λ−1 → 0, the posterior becomes Σ −1 |Y, X ∼ W ( ( [ ]−1 ) ⊤ n − p , (Y − X β̂) (Y − X β̂) , ) vec(β) | Y , X, Σ ∼ N vec(β̂), Σ ⊗ (X ⊤X)−1 . To simulate β, we apply the Cholesky factorizations Σ = GG⊤ and (X ⊤X)−1 = HH ⊤ , where G and H are lower-triangular. Using elementary properties of Kronecker products, we get Σ ⊗ (X ⊤X)−1 = (GG⊤ ) ⊗ (HH ⊤ ) = (G ⊗ H)(G⊤ ⊗ H) = (G ⊗ H)(G ⊗ H)⊤ , where G ⊗ H is lower-triangular. A random draw of vec(β) is obtained as vec(β̂) + (G ⊗ H)z, where z is a vector of independent standard normal variates of length p × r. We implemented this procedure in a function called mcmcNorm, which will be described in Section 6. When calling mcmcNorm, the user must supply starting values for β and Σ. A good choice for starting values is an ML estimate or posterior mode obtained from emNorm. If emNorm is run first, then the result from emNorm may be supplied as the main argument to mcmcNorm, in which case the EM estimates automatically become the starting values for MCMC. The result from mcmcNorm may also be supplied as the main argument in another call to mcmcNorm, so that the final simulated values of β and Σ from the first chain become the starting values for the second chain. 4.2 Assessing convergence Proper use of MCMC requires us to judge how many iterations are required before the simulated parameters can be regarded as draws from the observed-data posterior distribution. In general, we would like to know whether the algorithm achieves stationarity by 10 k iterations, which means that θ (t+k) , the simulated value after iteration t + k, is independent of θ (t) . Like EM, the convergence of this MCMC procedure is related to missing information. If there were no missing values in Y , then the algorithm would converge in one iteration. When the rates of missing information are high, many iterations may be needed. Users of MCMC typically assess convergence by examining time-series plots and autocorrelation functions (ACF’s) for the series (t+1) θj (t+2) , θj (t+3) , θj ,... for individual parameters j = 1, 2, . . .. By default, mcmcNorm saves the entire output stream consisting of the simulated elements of β and Σ from all iterations, returning them as multivariate time-series objects. In large problems with many parameters, saving the full output stream from all iterations may be impractical. The mcmcNorm function has two options that allow us to make the output more manageable. An option called multicycle instructs mcmcNorm to perform multiple cycles of MCMC per iteration (one cycle consists of an I-step followed by a P-step). By specifying multicycle = k for some k ≥ 2, mcmcNorm will save only the subsampled stream θ (k) , θ (2k) , θ (3k) , . . .. Another option for output analysis is to save and examine the simulated values of scalar functions of θ for which the serial dependence tends to be high. Schafer (1997, Section 4.4) recommends using the inner product v̂ ⊤ θ = ∑ v̂j θj , j where v̂ = (v̂1 , v̂2 , . . .)⊤ is the eigenvector corresponding to the dominant eigenvalue of the Jacobian matrix in Section 3.5. This is the worst linear function of θ, the linear combination of θj ’s in the vicinity of θ̂ for which the missing information is highest. For interpretability, we this function as g(θ) = √ v̂ ⊤ θ ⊤ v̂ v̂ √ , (13) ⊤ θ θ so that it becomes the cosine of the angle between v̂ and θ. In typical applications, if the serial correlation in this function has died down by k cycles, we can be reasonably sure that the correlations in other functions of θ have died down as well. If the result from emNorm is provided as input to mcmcNorm, the elements of v̂ (the so-called worst linear coefficients) are carried over to mcmcNorm automatically, and values of g(θ) are then saved for all iterations of MCMC. 11 4.3 Inferences about the normal model parameters After a suitable burn-in period to achieve stationarity, the output stream from the MCMC algorithm may be used for direct Bayesian inferences about model parameters. For ex(t+1) (t+2) (t+3) ample, the parameter series θj , θj , θj , . . . may be averaged to obtain a simulated posterior mean for θj . A simulated 95% Bayesian credible interval runs from the 2.5th to the 97.5th percentiles of the sampled values of θj . For more discussion on the use of simulated parameters from MCMC, see Section 4.2 of Schafer (1997). 4.4 Imputation of missing values Perhaps the most important feature of mcmcNorm is that it enables us to easily create multiple imputations for missing values. Multiple imputations, as described by Rubin (1987), Schafer (1997) and others, are independent draws from a Bayesian posterior predictive distribution for the missing values given the observed values under an assumed model, ∫ P (Y ? | Y ! , X) = P (Y ? | Y ! , X, θ) P (θ | Y ! , X) dθ. (1) (2) (M ) To create multiple imputations Y ? , Y ? , . . . , Y ? , we first generate M independent draws of the parameters, θ (1) , θ (2) , . . . , θ (M ) , from the posterior distribution P (θ | Y ! , X). (m) Then, given the simulated parameters, we draw Y ? from P (Y ? | Y ! , X, θ = θ (m) ) for m = 1, . . . , M . Multiple imputations can be generated by a single chain or by multiple chains. In the single-chain method, we run the MCMC procedure for M k cycles, where k is large enough to ensure that θ (k) , θ (2k) , . . . , θ (M k) are essentially independent, and save the simulated values of Y? from every kth I-step. This is accomplished by supplying an argument impute.every = k to the function mcmcNorm. In the multiple-chain method, we call the mcmcNorm function M times, running it for k cycles each time, to create M sets of simulated parameters. Each set of simulated parameters is then supplied to another function, impNorm, which draws the missing values from P (Y ? | Y ! , X, θ). Whether using a single chain or multiple chains, it is advisable to first examine the output stream from an exploratory MCMC run to make sure that serial correlations in all parameters have died down by k cycles. 4.5 Prediction of missing values In certain situations, it may be desirable to predict the missing values without random noise. That is, rather than simulating a random draw of Y ? from P (Y ? | Y ! , X, θ), we may want to compute the expected value E(Y ? | Y ! , X, θ) for a given θ using the formulas given in Section 3.1. This can be accomplished by calling the function impNorm with the option method = "predict". The default procedure for impNorm, which is method = "random", draws the missing values from P (Y ? | Y ! , X, θ). 12 4.6 Combining results from analyses after multiple imputation An attractive feature of multiple imputation is that, after the imputations have been created, the resulting datasets may be handled using routines designed for complete data. We may analyze each of the completed versions of (X, Y ) separately, saving the M sets of estimates and standard errors, and then combine them by rules developed by Rubin (1987) and others. In the basic procedure described by Rubin (1987, Chap. 3), there is a scalar quantity Q for which we need an estimate and measure of uncertainty. Suppose that, if there were no missing data, we could calculate a point estimate Q̂ = Q̂(X, Y ) and a variance (m) estimate U = U (X, Y ). Suppose that we have imputed datasets (X, Y ! , Y ? ) for m = 1, . . . , M , and let Q̂(m) and U (m) denote the point and variance estimate from the mth imputed dataset. The overall point estimate for Q is simply the mean of the repeated estimates, Q̄ = M −1 M ∑ Q̂(m) , m=1 and the overall variance estimate is ( ) T = 1 + M −1 B + Ū , where B = (M − 1)−1 M ( ∑ Q̂(m) − Q̄ )2 m=1 and M ∑ Ū = M −1 U (m) m=1 are the between- and within-imputation variances, respectively. The relative increase in variance due to nonresponse is r̂ = (1 + M −1 ) B Ū , and the approximate rate of missing information is γ̂ = (1 + M −1 ) B r̂ = . T r̂ + 1 √ Intervals and tests for Q are based on the result that (Q̂ − Q)/ T is approximately distributed as Student’s t with degrees of freedom νM = (M − 1) γ̂ −2 . 13 (14) Rubin (1987) assumed that the sample size was large enough that, if there were no missing data, it would be appropriate to base intervals and tests on the normal approximation √ (Q̂ − Q)/ U ∼ N (0, 1). (15) When the sample is too small for (15) to be plausible, better results are available from a modified procedure described by Barnard and Rubin (1999). If we suppose that (Q̂ − √ Q)/ U with complete data is distributed as Student’s t with νcom degrees of freedom, √ Barnard and Rubin (1999) show that (Q̂ − Q)/ T is approximately Student’s t with degrees of freedom ( ) 1 1 −1 ν̃M = + , (16) νM ν̂ where νcom (νcom + 1)(1 − γ̂) ν̂ = . (νcom + 3) When νcom = ∞, (16) reduces to (14), yielding Rubin’s (1987) rules as a special case. The rules of Rubin (1987) and Barnard and Rubin (1999) are implemented in the norm2 package as a function called miInference. 5 Preliminaries [In this section, we review some important concepts from R that are helpful for understanding norm2 functions and documentation. Experienced users of R may want to lightly skim this material or skip ahead to Section 6.] 5.1 Loading the package An R package is a collection of functions, data and documentation that is not part of the basic R distribution and needs to be installed separately. Once a package has been installed on an R user’s system, it becomes a library that can be attached and used in any R session. Procedures for downloading and installing packages from CRAN vary slightly depending on your version of R and your computer’s operating system, but they are quite simple. For example, in a typical Windows session of R, you would select the “Install package(s)” from the “Packages” menu at the top of the R user interface. For more details on package installation, refer to the documentation and help files for your particular version of R. Before you can use any of the norm2 functions or datasets within an R session, you will have to load it by issuing the command > library(norm2) 14 from the R console. 5.2 Documentation and data examples Once the library has been loaded, you can view its documentation files in the usual R fashion, as in the following examples: > help(norm2) > ?norm2 > help(emNorm) # overview of the package # same thing as above # documentation for the function emNorm The norm2 package also includes several datasets that were used by Schafer (1997) to illustrate techniques of estimation and imputation. If you issue the command > data(package="norm2") then a list of these datasets will appear. Each dataset has its own documentation; for example, > help(flas) will display the page for the flas dataset. To gain access to these datasets, use the data function with the name of the dataset as its argument. For example, if you type > data(flas) then a copy of the flas dataset will be loaded into your current workspace. 5.3 Objects, classes and generic methods Objects in R are self-documenting in the sense that each object carries important information about itself: what types of data it contains, how large it is, and so on. These pieces of information about an object are called its attributes. One of the most important attributes of an object is its class. The class of an object is a character string that tells us what kind of object it is. We can discover the class of an object by using the function class, as in the following examples. 15 > ivec <- 1:10 > converged <- T > y <- matrix(rnorm(10), 5, 2) > mylist <- list(ivec, converged, > class(ivec) [1] "integer" > class(y) [1] "matrix" > class(mylist) [1] "list" # a vector of integers 1, ..., 10 # a single logical value # a 5 x 2 matrix of random numbers y) # a list with 3 components The reason why classes are so important is that R has generic methods that do different things depending on what kind of arguments are supplied to it. More precisely, a generic method is a group of functions called by the same name but distinguished by the class of the first argument. To see how this works, consider the most widely used generic method in R, which is print. In an interactive R session, whenever you look at an object by typing its name, you are implicitly calling print. > ivec [1] 1 3 4 # look at ivec 5 6 7 8 9 10 > print(ivec) [1] 1 2 3 4 # exactly the same thing 5 6 7 8 9 10 2 Here is a part of the R documentation file for print. print package:base R Documentation Print Values Description: ’print’ prints its argument and returns it _invisibly_ (via ’invisible(x)’). It is a generic function which means that new printing methods can be easily added for new ’class’es. Usage: print(x, ...) ## S3 method for class ’factor’: print(x, quote = FALSE, max.levels = NULL, width = getOption("width"), ...) 16 ## S3 method for class ’table’: print(x, digits = getOption("digits"), quote = FALSE, na.print = "", zero.print = "0", justify = "none", ...) The first two arguments to print are: Arguments: x: an object used to select a method. ...: further arguments passed to or from other methods. When you invoke a generic method, R examines your first argument and decides what action to take. If the first argument to print is an object of class "factor", then print passes the first argument and all subsequent arguments to another function called print.factor. If the first argument to print is an object of class "table", then print passes the first argument and all subsequent arguments to another function called print.table. You may call the functions print.factor and print.table directly if you like, but users rarely do; it’s easier to just refer to all of these functions by their generic name print. What happens if you apply a generic method to an object for which no specific method exists? For example, suppose x is an object of class "boogeyman". If you invoke print(x), then R searches for a function named print.boogeyman. If none can be found, R tries to find another specific method that may be appropriate. As a last resort, it calls a default method which is called print.default. In R, every generic method is supposed to have a default method that is invoked as a last resort. 5.4 User-defined classes Experienced users of R, especially those who write their own packages, are fond of developing new object classes. R makes this very easy to do. For example, if you issue these commands, > x <- rnorm(100) > class(x) <- "randomVector" then you have just created a new object class "randomVector". If you then create a new function called "summary.randomVector", then > summary(x) 17 will invoke your new function automatically. In the norm2 package, we have defined a new object class called norm. A norm object is simply a list whose class attribute has been set to "norm". Most of the functions in norm2 can operate either on raw data or on a norm object, which makes them easier to remember and use. 5.5 A note on data frames and factors In R, a basic two-dimensional array is a matrix, an object whose class attribute is "matrix". All of the elements of a matrix are the same type of data (numeric values, character strings or logical values). A more general type of two-dimensional array is the data frame, whose class attribute is "data.frame". One column of a data frame may contain numbers, while another may contain logical values, another may contain character strings, and so on. A column of a data frame may also be a factor, which is R’s terminology for a categorical variable. In a factor, data values are stored as integer codes, and character labels corresponding to these codes (e.g., "Male" and "Female") are carried along via an attribute called levels. If a k-level factor appears in a as a predictor in a call to the regression modeling functions lm or glm, R automatically converts it to a set of k − 1 contrasts or dummy codes. An ordered is similar to a factor except that the levels are assumed to be ordered, and R converts it to a set of contrasts representing effects that are linear, quadratic, cubic, etc. We have been assuming that response variables are jointly normal. In practice, missing-data procedures designed for variables that are normal are sometimes applied to variables that are not. Binary and ordinal variables are sometimes imputed under a normal model, and the imputed values may be classified or rounded (Schafer, 1997, Chap. 6). If norm2 functions are applied to variables that are categorical, it is up to the user to decide how to do it. For example, consider a factor for race/ethnicity coded as 1=Black, 2=non-Black Hispanic and 3=Other. It would make little sense to enter this factor into a model as is, treating the internal codes 1, 2 and 3 as numeric values, because the categories are not intrinsically ordered. A better strategy would be to convert this variable into a pair of dummy variables, e.g. an indicator for Black (1 if Black, 0 otherwise) and an indicator for non-Black Hispanic (1 if non-Black Hispanic, 0 otherwise). If the race/ethnicity were completely observed, then the two dummy variables could enter the model as columns of X. If race/ethnicity had missing values, the dummy variables could be included as columns of Y and imputed, but then the user would have to decide how to convert the continuously distributed imputed values for these dummy codes back into categories of race/ethnicity. Various ad hoc procedures for converting continuous imputed values into categories have been tried, and no single method seems to dominate the others (Bernaards, Belin, and Schafer, 2007). Because we have no firmly recommended strategy for applying normal 18 distributions to categorical variables, norm2 makes no serious attempt to handle factors or ordered factors. If a data frame is supplied as an argument to a norm2 function, the data frame is converted to a numeric matrix using the R conversion function data.matrix. If any factors or ordered factors are present, they are replaced by their internal codes, which may or may not be sensible. 6 Using the package: Example 1 6.1 Cholesterol data In the remaining sections, we show by example how to use all the major functions in norm2. Our first dataset, which was originally published by Ryan and Joiner (1994), reports cholesterol levels for 28 patients treated for heart attack at a Pennsylvania medical center. These data, which were previously analyzed by Schafer (1997, Chap. 5), are distributed with the norm2 package and stored in a data frame called cholesterol: > data(cholesterol) > cholesterol Y1 Y2 Y3 1 270 218 156 2 236 234 NA 3 210 214 242 4 142 116 NA 5 280 200 NA 6 272 276 256 7 160 146 142 8 220 182 216 9 226 238 248 10 242 288 NA 11 186 190 168 12 266 236 236 13 206 244 NA 14 318 258 200 15 294 240 264 16 282 294 NA 17 234 220 264 18 224 200 NA 19 276 220 188 20 282 186 182 21 360 352 294 22 310 202 214 23 280 218 NA 19 24 25 26 27 28 278 288 288 244 236 248 278 248 270 242 198 NA 256 280 204 The three variables correspond to cholesterol levels observed 2 days (Y1), 4 days (Y2) and 14 days (Y3) after attack. Nine patients have missing values for Y3. 6.2 Listwise deletion The first step in our analysis is to estimate the means and the covariance matrix for Y1, Y2 and Y3. The R generic method summary, when applied to a data frame, displays univariate summary statistics for each variable, ignoring any missing values for that variable. > summary(cholesterol, na.rm=F) Y1 Y2 Min. :142.0 Min. :116.0 1st Qu.:225.5 1st Qu.:201.5 Median :268.0 Median :235.0 Mean :253.9 Mean :230.6 3rd Qu.:282.0 3rd Qu.:250.5 Max. :360.0 Max. :352.0 Y3 Min. :142.0 1st Qu.:193.0 Median :216.0 Mean :221.5 3rd Qu.:256.0 Max. :294.0 NA’s :9 The R function var computes a sample covariance matrix. If var is called with the argument use="complete.obs", it omits any row of the dataset that contains missing values. In the missing-data literature, this method is known as listwise deletion or complete-case analysis (Little and Rubin, 2002). > var(cholesterol, use="complete.obs") Y1 Y2 Y3 Y1 2299.0409 1448.912 813.2632 Y2 1448.9123 1924.585 1348.9123 Y3 813.2632 1348.912 1864.8187 Multivariate methods based on case deletion may be inefficient, because data are being discarded unnecessarily. In this example, listwise deletion omits the observed values of Y1 and Y2 for the nine patients with missing values for Y3. Estimates from listwise deletion may also be biased if the complete cases and incomplete cases are systematically different. A better alternative is to compute ML estimates for the means and covariances using the norm2 function emNorm. 20 6.3 Applying the EM algorithm One technique for calling emNorm, as explained in its documentation file, is shown below: ## Default S3 method: emNorm(obj, x = NULL, intercept = TRUE, iter.max = 1000, criterion = NULL, estimate.worst = TRUE, prior = "uniform", prior.df = NULL, prior.sscp = NULL, starting.values = NULL, ...) The only only required argument is: obj: an object used to select a method. It may be ’y’, a numeric matrix, vector or data frame containing response variables to be modeled as multivariate normal. Missing values (’NA’s) are allowed. If ’y’ is a data frame, any factors or ordered factors will be replaced by their internal codes, and a warning will be given. Alternatively, this first argument may be a ’formula’ as described below, or an object of class ’"norm"’ resulting from a call to ’emNorm’ or ’mcmcNorm’; see DETAILS. If we call emNorm with cholesterol as its first argument, the data frame is coerced into a numeric data matrix (via the function data.matrix) and becomes y. If no value is supplied for the argument x, the matrix of predictors defaults to a single column of 1’s, and the three variables in cholesterol will be regressed on a constant. > emResult <- emNorm(cholesterol) The result from emNorm is an object of class "norm". Important information in this object can be viewed with summary. > summary(emResult) Predictor (X) variables: Mean SD Observed Missing Pct.Missing CONST 1 0 28 0 0 Response (Y) variables: Mean SD Observed Missing Pct.Missing Y1 253.9286 47.71049 28 0 0.00000 Y2 230.6429 46.96745 28 0 0.00000 21 Y3 221.4737 43.18355 19 9 32.14286 Missingness patterns for response (Y) variables (. denotes observed value, m denotes missing value) (variable names are displayed vertically) (rightmost column is the frequency): YYY 123 ... 19 ..m 9 Method: Prior: Convergence criterion: Iterations: Converged: Max. rel. difference: -2 Loglikelihood: -2 Log-posterior density: Worst fraction missing information: EM "uniform" 1e-05 15 TRUE 8.5201e-06 615.9902 615.9902 0.4617 Estimated coefficients (beta): Y1 Y2 Y3 CONST 253.9286 230.6429 222.2371 Estimated covariance matrix (sigma): Y1 Y2 Y3 Y1 2194.9949 1454.617 835.3973 Y2 1454.6173 2127.158 1515.4584 Y3 835.3973 1515.458 1952.2182 The section above titled "Missingness patterns for response (Y) variables" has a matrix that displays in compact form the patterns of missing values that appear in the dataset and the frequency of each pattern. The next section below it provides basic information about the EM run. Using the default starting values and the default convergence criterion, the algorithm converged in just 15 iterations, suggesting that the rates of missing information are low. The rate of convergence, which estimates the worst fraction of missing information, is approximately 0.46. More results from EM are available by examining the components of the object directly. The names of the components are: > names(emResult) [1] "y" "x" "method" 22 [4] [7] [10] [13] [16] [19] [22] [25] "prior" "starting.values" "criterion" "logpost" "y.mean.imp" "n.obs" "ybar" "worst.frac" "prior.df" "iter" "estimate.worst" "param" "miss.patt" "which.patt" "ysdv" "worst.linear.coef" "prior.sscp" "converged" "loglik" "param.rate" "miss.patt.freq" "rel.diff" "em.worst.ok" "msg" Some of these components are described in the documentation for emNorm. The component iter is the number of iterations actually performed. > emResult$iter [1] 15 Perhaps the most important component is param, which contains the final parameter estimates. This is list of two components whose names are beta and sigma, corresponding to the matrices β and Σ. > emResult$param $beta Y1 Y2 Y3 CONST 253.9286 230.6429 222.2371 $sigma Y1 Y2 Y3 Y1 2194.9949 1454.617 835.3973 Y2 1454.6173 2127.158 1515.4584 Y3 835.3973 1515.458 1952.2182 The component loglik is a vector that reports, for each iteration, the value of the loglikelihood function at the start of that iteration. A property of EM is that the loglikelihood function (or log-posterior density, if a non-uniform prior is applied) will increase at each iteration. > emResult$loglik [1] -323.5527 -310.3163 -308.5807 -308.1113 -308.0162 -307.9990 -307.9958 [8] -307.9952 -307.9951 -307.9951 -307.9951 -307.9951 -307.9951 -307.9951 [15] -307.9951 The last element of loglik is not the loglikelihood evaluated at the final parameter estimates, but at the parameter estimates just before the start of the final iteration. If 23 the algorithm has converged, the two are essentially the same. If for some reason you need the precise value of the loglikelihood at the final parameter estimates, you can compute it using the function loglikNorm. One technique for calling loglikNorm, as shown in its documentation file, is ## Default S3 method: loglikNorm(obj, x = NULL, intercept = TRUE, param, ...) where the data are supplied through obj and x, and the parameters are supplied through param. Let’s compute the value of the loglikelihood at the final parameter estimates from EM and compare it to the value just before the final iteration. > llmax <- loglikNorm(cholesterol, param=emResult$param) > llmax - emResult$loglik[ emResult$iter ] [1] 2.375202e-09 The first argument to loglikNorm may also be an object of class "norm", as shown in this section of the documentation file: ## S3 method for class ’norm’ loglikNorm(obj, param = obj$param, ...) If we invoke loglikNorm with the object returned by emNorm (in this case, emResult) as its only argument, then the parameter estimates from EM (in this case, emResult$param) are automatically passed to loglikNorm as the parameter values at which the loglikelihood is to be evaluated. Therefore, we could have achieved the same effect with this easier syntax: > llmax <- loglikNorm(emResult) 6.4 Completely observed variables as predictors In this example, Y1 and Y2 have no missing values. Variables that are completely observed may be treated either as responses (columns of the matrix Y ) or as predictors (columns of the matrix X). If we treat Y1 and Y2 as predictors, then the model becomes a simple linear regression of Y3 on Y1, Y2 and a constant. 24 > x <- cholesterol[,1:2] > y <- cholesterol[,3] > emResult <- emNorm(y,x) > summary(emResult) Predictor (X) variables: Mean SD Observed Missing Pct.Missing CONST 1.0000 0.00000 28 0 0 Y1 253.9286 47.71049 28 0 0 Y2 230.6429 46.96745 28 0 0 Response (Y) variables: Mean SD Observed Missing Pct.Missing Y1 221.4737 43.18355 19 9 32.14286 Missingness patterns for response (Y) variables (. denotes observed value, m denotes missing value) (variable names are displayed vertically) (rightmost column is the frequency): Y 1 . 19 m 9 Method: Prior: Convergence criterion: Iterations: 10 Converged: Max. rel. difference: -2 Loglikelihood: -2 Log-posterior density: Worst fraction missing information: EM "uniform" 1e-05 TRUE 4.6597e-06 146.9102 146.9102 0.3214 Estimated coefficients (beta): Y1 CONST 74.0236337 Y1 -0.1673990 Y2 0.8269102 Estimated covariance matrix (sigma): Y1 Y1 838.9239 25 6.5 Other ways to invoke EM Many users of R are familiar with linear regression using the lm function and the syntax lm( formula, data ), where formula is a symbolic representation of the model to be fit, and data is a data frame where the model’s variables reside. For example, > lmResult <- lm( Y3 ~ Y1 + Y2, data=cholesterol ) will regress the variable Y3 on Y1, Y2 and a constant using the data contained in cholesterol. We can invoke emNorm in a similar fashion, by supplying a model formula and data frame. > emResult <- emNorm( Y3 ~ Y1 + Y2, data=cholesterol ) If there are multiple response variables, these variables should be placed on the left-hand side of the model formula and glued together using cbind. For example, the expressions emNorm( cholesterol ) and emNorm( cbind(Y1,Y2,Y3) ~ 1, data=cholesterol ) are equivalent. The documentation for emNorm descibes yet another technique for calling the function, in which the first argument is an object of class "norm". ## S3 method for class ’norm’ emNorm(obj, iter.max = 1000, criterion = obj$criterion, estimate.worst = obj$estimate.worst, prior = obj$prior, prior.df = obj$prior.df, prior.sscp = obj$prior.sscp, starting.values = obj$param, ...) This may be useful in a problem where EM did not converge by iter.max iterations. If we supply the result from emNorm in another call to emNorm, then the data, model and specification of the prior distribution from the first EM run are automatically carried over, and the final parameter estimates from the first run become the starting values for the second. See what happens when we do this with the cholesterol data. 26 > emResult <- emNorm(cholesterol) > emResult <- emNorm(emResult) > summary(emResult, show.variables=F, show.patterns=F) Method: Prior: Convergence criterion: Iterations: Converged: Max. rel. difference: -2 Loglikelihood: -2 Log-posterior density: Worst fraction missing information: EM "uniform" 1e-05 1 TRUE 3.9456e-06 615.9902 615.9902 0.4652 Estimated coefficients (beta): Y1 Y2 Y3 CONST 253.9286 230.6429 222.2371 Estimated covariance matrix (sigma): Y1 Y2 Y3 Y1 2194.9949 1454.617 835.3977 Y2 1454.6173 2127.158 1515.4631 Y3 835.3977 1515.463 1952.2259 Because the first run of EM had already converged according to the default convergence criterion, the second run stopped after a single iteration, and the parameter estimates are essentially unchanged. 6.6 Running MCMC and assessing convergence The function mcmcNorm is another generic method that calls different functions depending on the class of its first argument. The default method syntax is shown below. ## Default S3 method: mcmcNorm(obj, x = NULL, intercept = TRUE, starting.values, iter = 1000, multicycle = NULL, seeds = NULL, prior = "uniform", prior.df = NULL, prior.sscp = NULL, save.all.series = TRUE, save.worst.series = FALSE, worst.linear.coef = NULL, impute.every = NULL, ...) The only required inputs are a matrix of response variables, which may be provided as the first argument, and starting values for the parameters provided as starting.values. 27 Unlike emNorm, there is no default starting-value procedure; some starting values must be specified by the user. An easy way to get them is to run emNorm first and use the final parameter estimates from EM as the starting values for MCMC, as in the example below. > emResult <- emNorm( cholesterol ) > mcmcResult <- mcmcNorm( cholesterol, + starting.values = emResult$param ) A second way to call mcmcNorm is to provide a model formula, a data frame in which the variables reside, and starting values for the parameters. > mcmcResult <- mcmcNorm( cbind(Y1,Y2,Y3) ~ 1, + data = cholesterol, + starting.values = emResult$param ) The third and easiest way to call mcmcNorm is to supply an object of class "norm" produced by a previous call to emNorm or mcmcNorm. If we run emNorm first and then use its result as the first argument to mcmcNorm, the data, model, and prior distributions are carried over, and the final estimates from EM automatically become the starting values for MCMC. > > > > emResult <- emNorm(cholesterol) set.seed(92561) # so that you can reproduce these results mcmcResult <- mcmcNorm(emResult) summary(mcmcResult) Method: MCMC Prior: "uniform" Iterations: 1000 Cycles per iteration: 1 Impute every k iterations, k = NULL No. of imputations created: 0 series.worst present: TRUE series.beta present: TRUE series.sigma present: TRUE Using estimates from EM as the starting values for MCMC has an additional benefit. Notice that the output from summary tells us that mcmcResult contains an object called series.worst. This is a time series that records the value of the worst linear function of the parameters at every iteration. The coefficients that determine the worst linear function were automatically computed by emNorm after the EM algorithm converged. Plotting series.worst and its autocorrelation (ACF) function helps us to assess the convergence behavior of MCMC. 28 > par(mfrow=c(3,2)) # arrange plots on page in a 3x2 matrix > plot(mcmcResult$series.worst) > acf(mcmcResult$series.worst) These plots, which are shown in Figure 1, suggest that the algorithm approaches stationarity very quickly, with no appreciable serial dependence beyond lag 5. Checking the behavior of the worst linear function is a sensible first step in judging convergence. Our experience with many data examples has shown that this series often behaves like a worst-case scenario; when the serial dependence in this series has died down, the dependence in other parameters has also died down. But counterexamples to this rule do exist, so after examining the worst linear function, it is best to look at other parameters as well. By default, the mcmcNorm function will save the full series for all nonredundant model parameters in objects called series.beta and series.sigma. Each of these is a multivariate time series with rows corresponding to the iterations of MCMC and columns corresponding to the individual parameters. In the example below, we plot the ACF’s for all parameters, and the resulting plots are displayed in Figure 2. > dim(mcmcResult$series.beta) # check dimensions of beta series [1] 1000 3 > dim(mcmcResult$series.sigma) # check dimensions of sigma series [1] 1000 6 > par(mfrow=c(3,3)) # nine plots per page > for(i in 1:3) + acf(mcmcResult$series.beta[,i], + main=colnames(mcmcResult$series.beta)[i]) > for(i in 1:6) + acf(mcmcResult$series.sigma[,i], + main=colnames(mcmcResult$series.sigma)[i]) 6.7 Inferences from simulated parameters In this data example, it may be of interest to draw inferences about changes in mean cholesterol over time. Let µj , j = 1, 2, 3 denote the mean cholesterol levels at the three occasions, and let δ31 = µ3 − µ1 be the average change from occasion 1 to occasion 3. The ML estimate of this parameter is > emResult$param$beta[1,3] - emResult$param$beta[1,1] [1] -31.69151 but because EM does not provide measures of uncertainty, we do not know whether this difference is statistically significant. The parameter series from MCMC, which contain 29 0.8 0.0 0.4 ACF −0.4 −0.8 mcmcResult$series.worst Series mcmcResult$series.worst 0 200 400 600 800 1000 0 5 Time 10 15 20 25 30 Lag Figure 1: Time-series and ACF plots for the worst linear function of the parameters from 1,000 iterations of MCMC applied to the cholesterol data. simulated draws from their posterior distribution, can be used to compute simulated Bayesian estimates and intervals for δ31 or any other function of the parameters. The default number of iterations (iter=1000) used by mcmcNorm is large enough to diagnose convergence in many problems, but it does not give very accurate estimates of posterior quantiles. So we will run mcmcNorm for an additional 10,000 iterations, using the simulated parameters at the end of first run as starting values, and retain only the parameter series from the new iterations. In effect, the first 1,000 iterations are being used as a burn-in period. > mcmcResult <- mcmcNorm(mcmcResult, iter=10000) Now we simply compute the 10,000 simulated values of δ31 and summarize them. > delta31 <- mcmcResult$series.beta[,3] + mcmcResult$series.beta[,1] > mean(delta31) # simulated posterior mean [1] -31.52694 > quantile(delta31, probs=c(.025,.975) ) # simulated 95% interval 2.5% 97.5% -57.190033 -5.198703 > table( delta31 < 0 ) FALSE 108 TRUE 9892 The simulated 95% Bayesian posterior interval lies entirely below zero, so the drop in mean cholesterol appears to be statistically significant. About 99% of the simulated 30 CONST,Y2 0.8 0.0 0 5 10 15 20 25 30 0 Y1,Y1 Y2,Y1 Y3,Y1 0.0 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Lag Lag Y2,Y2 Y3,Y2 Y3,Y3 10 15 20 25 30 Lag 0.8 0.0 0.4 ACF 0.0 0.4 ACF 0.8 0.8 Lag 0.4 5 0.4 0.0 0.4 ACF 0.8 0.8 Lag 0.0 0 10 15 20 25 30 Lag ACF 5 5 Lag 0.4 0 0.4 ACF 0.0 10 15 20 25 30 0.8 5 0.0 ACF 0.4 ACF 0.8 0.4 0.0 ACF 0 ACF CONST,Y3 0.8 CONST,Y1 0 5 10 15 20 25 30 Lag 0 5 10 15 20 25 30 Lag Figure 2: ACF plots for all parameters (means and covariances) from 1,000 iterations of MCMC applied to the cholesterol data. 31 values of δ31 lie below zero, so we are about 99% sure, in the Bayesian sense, that the average level of cholesterol has declined between occasions 1 and 3. Simulated parameters from successive iterations of MCMC tend to be positively correlated, so these 10,000 values of δ31 carry less information then they would if they were independent draws from the posterior distribution. If independent draws are needed, the optional argument multicycle may be helpful. Specifying multicycle=k for some k > 1 instructs mcmcNorm to perform the I- and P-step cycle k times within each iteration. Calling mcmcNorm with iter=10000 and multicycle=k has the same effect as running the algorithm for 10, 000 × k iterations and saving every kth set of simulated parameters. In this example, it is probably conservative to believe that the algorithm achieves stationarity by 10 cycles, because no significant serial correlations were seen beyond lag 5. Therefore, > mcmcResult <- mcmcNorm(mcmcResult, iter=10000, multicycle=10) should produce a beta.series and sigma.series that behave as 10,000 independent draws from the joint posterior distribution. 6.8 Creating multiple imputations Another way to use MCMC for inference is to create multiple imputations (MI’s). Simulated values of the missing observations from the I-steps of MCMC can be used as MI’s, but they need to be independent of one another. The simplest way to achieve independence is to save the results from every kth I-step, where k is large enough to ensure stationarity. We do this by supplying the argument impute.every=k to mcmcNorm. In the example below, we start at the ML estimate and run the MCMC algorithm for 5,000 iterations, spacing the imputations k = 100 iterations apart to create M = 50 multiple imputations. > > > > emResult <- emNorm(cholesterol) set.seed(532) # so you can reproduce these results mcmcResult <- mcmcNorm(emResult, iter=5000, impute.every=100) summary(mcmcResult) Method: MCMC Prior: "uniform" Iterations: 5000 Cycles per iteration: 1 Impute every k iterations, k = 100 No. of imputations created: 50 series.worst present: TRUE 32 series.beta present: series.sigma present: TRUE TRUE The result from mcmcNorm now contains an object called imp.list. This is a list with M = 50 components, where each component is a data matrix that resembles the original dataset y, except that the missing values have been imputed. > cholesterol[1:5,] Y1 Y2 Y3 1 270 218 156 2 236 234 NA 3 210 214 242 4 142 116 NA 5 280 200 NA # first 5 rows of original data > mcmcResult$imp.list[[1]][1:5,] # first 5 rows of imputed dataset 1 Y1 Y2 Y3 [1,] 270 218 156.0000 [2,] 236 234 179.0431 [3,] 210 214 242.0000 [4,] 142 116 90.7577 [5,] 280 200 185.2721 > mcmcResult$imp.list[[2]][1:5,] # first 5 rows of imputed dataset 2 Y1 Y2 Y3 [1,] 270 218 156.0000 [2,] 236 234 204.7819 [3,] 210 214 242.0000 [4,] 142 116 215.8521 [5,] 280 200 224.6960 The method described above for creating MI’s relies on a single chain of M k iterations. A slightly different method is to run M chains of k iterations each. At the end of each chain, we call the function impNorm to generate a single random imputation based on the final simulated parameters. In the example below, we run M = 50 chains of 100 iterations each, starting each chain at the ML estimate, generate an imputed dataset at the end of each chain, and save the imputed datasets as a list. > imp.list <- as.list(NULL) > for(m in 1:50){ + mcmcResult <- mcmcNorm(emResult, iter=100) + imp.list[[m]] <- impNorm(mcmcResult) } 33 6.9 Combining the results from post-imputation analyses Now let’s use the imputed datasets to draw inferences about the mean difference δ31 . With complete data, inferences about this parameter correspond to a paired t-test. That is, we estimate δ31 by the mean of the difference scores (occasion 3 minus ocasion 1) among the N = 28 patients. The standard error is the variance of the difference scores divided by N . In the code below, we generate M = 50 imputed datasets by the singlechain method, compute the estimate and standard error from each one, and save the results as two lists. > > > > > > + + + + emResult <- emNorm(cholesterol) set.seed(532) # so you can reproduce these results mcmcResult <- mcmcNorm(emResult, iter=5000, impute.every=100) est.list <- as.list(NULL) # to hold the estimates std.err.list <- as.list(NULL) # to hold the standard errors for(m in 1:50){ y.imp <- mcmcResult$imp.list[[m]] # one imputed dataset diff <- y.imp[,3] - y.imp[,1] # difference scores est.list[[m]] <- mean(diff) std.err.list[[m]] <- sqrt( var(diff) / length(diff) ) } Now we combine the results across imputations using the techniques described in Section 4.6, which are implemented in the function miInference. The syntax for calling this function, as provided in its documentation file, is shown below. miInference( est.list, std.err.list, method = "scalar", df.complete = NULL ) Arguments: est.list: a list of estimates to be combined. Each component of this list should be a scalar or vector containing point estimates from the analysis of an imputed dataset. This list should have _M_ components, where _M_ is the number of imputations, and all components should have the same length. std.err.list: a list of standard errors to be combined. Each component of this list should be a scalar or vector of standard errors associated with the estimates in ’est.list’. method: how are the estimates to be combined? At present, the only type allowed is ’"scalar"’, which means that estimands are treated as one-dimensional entities. If ’est.list’ contains vectors, inference for each element of the vector is carried 34 out separately; covariances among them are not considered. df.complete: degrees of freedom assumed for the complete-data inference. This should be a scalar or a vector of the same length as the components of ’est.list’ and ’std.err.list’. If there were no missing data in this example, tests and intervals for δ31 would be based on a t-distribution with N − 1 = 27 degrees of freedom. Therefore, when we call miInference, we need to supply the argument df.complete=27. > miResult <- miInference(est.list, std.err.list, df.complete=27) > miResult Est SE Est/SE df p Pct.mis [1,] -31.84235 11.37477 -2.799383 18.9 0.011 23.3 The overall estimate of −31.8 is close to the ML estimate of −31.7 and the simulated posterior mean of −31.5 reported earlier. The endpoints of the approximate 95% confidence interval from MI, > miResult$est - miResult$std.err * qt(.975, miResult$df) [1] -55.65734 > miResult$est + miResult$std.err * qt(.975, miResult$df) [1] -8.02737 are also close to the endpoints of the simulated 95% posterior interval (−57.2, −5.20) reported earlier. 7 7.1 Using the package: Example 2 Marijuana data We now turn to an example that is less straightforward. Weil, Zinberg, and Nelson (1968) described an experiment to investigate the effects of marijuana. Nine male subjects received three treatments each (placebo, low dose and high dose). Under each treatment, the change in heart rate (beats per minute above baseline) was recorded 15 minutes after smoking and 90 minutes after smoking, producing six measures per subject, but five of the 54 data values are missing. These data are distributed with norm2 as a data frame called marijuana. 35 > data(marijuana) > marijuana Plac.15 Low.15 High.15 Plac.90 Low.90 High.90 1 16 20 16 20 -6 -4 2 12 24 12 -6 4 -8 3 8 8 26 -4 4 8 4 20 8 NA NA 20 -4 5 8 4 -8 NA 22 -8 6 10 20 28 -20 -4 -4 7 4 28 24 12 8 18 8 -8 20 24 -3 8 -24 9 NA 20 24 8 12 NA Classical analysis-of-variance methods for repeated measures make strong assumptions about the covariance structure. Following the example analyses of Schafer (1997, Chap. 5), we will proceed without assuming any particular form for the within-person covariance matrix. 7.2 Difficulties with noninformative priors As in the cholesterol example, we will run the EM algorithm and then use the resulting ML estimates as starting values for MCMC. > emResult <- emNorm(marijuana) Note: Finite-differencing procedure strayed outside parameter space; solution at or near boundary. OCCURRED IN: estimate_worst_frac in MOD norm_engine > set.seed(543) > mcmcResult <- mcmcNorm(emResult) Note: Degrees of freedom not positive. OCCURRED IN: ran_genchi in MOD random_generator OCCURRED IN: run_pstep in MOD norm_engine Posterior distribution may not be proper. MCMC aborted at iteration 1, cycle 1 OCCURRED IN: run_norm_engine_mcmc in MOD norm_engine What happened? The EM algorithm did converge, but the procedure for estimating the worst fraction of missing information and worst linear function failed. The warning message indicates that the EM solution lies at or near the boundary of the parameter space, meaning that the final estimate of the covariance matrix Σ is nearly singular. The MCMC procedure did not work and aborted at the first cycle. The problem lies with the prior distribution. The default prior density for Σ, which is prior="uniform", looks 36 like an inverted Wishart density with ξ = −(r + 1) degrees of freedom, where r is the number of response variables (Section 3.5). It is not a real density function, however, because the inverted Wishart density does not exist unless the degrees of freedom exceed r (Schafer, 1997, Chap. 5). When the sample size is large, this is not a problem, because the complete-data posterior distribution for Σ is inverted Wishart with degrees of freedom equal ξ + N − p, where ξ is the prior degrees of freedom, N is the sample size, and p is the number of predictors. But in this particular example, we have r = 6, ξ = −7, N = 9, and p = 1 (the only predictor is a constant), so the posterior degrees of freedom are −(r + 1) + (N − p) = 1, which does not yield a proper posterior distribution even with complete data. When (N − p) is much larger than r, the choice of prior distribution tends to have little impact on the results. But with small samples like this one, the uniform prior may cause difficulty in MCMC. A better choice is the Jeffreys noninformative prior, which resembles an inverted Wishart density with prior degrees of freedom equal to 0. Under the Jeffreys prior, the posterior degrees of freedom with complete data are (N −p), which is proper provided that (N − p) > r. Let’s see what happens if we run MCMC under the Jeffreys prior starting at the posterior mode. > emResult <- emNorm(marijuana, prior="jeffreys") Note: Finite-differencing procedure strayed outside parameter space; solution at or near boundary. OCCURRED IN: estimate_worst_frac in MOD norm_engine > set.seed(543) > mcmcResult <- mcmcNorm(emResult) Note: Matrix not positive definite OCCURRED IN: cholesky_saxpy in MOD matrix_methods OCCURRED IN: run_pstep in MOD norm_engine Posterior distribution may not be proper. MCMC aborted at iteration 106, cycle 1 OCCURRED IN: run_norm_engine_mcmc in MOD norm_engine In this run, MCMC aborted at iteration 106. The error message again suggests that the posterior distribution may not be proper. This time, however, the problem is not that the degrees of freedom are negative, but that the estimated covariance matrix became singular or negative definite. This problem arises when the pattern of missing values causes some aspects of the covariance matrix to be poorly estimated or inestimable from the observed data. To get more insight into what is happening, let’s run EM again under a uniform prior and look at the results. 37 > emResult <- emNorm(marijuana) > summary(emResult, show.variables=F, show.patterns=F, show.params=F ) Method: Prior: Convergence criterion: Iterations: Converged: Max. rel. difference: -2 Loglikelihood: -2 Log-posterior density: Estimated rate of convergence: EM "uniform" 1e-05 288 TRUE 8.810968e-06 114.3465 114.3465 0.47868 EM apparently converged, but the estimated covariance matrix is nearly singular, because its smallest eigenvalue is nearly zero. > eigen(emResult$param$sigma, symmetric=T, only.values=T) $values [1] 7.510111e+02 1.483758e+02 8.574829e+01 4.563068e+01 2.783385e+01 [6] 6.523830e-10 $vectors NULL In fact, if we run EM again with a stricter convergence criterion, the procedure aborts, showing that the ML estimate does indeed lie on the boundary of the parameter space. > emResult <- emNorm(marijuana, criterion=1e-06) Note: Attempted logarithm of non-positive number Cov. matrix became singular or negative definite. OCCURRED IN: run_estep in MOD norm_engine EM algorithm aborted at iteration 459 OCCURRED IN: run_norm_engine_em in MOD norm_engine Warning message: In emNorm.default(marijuana, criterion = 1e-06) : Algorithm did not converge by iteration 464 7.3 Using a ridge prior When aspects of the covariance structure are poorly estimated or inestimable, one remedy is to introduce a small amount or prior information. The ridge prior (Section 3.5), allows the variances to be estimated from the observed data, but smooths the estimated 38 0.0 0.4 ACF 0.8 0.0 −0.4 −0.8 mcmcResult$series.worst Series mcmcResult$series.worst 0 1000 2000 3000 4000 5000 0 5 Time 10 15 20 25 30 35 Lag Figure 3: Time-series and ACF plots for the worst linear function of the parameters from 5,000 iterations of MCMC applied to the marijauna data under ridge prior. correlations toward zero. The degree of smoothing is controlled by the prior degrees of freedom ξ, which is specified via the argument prior.df. In many cases, we can stabilize the inference with very small amounts of prior information. Here we apply a ridge prior with ξ = 0.5 and run 5,000 iterations of MCMC starting from the posterior mode. > emResult <- emNorm(marijuana, prior="ridge", prior.df=0.5) > set.seed(543) > mcmcResult <- mcmcNorm(emResult, iter=5000) This time, the procedure ran without difficulty. Plots of the worst linear function (Figure 3) show that MCMC converges slowly because the rates of missing information are high. 7.4 Inferences by parameter simulation The slow convergence of MCMC in this example indicates that many iterations will be required to get precise estimates for quantities of interest. Here we perform an analysis similar to that of Schafer (1997, Sec. 5.4), examining posterior distributions for contrasts among the means of the repeated measurements. Let µj , j = 1, . . . , 6 denote the means for the six measurements, and let δjk = µj − µk denote a contrast between two means. In the code below, we run 10,000 iterations of MCMC, cycling the procedure 100 times within each iteration, so that the simulated draws saved to the parameter series are approximately independent. From the series, we compute 10,000 simulated values for the six contrasts δ21 , δ31 , δ32 , δ54 , δ64 , and δ65 , and summarize them with a boxplot-like graph that, except for a minor amount of random noise, replicates the results shown in Schafer (1997, Fig. 5.9). The boxplot-like graph is shown in Figure 4. > set.seed(876) 39 d21 d31 d32 d54 d64 d65 −40 −20 0 20 40 Figure 4: Simulated posterior medians, quartiles and 95% equal-tailed posterior intervals for six contrasts from the marijuana dataset. > > > + + + + + + > > > > > + > > > mcmcResult <- mcmcNorm(emResult, iter=10000, multicycle=100) diff.scores <- cbind( d65 = mcmcResult$series.beta[,6] d64 = mcmcResult$series.beta[,6] d54 = mcmcResult$series.beta[,5] d32 = mcmcResult$series.beta[,3] d31 = mcmcResult$series.beta[,3] d21 = mcmcResult$series.beta[,2] - mcmcResult$series.beta[,5], mcmcResult$series.beta[,4], mcmcResult$series.beta[,4], mcmcResult$series.beta[,2], mcmcResult$series.beta[,1], mcmcResult$series.beta[,1] ) # construct boxplot-like graph showing the posterior # median, quartiles and 95% interval for each contrast stats <- matrix( NA, 5, 6) for(i in 1:6) stats[,i] <- quantile( diff.scores[,i], probs=c(.025,.25,.5,.75,.975)) box.stats <- list( stats=stats, names=colnames(diff.scores) ) bxp( box.stats, horizontal=T ) abline(v=0) 40 8 8.1 Using the package: Example 3 FLAS data Our third example uses the dataset flas. These data come from a pilot study to assess the reliability and validity of the Foreign Language Attitude Scale (FLAS), an instrument for predicting success in the study or foreign languages (Raymond and Roberts, 1983). The questionnaire was given to N = 279 students enrolled in four different language courses (French, German, Spanish, Russian) at Penn State University. This dataset includes FLAS score, final grade in the course, and other variables relevant to academic achievement. The variables in the dataset, as shown in its documentation file, are: ’LAN2’ 1=Spanish, 0=other. ’LAN3’ 1=German, 0=other. ’LAN4’ 1=Russian, 0=other. ’AGE’ age group (1=less than 20, 2=20+). ’PRI’ number of prior foreign language courses (1=none, 2=1-2, 3=3+). ’SEX’ 0=male, 1=female ’MLAT’ Modern Language Aptitude Test ’FLAS’ Foreign Language Attitude Scale ’SATV’ Scholastic Aptitude Test, verbal score ’SATM’ Scholastic Aptitude Test, math score ’ENG’ score on Penn State English placement exam ’HGPA’ high school grade point average ’CGPA’ current college grade point average ’GRD’ final grade in foreign language course (1=B or lower, 2=A) Notice that the language variable, a nominal variable with four categories and no missing values, has been converted to three dummy codes. Including these dummy codes either as 41 responses or predictors allows the means of the other variables to vary without constraints across the four language groups. 8.2 Applying a ridge prior When the EM algorithm is applied to the FLAS dataset, it converges quickly. > data(flas) > emResult <- emNorm(flas) Note: Eigen power method failed to converge. OCCURRED IN: estimate_worst_frac in MOD norm_engine > summary(emResult, show.variables=F, show.patterns=F, show.params=F) Method: Prior: Convergence criterion: Iterations: Converged: Max. rel. difference: -2 Loglikelihood: -2 Log-posterior density: Worst fraction missing information: EM "uniform" 1e-05 17 TRUE 6.0021e-06 7310.107 7310.107 NA The quick convergence of EM suggests that the rates of missing information are low. However, the procedure for estimating the worst fraction of missing information failed, suggesting that something has gone awry. If we try to run MCMC using the default uniform prior, we encounter difficulty. The procedure does not abort, but the time-series and ACF plots for the mean of the variable GRD, which are shown in Figure 5, show that this parameter exhibits long-term drift. (In this example, the worst linear function of the parameters is not available, so we need to diagnose convergence by examine time-series and ACF plots for individual elements or other functions of β and Σ.) > > > > > set.seed(765) mcmcResult <- mcmcNorm(emResult) par(mfrow=c(3,2)) plot(mcmcResult$series.beta[,14]) acf( mcmcResult$series.beta[,14]) Long-term drifts in a parameter series suggest that the parameter is poorly estimated or inestimable. Estimability problems may arise when missing values fall in a malicious pattern. In this example, the problem is that value of GRD happens to be missing for everyone who studied Russian. 42 ACF 0.0 0.4 0.8 1.6 1.4 1.2 1.0 mcmcResult$series.beta[, 14] Series mcmcResult$series.beta[, 14] 0 200 400 600 800 1000 0 5 10 Time 15 20 25 30 Lag Figure 5: Time-series and ACF plots for the mean of variable GRD from 1,000 iterations of MCMC applied to the FLAS data under a uniform prior. > table( flas$LAN4, is.na(flas$GRD)) 0 1 FALSE TRUE 232 27 0 20 This pattern of missing values makes it impossible to estimate the mean of GRD among those who study Russian. With an improper prior distribution and a malicious pattern of missing values, it is possible to have a situation where, even though the I- and P-steps are well defined at each iteration, the posterior distribution does not exist, and the MCMC algorithm cannot converge (Schafer, 1997, Sec. 3.5). When some parameters are inestimable, different starting values for EM may lead to different parameter estimates with exactly the same loglikelihood. We can diagnose inestimability using a combination of EM and MCMC. In the code example below, we use MCMC to generate random starting values for the parameters and then run EM from these random starts. The three EM runs converge to different solutions (note the different estimates for the mean of GRD), but the loglikelihood at the three solutions looks exactly the same. > data(flas) > emResult <- emNorm(flas) Note: Eigen power method failed to converge. OCCURRED IN: estimate_worst_frac in MOD norm_engine > summary(emResult, show.variables=F, + show.patterns=F, show.params=F) Method: Prior: EM "uniform" 43 Convergence criterion: Iterations: Converged: Max. rel. difference: -2 Loglikelihood: -2 Log-posterior density: Worst fraction missing information: 1e-05 17 TRUE 6.0021e-06 7310.107 7310.107 NA Eigen power method failed to converge. OCCURRED IN: estimate_worst_frac in MOD norm_engine > set.seed(765) > mcmcResult <- mcmcNorm(emResult) > par(mfrow=c(3,2)) > plot(mcmcResult$series.beta[,14]) > acf( mcmcResult$series.beta[,14]) > > > > > > # generate three random starts by running MCMC # from the EM estimates set.seed(321) random.start.1 <- mcmcNorm(emResult,iter=100) random.start.2 <- mcmcNorm(emResult,iter=100) random.start.3 <- mcmcNorm(emResult,iter=100) > > > > # now run EM from the emResult.1 <- emNorm( emResult.2 <- emNorm( emResult.3 <- emNorm( random starts random.start.1 ) random.start.2 ) random.start.3 ) > # compare the loglikelihoods at the three EM solutions > loglikNorm( emResult.1 ) [1] -3655.054 > loglikNorm( emResult.2 ) [1] -3655.054 > loglikNorm( emResult.3 ) [1] -3655.054 > # compare the estimates of beta > emResult.1$param$beta LAN2 LAN3 LAN4 AGE PRI CONST 0.2795699 0.4086022 0.07168459 1.530772 2.196905 SEX MLAT FLAS SATV SATM CONST 0.4527892 82.48746 24.11213 498.3939 560.6742 ENG HGPA CGPA GRD CONST 52.85448 2.749734 3.265421 1.39928 44 > emResult.2$param$beta LAN2 LAN3 LAN4 AGE PRI CONST 0.2795699 0.4086022 0.07168459 1.530772 2.196905 SEX MLAT FLAS SATV SATM CONST 0.4527892 82.48746 24.11213 498.3939 560.6742 ENG HGPA CGPA GRD CONST 52.85448 2.749734 3.265421 1.537627 > emResult.3$param$beta LAN2 LAN3 LAN4 AGE PRI CONST 0.2795699 0.4086022 0.07168459 1.530772 2.196905 SEX MLAT FLAS SATV SATM CONST 0.4527892 82.48746 24.11213 498.3939 560.6742 ENG HGPA CGPA GRD CONST 52.85448 2.749734 3.265421 1.728877 8.3 Applying a ridge prior To overcome the problem of inestimability, we apply a ridge prior with ξ = 3 prior degrees of freedom. This small amount of information, which corresponds to about 1% of the sample size, is enough to identify the inestimable parameters. With this prior, MCMC does converge, but very slowly. Here we run MCMC from 100,000 iterations beginning from the posterior mode, and then we plot the time series and ACF for the mean of GRD. The plots are shown in Figure 6. > > > > > > > > > emResult <- emNorm(flas, prior="ridge", prior.df=3) set.seed(765) # relax, this may take a little while mcmcResult <- mcmcNorm(emResult, iter=100000) par(mfrow=c(2,2)) # plot first tenth of the series plot(mcmcResult$series.beta[1:10000,14],type="l") # estimate the acf from the whole series acf( mcmcResult$series.beta[,14], lag.max=600, col="grey") (If you are unable to run this example because of the memory limitations of your system, try reducing the number of iterations to iter=1000 and use the option multicycle=100. MCMC will still run for 100,000 cycles, but mcmcNorm will save the simulated parameters only at every 100th cycle, and the ACF will die down by lag 6 rather than lag 600.) 45 0 2000 4000 6000 0.0 0.4 0.8 ACF 1.8 1.5 1.2 mcmcResult$series.beta[1:10000, 14] Series mcmcResult$series.beta[, 14] 8000 10000 0 100 200 Index 300 400 500 600 Lag Figure 6: Time-series and ACF plots for the mean of variable GRD from MCMC applied to the FLAS data under a ridge prior. 8.4 Analysis by multiple imputation In earlier analyses of these data, Schafer (1997, Sec. 6.3) imputed missing values under the joint normal model and then analyzed the imputed data by fitting logistic regressions with GRD as a binary response. Now we will try to replicate those results. First, we run MCMC for 50,000 iterations using a ridge prior with ξ = 3. We save the imputed values at every 1,000th iteration, producing M = 50 multiple imputations. Because we no longer need to save the parameter series, we now call mcmcNorm with the argument save.all.series=FALSE. > emResult <- emNorm(flas, prior="ridge", prior.df=3) > set.seed(765) > mcmcResult <- mcmcNorm(emResult, iter=50000, + save.all.series=FALSE, impute.every=1000) After imputation, we round off the imputed values for the discrete variables AGE, PRI, SEX and GRD to the nearest categories. Then we recode AGE to a dummy indicator for age ≥ 20 and characterize the effect of PRI with linear and quadratic terms. Finally, we regress an indicator for GRD=2 on the entire set of recoded covariates, saving the coefficients and standard errors, and combine the results by Rubin’s (1987) rules for scalar estimands. > est.list <- as.list(NULL) # to hold estimated coefficients > std.err.list <- as.list(NULL) # to hold their standard errors > for(m in 1:50){ + imp <- mcmcResult$imp.list[[m]] # one imputed datasset + imp <- data.frame(imp) # convert into a data frame + # round AGE and change to dummy indicator for age=20+ + imp$AGE.2 <- 1*(imp$AGE>1.5) # dummy indicator for age=20+ + # round PRI and define linear, quadratic terms 46 + + + + + + + + + + + + + + + + + imp$PRI <- round(imp$PRI) imp$PRI[imp$PRI<1] <- 1 imp$PRI[imp$PRI>3] <- 3 imp$PRI.L <- -1*(imp$PRI==1) +0*(imp$PRI==2) + 1*(imp$PRI==3) imp$PRI.Q <- 1*(imp$PRI==1) -2*(imp$PRI==2) + 1*(imp$PRI==3) # round SEX and change to dummy indicator for female imp$SEX <- 1*(imp$SEX>.5) # round imputed GRD and recode to 1=A, 0=B or lower imp$GRD <- 1*(imp$GRD > 1.5) # fit logistic model glmResult <- glm( GRD ~ LAN2 + LAN3 + LAN4 + AGE.2 + PRI.L + PRI.Q + SEX + MLAT + FLAS + SATV + SATM + ENG + HGPA + CGPA, data=imp, family=binomial) # save estimates and SE’s est.list[[m]] <- glmResult$coefficients std.err.list[[m]]# combine the results by Rubin’s (1987) rules > miResult <- miInference(est.list, std.err.list) > miResult Est SE Est/SE df (Intercept) -1.5495e+01 2.9395e+00 -5.271 810.5 LAN2 3.6052e-01 5.0760e-01 0.710 2484.9 LAN3 1.1272e+00 4.4173e-01 2.552 7161.6 LAN4 -2.1253e-01 6.6213e+02 0.000 527837921.7 AGE.2 1.3719e+00 4.4020e-01 3.117 1014.3 PRI.L 2.9659e-01 2.4933e-01 1.190 1198.9 PRI.Q -1.3100e-01 1.4701e-01 -0.891 1267.0 SEX 7.5617e-01 4.4557e-01 1.697 1272.6 MLAT 4.0815e-02 1.5886e-02 2.569 789.4 FLAS 1.0839e-01 5.0364e-02 2.152 366.9 SATV -3.3027e-03 3.3211e-03 -0.994 1018.8 SATM 2.9059e-04 2.6788e-03 0.108 1975.8 ENG 9.5108e-03 2.2439e-02 0.424 919.0 HGPA 2.2366e+00 4.5447e-01 4.921 1678.3 CGPA 8.5284e-01 5.4255e-01 1.572 697.9 Pct.mis (Intercept) 24.6 LAN2 14.0 LAN3 8.3 LAN4 0.0 AGE.2 22.0 PRI.L 20.2 PRI.Q 19.7 47 p 0.000 0.478 0.011 1.000 0.002 0.234 0.373 0.090 0.010 0.032 0.320 0.914 0.672 0.000 0.116 SEX MLAT FLAS SATV SATM ENG HGPA CGPA 19.6 24.9 36.5 21.9 15.7 23.1 17.1 26.5 The results from this analysis are quite similar to those reported by Schafer (1997, Table 6.8). The differences could plausibly be due to random noise, except for one discrepancy: the standard error for the coefficient of LAN4 shown above is huge. This is the one parameter for which the observed data carry essentially no information. The logistic regression procedure converged for all M = 50 imputed datasets, but for some of these, the estimated standard error was very large, suggesting that the Hessian of the loglikelihood was nearly singular. These large standard errors inflated the value of the within-imputation variance Ū and caused the estimated rate of missing information for this parameter to be essentially zero. 9 References Barnard, J. and Rubin, D.B. (1999) Small-sample degrees of freedom with multiple imputation. Biometrika, 86, 948–955. Bernaards, C.A., Belin, T.R. and Schafer, J.L. (2007) Robustness of a multivariate normal approximation for imputation of incomplete binary data. Statistics in Medicine, 26, 1368–1382. Fraley, C. (1998) On Computing the Largest Fraction of Missing Information for the EM Algorithm and the Worst Linear Function for Data Augmentation. Technical Report No. 332, Department of Statistics, University of Washington, Seattle. Goodnight, J.H. (1979) A tutorial on the SWEEP operator. The American Statistician, 33, 149–158. Little, R.J.A. and Rubin, D.B. (2002) Statistical Analysis with Missing Data (2nd Ed.). New York: Wiley. Meng, X.-L. and Rubin, D.B. (1994) On the global and componentwise rates of convergence of the EM algorithm. Linear Algebra and its Applications, 199 (Supplement 1), 413–425. Raymond, M.R. and Roberts, D.M. (1983) Development and validation of a foreign 48 language attitude scale. Educational and Psychological Measurement, 43, 1239–1246. Rubin, D.B. (1974) Characterizing the estimation of parameters in incomplete data problems. Journal of the American Statistical Association, 69, 467–474. Rubin, D.B. (1976) Inference and missing data. Biometrika, 63, 581–592. Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. New York: Wiley. Ryan, B.F. and Joiner, B.L. (1994) Minitab Handbook (Third edition). Belmont, CA: Wadsworth. Schafer, J.L. (1997) Analysis of Incomplete Multivariate Data. London: Chapman & Hall. Weil, A.T., Zinberg, N.E. and Nelson, J.M. (1968) Clinical and psychological effects of marihuana in man. Science, 162, 1234–1242. 49
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Page Count : 50 Creator : TeX output 2016.05.10:1941 Producer : MiKTeX-dvipdfmx (20100328) Create Date : 2016:05:10 19:42:02-05:00EXIF Metadata provided by EXIF.tools