[R] Base Reference Stata Manual V13
User Manual:
Open the PDF directly: View PDF .
Page Count: 2556
Download | |
Open PDF In Browser | View PDF |
STATA BASE REFERENCE MANUAL RELEASE 13 ® A Stata Press Publication StataCorp LP College Station, Texas ® Copyright c 1985–2013 StataCorp LP All rights reserved Version 13 Published by Stata Press, 4905 Lakeway Drive, College Station, Texas 77845 Typeset in TEX ISBN-10: 1-59718-116-1 ISBN-13: 978-1-59718-116-7 This manual is protected by copyright. All rights are reserved. No part of this manual may be reproduced, stored in a retrieval system, or transcribed, in any form or by any means—electronic, mechanical, photocopy, recording, or otherwise—without the prior written permission of StataCorp LP unless permitted subject to the terms and conditions of a license granted to you by StataCorp LP to use the software and documentation. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. StataCorp provides this manual “as is” without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. StataCorp may make improvements and/or changes in the product(s) and the program(s) described in this manual at any time and without notice. The software described in this manual is furnished under a license agreement or nondisclosure agreement. The software may be copied only in accordance with the terms of the agreement. It is against the law to copy the software onto DVD, CD, disk, diskette, tape, or any other medium for any purpose other than backup or archival purposes. The automobile dataset appearing on the accompanying media is Copyright c 1979 by Consumers Union of U.S., Inc., Yonkers, NY 10703-1057 and is reproduced by permission from CONSUMER REPORTS, April 1979. Stata, , Stata Press, Mata, , and NetCourse are registered trademarks of StataCorp LP. Stata and Stata Press are registered trademarks with the World Intellectual Property Organization of the United Nations. NetCourseNow is a trademark of StataCorp LP. Other brand and product names are registered trademarks or trademarks of their respective companies. For copyright information about the software, type help copyright within Stata. The suggested citation for this software is StataCorp. 2013. Stata: Release 13 . Statistical Software. College Station, TX: StataCorp LP. Contents intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction to base reference manual 1 about . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display information about your Stata adoupdate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Update user-written ado-files ameans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arithmetic, geometric, and harmonic means anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of variance and covariance anova postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for anova areg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear regression with a large dummy-variable set areg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for areg asclogit . . . . . . . . . . . . . . . . Alternative-specific conditional logit (McFadden’s choice) model asclogit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for asclogit asmprobit . . . . . . . . . . . . . . . . . . . . . . . . . . Alternative-specific multinomial probit regression asmprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for asmprobit asroprobit . . . . . . . . . . . . . . . . . . . . . . . . . . Alternative-specific rank-ordered probit regression asroprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for asroprobit 7 8 12 16 57 74 80 84 94 101 126 136 149 BIC note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Calculating and interpreting BIC binreg . . . . . . . . . . . . . . . . . . . Generalized linear models: Extensions to the binomial family binreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for binreg biprobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bivariate probit regression biprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for biprobit bitest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Binomial probability test bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bootstrap sampling and estimation bootstrap postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for bootstrap boxcox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Box–Cox regression models boxcox postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for boxcox brier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brier score decomposition bsample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling with replacement bstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Report bootstrap results 157 162 175 178 185 188 193 215 219 230 235 241 249 centile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Report centile and confidence interval ci . . . . . . . . . . . . . . . . . . . . . . . . . . . Confidence intervals for means, proportions, and counts clogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conditional (fixed-effects) logistic regression clogit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for clogit cloglog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complementary log-log regression cloglog postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for cloglog cls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clear Results window cnsreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constrained linear regression cnsreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for cnsreg constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Define and list constraints contrast . . . . . . . . . . . . . . . . . . . . . . . . . . Contrasts and linear hypothesis tests after estimation contrast postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for contrast copyright . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display copyright information copyright apache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Apache copyright notification copyright boost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boost copyright notification copyright freetype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FreeType copyright notification copyright icu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ICU copyright notification copyright jagpdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . JagPDF copyright notification 256 262 274 290 295 304 307 308 314 317 320 383 385 386 390 391 394 395 i ii Contents copyright lapack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LAPACK copyright notification copyright libpng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . libpng copyright notification copyright miglayout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MiG Layout copyright notification copyright scintilla . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scintilla copyright notification copyright ttf2pt1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ttf2pt1 copyright notification copyright zlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zlib copyright notification correlate . . . . . . . . . . . . . . . . . . . . . . . . Correlations (covariances) of variables or coefficients cumul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cumulative distribution cusum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cusum plots and tests for binary variables 396 397 399 400 401 403 404 412 416 db . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Launch dialog diagnostic plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributional diagnostic plots display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Substitute for a hand calculator do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execute commands from a file doedit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edit do-files and other text files dotplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparative scatterplots dstdize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Direct and indirect standardization dydx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Calculate numeric derivatives and integrals 420 422 434 435 436 437 444 463 eform option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Displaying exponentiated coefficients eivreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Errors-in-variables regression eivreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for eivreg error messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error messages and return codes esize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Effect size based on mean comparison estat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation statistics estat classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification statistics and table estat gof . . . . . . . . . . . . . . . . . . . . . . . . . . Pearson or Hosmer–Lemeshow goodness-of-fit test estat ic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display information criteria estat summarize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summarize estimation sample estat vce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display covariance matrix estimates estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Save and manipulate estimation results estimates describe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Describe estimation results estimates for . . . . . . . . . . . . . . . . . . . . . . . . . Repeat postestimation command across models estimates notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Add notes to estimation results estimates replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Redisplay estimation results estimates save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Save and use estimation results estimates stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model-selection statistics estimates store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Store and restore estimation results estimates table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Compare estimation results estimates title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Set title for estimation results estimation options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimation options exit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exit Stata exlogistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exact logistic regression exlogistic postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for exlogistic expoisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exact Poisson regression expoisson postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for expoisson 469 471 476 478 479 490 491 494 503 507 510 513 517 519 521 523 526 530 532 535 541 542 545 546 564 569 581 fp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fractional polynomial regression fp postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for fp frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stochastic frontier models frontier postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for frontier fvrevar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factor-variables operator programming command 583 607 616 631 635 Contents iii fvset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Declare factor-variable settings 638 gllamm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalized linear and latent mixed models glm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalized linear models glm postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for glm glogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logit and probit regression for grouped data glogit postestimation . . . . . . . . . . Postestimation tools for glogit, gprobit, blogit, and bprobit gmm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalized method of moments estimation gmm postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for gmm grmeanby . . . . . . . . . . . . . . . . . . . . . . . . . Graph means and medians by categorical variables 643 645 679 685 696 698 760 764 hausman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hausman specification test heckman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heckman selection model heckman postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for heckman heckoprobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ordered probit model with sample selection heckoprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for heckoprobit heckprobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Probit model with sample selection heckprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for heckprobit help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display help in Stata hetprobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heteroskedastic probit model hetprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for hetprobit histogram . . . . . . . . . . . . . . . . . . . . . . . . Histograms for continuous and categorical variables 767 776 794 800 809 814 822 827 829 836 839 icc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intraclass correlation coefficients inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inequality measures intreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interval regression intreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for intreg ivpoisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Poisson regression with endogenous regressors ivpoisson postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for ivpoisson ivprobit . . . . . . . . . . . . . . . . . . . . . . . . . . Probit model with continuous endogenous regressors ivprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for ivprobit ivregress . . . . . . . . . . . . . . . . . . . . . . . . . . . . Single-equation instrumental-variables regression ivregress postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for ivregress ivtobit . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobit model with continuous endogenous regressors ivtobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for ivtobit 850 872 875 885 890 905 910 923 927 943 961 971 jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jackknife estimation 975 jackknife postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for jackknife 987 kappa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interrater agreement 988 kdensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Univariate kernel density estimation 1002 ksmirnov . . . . . . . . . . . . . . . . . . . . . . . . . Kolmogorov – Smirnov equality-of-distributions test 1012 kwallis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kruskal – Wallis equality-of-populations rank test 1016 ladder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ladder of powers 1019 level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Set default confidence level 1026 limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quick reference for limits 1028 lincom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear combinations of estimators 1033 linktest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specification link test for single-equation models 1041 lnskew0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Find zero-skewness log or Box – Cox transform 1047 log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Echo copy of session to file 1051 logistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic regression, reporting odds ratios 1055 logistic postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for logistic 1067 logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic regression, reporting coefficients 1077 iv Contents logit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for logit 1090 loneway . . . . . . . . . . . . . . . . . . . . . . Large one-way ANOVA, random effects, and reliability 1096 lowess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lowess smoothing 1102 lpoly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kernel-weighted local polynomial smoothing 1108 lroc . . . . . . . . . . . . . . . . . . . . . . . . . . . . Compute area under ROC curve and graph the curve 1118 lrtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Likelihood-ratio test after estimation 1124 lsens . . . . . . . . . . . . . . . . . . . . . . . . Graph sensitivity and specificity versus probability cutoff 1134 lv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Letter-value displays 1139 margins . . . . . . . . . . . . . . . . . . . . . Marginal means, predictive margins, and marginal effects 1145 margins postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for margins 1200 margins, contrast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contrasts of margins 1202 margins, pwcompare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pairwise comparisons of margins 1219 marginsplot . . . . . . . . . . . . . . . . . . . . . . . . . . Graph results from margins (profile plots, etc.) 1224 matsize . . . . . . . . . . . . . . . . . . . . . . . . . . . . Set the maximum number of variables in a model 1259 maximize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Details of iterative maximization 1261 mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimate means 1268 mean postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for mean 1279 meta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Meta-analysis 1281 mfp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multivariable fractional polynomial models 1283 mfp postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for mfp 1295 misstable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tabulate missing values 1300 mkspline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear and restricted cubic spline construction 1308 ml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maximum likelihood estimation 1314 mlexp . . . . . . . . . . . . . . . . . . . . Maximum likelihood estimation of user-specified expressions 1341 mlexp postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for mlexp 1353 mlogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multinomial (polytomous) logistic regression 1355 mlogit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for mlogit 1369 more . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The —more— message 1379 mprobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multinomial probit regression 1381 mprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for mprobit 1388 nbreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Negative binomial regression 1391 nbreg postestimation . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for nbreg and gnbreg 1403 nestreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nested model statistics 1407 net . . . . . . . . . . . . . . . . . . . . . . . Install and manage user-written additions from the Internet 1413 net search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Search the Internet for installable packages 1431 netio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Control Internet connections 1435 news . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Report Stata news 1438 nl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nonlinear least-squares estimation 1440 nl postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for nl 1460 nlcom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nonlinear combinations of estimators 1464 nlogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nested logit regression 1475 nlogit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for nlogit 1497 nlsur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimation of nonlinear systems of equations 1502 nlsur postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for nlsur 1524 nptrend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test for trend across ordered groups 1527 ologit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ordered logistic regression 1531 ologit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for ologit 1540 oneway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . One-way analysis of variance 1544 oprobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ordered probit regression 1555 Contents v oprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for oprobit 1560 orthog . . . . . . . . . . . . . . . . . . . Orthogonalize variables and compute orthogonal polynomials 1564 pcorr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partial and semipartial correlation coefficients 1570 permute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monte Carlo permutation tests 1573 pk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pharmacokinetic (biopharmaceutical) data 1583 pkcollapse . . . . . . . . . . . . . . . . . . . . . . . . . . . Generate pharmacokinetic measurement dataset 1591 pkcross . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analyze crossover experiments 1594 pkequiv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Perform bioequivalence tests 1603 pkexamine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Calculate pharmacokinetic measures 1610 pkshape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reshape (pharmacokinetic) Latin-square data 1616 pksumm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summarize pharmacokinetic data 1624 poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Poisson regression 1629 poisson postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for poisson 1639 predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . Obtain predictions, residuals, etc., after estimation 1645 predictnl . . . . . . . . . . . . . Obtain nonlinear predictions, standard errors, etc., after estimation 1656 probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Probit regression 1668 probit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for probit 1681 proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimate proportions 1685 proportion postestimation . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for proportion 1691 prtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tests of proportions 1693 pwcompare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pairwise comparisons 1698 pwcompare postestimation . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for pwcompare 1730 pwmean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pairwise comparisons of means 1732 pwmean postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for pwmean 1744 qc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quality control charts 1746 qreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quantile regression 1761 qreg postestimation . . . . . . . . . . . . . . Postestimation tools for qreg, iqreg, sqreg, and bsqreg 1791 query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display system parameters 1795 ranksum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Equality tests on unmatched data 1802 ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimate ratios 1809 ratio postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for ratio 1818 reg3 . . . . . . . . . . . . . . . . . . . . Three-stage estimation for systems of simultaneous equations 1819 reg3 postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for reg3 1840 regress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear regression 1845 regress postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for regress 1870 regress postestimation diagnostic plots . . . . . . . . . . . . . . . . . Postestimation plots for regress 1905 regress postestimation time series . . . . . . . Postestimation tools for regress with time series 1924 #review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review previous commands 1934 roc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Receiver operating characteristic (ROC) analysis 1935 roccomp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tests of equality of ROC areas 1937 rocfit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parametric ROC models 1949 rocfit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for rocfit 1956 rocreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . Receiver operating characteristic (ROC) regression 1960 rocreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for rocreg 2013 rocregplot . . . . . . . . . . . . . . . . . Plot marginal and covariate-specific ROC curves after rocreg 2028 roctab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nonparametric ROC analysis 2048 rologit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rank-ordered logistic regression 2058 rologit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for rologit 2075 rreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robust regression 2077 vi Contents rreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for rreg 2084 runtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test for random order 2086 scobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Skewed logistic regression 2092 scobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for scobit 2101 sdtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variance-comparison tests 2104 search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Search Stata documentation and other resources 2110 serrbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graph standard error bar chart 2116 set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of system parameters 2119 set cformat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Format settings for coefficient tables 2131 set defaults . . . . . . . . . . . . . . . . . . . . . . . . Reset system parameters to original Stata defaults 2134 set emptycells . . . . . . . . . . . . . . . . . . . . . . . . Set what to do with empty cells in interactions 2136 set seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specify initial value of random-number seed 2137 set showbaselevels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display settings for coefficient tables 2142 signrank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Equality tests on matched data 2151 simulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monte Carlo simulations 2157 sj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stata Journal and STB installation instructions 2164 sktest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Skewness and kurtosis test for normality 2167 slogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stereotype logistic regression 2172 slogit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for slogit 2185 smooth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robust nonlinear smoother 2189 spearman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spearman’s and Kendall’s correlations 2197 spikeplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spike plots and rootograms 2206 ssc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Install and uninstall packages from SSC 2210 stem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stem-and-leaf displays 2218 stepwise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stepwise estimation 2222 stored results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stored results 2232 suest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seemingly unrelated estimation 2237 summarize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary statistics 2255 sunflower . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Density-distribution sunflower plots 2265 sureg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zellner’s seemingly unrelated regression 2271 sureg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for sureg 2279 swilk . . . . . . . . . . . . . . . . . . . . . . . . . . Shapiro – Wilk and Shapiro – Francia tests for normality 2282 symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Symmetry and marginal homogeneity tests 2286 table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Flexible table of summary statistics 2294 tabstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Compact table of summary statistics 2305 tabulate oneway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . One-way table of frequencies 2310 tabulate twoway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-way table of frequencies 2318 tabulate, summarize() . . . . . . . . . . . . . . . . . . One- and two-way tables of summary statistics 2335 test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test linear hypotheses after estimation 2340 testnl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test nonlinear hypotheses after estimation 2359 tetrachoric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tetrachoric correlations for binary variables 2368 tnbreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Truncated negative binomial regression 2378 tnbreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for tnbreg 2387 tobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobit regression 2391 tobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for tobit 2398 total . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimate totals 2403 total postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for total 2409 tpoisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Truncated Poisson regression 2410 tpoisson postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for tpoisson 2418 translate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Print and translate logs 2421 Contents vii truncreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Truncated regression 2431 truncreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for truncreg 2438 ttest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . t tests (mean-comparison tests) 2441 update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Check for official updates 2451 vce option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variance estimators 2454 view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . View files and logs 2459 vwls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variance-weighted least squares 2462 vwls postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for vwls 2468 which . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display location and version for an ado-file 2470 xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interaction expansion 2472 zinb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zero-inflated negative binomial regression 2482 zinb postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for zinb 2489 zip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zero-inflated Poisson regression 2492 zip postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for zip 2499 Author index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subject index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2503 2519 Cross-referencing the documentation When reading this manual, you will find references to other Stata manuals. For example, [U] 26 Overview of Stata estimation commands [XT] xtabond [D] reshape The first example is a reference to chapter 26, Overview of Stata estimation commands, in the User’s Guide; the second is a reference to the xtabond entry in the Longitudinal-Data/Panel-Data Reference Manual; and the third is a reference to the reshape entry in the Data Management Reference Manual. All the manuals in the Stata Documentation have a shorthand notation: [GSM] [GSU] [GSW] [U ] [R] [D ] [G ] [XT] [ME] [MI] [MV] [PSS] [P ] [SEM] [SVY] [ST] [TS] [TE] [I] Getting Started with Stata for Mac Getting Started with Stata for Unix Getting Started with Stata for Windows Stata User’s Guide Stata Base Reference Manual Stata Data Management Reference Manual Stata Graphics Reference Manual Stata Longitudinal-Data/Panel-Data Reference Manual Stata Multilevel Mixed-Effects Reference Manual Stata Multiple-Imputation Reference Manual Stata Multivariate Statistics Reference Manual Stata Power and Sample-Size Reference Manual Stata Programming Reference Manual Stata Structural Equation Modeling Reference Manual Stata Survey Data Reference Manual Stata Survival Analysis and Epidemiological Tables Reference Manual Stata Time-Series Reference Manual Stata Treatment-Effects Reference Manual: Potential Outcomes/Counterfactual Outcomes Stata Glossary and Index [M ] Mata Reference Manual ix Title intro — Introduction to base reference manual Description Remarks and examples Also see Description This entry describes the organization of the reference manuals. Remarks and examples The complete list of reference manuals is as follows: [R] [D] [G] [XT] [ME] [MI] [MV] [PSS] [P] [SEM] [SVY] [ST] [TS] [TE] [I] Stata Base Reference Manual Stata Data Management Reference Manual Stata Graphics Reference Manual Stata Longitudinal-Data/Panel-Data Reference Manual Stata Multilevel Mixed-Effects Reference Manual Stata Multiple-Imputation Reference Manual Stata Multivariate Statistics Reference Manual Stata Power and Sample-Size Reference Manual Stata Programming Reference Manual Stata Structural Equation Modeling Reference Manual Stata Survey Data Reference Manual Stata Survival Analysis and Epidemiological Tables Reference Manual Stata Time-Series Reference Manual Stata Treatment-Effects Reference Manual: Potential Outcomes/Counterfactual Outcomes Stata Glossary and Index [M] Mata Reference Manual When we refer to “reference manuals”, we mean all manuals listed above. When we refer to the specialty manuals, we mean all the manuals listed above except [R] and [ I ]. 1 2 intro — Introduction to base reference manual Arrangement of the reference manuals Each manual contains the following sections: • Contents. A table of contents can be found at the beginning of each manual. • Cross-referencing the documentation. This entry lists all the manuals and explains how they are cross-referenced. • Introduction. This entry—usually called intro—provides an overview of the manual. In the specialty manuals, this introduction suggests entries that you might want to read first and provides information about new features. Each specialty manual contains an overview of the commands described in it. • Entries. Entries are arranged in alphabetical order. Most entries describe Stata commands, but some entries discuss concepts, and others provide overviews. Entries that describe estimation commands are followed by an entry discussing postestimation commands that are available for use after the estimation command. For example, the xtlogit entry in the [XT] manual is followed by the xtlogit postestimation entry. • Index. An index can be found at the end of each manual. The Glossary and Index, [ I ], contains a subject table of contents for all the reference manuals and the User’s Guide, a combined acronym glossary, a combined glossary, a vignette index, a combined author index, and a combined subject index for all the manuals. To find information and commands quickly, use Stata’s search command; see [R] search (see the entry search in the [R] manual). Arrangement of each entry Entries in the Stata reference manuals, except the [M] and [SEM] manuals, generally contain the following sections, which are explained below: Syntax Menu Description Options Remarks and examples Stored results Methods and formulas References Also see Syntax A command’s syntax diagram shows how to type the command, indicates all possible options, and gives the minimal allowed abbreviations for all the items in the command. For instance, the syntax diagram for the summarize command is intro — Introduction to base reference manual summarize varlist if in weight , options 3 Description options Main detail meanonly format separator(#) display options display additional statistics suppress the display; calculate only the mean; programmer’s option use variable’s display format draw separator line after every # variables; default is separator(5) control spacing and base and empty cells varlist may contain factor variables; see [U] 11.4.3 Factor variables. varlist may contain time-series operators; see [U] 11.4.4 Time-series varlists. by is allowed; see [D] by. aweights, fweights, and iweights are allowed. However, iweights may not be used with the detail option; see [U] 11.1.6 weight. Items in the typewriter-style font should be typed exactly as they appear in the diagram, although they may be abbreviated. Underlining indicates the shortest abbreviations where abbreviations are allowed. For instance, summarize may be abbreviated su, sum, summ, etc., or it may be spelled out completely. Items in the typewriter font that are not underlined may not be abbreviated. Square brackets denote optional items. In the syntax diagram above, varlist, if, in, weight, and the options are optional. The options are listed in a table immediately following the diagram, along with a brief description of each. Items typed in italics represent arguments for which you are to substitute variable names, observation numbers, and the like. The diagrams use the following symbols: # Indicates a literal number, for example, 5; see [U] 12.2 Numbers. Anything enclosed in brackets is optional. At least one of the items enclosed in braces must appear. The vertical bar separates alternatives. Any Stata format, for example, %8.2f; see [U] 12.5 Formats: Controlling how data are displayed. The dependent variable in an estimation command; see [U] 20 Estimation and postestimation commands. Any algebraic expression, for example, (5+myvar)/2; see [U] 13 Functions and expressions. Any filename; see [U] 11.6 Filenaming conventions. The independent variables in an estimation command; see [U] 20 Estimation and postestimation commands. A variable that will be created by the current command; see [U] 11.4.2 Lists of new variables. A list of numbers; see [U] 11.1.8 numlist. A previously created variable; see [U] 11.4.1 Lists of existing variables. A list of options; see [U] 11.1.7 options. | %fmt depvar exp filename indepvars newvar numlist oldvar options 4 intro — Introduction to base reference manual range "string" varlist varname weight xvar yvar An observation range, for example, 5/20; see [U] 11.1.4 in range. Any string of characters enclosed in double quotes; see [U] 12.4 Strings. A list of variable names; see [U] 11.4 varlists. If varlist allows factor variables, a note to that effect will be shown below the syntax diagram; see [U] 11.4.3 Factor variables. If varlist allows time-series operators, a note to that effect will be shown below the syntax diagram; see [U] 11.4.4 Time-series varlists. A variable name; see [U] 11.3 Naming conventions. A [wgttype=exp] modifier; see [U] 11.1.6 weight and [U] 20.23 Weighted estimation. The variable to be displayed on the horizontal axis. The variable to be displayed on the vertical axis. The Syntax section will indicate whether factor variables or time-series operators may be used with a command. summarize allows factor variables and time-series operators. If a command allows prefix commands, this will be indicated immediately following the table of options. summarize allows by. If a command allows weights, the types of weights allowed will be specified, with the default weight listed first. summarize allows aweights, fweights, and iweights, and if the type of weight is not specified, the default is aweights. Menu A menu indicates how the dialog box for the command may be accessed using the menu system. Description Following the syntax diagram is a brief description of the purpose of the command. Options If the command allows any options, they are explained here, and for dialog users the location of the options in the dialog is indicated. For instance, in the logistic entry in this manual, the Options section looks like this: Model ... SE/Robust ... Reporting ... Maximization ... intro — Introduction to base reference manual 5 Remarks and examples The explanations under Description and Options are exceedingly brief and technical; they are designed to provide a quick summary. The remarks explain in English what the preceding technical jargon means. Examples are used to illustrate the command. Stored results Commands are classified as e-class, r-class, s-class, or n-class, according to whether they store calculated results in e(), r(), s(), or not at all. These results can then be used in subroutines by other programs (ado-files). Such stored results are documented here; see [U] 18.8 Accessing results calculated by other programs and [U] 18.9 Accessing results calculated by estimation commands. Methods and formulas The techniques and formulas used in obtaining the results are described here as tersely and technically as possible. References Published sources are listed that either were directly referenced in the preceding text or might be of interest. Also see Other manual entries relating to this entry are listed that might also interest you. Elizabeth L. (“Betty”) Scott (1917–1988) was an astronomer and mathematician trained at the University of California at Berkeley. She published her first paper when she was just 22 years old, and her work was focused on comets for much of her early academic career. During World War II, Scott began working at the statistical laboratory at Berkeley, which had recently been established by Jerzy Neyman, sparking what would be a long and fruitful collaboration with him. After the war, she shifted her focus toward mathematics and statistics, partly because of limited career opportunities as an astronomer, though she still applied her research to astronomical topics. For example, in 1949 she published a paper using statistical techniques to analyze the distribution of binary star systems. She also published papers examining the distribution of galaxies, and she is the name behind the “Scott effect”, which helps determine the distances to galaxies. Later in her career, Scott applied her statistical knowledge to problems associated with ozone depletion and its effects on the incidence of skin cancer as well as weather modification. She was also a champion of equality for women graduate students and faculty. Among Scott’s many awards and accomplishments, she was elected an honorary fellow of the Royal Statistical Society and was a fellow of the American Association for the Advancement of Science. In 1992, the Committee of Presidents of Statistical Societies established the Elizabeth L. Scott Award, a biannual award to recognize those who have strived to enhance the status of women within the statistics profession. 6 intro — Introduction to base reference manual Also see [U] 1.1 Getting Started with Stata Title about — Display information about your Stata Syntax Menu Description Remarks and examples Also see Syntax about Menu Help > About Stata Description about displays information about your version of Stata. Remarks and examples If you are running Stata for Windows, information about memory is also displayed: . about Stata/MP 13 for Windows (64-bit x86-64) Revision date Copyright 1985-2013 StataCorp LP Total physical memory: 8388608 KB Available physical memory: 937932 KB 10-user 32-core Stata network perpetual license: Serial number: 5013041234 Licensed to: Alan R. Riley StataCorp Also see [R] which — Display location and version for an ado-file [U] 3 Resources for learning and using Stata [U] 5 Flavors of Stata 7 Title adoupdate — Update user-written ado-files Syntax Stored results Description Also see Options Remarks and examples Syntax adoupdate pkglist , options options Description update perform update; default is to list packages that have updates, but not to update them include packages that might have updates; default is to list or update only packages that are known to have updates check only packages obtained from SSC; default is to check all installed packages check packages installed in dir; default is to check those installed in PLUS provide output to assist in debugging network problems all ssconly dir(dir) verbose Description User-written additions to Stata are called packages. These packages can add remarkable abilities to Stata. Packages are found and installed by using ssc, search, and net; see [R] ssc, [R] search, and [R] net. User-written packages are updated by their developers, just as official Stata software is updated by StataCorp. To determine whether your official Stata software is up to date, and to update it if it is not, you use update; see [R] update. To determine whether your user-written additions are up to date, and to update them if they are not, you use adoupdate. Options update specifies that packages with updates be updated. The default is simply to list the packages that could be updated without actually performing the update. The first time you adoupdate, do not specify this option. Once you see adoupdate work, you will be more comfortable with it. Then type . adoupdate, update The packages that can be updated will be listed and updated. all is rarely specified. Sometimes, adoupdate cannot determine whether a package you previously installed has been updated. adoupdate can determine that the package is still available over the web but is unsure whether the package has changed. Usually, the package has not changed, but if you want to be certain that you are using the latest version, reinstall from the source. 8 adoupdate — Update user-written ado-files 9 Specifying all does this. Typing . adoupdate, all adds such packages to the displayed list as needing updating but does not update them. Typing . adoupdate, update all lists such packages and updates them. ssconly is a popular option. Many packages are available from the Statistical Software Components (SSC) archive—often called the Boston College Archive—which is provided at http://repec.org. Many users find most of what they want there. See [R] ssc for more information on the SSC. ssconly specifies that adoupdate check only packages obtained from that source. Specifying this option is popular because SSC always provides distribution dates, and so adoupdate can be certain whether an update exists. dir(dir) specifies which installed packages be checked. The default is dir(PLUS), and that is probably correct. If you are responsible for maintaining a large system, however, you may have previously installed packages in dir(SITE), where they are shared across users. See [P] sysdir for an explanation of these directory codewords. You may also specify an actual directory name, such as C:\mydir. verbose is specified when you suspect network problems. It provides more detailed output that may help you diagnose the problem. Remarks and examples Do not confuse adoupdate with update. Use adoupdate to update user-written files. Use update to update the components (including ado-files) of the official Stata software. To use either command, you must be connected to the Internet. Remarks are presented under the following headings: Using adoupdate Possible problem the first time you run adoupdate and the solution Notes for developers Using adoupdate The first time you try adoupdate, type . adoupdate That is, do not specify the update option. adoupdate without update produces a report but does not update any files. The first time you run adoupdate, you may see messages such as . adoupdate (note: package utx was installed more than once; older copy removed) (remaining output omitted) Having the same packages installed multiple times is common; adoupdate cleans that up. The second time you run adoupdate, pick one package to update. Suppose that the report indicates that package st0008 has an update available. Type . adoupdate st0008, update You can specify one or many packages after the adoupdate command. You can even use wildcards such as st* to mean all packages that start with st or st*8 to mean all packages that start with st and end with 8. You can do that with or without the update option. 10 adoupdate — Update user-written ado-files Finally, you can let adoupdate update all your user-written additions: . adoupdate, update Possible problem the first time you run adoupdate and the solution The first time you run adoupdate, you might get many duplicate messages: . adoupdate (note: package ___ installed (note: package ___ installed (note: package ___ installed ... (note: package ___ installed (remaining output omitted) more than once; older copy removed) more than once; older copy removed) more than once; older copy removed) more than once; older copy removed) Some users have hundreds of duplicates. You might even see the same package name repeated more than once: (note: package stylus installed more than once; older copy removed) (note: package stylus installed more than once; older copy removed) That means that the package was duplicated twice. Stata tolerates duplicates, and you did nothing wrong when you previously installed and updated packages. adoupdate, however, needs the duplicates removed, mainly so that it does not keep checking the same files. The solution is to just let adoupdate run. adoupdate will run faster next time, when there are no (or just a few) duplicates. Notes for developers adoupdate reports whether an installed package is up to date by comparing its distribution date with that of the package available over the web. If you are distributing software, include the line d Distribution-Date: date somewhere in your .pkg file. The capitalization of Distribution-Date does not matter, but include the hyphen and the colon as shown. Code the date in either of two formats: all numeric: Stata standard: yyyymmdd, for example, 20120701 ddMONyyyy, for example, 01jul2012 Stored results adoupdate stores the following in r(): Macros r(pkglist) a space-separated list of package names that need updating (update not specified) or that were updated (update specified) adoupdate — Update user-written ado-files Also see [R] net — Install and manage user-written additions from the Internet [R] search — Search Stata documentation and other resources [R] ssc — Install and uninstall packages from SSC [R] update — Check for official updates 11 Title ameans — Arithmetic, geometric, and harmonic means Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgments Syntax ameans varlist if in weight , options Description options Main add # to each variable in varlist add # only to variables with nonpositive values set confidence level; default is level(95) add(#) only level(#) by is allowed; see [D] by. aweights and fweights are allowed; see [U] 11.1.6 weight. Menu Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Arith./geometric/harmonic means Description ameans computes the arithmetic, geometric, and harmonic means, with their corresponding confidence intervals, for each variable in varlist or for all the variables in the data if varlist is not specified. gmeans and hmeans are synonyms for ameans. If you simply want arithmetic means and corresponding confidence intervals, see [R] ci. Options Main add(#) adds the value # to each variable in varlist before computing the means and confidence intervals. This option is useful when analyzing variables with nonpositive values. only modifies the action of the add(#) option so that it adds # only to variables with at least one nonpositive value. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. 12 ameans — Arithmetic, geometric, and harmonic means 13 Remarks and examples Example 1 We have a dataset containing 8 observations on a variable named x. The eight values are 5, 4, −4, −5, 0, 0, missing, and 7. . ameans x Variable x Type Mean 7 3 3 1 5.192494 5.060241 Obs Mean 7 6 6 6 5.477226 3.540984 Arithmetic Geometric Harmonic . ameans x, add(5) Variable Type x Obs Arithmetic Geometric Harmonic [95% Conf. Interval] -3.204405 2.57899 3.023008 5.204405 10.45448 15.5179 [95% Conf. Interval] 1.795595 2.1096 . 10.2044 * 14.22071 * . * (*) 5 was added to the variables prior to calculating the results. Missing values in confidence intervals for harmonic mean indicate that confidence interval is undefined for corresponding variables. Consult Reference Manual for details. The number of observations displayed for the arithmetic mean is the number of nonmissing observations. The number of observations displayed for the geometric and harmonic means is the number of nonmissing, positive observations. Specifying the add(5) option produces 3 more positive observations. The confidence interval for the harmonic mean is not reported; see Methods and formulas below. Video example Descriptive statistics in Stata Stored results ameans stores the following in r(): Scalars r(N) r(N pos) r(mean) r(lb) r(ub) r(Var) r(mean g) r(lb g) r(ub g) r(Var g) r(mean h) r(lb h) r(ub h) r(Var h) number of nonmissing observations; used for arithmetic mean number of nonmissing positive observations; used for geometric and harmonic means arithmetic mean lower bound of confidence interval for arithmetic mean upper bound of confidence interval for arithmetic mean variance of untransformed data geometric mean lower bound of confidence interval for geometric mean upper bound of confidence interval for geometric mean variance of lnxi harmonic mean lower bound of confidence interval for harmonic mean upper bound of confidence interval for harmonic mean variance of 1/xi 14 ameans — Arithmetic, geometric, and harmonic means Methods and formulas See Armitage, Berry, and Matthews (2002) or Snedecor and Cochran (1989). For a history of the concept of the mean, see Plackett (1958). When restricted to the same set of values (that is, to positive values), the arithmetic mean (x) is greater than or equal to the geometric mean, which in turn is greater than or equal to the harmonic mean. Equality holds only if all values within a sample are equal to a positive constant. The arithmetic mean and its confidence interval are identical to those provided by ci; see [R] ci. To compute the geometric mean, ameans first creates uj = lnxj for all positive xj . The arithmetic mean of the uj and its confidence interval are then computed as in ci. Let u be the resulting mean, and let [ L, U ] be the corresponding confidence interval. The geometric mean is then exp(u), and its confidence interval is [ exp(L), exp(U ) ]. The same procedure is followed for the harmonic mean, except that then uj = 1/xj . The harmonic mean is then 1/u, and its confidence interval is [ 1/U, 1/L ] if L is greater than zero. If L is not greater than zero, this confidence interval is not defined, and missing values are reported. When weights are specified, ameans applies the weights to the transformed values, uj = lnxj and uj = 1/xj , respectively, when computing the geometric and harmonic means. For details on how the weights are used to compute the mean and variance of the uj , see [R] summarize. Without weights, the formula for the geometric mean reduces to n1 X o exp ln(xj ) n j Without weights, the formula for the harmonic mean is n X1 xj j Acknowledgments This improved version of ameans is based on the gmci command (Carlin, Vidmar, and Ramalheira 1998) and was written by John Carlin of the Murdoch Children’s Research Institute and the University of Melbourne; Suzanna Vidmar of the University of Melbourne; and Carlos Ramalheira of Coimbra University Hospital, Portugal. References Armitage, P., G. Berry, and J. N. S. Matthews. 2002. Statistical Methods in Medical Research. 4th ed. Oxford: Blackwell. Carlin, J. B., S. Vidmar, and C. Ramalheira. 1998. sg75: Geometric means and confidence intervals. Stata Technical Bulletin 41: 23–25. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 197–199. College Station, TX: Stata Press. Keynes, J. M. 1911. The principal averages and the laws of error which lead to them. Journal of the Royal Statistical Society 74: 322–331. Plackett, R. L. 1958. Studies in the history of probability and statistics: VII. The principle of the arithmetic mean. Biometrika 45: 130–135. Snedecor, G. W., and W. G. Cochran. 1989. Statistical Methods. 8th ed. Ames, IA: Iowa State University Press. Stigler, S. M. 1985. Arithmetric means. In Vol. 1 of Encyclopedia of Statistical Sciences, ed. S. Kotz and N. L. Johnson, 126–129. New York: Wiley. ameans — Arithmetic, geometric, and harmonic means Also see [R] ci — Confidence intervals for means, proportions, and counts [R] mean — Estimate means [R] summarize — Summary statistics [SVY] svy estimation — Estimation commands for survey data 15 Title anova — Analysis of variance and covariance Syntax Remarks and examples Menu Stored results Description References Options Also see Syntax anova varname termlist if in weight , options where termlist is a factor-variable list (see [U] 11.4.3 Factor variables) with the following additional features: • Variables are assumed to be categorical; use the c. factor-variable operator to override this. • The | symbol (indicating nesting) may be used in place of the # symbol (indicating interaction). • The / symbol is allowed after a term and indicates that the following term is the error term for the preceding terms. Description options Model repeated(varlist) partial sequential noconstant dropemptycells variables in terms that are repeated-measures variables use partial (or marginal) sums of squares use sequential sums of squares suppress constant term drop empty cells from the design matrix Adv. model bse(term) bseunit(varname) grouping(varname) between-subjects error term in repeated-measures ANOVA variable representing lowest unit in the between-subjects error term grouping variable for computing pooled covariance matrix bootstrap, by, fp, jackknife, and statsby are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. aweights and fweights are allowed; see [U] 11.1.6 weight. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Linear models and related > ANOVA/MANOVA > Analysis of variance and covariance Description The anova command fits analysis-of-variance (ANOVA) and analysis-of-covariance (ANCOVA) models for balanced and unbalanced designs, including designs with missing cells; for repeated-measures ANOVA; and for factorial, nested, or mixed designs. 16 anova — Analysis of variance and covariance 17 The regress command (see [R] regress) will display the coefficients, standard errors, etc., of the regression model underlying the last run of anova. If you want to fit one-way ANOVA models, you may find the oneway or loneway command more convenient; see [R] oneway and [R] loneway. If you are interested in MANOVA or MANCOVA, see [MV] manova. Options Model repeated(varlist) indicates the names of the categorical variables in the terms that are to be treated as repeated-measures variables in a repeated-measures ANOVA or ANCOVA. partial presents the ANOVA table using partial (or marginal) sums of squares. This setting is the default. Also see the sequential option. sequential presents the ANOVA table using sequential sums of squares. noconstant suppresses the constant term (intercept) from the ANOVA or regression model. dropemptycells drops empty cells from the design matrix. If c(emptycells) is set to keep (see [R] set emptycells), this option temporarily resets it to drop before running the ANOVA model. If c(emptycells) is already set to drop, this option does nothing. Adv. model bse(term) indicates the between-subjects error term in a repeated-measures ANOVA. This option is needed only in the rare case when the anova command cannot automatically determine the between-subjects error term. bseunit(varname) indicates the variable representing the lowest unit in the between-subjects error term in a repeated-measures ANOVA. This option is rarely needed because the anova command automatically selects the first variable listed in the between-subjects error term as the default for this option. grouping(varname) indicates a variable that determines which observations are grouped together in computing the covariance matrices that will be pooled and used in a repeated-measures ANOVA. This option is rarely needed because the anova command automatically selects the combination of all variables except the first (or as specified in the bseunit() option) in the between-subjects error term as the default for grouping observations. Remarks and examples Remarks are presented under the following headings: Introduction One-way ANOVA Two-way ANOVA N-way ANOVA Weighted data ANCOVA Nested designs Mixed designs Latin-square designs Repeated-measures ANOVA Video examples 18 anova — Analysis of variance and covariance Introduction anova uses least squares to fit the linear models known as ANOVA or ANCOVA (henceforth referred to simply as ANOVA models). If your interest is in one-way ANOVA, you may find the oneway command to be more convenient; see [R] oneway. Structural equation modeling provides a more general framework for fitting ANOVA models; see the Stata Structural Equation Modeling Reference Manual. ANOVA was pioneered by Fisher. It features prominently in his texts on statistical methods and his design of experiments (1925, 1935). Many books discuss ANOVA; see, for instance, Altman (1991); van Belle et al. (2004); Cobb (1998); Snedecor and Cochran (1989); or Winer, Brown, and Michels (1991). For a classic source, see Scheffé (1959). Kennedy and Gentle (1980) discuss ANOVA’s computing problems. Edwards (1985) is concerned primarily with the relationship between multiple regression and ANOVA. Acock (2014, chap. 9) illustrates his discussion with Stata output. Repeated-measures ANOVA is discussed in Winer, Brown, and Michels (1991); Kuehl (2000); and Milliken and Johnson (2009). Pioneering work in repeated-measures ANOVA can be found in Box (1954); Geisser and Greenhouse (1958); Huynh and Feldt (1976); and Huynh (1978). For a Stata-specific discussion of ANOVA contrasts, see Mitchell (2012, chap. 7–9). One-way ANOVA anova, entered without options, performs and reports standard ANOVA. For instance, to perform a one-way layout of a variable called endog on exog, you would type anova endog exog. Example 1: One-way ANOVA We run an experiment varying the amount of fertilizer used in growing apple trees. We test four concentrations, using each concentration in three groves of 12 trees each. Later in the year, we measure the average weight of the fruit. If all had gone well, we would have had 3 observations on the average weight for each of the four concentrations. Instead, two of the groves were mistakenly leveled by a confused man on a large bulldozer. We are left with the following data: . use http://www.stata-press.com/data/r13/apple (Apple trees) . list, abbrev(10) sepby(treatment) treatment weight 1. 2. 3. 1 1 1 117.5 113.8 104.4 4. 5. 6. 2 2 2 48.9 50.4 58.9 7. 8. 3 3 70.4 86.9 9. 10. 4 4 87.7 67.3 anova — Analysis of variance and covariance 19 To obtain one-way ANOVA results, we type . anova weight treatment Source Number of obs = 10 Root MSE = 9.07002 Partial SS df MS R-squared = 0.9147 Adj R-squared = 0.8721 F Prob > F Model 5295.54433 3 1765.18144 21.46 0.0013 treatment 5295.54433 3 1765.18144 21.46 0.0013 Residual 493.591667 6 82.2652778 Total 5789.136 9 643.237333 We find significant (at better than the 1% level) differences among the four concentrations. Although the output is a usual ANOVA table, let’s run through it anyway. Above the table is a summary of the underlying regression. The model was fit on 10 observations, and the root mean squared error (Root MSE) is 9.07. The R2 for the model is 0.9147, and the adjusted R2 is 0.8721. The first line of the table summarizes the model. The sum of squares (Partial SS) for the model is 5295.5 with 3 degrees of freedom (df). This line results in a mean square (MS) of 5295.5/3 ≈ 1765.2. The corresponding F statistic is 21.46 and has a significance level of 0.0013. Thus the model appears to be significant at the 0.13% level. The next line summarizes the first (and only) term in the model, treatment. Because there is only one term, the line is identical to that for the overall model. The third line summarizes the residual. The residual sum of squares is 493.59 with 6 degrees of freedom, resulting in a mean squared error of 82.27. The square root of this latter number is reported as the Root MSE. The model plus the residual sum of squares equals the total sum of squares, which is reported as 5789.1 in the last line of the table. This is the total sum of squares of weight after removal of the mean. Similarly, the model plus the residual degrees of freedom sum to the total degrees of freedom, 9. Remember that there are 10 observations. Subtracting 1 for the mean, we are left with 9 total degrees of freedom. Technical note Rather than using the anova command, we could have performed this analysis by using the oneway command. Example 1 in [R] oneway repeats this same analysis. You may wish to compare the output. Type regress to see the underlying regression model corresponding to an ANOVA model fit using the anova command. Example 2: Regression table from a one-way ANOVA Returning to the apple tree experiment, we found that the fertilizer concentration appears to significantly affect the average weight of the fruit. Although that finding is interesting, we next want to know which concentration appears to grow the heaviest fruit. One way to find out is by examining the underlying regression coefficients. 20 anova — Analysis of variance and covariance . regress, baselevels SS Source df MS Number of obs F( 3, 6) Prob > F R-squared Adj R-squared Root MSE = = = = = = 10 21.46 0.0013 0.9147 0.8721 9.07 Model Residual 5295.54433 493.591667 3 6 1765.18144 82.2652778 Total 5789.136 9 643.237333 weight Coef. treatment 1 2 3 4 0 -59.16667 -33.25 -34.4 (base) 7.405641 8.279758 8.279758 -7.99 -4.02 -4.15 0.000 0.007 0.006 -77.28762 -53.50984 -54.65984 -41.04572 -12.99016 -14.14016 _cons 111.9 5.236579 21.37 0.000 99.08655 124.7134 Std. Err. t P>|t| [95% Conf. Interval] See [R] regress for an explanation of how to read this table. The baselevels option of regress displays a row indicating the base category for our categorical variable, treatment. In summary, we find that concentration 1, the base (omitted) group, produces significantly heavier fruits than concentration 2, 3, and 4; concentration 2 produces the lightest fruits; and concentrations 3 and 4 appear to be roughly equivalent. Example 3: ANOVA replay We previously typed anova weight treatment to produce and display the ANOVA table for our apple tree experiment. Typing regress displays the regression coefficients. We can redisplay the ANOVA table by typing anova without arguments: . anova Source Number of obs = 10 Root MSE = 9.07002 Partial SS df MS R-squared = 0.9147 Adj R-squared = 0.8721 F Prob > F Model 5295.54433 3 1765.18144 21.46 0.0013 treatment 5295.54433 3 1765.18144 21.46 0.0013 Residual 493.591667 6 82.2652778 Total 5789.136 9 643.237333 Two-way ANOVA You can include multiple explanatory variables with the anova command, and you can specify interactions by placing ‘#’ between the variable names. For instance, typing anova y a b performs a two-way layout of y on a and b. Typing anova y a b a#b performs a full two-way factorial layout. The shorthand anova y a##b does the same. With the default partial sums of squares, when you specify interacted terms, the order of the terms does not matter. Typing anova y a b a#b is the same as typing anova y b a b#a. anova — Analysis of variance and covariance 21 Example 4: Two-way factorial ANOVA The classic two-way factorial ANOVA problem, at least as far as computer manuals are concerned, is a two-way ANOVA design from Afifi and Azen (1979). Fifty-eight patients, each suffering from one of three different diseases, were randomly assigned to one of four different drug treatments, and the change in their systolic blood pressure was recorded. Here are the data: Drug 1 Drug 2 Drug 3 Drug 4 Disease 1 42, 44, 36 13, 19, 22 28, 23, 34 42, 13 1, 29, 19 24, 9, 22 –2, 15 Disease 2 Disease 3 33, 26, 33 31, –3, 25 21 25, 24 34, 33, 31 3, 26, 28 36 32, 4, 16 11, 9, 7 21, 1, 9 1, –6 3 27, 12, 12 22, 7, 25 –5, 16, 15 5, 12 Let’s assume that we have entered these data into Stata and stored the data as systolic.dta. Below we use the data, list the first 10 observations, summarize the variables, and tabulate the control variables: . use http://www.stata-press.com/data/r13/systolic (Systolic Blood Pressure Data) . list in 1/10 drug disease systolic 1. 2. 3. 4. 5. 1 1 1 1 1 1 1 1 1 1 42 44 36 13 19 6. 7. 8. 9. 10. 1 1 1 1 1 1 2 2 2 2 22 33 26 33 21 . summarize Variable Obs Mean drug 58 2.5 disease 58 2.017241 systolic 58 18.87931 . tabulate drug disease Patient’s Disease Drug Used 1 2 Std. Dev. Min Max 1.158493 .8269873 12.80087 1 1 -6 4 3 44 3 Total 1 2 3 4 6 5 3 5 4 4 5 6 5 6 4 5 15 15 12 16 Total 19 19 20 58 22 anova — Analysis of variance and covariance Each observation in our data corresponds to one patient, and for each patient we record drug, disease, and the increase in the systolic blood pressure, systolic. The tabulation reveals that the data are not balanced — there are not equal numbers of patients in each drug – disease cell. Stata does not require that the data be balanced. We can perform a two-way factorial ANOVA by typing . anova systolic drug disease drug#disease Number of obs = 58 Root MSE = 10.5096 Source Partial SS df MS R-squared = 0.4560 Adj R-squared = 0.3259 F Prob > F Model 4259.33851 11 387.212591 3.51 0.0013 drug disease drug#disease 2997.47186 415.873046 707.266259 3 2 6 999.157287 207.936523 117.87771 9.05 1.88 1.07 0.0001 0.1637 0.3958 Residual 5080.81667 46 110.452536 Total 9340.15517 57 163.862371 Although Stata’s table command does not perform ANOVA, it can produce useful summary tables of your data (see [R] table): . table drug disease, c(mean systolic) row col f(%8.2f) Patient’s Disease 1 2 3 Total Drug Used 1 2 3 4 29.33 28.00 16.33 13.60 28.25 33.50 4.40 12.83 20.40 18.17 8.50 14.20 26.07 25.53 8.75 13.50 Total 22.79 18.21 15.80 18.88 These are simple means and are not influenced by our anova model. More useful is the margins command (see [R] margins) that provides marginal means and adjusted predictions. Because drug is the only significant factor in our ANOVA, we now examine the adjusted marginal means for drug. . margins drug, asbalanced Adjusted predictions Expression : Linear prediction, predict() at : drug (asbalanced) disease (asbalanced) Margin drug 1 2 3 4 25.99444 26.55556 9.744444 13.54444 Number of obs = 58 Delta-method Std. Err. t P>|t| [95% Conf. Interval] 2.751008 2.751008 3.100558 2.637123 9.45 9.65 3.14 5.14 0.000 0.000 0.003 0.000 20.45695 21.01806 3.503344 8.236191 31.53194 32.09305 15.98554 18.8527 These adjusted marginal predictions are not equal to the simple drug means (see the total column from the table command); they are based upon predictions from our ANOVA model. The asbalanced option of margins corresponds with the interpretation of the F statistic produced by ANOVA —each cell is given equal weight regardless of its sample size (see the following three technical notes). You anova — Analysis of variance and covariance 23 can omit the asbalanced option and obtain predictive margins that take into account the unequal sample sizes of the cells. . margins drug Predictive margins Expression : Linear prediction, predict() Margin drug 1 2 3 4 25.89799 26.41092 9.722989 13.55575 Number of obs = 58 Delta-method Std. Err. t P>|t| [95% Conf. Interval] 2.750533 2.742762 3.099185 2.640602 9.42 9.63 3.14 5.13 0.000 0.000 0.003 0.000 20.36145 20.89003 3.484652 8.24049 31.43452 31.93181 15.96132 18.871 Technical note How do you interpret the significance of terms like drug and disease in unbalanced data? If you are familiar with SAS, the sums of squares and the F statistic reported by Stata correspond to SAS type III sums of squares. (Stata can also calculate sequential sums of squares, but we will postpone that topic for now.) Let’s think in terms of the following table: Drug Drug Drug Drug 1 2 3 4 Disease 1 µ11 µ21 µ31 µ41 µ·1 Disease 2 µ12 µ22 µ32 µ42 µ·2 Disease 3 µ13 µ23 µ33 µ43 µ·3 µ1· µ2· µ3· µ4· µ·· In this table, µij is the mean increase in systolic blood pressure associated with drug i and disease j , while µi· is the mean for drug i, µ·j is the mean for disease j , and µ·· is the overall mean. If the data are balanced, meaning that there are equal numbers of observations going into the calculation of each mean µij , the row means, µi· , are given by µi· = µi1 + µi2 + µi3 3 In our case, the data are not balanced, but we define the µi· according to that formula anyway. The test for the main effect of drug is the test that µ1· = µ2· = µ3· = µ4· To be absolutely clear, the F test of the term drug, called the main effect of drug, is formally equivalent to the test of the three constraints: 24 anova — Analysis of variance and covariance µ11 + µ12 + µ13 µ21 + µ22 + µ23 = 3 3 µ11 + µ12 + µ13 µ31 + µ32 + µ33 = 3 3 µ11 + µ12 + µ13 µ41 + µ42 + µ43 = 3 3 In our data, we obtain a significant F statistic of 9.05 and thus reject those constraints. Technical note Stata can display the symbolic form underlying the test statistics it presents, as well as display other test statistics and their symbolic forms; see Obtaining symbolic forms in [R] anova postestimation. Here is the result of requesting the symbolic form for the main effect of drug in our data: . test drug, symbolic drug 1 -(r2+r3+r4) 2 r2 3 r3 4 r4 disease 1 0 2 0 3 0 drug#disease 1 1 -1/3 (r2+r3+r4) 1 2 -1/3 (r2+r3+r4) 1 3 -1/3 (r2+r3+r4) 2 1 1/3 r2 2 2 1/3 r2 2 3 1/3 r2 3 1 1/3 r3 3 2 1/3 r3 3 3 1/3 r3 4 1 1/3 r4 4 2 1/3 r4 4 3 1/3 r4 _cons 0 This says exactly what we said in the previous technical note. Technical note Saying that there is no main effect of a variable is not the same as saying that it has no effect at all. Stata’s ability to perform ANOVA on unbalanced data can easily be put to ill use. For example, consider the following table of the probability of surviving a bout with one of two diseases according to the drug administered to you: anova — Analysis of variance and covariance Drug 1 Drug 2 Disease 1 1 0 25 Disease 2 0 1 If you have disease 1 and are administered drug 1, you live. If you have disease 2 and are administered drug 2, you live. In all other cases, you die. This table has no main effects of either drug or disease, although there is a large interaction effect. You might now be tempted to reason that because there is only an interaction effect, you would be indifferent between the two drugs in the absence of knowledge about which disease infects you. Given an equal chance of having either disease, you reason that it does not matter which drug is administered to you — either way, your chances of surviving are 0.5. You may not, however, have an equal chance of having either disease. If you knew that disease 1 was 100 times more likely to occur in the population, and if you knew that you had one of the two diseases, you would express a strong preference for receiving drug 1. When you calculate the significance of main effects on unbalanced data, you must ask yourself why the data are unbalanced. If the data are unbalanced for random reasons and you are making predictions for a balanced population, the test of the main effect makes perfect sense. If, however, the data are unbalanced because the underlying populations are unbalanced and you are making predictions for such unbalanced populations, the test of the main effect may be practically — if not statistically — meaningless. Example 5: ANOVA with missing cells Stata can perform ANOVA not only on unbalanced populations, but also on populations that are so unbalanced that entire cells are missing. For instance, using our systolic blood pressure data, let’s refit the model eliminating the drug 1–disease 1 cell. Because anova follows the same syntax as all other Stata commands, we can explicitly specify the data to be used by typing the if qualifier at the end of the anova command. Here we want to use the data that are not for drug 1 and disease 1: . anova systolic drug##disease if !(drug==1 & disease==1) Number of obs = 52 R-squared = 0.4545 Root MSE = 10.1615 Adj R-squared = 0.3215 Partial SS df MS F Prob > F Source Model 3527.95897 10 352.795897 3.42 0.0025 drug disease drug#disease 2686.57832 327.792598 703.007602 3 2 5 895.526107 163.896299 140.60152 8.67 1.59 1.36 0.0001 0.2168 0.2586 Residual 4233.48333 41 103.255691 Total 7761.44231 51 152.185143 Here we used drug##disease as a shorthand for drug disease drug#disease. 26 anova — Analysis of variance and covariance Technical note The test of the main effect of drug in the presence of missing cells is more complicated than that for unbalanced data. Our underlying tableau now has the following form: Disease 1 Drug Drug Drug Drug 1 2 3 4 µ21 µ31 µ41 Disease 2 µ12 µ22 µ32 µ42 µ·2 Disease 3 µ13 µ23 µ33 µ43 µ·3 µ2· µ3· µ4· The hole in the drug 1–disease 1 cell indicates that the mean is unobserved. Considering the main effect of drug, the test is unchanged for the rows in which all the cells are defined: µ2· = µ3· = µ4· The first row, however, requires special attention. Here we want the average outcome for drug 1, which is averaged only over diseases 2 and 3, to be equal to the average values of all other drugs averaged over those same two diseases: µ22 + µ23 /2 + µ32 + µ33 /2 + µ42 + µ43 /2 µ12 + µ13 = 2 3 Thus the test contains three constraints: µ21 + µ22 + µ23 3 µ21 + µ22 + µ23 3 µ12 + µ13 2 = = = µ31 + µ32 + µ33 3 µ41 + µ42 + µ43 3 µ22 + µ23 + µ32 + µ33 + µ42 + µ43 6 Stata can calculate two types of sums of squares, partial and sequential. If you do not specify which sums of squares to calculate, Stata calculates partial sums of squares. The technical notes above have gone into great detail about the definition and use of partial sums of squares. Use the sequential option to obtain sequential sums of squares. Technical note Before we illustrate sequential sums of squares, consider one more feature of the partial sums. If you know how such things are calculated, you may worry that the terms must be specified in some particular order, that Stata would balk or, even worse, produce different results if you typed, say, anova drug#disease drug disease rather than anova drug disease drug#disease. We assure you that is not the case. When you type a model, Stata internally reorganizes the terms, forms the cross-product matrix, inverts it, converts the result to an upper-Hermite form, and then performs the hypothesis tests. As a final touch, Stata reports the results in the same order that you typed the terms. anova — Analysis of variance and covariance 27 Example 6: Sequential sums of squares We wish to estimate the effects on systolic blood pressure of drug and disease by using sequential sums of squares. We want to introduce disease first, then drug, and finally, the interaction of drug and disease: . anova systolic disease drug disease#drug, sequential Number of obs = 58 Root MSE = 10.5096 R-squared = Adj R-squared = 0.4560 0.3259 Source Seq. SS df Model 4259.33851 11 387.212591 3.51 0.0013 disease drug disease#drug 488.639383 3063.43286 707.266259 2 3 6 244.319691 1021.14429 117.87771 2.21 9.25 1.07 0.1210 0.0001 0.3958 Residual 5080.81667 46 110.452536 Total 9340.15517 57 163.862371 MS F Prob > F The F statistic on disease is now 2.21. When we fit this same model by using partial sums of squares, the statistic was 1.88. N-way ANOVA You may include high-order interaction terms, such as a third-order interaction between the variables A, B, and C, by typing A#B#C. Example 7: Three-way factorial ANOVA We wish to determine the operating conditions that maximize yield for a manufacturing process. There are three temperature settings, two chemical supply companies, and two mixing methods under investigation. Three observations are obtained for each combination of these three factors. . use http://www.stata-press.com/data/r13/manuf (manufacturing process data) . describe Contains data from http://www.stata-press.com/data/r13/manuf.dta obs: 36 manufacturing process data vars: 4 2 Jan 2013 13:28 size: 144 variable name temperature chemical method yield Sorted by: storage type byte byte byte byte display format value label %9.0g %9.0g %9.0g %9.0g temp supplier meth variable label machine temperature setting chemical supplier mixing method product yield 28 anova — Analysis of variance and covariance We wish to perform a three-way factorial ANOVA. We could type . anova yield temp chem temp#chem meth temp#meth chem#meth temp#chem#meth but prefer to use the ## factor-variable operator for brevity. . anova yield temp##chem##meth Number of obs = 36 Root MSE = 2.62996 Partial SS df MS Source R-squared = 0.5474 Adj R-squared = 0.3399 F Prob > F Model 200.75 11 18.25 2.64 0.0227 temperature chemical temperature#chemical method temperature#method chemical#method temperature#chemical# method 30.5 12.25 24.5 42.25 87.5 .25 2 1 2 1 2 1 15.25 12.25 12.25 42.25 43.75 .25 2.20 1.77 1.77 6.11 6.33 0.04 0.1321 0.1958 0.1917 0.0209 0.0062 0.8508 3.5 2 1.75 0.25 0.7785 Residual 166 24 6.91666667 Total 366.75 35 10.4785714 The interaction between temperature and method appears to be the important story in these data. A table of means for this interaction is given below. . table method temp, c(mean yield) row col f(%8.2f) mixing method machine temperature setting low medium high Total stir fold 7.50 5.50 6.00 9.00 6.00 11.50 6.50 8.67 Total 6.50 7.50 8.75 7.58 Here our ANOVA is balanced (each cell has the same number of observations), and we obtain the same values as in the table above (but with additional information such as confidence intervals) by using the margins command. Because our ANOVA is balanced, using the asbalanced option with margins would not produce different results. We request the predictive margins for the two terms that appear significant in our ANOVA: temperature#method and method. anova — Analysis of variance and covariance . margins temperature#method method Predictive margins Expression : Linear prediction, predict() Margin temperature# method low#stir low#fold medium#stir medium#fold high#stir high#fold method stir fold Delta-method Std. Err. t Number of obs = 29 36 P>|t| [95% Conf. Interval] 7.5 5.5 6 9 6 11.5 1.073675 1.073675 1.073675 1.073675 1.073675 1.073675 6.99 5.12 5.59 8.38 5.59 10.71 0.000 0.000 0.000 0.000 0.000 0.000 5.284044 3.284044 3.784044 6.784044 3.784044 9.284044 9.715956 7.715956 8.215956 11.21596 8.215956 13.71596 6.5 8.666667 .6198865 .6198865 10.49 13.98 0.000 0.000 5.220617 7.387284 7.779383 9.946049 We decide to use the folding method of mixing and a high temperature in our manufacturing process. Weighted data Like all estimation commands, anova can produce estimates on weighted data. See [U] 11.1.6 weight for details on specifying the weight. Example 8: Three-way factorial ANOVA on grouped data We wish to investigate the prevalence of byssinosis, a form of pneumoconiosis that can afflict workers exposed to cotton dust. We have data on 5,419 workers in a large cotton mill. We know whether each worker smokes, his or her race, and the dustiness of the work area. The variables are smokes smoker or nonsmoker in the last five years race white or other workplace 1 (most dusty), 2 (less dusty), 3 (least dusty) We wish to fit an ANOVA model explaining the prevalence of byssinosis according to a full factorial model of smokes, race, and workplace. The data are unbalanced. Moreover, although we have data on 5,419 workers, the data are grouped according to the explanatory variables, along with some other variables, resulting in 72 observations. For each observation, we know the number of workers in the group (pop), the prevalence of byssinosis (prob), and the values of the three explanatory variables. Thus we wish to fit a three-way factorial model on grouped data. We begin by showing a bit of the data, which are from Higgins and Koch (1977). 30 anova — Analysis of variance and covariance . use http://www.stata-press.com/data/r13/byssin (Byssinosis incidence) . describe Contains data from http://www.stata-press.com/data/r13/byssin.dta obs: 72 Byssinosis incidence vars: 5 19 Dec 2012 07:04 size: 864 variable name storage type display format value label smokes race workplace smokes race workplace int int int %8.0g %8.0g %8.0g pop prob int float %8.0g %9.0g variable label Smokes Race Dustiness of workplace Population size Prevalence of byssinosis Sorted by: . list in 1/5, abbrev(10) divider 1. 2. 3. 4. 5. smokes race workplace pop prob yes yes yes yes yes white white white other other most less least most less 40 74 260 164 88 .075 0 .0076923 .152439 0 The first observation in the data represents a group of 40 white workers who smoke and work in a “most” dusty work area. Of those 40 workers, 7.5% have byssinosis. The second observation represents a group of 74 white workers who also smoke but who work in a “less” dusty environment. None of those workers has byssinosis. Almost every Stata command allows weights. Here we want to weight the data by pop. We can, for instance, make a table of the number of workers by their smoking status and race: . tabulate smokes race [fw=pop] Race Smokes other white Total no yes 799 1,104 1,431 2,085 2,230 3,189 Total 1,903 3,516 5,419 The [fw=pop] at the end of the tabulate command tells Stata to count each observation as representing pop persons. When making the tally, tabulate treats the first observation as representing 40 workers, the second as representing 74 workers, and so on. Similarly, we can make a table of the dustiness of the workplace: anova — Analysis of variance and covariance . tabulate workplace [fw=pop] Dustiness of workplace Freq. Percent least less most 3,450 1,300 669 63.66 23.99 12.35 Total 5,419 100.00 31 Cum. 63.66 87.65 100.00 We can discover the average incidence of byssinosis among these workers by typing . summarize prob [fw=pop] Variable Obs prob 5419 Mean .0304484 Std. Dev. Min Max 0 .287037 .0567373 We discover that 3.04% of these workers have byssinosis. Across all cells, the byssinosis rates vary from 0 to 28.7%. Just to prove that there might be something here, let’s obtain the average incidence rates according to the dustiness of the workplace: . table workplace smokes race [fw=pop], c(mean prob) Dustiness of workplace least less most Race and Smokes other white no yes no .0107527 .02 .0820896 .0101523 .0081633 .1679105 .0081549 .0136612 .0833333 yes .0162774 .0143149 .2295082 Let’s now fit the ANOVA model. . anova prob workplace smokes race workplace#smokes workplace#race smokes#race > workplace#smokes#race [aweight=pop] (sum of wgt is 5.4190e+03) Number of obs = 65 R-squared = 0.8300 Root MSE = .025902 Adj R-squared = 0.7948 Source Partial SS df MS F Prob > F Model .173646538 11 .015786049 23.53 0.0000 workplace smokes race workplace#smokes workplace#race smokes#race workplace#smokes#race .097625175 .013030812 .001094723 .019690342 .001352516 .001662874 .000950841 2 1 1 2 2 1 2 .048812588 .013030812 .001094723 .009845171 .000676258 .001662874 .00047542 72.76 19.42 1.63 14.67 1.01 2.48 0.71 0.0000 0.0001 0.2070 0.0000 0.3718 0.1214 0.4969 Residual .035557766 53 .000670901 Total .209204304 64 .003268817 Of course, if we want to see the underlying regression, we could type regress. Above we examined simple means of the cells of workplace#smokes#race. Our ANOVA shows workplace, smokes, and their interaction as being the only significant factors in our model. We now examine the predictive marginal mean byssinosis rates for these terms. 32 anova — Analysis of variance and covariance . margins workplace#smokes workplace smokes Predictive margins Expression : Linear prediction, predict() Margin Delta-method Std. Err. t Number of obs P>|t| = 65 [95% Conf. Interval] workplace# smokes least#no least#yes less#no less#yes most#no most#yes .0090672 .0141264 .0158872 .0121546 .0828966 .2078768 .0062319 .0053231 .009941 .0087353 .0182151 .012426 1.45 2.65 1.60 1.39 4.55 16.73 0.152 0.010 0.116 0.170 0.000 0.000 -.0034323 .0034497 -.0040518 -.0053662 .0463617 .1829533 .0215667 .0248032 .0358263 .0296755 .1194314 .2328003 workplace least less most .0120701 .0137273 .1566225 .0040471 .0065685 .0104602 2.98 2.09 14.97 0.004 0.041 0.000 .0039526 .0005526 .1356419 .0201875 .0269019 .177603 smokes no yes .0196915 .0358626 .0050298 .0041949 3.91 8.55 0.000 0.000 .0096029 .0274488 .02978 .0442765 Smoking combined with the most dusty workplace produces the highest byssinosis rates. Ronald Aylmer Fisher (1890–1962) (Sir Ronald from 1952) studied mathematics at Cambridge. Even before he finished his studies, he had published on statistics. He worked as a statistician at Rothamsted Experimental Station (1919–1933), as professor of eugenics at University College London (1933–1943), as professor of genetics at Cambridge (1943–1957), and in retirement at the CSIRO Division of Mathematical Statistics in Adelaide. His many fundamental and applied contributions to statistics and genetics mark him as one of the greatest statisticians of all time, including original work on tests of significance, distribution theory, theory of estimation, fiducial inference, and design of experiments. ANCOVA You can include multiple explanatory variables with the anova command, but unless you explicitly state otherwise by using the c. factor-variable operator, all the variables are interpreted as categorical variables. Using the c. operator, you can designate variables as continuous and thus perform ANCOVA. Example 9: ANCOVA (ANOVA with a continuous covariate) We have census data recording the death rate (drate) and median age (age) for each state. The dataset also includes the region of the country in which each state is located (region): anova — Analysis of variance and covariance . use http://www.stata-press.com/data/r13/census2 (1980 Census data by state) . summarize drate age region Obs Mean Std. Dev. Variable drate age region 50 50 50 84.3 29.5 2.66 Min Max 40 24 1 107 35 4 13.07318 1.752549 1.061574 33 age is coded in integral years from 24 to 35, and region is coded from 1 to 4, with 1 standing for the Northeast, 2 for the North Central, 3 for the South, and 4 for the West. When we examine the data more closely, we discover large differences in the death rate across regions of the country: . tabulate region, summarize(drate) Census Summary of Death Rate region Mean Std. Dev. Freq. NE N Cntrl South West 93.444444 88.916667 88.3125 68.769231 7.0553368 5.5833899 8.5457104 13.342625 9 12 16 13 Total 84.3 13.073185 50 Naturally, we wonder if these differences might not be explained by differences in the median ages of the populations. To find out, we fit a regression model (via anova) of drate on region and age. In the anova example below, we treat age as a categorical variable. . anova drate region age Source Number of obs = 50 Root MSE = 6.7583 Partial SS df MS R-squared = 0.7927 Adj R-squared = 0.7328 F Prob > F Model 6638.86529 11 603.533208 13.21 0.0000 region age 1320.00973 2237.24937 3 8 440.003244 279.656171 9.63 6.12 0.0001 0.0000 Residual 1735.63471 38 45.6745977 Total 8374.5 49 170.908163 We have the answer to our question: differences in median ages do not eliminate the differences in death rates across the four regions. The ANOVA table summarizes the two terms in the model, region and age. The region term contains 3 degrees of freedom, and the age term contains 8 degrees of freedom. Both are significant at better than the 1% level. The age term contains 8 degrees of freedom. Because we did not explicitly indicate that age was to be treated as a continuous variable, it was treated as categorical, meaning that unique coefficients were estimated for each level of age. The only clue of this labeling is that the number of degrees of freedom associated with the age term exceeds 1. The labeling becomes more obvious if we review the regression coefficients: 34 anova — Analysis of variance and covariance . regress, baselevels SS Source df MS Number of obs F( 11, 38) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 13.21 0.0000 0.7927 0.7328 6.7583 Model Residual 6638.86529 1735.63471 11 38 603.533208 45.6745977 Total 8374.5 49 170.908163 drate Coef. region NE N Cntrl South West 0 .4428387 -.2964637 -13.37147 (base) 3.983664 3.934766 4.195344 0.11 -0.08 -3.19 0.912 0.940 0.003 -7.621668 -8.261981 -21.8645 8.507345 7.669054 -4.878439 age 24 26 27 28 29 30 31 32 35 0 -15 14.30833 12.66011 18.861 20.87003 29.91307 27.02853 38.925 (base) 9.557677 7.857378 7.495513 7.28918 7.210148 8.242741 8.509432 9.944825 -1.57 1.82 1.69 2.59 2.89 3.63 3.18 3.91 0.125 0.076 0.099 0.014 0.006 0.001 0.003 0.000 -34.34851 -1.598099 -2.51376 4.104825 6.273847 13.22652 9.802089 18.79275 4.348506 30.21476 27.83399 33.61717 35.46621 46.59963 44.25498 59.05724 _cons 68.37147 7.95459 8.60 0.000 52.26824 84.47469 Std. Err. t P>|t| [95% Conf. Interval] The regress command displayed the anova model as a regression table. We used the baselevels option to display the dropped level (or base) for each term. If we want to treat age as a continuous variable, we must prepend c. to age in our anova. . anova drate region c.age Source Number of obs = 50 Root MSE = 7.21483 Partial SS df MS R-squared = 0.7203 Adj R-squared = 0.6954 F Prob > F Model 6032.08254 4 1508.02064 28.97 0.0000 region age 1645.66228 1630.46662 3 1 548.554092 1630.46662 10.54 31.32 0.0000 0.0000 Residual 2342.41746 45 52.0537213 Total 8374.5 49 170.908163 The age term now has 1 degree of freedom. The regression coefficients are anova — Analysis of variance and covariance . regress, baselevels SS Source df MS Number of obs F( 4, 45) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 28.97 0.0000 0.7203 0.6954 7.2148 Model Residual 6032.08254 2342.41746 4 45 1508.02064 52.0537213 Total 8374.5 49 170.908163 drate Coef. region NE N Cntrl South West 0 1.792526 .6979912 -13.37578 (base) 3.375925 3.18154 3.723447 0.53 0.22 -3.59 0.598 0.827 0.001 -5.006935 -5.70996 -20.87519 8.591988 7.105942 -5.876377 age _cons 3.922947 -28.60281 .7009425 21.93931 5.60 -1.30 0.000 0.199 2.511177 -72.79085 5.334718 15.58524 Std. Err. t P>|t| 35 [95% Conf. Interval] Although we started analyzing these data to explain the regional differences in death rate, let’s focus on the effect of age for a moment. In our first model, each level of age had a unique death rate associated with it. For instance, the predicted death rate in a north central state with a median age of 28 was 0.44 + 12.66 + 68.37 ≈ 81.47 whereas the predicted death rate from our current model is 1.79 + 3.92 × 28 − 28.60 ≈ 82.95 Our previous model had an R2 of 0.7927, whereas our current model has an R2 of 0.7203. This “small” loss of predictive power accompanies a gain of 7 degrees of freedom, so we suspect that the continuous-age model is as good as the discrete-age model. Technical note There is enough information in the two ANOVA tables to attach a statistical significance to our suspicion that the loss of predictive power is offset by the savings in degrees of freedom. Because the continuous-age model is nested within the discrete-age model, we can perform a standard Chow test. For those of us who know such formulas off the top of our heads, the F statistic is (2342.41746 − 1735.63471)/7 = 1.90 45.6745977 There is, however, a better way. We can find out whether our continuous model is as good as our discrete model by putting age in the model twice: once as a continuous variable and once as a categorical variable. The categorical variable will then measure deviations around the straight line implied by the continuous variable, and the F test for the significance of the categorical variable will test whether those deviations are jointly zero. 36 anova — Analysis of variance and covariance . anova drate region c.age age Number of obs = 50 Root MSE = 6.7583 Source Partial SS df MS R-squared = 0.7927 Adj R-squared = 0.7328 F Prob > F Model 6638.86529 11 603.533208 13.21 0.0000 region age age 1320.00973 699.74137 606.782747 3 1 7 440.003244 699.74137 86.6832496 9.63 15.32 1.90 0.0001 0.0004 0.0970 Residual 1735.63471 38 45.6745977 Total 8374.5 49 170.908163 We find that the F test for the significance of the (categorical) age variable is 1.90, just as we calculated above. It is significant at the 9.7% level. If we hold to a 5% significance level, we cannot reject the null hypothesis that the effect of age is linear. Example 10: Interaction of continuous and categorical variables In our census data, we still find significant differences across the regions after controlling for the median age of the population. We might now wonder whether the regional differences are differences in level — independent of age — or are instead differences in the regional effects of age. Just as we can interact categorical variables with other categorical variables, we can interact categorical variables with continuous variables. . anova drate region c.age region#c.age Number of obs = 50 Root MSE = 7.24852 Source Partial SS df MS R-squared = 0.7365 Adj R-squared = 0.6926 F Prob > F Model 6167.7737 7 881.110529 16.77 0.0000 region age region#age 188.713602 873.425599 135.691162 3 1 3 62.9045339 873.425599 45.2303874 1.20 16.62 0.86 0.3225 0.0002 0.4689 Residual 2206.7263 42 52.5411023 Total 8374.5 49 170.908163 The region#c.age term in our model measures the differences in slopes across the regions. We cannot reject the null hypothesis that there are no such differences. The region effect is now “insignificant”. This status does not mean that there are no regional differences in death rates because each test is a marginal or partial test. Here, with region#c.age included in the model, region is being tested at the point where age is zero. Apart from this value not existing in the dataset, it is also a long way from the mean value of age, so the test of region at this point is meaningless (although it is valid if you acknowledge what is being tested). To obtain a more sensible test of region, we can subtract the mean from the age variable and use this in the model. . quietly summarize age . generate mage = age - r(mean) anova — Analysis of variance and covariance . anova drate region c.mage region#c.mage Number of obs = 50 Root MSE = 7.24852 Partial SS df MS Source 37 R-squared = 0.7365 Adj R-squared = 0.6926 F Prob > F Model 6167.7737 7 881.110529 16.77 0.0000 region mage region#mage 1166.14735 873.425599 135.691162 3 1 3 388.715783 873.425599 45.2303874 7.40 16.62 0.86 0.0004 0.0002 0.4689 Residual 2206.7263 42 52.5411023 Total 8374.5 49 170.908163 region is significant when tested at the mean of the age variable. Remember that we can specify interactions by typing varname#varname. We have seen examples of interacting categorical variables with categorical variables and, in the examples above, a categorical variable (region) with a continuous variable (age or mage). We can also interact continuous variables with continuous variables. To include an age2 term in our model, we could type c.age#c.age. If we also wanted to interact the categorical variable region with the age2 term, we could type region#c.age#c.age (or even c.age#region#c.age). Nested designs In addition to specifying interaction terms, nested terms can also be specified in an ANOVA. A vertical bar is used to indicate nesting: A|B is read as A nested within B. A|B|C is read as A nested within B, which is nested within C. A|B#C is read as A is nested within the interaction of B and C. A#B|C is read as the interaction of A and B, which is nested within C. Different error terms can be specified for different parts of the model. The forward slash is used to indicate that the next term in the model is the error term for what precedes it. For instance, anova y A / B|A indicates that the F test for A is to be tested by using the mean square from B|A in the denominator. Error terms (terms following the slash) are generally not tested unless they are themselves followed by a slash. Residual error is the default error term. For example, consider A / B / C, where A, B, and C may be arbitrarily complex terms. Then anova will report A tested by B and B tested by C. If we add one more slash on the end to form A / B / C /, then anova will also report C tested by the residual error. Example 11: Simple nested ANOVA We have collected data from a manufacturer that is evaluating which of five different brands of machinery to buy to perform a particular function in an assembly line. Twenty assembly-line employees were selected at random for training on these machines, with four employees assigned to learn a particular machine. The output from each employee (operator) on the brand of machine for which he trained was measured during four trial periods. In this example, the operator is nested within machine. Because of sickness and employee resignations, the final data are not balanced. The following table gives the mean output and sample size for each machine and operator combination. . use http://www.stata-press.com/data/r13/machine, clear (machine data) 38 anova — Analysis of variance and covariance . table machine operator, c(mean output n output) col f(%8.2f) five brands of machine operator nested in machine 1 2 3 4 Total 1 9.15 2 9.48 4 8.27 3 8.20 4 8.75 13 2 15.03 3 11.55 2 11.45 2 11.52 4 12.47 11 3 11.27 3 10.13 3 11.13 3 4 16.10 3 18.97 3 15.35 4 5 15.30 4 14.35 4 10.43 3 10.84 9 16.60 3 16.65 13 13.63 11 Assuming that operator is random (that is, we wish to infer to the larger population of possible operators) and machine is fixed (that is, only these five machines are of interest), the typical test for machine uses operator nested within machine as the error term. operator nested within machine can be tested by residual error. Our earlier warning concerning designs with either unplanned missing cells or unbalanced cell sizes, or both, also applies to interpreting the ANOVA results from this unbalanced nested example. . anova output machine / operator|machine / Number of obs = 57 Root MSE = 1.47089 Source Partial SS df MS R-squared = 0.8661 Adj R-squared = 0.8077 F Prob > F Model 545.822288 17 32.1071934 14.84 0.0000 machine operator|machine 430.980792 101.353804 4 13 107.745198 7.79644648 13.82 0.0001 operator|machine 101.353804 13 7.79644648 3.60 0.0009 Residual 84.3766582 39 2.16350406 Total 630.198947 56 11.2535526 operator|machine is preceded by a slash, indicating that it is the error term for the terms before it (here machine). operator|machine is also followed by a slash that indicates it should be tested with residual error. The output lists the operator|machine term twice, once as the error term for machine, and again as a term tested by residual error. A line is placed in the ANOVA table to separate the two. In general, a dividing line is placed in the output to separate the terms into groups that are tested with the same error term. The overall model is tested by residual error and is separated from the rest of the table by a blank line at the top of the table. The results indicate that the machines are not all equal and that there are significant differences between operators. anova — Analysis of variance and covariance 39 Example 12: ANOVA with multiple levels of nesting Your company builds and operates sewage treatment facilities. You want to compare two particulate solutions during the particulate reduction step of the sewage treatment process. For each solution, two area managers are randomly selected to implement and oversee the change to the new treatment process in two of their randomly chosen facilities. Two workers at each of these facilities are trained to operate the new process. A measure of particulate reduction is recorded at various times during the month at each facility for each worker. The data are described below. . use http://www.stata-press.com/data/r13/sewage (Sewage treatment) . describe Contains data from http://www.stata-press.com/data/r13/sewage.dta obs: 64 Sewage treatment vars: 5 9 May 2013 12:43 size: 320 variable name particulate solution manager facility worker Sorted by: storage type display format byte byte byte byte byte %9.0g %9.0g %9.0g %9.0g %9.0g solution manager value label variable label particulate reduction 2 particulate solutions 2 managers per solution 2 facilities per manager 2 workers per facility facility worker You want to determine if the two particulate solutions provide significantly different particulate reduction. You would also like to know if manager, facility, and worker are significant effects. solution is a fixed factor, whereas manager, facility, and worker are random factors. In the following anova command, we use abbreviations for the variable names, which can sometimes make long ANOVA model statements easier to read. . anova particulate s / m|s / f|m|s / w|f|m|s /, dropemptycells Number of obs = 64 Root MSE = 12.7445 R-squared = Adj R-squared = Source Partial SS df Model 13493.6094 15 899.573958 5.54 0.0000 solution manager|solution 7203.76563 838.28125 1 2 7203.76563 419.140625 17.19 0.0536 manager|solution facility|manager| solution 838.28125 2 419.140625 0.55 0.6166 3064.9375 4 766.234375 3064.9375 4 766.234375 2.57 0.1193 2386.625 8 298.328125 worker|facility| manager|solution 2386.625 8 298.328125 1.84 0.0931 Residual 7796.25 48 162.421875 Total 21289.8594 63 337.934276 facility|manager| solution worker|facility| manager|solution MS F 0.6338 0.5194 Prob > F 40 anova — Analysis of variance and covariance While solution is not declared significant at the 5% significance level, it is near enough to that threshold to warrant further investigation (see example 3 in [R] anova postestimation for a continuation of the analysis of these data). Technical note Why did we use the dropemptycells option with the previous anova? By default, Stata retains empty cells when building the design matrix and currently treats | and # the same in how it determines the possible number of cells. Retaining empty cells in an ANOVA with nested terms can cause your design matrix to become too large. In example 12, there are 1024 = 2 × 4 × 8 × 16 cells that are considered possible for the worker|facility|manager|solution term because the worker, facility, and manager variables are uniquely numbered. With the dropemptycells option, the worker|facility|manager|solution term requires just 16 columns in the design matrix (corresponding to the 16 unique workers). Why did we not use the dropemptycells option in example 11, where operator is nested in machine? If you look at the table presented at the beginning of that example, you will see that operator is compactly instead of uniquely numbered (you need both operator number and machine number to determine the operator). Here the dropemptycells option would have only reduced our design matrix from 26 columns down to 24 columns (because there were only 3 operators instead of 4 for machines 3 and 5). We suggest that you specify dropemptycells when there are nested terms in your ANOVA. You could also use the set emptycells drop command to accomplish the same thing; see [R] set. Mixed designs An ANOVA can consist of both nested and crossed terms. A split-plot ANOVA design provides an example. Example 13: Split-plot ANOVA Two reading programs and three skill-enhancement techniques are under investigation. Ten classes of first-grade students were randomly assigned so that five classes were taught with one reading program and another five classes were taught with the other. The 30 students in each class were divided into six groups with 5 students each. Within each class, the six groups were divided randomly so that each of the three skill-enhancement techniques was taught to two of the groups within each class. At the end of the school year, a reading assessment test was administered to all the students. In this split-plot ANOVA, the whole-plot treatment is the two reading programs, and the split-plot treatment is the three skill-enhancement techniques. . use http://www.stata-press.com/data/r13/reading (Reading experiment data) anova — Analysis of variance and covariance 41 . describe Contains data from http://www.stata-press.com/data/r13/reading.dta obs: 300 Reading experiment data vars: 5 9 Mar 2013 18:57 size: 1,500 (_dta has notes) variable name score program class skill group storage type byte byte byte byte byte display format value label variable label %9.0g %9.0g %9.0g %9.0g %9.0g reading score reading program class nested in program skill enhancement technique group nested in class and skill Sorted by: In this split-plot ANOVA, the error term for program is class nested within program. The error term for skill and the program by skill interaction is the class by skill interaction nested within program. Other terms are also involved in the model and can be seen below. Our anova command is too long to fit on one line of this manual. Where we have chosen to break the command into multiple lines is arbitrary. If we were typing this command into Stata, we would just type along and let Stata automatically wrap across lines, as necessary. . anova score prog / class|prog skill prog#skill / class#skill|prog / > group|class#skill|prog /, dropemptycells Number of obs = 300 R-squared = 0.3738 Root MSE = 14.6268 Adj R-squared = 0.2199 Source Partial SS df MS F Prob > F Model 30656.5167 59 519.601977 2.43 0.0000 program class|program 4493.07 4116.61333 1 8 4493.07 514.576667 8.73 0.0183 skill program#skill class#skill|program 1122.64667 5694.62 5841.46667 2 2 16 561.323333 2847.31 365.091667 1.54 7.80 0.2450 0.0043 class#skill|program group|class#skill| program 5841.46667 16 365.091667 1.17 0.3463 9388.1 30 312.936667 group|class#skill| program 9388.1 30 312.936667 1.46 0.0636 Residual 51346.4 240 213.943333 Total 82002.9167 299 274.257246 The program#skill term is significant, as is the program term. Let’s look at the predictive margins for these two terms and at a marginsplot for the first term. 42 anova — Analysis of variance and covariance . margins, within(program skill) Predictive margins Expression : Linear prediction, predict() within : program skill Empty cells : reweight program#skill 1 1 1 2 1 3 2 1 2 2 2 3 Margin Delta-method Std. Err. 68.16 52.86 61.54 50.7 56.54 52.1 2.068542 2.068542 2.068542 2.068542 2.068542 2.068542 Number of obs t 32.95 25.55 29.75 24.51 27.33 25.19 = 300 P>|t| [95% Conf. Interval] 0.000 0.000 0.000 0.000 0.000 0.000 64.08518 48.78518 57.46518 46.62518 52.46518 48.02518 72.23482 56.93482 65.61482 54.77482 60.61482 56.17482 . marginsplot, plot2opts(lp(dash) m(D)) plot3opts(lp(dot) m(T)) Variables that uniquely identify margins: program skill 45 50 Linear Prediction 55 60 65 70 Predictive Margins with 95% CIs 1 2 reading program skill=1 skill=3 skill=2 . margins, within(program) Predictive margins Expression : Linear prediction, predict() within : program Empty cells : reweight Margin program 1 2 60.85333 53.11333 Delta-method Std. Err. 1.194273 1.194273 t 50.95 44.47 Number of obs = 300 P>|t| [95% Conf. Interval] 0.000 0.000 58.50074 50.76074 63.20593 55.46593 Because our ANOVA involves nested terms, we used the within() option of margins; see [R] margins. skill 2 produces a low score when combined with program 1 and a high score when combined with program 2, demonstrating the interaction between the reading program and the skill-enhancement anova — Analysis of variance and covariance 43 technique. You might conclude that the first reading program and the first skill-enhancement technique perform best when combined. However, notice the overlapping confidence interval for the first reading program and the third skill-enhancement technique. Technical note There are several valid ways to write complicated anova terms. In the reading experiment example (example 13), we had a term group|class#skill|program. This term can be read as group nested within both class and skill and further nested within program. You can also write this term as group|class#skill#program or group|program#class#skill or group|skill#class|program, etc. All variations will produce the same result. Some people prefer having only one ‘|’ in a term and would use group|class#skill#program, which is read as group nested within class, skill, and program. Gertrude Mary Cox (1900–1978) was born on a farm near Dayton, Iowa. Initially intending to become superintendent of an orphanage, she enrolled at Iowa State College. There she majored in mathematics and attained the college’s first Master’s degree in statistics. After working on her PhD in psychological statistics for two years at the University of California–Berkeley, she decided to go back to Iowa State to work with George W. Snedecor. There she pursued her interest in and taught a course in design of experiments. That work led to her collaboration with W. G. Cochran, which produced a classic text. In 1940, when Snedecor shared with her his list of men he was nominating to head the statistics department at North Carolina State College, she wanted to know why she had not been included. He added her name, she won the position, and she built an outstanding department at North Carolina State. Cox retired early so she could work at the Research Triangle Institute in North Carolina. She consulted widely, served as editor of Biometrics, and was elected to the National Academy of Sciences. Latin-square designs You can use anova to analyze a Latin-square design. Consider the following example, published in Snedecor and Cochran (1989). Example 14: Latin-square ANOVA Data from a Latin-square design are as follows: Row 1 2 3 4 5 Column 1 257(B) 245(D) 182(E) 203(A) 231(C) Column 2 230(E) 283(A) 252(B) 204(C) 271(D) Column 3 279(A) 245(E) 280(C) 227(D) 266(B) Column 4 287(C) 280(B) 246(D) 193(E) 334(A) Column 5 202(D) 260(C) 250(A) 259(B) 338(E) 44 anova — Analysis of variance and covariance In Stata, the data might appear as follows: . use http://www.stata-press.com/data/r13/latinsq . list 1. 2. 3. 4. 5. row c1 c2 c3 c4 c5 1 2 3 4 5 257 245 182 203 231 230 283 252 204 271 279 245 280 227 266 287 280 246 193 334 202 260 250 259 338 Before anova can be used on these data, the data must be organized so that the outcome measurement is in one column. reshape is inadequate for this task because there is information about the treatments in the sequence of these observations. pkshape is designed to reshape this type of data; see [R] pkshape. . pkshape row row c1-c5, order(beacd daebc ebcda acdeb cdbae) . list sequence outcome treat carry period 1. 2. 3. 4. 5. 1 2 3 4 5 257 245 182 203 231 1 5 2 3 4 0 0 0 0 0 1 1 1 1 1 6. 7. 8. 9. 10. 1 2 3 4 5 230 283 252 204 271 2 3 1 4 5 1 5 2 3 4 2 2 2 2 2 11. 12. 13. 14. 15. 1 2 3 4 5 279 245 280 227 266 3 2 4 5 1 2 3 1 4 5 3 3 3 3 3 16. 17. 18. 19. 20. 1 2 3 4 5 287 280 246 193 334 4 1 5 2 3 3 2 4 5 1 4 4 4 4 4 21. 22. 23. 24. 25. 1 2 3 4 5 202 260 250 259 338 5 4 3 1 2 4 1 5 2 3 5 5 5 5 5 anova — Analysis of variance and covariance . anova outcome sequence period treat Number of obs = 25 Root MSE = 32.4901 Partial SS df MS Source 45 R-squared = 0.6536 Adj R-squared = 0.3073 F Prob > F Model 23904.08 12 1992.00667 1.89 0.1426 sequence period treat 13601.36 6146.16 4156.56 4 4 4 3400.34 1536.54 1039.14 3.22 1.46 0.98 0.0516 0.2758 0.4523 Residual 12667.28 12 1055.60667 Total 36571.36 24 1523.80667 These methods will work with any type of Latin-square design, including those with replicated measurements. For more information, see [R] pk, [R] pkcross, and [R] pkshape. Repeated-measures ANOVA One approach for analyzing repeated-measures data is to use multivariate ANOVA (MANOVA); see [MV] manova. In this approach, the data are placed in wide form (see [D] reshape), and the repeated measures enter the MANOVA as dependent variables. A second approach for analyzing repeated measures is to use anova. However, one of the underlying assumptions for the F tests in ANOVA is independence of observations. In a repeated-measures design, this assumption is almost certainly violated or is at least suspect. In a repeated-measures ANOVA, the subjects (or whatever the experimental units are called) are observed for each level of one or more of the other categorical variables in the model. These variables are called the repeated-measure variables. Observations from the same subject are likely to be correlated. The approach used in repeated-measures ANOVA to correct for this lack of independence is to apply a correction to the degrees of freedom of the F test for terms in the model that involve repeated measures. This correction factor, , lies between the reciprocal of the degrees of freedom for the repeated term and 1. Box (1954) provided the pioneering work in this area. Milliken and Johnson (2009) refer to the lower bound of this correction factor as Box’s conservative correction factor. Winer, Brown, and Michels (1991) call it simply the conservative correction factor. Geisser and Greenhouse (1958) provide an estimate for the correction factor called the Greenhouse– Geisser . This value is estimated from the data. Huynh and Feldt (1976) show that the Greenhouse– Geisser tends to be conservatively biased. They provide a revised correction factor called the Huynh–Feldt . When the Huynh–Feldt exceeds 1, it is set to 1. Thus there is a natural ordering for these correction factors: Box’s conservative ≤ Greenhouse–Geisser ≤ Huynh–Feldt ≤ 1 A correction factor of 1 is the same as no correction. anova with the repeated() option computes these correction factors and displays the revised test results in a table that follows the standard ANOVA table. In the resulting table, H-F stands for Huynh–Feldt, G-G stands for Greenhouse–Geisser, and Box stands for Box’s conservative . 46 anova — Analysis of variance and covariance Example 15: Repeated-measures ANOVA This example is taken from table 4.3 of Winer, Brown, and Michels (1991). The reaction time for five subjects each tested with four drugs was recorded in the variable score. Here is a table of the data (see [P] tabdisp if you are unfamiliar with tabdisp): . use http://www.stata-press.com/data/r13/t43, clear (T4.3 -- Winer, Brown, Michels) . tabdisp person drug, cellvar(score) person 1 1 2 3 4 5 30 14 24 38 26 drug 2 28 18 20 34 28 3 4 16 10 18 20 14 34 22 30 44 30 drug is the repeated variable in this simple repeated-measures ANOVA example. The ANOVA is specified as follows: . anova score person drug, repeated(drug) Number of obs = 20 Root MSE = 3.06594 R-squared = Adj R-squared = Source Partial SS df Model 1379 7 197 20.96 0.0000 person drug 680.8 698.2 4 3 170.2 232.733333 18.11 24.76 0.0001 0.0000 Residual 112.8 12 9.4 Total 1491.8 19 78.5157895 Between-subjects error term: Levels: Lowest b.s.e. variable: person 5 person MS F 0.9244 0.8803 Prob > F (4 df) Repeated variable: drug Huynh-Feldt epsilon *Huynh-Feldt epsilon reset Greenhouse-Geisser epsilon Box’s conservative epsilon Source df F Regular drug Residual 3 12 24.76 0.0000 = to = = Prob > F H-F G-G 0.0000 0.0006 1.0789 1.0000 0.6049 0.3333 Box 0.0076 Here the Huynh–Feldt is 1.0789, which is larger than 1. It is reset to 1, which is the same as making no adjustment to the standard test computed in the main ANOVA table. The Greenhouse–Geisser is 0.6049, and its associated p-value is computed from an F ratio of 24.76 using 1.8147 (= 3) and 7.2588 (= 12) degrees of freedom. Box’s conservative is set equal to the reciprocal of the degrees of freedom for the repeated term. Here it is 1/3, so Box’s conservative test is computed using 1 and 4 degrees of freedom for the observed F ratio of 24.76. anova — Analysis of variance and covariance 47 Even for Box’s conservative , drug is significant with a p-value of 0.0076. The following table gives the predictive marginal mean score (that is, response time) for each of the four drugs: . margins drug Predictive margins Expression : Linear prediction, predict() Margin drug 1 2 3 4 Delta-method Std. Err. 26.4 25.6 15.6 32 1.371131 1.371131 1.371131 1.371131 t 19.25 18.67 11.38 23.34 Number of obs = 20 P>|t| [95% Conf. Interval] 0.000 0.000 0.000 0.000 23.41256 22.61256 12.61256 29.01256 29.38744 28.58744 18.58744 34.98744 The ANOVA table for this example provides an F test for person, but you should ignore it. An appropriate test for person would require replication (that is, multiple measurements for person and drug combinations). Also, without replication there is no test available for investigating the interaction between person and drug. Example 16: Repeated-measures ANOVA with nesting Table 7.7 of Winer, Brown, and Michels (1991) provides another repeated-measures ANOVA example. There are four dial shapes and two methods for calibrating dials. Subjects are nested within calibration method, and an accuracy score is obtained. The data are shown below. . use http://www.stata-press.com/data/r13/t77 (T7.7 -- Winer, Brown, Michels) . tabdisp shape subject calib, cell(score) 2 methods for calibrating dials and subject nested in calib 1 2 1 2 3 1 2 3 4 dial shapes 1 2 3 4 0 0 5 3 3 1 5 4 4 3 6 2 4 2 7 8 5 4 6 6 7 5 8 9 The calibration method and dial shapes are fixed factors, whereas subjects are random. The appropriate test for calibration method uses the nested subject term as the error term. Both the dial shape and the interaction between dial shape and calibration method are tested with the dial shape by subject interaction nested within calibration method. Here we drop this term from the anova command, and it becomes residual error. The dial shape is the repeated variable because each subject is tested with all four dial shapes. Here is the anova command that produces the desired results: 48 anova — Analysis of variance and covariance . anova score calib / subject|calib shape calib#shape, repeated(shape) Number of obs = 24 R-squared = 0.8925 Root MSE = 1.11181 Adj R-squared = 0.7939 Partial SS df MS F Prob > F Source Model 123.125 11 11.1931818 9.06 0.0003 calib subject|calib 51.0416667 17.1666667 1 4 51.0416667 4.29166667 11.89 0.0261 shape calib#shape 47.4583333 7.45833333 3 3 15.8194444 2.48611111 12.80 2.01 0.0005 0.1662 Residual 14.8333333 12 1.23611111 Total 137.958333 23 5.99818841 Between-subjects error term: Levels: Lowest b.s.e. variable: Covariance pooled over: Repeated variable: shape subject|calib 6 (4 df) subject calib (for repeated variable) Huynh-Feldt epsilon = Greenhouse-Geisser epsilon = Box’s conservative epsilon = Prob > F Regular H-F G-G Source df F shape calib#shape Residual 3 3 12 12.80 2.01 0.0005 0.1662 0.0011 0.1791 0.0099 0.2152 0.8483 0.4751 0.3333 Box 0.0232 0.2291 The repeated-measure corrections are applied to any terms that are tested in the main ANOVA table and have the repeated variable in the term. These corrections are given in a table below the main ANOVA table. Here the repeated-measures tests for shape and calib#shape are presented. Calibration method is significant, as is dial shape. The interaction between calibration method and dial shape is not significant. The repeated-measure corrections do not change these conclusions, but they do change the significance level for the tests on shape and calib#shape. Here, though, unlike in the example 15, the Huynh–Feldt is less than 1. Here are the predictive marginal mean scores for calibration method and dial shapes. Because the interaction was not significant, we request only the calib and shape predictive margins. . margins, within(calib) Predictive margins Expression : Linear prediction, predict() within : calib Empty cells : reweight Margin calib 1 2 3 5.916667 Delta-method Std. Err. .3209506 .3209506 t 9.35 18.43 Number of obs = 24 P>|t| [95% Conf. Interval] 0.000 0.000 2.300709 5.217375 3.699291 6.615958 anova — Analysis of variance and covariance . margins, within(shape) Predictive margins Expression : Linear prediction, predict() within : shape Empty cells : reweight Margin shape 1 2 3 4 3.833333 2.5 6.166667 5.333333 Delta-method Std. Err. .4538926 .4538926 .4538926 .4538926 t 8.45 5.51 13.59 11.75 Number of obs = 49 24 P>|t| [95% Conf. Interval] 0.000 0.000 0.000 0.000 2.844386 1.511053 5.17772 4.344386 4.82228 3.488947 7.155614 6.32228 Technical note The computation of the Greenhouse–Geisser and Huynh–Feldt epsilons in a repeated-measures ANOVA requires the number of levels and degrees of freedom for the between-subjects error term, as well as a value computed from a pooled covariance matrix. The observations are grouped based on all but the lowest-level variable in the between-subjects error term. The covariance over the repeated variables is computed for each resulting group, and then these covariance matrices are pooled. The dimension of the pooled covariance matrix is the number of levels of the repeated variable (or combination of levels for multiple repeated variables). In example 16, there are four levels of the repeated variable (shape), so the resulting covariance matrix is 4 × 4. The anova command automatically attempts to determine the between-subjects error term and the lowest-level variable in the between-subjects error term to group the observations for computation of the pooled covariance matrix. anova issues an error message indicating that the bse() or bseunit() option is required when anova cannot determine them. You may override the default selections of anova by specifying the bse(), bseunit(), or grouping() option. The term specified in the bse() option must be a term in the ANOVA model. The default selection for the between-subjects error term (the bse() option) is the interaction of the nonrepeated categorical variables in the ANOVA model. The first variable listed in the between-subjects error term is automatically selected as the lowest-level variable in the between-subjects error term but can be overridden with the bseunit(varname) option. varname is often a term, such as subject or subsample within subject, and is most often listed first in the term because of the nesting notation of ANOVA. This term makes sense in most repeated-measures ANOVA designs when the terms of the model are written in standard form. For instance, in example 16, there were three categorical variables (subject, calib, and shape), with shape being the repeated variable. Here anova looked for a term involving only subject and calib to determine the between-subjects error term. It found subject|calib as the term with six levels and 4 degrees of freedom. anova then picked subject as the default for the bseunit() option (the lowest variable in the between-subjects error term) because it was listed first in the term. The grouping of observations proceeds, based on the different combinations of values of the variables in the between-subjects error term, excluding the lowest level variable (as found by default or as specified with the bseunit() option). You may specify the grouping() option to change the default grouping used in computing the pooled covariance matrix. The between-subjects error term, number of levels, degrees of freedom, lowest variable in the term, and grouping information are presented after the main ANOVA table and before the rest of the repeated-measures output. 50 anova — Analysis of variance and covariance Example 17: Repeated-measures ANOVA with two repeated variables Data with two repeated variables are given in table 7.13 of Winer, Brown, and Michels (1991). The accuracy scores of subjects making adjustments to three dials during three different periods are recorded. Three subjects are exposed to a certain noise background level, whereas a different set of three subjects is exposed to a different noise background level. Here is a table of accuracy scores for the noise, subject, period, and dial variables: . use http://www.stata-press.com/data/r13/t713 (T7.13 -- Winer, Brown, Michels) . tabdisp subject dial period, by(noise) cell(score) stubwidth(11) noise background and subject nested in noise 10 minute time periods and dial 2 3 1 2 3 1 1 1 2 1 2 3 45 35 60 53 41 65 60 50 75 40 30 58 52 37 54 57 47 70 1 2 3 50 42 56 48 45 60 61 55 77 25 30 40 34 37 39 51 43 57 3 2 3 28 25 40 37 32 47 46 41 50 16 22 31 23 27 29 35 37 46 1 2 noise, period, and dial are fixed, whereas subject is random. Both period and dial are repeated variables. The ANOVA for this example is specified next. anova — Analysis of variance and covariance . anova score noise / subject|noise period noise#period / > period#subject|noise dial noise#dial / > dial#subject|noise period#dial noise#period#dial, repeated(period dial) Source Number of obs = 54 Root MSE = 2.81859 Partial SS df MS R-squared = 0.9872 Adj R-squared = 0.9576 F Prob > F Model 9797.72222 37 264.803303 33.33 0.0000 noise subject|noise 468.166667 2491.11111 1 4 468.166667 622.777778 0.75 0.4348 period noise#period period#subject|noise 3722.33333 333 234.888889 2 2 8 1861.16667 166.5 29.3611111 63.39 5.67 0.0000 0.0293 dial noise#dial dial#subject|noise 2370.33333 50.3333333 105.555556 2 2 8 1185.16667 25.1666667 13.1944444 89.82 1.91 0.0000 0.2102 period#dial noise#period#dial 10.6666667 11.3333333 4 4 2.66666667 2.83333333 0.34 0.36 0.8499 0.8357 Residual 127.111111 16 7.94444444 Total 9924.83333 53 187.261006 Between-subjects error term: Levels: Lowest b.s.e. variable: Covariance pooled over: Repeated variable: period subject|noise 6 (4 df) subject noise (for repeated variables) Huynh-Feldt epsilon *Huynh-Feldt epsilon reset Greenhouse-Geisser epsilon Box’s conservative epsilon Source df F Regular period noise#period period#subject|noise 2 2 8 63.39 5.67 0.0000 0.0293 = to = = Prob > F H-F G-G 0.0000 0.0293 1.0668 1.0000 0.6476 0.5000 Box 0.0003 0.0569 0.0013 0.0759 Repeated variable: dial Huynh-Feldt epsilon *Huynh-Feldt epsilon reset Greenhouse-Geisser epsilon Box’s conservative epsilon Prob > F Regular H-F G-G Source df F dial noise#dial dial#subject|noise 2 2 8 89.82 1.91 0.0000 0.2102 0.0000 0.2102 = to = = 0.0000 0.2152 2.0788 1.0000 0.9171 0.5000 Box 0.0007 0.2394 51 52 anova — Analysis of variance and covariance Repeated variables: period#dial Huynh-Feldt epsilon *Huynh-Feldt epsilon reset Greenhouse-Geisser epsilon Box’s conservative epsilon Prob > F Regular H-F G-G Source df F period#dial noise#period#dial Residual 4 4 16 0.34 0.36 0.8499 0.8357 0.8499 0.8357 = to = = 0.7295 0.7156 1.3258 1.0000 0.5134 0.2500 Box 0.5934 0.5825 For each repeated variable and for each combination of interactions of repeated variables, there are different correction values. The anova command produces tables for each applicable combination. The two most significant factors in this model appear to be dial and period. The noise by period interaction may also be significant, depending on the correction factor you use. Below is a table of predictive margins for the accuracy score for dial, period, and noise by period. . margins, within(dial) Predictive margins Expression : Linear prediction, predict() within : dial Empty cells : reweight Margin dial 1 2 3 37.38889 42.22222 53.22222 Delta-method Std. Err. .6643478 .6643478 .6643478 t 56.28 63.55 80.11 . margins, within(period) Predictive margins Expression : Linear prediction, predict() within : period Empty cells : reweight Margin period 1 2 3 54.33333 44.5 34 Delta-method Std. Err. .6643478 .6643478 .6643478 t 81.78 66.98 51.18 Number of obs = 54 P>|t| [95% Conf. Interval] 0.000 0.000 0.000 35.98053 40.81387 51.81387 Number of obs 38.79724 43.63058 54.63058 = 54 P>|t| [95% Conf. Interval] 0.000 0.000 0.000 52.92498 43.09165 32.59165 55.74169 45.90835 35.40835 anova — Analysis of variance and covariance . margins, within(noise period) Predictive margins Expression : Linear prediction, predict() within : noise period Empty cells : reweight Margin noise#period 1 1 1 2 1 3 2 1 2 2 2 3 53.77778 49.44444 38.44444 54.88889 39.55556 29.55556 Delta-method Std. Err. .9395297 .9395297 .9395297 .9395297 .9395297 .9395297 t 57.24 52.63 40.92 58.42 42.10 31.46 Number of obs = 53 54 P>|t| [95% Conf. Interval] 0.000 0.000 0.000 0.000 0.000 0.000 51.78606 47.45273 36.45273 52.89717 37.56384 27.56384 55.76949 51.43616 40.43616 56.8806 41.54727 31.54727 Dial shape 3 produces the highest score, and scores decrease over the periods. Example 17 had two repeated-measurement variables. Up to four repeated-measurement variables may be specified in the anova command. Video examples Analysis of covariance in Stata Two-way ANOVA in Stata 54 anova — Analysis of variance and covariance Stored results anova stores the following in e(): Scalars e(N) e(mss) e(df m) e(rss) e(df r) e(r2) e(r2 a) e(F) e(rmse) e(ll) e(ll 0) e(ss #) e(df #) e(ssdenom #) e(dfdenom #) e(F #) e(N bse) e(df bse) e(box#) e(gg#) e(hf#) e(rank) Macros e(cmd) e(cmdline) e(depvar) e(varnames) e(term #) e(errorterm #) e(sstype) e(repvars) e(repvar#) e(model) e(wtype) e(wexp) e(properties) e(estat cmd) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(V) e(Srep) Functions e(sample) number of observations model sum of squares model degrees of freedom residual sum of squares residual degrees of freedom R-squared adjusted R-squared F statistic root mean squared error log likelihood log likelihood, constant-only model sum of squares for term # numerator degrees of freedom for term # denominator sum of squares for term # (when using nonresidual error) denominator degrees of freedom for term # (when using nonresidual error) F statistic for term # (if computed) number of levels of the between-subjects error term degrees of freedom for the between-subjects error term Box’s conservative epsilon for a particular combination of repeated variables (repeated() only) Greenhouse–Geisser epsilon for a particular combination of repeated variables (repeated() only) Huynh–Feldt epsilon for a particular combination of repeated variables (repeated() only) rank of e(V) anova command as typed name of dependent variable names of the right-hand-side variables term # error term for term # (when using nonresidual error) type of sum of squares; sequential or partial names of repeated variables (repeated() only) names of repeated variables for a particular combination (repeated() only) ols weight type weight expression b V program used to implement estat program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector variance–covariance matrix of the estimators covariance matrix based on repeated measures (repeated() only) marks estimation sample References Acock, A. C. 2014. A Gentle Introduction to Stata. 4th ed. College Station, TX: Stata Press. Afifi, A. A., and S. P. Azen. 1979. Statistical Analysis: A Computer Oriented Approach. 2nd ed. New York: Academic Press. anova — Analysis of variance and covariance 55 Altman, D. G. 1991. Practical Statistics for Medical Research. London: Chapman & Hall/CRC. Anderson, R. L. 1990. Gertrude Mary Cox 1900–1978. Biographical Memoirs, National Academy of Sciences 59: 116–132. Box, G. E. P. 1954. Some theorems on quadratic forms applied in the study of analysis of variance problems, I. Effect of inequality of variance in the one-way classification. Annals of Mathematical Statistics 25: 290–302. Box, J. F. 1978. R. A. Fisher: The Life of a Scientist. New York: Wiley. Chatfield, M., and A. P. Mander. 2009. The Skillings–Mack test (Friedman test when there are missing data). Stata Journal 9: 299–305. Cobb, G. W. 1998. Introduction to Design and Analysis of Experiments. New York: Springer. Edwards, A. L. 1985. Multiple Regression and the Analysis of Variance and Covariance. 2nd ed. New York: Freeman. Fisher, R. A. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd. . 1935. The Design of Experiments. Edinburgh: Oliver & Boyd. . 1990. Statistical Methods, Experimental Design, and Scientific Inference. Oxford: Oxford University Press. Geisser, S., and S. W. Greenhouse. 1958. An extension of Box’s results on the use of the F distribution in multivariate analysis. Annals of Mathematical Statistics 29: 885–891. Gleason, J. R. 1999. sg103: Within subjects (repeated measures) ANOVA, including between subjects factors. Stata Technical Bulletin 47: 40–45. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 236–243. College Station, TX: Stata Press. . 2000. sg132: Analysis of variance from summary statistics. Stata Technical Bulletin 54: 42–46. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 328–332. College Station, TX: Stata Press. Hall, N. S. 2010. Ronald Fisher and Gertrude Cox: Two statistical pioneers sometimes cooperate and sometimes collide. American Statistician 64: 212–220. Higgins, J. E., and G. G. Koch. 1977. Variable selection and generalized chi-square analysis of categorical data applied to a large cross-sectional occupational health survey. International Statistical Review 45: 51–62. Huynh, H. 1978. Some approximate tests for repeated measurement designs. Psychometrika 43: 161–175. Huynh, H., and L. S. Feldt. 1976. Estimation of the Box correction for degrees of freedom from sample data in randomized block and split-plot designs. Journal of Educational Statistics 1: 69–82. Kennedy, W. J., Jr., and J. E. Gentle. 1980. Statistical Computing. New York: Dekker. Kuehl, R. O. 2000. Design of Experiments: Statistical Principles of Research Design and Analysis. 2nd ed. Belmont, CA: Duxbury. Marchenko, Y. V. 2006. Estimating variance components in Stata. Stata Journal 6: 1–21. Milliken, G. A., and D. E. Johnson. 2009. Analysis of Messy Data, Volume 1: Designed Experiments. 2nd ed. Boca Raton, FL: CRC Press. Mitchell, M. N. 2012. Interpreting and Visualizing Regression Models Using Stata. College Station, TX: Stata Press. Scheffé, H. 1959. The Analysis of Variance. New York: Wiley. Snedecor, G. W., and W. G. Cochran. 1989. Statistical Methods. 8th ed. Ames, IA: Iowa State University Press. van Belle, G., L. D. Fisher, P. J. Heagerty, and T. S. Lumley. 2004. Biostatistics: A Methodology for the Health Sciences. 2nd ed. New York: Wiley. Winer, B. J., D. R. Brown, and K. M. Michels. 1991. Statistical Principles in Experimental Design. 3rd ed. New York: McGraw–Hill. 56 anova — Analysis of variance and covariance Also see [R] anova postestimation — Postestimation tools for anova [R] contrast — Contrasts and linear hypothesis tests after estimation [R] icc — Intraclass correlation coefficients [R] loneway — Large one-way ANOVA, random effects, and reliability [R] oneway — One-way analysis of variance [R] regress — Linear regression [MV] manova — Multivariate analysis of variance and covariance [PSS] power oneway — Power analysis for one-way analysis of variance [PSS] power repeated — Power analysis for repeated-measures analysis of variance [PSS] power twoway — Power analysis for two-way analysis of variance Stata Structural Equation Modeling Reference Manual Title anova postestimation — Postestimation tools for anova Description Menu for test after anova References Syntax for predict Options for test after anova Also see Syntax for test after anova Remarks and examples Description The following postestimation commands are of special interest after anova: Command Description dfbeta estat hettest estat imtest estat ovtest estat szroeter estat vif estat esize rvfplot avplot avplots cprplot acprplot rvpplot lvr2plot DFBETA influence statistics tests for heteroskedasticity information matrix test Ramsey regression specification-error test for omitted variables Szroeter’s rank test for heteroskedasticity variance inflation factors for the independent variables η 2 and ω 2 effect sizes residual-versus-fitted plot added-variable plot all added-variables plots in one image component-plus-residual plot augmented component-plus-residual plot residual-versus-predictor plot leverage-versus-squared-residual plot 57 58 anova postestimation — Postestimation tools for anova The following standard postestimation commands are also available: Command Description contrast estat ic estat summarize estat vce estimates hausman lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest margins marginsplot nlcom predict predictnl pwcompare suest test testnl Special-interest postestimation commands In addition to the common estat commands (see [R] estat), estat hettest, estat imtest, estat ovtest, estat szroeter, and estat vif are also available. dfbeta is also available. The syntax for dfbeta and these estat commands is the same as after regress; see [R] regress postestimation. For information on the plot commands, see [R] regress postestimation diagnostic plots. In addition to the standard syntax of test (see [R] test), test after anova has three additionally allowed syntaxes; see below. test performs Wald tests of expressions involving the coefficients of the underlying regression model. Simple and composite linear hypotheses are possible. Syntax for predict predict after anova follows the same syntax as predict after regress and can provide predictions, residuals, standardized residuals, Studentized residuals, the standard error of the residuals, the standard error of the prediction, the diagonal elements of the projection (hat) matrix, and Cook’s D. See [R] regress postestimation for details. anova postestimation — Postestimation tools for anova 59 Syntax for test after anova In addition to the standard syntax of test (see [R] test), test after anova also allows the following: test, test(matname) mtest (opt) matvlc(matname) syntax a test, showorder / term term . . . , symbolic test term term . . . syntax a syntax b syntax c syntax b syntax c test expression involving the coefficients of the underlying regression model; you provide information as a matrix show underlying order of design matrix, which is useful when constructing matname argument of the test() option test effects and show symbolic forms Menu for test after anova Statistics > Linear models and related > ANOVA/MANOVA > Test linear hypotheses after anova Options for test after anova test(matname) is required with syntax a of test. The rows of matname specify linear combinations of the underlying design matrix of the ANOVA that are to be jointly tested. The columns correspond to the underlying design matrix (including the constant if it has not been suppressed). The column and row names of matname are ignored. A listing of the constraints imposed by the test() option is presented before the table containing the tests. You should examine this table to verify that you have applied the linear combinations you desired. Typing test, showorder allows you to examine the ordering of the columns for the design matrix from the ANOVA. mtest (opt) specifies that tests are performed for each condition separately. opt specifies the method for adjusting p-values for multiple testing. Valid values for opt are bonferroni holm sidak noadjust Bonferroni’s method Holm’s method Šidák’s method no adjustment is to be made Specifying mtest with no argument is equivalent to mtest(noadjust). matvlc(matname), a programmer’s option, saves the variance–covariance matrix of the linear combinations involved in the suite of tests. For the test Lb = c, what is returned in matname is LV L0 , where V is the estimated variance–covariance matrix of b. showorder causes test to list the definition of each column in the design matrix. showorder is not allowed with any other option. symbolic requests the symbolic form of the test rather than the test statistic. When this option is specified with no terms (test, symbolic), the symbolic form of the estimable functions is displayed. 60 anova postestimation — Postestimation tools for anova Remarks and examples Remarks are presented under the following headings: Testing effects Obtaining symbolic forms Testing coefficients and contrasts of margins Video example See examples 4, 7, 8, 13, 15, 16, and 17 in [R] anova for examples that use the margins command. Testing effects After fitting a model using anova, you can test for the significance of effects in the ANOVA table, as well as for effects that are not reported in the ANOVA table, by using the test or contrast command. You follow test or contrast by the list of effects that you wish to test. By default, these commands use the residual mean squared error in the denominator of the F ratio. You can specify other error terms by using the slash notation, just as you would with anova. See [R] contrast for details on this command. Example 1: Testing effects Recall our byssinosis example (example 8) in [R] anova: . anova prob workplace smokes race workplace#smokes workplace#race smokes#race > workplace#smokes#race [aweight=pop] (sum of wgt is 5.4190e+03) Number of obs = 65 R-squared = 0.8300 Root MSE = .025902 Adj R-squared = 0.7948 Partial SS df MS F Prob > F Source Model .173646538 11 .015786049 23.53 0.0000 workplace smokes race workplace#smokes workplace#race smokes#race workplace#smokes#race .097625175 .013030812 .001094723 .019690342 .001352516 .001662874 .000950841 2 1 1 2 2 1 2 .048812588 .013030812 .001094723 .009845171 .000676258 .001662874 .00047542 72.76 19.42 1.63 14.67 1.01 2.48 0.71 0.0000 0.0001 0.2070 0.0000 0.3718 0.1214 0.4969 Residual .035557766 53 .000670901 Total .209204304 64 .003268817 We can easily obtain a test on a particular term from the ANOVA table. Here are two examples: . test smokes Source Partial SS df smokes Residual . test smokes#race Source .013030812 .035557766 1 53 Partial SS df smokes#race Residual .001662874 .035557766 1 53 MS .013030812 .000670901 MS .001662874 .000670901 F 19.42 F 2.48 Prob > F 0.0001 Prob > F 0.1214 Both of these tests use residual error by default and agree with the ANOVA table produced earlier. anova postestimation — Postestimation tools for anova 61 We could have performed these same tests with contrast: . contrast smokes Contrasts of marginal linear predictions Margins : asbalanced df F P>F smokes 1 19.42 0.0001 Denominator 53 . contrast smokes#race Contrasts of marginal linear predictions Margins : asbalanced df F P>F smokes#race 1 2.48 0.1214 Denominator 53 Technical note After anova, you can use the ‘/’ syntax in test or contrast to perform tests with a variety of non-σ 2 I error structures. However, in most unbalanced models, the mean squares are not independent and do not have equal expectations under the null hypothesis. Also, be warned that you assume responsibility for the validity of the test statistic. Example 2: Testing effects with different error terms We return to the nested ANOVA example (example 11) in [R] anova, where five brands of machinery were compared in an assembly line. We can obtain appropriate tests for the nested terms using test, even if we had run the anova command without initially indicating the proper error terms. . use http://www.stata-press.com/data/r13/machine (machine data) . anova output machine / operator|machine / Number of obs = 57 Root MSE = 1.47089 Source Partial SS df MS R-squared = 0.8661 Adj R-squared = 0.8077 F Prob > F Model 545.822288 17 32.1071934 14.84 0.0000 machine operator|machine 430.980792 101.353804 4 13 107.745198 7.79644648 13.82 0.0001 operator|machine 101.353804 13 7.79644648 3.60 0.0009 Residual 84.3766582 39 2.16350406 Total 630.198947 56 11.2535526 62 anova postestimation — Postestimation tools for anova In this ANOVA table, machine is tested with residual error. With this particular nested design, the appropriate error term for testing machine is operator nested within machine, which is easily obtained from test. . test machine / operator|machine Source Partial SS df machine operator|machine 430.980792 101.353804 4 13 MS F 107.745198 7.79644648 13.82 Prob > F 0.0001 This result from test matches what we obtained from our anova command. Example 3: Pooling terms when testing effects The other nested ANOVA example (example 12) in [R] anova was based on the sewage data. The ANOVA table is presented here again. As before, we will use abbreviations of variable names in typing the commands. . use http://www.stata-press.com/data/r13/sewage (Sewage treatment) . anova particulate s / m|s / f|m|s / w|f|m|s /, dropemptycells Number of obs = 64 Root MSE = 12.7445 R-squared = Adj R-squared = 0.6338 0.5194 Source Partial SS df Model 13493.6094 15 899.573958 5.54 0.0000 solution manager|solution 7203.76563 838.28125 1 2 7203.76563 419.140625 17.19 0.0536 manager|solution facility|manager| solution 838.28125 2 419.140625 0.55 0.6166 3064.9375 4 766.234375 3064.9375 4 766.234375 2.57 0.1193 2386.625 8 298.328125 worker|facility| manager|solution 2386.625 8 298.328125 1.84 0.0931 Residual 7796.25 48 162.421875 Total 21289.8594 63 337.934276 facility|manager| solution worker|facility| manager|solution MS F Prob > F In practice, it is often beneficial to pool nonsignificant nested terms to increase the power of tests on remaining terms. One rule of thumb is to allow the pooling of a term whose p-value is larger than 0.25. In this sewage example, the p-value for the test of manager is 0.6166. This value indicates that the manager effect is negligible and might be ignored. Currently, solution is tested by manager|solution, which has only 2 degrees of freedom. If we pool the manager and facility terms and use this pooled estimate as the error term for solution, we would have a term with 6 degrees of freedom. Below are two tests: a test of solution with the pooled manager and facility terms and a test of this pooled term by worker. anova postestimation — Postestimation tools for anova . test s / m|s f|m|s Source solution manager|solution facility|manager| solution . test m|s f|m|s / w|f|m|s Source manager|solution facility|manager| solution worker|facility|manager| solution Partial SS df MS F 7203.76563 1 7203.76563 3903.21875 6 650.536458 Partial SS df 3903.21875 6 650.536458 2386.625 8 298.328125 MS 11.07 F 2.18 63 Prob > F 0.0159 Prob > F 0.1520 In the first test, we included two terms after the forward slash (m|s and f|m|s). test after anova allows multiple terms both before and after the slash. The terms before the slash are combined and are then tested by the combined terms that follow the slash (or residual error if no slash is present). The p-value for solution using the pooled term is 0.0159. Originally, it was 0.0536. The increase in the power of the test is due to the increase in degrees of freedom for the pooled error term. We can get identical results if we drop manager from the anova model. (This dataset has unique numbers for each facility so that there is no confusion of facilities when manager is dropped.) . anova particulate s / f|s / w|f|s /, dropemptycells Number of obs = 64 Root MSE = 12.7445 Source Partial SS df MS R-squared = 0.6338 Adj R-squared = 0.5194 F Prob > F Model 13493.6094 15 899.573958 5.54 0.0000 solution facility|solution 7203.76563 3903.21875 1 6 7203.76563 650.536458 11.07 0.0159 facility|solution worker|facility| solution 3903.21875 6 650.536458 2.18 0.1520 2386.625 8 298.328125 worker|facility| solution 2386.625 8 298.328125 1.84 0.0931 Residual 7796.25 48 162.421875 Total 21289.8594 63 337.934276 This output agrees with our earlier test results. In the following example, two terms from the anova are jointly tested (pooled). 64 anova postestimation — Postestimation tools for anova Example 4: Obtaining overall significance of a term using contrast In example 10 of [R] anova, we fit the model anova drate region c.mage region#c.mage. Now we use the contrast command to test for the overall significance of region. . contrast region region#c.mage, overall Contrasts of marginal linear predictions Margins : asbalanced df F P>F region 3 7.40 0.0004 region#c.mage 3 0.86 0.4689 Overall 6 5.65 0.0002 Denominator 42 The overall F statistic associated with the region and region#c.mage terms is 5.65, and it is significant at the 0.02% level. In the ANOVA output, the region term, by itself, had a sum of squares of 1166.15, which, based on 3 degrees of freedom, yielded an F statistic of 7.40 and a significance level of 0.0004. This is the same test that is reported by contrast in the row labeled region. Likewise, the test from the ANOVA output for the region#c.mage term is reproduced in the second row of the contrast output. Obtaining symbolic forms test can produce the symbolic form of the estimable functions and symbolic forms for particular tests. Example 5: Symbolic form of the estimable functions After fitting an ANOVA model, we type test, symbolic to obtain the symbolic form of the estimable functions. For instance, returning to our blood pressure data introduced in example 4 of [R] anova, let’s begin by reestimating systolic on drug, disease, and drug#disease: . use http://www.stata-press.com/data/r13/systolic, clear (Systolic Blood Pressure Data) . anova systolic drug disease drug#disease Number of obs = 58 R-squared = 0.4560 Root MSE = 10.5096 Adj R-squared = 0.3259 Source Partial SS df MS F Prob > F Model 4259.33851 11 387.212591 3.51 0.0013 drug disease drug#disease 2997.47186 415.873046 707.266259 3 2 6 999.157287 207.936523 117.87771 9.05 1.88 1.07 0.0001 0.1637 0.3958 Residual 5080.81667 46 110.452536 Total 9340.15517 57 163.862371 anova postestimation — Postestimation tools for anova 65 To obtain the symbolic form of the estimable functions, type . test, symbolic drug 1 -(r2+r3+r4-r0) 2 r2 3 r3 4 r4 disease 1 -(r6+r7-r0) 2 r6 3 r7 drug#disease 1 1 -(r2+r3+r4+r6+r7-r12-r13-r15-r16-r18-r19-r0) 1 2 r6 - (r12+r15+r18) 1 3 r7 - (r13+r16+r19) 2 1 r2 - (r12+r13) 2 2 r12 2 3 r13 3 1 r3 - (r15+r16) 3 2 r15 3 3 r16 4 1 r4 - (r18+r19) 4 2 r18 4 3 r19 _cons r0 Example 6: Symbolic form for a particular test To obtain the symbolic form for a particular test, we type test term [term . . . ], symbolic. For instance, the symbolic form for the test of the main effect of drug is . test drug, symbolic drug 1 -(r2+r3+r4) 2 r2 3 r3 4 r4 disease 1 0 2 0 3 0 drug#disease 1 1 -1/3 (r2+r3+r4) 1 2 -1/3 (r2+r3+r4) 1 3 -1/3 (r2+r3+r4) 2 1 1/3 r2 2 2 1/3 r2 2 3 1/3 r2 3 1 1/3 r3 3 2 1/3 r3 3 3 1/3 r3 4 1 1/3 r4 4 2 1/3 r4 4 3 1/3 r4 _cons 0 66 anova postestimation — Postestimation tools for anova If we omit the symbolic option, we instead see the result of the test: . test drug Source Partial SS df drug Residual 2997.47186 5080.81667 3 46 MS 999.157287 110.452536 F 9.05 Prob > F 0.0001 Testing coefficients and contrasts of margins The test command allows you to perform tests directly on the coefficients of the underlying regression model. For instance, the coefficient on the third drug and the second disease is referred to as 3.drug#2.disease. This could also be written as i3.drug#i2.disease, or b[3.drug#2.disease], or even coef[i3.drug#i2.disease]; see [U] 13.5 Accessing coefficients and standard errors. Example 7: Testing linear combinations of coefficients Let’s begin by testing whether the coefficient on the third drug is equal to the coefficient on the fourth in our blood pressure data. We have already fit the model anova systolic drug##disease (equivalent to anova systolic drug disease drug#disease), and you can see the results of that estimation in example 5. Even though we have performed many tasks since we fit the model, Stata still remembers, and we can perform tests at any time. . test 3.drug = 4.drug ( 1) 3.drug - 4.drug = 0 F( 1, 46) = 0.13 Prob > F = 0.7234 We find that the two coefficients are not significantly different, at least at any significance level smaller than 73%. For more complex tests, the contrast command often provides a more concise way to specify the test we are interested in and prevents us from having to write the tests in terms of the regression coefficients. With contrast, we instead specify our tests in terms of differences in the marginal means for the levels of a particular factor. For example, if we want to compare the third and fourth drugs, we can test the difference in the mean impact on systolic blood pressure separately for each disease using the @ operator. We also use the reverse adjacent operator, ar., to compare the fourth level of drug with the previous level. anova postestimation — Postestimation tools for anova 67 . contrast ar4.drug@disease Contrasts of marginal linear predictions Margins : asbalanced df F P>F drug@disease (4 vs 3) 1 (4 vs 3) 2 (4 vs 3) 3 Joint 1 1 1 3 0.13 1.76 0.65 0.85 0.7234 0.1917 0.4230 0.4761 Denominator 46 drug@disease (4 vs 3) 1 (4 vs 3) 2 (4 vs 3) 3 Contrast Std. Err. -2.733333 8.433333 5.7 7.675156 6.363903 7.050081 [95% Conf. Interval] -18.18262 -4.376539 -8.491077 12.71595 21.24321 19.89108 None of the individual contrasts shows significant differences between the third drug and the fourth drug. Likewise, the overall F statistic is 0.85, which is hardly significant. We cannot reject the hypothesis that the third drug has the same effect as the fourth drug. Technical note Alternatively, we could have specified these tests based on the coefficients of the underlying regression model using the test command. We would have needed to perform tests on the coefficients for drug and for the coefficients on drug interacted with disease in order to test for differences in the means mentioned above. To do this, we start with our previous test command: . test 3.drug = 4.drug Notice that the F statistic for this test is equivalent to the test labeled (4 vs 3) 1 in the contrast output. Let’s now add the constraint that the coefficient on the third drug interacted with the third disease is equal to the coefficient on the fourth drug, again interacted with the third disease. We do that by typing the new constraint and adding the accumulate option: . test 3.drug#3.disease = 4.drug#3.disease, accumulate ( 1) 3.drug - 4.drug = 0 ( 2) 3.drug#3.disease - 4.drug#3.disease = 0 F( 2, 46) = 0.39 Prob > F = 0.6791 So far, our test includes the equality of the two drug coefficients, along with the equality of the two drug coefficients when interacted with the third disease. Now we add two more equations, one for each of the remaining two diseases: 68 anova postestimation — Postestimation tools for anova . test ( 1) ( 2) ( 3) 3.drug#2.disease = 4.drug#2.disease, accumulate 3.drug - 4.drug = 0 3.drug#3.disease - 4.drug#3.disease = 0 3.drug#2.disease - 4.drug#2.disease = 0 F( 3, 46) = 0.85 Prob > F = 0.4761 . test 3.drug#1.disease = 4.drug#1.disease, accumulate ( ( ( ( 1) 2) 3) 4) 3.drug - 4.drug = 0 3.drug#3.disease - 4.drug#3.disease = 0 3.drug#2.disease - 4.drug#2.disease = 0 3o.drug#1b.disease - 4o.drug#1b.disease = 0 Constraint 4 dropped F( 3, 46) = 0.85 Prob > F = 0.4761 The overall F statistic reproduces the one from the joint test in the contrast output. You may notice that we also got the message “Constraint 4 dropped”. For the technically inclined, this constraint was unnecessary, given the normalization of the model. If we specify all the constraints involved in our test or use contrast, we need not worry about the normalization because Stata handles this automatically. The test() option of test provides another alternative for testing coefficients. Instead of spelling out each coefficient involved in the test, a matrix representing the test provides the needed information. test, showorder shows the order of the terms in the ANOVA corresponding to the order of the columns for the matrix argument of test(). Example 8: Another way to test linear combinations of coefficients We repeat the last test of example 7 above with the test() option. First, we view the definition and order of the columns underlying the ANOVA performed on the systolic data. anova postestimation — Postestimation tools for anova 69 . test, showorder Order of columns in the design matrix 1: (drug==1) 2: (drug==2) 3: (drug==3) 4: (drug==4) 5: (disease==1) 6: (disease==2) 7: (disease==3) 8: (drug==1)*(disease==1) 9: (drug==1)*(disease==2) 10: (drug==1)*(disease==3) 11: (drug==2)*(disease==1) 12: (drug==2)*(disease==2) 13: (drug==2)*(disease==3) 14: (drug==3)*(disease==1) 15: (drug==3)*(disease==2) 16: (drug==3)*(disease==3) 17: (drug==4)*(disease==1) 18: (drug==4)*(disease==2) 19: (drug==4)*(disease==3) 20: _cons Columns 1–4 correspond to the four levels of drug. Columns 5–7 correspond to the three levels of disease. Columns 8–19 correspond to the interaction of drug and disease. The last column corresponds to cons, the constant in the model. We construct the matrix dr3vs4 with the same four constraints as the last test shown in example 7 and then use the test(dr3vs4) option to perform the test. . matrix dr3vs4 = (0,0,1,-1, > 0,0,0, 0, > 0,0,0, 0, > 0,0,0, 0, 0,0,0, 0,0,0, 0,0,0, 0,0,0, 0,0,0,0,0,0,0,0,0, 0, 0, 0, 0,0,0,0,0,0,0,0,1, 0, 0,-1, 0,0,0,0,0,0,0,1,0, 0,-1, 0, 0,0,0,0,0,0,1,0,0,-1, 0, 0, 0 \ 0 \ 0 \ 0) . test, test(dr3vs4) ( 1) 3.drug - 4.drug = 0 ( 2) 3.drug#3.disease - 4.drug#3.disease = 0 ( 3) 3.drug#2.disease - 4.drug#2.disease = 0 ( 4) 3o.drug#1b.disease - 4o.drug#1b.disease = 0 Constraint 4 dropped F( 3, 46) = 0.85 Prob > F = 0.4761 Here the effort involved with spelling out the coefficients is similar to that of constructing a matrix and using it in the test() option. When the test involving coefficients is more complicated, the test() option may be more convenient than specifying the coefficients directly in test. However, as previously demonstrated, contrast may provide an even simpler method for testing the same hypothesis. After fitting an ANOVA model, various contrasts (1-degree-of-freedom tests comparing different levels of a categorical variable) are often of interest. contrast can perform each 1-degree-of-freedom test in addition to the combined test, even in cases in which the contrasts do not correspond to one of the contrast operators. 70 anova postestimation — Postestimation tools for anova Example 9: Testing particular contrasts of interest Rencher and Schaalje (2008) illustrate 1-degree-of-freedom contrasts for an ANOVA comparing the net weight of cans filled by five machines (labeled A–E). The data were originally obtained from Ostle and Mensing (1975). Rencher and Schaalje use a cell-means ANOVA model approach for this problem. We could do the same by using the noconstant option of anova; see [R] anova. Instead, we obtain the same results by using the standard overparameterized ANOVA approach (that is, we keep the constant in the model). . use http://www.stata-press.com/data/r13/canfill (Can Fill Data) . list, sepby(machine) machine weight 1. 2. 3. 4. A A A A 11.95 12.00 12.25 12.10 5. 6. B B 12.18 12.11 7. 8. 9. C C C 12.16 12.15 12.08 10. 11. 12. D D D 12.25 12.30 12.10 13. 14. 15. 16. E E E E 12.10 12.04 12.02 12.02 . anova weight machine Number of obs = 16 Root MSE = .087758 R-squared = Adj R-squared = MS F 0.4123 0.1986 Source Partial SS df Prob > F Model .059426993 4 .014856748 1.93 0.1757 machine .059426993 4 .014856748 1.93 0.1757 Residual .084716701 11 .007701518 Total .144143694 15 .00960958 The four 1-degree-of-freedom tests of interest among the five machines are A and D versus B, C, and E; B and E versus C; A versus D; and B versus E. We can specify these tests as user-defined contrasts by placing the corresponding contrast coefficients into positions related to the five levels of machine as described in User-defined contrasts of [R] contrast. anova postestimation — Postestimation tools for anova 71 . contrast {machine 3 -2 -2 3 -2} > {machine 0 1 -2 0 1} > {machine 1 0 0 -1 0} > {machine 0 1 0 0 -1}, noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F machine (1) (2) (3) (4) Joint 1 1 1 1 4 0.75 0.31 4.47 1.73 1.93 0.4055 0.5916 0.0582 0.2150 0.1757 Denominator 11 contrast produces a 1-degree-of-freedom test for each of the specified contrasts as well as a joint test. We included the noeffects option so that the table displaying the values of the individual contrasts with their confidence intervals was suppressed. The significance values above are not adjusted for multiple comparisons. We could have produced the Bonferroni-adjusted significance values by using the mcompare(bonferroni) option. . contrast {machine 3 -2 -2 3 -2} > {machine 0 1 -2 0 1} > {machine 1 0 0 -1 0} > {machine 0 1 0 0 -1}, mcompare(bonferroni) noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F machine (1) (2) (3) (4) Joint 1 1 1 1 4 0.75 0.31 4.47 1.73 1.93 0.4055 0.5916 0.0582 0.2150 0.1757 Denominator 11 Bonferroni P>F 1.0000 1.0000 0.2329 0.8601 Note: Bonferroni-adjusted p-values are reported for tests on individual contrasts only. Example 10: Linear and quadratic contrasts Here there are two factors, A and B, each with three levels. The levels are quantitative so that linear and quadratic contrasts are of interest. 72 anova postestimation — Postestimation tools for anova . use http://www.stata-press.com/data/r13/twowaytrend . anova Y A B A#B Number of obs = 36 Root MSE = 2.6736 Partial SS df MS Source R-squared = 0.9304 Adj R-squared = 0.9097 F Prob > F Model 2578.55556 8 322.319444 45.09 0.0000 A B A#B 2026.72222 383.722222 168.111111 2 2 4 1013.36111 191.861111 42.0277778 141.77 26.84 5.88 0.0000 0.0000 0.0015 Residual 193 27 7.14814815 Total 2771.55556 35 79.1873016 We can use the p. contrast operator to obtain the 1-degree-of-freedom tests for the linear and quadratic effects of A and B. . contrast p.A p.B, noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F A (linear) (quadratic) Joint 1 1 2 212.65 70.88 141.77 0.0000 0.0000 0.0000 B (linear) (quadratic) Joint 1 1 2 26.17 27.51 26.84 0.0000 0.0000 0.0000 Denominator 27 All the above tests appear to be significant. In addition to presenting the 1-degree-of-freedom tests, the combined tests for A and B are produced and agree with the original ANOVA results. Now we explore the interaction between A and B. . contrast p.A#p1.B, noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F A#B (linear) (linear) (quadratic) (linear) Joint 1 1 2 17.71 0.07 8.89 0.0003 0.7893 0.0011 Denominator 27 The 2-degrees-of-freedom test of the interaction of A with the linear components of B is significant at the 0.0011 level. But, when we examine the two 1-degree-of-freedom tests that compose this result, anova postestimation — Postestimation tools for anova 73 the significance is due to the linear A by linear B contrast (significance level of 0.0003). A significance value of 0.7893 for the quadratic A by linear B indicates that this factor is not significant for these data. . contrast p.A#p2.B, noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F A#B (linear) (quadratic) (quadratic) (quadratic) Joint 1 1 2 2.80 2.94 2.87 0.1058 0.0979 0.0741 Denominator 27 The test of A with the quadratic components of B does not fall below the 0.05 significance level. Video example Introduction to contrasts in Stata: One-way ANOVA References Mitchell, M. N. 2012. Interpreting and Visualizing Regression Models Using Stata. College Station, TX: Stata Press. Ostle, B., and R. W. Mensing. 1975. Statistics in Research. 3rd ed. Ames, IA: Iowa State University Press. Rencher, A. C., and G. B. Schaalje. 2008. Linear Models in Statistics. 2nd ed. New York: Wiley. Also see [R] anova — Analysis of variance and covariance [R] regress postestimation — Postestimation tools for regress [R] regress postestimation diagnostic plots — Postestimation plots for regress [U] 20 Estimation and postestimation commands Title areg — Linear regression with a large dummy-variable set Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax areg depvar indepvars if in weight , absorb(varname) options Description options Model ∗ absorb(varname) categorical variable to be absorbed SE/Robust vce(vcetype) vcetype may be ols, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) display options set confidence level; default is level(95) control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling coeflegend display legend instead of statistics ∗ absorb(varname) is required. indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mi estimate, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. aweights, fweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Linear models and related > Other > Linear regression absorbing one cat. variable Description areg fits a linear regression absorbing one categorical factor. areg is designed for datasets with many groups, but not a number of groups that increases with the sample size. See the xtreg, fe command in [XT] xtreg for an estimator that handles the case in which the number of groups increases with the sample size. 74 areg — Linear regression with a large dummy-variable set 75 Options Model absorb(varname) specifies the categorical variable, which is to be included in the regression as if it were specified by dummy variables. absorb() is required. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (ols), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(ols), the default, uses the standard variance estimator for ordinary least-squares regression. Exercise caution when using the vce(cluster clustvar) option with areg. The effective number of degrees of freedom for the robust variance estimator is ng − 1, where ng is the number of clusters. Thus the number of levels of the absorb() variable should not exceed the number of clusters. Reporting level(#); see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. The following option is available with areg but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Suppose that you have a regression model that includes among the explanatory variables a large number, k , of mutually exclusive and exhaustive dummies: y = Xβ + d1 γ1 + d2 γ2 + · · · + dk γk + For instance, the dummy variables, di , might indicate countries in the world or states of the United States. One solution would be to fit the model with regress, but this solution is possible only if k is small enough so that the total number of variables (the number of columns of X plus the number of di ’s plus one for y) is sufficiently small — meaning less than matsize (see [R] matsize). For problems with more variables than the largest possible value of matsize (100 for Small Stata, 800 for Stata/IC, and 11,000 for Stata/SE and Stata/MP), regress will not work. areg provides a way of obtaining estimates of β — but not the γi ’s — in these cases. The effects of the dummy variables are said to be absorbed. Example 1 So that we can compare the results produced by areg with Stata’s other regression commands, we will fit a model in which k is small. areg’s real use, however, is when k is large. In our automobile data, we have a variable called rep78 that is coded 1, 2, 3, 4, and 5, where 1 means poor and 5 means excellent. Let’s assume that we wish to fit a regression of mpg on weight, gear ratio, and rep78 (parameterized as a set of dummies). 76 areg — Linear regression with a large dummy-variable set . use http://www.stata-press.com/data/r13/auto2 (1978 Automobile Data) . regress mpg weight gear_ratio b5.rep78 SS df MS Source Model Residual 1575.97621 764.226686 6 62 262.662702 12.3262369 Total 2340.2029 68 34.4147485 Std. Err. t Number of obs F( 6, 62) Prob > F R-squared Adj R-squared Root MSE P>|t| = = = = = = 69 21.31 0.0000 0.6734 0.6418 3.5109 mpg Coef. [95% Conf. Interval] weight gear_ratio -.0051031 .901478 .0009206 1.565552 -5.54 0.58 0.000 0.567 -.0069433 -2.228015 -.003263 4.030971 rep78 Poor Fair Average Good -2.036937 -2.419822 -2.557432 -2.788389 2.740728 1.764338 1.370912 1.395259 -0.74 -1.37 -1.87 -2.00 0.460 0.175 0.067 0.050 -7.515574 -5.946682 -5.297846 -5.577473 3.4417 1.107039 .1829814 .0006939 _cons 36.23782 7.01057 5.17 0.000 22.22389 50.25175 To fit the areg equivalent, we type . areg mpg weight gear_ratio, absorb(rep78) Linear regression, absorbing indicators mpg Coef. weight gear_ratio _cons -.0051031 .901478 34.05889 rep78 Std. Err. .0009206 1.565552 7.056383 F(4, 62) = t Number of obs F( 2, 62) Prob > F R-squared Adj R-squared Root MSE P>|t| -5.54 0.58 4.83 0.000 0.567 0.000 1.117 0.356 = = = = = = 69 41.64 0.0000 0.6734 0.6418 3.5109 [95% Conf. Interval] -.0069433 -2.228015 19.95338 -.003263 4.030971 48.1644 (5 categories) Both regress and areg display the same R2 values, root mean squared error, and—for weight and gear ratio—the same parameter estimates, standard errors, t statistics, significance levels, and confidence intervals. areg, however, does not report the coefficients for rep78, and, in fact, they are not even calculated. This computational trick makes the problem manageable when k is large. areg reports a test that the coefficients associated with rep78 are jointly zero. Here this test has a significance level of 35.6%. This F test for rep78 is the same that we would obtain after regress if we were to specify test 1.rep78 2.rep78 3.rep78 4.rep78; see [R] test. The model F tests reported by regress and areg also differ. The regress command reports a test that all coefficients except that of the constant are equal to zero; thus, the dummies are included in this test. The areg output shows a test that all coefficients excluding the dummies and the constant are equal to zero. This is the same test that can be obtained after regress by typing test weight gear ratio. areg — Linear regression with a large dummy-variable set 77 Technical note areg is designed for datasets with many groups, but not a number that grows with the sample size. Consider two different samples from the U.S. population. In the first sample, we have 10,000 individuals and we want to include an indicator for each of the 50 states, whereas in the second sample we have 3 observations on each of 10,000 individuals and we want to include an indicator for each individual. areg was designed for datasets similar to the first sample in which we have a fixed number of groups, the 50 states. In the second sample, the number of groups, which is the number of individuals, grows as we include more individuals in the sample. For an estimator designed to handle the case in which the number of groups grows with the sample size, see the xtreg, fe command in [XT] xtreg. Although the point estimates produced by areg and xtreg, fe are the same, the estimated VCEs differ when vce(cluster clustvar) is specified because the commands make different assumptions about whether the number of groups increases with the sample size. Technical note The intercept reported by areg deserves some explanation because, given k mutually exclusive and exhaustive dummies, it is arbitrary. areg identifies the model by choosing the intercept that makes the prediction calculated at the means of the independent variables equal to the mean of the b dependent variable: y = x β. . predict yhat (option xb assumed; fitted values) . summarize mpg yhat if rep78 != . Variable Obs Mean mpg yhat 69 69 21.28986 21.28986 Std. Dev. 5.866408 4.383224 Min Max 12 11.58643 41 28.07367 We had to include if rep78 < . in our summarize command because we have missing values in our data. areg automatically dropped those missing values (as it should) in forming the estimates, but predict with the xb option will make predictions for cases with missing rep78 because it does not know that rep78 is really part of our model. These predicted values do not include the absorbed effects (that is, the di γi ). For predicted values that include these effects, use the xbd option of predict (see [R] areg postestimation) or see [XT] xtreg. Example 2 areg, vce(robust) is a Huberized version of areg; see [P] robust. Just as areg is equivalent to using regress with dummies, areg, vce(robust) is equivalent to using regress, vce(robust) with dummies. You can use areg, vce(robust) when you expect heteroskedastic or nonnormal errors. areg, vce(robust), like ordinary regression, assumes that the observations are independent, unless the vce(cluster clustvar) option is specified. If the vce(cluster clustvar) option is specified, this independence assumption is relaxed and only the clusters identified by equal values of clustvar are assumed to be independent. 78 areg — Linear regression with a large dummy-variable set Assume that we were to collect data by randomly sampling 10,000 doctors (from 100 hospitals) and then sampling 10 patients of each doctor, yielding a total dataset of 100,000 patients in a cluster sample. If in some regression we wished to include effects of the hospitals to which the doctors belonged, we would want to include a dummy variable for each hospital, adding 100 variables to our model. areg could fit this model by . areg depvar patient vars, absorb(hospital) vce(cluster doctor) Stored results areg stores the following in e(): Scalars e(N) e(tss) e(df m) e(rss) e(df r) e(r2) e(r2 a) e(df a) e(rmse) e(ll) e(ll 0) e(N clust) e(F) e(F absorb) e(rank) Macros e(cmd) e(cmdline) e(depvar) e(absvar) e(wtype) e(wexp) e(title) e(clustvar) e(vce) e(vcetype) e(datasignature) e(datasignaturevars) e(properties) e(predict) e(footnote) e(marginsnotok) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(V) e(V modelbased) Functions e(sample) number of observations total sum of squares model degrees of freedom residual sum of squares residual degrees of freedom R-squared adjusted R-squared degrees of freedom for absorbed effect root mean squared error log likelihood log likelihood, constant-only model number of clusters F statistic F statistic for absorbed effect (when vce(robust) is not specified) rank of e(V) areg command as typed name of dependent variable name of absorb variable weight type weight expression title in estimation output name of cluster variable vcetype specified in vce() title used to label Std. Err. the checksum variables used in calculation of checksum b V program used to implement predict program used to implement the footnote display predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix variance–covariance matrix of the estimators model-based variance marks estimation sample areg — Linear regression with a large dummy-variable set 79 Methods and formulas areg begins by recalculating depvar and indepvars to have mean 0 within the groups specified by absorb(). The overall mean of each variable is then added back in. The adjusted depvar is then regressed on the adjusted indepvars with regress, yielding the coefficient estimates. The degrees of freedom of the variance–covariance matrix of the coefficients is then adjusted to account for the absorbed variables — this calculation yields the same results (up to numerical roundoff error) as if the matrix had been calculated directly by the formulas given in [R] regress. areg with vce(robust) or vce(cluster clustvar) works similarly, calling robust after regress to produce the Huber/White/sandwich estimator of the variance or its clustered version. See [P] robust, particularly Introduction and Methods and formulas. The model F test uses the robust variance estimates. There is, however, no simple computational means of obtaining a robust test of the absorbed dummies; thus this test is not displayed when the vce(robust) or vce(cluster clustvar) option is specified. The number of groups specified in absorb() are included in the degrees of freedom used in the finite-sample adjustment of the cluster–robust VCE estimator. This statement is only valid if the number of groups is small relative to the sample size. (Technically, the number of groups must remain fixed as the sample size grows.) For an estimator that allows the number of groups to grow with the sample size, see the xtreg, fe command in [XT] xtreg. References Blackwell, J. L., III. 2005. Estimation and testing of fixed-effect panel-data systems. Stata Journal 5: 202–207. McCaffrey, D. F., K. Mihaly, J. R. Lockwood, and T. R. Sass. 2012. A review of Stata commands for fixed-effects estimation in normal linear models. Stata Journal 12: 406–432. Also see [R] areg postestimation — Postestimation tools for areg [R] regress — Linear regression [MI] estimation — Estimation commands for use with mi estimate [XT] xtreg — Fixed-, between-, and random-effects and population-averaged linear models [U] 20 Estimation and postestimation commands Title areg postestimation — Postestimation tools for areg Description Remarks and examples Syntax for predict References Menu for predict Also see Options for predict Description The following postestimation commands are available after areg: Command Description contrast estat ic estat summarize estat vce estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest margins marginsplot nlcom predict predictnl pwcompare test testnl 1 forecast is not appropriate with mi estimation results. 80 areg postestimation — Postestimation tools for areg 81 Syntax for predict predict type newvar if in , statistic where yj = xj b + dabsorbvar + ej and statistic is Description statistic Main xb stdp dresiduals ∗ xbd ∗ d ∗ residuals ∗ score xj b, fitted values; the default standard error of the prediction dabsorbvar + ej = yj − xj b xj b + dabsorbvar dabsorbvar residual score; equivalent to residuals Unstarred statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample. Starred statistics are calculated only for the estimation sample, even when if e(sample) is not specified. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main xb, the default, calculates the prediction of xj b, the fitted values, by using the average effect of the absorbed variable. Also see xbd below. stdp calculates the standard error of xj b. dresiduals calculates yj − xj b, which are the residuals plus the effect of the absorbed variable. xbd calculates xj b + dabsorbvar , which are the fitted values including the individual effects of the absorbed variable. d calculates dabsorbvar , the individual coefficients for the absorbed variable. residuals calculates the residuals, that is, yj − (xj b + dabsorbvar ). score is a synonym for residuals. Remarks and examples Example 1 Continuing with example 1 of [R] areg, we refit the model with robust standard errors and then obtain linear predictions and standard errors for those linear predictions. . use http://www.stata-press.com/data/r13/auto2 (1978 Automobile Data) 82 areg postestimation — Postestimation tools for areg . areg mpg weight gear_ratio, absorb(rep78) vce(robust) (output omitted ) . predict xb_ar (option xb assumed; fitted values) . predict stdp_ar, stdp We can obtain the same linear predictions by fitting the model with xtreg, fe, but we would first need to specify the panel structure by using xtset. . xtset rep78 panel variable: rep78 (unbalanced) . xtreg mpg weight gear_ratio, fe vce(robust) (output omitted ) . predict xb_xt (option xb assumed; fitted values) . predict stdp_xt, stdp . summarize xb_ar xb_xt stdp* Variable Obs Mean Std. Dev. xb_ar xb_xt stdp_ar stdp_xt 74 74 74 74 21.36805 21.36805 .7105649 .8155919 4.286788 4.286788 .1933936 .4826332 Min Max 11.58643 11.58643 .4270821 .0826999 28.07367 28.07367 1.245179 1.709786 The predicted xb values above are the same for areg and xtreg, fe, but the standard errors for those linear predictions are different. The assumptions for these two estimators lead to different formulations for their standard errors. The robust variance estimates with areg are equivalent to the robust variance estimates using regress, including the panel dummies. The consistent robust variance estimates with xtreg are equivalent to those obtained by specifying vce(cluster panelvar) with that estimation command. For a theoretical discussion, see Wooldridge (2013), Stock and Watson (2008), and Arellano (2003); also see the technical note after example 3 of [XT] xtreg. Example 2 We would like to use linktest to check whether the dependent variable for our model is correctly specified: . use http://www.stata-press.com/data/r13/auto2, clear (1978 Automobile Data) . areg mpg weight gear_ratio, absorb(rep78) (output omitted ) . linktest, absorb(rep78) Linear regression, absorbing indicators Number of obs F( 2, 62) Prob > F R-squared Adj R-squared Root MSE mpg Coef. _hat _hatsq _cons -.9305602 .0462785 19.24899 rep78 Std. Err. .9537856 .0227219 9.725618 F(4, 62) = t P>|t| -0.98 2.04 1.98 0.333 0.046 0.052 1.278 0.288 = = = = = = 69 46.50 0.0000 0.6939 0.6643 3.3990 [95% Conf. Interval] -2.83715 .0008582 -.1922457 .9760302 .0916989 38.69022 (5 categories) areg postestimation — Postestimation tools for areg 83 The squared residuals are significant in the regression for mpg on the linear and squared residuals; therefore, the test indicates that our dependent variable does not seem to be well specified. Let’s transform the dependent variable into energy consumption, gallons per mile, fit the alternative model, and check the link test again. . generate gpm = 1/mpg . areg gpm weight gear_ratio, absorb(rep78) (output omitted ) . linktest, absorb(rep78) Linear regression, absorbing indicators gpm Coef. _hat _hatsq _cons .2842582 6.956965 .0175457 rep78 Number of obs F( 2, 62) Prob > F R-squared Adj R-squared Root MSE Std. Err. t P>|t| .7109124 6.862439 .0178251 0.40 1.01 0.98 0.691 0.315 0.329 0.065 0.992 F(4, 62) = = = = = = = 69 72.60 0.0000 0.7436 0.7187 0.0068 [95% Conf. Interval] -1.136835 -6.760855 -.0180862 1.705352 20.67478 .0531777 (5 categories) The link test supports the use of the transformed dependent variable. References Arellano, M. 2003. Panel Data Econometrics. Oxford: Oxford University Press. Stock, J. H., and M. W. Watson. 2008. Heteroskedasticity-robust standard errors for fixed effects panel data regression. Econometrica 76: 155–174. Wooldridge, J. M. 2013. Introductory Econometrics: A Modern Approach. 5th ed. Mason, OH: South-Western. Also see [R] areg — Linear regression with a large dummy-variable set [U] 20 Estimation and postestimation commands Title asclogit — Alternative-specific conditional logit (McFadden’s choice) model Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax asclogit depvar indepvars alternatives(varname) options Model ∗ ∗ case(varname) alternatives(varname) casevars(varlist) basealternative(# | lbl | str) noconstant altwise offset(varname) constraints(constraints) collinear if options in weight , case(varname) Description use varname to identify cases use varname to identify the alternatives available for each case case-specific variables alternative to normalize location suppress alternative-specific constant terms use alternativewise deletion instead of casewise deletion include varname in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) or noheader nocnsreport display options set confidence level; default is level(95) report odds ratios do not display the header on the coefficient table do not display constraints control column formats and line width Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics ∗ case(varname) and alternatives(varname) are required. bootstrap, by, fp, jackknife, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. fweights, iweights, and pweights are allowed (see [U] 11.1.6 weight), but they are interpreted to apply to cases as a whole, not to individual observations. See Use of weights in [R] clogit. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 84 asclogit — Alternative-specific conditional logit (McFadden’s choice) model 85 Menu Statistics > Categorical outcomes > Alternative-specific conditional logit Description asclogit fits McFadden’s choice model, which is a specific case of the more general conditional logistic regression model (McFadden 1974). asclogit requires multiple observations for each case (individual or decision), where each observation represents an alternative that may be chosen. The cases are identified by the variable specified in the case() option, whereas the alternatives are identified by the variable specified in the alternatives() option. The outcome or chosen alternative is identified by a value of 1 in depvar, whereas zeros indicate the alternatives that were not chosen. There can be multiple alternatives chosen for each case. asclogit allows two types of independent variables: alternative-specific variables and case-specific variables. Alternative-specific variables vary across both cases and alternatives and are specified in indepvars. Case-specific variables vary only across cases and are specified in the casevars() option. See [R] clogit for a more general application of conditional logistic regression. For example, clogit would be used when you have grouped data where each observation in a group may be a different individual, but all individuals in a group have a common characteristic. You may use clogit to obtain the same estimates as asclogit by specifying the case() variable as the group() variable in clogit and generating variables that interact the casevars() in asclogit with each alternative (in the form of an indicator variable), excluding the interaction variable associated with the base alternative. asclogit takes care of this data management burden for you. Also, for clogit, each record (row in your data) is an observation, whereas in asclogit each case, consisting of several records (the alternatives) in your data, is an observation. This last point is important because asclogit will drop observations, by default, in a casewise fashion. That is, if there is at least one missing value in any of the variables for each record of a case, the entire case is dropped from estimation. To use alternativewise deletion, specify the altwise option and only the records with missing values will be dropped from estimation. Options Model case(varname) specifies the numeric variable that identifies each case. case() is required and must be integer valued. alternatives(varname) specifies the variable that identifies the alternatives for each case. The number of alternatives can vary with each case; the maximum number of alternatives cannot exceed the limits of tabulate oneway; see [R] tabulate oneway. alternatives() is required and may be a numeric or a string variable. casevars(varlist) specifies the case-specific numeric variables. These are variables that are constant for each case. If there are a maximum of J alternatives, there will be J − 1 sets of coefficients associated with the casevars(). basealternative(# | lbl | str) specifies the alternative that normalizes the latent-variable location (the level of utility). The base alternative may be specified as a number, label, or string depending on the storage type of the variable indicating alternatives. The default is the alternative with the highest frequency. If vce(bootstrap) or vce(jackknife) is specified, you must specify the base alternative. This is to ensure that the same model is fit with each call to asclogit. 86 asclogit — Alternative-specific conditional logit (McFadden’s choice) model noconstant suppresses the J − 1 alternative-specific constant terms. altwise specifies that alternativewise deletion be used when marking out observations due to missing values in your variables. The default is to use casewise deletion; that is, the entire group of observations making up a case is deleted if any missing values are encountered. This option does not apply to observations that are marked out by the if or in qualifier or the by prefix. offset(varname), constraints(numlist | matname), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. or reports the estimated coefficients transformed to odds ratios, that is, eb rather than b. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated. or may be specified at estimation or when replaying previously estimated results. noheader prevents the coefficient table header from being displayed. nocnsreport; see [R] estimation options. display options: cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. technique(bhhh) is not allowed. The initial estimates must be specified as from(matname , copy ), where matname is the matrix containing the initial estimates and the copy option specifies that only the position of each element in matname is relevant. If copy is not specified, the column stripe of matname identifies the estimates. The following option is available with asclogit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples asclogit fits McFadden’s choice model (McFadden [1974]; for a brief introduction, see Greene [2012, sec. 18.2] or Cameron and Trivedi [2010, sec. 15.5]). In this model, we have a set of unordered alternatives indexed by 1, 2, . . . , J . Let yij , j = 1, . . . , J , be an indicator variable for the alternative actually chosen by the ith individual (case). That is, yij = 1 if individual i chose alternative j and yij = 0 otherwise. The independent variables come in two forms: alternative specific and case asclogit — Alternative-specific conditional logit (McFadden’s choice) model 87 specific. Alternative-specific variables vary among the alternatives (as well as cases), and case-specific variables vary only among cases. Assume that we have p alternative-specific variables so that for case i we have a J × p matrix, Xi . Further, assume that we have q case-specific variables so that we have a 1 × q vector zi for case i. Our random-utility model can then be expressed as 0 ui = Xi β + (zi A) + i Here β is a p× 1 vector of alternative-specific regression coefficients and A = (α1 , . . . , αJ ) is a q ×J matrix of case-specific regression coefficients. The elements of the J × 1 vector i are independent Type I (Gumbel-type) extreme-value random variables with mean γ (the Euler–Mascheroni constant, approximately 0.577) and variance π 2 /6. We must fix one of the αj to the constant vector to normalize the location. We set αk = 0, where k is specified by the basealternative() option. The vector ui quantifies the utility that the individual gains from the J alternatives. The alternative chosen by individual i is the one that maximizes utility. Example 1 We have data on 295 consumers and their choice of automobile. Each consumer chose among an American, Japanese, or European car; the variable car indicates the nationality of the car for each alternative. We want to explore the relationship between the choice of car to the consumer’s sex (variable sex) and income (variable income in thousands of dollars). We also have information on the number of dealerships of each nationality in the consumer’s city in the variable dealer that we want to include as a regressor. We assume that consumers’ preferences are influenced by the number of dealerships in an area but that the number of dealerships is not influenced by consumer preferences (which we admit is a rather strong assumption). The variable dealer is an alternative-specific variable (Xi is a 3 × 1 vector in our previous notation), and sex and income are case-specific variables (zi is a 1 × 2 vector). Each consumer’s chosen car is indicated by the variable choice. Let’s list some of the data. . use http://www.stata-press.com/data/r13/choice . list id car choice dealer sex income in 1/12, sepby(id) id car choice dealer sex income 1. 2. 3. 1 1 1 American Japan Europe 0 0 1 18 8 5 male male male 46.7 46.7 46.7 4. 5. 6. 2 2 2 American Japan Europe 1 0 0 17 6 2 male male male 26.1 26.1 26.1 7. 8. 9. 3 3 3 American Japan Europe 1 0 0 12 6 2 male male male 32.7 32.7 32.7 10. 11. 12. 4 4 4 American Japan Europe 0 1 0 18 7 4 female female female 49.2 49.2 49.2 We see, for example, that the first consumer, a male earning $46,700 per year, chose to purchase a European car even though there are more American and Japanese car dealers in his area. The fourth consumer, a female earning $49,200 per year, purchased a Japanese car. 88 asclogit — Alternative-specific conditional logit (McFadden’s choice) model We now fit our model. . asclogit choice dealer, case(id) alternatives(car) casevars(sex income) Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = -273.55685 = -252.75109 = -250.78555 = -250.7794 = -250.7794 Alternative-specific conditional logit Case variable: id Number of obs Number of cases = = 885 295 Alternative variable: car Alts per case: min = avg = max = 3 3.0 3 Log likelihood = Wald chi2(5) Prob > chi2 -250.7794 choice Coef. dealer .0680938 = = 15.86 0.0072 Std. Err. z P>|z| [95% Conf. Interval] .0344465 1.98 0.048 .00058 .1356076 car American (base alternative) Japan sex income _cons -.5346039 .0325318 -1.352189 .3141564 .012824 .6911829 -1.70 2.54 -1.96 0.089 0.011 0.050 -1.150339 .0073973 -2.706882 .0811314 .0576663 .0025049 .5704109 .032042 -2.355249 .4540247 .0138676 .8526681 1.26 2.31 -2.76 0.209 0.021 0.006 -.3194612 .004862 -4.026448 1.460283 .0592219 -.6840501 Europe sex income _cons Displaying the results as odds ratios makes interpretation easier. . asclogit, or noheader choice Odds Ratio dealer 1.070466 Std. Err. z P>|z| [95% Conf. Interval] .0368737 1.98 0.048 1.00058 1.145232 car American (base alternative) Japan sex income _cons .5859013 1.033067 .2586735 .1840647 .013248 .1787907 -1.70 2.54 -1.96 0.089 0.011 0.050 .3165294 1.007425 .0667446 1.084513 1.059361 1.002508 1.768994 1.032561 .0948699 .8031669 .0143191 .0808925 1.26 2.31 -2.76 0.209 0.021 0.006 .7265404 1.004874 .0178376 4.307178 1.061011 .5045693 Europe sex income _cons These results indicate that men (sex = 1) are less likely to pick a Japanese car over an American car than women (odds ratio 0.59) but that men are more likely to choose a European car over an American car (odds ratio 1.77). Raising a person’s income increases the likelihood that he or she asclogit — Alternative-specific conditional logit (McFadden’s choice) model 89 purchases a Japanese or European car; interestingly, the effect of higher income is about the same for these two types of cars. Daniel Little McFadden was born in 1937 in North Carolina. He studied physics, psychology, and economics at the University of Minnesota and has taught economics at Pittsburgh, Berkeley, MIT, and the University of Southern California. His contributions to logit models were triggered by a student’s project on freeway routing decisions, and his work consistently links economic theory and applied problems. In 2000, he shared the Nobel Prize in Economics with James J. Heckman. Technical note McFadden’s choice model is related to multinomial logistic regression (see [R] mlogit). If all the independent variables are case specific, then the two models are identical. We verify this supposition by running the previous example without the alternative-specific variable, dealer. . asclogit choice, case(id) alternatives(car) casevars(sex income) nolog Alternative-specific conditional logit Number of obs = 885 Case variable: id Number of cases = 295 Alternative variable: car Alts per case: min = 3 avg = 3.0 max = 3 Wald chi2(4) = 12.53 Log likelihood = -252.72012 Prob > chi2 = 0.0138 choice American Coef. Std. Err. z P>|z| [95% Conf. Interval] (base alternative) Japan sex income _cons -.4694799 .0276854 -1.962652 .3114939 .0123666 .6216804 -1.51 2.24 -3.16 0.132 0.025 0.002 -1.079997 .0034472 -3.181123 .141037 .0519236 -.7441807 .5388441 .0273669 -3.180029 .4525279 .013787 .7546837 1.19 1.98 -4.21 0.234 0.047 0.000 -.3480942 .000345 -4.659182 1.425782 .0543889 -1.700876 Europe sex income _cons To run mlogit, we must rearrange the dataset. mlogit requires a dependent variable that indicates the choice—1, 2, or 3—for each individual. We will use car as our dependent variable for those observations that represent the choice actually chosen. 90 asclogit — Alternative-specific conditional logit (McFadden’s choice) model . keep if choice == 1 (590 observations deleted) . mlogit car sex income Iteration 0: log likelihood = -259.1712 Iteration 1: log likelihood = -252.81165 Iteration 2: log likelihood = -252.72014 Iteration 3: log likelihood = -252.72012 Multinomial logistic regression Number of obs LR chi2(4) Prob > chi2 Pseudo R2 Log likelihood = -252.72012 car American Coef. Std. Err. z P>|z| = = = = 295 12.90 0.0118 0.0249 [95% Conf. Interval] (base outcome) Japan sex income _cons -.4694798 .0276854 -1.962651 .3114939 .0123666 .6216803 -1.51 2.24 -3.16 0.132 0.025 0.002 -1.079997 .0034472 -3.181122 .1410371 .0519236 -.7441801 .5388443 .027367 -3.18003 .4525278 .013787 .7546837 1.19 1.98 -4.21 0.234 0.047 0.000 -.348094 .000345 -4.659182 1.425783 .0543889 -1.700877 Europe sex income _cons The results are the same except for the model statistic: asclogit uses a Wald test and mlogit uses a likelihood-ratio test. If you prefer the likelihood-ratio test, you can fit the constant-only model for asclogit followed by the full model and use [R] lrtest. The following example will carry this out. . use http://www.stata-press.com/data/r13/choice, clear . asclogit choice, case(id) alternatives(car) . estimates store null . asclogit choice, case(id) alternatives(car) casevars(sex income) . lrtest null . Technical note We force you to explicitly identify the case-specific variables in the casevars() option to ensure that the program behaves as you expect. For example, an if or in qualifier may drop observations in such a way that (what was expected to be) an alternative-specific variable turns into a case-specific variable. Here you would probably want asclogit to terminate instead of interacting the variable with the alternative indicators. This situation could also occur if asclogit drops cases, or observations if you use the altwise option, because of missing values. asclogit — Alternative-specific conditional logit (McFadden’s choice) model Stored results asclogit stores the following in e(): Scalars e(N) e(N case) e(k) e(k alt) e(k indvars) e(k casevars) e(k eq) e(k eq model) e(df m) e(ll) e(N clust) e(const) e(i base) e(chi2) e(F) e(p) e(alt min) e(alt avg) e(alt max) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(indvars) e(casevars) e(case) e(altvar) e(alteqs) e(alt#) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(datasignature) e(datasignaturevars) e(properties) e(estat cmd) e(predict) e(marginsnotok) number of observations number of cases number of parameters number of alternatives number of alternative-specific variables number of case-specific variables number of equations in e(b) number of equations in overall model test model degrees of freedom log likelihood number of clusters constant indicator base alternative index χ2 F statistic significance minimum number of alternatives average number of alternatives maximum number of alternatives rank of e(V) number of iterations return code 1 if converged, 0 otherwise asclogit command as typed name of dependent variable alternative-specific independent variable case-specific variables variable defining cases variable defining alternatives alternative equation names alternative labels weight type weight expression title in estimation output name of cluster variable linear offset variable Wald, type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique the checksum variables used in calculation of checksum b V program used to implement estat program used to implement predict predictions disallowed by margins 91 92 asclogit — Alternative-specific conditional logit (McFadden’s choice) model Matrices e(b) e(stats) e(altvals) e(altfreq) e(alt casevars) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) coefficient vector alternative statistics alternative values alternative frequencies indicators for estimated case-specific coefficients—e(k alt)×e(k casevars) iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample Methods and formulas In this model, we have a set of unordered alternatives indexed by 1, 2, . . . , J . Let yij , j = 1, . . . , J , be an indicator variable for the alternative actually chosen by the ith individual (case). That is, yij = 1 if individual i chose alternative j and yij = 0 otherwise. The independent variables come in two forms: alternative specific and case specific. Alternative-specific variables vary among the alternatives (as well as cases), and case-specific variables vary only among cases. Assume that we have p alternative-specific variables so that for case i we have a J × p matrix, Xi . Further, assume that we have q case-specific variables so that we have a 1 × q vector zi for case i. The deterministic component of the random-utility model can then be expressed as 0 ηi = Xi β + (zi A) = Xi β + (zi ⊗ IJ ) vec(A0 ) β = (Xi , zi ⊗ IJ ) vec(A0 ) = X∗i β∗ As before, β is a p × 1 vector of alternative-specific regression coefficients, and A = (α1 , . . . , αJ ) is a q × J matrix of case-specific regression coefficients; remember that we must fix one of the αj to the constant vector to normalize the location. Here IJ is the J × J identity matrix, vec() is the vector function that creates a vector from a matrix by placing each column of the matrix on top of the other (see [M-5] vec( )), and ⊗ is the Kronecker product (see [M-2] op kronecker). We have rewritten the linear equation so that it is a form that can be used by clogit, namely, X∗i β∗ , where X∗i = (Xi , zi ⊗ IJ ) β ∗ β = vec(A0 ) With this in mind, see Methods and formulas in [R] clogit for the computational details of the conditional logit model. This command supports the clustered version of the Huber/White/sandwich estimator of the variance using vce(robust) and vce(cluster clustvar). See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. Specifying vce(robust) is equivalent to specifying vce(cluster casevar), where casevar is the variable that identifies the cases. asclogit — Alternative-specific conditional logit (McFadden’s choice) model 93 References Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. McFadden, D. L. 1974. Conditional logit analysis of qualitative choice behavior. In Frontiers in Econometrics, ed. P. Zarembka, 105–142. New York: Academic Press. Also see [R] asclogit postestimation — Postestimation tools for asclogit [R] asmprobit — Alternative-specific multinomial probit regression [R] asroprobit — Alternative-specific rank-ordered probit regression [R] clogit — Conditional (fixed-effects) logistic regression [R] logistic — Logistic regression, reporting odds ratios [R] logit — Logistic regression, reporting coefficients [R] nlogit — Nested logit regression [R] ologit — Ordered logistic regression [U] 20 Estimation and postestimation commands Title asclogit postestimation — Postestimation tools for asclogit Description Options for predict Options for estat mfx Methods and formulas Syntax for predict Syntax for estat Remarks and examples Also see Menu for predict Menu for estat Stored results Description The following postestimation commands are of special interest after asclogit: Commands Description estat alternatives estat mfx alternative summary statistics marginal effects The following standard postestimation commands are also available: Commands Description estat ic estat summarize estat vce estimates hausman lincom Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predicted probabilities, estimated linear predictor and its standard error point estimates, standard errors, testing, and inference for generalized predictions Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest nlcom predict predictnl test testnl Special-interest postestimation commands estat alternatives displays summary statistics about the alternatives in the estimation sample. estat mfx computes probability marginal effects. 94 asclogit postestimation — Postestimation tools for asclogit 95 Syntax for predict predict type predict type newvar if in stub* | newvarlist , statistic options if in , scores Description statistic Main pr xb stdp probability that each alternative is chosen; the default linear prediction standard error of the linear prediction options Description Main ∗ k(# | observed) condition on # alternatives per case or on observed number of alternatives altwise use alternativewise deletion instead of casewise deletion when computing probabilities nooffset ignore the offset() variable specified in asclogit ∗ k(# | observed) may be used only with pr. These statistics are available both in and out of sample; type predict only for the estimation sample. . . . if e(sample) . . . if wanted Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr computes the probability of choosing each alternative conditioned on each case choosing k() alternatives. This is the default statistic with default k(1); one alternative per case is chosen. xb computes the linear prediction. stdp computes the standard error of the linear prediction. k(# | observed) conditions the probability on # alternatives per case or on the observed number of alternatives. The default is k(1). This option may be used only with the pr option. altwise specifies that alternativewise deletion be used when marking out observations due to missing values in your variables. The default is to use casewise deletion. The xb and stdp options always use alternativewise deletion. nooffset is relevant only if you specified offset(varname) for asclogit. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xβ rather than as xβ + offset. scores calculates the scores for each coefficient in e(b). This option requires a new variable list of length equal to the number of columns in e(b). Otherwise, use the stub* option to have predict generate enumerated variables with prefix stub. 96 asclogit postestimation — Postestimation tools for asclogit Syntax for estat Alternative summary statistics estat alternatives Marginal effects estat mfx if in , options Description options Main varlist(varlist) display marginal effects for varlist at(mean atlist | median atlist ) calculate marginal effects at these values k(#) condition on the number of alternatives chosen to be # Options set confidence interval level; default is level(95) treat indicator variables as continuous do not restrict calculation of means and medians to the estimation sample ignore weights when calculating means and medians level(#) nodiscrete noesample nowght Menu for estat Statistics > Postestimation > Reports and statistics Options for estat mfx Main varlist(varlist) specifies the variables for which to display marginal effects. The default is all variables. at(mean atlist | median atlist ) specifies the values at which the marginal effects are to be calculated. atlist is alternative:variable = # variable = # alternative:offset = # ... The default is to calculate the marginal effects at the means of the independent variables by using the estimation sample, at(mean). If offset() is used during estimation, the means of the offsets (by alternative) are computed by default. After specifying the summary statistic, you can specify a series of specific values for variables. You can specify values for alternative-specific variables by alternative, or you can specify one value for all alternatives. You can specify only one value for case-specific variables. You specify values for the offset() variable (if present) the same way as for alternative-specific variables. For example, in the choice dataset (car choice), income is a case-specific variable, whereas dealer is an alternative-specific variable. The following would be a legal syntax for estat mfx: . estat mfx, at(mean American:dealer=18 income=40) asclogit postestimation — Postestimation tools for asclogit 97 When nodiscrete is not specified, at(mean atlist ) or at(median atlist ) has no effect on computing marginal effects for indicator variables, which are calculated as the discrete change in the simulated probability as the indicator variable changes from 0 to 1. The mean and median computations respect any if or in qualifiers, so you can restrict the data over which the statistic is computed. You can even restrict the values to a specific case, for example, . estat mfx if case==21 k(#) computes the probabilities conditioned on # alternatives chosen. The default is one alternative chosen. Options level(#) sets the confidence level; default is level(95). nodiscrete specifies that indicator variables be treated as continuous variables. An indicator variable is one that takes on the value 0 or 1 in the estimation sample. By default, the discrete change in the simulated probability is computed as the indicator variable changes from 0 to 1. noesample specifies that the whole dataset be considered instead of only those marked in the e(sample) defined by the asclogit command. nowght specifies that weights be ignored when calculating the medians. Remarks and examples Remarks are presented under the following headings: Predicted probabilities Obtaining estimation statistics Predicted probabilities After fitting a McFadden’s choice model with alternative-specific conditional logistic regression, you can use predict to obtain the estimated probability of alternative choices given case profiles. Example 1 In example 1 of [R] asclogit, we fit a model of consumer choice of automobile. The alternatives are nationality of the automobile manufacturer: American, Japanese, or European. There is one alternativespecific variable in the model, dealer, which contains the number of dealerships of each nationality in the consumer’s city. The case-specific variables are sex, the consumer’s sex, and income, the consumer’s income in thousands of dollars. . use http://www.stata-press.com/data/r13/choice . asclogit choice dealer, case(id) alternatives(car) casevars(sex income) (output omitted ) . predict p (option pr assumed; Pr(car)) . predict p2, k(2) (option pr assumed; Pr(car)) . format p p2 %6.4f 98 asclogit postestimation — Postestimation tools for asclogit . list car choice dealer sex income p p2 in 1/9, sepby(id) car choice dealer sex income p p2 1. 2. 3. American Japan Europe 0 0 1 18 8 5 male male male 46.7 46.7 46.7 0.6025 0.2112 0.1863 0.8589 0.5974 0.5437 4. 5. 6. American Japan Europe 1 0 0 17 6 2 male male male 26.1 26.1 26.1 0.7651 0.1282 0.1067 0.9293 0.5778 0.4929 7. 8. 9. American Japan Europe 1 0 0 12 6 2 male male male 32.7 32.7 32.7 0.6519 0.1902 0.1579 0.8831 0.5995 0.5174 Obtaining estimation statistics Here we will demonstrate the specialized estat subcommands after asclogit. Use estat alternatives to obtain a table of alternative statistics. The table will contain the alternative values, labels (if any), the number of cases in which each alternative is present, the frequency that the alternative is selected, and the percent selected. Use estat mfx to obtain marginal effects after asclogit. Example 2 We will continue with the automobile choice example, where we first list the alternative statistics and then compute the marginal effects at the mean income in our sample, assuming that there are five automobile dealers for each nationality. We will evaluate the probabilities for females because sex is coded 0 for females, and we will be obtaining the discrete change from 0 to 1. . estat alternatives Alternatives summary for car index 1 2 3 Alternative value 1 2 3 label Cases present Frequency selected Percent selected American Japan Europe 295 295 295 192 64 39 65.08 21.69 13.22 . estat mfx, at(dealer=0 sex=0) varlist(sex income) Pr(choice = American|1 selected) = .41964329 variable casevars sex* income dp/dx Std. Err. z P>|z| [ 95% C.I. .026238 -.007891 .068311 .002674 0.38 -2.95 0.701 0.003 -.107649 -.013132 ] .160124 -.00265 (*) dp/dx is for discrete change of indicator variable from 0 to 1 X 0 42.097 asclogit postestimation — Postestimation tools for asclogit 99 Pr(choice = Japan|1 selected) = .42696187 variable casevars sex* income dp/dx Std. Err. z P>|z| [ 95% C.I. -.161164 .005861 .079238 .002997 -2.03 1.96 0.042 0.051 -.316468 -.000014 ] -.005859 .011735 X 0 42.097 (*) dp/dx is for discrete change of indicator variable from 0 to 1 Pr(choice = Europe|1 selected) = .15339484 variable casevars sex* income dp/dx Std. Err. z P>|z| [ 95% C.I. .134926 .00203 .076556 .001785 1.76 1.14 0.078 0.255 -.015122 -.001469 ] .284973 .00553 X 0 42.097 (*) dp/dx is for discrete change of indicator variable from 0 to 1 The marginal effect of income indicates that there is a lower chance for a consumer to buy American automobiles with an increase in income. There is an indication that men have a higher preference for European automobiles than women but a lower preference for Japanese automobiles. We did not include the marginal effects for dealer because we view these as nuisance parameters, so we adjusted the probabilities by fixing dealer to a constant, 0. Stored results estat mfx stores the following in r(): Scalars r(pr alt) Matrices r(alt) scalars containing the computed probability of each alternative evaluated at the value that is labeled X in the table output. Here alt are the labels in the macro e(alteqs). matrices containing the computed marginal effects and associated statistics. There is one matrix for each alternative, where alt are the labels in the macro e(alteqs). Column 1 of each matrix contains the marginal effects; column 2, their standard errors; column 3, their z statistics; and columns 4 and 5, the confidence intervals. Column 6 contains the values of the independent variables used to compute the probabilities r(pr alt). Methods and formulas The deterministic component of the random-utility model can be expressed as 0 η = Xβ + (zA) = Xβ + (z ⊗ IJ ) vec(A0 ) β = (X, z ⊗ IJ ) vec(A0 ) = X∗ β∗ where X is the J × p matrix containing the alternative-specific covariates, z is a 1 × q vector of case-specific variables, β is a p × 1 vector of alternative-specific regression coefficients, and A = (α1 , . . . , αJ ) is a q × J matrix of case-specific regression coefficients (with one of the αj fixed to the constant). Here IJ is the J × J identity matrix, vec() is the vector function that creates a vector from a matrix by placing each column of the matrix on top of the other (see [M-5] vec( )), and ⊗ is the Kronecker product (see [M-2] op kronecker). 100 asclogit postestimation — Postestimation tools for asclogit We have rewritten the linear equation so that it is a form that we all recognize, namely, η = X∗ β∗ , where X∗ = (X, z ⊗ IJ ) β β∗ = vec(A0 ) To compute the marginal effects, we use the derivative of the log likelihood ∂`(y|η)/∂ η, where `(y|η) = log Pr(y|η) is the log of the probability of the choice indicator vector y given the linear predictor vector η. Namely, ∂Pr(y|η) ∂`(y|η) ∂η = Pr(y|η) ∂vec(X∗ )0 ∂ η0 ∂vec(X∗ )0 ∂`(y|η) ∗0 = Pr(y|η) β ⊗ IJ 0 ∂η The standard errors of the marginal effects are computed using the delta method. Also see [R] asclogit — Alternative-specific conditional logit (McFadden’s choice) model [U] 20 Estimation and postestimation commands Title asmprobit — Alternative-specific multinomial probit regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax asmprobit depvar indepvars alternatives(varname) options Model ∗ ∗ case(varname) alternatives(varname) casevars(varlist) constraints(constraints) collinear if options in weight , case(varname) Description use varname to identify cases use varname to identify the alternatives available for each case case-specific variables apply specified linear constraints keep collinear variables Model 2 correlation structure of the latent-variable errors variance structure of the latent-variable errors use the structural covariance parameterization; default is the differenced covariance parameterization factor(#) use the factor covariance structure with dimension # noconstant suppress the alternative-specific constant terms basealternative(# | lbl | str) alternative used for normalizing location scalealternative(# | lbl | str) alternative used for normalizing scale altwise use alternativewise deletion instead of casewise deletion correlation(correlation) stddev(stddev) structural SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) notransform nocnsreport display options set confidence level; default is level(95) do not transform variance–covariance estimates to the standard deviation and correlation metric do not display constraints control column formats and line width 101 102 asmprobit — Alternative-specific multinomial probit regression Integration intmethod(seqtype) intpoints(#) intburn(#) intseed(code | #) antithetics nopivot initbhhh(#) favor(speed | space) type of quasi- or pseudouniform point set number of points in each sequence starting index in the Hammersley or Halton sequence pseudouniform random-number seed use antithetic draws do not use integration interval pivoting use the BHHH optimization algorithm for the first # iterations favor speed or space when generating integration points Maximization maximize options control the maximization process coeflegend display legend instead of statistics correlation Description unstructured one correlation parameter for each pair of alternatives; correlations with the basealternative() are zero; the default one correlation parameter common to all pairs of alternatives; correlations with the basealternative() are zero constrain all correlation parameters to zero user-specified matrix identifying the correlation pattern user-specified matrix identifying the fixed and free correlation parameters exchangeable independent pattern matname fixed matname stddev Description heteroskedastic estimate standard deviation for each alternative; standard deviations for basealternative() and scalealternative() set to one all standard deviations are one user-specified matrix identifying the standard deviation pattern user-specified matrix identifying the fixed and free standard deviations homoskedastic pattern matname fixed matname seqtype Description hammersley halton random Hammersley point set Halton point set uniform pseudorandom point set ∗ case(varname) and alternatives(varname) are required. bootstrap, by, jackknife, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. asmprobit — Alternative-specific multinomial probit regression 103 Menu Statistics > Categorical outcomes > Alternative-specific multinomial probit Description asmprobit fits multinomial probit (MNP) models by using maximum simulated likelihood (MSL) implemented by the Geweke–Hajivassiliou–Keane (GHK) algorithm. By estimating the variance– covariance parameters of the latent-variable errors, the model allows you to relax the independence of irrelevant alternatives (IIA) property that is characteristic of the multinomial logistic model. asmprobit requires multiple observations for each case (decision), where each observation represents an alternative that may be chosen. The cases are identified by the variable specified in the case() option, whereas the alternatives are identified by the variable specified in the alternative() option. The outcome (chosen alternative) is identified by a value of 1 in depvar, with 0 indicating the alternatives that were not chosen; only one alternative may be chosen for each case. asmprobit allows two types of independent variables: alternative-specific variables and casespecific variables. Alternative-specific variables vary across both cases and alternatives and are specified in indepvars. Case-specific variables vary only across cases and are specified in the casevars() option. Options Model case(varname) specifies the variable that identifies each case. This variable identifies the individuals or entities making a choice. case() is required. alternatives(varname) specifies the variable that identifies the alternatives for each case. The number of alternatives can vary with each case; the maximum number of alternatives is 20. alternatives() is required. casevars(varlist) specifies the case-specific variables that are constant for each case(). If there are a maximum of J alternatives, there will be J − 1 sets of coefficients associated with casevars(). constraints(constraints), collinear; see [R] estimation options. Model 2 correlation(correlation) specifies the correlation structure of the latent-variable errors. correlation(unstructured) is the most general and has J(J − 3)/2 + 1 unique correlation parameters. This is the default unless stdev() or structural are specified. correlation(exchangeable) provides for one correlation coefficient common to all latent variables, except the latent variable associated with the basealternative() option. correlation(independent) assumes that all correlations are zero. correlation(pattern matname) and correlation(fixed matname) give you more flexibility in defining the correlation structure. See Variance structures later in this entry for more information. stddev(stddev) specifies the variance structure of the latent-variable errors. 104 asmprobit — Alternative-specific multinomial probit regression stddev(heteroskedastic) is the most general and has J − 2 estimable parameters. The standard deviations of the latent-variable errors for the alternatives specified in basealternative() and scalealternative() are fixed to one. stddev(homoskedastic) constrains all the standard deviations to equal one. stddev(pattern matname) and stddev(fixed matname) give you added flexibility in defining the standard deviation parameters. See Variance structures later in this entry for more information. structural requests the J ×J structural covariance parameterization instead of the default J −1×J −1 differenced covariance parameterization (the covariance of the latent errors differenced with that of the base alternative). The differenced covariance parameterization will achieve the same MSL regardless of the choice of basealternative() and scalealternative(). On the other hand, the structural covariance parameterization imposes more normalizations that may bound the model away from its maximum likelihood and thus prevent convergence with some datasets or choices of basealternative() and scalealternative(). factor(#) requests that the factor covariance structure of dimension # be used. The factor() option can be used with the structural option but cannot be used with stddev() or correlation(). A # × J (or # × J − 1) matrix, C, is used to factor the covariance matrix as I + C0 C, where I is the identity matrix of dimension J (or J − 1). The column dimension of C depends on whether the covariance is structural or differenced. The row dimension of C, #, must be less than or equal to floor((J(J − 1)/2 − 1)/(J − 2)), because there are only J(J − 1)/2 − 1 identifiable variance–covariance parameters. This covariance parameterization may be useful for reducing the number of covariance parameters that need to be estimated. If the covariance is structural, the column of C corresponding to the base alternative contains zeros. The column corresponding to the scale alternative has a one in the first row and zeros elsewhere. If the covariance is differenced, the column corresponding to the scale alternative (differenced with the base) has a one in the first row and zeros elsewhere. noconstant suppresses the J − 1 alternative-specific constant terms. basealternative(# | lbl | str) specifies the alternative used to normalize the latent-variable location (also referred to as the level of utility). The base alternative may be specified as a number, label, or string. The standard deviation for the latent-variable error associated with the base alternative is fixed to one, and its correlations with all other latent-variable errors are set to zero. The default is the first alternative when sorted. If a fixed or pattern matrix is given in the stddev() and correlation() options, the basealternative() will be implied by the fixed standard deviations and correlations in the matrix specifications. basealternative() cannot be equal to scalealternative(). scalealternative(# | lbl | str) specifies the alternative used to normalize the latent-variable scale (also referred to as the scale of utility). The scale alternative may be specified as a number, label, or string. The default is to use the second alternative when sorted. If a fixed or pattern matrix is given in the stddev() option, the scalealternative() will be implied by the fixed standard deviations in the matrix specification. scalealternative() cannot be equal to basealternative(). If a fixed or pattern matrix is given for the stddev() option, the base alternative and scale alternative are implied by the standard deviations and correlations in the matrix specifications, and they need not be specified in the basealternative() and scalealternative() options. altwise specifies that alternativewise deletion be used when marking out observations due to missing values in your variables. The default is to use casewise deletion; that is, the entire group of observations making up a case is deleted if any missing values are encountered. This option does not apply to observations that are marked out by the if or in qualifier or the by prefix. asmprobit — Alternative-specific multinomial probit regression 105 SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. If specifying vce(bootstrap) or vce(jackknife), you must also specify basealternative() and scalealternative(). Reporting level(#); see [R] estimation options. notransform prevents retransforming the Cholesky-factored variance–covariance estimates to the correlation and standard deviation metric. This option has no effect if structural is not specified because the default differenced variance– covariance estimates have no interesting interpretation as correlations and standard deviations. notransform also has no effect if the correlation() and stddev() options are specified with anything other than their default values. Here it is generally not possible to factor the variance– covariance matrix, so optimization is already performed using the standard deviation and correlation representations. nocnsreport; see [R] estimation options. display options: cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Integration intmethod(hammersley | halton | random) specifies the method of generating the point sets used in the quasi–Monte Carlo integration of the multivariate normal density. intmethod(hammersley), the default, uses the Hammersley sequence; intmethod(halton) uses the Halton sequence; and intmethod(random) uses a sequence of uniform random numbers. intpoints(#) specifies the number of points to use in the quasi–Monte Carlo integration. If this option is not specified, the number of points is 50 × J if intmethod(hammersley) or intmethod(halton) is used and 100 × J if intmethod(random) is used. Larger values of intpoints() provide better approximations of the log likelihood, but at the cost of added computation time. intburn(#) specifies where in the Hammersley or Halton sequence to start, which helps reduce the correlation between the sequences of each dimension. The default is 0. This option may not be specified with intmethod(random). intseed(code | #) specifies the seed to use for generating the uniform pseudorandom sequence. This option may be specified only with intmethod(random). code refers to a string that records the state of the random-number generator runiform(); see [R] set seed. An integer value # may be used also. The default is to use the current seed value from Stata’s uniform random-number generator, which can be obtained from c(seed). antithetics specifies that antithetic draws be used. The antithetic draw for the J − 1 vector uniform-random variables, x, is 1 − x. nopivot turns off integration interval pivoting. By default, asmprobit will pivot the wider intervals of integration to the interior of the multivariate integration. This improves the accuracy of the quadrature estimate. However, discontinuities may result in the computation of numerical secondorder derivatives using finite differencing (for the Newton–Raphson optimize technique, tech(nr)) 106 asmprobit — Alternative-specific multinomial probit regression when few simulation points are used, resulting in a non–positive-definite Hessian. asmprobit uses the Broyden–Fletcher–Goldfarb–Shanno optimization algorithm, by default, which does not require computing the Hessian numerically using finite differencing. initbhhh(#) specifies that the Berndt–Hall–Hall–Hausman (BHHH) algorithm be used for the initial # optimization steps. This option is the only way to use the BHHH algorithm along with other optimization techniques. The algorithm switching feature of ml’s technique() option cannot include bhhh. favor(speed | space) instructs asmprobit to favor either speed or space when generating the integration points. favor(speed) is the default. When favoring speed, the integration points are generated once and stored in memory, thus increasing the speed of evaluating the likelihood. This speed increase can be seen when there are many cases or when the user specifies a large number of integration points, intpoints(#). When favoring space, the integration points are generated repeatedly with each likelihood evaluation. For unbalanced data, where the number of alternatives varies with each case, the estimates computed using intmethod(random) will vary slightly between favor(speed) and favor(space). This is because the uniform sequences will not be identical, even when initiating the sequences using the same uniform seed, intseed(code | #). For favor(speed), ncase blocks of intpoints(#) × J − 2 uniform points are generated, where J is the maximum number of alternatives. For favor(space), the column dimension of the matrices of points varies with the number of alternatives that each case has. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. The following options may be particularly useful in obtaining convergence with asmprobit: difficult, technique(algorithm spec), nrtolerance(#), nonrtolerance, and from(init specs). If technique() contains more than one algorithm specification, bhhh cannot be one of them. To use the BHHH algorithm with another algorithm, use the initbhhh() option and specify the other algorithm in technique(). Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with asmprobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Introduction Variance structures asmprobit — Alternative-specific multinomial probit regression 107 Introduction The MNP model is used with discrete dependent variables that take on more than two outcomes that do not have a natural ordering. The stochastic error terms are assumed to have a multivariate normal distribution that is heteroskedastic and correlated. Say that you have a set of J unordered alternatives that are modeled by a regression of both case-specific and alternative-specific covariates. A “case” refers to the information on one decision maker. Underlying the model is the set of J latent variables (utilities), ηij = xij β + zi αj + ξij (1) where i denotes cases and j denotes alternatives. xij is a 1 × p vector of alternative-specific variables, β is a p × 1 vector of parameters, zi is a 1 × q vector of case-specific variables, αj is a q × 1 vector of parameters for the j th alternative, and ξi = (ξi1 , . . . , ξiJ ) is distributed multivariate normal with mean zero and covariance matrix Ω. The decision maker selects the alternative whose latent variable is highest. Because the MNP model allows for a general covariance structure in ξij , it does not impose the IIA property inherent in multinomial logistic and conditional logistic models. That is, the MNP model permits the odds of choosing one alternative over another to depend on the remaining alternatives. For example, consider the choice of travel mode between two cities: air, train, bus, or car, as a function of the travel mode cost, travel time (alternative-specific variables), and an individual’s income (a case-specific variable). The odds of choosing air travel over a bus may not be independent of the train alternative because both bus and train travel are public ground transportation. That is, the probability of choosing air travel is Pr(ηair > ηbus , ηair > ηtrain , ηair > ηcar ), and the two events ηair > ηbus and ηair > ηtrain may be correlated. An alternative to MNP that will allow a nested correlation structure in ξij is the nested logit model (see [R] nlogit). The added flexibility of the MNP model does impose a significant computation burden because of the need to evaluate probabilities from the multivariate normal distribution. These probabilities are evaluated using simulation techniques because a closed-form solution does not exist. See Methods and formulas for more information. Not all the J sets of regression coefficients αj are identifiable, nor are all J(J + 1)/2 elements of the variance–covariance matrix Ω. As described by Train (2009, sec. 2.5), the model requires normalization because both the location (level) and scale of the latent variable are irrelevant. Increasing the latent variables by a constant does not change which ηij is the maximum for decision maker i, nor does multiplying them by a constant. To normalize location, we choose an alternative, indexed by k , say, and take the difference between the latent variable k and the J − 1 others, vijk = ηij − ηik = (xij − xik )β + zi (αj − αk ) + ξij − ξik = δij 0 β + zi γj 0 + ij 0 (2) = λij 0 + ij 0 where j 0 = j if j < k and j 0 = j − 1 if j > k , so that j 0 = 1, . . . , J − 1. One can now work with the (J − 1) × (J − 1) covariance matrix Σ(k) for 0i = (i1 , . . . , i,J−1 ). The k th alternative here is the basealternative() in asmprobit. From (2), the probability that decision maker i chooses alternative k , for example, is Pr(i chooses k) = Pr(vi1k ≤ 0, . . . , vi,J−1,k ≤ 0) = Pr(i1 ≤ −λi1 , . . . , i,J−1 ≤ −λi,J−1 ) 108 asmprobit — Alternative-specific multinomial probit regression To normalize for scale, one of the diagonal elements of Σ(k) must be fixed to a constant. In asmprobit, this is the error variance for the alternative specified by scalealternative(). Thus there are a total of, at most, J(J − 1)/2 − 1 identifiable variance–covariance parameters. See Variance structures below for more on this issue. In fact, the model is slightly more general in that not all cases need to have faced all J alternatives. The model allows for situations in which some cases chose among all possible alternatives, whereas other cases were given a choice among a subset of them, and perhaps other cases were given a choice among a different subset. The number of observations for each case is equal to the number of alternatives faced. The MNP model is often motivated using a random-utility consumer-choice framework. Equation (1) represents the utility that consumer i receives from good j . The consumer purchases the good for which the utility is highest. Because utility is ordinal, all that matters is the ranking of the utilities from the alternatives. Thus one must normalize for location and scale. Example 1 Application of MNP models is common in the analysis of transportation data. Greene (2012, sec. 18.2.9) uses travel-mode choice data between Sydney and Melbourne to demonstrate estimating parameters of various discrete-choice models. The data contain information on 210 individuals’ choices of travel mode. The four alternatives are air, train, bus, and car, with indices 1, 2, 3, and 4, respectively. One alternative-specific variable is travelcost, a measure of generalized cost of travel that is equal to the sum of in-vehicle cost and a wagelike measure times the amount of time spent traveling. A second alternative-specific variable is the terminal time, termtime, which is zero for car transportation. Household income, income, is a case-specific variable. . use http://www.stata-press.com/data/r13/travel . list id mode choice travelcost termtime income in 1/12, sepby(id) id mode choice travel~t termtime income 1. 2. 3. 4. 1 1 1 1 air train bus car 0 0 0 1 70 71 70 30 69 34 35 0 35 35 35 35 5. 6. 7. 8. 2 2 2 2 air train bus car 0 0 0 1 68 84 85 50 64 44 53 0 30 30 30 30 9. 10. 11. 12. 3 3 3 3 air train bus car 0 0 0 1 129 195 149 101 69 34 35 0 40 40 40 40 The model of travel choice is ηij = β1 travelcostij + β2 termtimeij + α1j incomei + α0j + ξij The alternatives can be grouped as air and ground travel. With this in mind, we set the air alternative to be the basealternative() and choose train as the scaling alternative. Because these are the first and second alternatives in the mode variable, they are also the defaults. asmprobit — Alternative-specific multinomial probit regression . asmprobit choice travelcost termtime, case(id) alternatives(mode) > casevars(income) (output omitted ) Alternative-specific multinomial probit Number of obs = Case variable: id Number of cases = Alternative variable: mode Alts per case: min = avg = max = Integration sequence: Hammersley Integration points: 200 Wald chi2(5) = Log simulated-likelihood = -190.09418 Prob > chi2 = choice Coef. Std. Err. mode travelcost termtime -.00977 -.0377095 air (base alternative) .0027834 .0094088 z P>|z| 109 840 210 4 4.0 4 32.05 0.0000 [95% Conf. Interval] -3.51 -4.01 0.000 0.000 -.0152253 -.0561504 -.0043146 -.0192686 train income _cons -.0291971 .5616376 .0089246 .3946551 -3.27 1.42 0.001 0.155 -.046689 -.2118721 -.0117052 1.335147 income _cons -.0127503 -.0571364 .0079267 .4791861 -1.61 -0.12 0.108 0.905 -.0282863 -.9963239 .0027857 .882051 income _cons -.0049086 -1.833393 .0077486 .8186156 -0.63 -2.24 0.526 0.025 -.0200957 -3.43785 .0102784 -.2289357 /lnl2_2 /lnl3_3 -.5502039 -.6005552 .3905204 .3353292 -1.41 -1.79 0.159 0.073 -1.31561 -1.257788 .2152021 .0566779 /l2_1 /l3_1 /l3_2 1.131518 .9720669 .5197214 .2124817 .2352116 .2861552 5.33 4.13 1.82 0.000 0.000 0.069 .7150612 .5110606 -.0411325 1.547974 1.433073 1.080575 bus car (mode=air is the alternative normalizing location) (mode=train is the alternative normalizing scale) . estimates store full By default, the differenced covariance parameterization is used, so the covariance matrix for this model is 3 × 3. There are two free variances to estimate and three correlations. To help ensure that the covariance matrix remains positive definite, asmprobit uses the square root transformation, where it optimizes on the Cholesky-factored variance–covariance. To ensure that the diagonal elements of the Cholesky estimates remain positive, we use the log transformation. The estimates labeled /lnl2 2 and /lnl3 3 in the coefficient table are the log-transformed diagonal elements of the Cholesky matrix. The estimates labeled /l2 1, /l3 1, and /l3 2 are the off-diagonal entries for elements (2, 1), (3, 1), and (3, 2) of the Cholesky matrix. Although the transformed parameters of the differenced covariance parameterization are difficult to interpret, you can view them untransformed by using the estat command. Typing 110 asmprobit — Alternative-specific multinomial probit regression . estat correlation train bus car train bus car 1.0000 0.8909 0.7895 1.0000 0.8951 1.0000 Note: correlations are for alternatives differenced with air gives the correlations, and typing . estat covariance train bus car train bus car 2 1.600208 1.37471 1.613068 1.399703 1.515884 Note: covariances are for alternatives differenced with air gives the (co)variances. We can reduce the number of covariance parameters in the model by using the factor model by Cameron and Trivedi (2005). For large models with many alternatives, the parameter reduction can be dramatic, but for our example we will use factor(1), a one-dimension factor model, to reduce by 3 the number of parameters associated with the covariance matrix. asmprobit — Alternative-specific multinomial probit regression 111 . asmprobit choice travelcost termtime, case(id) alternatives(mode) > casevars(income) factor(1) (output omitted ) Alternative-specific multinomial probit Case variable: id Number of obs Number of cases = = 840 210 Alternative variable: mode Alts per case: min = avg = max = 4 4.0 4 Integration sequence: Hammersley Integration points: 200 Log simulated-likelihood = -196.85094 choice Coef. Std. Err. mode travelcost termtime -.0093696 -.0593173 air (base alternative) .0036329 .0064585 Wald chi2(5) Prob > chi2 z P>|z| = = 107.85 0.0000 [95% Conf. Interval] -2.58 -9.18 0.010 0.000 -.01649 -.0719757 -.0022492 -.0466589 train income _cons -.0373511 .1092322 .0098219 .3949529 -3.80 0.28 0.000 0.782 -.0566018 -.6648613 -.0181004 .8833257 income _cons -.0158793 -1.082181 .0112239 .4678732 -1.41 -2.31 0.157 0.021 -.0378777 -1.999196 .0061191 -.1651666 income _cons .0042677 -3.765445 .0092601 .5540636 0.46 -6.80 0.645 0.000 -.0138817 -4.851389 .0224171 -2.6795 /c1_2 /c1_3 1.182805 1.227705 .3060299 .3401237 3.86 3.61 0.000 0.000 .5829972 .5610747 1.782612 1.894335 bus car (mode=air is the alternative normalizing location) (mode=train is the alternative normalizing scale) The estimates labeled /c1 2 and /c1 3 in the coefficient table are the factor loadings. These factor loadings produce the following differenced covariance estimates: . estat covariance train bus car train bus car 2 1.182805 1.227705 2.399027 1.452135 2.507259 Note: covariances are for alternatives differenced with air Variance structures The matrix Ω has J(J + 1)/2 distinct elements because it is symmetric. Selecting a base alternative, normalizing its error variance to one, and constraining the correlations between its error and the other errors reduces the number of estimable parameters by J . Moreover, selecting a scale alternative and normalizing its error variance to one reduces the number by one, as well. Hence, there are at most m = J(J − 1)/2 − 1 estimable parameters in Ω. 112 asmprobit — Alternative-specific multinomial probit regression In practice, estimating all m parameters can be difficult, so one must often place more restrictions on the parameters. The asmprobit command provides the correlation() option to specify restrictions on the J(J − 3)/2 + 1 correlation parameters not already restricted as a result of choosing the base alternatives, and it provides stddev() to specify restrictions on the J − 2 standard deviations not already restricted as a result of choosing the base and scale alternatives. When the structural option is used, asmprobit fits the model by assuming that all m parameters can be estimated, which is equivalent to specifying correlation(unstructured) and stddev(heteroskedastic). The unstructured correlation structure means that all J(J − 3)/2 + 1 of the remaining correlation parameters will be estimated, and the heteroskedastic specification means that all J − 2 standard deviations will be estimated. With these default settings, the log likelihood is maximized with respect to the Cholesky decomposition of Ω, and then the parameters are transformed to the standard deviation and correlation form. The correlation(exchangeable) option forces the J(J − 3)/2 + 1 correlation parameters to be equal, and correlation(independent) forces all the correlations to be zero. Using the stddev(homoskedastic) option forces all J standard deviations to be one. These options may help in obtaining convergence for a model if the default options do not produce satisfactory results. In fact, when fitting a complex model, it may be advantageous to first fit a simple one and then proceed with removing the restrictions one at a time. Advanced users may wish to specify alternative variance structures of their own choosing, and the next few paragraphs explain how to do so. correlation(pattern matname) allows you to give the name of a J × J matrix that identifies a correlation structure. Sequential positive integers starting at 1 are used to identify each correlation parameter: if there are three correlation parameters, they are identified by 1, 2, and 3. The integers can be repeated to indicate that correlations with the same number should be constrained to be equal. A zero or a missing value (.) indicates that the correlation is to be set to zero. asmprobit considers only the elements of the matrix below the main diagonal. Suppose that you have a model with four alternatives, numbered 1–4, and alternative 1 is the base. The unstructured and exchangeable correlation structures identified in the 4 × 4 lower triangular matrices are unstructured exchangeable 1 2 3 4 1 2 3 4 1 · 1 · 20 · 20 · 30 1 · 30 1 · 4 0 2 3 · 4 0 1 1 · asmprobit labels these correlation structures unstructured and exchangeable, even though the correlations corresponding to the base alternative are set to zero. More formally: these terms are appropriate when considering the (J − 1) × (J − 1) submatrix Σ(k) defined in the Introduction above. You can also use the correlation(fixed matname) option to specify a matrix that specifies fixed and free parameters. Here the free parameters (those that are to be estimated) are identified by a missing value, and nonmissing values represent correlations that are to be taken as given. Below is a correlation structure that would set the correlations of alternative 1 to be 0.5: 1 1 · 2 0.5 3 0.5 4 0.5 2 3 4 · · · · · · asmprobit — Alternative-specific multinomial probit regression 113 The order of the elements of the pattern or fixed matrices must be the same as the numeric order of the alternative levels. To specify the structure of the standard deviations—the diagonal elements of Ω—you can use the stddev(pattern matname) option, where matname is a 1 × J matrix. Sequential positive integers starting at 1 are used to identify each standard deviation parameter. The integers can be repeated to indicate that standard deviations with the same number are to be constrained to be equal. A missing value indicates that the corresponding standard deviation is to be set to one. In the four-alternative example mentioned above, suppose that you wish to set the first and second standard deviations to one and that you wish to constrain the third and fourth standard deviations to be equal; the following pattern matrix will do that: 1 2 3 4 1 ( · · 1 1) Using the stddev(fixed matname) option allows you to identify the fixed and free standard deviations. Fixed standard deviations are entered as positive real numbers, and free parameters are identified with missing values. For example, to constrain the first and second standard deviations to equal one and to allow the third and fourth to be estimated, you would use this fixed matrix: 1 2 1 (1 1 3 · 4 ·) When supplying either the pattern or the fixed matrices, you must ensure that the model is properly scaled. At least two standard deviations must be constant for the model to be scaled. A warning is issued if asmprobit detects that the model is not scaled. The order of the elements of the pattern or fixed matrices must be the same as the numeric order of the alternative levels. Example 2 In example 1, we used the differenced covariance parameterization, the default. We now use the structural option to view the J − 2 standard deviation estimates and the (J − 1)(J − 2)/2 correlation estimates. Here we will fix the standard deviations for the air and train alternatives to 1 and the correlations between air and the rest of the alternatives to 0. 114 asmprobit — Alternative-specific multinomial probit regression . asmprobit choice travelcost termtime, case(id) alternatives(mode) > casevars(income) structural (output omitted ) Alternative-specific multinomial probit Number of obs = Case variable: id Number of cases = Alternative variable: mode Alts per case: min = avg = max = Integration sequence: Hammersley Integration points: 200 Wald chi2(5) = Log simulated-likelihood = -190.09418 Prob > chi2 = choice Coef. Std. Err. mode travelcost termtime -.0097703 -.0377103 air (base alternative) .0027834 .0094092 z P>|z| 840 210 4 4.0 4 32.05 0.0000 [95% Conf. Interval] -3.51 -4.01 0.000 0.000 -.0152257 -.056152 -.0043149 -.0192687 train income _cons -.0291975 .5616448 .0089246 .3946529 -3.27 1.42 0.001 0.155 -.0466895 -.2118607 -.0117055 1.33515 income _cons -.01275 -.0571664 .0079266 .4791996 -1.61 -0.12 0.108 0.905 -.0282858 -.9963803 .0027858 .8820476 income _cons -.0049085 -1.833444 .0077486 .8186343 -0.63 -2.24 0.526 0.025 -.0200955 -3.437938 .0102785 -.22895 /lnsigma3 /lnsigma4 -.2447428 -.3309429 .4953363 .6494493 -0.49 -0.51 0.621 0.610 -1.215584 -1.60384 .7260985 .9419543 /atanhr3_2 /atanhr4_2 /atanhr4_3 1.01193 .5786576 .8885204 .3890994 .3940461 .5600561 2.60 1.47 1.59 0.009 0.142 0.113 .249309 -.1936586 -.2091693 1.774551 1.350974 1.98621 sigma1 sigma2 sigma3 sigma4 1 1 .7829059 .7182462 .2965368 .2011227 2.067 2.564989 rho3_2 rho4_2 rho4_3 .766559 .5216891 .7106622 .244269 -.1912734 -.2061713 .9441061 .874283 .9630403 bus car (base alternative) (scale alternative) .3878017 .4664645 .1604596 .2868027 .277205 (mode=air is the alternative normalizing location) (mode=train is the alternative normalizing scale) When comparing this output to that of example 1, we see that we have achieved the same log likelihood. That is, the structural parameterization using air as the base alternative and train as the scale alternative applied no restrictions on the model. This will not always be the case. We leave it up to you to try different base and scale alternatives, and you will see that not all the different combinations will achieve the same log likelihood. This is not true for the differenced covariance parameterization: it will always achieve the same log likelihood (and the maximum possible likelihood) regardless of the base and scale alternatives. This is why it is the default parameterization. asmprobit — Alternative-specific multinomial probit regression 115 For an exercise, we can compute the differenced covariance displayed in example 1 by using the following ado-code. . estat covariance air train bus car air train bus car 1 0 0 0 1 .6001436 .3747012 .6129416 .399619 .5158776 . return list matrices: r(cov) : 4 x 4 . matrix cov = r(cov) . matrix M = (1,-1,0,0 \ 1,0,-1,0 \ 1,0,0,-1) . matrix cov1 = M*cov*M’ . matrix list cov1 symmetric cov1[3,3] r1 r2 r1 2 r2 1.6001436 1.6129416 r3 1.3747012 1.399619 r3 1.5158776 The slight difference in the regression coefficients between the example 1 and example 2 coefficient tables reflects the accuracy of the [M-5] ghk( ) algorithm using 200 points from the Hammersley sequence. We now fit the model using the exchangeable correlation matrix and compare the models with a likelihood-ratio test. 116 asmprobit — Alternative-specific multinomial probit regression . asmprobit choice travelcost termtime, case(id) alternatives(mode) > casevars(income) correlation(exchangeable) (output omitted ) Alternative-specific multinomial probit Number of obs = Case variable: id Number of cases = Alternative variable: mode Alts per case: min = avg = max = Integration sequence: Hammersley Integration points: 200 Wald chi2(5) = Log simulated-likelihood = -190.4679 Prob > chi2 = choice Coef. Std. Err. mode travelcost termtime -.0084636 -.0345394 air (base alternative) .0020452 .0072812 z P>|z| 840 210 4 4.0 4 53.60 0.0000 [95% Conf. Interval] -4.14 -4.74 0.000 0.000 -.012472 -.0488103 -.0044551 -.0202684 train income _cons -.0290357 .5517445 .0083226 .3719913 -3.49 1.48 0.000 0.138 -.0453477 -.177345 -.0127237 1.280834 income _cons -.0132562 -.0052517 .0074133 .4337932 -1.79 -0.01 0.074 0.990 -.0277859 -.8554708 .0012735 .8449673 income _cons -.0060878 -1.565918 .006638 .6633007 -0.92 -2.36 0.359 0.018 -.0190981 -2.865964 .0069224 -.265873 /lnsigmaP1 /lnsigmaP2 -.3557589 -1.308596 .1972809 .8872957 -1.80 -1.47 0.071 0.140 -.7424222 -3.047663 .0309045 .4304719 /atanhrP1 1.116589 .3765488 2.97 0.003 .3785667 1.854611 sigma1 sigma2 sigma3 sigma4 1 1 .7006416 .2701992 .4759596 .0474697 1.031387 1.537983 rho3_2 rho4_2 rho4_3 .8063791 .8063791 .8063791 .3614621 .3614621 .3614621 .9521783 .9521783 .9521783 bus car (base alternative) (scale alternative) .1382232 .2397466 .131699 .131699 .131699 (mode=air is the alternative normalizing location) (mode=train is the alternative normalizing scale) . lrtest full . Likelihood-ratio test (Assumption: . nested in full) LR chi2(2) = Prob > chi2 = 0.75 0.6882 The likelihood-ratio test suggests that a common correlation is a plausible hypothesis, but this could be an artifact of the small sample size. The labeling of the standard deviation and correlation estimates has changed from /lnsigma and /atanhr, in the previous example, to /lnsigmaP and /atanhrP. The “P” identifies the parameter’s index in the pattern matrices used by asmprobit. The pattern matrices are stored in e(stdpattern) and e(corpattern). asmprobit — Alternative-specific multinomial probit regression 117 Technical note Another way to fit the model with the exchangeable correlation structure in example 2 is to use the constraint command to define the constraints on the rho parameters manually and then apply those. . . . > constraint 1 [atanhr3_2]_cons = [atanhr4_2]_cons constraint 2 [atanhr3_2]_cons = [atanhr4_3]_cons asmprobit choice travelcost termtime, case(id) alternatives(mode) casevars(income) constraints(1 2) structural With this method, however, we must keep track of what parameterization of the rhos is used in estimation, and that depends on the options specified. Example 3 In the last example, we used the correlation(exchangeable) option, reducing the number of correlation parameters from three to one. We can explore a two–correlation parameter model by specifying a pattern matrix in the correlation() option. Suppose that we wish to have the correlation between train and bus be equal to the correlation between bus and car and to have the standard deviations for the bus and car equations be equal. We will use air as the base category and train as the scale category. 118 asmprobit — Alternative-specific multinomial probit regression . matrix define corpat = J(4, 4, .) . matrix corpat[3,2] = 1 . matrix corpat[4,3] = 1 . matrix corpat[4,2] = 2 . matrix define stdpat = J(1, 4, .) . matrix stdpat[1,3] = 1 . matrix stdpat[1,4] = 1 . asmprobit choice travelcost termtime, case(id) alternatives(mode) > casevars(income) correlation(pattern corpat) stddev(pattern stdpat) (output omitted ) Alternative-specific multinomial probit Case variable: id Number of obs Number of cases = = 840 210 Alternative variable: mode Alts per case: min = avg = max = 4 4.0 4 Integration sequence: Hammersley Integration points: 200 Log simulated-likelihood = -190.12871 choice Coef. Std. Err. mode travelcost termtime -.0100335 -.0385731 air (base alternative) .0026203 .008608 Wald chi2(5) Prob > chi2 z P>|z| = = 41.67 0.0000 [95% Conf. Interval] -3.83 -4.48 0.000 0.000 -.0151692 -.0554445 -.0048979 -.0217018 train income _cons -.029271 .56528 .0089739 .4008037 -3.26 1.41 0.001 0.158 -.0468595 -.2202809 -.0116824 1.350841 income _cons -.0124658 -.0741685 .0080043 .4763422 -1.56 -0.16 0.119 0.876 -.0281539 -1.007782 .0032223 .859445 income _cons -.0046905 -1.897931 .0079934 .7912106 -0.59 -2.40 0.557 0.016 -.0203573 -3.448675 .0109763 -.3471867 /lnsigmaP1 -.197697 .2751269 -0.72 0.472 -.7369359 .3415418 /atanhrP1 /atanhrP2 .9704403 .5830923 .3286981 .3690419 2.95 1.58 0.003 0.114 .3262038 -.1402165 1.614677 1.306401 sigma1 sigma2 sigma3 sigma4 1 1 .8206185 .8206185 .4785781 .4785781 1.407115 1.407115 rho3_2 rho4_2 rho4_3 .7488977 .5249094 .7488977 .3151056 -.1393048 .3151056 .9238482 .863362 .9238482 bus car (base alternative) (scale alternative) .2257742 .2257742 .1443485 .2673598 .1443485 (mode=air is the alternative normalizing location) (mode=train is the alternative normalizing scale) In the call to asmprobit, we did not need to specify the basealternative() and scalealternative() options because they are implied by the specifications of the pattern matrices. asmprobit — Alternative-specific multinomial probit regression 119 Technical note If you experience convergence problems, try specifying nopivot, increasing intpoints(), specifying antithetics, specifying technique(nr) with difficult, or specifying a switching algorithm in the technique() option. As a last resort, you can use the nrtolerance() and showtolerance options. Changing the base and scale alternative in the model specification can also affect convergence if the structural option is used. Because simulation methods are used to obtain multivariate normal probabilities, the estimates obtained have a limited degree of precision. Moreover, the solutions are particularly sensitive to the starting values used. Experimenting with different starting values may help in obtaining convergence, and doing so is a good way to verify previous results. If you wish to use the BHHH algorithm along with another maximization algorithm, you must specify the initbhhh(#) option, where # is the number of BHHH iterations to use before switching to the algorithm specified in technique(). The BHHH algorithm uses an outer-product-of-gradients approximation for the Hessian, and asmprobit must perform the gradient calculations differently than for the other algorithms. Technical note If there are no alternative-specific variables in your model, the variance–covariance matrix parameters are not identifiable. For such a model to converge, you would therefore need to use correlation(independent) and stddev(homoskedastic). A better alternative is to use mprobit, which is geared specifically toward models with only case-specific variables. See [R] mprobit. 120 asmprobit — Alternative-specific multinomial probit regression Stored results asmprobit stores the following in e(): Scalars e(N) e(N case) e(k) e(k alt) e(k indvars) e(k casevars) e(k sigma) e(k rho) e(k eq) e(k eq model) e(df m) e(ll) e(N clust) e(const) e(i base) e(i scale) e(mc points) e(mc burn) e(mc antithetics) e(chi2) e(p) e(fullcov) e(structcov) e(cholesky) e(alt min) e(alt avg) e(alt max) e(rank) e(ic) e(rc) e(converged) number of observations number of cases number of parameters number of alternatives number of alternative-specific variables number of case-specific variables number of variance estimates number of correlation estimates number of equations in e(b) number of equations in overall model test model degrees of freedom log simulated-likelihood number of clusters constant indicator base alternative index scale alternative index number of Monte Carlo replications starting sequence index antithetics indicator χ2 significance unstructured covariance indicator 1 if structured covariance; 0 otherwise Cholesky-factored covariance indicator minimum number of alternatives average number of alternatives maximum number of alternatives rank of e(V) number of iterations return code 1 if converged, 0 otherwise asmprobit — Alternative-specific multinomial probit regression 121 Macros e(cmd) e(cmdline) e(depvar) e(indvars) e(casevars) e(case) e(altvar) e(alteqs) e(alt#) e(wtype) e(wexp) e(title) e(clustvar) e(correlation) e(stddev) e(cov class) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(mc method) e(mc seed) e(user) e(technique) e(datasignature) e(datasignaturevars) e(properties) e(estat cmd) e(mfx dlg) e(predict) e(marginsnotok) asmprobit command as typed name of dependent variable alternative-specific independent variable case-specific variables variable defining cases variable defining alternatives alternative equation names alternative labels weight type weight expression title in estimation output name of cluster variable correlation structure variance structure class of the covariance structure Wald, type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method technique used to generate sequences random-number generator seed name of likelihood-evaluator program maximization technique the checksum variables used in calculation of checksum b V program used to implement estat program used to implement estat mfx dialog program used to implement predict predictions disallowed by margins Matrices e(b) e(Cns) e(stats) e(stdpattern) e(stdfixed) e(altvals) e(altfreq) e(alt casevars) e(corpattern) e(corfixed) e(ilog) e(gradient) e(V) e(V modelbased) coefficient vector constraints matrix alternative statistics variance pattern fixed and free standard deviations alternative values alternative frequencies indicators for estimated case-specific coefficients—e(k alt)×e(k casevars) correlation structure fixed and free correlations iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance Functions e(sample) marks estimation sample Methods and formulas The simulated maximum likelihood estimates for the MNP are obtained using ml; see [R] ml. The likelihood evaluator implements the GHK algorithm to approximate the multivariate distribution function (Geweke 1989; Hajivassiliou and McFadden 1998; Keane and Wolpin 1994). The technique is also described in detail by Genz (1992), but Genz describes a more general algorithm where both 122 asmprobit — Alternative-specific multinomial probit regression lower and upper bounds of integration are finite. We briefly describe the GHK simulator and refer you to Bolduc (1999) for the score computations. As discussed earlier, the latent variables for a J -alternative model are ηij = xij β + zi αj + ξij , for j = 1, . . . , J , i = 1, . . . , n, and ξ0i = (ξi,1 , . . . , ξi,J ) ∼ MVN(0, Ω). The experimenter observes alternative k for the ith observation if k = arg max(ηij , j = 1, . . . , J). Let vij 0 = ηij − ηik = (xij − xik )β + zi (αj − αk ) + ξij − ξik = δij 0 β + zi γj 0 + ij 0 where j 0 = j if j < k and j 0 = j − 1 if j > k , so that j 0 = 1, . . . , J − 1. Further, i = (i1 , . . . , i,J−1 ) ∼ MVN(0, Σ(k) ). Σ is indexed by k because it depends on the choice made. We denote the deterministic part of the model as λij 0 = δij 0 β + zj γj 0 , and the probability of this event is Pr(yi = k) = Pr(vi1 ≤ 0, . . . , vi,J−1 ≤ 0) = Pr(i1 ≤ −λi1 , . . . , i,J−1 ≤ −λi,J−1 ) Z −λi1 Z −λi,J−1 −(J−1)/2 −1/2 z dz = (2π) |Σ(k) | ··· exp − 21 z0 Σ−1 (k) −∞ (3) −∞ Simulated likelihood For clarity in the discussion that follows, we drop the index denoting case so that for an arbitrary observation υ0 = (v1 , . . . , vJ−1 ), λ0 = (λ1 , . . . , λJ−1 ), and 0 = (1 , . . . , J−1 ). The Cholesky-factored variance–covariance, Σ = LL0 , is lower triangular, L= l l11 l21 .. . 0 l22 .. . J−1,1 lJ−1,2 ... ... 0 0 .. . . . . lJ−1,J−1 and the correlated latent-variable errors can be expressed as linear functions of uncorrelated normal variates, = Lζ, where ζ0 = (ζ1 , . . . , ζJ−1 ) and ζj ∼ iid N(0, 1). We now have υ = λ + Lζ, and by defining − λ1 for j = 1 l11 Pj−1 (4) zj = λ + i=1 lji ζi − j for j = 2, . . . , J − 1 ljj we can express the probability statement (3) as the product of conditional probabilities Pr(yi = k) = Pr (ζ1 ≤ z1 ) Pr (ζ2 ≤ z2 | ζ1 ≤ z1 ) · · · Pr (ζJ−1 ≤ zJ−1 | ζ1 ≤ z1 , . . . , ζJ−2 ≤ zJ−2 ) asmprobit — Alternative-specific multinomial probit regression 123 because Pr(v1 ≤ 0) = Pr(λ1 + l11 ζ1 ≤ 0) λ1 = Pr ζ1 ≤ − l11 Pr(v2 ≤ 0) = Pr(λ2 + l21 ζ1 + l22 ζ2 ≤ 0) λ2 + l21 ζ1 λ1 = Pr ζ2 ≤ − | ζ1 ≤ − l22 l11 ... The Monte Carlo algorithm then must make draws from the truncated standard normal distribution. It does so by generating J − 1 uniform variates, δj , j = 1, . . . , J − 1, and computing λ1 −1 Φ δ1 Φ − l11 ( e Pj−1 e !) ζj = −λ − j −1 i=1 lji ζi δj Φ Φ ljj for j = 1 for j = 2, . . . , J − 1 Define zej by replacing ζei for ζi in (4) so that the simulated probability for the lth draw is pl = J−1 Y Φ(e zj ) j=1 To increase accuracy, the bounds of integration, λj , are ordered so that the largest integration intervals are on the inside. The rows and columns of the variance–covariance matrix are pivoted accordingly (Genz 1992). For a more detailed description of the GHK algorithm in Stata, see Gates (2006). b i , is Repeated draws are made, say, N , and the simulated likelihood for the ith case, denoted L computed as N X bi = 1 L pl N l=1 The overall simulated log likelihood is P i bi . log L If the true likelihood is Li , the error bound on the approximation can be expressed as b i − Li | ≤ V (Li )DN {(δi )} |L where V (Li ) is the total variation of Li and DN is the discrepancy, or nonuniformity, of the set of abscissas. For the uniform pseudorandom sequence, δi , the discrepancy is of order O{(log log N/N )1/2 }. The order of discrepancy can be improved by using quasirandom sequences. Quasi–Monte Carlo integration is carried out by asmprobit by replacing the uniform deviates with either the Halton or the Hammersley sequences. These sequences spread the points more evenly than the uniform random sequence and have a smaller order of discrepancy, O {(log N )J−1 }/N and O {(log N )J−2 }/N , respectively. The Halton sequence of dimension J − 1 is generated from the first J − 1 primes, pk , so that on draw l we have hl = {rp1 (l), rp2 (l), . . . , rpJ−1 (l)}, where 124 asmprobit — Alternative-specific multinomial probit regression rpk (l) = q X bjk (l)p−j−1 ∈ (0, 1) k j=0 is the radical inverse function of l with base pk so that (Fang and Wang 1994). Pq j j=0 bjk (l)pk = l, where pqk ≤ l < pq+1 k This function is demonstrated with base p3 = 5 and l = 33, which generates r5 (33). Here q = 2, b0,3 (33) = 3, b1,5 (33) = 1, and b2,5 (33) = 1, so that r5 (33) = 3/5 + 1/25 + 1/625. The Hammersley sequence uses an evenly spaced set of points with the first J − 2 components of the Halton sequence hl = 2l − 1 , rp1 (l), rp2 (l), . . . , rpJ−2 (l) 2N for l = 1, . . . , N . For a more detailed description of the Halton and Hammersley sequences, see Drukker and Gates (2006). Computations for the derivatives of the simulated likelihood are taken from Bolduc (1999). Bolduc gives the analytical first-order derivatives for the log of the simulated likelihood with respect to the regression coefficients and the parameters of the Cholesky-factored variance–covariance matrix. asmprobit uses these analytical first-order derivatives and numerical second-order derivatives. This command supports the clustered version of the Huber/White/sandwich estimator of the variance using vce(robust) and vce(cluster clustvar). See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. Specifying vce(robust) is equivalent to specifying vce(cluster casevar), where casevar is the variable that identifies the cases. References Bolduc, D. 1999. A practical technique to estimate multinomial probit models in transportation. Transportation Research Part B 33: 63–79. Bunch, D. S. 1991. Estimability of the multinomial probit model. Transportation Research Part B 25: 1–12. Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press. Cappellari, L., and S. P. Jenkins. 2003. Multivariate probit regression using simulated maximum likelihood. Stata Journal 3: 278–294. Drukker, D. M., and R. Gates. 2006. Generating Halton sequences using Mata. Stata Journal 6: 214–228. Fang, K.-T., and Y. Wang. 1994. Number-theoretic Methods in Statistics. London: Chapman & Hall. Gates, R. 2006. A Mata Geweke–Hajivassiliou–Keane multivariate normal simulator. Stata Journal 6: 190–213. Genz, A. 1992. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics 1: 141–149. Geweke, J. 1989. Bayesian inference in econometric models using Monte Carlo integration. Econometrica 57: 1317–1339. Geweke, J., and M. P. Keane. 2001. Computationally intensive methods for integration in econometrics. In Vol. 5 of Handbook of Econometrics, ed. J. Heckman and E. Leamer, 3463–3568. Amsterdam: Elsevier. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Haan, P., and A. Uhlendorff. 2006. Estimation of multinomial logit models with unobserved heterogeneity using maximum simulated likelihood. Stata Journal 6: 229–245. asmprobit — Alternative-specific multinomial probit regression 125 Hajivassiliou, V. A., and D. L. McFadden. 1998. The method of simulated scores for the estimation of LDV models. Econometrica 66: 863–896. Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood. Stata Journal 7: 388–401. Keane, M. P., and K. I. Wolpin. 1994. The solution and estimation of discrete choice dynamic programming models by simulation and interpolation: Monte Carlo evidence. Review of Economics and Statistics 76: 648–672. Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. New York: Cambridge University Press. Also see [R] asmprobit postestimation — Postestimation tools for asmprobit [R] asclogit — Alternative-specific conditional logit (McFadden’s choice) model [R] asroprobit — Alternative-specific rank-ordered probit regression [R] mlogit — Multinomial (polytomous) logistic regression [R] mprobit — Multinomial probit regression [U] 20 Estimation and postestimation commands Title asmprobit postestimation — Postestimation tools for asmprobit Description Syntax for estat Stored results Syntax for predict Menu for estat Methods and formulas Menu for predict Options for estat Also see Options for predict Remarks and examples Description The following postestimation commands are of special interest after asmprobit: Command Description estat estat estat estat estat alternative summary statistics covariance matrix of the latent-variable errors for the alternatives correlation matrix of the latent-variable errors for the alternatives covariance factor weights matrix marginal effects alternatives covariance correlation facweights mfx The following standard postestimation commands are also available: Command Description estat ic estat summarize estat vce estimates lincom Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predicted probabilities, estimated linear predictor and its standard error point estimates, standard errors, testing, and inference for generalized predictions Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest nlcom predict predictnl test testnl Special-interest postestimation commands estat alternatives displays summary statistics about the alternatives in the estimation sample and provides a mapping between the index numbers that label the covariance parameters of the model and their associated values and labels for the alternative variable. estat covariance computes the estimated variance–covariance matrix of the latent-variable errors for the alternatives. The estimates are displayed, and the variance–covariance matrix is stored in r(cov). 126 asmprobit postestimation — Postestimation tools for asmprobit 127 estat correlation computes the estimated correlation matrix of the latent-variable errors for the alternatives. The estimates are displayed, and the correlation matrix is stored in r(cor). estat facweights displays the covariance factor weights matrix and stores it in r(C). estat mfx computes the simulated probability marginal effects. Syntax for predict predict type predict type newvar if in stub* | newvarlist , statistic altwise if in , scores Description statistic Main probability alternative is chosen; the default linear prediction standard error of the linear prediction pr xb stdp These statistics are available both in and out of sample; type predict only for the estimation sample. . . . if e(sample) . . . if wanted Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the probability that alternative j is chosen in case i. xb calculates the linear prediction xij β + zi αj for alternative j and case i. stdp calculates the standard error of the linear predictor. altwise specifies that alternativewise deletion be used when marking out observations due to missing values in your variables. The default is to use casewise deletion. The xb and stdp options always use alternativewise deletion. scores calculates the scores for each coefficient in e(b). This option requires a new variable list of length equal to the number of columns in e(b). Otherwise, use the stub* option to have predict generate enumerated variables with prefix stub. 128 asmprobit postestimation — Postestimation tools for asmprobit Syntax for estat Alternative summary statistics estat alternatives Covariance matrix of the latent-variable errors for the alternatives estat covariance , format(% fmt) border(bspec) left(#) Correlation matrix of the latent-variable errors for the alternatives estat correlation , format(% fmt) border(bspec) left(#) Covariance factor weights matrix estat facweights , format(% fmt) border(bspec) left(#) Marginal effects estat mfx if in , estat mfx options Description estat mfx options Main varlist(varlist) display marginal effects for varlist at(mean atlist | median atlist ) calculate marginal effects at these values Options set confidence interval level; default is level(95) treat indicator variables as continuous do not restrict calculation of means and medians to the estimation sample ignore weights when calculating means and medians level(#) nodiscrete noesample nowght Menu for estat Statistics > Postestimation > Reports and statistics Options for estat Options for estat are presented under the following headings: Options for estat covariance, estat correlation, and estat facweights Options for estat mfx asmprobit postestimation — Postestimation tools for asmprobit 129 Options for estat covariance, estat correlation, and estat facweights format(% fmt) sets the matrix display format. The default for estat covariance and estat facweights is format(%9.0g); the default for estat correlation is format(%9.4f). border(bspec) sets the matrix display border style. The default is border(all). See [P] matlist. left(#) sets the matrix display left indent. The default is left(2). See [P] matlist. Options for estat mfx Main varlist(varlist) specifies the variables for which to display marginal effects. The default is all variables. at(mean atlist | median atlist ) specifies the values at which the marginal effects are to be calculated. atlist is alternative:variable = # variable = # ... The default is to calculate the marginal effects at the means of the independent variables at the estimation sample, at(mean). After specifying the summary statistic, you can specify a series of specific values for variables. You can specify values for alternative-specific variables by alternative, or you can specify one value for all alternatives. You can specify only one value for case-specific variables. For example, in travel.dta, income is a case-specific variable, whereas termtime and travelcost are alternative-specific variables. The following would be a legal syntax for estat mfx: . estat mfx, at(mean air:termtime=50 travelcost=100 income=60) When nodiscrete is not specified, at(mean atlist ) or at(median atlist ) has no effect on computing marginal effects for indicator variables, which are calculated as the discrete change in the simulated probability as the indicator variable changes from 0 to 1. The mean and median computations respect any if and in qualifiers, so you can restrict the data over which the means or medians are computed. You can even restrict the values to a specific case; for example, . estat mfx if case==21 Options level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. nodiscrete specifies that indicator variables be treated as continuous variables. An indicator variable is one that takes on the value 0 or 1 in the estimation sample. By default, the discrete change in the simulated probability is computed as the indicator variable changes from 0 to 1. noesample specifies that the whole dataset be considered instead of only those marked in the e(sample) defined by the asmprobit command. nowght specifies that weights be ignored when calculating the means or medians. Remarks and examples Remarks are presented under the following headings: Predicted probabilities Obtaining estimation statistics Obtaining marginal effects 130 asmprobit postestimation — Postestimation tools for asmprobit Predicted probabilities After fitting an alternative-specific multinomial probit model, you can use predict to obtain the simulated probabilities that an individual will choose each of the alternatives. When evaluating the multivariate normal probabilities via Monte Carlo simulation, predict uses the same method to generate the random sequence of numbers as the previous call to asmprobit. For example, if you specified intmethod(Halton) when fitting the model, predict also uses the Halton sequence. Example 1 In example 1 of [R] asmprobit, we fit a model of individuals’ travel-mode choices. We can obtain the simulated probabilities that an individual chooses each alternative by using predict: . use http://www.stata-press.com/data/r13/travel . asmprobit choice travelcost termtime, case(id) alternatives(mode) > casevars(income) (output omitted ) . predict prob (option pr assumed; Pr(mode)) . list id mode prob choice in 1/12, sepby(id) id mode prob choice 1. 2. 3. 4. 1 1 1 1 air train bus car .1494137 .329167 .1320298 .3898562 0 0 0 1 5. 6. 7. 8. 2 2 2 2 air train bus car .2565875 .2761054 .0116135 .4556921 0 0 0 1 9. 10. 11. 12. 3 3 3 3 air train bus car .2098406 .1081824 .1671841 .5147822 0 0 0 1 Obtaining estimation statistics Once you have fit a multinomial probit model, you can obtain the estimated variance or correlation matrices for the model alternatives by using the estat command. Example 2 To display the correlations of the errors in the latent-variable equations, we type . estat correlation train bus car train bus car 1.0000 0.8909 0.7895 1.0000 0.8951 1.0000 Note: correlations are for alternatives differenced with air asmprobit postestimation — Postestimation tools for asmprobit 131 The covariance matrix can be displayed by typing . estat covariance train bus car train bus car 2 1.600208 1.37471 1.613068 1.399703 1.515884 Note: covariances are for alternatives differenced with air Obtaining marginal effects The marginal effects are computed as the derivative of the simulated probability for an alternative with respect to an independent variable. A table of marginal effects is displayed for each alternative, with the table containing the marginal effect for each case-specific variable and the alternative for each alternative-specific variable. By default, the marginal effects are computed at the means of each continuous independent variable over the estimation sample. For indicator variables, the difference in the simulated probability evaluated at 0 and 1 is computed by default. Indicator variables will be treated as continuous variables if the nodiscrete option is used. Example 3 Continuing with our model from example 1, we obtain the marginal effects for alternatives air, train, bus, and car evaluated at the mean values of each independent variable. Recall that the travelcost and termtime variables are alternative specific, taking on different values for each alternative, so they have a separate marginal effect for each alternative. 132 asmprobit postestimation — Postestimation tools for asmprobit . estat mfx Pr(choice = air) = .29434926 variable dp/dx Std. Err. z P>|z| [ -.002688 .0009 .000376 .001412 .000677 .000436 .000271 .00051 -3.97 2.07 1.39 2.77 0.000 0.039 0.166 0.006 -.004015 .000046 -.000155 .000412 -.001362 .001755 .000908 .002412 102.65 130.2 115.26 95.414 air train bus car -.010376 .003475 .001452 .005449 .002711 .001639 .001008 .002164 -3.83 2.12 1.44 2.52 0.000 0.034 0.150 0.012 -.015689 .000264 -.000523 .001209 -.005063 .006687 .003427 .00969 61.01 35.69 41.657 0 casevars income .003891 .001847 2.11 0.035 .000271 .007511 34.548 travelcost air train bus car 95% C.I. ] X termtime Pr(choice = train) = .29531182 variable dp/dx Std. Err. z P>|z| [ .000899 -.004081 .001278 .001904 .000436 .001466 .00063 .000887 2.06 -2.78 2.03 2.15 0.039 0.005 0.042 0.032 .000045 -.006953 .000043 .000166 .001753 -.001208 .002513 .003641 102.65 130.2 115.26 95.414 air train bus car .003469 -.01575 .004934 .007348 .001638 .00247 .001593 .002228 2.12 -6.38 3.10 3.30 0.034 0.000 0.002 0.001 .000258 -.020591 .001812 .00298 .00668 -.010909 .008056 .011715 61.01 35.69 41.657 0 casevars income -.00957 .002223 -4.31 0.000 -.013927 -.005214 34.548 travelcost air train bus car 95% C.I. ] X termtime Pr(choice = bus) = .08880039 variable dp/dx Std. Err. z P>|z| [ .00038 .001279 -.003182 .001523 .000274 .00063 .001175 .000675 1.39 2.03 -2.71 2.26 0.165 0.042 0.007 0.024 -.000157 .000044 -.005485 .0002 .000916 .002514 -.00088 .002847 102.65 130.2 115.26 95.414 air train bus car .001466 .004937 -.012283 .00588 .001017 .001591 .002804 .002255 1.44 3.10 -4.38 2.61 0.149 0.002 0.000 0.009 -.000526 .001819 -.017778 .001461 .003459 .008055 -.006788 .010299 61.01 35.69 41.657 0 casevars income .000435 .001461 0.30 0.766 -.002428 .003298 34.548 travelcost air train bus car 95% C.I. ] X termtime asmprobit postestimation — Postestimation tools for asmprobit 133 Pr(choice = car) = .32168607 variable dp/dx Std. Err. z P>|z| [ .00141 .001903 .001523 -.004836 .000509 .000886 .000675 .001539 2.77 2.15 2.25 -3.14 0.006 0.032 0.024 0.002 .000411 .000166 .000199 -.007853 .002408 .003641 .002847 -.001819 102.65 130.2 115.26 95.414 air train bus car .005441 .007346 .005879 -.018666 .002161 .002228 .002256 .003938 2.52 3.30 2.61 -4.74 0.012 0.001 0.009 0.000 .001205 .00298 .001456 -.026385 .009677 .011713 .010301 -.010948 61.01 35.69 41.657 0 casevars income .005246 .002166 2.42 0.015 .001002 .00949 34.548 travelcost air train bus car 95% C.I. ] X termtime First, we note that there is a separate marginal effects table for each alternative and that table begins by reporting the overall probability of choosing the alternative, for example, 0.2944 for air travel. We see in the first table that a unit increase in terminal time for air travel from 61.01 minutes will result in a decrease in probability of choosing air travel (when the probability is evaluated at the mean of all variables) by approximately 0.01, with a 95% confidence interval of about −0.016 to −0.005. Travel cost has a less negative effect of choosing air travel (at the average cost of 102.65). Alternatively, an increase in terminal time and travel cost for train, bus, or car from these mean values will increase the chance for air travel to be chosen. Also, with an increase in income from 34.5, it would appear that an individual would be more likely to choose air or automobile travel over bus or train. (While the marginal effect for bus travel is positive, it is not significant.) Example 4 Plotting the simulated probability marginal effect evaluated over a range of values for an independent variable may be more revealing than a table of values. Below are the commands for generating the simulated probability marginal effect of air travel for increasing air travel terminal time. We fix all other independent variables at their medians. . . . . . qui gen meff = . qui gen tt = . qui gen lb = . qui gen ub = . forvalues i=0/19 { 2. local termtime = 5+5*‘i’ 3. qui replace tt = ‘termtime’ if _n == ‘i’+1 4. qui estat mfx, at(median air:termtime=‘termtime’) var(termtime) 5. mat air = r(air) 6. qui replace meff = air[1,1] if _n == ‘i’+1 7. qui replace lb = air[1,5] if _n == ‘i’+1 8. qui replace ub = air[1,6] if _n == ‘i’+1 9. qui replace prob = r(pr_air) if _n == ‘i’+1 10. } . label variable tt "terminal time" 134 asmprobit postestimation — Postestimation tools for asmprobit . twoway (rarea lb ub tt, pstyle(ci)) (line meff tt, lpattern(solid)), name(meff) > legend(off) title(" marginal effect of air travel" "terminal time and" > "95% confidence interval", position(3)) .6 .8 1 . twoway line prob tt, name(prob) title(" probability of choosing" "air travel", > position(3)) graphregion(margin(r+9)) ytitle("") xtitle("") . graph combine prob meff, cols(1) graphregion(margin(l+5 r+5)) 0 .2 .4 probability of choosing air travel 20 40 60 80 100 −.02 −.015 −.01 −.005 0 0 marginal effect of air travel terminal time and 95% confidence interval 0 20 40 60 terminal time 80 100 From the graphs, we see that the simulated probability of choosing air travel decreases in an sigmoid fashion. The marginal effects display the rate of change in the simulated probability as a function of the air travel terminal time. The rate of change in the probability of choosing air travel decreases until the air travel terminal time reaches about 45; thereafter, it increases. Stored results estat mfx stores the following in r(): Scalars r(pr alt) Matrices r(alt) scalars containing the computed probability of each alternative evaluated at the value that is labeled X in the table output. Here alt are the labels in the macro e(alteqs). matrices containing the computed marginal effects and associated statistics. There is one matrix for each alternative, where alt are the labels in the macro e(alteqs). Column 1 of each matrix contains the marginal effects; column 2, their standard errors; columns 3 and 4, their z statistics and the p-values for the z statistics; and columns 5 and 6, the confidence intervals. Column 7 contains the values of the independent variables used to compute the probabilities r(pr alt). asmprobit postestimation — Postestimation tools for asmprobit 135 Methods and formulas Marginal effects The marginal effects are computed as the derivative of the simulated probability with respect to each independent variable. A set of marginal effects is computed for each alternative; thus, for J alternatives, there will be J tables. Moreover, the alternative-specific variables will have J entries, one for each alternative in each table. The details of computing the effects are different for alternative-specific variables and case-specific variables, as well as for continuous and indicator variables. We use the latent-variable notation of asmprobit (see [R] asmprobit) for a J -alternative model and, for notational convenience, we will drop any subscripts involving observations. We then have the following linear functions ηj = xj β + zαj , for j = 1, . . . , J . Let k index the alternative of interest, and then vj 0 = η j − η k = (xj − xk )β + z(αj − αk ) + j 0 where j 0 = j if j < k and j 0 = j − 1 if j > k , so that j 0 = 1, . . . , J − 1 and j 0 ∼ MVN(0, Σ). Denote pk = Pr(v1 ≤ 0, . . . , vJ−1 ≤ 0) as the simulated probability of choosing alternative k given profile xk and z. The marginal effects are then ∂pk /∂xk , ∂pk /∂xj , and ∂pk /∂z, where k = 1, . . . , J , j 6= k . asmprobit analytically computes the first-order derivatives of the simulated probability with respect to the v ’s, and the marginal effects for x’s and z are obtained via the chain rule. The standard errors for the marginal effects are computed using the delta method. Also see [R] asmprobit — Alternative-specific multinomial probit regression [U] 20 Estimation and postestimation commands Title asroprobit — Alternative-specific rank-ordered probit regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options Reference Syntax indepvars if in weight , case(varname) alternatives(varname) options asroprobit depvar options Model ∗ ∗ case(varname) alternatives(varname) casevars(varlist) constraints(constraints) collinear Description use varname to identify cases use varname to identify the alternatives available for each case case-specific variables apply specified linear constraints keep collinear variables Model 2 correlation structure of the latent-variable errors variance structure of the latent-variable errors use the structural covariance parameterization; default is the differenced covariance parameterization factor(#) use the factor covariance structure with dimension # noconstant suppress the alternative-specific constant terms basealternative(# | lbl | str) alternative used for normalizing location scalealternative(# | lbl | str) alternative used for normalizing scale altwise use alternativewise deletion instead of casewise deletion reverse interpret the lowest rank in depvar as the best; the default is the highest rank is the best correlation(correlation) stddev(stddev) structural SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) notransform nocnsreport display options set confidence level; default is level(95) do not transform variance–covariance estimates to the standard deviation and correlation metric do not display constraints control column formats and line width 136 asroprobit — Alternative-specific rank-ordered probit regression 137 Integration intmethod(seqtype) intpoints(#) intburn(#) intseed(code | #) antithetics nopivot initbhhh(#) favor(speed | space) type of quasi- or pseudouniform sequence number of points in each sequence starting index in the Hammersley or Halton sequence pseudouniform random-number seed use antithetic draws do not use integration interval pivoting use the BHHH optimization algorithm for the first # iterations favor speed or space when generating integration points Maximization maximize options control the maximization process coeflegend display legend instead of statistics correlation Description unstructured one correlation parameter for each pair of alternatives; correlations with the basealternative() are zero; the default one correlation parameter common to all pairs of alternatives; correlations with the basealternative() are zero constrain all correlation parameters to zero user-specified matrix identifying the correlation pattern user-specified matrix identifying the fixed and free correlation parameters exchangeable independent pattern matname fixed matname stddev Description heteroskedastic estimate standard deviation for each alternative; standard deviations for basealternative() and scalealternative() set to one all standard deviations are one user-specified matrix identifying the standard deviation pattern user-specified matrix identifying the fixed and free standard deviations homoskedastic pattern matname fixed matname seqtype Description hammersley halton random Hammersley point set Halton point set uniform pseudorandom point set ∗ case(varname) and alternatives(varname) are required. bootstrap, by, jackknife, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 138 asroprobit — Alternative-specific rank-ordered probit regression Menu Statistics > Ordinal outcomes > Rank-ordered probit regression Description asroprobit fits rank-ordered probit (ROP) models by using maximum simulated likelihood (MSL). The model allows you to relax the independence of irrelevant alternatives (IIA) property that is characteristic of the rank-ordered logistic model by estimating the variance–covariance parameters of the latent-variable errors. Each unique identifier in the case() variable has multiple alternatives identified in the alternatives() variable, and depvar contains the ranked alternatives made by each case. Only the order in the ranks, not the magnitude of their differences, is assumed to be relevant. By default, the largest rank indicates the more desirable alternative. Use the reverse option if the lowest rank should be interpreted as the more desirable alternative. Tied ranks are allowed, but they increase the computation time because all permutations of the tied ranks are used in computing the likelihood for each case. asroprobit allows two types of independent variables: alternative-specific variables, in which the values of each variable vary with each alternative, and case-specific variables, which vary with each case. The estimation technique of asroprobit is nearly identical to that of asmprobit, and the two routines share many of the same options; see [R] asmprobit. Options Model case(varname) specifies the variable that identifies each case. This variable identifies the individuals or entities making a choice. case() is required. alternatives(varname) specifies the variable that identifies the alternatives available for each case. The number of alternatives can vary with each case; the maximum number of alternatives is 20. alternatives() is required. casevars(varlist) specifies the case-specific variables that are constant for each case(). If there are a maximum of J alternatives, there will be J − 1 sets of coefficients associated with casevars(). constraints(constraints), collinear; see [R] estimation options. Model 2 correlation(correlation) specifies the correlation structure of the latent-variable errors. correlation(unstructured) is the most general and has J(J − 3)/2 + 1 unique correlation parameters. This is the default unless stddev() or structural are specified. correlation(exchangeable) provides for one correlation coefficient common to all latent variables, except the latent variable associated with the basealternative(). correlation(independent) assumes that all correlations are zero. correlation(pattern matname) and correlation(fixed matname) give you more flexibility in defining the correlation structure. See Variance structures in [R] asmprobit for more information. stddev(stddev) specifies the variance structure of the latent-variable errors. stddev(heteroskedastic) is the most general and has J − 2 estimable parameters. The standard deviations of the latent-variable errors for the alternatives specified in basealternative() and scalealternative() are fixed to one. asroprobit — Alternative-specific rank-ordered probit regression 139 stddev(homoskedastic) constrains all the standard deviations to equal one. stddev(pattern matname) and stddev(fixed matname) give you added flexibility in defining the standard deviation parameters. See Variance structures in [R] asmprobit for more information. structural requests the J ×J structural covariance parameterization instead of the default J −1×J −1 differenced covariance parameterization (the covariance of the latent errors differenced with that of the base alternative). The differenced covariance parameterization will achieve the same maximum simulated likelihood regardless of the choice of basealternative() and scalealternative(). On the other hand, the structural covariance parameterization imposes more normalizations that may bound the model away from its maximum likelihood and thus prevent convergence with some datasets or choices of basealternative() and scalealternative(). factor(#) requests that the factor covariance structure of dimension # be used. The factor() option can be used with the structural option but cannot be used with stddev() or correlation(). A # × J (or # × J − 1) matrix, C, is used to factor the covariance matrix as I + C0 C, where I is the identity matrix of dimension J (or J − 1). The column dimension of C depends on whether the covariance is structural or differenced. The row dimension of C, #, must be less than or equal to floor((J(J − 1)/2 − 1)/(J − 2)), because there are only J(J − 1)/2 − 1 identifiable variance–covariance parameters. This covariance parameterization may be useful for reducing the number of covariance parameters that need to be estimated. If the covariance is structural, the column of C corresponding to the base alternative contains zeros. The column corresponding to the scale alternative has a one in the first row and zeros elsewhere. If the covariance is differenced, the column corresponding to the scale alternative (differenced with the base) has a one in the first row and zeros elsewhere. noconstant suppresses the J − 1 alternative-specific constant terms. basealternative(# | lbl | str) specifies the alternative used to normalize the latent-variable location (also referred to as the level of utility). The base alternative may be specified as a number, label, or string. The standard deviation for the latent-variable error associated with the base alternative is fixed to one, and its correlations with all other latent-variable errors are set to zero. The default is the first alternative when sorted. If a fixed or pattern matrix is given in the stddev() and correlation() options, the basealternative() will be implied by the fixed standard deviations and correlations in the matrix specifications. basealternative() cannot be equal to scalealternative(). scalealternative(# | lbl | str) specifies the alternative used to normalize the latent-variable scale (also referred to as the scale of utility). The scale alternative may be specified as a number, label, or string. The default is to use the second alternative when sorted. If a fixed or pattern matrix is given in the stddev() option, the scalealternative() will be implied by the fixed standard deviations in the matrix specification. scalealternative() cannot be equal to basealternative(). If a fixed or pattern matrix is given for the stddev() option, the base alternative and scale alternative are implied by the standard deviations and correlations in the matrix specifications, and they need not be specified in the basealternative() and scalealternative() options. altwise specifies that alternativewise deletion be used when marking out observations due to missing values in your variables. The default is to use casewise deletion; that is, the entire group of observations making up a case is deleted if any missing values are encountered. This option does not apply to observations that are marked out by the if or in qualifier or the by prefix. reverse directs asroprobit to interpret the rank in depvar that is smallest in value as the preferred alternative. By default, the rank that is largest in value is the favored alternative. 140 asroprobit — Alternative-specific rank-ordered probit regression SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. If specifying vce(bootstrap) or vce(jackknife), you must also specify basealternative() and scalealternative(). Reporting level(#); see [R] estimation options. notransform prevents retransforming the Cholesky-factored variance–covariance estimates to the correlation and standard deviation metric. This option has no effect if structural is not specified because the default differenced variance– covariance estimates have no interesting interpretation as correlations and standard deviations. notransform also has no effect if the correlation() and stddev() options are specified with anything other than their default values. Here it is generally not possible to factor the variance– covariance matrix, so optimization is already performed using the standard deviation and correlation representations. nocnsreport; see [R] estimation options. display options: cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Integration intmethod(hammersley | halton | random) specifies the method of generating the point sets used in the quasi–Monte Carlo integration of the multivariate normal density. intmethod(hammersley), the default, uses the Hammersley sequence; intmethod(halton) uses the Halton sequence; and intmethod(random) uses a sequence of uniform random numbers. intpoints(#) specifies the number of points to use in the quasi–Monte Carlo integration. If this option is not specified, the number of points is 50 × J if intmethod(hammersley) or intmethod(halton) is used and 100 × J if intmethod(random) is used. Larger values of intpoints() provide better approximations of the log likelihood, but at the cost of added computation time. intburn(#) specifies where in the Hammersley or Halton sequence to start, which helps reduce the correlation between the sequences of each dimension. The default is 0. This option may not be specified with intmethod(random). intseed(code | #) specifies the seed to use for generating the uniform pseudorandom sequence. This option may be specified only with intmethod(random). code refers to a string that records the state of the random-number generator runiform(); see [R] set seed. An integer value # may be used also. The default is to use the current seed value from Stata’s uniform random-number generator, which can be obtained from c(seed). antithetics specifies that antithetic draws be used. The antithetic draw for the J − 1 vector uniform-random variables, x, is 1 − x. nopivot turns off integration interval pivoting. By default, asroprobit will pivot the wider intervals of integration to the interior of the multivariate integration. This improves the accuracy of the quadrature estimate. However, discontinuities may result in the computation of numerical secondorder derivatives using finite differencing (for the Newton–Raphson optimize technique, tech(nr)) asroprobit — Alternative-specific rank-ordered probit regression 141 when few simulation points are used, resulting in a non–positive-definite Hessian. asroprobit uses the Broyden–Fletcher–Goldfarb–Shanno optimization algorithm, by default, which does not require computing the Hessian numerically using finite differencing. initbhhh(#) specifies that the Berndt–Hall–Hall–Hausman (BHHH) algorithm be used for the initial # optimization steps. This option is the only way to use the BHHH algorithm along with other optimization techniques. The algorithm switching feature of ml’s technique() option cannot include bhhh. favor(speed | space) instructs asroprobit to favor either speed or space when generating the integration points. favor(speed) is the default. When favoring speed, the integration points are generated once and stored in memory, thus increasing the speed of evaluating the likelihood. This speed increase can be seen when there are many cases or when the user specifies a large number of integration points, intpoints(#). When favoring space, the integration points are generated repeatedly with each likelihood evaluation. For unbalanced data, where the number of alternatives varies with each case, the estimates computed using intmethod(random) will vary slightly between favor(speed) and favor(space). This is because the uniform sequences will not be identical, even when initiating the sequences using the same uniform seed, intseed(code | #). For favor(speed), ncase blocks of intpoints(#) × J − 2 uniform points are generated, where J is the maximum number of alternatives. For favor(space), the column dimension of the matrices of points varies with the number of alternatives that each case has. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. The following options may be particularly useful in obtaining convergence with asroprobit: difficult, technique(algorithm spec), nrtolerance(#), nonrtolerance, and from(init specs). If technique() contains more than one algorithm specification, bhhh cannot be one of them. To use the BHHH algorithm with another algorithm, use the initbhhh() option and specify the other algorithm in technique(). Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). When specifying from(matname , copy ), the values in matname associated with the latentvariable error variances must be for the log-transformed standard deviations and inverse-hyperbolic tangent-transformed correlations. This option makes using the coefficient vector from a previously fitted asroprobit model convenient as a starting point. The following option is available with asroprobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples The mathematical description and numerical computations of the rank-ordered probit model are similar to that of the multinomial probit model. The only difference is that the dependent variable of the rank-ordered probit model is ordinal, showing preferences among alternatives, as opposed to the binary dependent variable of the multinomial probit model, indicating a chosen alternative. We will describe how the likelihood of a ranking is computed using the latent-variable framework here, 142 asroprobit — Alternative-specific rank-ordered probit regression but for details of the latent-variable parameterization of these models and the method of maximum simulated likelihood, see [R] asmprobit. Consider the latent-variable parameterization of a J alternative rank-ordered probit model. Using the notation from asmprobit, we have variables ηij , j = 1, . . . , J , such that ηij = xij β + zi αj + ξij Here the xij are the alternative-specific independent variables, the zi are the case-specific variables, and the ξij are multivariate normal with mean zero and covariance Ω. Without loss of generality, assume that individual i ranks the alternatives in order of the alternative indices j = 1, 2, . . . , J , so the alternative J is the preferred alternative and alternative 1 is the least preferred alternative. The probability of this ranking given β and αj is the probability that ηi,J−1 − ηi,J ≤ 0 and ηi,J−2 − ηi,J−1 ≤ 0, . . . , and ηi,1 − ηi,2 ≤ 0. Example 1 Long and Freese (2014, 477) provide an example of a rank-ordered logit model with alternativespecific variables. We use this dataset to demonstrate asroprobit. The data come from the Wisconsin Longitudinal Study. This is a study of 1957 Wisconsin high school graduates that were asked to rate their relative preference of four job characteristics: esteem, a job other people regard highly; variety, a job that is not repetitive and allows you to do a variety of things; autonomy, a job where your supervisor does not check on you frequently; and security, a job with a low risk of being laid off. The case-specific covariates are gender, female, an indicator variable for females, and score, a score on a general mental ability test measured in standard deviations. The alternative-specific variables are high and low, which indicate whether the respondent’s current job is high or low in esteem, variety, autonomy, or security. This approach provides three states for a respondent’s current job status for each alternative, (1, 0), (0, 1), and (0, 0), using the notation (high, low). The score (1, 1) is omitted because the respondent’s current job cannot be considered both high and low in one of the job characteristics. The (0, 0) score would indicate that the respondent’s current job does not rank high or low (is neutral) in a job characteristic. The alternatives are ranked such that 1 is the preferred alternative and 4 is the least preferred. . use http://www.stata-press.com/data/r13/wlsrank (1992 Wisconsin Longitudinal Study data on job values) . list id jobchar rank female score high low in 1/12, sepby(id) id jobchar rank female score high low 1. 2. 3. 4. 1 1 1 1 security autonomy variety esteem 1 4 1 3 1 1 1 1 .0492111 .0492111 .0492111 .0492111 0 0 0 0 0 0 0 0 5. 6. 7. 8. 5 5 5 5 security variety esteem autonomy 2 2 2 1 1 1 1 1 2.115012 2.115012 2.115012 2.115012 1 1 1 0 0 0 0 0 9. 10. 11. 12. 7 7 7 7 autonomy variety esteem security 1 1 4 1 0 0 0 0 1.701852 1.701852 1.701852 1.701852 1 0 0 0 0 1 0 0 asroprobit — Alternative-specific rank-ordered probit regression 143 The three cases listed have tied ranks. asroprobit will allow ties, but at the cost of increased computation time. To evaluate the likelihood of the first observation, asroprobit must compute Pr(esteem = 3, variety = 1, autonomy = 4, security = 2)+ Pr(esteem = 3, variety = 2, autonomy = 4, security = 1) and both of these probabilities are estimated using simulation. In fact, the full dataset contains 7,237 tied ranks and asroprobit takes a great deal of time to estimate the parameters. For exposition, we estimate the rank-ordered probit model by using the cases without ties. These cases are marked in the variable noties. The model of job preference is ηij = β1 highij + β2 lowij + α1j femalei + α2j scorei + α0j + ξij for j = 1, 2, 3, 4. The base alternative will be esteem, so α01 = α11 = α21 = 0. 144 asroprobit — Alternative-specific rank-ordered probit regression . asroprobit rank high low if noties, case(id) alternatives(jobchar) > casevars(female score) reverse note: variable high has 107 cases that are not alternative-specific: there is no within-case variability note: variable low has 193 cases that are not alternative-specific: there is no within-case variability Iteration 0: log simulated-likelihood = -1103.2768 Iteration 1: log simulated-likelihood = -1089.3361 (backed up) (output omitted ) Alternative-specific rank-ordered probit Case variable: id Alternative variable: jobchar Integration sequence: Hammersley Integration points: 200 Log simulated-likelihood = -1080.2206 rank Coef. high low .3741029 -.0697443 Std. Err. z Number of obs Number of cases Alts per case: min avg max = = = = = 1660 415 4 4.0 4 Wald chi2(8) Prob > chi2 = = 34.01 0.0000 P>|z| [95% Conf. Interval] jobchar esteem .0925685 .1093317 4.04 -0.64 0.000 0.524 .192672 -.2840305 .5555337 .1445419 (base alternative) variety female score _cons .1351487 .1405482 1.735016 .1843088 .0977567 .1451343 0.73 1.44 11.95 0.463 0.151 0.000 -.2260899 -.0510515 1.450558 .4963873 .3321479 2.019474 autonomy female score _cons .2561828 .1898853 .7009797 .1679565 .0875668 .1227336 1.53 2.17 5.71 0.127 0.030 0.000 -.0730059 .0182575 .4604262 .5853715 .361513 .9415333 security female score _cons .232622 -.1780076 1.343766 .2057547 .1102115 .1600059 1.13 -1.62 8.40 0.258 0.106 0.000 -.1706497 -.3940181 1.030161 .6358938 .038003 1.657372 /lnl2_2 /lnl3_3 .1805151 .4843091 .0757296 .0793343 2.38 6.10 0.017 0.000 .0320878 .3288168 .3289424 .6398014 /l2_1 /l3_1 /l3_2 .6062037 .4509217 .2289447 .1169368 .1431183 .1226081 5.18 3.15 1.87 0.000 0.002 0.062 .3770117 .1704151 -.0113627 .8353957 .7314283 .4692521 (jobchar=esteem is the alternative normalizing location) (jobchar=variety is the alternative normalizing scale) We specified the reverse option because a rank of 1 is the highest preference. The variance– covariance estimates are for the Cholesky-factored variance–covariance for the latent-variable errors differenced with that of alternative esteem. We can view the estimated correlations by entering asroprobit — Alternative-specific rank-ordered probit regression 145 . estat correlation variety autonomy security variety autonomy security 1.0000 0.4516 0.2652 1.0000 0.2399 1.0000 Note: correlations are for alternatives differenced with esteem and typing . estat covariance variety autonomy security variety autonomy security 2 .8573015 .6376996 1.80229 .5475882 2.890048 Note: covariances are for alternatives differenced with esteem gives the (co)variances. [R] mprobit explains that if the latent-variable errors are independent, then the correlations in the differenced parameterization should be ∼0.5 and the variances should be ∼2.0, which seems to be the case here. The coefficient estimates for the probit models can be difficult to interpret because of the normalization for location and scale. The regression estimates for the case-specific variables will be relative to the base alternative and the regression estimates for both the case-specific and alternativespecific variables are affected by the scale normalization. The more pronounced the heteroskedasticity and correlations, the more pronounced the resulting estimate differences when choosing alternatives to normalize for location and scale. However, when using the differenced covariance structure, you will obtain the same model likelihood regardless of which alternatives you choose as the base and scale alternatives. For model interpretation, you can examine the estimated probabilities and marginal effects by using postestimation routines predict and estat mfx. See [R] asroprobit postestimation. 146 asroprobit — Alternative-specific rank-ordered probit regression Stored results asroprobit stores the following in e(): Scalars e(N) e(N case) e(N ties) e(k) e(k alt) e(k indvars) e(k casevars) e(k sigma) e(k rho) e(k eq) e(k eq model) e(df m) e(ll) e(N clust) e(const) e(i base) e(i scale) e(mc points) e(mc burn) e(mc antithetics) e(reverse) e(chi2) e(p) e(fullcov) e(structcov) e(cholesky) e(alt min) e(alt avg) e(alt max) e(rank) e(ic) e(rc) e(converged) number of observations number of cases number of ties number of parameters number of alternatives number of alternative-specific variables number of case-specific variables number of variance estimates number of correlation estimates number of equations in e(b) number of equations in overall model test model degrees of freedom log simulated-likelihood number of clusters constant indicator base alternative index scale alternative index number of Monte Carlo replications starting sequence index antithetics indicator 1 if minimum rank is best, 0 if maximum rank is best χ2 significance unstructured covariance indicator 1 if structured covariance; 0 otherwise Cholesky-factored covariance indicator minimum number of alternatives average number of alternatives maximum number of alternatives rank of e(V) number of iterations return code 1 if converged, 0 otherwise asroprobit — Alternative-specific rank-ordered probit regression 147 Macros e(cmd) e(cmdline) e(depvar) e(indvars) e(casevars) e(case) e(altvar) e(alteqs) e(alt#) e(wtype) e(wexp) e(title) e(clustvar) e(correlation) e(stddev) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(mc method) e(mc seed) e(user) e(technique) e(datasignature) e(datasignaturevars) e(properties) e(estat cmd) e(mfx dlg) e(predict) e(marginsnotok) asroprobit command as typed name of dependent variable alternative-specific independent variable case-specific variables variable defining cases variable defining alternatives alternative equation names alternative labels weight type weight expression title in estimation output name of cluster variable correlation structure variance structure Wald, type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method Hammersley, Halton, or uniform random; technique to generate sequences random-number generator seed name of likelihood-evaluator program maximization technique the checksum variables used in calculation of checksum b V program used to implement estat program used to implement estat mfx dialog program used to implement predict predictions disallowed by margins Matrices e(b) e(Cns) e(stats) e(stdpattern) e(stdfixed) e(altvals) e(altfreq) e(alt casevars) e(corpattern) e(corfixed) e(ilog) e(gradient) e(V) e(V modelbased) coefficient vector constraints matrix alternative statistics variance pattern fixed and free standard deviations alternative values alternative frequencies indicators for estimated case-specific coefficients—e(k alt)×e(k casevars) correlation structure fixed and free correlations iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance Functions e(sample) marks estimation sample Methods and formulas From a computational perspective, asroprobit is similar to asmprobit and the two programs share many numerical tools. Therefore, we will use the notation from Methods and formulas in [R] asmprobit to discuss the rank-ordered probit probability model. The latent variables for a J -alternative model are ηij = xij β + zi αj + ξij , for j = 1, . . . , J , i = 1, . . . , n, and ξ0i = (ξi,1 , . . . , ξi,J ) ∼ MVN(0, Ω). Without loss of generality, assume for 148 asroprobit — Alternative-specific rank-ordered probit regression the ith observation that an individual ranks the alternatives in the order of their numeric indices, yi = (J, J − 1, . . . , 1), so the first alternative is the most preferred and the last alternative is the least preferred. We can then difference the latent variables such that vik = ηi,k+1 − ηi,k = (xi,k+1 − xi,k )β + zi (αk+1 − αk ) + ξi,k+1 − ξik = δik β + zi γk + ik for k = 1, . . . , J − 1 and where i = (i1 , . . . , i,J−1 ) ∼ MVN(0, Σ(i) ). Σ is indexed by i because it is specific to the ranking of individual i. We denote the deterministic part of the model as λik = δik β + zj γk , and the probability of this event is Pr(yi ) = Pr(vi1 ≤ 0, . . . , vi,J−1 ≤ 0) = Pr(i1 ≤ −λi1 , . . . , i,J−1 ≤ −λi,J−1 ) Z −λi1 Z −λi,J−1 −(J−1)/2 = (2π) |Σ(i) |−1/2 ··· exp − 12 z0 Σ−1 z dz (i) −∞ −∞ The integral has the same form as (3) of Methods and formulas in [R] asmprobit. See [R] asmprobit for details on evaluating this integral numerically by using simulation. asroprobit handles tied ranks by enumeration. For k tied ranks, it will generate k! rankings, where ! is the factorial operator k! = k(k − 1)(k − 2) · · · (2)(1). For two sets of tied ranks of size k1 and k2 , asroprobit will generate k1 !k2 ! rankings. The total probability is the sum of the probability of each ranking. For example, if there are two tied ranks such that yi = (J, J, J − 2, . . . , 1), then (1) (2) (1) asroprobit will evaluate Pr(yi ) = Pr(yi ) + Pr(yi ), where yi = (J, J − 1, J − 2, . . . , 1) (2) and yi = (J − 1, J, J − 2, . . . , 1). This command supports the clustered version of the Huber/White/sandwich estimator of the variance using vce(robust) and vce(cluster clustvar). See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. Specifying vce(robust) is equivalent to specifying vce(cluster casevar), where casevar is the variable that identifies the cases. Reference Long, J. S., and J. Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Also see [R] asroprobit postestimation — Postestimation tools for asroprobit [R] asmprobit — Alternative-specific multinomial probit regression [R] mlogit — Multinomial (polytomous) logistic regression [R] mprobit — Multinomial probit regression [R] oprobit — Ordered probit regression [U] 20 Estimation and postestimation commands Title asroprobit postestimation — Postestimation tools for asroprobit Description Syntax for estat Stored results Syntax for predict Menu for estat Also see Menu for predict Options for estat Options for predict Remarks and examples Description The following postestimation commands are of special interest after asroprobit: Command Description estat estat estat estat estat alternative summary statistics covariance matrix of the latent-variable errors for the alternatives correlation matrix of the latent-variable errors for the alternatives covariance factor weights matrix marginal effects alternatives covariance correlation facweights mfx The following standard postestimation commands are also available: Command Description estat ic estat summarize estat vce estimates lincom Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predicted probabilities, estimated linear predictor and its standard error point estimates, standard errors, testing, and inference for generalized predictions Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest nlcom predict predictnl test testnl Special-interest postestimation commands estat alternatives displays summary statistics about the alternatives in the estimation sample. The command also provides a mapping between the index numbers that label the covariance parameters of the model and their associated values and labels for the alternative variable. estat covariance computes the estimated variance–covariance matrix of the latent-variable errors for the alternatives. The estimates are displayed, and the variance–covariance matrix is stored in r(cov). 149 150 asroprobit postestimation — Postestimation tools for asroprobit estat correlation computes the estimated correlation matrix of the latent-variable errors for the alternatives. The estimates are displayed, and the correlation matrix is stored in r(cor). estat facweights displays the covariance factor weights matrix and stores it in r(C). estat mfx computes marginal effects of a simulated probability of a set of ranked alternatives. The probability is stored in r(pr), the matrix of rankings is stored in r(ranks), and the matrix of marginal-effect statistics is stored in r(mfx). Syntax for predict predict type predict type newvar if in stub* | newvarlist , statistic altwise if in , scores Description statistic Main probability of each ranking, by case; the default probability that each alternative is preferred linear prediction standard error of the linear prediction pr pr1 xb stdp These statistics are available both in and out of sample; type predict only for the estimation sample. . . . if e(sample) . . . if wanted Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the probability of each ranking. For each case, one probability is computed for the ranks in e(depvar). pr1 calculates the probability that each alternative is preferred. xb calculates the linear prediction xij β + zi αj for alternative j and case i. stdp calculates the standard error of the linear predictor. altwise specifies that alternativewise deletion be used when marking out observations due to missing values in your variables. The default is to use casewise deletion. The xb and stdp options always use alternativewise deletion. scores calculates the scores for each coefficient in e(b). This option requires a new variable list of length equal to the number of columns in e(b). Otherwise, use the stub* option to have predict generate enumerated variables with prefix stub. asroprobit postestimation — Postestimation tools for asroprobit Syntax for estat Alternative summary statistics estat alternatives Covariance matrix of the latent-variable errors for the alternatives estat covariance , format(% fmt) border(bspec) left(#) Correlation matrix of the latent-variable errors for the alternatives estat correlation , format(% fmt) border(bspec) left(#) Covariance factor weights matrix estat facweights , format(% fmt) border(bspec) left(#) Marginal effects estat mfx if estat mfx options in , estat mfx options Description Main varlist(varlist) display marginal effects for varlist at(median atlist ) calculate marginal effects at these values rank(ranklist) calculate marginal effects for the simulated probability of these ranked alternatives Options set confidence interval level; default is level(95) treat indicator variables as continuous do not restrict calculation of the medians to the estimation sample ignore weights when calculating medians level(#) nodiscrete noesample nowght Menu for estat Statistics > Postestimation > Reports and statistics Options for estat Options for estat are presented under the following headings: Options for estat covariance, estat correlation, and estat facweights Options for estat mfx 151 152 asroprobit postestimation — Postestimation tools for asroprobit Options for estat covariance, estat correlation, and estat facweights format(% fmt) sets the matrix display format. The default for estat covariance and estat facweights is format(%9.0g). The default for estat correlation is format(%9.4f). border(bspec) sets the matrix display border style. The default is border(all). See [P] matlist. left(#) sets the matrix display left indent. The default is left(2). See [P] matlist. Options for estat mfx Main varlist(varlist) specifies the variables for which to display marginal effects. The default is all variables. at(median atlist ) specifies the values at which the marginal effects are to be calculated. atlist is alternative:variable = # variable = # ... ) The marginal effects are calculated at the medians of the independent variables. After specifying the summary statistic, you can specify specific values for variables. You can specify values for alternative-specific variables by alternative, or you can specify one value for all alternatives. You can specify only one value for case-specific variables. For example, in the wlsrank dataset, female and score are case-specific variables, whereas high and low are alternative-specific variables. The following would be a legal syntax for estat mfx: . estat mfx, at(median high=0 esteem:high=1 low=0 security:low=1 female=1) When nodiscrete is not specified, at(median atlist ) has no effect on computing marginal effects for indicator variables, which are calculated as the discrete change in the simulated probability as the indicator variable changes from 0 to 1. The median computations respect any if or in qualifiers, so you can restrict the data over which the medians are computed. You can even restrict the values to a specific case, for example, . estat mfx if case==13 rank(ranklist) specifies the ranks for the alternatives. ranklist is alternative = # alternative = # . . . ) The default is to rank the calculated latent variables. Alternatives excluded from rank() are omitted from the analysis. You must therefore specify at least two alternatives in rank(). You may have tied ranks in the rank specification. Only the order in the ranks is relevant. Options level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. nodiscrete specifies that indicator variables be treated as continuous variables. An indicator variable is one that takes on the value 0 or 1 in the estimation sample. By default, the discrete change in the simulated probability is computed as the indicator variable changes from 0 to 1. noesample specifies that the whole dataset be considered instead of only those marked in the e(sample) defined by the asroprobit command. nowght specifies that weights be ignored when calculating the medians. asroprobit postestimation — Postestimation tools for asroprobit 153 Remarks and examples Remarks are presented under the following headings: Predicted probabilities Obtaining estimation statistics Predicted probabilities After fitting an alternative-specific rank-ordered probit model, you can use predict to obtain the probabilities of alternative rankings or the probabilities of each alternative being preferred. When evaluating the multivariate normal probabilities via (quasi) Monte Carlo, predict uses the same method to generate the (quasi) random sequence of numbers as the previous call to asroprobit. For example, if you specified intmethod(halton) when fitting the model, predict also uses Halton sequences. Example 1 In example 1 of [R] asroprobit, we fit a model of job characteristic preferences. This is a study of 1957 Wisconsin high school graduates that were asked to rate their relative preference of four job characteristics: esteem, a job other people regard highly; variety, a job that is not repetitive and allows you to do a variety of things; autonomy, a job where your supervisor does not check on you frequently; and security, a job with a low risk of being laid off. The case-specific covariates are gender, female, an indicator variable for females, and score, a score on a general mental ability test measured in standard deviations. The alternative-specific variables are high and low, which indicate whether the respondent’s current job is high or low in esteem, variety, autonomy, or security. This approach provides three states for a respondent’s current job status for each alternative, (1, 0), (0, 1), and (0, 0), using the notation (high, low). The score (1, 1) is omitted because the respondent’s current job cannot be considered both high and low in one of the job characteristics. The (0, 0) score would indicate that the respondent’s current job does not rank high or low (is neutral) in a job characteristic. The alternatives are ranked such that 1 is the preferred alternative and 4 is the least preferred. We can obtain the probabilities of the observed alternative rankings, the pr option, and the probability of each alternative being preferred, the pr1 option, by using predict: 154 asroprobit postestimation — Postestimation tools for asroprobit . use http://www.stata-press.com/data/r13/wlsrank (1992 Wisconsin Longitudinal Study data on job values) . asroprobit rank high low if noties, case(id) alternatives(jobchar) > casevars(female score) reverse (output omitted ) . keep if e(sample) (11244 observations deleted) . predict prob, pr . predict prob1, pr1 . list id jobchar prob prob1 rank female score high low in 1/12 id jobchar prob prob1 rank female score high low 1. 2. 3. 4. 5. 13 13 13 13 19 security autonomy variety esteem autonomy .0421807 .0421807 .0421807 .0421807 .0942025 .2784269 .1029036 .6026725 .0160111 .1232488 3 1 2 4 4 0 0 0 0 1 .3246512 .3246512 .3246512 .3246512 .0492111 0 0 1 0 0 1 0 0 1 0 6. 7. 8. 9. 10. 19 19 19 22 22 esteem security variety esteem variety .0942025 .0942025 .0942025 .1414177 .1414177 .0140261 .4601368 .4025715 .0255264 .4549441 3 1 2 4 1 1 1 1 1 1 .0492111 .0492111 .0492111 1.426412 1.426412 0 1 0 1 0 0 0 0 0 0 11. 12. 22 22 security autonomy .1414177 .1414177 .2629494 .2566032 3 2 1 1 1.426412 1.426412 0 1 0 0 The prob variable is constant for each case because it contains the probability of the ranking in the rank variable. On the other hand, the prob1 variable contains the estimated probability of each alternative being preferred. For each case, the sum of the values in prob1 will be approximately 1.0. They do not add up to exactly 1.0 because of approximations due to the GHK algorithm. Obtaining estimation statistics For examples of the specialized estat subcommands covariance and correlation, see [R] asmprobit postestimation. The entry also has a good example of computing marginal effects after asmprobit that is applicable to asroprobit. Below we will elaborate further on marginal effects after asroprobit where we manipulate the rank() option. Example 2 We will continue with the preferred job characteristics example where we first compute the marginal effects for case id = 13. asroprobit postestimation — Postestimation tools for asroprobit 155 . estat mfx if id==13, rank(security=3 autonomy=1 variety=2 esteem=4) Pr(esteem=4 variety=2 autonomy=1 security=3) = .04218068 variable dp/dx Std. Err. z P>|z| [ 95% C.I. ] X -.008713 -.009102 .025535 -.003745 .001964 .003127 .007029 .001394 -4.44 -2.91 3.63 -2.69 0.000 0.004 0.000 0.007 -.012562 -.015231 .011758 -.006477 -.004864 -.002973 .039313 -.001013 0 1 0 0 esteem variety autonomy security .001614 .001809 -.003849 .000582 .002646 .003012 .006104 .000985 0.61 0.60 -0.63 0.59 0.542 0.548 0.528 0.554 -.003572 -.004094 -.015813 -.001348 .0068 .007712 .008115 .002513 1 0 0 1 casevars female* score .009767 .008587 .009064 .004488 1.08 1.91 0.281 0.056 -.007998 -.00021 .027533 .017384 0 .32465 high* esteem variety autonomy security low* (*) dp/dx is for discrete change of indicator variable from 0 to 1 Next we compute the marginal effects for the probability that autonomy is preferred given the profile of case id = 13. . estat mfx if id==13, rank(security=2 autonomy=1 variety=2 esteem=2) Pr(esteem=3 Pr(esteem=4 Pr(esteem=2 Pr(esteem=4 Pr(esteem=2 Pr(esteem=3 variety=4 variety=3 variety=4 variety=2 variety=3 variety=2 variable autonomy=1 autonomy=1 autonomy=1 autonomy=1 autonomy=1 autonomy=1 security=2) security=2) security=3) security=3) security=4) security=4) + + + + + = .10276103 dp/dx Std. Err. z P>|z| [ 95% C.I. ] X -.003524 -.036203 .057279 -.0128 .001258 .00894 .013801 .002665 -2.80 -4.05 4.15 -4.80 0.005 0.000 0.000 0.000 -.005989 -.053724 .030231 -.018024 -.001059 -.018681 .084328 -.007576 0 1 0 0 esteem variety autonomy security .000518 .006409 -.008818 .002314 .000833 .010588 .013766 .003697 0.62 0.61 -0.64 0.63 0.534 0.545 0.522 0.531 -.001116 -.014343 -.035799 -.004932 .002151 .027161 .018163 .009561 1 0 0 1 casevars female* score .013839 .017917 .021607 .011062 0.64 1.62 0.522 0.105 -.028509 -.003764 .056188 .039598 0 .32465 high* esteem variety autonomy security low* (*) dp/dx is for discrete change of indicator variable from 0 to 1 The probability computed by estat mfx matches the probability computed by predict, pr1 only within three digits. This outcome is because of how the computation is carried out and the numeric inaccuracy of the GHK simulator using a Hammersley point set of length 200. The computation carried out by estat mfx literally computes all six probabilities listed in the header of the MFX table and sums them. The computation by predict, pr1 is the same as predict after asmprobit (multinomial probit): it computes the probability that autonomy is chosen, thus requiring only one 156 asroprobit postestimation — Postestimation tools for asroprobit call to the GHK simulator. Hence, there is a difference in the reported values even though the two probability statements are equivalent. Stored results estat mfx stores the following in r(): Scalars r(pr) Matrices r(ranks) r(mfx) scalar containing the computed probability of the ranked alternatives. column vector containing the alternative ranks. The rownames identify the alternatives. matrix containing the computed marginal effects and associated statistics. Column 1 of the matrix contains the marginal effects; column 2, their standard errors; column 3, their z statistics; and columns 4 and 5, the confidence intervals. Column 6 contains the values of the independent variables used to compute the probabilities r(pr). Also see [R] asroprobit — Alternative-specific rank-ordered probit regression [R] asmprobit — Alternative-specific multinomial probit regression [U] 20 Estimation and postestimation commands Title BIC note — Calculating and interpreting BIC Description Also see Remarks and examples Methods and formulas References Description This entry discusses a statistical issue that arises when using the Bayesian information criterion (BIC) to compare models. Stata calculates BIC, assuming N = e(N)—we will explain—but sometimes it would be better if a different N were used. Commands that calculate BIC have an n() option, allowing you to specify the N to be used. In summary, 1. If you are comparing results estimated by the same estimation command, using the default BIC calculation is probably fine. There is an issue, but most researchers would ignore it. 2. If you are comparing results estimated by different estimation commands, you need to be on your guard. a. If the different estimation commands share the same definitions of observations, independence, and the like, you are back in case 1. b. If they differ in these regards, you need to think about the value of N that should be used. For example, logit and xtlogit differ in that the former assumes independent observations and the latter, independent panels. c. If estimation commands differ in the events being used over which the likelihood function is calculated, the information criteria may not be comparable at all. We say information criteria because this would apply equally to the Akaike information criterion (AIC), as well as to BIC. For instance, streg and stcox produce such incomparable results. The events used by streg are the actual survival times, whereas the events used by stcox are failures within risk pools, conditional on the times at which failures occurred. Remarks and examples Remarks are presented under the following headings: Background The problem of determining N The problem of conformable likelihoods The first problem does not arise with AIC; the second problem does Calculating BIC correctly 157 158 BIC note — Calculating and interpreting BIC Background The AIC and the BIC are two popular measures for comparing maximum likelihood models. AIC and BIC are defined as AIC = −2 × ln(likelihood) + 2 × k BIC = −2 × ln(likelihood) + ln(N ) × k where k = number of parameters estimated N = number of observations We are going to discuss AIC along with BIC because AIC has some of the problems that BIC has, but not all. AIC and BIC can be viewed as measures that combine fit and complexity. Fit is measured negatively by −2 × ln(likelihood); the larger the value, the worse the fit. Complexity is measured positively, either by 2 × k (AIC) or ln(N ) × k (BIC). Given two models fit on the same data, the model with the smaller value of the information criterion is considered to be better. There is substantial literature on these measures: see Akaike (1974); Raftery (1995); Sakamoto, Ishiguro, and Kitagawa (1986); and Schwarz (1978). When Stata calculates the above measures, it uses the rank of e(V) for k and it uses e(N) for N . e(V) and e(N) are Stata notation for results stored by the estimation command. e(V) is the variance–covariance matrix of the estimated parameters, and e(N) is the number of observations in the dataset used in calculating the result. The problem of determining N The difference between AIC and BIC is that AIC uses the constant 2 to weight k , whereas BIC uses ln(N ). Determining what value of N should be used is problematic. Despite appearances, the definition “N is the number of observations” is not easy to make operational. N does not appear in the likelihood function itself, N is not the output of a standard statistical formula, and what is an observation is often subjective. Example 1 Often what is meant by N is obvious. Consider a simple logit model. What is meant by N is the number of observations that are statistically independent and that corresponds to M , the number of observations in the dataset used in the calculation. We will write N = M . But now assume that the same dataset has a grouping variable and the data are thought to be clustered within group. To keep the problem simple, let’s pretend that there are G groups and m observations within group, so that M = G×m. Because you are worried about intragroup correlation, you fit your model with xtlogit, grouping on the grouping variable. Now you wish to calculate BIC. What is the N that should be used? N = M or N = G? BIC note — Calculating and interpreting BIC 159 That is a deep question. If the observations really are independent, then you should use N = M . If the observations within group are not just correlated but are duplicates of one another, and they had to be so, then you should use M = G. Between those two extremes, you should probably use a number between N and G, but determining what that number should be from measured correlations is difficult. Using N = M is conservative in that, if anything, it overweights complexity. Conservativeness, however, is subjective, too: using N = G could be considered more conservative in that fewer constraints are being placed on the data. When the estimated correlation is high, our reaction would be that using N = G is probably more reasonable. Our first reaction, however, would be that using BIC to compare models is probably a misuse of the measure. Stata uses N = M . An informal survey of web-based literature suggests that N = M is the popular choice. There is another reason, not so good, to choose N = M . It makes across-model comparisons more likely to be valid when performed without thinking about the issue. Say that you wish to compare the logit and xtlogit results. Thus you need to calculate BICp = −2 × ln(likelihoodp ) + ln(Np ) × k BICx = −2 × ln(likelihoodx ) + ln(Nx ) × k Whatever N you use, you must use the same N in both formulas. Stata’s choice of N = M at least meets that test. Example 2 In the above example, using N = M is reasonable. Now let’s look at when using N = M is wrong, even if popular. Consider a model fit by stcox. Using N = M is certainly wrong if for no other reason than M is not even a well-defined number. The same data can be represented by different datasets with different numbers of observations. For example, in one dataset, there might be 1 observation per subject. In another, the same subjects could have two records each, the first recording the first half of the time at risk and the second recording the remaining part. All statistics calculated by Stata on either dataset would be the same, but M would be different. Deciding on the right definition, however, is difficult. Viewed one way, N in the Cox regression case should be the number of risk pools, R, because the Cox regression calculation is made on the basis of the independent risk pools. Viewed another way, N should be the number of subjects, Nsubj , because, even though the likelihood function is based on risk pools, the parameters estimated are at the subject level. You can decide which argument you prefer. For parametric survival models, in single-record data, N = M is unambiguously correct. For multirecord data, there is an argument for N = M and for N = Nsubj . 160 BIC note — Calculating and interpreting BIC The problem of conformable likelihoods The problem of conformable likelihoods does not concern N . Researchers sometimes use information criteria such as BIC and AIC to make comparisons across models. For that to be valid, the likelihoods must be conformable; that is, the likelihoods must all measure the same thing. It is common to think of the likelihood function as the Pr(data | parameters), but in fact, the likelihood is Pr(particular events in the data | parameters) You must ensure that the events are the same. For instance, they are not the same in the semiparametric Cox regression and the various parametric survival models. In Cox regression, the events are, at each failure time, that the subjects observed to fail in fact failed, given that failures occurred at those times. In the parametric models, the events are that each subject failed exactly when the subject was observed to fail. The formula for AIC and BIC is measure = −2 × ln(likelihood) + complexity When you are comparing models, if the likelihoods are measuring different events, even if the models obtain estimates of the same parameters, differences in the information measures are irrelevant. The first problem does not arise with AIC; the second problem does Regardless of model, the problem of defining N never arises with AIC because N is not used in the AIC calculation. AIC uses a constant 2 to weight complexity as measured by k , rather than ln(N ). For both AIC and BIC, however, the likelihood functions must be conformable; that is, they must be measuring the same event. Calculating BIC correctly When using BIC to compare results, and especially when using BIC to compare results from different models, you should think carefully about how N should be defined. Then specify that number by using the n() option: . estimates stats full sub, n(74) Akaike’s information criterion and Bayesian information criterion Model Obs ll(null) ll(model) df AIC BIC full sub 74 74 -45.03321 -45.03321 -20.59083 -27.17516 4 3 49.18167 60.35031 58.39793 67.26251 Note: N=74 used in calculating BIC Both estimates stats and estat ic allow the n() option; see [R] estimates stats and [R] estat ic. BIC note — Calculating and interpreting BIC 161 Methods and formulas AIC and BIC are defined as AIC = −2 × ln(likelihood) + 2 × k BIC = −2 × ln(likelihood) + ln(N ) × k where k is the model degrees of freedom calculated as the rank of variance–covariance matrix of the parameters e(V) and N is the number of observations used in estimation or, more precisely, the number of independent terms in the likelihood. Operationally, N is defined as e(N) unless the n() option is specified. References Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control 19: 716–723. Raftery, A. E. 1995. Bayesian model selection in social research. In Vol. 25 of Sociological Methodology, ed. P. V. Marsden, 111–163. Oxford: Blackwell. Sakamoto, Y., M. Ishiguro, and G. Kitagawa. 1986. Akaike Information Criterion Statistics. Dordrecht, The Netherlands: Reidel. Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics 6: 461–464. Also see [R] estat ic — Display information criteria [R] estimates stats — Model-selection statistics Title binreg — Generalized linear models: Extensions to the binomial family Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax binreg depvar indepvars options if in weight , options Description Model noconstant or rr hr rd n(# | varname) exposure(varname) offset(varname) constraints(constraints) collinear mu(varname) init(varname) suppress constant term use logit link and report odds ratios use log link and report risk ratios use log-complement link and report health ratios use identity link and report risk differences use # or varname for number of trials include ln(varname) in model with coefficient constrained to 1 include varname in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables use varname as the initial estimate for the mean of depvar synonym for mu(varname) SE/Robust vce(vcetype) t(varname) vfactor(#) disp(#) scale(x2 | dev | #) vcetype may be eim, robust, cluster clustvar, oim, opg, bootstrap, jackknife, hac kernel, jackknife1, or unbiased variable name corresponding to time multiply variance matrix by scalar # quasi-likelihood multiplier set the scale parameter; default is scale(1) Reporting level(#) coefficients nocnsreport display options set confidence level; default is level(95) report nonexponentiated coefficients do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling 162 binreg — Generalized linear models: Extensions to the binomial family 163 Maximization irls ml maximize options fisher(#) search use iterated, reweighted least-squares optimization; the default use maximum likelihood optimization control the maximization process; seldom used Fisher scoring steps search for good starting values coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mi estimate, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap), vce(jackknife), and vce(jackknife1) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. fweights, aweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Generalized linear models > GLM for the binomial family Description binreg fits generalized linear models for the binomial family. It estimates odds ratios, risk ratios, health ratios, and risk differences. The available links are Option Implied link Parameter or rr hr rd logit log log complement identity odds ratios = exp(β ) risk ratios = exp(β ) health ratios = exp(β ) risk differences = β Estimates of odds, risk, and health ratios are obtained by exponentiating the appropriate coefficients. The or option produces the same results as Stata’s logistic command, and or coefficients yields the same results as the logit command. When no link is specified, or is assumed. Options Model noconstant; see [R] estimation options. or requests the logit link and results in odds ratios if coefficients is not specified. rr requests the log link and results in risk ratios if coefficients is not specified. hr requests the log-complement link and results in health ratios if coefficients is not specified. rd requests the identity link and results in risk differences. 164 binreg — Generalized linear models: Extensions to the binomial family n(# | varname) specifies either a constant integer to use as the denominator for the binomial family or a variable that holds the denominator for each observation. exposure(varname), offset(varname), constraints(constraints), collinear; see [R] estimation options. constraints(constraints) and collinear are not allowed with irls. mu(varname) specifies varname containing an initial estimate for the mean of depvar. This option can be useful if you encounter convergence difficulties. init(varname) is a synonym. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), that are derived from asymptotic theory (oim, opg), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(eim), the default, uses the expected information matrix (EIM) for the variance estimator. binreg also allows the following: vce(hac kernel # ) specifies that a heteroskedasticity- and autocorrelation-consistent (HAC) variance estimate be used. HAC refers to the general form for combining weighted matrices to form the variance estimate. There are three kernels built into binreg. kernel is a user-written program or one of nwest | gallant | anderson If # is not specified, N − 2 is assumed. vce(jackknife1) specifies that the one-step jackknife estimate of variance be used. vce(unbiased) specifies that the unbiased sandwich estimate of variance be used. t(varname) specifies the variable name corresponding to time; see [TS] tsset. binreg does not always need to know t(), though it does if vce(hac . . . ) is specified. Then you can either specify the time variable with t(), or you can tsset your data before calling binreg. When the time variable is required, binreg assumes that the observations are spaced equally over time. vfactor(#) specifies a scalar by which to multiply the resulting variance matrix. This option allows users to match output with other packages, which may apply degrees of freedom or other small-sample corrections to estimates of variance. disp(#) multiplies the variance of depvar by # and divides the deviance by #. The resulting distributions are members of the quasilikelihood family. scale(x2 | dev | #) overrides the default scale parameter. This option is allowed only with Hessian (information matrix) variance estimates. By default, scale(1) is assumed for the discrete distributions (binomial, Poisson, and negative binomial), and scale(x2) is assumed for the continuous distributions (Gaussian, gamma, and inverse Gaussian). scale(x2) specifies that the scale parameter be set to the Pearson chi-squared (or generalized chi-squared) statistic divided by the residual degrees of freedom, which was recommended by McCullagh and Nelder (1989) as a good general choice for continuous distributions. scale(dev) sets the scale parameter to the deviance divided by the residual degrees of freedom. This option provides an alternative to scale(x2) for continuous distributions and overdispersed or underdispersed discrete distributions. scale(#) sets the scale parameter to #. binreg — Generalized linear models: Extensions to the binomial family 165 Reporting level(#), noconstant; see [R] estimation options. coefficients displays the nonexponentiated coefficients and corresponding standard errors and confidence intervals. This option has no effect when the rd option is specified, because it always presents the nonexponentiated coefficients. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization irls requests iterated, reweighted least-squares (IRLS) optimization of the deviance instead of Newton–Raphson optimization of the log likelihood. This option is the default. ml requests that optimization be carried out by using Stata’s ml command; see [R] ml. maximize options: technique(algorithm spec), no log, trace, gradient, showstep, hessian, showtolerance, difficult, iterate(#), tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization method to ml, with technique() set to something other than BHHH, changes the vcetype to vce(oim). Specifying technique(bhhh) changes vcetype to vce(opg). fisher(#) specifies the number of Newton–Raphson steps that should use the Fisher scoring Hessian or EIM before switching to the observed information matrix (OIM). This option is available only if ml is specified and is useful only for Newton–Raphson optimization. search specifies that the command search for good starting values. This option is available only if ml is specified and is useful only for Newton–Raphson optimization. The following option is available with binreg but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Wacholder (1986) suggests methods for estimating risk ratios and risk differences from prospective binomial data. These estimates are obtained by selecting the proper link functions in the generalized linear-model framework. (See Methods and formulas for details; also see [R] glm.) Example 1 Wacholder (1986) presents an example, using data from Wright et al. (1983), of an investigation of the relationship between alcohol consumption and the risk of a low-birthweight baby. Covariates examined included whether the mother smoked (yes or no), mother’s social class (three levels), and drinking frequency (light, moderate, or heavy). The data for the 18 possible categories determined by the covariates are illustrated below. Let’s first describe the data and list a few observations. 166 binreg — Generalized linear models: Extensions to the binomial family . use http://www.stata-press.com/data/r13/binreg . list category n_lbw_~s n_women alcohol smokes social 1. 2. 3. 4. 5. 1 2 3 4 5 11 5 11 6 3 84 79 169 28 13 heavy moderate light heavy moderate nonsmoker nonsmoker nonsmoker smoker smoker 1 1 1 1 1 6. 7. 8. 9. 10. 6 7 8 9 10 1 4 3 12 4 26 22 25 162 17 light heavy moderate light heavy smoker nonsmoker nonsmoker nonsmoker smoker 1 2 2 2 2 11. 12. 13. 14. 15. 11 12 13 14 15 2 6 0 1 12 7 38 14 18 91 moderate light heavy moderate light smoker smoker nonsmoker nonsmoker nonsmoker 2 2 3 3 3 16. 17. 18. 16 17 18 7 2 8 19 18 70 heavy moderate light smoker smoker smoker 3 3 3 Each observation corresponds to one of the 18 covariate structures. The number of low-birthweight babies from n women in each category is given by the n lbw babies variable. binreg — Generalized linear models: Extensions to the binomial family 167 We begin by estimating risk ratios: . binreg n_lbw_babies i.soc i.alc i.smo, n(n_women) rr Iteration 1: deviance = 14.2879 Iteration 2: deviance = 13.607 Iteration 3: deviance = 13.60503 Iteration 4: deviance = 13.60503 Generalized linear models No. of obs Optimization : MQL Fisher scoring Residual df (IRLS EIM) Scale parameter Deviance = 13.6050268 (1/df) Deviance Pearson = 11.51517095 (1/df) Pearson Variance function: V(u) = u*(1-u/n_women) [Binomial] Link function : g(u) = ln(u/n_women) [Log] BIC = = = = = 18 12 1 1.133752 .9595976 = -21.07943 EIM Std. Err. z P>|z| [95% Conf. Interval] 1.340001 1.349487 .3127382 .3291488 1.25 1.23 0.210 0.219 .848098 .8366715 2.11721 2.176619 alcohol moderate heavy 1.191157 1.974078 .3265354 .4261751 0.64 3.15 0.523 0.002 .6960276 1.293011 2.038503 3.013884 smokes smoker _cons 1.648444 .0630341 .332875 .0128061 2.48 -13.61 0.013 0.000 1.109657 .0423297 2.448836 .0938656 n_lbw_babies Risk Ratio social 2 3 By default, Stata reports the risk ratios (the exponentiated regression coefficients) estimated by the model. We can see that the risk ratio comparing heavy drinkers with light drinkers, after adjusting for smoking and social class, is 1.974078. That is, mothers who drink heavily during their pregnancy have approximately twice the risk of delivering low-birthweight babies as mothers who are light drinkers. 168 binreg — Generalized linear models: Extensions to the binomial family The nonexponentiated coefficients can be obtained with the coefficients option: . binreg n_lbw_babies i.soc i.alc i.smo, n(n_women) rr coefficients Iteration 1: deviance = 14.2879 Iteration 2: deviance = 13.607 Iteration 3: deviance = 13.60503 Iteration 4: deviance = 13.60503 Generalized linear models No. of obs = 18 Optimization : MQL Fisher scoring Residual df = 12 (IRLS EIM) Scale parameter = 1 Deviance = 13.6050268 (1/df) Deviance = 1.133752 Pearson = 11.51517095 (1/df) Pearson = .9595976 Variance function: V(u) = u*(1-u/n_women) [Binomial] Link function : g(u) = ln(u/n_women) [Log] BIC = -21.07943 EIM Std. Err. z P>|z| .2926702 .2997244 .2333866 .2439066 1.25 1.23 0.210 0.219 -.1647591 -.1783238 .7500994 .7777726 alcohol moderate heavy .1749248 .6801017 .274133 .2158856 0.64 3.15 0.523 0.002 -.362366 .2569737 .7122156 1.10323 smokes smoker _cons .4998317 -2.764079 .2019329 .2031606 2.48 -13.61 0.013 0.000 .1040505 -3.162266 .8956129 -2.365891 n_lbw_babies Coef. social 2 3 [95% Conf. Interval] binreg — Generalized linear models: Extensions to the binomial family 169 Risk differences are obtained with the rd option: . binreg n_lbw_babies i.soc i.alc i.smo, n(n_women) rd Iteration 1: deviance = 18.67277 Iteration 2: deviance = 14.94364 Iteration 3: deviance = 14.9185 Iteration 4: deviance = 14.91762 Iteration 5: deviance = 14.91758 Iteration 6: deviance = 14.91758 Iteration 7: deviance = 14.91758 Generalized linear models No. of obs Optimization : MQL Fisher scoring Residual df (IRLS EIM) Scale parameter Deviance = 14.91758277 (1/df) Deviance Pearson = 12.60353235 (1/df) Pearson Variance function: V(u) = u*(1-u/n_women) [Binomial] Link function : g(u) = u/n_women [Identity] BIC = = = = = 18 12 1 1.243132 1.050294 = -19.76688 EIM Std. Err. z P>|z| .0263817 .0365553 .0232124 .0268668 1.14 1.36 0.256 0.174 -.0191137 -.0161026 .0718771 .0892132 alcohol moderate heavy .0122539 .0801291 .0257713 .0302878 0.48 2.65 0.634 0.008 -.0382569 .020766 .0627647 .1394921 smokes smoker _cons .0542415 .059028 .0270838 .0160693 2.00 3.67 0.045 0.000 .0011582 .0275327 .1073248 .0905232 n_lbw_babies Risk Diff. social 2 3 [95% Conf. Interval] The risk difference between heavy drinkers and light drinkers is 0.0801291. Because the risk differences are obtained directly from the coefficients estimated by using the identity link, the coefficients option has no effect here. 170 binreg — Generalized linear models: Extensions to the binomial family Health ratios are obtained with the hr option. The health ratios (exponentiated coefficients for the log-complement link) are reported directly. . binreg n_lbw_babies i.soc i.alc i.smo, n(n_women) hr Iteration 1: deviance = 21.15233 Iteration 2: deviance = 15.16467 Iteration 3: deviance = 15.13205 Iteration 4: deviance = 15.13114 Iteration 5: deviance = 15.13111 Iteration 6: deviance = 15.13111 Iteration 7: deviance = 15.13111 Generalized linear models No. of obs = 18 Optimization : MQL Fisher scoring Residual df = 12 (IRLS EIM) Scale parameter = 1 Deviance = 15.13110545 (1/df) Deviance = 1.260925 Pearson = 12.84203917 (1/df) Pearson = 1.07017 Variance function: V(u) = u*(1-u/n_women) [Binomial] Link function : g(u) = ln(1-u/n_women) [Log complement] BIC = -19.55336 EIM Std. Err. n_lbw_babies HR social 2 3 .9720541 .9597182 .024858 .0290412 alcohol moderate heavy .9871517 .9134243 smokes smoker _cons .9409983 .9409945 z P>|z| [95% Conf. Interval] -1.11 -1.36 0.268 0.174 .9245342 .9044535 1.022017 1.01836 .0278852 .0325726 -0.46 -2.54 0.647 0.011 .9339831 .8517631 1.043347 .9795493 .0296125 .0163084 -1.93 -3.51 0.053 0.000 .8847125 .9095674 1.000865 .9735075 (HR) Health ratios To see the nonexponentiated coefficients, we can specify the coefficients option. binreg — Generalized linear models: Extensions to the binomial family Stored results binreg, irls stores the following in e(): Scalars e(N) e(k) e(k eq model) e(df m) e(df) e(phi) e(disp) e(bic) e(N clust) e(deviance) e(deviance s) e(deviance p) e(deviance ps) e(dispers) e(dispers s) e(dispers p) e(dispers ps) e(vf) e(rank) e(rc) Macros e(cmd) e(cmdline) e(depvar) e(eform) e(varfunc) e(varfunct) e(varfuncf) e(link) e(linkt) e(linkf) e(m) e(wtype) e(wexp) e(title fl) e(clustvar) e(offset) e(cons) e(hac kernel) e(hac lag) e(vce) e(vcetype) e(opt) e(opt1) e(opt2) e(properties) e(predict) e(marginsok) e(marginsnotok) e(asbalanced) e(asobserved) number of observations number of parameters number of equations in overall model test model degrees of freedom residual degrees of freedom model scale parameter dispersion parameter model BIC number of clusters deviance scaled deviance Pearson deviance scaled Pearson deviance dispersion scaled dispersion Pearson dispersion scaled Pearson dispersion factor set by vfactor(), 1 if not set rank of e(V) return code binreg command as typed name of dependent variable eform() option implied by or, rr, hr, or rd program to calculate variance function variance title variance function program to calculate link function link title link function number of binomial trials weight type weight expression family–link title name of cluster variable linear offset variable noconstant or not set HAC kernel HAC lag vcetype specified in vce() title used to label Std. Err. type of optimization optimization title, line 1 optimization title, line 2 b V program used to implement predict predictions allowed by margins predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved 171 172 binreg — Generalized linear models: Extensions to the binomial family Matrices e(b) e(Cns) e(V) e(V modelbased) Functions e(sample) coefficient vector constraints matrix variance–covariance matrix of the estimators model-based variance marks estimation sample binreg, ml stores the following in e(): Scalars e(N) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(df) e(phi) e(aic) e(bic) e(ll) e(N clust) e(chi2) e(p) e(deviance) e(deviance s) e(deviance p) e(deviance ps) e(dispers) e(dispers s) e(dispers p) e(dispers ps) e(vf) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(eform) e(varfunc) e(varfunct) e(varfuncf) e(link) e(linkt) e(linkf) e(m) e(wtype) e(wexp) e(title) e(title fl) e(clustvar) e(offset) e(cons) e(hac kernel) e(hac lag) e(chi2type) e(vce) number of observations number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom residual degrees of freedom model scale parameter model AIC, if ml model BIC log likelihood, if ml number of clusters χ2 significance of model test deviance scaled deviance Pearson deviance scaled Pearson deviance dispersion scaled dispersion Pearson dispersion scaled Pearson dispersion factor set by vfactor(), 1 if not set rank of e(V) number of iterations return code 1 if converged, 0 otherwise binreg command as typed name of dependent variable eform() option implied by or, rr, hr, or rd program to calculate variance function variance title variance function program to calculate link function link title link function number of binomial trials weight type weight expression title in estimation output family–link title name of cluster variable linear offset variable noconstant or not set HAC kernel HAC lag Wald; type of model χ2 test vcetype specified in vce() binreg — Generalized linear models: Extensions to the binomial family e(vcetype) e(opt) e(opt1) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(marginsok) e(marginsnotok) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) 173 title used to label Std. Err. type of optimization optimization title, line 1 max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict predictions allowed by margins predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample Methods and formulas Let πi be the probability of success for the ith observation, i = 1, . . . , N , and let Xβ be the linear predictor. The link function relates the covariates of each observation to its respective probability through the linear predictor. In logistic regression, the logit link is used: π ln 1−π = Xβ The regression coefficient βk represents the change in the logarithm of the odds associated with a one-unit change in the value of the Xk covariate; thus exp(βk ) is the ratio of the odds associated with a change of one unit in Xk . For risk differences, the identity link π = Xβ is used. The regression coefficient βk represents the risk difference associated with a change of one unit in Xk . When using the identity link, you can obtain fitted probabilities outside the interval (0, 1). As suggested by Wacholder, at each iteration, fitted probabilities are checked for range conditions (and put back in range if necessary). For example, if the identity link results in a fitted probability that is smaller than 1e–4, the probability is replaced with 1e–4 before the link function is calculated. A similar adjustment is made for the logarithmic link, which is used for estimating the risk ratio, ln(π) = Xβ , where exp(βk ) is the risk ratio associated with a change of one unit in Xk , and for the log-complement link used to estimate the probability of no disease or health, where exp(βk ) represents the “health ratio” associated with a change of one unit in Xk . This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. 174 binreg — Generalized linear models: Extensions to the binomial family References Cummings, P. 2009. Methods for estimating adjusted risk ratios. Stata Journal 9: 175–196. Hardin, J. W., and M. A. Cleves. 1999. sbe29: Generalized linear models: Extensions to the binomial family. Stata Technical Bulletin 50: 21–25. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 140–146. College Station, TX: Stata Press. Kleinbaum, D. G., and M. Klein. 2010. Logistic Regression: A Self-Learning Text. 3rd ed. New York: Springer. McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. 2nd ed. London: Chapman & Hall/CRC. Wacholder, S. 1986. Binomial regression in GLIM: Estimating risk ratios and risk differences. American Journal of Epidemiology 123: 174–184. Wright, J. T., I. G. Barrison, I. G. Lewis, K. D. MacRae, E. J. Waterson, P. J. Toplis, M. G. Gordon, N. F. Morris, and I. M. Murray-Lyon. 1983. Alcohol consumption, pregnancy and low birthweight. Lancet 1: 663–665. Also see [R] binreg postestimation — Postestimation tools for binreg [R] glm — Generalized linear models [ME] mecloglog — Multilevel mixed-effects complementary log-log regression [ME] meglm — Multilevel mixed-effects generalized linear model [ME] melogit — Multilevel mixed-effects logistic regression [ME] meprobit — Multilevel mixed-effects probit regression [MI] estimation — Estimation commands for use with mi estimate [U] 20 Estimation and postestimation commands Title binreg postestimation — Postestimation tools for binreg Description References Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after binreg: Command Description contrast estat ic estat summarize estat vce estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest margins marginsplot nlcom predict predictnl pwcompare test testnl 1 forecast may not be used with mi estimation results. 175 176 binreg postestimation — Postestimation tools for binreg Syntax for predict predict type newvar if in , statistic options Description statistic Main mu xb eta stdp anscombe cooksd deviance hat likelihood pearson response score working expected value of y ; the default b linear prediction η = xβ synonym for xb standard error of the linear prediction Anscombe (1953) residuals Cook’s distance deviance residuals diagonals of the “hat” matrix weighted average of the standardized deviance and standard Pearson residuals Pearson residuals differences between the observed and fitted outcomes first derivative of the log likelihood with respect to xj β working residuals options Description Options nooffset adjusted standardized studentized modified modify calculations to ignore the offset variable adjust deviance residual to speed up convergence multiply residual by the factor (1 − h)1/2 multiply residual by one over the square root of the estimated scale parameter modify denominator of residual to be a reasonable estimate of the variance of depvar These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main b ) [ng −1 (xβ b) mu, the default, specifies that predict calculate the expected value of y , equal to g −1 (xβ for the binomial family]. b xb calculates the linear prediction η = xβ. eta is a synonym for xb. stdp calculates the standard error of the linear prediction. binreg postestimation — Postestimation tools for binreg 177 anscombe calculates the Anscombe (1953) residuals to produce residuals that closely follow a normal distribution. cooksd calculates Cook’s distance, which measures the aggregate change in the estimated coefficients when each observation is left out of the estimation. deviance calculates the deviance residuals, which are recommended by McCullagh and Nelder (1989) and others as having the best properties for examining goodness of fit of a GLM. They are approximately normally distributed if the model is correct and may be plotted against the fitted values or against a covariate to inspect the model’s fit. Also see the pearson option below. hat calculates the diagonals of the “hat” matrix, analogous to linear regression. likelihood calculates a weighted average of the standardized deviance and standardized Pearson (described below) residuals. pearson calculates the Pearson residuals, which often have markedly skewed distributions for nonnormal family distributions. Also see the deviance option above. response calculates the differences between the observed and fitted outcomes. score calculates the equation-level score, ∂ ln L/∂(xj β). working calculates the working residuals, which are response residuals weighted according to the derivative of the link function. Options nooffset is relevant only if you specified offset(varname) for binreg. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . adjusted adjusts the deviance residual to make the convergence to the limiting normal distribution faster. The adjustment deals with adding to the deviance residual a higher-order term depending on the variance function family. This option is allowed only when deviance is specified. standardized requests that the residual be multiplied by the factor (1 − h)−1/2 , where h is the diagonal of the hat matrix. This step is done to take into account the correlation between depvar and its predicted value. studentized requests that the residual be multiplied by one over the square root of the estimated scale parameter. modified requests that the denominator of the residual be modified to be a reasonable estimate of the variance of depvar. The base residual is multiplied by the factor (k/w)−1/2 , where k is either one or the user-specified dispersion parameter and w is the specified weight (or one if left unspecified). References Anscombe, F. J. 1953. Contribution of discussion paper by H. Hotelling “New light on the correlation coefficient and its transforms”. Journal of the Royal Statistical Society, Series B 15: 229–230. McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. 2nd ed. London: Chapman & Hall/CRC. Also see [R] binreg — Generalized linear models: Extensions to the binomial family [U] 20 Estimation and postestimation commands Title biprobit — Bivariate probit regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Bivariate probit regression biprobit depvar1 depvar2 indepvars if in weight , options Seemingly unrelated bivariate probit regression biprobit equation1 equation2 if in weight , su options where equation1 and equation2 are specified as ( eqname: depvar = indepvars , noconstant offset(varname) ) options Description Model noconstant partial offset1(varname) offset2(varname) constraints(constraints) collinear suppress constant term fit partial observability model offset variable for first equation offset variable for second equation apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) noskip nocnsreport display options set confidence level; default is level(95) perform likelihood-ratio test do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics 178 biprobit — Bivariate probit regression 179 Description su options Model partial constraints(constraints) collinear fit partial observability model apply specified linear constraints keep collinear variables SE/Robust vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife vce(vcetype) Reporting set confidence level; default is level(95) perform likelihood-ratio test do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling level(#) noskip nocnsreport display options Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar1 , depvar2 , indepvars, and depvar may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce(), noskip, and weights are not allowed with the svy prefix; see [SVY] svy. pweights, fweights, and iweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu biprobit Statistics > Binary outcomes > Bivariate probit regression seemingly unrelated biprobit Statistics > Binary outcomes > Seemingly unrelated bivariate probit regression Description biprobit fits maximum-likelihood two-equation probit models—either a bivariate probit or a seemingly unrelated probit (limited to two equations). 180 biprobit — Bivariate probit regression Options Model noconstant; see [R] estimation options. partial specifies that the partial observability model be fit. This particular model commonly has poor convergence properties, so we recommend that you use the difficult option if you want to fit the Poirier partial observability model; see [R] maximize. This model computes the product of the two dependent variables so that you do not have to replace each with the product. offset1(varname), offset2(varname), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. noskip specifies that a full maximum-likelihood model with only a constant for the regression equation be fit. This model is not displayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asymptotically equivalent Wald test of all the parameters in the regression equation being zero (except the constant). For many models, this option can substantially increase estimation time. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with biprobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples For a good introduction to the bivariate probit models, see Greene (2012, 738–752) and Pindyck and Rubinfeld (1998). Poirier (1980) explains the partial observability model. Van de Ven and Van Pragg (1981) explain the probit model with sample selection; see [R] heckprobit for details. biprobit — Bivariate probit regression 181 Example 1 We use the data from Pindyck and Rubinfeld (1998, 332). In this dataset, the variables are whether children attend private school (private), number of years the family has been at the present residence (years), log of property tax (logptax), log of income (loginc), and whether the head of the household voted for an increase in property taxes (vote). We wish to model the bivariate outcomes of whether children attend private school and whether the head of the household voted for an increase in property tax based on the other covariates. . use http://www.stata-press.com/data/r13/school . biprobit private vote years logptax loginc Fitting comparison equation 1: Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -31.967097 -31.452424 -31.448958 -31.448958 = = = = -63.036914 -58.534843 -58.497292 -58.497288 Fitting comparison equation 2: Iteration Iteration Iteration Iteration 0: 1: 2: 3: Comparison: log log log log likelihood likelihood likelihood likelihood log likelihood = -89.946246 Fitting full model: Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -89.946246 -89.258897 -89.254028 -89.254028 Bivariate probit regression Number of obs Wald chi2(6) Prob > chi2 Log likelihood = -89.254028 Coef. private years logptax loginc _cons Std. Err. z P>|z| = = = 95 9.59 0.1431 [95% Conf. Interval] -.0118884 -.1066962 .3762037 -4.184694 .0256778 .6669782 .5306484 4.837817 -0.46 -0.16 0.71 -0.86 0.643 0.873 0.478 0.387 -.0622159 -1.413949 -.663848 -13.66664 .0384391 1.200557 1.416255 5.297253 years logptax loginc _cons -.0168561 -1.288707 .998286 -.5360573 .0147834 .5752266 .4403565 4.068509 -1.14 -2.24 2.27 -0.13 0.254 0.025 0.023 0.895 -.0458309 -2.416131 .1352031 -8.510188 .0121188 -.1612839 1.861369 7.438073 /athrho -.2764525 .2412099 -1.15 0.252 -.7492153 .1963102 rho -.2696186 .2236753 -.6346806 .1938267 vote Likelihood-ratio test of rho=0: chi2(1) = 1.38444 Prob > chi2 = 0.2393 The output shows several iteration logs. The first iteration log corresponds to running the univariate probit model for the first equation, and the second log corresponds to running the univariate probit for the second model. If ρ = 0, the sum of the log likelihoods from these two models will equal the log likelihood of the bivariate probit model; this sum is printed in the iteration log as the comparison log likelihood. 182 biprobit — Bivariate probit regression The final iteration log is for fitting the full bivariate probit model. A likelihood-ratio test of the log likelihood for this model and the comparison log likelihood is presented at the end of the output. If we had specified the vce(robust) option, this test would be presented as a Wald test instead of as a likelihood-ratio test. We could have fit the same model by using the seemingly unrelated syntax as . biprobit (private=years logptax loginc) (vote=years logptax loginc) Stored results biprobit stores the following in e(): Scalars e(N) e(k) e(k eq) e(k aux) e(k eq model) e(k dv) e(df m) e(ll) e(ll 0) e(ll c) e(N clust) e(chi2) e(chi2 c) e(p) e(rho) e(rank) e(rank0) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset1) e(offset2) e(chi2type) e(chi2 ct) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) d(predict) e(asbalanced) e(asobserved) number of observations number of parameters number of equations in e(b) number of auxiliary parameters number of equations in overall model test number of dependent variables model degrees of freedom log likelihood log likelihood, constant-only model (noskip only) log likelihood, comparison model number of clusters χ2 χ2 for comparison test significance ρ rank of e(V) rank of e(V) for constant-only model number of iterations return code 1 if converged, 0 otherwise biprobit command as typed names of dependent variables weight type weight expression title in estimation output name of cluster variable offset for first equation offset for second equation Wald or LR; type of model χ2 test Wald or LR; type of model χ2 test corresponding to e(chi2 c) vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved biprobit — Bivariate probit regression Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance Functions e(sample) marks estimation sample 183 Methods and formulas The log likelihood, lnL, is given by ξjβ = xj β + offsetβj ξjγ = zj γ + offsetγj n 1 if y1j 6= 0 q1j = −1 otherwise n 1 if y2j 6= 0 q2j = −1 otherwise ρ∗j = q1j q2j ρ n X lnL = wj lnΦ2 q1j ξjβ , q2j ξjγ , ρ∗j j=1 0 where Φ2 () is the cumulative bivariate normal distribution function (with mean [ 0 0 ] ) and wj is an optional weight for observation j . This derivation assumes that ∗ y1j = xj β + 1j + offsetβj ∗ y2j = zj γ + 2j + offsetγj E(1 ) = E(2 ) = 0 Var(1 ) = Var(2 ) = 1 Cov(1 , 2 ) = ρ ∗ ∗ ∗ where y1j and y2j are the unobserved latent variables; instead, we observe only yij = 1 if yij >0 and yij = 0 otherwise (for i = 1, 2). In the maximum likelihood estimation, ρ is not directly estimated, but atanh ρ is 1+ρ 1 atanh ρ = ln 2 1−ρ From the form of the likelihood, if ρ = 0, then the log likelihood for the bivariate probit models is equal to the sum of the log likelihoods of the two univariate probit models. A likelihood-ratio test may therefore be performed by comparing the likelihood of the full bivariate model with the sum of the log likelihoods for the univariate probit models. 184 biprobit — Bivariate probit regression This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. biprobit also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References De Luca, G. 2008. SNP and SML estimation of univariate and bivariate binary-choice models. Stata Journal 8: 190–220. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Hardin, J. W. 1996. sg61: Bivariate probit models. Stata Technical Bulletin 33: 15–20. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 152–158. College Station, TX: Stata Press. Heckman, J. 1979. Sample selection bias as a specification error. Econometrica 47: 153–161. Lokshin, M., and Z. Sajaia. 2011. Impact of interventions on discrete outcomes: Maximum likelihood estimation of the binary choice models with binary endogenous regressors. Stata Journal 11: 368–385. Pindyck, R. S., and D. L. Rubinfeld. 1998. Econometric Models and Economic Forecasts. 4th ed. New York: McGraw–Hill. Poirier, D. J. 1980. Partial observability in bivariate probit models. Journal of Econometrics 12: 209–217. Van de Ven, W. P. M. M., and B. M. S. Van Pragg. 1981. The demand for deductibles in private health insurance: A probit model with sample selection. Journal of Econometrics 17: 229–252. Also see [R] biprobit postestimation — Postestimation tools for biprobit [R] mprobit — Multinomial probit regression [R] probit — Probit regression [SVY] svy estimation — Estimation commands for survey data [U] 20 Estimation and postestimation commands Title biprobit postestimation — Postestimation tools for biprobit Description Syntax for predict Menu for predict Options for predict Also see Description The following postestimation commands are available after biprobit: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest1 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 lrtest is not appropriate with svy estimation results. 185 186 biprobit postestimation — Postestimation tools for biprobit Syntax for predict predict type predict type newvar if in , statistic nooffset stub* | newvareq1 newvareq2 newvarathrho if in , scores Description statistic Main Φ2 (xj b, zj g, ρ), predicted probability Pr(y1j = 1, y2j = 1); the default Φ2 (xj b, −zj g, −ρ), predicted probability Pr(y1j = 1, y2j = 0) Φ2 (−xj b, zj g, −ρ), predicted probability Pr(y1j = 0, y2j = 1) Φ2 (−xj b, −zj g, ρ), predicted probability Pr(y1j = 0, y2j = 0) Φ(xj b), marginal success probability for equation 1 Φ(zj g), marginal success probability for equation 2 Φ2 (xj b, zj g, ρ)/Φ(zj g), conditional probability of success for equation 1 Φ2 (xj b, zj g, ρ)/Φ(xj b), conditional probability of success for equation 2 xj b, linear prediction for equation 1 zj g, linear prediction for equation 2 standard error of the linear prediction for equation 1 standard error of the linear prediction for equation 2 p11 p10 p01 p00 pmarg1 pmarg2 pcond1 pcond2 xb1 xb2 stdp1 stdp2 where Φ() is the standard normal-distribution function and Φ2 () is the bivariate standard normal-distribution function. These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main p11, the default, calculates the bivariate predicted probability Pr(y1j = 1, y2j = 1). p10 calculates the bivariate predicted probability Pr(y1j = 1, y2j = 0). p01 calculates the bivariate predicted probability Pr(y1j = 0, y2j = 1). p00 calculates the bivariate predicted probability Pr(y1j = 0, y2j = 0). pmarg1 calculates the univariate (marginal) predicted probability of success Pr(y1j = 1). pmarg2 calculates the univariate (marginal) predicted probability of success Pr(y2j = 1). pcond1 calculates the conditional (on success in equation 2) predicted probability of success Pr(y1j = 1, y2j = 1)/Pr(y2j = 1). pcond2 calculates the conditional (on success in equation 1) predicted probability of success Pr(y1j = 1, y2j = 1)/Pr(y1j = 1). biprobit postestimation — Postestimation tools for biprobit 187 xb1 calculates the probit linear prediction xj b. xb2 calculates the probit linear prediction zj g. stdp1 calculates the standard error of the linear prediction for equation 1. stdp2 calculates the standard error of the linear prediction for equation 2. nooffset is relevant only if you specified offset1(varname) or offset2(varname) for biprobit. It modifies the calculations made by predict so that they ignore the offset variables; the linear predictions are treated as xj b rather than as xj b + offset1j and zj γ rather than as zj γ + offset2j . scores calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xj β). The second new variable will contain ∂ ln L/∂(zj γ). The third new variable will contain ∂ ln L/∂(atanh ρ). Also see [R] biprobit — Bivariate probit regression [U] 20 Estimation and postestimation commands Title bitest — Binomial probability test Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Option References Syntax Binomial probability test bitest varname== # p if in weight , detail Immediate form of binomial probability test bitesti # N # succ # p , detail by is allowed with bitest; see [D] by. fweights are allowed with bitest; see [U] 11.1.6 weight. Menu bitest Statistics > Summaries, tables, and tests > Classical tests of hypotheses > Binomial probability test > Summaries, tables, and tests > Classical tests of hypotheses > Binomial probability test calculator bitesti Statistics Description bitest performs exact hypothesis tests for binomial random variables. The null hypothesis is that the probability of a success on a trial is # p . The total number of trials is the number of nonmissing values of varname (in bitest) or # N (in bitesti). The number of observed successes is the number of 1s in varname (in bitest) or # succ (in bitesti). varname must contain only 0s, 1s, and missing. bitesti is the immediate form of bitest; see [U] 19 Immediate commands for a general introduction to immediate commands. Option Advanced detail shows the probability of the observed number of successes, kobs ; the probability of the number of successes on the opposite tail of the distribution that is used to compute the two-sided p-value, kopp ; and the probability of the point next to kopp . This information can be safely ignored. See the technical note below for details. 188 bitest — Binomial probability test 189 Remarks and examples Remarks are presented under the following headings: bitest bitesti bitest Example 1 We test 15 university students for high levels of one measure of visual quickness which, from other evidence, we believe is present in 30% of the nonuniversity population. Included in our data is quick, taking on the values 1 (“success”) or 0 (“failure”) depending on the outcome of the test. . use http://www.stata-press.com/data/r13/quick . bitest quick == 0.3 Variable N Observed k Expected k quick 15 7 4.5 Pr(k >= 7) = 0.131143 (one-sided test) Pr(k <= 7) = 0.949987 (one-sided test) Pr(k <= 1 or k >= 7) = 0.166410 (two-sided test) Assumed p Observed p 0.30000 0.46667 The first part of the output reveals that, assuming a true probability of success of 0.3, the expected number of successes is 4.5 and that we observed seven. Said differently, the assumed frequency under the null hypothesis H0 is 0.3, and the observed frequency is 0.47. The first line under the table is a one-sided test; it is the probability of observing seven or more successes conditional on p = 0.3. It is a test of H0 : p = 0.3 versus the alternative hypothesis HA : p > 0.3. Said in English, the alternative hypothesis is that more than 30% of university students score at high levels on this test of visual quickness. The p-value for this hypothesis test is 0.13. The second line under the table is a one-sided test of H0 versus the opposite alternative hypothesis HA : p < 0.3. The third line is the two-sided test. It is a test of H0 versus the alternative hypothesis HA : p 6= 0.3. Technical note The p-value of a hypothesis test is the probability (calculated assuming H0 is true) of observing any outcome as extreme or more extreme than the observed outcome, with extreme meaning in the direction of the alternative hypothesis. In example 1, the outcomes k = 8, 9, . . . , 15 are clearly “more extreme” than the observed outcome kobs = 7 when considering the alternative hypothesis HA : p 6= 0.3. However, outcomes with only a few successes are also in the direction of this alternative hypothesis. For two-sided hypotheses, outcomes with k successes are considered “as extreme or more extreme” than the observed outcome kobs if Pr(k) ≤ Pr(kobs ). Here Pr(k = 0) and Pr(k = 1) are both less than Pr(k = 7), so they are included in the two-sided p-value. The detail option allows you to see the probability (assuming that H0 is true) of the observed successes (k = 7) and the probability of the boundary point (k = 1) of the opposite tail used for the two-sided p-value. 190 bitest — Binomial probability test . bitest quick == 0.3, detail Variable N quick 15 Observed k Expected k 7 Observed p 0.30000 0.46667 4.5 Pr(k >= 7) = 0.131143 Pr(k <= 7) = 0.949987 Pr(k <= 1 or k >= 7) = 0.166410 (one-sided test) (one-sided test) (two-sided test) Pr(k == 7) Pr(k == 2) Pr(k == 1) (observed) = 0.081130 = 0.091560 = 0.030520 Assumed p (opposite extreme) Also shown is the probability of the point next to the boundary point. This probability, namely, Pr(k = 2) = 0.092, is certainly close to the probability of the observed outcome Pr(k = 7) = 0.081, so some people might argue that k = 2 should be included in the two-sided p-value. Statisticians (at least some we know) would reply that the p-value is a precisely defined concept and that this is an arbitrary “fuzzification” of its definition. When you compute exact p-values according to the precise definition of a p-value, your type I error is never more than what you say it is — so no one can criticize you for being anticonservative. Including the point k = 2 is being overly conservative because it makes the p-value larger yet. But it is your choice; being overly conservative, at least in statistics, is always safe. Know that bitest and bitesti always keep to the precise definition of a p-value, so if you wish to include this extra point, you must do so by hand or by using the r() stored results; see Stored results below. bitesti Example 2 The binomial test is a function of two statistics and one parameter: N , the number of observations; kobs , the number of observed successes; and p, the assumed probability of a success on a trial. For instance, in a city of N = 2,500,000, we observe kobs = 36 cases of a particular disease when the population rate for the disease is p = 0.00001. . bitesti 2500000 36 .00001 N 2500000 Observed k Expected k 36 25 Pr(k >= 36) = 0.022458 Pr(k <= 36) = 0.985448 Pr(k <= 14 or k >= 36) = 0.034859 Assumed p Observed p 0.00001 0.00001 (one-sided test) (one-sided test) (two-sided test) Example 3 Boice and Monson (1977) present data on breast cancer cases and person-years of observations for women with tuberculosis who were repeatedly exposed to multiple x-ray fluoroscopies and for women with tuberculosis who were not. The data are Breast cancer Person-years Exposed 41 28,010 Not exposed 15 19,017 Total 56 47,027 bitest — Binomial probability test 191 We can thus test whether x-ray fluoroscopic examinations are associated with breast cancer; the assumed rate of exposure is p = 28010/47027. . bitesti 56 41 28010/47027 N Observed k Expected k 56 41 33.35446 Pr(k >= 41) = 0.023830 Pr(k <= 41) = 0.988373 Pr(k <= 25 or k >= 41) = 0.040852 Assumed p Observed p 0.59562 0.73214 (one-sided test) (one-sided test) (two-sided test) Stored results bitest and bitesti store the following in r(): Scalars r(N) number N of trials r(P p) assumed probability p of success r(k) observed number k of successes opposite extreme k probability of observed k (detail only) probability of opposite extreme k (detail only) r(k nopp) k next to opposite extreme (detail only) r(P noppk) probability of k next to opposite extreme (detail only) r(k opp) r(P k) r(P oppk) r(p l) lower one-sided p-value r(p u) upper one-sided p-value r(p) two-sided p-value Methods and formulas Let N , kobs , and p be, respectively, the number of observations, the observed number of successes, and the assumed probability of success on a trial. The expected number of successes is N p, and the observed probability of success on a trial is kobs /N . bitest and bitesti compute exact p-values based on the binomial distribution. The upper one-sided p-value is N X N Pr(k ≥ kobs ) = pm (1 − p)N −m m m=kobs The lower one-sided p-value is k obs X N Pr(k ≤ kobs ) = pm (1 − p)N −m m m=0 If kobs ≥ N p, the two-sided p-value is Pr(k ≤ kopp or k ≥ kobs ) where kopp is the largest number ≤ N p such that Pr(k = kopp ) ≤ Pr(k = kobs ). If kobs < N p, the two-sided p-value is Pr(k ≤ kobs or k ≥ kopp ) where kopp is the smallest number ≥ N p such that Pr(k = kopp ) ≤ Pr(k = kobs ). 192 bitest — Binomial probability test References Boice, J. D., Jr., and R. R. Monson. 1977. Breast cancer in women after repeated fluoroscopic examinations of the chest. Journal of the National Cancer Institute 59: 823–832. Hoel, P. G. 1984. Introduction to Mathematical Statistics. 5th ed. New York: Wiley. Also see [R] ci — Confidence intervals for means, proportions, and counts [R] prtest — Tests of proportions Title bootstrap — Bootstrap sampling and estimation Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax bootstrap exp list , options eform option : command options Description Main reps(#) perform # bootstrap replications; default is reps(50) Options strata(varlist) size(#) cluster(varlist) idcluster(newvar) saving( filename, . . .) bca ties mse variables identifying strata draw samples of size #; default is N variables identifying resampling clusters create new cluster ID variable save results to filename; save statistics in double precision; save results to filename every # replications compute acceleration for BCa confidence intervals adjust BC/BCa confidence intervals for ties use MSE formula for variance estimation Reporting level(#) notable noheader nolegend verbose nodots noisily trace title(text) display options eform option set confidence level; default is level(95) suppress table of results suppress table header suppress table legend display the full table legend suppress replication dots display any output from command trace command use text as title for bootstrap results control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling display coefficient table in exponentiated form Advanced nodrop nowarn force do not drop observations do not warn when e(sample) is not set do not check for weights or svy commands; seldom used 193 194 bootstrap — Bootstrap sampling and estimation identify invalid results set random-number seed to # reject(exp) seed(#) group(varname) ID variable for groups within cluster() jackknifeopts(jkopts) options for jackknife; see [R] jackknife coeflegend display legend instead of statistics weights are not allowed in command. group(), jackknifeopts(), and coeflegend do not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. exp list contains elist contains eexp is specname is eqno is (name: elist) elist eexp newvar = (exp) (exp) specname [eqno]specname b b[] se se[] ## name exp is a standard Stata expression; see [U] 13 Functions and expressions. , which indicate optional arguments. Distinguish between [ ], which are to be typed, and Menu Statistics > Resampling > Bootstrap estimation Description bootstrap performs bootstrap estimation. Typing . bootstrap exp list, reps(#): command executes command multiple times, bootstrapping the statistics in exp list by resampling observations (with replacement) from the data in memory # times. This method is commonly referred to as the nonparametric bootstrap. command defines the statistical command to be executed. Most Stata commands and user-written programs can be used with bootstrap, as long as they follow standard Stata syntax; see [U] 11 Language syntax. If the bca option is supplied, command must also work with jackknife; see [R] jackknife. The by prefix may not be part of command. exp list specifies the statistics to be collected from the execution of command. If command changes the contents in e(b), exp list is optional and defaults to b. bootstrap — Bootstrap sampling and estimation 195 Because bootstrapping is a random process, if you want to be able to reproduce results, set the random-number seed by specifying the seed(#) option or by typing . set seed # where # is a seed of your choosing, before running bootstrap; see [R] set seed. Many estimation commands allow the vce(bootstrap) option. For those commands, we recommend using vce(bootstrap) over bootstrap because the estimation command already handles clustering and other model-specific details for you. The bootstrap prefix command is intended for use with nonestimation commands, such as summarize, user-written commands, or functions of coefficients. bs and bstrap are synonyms for bootstrap. Options Main reps(#) specifies the number of bootstrap replications to be performed. The default is 50. A total of 50 – 200 replications are generally adequate for estimates of standard error and thus are adequate for normal-approximation confidence intervals; see Mooney and Duval (1993, 11). Estimates of confidence intervals using the percentile or bias-corrected methods typically require 1,000 or more replications. Options strata(varlist) specifies the variables that identify strata. If this option is specified, bootstrap samples are taken independently within each stratum. size(#) specifies the size of the samples to be drawn. The default is N, meaning to draw samples of the same size as the data. If specified, # must be less than or equal to the number of observations within strata(). If cluster() is specified, the default size is the number of clusters in the original dataset. For unbalanced clusters, resulting sample sizes will differ from replication to replication. For cluster sampling, # must be less than or equal to the number of clusters within strata(). cluster(varlist) specifies the variables that identify resampling clusters. If this option is specified, the sample drawn during each replication is a bootstrap sample of clusters. idcluster(newvar) creates a new variable containing a unique identifier for each resampled cluster. This option requires that cluster() also be specified. saving( filename , suboptions ) creates a Stata data file (.dta file) consisting of (for each statistic in exp list) a variable containing the replicates. double specifies that the results for each replication be saved as doubles, meaning 8-byte reals. By default, they are saved as floats, meaning 4-byte reals. This option may be used without the saving() option to compute the variance estimates by using double precision. every(#) specifies that results be written to disk every #th replication. every() should be specified only in conjunction with saving() when command takes a long time for each replication. This option will allow recovery of partial results should some other software crash your computer. See [P] postfile. replace specifies that filename be overwritten if it exists. This option does not appear in the dialog box. 196 bootstrap — Bootstrap sampling and estimation bca specifies that bootstrap estimate the acceleration of each statistic in exp list. This estimate is used to construct BCa confidence intervals. Type estat bootstrap, bca to display the BCa confidence interval generated by the bootstrap command. ties specifies that bootstrap adjust for ties in the replicate values when computing the median bias used to construct BC and BCa confidence intervals. mse specifies that bootstrap compute the variance by using deviations of the replicates from the observed value of the statistics based on the entire dataset. By default, bootstrap computes the variance by using deviations from the average of the replicates. Reporting level(#); see [R] estimation options. notable suppresses the display of the table of results. noheader suppresses the display of the table header. This option implies nolegend. This option may also be specified when replaying estimation results. nolegend suppresses the display of the table legend. This option may also be specified when replaying estimation results. verbose specifies that the full table legend be displayed. By default, coefficients and standard errors are not displayed. This option may also be specified when replaying estimation results. nodots suppresses display of the replication dots. By default, one dot character is displayed for each successful replication. A red ‘x’ is displayed if command returns an error or if one of the values in exp list is missing. noisily specifies that any output from command be displayed. This option implies the nodots option. trace causes a trace of the execution of command to be displayed. This option implies the noisily option. title(text) specifies a title to be displayed above the table of bootstrap results. The default title is the title stored in e(title) by an estimation command, or if e(title) is not filled in, Bootstrap results is used. title() may also be specified when replaying estimation results. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. eform option causes the coefficient table to be displayed in exponentiated form; see [R] eform option. command determines which of the following are allowed (eform(string) and eform are always allowed): eform option Description eform(string) eform hr shr irr or rrr use string for the column title exponentiated coefficient, string is exp(b) hazard ratio, string is Haz. Ratio subhazard ratio, string is SHR incidence-rate ratio, string is IRR odds ratio, string is Odds Ratio relative-risk ratio, string is RRR bootstrap — Bootstrap sampling and estimation 197 Advanced nodrop prevents observations outside e(sample) and the if and in qualifiers from being dropped before the data are resampled. nowarn suppresses the display of a warning message when command does not set e(sample). force suppresses the restriction that command not specify weights or be a svy command. This is a rarely used option. Use it only if you know what you are doing. reject(exp) identifies an expression that indicates when results should be rejected. When exp is true, the resulting values are reset to missing values. seed(#) sets the random-number seed. Specifying this option is equivalent to typing the following command prior to calling bootstrap: . set seed # The following options are available with bootstrap but are not shown in the dialog box: group(varname) re-creates varname containing a unique identifier for each group across the resampled clusters. This option requires that idcluster() also be specified. This option is useful for maintaining unique group identifiers when sampling clusters with replacement. Suppose that cluster 1 contains 3 groups. If the idcluster(newclid) option is specified and cluster 1 is sampled multiple times, newclid uniquely identifies each copy of cluster 1. If group(newgroupid) is also specified, newgroupid uniquely identifies each copy of each group. jackknifeopts(jkopts) identifies options that are to be passed to jackknife when it computes the acceleration values for the BCa confidence intervals; see [R] jackknife. This option requires the bca option and is mostly used for passing the eclass, rclass, or n(#) option to jackknife. coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Introduction Regression coefficients Expressions Combining bootstrap datasets A note about macros Achieved significance level Bootstrapping a ratio Warning messages and e(sample) Bootstrapping statistics from data with a complex structure Introduction With few assumptions, bootstrapping provides a way of estimating standard errors and other measures of statistical precision (Efron 1979; Efron and Stein 1981; Efron 1982; Efron and Tibshirani 1986; Efron and Tibshirani 1993; also see Davison and Hinkley [1997]; Guan [2003]; Mooney and Duval [1993]; Poi [2004]; and Stine [1990]). It provides a way to obtain such measures when no formula is otherwise available or when available formulas make inappropriate assumptions. Cameron and Trivedi (2010, chap. 13) discuss many bootstrapping topics and demonstrate how to do them in Stata. 198 bootstrap — Bootstrap sampling and estimation To illustrate bootstrapping, suppose that you have a dataset containing N observations and an estimator that, when applied to the data, produces certain statistics. You draw, with replacement, N observations from the N -observation dataset. In this random drawing, some of the original observations will appear once, some more than once, and some not at all. Using the resampled dataset, you apply the estimator and collect the statistics. This process is repeated many times; each time, a new random sample is drawn and the statistics are recalculated. This process builds a dataset of replicated statistics. From these data, you can calculate the standard error by using the standard formula for the sample standard deviation se b = 1 X b (θi − θ)2 k−1 1/2 where θbi is the statistic calculated using the ith bootstrap sample and k is the number of replications. This formula gives an estimate of the standard error of the statistic, according to Hall and Wilson (1991). Although the average, θ, of the bootstrapped estimates is used in calculating the standard deviation, it is not used as the estimated value of the statistic itself. Instead, the original observed value of the statistic, θb, is used, meaning the value of the statistic computed using the original N observations. You might think that θ is a better estimate of the parameter than θb, but it is not. If the statistic is biased, bootstrapping exaggerates the bias. In fact, the bias can be estimated as θ − θb (Efron 1982, 33). Knowing this, you might be tempted to subtract this estimate of bias from θb to produce an unbiased statistic. The bootstrap bias estimate has an indeterminate amount of random error, so this unbiased estimator may have greater mean squared error than the biased estimator (Mooney and Duval 1993; Hinkley 1978). Thus θb is the best point estimate of the statistic. The logic behind the bootstrap is that all measures of precision come from a statistic’s sampling distribution. When the statistic is estimated on a sample of size N from some population, the sampling distribution tells you the relative frequencies of the values of the statistic. The sampling distribution, in turn, is determined by the distribution of the population and the formula used to estimate the statistic. Sometimes the sampling distribution can be derived analytically. For instance, if the underlying population is distributed normally and you calculate means, the sampling distribution for the mean is also normal but has a smaller variance than that of the population. In other cases, deriving the sampling distribution is difficult, as when means are calculated from nonnormal populations. Sometimes, as in the case of means, it is not too difficult to derive the sampling distribution as the sample size goes to infinity (N → ∞). However, such asymptotic distributions may not perform well when applied to finite samples. If you knew the population distribution, you could obtain the sampling distribution by simulation: you could draw random samples of size N , calculate the statistic, and make a tally. Bootstrapping does precisely this, but it uses the observed distribution of the sample in place of the true population distribution. Thus the bootstrap procedure hinges on the assumption that the observed distribution is a good estimate of the underlying population distribution. In return, the bootstrap produces an estimate, called the bootstrap distribution, of the sampling distribution. From this, you can estimate the standard error of the statistic, produce confidence intervals, etc. The accuracy with which the bootstrap distribution estimates the sampling distribution depends on the number of observations in the original sample and the number of replications in the bootstrap. A crudely estimated sampling distribution is adequate if you are only going to extract, say, a standard error. A better estimate is needed if you want to use the 2.5th and 97.5th percentiles of the distribution to produce a 95% confidence interval. To extract many features simultaneously about the distribution, bootstrap — Bootstrap sampling and estimation 199 an even better estimate is needed. Generally, replications on the order of 1,000 produce very good estimates, but only 50 – 200 replications are needed for estimates of standard errors. See Poi (2004) for a method to choose the number of bootstrap replications. Regression coefficients Example 1 Let’s say that we wish to compute bootstrap estimates for the standard errors of the coefficients from the following regression: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight gear foreign Source SS df MS Model Residual 1629.67805 813.781411 3 70 543.226016 11.6254487 Total 2443.45946 73 33.4720474 mpg Coef. weight gear_ratio foreign _cons -.006139 1.457113 -2.221682 36.10135 Std. Err. .0007949 1.541286 1.234961 6.285984 t -7.72 0.95 -1.80 5.74 Number of obs F( 3, 70) Prob > F R-squared Adj R-squared Root MSE P>|t| = = = = = = 74 46.73 0.0000 0.6670 0.6527 3.4096 [95% Conf. Interval] 0.000 0.348 0.076 0.000 -.0077245 -1.616884 -4.684735 23.56435 -.0045536 4.53111 .2413715 48.63835 To run the bootstrap, we simply prefix the above regression command with the bootstrap command (specifying its options before the colon separator). We must set the random-number seed before calling bootstrap. . bootstrap, reps(100) seed(1): regress mpg weight gear foreign (running regress on estimation sample) Bootstrap replications (100) 1 2 3 4 5 .................................................. .................................................. Linear regression 50 100 Number of obs Replications Wald chi2(3) Prob > chi2 R-squared Adj R-squared Root MSE mpg Observed Coef. Bootstrap Std. Err. weight gear_ratio foreign _cons -.006139 1.457113 -2.221682 36.10135 .0006498 1.297786 1.162728 4.71779 z -9.45 1.12 -1.91 7.65 P>|z| 0.000 0.262 0.056 0.000 = = = = = = = 74 100 111.96 0.0000 0.6670 0.6527 3.4096 Normal-based [95% Conf. Interval] -.0074127 -1.086501 -4.500587 26.85465 -.0048654 4.000727 .0572236 45.34805 200 bootstrap — Bootstrap sampling and estimation The displayed confidence interval is based on the assumption that the sampling (and hence bootstrap) distribution is approximately normal (see Methods and formulas below). Because this confidence interval is based on the standard error, it is a reasonable estimate if normality is approximately true, even for a few replications. Other types of confidence intervals are available after bootstrap; see [R] bootstrap postestimation. We could instead supply names to our expressions when we run bootstrap. For example, . bootstrap diff=(_b[weight]-_b[gear]): regress mpg weight gear foreign would bootstrap a statistic, named diff, equal to the difference between the coefficients on weight and gear ratio. Expressions Example 2 When we use bootstrap, the list of statistics can contain complex expressions, as long as each expression is enclosed in parentheses. For example, to bootstrap the range of a variable x, we could type . bootstrap range=(r(max)-r(min)), reps(1000): summarize x Of course, we could also bootstrap the minimum and maximum and later compute the range. . bootstrap max=r(max) min=r(min), reps(1000) saving(mybs): summarize x . use mybs, clear (bootstrap: summarize) . generate range = max - min . bstat range, stat(19.5637501) The difference between the maximum and minimum of x in the sample is 19.5637501. The stat() option to bstat specifies the observed value of the statistic (range) to be summarized. This option is useful when, as shown above, the statistic of ultimate interest is not specified directly to bootstrap but instead is calculated by other means. Here the observed values of r(max) and r(min) are stored as characteristics of the dataset created by bootstrap and are thus available for retrieval by bstat; see [R] bstat. The observed range, however, is unknown to bstat, so it must be specified. Combining bootstrap datasets You can combine two datasets from separate runs of bootstrap by using append (see [D] append) and then get the bootstrap statistics for the combined datasets by running bstat. The runs must have been performed independently (having different starting random-number seeds), and the original dataset, command, and bootstrap statistics must have been all the same. bootstrap — Bootstrap sampling and estimation 201 A note about macros In example 2, we executed the command . bootstrap max=r(max) min=r(min), reps(1000) saving(mybs): summarize x We did not enclose r(max) and r(min) in single quotes, as we would in most other contexts, because it would not produce what was intended: . bootstrap ‘r(max)’ ‘r(min)’, reps(1000) saving(mybs): summarize x To understand why, note that ‘r(max)’, like any reference to a local macro, will evaluate to a literal string containing the contents of r(max) before bootstrap is even executed. Typing the command above would appear to Stata as if we had typed . bootstrap 14.5441234 33.4393293, reps(1000) saving(mybs): summarize x Even worse, the current contents of r(min) and r(max) could be empty, producing an even more confusing result. To avoid this outcome, refer to statistics by name (for example, r(max)) and not by value (for example, ‘r(max)’). Achieved significance level Example 3 Suppose that we wish to estimate the achieved significance level (ASL) of a test statistic by using the bootstrap. ASL is another name for p-value. An example is b 0 ASL = Pr θb∗ ≥ θ|H for an upper-tailed, alternative hypothesis, where H0 denotes the null hypothesis, θb is the observed value of the test statistic, and θb∗ is the random variable corresponding to the test statistic, assuming that H0 is true. Here we will compare the mean miles per gallon (mpg) between foreign and domestic cars by using the two-sample t test with unequal variances. The following results indicate the p-value to be 0.0034 for the two-sided test using Satterthwaite’s approximation. Thus assuming that mean mpg is the same for foreign and domestic cars, we would expect to observe a t statistic more extreme (in absolute value) than 3.1797 in about 0.3% of all possible samples of the type that we observed. Thus we have evidence to reject the null hypothesis that the means are equal. 202 bootstrap — Bootstrap sampling and estimation . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . ttest mpg, by(foreign) unequal Two-sample t test with unequal variances Group Obs Mean Domestic Foreign 52 22 combined 74 diff Std. Err. Std. Dev. [95% Conf. Interval] 19.82692 24.77273 .657777 1.40951 4.743297 6.611187 18.50638 21.84149 21.14747 27.70396 21.2973 .6725511 5.785503 19.9569 22.63769 -4.945804 1.555438 -8.120053 -1.771556 diff = mean(Domestic) - mean(Foreign) t = -3.1797 Ho: diff = 0 Satterthwaite’s degrees of freedom = 30.5463 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0017 Pr(|T| > |t|) = 0.0034 Pr(T > t) = 0.9983 We also place the value of the test statistic in a scalar for later use. . scalar tobs = r(t) Efron and Tibshirani (1993, 224) describe an alternative to Satterthwaite’s approximation that estimates the ASL by bootstrapping the statistic from the test of equal means. Their idea is to recenter the two samples to the combined sample mean so that the data now conform to the null hypothesis but that the variances within the samples remain unchanged. . summarize mpg, meanonly . scalar omean = r(mean) . summarize mpg if foreign==0, meanonly . replace mpg = mpg - r(mean) + scalar(omean) if foreign==0 mpg was int now float (52 real changes made) . summarize mpg if foreign==1, meanonly . replace mpg = mpg - r(mean) + scalar(omean) if foreign==1 (22 real changes made) . sort foreign . by foreign: summarize mpg -> foreign = Domestic Obs Variable mpg Mean 52 21.2973 -> foreign = Foreign Variable Obs Mean mpg 22 21.2973 Std. Dev. 4.743297 Std. Dev. 6.611187 Min Max 13.47037 35.47038 Min Max 10.52457 37.52457 Each sample (foreign and domestic) is a stratum, so the bootstrapped samples must have the same number of foreign and domestic cars as the original dataset. This requirement is facilitated by the strata() option to bootstrap. By typing the following, we bootstrap the test statistic using the modified dataset and save the values in bsauto2.dta: bootstrap — Bootstrap sampling and estimation 203 . keep mpg foreign . set seed 1 . bootstrap t=r(t), rep(1000) strata(foreign) saving(bsauto2) nodots: ttest mpg, > by(foreign) unequal Warning: Because ttest is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means that no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap results Number of strata command: t: t = 2 Number of obs Replications = = 74 1000 ttest mpg, by(foreign) unequal r(t) Observed Coef. Bootstrap Std. Err. z P>|z| 1.75e-07 1.036437 0.00 1.000 Normal-based [95% Conf. Interval] -2.031379 2.031379 We can use the data in bsauto2.dta to estimate ASL via the fraction of bootstrap test statistics that are more extreme than 3.1797. . use bsauto2, clear (bootstrap: ttest) . generate indicator = abs(t)>=abs(scalar(tobs)) . summarize indicator, meanonly . display "ASLboot = " r(mean) ASLboot = .005 The result is ASLboot = 0.005. Assuming that the mean mpg is the same between foreign and domestic cars, we would expect to observe a t statistic more extreme (in absolute value) than 3.1797 in about 0.5% of all possible samples of the type we observed. This finding is still strong evidence to reject the hypothesis that the means are equal. Bootstrapping a ratio Example 4 Suppose that we wish to produce a bootstrap estimate of the ratio of two means. Because summarize stores results for only one variable, we must call summarize twice to compute the means. Actually, we could use collapse to compute the means in one call, but calling summarize twice is much faster. Thus we will have to write a small program that will return the results we want. 204 bootstrap — Bootstrap sampling and estimation We write the program below and save it to a file called ratio.ado (see [U] 17 Ado-files). Our program takes two variable names as input and saves them in the local macros y (first variable) and x (second variable). It then computes one statistic: the mean of ‘y’ divided by the mean of ‘x’. This value is returned as a scalar in r(ratio). ratio also returns the ratio of the number of observations used to the mean for each variable. program myratio, rclass version 13 args y x confirm var ‘y’ confirm var ‘x’ tempname ymean yn summarize ‘y’, meanonly scalar ‘ymean’ = r(mean) return scalar n_‘y’ = r(N) summarize ‘x’, meanonly return scalar n_‘x’ = r(N) return scalar ratio = ‘ymean’/r(mean) end Remember to test any newly written commands before using them with bootstrap. . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . summarize price Variable Obs Mean Std. Dev. price 74 . scalar mean1=r(mean) . summarize weight Variable Obs 6165.257 weight 74 . scalar mean2=r(mean) 3019.459 Mean Min Max 2949.496 3291 15906 Std. Dev. Min Max 777.1936 1760 4840 . di scalar(mean1)/scalar(mean2) 2.0418412 . myratio price weight . return list scalars: r(ratio) = 2.041841210168278 r(n_weight) = 74 r(n_price) = 74 bootstrap — Bootstrap sampling and estimation 205 The results of running bootstrap on our program are . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . set seed 1 . bootstrap ratio=r(ratio), reps(1000) nowarn nodots: myratio price weight Bootstrap results Number of obs = 74 Replications = 1000 command: myratio price weight ratio: r(ratio) ratio Observed Coef. Bootstrap Std. Err. 2.041841 .0942932 z 21.65 P>|z| 0.000 Normal-based [95% Conf. Interval] 1.85703 2.226652 As mentioned previously, we should specify the saving() option if we wish to save the bootstrap dataset. Warning messages and e(sample) bootstrap is not meant to be used with weighted calculations. bootstrap determines the presence of weights by parsing the prefixed command with standard syntax. However, commands like stcox and streg require that weights be specified in stset, and some user commands may allow weights to be specified by using an option instead of the standard syntax. Both cases pose a problem for bootstrap because it cannot determine the presence of weights under these circumstances. In these cases, we can only assume that you know what you are doing. bootstrap does not know which variables of the dataset in memory matter to the calculation at hand. You can speed their execution by dropping unnecessary variables because, otherwise, they are included in each bootstrap sample. You should thus drop observations with missing values. Leaving in missing values causes no problem in one sense because all Stata commands deal with missing values gracefully. It does, however, cause a statistical problem. Bootstrap sampling is defined as drawing, with replacement, samples of size N from a set of N observations. bootstrap determines N by counting the number of observations in memory, not counting the number of nonmissing values on the relevant variables. The result is that too many observations are resampled; the resulting bootstrap samples, because they are drawn from a population with missing values, are of unequal sizes. If the number of missing values relative to the sample size is small, this will make little difference. If you have many missing values, however, you should first drop the observations that contain them. Example 5 To illustrate, we use the previous example but replace some of the values of price with missing values. The number of values of price used to compute the mean for each bootstrap is not constant. This is the purpose of the Warning message. 206 bootstrap — Bootstrap sampling and estimation . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . replace price = . if inlist(_n,1,3,5,7) (4 real changes made, 4 to missing) . set seed 1 . bootstrap ratio=r(ratio) np=r(n_price) nw=r(n_weight), reps(100) nodots: > myratio price weight Warning: Because myratio is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means that no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap results Number of obs = 74 Replications = 100 command: myratio price weight ratio: r(ratio) np: r(n_price) nw: r(n_weight) ratio np nw Observed Coef. Bootstrap Std. Err. 2.063051 70 74 .0893669 1.872178 . z 23.09 37.39 . P>|z| Normal-based [95% Conf. Interval] 0.000 0.000 . 1.887896 66.3306 . 2.238207 73.6694 . Bootstrapping statistics from data with a complex structure Here we describe how to bootstrap statistics from data with a complex structure, for example, longitudinal or panel data, or matched data. bootstrap, however, is not designed to work with complex survey data. It is important to include all necessary information about the structure of the data in the bootstrap syntax to obtain correct bootstrap estimates for standard errors and confidence intervals. bootstrap offers several options identifying the specifics of the data. These options are strata(), cluster(), idcluster(), and group(). The usage of strata() was described in example 3 above. Below we demonstrate several examples that require specifying the other three options. Example 6 Suppose that the auto data in example 1 above are clustered by rep78. We want to obtain bootstrap estimates for the standard errors of the difference between the coefficients on weight and gear ratio, taking into account clustering. We supply the cluster(rep78) option to bootstrap to request resampling from clusters rather than from observations in the dataset. bootstrap — Bootstrap sampling and estimation 207 . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . keep if rep78<. (5 observations deleted) . bootstrap diff=(_b[weight]-_b[gear]), seed(1) cluster(rep78): regress mpg > weight gear foreign (running regress on estimation sample) Bootstrap replications (50) 1 2 3 4 5 .................................................. 50 Linear regression command: diff: diff Number of obs Replications = = 69 50 regress mpg weight gear foreign _b[weight]-_b[gear] (Replications based on 5 clusters in rep78) Observed Coef. Bootstrap Std. Err. -1.910396 1.876778 z -1.02 P>|z| 0.309 Normal-based [95% Conf. Interval] -5.588812 1.768021 We drop missing values in rep78 before issuing the command because bootstrap does not allow missing values in cluster(). See the section above about using bootstrap when variables contain missing values. We can also obtain these same results by using the following syntax: . bootstrap diff=(_b[weight]-_b[gear]), seed(1): regress mpg weight gear foreign, > vce(cluster rep78) When only clustered information is provided to the command, bootstrap can pick up the vce(cluster clustvar) option from the main command and use it to resample from clusters. Example 7 Suppose now that we have matched data and want to use bootstrap to obtain estimates of the standard errors of the exponentiated difference between two coefficients (or, equivalently, the ratio of two odds ratios) estimated by clogit. Consider the example of matched case–control data on birthweight of infants described in example 2 of [R] clogit. The infants are paired by being matched on mother’s age. All groups, defined by the pairid variable, have 1:1 matching. clogit requires that the matching information, pairid, be supplied to the group() (or, equivalently, strata()) option to be used in computing the parameter estimates. Because the data are matched, we need to resample from groups rather than from the whole dataset. However, simply supplying the grouping variable pairid in cluster() is not enough with bootstrap, as it is with clustered data. 208 bootstrap — Bootstrap sampling and estimation . use http://www.stata-press.com/data/r13/lowbirth2, clear (Applied Logistic Regression, Hosmer & Lemeshow) . bootstrap ratio=exp(_b[smoke]-_b[ptd]), seed(1) cluster(pairid): clogit low > lwt smoke ptd ht ui i.race, group(pairid) (running clogit on estimation sample) Bootstrap replications (50) 1 2 3 4 5 .................................................. 50 Bootstrap results Number of obs = 112 Replications = 50 command: ratio: ratio clogit low lwt smoke ptd ht ui i.race, group(pairid) exp(_b[smoke]-_b[ptd]) (Replications based on 56 clusters in pairid) Observed Coef. Bootstrap Std. Err. z P>|z| .6654095 17.71791 0.04 0.970 Normal-based [95% Conf. Interval] -34.06106 35.39187 For the syntax above, imagine that the first pair was sampled twice during a replication. Then the bootstrap sample has four subjects with pairid equal to one, which clearly violates the original 1:1 matching design. As a result, the estimates of the coefficients obtained from this bootstrap sample will be incorrect. Therefore, in addition to resampling from groups, we need to ensure that resampled groups are uniquely identified in each of the bootstrap samples. The idcluster(newcluster) option is designed for this. It requests that at each replication bootstrap create the new variable, newcluster, containing unique identifiers for all resampled groups. Thus, to make sure that the correct matching is preserved during each replication, we need to specify the grouping variable in cluster(), supply a variable name to idcluster(), and use this variable as the grouping variable with clogit, as we demonstrate below. . bootstrap ratio=exp(_b[smoke]-_b[ptd]), seed(1) cluster(pairid) > idcluster(newpairid): clogit low lwt smoke ptd ht ui i.race, group(newpairid) (running clogit on estimation sample) Bootstrap replications (50) 1 2 3 4 5 .................................................. 50 Bootstrap results Number of obs = 112 Replications = 50 command: clogit low lwt smoke ptd ht ui i.race, group(newpairid) ratio: exp(_b[smoke]-_b[ptd]) (Replications based on 56 clusters in pairid) ratio Observed Coef. Bootstrap Std. Err. z P>|z| .6654095 7.919441 0.08 0.933 Normal-based [95% Conf. Interval] -14.85641 16.18723 Note the difference between the estimates of the bootstrap standard error for the two specifications of the bootstrap syntax. bootstrap — Bootstrap sampling and estimation 209 Technical note Similarly, when you have panel (longitudinal) data, all resampled panels must be unique in each of the bootstrap samples to obtain correct bootstrap estimates of statistics. Therefore, both cluster(panelvar) and idcluster(newpanelvar) must be specified with bootstrap, and i(newpanelvar) must be used with the main command. Moreover, you must clear the current xtset settings by typing xtset, clear before calling bootstrap. Example 8 Continuing with our birthweight data, suppose that we have more information about doctors supervising women’s pregnancies. We believe that the data on the pairs of infants from the same doctor may be correlated and want to adjust standard errors for possible correlation among the pairs. clogit offers the vce(cluster clustvar) option to do this. Let’s add a cluster variable to our dataset. One thing to keep in mind is that to use vce(cluster clustvar), groups in group() must be nested within clusters. . use http://www.stata-press.com/data/r13/lowbirth2, clear (Applied Logistic Regression, Hosmer & Lemeshow) . set seed 12345 . by pairid, sort: egen byte doctor = total(int(2*runiform()+1)*(_n == 1)) . clogit low lwt smoke ptd ht ui i.race, group(pairid) vce(cluster doctor) Iteration 0: log pseudolikelihood = -26.768693 Iteration 1: log pseudolikelihood = -25.810476 Iteration 2: log pseudolikelihood = -25.794296 Iteration 3: log pseudolikelihood = -25.794271 Iteration 4: log pseudolikelihood = -25.794271 Conditional (fixed-effects) logistic regression Number of obs = 112 Wald chi2(1) = . Prob > chi2 = . Log pseudolikelihood = -25.794271 Pseudo R2 = 0.3355 (Std. Err. adjusted for 2 clusters in doctor) Robust Std. Err. low Coef. z P>|z| [95% Conf. Interval] lwt smoke ptd ht ui -.0183757 1.400656 1.808009 2.361152 1.401929 .0217802 .0085545 .938173 1.587013 .8568119 -0.84 163.73 1.93 1.49 1.64 0.399 0.000 0.054 0.137 0.102 -.0610641 1.38389 -.0307765 -.7493362 -.2773913 .0243128 1.417423 3.646794 5.47164 3.08125 race black other .5713643 -.0253148 .0672593 .9149785 8.49 -0.03 0.000 0.978 .4395385 -1.81864 .7031902 1.76801 To obtain correct bootstrap standard errors of the exponentiated difference between the two coefficients in this example, we need to make sure that both resampled clusters and groups within resampled clusters are unique in each of the bootstrap samples. To achieve this, bootstrap needs the information about clusters in cluster(), the variable name of the new identifier for clusters in idcluster(), and the information about groups in group(). We demonstrate the corresponding syntax of bootstrap below. 210 bootstrap — Bootstrap sampling and estimation . bootstrap ratio=exp(_b[smoke]-_b[ptd]), seed(1) cluster(doctor) > idcluster(uidoctor) group(pairid): clogit low lwt smoke ptd ht ui i.race, > group(pairid) (running clogit on estimation sample) Bootstrap replications (50) 1 2 3 4 5 .................................................. 50 Bootstrap results Number of obs = 112 Replications = 50 command: clogit low lwt smoke ptd ht ui i.race, group(pairid) ratio: exp(_b[smoke]-_b[ptd]) (Replications based on 2 clusters in doctor) ratio Observed Coef. Bootstrap Std. Err. z P>|z| Normal-based [95% Conf. Interval] .6654095 .3156251 2.11 0.035 .0467956 1.284023 In the above syntax, although we specify group(pairid) with clogit, it is not the group identifiers of the original pairid variable that are used to compute parameter estimates from bootstrap samples. The way bootstrap works is that, at each replication, the clusters defined by doctor are resampled and the new variable, uidoctor, uniquely identifying resampled clusters is created. After that, another new variable uniquely identifying the (uidoctor, group) combination is created and renamed to have the same name as the grouping variable, pairid. This newly defined grouping variable is then used by clogit to obtain the parameter estimates from this bootstrap sample of clusters. After all replications are performed, the original values of the grouping variable are restored. Technical note The same logic must be used when running bootstrap with commands designed for panel (longitudinal) data that allow specifying the cluster(clustervar) option. To ensure that the combination of (clustervar, panelvar) values are unique in each of the bootstrap samples, cluster(clustervar), idcluster(newclustervar), and group(panelvar) must be specified with bootstrap, and i(panelvar) must be used with the main command. Bradley Efron was born in 1938 in Minnesota and studied mathematics and statistics at Caltech and Stanford; he has lived in northern California since 1960. He has worked on empirical Bayes, survival analysis, exponential families, bootstrap and jackknife methods, and confidence intervals, in conjunction with applied work in biostatistics, astronomy, and physics. Efron is a member of the U.S. National Academy of Sciences and was awarded the U.S. National Medal of Science in 2005. He is by any standards one of the world’s leading statisticians: his work ranges from deep and elegant contributions in theoretical statistics to pathbreaking involvement in a variety of practical applications. bootstrap — Bootstrap sampling and estimation 211 Stored results bootstrap stores the following in e(): Scalars e(N) e(N reps) e(N misreps) e(N strata) e(N clust) e(k eq) e(k exp) e(k eexp) e(k extra) e(level) e(bs version) e(rank) sample size number of complete replications number of incomplete replications number of strata number of clusters number of equations in e(b) number of standard expressions number of extended expressions (i.e., b) number of extra equations beyond the original ones from e(b) confidence level for bootstrap CIs version for bootstrap results rank of e(V) Macros e(cmdname) e(cmd) e(command) e(cmdline) e(prefix) e(title) e(strata) e(cluster) e(seed) e(size) e(exp#) e(ties) e(mse) e(vce) e(vcetype) e(properties) command name from command same as e(cmdname) or bootstrap command command as typed bootstrap title in estimation output strata variables cluster variables initial random-number seed from the size(#) option expression for the #th statistic ties, if specified mse, if specified bootstrap title used to label Std. Err. b V Matrices e(b) e(b bs) e(reps) e(bias) e(se) e(z0) e(accel) e(ci normal) e(ci percentile) e(ci bc) e(ci bca) e(V) e(V modelbased) observed statistics bootstrap estimates number of nonmissing results estimated biases estimated standard errors median biases estimated accelerations normal-approximation CIs percentile CIs bias-corrected CIs bias-corrected and accelerated CIs bootstrap variance–covariance matrix model-based variance When exp list is command. b, bootstrap will also carry forward most of the results already in e() from Methods and formulas Let θb be the observed value of the statistic, that is, the value of the statistic calculated with the original dataset. Let i = 1, 2, . . . , k denote the bootstrap samples, and let θbi be the value of the statistic from the ith bootstrap sample. 212 bootstrap — Bootstrap sampling and estimation When the mse option is specified, the standard error is estimated as X 1/2 k 1 b2 se b MSE = (θbi − θ) k i=1 Otherwise, the standard error is estimated as 1/2 k 1 X b (θi − θ)2 se b = k−1 i=1 where θ= k 1 Xb θi k i=1 The variance–covariance matrix is similarly computed. The bias is estimated as d = θ − θb bias Confidence intervals with nominal coverage rates 1 − α are calculated according to the following formulas. The normal-approximation method yields the confidence intervals θb − z1−α/2 se, b θb + z1−α/2 se b where z1−α/2 is the (1 − α/2)th quantile of the standard normal distribution. If the mse option is specified, bootstrap will report the normal confidence interval using se b MSE instead of se b . estat bootstrap only uses se b in the normal confidence interval. The percentile method yields the confidence intervals ∗ ∗ θα/2 , θ1−α/2 where θp∗ is the pth quantile (the 100pth percentile) of the bootstrap distribution (θb1 , . . . , θbk ). Let b z0 = Φ−1 {#(θbi ≤ θ)/k} b is the number of elements of the bootstrap distribution that are less than or equal where #(θbi ≤ θ) to the observed statistic and Φ is the standard cumulative normal. z0 is known as the median bias of b + #(θbi = θ)/ b 2, which is the θb. When the ties option is specified, z0 is estimated as #(θbi < θ) number of elements of the bootstrap distribution that are less than the observed statistic plus half the number of elements that are equal to the observed statistic. Let Pn (θ(·) − θb(i) )3 a = P i=1 n b 2 3/2 6 i=1 (θ (·) − θ(i) ) where θb(i) are the leave-one-out (jackknife) estimates of θb and θ(·) is their mean. This expression is known as the jackknife estimate of acceleration for θb. Let z0 − z1−α/2 p1 = Φ z0 + 1 − a(z0 − z1−α/2 ) z0 + z1−α/2 p2 = Φ z0 + 1 − a(z0 + z1−α/2 ) bootstrap — Bootstrap sampling and estimation 213 where z1−α/2 is the (1 −α/2)th quantile of the normal distribution. The bias-corrected and accelerated (BCa ) method yields confidence intervals θp∗1 , θp∗2 where θp∗ is the pth quantile of the bootstrap distribution as defined previously. The bias-corrected (but not accelerated) method is a special case of BCa with a = 0. References Ängquist, L. 2010. Stata tip 92: Manual implementation of permutations and bootstraps. Stata Journal 10: 686–688. Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Application. Cambridge: Cambridge University Press. Efron, B. 1979. Bootstrap methods: Another look at the jackknife. Annals of Statistics 7: 1–26. . 1982. The Jackknife, the Bootstrap and Other Resampling Plans. Philadelphia: Society for Industrial and Applied Mathematics. Efron, B., and C. Stein. 1981. The jackknife estimate of variance. Annals of Statistics 9: 586–596. Efron, B., and R. J. Tibshirani. 1986. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1: 54–77. . 1993. An Introduction to the Bootstrap. New York: Chapman & Hall/CRC. Field, C. A., and A. H. Welsh. 2007. Bootstrapping clustered data. Journal of the Royal Statistical Society, Series B 69: 369–390. Gleason, J. R. 1997. ip18: A command for randomly resampling a dataset. Stata Technical Bulletin 37: 17–22. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 77–83. College Station, TX: Stata Press. . 1999. ip18.1: Update to resample. Stata Technical Bulletin 52: 9–10. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 119. College Station, TX: Stata Press. Gould, W. W. 1994. ssi6.2: Faster and easier bootstrap estimation. Stata Technical Bulletin 21: 24–33. Reprinted in Stata Technical Bulletin Reprints, vol. 4, pp. 211–223. College Station, TX: Stata Press. Guan, W. 2003. From the help desk: Bootstrapped standard errors. Stata Journal 3: 71–80. Hall, P., and S. R. Wilson. 1991. Two guidelines for bootstrap hypothesis testing. Biometrics 47: 757–762. Hamilton, L. C. 1991. ssi2: Bootstrap programming. Stata Technical Bulletin 4: 18–27. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 208–220. College Station, TX: Stata Press. . 1992. Regression with Graphics: A Second Course in Applied Statistics. Belmont, CA: Duxbury. . 2013. Statistics with Stata: Updated for Version 12. 8th ed. Boston: Brooks/Cole. Hinkley, D. V. 1978. Improving the jackknife with special reference to correlation estimation. Biometrika 65: 13–22. Holmes, S., C. Morris, and R. J. Tibshirani. 2003. Bradley Efron: A conversation with good friends. Statistical Science 18: 268–281. Mooney, C. Z., and R. D. Duval. 1993. Bootstrapping: A Nonparametric Approach to Statistical Inference. Newbury Park, CA: Sage. Ng, E. S.-W., R. Grieve, and J. R. Carpenter. 2013. Two-stage nonparametric bootstrap sampling with shrinkage correction for clustered data. Stata Journal 13: 141–164. Poi, B. P. 2004. From the help desk: Some bootstrapping techniques. Stata Journal 4: 312–328. Royston, P., and W. Sauerbrei. 2009. Bootstrap assessment of the stability of multivariable models. Stata Journal 9: 547–570. Stine, R. 1990. An introduction to bootstrap methods: Examples and ideas. In Modern Methods of Data Analysis, ed. J. Fox and J. S. Long, 353–373. Newbury Park, CA: Sage. 214 bootstrap — Bootstrap sampling and estimation Also see [R] bootstrap postestimation — Postestimation tools for bootstrap [R] jackknife — Jackknife estimation [R] permute — Monte Carlo permutation tests [R] simulate — Monte Carlo simulations [SVY] svy bootstrap — Bootstrap for survey data [U] 13.5 Accessing coefficients and standard errors [U] 13.6 Accessing results from Stata commands [U] 20 Estimation and postestimation commands Title bootstrap postestimation — Postestimation tools for bootstrap Description Menu for estat Also see Syntax for predict Options for estat bootstrap Syntax for estat bootstrap Remarks and examples Description The following postestimation command is of special interest after bootstrap: Command Description estat bootstrap percentile-based and bias-corrected CI tables The following standard postestimation commands are also available: Command ∗ contrast estat ic estat summarize estat vce estimates ∗ hausman ∗ lincom ∗ margins ∗ marginsplot nlcom ∗ ∗ ∗ predict predictnl ∗ pwcompare test ∗ testnl ∗ ∗ Description contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses This postestimation command is allowed if it may be used after command. 215 216 bootstrap postestimation — Postestimation tools for bootstrap Special-interest postestimation command estat bootstrap displays a table of confidence intervals for each statistic from a bootstrap analysis. Syntax for predict The syntax of predict (and even if predict is allowed) following bootstrap depends upon the command used with bootstrap. If predict is not allowed, neither is predictnl. Syntax for estat bootstrap estat bootstrap , options options Description bc bca normal percentile all noheader nolegend verbose bias-corrected CIs; the default bias-corrected and accelerated (BCa ) CIs normal-based CIs percentile CIs all available CIs suppress table header suppress table legend display the full table legend bc, bca, normal, and percentile may be used together. Menu for estat Statistics > Postestimation > Reports and statistics Options for estat bootstrap bc is the default and displays bias-corrected confidence intervals. bca displays bias-corrected and accelerated confidence intervals. This option assumes that you also specified the bca option on the bootstrap prefix command. normal displays normal approximation confidence intervals. percentile displays percentile confidence intervals. all displays all available confidence intervals. noheader suppresses display of the table header. This option implies nolegend. nolegend suppresses display of the table legend, which identifies the rows of the table with the expressions they represent. verbose requests that the full table legend be displayed. bootstrap postestimation — Postestimation tools for bootstrap 217 Remarks and examples Example 1 The estat bootstrap postestimation command produces a table containing the observed value of the statistic, an estimate of its bias, the bootstrap standard error, and up to four different confidence intervals. If we were interested merely in getting bootstrap standard errors for the model coefficients, we could use the bootstrap prefix with our estimation command. If we were interested in performing a thorough bootstrap analysis of the model coefficients, we could use the estat bootstrap postestimation command after fitting the model with the bootstrap prefix. Using example 1 from [R] bootstrap, we need many more replications for the confidence interval types other than the normal based, so let’s rerun the estimation command. We will reset the randomnumber seed—in case we wish to reproduce the results—increase the number of replications, and save the bootstrap distribution as a dataset called bsauto.dta. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . set seed 1 . bootstrap _b, reps(1000) saving(bsauto) bca: regress mpg weight gear foreign (output omitted ) . estat bootstrap, all Linear regression Number of obs Replications mpg Observed Coef. Bias weight -.00613903 .0000567 .000628 gear_ratio 1.4571134 .1051696 1.4554785 foreign -2.2216815 -.0196361 1.2023286 _cons 36.101353 -.502281 5.4089441 (N) (P) (BC) (BCa) Bootstrap Std. Err. = = 74 1000 [95% Conf. Interval] -.0073699 -.0073044 -.0074355 -.0075282 -1.395572 -1.262111 -1.523927 -1.492223 -4.578202 -4.442199 -4.155504 -4.216531 25.50002 24.48569 25.59799 25.85658 -.0049082 -.0048548 -.004928 -.0050258 4.309799 4.585372 4.174376 4.231356 .1348393 .2677989 .6170642 .5743973 46.70269 46.07086 46.63227 47.02108 (N) (P) (BC) (BCa) (N) (P) (BC) (BCa) (N) (P) (BC) (BCa) (N) (P) (BC) (BCa) normal confidence interval percentile confidence interval bias-corrected confidence interval bias-corrected and accelerated confidence interval The estimated standard errors here differ from our previous estimates using only 100 replications by, respectively, 8%, 3%, 11%, and 6%; see example 1 of [R] bootstrap. So much for our advice that 50 – 200 replications are good enough to estimate standard errors. Well, the more replications the better — that advice you should believe. 218 bootstrap postestimation — Postestimation tools for bootstrap Which of the methods to compute confidence intervals should we use? If the statistic is unbiased, the percentile (P) and bias-corrected (BC) methods should give similar results. The bias-corrected confidence interval will be the same as the percentile confidence interval when the observed value of the statistic is equal to the median of the bootstrap distribution. Thus, for unbiased statistics, the two methods should give similar results as the number of replications becomes large. For biased statistics, the bias-corrected method should yield confidence intervals with better coverage probability (closer to the nominal value of 95% or whatever was specified) than the percentile method. For statistics with variances that vary as a function of the parameter of interest, the bias-corrected and accelerated method (BCa ) will typically have better coverage probability than the others. When the bootstrap distribution is approximately normal, all of these methods should give similar confidence intervals as the number of replications becomes large. If we examine the normality of these bootstrap distributions using, say, the pnorm command (see [R] diagnostic plots), we see that they closely follow a normal distribution. Thus here, the normal approximation would also be a valid choice. The chief advantage of the normal-approximation method is that it (supposedly) requires fewer replications than the other methods. Of course, it should be used only when the bootstrap distribution exhibits normality. We can load bsauto.dta containing the bootstrap distributions for these coefficients: . use bsauto (bootstrap: regress) . describe * storage variable name type display format _b_weight _b_gear_ratio _b_foreign _b_cons %9.0g %9.0g %9.0g %9.0g float float float float value label variable label _b[weight] _b[gear_ratio] _b[foreign] _b[_cons] We can now run other commands, such as pnorm, on the bootstrap distributions. As with all standard estimation commands, we can use the bootstrap command to replay its output table. The default variable names assigned to the statistics in exp list are bs 1, bs 2, . . . , and each variable is labeled with the associated expression. The naming convention for the extended expressions b and se is to prepend b and se , respectively, onto the name of each element of the coefficient vector. Here the first coefficient is b[weight], so bootstrap named it b weight. Also see [R] bootstrap — Bootstrap sampling and estimation [U] 20 Estimation and postestimation commands Title boxcox — Box–Cox regression models Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax boxcox depvar indepvars if in weight , options Description options Model noconstant model(lhsonly) model(rhsonly) model(lambda) model(theta) notrans(varlist) suppress constant term left-hand-side Box–Cox model; the default right-hand-side Box–Cox model both sides Box–Cox model with same parameter both sides Box–Cox model with different parameters nontransformed independent variables Reporting set confidence level; default is level(95) perform likelihood-ratio test level(#) lrtest Maximization nolog nologlr maximize options suppress full-model iteration log suppress restricted-model lrtest iteration log control the maximization process; seldom used depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, jackknife, rolling, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. fweights and iweights are allowed; see [U] 11.1.6 weight. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Linear models and related > Box-Cox regression Description boxcox finds the maximum likelihood estimates of the parameters of the Box–Cox transform, the coefficients on the independent variables, and the standard deviation of the normally distributed errors for a model in which depvar is regressed on indepvars. You can fit the following models: 219 220 boxcox — Box–Cox regression models Option Estimates lhsonly yj = β1 x1j + β2 x2j + · · · + βk xkj + j rhsonly yj = β1 x1j + β2 x2j + · · · + βk xkj + j (θ) (λ) (λ) (λ) (λ) (λ) (λ) (λ) β1 x1j (λ) β1 x1j (λ) β1 x1j (λ) β1 x1j (λ) β2 x2j (λ) β2 x2j (λ) β2 x2j (λ) β2 x2j (λ) βk xkj (λ) βk xkj (λ) βk xkj (λ) βk xkj rhsonly notrans() yj = β1 x1j + β2 x2j + · · · + βk xkj + γ1 z1j + · · · + γl zlj + j lambda lambda notrans() theta theta notrans() (λ) yj = (λ) yj = (θ) yj = (θ) yj = + + + + + ··· + + ··· + + ··· + + ··· + + j + γ1 z1j + · · · + γl zlj + j + j + γ1 z1j + · · · + γl zlj + j Any variable to be transformed must be strictly positive. Options Model noconstant; see [R] estimation options. model( lhsonly | rhsonly | lambda | theta ) specifies which of the four models to fit. model(lhsonly) applies the Box–Cox transform to depvar only. model(lhsonly) is the default. model(rhsonly) applies the transform to the indepvars only. model(lambda) applies the transform to both depvar and indepvars, and they are transformed by the same parameter. model(theta) applies the transform to both depvar and indepvars, but this time, each side is transformed by a separate parameter. notrans(varlist) specifies that the variables in varlist be included as nontransformed independent variables. Reporting level(#); see [R] estimation options. lrtest specifies that a likelihood-ratio test of significance be performed and reported for each independent variable. Maximization nolog suppresses the iteration log when fitting the full model. nologlr suppresses the iteration log when fitting the restricted models required by the lrtest option. maximize options: iterate(#) and from(init specs); see [R] maximize. Model Initial value specification lhsonly rhsonly lambda theta from(θ0 , copy) from(λ0 , copy) from(λ0 , copy) from(λ0 θ0 , copy) boxcox — Box–Cox regression models 221 Remarks and examples Remarks are presented under the following headings: Introduction Theta model Lambda model Left-hand-side-only model Right-hand-side-only model Introduction The Box–Cox transform y (λ) = yλ − 1 λ has been widely used in applied data analysis. Box and Cox (1964) developed the transformation and argued that the transformation could make the residuals more closely normal and less heteroskedastic. Cook and Weisberg (1982) discuss the transform in this light. Because the transform embeds several popular functional forms, it has received some attention as a method for testing functional forms, in particular, y (λ) if λ = 1 y − 1 ln(y) if λ = 0 = 1 − 1/y if λ = −1 Davidson and MacKinnon (1993) discuss this use of the transform. Atkinson (1985) also gives a good general treatment. Theta model boxcox obtains the maximum likelihood estimates of the parameters for four different models. The most general of the models, the theta model, is (θ) yj (λ) (λ) (λ) = β0 + β1 x1j + β2 x2j + · · · + βk xkj + γ1 z1j + γ2 z2j + · · · + γl zlj + j where ∼ N (0, σ 2 ). Here the dependent variable, y , is subject to a Box–Cox transform with parameter θ. Each of the indepvars, x1 , x2 , . . . , xk , is transformed by a Box–Cox transform with parameter λ. The z1 , z2 , . . . , zl specified in the notrans() option are independent variables that are not transformed. Box and Cox (1964) argued that this transformation would leave behind residuals that more closely follow a normal distribution than those produced by a simple linear regression model. Bear in mind that the normality of is assumed and that boxcox obtains maximum likelihood estimates of the k + l + 4 parameters under this assumption. boxcox does not choose λ and θ so that the residuals are approximately normally distributed. If you are interested in this type of transformation to normality, see the official Stata commands lnskew0 and bcskew0 in [R] lnskew0. However, those commands work on a more restrictive model in which none of the independent variables is transformed. 222 boxcox — Box–Cox regression models Example 1 Below we fit a theta model to a nonrepresentative extract of the Second National Health and Nutrition Examination Survey (NHANES II) dataset discussed in McDowell et al. (1981). We model individual-level diastolic blood pressure (bpdiast) as a function of the transformed variables body mass index (bmi) and cholesterol level (tcresult) and of the untransformed variables age (age) and sex (sex). . use http://www.stata-press.com/data/r13/nhanes2 . boxcox bpdiast bmi tcresult, notrans(age sex) model(theta) lrtest Fitting comparison model Iteration 0: log likelihood = -41178.61 Iteration 1: log likelihood = -41032.51 Iteration 2: log likelihood = -41032.488 Iteration 3: log likelihood = -41032.488 Fitting full model Iteration 0: log Iteration 1: log Iteration 2: log Iteration 3: log Fitting comparison Iteration 0: log Iteration 1: log Iteration 2: log Iteration 3: log Iteration 0: log Iteration 1: log Iteration 2: log Iteration 0: log Iteration 1: log Iteration 2: log Iteration 3: log Iteration 0: log Iteration 1: log Iteration 2: log likelihood likelihood likelihood likelihood models for likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood = -39928.606 = -39775.026 = -39774.987 = -39774.987 LR tests = -39947.144 = -39934.55 = -39934.516 = -39934.516 = -39906.96 = -39896.63 = -39896.629 = -40464.599 = -40459.752 = -40459.604 = -40459.604 = -39829.859 = -39815.576 = -39815.575 Number of obs LR chi2(5) Prob > chi2 Log likelihood = -39774.987 bpdiast Coef. /lambda /theta .6383286 .1988197 = = = 10351 2515.00 0.000 Std. Err. z P>|z| [95% Conf. Interval] .1577601 .0454088 4.05 4.38 0.000 0.000 .3291245 .1098201 .9475327 .2878193 boxcox — Box–Cox regression models 223 Estimates of scale-variant parameters Coef. chi2(df) P>chi2(df) age sex _cons .003811 -.1054887 5.835555 319.060 243.284 0.000 0.000 bmi tcresult .0872041 .004734 1369.235 81.177 /sigma .3348267 df of chi2 Notrans 1 1 Trans Test H0: theta=lambda = -1 theta=lambda = 0 theta=lambda = 1 Restricted log likelihood -40162.898 -39790.945 -39928.606 0.000 0.000 chi2 775.82 31.92 307.24 1 1 Prob > chi2 0.000 0.000 0.000 The output is composed of the iteration logs and three distinct tables. The first table contains a standard header for a maximum likelihood estimator and a standard output table for the Box– Cox transform parameters. The second table contains the estimates of the scale-variant parameters. The third table contains the output from likelihood-ratio tests on three standard functional form specifications. The right-hand-side and the left-hand-side transformations each add to the regression fit at the 1% significance level and are both positive but less than 1. All the variables have significant impacts on diastolic blood pressure, bpdiast. As expected, the transformed variables—the body mass index, bmi, and cholesterol level, tcresult—contribute to higher blood pressure. The last output table shows that the linear, multiplicative inverse, and log specifications are strongly rejected. Technical note Spitzer (1984) showed that the Wald tests of the joint significance of the coefficients of the right-hand-side variables, either transformed or untransformed, are not invariant to changes in the scale of the transformed dependent variable. Davidson and MacKinnon (1993) also discuss this point. This problem demonstrates that Wald statistics can be manipulated in nonlinear models. Lafontaine and White (1986) analyze this problem numerically, and Phillips and Park (1988) analyze it by using Edgeworth expansions. See Drukker (2000b) for a more detailed discussion of this issue. Because the parameter estimates and their Wald tests are not scale invariant, no Wald tests or confidence intervals are reported for these parameters. However, when the lrtest option is specified, likelihood-ratio tests are performed and reported. Schlesselman (1971) showed that, if a constant is included in the model, the parameter estimates of the Box–Cox transforms are scale invariant. For this reason, we strongly recommend that you not use the noconstant option. The lrtest option does not perform a likelihood-ratio test on the constant, so no value for this statistic is reported. Unless the data are properly scaled, the restricted model does not often converge. For this reason, no likelihood-ratio test on the constant is performed by the lrtest option. However, if you have a special interest in performing this test, you can do so by fitting the constrained model separately. If problems with convergence are encountered, rescaling the data by their means may help. 224 boxcox — Box–Cox regression models Lambda model A less general model than the one above is called the lambda model. It specifies that the same parameter be used in both the left-hand-side and right-hand-side transformations. Specifically, (λ) yj (λ) (λ) (λ) = β0 + β1 x1j + β2 x2j + · · · + βk xkj + γ1 z1j + γ2 z2j + · · · + γl zlj + j where ∼ N (0, σ 2 ). Here the depvar variable, y , and each of the indepvars, x1 , x2 , . . . , xk , is transformed by a Box–Cox transform with the common parameter λ. Again the z1 , z2 , . . . , zl are independent variables that are not transformed. Left-hand-side-only model Even more restrictive than a common transformation parameter is transforming the dependent variable only. Because the dependent variable is on the left-hand side of the equation, this model is known as the lhsonly model. Here you are estimating the parameters of the model (θ) yj = β0 + β1 x1j + β2 x2j + · · · + βk xkj + j where ∼ N (0, σ 2 ). Here only the depvar, y , is transformed by a Box–Cox transform with the parameter θ. boxcox — Box–Cox regression models 225 Example 2 In this example, we model the transform of diastolic blood pressure as a linear combination of the untransformed body mass index, cholesterol level, age, and sex. . boxcox bpdiast bmi tcresult age sex, model(lhsonly) lrtest nolog nologlr Fitting comparison model Fitting full model Fitting comparison models for LR tests Number of obs = 10351 LR chi2(4) = 2509.56 Log likelihood = -39777.709 Prob > chi2 = 0.000 bpdiast Coef. /theta .2073268 Std. Err. z P>|z| [95% Conf. Interval] .0452895 4.58 0.000 .1185611 .2960926 Estimates of scale-variant parameters Coef. chi2(df) P>chi2(df) bmi tcresult age sex _cons .0272628 .0006929 .0040141 -.1122274 6.302855 1375.841 82.380 334.117 263.219 0.000 0.000 0.000 0.000 /sigma .3476615 df of chi2 Notrans Test H0: Restricted log likelihood theta = -1 theta = 0 theta = 1 -40146.678 -39788.241 -39928.606 LR statistic chi2 737.94 21.06 301.79 1 1 1 1 P-value Prob > chi2 0.000 0.000 0.000 The maximum likelihood estimate of the transformation parameter for this model is positive and significant. Once again, all the scale-variant parameters are significant, and we find a positive impact of body mass index (bmi) and cholesterol levels (tcresult) on the transformed diastolic blood pressure (bpdiast). This model rejects the linear, multiplicative inverse, and log specifications. Right-hand-side-only model The fourth model leaves the depvar alone and transforms a subset of the indepvars using the parameter λ. This is the rhsonly model. In this model, the depvar, y , is given by (λ) (λ) (λ) yj = β0 + β1 x1j + β2 x2j + · · · + βk xkj + γ1 z1j + γ2 z2j + · · · + γl zlj + j where ∼ N (0, σ 2 ). Here each of the indepvars, x1 , x2 , . . . , xk , is transformed by a Box–Cox transform with the parameter λ. Again the z1 , z2 , . . . , zl are independent variables that are not transformed. 226 boxcox — Box–Cox regression models Example 3 Now we consider a rhsonly model in which the regressors sex and age are not transformed. . boxcox bpdiast bmi tcresult, notrans(sex age) model(rhsonly) lrtest nolog > nologlr Fitting full model Fitting comparison models for LR tests Number of obs LR chi2(5) Prob > chi2 Log likelihood = -39928.212 bpdiast Coef. /lambda .8658841 = = = 10351 2500.79 0.000 Std. Err. z P>|z| [95% Conf. Interval] .1522387 5.69 0.000 .5675018 1.164266 Estimates of scale-variant parameters Coef. chi2(df) P>chi2(df) sex age _cons -3.544042 .128809 50.01498 235.020 311.754 0.000 0.000 bmi tcresult 1.418215 .0462964 1396.709 78.500 /sigma 11.4557 df of chi2 Notrans 1 1 Trans Test H0: Restricted log likelihood lambda = -1 lambda = 0 lambda = 1 -39989.331 -39942.945 -39928.606 0.000 0.000 LR statistic chi2 122.24 29.47 0.79 1 1 P-value Prob > chi2 0.000 0.000 0.375 The maximum likelihood estimate of the transformation parameter in this model is positive and significant at the 1% level. The transformed bmi coefficient behaves as expected, and the remaining scale-variant parameters are significant at the 1% level. This model rejects the multiplicative inverse and log specifications strongly. However, we cannot reject the hypothesis that the model is linear. boxcox — Box–Cox regression models 227 Stored results boxcox stores the following in e(): Scalars e(N) e(ll) e(chi2) e(df m) e(ll0) e(df r) e(ll t1) e(chi2 t1) e(p t1) e(ll tm1) e(chi2 tm1) e(p tm1) e(ll t0) e(chi2 t0) e(p t0) e(rank) e(ic) e(rc) Macros e(cmd) e(cmdline) e(depvar) e(model) e(wtype) e(wexp) e(ntrans) e(chi2type) e(lrtest) e(properties) e(predict) e(marginsnotok) Matrices e(b) e(V) e(pm) e(df) e(chi2m) Functions e(sample) number of observations log likelihood LR statistic of full vs. comparison full model degrees of freedom log likelihood of the restricted model restricted model degrees of freedom log likelihood of model λ=θ=1 LR of λ=θ=1 vs. full model p-value of λ=θ=1 vs. full model log likelihood of model λ=θ=−1 LR of λ=θ=−1 vs. full model p-value of λ=θ=−1 vs. full model log likelihood of model λ=θ=0 LR of λ=θ=0 vs. full model p-value of λ=θ=0 vs. full model rank of e(V) number of iterations return code boxcox command as typed name of dependent variable lhsonly, rhsonly, lambda, or theta weight type weight expression yes if nontransformed indepvars LR; type of model χ2 test lrtest, if requested b V program used to implement predict predictions disallowed by margins coefficient vector variance–covariance matrix of the estimators (see note below) p-values for LR tests on indepvars degrees of freedom of LR tests on indepvars LR statistics for tests on indepvars marks estimation sample e(V) contains all zeros, except for the elements that correspond to the parameters of the Box–Cox transform. 228 boxcox — Box–Cox regression models Methods and formulas In the internal computations, y (λ) = λ y λ−1 if |λ| > 10−10 ln(y) otherwise The unconcentrated log likelihood for the theta model is lnL = −N 2 2 ln(2π) + ln(σ ) + (θ − 1) N X ln(yi ) − i=1 1 2σ 2 SSR where SSR = N X (θ) (λ) (λ) (λ) (yi − β0 + β1 xi1 + β2 xi2 + · · · + βk xik + γ1 zi1 + γ2 zi2 + · · · + γl zil )2 i=1 Writing the SSR in matrix form, SSR = (y(θ) − X(λ) b0 − Zg0 )0 (y(θ) − X(λ) b0 − Zg0 ) where y(θ) is an N × 1 vector of elementwise transformed data, X(λ) is an N × k matrix of elementwise transformed data, Z is an N × l matrix of untransformed data, b is a 1 × k vector of coefficients, and g is a 1 × l vector of coefficients. Letting Wλ = X(λ) Z be the horizontal concatenation of X(λ) and Z and 0 b 0 d = g0 be the vertical concatenation of the coefficients yields SSR = (y(θ) − Wλ d0 )0 (y(θ) − Wλ d0 ) For given values of λ and θ, the solutions for d0 and σ 2 are b 0 = (W 0 Wλ )−1 W 0 y (θ) d λ λ and 0 1 (θ) y − Wλ db0 y (θ) − Wλ db0 N Substituting these solutions into the log-likelihood function yields the concentrated log-likelihood function N X N lnLc = − ln(2π) + 1 + ln(b σ 2 ) + (θ − 1) ln(yi ) 2 i=1 σ b2 = boxcox — Box–Cox regression models 229 Similar calculations yield the concentrated log-likelihood function for the lambda model, N X N ln(2π) + 1 + ln(b σ 2 ) + (λ − 1) lnLc = − ln(yi ) 2 i=1 the lhsonly model, lnLc = N X N ln(2π) + 1 + ln(b σ 2 ) + (θ − 1) ln(yi ) − 2 i=1 and the rhsonly model, lnLc = N − 2 ln(2π) + 1 + ln(b σ 2) where σ b 2 is specific to each model and is defined analogously to that in the theta model. References Atkinson, A. C. 1985. Plots, Transformations, and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis. Oxford: Oxford University Press. Box, G. E. P., and D. R. Cox. 1964. An analysis of transformations. Journal of the Royal Statistical Society, Series B 26: 211–252. Carroll, R. J., and D. Ruppert. 1988. Transformation and Weighting in Regression. New York: Chapman & Hall. Cook, R. D., and S. Weisberg. 1982. Residuals and Influence in Regression. New York: Chapman & Hall/CRC. Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. Drukker, D. M. 2000a. sg130: Box–Cox regression models. Stata Technical Bulletin 54: 27–36. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 307–319. College Station, TX: Stata Press. . 2000b. sg131: On the manipulability of Wald tests in Box–Cox regression models. Stata Technical Bulletin 54: 36–42. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 319–327. College Station, TX: Stata Press. Lafontaine, F., and K. J. White. 1986. Obtaining any Wald statistic you want. Economics Letters 21: 35–40. Lindsey, C., and S. J. Sheather. 2010a. Power transformation via multivariate Box–Cox. Stata Journal 10: 69–81. . 2010b. Optimal power transformation via inverse response plots. Stata Journal 10: 200–214. McDowell, A., A. Engel, J. T. Massey, and K. Maurer. 1981. Plan and operation of the Second National Health and Nutrition Examination Survey, 1976–1980. Vital and Health Statistics 1(15): 1–144. Phillips, P. C. B., and J. Y. Park. 1988. On the formulation of Wald tests of nonlinear restrictions. Econometrica 56: 1065–1083. Schlesselman, J. J. 1971. Power families: A note on the Box and Cox transformation. Journal of the Royal Statistical Society, Series B 33: 307–311. Spitzer, J. J. 1984. Variance estimates in models with the Box–Cox transformation: Implications for estimation and hypothesis testing. Review of Economics and Statistics 66: 645–652. Also see [R] boxcox postestimation — Postestimation tools for boxcox [R] lnskew0 — Find zero-skewness log or Box – Cox transform [R] regress — Linear regression [U] 20 Estimation and postestimation commands Title boxcox postestimation — Postestimation tools for boxcox Description Remarks and examples Syntax for predict Methods and formulas Menu for predict References Options for predict Also see Description The following postestimation commands are available after boxcox: Command Description estat ic estat summarize estat vce estimates ∗ lincom ∗ nlcom predict test ∗ testnl ∗ ∗ Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses Inference is valid only for hypotheses concerning λ and θ. Syntax for predict predict statistic type newvar if in , statistic options Description Main yhat residuals predicted value of y ; the default residuals options Description Options smearing btransform compute statistic using smearing method; the default compute statistic using back-transform method These statistics are available both in and out of sample; type predict only for the estimation sample. 230 . . . if e(sample) . . . if wanted boxcox postestimation — Postestimation tools for boxcox 231 Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main yhat, the default, calculates the predicted value of the dependent variable. residuals calculates the residuals, that is, the observed value minus the predicted value. Options smearing calculates the statistics yhat and residuals using the smearing method proposed by Duan (1983) (see Methods and formulas for a description of this method). smearing is the default. btransform calculates the statistics yhat and residuals using the back-transform method (see Methods and formulas for a description of this method). Remarks and examples Below we present two examples that illustrate how to use the smearing and btransform options. Example 1: Predictions with the smearing option In this example, we calculate the predicted values of diastolic blood pressure, bpdiast, that arise from the theta model calculated in example 1 of [R] boxcox. . use http://www.stata-press.com/data/r13/nhanes2 . boxcox bpdiast bmi tcresult, notrans(age sex) model(theta) lrtest (output omitted ) . predict yhat (statistic yhat and option smearing are assumed) In the expression above, yhat is the name we gave to the estimates of the conditional expectation. Given that we did not specify any statistic or option, the corresponding defaults yhat and smearing were assumed. As the summary table below illustrates, the mean of the dependent variable is close to the mean of the predicted value yhat. This indicates that the theta model does a good job approximating the true value of diastolic blood pressure, bpdiast. . summarize bpdiast yhat Variable Obs bpdiast yhat 10351 10351 Mean 81.715 81.71406 Std. Dev. 12.92722 5.983486 Min Max 35 66.93709 150 110.5283 232 boxcox postestimation — Postestimation tools for boxcox Similarly, we could have asked that residuals be calculated. Here we again use the default smearing option: . predict resid, residuals (option smearing assumed to compute residuals) Example 2: Predictions with the btransform option In this example, we illustrate the tradeoffs involved by using the btransform option as opposed to the default smearing option. Continuing with example 1, we compute the predicted values using the back-transform method. . predict yhatb, btransform (statistic yhat assumed) We now compute the predicted values using the smearing option and summarize both computations. . predict yhats (statistic yhat and option smearing are assumed) . summarize bpdiast yhats yhatb Variable Obs Mean Std. Dev. bpdiast yhats yhatb 10351 10351 10351 81.715 81.71406 81.08018 12.92722 5.983486 5.95549 Min Max 35 66.93709 66.37479 150 110.5283 109.7671 As can be seen from the mean and the standard deviation of the summary table, the predicted values using the back-transform method give biased estimates but are less variable than those coming from the smearing method. However, the efficiency loss is small compared with the bias reduction. Technical note boxcox estimates variances only for the λ and θ parameters (see the technical note in [R] boxcox), so the extent to which postestimation commands can be used following boxcox is limited. Formulas used in lincom, nlcom, test, and testnl are dependent on the estimated variances. Therefore, the use of these commands is limited and generally applicable only to inferences on the λ and θ coefficients. Methods and formulas The computation of the expected value of the dependent variable conditional on the regressors for the Box–Cox model does not follow the logic of the standard linear regression model because the random disturbance does not vanish from the conditional expectation and must be accounted for. To show this, we will revisit the lhsonly model described by (λ) yj = β0 + β1 x1j + β2 x2j + · · · + β(k−1) x(k−1)j + j where y (λ) = yλ − 1 λ boxcox postestimation — Postestimation tools for boxcox and y (λ) 233 if λ = 1 y − 1 ln(y) if λ = 0 = 1 − 1/y if λ = −1 For the presentation below, let y(λ) be an N × 1 vector of elementwise transformed data, X be an N × k matrix of regressors, β be a k × 1 vector of parameters, and ι be an n × 1 vector of ones. If we were interested in E(y(λ) |X), then the conventional logic would follow, and we would λ) b where β b is the estimate of β. However, to estimate the conditional obtain predictions as y (b = Xβ, expectation of y, we need to isolate it on the left-hand side of the model. In the case of the lhsonly model, this yields n o1/λ b + ) + ι y = λ(Xβ The conditional expectation is then defined by Z E (y|X) = 1/λ {λ(Xβ + ) + ι} dF (|X) In the expression above, dF (|X) corresponds to the cdf of conditional on the regressors. It is also clear that the random disturbance does not vanish. To address this issue, the default methodology used by predict computes this integral using the smearing method proposed by Duan (1983) to implement a two-step estimator, as was suggested by Abrevaya (2002). In the first step, we get an estimate for defined as λ) b b = y(b − Xβ In the second step, for each j we compute our predicted values as the sum: ybj = N 1 X b b λ {λ(xj β + b i ) + 1}1/b N i=1 In the expression above, xj is the j th row of the matrix X (in other words, the values of the covariates for individual j ), and b i is the residual for individual i. The result of this summation gives us the conditional expectation of the dependent variable for individual j . Given that this operation is performed for each individual j , the methodology is computationally intensive. The back-transform method can be understood as a naı̈ve estimate that disregards the random disturbance. The predictions using this approach are given by 1/b λ b jβ b+1 ybj = λx 234 boxcox postestimation — Postestimation tools for boxcox References Abrevaya, J. 2002. Computing marginal effects in the Box–Cox model. Econometric Reviews 21: 383–393. Duan, N. 1983. Smearing estimate: A nonparametric retransformation method. Journal of the American Statistical Association 78: 605–610. Also see [R] boxcox — Box–Cox regression models [U] 20 Estimation and postestimation commands Title brier — Brier score decomposition Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Option Acknowledgment Syntax brier outcomevar forecastvar if in , group(#) by is allowed; see [D] by. Menu Statistics > Epidemiology and related > Other > Brier score decomposition Description brier computes the Yates, Sanders, and Murphy decompositions of the Brier Mean Probability Score. outcomevar contains 0/1 values reflecting the actual outcome of the experiment, and forecastvar contains the corresponding probabilities as predicted by, say, logit, probit, or a human forecaster. Option Main group(#) specifies the number of groups that will be used to compute the decomposition. group(10) is the default. Remarks and examples You have a binary (0/1) response and a formula that predicts the corresponding probabilities of having observed a positive outcome (1). If the probabilities were obtained from logistic regression, there are many methods that assess goodness of fit (see, for instance, [R] estat gof). However, the probabilities might be computed from a published formula or from a model fit on another sample, both completely unrelated to the data at hand, or perhaps the forecasts are not from a formula at all. In any case, you now have a test dataset consisting of the forecast probabilities and observed outcomes. Your test dataset might, for instance, record predictions made by a meteorologist on the probability of rain along with a variable recording whether it actually rained. The Brier score is an aggregate measure of disagreement between the observed outcome and a prediction — the average squared error difference. The Brier score decomposition is a partition of the Brier score into components that suggest reasons for discrepancy. These reasons fall roughly into three groups: 1) lack of overall calibration between the average predicted probability and the actual probability of the event in your data, 2) misfit of the data in groups defined within your sample, and 3) inability to match actual 0 and 1 responses. 235 236 brier — Brier score decomposition Problem 1 refers to simply overstating or understating the probabilities. Problem 2 refers to what is standardly called a goodness-of-fit test: the data are grouped, and the predictions for the group are compared with the outcomes. Problem 3 refers to an individual-level measure of fit. Imagine that the grouped outcomes are predicted on average correctly but that within the group, the outcomes are poorly predicted. Using logit or probit analysis to fit your data will guarantee that there is no lack of fit due to problem 1, and a good model fitter will be able to avoid problem 2. Problem 3 is inherent in any prediction exercise. Example 1 We have data on the outcomes of 20 basketball games (win) and the probability of victory predicted by a local pundit (for). . use http://www.stata-press.com/data/r13/bball . summarize win for Variable Obs Mean Std. Dev. Min Max win for 20 20 .65 .4785 .4893605 .2147526 0 .15 1 .9 . brier win for, group(5) Mean probability of outcome of forecast 0.6500 0.4785 Correlation ROC area 0.5907 0.8791 Brier score 0.1828 Spiegelhalter’s z-statistic -0.6339 Sanders-modified Brier score 0.1861 Sanders resolution 0.1400 Outcome index variance 0.2275 Murphy resolution 0.0875 Reliability-in-the-small 0.0461 Forecast variance 0.0438 Excess forecast variance 0.0285 Minimum forecast variance 0.0153 Reliability-in-the-large 0.0294 2*Forecast-Outcome-Covar 0.1179 p = 0.0030 p = 0.7369 The mean probabilities of forecast and outcome are simply the mean of the predicted probabilities and the actual outcomes (wins/losses). The correlation is the product-moment correlation between them. The Brier score measures the total difference between the event (winning) and the forecast probability of that event as an average squared difference. As a benchmark, a perfect forecaster would have a Brier score of 0, a perfect misforecaster (predicts probability of win is 1 when loses and 0 when wins) would have a Brier score of 1, and a fence-sitter (forecasts every game as 50/50) would have a Brier score of 0.25. Our pundit is doing reasonably well. Spiegelhalter’s z statistic is a standard normal test statistic for testing whether an individual Brier score is extreme. The ROC area is the area under the receiver operating curve, and the associated test is a test of whether it is greater than 0.5. The more accurate the forecast probabilities, the larger the ROC area. The Sanders-modified Brier score measures the difference between a grouped forecast measure and the event, where the data are grouped by sorting the sample on the forecast and dividing it into brier — Brier score decomposition 237 approximately equally sized groups. The difference between the modified and the unmodified score is typically minimal. For this and the other statistics that require grouping—the Sanders and Murphy resolutions and reliability-in-the-small—to be well-defined, group boundaries are chosen so as not to allocate observations with the same forecast probability to different groups. This task is done by grouping on the forecast using xtile, n(#), with # being the number of groups; see [D] pctile. The Sanders resolution measures error that arises from statistical considerations in evaluating the forecast for a group. A group with all positive or all negative outcomes would have a Sanders resolution of 0; it would most certainly be feasible to predict exactly what happened to each member of the group. If the group had 40% positive responses, on the other hand, a forecast that assigned p = 0.4 to each member of the group would be a good one, and yet, there would be “errors” in the squared difference sense. The “error” would be (1 − 0.4)2 or (0 − 0.4)2 for each member. The Sanders resolution is the average across groups of such “expected” errors. The 0.1400 value in our data from an overall Brier score of 0.1828 or 0.1861 suggests that a substantial portion of the “error” in our data is inherent. Outcome index variance is just the variance of the outcome variable. This is the expected value of the Brier score if all the forecast probabilities were merely the average observed outcome. Remember that a fence-sitter has an expected Brier score of 0.25; a smarter fence sitter (who would guess p = 0.65 for these data) would have a Brier score of 0.2275. The Murphy resolution measures the variation in the average outcomes across groups. If all groups have the same frequency of positive outcomes, little information in any forecast is possible, and the Murphy resolution is 0. If groups differ markedly, the Murphy resolution is as large as 0.25. The 0.0875 means that there is some variation but not a lot, and 0.0875 is probably higher than in most real cases. If you had groups in your data that varied between 40% and 60% positive outcomes, the Murphy resolution would be 0.01; between 30% and 70%, it would be 0.04. Reliability-in-the-small measures the error that comes from the average forecast within group not measuring the average outcome within group — a classical goodness-of-fit measure, with 0 meaning a perfect fit and 1 meaning a complete lack of fit. The calculated value of 0.0461 shows some amount of lack of fit. √ Remember, the number is squared, and we are saying that probabilities could be just more than 0.0461 = 0.215 or 21.5% off. Forecast variance measures the amount of discrimination being attempted — that is, the variation in the forecasted probabilities. A small number indicates a fence-sitter making constant predictions. If the forecasts were from a logistic regression model, forecast variance would tend to increase with the amount of√ information available. Our pundit shows considerable forecast variance of 0.0438 (standard deviation 0.0438 = 0.2093), which is in line with the reliability-in-the-small, suggesting that the forecaster is attempting as much variation as is available in these data. Excess forecast variance is the amount of actual forecast variance over a theoretical minimum. The theoretical minimum — called the minimum forecast variance — corresponds to forecasts of p0 for observations ultimately observed to be negative responses and p1 for observations ultimately observed to be positive outcomes. Moreover, p0 and p1 are set to the average forecasts made for the ultimate negative and positive outcomes. These predictions would be just as good as the predictions the forecaster did make, and any variation in the actual forecast probabilities above this is useless. If this number is large, above 1% – 2%, then the forecaster may be attempting more than is possible. The 0.0285 in our data suggests this possibility. Reliability-in-the-large measures the discrepancy between the mean forecast and the observed fraction of positive outcomes. This discrepancy will be 0 for forecasts made by most statistical models — at least when measured on the same sample used for estimation — because√they, by design, reproduce sample means. For our human pundit, the 0.0294 says that there is a 0.0294, or 17percentage-point, difference. (This difference can also be found by calculating the difference in the 238 brier — Brier score decomposition averages of the observed outcomes and forecast probabilities: 0.65 − 0.4785 = 0.17.) That difference, however, is not significant, as we would see if we typed ttest win=for; see [R] ttest. If these data were larger and the bias persisted, this difference would be a critical shortcoming of the forecast. Twice the forecast-outcome covariance is a measure of how accurately the forecast corresponds to the outcome. The concept is similar to that of R-squared in linear regression. Stored results brier stores the following in r(): Scalars r(p roc) r(roc area) r(z) r(p) r(brier) r(brier s) r(sanders) r(oiv) significance of ROC area ROC area Spiegelhalter’s z statistic significance of z statistic Brier score Sanders-modified Brier score Sanders resolution outcome index variance r(murphy) r(relinsm) r(Var f) r(Var fex) r(Var fmin) r(relinla) r(cov 2f) Murphy resolution reliability-in-the-small forecast variance excess forecast variance minimum forecast variance reliability-in-the-large 2×forecast-outcome-covariance Methods and formulas See Wilks (2011) or Schmidt and Griffith (2005) for a discussion of the Brier score. Let dj , j = 1, . . . , N , be the observed outcomes with dj = 0 or dj = 1, and let fj be the corresponding forecasted probabilities that dj is 1, 0 ≤ fj ≤ 1. Assume that the data are ordered so that fj+1 ≥ fj (brier sorts the data to obtain this order). Divide the data into K nearly equally sized groups, with group 1 containing observations 1 through j2 − 1, group 2 containing observations j2 through j3 − 1, and so on. Define f 0 = average fj among dj = 0 f 1 = average fj among dj = 1 f = average fj d = average dj fek = average fj in group k dek = average dj in group k n ek = number of observations in group k The Brier score is P j (dj − fj )2 /N . The Sanders-modified Brier score is P j (dj − fek(j) )2 /N . Let pj denote the true but unknown probability that dj = 1. Under the null hypothesis that pj = fj for all j , Spiegelhalter (1986) determined that the expectation and variance of the Brier score is given by the following: brier — Brier score decomposition 239 N E (Brier) = Var(Brier) = 1X fj (1 − fj ) N j=1 N 1 X 2 fj (1 − fj )(1 − 2fj ) N 2 j=1 Denoting the observed value of the Brier score by O(Brier), Spiegelhalter’s z statistic is given by Z= O(Brier) − E(Brier) p Var(Brier) The corresponding p-value is given by the upper-tail probability of Z under the standard normal distribution. The area under the ROC curve is estimated by applying the trapezoidal rule to the empirical ROC curve. This area is Wilcoxon’s test statistic, so the corresponding p-value is just that of a one-sided Wilcoxon test of the null hypothesis that the distribution of predictions is constant across the two outcomes. P The Sanders resolution is k n ek {dek (1 − dek )}/N . The outcome index variance is d(1 − d). P The Murphy resolution is k n ek (dek − d)2 /N . P Reliability-in-the-small is k n ek (dek − fek )2 /N . P The forecast variance is j (fj − f )2 /N . P P 2 2 The minimum forecast variance is j∈S (fj − f 1 ) /N , where F is the j∈F (fj − f 0 ) + set of observations for which dj = 0 and S is the complement. The excess forecast variance is the difference between the forecast variance and the minimum forecast variance. Reliability-in-the-large is (f − d)2 . Twice the outcome covariance is 2(f 1 − f 0 )d(1 − d). Glenn Wilson Brier (1913–1998) was an American meteorological statistician who, after obtaining degrees in physics and statistics, was for many years head of meteorological statistics at the U.S. Weather Bureau in Washington, DC. In the latter part of his career, he was associated with Colorado State University. Brier worked especially on verification and evaluation of predictions and forecasts, statistical decision making, the statistical theory of turbulence, the analysis of weather modification experiments, and the application of permutation techniques. 240 brier — Brier score decomposition Acknowledgment We thank Richard Goldstein for his contributions to this improved version of brier. References Brier, G. W. 1950. Verification of forecasts expressed in terms of probability. Monthly Weather Review 78: 1–3. Goldstein, R. 1996. sg55: Extensions to the brier command. Stata Technical Bulletin 32: 21–22. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 133–134. College Station, TX: Stata Press. Hadorn, D. C., E. B. Keeler, W. H. Rogers, and R. H. Brook. 1993. Assessing the Performance of Mortality Prediction Models. Santa Monica, CA: Rand. Holloway, L., and P. W. Mielke, Jr. 1998. Glenn Wilson Brier 1913–1998. Bulletin of the American Meteorological Society 79: 1438–1439. Jolliffe, I. T., and D. B. Stephenson, ed. 2012. Forecast Verification: A Practitioner’s Guide in Atmospheric Science. 2nd ed. Chichester, UK: Wiley. Murphy, A. H. 1973. A new vector partition of the probability score. Journal of Applied Meteorology 12: 595–600. . 1997. Forecast verification. In Economic Value of Weather and Climate Forecasts, ed. R. W. Katz and A. H. Murphy, 19–74. Cambridge: Cambridge University Press. Redelmeier, D. A., D. A. Bloch, and D. H. Hickam. 1991. Assessing predictive accuracy: How to compare Brier scores. Journal of Clinical Epidemiology 44: 1141–1146. Rogers, W. H. 1992. sbe9: Brier score decomposition. Stata Technical Bulletin 10: 20–22. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 92–94. College Station, TX: Stata Press. Sanders, F. 1963. On subjective probability forecasting. Journal of Applied Meteorology 2: 191–201. Schmidt, C. H., and J. L. Griffith. 2005. Multivariate classification rules: Calibration and discrimination. In Vol. 2 of Encyclopedia of Biostatistics, ed. P. Armitage and T. Colton, 3492–3494. Chichester, UK: Wiley. Spiegelhalter, D. J. 1986. Probabilistic prediction in patient management and clinical trials. Statistics in Medicine 5: 421–433. Von Storch, H., and F. W. Zwiers. 1999. Statistical Analysis in Climate Research. Cambridge: Cambridge University Press. Wilks, D. S. 2011. Statistical Methods in the Atmospheric Sciences. 3rd ed. Waltham, MA: Academic Press. Yates, J. F. 1982. External correspondence: Decompositions of the mean probability score. Organizational Behavior and Human Performance 30: 132–156. Also see [R] logistic — Logistic regression, reporting odds ratios [R] logit — Logistic regression, reporting coefficients [R] probit — Probit regression Title bsample — Sampling with replacement Syntax Remarks and examples Menu References Description Also see Options Syntax bsample exp if in , options where exp is a standard Stata expression; see [U] 13 Functions and expressions. options Description strata(varlist) cluster(varlist) idcluster(newvar) weight(varname) variables identifying strata variables identifying resampling clusters create new cluster ID variable replace varname with frequency weights Menu Statistics > Resampling > Draw bootstrap sample Description bsample draws bootstrap samples (random samples with replacement) from the data in memory. exp specifies the size of the sample, which must be less than or equal to the number of sampling units in the data. The observed number of units is the default when exp is not specified. For bootstrap sampling of the observations, exp must be less than or equal to N (the number of observations in the data; see [U] 13.4 System variables ( variables)). For stratified bootstrap sampling, exp must be less than or equal to N within the strata identified by the strata() option. For clustered bootstrap sampling, exp must be less than or equal to Nc (the number of clusters identified by the cluster() option). For stratified bootstrap sampling of clusters, exp must be less than or equal to Nc within the strata identified by the strata() option. Observations that do not meet the optional if and in criteria are dropped (not sampled). Options strata(varlist) specifies the variables identifying strata. If strata() is specified, bootstrap samples are selected within each stratum. cluster(varlist) specifies the variables identifying resampling clusters. If cluster() is specified, the sample drawn during each replication is a bootstrap sample of clusters. 241 242 bsample — Sampling with replacement idcluster(newvar) creates a new variable containing a unique identifier for each resampled cluster. weight(varname) specifies a variable in which the sampling frequencies will be placed. varname must be an existing variable, which will be replaced. After bsample, varname can be used as an fweight in any Stata command that accepts fweights, which can speed up resampling for commands like regress and summarize. This option cannot be combined with idcluster(). By default, bsample replaces the data in memory with the sampled observations; however, specifying the weight() option causes only the specified varname to be changed. Remarks and examples Below is a series of examples illustrating how bsample is used with various sampling schemes. Example 1: Bootstrap sampling We have data on the characteristics of hospital patients and wish to draw a bootstrap sample of 200 patients. We type . use http://www.stata-press.com/data/r13/bsample1 . bsample 200 . count 200 Example 2: Stratified samples with equal sizes Among the variables in our dataset is female, an indicator for the female patients. To get a bootstrap sample of 200 female patients and 200 male patients, we type . use http://www.stata-press.com/data/r13/bsample1, clear . bsample 200, strata(female) . tabulate female Freq. Percent Cum. female male female 200 200 50.00 50.00 Total 400 100.00 50.00 100.00 bsample — Sampling with replacement 243 Example 3: Stratified samples with unequal sizes To sample 300 females and 200 males, we must generate a variable that is 300 for females and 200 for males and then use this variable in exp when we call bsample. . . . . use http://www.stata-press.com/data/r13/bsample1, clear generate nsamp = cond(female,300,200) bsample nsamp, strata(female) tabulate female female Freq. Percent Cum. male female 200 300 40.00 60.00 Total 500 100.00 40.00 100.00 Example 4: Stratified samples with proportional sizes Our original dataset has 2,392 males and 3,418 females. . use http://www.stata-press.com/data/r13/bsample1, clear . tabulate female female Freq. Percent Cum. male female 2,392 3,418 41.17 58.83 Total 5,810 100.00 41.17 100.00 To sample 10% from females and males, we type . bsample round(0.1*_N), strata(female) bsample requires that the specified size of the sample be an integer, so we use the round() function to obtain the nearest integer to 0.1 × 2392 and 0.1 × 3418. Our sample now has 239 males and 342 females: . tabulate female female Freq. Percent Cum. male female 239 342 41.14 58.86 41.14 100.00 Total 581 100.00 Example 5: Samples satisfying a condition For a bootstrap sample of 200 female patients, we type . use http://www.stata-press.com/data/r13/bsample1, clear . bsample 200 if female . tabulate female female Freq. Percent Cum. female 200 100.00 Total 200 100.00 100.00 244 bsample — Sampling with replacement Example 6: Generating frequency weights To identify the sampled observations using frequency weights instead of dropping unsampled observations, we use the weight() option (we will need to supply it an existing variable name) and type . use http://www.stata-press.com/data/r13/bsample1, clear . set seed 1234 . generate fw = . (5810 missing values generated) . bsample 200 if female, weight(fw) . tabulate fw female female fw male female Total 0 1 2 2,392 0 0 3,221 194 3 5,613 194 3 Total 2,392 3,418 5,810 Note that (194 × 1) + (3 × 2) = 200. Example 7: Oversampling observations bsample requires the expression in exp to evaluate to a number that is less than or equal to the number of observations. To sample twice as many male and female patients as there are already in memory, we must expand the data before using bsample. For example, . use http://www.stata-press.com/data/r13/bsample1, clear . set seed 1234 . expand 2 (5810 observations created) . bsample, strata(female) . tabulate female Freq. female Percent Cum. 41.17 100.00 male female 4,784 6,836 41.17 58.83 Total 11,620 100.00 bsample — Sampling with replacement 245 Example 8: Stratified oversampling with unequal sizes To sample twice as many female patients as male patients, we must expand the records for the female patients because there are less than twice as many of them as there are male patients, but first put the number of observed male patients in a local macro. After expanding the female records, we generate a variable that contains the number of observations to sample within the two groups. . use http://www.stata-press.com/data/r13/bsample1, clear . set seed 1234 . count if !female 2392 . local nmale = r(N) . expand 2 if female (3418 observations created) . generate nsamp = cond(female,2*‘nmale’,‘nmale’) . bsample nsamp, strata(female) . tabulate female female Freq. Percent Cum. male female 2,392 4,784 33.33 66.67 33.33 100.00 Total 7,176 100.00 Example 9: Oversampling of clusters For clustered data, sampling more clusters than are present in the original dataset requires more than just expanding the data. To illustrate, suppose we wanted a bootstrap sample of eight clusters from a dataset consisting of five clusters of observations. . use http://www.stata-press.com/data/r13/bsample2, clear . tabstat x, stat(n mean) by(group) Summary for variables: x by categories of: group group N mean A B C D E 15 -.3073028 10 -.00984 11 .0810985 11 -.1989179 29 -.095203 Total 76 -.1153269 bsample will complain if we simply expand the dataset. . use http://www.stata-press.com/data/r13/bsample2 . expand 3 (152 observations created) . bsample 8, cluster(group) resample size must not be greater than number of clusters r(498); 246 bsample — Sampling with replacement Expanding the data will only partly solve the problem. We also need a new variable that uniquely identifies the copied clusters. We use the expandcl command to accomplish both these tasks; see [D] expandcl. . use http://www.stata-press.com/data/r13/bsample2, clear . set seed 1234 . expandcl 2, generate(expgroup) cluster(group) (76 observations created) . tabstat x, stat(n mean) by(expgroup) Summary for variables: x by categories of: expgroup expgroup N mean 1 2 3 4 5 6 7 8 9 10 15 15 10 10 11 11 11 11 29 29 Total -.3073028 -.3073028 -.00984 -.00984 .0810985 .0810985 -.1989179 -.1989179 -.095203 -.095203 152 -.1153269 . generate fw = . (152 missing values generated) . bsample 8, cluster(expgroup) weight(fw) . tabulate fw group group fw A B C D E Total 0 1 2 15 15 0 10 10 0 0 22 0 0 22 0 29 0 29 54 69 29 Total 30 20 22 22 58 152 The results from tabulate on the generated frequency weight variable versus the original cluster ID (group) show us that the bootstrap sample contains one copy of cluster A, one copy of cluster B, two copies of cluster C, two copies of cluster D, and two copies of cluster E (1 + 1 + 2 + 2 + 2 = 8). bsample — Sampling with replacement 247 Example 10: Stratified oversampling of clusters Suppose that we have a dataset containing two strata with five clusters in each stratum, but the cluster identifiers are not unique between the strata. To get a stratified bootstrap sample with eight clusters in each stratum, we first use expandcl to expand the data and get a new cluster ID variable. We use cluster(strid group) in the call to expandcl; this action will uniquely identify the 2 ∗ 5 = 10 clusters across the strata. . use http://www.stata-press.com/data/r13/bsample2, clear . set seed 1234 . tabulate group strid strid group 1 2 Total A B C D E 7 5 5 5 14 8 5 6 6 15 15 10 11 11 29 Total 36 40 76 . expandcl 2, generate(expgroup) cluster(strid group) (76 observations created) Now we can use bsample with the expanded data, stratum ID variable, and new cluster ID variable. . generate fw = . (152 missing values generated) . bsample 8, cluster(expgroup) str(strid) weight(fw) . by strid, sort: tabulate fw group -> strid = 1 group fw A B C D E Total 0 1 2 0 14 0 5 5 0 0 10 0 5 5 0 14 0 14 24 34 14 Total 14 10 10 10 28 72 fw A B C D E Total 0 1 2 8 8 0 10 0 0 0 6 6 6 6 0 0 15 15 24 35 21 Total 16 10 12 12 30 80 -> strid = 2 group The results from by strid: tabulate on the generated frequency weight variable versus the original cluster ID (group) show us how many times each cluster was sampled for each stratum. For stratum 1, the bootstrap sample contains two copies of cluster A, one copy of cluster B, two copies of cluster C, one copy of cluster D, and two copies of cluster E (2 + 1 + 2 + 1 + 2 = 8). For stratum 2, the bootstrap sample contains one copy of cluster A, zero copies of cluster B, three copies of cluster C, one copy of cluster D, and three copies of cluster E (1 + 0 + 3 + 1 + 3 = 8). 248 bsample — Sampling with replacement References Gould, W. W. 2012a. Using Stata’s random-number generators, part 2: Drawing without replacement. The Stata Blog: Not Elsewhere Classified. http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/. . 2012b. Using Stata’s random-number generators, part 3: Drawing with replacement. The Stata Blog: Not Elsewhere Classified. http://blog.stata.com/2012/08/29/using-statas-random-number-generators-part-3-drawingwith-replacement/. Also see [R] bootstrap — Bootstrap sampling and estimation [R] bstat — Report bootstrap results [R] simulate — Monte Carlo simulations [D] sample — Draw random sample Title bstat — Report bootstrap results Syntax Remarks and examples Menu Stored results Description References Options Also see Syntax Bootstrap statistics from variables bstat varlist if in , options Bootstrap statistics from file bstat namelist using filename if in , options Description options Main stat(vector) accel(vector) ties mse observed values for each statistic acceleration values for each statistic adjust BC/BCa confidence intervals for ties use MSE formula for variance estimation Reporting level(#) n(#) notable noheader nolegend verbose title(text) display options set confidence level; default is level(95) # of observations from which bootstrap samples were taken suppress table of results suppress table header suppress table legend display the full table legend use text as title for bootstrap results control column formats and line width See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Resampling > Report bootstrap results Description bstat is a programmer’s command that computes and displays estimation results from bootstrap statistics. For each variable in varlist (the default is all variables), then bstat computes a covariance matrix, estimates bias, and constructs several different confidence intervals (CIs). The following CIs are constructed by bstat: 249 250 bstat — Report bootstrap results 1. Normal CIs (using the normal approximation) 2. Percentile CIs 3. Bias-corrected (BC) CIs 4. Bias-corrected and accelerated (BCa ) CIs (optional) estat bootstrap displays a table of one or more of the above confidence intervals; see [R] bootstrap postestimation. If there are bootstrap estimation results in e(), bstat replays them. If given the using modifier, bstat uses the data in filename to compute the bootstrap statistics while preserving the data currently in memory. Otherwise, bstat uses the data in memory to compute the bootstrap statistics. The following options may be used to replay estimation results from bstat: level(#) notable noheader nolegend verbose title(text) For all other options and the qualifiers using, if, and in, bstat requires a bootstrap dataset. Options Main stat(vector) specifies the observed value of each statistic (that is, the value of the statistic using the original dataset). accel(vector) specifies the acceleration of each statistic, which is used to construct BCa CIs. ties specifies that bstat adjust for ties in the replicate values when computing the median bias used to construct BC and BCa CIs. mse specifies that bstat compute the variance by using deviations of the replicates from the observed value of the statistics. By default, bstat computes the variance by using deviations from the average of the replicates. Reporting level(#); see [R] estimation options. n(#) specifies the number of observations from which bootstrap samples were taken. This value is used in no calculations but improves the table header when this information is not saved in the bootstrap dataset. notable suppresses the display of the output table. noheader suppresses the display of the table header. This option implies nolegend. nolegend suppresses the display of the table legend. verbose specifies that the full table legend be displayed. By default, coefficients and standard errors are not displayed. title(text) specifies a title to be displayed above the table of bootstrap results; the default title is Bootstrap results. display options: cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. bstat — Report bootstrap results 251 Remarks and examples Remarks are presented under the following headings: Bootstrap datasets Creating a bootstrap dataset Bootstrap datasets Although bstat allows you to specify the observed value and acceleration of each bootstrap statistic via the stat() and accel() options, programmers may be interested in what bstat uses when these options are not supplied. When working from a bootstrap dataset, bstat first checks the data characteristics (see [P] char) that it understands: dta[bs version] identifies the version of the bootstrap dataset. This characteristic may be empty (not defined), 2, or 3; otherwise, bstat will quit and display an error message. This version tells bstat which other characteristics to look for in the bootstrap dataset. bstat uses the following characteristics from version 3 bootstrap datasets: dta[N] dta[N strata] dta[N cluster] dta[command] varname[observed] varname[acceleration] varname[expression] bstat uses the following characteristics from version 2 bootstrap datasets: dta[N] dta[N strata] dta[N cluster] varname[observed] varname[acceleration] An empty bootstrap dataset version implies that the dataset was created by the bstrap command in a version of Stata earlier than Stata 8. Here bstat expects varname[bstrap] to contain the observed value of the statistic identified by varname (varname[observed] in version 2). All other characteristics are ignored. dta[N] is the number of observations in the observed dataset. This characteristic may be overruled by specifying the n() option. dta[N strata] is the number of strata in the observed dataset. dta[N cluster] is the number of clusters in the observed dataset. dta[command] is the command used to compute the observed values of the statistics. varname[observed] is the observed value of the statistic identified by varname. To specify a different value, use the stat() option. varname[acceleration] is the estimate of acceleration for the statistic identified by varname. To specify a different value, use the accel() option. varname[expression] is the expression or label that describes the statistic identified by varname. 252 bstat — Report bootstrap results Creating a bootstrap dataset Suppose that we are interested in obtaining bootstrap statistics by resampling the residuals from a regression (which is not possible with the bootstrap command). After loading some data, we run a regression, save some results relevant to the bstat command, and save the residuals in a new variable, res. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight length Source SS df MS Model Residual 1616.08062 827.378835 2 71 808.040312 11.653223 Total 2443.45946 73 33.4720474 mpg Coef. weight length _cons -.0038515 -.0795935 47.88487 Std. Err. .001586 .0553577 6.08787 t -2.43 -1.44 7.87 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.018 0.155 0.000 = = = = = = 74 69.34 0.0000 0.6614 0.6519 3.4137 [95% Conf. Interval] -.0070138 -.1899736 35.746 -.0006891 .0307867 60.02374 . matrix b = e(b) . local n = e(N) . predict res, residuals We can resample the residual values in res by generating a random observation ID (rid), generate a new response variable (y), and run the original regression with the new response variables. . set seed 54321 . generate rid = int(_N*runiform())+1 . matrix score double y = b . replace y = y + res[rid] (74 real changes made) . regress y weight length Source SS df MS Model Residual 1773.23548 608.747732 2 71 886.617741 8.57391172 Total 2381.98321 73 32.629907 y Coef. weight length _cons -.0059938 -.0127875 42.23195 Std. Err. .0013604 .0474837 5.22194 t -4.41 -0.27 8.09 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.788 0.000 = = = = = = 74 103.41 0.0000 0.7444 0.7372 2.9281 [95% Conf. Interval] -.0087064 -.1074673 31.8197 -.0032813 .0818924 52.6442 Instead of programming this resampling inside a loop, it is much more convenient to write a short program and use the simulate command; see [R] simulate. In the following, mysim r requires the user to specify a coefficient vector and a residual variable. mysim r then retrieves the list of predictor variables (removing cons from the list), generates a new temporary response variable with the resampled residuals, and regresses the new response variable on the predictors. bstat — Report bootstrap results 253 program mysim_r version 13 syntax name(name=bvector), res(varname) tempvar y rid local xvars : colnames ‘bvector’ local cons _cons local xvars : list xvars - cons matrix score double ‘y’ = ‘bvector’ gen long ‘rid’ = int(_N*runiform()) + 1 replace ‘y’ = ‘y’ + ‘res’[‘rid’] regress ‘y’ ‘xvars’ end We can now give mysim r a test run, but we first set the random-number seed (to reproduce results). . set seed 54321 . mysim_r b, res(res) (74 real changes made) Source SS df MS Model Residual 1773.23548 608.747732 2 71 886.617741 8.57391172 Total 2381.98321 73 32.629907 __000000 Coef. weight length _cons -.0059938 -.0127875 42.23195 Std. Err. .0013604 .0474837 5.22194 t -4.41 -0.27 8.09 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.788 0.000 = = = = = = 74 103.41 0.0000 0.7444 0.7372 2.9281 [95% Conf. Interval] -.0087064 -.1074673 31.8197 -.0032813 .0818924 52.6442 Now that we have a program that will compute the results we want, we can use simulate to generate a bootstrap dataset and bstat to display the results. . set seed 54321 . simulate, reps(200) nodots: mysim_r b, res(res) command: mysim_r b, res(res) . bstat, stat(b) n(‘n’) Bootstrap results _b_weight _b_length _b_cons Number of obs Replications Observed Coef. Bootstrap Std. Err. -.0038515 -.0795935 47.88487 .0015715 .0552415 6.150069 z -2.45 -1.44 7.79 P>|z| 0.014 0.150 0.000 = = 74 200 Normal-based [95% Conf. Interval] -.0069316 -.1878649 35.83096 -.0007713 .0286779 59.93879 Finally, we see that simulate created some of the data characteristics recognized by bstat. All we need to do is correctly specify the version of the bootstrap dataset, and bstat will automatically use the relevant data characteristics. 254 bstat — Report bootstrap results . char list _dta[seed]: _dta[command]: _b_weight[is_eexp]: _b_weight[colname]: _b_weight[coleq]: _b_weight[expression]: _b_length[is_eexp]: _b_length[colname]: _b_length[coleq]: _b_length[expression]: _b_cons[is_eexp]: _b_cons[colname]: _b_cons[coleq]: _b_cons[expression]: . char _dta[bs_version] 3 . bstat, stat(b) n(‘n’) Bootstrap results command: weight length _cons X681014b5c43f462544a474abacbdd93d00042842 mysim_r b, res(res) 1 weight _ _b[weight] 1 length _ _b[length] 1 _cons _ _b[_cons] Number of obs Replications = = 74 200 mysim_r b, res(res) Observed Coef. Bootstrap Std. Err. -.0038515 -.0795935 47.88487 .0015715 .0552415 6.150069 z -2.45 -1.44 7.79 P>|z| 0.014 0.150 0.000 See Poi (2004) for another example of residual resampling. Normal-based [95% Conf. Interval] -.0069316 -.1878649 35.83096 -.0007713 .0286779 59.93879 bstat — Report bootstrap results 255 Stored results bstat stores the following in e(): Scalars e(N) e(N reps) e(N misreps) e(N strata) e(N clust) e(k aux) e(k eq) e(k exp) e(k eexp) e(k extra) e(level) e(bs version) e(rank) Macros e(cmd) e(command) e(cmdline) e(title) e(exp#) e(prefix) e(ties) e(mse) e(vce) e(vcetype) e(properties) Matrices e(b) e(b bs) e(reps) e(bias) e(se) e(z0) e(accel) e(ci normal) e(ci percentile) e(ci bc) e(ci bca) e(V) sample size number of complete replications number of incomplete replications number of strata number of clusters number of auxiliary parameters number of equations in e(b) number of standard expressions number of extended expressions (i.e., b) number of extra equations beyond the original ones from e(b) confidence level for bootstrap CIs version for bootstrap results rank of e(V) bstat from dta[command] command as typed title in estimation output expression for the #th statistic bootstrap ties, if specified mse, if specified bootstrap title used to label Std. Err. b V observed statistics bootstrap estimates number of nonmissing results estimated biases estimated standard errors median biases estimated accelerations normal-approximation CIs percentile CIs bias-corrected CIs bias-corrected and accelerated CIs bootstrap variance–covariance matrix References Ng, E. S.-W., R. Grieve, and J. R. Carpenter. 2013. Two-stage nonparametric bootstrap sampling with shrinkage correction for clustered data. Stata Journal 13: 141–164. Poi, B. P. 2004. From the help desk: Some bootstrapping techniques. Stata Journal 4: 312–328. Also see [R] bootstrap postestimation — Postestimation tools for bootstrap [R] bootstrap — Bootstrap sampling and estimation [R] bsample — Sampling with replacement Title centile — Report centile and confidence interval Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgment Syntax centile varlist if in , options Description options Main centile(numlist) report specified centiles; default is centile(50) Options binomial exact; conservative confidence interval normal, based on observed centiles normal, based on mean and standard deviation set confidence level; default is level(95) cci normal meansd level(#) by is allowed; see [D] by. Menu Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Centiles with CIs Description centile estimates specified centiles and calculates confidence intervals. If no varlist is specified, centile calculates centiles for all the variables in the dataset. If centile() is not specified, medians (centile(50)) are reported. By default, centile uses a binomial method for obtaining confidence intervals that makes no assumptions about the underlying distribution of the variable. Options Main centile(numlist) specifies the centiles to be reported. The default is to display the 50th centile. Specifying centile(5) requests that the fifth centile be reported. Specifying centile(5 50 95) requests that the 5th, 50th, and 95th centiles be reported. Specifying centile(10(10)90) requests that the 10th, 20th, . . . , 90th centiles be reported; see [U] 11.1.8 numlist. Options cci (conservative confidence interval) forces the confidence limits to fall exactly on sample values. Confidence intervals displayed with the cci option are slightly wider than those with the default (nocci) option. 256 centile — Report centile and confidence interval 257 normal causes the confidence interval to be calculated by using a formula for the standard error of a normal-distribution quantile given by Kendall and Stuart (1969, 237). The normal option is useful when you want empirical centiles — that is, centiles based on sample order statistics rather than on the mean and standard deviation — and are willing to assume normality. meansd causes the centile and confidence interval to be calculated based on the sample mean and standard deviation, and it assumes normality. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [R] level. Remarks and examples The q th centile of a continuous random variable, X , is defined as the value of Cq , which fulfills the condition Pr(X ≤ Cq ) = q/100. The value of q must be in the range 0 < q < 100, though q is not necessarily an integer. By default, centile estimates Cq for the variables in varlist and for the values of q given in centile(numlist). It makes no assumptions about the distribution of X , and, if necessary, uses linear interpolation between neighboring sample values. Extreme centiles (for example, the 99th centile in samples smaller than 100) are fixed at the minimum or maximum sample value. An “exact” confidence interval for Cq is also given, using the binomial-based method described below in Methods and formulas and in Conover (1999, 143–148). Again linear interpolation is used to improve the accuracy of the estimated confidence limits, but extremes are fixed at the minimum or maximum sample value. You can prevent centile from interpolating when calculating binomial-based confidence intervals by specifying cci. The resulting intervals are generally wider than with the default; that is, the coverage (confidence level) tends to be greater than the nominal value (given as usual by level(#), by default 95%). If the data are believed to be normally distributed (a common case), there are two alternative methods for estimating centiles. If normal is specified, Cq is calculated, as just described, but its confidence interval is based on a formula for the standard error (se) of a normal-distribution quantile given by Kendall and Stuart (1969, 237). If meansd is alternatively specified, Cq is estimated as x + zq × s, where x and s are the sample mean and standard deviation, and zq is the q th centile of the standard normal distribution (for example, z95 = 1.645). The confidence interval is derived from the se of the estimate of Cq . Example 1 Using auto.dta, we estimate the 5th, 50th, and 95th centiles of the price variable: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . format price %8.2fc . centile price, centile(5 50 95) Variable Obs price 74 Percentile 5 50 95 Centile 3,727.75 5,006.50 13,498.00 Binom. Interp. [95% Conf. Interval] 3,291.23 4,593.57 11,061.53 3,914.16 5,717.90 15,865.30 summarize produces somewhat different results from centile; see Methods and formulas. 258 centile — Report centile and confidence interval . summarize price, detail Price 1% 5% 10% 25% 50% 75% 90% 95% 99% Percentiles 3291 3748 3895 4195 Smallest 3291 3299 3667 3748 5006.5 Largest 13466 13594 14500 15906 6342 11385 13466 15906 Obs Sum of Wgt. 74 74 Mean Std. Dev. 6165.257 2949.496 Variance Skewness Kurtosis 8699526 1.653434 4.819188 The confidence limits produced by using the cci option are slightly wider than those produced without this option: . centile price, c(5 50 95) cci Variable Obs price 74 Percentile 5 50 95 Centile 3,727.75 5,006.50 13,498.00 Binomial Exact [95% Conf. Interval] 3,291.00 4,589.00 10,372.00 3,955.00 5,719.00 15,906.00 If we are willing to assume that price is normally distributed, we could include either the normal or the meansd option: . centile price, c(5 50 95) normal Variable Obs price 74 Percentile 5 50 95 Normal, based on observed centiles Centile [95% Conf. Interval] 3,727.75 5,006.50 13,498.00 3,211.19 4,096.68 5,426.81 4,244.31 5,916.32 21,569.19 . centile price, c(5 50 95) meansd Variable Obs price 74 Percentile 5 50 95 Normal, based on mean and std. dev. Centile [95% Conf. Interval] 1,313.77 6,165.26 11,016.75 278.93 5,493.24 9,981.90 2,348.61 6,837.27 12,051.59 With the normal option, the centile estimates are, by definition, the same as before. The confidence intervals for the 5th and 50th centiles are similar to the previous ones, but the interval for the 95th centile is different. The results using the meansd option also differ from both previous sets of estimates. We can use sktest (see [R] sktest) to check the correctness of the normality assumption: . sktest price Skewness/Kurtosis tests for Normality Variable Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) price 74 0.0000 0.0127 21.77 joint Prob>chi2 0.0000 centile — Report centile and confidence interval 259 sktest reveals that price is definitely not normally distributed, so the normal assumption is not reasonable, and the normal and meansd options are not appropriate for these data. We should rely on the results from the default choice, which does not assume normality. If the data are normally distributed, however, the precision of the estimated centiles and their confidence intervals will be ordered (best) meansd > normal > [default] (worst). The normal option is useful when we really do want empirical centiles (that is, centiles based on sample order statistics rather than on the mean and standard deviation) but are willing to assume normality. Stored results centile stores the following in r(): Scalars r(N) r(n cent) r(c #) r(lb #) r(ub #) Macros r(centiles) number of observations number of centiles requested value of # centile #-requested centile lower confidence bound #-requested centile upper confidence bound centiles requested Methods and formulas Methods and formulas are presented under the following headings: Default case Normal case meansd case Default case The calculation is based on the method of Mood and Graybill (1963, 408). Let x1 ≤ x2 ≤ · · · ≤ xn be a sample of size n arranged in ascending order. Denote the estimated q th centile of the x’s as cq . We require that 0 < q < 100. Let R = (n + 1)q/100 have integer part r and fractional part f ; that is, r = int(R) and f = R − r. (If R is itself an integer, then r = R and f = 0.) Note that 0 ≤ r ≤ n. For convenience, define x0 = x1 and xn+1 = xn . Cq is estimated by cq = xr + f × (xr+1 − xr ) that is, cq is a weighted average of xr and xr+1 . Loosely speaking, a (conservative) p% confidence interval for Cq involves finding the observations ranked t and u, which correspond, respectively, to the α = (100 − p)/200 and 1 − α quantiles of a binomial distribution with parameters n and q/100, that is, B(n, q/100). More precisely, define the ith value (i = 0, . . . , n) of the cumulative binomial distribution function as Fi = Pr(S ≤ i), where S has distribution B(n, q/100). For convenience, let F−1 = 0 and Fn+1 = 1. t is found such that Ft ≤ α and Ft+1 > α, and u is found such that 1 − Fu ≤ α and 1 − Fu−1 > α. With the cci option in force, the (conservative) confidence interval is (xt+1 , xu+1 ), and its actual coverage probability is Fu − Ft . 260 centile — Report centile and confidence interval The default case uses linear interpolation on the Fi as follows. Let g = (α − Ft )/(Ft+1 − Ft ) h = {α − (1 − Fu )}/{(1 − Fu−1 ) − (1 − Fu )} = (α − 1 + Fu )/(Fu − Fu−1 ) The interpolated lower and upper confidence limits (cqL , cqU ) for Cq are cqL = xt+1 + g × (xt+2 − xt+1 ) cqU = xu+1 − h × (xu+1 − xu ) Suppose that we want a 95% confidence interval for the median of a sample of size 13. n = 13, q = 50, p = 95, α = 0.025, R = 14 × 50/100 = 7, and f = 0. Therefore, the median is the 7th observation. Some example data, xi , and the values of Fi are as follows: i 0 1 2 3 4 5 6 Fi 1 − Fi 0.0001 0.9999 0.0017 0.9983 0.0112 0.9888 0.0461 0.9539 0.1334 0.8666 0.2905 0.7095 0.5000 0.5000 xi – 5 7 10 15 23 28 i 7 8 9 10 11 12 13 F i 1 − Fi xi 0.7095 0.2905 33 0.8666 0.1334 37 0.9539 0.0461 45 0.9888 0.0112 59 0.9983 0.0017 77 0.9999 0.0001 104 1.0000 0.0000 211 The median is x7 = 33. Also, F2 ≤ 0.025 and F3 > 0.025, so t = 2; 1 − F10 ≤ 0.025 and 1 − F9 > 0.025, so u = 10. The conservative confidence interval is therefore (c50L , c50U ) = (xt+1 , xu+1 ) = (x3 , x11 ) = (10, 77) with actual coverage F10 − F2 = 0.9888 − 0.0112 = 0.9776 (97.8% confidence). For the interpolation calculation, we have g = (0.025 − 0.0112)/(0.0461 − 0.0112) = 0.395 h = (0.025 − 1 + 0.9888)/(0.9888 − 0.9539) = 0.395 So, c50L = x3 + 0.395 × (x4 − x3 ) = 10 + 0.395 × 5 = 11.98 c50U = x11 − 0.395 × (x11 − x10 ) = 77 − 0.395 × 18 = 69.89 Normal case The value of cq is as above. Its se is given by the formula n o p √ sq = q(100 − q) 100 nZ(cq ; x, s) where x and s are the mean and standard deviation of the xi , and √ 2 2 Z(Y ; µ, σ) = 1 2πσ 2 e−(Y −µ) /2σ is the density function of a normally distributed variable Y with mean µ and standard deviation σ . The confidence interval for Cq is (cq − z100(1−α) sq , cq + z100(1−α) sq ). centile — Report centile and confidence interval 261 meansd case The value of cq is x + zq × s. Its se is given by the formula q s?q = s 1/n + zq2 /(2n − 2) The confidence interval for Cq is (cq − z100(1−α) × s?q , cq + z100(1−α) × s?q ). Acknowledgment centile was written by Patrick Royston of the MRC Clinical Trials Unit, London, and coauthor of the Stata Press book Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. References Conover, W. J. 1999. Practical Nonparametric Statistics. 3rd ed. New York: Wiley. Kendall, M. G., and A. Stuart. 1969. The Advanced Theory of Statistics, Vol. 1: Distribution Theory. 3rd ed. London: Griffin. Mood, A. M., and F. A. Graybill. 1963. Introduction to the Theory of Statistics. 2nd ed. New York: McGraw–Hill. Newson, R. B. 2000. snp16: Robust confidence intervals for median and other percentile differences between two groups. Stata Technical Bulletin 58: 30–35. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 324–331. College Station, TX: Stata Press. Royston, P. 1992. sg7: Centile estimation command. Stata Technical Bulletin 8: 12–15. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 122–125. College Station, TX: Stata Press. Stuart, A., and J. K. Ord. 1994. Kendall’s Advanced Theory of Statistics: Distribution Theory, Vol I. 6th ed. London: Arnold. Also see [R] ci — Confidence intervals for means, proportions, and counts [R] summarize — Summary statistics [D] pctile — Create variable containing percentiles Title ci — Confidence intervals for means, proportions, and counts Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgment Syntax Syntax for ci ci varlist if in weight , options Immediate command for variable distributed as normal cii # obs # mean # sd , ciin option Immediate command for variable distributed as binomial cii # obs # succ , ciib options Immediate command for variable distributed as Poisson cii # exposure # events , poisson ciip options options Description Main binomial poisson exposure(varname) exact wald wilson agresti jeffreys total separator(#) level(#) binomial 0/1 variables; compute exact confidence intervals Poisson variables; compute exact confidence intervals exposure variable; implies poisson calculate exact confidence intervals; the default calculate Wald confidence intervals calculate Wilson confidence intervals calculate Agresti–Coull confidence intervals calculate Jeffreys confidence intervals add output for all groups combined (for use with by only) draw separator line after every # variables; default is separator(5) set confidence level; default is level(95) by is allowed with ci; see [D] by. aweights and fweights are allowed, but aweights may not be specified with the binomial or poisson options; see [U] 11.1.6 weight. ciin option Description level(#) set confidence level; default is level(95) 262 ci — Confidence intervals for means, proportions, and counts ∗ ciib options Description level(#) exact wald wilson agresti jeffreys set confidence level; default is level(95) calculate exact confidence intervals; the default calculate Wald confidence intervals calculate Wilson confidence intervals calculate Agresti–Coull confidence intervals calculate Jeffreys confidence intervals ciip options Description poisson level(#) numbers are Poisson-distributed counts set confidence level; default is level(95) ∗ 263 poisson is required. Menu ci Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Confidence intervals > Normal CI calculator > Binomial CI calculator > Poisson CI calculator cii for variable distributed as normal Statistics > Summaries, tables, and tests > Summary and descriptive statistics cii for variable distributed as binomial Statistics > Summaries, tables, and tests > Summary and descriptive statistics cii for variable distributed as Poisson Statistics > Summaries, tables, and tests > Summary and descriptive statistics Description ci computes standard errors and confidence intervals for each of the variables in varlist. cii is the immediate form of ci; see [U] 19 Immediate commands for a general discussion of immediate commands. In the binomial and Poisson variants of cii, the second number specified (#succ or #events ) must be an integer or between 0 and 1. If the number is between 0 and 1, Stata interprets it as the fraction of successes or events and converts it to an integer number representing the number of successes or events. The computation then proceeds as if two integers had been specified. Options Main binomial tells ci that the variables are 0/1 variables and that binomial confidence intervals will be calculated. (cii produces binomial confidence intervals when only two numbers are specified.) poisson specifies that the variables (or numbers for cii) are Poisson-distributed counts; exact Poisson confidence intervals will be calculated. 264 ci — Confidence intervals for means, proportions, and counts exposure(varname) is used only with poisson. You do not need to specify poisson if you specify exposure(); poisson is assumed. varname contains the total exposure (typically a time or an area) during which the number of events recorded in varlist were observed. exact, wald, wilson, agresti, and jeffreys specify that variables are 0/1 and specify how binomial confidence intervals are to be calculated. exact is the default and specifies exact (also known in the literature as Clopper–Pearson [1934]) binomial confidence intervals. wald specifies calculation of Wald confidence intervals. wilson specifies calculation of Wilson confidence intervals. agresti specifies calculation of Agresti–Coull confidence intervals. jeffreys specifies calculation of Jeffreys confidence intervals. See Brown, Cai, and DasGupta (2001) for a discussion and comparison of the different binomial confidence intervals. total is for use with the by prefix. It requests that, in addition to output for each by-group, output be added for all groups combined. separator(#) specifies how often separation lines should be inserted into the output. The default is separator(5), meaning that a line is drawn after every five variables. separator(10) would draw the line after every 10 variables. separator(0) suppresses the separation line. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [R] level. Remarks and examples Remarks are presented under the following headings: Ordinary confidence intervals Binomial confidence intervals Poisson confidence intervals Immediate form Video examples Ordinary confidence intervals Example 1 Without the binomial or poisson options, ci produces “ordinary” confidence intervals, meaning those that are correct if the variable is distributed normally, and asymptotically correct for all other distributions satisfying the conditions of the central limit theorem. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . ci mpg price Variable Obs Mean mpg price 74 74 21.2973 6165.257 Std. Err. [95% Conf. Interval] .6725511 342.8719 19.9569 5481.914 22.63769 6848.6 ci — Confidence intervals for means, proportions, and counts 265 The standard error of the mean of mpg is 0.67, and the 95% confidence interval is [ 19.96, 22.64 ]. We can obtain wider confidence intervals, 99%, by typing . ci mpg price, level(99) Variable Obs Mean mpg price 74 74 21.2973 6165.257 Std. Err. [99% Conf. Interval] .6725511 342.8719 19.51849 5258.405 23.07611 7072.108 Example 2 by() breaks out the confidence intervals according to by-group; total adds an overall summary. For instance, . ci mpg, by(foreign) total -> foreign = Domestic Variable Obs Mean mpg 52 19.82692 -> foreign = Foreign Variable Obs Mean mpg 22 24.77273 -> Total Variable Obs Mean mpg 74 21.2973 Std. Err. .657777 Std. Err. 1.40951 Std. Err. .6725511 [95% Conf. Interval] 18.50638 21.14747 [95% Conf. Interval] 21.84149 27.70396 [95% Conf. Interval] 19.9569 22.63769 Technical note You can control the formatting of the numbers in the output by specifying a display format for the variable; see [U] 12.5 Formats: Controlling how data are displayed. For instance, . format mpg %9.2f . ci mpg Variable Obs Mean mpg 74 21.30 Std. Err. 0.67 [95% Conf. Interval] 19.96 22.64 Binomial confidence intervals Example 3 We have data on employees, including a variable marking whether the employee was promoted last year. 266 ci — Confidence intervals for means, proportions, and counts . use http://www.stata-press.com/data/r13/promo . ci promoted, binomial Variable Obs Mean promoted 20 .1 Std. Err. .067082 Binomial Exact [95% Conf. Interval] .0123485 .3169827 The above interval is the default for binomial data, known equivalently as both the exact binomial and the Clopper–Pearson interval. Nominally, the interpretation of a 95% confidence interval is that under repeated samples or experiments, 95% of the resultant intervals would contain the unknown parameter in question. However, for binomial data, the actual coverage probability, regardless of method, usually differs from that interpretation. This result occurs because of the discreteness of the binomial distribution, which produces only a finite set of outcomes, meaning that coverage probabilities are subject to discrete jumps and the exact nominal level cannot always be achieved. Therefore, the term exact confidence interval refers to its being derived from the binomial distribution, the distribution exactly generating the data, rather than resulting in exactly the nominal coverage. For the Clopper–Pearson interval, the actual coverage probability is guaranteed to be greater than or equal to the nominal confidence level, here 95%. Because of the way it is calculated—see Methods and formulas—it may also be interpreted as follows: If the true probability of being promoted were 0.012, the chances of observing a result as extreme or more extreme than the result observed (20 × 0.1 = 2 or more promotions) would be 2.5%. If the true probability of being promoted were 0.317, the chances of observing a result as extreme or more extreme than the result observed (two or fewer promotions) would be 2.5%. Example 4 The Clopper–Pearson interval is desirable because it guarantees nominal coverage; however, by dropping this restriction, you may obtain accurate intervals that are not as conservative. In this vein, you might opt for the Wilson (1927) interval, . ci promoted, binomial wilson Variable Obs Mean promoted 20 .1 Std. Err. .067082 Wilson [95% Conf. Interval] .0278665 .3010336 the Agresti–Coull (1998) interval, . ci promoted, binomial agresti Variable Obs Mean promoted 20 .1 Std. Err. .067082 Agresti-Coull [95% Conf. Interval] .0156562 .3132439 or the Bayesian-derived Jeffreys interval (Brown, Cai, and DasGupta 2001), . ci promoted, binomial jeffreys Variable Obs Mean promoted 20 .1 Std. Err. .067082 Jeffreys [95% Conf. Interval] .0213725 .2838533 ci — Confidence intervals for means, proportions, and counts 267 Picking the best interval is a matter of balancing accuracy (coverage) against precision (average interval length) and depends on sample size and success probability. Brown, Cai, and DasGupta (2001) recommend the Wilson or Jeffreys interval for small sample sizes (≤40) yet favor the Agresti–Coull interval for its simplicity, decent performance for sample sizes less than or equal to 40, and performance comparable to Wilson/Jeffreys for sample sizes greater than 40. They also deem the Clopper–Pearson interval to be “wastefully conservative and [. . . ] not a good choice for practical use”, unless of course one requires, at a minimum, the nominal coverage level. Finally, the binomial Wald confidence interval is obtained by specifying the binomial and wald options. The Wald interval is the one taught in most introductory statistics courses and for the above is simply, for level 1 − α, Mean±zα (Std. Err.), where zα is the 1 − α/2 quantile of the standard normal. Because its overall poor performance makes it impractical, the Wald interval is available mainly for pedagogical purposes. The binomial Wald interval is also similar to the interval produced by treating binary data as normal data and using ci without the binomial option, with two exceptions. First, when binomial is specified, the calculation of the standard error uses denominator n rather than n − 1, used for normal data. Second, confidence intervals for normal data are based on the t distribution rather than the standard normal. Of course, both discrepancies vanish as sample size increases. Technical note Let’s repeat example 3, but this time with data in which there are no promotions over the observed period: . use http://www.stata-press.com/data/r13/promonone . ci promoted, binomial Variable Obs Mean Std. Err. promoted 20 0 (*) one-sided, 97.5% confidence interval 0 Binomial Exact [95% Conf. Interval] 0 .1684335* The confidence interval is [ 0, 0.168 ], and this is the confidence interval that most books publish. It is not, however, a true 95% confidence interval because the lower tail has vanished. As Stata notes, it is a one-sided, 97.5% confidence interval. If you wanted to put 5% in the right tail, you could type ci promoted, binomial level(90). Technical note ci with the binomial option ignores any variables that do not take on the values 0 and 1 exclusively. For instance, with our automobile dataset, . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . ci mpg foreign, binomial Variable Obs Mean foreign 74 .2972973 Std. Err. .0531331 Binomial Exact [95% Conf. Interval] .196584 .4148353 We also requested the confidence interval for mpg, but Stata ignored us. It does that so you can type ci, binomial and obtain correct confidence intervals for all the variables that are 0/1 in your data. 268 ci — Confidence intervals for means, proportions, and counts Poisson confidence intervals Example 5 We have data on the number of bacterial colonies on a Petri dish. The dish has been divided into 36 small squares, and the number of colonies in each square has been counted. Each observation in our dataset represents a square on the dish. The variable count records the number of colonies in each square counted, which varies from 0 to 5. . use http://www.stata-press.com/data/r13/petri . ci count, poisson Variable Exposure Mean count 36 2.333333 Std. Err. Poisson Exact [95% Conf. Interval] .2545875 1.861158 2.888825 ci reports that the average number of colonies per square is 2.33. If the expected number of colonies per square were as low as 1.86, the probability of observing 2.33 or more colonies per square would be 2.5%. If the expected number were as large as 2.89, the probability of observing 2.33 or fewer colonies per square would be 2.5%. Technical note The number of “observations” — how finely the Petri dish is divided — makes no difference. The Poisson distribution is a function only of the count. In example 4, we observed a total of 2.33 × 36 = 84 colonies and a confidence interval of [ 1.86 × 36, 2.89 × 36 ] = [ 67, 104 ]. We would obtain the same [ 67, 104 ] confidence interval if our dish were divided into, say, 49 squares, rather than 36. For the counts, it is not even important that all the squares be of the same size. For rates, however, such differences do matter, but in an easy-to-calculate way. Rates are obtained from counts by dividing by exposure, which is typically a number multiplied by either time or an area. For our Petri dishes, we divide by an area to obtain a rate, but if our example were cast in terms of being infected by a disease, we might divide by person-years to obtain the rate. Rates are convenient because they are easier to compare: we might have 2.3 colonies per square inch or 0.0005 infections per person-year. So, let’s assume that we wish to obtain the number of colonies per square inch, and, moreover, that not all the “squares” on our dish are of equal size. We have a variable called area that records the area of each “square”: . ci count, exposure(area) Variable Exposure Mean count 3 28 Std. Err. 3.055051 Poisson Exact [95% Conf. Interval] 22.3339 34.66591 The rates are now in more familiar terms. In our sample, there are 28 colonies per square inch and the 95% confidence interval is [ 22.3, 34.7 ]. When we did not specify exposure(), ci assumed that each observation contributed 1 to exposure. ci — Confidence intervals for means, proportions, and counts 269 Technical note As with the binomial option, if there were no colonies on our dish, ci would calculate a one-sided confidence interval: . use http://www.stata-press.com/data/r13/petrinone . ci count, poisson Variable Exposure Mean count 36 0 (*) one-sided, 97.5% confidence interval Std. Err. 0 Poisson Exact [95% Conf. Interval] 0 .1024689* Immediate form Example 6 We are reading a soon-to-be-published paper by a colleague. In it is a table showing the number of observations, mean, and standard deviation of 1980 median family income for the Northeast and West. We correctly think that the paper would be much improved if it included the confidence intervals. The paper claims that for 166 cities in the Northeast, the average of median family income is $19,509 with a standard deviation of $4,379: For the Northeast: . cii 166 19509 4379 Variable Obs Mean Std. Err. [95% Conf. Interval] 166 19509 339.8763 18837.93 Obs Mean Std. Err. [95% Conf. Interval] 256 22557 312.6875 21941.22 20180.07 For the West: . cii 256 22557 5003 Variable 23172.78 Example 7 We flip a coin 10 times, and it comes up heads only once. We are shocked and decide to obtain a 99% confidence interval for this coin: . cii 10 1, level(99) Variable Obs Mean 10 .1 Std. Err. Binomial Exact [99% Conf. Interval] .0948683 .0005011 .5442871 270 ci — Confidence intervals for means, proportions, and counts Example 8 The number of reported traffic accidents in Santa Monica over a 24-hour period is 27. We need know nothing else: . cii 1 27, poisson Variable Exposure Mean 1 27 Std. Err. Poisson Exact [95% Conf. Interval] 5.196152 17.79317 39.28358 Video examples Immediate commands in Stata: Confidence intervals for Poisson data Immediate commands in Stata: Confidence intervals for binomial data Immediate commands in Stata: Confidence intervals for normal data Stored results ci and cii store the following in r(): Scalars r(N) number of observations or exposure r(mean) mean r(se) estimate of standard error r(lb) r(ub) lower bound of confidence interval upper bound of confidence interval Methods and formulas Methods and formulas are presented under the following headings: Ordinary Binomial Poisson Ordinary Define n, x, and s2 as, respectively, the number of observations, (weighted) average, and (unbiased) estimated variance of the variable in question; see [R] summarize. p The standard error of the mean, sµ , is defined as s2 /n. Let α be 1 − l/100, where l is the significance level specified by the user. Define tα as the two-sided t statistic corresponding to a significance level of α with n − 1 degrees of freedom; tα is obtained from Stata as invttail(n-1,0.5*α). The lower and upper confidence bounds are, respectively, x − sµ tα and x + sµ tα . ci — Confidence intervals for means, proportions, and counts 271 Binomial Given k successes of n trials, the estimated probability is pb = k/n with standard error ci calculates the exact (Clopper–Pearson) confidence interval [ p1 , p2 ] such that p pb(1 − pb)/n. Pr(K ≥ k|p = p1 ) = α/2 and Pr(K ≤ k|p = p2 ) = α/2 where K is distributed as binomial(n, p). The endpoints may be obtained directly by using Stata’s invbinomial() function. If k = 0 or k = n, the calculation of the appropriate tail is skipped. p The Wald interval is pb ± zα pb(1 − pb)/n, where zα is the 1 − α/2 quantile of the standard normal. The interval is obtained by inverting the acceptance region of the large-sample Wald test of H0 : p = p0 versus the two-sided alternative. That is, the confidence interval is the set of all p0 such that pb − p0 p ≤ zα −1 n pb(1 − pb) p The Wilson interval is a variation on the Wald interval, using the null standard error n−1 p0 (1 − p0 ) p in place of the estimated standard error n−1 pb(1 − pb) in the above expression. Inverting this acceptance region is more complicated yet results in the closed form zα n1/2 k + zα2 /2 ± n + zα2 n + zα2 /2 z2 pb(1 − pb) + α 4n 1/2 The Agresti–Coull interval is basically a Wald interval that borrows its center from the Wilson interval. Defining e k = k + zα2 /2, n e = n + zα2 , and (hence) pe = e k/e n, the Agresti–Coull interval is pe ± zα p pe(1 − pe)/e n When α = 0.05, zα is near enough to 2 that pe can be thought of as a typical estimate of proportion where two successes and two failures have been added to the sample (Agresti and Coull 1998). This typical estimate of proportion makes the Agresti–Coull interval an easy-to-present alternative for introductory statistics students. The Jeffreys interval is a Bayesian interval and is based on the Jeffreys prior, which is the Beta(1/2, 1/2) distribution. Assigning this prior to p results in a posterior distribution for p that is Beta with parameters k + 1/2 and n−k + 1/2. The Jeffreys interval is then taken to be the 1 −α central posterior probability interval, namely, the α/2 and 1 −α/2 quantiles of the Beta(k + 1/2, n−k + 1/2) distribution. These quantiles may be obtained directly by using Stata’s invibeta() function. Poisson Given the total cases, k , the estimate of the expected count λ is k , and its standard error is ci calculates the exact confidence interval [ λ1 , λ2 ] such that Pr(K ≥ k|λ = λ1 ) = α/2 √ k. 272 ci — Confidence intervals for means, proportions, and counts and Pr(K ≤ k|λ = λ2 ) = α/2 where K is Poisson with mean λ. Solution is by Newton’s method. If k = 0, the calculation of λ1 is skipped. All values are then reported as rates, which are the above numbers divided by the total exposure. Harold Jeffreys (1891–1989) was born near Durham, England, and spent more than 75 years studying and working at the University of Cambridge, principally on theoretical and observational problems in geophysics, astronomy, mathematics, and statistics. He developed a systematic Bayesian approach to inference in his monograph Theory of Probability. Edwin Bidwell (E. B.) Wilson (1879–1964) majored in mathematics at Harvard and studied and taught at Yale and MIT before returning to Harvard in 1922. He worked in mathematics, physics, and statistics. His method for binomial intervals can be considered a precursor, for a particular problem, of Neyman’s concept of confidence intervals. Jerzy Neyman (1894–1981) was born in Bendery, Russia, now Moldavia. He studied and then taught at Kharkov University, moving from physics to mathematics. In 1921, Neyman moved to Poland, where he worked in statistics at Bydgoszcz and then Warsaw. Neyman received a Rockefeller Fellowship to work with Karl Pearson at University College London. There, he collaborated with Egon Pearson, Karl’s son, on the theory of hypothesis testing. Life in Poland became progressively more difficult, and Neyman returned to UCL to work there from 1934 to 1938. At this time, he published on the theory of confidence intervals. He then was offered a post in California at Berkeley, where he settled. Neyman established an outstanding statistics department and remained highly active in research, including applications in astronomy, meteorology, and medicine. He was one of the great statisticians of the 20th century. Acknowledgment We thank Nicholas J. Cox of the Department of Geography at Durham University, UK, and coeditor of the Stata Journal for his assistance with the jeffreys and wilson options. References Agresti, A., and B. A. Coull. 1998. Approximate is better than “exact” for interval estimation of binomial proportions. American Statistician 52: 119–126. Brown, L. D., T. T. Cai, and A. DasGupta. 2001. Interval estimation for a binomial proportion. Statistical Science 16: 101–133. Campbell, M. J., D. Machin, and S. J. Walters. 2007. Medical Statistics: A Textbook for the Health Sciences. 4th ed. Chichester, UK: Wiley. Clopper, C. J., and E. S. Pearson. 1934. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26: 404–413. Cook, A. 1990. Sir Harold Jeffreys. 2 April 1891–18 March 1989. Biographical Memoirs of Fellows of the Royal Society 36: 303–333. Gleason, J. R. 1999. sg119: Improved confidence intervals for binomial proportions. Stata Technical Bulletin 52: 16–18. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 208–211. College Station, TX: Stata Press. Jeffreys, H. 1946. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London, Series A 186: 453–461. ci — Confidence intervals for means, proportions, and counts 273 Lindley, D. V. 2001. Harold Jeffreys. In Statisticians of the Centuries, ed. C. C. Heyde and E. Seneta, 402–405. New York: Springer. Reid, C. 1982. Neyman—from Life. New York: Springer. Rothman, K. J., S. Greenland, and T. L. Lash. 2008. Modern Epidemiology. 3rd ed. Philadelphia: Lippincott Williams & Wilkins. Seed, P. T. 2001. sg159: Confidence intervals for correlations. Stata Technical Bulletin 59: 27–28. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 267–269. College Station, TX: Stata Press. Stigler, S. M. 1997. Wilson, Edwin Bidwell. In Leading Personalities in Statistical Sciences: From the Seventeenth Century to the Present, ed. N. L. Johnson and S. Kotz, 344–346. New York: Wiley. Utts, J. M. 2005. Seeing Through Statistics. 3rd ed. Belmont, CA: Brooks/Cole. Wang, D. 2000. sg154: Confidence intervals for the ratio of two binomial proportions by Koopman’s method. Stata Technical Bulletin 58: 16–19. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 244–247. College Station, TX: Stata Press. Wilson, E. B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209–212. Also see [R] ameans — Arithmetic, geometric, and harmonic means [R] bitest — Binomial probability test [R] centile — Report centile and confidence interval [D] pctile — Create variable containing percentiles [R] prtest — Tests of proportions [R] summarize — Summary statistics [R] ttest — t tests (mean-comparison tests) Title clogit — Conditional (fixed-effects) logistic regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax clogit depvar indepvars options Model ∗ group(varname) offset(varname) constraints(constraints) collinear if in weight , group(varname) options Description matched group variable include varname in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) nonest vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife do not check that panels are nested within clusters Reporting level(#) or nocnsreport display options set confidence level; default is level(95) report odds ratios do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics ∗ group(varname) is required. indepvars may contain factor variables; see [U] 11.4.3 Factor variables. bootstrap, by, fp, jackknife, mfp, mi estimate, nestreg, rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce(), nonest, and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed (see [U] 11.1.6 weight), but they are interpreted to apply to groups as a whole, not to individual observations. See Use of weights below. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 274 clogit — Conditional (fixed-effects) logistic regression 275 Menu Statistics > Categorical outcomes > Conditional logistic regression Description clogit fits what biostatisticians and epidemiologists call conditional logistic regression for matched case – control groups (see, for example, Hosmer, Lemeshow, and Sturdivant [2013, chap. 7]) and what economists and other social scientists call fixed-effects logit for panel data (see, for example, Chamberlain [1980]). Computationally, these models are the same. depvar equal to nonzero and nonmissing (typically depvar equal to one) indicates a positive outcome, whereas depvar equal to zero indicates a negative outcome. See [R] asclogit if you want to fit McFadden’s choice model (McFadden 1974). Also see [R] logistic for a list of related estimation commands. Options Model group(varname) is required; it specifies an identifier variable (numeric or string) for the matched groups. strata(varname) is a synonym for group(). offset(varname), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. nonest, available only with vce(cluster clustvar), prevents checking that matched groups are nested within clusters. It is the user’s responsibility to verify that the standard errors are theoretically correct. Reporting level(#); see [R] estimation options. or reports the estimated coefficients transformed to odds ratios, that is, eb rather than b. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated. or may be specified at estimation or when replaying previously estimated results. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). 276 clogit — Conditional (fixed-effects) logistic regression The following option is available with clogit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Introduction Matched case–control data Use of weights Fixed-effects logit Introduction clogit fits maximum likelihood models with a dichotomous dependent variable coded as 0/1 (more precisely, clogit interprets 0 and not 0 to indicate the dichotomy). Conditional logistic analysis differs from regular logistic regression in that the data are grouped and the likelihood is calculated relative to each group; that is, a conditional likelihood is used. See Methods and formulas at the end of this entry. Biostatisticians and epidemiologists fit these models when analyzing matched case – control studies with 1 : 1 matching, 1 : k2i matching, or k1i : k2i matching, where i denotes the ith matched group for i = 1, 2, . . . , n, where n is the total number of groups. clogit fits a model appropriate for all of these matching schemes or for any mix of the schemes because the matching k1i : k2i can vary from group to group. clogit always uses the true conditional likelihood, not an approximation. Biostatisticians and epidemiologists sometimes refer to the matched groups as “strata”, but we will stick to the more generic term “group”. Economists and other social scientists fitting fixed-effects logit models have data that look exactly like the data biostatisticians and epidemiologists call k1i : k2i matched case – control data. In terms of how the data are arranged, k1i : k2i matching means that in the ith group, the dependent variable is 1 a total of k1i times and 0 a total of k2i times. There are a total of Ti = k1i + k2i observations for the ith group. This data arrangement is what economists and other social scientists call “panel data”, “longitudinal data”, or “cross-sectional time-series data”. So no matter what terminology you use, the computation and the use of the clogit command is the same. The following example shows how your data should be arranged to use clogit. Example 1 Suppose that we have grouped data with the variable id containing a unique identifier for each group. Our outcome variable, y, contains 0s and 1s. If we were biostatisticians, y = 1 would indicate a case, y = 0 would be a control, and id would be an identifier variable that indicates the groups of matched case – control subjects. If we were economists, y = 1 might indicate that a person was unemployed at any time during a year and y = 0, that a person was employed all year, and id would be an identifier variable for persons. clogit — Conditional (fixed-effects) logistic regression 277 If we list the first few observations of this dataset, it looks like . use http://www.stata-press.com/data/r13/clogitid . list y x1 x2 id in 1/11 y x1 x2 id 1. 2. 3. 4. 5. 0 0 0 1 0 0 1 1 1 0 4 4 6 8 1 1014 1014 1014 1014 1017 6. 7. 8. 9. 10. 0 1 0 0 1 0 1 0 1 1 7 10 1 7 7 1017 1017 1019 1019 1019 11. 1 1 9 1019 Pretending that we are biostatisticians, we describe our data as follows. The first group (id = 1014) consists of four matched persons: 1 case (y = 1) and three controls (y = 0), that is, 1 : 3 matching. The second group has 1 : 2 matching, and the third 2 : 2. Pretending that we are economists, we describe our data as follows. The first group consists of 4 observations (one per year) for person 1014. This person had a period of unemployment during 1 year of 4. The second person had a period of unemployment during 1 year of 3, and the third had a period of 2 years of 4. Our independent variables are x1 and x2. To fit the conditional (fixed-effects) logistic model, we type . clogit y x1 x2, group(id) note: multiple positive outcomes within groups encountered. Iteration 0: log likelihood = -123.42828 Iteration 1: log likelihood = -123.41386 Iteration 2: log likelihood = -123.41386 Conditional (fixed-effects) logistic regression Number of obs LR chi2(2) Prob > chi2 Log likelihood = -123.41386 Pseudo R2 y Coef. x1 x2 .653363 .0659169 Std. Err. z P>|z| .2875215 .0449555 2.27 1.47 0.023 0.143 = = = = 369 9.07 0.0107 0.0355 [95% Conf. Interval] .0898312 -.0221943 1.216895 .1540281 Technical note The message “note: multiple positive outcomes within groups encountered” at the top of the clogit output for the previous example merely informs us that we have k1i : k2i matching with k1i > 1 for at least one group. If your data should be 1 : k2i matched, this message tells you that there is an error in the data somewhere. We can see the distribution of k1i and Ti = k1i + k2i for the data of the example 1 by using the following steps: 278 clogit — Conditional (fixed-effects) logistic regression . by id, sort: gen k1 = sum(y) . by id: replace k1 = . if _n < _N (303 real changes made, 303 to missing) . by id: gen T = sum(y<.) . by id: replace T = . if _n < _N (303 real changes made, 303 to missing) . tabulate k1 k1 Freq. Percent 1 2 3 4 Cum. 48 12 4 2 72.73 18.18 6.06 3.03 66 100.00 Freq. Percent Cum. 2 3 4 5 6 7 8 9 10 5 5 12 11 13 8 3 7 2 7.58 7.58 18.18 16.67 19.70 12.12 4.55 10.61 3.03 7.58 15.15 33.33 50.00 69.70 81.82 86.36 96.97 100.00 Total 66 100.00 Total . tabulate T T 72.73 90.91 96.97 100.00 We see that k1i ranges from 1 to 4 and Ti ranges from 2 to 10 for these data. Technical note For k1i : k2i matching (and hence in the general case of fixed-effects logit), clogit uses a recursive algorithm to compute the likelihood,P which means that there are no limits on the size of Ti . However, computation time is proportional to Ti min(k1i , k2i ), so clogit will take roughly 10 times longer to fit a model with 10 : 10 matching than one with 1 : 10 matching. But clogit is fast, so computation time becomes an issue only when min(k1i , k2i ) is around 100 or more. See Methods and formulas for details. Matched case–control data Here we give a more detailed example of matched case – control data. Example 2 Hosmer, Lemeshow, and Sturdivant (2013, 24) present data on matched pairs of infants, each pair having one with low birthweight and another with regular birthweight. The data are matched on age of the mother. Several possible maternal exposures are considered: race (three categories), smoking status, presence of hypertension, presence of uterine irritability, previous preterm delivery, and weight at the last menstrual period. clogit — Conditional (fixed-effects) logistic regression 279 . use http://www.stata-press.com/data/r13/lowbirth2, clear (Applied Logistic Regression, Hosmer & Lemeshow) . describe Contains data from http://www.stata-press.com/data/r13/lowbirth2.dta obs: 112 Applied Logistic Regression, Hosmer & Lemeshow vars: 9 30 Jan 2013 08:46 size: 1,120 variable name pairid low age lwt smoke ptd ht ui race storage type byte byte byte int byte byte byte byte byte display format value label %8.0g %8.0g %8.0g %8.0g %8.0g %8.0g %8.0g %8.0g %9.0g race variable label Case-control pair ID Baby has low birthweight Age of mother Mother’s last menstrual weight Mother smoked during pregnancy Mother had previous preterm baby Mother has hypertension Uterine irritability race of mother: 1=white, 2=black, 3=other Sorted by: We list the case – control indicator variable, low; the match identifier variable, pairid; and two of the covariates, lwt and smoke, for the first 10 observations. . list low lwt smoke pairid in 1/10 low lwt smoke pairid 1. 2. 3. 4. 5. 0 1 0 1 0 135 101 98 115 95 0 1 0 0 0 1 1 2 2 3 6. 7. 8. 9. 10. 1 0 1 0 1 130 103 130 122 110 0 0 1 1 1 3 4 4 5 5 We fit a conditional logistic model of low birthweight on mother’s weight, race, smoking behavior, and history. 280 clogit — Conditional (fixed-effects) logistic regression . clogit low lwt smoke ptd ht ui i.race, group(pairid) nolog Conditional (fixed-effects) logistic regression Number of obs LR chi2(7) Prob > chi2 Log likelihood = -25.794271 Pseudo R2 Std. Err. z P>|z| = = = = 112 26.04 0.0005 0.3355 low Coef. [95% Conf. Interval] lwt smoke ptd ht ui -.0183757 1.400656 1.808009 2.361152 1.401929 .0100806 .6278396 .7886502 1.086128 .6961585 -1.82 2.23 2.29 2.17 2.01 0.068 0.026 0.022 0.030 0.044 -.0381333 .1701131 .2622828 .2323796 .0374836 .0013819 2.631199 3.353735 4.489924 2.766375 race black other .5713643 -.0253148 .689645 .6992044 0.83 -0.04 0.407 0.971 -.7803149 -1.39573 1.923044 1.345101 We might prefer to see results presented as odds ratios. We could have specified the or option when we first fit the model, or we can now redisplay results and specify or: . clogit, or Conditional (fixed-effects) logistic regression Log likelihood = -25.794271 low Odds Ratio Std. Err. lwt smoke ptd ht ui .9817921 4.057862 6.098293 10.60316 4.06303 .009897 2.547686 4.80942 11.51639 2.828513 race black other 1.770681 .975003 1.221141 .6817263 z Number of obs LR chi2(7) Prob > chi2 Pseudo R2 = = = = 112 26.04 0.0005 0.3355 P>|z| [95% Conf. Interval] -1.82 2.23 2.29 2.17 2.01 0.068 0.026 0.022 0.030 0.044 .9625847 1.185439 1.299894 1.261599 1.038195 1.001383 13.89042 28.60938 89.11467 15.90088 0.83 -0.04 0.407 0.971 .4582617 .2476522 6.84175 3.838573 Smoking, previous preterm delivery, hypertension, uterine irritability, and possibly the mother’s weight all contribute to low birthweight. Race of black and race of other are statistically insignificant when compared with the race of white omitted group, although the race of black effect is large. We can test the joint statistical significance of race being black (2.race) and race being other (3.race) by using test: . test 2.race 3.race ( 1) [low]2.race = ( 2) [low]3.race = chi2( 2) Prob > chi2 0 0 = = 0.88 0.6436 For a more complete description of test, see [R] test. test presents results in coefficients rather than odds ratios. Jointly testing that the coefficients on 2.race and 3.race are 0 is equivalent to jointly testing that the odds ratios are 1. Here one case was matched to one control, that is, 1 : 1 matching. From clogit’s point of view, that was not important — k1 cases could have been matched to k2 controls (k1 : k2 matching), and we would have fit the model in the same way. Furthermore, the matching can change from group clogit — Conditional (fixed-effects) logistic regression 281 to group, which we have denoted as k1i : k2i matching, where i denotes the group. clogit does not care. To fit the conditional logistic regression model, we specified the group(varname) option, group(pairid). The case and control are stored in separate observations. clogit knew that they were linked (in the same group) because the related observations share the same value of pairid. Technical note clogit provides a way to extend McNemar’s test to multiple controls per case (1 : k2i matching) and to multiple controls matched with multiple cases (k1i : k2i matching). In Stata, McNemar’s test is calculated by the mcc command; see [ST] epitab. The mcc command, however, requires that the matched case and control appear in one observation, so the data will need to be manipulated from 1 to 2 observations per stratum before using clogit. Alternatively, if you begin with clogit’s 2-observations-per-group organization, you will have to change it to 1 observation per group if you wish to use mcc. In either case, reshape provides an easy way to change the organization of the data. We will demonstrate its use below, but we direct you to [D] reshape for a more thorough discussion. In example 2, we used clogit to analyze the relationship between low birthweight and various characteristics of the mother. Assume that we now want to assess the relationship between low birthweight and smoking, ignoring the mother’s other characteristics. Using clogit, we obtain the following results: . clogit low smoke, group(pairid) or Iteration 0: Iteration 1: Iteration 2: log likelihood = -35.425931 log likelihood = -35.419283 log likelihood = -35.419282 Conditional (fixed-effects) logistic regression Log likelihood = -35.419282 low Odds Ratio smoke 2.75 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 112 6.79 0.0091 0.0875 Std. Err. z P>|z| [95% Conf. Interval] 1.135369 2.45 0.014 1.224347 6.176763 Let’s compare our estimated odds ratio and 95% confidence interval with that produced by mcc. We begin by reshaping the data: . keep low smoke pairid . reshape wide smoke, i(pairid) j(low 0 1) Data Number of obs. Number of variables j variable (2 values) xij variables: long -> wide 112 3 low -> -> -> 56 3 (dropped) smoke -> smoke0 smoke1 We now have the variables smoke0 (formed from smoke and low = 0), recording 1 if the control mother smoked and 0 otherwise; and smoke1 (formed from smoke and low = 1), recording 1 if the case mother smoked and 0 otherwise. We can now use mcc: 282 clogit — Conditional (fixed-effects) logistic regression . mcc smoke1 smoke0 Cases Exposed Unexposed Controls Exposed Unexposed Total 8 8 22 18 30 26 Total 16 40 56 McNemar’s chi2(1) = 6.53 Prob > chi2 = 0.0106 Exact McNemar significance probability = 0.0161 Proportion with factor Cases .5357143 Controls .2857143 [95% Conf. Interval] difference ratio rel. diff. odds ratio .25 1.875 .35 2.75 .0519726 1.148685 .1336258 1.179154 .4480274 3.060565 .5663742 7.143667 (exact) Both methods estimated the same odds ratio, and the 95% confidence intervals are similar. clogit produced a confidence interval of [ 1.22, 6.18 ], whereas mcc produced a confidence interval of [ 1.18, 7.14 ]. Use of weights With clogit, weights apply to groups as a whole, not to individual observations. For example, if there is a group in your dataset with a frequency weight of 3, there are a total of three groups in your sample with the same values of the dependent and independent variables as this one group. Weights must have the same value for all observations belonging to the same group; otherwise, an error message will be displayed. Example 3 We use the example from the above discussion of the mcc command. Here we have a total of 56 matched case – control groups, each with one case matched to one control. We had 8 matched pairs in which both the case and the control are exposed, 22 pairs in which the case is exposed and the control is unexposed, 8 pairs in which the case is unexposed and the control is exposed, and 18 pairs in which they are both unexposed. With weights, it is easy to enter these data into Stata and run clogit. clogit — Conditional (fixed-effects) logistic regression . clear . input id case exposed weight id case exposed 1. 1 1 1 8 2. 1 0 1 8 3. 2 1 1 22 4. 2 0 0 22 5. 3 1 0 8 6. 3 0 1 8 7. 4 1 0 18 8. 4 0 0 18 9. end weight . clogit case exposed [w=weight], group(id) or (frequency weights assumed) Iteration 0: log likelihood = -35.425931 Iteration 1: log likelihood = -35.419283 Iteration 2: log likelihood = -35.419282 Conditional (fixed-effects) logistic regression Log likelihood = -35.419282 case Odds Ratio exposed 2.75 283 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 112 6.79 0.0091 0.0875 Std. Err. z P>|z| [95% Conf. Interval] 1.135369 2.45 0.014 1.224347 6.176763 Fixed-effects logit The fixed-effects logit model can be written as Pr(yit = 1 | xit ) = F (αi + xit β) where F is the cumulative logistic distribution F (z) = exp(z) 1 + exp(z) i = 1, 2, . . . , n denotes the independent units (called “groups” by clogit), and t = 1, 2, . . . , Ti denotes the observations for the ith unit (group). Fitting this model by using a full maximum-likelihood approach leads to difficulties, however. When Ti is fixed, the maximum likelihood estimates for αi and β are inconsistent (Andersen 1970; Chamberlain 1980). This difficulty can be circumvented by looking at the probability of yi = PTi yit . This conditional probability does not involve the αi , so they (yi1 , . . . , yiTi ) conditional on t=1 are never estimated when the resulting conditional likelihood is used. See Hamerle and Ronning (1995) for a succinct and lucid development. See Methods and formulas for the estimation equation. 284 clogit — Conditional (fixed-effects) logistic regression Example 4 We are studying unionization of women in the United States by using the union dataset; see [XT] xt. We fit the fixed-effects logit model: . use http://www.stata-press.com/data/r13/union, clear (NLS Women 14-24 in 1968) . clogit union age grade not_smsa south black, group(idcode) note: multiple positive outcomes within groups encountered. note: 2744 groups (14165 obs) dropped because of all positive or all negative outcomes. note: black omitted because of no within-group variance. Iteration 0: log likelihood = -4521.3385 Iteration 1: log likelihood = -4516.1404 Iteration 2: log likelihood = -4516.1385 Iteration 3: log likelihood = -4516.1385 Conditional (fixed-effects) logistic regression Number of obs LR chi2(4) Prob > chi2 Log likelihood = -4516.1385 Pseudo R2 union Coef. age grade not_smsa south black .0170301 .0853572 .0083678 -.748023 0 Std. Err. .004146 .0418781 .1127963 .1251752 (omitted) z 4.11 2.04 0.07 -5.98 P>|z| 0.000 0.042 0.941 0.000 = = = = 12035 68.09 0.0000 0.0075 [95% Conf. Interval] .0089042 .0032777 -.2127088 -.9933619 .0251561 .1674368 .2294445 -.5026842 We received three messages at the top of the output. The first one, “multiple positive outcomes within groups encountered”, we expected. Our data do indeed have multiple positive outcomes (union = 1) in many groups. (Here a group consists of all the observations for a particular individual.) The second message tells us that 2,744 groups were “dropped” by clogit. When either union = 0 or union = 1 for all observations for an individual, this individual’s contribution to the log-likelihood is zero. Although these are perfectly valid observations in every sense, they have no effect on the estimation, so they are not included in the total “Number of obs”. Hence, the reported “Number of obs” gives the effective sample size of the estimation. Here it is 12,035 observations — only 46% of the total 26,200. We can easily check that there are indeed 2,744 groups with union either all 0 or all 1. We will generate a variable that contains the fraction of observations for each individual who has union = 1. clogit — Conditional (fixed-effects) logistic regression 285 . by idcode, sort: generate fraction = sum(union)/sum(union < .) . by idcode: replace fraction = . if _n < _N (21766 real changes made, 21766 to missing) . tabulate fraction fraction 0 .0833333 .0909091 .1 (output omitted ) .9 .9090909 .9166667 1 Total Freq. Percent Cum. 2,481 30 33 53 55.95 0.68 0.74 1.20 55.95 56.63 57.37 58.57 10 11 10 263 0.23 0.25 0.23 5.93 93.59 93.84 94.07 100.00 4,434 100.00 Because 2481 + 263 = 2744, we confirm what clogit did. The third warning message from clogit said “black omitted because of no within-group variance”. Obviously, race stays constant for an individual across time. Any such variables are collinear with the αi (that is, the fixed effects), and just as the αi drop out of the conditional likelihood, so do all variables that are unchanging within groups. Thus they cannot be estimated with the conditional fixed-effects model. There are several other estimators implemented in Stata that we could use with these data: cloglog . . . , vce(cluster idcode) logit . . . , vce(cluster idcode) probit . . . , vce(cluster idcode) scobit . . . , vce(cluster idcode) xtcloglog . . . xtgee . . . , family(binomial) link(logit) corr(exchangeable) xtlogit . . . xtprobit . . . See [R] cloglog, [R] logit, [R] probit, [R] scobit, [XT] xtcloglog, [XT] xtgee, [XT] xtlogit, and [XT] xtprobit for details. 286 clogit — Conditional (fixed-effects) logistic regression Stored results clogit stores the following in e(): Scalars e(N) e(N drop) e(N group drop) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(r2 p) e(ll) e(ll 0) e(N clust) e(chi2) e(p) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(group) e(multiple) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(marginsok) e(marginsnotok) e(asbalanced) e(asobserved) number of observations number of observations dropped because of all positive or all negative outcomes number of groups dropped because of all positive or all negative outcomes number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom pseudo-R-squared log likelihood log likelihood, constant-only model number of clusters χ2 significance rank of e(V) number of iterations return code 1 if converged, 0 otherwise clogit command as typed name of dependent variable name of group() variable multiple if multiple positive outcomes within group weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict predictions allowed by margins predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved clogit — Conditional (fixed-effects) logistic regression Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) 287 coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample Methods and formulas Breslow and Day (1980, 247–279), Collett (2003, 251–267), and Hosmer, Lemeshow, and Sturdivant (2013, 243–268) provide a biostatistical point of view on conditional logistic regression. Hamerle and Ronning (1995) give a succinct and lucid review of fixed-effects logit; Chamberlain (1980) is a standard reference for this model. Greene (2012, chap. 17) provides a straightforward textbook description of conditional logistic regression from an economist’s point of view, as well as a brief description of choice models. Let i = 1, 2, . . . , n denote the groups and let t = 1, 2, . . . , Ti denote the observations for the ith group. Let yit be the dependent variable taking on values 0 or 1. Let yi = (yi1 , . . . , yiTi ) be the outcomes for the ith group as a whole. Let xit be a row vector of covariates. Let k1i = Ti X yit t=1 be the observed number of ones for the dependent variable in the ith group. Biostatisticians would say that there are k1i cases matched to k2i = Ti − k1i controls in the ith group. PTi We consider the probability of a possible value of yi conditional on t=1 yit = k1i (Hamerle and Ronning 1995, eq. 8.33; Hosmer, Lemeshow, and Sturdivant 2013, eq. 7.4), Pr yi | PTi t=1 yit = k1i PTi yit xit β =P PTi di ∈Si exp t=1 dit xit β exp t=1 PTi where dit is equal to 0 or 1 with t=1 dit = k1i , and Si is the set of all possible combinations of k1i ones and k2i zeros. Clearly, there are kT1ii such combinations, but we need not count all of these combinations to compute the denominator of the above equation. It can be computed recursively. Denote the denominator by fi (Ti , k1i ) = X di ∈Si exp X Ti dit xit β t=1 Consider, computationally, how fi changes as we go from a total of 1 observation in the group to 2 observations to 3, etc. Doing this, we derive the recursive formula fi (T, k) = fi (T − 1, k) + fi (T − 1, k − 1) exp(xiT β) where we define fi (T, k) = 0 if T < k and fi (T, 0) = 1. 288 clogit — Conditional (fixed-effects) logistic regression The conditional log-likelihood is lnL = (T n i X X i=1 ) yit xit β − log fi (Ti , k1i ) t=1 The derivatives of the conditional log-likelihood can also be computed recursively by taking derivatives of the recursive formula for fi . Computation time is roughly proportional to p2 n X Ti min(k1i , k2i ) i=1 where p is the number of independent variables in the model. If min(k1i , k2i ) is small, computation time is not an issue. But if it is large—say, 100 or more—patience may be required. If Ti is large for all groups, the bias of the unconditional fixed-effects estimator is not a concern, and we can confidently use logit with an indicator variable for each group (provided, of course, that the number of groups does not exceed matsize; see [R] matsize). This command supports the clustered version of the Huber/White/sandwich estimator of the variance using vce(robust) and vce(cluster clustvar). See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. Specifying vce(robust) is equivalent to specifying vce(cluster groupvar), where groupvar is the variable for the matched groups. clogit also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Andersen, E. B. 1970. Asymptotic properties of conditional maximum likelihood estimators. Journal of the Royal Statistical Society, Series B 32: 283–301. Breslow, N. E., and N. E. Day. 1980. Statistical Methods in Cancer Research: Vol. 1—The Analysis of Case–Control Studies. Lyon: IARC. Chamberlain, G. 1980. Analysis of covariance with qualitative data. Review of Economic Studies 47: 225–238. Collett, D. 2003. Modelling Binary Data. 2nd ed. London: Chapman & Hall/CRC. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Hamerle, A., and G. Ronning. 1995. Panel analysis for qualitative variables. In Handbook of Statistical Modeling for the Social and Behavioral Sciences, ed. G. Arminger, C. C. Clogg, and M. E. Sobel, 401–451. New York: Plenum. Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood. Stata Journal 7: 388–401. Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Kleinbaum, D. G., and M. Klein. 2010. Logistic Regression: A Self-Learning Text. 3rd ed. New York: Springer. Long, J. S., and J. Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. McFadden, D. L. 1974. Conditional logit analysis of qualitative choice behavior. In Frontiers in Econometrics, ed. P. Zarembka, 105–142. New York: Academic Press. clogit — Conditional (fixed-effects) logistic regression Also see [R] clogit postestimation — Postestimation tools for clogit [R] asclogit — Alternative-specific conditional logit (McFadden’s choice) model [R] logistic — Logistic regression, reporting odds ratios [R] mlogit — Multinomial (polytomous) logistic regression [R] nlogit — Nested logit regression [R] ologit — Ordered logistic regression [R] scobit — Skewed logistic regression [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [XT] xtgee — Fit population-averaged panel-data models by using GEE [XT] xtlogit — Fixed-effects, random-effects, and population-averaged logit models [U] 20 Estimation and postestimation commands 289 Title clogit postestimation — Postestimation tools for clogit Description Remarks and examples Syntax for predict Methods and formulas Menu for predict Reference Options for predict Also see Description The following standard postestimation commands are available after clogit: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates hausman lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest1 margins2 marginsplot nlcom predict predictnl pwcompare suest test testnl 1 lrtest is not appropriate with svy estimation results. 2 The default prediction statistic pc1 cannot be correctly handled by margins; however, margins can be used after clogit with options predict(pu0) and predict(xb). 290 clogit postestimation — Postestimation tools for clogit 291 Syntax for predict predict type newvar if in , statistic nooffset Description statistic Main pc1 pu0 xb stdp ∗ dbeta ∗ dx2 ∗ gdbeta ∗ gdx2 ∗ hat ∗ residuals ∗ rstandard score probability of a positive outcome; the default probability of a positive outcome, assuming fixed effect is zero linear prediction standard error of the linear prediction Delta-β influence statistic Delta-χ2 lack-of-fit statistic Delta-β influence statistic for each group Delta-χ2 lack-of-fit statistic for each group Hosmer and Lemeshow leverage Pearson residuals standardized Pearson residuals first derivative of the log likelihood with respect to xj β Unstarred statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample. Starred statistics are calculated only for the estimation sample, even when if e(sample) is not specified. Starred statistics are available for multiple controls per case-matching design only. They are not available if vce(robust), vce(cluster clustvar), or pweights were specified with clogit. dbeta, dx2, gdbeta, gdx2, hat, and rstandard are not available if constraints() was specified with clogit. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pc1, the default, calculates the probability of a positive outcome conditional on one positive outcome within group. pu0 calculates the probability of a positive outcome, assuming that the fixed effect is zero. xb calculates the linear prediction. stdp calculates the standard error of the linear prediction. dbeta calculates the Delta-β influence statistic, a standardized measure of the difference in the coefficient vector that is due to deletion of the observation. dx2 calculates the Delta-χ2 influence statistic, reflecting the decrease in the Pearson chi-squared that is due to deletion of the observation. gdbeta calculates the approximation to the Pregibon stratum-specific Delta-β influence statistic, a standardized measure of the difference in the coefficient vector that is due to deletion of the entire stratum. 292 clogit postestimation — Postestimation tools for clogit gdx2 calculates the approximation to the Pregibon stratum-specific Delta-χ2 influence statistic, reflecting the decrease in the Pearson chi-squared that is due to deletion of the entire stratum. hat calculates the Hosmer and Lemeshow leverage or the diagonal element of the hat matrix. residuals calculates the Pearson residuals. rstandard calculates the standardized Pearson residuals. score calculates the equation-level score, ∂ ln L/∂(xit β). nooffset is relevant only if you specified offset(varname) for clogit. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . This option cannot be specified with dbeta, dx2, gdbeta, gdx2, hat, and rstandard. Remarks and examples predict may be used after clogit to obtain predicted values of the index xit β. Predicted probabilities for conditional logistic regression must be interpreted carefully. Probabilities are estimated for each group as a whole, not for individual observations. Furthermore, the probabilities are conditional on the number of positive outcomes in the group (that is, the number of cases and the number of controls), or it is assumed that the fixed effect is zero. predict may also be used to obtain influence and lack of fit statistics for an individual observation and for the whole group, to compute Pearson, standardized Pearson residuals, and leverage values. predict may be used for both within-sample and out-of-sample predictions. Example 1 Suppose that we have 1 : k2i matched data and that we have previously fit the following model: . use http://www.stata-press.com/data/r13/clogitid . clogit y x1 x2, group(id) (output omitted ) To obtain the predicted values of the index, we could type predict idx, xb to create a new variable called idx. From idx, we could then calculate the predicted probabilities. Easier, however, would be to type . predict phat (option pc1 assumed; probability of success given one success within group) phat would then contain the predicted probabilities. As noted previously, the predicted probabilities are really predicted probabilities for the group as a whole (that is, they are the predicted probability of observing yit = 1 and yit0 = 0 for all t0 6= t). Thus, if we want to obtain the predicted probabilities for the estimation sample, it is important that, when we make the calculation, predictions be restricted to the same sample on which we estimated the data. We cannot predict the probabilities and then just keep the relevant ones because the entire sample determines each probability. Thus, assuming that we are not attempting to make out-of-sample predictions, we type . predict phat2 if e(sample) (option pc1 assumed; probability of success given one success within group) clogit postestimation — Postestimation tools for clogit 293 Methods and formulas Recall that i = 1, . . . , n denote the groups and t = 1, . . . , Ti denote the observations for the ith group. predict produces probabilities of a positive outcome within group conditional on there being one positive outcome (pc1), Ti X Pr yit = 1 ! yit = 1 t=1 exp(xit β) = PTi t=1 exp(xit β) or predict calculates the unconditional pu0: Pr(yit = 1) = exp(xit β) 1 + exp(xit β) Pn Let N = j=1 Tj denote the total number of observations, p denote the number of covariates, and θbit denote the conditional predicted probabilities of a positive outcome (pc1). For the multiple control per case (1 : k2i ) matching, Hosmer, Lemeshow, and Sturdivant (2013, 248–251) propose the following diagnostics: The Pearson residual is rit = (yit − θbit ) q θbit The leverage (hat) value is defined as e T UX) e −1 x eTit (X eit hit = θbit x eit = xit − where x PTi xij θbij is the 1 × p row vector of centered by a weighted stratum-specific e N ×p are composed of x eit values. mean covariate values, UN = diag{θbit }, and the rows of X j=1 The standardized Pearson residual is rsit = √ rit 1 − hit The lack of fit and influence diagnostics for an individual observation are (respectively) computed as 2 ∆χ2it = rsit and ∆βbit = ∆χ2it hit 1 − hit The lack of fit and influence diagnostics for the groups are the group-specific totals of the respective individual diagnostics shown above. 294 clogit postestimation — Postestimation tools for clogit Reference Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Also see [R] clogit — Conditional (fixed-effects) logistic regression [U] 20 Estimation and postestimation commands Title cloglog — Complementary log-log regression Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgment Syntax cloglog depvar indepvars options if in weight , options Description Model noconstant offset(varname) asis constraints(constraints) collinear suppress constant term include varname in model with coefficient constrained to 1 retain perfect predictor variables apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) eform nocnsreport display options set confidence level; default is level(95) report exponentiated coefficients do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mi estimate, nestreg, rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 295 296 cloglog — Complementary log-log regression Menu Statistics > Binary outcomes > Complementary log-log regression Description cloglog fits maximum-likelihood complementary log-log models. See [R] logistic for a list of related estimation commands. Options Model noconstant, offset(varname); see [R] estimation options. asis forces retention of perfect predictor variables and their associated perfectly predicted observations and may produce instabilities in maximization; see [R] probit. constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. eform displays the exponentiated coefficients and corresponding standard errors and confidence intervals. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with cloglog but is not shown in the dialog box: coeflegend; see [R] estimation options. cloglog — Complementary log-log regression 297 Remarks and examples Remarks are presented under the following headings: Introduction to complementary log-log regression Robust standard errors Introduction to complementary log-log regression cloglog fits maximum likelihood models with dichotomous dependent variables coded as 0/1 (or, more precisely, coded as 0 and not 0). Example 1 We have data on the make, weight, and mileage rating of 22 foreign and 52 domestic automobiles. We wish to fit a model explaining whether a car is foreign based on its weight and mileage. Here is an overview of our data: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . keep make mpg weight foreign . describe Contains data from http://www.stata-press.com/data/r13/auto.dta obs: 74 1978 Automobile Data vars: 4 13 Apr 2013 17:45 size: 1,702 (_dta has notes) variable name make mpg weight foreign storage type display format str18 int int byte Sorted by: Note: %-18s %8.0g %8.0gc %8.0g value label variable label origin Make and Model Mileage (mpg) Weight (lbs.) Car type foreign dataset has changed since last saved . inspect foreign foreign: # # # # # # Car type Number of Observations Negative Zero Positive # # 0 Total Missing 1 Total 52 22 74 - Integers 52 22 Nonintegers - 74 - 74 (2 unique values) foreign is labeled and all values are documented in the label. The variable foreign takes on two unique values, 0 and 1. The value 0 denotes a domestic car, and 1 denotes a foreign car. 298 cloglog — Complementary log-log regression The model that we wish to fit is Pr(foreign = 1) = F (β0 + β1 weight + β2 mpg) where F (z) = 1 − exp − exp(z) . To fit this model, we type . cloglog foreign weight mpg Iteration 0: log likelihood = -34.054593 Iteration 1: log likelihood = -27.869915 Iteration 2: log likelihood = -27.742997 Iteration 3: log likelihood = -27.742769 Iteration 4: log likelihood = -27.742769 Complementary log-log regression Number of obs Zero outcomes Nonzero outcomes LR chi2(2) Prob > chi2 Log likelihood = -27.742769 foreign Coef. weight mpg _cons -.0029153 -.1422911 10.09694 Std. Err. .0006974 .076387 3.351841 z -4.18 -1.86 3.01 P>|z| 0.000 0.062 0.003 = = = = = 74 52 22 34.58 0.0000 [95% Conf. Interval] -.0042823 -.2920069 3.527448 -.0015483 .0074247 16.66642 We find that heavier cars are less likely to be foreign and that cars yielding better gas mileage are also less likely to be foreign, at least when holding the weight of the car constant. See [R] maximize for an explanation of the output. Technical note Stata interprets a value of 0 as a negative outcome (failure) and treats all other values (except missing) as positive outcomes (successes). Thus, if your dependent variable takes on the values 0 and 1, 0 is interpreted as failure and 1 as success. If your dependent variable takes on the values 0, 1, and 2, 0 is still interpreted as failure, but both 1 and 2 are treated as successes. If you prefer a more formal mathematical statement, when you type cloglog y x, Stata fits the model n o Pr(yj 6= 0 | xj ) = 1 − exp − exp(xj β) Robust standard errors If you specify the vce(robust) option, cloglog reports robust standard errors, as described in [U] 20.21 Obtaining robust variance estimates. For the model of foreign on weight and mpg, the robust calculation increases the standard error of the coefficient on mpg by 44%: cloglog — Complementary log-log regression 299 . cloglog foreign weight mpg, vce(robust) Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log pseudolikelihood pseudolikelihood pseudolikelihood pseudolikelihood pseudolikelihood = = = = = -34.054593 -27.869915 -27.742997 -27.742769 -27.742769 Complementary log-log regression Log pseudolikelihood = -27.742769 foreign Coef. weight mpg _cons -.0029153 -.1422911 10.09694 Robust Std. Err. .0007484 .1102466 4.317305 z -3.90 -1.29 2.34 Number of obs Zero outcomes Nonzero outcomes = = = 74 52 22 Wald chi2(2) Prob > chi2 = = 29.74 0.0000 P>|z| 0.000 0.197 0.019 [95% Conf. Interval] -.0043822 -.3583704 1.635174 -.0014484 .0737882 18.5587 Without vce(robust), the standard error for the coefficient on mpg was reported to be 0.076, with a resulting confidence interval of [ −0.29, 0.01 ]. The vce(cluster clustvar) option can relax the independence assumption required by the complementary log-log estimator to being just independence between clusters. To demonstrate this ability, we will switch to a different dataset. We are studying unionization of women in the United States by using the union dataset; see [XT] xt. We fit the following model, ignoring that women are observed an average of 5.9 times each in this dataset: . use http://www.stata-press.com/data/r13/union, clear (NLS Women 14-24 in 1968) . cloglog union age grade not_smsa south##c.year Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -13606.373 -13540.726 -13540.607 -13540.607 Complementary log-log regression Log likelihood = -13540.607 Std. Err. z Number of obs Zero outcomes Nonzero outcomes = = = 26200 20389 5811 LR chi2(6) Prob > chi2 = = 647.24 0.0000 union Coef. P>|z| [95% Conf. Interval] age grade not_smsa 1.south year .0185346 .0452772 -.1886592 -1.422292 -.0133007 .0043616 .0057125 .0317801 .3949381 .0049576 4.25 7.93 -5.94 -3.60 -2.68 0.000 0.000 0.000 0.000 0.007 .009986 .0340809 -.2509471 -2.196356 -.0230174 .0270833 .0564736 -.1263712 -.648227 -.0035839 south#c.year 1 .0105659 .0049234 2.15 0.032 .0009161 .0202157 _cons -1.219801 .2952374 -4.13 0.000 -1.798455 -.6411462 300 cloglog — Complementary log-log regression The reported standard errors in this model are probably meaningless. Women are observed repeatedly, and so the observations are not independent. Looking at the coefficients, we find a large southern effect against unionization and a different time trend for the south. The vce(cluster clustvar) option provides a way to fit this model and obtains correct standard errors: . cloglog union age grade not_smsa south##c.year, vce(cluster id) nolog Complementary log-log regression Log pseudolikelihood = -13540.607 Number of obs Zero outcomes Nonzero outcomes = = = 26200 20389 5811 Wald chi2(6) Prob > chi2 = = 160.76 0.0000 (Std. Err. adjusted for 4434 clusters in idcode) Robust Std. Err. union Coef. z P>|z| [95% Conf. Interval] age grade not_smsa 1.south year .0185346 .0452772 -.1886592 -1.422292 -.0133007 .0084873 .0125776 .0642068 .506517 .0090628 2.18 3.60 -2.94 -2.81 -1.47 0.029 0.000 0.003 0.005 0.142 .0018999 .0206255 -.3145021 -2.415047 -.0310633 .0351694 .069929 -.0628162 -.4295365 .004462 south#c.year 1 .0105659 .0063175 1.67 0.094 -.0018162 .022948 _cons -1.219801 .5175129 -2.36 0.018 -2.234107 -.2054942 These standard errors are larger than those reported by the inappropriate conventional calculation. By comparison, another way we could fit this model is with an equal-correlation population-averaged complementary log-log model: . xtcloglog union age grade not_smsa south##c.year, pa nolog GEE population-averaged model Group variable: idcode Link: cloglog Family: binomial Correlation: exchangeable Scale parameter: Number of obs Number of groups Obs per group: min avg max Wald chi2(6) Prob > chi2 1 Std. Err. z P>|z| = = = = = = = 26200 4434 1 5.9 12 234.66 0.0000 union Coef. [95% Conf. Interval] age grade not_smsa 1.south year .0153737 .0549518 -.1045232 -1.714868 -.0115881 .0081156 .0095093 .0431082 .3384558 .0084125 1.89 5.78 -2.42 -5.07 -1.38 0.058 0.000 0.015 0.000 0.168 -.0005326 .0363139 -.1890138 -2.378229 -.0280763 .03128 .0735897 -.0200326 -1.051507 .0049001 south#c.year 1 .0149796 .0041687 3.59 0.000 .0068091 .0231501 _cons -1.488278 .4468005 -3.33 0.001 -2.363991 -.6125652 The coefficient estimates are similar, but these standard errors are smaller than those produced by cloglog, vce(cluster clustvar). This finding is as we would expect. If the within-panel correlation assumptions are valid, the population-averaged estimator should be more efficient. cloglog — Complementary log-log regression 301 In addition to this estimator, we may use the xtgee command to fit a panel estimator (with complementary log-log link) and any number of assumptions on the within-idcode correlation. cloglog, vce(cluster clustvar) is robust to assumptions about within-cluster correlation. That is, it inefficiently sums within cluster for the standard-error calculation rather than attempting to exploit what might be assumed about the within-cluster correlation (as do the xtgee population-averaged models). Stored results cloglog stores the following in e(): Scalars e(N) e(k) e(k eq) e(k eq model) e(k dv) e(N f) e(N s) e(df m) e(ll) e(ll 0) e(N clust) e(chi2) e(p) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) number of observations number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables number of zero outcomes number of nonzero outcomes model degrees of freedom log likelihood log likelihood, constant-only model number of clusters χ2 significance rank of e(V) number of iterations return code 1 if converged, 0 otherwise cloglog command as typed name of dependent variable weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample 302 cloglog — Complementary log-log regression Methods and formulas Complementary log-log analysis (related to the gompit model, so named because of its relationship to the Gompertz distribution) is an alternative to logit and probit analysis, but it is unlike these other estimators in that the transformation is not symmetric. Typically, this model is used when the positive (or negative) outcome is rare. The log-likelihood function for complementary log-log is lnL = X j∈S wj lnF (xj b) + X n o wj ln 1 − F (xj b) j6∈S where S is the set of all observations j such that yj 6= 0, F (z) = 1 − exp − exp(z) , and wj denotes the optional weights. lnL is maximized as described in [R] maximize. We can fit a gompit model by reversing the success–failure sense of the dependent variable and using cloglog. This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. The scores are calculated as uj = [exp(xj b) exp − exp(xj b) /F (xj b)]xj for the positive outcomes and {− exp(xj b)}xj for the negative outcomes. cloglog also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. Acknowledgment We thank Joseph Hilbe of Arizona State University for providing the inspiration for the cloglog command (Hilbe 1996, 1998). References Clayton, D. G., and M. Hills. 1993. Statistical Models in Epidemiology. Oxford: Oxford University Press. Hilbe, J. M. 1996. sg53: Maximum-likelihood complementary log-log regression. Stata Technical Bulletin 32: 19–20. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 129–131. College Station, TX: Stata Press. . 1998. sg53.2: Stata-like commands for complementary log-log regression. Stata Technical Bulletin 41: 23. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 166–167. College Station, TX: Stata Press. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Long, J. S., and J. Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Xu, J., and J. S. Long. 2005. Confidence intervals for predicted outcomes in regression models for categorical outcomes. Stata Journal 5: 537–559. cloglog — Complementary log-log regression Also see [R] cloglog postestimation — Postestimation tools for cloglog [R] clogit — Conditional (fixed-effects) logistic regression [R] glm — Generalized linear models [R] logistic — Logistic regression, reporting odds ratios [R] scobit — Skewed logistic regression [ME] mecloglog — Multilevel mixed-effects complementary log-log regression [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [XT] xtcloglog — Random-effects and population-averaged cloglog models [U] 20 Estimation and postestimation commands 303 Title cloglog postestimation — Postestimation tools for cloglog Description Remarks and examples Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after cloglog: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with mi or svy estimation results. lrtest is not appropriate with svy estimation results. 304 cloglog postestimation — Postestimation tools for cloglog 305 Syntax for predict predict type newvar if in , statistic nooffset Description statistic Main probability of a positive outcome; the default linear prediction standard error of the linear prediction first derivative of the log likelihood with respect to xj β pr xb stdp score These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the probability of a positive outcome. xb calculates the linear prediction. stdp calculates the standard error of the linear prediction. score calculates the equation-level score, ∂ ln L/∂(xj β). nooffset is relevant only if you specified offset(varname) for cloglog. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . Remarks and examples Once you have fit a model, you can obtain the predicted probabilities by using the predict command for both the estimation sample and other samples; see [U] 20 Estimation and postestimation commands and [R] predict. Here we will make only a few comments. predict without arguments calculates the predicted probability of a positive outcome. With the xb option, it calculates the linear combination xj b, where xj are the independent variables in the j th observation and b is the estimated parameter vector. With the stdp option, predict calculates the standard error of the linear prediction, which is not adjusted for replicated covariate patterns in the data. Example 1 In example 1 in [R] cloglog, we fit the complementary log-log model cloglog foreign weight mpg. To obtain predicted probabilities, 306 cloglog postestimation — Postestimation tools for cloglog . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . cloglog foreign weight mpg (output omitted ) . predict p (option pr assumed; Pr(foreign)) . summarize foreign p Obs Mean Std. Dev. Variable foreign p 74 74 .2972973 .2928348 .4601885 .29732 Also see [R] cloglog — Complementary log-log regression [U] 20 Estimation and postestimation commands Min Max 0 .0032726 1 .9446067 Title cls — Clear Results window Syntax Description Syntax cls Description cls clears the Results window, causing all text to be removed. This operation cannot be undone. 307 Title cnsreg — Constrained linear regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax cnsreg depvar indepvars if in weight , constraints(constraints) options options Model ∗ constraints(constraints) collinear noconstant Description apply specified linear constraints keep collinear variables suppress constant term SE/Robust vce(vcetype) vcetype may be ols, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) nocnsreport display options set confidence level; default is level(95) do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling mse1 coeflegend force MSE to be 1 display legend instead of statistics ∗ constraints(constraints) is required. indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mi estimate, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. With the fp prefix (see [R] fp), constraints cannot be specified for the variable containing fractional polynomial terms. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. vce(), mse1, and weights are not allowed with the svy prefix; see [SVY] svy. aweights, fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. mse1 and coeflegend do not appear in the dialog. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 308 cnsreg — Constrained linear regression 309 Menu Statistics > Linear models and related > Constrained linear regression Description cnsreg fits constrained linear regression models. Options Model constraints(constraints), collinear, noconstant; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (ols), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(ols), the default, uses the standard variance estimator for ordinary least-squares regression. Reporting level(#); see [R] estimation options. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. The following options are available with cnsreg but are not shown in the dialog box: mse1 is used only in programs and ado-files that use cnsreg to fit models other than constrained linear regression. mse1 sets the mean squared error to 1, thus forcing the variance–covariance matrix of the estimators to be (X0 DX)−1 (see Methods and formulas in [R] regress) and affecting calculated standard errors. Degrees of freedom for t statistics are calculated as n rather than n − p + c, where p is the total number of parameters (prior to restrictions and including the constant) and c is the number of constraints. mse1 is not allowed with the svy prefix. coeflegend; see [R] estimation options. Remarks and examples For a discussion of constrained linear regression, see Greene (2012, 121–122); Hill, Griffiths, and Lim (2011, 231–233); or Davidson and MacKinnon (1993, 17). 310 cnsreg — Constrained linear regression Example 1: One constraint In principle, we can obtain constrained linear regression estimates by modifying the list of independent variables. For instance, if we wanted to fit the model mpg = β0 + β1 price + β2 weight + u and constrain β1 = β2 , we could write mpg = β0 + β1 (price + weight) + u and run a regression of mpg on price + weight. The estimated coefficient on the sum would be the constrained estimate of β1 and β2 . Using cnsreg, however, is easier: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . constraint 1 price = weight . cnsreg mpg price weight, constraint(1) Constrained linear regression ( 1) Number of obs F( 1, 72) Prob > F Root MSE = = = = 74 37.59 0.0000 4.7220 price - weight = 0 mpg Coef. price weight _cons -.0009875 -.0009875 30.36718 Std. Err. .0001611 .0001611 1.577958 t -6.13 -6.13 19.24 P>|t| 0.000 0.000 0.000 [95% Conf. Interval] -.0013086 -.0013086 27.22158 -.0006664 -.0006664 33.51278 We define constraints by using the constraint command; see [R] constraint. We fit the model with cnsreg and specify the constraint number or numbers in the constraints() option. Just to show that the results above are correct, here is the result of applying the constraint by hand: . generate x = price + weight . regress mpg x Source SS df MS Model Residual 838.065767 1605.39369 1 72 838.065767 22.2971346 Total 2443.45946 73 33.4720474 mpg Coef. x _cons -.0009875 30.36718 Std. Err. .0001611 1.577958 t -6.13 19.24 Number of obs F( 1, 72) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 = = = = = = 74 37.59 0.0000 0.3430 0.3339 4.722 [95% Conf. Interval] -.0013086 27.22158 -.0006664 33.51278 cnsreg — Constrained linear regression 311 Example 2: Multiple constraints Models can be fit subject to multiple simultaneous constraints. We simply define the constraints and then include the constraint numbers in the constraints() option. For instance, say that we wish to fit the model mpg = β0 + β1 price + β2 weight + β3 displ + β4 gear ratio + β5 foreign + β6 length + u subject to the constraints β1 = β2 = β3 = β6 β4 = −β5 = β0 /20 (This model, like the one in example 1, is admittedly senseless.) We fit the model by typing . constraint 1 price=weight . constraint 2 displ=weight . constraint 3 length=weight . constraint 5 gear_ratio = -foreign . constraint 6 gear_ratio = _cons/20 . cnsreg mpg price weight displ gear_ratio foreign length, c(1-3,5-6) Constrained linear regression Number of obs = F( 2, 72) = Prob > F = Root MSE = ( 1) price - weight = 0 ( 2) - weight + displacement = 0 ( 3) - weight + length = 0 ( 4) gear_ratio + foreign = 0 ( 5) gear_ratio - .05*_cons = 0 mpg Coef. price weight displacement gear_ratio foreign length _cons -.000923 -.000923 -.000923 1.326114 -1.326114 -.000923 26.52229 Std. Err. .0001534 .0001534 .0001534 .0687589 .0687589 .0001534 1.375178 t -6.02 -6.02 -6.02 19.29 -19.29 -6.02 19.29 P>|t| 0.000 0.000 0.000 0.000 0.000 0.000 0.000 74 785.20 0.0000 4.6823 [95% Conf. Interval] -.0012288 -.0012288 -.0012288 1.189046 -1.463183 -.0012288 23.78092 -.0006172 -.0006172 -.0006172 1.463183 -1.189046 -.0006172 29.26365 There are many ways we could have specified the constraints() option (which we abbreviated c() above). We typed c(1-3,5-6), meaning that we want constraints 1 through 3 and 5 and 6; those numbers correspond to the constraints we defined. The only reason we did not use the number 4 was to emphasize that constraints do not have to be consecutively numbered. We typed c(1-3,5-6), but we could have typed c(1,2,3,5,6) or c(1-3,5,6) or c(1-2,3,5,6) or even c(1-6), which would have worked as long as constraint 4 was not defined. If we had previously defined a constraint 4, then c(1-6) would have included it. 312 cnsreg — Constrained linear regression Stored results cnsreg stores the following in e(): Scalars e(N) e(df m) e(df r) e(F) e(rmse) e(ll) e(N clust) e(rank) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(vce) e(vcetype) e(properties) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(V) e(V modelbased) Functions e(sample) number of observations model degrees of freedom residual degrees of freedom F statistic root mean squared error log likelihood number of clusters rank of e(V) cnsreg command as typed name of dependent variable weight type weight expression title in estimation output name of cluster variable vcetype specified in vce() title used to label Std. Err. b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix variance–covariance matrix of the estimators model-based variance marks estimation sample Methods and formulas Let n be the number of observations, p be the total number of parameters (prior to restrictions and including the constant), and c be the number of constraints. The coefficients are calculated as b0 = T (T0 X0 WXT)−1 (T0 X0 Wy − T0 X0 WXa0 ) + a0 , where T and a are as defined in [P] makecns. W = I if no weights are specified. If weights are specified, let v: 1 × n be the specified weights. If fweight frequency weights are specified, W = diag(v). If aweight analytic weights are specified, then W = diag[v/(10 v)(10 1)], meaning that the weights are normalized to sum to the number of observations. The mean squared error is s2 = (y0 Wy − 2b0 X0 Wy + b0 X0 WXb)/(n − p + c). The variance– covariance matrix is s2 T(T0 X0 WXT)−1 T0 . This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Introduction and Methods and formulas. cnsreg also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. cnsreg — Constrained linear regression 313 References Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Hill, R. C., W. E. Griffiths, and G. C. Lim. 2011. Principles of Econometrics. 4th ed. Hoboken, NJ: Wiley. Also see [R] cnsreg postestimation — Postestimation tools for cnsreg [R] regress — Linear regression [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [U] 20 Estimation and postestimation commands Title cnsreg postestimation — Postestimation tools for cnsreg Description Syntax for predict Menu for predict Options for predict Also see Description The following postestimation commands are available after cnsreg: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with mi or svy estimation results. lrtest is not appropriate with svy estimation results. 314 cnsreg postestimation — Postestimation tools for cnsreg 315 Syntax for predict predict type newvar if in , statistic Description statistic Main xb residuals stdp stdf pr(a,b) e(a,b) ystar(a,b) score linear prediction; the default residuals standard error of the prediction standard error of the forecast Pr(a < yj < b) E(yj | a < yj < b) E(yj∗ ), yj∗ = max{a, min(yj , b)} equivalent to residuals These statistics are available both in and out of sample; type predict the estimation sample. stdf is not allowed with svy estimation results. . . . if e(sample) . . . if wanted only for where a and b may be numbers or variables; a missing (a ≥ .) means −∞, and b missing (b ≥ .) means +∞; see [U] 12.2.1 Missing values. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main xb, the default, calculates the linear prediction. residuals calculates the residuals, that is, yj − xj b. stdp calculates the standard error of the prediction, which can be thought of as the standard error of the predicted expected value or mean for the observation’s covariate pattern. The standard error of the prediction is also referred to as the standard error of the fitted value. stdf calculates the standard error of the forecast, which is the standard error of the point prediction for 1 observation. It is commonly referred to as the standard error of the future or forecast value. By construction, the standard errors produced by stdf are always larger than those produced by stdp; see Methods and formulas in [R] regress postestimation. pr(a,b) calculates Pr(a < xj b + uj < b), the probability that yj |xj would be observed in the interval (a, b). a and b may be specified as numbers or variable names; lb and ub are variable names; pr(20,30) calculates Pr(20 < xj b + uj < 30); pr(lb,ub) calculates Pr(lb < xj b + uj < ub); and pr(20,ub) calculates Pr(20 < xj b + uj < ub). a missing (a ≥ .) means −∞; pr(.,30) calculates Pr(−∞ < xj b + uj < 30); pr(lb,30) calculates Pr(−∞ < xj b + uj < 30) in observations for which lb ≥ . and calculates Pr(lb < xj b + uj < 30) elsewhere. 316 cnsreg postestimation — Postestimation tools for cnsreg b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(+∞ > xj b + uj > 20); pr(20,ub) calculates Pr(+∞ > xj b + uj > 20) in observations for which ub ≥ . and calculates Pr(20 < xj b + uj < ub) elsewhere. e(a,b) calculates E(xj b + uj | a < xj b + uj < b), the expected value of yj |xj conditional on yj |xj being in the interval (a, b), meaning that yj |xj is truncated. a and b are specified as they are for pr(). ystar(a,b) calculates E(yj∗ ), where yj∗ = a if xj b + uj ≤ a, yj∗ = b if xj b + uj ≥ b, and yj∗ = xj b + uj otherwise, meaning that yj∗ is censored. a and b are specified as they are for pr(). score is equivalent to residuals for linear regression models. Also see [R] cnsreg — Constrained linear regression [U] 20 Estimation and postestimation commands Title constraint — Define and list constraints Syntax References Menu Also see Description Remarks and examples Syntax Define constraints constraint define # exp=exp | coeflist List constraints constraint dir constraint list numlist | all numlist | all Drop constraints constraint drop numlist | all Programmer’s commands constraint get # constraint free where coeflist is as defined in [R] test and # is restricted to the range 1–1,999, inclusive. Menu Statistics > Other > Manage constraints Description constraint defines, lists, and drops linear constraints. Constraints are for use by models that allow constrained estimation. Constraints are defined by the constraint command. The currently defined constraints can be listed by either constraint list or constraint dir; both do the same thing. Existing constraints can be eliminated by constraint drop. constraint get and constraint free are programmer’s commands. constraint get returns the contents of the specified constraint in macro r(contents) and returns in scalar r(defined) 0 or 1—1 being returned if the constraint was defined. constraint free returns the number of a free (unused) constraint in macro r(free). 317 318 constraint — Define and list constraints Remarks and examples Using constraints is discussed in [R] cnsreg, [R] mlogit, and [R] reg3; this entry is concerned only with practical aspects of defining and manipulating constraints. Example 1 Constraints are numbered from 1 to 1,999, and we assign the number when we define the constraint: . use http://www.stata-press.com/data/r13/sysdsn1 (Health insurance data) . constraint 2 [indemnity]2.site = 0 The currently defined constraints can be listed by constraint list: . constraint list 2: [indemnity]2.site = 0 constraint drop drops constraints: . constraint drop 2 . constraint list The empty list after constraint list indicates that no constraints are defined. Below we demonstrate the various syntaxes allowed by constraint: . constraint 1 [Indemnity] . constraint 10 [Indemnity]: 1.site 2.site . constraint 11 [Indemnity]: 3.site . constraint 21 [Prepaid=Uninsure]: nonwhite . constraint 30 [Prepaid] . constraint 31 [Insure] . constraint list 1: [Indemnity] 10: [Indemnity]: 1.site 2.site 11: [Indemnity]: 3.site 21: [Prepaid=Uninsure]: nonwhite 30: [Prepaid] 31: [Insure] . constraint drop 21-25, 31 . constraint list 1: [Indemnity] 10: [Indemnity]: 1.site 2.site 11: [Indemnity]: 3.site 30: [Prepaid] . constraint drop _all . constraint list Technical note The constraint command does not check the syntax of the constraint itself because a constraint can be interpreted only in the context of a model. Thus constraint is willing to define constraints that later will not make sense. Any errors in the constraints will be detected and mentioned at the time of estimation. constraint — Define and list constraints 319 References Buis, M. L. 2012. Stata tip 108: On adding and constraining. Stata Journal 12: 342–344. Weesie, J. 1999. sg100: Two-stage linear constrained estimation. Stata Technical Bulletin 47: 24–30. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 217–225. College Station, TX: Stata Press. Also see [R] cnsreg — Constrained linear regression Title contrast — Contrasts and linear hypothesis tests after estimation Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax contrast termlist , options where termlist is a list of factor variables or interactions that appear in the current estimation results. The variables may be typed with or without contrast operators, and you may use any factor-variable syntax: See the operators (op.) table below for the list of contrast operators. options Description Main overall asobserved lincom add a joint hypothesis test for all specified contrasts treat all factor variables as observed treat user-defined contrasts as linear combinations Equations equation(eqspec) atequations perform contrasts in termlist for equation eqspec perform contrasts in termlist within each equation Advanced emptycells(empspec) treatment of empty cells for balanced factors noestimcheck suppress estimability checks Reporting level(#) mcompare(method) noeffects cieffects pveffects effects nowald noatlevels nosvyadjust sort post display options eform option df(#) confidence level; default is level(95) adjust for multiple comparisons; default is mcompare(noadjust) suppress table of individual contrasts show effects table with confidence intervals show effects table with p-values show effects table with confidence intervals and p-values suppress table of Wald tests report only the overall Wald test for terms that use the within @ or nested | operator compute unadjusted Wald tests for survey results sort the individual contrast values in each term post contrasts and their VCEs as estimation results control column formats, row spacing, line width, and factor-variable labeling report exponentiated contrasts use t distribution with # degrees of freedom for computing p-values and confidence intervals df(#) does not appear in the dialog box. 320 contrast — Contrasts and linear hypothesis tests after estimation Term 321 Description Main effects A r.A joint test of the main effects of A individual contrasts that decompose A using r. Interaction effects A#B A#B#C r.A#g.B joint test of the two-way interaction effects of A and B joint test of the three-way interaction effects of A, B, and C individual contrasts for each interaction of A and B defined by r. and g. Partial interaction effects r.A#B A#r.B joint tests of interactions of A and B within each contrast defined by r.A joint tests of interactions of A and B within each contrast defined by r.B Simple effects A@B A@B#C r.A@B r.A@B#C joint tests joint tests individual individual of the effects of A of the effects of A contrasts of A that contrasts of A that within each level of B within each combination of the levels of B and C decompose A@B using r. decompose A@B#C using r. Other conditional effects A#B@C A#B@C#D r.A#g.B@C joint tests of the interaction effects of A and B within each level of C joint tests of the interaction effects of A and B within each combination of the levels of C and D individual contrasts for each interaction of A and B that decompose A#B@C using r. and g. Nested effects A|B A|B#C A#B|C A#B|C#D r.A|B r.A|B#C r.A#g.B|C joint tests of the effects of A nested in each level of B joint tests of the effects of A nested in each combination of the levels of B and C joint tests of the interaction effects of A and B nested in each level of C joint tests of the interaction effects of A and B nested in each combination of the levels of C and D individual contrasts of A that decompose A|B using r. individual contrasts of A that decompose A|B#C using r. individual contrasts for each interaction of A and B defined by r. and g. nested in each level of C Slope effects A#c.x A#c.x#c.y A#B#c.x A#B#c.x#c.y r.A#c.x joint test of the effects of A on the slopes of x joint test of the effects of A on the slopes of the product (interaction) of x and y joint test of the interaction effects of A and B on the slopes of x joint test of the interaction effects of A and B on the slopes of the product (interaction) of x and y individual contrasts of A’s effects on the slopes of x using r. Denominators ... / term2 ... / use term2 as the denominator in the F tests of the preceding terms use the residual as the denominator in the F tests of the preceding terms (the default if no other /s are specified) 322 contrast — Contrasts and linear hypothesis tests after estimation A, B, C, and D represent any factor variable in the current estimation results. x and y represent any continuous variable in the current estimation results. r. and g. represent any contrast operator. See the table below. c. specifies that a variable be treated as continuous; see [U] 11.4.3 Factor variables. Operators are allowed on any factor variable that does not appear to the right of @ or |. Operators decompose the effects of the associated factor variable into one-degree-of-freedom effects (contrasts). Higher-level interactions are allowed anywhere an interaction operator (#) appears in the table. Time-series operators are allowed if they were used in the estimation. eqns designates the equations in manova, mlogit, mprobit, and mvreg and can be specified anywhere a factor variable appears. / is allowed only after anova, cnsreg, manova, mvreg, or regress. operators (op.) Description r. a. ar. differences from the reference (base) level; the default differences from the next level (adjacent contrasts) differences from the previous level (reverse adjacent contrasts) As-balanced operators g. h. j. p. q. differences from the balanced grand mean differences from the balanced mean of subsequent levels (Helmert contrasts) differences from the balanced mean of previous levels (reverse Helmert contrasts) orthogonal polynomial in the level values orthogonal polynomial in the level sequence As-observed operators gw. hw. jw. pw. qw. differences from the observation-weighted grand mean differences from the observation-weighted mean of subsequent levels differences from the observation-weighted mean of previous levels observation-weighted orthogonal polynomial in the level values observation-weighted orthogonal polynomial in the level sequence One or more individual contrasts may be selected by using the op#. or op(numlist). syntax. For example, a3.A selects the adjacent contrast for level 3 of A, and p(1/2).B selects the linear and quadratic effects of B. Also see Orthogonal polynomial contrasts and Beyond linear models. Custom contrasts Description {A numlist} user-defined contrast on the levels of factor A {A#B numlist} user-defined contrast on the levels of the interaction between A and B Custom contrasts may be part of a term, such as {A numlist}#B, {A numlist}@B, {A numlist}|B, {A#B numlist}, and {A numlist}#{B numlist}. The same is true of higher-order custom contrasts, such as {A#B numlist}@C, {A#B numlist}#r.C, and {A#B numlist}#c.x. Higher-order interactions with at most eight factor variables are allowed with custom contrasts. contrast — Contrasts and linear hypothesis tests after estimation method Description noadjust bonferroni adjustall sidak adjustall scheffe do not adjust for multiple comparisons; the default Bonferroni’s method; adjust across all terms Šidák’s method; adjust across all terms Scheffé’s method 323 Menu Statistics > Postestimation > Contrasts Description contrast tests linear hypotheses and forms contrasts involving factor variables and their interactions from the most recently fit model. The tests include ANOVA-style tests of main effects, simple effects, interactions, and nested effects. contrast can use named contrasts to decompose these effects into comparisons against reference categories, comparisons of adjacent levels, comparisons against the grand mean, orthogonal polynomials, and such. Custom contrasts may also be specified. contrast can be used with svy estimation results; see [SVY] svy postestimation. Contrasts can also be computed for margins of linear and nonlinear responses; see [R] margins, contrast. Options Main overall specifies that a joint hypothesis test over all terms be performed. asobserved specifies that factor covariates be evaluated using the cell frequencies observed in the estimation sample. The default is to treat all factor covariates as though there were an equal number of observations in each level. lincom specifies that user-defined contrasts be treated as linear combinations. The default is to require that all user-defined contrasts sum to zero. (Summing to zero is part of the definition of a contrast.) Equations equation(eqspec) specifies the equation from which contrasts are to be computed. The default is to compute contrasts from the first equation. atequations specifies that the contrasts be computed within each equation. Advanced emptycells(empspec) specifies how empty cells are handled in interactions involving factor variables that are being treated as balanced. emptycells(strict) is the default; it specifies that contrasts involving empty cells be treated as not estimable. emptycells(reweight) specifies that the effects of the observed cells be increased to accommodate any missing cells. This makes the contrast estimable but changes its interpretation. 324 contrast — Contrasts and linear hypothesis tests after estimation noestimcheck specifies that contrast not check for estimability. By default, the requested contrasts are checked and those found not estimable are reported as such. Nonestimability is usually caused by empty cells. If noestimcheck is specified, estimates are computed in the usual way and reported even though the resulting estimates are manipulable, which is to say they can differ across equivalent models having different parameterizations. Reporting level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. mcompare(method) specifies the method for computing p-values and confidence intervals that account for multiple comparisons within a factor-variable term. Most methods adjust the comparisonwise error rate, αc , to achieve a prespecified experimentwise error rate, αe . mcompare(noadjust) is the default; it specifies no adjustment. αc = αe mcompare(bonferroni) adjusts the comparisonwise error rate based on the upper limit of the Bonferroni inequality αe ≤mαc where m is the number of comparisons within the term. The adjusted comparisonwise error rate is αc = αe /m mcompare(sidak) adjusts the comparisonwise error rate based on the upper limit of the probability inequality αe ≤1 − (1 − αc )m where m is the number of comparisons within the term. The adjusted comparisonwise error rate is αc = 1 − (1 − αe )1/m This adjustment is exact when the m comparisons are independent. mcompare(scheffe) controls the experimentwise error rate using the F or χ2 distribution with degrees of freedom equal to the rank of the term. mcompare(method adjustall) specifies that the multiple-comparison adjustments count all comparisons across all terms rather than performing multiple comparisons term by term. This leads to more conservative adjustments when multiple variables or terms are specified in marginslist. This option is compatible only with the bonferroni and sidak methods. noeffects suppresses the table of individual contrasts with confidence intervals. This table is produced by default when the mcompare() option is specified or when a term in termlist implies all individual contrasts. cieffects specifies that a table containing a confidence interval for each individual contrast be reported. pveffects specifies that a table containing a p-value for each individual contrast be reported. effects specifies that a single table containing a confidence interval and p-value for each individual contrast be reported. contrast — Contrasts and linear hypothesis tests after estimation 325 nowald suppresses the table of Wald tests. noatlevels indicates that only the overall Wald test be reported for each term containing within or nested (@ or |) operators. nosvyadjust is for use with svy estimation commands. It specifies that the Wald test be carried out without the default adjustment for the design degrees of freedom. That is to say the test is carried out as W/k ∼ F (k, d) rather than as (d − k + 1)W/(kd) ∼ F (k, d − k + 1), where k is the dimension of the test and d is the total number of sampled PSUs minus the total number of strata. sort specifies that the table of individual contrasts be sorted by the contrast values within each term. post causes contrast to behave like a Stata estimation (e-class) command. contrast posts the vector of estimated contrasts along with the estimated variance–covariance matrix to e(), so you can treat the estimated contrasts just as you would results from any other estimation command. For example, you could use test to perform simultaneous tests of hypotheses on the contrasts, or you could use lincom to create linear combinations. display options: vsquish, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch. vsquish specifies that the blank space separating factor-variable terms or time-series–operated variables from other variables in the model be suppressed. nofvlabel displays factor-variable level values rather than attached value labels. This option overrides the fvlabel setting; see [R] set showbaselevels. fvwrap(#) specifies how many lines to allow when long value labels must be wrapped. Labels requiring more than # lines are truncated. This option overrides the fvwrap setting; see [R] set showbaselevels. fvwrapon(style) specifies whether value labels that wrap will break at word boundaries or break based on available space. fvwrapon(word), the default, specifies that value labels break at word boundaries. fvwrapon(width) specifies that value labels break based on available space. This option overrides the fvwrapon setting; see [R] set showbaselevels. cformat(% fmt) specifies how to format contrasts, standard errors, and confidence limits in the table of estimated contrasts. pformat(% fmt) specifies how to format p-values in the table of estimated contrasts. sformat(% fmt) specifies how to format test statistics in the table of estimated contrasts. nolstretch specifies that the width of the table of estimated contrasts not be automatically widened to accommodate longer variable names. The default, lstretch, is to automatically widen the table of estimated contrasts up to the width of the Results window. To change the default, use set lstretch off. nolstretch is not shown in the dialog box. eform option specifies that the contrasts table be displayed in exponentiated form. econtrast is displayed rather than contrast. Standard errors and confidence intervals are also transformed. See [R] eform option for the list of available options. The following option is available with contrast but is not shown in the dialog box: df(#) specifies that the t distribution with # degrees of freedom be used for computing p-values and confidence intervals. The default is to use e(df r) degrees of freedom or the standard normal distribution if e(df r) is missing. 326 contrast — Contrasts and linear hypothesis tests after estimation Remarks and examples Remarks are presented under the following headings: Introduction One-way models Estimated cell means Testing equality of cell means Reference category contrasts Reverse adjacent contrasts Orthogonal polynomial contrasts Two-way models Estimated interaction cell means Simple effects Interaction effects Main effects Partial interaction effects Three-way and higher-order models Contrast operators Differences from a reference level (r.) Differences from the next level (a.) Differences from the previous level (ar.) Differences from the grand mean (g.) Differences from the mean of subsequent levels (h.) Differences from the mean of previous levels (j.) Orthogonal polynomials (p. and q.) User-defined contrasts Empty cells Empty cells, ANOVA style Nested effects Multiple comparisons Unbalanced data Using observed cell frequencies Weighted contrast operators Testing factor effects on slopes Chow tests Beyond linear models Multiple equations Video example Introduction contrast performs ANOVA-style tests of main effects, interactions, simple effects, and nested effects. It can easily decompose these tests into constituent contrasts using either named contrasts (codings) or user-specified contrasts. Comparing levels of factor variables—whether as main effects, interactions, or simple effects—is as easy as adding a contrast operator to the variable. The operators can compare each level with the previous level, each level with a reference level, each level with the mean of previous levels, and more. contrastPtests and estimates contrasts. A contrast of the parameters µ1 , µ2 , . . . , µp is a linear combination i ci µi whose ci sum to zero. A difference of population means that µ1 −µ2 is a contrast, as are most other comparisons of population or model quantities (Coster 2005). Some contrasts may be estimated with lincom, but contrast is much more powerful. contrast can handle multiple contrasts simultaneously, and the command’s contrast operators make it easy to specify complicated linear combinations. Both the contrast operation and the creation of the margins for comparison can be performed as though the data were balanced (typical for experimental designs) or using the observed frequencies in the estimation sample (typical for observational studies). contrast can perform these analyses on the results of almost all of Stata’s estimators, not just the linear-models estimators. contrast — Contrasts and linear hypothesis tests after estimation 327 Most of contrast’s computations can be considered comparisons of estimated cell means from a model fit. Tests of interactions are tests of whether the cell means for the interaction are all equal. Tests of main effects are tests of whether the marginal cell means for the factor are all equal. More focused comparisons of cell means (for example, is level 2 equal to level 1) are specified using contrast operators. More formally, all of contrast’s computations are comparisons of conditional expectations; cell means are one type of conditional expectation. All contrasts can also easily be graphed; see [R] marginsplot. For a discussion of contrasts and testing for linear models, see Searle (1971) and Searle (1997). For discussions specifically related to experimental design, see Kuehl (2000), Winer, Brown, and Michels (1991), and Milliken and Johnson (2009). Rosenthal, Rosnow, and Rubin (2000) focus on contrasts with applications in behavioral sciences. Mitchell (2012) focuses on contrasts in Stata. contrast is a flexible tool for understanding the effects of categorical covariates. If your model contains categorical covariates, and especially if it contains interactions, you will want to use contrast. One-way models Suppose we have collected data on cholesterol levels for individuals from five age groups. To study the effect of age group on cholesterol, we can begin by fitting a one-way model using regress: . use http://www.stata-press.com/data/r13/cholesterol (Artificial cholesterol data) . label list ages ages: 1 10-19 2 20-29 3 30-39 4 40-59 5 60-79 . regress chol i.agegrp Source SS df MS Model Residual 14943.3997 7468.21971 4 70 3735.84993 106.688853 Total 22411.6194 74 302.859722 chol Coef. Std. Err. agegrp 20-29 30-39 40-59 60-79 8.203575 21.54105 30.15067 38.76221 3.771628 3.771628 3.771628 3.771628 _cons 180.5198 2.666944 t Number of obs F( 4, 70) Prob > F R-squared Adj R-squared Root MSE = = = = = = 75 35.02 0.0000 0.6668 0.6477 10.329 P>|t| [95% Conf. Interval] 2.18 5.71 7.99 10.28 0.033 0.000 0.000 0.000 .6812991 14.01878 22.6284 31.23993 15.72585 29.06333 37.67295 46.28448 67.69 0.000 175.2007 185.8388 328 contrast — Contrasts and linear hypothesis tests after estimation Estimated cell means margins will show us the estimated cell means for each age group based on our fitted model: . margins agegrp Adjusted predictions Model VCE : OLS Expression : Linear prediction, predict() Margin agegrp 10-19 20-29 30-39 40-59 60-79 180.5198 188.7233 202.0608 210.6704 219.282 Delta-method Std. Err. 2.666944 2.666944 2.666944 2.666944 2.666944 t 67.69 70.76 75.76 78.99 82.22 Number of obs = P>|t| [95% Conf. Interval] 0.000 0.000 0.000 0.000 0.000 175.2007 183.4043 196.7418 205.3514 213.9629 We can graph those means with marginsplot: . marginsplot Variables that uniquely identify margins: agegrp 180 Linear Prediction 200 220 Adjusted Predictions of agegrp with 95% CIs 10−19 20−29 30−39 agegrp 75 40−59 60−79 185.8388 194.0424 207.3799 215.9895 224.601 contrast — Contrasts and linear hypothesis tests after estimation 329 Testing equality of cell means Are all the means equal? That is to say is there an effect of age group on cholesterol level? We can answer that by asking contrast to test whether the means of the age groups are identical. . contrast agegrp Contrasts of marginal linear predictions Margins : asbalanced df F P>F agegrp 4 35.02 0.0000 Denominator 70 The means are clearly different. We could have obtained this same test directly had we fit our model using anova rather than regress. . anova chol agegrp Source Number of obs = 75 Root MSE = 10.329 Partial SS df MS R-squared = 0.6668 Adj R-squared = 0.6477 F Prob > F Model 14943.3997 4 3735.84993 35.02 0.0000 agegrp 14943.3997 4 3735.84993 35.02 0.0000 Residual 7468.21971 70 106.688853 Total 22411.6194 74 302.859722 Achieving a more direct test result is why we recommend using anova instead of regress for models where our focus is on the categorical covariates. The models fit by anova and regress are identical; they merely parameterize the effects differently. The results of contrast will be identical regardless of which command is used to fit the model. If, however, we were fitting models whose responses are nonlinear functions of the covariates, such as logistic regression, then there would be no analogue to anova, and we would appreciate contrast’s ability to quickly test main effects and interactions. 330 contrast — Contrasts and linear hypothesis tests after estimation Reference category contrasts Now that we know that the overall effect of age group is statistically significant, we can explore the effects of each age group. One way to do that is to use the reference category operator, r.: . contrast r.agegrp Contrasts of marginal linear predictions Margins : asbalanced df F P>F agegrp 10-19) 10-19) 10-19) 10-19) Joint 1 1 1 1 4 4.73 32.62 63.91 105.62 35.02 0.0330 0.0000 0.0000 0.0000 0.0000 Denominator 70 (20-29 (30-39 (40-59 (60-79 (20-29 (30-39 (40-59 (60-79 vs vs vs vs vs vs vs vs agegrp 10-19) 10-19) 10-19) 10-19) Contrast Std. Err. [95% Conf. Interval] 8.203575 21.54105 30.15067 38.76221 3.771628 3.771628 3.771628 3.771628 .6812991 14.01878 22.6284 31.23993 15.72585 29.06333 37.67295 46.28448 The cell mean of each age group is compared against the base age group (ages 10–19). The first table shows that each difference is significant. The second table gives an estimate and confidence interval for each contrast. These are the comparisons that linear regression gives with a factor covariate and no interactions. The contrasts are identical to the coefficients from our linear regression. Reverse adjacent contrasts We have far more flexibility with contrast. Age group is ordinal, so it is interesting to compare each age group with the preceding age group (rather than against one reference group). We specify that analysis by using the reverse adjacent operator, ar.: . contrast ar.agegrp Contrasts of marginal linear predictions Margins : asbalanced df F P>F agegrp 10-19) 20-29) 30-39) 40-59) Joint 1 1 1 1 4 4.73 12.51 5.21 5.21 35.02 0.0330 0.0007 0.0255 0.0255 0.0000 Denominator 70 (20-29 (30-39 (40-59 (60-79 vs vs vs vs contrast — Contrasts and linear hypothesis tests after estimation (20-29 (30-39 (40-59 (60-79 vs vs vs vs agegrp 10-19) 20-29) 30-39) 40-59) Contrast Std. Err. [95% Conf. Interval] 8.203575 13.33748 8.60962 8.611533 3.771628 3.771628 3.771628 3.771628 .6812991 5.815204 1.087345 1.089257 331 15.72585 20.85976 16.1319 16.13381 The 20–29 age group’s cholesterol level is 8.2 points higher than the 10–19 age group’s cholesterol level; the 30–39 age group’s level is 13.3 points higher than the 20–29 age group’s level; and so on. Each age group is statistically different from the preceding age group at the 5% level. Orthogonal polynomial contrasts The relationship between age group and cholesterol level looked almost linear in our graph. We can examine that relationship further by using the orthogonal polynomial operator, p.: . contrast p.agegrp, noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F agegrp (linear) (quadratic) (cubic) (quartic) Joint 1 1 1 1 4 139.11 0.15 0.37 0.43 35.02 0.0000 0.6962 0.5448 0.5153 0.0000 Denominator 70 Only the linear effect is statistically significant. We can even perform the joint test that all effects beyond linear are zero. We do that by selecting all polynomial contrasts above linear—that is, polynomial contrasts 2, 3, and 4. . contrast p(2 3 4).agegrp, noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F agegrp (quadratic) (cubic) (quartic) Joint 1 1 1 3 0.15 0.37 0.43 0.32 0.6962 0.5448 0.5153 0.8129 Denominator 70 The joint test has three degrees of freedom and is clearly insignificant. A linear effect of age group seems adequate for this model. 332 contrast — Contrasts and linear hypothesis tests after estimation Two-way models Suppose we are investigating the effects of different dosages of a blood pressure medication and believe that the effects may be different for men and women. We can fit the following ANOVA model for bpchange, the change in diastolic blood pressure. Change is defined as the after measurement minus the before measurement, so that negative values of bpchange correspond to decreases in blood pressure. . use http://www.stata-press.com/data/r13/bpchange (Artificial blood pressure data) . label list gender gender: 1 male 2 female . anova bpchange dose##gender Number of obs = 30 Root MSE = 1.4677 R-squared = Adj R-squared = 0.9647 0.9573 Source Partial SS df Model 1411.9087 5 282.381741 131.09 0.0000 dose gender dose#gender 963.481795 355.118817 93.3080926 2 1 2 481.740897 355.118817 46.6540463 223.64 164.85 21.66 0.0000 0.0000 0.0000 Residual 51.699253 24 2.15413554 Total 1463.60796 29 50.4692399 MS F Prob > F Estimated interaction cell means Everything is significant, including the interaction. So increasing dosage is effective and differs by gender. Let’s explore the effects. First, let’s look at the estimated cell mean of blood pressure change for each combination of gender and dosage. . margins dose#gender Adjusted predictions Expression : Linear prediction, predict() Margin dose#gender 250#male 250#female 500#male 500#female 750#male 750#female -7.35384 3.706567 -13.73386 -6.584167 -16.82108 -14.38795 Delta-method Std. Err. .6563742 .6563742 .6563742 .6563742 .6563742 .6563742 t -11.20 5.65 -20.92 -10.03 -25.63 -21.92 Number of obs P>|t| 0.000 0.000 0.000 0.000 0.000 0.000 = 30 [95% Conf. Interval] -8.708529 2.351877 -15.08855 -7.938857 -18.17576 -15.74264 -5.99915 5.061257 -12.37917 -5.229477 -15.46639 -13.03326 Our data are balanced, so these results will not be affected by the many different ways that margins can compute cell means. Moreover, because our model consists of only dose and gender, these are also the point estimates for each combination. contrast — Contrasts and linear hypothesis tests after estimation 333 We can graph the results: . marginsplot Variables that uniquely identify margins: dose gender −20 −15 Linear Prediction −10 −5 0 5 Adjusted Predictions of dose#gender with 95% CIs 250 500 dosage in milligrams per day male 750 female The lines are not parallel, which we expected because the interaction term is significant. Males experience a greater decline in blood pressure at every dosage level, but the effect of increasing dosage is greater for females. In fact, it is not clear if we can tell the difference between male and female response at the maximum dosage. Simple effects We can contrast the male and female responses within dosage to see the simple effects of gender. Because there are only two levels in gender, the choice of contrast operator is largely irrelevant. Aside from orthogonal polynomials, all operators produce the same estimates, although the effects can change signs. . contrast r.gender@dose Contrasts of marginal linear predictions Margins : asbalanced df F P>F gender@dose (female vs male) 250 (female vs male) 500 (female vs male) 750 Joint 1 1 1 3 141.97 59.33 6.87 69.39 0.0000 0.0000 0.0150 0.0000 Denominator 24 gender@dose (female vs male) 250 (female vs male) 500 (female vs male) 750 Contrast Std. Err. [95% Conf. Interval] 11.06041 7.149691 2.433124 .9282533 .9282533 .9282533 9.144586 5.23387 .5173031 12.97623 9.065512 4.348944 334 contrast — Contrasts and linear hypothesis tests after estimation The effect for males is about 11 points higher than for females at a dosage of 250, and that shrinks to 2.4 points higher at the maximum dosage of 750. We can form the simple effects the other way by contrasting the effect of dose at each level of gender: . contrast ar.dose@gender Contrasts of marginal linear predictions Margins : asbalanced df F P>F dose@gender (500 vs 250) male (500 vs 250) female (750 vs 500) male (750 vs 500) female Joint 1 1 1 1 4 47.24 122.90 11.06 70.68 122.65 0.0000 0.0000 0.0028 0.0000 0.0000 Denominator 24 dose@gender (500 vs 250) male (500 vs 250) female (750 vs 500) male (750 vs 500) female Contrast Std. Err. -6.380018 -10.29073 -3.087217 -7.803784 .9282533 .9282533 .9282533 .9282533 [95% Conf. Interval] -8.295839 -12.20655 -5.003038 -9.719605 -4.464198 -8.374914 -1.171396 -5.887963 Here we use the ar. reverse adjacent contrast operator so that first we are comparing a dosage of 500 with a dosage of 250, and then we are comparing 750 with 500. We see that increasing the dosage has a larger effect on females—10.3 points when going from 250 to 500 compared with 6.4 points for males, and 7.8 points when going from 500 to 750 versus 3.1 points for males. Interaction effects By specifying contrast operators on both factors, we can decompose the interaction effect into separate interaction contrasts. . contrast ar.dose#r.gender Contrasts of marginal linear predictions Margins : asbalanced df F P>F dose#gender (500 vs 250) (female vs male) (750 vs 500) (female vs male) Joint 1 1 2 8.87 12.91 21.66 0.0065 0.0015 0.0000 Denominator 24 contrast — Contrasts and linear hypothesis tests after estimation dose#gender (500 vs 250) (female vs male) (750 vs 500) (female vs male) Contrast Std. Err. -3.910716 1.312748 -6.620095 -1.201336 -4.716567 1.312748 -7.425947 -2.007187 335 [95% Conf. Interval] Look for departures from zero to indicate an interaction effect between dose and gender. Both contrasts are significantly different from zero. Of course, we already knew the overall interaction was significant from our ANOVA results. The effect of increasing dose from 250 to 500 is 3.9 points greater in females than in males, and the effect of increasing dose from 500 to 750 is 4.7 points greater in females than in males. The confidence intervals for both estimates easily exclude zero, meaning that there is an interaction effect. The joint test of these two interaction effects reproduces the test of interaction effects in the anova output. We can see that the F statistic of 21.66 matches the statistic from our original ANOVA results. Main effects We can perform tests of the main effects by listing each variable individually in contrast. . contrast dose gender Contrasts of marginal linear predictions Margins : asbalanced df F P>F dose 2 223.64 0.0000 gender 1 164.85 0.0000 Denominator 24 The F tests are equivalent to the tests of main effects in the anova output. This is true only for linear models. contrast provides an easy way to obtain main effects and other ANOVA-style tests for models whose responses are not linear in the parameters—logistic, probit, glm, etc. If we include contrast operators on the variables, we can also decompose the main effects into individual contrasts: . contrast ar.dose r.gender Contrasts of marginal linear predictions Margins : asbalanced df F P>F dose (500 vs 250) (750 vs 500) Joint 1 1 2 161.27 68.83 223.64 0.0000 0.0000 0.0000 gender 1 164.85 0.0000 Denominator 24 336 contrast — Contrasts and linear hypothesis tests after estimation Contrast Std. Err. [95% Conf. Interval] dose (500 vs 250) (750 vs 500) -8.335376 -5.4455 .6563742 .6563742 -9.690066 -6.80019 -6.980687 -4.090811 gender (female vs male) 6.881074 .5359273 5.774974 7.987173 By specifying the ar. operator on dose, we decompose the main effect for dose into two one-degreeof-freedom contrasts, comparing the marginal mean of blood pressure change for each dosage level with that of the previous level. Because gender has only two levels, we cannot decompose this main effect any further. However, specifying a contrast operator on gender allowed us to calculate the difference in the marginal means for women and men. Partial interaction effects At this point, we have looked at the total interaction effects and at the main effects of each variable. The partial interaction effects are a midpoint between these two types of effects where we collect the individual interaction effects along the levels of one of the variables and perform a joint test of those interactions. If we think of the interaction effects as forming a table, with the levels of one factor variable forming the rows and the levels of the other forming the columns, partial interaction effects are joint tests of the interactions in a row or a column. To perform these tests, we specify a contrast operator on only one of the variables in our interaction. For this particular model, these are not very interesting because our variables have only two and three levels. Therefore, the tests of the partial interaction effects reproduce the tests that we obtained for the total interaction effects. We specify a contrast operator only on dose to decompose the overall test for interaction effects into joint tests for each ar.dose contrast: . contrast ar.dose#gender Contrasts of marginal linear predictions Margins : asbalanced df F P>F dose#gender (500 vs 250) (joint) (750 vs 500) (joint) Joint 1 1 2 8.87 12.91 21.66 0.0065 0.0015 0.0000 Denominator 24 The first row is a joint test of all the interaction effects involving the (500 vs 250) comparison of dosages. The second row is a joint test of all the interaction effects involving the (750 vs 500) comparison. If we look back at our output in Interaction effects, we can see that there was only one of each of these interaction effects. Therefore, each test labeled (joint) has only one degree-of-freedom. We could have instead included a contrast operator on gender to compute the partial interaction effects along the other dimension: contrast — Contrasts and linear hypothesis tests after estimation 337 . contrast dose#r.gender Contrasts of marginal linear predictions Margins : asbalanced df F P>F dose#gender 2 21.66 0.0000 Denominator 24 Here we obtain a joint test of all the interaction effects involving the (female vs male) comparison for gender. Because gender has only two levels, the (female vs male) contrast is the only reference category contrast possible. Therefore, we obtain a single joint test of all the interaction effects. Clearly, the partial interaction effects are not interesting for this particular model. However, if our factors had more levels, the partial interaction effects would produce tests that are not available in the total interaction effects. For example, if our model included factors for four dosage levels and three races, then typing . contrast ar.dose#race would produce three joint tests, one for each of the reverse adjacent contrasts for dosage. Each of these tests would be a two-degree-of-freedom test because race has three levels. Three-way and higher-order models All the contrasts and tests that we reviewed above for two-way models can be used with models that have more terms. For instance, we could fit a three-way full factorial model by using the anova command: . use http://www.stata-press.com/data/r13/cont3way . anova y race##sex##group We could then test the simple effects of race within each level of the interaction between sex and group: . contrast race@sex#group To see the reference category contrasts that decompose these simple effects, type . contrast r.race@sex#group We could test the three-way interaction effects by typing . contrast race#sex#group or the interaction effects for the interaction of race and sex by typing . contrast race#sex To see the individual reference category contrasts that decompose this interaction effect, type . contrast r.race#r.sex 338 contrast — Contrasts and linear hypothesis tests after estimation We could even obtain joint tests for the interaction of race and sex within each level of group by typing . contrast race#sex@group For tests of the main effects of each factor, we can type . contrast race sex group We can calculate the individual reference category contrasts that decompose these main effects: . contrast r.race r.sex r.group For the partial interaction effects, we could type . contrast r.race#group to obtain a joint test of the two-way interaction effects of race and group for each of the individual r.race contrasts. We could type . contrast r.race#sex#group to obtain a joint test of all the three-way interaction terms for each of the individual r.race contrasts. Contrast operators contrast recognizes a set of contrast operators that are used to specify commonly used contrasts. When these operators are used, contrast will report a test for each individual contrast in addition to the joint test for the term. We have already seen a few of these, like r. and ar., in the previous examples. Here we will take a closer look at each of the unweighted operators. Here we use the cholesterol dataset and the one-way ANOVA model from the example in One-way models: . use http://www.stata-press.com/data/r13/cholesterol (Artificial cholesterol data) . anova chol agegrp (output omitted ) The margins command reports the estimated cell means, µ b1 , . . . , µ b5 , for each of the five age groups. . margins agegrp Adjusted predictions Expression : Linear prediction, predict() Margin agegrp 10-19 20-29 30-39 40-59 60-79 180.5198 188.7233 202.0608 210.6704 219.282 Delta-method Std. Err. 2.666944 2.666944 2.666944 2.666944 2.666944 t 67.69 70.76 75.76 78.99 82.22 Number of obs = 75 P>|t| [95% Conf. Interval] 0.000 0.000 0.000 0.000 0.000 175.2007 183.4043 196.7418 205.3514 213.9629 185.8388 194.0424 207.3799 215.9895 224.601 contrast — Contrasts and linear hypothesis tests after estimation 339 Contrast operators provide an easy way to make certain types of comparisons of these cell means. We use the ordinal factor agegrp to demonstrate these operators because some types of contrasts are only meaningful when the levels of the factor have a natural ordering. We demonstrate these contrast operators using a one-way model; however, they are equally applicable to main effects, simple effects, and interactions for more complicated models. Differences from a reference level (r.) The r. operator specifies that each level of the attached factor variable be compared with a reference level. These are referred to as reference-level or reference-category contrasts (or effects), and r. is the reference-level operator. In the following, we use the r. operator to test the effect of each category of age group when that category is compared with a reference category. . contrast r.agegrp Contrasts of marginal linear predictions Margins : asbalanced df F P>F agegrp 10-19) 10-19) 10-19) 10-19) Joint 1 1 1 1 4 4.73 32.62 63.91 105.62 35.02 0.0330 0.0000 0.0000 0.0000 0.0000 Denominator 70 (20-29 (30-39 (40-59 (60-79 (20-29 (30-39 (40-59 (60-79 vs vs vs vs vs vs vs vs agegrp 10-19) 10-19) 10-19) 10-19) Contrast Std. Err. [95% Conf. Interval] 8.203575 21.54105 30.15067 38.76221 3.771628 3.771628 3.771628 3.771628 .6812991 14.01878 22.6284 31.23993 15.72585 29.06333 37.67295 46.28448 In the first table, the row labeled (20-29 vs 10-19) is a test of µ2 = µ1 , a test that the mean cholesterol levels for the 10–19 age group and the 20–29 age group are equal. The tests in the next three rows are defined similarly. The row labeled Joint provides the joint test for these four hypotheses, which is just the test of the main effects of age group. The second table provides the contrasts of each category with the reference category along with confidence intervals. The contrast in the row labeled (20-29 vs 10-19) is the difference in the cell means of the second age group and the first age group, µ b2 − µ b1 . The first level of a factor is the default reference level, but we can specify a different reference level by using the b. operator; see [U] 11.4.3.2 Base levels. Here we use the last age group, (60-79), instead of the first as the reference category. We also include the nowald option so that only the table of contrasts and their confidence intervals is produced. 340 contrast — Contrasts and linear hypothesis tests after estimation . contrast rb5.agegrp, nowald Contrasts of marginal linear predictions Margins : asbalanced (10-19 (20-29 (30-39 (40-59 vs vs vs vs agegrp 60-79) 60-79) 60-79) 60-79) Contrast Std. Err. -38.76221 -30.55863 -17.22115 -8.611533 3.771628 3.771628 3.771628 3.771628 [95% Conf. Interval] -46.28448 -38.08091 -24.74343 -16.13381 -31.23993 -23.03636 -9.698877 -1.089257 Now the first row is labeled (10-19 vs 60-79) and is the difference in the cell means of the first and fifth age groups. Differences from the next level (a.) The a. operator specifies that each level of the attached factor variable be compared with the next level. These are referred to as adjacent contrasts (or effects), and a. is the adjacent operator. This operator is only meaningful with factor variables that have a natural ordering in the levels. We can use the a. operator to perform tests that each level of age group differs from the next adjacent level. . contrast a.agegrp Contrasts of marginal linear predictions Margins : asbalanced df F P>F agegrp 20-29) 30-39) 40-59) 60-79) Joint 1 1 1 1 4 4.73 12.51 5.21 5.21 35.02 0.0330 0.0007 0.0255 0.0255 0.0000 Denominator 70 (10-19 (20-29 (30-39 (40-59 (10-19 (20-29 (30-39 (40-59 vs vs vs vs vs vs vs vs agegrp 20-29) 30-39) 40-59) 60-79) Contrast Std. Err. -8.203575 -13.33748 -8.60962 -8.611533 3.771628 3.771628 3.771628 3.771628 [95% Conf. Interval] -15.72585 -20.85976 -16.1319 -16.13381 -.6812991 -5.815204 -1.087345 -1.089257 In the first table, the row labeled (10-19 vs 20-29) tests the effect of belonging to the 10–19 age group instead of the 20–29 age group. Likewise, the rows labeled (20-29 vs 30-39), (30-39 vs 40-59), and (40-59 vs 60-79) are tests for the effects of being in the younger of the two age groups instead of the older one. In the second table, the contrast in the row labeled (10-19 vs 20-29) is the difference in the cell means of the first and second age groups, µ b1 − µ b2 . The contrasts in the other rows are defined similarly. contrast — Contrasts and linear hypothesis tests after estimation 341 Differences from the previous level (ar.) The ar. operator specifies that each level of the attached factor variable be compared with the previous level. These are referred to as reverse adjacent contrasts (or effects), and ar. is the reverse adjacent operator. As with the a. operator, this operator is only meaningful with factor variables that have a natural ordering in the levels. In the following, we use the ar. operator to report tests for the individual reverse adjacent effects of agegrp. . contrast ar.agegrp Contrasts of marginal linear predictions Margins : asbalanced df F P>F agegrp 10-19) 20-29) 30-39) 40-59) Joint 1 1 1 1 4 4.73 12.51 5.21 5.21 35.02 0.0330 0.0007 0.0255 0.0255 0.0000 Denominator 70 (20-29 (30-39 (40-59 (60-79 (20-29 (30-39 (40-59 (60-79 vs vs vs vs vs vs vs vs agegrp 10-19) 20-29) 30-39) 40-59) Contrast Std. Err. [95% Conf. Interval] 8.203575 13.33748 8.60962 8.611533 3.771628 3.771628 3.771628 3.771628 .6812991 5.815204 1.087345 1.089257 15.72585 20.85976 16.1319 16.13381 Here the Wald tests in the first table for the individual reverse adjacent effects are equivalent to the tests for the adjacent effects in the previous example. However, if we compare values of the contrasts in the bottom tables, we see the difference between the r. and the ar. operators. This time, the contrast in the first row is labeled (20-29 vs 10-19) and is the difference in the cell means of the second and first age groups, µ b2 − µ b1 . This is the estimated effect of belonging to the 20–29 age group instead of the 10–19 age group. The remaining rows make similar comparisons to the previous level. Differences from the grand mean (g.) The g. operator specifies that each level of a factor variable be compared with the grand mean of all levels. For this operator, the grand mean is computed using a simple average of the cell means. 342 contrast — Contrasts and linear hypothesis tests after estimation Here are the grand mean effects of agegrp: . contrast g.agegrp Contrasts of marginal linear predictions Margins : asbalanced df F P>F agegrp mean) mean) mean) mean) mean) Joint 1 1 1 1 1 4 68.42 23.36 0.58 19.08 63.65 35.02 0.0000 0.0000 0.4506 0.0000 0.0000 0.0000 Denominator 70 (10-19 (20-29 (30-39 (40-59 (60-79 (10-19 (20-29 (30-39 (40-59 (60-79 vs vs vs vs vs vs vs vs vs vs agegrp mean) mean) mean) mean) mean) Contrast Std. Err. -19.7315 -11.52793 1.809552 10.41917 19.0307 2.385387 2.385387 2.385387 2.385387 2.385387 [95% Conf. Interval] -24.48901 -16.28543 -2.947953 5.661668 14.2732 -14.974 -6.770423 6.567057 15.17668 23.78821 There are five age groups in our estimation sample. Thus the row labeled (10-19 vs mean) tests µ1 = (µ1 +µ2 +µ3 +µ4 +µ5 )/5. The row labeled (20-29 vs mean) tests µ2 = (µ1 +µ2 +µ3 +µ4 +µ5 )/5. The remaining rows perform similar tests for the third, fourth, and fifth age groups. In our example, the means for all age groups except the 30–39 age group are statistically different from the grand mean. Differences from the mean of subsequent levels (h.) The h. operator specifies that each level of the attached factor variable be compared with the mean of subsequent levels. These are referred to as Helmert contrasts (or effects), and h. is the Helmert operator. For this operator, the mean is computed using a simple average of the cell means. This operator is only meaningful with factor variables that have a natural ordering in the levels. Here are the Helmert contrasts for agegrp: . contrast h.agegrp Contrasts of marginal linear predictions Margins : asbalanced (10-19 (20-29 (30-39 (40-59 df F P>F agegrp vs >10-19) vs >20-29) vs >30-39) vs 60-79) Joint 1 1 1 1 4 68.42 50.79 15.63 5.21 35.02 0.0000 0.0000 0.0002 0.0255 0.0000 Denominator 70 contrast — Contrasts and linear hypothesis tests after estimation (10-19 (20-29 (30-39 (40-59 agegrp vs >10-19) vs >20-29) vs >30-39) vs 60-79) Contrast Std. Err. -24.66438 -21.94774 -12.91539 -8.611533 2.981734 3.079522 3.266326 3.771628 343 [95% Conf. Interval] -30.61126 -28.08965 -19.42987 -16.13381 -18.7175 -15.80583 -6.400905 -1.089257 The row labeled (10-19 vs >10-19) tests µ1 = (µ2 + µ3 + µ4 + µ5 )/4, that is, that the cell mean for the youngest age group is equal to the average of the cell means for the older age groups. The row labeled (20-29 vs >20-29) tests µ2 = (µ3 + µ4 + µ5 )/3. The tests in the other rows are defined similarly. Differences from the mean of previous levels (j.) The j. operator specifies that each level of the attached factor variable be compared with the mean of the previous levels. These are referred to as reverse Helmert contrasts (or effects), and j. is the reverse Helmert operator. For this operator, the mean is computed using a simple average of the cell means. This operator is only meaningful with factor variables that have a natural ordering in the levels. Here are the reverse Helmert contrasts of agegrp: . contrast j.agegrp Contrasts of marginal linear predictions Margins : asbalanced (20-29 (30-39 (40-59 (60-79 (20-29 (30-39 (40-59 (60-79 df F P>F agegrp vs 10-19) vs <30-39) vs <40-59) vs <60-79) Joint 1 1 1 1 4 4.73 28.51 43.18 63.65 35.02 0.0330 0.0000 0.0000 0.0000 0.0000 Denominator 70 agegrp vs 10-19) vs <30-39) vs <40-59) vs <60-79) Contrast Std. Err. [95% Conf. Interval] 8.203575 17.43927 20.2358 23.78838 3.771628 3.266326 3.079522 2.981734 .6812991 10.92479 14.09389 17.8415 15.72585 23.95375 26.37771 29.73526 The row labeled (20-29 vs 10-19) tests µ2 = µ1 , that is, that the cell means for the 20–29 and the 10–19 age groups are equal. The row labeled (30-39 vs <30-29) tests µ3 = (µ1 + µ2 )/2, that is, that the cell mean for the 30–39 age group is equal to the average of the cell means for the 10–19 and 20–29 age groups. The tests in the remaining rows are defined similarly. 344 contrast — Contrasts and linear hypothesis tests after estimation Orthogonal polynomials (p. and q.) The p. and q. operators specify that orthogonal polynomials be applied to the attached factor variable. Orthogonal polynomial contrasts allow us to partition the effects of a factor variable into linear, quadratic, cubic, and higher-order polynomial components. The p. operator applies orthogonal polynomials using the values of the factor variable. The q. operator applies orthogonal polynomials using the level indices. If the level values of the factor variable are equally spaced, as with our agegrp variable, then the p. and q. operators yield the same result. These operators are only meaningful with factor variables that have a natural ordering in the levels. Because agegrp has five levels, contrast can test the linear, quadratic, cubic, and quartic effects of agegrp. . contrast p.agegrp, noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F agegrp (linear) (quadratic) (cubic) (quartic) Joint 1 1 1 1 4 139.11 0.15 0.37 0.43 35.02 0.0000 0.6962 0.5448 0.5153 0.0000 Denominator 70 The row labeled (linear) tests the linear effect of agegrp, the only effect that appears to be significant in this case. The labels for our agegrp variable show the age ranges that correspond to each level. . label list ages: 1 2 3 4 5 ages 10-19 20-29 30-39 40-59 60-79 Notice that these groups do not have equal widths. Now let’s refit our model using the agemidpt variable. The values of agemidpt indicate the midpoint of each age group that was defined by the agegrp variable and are, therefore, not equally spaced. . anova chol agemidpt Number of obs = Root MSE = 75 10.329 R-squared = Adj R-squared = MS F 0.6668 0.6477 Source Partial SS df Prob > F Model 14943.3997 4 3735.84993 35.02 0.0000 agemidpt 14943.3997 4 3735.84993 35.02 0.0000 Residual 7468.21971 70 106.688853 Total 22411.6194 74 302.859722 contrast — Contrasts and linear hypothesis tests after estimation 345 Now if we use the q. operator, we will obtain the same results as above because the level indices of agemidpt are equivalent to the values of agegrp. . contrast q.agemidpt, noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F agemidpt (linear) (quadratic) (cubic) (quartic) Joint 1 1 1 1 4 139.11 0.15 0.37 0.43 35.02 0.0000 0.6962 0.5448 0.5153 0.0000 Denominator 70 However, if we use the p. operator, we will instead fit an orthogonal polynomial to the midpoint values. . contrast p.agemidpt, noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F agemidpt (linear) (quadratic) (cubic) (quartic) Joint 1 1 1 1 4 133.45 5.40 0.05 1.16 35.02 0.0000 0.0230 0.8198 0.2850 0.0000 Denominator 70 Using the values of the midpoints, the quadratic effect is also significant at the 5% level. Technical note We used the noeffects option when working with orthogonal polynomial contrasts. Apart from perhaps the sign of the contrast, the values of the individual contrasts are not meaningful for orthogonal polynomial contrasts. In addition, many textbooks provide tables with contrast coefficients that can be used to compute orthogonal polynomial contrasts where the levels of a factor are equally spaced. If we use these coefficients and calculate the contrasts manually with user-defined contrasts, as described below, the Wald tests for the polynomial terms will be equivalent, but the values of the individual contrasts will not necessarily match those that we obtain when using the polynomial contrast operator. When we use one of these contrast operators, an algorithm is used to calculate the coefficients of the polynomial contrast that will allow for unequal spacing in the levels of the factor as well as in the weights for the cell frequencies (when using pw. or qw.), as described in Methods and formulas. 346 contrast — Contrasts and linear hypothesis tests after estimation User-defined contrasts In the previous examples, we performed tests using contrast operators. When there is not a contrast operator available to calculate the contrast in which we are interested, we can specify custom contrasts. Here we fit a one-way model for cholesterol on the factor race, which has three levels: . label list race: 1 2 3 race black white other . anova chol race Number of obs = 75 Root MSE = 17.3775 R-squared = Adj R-squared = MS F 0.0299 0.0029 Source Partial SS df Prob > F Model 669.278235 2 334.639117 1.11 0.3357 race 669.278235 2 334.639117 1.11 0.3357 Residual 21742.3412 72 301.976961 Total 22411.6194 74 302.859722 margins calculates the estimated cell mean cholesterol level for each race: . margins race Adjusted predictions Expression Margin race black white other Number of obs = 75 : Linear prediction, predict() 204.4279 197.6132 198.7127 Delta-method Std. Err. 3.475497 3.475497 3.475497 t 58.82 56.86 57.18 P>|t| [95% Conf. Interval] 0.000 0.000 0.000 197.4996 190.6849 191.7844 211.3562 204.5415 205.6409 Suppose we want to test the following linear combination: 3 X ci µi i=1 where µi is the cell mean of chol when race is equal to its ith level (the means estimated using margins above). Assuming the ci elements sum to zero, this linear combination is a contrast. We can specify this type of custom contrast by using the following syntax: {race c1 c2 c3 } The null hypothesis for the test of the main effects of race is H0race : µ1 = µ2 = µ3 contrast — Contrasts and linear hypothesis tests after estimation 347 Although H0race can be tested using any of several different contrasts on the cell means, we will test it by comparing the second and third cell means with the first. To test that the cell means for blacks and whites are equal, µ1 = µ2 , we can specify the contrast {race -1 1 0} To test that the cell means for blacks and other races are equal, µ1 = µ3 , we can specify the contrast {race -1 0 1} We can use both in a single call to contrast. . contrast {race -1 1 0} {race -1 0 1} Contrasts of marginal linear predictions Margins : asbalanced df F P>F race (1) (2) Joint 1 1 2 1.92 1.35 1.11 0.1699 0.2488 0.3357 Denominator 72 race (1) (2) Contrast Std. Err. -6.814717 -5.715261 4.915095 4.915095 [95% Conf. Interval] -16.61278 -15.51332 2.983345 4.082801 The row labeled (1) is the test for µ1 = µ2 , the first specified contrast. The row labeled (2) is the test for µ1 = µ3 , the second specified contrast. The row labeled Joint is the overall test for the main effects of race. Now let’s fit a model with two factors, race and age group: . anova chol race##agegrp Source Number of obs = 75 Root MSE = 9.61785 Partial SS df MS R-squared = 0.7524 Adj R-squared = 0.6946 F Prob > F Model 16861.438 14 1204.38843 13.02 0.0000 race agegrp race#agegrp 669.278235 14943.3997 1248.76005 2 4 8 334.639117 3735.84993 156.095006 3.62 40.39 1.69 0.0329 0.0000 0.1201 Residual 5550.18143 60 92.5030238 Total 22411.6194 74 302.859722 The null hypothesis for the test of the main effects of race is now H0race : µ1· = µ2· = µ3· where µi· is the marginal mean of chol when race is equal to its ith level. 348 contrast — Contrasts and linear hypothesis tests after estimation We can use the same syntax as above to perform this test by specifying contrasts on the marginal means of race: . contrast {race -1 1 0} {race -1 0 1} Contrasts of marginal linear predictions Margins : asbalanced df F P>F race (1) (2) Joint 1 1 2 6.28 4.41 3.62 0.0150 0.0399 0.0329 Denominator 60 race (1) (2) Contrast Std. Err. -6.814717 -5.715261 2.720339 2.720339 [95% Conf. Interval] -12.2562 -11.15675 -1.37323 -.2737739 Custom contrasts may be specified on the cell means of interactions, too. Here we use margins to calculate the mean of chol for each cell in the interaction of race and agegrp: . margins race#agegrp Adjusted predictions Expression Margin race#agegrp black#10-19 black#20-29 black#30-39 black#40-59 black#60-79 white#10-19 white#20-29 white#30-39 white#40-59 white#60-79 other#10-19 other#20-29 other#30-39 other#40-59 other#60-79 Number of obs = 75 : Linear prediction, predict() 179.2309 196.4777 210.6694 214.097 221.6646 186.0727 184.6714 196.2633 209.9953 211.0633 176.2556 185.0209 199.2498 207.9189 225.118 Delta-method Std. Err. 4.301233 4.301233 4.301233 4.301233 4.301233 4.301233 4.301233 4.301233 4.301233 4.301233 4.301233 4.301233 4.301233 4.301233 4.301233 t 41.67 45.68 48.98 49.78 51.54 43.26 42.93 45.63 48.82 49.07 40.98 43.02 46.32 48.34 52.34 P>|t| [95% Conf. Interval] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 170.6271 187.874 202.0656 205.4933 213.0609 177.469 176.0676 187.6595 201.3916 202.4595 167.6519 176.4172 190.646 199.3152 216.5143 187.8346 205.0814 219.2731 222.7008 230.2684 194.6765 193.2751 204.867 218.5991 219.667 184.8594 193.6247 207.8535 216.5227 233.7218 Now we are interested in testing the following linear combination of these cell means: 3 X 5 X i=1 j=1 cij µij contrast — Contrasts and linear hypothesis tests after estimation 349 We can specify this type of custom contrast using the following syntax: {race#agegrp c11 c12 . . . c15 c21 c22 . . . c25 c31 c32 . . . c35 } Because the marginal means of chol for each level of race are linear combinations of the cell means, we can compose the test for the main effects of race in terms of the cell means directly. The constraint that the marginal means for blacks and whites are equal, µ1· = µ2· , translates to the following constraint on the cell means: 1 1 (µ11 + µ12 + µ13 + µ14 + µ15 ) = (µ21 + µ22 + µ23 + µ24 + µ25 ) 5 5 Ignoring the common factor, we can specify this contrast as {race#agegrp -1 -1 -1 -1 -1 1 1 1 1 1 0 0 0 0 0} contrast will fill in the trailing zeros for us if we neglect to specify them, so {race#agegrp -1 -1 -1 -1 -1 1 1 1 1 1} is also allowed. The other constraint, µ1· = µ3· , translates to 1 1 (µ11 + µ12 + µ13 + µ14 + µ15 ) = (µ31 + µ32 + µ33 + µ34 + µ35 ) 5 5 This can be specified to contrast as {race#agegrp -1 -1 -1 -1 -1 0 0 0 0 0 1 1 1 1 1} The following call to contrast yields the same test results as above. . contrast {race#agegrp -1 -1 -1 -1 -1 1 1 1 1 1} > {race#agegrp -1 -1 -1 -1 -1 0 0 0 0 0 1 1 1 1 1}, noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F race#agegrp (1) (1) (2) (2) Joint 1 1 2 6.28 4.41 3.62 0.0150 0.0399 0.0329 Denominator 60 The row labeled (1) (1) is the test for µ11 + µ12 + µ13 + µ14 + µ15 = µ21 + µ22 + µ23 + µ24 + µ25 It was the first specified contrast. The row labeled (2) (2) is the test for µ11 + µ12 + µ13 + µ14 + µ15 = µ31 + µ32 + µ33 + µ34 + µ35 It was the second specified contrast. The row labeled Joint tests (1) (1) and (2) (2) simultaneously. 350 contrast — Contrasts and linear hypothesis tests after estimation We used the noeffects option above to suppress the table of contrasts. We can omit the 1/5 from the equations for µ1· = µ2· and µ1· = µ3· and still obtain the appropriate tests. However, if we want to calculate the differences in the marginal means, we must include the 1/5 = 0.2 on each of the contrast coefficients as follows: . contrast {race#agegrp -0.2 -0.2 -0.2 -0.2 -0.2 0.2 0.2 0.2 0.2 0.2} {race#agegrp -0.2 -0.2 -0.2 -0.2 -0.2 0 0 0 0 0 0.2 0.2 0.2 0.2 0.2} So far, we have reproduced the reference category contrasts by specifying user-defined contrasts on the marginal means and then on the cell means. For this test, it would have been easier to use the r. contrast operator: . contrast r.race, noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F race (white vs black) (other vs black) Joint 1 1 2 6.28 4.41 3.62 0.0150 0.0399 0.0329 Denominator 60 In most cases, we can use contrast operators to perform tests. However, if we want to compare, for instance, the second and third age groups with the fourth and fifth age groups with the test 1 1 (µ·2 + µ·3 ) = (µ·4 + µ·5 ) 2 2 there is not a contrast operator that corresponds to this particular contrast. A custom contrast is necessary. . contrast {agegrp 0 -0.5 -0.5 0.5 0.5} Contrasts of marginal linear predictions Margins : asbalanced df F P>F agegrp 1 62.19 0.0000 Denominator 60 agegrp (1) Contrast Std. Err. [95% Conf. Interval] 19.58413 2.483318 14.61675 24.5515 contrast — Contrasts and linear hypothesis tests after estimation 351 Empty cells An empty cell is a combination of the levels of factor variables that is not observed in the estimation sample. In the previous examples, we have seen data with three levels of race, five levels of agegrp, and all level combinations of race and agegrp present. Suppose there are no observations for white individuals in the second age group (ages 20–29). . use http://www.stata-press.com/data/r13/cholesterol2 (Artificial cholesterol data, empty cells) . label list ages: 1 10-19 2 20-29 3 30-39 4 40-59 5 60-79 race: 1 black 2 white 3 other . regress chol race##agegrp note: 2.race#2.agegrp identifies no observations in the sample Source SS df MS Model Residual 15751.6113 5022.71559 13 56 1211.66241 89.6913498 Total 20774.3269 69 301.077201 Std. Err. t Number of obs F( 13, 56) Prob > F R-squared Adj R-squared Root MSE P>|t| = = = = = = 70 13.51 0.0000 0.7582 0.7021 9.4706 chol Coef. [95% Conf. Interval] race white other 12.84185 -.167627 5.989703 5.989703 2.14 -0.03 0.036 0.978 .8430383 -12.16644 24.84067 11.83119 agegrp 20-29 30-39 40-59 60-79 17.24681 31.43847 34.86613 44.43374 5.989703 5.989703 5.989703 5.989703 2.88 5.25 5.82 7.42 0.006 0.000 0.000 0.000 5.247991 19.43966 22.86732 32.43492 29.24562 43.43729 46.86495 56.43256 race#agegrp white#20-29 white#30-39 white#40-59 white#60-79 other#20-29 other#30-39 other#40-59 other#60-79 0 -22.83983 -14.67558 -10.51115 -6.054425 -11.48083 -.6796112 -1.578052 (empty) 8.470719 8.470719 8.470719 8.470719 8.470719 8.470719 8.470719 -2.70 -1.73 -1.24 -0.71 -1.36 -0.08 -0.19 0.009 0.089 0.220 0.478 0.181 0.936 0.853 -39.80872 -31.64447 -27.48004 -23.02331 -28.44971 -17.6485 -18.54694 -5.870939 2.293306 6.457735 10.91446 5.488063 16.28928 15.39084 _cons 175.2309 4.235359 41.37 0.000 166.7464 183.7153 352 contrast — Contrasts and linear hypothesis tests after estimation Now let’s use contrast to test the main effects of race: . contrast race Contrasts of marginal linear predictions Margins : asbalanced df race F P>F (not testable) Denominator 56 By “not testable”, contrast means that it cannot form a test for the main effects of race based on estimable functions of the model coefficients. agegrp has five levels, so contrast constructs an estimate of the ith margin for race as 5 µ bi· = 5 o 1X 1 X nb c ij µ bij = µ b0 + α bi + βj + (αβ) 5 j=1 5 j=1 c 22 was constrained to zero because of the empty cell, so µ but (αβ) b2· is not an estimable function of the model coefficients. See Estimable functions in Methods and formulas of [R] margins for a technical description of estimable functions. The emptycells(reweight) option causes contrast to estimate µ2· by µ b2· = µ b21 + µ b23 + µ b24 + µ b25 4 which is an estimable function of the model coefficients. . contrast race, emptycells(reweight) Contrasts of marginal linear predictions Margins : asbalanced Empty cells : reweight df F P>F race 2 3.17 0.0498 Denominator 56 contrast — Contrasts and linear hypothesis tests after estimation 353 We can reconstruct the effect of the emptycells(reweight) option by using custom contrasts. . contrast {race#agegrp -4 -4 -4 -4 -4 5 0 5 5 5} > {race#agegrp -1 -1 -1 -1 -1 0 0 0 0 0 1 1 1 1 1}, noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F race#agegrp (1) (1) (2) (2) Joint 1 1 2 1.06 2.37 3.17 0.3080 0.1291 0.0498 Denominator 56 The row labeled (1) (1) is the test for 1 1 (µ11 + µ12 + µ13 + µ14 + µ15 ) = (µ21 + µ23 + µ24 + µ25 ) 5 4 It was the first specified contrast. The row labeled (2) (2) is the test for µ11 + µ12 + µ13 + µ14 + µ15 = µ31 + µ32 + µ33 + µ34 + µ35 It was the second specified contrast. The row labeled Joint is the overall test of the main effects of race. Empty cells, ANOVA style Let’s refit the linear model from the previous example with anova to compare with contrast’s test for the main effects of race. . anova chol race##agegrp Number of obs = 70 Root MSE = 9.47055 R-squared = Adj R-squared = MS F 0.7582 0.7021 Source Partial SS df Prob > F Model 15751.6113 13 1211.66241 13.51 0.0000 race agegrp race#agegrp 305.49046 14387.8559 795.807574 2 4 7 152.74523 3596.96397 113.686796 1.70 40.10 1.27 0.1914 0.0000 0.2831 Residual 5022.71559 56 89.6913498 Total 20774.3269 69 301.077201 contrast and anova handled the empty cell differently; the F statistic reported by contrast was 3.17, but anova reported 1.70. To see how they differ, consider the following table of the cell means and margins for our situation. 354 contrast — Contrasts and linear hypothesis tests after estimation 1 race 2 3 1 µ11 µ21 µ31 µ·1 2 µ12 µ32 agegrp 3 4 µ13 µ14 µ23 µ24 µ33 µ34 µ·3 µ·4 5 µ15 µ25 µ35 µ·5 µ1· µ3· For testing the main effects of race, we know that we will be testing the equality of the marginal means for rows 1 and 3, that is, µ1· = µ3· . This translates into the following constraint: µ11 + µ12 + µ13 + µ14 + µ15 = µ31 + µ32 + µ33 + µ34 + µ35 Because row 2 contains an empty cell in column 2, anova dropped column 2 and tested the equality of the marginal mean for row 2 with the average of the marginal means from rows 1 and 3, using only the remaining cell means. This translates into the following constraint: 2(µ21 + µ23 + µ24 + µ25 ) = µ11 + µ13 + µ14 + µ15 + µ31 + µ33 + µ34 + µ35 (1) Now that we know the constraints that anova used to test for the main effects of race, we can use custom contrasts to reproduce the anova test result. . contrast {race#agegrp -1 -1 -1 -1 -1 0 0 0 0 0 1 1 1 1 1} > {race#agegrp 1 0 1 1 1 -2 0 -2 -2 -2 1 0 1 1 1}, noeffects Contrasts of marginal linear predictions Margins : asbalanced df F P>F race#agegrp (1) (1) (2) (2) Joint 1 1 2 2.37 1.03 1.70 0.1291 0.3138 0.1914 Denominator 56 The row labeled (1) (1) is the test for µ1· = µ3· ; it was the first specified contrast. The row labeled (2) (2) is the test for the constraint in (1); it was the second specified contrast. The row labeled Joint is an overall test for the main effects of race. Nested effects contrast has the | operator for computing simple effects when the levels of one factor are nested within the levels of another. Here is a fictional example where we are interested in the effect of five methods of teaching algebra on students’ scores for the math portion of the SAT. Suppose three algebra classes are randomly sampled from classes using each of the five methods so that class is nested in method as demonstrated in the following tabulation. contrast — Contrasts and linear hypothesis tests after estimation . use http://www.stata-press.com/data/r13/sat (Artificial SAT data) . tabulate class method method 1 2 3 class 4 5 Total 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 Total 15 15 15 15 15 75 355 We will consider method as fixed and class nested in method as random. To use class nested in method as the error term for method, we can specify the following anova model: . anova score method / class|method / Number of obs = 75 Root MSE = 71.8517 Source Partial SS df MS R-squared = 0.7599 Adj R-squared = 0.7039 F Prob > F Model 980312 14 70022.2857 13.56 0.0000 method class|method 905872 74440 4 10 226468 7444 30.42 0.0000 class|method 74440 10 7444 1.44 0.1845 Residual 309760 60 5162.66667 Total 1290072 74 17433.4054 Like anova, contrast allows the | operator, which specifies that one variable is nested in the levels of another. We can use contrast to test the main effects of method and the simple effects of class within method. 356 contrast — Contrasts and linear hypothesis tests after estimation . contrast method class|method Contrasts of marginal linear predictions Margins : asbalanced df method F P>F 2.80 0.91 1.10 0.22 2.18 1.44 0.0687 0.4089 0.3390 0.8025 0.1221 0.1845 (not testable) class|method 1 2 3 4 5 Joint 2 2 2 2 2 10 Denominator 60 Although contrast was able to perform the individual tests for the simple effects of class within method, empty cells in the interaction between method and class prevented contrast from testing for a main effect of method. Here we add the emptycells(reweight) option so that contrast can take the empty cells into account when computing the marginal means for method. . contrast method class|method, emptycells(reweight) Contrasts of marginal linear predictions Margins : asbalanced Empty cells : reweight df F P>F method 4 43.87 0.0000 class|method 1 2 3 4 5 Joint 2 2 2 2 2 10 2.80 0.91 1.10 0.22 2.18 1.44 0.0687 0.4089 0.3390 0.8025 0.1221 0.1845 Denominator 60 Now contrast does report a test for the main effects of method. However, if we compare this with the anova results, we will see that the results are different. They are different because contrast uses the residual error term to compute the F test by default. Using notation similar to anova, we can use the / operator to specify a different error term for the test. Therefore, we can reproduce the test of main effects from our anova command by typing contrast — Contrasts and linear hypothesis tests after estimation 357 . contrast method / class|method /, emptycells(reweight) Contrasts of marginal linear predictions Margins : asbalanced Empty cells : reweight df F P>F method 4 30.42 0.0000 class|method 10 class|method 1 2 3 4 5 Joint 2 2 2 2 2 10 Denominator 60 (denominator) 2.80 0.91 1.10 0.22 2.18 1.44 0.0687 0.4089 0.3390 0.8025 0.1221 0.1845 Multiple comparisons We have seen that contrast can report the individual linear combinations that make up the requested effects. Depending upon the specified option, contrast will report confidence intervals, p-values, or both in the effects table. By default, the reported confidence intervals and p-values are not adjusted for multiple comparisons. Use the mcompare() option to adjust the confidence intervals and p-values for multiple comparisons of the individual effects. Let’s compute the grand mean effects of race using the g. operator. We also specify the mcompare(bonferroni) option to compute p-values and confidence intervals using Bonferroni’s adjustment. . use http://www.stata-press.com/data/r13/cholesterol (Artificial cholesterol data) . anova chol race##agegrp (output omitted ) . contrast g.race, mcompare(bonferroni) Contrasts of marginal linear predictions Margins : asbalanced df F P>F race (black vs mean) (white vs mean) (other vs mean) Joint 1 1 1 2 7.07 2.82 0.96 3.62 0.0100 0.0982 0.3312 0.0329 Denominator 60 Bonferroni P>F 0.0301 0.2947 0.9936 Note: Bonferroni-adjusted p-values are reported for tests on individual contrasts only. 358 contrast — Contrasts and linear hypothesis tests after estimation Number of Comparisons race 3 race (black vs mean) (white vs mean) (other vs mean) Contrast Std. Err. 4.17666 -2.638058 -1.538602 1.570588 1.570588 1.570588 Bonferroni [95% Conf. Interval] .3083743 -6.506343 -5.406887 8.044945 1.230227 2.329684 The last table reports a Bonferroni-adjusted confidence interval for each individual contrast. (Use the effects option to add p-values to the last table.) The first table includes a Bonferroni-adjusted p-value for each test that is not a joint test. Joint tests are never adjusted for multiple comparisons. For example, . contrast race@agegrp, mcompare(bonferroni) Contrasts of marginal linear predictions Margins : asbalanced df F P>F race@agegrp 10-19 20-29 30-39 40-59 60-79 Joint 2 2 2 2 2 10 1.37 2.44 3.12 0.53 2.90 2.07 0.2620 0.0958 0.0512 0.5889 0.0628 0.0409 Denominator 60 Note: Bonferroni-adjusted p-values are reported for tests on individual contrasts only. Number of Comparisons race@agegrp 10 contrast — Contrasts and linear hypothesis tests after estimation (white (white (white (white (white (other (other (other (other (other vs vs vs vs vs vs vs vs vs vs race@agegrp base) 10-19 base) 20-29 base) 30-39 base) 40-59 base) 60-79 base) 10-19 base) 20-29 base) 30-39 base) 40-59 base) 60-79 Contrast Std. Err. 6.841855 -11.80631 -14.40607 -4.101691 -10.60137 -2.975244 -11.45679 -11.41958 -6.17807 3.453375 6.082862 6.082862 6.082862 6.082862 6.082862 6.082862 6.082862 6.082862 6.082862 6.082862 359 Bonferroni [95% Conf. Interval] -10.88697 -29.53513 -32.13489 -21.83051 -28.33019 -20.70407 -29.18561 -29.1484 -23.90689 -14.27545 24.57068 5.922513 3.322751 13.62713 7.127448 14.75358 6.272031 6.309244 11.55075 21.1822 Here we have five tests of simple effects with two degrees of freedom each. No Bonferroni-adjusted p-values are available for these tests, but the confidence intervals for the individual contrasts are adjusted. Unbalanced data By default, contrast treats all factors as balanced when computing marginal means. By balanced, we mean that contrast assumes an equal number of observations in each level of each factor and an equal number of observations in each cell of each interaction. If our data are balanced, there is no issue. If, however, our data are not balanced, we might prefer that contrast use the actual cell frequencies from our data in computing marginal means. We instruct contrast to use observed frequencies by adding the asobserved option. Even if our data are unbalanced, we might still want contrast to compute balanced marginal means. It depends on what we want to test and what our data represent. If we have data from a designed experiment that started with an equal number of males and females but the data became unbalanced because the data from a few males were unusable, we might still want our margins computed as though the data were balanced. If, however, we have a representative sample of individuals from Los Angeles with 40% of European descent, 34% African-American, 25% Hispanic, and 1% Australian, we probably want our margins computed using these representative frequencies. We do not want Australians receiving the same weight as Europeans. The following examples will use an unbalanced version of our dataset. . use http://www.stata-press.com/data/r13/cholesterol3 (Artificial cholesterol data, unbalanced) . tab race agegrp agegrp race 10-19 20-29 30-39 40-59 60-79 Total black white other 1 4 3 5 5 7 5 7 6 4 4 5 3 4 4 18 24 25 Total 8 17 18 13 11 67 The row labeled Total gives observed cell frequencies for age group. These can be obtained by summing frequencies from the cells in the corresponding column. In this respect, we can also refer to them as marginal frequencies. We use the terms marginal frequencies and cell frequencies interchangeably below. 360 contrast — Contrasts and linear hypothesis tests after estimation We begin by fitting the two-factor model with an interaction. . anova chol race##agegrp Source Number of obs = 67 Root MSE = 8.37496 Partial SS df MS R-squared = 0.8179 Adj R-squared = 0.7689 F Prob > F Model 16379.9926 14 1169.99947 16.68 0.0000 race agegrp race#agegrp 230.754396 13857.9877 857.815209 2 4 8 115.377198 3464.49693 107.226901 1.64 49.39 1.53 0.2029 0.0000 0.1701 Residual 3647.2774 52 70.13995 Total 20027.27 66 303.443485 Using observed cell frequencies Recall that the marginal means are computed from the cell means. Treating the factors as balanced yields the following marginal means for race: η1· = 1 (µ11 + µ12 + µ13 + µ14 + µ15 ) 5 η2· = 1 (µ21 + µ22 + µ23 + µ24 + µ25 ) 5 η3· = 1 (µ31 + µ32 + µ33 + µ34 + µ35 ) 5 If we have a fixed population and unbalanced cells, then the ηi· do not represent population means. If, however, our data are representative of the population, we can use the frequencies from our estimation sample to estimate the population marginal means, denoted µi· . Here are the results of testing for a main effect of race, treating all the factors as balanced. . contrast r.race Contrasts of marginal linear predictions Margins : asbalanced df F P>F race (white vs black) (other vs black) Joint 1 1 2 3.28 1.50 1.64 0.0757 0.2263 0.2029 Denominator 52 race (white vs black) (other vs black) Contrast Std. Err. -5.324254 -3.596867 2.93778 2.93778 [95% Conf. Interval] -11.21934 -9.491955 .5708338 2.298221 contrast — Contrasts and linear hypothesis tests after estimation 361 The row labeled (white vs black) is the test for η2· = η1· . The row labeled (other vs black) is the test for η3· = η1· . If the observed marginal frequencies are representative of the distribution of the levels of agegrp, we can use them to form the marginal means of chol for each of the levels of race from the cell means. µ1· = 1 (8µ11 + 17µ12 + 18µ13 + 13µ14 + 11µ15 ) 67 µ2· = 1 (8µ21 + 17µ22 + 18µ23 + 13µ24 + 11µ25 ) 67 µ3· = 1 (8µ31 + 17µ32 + 18µ33 + 13µ34 + 11µ35 ) 67 Here are the results of testing for the main effects of race, using the observed marginal frequencies: . contrast r.race, asobserved Contrasts of marginal linear predictions Margins : asobserved df F P>F race (white vs black) (other vs black) Joint 1 1 2 7.25 3.89 3.74 0.0095 0.0538 0.0304 Denominator 52 race (white vs black) (other vs black) Contrast Std. Err. -7.232433 -5.231198 2.686089 2.651203 [95% Conf. Interval] -12.62246 -10.55123 -1.842402 .0888295 The row labeled (white vs black) is the test for µ2· = µ1· . The row labeled (other vs black) is the test for µ3· = µ1· . Both tests were insignificant when we tested the cell means resulting from balanced frequencies; however, when we tested the cell means from observed frequencies, the first test is significant beyond the 5% level (and the second test is nearly so). Here we reproduce the results of the asobserved option with custom contrasts. Because we are modifying the way that the marginal means are constructed from the cell means, we will specify the contrasts on the predicted cell means. We use macro expansion, =exp, to evaluate the fractions instead of approximating them with decimals. Macro expansion guarantees that the contrast coefficients sum to zero. For more information, see Macro expansion operators and function in [P] macro. 362 contrast — Contrasts and linear hypothesis tests after estimation . contrast {race#agegrp -‘=8/67’ -‘=17/67’ -‘=18/67’ -‘=13/67’ -‘=11/67’ > ‘=8/67’ ‘=17/67’ ‘=18/67’ ‘=13/67’ ‘=11/67’} > {race#agegrp -‘=8/67’ -‘=17/67’ -‘=18/67’ -‘=13/67’ -‘=11/67’ > 0 0 0 0 0 > ‘=8/67’ ‘=17/67’ ‘=18/67’ ‘=13/67’ ‘=11/67’} Contrasts of marginal linear predictions Margins : asbalanced df F P>F race#agegrp (1) (1) (2) (2) Joint 1 1 2 7.25 3.89 3.74 0.0095 0.0538 0.0304 Denominator 52 race#agegrp (1) (1) (2) (2) Contrast Std. Err. -7.232433 -5.231198 2.686089 2.651203 [95% Conf. Interval] -12.62246 -10.55123 -1.842402 .0888295 Weighted contrast operators contrast provides observation-weighted versions of five of the contrast operators—gw., hw., jw., pw., and qw.. The first three of these operators perform comparisons of means across cells, and like the marginal means just discussed, these means can be computed in two ways: 1) as though the cell frequencies were equal or 2) using the observed cell frequencies from the estimation sample. The weighted operators provide versions of the standard (as balanced) operators that weight these means by their cell frequencies. The two orthogonal polynomial operators involve similar adjustments for weighting. Let’s examine what this means by using the gw. operator. The gw. operator is a weighted version of the g. operator. The gw. operator computes the grand mean using the cell frequencies for race obtained from the model fit. Here we test the effects of race, comparing each level with the weighted grand mean but otherwise treating the factors as balanced in the marginal mean calculations. . contrast gw.race Contrasts of marginal linear predictions Margins : asbalanced df F P>F race (black vs mean) (white vs mean) (other vs mean) Joint 1 1 1 2 2.78 2.06 0.06 1.64 0.1014 0.1573 0.8068 0.2029 Denominator 52 contrast — Contrasts and linear hypothesis tests after estimation race (black vs mean) (white vs mean) (other vs mean) Contrast Std. Err. 3.24931 -2.074944 -.347557 1.948468 1.44618 1.414182 363 [95% Conf. Interval] -.6605779 -4.976915 -3.18532 7.159198 .8270276 2.490206 The observed marginal frequencies of race are 18, 24, and 25. Thus the row labeled (black vs mean) tests η1· = (18η1· + 24η2· + 25η3· )/67; the row labeled (white vs mean) tests η2· = (18η1· + 24η2· + 25η3· )/67; and the row labeled (other vs mean) tests η3· = (18η1· + 24η2· + 25η3· )/67. Now we reproduce the above results using custom contrasts. We are weighting the calculation of the grand mean from the marginal means for each of the races, but we are not weighting the calculation of the marginal means themselves. Therefore, we can specify the custom contrast on the marginal means for race instead of on the cell means. . contrast {race ‘=49/67’ -‘=24/67’ -‘=25/67’} > {race -‘=18/67’ ‘=43/67’ -‘=25/67’} > {race -‘=18/67’ -‘=24/67’ ‘=42/67’} Contrasts of marginal linear predictions Margins : asbalanced df F P>F race (1) (2) (3) Joint 1 1 1 2 2.78 2.06 0.06 1.64 0.1014 0.1573 0.8068 0.2029 Denominator 52 race (1) (2) (3) Contrast Std. Err. 3.24931 -2.074944 -.347557 1.948468 1.44618 1.414182 [95% Conf. Interval] -.6605779 -4.976915 -3.18532 7.159198 .8270276 2.490206 Now we will test for each race the difference between the marginal mean and the weighted grand mean, treating the factors as observed in the marginal mean calculations. 364 contrast — Contrasts and linear hypothesis tests after estimation . contrast gw.race, asobserved wald ci Contrasts of marginal linear predictions Margins : asobserved df F P>F race (black vs mean) (white vs mean) (other vs mean) Joint 1 1 1 2 6.81 3.74 0.26 3.74 0.0118 0.0587 0.6099 0.0304 Denominator 52 race (black vs mean) (white vs mean) (other vs mean) Contrast Std. Err. 4.542662 -2.689771 -.6885363 1.740331 1.39142 1.341261 [95% Conf. Interval] 1.050432 -5.481859 -3.379973 8.034891 .1023172 2.002901 The row labeled (black vs mean) tests µ1· = (18µ1· + 24µ2· + 25µ3· )/67; the row labeled (white vs mean) tests µ2· = (18µ1· + 24µ2· + 25µ3· )/67; and the row labeled (other vs mean) tests µ3· = (18µ1· + 24µ2· + 25µ3· )/67. Here we use a custom contrast to reproduce the above result testing µ1· = (18µ1· + 24µ2· + 25µ3· )/67. Because both the calculation of the marginal means and the calculation of the grand mean are adjusted, we specify the custom contrast on the cell means. . contrast {race#agegrp ‘=49/67*8/67’ ‘=49/67*17/67’ ‘=49/67*18/67’ > ‘=49/67*13/67’ ‘=49/67*11/67’ > -‘=24/67*8/67’ -‘=24/67*17/67’ -‘=24/67*18/67’ > -‘=24/67*13/67’ -‘=24/67*11/67’ > -‘=25/67*8/67’ -‘=25/67*17/67’ -‘=25/67*18/67’ > -‘=25/67*13/67’ -‘=25/67*11/67’}, nowald Contrasts of marginal linear predictions Margins : asbalanced race#agegrp (1) (1) Contrast Std. Err. [95% Conf. Interval] 4.542662 1.740331 1.050432 8.034891 The Helmert and reverse Helmert contrasts also involve calculating averages of the marginal means; therefore, weighted versions of these parameters are available as well. The hw. operator is a weighted version of the h. operator that computes the mean of the subsequent levels using the cell frequencies obtained from the model fit. The jw. operator is a weighted version of the j. operator that computes the mean of the previous levels using the cell frequencies obtained from the model fit. For orthogonal polynomials, we can use the pw. and qw. operators, which are the weighted versions of the p. and q. operators. In this case, the cell frequencies from the model fit are used in the calculation of the orthogonal polynomial contrast coefficients. contrast — Contrasts and linear hypothesis tests after estimation 365 Testing factor effects on slopes For linear models where the independent variables are all factor variables, the linear prediction at fixed levels of the factor variables turns out to be a cell mean. With these models, contrast computes and tests the effects of the factor variables on the expected mean of the dependent variable. When factor variables are interacted with continuous variables, contrast distinguishes factor effects on the intercept from factor effects on the slope. Here we have 1980 census data including information on the birth rate (brate), the median age (medage), and the region of the country (region) for each of the 50 states. We can fit an ANCOVA model for brate using main effects of the factor variable region and the continuous variable medage. . use http://www.stata-press.com/data/r13/census3 (1980 Census data by state) . label list cenreg: 1 2 3 4 cenreg NE NCentral South West . anova brate i.region c.medage Number of obs = 50 Root MSE = 12.7575 Source Partial SS df MS R-squared = 0.8264 Adj R-squared = 0.8110 F Prob > F Model 34872.8589 4 8718.21473 53.57 0.0000 region medage 2197.75453 15327.423 3 1 732.584844 15327.423 4.50 94.18 0.0076 0.0000 Residual 7323.96108 45 162.754691 Total 42196.82 49 861.159592 For those more comfortable with linear regression, this is equivalent to the regression model . regress brate i.region c.medage You may use either. We can use contrast to compute reference category effects for region. These contrasts compare the adjusted means of NCentral, South, and West regions with the adjusted mean of the NE region. . contrast r.region Contrasts of marginal linear predictions Margins : asbalanced df F P>F region (NCentral vs NE) (South vs NE) (West vs NE) Joint 1 1 1 3 2.24 0.78 10.33 4.50 0.1417 0.3805 0.0024 0.0076 Denominator 45 366 contrast — Contrasts and linear hypothesis tests after estimation region (NCentral vs NE) (South vs NE) (West vs NE) Contrast Std. Err. 9.061063 5.06991 21.71328 6.057484 5.72396 6.755616 [95% Conf. Interval] -3.139337 -6.458738 8.106774 21.26146 16.59856 35.31979 Let’s add the interaction between region and medage to the model. . anova brate region##c.medage Number of obs = 50 Root MSE = 10.0244 R-squared = Adj R-squared = MS F 0.9000 0.8833 Source Partial SS df Prob > F Model 37976.3149 7 5425.18784 53.99 0.0000 region medage region#medage 3405.07044 5279.71448 3103.45597 3 1 3 1135.02348 5279.71448 1034.48532 11.30 52.54 10.29 0.0000 0.0000 0.0000 Residual 4220.5051 42 100.488217 Total 42196.82 49 861.159592 The parameterization for the expected value of brate as a function of region and medage is given by E(brate|region = i, medage) = α0 + αi + β0 medage + βi medage where α0 is the intercept and β0 is the slope of medage. We are modeling the effects of region in two different ways. The αi parameters measure the effect of region on the intercept, and the βi parameters measure the effect of region on the slope of medage. contrast computes and tests effects on slopes separately from effects on intercepts. First, we will compute the reference category effects of region on the intercept: . contrast r.region Contrasts of marginal linear predictions Margins : asbalanced df F P>F region (NCentral vs NE) (South vs NE) (West vs NE) Joint 1 1 1 3 0.09 0.01 8.50 11.30 0.7691 0.9389 0.0057 0.0000 Denominator 42 region (NCentral vs NE) (South vs NE) (West vs NE) Contrast Std. Err. -49.38396 -9.058983 343.0024 167.1281 117.424 117.6547 [95% Conf. Interval] -386.6622 -246.0302 105.5656 287.8942 227.9123 580.4393 contrast — Contrasts and linear hypothesis tests after estimation 367 Now we will compute the reference category effects of region on the slope of medage: . contrast r.region#c.medage Contrasts of marginal linear predictions Margins : asbalanced df F P>F region#c.medage (NCentral vs NE) (South vs NE) (West vs NE) Joint 1 1 1 3 0.16 0.03 8.18 10.29 0.6917 0.8558 0.0066 0.0000 Denominator 42 region#c.medage (NCentral vs NE) (South vs NE) (West vs NE) Contrast Std. Err. 2.208539 .6928008 -10.94649 5.530981 3.788735 3.827357 [95% Conf. Interval] -8.953432 -6.953175 -18.67041 13.37051 8.338777 -3.22257 At the 5% level, the slope of medage for the West region differs from that of the NE region, but at that level of significance, we cannot say that the slope for the NCentral or the South region differs from that of the NE region. This model is simple enough that the reference category contrasts reproduce the coefficients for region and for the interactions in an equivalent model fit by regress. . regress brate region##c.medage SS df Source MS Number of obs F( 7, 42) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 53.99 0.0000 0.9000 0.8833 10.024 Model Residual 37976.3149 4220.5051 7 42 5425.18784 100.488217 Total 42196.82 49 861.159592 brate Coef. region NCentral South West -49.38396 -9.058983 343.0024 167.1281 117.424 117.6547 -0.30 -0.08 2.92 0.769 0.939 0.006 -386.6622 -246.0302 105.5656 287.8942 227.9123 580.4393 medage -8.802707 3.462865 -2.54 0.015 -15.79105 -1.814362 2.208539 .6928008 -10.94649 5.530981 3.788735 3.827357 0.40 0.18 -2.86 0.692 0.856 0.007 -8.953432 -6.953175 -18.67041 13.37051 8.338777 -3.22257 411.8268 108.2084 3.81 0.000 193.4533 630.2002 region# c.medage NCentral South West _cons Std. Err. t P>|t| This will not be the case for models that are more complicated. [95% Conf. Interval] 368 contrast — Contrasts and linear hypothesis tests after estimation Chow tests Now let’s suppose we are fitting a model for birth rates on median age and marriage rate. We are also interested in whether the regression coefficients differ for states in the east versus states in the west. We use census divisions to create a new variable, west, that indicates which states are in the western half of the United States. . generate west = inlist(division, 4, 7, 8, 9) We fit a model that includes a separate intercept for west as well as an interaction between west and each of the other variables in our model. . regress brate i.west##c.medage i.west##c.mrgrate Source SS df MS Number of obs F( 5, 44) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 92.09 0.0000 0.9128 0.9029 9.146 Model Residual 38516.2172 3680.60281 5 44 7703.24344 83.6500639 Total 42196.82 49 861.159592 brate Coef. 1.west medage 327.8733 -7.532304 58.71793 1.387624 5.58 -5.43 0.000 0.000 209.5351 -10.32888 446.2115 -4.735731 west# c.medage 1 -10.11443 1.849103 -5.47 0.000 -13.84105 -6.387808 mrgrate 828.6813 643.3443 1.29 0.204 -467.8939 2125.257 west# c.mrgrate 1 -800.8036 645.488 -1.24 0.221 -2101.699 500.092 366.5325 47.08904 7.78 0.000 271.6308 461.4343 _cons Std. Err. t P>|t| [95% Conf. Interval] We can test the effects of west on the intercept and on the slopes of medage and mrgrate. We will specify all of these effects in a single contrast command and include the overall option to obtain a joint test of effects, that is, a test that the coefficients for eastern states and for western states are equal. . contrast west west#c.medage west#c.mrgrate, overall Contrasts of marginal linear predictions Margins : asbalanced df F P>F west 1 31.18 0.0000 west#c.medage 1 29.92 0.0000 west#c.mrgrate 1 1.54 0.2213 Overall 3 22.82 0.0000 Denominator 44 This overall test is referred to as a Chow test in econometrics (Chow 1960). contrast — Contrasts and linear hypothesis tests after estimation 369 Beyond linear models contrast may be used after almost any estimation command, with the added benefit that contrast provides direct support for testing main and interaction effects that is not available in most estimation commands. To illustrate, we will use contrast with results from a logistic regression. Stata’s logit command fits logistic regression models, reporting the fitted regression coefficients. The logistic command fits the same models but reports odds ratios. Although contrast can report odds ratios for the computed effects, the tests are all computed from linear combinations of the model coefficients regardless of which estimation command we used. Suppose we have data on patient satisfaction for three hospitals in a city. Let’s begin by fitting a model for satisfied, whether the patient was satisfied with his or her treatment, using the main effects of hospital: . use http://www.stata-press.com/data/r13/hospital, clear (Artificial hospital satisfaction data) . logit satisfied i.hospital Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = -393.72216 = -387.55736 = -387.4768 = -387.47679 Logistic regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -387.47679 satisfied Coef. hospital 2 3 _cons = = = = 802 12.49 0.0019 0.0159 Std. Err. z P>|z| [95% Conf. Interval] .5348129 .7354519 .2136021 .2221929 2.50 3.31 0.012 0.001 .1161604 .2999618 .9534654 1.170942 1.034708 .1391469 7.44 0.000 .7619855 1.307431 Because there are no other independent variables in this model, the reference category effects of hospital computed by contrast will match the fitted model coefficients, assuming a common reference level. . contrast r.hospital Contrasts of marginal linear predictions Margins hospital (2 vs 1) (3 vs 1) Joint hospital (2 vs 1) (3 vs 1) : asbalanced df chi2 P>chi2 1 1 2 6.27 10.96 12.55 0.0123 0.0009 0.0019 Contrast Std. Err. [95% Conf. Interval] .5348129 .7354519 .2136021 .2221929 .1161604 .2999618 .9534654 1.170942 370 contrast — Contrasts and linear hypothesis tests after estimation We see that the reference category effects are equal to the fitted coefficients. They also have the same interpretation, the difference in log odds from the reference category. The top table also provides a joint test of these effects, a test of the main effects of hospital. We also have information on the condition for which each patient is being treated in the variable illness. Here we fit a logistic regression using a two-way crossed model of hospital and illness. . label list illness illness: 1 heart attack 2 stroke 3 pneumonia 4 lung disease 5 kidney failure . logistic satisfied hospital##illness Logistic regression Number of obs LR chi2(14) Prob > chi2 Pseudo R2 Log likelihood = -374.46865 satisfied Odds Ratio hospital 2 3 illness stroke pneumonia lung dise.. kidney fa.. hospital# illness 2#stroke 2#pneumonia 2 # lung dise.. 2 # kidney fa.. 3#stroke 3#pneumonia 3 # lung dise.. 3 # kidney fa.. _cons = = = = 802 38.51 0.0004 0.0489 Std. Err. z P>|z| [95% Conf. Interval] 1.226496 1.711111 .5492177 .8061016 0.46 1.14 0.648 0.254 .509921 .6796395 2.950049 4.308021 1.328704 .7993827 1.231481 1.25 .6044214 .3408305 .5627958 .5489438 0.62 -0.53 0.46 0.51 0.532 0.599 0.649 0.611 .544779 .3466015 .5028318 .5285676 3.240678 1.843653 3.016012 2.956102 2.434061 4.045805 1.768427 2.868559 1.22 1.97 0.221 0.049 .5860099 1.008058 10.11016 16.23769 .54713 .3469342 -0.95 0.342 .1578866 1.89599 1.594425 .5416535 1.579502 1.081104 .3590089 1.042504 0.69 -0.93 0.69 0.491 0.355 0.489 .4221288 .1477555 .4332209 6.022312 1.985635 5.758783 3.137388 2.595748 1.38 0.167 .6198955 15.87881 1.672727 1.226149 0.70 0.483 .3976256 7.036812 2.571429 .8099239 3.00 0.003 1.386983 4.767358 Using contrast, we can obtain an ANOVA-style table of tests for the main effects and interaction effects of hospital and illness. contrast — Contrasts and linear hypothesis tests after estimation 371 . contrast hospital##illness Contrasts of marginal linear predictions Margins : asbalanced df chi2 P>chi2 hospital 2 14.92 0.0006 illness 4 4.09 0.3937 hospital#illness 8 20.45 0.0088 Our interaction effect is significant, so we decide to evaluate the simple reference category effects of hospital within illness. We are particularly interested in patient satisfaction when being treated for a heart attack or stroke, so we will use the i. operator to limit our output to simple effects within the first two illnesses. . contrast r.hospital@i(1 2).illness, nowald Contrasts of marginal linear predictions Margins : asbalanced hospital@illness (2 vs 1) heart attack (2 vs 1) stroke (3 vs 1) heart attack (3 vs 1) stroke Contrast Std. Err. .2041611 1.093722 .5371429 -.0759859 .4477942 .5721288 .4710983 .4662325 [95% Conf. Interval] -.6734995 -.0276296 -.3861928 -.9897847 1.081822 2.215074 1.460479 .8378129 The row labeled (2 vs 1) heart attack estimates simple effects on the log odds when comparing hospital 2 with hospital 1 for patients having heart attacks. These effects are differences in the cell means of the linear predictions. We can add the or option to report an odds ratio for each of these simple effects: . contrast r.hospital@i(1 2).illness, nowald or Contrasts of marginal linear predictions Margins : asbalanced Odds Ratio hospital@illness (2 vs 1) heart attack (2 vs 1) stroke (3 vs 1) heart attack (3 vs 1) stroke 1.226496 2.985366 1.711111 .9268293 Std. Err. [95% Conf. Interval] .5492177 1.708014 .8061016 .4321179 .509921 .9727486 .6796395 .3716567 2.950049 9.162089 4.308021 2.311306 These odds ratios are just the exponentiated version of the contrasts in the previous table. For contrasts of the margins of nonlinear predictions, such as predicted probabilities, see [R] margins, contrast. 372 contrast — Contrasts and linear hypothesis tests after estimation Multiple equations contrast works with models containing multiple equations. Commands such as intreg and gnbreg allow their ancillary parameters to be modeled as functions of independent variables, and contrast can compute and test effects within these equations. In addition, contrast allows a special pseudofactor for equation—called eqns—when working with results from manova, mvreg, mlogit, and mprobit. In example 4 of [MV] manova, we fit a two-way MANOVA model using data from Woodard (1931). Here we will fit this model using mvreg. The data represent patients with jaw fractures. y1 is the patient’s age, y2 is blood lymphocytes, and y3 is blood polymorphonuclears. Two factor variables, gender and fracture, are used as independent variables. . use http://www.stata-press.com/data/r13/jaw (Table 4.6 Two-Way Unbalanced Data for Fractures of the Jaw -- Rencher (1998)) . mvreg y1 y2 y3 = gender##fracture, vsquish nofvlabel Equation Obs Parms RMSE "R-sq" y1 y2 y3 27 27 27 6 6 6 10.21777 5.268768 4.993647 0.4086 0.4743 0.4518 t F P 2.902124 3.78967 3.460938 0.0382 0.0133 0.0195 Coef. Std. Err. P>|t| [95% Conf. Interval] -17.5 11.03645 -1.59 0.128 -40.45156 5.451555 -12.625 5.666667 5.518225 5.899231 -2.29 0.96 0.033 0.348 -24.10078 -6.601456 -1.149222 17.93479 21.375 8.833333 39.5 12.68678 13.83492 4.171386 1.68 0.64 9.47 0.107 0.530 0.000 -5.008595 -19.93796 30.82513 47.75859 37.60463 48.17487 20.5 5.69092 3.60 0.002 8.665083 32.33492 -3.125 .6666667 2.84546 3.041925 -1.10 0.22 0.285 0.829 -9.042458 -5.659362 2.792458 6.992696 -19.625 -23.66667 35.5 6.541907 7.133946 2.150966 -3.00 -3.32 16.50 0.007 0.003 0.000 -33.22964 -38.50252 31.02682 -6.02036 -8.830813 39.97318 -18.16667 5.393755 -3.37 0.003 -29.38359 -6.949739 1.083333 -3 2.696877 2.883083 0.40 -1.04 0.692 0.310 -4.52513 -8.9957 6.691797 2.9957 19.91667 23.5 61.16667 6.200305 6.76143 2.038648 3.21 3.48 30.00 0.004 0.002 0.000 7.022426 9.438837 56.92707 32.81091 37.56116 65.40627 y1 2.gender fracture 2 3 gender# fracture 2 2 2 3 _cons y2 2.gender fracture 2 3 gender# fracture 2 2 2 3 _cons y3 2.gender fracture 2 3 gender# fracture 2 2 2 3 _cons contrast computes Wald tests using the coefficients from the first equation by default. contrast — Contrasts and linear hypothesis tests after estimation 373 . contrast gender##fracture Contrasts of marginal linear predictions Margins : asbalanced df F P>F gender 1 2.16 0.1569 fracture 2 2.74 0.0880 gender#fracture 2 1.69 0.2085 Denominator 21 y1 Here we use the equation() option to compute the Wald tests in the y2 equation: . contrast gender##fracture, equation(y2) Contrasts of marginal linear predictions Margins : asbalanced df F P>F gender 1 5.41 0.0301 fracture 2 7.97 0.0027 gender#fracture 2 5.97 0.0088 Denominator 21 y2 Here we use the equation index to compute the Wald tests in the third equation: . contrast gender##fracture, equation(#3) Contrasts of marginal linear predictions Margins : asbalanced df F P>F gender 1 2.23 0.1502 fracture 2 6.36 0.0069 gender#fracture 2 6.66 0.0058 Denominator 21 y3 Here we use the atequations option to compute Wald tests for each equation in the model. We also use the vsquish option to suppress the extra blank lines between terms. 374 contrast — Contrasts and linear hypothesis tests after estimation . contrast gender##fracture, atequations vsquish Contrasts of marginal linear predictions Margins : asbalanced df F P>F 1 2 2 2.16 2.74 1.69 0.1569 0.0880 0.2085 1 2 2 5.41 7.97 5.97 0.0301 0.0027 0.0088 gender fracture gender#fracture 1 2 2 2.23 6.36 6.66 0.1502 0.0069 0.0058 Denominator 21 y1 gender fracture gender#fracture y2 gender fracture gender#fracture y3 Because we are investigating the results from mvreg, we can use the special eqns factor to test for a marginal effect on the means among the dependent variables: . contrast _eqns Contrasts of marginal linear predictions Margins : asbalanced df F P>F _eqns 2 49.19 0.0000 Denominator 21 Here we test whether the main effects of gender differ among the dependent variables: . contrast gender#_eqns Contrasts of marginal linear predictions Margins : asbalanced df F P>F gender#_eqns 2 3.61 0.0448 Denominator 21 contrast — Contrasts and linear hypothesis tests after estimation 375 Although it is not terribly interesting in this case, we can even calculate contrasts across equations: . contrast gender#r._eqns Contrasts of marginal linear predictions Margins : asbalanced df F P>F gender#_eqns (joint) (2 vs 1) (joint) (3 vs 1) Joint 1 1 2 5.82 0.40 3.61 0.0251 0.5352 0.0448 Denominator 21 Video example Introduction to contrasts in Stata: One-way ANOVA Stored results contrast stores the following in r(): Scalars r(df r) r(k terms) r(level) Macros r(cmd) r(cmdline) r(est cmd) r(est cmdline) r(title) r(overall) r(emptycells) r(mcmethod) r(mctitle) r(mcadjustall) r(margin method) Matrices r(b) r(V) r(error) r(L) r(table) r(F) r(chi2) r(p) r(df) r(df2) variance degrees of freedom number of terms in termlist confidence level of confidence intervals contrast command as typed e(cmd) from original estimation results e(cmdline) from original estimation results title in output overall or empty empspec from emptycells() method from mcompare() title for method from mcompare() adjustall or empty asbalanced or asobserved contrast estimates variance–covariance matrix of the contrast estimates contrast estimability codes; 0 means estimable, 8 means not estimable matrix of contrasts applied to the model coefficients matrix containing the contrasts with their standard errors, test statistics, p-values, and confidence intervals vector of F statistics; r(df r) present vector of χ2 statistics; r(df r) not present vector of p-values corresponding to r(F) or r(chi2) vector of degrees of freedom corresponding to r(p) vector of denominator degrees of freedom corresponding to r(F) 376 contrast — Contrasts and linear hypothesis tests after estimation contrast with the post option stores the following in e(): Scalars e(df r) e(k terms) variance degrees of freedom number of terms in termlist Macros e(cmd) e(cmdline) e(est cmd) e(est cmdline) e(title) e(overall) e(emptycells) e(margin method) e(properties) contrast command as typed e(cmd) from original estimation results e(cmdline) from original estimation results title in output overall or empty empspec from emptycells() asbalanced or asobserved b V Matrices e(b) e(V) e(error) contrast estimates variance–covariance matrix of the contrast estimates contrast estimability codes; 0 means estimable, 8 means not estimable matrix of contrasts applied to the model coefficients vector of unadjusted F statistics; e(df r) present vector of χ2 statistics; e(df r) not present vector of unadjusted p-values corresponding to e(F) or e(chi2) vector of degrees of freedom corresponding to e(p) vector of denominator degrees of freedom corresponding to e(F) e(L) e(F) e(chi2) e(p) e(df) e(df2) Methods and formulas Methods and formulas are presented under the following headings: Marginal linear predictions Contrast operators Reference level contrasts Adjacent contrasts Grand mean contrasts Helmert contrasts Reverse Helmert contrasts Orthogonal polynomial contrasts Contrasts within interactions Multiple comparisons Marginal linear predictions contrast treats intercept effects separately from slope effects. To illustrate, consider the following parameterization for a quadratic regression of y on x that also models the effects of two factor variables A and B , where the levels of A are indexed by i = 1, . . . , ka and the levels of B are indexed by j = 1, . . . , kb . E(y|A = i, B = j, x) = η0ij + η1ij x + η2ij x2 η0ij = η0 + α0i + β0j + (αβ)0ij contrast — Contrasts and linear hypothesis tests after estimation 377 η1ij = η1 + α1i + β1j + (αβ)1ij η2ij = η2 + α2i + β2j + (αβ)2ij We have partitioned the coefficients into three groups of parameters: η0ij is a cell prediction for the intercept, η1ij is a cell prediction for the slope on x, and η2ij is a cell prediction for the slope on x2 . For the intercept parameters, η0 is the intercept, α0i represents a main effect for factor A at its ith level, β0j represents a main effect for factor B at its j th level, and (αβ)0ij represents an effect for the interaction of A and B at the ij th level. The individual coefficients in η1ij and η2ij have similar interpretations, but the effects are on the slopes of x and x2 , respectively. The marginal intercepts for A are given by η0i. = kb X fij η0ij j=1 where fij is a marginal relative frequency of the j th level of B and is controlled by the asobserved and emptycells(reweight) options according to 1/kb , w.j /w.. , fij = 1/(kb − ei. ), wij /wi. , default asobserved emptycells(reweight) emptycells(reweight) and asobserved Above, wij is the number of individuals with A at its ith level and B at its j th, wi. = kb X wij j=1 w.j = ka X wij i=1 w.. = kb ka X X wij i=1 j=1 and ei. is the number of empty cells where A is at its ith level. The marginal intercepts for B and marginal slopes on x and x2 are similarly defined. Estimates for the cell intercepts and slopes are computed using the corresponding linear combination of the coefficients from the fitted model. For example, the estimated cell intercepts are computed using c 0ij ηb0ij = ηb0 + α b0i + βb0j + (αβ) and the estimated marginal intercepts for A are computed as 378 contrast — Contrasts and linear hypothesis tests after estimation ηb0i. = kb X fij ηb0ij j=1 Contrast operators contrast performs Wald tests using linear combinations of marginal linear predictions. For example, the following linear combination can be used to test for a specific effect of factor A on the marginal intercepts. ka X ci η0i. i=1 If the ci elements sum to zero, the linear combination is called a contrast. If the factor A is represented by a variable named A, then we specify this contrast using the following syntax: {A c1 c2 ... cka } Similarly, the following linear combination can be used to test for a specific interaction effect of factors A and B on the marginal slope of x. kb ka X X cij η1ij i=1 j=1 If the factor B is represented by a variable named B, then we specify this contrast using the following syntax: {A#B c11 c12 ... c1kb c21 ... cka kb } contrast has variable operators for several commonly used contrasts. Each contrast operator specifies a matrix of linear combinations that yield the requested set of contrasts to be applied to the marginal linear predictions associated with the attached factor variable. Reference level contrasts The r. operator compares each level with a reference level. Let R be the corresponding contrast matrix for factor A, and then R is a (ka − 1) × ka matrix with elements −1, 1, Rij = 1, 0, if j is the reference level if i = j and j is less than the reference level if i + 1 = j and j is greater than the reference level otherwise contrast — Contrasts and linear hypothesis tests after estimation 379 If ka = 5 and the reference level is the third level of A (specified as rb(#3).A), then 1 0 R= 0 0 0 1 0 0 −1 −1 −1 −1 0 0 1 0 0 0 0 1 Adjacent contrasts The a. operator compares each level with the next level. Let A be the corresponding contrast matrix for factor A, and then A is a (ka − 1) × ka matrix with elements ( 1, if i = j Aij = −1, if i + 1 = j 0, otherwise If ka = 5, then 1 −1 0 0 0 1 −1 0 0 0 A= 0 0 1 −1 0 0 0 0 1 −1 The ar. operator compares each level with the previous level. If A is the contrast matrix for the a. operator, then −A is the corresponding contrast matrix for the ar. operator. Grand mean contrasts The g. operator compares each level with the mean of all the levels. Let G be the corresponding contrast matrix for factor A, and then G is a ka × ka matrix with elements Gij = 1 − 1/ka , if i = j − 1/ka , if i 6= j If ka = 5, then 4/5 −1/5 G = −1/5 −1/5 −1/5 −1/5 4/5 −1/5 −1/5 −1/5 −1/5 −1/5 4/5 −1/5 −1/5 −1/5 −1/5 −1/5 4/5 −1/5 −1/5 −1/5 −1/5 −1/5 4/5 The gw. operator compares each level with the weighted mean of all the levels. The weights are taken from the observed weighted cell frequencies in the estimation sample of the fitted model. Let Gw be the corresponding contrast matrix for factor A, and then Gw is a ka × ka matrix with elements 380 contrast — Contrasts and linear hypothesis tests after estimation Gij = 1 − wi /w· , if i = j − wj /w· , if i 6= j whereP wi is a marginal weight representing the number of individuals with A at its ith level and w· = i wi . Helmert contrasts The h. operator compares each level with the mean of the subsequent levels. Let H be the corresponding contrast matrix for factor A, and then H is a (ka − 1) × ka matrix with elements ( 1, if i = j Hij = −1/(ka − i), if i < j 0, otherwise If ka = 5, then 1 −1/4 −1/4 −1/4 −1/4 1 −1/3 −1/3 −1/3 0 H= 0 0 1 −1/2 −1/2 0 0 0 1 −1 The hw. operator compares each level with the weighted mean of the subsequent levels. Let Hw be the corresponding contrast matrix for factor A, and then Hw is a (ka − 1) × ka matrix with elements Hwij ( 1, if i = j Pka wl , if i < j = −wj / l=j 0, otherwise Reverse Helmert contrasts The j. operator compares each level with the mean of the previous levels. Let J be the corresponding contrast matrix for factor A, and then J is a (ka − 1) × ka matrix with elements ( 1, if i + 1 = j Jij = −1/i, if j ≤ i 0, otherwise If ka = 5, then −1 1 0 0 1 0 −1/2 −1/2 H= −1/3 −1/3 −1/3 1 −1/4 −1/4 −1/4 −1/4 0 0 0 1 contrast — Contrasts and linear hypothesis tests after estimation 381 The jw. operator compares each level with the weighted mean of the previous levels. Let Jw be the corresponding contrast matrix for factor A, and then Jw is a (ka − 1) × ka matrix with elements Jwij ( 1, if i + 1 = j Pi = −wj / l=1 wl , if i ≤ j 0, otherwise Orthogonal polynomial contrasts The p. operator applies orthogonal polynomial contrasts using the level values of the attached factor variable. The q. operator applies orthogonal polynomial contrasts using the level indices of the attached factor variable. These two operators are equivalent when the level values of the attached factor are equally spaced. The pw. and qw. operators are weighted versions of p. and q., where the weights are taken from the observed weighted cell frequencies in the estimation sample of the fitted model. contrast uses the Christoffel–Darboux recurrence formula for computing orthogonal polynomial contrasts (Abramowitz and Stegun 1972). The elements of the contrasts are normalized such that Q0 WQ = 1 I w· where W is a diagonal matrix of the marginal cell weights w1 , w2 , . . . , wk of the attached factor variable (all 1 for p. and q.), and w· is the sum of the weights (the number of levels k for p. and q.). Contrasts within interactions Contrast operators are allowed to be specified on factor variables participating in interactions. In such cases, contrast applies the proper matrix product of the contrast matrices to the cell margins of the interacted factor variables. For example, consider the contrasts implied by specifying r.A#h.B. Let M be the matrix of estimated cell margins for the levels of A and B , where the rows of M are indexed by the levels of A and the columns are indexed by the levels of B . contrast puts the estimated cell margins in the following vector form: v = vec(M0 ) = M11 M12 .. . M1kb M21 M22 .. . M2kb .. . Mka kb 382 contrast — Contrasts and linear hypothesis tests after estimation The individual contrasts are then given by the elements of (R ⊗ H)v where ⊗ denotes the Kronecker direct product. Multiple comparisons See [R] pwcompare for details on the methods and formulas used to adjust p-values and confidence intervals for multiple comparisons. The formulas for Bonferroni’s method and Šidák’s method are presented with m = k (k − 1)/2, the number of pairwise comparisons for a factor term with k levels. For contrasts, m is instead the number of contrasts being performed on the factor term; often, m = k − 1 for a term with k levels. References Abramowitz, M., and I. A. Stegun, ed. 1972. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. 10th ed. Washington, DC: National Bureau of Standards. Chow, G. C. 1960. Tests of equality between sets of coefficients in two linear regressions. Econometrica 28: 591–605. Coster, D. 2005. Contrasts. In Vol. 2 of Encyclopedia of Biostatistics, ed. P. Armitage and T. Colton, 1153–1157. Chichester, UK: Wiley. Kuehl, R. O. 2000. Design of Experiments: Statistical Principles of Research Design and Analysis. 2nd ed. Belmont, CA: Duxbury. Milliken, G. A., and D. E. Johnson. 2009. Analysis of Messy Data, Volume 1: Designed Experiments. 2nd ed. Boca Raton, FL: CRC Press. Mitchell, M. N. 2012. Interpreting and Visualizing Regression Models Using Stata. College Station, TX: Stata Press. Rosenthal, R., R. L. Rosnow, and D. B. Rubin. 2000. Contrasts and Effect Sizes in Behavioral Research: A Correlational Approach. Cambridge: Cambridge University Press. Searle, S. R. 1971. Linear Models. New York: Wiley. . 1997. Linear Models for Unbalanced Data. New York: Wiley. Winer, B. J., D. R. Brown, and K. M. Michels. 1991. Statistical Principles in Experimental Design. 3rd ed. New York: McGraw–Hill. Woodard, D. E. 1931. Healing time of fractures of the jaw in relation to delay before reduction, infection, syphilis and blood calcium and phosphorus content. Journal of the American Dental Association 18: 419–442. Also see [R] contrast postestimation — Postestimation tools for contrast [R] lincom — Linear combinations of estimators [R] margins — Marginal means, predictive margins, and marginal effects [R] margins, contrast — Contrasts of margins [R] pwcompare — Pairwise comparisons [R] test — Test linear hypotheses after estimation [U] 20 Estimation and postestimation commands Title contrast postestimation — Postestimation tools for contrast Description Remarks and examples Also see Description The following postestimation commands are available after contrast, post: Command Description estat vce estat (svy) estimates lincom variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses nlcom test testnl Remarks and examples In Orthogonal polynomial contrasts in [R] contrast, we used the p. operator to test the orthogonal polynomial effects of age group. . contrast p.agegrp, noeffects We then used a second contrast command, . contrast p(2 3 4).agegrp, noeffects selecting levels to test whether the quadratic, cubic, and quartic contrasts were jointly significant. We can perform the same joint test by using the test command after specifying the post option with our first contrast command. 383 384 contrast postestimation — Postestimation tools for contrast . use http://www.stata-press.com/data/r13/cholesterol (Artificial cholesterol data) . anova chol agegrp (output omitted ) . contrast p.agegrp, noeffects post Contrasts of marginal linear predictions Margins : asbalanced df F P>F agegrp (linear) (quadratic) (cubic) (quartic) Joint 1 1 1 1 4 139.11 0.15 0.37 0.43 35.02 0.0000 0.6962 0.5448 0.5153 0.0000 Denominator 70 . test ( 1) ( 2) ( 3) p2.agegrp p2.agegrp p3.agegrp p4.agegrp F( 3, Prob p3.agegrp p4.agegrp = 0 = 0 = 0 70) = 0.32 > F = 0.8129 Also see [R] contrast — Contrasts and linear hypothesis tests after estimation [U] 20 Estimation and postestimation commands Title copyright — Display copyright information Syntax Description Remarks and examples Also see Syntax copyright Description copyright presents copyright notifications concerning tools, libraries, etc., used in the construction of Stata. Remarks and examples The correct form for a copyright notice is Copyright dates by author/owner The word “Copyright” is spelled out. You can use the c symbol, but “(C)” has never been given legal recognition. The phrase “All Rights Reserved” was historically required but is no longer needed. Currently, most works are copyrighted from the moment they are written, and no copyright notice is required. Copyright concerns the protection of the expression and structure of facts and ideas, not the facts and ideas themselves. Copyright concerns the ownership of the expression and not the name given to the expression, which is covered under trademark law. Copyright law as it exists today began in England in 1710 with the Statute of Anne, An Act for the Encouragement of Learning, by Vesting the Copies of Printed Books in the Authors or Purchases of Such Copies, during the Times therein mentioned . In 1672, Massachusetts introduced the first copyright law in what was to become the United States. After the Revolutionary War, copyright was introduced into the U.S. Constitution in 1787 and went into effect on May 31, 1790. On June 9, 1790, the first copyright in the United States was registered for The Philadelphia Spelling Book by John Barry. There are significant differences in the understanding of copyright in the English- and non–Englishspeaking world. The Napoleonic or Civil Code, the dominant legal system in the non–English-speaking world, splits the rights into two classes: the author’s economic rights and the author’s moral rights. Moral rights are available only to “natural persons”. Legal persons (corporations) have economic rights but not moral rights. Also see Copyright page of this book 385 Title copyright apache — Apache copyright notification Description Also see Description Stata uses portions of the Apache Commons Java components library, Apache log4j Java library, and the docx4j Java library with the express permission of the authors under the Apache License, version 2.0, pursuant to the following notice: Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. “License” shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. “Licensor” shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. “Legal Entity” shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, “control” means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. “You” (or “Your”) shall mean an individual or Legal Entity exercising permissions granted by this License. “Source” form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. “Object” form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. “Work” shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). “Derivative Works” shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. 386 copyright apache — Apache copyright notification “Contribution” shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, “submitted” means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as “Not a Contribution.” “Contributor” shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royaltyfree, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: You must give any other recipients of the Work or Derivative Works a copy of this License; and You must cause any modified files to carry prominent notices stating that You changed the files; and You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and If the Work includes a “NOTICE” text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may 387 388 copyright apache — Apache copyright notification add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets “[]” replaced with your own identifying information. (Don’t include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same “printed page” as the copyright notice for easier identification within third-party archives. copyright apache — Apache copyright notification 389 Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Also see [R] copyright — Display copyright information Title copyright boost — Boost copyright notification Description Also see Description Stata uses portions of Boost with the express permission of the authors pursuant to the following notice: Boost Software License - Version 1.0 - August 17, 2003 Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the “Software”) to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following: The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor. THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Also see [R] copyright — Display copyright information 390 Title copyright freetype — FreeType copyright notification Description Legal Terms Also see Description Stata uses portions of FreeType, a library used by JagPDF, which helps create PDF files, with the express permission of the authors. StataCorp thanks and acknowledges the authors of FreeType for producing FreeType and allowing its use in Stata and other software. For more information about FreeType, visit http://www.freetype.org/. The full FreeType copyright notice is Legal Terms 0. Definitions Throughout this license, the terms ‘package’, ‘FreeType Project’, and ‘FreeType archive’ refer to the set of files originally distributed by the authors (David Turner, Robert Wilhelm, and Werner Lemberg) as the ‘FreeType Project’, be they named as alpha, beta or final release. ‘You’ refers to the licensee, or person using the project, where ‘using’ is a generic term including compiling the project’s source code as well as linking it to form a ‘program’ or ‘executable’. This program is referred to as ‘a program using the FreeType engine’. This license applies to all files distributed in the original FreeType Project, including all source code, binaries and documentation, unless otherwise stated in the file in its original, unmodified form as distributed in the original archive. If you are unsure whether or not a particular file is covered by this license, you must contact us to verify this. This license applies to all files distributed in the original FreeType Project, including all source code, binaries and documentation, unless otherwise stated in the file in its original, unmodified form as distributed in the original archive. If you are unsure whether or not a particular file is covered by this license, you must contact us to verify this. The FreeType Project is copyright c 1996–2000 by David Turner, Robert Wilhelm, and Werner Lemberg. All rights reserved except as specified below. 1. No Warranty THE FREETYPE PROJECT IS PROVIDED ‘AS IS’ WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT WILL ANY OF THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY DAMAGES CAUSED BY THE USE OR THE INABILITY TO USE, OF THE FREETYPE PROJECT. 391 392 copyright freetype — FreeType copyright notification 2. Redistribution This license grants a worldwide, royalty-free, perpetual and irrevocable right and license to use, execute, perform, compile, display, copy, create derivative works of, distribute and sublicense the FreeType Project (in both source and object code forms) and derivative works thereof for any purpose; and to authorize others to exercise some or all of the rights granted herein, subject to the following conditions: • Redistribution of source code must retain this license file (‘FTL.TXT’) unaltered; any additions, deletions or changes to the original files must be clearly indicated in accompanying documentation. The copyright notices of the unaltered, original files must be preserved in all copies of source files. • Redistribution in binary form must provide a disclaimer that states that the software is based in part of the work of the FreeType Team, in the distribution documentation. We also encourage you to put an URL to the FreeType web page in your documentation, though this isn’t mandatory. These conditions apply to any software derived from or based on the FreeType Project, not just the unmodified files. If you use our work, you must acknowledge us. However, no fee need be paid to us. 3. Advertising Neither the FreeType authors and contributors nor you shall use the name of the other for commercial, advertising, or promotional purposes without specific prior written permission. We suggest, but do not require, that you use one or more of the following phrases to refer to this software in your documentation or advertising materials: ‘FreeType Project’, ‘FreeType Engine’, ‘FreeType library’, or ‘FreeType Distribution’. As you have not signed this license, you are not required to accept it. However, as the FreeType Project is copyrighted material, only this license, or another one contracted with the authors, grants you the right to use, distribute, and modify it. Therefore, by using, distributing, or modifying the FreeType Project, you indicate that you understand and accept all the terms of this license. 4. Contacts There are two mailing lists related to FreeType: • freetype@nongnu.org Discusses general use and applications of FreeType, as well as future and wanted additions to the library and distribution. If you are looking for support, start in this list if you haven’t found anything to help you in the documentation. • freetype-devel@nongnu.org Discusses bugs, as well as engine internals, design issues, specific licenses, porting, etc. Our home page can be found at http://www.freetype.org copyright freetype — FreeType copyright notification Also see [R] copyright — Display copyright information 393 Title copyright icu — ICU copyright notification Description Also see Description Stata uses portions of ICU, a library used by JagPDF, which helps create PDF files, with the express permission of the authors pursuant to the following notice: COPYRIGHT AND PERMISSION NOTICE Copyright c 1995–2011 International Business Machines Corporation and others All Rights Reserved Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, provided that the above copyright notice(s) and this permission notice appear in all copies of the Software and that both the above copyright notice(s) and this permission notice appear in supporting documentation. THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization of the copyright holder. All trademarks and registered trademarks mentioned herein are the property of their respective owners. Also see [R] copyright — Display copyright information 394 Title copyright jagpdf — JagPDF copyright notification Description Also see Description Stata uses portions of JagPDF, a library for creating PDF files, with the express permission of the author pursuant to the following notice: The JagPDF Library is Copyright c 2005–2009 Jaroslav Grešula Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ”Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED ”AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Also see [R] copyright — Display copyright information 395 Title copyright lapack — LAPACK copyright notification Description Also see Description Stata uses portions of LAPACK, a linear algebra package, with the express permission of the authors pursuant to the following notice: Copyright c 1992–2008 The University of Tennessee. All rights reserved. • Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer. • Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer, listed in this license in the documentation or other materials provided with the distribution or both. • Neither the names of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Also see [R] copyright — Display copyright information 396 Title copyright libpng — libpng copyright notification Description Also see Description Stata uses portions of libpng, a library used by JagPDF, which helps create PDF files, with the express permission of the authors. For the purposes of this acknowledgment, “Contributing Authors” is as defined by the copyright notice below. StataCorp thanks and acknowledges the Contributing Authors of libpng and Group 42, Inc. for producing libpng and allowing its use in Stata and other software. For more information about libpng, visit http://www.libpng.org/. The full libpng copyright notice is COPYRIGHT NOTICE, DISCLAIMER, and LICENSE: If you modify libpng you may insert additional notices immediately following this sentence. This code is released under the libpng license. libpng versions 1.2.6, August 15, 2004, through 1.5.2, March 31, 2011, are Copyright c 2004, 2006–2011 Glenn Randers-Pehrson, and are distributed according to the same disclaimer and license as libpng-1.2.5 with the following individual added to the list of Contributing Authors Cosmin Truta libpng versions 1.0.7, July 1, 2000, through 1.2.5 - October 3, 2002, are Copyright c 2000–2002 Glenn Randers-Pehrson, and are distributed according to the same disclaimer and license as libpng-1.0.6 with the following individuals added to the list of Contributing Authors Simon-Pierre Cadieux Eric S. Raymond Gilles Vollant and with the following additions to the disclaimer: There is no warranty against interference with your enjoyment of the library or against infringement. There is no warranty that our efforts or the library will fulfill any of your particular purposes or needs. This library is provided with all faults, and the entire risk of satisfactory quality, performance, accuracy, and effort is with the user. libpng versions 0.97, January 1998, through 1.0.6, March 20, 2000, are Copyright c 1998, 1999 Glenn Randers-Pehrson, and are distributed according to the same disclaimer and license as libpng-0.96, with the following individuals added to the list of Contributing Authors: 397 398 copyright libpng — libpng copyright notification Tom Lane Glenn Randers-Pehrson Willem van Schaik libpng versions 0.89, June 1996, through 0.96, May 1997, are Copyright c 1996, 1997 Andreas Dilger Distributed according to the same disclaimer and license as libpng-0.88, with the following individuals added to the list of Contributing Authors: John Bowler Kevin Bracey Sam Bushell Magnus Holmgren Greg Roelofs Tom Tanner libpng versions 0.5, May 1995, through 0.88, January 1996, are Copyright c 1995, 1996 Guy Eric Schalnat, Group 42, Inc. For the purposes of this copyright and license, “Contributing Authors” is defined as the following set of individuals: Andreas Dilger Dave Martindale Guy Eric Schalnat Paul Schmidt Tim Wegner The PNG Reference Library is supplied “AS IS”. The Contributing Authors and Group 42, Inc. disclaim all warranties, expressed or implied, including, without limitation, the warranties of merchantability and of fitness for any purpose. The Contributing Authors and Group 42, Inc. assume no liability for direct, indirect, incidental, special, exemplary, or consequential damages, which may result from the use of the PNG Reference Library, even if advised of the possibility of such damage. Permission is hereby granted to use, copy, modify, and distribute this source code, or portions hereof, for any purpose, without fee, subject to the following restrictions: 1. The origin of this source code must not be misrepresented. 2. Altered versions must be plainly marked as such and must not be misrepresented as being the original source. 3. This Copyright notice may not be removed or altered from any source or altered source distribution. The Contributing Authors and Group 42, Inc. specifically permit, without fee, and encourage the use of this source code as a component to supporting the PNG file format in commercial products. If you use this source code in a product, acknowledgment is not required but would be appreciated. Also see [R] copyright — Display copyright information Title copyright miglayout — MiG Layout copyright notification Description Also see Description Stata uses portions of MiG Layout with the express permission of the author, pursuant to the following notice: Copyright (c) 2004, Mikael Grev, MiG InfoCom AB. (miglayout (at) miginfocom (dot) com) All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the MiG InfoCom AB nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Also see [R] copyright — Display copyright information 399 Title copyright scintilla — Scintilla copyright notification Description Also see Description Stata uses portions of Scintilla with the express permission of the author, pursuant to the following notice: Copyright c 1998–2002 by Neil HodgsonAll Rights Reserved Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation. NEIL HODGSON DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL NEIL HODGSON BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. Also see [R] copyright — Display copyright information 400 Title copyright ttf2pt1 — ttf2pt1 copyright notification Description Also see Description Stata uses portions of ttf2pt1 to convert TrueType fonts to PostScript fonts, with express permission of the authors, pursuant to the following notice: Copyright c 1997–2003 by the AUTHORS: Andrew Weeks Frank M. Siegert Mark Heath Thomas Henlich Sergey Babkin , Turgut Uyar Rihardas Hepas Szalay Tamas Johan Vromans Petr Titera Lei Wang Chen Xiangyang Zvezdan Petkovic Rigel All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. All advertising materials mentioning features or use of this software must display the following acknowledgment: This product includes software developed by the TTF2PT1 Project and its contributors. 401 402 copyright ttf2pt1 — ttf2pt1 copyright notification THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Also see [R] copyright — Display copyright information Title copyright zlib — zlib copyright notification Description Also see Description Stata uses portions of zlib, a library used by JagPDF, which helps create PDF files, with the express permission of the authors. StataCorp thanks and acknowledges the authors of zlib, Jean-loup Gailly and Mark Adler, for producing zlib and allowing its use in Stata and other software. For more information about zlib, visit http://www.zlib.net/. The full zlib copyright notice is Copyright c 1995–2013 Jean-loup Gailly and Mark Adler This software is provided ’as-is’, without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution. Jean-loup Gailly Mark Adler Also see [R] copyright — Display copyright information 403 Title correlate — Correlations (covariances) of variables or coefficients Syntax Options for pwcorr References Menu Remarks and examples Also see Description Stored results Options for correlate Methods and formulas Syntax Display correlation matrix or covariance matrix if in weight , correlate options correlate varlist Display all pairwise correlation coefficients pwcorr varlist if in weight , pwcorr options correlate options Description Options means noformat covariance wrap display means, standard deviations, minimums, and maximums with matrix ignore display format associated with variables display covariances allow wide matrices to wrap pwcorr options Description Main obs sig listwise casewise print(#) star(#) bonferroni sidak print number of observations for each entry print significance level for each entry use listwise deletion to handle missing values synonym for listwise significance level for displaying coefficients significance level for displaying with a star use Bonferroni-adjusted significance level use Šidák-adjusted significance level varlist may contain time-series operators; see [U] 11.4.4 Time-series varlists. by is allowed with correlate and pwcorr; see [D] by. aweights and fweights are allowed; see [U] 11.1.6 weight. 404 correlate — Correlations (covariances) of variables or coefficients 405 Menu correlate Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Correlations and covariances > Summaries, tables, and tests > Summary and descriptive statistics > Pairwise correlations pwcorr Statistics Description The correlate command displays the correlation matrix or covariance matrix for a group of variables. If varlist is not specified, the matrix is displayed for all variables in the dataset. Also see the estat vce command in [R] estat vce. pwcorr displays all the pairwise correlation coefficients between the variables in varlist or, if varlist is not specified, all the variables in the dataset. Options for correlate Options means displays summary statistics (means, standard deviations, minimums, and maximums) with the matrix. noformat displays the summary statistics requested by the means option in g format, regardless of the display formats associated with the variables. covariance displays the covariances rather than the correlation coefficients. wrap requests that no action be taken on wide correlation matrices to make them readable. It prevents Stata from breaking wide matrices into pieces to enhance readability. You might want to specify this option if you are displaying results in a window wider than 80 characters. Then you may need to set linesize to however many characters you can display across a line; see [R] log. Options for pwcorr Main obs adds a line to each row of the matrix reporting the number of observations used to calculate the correlation coefficient. sig adds a line to each row of the matrix reporting the significance level of each correlation coefficient. listwise handles missing values through listwise deletion, meaning that the entire observation is omitted from the estimation sample if any of the variables in varlist is missing for that observation. By default, pwcorr handles missing values by pairwise deletion; all available observations are used to calculate each pairwise correlation without regard to whether variables outside that pair are missing. correlate uses listwise deletion. Thus listwise allows users of pwcorr to mimic correlate’s treatment of missing values while retaining access to pwcorr’s features. casewise is a synonym for listwise. 406 correlate — Correlations (covariances) of variables or coefficients print(#) specifies the significance level of correlation coefficients to be printed. Correlation coefficients with larger significance levels are left blank in the matrix. Typing pwcorr, print(.10) would list only correlation coefficients significant at the 10% level or better. star(#) specifies the significance level of correlation coefficients to be starred. Typing pwcorr, star(.05) would star all correlation coefficients significant at the 5% level or better. bonferroni makes the Bonferroni adjustment to calculated significance levels. This option affects printed significance levels and the print() and star() options. Thus pwcorr, print(.05) bonferroni prints coefficients with Bonferroni-adjusted significance levels of 0.05 or less. sidak makes the Šidák adjustment to calculated significance levels. This option affects printed significance levels and the print() and star() options. Thus pwcorr, print(.05) sidak prints coefficients with Šidák-adjusted significance levels of 0.05 or less. Remarks and examples Remarks are presented under the following headings: correlate pwcorr Video example correlate Typing correlate by itself produces a correlation matrix for all variables in the dataset. If you specify the varlist, a correlation matrix for just those variables is displayed. Example 1 We have state data on demographic characteristics of the population. To obtain a correlation matrix, we type . use http://www.stata-press.com/data/r13/census13 (1980 Census data by state) . correlate (obs=50) state brate pop medage division region mrgrate dvcrate medagesq state brate pop 1.0000 0.0208 -0.0540 -0.0624 -0.1345 -0.1339 0.0509 -0.0655 -0.0621 1.0000 -0.2830 -0.8800 0.6356 0.6086 0.0677 0.3508 -0.8609 1.0000 0.3294 -0.1081 -0.1515 -0.1502 -0.2064 0.3324 medage division 1.0000 -0.5207 -0.5292 -0.0177 -0.2229 0.9984 1.0000 0.9688 0.2280 0.5522 -0.5162 region mrgrate 1.0000 0.2490 0.5682 -0.5239 1.0000 0.7700 -0.0202 dvcrate medagesq dvcrate medagesq 1.0000 -0.2192 1.0000 Because we did not specify the wrap option, Stata did its best to make the result readable by breaking the table into two parts. correlate — Correlations (covariances) of variables or coefficients 407 To obtain the correlations between mrgrate, dvcrate, and medage, we type . correlate mrgrate dvcrate medage (obs=50) mrgrate dvcrate medage mrgrate dvcrate medage 1.0000 0.7700 -0.0177 1.0000 -0.2229 1.0000 Example 2 The pop variable in example 1 represents the total population of the state. Thus, to obtain population-weighted correlations among mrgrate, dvcrate, and medage, we type . correlate mrgrate dvcrate medage [w=pop] (analytic weights assumed) (sum of wgt is 2.2591e+08) (obs=50) mrgrate dvcrate medage mrgrate dvcrate medage 1.0000 0.5854 -0.1316 1.0000 -0.2833 1.0000 With the covariance option, correlate can be used to obtain covariance matrices, as well as correlation matrices, for both weighted and unweighted data. Example 3 To obtain the matrix of covariances between mrgrate, dvcrate, and medage, we type correlate mrgrate dvcrate medage, covariance: . correlate mrgrate dvcrate medage, covariance (obs=50) mrgrate dvcrate medage mrgrate dvcrate medage .000662 .000063 1.0e-05 -.000769 -.001191 2.86775 We could have obtained the pop-weighted covariance matrix by typing correlate mrgrate dvcrate medage [w=pop], covariance. 408 correlate — Correlations (covariances) of variables or coefficients pwcorr correlate calculates correlation coefficients by using casewise deletion; when you request correlations of variables x1 , x2 , . . . , xk , any observation for which any of x1 , x2 , . . . , xk is missing is not used. Thus if x3 and x4 have no missing values, but x2 is missing for half the data, the correlation between x3 and x4 is calculated using only the half of the data for which x2 is not missing. Of course, you can obtain the correlation between x3 and x4 by using all the data by typing correlate x3 x4 . pwcorr makes obtaining such pairwise correlation coefficients easier. Example 4 Using auto.dta, we investigate the correlation between several of the variables. . use http://www.stata-press.com/data/r13/auto1 (Automobile Models) . pwcorr mpg price rep78 foreign, obs sig mpg price rep78 foreign mpg 1.0000 74 price rep78 foreign -0.4594 0.0000 74 1.0000 0.3739 0.0016 69 0.0066 0.9574 69 1.0000 0.3613 0.0016 74 0.0487 0.6802 74 0.5922 0.0000 69 74 69 1.0000 74 . pwcorr mpg price headroom rear_seat trunk rep78 foreign, print(.05) star(.01) mpg price headroom rear_s~t trunk rep78 foreign mpg 1.0000 -0.4594* 1.0000 price headroom -0.4220* 1.0000 rear_seat -0.5213* 0.4194* 0.5238* 1.0000 -0.5703* 0.3143* 0.6620* 0.6480* 1.0000 trunk rep78 0.3739* 1.0000 foreign 0.3613* -0.2939 -0.2409 -0.3594* 0.5922* 1.0000 . pwcorr mpg price headroom rear_seat trunk rep78 foreign, print(.05) bon mpg price headroom rear_s~t trunk rep78 foreign mpg price headroom rear_seat trunk rep78 foreign 1.0000 -0.4594 -0.4220 -0.5213 -0.5703 0.3739 0.3613 1.0000 0.4194 1.0000 0.5238 0.6620 1.0000 0.6480 1.0000 -0.3594 1.0000 0.5922 1.0000 correlate — Correlations (covariances) of variables or coefficients 409 Technical note The correlate command will report the correlation matrix of the data, but there are occasions when you need the matrix stored as a Stata matrix so that you can further manipulate it. You can obtain the matrix by typing . matrix accum R = varlist, noconstant deviations . matrix R = corr(R) The first line places the cross-product matrix of the data in matrix R. The second line converts that to a correlation matrix. Also see [P] matrix define and [P] matrix accum. Video example Pearson’s correlation coefficient in Stata Stored results correlate stores the following in r(): Scalars r(N) r(rho) r(cov 12) r(Var 1) r(Var 2) Matrices r(C) number of observations ρ (first and second variables) covariance (covariance only) variance of first variable (covariance only) variance of second variable (covariance only) correlation or covariance matrix pwcorr will leave in its wake only the results of the last call that it makes internally to correlate for the correlation between the last variable and itself. Only rarely is this feature useful. Methods and formulas For a discussion of correlation, see, for instance, Snedecor and Cochran (1989, 177–195); for an introductory explanation using Stata examples, see Acock (2014, 200–206). According to Snedecor and Cochran (1989, 180), the term “co-relation” was first proposed by Galton (1888). The product-moment correlation coefficient is often called the Pearson product-moment correlation coefficient because Pearson (1896) and Pearson and Filon (1898) were partially responsible for popularizing its use. See Stigler (1986) for information on the history of correlation. The estimate of the product-moment correlation coefficient, ρ, is Pn i=1 wi (xi − x)(yi − y) pPn ρb = pPn 2 2 i=1 wi (xi − x) i=1 wi (yi − y) P P where wi are the weights, if specified, or wi = 1 if weights are not specified. x = ( wi xi )/( wi ) is the mean of x, and ȳ is similarly defined. The unadjusted significance level is calculated by pwcorr as p √ p = 2 ∗ ttail(n − 2, |b ρ| n − 2 / 1 − ρb2 ) 410 correlate — Correlations (covariances) of variables or coefficients Let v be the number of variables specified so that k = v(v − 1)/2 correlation coefficients are to be 0 estimated. If bonferroni is specified,kthe adjusted significance level is p = min(1, kp). If sidak 0 is specified, p = min 1, 1 − (1 − p) . In both cases, see Methods and formulas in [R] oneway for a more complete description of the logic behind these adjustments. Carlo Emilio Bonferroni (1892–1960) studied in Turin and taught there and in Bari and Florence. He published on actuarial mathematics, probability, statistics, analysis, geometry, and mechanics. His work on probability inequalities has been applied to simultaneous statistical inference, although the method known as Bonferroni adjustment usually relies only on an inequality established earlier by Boole. Florence Nightingale David (1909–1993) was born in Ivington, England, to parents who were friends with Florence Nightingale, David’s namesake. She began her studies in statistics under the direction of Karl Pearson at University College London and continued her studies under the direction of Jerzy Neyman. After receiving her doctorate in statistics in 1938, David became a senior statistician for various departments within the British military. She developed statistical models to forecast the toll on life and infrastructure that would occur if a large city were bombed. In 1938, she also published her book Tables of the Correlation Coefficient, dealing with the distributions of correlation coefficients. After the war, she returned to University College London, serving as a lecturer until her promotion to professor in 1962. In 1967, David joined the University of California–Riverside, eventually becoming chair of the Department of Statistics. One of her most well-known works is the book Games, Gods and Gambling: The Origins and History of Probability and Statistical Ideas from the Earliest Times to the Newtonian Era, a history of statistics. David published over 100 papers on topics including combinatorics, symmetric functions, the history of statistics, and applications of statistics, including ecological diversity. She published under the name F. N. David to avoid revealing her gender in a male-dominated profession. Karl Pearson (1857–1936) studied mathematics at Cambridge. He was professor of applied mathematics (1884–1911) and eugenics (1911–1933) at University College London. His publications include literary, historical, philosophical, and religious topics. Statistics became his main interest in the early 1890s after he learned about its application to biological problems. His work centered on distribution theory, the method of moments, correlation, and regression. Pearson introduced the chi-squared test and the terms coefficient of variation, contingency table, heteroskedastic, histogram, homoskedastic, kurtosis, mode, random sampling, random walk, skewness, standard deviation, and truncation. Despite many strong qualities, he also fell into prolonged disagreements with others, most notably, William Bateson and R. A. Fisher. Zbyněk Šidák (1933–1999) was a notable Czech statistician and probabilist. He worked on Markov chains, rank tests, multivariate distribution theory and multiple-comparison methods, and he served as the chief editor of Applications of Mathematics. correlate — Correlations (covariances) of variables or coefficients 411 References Acock, A. C. 2014. A Gentle Introduction to Stata. 4th ed. College Station, TX: Stata Press. Dewey, M. E., and E. Seneta. 2001. Carlo Emilio Bonferroni. In Statisticians of the Centuries, ed. C. C. Heyde and E. Seneta, 411–414. New York: Springer. Eisenhart, C. 1974. Pearson, Karl. In Vol. 10 of Dictionary of Scientific Biography, ed. C. C. Gillispie, 447–473. New York: Charles Scribner’s Sons. Galton, F. 1888. Co-relations and their measurement, chiefly from anthropometric data. Proceedings of the Royal Society of London 45: 135–145. Gleason, J. R. 1996. sg51: Inference about correlations using the Fisher z-transform. Stata Technical Bulletin 32: 13–18. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 121–128. College Station, TX: Stata Press. Goldstein, R. 1996. sg52: Testing dependent correlation coefficients. Stata Technical Bulletin 32: 18. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 128–129. College Station, TX: Stata Press. Pearson, K. 1896. Mathematical contributions to the theory of evolution—III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London, Series A 187: 253–318. Pearson, K., and L. N. G. Filon. 1898. Mathematical contributions to the theory of evolution. IV. On the probable errors of frequency constants and on the influence of random selection on variation and correlation. Philosophical Transactions of the Royal Society of London, Series A 191: 229–311. Porter, T. M. 2004. Karl Pearson: The Scientific Life in a Statistical Age. Princeton, NJ: Princeton University Press. Rodgers, J. L., and W. A. Nicewander. 1988. Thirteen ways to look at the correlation coefficient. American Statistician 42: 59–66. Rovine, M. J., and A. von Eye. 1997. A 14th way to look at the correlation coefficient: Correlation as the proportion of matches. American Statistician 51: 42–46. Seed, P. T. 2001. sg159: Confidence intervals for correlations. Stata Technical Bulletin 59: 27–28. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 267–269. College Station, TX: Stata Press. Seidler, J., J. Vondráček, and I. Saxl. 2000. The life and work of Zbyněk Šidák (1933–1999). Applications of Mathematics 45: 321–336. Snedecor, G. W., and W. G. Cochran. 1989. Statistical Methods. 8th ed. Ames, IA: Iowa State University Press. Stigler, S. M. 1986. The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge, MA: Belknap Press. Verardi, V., and C. Dehon. 2010. Multivariate outlier detection in Stata. Stata Journal 10: 259–266. Weber, S. 2010. bacon: An effective way to detect outliers in multivariate data using Stata (and Mata). Stata Journal 10: 331–338. Wolfe, F. 1997. sg64: pwcorrs: Enhanced correlation display. Stata Technical Bulletin 35: 22–25. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 163–167. College Station, TX: Stata Press. . 1999. sg64.1: Update to pwcorrs. Stata Technical Bulletin 49: 17. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 159. College Station, TX: Stata Press. Also see [R] esize — Effect size based on mean comparison [R] icc — Intraclass correlation coefficients [R] pcorr — Partial and semipartial correlation coefficients [R] spearman — Spearman’s and Kendall’s correlations [R] summarize — Summary statistics [R] tetrachoric — Tetrachoric correlations for binary variables Title cumul — Cumulative distribution Syntax Remarks and examples Menu Acknowledgment Description References Options Also see Syntax cumul varname if in weight , generate(newvar) options Description options Main ∗ generate(newvar) freq equal create variable newvar use frequency units for cumulative generate equal cumulatives for tied values ∗ generate(newvar) is required. by is allowed; see [D] by. fweights and aweights are allowed; see [U] 11.1.6 weight. Menu Statistics > Summaries, tables, and tests > Distributional plots and tests > Generate cumulative distribution Description cumul creates newvar, defined as the empirical cumulative distribution function of varname. Options Main generate(newvar) is required. It specifies the name of the new variable to be created. freq specifies that the cumulative be in frequency units; otherwise, it is normalized so that newvar is 1 for the largest value of varname. equal requests that observations with equal values in varname get the same cumulative value in newvar. 412 cumul — Cumulative distribution 413 Remarks and examples Example 1 cumul is most often used with graph to graph the empirical cumulative distribution. For instance, we have data on the median family income of 957 U.S. cities: . use http://www.stata-press.com/data/r13/hsng (1980 Census housing data) . cumul faminc, gen(cum) . sort cum . line cum faminc, ylab(, grid) ytitle("") xlab(, grid) > title("Cumulative of median family income") > subtitle("1980 Census, 957 U.S. Cities") Cumulative of median family income 0 .2 .4 .6 .8 1 1980 Census, 957 U.S. Cities 15000.00 20000.00 25000.00 Median family inc., 1979 30000.00 It would have been enough to type line cum faminc, but we wanted to make the graph look better; see [G-2] graph twoway line. If we had wanted a weighted cumulative, we would have typed cumul faminc [w=pop] at the first step. Example 2 To graph two (or more) cumulatives on the same graph, use cumul and stack; see [D] stack. For instance, we have data on the average January and July temperatures of 956 U.S. cities: 414 cumul — Cumulative distribution . use http://www.stata-press.com/data/r13/citytemp, clear (City Temperature Data) . cumul tempjan, gen(cjan) . cumul tempjuly, gen(cjuly) . . > > > stack cjan tempjan cjuly tempjuly, into(c temp) wide clear line cjan cjuly temp, sort ylab(, grid) ytitle("") xlab(, grid) xtitle("Temperature (F)") title("Cumulatives:" "Average January and July Temperatures") subtitle("956 U.S. Cities") Cumulatives: Average January and July Temperatures 0 .2 .4 .6 .8 1 956 U.S. Cities 0 20 40 60 Temperature (F) cjan 80 100 cjuly As before, it would have been enough to type line cjan cjuly temp, sort. See [D] stack for an explanation of how the stack command works. Technical note According to Beniger and Robyn (1978), Fourier (1821) published the first graph of a cumulative frequency distribution, which was later given the name “ogive” by Galton (1875). Jean Baptiste Joseph Fourier (1768–1830) was born in Auxerre in France. As a young man, Fourier became entangled in the complications of the French Revolution. As a result, he was arrested and put into prison, where he feared he might meet his end at the guillotine. When he was not in prison, he was studying, researching, and teaching mathematics. Later, he served Napolean’s army in Egypt as a scientific adviser. Upon his return to France in 1801, he was appointed Prefect of the Department of Isère. While prefect, Fourier worked on the mathematical basis of the theory of heat, which is based on what are now called Fourier series. This work was published in 1822, despite the skepticism of Lagrange, Laplace, Legendre, and others—who found the work lacking in generality and even rigor—and disagreements of both priority and substance with Biot and Poisson. cumul — Cumulative distribution 415 Acknowledgment The equal option was added by Nicholas J. Cox of the Department of Geography at Durham University, UK, and coeditor of the Stata Journal. References Beniger, J. R., and D. L. Robyn. 1978. Quantitative graphics in statistics: A brief history. American Statistician 32: 1–11. Clayton, D. G., and M. Hills. 1999. gr37: Cumulative distribution function plots. Stata Technical Bulletin 49: 10–12. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 96–98. College Station, TX: Stata Press. Cox, N. J. 1999. gr41: Distribution function plots. Stata Technical Bulletin 51: 12–16. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 108–112. College Station, TX: Stata Press. Fourier, J. B. J. 1821. Notions générales, sur la population. Recherches Statistiques sur la Ville de Paris et le Département de la Seine 1: 1–70. Galton, F. 1875. Statistics by intercomparison, with remarks on the law of frequency of error. Philosophical Magazine 49: 33–46. Wilk, M. B., and R. Gnanadesikan. 1968. Probability plotting methods for the analysis of data. Biometrika 55: 1–17. Also see [R] diagnostic plots — Distributional diagnostic plots [R] kdensity — Univariate kernel density estimation [D] stack — Stack data Title cusum — Cusum plots and tests for binary variables Syntax Remarks and examples Also see Menu Stored results Description Acknowledgment Options References Syntax cusum yvar xvar if in , options Description options Main generate(newvar) yfit(fitvar) nograph nocalc save cumulative sum in newvar calculate cumulative sum against fitvar suppress the plot suppress cusum test statistics Cusum plot affect the rendition of the plotted line connect options Add plots add plots to the generated graph addplot(plot) Y axis, X axis, Titles, Legend, Overall any options other than by() documented in [G-3] twoway options twoway options Menu Statistics > Other > Quality control > Cusum plots and tests for binary variables Description cusum graphs the cumulative sum (cusum) of a binary (0/1) variable, yvar, against a (usually) continuous variable, xvar. Options Main generate(newvar) saves the cusum in newvar. yfit(fitvar) calculates a cusum against fitvar, that is, the running sums of the “residuals” fitvar minus yvar. Typically, fitvar is the predicted probability of a positive outcome obtained from a logistic regression analysis. nograph suppresses the plot. nocalc suppresses calculation of the cusum test statistics. 416 cusum — Cusum plots and tests for binary variables 417 Cusum plot connect options affect the rendition of the plotted line; see [G-3] connect options. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples The cusum is the running sum of the proportion of ones in the sample, a constant number, minus yvar, j X cj = f − yvar(k) , 1≤j≤N k=1 P where f = ( yvar)/N and yvar(k) refers to the corresponding value of yvar when xvar is placed in ascending order: xvar(k+1) ≥ xvar(k) . Tied values of xvar are broken at random. If you want them broken the same way in two runs, you must set the random-number seed to the same value before giving the cusum command; see [R] set seed. A U-shaped or inverted U-shaped cusum indicates, respectively, a negative or a positive trend of yvar with xvar. A sinusoidal shape is evidence of a nonmonotonic (for example, quadratic) trend. cusum displays the maximum absolute cusum for monotonic and nonmonotonic trends of yvar on xvar. These are nonparametric tests of departure from randomness of yvar with respect to xvar. Approximate values for the tests are given. Example 1 For the automobile dataset, auto.dta, we wish to investigate the relationship between foreign (0 = domestic, 1 = foreign) and car weight as follows: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . cusum foreign weight Obs Pr(1) CusumL zL Pr>zL Variable foreign 74 0.2973 10.30 3.963 0.000 CusumQ 3.32 zQ Pr>zQ 0.469 0.320 cusum — Cusum plots and tests for binary variables −10 −8 Cusum (Car type) −6 −4 −2 0 418 2000 3000 Weight (lbs.) 4000 5000 The resulting plot, which is U-shaped, suggests a negative monotonic relationship. The trend is confirmed by a highly significant linear cusum statistic, labeled CusumL in the output above. Some 29.73% of the cars are foreign (coded 1). The proportion of foreign cars diminishes with increasing weight. The domestic cars are crudely heavier than the foreign ones. We could have discovered that by typing table foreign, stats(mean weight), but such an approach does not give the full picture of the relationship. The quadratic cusum (CusumQ) is not significant, so we do not suspect any tendency for the very heavy cars to be foreign rather than domestic. A slightly enhanced version of the plot shows the preponderance of domestic (coded 0) cars at the heavy end of the weight axis: . label values foreign . cusum foreign weight, s(none) recast(scatter) mlabel(foreign) mlabp(0) Variable Obs Pr(1) CusumL zL Pr>zL CusumQ zQ Pr>zQ 0.2973 0 74 10.30 3.963 0.000 2.92 0 0 11 −2 1 1 1 1 10 0 11 10 0 11 −8 Cusum (Car type) −6 −4 1 1 1 1 −10 foreign 2000 0 0 00 0 10 0 0 1 1 0 1 0 00 00 01 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 3000 Weight (lbs.) 0 0 0 0 0 0 0 0 4000 0 0 0.064 0 0 5000 0.475 cusum — Cusum plots and tests for binary variables 419 The example is, of course, artificial, because we would not really try to model the probability of a car being foreign given its weight. Stored results cusum stores the following in r(): Scalars r(N) r(prop1) r(cusuml) r(zl) number of observations proportion of positive outcomes cusum test (linear) r(P zl) r(cusumq) r(zq) r(P zq) p-value for test (linear) quadratic cusum test (quadratic) p-value for test (quadratic) Acknowledgment cusum was written by Patrick Royston of the MRC Clinical Trials Unit, London, and coauthor of the Stata Press book Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. References Royston, P. 1992. The use of cusums and other techniques in modelling continuous covariates in logistic regression. Statistics in Medicine 11: 1115–1129. . 1993. sqv7: Cusum plots and tests for binary variables. Stata Technical Bulletin 12: 16–17. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 175–177. College Station, TX: Stata Press. Also see [R] logistic — Logistic regression, reporting odds ratios [R] logit — Logistic regression, reporting coefficients [R] probit — Probit regression Title db — Launch dialog Syntax Description Options Remarks and examples Also see Syntax Syntax for db db commandname For programmers db commandname , message(string) debug dryrun Set system parameter set maxdb # , permanently where # must be between 5 and 1,000. Description db is the command-line way to launch a dialog for a Stata command. The second syntax (which is the same but includes options) is for use by programmers. If you wish to allow the launching of dialogs from a help file, see [P] smcl for information on the dialog SMCL directive. set maxdb sets the maximum number of dialog boxes whose contents are remembered from one invocation to the next during a session. The default value of maxdb is 50. Options message(string) specifies that string be passed to the dialog box, where it can be referred to from the MESSAGE STRING property. debug specifies that the underlying dialog box be loaded with debug messaging turned on. dryrun specifies that, rather than launching the dialog, db show the commands it would issue to launch the dialog. permanently specifies that, in addition to making the change right now, the maxdb setting be remembered and become the default setting when you invoke Stata. Remarks and examples The usual way to launch a dialog is to open the Data, Graphics, or Statistics menu and to make your selection from there. When you know the name of the command that you want to run, however, db provides a way to invoke the dialog from the command line. 420 db — Launch dialog 421 db follows the same abbreviation rules that Stata’s command-line interface follows. So, to launch the dialog for regress, you can type . db regress or . db reg Say that you use the dialog box for regress, either by selecting Statistics > Linear models and related > Linear regression or by typing . db regress You fit a regression. Much later during the session, you return to the regress dialog box. It will have the contents as you left them if 1) you have not typed clear all between the first and second invocations; 2) you have not typed discard between the two invocations; and 3) you have not used more than 50 different dialog boxes—regardless of how many times you have used each—between the first and second invocations of regress. If you use 51 or more, the contents of the regress dialog box will be forgotten. set maxdb determines how many different dialog boxes are remembered. A dialog box takes, on average, about 20 KB of memory, so the 50 default corresponds to allowing dialog boxes to consume about 1 MB of memory. Also see [R] query — Display system parameters Title diagnostic plots — Distributional diagnostic plots Syntax Description Options for qnorm and pnorm Remarks and examples Acknowledgments Also see Menu Options for symplot, quantile, and qqplot Options for qchi and pchi Methods and formulas References Syntax Symmetry plot symplot varname if in , options1 Ordered values of varname against quantiles of uniform distribution quantile varname if in , options1 Quantiles of varname1 against quantiles of varname2 qqplot varname1 varname2 if in , options1 Quantiles of varname against quantiles of normal distribution qnorm varname if in , options2 Standardized normal probability plot pnorm varname if in , options2 Quantiles of varname against quantiles of χ2 distribution qchi varname if in , options3 χ2 probability plot pchi varname if in , options3 422 diagnostic plots — Distributional diagnostic plots options1 Description Plot marker options marker label options change look of markers (color, size, etc.) add marker labels; change look or position Reference line rlopts(cline options) affect rendition of the reference line Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options options2 Description Main grid add grid lines Plot marker options marker label options change look of markers (color, size, etc.) add marker labels; change look or position Reference line rlopts(cline options) affect rendition of the reference line Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options options3 Description Main grid df(#) add grid lines degrees of freedom of χ2 distribution; default is df(1) Plot marker options marker label options change look of markers (color, size, etc.) add marker labels; change look or position Reference line rlopts(cline options) affect rendition of the reference line Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options 423 424 diagnostic plots — Distributional diagnostic plots Menu symplot Statistics > Summaries, tables, and tests > Distributional plots and tests > Symmetry plot > Summaries, tables, and tests > Distributional plots and tests > Quantiles plot > Summaries, tables, and tests > Distributional plots and tests > Quantile-quantile plot > Summaries, tables, and tests > Distributional plots and tests > Normal quantile plot > Summaries, tables, and tests > Distributional plots and tests > Normal probability plot, standardized > Summaries, tables, and tests > Distributional plots and tests > Chi-squared quantile plot > Summaries, tables, and tests > Distributional plots and tests > Chi-squared probability plot quantile Statistics qqplot Statistics qnorm Statistics pnorm Statistics qchi Statistics pchi Statistics Description symplot graphs a symmetry plot of varname. quantile plots the ordered values of varname against the quantiles of a uniform distribution. qqplot plots the quantiles of varname1 against the quantiles of varname2 (Q – Q plot). qnorm plots the quantiles of varname against the quantiles of the normal distribution (Q – Q plot). pnorm graphs a standardized normal probability plot (P – P plot). qchi plots the quantiles of varname against the quantiles of a χ2 distribution (Q – Q plot). pchi graphs a χ2 probability plot (P – P plot). See [R] regress postestimation diagnostic plots for regression diagnostic plots and [R] logistic postestimation for logistic regression diagnostic plots. Options for symplot, quantile, and qqplot Plot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Reference line rlopts(cline options) affect the rendition of the reference line; see [G-3] cline options. diagnostic plots — Distributional diagnostic plots 425 Add plots addplot(plot) provides a way to add other plots to the generated graph; see [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Options for qnorm and pnorm Main grid adds grid lines at the 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, and 0.95 quantiles when specified with qnorm. With pnorm, grid is equivalent to yline(.25,.5,.75) xline(.25,.5,.75). Plot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Reference line rlopts(cline options) affect the rendition of the reference line; see [G-3] cline options. Add plots addplot(plot) provides a way to add other plots to the generated graph; see [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Options for qchi and pchi Main grid adds grid lines at the 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, and .95 quantiles when specified with qchi. With pchi, grid is equivalent to yline(.25,.5,.75) xline(.25,.5,.75). df(#) specifies the degrees of freedom of the χ2 distribution. The default is df(1). Plot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. 426 diagnostic plots — Distributional diagnostic plots Reference line rlopts(cline options) affect the rendition of the reference line; see [G-3] cline options. Add plots addplot(plot) provides a way to add other plots to the generated graph; see [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples Remarks are presented under the following headings: symplot quantile qqplot qnorm pnorm qchi pchi symplot Example 1 We have data on 74 automobiles. To make a symmetry plot of the variable price, we type . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . symplot price 0 2000 Distance above median 4000 6000 8000 10000 Price 0 500 1000 Distance below median 1500 2000 diagnostic plots — Distributional diagnostic plots 427 All points would lie along the reference line (defined as y = x) if car prices were symmetrically distributed. The points in this plot lie above the reference line, indicating that the distribution of car prices is skewed to the right — the most expensive cars are far more expensive than the least expensive cars are inexpensive. The logic works as follows: a variable, z , is distributed symmetrically if median − z(i) = z(N +1−i) − median where z(i) indicates the ith-order statistic of z . symplot graphs yi = median − z(i) versus xi = z(N +1−i) − median. For instance, consider the largest and smallest values of price in the example above. The most expensive car costs $15,906 and the least expensive, $3,291. Let’s compare these two cars with the typical car in the data and see how much more it costs to buy the most expensive car, and compare that with how much less it costs to buy the least expensive car. If the automobile price distribution is symmetric, the price differences would be the same. Before we can make this comparison, we must agree on a definition for the word “typical”. Let’s agree that “typical” means median. The price of the median car is $5,006.50, so the most expensive car costs $10,899.50 more than the median car, and the least expensive car costs $1,715.50 less than the median car. We now have one piece of evidence that the car price distribution is not symmetric. We can repeat the experiment for the second-most-expensive car and the second-least-expensive car. We find that the second-most-expensive car costs $9,494.50 more than the median car, and the second-least-expensive car costs $1,707.50 less than the median car. We now have more evidence. We can continue doing this with the third most expensive and the third least expensive, and so on. Once we have all of these numbers, we want to compare each pair and ask how similar, on average, they are. The easiest way to do that is to plot all the pairs. 428 diagnostic plots — Distributional diagnostic plots quantile Example 2 We have data on the prices of 74 automobiles. To make a quantile plot of price, we type . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) 0 Quantiles of Price 5000 10000 15000 . quantile price, rlopts(clpattern(dash)) 0 .25 .5 Fraction of the data .75 1 We changed the pattern of the reference line by specifying rlopts(clpattern(dash)). In a quantile plot, each value of the variable is plotted against the fraction of the data that have values less than that fraction. The diagonal line is a reference line. If automobile prices were rectangularly distributed, all the data would be plotted along the line. Because all the points are below the reference line, we know that the price distribution is skewed right. qqplot Example 3 We have data on the weight and country of manufacture of 74 automobiles. We wish to compare the distributions of weights for domestic and foreign automobiles: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . generate weightd=weight if !foreign (22 missing values generated) . generate weightf=weight if foreign (52 missing values generated) . qqplot weightd weightf diagnostic plots — Distributional diagnostic plots 429 2000 weightd 3000 4000 5000 Quantile−Quantile Plot 1500 2000 2500 weightf 3000 3500 qnorm Example 4 Continuing with our price data on 74 automobiles, we now wish to compare the distribution of price with the normal distribution: . qnorm price, grid ylabel(, angle(horizontal) axis(1)) > ylabel(, angle(horizontal) axis(2)) 1,313.8 6,165.3 11,017 15,000 13,466 Price 10,000 5,006.5 5,000 3,748 0 0 5,000 10,000 Inverse Normal Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles The result shows that the distributions are different. 15,000 430 diagnostic plots — Distributional diagnostic plots Technical note The idea behind qnorm is recommended strongly by Miller (1997): he calls it probit plotting. His recommendations from much practical experience should interest many users. “My recommendation for detecting nonnormality is probit plotting” (Miller 1997, 10). “If a deviation from normality cannot be spotted by eye on probit paper, it is not worth worrying about. I never use the Kolmogorov–Smirnov test (or one of its cousins) or the χ2 test as a preliminary test of normality. They do not tell you how the sample is differing from normality, and I have a feeling they are more likely to detect irregularities in the middle of the distribution than in the tails” (Miller 1997, 13–14). pnorm Example 5 Quantile–normal plots emphasize the tails of the distribution. Normal probability plots put the focus on the center of the distribution: 0.00 Normal F[(price−m)/s] 0.25 0.50 0.75 1.00 . pnorm price, grid 0.00 0.25 0.50 Empirical P[i] = i/(N+1) 0.75 1.00 qchi Example 6 Suppose that we want to examine the distribution of the sum of squares of price and mpg, standardized for their variances. . . . . egen c1 = std(price) egen c2 = std(mpg) generate ch = c1^2 + c2^2 qchi ch, df(2) grid ylabel(, alt axis(2)) xlabel(, alt axis(2)) diagnostic plots — Distributional diagnostic plots 1.386294 5.991465 0 2 4 6 2 Expected χ d.f. = 2 8 .1598352 .7587778 0 5 ch 10 8.926035 15 .1025866 Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles The quadratic form is clearly not χ2 with 2 degrees of freedom. pchi Example 7 We can focus on the center of the distribution by doing a probability plot: 0.00 0.25 2 χ (ch) d.f. = 2 0.50 0.75 1.00 . pchi ch, df(2) grid 0.00 0.25 0.50 Empirical P[i] = i/(N+1) 0.75 1.00 431 432 diagnostic plots — Distributional diagnostic plots Methods and formulas Let x(1) , x(2) , . . . , x(N ) be the data sorted in ascending order. If a continuous variable, x, has a cumulative distribution function F (x) = P (X ≤ x) = p, the quantiles xpi are such that F (xpi ) = pi . For example, if pi = 0.5, then x0.5 is the median. When we plot data, the probabilities, pi , are often referred to as plotting positions. There are many different conventions for choice of plotting positions, given x(1) ≤ · · · ≤ x(N ) . Most belong to the family (i − a)/(N − 2a + 1). a = 0.5 (suggested by Hazen) and a = 0 (suggested by Weibull) are popular choices. For a wider discussion of the calculation of plotting positions, see Cox (2002). symplot plots median − x(i) versus x(N +1−i) − median. quantile plots x(i) versus (i − 0.5)/N (the Hazen position). qnorm plots x(i) against qi , where qi = Φ−1 (pi ), Φ is the cumulative normal distribution, and pi = i/(N + 1) (the Weibull position). pnorm plots Φ (xi − µ b)/b σ versus pi = i/(N + 1), where µ b is the mean of the data and σ b is the standard deviation. qchi and pchi are similar to qnorm and pnorm; the cumulative χ2 distribution is used in place of the cumulative normal distribution. qqplot is just a two-way scatterplot of one variable against the other after both variables have been sorted into ascending order, and both variables have the same number of nonmissing observations. If the variables have unequal numbers of nonmissing observations, interpolated values of the variable with more data are plotted against the variable with fewer data. Ramanathan Gnanadesikan (1932– ) was born in Madras. He obtained degrees from the Universities of Madras and North Carolina. He worked in industry at Procter & Gamble, Bell Labs, and Bellcore, as well as in universities, retiring from Rutgers in 1998. Among many contributions to statistics he is especially well known for work on probability plotting, robustness, outlier detection, clustering, classification, and pattern recognition. Martin Bradbury Wilk (1922–2013) was born in Montreal. He obtained degrees in chemical engineering and statistics from McGill and Iowa State Universities. After holding several statisticsrelated posts in industry and at universities (including periods at Princeton, Bell Labs, and Rutgers), Wilk was appointed Chief Statistician at Statistics Canada (1980–1986). He is especially well known for his work with Gnanadesikan on probability plotting and with Shapiro on tests for normality. Acknowledgments We thank Peter A. Lachenbruch of the Department of Public Health at Oregon State University for writing the original versions of qchi and pchi. Patrick Royston of the MRC Clinical Trials Unit, London, and coauthor of the Stata Press book Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model also published a similar command in the Stata Technical Bulletin (Royston 1996). diagnostic plots — Distributional diagnostic plots 433 References Chambers, J. M., W. S. Cleveland, B. Kleiner, and P. A. Tukey. 1983. Graphical Methods for Data Analysis. Belmont, CA: Wadsworth. Cox, N. J. 1999. gr42: Quantile plots, generalized. Stata Technical Bulletin 51: 16–18. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 113–116. College Station, TX: Stata Press. . 2001. gr42.1: Quantile plots, generalized: Update to Stata 7. Stata Technical Bulletin 61: 10. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 55–56. College Station, TX: Stata Press. . 2002. Speaking Stata: On getting functions to do the work. Stata Journal 2: 411–427. . 2004a. Speaking Stata: Graphing distributions. Stata Journal 4: 66–88. . 2004b. gr42 2: Software update: Quantile plots, generalized. Stata Journal 4: 97. . 2005a. Speaking Stata: Density probability plots. Stata Journal 5: 259–273. . 2005b. Speaking Stata: The protean quantile plot. Stata Journal 5: 442–460. . 2005c. Speaking Stata: Smoothing in various directions. Stata Journal 5: 574–593. . 2007. Stata tip 47: Quantile–quantile plots without programming. Stata Journal 7: 275–279. . 2012. Speaking Stata: Axis practice, or what goes where on a graph. Stata Journal 12: 549–561. Daniel, C., and F. S. Wood. 1980. Fitting Equations to Data: Computer Analysis of Multifactor Data. 2nd ed. New York: Wiley. Gan, F. F., K. J. Koehler, and J. C. Thompson. 1991. Probability plots and distribution curves for assessing the fit of probability models. American Statistician 45: 14–21. Genest, C., and G. J. Brackstone. 2013. Obituary: Martin B. Wilk, 1922–2013. IMS Bulletin 42(4): 7–8. Hamilton, L. C. 1992. Regression with Graphics: A Second Course in Applied Statistics. Belmont, CA: Duxbury. . 2013. Statistics with Stata: Updated for Version 12. 8th ed. Boston: Brooks/Cole. Hoaglin, D. C. 1985. Using quantiles to study shape. In Exploring Data Tables, Trends, and Shapes, ed. D. C. Hoaglin, C. F. Mosteller, and J. W. Tukey, 417–460. New York: Wiley. Kettenring, J. R. 2001. A conversation with Ramanathan Gnanadesikan. Statistical Science 16: 295–309. Miller, R. G., Jr. 1997. Beyond ANOVA: Basics of Applied Statistics. London: Chapman & Hall. Nolan, D., and T. Speed. 2000. Stat Labs: Mathematical Statistics Through Applications. New York: Springer. Royston, P. 1996. sg47: A plot and a test for the χ2 distribution. Stata Technical Bulletin 29: 26–27. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 142–144. College Station, TX: Stata Press. Scotto, M. G. 2000. sg140: The Gumbel quantile plot and a test for choice of extreme models. Stata Technical Bulletin 55: 23–25. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 156–159. College Station, TX: Stata Press. Wilk, M. B., and R. Gnanadesikan. 1968. Probability plotting methods for the analysis of data. Biometrika 55: 1–17. Also see [R] cumul — Cumulative distribution [R] kdensity — Univariate kernel density estimation [R] logistic postestimation — Postestimation tools for logistic [R] lv — Letter-value displays [R] regress postestimation diagnostic plots — Postestimation plots for regress Title display — Substitute for a hand calculator Syntax Description Remarks and examples Also see Syntax display exp Description display displays strings and values of scalar expressions. display really has many more features and a more complex syntax diagram, but the diagram shown above is adequate for interactive use. For a full discussion of display’s capabilities, see [P] display. Remarks and examples display can be used as a substitute for a hand calculator. Example 1 display 2+2 produces the output 4. Stata variables may also appear in the expression, such as in display myvar/2. Because display works only with scalars, the resulting calculation is performed only for the first observation. You could type display myvar[10]/2 to display the calculation for the 10th observation. Here are more examples: . display sqrt(2)/2 .70710678 . display normal(-1.1) .13566606 . di (57.2-3)/(12-2) 5.42 . display myvar/10 7 . display myvar[10]/2 3.5 Also see [P] display — Display strings and values of scalar expressions [U] 13 Functions and expressions 434 Title do — Execute commands from a file Syntax Remarks and examples Menu Reference Description Also see Option Syntax do | run filename arguments , nostop Menu File > Do... Description do and run cause Stata to execute the commands stored in filename just as if they were entered from the keyboard. do echoes the commands as it executes them, whereas run is silent. If filename is specified without an extension, .do is assumed. Option nostop allows the do-file to continue executing even if an error occurs. Normally, Stata stops executing the do-file when it detects an error (nonzero return code). Remarks and examples You can create filename (called a do-file) using Stata’s Do-file Editor; see [R] doedit. This file will be a standard ASCII (text) file. A complete discussion of do-files can be found in [U] 16 Do-files. You can also create filename by using a non-Stata text editor; see [D] shell for a way to invoke your favorite editor from inside Stata. Make sure that you save the file in ASCII format. If the path or filename contains spaces, it should be enclosed in double quotes. Reference Jenkins, S. P. 2006. Stata tip 32: Do not stop. Stata Journal 6: 281. Also see [R] doedit — Edit do-files and other text files [P] include — Include commands from file [GSM] 13 Using the Do-file Editor—automating Stata [GSU] 13 Using the Do-file Editor—automating Stata [GSW] 13 Using the Do-file Editor—automating Stata [U] 15 Saving and printing output—log files [U] 16 Do-files 435 Title doedit — Edit do-files and other text files Syntax Menu Description Remarks and examples Also see Syntax doedit filename Menu Window > Do-file Editor Description doedit opens a text editor that lets you edit do-files and other text files. The Do-file Editor lets you submit several commands to Stata at once. Remarks and examples Clicking on the Do-file Editor button is equivalent to typing doedit. doedit, typed by itself, invokes the Editor with an empty document. If you specify filename, that file is displayed in the Editor. You may have more than one Do-file Editor open at once. Each time you submit the doedit command, a new window will be opened. A tutorial discussion of doedit can be found in the Getting Started with Stata manual. Read [U] 16 Do-files for an explanation of do-files, and then read [GSW] 13 Using the Do-file Editor— automating Stata to learn how to use the Do-file Editor to create and execute do-files. Also see [GSM] 13 Using the Do-file Editor—automating Stata [GSU] 13 Using the Do-file Editor—automating Stata [GSW] 13 Using the Do-file Editor—automating Stata [U] 16 Do-files 436 Title dotplot — Comparative scatterplots Syntax Remarks and examples Menu Stored results Description Acknowledgments Options References Syntax Dotplot of varname, with one column per value of groupvar dotplot varname if in , options Dotplot for each variable in varlist, with one column per variable dotplot varlist if in , options Description options Options display one columnar dotplot for each value of groupvar horizontal dot density; default is nx(0) vertical dot density; default is ny(35) label every # group; default is incr(1) plot a horizontal line of pluses at the mean or median use minimum and maximum as boundaries plot horizontal dashed lines at shoulders of each group use the actual values of yvar center the dot for each column over(groupvar) nx(#) ny(#) incr(#) mean | median bounded bar nogroup center Plot change look of markers (color, size, etc.) add marker labels; change look or position marker options marker label options Y axis, X axis, Titles, Legend, Overall any options other than by() documented in [G-3] twoway options twoway options Menu Graphics > Distributional graphs > Distribution dotplot Description A dotplot is a scatterplot with values grouped together vertically (“binning”, as in a histogram) and with plotted points separated horizontally. The aim is to display all the data for several variables or groups in one compact graphic. 437 438 dotplot — Comparative scatterplots In the first syntax, dotplot produces a columnar dotplot of varname, with one column per value of groupvar. In the second syntax, dotplot produces a columnar dotplot for each variable in varlist, with one column per variable; over(groupvar) is not allowed. In each case, the “dots” are plotted as small circles to increase readability. Options Options over(groupvar) identifies the variable for which dotplot will display one columnar dotplot for each value of groupvar. nx(#) sets the horizontal dot density. A larger value of # will increase the dot density, reducing the horizontal separation between dots. This option will increase the separation between columns if two or more groups or variables are used. ny(#) sets the vertical dot density (number of “bins” on the y axis). A larger value of # will result in more bins and a plot that is less spread out horizontally. # should be determined in conjunction with nx() to give the most pleasing appearance. incr(#) specifies how the x axis is to be labeled. incr(1), the default, labels all groups. incr(2) labels every second group. mean | median plots a horizontal line of pluses at the mean or median of each group. bounded forces the minimum and maximum of the variable to be used as boundaries of the smallest and largest bins. It should be used with one variable whose support is not the whole of the real line and whose density does not tend to zero at the ends of its support, for example, a uniform random variable or an exponential random variable. bar plots horizontal dashed lines at the “shoulders” of each group. The shoulders are taken to be the upper and lower quartiles unless mean has been specified; here they will be the mean plus or minus the standard deviation. nogroup uses the actual values of yvar rather than grouping them (the default). This option may be useful if yvar takes on only a few values. center centers the dots for each column on a hidden vertical line. Plot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). dotplot — Comparative scatterplots 439 Remarks and examples dotplot produces a figure that has elements of a boxplot, a histogram, and a scatterplot. Like a boxplot, it is most useful for comparing the distributions of several variables or the distribution of 1 variable in several groups. Like a histogram, the figure provides a crude estimate of the density, and, as with a scatterplot, each symbol (dot) represents 1 observation. Example 1 dotplot may be used as an alternative to Stata’s histogram graph for displaying the distribution of one variable. . set seed 123456789 . set obs 1000 . generate norm = rnormal() . dotplot norm, title("Normal distribution, sample size 1000") −4 −2 norm 0 2 4 Normal distribution, sample size 1000 0 20 40 Frequency 60 80 Example 2 The over() option lets us use dotplot to compare the distribution of one variable within different levels of a grouping variable. The center, median, and bar options create a graph that may be compared with Stata’s boxplot; see [G-2] graph box. The next graph illustrates this option with Stata’s automobile dataset. 440 dotplot — Comparative scatterplots Mileage (mpg) 30 40 . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . dotplot mpg, over(foreign) nx(25) ny(10) center median bar − − − −−− − − − − − − − − − −−−−−−−−− − − − − − − 20 − − − −−− − − − 10 − − − − − − −−−−−−−−− − − − − − − Domestic Foreign Car type Example 3 The second version of dotplot lets us compare the distribution of several variables. In the next graph, all 10 variables contain measurements on tumor volume. 0 200 Tumor volume, cu mm 400 600 800 1000 . use http://www.stata-press.com/data/r13/dotgr . dotplot g1r1-g1r10, ytitle("Tumor volume, cu mm") g1r1 g1r2 g1r3 g1r4 g1r5 g1r6 g1r7 g1r8 g1r9 g1r10 dotplot — Comparative scatterplots 441 Example 4 When using the first form with the over() option, we can encode a third dimension in a dotplot by using a different plotting symbol for different groups. The third dimension cannot be encoded with a varlist. The example is of a hypothetical matched case – control study. The next graph shows the exposure of each individual in each matched stratum. Cases are marked by the letter ‘x’, and controls are marked by the letter ‘o’. use http://www.stata-press.com/data/r13/dotdose label define symbol 0 "o" 1 "x" label values case symbol dotplot dose, over(strata) m(none) mlabel(case) mlabp(0) center 40 50 . . . . 30 dose 20 o o o o xxo o o oox oo oo o ooo x oo xo oo oo o o ox oo oooo o o oo ooo o ooo ox xooo o o xoo ooo o ooo o oo oo o o ox o o ooo oo o ooo o oo 11 12 o o o o o o o xo o oo o o oo oxo ox ox o x oo 0 10 o o o o o oo oo oo 0 1 2 3 4 5 6 7 strata 8 9 10 Example 5 dotplot can also be used with two virtually continuous variables as an alternative to jittering the data to distinguish ties. We must use the xlabel() option, because otherwise dotplot will attempt to label too many points on the x axis. It is often useful in such instances to use a value of nx that is smaller than the default. That was not necessary in this example, partly because of our choice of symbols. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . generate byte hi_price = (price>10000) if price < . . label define symbol 0 "|" 1 "o" . label values hi_price symbol 442 dotplot — Comparative scatterplots 5,000 . dotplot weight, over(gear_ratio) m(none) mlabel(hi_price) mlabp(0) center > xlabel(#5) o o Weight (lbs.) 3,000 4,000 oo | | o|| o o | | || | | o | ||| | || |||| | | o | | || | o | | | || | | | | | || | | | 2,000 | | | | 2 2.5 | 3 Gear Ratio || | | | 3.5 | | | | || || | | | 4 Example 6 The following figure is included mostly for aesthetic reasons. It also demonstrates dotplot’s ability to cope with even very large datasets. The sample size for each variable is 10,000, so it may take a long time to print. −2 0 2 4 6 clear all set seed 123456789 set obs 10000 gen norm0 = rnormal() gen norm1 = rnormal() + 1 gen norm2 = rnormal() + 2 label variable norm0 "N(0,1)" label variable norm1 "N(1,1)" label variable norm2 "N(2,1)" dotplot norm0 norm1 norm2 −4 . . . . . . . . . . N(0,1) N(1,1) N(2,1) dotplot — Comparative scatterplots 443 Stored results dotplot stores the following in r(): Scalars r(nx) r(ny) horizontal dot density vertical dot density Acknowledgments dotplot was written by Peter Sasieni of the Wolfson Institute of Preventive Medicine, London, and Patrick Royston of the MRC Clinical Trials Unit, London, and coauthor of the Stata Press book Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. References Sasieni, P. D., and P. Royston. 1994. gr14: dotplot: Comparative scatterplots. Stata Technical Bulletin 19: 8–10. Reprinted in Stata Technical Bulletin Reprints, vol. 4, pp. 50–54. College Station, TX: Stata Press. . 1996. Dotplots. Applied Statistics 45: 219–234. Title dstdize — Direct and indirect standardization Syntax Options for istdize Acknowledgments Menu Remarks and examples References Description Stored results Also see Options for dstdize Methods and formulas Syntax Direct standardization dstdize charvar popvar stratavars if in , by(groupvars) dstdize options Indirect standardization istdize casevars popvars stratavars if in using filename, popvars(casevarp popvarp ) | rate(ratevarp # | crudevarp ) istdize options dstdize options Main ∗ by(groupvars) using( filename) base(# | string) level(#) Description study populations use standard population from Stata dataset use standard population from a value of grouping variable set confidence level; default is level(95) Options saving( filename) format(% fmt) print nores ∗ by(groupvars) is required. istdize options Main ∗ popvars(casevarp popvarp ) ∗ save computed standard population distribution as a Stata dataset final summary table display format; default is %10.0g include table summary of standard population in output suppress storing results in r() rate(ratevarp # | crudevarp level(#) Description for standard population, casevarp is number of cases and popvarp is number of individuals ) ratevarp is stratum-specific rates and # or crudevarp is the crude case rate value or variable set confidence level; default is level(95) Options by(groupvars) format(% fmt) print ∗ variables identifying study populations final summary table display format; default is %10.0g include table summary of standard population in output Either popvars(casevarp popvarp ) or rate(ratevarp {# | crudevarp }) must be specified. 444 dstdize — Direct and indirect standardization 445 Menu dstdize Statistics > Epidemiology and related > Other > Direct standardization > Epidemiology and related > Other > Indirect standardization istdize Statistics Description dstdize produces standardized rates for charvar, which are defined as a weighted average of the stratum-specific rates. These rates can be used to compare the characteristic charvar across different populations identified by groupvars. Weights used in the standardization are given by popvar; the strata across which the weights are to be averaged are defined by stratavars. istdize produces indirectly standardized rates for a study population based on a standard population. This standardization method is appropriate when the stratum-specific rates for the population being studied are either unavailable or based on small samples and thus are unreliable. The standardization uses the stratum-specific rates of a standard population to calculate the expected number of cases in the study population(s), sums them, and then compares them with the actual number of cases observed. The standard population is in another Stata data file specified by using filename, and it must contain popvar and stratavars. In addition to calculating rates, the indirect standardization command produces point estimates and exact confidence intervals of the study population’s standardized mortality ratio (SMR), if death is the event of interest, or the standardized incidence ratio (SIR) for studies of incidence. Here we refer to both ratios as SMR. casevars is the variable name for the study population’s number of cases (usually deaths). It must contain integers, and for each group, defined by groupvar, each subpopulation identified by stratavars must have the same values or missing. popvars identifies the number of subjects represented by each observation in the study population. stratavars define the strata. Options for dstdize Main by(groupvars) is required for the dstdize command; it specifies the variables identifying the study populations. If base() is also specified, there must be only one variable in the by() group. If you do not have a variable for this option, you can generate one by using something like gen newvar=1 and then use newvar as the argument to this option. using(filename) or base(# | string) may be used to specify the standard population. You may not specify both options. using( filename) supplies the name of a .dta file containing the standard population. The standard population must contain the popvar and the stratavars. If using() is not specified, the standard population distribution will be obtained from the data. base(# | string) lets you specify one of the values of groupvar—either a numeric value or a string—to be used as the standard population. If neither base() nor using() is specified, the entire dataset is used to determine an estimate of the standard population. 446 dstdize — Direct and indirect standardization level(#) specifies the confidence level, as a percentage, for a confidence interval of the adjusted rate. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. Options saving( filename) saves the computed standard population distribution as a Stata dataset that can be used in further analyses. format(% fmt) specifies the format in which to display the final summary table. The default is %10.0g. print includes a table summary of the standard population before displaying the study population results. nores suppresses storing results in r(). This option is seldom specified. Some results are stored in matrices. If there are more groups than matsize, dstdize will report “matsize too small”. Then you can either increase matsize or specify nores. The nores option does not change how results are calculated but specifies that results need not be left behind for use by other programs. Options for istdize Main popvars(casevarp popvarp ) or rate(ratevarp # | ratevarp crudevarp ) must be specified with istdize. Only one of these two options is allowed. These options are used to describe the standard population’s data. With popvars(casevarp popvarp ), casevarp records the number of cases (deaths) for each stratum in the standard population, and popvarp records the total number of individuals in each stratum (individuals at risk). With rate(ratevarp # | crudevarp ), ratevarp contains the stratum-specific rates. # | crudevarp specifies the crude case rate either by a variable name or by the crude case rate value. If a crude rate variable is used, it must be the same for all observations, although it could be missing for some. level(#) specifies the confidence level, as a percentage, for a confidence interval of the adjusted rate. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. Options by(groupvars) specifies variables identifying study populations when more than one exists in the data. If this option is not specified, the entire study population is treated as one group. format(% fmt) specifies the format in which to display the final summary table. The default is %10.0g. print outputs a table summary of the standard population before displaying the study population results. dstdize — Direct and indirect standardization 447 Remarks and examples Remarks are presented under the following headings: Direct standardization Indirect standardization In epidemiology and other fields, you will often need to compare rates for some characteristic across different populations. These populations often differ on factors associated with the characteristic under study; thus directly comparing overall rates may be misleading. See van Belle et al. (2004, 642–684), Fleiss, Levin, and Paik (2003, chap. 19), or Kirkwood and Sterne (2003, chap. 25) for a discussion of direct and indirect standardization. Direct standardization The direct method of adjusting for differences among populations involves computing the overall rates that would result if, instead of having different distributions, all populations had the same standard distribution. The standardized rate is defined as a weighted average of the stratum-specific rates, with the weights taken from the standard distribution. Direct standardization may be applied only when the specific rates for a given population are available. dstdize generates adjusted summary measures of occurrence, which can be used to compare prevalence, incidence, or mortality rates between populations that may differ on certain characteristics (for example, age, gender, race). These underlying differences may affect the crude prevalence, mortality, or incidence rates. Example 1 We have data (Rothman 1986, 42) on mortality rates for Sweden and Panama for 1962, and we wish to compare mortality in these two countries: . use http://www.stata-press.com/data/r13/mortality (1962 Mortality, Sweden & Panama) . describe Contains data from http://www.stata-press.com/data/r13/mortality.dta obs: 6 1962 Mortality, Sweden & Panama vars: 4 14 Apr 2013 16:18 size: 90 variable name nation age_category population deaths Sorted by: storage type display format str6 byte float float %9s %9.0g %10.0gc %9.0gc value label age_lbl variable label Nation Age Category Population in Age Category Deaths in Age Category 448 dstdize — Direct and indirect standardization . list, sepby(nation) abbrev(12) divider nation age_category population deaths 1. 2. 3. Sweden Sweden Sweden 0 - 29 30 - 59 60+ 3145000 3057000 1294000 3,523 10,928 59,104 4. 5. 6. Panama Panama Panama 0 - 29 30 - 59 60+ 741,000 275,000 59,000 3,904 1,421 2,456 We divide the total number of cases in the population by the population to obtain the crude rate: . collapse (sum) pop deaths, by(nation) . list, abbrev(10) divider 1. 2. nation population deaths Panama Sweden 1075000 7496000 7,781 73,555 . generate crude = deaths/pop . list, abbrev(10) divider 1. 2. nation population deaths crude Panama Sweden 1075000 7496000 7,781 73,555 .0072381 .0098126 If we examine the total number of deaths in the two nations, the total crude mortality rate in Sweden is higher than that in Panama. From the original data, we see one possible explanation: Swedes are older than Panamanians, making direct comparison of the mortality rates difficult. Direct standardization lets us remove the distortion caused by the different age distributions. The adjusted rate is defined as the weighted sum of the crude rates, where the weights are given by the standard distribution. Suppose that we wish to standardize these mortality rates to the following age distribution: . use http://www.stata-press.com/data/r13/1962, clear (Standard Population Distribution) . list, abbrev(12) divider 1. 2. 3. age_category population 0 - 29 30 - 59 60+ .35 .35 .3 . save 1962 file 1962.dta saved If we multiply the above weights for the age strata by the crude rate for the corresponding age category, the sum gives us the standardized rate. dstdize — Direct and indirect standardization . use http://www.stata-press.com/data/r13/mortality (1962 Mortality, Sweden & Panama) . generate crude=deaths/pop . drop pop . merge m:1 age_cat using 1962 age_category was byte now float Result # of obs. not matched matched 0 6 (_merge==3) . list, sepby(age_category) abbrev(12) nation age_category deaths crude population _merge 1. 2. Sweden Panama 0 - 29 0 - 29 3,523 3,904 .0011202 .0052686 .35 .35 matched (3) matched (3) 3. 4. Panama Sweden 30 - 59 30 - 59 1,421 10,928 .0051673 .0035747 .35 .35 matched (3) matched (3) 5. 6. Panama Sweden 60+ 60+ 2,456 59,104 .0416271 .0456754 .3 .3 matched (3) matched (3) . generate product = crude*pop . by nation, sort: egen adj_rate = sum(product) . drop _merge . list, sepby(nation) nation age_ca~y deaths crude popula~n product adj_rate 1. 2. 3. Panama Panama Panama 0 - 29 30 - 59 60+ 3,904 1,421 2,456 .0052686 .0051673 .0416271 .35 .35 .3 .001844 .0018085 .0124881 .0161407 .0161407 .0161407 4. 5. 6. Sweden Sweden Sweden 60+ 30 - 59 0 - 29 59,104 10,928 3,523 .0456754 .0035747 .0011202 .3 .35 .35 .0137026 .0012512 .0003921 .0153459 .0153459 .0153459 Comparing the standardized rates indicates that the Swedes have a slightly lower mortality rate. 449 450 dstdize — Direct and indirect standardization To perform the above analysis with dstdize, type . use http://www.stata-press.com/data/r13/mortality, clear (1962 Mortality, Sweden & Panama) . dstdize deaths pop age_cat, by(nation) using(1962) -> nation= Panama Unadjusted Std. Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] Stratum Pop. 0 - 29 30 - 59 60+ 741000 275000 59000 3904 1421 2456 1075000 7781 Totals: 0.689 0.0053 0.256 0.0052 0.055 0.0416 s*P 0.350 0.0018 0.350 0.0018 0.300 0.0125 Adjusted Cases: 17351.2 Crude Rate: 0.0072 Adjusted Rate: 0.0161 95% Conf. Interval: [0.0156, 0.0166] -> nation= Sweden Unadjusted Std. Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] Stratum Pop. 0 - 29 30 - 59 60+ 3145000 3057000 1294000 3523 10928 59104 7496000 73555 Totals: 0.420 0.0011 0.408 0.0036 0.173 0.0457 s*P 0.350 0.0004 0.350 0.0013 0.300 0.0137 Adjusted Cases: 115032.5 Crude Rate: 0.0098 Adjusted Rate: 0.0153 95% Conf. Interval: [0.0152, 0.0155] Summary of Study Populations: nation N Crude Adj_Rate Confidence Interval Panama Sweden 1075000 7496000 0.007238 0.009813 0.016141 0.015346 [ [ 0.015645, 0.015235, 0.016637] 0.015457] The summary table above lets us make a quick inspection of the results within the study populations, and the detail tables give the behavior among the strata within the study populations. Example 2 We have individual-level data on persons in four cities over several years. Included in the data is a variable indicating whether the person has high blood pressure, together with information on the person’s age, sex, and race. We wish to obtain standardized high blood pressure rates for each city for 1990 and 1992, using, as the standard, the age, sex, and race distribution of the four cities and two years combined. dstdize — Direct and indirect standardization 451 Our dataset contains . use http://www.stata-press.com/data/r13/hbp . describe Contains data from http://www.stata-press.com/data/r13/hbp.dta obs: 1,130 vars: 7 21 Feb 2013 06:42 size: 19,210 variable name id city year sex age_group race hbp storage type str10 byte int byte byte byte byte display format %10s %8.0g %8.0g %8.0g %8.0g %8.0g %8.0g value label variable label Record identification number sexfmt agefmt racefmt yn high blood pressure Sorted by: The dstdize command is designed to work with aggregate data but will work with individuallevel data only if we create a variable recording the population represented by each observation. For individual-level data, this is one: . generate pop = 1 On the next page, we specify print to obtain a listing of the standard population and level(90) to request 90% rather than 95% confidence intervals. Typing if year==1990 | year==1992 restricts the data to the two years for both summary tables and the standard population. 452 dstdize — Direct and indirect standardization . dstdize hbp pop age race sex if year==1990 | year==1992, by(city year) print > level(90) Standard Population Stratum Pop. Dist. 15 15 15 15 15 15 20 20 20 20 20 20 25 25 25 25 25 25 30 30 30 30 30 30 - 19 19 19 19 19 19 24 24 24 24 24 24 29 29 29 29 29 29 34 34 34 34 34 34 Black Black Hispanic Hispanic White White Black Black Hispanic Hispanic White White Black Black Hispanic Hispanic White White Black Black Hispanic Hispanic White White Female Male Female Male Female Male Female Male Female Male Female Male Female Male Female Male Female Male Female Male Female Male Female Male 35 44 5 10 7 5 43 67 14 13 4 21 17 44 7 13 9 16 16 32 2 3 5 23 0.077 0.097 0.011 0.022 0.015 0.011 0.095 0.147 0.031 0.029 0.009 0.046 0.037 0.097 0.015 0.029 0.020 0.035 0.035 0.070 0.004 0.007 0.011 0.051 Total: 455 (6 observations excluded because of missing values) -> city year= 1 1990 15 15 15 20 20 25 25 25 25 30 30 - 19 Black 19 Black 19 Hispanic 24 Black 24 Black 29 Black 29 Black 29 Hispanic 29 White 34 Black 34 Black Totals: Unadjusted Std. Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] Stratum Pop. Female Male Male Female Male Female Male Female Female Female Male 6 6 1 3 11 4 6 2 1 1 6 2 0 0 0 0 0 1 0 0 0 0 47 3 0.128 0.128 0.021 0.064 0.234 0.085 0.128 0.043 0.021 0.021 0.128 0.3333 0.0000 0.0000 0.0000 0.0000 0.0000 0.1667 0.0000 0.0000 0.0000 0.0000 0.077 0.097 0.022 0.095 0.147 0.037 0.097 0.015 0.020 0.035 0.070 s*P 0.0256 0.0000 0.0000 0.0000 0.0000 0.0000 0.0161 0.0000 0.0000 0.0000 0.0000 Adjusted Cases: 2.0 Crude Rate: 0.0638 Adjusted Rate: 0.0418 90% Conf. Interval: [0.0074, 0.0761] dstdize — Direct and indirect standardization -> city year= 1 1992 15 15 15 20 20 20 25 25 25 30 30 30 - 19 Black 19 Black 19 Hispanic 24 Black 24 Black 24 Hispanic 29 Black 29 Black 29 Hispanic 34 Black 34 Black 34 White Unadjusted Std. Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] Stratum Pop. Female Male Male Female Male Female Female Male Male Female Male Female 3 9 1 7 9 1 2 11 1 7 4 1 0 0 0 0 0 0 0 1 0 0 0 0 56 1 Totals: 0.054 0.161 0.018 0.125 0.161 0.018 0.036 0.196 0.018 0.125 0.071 0.018 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0909 0.0000 0.0000 0.0000 0.0000 0.077 0.097 0.022 0.095 0.147 0.031 0.037 0.097 0.029 0.035 0.070 0.011 s*P 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0088 0.0000 0.0000 0.0000 0.0000 Adjusted Cases: 0.5 Crude Rate: 0.0179 Adjusted Rate: 0.0088 90% Conf. Interval: [0.0000, 0.0226] -> city year= 2 1990 15 15 15 20 20 20 20 20 25 25 25 25 25 30 30 30 30 30 - 19 19 19 24 24 24 24 24 29 29 29 29 29 34 34 34 34 34 Totals: Black Black Hispanic Black Black Hispanic Hispanic White Black Black Hispanic White White Black Black Hispanic White White Unadjusted Std. Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] Stratum Pop. Female Male Male Female Male Female Male Male Female Male Female Female Male Female Male Female Female Male 5 7 1 7 8 5 2 2 3 9 2 1 2 1 5 2 1 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 64 3 0.078 0.109 0.016 0.109 0.125 0.078 0.031 0.031 0.047 0.141 0.031 0.016 0.031 0.016 0.078 0.031 0.016 0.016 0.0000 0.1429 0.0000 0.1429 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.5000 0.0000 0.0000 0.0000 0.0000 0.0000 0.077 0.097 0.022 0.095 0.147 0.031 0.029 0.046 0.037 0.097 0.015 0.020 0.035 0.035 0.070 0.004 0.011 0.051 s*P 0.0000 0.0138 0.0000 0.0135 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0176 0.0000 0.0000 0.0000 0.0000 0.0000 Adjusted Cases: 2.9 Crude Rate: 0.0469 Adjusted Rate: 0.0449 90% Conf. Interval: [0.0091, 0.0807] 453 454 dstdize — Direct and indirect standardization -> city year= 2 1992 15 15 15 15 15 20 20 20 20 20 20 25 25 25 25 25 25 30 30 30 30 30 - 19 19 19 19 19 24 24 24 24 24 24 29 29 29 29 29 29 34 34 34 34 34 Black Black Hispanic Hispanic White Black Black Hispanic Hispanic White White Black Black Hispanic Hispanic White White Black Black Hispanic White White Unadjusted Std. Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] Stratum Pop. Female Male Female Male Male Female Male Female Male Female Male Female Male Female Male Female Male Female Male Male Female Male 1 5 3 1 1 8 11 6 4 1 2 2 3 2 4 4 2 1 2 1 2 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 67 2 Totals: 0.015 0.075 0.045 0.015 0.015 0.119 0.164 0.090 0.060 0.015 0.030 0.030 0.045 0.030 0.060 0.060 0.030 0.015 0.030 0.015 0.030 0.015 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.5000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.077 0.097 0.011 0.022 0.011 0.095 0.147 0.031 0.029 0.009 0.046 0.037 0.097 0.015 0.029 0.020 0.035 0.035 0.070 0.007 0.011 0.051 s*P 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0143 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 Adjusted Cases: 1.0 Crude Rate: 0.0299 Adjusted Rate: 0.0143 90% Conf. Interval: [0.0025, 0.0260] -> city year= 3 1990 15 15 15 15 15 20 20 20 20 20 25 25 25 25 25 30 30 - 19 Black 19 Black 19 Hispanic 19 White 19 White 24 Black 24 Black 24 Hispanic 24 White 24 White 29 Black 29 Black 29 Hispanic 29 White 29 White 34 Black 34 White Totals: Unadjusted Std. Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] Stratum Pop. Female Male Female Female Male Female Male Male Female Male Female Male Male Female Male Male Male 3 1 1 3 1 1 9 3 2 8 1 8 4 1 6 6 11 0 0 0 0 0 0 0 0 0 1 0 3 0 0 0 2 5 69 11 0.043 0.014 0.014 0.043 0.014 0.014 0.130 0.043 0.029 0.116 0.014 0.116 0.058 0.014 0.087 0.087 0.159 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.1250 0.0000 0.3750 0.0000 0.0000 0.0000 0.3333 0.4545 0.077 0.097 0.011 0.015 0.011 0.095 0.147 0.029 0.009 0.046 0.037 0.097 0.029 0.020 0.035 0.070 0.051 s*P 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0058 0.0000 0.0363 0.0000 0.0000 0.0000 0.0234 0.0230 Adjusted Cases: 6.1 Crude Rate: 0.1594 Adjusted Rate: 0.0885 90% Conf. Interval: [0.0501, 0.1268] dstdize — Direct and indirect standardization -> city year= 3 1992 15 15 15 15 20 20 20 20 20 25 25 30 30 - 19 19 19 19 24 24 24 24 24 29 29 34 34 Black Hispanic White White Black Hispanic Hispanic White White Hispanic White Black White Unadjusted Std. Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] Stratum Pop. Female Male Female Male Male Female Male Female Male Male Male Male Male 2 3 2 1 3 1 3 1 6 1 5 1 8 0 0 0 0 0 0 0 0 1 0 1 0 5 37 7 Totals: 0.054 0.081 0.054 0.027 0.081 0.027 0.081 0.027 0.162 0.027 0.135 0.027 0.216 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.1667 0.0000 0.2000 0.0000 0.6250 0.077 0.022 0.015 0.011 0.147 0.031 0.029 0.009 0.046 0.029 0.035 0.070 0.051 s*P 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0077 0.0000 0.0070 0.0000 0.0316 Adjusted Cases: 1.7 Crude Rate: 0.1892 Adjusted Rate: 0.0463 90% Conf. Interval: [0.0253, 0.0674] -> city year= 5 1990 15 15 15 15 20 20 20 25 25 25 25 30 30 30 - 19 Black 19 Black 19 Hispanic 19 White 24 Black 24 Black 24 Hispanic 29 Black 29 Black 29 Hispanic 29 White 34 Black 34 Black 34 White Totals: Unadjusted Std. Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] Stratum Pop. Female Male Male Male Female Male Female Female Male Female Female Female Male Male 9 7 1 1 4 6 1 3 5 1 2 2 3 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 46 2 0.196 0.152 0.022 0.022 0.087 0.130 0.022 0.065 0.109 0.022 0.043 0.043 0.065 0.022 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.3333 0.0000 0.0000 0.5000 0.0000 0.0000 0.0000 0.077 0.097 0.022 0.011 0.095 0.147 0.031 0.037 0.097 0.015 0.020 0.035 0.070 0.051 s*P 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0125 0.0000 0.0000 0.0099 0.0000 0.0000 0.0000 Adjusted Cases: 1.0 Crude Rate: 0.0435 Adjusted Rate: 0.0223 90% Conf. Interval: [0.0020, 0.0426] 455 456 dstdize — Direct and indirect standardization -> city year= 5 1992 15 15 15 15 15 15 20 20 20 20 25 25 25 25 30 30 30 30 30 - 19 19 19 19 19 19 24 24 24 24 29 29 29 29 34 34 34 34 34 Black Black Hispanic Hispanic White White Black Black Hispanic White Black Black Hispanic White Black Black Hispanic White White Stratum Pop. Female Male Female Male Female Male Female Male Male Male Female Male Male Male Female Male Male Female Male 6 9 1 2 2 1 13 10 1 3 2 2 3 1 4 5 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 69 1 Totals: Summary of Study Populations: city year N Crude 1 1990 1 1992 2 1990 2 1992 3 1990 3 1992 5 1990 5 1992 Unadjusted Std. Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] 0.087 0.130 0.014 0.029 0.029 0.014 0.188 0.145 0.014 0.043 0.029 0.029 0.043 0.014 0.058 0.072 0.029 0.014 0.014 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.077 0.097 0.011 0.022 0.015 0.011 0.095 0.147 0.029 0.046 0.037 0.097 0.029 0.035 0.035 0.070 0.007 0.011 0.051 s*P 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0505 Adjusted Cases: 3.5 Crude Rate: 0.0145 Adjusted Rate: 0.0505 90% Conf. Interval: [0.0505, 0.0505] Adj_Rate Confidence Interval 47 0.063830 0.041758 [ 0.007427, 0.076089] 56 0.017857 0.008791 [ 0.000000, 0.022579] 64 0.046875 0.044898 [ 0.009072, 0.080724] 67 0.029851 0.014286 [ 0.002537, 0.026035] 69 0.159420 0.088453 [ 0.050093, 0.126813] 37 0.189189 0.046319 [ 0.025271, 0.067366] 46 0.043478 0.022344 [ 0.002044, 0.042644] 69 0.014493 0.050549 [ 0.050549, 0.050549] dstdize — Direct and indirect standardization 457 Indirect standardization Standardization of rates can be performed via the indirect method whenever the stratum-specific rates are either unknown or unreliable. If the stratum-specific rates are known, the direct standardization method is preferred. To apply the indirect method, you must have the following information: • The observed number of cases in each population to be standardized, O. For example, if death rates in two states are being standardized using the U.S. death rate for the same period, you must know the total number of deaths in each state. • The distribution across the various strata for the population being studied, n1 , . . . , nk . If you are standardizing the death rate in the two states, adjusting for age, you must know the number of individuals in each of the k age groups. • The stratum-specific rates for the standard population, p1 , . . . , pk . For example, you must have the U.S. death rate for each stratum (age group). • The crude rate of the standard population, C . For example, you must have the U.S. mortality rate for the year. The indirect adjusted rate is then Rindirect = C O E where E is the expected number of cases (deaths) in each population. See Methods and formulas for a more detailed description of calculations. Example 3 This example is borrowed from Kahn and Sempos (1989, 95–105). We want to compare 1970 mortality rates in California and Maine, adjusting for age. Although we have age-specific population counts for the two states, we lack age-specific death rates. Direct standardization is not feasible here. We can use the U.S. population census data for the same year to produce indirectly standardized rates for these two states. From the U.S. census, the standard population for this example was entered into Stata and saved in popkahn.dta. . use http://www.stata-press.com/data/r13/popkahn, clear . list age pop deaths rate, sep(4) age population deaths rate 1. 2. 3. 4. <15 15-24 25-34 35-44 57,900,000 35,441,000 24,907,000 23,088,000 103,062 45,261 39,193 72,617 .00178 .00128 .00157 .00315 5. 6. 7. 8. 45-54 55-64 65-74 75+ 23,220,000 18,590,000 12,436,000 7,630,000 169,517 308,373 445,531 736,758 .0073 .01659 .03583 .09656 458 dstdize — Direct and indirect standardization The standard population contains for each age stratum the total number of individuals (pop) and both the age-specific mortality rate (rate) and the number of deaths. The standard population need not contain all three. If we have only the age-specific mortality rate, we can use the rate(ratevarp crudevarp ) or rate(ratevarp #) option, where crudevarp refers to the variable containing the total population’s crude death rate or # is the total population’s crude death rate. Now let’s look at the states’ data (study population): . use http://www.stata-press.com/data/r13/kahn . list, sep(4) state age populat~n death st death_~e 1. 2. 3. 4. California California California California <15 15-24 25-34 35-44 5,524,000 3,558,000 2,677,000 2,359,000 166,285 166,285 166,285 166,285 1 1 1 1 .0016 .0013 .0015 .0028 5. 6. 7. 8. California California California California 45-54 55-64 65-74 75+ 2,330,000 1,704,000 1,105,000 696,000 166,285 166,285 166,285 166,285 1 1 1 1 .0067 .0154 .0328 .0917 9. 10. 11. 12. Maine Maine Maine Maine <15 15-24 25-34 35-44 286,000 168,000 110,000 109,000 11,051 . . . 2 2 2 2 .0019 .0011 .0014 .0029 13. 14. 15. 16. Maine Maine Maine Maine 45-54 55-64 65-74 75+ 110,000 94,000 69,000 46,000 . . . . 2 2 2 2 .0069 .0173 .039 .1041 For each state, the number of individuals in each stratum (age group) is contained in the pop variable. The death variable is the total number of deaths observed in the state during the year. It must have the same value for all observations in the group, as for California, or it could be missing in all but one observation per group, as for Maine. To match these two datasets, the strata variables must have the same name in both datasets and ideally the same levels. If a level is missing from either dataset, that level will not be included in the standardization. With kahn.dta in memory, we now execute the command. We will use the print option to obtain the standard population’s summary table, and because we have both the standard population’s age-specific count and deaths, we will specify the popvars(casevarp popvarp ) option. Or, we could specify the rate(rate 0.00945) option because we know that 0.00945 is the U.S. crude death rate for 1970. dstdize — Direct and indirect standardization . istdize death pop age using http://www.stata-press.com/data/r13/popkahn, > by(state) pop(deaths pop) print Standard Population Stratum Rate <15 15-24 25-34 35-44 45-54 55-64 65-74 75+ 0.00178 0.00128 0.00157 0.00315 0.00730 0.01659 0.03583 0.09656 Standard population’s crude rate: 0.00945 -> state= California Indirect Standardization Standard Population Observed Stratum Rate Population <15 15-24 25-34 35-44 45-54 55-64 65-74 75+ 0.0018 0.0013 0.0016 0.0031 0.0073 0.0166 0.0358 0.0966 Totals: 5524000 3558000 2677000 2359000 2330000 1704000 1105000 696000 Cases Expected 9832.72 4543.85 4212.46 7419.59 17010.10 28266.14 39587.63 67206.23 19953000 178078.73 Observed Cases: SMR (Obs/Exp): SMR exact 95% Conf. Interval: [0.9293, Crude Rate: Adjusted Rate: 95% Conf. Interval: [0.0088, 166285 0.93 0.9383] 0.0083 0.0088 0.0089] -> state= Maine Stratum <15 15-24 25-34 35-44 45-54 55-64 65-74 75+ Totals: Indirect Standardization Standard Population Observed Rate Population 0.0018 0.0013 0.0016 0.0031 0.0073 0.0166 0.0358 0.0966 286000 168000 110000 109000 110000 94000 69000 46000 Cases Expected 509.08 214.55 173.09 342.83 803.05 1559.28 2471.99 4441.79 992000 10515.67 Observed Cases: SMR (Obs/Exp): SMR exact 95% Conf. Interval: [1.0314, Crude Rate: Adjusted Rate: 95% Conf. Interval: [0.0097, 11051 1.05 1.0707] 0.0111 0.0099 0.0101] 459 460 dstdize — Direct and indirect standardization Summary of Study Populations (Rates): Cases state Observed Crude Adj_Rate California 166285 0.008334 0.008824 Maine 11051 0.011140 0.009931 Summary of Study Populations (SMR): Cases Cases state Observed Expected SMR California Maine 166285 11051 178078.73 10515.67 0.934 1.051 Confidence Interval [0.008782, 0.008866] [0.009747, 0.010118] Exact Confidence Interval [0.929290, 0.938271] [1.031405, 1.070688] Stored results dstdize stores the following in r(): Scalars r(k) Macros r(by) r(c#) Matrices r(se) r(ub) r(lb) r(Nobs) r(crude) r(adj) number of populations variable names specified in by() values of r(by) for #th group 1×k 1×k 1×k 1×k 1×k 1×k vector vector vector vector vector vector of of of of of of standard errors of adjusted rates upper bounds of confidence intervals for adjusted rates lower bounds of confidence intervals for adjusted rates number of observations crude rates (*) adjusted rates (*) (*) If, in a group, the number of observations is 0, then 9 is stored for the corresponding crude and adjusted rates. istdize stores the following in r(): Scalars r(k) Macros r(by) r(c#) Matrices r(cases obs) r(cases exp) r(ub adj) r(lb adj) r(crude) r(adj) r(smr) r(ub smr) r(lb smr) number of populations variable names specified in by() values of r(by) for #th group 1×k 1×k 1×k 1×k 1×k 1×k 1×k 1×k 1×k vector vector vector vector vector vector vector vector vector of of of of of of of of of number of observed cases number of expected cases upper bounds of confidence lower bounds of confidence crude rates adjusted rates SMRs upper bounds of confidence lower bounds of confidence intervals for adjusted rates intervals for adjusted rates intervals for SMRs intervals for SMRs dstdize — Direct and indirect standardization 461 Methods and formulas The directly standardized rate, SR , is defined by k X SR = wi Ri i=1 k X wi i=1 (Rothman 1986, 44), where Ri is the stratum-specific rate in stratum i and wi is the weight for stratum i derived from the standard population. If ni is the population of stratum i, the standard error, se(SR ), in stratified sampling for proportions (ignoring the finite population correction) is v u k X wi 2 Ri (1 − Ri ) 1 u se(SR ) = P t wi i=1 ni (Cochran 1977, 108), from which the confidence intervals are calculated. For indirect standardization, define O as the observed number of cases in each population to be standardized; n1 , . . . , nk as the distribution across the various strata for the population being studied; R1 , . . . , Rk as the stratum-specific rates for the standard population; and C as the crude rate of the standard population. The expected number of cases (deaths), E , in each population is obtained by applying the standard population stratum-specific rates, R1 , . . . , Rk , to the study populations: E= k X ni Ri i=1 The indirectly adjusted rate is then Rindirect = C O E and O/E is the study population’s SMR if death is the event of interest or the SIR for studies of disease (or other) incidence. The exact confidence interval is calculated for each estimated SMR by assuming a Poisson process as described in Breslow and Day (1987, 69–71). These intervals are obtained by first calculating the upper and lower bounds for the confidence interval of the Poisson-distributed observed events, O —say, L and U, respectively—and then computing SMRL = L/E and SMRU = U/E . Acknowledgments We gratefully acknowledge the collaboration of Dr. Joel A. Harrison, consultant; Dr. José Maria Pacheco of the Departamento de Epidemiologia, Faculdade de Saúde Pública/USP, Sao Paulo, Brazil; and Dr John L. Moran of the Queen Elizabeth Hospital, Woodville, Australia. 462 dstdize — Direct and indirect standardization References Breslow, N. E., and N. E. Day. 1987. Statistical Methods in Cancer Research: Vol. 2—The Design and Analysis of Cohort Studies. Lyon: IARC. Cleves, M. A. 1998. sg80: Indirect standardization. Stata Technical Bulletin 42: 43–47. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 224–228. College Station, TX: Stata Press. Cochran, W. G. 1977. Sampling Techniques. 3rd ed. New York: Wiley. Consonni, D. 2012. A command to calculate age-standardized rates with efficient interval estimation. Stata Journal 12: 688–701. Fleiss, J. L., B. Levin, and M. C. Paik. 2003. Statistical Methods for Rates and Proportions. 3rd ed. New York: Wiley. Forthofer, R. N., and E. S. Lee. 1995. Introduction to Biostatistics: A Guide to Design, Analysis, and Discovery. New York: Academic Press. Juul, S., and M. Frydenberg. 2014. An Introduction to Stata for Health Researchers. 4th ed. College Station, TX: Stata Press. Kahn, H. A., and C. T. Sempos. 1989. Statistical Methods in Epidemiology. New York: Oxford University Press. Kirkwood, B. R., and J. A. C. Sterne. 2003. Essential Medical Statistics. 2nd ed. Malden, MA: Blackwell. McGuire, T. J., and J. A. Harrison. 1994. sbe11: Direct standardization. Stata Technical Bulletin 21: 5–9. Reprinted in Stata Technical Bulletin Reprints, vol. 4, pp. 88–94. College Station, TX: Stata Press. Pagano, M., and K. Gauvreau. 2000. Principles of Biostatistics. 2nd ed. Belmont, CA: Duxbury. Rothman, K. J. 1986. Modern Epidemiology. Boston: Little, Brown. van Belle, G., L. D. Fisher, P. J. Heagerty, and T. S. Lumley. 2004. Biostatistics: A Methodology for the Health Sciences. 2nd ed. New York: Wiley. Wang, D. 2000. sbe40: Modeling mortality data using the Lee–Carter model. Stata Technical Bulletin 57: 15–17. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 118–121. College Station, TX: Stata Press. Also see [ST] epitab — Tables for epidemiologists [SVY] direct standardization — Direct standardization of means, proportions, and ratios Title dydx — Calculate numeric derivatives and integrals Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgment Syntax Derivatives of numeric functions dydx yvar xvar if in , generate(newvar) dydx options Integrals of numeric functions integ yvar xvar if in , integ options dydx options Main ∗ generate(newvar) replace ∗ Description create variable named newvar overwrite the existing variable generate(newvar) is required. integ options Description Main generate(newvar) trapezoid initial(#) replace create variable named newvar use trapezoidal rule to compute integrals; default is cubic splines initial value of integral; default is initial(0) overwrite the existing variable by is allowed with dydx and integ; see [D] by. Menu dydx Data > Create or change data > Other variable-creation commands > Calculate numerical derivatives Create or change data > Other variable-creation commands > Calculate numeric integrals integ Data > 463 464 dydx — Calculate numeric derivatives and integrals Description dydx and integ calculate derivatives and integrals of numeric “functions”. Options Main generate(newvar) specifies the name of the new variable to be created. It must be specified with dydx. trapezoid requests that the trapezoidal rule [the sum of (xi − xi−1 )(yi + yi−1 )/2 be used to compute integrals. The default is cubic splines, which give superior results for most smooth functions; for irregular functions, trapezoid may give better results. initial(#) specifies the initial condition for calculating definite integrals; see Methods and formulas below. The default is initial(0). replace specifies that if an existing variable is specified for generate(), it should be overwritten. Remarks and examples dydx and integ lets you extend Stata’s graphics capabilities beyond data analysis and into mathematics. (See Gould [1993] for another command that draws functions.) Example 1 We graph y = e−x/6 sin(x) over the interval [ 0, 12.56 ]: . range x 0 12.56 100 obs was 0, now 100 . generate y = exp(-x/6)*sin(x) . label variable y "exp(-x/6)*sin(x)" −.5 exp(−x/6)*sin(x) 0 .5 1 . twoway connected y x, connect(i) yline(0) 0 5 10 x 15 dydx — Calculate numeric derivatives and integrals 465 We estimate the derivative by using dydx and compute the relative difference between this estimate and the true derivative. . dydx y x, gen(dy) . generate dytrue = exp(-x/6)*(cos(x) - sin(x)/6) . generate error = abs(dy - dytrue)/dytrue The error is greatest at the endpoints, as we would expect. The error is approximately 0.5% at each endpoint, but the error quickly falls to less than 0.01%. 0 Error in derivative estimate .002 .004 .006 . label variable error "Error in derivative estimate" . twoway line error x, ylabel(0(.002).006) 0 5 10 15 x We now estimate the integral by using integ: . integ y x, gen(iy) number of points = 100 integral = .85316396 . generate iytrue = (36/37)*(1 - exp(-x/6)*(cos(x) + sin(x)/6)) . display iytrue[_N] .85315901 . display abs(r(integral) - iytrue[_N])/iytrue[_N] 5.799e-06 . generate diff = iy - iytrue The relative difference between the estimate [stored in r(integral)] and the true value of the integral is about 6 × 10−6 . A graph of the absolute difference (diff) is shown below. Here error is cumulative. Again most of the error is due to a relatively poorer fit near the endpoints. . label variable diff "Error in integral estimate" . twoway line diff x, ylabel(0(5.00e-06).00001) dydx — Calculate numeric derivatives and integrals 0 Error in integral estimate 5.00e−06 .00001 466 0 5 10 15 x Stored results dydx stores the following in r(): Macros r(y) name of yvar integ stores the following in r(): Scalars r(N points) number of unique x points r(integral) estimate of the integral Methods and formulas Consider a set of data points, (x1 , y1 ), . . . , (xn , yn ), generated by a function y = f (x). dydx and integ first fit these points with a cubic spline, which is then analytically differentiated (integrated) to give an approximation for the derivative (integral) of f . The cubic spline (see, for example, Press et al. [2007]) consists of n − 1 cubic polynomials Pi (x), with the ith one defined on the interval [xi , xi+1 ], 00 Pi (x) = yi ai (x) + yi+1 bi (x) + yi00 ci (x) + yi+1 di (x) where ai (x) = xi+1 − x xi+1 − xi bi (x) = x − xi xi+1 − xi ci (x) = 1 (xi+1 − xi )2 ai (x)[{ai (x)}2 − 1] 6 di (x) = 1 (xi+1 − xi )2 bi (x)[{bi (x)}2 − 1] 6 00 and yi00 and yi+1 are constants whose values will be determined as described below. The notation for 00 these constants is justified because Pi00 (xi ) = yi00 and Pi00 (xi+1 ) = yi+1 . dydx — Calculate numeric derivatives and integrals 467 Because ai (xi ) = 1, ai (xi+1 ) = 0, bi (xi ) = 0, and bi (xi+1 ) = 1. Therefore, Pi (xi ) = yi , and Pi (xi+1 ) = yi+1 . Thus the Pi jointly define a function that is continuous at the interval boundaries. The first derivative should be continuous at the interval boundaries; that is, 0 Pi0 (xi+1 ) = Pi+1 (xi+1 ) The above n − 2 equations (one equation for each point except the two endpoints) and the values of 0 the first derivative at the endpoints, P10 (x1 ) and Pn−1 (xn ), determine the n constants yi00 . The value of the first derivative at an endpoint is set to the value of the derivative obtained by fitting a quadratic to the endpoint and the two adjacent points; namely, we use P10 (x1 ) = y1 − y2 y1 − y3 y2 − y3 + − x1 − x2 x1 − x3 x2 − x3 and a similar formula for the upper endpoint. dydx approximates f 0 (xi ) by using Pi0 (xi ). Rx integ approximates F (xi ) = F (x1 ) + x1i f (x) dx by using I0 + i−1 Z X k=1 xk+1 Pk (x) dx xk where I0 (an estimate of F (x1 )) is the value specified by the initial(#) option. If the trapezoid option is specified, integ approximates the integral by using the trapezoidal rule: I0 + i−1 X 1 k=1 2 (xk+1 − xk )(yk+1 + yk ) If there are ties among the xi , the mean of yi is computed at each set of ties and the cubic spline is fit to these values. Acknowledgment The present versions of dydx and integ were inspired by the dydx2 command written by Patrick Royston of the MRC Clinical Trials Unit, London, and coauthor of the Stata Press book Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. References Gould, W. W. 1993. ssi5.1: Graphing functions. Stata Technical Bulletin 16: 23–26. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 188–193. College Station, TX: Stata Press. . 1997. crc46: Better numerical derivatives and integrals. Stata Technical Bulletin 35: 3–5. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 8–12. College Station, TX: Stata Press. Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 2007. Numerical Recipes: The Art of Scientific Computing. 3rd ed. New York: Cambridge University Press. 468 dydx — Calculate numeric derivatives and integrals Also see [D] obs — Increase the number of observations in a dataset [D] range — Generate numerical range Title eform option — Displaying exponentiated coefficients Description Remarks and examples Reference Also see Description An eform option causes the coefficient table to be displayed in exponentiated form: for each coefficient, eb rather than b is displayed. Standard errors and confidence intervals (CIs) are also transformed. An eform option is one of the following: eform option Description eform(string) use string for the column title eform hr shr irr or rrr exponentiated coefficient, string is exp(b) hazard ratio, string is Haz. Ratio subhazard ratio, string is SHR incidence-rate ratio, string is IRR odds ratio, string is Odds Ratio relative-risk ratio, string is RRR Remarks and examples Example 1 Here is a simple example of the or option with svy: logit. The CI for the odds ratio is computed by transforming (by exponentiating) the endpoints of the CI for the corresponding coefficient. . use http://www.stata-press.com/data/r13/nhanes2d . svy, or: logit highbp female black (running logit on estimation sample) (output omitted ) highbp Odds Ratio female black _cons .6107011 1.384865 .7249332 Linearized Std. Err. .0326159 .1336054 .0551062 t -9.23 3.37 -4.23 P>|t| [95% Conf. Interval] 0.000 0.002 0.000 .5476753 1.137507 .6208222 .6809798 1.686011 .8465035 We also could have specified the following command and received the same results as above: . svy: logit highbp female black, or 469 470 eform option — Displaying exponentiated coefficients Reference Buis, M. L. 2012. Stata tip 107: The baseline is now reported. Stata Journal 12: 165–166. Also see [R] ml — Maximum likelihood estimation Title eivreg — Errors-in-variables regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax eivreg depvar indepvars if in weight , options Description options Model reliab(indepvar # indepvar # . . . ) specify measurement reliability for each indepvar measured with error Reporting level(#) display options coeflegend set confidence level; default is level(95) control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. bootstrap, by, jackknife, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. aweights and fweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Linear models and related > Errors-in-variables regression Description eivreg fits errors-in-variables regression models. Options Model reliab(indepvar # indepvar # . . . ) specifies the measurement reliability for each independent variable measured with error. Reliabilities are specified as pairs consisting of an independent variable name (a name that appears in indepvars) and the corresponding reliability r, 0 < r ≤ 1. Independent variables for which no reliability is specified are assumed to have reliability 1. If the option is not specified, all variables are assumed to have reliability 1, and the result is thus the same as that produced by regress (the ordinary least-squares results). 471 472 eivreg — Errors-in-variables regression Reporting level(#); see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. The following option is available with eivreg but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples For an introduction to errors-in-variables regression, see Draper and Smith (1998, 89–91) or Kmenta (1997, 352–357). Treiman (2009, 258–261) compares the results of errors-in-variables regression with conventional regression. Errors-in-variables regression models are useful when one or more of the independent variables are measured with additive noise. Standard regression (as performed by regress) would underestimate the effect of the variable, and the other coefficients in the model can be biased to the extent that they are correlated with the poorly measured variable. You can adjust for the biases if you know the reliability: noise variance r =1− total variance That is, given the model y = Xβ + u, for some variable xi in X, the xi is observed with error, xi = x∗i + e, and the noise variance is the variance of e. The total variance is the variance of xi . Example 1 Say that in our automobile data, the weight of cars was measured with error, and the reliability of our measured weight is 0.85. The result of this would be to underestimate the effect of weight in a regression of, say, price on weight and foreign, and it would also bias the estimate of the coefficient on foreign (because being of foreign manufacture is correlated with the weight of cars). We would ignore all of this if we fit the model with regress: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress price weight foreign Source SS df MS Model Residual 316859273 318206123 2 71 158429637 4481776.38 Total 635065396 73 8699525.97 price Coef. weight foreign _cons 3.320737 3637.001 -4942.844 Std. Err. .3958784 668.583 1345.591 t 8.39 5.44 -3.67 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.000 = = = = = = 74 35.35 0.0000 0.4989 0.4848 2117 [95% Conf. Interval] 2.531378 2303.885 -7625.876 4.110096 4970.118 -2259.812 eivreg — Errors-in-variables regression 473 With eivreg, we can account for our measurement error: . eivreg price weight foreign, r(weight .85) variable assumed reliability weight * 0.8500 1.0000 price Coef. weight foreign _cons 4.31985 4637.32 -8257.017 Errors-in-variables regression Number of obs = 74 F( 2, 71) = 50.37 Prob > F = 0.0000 R-squared = 0.6483 Root MSE = 1773.54 Std. Err. .431431 624.5362 1452.086 t 10.01 7.43 -5.69 P>|t| 0.000 0.000 0.000 [95% Conf. Interval] 3.459601 3392.03 -11152.39 5.180099 5882.609 -5361.639 The effect of weight is increased — as we knew it would be — and here the effect of foreign manufacture is also increased. A priori, we knew only that the estimate of foreign might be biased; we did not know the direction. Technical note Swept under the rug in our example is how we would determine the reliability, r. We can easily see that a variable is measured with error, but we may not know the reliability because the ingredients for calculating r depend on the unobserved noise. For our example, we made up a value for r, and in fact we do not believe that weight is measured with error at all, so the reported eivreg results have no validity. The regress results were the statistically correct results here. But let’s say that we do suspect that weight is measured with error and that we do not know r. We could then experiment with various values of r to describe the sensitivity of our estimates to possible error levels. We may not know r, but r does have a simple interpretation, and we could probably produce a sensible range for r by thinking about how the data were collected. If the reliability, r, is less than the R2 from a regression of the poorly measured variable on all the other variables, including the dependent variable, the information might as well not have been collected; no adjustment to the final results is possible. For our automobile data, running a regression of weight on foreign and price would result in an R2 of 0.6743. Thus the reliability must be at least 0.6743 here. If we specify a reliability that is too small, eivreg will inform us and refuse to fit the model: . eivreg price weight foreign, r(weight .6742) reliability r() too small r(399); Returning to our problem of how to estimate r, too small or not, if the measurements are summaries of scaled items, the reliability may be estimated using the alpha command; see [MV] alpha. If the score is computed from factor analysis and the data are scored using predict’s default options (see [MV] factor postestimation), the square of the standard deviation of the score is an estimate of the reliability. 474 eivreg — Errors-in-variables regression Technical note Consider a model with more than one variable measured with error. For instance, say that our model is that price is a function of weight, foreign, and mpg and that both weight and mpg are measured with error. . eivreg price weight foreign mpg, r(weight .85 mpg .9) assumed Errors-in-variables regression variable reliability Number of obs = 74 weight 0.8500 F( 3, 70) = 429.14 mpg 0.9000 Prob > F = 0.0000 * 1.0000 R-squared = 0.9728 Root MSE = 496.41 price Coef. weight foreign mpg _cons 12.88302 8268.951 999.2043 -56473.19 Std. Err. .6820532 352.8719 73.60037 3710.015 t 18.89 23.43 13.58 -15.22 P>|t| 0.000 0.000 0.000 0.000 Stored results eivreg stores the following in e(): Scalars e(N) e(df m) e(df r) e(r2) e(F) e(rmse) e(rank) Macros e(cmd) e(cmdline) e(depvar) e(rellist) e(wtype) e(wexp) e(properties) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(V) Functions e(sample) number of observations model degrees of freedom residual degrees of freedom R-squared F statistic root mean squared error rank of e(V) eivreg command as typed name of dependent variable indepvars and associated reliabilities weight type weight expression b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix variance–covariance matrix of the estimators marks estimation sample [95% Conf. Interval] 11.52271 7565.17 852.413 -63872.58 14.24333 8972.732 1145.996 -49073.8 eivreg — Errors-in-variables regression 475 Methods and formulas Let the model to be fit be y = X∗ β + e X = X∗ + U where X∗ are the true values and X are the observed values. Let W be the user-specified weights. If no weights are specified, W = I. If weights are specified, let v be the specified weights. If fweight frequency weights are specified, then W = diag(v). If aweight analytic weights are specified, then W = diag{v/(10 v)(10 1)}, meaning that the weights are normalized to sum to the number of observations. The estimates b of β are obtained as A−1 X0 Wy, where A = X0 WX − S. S is a diagonal matrix with elements N (1 − ri )s2i . N is the number of observations, ri is the user-specified reliability coefficient for the ith explanatory variable or 1 if not specified, and s2i is the (appropriately weighted) variance of the variable. The variance–covariance matrix of the estimators is obtained as s2 A−1 X0 WXA−1 , where the root mean squared error s2 = (y0 Wy − bAb0 )/(N − p), where p is the number of estimated parameters. References Draper, N., and H. Smith. 1998. Applied Regression Analysis. 3rd ed. New York: Wiley. Kmenta, J. 1997. Elements of Econometrics. 2nd ed. Ann Arbor: University of Michigan Press. Treiman, D. J. 2009. Quantitative Data Analysis: Doing Social Research to Test Ideas. San Francisco: Jossey-Bass. Also see [R] eivreg postestimation — Postestimation tools for eivreg [R] regress — Linear regression [SEM] example 24 — Reliability [U] 20 Estimation and postestimation commands Title eivreg postestimation — Postestimation tools for eivreg Description Also see Syntax for predict Menu for predict Options for predict Description The following postestimation commands are available after eivreg: Command Description contrast estat summarize estat vce estimates forecast lincom contrasts and ANOVA-style joint tests of estimates summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest margins marginsplot nlcom predict predictnl pwcompare test testnl Syntax for predict predict statistic type newvar if in , statistic Description Main xb residuals stdp stdf pr(a,b) e(a,b) ystar(a,b) linear prediction; the default residuals standard error of the prediction standard error of the forecast Pr(a < yj < b) E(yj | a < yj < b) E(yj∗ ), yj∗ = max{a, min(yj , b)} These statistics are available both in and out of sample; type predict the estimation sample. 476 . . . if e(sample) . . . if wanted only for eivreg postestimation — Postestimation tools for eivreg 477 where a and b may be numbers or variables; a missing (a ≥ .) means −∞, and b missing (b ≥ .) means +∞; see [U] 12.2.1 Missing values. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main xb, the default, calculates the linear prediction. residuals calculates the residuals, that is, yj − xj b. stdp calculates the standard error of the prediction, which can be thought of as the standard error of the predicted expected value or mean for the observation’s covariate pattern. The standard error of the prediction is also referred to as the standard error of the fitted value. stdf calculates the standard error of the forecast, which is the standard error of the point prediction for 1 observation and is commonly referred to as the standard error of the future or forecast value. By construction, the standard errors produced by stdf are always larger than those produced by stdp; see Methods and formulas in [R] regress postestimation. pr(a,b) calculates Pr(a < xj b + uj < b), the probability that yj |xj would be observed in the interval (a, b). a and b may be specified as numbers or variable names; lb and ub are variable names; pr(20,30) calculates Pr(20 < xj b + uj < 30); pr(lb,ub) calculates Pr(lb < xj b + uj < ub); and pr(20,ub) calculates Pr(20 < xj b + uj < ub). a missing (a ≥ .) means −∞; pr(.,30) calculates Pr(−∞ < xj b + uj < 30); pr(lb,30) calculates Pr(−∞ < xj b + uj < 30) in observations for which lb ≥ . and calculates Pr(lb < xj b + uj < 30) elsewhere. b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(+∞ > xj b + uj > 20); pr(20,ub) calculates Pr(+∞ > xj b + uj > 20) in observations for which ub ≥ . and calculates Pr(20 < xj b + uj < ub) elsewhere. e(a,b) calculates E(xj b + uj | a < xj b + uj < b), the expected value of yj |xj conditional on yj |xj being in the interval (a, b), meaning that yj |xj is truncated. a and b are specified as they are for pr(). ystar(a,b) calculates E(yj∗ ), where yj∗ = a if xj b + uj ≤ a, yj∗ = b if xj b + uj ≥ b, and yj∗ = xj b + uj otherwise, meaning that yj∗ is censored. a and b are specified as they are for pr(). Also see [R] eivreg — Errors-in-variables regression [U] 20 Estimation and postestimation commands Title error messages — Error messages and return codes Description Also see Description Whenever Stata detects that something is wrong — that what you typed is uninterpretable, that you are trying to do something you should not be trying to do, or that you requested the impossible — Stata responds by typing a message describing the problem, together with a return code. For instance, . lsit unrecognized command: lsit r(199); . list myvar variable myvar not found r(111); . test a=b last estimates not found r(301); In each case, the message is probably sufficient to guide you to a solution. When we typed lsit, Stata responded with “unrecognized command”. We meant to type list. When we typed list myvar, Stata responded with “variable myvar not found”. There is no variable named myvar in our data. When we typed test a=b, Stata responded with “last estimates not found”. test tests hypotheses about previously fit models, and we have not yet fit a model. The numbers in parentheses in the r(199), r(111), and r(301) messages are called the return codes. To find out more about these messages, type search rc #, where # is the number returned in the parentheses. Example 1 . search rc 301 [P] error messages . . . . . . . . . . . . . . . . . . . . Return code 301 last estimates not found; You typed an estimation command such as regress without arguments or attempted to perform a test or typed predict, but there were no previous estimation results. Programmers should see [P] error for details on programming error messages. Also see [R] search — Search Stata documentation and other resources 478 Title esize — Effect size based on mean comparison Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Effect sizes for two independent samples using groups in , by(groupvar) options esize twosample varname if Effect sizes for two independent samples using variables in , options esize unpaired varname1 == varname2 if Immediate form of effect sizes for two independent samples esizei # obs1 # mean1 # sd1 # obs2 # mean2 # sd2 , options Immediate form of effect sizes for F tests after an ANOVA esizei # df1 # df2 # F , level(#) Description options Main cohensd hedgesg glassdelta pbcorr all unequal welch level(#) report Cohen’s d (1988) report Hedges’s g (1981) report Glass’s ∆ (Smith and Glass 1977) using each group’s standard deviation report the point-biserial correlation coefficient (Pearson 1909) report all estimates of effect size use unequal variances use Welch’s (1947) approximation set confidence level; default is level(95) by is allowed with esize; see [D] by. Menu esize Statistics > Summaries, tables, and tests > Classical tests of hypotheses > Effect size based on mean comparison esizei Statistics > Summaries, tables, and tests > Classical tests of hypotheses 479 > Effect-size calculator 480 esize — Effect size based on mean comparison Description esize calculates effect sizes for comparing the difference between the means of a continuous variable for two groups. In the first form, esize calculates effect sizes for the difference between the mean of varname for two groups defined by groupvar. In the second form, esize calculates effect sizes for the difference between varname1 and varname2 , assuming unpaired data. esizei is the immediate form of esize; see [U] 19 Immediate commands. In the first form, esizei calculates the effect size for comparing the difference between the means of two groups. In the second form, esizei calculates the effect size for an F test after an ANOVA. Options Main by(groupvar) specifies the groupvar that defines the two groups that esize will use to estimate the effect sizes. Do not confuse the by() option with the by prefix; you can specify both. cohensd specifies that Cohen’s d (1988) be reported. hedgesg specifies that Hedges’s g (1981) be reported. glassdelta specifies that Glass’s ∆ (Smith and Glass 1977) be reported. pbcorr specifies that the point-biserial correlation coefficient (Pearson 1909) be reported. all specifies that all estimates of effect size be reported. The default is Cohen’s d and Hedges’s g . unequal specifies that the data not be assumed to have equal variances. welch specifies that the approximate degrees of freedom for the test be obtained from Welch’s formula (1947) rather than from Satterthwaite’s approximation formula (1946), which is the default when unequal is specified. Specifying welch implies unequal. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. Remarks and examples Whereas p-values are used to assess the statistical significance of a result, measures of effect size are used to assess the practical significance of a result. Effect sizes can be broadly categorized as “measures of group differences” (the d family) and “measures of association” (the r family); see Ellis (2010, table 1.1). The d family includes estimators such as Cohen’s d, Hedges’s g , and Glass’s ∆. The r family includes estimators such as the point-biserial correlation coefficient, ω 2 and η 2 (also see estat esize in [R] regress postestimation). For an introduction to the concepts and calculation of effect sizes, see Kline (2013) and Thompson (2006). For a more detailed discussion, see Kirk (1996), Ellis (2010), Cumming (2012), Grissom and Kim (2012), and Kelley and Preacher (2012). It should be noted that there is much variation in the definitions of measures of effect size (Kline 2013). As Ellis (2010, 27) cautions, “However, beware the inconsistent terminology. What is labeled here as g was labeled by Hedges and Olkin as d and vice versa. For these authors writing in the early 1980s, g was the mainstream effect-size index developed by Cohen and refined by Glass (hence g for Glass). However, since then g has become synonymous with Hedges’s equation (not Glass’s) and the reason it is called Hedges’s g and not Hedges’s h is because it was originally named after Glass—even though it was developed by Larry Hedges. Confused?” esize — Effect size based on mean comparison 481 To avoid confusion, esize and esizei closely follow the notation of Hedges (1981), Smithson (2001), Kline (2013), and Ellis (2010). Example 1: Effect size for two independent samples using by() Suppose we are interested in question 1 from the fictitious depression.dta: “My statistical software makes me feel sad”. We might have conducted a t test to test the null hypothesis that there is no difference in response by sex. We could then compute various measures of effect size to describe the magnitude of the effect of sex. . use http://www.stata-press.com/data/r13/depression (Fictitious Depression Inventory data based on the Beck Depression Inventory) . esize twosample qu1, by(sex) all Effect size based on mean comparison Obs per group: Female = 712 Male = 288 Effect Size Estimate d g 1 2 r -.0512417 -.0512032 -.0517793 -.0499786 -.0232208 Cohen’s Hedges’s Glass’s Delta Glass’s Delta Point-Biserial [95% Conf. Interval] -.1881184 -.187977 -.1886587 -.1868673 -.0849629 .0856607 .0855963 .0851364 .086997 .0387995 Cohen’s d, Hedges’s g , and both estimates of Glass’s ∆ indicate that the score for females is 0.05 standard deviations lower than the score for males. The point-biserial correlation coefficient indicates that there is a small, negative correlation between the scores for females and males. Technical note Glass’s ∆ has traditionally been estimated for experimental studies using the control group standard deviation rather than the pooled standard deviation. Kline (2013) notes that the choice of group becomes arbitrary for data arising from observational studies and recommends the reporting of Glass’s ∆ using each group standard deviation. Example 2: Effect size for two independent samples by a third variable If we are interested in the same effect sizes from example 1 stratified by race, we could use the by prefix with the sort option to accomplish this task. 482 esize — Effect size based on mean comparison . by race, sort: esize twosample qu1, by(sex) -> race = Hispanic Effect size based on mean comparison Obs per group: Female = Male = Effect Size Estimate Cohen’s d Hedges’s g -.1042883 -.1036899 [95% Conf. Interval] -.463503 -.4608434 .2553235 .2538584 -> race = Black Effect size based on mean comparison Obs per group: Female = Male = Effect Size Estimate Cohen’s d Hedges’s g -.1720681 -.1717012 Estimate Cohen’s d Hedges’s g .0479511 .0478807 259 95 [95% Conf. Interval] -.4073814 -.4065128 .063489 .0633536 -> race = White Effect size based on mean comparison Obs per group: Female = Male = Effect Size 88 45 365 148 [95% Conf. Interval] -.1430932 -.1428831 .2389486 .2385977 Example 3: Bootstrap confidence intervals for effect sizes Simulation studies have shown that bootstrap confidence intervals may be preferable to confidence intervals based on the noncentral t distribution when the variable of interest does not have a normal distribution (Kelley 2005; Algina, Keselman, and Penfield 2006). Bootstrap confidence intervals can be easily estimated for effect sizes using the bootstrap prefix. esize — Effect size based on mean comparison 483 . use http://www.stata-press.com/data/r13/depression (Fictitious Depression Inventory data based on the Beck Depression Inventory) . set seed 12345 . bootstrap r(d) r(g), reps(1000) nodots nowarn: esize twosample qu1, by(sex) Bootstrap results Number of obs = 1000 Replications = 1000 command: esize twosample qu1, by(sex) _bs_1: r(d) _bs_2: r(g) _bs_1 _bs_2 Observed Coef. Bootstrap Std. Err. -.0512417 -.0512032 .07169 .0716361 z -0.71 -0.71 Normal-based [95% Conf. Interval] P>|z| 0.475 0.475 -.1917515 -.1916074 .0892682 .0892011 Example 4: Effect sizes for two independent samples using variables Sometimes, the data of interest are stored in two separate variables. We can calculate effect sizes for the two groups by using the unpaired version of esize. . use http://www.stata-press.com/data/r13/fuel . esize unpaired mpg1==mpg2 Effect size based on mean comparison Number of obs = Effect Size Estimate Cohen’s d Hedges’s g -.5829654 -.5628243 24 [95% Conf. Interval] -1.394934 -1.34674 .2416105 .2332631 Example 5: Immediate form for effect sizes for two means Often we do not have access to raw data, but we are given summary statistics in a report or manuscript. To calculate the effect sizes from summary statistics, we can use the immediate command esizei. For example, Kline (2013) in table 4.2 shows summary statistics for a hypothetical sample where mean1 = 13, sd1 = 2.74, mean2 = 11, and sd2 = 2.24; there are 30 people in each group. We can estimate the effect sizes from these summary data using esizei: . esizei 30 13 2.74 30 11 2.24 Effect size based on mean comparison Obs per group: Group 1 = Group 2 = 30 30 Effect Size Estimate [95% Conf. Interval] Cohen’s d Hedges’s g .7991948 .7888081 .2695509 .2660477 1.322465 1.305277 484 esize — Effect size based on mean comparison Example 6: Immediate form for effect sizes for F tests after an ANOVA esizei can also be used to compute η 2 and ω 2 for F tests after an ANOVA. The following example from Smithson (2001, 623) illustrates the use of esizei for dfnum = 4, dfden = 50, and F = 4.2317. . esizei 4 50 4.2317, level(90) Effect sizes for linear models Effect Size Estimate [90% Conf. Interval] Eta-Squared Omega-Squared .2529151 .1931483 .0521585 0 .3603621 .309191 Stored results esize and esizei for comparing two means store the following in r(): Scalars r(d) r(lb d) r(ub d) r(g) r(lb g) r(ub g) r(delta1) r(lb delta1) r(ub delta1) r(delta2) r(lb delta2) r(ub delta2) r(r pb) r(lb r pb) r(ub r pb) r(N 1) r(N 2) r(df t) r(level) Cohen’s d lower confidence bound for Cohen’s d upper confidence bound for Cohen’s d Hedges’s g lower confidence bound for Hedges’s g upper confidence bound for Hedges’s g Glass’s ∆ for group 1 lower confidence bound for Glass’s ∆ for group 1 upper confidence bound for Glass’s ∆ for group 1 Glass’s ∆ for group 2 lower confidence bound for Glass’s ∆ for group 2 upper confidence bound for Glass’s ∆ for group 2 point-biserial correlation coefficient lower confidence bound for the point-biserial correlation coefficient upper confidence bound for the point-biserial correlation coefficient sample size n1 sample size n2 degrees of freedom confidence level esizei for F tests after ANOVA stores the following in r(): Scalars r(eta2) r(lb eta2) r(ub eta2) r(omega2) r(lb omega2) r(ub omega2) r(level) η2 lower confidence bound for η 2 upper confidence bound for η 2 ω2 lower confidence bound for ω 2 upper confidence bound for ω 2 confidence level Methods and formulas For the d family, the effect-size parameter of interest is the scaled difference between the means given by (µ1 − µ2 ) δ= σ esize — Effect size based on mean comparison 485 One of the most popular estimators of effect size is Cohen’s d, given by Cohen’s d = (x1 − x2 ) s∗ where s s∗ = (n1 − 1)s21 + (n2 − 1)s22 n1 + n2 − 2 Hedges (1981) showed that Cohen’s d is biased and proposed the unbiased estimator Hedges’s g = Cohen’s d × c(m) where m = n1 + n2 − 2 and Γ m 2 c(m) = p m m−1 2Γ 2 Glass (Smith and Glass 1977) proposed an estimator for δ in the context of designed experiments, Glass’s ∆ = (xtreated − xcontrol ) scontrol where scontrol is the standard deviation for the control group. As noted above, esize and esizei report two estimates of Glass’s ∆: one using the standard deviation for group 1 and the other using the standard deviation for group 2: Glass’s ∆1 = (x1 − x2 ) s1 Glass’s ∆2 = (x1 − x2 ) s2 and For the r family, the effect-size parameter of interest is the ratio of the variance attributable to an effect and the total variance: σ2 η 2 = effect 2 σtotal A popular estimator of η when there are two groups is the point-biserial correlation coefficient, rPB = √ t2 t + df where t is the t statistic for the difference between the means of the two groups, and df is the corresponding degrees of freedom. Satterthwaite’s or Welch’s adjustment (see [R] ttest for details) to the degrees of freedom can be used to calculate rPB by specifying the unequal or welch option, respectively. 486 esize — Effect size based on mean comparison When more than two means are being compared, as in the case of an ANOVA with p groups, a popular estimator of effect size is the correlation ratio denoted η 2 (Fisher 1925; Kerlinger 1964). η 2 can be computed directly as the ratio of the SSeffect and the SStotal or as a function of the F statistic with numerator degrees of freedom equal to dfnum and denominator degrees of freedom equal to dfden . (F × dfnum ) ηb2 = (F × dfnum ) + dfden Like its equivalent estimator R2 , η 2 has an upward bias. The less biased (though not unbiased) estimator ω 2 (Hays 1963) is equivalent to the adjusted R2 and can be estimated directly from the sums of squares, the F statistic, or as a function of η 2 ; that is, ω b2 = SSbetween − (p − 1)MSwithin + MSwithin SStotal or ω b2 = (p − 1)(F − 1) (p − 1)(F − 1) + (p)(n) or 2 2 ω b =η − dfnum dfden ! × (1 − η 2 ) To calculate ηb2 and ω b 2 directly after anova or regress, see estat esize in [R] regress postestimation. Cohen’s d, Hedges’s g , and Glass’s ∆ have been shown to have a noncentral t distribution (Hedges 1981) with noncentrality parameter equal to r λ=δ n1 n2 n1 + n2 Confidence intervals are calculated by finding the noncentrality parameters λlower and λupper that correspond to α Pr(df, δ, λlower ) = 1 − 2 and Pr(df, δ, λupper ) = α 2 using the function npnt(df ,t,p). The noncentrality parameters are then transformed back to the effect-size scale: r n1 + n2 δlower = λlower n1 n2 and r δupper = λupper n1 + n2 n1 n2 (see Venables [1975]; Steiger and Fouladi [1997]; Cumming and Finch [2001]; Smithson [2001]). esize — Effect size based on mean comparison 487 Confidence intervals for the point-biserial correlation coefficient are calculated similarly and transformed back to the effect-size scale as λlower rlower = p 2 λlower + df and λupper rupper = q λ2upper + df Following Smithson’s (2001) notation, the F statistic is written as Fdfnum ,dfden = f 2 (dfnum /dfden ) This equation has a noncentral F distribution with noncentrality parameter: λ = f 2 (dfnum + dfden + 1) where f 2 = R2 /(1 − R2 ). Confidence intervals for ηb2 and ω b 2 are calculated by finding the noncentrality parameters λlower and λupper for a noncentral F distribution that correspond to Pr(dfnum , dfden , F, λlower ) = 1 − α 2 and α 2 using the function npnF(df1 ,df2 ,f ,p). The noncentrality parameters are transformed back to the ηb2 scale as ! λlower 2 ηblower = max 0, λlower + dfnum + dfden + 1 Pr(dfnum , dfden , F, λupper ) = and 2 ηbupper λupper = min 1, λupper + dfnum + dfden + 1 ! The confidence limits for ω b 2 are then calculated as a function of ηb2 : 2 2 ω blower = ηblower − dfnum dfden ! dfnum dfden ! and 2 ω bupper = 2 ηbupper See Smithson (2001) for further details. − 2 × (1 − ηblower ) 2 × (1 − ηbupper ) 488 esize — Effect size based on mean comparison Fred Nichols Kerlinger (1910–1991) was born in New York City. He studied music at New York University and graduated magna cum laude with a degree in education and philosophy. After graduation, he joined the U.S. Army and served as a counterintelligence officer in Japan in 1946. Kerlinger earned an MA and a PhD in educational psychology from the University of Michigan and held faculty appointments at several universities, including New York University. He was president of the American Educational Research Association and is best known for his popular and influential book Foundations of Behavioral Research (1964), which introduced Fisher’s (1925) η 2 statistic to behavioral researchers. William Lee Hays (1926–1995) was born in Clarksville, Texas. He studied mathematics and psychology at Paris Junior College in Paris, Texas, and at East Texas State College. He earned BS and MS degrees from North Texas State University. Upon completion of his PhD in psychology at the University of Michigan, he joined the faculty, where he eventually became associate vice president for academic affairs. In 1977, Hays accepted an appointment as vice president for academic affairs at the University of Texas at Austin, where he remained until his death in 1995. Hays is best known for his book Statistics for Psychologists (1963), which introduced the ω 2 statistic. References Algina, J., H. J. Keselman, and R. D. Penfield. 2006. Confidence interval coverage for Cohen’s effect size statistic. Educational and Psychological Measurement 66: 945–960. Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum. Cumming, G. 2012. Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Taylor & Francis. Cumming, G., and S. Finch. 2001. A primer on the understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement 61: 532–574. Ellis, P. D. 2010. The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Cambridge: Cambridge University Press. Fisher, R. A. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd. Grissom, R. J., and J. J. Kim. 2012. Effect Sizes for Research: Univariate and Multivariate Applications. 2nd ed. New York: Taylor & Francis. Hays, W. L. 1963. Statistics for Psychologists. New York: Holt, Rinehart & Winston. Hedges, L. V. 1981. Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics 6: 107–128. Huber, C. 2013. Measures of effect size in Stata 13. The Stata Blog: Not Elsewhere Classified. http://blog.stata.com/2013/09/05/measures-of-effect-size-in-stata-13/. Kelley, K. 2005. The effects of nonnormal distributions on confidence intervals around the standardized mean difference: Bootstrap and parametric confidence intervals. Educational and Psychological Measurement 65: 51–69. Kelley, K., and K. J. Preacher. 2012. On effect size. Psychological Methods 17: 137–152. Kerlinger, F. N. 1964. Foundations of Behavioral Research. New York: Holt, Rinehart & Winston. Kirk, R. E. 1996. Practical significance: A concept whose time has come. Educational and Psychological Measurement 56: 746–759. Kline, R. B. 2013. Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. 2nd ed. Washington, DC: American Psychological Association. Pearson, K. 1909. On a new method of determining correlation between a measured character A, and a character B, of which only the percentage of cases wherein B exceeds (or falls short of) a given intensity is recorded for each grade of A. Biometrika 7: 96–105. esize — Effect size based on mean comparison 489 Satterthwaite, F. E. 1946. An approximate distribution of estimates of variance components. Biometrics Bulletin 2: 110–114. Smith, M. L., and G. V. Glass. 1977. Meta-analysis of psychotherapy outcome studies. American Psychologist 32: 752–760. Smithson, M. 2001. Correct confidence intervals for various regression effect sizes and parameters: The importance of noncentral distributions in computing intervals. Educational and Psychological Measurement 61: 605–632. Steiger, J. H., and R. T. Fouladi. 1997. Noncentrality interval estimation and the evaluation of statistical models. In What If There Were No Significance Tests?, ed. L. L. Harlow, S. A. Mulaik, and J. H. Steiger, 221–257. Mahwah, NJ: Erlbaum. Thompson, B. 2006. Foundations of Behavioral Statistics: An Insight-Based Approach. New York: Guilford Press. Venables, W. 1975. Calculation of confidence intervals for noncentrality parameters. Journal of the Royal Statistical Society, Series B 37: 406–412. Welch, B. L. 1947. The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika 34: 28–35. Also see [R] bitest — Binomial probability test [R] ci — Confidence intervals for means, proportions, and counts [R] mean — Estimate means [R] oneway — One-way analysis of variance [R] prtest — Tests of proportions [R] sdtest — Variance-comparison tests [R] ttest — t tests (mean-comparison tests) Title estat — Postestimation statistics Syntax Description Syntax Command Reference Display information criteria estat ic , n(#) [R] estat ic Summarize estimation sample estat summarize eqlist , estat summ options [R] estat summarize Display covariance matrix estimates estat vce , estat vce options [R] estat vce Command-specific estat subcommand1 , options1 Description estat displays scalar- and matrix-valued statistics after estimation; it complements predict, which calculates variables after estimation. Exactly what statistics estat can calculate depends on the previous estimation command. Three sets of statistics are so commonly used that they are available after all estimation commands that store the model log likelihood. estat ic displays Akaike’s and Schwarz’s Bayesian information criteria. estat summarize summarizes the variables used by the command and automatically restricts the sample to e(sample); it also summarizes the weight variable and cluster structure, if specified. estat vce displays the covariance or correlation matrix of the parameter estimates of the previous model. 490 Title estat classification — Classification statistics and table Syntax Remarks and examples Also see Menu for estat Stored results Description Methods and formulas Options References Syntax estat classification if in weight , options Description options Main display summary statistics for all observations in the data positive outcome threshold; default is cutoff(0.5) all cutoff(#) fweights are allowed; see [U] 11.1.6 weight. estat classification is not appropriate after the svy prefix. Menu for estat Statistics > Postestimation > Reports and statistics Description estat classification reports various summary statistics, including the classification table. estat classification requires that the current estimation results be from logistic, logit, probit, or ivprobit; see [R] logistic, [R] logit, [R] probit, or [R] ivprobit. Options Main all requests that the statistic be computed for all observations in the data, ignoring any if or in restrictions specified by the estimation command. cutoff(#) specifies the value for determining whether an observation has a predicted positive outcome. An observation is classified as positive if its predicted probability is ≥ #. The default is 0.5. Remarks and examples estat classification presents the classification statistics and classification table after logistic, logit, probit, or ivprobit. Statistics are produced either for the estimation sample (the default) or for any set of observations. When weights, if, or in is used with the estimation command, it is not necessary to repeat the qualifier when you want statistics computed for the estimation sample. Specify if, in, or the all option only when you want statistics computed for a set of observations other than the estimation sample. Specify weights only when you want to use a different set of weights. 491 492 estat classification — Classification statistics and table Example 1 We illustrate estat classification after logistic; see [R] logistic. . use http://www.stata-press.com/data/r13/lbw (Hosmer & Lemeshow data) . logistic low age lwt i.race smoke ptl ht ui (output omitted ) . estat classification Logistic model for low True Classified D ~D Total + - 21 38 12 118 33 156 Total 59 130 Classified + if predicted Pr(D) >= .5 True D defined as low != 0 189 Sensitivity Specificity Positive predictive value Negative predictive value Pr( +| D) Pr( -|~D) Pr( D| +) Pr(~D| -) 35.59% 90.77% 63.64% 75.64% False False False False Pr( +|~D) Pr( -| D) Pr(~D| +) Pr( D| -) 9.23% 64.41% 36.36% 24.36% + + - rate rate rate rate for for for for true ~D true D classified + classified - Correctly classified 73.54% The overall rate of correct classification is estimated to be 73.54, with 90.77% of the normal weight group correctly classified (specificity) and only 35.59% of the low weight group correctly classified (sensitivity). Classification is sensitive to the relative sizes of each component group, and always favors classification into the larger group. This phenomenon is evident here. By default, estat classification uses a cutoff of 0.5, although you can vary this with the cutoff() option. You can use the lsens command to review the potential cutoffs; see [R] lsens. Stored results estat classification stores the following in r(): Scalars r(P r(P r(P r(P r(P r(P r(P r(P r(P corr) p1) n0) p0) n1) 1p) 0n) 0p) 1n) percent correctly classified sensitivity specificity false-positive rate given true negative false-negative rate given true positive positive predictive value negative predictive value false-positive rate given classified positive false-negative rate given classified negative estat classification — Classification statistics and table 493 Methods and formulas Let j index observations. Define c as the cutoff() specified by the user or, if not specified, as 0.5. Let pj be the predicted probability of a positive outcome and yj be the actual outcome, which we will treat as 0 or 1, although Stata treats it as 0 and non-0, excluding missing observations. A prediction is classified as positive if pj ≥ c and otherwise is classified as negative. The classification is correct if it is positive and yj = 1 or if it is negative and yj = 0. Sensitivity is the fraction of yj = 1 observations that are correctly classified. Specificity is the percentage of yj = 0 observations that are correctly classified. References Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Kohler, U., and F. Kreuter. 2012. Data Analysis Using Stata. 3rd ed. College Station, TX: Stata Press. Also see [R] logistic — Logistic regression, reporting odds ratios [R] logit — Logistic regression, reporting coefficients [R] probit — Probit regression [R] ivprobit — Probit model with continuous endogenous regressors [R] lroc — Compute area under ROC curve and graph the curve [R] lsens — Graph sensitivity and specificity versus probability cutoff [R] estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test [R] roc — Receiver operating characteristic (ROC) analysis [U] 20 Estimation and postestimation commands Title estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test Syntax Remarks and examples Also see Menu for estat Stored results Description Methods and formulas Options References Syntax estat gof if in weight , options Description options Main perform Hosmer–Lemeshow goodness-of-fit test using # quantiles execute test for all observations in the data adjust degrees of freedom for samples outside estimation sample display table of groups used for test group(#) all outsample table fweights are allowed; see [U] 11.1.6 weight. For information on using estat gof with survey data, see [SVY] estat. Menu for estat Statistics > Postestimation > Reports and statistics Description estat gof reports the Pearson goodness-of-fit test or the Hosmer – Lemeshow goodness-of-fit test. estat gof requires that the current estimation results be from logistic, logit, or probit; see [R] logistic, [R] logit, or [R] probit. For estat gof after poisson, see [R] poisson postestimation. For estat gof after sem, see [SEM] estat gof. Options Main group(#) specifies the number of quantiles to be used to group the data for the Hosmer – Lemeshow goodness-of-fit test. group(10) is typically specified. If this option is not given, the Pearson goodness-of-fit test is computed using the covariate patterns in the data as groups. all requests that the statistic be computed for all observations in the data, ignoring any if or in restrictions specified by the estimation command. outsample adjusts the degrees of freedom for the Pearson and Hosmer – Lemeshow goodness-of-fit tests for samples outside the estimation sample. See Samples other than the estimation sample later in this entry. 494 estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test 495 table displays a table of the groups used for the Hosmer – Lemeshow or Pearson goodness-of-fit test with predicted probabilities, observed and expected counts for both outcomes, and totals for each group. Remarks and examples Remarks are presented under the following headings: Introduction Samples other than the estimation sample Introduction estat gof computes goodness-of-fit tests: either the Pearson χ2 test or the Hosmer – Lemeshow test. By default, estat gof computes statistics for the estimation sample by using the last model fit by logistic, logit, or probit. However, samples other than the estimation sample can be specified; see Samples other than the estimation sample later in this entry. Example 1 estat gof, typed without options, presents the Pearson χ2 goodness-of-fit test for the fitted model. The Pearson χ2 goodness-of-fit test is a test of the observed against expected number of responses using cells defined by the covariate patterns; see predict with the number option in [R] logistic postestimation for the definition of covariate patterns. . use http://www.stata-press.com/data/r13/lbw (Hosmer & Lemeshow data) . logistic low age lwt i.race smoke ptl ht ui (output omitted ) . estat gof Logistic model for low, goodness-of-fit test number of observations number of covariate patterns Pearson chi2(173) Prob > chi2 = = = = 189 182 179.24 0.3567 Our model fits reasonably well. However, the number of covariate patterns is close to the number of observations, making the applicability of the Pearson χ2 test questionable but not necessarily inappropriate. Hosmer, Lemeshow, and Sturdivant (2013, 157–160) suggest regrouping the data by ordering on the predicted probabilities and then forming, say, 10 nearly equal-sized groups. estat gof with the group() option does this: . estat gof, group(10) Logistic model for low, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) number of observations number of groups Hosmer-Lemeshow chi2(8) Prob > chi2 = = = = 189 10 9.65 0.2904 Again we cannot reject our model. If we specify the table option, estat gof displays the groups along with the expected and observed number of positive responses (low-birthweight babies): 496 estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test . estat gof, group(10) table Logistic model for low, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) Group Prob Obs_1 Exp_1 Obs_0 Exp_0 Total 1 2 3 4 5 0.0827 0.1276 0.2015 0.2432 0.2792 0 2 6 1 7 1.2 2.0 3.2 4.3 4.9 19 17 13 18 12 17.8 17.0 15.8 14.7 14.1 19 19 19 19 19 6 7 8 9 10 0.3138 0.3872 0.4828 0.5941 0.8391 7 6 7 10 13 5.6 6.5 8.2 10.3 12.8 12 13 12 9 5 13.4 12.5 10.8 8.7 5.2 19 19 19 19 18 number of observations number of groups Hosmer-Lemeshow chi2(8) Prob > chi2 = = = = 189 10 9.65 0.2904 Technical note estat gof with the group() option puts all observations with the same predicted probabilities into the same group. If, as in the previous example, we request 10 groups, the groups that estat gof makes are [ p0 , p10 ], (p10 , p20 ], (p20 , p30 ], . . . , (p90 , p100 ], where pk is the k th percentile of the predicted probabilities, with p0 the minimum and p100 the maximum. If there are many ties at the quantile boundaries, as will often happen if all independent variables are categorical and there are only a few of them, the sizes of the groups will be uneven. If the totals in some of the groups are small, the χ2 statistic for the Hosmer – Lemeshow test may be unreliable. In this case, fewer groups should be specified, or the Pearson goodness-of-fit test may be a better choice. Example 2 The table option can be used without the group() option. We would not want to specify this for our current model because there were 182 covariate patterns in the data, caused by including the two continuous variables, age and lwt, in the model. As an aside, we fit a simpler model and specify table with estat gof: estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test . logistic low i.race smoke ui Logistic regression Number of obs LR chi2(4) Prob > chi2 Pseudo R2 Log likelihood = -107.93404 low Odds Ratio race black other smoke ui _cons = = = = 189 18.80 0.0009 0.0801 Std. Err. z P>|z| [95% Conf. Interval] 3.052746 2.922593 1.498087 1.189229 2.27 2.64 0.023 0.008 1.166747 1.316457 7.987382 6.488285 2.945742 2.419131 .1402209 1.101838 1.047359 .0512295 2.89 2.04 -5.38 0.004 0.041 0.000 1.415167 1.035459 .0685216 6.131715 5.651788 .2869447 . estat gof, table Logistic model for low, goodness-of-fit test Group Prob Obs_1 Exp_1 Obs_0 Exp_0 Total 1 2 3 4 5 0.1230 0.2533 0.2907 0.2923 0.2997 3 1 16 15 3 4.9 1.0 13.7 12.6 3.9 37 3 31 28 10 35.1 3.0 33.3 30.4 9.1 40 4 47 43 13 6 7 8 9 10 0.4978 0.4998 0.5087 0.5469 0.5577 4 4 2 2 6 4.0 4.5 1.5 4.4 5.6 4 5 1 6 4 4.0 4.5 1.5 3.6 4.4 8 9 3 8 10 11 0.7449 3 3.0 1 1.0 4 Group Prob race smoke ui 1 2 3 4 5 0.1230 0.2533 0.2907 0.2923 0.2997 white white other white black nonsmoker nonsmoker nonsmoker smoker nonsmoker 0 1 0 0 0 6 7 8 9 10 0.4978 0.4998 0.5087 0.5469 0.5577 other white black other black nonsmoker smoker nonsmoker smoker smoker 1 1 1 0 0 11 0.7449 other smoker 1 number of observations number of covariate patterns Pearson chi2(6) Prob > chi2 = = = = 189 11 5.71 0.4569 497 498 estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test Technical note logistic, logit, or probit and estat gof keep track of the estimation sample. If you type, for instance, logistic . . . if x==1, then when you type estat gof, the statistics will be calculated on the x==1 subsample of the data automatically. You should specify if or in with estat gof only when you wish to calculate statistics for a set of observations other than the estimation sample. See Samples other than the estimation sample later in this entry. If the logistic model was fit with fweights, estat gof properly accounts for the weights in its calculations. (estat gof only allows fweights.) You do not have to specify the weights when you run estat gof. Weights should be specified with estat gof only when you wish to use a different set of weights. Samples other than the estimation sample estat gof can be used with samples other than the estimation sample. By default, estat gof remembers the estimation sample used with the last logistic, logit, or probit command. To override this, simply use an if or in restriction to select another set of observations, or specify the all option to force the command to use all the observations in the dataset. If you use estat gof with a sample that is completely different from the estimation sample (that is, no overlap), you should also specify the outsample option so that the χ2 statistic properly adjusts the degrees of freedom upward. For an overlapping sample, the conservative thing to do is to leave the degrees of freedom the same as they are for the estimation sample. Example 3 We want to develop a model for predicting low-birthweight babies. One approach would be to divide our data into two groups, a developmental sample and a validation sample. See Lemeshow and Gall (1994) and Tilford, Roberson, and Fiser (1995) for more information on developing prediction models and severity-scoring systems. We will do this with the low-birthweight data that we considered previously. First, we randomly divide the data into two samples. . use http://www.stata-press.com/data/r13/lbw, clear (Hosmer & Lemeshow data) . set seed 1 . generate r = runiform() . sort r . generate group = 1 if _n <= _N/2 (95 missing values generated) . replace group = 2 if group==. (95 real changes made) Then we fit a model using the first sample (group = 1), which is our developmental sample. estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test . logistic low age lwt i.race smoke ptl ht ui if group==1 Logistic regression Number of obs LR chi2(8) Prob > chi2 Log likelihood = -44.293342 Pseudo R2 low Odds Ratio Std. Err. age lwt .91542 .9744276 .0553937 .0112295 race black other 5.063678 2.606209 smoke ptl ht ui _cons .909912 3.033543 21.07656 .988479 30.73641 z = = = = 499 94 29.14 0.0003 0.2475 P>|z| [95% Conf. Interval] -1.46 -2.25 0.144 0.025 .8130414 .9526649 1.03069 .9966874 3.78442 1.657608 2.17 1.51 0.030 0.132 1.170327 .7492483 21.90913 9.065522 .5252898 1.507048 22.64788 .6699458 56.82168 -0.16 2.23 2.84 -0.02 1.85 0.870 0.025 0.005 0.986 0.064 .2934966 1.145718 2.565304 .2618557 .8204589 2.820953 8.03198 173.1652 3.731409 1151.462 To test calibration in the developmental sample, we calculate the Hosmer – Lemeshow goodness-of-fit test by using estat gof. . estat gof, group(10) Logistic model for low, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) number of observations = 94 number of groups = 10 Hosmer-Lemeshow chi2(8) = 6.67 Prob > chi2 = 0.5721 We did not specify an if statement with estat gof because we wanted to use the estimation sample. Because the test is not significant, we are satisfied with the fit of our model. Running lroc (see [R] lroc) gives a measure of the discrimination: . lroc, nograph Logistic model for low number of observations = area under ROC curve = 94 0.8156 Now we test the calibration of our model by performing a goodness-of-fit test on the validation sample. We specify the outsample option so that the number of degrees of freedom is 10 rather than 8. 500 estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test . estat gof if group==2, group(10) table outsample Logistic model for low, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) Group Prob Obs_1 Exp_1 Obs_0 Exp_0 Total 1 2 3 4 5 0.0725 0.1202 0.1549 0.1888 0.2609 1 4 3 1 3 0.4 0.8 1.3 1.5 2.2 9 5 7 8 7 9.6 8.2 8.7 7.5 7.8 10 9 10 9 10 6 7 8 9 10 0.3258 0.4217 0.4915 0.6265 0.9737 4 2 3 4 4 2.7 3.7 4.1 5.5 7.1 5 8 6 6 5 6.3 6.3 4.9 4.5 1.9 9 10 9 10 9 number of observations number of groups Hosmer-Lemeshow chi2(10) Prob > chi2 = = = = 95 10 28.03 0.0018 We must acknowledge that our model does not fit well on the validation sample. The model’s discrimination in the validation sample is appreciably lower, as well. . lroc if group==2, nograph Logistic model for low number of observations = 95 area under ROC curve = 0.5839 Stored results estat gof stores the following in r(): Scalars r(N) r(m) r(df) r(chi2) number of observations number of covariate patterns or groups degrees of freedom χ2 Methods and formulas Let M be the total number of covariate patterns among the N observations. View the data as collapsed on covariate patterns j = 1, 2, . . . , M , and define mj as the total number of observations having covariate pattern j and yj as the total number of positive responses among observations with covariate pattern j . Define pj as the predicted probability of a positive outcome in covariate pattern j. The Pearson χ2 goodness-of-fit statistic is χ2 = M X (yj − mj pj )2 mj pj (1 − pj ) j=1 estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test 501 This χ2 statistic has approximately M − k degrees of freedom for the estimation sample, where k is the number of independent variables, including the constant. For a sample outside the estimation sample, the statistic has M degrees of freedom. The Hosmer – Lemeshow goodness-of-fit χ2 (Hosmer and Lemeshow 1980; Lemeshow and Hosmer 1982; Hosmer, Lemeshow, and Klar 1988) is calculated similarly, except that rather than using the M covariate patterns as the group definition, the quantiles of the predicted probabilities are used to form groups. Let G = # be the number of quantiles requested with group(#). The smallest index 1 ≤ q(i) ≤ M , such that q(i) X N Wq(i) = mj ≥ G j=1 gives pq(i) as the upper boundary of the ith quantile for i = 1, 2, . . . , G. Let q(0) = 1 denote the first index. The groups are then [ pq(0) , pq(1) ], ( pq(1) , pq(2) ], . . . , ( pq(G−1) , pq(G) ] If the table option is given, the upper boundaries pq(1) , . . . , pq(G) of the groups appear next to the group number on the output. The resulting χ2 statistic has approximately G − 2 degrees of freedom for the estimation sample. For a sample outside the estimation sample, the statistic has G degrees of freedom. References Archer, K. J., and S. A. Lemeshow. 2006. Goodness-of-fit test for a logistic regression model fitted using survey sample data. Stata Journal 6: 97–105. Fagerland, M. W., and D. W. Hosmer, Jr. 2012. A generalized HosmerLemeshow goodness-of-fit test for multinomial logistic regression models. Stata Journal 12: 447–453. Hosmer, D. W., Jr., and S. A. Lemeshow. 1980. Goodness of fit tests for the multiple logistic regression model. Communications in Statistics—Theory and Methods 9: 1043–1069. Hosmer, D. W., Jr., S. A. Lemeshow, and J. Klar. 1988. Goodness-of-fit testing for the logistic regression model when the estimated probabilities are small. Biometrical Journal 30: 911–924. Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Lemeshow, S. A., and J.-R. L. Gall. 1994. Modeling the severity of illness of ICU patients: A systems update. Journal of the American Medical Association 272: 1049–1055. Lemeshow, S. A., and D. W. Hosmer, Jr. 1982. A review of goodness of fit statistics for the use in the development of logistic regression models. American Journal of Epidemiology 115: 92–106. Tilford, J. M., P. K. Roberson, and D. H. Fiser. 1995. sbe12: Using lfit and lroc to evaluate mortality prediction models. Stata Technical Bulletin 28: 14–18. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 77–81. College Station, TX: Stata Press. 502 estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test Also see [R] logistic — Logistic regression, reporting odds ratios [R] logit — Logistic regression, reporting coefficients [R] probit — Probit regression [R] estat classification — Classification statistics and table [R] lroc — Compute area under ROC curve and graph the curve [R] lsens — Graph sensitivity and specificity versus probability cutoff [U] 20 Estimation and postestimation commands Title estat ic — Display information criteria Syntax Remarks and examples Also see Menu for estat Stored results Description Methods and formulas Option References Syntax estat ic , n(#) Menu for estat Statistics > Postestimation > Reports and statistics Description estat ic displays Akaike’s and Schwarz’s Bayesian information criteria. Option n(#) specifies the N to be used in calculating BIC; see [R] BIC note. Remarks and examples estat ic calculates two information criteria used to compare models. Unlike likelihood-ratio, Wald, and similar testing procedures, the models need not be nested to compare the information criteria. Because they are based on the log-likelihood function, information criteria are available only after commands that report the log likelihood. In general, “smaller is better”: given two models, the one with the smaller AIC fits the data better than the one with the larger AIC. As with the AIC, a smaller BIC indicates a better-fitting model. For AIC and BIC formulas, see Methods and formulas. Example 1 In [R] mlogit, we fit a model explaining the type of insurance a person has on the basis of age, gender, race, and site of study. Here we refit the model with and without the site dummies and compare the models. 503 504 estat ic — Display information criteria . use http://www.stata-press.com/data/r13/sysdsn1 (Health insurance data) . mlogit insure age male nonwhite (output omitted ) . estat ic Akaike’s information criterion and Bayesian information criterion Model Obs ll(null) ll(model) df AIC BIC . 615 -555.8545 -545.5833 8 1107.167 1142.54 Note: N=Obs used in calculating BIC; see [R] BIC note . mlogit insure age male nonwhite i.site (output omitted ) . estat ic Akaike’s information criterion and Bayesian information criterion Model Obs ll(null) ll(model) df AIC BIC . 615 -555.8545 -534.3616 12 1092.723 1145.783 Note: N=Obs used in calculating BIC; see [R] BIC note The AIC indicates that the model including the site dummies fits the data better, whereas the BIC indicates the opposite. As is often the case, different model-selection criteria have led to conflicting conclusions. Technical note glm and binreg, ml report a slightly different version of AIC and BIC; see [R] glm for the formulas used. That version is commonly used within the GLM literature; see, for example, Hardin and Hilbe (2012). The literature on information criteria is vast; see, among others, Akaike (1973), Sawa (1978), and Raftery (1995). Judge et al. (1985) contains a discussion of using information criteria in econometrics. Royston and Sauerbrei (2008, chap. 2) examine the use of information criteria as an alternative to stepwise procedures for selecting model variables. Stored results estat ic stores the following in r(): Matrices r(S) 1 × 1. 2. 3. 4. 5. 6. 6 matrix of results: sample size log likelihood of null model log likelihood of full model degrees of freedom AIC BIC estat ic — Display information criteria 505 Methods and formulas Akaike’s (1974) information criterion is defined as AIC = −2 lnL + 2k where lnL is the maximized log-likelihood of the model and k is the number of parameters estimated. Some authors define the AIC as the expression above divided by the sample size. Schwarz’s (1978) Bayesian information criterion is another measure of fit defined as BIC = −2 lnL + k lnN where N is the sample size. See [R] BIC note for additional information on calculating and interpreting BIC. Hirotugu Akaike (1927–2009) was born in Fujinomiya City, Shizuoka Prefecture, Japan. He was the son of a silkworm farmer. He gained BA and DSc degrees from the University of Tokyo. Akaike’s career from 1952 at the Institute of Statistical Mathematics in Japan culminated in service as Director General; after 1994, he was Professor Emeritus. His best known work in a prolific career is on what is now known as the Akaike information criterion (AIC), which was formulated to help selection of the most appropriate model from a number of candidates. Gideon E. Schwarz (1933–2007) was a professor of Statistics at the Hebrew University, Jerusalem. He was born in Salzburg, Austria, and obtained an MSc in 1956 from the Hebrew University and a PhD in 1961 from Columbia University. His interests included stochastic processes, sequential analysis, probability, and geometry. He is best known for the Bayesian information criterion (BIC). References Akaike, H. 1973. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, ed. B. N. Petrov and F. Csaki, 267–281. Budapest: Akailseoniai–Kiudo. . 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control 19: 716–723. Findley, D. F., and E. Parzen. 1995. A conversation with Hirotugu Akaike. Statistical Science 10: 104–117. Hardin, J. W., and J. M. Hilbe. 2012. Generalized Linear Models and Extensions. 3rd ed. College Station, TX: Stata Press. Judge, G. G., W. E. Griffiths, R. C. Hill, H. Lütkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2nd ed. New York: Wiley. Raftery, A. E. 1995. Bayesian model selection in social research. In Vol. 25 of Sociological Methodology, ed. P. V. Marsden, 111–163. Oxford: Blackwell. Royston, P., and W. Sauerbrei. 2008. Multivariable Model-building: A Pragmatic Approach to Regression Analysis Based on Fractional Polynomials for Modelling Continuous Variables. Chichester, UK: Wiley. Sawa, T. 1978. Information criteria for discriminating among alternative regression models. Econometrica 46: 1273– 1291. Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics 6: 461–464. Tong, H. 2010. Professor Hirotugu Akaike, 1927–2009. Journal of the Royal Statistical Society, Series A 173: 451–454. 506 estat ic — Display information criteria Also see [R] estat — Postestimation statistics [R] estat summarize — Summarize estimation sample [R] estat vce — Display covariance matrix estimates Title estat summarize — Summarize estimation sample Syntax Remarks and examples Menu for estat Stored results Description Also see Options Syntax estat summarize eqlist , estat summ options estat summ options Description equation group labels noheader noweights display options display summary by equation display summary by group; only after sem display variable labels suppress the header ignore weights control row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling eqlist is rarely used and specifies the variables, with optional equation name, to be summarized. eqlist may be varlist or (eqname1 : varlist) (eqname2 : varlist) . . . . varlist may contain time-series operators; see [U] 11.4.4 Time-series varlists. Menu for estat Statistics > Postestimation > Reports and statistics Description estat summarize summarizes the variables used by the command and automatically restricts the sample to e(sample); it also summarizes the weight variable and cluster structure, if specified. Options equation requests that the dependent variables and the independent variables in the equations be displayed in the equation-style format of estimation commands, repeating the summary information about variables entered in more than one equation. group displays summary information separately for each group. group is only allowed after sem with a group() variable specified. labels displays variable labels. noheader suppresses the header. noweights ignores the weights, if any, from the previous estimation command. The default when weights are present is to perform a weighted summarize on all variables except the weight variable itself. An unweighted summarize is performed on the weight variable. 507 508 estat summarize — Summarize estimation sample display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), and fvwrapon(style); see [R] estimation options. Remarks and examples Often when fitting a model, you will also be interested in obtaining summary statistics, such as the sample means and standard deviations of the variables in the model. estat summarize makes this process simple. The output displayed is similar to that obtained by typing . summarize varlist if e(sample) without the need to type the varlist containing the dependent and independent variables. Example 1 Continuing with the example in [R] estat ic, here we summarize the variables by using estat summarize. . use http://www.stata-press.com/data/r13/sysdsn1 (Health insurance data) . mlogit insure age male nonwhite i.site (output omitted ) . estat summarize, noomitted Estimation sample mlogit Number of obs = Variable Mean insure 1.596748 age male nonwhite site 2 3 Std. Dev. 615 Min Max .6225846 1 3 44.46832 .2504065 .196748 14.18523 .4335998 .3978638 18.1109 0 0 86.0725 1 1 .3707317 .3138211 .4833939 .4644224 0 0 1 1 The output in the previous example contains all the variables in one table, though mlogit presents its results in a multiple-equation format. For models in which the same variables appear in all equations, that is fine; but for other multiple-equation models, we may prefer to have the variables separated by the equation in which they appear. The equation option makes this possible. Example 2 Systems of simultaneous equations typically have different variables in each equation, and the equation option of estat summarize is helpful in such situations. In example 2 of [R] reg3, we have a model of supply and demand. We first refit the model and then call estat summarize. estat summarize — Summarize estimation sample 509 . use http://www.stata-press.com/data/r13/supDem . reg3 (Demand:quantity price pcompete income) (Supply:quantity price praw), > endog(price) (output omitted ) . estat summarize, equation Estimation sample reg3 Variable Mean depvar quantity quantity 12.61818 12.61818 Number of obs = Std. Dev. 49 Min Max 2.774952 2.774952 7.71069 7.71069 20.0477 20.0477 32.70944 5.929975 7.811735 2.882684 3.508264 4.18859 26.3819 .207647 .570417 38.4769 11.5549 14.0077 32.70944 4.740891 2.882684 2.962565 26.3819 .151028 38.4769 9.79881 demale price pcompete income Supply price praw The first block of the table contains statistics on the dependent (or, more accurately, left-hand-side) variables, and because we specified quantity as the left-hand-side variable in both equations, it is listed twice. The second block refers to the variables in the first equation we specified, which we labeled “Demand” in our call to reg3; and the final block refers to the supply equation. Stored results estat summarize stores the following in r(): Scalars r(N groups) number of groups (group only) Matrices r(stats) k × 4 matrix of means, standard deviations, minimums, and maximums r(stats # ) k × 4 matrix of means, standard deviations, minimums, and maximums for group # (group only) Also see [R] estat — Postestimation statistics [R] estat ic — Display information criteria [R] estat vce — Display covariance matrix estimates Title estat vce — Display covariance matrix estimates Syntax Remarks and examples Menu for estat Stored results Description Reference Options Also see Syntax estat vce , estat vce options estat vce options Description covariance correlation equation(spec) block diag format(% fmt) nolines display options display as covariance matrix; the default display as correlation matrix display only specified equations display submatrices by equation display submatrices by equation; diagonal blocks only display format for covariances and correlations suppress lines between equations control display of omitted variables and base and empty cells Menu for estat Statistics > Postestimation > Reports and statistics Description estat vce displays the covariance or correlation matrix of the parameter estimates of the previous model. Options covariance displays the matrix as a variance–covariance matrix; this is the default. correlation displays the matrix as a correlation matrix rather than a variance–covariance matrix. rho is a synonym. equation(spec) selects part of the VCE to be displayed. If spec is eqlist, the VCE for the listed equations is displayed. If spec is eqlist1 \ eqlist2, the part of the VCE associated with the equations in eqlist1 (rowwise) and eqlist2 (columnwise) is displayed. If spec is *, all equations are displayed. equation() implies block if diag is not specified. block displays the submatrices pertaining to distinct equations separately. diag displays the diagonal submatrices pertaining to distinct equations separately. format(% fmt) specifies the number format for displaying the elements of the matrix. The default is format(%10.0g) for covariances and format(%8.4f) for correlations. See [U] 12.5 Formats: Controlling how data are displayed for more information. 510 estat vce — Display covariance matrix estimates 511 nolines suppresses lines between equations. display options: noomitted, noemptycells, baselevels, allbaselevels; see [R] estimation options. Remarks and examples estat vce allows you to display the VCE of the parameters of the previously fit model, as either a covariance matrix or a correlation matrix. Example 1 Returning to the example in [R] estat ic, here we display the covariance matrix of the parameters of the mlogit model by using estat vce. . use http://www.stata-press.com/data/r13/sysdsn1 (Health insurance data) . mlogit insure age male nonwhite (output omitted ) . estat vce, block Covariance matrix of coefficients of mlogit model covariances of equation Indemnity o. o. o. age male nonwhite o. _cons o.age 0 o.male 0 0 o.nonwhite 0 0 0 o._cons 0 0 0 0 covariances of equation Prepaid (row) by equation Indemnity (column) o. o. o. o. age male nonwhite _cons age 0 0 male nonwhite 0 _cons 0 covariances of equation Prepaid age male nonwhite _cons 0 0 0 0 0 0 age male nonwhite _cons .00003711 -.00015303 -.00008948 -.00159095 .0402091 .00470608 -.00398961 .04795135 -.00628886 .08000462 covariances of equation Uninsure (row) by equation Indemnity (column) o. o. o. o. age male nonwhite _cons age male nonwhite _cons 0 0 0 0 0 0 0 0 0 0 512 estat vce — Display covariance matrix estimates covariances of equation Uninsure (row) by equation Prepaid (column) age male nonwhite _cons age .00001753 -.00007926 -.00007544 .02188398 male nonwhite -.00004577 .00250588 _cons -.00077045 -.00130535 covariances of equation Uninsure age male age male nonwhite _cons .00013022 -.00050406 -.00026145 -.00562159 .13248095 .01505449 -.01686629 -.00004564 .0023186 .02813553 -.00257593 -.00076886 -.00145923 -.00263872 .03888032 nonwhite _cons .16861327 -.02474852 .28607591 The block option is particularly useful for multiple-equation estimators. The first block of output here corresponds to the VCE of the estimated parameters for the first equation—the square roots of the diagonal elements of this matrix are equal to the standard errors of the first equation’s parameters. Similarly, the final block corresponds to the VCE of the parameters for the second equation. The middle block shows the covariances between the estimated parameters of the first and second equations. Stored results estat vce stores the following in r(): Matrices r(V) VCE or correlation matrix Reference Hamilton, L. C. 2013. Statistics with Stata: Updated for Version 12. 8th ed. Boston: Brooks/Cole. Also see [R] estat — Postestimation statistics [R] estat ic — Display information criteria [R] estat summarize — Summarize estimation sample Title estimates — Save and manipulate estimation results Syntax Description Remarks and examples Also see Syntax Command Reference Save and use results from disk [R] estimates save [R] estimates save estimates save filename estimates use filename estimates describe using filename [R] estimates describe estimates esample: . . . [R] estimates save Store and restore estimates in memory estimates store name estimates restore name [R] estimates store [R] estimates store estimates query estimates dir [R] estimates store [R] estimates store estimates drop namelist estimates clear [R] estimates store [R] estimates store Set titles and notes [R] estimates title [R] estimates title estimates title: text estimates title estimates estimates estimates estimates [R] [R] [R] [R] notes: text notes notes list . . . notes drop . . . estimates estimates estimates estimates notes notes notes notes Report estimates describe name estimates replay namelist [R] estimates describe [R] estimates replay Tables and statistics estimates table namelist estimates stats namelist estimates for namelist: . . . 513 [R] estimates table [R] estimates stats [R] estimates for 514 estimates — Save and manipulate estimation results Description estimates allows you to store and manipulate estimation results: • You can save estimation results in a file for use in later sessions. • You can store estimation results in memory so that you can a. switch among separate estimation results and b. form tables combining separate estimation results. Remarks and examples estimates is for use after you have fit a model, be it with regress, logistic, etc. You can use estimates after any estimation command, whether it be an official estimation command of Stata or a user-written one. estimates has three separate but related capabilities: 1. You can save estimation results in a file on disk so that you can use them later, even in a different Stata session. 2. You can store up to 300 estimation results in memory so that they are at your fingertips. 3. You can make tables comparing any results you have stored in memory. Remarks are presented under the following headings: Saving and using estimation results Storing and restoring estimation results Comparing estimation results Jargon Saving and using estimation results After you have fit a model, say, with regress, type . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight displ foreign (output omitted ) You can save the results in a file: . estimates save basemodel (file basemodel.ster saved) Later, say, in a different session, you can reload those results: . estimates use basemodel The situation is now nearly identical to what it was immediately after you fit the model. You can replay estimation results: . regress (output omitted ) You can perform tests: . test foreign==0 (output omitted ) estimates — Save and manipulate estimation results 515 And you can use any postestimation command or postestimation capability of Stata. The only difference is that Stata no longer knows what the estimation sample, e(sample) in Stata jargon, was. When you reload the estimation results, you might not even have the original data in memory. That is okay. Stata will know to refuse to calculate anything that can be calculated only on the original estimation sample. If it is important that you use a postestimation command that can be used only on the original estimation sample, there is a way you can do that. You use the original data and then use estimates esample: to tell Stata what the original sample was. See [R] estimates save for details. Storing and restoring estimation results Storing and restoring estimation results in memory is much like saving them to disk. You type . estimates store base to save the current estimation results under the name base, and you type . estimates restore base to get them back later. You can find out what you have stored by typing . estimates dir Saving estimation results to disk is more permanent than storing them in memory, so why would you want merely to store them? The answer is that, once they are stored, you can use other estimates commands to produce tables and reports from them. See [R] estimates store for details about the estimates store and restore commands. Comparing estimation results Let’s say that you have done the following: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight displ (output omitted ) . estimates store base . regress mpg weight displ foreign (output omitted ) . estimates store alt You can now get a table comparing the coefficients: . estimates table base alt Variable weight displacement foreign _cons base alt -.00656711 .00528078 -.00677449 .00192865 -1.6006312 41.847949 40.084522 estimates table can do much more; see [R] estimates table. Also see [R] estimates stats. estimates stats works similarly to estimates table but produces model comparisons in terms of BIC and AIC. 516 estimates — Save and manipulate estimation results Jargon You know that if you fit a model, say, by typing . regress mpg weight displacement then you can later replay the results by typing . regress and you can do tests and calculate other postestimation statistics by typing . test displacement==0 . estat vif . predict mpghat As a result, we often refer to the estimation results or the current estimation results or the most recent estimation results or the last estimation results or the estimation results in memory. With estimates store and estimates restore, you can have many estimation results in memory. One set of those, the set most recently estimated, or the set most recently restored, are the current or active estimation results, which you can replay, which you can test, or from which you can calculate postestimation statistics. Current and active are the two words we will use interchangeably from now on. Also see [P] estimates — Manage estimation results Title estimates describe — Describe estimation results Syntax Remarks and examples Menu Stored results Description Also see Option Syntax estimates describe estimates describe name estimates describe using filename , number(#) Menu Statistics > Postestimation > Manage estimation results > Describe results Description estimates describe describes the current (active) estimates. Reported are the command line that produced the estimates, any title that was set by estimates title (see [R] estimates title), and any notes that were added by estimates notes (see [R] estimates notes). estimates describe name does the same but reports results for estimates stored by estimates store (see [R] estimates store). estimates describe using filename does the same but reports results for estimates saved by estimates save (see [R] estimates save). If filename contains multiple sets of estimates (saved in it by estimates save, append), the number of sets of estimates is also reported. If filename is specified without an extension, .ster is assumed. Option number(#) specifies that the #th set of estimation results from filename be described. This assumes that multiple sets of estimation results have been saved in filename by estimates save, append. The default is number(1). Remarks and examples estimates describe can be used to describe the estimation results currently in memory, . estimates describe Estimation results produced by . regress mpg weight displ if foreign 517 518 estimates describe — Describe estimation results or to describe results saved by estimates save in a .ster file: . estimates describe using final Estimation results "Final results" saved on 12apr2013 14:20, produced by . logistic myopic age sex drug1 drug2 if complete==1 Notes: 1. Used file patient.dta 2. "datasignature myopic age sex drug1 drug2 if complete==1" reports 148:5(58763):2252897466:3722318443 3. must be reviewed by rgg Example 1 . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight displ if foreign (output omitted ) . estimates notes: file ‘c(filename)’ . datasignature 74:12(71728):3831085005:1395876116 . estimates notes: datasignature report ‘r(datasignature)’ . estimates save foreign file foreign.ster saved . regress mpg weight displ if !foreign (output omitted ) . estimates describe using foreign Estimation results saved on 02may2013 10:33, produced by . regress mpg weight displ if foreign Notes: 1. file http://www.stata-press.com/data/r13/auto.dta 2. datasignature report 74:12(71728):3831085005:1395876116 Stored results estimates describe and estimates describe name store the following in r(): Macros r(title) r(cmdline) title original command line estimates describe using filename stores the above and the following in r(): Scalars r(datetime) %tc value of date/time file saved r(nestresults) number of sets of estimation results in file Also see [R] estimates — Save and manipulate estimation results Title estimates for — Repeat postestimation command across models Syntax Description Options Remarks and examples Also see Syntax estimates for namelist , options : postestimation command where namelist is a name, a list of names, all, or *. A name may be ., meaning the current (active) estimates. all and * mean the same thing. options Description noheader nostop do not display title do not stop if command fails Description estimates for performs postestimation command on each estimation result specified. Options noheader suppresses the display of the header as postestimation command is executed each time. nostop specifies that execution of postestimation command is to be performed on the remaining models even if it fails on some. Remarks and examples In the example that follows, we fit a model two different ways, store the results, and then use estimates for to perform the same test on both of them: Example 1 . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . generate gpm = 1/mpg . regress gpm i.foreign i.foreign#c.weight displ (output omitted ) . estimates store reg . qreg gpm i.foreign i.foreign#c.weight displ (output omitted ) . estimates store qreg 519 520 estimates for — Repeat postestimation command across models . estimates for reg qreg: test 0.foreign#c.weight==1.foreign#c.weight Model reg ( 1) 0b.foreign#c.weight - 1.foreign#c.weight = 0 F( 1, 69) = 4.87 Prob > F = 0.0307 Model qreg ( 1) 0b.foreign#c.weight - 1.foreign#c.weight = 0 F( 1, 69) = Prob > F = 0.03 0.8554 Also see [R] estimates — Save and manipulate estimation results Title estimates notes — Add notes to estimation results Syntax Description Remarks and examples Also see Syntax estimates notes: text estimates notes estimates notes list in noterange estimates notes drop in noterange where noterange is # or #/# and where # may be a number, the letter f (meaning first), or the letter l (meaning last). Description estimates notes: text adds a note to the current (active) estimation results. estimates notes and estimates notes list list the current notes. estimates notes drop in noterange eliminates the specified notes. Remarks and examples After adding or removing notes, if estimates have been stored, do not forget to store them again. If estimates have been saved, do not forget to save them again. Notes are most useful when you intend to save estimation results in a file; see [R] estimates save. For instance, after fitting a model, you might type . estimates note: I think these are final . estimates save lock2 and then, later when going through your files, you could type . estimates use lock2 . estimates notes 1. I think these are final Up to 9,999 notes can be attached to estimation results. If estimation results are important, we recommend that you add a note identifying the .dta dataset you used. The best way to do that is to type . estimates notes: file ‘c(filename)’ 521 522 estimates notes — Add notes to estimation results because ‘c(filename)’ will expand to include not just the name of the file but also its full path; see [P] creturn. If estimation results took a long time to estimate—say, they were produced by asmprobit or gllamm (see [R] asmprobit and http://www.gllamm.org)—it is also a good idea to add a data signature. A data signature takes less time to compute than reestimation when you need proof that you really have the right dataset. The easy way to do that is to type . datasignature 74:12(71728):3831085005:1395876116 . estimates notes: datasignature reports ‘r(datasignature)’ Now when you ask to see the notes, you will see . estimates notes 1. I think these are final 2. file C:\project\one\pat4.dta 3. datasignature reports 74:12(71728):3831085005:1395876116 See [D] datasignature. Notes need not be positive. You might set a note to be, “I need to check that age is defined correctly.” Example 1 . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight displ if foreign (output omitted ) . estimates notes: file ‘c(filename)’ . datasignature 74:12(71728):3831085005:1395876116 . estimates notes: datasignature report ‘r(datasignature)’ . estimates save foreign file foreign.ster saved . estimates notes list in 1/2 1. file http://www.stata-press.com/data/r13/auto.dta 2. datasignature report 74:12(71728):3831085005:1395876116 . estimates notes drop in 2 (1 note dropped) . estimates notes 1. file http://www.stata-press.com/data/r13/auto.dta Also see [R] estimates — Save and manipulate estimation results Title estimates replay — Redisplay estimation results Syntax Menu Description Remarks and examples Also see Syntax estimates replay estimates replay namelist where namelist is a name, a list of names, all, or *. A name may be ., meaning the current (active) estimates. all and * mean the same thing. Menu Statistics > Postestimation > Manage estimation results > Redisplay estimation output Description estimates replay redisplays the current (active) estimation results, just as typing the name of the estimation command would do. estimates replay namelist redisplays each specified estimation result. The active estimation results are left unchanged. Remarks and examples In the example that follows, we fit a model two different ways, store the results, use estimates for to perform the same test on both of them, and then replay the results: Example 1 . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . generate gpm = 1/mpg . regress gpm i.foreign i.foreign#c.weight displ (output omitted ) . estimates store reg . qreg gpm i.foreign i.foreign#c.weight displ (output omitted ) . estimates store qreg 523 524 estimates replay — Redisplay estimation results . estimates for reg qreg: test 0.foreign#c.weight==1.foreign#c.weight Model reg ( 1) 0b.foreign#c.weight - 1.foreign#c.weight = 0 F( 1, 69) = 4.87 Prob > F = 0.0307 Model qreg ( 1) 0b.foreign#c.weight - 1.foreign#c.weight = 0 F( 1, 69) = Prob > F = . estimates replay 0.03 0.8554 Model qreg Median regression Raw sum of deviations .7555689 (about .05) Min sum of deviations .3201479 gpm Coef. foreign Foreign foreign# c.weight Domestic Foreign displacement _cons Number of obs = Pseudo R2 = 74 0.5763 Std. Err. t P>|t| [95% Conf. Interval] .0065352 .0109777 0.60 0.554 -.0153647 .0284351 .0000147 .0000155 2.93e-06 4.17e-06 5.00 3.71 0.000 0.000 8.81e-06 7.16e-06 .0000205 .0000238 .0000179 .0003134 .0000239 .0059612 0.75 0.05 0.457 0.958 -.0000298 -.0115789 .0000656 .0122056 estimates replay — Redisplay estimation results . estimates replay reg Model reg Source SS df MS Model Residual .009342436 .002615192 4 69 .002335609 .000037901 Total .011957628 73 .000163803 t 74 61.62 0.0000 0.7813 0.7686 .00616 Coef. foreign Foreign -.0117756 .0086088 -1.37 0.176 -.0289497 .0053986 foreign# c.weight Domestic Foreign .0000123 .00002 2.30e-06 3.27e-06 5.36 6.12 0.000 0.000 7.75e-06 .0000135 .0000169 .0000265 .0000296 .0053352 .0000187 .0046748 1.58 1.14 0.119 0.258 -7.81e-06 -.0039909 .000067 .0146612 Also see [R] estimates — Save and manipulate estimation results P>|t| = = = = = = gpm displacement _cons Std. Err. Number of obs F( 4, 69) Prob > F R-squared Adj R-squared Root MSE [95% Conf. Interval] 525 Title estimates save — Save and use estimation results Syntax Remarks and examples Menu Stored results Description Also see Options Syntax estimates save filename estimates use filename , append replace , number(#) if in weight estimates esample: varlist , replace stringvars(varlist) zeroweight estimates esample Menu estimates save Statistics > Postestimation > Manage estimation results > Save to disk > Manage estimation results > Load from disk estimates use Statistics > Postestimation Description estimates save filename saves the current (active) estimation results in filename. estimates use filename loads the results saved in filename into the current (active) estimation results. In both cases, if filename is specified without an extension, .ster is assumed. estimates esample: (note the colon) resets e(sample). After estimates use filename, e(sample) is set to contain 0, meaning that none of the observations currently in memory was used in obtaining the estimates. estimates esample (without a colon) displays how e(sample) is currently set. Options append, used with estimates save, specifies that results be appended to an existing file. If the file does not already exist, a new file is created. replace, used with estimates save, specifies that filename can be replaced if it already exists. 526 estimates save — Save and use estimation results 527 number(#), used with estimates use, specifies that the #th set of estimation results from filename be loaded. This assumes that multiple sets of estimation results have been saved in filename by estimates save, append. The default is number(1). replace, used with estimates esample:, specifies that e(sample) can be replaced even if it is already set. stringvars(varlist), used with estimates esample:, specifies string variables. Observations containing variables that contain "" will be omitted from e(sample). zeroweight, used with estimates esample:, specifies that observations with zero weights are to be included in e(sample). Remarks and examples See [R] estimates for an overview of the estimates commands. For a description of estimates save and estimates use, see Saving and using estimation results in [R] estimates. The rest of this entry concerns e(sample). Remarks are presented under the following headings: Setting e(sample) Resetting e(sample) Determining who set e(sample) Setting e(sample) After estimates use filename, the situation is nearly identical to what it was immediately after you fit the model. The one difference is that e(sample) is set to 0. e(sample) is Stata’s function to mark which observations among those currently in memory were used in producing the estimates. For instance, you might type . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight displ if foreign (output omitted ) . summarize mpg if e(sample) (output omitted ) and summarize would report the summary statistics for the observations regress in fact used, which would exclude not only observations for which foreign = 0 but also any observations for which mpg, weight, or displ was missing. If you saved the above estimation results and then reloaded them, however, summarize mpg if e(sample) would produce . summarize mpg if e(sample) Variable Obs mpg 0 Mean Std. Dev. Min Max Stata thinks that none of these observations was used in producing the estimates currently loaded. 528 estimates save — Save and use estimation results What else could Stata think? When you estimates use filename, you do not have to have the original data in memory. Even if you do have data in memory that look like the original data, they might not be. Setting e(sample) to 0 is the safe thing to do. There are some postestimation statistics, for instance, that are appropriate only when calculated on the estimation sample. Setting e(sample) to 0 ensures that if you ask for one of them, you will get back a null result. We recommend that you leave e(sample) set to 0. But what if you really need to calculate that postestimation statistic? Well, you can get it, but you are going to be responsible for setting e(sample) correctly. Here we just happen to know that all the observations with foreign = 1 were used, so we can type . estimates esample: if foreign If all the observations had been used, we could simply type . estimates esample: The safe thing to do, however, is to look at the estimation command—estimates describe will show it to you—and then type . estimates esample: mpg weight displ if foreign We include all observations with foreign = 1, excluding any with missing values in the mpg, weight, or displ variable, that are to be treated as the estimation sample. Resetting e(sample) estimates esample: will allow you to not only set but also reset e(sample). If e(sample) has already been set (say that you just fit the model) and you try to set it, you will see . estimates esample: mpg weight displ if foreign no; e(sample) already set r(322); Here you can specify the replace option: . estimates esample: mpg weight displ if foreign, replace We do not recommend resetting e(sample), but the situation can arise where you need to. Imagine that you estimates use filename, you set e(sample), and then you realize that you set it wrong. Here you would want to reset it. Determining who set e(sample) estimates esample without a colon will report whether and how e(sample) was set. You might see . estimates esample e(sample) set by estimation command or . estimates esample e(sample) set by user or . estimates esample e(sample) not set (0 assumed) estimates save — Save and use estimation results 529 Stored results estimates esample without the colon saves macro r(who), which will contain cmd, user, or zero’d. Also see [R] estimates — Save and manipulate estimation results Title estimates stats — Model-selection statistics Syntax Remarks and examples Menu Stored results Description Methods and formulas Option Also see Syntax estimates stats namelist , n(#) where namelist is a name, a list of names, all, or *. A name may be ., meaning the current (active) estimates. all and * mean the same thing. Menu Statistics > Postestimation > Manage estimation results > Table of fit statistics Description estimates stats reports model-selection statistics, including the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). These measures are appropriate for maximum likelihood models. If estimates stats is used for a non–likelihood-based model, such as qreg, missing values are reported. Option n(#) specifies the N to be used in calculating BIC; see [R] BIC note. Remarks and examples If you type estimates stats without arguments, a table for the most recent estimation results will be shown: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . logistic foreign mpg weight displ (output omitted ) . estimates stats Akaike’s information criterion and Bayesian information criterion Model Obs ll(null) ll(model) df AIC BIC . 74 -45.03321 -20.59083 4 49.18167 58.39793 Note: N=Obs used in calculating BIC; see [R] BIC note 530 estimates stats — Model-selection statistics 531 Regarding the note at the bottom of the table, N is an ingredient in the calculation of BIC; see [R] BIC note. The note changes if you specify the n() option, which tells estimates stats what N to use. N = Obs is the default. Regarding the table itself, ll(null) is the log likelihood for the constant-only model, ll(model) is the log likelihood for the model, df is the number of degrees of freedom, and AIC and BIC are the Akaike and Bayesian information criteria. Models with smaller values of an information criterion are considered preferable. estimates stats can compare estimation results: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . logistic foreign mpg weight displ (output omitted ) . estimates store full . logistic foreign mpg weight (output omitted ) . estimates store sub . estimates stats full sub Akaike’s information criterion and Bayesian information criterion Model Obs ll(null) ll(model) df AIC BIC full sub 74 74 -45.03321 -45.03321 -20.59083 -27.17516 4 3 49.18167 60.35031 58.39793 67.26251 Note: N=Obs used in calculating BIC; see [R] BIC note Stored results estimates stats stores the following in r(): Matrices r(S) matrix with 6 columns (N, ll0, ll, df, AIC, and BIC) and rows corresponding to models in table Methods and formulas See [R] BIC note. Also see [R] estimates — Save and manipulate estimation results Title estimates store — Store and restore estimation results Syntax Remarks and examples Menu Stored results Description References Option Also see Syntax estimates store name , nocopy estimates restore name estimates query estimates dir namelist estimates drop namelist estimates clear where namelist is a name, a list of names, all, or *. all and * mean the same thing. Menu estimates store Statistics > Postestimation > Manage estimation results > Store in memory > Manage estimation results > Restore from memory > Manage estimation results > List results stored in memory > Manage estimation results > Drop from memory estimates restore Statistics > Postestimation estimates dir Statistics > Postestimation estimates drop Statistics > Postestimation Description estimates store name stores the current (active) estimation results under the name name. estimates restore name loads the results stored under name into the current (active) estimation results. estimates query tells you whether the current (active) estimates have been stored and, if so, the name. 532 estimates store — Store and restore estimation results 533 estimates dir displays a list of the stored estimates. estimates drop namelist drops the specified stored estimation results. estimates clear drops all stored estimation results. estimates clear, estimates drop all, and estimates drop * do the same thing. estimates drop and estimates clear do not eliminate the current (active) estimation results. Option nocopy, used with estimates store, specifies that the current (active) estimation results are to be moved into name rather than copied. Typing . estimates store hold, nocopy is the same as typing . estimates store hold . ereturn clear except that the former is faster. The nocopy option is sometimes used by programmers. Remarks and examples estimates store stores estimation results in memory so that you can access them later. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight displ (output omitted ) . estimates store myreg . ... you do other things, including fitting other models ... . estimates restore myreg . regress (same output shown again) After estimates restore myreg, things are once again just as they were, estimationwise, just after you typed regress mpg weight displ. estimates store stores results in memory. When you exit Stata, those stored results vanish. If you wish to make a permanent copy of your estimation results, see [R] estimates save. The purpose of making copies in memory is 1) so that you can quickly switch between them and 2) so that you can make tables comparing estimation results. Concerning the latter, see [R] estimates table and [R] estimates stats. Stored results estimates dir stores the following in r(): Macros r(names) names of stored results 534 estimates store — Store and restore estimation results References Jann, B. 2005. Making regression tables from stored estimates. Stata Journal 5: 288–308. . 2007. Making regression tables simplified. Stata Journal 7: 227–244. Also see [R] estimates — Save and manipulate estimation results Title estimates table — Compare estimation results Syntax Remarks and examples Menu Stored results Description References Options Also see Syntax estimates table namelist , options where namelist is a name, a list of names, all, or *. A name may be ., meaning the current (active) estimates. all and * mean the same thing. options Description Main stats(scalarlist) star (#1 #2 #3) report scalarlist in table use stars to denote significance levels Options keep(coeflist) drop(coeflist) equations(matchlist) report coefficients in order specified omit specified coefficients from table match equations of models as specified Numerical formats b (% fmt) se (% fmt) t (% fmt) p (% fmt) stfmt(% fmt) how to format coefficients, which are always reported report standard errors and use optional format report t or z and use optional format report p-values and use optional format how to format scalar statistics General format varwidth(#) modelwidth(#) use # characters to display variable names and statistics use # characters to display model names eform varlabel newpanel display coefficients in exponentiated form display variable labels rather than variable names display statistics in separate table from coefficients style(oneline) style(columns) style(noline) put vertical line after variable names; the default put vertical line separating every column suppress all vertical lines coded display compact table 535 536 estimates table — Compare estimation results Reporting display options control row spacing, line width, and display of omitted variables and base and empty cells title(string) title for table title() does not appear in the dialog box. where • A scalarlist is a list of any or all of the names of scalars stored in e(), plus aic, bic, and rank. • #1 #2 #3 are three numbers such as .05 .01 .001. • A coeflist is a list of coefficient names, each name of which may be simple (for example, price), an equation name followed by a colon (for example, mean:), or a full name (for example, mean:price). Names are separated by blanks. • A matchlist specifies how equations from different estimation results are to be matched. If you need to specify a matchlist, the solution is usually 1, as in equations(1). The full syntax is matchlist := term , term . . . term := eqname = #:# . . . :# eqname = # See equations() under Options below. Menu Statistics > Postestimation > Manage estimation results > Table of estimation results Description estimates table displays a table of coefficients and statistics for one or more sets of estimation results. Options Main stats(scalarlist) specifies one or more scalar statistics to be displayed in the table. scalarlist may contain aic bic rank Akaike’s information criterion Schwarz’s Bayesian information criterion rank of e(V) (# of free parameters in model) along with the names of any scalars stored in e(). The specified statistics do not have to be available for all estimation results being displayed. estimates table — Compare estimation results 537 For example, stats(N ll chi2 aic) specifies that e(N), e(ll), e(chi2), and AIC be included. In Stata, e(N) records the number of observations; e(ll), the log likelihood; and e(chi2), the chi-squared test that all coefficients in the first equation of the model are equal to zero. star and star(#1 #2 #3) specify that stars (asterisks) are to be used to mark significance. The second syntax specifies the significance levels for one, two, and three stars. If you specify simply star, that is equivalent to specifying star(.05 .01 .001), which means one star (*) if p < 0.05, two stars (**) if p < 0.01, and three stars (***) if p < 0.001. The star and star() options may not be combined with se, t, or p option. Options keep(coeflist) and drop(coeflist) are alternatives; they specify coefficients to be included or omitted from the table. The default is to display all coefficients. If keep() is specified, it specifies not only the coefficients to be included but also the order in which they appear. A coeflist is a list of coefficient names, each name of which may be simple (for example, price), an equation name followed by a colon (for example, mean:), or a full name (for example, mean:price). Names are separated from each other by blanks. When full names are not specified, all coefficients that match the partial specification are included. For instance, drop( cons) would omit cons for all equations. equations(matchlist) specifies how the equations of the models in namelist are to be matched. The default is to match equations by name. Matching by name usually works well when all results were fit by the same estimation command. When you are comparing results from different estimation commands, however, specifying equations() may be necessary. The most common usage is equations(1), which indicates that all first equations are to be matched into one equation named #1. matchlist has the syntax term , term . . . where term is eqname = #:# . . .:# eqname = # (syntax 1) (syntax 2) In syntax 1, each # is a number or a period (.). If a number, it specifies the position of the equation in the corresponding model; 1:3:1 would indicate that equation 1 in the first model matches equation 3 in the second, which matches equation 1 in the third. A period indicates that there is no corresponding equation in the model; 1:.:1 indicates that equation 1 in the first matches equation 1 in the third. In syntax 2, you specify just one number, say, 1 or 2, and that is shorthand for 1:1. . . :1 or 2:2. . . :2, meaning that equation 1 matches across all models specified or that equation 2 matches across all models specified. Now that you can specify a term, you can put that together into a matchlist by separating one term from the other by commas. In what follows, we will assume that three names were specified, . estimates table alpha beta gamma, ... equations(1) is equivalent to equations(1:1:1); we would be saying that the first equations match across the board. 538 estimates table — Compare estimation results equations(1:.:1) would specify that equation 1 matches in models alpha and gamma but that there is nothing corresponding in model beta. equations(1,2) is equivalent to equations(1:1:1, 2:2:2). We would be saying that the first equations match across the board and so do the second equations. equations(1, 2:.:2) would specify that the first equations match across the board, that the second equations match for models alpha and gamma, and that there is nothing equivalent to equation 2 in model beta. If equations() is specified, equations not matched by position are matched by name. Numerical formats b(% fmt) specifies how the coefficients are to be displayed. You might specify b(%9.2f) to make decimal points line up. There is also a b option, which specifies that coefficients are to be displayed, but that is just included for consistency with the se, t, and p options. Coefficients are always displayed. se, t, and p specify that standard errors, t or z statistics, and significance levels are to be displayed. The default is not to display them. se(% fmt), t(% fmt), and p(% fmt) specify that each is to be displayed and specifies the display format to be used to format them. stfmt(% fmt) specifies the format for displaying the scalar statistics included by the stats() option. General format varwidth(#) specifies the number of character positions used to display the names of the variables and statistics. The default is 12. modelwidth(#) specifies the number of character positions used to display the names of the models. The default is 12. eform displays coefficients in exponentiated form. For each coefficient, exp(β) rather than β is displayed, and standard errors are transformed appropriately. Display of the intercept, if any, is suppressed. varlabel specifies that variable labels be displayed instead of variable names. newpanel specifies that the statistics be displayed in a table separated by a blank line from the table with coefficients rather than in the style of another equation in the table of coefficients. style(stylespec) specifies the style of the coefficient table. style(oneline) specifies that a vertical line be displayed after the variables but not between the models. This is the default. style(columns) specifies that vertical lines be displayed after each column. style(noline) specifies that no vertical lines be displayed. coded specifies that a compact table be displayed. This format is especially useful for comparing variables that are included in a large collection of models. Reporting display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), and fvwrapon(style); see [R] estimation options. The following option is available with estimates table but is not shown in the dialog box: title(string) specifies the title to appear above the table. estimates table — Compare estimation results 539 Remarks and examples If you type estimates table without arguments, a table of the most recent estimation results will be shown: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight displ (output omitted ) . estimates table Variable weight displacement _cons active -.00656711 .00528078 40.084522 The real use of estimates table, however, is for comparing estimation results, and that requires using it after estimates store: . regress mpg weight displ (output omitted ) . estimates store base . regress mpg weight displ foreign (output omitted ) . estimates store alt . qreg mpg weight displ foreign (output omitted ) . estimates store qreg . estimates table base alt qreg, stats(r2) Variable base alt qreg weight displacement foreign _cons -.00656711 .00528078 40.084522 -.00677449 .00192865 -1.6006312 41.847949 -.00595056 .00018552 -2.1326004 39.213348 r2 .6529307 .66287957 Stored results estimates table stores the following in r(): Macros r(names) Matrices r(coef) r(stats) names of results used matrix M : n × 2∗m M [i, 2j−1] = ith parameter estimate for model j ; M [i, 2j] = variance of M [i, 2j−1]; i=1,...,n; j=1,...,m matrix S : k×m (if option stats() specified) S[i, j] = ith statistic for model j ; i=1,...,k; j=1,...,m 540 estimates table — Compare estimation results References Gallup, J. L. 2012. A new system for formatting estimation tables. Stata Journal 12: 3–28. Weiss, M. 2010. Stata tip 90: Displaying partial results. Stata Journal 10: 500–502. Also see [R] estimates — Save and manipulate estimation results Title estimates title — Set title for estimation results Syntax Menu Description Remarks and examples Also see Syntax estimates title: text estimates title Menu Statistics > Postestimation > Manage estimation results > Title/retitle results Description estimates title: (note the colon) sets or clears the title for the current estimation results. The title is used by estimates table and estimates stats (see [R] estimates table and [R] estimates stats). estimates title without the colon displays the current title. Remarks and examples After setting the title, if estimates have been stored, do not forget to store them again: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg gear turn (output omitted ) . estimates store reg Now let’s add a title: . estimates title: "My regression" . estimates store reg Also see [R] estimates — Save and manipulate estimation results 541 Title estimation options — Estimation options Syntax Description Options Also see Syntax estimation cmd . . . , options options Description Model noconstant offset(varnameo ) exposure(varnamee ) constraints(constraints) collinear suppress constant term include varnameo in model with coefficient constrained to 1 include ln(varnamee ) in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables Reporting level(#) noskip nocnsreport noomitted vsquish noemptycells baselevels allbaselevels nofvlabel fvwrap(#) fvwrapon(style) cformat(% fmt) pformat(% fmt) sformat(% fmt) nolstretch set confidence level; default is level(95) perform overall model test as a likelihood-ratio test do not display constraints do not display omitted collinear variables suppress blank space separating factor variables or time-series variables do not display empty interaction cells of factor variables report base levels whose bases cannot be inferred display all base levels for factor variables and interactions display factor-variable level values rather than value labels allow # lines when wrapping long value labels apply style for wrapping long value labels; style may be word or width format for coefficients, standard errors, and confidence limits format for p-values format for test statistics do not automatically widen coefficient table for long variable names Integration intmethod(intmethod) intpoints(#) integration method for random-effects models use # integration (quadrature) points coeflegend display legend instead of statistics Description This entry describes the options common to many estimation commands. Not all the options documented here work with all estimation commands. See the documentation for the particular estimation command; if an option is listed there, it is applicable. 542 estimation options — Estimation options 543 Options Model noconstant suppresses the constant term (intercept) in the model. offset(varnameo ) specifies that varnameo be included in the model with the coefficient constrained to be 1. exposure(varnamee ) specifies a variable that reflects the amount of exposure over which the depvar events were observed for each observation; ln(varnamee ) with coefficient constrained to be 1 is entered into the log-link function. constraints(numlist | matname) specifies the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. See [R] reg3 for the use of constraints in multiple-equation contexts. constraints(numlist) specifies the constraints by number after they have been defined by using the constraint command; see [R] constraint. Some commands (for example, slogit) allow only constraints(numlist). constraints(matname) specifies a matrix containing the constraints; see [P] makecns. constraints(clist) is usedby some estimation commands, such as mlogit, where clist has the form # -# , # -# . . . . collinear specifies that the estimation command not omit collinear variables. Usually, there is no reason to leave collinear variables in place, and, in fact, doing so usually causes the estimation to fail because of the matrix singularity caused by the collinearity. However, with certain models, the variables may be collinear, yet the model is fully identified because of constraints or other features of the model. In such cases, using the collinear option allows the estimation to take place, leaving the equations with collinear variables intact. This option is seldom used. Reporting level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. noskip specifies that a full maximum-likelihood model with only a constant for the regression equation be fit. This model is not displayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asymptotically equivalent Wald test of all the parameters in the regression equation being zero (except the constant). For many models, this option can substantially increase estimation time. nocnsreport specifies that no constraints be reported. The default is to display user-specified constraints above the coefficient table. noomitted specifies that variables that were omitted because of collinearity not be displayed. The default is to include in the table any variables omitted because of collinearity and to label them as “(omitted)”. vsquish specifies that the blank space separating factor-variable terms or time-series–operated variables from other variables in the model be suppressed. noemptycells specifies that empty cells for interactions of factor variables not be displayed. The default is to include in the table interaction cells that do not occur in the estimation sample and to label them as “(empty)”. 544 estimation options — Estimation options baselevels and allbaselevels control whether the base levels of factor variables and interactions are displayed. The default is to exclude from the table all base categories. baselevels specifies that base levels be reported for factor variables and for interactions whose bases cannot be inferred from their component factor variables. allbaselevels specifies that all base levels of factor variables and interactions be reported. nofvlabel displays factor-variable level values rather than attached value labels. This option overrides the fvlabel setting; see [R] set showbaselevels. fvwrap(#) specifies how many lines to allow when long value labels must be wrapped. Labels requiring more than # lines are truncated. This option overrides the fvwrap setting; see [R] set showbaselevels. fvwrapon(style) specifies whether value labels that wrap will break at word boundaries or break based on available space. fvwrapon(word), the default, specifies that value labels break at word boundaries. fvwrapon(width) specifies that value labels break based on available space. This option overrides the fvwrapon setting; see [R] set showbaselevels. cformat(% fmt) specifies how to format coefficients, standard errors, and confidence limits in the coefficient table. The maximum format width is 9. pformat(% fmt) specifies how to format p-values in the coefficient table. The maximum format width is 5. sformat(% fmt) specifies how to format test statistics in the coefficient table. The maximum format width is 8. nolstretch specifies that the width of the coefficient table not be automatically widened to accommodate longer variable names. The default, lstretch, is to automatically widen the coefficient table up to the width of the Results window. To change the default, use set lstretch off. nolstretch is not shown in the dialog box. Integration intmethod(intmethod) specifies the integration method to be used for the random-effects model. It accepts one of four arguments: mvaghermite, the default for all but a crossed random-effects model, performs mean and variance adaptive Gauss–Hermite quadrature; mcaghermite performs mode and curvature adaptive Gauss–Hermite quadrature; ghermite performs nonadaptive Gauss– Hermite quadrature; and laplace, the default for crossed random-effects models, performs the Laplacian approximation. intpoints(#) specifies the number of integration points to use for integration by quadrature. The default is intpoints(12); the maximum is intpoints(195). Increasing this value improves the accuracy but also increases computation time. Computation time is roughly proportional to its value. The following option is not shown in the dialog box: coeflegend specifies that the legend of the coefficients and how to specify them in an expression be displayed rather than displaying the statistics for the coefficients. Also see [U] 20 Estimation and postestimation commands Title exit — Exit Stata Syntax Description Option Remarks and examples Also see Syntax exit , clear Description Typing exit causes Stata to stop processing and return control to the operating system. If the dataset in memory has changed since the last save command, you must specify the clear option before Stata will let you exit. exit may also be used for exiting do-files or programs; see [P] exit. Stata for Windows users may also exit Stata by clicking on the Close button or by pressing Alt+F4. Stata for Mac users may also exit Stata by pressing Command+Q. Stata(GUI) users may also exit Stata by clicking on the Close button. Option clear permits you to exit, even if the current dataset has not been saved. Remarks and examples Type exit to leave Stata and return to the operating system. If the dataset in memory has changed since the last time it was saved, however, Stata will refuse. At that point, you can either save the dataset and then type exit, or type exit, clear: . exit no; data in memory would be lost r(4); . exit, clear Also see [P] exit — Exit from a program or do-file 545 Title exlogistic — Exact logistic regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax exlogistic depvar indepvars if in weight , options Description options Model condvars(varlist) condition on variables in varlist group(varname) groups/strata are stratified by unique values of varname binomial(varname | #) data are in binomial form and the number of trials is contained in varname or in # estconstant estimate constant term; do not condition on the number of successes noconstant suppress constant term Terms terms(termsdef ) terms definition Options memory(# b | k | m | g ) set limit on memory usage; default is memory(10m) saving(filename) save the joint conditional distribution to filename Reporting level(#) coef test(testopt) mue(varlist) midp nolog set confidence level; default is level(95) report estimated coefficients report significance of observed sufficient statistic, conditional scores test, or conditional probabilities test compute the median unbiased estimates for varlist use the mid-p-value rule do not display the enumeration log by, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands. fweights are allowed; see [U] 11.1.6 weight. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Exact statistics > Exact logistic regression 546 exlogistic — Exact logistic regression 547 Description exlogistic fits an exact logistic regression model of depvar on indepvars. exlogistic is an alternative to logistic, the standard maximum-likelihood–based logistic regression estimator; see [R] logistic. exlogistic produces more-accurate inference in small samples because it does not depend on asymptotic results and exlogistic can better deal with one-way causation, such as the case where all females are observed to have a positive outcome. exlogistic with the group(varname) option is an alternative to clogit, the conditional logistic regression estimator; see [R] clogit. Like clogit, exlogistic conditions on the number of positive outcomes within stratum. depvar can be specified in two ways. It can be zero/nonzero, with zero indicating failure and nonzero representing positive outcomes (successes), or if you specify the binomial(varname | #) option, depvar may contain the number of positive outcomes within each trial. exlogistic is computationally intensive. Unlike most estimators, rather than calculating coefficients for all independent variables at once, results for each independent variable are calculated separately with the other independent variables temporarily conditioned out. You can save considerable computer time by skipping the parameter calculations for variables that are not of direct interest. Specify such variables in the condvars() option rather than among the indepvars; see condvars() below. Unlike Stata’s other estimation commands, you may not use test, lincom, or other postestimation commands after exlogistic. Given the method used to calculate estimates, hypothesis tests must be performed during estimation by using exlogistic’s terms() option; see terms() below. Options Model condvars(varlist) specifies variables whose parameter estimates are not of interest to you. You can save substantial computer time and memory moving such variables from indepvars to condvars(). Understand that you will get the same results for x1 and x3 whether you type . exlogistic y x1 x2 x3 x4 or . exlogistic y x1 x3, condvars(x2 x4) group(varname) specifies the variable defining the strata, if any. A constant term is assumed for each stratum identified in varname, and the sufficient statistics for indepvars are conditioned on the observed number of successes within each group. This makes the model estimated equivalent to that estimated by clogit, Stata’s conditional logistic regression command (see [R] clogit). group() may not be specified with noconstant or estconstant. binomial(varname | #) indicates that the data are in binomial form and depvar contains the number of successes. varname contains the number of trials for each observation. If all observations have the same number of trials, you can instead specify the number as an integer. The number of trials must be a positive integer at least as great as the number of successes. If binomial() is not specified, the data are assumed to be Bernoulli, meaning that depvar equaling zero or nonzero records one failure or success. estconstant estimates the constant term. By default, the models are assumed to have an intercept (constant), but the value of the intercept is not calculated. That is, the conditional distribution of 548 exlogistic — Exact logistic regression the sufficient statistics for the indepvars is computed given the number of successes in depvar, thus conditioning out the constant term of the model. Use estconstant if you want the estimate of the intercept reported. estconstant may not be specified with group(). noconstant; see [R] estimation options. noconstant may not be specified with group(). Terms terms(termname = variable . . . variable , termname = variable . . . variable . . . ) defines additional terms of the model on which you want exlogistic to perform joint-significance hypothesis tests. By default, exlogistic reports tests individually on each variable in indepvars. For instance, if variables x1 and x3 are in indepvars, and you want to jointly test their significance, specify terms(t1=x1 x3). To also test the joint significance of x2 and x4, specify terms(t1=x1 x3, t2=x2 x4). Each variable can be assigned to only one term. Joint tests are computed only for the conditional scores tests and the conditional probabilities tests. See the test() option below. Options memory(# b | k | m | g ) sets a limit on the amount of memory exlogistic can use when computing the conditional distribution of the parameter sufficient statistics. The default is memory(10m), where m stands for megabyte, or 1,048,576 bytes. The following are also available: b stands for byte; k stands for kilobyte, which is equal to 1,024 bytes; and g stands for gigabyte, which is equal to 1,024 megabytes. The minimum setting allowed is 1m and the maximum is 2048m or 2g, but do not attempt to use more memory than is available on your computer. Also see the first technical note under example 4 on counting the conditional distribution. saving(filename , replace ) saves the joint conditional distribution to filename. This distribution is conditioned on those variables specified in condvars(). Use replace to replace an existing file with filename. A Stata data file is created containing all the feasible values of the parameter sufficient statistics. The variable names are the same as those in indepvars, in addition to a variable named f containing the feasible value frequencies (sometimes referred to as the condition numbers). Reporting level(#); see [R] estimation options. The level(#) option will not work on replay because confidence intervals are based on estimator-specific enumerations. To change the confidence level, you must refit the model. coef reports the estimated coefficients rather than odds ratios (exponentiated coefficients). coef may be specified when the model is fit or upon replay. coef affects only how results are displayed and not how they are estimated. test(sufficient | score | probability) reports the significance level of the observed sufficient statistics, the conditional scores tests, or the conditional probabilities tests, respectively. The default is test(sufficient). If terms() is included in the specification, the conditional scores test and the conditional probabilities test are applied to each term providing conditional inference for several parameters simultaneously. All the statistics are computed at estimation time regardless of which is specified. Each statistic may thus also be displayed postestimation without having to refit the model; see [R] exlogistic postestimation. mue(varlist) specifies that median unbiased estimates (MUEs) be reported for the variables in varlist. By default, the conditional maximum likelihood estimates (CMLEs) are reported, except for those parameters for which the CMLEs are infinite. Specify mue( all) if you want MUEs for all the indepvars. exlogistic — Exact logistic regression 549 midp instructs exlogistic to use the mid-p-value rule when computing the MUEs, significance levels, and confidence intervals. This adjustment is for the discreteness of the distribution and halves the value of the discrete probability of the observed statistic before adding it to the p-value. The mid-p-value rule cannot be applied to MUEs whose corresponding parameter CMLE is infinite. nolog prevents the display of the enumeration log. By default, the enumeration log is displayed, showing the progress of computing the conditional distribution of the sufficient statistics. Remarks and examples Exact logistic regression is the estimation of the logistic model parameters by using the conditional distribution of the parameter sufficient statistics. The estimates are referred to as the conditional maximum likelihood estimates (CMLEs). This technique was first introduced by Cox and Snell (1989) as an alternative to using maximum likelihood estimation, which can perform poorly for small sample sizes. For stratified data, exact logistic regression is a small-sample alternative to conditional logistic regression. See [R] logit, [R] logistic, and [R] clogit to obtain maximum likelihood estimates (MLEs) for the logistic model and the conditional logistic model. For a comprehensive overview of exact logistic regression, see Mehta and Patel (1995). Let Yi denote a Bernoulli random variable where we observe the outcome Yi = yi , i = 1, . . . , n. Associated with each independent observation is a 1 × p vector of covariates, xi . We will denote πi = Pr (Yi | xi ) and let the logit function model the relationship between Yi and xi , log πi 1 − πi = θ + xi β where the constant term θ and the 1 × p vector of regression parameters β are unknown. The probability of observing Yi = yi , i = 1, . . . , n, is Pr(Y = y) = n Y πiyi (1 − πi ) 1−yi i=1 where Y = (Y1 , . . . , Yn ) and y = (y1 , . . . , yn ). The MLEs for θ and β maximize the log of this function. Pn Pn The sufficient statistics for θ and βj , j = 1, . . . , p, are M = i=1 Yi and Tj = i=1 Yi xij , respectively, and we observe M = m and Tj = tj . By default, exlogistic tallies the conditional n distribution of T = (T1 , . . . , Tp ) given M = m. This distribution will have a size of . (It m would have a size of 2n without conditioning on M = m.) Denote one of these vectors T(k) = PN (k) (k) n (t1 , . . . , tp ), k = 1, . . . , N , with combinatorial coefficient (frequency) ck , k=1 ck = m . For each independent variable xj , j = 1, . . . , p, we reduce the conditional distribution further by conditioning on all other observed sufficient statistics Tl = tl , l 6= j . The conditional probability of observing Tj = tj has the form Pr(Tj = tj | Tl = tl , l 6= j, M = m) = P c etj βj k ck e (k) tj βj 550 exlogistic — Exact logistic regression (k) (k) (k) (k) where the sum is over the subset of T vectors such that (T1 = t1 , . . . , Tj = tj , . . . , Tp = tp ) and c is the combinatorial coefficient associated with the observed t. The CMLE for βj maximizes the log of this function. Specifying nuisance variables in condvars() will reduce the size of the conditional distribution by conditioning on their observed sufficient statistics as well as conditioning on M = m. This reduces the amount of memory consumed at the cost of not obtaining regression estimates for those variables specified in condvars(). Inferences from MLEs rely on asymptotics, and if your sample size is small, these inferences may not be valid. On the other hand, inferences from the CMLEs are exact in the sense that they use the conditional distribution of the sufficient statistics outlined above. For small datasets, it is common for the dependent variable to be completely determined by the data. Here the MLEs and the CMLEs are unbounded. exlogistic will instead compute the MUE, the regression estimate that places the observed sufficient statistic at the median of the conditional distribution. Example 1 One example presented by Mehta and Patel (1995) is data from a prospective study of perinatal infection and human immunodeficiency virus type 1 (HIV-1). We use a variation of this dataset. There was an investigation Hutto et al. (1991) into whether the blood serum levels of glycoproteins CD4 and CD8 measured in infants at 6 months of age might predict their development of HIV infection. The blood serum levels are coded as ordinal values 0, 1, and 2. . use http://www.stata-press.com/data/r13/hiv1 (prospective study of perinatal infection of HIV-1) . list in 1/5 1. 2. 3. 4. 5. hiv cd4 cd8 1 0 1 1 0 0 0 0 1 1 0 0 2 0 0 We first obtain the MLEs from logistic so that we can compare the estimates and associated statistics with the CMLEs from exlogistic. . logistic hiv cd4 cd8, coef Logistic regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -20.751687 hiv Coef. cd4 cd8 _cons -2.541669 1.658586 .5132389 Std. Err. .8392231 .821113 .6809007 z -3.03 2.02 0.75 P>|z| 0.002 0.043 0.451 = = = = 47 15.75 0.0004 0.2751 [95% Conf. Interval] -4.186517 .0492344 -.8213019 -.8968223 3.267938 1.84778 exlogistic — Exact logistic regression . exlogistic hiv cd4 cd8, coef Enumerating sample-space combinations: observation 1: enumerations = 2 observation 2: enumerations = 3 (output omitted ) observation 46: enumerations = 601 observation 47: enumerations = 326 Exact logistic regression hiv Coef. Suff. cd4 cd8 -2.387632 1.592366 10 12 Number of obs = Model score = Pr >= score = 2*Pr(Suff.) 0.0004 0.0528 551 47 13.34655 0.0006 [95% Conf. Interval] -4.699633 -.0137905 -.8221807 3.907876 exlogistic produced a log showing how many records are generated as it processes each observation. The primary purpose of the log is to provide feedback because generating the distribution can be time consuming, but we also see from the last entry that the joint distribution for the sufficient statistics for cd4 and cd8 conditioned on the total number of successes has 326 unique values (but a size of 47 = 341,643,774,795). 14 The statistics for logistic are based on asymptotics: for a large sample size, each Z statistic will be approximately normally distributed (with a mean of zero and a standard deviation of one) if the associated regression parameter is zero. The question is whether a sample size of 47 is large enough. On the other hand, the p-values computed by exlogistic are from the conditional distributions of the sufficient statistics for each parameter given the sufficient statistics for all other parameters. In this sense, these p-values are exact. By default, exlogistic reports the sufficient statistics for the regression parameters and the probability of observing a more extreme value. These are singleparameter tests for H0: βcd4 = 0 and H0: βcd8 = 0 versus the two-sided alternatives. The conditional scores test, located in the coefficient table header, is testing that both H0: βcd4 = 0 and H0: βcd8 = 0. We find these p-values to be in fair agreement with the Wald and likelihood-ratio tests from logistic. The confidence intervals for exlogistic are computed from the exact conditional distributions. The exact confidence intervals are asymmetrical about the estimate and are wider than the normal-based confidence intervals from logistic. Both estimation techniques indicate that the incidence of HIV infection decreases with increasing CD4 blood serum levels and increases with increasing CD8 blood serum levels. The constant term is missing from the exact logistic coefficient table because we conditioned out its observed sufficient statistic when tallying the joint distribution of the sufficient statistics for the cd4 and cd8 parameters. The test() option provides two other test statistics used in exact logistic: the conditional scores test, test(score), and the conditional probabilities test, test(probability). For comparison, we display the individual parameter conditional scores tests. 552 exlogistic — Exact logistic regression . exlogistic, test(score) coef Exact logistic regression Number of obs = Model score = Pr >= score = hiv Coef. Score cd4 cd8 -2.387632 1.592366 12.88022 4.604816 Pr>=Score 0.0003 0.0410 47 13.34655 0.0006 [95% Conf. Interval] -4.699633 -.0137905 -.8221807 3.907876 For the probabilities test, the probability statistic is computed from (1) in Methods and formulas with β = 0. For this example, the significance of the probabilities tests matches the scores tests so they are not displayed here. Technical note Typically, the value of θ, the constant term, is of little interest, as well as perhaps some of the parameters in β, but we need to include all parameters in the model to correctly specify it. By conditioning out the nuisance parameters, we can reduce the size of the joint conditional distribution that is used to estimate the regression parameters of interest. The condvars() option allows you to specify a varlist of nuisance variables. By default, exlogistic conditions on the sufficient statistic of θ, which is the number of successes. You can save computation time and computer memory by using the condvars() option because infeasible values of the sufficient statistics associated with the variables in condvars() can be dropped from consideration before all n observations are processed. Specifying some of your independent variables in condvars() will not change the estimated regression coefficients of the remaining independent variables. For instance, in example 1, if we instead type . exlogistic hiv cd4, condvars(cd8) coef the regression coefficient for cd4 (as well as all associated inference) will be identical. One reason to have multiple variables in indepvars is to make conditional inference of several parameters simultaneously by using the terms() option. If you do not wish to test several parameters simultaneously, it may be more efficient to obtain estimates for individual variables by calling exlogistic multiple times with one variable in indepvars and all other variables listed in condvars(). The estimates will be the same as those with all variables in indepvars. Technical note If you fit a clogit (see [R] clogit) model to the HIV data from example 1, you will find that the estimates differ from those with exlogistic. (To fit the clogit model, you will have to create a group variable that includes all observations.) The regression estimates will be different because clogit conditions on the constant term only, whereas the estimates from exlogistic condition on the sufficient statistic of the other regression parameter as well as the constant term. Example 2 The HIV data presented in table IV of Mehta and Patel (1995) are in a binomial form, where the variable hiv contains the HIV cases that tested positive and the variable n contains the number of individuals with the same CD4 and CD8 levels, the binomial number-of-trials parameter. Here depvar is hiv, and we use the binomial(n) option to identify the number-of-trials variable. exlogistic — Exact logistic regression 553 . use http://www.stata-press.com/data/r13/hiv_n (prospective study of perinatal infection of HIV-1; binomial form) . list cd4 cd8 hiv n 1. 2. 3. 4. 5. 0 1 0 1 2 2 2 0 1 2 1 2 4 4 1 1 2 7 12 3 6. 7. 8. 1 2 2 0 0 1 2 0 0 7 2 13 Further, the cd4 and cd8 variables of the hiv dataset are actually factor variables, where each has the ordered levels of (0, 1, 2). Another approach to the analysis is to use indicator variables, and following Mehta and Patel (1995), we used a 0–1 coding scheme that will give us the odds ratio of level 0 versus 2 and level 1 versus 2. . generate byte cd4_0 = (cd4==0) . generate byte cd4_1 = (cd4==1) . generate byte cd8_0 = (cd8==0) . generate byte cd8_1 = (cd8==1) . exlogistic hiv cd4_0 cd4_1 cd8_0 cd8_1, terms(cd4=cd4_0 cd4_1, > cd8=cd8_0 cd8_1) binomial(n) test(probability) saving(dist, replace) nolog note: saving distribution to file dist.dta note: CMLE estimate for cd4_0 is +inf; computing MUE note: CMLE estimate for cd4_1 is +inf; computing MUE note: CMLE estimate for cd8_0 is -inf; computing MUE note: CMLE estimate for cd8_1 is -inf; computing MUE Exact logistic regression Number of obs = 47 Binomial variable: n Model prob. = 3.19e-06 Pr <= prob. = 0.0011 hiv Odds Ratio cd4 Prob. Pr<=Prob. [95% Conf. Interval] cd4_0 cd4_1 18.82831* 11.53732* .0007183 .007238 .0063701 0.0055 0.0072 0.0105 1.714079 1.575285 cd8_0 cd8_1 .1056887* .0983388* .0053212 .0289948 .0241503 0.0323 0.0290 0.0242 0 0 cd8 +Inf +Inf 1.072531 .9837203 (*) median unbiased estimates (MUE) . matrix list e(sufficient) e(sufficient)[1,4] cd4_0 cd4_1 cd8_0 cd8_1 r1 5 8 6 4 . display e(n_possible) 1091475 Here we used terms() to specify two terms in the model, cd4 and cd8, that make up the cd4 and cd8 indicator variables. By doing so, we obtained a conditional probabilities test for cd4, simultaneously testing both cd4 0 and cd4 1, and for cd8, simultaneously testing both cd8 0 and cd8 1. The significance levels for the two terms are 0.0055 and 0.0323, respectively. 554 exlogistic — Exact logistic regression This example also illustrates instances where the dependent variable is completely determined by the independent variables and CMLEs are infinite. If we try to obtain MLEs, logistic will drop each variable and then terminate with a no-data error, error number 2000. . use http://www.stata-press.com/data/r13/hiv_n, clear (prospective study of perinatal infection of HIV-1; binomial form) . generate byte cd4_0 = (cd4==0) . generate byte cd4_1 = (cd4==1) . generate byte cd8_0 = (cd8==0) . generate byte cd8_1 = (cd8==1) . expand n (39 observations created) . logistic hiv cd4_0 cd4_1 cd8_0 cd8_1 note: cd4_0 != 0 predicts success perfectly cd4_0 dropped and 8 obs not used note: cd4_1 != 0 predicts success perfectly cd4_1 dropped and 21 obs not used note: cd8_0 != 0 predicts failure perfectly cd8_0 dropped and 2 obs not used outcome = cd8_1 <= 0 predicts data perfectly r(2000); In example 2, exlogistic generated the joint conditional distribution of Tcd4 0 , Tcd4 1 , Tcd8 0 , and Tcd8 1 given M = 14 (the number of individuals that tested positive), and for reference, we listed the observed sufficient statistics that are stored in the matrix e(sufficient). Below we take that distribution and further condition on Tcd4 1 = 8, Tcd8 0 = 6, and Tcd8 1 = 4, giving the conditional distribution of Tcd4 0 . Here we see that the observed sufficient statistic Tcd4 0 = 5 is last in the sorted listing or, equivalently, Tcd4 0 is at the domain boundary of the conditional probability distribution. When this occurs, the conditional probability distribution is monotonically increasing in βcd4 0 and a maximum does not exist. . use dist, clear . keep if cd4_1==8 & cd8_0==6 & cd8_1==4 (4139 observations deleted) . list, sep(0) 1. 2. 3. 4. 5. 6. _f_ cd4_0 cd4_1 cd8_0 cd8_1 1668667 18945542 55801053 55867350 17423175 1091475 0 1 2 3 4 5 8 8 8 8 8 8 6 6 6 6 6 6 4 4 4 4 4 4 When the CMLEs are infinite, the MUEs are computed (Hirji, Tsiatis, and Mehta 1989). For the cd4 0 estimate, we compute the value β cd4 0 such that Pr(Tcd4 0 ≥ 5 | βcd4 0 = β cd4 0 , Tcd4 using (1) in Methods and formulas. 1 = 8, Tcd8 0 = 6, Tcd8 1 = 4, M = 14) = 1/2 exlogistic — Exact logistic regression 555 The output is in agreement with example 1: there is an increase in risk of HIV infection for a CD4 blood serum level of 0 relative to a level of 2 and for a level of 1 relative to a level of 2; there is a decrease in risk of HIV infection for a CD8 blood serum level of 0 relative to a level of 2 and for a level of 1 relative to a level of 2. We also displayed e(n possible). This is the combinatorial coefficient associated with the observed sufficient statistics. The same value is found in the f variable of the conditional distribution 47 dataset listed above. The size of the distribution is = 341,643,774,795. This can be verified 14 by summing the f variable of the generated conditional distribution dataset. . use dist, clear . summarize _f_, meanonly . di %15.1f r(sum) 341643774795.0 Example 3 One can think of exact logistic regression as a covariate-adjusted exact binomial. To demonstrate this point, we will use exlogistic to compute a binomial confidence interval for m successes of n trials, by fitting the constant-only model, and we will compare it with the confidence interval computed by ci (see [R] ci). We will use the saving() option to retain the dataset containing the feasible values for the constant term sufficient statistic,namely, the number of successes, m, given n , m = 0, 1, . . . , n. n trials and their associated combinatorial coefficients m . input y 1. 2. 3. 4. 5. 6. 7. . ci y 1 0 1 0 1 1 end y, binomial Variable Obs Mean Std. Err. Binomial Exact [95% Conf. Interval] y 6 .6666667 .1924501 .2227781 . exlogistic y, estconstant nolog coef saving(binom) note: saving distribution to file binom.dta Exact logistic regression Number of obs = y Coef. Suff. _cons .6931472 4 2*Pr(Suff.) 0.6875 .9567281 6 [95% Conf. Interval] -1.24955 3.096017 We use the postestimation program estat predict to transform the estimated constant term and its confidence bounds by using the inverse logit function, invlogit() (see [D] functions). The standard error for the estimated probability is computed using the delta method. 556 exlogistic — Exact logistic regression . estat predict y Predicted Probability 0.6667 Std. Err. 0.1925 [95% Conf. Interval] 0.2228 0.9567 . use binom, replace . list, sep(0) 1. 2. 3. 4. 5. 6. 7. _f_ _cons_ 1 6 15 20 15 6 1 0 1 2 3 4 5 6 Examining the listing of the generated data, the values contained in the variable are the 6 feasible values of M , and the values contained in the variable f are the binomial coefficients m 6 X 6 with total = 26 = 64. In the coefficient table, the sufficient statistic for the constant term, m m=0 labeled Suff., is m = 4. This value is located at record 5 of the dataset. Therefore, the two-tailed probability of the sufficient statistic is computed as 0.6875 = 2(15 + 6 + 1)/64. cons The constant term is the value of θ that maximizes the probability of observing M = 4; see (1) of Methods and formulas: Pr(M = 4|θ) = 15e4α 1 + 6eα + 15e2α + 20e3α + 15e4α + 6e5α + e6α .4 The maximum is at the value θ = log 2, which is demonstrated in the figure below. 0 .1 prob .2 .3 (log(2),0.33) −2 0 2 constant term 4 exlogistic — Exact logistic regression 557 .2 cummlative probability .4 .6 .8 1 The lower and upper confidence bounds are the values of θ such that Pr(M ≥ 4|θ) = 0.025 and Pr(M ≤ 4|θ) = 0.025, respectively. These probabilities are plotted in the figure below for θ ∈ [−2, 4]. (3.1, .025) 0 (−1.25, .025) −2 0 2 4 constant term Pr(M >= 4) Pr(M <= 4) confidence bounds Example 4 This example demonstrates the group() option, which allows the analysis of stratified data. Here the logistic model is πik log = θk + xki β 1 − πik where k indexes the Psnkstrata, k = 1, . . . , s, and θk is the strata-specific constant term whose sufficient Yki . statistic is Mk = i=1 Mehta and Patel (1995) use a case–control study to demonstrate this model, which is useful in comparing the estimates from exlogistic and clogit. This study was intended to determine the role of birth complications in people with schizophrenia (Garsd 1988). Siblings from seven families took part in the study, and each individual was classified as normal or schizophrenic. A birth complication index is recorded for each individual that ranges from 0, an uncomplicated birth, to 15, a very complicated birth. Some of the frequencies contained in variable f are greater than 1, and these count different births at different times where the individual has the same birth complications index, found in variable BCindex. 558 exlogistic — Exact logistic regression . use http://www.stata-press.com/data/r13/schizophrenia, clear (case-control study on birth complications for people with schizophrenia) . list, sepby(family) family BCindex schizo f 1. 2. 3. 4. 5. 6. 7. 1 1 1 1 1 1 1 6 7 3 2 5 0 15 0 0 0 0 0 0 1 1 1 2 3 1 1 1 8. 9. 2 2 2 0 1 0 1 1 10. 11. 12. 3 3 3 2 9 1 0 1 0 1 1 1 13. 14. 4 4 2 0 1 0 1 4 15. 16. 17. 5 5 5 3 6 0 1 0 1 1 1 1 18. 19. 20. 6 6 6 3 0 0 0 1 0 1 1 2 21. 22. 7 7 2 6 0 1 1 1 . exlogistic schizo BCindex [fw=f], group(family) test(score) coef Enumerating sample-space combinations: observation 1: enumerations = 2 observation 2: enumerations = 3 observation 3: enumerations = 4 observation 4: enumerations = 5 observation 5: enumerations = 6 observation 6: enumerations = 7 (output omitted ) observation 21: enumerations = 72 observation 22: enumerations = 40 Exact logistic regression Number of obs = Group variable: family Number of groups = Obs per group: min = avg = max = Model score = Pr >= score = schizo Coef. Score BCindex .3251178 6.328033 Pr>=Score 0.0167 29 7 2 4.1 10 6.32803 0.0167 [95% Conf. Interval] .0223423 .7408832 exlogistic — Exact logistic regression 559 The asymptotic alternative for this model can be estimated using clogit (equivalently, xtlogit, fe) and is listed below for comparison. We must expand the data because clogit will not accept frequency weights if they are not constant within the groups. . expand f (7 observations created) . clogit schizo BCindex, group(family) nolog note: multiple positive outcomes within groups encountered. Conditional (fixed-effects) logistic regression Log likelihood = -6.2819819 schizo Coef. BCindex .3251178 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Std. Err. z P>|z| .1678981 1.94 0.053 = = = = 29 5.20 0.0226 0.2927 [95% Conf. Interval] -.0039565 .654192 Both techniques compute the same regression estimate for the BCindex, which might not be too surprising because both estimation techniques condition on the total number of successes in each group. The difference lies in the p-values and confidence intervals. The p-value testing H0 : βBCindex = 0 is approximately 0.0167 for the exact conditional scores test and 0.053 for the asymptotic Wald test. Moreover, the exact confidence interval is asymmetric about the estimate and does not contain zero. Technical note The memory(#) option limits the amount of memory that exlogistic will consume when computing the conditional distribution of the parameter sufficient statistics. memory() is independent of the data maximum memory setting (see set max memory in [D] memory), and it is possible for exlogistic to exceed the memory limit specified in set max memory without terminating. By default, a log is provided that displays the number of enumerations (the size of the conditional distribution) after processing each observation. Typically, you will see the number of enumerations increase, and then at some point they will decrease as the multivariate shift algorithm (Hirji, Mehta, and Patel 1987) determines that some of the enumerations cannot achieve the observed sufficient statistics of the conditioning variables. When the algorithm is complete, however, it is necessary to store the conditional distribution of the parameter sufficient statistics as a dataset. It is possible, therefore, to get a memory error when the algorithm has completed if there is not enough memory to store the conditional distribution. Technical note Computing the conditional distributions and reported statistics requires data sorting and numerical comparisons. If there is at least one single-precision variable specified in the model, exlogistic will make comparisons with a relative precision of 2−5 . Otherwise, a relative precision of 2−11 is used. Be careful if you use recast to promote a single-precision variable to double precision (see [D] recast). You might try listing the data in full precision (maybe %20.15g; see [D] format) to make sure that this is really what you want. See [D] data types for information on precision of numeric storage types. 560 exlogistic — Exact logistic regression Stored results exlogistic stores the following in e(): Scalars e(N) e(k groups) e(n possible) e(n trials) e(sum y) e(k indvars) e(k terms) e(k condvars) e(condcons) e(midp) e(eps) Macros e(cmd) e(cmdline) e(title) e(depvar) e(indvars) e(condvars) e(groupvar) e(binomial) e(level) e(wtype) e(wexp) e(datasignature) e(datasignaturevars) e(properties) e(estat cmd) e(marginsnotok) Matrices e(b) e(mue indicators) e(se) e(ci) e(sum y groups) e(N g) e(sufficient) e(p sufficient) e(scoretest) e(p scoretest) e(probtest) e(p probtest) e(scoretest m) e(p scoretest m) e(probtest m) e(p probtest m) Functions e(sample) number of observations number of groups number of distinct possible outcomes where sum(sufficient) equals observed e(sufficient) binomial number-of-trials parameter sum of depvar number of independent variables number of model terms number of conditioning variables conditioned on the constant(s) indicator mid-p-value rule indicator relative difference tolerance exlogistic command as typed title in estimation output name of dependent variable independent variables conditional variables group variable binomial number-of-trials variable confidence level weight type weight expression the checksum variables used in calculation of checksum b program used to implement estat predictions disallowed by margins coefficient vector indicator for elements of e(b) estimated using MUE instead of CMLE e(b) standard errors (CMLEs only) matrix of e(level) confidence intervals for e(b) sum of e(depvar) for each group number of observations in each group sufficient statistics for e(b) p-value for e(sufficient) conditional scores tests for indepvars p-values for e(scoretest) conditional probabilities tests for indepvars p-value for e(probtest) conditional scores tests for model terms p-value for e(scoretest m) conditional probabilities tests for model terms p-value for e(probtest m) marks estimation sample exlogistic — Exact logistic regression 561 Methods and formulas Methods and formulas are presented under the following headings: Sufficient statistics Conditional distribution and CMLE Median unbiased estimates and exact CI Conditional hypothesis tests Sufficient-statistic p-value Sufficient statistics Let {Y1 , Y2 , . . . , Yn } be a set of n independent Bernoulli random variables, each of which can realize two outcomes, {0, 1}. For each i = 1, . . . , n, we observe Yi = yi , and associated with each observation is the covariate row vector of length p, xi = (xi1 , . . . , xip ). Denote β = (β1 , . . . , βp )T to be theP column vector of regression parameters and θ P to be the constant. The sufficient statistic Pn for βj is n n Tj = i=1 Yi xij , jP = 1, . . . , p, and for θ is M = i=1 Yi . We observe Tj = tj , tj = i=1 yi xij , n and M = m, m = i=1 yi . The probability of observing (Y1 = y1 , Y2 = y2 , . . . , Yn = yn ) is exp(mθ + tβ) i=1 {1 + exp(θ + xi β)} Pr(Y1 = y1 , . . . , Yn = yn | β, X) = Qn where t = (t1 , . . . , tp ) and X = (xT1 , . . . , xTn )T . The joint distribution of the sufficient statistics T is obtained by summing over all possible binary sequences Y1 , . . . , Yn such that T = t and M = m. This probability function is c(t, m) exp(mθ + tβ) Pr(T1 = t1 , . . . , Tp = tp , M = m | β, X) = Qn i=1 {1 + exp(θ + xi β)} where c(t, m) is the combinatorial coefficient of (t, m) or the number of distinct binary sequences Y1 , . . . , Yn such that T = t and M = m (Cox and Snell 1989). Conditional distribution and CMLE Without loss of generality, we will restrict our discussion to computing the CMLE of β1 . If we condition on observing M = m and T2 = t2 , . . . , Tp = tp , the probability function of (T1 | β1 , T2 = t2 , . . . , Tp = tp , M = m) is c(t, m)et1 β1 uβ1 u c(u, t2 , . . . , tp , m)e Pr(T1 = t1 | β1 , T2 = t2 , . . . , Tp = tp , M = m) = P (1) where the sum in the denominator is over all possible values of T1 such that M = m and T2 = t2 , . . . , Tp = tp and c(u, t2 , . . . , tp , m) is the combinatorial coefficient of (u, t2 , . . . , tp , m) (Cox and Snell 1989). The CMLE for β1 is the value βb1 that maximizes the log of (1). This optimization task is carried out by ml, using the conditional frequency distribution of (T1 | T2 = t2 , . . . , Tp = tp , M = m) as a dataset. Generating the joint conditional distribution is efficiently computed using the multivariate shift algorithm described by Hirji, Mehta, and Patel (1987). Difficulties in computing βb1 arise if the observed (T1 = t1 , . . . , Tp = tp , M = m) lies on the boundaries of the distribution of (T1 | T2 = t2 , . . . , Tp = tp , M = m), where the conditional probability function is monotonically increasing (or decreasing) in β1 . Here the CMLE is plus infinity if it is on the upper boundary, Pr(T1 ≤ t1 | T2 = t2 , . . . , Tp = tp , M = m) = 1, and is minus infinity if it is on the lower boundary of the distribution, Pr(T1 ≥ t1 | T2 = t2 , . . . , Tp = tp , M = m) = 1. This concept is demonstrated in example 2. When infinite CMLEs occur, the MUE is computed. 562 exlogistic — Exact logistic regression Median unbiased estimates and exact CI The MUE is computed using the technique outlined by Hirji, Tsiatis, and Mehta (1989). First, we (u) (l) find the values of β1 and β1 such that (u) Pr(T1 ≤ t1 | β1 = β1 , T2 = t2 , . . . , Tp = tp , M = m) = (2) (l) Pr(T1 ≥ t1 | β1 = β1 , T2 = t2 , . . . , Tp = tp , M = m) = 1/2 (l) (u) /2. However, if T1 is equal to the minimum of the domain of The MUE is then β 1 = β1 + β1 the conditional distribution, β (l) does not exist and β 1 = β (u) . If T1 is equal to the maximum of the domain of the conditional distribution, β (u) does not exist and β 1 = β (l) . Confidence bounds for β are computed similarly, except that we substitute α/2 for 1/2 in (2), (l) (u) where 1 − α is the confidence level. Here β1 would then be the lower confidence bound and β1 would be the upper confidence bound (see example 3). Conditional hypothesis tests P To test H0: β1 = 0 versus H1 : βP 1 6= 0, we obtain the exact p-value from u∈E f1 (u) − f1 (t1 )/2 if the mid-p-value rule is used and u∈E f1 (u) otherwise. Here E is a critical region, and we define f1 (u) = Pr(T1 = u | β1 = 0, T2 = t2 , . . . , Tp = tp , M = m) for ease of notation. There are two popular ways to define the critical region: the conditional probabilities test and the conditional scores test (Mehta and Patel 1995). The critical region when using the conditional probabilities test is all values of the sufficient statistic for β1 that have a probability less than or equal to that of the observed t1 , Ep = {u : f1 (u) ≤ f1 (t1 )}. The critical region of the conditional scores test is defined as all values of the sufficient statistic for β1 such that its score is greater than or equal to that of t1 , Es = u : (u − µ1 )2 /σ12 ≥ (t1 − µ1 )2 /σ12 ) Here µ1 and σ12 are the mean and variance of (T1 | β1 = 0, T2 = t2 , . . . , Tp = tp , M = m). The score statistic is defined as ∂`(β) ∂β 2 2 −1 ∂ `(β) −E ∂β 2 evaluated at H0: β = 0, where ` is the log of (1). The score test simplifies to (t−E [T |β])2 /var(T |β) (Hirji 2006), where the mean and variance are computed from the conditional distribution of the sufficient statistic with β = 0 and t is the observed sufficient statistic. Sufficient-statistic p-value The p-value for testing H0 : β1 = 0 versus the two-sided alternative when (T1 = t1 |T2 = t2 , . . . , Tp = tp ) is computed as 2×min(pl , pu ), where P u≤t c(u, t2 , . . . , tp , m) pl = P 1 c(u, t2 , . . . , tp , m) P u u≥t c(u, t2 , . . . , tp , m) pu = P 1 u c(u, t2 , . . . , tp , m) It is the probability of observing a more extreme T1 . exlogistic — Exact logistic regression 563 References Cox, D. R., and E. J. Snell. 1989. Analysis of Binary Data. 2nd ed. London: Chapman & Hall. Garsd, A. 1988. Schizophrenia and birth complications. Unpublished manuscript. Hirji, K. F. 2006. Exact Analysis of Discrete Data. Boca Raton: Chapman & Hall/CRC. Hirji, K. F., C. R. Mehta, and N. R. Patel. 1987. Computing distributions for exact logistic regression. Journal of the American Statistical Association 82: 1110–1117. Hirji, K. F., A. A. Tsiatis, and C. R. Mehta. 1989. Median unbiased estimation for binary data. American Statistician 43: 7–11. Hutto, C., W. P. Parks, S. Lai, M. T. Mastrucci, C. Mitchell, J. Muñoz, E. Trapido, I. M. Master, and G. B. Scott. 1991. A hospital-based prospective study of perinatal infection with human immunodeficiency virus type 1. Journal of Pediatrics 118: 347–353. Mehta, C. R., and N. R. Patel. 1995. Exact logistic regression: Theory and examples. Statistics in Medicine 14: 2143–2160. Also see [R] exlogistic postestimation — Postestimation tools for exlogistic [R] binreg — Generalized linear models: Extensions to the binomial family [R] clogit — Conditional (fixed-effects) logistic regression [R] expoisson — Exact Poisson regression [R] logistic — Logistic regression, reporting odds ratios [R] logit — Logistic regression, reporting coefficients [U] 20 Estimation and postestimation commands Title exlogistic postestimation — Postestimation tools for exlogistic Description Option for estat se Also see Syntax for estat Remarks and examples Menu for estat Stored results Options for estat predict Reference Description The following postestimation commands are of special interest after exlogistic: Command Description estat predict estat se single-observation prediction report ORs or coefficients and their asymptotic standard errors The following standard postestimation command is also available: Command Description estat summarize summary statistics for the estimation sample estat summarize is not allowed if the binomial() option was specified in exlogistic. See [R] estat summarize for details. Special-interest postestimation commands estat predict computes a predicted probability (or linear predictor), its asymptotic standard error, and its exact confidence interval for 1 observation. Predictions are carried out by estimating the constant coefficient after shifting the independent variables and conditioned variables by the values specified in the at() option or by their medians. Therefore, predictions must be done with the estimation sample in memory. If a different dataset is used or if the dataset is modified, then an error will result. estat se reports odds ratio or coefficients and their asymptotic standard errors. The estimates are stored in the matrix r(estimates). 564 exlogistic postestimation — Postestimation tools for exlogistic 565 Syntax for estat Single-observation prediction estat predict , options Report ORs or coefficients and their asymptotic standard errors estat se , coef options Description pr xb at(atspec) level(#) memory(# b | k | m | g ) nolog probability; the default linear effect use the specified values for the indepvars and condvars() set confidence level for the predicted value; default is level(95) set limit on memory usage; default is memory(10m) do not display the enumeration log These statistics are available only for the estimation sample. Menu for estat Statistics > Postestimation > Reports and statistics Options for estat predict pr, the default, calculates the probability. xb calculates the linear effect. at(varname = # varname = # . . . ) specifies values to use in computing the predicted value. Here varname is one of the independent variables, indepvars, or the conditioned variables, condvars(). The default is to use the median of each independent and conditioned variable. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. memory(# b | k | m | g ) sets a limit on the amount of memory estat predict can use when generating the conditional distribution of the constant parameter sufficient statistic. The default is memory(10m), where m stands for megabyte, or 1,048,576 bytes. The following are also available: b stands for byte; k stands for kilobyte, which is equal to 1,024 bytes; and g stands for gigabyte, which is equal to 1,024 megabytes. The minimum setting allowed is 1m and the maximum is 512m or 0.5g, but do not attempt to use more memory than is available on your computer. Also see Remarks and examples in [R] exlogistic for details on enumerating the conditional distribution. nolog prevents the display of the enumeration log. By default, the enumeration log is displayed showing the progress of enumerating the distribution of the observed successes conditioned on the independent variables shifted by the values specified in at() (or by their medians). See Methods and formulas in [R] exlogistic for details of the computations. 566 exlogistic postestimation — Postestimation tools for exlogistic Option for estat se coef requests that the estimated coefficients and their asymptotic standard errors be reported. The default is to report the odds ratios and their asymptotic standard errors. Remarks and examples Predictions must be done using the estimation sample. This is because the prediction is really an estimated constant coefficient (the intercept) after shifting the independent variables and conditioned variables by the values specified in at() or by their medians. The justification for this approach can be seen by rewriting the model as log πi 1 − πi = (α + x0 β) + (xi − x0 )β where x0 are the specified values for the indepvars (Mehta and Patel 1995). Because the estimation of the constant term is required, this technique is not appropriate for stratified models that used the group() option. Example 1 To demonstrate, we return to the example 2 in [R] exlogistic using data from a prospective study of perinatal infection and HIV-1. Here there was an investigation into whether the blood serum levels of CD4 and CD8 measured in infants at 6 months of age might predict their development of HIV infection. The blood serum levels are coded as ordinal values 0, 1, and 2. These data are used by Mehta and Patel (1995) as an exposition of exact logistic. . use http://www.stata-press.com/data/r13/hiv_n (prospective study of perinatal infection of HIV-1; binomial form) . generate byte cd4_0 = (cd4==0) . generate byte cd4_1 = (cd4==1) . generate byte cd8_0 = (cd8==0) . generate byte cd8_1 = (cd8==1) . exlogistic hiv cd4_0 cd4_1 cd8_0 cd8_1, terms(cd4=cd4_0 cd4_1, > cd8=cd8_0 cd8_1) binomial(n) test(probability) saving(dist) (output omitted ) exlogistic postestimation — Postestimation tools for exlogistic 567 . estat predict Enumerating sample-space combinations: observation 1: enumerations = 3 observation 2: enumerations = 12 observation 3: enumerations = 5 observation 4: enumerations = 5 observation 5: enumerations = 5 observation 6: enumerations = 35 observation 7: enumerations = 15 observation 8: enumerations = 15 observation 9: enumerations = 9 observation 10: enumerations = 9 observation 11: enumerations = 5 observation 12: enumerations = 18 note: CMLE estimate for _cons is -inf; computing MUE Predicted value at cd4_0 = 0, cd4_1 = 0, cd8_0 = 0, cd8_1 = 1 hiv Probability Predicted 0.0390* Std. Err. N/A [95% Conf. Interval] 0.0000 0.1962 (*) identifies median unbiased estimates (MUE); because an MUE is computed, there is no SE estimate Because we did not specify values by using the at() option, the median values of the indepvars are used for the prediction. By default, medians are used instead of means because we want to use values that are observed in the dataset. If the means of the binary variables cd4 0–cd8 1 were used, we would have created floating point variables in (0, 1) that not only do not properly represent the indicator variables but also would be a source of computational inefficiency in generating the conditional distribution. Because the MUE is computed for the predicted value, there is no standard-error estimate. From the example discussions in [R] exlogistic, the infants at highest risk are those with a CD4 level of 0 and a CD8 level of 2. Below we use the at() option to make a prediction at these blood serum levels. . estat predict, at(cd4_0=1 cd4_1=0 cd8_0=0 cd8_1=0) nolog note: CMLE estimate for _cons is +inf; computing MUE Predicted value at cd4_0 = 1, cd4_1 = 0, cd8_0 = 0, cd8_1 = 0 hiv Probability Predicted 0.9063* Std. Err. N/A [95% Conf. Interval] 0.4637 1.0000 (*) identifies median unbiased estimates (MUE); because an MUE is computed, there is no SE estimate 568 exlogistic postestimation — Postestimation tools for exlogistic Stored results estat predict stores the following in r(): Scalars r(imue) r(pred) r(se) Macros r(estimate) r(level) Matrices r(ci) r(x) 1 if r(pred) is an MUE and 0 if a CMLE estimated probability or the linear effect asymptotic standard error of r(pred) prediction type: pr or xb confidence level confidence interval indepvars and condvars() values Reference Mehta, C. R., and N. R. Patel. 1995. Exact logistic regression: Theory and examples. Statistics in Medicine 14: 2143–2160. Also see [R] exlogistic — Exact logistic regression [U] 20 Estimation and postestimation commands Title expoisson — Exact Poisson regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax expoisson depvar indepvars if in weight , options Description options Model condvars(varlist) group(varname) exposure(varnamee ) offset(varnameo ) condition on variables in varlist groups/strata are stratified by unique values of varname include ln(varnamee ) in model with coefficient constrained to 1 include varnameo in model with coefficient constrained to 1 Options memory(# b | k | m | g ) set limit on memory usage; default is memory(25m) saving(filename) save the joint conditional distribution to filename Reporting level(#) irr test(testopt) mue(varlist) midp nolog set confidence level; default is level(95) report incidence-rate ratios report significance of observed sufficient statistic, conditional scores test, or conditional probabilities test compute the median unbiased estimates for varlist use the mid-p-value rule do not display the enumeration log by, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands. fweights are allowed; see [U] 11.1.6 weight. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Exact statistics > Exact Poisson regression Description expoisson fits an exact Poisson regression model of depvar on indepvars. Exact Poisson regression is an alternative to standard maximum-likelihood–based Poisson regression (see [R] poisson) that offers more accurate inference in small samples because it does not depend on asymptotic results. For stratified data, expoisson is an alternative to fixed-effects Poisson regression (see xtpoisson, fe in [XT] xtpoisson); like fixed-effects Poisson regression, exact Poisson regression conditions on the number of events in each stratum. 569 570 expoisson — Exact Poisson regression Exact Poisson regression is computationally intensive, so if you have regressors whose parameter estimates are not of interest (that is, nuisance parameters), you should specify those variables in the condvars() option instead of in indepvars. Options Model condvars(varlist) specifies variables whose parameter estimates are not of interest to you. You can save substantial computer time and memory by moving such variables from indepvars to condvars(). Understand that you will get the same results for x1 and x3 whether you type . expoisson y x1 x2 x3 x4 or . expoisson y x1 x3, condvars(x2 x4) group(varname) specifies the variable defining the strata, if any. A constant term is assumed for each stratum identified in varname, and the sufficient statistics for indepvars are conditioned on the observed number of successes within each group (as well as other variables in the model). The group variable must be integer valued. exposure(varnamee ), offset(varnameo ); see [R] estimation options. Options memory(# b | k | m | g ) sets a limit on the amount of memory expoisson can use when computing the conditional distribution of the parameter sufficient statistics. The default is memory(25m), where m stands for megabyte, or 1,048,576 bytes. The following are also available: b stands for byte; k stands for kilobyte, which is equal to 1,024 bytes; and g stands for gigabyte, which is equal to 1,024 megabytes. The minimum setting allowed is 1m and the maximum is 2048m or 2g, but do not attempt to use more memory than is available on your computer. Also see the first technical note under example 3 on counting the conditional distribution. saving(filename , replace ) saves the joint conditional distribution for each independent variable specified in indepvars. There is one file for each variable, and it is named using the prefix filename with the variable name appended. For example, saving(mydata) with an independent variable named X would generate a data file named mydata X.dta. Use replace to replace an existing file. Each file contains the conditional distribution for one of the independent variables specified in indepvars conditioned on all other indepvars and those variables specified in condvars(). There are two variables in each data file: the feasible sufficient statistics for the variable’s parameter and their associated weights. The weights variable is named w . Reporting level(#); see [R] estimation options. The level(#) option will not work on replay because confidence intervals are based on estimator-specific enumerations. To change the confidence level, you must refit the model. irr reports estimated coefficients transformed to incidence-rate ratios, that is, exp(β) rather than β . Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated or stored. irr may be specified at estimation or when replaying previously estimated results. test(sufficient | score | probability) reports the significance level of the observed sufficient statistic, the conditional scores test, or the conditional probabilities test. The default is test(sufficient). All the statistics are computed at estimation time, and each statistic may be displayed postestimation; see [R] expoisson postestimation. expoisson — Exact Poisson regression 571 mue(varlist) specifies that median unbiased estimates (MUEs) be reported for the variables in varlist. By default, the conditional maximum likelihood estimates (CMLEs) are reported, except for those parameters for which the CMLEs are infinite. Specify mue( all) if you want MUEs for all the indepvars. midp instructs expoisson to use the mid-p-value rule when computing the MUEs, significance levels, and confidence intervals. This adjustment is for the discreteness of the distribution by halving the value of the discrete probability of the observed statistic before adding it to the p-value. The mid-p-value rule cannot be MUEs whose corresponding parameter CMLE is infinite. nolog prevents the display of the enumeration log. By default, the enumeration log is displayed, showing the progress of computing the conditional distribution of the sufficient statistics. Remarks and examples Exact Poisson regression estimates the model parameters by using the conditional distributions of the parameters’ sufficient statistics, and the resulting parameter estimates are known as CMLEs. Exact Poisson regression is a small-sample alternative to the maximum-likelihood ML Poisson model. See [R] poisson and [XT] xtpoisson to obtain maximum likelihood estimates (MLEs) for the Poisson model and the fixed-effects Poisson model. Let Yi denote a Poisson random variable where we observe the outcome Yi = yi , i = 1, . . . , n. Associated with each independent observation is a 1 × p vector of covariates, xi . We will denote µi = E [Yi | xi ] and use the log linear model to model the relationship between Yi and xi , log (µi ) = θ + xi β where the constant term, θ, and the p × 1 vector of regression parameters, β, are unknown. The probability of observing Yi = yi , i = 1, . . . , n, is Pr(Y = y) = n Y µyi e−µi i i=1 yi ! where Y = (Y1 , . . . , Yn ) and y = (y1 , . . . , yn ). The MLEs for θ and β maximize the log of this function. Pn Pn The sufficient statistics for θ and βj , j = 1, . . . , p, are M = i=1 Yi and Tj = i=1 Yi xij , respectively, and we observe M = m and Tj = tj . expoisson tallies the conditional distribution for each Tj , given the other sufficient statistics Tl = tl , l 6= j and M = m. Denote one of these (k) values to be tj , k = 1, . . . , N , with weight wk that accounts for all the generated Y vectors that (k) give rise to tj . The conditional probability of observing Tj = tj has the form Pr(Tj = tj | Tl = tl , l 6= j, M = m) = P w etj βj k (k) (k) wk etj (1) βj (k) (k) (k) where the sum is over the subset of T vectors such that (T1 = t1 , . . . , Tj = tj , . . . , Tp = tp ) and w is the weight associated with the observed t. The CMLE for βj maximizes the log of this function. Specifying nuisance variables in condvars() prevents expoisson from estimating their associated regression coefficients. These variables are still conditional variables when tallying the conditional distribution for the variables in indepvars. 572 expoisson — Exact Poisson regression Inferences from MLEs rely on asymptotics, and if your sample size is small, these inferences may not be valid. On the other hand, inferences from the CMLEs are exact in that they use the conditional distribution of the sufficient statistics outlined above. For small datasets, the dependent variable can be completely determined by the data. Here the MLEs and the CMLEs are unbounded. When this occurs, expoisson will compute the MUE, the regression estimate that places the observed sufficient statistic at the median of the conditional distribution. See [R] exlogistic for a more thorough discussion of exact estimation and related statistics. Example 1 Armitage, Berry, and Matthews (2002, 499–501) fit a log-linear model to data containing the number of cerebrovascular accidents experienced by 41 men during a fixed period, each of whom had recovered from a previous cerebrovascular accident and was hypertensive. Sixteen men received treatment, and in the original data, there are three age groups (40–49, 50–59, ≥60), but we pool the first two age groups to simplify the example. Armitage, Berry, and Matthews point out that this was not a controlled trial, but the data are useful to inquire whether there is evidence of fewer accidents for the treatment group and if age may be an important factor. The dependent variable count contains the number of accidents, variable treat is an indicator for the treatment group (1 = treatment, 0 = control), and variable age is an indicator for the age group (0 = 40−59; 1 = ≥60). First, we load the data, list it, and tabulate the cerebrovascular accident counts by treatment and age group. . use http://www.stata-press.com/data/r13/cerebacc (cerebrovascular accidents in hypotensive-treated and control groups) . list 1. 2. 3. 4. 5. 6. 7. treat count age control control control control control 0 0 1 1 2 40/59 >=60 40/59 >=60 40/59 35. control 2 >=60 control 3 40/59 (output omitted ) treatment 0 40/59 36. 37. 38. 39. 40. treatment treatment treatment treatment treatment 0 0 0 1 1 40/59 40/59 40/59 40/59 40/59 41. treatment 1 40/59 . tabulate treat age [fw=count] hypotensiv e drug age group treatment 40/59 >=60 Total control treatment 15 4 10 0 25 4 Total 19 10 29 expoisson — Exact Poisson regression 573 Next we estimate the CMLE with expoisson and, for comparison, the MLE with poisson. . expoisson count treat age Estimating: treat Enumerating sample-space combinations: observation 1: enumerations = observation 2: enumerations = observation 3: enumerations = (output omitted ) observation 39: enumerations = observation 40: enumerations = observation 41: enumerations = Estimating: age Enumerating sample-space combinations: observation 1: enumerations = observation 2: enumerations = observation 3: enumerations = (output omitted ) observation 39: enumerations = observation 40: enumerations = observation 41: enumerations = Exact Poisson regression 11 11 11 410 410 30 5 15 15 455 455 30 Number of obs = count Coef. Suff. treat age -1.594306 -.5112067 4 10 2*Pr(Suff.) Log likelihood = Coef. treat age _cons -1.594306 -.5112067 .233344 Std. Err. .5573614 .4043525 .2556594 -3.005089 -1.416179 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 -38.97981 count [95% Conf. Interval] 0.0026 0.2794 . poisson count treat age, nolog Poisson regression z -2.86 -1.26 0.91 41 P>|z| 0.004 0.206 0.361 -.4701708 .3429232 = = = = 41 10.64 0.0049 0.1201 [95% Conf. Interval] -2.686714 -1.303723 -.2677391 -.5018975 .2813096 .7344271 expoisson generates an enumeration log for each independent variable in indepvars. The conditional distribution of the parameter sufficient statistic is tallied for each independent variable. The conditional distribution for treat, for example, has 30 records containing the weights, wk , and (k) (k) feasible sufficient statistics, ttreat . In essence, the set of points (wk , ttreat ), k = 1, . . . , 30, tallied by expoisson now become the data to estimate the regression coefficient for treat, using (1) as the (k) likelihood. Remember that one of the 30 (wk , ttreat ) must contain the observed sufficient statistic, P41 ttreat = i=1 treati × counti = 4, and its relative position in the sorted set of points (sorted by (k) ttreat ) is how the sufficient-statistic significance is computed. This algorithm is repeated for the age variable. The regression coefficients for treat and age are numerically identical for both Poisson models. Both models indicate that the treatment is significant at reducing the rate of cerebrovascular accidents, ≈ e−1.59 ≈ 0.204, or a reduction of about 80%. There is no significant age effect. 574 expoisson — Exact Poisson regression The p-value for the treatment regression-coefficient sufficient statistic indicates that the treatment effect is a bit more significant than for the corresponding asymptotic Z statistic from poisson. However, the exact confidence intervals are wider than their asymptotic counterparts. Example 2 Agresti (2013, 129) used the data from Laird and Olivier (1981) to demonstrate the Poisson model for modeling rates. The data consist of patient survival after heart valve replacement operations. The sample consists of 109 patients that are classified by type of heart valve (aortic, mitral) and by age (<55, ≥55). Follow-up observations cover lengths from 3 to 97 months, and the time at risk, or exposure, is stored in the variable TAR. The response is whether the subject died. First, we take a look at the data and then estimate the incidence rates (IRs) with expoisson and poisson. . use http://www.stata-press.com/data/r13/heartvalve (heart valve replacement data) . list age 1. 2. 3. 4. < < >= >= 55 55 55 55 valve deaths TAR aortic mitral aortic mitral 4 1 7 9 1259 2082 1417 1647 The age variable is coded 0 for age <55 and 1 for age ≥55, and the valve variable is coded 0 for the aortic valve and 1 for the mitral valve. The total number of deaths, M = 21, is small enough that enumerating the conditional distributions for age and valve type is feasible and asymptotic inferences associated with standard ML Poisson regression may be questionable. . expoisson deaths age valve, exposure(TAR) irr Estimating: age Enumerating sample-space combinations: observation 1: enumerations = 11 observation 2: enumerations = 11 observation 3: enumerations = 132 observation 4: enumerations = 22 Estimating: valve Enumerating sample-space combinations: observation 1: enumerations = 17 observation 2: enumerations = 17 observation 3: enumerations = 102 observation 4: enumerations = 22 Exact Poisson regression Number of obs = deaths IRR Suff. age valve ln(TAR) 3.390401 .7190197 1 16 10 (exposure) 2*Pr(Suff.) 0.0194 0.5889 4 [95% Conf. Interval] 1.182297 .2729881 11.86935 1.870068 expoisson — Exact Poisson regression 575 . poisson deaths age valve, exposure(TAR) irr nolog Poisson regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -8.1747285 deaths IRR Std. Err. age valve _cons ln(TAR) 3.390401 .7190197 .0018142 1 1.741967 .3150492 .0009191 (exposure) z 2.38 -0.75 -12.46 = = = = 4 7.62 0.0222 0.3178 P>|z| [95% Conf. Interval] 0.017 0.452 0.000 1.238537 .3046311 .0006722 9.280965 1.6971 .0048968 The CMLE and the MLE are numerically identical. The death rate for the older age group is about 3.4 times higher than the younger age group, and this difference is significant at the 5% level. This means that for every death in the younger group each month, we would expect about three deaths in the older group. The IR estimate for valve type is approximately 0.72, but it is not significantly different from one. The exact Poisson confidence intervals are a bit wider than the asymptotic CIs. You can use ir (see [ST] epitab) to estimate IRs and exact CIs for one covariate, and we compare these CIs with those from expoisson, where we estimate the incidence rate by using age only. . ir deaths age TAR age of patient Exposed Unexposed Total number of deaths time at risk 16 3064 5 3341 21 6405 Incidence rate .0052219 .0014966 .0032787 Point estimate Inc. rate diff. Inc. rate ratio Attr. frac. ex. Attr. frac. pop [95% Conf. Interval] .0037254 3.489295 .7134092 .5435498 .00085 1.221441 .1812948 (midp) Pr(k>=16) = (midp) 2*Pr(k>=16) = .0066007 12.17875 (exact) .9178898 (exact) 0.0049 (exact) 0.0099 (exact) . expoisson deaths age, exposure(TAR) irr midp nolog Exact Poisson regression Number of obs = deaths IRR Suff. age ln(TAR) 3.489295 1 16 (exposure) 2*Pr(Suff.) 0.0099 4 [95% Conf. Interval] 1.324926 10.64922 mid-p-value computed for the probabilities and CIs Both ir and expoisson give identical IRs and p-values. Both report the two-sided exact significance by using the mid-p-value rule that accounts for the discreteness in the distribution by subtracting p1/2 = Pr(T = t)/2 from pl = Pr(T ≤ t) and pg = Pr(T ≥ t), computing 2 × min(pl − p1/2 , pg − p1/2 ). By default, expoisson will not use the mid-p-value rule (when you exclude the midp option), and here the two-sided exact significance would be 2 × min(pl , pg ) = 0.0158. The confidence intervals differ because expoisson uses the mid-p-value rule when computing the confidence intervals, yet 576 expoisson — Exact Poisson regression ir does not. You can verify this by executing expoisson without the midp option for this example; you will get the same CIs as ir. You can replay expoisson to view the conditional scores test or the conditional probabilities test by using the test() option. . expoisson, test(score) irr Exact Poisson regression Number of obs = deaths IRR Score age ln(TAR) 3.489295 1 6.76528 (exposure) Pr>=Score 0.0113 4 [95% Conf. Interval] 1.324926 10.64922 mid-p-value computed for the probabilities and CIs All the statistics for expoisson are defined in Methods and formulas of [R] exlogistic. Apart from enumerating the conditional distributions for the logistic and Poisson sufficient statistics, computationally, the primary difference between exlogistic and expoisson is the weighting values in the likelihood for the parameter sufficient statistics. Example 3 In this example, we fabricate data that will demonstrate the difference between the CMLE and the MUE when the CMLE is not infinite. A difference in these estimates will be more pronounced when the probability of the coefficient sufficient statistic is skewed when plotted as a function of the regression coefficient. . clear . input y x y x 1. 0 2 2. 1 1 3. 1 0 4. 0 0 5. 0 .5 6. 1 .5 7. 2 .01 8. 3 .001 9. 4 .0001 10. end . expoisson y x, test(score) Enumerating sample-space combinations: observation 1: enumerations = 13 observation 2: enumerations = 91 observation 3: enumerations = 169 observation 4: enumerations = 169 observation 5: enumerations = 313 observation 6: enumerations = 313 observation 7: enumerations = 1469 observation 8: enumerations = 5525 observation 9: enumerations = 5479 Exact Poisson regression Number of obs = y Coef. Score x -1.534468 2.955316 Pr>=Score 0.0810 9 [95% Conf. Interval] -3.761718 .0485548 expoisson — Exact Poisson regression 577 . expoisson y x, test(score) mue(x) nolog Exact Poisson regression Number of obs = y x Coef. -1.309268* Score 2.955316 Pr>=Score 0.0810 9 [95% Conf. Interval] -3.761718 .0485548 (*) median unbiased estimates (MUE) P9 We observe (xi , yiP ), i = 1, . . . , 9. If we condition on m = i=1 yi = 12, the conditional distribution of Tx = Y x has a size of 5,479 elements. For each entry in this enumeration, i i i P (k) (k) a realization of Yi = yi , k = 1, . . . , 5,479, is generated such that = 12. One of these i yi P realizations produces the observed tx = i yi xi ≈1.5234. Below is a graphical display comparing the CMLE with the MUE. We plot Pr(Tx = tx | M = 12, βx ) versus βx , −6 ≤ βx ≤ 1, in the upper panel and the cumulative probabilities, Pr(Tx ≤ tx | M = 12, βx ) and Pr(Tx ≥ tx | M = 12, βx ), in the lower panel. MUE 0 cumulative probability .2 .4 .5 .6 .8 1 0 probability .0001 .0002 .0003 CMLE −6 −4 −2 x coefficient 0 1 The location of the CMLE, indicated by the dashed line, is at the mode of the probability profile, and (u) (l) the MUE, indicated by the dotted line, is to the right of the mode. If we solve for the βx and βx (u) (l) such that Pr(Tx ≤ tx | M = 12, βx ) = 1/2 and Pr(Tx ≥ tx | M = 12, βx ) = 1/2, the MUE is (u) (l) (βx + βx )/2. As you can see in the lower panel, the MUE cuts through the intersection of these cumulative probability profiles. 578 expoisson — Exact Poisson regression Technical note The memory(#) option limits the amount of memory that expoisson will consume when computing the conditional distribution of the parameter sufficient statistics. memory() is independent of the data maximum memory setting (see set max memory in [D] memory), and it is possible for expoisson to exceed the memory limit specified in set max memory without terminating. By default, a log is provided that displays the number of enumerations (the size of the conditional distribution) after processing each observation. Typically, you will see the number of enumerations increase, and then at some point they will decrease as the multivariate shift algorithm (Hirji, Mehta, and Patel 1987) determines that some of the enumerations cannot achieve the observed sufficient statistics of the conditioning variables. When the algorithm is complete, however, it is necessary to store the conditional distribution of the parameter sufficient statistics as a dataset. It is possible, therefore, to get a memory error when the algorithm has completed if there is not enough memory to store the conditional distribution. Technical note Computing the conditional distributions and reported statistics requires data sorting and numerical comparisons. If there is at least one single-precision variable specified in the model, expoisson will make comparisons with a relative precision of 2−5 . Otherwise, a relative precision of 2−11 is used. Be careful if you use recast to promote a single-precision variable to double precision (see [D] recast). You might try listing the data in full precision (maybe %20.15g; see [D] format) to make sure that this is really what you want. See [D] data types for information on precision of numeric storage types. Stored results expoisson stores the following in e(): Scalars e(N) e(k groups) e(relative weight) e(sum y) e(k indvars) e(k condvars) e(midp) e(eps) Macros e(cmd) e(cmdline) e(title) e(depvar) e(indvars) e(condvars) e(groupvar) e(exposure) e(offset) e(level) e(wtype) e(wexp) e(datasignature) e(datasignaturevars) e(properties) e(estat cmd) e(marginsnotok) number of observations number of groups relative weight for the observed e(sufficient) and e(condvars) sum of depvar number of independent variables number of conditioning variables mid-p-value rule indicator relative difference tolerance expoisson command as typed title in estimation output name of dependent variable independent variables conditional variables group variable exposure variable linear offset variable confidence level weight type weight expression the checksum variables used in calculation of checksum b V program used to implement estat predictions disallowed by margins expoisson — Exact Poisson regression Matrices e(b) e(mue indicators) e(se) e(ci) e(sum y groups) e(N g) e(sufficient) e(p sufficient) e(scoretest) e(p scoretest) e(probtest) e(p probtest) coefficient vector indicator for elements of e(b) estimated using MUE instead of CMLE e(b) standard errors (CMLEs only) matrix of e(level) confidence intervals for e(b) sum of e(depvar) for each group number of observations in each group sufficient statistics for e(b) p-value for e(sufficient) conditional scores tests for indepvars p-values for e(scoretest) conditional probability tests for indepvars p-value for e(probtest) Functions e(sample) marks estimation sample 579 Methods and formulas Let {Y1 , Y2 , . . . , Yn } be a set of n independent Poisson random variables. For each i = 1, . . . , n, we observe Yi = yi ≥ 0, and associated with each observation is the covariate row vector of length p, xi = (xi1 , . . . , xip ). Denote β = (β1 , . . . , βp )T to be the column Pn vector of regression parameters and θ to be the constant. The sufficient statistic for β is T = = 1, . . . , p, and for θ is j j i=1 Yi xij , j P Pn Pn n M = i=1 Yi . We observe Tj = tj , tj = i=1 yi xij , and M = m, m = i=1 yi . Let κi be the exposure for the ith observation. Then the probability of observing (Y1 = y1 , Y2 = y2 , . . . , Yn = yn ) is n Y κyi i exp(mθ + tβ) Pn Pr(Y1 = y1 , . . . , Yn = yn | β, X, κ) = yi ! exp{ i=1 κi exp(θ + xi β)} i=1 where t = (t1 , . . . , tp ), X = (xT1 , . . . , xTn )T , and κ = (κ1 , . . . , κn )T . The joint distribution of the sufficient statistics (T, M ) is obtained by summing over all possible sequences Y1 ≥ 0, . . . , Yn ≥ 0 such that T = t and M = m. This probability function is exp(mθ + tβ) Pn Pr(T1 = t1 , . . . , Tp = tp , M = m | β, X, κ) = exp { i=1 κi exp(θ + xi β)} where Pn the sum i=1 ui xi = t. P u is over all nonnegative vectors u of length n such that n XY κui ! i u i=1 Pn i=1 ui ! ui = m and Conditional distribution Without loss of generality, we will restrict our discussion to the conditional distribution of the sufficient statistic for β1 , T1 . If we condition on observing M = m and T2 = t2 , . . . , Tp = tp , the probability function of (T1 | β1 , T2 = t2 , . . . , Tp = tp , M = m) is u κi i i=1 ui ! P Q n u Pr(T1 = t1 | β1 , T2 = t2 , . . . , Tp = tp , M = m) = P Q n v vi κi i=1 vi ! eβ1 et1 β1 P i vi xi1 (2) 580 expoisson — Exact Poisson regression P Pn where the sum all nonnegative vectors u of length n such that u is over P i=1 P Puni = m and n u x = t , and the sum is over all nonnegative vectors v of length n such that i=1 vi = m, Pni=1 i i Pn v v x = t , . . . , v x = t . The CMLE for β is the value that maximizes the log of 2 p 1 i=1 i i2 i=1 i ip (1). This optimization task is carried out by ml (see [R] ml), using the conditional distribution of (T1 | T2 = t2 , . . . , Tp = tp , M = m) as a dataset. This dataset consists of the feasible values and weights for T1 , ( s1 , n Y κvi i i=1 vi ! ! : n X vi = m, i=1 n X vi xi1 = s1 , i=1 n X vi xi2 = t2 , . . . , i=1 n X ) vi xip = tp i=1 Computing the CMLE, MUE, confidence intervals, conditional hypothesis tests, and sufficient statistic p-values is discussed in Methods and formulas of [R] exlogistic. The only difference between the two techniques is the use of the weights; that is, the weights for exact logistic are the combinatorial coefficients, c(t, m), in (1) of Methods and formulas in [R] exlogistic. expoisson and exlogistic use the same ml likelihood evaluator to compute the CMLEs as well as the same ado-programs and Mata functions to compute the MUEs and estimate statistics. References Agresti, A. 2013. Categorical Data Analysis. 3rd ed. Hoboken, NJ: Wiley. Armitage, P., G. Berry, and J. N. S. Matthews. 2002. Statistical Methods in Medical Research. 4th ed. Oxford: Blackwell. Cox, D. R., and E. J. Snell. 1989. Analysis of Binary Data. 2nd ed. London: Chapman & Hall. Hirji, K. F., C. R. Mehta, and N. R. Patel. 1987. Computing distributions for exact logistic regression. Journal of the American Statistical Association 82: 1110–1117. Laird, N. M., and D. Olivier. 1981. Covariance analysis of censored survival data using log-linear analysis techniques. Journal of the American Statistical Association 76: 231–240. Also see [R] expoisson postestimation — Postestimation tools for expoisson [R] poisson — Poisson regression [XT] xtpoisson — Fixed-effects, random-effects, and population-averaged Poisson models [U] 20 Estimation and postestimation commands Title expoisson postestimation — Postestimation tools for expoisson Description Remarks and examples Syntax for estat se Also see Menu for estat Option for estat se Description The following postestimation command is of special interest after expoisson: Command Description estat se report coefficients or IRRs and their asymptotic standard errors The following standard postestimation command is also available: Command Description estat summarize summary statistics for the estimation sample See [R] estat summarize for details. Special-interest postestimation command estat se reports regression coefficients or incidence-rate asymptotic standard errors. The estimates are stored in the matrix r(estimates). Syntax for estat se estat se , irr Menu for estat Statistics > Postestimation > Reports and statistics Option for estat se irr requests that the incidence-rate ratios and their asymptotic standard errors be reported. The default is to report the coefficients and their asymptotic standard errors. 581 582 expoisson postestimation — Postestimation tools for expoisson Remarks and examples Example 1 To demonstrate estat se after expoisson, we use the British physicians smoking data. . use http://www.stata-press.com/data/r13/smokes (cigarette smoking and lung cancer among British physicians (45-49 years)) . expoisson cases smokes, exposure(peryrs) irr nolog Exact Poisson regression Number of obs = 7 cases IRR Suff. smokes ln(peryrs) 1.077718 1 797.4 (exposure) cases IRR Std. Err. smokes 1.077718 2*Pr(Suff.) . estat se, irr .0168547 Also see [R] expoisson — Exact Poisson regression [U] 20 Estimation and postestimation commands 0.0000 [95% Conf. Interval] 1.04552 1.111866 Title fp — Fractional polynomial regression Syntax Options for fp Stored results References Menu Options for fp generate Methods and formulas Also see Description Remarks and examples Acknowledgment Syntax Estimation fp , est options : est cmd est cmd may be almost any estimation command that stores the e(ll) result. To confirm whether fp works with a specific est cmd, see the documentation for that est cmd. Instances of (with the angle brackets) that occur within est cmd are replaced in est cmd by a varlist containing the fractional powers of the variable term. These variables will be named term 1, term 2, . . . . fp performs est cmd with this substitution, fitting a fractional polynomial regression in term. est cmd in either this or the following syntax may not contain other prefix commands; see [U] 11.1.10 Prefix commands. Estimation (alternate syntax) fp (varname) , est options : est cmd Use this syntax to specify that fractional powers of varname are to be calculated. The fractional polynomial power variables will still be named term 1, term 2, . . . . Replay estimation results fp , replay options Create specified fractional polynomial power variables fp generate type newvar = varname^(numlist) if in , gen options 583 584 fp — Fractional polynomial regression est options Description Main Search powers(# # . . . #) dimension(#) powers to be searched; default is powers(-2 -1 -.5 0 .5 1 2 3) maximum degree of fractional polynomial; default is dimension(2) Or specify fp(# # . . . #) use specified fractional polynomial And then specify any of these options Options classic replace all scale(# a # b) scale center(# c) center zero catzero perform automatic scaling and centering and omit comparison table replace existing fractional polynomial power variables named term 1, term 2, . . . generate term 1, term 2, . . . in all observations; default is in observations if esample() use (term+a)/b; default is to use variable term as is specify a and b automatically report centered-on-c results; default is uncentered results specify c to be the mean of (scaled) term set term 1, term 2, . . . to zero if scaled term ≤ 0; default is to issue an error message same as zero and include term 0 = (term ≤ 0) among fractional polynomial power variables Reporting replay options specify how results are displayed replay options Description Reporting nocompare reporting options do not display model-comparison test results any options allowed by est cmd for replaying estimation results gen options Description Main replace scale(# a # b) scale center(# c) center zero catzero replace existing fractional polynomial power variables named term 1, term 2, . . . use (term+a)/b; default is to use variable term as is specify a and b automatically report centered-on-c results; default is uncentered results specify c to be the mean of (scaled) term set term 1, term 2, . . . to zero if scaled term ≤ 0; default is to issue an error message same as zero and include term 0 = (term ≤ 0) among fractional polynomial power variables fp — Fractional polynomial regression 585 Menu fp Statistics > Linear models and related > Fractional polynomials > Fractional polynomial regression > Fractional polynomials > Create fractional polynomial variables fp generate Statistics > Linear models and related Description fp : est cmd fits models with the “best”-fitting fractional polynomial substituted for wherever it appears in est cmd. fp : regress mpg foreign would fit a regression model of mpg on a fractional polynomial in weight and (linear) foreign. By specifying option fp(), you may set the exact powers to be used. Otherwise, a search through all possible fractional polynomials up to the degree set by dimension() with the powers set by powers() is performed. fp without arguments redisplays the previous estimation results, just as typing est cmd would. You can type either one. fp will include a fractional polynomial comparison table. fp generate creates fractional polynomial power variables for a given set of powers. For instance, fp : regress mpg foreign might produce the fractional polynomial weight(−2,−1) and store weight−2 in weight 1 and weight−1 in weight 2. Typing fp generate weight^(-2 -1) would allow you to create the same variables in another dataset. See [R] mfp for multivariable fractional polynomial models. Options for fp Main powers(# # . . . #) specifies that a search be performed and details about the search provided. powers() works with the dimension() option; see below. The default is powers(-2 -1 -.5 0 .5 1 2 3). dimension(#) specifies the maximum degree of the fractional polynomial to be searched. The default is dimension(2). If the defaults for both powers() and dimension() are used, then the fractional polynomial could be any of the following 44 possibilities: 586 fp — Fractional polynomial regression term(−2) term(−1) .. . term(3) term , term(−2) (−2) term , term(−1) .. . (−2) term(−2) , term(3) term(−1) , term(−2) .. . term(3) , term(3) fp(# # . . . #) specifies that no search be performed and that the fractional polynomial specified be used. fp() is an alternative to powers() and dimension(). Options classic performs automatic scaling and centering and omits the comparison table. Specifying classic is equivalent to specifying scale, center, and nocompare. replace replaces existing fractional polynomial power variables named term 1, term 2, . . . . all specifies that term 1, term 2, . . . be filled in for all observations in the dataset rather than just for those in e(sample). scale(# a # b) specifies that term be scaled in the way specified, namely, that (term+a)/b be calculated. All values of scaled term are required to be greater than zero unless you specify options zero or catzero. Values should not be too large or too close to zero, because by default, cubic powers and squared reciprocal powers will be considered. When scale(a b) is specified, values in the variable term are not modified; fp merely remembers to scale the values whenever powers are calculated. You will probably not use scale(a b) for values of a and b that you create yourself, although you could. It is usually easier just to generate a scaled variable. For instance, if term is age, and age in your data is required to be greater than or equal to 20, you might generate an age5 variable, for use as term: . generate age5 = (age-19)/5 scale(a b) is useful when you previously fit a model using automatic scaling (option scale) in one dataset and now want to create the fractional polynomials in another. In the first dataset, fp with scale added notes to the dataset concerning the values of a and b. You can see them by typing . notes You can then use fp generate, scale(a b) in the second dataset. The default is to use term as it is used in calculating fractional powers; thus term’s values are required to be greater than zero unless you specify options zero or catzero. Values should not be too large, because by default, cubic powers will be considered. scale specifies that term be scaled to be greater than zero and not too large in calculating fractional powers. See Scaling for more details. When scale is specified, values in the variable term are not modified; fp merely remembers to scale the values whenever powers are calculated. fp — Fractional polynomial regression 587 center(# c) reports results for the fractional polynomial in (scaled) term, centered on c. The default is to perform no centering. term(p1 ,p2 ,...,pm ) -c(p1 ,p2 ,...,pm ) is reported. This makes the constant coefficient (intercept) easier to interpret. See Centering for more details. center performs center(c), where c is the mean of (scaled) term. zero and catzero specify how nonpositive values of term are to be handled. By default, nonpositive values of term are not allowed, because we will be calculating natural logarithms and fractional powers of term. Thus an error message is issued. zero sets the fractional polynomial value to zero for nonpositive values of (scaled) term. catzero sets the fractional polynomial value to zero for nonpositive values of (scaled) term and includes a dummy variable indicating where nonpositive values of (scaled) term appear in the model. Reporting nocompare suppresses display of the comparison tests. reporting options are any options allowed by est cmd for replaying estimation results. Options for fp generate Main replace replaces existing fractional polynomial power variables named term 1, term 2, . . . . scale(# a # b) specifies that term be scaled in the way specified, namely, that (term+a)/b be calculated. All values of scaled term are required to be greater than zero unless you specify options zero or catzero. Values should not be too large or too close to zero, because by default, cubic powers and squared reciprocal powers will be considered. When scale(a b) is specified, values in the variable term are not modified; fp merely remembers to scale the values whenever powers are calculated. You will probably not use scale(a b) for values of a and b that you create yourself, although you could. It is usually easier just to generate a scaled variable. For instance, if term is age, and age in your data is required to be greater than or equal to 20, you might generate an age5 variable, for use as term: . generate age5 = (age-19)/5 scale(a b) is useful when you previously fit a model using automatic scaling (option scale) in one dataset and now want to create the fractional polynomials in another. In the first dataset, fp with scale added notes to the dataset concerning the values of a and b. You can see them by typing . notes You can then use fp generate, scale(a b) in the second dataset. The default is to use term as it is used in calculating fractional powers; thus term’s values are required to be greater than zero unless you specify options zero or catzero. Values should not be too large, because by default, cubic powers will be considered. scale specifies that term be scaled to be greater than zero and not too large in calculating fractional powers. See Scaling for more details. When scale is specified, values in the variable term are not modified; fp merely remembers to scale the values whenever powers are calculated. 588 fp — Fractional polynomial regression center(# c) reports results for the fractional polynomial in (scaled) term, centered on c. The default is to perform no centering. term(p1 ,p2 ,...,pm ) -c(p1 ,p2 ,...,pm ) is reported. This makes the constant coefficient (intercept) easier to interpret. See Centering for more details. center performs center(c), where c is the mean of (scaled) term. zero and catzero specify how nonpositive values of term are to be handled. By default, nonpositive values of term are not allowed, because we will be calculating natural logarithms and fractional powers of term. Thus an error message is issued. zero sets the fractional polynomial value to zero for nonpositive values of (scaled) term. catzero sets the fractional polynomial value to zero for nonpositive values of (scaled) term and includes a dummy variable indicating where nonpositive values of (scaled) term appear in the model. Remarks and examples Remarks are presented under the following headings: Fractional polynomial regression Scaling Centering Examples Fractional polynomial regression Regression models based on fractional polynomial functions of a continuous covariate are described by Royston and Altman (1994). Fractional polynomials increase the flexibility afforded by the family of conventional polynomial models. Although polynomials are popular in data analysis, linear and quadratic functions are limited in their range of curve shapes, whereas cubic and higher-order curves often produce undesirable artifacts such as edge effects and waves. Fractional polynomials differ from regular polynomials in that 1) they allow logarithms, 2) they allow noninteger powers, and 3) they allow powers to be repeated. We will write a fractional polynomial in x as x(p1 ,p2 ,...,pm )0 β We will write x(p) to mean a regular power except that x(0) is to be interpreted as meaning ln(x) rather than x(0) = 1. Then if there are no repeated powers in (p1 , p2 , . . . , pm ), x(p1 ,p2 ,...,pm )0 β = β0 + β1 x(p1 ) + β2 x(p2 ) + · · · + βm x(pm ) Powers are allowed to repeat in fractional polynomials. Each time a power repeats, it is multiplied by another ln(x). As an extreme case, consider the fractional polynomial with all-repeated powers, say, m of them, x(p,p,...,p)0 β = β0 + β1 x(p) + β2 x(p) ln(x) + · · · + βm x(p) {ln(x)}m−1 fp — Fractional polynomial regression 589 Thus the fractional polynomial x(0,0,2)0 β would be x(0,0,2)0 β = β0 + β1 x(0) + β2 x(0) ln(x) + β3 x(2) = β0 + β1 ln(x) + β2 {ln(x)}2 + β3 x2 With this definition, we can obtain a much wider range of shapes than can be obtained with regular polynomials. The following graphs appeared in Royston and Sauerbrei (2008, sec. 4.5). The first graph shows the shapes of differing fractional polynomials. −2 −1 −0.5 0 0.5 1 2 3 590 fp — Fractional polynomial regression The second graph shows some of the curve shapes available with different βs for the degree-2 fractional polynomial, x(−2,2) . In modeling a fractional polynomial, Royston and Sauerbrei (2008) recommend choosing powers from among {−2, −1, −0.5, 0, 0.5, 1, 2, 3}. By default, fp chooses powers from this set, but other powers can be explicitly specified in the powers() option. fp : est cmd fits models with the terms of the best-fitting fractional polynomial substituted for wherever it appears in est cmd. We will demonstrate with auto.dta, which contains repair records and other information about a variety of vehicles in 1978. We use fp to find the best fractional polynomial in automobile weight (lbs.) (weight) for the linear regression of miles per gallon (mpg) on weight and an indicator of whether the vehicle is foreign (foreign). By default, fp will fit degree-2 fractional polynomial (FP2) models and choose the fractional powers from the set {−2, −1, −0.5, 0, 0.5, 1, 2, 3}. Because car weight is measured in pounds and will have a cubic transformation applied to it, we shrink it to a smaller scale before estimation by dividing by 1,000. We modify the existing weight variable for conciseness and to facilitate the comparison of tables. When applying a data transformation in practice, rather than modifying the existing variables, you should create new variables that hold the transformed values. fp — Fractional polynomial regression 591 . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . replace weight = weight/1000 weight was int now float (74 real changes made) . fp : regress mpg foreign (fitting 44 models) (....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%) Fractional polynomial comparisons: weight df Deviance Res. s.d. Dev. dif. P(*) Powers omitted linear m = 1 m = 2 0 1 2 4 456.347 388.366 381.806 381.131 5.356 3.407 3.259 3.268 75.216 7.236 0.675 0.000 0.000 0.082 0.733 -- 1 -.5 -2 -2 (*) P = sig. level of model with m = 2 based on F with 68 denominator dof. Source SS df MS Number of obs = F( 3, 70) = Model 1696.05949 3 565.353163 Prob > F = Residual 747.399969 70 10.6771424 R-squared = Adj R-squared = Total 2443.45946 73 33.4720474 Root MSE = mpg Coef. weight_1 weight_2 foreign _cons 15.88527 127.9349 -2.222515 3.705981 Std. Err. 20.60329 47.53106 1.053782 3.367949 t 0.77 2.69 -2.11 1.10 P>|t| 0.443 0.009 0.039 0.275 74 52.95 0.0000 0.6941 0.6810 3.2676 [95% Conf. Interval] -25.20669 33.13723 -4.324218 -3.011182 56.97724 222.7326 -.1208131 10.42314 fp begins by showing the model-comparison table. This table shows the best models of each examined degree, obtained by searching through all possible power combinations. The fractional powers of the models are shown in the Powers column. A separate row is provided for the linear fractional polynomial because it is often the default used when including a predictor in the model. The null model does not include any fractional polynomial terms for weight. The df column shows the count of the additional parameters used in each model beyond the quantity of parameters used in the null model. The model deviance, which we define as twice the negative log likelihood, is given in the Deviance column. The difference of the model deviance from the deviance of the model with the lowest deviance is given in the Dev. dif. column. The p-value for the partial F test comparing the models and the lowest deviance model is given in the P(*) column. An estimate of the residual standard error is given in the Res. Sd. column. Under linear regression, a partial F test is used in the model-comparison table. In other settings, a likelihood-ratio test is performed. Then a χ2 statistic is reported. Under robust variance estimation and some other cases (see [R] lrtest), the likelihood-ratio test cannot be performed. When the likelihood-ratio test cannot be performed on the model specified in est cmd, fp still reports the model-comparison table, but the comparison tests are not performed. fp reports the “best” model as the model with the lowest deviance; however, users may choose a more efficient model based on the comparison table. They may choose the lowest degree model that the partial F test (or likelihood-ratio test) fails to reject in favor of the lowest deviance model. After the comparison table, the results of the estimation command for the lowest deviance model are shown. Here the best model has terms weight(−2,−2) . However, based on the model-comparison 592 fp — Fractional polynomial regression table, we can reject the model without weight and the linear model at the 0.1 significance level. We fail to reject the m = 1 model at any reasonable level. We will choose the FP1 model, which includes weight(−.5) . We use fp again to estimate the parameters for this model. We use the fp() option to specify what powers we want to use; this option specifies that we do not want to perform a search for the best powers. We also specify the replace option to overwrite the previously created fractional polynomial power variables. . fp , fp(-.5) replace: regress mpg foreign -> regress mpg weight_1 foreign Source SS df MS Number of obs F( 2, 71) 1689.20865 2 844.604325 Prob > F Model Residual 754.25081 71 10.6232508 R-squared Adj R-squared Total 2443.45946 73 33.4720474 Root MSE mpg Coef. weight_1 foreign _cons 66.89665 -2.095622 -17.58651 Std. Err. 6.021749 1.043513 3.397992 t 11.11 -2.01 -5.18 P>|t| 0.000 0.048 0.000 = = = = = = 74 79.51 0.0000 0.6913 0.6826 3.2593 [95% Conf. Interval] 54.88963 -4.176329 -24.36192 78.90368 -.0149157 -10.81111 Alternatively, we can use fp generate to create the fractional polynomial variable corresponding to weight(−.5) and then use regress. We store weight(−.5) in the new variable wgt nsqrt. . fp generate wgt_nsqrt=weight^(-.5) . regress mpg wgt_nsqrt foreign Source SS df MS Model Residual 1689.20874 754.250718 2 71 844.604371 10.6232495 Total 2443.45946 73 33.4720474 mpg Coef. wgt_nsqrt_1 foreign _cons 66.89665 -2.095622 -17.58651 Std. Err. 6.021748 1.043513 3.397991 t 11.11 -2.01 -5.18 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.048 0.000 = = = = = = 74 79.51 0.0000 0.6913 0.6826 3.2593 [95% Conf. Interval] 54.88963 -4.176328 -24.36191 78.90368 -.0149155 -10.81111 Scaling Fractional polynomials are only defined for positive term variables. By default, fp will assume that the variable x is positive and attempt to compute fractional powers of x. If the positive value assumption is incorrect, an error will be reported and estimation will not be performed. If the values of the variable are too large or too small, the reported results of fp may be difficult to interpret. By default, cubic powers and squared reciprocal powers will be considered in the search for the best fractional polynomial in term. We can scale the variable x to 1) make it positive and 2) ensure its magnitude is not too large or too small. Suppose you have data on hospital patients with age as a fractional polynomial variable of interest. age is required to be greater than or equal to 20, so you might generate an age5 variable by typing fp — Fractional polynomial regression 593 . generate age5 = (age-19)/5 A unit change in age5 is equivalent to a five-year change in age, and the minimum value of age5 is 1/5 instead of 20. In the automobile example of Fractional polynomial regression, our term variable was automobile weight (lbs.). Cars weigh in the thousands of pounds, so cubing their weight figures results in large numbers. We prevented this from being a problem by shrinking the weight by 1,000; that is, we typed . replace weight = weight/1000 Calendar year is another type of variable that can have a problematically large magnitude. We can shrink this by dividing by 10, making a unit change correspond to a decade. . generate decade = calendar_year/10 You may also have a variable that measures deviation from zero. Perhaps x has already been demeaned and is symmetrically about zero. The fractional polynomial in x will be undefined for half of its domain. We can shift the location of x, making it positive by subtracting its minimum and adding a small number to it. Suppose x ranges from −4 to 4; we could use . generate newx = x+5 Rescaling ourselves provides easily communicated results. We can tell exactly how the scaling was performed and how it should be performed in similar applications. Alternatively, fp can scale the fractional polynomial variable so that its values are positive and the magnitude of the values are not too large. This can be done automatically or by directly specifying the scaling values. Scaling can be automatically performed with fp by specifying the scale option. If term has nonpositive values, the minimum value of term is subtracted from each observation of term. In this case, the counting interval, the minimum distance between the sorted values of term, is also added to each observation of term. After adjusting the location of term so that its minimum value is positive, creating term∗ , automatic scaling will divide each observation of term by a power of ten. The exponent of this scaling factor is given by p = log10 {max(term∗ ) − min(term∗ )} p∗ = sign(p)floor (|p|) Rather than letting fp automatically choose the scaling of term, you may specify adjustment and scale factors a and b by using the scale(a b) option. Fractional powers are then calculated using the (term+a)/b values. When scale or scale(a b) is specified, values in the variable term are not modified; fp merely remembers to scale the values whenever powers are calculated. In addition to fp, both scale and scale(a b) may be used with fp generate. You will probably not use scale(a b) with fp for values of a and b that you create yourself, although you could. As we demonstrated earlier, it is usually easier just to generate a scaled variable. scale(a b) is useful when you previously fit a model using scale in one dataset and now want to create the fractional polynomials in another. In the first dataset, fp with scale added notes to the dataset concerning the values of a and b. You can see them by typing . notes 594 fp — Fractional polynomial regression You can then use fp generate, scale(a b) in the second dataset. When you apply the scaling rules of a previously fit model to new data with the scale(a b) option, it is possible that the scaled term may have nonpositive values. fp will be unable to calculate the fractional powers of the term in this case and will issue an error. The options zero and catzero cause fp and fp generate to output zero values for each fractional polynomial variable when the input (scaled) fractional polynomial variable is nonpositive. Specifying catzero causes a dummy variable indicating nonpositive values of the (scaled) fractional polynomial variable to be included in the model. A detailed example of the use of catzero and zero is shown in example 3 below. Using the scaling options, we can fit our previous model again using the auto.dta. We specify scale(0 1000) so that fp will shrink the magnitude of weight in estimating the regression. This is done for demonstration purposes because our scaling rule is simple. As mentioned before, in practice, you would probably only use scale(a b) when applying the scaling rules from a previous analysis. Allowing fp to scale does have the advantage of not altering the original variable, weight. . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . fp , fp(-.5) scale(0 1000): regress mpg foreign -> regress mpg weight_1 foreign Source SS df MS Number of obs F( 2, 71) 1689.20861 2 844.604307 Prob > F Model Residual 754.250846 71 10.6232514 R-squared Adj R-squared Total 2443.45946 73 33.4720474 Root MSE mpg Coef. weight_1 foreign _cons 66.89665 -2.095622 -17.58651 Std. Err. 6.021749 1.043513 3.397992 t 11.11 -2.01 -5.18 P>|t| = = = = = = 74 79.51 0.0000 0.6913 0.6826 3.2593 [95% Conf. Interval] 0.000 0.048 0.000 54.88963 -4.176329 -24.36192 78.90368 -.0149159 -10.81111 The scaling is clearly indicated in the variable notes for the generated variable weight 1. . notes weight_1 weight_1: 1. fp term 1 of x^(-.5), where x is weight scaled. 2. Scaling was user specified: x = (weight+a)/b where a=0 and b=1000 3. Fractional polynomial variables created by fp , fp(-.5) scale(0 1000): regress mpg foreign 4. To re-create the fractional polynomial variables, for instance, in another dataset, type fp gen double weight^(-.5), scale(0 1000) Centering The fractional polynomial of term, centered on c is term(p1 ,...,pm ) − c(p1 ,...,pm ) 0 β The intercept of a centered fractional polynomial can be interpreted as the effect at zero for all the covariates. When we center the fractional polynomial terms using c, the intercept is now interpreted as the effect at term = c and zero values for the other covariates. fp — Fractional polynomial regression 595 Suppose we wanted to center the fractional polynomial of x with powers (0, 0, 2) at x = c. x(0,0,2) − c(0,0,2) 0 β n o = β0 + β1 x(0) − c(0) + β2 x(0) ln(x) − c(0) ln(c) + β3 x(2) − c(2) = β0 + β1 {ln(x) − ln(c)} + β2 {ln(x)}2 − {ln(c)}2 + β3 x2 − c2 When center is specified, fp centers based on the sample mean of (scaled) term. A previously chosen value for centering, c, may also be specified in center(c). This would be done when applying the results of a previous model fitting to a new dataset. The center and center(c) options may be used in fp or fp generate. Returning to the model of mileage per gallon based on automobile weight and foreign origin, we refit the model with the fractional polynomial of weight centered at its scaled mean. . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . fp , fp(-.5) scale(0 1000) center: regress mpg foreign -> regress mpg weight_1 foreign Source SS df MS Number of obs = 74 F( 2, 71) = 79.51 Model 1689.20861 2 844.604307 Prob > F = 0.0000 754.250846 71 10.6232514 R-squared = 0.6913 Residual Adj R-squared = 0.6826 2443.45946 73 33.4720474 Root MSE = 3.2593 Total mpg Coef. weight_1 foreign _cons 66.89665 -2.095622 20.91163 Std. Err. 6.021749 1.043513 .4624143 t 11.11 -2.01 45.22 P>|t| 0.000 0.048 0.000 [95% Conf. Interval] 54.88963 -4.176329 19.9896 78.90368 -.0149159 21.83366 Note that the coefficients for weight 1 and foreign do not change. Only the intercept cons changes. It can be interpreted as the estimated average miles per gallon of an American-made car of average weight. 596 fp — Fractional polynomial regression Examples Example 1: Linear regression Consider the serum immunoglobulin G (IgG) dataset from Isaacs et al. (1983), which consists of 298 independent observations in young children. The dependent variable sqrtigg is the square root of the IgG concentration, and the independent variable age is the age of each child. (Preliminary Box – Cox analysis shows that a square root transformation removes the skewness in IgG.) The aim is to find a model that accurately predicts the mean of sqrtigg given age. We use fp to find the best FP2 model (the default option). We specify center for automatic centering. The age of each child is small in magnitude and positive, so we do not use the scaling options of fp or scale ourselves. . use http://www.stata-press.com/data/r13/igg, clear (Immunoglobulin in children) . fp , scale center: regress sqrtigg (fitting 44 models) (....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%) Fractional polynomial comparisons: age df Deviance Res. s.d. omitted linear m = 1 m = 2 0 1 2 4 427.539 337.561 327.436 319.448 0.497 0.428 0.421 0.416 Dev. dif. 108.090 18.113 7.987 0.000 P(*) 0.000 0.000 0.020 -- Powers 1 0 -2 2 (*) P = sig. level of model with m = 2 based on F with 293 denominator dof. Source SS df MS Number of obs F( 2, 295) 22.2846976 2 11.1423488 Prob > F Model Residual 50.9676492 295 .172771692 R-squared Adj R-squared Total 73.2523469 297 .246640898 Root MSE sqrtigg Coef. age_1 age_2 _cons -.1562156 .0148405 2.283145 Std. Err. .027416 .0027767 .0305739 t -5.70 5.34 74.68 P>|t| 0.000 0.000 0.000 = = = = = = 298 64.49 0.0000 0.3042 0.2995 .41566 [95% Conf. Interval] -.2101713 .0093757 2.222974 -.10226 .0203052 2.343315 The new variables created by fp contain the best-fitting fractional polynomial powers of age, as centered by fp. For example, age 1 is centered by subtracting the mean of age raised to the power −2. The variables created by fp and fp generate are centered or scaled as specified by the user, which is reflected in the estimated regression coefficients and intercept. Centering does have its advantages (see Centering earlier in this entry). By default, fp will not perform scaling or centering. For a more detailed discussion, see Royston and Sauerbrei (2008, sec. 4.11). The fitted curve has an asymmetric S shape. The best model has powers (−2, 2) and deviance 319.448. We reject lesser degree models: the null, linear, and natural log power models at the 0.05 level. As many as 44 models have been fit in the search for the best powers. Now let’s look at models of degree ≤ 4. The highest allowed degree is specified in dimension(). We overwrite the previously generated fractional polynomial power variables by including replace. fp — Fractional polynomial regression 597 . fp , dimension(4) center replace: regress sqrtigg (fitting 494 models) (....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%) Fractional polynomial comparisons: age df Deviance Res. s.d. omitted linear m = 1 m = 2 m = 3 m = 4 0 1 2 4 6 8 427.539 337.561 327.436 319.448 319.275 317.744 0.497 0.428 0.421 0.416 0.416 0.416 Dev. dif. 109.795 19.818 9.692 1.705 1.532 0.000 P(*) Powers 0.000 0.007 0.149 0.798 0.476 -- 1 0 -2 2 -2 1 1 0 3 3 3 (*) P = sig. level of model with m = 4 based on F with 289 denominator dof. Source SS df MS Number of obs F( 4, 293) 22.5754541 4 5.64386353 Prob > F Model Residual 50.6768927 293 .172958678 R-squared Adj R-squared Total 73.2523469 297 .246640898 Root MSE sqrtigg Coef. age_1 age_2 age_3 age_4 _cons .8761824 -.1922029 .2043794 -.0560067 2.238735 Std. Err. .1898721 .0684934 .074947 .0212969 .0482705 t 4.61 -2.81 2.73 -2.63 46.38 P>|t| 0.000 0.005 0.007 0.009 0.000 = = = = = = 298 32.63 0.0000 0.3082 0.2987 .41588 [95% Conf. Interval] .5024962 -.3270044 .0568767 -.097921 2.143734 1.249869 -.0574015 .3518821 -.0140924 2.333736 It appears that the FP4 model is not significantly different from the other fractional polynomial models (at the 0.05 level). Let’s compare the curve shape from the m = 2 model with that from a conventional quartic polynomial whose fit turns out to be significantly better than a cubic (not shown). We use the ability of fp both to generate the required powers of age, namely, (1, 2, 3, 4) for the quartic and (−2, 2) for the second-degree fractional polynomial, and to fit the model. The fp() option is used to specify the powers. We use predict to obtain the fitted values of each regression. We fit both models with fp and graph the resulting curves with twoway scatter. 598 fp — Fractional polynomial regression . fp , center fp(1 2 3 4) replace: regress sqrtigg -> regress sqrtigg age_1 age_2 age_3 age_4 SS df MS Number of obs Source F( 4, 293) 22.5835458 4 5.64588646 Prob > F Model Residual 50.668801 293 .172931061 R-squared Adj R-squared 73.2523469 297 .246640898 Root MSE Total sqrtigg Coef. age_1 age_2 age_3 age_4 _cons 2.047831 -1.058902 .2284917 -.0168534 2.240012 Std. Err. .4595962 .2822803 .0667591 .0053321 .0480157 t 4.46 -3.75 3.42 -3.16 46.65 P>|t| 0.000 0.000 0.001 0.002 0.000 Coef. age_1 age_2 _cons -.1562156 .0148405 2.283145 Std. Err. .027416 .0027767 .0305739 t -5.70 5.34 74.68 P>|t| 0.000 0.000 0.000 298 32.65 0.0000 0.3083 0.2989 .41585 [95% Conf. Interval] 1.143302 -1.614456 .0971037 -.0273475 2.145512 . predict fit1 (option xb assumed; fitted values) . label variable fit1 "Quartic" . fp , center fp(-2 2) replace: regress sqrtigg -> regress sqrtigg age_1 age_2 Source SS df MS Number of obs F( 2, 295) Model 22.2846976 2 11.1423488 Prob > F Residual 50.9676492 295 .172771692 R-squared Adj R-squared Total 73.2523469 297 .246640898 Root MSE sqrtigg = = = = = = 2.952359 -.5033479 .3598798 -.0063594 2.334511 = = = = = = 298 64.49 0.0000 0.3042 0.2995 .41566 [95% Conf. Interval] -.2101713 .0093757 2.222974 . predict fit2 (option xb assumed; fitted values) . label variable fit2 "FP 2" . scatter sqrtigg fit1 fit2 age, c(. l l) m(o i i) msize(small) > lpattern(. -_.) ytitle("Square root of IgG") xtitle("Age, years") -.10226 .0203052 2.343315 599 1 Square root of IgG 2 3 4 fp — Fractional polynomial regression 0 2 4 6 Age, years Square root of IgG FP 2 Quartic The quartic curve has an unsatisfactory wavy appearance that is implausible for the known behavior of IgG, the serum level of which increases throughout early life. The fractional polynomial curve (FP2) increases monotonically and is therefore biologically the more plausible curve. The two models have approximately the same deviance. Example 2: Cox regression Data from Smith et al. (1992) contain times to complete healing of leg ulcers in a randomized, controlled clinical trial of two treatments in 192 elderly patients. Several covariates were available, of which an important one is mthson, the number of months since the recorded onset of the ulcer. This time is recorded in whole months, not fractions of a month; therefore, some zero values are recorded. Because the response variable is time to an event of interest and some (in fact, about one-half) of the times are censored, using Cox regression to analyze the data is appropriate. We consider fractional polynomials in mthson, adjusting for four other covariates: age; ulcarea, the area of tissue initially affected by the ulcer; deepppg, a binary variable indicating the presence or absence of deep vein involvement; and treat, a binary variable indicating treatment type. We fit fractional polynomials of degrees 1 and 2 with fp. We specify scale to perform automatic scaling on mthson. This makes it positive and ensures that its magnitude is not too large. (See Scaling for more details.) The display option nohr is specified before the colon so that the coefficients and not the hazard ratios are displayed. The center option is specified to obtain automatic centering. age and ulcarea are also demeaned by using summarize and then subtracting the returned result r(mean). In Cox regression, there is no constant term, so we cannot see the effects of centering in the table of regression estimates. The effects would be present if we were to graph the baseline hazard or survival function because these functions are defined with all predictors set equal to 0. In these graphs, we will see the estimated baseline hazard or survival function under no deep vein involvement or treatment and under mean age, ulcer area, and number of months since the recorded onset of the ulcer. 600 fp — Fractional polynomial regression . use http://www.stata-press.com/data/r13/legulcer1, clear (Leg ulcer clinical trial) . stset ttevent, fail(cens) failure event: censored != 0 & censored < . obs. time interval: (0, ttevent] exit on or before: failure 192 0 total observations exclusions 192 92 13825 observations remaining, representing failures in single-record/single-failure data total analysis time at risk and under observation at risk from t = 0 earliest observed entry t = 0 last observed exit t = 206 . qui sum age . replace age = age - r(mean) age was byte now float (192 real changes made) . qui sum ulcarea . replace ulcarea = ulcarea - r(mean) ulcarea was int now float (192 real changes made) . fp , center scale nohr: stcox age ulcarea deepppg treat (fitting 44 models) (....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%) Fractional polynomial comparisons: mthson df Deviance Dev. dif. P(*) Powers omitted linear m = 1 m = 2 0 1 2 4 754.345 751.680 738.969 736.709 17.636 14.971 2.260 0.000 0.001 0.002 0.323 -- 1 -.5 .5 .5 (*) P = sig. level of model with m = 2 based on chi^2 of dev. dif. Cox regression -- Breslow method for ties No. of subjects = 192 No. of failures = 92 Time at risk = 13825 Log likelihood = -368.35446 _t Coef. mthson_1 mthson_2 age ulcarea deepppg treat -2.81425 1.541451 -.0261111 -.0017491 -.5850499 -.1624663 Std. Err. .6996385 .4703143 .0087983 .000359 .2163173 .2171048 z -4.02 3.28 -2.97 -4.87 -2.70 -0.75 Number of obs = 192 LR chi2(6) Prob > chi2 = = 108.59 0.0000 P>|z| 0.000 0.001 0.003 0.000 0.007 0.454 [95% Conf. Interval] -4.185516 .6196521 -.0433556 -.0024527 -1.009024 -.5879838 -1.442984 2.46325 -.0088667 -.0010455 -.1610758 .2630513 The best-fitting fractional polynomial of degree 2 has powers (0.5, 0.5) and deviance 736.709. However, this model does not fit significantly better than the fractional polynomial of degree 1 (at the 0.05 level), which has power −0.5 and deviance 738.969. We prefer the model with m = 1. fp — Fractional polynomial regression 601 . fp , replace center scale nohr fp(-.5): stcox age ulcarea > deepppg treat -> stcox mthson_1 age ulcarea deepppg treat Cox regression -- Breslow method for ties No. of subjects = 192 No. of failures = 92 Time at risk = 13825 Log likelihood = -369.48426 _t Coef. mthson_1 age ulcarea deepppg treat .1985592 -.02691 -.0017416 -.5740759 -.1798575 Std. Err. .0493922 .0087875 .0003482 .2185134 .2175726 z 4.02 -3.06 -5.00 -2.63 -0.83 Number of obs = 192 LR chi2(5) Prob > chi2 = = 106.33 0.0000 P>|z| 0.000 0.002 0.000 0.009 0.408 [95% Conf. Interval] .1017523 -.0441331 -.0024241 -1.002354 -.6062921 .2953662 -.0096868 -.0010591 -.1457975 .246577 The hazard for healing is much higher for patients whose ulcer is of recent onset than for those who have had an ulcer for many months. A more appropriate analysis of this dataset, if one wanted to model all the predictors, possibly with fractional polynomial functions, would be to use mfp; see [R] mfp. Example 3: Logistic regression The zero option permits fitting a fractional polynomial model to the positive values of a covariate, taking nonpositive values as zero. An application is the assessment of the effect of cigarette smoking as a risk factor. Whitehall 1 is an epidemiological study, which was examined in Royston and Sauerbrei (2008), of 18,403 male British Civil Servants employed in London. We examine the data collected in Whitehall 1 and use logistic regression to model the odds of death based on a fractional polynomial in the number of cigarettes smoked. Nonsmokers may be qualitatively different from smokers, so the effect of smoking (regarded as a continuous variable) may not be continuous between zero cigarettes and one cigarette. To allow for this possibility, we model the risk as a constant for the nonsmokers and as a fractional polynomial function of the number of cigarettes for the smokers, adjusted for age. The dependent variable all10 is an indicator of whether the individual passed away in the 10 years under study. cigs is the number of cigarettes consumed per day. After loading the data, we demean age and create a dummy variable, nonsmoker. We then use fp to fit the model. 602 fp — Fractional polynomial regression . use http://www.stata-press.com/data/r13/smoking, clear (Smoking and mortality data) . qui sum age . replace age = age - r(mean) age was byte now float (17260 real changes made) . generate byte nonsmoker = cond(cigs==0, 1, 0) if cigs < . . fp , zero: logit all10 nonsmoker age (fitting 44 models) (....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%) Fractional polynomial comparisons: cigs df omitted linear m = 1 m = 2 0 1 2 4 Deviance 9990.804 9958.801 9946.603 9944.708 Dev. dif. P(*) Powers 46.096 14.093 1.895 0.000 0.000 0.003 0.388 -- 1 0 -1 -1 (*) P = sig. level of model with m = 2 based on chi^2 of dev. dif. Logistic regression Number of obs = LR chi2(4) = Prob > chi2 = Log likelihood = -4972.3539 Pseudo R2 = all10 Coef. cigs_1 cigs_2 nonsmoker age _cons -1.285867 -1.982424 -1.223749 .1194541 -1.591489 Std. Err. .3358483 .572109 .1119583 .0045818 .1052078 z -3.83 -3.47 -10.93 26.07 -15.13 P>|z| 0.000 0.001 0.000 0.000 0.000 17260 1029.03 0.0000 0.0938 [95% Conf. Interval] -1.944117 -3.103736 -1.443183 .1104739 -1.797693 -.6276162 -.8611106 -1.004315 .1284343 -1.385286 Omission of the zero option would cause fp to halt with an error message because nonpositive covariate values (for example, values of cigs) are invalid unless the scale option is specified. A closely related approach involves the catzero option. Here we no longer need to have nonsmoker in the model, because fp creates its own dummy variable cigs 0 to indicate whether the individual does not smoke on that day. fp — Fractional polynomial regression 603 . fp , catzero replace: logit all10 age (fitting 44 models) (....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%) Fractional polynomial comparisons: cigs df omitted linear m = 1 m = 2 0 2 3 5 Deviance 10175.75 9958.80 9946.60 9944.71 Dev. dif. 231.047 14.093 1.895 0.000 P(*) Powers 0.000 0.003 0.388 -- 1 0 -1 -1 (*) P = sig. level of model with m = 2 based on chi^2 of dev. dif. Logistic regression Number of obs = LR chi2(4) = Prob > chi2 = Log likelihood = -4972.3539 Pseudo R2 = all10 Coef. cigs_0 cigs_1 cigs_2 age _cons -1.223749 -1.285867 -1.982424 .1194541 -1.591489 Std. Err. .1119583 .3358483 .572109 .0045818 .1052078 z -10.93 -3.83 -3.47 26.07 -15.13 P>|z| 0.000 0.000 0.001 0.000 0.000 17260 1029.03 0.0000 0.0938 [95% Conf. Interval] -1.443183 -1.944117 -3.103736 .1104739 -1.797693 -1.004315 -.6276162 -.8611106 .1284343 -1.385286 Under both approaches, the comparison table suggests that we can accept the FP1 model instead of the FP2 model. We estimate the parameters of the accepted model—that is, the one that uses the natural logarithm of cigs—with fp. . fp , catzero replace fp(0): logit all10 age -> logit all10 cigs_0 cigs_1 age Logistic regression Number of obs LR chi2(3) Prob > chi2 Log likelihood = -4973.3016 Pseudo R2 all10 Coef. cigs_0 cigs_1 age _cons .1883732 .3469842 .1194976 -3.003767 Std. Err. .1553093 .0543552 .0045818 .1514909 z 1.21 6.38 26.08 -19.83 P>|z| 0.225 0.000 0.000 0.000 = = = = 17260 1027.13 0.0000 0.0936 [95% Conf. Interval] -.1160274 .2404499 .1105174 -3.300683 .4927738 .4535185 .1284778 -2.70685 The high p-value for cigs 0 in the output indicates that we cannot reject that there is no extra effect at zero for nonsmokers. 604 fp — Fractional polynomial regression Stored results In addition to the results that est cmd stores, fp stores the following in e(): Scalars e(fp e(fp e(fp e(fp e(fp dimension) center mean) scale a) scale b) compare df2) degree of fractional polynomial value used for centering or . value used for scaling or . value used for scaling or . denominator degree of freedom in F test Macros e(fp e(fp e(fp e(fp e(fp e(fp e(fp e(fp cmd) cmdline) variable) terms) gen cmdline) catzero) zero) compare type) fp, search(): or fp, powers(): full fp command as typed fractional polynomial variable generated fp variables fp generate command to re-create e(fp terms) variables catzero, if specified zero, if specified F or chi2 Matrices e(fp e(fp e(fp e(fp e(fp e(fp e(fp fp) compare) compare stat) compare df1) compare fp) compare length) powers) powers used in fractional polynomial results of model comparisons F test statistics numerator degree of F test powers of comparison models encoded string for display of row titles powers that are searched fp generate stores the following in r(): Scalars r(fp center mean) r(fp scale a) r(fp scale b) value used for centering or . value used for scaling or . value used for scaling or . Macros r(fp r(fp r(fp r(fp r(fp full fp generate command as typed fractional polynomial variable generated fp variables catzero, if specified zero, if specified cmdline) variable) terms) catzero) zero) Matrices r(fp fp) powers used in fractional polynomial Methods and formulas The general definition of a fractional polynomial, accommodating possible repeated powers, may be written for functions H1 (x), . . . , Hm (x) of x > 0 as β0 + m X βj Hj (x) j=1 where H1 (x) = x(p1 ) and for j = 2, . . . , m, Hj (x) = x(pj ) if pj 6= pj−1 Hj−1 (x) ln(x) if pj = pj−1 fp — Fractional polynomial regression 605 For example, a fractional polynomial of degree 3 with powers (1, 3, 3) has H1 (x) = x, H2 (x) = x3 , and H3 (x) = x3 ln(x) and equals β0 + β1 x + β2 x3 + β3 x3 ln(x). We can express a fractional polynomial in vector notation by using H(x) = [H1 (x), . . . , Hd (x)]0 . We define x(p1 ,p2 ,...,pm ) = [H(x)0 , 1]0 . Under this notation, we can write x(1,3,3)0 β = β0 + β1 x + β2 x3 + β3 x3 ln(x) The fractional polynomial may be centered so that the intercept can be more easily interpreted. When centering the fractional polynomial of x at c, we subtract c(p1 ,p2 ,...,pm ) from x(p1 ,p2 ,...,pm ) , where c(p1 ,p2 ,...,pd ) = [H(x)0 , 0]0 . The centered fractional polynomial is x(p1 ,...,pd ) − c(p1 ,...,pd ) 0 β The definition may be extended to allow x ≤ 0 values. For these values, the fractional polynomial is equal to the intercept β0 or equal to a zero-offset term α0 plus the intercept β0 . A fractional polynomial model of degree m is taken to have 2m + 1 degrees of freedom (df ): one for β0 and one for each βj and its associated power. Because the powers in a fractional polynomial are chosen from a finite set rather than from the entire real line, the df defined in this way are approximate. The deviance D of a model is defined as −2 times its maximized log likelihood. For normal-errors models, we use the formula 2π RSS D = n 1 − l + ln n where n is the sample size, l is the mean of the lognormalized weights (l = 0 if the weights are all equal), and RSS is the residual sum of squares as fit by regress. fp reports a table comparing fractional polynomial models of degree k < m with the degree m fractional polynomial model, which will have the lowest deviance. The p-values reported by fp are calculated differently for normal and nonnormal regressions. Let Dk and Dm be the deviances of the models with degrees k and m, respectively. For normal-errors models, a variance ratio F is calculated as Dk − Dm n2 exp −1 F = n1 n where n1 is the numerator df, the quantity of the additional parameters that the degree m model has over the degree k model. n2 is the denominator df and equals the residual degrees of freedom of the degree m model, minus the number of powers estimated, m. The p-value is obtained by referring F to an F distribution on (n1 , n2 ) df. For nonnormal models, the p-value is obtained by referring Dk − Dm to a χ2 distribution on 2m − 2k df. These p-values for comparing models are approximate and are typically somewhat conservative (Royston and Altman 1994). 606 fp — Fractional polynomial regression Acknowledgment We thank Patrick Royston of the MRC Clinical Trials Unit, London, and coauthor of the Stata Press book Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model for writing fracpoly and fracgen, the commands on which fp and fp generate are based. We also thank Professor Royston for his advice on and review of the new fp commands. References Becketti, S. 1995. sg26.2: Calculating and graphing fractional polynomials. Stata Technical Bulletin 24: 14–16. Reprinted in Stata Technical Bulletin Reprints, vol. 4, pp. 129–132. College Station, TX: Stata Press. Isaacs, D., D. G. Altman, C. E. Tidmarsh, H. B. Valman, and A. D. Webster. 1983. Serum immunoglobulin concentrations in preschool children measured by laser nephelometry: Reference ranges for IgG, IgA, IgM. Journal of Clinical Pathology 36: 1193–1196. Libois, F., and V. Verardi. 2013. Semiparametric fixed-effects estimator. Stata Journal 13: 329–336. Royston, P. 1995. sg26.3: Fractional polynomial utilities. Stata Technical Bulletin 25: 9–13. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 82–87. College Station, TX: Stata Press. Royston, P., and D. G. Altman. 1994. Regression using fractional polynomials of continuous covariates: Parsimonious parametric modelling. Applied Statistics 43: 429–467. Royston, P., and G. Ambler. 1999a. sg112: Nonlinear regression models involving power or exponential functions of covariates. Stata Technical Bulletin 49: 25–30. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 173–179. College Station, TX: Stata Press. . 1999b. sg81.1: Multivariable fractional polynomials: Update. Stata Technical Bulletin 49: 17–23. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 161–168. College Station, TX: Stata Press. . 1999c. sg112.1: Nonlinear regression models involving power or exponential functions of covariates: Update. Stata Technical Bulletin 50: 26. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 180. College Station, TX: Stata Press. . 1999d. sg81.2: Multivariable fractional polynomials: Update. Stata Technical Bulletin 50: 25. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 168. College Station, TX: Stata Press. Royston, P., and W. Sauerbrei. 2008. Multivariable Model-building: A Pragmatic Approach to Regression Analysis Based on Fractional Polynomials for Modelling Continuous Variables. Chichester, UK: Wiley. Smith, J. M., C. J. Dore, A. Charlett, and J. D. Lewis. 1992. A randomized trial of Biofilm dressing for venous leg ulcers. Phlebology 7: 108–113. Also see [R] fp postestimation — Postestimation tools for fp [R] mfp — Multivariable fractional polynomial models [U] 20 Estimation and postestimation commands Title fp postestimation — Postestimation tools for fp Description Menu for fp plot and fp predict Remarks and examples References Syntax for predict Options for fp plot Methods and formulas Also see Syntax for fp plot and fp predict Options for fp predict Acknowledgment Description The following postestimation commands are of special interest after fp: Command Description fp plot fp predict component-plus-residual plot from most recently fit fractional polynomial model create variable containing prediction or SEs of fractional polynomials The following standard postestimation commands are also available if available after est cmd: Command Description contrast estat ic estat summarize estat vce estimates hausman lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest margins marginsplot nlcom predict predictnl pwcompare suest test testnl 607 608 fp postestimation — Postestimation tools for fp Special-interest postestimation commands fp plot produces a component-plus-residual plot. The fractional polynomial comprises the component, and the residual is specified by the user in residuals(). The component-plus-residuals are plotted against the fractional polynomial variable. If you only want to plot the component fit, without residuals, you would specify residuals(none). fp predict generates the fractional polynomial or the standard error of the fractional polynomial. The fractional polynomial prediction is equivalent to the fitted values prediction given by predict, xb, with the covariates other than the fractional polynomial variable set to zero. The standard error may be quite large if the range of the other covariates is far from zero. In this situation, the covariates would be centered and their range would include, or come close to including, zero. These postestimation commands can be used only when the fractional polynomial variables do not interact with other variables in the specification of est cmd. See [U] 11.4.3 Factor variables for more information about interactions. Syntax for predict The behavior of predict following fp is determined by est cmd. See the corresponding est cmd postestimation entry for available predict options. Also see information on fp predict below. Syntax for fp plot and fp predict Component-plus-residual plot for most recently fit fractional polynomial model fp plot if in , residuals(res option) graph options Create variable containing the prediction or SEs of fractional polynomials fp predict type newvar if in , predict options fp postestimation — Postestimation tools for fp graph options Main ∗ residuals(res option) equation(eqno) level(#) 609 Description residual option name to use in predict after est cmd, or residuals(none) if residuals are not to be graphed specify equation set confidence level; default is level(95) Plot plotopts(scatter options) affect rendition of the component-plus-residual scatter points Fitted line lineopts(cline options) affect rendition of the fitted line CI plot ciopts(area options) affect rendition of the confidence bands Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options ∗ any options other than by() documented in [G-3] twoway options residuals(res option) is required. predict options Description Main fp stdp equation(eqno) calculate the fractional polynomial; the default calculate the standard error of the fractional polynomial specify equation Menu for fp plot and fp predict fp plot Statistics > Linear models and related > Fractional polynomials > Component-plus-residual plot > Fractional polynomials > Fractional polynomial prediction fp predict Statistics > Linear models and related Options for fp plot Main residuals(res option) specifies what type of residuals to plot in the component-plus-residual plot. res option is the same option that would be specified to predict after est cmd. Residuals can be omitted from the plot by specifying residuals(none). residuals() is required. equation(eqno) is relevant only when you have previously fit a multiple-equation model in est cmd. It specifies the equation to which you are referring. 610 fp postestimation — Postestimation tools for fp equation(#1) would mean that the calculation is to be made for the first equation, equation(#2) would mean the second, and so on. You could also refer to the equations by their names: equation(income) would refer to the equation name income, and equation(hours) would refer to the equation named hours. If you do not specify equation(), the results are the same as if you specified equation(#1). level(#); see [R] estimation options. Plot plotopts(scatter options) affects the rendition of the component-plus-residual scatter points; see [G-2] graph twoway scatter. Fitted line lineopts(cline options) affects the rendition of the fitted line; see [G-3] cline options. CI plot ciopts(area options) affects the rendition of the confidence bands; see [G-3] area options. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Options for fp predict Main fp calculates the fractional polynomial, the linear prediction with other variables set to zero. This is the default. stdp calculates the standard error of the fractional polynomial. equation(eqno) is relevant only when you have previously fit a multiple-equation model in est cmd. It specifies the equation to which you are referring. equation(#1) would mean that the calculation is to be made for the first equation, equation(#2) would mean the second, and so on. You could also refer to the equations by their names: equation(income) would refer to the equation name income, and equation(hours) would refer to the equation named hours. If you do not specify equation(), the results are the same as if you specified equation(#1). fp postestimation — Postestimation tools for fp 611 Remarks and examples After a model is fit using fp, the estimated fractional polynomial may be of interest. This is the linear combination of the fractional polynomial terms and the constant intercept using the model coefficients estimated by fp. It is equivalent to the fitted values prediction given by predict,xb, with the covariates and the fractional polynomial variable set to zero. When these other covariates have been centered, the prediction is made at the centering values of the covariates. A component-plus-residual plot is generated by fp plot. The fractional polynomial comprises the component, and the residual is specified by the user in residuals(). The residuals() option takes the same argument that would be supplied to predict after est cmd to obtain the desired type of residuals. If you only want to plot the component fit, without residuals, you would specify residuals(none). fp predict generates the fractional polynomial. If the stdp option is specified, the standard error of the fractional polynomial is generated instead. This standard error may be quite large if the range of the other covariates is far from zero. In this situation, the covariates would be centered and their range would include, or come close to including, zero. These postestimation commands can be used only when the fractional polynomial terms do not interact with other variables in the specification of est cmd. See [U] 11.4.3 Factor variables for more information about interactions. Examples Example 1: fp plot after linear regression In example 1 of [R] fp, we modeled the mean of the square root of a child’s serum immunoglobulin G (IgG) level as a fractional polynomial function of the child’s age. An FP2 model with powers (−2, 2) is chosen. We load the data and then fit the model with fp. Then we use fp plot to draw the componentplus-residual plot. A 95% confidence interval is produced for the fractional polynomial in age (the component). The residuals prediction option for regress is specified in the residuals() option in fp plot so that the residuals are rendered. 612 fp postestimation — Postestimation tools for fp . use http://www.stata-press.com/data/r13/igg (Immunoglobulin in children) . fp , scale center: regress sqrtigg (output omitted ) 1 Component+residual of sqrtigg 2 3 4 . fp plot, residuals(residuals) 0 2 4 6 Age (years) Example 2: fp plot after Cox regression In example 2 of [R] fp, we modeled the time to complete healing of leg ulcers for 192 elderly patients using a Cox regression. A one-degree fractional polynomial in mthson, the number of months since the onset of the ulcer, is used as a predictor in the regression. The power −0.5 is used for mthson. Other covariates are age (age), ulcer area (ulcarea), treatment type, and a binary indicator of deep vein involvement (deepppg). We load the data and then demean ulcer area and age. Then we fit the model with fp and draw the component-plus-residual plot with fp plot. mgale is specified in the residuals() option to obtain martingale residuals. See [ST] stcox postestimation for more details. . use http://www.stata-press.com/data/r13/legulcer1, clear (Leg ulcer clinical trial) . quietly stset ttevent, failure(cens) . quietly summarize age . replace age = age - r(mean) age was byte now float (192 real changes made) . quietly summarize ulcarea . replace ulcarea = ulcarea - r(mean) ulcarea was int now float (192 real changes made) . fp , replace center scale nohr fp(-.5): stcox age ulcarea > deepppg treat (output omitted ) . fp plot, residuals(mgale) 613 −2 Component+residual of _t −1 0 1 2 3 fp postestimation — Postestimation tools for fp 0 100 200 Months since onset 300 400 Example 3: fp plot and fp predict after logistic regression In example 3 of [R] fp, we used logistic regression to model the odds of death for male civil servants in Britain conditional on cigarette consumption. The dependent variable all10 is an indicator of whether the individual passed away in the 10 years under study. Nonsmokers may be qualitatively different from smokers, so the effect of smoking (regarded as a continuous variable) may not be continuous between zero cigarettes and one cigarette. To allow for this possibility, we model the risk as constant intercept for the nonsmokers and as a fractional polynomial function of the number of cigarettes for the smokers, cigs, adjusted for age. An FP1 model with power 0 is chosen. We load the data and demean age. Then we fit the model using fp and graph the fit of the model and 95% confidence interval using fp plot. Only the component fit is graphed by specifying residuals(none). . use http://www.stata-press.com/data/r13/smoking, clear (Smoking and mortality data) . quietly summarize age . replace age = age - r(mean) age was byte now float (17260 real changes made) . fp , catzero replace fp(0): logit all10 age -> logit all10 cigs_0 cigs_1 age Logistic regression Number of obs LR chi2(3) Prob > chi2 Pseudo R2 Log likelihood = -4973.3016 all10 Coef. cigs_0 cigs_1 age _cons .1883732 .3469842 .1194976 -3.003767 Std. Err. .1553093 .0543552 .0045818 .1514909 z 1.21 6.38 26.08 -19.83 P>|z| 0.225 0.000 0.000 0.000 = = = = 17260 1027.13 0.0000 0.0936 [95% Conf. Interval] -.1160274 .2404499 .1105174 -3.300683 .4927738 .4535185 .1284778 -2.70685 614 fp postestimation — Postestimation tools for fp −3.5 −3 Component −2.5 −2 −1.5 . fp plot, residuals(none) 0 20 40 Daily cigarette consumption 60 We see a small spike at zero for cigs because of the effect of cigs 0 on the fractional polynomial; however, the high p-value for cigs 0 in the model output indicates that we cannot reject that there is no extra effect at zero for nonsmokers. We can also use fp predict to predict the fractional polynomial for nonsmokers and the mean of age. This is the value at the spike. We store the result in fp0. We see it is equivalent to the sum of the constant intercept estimate and the estimate of the cigs 0 coefficient. . fp predict fp0 if cigs == 0 (7157 missing values generated) . summarize fp0 Variable Obs Mean fp0 10103 -2.815393 . display _b[cigs_0]+_b[_cons] -2.8153935 Std. Dev. 0 Min Max -2.815393 -2.815393 Methods and formulas Let the data consist of triplets (yi , xi , zi ), i = 1, . . . , n, where zi is the vector of covariates for the ith observation and xi is the fractional polynomial variable. fp predict calculates the fractional polynomial at the centering value x0 , ηbi = (xi (p1 ,...,pd ) − x0 ) β. This is equivalent to the linear predictor of the model at zi = 0. The standard error b ignoring estimation of the powers. When is calculated from the variance–covariance matrix of β, (p1 ,...,pm ) xi ≤ 0, H(xi ), and thus xi , is either undefined or zero. A zero offset term, α0 , may be added to ηbi for these nonpositive xi values. (p1 ,...,pm ) 0 b The values ηbi represent the behavior of the fractional polynomial model for x at fixed values z = 0 of the (centered) covariates. The ith component-plus-residual is defined as ηbi + di , where di is the residual for the ith observation. The definition of di will change according to the type of model used and the preference of the user. fp plot plots ηbi + di versus xi , overlaying ηbi and its confidence interval. fp postestimation — Postestimation tools for fp 615 Acknowledgment We thank Patrick Royston of the MRC Clinical Trials Unit, London, and coauthor of the Stata Press book Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model for writing fracplot and fracpred, the commands on which fp plot and fp predict are based. We also thank Professor Royston for his advice on and review of fp plot and fp predict. References Becketti, S. 1995. sg26.2: Calculating and graphing fractional polynomials. Stata Technical Bulletin 24: 14–16. Reprinted in Stata Technical Bulletin Reprints, vol. 4, pp. 129–132. College Station, TX: Stata Press. Royston, P. 1995. sg26.3: Fractional polynomial utilities. Stata Technical Bulletin 25: 9–13. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 82–87. College Station, TX: Stata Press. Royston, P., and G. Ambler. 1999a. sg112: Nonlinear regression models involving power or exponential functions of covariates. Stata Technical Bulletin 49: 25–30. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 173–179. College Station, TX: Stata Press. . 1999b. sg81.1: Multivariable fractional polynomials: Update. Stata Technical Bulletin 49: 17–23. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 161–168. College Station, TX: Stata Press. . 1999c. sg112.1: Nonlinear regression models involving power or exponential functions of covariates: Update. Stata Technical Bulletin 50: 26. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 180. College Station, TX: Stata Press. . 1999d. sg81.2: Multivariable fractional polynomials: Update. Stata Technical Bulletin 50: 25. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 168. College Station, TX: Stata Press. Also see [R] fp — Fractional polynomial regression [U] 20 Estimation and postestimation commands Title frontier — Stochastic frontier models Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax frontier depvar indepvars options if in weight , options Description Model noconstant suppress constant term distribution(hnormal) half-normal distribution for the inefficiency term distribution(exponential) exponential distribution for the inefficiency term distribution(tnormal) truncated-normal distribution for the inefficiency term ufrom(matrix) specify untransformed log likelihood; only with d(tnormal) cm(varlist , noconstant ) fit conditional mean model; only with d(tnormal); use noconstant to suppress constant term Model 2 constraints(constraints) collinear uhet(varlist , noconstant ) vhet(varlist , noconstant ) cost apply specified linear constraints keep collinear variables explanatory variables for technical inefficiency variance function; use noconstant to suppress constant term explanatory variables for idiosyncratic error variance function; use noconstant to suppress constant term fit cost frontier model; default is production frontier model SE vce(vcetype) vcetype may be oim, opg, bootstrap, or jackknife Reporting level(#) nocnsreport display options set confidence level; default is level(95) do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars and varlist may contain factor variables; see [U] 11.4.3 Factor variables. bootstrap, by, fp, jackknife, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 616 frontier — Stochastic frontier models 617 Menu Statistics > Linear models and related > Frontier models Description frontier fits stochastic production or cost frontier models; the default is a production frontier model. It provides estimators for the parameters of a linear model with a disturbance that is assumed to be a mixture of two components, which have a strictly nonnegative and symmetric distribution, respectively. frontier can fit models in which the nonnegative distribution component (a measurement of inefficiency) is assumed to be from a half-normal, exponential, or truncated-normal distribution. See Kumbhakar and Lovell (2000) for a detailed introduction to frontier analysis. Options Model noconstant; see [R] estimation options. distribution(distname) specifies the distribution for the inefficiency term as half-normal (hnormal), exponential, or truncated-normal (tnormal). The default is hnormal. ufrom(matrix) specifies a 1 × K matrix of untransformed starting values when the distribution is truncated-normal (tnormal). frontier can estimate the parameters of the model by maximizing either the log likelihood or a transformed log likelihood (see Methods and formulas). frontier automatically transforms the starting values before passing them on to the transformed log likelihood. The matrix must have the same number of columns as there are parameters to estimate. cm(varlist , noconstant ) may be used only with distribution(tnormal). Here frontier will fit a conditional mean model in which the mean of the truncated-normal distribution is modeled as a linear function of the set of covariates specified in varlist. Specifying noconstant suppresses the constant in the mean function. Model 2 constraints(constraints), collinear; see [R] estimation options. By default, when fitting the truncated-normal model or the conditional mean model, frontier maximizes a transformed log likelihood. When constraints are applied, frontier will maximize the untransformed log likelihood with constraints defined in the untransformed metric. uhet(varlist , noconstant ) specifies that the technical inefficiency component is heteroskedastic, with the variance function depending on a linear combination of varlistu . Specifying noconstant suppresses the constant term from the variance function. This option may not be specified with distribution(tnormal). vhet(varlist , noconstant ) specifies that the idiosyncratic error component is heteroskedastic, with the variance function depending on a linear combination of varlistv . Specifying noconstant suppresses the constant term from the variance function. This option may not be specified with distribution(tnormal). cost specifies that frontier fit a cost frontier model. 618 frontier — Stochastic frontier models SE vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg) and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with frontier but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Stochastic production frontier models were introduced by Aigner, Lovell, and Schmidt (1977) and Meeusen and van den Broeck (1977). Since then, stochastic frontier models have become a popular subfield in econometrics. Kumbhakar and Lovell (2000) provide a good introduction. frontier fits three stochastic frontier models with distinct parameterizations of the inefficiency term and can fit stochastic production or cost frontier models. Let’s review the nature of the stochastic frontier problem. Suppose that a producer has a production function f (zi , β). In a world without error or inefficiency, the ith firm would produce qi = f (zi , β) Stochastic frontier analysis assumes that each firm potentially produces less than it might due to a degree of inefficiency. Specifically, qi = f (zi , β)ξi where ξi is the level of efficiency for firm i; ξi must be in the interval (0, 1 ]. If ξi = 1, the firm is achieving the optimal output with the technology embodied in the production function f (zi , β). When ξi < 1, the firm is not making the most of the inputs zi given the technology embodied in the production function f (zi , β). Because the output is assumed to be strictly positive (that is, qi > 0), the degree of technical efficiency is assumed to be strictly positive (that is, ξi > 0). Output is also assumed to be subject to random shocks, implying that qi = f (zi , β)ξi exp(vi ) frontier — Stochastic frontier models 619 Taking the natural log of both sides yields ln(qi ) = ln f (zi , β) + ln(ξi ) + vi Assuming that there are k inputs and that the production function is linear in logs, defining ui = − ln(ξi ) yields k X ln(qi ) = β0 + βj ln(zji ) + vi − ui (1) j=1 Because ui is subtracted from ln(qi ), restricting ui ≥ 0 implies that 0 < ξi ≤ 1, as specified above. Kumbhakar and Lovell (2000) provide a detailed version of the above derivation, and they show that performing an analogous derivation in the dual cost function problem allows us to specify the problem as k X ln(ci ) = β0 + βq ln(qi ) + βj ln(pji ) + vi + ui (2) j=1 where qi is output, zji are input quantities, ci is cost, and the pji are input prices. Intuitively, the inefficiency effect is required to lower output or raise expenditure, depending on the specification. Technical note The model that frontier actually fits is of the form y i = β0 + k X βj xji + vi − sui j=1 where s= 1, for production functions −1, for cost functions so, in the context of the discussion above, yi = ln(qi ), and xji = ln(zji ) for a production function; and for a cost function, yi = ln(ci ), and the xji are the ln(pji ) and ln(qi ). You must take the natural logarithm of the data before fitting a stochastic frontier production or cost model. frontier performs no transformations on the data. Different specifications of the ui and the vi terms give rise to distinct models. frontier provides estimators for the parameters of three basic models in which the idiosyncratic component, vi , is assumed to be independently N (0, σv ) distributed over the observations. The basic models differ in their specification of the inefficiency term, ui , as follows: exponential: the ui are independently exponentially distributed with variance σu2 hnormal: the ui are independently half-normally N + (0, σu2 ) distributed tnormal: the ui are independently N + (µ, σu2 ) distributed with truncation point at 0 For half-normal or exponential distributions, frontier can fit models with heteroskedastic error components, conditional on a set of covariates. For a truncated-normal distribution, frontier can also fit a conditional mean model in which the mean is modeled as a linear function of a set of covariates. 620 frontier — Stochastic frontier models Example 1: The half-normal and the exponential models For our first example, we demonstrate the half-normal and exponential models by reproducing a study found in Greene (2003, 505), which uses data originally published in Zellner and Revankar (1969). In this study of the transportation-equipment manufacturing industry, observations on value added, capital, and labor are used to estimate a Cobb–Douglas production function. The variable lnv is the log-transformed value added, lnk is the log-transformed capital, and lnl is the log-transformed labor. OLS estimates are compared with those from stochastic frontier models using both the half-normal and exponential distribution for the inefficiency term. . use http://www.stata-press.com/data/r13/greene9 . regress lnv lnk lnl Source SS df MS Model Residual 44.1727741 1.22225984 2 22 22.086387 .055557265 Total 45.3950339 24 1.89145975 lnv Coef. lnk lnl _cons .2454281 .805183 1.844416 . frontier lnv Iteration 0: Iteration 1: Iteration 2: Iteration 3: Iteration 4: Stoc. frontier Coef. lnk lnl _cons 25 397.54 0.0000 0.9731 0.9706 .23571 t P>|t| [95% Conf. Interval] .1068574 .1263336 .2335928 2.30 6.37 7.90 0.032 0.000 0.000 .0238193 .5431831 1.359974 Number of obs Wald chi2(2) Prob > chi2 2.4695222 lnv = = = = = = Std. Err. lnk lnl log likelihood = 2.3357572 log likelihood = 2.4673009 log likelihood = 2.4695125 log likelihood = 2.4695222 log likelihood = 2.4695222 normal/half-normal model Log likelihood = Number of obs F( 2, 22) Prob > F R-squared Adj R-squared Root MSE = = = .4670368 1.067183 2.328858 25 743.71 0.0000 Std. Err. z P>|z| [95% Conf. Interval] .2585478 .7802451 2.081135 .098764 .1199399 .281641 2.62 6.51 7.39 0.009 0.000 0.000 .0649738 .5451672 1.529128 .4521218 1.015323 2.633141 /lnsig2v /lnsig2u -3.48401 -3.014599 .6195353 1.11694 -5.62 -2.70 0.000 0.007 -4.698277 -5.203761 -2.269743 -.8254368 sigma_v sigma_u sigma2 lambda .1751688 .2215073 .0797496 1.264536 .0542616 .1237052 .0426989 .1678684 .0954514 .074134 -.0039388 .9355204 .3214633 .6618486 .163438 1.593552 Likelihood-ratio test of sigma_u=0: chibar2(01) = 0.43 . predict double u_h, u Prob>=chibar2 = 0.256 frontier — Stochastic frontier models . frontier lnv lnk lnl, distribution(exponential) Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = 2.7270659 2.8551532 2.8604815 2.8604897 2.8604897 Stoc. frontier normal/exponential model Log likelihood = Number of obs Wald chi2(2) Prob > chi2 2.8604897 lnv Coef. lnk lnl _cons = = = 25 845.68 0.0000 Std. Err. z P>|z| [95% Conf. Interval] .2624859 .7703795 2.069242 .0919988 .1109569 .2356159 2.85 6.94 8.78 0.004 0.000 0.000 .0821717 .5529079 1.607444 .4428002 .9878511 2.531041 /lnsig2v /lnsig2u -3.527598 -4.002457 .4486176 .9274575 -7.86 -4.32 0.000 0.000 -4.406873 -5.820241 -2.648324 -2.184674 sigma_v sigma_u sigma2 lambda .1713925 .1351691 .0476461 .7886525 .0384448 .0626818 .0157921 .087684 .1104231 .0544692 .016694 .616795 .2660258 .3354317 .0785981 .9605101 Likelihood-ratio test of sigma_u=0: chibar2(01) = 1.21 . predict double u_e, u . list state u_h u_e state u_h u_e 1. 2. 3. 4. 5. Alabama California Connecticut Florida Georgia .2011338 .14480966 .1903485 .51753139 .10397912 .14592865 .0972165 .13478797 .5903303 .07140994 6. 7. 8. 9. 10. Illinois Indiana Iowa Kansas Kentucky .12126696 .21128212 .24933153 .10099517 .05626919 .0830415 .15450664 .20073081 .06857629 .04152443 11. 12. 13. 14. 15. Louisiana Maine Maryland Massachusetts Michigan .20332731 .22263164 .13534062 .15636999 .15809566 .15066405 .17245793 .09245501 .10932923 .10756915 16. 17. 18. 19. 20. Missouri NewJersey NewYork Ohio Pennsylvania .10288047 .09584337 .27787793 .22914231 .1500667 .0704146 .06587986 .22249416 .16981857 .10302905 21. 22. 23. 24. 25. Texas Virginia Washington WestVirginia Wisconsin .20297875 .14000132 .11047581 .15561392 .14067066 .14552218 .09676078 .07533251 .11236153 .0970861 Prob>=chibar2 = 0.135 621 622 frontier — Stochastic frontier models The parameter estimates and the estimates of the inefficiency terms closely match those published in Greene (2003, 505), but the standard errors of the parameter estimates are estimated differently (see the technical note below). The output from frontier includes estimates of the standard deviations of the two error components, σv and σu , which are labeled sigma v and sigma u, respectively. In the log likelihood, they are parameterized as lnσv2 and lnσu2 , and these estimates are labeled /lnsig2v and /lnsig2u in the output. frontier also reports two other useful parameterizations. The estimate of the total error variance, σS2 = σv2 + σu2 , is labeled sigma2, and the estimate of the ratio of the standard deviation of the inefficiency component to the standard deviation of the idiosyncratic component, λ = σu /σv , is labeled lambda. At the bottom of the output, frontier reports the results of a test that there is no technical inefficiency component in the model. This is a test of the null hypothesis H0 : σu2 = 0 against the alternative hypotheses H1 : σu2 > 0. If the null hypothesis is true, the stochastic frontier model reduces to an OLS model with normal errors. However, because the test lies on the boundary of the parameter space of σu2 , the standard likelihood-ratio test is not valid, and a one-sided generalized likelihood-ratio test must be constructed; see Gutierrez, Carter, and Drukker (2001). For this example, the output shows LR = 0.43 with a p-value of 0.256 for the half-normal model and LR = 1.21 with a p-value of 0.135 for the exponential model. There are several possible reasons for the failure to reject the null hypothesis, but the fact that the test is based on an asymptotic distribution and the sample size was 25 is certainly a leading candidate among those possibilities. Technical note frontier maximizes the log-likelihood function of a stochastic frontier model by using the Newton–Raphson method, and the estimated variance–covariance matrix is calculated as the inverse of the negative Hessian (matrix of second partial derivatives); see [R] ml. When comparing the results with those published using other software, be aware of the difference in the optimization methods, which may result in different, yet asymptotically equivalent, variance estimates. Example 2: Models with heteroskedasticity Often the error terms may not have constant variance. frontier allows you to model heteroskedasticity in either error term as a linear function of a set of covariates. The variance of either the technical inefficiency or the idiosyncratic component may be modeled as σi2 = exp(wi δ) The default constant included in wi may be suppressed by appending a noconstant option to the list of covariates. Also, you can simultaneously specify covariates for both σui and σvi . In this example, we use a sample of 756 observations of fictional firms producing a manufactured good by using capital and labor. The firms are hypothesized to use a constant returns-to-scale technology, but the sizes of the firms differ. Believing that this size variation will introduce heteroskedasticity into the idiosyncratic error term, we estimate the parameters of a Cobb–Douglas production function. To do this, we use a conditional heteroskedastic half-normal model, with the size of the firm as an explanatory variable in the variance function for the idiosyncratic error. We also perform a test of the hypothesis that the firms use a constant returns-to-scale technology. frontier — Stochastic frontier models . use http://www.stata-press.com/data/r13/frontier1, clear . frontier lnoutput lnlabor lncapital, vhet(size) Iteration 0: log likelihood = -1508.3692 Iteration 1: log likelihood = -1501.583 Iteration 2: log likelihood = -1500.3942 Iteration 3: log likelihood = -1500.3794 Iteration 4: log likelihood = -1500.3794 Stoc. frontier normal/half-normal model Number of obs Wald chi2(2) Log likelihood = -1500.3794 Prob > chi2 lnoutput Coef. lnoutput lnlabor lncapital _cons = = = 623 756 9.68 0.0079 Std. Err. z P>|z| [95% Conf. Interval] .7090933 .3931345 1.252199 .2349374 .5422173 3.14656 3.02 0.73 0.40 0.003 0.468 0.691 .2486244 -.6695919 -4.914946 1.169562 1.455861 7.419344 size _cons -.0016951 3.156091 .0004748 .9265826 -3.57 3.41 0.000 0.001 -.0026256 1.340023 -.0007645 4.97216 lnsig2u _cons 1.947487 .1017653 19.14 0.000 1.748031 2.146943 sigma_u 2.647838 .134729 2.396514 2.925518 lnsig2v . test _b[lnlabor] + _b[lncapital] = 1 ( 1) [lnoutput]lnlabor + [lnoutput]lncapital = 1 chi2( 1) = 0.03 Prob > chi2 = 0.8622 The output above indicates that the variance of the idiosyncratic error term is a function of firm size. Also, we failed to reject the hypothesis that the firms use a constant returns-to-scale technology. Technical note In small samples, the conditional heteroskedastic estimators will lack precision for the variance parameters and may fail to converge altogether. Example 3: The truncated-normal model Let’s turn our attention to the truncated-normal model. Once again, we will use fictional data. For this example, we have 1,231 observations on the quantity of output, the total cost of production for each firm, the prices that each firm paid for labor and capital services, and a categorical variable measuring the quality of each firm’s management. After taking the natural logarithm of the costs (lncost), prices (lnp k and lnp l), and output (lnout), we fit a stochastic cost frontier model and specify the distribution for the inefficiency term to be truncated normal. 624 frontier — Stochastic frontier models . use http://www.stata-press.com/data/r13/frontier2 . frontier lncost lnp_k lnp_l lnout, distribution(tnormal) cost Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = -2386.9523 -2386.5146 -2386.2704 -2386.2504 -2386.2493 -2386.2493 Stoc. frontier normal/truncated-normal model Log likelihood = -2386.2493 lncost Coef. lnp_k lnp_l lnout _cons Number of obs Wald chi2(3) Prob > chi2 = = = 1231 8.82 0.0318 Std. Err. z P>|z| .3410717 .6608628 .7528653 2.602609 .2363861 .4951499 .3468968 1.083004 1.44 1.33 2.17 2.40 0.149 0.182 0.030 0.016 -.1222366 -.3096131 .0729601 .4799595 .80438 1.631339 1.432771 4.725259 /mu /lnsigma2 /ilgtgamma 1.095705 1.5534 1.257862 .881517 .1873464 .2589522 1.24 8.29 4.86 0.214 0.000 0.000 -.632037 1.186208 .7503255 2.823446 1.920592 1.765399 sigma2 gamma sigma_u2 sigma_v2 4.727518 .7786579 3.681119 1.046399 .8856833 .0446303 .7503408 .2660035 3.274641 .6792496 2.210478 .5250413 6.825001 .8538846 5.15176 1.567756 H0: No inefficiency component: z = 5.595 [95% Conf. Interval] Prob>=z = 0.000 In addition to the coefficients, the output reports estimates for several parameters. sigma v2 is the estimate of σv2 . sigma u2 is the estimate of σu2 . gamma is the estimate of γ = σu2 /σS2 . sigma2 is the estimate of σS2 = σv2 + σu2 . Because γ must be between 0 and 1, the optimization is parameterized in terms of the inverse logit of γ , and this estimate is reported as ilgtgamma. Because σS2 must be positive, the optimization is parameterized in terms of ln(σS2 ), whose estimate is reported as lnsigma2. Finally, mu is the estimate of µ, the mean of the truncated-normal distribution. In the output above, the generalized log-likelihood test for the presence of the inefficiency term has been replaced with a test based on the third moment of the OLS residuals. When µ = 0 and σu = 0, the truncated-normal model reduces to a linear regression model with normally distributed errors. However, the distribution of the test statistic under the null hypothesis is not well established, because it becomes impossible to evaluate the log likelihood as σu approaches zero, prohibiting the use of the likelihood-ratio test. However, Coelli (1995) noted that the presence of an inefficiency term would negatively skew the residuals from an OLS regression. By identifying negative skewness in the residuals with the presence of an inefficiency term, Coelli derived a one-sided test for the presence of the inefficiency term. The results of this test are given at the bottom of the output. For this example, the null hypothesis of no inefficiency component is rejected. In the example below, we fit a truncated model and detect a statistically significant inefficiency term in the model. We might question whether the inefficiency term is identically distributed over all firms or whether there might be heterogeneity across firms. frontier provides an extension to the truncated normal model by allowing the mean of the inefficiency term to be modeled as a linear function of a set of covariates. In our dataset, we have a categorical variable that measures the quality of a firm’s management. We refit the model, including the cm() option, specifying a set of frontier — Stochastic frontier models 625 binary indicator variables representing the different categories of the quality-measurement variable as covariates. . frontier lncost lnp_k lnp_l lnout, distribution(tnormal) cm(i.quality) cost Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = -2386.9523 -2384.936 -2382.3942 -2382.324 -2382.3233 -2382.3233 Stoc. frontier normal/truncated-normal model Log likelihood = -2382.3233 lncost Coef. Number of obs Wald chi2(3) Prob > chi2 Std. Err. z P>|z| = = = 1231 9.31 0.0254 [95% Conf. Interval] lncost lnp_k lnp_l lnout _cons .3611204 .680446 .7605533 2.550769 .2359749 .4934935 .3466102 1.078911 1.53 1.38 2.19 2.36 0.126 0.168 0.028 0.018 -.1013819 -.2867835 .0812098 .4361417 .8236227 1.647675 1.439897 4.665396 quality 2 3 4 5 .5056067 .783223 .5577511 .6792882 .3382907 .376807 .3355061 .3428073 1.49 2.08 1.66 1.98 0.135 0.038 0.096 0.048 -.1574309 .0446947 -.0998288 .0073981 1.168644 1.521751 1.215331 1.351178 _cons .6014025 .990167 0.61 0.544 -1.339289 2.542094 /lnsigma2 /ilgtgamma 1.541784 1.242302 .1790926 .2588968 8.61 4.80 0.000 0.000 1.190769 .734874 1.892799 1.749731 sigma2 gamma sigma_u2 sigma_v2 4.67292 .7759645 3.62602 1.0469 .8368852 .0450075 .7139576 .2583469 3.289611 .6758739 2.226689 .5405491 6.637923 .8519189 5.025351 1.553251 mu The conditional mean model was developed in the context of panel-data estimators, and we can apply frontier’s conditional mean model to panel data. 626 frontier — Stochastic frontier models Stored results frontier stores the following in e(): Scalars e(N) e(df m) e(k) e(k eq) e(k eq model) e(k dv) e(chi2) e(ll) e(ll c) e(z) e(sigma u) e(sigma v) e(p) e(chi2 c) e(p z) e(rank) e(ic) e(rc) e(converged) log likelihood log likelihood for H0 : σu =0 test for negative skewness of OLS residuals standard deviation of technical inefficiency standard deviation of vi significance LR test statistic p-value for z rank of e(V) number of iterations return code 1 if converged, 0 otherwise Macros e(cmd) e(cmdline) e(depvar) e(function) e(wtype) e(wexp) e(title) e(chi2type) e(dist) e(het) e(u hetvar) e(v hetvar) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(asbalanced) e(asobserved) frontier command as typed name of dependent variable production or cost weight type weight expression title in estimation output Wald; type of model χ2 test distribution assumption for ui heteroskedastic components varlist in uhet() varlist in vhet() vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved number of observations model degrees of freedom number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables χ2 frontier — Stochastic frontier models Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) 627 coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample Methods and formulas Consider an equation of the form yi = xi β + vi − sui where yi is the dependent variable, xi is a 1 × k vector of observations on the independent variables included as indent covariates, β is a k × 1 vector of coefficients, and s= 1, for production functions −1, for cost functions The log-likelihood functions are as follows. Normal/half-normal model: lnL = N X 1 i=1 2 si λ 2i ln − lnσS + lnΦ − − 2 2 π σS 2σS Normal/exponential model: N X −si − σv2 lnL = − lnσu + 2 + lnΦ 2σu σv i=1 σv2 σu s + i σu Normal/truncated-normal model: lnL = N X 1 µ − ln (2π) − lnσS − lnΦ √ 2 σS γ i=1 # " 2 ) (1 − γ) µ − sγi 1 i + sµ + lnΦ − 1/2 2 σS {σS2 γ (1 − γ)} 628 frontier — Stochastic frontier models where σS = (σu2 + σv2 )1/2 , λ = σu /σv , γ = σu2 /σS2 , i = yi − xi β, and Φ() is the cumulative distribution function of the standard normal distribution. To obtain estimation for ui , you can use either the mean or the mode of the conditional distribution f (u|). φ(−µ∗i /σ∗ ) E (ui | i ) = µ∗i + σ∗ Φ(µ∗i /σ∗ ) n µ∗i if µ∗i ≥ 0 M (ui | i ) = 0 otherwise Then the technical efficiency (s = 1) or cost efficiency (s = −1) will be estimated by Ei = E {exp(−sui ) | i } 1 − Φ (sσ∗ − µ∗i /σ∗ ) 1 exp −sµ∗i + σ∗2 = 1 − Φ (−µ∗i /σ∗ ) 2 where µ∗i and σ∗ are defined for the normal/half-normal model as µ∗i = −si σu2 /σS2 σ∗ = σu σv /σS for the normal/exponential model as µ∗i = −si − σv2 /σu σ∗ = σv and for the normal/truncated-normal model as µ∗i = −si σu2 + µσv2 σS2 σ∗ = σu σv /σS In the half-normal and exponential models, when heteroskedasticity is assumed, the standard deviations, σu or σv , will be replaced in the above equations by σi2 = exp(wi δ) where w is the vector of explanatory variables in the variance function. In the conditional mean model, the mean parameter of the truncated normal distribution, µ, is modeled as a linear combination of the set of covariates, w. µ = wi δ frontier — Stochastic frontier models 629 Therefore, the log-likelihood function can be rewritten as ! 1 wi δ lnL = − ln (2π) − lnσS − lnΦ p 2 2 σS γ i=1 ( ) 2 # (1 − γ) wi δ − sγi 1 i + swi δ p + lnΦ − 2 σS σS2 γ (1 − γ) N X " The z test reported in the output of the truncated-normal model is a third-moment test developed by Coelli (1995) as an extension of a test previously developed by Pagan and Hall (1983). Coelli shows that under the null of normally distributed errors, the statistic z= m3 1/2 3 6m2 N has a standard normal distribution, where m3 is the third moment from the OLS regression. Because the residuals are either negatively skewed (production function) or positively skewed (cost function), a one-sided p-value is used. References Aigner, D. J., C. A. K. Lovell, and P. Schmidt. 1977. Formulation and estimation of stochastic frontier production function models. Journal of Econometrics 6: 21–37. Belotti, F., S. Daidone, G. Ilardi, and V. Atella. 2013. Stochastic frontier analysis using Stata. Stata Journal 13: 719–758. Caudill, S. B., J. M. Ford, and D. M. Gropper. 1995. Frontier estimation and firm-specific inefficiency measures in the presence of heteroscedasticity. Journal of Business and Economic Statistics 13: 105–111. Coelli, T. J. 1995. Estimators and hypothesis tests for a stochastic frontier function: A Monte Carlo analysis. Journal of Productivity Analysis 6: 247–268. Gould, W. W., J. S. Pitblado, and B. P. Poi. 2010. Maximum Likelihood Estimation with Stata. 4th ed. College Station, TX: Stata Press. Greene, W. H. 2003. Econometric Analysis. 5th ed. Upper Saddle River, NJ: Prentice Hall. Gutierrez, R. G., S. L. Carter, and D. M. Drukker. 2001. sg160: On boundary-value likelihood-ratio tests. Stata Technical Bulletin 60: 15–18. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 269–273. College Station, TX: Stata Press. Kumbhakar, S. C., and C. A. K. Lovell. 2000. Stochastic Frontier Analysis. Cambridge: Cambridge University Press. Meeusen, W., and J. van den Broeck. 1977. Efficiency estimation from Cobb–Douglas production functions with composed error. International Economic Review 18: 435–444. Pagan, A. R., and A. D. Hall. 1983. Diagnostic tests as residual analysis. Econometric Reviews 2: 159–218. Petrin, A. K., B. P. Poi, and J. A. Levinsohn. 2004. Production function estimation in Stata using inputs to control for unobservables. Stata Journal 4: 113–123. Stevenson, R. E. 1980. Likelihood functions for generalized stochastic frontier estimation. Journal of Econometrics 13: 57–66. Tauchmann, H. 2012. Partial frontier efficiency analysis. Stata Journal 12: 461–478. Zellner, A., and N. S. Revankar. 1969. Generalized production functions. Review of Economic Studies 36: 241–250. 630 frontier — Stochastic frontier models Also see [R] frontier postestimation — Postestimation tools for frontier [R] regress — Linear regression [XT] xtfrontier — Stochastic frontier models for panel data [U] 20 Estimation and postestimation commands Title frontier postestimation — Postestimation tools for frontier Description Remarks and examples Syntax for predict Reference Menu for predict Also see Options for predict Description The following postestimation commands are available after frontier: Command Description contrast estat ic estat summarize estat vce estimates lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest margins marginsplot nlcom predict predictnl pwcompare test testnl Syntax for predict predict type predict type newvar if in , statistic stub* | newvarxb newvarv newvaru 631 if in , scores 632 frontier postestimation — Postestimation tools for frontier Description statistic Main linear prediction; the default standard error of the prediction estimates of minus the natural log of the technical efficiency via E (ui | i ) estimates of minus the natural log of the technical efficiency via M (ui | i ) estimates of the technical efficiency via E {exp(−sui ) | i } 1, for production functions s= −1, for cost functions xb stdp u m te These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main xb, the default, calculates the linear prediction. stdp calculates the standard error of the linear prediction. u produces estimates of minus the natural log of the technical efficiency via E (ui | i ). m produces estimates of minus the natural log of the technical efficiency via M (ui | i ). te produces estimates of the technical efficiency via E {exp(−sui ) | i }. scores calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xi β). The second new variable will contain ∂ ln L/∂(lnsig2v). The third new variable will contain ∂ ln L/∂(lnsig2u). Remarks and examples Example 1 In example 2 of [R] frontier, we modeled heteroskedasticity by specifying the vhet() option. We would like to compare the predicted efficiency in that case with respect to a model specification without accounting for the presence of heteroskedasticity in the error term. Kumbhakar and Lovell (2000, 117) show that failing to account for heteroskedasticity associated with firm size may lead to bias in the estimation of the technical efficiency. By incorrectly assuming homoskedasticity, the estimates for relatively small firms would be biased upward, while the estimates for relatively large firms would be biased downward. Let’s refit the model and use the te option of predict: frontier postestimation — Postestimation tools for frontier 633 . use http://www.stata-press.com/data/r13/frontier1 . frontier lnoutput lnlabor lncapital, vhet(size) (output omitted ) . predict te_vhet, te Next we fit the model assuming homoskedasticity and then again predict the technical efficiency with the te option of predict: . frontier lnoutput lnlabor lncapital (output omitted ) . predict te, te The graph below shows the estimates for technical efficiency for the smaller and larger firms. The technical efficiency tends to be smaller for smaller firms when the model specification accounts for the presence of heteroskedasticity, whereas the predictions for the technical efficiency tends to be smaller for larger firms assuming homoskedasticity. These results agree with the theoretical statement in Kumbhakar and Lovell (2000) because the firm size was actually relevant to model heteroskedasticity in the idiosyncratic component of the error term. Predicted technical efficiency for smaller and larger firms .6 .8 Assuming homoskedasticity .2 0 0 .2 te .4 te_vhet .4 .6 .8 Modeling heteroskedasticity 1500 2000 firm size 2500 1500 2000 firm size 2500 Example 2 We also test in example 2 of [R] frontier whether the firms use constant returns to scale. We can use lincom as an alternative to perform an equivalent test based on the normal distribution. . use http://www.stata-press.com/data/r13/frontier1, clear . frontier lnoutput lnlabor lncapital, vhet(size) (output omitted ) . lincom _b[lnlabor] + _b[lncapital]-1 ( 1) [lnoutput]lnlabor + [lnoutput]lncapital = 1 lnoutput Coef. (1) .1022278 Std. Err. z P>|z| .5888511 0.17 0.862 [95% Conf. Interval] -1.051899 1.256355 634 frontier postestimation — Postestimation tools for frontier The p-value is exactly the same as the one we obtained with the test command in example 2 of [R] frontier. However, notice that by using lincom, we obtained an estimate of the deviation from the constant returns-to-scale assumption, which is not significantly different from zero in this case. Reference Kumbhakar, S. C., and C. A. K. Lovell. 2000. Stochastic Frontier Analysis. Cambridge: Cambridge University Press. Also see [R] frontier — Stochastic frontier models [U] 20 Estimation and postestimation commands Title fvrevar — Factor-variables operator programming command Syntax Stored results Description Also see Options Remarks and examples Syntax fvrevar varlist if in , substitute tsonly list stub(stub) You must tsset your data before using fvrevar if varlist contains time-series operators; see [TS] tsset. Description fvrevar creates an equivalent, temporary variable list for a varlist that might contain factor variables, interactions, or time-series–operated variables so that the resulting variable list can be used by commands that do not otherwise support factor variables or time-series–operated variables. The resulting list also could be used in a program to speed execution at the cost of using more memory. Options substitute specifies that equivalent, temporary variables be substituted for any factor variables, interactions, or time-series–operated variables in varlist. substitute is the default action taken by fvrevar; you do not need to specify the option. tsonly specifies that equivalent, temporary variables be substituted for only the time-series–operated variables in varlist. list specifies that all factor-variable operators and time-series operators be removed from varlist and the resulting list of base variables be returned in r(varlist). No new variables are created with this option. stub(stub) specifies that fvrevar generate named variables instead of temporary variables. The new variables will be named stub#. Remarks and examples fvrevar might create no new variables, one new variable, or many new variables, depending on the number of factor variables, interactions, and time-series operators appearing in varlist. Any new variables created are temporary. The new, equivalent varlist is returned in r(varlist). The new varlist corresponds one to one with the original varlist. Example 1 Typing . use http://www.stata-press.com/data/r13/auto2 . fvrevar i.rep78 mpg turn 635 636 fvrevar — Factor-variables operator programming command creates five temporary variables corresponding to the levels of rep78. No new variables are created for variables mpg and turn because they do not contain factor-variable or time-series operators. The resulting variable list is . display "‘r(varlist)’" 000000 000001 000002 000003 000004 mpg turn (Your temporary variable names may be different, but that is of no consequence.) Temporary variables automatically vanish when the program concludes. Example 2 Suppose we want to create temporary variables for specific levels of a factor variable. To do this, we can use the parenthesis notation of factor-variable syntax. . fvrevar i(2,3)bn.rep78 mpg creates two temporary variables corresponding to levels 2 and 3 of rep78. Notice that we specified that neither level 2 nor 3 be set as the base level by using the bn notation. If we did not specify bn, level 2 would have been treated as the base level. The resulting variable list is . display "‘r(varlist)’" 000005 000002 mpg We can see the results by listing the new variables alongside the original value of rep78. . list rep78 ‘r(varlist)’ in 1/5 1. 2. 3. 4. 5. rep78 __000005 __000002 mpg Average Average . Average Good 0 0 . 0 0 1 1 . 1 0 22 17 22 20 15 If we had needed only the base-variable names, we could have specified . fvrevar i(2,3)bn.rep78 mpg, list . display "‘r(varlist)’" mpg rep78 The order of the list will probably differ from that of the original list; base variables are listed only once. Example 3 Now let’s assume we have a varlist containing both an interaction and time-series–operated variables. If we want to create temporary variables for the entire equivalent varlist, we can specify fvrevar with no options. fvrevar — Factor-variables operator programming command 637 . generate t = _n . tsset t time variable: t, 1 to 74 delta: 1 unit . fvrevar c.turn#i(2,3).rep78 L.mpg The resulting variable list is . display "‘r(varlist)’" 000006 000007 000008 If we want to create temporary variables only for the time-series–operated variables, we can specify the tsonly option. . fvrevar c.turn#i(2,3).rep78 L.mpg, tsonly The resulting variable list is . display "‘r(varlist)’" c.turn#2b.rep78 c.turn#3.rep78 000008 Notice that fvrevar returned the expanded factor-variable list with the tsonly option. Technical note fvrevar, substitute avoids creating duplicate variables. Consider . fvrevar i.rep78 turn mpg i.rep78 i.rep78 appears twice in the varlist. fvrevar will create only one set of new variables for the five levels of rep78 and will use these new variables once in the resulting r(varlist). Moreover, fvrevar will do this even across multiple calls: . fvrevar i.rep78 turn mpg . fvrevar i.rep78 i.rep78 appears in two separate calls. At the first call, fvrevar creates five temporary variables corresponding to the five levels of rep78. At the second call, fvrevar remembers what it has done and uses the same temporary variables for i.rep78. Stored results fvrevar stores the following in r(): Macros r(varlist) the modified variable list or list of base-variable names Also see [TS] tsrevar — Time-series operator programming command [P] syntax — Parse Stata syntax [P] unab — Unabbreviate variable list [U] 11 Language syntax [U] 11.4.4 Time-series varlists [U] 18 Programming Stata Title fvset — Declare factor-variable settings Syntax Description Options Remarks and examples Stored results Syntax Declare base settings fvset base base spec varlist Declare design settings fvset design design spec varlist Clear the current settings fvset clear varlist Report the current settings fvset report varlist , base(base spec) design(design spec) base spec Description default first last frequent none # default base lowest level value; the default highest level value most frequent level value no base nonnegative integer value design spec Description default asbalanced asobserved default base accumulate using 1/k , k = number of levels accumulate using observed relative frequencies; the default Description fvset declares factor-variable settings. Factor-variable settings identify the base level and how to accumulate statistics over levels. fvset base specifies the base level for each variable in varlist. The default for factor variables without a declared base level is first. 638 fvset — Declare factor-variable settings 639 fvset design specifies how to accumulate over the levels of a factor variable. The margins command is the only command aware of this setting; see [R] margins. By default, margins assumes that factor variables are asobserved, meaning that they are accumulated by weighting by the number of observations or the sum of the weights if weights have been specified. fvset clear removes factor-variable settings for each variable in varlist. fvset clear removes all factor-variable settings from all variables. all fvset report reports the current factor-variable settings for each variable in varlist. fvset without arguments is a synonym for fvset report. Options base(base spec) restricts fvset report to report only the factor-variable settings for variables with the specified base spec. design(design spec) restricts fvset report to report only the factor-variable settings for variables with the specified design spec. Remarks and examples Example 1 Using auto2.dta, we include factor variable i.rep78 in a regression: . use http://www.stata-press.com/data/r13/auto2 (1978 Automobile Data) . regress mpg i.rep78, baselevels Source SS df MS Model Residual 549.415777 1790.78712 4 64 137.353944 27.9810488 Total 2340.2029 68 34.4147485 Std. Err. t Number of obs F( 4, 64) Prob > F R-squared Adj R-squared Root MSE P>|t| = = = = = = 69 4.91 0.0016 0.2348 0.1869 5.2897 mpg Coef. [95% Conf. Interval] rep78 Poor Fair Average Good Excellent 0 -1.875 -1.566667 .6666667 6.363636 (base) 4.181884 3.863059 3.942718 4.066234 -0.45 -0.41 0.17 1.56 0.655 0.686 0.866 0.123 -10.22927 -9.284014 -7.209818 -1.759599 6.479274 6.150681 8.543152 14.48687 _cons 21 3.740391 5.61 0.000 13.52771 28.47229 We specified the baselevels option so that the base level would be included in the output. By default, the first level is the base level. We can change the base level to 2: 640 fvset — Declare factor-variable settings . fvset base 2 rep78 . regress mpg i.rep78, baselevels SS df Source MS Model Residual 549.415777 1790.78712 4 64 137.353944 27.9810488 Total 2340.2029 68 34.4147485 mpg Coef. rep78 Poor Fair Average Good Excellent 1.875 0 .3083333 2.541667 8.238636 _cons 19.125 Std. Err. Number of obs F( 4, 64) Prob > F R-squared Adj R-squared Root MSE = = = = = = 69 4.91 0.0016 0.2348 0.1869 5.2897 t P>|t| [95% Conf. Interval] 4.181884 (base) 2.104836 2.247695 2.457918 0.45 0.655 -6.479274 10.22927 0.15 1.13 3.35 0.884 0.262 0.001 -3.896559 -1.948621 3.32838 4.513226 7.031954 13.14889 1.870195 10.23 0.000 15.38886 22.86114 Let’s set rep78 to have no base level and fit a cell-means regression: . fvset base none rep78 . regress mpg i.rep78, noconstant Source SS df MS Model Residual 31824.2129 1790.78712 5 64 6364.84258 27.9810488 Total 33615 69 487.173913 mpg Coef. rep78 Poor Fair Average Good Excellent 21 19.125 19.43333 21.66667 27.36364 Std. Err. 3.740391 1.870195 .9657648 1.246797 1.594908 t 5.61 10.23 20.12 17.38 17.16 Number of obs F( 5, 64) Prob > F R-squared Adj R-squared Root MSE = = = = = = 69 227.47 0.0000 0.9467 0.9426 5.2897 P>|t| [95% Conf. Interval] 0.000 0.000 0.000 0.000 0.000 13.52771 15.38886 17.504 19.1759 24.17744 28.47229 22.86114 21.36267 24.15743 30.54983 Example 2 By default, margins accumulates a margin by using the observed relative frequencies of the factor levels. . regress mpg i.foreign Source SS df MS Model Residual 378.153515 2065.30594 1 72 378.153515 28.6848048 Total 2443.45946 73 33.4720474 mpg Coef. foreign Foreign _cons 4.945804 19.82692 Std. Err. 1.362162 .7427186 t 3.63 26.70 Number of obs F( 1, 72) Prob > F R-squared Adj R-squared Root MSE = = = = = = 74 13.18 0.0005 0.1548 0.1430 5.3558 P>|t| [95% Conf. Interval] 0.001 0.000 2.230384 18.34634 7.661225 21.30751 fvset — Declare factor-variable settings . margins Predictive margins Model VCE : OLS Expression : Linear prediction, predict() _cons Margin Delta-method Std. Err. 21.2973 .6226014 Number of obs t 34.21 = 641 74 P>|t| [95% Conf. Interval] 0.000 20.05616 22.53843 Let’s set foreign to always accumulate using equal relative frequencies: . fvset design asbalanced foreign . regress mpg i.foreign Source SS df MS Model Residual 378.153515 2065.30594 1 72 378.153515 28.6848048 Total 2443.45946 73 33.4720474 mpg Coef. foreign Foreign _cons 4.945804 19.82692 Std. Err. 1.362162 .7427186 t 3.63 26.70 . margins Adjusted predictions Model VCE : OLS Expression : Linear prediction, predict() at : foreign (asbalanced) Margin _cons 22.29983 Delta-method Std. Err. .6810811 t 32.74 Number of obs F( 1, 72) Prob > F R-squared Adj R-squared Root MSE = = = = = = 74 13.18 0.0005 0.1548 0.1430 5.3558 P>|t| [95% Conf. Interval] 0.001 0.000 2.230384 18.34634 Number of obs = 7.661225 21.30751 74 P>|t| [95% Conf. Interval] 0.000 20.94211 23.65754 Suppose that we issued the fvset design command earlier in our session and that we cannot remember which variables we set as asbalanced. We can retrieve this information by using the fvset report command: . fvset report, design(asbalanced) Variable Base Design foreign asbalanced Technical note margins is aware of a factor variable’s design setting only through the estimation results it is working with. The design setting is stored by the estimation command; thus changing the design setting between the estimation command and margins will have no effect. For example, the output from the following two calls to margins yields the same results: 642 fvset — Declare factor-variable settings . fvset clear foreign . regress mpg i.foreign SS Source df MS Model Residual 378.153515 2065.30594 1 72 378.153515 28.6848048 Total 2443.45946 73 33.4720474 mpg Coef. foreign Foreign _cons 4.945804 19.82692 Std. Err. 1.362162 .7427186 t 3.63 26.70 Number of obs F( 1, 72) Prob > F R-squared Adj R-squared Root MSE _cons Delta-method Std. Err. 21.2973 .6226014 t 34.21 74 13.18 0.0005 0.1548 0.1430 5.3558 P>|t| [95% Conf. Interval] 0.001 0.000 2.230384 18.34634 . margins Predictive margins Model VCE : OLS Expression : Linear prediction, predict() Margin = = = = = = Number of obs 7.661225 21.30751 = 74 P>|t| [95% Conf. Interval] 0.000 20.05616 22.53843 . fvset design asbalanced foreign . margins Predictive margins Model VCE : OLS Expression : Linear prediction, predict() _cons Margin Delta-method Std. Err. 21.2973 .6226014 t 34.21 Number of obs = 74 P>|t| [95% Conf. Interval] 0.000 20.05616 Stored results fvset stores the following in r(): Macros r(varlist) varlist r(baselist) base setting for each variable in varlist r(designlist) design setting for each variable in varlist 22.53843 Title gllamm — Generalized linear and latent mixed models Description Remarks and examples References Also see Description GLLAMM stands for generalized linear latent and mixed models, and gllamm is a Stata command for fitting such models written by Sophia Rabe-Hesketh (University of California–Berkeley) as part of joint work with Anders Skrondal (Norwegian Institute of Public Health) and Andrew Pickles (King’s College London). Remarks and examples Generalized linear latent and mixed models are a class of multilevel latent variable models, where a latent variable is a factor or a random effect (intercept or coefficient), or a disturbance (residual). The gllamm command for fitting such models is not an official command of Stata; it has been independently developed by highly regarded authors and is itself highly regarded. You can learn more about gllamm by visiting http://www.gllamm.org. gllamm is available from the Statistical Software Components (SSC) archive. To install, type . ssc describe gllamm . ssc install gllamm If you later wish to uninstall gllamm, type ado uninstall gllamm. References Miranda, A., and S. Rabe-Hesketh. 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata Journal 6: 285–308. Rabe-Hesketh, S., and B. S. Everitt. 2007. A Handbook of Statistical Analyses Using Stata. 4th ed. Boca Raton, FL: Chapman & Hall/CRC. Rabe-Hesketh, S., A. Pickles, and C. Taylor. 2000. sg129: Generalized linear latent and mixed models. Stata Technical Bulletin 53: 47–57. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 293–307. College Station, TX: Stata Press. Rabe-Hesketh, S., and A. Skrondal. 2012. Multilevel and Longitudinal Modeling Using Stata. 3rd ed. College Station, TX: Stata Press. Rabe-Hesketh, S., A. Skrondal, and A. Pickles. 2002. Reliable estimation of generalized linear mixed models using adaptive quadrature. Stata Journal 2: 1–21. . 2003. Maximum likelihood estimation of generalized linear models with covariate measurement error. Stata Journal 3: 386–411. Skrondal, A., and S. Rabe-Hesketh. 2004. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models. Boca Raton, FL: Chapman & Hall/CRC. Zheng, X., and S. Rabe-Hesketh. 2007. Estimating parameters of dichotomous and ordinal item response models with gllamm. Stata Journal 7: 313–333. The references above are restricted to works by the primary authors of gllamm. There are many other books and articles that use or discuss gllamm; see http://www.gllamm.org/pub.html for a list. 643 644 gllamm — Generalized linear and latent mixed models Also see [ME] meglm — Multilevel mixed-effects generalized linear model [ME] mixed — Multilevel mixed-effects linear regression [SEM] intro 2 — Learning the language: Path diagrams and command language [SEM] intro 5 — Tour of models Title glm — Generalized linear models Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgments Syntax glm depvar indepvars if in weight , options options Description Model family(familyname) link(linkname) distribution of depvar; default is family(gaussian) link function; default is canonical link for family() specified Model 2 noconstant exposure(varname) offset(varname) constraints(constraints) collinear asis mu(varname) init(varname) suppress constant term include ln(varname) in model with coefficient constrained to 1 include varname in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables retain perfect predictor variables use varname as the initial estimate for the mean of depvar synonym for mu(varname) SE/Robust vce(vcetype) vfactor(#) disp(#) scale(x2 | dev | #) vcetype may be oim, robust, cluster clustvar, eim, opg, bootstrap, jackknife, hac kernel, jackknife1, or unbiased multiply variance matrix by scalar # quasilikelihood multiplier set the scale parameter Reporting level(#) eform nocnsreport display options set confidence level; default is level(95) report exponentiated coefficients do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization ml irls maximize options fisher(#) search use maximum likelihood optimization; the default use iterated, reweighted least-squares optimization of the deviance control the maximization process; seldom used use the Fisher scoring Hessian or expected information matrix (EIM) search for good starting values 645 646 glm — Generalized linear models noheader notable nodisplay coeflegend suppress header table from above coefficient table suppress coefficient table suppress the output; iteration log is still displayed display legend instead of statistics familyname Description gaussian igaussian binomial varnameN | #N poisson nbinomial #k | ml gamma Gaussian (normal) inverse Gaussian Bernoulli/binomial Poisson negative binomial gamma linkname Description identity log logit probit cloglog power # opower # nbinomial loglog logc identity log logit probit cloglog power odds power negative binomial log-log log-complement indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mfp, mi estimate, nestreg, rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap), vce(jackknife), and vce(jackknife1) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. vce(), vfactor(), disp(), scale(), irls, fisher(), noheader, notable, nodisplay, and weights are not allowed with the svy prefix; see [SVY] svy. fweights, aweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. noheader, notable, nodisplay, and coeflegend do not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Generalized linear models > Generalized linear models (GLM) Description glm fits generalized linear models. It can fit models by using either IRLS (maximum quasilikelihood) or Newton–Raphson (maximum likelihood) optimization, which is the default. glm — Generalized linear models 647 See [U] 26 Overview of Stata estimation commands for a description of all of Stata’s estimation commands, several of which fit models that can also be fit using glm. Options Model family( familyname) specifies the distribution of depvar; family(gaussian) is the default. link(linkname) specifies the link function; the default is the canonical link for the family() specified (except for family(nbinomial)). Model 2 noconstant, exposure(varname), offset(varname), constraints(constraints), collinear; see [R] estimation options. constraints(constraints) and collinear are not allowed with irls. asis forces retention of perfect predictor variables and their associated, perfectly predicted observations and may produce instabilities in maximization; see [R] probit. This option is only allowed with option family(binomial) with a denominator of 1. mu(varname) specifies varname as the initial estimate for the mean of depvar. This option can be useful with models that experience convergence difficulties, such as family(binomial) models with power or odds-power links. init(varname) is a synonym. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. In addition to the standard vcetypes, glm allows the following alternatives: vce(eim) specifies that the EIM estimate of variance be used. vce(jackknife1) specifies that the one-step jackknife estimate of variance be used. vce(hac kernel # ) specifies that a heteroskedasticity- and autocorrelation-consistent (HAC) variance estimate be used. HAC refers to the general form for combining weighted matrices to form the variance estimate. There are three kernels built into glm. kernel is a user-written program or one of nwest | gallant | anderson # specifies the number of lags. If # is not specified, N − 2 is assumed. If you wish to specify vce(hac . . . ), you must tsset your data before calling glm. vce(unbiased) specifies that the unbiased sandwich estimate of variance be used. vfactor(#) specifies a scalar by which to multiply the resulting variance matrix. This option allows you to match output with other packages, which may apply degrees of freedom or other small-sample corrections to estimates of variance. disp(#) multiplies the variance of depvar by # and divides the deviance by #. The resulting distributions are members of the quasilikelihood family. scale(x2 | dev | #) overrides the default scale parameter. This option is allowed only with Hessian (information matrix) variance estimates. 648 glm — Generalized linear models By default, scale(1) is assumed for the discrete distributions (binomial, Poisson, and negative binomial), and scale(x2) is assumed for the continuous distributions (Gaussian, gamma, and inverse Gaussian). scale(x2) specifies that the scale parameter be set to the Pearson chi-squared (or generalized chisquared) statistic divided by the residual degrees of freedom, which is recommended by McCullagh and Nelder (1989) as a good general choice for continuous distributions. scale(dev) sets the scale parameter to the deviance divided by the residual degrees of freedom. This option provides an alternative to scale(x2) for continuous distributions and overdispersed or underdispersed discrete distributions. scale(#) sets the scale parameter to #. For example, using scale(1) in family(gamma) models results in exponential-errors regression. Additional use of link(log) rather than the default link(power -1) for family(gamma) essentially reproduces Stata’s streg, dist(exp) nohr command (see [ST] streg) if all the observations are uncensored. Reporting level(#); see [R] estimation options. eform displays the exponentiated coefficients and corresponding standard errors and confidence intervals. For family(binomial) link(logit) (that is, logistic regression), exponentiation results are odds ratios; for family(nbinomial) link(log) (that is, negative binomial regression) and for family(poisson) link(log) (that is, Poisson regression), exponentiated coefficients are incidence-rate ratios. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization ml requests that optimization be carried out using Stata’s ml commands and is the default. irls requests iterated, reweighted least-squares (IRLS) optimization of the deviance instead of Newton– Raphson optimization of the log likelihood. If the irls option is not specified, the optimization is carried out using Stata’s ml commands, in which case all options of ml maximize are also available. maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). fisher(#) specifies the number of Newton–Raphson steps that should use the Fisher scoring Hessian or EIM before switching to the observed information matrix (OIM). This option is useful only for Newton–Raphson optimization (and not when using irls). search specifies that the command search for good starting values. This option is useful only for Newton–Raphson optimization (and not when using irls). The following options are available with glm but are not shown in the dialog box: noheader suppresses the header information from the output. The coefficient table is still displayed. glm — Generalized linear models 649 notable suppresses the table of coefficients from the output. The header information is still displayed. nodisplay suppresses the output. The iteration log is still displayed. coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: General use Variance estimators User-defined functions General use glm fits generalized linear models of y with covariates x: g E(y) = xβ, y∼F g( ) is called the link function, and F is the distributional family. Substituting various definitions for g( ) and F results in a surprising array of models. For instance, if y is distributed as Gaussian (normal) and g( ) is the identity function, we have E(y) = xβ, y ∼ Normal or linear regression. If g( ) is the logit function and y is distributed as Bernoulli, we have logit E(y) = xβ, y ∼ Bernoulli or logistic regression. If g( ) is the natural log function and y is distributed as Poisson, we have ln E(y) = xβ, y ∼ Poisson or Poisson regression, also known as the log-linear model. Other combinations are possible. Although glm can be used to perform linear regression (and, in fact, does so by default), this regression should be viewed as an instructional feature; regress produces such estimates more quickly, and many postestimation commands are available to explore the adequacy of the fit; see [R] regress and [R] regress postestimation. In any case, you specify the link function by using the link() option and specify the distributional family by using family(). The available link functions are Link function glm option identity log logit probit complementary log-log odds power link(identity) link(log) link(logit) link(probit) link(cloglog) link(opower #) power negative binomial log-log log-complement link(power #) link(nbinomial) link(loglog) link(logc) Define µ = E(y) and η = g(µ), meaning that g(·) maps E(y) to η = xβ + offset. 650 glm — Generalized linear models Link functions are defined as follows: identity is defined as η = g(µ) = µ. log is defined as η = ln(µ). logit is defined as η = ln µ/(1 − µ) , the natural log of the odds. probit is defined as η = Φ−1 (µ), where Φ−1 ( ) is the inverse Gaussian cumulative. cloglog is defined as η = ln − ln(1 − µ) . n opower is defined as η = µ/(1 − µ) − 1 /n, the power of the odds. The function is generalized so that link(opower 0) is equivalent to link(logit), the natural log of the odds. power is defined as η = µn . Specifying link(power 1) is equivalent to specifying link(identity). The power function is generalized so that µ0 ≡ ln(µ). Thus link(power 0) is equivalent to link(log). Negative powers are, of course, allowed. nbinomial is defined as η = ln µ/(µ + k) , where k = 1 if family(nbinomial) is specified, k = # k if family(nbinomial # k ) is specified, and k is estimated via maximum likelihood if family(nbinomial ml) is specified. loglog is defined as η = −ln{−ln(µ)}. logc is defined as η = ln(1 − µ). The available distributional families are Family glm option Gaussian (normal) inverse Gaussian Bernoulli/binomial Poisson family(gaussian) family(igaussian) family(binomial) family(poisson) negative binomial gamma family(nbinomial) family(gamma) family(normal) is a synonym for family(gaussian). The binomial distribution can be specified as 1) family(binomial), 2) family(binomial #N ), or 3) family(binomial varnameN ). In case 2, #N is the value of the binomial denominator N , the number of trials. Specifying family(binomial 1) is the same as specifying family(binomial). In case 3, varnameN is the variable containing the binomial denominator, allowing the number of trials to vary across observations. The negative binomial distribution can be specified as 1) family(nbinomial), 2) family(nbinomial # k ), or 3) family(nbinomial ml). Omitting # k is equivalent to specifying family(nbinomial 1). In case 3, the value of # k is estimated via maximum likelihood. The value # k enters the variance and deviance functions. Typical values range between 0.01 and 2; see the technical note below. You do not have to specify both family() and link(); the default link() is the canonical link for the specified family() (except for nbinomial): glm — Generalized linear models Family Default link family(gaussian) family(igaussian) family(binomial) family(poisson) link(identity) link(power -2) link(logit) link(log) family(nbinomial) family(gamma) link(log) link(power -1) 651 If you specify both family() and link(), not all combinations make sense. You may choose from the following combinations: identity log Gaussian inverse Gaussian x x x x binomial Poisson negative binomial gamma x x x x x x x x logit probit cloglog power opower nbinomial loglog logc x x x x x x x x x x x x x Technical note Some family() and link() combinations result in models already fit by Stata. These are family() link() Options Equivalent Stata command gaussian gaussian identity identity regress newey, t(var) lag(#) (see note 1) binomial cloglog nothing | irls | irls vce(oim) t(var) vce(hac nwest #) vfactor(#v ) nothing | irls vce(oim) binomial probit nothing | irls vce(oim) probit (see note 2) cloglog (see note 2) binomial logit nothing | irls | irls vce(oim) logit or logistic (see note 3) poisson log nothing | irls | irls vce(oim) poisson (see note 3) nbinomial gamma log log nothing | irls vce(oim) scale(1) nbreg (see note 4) streg, dist(exp) nohr (see note 5) Notes: 1. The variance factor # v should be set to n/(n − k), where n is the number of observations and k the number of regressors. If the number of regressors is not specified, the estimated standard errors will, as a result, differ by this factor. 2. Because the link is not the canonical link for the binomial family, you must specify the vce(oim) option if using irls to get equivalent standard errors. If irls is used without vce(oim), the regression coefficients will be the same but the standard errors will be only asymptotically equivalent. If no options are specified (nothing), glm will optimize using Newton–Raphson, making it equivalent to the other Stata command. See [R] cloglog and [R] probit for more details about these commands. 3. Because the canonical link is being used, the standard errors will be equivalent whether the EIM or the OIM estimator of variance is used. 652 glm — Generalized linear models 4. Family negative binomial, log-link models — also known as negative binomial regression models — are used for data with an overdispersed Poisson distribution. Although glm can be used to fit such models, using Stata’s maximum likelihood nbreg command is probably better. In the GLM approach, you specify family(nbinomial # k ) and then search for a # k that results in the deviance-based dispersion being 1. You can also specify family(nbinomial ml) to estimate # k via maximum likelihood, which will report the same value returned from nbreg. However, nbreg also reports a confidence interval for it; see [R] nbreg and Rogers (1993). Of course, glm allows links other than log, and for those links, including the canonical nbinomial link, you will need to use glm. 5. glm can be used to estimate parameters from exponential regressions, but this method requires specifying scale(1). However, censoring is not available. Censored exponential regression may be modeled using glm with family(poisson). The log of the original response is entered into a Poisson model as an offset, whereas the new response is the censor variable. The result of such modeling is identical to the log relative hazard parameterization of streg, dist(exp) nohr. See [ST] streg for details about the streg command. In general, where there is overlap between a capability of glm and that of some other Stata command, we recommend using the other Stata command. Our recommendation is not because of some inferiority of the GLM approach. Rather, those other commands, by being specialized, provide options and ancillary commands that are missing in the broader glm framework. Nevertheless, glm does produce the same answers where it should. Special note. When equivalence is expected, for some datasets, you may still see very slight differences in the results, most often only in the later digits of the standard errors. When you compare glm output to an equivalent Stata command, these tiny discrepancies arise for many reasons: a. glm uses a general methodology for starting values, whereas the equivalent Stata command may be more specialized in its treatment of starting values. b. When using a canonical link, glm, irls should be equivalent to the maximum likelihood method of the equivalent Stata command, yet the convergence criterion is different (one is for deviance, the other for log likelihood). These discrepancies are easily resolved by adjusting one convergence criterion to correspond to the other. c. When both glm and the equivalent Stata command use Newton–Raphson, small differences may still occur if the Stata command has a different default convergence criterion from that of glm. Adjusting the convergence criterion will resolve the difference. See [R] ml and [R] maximize for more details. Example 1 In example 1 of [R] logistic, we fit a model based on data from a study of risk factors associated with low birthweight (Hosmer, Lemeshow, and Sturdivant 2013, 24). We can replicate the estimation by using glm: glm — Generalized linear models 653 . use http://www.stata-press.com/data/r13/lbw (Hosmer & Lemeshow data) . glm low age lwt i.race smoke ptl ht ui, family(binomial) link(logit) Iteration 0: log likelihood = -101.0213 Iteration 1: log likelihood = -100.72519 Iteration 2: log likelihood = -100.724 Iteration 3: log likelihood = -100.724 Generalized linear models No. of obs = 189 Optimization : ML Residual df = 180 Scale parameter = 1 Deviance = 201.4479911 (1/df) Deviance = 1.119156 Pearson = 182.0233425 (1/df) Pearson = 1.011241 Variance function: V(u) = u*(1-u) [Bernoulli] Link function : g(u) = ln(u/(1-u)) [Logit] AIC = 1.1611 Log likelihood = -100.7239956 BIC = -742.0665 OIM Std. Err. low Coef. z P>|z| [95% Conf. Interval] age lwt -.0271003 -.0151508 .0364504 .0069259 -0.74 -2.19 0.457 0.029 -.0985418 -.0287253 .0443412 -.0015763 race black other 1.262647 .8620792 .5264101 .4391532 2.40 1.96 0.016 0.050 .2309024 .0013548 2.294392 1.722804 smoke ptl ht ui _cons .9233448 .5418366 1.832518 .7585135 .4612239 .4008266 .346249 .6916292 .4593768 1.20459 2.30 1.56 2.65 1.65 0.38 0.021 0.118 0.008 0.099 0.702 .137739 -.136799 .4769494 -.1418484 -1.899729 1.708951 1.220472 3.188086 1.658875 2.822176 glm, by default, presents coefficient estimates, whereas logistic presents the exponentiated coefficients — the odds ratios. glm’s eform option reports exponentiated coefficients, and glm, like Stata’s other estimation commands, replays results. 654 glm — Generalized linear models . glm, eform Generalized linear models Optimization : ML Deviance Pearson = = No. of obs Residual df Scale parameter (1/df) Deviance (1/df) Pearson 201.4479911 182.0233425 Variance function: V(u) = u*(1-u) Link function : g(u) = ln(u/(1-u)) [Bernoulli] [Logit] Log likelihood AIC BIC = -100.7239956 OIM Std. Err. low Odds Ratio age lwt .9732636 .9849634 .0354759 .0068217 race black other 3.534767 2.368079 smoke ptl ht ui _cons 2.517698 1.719161 6.249602 2.1351 1.586014 z = = = = = 189 180 1 1.119156 1.011241 = 1.1611 = -742.0665 P>|z| [95% Conf. Interval] -0.74 -2.19 0.457 0.029 .9061578 .9716834 1.045339 .9984249 1.860737 1.039949 2.40 1.96 0.016 0.050 1.259736 1.001356 9.918406 5.600207 1.00916 .5952579 4.322408 .9808153 1.910496 2.30 1.56 2.65 1.65 0.38 0.021 0.118 0.008 0.099 0.702 1.147676 .8721455 1.611152 .8677528 .1496092 5.523162 3.388787 24.24199 5.2534 16.8134 These results are the same as those reported in example 1 of [R] logistic. Included in the output header are values for the Akaike (1973) information criterion (AIC) and the Bayesian information criterion (BIC) (Raftery 1995). Both are measures of model fit adjusted for the number of parameters that can be compared across models. In both cases, a smaller value generally indicates a better model fit. AIC is based on the log likelihood and thus is available only when Newton–Raphson optimization is used. BIC is based on the deviance and thus is always available. Technical note The values for AIC and BIC reported in the output after glm are different from those reported by estat ic: . estat ic Akaike’s information criterion and Bayesian information criterion Model Obs ll(null) ll(model) df AIC BIC . 189 . -100.724 9 219.448 248.6237 Note: N=Obs used in calculating BIC; see [R] BIC note There are various definitions of these information criteria (IC) in the literature; glm and estat ic use different definitions. glm bases its computation of the BIC on deviance, whereas estat ic uses the likelihood. Both glm and estat ic use the likelihood to compute the AIC; however, the AIC from estat ic is equal to N , the number of observations, times the AIC from glm. Refer to Methods and formulas in this entry and [R] estat ic for the references and formulas used by glm and estat ic, respectively, to compute AIC and BIC. Inferences based on comparison of IC values reported by glm glm — Generalized linear models 655 for different GLM models will be equivalent to those based on comparison of IC values reported by estat ic after glm. Example 2 We use data from an early insecticide experiment, given in Pregibon (1980). The variables are ldose, the log dose of insecticide; n, the number of flour beetles subjected to each dose; and r, the number killed. . use http://www.stata-press.com/data/r13/ldose . list, sep(4) ldose n r 1. 2. 3. 4. 1.6907 1.7242 1.7552 1.7842 59 60 62 56 6 13 18 28 5. 6. 7. 8. 1.8113 1.8369 1.861 1.8839 63 59 62 60 52 53 61 60 The aim of the analysis is to estimate a dose–response relationship between p, the proportion killed, and X , the log dose. As a first attempt, we will formulate the model as a linear logistic regression of p on ldose; that is, we will take the logit of p and represent the dose–response curve as a straight line in X : ln p/(1 − p) = β0 + β1 X Because the data are grouped, we cannot use Stata’s logistic command to fit the model. Stata does, however, already have a command for performing logistic regression on data organized in this way, so we could type . blogit r n ldose 656 glm — Generalized linear models Instead, we will fit the model by using glm: . glm r ldose, family(binomial n) link(logit) Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -18.824848 -18.715271 -18.715123 -18.715123 Generalized linear models Optimization : ML Deviance Pearson = = No. of obs Residual df Scale parameter (1/df) Deviance (1/df) Pearson 11.23220702 10.0267936 Variance function: V(u) = u*(1-u/n) Link function : g(u) = ln(u/(n-u)) [Binomial] [Logit] Log likelihood AIC BIC = -18.71512262 r Coef. ldose _cons 34.27034 -60.71747 OIM Std. Err. 2.912141 5.180713 z 11.77 -11.72 P>|z| 0.000 0.000 = = = = = 8 6 1 1.872035 1.671132 = 5.178781 = -1.244442 [95% Conf. Interval] 28.56265 -70.87149 39.97803 -50.56346 The only difference between blogit and glm here is how they obtain the answer. blogit expands the data to contain 481 observations (the sum of n) so that it can run Stata’s standard, individual-level logistic command. glm, on the other hand, uses the information on the binomial denominator directly. We specified family(binomial n), meaning that variable n contains the denominator. Parameter estimates and standard errors from the two approaches do not differ. An alternative model, which gives asymmetric sigmoid curves for p, involves the complementary log-log, or cloglog, function: ln − ln(1 − p) = β0 + β1 X We fit this model by using glm: . glm r ldose, family(binomial n) link(cloglog) Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -14.883594 -14.822264 -14.822228 -14.822228 Generalized linear models Optimization : ML Deviance Pearson = = No. of obs Residual df Scale parameter (1/df) Deviance (1/df) Pearson 3.446418004 3.294675153 = = = = = 8 6 1 .574403 .5491125 Variance function: V(u) = u*(1-u/n) Link function : g(u) = ln(-ln(1-u/n)) [Binomial] [Complementary log-log] Log likelihood AIC BIC = -14.82222811 r Coef. ldose _cons 22.04118 -39.57232 OIM Std. Err. 1.793089 3.229047 z 12.29 -12.26 P>|z| 0.000 0.000 = 4.205557 = -9.030231 [95% Conf. Interval] 18.52679 -45.90114 25.55557 -33.24351 glm — Generalized linear models 657 The complementary log-log model is preferred; the deviance for the logistic model, 11.23, is much higher than the deviance for the cloglog model, 3.45. This change also is evident by comparing log likelihoods, or equivalently, AIC values. This example also shows the advantage of the glm command — we can vary assumptions easily. Note the minor difference in what we typed to obtain the logistic and cloglog models: . glm r ldose, family(binomial n) link(logit) . glm r ldose, family(binomial n) link(cloglog) If we were performing this work for ourselves, we would have typed the commands in a more abbreviated form: . glm r ldose, f(b n) l(l) . glm r ldose, f(b n) l(cl) Technical note Factor variables may be used with glm. Say that, in the example above, we had ldose, the log dose of insecticide; n, the number of flour beetles subjected to each dose; and r, the number killed — all as before — except that now we have results for three different kinds of beetles. Our hypothetical data include beetle, which contains the values 1 (“Destructive flour”), 2 (“Red flour”), and 3 (“Mealworm”). . use http://www.stata-press.com/data/r13/beetle . list, sep(0) beetle 1. 2. 3. 4. 5. 23. 24. n r 1.6907 59 1.7242 60 1.7552 62 1.7842 56 1.8113 63 (output omitted ) 3 1.861 64 3 1.8839 58 6 13 18 28 52 1 1 1 1 1 ldose 23 22 Let’s assume that, at first, we wish merely to add a shift factor for the type of beetle. We could type . glm r i.beetle ldose, family(bin n) link(cloglog) Iteration 0: log likelihood = -79.012269 Iteration 1: log likelihood = -76.94951 Iteration 2: log likelihood = -76.945645 Iteration 3: log likelihood = -76.945645 Generalized linear models No. of obs = 24 Optimization : ML Residual df = 20 Scale parameter = 1 Deviance = 73.76505595 (1/df) Deviance = 3.688253 Pearson = 71.8901173 (1/df) Pearson = 3.594506 Variance function: V(u) = u*(1-u/n) [Binomial] Link function : g(u) = ln(-ln(1-u/n)) [Complementary log-log] AIC = 6.74547 Log likelihood = -76.94564525 BIC = 10.20398 658 glm — Generalized linear models OIM Std. Err. r Coef. z P>|z| [95% Conf. Interval] beetle Red flour Mealworm -.0910396 -1.836058 .1076132 .1307125 -0.85 -14.05 0.398 0.000 -.3019576 -2.09225 .1198783 -1.579867 ldose _cons 19.41558 -34.84602 .9954265 1.79333 19.50 -19.43 0.000 0.000 17.46458 -38.36089 21.36658 -31.33116 We find strong evidence that the insecticide works differently on the mealworm. We now check whether the curve is merely shifted or also differently sloped: . glm r beetle##c.ldose, family(bin n) link(cloglog) Iteration 0: log Iteration 1: log Iteration 2: log Iteration 3: log Generalized linear Optimization : Deviance Pearson = = likelihood likelihood likelihood likelihood models ML = = = = -67.270188 -65.149316 -65.147978 -65.147978 No. of obs Residual df Scale parameter (1/df) Deviance (1/df) Pearson 50.16972096 49.28422567 Variance function: V(u) = u*(1-u/n) Link function : g(u) = ln(-ln(1-u/n)) Log likelihood 24 18 1 2.787207 2.738013 [Binomial] [Complementary log-log] AIC = 5.928998 BIC = -7.035248 = -65.14797776 OIM Std. Err. = = = = = r Coef. beetle Red flour Mealworm -.79933 17.78741 4.470882 4.586429 -0.18 3.88 0.858 0.000 -9.562098 8.798172 7.963438 26.77664 ldose 22.04118 1.793089 12.29 0.000 18.52679 25.55557 .3838708 -10.726 2.478477 2.526412 0.15 -4.25 0.877 0.000 -4.473855 -15.67768 5.241596 -5.774321 -39.57232 3.229047 -12.26 0.000 -45.90114 -33.24351 beetle# c.ldose Red flour Mealworm _cons z P>|z| [95% Conf. Interval] We find that the (complementary log-log) dose–response curve for the mealworm has roughly half the slope of that for the destructive flour beetle. See [U] 25 Working with categorical data and factor variables; what is said there concerning linear regression is applicable to any GLM model. glm — Generalized linear models 659 Variance estimators glm offers many variance options and gives different types of standard errors when used in various combinations. We highlight some of them here, but for a full explanation, see Hardin and Hilbe (2012). Example 3 Continuing with our flour beetle data, we rerun the most recently displayed model, this time requesting estimation via IRLS. . use http://www.stata-press.com/data/r13/beetle . glm r beetle##c.ldose, f(bin n) l(cloglog) ltol(1e-13) irls Iteration 1: deviance = 54.41414 Iteration 2: deviance = 50.19424 Iteration 3: deviance = 50.16973 (output omitted ) Generalized linear models No. of obs = 24 Optimization : MQL Fisher scoring Residual df = 18 (IRLS EIM) Scale parameter = 1 Deviance = 50.16972096 (1/df) Deviance = 2.787207 Pearson = 49.28422567 (1/df) Pearson = 2.738013 Variance function: V(u) = u*(1-u/n) [Binomial] Link function : g(u) = ln(-ln(1-u/n)) [Complementary log-log] BIC = -7.035248 EIM Std. Err. r Coef. beetle Red flour Mealworm -.79933 17.78741 4.586649 4.624834 -0.17 3.85 0.862 0.000 -9.788997 8.7229 8.190337 26.85192 ldose 22.04118 1.799356 12.25 0.000 18.5145 25.56785 .3838708 -10.726 2.544068 2.548176 0.15 -4.21 0.880 0.000 -4.602411 -15.72033 5.370152 -5.731665 -39.57232 3.240274 -12.21 0.000 -45.92314 -33.2215 beetle# c.ldose Red flour Mealworm _cons z P>|z| [95% Conf. Interval] Note our use of the ltol() option, which, although unrelated to our discussion on variance estimation, was used so that the regression coefficients would match those of the previous Newton–Raphson (NR) fit. Because IRLS uses the EIM for optimization, the variance estimate is also based on EIM. If we want optimization via IRLS but the variance estimate based on OIM, we specify glm, irls vce(oim): 660 glm — Generalized linear models . glm r beetle##c.ldose, f(b n) l(cl) ltol(1e-15) irls vce(oim) noheader nolog OIM Std. Err. r Coef. beetle Red flour Mealworm -.79933 17.78741 4.470882 4.586429 -0.18 3.88 0.858 0.000 -9.562098 8.798172 7.963438 26.77664 ldose 22.04118 1.793089 12.29 0.000 18.52679 25.55557 .3838708 -10.726 2.478477 2.526412 0.15 -4.25 0.877 0.000 -4.473855 -15.67768 5.241596 -5.774321 -39.57232 3.229047 -12.26 0.000 -45.90114 -33.24351 beetle# c.ldose Red flour Mealworm _cons z P>|z| [95% Conf. Interval] This approach is identical to NR except for the convergence path. Because the cloglog link is not the canonical link for the binomial family, EIM and OIM produce different results. Both estimators, however, are asymptotically equivalent. Going back to NR, we can also specify vce(robust) to get the Huber/White/sandwich estimator of variance: . glm r beetle##c.ldose, f(b n) l(cl) vce(robust) noheader nolog Robust Std. Err. r Coef. beetle Red flour Mealworm -.79933 17.78741 5.733049 5.158477 ldose 22.04118 beetle# c.ldose Red flour Mealworm _cons z P>|z| [95% Conf. Interval] -0.14 3.45 0.889 0.001 -12.0359 7.676977 10.43724 27.89784 .8998551 24.49 0.000 20.27749 23.80486 .3838708 -10.726 3.174427 2.800606 0.12 -3.83 0.904 0.000 -5.837892 -16.21508 6.605633 -5.236912 -39.57232 1.621306 -24.41 0.000 -42.75003 -36.39462 The sandwich estimator gets its name from the form of the calculation—it is the multiplication of three matrices, with the outer two matrices (the “bread”) set to the OIM variance matrix. When irls is used along with vce(robust), the EIM variance matrix is instead used as the bread. Using a result from McCullagh and Nelder (1989), Newson (1999) points out that the EIM and OIM variance matrices are equivalent under the canonical link. Thus if irls is specified with the canonical link, the resulting variance is labeled “Robust”. When the noncanonical link for the family is used, which is the case in the example below, the EIM and OIM variance matrices differ, so the resulting variance is labeled “Semirobust”. glm — Generalized linear models 661 . glm r beetle##c.ldose, f(b n) l(cl) irls ltol(1e-15) vce(robust) noheader > nolog Semirobust Std. Err. r Coef. beetle Red flour Mealworm -.79933 17.78741 6.288963 5.255307 -0.13 3.38 0.899 0.001 -13.12547 7.487194 11.52681 28.08762 ldose 22.04118 .9061566 24.32 0.000 20.26514 23.81721 .3838708 -10.726 3.489723 2.855897 0.11 -3.76 0.912 0.000 -6.455861 -16.32345 7.223603 -5.128542 -39.57232 1.632544 -24.24 0.000 -42.77205 -36.3726 beetle# c.ldose Red flour Mealworm _cons z P>|z| [95% Conf. Interval] The outer product of the gradient (OPG) estimate of variance is one that avoids the calculation of second derivatives. It is equivalent to the “middle” part of the sandwich estimate of variance and can be specified by using glm, vce(opg), regardless of whether NR or IRLS optimization is used. . glm r beetle##c.ldose, f(b n) l(cl) vce(opg) noheader nolog OPG Std. Err. r Coef. beetle Red flour Mealworm -.79933 17.78741 6.664045 6.838505 -0.12 2.60 0.905 0.009 -13.86062 4.384183 12.26196 31.19063 ldose 22.04118 3.572983 6.17 0.000 15.03826 29.0441 .3838708 -10.726 3.700192 3.796448 0.10 -2.83 0.917 0.005 -6.868372 -18.1669 7.636114 -3.285097 -39.57232 6.433101 -6.15 0.000 -52.18097 -26.96368 beetle# c.ldose Red flour Mealworm _cons z P>|z| [95% Conf. Interval] The OPG estimate of variance is a component of the BHHH (Berndt et al. 1974) optimization technique. This method of optimization is also available with glm with the technique() option; however, the technique() option is not allowed with the irls option. Example 4 The Newey–West (1987) estimator of variance is a sandwich estimator with the “middle” of the sandwich modified to take into account possible autocorrelation between the observations. These estimators are a generalization of those given by the Stata command newey for linear regression. See [TS] newey for more details. For example, consider the dataset given in [TS] newey, which has time-series measurements on usr and idle. We want to perform a linear regression with Newey–West standard errors. 662 glm — Generalized linear models . use http://www.stata-press.com/data/r13/idle2 . list usr idle time usr 1. 2. 3. 4. 5. 29. 30. idle time 0 100 1 0 100 2 0 97 3 1 98 4 2 94 5 (output omitted ) 1 98 29 1 98 30 Examining Methods and formulas of [TS] newey, we see that the variance estimate is multiplied by a correction factor of n/(n − k), where k is the number of regressors. glm, vce(hac . . . ) does not make this correction, so to get the same standard errors, we must use the vfactor() option within glm to make the correction manually. . display 30/28 1.0714286 . tsset time time variable: time, 1 to 30 delta: 1 unit . glm usr idle, vce(hac nwest 3) vfactor(1.0714286) Iteration 0: log likelihood = -71.743396 Generalized linear models No. of obs Optimization : ML Residual df Scale parameter Deviance = 209.8123165 (1/df) Deviance Pearson = 209.8123165 (1/df) Pearson Variance function: V(u) = 1 [Gaussian] Link function : g(u) = u [Identity] HAC kernel (lags): Newey-West (3) AIC Log likelihood = -71.74339627 BIC usr Coef. idle _cons -.2281501 23.13483 HAC Std. Err. .0690928 6.327033 z -3.30 3.66 P>|z| 0.001 0.000 = = = = = 30 28 7.493297 7.493297 7.493297 = = 4.916226 114.5788 [95% Conf. Interval] -.3635694 10.73407 -.0927307 35.53558 The glm command above reproduces the results given in [TS] newey. We may now generalize this output to models other than simple linear regression and to different kernel weights. glm — Generalized linear models . glm usr idle, fam(gamma) link(log) vce(hac gallant 3) Iteration 0: log likelihood = -61.76593 Iteration 1: log likelihood = -60.963233 Iteration 2: log likelihood = -60.95097 Iteration 3: log likelihood = -60.950965 Generalized linear models No. of obs Optimization : ML Residual df Scale parameter Deviance = 9.908506707 (1/df) Deviance Pearson = 12.07628677 (1/df) Pearson Variance function: V(u) = u^2 [Gamma] Link function : g(u) = ln(u) [Log] HAC kernel (lags): Gallant (3) AIC Log likelihood = -60.95096484 BIC usr Coef. idle _cons -.0796609 7.771011 HAC Std. Err. .0184647 1.510198 z -4.31 5.15 = = = = = 663 30 28 .431296 .3538752 .431296 = 4.196731 = -85.32502 P>|z| [95% Conf. Interval] 0.000 0.000 -.115851 4.811078 -.0434708 10.73094 glm also offers variance estimators based on the bootstrap (resampling your data with replacement) and the jackknife (refitting the model with each observation left out in succession). Also included is the one-step jackknife estimate, which, instead of performing full reestimation when each observation is omitted, calculates a one-step NR estimate, with the full data regression coefficients as starting values. . set seed 1 . glm usr idle, fam(gamma) link(log) vce(bootstrap, reps(100) nodots) Generalized linear models No. of obs = 30 Optimization : ML Residual df = 28 Scale parameter = .431296 Deviance = 9.908506707 (1/df) Deviance = .3538752 Pearson = 12.07628677 (1/df) Pearson = .431296 Variance function: V(u) = u^2 [Gamma] Link function : g(u) = ln(u) [Log] Log likelihood AIC BIC = -60.95096484 usr Observed Coef. Bootstrap Std. Err. idle _cons -.0796609 7.771011 .0216591 1.80278 z -3.68 4.31 P>|z| 0.000 0.000 = 4.196731 = -85.32502 Normal-based [95% Conf. Interval] -.1221119 4.237627 -.0372099 11.3044 See Hardin and Hilbe (2012) for a full discussion of the variance options that go with glm and, in particular, of how the different variance estimators are modified when vce(cluster clustvar) is specified. Finally, not all variance options are supported with all types of weights. See help glm for a current table of the variance options that are supported with the different weights. 664 glm — Generalized linear models User-defined functions glm may be called with a user-written link function, variance (family) function, Newey–West kernel-weight function, or any combination of the three. Syntax of link functions program progname version 13 args todo eta mu return if ‘todo’ == -1 { /* Set global macros for output */ global SGLM_lt "title for link function" global SGLM_lf "subtitle showing link definition" exit } if ‘todo’ == 0 { /* set η=g(µ) */ /* Intermediate calculations go here */ generate double ‘eta’ = . . . exit } if ‘todo’ == 1 { /* set µ=g −1 (η) */ /* Intermediate calculations go here */ generate double ‘mu’ = . . . exit } if ‘todo’ == 2 { /* set return = ∂µ/∂η */ /* Intermediate calculations go here */ generate double ‘return’ = . . . exit } if ‘todo’ == 3 { /* set return = ∂ 2 µ/∂η 2 */ /* Intermediate calculations go here */ generate double ‘return’ = . . . exit } display as error "Unknown call to glm link function" exit 198 end glm — Generalized linear models Syntax of variance functions program progname version 13 args todo eta mu return if ‘todo’ == -1 { /* Set global macros for output */ /* Also check that depvar is in proper range */ /* Note: For this call, eta contains indicator for whether each obs. is in est. sample */ global SGLM_vt "title for variance function" global SGLM_vf "subtitle showing function definition" global SGLM_mu "program to call to enforce boundary conditions on µ" exit } if ‘todo’ == 0 { /* set η to initial value. */ /* Intermediate calculations go here */ generate double ‘eta’ = . . . exit } if ‘todo’ == 1 { /* set return = V (µ) */ /* Intermediate calculations go here */ generate double ‘return’ = . . . exit } if ‘todo’ == 2 { /* set return = ∂V (µ)/∂µ */ /* Intermediate calculations go here */ generate double ‘return’ = . . . exit } if ‘todo’ == 3 { /* set return = squared deviance (per observation) */ /* Intermediate calculations go here */ generate double ‘return’ = . . . exit } if ‘todo’ == 4 { /* set return = Anscombe residual */ /* Intermediate calculations go here */ generate double ‘return’ = . . . exit } if ‘todo’ == 5 { /* set return = log likelihood */ /* Intermediate calculations go here */ generate double ‘return’ = . . . exit } if ‘todo’ == 6 { /* set return = adjustment for deviance residuals */ /* Intermediate calculations go here */ generate double ‘return’ = . . . exit } display as error "Unknown call to glm variance function" exit 198 end 665 666 glm — Generalized linear models Syntax of Newey–West kernel-weight functions program progname, rclass version 13 args G j /* G is the maximum lag */ /* j is the current lag */ /* Intermediate calculations go here */ return scalar wt = computed weight return local setype "Newey-West" return local sewtype "name of kernel" end Global macros available for user-written programs Global macro Description SGLM SGLM SGLM SGLM SGLM SGLM program name of variance (family) evaluator program name of link evaluator dependent variable name binomial denominator negative binomial k power if power() or opower() is used, or an argument from a user-specified link function indicator; set to one if scale is equal to one value of scale parameter V L y m a p SGLM s1 SGLM ph Example 5 Suppose that we wish to perform Poisson regression with a log-link function. Although this regression is already possible with standard glm, we will write our own version for illustrative purposes. Because we want a log link, η = g(µ) = ln(µ), and for a Poisson family the variance function is V (µ) = µ. The Poisson density is given by f (yi ) = e− exp(µi ) eµi yi yi ! resulting in a log likelihood of L= n X {−eµi + µi yi − ln(yi !)} i=1 The squared deviance of the ith observation for the Poisson family is given by d2i = 2µ bi 2 yi ln(yi /b µi ) − (yi − µ bi ) if yi = 0 otherwise glm — Generalized linear models 667 We now have enough information to write our own Poisson-log glm module. We create the file mylog.ado, which contains program mylog version 13 args todo eta mu return if ‘todo’ == -1 { global SGLM_lt "My Log" // Titles for output global SGLM_lf "ln(u)" exit } if ‘todo’ == 0 { gen double ‘eta’ = ln(‘mu’) // η = ln(µ) exit } if ‘todo’ == 1 { gen double ‘mu’ = exp(‘eta’) // µ = exp(η) exit } if ‘todo’ == 2 { gen double ‘return’ = ‘mu’ // ∂µ/∂η = exp(η) = µ exit } if ‘todo’ == 3 { gen double ‘return’ = ‘mu’ // ∂ 2 µ/∂η 2 = exp(η) = µ exit } di as error "Unknown call to glm link function" exit 198 end and we create the file mypois.ado, which contains program mypois version 13 args todo eta mu return if ‘todo’ == -1 { local y "$SGLM y" local touse "‘eta’" // ‘eta’ marks estimation sample here capture assert ‘y’>=0 if ‘touse’ // check range of y if _rc { di as error ‘"dependent variable ‘y’ has negative values"’ exit 499 } global SGLM vt "My Poisson" // Titles for output global SGLM vf "u" global SGLM mu "glim_mu 0 ." // see note 1 exit } if ‘todo’ == 0 { // Initialization of η ; see note 2 gen double ‘eta’ = ln(‘mu’) exit } 668 glm — Generalized linear models if ‘todo’ == 1 { gen double ‘return’ = ‘mu’ // V (µ) = µ exit } if ‘todo’ == 2 { // ∂ V (µ)/∂µ gen byte ‘return’ = 1 exit } if ‘todo’ == 3 { // squared deviance, defined above local y "$SGLM y" if "‘y’" == "" { local y "‘e(depvar)’" } gen double ‘return’ = cond(‘y’==0, 2*‘mu’, /* */ 2*(‘y’*ln(‘y’/‘mu’)-(‘y’-‘mu’))) exit } if ‘todo’ == 4 { // Anscombe residual; see note 3 local y "$SGLM y" if "‘y’" == "" { local y "‘e(depvar)’" } gen double ‘return’ = 1.5*(‘y’^(2/3)-‘mu’^(2/3)) / ‘mu’^(1/6) exit } if ‘todo’ == 5 { // log likelihood; see note 4 local y "$SGLM y" if "‘y’" == "" { local y "‘e(depvar)’" } gen double ‘return’ = -‘mu’+‘y’*ln(‘mu’)-lngamma(‘y’+1) exit } if ‘todo’ == 6 { // adjustment to residual; see note 5 gen double ‘return’ = 1/(6*sqrt(‘mu’)) exit } di as error "Unknown call to glm variance function" error 198 end Notes: 1. glim mu is a Stata program that will, at each iteration, bring µ b back into its plausible range, should it stray out of it. Here glim mu is called with the arguments zero and missing, meaning that zero is the lower bound of µ b and there exists no upper bound—such is the case for Poisson models. 2. Here the initial value of η is easy because we intend to fit this model with our user-defined log link. In general, however, the initialization may need to vary according to the link to obtain convergence. If so, the global macro SGLM L is used to determine which link is being utilized. 3. The Anscombe formula is given here because we know it. If we were not interested in Anscombe residuals, we could merely set ‘return’ to missing. Also, the local macro y is set either to SGLM y if it is in current estimation or to e(depvar) if this function is being accessed by predict. 4. If we were not interested in ML estimation, we could omit this code entirely and just leave an exit statement in its place. Similarly, if we were not interested in deviance or IRLS optimization, we could set ‘return’ in the deviance portion of the code (‘todo’==3) to missing. glm — Generalized linear models 669 5. This code defines the term to be added to the predicted residuals if the adjusted option is specified. Again, if we were not interested, we could set ‘return’ to missing. We can now test our Poisson-log module by running it on the airline data presented in [R] poisson. . use http://www.stata-press.com/data/r13/airline . list airline injuries n XYZowned airline injuries n XYZowned 1. 2. 3. 4. 5. 1 2 3 4 5 11 7 7 19 9 0.0950 0.1920 0.0750 0.2078 0.1382 1 0 0 0 0 6. 7. 8. 9. 6 7 8 9 4 3 1 3 0.0540 0.1292 0.0503 0.0629 1 0 0 1 . generate lnN=ln(n) . glm injuries XYZowned lnN, f(mypois) l(mylog) scale(1) Iteration 0: log Iteration 1: log Iteration 2: log Iteration 3: log Generalized linear Optimization : Deviance Pearson = = likelihood likelihood likelihood likelihood models ML = = = = -22.557572 -22.332861 -22.332276 -22.332276 12.70432823 12.7695081 Variance function: V(u) = u Link function : g(u) = ln(u) Log likelihood = -22.33227605 injuries Coef. XYZowned lnN _cons .6840668 1.424169 4.863891 No. of obs Residual df Scale parameter (1/df) Deviance (1/df) Pearson = = = = = [My Poisson] [My Log] AIC BIC = 5.629395 = -.4790192 OIM Std. Err. z P>|z| .3895877 .3725155 .7090501 1.76 3.82 6.86 0.079 0.000 0.000 9 6 1 2.117388 2.128251 [95% Conf. Interval] -.0795111 .6940517 3.474178 1.447645 2.154286 6.253603 (Standard errors scaled using dispersion equal to square root of 1.) These are precisely the results given in [R] poisson and are those that would have been given had we run glm, family(poisson) link(log). The only minor adjustment we needed to make was to specify the scale(1) option. If scale() is left unspecified, glm assumes scale(1) for discrete distributions and scale(x2) for continuous ones. By default, glm assumes that any user-defined family is continuous because it has no way of checking. Thus we needed to specify scale(1) because our model is discrete. Because we were careful in defining the squared deviance, we could have fit this model with IRLS. Because log is the canonical link for the Poisson family, we would not only get the same regression coefficients but also the same standard errors. 670 glm — Generalized linear models Example 6 Suppose now that we wish to use our log link (mylog.ado) with glm’s binomial family. This task requires some modification because our current function is not equipped to deal with the binomial denominator, which we are allowed to specify. This denominator is accessible to our link function through the global macro SGLM m. We now make the modifications and store them in mylog2.ado. program mylog2 version 13 args todo eta mu return if ‘todo’ == -1 { global SGLM_lt "My Log, Version 2" if "$SGLM m" == "1" { global SGLM lf "ln(u)" } else global SGLM lf "ln(u/$SGLM m)" exit } if ‘todo’ == 0 { gen double ‘eta’ = ln(‘mu’/$SGLM m) exit } if ‘todo’ == 1 { gen double ‘mu’ = $SGLM m*exp(‘eta’) exit } if ‘todo’ == 2 { gen double ‘return’ = ‘mu’ exit } if ‘todo’ == 3 { gen double ‘return’ = ‘mu’ exit } di as error "Unknown call to glm link function" exit 198 end // <-- changed // // // // // <-<-<-<-<-- changed changed changed changed changed // <-- changed // <-- changed We can now run our new log link with glm’s binomial family. Using the flour beetle data from earlier, we have glm — Generalized linear models 671 . use http://www.stata-press.com/data/r13/beetle, clear . glm r ldose, f(bin n) l(mylog2) irls Iteration 1: deviance = 2212.108 Iteration 2: deviance = 452.9352 Iteration 3: deviance = 429.95 Iteration 4: deviance = 429.2745 Iteration 5: deviance = 429.2192 Iteration 6: deviance = 429.2082 Iteration 7: deviance = 429.2061 Iteration 8: deviance = 429.2057 Iteration 9: deviance = 429.2056 Iteration 10: deviance = 429.2056 Iteration 11: deviance = 429.2056 Iteration 12: deviance = 429.2056 Generalized linear models No. of obs = 24 Optimization : MQL Fisher scoring Residual df = 22 (IRLS EIM) Scale parameter = 1 Deviance = 429.205599 (1/df) Deviance = 19.50935 Pearson = 413.088142 (1/df) Pearson = 18.77673 Variance function: V(u) = u*(1-u/n) [Binomial] Link function : g(u) = ln(u/n) [My Log, Version 2] BIC = 359.2884 r Coef. ldose _cons 8.478908 -16.11006 EIM Std. Err. .4702808 .8723167 z 18.03 -18.47 P>|z| 0.000 0.000 [95% Conf. Interval] 7.557175 -17.81977 9.400642 -14.40035 For a more detailed discussion on user-defined functions, and for an example of a user-defined Newey–West kernel weight, see Hardin and Hilbe (2012). John Ashworth Nelder (1924–2010) was born in Somerset, England. He studied mathematics and statistics at Cambridge and worked as a statistician at the National Vegetable Research Station and then Rothamsted Experimental Station. In retirement, he was actively affiliated with Imperial College London. Nelder was especially well known for his contributions to the theory of linear models and to statistical computing. He was the principal architect of generalized and hierarchical generalized linear models and of the programs GenStat and GLIM. Robert William Maclagan Wedderburn (1947–1975) was born in Edinburgh and studied mathematics and statistics at Cambridge. At Rothamsted Experimental Station, he developed the theory of generalized linear models with Nelder and originated the concept of quasilikelihood. He died of anaphylactic shock from an insect bite on a canal holiday. 672 glm — Generalized linear models Stored results glm, ml stores the following in e(): Scalars e(N) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(df) e(phi) e(aic) e(bic) e(ll) e(N clust) e(chi2) e(p) e(deviance) e(deviance s) e(deviance p) e(deviance ps) e(dispers) e(dispers s) e(dispers p) e(dispers ps) e(nbml) e(vf) e(power) e(rank) e(ic) e(rc) e(converged) number of observations number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom residual degrees of freedom scale parameter model AIC model BIC log likelihood, if NR number of clusters χ2 significance deviance scaled deviance Pearson deviance scaled Pearson deviance dispersion scaled dispersion Pearson dispersion scaled Pearson dispersion 1 if negative binomial parameter estimated via ML, 0 otherwise factor set by vfactor(), 1 if not set power set by power(), opower() rank of e(V) number of iterations return code 1 if converged, 0 otherwise glm — Generalized linear models Macros e(cmd) e(cmdline) e(depvar) e(varfunc) e(varfunct) e(varfuncf) e(link) e(linkt) e(linkf) e(m) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(cons) e(hac kernel) e(hac lag) e(vce) e(vcetype) e(opt) e(opt1) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(marginsok) e(marginsnotok) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) glm command as typed name of dependent variable program to calculate variance function variance title variance function program to calculate link function link title link function number of binomial trials weight type weight expression title in estimation output name of cluster variable linear offset variable Wald; type of model χ2 test set if noconstant specified HAC kernel HAC lag vcetype specified in vce() title used to label Std. Err. ml or irls optimization title, line 1 max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict predictions allowed by margins predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample 673 674 glm — Generalized linear models glm, irls stores the following in e(): Scalars e(N) e(k) e(k eq model) e(df m) e(df) e(phi) e(disp) e(bic) e(N clust) e(deviance) e(deviance s) e(deviance p) e(deviance ps) e(dispers) e(dispers s) e(dispers p) e(dispers ps) e(nbml) e(vf) e(power) e(rank) e(rc) Macros e(cmd) e(cmdline) e(depvar) e(varfunc) e(varfunct) e(varfuncf) e(link) e(linkt) e(linkf) e(m) e(wtype) e(wexp) e(clustvar) e(offset) e(cons) e(hac kernel) e(hac lag) e(vce) e(vcetype) e(opt) e(opt1) e(opt2) e(properties) e(predict) e(marginsok) e(marginsnotok) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(V) e(V modelbased) Functions e(sample) number of observations number of parameters number of equations in overall model test model degrees of freedom residual degrees of freedom scale parameter dispersion parameter model BIC number of clusters deviance scaled deviance Pearson deviance scaled Pearson deviance dispersion scaled dispersion Pearson dispersion scaled Pearson dispersion 1 if negative binomial parameter estimated via ML, 0 otherwise factor set by vfactor(), 1 if not set power set by power(), opower() rank of e(V) return code glm command as typed name of dependent variable program to calculate variance function variance title variance function program to calculate link function link title link function number of binomial trials weight type weight expression name of cluster variable linear offset variable set if noconstant specified HAC kernel HAC lag vcetype specified in vce() title used to label Std. Err. ml or irls optimization title, line 1 optimization title, line 2 b V program used to implement predict predictions allowed by margins predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix variance–covariance matrix of the estimators model-based variance marks estimation sample glm — Generalized linear models 675 Methods and formulas The canonical reference on GLM is McCullagh and Nelder (1989). The term “generalized linear model” is from Nelder and Wedderburn (1972). Many people use the acronym GLIM for GLM models because of the classic GLM software tool GLIM, by Baker and Nelder (1985). See Dobson and Barnett (2008) for a concise introduction and overview. See Rabe-Hesketh and Everitt (2007) for more examples of GLM using Stata. Hoffmann (2004) focuses on applying generalized linear models, using real-world datasets, along with interpreting computer output, which for the most part is obtained using Stata. This discussion highlights the details of parameter estimation and predicted statistics. For a more detailed treatment, and for information on variance estimation, see Hardin and Hilbe (2012). glm supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. glm obtains results by IRLS, as described in McCullagh and Nelder (1989), or by maximum likelihood using Newton–Raphson. The implementation here, however, allows user-specified weights, which we denote as vj for the j th observation. Let M be the number of “observations” ignoring weights. Define if no weights are specified 1 if fweights or iweights are specified w j = vj M v /(P v ) if aweights or pweights are specified j k k P The number of observations is then N = j wj if fweights are specified and N = M otherwise. Each IRLS step is performed by regress using wj as the weights. Let d2j denote the squared deviance residual for the j th observation: For the Gaussian family, d2j = (yj − µ bj )2 . For the Bernoulli family (binomial with denominator 1), d2j = −2ln(1 − µ bj ) if yj = 0 −2ln(b µj ) otherwise For the binomial family with denominator mj , µj ) + 2(mj − yj )ln (mj − yj )/(mj − µ bj ) 2yj ln(yj /b 2 dj = 2mj ln mj /(mj − µ bj ) 2yj ln(yj /b µj ) For the Poisson family, d2j = 2µ bj 2 yj ln(yj /b µj ) − (yj − µ bj ) For the gamma family, d2j = −2 ln(yj /b µj ) − (yj − µ bj )/b µj . For the inverse Gaussian, d2j = (yj − µ bj )2 /(b µ2j yj ). if yj = 0 otherwise if 0 < yj < mj if yj = 0 if yj = mj 676 glm — Generalized linear models For the negative binomial, d2j = 2ln(1 + kb µj )/k if yj = 0 2yj ln(yj /b µj ) − 2{(1 + kyj )/k}ln{(1 + kyj )/(1 + kb µj )} otherwise b0 (n − k)/n, where φb0 is the Let φ = 1 if the scale parameter is set to one; otherwise, define φ = φ estimated scale parameter and k is the number of covariates in the model (including intercept). Let lnLj denote the log likelihood for the j th observation: For the Gaussian family, lnLj = − 1 2 (yj − µ bj )2 φ + ln(2πφ) For the binomial family with denominator mj (Bernoulli if all mj = 1), lnLj = φ × ln{Γ(mj + 1)} − ln{Γ(yj + 1)} + ln{Γ(mj − yj + 1)} if 0 < yj < mj +(mj − yj ) ln(1 − µ bj /mj ) + yj ln(b µj /mj ) mj ln(1 − µ bj /mj ) mj ln(b µj /mj ) if yj = 0 if yj = mj For the Poisson family, lnLj = φ [yj ln(b µj ) − µ bj − ln{Γ(yj + 1)}] For the gamma family, lnLj = −yj /b µj + ln(1/b µj ). For the inverse Gaussian, 1 lnLj = − 2 ( (yj − µ bj )2 + 3 ln(yj ) + ln(2π) yj µ b2j ) For the negative binomial (let m = 1/k ), lnLj =φ [ ln{Γ(m + yj )} − ln{Γ(yj + 1)} − ln{Γ(m)} −m ln(1 + µ bj /m) + yj ln{b µj /(b µj + m)}] The overall deviance reported by glm is D2 = divided by the residual degrees of freedom. P j wj d2j . The dispersion of the deviance is D2 The Akaike information criterion (AIC) and Bayesian information criterion (BIC) are given by −2 lnL + 2k N 2 BIC = D − (N − k) ln(N ) AIC where lnL = P j = wj lnLj is the overall log likelihood. glm — Generalized linear models 677 P The Pearson deviance reported by glm is j wj rj2 . The corresponding Pearson dispersion is the Pearson deviance divided by the residual degrees of freedom. glm also calculates the scaled versions of all these quantities by dividing by the estimated scale parameter. Acknowledgments glm was written by James Hardin of the Arnold School of Public Health at the University of South Carolina and Joseph Hilbe of Arizona State University, the coauthors of the Stata Press book Generalized Linear Models and Extensions. The previous version of this routine was written by Patrick Royston of the MRC Clinical Trials Unit, London, and coauthor of the Stata Press book Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. The original version of this routine was published in Royston (1994). Royston’s work, in turn, was based on a prior implementation by Joseph Hilbe, first published in Hilbe (1993). Roger Newson wrote an early implementation (Newson 1999) of robust variance estimates for GLM. Parts of this entry are excerpts from Hardin and Hilbe (2012). References Akaike, H. 1973. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, ed. B. N. Petrov and F. Csaki, 267–281. Budapest: Akailseoniai–Kiudo. Anscombe, F. J. 1953. Contribution of discussion paper by H. Hotelling “New light on the correlation coefficient and its transforms”. Journal of the Royal Statistical Society, Series B 15: 229–230. Baker, R. J., and J. A. Nelder. 1985. The Generalized Linear Interactive Modelling System, Release 3.77. Oxford: Numerical Algorithms Group. Basu, A. 2005. Extended generalized linear models: Simultaneous estimation of flexible link and variance functions. Stata Journal 5: 501–516. Berndt, E. K., B. H. Hall, R. E. Hall, and J. A. Hausman. 1974. Estimation and inference in nonlinear structural models. Annals of Economic and Social Measurement 3/4: 653–665. Cummings, P. 2009. Methods for estimating adjusted risk ratios. Stata Journal 9: 175–196. Dobson, A. J., and A. G. Barnett. 2008. An Introduction to Generalized Linear Models. 3rd ed. Boca Raton, FL: Chapman & Hall/CRC. Hardin, J. W., and J. M. Hilbe. 2012. Generalized Linear Models and Extensions. 3rd ed. College Station, TX: Stata Press. Hilbe, J. M. 1993. sg16: Generalized linear models. Stata Technical Bulletin 11: 20–28. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 149–159. College Station, TX: Stata Press. . 2000. sg126: Two-parameter log-gamma and log-inverse Gaussian models. Stata Technical Bulletin 53: 31–32. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 273–275. College Station, TX: Stata Press. . 2009. Logistic Regression Models. Boca Raton, FL: Chapman & Hill/CRC. Hoffmann, J. P. 2004. Generalized Linear Models: An Applied Approach. Boston: Pearson. Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. 2nd ed. London: Chapman & Hall/CRC. Nelder, J. A. 1975. Robert William MacLagan Wedderburn, 1947–1975. Journal of the Royal Statistical Society, Series A 138: 587. Nelder, J. A., and R. W. M. Wedderburn. 1972. Generalized linear models. Journal of the Royal Statistical Society, Series A 135: 370–384. Newey, W. K., and K. D. West. 1987. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55: 703–708. Newson, R. B. 1999. sg114: rglm—Robust variance estimates for generalized linear models. Stata Technical Bulletin 50: 27–33. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 181–190. College Station, TX: Stata Press. 678 glm — Generalized linear models . 2004. Generalized power calculations for generalized linear models and more. Stata Journal 4: 379–401. Orsini, N., R. Bellocco, and P. C. Sjölander. 2013. Doubly robust estimation in generalized linear models. Stata Journal 13: 185–205. Parner, E. T., and P. K. Andersen. 2010. Regression analysis of censored data using pseudo-observations. Stata Journal 10: 408–422. Pregibon, D. 1980. Goodness of link tests for generalized linear models. Applied Statistics 29: 15–24. Rabe-Hesketh, S., and B. S. Everitt. 2007. A Handbook of Statistical Analyses Using Stata. 4th ed. Boca Raton, FL: Chapman & Hall/CRC. Rabe-Hesketh, S., A. Pickles, and C. Taylor. 2000. sg129: Generalized linear latent and mixed models. Stata Technical Bulletin 53: 47–57. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 293–307. College Station, TX: Stata Press. Rabe-Hesketh, S., A. Skrondal, and A. Pickles. 2002. Reliable estimation of generalized linear mixed models using adaptive quadrature. Stata Journal 2: 1–21. Raftery, A. E. 1995. Bayesian model selection in social research. In Vol. 25 of Sociological Methodology, ed. P. V. Marsden, 111–163. Oxford: Blackwell. Rogers, W. H. 1993. sg16.4: Comparison of nbreg and glm for negative binomial. Stata Technical Bulletin 16: 7. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 82–84. College Station, TX: Stata Press. Royston, P. 1994. sg22: Generalized linear models: Revision of glm. Stata Technical Bulletin 18: 6–11. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 112–121. College Station, TX: Stata Press. Sasieni, P. D. 2012. Age–period–cohort models in Stata. Stata Journal 12: 45–60. Schonlau, M. 2005. Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata Journal 5: 330–354. Senn, S. J. 2003. A conversation with John Nelder. Statistical Science 18: 118–131. Williams, R. 2010. Fitting heterogeneous choice models with oglm. Stata Journal 10: 540–567. Also see [R] glm postestimation — Postestimation tools for glm [R] cloglog — Complementary log-log regression [R] logistic — Logistic regression, reporting odds ratios [R] nbreg — Negative binomial regression [R] poisson — Poisson regression [R] regress — Linear regression [ME] meglm — Multilevel mixed-effects generalized linear model [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [XT] xtgee — Fit population-averaged panel-data models by using GEE [U] 20 Estimation and postestimation commands Title glm postestimation — Postestimation tools for glm Description Remarks and examples Syntax for predict Methods and formulas Menu for predict References Options for predict Also see Description The following postestimation commands are available after glm: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with mi or svy estimation results. lrtest is not appropriate with svy estimation results. 679 680 glm postestimation — Postestimation tools for glm Syntax for predict predict type newvar if in , statistic options Description statistic Main mu xb eta stdp anscombe cooksd deviance hat likelihood pearson response score working expected value of y ; the default b linear prediction η = xβ synonym of xb standard error of the linear prediction Anscombe (1953) residuals Cook’s distance deviance residuals diagonals of the “hat” matrix a weighted average of standardized deviance and standardized Pearson residuals Pearson residuals differences between the observed and fitted outcomes first derivative of the log likelihood with respect to xj β working residuals options Description Options nooffset adjusted standardized studentized modified modify calculations to ignore offset variable adjust deviance residual to speed up convergence multiply residual by the factor (1 − h)−1/2 multiply residual by one over the square root of the estimated scale parameter modify denominator of residual to be a reasonable estimate of the variance of depvar These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for mu, xb, stdp, and score are the only statistics allowed with svy estimation results. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main b ) [ng −1 (xβ b) mu, the default, specifies that predict calculate the expected value of y , equal to g −1 (xβ for the binomial family]. b xb calculates the linear prediction η = xβ. eta is a synonym for xb. stdp calculates the standard error of the linear prediction. anscombe calculates the Anscombe (1953) residuals to produce residuals that closely follow a normal distribution. glm postestimation — Postestimation tools for glm 681 cooksd calculates Cook’s distance, which measures the aggregate change in the estimated coefficients when each observation is left out of the estimation. deviance calculates the deviance residuals. Deviance residuals are recommended by McCullagh and Nelder (1989) and by others as having the best properties for examining the goodness of fit of a GLM. They are approximately normally distributed if the model is correct. They may be plotted against the fitted values or against a covariate to inspect the model’s fit. Also see the pearson option below. hat calculates the diagonals of the “hat” matrix, analogous to linear regression. likelihood calculates a weighted average of standardized deviance and standardized Pearson residuals. pearson calculates the Pearson residuals. Pearson residuals often have markedly skewed distributions for nonnormal family distributions. Also see the deviance option above. response calculates the differences between the observed and fitted outcomes. score calculates the equation-level score, ∂ ln L/∂(xj β). working calculates the working residuals, which are response residuals weighted according to the derivative of the link function. Options nooffset is relevant only if you specified offset(varname) for glm. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . adjusted adjusts the deviance residual to speed up the convergence to the limiting normal distribution. The adjustment deals with adding to the deviance residual a higher-order term that depends on the variance function family. This option is allowed only when deviance is specified. standardized requests that the residual be multiplied by the factor (1 − h)−1/2 , where h is the diagonal of the hat matrix. This operation is done to account for the correlation between depvar and its predicted value. studentized requests that the residual be multiplied by one over the square root of the estimated scale parameter. modified requests that the denominator of the residual be modified to be a reasonable estimate of the variance of depvar. The base residual is multiplied by the factor (k/w)−1/2 , where k is either one or the user-specified dispersion parameter and w is the specified weight (or one if left unspecified). Remarks and examples Remarks are presented under the following headings: Predictions Other postestimation commands 682 glm postestimation — Postestimation tools for glm Predictions Example 1 After glm estimation, predict may be used to obtain various predictions based on the model. In example 2 of [R] glm, we mentioned that the complementary log-log link seemed to fit the data better than the logit link. Now we go back and obtain the fitted values and deviance residuals: . use http://www.stata-press.com/data/r13/ldose . glm r ldose, family(binomial n) link(logit) (output omitted ) . predict mu_logit (option mu assumed; predicted mean r) . predict dr_logit, deviance . quietly glm r ldose, f(binomial n) l(cloglog) . predict mu_cl (option mu assumed; predicted mean r) . predict dr_cl, d . format mu_logit dr_logit mu_cl dr_cl %9.5f . list r mu_logit dr_logit mu_cl dr_cl, sep(4) r mu_logit dr_logit mu_cl dr_cl 1. 2. 3. 4. 6 13 18 28 3.45746 9.84167 22.45139 33.89761 1.28368 1.05969 -1.19611 -1.59412 5.58945 11.28067 20.95422 30.36942 0.18057 0.55773 -0.80330 -0.63439 5. 6. 7. 8. 52 53 61 60 50.09584 53.29092 59.22216 58.74297 0.60614 -0.12716 1.25107 1.59398 47.77644 54.14273 61.11331 59.94723 1.28883 -0.52366 -0.11878 0.32495 In six of the eight cases, |dr logit| > |dr cl|. The above represents only one of the many available options for predict. See Hardin and Hilbe (2012) for a more in-depth examination. Other postestimation commands Technical note After glm estimation, you may perform any of the postestimation commands that you would perform after any other kind of estimation in Stata; see [U] 20 Estimation and postestimation commands. Below we test the joint significance of all the interaction terms. glm postestimation — Postestimation tools for glm 683 . use http://www.stata-press.com/data/r13/beetle, clear . glm r beetle##c.ldose, family(binomial n) link(cloglog) (output omitted ) . testparm i.beetle beetle#c.ldose ( 1) [r]2.beetle = 0 ( 2) [r]3.beetle = 0 ( 3) [r]2.beetle#c.ldose = 0 ( 4) [r]3.beetle#c.ldose = 0 chi2( 4) = 249.69 Prob > chi2 = 0.0000 If you wanted to print the variance–covariance matrix of the estimators, you would type estat vce. If you use the linktest postestimation command, you must also specify the family() and link() options; see [R] linktest. Methods and formulas We follow the terminology used in Methods and formulas of [R] glm. q The deviance residual calculated by predict following glm is rjD = sign(yj − µ bj ) d2j . The Pearson residual calculated by predict following glm is yj − µ bj rjP = p V (b µj ) where V (b µj ) is the family-specific variance function. µ bj (1 − µ bj /mj ) 2 µ b 1j V (b µj ) = b3j µ b + kb µ2j µ j µ bj if if if if if if binomial or Bernoulli (mj = 1) gamma Gaussian inverse Gaussian negative binomial Poisson The response residuals are given by riR = yi − µi . The working residuals are riW = (yi − µ bi ) ∂η ∂µ and the score residuals are riS = yi − µ bi V (b µi ) ∂η ∂µ i −1 i c = V (b Define W µ) and X to be the covariate matrix. hi , then, is the ith diagonal of the hat matrix given by b =W c 1/2 X(X T W c X)−1 X T W c 1/2 H 684 glm postestimation — Postestimation tools for glm As a result, the likelihood residuals are given by riL = sign(yi − µ bi ) hi (riP 0 )2 + (1 − hi )(riD 0 )2 1/2 where riP 0 and riD 0 are the standardized Pearson and standardized deviance residuals, respectively. By standardized, we mean that the residual is divided by {1 − hi }1/2 . Cook’s distance is an overall measure of the change in the regression coefficients caused by omitting the ith observation from the analysis. Computationally, Cook’s distance is obtained as Ci = (riP 0 )2 hi k(1 − hi ) where k is the number of regressors, including the constant. Anscombe residuals are given by riA = A(yi ) − A(b µi ) A0 (b µi ){V (b µi )}1/2 where Z A(·) = dµ V 1/3 (µ) Deviance residuals may be adjusted (predict, adjusted) to make the following correction: 1 riD a = riD + ρ3 (θ) 6 where ρ3 (θ) is a family-specific correction. See Hardin and Hilbe (2012) for the exact forms of ρ3 (θ) for each family. References Anscombe, F. J. 1953. Contribution of discussion paper by H. Hotelling “New light on the correlation coefficient and its transforms”. Journal of the Royal Statistical Society, Series B 15: 229–230. Hardin, J. W., and J. M. Hilbe. 2012. Generalized Linear Models and Extensions. 3rd ed. College Station, TX: Stata Press. McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. 2nd ed. London: Chapman & Hall/CRC. Newson, R. B. 2013. Attributable and unattributable risks and fractions and other scenario comparisons. Stata Journal 13: 672–698. Also see [R] glm — Generalized linear models [R] regress postestimation — Postestimation tools for regress [U] 20 Estimation and postestimation commands Title glogit — Logit and probit regression for grouped data Syntax Options for blogit and bprobit Stored results Also see Menu Options for glogit and gprobit Methods and formulas Description Remarks and examples References Syntax Logistic regression for grouped data blogit pos var pop var indepvars if in , blogit options Probit regression for grouped data bprobit pos var pop var indepvars if in , bprobit options Weighted least-squares logistic regression for grouped data if in , glogit options glogit pos var pop var indepvars Weighted least-squares probit regression for grouped data gprobit pos var pop var indepvars if in , gprobit options blogit options Description Model noconstant asis offset(varname) constraints(constraints) collinear suppress constant term retain perfect predictor variables include varname in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) or nocnsreport display options set confidence level; default is level(95) report odds ratios do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used nocoef coeflegend do not display coefficient table; seldom used display legend instead of statistics 685 686 glogit — Logit and probit regression for grouped data bprobit options Description Model noconstant asis offset(varname) constraints(constraints) collinear suppress constant term retain perfect predictor variables include varname in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) nocnsreport display options set confidence level; default is level(95) do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used nocoef coeflegend do not display coefficient table; seldom used display legend instead of statistics glogit options Description SE vce(vcetype) vcetype may be ols, bootstrap, or jackknife Reporting level(#) or display options set confidence level; default is level(95) report odds ratios control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling coeflegend display legend instead of statistics gprobit options Description SE vce(vcetype) vcetype may be ols, bootstrap, or jackknife Reporting level(#) display options set confidence level; default is level(95) control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling coeflegend display legend instead of statistics glogit — Logit and probit regression for grouped data 687 indepvars may contain factor variables; see [U] 11.4.3 Factor variables. bootstrap, by, jackknife, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. fp is allowed with blogit and bprobit. nocoef and coeflegend do not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu blogit Statistics > Binary outcomes > Grouped data > Logit regression for grouped data > Binary outcomes > Grouped data > Probit regression for grouped data > Binary outcomes > Grouped data > Weighted least-squares logit regression > Binary outcomes > Grouped data > Weighted least-squares probit regression bprobit Statistics glogit Statistics gprobit Statistics Description blogit and bprobit produce maximum-likelihood logit and probit estimates on grouped (“blocked”) data; glogit and gprobit produce weighted least-squares estimates. In the syntax diagrams above, pos var and pop var refer to variables containing the total number of positive responses and the total population. See [R] logistic for a list of related estimation commands. Options for blogit and bprobit Model noconstant; see [R] estimation options. asis forces retention of perfect predictor variables and their associated perfectly predicted observations and may produce instabilities in maximization; see [R] probit. offset(varname), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. 688 glogit — Logit and probit regression for grouped data or (blogit only) reports the estimated coefficients transformed to odds ratios, that is, eb rather than b. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated. or may be specified at estimation or when replaying previously estimated results. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. The following options are available with blogit and bprobit but are not shown in the dialog box: nocoef specifies that the coefficient table not be displayed. This option is sometimes used by program writers but is useless interactively. coeflegend; see [R] estimation options. Options for glogit and gprobit SE vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (ols) and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(ols), the default, uses the standard variance estimator for ordinary least-squares regression. Reporting level(#); see [R] estimation options. or (glogit only) reports the estimated coefficients transformed to odds ratios, that is, eb rather than b. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated. or may be specified at estimation or when replaying previously estimated results. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. The following option is available with glogit and gprobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Maximum likelihood estimates Weighted least-squares estimates glogit — Logit and probit regression for grouped data 689 Maximum likelihood estimates blogit produces the same results as logit and logistic, and bprobit produces the same results as probit, but the “blocked” commands accept data in a slightly different “shape”. Consider the following two datasets: . use http://www.stata-press.com/data/r13/xmpl1 . list, sepby(agecat) agecat exposed died pop 1. 2. 3. 4. 0 0 0 0 0 0 1 1 0 1 0 1 115 5 98 8 5. 6. 7. 8. 1 1 1 1 0 0 1 1 0 1 0 1 69 16 76 22 . use http://www.stata-press.com/data/r13/xmpl2 . list 1. 2. 3. 4. agecat exposed deaths pop 0 0 1 1 0 1 0 1 5 8 16 22 120 106 85 98 These two datasets contain the same information; observations 1 and 2 of xmpl1 correspond to observation 1 of xmpl2, observations 3 and 4 of xmpl1 correspond to observation 2 of xmpl2, and so on. The first observation of xmpl1 says that for agecat==0 and exposed==0, 115 subjects did not die (died==0). The second observation says that for the same agecat and exposed groups, five subjects did die (died==1). In xmpl2, the first observation says that there were five deaths of a population of 120 in agecat==0 and exposed==0. These are two different ways of saying the same thing. Both datasets are transcriptions from the following table, reprinted in Rothman, Greenland, and Lash (2008, 260), for age-specific deaths from all causes for tolbutamide and placebo treatment groups (University Group Diabetes Program 1970): Dead Surviving Age through 54 Tolbutamide Placebo 8 5 98 115 Age 55 and above Tolbutamide Placebo 22 16 76 79 The data in xmpl1 are said to be “fully relational”, which is computer jargon meaning that each observation corresponds to one cell of the table. Stata typically prefers data in this format. The second form of storing these data in xmpl2 is said to be “folded”, which is computer jargon for something less than fully relational. blogit and bprobit deal with “folded” data and produce the same results that logit and probit would have if the data had been stored in the “fully relational” representation. 690 glogit — Logit and probit regression for grouped data Example 1 For the tolbutamide data, the fully relational representation is preferred. We could then use logistic, logit, and any of the epidemiological table commands; see [R] logistic, [R] logit, and [ST] epitab. Nevertheless, there are occasions when the folded representation seems more natural. With blogit and bprobit, we avoid the tedium of having to unfold the data: . use http://www.stata-press.com/data/r13/xmpl2 . blogit deaths pop agecat exposed, or Logistic regression for grouped data Log likelihood = -142.6212 _outcome Odds Ratio agecat exposed _cons 4.216299 1.404674 .0513818 Std. Err. 1.431519 .4374454 .0170762 z 4.24 1.09 -8.93 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 = = = = 409 22.47 0.0000 0.0730 P>|z| [95% Conf. Interval] 0.000 0.275 0.000 2.167361 .7629451 .0267868 8.202223 2.586175 .0985593 If we had not specified the or option, results would have been presented as coefficients instead of as odds ratios. The estimated odds ratio of death for tolbutamide exposure is 1.40, although the 95% confidence interval includes 1. (By comparison, these data, in fully relational form and analyzed using the cs command [see [ST] epitab], produce a Mantel – Haenszel weighted odds ratio of 1.40 with a 95% confidence interval of 0.76 to 2.59.) We can see the underlying coefficients by replaying the estimation results and not specifying the or option: . blogit Logistic regression for grouped data Log likelihood = -142.6212 _outcome Coef. agecat exposed _cons 1.438958 .3398053 -2.968471 Std. Err. .3395203 .3114213 .33234 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 z 4.24 1.09 -8.93 P>|z| 0.000 0.275 0.000 = = = = 409 22.47 0.0000 0.0730 [95% Conf. Interval] .7735101 -.2705692 -3.619846 2.104405 .9501798 -2.317097 glogit — Logit and probit regression for grouped data 691 Example 2 bprobit works like blogit, substituting the probit for the logit-likelihood function. . bprobit deaths pop agecat exposed Probit regression for grouped data Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -142.56478 _outcome Coef. agecat exposed _cons .7542049 .1906236 -1.673973 Std. Err. .1709692 .1666059 .1619594 z 4.41 1.14 -10.34 P>|z| 0.000 0.253 0.000 = = = = 409 22.58 0.0000 0.0734 [95% Conf. Interval] .4191114 -.1359179 -1.991408 1.089298 .5171651 -1.356539 Weighted least-squares estimates Example 3 We have state data for the United States on the number of marriages (marriage), the total population aged 18 years or more (pop18p), and the median age (medage). The dataset excludes Nevada, so it has 49 observations. We now wish to estimate a logit equation for the marriage rate. We will include age squared by specifying the term c.medage#c.medage: . use http://www.stata-press.com/data/r13/census7 (1980 Census data by state) . glogit marriage pop18p medage c.medage#c.medage Weighted LS logistic regression for grouped data Source SS df MS Model Residual .71598314 1.27772858 2 46 .35799157 .027776708 Total 1.99371172 48 .041535661 Coef. Std. Err. t P>|t| Number of obs F( 2, 46) Prob > F R-squared Adj R-squared Root MSE = = = = = = 49 12.89 0.0000 0.3591 0.3313 .16666 [95% Conf. Interval] medage -.6459349 .2828381 -2.28 0.027 -1.215258 -.0766114 c.medage# c.medage .0095414 .0046608 2.05 0.046 .0001598 .0189231 6.503833 4.288977 1.52 0.136 -2.129431 15.1371 _cons 692 glogit — Logit and probit regression for grouped data Example 4 We could just as easily have fit a grouped-probit model by typing gprobit rather than glogit: . gprobit marriage pop18p medage c.medage#c.medage Weighted LS probit regression for grouped data SS df MS Source Model Residual .108222962 .192322476 2 46 .054111481 .004180923 Total .300545438 48 .006261363 Coef. Std. Err. t P>|t| Number of obs F( 2, 46) Prob > F R-squared Adj R-squared Root MSE = = = = = = 49 12.94 0.0000 0.3601 0.3323 .06466 [95% Conf. Interval] medage -.2755007 .1121042 -2.46 0.018 -.5011548 -.0498466 c.medage# c.medage .0041082 .0018422 2.23 0.031 .0004001 .0078163 2.357708 1.704446 1.38 0.173 -1.073164 5.788579 _cons Stored results blogit and bprobit store the following in e(): Scalars e(N) e(N cds) e(N cdf) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(r2 p) e(ll) e(ll 0) e(N clust) e(chi2) e(p) e(rank) e(ic) e(rc) e(converged) number of observations number of completely determined successes number of completely determined failures number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom pseudo-R-squared log likelihood log likelihood, constant-only model number of clusters χ2 significance of model test rank of e(V) number of iterations return code 1 if converged, 0 otherwise glogit — Logit and probit regression for grouped data 693 Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(marginsok) e(asbalanced) e(asobserved) blogit or bprobit command as typed variable containing number of positive responses and variable containing population size weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict predictions allowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved Matrices e(b) e(Cns) e(ilog) e(gradient) e(mns) e(rules) e(V) e(V modelbased) coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector vector of means of the independent variables information about perfect predictors variance–covariance matrix of the estimators model-based variance Functions e(sample) marks estimation sample glogit and gprobit store the following in e(): Scalars e(N) e(mss) e(df m) e(rss) e(df r) e(r2) e(r2 a) e(F) e(rmse) e(rank) number of observations model sum of squares model degrees of freedom residual sum of squares residual degrees of freedom R-squared adjusted R-squared F statistic root mean squared error rank of e(V) Macros e(cmd) e(cmdline) e(depvar) e(model) e(title) e(vce) e(vcetype) e(properties) e(predict) e(marginsok) e(asbalanced) e(asobserved) glogit or gprobit command as typed variable containing number of positive responses and variable containing population size ols title in estimation output vcetype specified in vce() title used to label Std. Err. b V program used to implement predict predictions allowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved 694 glogit — Logit and probit regression for grouped data Matrices e(b) e(V) Functions e(sample) coefficient vector variance–covariance matrix of the estimators marks estimation sample Methods and formulas Methods and formulas are presented under the following headings: Maximum likelihood estimates Weighted least-squares estimates Maximum likelihood estimates The results reported by blogit and bprobit are obtained by maximizing a weighted logit- or probit-likelihood function. Let F ( ) denote the normal- or logistic-likelihood function. The likelihood of observing each observation in the data is then t−s F (βx)s 1 − F (βx) where s is the number of successes and t is the population. The term above is counted as contributing s + (t − s) = t degrees of freedom. All of this follows directly from the definitions of logit and probit. blogit and bprobit support the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. Weighted least-squares estimates The logit function is defined as the log of the odds ratio. If there is one explanatory variable, the model can be written as pj log = β0 + β1 xj + j (1) 1 − pj where pj represents successes divided by population for the j th observation. (If there is more than one explanatory variable, we simply interpret β1 as a row vector and xj as a column vector.) The large-sample expectation of j is zero, and its variance is σj2 = 1 nj pj (1 − pj ) where nj represents the population for observation j . We can thus apply weighted least squares to the observations, with weights proportional to nj pj (1 − pj ). As in any feasible generalized least-squares problem, estimation proceeds in two steps. First, we fit (1) by OLS and compute the predicted probabilities as pbj = c0 + β c1 xj ) exp(β c0 + β c1 xj ) 1 + exp(β In the second step, we fit (1) by using analytic weights equal to nj pbj (1 − pbj ). glogit — Logit and probit regression for grouped data 695 For gprobit, write Φ(·) for the cumulative normal distribution, and define zj implicitly by Φ(zj ) = pj , where pj is the fraction of successes for observation j . The probit model for one explanatory variable can be written as Φ−1 (pj ) = β0 + β1 xj + j (If there is more than one explanatory variable, we simply interpret β1 as a row vector and xj as a column vector.) The expectation of j is zero, and its variance is given by σj2 = pj (1 − pj ) nj φ2 Φ−1 (pj ) where φ(·) represents the normal density (Amemiya 1981, 1498). We can thus apply weighted least squares to the observations with weights proportional to 1/σj2 . As for grouped logit, we use a two-step estimator to obtain the weighted least-squares estimates. References Amemiya, T. 1981. Qualitative response models: A survey. Journal of Economic Literature 19: 1483–1536. Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Judge, G. G., W. E. Griffiths, R. C. Hill, H. Lütkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2nd ed. New York: Wiley. Rothman, K. J., S. Greenland, and T. L. Lash. 2008. Modern Epidemiology. 3rd ed. Philadelphia: Lippincott Williams & Wilkins. University Group Diabetes Program. 1970. A study of the effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes, II: Mortality results. Diabetes 19, supplement 2: 789–830. Also see [R] glogit postestimation — Postestimation tools for glogit, gprobit, blogit, and bprobit [R] logistic — Logistic regression, reporting odds ratios [R] logit — Logistic regression, reporting coefficients [R] probit — Probit regression [R] scobit — Skewed logistic regression [U] 20 Estimation and postestimation commands Title glogit postestimation — Postestimation tools for glogit, gprobit, blogit, and bprobit Description Also see Syntax for predict Menu for predict Options for predict Description The following postestimation commands are available after glogit, gprobit, blogit, and bprobit: ∗ ∗ Command Description contrast estat ic estat summarize estat vce estimates lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest margins marginsplot nlcom predict predictnl pwcompare test testnl ∗ estat ic and lrtest are not appropriate after glogit and gprobit. Syntax for predict predict statistic type newvar if in , statistic Description Main n pr xb stdp predicted count; the default probability of a positive outcome linear prediction standard error of the linear prediction These statistics are available both in and out of sample; type predict the estimation sample. 696 . . . if e(sample) . . . if wanted only for glogit postestimation — Postestimation tools for glogit, gprobit, blogit, and bprobit 697 Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main n, the default, calculates the expected count, that is, the estimated probability times pop var, which is the total population. pr calculates the predicted probability of a positive outcome. xb calculates the linear prediction. stdp calculates the standard error of the linear prediction. Also see [R] glogit — Logit and probit regression for grouped data [U] 20 Estimation and postestimation commands Title gmm — Generalized method of moments estimation Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Interactive version gmm ( eqname1 : ) ( eqname2 : ). . . if in weight , options Moment-evaluator program version gmm moment prog if in weight , equations(namelist) | nequations(#) parameters(namelist) | nparameters(#) options program options where mexpj is the substitutable expression for the j th moment equation and moment prog is a moment-evaluator program. options Description Model derivative( ) specify derivative of mexpm with respect to parameter n; can be specified more than once (interactive version only) ∗ twostep use two-step GMM estimator; the default ∗ onestep use one-step GMM estimator ∗ igmm use iterative GMM estimator Instruments instruments( : varlist , noconstant ) specify instruments; can be specified more than once xtinstruments( : varlist, lags(#1 /#2 )) specify panel-style instruments; can be specified more than once Weight matrix wmatrix(wmtype , independent ) specify weight matrix; wmtype may be robust, cluster clustvar, hac kernel lags , or unadjusted center center moments in weight-matrix computation winitial(iwtype , independent ) specify initial weight matrix; iwtype may be identity, unadjusted, xt xtspec, or the name of a Stata matrix 698 gmm — Generalized method of moments estimation Options variables(varlist) nocommonesample specify variables in model do not restrict estimation sample to be the same for all equations SE/Robust vce(vcetype , independent ) vcetype may be robust, cluster clustvar, bootstrap, jackknife, hac kernel lags, or unadjusted use alternative method of computing numerical derivatives quickderivatives for VCE Reporting level(#) title(string) title2(string) display options set confidence level; default is level(95) display string as title above the table of parameter estimates display string as subtitle control column formats and line width Optimization from(initial values) ‡ igmmiterate(#) ‡ igmmeps(#) specify initial values for parameters specify maximum number of iterations for iterated GMM estimator optimization options specify # for iterated GMM parameter convergence criterion; default is igmmeps(1e-6) specify # for iterated GMM weight-matrix convergence criterion; default is igmmweps(1e-6) control the optimization process; seldom used coeflegend display legend instead of statistics ‡ igmmweps(#) ∗ You can specify at most one of these options. ‡ These options may be specified only when igmm is specified. program options Description Model evaluator options hasderivatives ∗ haslfderivatives † equations(namelist) † nequations(#) additional options to be passed to the moment-evaluator program moment-evaluator program can calculate parameter-level derivatives moment-evaluator program can calculate linear-form derivatives specify moment-equation names ‡ parameters(namelist) ‡ nparameters(#) specify parameter names ∗ ∗ specify number of moment equations specify number of parameters You may not specify both hasderivatives and haslfderivatives. † You must specify equations(namelist) or nequations(#); you may specify both. ‡ You must specify parameters(namelist) or nparameters(#); you may specify both. 699 700 gmm — Generalized method of moments estimation bootstrap, by, jackknife, rolling, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. aweights, fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. and are extensions of valid Stata expressions that also contain parameters to be estimated. The parameters are enclosed in curly braces and must otherwise satisfy the naming requirements for variables; {beta} is an example of a parameter. Also allowed is a notation of the form { :varlist} for linear combinations of multiple covariates and their parameters. For example, {xb: mpg price turn} defines a linear combination of the variables mpg, price, and turn. See Substitutable expressions under Remarks and examples below. Menu Statistics > Endogenous covariates > Generalized method of moments estimation Description gmm performs generalized method of moments (GMM) estimation. With the interactive version of the command, you enter the moment equations directly into the dialog box or on the command line using substitutable expressions. The moment-evaluator program version gives you greater flexibility in exchange for increased complexity; with this version, you write a program in an ado-file that calculates the moments based on a vector of parameters passed to it. gmm can fit both single- and multiple-equation models, and it allows moment conditions of the form E{zi ui (β)} = 0, where zi is a vector of instruments and ui (β) is often an additive regression error term, as well as more general moment conditions of the form E{hi (zi ; β)} = 0. gmm works with cross-sectional, time-series, and longitudinal (panel) data. Options Model derivative( eqname | # /name = ) specifies the derivative of moment equation eqname or # with respect to parameter name. If eqname or # is not specified, gmm assumes that the derivative applies to the first moment equation. For a moment equation of the form E{zmi umi (β)} = 0, derivative(m/βj = ) is to contain a substitutable expression for ∂umi /∂βj . For a moment equation of the form E{hmi (zi ; β)} = 0, derivative(m/βj = ) is to contain a substitutable expression for ∂hmi /∂βj . uses the same substitutable expression syntax as is used to specify moment equations. If you declare a linear combination in a moment equation, you provide the derivative for the linear combination; gmm then applies the chain rule for you. See Specifying derivatives under Remarks and examples below for examples. If you do not specify the derivative() option, gmm calculates derivatives numerically. You must either specify no derivatives or specify all the derivatives that are not identically zero; you cannot specify some analytic derivatives and have gmm compute the rest numerically. gmm — Generalized method of moments estimation 701 twostep, onestep, and igmm specify which estimator is to be used. You can specify at most one of these options. twostep is the default. twostep requests the two-step GMM estimator. gmm obtains parameter estimates based on the initial weight matrix, computes a new weight matrix based on those estimates, and then reestimates the parameters based on that weight matrix. onestep requests the one-step GMM estimator. The parameters are estimated based on an initial weight matrix, and no updating of the weight matrix is performed except when calculating the appropriate variance–covariance (VCE) matrix. igmm requests the iterative GMM estimator. gmm obtains parameter estimates based on the initial weight matrix, computes a new weight matrix based on those estimates, reestimates the parameters based on that weight matrix, computes a new weight matrix, and so on, to convergence. Convergence is declared when the relative change in the parameter vector is less than igmmeps(), the relative change in the weight matrix is less than igmmweps(), or igmmiterate() iterations have been completed. Hall (2005, sec. 2.4 and 3.6) mentions that there may be gains to finite-sample efficiency from using the iterative estimator. Instruments instruments( : varlist , noconstant ) specifies a list of instrumental variables to be used. If you specify a single moment equation, then you do not need to specify the equations to which the instruments apply; you can omit the eqlist and simply specify instruments(varlist). By default, a constant term is included in varlist; to omit the constant term, use the noconstant suboption: instruments(varlist, noconstant). If you specify a model with multiple moment conditions of the form z1i u1i (β) E ··· =0 zqi uqi (β) then you can specify the equations to indicate the moment equations for which the list of variables is to be used as instruments if you do not want that list applied to all the moment equations. For example, you might type gmm (main: ) ( ) ( ), instruments(z1 z2) /// instruments(2: z3) instruments(main 3: z4) Variables z1 and z2 will be used as instruments for all three equations, z3 will be used as an instrument for the second equation, and z4 will be used as an instrument for the first and third equations. Notice that we chose to supply a name for the first moment equation but not the second two. varlist may contain factor variables and time-series operators; see [U] 11.4.3 Factor variables and [U] 11.4.4 Time-series varlists, respectively. xtinstruments( : varlist, lags(#1 /#2 )) is for use with panel-data models in which the set of available instruments depends on the time period. As with instruments(), you can prefix the list of variables with equation names or numbers to target instruments to specific equations. Unlike with instruments(), a constant term is not included in varlist. You must xtset your data before using this option; see [XT] xtset. If you specify gmm . . ., xtinstruments(x, lags(1/.)) . . . 702 gmm — Generalized method of moments estimation then for panel i and period t, gmm uses as instruments xi,t−1 , xi,t−2 , . . . , xi1 . More generally, specifying xtinstruments(x, lags(#1 , #2 )) uses as instruments xi,t−#1 , . . . , xi,t−#2 ; setting #2 = . requests all available lags. #1 and #2 must be zero or positive integers. gmm automatically excludes observations for which no valid instruments are available. It does, however, include observations for which only a subset of the lags is available. For example, if you request that lags one through three be used, then gmm will include the observations for the second and third time periods even though fewer than three lags are available as instruments. Weight matrix wmatrix(wmtype , independent ) specifies the type of weight matrix to be used in conjunction with the two-step and iterated GMM estimators. Specifying wmatrix(robust) requests a weight matrix that is appropriate when the errors are independent but not necessarily identically distributed. wmatrix(robust) is the default. Specifying wmatrix(cluster clustvar) requests a weight matrix that accounts for arbitrary correlation among observations within clusters identified by clustvar. Specifying wmatrix(hac kernel #) requests a heteroskedasticity- and autocorrelation-consistent (HAC) weight matrix using the specified kernel (see below) with # lags. The bandwidth of a kernel is equal to the number of lags plus one. Specifying wmatrix(hac kernel opt) requests an HAC weight matrix using the specified kernel, and the lag order is selected using Newey and West’s (1994) optimal lag-selection algorithm. Specifying wmatrix(hac kernel) requests an HAC weight matrix using the specified kernel and N − 2 lags, where N is the sample size. There are three kernels available for HAC weight matrices, and you may request each one by using the name used by statisticians or the name perhaps more familiar to economists: bartlett or nwest requests the Bartlett (Newey–West) kernel; parzen or gallant requests the Parzen (Gallant) kernel; and quadraticspectral or andrews requests the quadratic spectral (Andrews) kernel. Specifying wmatrix(unadjusted) requests a weight matrix that is suitable when the errors are homoskedastic. In some applications, the GMM estimator so constructed is known as the (nonlinear) two-stage least-squares (2SLS) estimator. Including the independent suboption creates a weight matrix that assumes moment equations are independent. This suboption is often used to replicate other models that can be motivated outside the GMM framework, such as the estimation of a system of equations by system-wide 2SLS. This suboption has no effect if only one moment equation is specified. wmatrix() has no effect if onestep is also specified. center requests that the sample moments be centered (demeaned) when computing GMM weight matrices. By default, centering is not done. winitial(wmtype , independent ) specifies the weight matrix to use to obtain the first-step parameter estimates. Specifying winitial(unadjusted) requests a weight matrix that assumes the moment equations are independent and identically distributed. This matrix is of the form (Z0 Z)−1 , where Z represents all the instruments specified in the instruments() option. To avoid a singular weight matrix, you should specify at least q − 1 moment equations of the form E{zhi uhi (β)} = 0, where q is the number of moment equations, or you should specify the independent suboption. gmm — Generalized method of moments estimation 703 Including the independent suboption creates a weight matrix that assumes moment equations are independent. Elements of the weight matrix corresponding to covariances between two moment equations are set equal to zero. This suboption has no effect if only one moment equation is specified. winitial(unadjusted) is the default. winitial(xt xtspec) is for use with dynamic panel-data models in which one of the moment equations is specified in first-differences form. xtspec is a string consisting of the letters “L” and “D”, the length of which is equal to the number of moment equations in the model. You specify “L” for a moment equation if that moment equation is written in levels, and you specify “D” for a moment equation if it is written in first-differences; xtspec is not case sensitive. When you specify this option, you can specify at most one moment equation in levels and one moment equation in first-differences. See the examples listed in Dynamic panel-data models under Remarks and examples below. winitial(identity) requests that the identity matrix be used. winitial(matname) requests that Stata matrix matname be used. You cannot specify the independent suboption if you specify winitial(matname). Options variables(varlist) specifies the variables in the model. gmm ignores observations for which any of these variables has a missing value. If you do not specify variables(), then gmm assumes all the observations are valid and issues an error message with return code 480 if any moment equations evaluate to missing for any observations at the initial value of the parameter vector. nocommonesample requests that gmm not restrict the estimation sample to be the same for all equations. By default, gmm will restrict the estimation sample to observations that are available for all equations in the model, mirroring the behavior of other multiple-equation estimators such as nlsur, sureg, or reg3. For certain models, however, different equations can have different numbers of observations. For these models, you should specify nocommonesample. See Dynamic panel-data models below for one application of this option. You cannot specify weights if you specify nocommonesample. SE/Robust vce(vcetype , independent ) specifies the type of standard error reported, which includes types that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(unadjusted) specifies that an unadjusted (nonrobust) VCE matrix be used; this, along with the twostep option, results in the “optimal two-step GMM” estimates often discussed in textbooks. The default vcetype is based on the wmtype specified in the wmatrix() option. If wmatrix() is specified but vce() is not, then vcetype is set equal to wmtype. To override this behavior and obtain an unadjusted (nonrobust) VCE matrix, specify vce(unadjusted). Specifying vce(bootstrap) or vce(jackknife) results in standard errors based on the bootstrap or jackknife, respectively. See [R] vce option, [R] bootstrap, and [R] jackknife for more information on these VCEs. The syntax for vcetypes other than bootstrap and jackknife is identical to those for wmatrix(). 704 gmm — Generalized method of moments estimation quickderivatives requests that an alternative method be used to compute the numerical derivatives for the VCE. This option has no effect if you specify the derivatives(), hasderivatives, or haslfderivatives option. The VCE depends on a matrix of partial derivatives that gmm must compute numerically unless you supply analytic derivatives. This Jacobian matrix will be especially large if your model has many instruments, moment equations, or parameters. By default, gmm computes each element of the Jacobian matrix individually, searching for an optimal step size each time. Although this procedure results in accurate derivatives, it is computationally taxing: gmm may have to evaluate the moments of your model five or more times for each element of the Jacobian matrix. When you specify the quickderivatives option, gmm computes all derivatives corresponding to a parameter at once, using a fixed step size proportional to the parameter’s value. This method requires just two evaluations of the model’s moments to compute an entire column of the Jacobian matrix and therefore has the most impact when you specify many instruments or moment equations. Most of the time, the two methods produce virtually identical results, but the quickderivatives method may fail if a moment equation is highly nonlinear or if instruments differ by orders of magnitude. In the rare case where you specify quickderivatives and obtain suspiciously large or small standard errors, try refitting your model without this option. Reporting level(#); see [R] estimation options. title(string) specifies an optional title that will be displayed just above the table of parameter estimates. title2(string) specifies an optional subtitle that will be displayed between the title specified in title() and the table of parameter estimates. If title2() is specified but title() is not, title2() has the same effect as title(). display options: cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Optimization from(initial values) specifies the initial values to begin the estimation. You can specify a 1 × k matrix, where k is the number of parameters in the model, or you can specify a parameter name, its initial value, another parameter name, its initial value, and so on. For example, to initialize alpha to 1.23 and delta to 4.57, you would type gmm ..., from(alpha 1.23 delta 4.57) ... Initial values declared using this option override any that are declared within substitutable expressions. If you specify a parameter that does not appear in your model, gmm exits with error code 480. If you specify a matrix, the values must be in the same order in which the parameters are declared in your model. gmm ignores the row and column names of the matrix. igmmiterate(#), igmmeps(#), and igmmweps(#) control the iterative process for the iterative GMM estimator. These options can be specified only if you also specify igmm. igmmiterate(#) specifies the maximum number of iterations to perform with the iterative GMM estimator. The default is the number set using set maxiter (set [R] maximize), which is 16,000 by default. gmm — Generalized method of moments estimation 705 igmmeps(#) specifies the convergence criterion used for successive parameter estimates when the iterative GMM estimator is used. The default is igmmeps(1e-6). Convergence is declared when the relative difference between successive parameter estimates is less than igmmeps() and the relative difference between successive estimates of the weight matrix is less than igmmweps(). igmmweps(#) specifies the convergence criterion used for successive estimates of the weight matrix when the iterative GMM estimator is used. The default is igmmweps(1e-6). Convergence is declared when the relative difference between successive parameter estimates is less than igmmeps() and the relative difference between successive estimates of the weight matrix is less than igmmweps(). optimization options: technique(), conv maxiter(), conv ptol(), conv vtol(), conv nrtol(), tracelevel(). technique() specifies the optimization technique to use; gn (the default), nr, dfp, and bfgs are allowed. conv maxiter() specifies the maximum number of iterations; conv ptol(), conv vtol(), and conv nrtol() specify the convergence criteria for the parameters, gradient, and scaled Hessian, respectively. tracelevel() allows you to obtain additional details during the iterative process. See [M-5] optimize( ). The following options pertain only to the moment-evaluator program version of gmm. Model evaluator options refer to any options allowed by your moment prog. hasderivatives and haslfderivatives indicate that you have written your moment-evaluator program to compute derivatives. You may specify one or the other but not both. If you do not specify either of these options, gmm computes the derivatives numerically. hasderivatives indicates that your moment-evaluator program computes parameter-level derivatives. haslfderivatives indicates that your moment-evaluator program computes equation-level derivatives and is useful only when you specify the parameters of your model using the : syntax of the parameters() option. See Details of moment-evaluator programs below for more information. equations(namelist) specifies the names of the moment equations in the model. If you specify both equations() and nequations(), the number of names in the former must match the number specified in the latter. nequations(#) specifies the number of moment equations in the model. If you do not specify names with the equations() option, gmm numbers the moment equations 1, 2, 3, . . . . If you specify both equations() and nequations(), the number of names in the former must match the number specified in the latter. parameters(namelist) specifies the names of the parameters in the model. The names of the parameters must adhere to the naming conventions of Stata’s variables; see [U] 11.3 Naming conventions. Alternatively, you may specify a list of names in which each item in the list is of the form : , where eqname is an equation name used to group parameters, and varname is the name of an existing variable or cons to indicate a constant term. When you use this syntax, gmm adorns the parameter vector passed to your evaluator program with these names so that you can use matrix score (see [P] matrix score) to compute linear combinations of parameters. These equation names are not related to the names you may give to the moment equations. 706 gmm — Generalized method of moments estimation If you specify both parameters() and nparameters(), the number of names in the former must match the number specified in the latter. nparameters(#) specifies the number of parameters in the model. If you do not specify names with the parameters() option, gmm names them b1, b2, . . . , b#. If you specify both parameters() and nparameters(), the number of names in the former must match the number specified in the latter. The following option is available with gmm but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Introduction Substitutable expressions The weight matrix and two-step estimation Obtaining standard errors Exponential (Poisson) regression models Specifying derivatives Exponential regression models with panel data Rational-expectations models System estimators Dynamic panel-data models Details of moment-evaluator programs Introduction The generalized method of moments (GMM) estimator is a workhorse of modern econometrics and is discussed in all the leading textbooks, including Cameron and Trivedi (2005, 2010), Davidson and MacKinnon (1993, 2004), Greene (2012, 468–506), Ruud (2000), Hayashi (2000), Wooldridge (2010), Hamilton (1994), and Baum (2006). An excellent treatise on GMM with a focus on time-series applications is Hall (2005). The collection of papers by Mátyás (1999) provides both theoretical and applied aspects of GMM. Here we give a brief introduction to the methodology and emphasize how the various options of gmm are used. The starting point for the generalized method of moments (GMM) estimator is the analogy principle, which says we can estimate a parameter by replacing a population moment condition with its sample analogue. For example, the mean of an independent and identically distributed (i.i.d.) population is defined as the value µ such that the first (central) population moment is zero; that is, µ solves E(y − µ) = 0 where y is a random draw from the population. The analogy principle tells us that to obtain an estimate, µ b, of µ, we replace the population-expectations operator with its sample analogue (Manski 1988; Wooldridge 2010): E(y − µ) = 0 −→ N N 1 X 1 X (yi − µ b) = 0 −→ µ b= yi N i=1 N i=1 where N denotes sample size and yi represents the ith observation of y in our dataset. The estimator µ b is known as the method of moments (MM) estimator, because we started with a population moment condition and then applied the analogy principle to obtain an estimator that depends on the observed data. gmm — Generalized method of moments estimation 707 Ordinary least-squares (OLS) regression can also be viewed as an MM estimator. In the model y = x0 β + u we assume that u has mean zero conditional on x: E(u|x) = 0. This conditional expectation implies the unconditional expectation E(xu) = 0 because, using the law of iterated expectations, E(xu) = Ex {E(xu|x)} = Ex {x E(u|x)} = 0 (Using the law of iterated expectations to derive unconditional expectations based on conditional expectations, perhaps motivated by subject theory, is extremely common in GMM estimation.) Continuing, E(xu) = E {x(y − x0 β)} = 0 Applying the analogy principle, N 1 X E {x(y − x β)} −→ xi (yi − x0i β) = 0 N i=1 0 so that b= β X i xi x0i −1 X i xi yi b = (X0 X)−1 X0 y written using summation notation. which is just the more familiar formula β In both the previous examples, the number of parameters we were estimating equaled the number of moment conditions. In the first example, we estimated one parameter, µ, and had one moment condition E(y − µ) = 0. In the second example, the parameter vector β had k elements, as did the vector of regressors x, yielding k moment conditions. Ignoring peculiar cases, a model of m equations in m unknowns has a unique solution, and because the moment equations in these examples were linear, we were able to solve for the parameters analytically. Had the moment conditions been nonlinear, we would have had to use numerical techniques to solve for the parameters, but that is not a significant limitation with modern computers. What if we have more moment conditions than parameters? Say we have l moment conditions and k parameters. A model of l > k equations in k unknowns does not have a unique solution. Any size-k subset of the moment conditions would yield a consistent parameter estimate, though the parameter estimate so obtained would in general be different based on which k moment conditions we used. For concreteness, let’s return to our regression model, y = x0 β + u but we no longer wish to assume that E(xu) = 0; we suspect that the error term u affects one or more elements of x. As a result, we can no longer use the OLS estimator. Suppose we have a vector z with the properties that E(zu) = 0, that the rank of E(z0 z) equals l, and that the rank of E(z0 x) = k . The first assumption simply states that z is not correlated with the error term. The second assumption rules out perfect collinearity among the elements of z. The third assumption, known as the rank condition in econometrics, ensures that z is sufficiently correlated with x and that the estimator is feasible. If some elements of x are not correlated with u, then they should also appear in z. If l < k , then the rank of E(z0 x) < k , violating the rank condition. 708 gmm — Generalized method of moments estimation If l = k , then we can use the simpler MM estimator we already discussed; we would obtain what P P 0 −1 b=( is sometimes called theP simple instrumental-variables estimator β i zi yi . The rank i zi xi ) 0 condition ensures that i zi xi is invertible, at least in the population. b that minimizes a quadratic function of the If l > k , the GMM estimator chooses the value, β, moment conditions. We could define b ≡ arg min β β 0 X 1 X 1 zi ui (β) zi ui (β) i i N N (1) where for our linear regression example ui (β) = yi − x0i β. This estimator tries to make the moment conditions as close to zero as possible. This simple estimator, however, applies equal weight to each of the moment conditions; and as we shall see later, we can obtain more efficient estimators by choosing to weight some moment conditions more highly than others. Consider the quadratic function Q(β) = 0 X 1 X 1 zi ui (β) W zi ui (β) i i N N where W is a symmetric positive-definite matrix known as a weight matrix. Then we define the GMM estimator as b ≡ arg min Q(β) β (2) β Continuing with our regression model example, if we choose W= −1 1 X zi z0i i N (3) then we obtain X −1 X )−1 1 1 1 X xi z0i zi z0i zi x0i × i i i N N N X X −1 X 1 1 1 0 0 x i zi zi zi zi yi i i i N N N ( b= β which is the well-known two-stage least-squares (2SLS) estimator. Our choice of weight matrix here was based on the assumption that u was homoskedastic. A feature of GMM estimation is that by selecting different weight matrices, we can obtain estimators that can tolerate heteroskedasticity, clustering, autocorrelation, and other features of u. See [R] ivregress for more information about the 2SLS and linear GMM estimators. Returning to the case where the model is “just identified”, meaning that l = k , if we apply the W. Because l = k , if a unique solution exists, it will set all the sample moment conditions to zero jointly, so W has no impact on the value of β that minimizes the objective function. b regardless of our choice of GMM estimator, we will obtain the same estimate, β, We will highlight other features of the GMM estimator and the gmm command as we proceed through examples. First, though, we discuss how to specify moment equations by using substitutable expressions. gmm — Generalized method of moments estimation 709 Substitutable expressions To use the interactive version of gmm, you define the moment equations by using substitutable expressions. In most applications, your moment conditions are of the form E {zi ui (β)}, where ui (β) is a residual term that depends on the parameter vector β as well as variables in your dataset, though we suppress expressing the variables for notational simplicity; we refer to ui (β) as the moment equation to differentiate it from the moment conditions E{zi0 ui (β)} = 0. Substitutable expressions in gmm work much like those used in nl and nlsur, though with one important difference. For the latter two commands, you type the name of the dependent variable, an equal sign, and then the regression function. For example, in nl, if you want to fit the function y = f (x; β) + u, you would type nl (y = ), ... On the other hand, gmm requires you to write a substitutable expression for u; in this example, u = y − f (x; β), so you would type gmm (y - ), ... The advantage of writing the substitutable expression directly in terms of u is that you are not restricted to fitting models with additive error terms as you are with nl and nlsur. You specify substitutable expressions just like any other mathematical expression involving scalars and variables, such as those you would use with Stata’s generate command, except that the parameters to be estimated are bound in braces. See [U] 13.2 Operators and [U] 13.3 Functions for more information on expressions. Parameter names must follow the same conventions as variable names. See [U] 11.3 Naming conventions. For example, say that the tth observation on a sample moment is ut = 1 − β (1 + rt+1 )(ct+1 /ct )−γ where t denotes time period, β and γ are the parameters to be estimated, and r and c are variables in your dataset. Then you would type gmm (1 - {beta}*((1 + F.r)*(F.c/c)^(-1*{gamma}))), ... Because β and γ are parameters, we enclose them in braces. Also notice our use of the forward operator to refer to the values of r and c one period ahead; time-series operators are allowed in substitutable expressions as long as you have previously tsset (see [TS] tsset) your data. See [U] 13.9 Time-series operators for more information on time-series operators. To specify initial values for some parameters, you can include an equal sign and the initial value after a parameter: gmm (1 - {beta}*((1 + F.r)*(F.c/c)^(-1*{gamma=1}))), ... would initialize γ to be one. If you do not specify an initial value for a parameter, it is initialized to zero. Frequently, even nonlinear functions contain linear combinations of variables. As an example, suppose you have this moment equation: u = {y − exp(β1 x1 + β2 x2 + β3 x3 )} /exp(β1 x1 + β2 x2 + β3 x3 ) Instead of typing gmm ((y - exp({beta1}*x1 + {beta2}*x2 + {beta3}*x3)) / exp({beta1}*x1 + {beta2}*x2 + {beta3}*x3)) ... /// 710 gmm — Generalized method of moments estimation you can type gmm ((y - exp({xb:x1 x2 x3})) / exp({xb:})) ..... The notation {xb:x1 x2 x3} tells gmm that you want a linear combination of the variables x1, x2, and x3. We named this linear combination xb, so gmm will name the three parameters corresponding to the three variables xb x1, xb x2, and xb x3. You can name the linear combination anything you wish (subject to Stata’s naming conventions for variable names); gmm then names the parameter corresponding to variable x lc x, where lc is the name of your linear combination. You cannot use the same name for both an individual parameter and a linear combination. You can, however, refer to one parameter in a linear combination after it has been declared as you would any other parameter by using the notation {lc x}. Linear combinations do not include a constant term. Once we have declared the variables in the linear combination xb, we can subsequently refer to the linear combination in our substitutable expression by using the notation xb:. The colon is not optional; it tells gmm that you are referring to a previously declared linear combination, not an individual parameter. This shorthand notation is also handy when specifying derivatives, as we will show later. In general, there are three rules to follow when defining substitutable expressions: 1. Parameters of the model are bound in braces: {b0}, {param}, etc. 2. Initial values for parameters are given by including an equal sign and the initial value inside the braces: {b0=1}, {param=3.571}, etc. 3. Linear combinations of variables can be included using the notation {eqname:varlist}: {xb: mpg price weight}, {score: w x z}, etc. Parameters of linear combinations are initialized to zero. If you specify initial values by using the from() option, they override whatever initial values are given within the substitutable expression. Substitutable expressions are so named because, once values are assigned to the parameters, the resulting expressions can be handled by generate and replace. Example 1: OLS regression In Introduction, we stated that OLS is an MM estimator. Say that we want to fit the model mpg = β0 + β1 weight + β2 length + u where u is an i.i.d. error term. We type gmm — Generalized method of moments estimation 711 . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . gmm (mpg - {b1}*weight - {b2}*length - {b0}), instruments(weight length) Step 1 Iteration 0: Iteration 1: Iteration 2: GMM criterion Q(b) = GMM criterion Q(b) = GMM criterion Q(b) = 475.4138 3.305e-20 3.795e-27 Step 2 Iteration 0: Iteration 1: GMM criterion Q(b) = GMM criterion Q(b) = 7.401e-28 3.771e-31 GMM estimation Number of parameters = 3 Number of moments = 3 Initial weight matrix: Unadjusted GMM weight matrix: Robust Coef. /b1 /b2 /b0 -.0038515 -.0795935 47.88487 Number of obs Robust Std. Err. .0019472 .0677528 7.505985 z -1.98 -1.17 6.38 P>|z| 0.048 0.240 0.000 = 74 [95% Conf. Interval] -.0076678 -.2123864 33.17341 -.0000351 .0531995 62.59633 Instruments for equation 1: weight length _cons Recall that the moment condition for OLS regression is E(xu) = 0, where x, the list of instruments, is the same as the list of regressors in the model. In our command, we defined the residual term, u, inside parentheses by using a substitutable expression; because linear combinations declared in substitutable expressions do not include a constant term, we included our own (b0). Inside the instruments() option, we listed our instruments; by default, gmm includes a constant term among the instrument list. Because the number of moments equals the number of parameters we are estimating, the model is said to be “just identified” or “exactly identified.” Therefore, the choice of weight matrix has no impact on the solution to (2), and the criterion function Q(β) achieves its minimum value at zero. The OLS estimator is a one-step GMM estimator, but we did not bother to specify the onestep option because the model is just identified. Doing a second step of GMM estimation affects neither the point estimates nor the standard errors, so to keep the syntax as simple as possible, we did not include the onestep option. The first step of estimation resulted in Q(β) = 0 as expected, and the second step of estimation did not change the minimized value of Q(β). (4 × 10−27 and 3 × 10−31 are both zero for all practical purposes.) When you do not specify either the wmatrix() or the vce() option, gmm reports heteroskedasticityrobust standard errors. The parameter estimates reported here match those that we would obtain from the command . regress mpg weight length, vce(robust) The standard errors reported by that regress command would be larger than those reported by gmm by a factor of sqrt(74/71) because regress makes a small-sample adjustment to the estimated variance matrix while gmm does not. Likewise, had we specified the vce(unadjusted) option with our gmm command, then our standard errors would differ by a factor of sqrt(74/71) from those reported by regress without the vce(robust) option. Using the notation for linear combinations of parameters, we could have typed . gmm (mpg - {xb: weight length} - {b0}), instruments(weight length) 712 gmm — Generalized method of moments estimation and obtained identical results. Instead of having parameters b1 and b2, with this syntax we would have parameters xb weight and xb length. Example 2: Instrumental-variables regression In Introduction, we mentioned that 2SLS can be viewed as a GMM estimator. In example 1 of [R] ivregress, we fit by 2SLS a model of rental rates (rent) as a function of the value of owner-occupied housing (hsngval) and the percentage of the population living in urban areas (pcturban): rent = β0 + β1 hsngval + β2 pcturban + u by 2SLS. We argued that random shocks that affect rental rates likely also affect housing values, so we treated hsngval as an endogenous variable. As additional instruments, we used family income, faminc, and three regional dummies (reg2–reg4). To replicate the results of ivregress 2sls by using gmm, we type . use http://www.stata-press.com/data/r13/hsng2 (1980 Census housing data) . gmm (rent - {xb:hsngval pcturban} - {b0}), > instruments(pcturban faminc reg2-reg4) vce(unadjusted) onestep Step 1 Iteration 0: GMM criterion Q(b) = 56115.03 Iteration 1: GMM criterion Q(b) = 110.91583 Iteration 2: GMM criterion Q(b) = 110.91583 GMM estimation Number of parameters = 3 Number of moments = 6 Initial weight matrix: Unadjusted Number of obs Coef. /xb_hsngval /xb_pcturban /b0 .0022398 .081516 120.7065 = 50 Std. Err. z P>|z| [95% Conf. Interval] .0003284 .2987652 15.22839 6.82 0.27 7.93 0.000 0.785 0.000 .0015961 -.504053 90.85942 .0028836 .667085 150.5536 Instruments for equation 1: pcturban faminc reg2 reg3 reg4 _cons We specified vce(unadjusted) so that we would obtain an unadjusted VCE matrix and our standard errors would match those reported in [R] ivregress. Pay attention to how we specified the instruments() option. In Introduction, we mentioned that the moment conditions for the 2SLS estimator are E(zu) = 0, and we mentioned that if some elements of x (the regressors) are not endogenous, then they should also appear in z. In this model, we assume the regressor pcturban is exogenous, so we included it in the list of instrumental variables. Commands like ivregress, ivprobit, and ivtobit accept standard varlists, so they can deduce the exogenous regressors in the model. Because gmm accepts arbitrary functions in the form of substitutable expressions, it has no way of discerning the exogenous variables of the model on its own. Also notice that we specified the onestep option. The 2SLS estimator is a one-step GMM estimator that is based on a weight matrix that assumes the error terms are i.i.d. Unlike the previous example, here we had more instruments than parameters, so the minimized value of Q(β) is nonzero. We discuss the weight matrix and its relationship to two-step estimation next. gmm — Generalized method of moments estimation 713 The weight matrix and two-step estimation b depends on the choice Recall our definition of the GMM estimator given in (2). The estimator, β, b of the weight matrix, W. Under relatively mild assumptions, our estimator, β, is consistent regardless of the choice of W, so how are we to decide what W to use? The most common solution is to use the two-step estimator, which we now describe. A key result in Hansen’s (1982) seminal paper is that if we denote by S the covariance matrix of the moment conditions, then the optimal (in a way we make precise later) GMM estimator is the one that uses a weight matrix equal to the inverse of the moment covariance matrix. That is, if we let S = Cov(zu), then we want to use W = S−1 . But how do we obtain S in the first place? If we assume that the errors are i.i.d., then Cov(zu) = E(u2 zz0 ) = σ 2 E(zz0 ) where σ 2 is the variance of u. Because σ 2 is a positive scalar, we can ignore it when solving (2). Thus we compute X −1 c1 = 1 W zi z0i (4) i N c 1 is the same weight which does not depend on any unknown model parameters. (Notice that W c b1. matrix used in 2SLS.) Given W1 , we can solve (2) to obtain an initial estimate, say, β b 1 , is consistent, so by Slutsky’s theorem, the sample residuals u Our estimate, β b computed at this value of β will also be consistent. Using virtually the same arguments used to justify the Huber/Eicker/White heteroskedasticity-robust VCE, if we assume that the residuals are independent though not identically distributed, we can estimate S as X b= 1 S u b2 zi z0i N i i c2 = S b −1 , yielding the two-step GMM estimate Then, in the second step, we re-solve (2), using W βb2 . If the residuals exhibit clustering, you can specify wmatrix(cluster varname) so that gmm computes a weight matrix that does not assume the ui ’s are independent within clusters identified by varname. You can specify wmatrix(hac . . .) to obtain weight matrices that are suitable for when the ui ’s exhibit autocorrelation as well as heteroskedasticity. We could take the point estimates from the second round of estimation and use them to compute c 3 , say, re-solve (2) yet again, and so on, stopping when the parameters yet another weight matrix, W or weight matrix do not change much from one iteration to the next. This procedure is known as the iterative GMM estimator and is obtained with the igmm option. Asymptotically, the two-step and iterative GMM estimators have the same distribution. However, Hall (2005, 90) suggests that the iterative estimator may have better finite-sample properties. c 1 as in (4), we could simply choose W c 1 = I, the identity matrix. Instead of computing W b 1 , would still be consistent. You can request this behavior by specifying The initial estimate, β the winitial(identity) option. However, if you specify all your moment equations of the form E(zu) = 0, we recommend using the default winitial(unadjusted) instead; the rescaling of the moment conditions implied by using a homoskedastic initial weight matrix makes the numerical routines used to solve (2) more stable. 714 gmm — Generalized method of moments estimation If you fit a model with more than one of the moment equations of the form E {h(z; β)} = 0, then you must use winitial(identity) or winitial(unadjusted, independent). With moment equations of that form, you do not specify a list of instruments, and gmm cannot evaluate (4)—the matrix expression in parentheses would necessarily be singular, so it cannot be inverted. Example 3: Two-step linear GMM estimator From the previous discussion and the comments in Introduction, we see that the linear 2SLS estimator is a one-step GMM estimator where we use the weight matrix defined in (4) that assumes the errors are i.i.d. If we use the 2SLS estimate of β to obtain the sample residuals, compute a new weight matrix based on those residuals, and then do a second step of GMM estimation, we obtain the linear two-step GMM estimator as implemented by ivregress gmm. In example 3 of [R] ivregress, we fit the model of rental rates as discussed in example 2 above. We now allow the residuals to be heteroskedastic, though we will maintain our assumption that they are independent. We type . gmm (rent - {xb:hsngval pcturban} - {b0}), inst(pcturban faminc reg2-reg4) Step 1 Iteration 0: GMM criterion Q(b) = 56115.03 Iteration 1: GMM criterion Q(b) = 110.91583 Iteration 2: GMM criterion Q(b) = 110.91583 Step 2 Iteration 0: GMM criterion Q(b) = .2406087 Iteration 1: GMM criterion Q(b) = .13672801 Iteration 2: GMM criterion Q(b) = .13672801 (backed up) GMM estimation Number of parameters = 3 Number of moments = 6 Initial weight matrix: Unadjusted GMM weight matrix: Robust Coef. /xb_hsngval /xb_pcturban /b0 .0014643 .7615482 112.1227 Robust Std. Err. .0004473 .2895105 10.80234 Number of obs z 3.27 2.63 10.38 = 50 P>|z| [95% Conf. Interval] 0.001 0.009 0.000 .0005877 .1941181 90.95052 .002341 1.328978 133.2949 Instruments for equation 1: pcturban faminc reg2 reg3 reg4 _cons By default, gmm computes a heteroskedasticity-robust weight matrix before the second step of estimation, though we could have specified wmatrix(robust) if we wanted to be explicit. Because we did not specify the vce() option, gmm used a heteroskedasticity-robust one. Our results match those in example 3 of [R] ivregress. Moreover, the only substantive difference between this example and example 2 is that here we did not specify the onestep option, so we obtain the two-step estimates. gmm — Generalized method of moments estimation 715 Obtaining standard errors This section is a bit more theoretical and can be skipped on first reading. However, the information is sufficiently important that you should return to this section at some point. So far in our discussion, we have focused on point estimation without much mention of how we obtain the standard errors of the estimates. We also mentioned that if we choose W to be the inverse of the covariance matrix of the moment conditions, then we obtain the “optimal” GMM estimator. We elaborate those points now. Using mostly standard statistical arguments, we can show that for the GMM estimator defined in b is given by (2), the variance of β b) = Var(β o−1 n o−1 1 n b 0 b) b )0 WSWG(β b ) G(β b )0 WG(β b) G(β) WG(β G(β N where b) = G(β 1 X ∂ui zi i N ∂ β β=β b or b) = G(β (5) 1 X ∂hi i ∂β N b β=β as the case may be and S = E(zuu0 z0 ). Assuming the vce(unadjusted) option is not specified, gmm reports standard errors based on the robust variance matrix defined in (5). For the two-step estimator, W is the weight matrix requested using the wmatrix() option, and it is calculated based on the residuals obtained after the first estimation step. The second-step point estimates and residuals are obtained, and S is calculated based on the specification of the vce() option. For the iterated estimator, W is calculated based on the second-to-last round of estimation, while S is based on the residuals obtained after the last round of estimation. Computation of the covariance matrix for the one-step estimator is, perhaps surprisingly, more involved; we discuss the covariance matrix with the one-step estimator in the technical note at the end of this section. If we choose the weight matrix to be the inverse of the covariance matrix of the moment conditions so that W = S−1 , then (5) simplifies substantially: n o−1 b )0 WG(β b) b ) = 1 G(β Var(β N (6) The GMM estimator constructed using this choice of weight matrix along with the covariance matrix in (6) is known as the “optimal” GMM estimator. One can show that if in fact W = S−1 , then the variance in (6) is smaller than the variance in (5) of any other GMM estimator based on the same moment conditions but with a different choice of weight matrix. Thus the optimal GMM estimator is also known as the efficient GMM estimator, because it has the smallest variance of any estimator based on the given moment conditions. To obtain standard errors from gmm based on the optimal GMM estimator, you specify the vce(unadjusted) option. We call that VCE unadjusted because we do not recompute the residuals after estimation to obtain the matrix S required in (5) or allow for the fact that those residuals may not be i.i.d. Some statistical packages by default report standard errors based on (6) and offer standard errors based on (5) only as an option or not at all. While the optimal GMM estimator is theoretically appealing, Cameron and Trivedi (2005, 177) suggest that in finite samples it need not perform better than the GMM estimator that uses (5) to obtain standard errors. 716 gmm — Generalized method of moments estimation Technical note Computing the covariance matrix of the parameters after using the one-step estimator is actually a bit more complex than after using the two-step or iterative estimators. We can illustrate most of the intricacies by using linear regression with moment conditions of the form E{x(y − x0 β)} = 0. If you specify winitial(unadjusted) and vce(unadjusted), then the initial weight matrix will be computed as !−1 1 X 0 c W1 = xi xi (7) N i Moreover, for linear regression, we can show that b) = G(β 1 X xi x0i N i so that (6) becomes ! 1 X 1 0 b) = Var(β xi xi N N i 1 X xi x0i N i !−1 !−1 1 X xi x0i N i !−1 = X xi x0i i = (X0 X)−1 (8) However, we know that the nonrobust covariance matrix for the OLS estimator is actually σ b2 (X0 X)−1 . 2 What is missing from (8) is the scalar σ b , the estimated variance of the residuals. When you use the one-step estimator and specify winitial(unadjusted), the weight matrix (7) does not include the σ b2 term because gmm does not have a consistent estimate of β from which it can then estimate σ 2 . The point estimates are still correct, because multiplying the weight matrix by a scalar factor does not affect the solution to the minimization problem. To circumvent this issue, if you specify winitial(unadjusted) and vce(unadjusted), gmm b (which is consistent) to obtain a new unadjusted weight matrix that does include uses the estimated β the term σ b2 so that evaluating (6) will yield correct standard errors. If you use the two-step or iterated GMM estimators, this extra effort is not needed to obtain standard errors because the first-step (and subsequent steps’) estimate of β is consistent and can be used to estimate σ 2 or some other weight matrix based on the wmatrix() option. Straightforward algebra shows that this extra effort is also not needed if you request any type of adjusted (robust) covariance matrix with the one-step estimator. A similar issue arises when you specify winitial(identity) and vce(unadjusted) with the one-step estimator. Again the solution is to compute an unadjusted weight matrix after obtaining βb so that (6) provides the correct standard errors. We have illustrated the problem and solution using a single-equation linear model. However, the problem arises whenever you use the one-step estimator with an unadjusted VCE, regardless of the number of equations; and gmm handles all the details automatically. Computation of Hansen’s J statistic presents an identical issue, and gmm takes care of that as well. gmm — Generalized method of moments estimation 717 If you supply your own initial weight matrix by using winitial(matname), then the standard errors (as well as the J statistic reported by estat overid) are based on that weight matrix. You should verify that the weight matrix you provide will yield appropriate statistics. Exponential (Poisson) regression models Exponential regression models are frequently encountered in applied work. For example, they can be used as alternatives to linear regression models on log-transformed dependent variables, obviating the need for post-hoc transformations to obtain predicted values in the original metric of the dependent variable. When the dependent variable represents a discrete count variable, they are also known as Poisson regression models; see Cameron and Trivedi (2013). For now, we consider models of the form y = exp(x0 β) + u (9) where u is a zero-mean additive error term so that E(y) = exp(x0 β). Because the error term is additive, if x represents strictly exogenous regressors, then we have the population moment condition E[x{y − exp(x0 β)}] = 0 (10) Moreover, because the number of parameters in the model is equal to the number of instruments, there is no point to using the two-step GMM estimator. Example 4: Exponential regression Cameron and Trivedi (2010, 323) fit a model of the number of doctor visits based on whether the patient has private insurance, whether the patient has a chronic disease, gender, and income. Here we fit that model by using gmm. To allow for potential excess dispersion, we will obtain a robust VCE matrix, which is the default for gmm anyway. We type . use http://www.stata-press.com/data/r13/docvisits . gmm (docvis - exp({xb:private chronic female income}+{b0})), > instruments(private chronic female income) onestep Step 1 Iteration 0: GMM criterion Q(b) = 16.853973 Iteration 1: GMM criterion Q(b) = 2.2706472 Iteration 2: GMM criterion Q(b) = .19088097 Iteration 3: GMM criterion Q(b) = .00041101 Iteration 4: GMM criterion Q(b) = 3.939e-09 Iteration 5: GMM criterion Q(b) = 6.572e-19 GMM estimation Number of parameters = 5 Number of moments = 5 Initial weight matrix: Unadjusted Number of obs Coef. /xb_private /xb_chronic /xb_female /xb_income /b0 .7986654 1.091865 .4925481 .003557 -.2297263 Robust Std. Err. .1089891 .0559888 .0585298 .0010824 .1108607 z 7.33 19.50 8.42 3.29 -2.07 P>|z| 0.000 0.000 0.000 0.001 0.038 = 4412 [95% Conf. Interval] .5850507 .9821291 .3778317 .0014356 -.4470093 Instruments for equation 1: private chronic female income _cons 1.01228 1.201601 .6072644 .0056784 -.0124434 718 gmm — Generalized method of moments estimation Our point estimates agree with those reported by Cameron and Trivedi to at least six significant digits; the small discrepancies are attributable to different optimization techniques and convergence criteria being used by gmm and poisson. The standard errors differ by a factor of sqrt(4412/4411) because gmm uses N in the denominator of the formula for the robust covariance matrix, while the robust covariance matrix estimator used by poisson uses N − 1. Technical note That the GMM and maximum likelihood estimators of the exponential regression model coincide is not a general property of these two classes of estimators. The maximum likelihood estimator solves the score equations N 1 X ∂ ln `i =0 N ∂β i=1 where li is the likelihood for the ith observation. These score equations can be viewed as the sample analogues of the population moment conditions E ∂ ln `i ∂β =0 establishing that maximum likelihood estimators represent a subset of the class of GMM estimators. For the Poisson model, ln `i = −exp(x0i β) + yi x0i β − ln yi ! so the score equations are N 1 X xi {yi − exp(x0i β)} = 0 N i=1 which are just the sample moment conditions implied by (10) that we used in the previous example. That is why our results using gmm match Cameron and Trivedi’s results using poisson. On the other hand, an intuitive set of moment conditions to consider for GMM estimation of a probit model is E[x{y − Φ(x0 β)}] = 0 where Φ() is the standard normal cumulative distribution function. Differentiating the likelihood function for the maximum likelihood probit estimator, we can show that the corresponding score equations are N 1 X φ(x0i β) φ(x0i β) x i yi − (1 − y ) =0 i N Φ(x0i β) 1 − Φ(x0i β) i=1 where φ() is the standard normal density function. These two moment conditions are not equivalent, so the maximum likelihood and GMM probit estimators are distinct. gmm — Generalized method of moments estimation 719 Example 5: Comparison of GMM and maximum likelihood Using the automobile dataset, here we fit a probit model of foreign on gear ratio, length, and headroom using first the score equations and then the intuitive set of GMM equations. We type . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . global xb "{b1}*gear_ratio + {b2}*length + {b3}*headroom + {b0}" . global phi "normalden($xb)" . global Phi "normal($xb)" . gmm (foreign*$phi/$Phi - (1-foreign)*$phi/(1-$Phi)), > instruments(gear_ratio length headroom) onestep (output omitted ) . estimates store ml . gmm (foreign - $Phi), instruments(gear_ratio length headroom) onestep (output omitted ) . estimates store gmm . estimates table ml gmm, b se Variable ml gmm b1 _cons 2.9586277 .64042341 2.8489213 .63570247 _cons -.02148933 .01382043 -.02056033 .01396954 _cons .01136927 .27278528 .02240761 .2849891 _cons -6.0222289 3.5594588 -5.8595615 3.5188028 b2 b3 b0 legend: b/se The coefficients on gear ratio and length are close for the two estimators. The GMM estimate of the coefficient on headroom is twice that of the maximum likelihood estimate, though the relatively large standard errors imply that this difference is not significant. You can verify that the coefficients in the column marked “ml” match those you would obtain using probit. We have not discussed the differences among standard errors based on the various GMM and maximum-likelihood covariance matrix estimators to avoid tedious algebra, though you can verify that the robust covariance matrix after one-step GMM estimation differs by only a finite-sample adjustment factor of (N/N − 1) from the robust covariance matrix reported by probit. Both the maximum likelihood and GMM probit estimators require the normality assumption, and the maximum likelihood estimator is efficient if that normality assumption is correct; therefore, in this particular example, there is no reason to prefer the GMM estimator. We can modify (10) easily to allow for endogenous regressors. Suppose that xj is endogenous in the sense that E(u|xj ) 6= 0. Then (10) is no longer a valid moment condition. However, suppose we have some variables other than x such that E(u|z) = 0. We can instead use the moment conditions 720 gmm — Generalized method of moments estimation E(zu) = E[z{y − exp(x0 β)}] = 0 (11) As usual, if some elements of x are exogenous, then they should appear in z as well. Example 6: Exponential regression with endogenous regressors Returning to the model discussed in example 4, here we treat income as endogenous; unobservable factors that determine a person’s income may also affect the number of times a person visits a doctor. We use a person’s age and race as instruments. These are valid instruments if we believe that age and race influence a person’s income but do not have a direct impact on the number of doctor visits. (Whether this belief is justified is another matter; we test that belief in [R] gmm postestimation.) Because we have more instruments (seven) than parameters (five), we have an overidentified model. Therefore, the choice of weight matrix does matter. We will utilize the default two-step GMM estimator. In the first step, we will use a weight matrix that assumes the errors are i.i.d. In the second step, we will use a weight matrix that assumes heteroskedasticity. When you specify twostep, these are the defaults for the first- and second-step weight matrices, so we do not have to use the winitial() or wmatrix() options. We will again obtain a robust VCE, which is also the default. We type . use http://www.stata-press.com/data/r13/docvisits . gmm (docvis - exp({xb:private chronic female income}+{b0})), > instruments(private chronic female age black hispanic) twostep Step 1 Iteration 0: GMM criterion Q(b) = 16.910173 Iteration 1: GMM criterion Q(b) = .82276104 Iteration 2: GMM criterion Q(b) = .21832032 Iteration 3: GMM criterion Q(b) = .12685935 Iteration 4: GMM criterion Q(b) = .12672369 Iteration 5: GMM criterion Q(b) = .12672365 Step 2 Iteration 0: GMM criterion Q(b) = .00234641 Iteration 1: GMM criterion Q(b) = .00215957 Iteration 2: GMM criterion Q(b) = .00215911 Iteration 3: GMM criterion Q(b) = .00215911 GMM estimation Number of parameters = 5 Number of moments = 7 Initial weight matrix: Unadjusted GMM weight matrix: Robust Coef. /xb_private /xb_chronic /xb_female /xb_income /b0 .535335 1.090126 .6636579 .0142855 -.5983477 Robust Std. Err. .1599039 .0617659 .0959884 .0027162 .138433 Number of obs z 3.35 17.65 6.91 5.26 -4.32 P>|z| 0.001 0.000 0.000 0.000 0.000 = 4412 [95% Conf. Interval] .2219291 .9690668 .4755241 .0089618 -.8696713 .8487409 1.211185 .8517918 .0196092 -.327024 Instruments for equation 1: private chronic female age black hispanic _cons Once we control for the endogeneity of income, we find that its coefficient has quadrupled in size. Additionally, access to private insurance has less of an impact on the number of doctor visits and gender has more of an impact. gmm — Generalized method of moments estimation 721 Technical note Although perhaps at first tempting, unlike the Poisson model, you cannot simply replace x in the moment conditions for the probit (or logit) model with a vector of instruments, z, if you have endogenous regressors. See Wilde (2008). Mullahy (1997) considers a slightly more complicated version of the exponential regression model that incorporates nonadditive unobserved heterogeneity. His model can be written as yi = exp(x0i β)ηi + i where ηi > 0 is an unobserved heterogeneity term that may be correlated with xi . One result from his paper is that instead of using the additive moment condition (10), we can use the multiplicative moment condition y − exp(x0 β) = E[z{y exp(−x0 β) − 1}] = 0 (12) E z exp(x0 β) Windmeijer and Santos Silva (1997) discuss the use of additive versus multiplicative moment conditions with endogenous regressors and note that a set of instruments that satisfies the additive moment conditions will not also satisfy the multiplicative moment conditions. They remark that which to use is an empirical issue that can at least partially be settled by using the test of overidentifying restrictions that is implemented by estat overid after gmm to ascertain whether the instruments for a given model are valid. See [R] gmm postestimation for information on the test of overidentifying restrictions. Specifying derivatives By default, gmm calculates derivatives numerically, and the method used produces accurate results for the vast majority of applications. However, if you refit the same model repeatedly or else have the derivatives available, then gmm will run more quickly if you supply it with analytic derivatives. When you use the interactive version of gmm, you specify derivatives using substitutable expressions in much the same way you specify the moment equations. There are three rules you must follow: 1. As with the substitutable expressions that define residual equations, you bind parameters of the model in braces: {b0}, {param}, etc. 2. You must specify a derivative for each parameter that appears in each moment equation. If a parameter does not appear in a moment equation, then you do not specify a derivative for that parameter in that moment equation. 3. If you declare a linear combination in an equation, then you specify a derivative with respect to that linear combination. gmm applies the chain rule to obtain the derivatives with respect to the individual parameters encompassed by that linear combination. We illustrate with several examples. Example 7: Derivatives for a single-equation model Consider a simple exponential regression model with one exogenous regressor and a constant term. We have ui = yi − exp(β0 + β1 xi ) Now ∂ui = −exp(β0 + β1 xi ) ∂β0 and ∂ui = −xi exp(β0 + β1 xi ) ∂β1 722 gmm — Generalized method of moments estimation In Stata, we type . gmm (docvis - exp({b0} + {b1}*income)), instruments(income) > deriv(/b0 = -1*exp({b0} + {b1}*income)) > deriv(/b1 = -1*income*exp({b0}+{b1}*income)) onestep Step 1 Iteration 0: GMM criterion Q(b) Iteration 1: GMM criterion Q(b) Iteration 2: GMM criterion Q(b) Iteration 3: GMM criterion Q(b) Iteration 4: GMM criterion Q(b) Iteration 5: GMM criterion Q(b) GMM estimation Number of parameters = 2 Number of moments = 2 Initial weight matrix: Unadjusted Coef. /b0 /b1 1.204888 .0046702 = = = = = = 9.1548611 3.5146131 .01344695 3.690e-06 4.606e-13 1.501e-26 Number of obs Robust Std. Err. .0462355 .0009715 z 26.06 4.81 = 4412 P>|z| [95% Conf. Interval] 0.000 0.000 1.114268 .0027662 1.295507 .0065743 Instruments for equation 1: income _cons Notice how we specified the derivative() option for each parameter. We simply specified a slash, the name of the parameter, an equal sign, then a substitutable expression that represents the derivative. Because our model has only one residual equation, we do not need to specify equation numbers in the derivative() options. When you specify a linear combination of variables, your derivative should be with respect to the entire linear combination. For example, say we have the residual equation ui = y − exp(x0i β + β0 ) for which we would type . gmm (y - exp({xb: x1 x2 x3} + {b0}) ... Then in addition to the derivative ∂ui /∂β0 , we are to compute and specify ∂ui = −exp(x0i β + β0 ) ∂x0i β Using the chain rule, ∂ui /∂βj = ∂ui /∂(x0i β) × ∂(x0i β)/∂βj = −xij exp(x0i β + β0 ). Stata does this last calculation automatically. It knows the variables in the linear combination, so all it needs is the derivative of the residual function with respect to the linear combination. This allows you to change the variables in your linear combination without having to change the derivatives. Example 8: Derivatives with a linear combination We refit the model described in the example illustrating exponential regression with endogenous regressors, now providing analytic derivatives. We type gmm — Generalized method of moments estimation . gmm (docvis - exp({xb:private chronic female income}+{b0})), > instruments(private chronic female age black hispanic) > derivative(/xb = -1*exp({xb:} + {b0})) > derivative(/b0 = -1*exp({xb:} + {b0})) Step 1 Iteration 0: GMM criterion Q(b) = 16.910173 Iteration 1: GMM criterion Q(b) = .82270871 Iteration 2: GMM criterion Q(b) = .21831995 Iteration 3: GMM criterion Q(b) = .12685934 Iteration 4: GMM criterion Q(b) = .12672369 Iteration 5: GMM criterion Q(b) = .12672365 Step 2 Iteration 0: GMM criterion Q(b) = .00234641 Iteration 1: GMM criterion Q(b) = .00215957 Iteration 2: GMM criterion Q(b) = .00215911 Iteration 3: GMM criterion Q(b) = .00215911 GMM estimation Number of parameters = 5 Number of moments = 7 Initial weight matrix: Unadjusted Number of obs GMM weight matrix: Robust Coef. /xb_private /xb_chronic /xb_female /xb_income /b0 .535335 1.090126 .6636579 .0142855 -.5983477 Robust Std. Err. .159904 .0617659 .0959885 .0027162 .138433 z 3.35 17.65 6.91 5.26 -4.32 P>|z| 0.001 0.000 0.000 0.000 0.000 = 723 4412 [95% Conf. Interval] .221929 .9690668 .475524 .0089618 -.8696714 .848741 1.211185 .8517918 .0196092 -.327024 Instruments for equation 1: private chronic female age black hispanic _cons In the first derivative() option, we specified the name of the linear combination, xb, instead of an individual parameter’s name. We already declared the variables of our linear combination in the substitutable expression for the residual equation, so in our substitutable expressions for the derivatives, we can use the shorthand notation {xb:} to refer to it. Our point estimates are identical to those we obtained earlier. The standard errors and confidence intervals differ by only trivial amounts. Exponential regression models with panel data In addition to supporting cross-sectional and time-series data, gmm also works with panel-data models. Here we illustrate gmm’s panel-data capabilities by expanding our discussion of exponential regression models to allow for panel data. This also provides us the opportunity to demonstrate the moment-evaluator program version of gmm. Our discussion is based on Blundell, Griffith, and Windmeijer (2002). Also see Wooldridge (1999) for further discussion of nonlinear panel-data models. First, we expand (9) for panel data. With individual heterogeneity term ηi , we have E(yit |xit , ηi ) = exp(x0it β + ηi ) = µit νi where µit = exp(x0it β) and νi = exp(ηi ). Note that there is no constant term in this model because its effect cannot be disentangled from νi . With an additive idiosyncratic error term, we have the regression model yit = µit νi + it 724 gmm — Generalized method of moments estimation We do not impose the assumption E(xit ηi ) = 0, so ηi can be considered a fixed effect in the sense that it may be correlated with the regressors. As discussed by Blundell, Griffith, and Windmeijer (2002), if xit is strictly exogenous, meaning E(xit is ) = 0 for all t and s, then we can estimate the parameters of the model by using the sample moment conditions XX y xit yit − µit i = 0 (13) µi t i where y i and µi are the means of yit and µit for panel i, respectively. Because µi depends on the parameters of the model, it must be recomputed each time gmm needs to evaluate the residual equation. Therefore, we cannot use the substitutable expression version of gmm. Instead, we must use the moment-evaluator program version. The moment-evaluator program version of gmm functions much like the function-evaluator program versions of nl and nlsur. The program you write is passed one or more variables to be filled in with the residuals evaluated at the parameter values specified in an option passed to your program. For the fixed-effects Poisson model with strictly exogenous regressors, our first crack at a function-evaluator program is program gmm_poi version 13 syntax varlist if, at(name) quietly { tempvar mu mubar ybar gen double ‘mu’ = exp(x1*‘at’[1,1] + x2*‘at’[1,2] /// + x3*‘at’[1,3]) ‘if’ egen double ‘mubar’ = mean(‘mu’) ‘if’, by(id) egen double ‘ybar’ = mean(y) ‘if’, by(id) replace ‘varlist’ = y - ‘mu’*‘ybar’/‘mubar’ ‘if’ } end You can save your program in an ado-file named name.ado, where name is the name you use for your program; here we would save the program in the ado-file gmm poi.ado. Alternatively, if you are working from within a do-file, you can simply define the program before calling gmm. The syntax statement declares that we are expecting to receive a varlist, containing the names of variables whose values we are to replace with the values of the residual equations, and an if expression that will mark the estimation sample; because our model has one residual equation, varlist will consist of one variable. at() is a required option to our program, and it will contain the name of a matrix containing the parameter values at which we are to evaluate the residual equation. All moment-evaluator programs must accept the varlist, if condition, and at() option. The first part of our program computes µit . In the model we will fit shortly, we have three regressors, named x1, x2, and x3. The ‘at’ vector will have three elements, one for each of those variables. Notice that we included ‘if’ at the end of each statement that affects variables to restrict the computations to the relevant estimation sample. The two egen statements compute µi and y i ; in the example dataset we will use shortly, the panel variable is named id, and for simplicity we hardcoded that variable into our program as well. Finally, we compute the residual equation, which is the portion of (13) bound in parentheses. gmm — Generalized method of moments estimation 725 Example 9: Panel Poisson with strictly exogenous regressors To fit our model, we type . use http://www.stata-press.com/data/r13/poisson1 . gmm gmm_poi, nequations(1) parameters(b1 b2 b3) > instruments(x1 x2 x3, noconstant) vce(cluster id) onestep Step 1 Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: GMM GMM GMM GMM GMM criterion criterion criterion criterion criterion Q(b) Q(b) Q(b) Q(b) Q(b) = = = = = 51.99142 .04345191 8.720e-06 7.115e-13 5.130e-27 GMM estimation Number of parameters = 3 Number of moments = 3 Initial weight matrix: Unadjusted Number of obs = 409 (Std. Err. adjusted for 45 clusters in id) Coef. /b1 /b2 /b3 1.94866 -2.966119 1.008634 Robust Std. Err. .1000265 .0923592 .1156561 z 19.48 -32.12 8.72 P>|z| [95% Conf. Interval] 0.000 0.000 0.000 1.752612 -3.14714 .781952 2.144709 -2.785099 1.235315 Instruments for equation 1: x1 x2 x3 All three of our regressors are strictly exogenous, so they can serve as their own regressors. There is no constant term in the model (it would be unidentified), so we exclude a constant term from our list of instruments. We have one residual equation as indicated by nequations(1), and we have three parameters, named b1, b2, and b3. The order in which you declare parameters in the parameters() option determines the order in which they appear in the ‘at’ vector in the moment-evaluator program. We specified vce(cluster id) to obtain standard errors that allow for correlation among observations within each panel. The program we just wrote is sufficient to fit the model to the poisson1 dataset, but if we want to fit that model to other datasets, we would need to change the variable names and perhaps account for having a different number of parameters as well. Despite those limitations, if you just want to fit a single model, that program is adequate. Next we take advantage of the ability to specify full equation names in the parameters() option and rewrite our evaluator program so that we can more easily change the variables in our model. This approach is particularly useful if some of the moment equations are linear in the parameters, because then we can use matrix score (see [P] matrix score) to evaluate those moments. 726 gmm — Generalized method of moments estimation Our new evaluator program is program gmm_poieq version 13 syntax varlist if, at(name) quietly { tempvar mu mubar ybar matrix score double ‘mu’ = ‘at’ ‘if’, eq(#1) replace ‘mu’ = exp(‘mu’) egen double ‘mubar’ = mean(‘mu’) ‘if’, by(id) egen double ‘ybar’ = mean(y) ‘if’, by(id) replace ‘varlist’ = y - ‘mu’*‘ybar’/‘mubar’ ‘if’ } end Rather than using generate to compute the temporary variable ‘mu’, we used matrix score to obtain the linear combination x0it β and then called replace to compute exp(x0it β). Example 10: Panel Poisson using matrix score To fit our model, we type . use http://www.stata-press.com/data/r13/poisson1 . gmm gmm_poieq, nequations(1) parameters(y:x1 y:x2 y:x3) > instruments(x1 x2 x3, noconstant) vce(cluster id) onestep Step 1 Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: GMM GMM GMM GMM GMM criterion criterion criterion criterion criterion Q(b) Q(b) Q(b) Q(b) Q(b) = = = = = 51.99142 .04345191 8.720e-06 7.115e-13 5.106e-27 GMM estimation Number of parameters = 3 Number of moments = 3 Initial weight matrix: Unadjusted Number of obs = 409 (Std. Err. adjusted for 45 clusters in id) Coef. x1 x2 x3 1.94866 -2.966119 1.008634 Robust Std. Err. .1000265 .0923592 .1156561 z 19.48 -32.12 8.72 P>|z| [95% Conf. Interval] 0.000 0.000 0.000 1.752612 -3.14714 .781952 2.144709 -2.785099 1.235315 Instruments for equation 1: x1 x2 x3 Instead of specifying simple parameter names in the parameters() option, we specified an equation name and the variables associated with that equation. We named our equation y, but you could use any valid Stata name. When we use this syntax, the rows of the coefficient table are grouped by the equation names. Say we wanted to refit our model using just x1 and x3 as regressors. We do not need to make any changes to gmm poieq. We just change the specification of the parameters() option: . gmm gmm_poieq, nequations(1) parameters(y:x1 y:x3) /// > instruments(x1 x3, noconstant) vce(cluster id) onestep gmm — Generalized method of moments estimation 727 In this evaluator program, we have still hard-coded the name of the dependent variable. The next two examples include methods to tackle that shortcoming. Technical note Say we specify the parameters() option like this: . gmm . . ., parameters(y1:x1 y1:x2 y1:_cons y2:_cons y3:x1 y3:_cons) Then the ‘at’ vector passed to our program will have the following column names attached to it: ‘at’[1,6] y1: x1 y1: y1: y2: x2 _cons _cons y3: y3: x1 _cons Typing . matrix score double eq1 = ‘at’, eq(#1) is equivalent to typing . generate double eq1 = x1*‘at’[1,1] + x2*‘at’[1,2] + ‘at’[1,3] with one important difference. If we change some of the variables in the parameters() option when we call gmm, matrix score will compute the correct linear combination. If we were to use the generate statement instead, then every time we wanted to change the variables in our model, we would have to modify that statement as well. The command . matrix score double alpha = ‘at’, eq(#2) scalar is equivalent to . scalar alpha = ‘at’[1,4] Thus even if you specify equation and variable names in the parameters() option, you can still have scalar parameters in your model. When past values of the idiosyncratic error term affect the value of a regressor, we say that regressor is predetermined. When one or more regressors are predetermined, sample moment condition (10) is no longer valid. However, Chamberlain (1992) shows that a simple alternative is to consider moment conditions of the form T XX yit xi,t−1 yi,t−1 − µi,t−1 =0 (14) µit i t=2 Also see Wooldridge (1997) and Windmeijer (2000) for other moment conditions that can be used with predetermined regressors. 728 gmm — Generalized method of moments estimation Example 11: Panel Poisson with predetermined regressors Here we refit the previous model, treating all the regressors as predetermined and using the moment conditions in (14). Our moment-evaluator program is program gmm_poipre version 13 syntax varlist if, at(name) mylhs(varlist) quietly { tempvar mu matrix score double ‘mu’ = ‘at’ ‘if’, eq(#1) replace ‘mu’ = exp(‘mu’) replace ‘varlist’ = L.‘mylhs’ - L.‘mu’*‘mylhs’/‘mu’ ‘if’ } end To compute the moment equation, we used lag-operator notation so that Stata properly handles gaps in our panel dataset. We also made our program accept an additional option that we will use to pass in the dependent variable. When we specify this option in our gmm statement, it will get passed to our evaluator program because gmm will not recognize the option as one of its own. Equation (14) shows that we are to use the first lags of the regressors as instruments, so we type . gmm gmm_poipre, mylhs(y) nequations(1) vce(cluster id) onestep > parameters(y:x1 y:x2 y:x3) instruments(L.(x1 x2 x3), noconstant) warning: 45 missing values returned for equation 1 at initial values Step 1 Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: GMM GMM GMM GMM GMM GMM criterion criterion criterion criterion criterion criterion Q(b) Q(b) Q(b) Q(b) Q(b) Q(b) = = = = = = 52.288048 2.3599714 .16951739 .00020399 3.392e-10 9.230e-22 GMM estimation Number of parameters = 3 Number of moments = 3 Initial weight matrix: Unadjusted Number of obs = 319 (Std. Err. adjusted for 45 clusters in id) Coef. x1 x2 x3 2.025956 -2.909646 1.202926 Robust Std. Err. .2777156 .2577577 .1873571 z 7.30 -11.29 6.42 P>|z| 0.000 0.000 0.000 [95% Conf. Interval] 1.481644 -3.414842 .8357131 2.570269 -2.404451 1.570139 Instruments for equation 1: L.x1 L.x2 L.x3 Here, like earlier with strictly exogenous regressors, the number of instruments equals the number of parameters, so there is no gain to using the two-step or iterated estimators. However, if you do have more instruments than parameters, you will most likely want to use one of those other estimators instead. Instead of making our program accept the mylhs() option, we could have used Stata’s coleq extended macro function to determine the dependent variable based on the column names attached to the ‘at’ vector; see [P] macro. Then we could refit our model with a different dependent variable by changing the eqname used in the parameters() option. In the next example, we take this approach. gmm — Generalized method of moments estimation 729 In the previous example, we used xi,t−1 as instruments. A more efficient GMM estimator would also use xi,t−2 , xi,t−3 , . . . , xi,1 as instruments in period t as well. gmm’s xtinstruments() option allows you to specify instrument lists that grow as t increases. Later we discuss the xtinstruments() option in detail in the context of linear dynamic panel-data models. When a regressor is contemporaneously correlated with the idiosyncratic error term, we say that regressor is endogenous. Windmeijer (2000) shows that here we can use the moment condition T XX i xi,t−2 t=3 yit yi,t−1 − µit µi,t−1 Here we use the second lag of the endogenous regressor as an instrument. If a variable is strictly exogenous, it can of course serve as its own instrument. Example 12: Panel Poisson with endogenous regressors Here we refit the model, treating x3 as endogenous and x1 and x2 as strictly exogenous. Our moment-evaluator program is program gmm_poiend version 13 syntax varlist if, at(name) quietly { tempvar mu matrix score double ‘mu’ = ‘at’ ‘if’, eq(#1) replace ‘mu’ = exp(‘mu’) local mylhs : coleq ‘at’ local mylhs : word 1 of ‘mylhs’ replace ‘varlist’ = ‘mylhs’/‘mu’ - L.‘mylhs’/L.‘mu’ ‘if’ } end Now we call gmm using x1, x2, and L2.x3 as instruments: 730 gmm — Generalized method of moments estimation . use http://www.stata-press.com/data/r13/poisson2 . gmm gmm_poiend, nequations(1) vce(cluster id) onestep > parameters(y:x1 y:x2 y:x3) instruments(x1 x2 L2.x3, noconstant) warning: 500 missing values returned for equation 1 at initial values Step 1 Iteration 0: GMM criterion Q(b) = 43.799922 Iteration 1: GMM criterion Q(b) = .06998898 Iteration 2: GMM criterion Q(b) = .04165161 Iteration 3: GMM criterion Q(b) = .03573502 Iteration 4: GMM criterion Q(b) = .00001981 Iteration 5: GMM criterion Q(b) = 3.168e-12 Iteration 6: GMM criterion Q(b) = 1.529e-23 GMM estimation Number of parameters = 3 Number of moments = 3 Initial weight matrix: Unadjusted Number of obs = 3266 (Std. Err. adjusted for 500 clusters in id) Coef. x1 x2 x3 1.857766 -2.865858 4.961867 Robust Std. Err. .2631454 .2151529 14.91462 z 7.06 -13.32 0.33 P>|z| 0.000 0.000 0.739 [95% Conf. Interval] 1.34201 -3.28755 -24.27025 2.373521 -2.444167 34.19399 Instruments for equation 1: x1 x2 L2.x3 The warning at the top of the output indicates that we have 500 panels in our dataset. Our moment equation includes lagged terms and therefore cannot be evaluated for the first time period within each panel. Warning messages like that can be ignored once you know why they occurred. If you receive a warning message that you were not expecting, you should first investigate the cause of the warning before trusting the results. As in the previous example, instead of using just xi,t−2 as an instrument, we could use all further lags of xit as instruments as well. Rational-expectations models Macroeconomic models typically assume that agents’ expectations about the future are formed rationally. By rational expectations, we mean that agents use all information available when forming their forecasts, so the forecast error is uncorrelated with the information available when the forecast was made. Say that at time t, people make a forecast, ybt+1 , of variable y in the next period. If Ωt denotes all available information at time t, then rational expectations implies that E {(b yt+1 − yt+1 )|Ωt } = 0. If Ωt denotes observable variables such as interest rates or prices, then this conditional expectation can serve as the basis of a moment condition for GMM estimation. Example 13: Fitting a Euler equation In a well-known article, Hansen and Singleton (1982) consider a model of portfolio decision making and discuss parameter estimation using GMM. We will consider a simple example with one asset in which the agent can invest. A consumer wants to maximize the present value of his lifetime utility derived from consuming the good. On the one hand, the consumer is impatient, so he would rather consume today than wait until tomorrow. On the other hand, if he consumes less today, he can invest more of his money, earning more interest that he can then use to consume more of the good tomorrow. Thus there is a tradeoff between having his cake today or sacrificing a bit today to have more cake tomorrow. gmm — Generalized method of moments estimation 731 If we assume a specific form for the agent’s utility function, known as the constant relative-risk aversion utility function, we can show that the Euler equation is E zt 1 − β(1 + rt+1 )(ct+1 /ct )−γ = 0 where β and γ are the parameters to estimate, rt is the return to the financial asset, and ct is consumption in period t. β measures the agent’s discount factor. If β is near one, the agent is patient and is more willing to forgo consumption this period. If β is close to zero, the agent is less patient and prefers to consume more now. The parameter γ characterizes the agent’s utility function. If γ equals zero, the utility function is linear. As γ tends toward one, the utility function tends toward u = log(c). We have data on 3-month Treasury bills (rt ) and consumption expenditures (ct ). As instruments, we will use lagged rates of return and past growth rates of consumption. We will use the two-step estimator and a weight matrix that allows for heteroskedasticity and autocorrelation up to four lags with the Bartlett kernel. In Stata, we type . use http://www.stata-press.com/data/r13/cr . generate cgrowth = c / L.c (1 missing value generated) . gmm (1 - {b=1}*(1+F.r)*(F.c/c)^(-1*{gamma=1})), > inst(L.r L2.r cgrowth L.cgrowth) wmat(hac nw 4) twostep warning: 1 missing value returned for equation 1 at initial values Step 1 Iteration 0: GMM criterion Q(b) = .00226482 Iteration 1: GMM criterion Q(b) = .00054369 Iteration 2: GMM criterion Q(b) = .00053904 Iteration 3: GMM criterion Q(b) = .00053904 Step 2 Iteration 0: GMM criterion Q(b) = .0600729 Iteration 1: GMM criterion Q(b) = .0596369 Iteration 2: GMM criterion Q(b) = .0596369 GMM estimation Number of parameters = 2 Number of moments = 5 Initial weight matrix: Unadjusted GMM weight matrix: HAC Bartlett 4 Coef. /b /gamma .9204617 -4.222361 HAC Std. Err. .0134646 1.473895 z 68.36 -2.86 Number of obs P>|z| 0.000 0.004 = 239 [95% Conf. Interval] .8940716 -7.111143 .9468518 -1.333579 HAC standard errors based on Bartlett kernel with 4 lags. Instruments for equation 1: L.r L2.r cgrowth L.cgrowth _cons The warning message at the top of the output appears because the forward operator in our substitutable expression says that residuals can be computed only for 239 observations; our dataset contains 240 observations. Our estimate of β is near one, in line with expectations and published results. However, our estimate of γ implies risk-loving behavior and therefore a poorly specified model. 732 gmm — Generalized method of moments estimation System estimators In many economic models, two or more variables are determined jointly through a system of simultaneous equations. Indeed, some of the earliest work in econometrics, including that of the Cowles Commission, was centered around estimation of the parameters of simultaneous equations. The 2SLS and IV estimators we have already discussed are used in some circumstances to estimate such parameters. Here we focus on the joint estimation of all the parameters of systems of equations, and we begin with the well-known three-stage least-squares (3SLS) estimator. Recall that the 2SLS estimator is based on the moment conditions E(zu) = 0. The 2SLS estimator can be used to estimate the parameters of one equation of a system of structural equations. Moreover, with the 2SLS estimator, we do not even need to specify the structural relationship among all the endogenous variables; we need to specify only the equation on which interest focuses and simply assume reduced-form relationships among the endogenous regressors of the equation of interest and the exogenous variables of the model. If we are willing to specify the complete system of structural equations, then assuming our model is correctly specified, by estimating all the equations jointly, we can obtain estimates that are more efficient than equation-by-equation 2SLS. In [R] reg3, we fit a simple two-equation macroeconomic model: consump = β0 + β1 wagepriv + β2 wagegovt + 1 wagepriv = β3 + β4 consump + β5 govt + β6 capital1 + 2 (15) (16) where consump represents aggregate consumption; wagepriv and wagegovt are total wages paid by the private and government sectors, respectively; govt is government spending; and capital1 is the previous period’s capital stock. We are not willing to assume that 1 and 2 are independent, so we must treat both consump and wagepriv as endogenous. Suppose that a random shock makes 2 positive. Then by (16), wagepriv will be higher than it otherwise would. Moreover, 1 will either be higher or lower, depending on the correlation between it and 2 . The shock to 2 has made both wagepriv and 1 move, implying that in (15) wagepriv is an endogenous regressor. A similar argument shows that consump is an endogenous regressor in the second equation. In our model, wagegovt, govt, and capital1 are all exogenous variables. Let z1 and z2 denote the instruments for the first and second equations, respectively; we will discuss what comprises them shortly. We have two sets of moment conditions: z1 (consump − β0 − β1 wagepriv − β2 wagegovt) E =0 (17) z2 (wagepriv − β3 − β4 consump − β5 govt − β6 capital1) One of the defining characteristics of 3SLS is that the errors are homoskedastic conditional on the instrumental variables. Using this assumption, we have σ11 E(z1 z01 ) σ12 E(z1 z02 ) z1 1 0 0 (18) E { z1 1 z2 2 } = z2 2 σ21 E(z2 z01 ) σ22 E(z2 z02 ) where σij = cov(i , j ). Let Σ denote the 2 × 2 matrix with typical element σij . The second defining characteristic of the 3SLS estimator is that it uses all the exogenous variables as instruments for all equations; here z1 = z2 = (wagegovt, govt, capital1, 1), where the 1 indicates a constant term. From our discussion on the weight matrix and two-step estimation, we want to use the sample analogue of the matrix inverse of the right-hand side of (18) as our weight matrix. To implement the 3SLS estimator, we apparently need to know Σ or at least have a consistent estimator of it. The solution is to fit (15) and (16) by 2SLS, use the sample residuals b1 and b2 to estimate Σ, then estimate the parameters of (17) via GMM by using the weight matrix just discussed. gmm — Generalized method of moments estimation 733 Example 14: 3SLS estimation 3SLS is easier to do using gmm than it sounds. The 3SLS estimator is a two-step GMM estimator. In the first step, we do the equivalent of 2SLS on each equation, and then we compute a weight matrix based on (18). Finally, we perform a second step of GMM with this weight matrix. In Stata, we type . use http://www.stata-press.com/data/r13/klein, clear . gmm (eq1: consump - {b0} - {xb: wagepriv wagegovt}) > (eq2: wagepriv - {c0} - {xc: consump govt capital1}), > instruments(eq1: wagegovt govt capital1) > instruments(eq2: wagegovt govt capital1) > winitial(unadjusted, independent) wmatrix(unadjusted) twostep Step 1 Iteration 0: GMM criterion Q(b) = 4195.4487 Iteration 1: GMM criterion Q(b) = .22175631 Iteration 2: GMM criterion Q(b) = .22175631 (backed up) Step 2 Iteration 0: GMM criterion Q(b) = .09716589 Iteration 1: GMM criterion Q(b) = .07028208 Iteration 2: GMM criterion Q(b) = .07028208 GMM estimation Number of parameters = 7 Number of moments = 8 Initial weight matrix: Unadjusted Number of obs GMM weight matrix: Unadjusted Coef. /b0 /xb_wagepriv /xb_wagegovt /c0 /xc_consump /xc_govt /xc_capital1 19.3559 .8012754 1.029531 14.63026 .4026076 1.177792 -.0281145 Std. Err. 3.583772 .1279329 .3048424 10.26693 .2567312 .5421253 .0572111 z 5.40 6.26 3.38 1.42 1.57 2.17 -0.49 P>|z| 0.000 0.000 0.001 0.154 0.117 0.030 0.623 = 22 [95% Conf. Interval] 12.33184 .5505314 .432051 -5.492552 -.1005764 .1152461 -.1402462 26.37996 1.052019 1.627011 34.75306 .9057916 2.240338 .0840173 Instruments for equation 1: wagegovt govt capital1 _cons Instruments for equation 2: wagegovt govt capital1 _cons The independent suboption of the winitial() option tells gmm to assume that the residuals are independent across equations; this suboption sets σ21 = σ12 = 0 in (18). Assuming both homoskedasticity and cross-equation independence is equivalent to fitting the two equations of our model independently by 2SLS. The wmatrix() option controls how the weight matrix is computed based on the first-step parameter estimates before the second step of estimation; here we request a weight matrix that assumes conditional homoskedasticity but that does not impose the cross-equation independence like the initial weight matrix we used. In this example, we also illustrated how to name equations and how equation names can be used in the instruments() option. Our results are identical to those in [R] reg3. We could have specified our instruments with the syntax instruments(wagegovt govt capital1) because gmm uses the variables listed in the instruments() option for all equations unless you specify which equations the list of instruments is to be used with. However, we wanted to emphasize that the same instruments are being used for both equations; in a moment, we will discuss an estimator that does not use the same instruments in all equations. 734 gmm — Generalized method of moments estimation In the previous example, if we omit the twostep option, the resulting coefficients will be equivalent to equation-by-equation 2SLS, which Wooldridge (2010, 216) calls the “system 2SLS estimator”. Eliminating the twostep option makes the wmatrix() option irrelevant, so that option can be eliminated as well. So far, we have developed the traditional 3SLS estimator. Wooldridge (2010, chap. 8) discusses the “GMM 3SLS” estimator that extends the traditional 3SLS estimator by allowing for heteroskedasticity and different instruments for different equations. Generalizing (18) to an arbitrary number of equations, we have E (Z0 0 Z) = E (Z0 ΣZ) (19) where z1 0 Z= ... 0 z2 .. . ··· ··· .. . 0 0 · · · zm 0 0 .. . and Σ is now m × m. Equation (19) is the multivariate analogue of a homoskedasticity assumption; for each equation, the error variance is constant for all observations, as is the covariance between any two equations’ errors. We can relax this homoskedasticity assumption by considering different weight matrices. For example, if we continue to assume that observations are independent but not necessarily identically distributed, then by specifying wmatrix(robust), we would obtain a weight matrix that allows for heteroskedasticity: X c= 1 W Z0i b i b 0i Zi N i This is the weight matrix in Wooldridge’s (2010, 218) Procedure 8.1, “GMM with Optimal Weighting Matrix”. By default, gmm would report standard errors based on his covariance matrix (8.27); specifying vce(unadjusted) would provide the optimal GMM standard errors. If you have multiple observations for each individual or firm in your dataset, you could specify wmatrix(cluster id), where id identifies individuals or firms. This would allow arbitrary within-individual correlation, though it does not account for an individual-specific fixed or random effect. In both cases, we would continue to use winitial(unadjusted, independent) so that the first-step estimates are the system 2SLS estimates. Wooldridge (2010, sec. 9.6) discusses instances where it is necessary to use different instruments in different equations. The GMM 3SLS estimator with different instruments in different equations but with conditional homoskedasticity is what Hayashi (2000, 275) calls the “full-information instrumental variables efficient” (FIVE) estimator. Implementing the FIVE estimator is easy with gmm. For example, say we have a two-equation system, where kids, age, income, and education are all valid instruments for the first equation; but education is not a valid instrument for the second equation. Then our syntax would take the form gmm ( ) ( ), instruments(1:kids age income education) instruments(2:kids age income) The following syntax is equivalent: gmm ( ) ( ), instruments(kids age income) instruments(1:education) gmm — Generalized method of moments estimation 735 Because we did not specify a list of equations in the second example’s first instruments() option, those variables are used as instruments in both equations. You can use whichever syntax you prefer. The first requires a bit more typing but is arguably more transparent. If all the regressors in the model are exogenous, then the traditional 3SLS estimator is the seemingly unrelated regression (SUR) estimator. Here you would specify all the regressors as instruments. Dynamic panel-data models Commands in Stata that work with panel data expect the data to be in the “long” format, meaning that each row of the dataset consists of one subobservation that is a member of a logical observation (represented by the panel identifier variable). See [D] reshape for a discussion of the long versus “wide” data forms. gmm is no exception in this respect when used with panel data. From a theoretical perspective, however, it is sometimes easier to view GMM estimators for panel data as system estimators in which we have N observations on a system of T equations, where N and T are the number of observations and panels, respectively, rather than a single-equation estimator with N T observations. Usually, each of the T equations will in fact be the same, though we will want to specify different instruments for each of these equations. In a dynamic panel-data model, lagged values of the dependent variable are included as regressors. Here we consider a simple model with one lag of the dependent variable y as a regressor and a vector of strictly exogenous regressors, xit : yit = ρyi,t−1 + x0it β + ui + it (20) ui can be either a fixed- or a random-effect term, in the sense that we do not require xit to be independent of it. Even with the assumption that it is i.i.d., the presence of both yi,t−1 and ui in (20) renders both the standard fixed- and random-effects estimators to be inconsistent because of the well-known Nickell (1981) bias. OLS regression of yit on yi,t−1 and xit also produces inconsistent estimates, because yi,t−1 will be correlated with the error term. Technical note Stata has the xtabond, xtdpd, and xtdpdsys commands (see [XT] xtabond, [XT] xtdpd, and [XT] xtdpdsys) to fit equations like (20), and for everyday use those commands are preferred because they offer features such as Windmeijer (2005) bias-corrected standard errors to account for the bias of traditional two-step GMM standard errors seen in dynamic panel-data models and, being linear estimators, only require you to specify variable names instead of complete equations. However, using gmm has several pedagogical advantages, including the ability to tie those model-specific commands into a more general framework, a clear illustration of how certain types of instrument matrices for panel-data models are formed, and demonstrations of several advanced features of gmm. First-differencing (20) removes the panel-specific ui term: yit − yi,t−1 = ρ(yi,t−1 − yi,t−2 ) + (xit − xi,t−1 )0 β + (it − i,t−1 ) (21) However, now (yi,t−1 − yi,t−2 ) is correlated with (it − i,t−1 ). Thus we need an instrument that is correlated with the former but not the latter. The lagged variables in (21) mean that equation is not estimable for t < 3, so consider when t = 3. We have yi3 − yi2 = ρ(yi2 − yi1 ) + (xi3 − xi2 )0 β + (i3 − i2 ) (22) 736 gmm — Generalized method of moments estimation In the Arellano–Bond (1991) estimator, lagged levels of the dependent variable are used as instruments. With our assumption that the it are i.i.d., (20) intimates that yi1 can serve as an instrumental variable when we fit (22). Next consider (21) when t = 4. We have yi4 − yi3 = ρ(yi3 − yi2 ) + (xi4 − xi3 )0 β + (i4 − i3 ) Now (20) shows that both yi1 and yi2 are uncorrelated with the error term (i4 − i3 ), so we have two instruments available. For t = 5, you can show that yi1 , yi2 , and yi3 can serve as instruments. As may now be apparent, one of the key features of these dynamic panel-data models is that the available instruments depend on the time period, t, as was the case for some of the panel Poisson models we considered earlier. Because the xit are strictly exogenous by assumption, they can serve as their own instruments. The initial weight matrix that is appropriate for the GMM dynamic panel-data estimator is slightly more involved than the unadjusted matrix we have used in most of our previous examples that assumes the errors are i.i.d. First, rewrite (21) for panel i as L yi − yiL = ρ (yiL − yiLL ) + (Xi − XL i )β + (i − i ) where yi = (yi3 , . . . , yiT ) and yiL = (yi2 , . . . , yi,T −1 ), yiLL = (yi1 , . . . , yi,T −2 ), and Xi , XL i , i , and L i are defined analogously. Let Z denote the full matrix of instruments for panel i, including the variables specified in both the instruments() and xtinstruments() options; the exact structure is detailed in Methods and formulas. By assumption, it is i.i.d., so the first-difference (it − i,t−1 ) is necessarily autocorrelated with correlation −0.5. Therefore, we should not use a weight matrix that assumes the errors are independent. For dynamic panel-data models, we can show that the appropriate initial weight matrix is c= W where −1 1 X 0 Zi HD Zi i N 1 −0.5 0 ... 0 0 −0.5 1 −0.5 . . . 0 0 . .. .. .. .. . . . = . . . . . . 0 0 0 ... 1 −0.5 0 0 0 . . . −0.5 1 HD We can obtain this initial weight matrix by specifying winitial(xt D). The letter D indicates that the equation we are estimating is specified in first-differences. Example 15: Arellano–Bond estimator Say we want to fit the model nit = ρ ni,t−1 + β1 wit + β2 wi,t−1 + β3 kit + β4 ki,t−1 + ui + it (23) where we assume that wit and kit are strictly exogenous. First-differencing, our residual equation is ∗it = (it − i,t−1 ) =nit − ni,t−1 − ρ (ni,t−1 − ni,t−2 ) − β1 (wit − wi,t−1 ) − β2 (wi,t−1 − wi,t−2 ) − β3 (kit − ki,t−1 ) − β4 (ki,t−1 − ki,t−2 ) (24) gmm — Generalized method of moments estimation 737 In Stata, we type . use http://www.stata-press.com/data/r13/abdata . gmm (D.n - {rho}*LD.n - {xb:D.w LD.w D.k LD.k}), > xtinstruments(n, lags(2/.)) instruments(D.w LD.w D.k LD.k, noconstant) > deriv(/rho = -1*LD.n) deriv(/xb = -1) winitial(xt D) onestep Step 1 Iteration 0: Iteration 1: Iteration 2: GMM criterion Q(b) = GMM criterion Q(b) = GMM criterion Q(b) = .0011455 .00009103 .00009103 (backed up) GMM estimation Number of parameters = 5 Number of moments = 32 Initial weight matrix: XT D Coef. /rho /xb_D_w /xb_LD_w /xb_D_k /xb_LD_k .8041712 -.5600476 .3946699 .3520286 -.2160435 Number of obs Robust Std. Err. .1199819 .1619472 .1092229 .0536546 .0679689 z 6.70 -3.46 3.61 6.56 -3.18 P>|z| 0.000 0.001 0.000 0.000 0.001 = 751 [95% Conf. Interval] .5690111 -.8774583 .1805969 .2468676 -.3492601 1.039331 -.242637 .6087429 .4571897 -.0828269 Instruments for equation 1: XT-style: L(2/.).n Standard: D.w LD.w D.k LD.k Because w and k are strictly exogenous, we specified the variants of them that appear in (24) in the instruments() option; because there is no constant term in the model, we specified noconstant to omit the constant from the instrument list. We specified xtinstruments(n, lags(2/.)) to tell gmm what instruments to use for the lagged dependent variable included as a regressor in (23). Based on our previous discussion, lags two and higher of nit can serve as instruments. The lags(2/.) suboption tells gmm that the first available instrument for nit is the lag-two value ni,t−2 . The “.” tells gmm to use all further lags of nit as instruments as well. The instrument matrices in dynamic panel-data models can become large if the dataset has many time periods per panel. In those cases, you could specify, for example, lags(2/4) to use just lags two through four instead of using all available lags. Our results are identical to those we would obtain using xtabond with the syntax xtabond n L(0/1).w L(0/1).k, lags(1) noconstant vce(robust) Had we left off the vce(robust) option in our call to xtabond, we would have had to specify vce(unadjusted) in our call to gmm to obtain the same standard errors. Technical note gmm automatically excludes observations for which there are no valid observations for the panelstyle instruments. However, it keeps in the estimation sample those observations for which fewer than the maximum number of instruments you requested are available. For example, if you specify the lags(2/4) suboption, you have requested three instruments, but gmm will keep observations even if only one or two instruments are available. 738 gmm — Generalized method of moments estimation Example 16: Two-step Arellano–Bond estimator Here we refit the model from example 15, using the two-step GMM estimator. . > > > gmm (D.n - {rho}*LD.n - {xb:D.w LD.w D.k LD.k}), xtinstruments(n, lags(2/.)) instruments(D.w LD.w D.k LD.k, noconstant) deriv(/rho = -1*LD.n) deriv(/xb = -1) winitial(xt D) wmatrix(robust) vce(unadjusted) Step 1 Iteration 0: Iteration 1: Iteration 2: GMM criterion Q(b) = GMM criterion Q(b) = GMM criterion Q(b) = .0011455 .00009103 .00009103 Step 2 Iteration 0: Iteration 1: Iteration 2: GMM criterion Q(b) = GMM criterion Q(b) = GMM criterion Q(b) = .44107941 .4236729 .4236729 (backed up) GMM estimation Number of parameters = 5 Number of moments = 32 Initial weight matrix: XT D GMM weight matrix: Robust Coef. /rho /xb_D_w /xb_LD_w /xb_D_k /xb_LD_k .8044783 -.5154978 .4059309 .3556204 -.2204521 Std. Err. .0534763 .0335506 .0637294 .0390892 .046439 Number of obs z 15.04 -15.36 6.37 9.10 -4.75 P>|z| 0.000 0.000 0.000 0.000 0.000 = 751 [95% Conf. Interval] .6996667 -.5812557 .2810235 .2790071 -.3114709 .90929 -.4497399 .5308384 .4322337 -.1294332 Instruments for equation 1: XT-style: L(2/.).n Standard: D.w LD.w D.k LD.k Our results match those you would obtain with the command xtabond n L(0/1).(w k), lags(1) noconstant twostep Technical note Had we specified vce(robust) in our call to gmm, we would have obtained the traditional sandwich-based robust covariance matrix, but our standard errors would not match those we would obtain by specifying vce(robust) with the xtabond command. The xtabond, xtdpd, and xtdpdsys commands implement a bias-corrected robust VCE for the two-step GMM dynamic panel-data estimator. Traditional VCEs computed after the two-step dynamic panel-data estimator have been shown to exhibit often-severe bias; see Windmeijer (2005). Neither of the two dynamic panel-data examples (15 and 16) we have fit so far include a constant term. When a constant term is included, the dynamic panel-data estimator is in fact a two-equation system estimator. For notational simplicity, consider a simple model containing just a constant term and one lag of the dependent variable: yit = α + ρyi,t−1 + ui + it gmm — Generalized method of moments estimation 739 First-differencing to remove the ui term, we have yit − yi,t−1 = ρ(yi,t−1 − yi,t−2 ) + (it − i,t−1 ) (25) This has also eliminated the constant term. If we assume E(ui ) = 0, which is reasonable if a constant term is included in the model, then we can recover α by including the moment condition yit = α + ρyi,t−1 + 0it (26) where 0it = ui + it . The parameter ρ continues to be identified by (25), so the only instrument we use with (26) is a constant term. As before, the error term (i,t − i,t−1 ) is necessarily autocorrelated with correlation coefficient −0.5, though the error term 0it is white noise. Therefore, our initial weight matrix should be c= W −1 1 X 0 Zi HZi i N where H= HD 0 0 I and I is a conformable identity matrix. One complication arises concerning the relevant estimation sample. Looking at (25), we apparently lose the first two observations from each panel because of the presence of yi,t−2 , but in (26) we need only to sacrifice one observation, for yi,t−1 . For most multiple-equation models, we need to use the same estimation sample for all equations. However, in dynamic panel-data models, we can use more observations to fit the equation in level form [(26) here] than the equation in first-differences [equation (25)]. To request this behavior, we specify the nocommonesample option to gmm. That option tells gmm to use as many observations as possible for each equation, ignoring the loss of observations due to lagging or differencing. Example 17: Arellano–Bond estimator with constant term Here we fit the model nit = α + ρ ni,t−1 + ui + it Without specifying derivatives, our command would be . gmm (D.n - {rho}*LD.n) (n - {alpha} - {rho}*L.n), > xtinstruments(1: n, lags(2/.)) instruments(1:, noconstant) onestep > winitial(xt DL) vce(unadj) nocommonesample We would specify winitial(xt DL) to obtain the required initial weight matrix. The notation DL indicates that our first moment equation is in first-differences and the second moment equation is in levels (not first-differenced). We exclude a constant in the instrument list for the first equation, because first-differencing removed the constant term. Because we do not specify the instruments() option for the second moment equation, a constant is used by default. This example also provides us the opportunity to illustrate how to specify derivatives for multipleequation GMM models. Within the derivative() option, instead of specifying just the parameter name, now you must specify the equation name or number, a slash, and the parameter name to which the derivative applies. In Stata, we type 740 gmm — Generalized method of moments estimation . gmm (D.n - {rho}*LD.n) (n - {alpha} - {rho}*L.n), > xtinstruments(1: n, lags(2/.)) instruments(1:, noconstant) > derivative(1/rho = -1*LD.n) derivative(2/alpha = -1) > derivative(2/rho = -1*L.n) winitial(xt DL) vce(unadj) > nocommonesample onestep Step 1 Iteration 0: GMM criterion Q(b) = .09894466 Iteration 1: GMM criterion Q(b) = .00023508 Iteration 2: GMM criterion Q(b) = .00023508 GMM estimation Number of parameters = 2 Number of moments = 29 Initial weight matrix: XT DL Number of obs Coef. /rho /alpha 1.023349 -.0690864 Std. Err. .0608293 .0660343 z 16.82 -1.05 P>|z| 0.000 0.295 = * [95% Conf. Interval] .9041259 -.1985112 1.142572 .0603384 * Number of observations for equation 1: 751 Number of observations for equation 2: 891 Instruments for equation 1: XT-style: L(2/.).n Instruments for equation 2: Standard: _cons These results are identical to those we would obtain by typing xtabond n, lags(1) Because we specified nocommonesample, gmm did not report the number of observations used in the header of the output. In this dataset, there are in fact 1,031 observations on 140 panels. In the second equation, the presence of the lagged value of n reduces the sample size for that equation to 1031 − 140 = 891. In the first equation, we lose the first two observations per panel due to lagging and differencing, leading to 751 usable observations. These tallies are listed after the coefficient table in the output. Technical note Specifying xtinstruments(x1 x2 x3, lags(1/3)) differs from instruments(L(1/3).(x1 x2 x3)) in how observations are excluded from the estimation sample. When you use the latter syntax, gmm must exclude the first three observations from each panel when computing the moment equation: you requested three lags of each regressor be used as instruments, so the first residual that could be interacted with those instruments is the one for t = 4. On the other hand, when you use xtinstruments(), you are telling gmm that you would like to use up to the first three lags of x1, x2, and x3 as instruments but that using just one lag is acceptable. Because most panel datasets have a relatively modest number of observations per panel, dynamic instrument lists are typically used so that the number of usable observations is maximized. Dynamic instrument lists also accommodate the fact that there are more valid instruments for later time periods than earlier time periods. gmm — Generalized method of moments estimation 741 Specifying panel-style instruments using the xtinstruments() option also affects how the standard instruments specified in the instruments() option are treated. To illustrate, suppose we have a balanced panel dataset with T = 5 observations per panel and we specify . gmm . . ., xtinstruments(w, lags(1/2)) instruments(x) We will lose the first observation because we need at least one lag of w to serve as an instrument. Our instrument matrix for panel i will therefore be wi1 0 0 0 Zi = 0 0 0 xi2 1 0 wi1 wi2 0 0 0 0 xi3 1 0 0 0 wi2 wi3 0 0 xi4 1 0 0 0 0 0 wi3 wi4 xi5 1 (27) The vector of ones in the final row represents the constant term implied by the instruments() option. Because we lost the first observation, the residual vector ui will be 4 × 1. Thus our moment conditions for the ith panel can be written in matrix notation as ui2 (β) u (β) E{Zi ui (β)} = E Zi i3 =0 ui4 (β) ui5 (β) The moment conditions corresponding to the final two rows of (27) say that E (T =4 X t=2 ) xit uit (β) =0 and E (T =4 X ) uit (β) =0 t=2 Because we specified panel-style instruments with the xtinstruments() option, gmm no longer uses moment conditions for strictly exogenous variables of the form E{xit uit (β)} = 0 for each t. Instead, the moment conditions now stipulate that the average (over t) of xit uit (β) has expectation zero. This corresponds to the approach proposed by Arellano and Bond (1991, 280) and others. When you request panel-style instruments with the xtinstruments() option, the number of instruments in the Zi matrix increases quadratically in the number of periods. The dynamic paneldata estimators we have discussed in this section are designed for datasets that contain a large number of panels and a modest number of time periods. When the number of time periods is large, estimators that use standard (non–panel-style) instruments are more appropriate. We have focused on the Arellano–Bond dynamic panel-data estimator because of its relative simplicity. gmm can additionally fit any models that can be formulated using the xtdpd and xtdpdsys commands; see [XT] xtdpd and [XT] xtdpdsys. The key is to determine the appropriate instruments to use for the level and difference equations. You may find it useful to fit a version of your model with those commands to determine what instruments and XT-style instruments to use. We conclude this section with an example using the Arellano–Bover/Blundell–Bond estimator. 742 gmm — Generalized method of moments estimation Example 18: Arellano–Bover/Blundell–Bond estimator We fit a small model that includes one lag of the dependent variable n as a regressor as well as the contemporaneous and first lag of w, which we assume are strictly exogenous. We could fit our model using xtdpdsys using the syntax xtdpdsys n L(0/1).w, lags(1) twostep Applying virtually all the syntax issues we have discussed so far, the equivalent gmm command is . gmm (n - {rho}*L.n - {w}*w - {lagw}*L.w - {c}) > (D.n - {rho}*LD.n - {w}*D.w - {lagw}*LD.w), > xtinst(1: D.n, lags(1/1)) xtinst(2: n, lags(2/.)) > inst(2: D.w LD.w, noconstant) > deriv(1/rho = -1*L.n) deriv(1/w = -1*w) > deriv(1/lagw = -1*L.w) deriv(1/c = -1) > deriv(2/rho = -1*LD.n) deriv(2/w = -1*D.w) > deriv(2/lagw = -1*LD.w) > winit(xt LD) wmatrix(robust) vce(unadjusted) > nocommonesample Step 1 Iteration 0: GMM criterion Q(b) = .10170339 Iteration 1: GMM criterion Q(b) = .00022772 Iteration 2: GMM criterion Q(b) = .00022772 Step 2 Iteration 0: GMM criterion Q(b) = .59965014 Iteration 1: GMM criterion Q(b) = .56578186 Iteration 2: GMM criterion Q(b) = .56578186 GMM estimation Number of parameters = 4 Number of moments = 39 Initial weight matrix: XT LD Number of obs GMM weight matrix: Robust Coef. /rho /w /lagw /c 1.122738 -.6719909 .571274 .154309 Std. Err. .0206512 .0246148 .0403243 .17241 z 54.37 -27.30 14.17 0.90 P>|z| 0.000 0.000 0.000 0.371 = * [95% Conf. Interval] 1.082263 -.7202351 .4922398 -.1836084 1.163214 -.6237468 .6503083 .4922263 * Number of observations for equation 1: 891 Number of observations for equation 2: 751 Instruments for equation 1: XT-style: LD.n Standard: _cons Instruments for equation 2: XT-style: L(2/.).n Standard: D.w LD.w Details of moment-evaluator programs In examples 9, 10, 11, and 12, we used moment-evaluator programs to evaluate moment conditions that could not be specified using the interactive version of gmm. In example 11, we also showed how to pass additional information to an evaluator program. Here we discuss how to make moment-evaluator programs provide derivatives and accept weights. gmm — Generalized method of moments estimation 743 The complete specification for a moment-evaluator program’s syntax statement is syntax varlist if [weight], at(name) options [derivatives(varlist)] The macro ‘varlist’ contains the list of variables that we are to fill in with the values of our residual equations. The macro ‘if’ represents an if condition that restricts the estimation sample. The macro ‘at’ represents a vector containing the parameter values at which we are to evaluate our residual equations. options represent other options that you specify in your call to gmm and want to have passed to your moment-evaluator programs. In example 11, we included the mylhs() option so that we could pass the name of the dependent variable to our evaluator program. Two new elements of the syntax statement allow for weights and derivatives. weight specifies the types of weights your program allows. The interactive version of gmm allows for fweights, aweights, and pweights. However, unless you explicitly allow your moment evaluator program to accept weights, you cannot specify weights in your call to gmm with the moment-evaluator program version. The derivatives() option is used to pass to your program a set of variables that you are to fill in with the derivatives of your residual equations with respect to the parameters. To indicate that your program can calculate derivatives, you specify either the hasderivatives or the haslfderivatives option to gmm. The hasderivatives option indicates that your program calculates parameter-level derivatives; that method requires more work but can be applied to any GMM problem. The haslfderivatives option requires less work but can be used only when the model’s residual equations satisfy certain restrictions and you use the : syntax with the parameters() option. We first consider how to write the derivative computation logic to work with the hasderivatives option and provide an example; then we do the same for the haslfderivatives option. Say you specify k parameters in the nparameters() or parameters() option and q equations in the nequations() or equations() option and you specify hasderivatives. Then ‘derivatives’ will contain k × q variables. The first k variables are for the derivatives of the first residual equation with respect to the k parameters, the second k variables are for the derivatives of the second residual equation, and so on. Example 19: Specifying derivatives with simple parameter names To focus on how to specify derivatives, we return to the simple moment-evaluator program we used in example 9, in which we had three regressors, and extend it to supply derivatives. The residual equation corresponding to moment condition (13) is uit (β) = yit − µit yi µi where µit , µi , and y i were defined previously. Now ∂ y uit (β) = −µit 2i ∂βj µi (j) where xit represents the j th element of xit . (j) xit µi l=T 1 X (j) − xil µil T l=1 ! (28) 744 gmm — Generalized method of moments estimation Our moment-evaluator program is program gmm_poideriv version 13 syntax varlist if, at(name) [derivatives(varlist)] quietly { // Calculate residuals as before tempvar mu mubar ybar gen double ‘mu’ = exp(x1*‘at’[1,1] + x2*‘at’[1,2] /// + x3*‘at’[1,3]) ‘if’ egen double ‘mubar’ = mean(‘mu’) ‘if’, by(id) egen double ‘ybar’ = mean(y) ‘if’, by(id) replace ‘varlist’ = y - ‘mu’*‘ybar’/‘mubar’ ‘if’ // Did -gmm- request derivatives? if "‘derivatives’" == "" { exit // no, so we are done } // Calculate derivatives // We need the panel means of x1*mu, x2*mu, and x3*mu tempvar work x1mubar x2mubar x3mubar generate double ‘work’ = x1*‘mu’ ‘if’ egen double ‘x1mubar’ = mean(‘work’) ‘if’, by(id) replace ‘work’ = x2*‘mu’ ‘if’ egen double ‘x2mubar’ = mean(‘work’) ‘if’, by(id) replace ‘work’ = x3*‘mu’ ‘if’ egen double ‘x3mubar’ = mean(‘work’) ‘if’, by(id) local d1: word local d2: word local d3: word replace ‘d1’ = replace ‘d2’ = replace ‘d3’ = 1 of ‘derivatives’ 2 of ‘derivatives’ 3 of ‘derivatives’ -1*‘mu’*‘ybar’/‘mubar’^2*(x1*‘mubar’ - ‘x1mubar’) -1*‘mu’*‘ybar’/‘mubar’^2*(x2*‘mubar’ - ‘x2mubar’) -1*‘mu’*‘ybar’/‘mubar’^2*(x3*‘mubar’ - ‘x3mubar’) } end The derivatives() option is made optional in the syntax statement by placing it in square brackets. If gmm needs to evaluate your moment equations but does not need derivatives at that time, then the derivatives() option will be empty. In our program, we check to see if that is the case, and, if so, exit without calculating derivatives. As is often the case with [R] ml as well, the portion of our program devoted to derivatives is longer than the code to compute the objective function. The first part of our derivative code computes the term l=T 1 X (j) xil µil T (29) l=1 (j) for xit = x1, x2, and, x3. The ‘derivatives’ macro contains three variable names, corresponding to the three parameters of the ‘at’ matrix. We extract those names into local macros ‘d1’, ‘d2’, and ‘d3’, and then fill in the variables those macros represent with the derivatives shown in (28). gmm — Generalized method of moments estimation 745 With our program written, we fit our model by typing . use http://www.stata-press.com/data/r13/poisson1 . gmm gmm_poideriv, nequations(1) parameters(b1 b2 b3) > instruments(x1 x2 x3, noconstant) vce(cluster id) onestep hasderivatives Step 1 Iteration 0: GMM criterion Q(b) = 51.99142 Iteration 1: GMM criterion Q(b) = .04345191 Iteration 2: GMM criterion Q(b) = 8.720e-06 Iteration 3: GMM criterion Q(b) = 7.115e-13 Iteration 4: GMM criterion Q(b) = 5.129e-27 GMM estimation Number of parameters = 3 Number of moments = 3 Initial weight matrix: Unadjusted Number of obs = 409 (Std. Err. adjusted for 45 clusters in id) Coef. /b1 /b2 /b3 1.94866 -2.966119 1.008634 Robust Std. Err. .1000265 .0923592 .1156561 z 19.48 -32.12 8.72 P>|z| [95% Conf. Interval] 0.000 0.000 0.000 1.752612 -3.14714 .781952 2.144709 -2.785099 1.235315 Instruments for equation 1: x1 x2 x3 Our results are identical to those in example 9. Another way to verify that our program calculates derivatives correctly would be to type . gmm gmm_poideriv, nequations(1) parameters(b1 b2 b3) > instruments(x1 x2 x3, noconstant) vce(cluster id) onestep Without the hasderivatives or haslfderivatives option, gmm will not request derivatives from your program, even if it contains code to compute them. If you have trouble obtaining convergence with the hasderivatives or haslfderivatives option but do not have trouble without specifying one of them, then you need to recheck your derivatives. After example 9, we remarked that the evaluator program would have to be changed to accommodate different regressors. We then showed how you can specify parameters using the syntax : and then use matrix score to compute linear combinations of variables. To specify derivatives when you specify parameters using this equation-name syntax, ensure that your residual equations satisfy the “linear-form restriction” analogous to the restrictions of linear-form evaluators used by ml. See [R] ml and Gould, Pitblado, and Poi (2010) for more information about linear-form evaluators. A GMM residual equation satisfies the linear-form restriction if the equation can be written in terms of a single observation in the dataset and if the equation for observation i does not depend on any observations j 6= i. Cross-sectional models satisfy the linear-form restriction. Time-series models satisfy the linear-form restriction only when no lags or leads are used. Panel-data models often do not satisfy the linear-form restriction. For example, recall moment condition (13) for a panel Poisson model. That residual equation included panel-level mean terms ȳi and µ̄i , so the residual equation for an individual observation depends on all the observations in the same panel. When a residual equation does not satisfy the linear-form restriction, neither will its derivatives. To apply the chain rule, we need a way to multiply the eqname-level derivative by each of the variables 746 gmm — Generalized method of moments estimation in the equation to obtain parameter-level derivatives. In (28), for example, there is no way to factor (j) out each xit variable and obtain an eqname-level derivative that we then multiply by each of the (j) xit s. Suppose we do have a model with q = 2 moment equations, both of which do satisfy the linear-form restriction, and we specify the parameters() option like this: . gmm . . ., parameters(eq1:x1 eq1:x2 eq1:_cons eq2:_cons eq3:x1 eq3:x2 eq3:_cons) We have specified n = 3 eqnames in the parameters() option: eq1, eq2, and eq3. When we specify the haslfderivatives option, gmm will pass n × q = 3 × 2 = 6 variables in the derivatives() option. The first three variables are to be filled with ∂ u1i (β) ∂eq1 ∂ u1i (β) and ∂eq2 ∂ u1i (β) ∂eq3 where u1i (β) is the ith observation for the first moment equation. Then the second three variables are to be filled with ∂ u2i (β) ∂eq1 ∂ u2i (β) and ∂eq2 ∂ u2i (β) ∂eq3 where u2i (β) is the second moment equation. In this example, we filled in a total of six variables with derivatives. If we instead used the hasderivatives option, we would have filled k × q = 7 × 2 = 14 variables; moreover, if we wanted to change the number of variables in our model, we would have modified our evaluator program. Example 20: Specifying derivatives with linear-form residual equations In examples 7 and 8, we showed how to specify derivatives with an exponential regression model when using the interactive version of gmm. Here we show how to write a moment-evaluator program for the exponential regression model, including derivatives. The residual equation for observation i is ui = yi − exp(x0i β) where xi may include a constant term. The derivative with respect to the linear combination x0i β is ∂ui = − exp(x0i β) ∂x0i β To verify this residual equation satisfies the linear-form restriction, we see that for the j th element of β, we have ∂ui ∂ui = −xij exp(x0i β) = × xij ∂βj ∂x0i β so that given ∂ui /∂x0i β, gmm can apply the chain rule to obtain the derivatives with respect to the individual parameters. gmm — Generalized method of moments estimation Our moment-evaluator program is program gmm_poideriv2 version 13 syntax varlist if, at(name) [derivatives(varlist)] quietly { tempvar mu matrix score double ‘mu’ = ‘at’ ‘if’, eq(#1) replace ‘mu’ = exp(‘mu’) local depvar : coleq ‘at’ local depvar : word 1 of ‘depvar’ replace ‘varlist’ = ‘depvar’ - ‘mu’ ‘if’ // Did -gmm- request derivatives? if "‘derivatives’" == "" { exit // no, so we are done } // Calculate derivatives // The derivatives macro only has one variable // for this model. replace ‘derivatives’ = -1*‘mu’ ‘if’ } end To fit our model of doctor visits treating income as an endogenous regressor, we type . use http://www.stata-press.com/data/r13/docvisits . gmm gmm_poideriv2, nequations(1) > instruments(private chronic female age black hispanic) > parameters(docvis:private docvis:chronic > docvis:female docvis:income docvis:_cons) haslfderivatives Step 1 Iteration 0: GMM criterion Q(b) = 16.910173 Iteration 1: GMM criterion Q(b) = .82270871 Iteration 2: GMM criterion Q(b) = .21831995 Iteration 3: GMM criterion Q(b) = .12685934 Iteration 4: GMM criterion Q(b) = .12672369 Iteration 5: GMM criterion Q(b) = .12672365 Step 2 Iteration 0: GMM criterion Q(b) = .00234641 Iteration 1: GMM criterion Q(b) = .00215957 Iteration 2: GMM criterion Q(b) = .00215911 Iteration 3: GMM criterion Q(b) = .00215911 GMM estimation Number of parameters = 5 Number of moments = 7 Initial weight matrix: Unadjusted Number of obs GMM weight matrix: Robust Coef. private chronic female income _cons .535335 1.090126 .6636579 .0142855 -.5983477 Robust Std. Err. .159904 .0617659 .0959885 .0027162 .138433 z 3.35 17.65 6.91 5.26 -4.32 P>|z| 0.001 0.000 0.000 0.000 0.000 = 4412 [95% Conf. Interval] .221929 .9690668 .475524 .0089618 -.8696714 .848741 1.211185 .8517918 .0196092 -.327024 Instruments for equation 1: private chronic female age black hispanic _cons Our results match those shown in example 8. 747 748 gmm — Generalized method of moments estimation We can change the variables in our model just by changing the parameters() and instruments() options; we do not need to make any changes to the moment-evaluator program, because we used linear-form derivatives. Depending on your model, allowing your moment-evaluator program to accept weights may be as easy as modifying the syntax command to allow them, or it may require significantly more work. If your program uses only commands like generate and replace, then just modifying the syntax command is all you need to do; gmm takes care of applying the weights to the observation-level residuals when computing the sample moments, derivatives, and weight matrices. On the other hand, if your moment-evaluator program computes residuals using statistics that depend on multiple observations, then you must apply the weights passed to your program when computing those statistics. In our examples of panel Poisson with strictly exogenous regressors (9 and 18), we used the statistics µi and y i when computing the residuals. If we are to allow weights with our momentevaluator program, then we must incorporate those weights when computing µi and y i . Moreover, looking at the derivative in (28), the term highlighted in (29) is in fact a sample mean, so we must incorporate weights when computing it. Example 21: Panel Poisson with derivatives and weights Here we modify the program in example 19. to accept frequency weights. One complication immediately arises: we had been using egen to compute µi and y i . egen does not accept weights, so we must compute µi and y i ourselves, incorporating any weights the user may specify. Our program is gmm — Generalized method of moments estimation 749 program gmm_poiderivfw version 13 syntax varlist if [fweight/], at(name) [derivatives(varlist)] quietly { if "‘exp’" == "" { // no weights local exp 1 // weight each observation equally } // Calculate residuals as before tempvar mu mubar ybar sumwt gen double ‘mu’ = exp(x1*‘at’[1,1] + x2*‘at’[1,2] /// + x3*‘at’[1,3]) ‘if’ bysort id: gen double ‘sumwt’ = sum(‘exp’) by id: gen double ‘mubar’ = sum(‘mu’*‘exp’) by id: gen double ‘ybar’ = sum(y*‘exp’) by id: replace ‘mubar’ = ‘mubar’[_N] / ‘sumwt’[_N] by id: replace ‘ybar’ = ‘ybar’[_N] / ‘sumwt’[_N] replace ‘varlist’ = y - ‘mu’*‘ybar’/‘mubar’ ‘if’ // Did -gmm- request derivatives? if "‘derivatives’" == "" { exit // no, so we are done } // Calculate derivatives // We need the panel means of x1*mu, x2*mu, and x3*mu tempvar work x1mubar x2mubar x3mubar generate double ‘work’ = x1*‘mu’ ‘if’ by id: generate double ‘x1mubar’ = sum(‘work’*‘exp’) by id: replace ‘x1mubar’ = ‘x1mubar’[_N] / ‘sumwt’[_N] replace ‘work’ = x2*‘mu’ ‘if’ by id: generate double ‘x2mubar’ = sum(‘work’*‘exp’) by id: replace ‘x2mubar’ = ‘x2mubar’[_N] / ‘sumwt’[_N] replace ‘work’ = x3*‘mu’ ‘if’ by id: generate double ‘x3mubar’ = sum(‘work’*‘exp’) by id: replace ‘x3mubar’ = ‘x3mubar’[_N] / ‘sumwt’[_N] local d1: word 1 of ‘derivatives’ local d2: word 2 of ‘derivatives’ local d3: word 3 of ‘derivatives’ replace ‘d1’ = -1*‘mu’*‘ybar’/‘mubar’^2*(x1*‘mubar’ - ‘x1mubar’) replace ‘d2’ = -1*‘mu’*‘ybar’/‘mubar’^2*(x2*‘mubar’ - ‘x2mubar’) replace ‘d3’ = -1*‘mu’*‘ybar’/‘mubar’^2*(x3*‘mubar’ - ‘x3mubar’) } end Our syntax command now indicates that fweights are allowed. The first part of our code looks at the macro ‘exp’. If it is empty, then the user did not specify weights in their call to gmm; and we set the macro equal to 1, so that we weight each observation equally. After we compute µit , we calculate µi and y i , taking into account weights. To compute frequency-weighted means for each panel, we just multiply each observation by its respective weight, sum over all observations in the panel, then divide by the sum of the weights for the panel. (See [U] 20.23 Weighted estimation for information on how to handle aweights and pweights.) We use the same procedure to compute the frequency-weighted variant of expression (29) in the derivative calculations. 750 gmm — Generalized method of moments estimation To use our program, we type . use http://www.stata-press.com/data/r13/poissonwts . gmm gmm_poiderivfw [fw=fwt], nequations(1) parameters(b1 b2 b3) > instruments(x1 x2 x3, noconstant) vce(cluster id) onestep hasderivatives (sum of wgt is 819) Step 1 Iteration 0: GMM criterion Q(b) = 49.8292 Iteration 1: GMM criterion Q(b) = .11136736 Iteration 2: GMM criterion Q(b) = .00008519 Iteration 3: GMM criterion Q(b) = 7.110e-11 Iteration 4: GMM criterion Q(b) = 5.596e-23 GMM estimation Number of parameters = 3 Number of moments = 3 Initial weight matrix: Unadjusted Number of obs = 819 (Std. Err. adjusted for 45 clusters in id) Coef. /b1 /b2 /b3 1.967766 -3.060838 1.037594 Robust Std. Err. .111795 .0935561 .1184227 z 17.60 -32.72 8.76 P>|z| 0.000 0.000 0.000 [95% Conf. Interval] 1.748652 -3.244205 .80549 2.186881 -2.877472 1.269698 Instruments for equation 1: x1 x2 x3 Testing whether our program works correctly with frequency weights is easy. A frequency-weighted dataset is just a compact form of a larger dataset in which identical observations are omitted and a frequency-weight variable is included to tell us how many times each observation in the smaller dataset appears in the larger dataset. Therefore, we can expand our smaller dataset by the frequency-weight variable and then refit our model without specifying frequency weights. If we obtain the same results, our program works correctly. When we type . expand fw . gmm gmm_poiderivfw, nequations(1) parameters(b1 b2 b3) > instruments(x1 x2 x3, noconstant) vce(cluster id) onestep we obtain the same results as before. gmm — Generalized method of moments estimation Stored results gmm stores the following in e(): Scalars e(N) e(k) e(k eq) e(k eq model) e(k aux) e(n moments) e(n eq) e(Q) e(J) e(J df) e(k i) e(has xtinst) e(N clust) e(type) e(rank) e(ic) e(converged) Macros e(cmd) e(cmdline) e(title) e(title 2) e(clustvar) e(inst i) e(eqnames) e(winit) e(winitname) e(estimator) e(rhs) e(params i) e(wmatrix) e(vce) e(vcetype) e(params) e(sexp i) e(evalprog) e(evalopts) e(nocommonesample) e(technique) e(properties) e(estat cmd) e(predict) e(marginsnotok) Matrices e(b) e(init) e(Wuser) e(W) e(S) e(N byequation) e(V) e(V modelbased) Functions e(sample) number of observations number of parameters number of equations in e(b) number of equations in overall model test number of auxiliary parameters number of moments number of equations in moment-evaluator program criterion function Hansen J χ2 statistic J statistic degrees of freedom number of parameters in equation i 1 if panel-style instruments specified, 0 otherwise number of clusters 1 if interactive version, 2 if moment-evaluator program version rank of e(V) number of iterations used by iterative GMM estimator 1 if converged, 0 otherwise gmm command as typed title specified in title() title specified in title2() name of cluster variable equation i instruments equation names initial weight matrix used name of user-supplied initial weight matrix onestep, twostep, or igmm variables specified in variables() equation i parameters wmtype specified in wmatrix() vcetype specified in vce() title used to label Std. Err. parameter names substitutable expression for equation i moment-evaluator program options passed to moment-evaluator program nocommonesample, if specified optimization technique b V program used to implement estat program used to implement predict predictions disallowed by margins coefficient vector initial values of the estimators user-supplied initial weight matrix weight matrix used for final round of estimation moment covariance matrix used in robust VCE computations number of observations per equation, if nocommonesample specified variance–covariance matrix model-based variance marks estimation sample 751 752 gmm — Generalized method of moments estimation Methods and formulas Let q denote the number of moment equations. For observation i, i = 1, . . . , N , write the j th moment equation as zij uij (βj ) for j = 1, . . . , q . zij is a 1 × mj vector, where mj is the number of instruments specified for equation j . Let m = m1 + · · · + mq . Our notation can incorporate moment conditions of the form hij (wij ; βj ) with instruments wij by defining zij = 1 and uij (βj ) = hij (wij ; βj ), so except when necessary we do not distinguish between the two types of moment conditions. We could instead use notation so that all our moment conditions are of the form hij (wij ; βj ), or we could adopt notation that explicitly combines both forms of moment equations. However, because moment conditions of the form z0ij uij (βj ) are arguably more common, we use that notation. Let β denote a k × 1 vector of parameters, consisting of all the unique parameters of β1 , . . . , βq . Then we can stack the moment conditions and write them more compactly as Z0i ui (β), where zi1 0 Zi = ... 0 zi2 .. . 0 0 ··· ··· .. . 0 0 .. . u (β ) i1 1 ui2 (β2 ) ui (β) = .. . and · · · ziq uiq (βj ) b is the value of β that minimizes The GMM estimator β ( Q(β) = N −1 N X )0 Z0i ui (β) ( W N −1 i=1 N X ) Z0i ui (β) (A1) i=1 for q × q weight matrix W. By default, gmm minimizes (A1) using the Gauss–Newton method. See Hayashi (2000, 498) for a derivation. This technique is typically faster than quasi-Newton methods and does not require second-order derivatives. Methods and formulas are presented under the following headings: Initial weight matrix Weight matrix Variance–covariance matrix Hansen’s J statistic Panel-style instruments Initial weight matrix If you specify winitial(identity), then we set W = Iq . If you specify winitial(unadjusted), then we create matrix Λ with typical submatrix Λrs = N −1 N X z0ir zis i=1 for r = 1, . . . , q and s = 1, . . . , q . If you include the independent suboption, then we set Λrs = 0 for r = 6 s. The weight matrix W equals Λ−1 . If you specify winitial(matname), then we set W equal to Stata matrix matname. gmm — Generalized method of moments estimation 753 If you specify winitial(xt xtspec), then you must specify one or two items in xtspec, one for each equation. gmm allows you to specify at most two moment equations when you specify winitial(xt xtspec), one in first-differences and one in levels. We create the block-diagonal matrix H with typical block Hj . If the j th element of xtspec is “L”, then Hj is the identity matrix of suitable dimension. If the j th element of xtspec is “D”, then 1 −0.5 0 ... 0 0 −0.5 1 −0.5 . . . 0 0 . .. .. .. .. . . . Hj = . . . . . . 0 0 0 ... 1 −0.5 0 0 0 . . . −0.5 1 Then ΛH = Ng−1 g=N XG Z0g HZg g=1 where g indexes panels in the dataset, NG is the number of panels, Zg is the full instrument matrix for panel g , and W = Λ−1 H . See Panel-style instruments below for a discussion of how Zg is formed. Weight matrix Specification of the weight matrix applies only to the two-step and iterative estimators. When you use the onestep option, the wmatrix() option is ignored. b ). We first evaluate (A1) using the initial weight matrix described above and then compute ui (β In all cases, W = Λ−1 . If you specify wmatrix(unadjusted), then we create Λ to have typical submatrix N X Λrs = σrs N −1 z0ir zis i=1 where σrs = N −1 N X b )uis (β b) uir (β i=1 and r and s index moment equations. For all types of weight matrices, if the independent suboption is specified, then Λrs = 0 for r 6= s, where Λrs measures the covariance between moment conditions for equations r and s. If you specify wmatrix(robust), then Λ = N −1 N X b )u0i (β b )Z0i Zi ui (β i=1 If you specify wmatrix(cluster clustvar), then Λ = N −1 c=N XC c=1 qc q0c 754 gmm — Generalized method of moments estimation where c indexes clusters, NC is the number of clusters, and X b) qc = Zi ui (β i∈cj If you specify wmatrix(hac kernel # ), then Λ =N −1 N X b )ui (β b )0 Z0i Zi ui (β + i=1 N −1 l=n−1 X N X l=1 i=l+1 n o b )u0i−l (β b )Z0i−l + Zi−l ui−l (β b )u0i (β b )Z0i K(l, m) Zi ui (β where m = # if # is specified and m = N − 2 otherwise. Define z = l/(m + 1). If kernel is bartlett or nwest, then n 1−z 0≤z ≤1 K(l, m) = 0 otherwise If kernel is parzen or gallant, then ( 1 − 6z 2 + 6z 3 0 ≤ z ≤ 0.5 K(l, m) = 2(1 − z)3 0.5 < z ≤ 1 0 otherwise If kernel is quadraticspectral or andrews, then 1 K(l, m) = 3{sin(θ)/θ − cos(θ)}/θ2 z=0 otherwise where θ = 6πz/5. If wmatrix(hac kernel opt) is specified, then gmm uses Newey and West’s (1994) automatic lag-selection algorithm, which proceeds as follows. Define h to be an m × 1 vector of ones. Note that this definition of h is slightly different than the one used by ivregress. There, the element of h corresponding to the constant term equals zero, effectively ignoring the effect of the constant in determining the optimal lag length. Here we include the effect of the constant term. Now define 0 fi = {Z0i ui (β)} h N X σ bj = N −1 fi fi−j j = 0, . . . , m∗ i=j+1 sb(q) = 2 ∗ j=m X σ bj j q j=1 (0) sb =σ b0 + 2 ∗ j=m X σ bj j=1 ( γ b = cγ sb(q) sb(0) 2 )1/(2q+1) m=γ bN 1/(2q+1) gmm — Generalized method of moments estimation 755 where q , m∗ , and cγ depend on the kernel specified: Kernel q Bartlett/Newey–West Parzen/Gallant Quadratic spectral/Andres 1 2 2 m∗ int 20(T /100)2/9 int 20(T /100)4/25 int 20(T /100)2/25 cγ 1.1447 2.6614 1.3221 where int(x) denotes the integer obtained by truncating x toward zero. For the Bartlett and Parzen kernels, the optimal lag is min{int(m), m∗ }. For the quadratic spectral kernel, the optimal lag is min{m, m∗ }. Variance–covariance matrix If you specify vce(unadjusted), then the VCE matrix is computed as o−1 n b )0 WG(β b) b ) = N −1 G(β Var(β (A2) where b ) = N −1 G(β N X i=1 Z0i ∂ui (β) ∂ β0 β=β b (A3) For the two-step and iterated estimators, we use the weight matrix W that was used to compute the b final-round estimate β. When you do not specify analytic derivatives, gmm must compute the Jacobian matrix (A3) numerically. By default, gmm computes each element of the matrix individually by using the Mata deriv() function; see [M-5] deriv( ). This procedure results in accurate derivatives but can be slow if your model has many instruments or parameters. When you specify the quickderivatives option, gmm computes all derivatives corresponding to parameter βj , j = 1, . . . , q , at once, using two-sided derivatives with a step size of |βj |1/3 , where is the machine precision of a double precision number (approximately 2.22045 × 10−16 ). This method requires just two evaluations of the model’s moments to compute an entire column of (A3) and therefore has the most impact when you specify many instruments or moment equations so that (A3) has many rows. For the one-step estimator, how the unadjusted VCE is computed depends on the type of initial weight matrix requested and the form of the moment equations. If you specify two or more moment equations of the form hij (wij ; βj ), then gmm issues a warning message and computes a heteroskedasticityrobust VCE because here the matrix Z0 Z is necessarily singular; moreover, here you must use the identity matrix as the initial weight matrix. Otherwise, if you specify winitial(identity) or b before winitial(unadjusted), then gmm first computes an unadjusted weight matrix based on β evaluating (A2). If you specify winitial(matname), then (A2) is evaluated based on matname; the user is responsible for verifying that the VCE and other statistics so produced are appropriate. All types of robust VCEs computed by gmm take the form n o−1 n o−1 b ) = N −1 G(β b )0 WG(β b) b )0 WSWG(β b ) G(β b )0 WG(β b) Var(β G(β 756 gmm — Generalized method of moments estimation For the one-step estimator, W represents the initial weight matrix requested using the winitial() option, and S is computed based on the specification of the vce() option. The formulas for the S matrix are identical to the ones that define the Λ matrix in Weight matrix above, except that S is b For the two-step computed after the moment equations are reevaluated using the final estimate of β. and iterated GMM estimators, computation of W is controlled by the wmatrix() option based on b the penultimate estimate of β. For details on computation of the VCE matrix with dynamic panel-data models, see Panel-style instruments below. Hansen’s J statistic b ). J ∼ χ2 (m − k). If Hansen’s (1982) J test of overidentifying restrictions is J = N × Q(β m < k , gmm issues an error message without estimating the parameters. If m = k , the model is just-identified and J is saved as missing (“.”). For the two-step and iterated GMM estimators, the J statistic is based on the last-computed weight matrix as determined by the wmatrix() option. For the one-step estimator, gmm recomputes a weight matrix as described in the second paragraph of Variance–covariance matrix above. To obtain Hansen’s J statistic, you use estat overid; see [R] gmm postestimation. Panel-style instruments Here we discuss several issues that arise only when you specify panel-style instruments by using the xtinstruments() option. When you specify the xtinstruments() option, we can no longer consider the instruments for one observation in isolation; instead, we must consider the instrument matrix for an entire panel at once. In the following discussion, we let T denote the number of time periods in a panel. To accommodate unbalanced datasets, conceptually we simply use zeros as instruments and residuals for time periods that are missing in a panel. We consider the case where you specify both an equation in levels and an equation in differences, yielding two residual equations. Let uL pt (β) denote the residual for the level equation for panel p in D period t, and let upt (β) denote the residual for the corresponding difference equation. Now define the 2T × 1 vector up (β) as L L D D D up (β) = [uL p1 (β), up2 (β), . . . , upT (β), up2 (β), up3 (β), . . . , upT (β)] The T + 1 element of up is uD p2 (β) since we lose the first observation of the difference equation because of differencing. We write the moment conditions for the pth panel as Zp up (β). To see how Zp is defined, let L D wpt and wpt denote the vectors of panel-style instruments for the level and difference equations, respectively, and let time be denoted by t; we discuss their dimensions momentarily. Also let xL pt and xD pt denote the vectors of instruments specified in instruments() for the level and difference equations at time t. Without loss of generality, for our discussion we assume that you specify the level equation first. Then Zp has the form gmm — Generalized method of moments estimation wL 1 0 . . . 0 L x Zp = 1 0 0 . . . 0 0 0 w2L .. . ··· ··· .. . 0 xL 2 0 0 .. . · · · wTL · · · xL T ··· 0 ··· 0 .. .. . . ··· 0 ··· 0 0 0 0 0 .. . 0 0 .. . 0 0 .. . ··· ··· .. . 0 0 w1D 0 .. . 0 0 0 w2D .. . ··· ··· ··· ··· .. . 0 xD 1 0 xD 2 ··· ··· 0 0 .. . 757 0 0 0 0 .. . wTD xD T (A4) To see how the w vectors are formed, suppose you specify xtinstruments(eq(1): d, lags( a/ b)) Then wtL will be a (b − a + 1) × 1 vector consisting of dt−a , . . . , dt−b . If (t − a) ≤ 0, then instead we set wtL = 0. If (t − a) > 0 but (t − b) ≤ 0, then we create wtL to consist of dt−a , . . . , d1 . With this definition, (b − a + 1) defines the maximum number of lags of d used, but gmm will proceed with fewer lags if all (b − a + 1) lags are not available. If you specify two panel-style instruments, d and e, say, then wtL will consist of dt−a , . . . , dt−b , et−a , . . . , et−b . wtD is handled analogously. The xL t vectors are simply j × 1 vectors, where j is the number of regular instruments specified with the instruments() option; these vectors include a “1” unless you specify the noconstant suboption. Looking carefully at (A4), you will notice that for dynamic panel-data models, moment conditions corresponding to the instruments xL pt take the form E "t=T X # L xL pt upt (β) =0 t=1 and likewise for xD pt . Instead of having separate moment conditions for each time period, there is one moment condition equal to the average of individual periods’ moments. See Arellano and Bond (1991, 280). To include separate moment conditions for each time period, instead of specifying, say, instruments(1: x) you could instead first generate a variable called one equal to unity for all observations and specify xtinstruments(1: x one) (Creating the variable one is necessary because a constant is not automatically included in variable lists specified in xtinstruments().) Unbalanced panels are essentially handled by including zeros rows and columns of Zp and up (β) corresponding to missing time periods. However, the numbers of instruments and moment conditions reported by gmm do not reflect this trickery and instead reflect the numbers of instruments and moment conditions that are not manipulated in this way. Moreover, gmm includes code to work through these situations efficiently without actually having to fill in zeros. When you specify winitial(xt . . .), the one-step unadjusted VCE is computed as b) = σ Var(β b12 ΛH 758 gmm — Generalized method of moments estimation where ΛH was defined previously, σ b12 = (N − k)−1 p=P X b uD p (β) p=1 D b D b −1 b and uD instead of N −1 to match xtdpd. p (β) = [up2 (β), . . . , upT (β)]. Here we use (N − k) References Arellano, M., and S. Bond. 1991. Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. Review of Economic Studies 58: 277–297. Baum, C. F. 2006. An Introduction to Modern Econometrics Using Stata. College Station, TX: Stata Press. Blundell, R., R. Griffith, and F. Windmeijer. 2002. Individual effects and dynamics in count data models. Journal of Econometrics 108: 113–131. Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press. . 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. . 2013. Regression Analysis of Count Data. 2nd ed. New York: Cambridge University Press. Chamberlain, G. 1992. Comment: Sequential moment restrictions in panel data. Journal of Business and Economic Statistics 10: 20–26. Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. . 2004. Econometric Theory and Methods. New York: Oxford University Press. Doris, A., D. O’Neill, and O. Sweetman. 2011. GMM estimation of the covariance structure of longitudinal data on earnings. Stata Journal 11: 439–459. Flynn, Z. L., and L. M. Magnusson. 2013. Parametric inference using structural break tests. Stata Journal 13: 836–861. Gould, W. W., J. S. Pitblado, and B. P. Poi. 2010. Maximum Likelihood Estimation with Stata. 4th ed. College Station, TX: Stata Press. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Hall, A. R. 2005. Generalized Method of Moments. Oxford: Oxford University Press. Hamilton, J. D. 1994. Time Series Analysis. Princeton: Princeton University Press. Hansen, L. P. 1982. Large sample properties of generalized method of moments estimators. Econometrica 50: 1029–1054. Hansen, L. P., and K. J. Singleton. 1982. Generalized instrumental variables estimation of nonlinear rational expectations models. Econometrica 50: 1269–1286. Hayashi, F. 2000. Econometrics. Princeton, NJ: Princeton University Press. Manski, C. F. 1988. Analog Estimation Methods in Econometrics. New York: Chapman & Hall/CRC. Mátyás, L. 1999. Generalized Method of Moments Estimation. Cambridge: Cambridge University Press. Mullahy, J. 1997. Instrumental-variable estimation of count data models: Applications to models of cigarette smoking behavior. Review of Economics and Statistics 79: 586–593. Newey, W. K., and K. D. West. 1994. Automatic lag selection in covariance matrix estimation. Review of Economic Studies 61: 631–653. Nickell, S. J. 1981. Biases in dynamic models with fixed effects. Econometrica 49: 1417–1426. Ruud, P. A. 2000. An Introduction to Classical Econometric Theory. New York: Oxford University Press. Wilde, J. 2008. A note on GMM estimation of probit models with endogenous regressors. Statistical Papers 49: 471–484. gmm — Generalized method of moments estimation 759 Windmeijer, F. 2000. Moment conditions for fixed effects count data models with endogenous regressors. Economics Letters 68: 21–24. . 2005. A finite sample correction for the variance of linear efficient two-step GMM estimators. Journal of Econometrics 126: 25–51. Windmeijer, F., and J. M. C. Santos Silva. 1997. Endogeneity in count data models: An application to demand for health care. Journal of Applied Econometrics 12: 281–294. Wooldridge, J. M. 1997. Multiplicative panel data models without the strict exogeneity assumption. Econometric Theory 13: 667–678. . 1999. Distribution-free estimation of some nonlinear panel data models. Journal of Econometrics 90: 77–97. . 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press. Also see [R] gmm postestimation — Postestimation tools for gmm [R] ivregress — Single-equation instrumental-variables regression [R] ml — Maximum likelihood estimation [R] mlexp — Maximum likelihood estimation of user-specified expressions [R] nl — Nonlinear least-squares estimation [R] nlsur — Estimation of nonlinear systems of equations [XT] xtabond — Arellano–Bond linear dynamic panel-data estimation [XT] xtdpd — Linear dynamic panel-data estimation [XT] xtdpdsys — Arellano–Bover/Blundell–Bond linear dynamic panel-data estimation [U] 20 Estimation and postestimation commands Title gmm postestimation — Postestimation tools for gmm Description Syntax for estat overid Reference Syntax for predict Menu for estat Also see Menu for predict Remarks and examples Option for predict Stored results Description The following postestimation command is of special interest after gmm: Command Description estat overid perform test of overidentifying restrictions The following standard postestimation commands are also available: Command Description estat vce estimates lincom variance–covariance matrix of the estimators (VCE) cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients residuals point estimates, standard errors, testing, and inference for generalized predictions Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses nlcom predict predictnl test testnl Special-interest postestimation command estat overid reports Hansen’s J statistic, which is used to determine the validity of the overidentifying restrictions in a GMM model. If the model is correctly specified in the sense that E{zi ui (β)} = 0, then the sample analog to that condition should hold at the estimated value of β. Hansen’s J statistic is valid only if the weight matrix is optimal, meaning that it equals the inverse of the covariance matrix of the moment conditions. Therefore, estat overid only reports Hansen’s J statistic after two-step or iterated estimation, or if you specified winitial(matname) when calling gmm. In the latter case, it is your responsibility to determine the validity of the J statistic. Syntax for predict predict type newvar predict type { stub* | newvar1 if in , equation(#eqno | eqname) . . . newvarq } if in 760 gmm postestimation — Postestimation tools for gmm Residuals are available both in and out of sample; type predict estimation sample. 761 . . . if e(sample) . . . if wanted only for the You specify one new variable and (optionally) equation(), or you specify stub* or q new variables, where q is the number of moment equations. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Option for predict Main equation(#eqno | eqname) specifies the equation for which residuals are desired. Specifying equation(#1) indicates that the calculation is to be made for the first moment equation. Specifying equation(demand) would indicate that the calculation is to be made for the moment equation named demand, assuming there is an equation named demand in the model. If you specify one new variable name and omit equation(), results are the same as if you had specified equation(#1). For more information on using predict after multiple-equation estimation commands, see [R] predict. Syntax for estat overid estat overid Menu for estat Statistics > Postestimation > Reports and statistics Remarks and examples As we noted in Introduction of [R] gmm, underlying generalized method of moments (GMM) estimators is a set of l moment conditions, E{zi ui (β)} = 0. When l is greater than the number of parameters, k , any size-k subset of the moment conditions would yield a consistent parameter estimate. We remarked that the parameter estimates we would obtain would in general depend on which k moment conditions we used. However, if all our moment conditions are indeed valid, then the parameter estimates should not differ too much regardless of which k moment conditions we used to estimate the parameters. The test of overidentifying restrictions is a model specification test based on this observation. The test of overidentifying restrictions requires that the number of moment conditions be greater than the number of parameters in the model. Recall that the GMM criterion function is ( Q= )0 ( ) 1 X 1 X zi ui (β) W zi ui (β) N i N i 762 gmm postestimation — Postestimation tools for gmm The test of overidentifying restrictions is remarkably simple. If W is an optimal weight matrix, under the null hypothesis H0 : E{zi ui (β)} = 0, the test statistic J = N × Q ∼ χ2 (l − k). A large test statistic casts doubt on the null hypothesis. For the test to be valid, W must be optimal, meaning that W must be the inverse of the covariance matrix of the moment conditions: W−1 = E{zi ui (β)u0i (β)z0i } Therefore, estat overid works only after the two-step and iterated estimators, or if you supplied your own initial weight matrix by using the winitial(matname) option to gmm and used the one-step estimator. Often the overidentifying restrictions test is interpreted as a test of the validity of the instruments z. However, other forms of model misspecification can sometimes lead to a significant test statistic. See Hall (2005, sec. 5.1) for a discussion of the overidentifying restrictions test and its behavior in correctly and misspecified models. Example 1 In example 6 of [R] gmm, we fit an exponential regression model of the number of doctor visits based on the person’s gender, income, possession of private health insurance, and presence of a chronic disease. We argued that the variable income may be endogenous; we used the person’s age and race as additional instrumental variables. Here we refit the model and test the specification of the model. We type . use http://www.stata-press.com/data/r13/docvisits . gmm (docvis - exp({xb:private chronic female income} + {b0})), > instruments(private chronic female age black hispanic) (output omitted ) . estat overid Test of overidentifying restriction: Hansen’s J chi2(2) = 9.52598 (p = 0.0085) The J statistic is significant even at the 1% significance level, so we conclude that our model is misspecified. One possibility is that age and race directly affect the number of doctor visits, so we are not justified in excluding them from the model. A simple technique to explore whether any of the instruments is invalid is to examine the statistics ( ) N 1 X 1/2 b rj = Wjj zij ui (β) N i=1 b ) denotes the sample for j = 1, . . . , k , where Wjj denotes the j th diagonal element of W, ui (β residuals, and k is the number of instruments. If all the instruments are valid, then the scaled sample moments should at least be on the same order of magnitude. If one (or more) instrument’s rj is large in absolute value relative to the others, then that could be an indication that instrument is not valid. In Stata, we type . . . . . predict double r if e(sample) // obtain residual from the model matrix W = e(W) // retrieve weight matrix local i 1 // loop over each instrument and compute r_j foreach var of varlist private chronic female age black hispanic { 2. generate double r‘var’ = r*‘var’*sqrt(W[‘i’, ‘i’]) 3. local ‘++i’ 4. } gmm postestimation — Postestimation tools for gmm . summarize r* Variable Obs Mean r rprivate rchronic rfemale rage 4412 4412 4412 4412 4412 .0344373 .007988 .0026947 .0028168 .0360978 rblack rhispanic 4412 4412 -.0379317 -.017435 Std. Dev. Min Max 8.26176 3.824118 2.0707 1.566397 4.752986 -151.1847 -72.66254 -43.7311 -12.7388 -89.74112 113.059 54.33852 32.703 24.43621 55.58143 1.062027 1.08567 -24.39747 -5.509386 27.34512 31.53512 763 We notice that the rj statistics for age, black, and hispanic are larger than those for the other instruments in our model, supporting our suspicion that age and race may have a direct impact on the number of doctor visits. Stored results estat overid stores the following in r(): Scalars r(J) r(J df) r(J p) Hansen’s J statistic J statistic degrees of freedom J statistic p-value Reference Hall, A. R. 2005. Generalized Method of Moments. Oxford: Oxford University Press. Also see [R] gmm — Generalized method of moments estimation [U] 20 Estimation and postestimation commands Title grmeanby — Graph means and medians by categorical variables Syntax Remarks and examples Menu References Description Options Syntax grmeanby varlist if in weight , summarize(varname) , options Description options Main ∗ summarize(varname) median graph mean (or median) of varname graph medians; default is to graph means Plot change the look of the lines change look of markers (color, size, etc.) add marker labels; change look or position cline options marker options marker label options Y axis, X axis, Titles, Legend, Overall any options other than by() documented in [G-3] twoway options twoway options ∗ summarize(varname) is required. aweights and fweights are allowed; see [U] 11.1.6 weight. Menu Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Graph means/medians by groups Description grmeanby graphs the (optionally weighted) means or medians of varname according to the values of the variables in varlist. The variables in varlist may be string or numeric and, if numeric, may be labeled. Options Main summarize(varname) is required; it specifies the name of the variable whose mean or median is to be graphed. median specifies that the graph is to be of medians, not means. 764 grmeanby — Graph means and medians by categorical variables 765 Plot cline options affect the rendition of the lines through the markers, including their color, pattern, and width; see [G-3] cline options. marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples The idea of graphing means of categorical variables was shown in Chambers and Hastie (1992, 3). Because this was shown in the context of an S function for making such graphs, it doubtless has roots going back further than that. grmeanby is, in any case, another implementation of what we will assume is their idea. Example 1 Using a variation of our auto dataset, we graph the mean of mpg by foreign, rep77, rep78, and make: . use http://www.stata-press.com/data/r13/auto1 (Automobile Models) . grmeanby foreign rep77 rep78 make, sum(mpg) 35 Means of mpg, Mileage (mpg) 30 Subaru Mazda 25 Foreign 20 Exc Domestic Exc Good Average Poor Fair Good Poor Average Fair VW Plym. Honda Renault Datsun BMW Ford Toyota Chev. Fiat AMC Dodge Audi Pont. Olds Buick 15 Merc. Volvo Cad. 10 Peugeot Linc. foreign rep77 rep78 make If we had wanted a graph of medians rather than means, we could have typed . grmeanby foreign rep77 rep78 make, sum(mpg) median 766 grmeanby — Graph means and medians by categorical variables References Chambers, J. M., and T. J. Hastie, ed. 1992. Statistical Models in S. Pacific Grove, CA: Wadsworth and Brooks/Cole. Gould, W. W. 1993. gr12: Graphs of means and medians by categorical variables. Stata Technical Bulletin 12: 13. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 44–45. College Station, TX: Stata Press. Title hausman — Hausman specification test Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgment Syntax hausman name-consistent name-efficient , options Description options Main include estimated intercepts in comparison; default is to exclude use all equations to perform test; default is first equation only skip specified equations when performing test associate/compare the specified (by number) pairs of equations force performance of test, even though assumptions are not met use # degrees of freedom base both (co)variance matrices on disturbance variance estimate from efficient estimator base both (co)variance matrices on disturbance variance estimate from consistent estimator constant alleqs skipeqs(eqlist) equations(matchlist) force df(#) sigmamore sigmaless Advanced consistent estimator column header efficient estimator column header tconsistent(string) tefficient(string) where name-consistent and name-efficient are names under which estimation results were stored via estimates store; see [R] estimates store. A period (.) may be used to refer to the last estimation results, even if these were not already stored. Not specifying name-efficient is equivalent to specifying the last estimation results as “.”. Menu Statistics > Postestimation > Tests > Hausman specification test Description hausman performs Hausman’s (1978) specification test. Options Main constant specifies that the estimated intercept(s) be included in the model comparison; by default, they are excluded. The default behavior is appropriate for models in which the constant does not have a common interpretation across the two models. 767 768 hausman — Hausman specification test alleqs specifies that all the equations in the models be used to perform the Hausman test; by default, only the first equation is used. skipeqs(eqlist) specifies in eqlist the names of equations to be excluded from the test. Equation numbers are not allowed in this context, because the equation names, along with the variable names, are used to identify common coefficients. equations(matchlist) specifies, by number, the pairs of equations that are to be compared. The matchlist in equations() should follow the syntax #c :#e ,#c :#e ,. . . where #c (#e ) is an equation number of the always-consistent (efficient under H0 ) estimator. For instance, equations(1:1), equations(1:1, 2:2), or equations(1:2). If equations() is not specified, then equations are matched on equation names. equations() handles the situation in which one estimator uses equation names and the other does not. For instance, equations(1:2) means that equation 1 of the always-consistent estimator is to be tested against equation 2 of the efficient estimator. equations(1:1, 2:2) means that equation 1 is to be tested against equation 1 and that equation 2 is to be tested against equation 2. If equations() is specified, the alleqs and skipeqs options are ignored. force specifies that the Hausman test be performed, even though the assumptions of the Hausman test seem not to be met, for example, because the estimators were pweighted or the data were clustered. df(#) specifies the degrees of freedom for the Hausman test. The default is the matrix rank of the variance of the difference between the coefficients of the two estimators. sigmamore and sigmaless specify that the two covariance matrices used in the test be based on a common estimate of disturbance variance (σ 2 ). sigmamore specifies that the covariance matrices be based on the estimated disturbance variance from the efficient estimator. This option provides a proper estimate of the contrast variance for so-called tests of exogeneity and overidentification in instrumental-variables regression. sigmaless specifies that the covariance matrices be based on the estimated disturbance variance from the consistent estimator. These options can be specified only when both estimators store e(sigma) or e(rmse), or with the xtreg command. e(sigma e) is stored after the xtreg command with the fe or mle option. e(rmse) is stored after the xtreg command with the re option. sigmamore or sigmaless are recommended when comparing fixed-effects and random-effects linear regression because they are much less likely to produce a non–positive-definite-differenced covariance matrix (although the tests are asymptotically equivalent whether or not one of the options is specified). Advanced tconsistent(string) and tefficient(string) are formatting options. They allow you to specify the headers of the columns of coefficients that default to the names of the models. These options will be of interest primarily to programmers. hausman — Hausman specification test 769 Remarks and examples hausman is a general implementation of Hausman’s (1978) specification test, which compares an estimator θb1 that is known to be consistent with an estimator θb2 that is efficient under the assumption being tested. The null hypothesis is that the estimator θb2 is indeed an efficient (and consistent) estimator of the true parameters. If this is the case, there should be no systematic difference between the two estimators. If there exists a systematic difference in the estimates, you have reason to doubt the assumptions on which the efficient estimator is based. The assumption of efficiency is violated if the estimator is pweighted or the data are clustered, so hausman cannot be used. The test can be forced by specifying the force option with hausman. For an alternative to using hausman in these cases, see [R] suest. To use hausman, you . . . . (compute the always-consistent estimator) estimates store name-consistent (compute the estimator that is efficient under H 0 ) hausman name-consistent . Alternatively, you can turn this around: . . . . . (compute the estimator that is efficient under H 0 ) estimates store name-efficient (fit the less-efficient model ) (compute the always-consistent estimator) hausman . name-efficient You can, of course, also compute and store both the always-consistent and efficient-under-H0 estimators and perform the Hausman test with . hausman name-consistent name-efficient Example 1 We are studying the factors that affect the wages of young women in the United States between 1970 and 1988, and we have a panel-data sample of individual women over that time span. . use http://www.stata-press.com/data/r13/nlswork4 (National Longitudinal Survey. Young Women 14-26 years of age in 1968) . describe Contains data from http://www.stata-press.com/data/r13/nlswork4.dta obs: 28,534 National Longitudinal Survey. Young Women 14-26 years of age in 1968 vars: 6 29 Jan 2013 16:35 size: 370,942 variable name idcode year age msp ttl_exp ln_wage Sorted by: storage type int byte byte byte float float idcode display format %8.0g %8.0g %8.0g %8.0g %9.0g %9.0g year value label variable label NLS ID interview year age in current year 1 if married, spouse present total work experience ln(wage/GNP deflator) 770 hausman — Hausman specification test We believe that a random-effects specification is appropriate for individual-level effects in our model. We fit a fixed-effects model that will capture all temporally constant individual-level effects. . xtreg ln_wage age msp ttl_exp, fe Fixed-effects (within) regression Group variable: idcode R-sq: within = 0.1373 between = 0.2571 overall = 0.1800 corr(u_i, Xb) Number of obs Number of groups Obs per group: min avg max F(3,23781) Prob > F = 0.1476 ln_wage Coef. age msp ttl_exp _cons -.005485 .0033427 .0383604 1.593953 sigma_u sigma_e rho .37674223 .29751014 .61591044 F test that all u_i=0: Std. Err. .000837 .0054868 .0012416 .0177538 t -6.55 0.61 30.90 89.78 P>|t| 0.000 0.542 0.000 0.000 = = = = = = = 28494 4710 1 6.0 15 1262.01 0.0000 [95% Conf. Interval] -.0071256 -.0074118 .0359268 1.559154 -.0038443 .0140971 .0407941 1.628752 (fraction of variance due to u_i) F(4709, 23781) = 7.76 Prob > F = 0.0000 We assume that this model is consistent for the true parameters and store the results by using estimates store under a name, fixed: . estimates store fixed Now we fit a random-effects model as a fully efficient specification of the individual effects under the assumption that they are random and follow a normal distribution. We then compare these estimates with the previously stored results by using the hausman command. . xtreg ln_wage age msp ttl_exp, re Random-effects GLS regression Group variable: idcode R-sq: within = 0.1373 between = 0.2552 overall = 0.1797 corr(u_i, X) Number of obs Number of groups Obs per group: min avg max Wald chi2(3) Prob > chi2 = 0 (assumed) ln_wage Coef. Std. Err. z age msp ttl_exp _cons -.0069749 .0046594 .0429635 1.609916 .0006882 .0051012 .0010169 .0159176 sigma_u sigma_e rho .32648519 .29751014 .54633481 (fraction of variance due to u_i) -10.13 0.91 42.25 101.14 P>|z| 0.000 0.361 0.000 0.000 = = = = = = = 28494 4710 1 6.0 15 5100.33 0.0000 [95% Conf. Interval] -.0083238 -.0053387 .0409704 1.578718 -.0056259 .0146575 .0449567 1.641114 hausman — Hausman specification test . hausman fixed ., sigmamore Coefficients (b) (B) fixed . age msp ttl_exp Test: -.005485 .0033427 .0383604 -.0069749 .0046594 .0429635 (b-B) Difference sqrt(diag(V_b-V_B)) S.E. .0014899 -.0013167 -.0046031 .0004803 .0020596 .0007181 771 b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Ho: difference in coefficients not systematic chi2(3) = (b-B)’[(V_b-V_B)^(-1)](b-B) = 260.40 Prob>chi2 = 0.0000 Under the current specification, our initial hypothesis that the individual-level effects are adequately modeled by a random-effects model is resoundingly rejected. This result is based on the rest of our model specification, and random effects might be appropriate for some alternate model of wages. Jerry Allen Hausman was born in West Virginia in 1946. He studied economics at Brown and Oxford, has been at MIT since 1972, and has made many outstanding contributions to econometrics and applied microeconomics. Example 2 A stringent assumption of multinomial and conditional logit models is that outcome categories for the model have the property of independence of irrelevant alternatives (IIA). Stated simply, this assumption requires that the inclusion or exclusion of categories does not affect the relative risks associated with the regressors in the remaining categories. One classic example of a situation in which this assumption would be violated involves the choice of transportation mode; see McFadden (1974). For simplicity, postulate a transportation model with the four possible outcomes: rides a train to work, takes a bus to work, drives the Ford to work, and drives the Chevrolet to work. Clearly, “drives the Ford” is a closer substitute to “drives the Chevrolet” than it is to “rides a train” (at least for most people). This means that excluding “drives the Ford” from the model could be expected to affect the relative risks of the remaining options and that the model would not obey the IIA assumption. Using the data presented in [R] mlogit, we will use a simplified model to test for IIA. The choice of insurance type among indemnity, prepaid, and uninsured is modeled as a function of age and gender. The indemnity category is allowed to be the base category, and the model including all three outcomes is fit. The results are then stored under the name allcats. 772 hausman — Hausman specification test . use http://www.stata-press.com/data/r13/sysdsn3 (Health insurance data) . mlogit insure age male Iteration 0: log likelihood = -555.85446 Iteration 1: log likelihood = -551.32973 Iteration 2: log likelihood = -551.32802 Iteration 3: log likelihood = -551.32802 Multinomial logistic regression Number of obs LR chi2(4) Prob > chi2 Log likelihood = -551.32802 Pseudo R2 insure Indemnity Coef. Std. Err. z P>|z| = = = = 615 9.05 0.0598 0.0081 [95% Conf. Interval] (base outcome) Prepaid age male _cons -.0100251 .5095747 .2633838 .0060181 .1977893 .2787575 -1.67 2.58 0.94 0.096 0.010 0.345 -.0218204 .1219147 -.2829708 .0017702 .8972346 .8097383 -.0051925 .4748547 -1.756843 .0113821 .3618462 .5309602 -0.46 1.31 -3.31 0.648 0.189 0.001 -.0275011 -.2343508 -2.797506 .0171161 1.18406 -.7161803 Uninsure age male _cons . estimates store allcats Under the IIA assumption, we would expect no systematic change in the coefficients if we excluded one of the outcomes from the model. (For an extensive discussion, see Hausman and McFadden [1984].) We reestimate the parameters, excluding the uninsured outcome, and perform a Hausman test against the fully efficient full model. . mlogit insure age male if insure != "Uninsure":insure Iteration 0: log likelihood = -394.8693 Iteration 1: log likelihood = -390.4871 Iteration 2: log likelihood = -390.48643 Iteration 3: log likelihood = -390.48643 Multinomial logistic regression Number of obs LR chi2(2) Prob > chi2 Log likelihood = -390.48643 Pseudo R2 insure Indemnity Coef. Std. Err. z P>|z| = = = = 570 8.77 0.0125 0.0111 [95% Conf. Interval] (base outcome) Prepaid age male _cons -.0101521 .5144003 .2678043 .0060049 .1981735 .2775563 -1.69 2.60 0.96 0.091 0.009 0.335 -.0219214 .1259874 -.276196 .0016173 .9028133 .8118046 hausman — Hausman specification test . hausman . allcats, alleqs constant Coefficients (b) (B) . allcats age male _cons Test: -.0101521 .5144003 .2678043 -.0100251 .5095747 .2633838 (b-B) Difference sqrt(diag(V_b-V_B)) S.E. -.0001269 .0048256 .0044205 . .0123338 . 773 b = consistent under Ho and Ha; obtained from mlogit B = inconsistent under Ha, efficient under Ho; obtained from mlogit Ho: difference in coefficients not systematic chi2(3) = (b-B)’[(V_b-V_B)^(-1)](b-B) = 0.08 Prob>chi2 = 0.9944 (V_b-V_B is not positive definite) The syntax of the if condition on the mlogit command simply identified the "Uninsured" category with the insure value label; see [U] 12.6.3 Value labels. On examining the output from hausman, we see that there is no evidence that the IIA assumption has been violated. Because the Hausman test is a standardized comparison of model coefficients, using it with mlogit requires that the base outcome be the same in both competing models. In particular, if the most-frequent category (the default base outcome) is being removed to test for IIA, you must use the baseoutcome() option in mlogit to manually set the base outcome to something else. Or you can use the equation() option of the hausman command to align the equations of the two models. Having the missing values for the square root of the diagonal of the covariance matrix of the differences is not comforting, but it is also not surprising. This covariance matrix is guaranteed to be positive definite only asymptotically (it is a consequence of the assumption that one of the estimators is efficient), and assurances are not made about the diagonal elements. Negative values along the diagonal are possible, and the fourth column of the table is provided mainly for descriptive use. We can also perform the Hausman IIA test against the remaining alternative in the model: . mlogit insure age male if insure != "Prepaid":insure Iteration 0: log likelihood = -132.59913 Iteration 1: log likelihood = -131.78009 Iteration 2: log likelihood = -131.76808 Iteration 3: log likelihood = -131.76807 Multinomial logistic regression Number of obs LR chi2(2) Prob > chi2 Log likelihood = -131.76807 Pseudo R2 insure Indemnity Coef. Std. Err. z P>|z| = = = = 338 1.66 0.4356 0.0063 [95% Conf. Interval] (base outcome) Uninsure age male _cons -.0041055 .4591074 -1.801774 .0115807 .3595663 .5474476 -0.35 1.28 -3.29 0.723 0.202 0.001 -.0268033 -.2456296 -2.874752 .0185923 1.163844 -.7287968 774 hausman — Hausman specification test . hausman . allcats, alleqs constant Coefficients (b) (B) . allcats age male _cons Test: -.0041055 .4591074 -1.801774 -.0051925 .4748547 -1.756843 (b-B) Difference sqrt(diag(V_b-V_B)) S.E. .001087 -.0157473 -.0449311 .0021355 . .1333421 b = consistent under Ho and Ha; obtained from mlogit B = inconsistent under Ha, efficient under Ho; obtained from mlogit Ho: difference in coefficients not systematic chi2(3) = (b-B)’[(V_b-V_B)^(-1)](b-B) = -0.18 chi2<0 ==> model fitted on these data fails to meet the asymptotic assumptions of the Hausman test; see suest for a generalized test Here the χ2 statistic is actually negative. We might interpret this result as strong evidence that we cannot reject the null hypothesis. Such a result is not an unusual outcome for the Hausman test, particularly when the sample is relatively small — there are only 45 uninsured individuals in this dataset. Are we surprised by the results of the Hausman test in this example? Not really. Judging from the z statistics on the original multinomial logit model, we were struggling to identify any structure in the data with the current specification. Even when we were willing to assume IIA and computed the efficient estimator under this assumption, few of the effects could be identified as statistically different from those on the base category. Trying to base a Hausman test on a contrast (difference) between two poor estimates is just asking too much of the existing data. In example 2, we encountered a case in which the Hausman was not well defined. Unfortunately, in our experience this happens fairly often. Stata provides an alternative to the Hausman test that overcomes this problem through an alternative estimator of the variance of the difference between the two estimators. This other estimator is guaranteed to be positive semidefinite. This alternative estimator also allows a widening of the scope of problems to which Hausman-type tests can be applied by relaxing the assumption that one of the estimators is efficient. For instance, you can perform Hausman-type tests to clustered observations and survey estimators. See [R] suest for details. Stored results hausman stores the following in r(): Scalars r(chi2) r(df) r(p) r(rank) χ2 degrees of freedom for the statistic p-value for the χ2 rank of (V b-V B)^(-1) hausman — Hausman specification test 775 Methods and formulas The Hausman statistic is distributed as χ2 and is computed as H = (βc − βe )0 (Vc − Ve )−1 (βc − βe ) where βc βe Vc Ve is is is is the the the the coefficient vector from the consistent estimator coefficient vector from the efficient estimator covariance matrix of the consistent estimator covariance matrix of the efficient estimator When the difference in the variance matrices is not positive definite, a Moore–Penrose generalized inverse is used. As noted in Gourieroux and Monfort (1995, 125–128), the choice of generalized inverse is not important asymptotically. The number of degrees of freedom for the statistic is the rank of the difference in the variance matrices. When the difference is positive definite, this is the number of common coefficients in the models being compared. Acknowledgment Portions of hausman are based on an early implementation by Jeroen Weesie of the Department of Sociology at Utrecht University, The Netherlands. References Baltagi, B. H. 2011. Econometrics. 5th ed. Berlin: Springer. Gourieroux, C. S., and A. Monfort. 1995. Statistics and Econometric Models, Vol 2: Testing, Confidence Regions, Model Selection, and Asymptotic Theory. Trans. Q. Vuong. Cambridge: Cambridge University Press. Hausman, J. A. 1978. Specification tests in econometrics. Econometrica 46: 1251–1271. Hausman, J. A., and D. L. McFadden. 1984. Specification tests for the multinomial logit model. Econometrica 52: 1219–1240. McFadden, D. L. 1974. Measurement of urban travel demand. Journal of Public Economics 3: 303–328. Also see [R] lrtest — Likelihood-ratio test after estimation [R] suest — Seemingly unrelated estimation [R] test — Test linear hypotheses after estimation [XT] xtreg — Fixed-, between-, and random-effects and population-averaged linear models Title heckman — Heckman selection model Syntax Description Options for Heckman selection model (two-step) Stored results References Menu Options for Heckman selection model (ML) Remarks and examples Methods and formulas Also see Syntax Basic syntax heckman depvar indepvars , select(varlists ) twostep indepvars , select(depvars = varlists ) twostep or heckman depvar Full syntax for maximum likelihood estimates only heckman depvar indepvars if in weight , select( depvars = varlists , noconstant offset(varnameo ) ) heckman ml options Full syntax for Heckman’s two-step consistent estimates only heckman depvar indepvars if in , twostep select( depvars = varlists , noconstant ) heckman ts options 776 heckman — Heckman selection model heckman ml options 777 Description Model ∗ specify selection equation: dependent and independent variables; whether to have constant term and offset variable noconstant suppress constant term include varname in model with coefficient constrained to 1 offset(varname) constraints(constraints) apply specified linear constraints keep collinear variables collinear select() SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) first noskip nshazard(newvar) mills(newvar) nocnsreport display options set confidence level; default is level(95) report first-step probit estimates perform likelihood-ratio test generate nonselection hazard variable synonym for nshazard() do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics ∗ select( ) is required. The full specification is select( depvars = varlists , noconstant offset(varnameo ) ). 778 heckman — Heckman selection model heckman ts options Model ∗ select() ∗ twostep noconstant rhosigma rhotrunc rholimited rhoforce Description specify selection equation: dependent and independent variables; whether to have constant term produce two-step consistent estimate suppress constant term truncate ρ to [ −1, 1 ] with consistent σ truncate ρ to [ −1, 1 ] truncate ρ in limited cases do not truncate ρ SE vce(vcetype) vcetype may be conventional, bootstrap, or jackknife Reporting level(#) first nshazard(newvar) mills(newvar) display options set confidence level; default is level(95) report first-step probit estimates generate nonselection hazard variable synonym for nshazard() control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling coeflegend display legend instead of statistics ∗ select( ) and twostep are required. The full specification is select( depvars = varlists , noconstant ). indepvars and varlists may contain factor variables; see [U] 11.4.3 Factor variables. depvar, indepvars, varlists , and depvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. twostep, vce(), first, noskip, and weights are not allowed with the svy prefix; see [SVY] svy. pweights, aweights, fweights, and iweights are allowed with maximum likelihood estimation; see [U] 11.1.6 weight. No weights are allowed if twostep is specified. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu heckman for maximum likelihood estimates Statistics > Sample-selection models > Heckman selection model (ML) heckman for two-step consistent estimates Statistics > Sample-selection models > Heckman selection model (two-step) heckman — Heckman selection model 779 Description heckman fits regression models with selection by using either Heckman’s two-step consistent estimator or full maximum likelihood. Options for Heckman selection model (ML) Model select( depvars = varlists , noconstant offset(varnameo ) ) specifies the variables and options for the selection equation. It is an integral part of specifying a Heckman model and is required. The selection equation should contain at least one variable that is not in the outcome equation. If depvars is specified, it should be coded as 0 or 1, with 0 indicating an observation not selected and 1 indicating a selected observation. If depvars is not specified, observations for which depvar is not missing are assumed selected, and those for which depvar is missing are assumed not selected. noconstant suppresses the selection constant term (intercept). offset(varnameo ) specifies that selection offset varnameo be included in the model with the coefficient constrained to be 1. noconstant, offset(varname), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. first specifies that the first-step probit estimates of the selection equation be displayed before estimation. noskip specifies that a full maximum-likelihood model with only a constant for the regression equation be fit. This model is not displayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asymptotically equivalent Wald test that all the parameters in the regression equation are zero (except the constant). For many models, this option can substantially increase estimation time. nshazard(newvar) and mills(newvar) are synonyms; either will create a new variable containing the nonselection hazard — what Heckman (1979) referred to as the inverse of the Mills’ ratio — from the selection equation. The nonselection hazard is computed from the estimated parameters of the selection equation. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. 780 heckman — Heckman selection model Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with heckman but is not shown in the dialog box: coeflegend; see [R] estimation options. Options for Heckman selection model (two-step) Model select( depvars = varlists , noconstant ) specifies the variables and options for the selection equation. It is an integral part of specifying a Heckman model and is required. The selection equation should contain at least one variable that is not in the outcome equation. If depvars is specified, it should be coded as 0 or 1, with 0 indicating an observation not selected and 1 indicating a selected observation. If depvars is not specified, observations for which depvar is not missing are assumed selected, and those for which depvar is missing are assumed not selected. noconstant suppresses the selection constant term (intercept). twostep specifies that Heckman’s (1979) two-step efficient estimates of the parameters, standard errors, and covariance matrix be produced. noconstant; see [R] estimation options. rhosigma, rhotrunc, rholimited, and rhoforce are rarely used options to specify how the two-step estimator (option twostep) handles unusual cases in which the two-step estimate of ρ is outside the admissible range for a correlation, [ −1, 1 ]. When abs(ρ) > 1, the two-step estimate of the coefficient variance–covariance matrix may not be positive definite and thus may be unusable for testing. The default is rhosigma. rhosigma specifies that ρ be truncated, as with the rhotrunc option, and that the estimate of σ be made consistent with ρb, the truncated estimate of ρ. So, σ b = βm ρb; see Methods and formulas for the definition of βm . Both the truncated ρ and the new estimate of σ b are used in all computations to estimate the two-step covariance matrix. rhotrunc specifies that ρ be truncated to lie in the range [ −1, 1 ]. If the two-step estimate is less than −1, ρ is set to −1; if the two-step estimate is greater than 1, ρ is set to 1. This truncated value of ρ is used in all computations to estimate the two-step covariance matrix. rholimited specifies that ρ be truncated only in computing the diagonal matrix D as it enters Vtwostep and Q; see Methods and formulas. In all other computations, the untruncated estimate of ρ is used. rhoforce specifies that the two-step estimate of ρ be retained, even if it is outside the admissible range for a correlation. This option may, in rare cases, lead to a non–positive-definite covariance matrix. These options have no effect when estimation is by maximum likelihood, the default. They also have no effect when the two-step estimate of ρ is in the range [ −1, 1 ]. heckman — Heckman selection model 781 SE vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (conventional) and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(conventional), the default, uses the two-step variance estimator derived by Heckman. Reporting level(#); see [R] estimation options. first specifies that the first-step probit estimates of the selection equation be displayed before estimation. nshazard(newvar) and mills(newvar) are synonyms; either will create a new variable containing the nonselection hazard — what Heckman (1979) referred to as the inverse of the Mills’ ratio — from the selection equation. The nonselection hazard is computed from the estimated parameters of the selection equation. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. The following option is available with heckman but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples The Heckman selection model (Gronau 1974; Lewis 1974; Heckman 1976) assumes that there exists an underlying regression relationship, yj = xj β + u1j regression equation The dependent variable, however, is not always observed. Rather, the dependent variable for observation j is observed if zj γ + u2j > 0 selection equation where u1 ∼ N (0, σ) u2 ∼ N (0, 1) corr(u1 , u2 ) = ρ When ρ 6= 0, standard regression techniques applied to the first equation yield biased results. heckman provides consistent, asymptotically efficient estimates for all the parameters in such models. In one classic example, the first equation describes the wages of women. Women choose whether to work, and thus, from our point of view as researchers, whether we observe their wages in our data. If women made this decision randomly, we could ignore that not all wages are observed and use ordinary regression to fit a wage model. Such an assumption of random participation, however, is unlikely to be true; women who would have low wages may be unlikely to choose to work, and thus the sample of observed wages is biased upward. In the jargon of economics, women choose not to work when their personal reservation wage is greater than the wage offered by employers. Thus women who choose not to work might have even higher offer wages than those who do work — they may have high offer wages, but they have even higher reservation wages. We could tell a story that competency is related to wages, but competency is rewarded more at home than in the labor force. 782 heckman — Heckman selection model In any case, in this problem — which is the paradigm for most such problems — a solution can be found if there are some variables that strongly affect the chances for observation (the reservation wage) but not the outcome under study (the offer wage). Such a variable might be the number of children in the home. (Theoretically, we do not need such identifying variables, but without them, we depend on functional form to identify the model. It would be difficult for anyone to take such results seriously because the functional form assumptions have no firm basis in theory.) Example 1 In the syntax for heckman, depvar and indepvars are the dependent variable and regressors for the underlying regression model to be fit (y = Xβ), and varlists are the variables (Z) thought to determine whether depvar is observed or unobserved (selected or not selected). In our female wage example, the number of children at home would be included in the second list. By default, heckman assumes that missing values (see [U] 12.2.1 Missing values) of depvar imply that the dependent variable is unobserved (not selected). With some datasets, it is more convenient to specify a binary variable (depvars ) that identifies the observations for which the dependent is observed/selected (depvars 6= 0) or not observed (depvars = 0); heckman will accommodate either type of data. We have a (fictional) dataset on 2,000 women, 1,343 of whom work: . use http://www.stata-press.com/data/r13/womenwk . summarize age educ married children wage Obs Mean Std. Dev. Variable age education married children wage 2000 2000 2000 2000 1343 36.208 13.084 .6705 1.6445 23.69217 8.28656 3.045912 .4701492 1.398963 6.305374 Min Max 20 10 0 0 5.88497 59 20 1 5 45.80979 We will assume that the hourly wage is a function of education and age, whereas the likelihood of working (the likelihood of the wage being observed) is a function of marital status, the number of children at home, and (implicitly) the wage (via the inclusion of age and education, which we think determine the wage): heckman — Heckman selection model . heckman wage educ age, select(married children educ age) Iteration 0: log likelihood = -5178.7009 Iteration 1: log likelihood = -5178.3049 Iteration 2: log likelihood = -5178.3045 Heckman selection model Number of obs (regression model with sample selection) Censored obs Uncensored obs Wald chi2(2) Log likelihood = -5178.304 Prob > chi2 Std. Err. Coef. wage education age _cons .9899537 .2131294 .4857752 .0532565 .0206031 1.077037 18.59 10.34 0.45 0.000 0.000 0.652 .8855729 .1727481 -1.625179 1.094334 .2535108 2.59673 select married children education age _cons .4451721 .4387068 .0557318 .0365098 -2.491015 .0673954 .0277828 .0107349 .0041533 .1893402 6.61 15.79 5.19 8.79 -13.16 0.000 0.000 0.000 0.000 0.000 .3130794 .3842534 .0346917 .0283694 -2.862115 .5772647 .4931601 .0767718 .0446502 -2.119915 /athrho /lnsigma .8742086 1.792559 .1014225 .027598 8.62 64.95 0.000 0.000 .6754241 1.738468 1.072993 1.84665 rho sigma lambda .7035061 6.004797 4.224412 .0512264 .1657202 .3992265 .5885365 5.68862 3.441942 .7905862 6.338548 5.006881 chi2(1) = P>|z| 2000 657 1343 508.44 0.0000 wage LR test of indep. eqns. (rho = 0): z = = = = = 61.20 783 [95% Conf. Interval] Prob > chi2 = 0.0000 heckman assumes that wage is the dependent variable and that the first variable list (educ and age) are the determinants of wage. The variables specified in the select() option (married, children, educ, and age) are assumed to determine whether the dependent variable is observed (the selection equation). Thus we fit the model wage = β0 + β1 educ + β2 age + u1 and we assumed that wage is observed if γ0 + γ1 married + γ2 children + γ3 educ + γ4 age + u2 > 0 where u1 and u2 have correlation ρ. The reported results for the wage equation are interpreted exactly as though we observed wage data for all women in the sample; the coefficients on age and education level represent the estimated marginal effects of the regressors in the underlying regression equation. The results for the two ancillary parameters require some explanation. heckman does not directly estimate ρ; to constrain ρ within its valid limits, and for numerical stability during optimization, it estimates the inverse hyperbolic tangent of ρ: 1+ρ 1 atanh ρ = ln 2 1−ρ 784 heckman — Heckman selection model This estimate is reported as /athrho. In the bottom panel of the output, heckman undoes this transformation for you: the estimated value of ρ is 0.7035061. The standard error for ρ is computed using the delta method, and its confidence intervals are the transformed intervals of /athrho. Similarly, σ , the standard error of the residual in the wage equation, is not directly estimated; for numerical stability, heckman instead estimates ln σ . The untransformed sigma is reported at the end of the output: 6.004797. Finally, some researchers — especially economists — are used to the selectivity effect summarized not by ρ but by λ = ρσ . heckman reports this, too, along with an estimate of the standard error and confidence interval. Technical note If each of the equations in the model had contained many regressors, the heckman command could have become long. An alternate way of specifying our wage model would be to use Stata’s global macros. The following lines are an equivalent way of specifying our model: . global wageeq "wage educ age" . global seleq "married children educ age" . heckman $wageeq, select($seleq) (output omitted ) Technical note The reported model χ2 test is a Wald test that all coefficients in the regression model (except the constant) are 0. heckman is an estimation command, so you can use test, testnl, or lrtest to perform tests against whatever nested alternate model you choose; see [R] test, [R] testnl, and [R] lrtest. The estimation of ρ and σ in the forms atanh ρ and ln σ extends the range of these parameters to infinity in both directions, thus avoiding boundary problems during the maximization. Tests of ρ must be made in the transformed units. However, because atanh(0) = 0, the reported test for atanh ρ = 0 is equivalent to the test for ρ = 0. The likelihood-ratio test reported at the bottom of the output is an equivalent test for ρ = 0 and is computationally the comparison of the joint likelihood of an independent probit model for the selection equation and a regression model on the observed wage data against the Heckman model likelihood. Because χ2 = 61.20, this clearly justifies the Heckman selection equation with these data. Example 2 heckman supports the Huber/White/sandwich estimator of variance under the vce(robust) and vce(cluster clustvar) options or when pweights are used for population-weighted data; see [U] 20.21 Obtaining robust variance estimates. We can obtain robust standard errors for our wage model by specifying clustering on county of residence (the county variable). heckman — Heckman selection model 785 . heckman wage educ age, select(married children educ age) vce(cluster county) Iteration 0: log pseudolikelihood = -5178.7009 Iteration 1: log pseudolikelihood = -5178.3049 Iteration 2: log pseudolikelihood = -5178.3045 Heckman selection model Number of obs = 2000 (regression model with sample selection) Censored obs = 657 Uncensored obs = 1343 Wald chi2(1) = . Log pseudolikelihood = -5178.304 Prob > chi2 = . (Std. Err. adjusted for 10 clusters in county) Robust Std. Err. wage Coef. wage education age _cons .9899537 .2131294 .4857752 .0600061 .020995 1.302103 16.50 10.15 0.37 0.000 0.000 0.709 .8723438 .17198 -2.066299 1.107564 .2542789 3.03785 select married children education age _cons .4451721 .4387068 .0557318 .0365098 -2.491015 .0731472 .0312386 .0110039 .004038 .1153305 6.09 14.04 5.06 9.04 -21.60 0.000 0.000 0.000 0.000 0.000 .3018062 .3774802 .0341645 .0285954 -2.717059 .5885379 .4999333 .0772991 .0444242 -2.264972 /athrho /lnsigma .8742086 1.792559 .1403337 .0258458 6.23 69.36 0.000 0.000 .5991596 1.741902 1.149258 1.843216 rho sigma lambda .7035061 6.004797 4.224412 .0708796 .155199 .5186709 .5364513 5.708189 3.207835 .817508 6.316818 5.240988 z Wald test of indep. eqns. (rho = 0): chi2(1) = P>|z| 38.81 [95% Conf. Interval] Prob > chi2 = 0.0000 The robust standard errors tend to be a bit larger, but we notice no systematic differences. This finding is not surprising because the data were not constructed to have any county-specific correlations or any other characteristics that would deviate from the assumptions of the Heckman model. Example 3 Stata also produces Heckman’s (1979) two-step efficient estimator of the model with the twostep option. Maximum likelihood estimation of the parameters can be time consuming with large datasets, and the two-step estimates may provide a good alternative in such cases. Continuing with the women’s wage model, we can obtain the two-step estimates with Heckman’s consistent covariance estimates by typing 786 heckman — Heckman selection model . heckman wage educ age, select(married children educ age) twostep Heckman selection model -- two-step estimates Number of obs (regression model with sample selection) Censored obs Uncensored obs Wald chi2(2) Prob > chi2 Std. Err. z P>|z| = = = = = 2000 657 1343 442.54 0.0000 wage Coef. [95% Conf. Interval] wage education age _cons .9825259 .2118695 .7340391 .0538821 .0220511 1.248331 18.23 9.61 0.59 0.000 0.000 0.557 .8769189 .1686502 -1.712645 1.088133 .2550888 3.180723 select married children education age _cons .4308575 .4473249 .0583645 .0347211 -2.467365 .074208 .0287417 .0109742 .0042293 .1925635 5.81 15.56 5.32 8.21 -12.81 0.000 0.000 0.000 0.000 0.000 .2854125 .3909922 .0368555 .0264318 -2.844782 .5763025 .5036576 .0798735 .0430105 -2.089948 lambda 4.001615 .6065388 6.60 0.000 2.812821 5.19041 rho sigma 0.67284 5.9473529 mills Technical note The Heckman selection model depends strongly on the model being correct, much more so than ordinary regression. Running a separate probit or logit for sample inclusion followed by a regression, referred to in the literature as the two-part model (Manning, Duan, and Rogers 1987) — not to be confused with Heckman’s two-step procedure — is an especially attractive alternative if the regression part of the model arose because of taking a logarithm of zero values. When the goal is to analyze an underlying regression model or to predict the value of the dependent variable that would be observed in the absence of selection, however, the Heckman model is more appropriate. When the goal is to predict an actual response, the two-part model is usually the better choice. The Heckman selection model can be unstable when the model is not properly specified or if a specific dataset simply does not support the model’s assumptions. For example, let’s examine the solution to another simulated problem. heckman — Heckman selection model . use http://www.stata-press.com/data/r13/twopart . heckman yt x1 x2 x3, select(z1 z2) nonrtol Iteration 0: log likelihood = -111.94996 Iteration 1: log likelihood = -110.82258 Iteration 2: log likelihood = -110.17707 Iteration 3: log likelihood = -107.70663 (not concave) Iteration 4: log likelihood = -107.07729 (not concave) (output omitted ) Iteration 33: log likelihood = -104.0825 (not concave) Iteration 34: log likelihood = -104.0825 Heckman selection model Number of obs (regression model with sample selection) Censored obs Uncensored obs Wald chi2(3) Log likelihood = -104.0825 Prob > chi2 Std. Err. z P>|z| = = = = = 787 150 87 63 8.64e+08 0.0000 yt Coef. [95% Conf. Interval] x1 x2 x3 _cons .8974192 -2.525303 2.855786 .6975442 .0002247 3994.69 .0001472 -1.7e+04 .0004181 6829.86 .0920515 7.58 0.000 0.000 0.000 0.000 .8969789 -2.525591 2.854966 .5171265 .8978595 -2.525014 2.856605 .8779619 z1 z2 _cons -.6825988 1.003605 -.3604652 .0900159 .132347 .1232778 -7.58 7.58 -2.92 0.000 0.000 0.003 -.8590267 .7442097 -.6020852 -.5061709 1.263 -.1188452 /athrho /lnsigma 16.19193 -.5396153 280.9822 .1318714 0.06 -4.09 0.954 0.000 -534.523 -.7980786 566.9069 -.2811521 rho sigma lambda 1 .5829725 .5829725 9.73e-12 .0768774 .0768774 -1 .4501931 .4322955 1 .7549135 .7336494 yt select LR test of indep. eqns. (rho = 0): chi2(1) = 25.67 Prob > chi2 = 0.0000 The model has converged to a value of ρ that is 1.0 — within machine-rounding tolerances. Given the form of the likelihood for the Heckman selection model, this implies a division by zero, and it is surprising that the model solution turns out as well as it does. Reparameterizing ρ has allowed the estimation to converge, but we clearly have problems with the estimates. Moreover, if this had occurred in a large dataset, waiting for convergence might take considerable time. This dataset was not intentionally developed to cause problems. It is actually generated by a “Heckman process” and when generated starting from different random values can be easily estimated. The luck of the draw here merely led to data that, despite the source, did not support the assumptions of the Heckman model. The two-step model is generally more stable when the data are problematic. It even tolerates estimates of ρ less than −1 and greater than 1. For these reasons, the two-step model may be preferred when exploring a large dataset. Still, if the maximum likelihood estimates cannot converge, or converge to a value of ρ that is at the boundary of acceptable values, there is scant support for fitting a Heckman selection model on the data. Heckman (1979) discusses the implications of ρ being exactly 1 or 0, together with the implications of other possible covariance relationships among the model’s determinants. 788 heckman — Heckman selection model James Joseph Heckman was born in Chicago in 1944 and studied mathematics at Colorado College and economics at Princeton. He has taught economics at Columbia and (since 1973) at the University of Chicago. He has worked on developing a scientific basis for economic policy evaluation, with emphasis on models of individuals or disaggregated groups and the problems and possibilities created by heterogeneity, diversity, and unobserved counterfactual states. In 2000, he shared the Nobel Prize in Economics with Daniel L. McFadden. Stored results heckman (maximum likelihood) stores the following in e(): Scalars e(N) e(N cens) e(k) e(k eq) e(k eq model) e(k aux) e(k dv) e(df m) e(ll) e(ll 0) e(N clust) e(lambda) e(selambda) e(sigma) e(chi2) e(chi2 c) e(p c) e(p) e(rho) e(rank) e(rank0) e(ic) e(rc) e(converged) number of observations number of censored observations number of parameters number of equations in e(b) number of equations in overall model test number of auxiliary parameters number of dependent variables model degrees of freedom log likelihood log likelihood, constant-only model number of clusters λ standard error of λ sigma χ2 χ2 for comparison test p-value for comparison test significance of comparison test ρ rank of e(V) rank of e(V) for constant-only model number of iterations return code 1 if converged, 0 otherwise heckman — Heckman selection model Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(title2) e(clustvar) e(offset1) e(offset2) e(mills) e(chi2type) e(chi2 ct) e(vce) e(vcetype) e(opt) e(which) e(method) e(ml method) e(user) e(technique) e(properties) e(predict) e(marginsok) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) heckman command as typed names of dependent variables weight type weight expression title in estimation output secondary title in estimation output name of cluster variable offset for regression equation offset for selection equation variable containing nonselection hazard (inverse of Mills’) Wald or LR; type of model χ2 test Wald or LR; type of model χ2 test corresponding to e(chi2 c) vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization ml type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict predictions allowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample 789 790 heckman — Heckman selection model heckman (two-step) stores the following in e(): Scalars e(N) e(N cens) e(df m) e(lambda) e(selambda) e(sigma) e(chi2) e(p) e(rho) e(rank) Macros e(cmd) e(cmdline) e(depvar) e(title) e(title2) e(mills) e(chi2type) e(vce) e(vcetype) e(rhometh) e(method) e(properties) e(predict) e(marginsok) e(marginsnotok) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(V) Functions e(sample) number of observations number of censored observations model degrees of freedom λ standard error of λ sigma χ2 significance of comparison test ρ rank of e(V) heckman command as typed names of dependent variables title in estimation output secondary title in estimation output variable containing nonselection hazard (inverse of Mills’) Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. rhosigma, rhotrunc, rholimited, or rhoforce twostep b V program used to implement predict predictions allowed by margins predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix variance–covariance matrix of the estimators marks estimation sample Methods and formulas Cameron and Trivedi (2010, 556–562) and Greene (2012, 873–880) provide good introductions to the Heckman selection model. Adkins and Hill (2011, sec. 16.8) describe the two-step estimator with an application using Stata. Jones (2007, 35–40) illustrates Heckman estimation with an application to health economics. Regression estimates using the nonselection hazard (Heckman 1979) provide starting values for maximum likelihood estimation. The regression equation is yj = xj β + u1j The selection equation is zj γ + u2j > 0 heckman — Heckman selection model 791 where u1 ∼ N (0, σ) u2 ∼ N (0, 1) corr(u1 , u2 ) = ρ The log likelihood for observation j , lnLj = lj , is ) ( 2 √ w j yj − x j β z γ + (y − x β )ρ/σ j j j p − − wj ln( 2πσ) yj observed wj lnΦ 2 2 σ 1−ρ lj = wj ln Φ(−zj γ) yj not observed where Φ(·) is the standard cumulative normal and wj is an optional weight for observation j . In the maximum likelihood estimation, σ and ρ are not directly estimated. Directly estimated are ln σ and atanh ρ: 1 1+ρ atanh ρ = ln 2 1−ρ The standard error of λ = ρσ is approximated through the propagation of error (delta) method; that is, Var(λ) ≈ D Var (atanh ρ lnσ) D0 where D is the Jacobian of λ with respect to atanh ρ and ln σ . With maximum likelihood estimation, this command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. The maximum likelihood version of heckman also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. The two-step estimates are computed using Heckman’s (1979) procedure. Probit estimates of the selection equation Pr(yj observed | zj ) = Φ(zj γ) are obtained. From these estimates, the nonselection hazard—what Heckman (1979) referred to as the inverse of the Mills’ ratio, mj —for each observation j is computed as mj = φ(zj γ b) Φ(zj γ b) where φ is the normal density. We also define δj = mj (mj + γ b zj ) Following Heckman, the two-step parameter estimates of β are obtained by augmenting the regression equation with the nonselection hazard m. Thus the regressors become [ X m ], and we obtain the additional parameter estimate βm on the variable containing the nonselection hazard. 792 heckman — Heckman selection model A consistent estimate of the regression disturbance variance is obtained using the residuals from the augmented regression and the parameter estimate on the nonselection hazard, 2 σ b = e0 e + β2m PN j=1 δj N The two-step estimate of ρ is then ρb = βm σ b Heckman derived consistent estimates of the coefficient covariance matrix on the basis of the augmented regression. Let W = [ X m ] and R be a square, diagonal matrix of dimension N , with (1 − ρb 2 δj ) as the diagonal elements. The conventional VCE is Vtwostep = σ b2 (W0 W)−1 (W0 RW + Q)(W0 W)−1 where Q = ρb 2 (W0 DZ)Vp (Z0 DW) where D is the square, diagonal matrix of dimension N with δj as the diagonal elements; Z is the data matrix of selection equation covariates; and Vp is the variance–covariance estimate from the probit estimation of the selection equation. References Adkins, L. C., and R. C. Hill. 2011. Using Stata for Principles of Econometrics. 4th ed. Hoboken, NJ: Wiley. Baum, C. F. 2006. An Introduction to Modern Econometrics Using Stata. College Station, TX: Stata Press. Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. Chiburis, R., and M. Lokshin. 2007. Maximum likelihood and two-step estimation of an ordered-probit selection model. Stata Journal 7: 167–182. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Gronau, R. 1974. Wage comparisons: A selectivity bias. Journal of Political Economy 82: 1119–1143. Heckman, J. 1976. The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement 5: 475–492. . 1979. Sample selection bias as a specification error. Econometrica 47: 153–161. Jones, A. 2007. Applied Econometrics for Health Economists: A Practical Guide. 2nd ed. Abingdon, UK: Radcliffe. Lewis, H. G. 1974. Comments on selectivity biases in wage comparisons. Journal of Political Economy 82: 1145–1155. Manning, W. G., N. Duan, and W. H. Rogers. 1987. Monte Carlo evidence on the choice between sample selection and two-part models. Journal of Econometrics 35: 59–82. heckman — Heckman selection model Also see [R] heckman postestimation — Postestimation tools for heckman [R] heckoprobit — Ordered probit model with sample selection [R] heckprobit — Probit model with sample selection [R] regress — Linear regression [R] tobit — Tobit regression [SVY] svy estimation — Estimation commands for survey data [TE] etregress — Linear regression with endogenous treatment effects [U] 20 Estimation and postestimation commands 793 Title heckman postestimation — Postestimation tools for heckman Description Remarks and examples Syntax for predict Reference Menu for predict Also see Options for predict Description The following postestimation commands are available after heckman: Command Description contrast estat ic1 estat summarize estat vce estat (svy) estimates lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test; not available with two-step estimator marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest1 test testnl 1 estat ic and suest are not appropriate after heckman, twostep. 2 lrtest is not appropriate with svy estimation results. 794 heckman postestimation — Postestimation tools for heckman 795 Syntax for predict After ML or twostep predict type newvar if in , statistic nooffset After ML predict type stub* | newvarreg newvarsel newvarathrho newvarlnsigma if in , scores Description statistic Main linear prediction; the default standard error of the prediction standard error of the forecast linear prediction for selection equation standard error of the linear prediction for selection equation Pr(yj | a < yj < b) E(yj | a < yj < b) E(yj∗ ), yj∗ = max{a, min(yj , b)} E(yj |yj observed) E(yj∗ ), yj taken to be 0 where unobserved nonselection hazard (also called the inverse of Mills’ ratio) Pr(yj observed) xb stdp stdf xbsel stdpsel pr(a,b) e(a,b) ystar(a,b) ycond yexpected nshazard or mills psel These statistics are available both in and out of sample; type predict the estimation sample. stdf is not allowed with svy estimation results. . . . if e(sample) . . . if wanted only for where a and b may be numbers or variables; a missing (a ≥ .) means −∞, and b missing (b ≥ .) means +∞; see [U] 12.2.1 Missing values. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main xb, the default, calculates the linear prediction xj b. stdp calculates the standard error of the prediction, which can be thought of as the standard error of the predicted expected value or mean for the observation’s covariate pattern. The standard error of the prediction is also referred to as the standard error of the fitted value. stdf calculates the standard error of the forecast, which is the standard error of the point prediction for 1 observation. It is commonly referred to as the standard error of the future or forecast value. By construction, the standard errors produced by stdf are always larger than those produced by stdp; see Methods and formulas in [R] regress postestimation. 796 heckman postestimation — Postestimation tools for heckman xbsel calculates the linear prediction for the selection equation. stdpsel calculates the standard error of the linear prediction for the selection equation. pr(a,b) calculates Pr(a < xj b + u1 < b), the probability that yj |xj would be observed in the interval (a, b). a and b may be specified as numbers or variable names; lb and ub are variable names; pr(20,30) calculates Pr(20 < xj b + u1 < 30); pr(lb,ub) calculates Pr(lb < xj b + u1 < ub); and pr(20,ub) calculates Pr(20 < xj b + u1 < ub). a missing (a ≥ .) means −∞; pr(.,30) calculates Pr(−∞ < xj b + uj < 30); pr(lb,30) calculates Pr(−∞ < xj b + uj < 30) in observations for which lb ≥ . and calculates Pr(lb < xj b + uj < 30) elsewhere. b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(+∞ > xj b + uj > 20); pr(20,ub) calculates Pr(+∞ > xj b + uj > 20) in observations for which ub ≥ . and calculates Pr(20 < xj b + uj < ub) elsewhere. e(a,b) calculates E(xj b + u1 | a < xj b + u1 < b), the expected value of yj |xj conditional on yj |xj being in the interval (a, b), meaning that yj |xj is truncated. a and b are specified as they are for pr(). ystar(a,b) calculates E(yj∗ ), where yj∗ = a if xj b + uj ≤ a, yj∗ = b if xj b + uj ≥ b, and yj∗ = xj b + uj otherwise, meaning that yj∗ is censored. a and b are specified as they are for pr(). ycond calculates the expected value of the dependent variable conditional on the dependent variable being observed, that is, selected; E(yj | yj observed). yexpected calculates the expected value of the dependent variable (yj∗ ), where that value is taken to be 0 when it is expected to be unobserved; yj∗ = Pr(yj observed)E(yj | yj observed). The assumption of 0 is valid for many cases where nonselection implies nonparticipation (for example, unobserved wage levels, insurance claims from those who are uninsured) but may be inappropriate for some problems (for example, unobserved disease incidence). nshazard and mills are synonyms; both calculate the nonselection hazard — what Heckman (1979) referred to as the inverse of the Mills’ ratio — from the selection equation. psel calculates the probability of selection (or being observed): Pr(yj observed) = Pr(zj γ + u2j > 0). nooffset is relevant when you specify offset(varname) for heckman. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . scores, not available with twostep, calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xj β). The second new variable will contain ∂ ln L/∂(zj γ). The third new variable will contain ∂ ln L/∂(atanh ρ). The fourth new variable will contain ∂ ln L/∂(ln σ). heckman postestimation — Postestimation tools for heckman 797 Remarks and examples Example 1 The default statistic produced by predict after heckman is the expected value of the dependent variable from the underlying distribution of the regression model. In the wage model of [R] heckman, this is the expected wage rate among all women, regardless of whether they were observed to participate in the labor force: . use http://www.stata-press.com/data/r13/womenwk . heckman wage educ age, select(married children educ age) vce(cluster county) (output omitted ) . predict heckwage (option xb assumed; fitted values) It is instructive to compare these predicted wage values from the Heckman model with an ordinary regression model — a model without the selection adjustment: . regress wage educ age Source SS df MS Model Residual 13524.0337 39830.8609 2 1340 6762.01687 29.7245231 Total 53354.8946 1342 39.7577456 wage Coef. education age _cons .8965829 .1465739 6.084875 Std. Err. .0498061 .0187135 .8896182 . predict regwage (option xb assumed; fitted values) . summarize heckwage regwage Variable Obs Mean heckwage regwage 2000 2000 21.15532 23.12291 t 18.00 7.83 6.84 Number of obs F( 2, 1340) Prob > F R-squared Adj R-squared Root MSE = = = = = = 1343 227.49 0.0000 0.2535 0.2524 5.452 P>|t| [95% Conf. Interval] 0.000 0.000 0.000 .7988765 .109863 4.339679 Std. Dev. 3.83965 3.241911 .9942893 .1832848 7.830071 Min Max 14.6479 17.98218 32.85949 32.66439 Since this dataset was concocted, we know the true coefficients of the wage regression equation to be 1, 0.2, and 1, respectively. We can compute the true mean wage for our sample. . generate truewage = 1 + .2*age + 1*educ . summarize truewage Variable Obs Mean Std. Dev. truewage 2000 21.3256 3.797904 Min Max 15 32.8 Whereas the mean of the predictions from heckman is within 18 cents of the true mean wage, ordinary regression yields predictions that are on average about $1.80 per hour too high because of the selection effect. The regression predictions also show somewhat less variation than the true wages. The coefficients from heckman are so close to the true values that they are not worth testing. Conversely, the regression equation is significantly off but seems to give the right sense. Would we be led far astray if we relied on the OLS coefficients? The effect of age is off by more than 5 cents per year of age, and the coefficient on education level is off by about 10%. We can test the OLS coefficient on education level against the true value by using test. 798 heckman postestimation — Postestimation tools for heckman . test educ = 1 ( 1) education = 1 F( 1, 1340) = Prob > F = 4.31 0.0380 Not only is the OLS coefficient on education substantially lower than the true parameter, but the difference from the true parameter is also statistically significant beyond the 5% level. We can perform a similar test for the OLS age coefficient: . test age = .2 ( 1) age = .2 F( 1, 1340) = Prob > F = 8.15 0.0044 We find even stronger evidence that the OLS regression results are biased away from the true parameters. Example 2 Several other interesting aspects of the Heckman model can be explored with predict. Continuing with our wage model, we can obtain the expected wages for women conditional on participating in the labor force with the ycond option. Let’s get these predictions and compare them with actual wages for women participating in the labor force. . use http://www.stata-press.com/data/r13/womenwk, clear . heckman wage educ age, select(married children educ age) (output omitted ) . predict hcndwage, ycond . summarize wage hcndwage if wage != . Obs Mean Variable wage hcndwage 1343 1343 23.69217 23.68239 Std. Dev. 6.305374 3.335087 Min Max 5.88497 16.18337 45.80979 33.7567 We see that the average predictions from heckman are close to the observed levels but do not have the same mean. These conditional wage predictions are available for all observations in the dataset but can be directly compared only with observed wages, where individuals are participating in the labor force. What if we were interested in making predictions about mean wages for all women? Here the expected wage is 0 for those who are not expected to participate in the labor force, with expected participation determined by the selection equation. These values can be obtained with the yexpected option of predict. For comparison, a variable can be generated where the wage is set to 0 for nonparticipants. . predict hexpwage, yexpected . generate wage0 = wage (657 missing values generated) . replace wage0 = 0 if wage == . (657 real changes made) heckman postestimation — Postestimation tools for heckman . summarize hexpwage wage0 Obs Variable hexpwage wage0 2000 2000 Mean 15.92511 15.90929 Std. Dev. 5.979336 12.27081 Min Max 2.492469 0 32.45858 45.80979 799 Again we note that the predictions from heckman are close to the observed mean hourly wage rate for all women. Why aren’t the predictions using ycond and yexpected equal to their observed sample equivalents? For the Heckman model, unlike linear regression, the sample moments implied by the optimal solution to the model likelihood do not require that these predictions match observed data. Properly accounting for the additional variation from the selection equation requires that the model use more information than just the sample moments of the observed wages. Reference Heckman, J. 1979. Sample selection bias as a specification error. Econometrica 47: 153–161. Also see [R] heckman — Heckman selection model [U] 20 Estimation and postestimation commands Title heckoprobit — Ordered probit model with sample selection Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax heckoprobit depvar indepvars if in weight , select( depvars = varlists , noconstant offset(varnameo ) ) options options Description Model ∗ specify selection equation: dependent and independent variables; whether to have constant term and offset variable offset(varname) include varname in model with coefficient constrained to 1 constraints(constraints) apply specified linear constraints collinear keep collinear variables select() SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) first noheader nofootnote nocnsreport display options set confidence level; default is level(95) report first-step probit estimates do not display header above parameter table do not display footnotes below parameter table do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics ∗ select() is required. The full specification is select( depvars = varlists , noconstant offset(varnameo ) ). indepvars and varlists may contain factor variables; see [U] 11.4.3 Factor variables. depvar, indepvars, depvars , and varlists may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce(), first, and weights are not allowed with the svy prefix; see [SVY] svy. pweights, fweights, and iweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 800 heckoprobit — Ordered probit model with sample selection 801 Menu Statistics > Sample-selection models > Ordered probit model with selection Description heckoprobit fits maximum-likelihood ordered probit models with sample selection. Options Model select( depvars = varlists , noconstant offset(varnameo ) ). specifies the variables and options for the selection equation. It is an integral part of specifying a selection model and is required. The selection equation should contain at least one variable that is not in the outcome equation. If depvars is specified, it should be coded as 0 or 1, 0 indicating an observation not selected and 1 indicating a selected observation. If depvars is not specified, observations for which depvar is not missing are assumed selected, and those for which depvar is missing are assumed not selected. noconstant suppresses the selection constant term (intercept). offset(varnameo ) specifies that selection offset varnameo be included in the model with the coefficient constrained to be 1. offset(varname), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. first specifies that the first-step probit estimates of the selection equation be displayed before estimation. noheader suppresses the header above the parameter table, the display that reports the final loglikelihood value, number of observations, etc. nofootnote suppresses the footnotes displayed below the parameter table. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. 802 heckoprobit — Ordered probit model with sample selection Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with heckoprobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples heckoprobit estimates the parameters of a regression model for an ordered categorical outcome from a nonrandom sample known as a selected sample. Selected samples suffer from “selection on unobservables” because the errors that determine whether a case is missing are correlated with the errors that determine the outcome. For ordered categorical regression from samples that do not suffer from selection on unobservables, see [R] oprobit or [R] ologit. For regression of a continuous outcome variable from a selected sample, see [R] heckman. Even though we are interested in modeling a single ordinal outcome, there are two dependent variables in the ordered probit sample-selection model because we must also model the sampleselection process. First, there is the ordinal outcome yj . Second, there is a binary variable that indicates whether each case in the sample is observed or unobserved. To handle the sample-selection problem, we model both dependent variables jointly. Both variables are categorical. Their categorical values are determined by the values of linear combinations of covariates and normally distributed error terms relative to certain cutpoints that partition the real line. The error terms used in the determination of selection and the ordinal outcome value may be correlated. The probability that the ordinal outcome yj is equal to the value vh is given by the probability that xj β + u1j falls within the cutpoints κh−1 and κh , Pr(yj = vh ) = Pr(κh−1 < xj β + u1j ≤ κh ) where xj is the outcome covariates, β is the coefficients, and u1j is a random-error term. The observed outcome values v1 , . . . , vH are integers such that vi < vm for i < m. κ0 is taken as −∞ and κH is taken as +∞. We model the selection process for the outcome by sj = 1(zj γ + u2j > 0) where sj = 1 if we observed yj and 0 otherwise, zj is the covariates used to model the selection process, γ is the coefficients for the selection process, 1(·) denotes the indicator function, and u2j is a random-error term. (u1j , u2j ) have bivariate normal distribution with mean zero and variance matrix 1 ρ ρ 1 heckoprobit — Ordered probit model with sample selection 803 When ρ 6= 0, standard ordered probit techniques applied to the outcome equation yield inconsistent results. heckoprobit provides consistent, asymptotically efficient estimates for all the parameters in such models. De Luca and Perotti (2011) describe the maximum likelihood estimator used in heckoprobit. Example 1 We have a simulated dataset containing a sample of 5,000 women, 3,480 of whom work. The outcome of interest is a woman’s job satisfaction, and we suspect that unobservables that determine job satisfaction and the unobservables that increase the likelihood of employment are correlated. Women may make a decision to work based on how satisfying their job would be. We estimate the parameters of an ordered probit sample-selection model for the outcome of job satisfaction (satisfaction) with selection on employment (work). Age (age) and years of education (education) are used as outcome covariates, and we also expect that they affect selection. Additional covariates for selection are marital status (married) and the number of children at home (children). Here we estimate the parameters of the model with heckoprobit. We use the factorial interaction of married and children in select(). This specifies that the number of children and marital status affect selection, and it allows the effect of the number of children to differ among married and nonmarried women. The factorial interaction is specified using factor-variable notation, which is described in [U] 11.4.3 Factor variables. 804 heckoprobit — Ordered probit model with sample selection . use http://www.stata-press.com/data/r13/womensat (Job satisfaction, female) . heckoprobit satisfaction education age, > select(work=education age i.married##c.children) Fitting oprobit model: Iteration 0: log likelihood = -3934.1474 Iteration 1: log likelihood = -3571.886 Iteration 2: log likelihood = -3570.2616 Iteration 3: log likelihood = -3570.2616 Fitting selection model: Iteration 0: log likelihood = -3071.0775 Iteration 1: log likelihood = -2565.5092 Iteration 2: log likelihood = -2556.8369 Iteration 3: log likelihood = -2556.8237 Iteration 4: log likelihood = -2556.8237 Comparison: log likelihood = -6127.0853 Fitting full model: Iteration 0: log likelihood = -6127.0853 Iteration 1: log likelihood = -6093.8868 Iteration 2: log likelihood = -6083.215 Iteration 3: log likelihood = -6083.0376 Iteration 4: log likelihood = -6083.0372 Ordered probit model with sample selection Number of obs Censored obs Uncensored obs Wald chi2(2) Prob > chi2 Log likelihood = -6083.037 Coef. Std. Err. z = = = = = 5000 1520 3480 842.42 0.0000 P>|z| [95% Conf. Interval] satisfaction education age .1536381 .0334463 .0068266 .0024049 22.51 13.91 0.000 0.000 .1402583 .0287329 .1670179 .0381598 work education age 1.married children .0512494 .0288084 .6120876 .5140995 .0068095 .0026528 .0700055 .0288529 7.53 10.86 8.74 17.82 0.000 0.000 0.000 0.000 .037903 .023609 .4748794 .4575489 .0645958 .0340078 .7492958 .5706501 -.1337573 .035126 -3.81 0.000 -.202603 -.0649117 _cons -2.203036 .125772 -17.52 0.000 -2.449545 -1.956528 /cut1 /cut2 /cut3 /athrho 1.728757 2.64357 3.642911 .7430919 .1232063 .116586 .1178174 .0780998 14.03 22.67 30.92 9.51 0.000 0.000 0.000 0.000 1.487277 2.415066 3.411993 .5900191 1.970237 2.872075 3.873829 .8961646 rho .6310096 .0470026 .5299093 .7144252 married# c.children 1 LR test of indep. eqns. (rho = 0): chi2(1) = 88.10 Prob > chi2 = 0.0000 The output shows several iteration logs. The first iteration log corresponds to running the ordered probit model for those observations in the sample where we have observed the outcome. The second iteration log corresponds to running the selection probit model, which models whether we observe heckoprobit — Ordered probit model with sample selection 805 our outcome of interest. If ρ = 0, the sum of the log likelihoods from these two models will equal the log likelihood of the ordered probit sample-selection model; this sum is printed in the iteration log as the comparison log likelihood. The final iteration log is for fitting the full ordered probit sample-selection model. The Wald test in the header is highly significant, indicating a good model fit. All the covariates are statistically significant. The likelihood-ratio test in the footer indicates that we can reject the null hypothesis that the errors for outcome and selection are uncorrelated. This means that we should use the ordered probit sample-selection model instead of the simple ordered probit model. The positive estimate of 0.63 for ρ indicates that unobservables that increase job satisfaction tend to occur with unobservables that increase the chance of having a job. Stored results heckoprobit stores the following in e(): Scalars e(N) e(N cens) e(N cd) e(k cat) e(k) e(k eq) e(k eq model) e(k aux) e(k dv) e(df m) e(ll) e(ll c) e(N clust) e(chi2) e(chi2 c) e(p c) e(p) e(rho) e(rank) e(ic) e(rc) e(converged) number of observations number of censored observations number of completely determined observations number of categories number of parameters number of equations in e(b) number of equations in overall model test number of auxiliary parameters number of dependent variables model degrees of freedom log likelihood log likelihood, comparison model number of clusters χ2 χ2 for comparison test p-value for comparison test significance of comparison test ρ rank of e(V) number of iterations return code 1 if converged, 0 otherwise 806 heckoprobit — Ordered probit model with sample selection Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(title2) e(clustvar) e(offset1) e(offset2) e(chi2type) e(chi2 ct) e(vce) e(vcetype) e(opt) e(which) e(method) e(ml method) e(user) e(technique) e(properties) e(predict) e(marginsok) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(cat) e(V) e(V modelbased) Functions e(sample) heckoprobit command as typed names of dependent variables weight type weight expression title in estimation output secondary title in estimation output name of cluster variable offset for regression equation offset for selection equation Wald or LR; type of model χ2 test type of comparison χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization ml type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict predictions allowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector category values variance–covariance matrix of the estimators model-based variance marks estimation sample Methods and formulas De Luca and Perotti (2011) provide an introduction to this model. The ordinal outcome equation is yj = H X vh 1 (κh−1 < xj β + u1j ≤ κh ) h=1 where xj is the outcome covariates, β is the coefficients, and u1j is a random-error term. The observed outcome values v1 , . . . , vH are integers such that vi < vm for i < m. κ1 , . . . , κH−1 are real numbers such that κi < κm for i < m. κ0 is taken as −∞ and κH is taken as +∞. The selection equation is sj = 1(zj γ + u2j > 0) where sj = 1 if we observed yj and 0 otherwise, zj is the covariates used to model the selection process, γ is the coefficients for the selection process, and u2j is a random-error term. (u1j , u2j ) have bivariate normal distribution with mean zero and variance matrix 1 ρ ρ 1 heckoprobit — Ordered probit model with sample selection γ 807 β Let aj = zj γ + offsetj and bj = xj β + offsetj . This yields the log likelihood lnL = X wj ln {Φ (−aj )} + j6∈S H X X h=1 wj ln {Φ2 (aj , κh − bj , −ρ) − Φ2 (aj , κh−1 − bj , −ρ)} j∈S yj =vh where S is the set of observations for which yj is observed, Φ2 (·) is the cumulative bivariate normal 0 distribution function (with mean [ 0 0 ] ), Φ(·) is the standard cumulative normal, and wj is an optional weight for observation j . In the maximum likelihood estimation, ρ is not directly estimated. Directly estimated is atanh ρ: atanh ρ = 1+ρ 1 ln 2 1−ρ From the form of the likelihood, it is clear that if ρ = 0, the log likelihood for the ordered probit sample-selection model is equal to the sum of the ordered probit model for the outcome y and the selection model. We can perform a likelihood-ratio test by comparing the log likelihood of the full model with the sum of the log likelihoods for the ordered probit and selection models. References Baum, C. F. 2006. An Introduction to Modern Econometrics Using Stata. College Station, TX: Stata Press. Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press. Chiburis, R., and M. Lokshin. 2007. Maximum likelihood and two-step estimation of an ordered-probit selection model. Stata Journal 7: 167–182. De Luca, G., and V. Perotti. 2011. Estimation of ordered response models with sample selection. Stata Journal 11: 213–239. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Heckman, J. 1979. Sample selection bias as a specification error. Econometrica 47: 153–161. Miranda, A., and S. Rabe-Hesketh. 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata Journal 6: 285–308. Muro, J., C. Suárez, and M. Zamora. 2010. Computing Murphy–Topel-corrected variances in a heckprobit model with endogeneity. Stata Journal 10: 252–258. Van de Ven, W. P. M. M., and B. M. S. Van Pragg. 1981. The demand for deductibles in private health insurance: A probit model with sample selection. Journal of Econometrics 17: 229–252. Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press. 808 heckoprobit — Ordered probit model with sample selection Also see [R] heckoprobit postestimation — Postestimation tools for heckoprobit [R] heckman — Heckman selection model [R] heckprobit — Probit model with sample selection [R] oprobit — Ordered probit regression [R] probit — Probit regression [R] regress — Linear regression [R] tobit — Tobit regression [SVY] svy estimation — Estimation commands for survey data [U] 20 Estimation and postestimation commands Title heckoprobit postestimation — Postestimation tools for heckoprobit Description Remarks and examples Syntax for predict Methods and formulas Menu for predict Also see Options for predict Description The following postestimation commands are available after heckoprobit: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, standard errors, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest1 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 lrtest is not appropriate with svy estimation results. 809 810 heckoprobit postestimation — Postestimation tools for heckoprobit Syntax for predict stub* | newvar | newvarlist outcome(outcome) nooffset predict type if in , statistic predict type stub* | newvarreg newvarsel newvar1 . . . newvarh newvarathrho if in , scores Description statistic Main marginal probabilities; the default bivariate probabilities of levels with selection bivariate probabilities of levels with no selection probabilities of levels conditional on selection probabilities of levels conditional on no selection selection probability linear prediction standard error of the linear prediction linear prediction for selection equation standard error of the linear prediction for selection equation pmargin p1 p0 pcond1 pcond0 psel xb stdp xbsel stdpsel If you do not specify outcome(), pmargin (with one new variable specified) assumes outcome(#1). You specify one or k new variables with pmargin, where k is the number of outcomes. You specify one new variable with psel, xb, stdp, xbsel, and stdpsel. These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pmargin, the default, calculates the predicted marginal probabilities. You specify one or k new variables, where k is the number of categories of the outcome variable yj . If you specify the outcome() option, you must specify one new variable. If you specify one new variable and do not specify outcome(), outcome(#1) is assumed. When outcome() is specified, the marginal probability that yj is equal to the level outcome() is calculated. When outcome() is not specified, the marginal probabilities for each outcome level are calculated. p1 calculates the predicted bivariate probabilities of outcome levels with selection. You specify one or k new variables, where k is the number of categories of the outcome variable yj . If you specify the outcome() option, you must specify one new variable. If you specify one new variable and do not specify outcome(), outcome(#1) is assumed. heckoprobit postestimation — Postestimation tools for heckoprobit 811 When outcome() is specified, the bivariate probability that yj is equal to the level outcome() and that yj is observed is calculated. When outcome() is not specified, the bivariate probabilities for each outcome level and selection are calculated. p0 calculates the predicted bivariate probabilities of outcome levels with no selection. You specify one or k new variables, where k is the number of categories of the outcome variable yj . If you specify the outcome() option, you must specify one new variable. If you specify one new variable and do not specify outcome(), outcome(#1) is assumed. When outcome() is specified, the bivariate probability that yj is equal to the level outcome() and that yj is not observed is calculated. When outcome() is not specified, the bivariate probabilities for each outcome level and no selection are calculated. pcond1 calculates the predicted probabilities of outcome levels conditional on selection. You specify one or k new variables, where k is the number of categories of the outcome variable yj . If you specify the outcome() option, you must specify one new variable. If you specify one new variable and do not specify outcome(), outcome(#1) is assumed. When outcome() is specified, the probability that yj is equal to the level outcome() given that yj is observed is calculated. When outcome() is not specified, the probabilities for each outcome level conditional on selection are calculated. pcond0 calculates the predicted probabilities of outcome levels conditional on no selection. You specify one or k new variables, where k is the number of categories of the outcome variable yj . If you specify the outcome() option, you must specify one new variable. If you specify one new variable and do not specify outcome(), outcome(#1) is assumed. When outcome() is specified, the probability that yj is equal to the level outcome() given that yj is not observed is calculated. When outcome() is not specified, the probabilities for each outcome level conditional on no selection are calculated. psel calculates the predicted univariate (marginal) probability of selection. xb calculates the linear prediction for outcome variable, which is xj β if offset() was not specified β and xj β + offsetj if offset() was specified. stdp calculates the standard error of the prediction, which can be thought of as the standard error of the predicted expected value or mean for the observation’s covariate pattern. The standard error of the prediction is also referred to as the standard error of the fitted value. xbsel calculates the linear prediction for the selection equation, which is zj γ if offset() was not γ specified in select() and zj γ + offsetj if offset() was specified in select(). stdpsel calculates the standard error of the linear prediction for the selection equation. outcome(outcome) specifies for which outcome the predicted probabilities are to be calculated. outcome() should contain either one value of the dependent variable or one of #1, #2, . . . , with #1 meaning the first category of the dependent variable, #2 meaning the second category, etc. nooffset is relevant only if you specified offset(varname) for heckoprobit. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . 812 heckoprobit postestimation — Postestimation tools for heckoprobit scores calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xj β). The second new variable will contain ∂ ln L/∂(zj γ). When the dependent variable takes k different values, new variables three through k + 1 will contain ∂ ln L/∂(κj−2 ). The last new variable will contain ∂ ln L/∂(atanh ρ). Remarks and examples Example 1 In example 1 of [R] heckoprobit, we examined a simulated dataset of 5,000 women, 3,480 of whom work and can thus report job satisfaction. Using job satisfaction (satisfaction) as the outcome variable and employment (work) as the selection variable, we estimated the parameters of an ordered probit sample-selection model. Covariates age (age), years of education (education), number of children (children), and marital status (married) are expected to affect selection. The outcome, job satisfaction, is affected by age (age) and education (education). We first reestimate the parameters of the regression, but this time we request a robust variance estimator: . use http://www.stata-press.com/data/r13/womensat (Job satisfaction, female) . heckoprobit satisfaction education age, > select(work=education age i.married##c.children) vce(robust) (output omitted ) We then use margins (see [R] margins) to estimate the average marginal effect of education on the probability of having low job satisfaction. . margins, dydx(education) vce(unconditional) Average marginal effects Expression : Pr(satisfaction=1), predict() dy/dx w.r.t. : education dy/dx education Unconditional Std. Err. -.0234776 .0019176 z -12.24 Number of obs = 5000 P>|z| [95% Conf. Interval] 0.000 -.027236 -.0197192 The estimated average marginal effect of education on the probability of having low job satisfaction is approximately −0.023. Methods and formulas The ordinal outcome equation is yj = H X h=1 vh 1 (κh−1 < xj β + u1j ≤ κh ) heckoprobit postestimation — Postestimation tools for heckoprobit 813 where xj is the outcome covariates, β is the coefficients, and u1j is a random-error term. The observed outcome values v1 , . . . , vH are integers such that vi < vm for i < m. κ1 , . . . , κH−1 are real numbers such that κi < κm for i < m. κ0 is taken as −∞ and κH is taken as +∞. The selection equation is sj = 1(zj γ + u2j > 0) where sj = 1 if we observed yj and 0 otherwise, zj is the covariates used to model the selection process, γ is the coefficients for the selection process, and u2j is a random-error term. (u1j , u2j ) have bivariate normal distribution with mean zero and variance matrix 1 ρ ρ 1 The probability of selection is Pr(sj = 1) = Φ(zj γ + offsetγj ) Φ(·) is the standard cumulative normal distribution function. The probability of selection and the outcome yj = vh is Pr(yj = vh , sj = 1) = Φ2 zj γ + offsetγj , κh − xj β − offsetβj , −ρ − Φ2 zj γ + offsetγj , κh−1 − xj β − offsetβj , −ρ 0 Φ2 (·) is the cumulative bivariate normal distribution function (with mean [ 0 0 ] ). The probability of yj not being selected and the outcome yj = vh is Pr(yj = vh , sj = 0) = Φ2 −zj γ − offsetγj , κh − xj β − offsetβj , ρ − Φ2 −zj γ − offsetγj , κh−1 − xj β − offsetβj , ρ The probability of outcome yj = vh given selection is Pr(yj = vh |sj = 1) = Pr(yj = vh , sj = 1) Pr(sj = 1) The probability of outcome yj = vh given yj is not selected is Pr(yj = vh |sj = 0) = Pr(yj = vh , sj = 0) Pr(sj = 0) The marginal probabilities of the outcome yj are Pr(yj = v1 ) = Φ(κ1 − xj β − offsetβj ) Pr(yj = vH ) = 1 − Φ(κH−1 − xj β − offsetβj ) Pr(yj = vh ) = Φ(κh − xj β − offsetβj ) − Φ(κh−1 − xj β − offsetβj ) Also see [R] heckoprobit — Ordered probit model with sample selection [U] 20 Estimation and postestimation commands Title heckprobit — Probit model with sample selection Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax heckprobit depvar indepvars if in weight , select( depvars = varlists , noconstant offset(varnameo ) ) options options Description Model ∗ specify selection equation: dependent and independent variables; whether to have constant term and offset variable noconstant suppress constant term offset(varname) include varname in model with coefficient constrained to 1 constraints(constraints) apply specified linear constraints collinear keep collinear variables select() SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) first noskip nocnsreport display options set confidence level; default is level(95) report first-step probit estimates perform likelihood-ratio test do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics ∗ select( ) is required. The full specification is select( depvars = varlists , noconstant offset(varnameo ) ). indepvars and varlists may contain factor variables; see [U] 11.4.3 Factor variables. depvar, indepvars, depvars , and varlists may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce(), first, noskip, and weights are not allowed with the svy prefix; see [SVY] svy. pweights, fweights, and iweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 814 heckprobit — Probit model with sample selection 815 Menu Statistics > Sample-selection models > Probit model with selection Description heckprobit fits maximum-likelihood probit models with sample selection. heckprob is a synonym for heckprobit. Options Model select( depvars = varlists , noconstant offset(varnameo ) ) specifies the variables and options for the selection equation. It is an integral part of specifying a selection model and is required. The selection equation should contain at least one variable that is not in the outcome equation. If depvars is specified, it should be coded as 0 or 1, 0 indicating an observation not selected and 1 indicating a selected observation. If depvars is not specified, observations for which depvar is not missing are assumed selected, and those for which depvar is missing are assumed not selected. noconstant suppresses the selection constant term (intercept). offset(varnameo ) specifies that selection offset varnameo be included in the model with the coefficient constrained to be 1. noconstant, offset(varname), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. first specifies that the first-step probit estimates of the selection equation be displayed before estimation. noskip specifies that a full maximum-likelihood model with only a constant for the regression equation be fit. This model is not displayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asymptotically equivalent Wald test that all the parameters in the regression equation are zero (except the constant). For many models, this option can substantially increase estimation time. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. 816 heckprobit — Probit model with sample selection Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with heckprobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples The probit model with sample selection (Van de Ven and Van Pragg 1981) assumes that there exists an underlying relationship yj∗ = xj β + u1j latent equation such that we observe only the binary outcome yjprobit = (yj∗ > 0) probit equation The dependent variable, however, is not always observed. Rather, the dependent variable for observation j is observed if yjselect = (zj γ + u2j > 0) selection equation where u1 ∼ N (0, 1) u2 ∼ N (0, 1) corr(u1 , u2 ) = ρ When ρ 6= 0, standard probit techniques applied to the first equation yield biased results. heckprobit provides consistent, asymptotically efficient estimates for all the parameters in such models. For the model to be well identified, the selection equation should have at least one variable that is not in the probit equation. Otherwise, the model is identified only by functional form, and the coefficients have no structural interpretation. Example 1 We use the data from Pindyck and Rubinfeld (1998). In this dataset, the variables are whether children attend private school (private), number of years the family has been at the present residence (years), log of property tax (logptax), log of income (loginc), and whether one voted for an increase in property taxes (vote). In this example, we alter the meaning of the data. Here we assume that we observe whether children attend private school only if the family votes for increasing the property taxes. This assumption is not true in the dataset, and we make it only to illustrate the use of this command. We observe whether children attend private school only if the head of household voted for an increase in property taxes. We assume that the vote is affected by the number of years in residence, the current property taxes paid, and the household income. We wish to model whether children are sent to private school on the basis of the number of years spent in the current residence and the current property taxes paid. heckprobit — Probit model with sample selection 817 . use http://www.stata-press.com/data/r13/school . heckprob private years logptax, select(vote=years loginc logptax) Fitting probit model: Iteration 0: log likelihood = -17.122381 Iteration 1: log likelihood = -16.243974 (output omitted ) Iteration 5: log likelihood = -15.883655 Fitting selection model: Iteration Iteration Iteration Iteration 0: 1: 2: 3: Comparison: log log log log likelihood likelihood likelihood likelihood = = = = -63.036914 -58.534843 -58.497292 -58.497288 log likelihood = -74.380943 Fitting starting values: Iteration 0: log likelihood = -40.895684 Iteration 1: log likelihood = -16.654497 (output omitted ) Iteration 6: log likelihood = -15.753765 Fitting full model: Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = -75.010619 -74.287786 -74.250137 -74.245088 -74.244973 -74.244973 (not concave) Probit model with sample selection Log likelihood = -74.24497 Coef. private years logptax _cons Std. Err. z Number of obs Censored obs Uncensored obs = = = 95 36 59 Wald chi2(2) Prob > chi2 = = 1.04 0.5935 P>|z| [95% Conf. Interval] -.1142597 .3516098 -2.780665 .1461717 1.016485 6.905838 -0.78 0.35 -0.40 0.434 0.729 0.687 -.400751 -1.640665 -16.31586 .1722317 2.343884 10.75453 years loginc logptax _cons -.0167511 .9923024 -1.278783 -.545821 .0147735 .4430009 .5717545 4.070418 -1.13 2.24 -2.24 -0.13 0.257 0.025 0.025 0.893 -.0457067 .1240366 -2.399401 -8.523694 .0122045 1.860568 -.1581647 7.432052 /athrho -.8663156 1.450028 -0.60 0.550 -3.708318 1.975687 rho -.6994973 .7405343 -.9987984 .962269 vote LR test of indep. eqns. (rho = 0): chi2(1) = 0.27 Prob > chi2 = 0.6020 The output shows several iteration logs. The first iteration log corresponds to running the probit model for those observations in the sample where we have observed the outcome. The second iteration log corresponds to running the selection probit model, which models whether we observe our outcome of interest. If ρ = 0, the sum of the log likelihoods from these two models will equal the log likelihood of the probit model with sample selection; this sum is printed in the iteration log as the comparison log likelihood. The third iteration log shows starting values for the iterations. 818 heckprobit — Probit model with sample selection The final iteration log is for fitting the full probit model with sample selection. A likelihood-ratio test of the log likelihood for this model and the comparison log likelihood is presented at the end of the output. If we had specified the vce(robust) option, this test would be presented as a Wald test instead of as a likelihood-ratio test. Example 2 In example 1, we could have obtained robust standard errors by specifying the vce(robust) option. We do this here and also eliminate the iteration logs by using the nolog option: . heckprob private years logptax, sel(vote=years loginc logptax) vce(robust) nolog Probit model with sample selection Number of obs = 95 Censored obs = 36 Uncensored obs = 59 Wald chi2(2) = 2.55 Log pseudolikelihood = -74.24497 Prob > chi2 = 0.2798 Coef. private years logptax _cons Robust Std. Err. z P>|z| [95% Conf. Interval] -.1142597 .3516098 -2.780665 .1113977 .7358265 4.786678 -1.03 0.48 -0.58 0.305 0.633 0.561 -.3325951 -1.090584 -12.16238 .1040758 1.793803 6.601051 years loginc logptax _cons -.0167511 .9923024 -1.278783 -.545821 .0173344 .4228044 .5095156 4.543892 -0.97 2.35 -2.51 -0.12 0.334 0.019 0.012 0.904 -.0507259 .1636209 -2.277415 -9.451686 .0172237 1.820984 -.2801508 8.360044 /athrho -.8663156 1.630643 -0.53 0.595 -4.062318 2.329687 rho -.6994973 .8327753 -.9994079 .981233 vote Wald test of indep. eqns. (rho = 0): chi2(1) = 0.28 Prob > chi2 = 0.5952 Regardless of whether we specify the vce(robust) option, the outcome is not significantly different from the outcome obtained by fitting the probit and selection models separately. This result is not surprising because the selection mechanism estimated was invented for the example rather than borne from any economic theory. heckprobit — Probit model with sample selection Stored results heckprobit stores the following in e(): Scalars e(N) e(N cens) e(k) e(k eq) e(k eq model) e(k aux) e(k dv) e(df m) e(ll) e(ll 0) e(ll c) e(N clust) e(chi2) e(chi2 c) e(p c) e(p) e(rho) e(rank) e(rank0) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset1) e(offset2) e(chi2type) e(chi2 ct) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) number of observations number of censored observations number of parameters number of equations in e(b) number of equations in overall model test number of auxiliary parameters number of dependent variables model degrees of freedom log likelihood log likelihood, constant-only model log likelihood, comparison model number of clusters χ2 χ2 for comparison test p-value for comparison test significance of comparison test ρ rank of e(V) rank of e(V) for constant-only model number of iterations return code 1 if converged, 0 otherwise heckprobit command as typed names of dependent variables weight type weight expression title in estimation output name of cluster variable offset for regression equation offset for selection equation Wald or LR; type of model χ2 test type of comparison χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample 819 820 heckprobit — Probit model with sample selection Methods and formulas Van de Ven and Van Pragg (1981) provide an introduction and an explanation of this model. The probit equation is yj = (xj β + u1j > 0) The selection equation is zj γ + u2j > 0 where u1 ∼ N (0, 1) u2 ∼ N (0, 1) corr(u1 , u2 ) = ρ The log likelihood is lnL = X n o wj ln Φ2 xj β + offsetβj , zj γ + offsetγj , ρ j∈S yj 6=0 + X + X n o wj ln Φ2 −xj β + offsetβj , zj γ + offsetγj , −ρ j∈S yj =0 wj ln 1 − Φ zj γ + offsetγj j6∈S where S is the set of observations for which yj is observed, Φ2 (·) is the cumulative bivariate normal 0 distribution function (with mean [ 0 0 ] ), Φ(·) is the standard cumulative normal, and wj is an optional weight for observation j . In the maximum likelihood estimation, ρ is not directly estimated. Directly estimated is atanh ρ: 1+ρ 1 atanh ρ = ln 2 1−ρ From the form of the likelihood, it is clear that if ρ = 0, the log likelihood for the probit model with sample selection is equal to the sum of the probit model for the outcome y and the selection model. We can perform a likelihood-ratio test by comparing the likelihood of the full model with the sum of the log likelihoods for the probit and selection models. This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. heckprobit also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Baum, C. F. 2006. An Introduction to Modern Econometrics Using Stata. College Station, TX: Stata Press. Chiburis, R., and M. Lokshin. 2007. Maximum likelihood and two-step estimation of an ordered-probit selection model. Stata Journal 7: 167–182. heckprobit — Probit model with sample selection 821 De Luca, G. 2008. SNP and SML estimation of univariate and bivariate binary-choice models. Stata Journal 8: 190–220. De Luca, G., and V. Perotti. 2011. Estimation of ordered response models with sample selection. Stata Journal 11: 213–239. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Heckman, J. 1979. Sample selection bias as a specification error. Econometrica 47: 153–161. Lokshin, M., and Z. Sajaia. 2011. Impact of interventions on discrete outcomes: Maximum likelihood estimation of the binary choice models with binary endogenous regressors. Stata Journal 11: 368–385. Miranda, A., and S. Rabe-Hesketh. 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata Journal 6: 285–308. Muro, J., C. Suárez, and M. Zamora. 2010. Computing Murphy–Topel-corrected variances in a heckprobit model with endogeneity. Stata Journal 10: 252–258. Pindyck, R. S., and D. L. Rubinfeld. 1998. Econometric Models and Economic Forecasts. 4th ed. New York: McGraw–Hill. Van de Ven, W. P. M. M., and B. M. S. Van Pragg. 1981. The demand for deductibles in private health insurance: A probit model with sample selection. Journal of Econometrics 17: 229–252. Also see [R] heckprobit postestimation — Postestimation tools for heckprobit [R] heckman — Heckman selection model [R] heckoprobit — Ordered probit model with sample selection [R] probit — Probit regression [SVY] svy estimation — Estimation commands for survey data [TE] etregress — Linear regression with endogenous treatment effects [U] 20 Estimation and postestimation commands Title heckprobit postestimation — Postestimation tools for heckprobit Description Remarks and examples Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after heckprobit: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest1 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 lrtest is not appropriate with svy estimation results. 822 heckprobit postestimation — Postestimation tools for heckprobit 823 Syntax for predict predict type predict type newvar if in , statistic nooffset stub* | newvarreg newvarsel newvarathrho if in , scores Description statistic Main Φ(xj b), success probability; the default Φ2 (xj b, zj g, ρ), predicted probability Pr(yjprobit = 1, yjselect = 1) Φ2 (xj b, −zj g, −ρ), predicted probability Pr(yjprobit = 1, yjselect = 0) Φ2 (−xj b, zj g, −ρ), predicted probability Pr(yjprobit = 0, yjselect = 1) Φ2 (−xj b, −zj g, ρ), predicted probability Pr(yjprobit = 0, yjselect = 0) Φ(zj g), selection probability Φ2 (xj b, zj g, ρ)/Φ(zj g), probability of success conditional on selection linear prediction standard error of the linear prediction linear prediction for selection equation standard error of the linear prediction for selection equation pmargin p11 p10 p01 p00 psel pcond xb stdp xbsel stdpsel where Φ(·) is the standard normal distribution function and Φ2 (·) is the bivariate normal distribution function. These statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pmargin, the default, calculates the univariate (marginal) predicted probability of success Pr(yjprobit = 1). probit = 1, yjselect = 1). probit = 1, yjselect = 0). probit = 0, yjselect = 1). probit = 0, yjselect = 0). p11 calculates the bivariate predicted probability Pr(yj p10 calculates the bivariate predicted probability Pr(yj p01 calculates the bivariate predicted probability Pr(yj p00 calculates the bivariate predicted probability Pr(yj psel calculates the univariate (marginal) predicted probability of selection Pr(yjselect = 1). pcond calculates the conditional (on selection) predicted probability of success Pr(yjprobit = 1, yjselect = 1)/Pr(yjselect = 1). xb calculates the probit linear prediction xj b. 824 heckprobit postestimation — Postestimation tools for heckprobit stdp calculates the standard error of the prediction, which can be thought of as the standard error of the predicted expected value or mean for the observation’s covariate pattern. The standard error of the prediction is also referred to as the standard error of the fitted value. xbsel calculates the linear prediction for the selection equation. stdpsel calculates the standard error of the linear prediction for the selection equation. nooffset is relevant only if you specified offset(varname) for heckprobit. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . scores calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xj β). The second new variable will contain ∂ ln L/∂(zj γ). The third new variable will contain ∂ ln L/∂(atanh ρ). Remarks and examples Example 1 It is instructive to compare the marginal predicted probabilities with the predicted probabilities that we would obtain by ignoring the selection mechanism. To compare the two approaches, we will synthesize data so that we know the “true” predicted probabilities. First, we need to generate correlated error terms, which we can do using a standard Cholesky decomposition approach. For our example, we will clear any data from memory and then generate errors that have a correlation of 0.5 by using the following commands. We set the seed so that interested readers can type in these same commands and obtain the same results. . set seed 12309 . set obs 5000 obs was 0, now 5000 . gen c1 = rnormal() . gen c2 = rnormal() . matrix P = (1,.5\.5,1) . matrix A = cholesky(P) . . . . local fac1 = A[2,1] local fac2 = A[2,2] gen u1 = c1 gen u2 = ‘fac1’*c1 + ‘fac2’*c2 We can check that the errors have the correct correlation by using the correlate command. We will also normalize the errors so that they have a standard deviation of one, so we can generate a bivariate probit model with known coefficients. We do that with the following commands: heckprobit postestimation — Postestimation tools for heckprobit 825 . correlate u1 u2 (obs=5000) u1 u2 u1 u2 1.0000 0.5020 1.0000 . summarize u1 (output omitted ) . replace u1 = u1/r(sd) (5000 real changes made) . summarize u2 (output omitted ) . replace u2 = u2/r(sd) (5000 real changes made) . drop c1 c2 . gen x1 = runiform()-.5 . gen x2 = runiform()+1/3 . gen y1s = 0.5 + 4*x1 + u1 . gen y2s = 3 - 3*x2 + .5*x1 + u2 . gen y1 = (y1s>0) . gen y2 = (y2s>0) We have now created two dependent variables, y1 and y2, which are defined by our specified coefficients. We also included error terms for each equation, and the error terms are correlated. We run heckprobit to verify that the data have been correctly generated according to the model y1 = .5 + 4x1 + u1 y2 = 3 + .5x1 − 3x2 + u2 where we assume that y1 is observed only if y2 = 1. . heckprobit y1 x1, sel(y2 = x1 x2) nolog Probit model with sample selection Log likelihood = -3679.5 Coef. Std. Err. z Number of obs Censored obs Uncensored obs = = = 5000 1762 3238 Wald chi2(1) Prob > chi2 = = 953.71 0.0000 P>|z| [95% Conf. Interval] y1 x1 _cons 3.784705 .4630922 .1225532 .0453952 30.88 10.20 0.000 0.000 3.544505 .3741192 4.024905 .5520653 x1 x2 _cons .3693052 -3.05069 3.037696 .0721694 .0832424 .0777733 5.12 -36.65 39.06 0.000 0.000 0.000 .2278558 -3.213842 2.885263 .5107547 -2.887538 3.190128 /athrho .5186232 .083546 6.21 0.000 .354876 .6823705 rho .4766367 .0645658 .3406927 .5930583 y2 LR test of indep. eqns. (rho = 0): chi2(1) = 40.43 Prob > chi2 = 0.0000 826 heckprobit postestimation — Postestimation tools for heckprobit Now that we have verified that we have generated data according to a known model, we can obtain and then compare predicted probabilities from the probit model with sample selection and a (usual) probit model. . predict pmarg (option pmargin assumed; Pr(y1=1)) . probit y1 x1 if y2==1 (output omitted ) . predict phat (option pr assumed; Pr(y1)) Using the (marginal) predicted probabilities from the probit model with sample selection (pmarg) and the predicted probabilities from the (usual) probit model (phat), we can also generate the “true” predicted probabilities from the synthesized y1s variable and then compare the predicted probabilities: . gen ptrue = normal(y1s) . summarize pmarg ptrue phat Variable Obs pmarg ptrue phat 5000 5000 5000 Mean .6071226 .5974195 .6568175 Std. Dev. .3147861 .348396 .3025085 Min Max .0766334 5.53e-06 .1059824 .9907113 .9999999 .9954919 Here we see that ignoring the selection mechanism (comparing the phat variable with the true ptrue variable) results in predicted probabilities that are much higher than the true values. Looking at the marginal predicted probabilities from the model with sample selection, however, results in more accurate predictions. Also see [R] heckprobit — Probit model with sample selection [U] 20 Estimation and postestimation commands Title help — Display help in Stata Syntax Remarks and examples Menu Also see Description Options Syntax help command or topic name , nonew name(viewername) marker(markername) Menu Help > Stata Command... Description The help command displays help information about the specified command or topic. Stata for Mac, Stata for Unix(GUI), and Stata for Windows: help launches a new Viewer to display help for the specified command or topic. If help is not followed by a command or a topic name, Stata launches the Viewer and displays help help advice, advice for using the help system and documentation. Help may be accessed either by selecting Help > Stata Command... and filling in the desired command name or by typing help followed by a command or topic name. Stata for Unix(console): Typing help followed by a command name or a topic name will display help on the console. If help is not followed by a command or a topic name, a description of how to use the help system is displayed. Options nonew specifies that a new Viewer window not be opened for the help topic if a Viewer window is already open. The default is for a new Viewer window to be opened each time help is typed so that multiple help files may be viewed at once. nonew causes the help file to be displayed in the topmost open Viewer. name(viewername) specifies that help be displayed in a Viewer window named viewername. If the named window already exists, its contents will be replaced. If the named window does not exist, it will be created. marker(markername) specifies that the help file be opened to the position of markername within the help file. Remarks and examples To obtain help for any Stata command, type help command or select Help > Stata Command... and fill in command. 827 828 help — Display help in Stata help is best explained by examples. To obtain help for . . . regress postestimation tools for regress graph option xlabel() Stata function strpos() Mata function optimize() type help regress help regress postestimation help regress post help graph xlabel() help strpos() help mata optimize() or Tips: • help displays advice for using the help system and documentation. • help guide displays a table of contents for basic Stata concepts. • help estimation commands displays an alphabetical listing of all Stata estimation commands. • help functions displays help on Stata functions by category. • help mata functions displays a subject table of contents for Mata’s functions. • help ts glossary displays the glossary for the time-series manual, and similarly for the other Stata specialty manuals. If you type help topic and help for topic is not found, Stata will automatically perform a search for topic. For instance, try typing help forecasting. A forecasting help file is not found, so Stata executes search forecasting and displays the results in the Viewer. See [U] 4 Stata’s help and search facilities for a complete description of how to use help. Technical note When you type help topic, Stata first looks along the adopath for topic.sthlp; see [U] 17.5 Where does Stata look for ado-files?. Video examples Quick help in Stata Also see [R] net search — Search the Internet for installable packages [R] search — Search Stata documentation and other resources [GSM] 4 Getting help [GSW] 4 Getting help [GSU] 4 Getting help [U] 4 Stata’s help and search facilities Title hetprobit — Heteroskedastic probit model Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax hetprobit depvar indepvars if in weight , het(varlist , offset(varnameo ) ) options options Description Model ∗ het(varlist . . . ) independent variables to model the variance and possible offset variable noconstant suppress constant term offset(varname) include varname in model with coefficient constrained to 1 asis retain perfect predictor variables constraints(constraints) apply specified linear constraints collinear keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) noskip nolrtest nocnsreport display options set confidence level; default is level(95) perform likelihood-ratio test perform Wald test on variance do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics ∗ het() is required. The full specification is het(varlist , offset(varnameo ) ). indepvars and varlist may contain factor variables; see [U] 11.4.3 Factor variables. depvar, indepvars, and varlist may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce(), noskip, and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 829 830 hetprobit — Heteroskedastic probit model Menu Statistics > Binary outcomes > Heteroskedastic probit regression Description hetprobit fits a maximum-likelihood heteroskedastic probit model. hetprob is a synonym for hetprobit. See [R] logistic for a list of related estimation commands. Options Model het(varlist , offset(varnameo ) ) specifies the independent variables and the offset variable, if there is one, in the variance function. het() is required. offset(varnameo ) specifies that selection offset varnameo be included in the model with the coefficient constrained to be 1. noconstant, offset(varname); see [R] estimation options. asis forces the retention of perfect predictor variables and their associated perfectly predicted observations and may produce instabilities in maximization; see [R] probit. constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. noskip requests fitting of the constant-only model and calculation of the corresponding likelihood-ratio χ2 statistic for testing significance of the full model. By default, a Wald χ2 statistic is computed for testing the significance of the full model. nolrtest specifies that a Wald test of whether lnsigma2 = 0 be performed instead of the LR test. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). hetprobit — Heteroskedastic probit model 831 The following option is available with hetprobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Introduction Robust standard errors Introduction hetprobit fits a maximum-likelihood heteroskedastic probit model, which is a generalization of the probit model. Let yj , j = 1, . . . , N , be a binary outcome variable taking on the value 0 (failure) or 1 (success). In the probit model, the probability that yj takes on the value 1 is modeled as a nonlinear function of a linear combination of the k independent variables xj = (x1j , x2j , . . . , xkj ), Pr(yj = 1) = Φ(xj b) in which Φ() is the cumulative distribution function (CDF) of a standard normal random variable, that is, a normally distributed (Gaussian) random variable with mean 0 and variance 1. The linear combination of the independent variables, xj b, is commonly called the index function, or index. Heteroskedastic probit generalizes the probit model by generalizing Φ() to a normal CDF with a variance that is no longer fixed at 1 but can vary as a function of the independent variables. hetprobit models the variance as a multiplicative function of these m variables zj = (z1j , z2j , . . . , zmj ), following Harvey (1976): 2 σj2 = {exp(zj γ)} Thus the probability of success as a function of all the independent variables is n o Pr(yj = 1) = Φ xj b/ exp(zj γ) From this expression, it is clear that, unlike the index xj b, no constant term can be present in zj γ if the model is to be identifiable. Suppose that the binary outcomes yj are generated by thresholding an unobserved random variable, w, which is normally distributed with mean xj b and variance 1 such that 1 if wj > 0 yj = 0 if wj ≤ 0 This process gives the probit model: Pr(yj = 1) = Pr(wj > 0) = Φ(xj b) Now suppose that the unobserved wj are heteroskedastic with variance 2 σj2 = {exp(zj γ)} Relaxing the homoskedastic assumption of the probit model in this manner yields our multiplicative heteroskedastic probit model: n o Pr(yj = 1) = Φ xj b/ exp(zj γ) 832 hetprobit — Heteroskedastic probit model Example 1 For this example, we generate simulated data for a simple heteroskedastic probit model and then estimate the coefficients with hetprobit: . set obs 1000 obs was 0, now 1000 . set seed 1234567 . gen x = 1-2*runiform() . gen xhet = runiform() . gen sigma = exp(1.5*xhet) . gen p = normal((0.3+2*x)/sigma) . gen y = cond(runiform()<=p,1,0) . hetprob y x, het(xhet) Fitting probit model: Iteration 0: log likelihood = -688.53208 Iteration 1: log likelihood = -591.59895 Iteration 2: log likelihood = -591.50674 Iteration 3: log likelihood = -591.50674 Fitting full model: Iteration 0: log likelihood = -591.50674 Iteration 1: log likelihood = -572.12219 Iteration 2: log likelihood = -570.7742 Iteration 3: log likelihood = -569.48921 Iteration 4: log likelihood = -569.47828 Iteration 5: log likelihood = -569.47827 Heteroskedastic probit model Number of obs Zero outcomes Nonzero outcomes Wald chi2(1) Prob > chi2 Log likelihood = -569.4783 y Coef. x _cons lnsigma2 xhet = = = = = 1000 452 548 78.66 0.0000 Std. Err. z P>|z| [95% Conf. Interval] 2.228031 .2493822 .2512073 .0862833 8.87 2.89 0.000 0.004 1.735673 .08027 2.720388 .4184943 1.602537 .2640131 6.07 0.000 1.085081 2.119993 y Likelihood-ratio test of lnsigma2=0: chi2(1) = 44.06 Prob > chi2 = 0.0000 Above we created two variables, x and xhet, and then simulated the model n o Pr(y = 1) = F (β0 + β1 x)/ exp(γ1 xhet) for β0 = 0.3, β1 = 2, and γ1 = 1.5. According to hetprobit’s output, all coefficients are significant, and, as we would expect, the Wald test of the full model versus the constant-only model—for example, the index consisting of β0 + β1 x versus that of just β0 —is significant with χ2 (1) = 79. Likewise, the likelihood-ratio test of heteroskedasticity, which tests the full model with heteroskedasticity against the full model without, is significant with χ2 (1) = 44. See [R] maximize for more explanation of the output. For this simple model, hetprobit took five iterations to converge. As stated elsewhere (Greene 2012, 714), this is a difficult model to fit, and it is not uncommon for it to require many iterations or for the optimizer to print out warnings and informative messages during the optimization. Slow convergence is especially common for models in which one or more of the independent variables appear in both the index and variance functions. hetprobit — Heteroskedastic probit model 833 Technical note Stata interprets a value of 0 as a negative outcome (failure) and treats all other values (except missing) as positive outcomes (successes). Thus if your dependent variable takes on the values 0 and 1, then 0 is interpreted as failure and 1 as success. If your dependent variable takes on the values 0, 1, and 2, then 0 is still interpreted as failure, but both 1 and 2 are treated as successes. Robust standard errors If you specify the vce(robust) option, hetprobit reports robust standard errors as described in [U] 20.21 Obtaining robust variance estimates. To illustrate the effect of this option, we will reestimate our coefficients by using the same model and data in our example, this time adding vce(robust) to our hetprobit command. Example 2 . hetprob y x, het(xhet) vce(robust) nolog Heteroskedastic probit model Number of obs Zero outcomes Nonzero outcomes Wald chi2(1) Prob > chi2 Log pseudolikelihood = -569.4783 = = = = = 1000 452 548 65.23 0.0000 Robust Std. Err. z P>|z| [95% Conf. Interval] 2.22803 .2493821 .2758597 .0843367 8.08 2.96 0.000 0.003 1.687355 .0840853 2.768705 .4146789 1.602537 .2671326 6.00 0.000 1.078967 2.126107 y Coef. x _cons lnsigma2 xhet y Wald test of lnsigma2=0: chi2(1) = 35.99 Prob > chi2 = 0.0000 The vce(robust) standard errors for two of the three parameters are larger than the previously reported conventional standard errors. This is to be expected, even though (by construction) we have perfect model specification because this option trades off efficient estimation of the coefficient variance–covariance matrix for robustness against misspecification. Specifying the vce(cluster clustvar) option relaxes the usual assumption of independence between observations to the weaker assumption of independence just between clusters; that is, hetprobit, vce(cluster clustvar) is robust with respect to within-cluster correlation. This option is less efficient than the xtgee population-averaged models because hetprobit inefficiently sums within cluster for the standard-error calculation rather than attempting to exploit what might be assumed about the within-cluster correlation. 834 hetprobit — Heteroskedastic probit model Stored results hetprobit stores the following in e(): Scalars e(N) e(N f) e(N s) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(ll) e(ll 0) e(ll c) e(N clust) e(chi2) e(chi2 c) e(p c) e(df m c) e(p) e(rank) e(rank0) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset1) e(offset2) e(chi2type) e(chi2 ct) e(vce) e(vcetype) e(opt) e(which) e(method) e(ml method) e(user) e(technique) e(properties) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) number of observations number of zero outcomes number of nonzero outcomes number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom log likelihood log likelihood, constant-only model log likelihood, comparison model number of clusters χ2 χ2 for heteroskedasticity LR test p-value for heteroskedasticity LR test degrees of freedom for heteroskedasticity LR test significance rank of e(V) rank of e(V) for constant-only model number of iterations return code 1 if converged, 0 otherwise hetprobit command as typed name of dependent variable weight type weight expression title in estimation output name of cluster variable offset for probit equation offset for variance equation Wald or LR; type of model χ2 test Wald or LR; type of model χ2 test corresponding to e(chi2 c) vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization requested estimation method type of ml method name of likelihood-evaluator maximization technique b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample hetprobit — Heteroskedastic probit model 835 Methods and formulas The heteroskedastic probit model is a generalization of the probit model because it allows the scale of the inverse link function to vary from observation to observation as a function of the independent variables. The log-likelihood function for the heteroskedastic probit model is lnL = X wj lnΦ{xj β/ exp(zγ)} + j∈S X wj ln 1 − Φ{xj β/ exp(zγ)} j6∈S where S is the set of all observations j such that yj 6= 0 and wj denotes the optional weights. lnL is maximized as described in [R] maximize. This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. hetprobit also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Blevins, J. R., and S. Khan. 2013. Distribution-free estimation of heteroskedastic binary response models in Stata. Stata Journal 13: 588–602. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Harvey, A. C. 1976. Estimating regression models with multiplicative heteroscedasticity. Econometrica 44: 461–465. Also see [R] hetprobit postestimation — Postestimation tools for hetprobit [R] logistic — Logistic regression, reporting odds ratios [R] probit — Probit regression [SVY] svy estimation — Estimation commands for survey data [XT] xtprobit — Random-effects and population-averaged probit models [U] 20 Estimation and postestimation commands Title hetprobit postestimation — Postestimation tools for hetprobit Description Remarks and examples Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after hetprobit: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with svy estimation results. lrtest is not appropriate with svy estimation results. 836 hetprobit postestimation — Postestimation tools for hetprobit 837 Syntax for predict predict type predict type newvar if in , statistic nooffset stub* | newvarreg newvarlnsigma2 if in , scores Description statistic Main probability of a positive outcome; the default linear prediction standard deviation of the error term pr xb sigma These statistics are available both in and out of sample; type predict estimation sample. . . . if e(sample) . . . if wanted only for the Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the probability of a positive outcome. xb calculates the linear prediction. sigma calculates the standard deviation of the error term. nooffset is relevant only if you specified offset(varname) for hetprobit. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . scores calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xj β). The second new variable will contain ∂ ln L/∂(zj γ). Remarks and examples Once you have fit a model, you can use the predict command to obtain the predicted probabilities for both the estimation sample and other samples; see [U] 20 Estimation and postestimation commands and [R] predict. predict without arguments calculates the predicted probability of a positive outcome. With the xb option, predict calculates the index function combination, xj b, where xj are the independent variables in the j th observation and b is the estimated parameter vector. With the sigma option, predict calculates the predicted standard deviation, σj = exp(zj γ). Example 1 We use predict to compute the predicted probabilities and standard deviations based on the model in example 2 in [R] hetprobit to compare these with the actual values: 838 hetprobit postestimation — Postestimation tools for hetprobit . predict phat (option pr assumed; Pr(y)) . gen diff_p = phat - p . summarize diff_p Variable Obs diff_p 1000 . predict sigmahat, sigma Mean -.0107081 . gen diff_s = sigmahat - sigma . summarize diff_s Variable Obs Mean diff_s 1000 .1558882 Std. Dev. .0131869 Std. Dev. .1363698 Also see [R] hetprobit — Heteroskedastic probit model [U] 20 Estimation and postestimation commands Min Max -.0466331 .010482 Min Max .0000417 .4819107 Title histogram — Histograms for continuous and categorical variables Syntax Description Options for use in the discrete case Remarks and examples Also see Menu Options for use in the continuous case Options for use in the continuous and discrete cases References Syntax histogram varname continuous opts if in weight , continuous opts | discrete opts Description Main bin(#) width(#) start(#) set number of bins to # set width of bins to # set lower limit of first bin to # discrete opts Description Main discrete width(#) start(#) specify that data are discrete set width of bins to # set theoretical minimum value to # 839 options 840 histogram — Histograms for continuous and categorical variables Description options Main density fraction frequency percent bar options addlabels addlabopts(marker label options) draw as density; the default draw as fractions draw as frequencies draw as percentages rendition of bars add height labels to bars affect rendition of labels Density plots add a affect add a affect normal normopts(line options) kdensity kdenopts(kdensity options) normal density to the graph rendition of normal density kernel density estimate to the graph rendition of kernel density Add plots add other plots to the histogram addplot(plot) Y axis, X axis, Titles, Legend, Overall, By any options documented in [G-3] twoway options twoway options fweights are allowed; see [U] 11.1.6 weight. Menu Graphics > Histogram Description histogram draws histograms of varname, which is assumed to be the name of a continuous variable unless the discrete option is specified. Options for use in the continuous case Main bin(#) and width(#) are alternatives. They specify how the data are to be aggregated into bins: bin() by specifying the number of bins (from which the width can be derived) and width() by specifying the bin width (from which the number of bins can be derived). If neither option is specified, results are the same as if bin(k) had been specified, where n o k = min sqrt(N ), 10 ln(N )/ln(10) and where N is the (weighted) number of observations. start(#) specifies the theoretical minimum of varname. The default is start(m), where m is the observed minimum value of varname. histogram — Histograms for continuous and categorical variables 841 Specify start() when you are concerned about sparse data, for instance, if you know that varname can have a value of 0, but you are concerned that 0 may not be observed. start(#), if specified, must be less than or equal to m, or else an error will be issued. Options for use in the discrete case Main discrete specifies that varname is discrete and that you want each unique value of varname to have its own bin (bar of histogram). width(#) is rarely specified in the discrete case; it specifies the width of the bins. The default is width(d), where d is the observed minimum difference between the unique values of varname. Specify width() if you are concerned that your data are sparse. For example, in theory varname could take on the values, say, 1, 2, 3, . . . , 9, but because of the sparseness, perhaps only the values 2, 4, 7, and 8 are observed. Here the default width calculation would produce width(2), and you would want to specify width(1). start(#) is also rarely specified in the discrete case; it specifies the theoretical minimum value of varname. The default is start(m), where m is the observed minimum value. As with width(), specify start(#) if you are concerned that your data are sparse. In the previous example, you might also want to specify start(1). start() does nothing more than add white space to the left side of the graph. The value of # in start() must be less than or equal to m, or an error will be issued. Options for use in the continuous and discrete cases Main density, fraction, frequency, and percent specify whether you want the histogram scaled to density units, fractional units, frequencies, or percentages. density is the default. density scales the height of the bars so that the sum of their areas equals 1. fraction scales the height of the bars so that the sum of their heights equals 1. frequency scales the height of the bars so that each bar’s height is equal to the number of observations in the category. Thus the sum of the heights is equal to the total number of observations. percent scales the height of the bars so that the sum of their heights equals 100. bar options are any of the options allowed by graph twoway bar; see [G-2] graph twoway bar. One of the most useful bar options is barwidth(#), which specifies the width of the bars in varname units. By default, histogram draws the bars so that adjacent bars just touch. If you want gaps between the bars, do not specify histogram’s width() option—which would change how the histogram is calculated—but specify the bar option barwidth() or the histogram option gap, both of which affect only how the bar is rendered. The bar option horizontal cannot be used with the addlabels option. addlabels specifies that the top of each bar be labeled with the density, fraction, or frequency, as determined by the density, fraction, and frequency options. 842 histogram — Histograms for continuous and categorical variables addlabopts(marker label options) specifies how to render the labels atop the bars. See [G-3] marker label options. Do not specify the marker label option mlabel(varname), which specifies the variable to be used; this is specified for you by histogram. addlabopts() will accept more options than those documented in [G-3] marker label options. All options allowed by twoway scatter are also allowed by addlabopts(); see [G-2] graph twoway scatter. One particularly useful option is yvarformat(); see [G-3] advanced options. Density plots normal specifies that the histogram be overlaid with an appropriately scaled normal density. The normal will have the same mean and standard deviation as the data. normopts(line options) specifies details about the rendition of the normal curve, such as the color and style of line used. See [G-2] graph twoway line. kdensity specifies that the histogram be overlaid with an appropriately scaled kernel density estimate of the density. By default, the estimate will be produced using the Epanechnikov kernel with an “optimal” half-width. This default corresponds to the default of kdensity; see [R] kdensity. How the estimate is produced can be controlled using the kdenopts() option described below. kdenopts(kdensity options) specifies details about how the kernel density estimate is to be produced along with details about the rendition of the resulting curve, such as the color and style of line used. The kernel density estimate is described in [G-2] graph twoway kdensity. As an example, if you wanted to produce kernel density estimates by using the Gaussian kernel with optimal half-width, you would specify kdenopts(gauss) and if you also wanted a half-width of 5, you would specify kdenopts(gauss width(5)). Add plots addplot(plot) allows adding more graph twoway plots to the graph; see [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall, By twoway options are any of the options documented in [G-3] twoway options. This includes, most importantly, options for titling the graph (see [G-3] title options), options for saving the graph to disk (see [G-3] saving option), and the by() option, which will allow you to simultaneously graph histograms for different subsets of the data (see [G-3] by option). Remarks and examples Remarks are presented under the following headings: Histograms of continuous variables Overlaying normal and kernel density estimates Histograms of discrete variables Use with by() Video example For an example of editing a histogram with the Graph Editor, see Pollock (2011, 29–31). histogram — Histograms for continuous and categorical variables 843 Histograms of continuous variables histogram assumes that the variable is continuous, so you need type only histogram followed by the variable name: 0 5.0e−05 Density 1.0e−04 1.5e−04 2.0e−04 . use http://www.stata-press.com/data/r13/sp500 (S&P 500) . histogram volume (bin=15, start=4103, width=1280.3533) 5,000 10,000 15,000 Volume (thousands) 20,000 25,000 The small values reported for density on the y axis are correct; if you added up the area of the bars, you would get 1. Nevertheless, many people are used to seeing histograms scaled so that the bar heights sum to 1, 844 histogram — Histograms for continuous and categorical variables 0 .05 .1 Fraction .15 .2 .25 . histogram volume, fraction (bin=15, start=4103, width=1280.3533) 5,000 10,000 15,000 Volume (thousands) 20,000 25,000 and others are used to seeing histograms so that the bar height reflects the number of observations, 0 20 Frequency 40 60 . histogram volume, frequency (bin=15, start=4103, width=1280.3533) 5,000 10,000 15,000 Volume (thousands) 20,000 25,000 Regardless of the scale you prefer, you can specify other options to make the graph look more impressive: histogram — Histograms for continuous and categorical variables 845 . summarize volume Variable Obs Mean volume 248 12320.68 Std. Dev. Min Max 2585.929 4103 23308.3 . histogram volume, freq > xaxis(1 2) > ylabel(0(10)60, grid) > xlabel(12321 "mean" > 9735 "-1 s.d." > 14907 "+1 s.d." > 7149 "-2 s.d." > 17493 "+2 s.d." > 20078 "+3 s.d." > 22664 "+4 s.d." > , axis(2) grid gmax) > xtitle("", axis(2)) > subtitle("S&P 500, January 2001 - December 2001") > note("Source: Yahoo! Finance and Commodity Systems, Inc.") (bin=15, start=4103, width=1280.3533) S&P 500, January 2001 − December 2001 mean +1 s.d. +2 s.d. +3 s.d. +4 s.d. 0 10 20 Frequency 30 40 50 60 −2 s.d. −1 s.d. 5,000 10,000 15,000 Volume (thousands) 20,000 25,000 Source: Yahoo! Finance and Commodity Systems, Inc. For an explanation of the xaxis() option—it created the upper and lower x axis— see [G-3] axis choice options. For an explanation of the ylabel() and xlabel() options, see [G-3] axis label options. For an explanation of the subtitle() and note() options, see [G-3] title options. Overlaying normal and kernel density estimates Specifying normal will overlay a normal density over the histogram. It would be enough to type . histogram volume, normal but we will add the option to our more impressive rendition: . summarize volume Variable Obs Mean volume 248 12320.68 Std. Dev. Min Max 2585.929 4103 23308.3 846 histogram — Histograms for continuous and categorical variables . histogram volume, freq normal > xaxis(1 2) > ylabel(0(10)60, grid) > xlabel(12321 "mean" > 9735 "-1 s.d." > 14907 "+1 s.d." > 7149 "-2 s.d." > 17493 "+2 s.d." > 20078 "+3 s.d." > 22664 "+4 s.d." > , axis(2) grid gmax) > xtitle("", axis(2)) > subtitle("S&P 500, January 2001 - December 2001") > note("Source: Yahoo! Finance and Commodity Systems, Inc.") (bin=15, start=4103, width=1280.3533) S&P 500, January 2001 − December 2001 mean +1 s.d. +2 s.d. +3 s.d. +4 s.d. 0 10 20 Frequency 30 40 50 60 −2 s.d. −1 s.d. 5,000 10,000 15,000 Volume (thousands) 20,000 25,000 Source: Yahoo! Finance and Commodity Systems, Inc. If we instead wanted to overlay a kernel density estimate, we could specify kdensity in place of normal. Histograms of discrete variables Specify histogram’s discrete option when you wish to treat the data as discrete—when you wish each unique value of the variable to be assigned its own bin. For instance, in the automobile data, mpg is a continuous variable, but the mileage ratings have been measured to integer precision. If we were to type . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . histogram mpg (bin=8, start=12, width=3.625) mpg would be treated as continuous and categorized into eight bins by the default number-of-bins calculation, which is based on the number of observations, 74. histogram — Histograms for continuous and categorical variables 847 Adding the discrete option makes a histogram with a bin for each of the 21 unique values. 0 .05 Density .1 .15 . histogram mpg, discrete (start=12, width=1) 10 20 30 40 Mileage (mpg) Just as in the continuous case, the y axis was reported in density, and we could specify the fraction or frequency options if we wanted it to be reported differently. Below we specify frequency, we specify addlabels to add a report of frequencies printed above the bars, we specify ylabel(,grid) to add horizontal grid lines, and we specify xlabel(12(2)42) to label the values 12, 14, . . . , 42 on the x axis: 10 . histogram mpg, discrete freq addlabels ylabel(,grid) xlabel(12(2)42) (start=12, width=1) 9 8 8 Frequency 6 6 5 5 5 4 4 4 4 3 3 3 3 2 2 2 2 2 1 1 1 0 1 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 Mileage (mpg) 848 histogram — Histograms for continuous and categorical variables Use with by() histogram may be used with graph twoway’s by(); for example, . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . histogram mpg, discrete by(foreign) Foreign .1 0 .05 Density .15 .2 Domestic 10 20 30 40 10 20 30 40 Mileage (mpg) Graphs by Car type Here results would be easier to compare if the graphs were presented in one column: . histogram mpg, discrete by(foreign, col(1)) 0 .2 Foreign 0 .05 .1 .15 Density .05 .1 .15 .2 Domestic 10 20 30 40 Mileage (mpg) Graphs by Car type col(1) is a by() suboption—see [G-3] by option —and there are other useful suboptions, such as total, which will add an overall total histogram. total is a suboption of by(), not an option of histogram, so you would type . histogram mpg, discrete by(foreign, total) and not histogram mpg, discrete by(foreign) total. histogram — Histograms for continuous and categorical variables 849 As another example, Lipset (1993) reprinted data from the New York Times (November 5, 1992) collected by the Voter Research and Surveys based on questionnaires completed by 15,490 U.S. presidential voters from 300 polling places on election day in 1992. . use http://www.stata-press.com/data/r13/voter . histogram candi [freq=pop], discrete fraction by(inc, total) > gap(40) xlabel(2 3 4, valuelabel) $15−30k $30−50k $50−75k $75k+ Total 0 .6 0 .2 .4 Fraction .2 .4 .6 <$15k Clinton Bush Perot Clinton Bush Perot Clinton Bush Perot Candidate voted for, 1992 Graphs by Family Income We specified gap(40) to reduce the width of the bars by 40%. We also used xlabel()’s valuelabel suboption, which caused our bars to be labeled “Clinton”, “Bush”, and “Perot”, rather than 2, 3, and 4; see [G-3] axis label options. Video example Histograms in Stata References Cox, N. J. 2004. Speaking Stata: Graphing distributions. Stata Journal 4: 66–88. . 2005. Speaking Stata: Density probability plots. Stata Journal 5: 259–273. Harrison, D. A. 2005. Stata tip 20: Generating histogram bin variables. Stata Journal 5: 280–281. Lipset, S. M. 1993. The significance of the 1992 election. PS: Political Science and Politics 26: 7–16. Pollock, P. H., III. 2011. A Stata Companion to Political Analysis. 2nd ed. Washington, DC: CQ Press. Also see [R] kdensity — Univariate kernel density estimation [R] spikeplot — Spike plots and rootograms [G-2] graph twoway histogram — Histogram plots Title icc — Intraclass correlation coefficients Syntax Description Options for two-way RE and ME models Stored results References Menu Options for one-way RE model Remarks and examples Methods and formulas Also see Syntax Calculate intraclass correlations for one-way random-effects model icc depvar target if in , oneway options Calculate intraclass correlations for two-way random-effects model icc depvar target rater if in , twoway re options Calculate intraclass correlations for two-way mixed-effects model icc depvar target rater if in , mixed twoway me options oneway options Description Main absolute testvalue(#) estimate absolute agreement; the default test whether intraclass correlations equal #; default is testvalue(0) Reporting level(#) format(% fmt) set confidence level; default is level(95) display format for statistics and confidence intervals; default is format(%9.0g) twoway re options Description Main absolute consistency testvalue(#) estimate absolute agreement; the default estimate consistency of agreement test whether intraclass correlations equal #; default is testvalue(0) Reporting level(#) format(% fmt) set confidence level; default is level(95) display format for statistics and confidence intervals; default is format(%9.0g) 850 icc — Intraclass correlation coefficients twoway me options Main ∗ mixed consistency absolute testvalue(#) 851 Description estimate intraclass correlations for a mixed-effects model estimate consistency of agreement; the default estimate absolute agreement test whether intraclass correlations equal #; default is testvalue(0) Reporting level(#) format(% fmt) ∗ set confidence level; default is level(95) display format for statistics and confidence intervals; default is format(%9.0g) mixed is required. bootstrap, by, jackknife, and statsby are allowed; see [U] 11.1.10 Prefix commands. Menu Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Intraclass correlations Description icc estimates intraclass correlations for one-way random-effects models, two-way random-effects models, or two-way mixed-effects models for both individual and average measurements. Intraclass correlations measuring consistency of agreement or absolute agreement of the measurements may be estimated. Options for one-way RE model Main absolute specifies that intraclass correlations measuring absolute agreement of the measurements be estimated. This is the default for random-effects models. testvalue(#) tests whether intraclass correlations equal #. The default is testvalue(0). Reporting level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [R] level. format(% fmt) specifies how the intraclass correlation estimates and confidence intervals are to be formatted. The default is format(%9.0g). 852 icc — Intraclass correlation coefficients Options for two-way RE and ME models Main mixed is required to calculate two-way mixed-effects models. mixed specifies that intraclass correlations for a mixed-effects model be estimated. absolute specifies that intraclass correlations measuring absolute agreement of the measurements be estimated. This is the default for random-effects models. Only one of absolute or consistency may be specified. consistency specifies that intraclass correlations measuring consistency of agreement of the measurements be estimated. This is the default for mixed-effects models. Only one of absolute or consistency may be specified. testvalue(#) tests whether intraclass correlations equal #. The default is testvalue(0). Reporting level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [R] level. format(% fmt) specifies how the intraclass correlation estimates and confidence intervals are to be formatted. The default is format(%9.0g). Remarks and examples Remarks are presented under the following headings: Introduction One-way random effects Two-way random effects Two-way mixed effects Adoption study Relationship between ICCs Tests against nonzero values Introduction In some disciplines, such as psychology and sociology, data are often measured with error that can seriously affect statistical interpretation of the results. Thus it is important to assess the amount of measurement error by evaluating the consistency or reliability of measurements. The intraclass correlation coefficient (ICC) is often used to measure the consistency or homogeneity of measurements. Several versions of ICCs are introduced in the literature depending on the experimental design and goals of the study (see, for example, Shrout and Fleiss [1979] and McGraw and Wong [1996a]). Following Shrout and Fleiss (1979), we describe various forms of ICCs in the context of a reliability study of ratings of different targets (or objects of measurements) by several raters. Consider n targets (for example, students, patients, athletes) that are randomly sampled from a population of interest. Each target is rated independently by a set of k raters (for example, teachers, doctors, judges). One rating per target and rater is obtained. It is of interest to determine the extent of the agreement of the ratings. As noted by Shrout and Fleiss (1979) and McGraw and Wong (1996a), you need to answer several questions to decide what version of ICC is appropriate to measure the agreement in your study: icc — Intraclass correlation coefficients 853 1. Is a one-way or two-way analysis-of-variance model appropriate for your study? 2. Are differences between raters’ mean ratings relevant to the reliability of interest? 3. Is the unit of analysis an individual rating or the mean rating over several raters? 4. Is the consistency of agreement or the absolute agreement of ratings of interest? Three types of analysis-of-variance models are considered for the reliability study: one-way random effects, two-way random effects, and two-way mixed effects. Mixed models contain both fixed effects and random effects. In the one-way random-effects model, each target is rated by a different set of k independent raters, who are randomly drawn from the population of raters. The target is the only random effect in this model; the effects due to raters and possibly due to rater-and-target interaction cannot be separated from random error. In the two-way random-effects model, each target is rated by the same set of k independent raters, who are randomly drawn from the population of raters. The random effects in this model are target and rater and possibly their interaction, although in the absence of repeated measurements for each rater on each target, the effect of an interaction cannot be separated from random error. In the two-way mixed-effects model, each target is rated by the same set of k independent raters. Because they are the only raters of interest, rater is a fixed effect. The random effects are target and possibly target-and-rater interaction, but again the interaction effect cannot be separated from random error without repeated measurements for each rater and target. The definition of ICC depends on the chosen random-effects model; see Methods and formulas for details. In summary, use a one-way model if there are no systematic differences in measurements due to raters and use a two-way model otherwise. If you want to generalize your results to a population of raters from which the observed raters are sampled, use a two-way random-effects model, treating raters as random. If you are only interested in the effects of the observed k raters, use a two-way mixed-effects model, treating raters as fixed. For example, suppose you compare judges’ ratings of targets from different groups. If you use the combined data from k judges to compare the groups, the random-effects model is appropriate. If you compare groups separately for each judge and then pool the differences, the mixed-effects model is appropriate. The definition of ICC also depends on the unit of analysis in a study—whether the agreement is measured between individual ratings (individual ICC) or between the averages of ratings over several raters (average ICC). The data on individual ratings are more common. The data on average ratings are typically used when individual ratings are deemed unreliable. The average ICC can also be used when teams of raters are used to rate a target. For example, the ratings of teams of physicians may be evaluated in this manner. When the unit of analysis is an average rating, you should remember that the interpretation of ICC pertains to average ratings and not individual ratings. Finally, depending on whether consistency of agreement or absolute agreement is of interest, two types of ICC are used: consistency-of-agreement ICC (CA-ICC) and absolute-agreement ICC (AA-ICC). Under consistency of agreement, the scores are considered consistent if the scores from any two raters differ by the same constant value for all targets. This implies that raters give the same ranking to all targets. Under absolute agreement, the scores are considered in absolute agreement if the scores from all raters match exactly. For example, suppose we observe three targets and two raters. The ratings are (2,4), (4,6), and (6,8), with rater 1 giving the scores (2,4,6) and rater 2 giving the scores (4,6,8), two points higher than rater 1. The CA-ICC between individual ratings is 1 because the scores from rater 1 and rater 2 differ by a constant value (two points) for all targets. That rater 1 gives lower scores than rater 2 is deemed irrelevant under the consistency measure of agreement. The raters have the same difference of opinion on every target, and the variation between raters that is caused by this difference is not relevant. On the other hand, the AA-ICC between individual ratings is 8/12 = 0.67, where 8 is the estimated between-target variance and 12 is the estimated total variance of ratings. 854 icc — Intraclass correlation coefficients Either CA-ICC or AA-ICC can serve as a useful measure of agreement depending on whether rater variability is relevant for determining the degree of agreement. As McGraw and Wong (1996a) point out, CA-ICC is useful when comparative judgments are made about objects of measurement. The CA-ICC represents correlation when the rater is fixed; the AA-ICC represents correlation when the rater is random. See Shrout and Fleiss (1979) and McGraw and Wong (1996a) for more detailed guidelines about the choice of appropriate ICC. Shrout and Fleiss (1979) and McGraw and Wong (1996a) describe 10 versions of ICCs based on the concepts above: individual and average AA-ICCs for a one-way model (consistency of agreement is not defined for this model); individual and average AA-ICCs and CA-ICCs for a two-way random-effects model; and individual and average AA-ICCs and CA-ICCs for a two-way mixed-effects model. Although each of these ICCs has its own definition and interpretation, the estimators for some are identical, leading to the same estimates of those ICCs; see Relationship between ICCs and Methods and formulas for details. The icc command calculates ICCs for each of the three analysis-of-variance models. You can use option absolute to compute AA-ICCs or option consistency to compute CA-ICCs. By default, icc computes ICCs corresponding to the correlation between ratings and between average ratings made on the same target: AA-ICC for a random-effects model and CA-ICC for a mixed-effects model. As pointed out by Shrout and Fleiss (1979), although the data on average ratings might be needed for reliability, the generalization of interest might be individuals. For this reason, icc reports ICCs for both units, individual and average, for each model. In addition to estimates of ICCs, icc provides confidence intervals and one-sided F tests. The F test of Ho: ρ = 0 versus Ha: ρ > 0 is the same for the individual and average ICCs, so icc reports one test. This is not true, however, for nonzero null hypotheses (see Tests against nonzero values for details), so icc reports a separate test in this case. The icc command requires data in long form; see [D] reshape for how to convert data in wide form to long form. The data must also be balanced and contain one observation per target and rater. For unbalanced data, icc omits all targets with fewer than k ratings from computation. Under one-way models, k is determined as the largest number of observed ratings for a target. Under two-way models, k is the number of unique raters. If multiple observations per target and rater are detected, icc issues an error. We demonstrate the use of icc using datasets from Shrout and Fleiss (1979) and McGraw and Wong (1996a). In the next three sections, we use an example from table 2 of Shrout and Fleiss (1979) with six targets and four judges. For instructional purposes, we analyze these data under each of the three different models: one-way random effects, two-way random effects, and two-way mixed effects. One-way random effects In the one-way random-effects model, we assume that the n targets being rated are randomly selected from the population of potential targets. Each is rated by a different set of k raters randomly drawn from the population of potential raters. McGraw and Wong (1996a) describe an example of this setting, where behavioral genetics data are used to assess familial resemblance. Family units can be viewed as “targets”, and children can be viewed as “raters”. By taking a measurement on a child of the family unit, we obtain the “rating” of the family unit by the “child-rater”. In this case, we can use ICC to assess similarity between children within a family or, in other words, assess if there is a family effect in these data. icc — Intraclass correlation coefficients 855 As we mentioned in the introduction, only AA-ICC is defined for a one-way model. The consistency of agreement is not defined in this case, as each target is evaluated by a different set of raters. Thus there is no between-rater variability in this model. In a one-way model, the AA-ICC corresponds to the correlation coefficient between ratings within a target. It is also a ratio of the between-target variance of ratings to the total variance of ratings, the sum of the between-target and error variances. Example 1: One-way random-effects ICCs Consider data from table 2 of Shrout and Fleiss (1979) stored in judges.dta. The data contain 24 ratings of n = 6 targets by k = 4 judges. We list the first eight observations: . use http://www.stata-press.com/data/r13/judges (Ratings of targets by judges) . list in 1/8, sepby(target) rating target judge 1. 2. 3. 4. 9 2 5 8 1 1 1 1 1 2 3 4 5. 6. 7. 8. 6 1 3 2 2 2 2 2 1 2 3 4 For a moment, let’s ignore that targets are rated by the same set of judges. Instead, we assume that a different set of four judges is used to rate each target. In this case, the only systematic variation in the data is due to targets, so the one-way random-effects model is appropriate. We use icc to estimate the intraclass correlations for these data. To compute ICCs for a one-way model, we specify the dependent variable rating followed by the target variable target: . icc rating target Intraclass correlations One-way random-effects model Absolute agreement Random effects: target Number of targets = Number of raters = rating ICC Individual Average .1657418 .4427971 F test that ICC=0.00: F(5.0, 18.0) = 1.79 6 4 [95% Conf. Interval] -.1329323 -.8844422 .7225601 .9124154 Prob > F = 0.165 Note: ICCs estimate correlations between individual measurements and between average measurements made on the same target. icc reports the AA-ICCs for both individual and average ratings. The individual AA-ICC corresponds to ICC(1) in McGraw and Wong (1996a) or ICC(1,1) in Shrout and Fleiss (1979). The average AA-ICC corresponds to ICC(k ) in McGraw and Wong (1996a) or ICC(1,k ) in Shrout and Fleiss (1979). 856 icc — Intraclass correlation coefficients The estimated correlation between individual ratings is 0.17, indicating little similarity between ratings within a target, low reliability of individual target ratings, or no target effect. The estimated intraclass correlation between ratings averaged over k = 4 judges is higher, 0.44. (The average ICC will typically be higher than the individual ICC.) The estimated intraclass correlation measures the similarity or reliability of mean ratings from groups of four judges. We do not have statistical evidence that either ICC is different from zero based on reported confidence intervals and the one-sided F test. Note that although the estimates of ICCs cannot be negative in this setting, the lower bound of the computed confidence interval may be negative. A common ad-hoc way of handling this is to truncate the lower bound at zero. The estimates of both the individual and the average AA-ICC are also computed by the loneway command (see [R] loneway), which performs a one-way analysis of variance. Technical note Mean rating is commonly used when individual rating is unreliable because the reliability of a mean rating is always higher than the reliability of the individual rating when the individual reliability is positive. In the previous example, we estimated low reliability of the individual ratings of a target, 0.17. The reliability increased to 0.44 for the ratings averaged over four judges. What if we had more judges? We can use the Spearman–Brown formula (Spearman 1910; Brown 1910) to compute the m-average ICC based on the individual ICC: ICC(m) = mICC(1) 1 + (m − 1)ICC(1) Using this formula for the previous example, we find that the mean reliability over, say, 10 judges is 10 × 0.17/(1 + 9 × 0.17) = 0.67. Alternatively, we can invert the Spearman–Brown formula to determine the number of judges (or the number of ratings of a target) we need to achieve the desired reliability. Suppose we would like an average reliability of 0.9, then m= ICC(m){(1 − ICC(1))} ICC(1){1 − ICC(m)} = 0.9(1 − 0.17) = 44 0.17(1 − 0.9) See, for example, Bliese (2000) for other examples. Two-way random effects As before, we assume that the targets being rated are randomly selected from the population of potential targets. We now also assume that each target is evaluated by the same set of k raters, who have been randomly sampled from the population of raters. In this scenario, we want to generalize our findings to the population of raters from which the observed k raters were sampled. For example, suppose we want to estimate the reliability of doctors’ evaluations of patients with a certain condition. Unless the reliability at a specific hospital is of interest, the doctors may be interchanged with others in the relevant population of doctors. icc — Intraclass correlation coefficients 857 As for a one-way model, the AA-ICC corresponds to the correlation between measurements on the same target and is also a ratio of the between-target variance to the total variance of measurements in a two-way random-effects model. The total variance is now the sum of the between-target, between-rater, and error variances. Unlike a one-way model, the CA-ICC can be computed for a two-way randomeffects model when the consistency of agreement is of interest rather than the absolute agreement. The CA-ICC is also the ratio of the between-target variance to the total variance, but the total variance does not include the between-rater variance because the between-rater variability is irrelevant for the consistency of agreement. Again, the two versions, individual and average, are available for each ICC. Example 2: Two-way random-effects ICCs Continuing with example 1, recall that we previously ignored that each target is rated by the same set of four judges and instead assumed different sets of judges. We return to the original data setting. We want to evaluate the agreement between judges’ ratings of targets in a population represented by the observed set of four judges. In a two-way model, we must specify both the target and the rater variables. In icc, we now additionally specify the rater variable judge following the target variable target; the random-effects model is assumed by default. . icc rating target judge Intraclass correlations Two-way random-effects model Absolute agreement Random effects: target Random effects: judge Number of targets = Number of raters = rating ICC Individual Average .2897638 .6200505 6 4 [95% Conf. Interval] .0187865 .0711368 .7610844 .927232 F test that ICC=0.00: F(5.0, 15.0) = 11.03 Prob > F = 0.000 Note: ICCs estimate correlations between individual measurements and between average measurements made on the same target. As for a one-way random-effects model, icc by default reports AA-ICCs that correspond to the correlation between ratings on a target. Notice that both individual and average ICCs are larger in the two-way random-effects model than in the previous one-way model—0.29 versus 0.17 and 0.62 versus 0.44, respectively. We also have statistical evidence to reject the null hypothesis that neither ICC is zero based on confidence intervals and the F test. If a one-way model is used when a two-way model is appropriate, the true ICC will generally be underestimated. The individual AA-ICC corresponds to ICC(A,1) in McGraw and Wong (1996a) or ICC(2,1) in Shrout and Fleiss (1979). The average AA-ICC corresponds to ICC(A,k ) in McGraw and Wong (1996a) or ICC(2,k ) in Shrout and Fleiss (1979). Instead of the absolute agreement, we can also assess the consistency of agreement. The individual and average CA-ICCs are considered in McGraw and Wong (1996a) and denoted as ICC(C,1) and ICC(C,k ), respectively. These ICCs are not considered in Shrout and Fleiss (1979) because they are not correlations in the strict sense. Although CA-ICCs do not estimate correlation, they can provide useful information about the reliability of the raters. McGraw and Wong (1996a) note that the practical value of the individual and average CA-ICCs in the two-way random-effects model setting is well documented in measurement theory, citing Hartmann (1982) and Suen (1988). 858 icc — Intraclass correlation coefficients To estimate the individual and average CA-ICCs, we specify the consistency option: . icc rating target judge, consistency Intraclass correlations Two-way random-effects model Consistency of agreement Random effects: target Number of targets = Random effects: judge Number of raters = rating ICC Individual Average .7148407 .9093155 F test that ICC=0.00: F(5.0, 15.0) = 11.03 6 4 [95% Conf. Interval] .3424648 .6756747 .9458583 .9858917 Prob > F = 0.000 We estimate that the consistency of agreement of ratings in the considered population of raters is high, 0.71, based on the individual CA-ICC. On the other hand, the absolute agreement of ratings is low, 0.29, based on the individual AA-ICC from the previous output. The measure of consistency of agreement among means, the average CA-ICC, is equivalent to Cronbach’s alpha (Cronbach 1951); see [MV] alpha. The individual CA-ICC can also be equivalent to the Pearson’s correlation coefficient between raters when k = 2; see McGraw and Wong (1996a) for details. In the next example, we will see that the actual estimates of the individual and average AA-ICCs and CA-ICCs are the same whether we examine a random-effects model or a mixed-effects model. The differences between these ICCs are in their definitions and interpretations. Two-way mixed effects As in a two-way random-effects model, we assume that the targets are randomly selected from the population of potential targets and that each is evaluated by the same set of k raters. In a mixed-effects model, however, we assume that these raters are the only raters of interest. So as before, the targets are random, but now the raters are fixed. In the two-way mixed-effects model, the fixed effect of the rater does not contribute to the betweenrater random variance component to the total variance. As such, the definitions and interpretations of ICCs are different in a mixed-effects model than in a random-effects model. However, the estimates of ICCs as well as test statistics and confidence intervals are the same. The only exceptions are average AA-ICCs and CA-ICCs. These are not estimable in a two-way mixed-effects model including an interaction term between target and rater; see Relationship between ICCs and Methods and formulas for details. In a two-way mixed-effects model, the CA-ICC corresponds to the correlation between measurements on the same target. As pointed out by Shrout and Fleiss (1979), when the rater variance is ignored, the correlation coefficient is interpreted in terms of rater consistency rather than rater absolute agreement. Formally, the CA-ICC is the ratio of the covariance between measurements on the target to the total variance of the measurements. The AA-ICC corresponds to the same ratio, but includes a variance of the fixed factor, rater, in its denominator. icc — Intraclass correlation coefficients 859 Example 3: Two-way mixed-effects ICCs Continuing with example 2, suppose that we are now interested in assessing the agreement of ratings from only the observed four judges. The judges are now fixed effects, and the appropriate model is a two-way mixed-effects model. To estimate ICCs for a two-way mixed-effects model, we specify the mixed option with icc: . icc rating target judge, mixed Intraclass correlations Two-way mixed-effects model Consistency of agreement Random effects: target Number of targets = Fixed effects: judge Number of raters = rating ICC Individual Average .7148407 .9093155 6 4 [95% Conf. Interval] .3424648 .6756747 .9458583 .9858917 F test that ICC=0.00: F(5.0, 15.0) = 11.03 Prob > F = 0.000 Note: ICCs estimate correlations between individual measurements and between average measurements made on the same target. As we described in the introduction, icc by default reports ICCs corresponding to the correlations. So, for a mixed-effects model, icc reports CA-ICCs by default. The individual and average CA-ICCs are denoted as ICC(3,1) and ICC(3,k ) in Shrout and Fleiss (1979) and ICC(C,1) and ICC(C ,k ) in McGraw and Wong (1996a). Our estimates of the individual and average CA-ICCs are identical to the CA-ICC estimates obtained under the two-way random-effects model in example 2, but our interpretation of the results is different. Under a mixed-effects model, 0.71 and 0.91 are the estimates, respectively, of the correlation between individual measurements and the correlation between average measurements made on the same target. We can also estimate the AA-ICCs in this setting by specifying the absolute option: . icc rating target judge, mixed absolute Intraclass correlations Two-way mixed-effects model Absolute agreement Random effects: target Number of targets = Fixed effects: judge Number of raters = rating ICC Individual Average .2897638 .6200505 F test that ICC=0.00: F(5.0, 15.0) = 11.03 6 4 [95% Conf. Interval] .0187865 .0711368 .7610844 .927232 Prob > F = 0.000 The intraclass correlation estimates match the individual and average AA-ICCs obtained under the two-way random-effects model in example 2; but in a mixed-effects model, they do not represent correlations. We demonstrate the use of an individual AA-ICC in a mixed-effects setting in the next example. The AA-ICCs under a mixed-effects model are not considered by Shrout and Fleiss (1979). They are denoted as ICC(A,1) and ICC(A,k ) in McGraw and Wong (1996a). 860 icc — Intraclass correlation coefficients Adoption study In this section, we consider the adoption study described in McGraw and Wong (1996a). Adoption studies commonly include two effects of interest. One is the mean difference between the adopted child and its biological parents. It is used to determine if characteristics of adopted children differ on average from those of their biological parents. Another effect of interest is the correlation between genetically paired individuals and genetically unrelated individuals who live together. This effect is used to evaluate the impact of genetic differences on individual differences. As discussed in McGraw and Wong (1996a), a consistent finding from adoption research using IQ as a trait characteristic is that while adopted children typically have higher IQs than their biological parents, their IQs correlate better with those of their biological parents than with those of their adoptive parents. Both effects are important, and there is additional need to reconcile the two findings. McGraw and Wong (1996a) propose to use the individual AA-ICC for this purpose. Example 4: Absolute-agreement ICC in a mixed-effects model The adoption.dta dataset contains the data from table 6 of McGraw and Wong (1996a) on IQ scores: . use http://www.stata-press.com/data/r13/adoption (Biological mother and adopted child IQ scores) . describe Contains data from http://www.stata-press.com/data/r13/adoption.dta obs: 20 Biological mother and adopted child IQ scores vars: 5 15 May 2013 13:50 size: 160 (_dta has notes) variable name storage type display format value label mcvalues family mc iq3 byte byte int %9.0g %9.0g %9.0g iq9 int %9.0g iq15 int %9.0g variable label Adoptive family ID 1=Mother, 2=Child IQ scores, mother-child difference of 3 pts IQ scores, mother-child difference of 9 pts IQ scores, mother-child difference of 15 pts Sorted by: The family variable contains adoptive family identifiers, the mc variable records a mother or a child, and the iq3, iq9, and iq15 variables record IQ scores with differences between mother and child mean IQ scores of 3, 9, and 15 points, respectively. icc — Intraclass correlation coefficients 861 . by mc, sort: summarize iq* -> mc = Mother Variable Obs Mean 10 10 10 97 91 85 Variable Obs iq3 iq9 iq15 10 10 10 iq3 iq9 iq15 Std. Dev. Min Max 15.0037 15.0037 15.0037 62 56 50 116 110 104 Mean Std. Dev. Min Max 100 100 100 15.0037 15.0037 15.0037 65 65 65 119 119 119 -> mc = Child The variances of the mother and child IQ scores are the same. Children are fixed effects, so the mixed-effects model is appropriate for these data. We want to compare individual CA-ICC with individual AA-ICC for each of the three IQ variables. We could issue a separate icc command for each of the three IQ variables to obtain the intraclass correlations. Instead, we use reshape to convert our data to long form with one iq variable and the new diff variable recording mean differences: . reshape long iq, i(family mc) j(diff) (note: j = 3 9 15) Data wide Number of obs. Number of variables j variable (3 values) xij variables: -> long 20 5 -> -> -> 60 4 diff iq3 iq9 iq15 -> iq We can now use the by prefix with icc to estimate intraclass correlations for the three groups of interest: 862 icc — Intraclass correlation coefficients . by diff, sort: icc iq family mc, mixed -> diff = 3 Intraclass correlations Two-way mixed-effects model Consistency of agreement Random effects: family Fixed effects: mc Number of targets = Number of raters = iq ICC Individual Average .7142152 .8332853 F test that ICC=0.00: F(9.0, 9.0) = 6.00 10 2 [95% Conf. Interval] .1967504 .3288078 .920474 .9585904 Prob > F = 0.007 Note: ICCs estimate correlations between individual measurements and between average measurements made on the same target. -> diff = 9 Intraclass correlations Two-way mixed-effects model Consistency of agreement Random effects: family Fixed effects: mc Number of targets = Number of raters = iq ICC Individual Average .7142152 .8332853 10 2 [95% Conf. Interval] .1967504 .3288078 .920474 .9585904 F test that ICC=0.00: F(9.0, 9.0) = 6.00 Prob > F = 0.007 Note: ICCs estimate correlations between individual measurements and between average measurements made on the same target. -> diff = 15 (output omitted ) The estimated CA-ICCs are the same in all three groups and are equal to the corresponding estimates of the Pearson’s correlation coefficients because mothers’ and childrens’ IQ scores have the same variability. The scores differ only in means, and mean differences are irrelevant when measuring the consistency of agreement. icc — Intraclass correlation coefficients 863 The AA-ICCs, however, differ across the three groups: . by diff, sort: icc iq family mc, mixed absolute -> diff = 3 Intraclass correlations Two-way mixed-effects model Absolute agreement Random effects: family Fixed effects: mc Number of targets = Number of raters = iq ICC Individual Average .7204023 .8374812 F test that ICC=0.00: F(9.0, 9.0) = 6.00 -> diff = 9 Intraclass correlations Two-way mixed-effects model Absolute agreement Random effects: family Fixed effects: mc [95% Conf. Interval] .2275148 .3706917 iq ICC Individual Average .6203378 .7656895 F test that ICC=0.00: F(9.0, 9.0) = 6.00 10 2 [95% Conf. Interval] .0293932 .0571077 .8905025 .9420802 Prob > F = 0.007 Number of targets = Number of raters = iq ICC Individual Average .4854727 .6536272 F test that ICC=0.00: F(9.0, 9.0) = 6.00 .9217029 .9592564 Prob > F = 0.007 Number of targets = Number of raters = -> diff = 15 Intraclass correlations Two-way mixed-effects model Absolute agreement Random effects: family Fixed effects: mc 10 2 10 2 [95% Conf. Interval] -.1194157 -.2712191 .8466905 .9169815 Prob > F = 0.007 As the mean differences increase, the AA-ICCs decrease. Their attenuation reflects the difference in means between biological mother and child IQs while still measuring their agreement. Notice that for small mean differences, the estimates of AA-ICCs and CA-ICCs are very similar. Note that our estimates match those given in McGraw and Wong (1996b), who correct the original table 6 of McGraw and Wong (1996a). 864 icc — Intraclass correlation coefficients Relationship between ICCs In examples 2 and 3, we saw that the estimates of AA-ICCs and CA-ICCs are the same for twoway random-effects and two-way mixed-effects models. In this section, we consider the relationship between various forms of ICCs in more detail; also see Methods and formulas. There are 10 different versions of ICCs, but only six different estimators are needed to compute them. These estimators include the two estimators for the individual and average AA-ICCs in a one-way model, the two estimators for the individual and average AA-ICCs in two-way models, and the two estimators for the individual and average CA-ICCs in two-way models. Only individual and average AA-ICCs are defined for the one-way model. The estimates of AA-ICCs based on the one-way model will typically be smaller than individual and average estimates of AA-ICCs and CA-ICCs based on two-way models. The estimates of individual and average CA-ICCs will typically be larger than the estimates of individual and average AA-ICCs. Although AA-ICCs and CA-ICCs have the same respective estimators in two-way random-effects and mixed-effects models, their definitions and interpretations are different. The AA-ICCs based on a random-effects model contain the between-rater variance component in the denominator of the variance ratio. The AA-ICCs based on a mixed-effects model contain the variance of the fixed-factor rater instead of the random between-rater variability. The AA-ICCs in a random-effects model represent correlations between any two measurements made on a target. The AA-ICCs in a mixed-effects model measure absolute agreement of measurements treating raters as fixed. The CA-ICCs for random-effects and mixed-effects models have the same definition but different interpretations. The CA-ICCs represent correlations between any two measurements made on a target in a mixed-effects model but estimate the degree of consistency among measurements treating raters as random in a random-effects model. The difference in the definitions of AA-ICCs and CA-ICCs is that CA-ICCs do not contain the between-rater variance in the denominator of the variance ratio. For two-way models, the definitions and interpretations (but not the estimators) of ICCs also depend on whether the model contains an interaction between target and rater. For two-way models with interaction, ICCs include an additional variance component for the target-rater interaction in the denominator of the variance ratio. This component cannot be separated from random error because there is only one observation per target and rater. Also, under a two-way mixed-effects model including interaction, the interaction components are not mutually independent, as they are in a two-way random-effects model. The considered version of the mixed-effects model places a constraint on the interaction effects—the sum of the interaction effects over levels of the fixed factor is zero; see, for example, chapter 7 in Kuehl (2000) for an introductory discussion of mixed models. In this version of the model, there is a correlation between the interaction effects. Specifically, the two interaction effects for the same target and two different raters are negatively correlated. As a result, the estimated intraclass correlation can be negative under a two-way mixed-effects model with interaction. Also, average AA-ICC and average CA-ICC cannot be estimated in a two-way mixed-effects model including interaction; see Methods and formulas and McGraw and Wong (1996a) for details. Tests against nonzero values It may be of interest to test whether the intraclass correlation is equal to a value other than zero. icc supports testing against positive values through the use of the testvalue() option. Specifying testvalue(#) provides a one-sided hypothesis test of Ho : ρ = # versus Ha : ρ > #. The test is provided separately for both individual and average ICCs. icc — Intraclass correlation coefficients 865 Example 5: Testing ICC against a nonzero value We return to the two-way random-effects model for the judge and target data from Shrout and Fleiss (1979). Suppose we want to test whether the individual and average AA-ICCs are each equal to 0.2. We specify the testvalue(0.2) option with icc: . use http://www.stata-press.com/data/r13/judges, clear (Ratings of targets by judges) . icc rating target judge, testvalue(0.2) Intraclass correlations Two-way random-effects model Absolute agreement Random effects: target Number of targets = Random effects: judge Number of raters = rating ICC Individual Average .2897638 .6200505 6 4 [95% Conf. Interval] .0187865 .0711368 .7610844 .927232 F test that ICC(1)=0.20: F(5.0, 5.3) = 1.54 Prob > F = 0.317 ICC(k)=0.20: F(5.0, 9.4) = 4.35 Prob > F = 0.026 Note: ICCs estimate correlations between individual measurements and between average measurements made on the same target. We reject the null hypothesis that the average AA-ICC, labeled as ICC(k ) in the output, is equal to 0.2, but we do not have statistical evidence to reject the null hypothesis that the individual AA-ICC, labeled as ICC(1), is equal to 0.2. Stored results icc stores the following in r(): Scalars r(N target) r(N rater) r(icc i) r(icc i F) r(icc i df1) r(icc i df2) r(icc i p) r(icc i lb) r(icc i ub) r(icc avg) r(icc avg F) r(icc avg df1) r(icc avg df2) r(icc avg p) r(icc avg lb) r(icc avg ub) r(testvalue) r(level) Macros r(model) r(depvar) r(target) r(rater) r(type) number of targets number of raters intraclass correlation for individual measurements F test statistic for individual ICC numerator degrees of freedom for r(icc i F) denominator degrees of freedom for r(icc i F) p-value for F test of individual ICC lower endpoint for confidence intervals of individual ICC upper endpoint for confidence intervals of individual ICC intraclass correlation for average measurements F test statistic for average ICC numerator degrees of freedom for r(icc avg F) denominator degrees of freedom for r(icc avg F) p-value for F test of average ICC lower endpoint for confidence intervals of average ICC upper endpoint for confidence intervals of average ICC null hypothesis value confidence level analysis-of-variance model name of dependent variable target variable rater variable type of ICC estimated (absolute or consistency) 866 icc — Intraclass correlation coefficients Methods and formulas We observe yij , where i = 1, . . . , n and j = 1, . . . , k . yij is the j th rating on the ith target. Let α = 1 − l/100, where l is the significance level specified by the user. Methods and formulas are presented under the following headings: Mean squares One-way random effects Two-way random effects Two-way mixed effects Mean squares The mean squares within targets are WMS where y i· = P j = X X (yij − y )2 i· n(k − 1) i j yij /k . The mean squares between targets are BMS where y ·· = P i = X (y − y )2 i· ·· n − 1 i y i· /n. These are the only mean squares needed to estimate ICC in the one-way random-effects model. For the two-way models, we need two additional mean squares. The mean squares between raters are JMS where y ·j = P i yij /n and y ·· = P j = X (y ·j − y ·· )2 k−1 j y ·j /k . The residual or error mean square is P P EMS = i j (yij − y)2 − (k − 1)JMS − (n − 1)BMS (n − 1)(k − 1) icc — Intraclass correlation coefficients 867 One-way random effects Under the one-way random-effects model, we observe yij = µ + ri + ij (M1) where µ is the mean rating, ri is the target random effect, and ij is random error. The ri s are i.i.d. N (0, σr2 ); ij s are i.i.d. N (0, σ2 ) and are independent of ri s. There is no rater effect separate from the residual error because each target is evaluated by a different set of raters. The individual AA-ICC is the correlation between individual measurements on the same target: ρ1 = ICC(1) = Corr(yij , yij 0 ) = σr2 σr2 + σ2 The average AA-ICC is the correlation between average measurements of size k made on the same target: ρk = ICC(k) = Corr(y i. , y 0i. ) = σr2 σr2 + σ2 /k They are estimated by d ρb1 = ICC (1) = BMS − WMS BMS + (k − 1)WMS BMS − WMS d ρbk = ICC (k) = BMS Confidence intervals. Let Fobs = BMS/WMS, let Fl be the (1 − α/2) × 100th percentile of the Fn−1,n(k−1) distribution, and let Fu be the (1 − α/2) × 100th percentile of the Fn(k−1),n−1 distribution. Let FL = Fobs /Fl and FU = Fobs Fu . A (1 − α) × 100% confidence interval for ρ1 is FU − 1 FL − 1 , Fl + k − 1 FU + k − 1 (1) A (1 − α) × 100% confidence interval for ρk is 1 1 1− ,1 − FL FU (2) Hypothesis tests. Consider a one-sided hypothesis test of Ho: ICC = ρ0 versus Ha: ICC > ρ0 . The test statistic for ρ1 is Fρ1 = BMS WMS 1 − ρ0 1 + (k − 1)ρ0 (3) The test statistic for ρk is Fρk = BMS WMS (1 − ρ0 ) (4) Under the null hypothesis, both Fρ1 and Fρk have the Fn−1,n(k−1) distribution. When ρ0 = 0, the two test statistics coincide. 868 icc — Intraclass correlation coefficients Two-way random effects In this setting, the target is evaluated by the same set of raters, who are randomly drawn from the population of raters. The underlying models with and without interaction are yij = µ + ri + cj + (rc)ij + ij yij = µ + ri + cj + ij (M2) (M2A) where yij is the rating of the ith target by the j th rater, µ is the mean rating, ri is the target random effect, cj is the rater random effect, (rc)ij is the target-rater random effect, and ij is random 2 ), and ij s are error. The ri s are i.i.d. N (0, σr2 ), cj s are i.i.d. N (0, σc2 ), (rc)ij s are i.i.d. N (0, σrc 2 i.i.d. N (0, σ ). Each effect is mutually independent of the others. Below we provide formulas for ICCs for model (M2). The corresponding ICCs for model (M2A) 2 = 0. can be obtained by setting σrc The individual AA-ICC is the correlation between individual measurements on the same target: ρA,1 = ICC(A,1) = Corr(yij , yij 0 ) = σr2 2 + σ2 ) σr2 + σc2 + (σrc The average AA-ICC is the correlation between average measurements of size k made on the same target: ρA,k = ICC(A,k ) = Corr(y i. , y 0i. ) = σr2 2 + σ 2 )/k σr2 + (σc2 + σrc The consistency-of-agreement intraclass correlation for individual measurements, individual CA-ICC, is ρC,1 = ICC(C ,1) = σr2 σr2 2 + σ2 ) + (σrc The consistency-of-agreement intraclass correlation for average measurements of size k , average CA-ICC, is ρC,k = ICC(C ,k ) = σr2 2 + σ 2 )/k σr2 + (σrc 2 With one observation per target and rater, σrc and σ2 cannot be estimated separately. The estimators of intraclass correlations, confidence intervals, and test statistics are the same for models (M2) and (M2A). The estimators of ICCs are icc — Intraclass correlation coefficients d ρbA,1 = ICC( A,1) = 869 BMS − EMS − 1)EMS + nk (JMS − EMS) BMS − EMS d ρbA,k = ICC( A, k ) = 1 BMS + n (JMS − EMS) BMS − EMS d ρbC,1 = ICC( C ,1) = BMS + (k − 1)EMS BMS − EMS d ρbC,k = ICC( C ,k) = BMS + (k BMS Confidence intervals. Let a = kb ρA,1 /{n(1 − ρbA,1 )}, b = 1 + kb ρA,1 (n − 1)/{n(1 − ρbA,1 )}, and v= (aJMS + bEMS)2 a2 JMS2 k−1 + b2 EMS2 (n−1)(k−1) (5) Let Fl be the (1 −α/2)× 100th percentile of the Fn−1,v distribution and Fu be the (1 −α/2)× 100th percentile of the Fv,n−1 distribution. A (1 − α) × 100% confidence interval for ρA,1 is given by (L, U ), where n(BMS − Fl EMS) Fl {k JMS + (kn − k − n)EMS} + nBMS n(Fu BMS − EMS) U= k JMS + (kn − k − n)EMS + nFu BMS L= (6) A (1 − α) × 100% confidence intervals for ρA,k is a special case of (6) with k = 1, where a = ρbA,k /{n(1 − ρbA,k )}, b = 1 + ρbA,k (n − 1)/{n(1 − ρbA,k )}, and v is defined in (5). To define confidence intervals for ρC,1 and ρC,k , let Fobs = BMS/EMS, Fl be the (1 −α/2)× 100th percentile of the Fn−1,(n−1)(k−1) distribution, and Fu be the (1 − α/2) × 100th percentile of the F(n−1)(k−1),n−1 distribution. Let FL = Fobs /Fl and FL = Fobs Fu . A (1 − α) × 100% confidence intervals for ρC,1 and ρC,k are then as given by (1) and (2) for model (M1). Hypothesis tests. Consider a one-sided hypothesis test of Ho : ICC = ρ0 versus Ha : ICC > ρ0 . Let a = kρ0 /{n(1 − ρ0 )} and b = 1 + kρ0 (n − 1)/{n(1 − ρ0 )}. The test statistic for ρA,1 is FρA,1 = BMS aJMS + bEMS Under the null hypothesis, FρA,1 has the Fn−1,v distribution, where v is defined in (5). The test statistic for ρA,k is defined similarly, except a = ρ0 /{n(1 − ρ0 )} and b = 1 + ρ0 (n − 1)/{n(1 − ρ0 )}. Under the null hypothesis, FρA,k has the Fn−1,v distribution, where v is defined in (5). When ρ0 = 0, then a = 0, b = 1, and the two test statistics coincide. The test statistics for ρC,1 and ρC,k are defined by (3) and (4), respectively, with WMS replaced by FρC,1 and FρC,k have the Fn−1,(n−1)(k−1) distribution. They also both have the same value when ρ0 = 0. EMS. Under the null hypothesis, both 870 icc — Intraclass correlation coefficients Two-way mixed effects In this setting, every target is evaluated by the same set of judges, who are the only judges of interest. The underlying models with and without interaction are yij = µ + ri + cj + (rc)ij + ij yij = µ + ri + cj + ij (M3) (M3A) where yij is the rating of the ith target by the j th rater, µ is the mean rating, ri is the target random effect, cj is the rater random effect, (rc)ij is an interaction effect between target and rater, and ij is 2 ), and ij s are i.i.d.P N (0, σ2 ). Each random error. The ri s are i.i.d. N (0, σr2 ), (rc)ij s are N (0, σrc random effect is mutually independent of the others. The cj s are fixed such that j cj = 0. The P 2 variance of cj s is θc2 = cj /(k − 1). In the presence of an One assumes that (rc)ij s additional constraint that involving different targets interaction, two versions of a mixed-effects model may be considered. 2 2 are P i.i.d. N (0, σrc ). Another assumes that (rc)ij s are N (0, σrc ) with an j (rc)ij = 0 (for example, Kuehl [2000]), so only interaction terms are independent. The latter model is considered here. We now define the intraclass correlations for individual measurements for model (M3). The individual CA-ICC, the correlation between individual measurements on the same target, is ρC,1 = ICC(C ,1) = Corr(yij , yij 0 ) = 2 /(k − 1) σr2 − σrc 2 2 σr + (σrc + σ2 ) The absolute-agreement intraclass correlation for individual measurements, individual AA-ICC, is ρA,1 = ICC(A,1) = 2 σr2 − σrc /(k − 1) 2 2 + σ2 ) + θc + (σrc σr2 Shrout and Fleiss (1979) show that the individual ICC could be negative in this case—a phenomenon first pointed out by Sitgreaves (1960). This can happen when the interaction term has a high variance relative to the targets and there are not many raters. 2 The individual intraclass correlations for model (M3A) have similar definitions with σrc = 0. The individual CA-ICC is the correlation between individual measurements on the same target, Corr(yij , yij 0 ). We now discuss the intraclass correlations that correspond to average measurements. Neither average AA-ICC, ρA,k , nor average CA-ICC, ρC,k , can be estimated under model (M3) (Shrout and Fleiss 1979; McGraw and Wong 1996a). The problem is that in this model, σr2 , which is the covariance between two means based on k raters, cannot be estimated. 2 Specifically, the parameter P σr appears only in the expectation of the between-target mean squares (rc) ij = 0, j BMS. Under the restriction E(BMS) = kσr2 + σ2 2 Note that σrc does not appear in the expectation of between-target mean squares. With one 2 2 observation per target and rater, σrc and σ2 cannot be estimated separately (only their sum σrc + σ2 can be estimated), so BMS alone cannot be used to estimate σr2 . icc — Intraclass correlation coefficients 871 Under model (M3A), however, there is no interaction (and thus no interaction variance component 2 σrc ), so ρA,k or ρC,k can be estimated. The average AA-ICC, the absolute-agreement intraclass correlation for average measurements of size k , is ρA,k = ICC(A,k ) = σr2 σr2 + (θc2 + σ2 )/k The average CA-ICC, the correlation between average measurements of size k made on the same target, is ρC,k = ICC(C ,k ) = Corr(y i. , y 0i. ) = σr2 σr2 + σ2 /k The estimators of ICCs, their confidence intervals, and hypothesis tests are as described for two-way random-effects models, except ρA,k and ρC,k are not defined under model (M3). References Bliese, P. D. 2000. Within-group agreement, non-independence, and reliability: Implications for data aggregation and analysis. In Multilevel Theory, Research, and Methods in Organizations: Foundations, Extensions, and New Directions, ed. K. J. Klein and S. W. J. Kozlowski, 349–381. San Francisco: Jossey-Bass. Brown, W. 1910. Some experimental results in the correlation of mental abilities. British Journal of Psychology 3: 296–322. Cronbach, L. J. 1951. Coefficient alpha and the internal structure of tests. Psychometrika 16: 297–334. Hartmann, D. P. 1982. Assessing the dependability of observational data. In Using Observers to Study Behavior, 51–65. San Francisco: Jossey-Bass. Kuehl, R. O. 2000. Design of Experiments: Statistical Principles of Research Design and Analysis. 2nd ed. Belmont, CA: Duxbury. McGraw, K. O., and S. P. Wong. 1996a. Forming inferences about some intraclass correlation coefficients. Psychological Methods 1: 30–46. . 1996b. Forming inferences about some intraclass correlation coefficients: Correction. Psychological Methods 1: 390. Shrout, P. E., and J. L. Fleiss. 1979. Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86: 420–428. Sitgreaves, R. 1960. Book reviews: Intraclass Correlation and the Analysis of Variance, Ernest A. Haggard. Journal of the American Statistical Association 55: 384–385. Spearman, C. E. 1910. Correlation calculated from faulty data. British Journal of Psychology 3: 271–295. Suen, H. K. 1988. Agreement, reliability, accuracy, and validity: Toward a clarification. Behavioral Assessment 10: 343–366. Also see [R] anova — Analysis of variance and covariance [R] correlate — Correlations (covariances) of variables or coefficients [R] loneway — Large one-way ANOVA, random effects, and reliability [MV] alpha — Compute interitem correlations (covariances) and Cronbach’s alpha Title inequality — Inequality measures Remarks and examples References Remarks and examples Stata does not have commands for inequality measures, except roctab has an option to report Gini and Pietra indices; see [R] roctab. Stata users, however, have developed an excellent suite of commands, many of which have been published in the Stata Journal (SJ) and in the Stata Technical Bulletin (STB). Issue Insert Author(s) Command Description SJ-12-3 st0266 I. Almås, T. Havnes, M. Mogstad adgini Adjusting for age effects in cross-sectional distributions STB-48 gr35 N. J. Cox psm, qsm, pdagum, qdagum Diagnostic plots for assessing Singh–Maddala and Dagum distributions fit by MLE SJ-11-3 st0237 A. Doris, D. O’Neill, O. Sweetman gmmcovearn GMM estimation of the covariance structure of longitudinal data STB-23 sg31 R. Goldstein rspread Measures of diversity: Absolute and relative STB-48 sg104 S. P. Jenkins sumdist, xfrac, ineqdeco, geivars, ineqfac, povdeco Analysis of income distributions STB-48 sg106 S. P. Jenkins smfit, dagumfit Fitting Singh–Maddala and Dagum distributions by maximum likelihood STB-51 sg115 D. Jolliffe, B. Krushelnytskyy ineqerr Bootstrap standard errors for indices of inequality STB-51 sg117 D. Jolliffe, A. Semykina sepov Robust standard errors for the Foster–Greer– Thorbecke class of poverty indices SJ-8-4 st0100 1 A. López-Feldman descogini Decomposing inequality and obtaining marginal effects SJ-6-4 snp15 7 R. Newson somersd Gini coefficient is a special case of Somers’ D SJ-7-2 gr0001 3 S. P. Van Kerm, P. Jenkins glcurve Generalized Lorenz curves and related graphs STB-48 sg108 P. Van Kerm poverty Computing poverty indices STB-23 sg30 E. Whitehouse lorenz, inequal, atkinson, relsgini Measures of inequality in Stata More commands may be available; enter Stata and type search inequality measure, historical. 872 inequality — Inequality measures 873 Max Otto Lorenz (1876–1959) was born in Iowa and studied at the Universities of Iowa and Wisconsin. He proposed what is now known as the Lorenz curve in 1905. Lorenz worked for the Interstate Commerce Commission between 1911 and 1944, mainly with transportation data. His hobbies included calendar reform and Interlingua, a proposed international language. To download and install the Jenkins and Van Kerm glcurve command from the Internet, for instance, you could 1. Select Help > SJ and User-written Programs. 2. Click on Stata Journal. 3. Click on sj7-2. 4. Click on gr0001 3. 5. Click on click here to install. or you could instead do the following: 1. Navigate to the appropriate SJ issue: a. Type net from http://www.stata-journal.com/software Type net cd sj7-2 or b. Type net from http://www.stata-journal.com/software/sj7-2 2. Type net describe gr0001 3 3. Type net install gr0001 3 To download and install the Jenkins sumdist command from the Internet, for instance, you could 1. Select Help > SJ and User-written Programs. 2. Click on STB. 3. Click on stb48. 4. Click on sg104. 5. Click on click here to install. or you could instead do the following: 1. Navigate to the appropriate STB issue: a. Type net from http://www.stata.com Type net cd stb Type net cd stb48 or b. Type net from http://www.stata.com/stb/stb48 2. Type net describe sg104 3. Type net install sg104 874 inequality — Inequality measures References Almås, I., T. Havnes, and M. Mogstad. 2012. Adjusting for age effects in cross-sectional distributions. Stata Journal 12: 393–405. Cox, N. J. 1999. gr35: Diagnostic plots for assessing Singh–Maddala and Dagum distributions fitted by MLE. Stata Technical Bulletin 48: 2–4. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 72–74. College Station, TX: Stata Press. Doris, A., D. O’Neill, and O. Sweetman. 2011. GMM estimation of the covariance structure of longitudinal data on earnings. Stata Journal 11: 439–459. Goldstein, R. 1995. sg31: Measures of diversity: Absolute and relative. Stata Technical Bulletin 23: 23–26. Reprinted in Stata Technical Bulletin Reprints, vol. 4, pp. 150–154. College Station, TX: Stata Press. Haughton, J. H., and S. R. Khandker. 2009. Handbook on Poverty + Inequality. Washington, DC: World Bank. Jenkins, S. P. 1999a. sg104: Analysis of income distributions. Stata Technical Bulletin 48: 4–18. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 243–260. College Station, TX: Stata Press. . 1999b. sg106: Fitting Singh–Maddala and Dagum distributions by maximum likelihood. Stata Technical Bulletin 48: 19–25. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 261–268. College Station, TX: Stata Press. Jenkins, S. P., and P. Van Kerm. 1999a. sg107: Generalized Lorenz curves and related graphs. Stata Technical Bulletin 48: 25–29. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 269–274. College Station, TX: Stata Press. . 1999b. sg107.1: Generalized Lorenz curves and related graphs. Stata Technical Bulletin 49: 23. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 171. College Station, TX: Stata Press. . 2001. Generalized Lorenz curves and related graphs: An update for Stata 7. Stata Journal 1: 107–112. . 2004. gr0001 1: Software Updates: Generalized Lorenz curves and related graphs. Stata Journal 4: 490. . 2006. gr0001 2: Software Updates: Generalized Lorenz curves and related graphs. Stata Journal 6: 597. . 2007. gr0001 3: Software Updates: Generalized Lorenz curves and related graphs. Stata Journal 7: 280. Jolliffe, D., and B. Krushelnytskyy. 1999. sg115: Bootstrap standard errors for indices of inequality. Stata Technical Bulletin 51: 28–32. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 191–196. College Station, TX: Stata Press. Jolliffe, D., and A. Semykina. 1999. sg117: Robust standard errors for the Foster–Greer–Thorbecke class of poverty indices. Stata Technical Bulletin 51: 34–36. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 200–203. College Station, TX: Stata Press. Kleiber, C., and S. Kotz. 2003. Statistical Size Distributions in Economics and Actuarial Sciences. Hoboken, NJ: Wiley. López-Feldman, A. 2006. Decomposing inequality and obtaining marginal effects. Stata Journal 6: 106–111. . 2008. Software Updates: Decomposing inequality and obtaining marginal effects. Stata Journal 8: 594. Lorenz, M. O. 1905. Methods of measuring the concentration of wealth. American Statistical Association 9: 209–219. Newson, R. B. 2006. Confidence intervals for rank statistics: Percentile slopes, differences, and ratios. Stata Journal 6: 497–520. Van Kerm, P. 1999. sg108: Computing poverty indices. Stata Technical Bulletin 48: 29–33. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 274–278. College Station, TX: Stata Press. Whitehouse, E. 1995. sg30: Measures of inequality in Stata. Stata Technical Bulletin 23: 20–23. Reprinted in Stata Technical Bulletin Reprints, vol. 4, pp. 146–150. College Station, TX: Stata Press. Title intreg — Interval regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax intreg depvar1 depvar2 options indepvars if in weight , options Description Model noconstant suppress constant term het(varlist , noconstant ) independent variables to model the variance; use noconstant to suppress constant term include varname in model with coefficient constrained to 1 offset(varname) constraints(constraints) apply specified linear constraints collinear keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) nocnsreport display options set confidence level; default is level(95) do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars and varlist may contain factor variables; see [U] 11.4.3 Factor variables. depvar1 , depvar2 , indepvars, and varlist may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mfp, nestreg, rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. vce() and weights are not allowed with the svy prefix; see [SVY] svy. aweights, fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 875 876 intreg — Interval regression Menu Statistics > Linear models and related > Censored regression > Interval regression Description intreg fits a model of y = [ depvar 1 , depvar 2 ] on indepvars, where y for each observation is point data, interval data, left-censored data, or right-censored data. depvar1 and depvar2 should have the following form: Type of data point data interval data left-censored data right-censored data depvar1 a = [ a, a ] [ a, b ] ( −∞, b ] [ a, +∞ ) a a . a depvar2 a b b . Options Model noconstant; see [R] estimation options. het(varlist , noconstant ) specifies that varlist be included in the specification of the conditional variance. This varlist enters the variance specification collectively as multiplicative heteroskedasticity. offset(varname), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). intreg — Interval regression 877 The following option is available with intreg but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples intreg is a generalization of the models fit by tobit. Cameron and Trivedi (2010, 548–550) discuss the differences among censored, truncated, and interval data. If you know that the value for the j th individual is somewhere in the interval [ y1j , y2j ], then the likelihood contribution from this individual is simply Pr(y1j ≤ Yj ≤ y2j ). For censored data, their likelihoods contain terms of the form Pr(Yj ≤ yj ) for left-censored data and Pr(Yj ≥ yj ) for right-censored data, where yj is the observed censoring value and Yj denotes the random variable representing the dependent variable in the model. Hence, intreg can fit models for data where each observation represents interval data, left-censored data, right-censored data, or point data. Regardless of the type of observation, the data should be stored in the dataset as interval data; that is, two dependent variables, depvar1 and depvar2 , are used to hold the endpoints of the interval. If the data are left-censored, the lower endpoint is −∞ and is represented by a missing value, ‘.’, or an extended missing value, ‘.a, .b, . . . , .z’, in depvar1 . If the data are right-censored, the upper endpoint is +∞ and is represented by a missing value, ‘.’ (or an extended missing value), in depvar2 . Point data are represented by the two endpoints being equal. Type of data point data interval data left-censored data right-censored data depvar1 a = [ a, a ] [ a, b ] ( −∞, b ] [ a, +∞ ) a a . a depvar2 a b b . Truly missing values of the dependent variable must be represented by missing values in both depvar1 and depvar2 . Interval data arise naturally in many contexts, such as wage data. Often you know only that, for example, a person’s salary is between $30,000 and $40,000. Below we give an example for wage data and show how to set up depvar1 and depvar2 . Example 1 We have a dataset that contains the yearly wages of working women. Women were asked via a questionnaire to indicate a category for their yearly income from employment. The categories were less than 5,000, 5,001 – 10,000, . . . , 25,001 – 30,000, 30,001 – 40,000, 40,001 – 50,000, and more than 50,000. The wage categories are stored in the wagecat variable. 878 intreg — Interval regression . use http://www.stata-press.com/data/r13/womenwage (Wages of women) . tabulate wagecat Wage category ($1000s) Freq. Percent Cum. 5 10 15 20 25 30 40 50 51 14 83 158 107 57 30 19 14 6 2.87 17.01 32.38 21.93 11.68 6.15 3.89 2.87 1.23 Total 488 100.00 2.87 19.88 52.25 74.18 85.86 92.01 95.90 98.77 100.00 A value of 5 for wagecat represents the category less than 5,000, a value of 10 represents 5,001 – 10,000, . . . , and a value of 51 represents greater than 50,000. To use intreg, we must create two variables, wage1 and wage2, containing the lower and upper endpoints of the wage categories. Here is one way to do it. We first create a dataset containing the nine wage categories, lag the wage categories into wage1, and match-merge this dataset with nine observations back into the main one. . by wagecat: keep if _n==1 (479 observations deleted) . generate wage1 = wagecat[_n-1] (1 missing value generated) . keep wagecat wage1 . save lagwage file lagwage.dta saved . use http://www.stata-press.com/data/r13/womenwage (Wages of women) . merge m:1 wagecat using lagwage Result # of obs. not matched matched 0 488 (_merge==3) Now we create the upper endpoint and list the new variables: . generate wage2 = wagecat . replace wage2 = . if wagecat == 51 (6 real changes made, 6 to missing) . sort age, stable intreg — Interval regression . list wage1 wage2 in 1/10 wage1 wage2 1. 2. 3. 4. 5. . 5 5 10 . 5 10 10 15 5 6. 7. 8. 9. 10. . . 5 5 5 5 5 10 10 10 We can now run intreg: . intreg wage1 wage2 age c.age#c.age nev_mar rural school tenure Fitting constant-only model: Iteration 0: log likelihood = -967.24956 Iteration 1: log likelihood = -967.1368 Iteration 2: log likelihood = -967.1368 Fitting full model: Iteration 0: Iteration 1: Iteration 2: log likelihood = -856.65324 log likelihood = -856.33294 log likelihood = -856.33293 Interval regression Number of obs LR chi2(6) Prob > chi2 Log likelihood = -856.33293 Coef. Std. Err. z P>|z| = = = 488 221.61 0.0000 [95% Conf. Interval] age .7914438 .4433604 1.79 0.074 -.0775265 1.660414 c.age#c.age -.0132624 .0073028 -1.82 0.069 -.0275757 .0010509 nev_mar rural school tenure _cons -.2075022 -3.043044 1.334721 .8000664 -12.70238 .8119581 .7757324 .1357873 .1045077 6.367117 -0.26 -3.92 9.83 7.66 -1.99 0.798 0.000 0.000 0.000 0.046 -1.798911 -4.563452 1.068583 .5952351 -25.1817 1.383906 -1.522637 1.600859 1.004898 -.2230583 /lnsigma 1.987823 .0346543 57.36 0.000 1.919902 2.055744 sigma 7.299626 .2529634 6.82029 7.81265 Observation summary: 14 left-censored observations 0 uncensored observations 6 right-censored observations 468 interval observations 879 880 intreg — Interval regression We could also model these data by using an ordered probit model with oprobit (see [R] oprobit): . oprobit Iteration Iteration Iteration Iteration Iteration wagecat age c.age#c.age nev_mar rural school tenure 0: log likelihood = -881.1491 1: log likelihood = -764.31729 2: log likelihood = -763.31191 3: log likelihood = -763.31049 4: log likelihood = -763.31049 Ordered probit regression Number of obs LR chi2(6) Prob > chi2 Pseudo R2 Log likelihood = -763.31049 wagecat Coef. age = = = = 488 235.68 0.0000 0.1337 Std. Err. z P>|z| [95% Conf. Interval] .1674519 .0620333 2.70 0.007 .0458689 .289035 c.age#c.age -.0027983 .0010214 -2.74 0.006 -.0048001 -.0007964 nev_mar rural school tenure -.0046417 -.5270036 .2010587 .0989916 .1126737 .1100449 .0201189 .0147887 -0.04 -4.79 9.99 6.69 0.967 0.000 0.000 0.000 -.225478 -.7426875 .1616263 .0700063 .2161946 -.3113196 .2404911 .127977 /cut1 /cut2 /cut3 /cut4 /cut5 /cut6 /cut7 /cut8 2.650637 3.941018 5.085205 5.875534 6.468723 6.922726 7.34471 7.963441 .8957245 .8979167 .9056582 .9120933 .918117 .9215455 .9237628 .9338881 .8950495 2.181134 3.310148 4.087864 4.669247 5.11653 5.534168 6.133054 4.406225 5.700903 6.860263 7.663204 8.268199 8.728922 9.155252 9.793828 We can directly compare the log likelihoods for the intreg and oprobit models because both likelihoods are discrete. If we had point data in our intreg estimation, the likelihood would be a mixture of discrete and continuous terms, and we could not compare it directly with the oprobit likelihood. Here the oprobit log likelihood is significantly larger (that is, less negative), so it fits better than the intreg model. The intreg model assumes normality, but the distribution of wages is skewed and definitely nonnormal. Normality is more closely approximated if we model the log of wages. intreg — Interval regression 881 . generate logwage1 = log(wage1) (14 missing values generated) . generate logwage2 = log(wage2) (6 missing values generated) . intreg logwage1 logwage2 age c.age#c.age nev_mar rural school tenure Fitting constant-only model: Iteration 0: Iteration 1: Iteration 2: log likelihood = -889.23647 log likelihood = -889.06346 log likelihood = -889.06346 Fitting full model: Iteration 0: Iteration 1: Iteration 2: log likelihood = -773.81968 log likelihood = -773.36566 log likelihood = -773.36563 Interval regression Number of obs LR chi2(6) Prob > chi2 Log likelihood = -773.36563 Coef. = = = 488 231.40 0.0000 Std. Err. z P>|z| [95% Conf. Interval] age .0645589 .0249954 2.58 0.010 .0155689 .1135489 c.age#c.age -.0010812 .0004115 -2.63 0.009 -.0018878 -.0002746 nev_mar rural school tenure _cons -.0058151 -.2098361 .0804832 .0397144 .7084023 .0454867 .0439454 .0076783 .0058001 .3593193 -0.13 -4.77 10.48 6.85 1.97 0.898 0.000 0.000 0.000 0.049 -.0949674 -.2959675 .0654341 .0283464 .0041495 .0833371 -.1237047 .0955323 .0510825 1.412655 /lnsigma -.906989 .0356265 -25.46 0.000 -.9768157 -.8371623 sigma .4037381 .0143838 .3765081 .4329373 Observation summary: 14 left-censored observations 0 uncensored observations 6 right-censored observations 468 interval observations The log likelihood of this intreg model is close to the oprobit log likelihood, and the z statistics for both models are similar. Technical note intreg has two parameterizations for the log-likelihood function: the transformed parameterization (β/σ , 1/σ ) and the untransformed parameterization (β , ln(σ)). By default, the log likelihood for intreg is parameterized in the transformed parameter space. This parameterization tends to be more convergent, but it requires that any starting values and constraints have the same parameterization, and it prevents the estimation with multiplicative heteroskedasticity. Therefore, when the het() option is specified, intreg switches to the untransformed log likelihood for the fit of the conditional-variance model. Similarly, specifying from() or constraints() causes the optimization in the untransformed parameter space to allow constraints on (and starting values for) the coefficients on the covariates without reference to σ . The estimation results are all stored in the (β , ln(σ)) metric. 882 intreg — Interval regression Stored results intreg stores the following in e(): Scalars e(N) e(N unc) e(N lc) e(N rc) e(N int) e(k) e(k aux) e(k eq) e(k eq model) e(k dv) e(df m) e(ll) e(ll 0) e(ll c) e(N clust) e(chi2) e(p) e(sigma) e(se sigma) e(rank) e(rank0) e(ic) e(rc) e(converged) sigma standard error of sigma rank of e(V) rank of e(V) for constant-only model number of iterations return code 1 if converged, 0 otherwise Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(vce) e(vcetype) e(het) e(ml score) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(footnote) e(asbalanced) e(asobserved) intreg command as typed names of dependent variables weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. heteroskedasticity, if het() specified program used to implement scores type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict program and arguments to display footnote factor variables fvset as asbalanced factor variables fvset as asobserved Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance Functions e(sample) marks estimation sample number of observations number of uncensored observations number of left-censored observations number of right-censored observations number of interval observations number of parameters number of auxiliary parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom log likelihood log likelihood, constant-only model log likelihood, comparison model number of clusters χ2 p-value for model χ2 test intreg — Interval regression 883 Methods and formulas See Wooldridge (2013, sec. 17.4) or Davidson and MacKinnon (2004, sec. 11.6) for an introduction to censored and truncated regression models. The likelihood for intreg subsumes that of the tobit models. Let y = Xβ + be the model. y represents continuous outcomes — either observed or not observed. Our model assumes ∼ N (0, σ 2 I). For observations j ∈ C , we observe yj , that is, point data. Observations j ∈ L are left-censored; we know only that the unobserved yj is less than or equal to yLj , a censoring value that we do know. Similarly, observations j ∈ R are right-censored; we know only that the unobserved yj is greater than or equal to yRj . Observations j ∈ I are intervals; we know only that the unobserved yj is in the interval [ y1j , y2j ]. The log likelihood is ) 2 yj − x β + log 2πσ 2 σ j∈C X yLj − xβ + wj log Φ σ j∈L X yRj − xβ + wj log 1 − Φ σ j∈R X y2j − xβ y1j − xβ + wj log Φ −Φ σ σ 1X wj lnL = − 2 ( j∈I where Φ() is the standard cumulative normal and wj is the weight for the j th observation. If no √ weights are specified, wj = 1. If aweights are specified, wj = 1, and σ is replaced by σ/ aj in the above, where aj are the aweights normalized to sum to N . Maximization is as described in [R] maximize; the estimate reported as sigma is σ b. See Amemiya (1973) for a generalization of the tobit model to variable, but known, cutoffs. This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. intreg also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Amemiya, T. 1973. Regression analysis when the dependent variable is truncated normal. Econometrica 41: 997–1016. Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. Conroy, R. M. 2005. Stings in the tails: Detecting and dealing with censored data. Stata Journal 5: 395–404. Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. . 2004. Econometric Theory and Methods. New York: Oxford University Press. Goldberger, A. S. 1983. Abnormal selection bias. In Studies in Econometrics, Time Series, and Multivariate Statistics, ed. S. Karlin, T. Amemiya, and L. A. Goodman, 67–84. New York: Academic Press. 884 intreg — Interval regression Hurd, M. 1979. Estimation in truncated samples when there is heteroscedasticity. Journal of Econometrics 11: 247–258. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Stewart, M. B. 1983. On least squares estimation when the dependent variable is grouped. Review of Economic Studies 50: 737–753. Wooldridge, J. M. 2013. Introductory Econometrics: A Modern Approach. 5th ed. Mason, OH: South-Western. Also see [R] intreg postestimation — Postestimation tools for intreg [R] regress — Linear regression [R] tobit — Tobit regression [SVY] svy estimation — Estimation commands for survey data [XT] xtintreg — Random-effects interval-data regression models [XT] xttobit — Random-effects tobit models [U] 20 Estimation and postestimation commands Title intreg postestimation — Postestimation tools for intreg Description Remarks and examples Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after intreg: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest1 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 lrtest is not appropriate with svy estimation results. 885 886 intreg postestimation — Postestimation tools for intreg Syntax for predict predict type predict type newvar if in , statistic nooffset stub* | newvarreg newvarlnsigma if in , scores Description statistic Main xb stdp stdf pr(a,b) e(a,b) ystar(a,b) linear prediction; the default standard error of the prediction standard error of the forecast Pr(a < yj < b) E(yj | a < yj < b) E(yj∗ ), yj∗ = max{a, min(yj , b)} These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for stdf is not allowed with svy postestimation results. where a and b may be numbers or variables; a missing (a ≥ .) means −∞, and b missing (b ≥ .) means +∞; see [U] 12.2.1 Missing values. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main xb, the default, calculates the linear prediction. stdp calculates the standard error of the prediction, which can be thought of as the standard error of the predicted expected value or mean for the observation’s covariate pattern. The standard error of the prediction is also referred to as the standard error of the fitted value. stdf calculates the standard error of the forecast, which is the standard error of the point prediction for 1 observation. It is commonly referred to as the standard error of the future or forecast value. By construction, the standard errors produced by stdf are always larger than those produced by stdp; see Methods and formulas in [R] regress postestimation. pr(a,b) calculates Pr(a < xj b + uj < b), the probability that yj |xj would be observed in the interval (a, b). a and b may be specified as numbers or variable names; lb and ub are variable names; pr(20,30) calculates Pr(20 < xj b + uj < 30); pr(lb,ub) calculates Pr(lb < xj b + uj < ub); and pr(20,ub) calculates Pr(20 < xj b + uj < ub). a missing (a ≥ .) means −∞; pr(.,30) calculates Pr(−∞ < xj b + uj < 30); pr(lb,30) calculates Pr(−∞ < xj b + uj < 30) in observations for which lb ≥ . and calculates Pr(lb < xj b + uj < 30) elsewhere. intreg postestimation — Postestimation tools for intreg 887 b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(+∞ > xj b + uj > 20); pr(20,ub) calculates Pr(+∞ > xj b + uj > 20) in observations for which ub ≥ . and calculates Pr(20 < xj b + uj < ub) elsewhere. e(a,b) calculates E(xj b + uj | a < xj b + uj < b), the expected value of yj |xj conditional on yj |xj being in the interval (a, b), meaning that yj |xj is truncated. a and b are specified as they are for pr(). ystar(a,b) calculates E(yj∗ ), where yj∗ = a if xj b + uj ≤ a, yj∗ = b if xj b + uj ≥ b, and yj∗ = xj b + uj otherwise, meaning that yj∗ is censored. a and b are specified as they are for pr(). nooffset is relevant only if you specified offset(varname). It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . scores calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xj β). The second new variable will contain ∂ ln L/∂ lnσ . Remarks and examples Example 1 We continue with example 1 of [R] intreg. . use http://www.stata-press.com/data/r13/intregxmpl . intreg wage1 wage2 age c.age#c.age nev_mar rural school tenure (output omitted ) By default, the predict command produces the linear prediction, which in this case is the expected wage for each individual. . predict w1 (option xb assumed; fitted values) We can use the e(a,b) option to compute the expected wage, conditional on it being larger than $5,000: . predict w2, e(5,.) The probability of earning more than $5,000 might vary with age. We can use margins to compute the marginal means for those probabilities for different ages. 888 intreg postestimation — Postestimation tools for intreg . margins, predict(pr(5,.)) at(age=(20(5)50)) Predictive margins Model VCE : OIM Expression : Pr(y>5), predict(pr(5,.)) 1._at 2._at 3._at 4._at 5._at 6._at : : : : : : age age age age age age = = = = = = 20 25 30 35 40 45 7._at : age = 50 Delta-method Std. Err. z Margin _at 1 2 3 4 5 6 7 .8912598 .9104568 .9160005 .9096667 .8894289 .8491103 .7781644 .0151773 .0103467 .0120025 .0136693 .0206992 .0447429 .0970557 58.72 87.99 76.32 66.55 42.97 18.98 8.02 Number of obs = P>|z| [95% Conf. Interval] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 .8615127 .8901775 .892476 .8828753 .8488593 .7614159 .5879387 We can visualize these results by using marginsplot: . qui margins, predict(pr(5,.)) at(age=(20(5)50)) . marginsplot Variables that uniquely identify margins: age .6 .7 Pr(Y>5) .8 .9 1 Predictive Margins with 95% CIs 20 25 30 488 35 40 age in current year 45 The probability increases until age 30, and it decreases after that age. 50 .9210068 .930736 .9395251 .9364581 .9299985 .9368048 .9683902 intreg postestimation — Postestimation tools for intreg Also see [R] intreg — Interval regression [U] 20 Estimation and postestimation commands 889 Title ivpoisson — Poisson regression with endogenous regressors Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Generalized method of moments estimator ivpoisson gmm depvar varlist1 (varlist2 = varlistiv ) if in weight , reg err opt options Control-function estimator in weight ivpoisson cfunction depvar varlist1 (varlist2 = varlistiv ) if , options reg err opt Description Model additive multiplicative add regression errors to the conditional mean term; the default multiply regression errors by the conditional mean term 890 ivpoisson — Poisson regression with endogenous regressors options 891 Description Model noconstant exposure(varnamee ) offset(varnameo ) ∗ twostep ∗ onestep ∗ igmm suppress constant term include ln(varnamee ) in model with coefficient constrained to 1 include varnameo in model with coefficient constrained to 1 use two-step GMM estimator; the default for ivpoisson gmm use one-step GMM estimator; the default for ivpoisson cfunction use iterative GMM estimator Weight matrix specify weight matrix; wmtype may be robust, cluster clustvar, or unadjusted center center moments in weight-matrix computation winitial(iwtype , independent ) specify initial weight matrix; iwtype may be unadjusted, identity, or the name of a Stata matrix (independent may not be specified with ivpoisson gmm) wmatrix(wmtype) SE/Robust vce(vcetype) vcetype may be robust, cluster clustvar, bootstrap, jackknife, or unadjusted Reporting level(#) irr display options set confidence level; default is level(95) report incidence-rate ratios control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Optimization from(initial values) ‡ igmmiterate(#) ‡ igmmeps(#) ‡ igmmweps(#) optimization options ∗ specify initial values for parameters specify maximum number of iterations for iterated GMM estimator specify # for iterated GMM parameter convergence criterion; default is igmmeps(1e-6) specify # for iterated GMM weight-matrix convergence criterion; default is igmmweps(1e-6) control the optimization process; seldom used You can specify at most one of these options. ‡ These options may be specified only when igmm is specified. varlist1 and varlistiv may contain factor variables; see [U] 11.4.3 Factor variables. depvar, varlist1 , varlist2 , and varlistiv may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, jackknife, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. aweights, fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 892 ivpoisson — Poisson regression with endogenous regressors Menu Statistics > Endogenous covariates > Poisson regression with endogenous regressors Description ivpoisson estimates the parameters of a Poisson regression model in which some of the regressors are endogenous. The model is also known as an exponential conditional mean model in which some of the regressors are endogenous. The model may be specified using either additive or multiplicative error terms. The model is frequently used to model count outcomes and is also used to model nonnegative outcome variables. Options Model noconstant, exposure(varnamee ), offset(varnameo ); see [R] estimation options. additive, the default, specifies that the regression errors be added to the conditional mean term and have mean 0. multiplicative specifies that the regression errors be multiplied by the conditional mean term and have mean 1. twostep, onestep, and igmm specify which estimator is to be used. twostep requests the two-step GMM estimator. gmm obtains parameter estimates based on the initial weight matrix, computes a new weight matrix based on those estimates, and then reestimates the parameters based on that weight matrix. twostep is the default for ivpoisson gmm. onestep requests the one-step GMM estimator. The parameters are estimated based on an initial weight matrix, and no updating of the weight matrix is performed except when calculating the appropriate variance–covariance (VCE) matrix. onestep is the default for ivpoisson cfunction. igmm requests the iterative GMM estimator. gmm obtains parameter estimates based on the initial weight matrix, computes a new weight matrix based on those estimates, reestimates the parameters based on that weight matrix, computes a new weight matrix, and so on, to convergence. Convergence is declared when the relative change in the parameter vector is less than igmmeps(), the relative change in the weight matrix is less than igmmweps(), or igmmiterate() iterations have been completed. Hall (2005, sec. 2.4 and 3.6) mentions that there may be gains to finite-sample efficiency from using the iterative estimator. Weight matrix wmatrix(wmtype) specifies the type of weight matrix to be used in conjunction with the two-step and iterated GMM estimators. Specifying wmatrix(robust) requests a weight matrix that is appropriate when the errors are independent but not necessarily identically distributed. wmatrix(robust) is the default. Specifying wmatrix(cluster clustvar) requests a weight matrix that accounts for arbitrary correlation among observations within clusters identified by clustvar. Specifying wmatrix(unadjusted) requests a weight matrix that is suitable when the errors are homoskedastic. wmatrix() cannot be specified if onestep is also specified. ivpoisson — Poisson regression with endogenous regressors 893 center requests that the sample moments be centered (demeaned) when computing GMM weight matrices. By default, centering is not done. winitial(wmtype , independent ) specifies the weight matrix to use to obtain the first-step parameter estimates. Specifying winitial(unadjusted) requests a weighting matrix that assumes the error functions are independent and identically distributed. This matrix is of the form (Z0 Z)−1 , where Z represents all the exogenous and instrumental variables. winitial(identity) requests that the identity matrix be used. winitial(matname) requests that Stata matrix matname be used. Including the independent suboption creates a weight matrix that assumes error functions are independent. Elements of the weight matrix corresponding to covariances between any two error functions are set equal to zero. This suboption only applies to ivpoisson cfunction. winitial(unadjusted) is the default for ivpoisson gmm. winitial(unadjusted, independent) is the default for ivpoisson cfunction. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(unadjusted) specifies that an unadjusted (nonrobust) VCE matrix be used; this, along with the twostep option, results in the “optimal two-step GMM” estimates often discussed in textbooks. vce(unadjusted) may not be set in ivpoisson cfunction. The default vcetype is based on the wmtype specified in the wmatrix() option. If wmatrix() is specified but vce() is not, then vcetype is set equal to wmtype. To override this behavior in ivpoisson gmm and obtain an unadjusted (nonrobust) VCE matrix, specify vce(unadjusted). The default vcetype for ivpoisson cfunction is robust. Specifying vce(bootstrap) or vce(jackknife) results in standard errors based on the bootstrap or jackknife, respectively. See [R] vce option, [R] bootstrap, and [R] jackknife for more information on these VCEs. The syntax for vcetypes is identical to those for wmatrix(). Reporting level(#); see [R] estimation options. irr reports estimated coefficients transformed to incidence-rate ratios, that is, eβi rather than βi . Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated or stored. irr may be specified at estimation or when replaying previously estimated results. irr is not allowed with additive. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Optimization from(initial values) specifies the initial values to begin the estimation. You can specify a 1 × k matrix, where k is the number of parameters in the model, or you can specify a parameter name, 894 ivpoisson — Poisson regression with endogenous regressors its initial value, another parameter name, its initial value, and so on. For example, to initialize the coefficient for male to 1.23 and the constant cons to 4.57, you would type ivpoisson ..., from(male 1.23 _cons 4.57) ... Initial values declared using this option override any that are declared within substitutable expressions. If you specify a parameter that does not appear in your model, ivpoisson exits with error code 480. If you specify a matrix, the values must be in the same order in which the parameters are declared in your model. ivpoisson ignores the row and column names of the matrix. igmmiterate(#), igmmeps(#), and igmmweps(#) control the iterative process for the iterative GMM estimator for ivpoisson. These options can be specified only if you also specify igmm. igmmiterate(#) specifies the maximum number of iterations to perform with the iterative GMM estimator. The default is the number set using set maxiter (set [R] maximize), which is 16,000 by default. igmmeps(#) specifies the convergence criterion used for successive parameter estimates when the iterative GMM estimator is used. The default is igmmeps(1e-6). Convergence is declared when the relative difference between successive parameter estimates is less than igmmeps() and the relative difference between successive estimates of the weight matrix is less than igmmweps(). igmmweps(#) specifies the convergence criterion used for successive estimates of the weight matrix when the iterative GMM estimator is used. The default is igmmweps(1e-6). Convergence is declared when the relative difference between successive parameter estimates is less than igmmeps() and the relative difference between successive estimates of the weight matrix is less than igmmweps(). optimization options: technique(), conv maxiter(), conv ptol(), conv vtol(), conv nrtol(), and tracelevel(). technique() specifies the optimization technique to use; gn (the default), nr, dfp, and bfgs are allowed. conv maxiter() specifies the maximum number of iterations; conv ptol(), conv vtol(), and conv nrtol() specify the convergence criteria for the parameters, gradient, and scaled Hessian, respectively. tracelevel() allows you to obtain additional details during the iterative process. See [M-5] optimize( ). Remarks and examples ivpoisson estimates the parameters of a Poisson regression model in which some of the regressors are endogenous. A regressor is endogenous if it is related to the unobserved error term. The model is also known as an exponential conditional mean model in which some of the regressors are endogenous. The model may be specified using either additive or multiplicative error terms. The model is frequently used to model count outcomes and is also used to model nonnegative outcome variables. Poisson regression is a special exponential conditional mean model. See [R] poisson for more information on Poisson regression. The exponential conditional mean model has an error form representation in which the dependent variable y is a function of the exogenous regressors x, endogenous regressors y2 , and an error . The regressors x are independent of , while y2 are not. ivpoisson allows to enter either additively, 0 yi = exp(x0i β1 + y2,i β 2 ) + i or multiplicatively, 0 yi = exp(x0i β1 + y2,i β2 )i ivpoisson — Poisson regression with endogenous regressors 895 Mullahy (1997), Cameron and Trivedi (2013), Windmeijer and Santos Silva (1997), and Wooldridge (2010) discuss the generalized method of moments (GMM) estimators implemented in ivpoisson. GMM is frequently used in modern econometrics. Many econometric and statistical models can be expressed as conditions on the population moments. The parameter estimates produced by GMM estimators make the sample-moment conditions as true as possible given the data. See [R] gmm for further information on GMM estimation and how Stata performs it. The rest of the discussion is presented under the following headings: GMM estimator for additive model GMM estimator for multiplicative model CF estimator for multiplicative model GMM estimator for additive model The GMM estimator uses additional variables, known as instruments and denoted by zi , to specify moment conditions that hold in the population. The GMM parameter estimates make the sample versions of these population-moment conditions as close to true as possible. The instrumental variables are assumed to be correlated with the endogenous regressors y2,i but independent of the errors i . Under additive errors, the dependent variable yi is determined by exogenous regressors xi , endogenous regressors y2,i , and zero-mean error i as 0 yi = exp(x0i β1 + y2,i β 2 ) + i This leads to the following error function 0 u(yi , xi , y2,i , β1 , β2 ) = yi − exp(x0i β1 + y2,i β2 ) The population-moment conditions for GMM estimation are E {e zi u(yi , xi , y2,i , β1 , β2 )} = 0, where the vector e zi is partitioned as (x0i , z0i ). The sample-moment conditions are formed by replacing the expectation with the corresponding sample mean. The GMM estimator solves a minimization problem to make the sample-moment conditions as close to zero as possible. Details on how estimation is performed are given in Methods and formulas. Now we will demonstrate how ivpoisson gmm works in the additive error setting with an example. Example 1: ivpoisson gmm with additive errors This example uses simulated data based on the following story. A news website randomly samples 500 young adults in a major city. The website wants to model the number of times the sampled individuals visit its website (visits) based on their overall time spent on the Internet (time) and the number of times they receive an ad for the website through email or viewing another website (ad). The website also suspects the gender of the individual may matter, so an exogenous dummy variable, female, is included in the model. We suspect time spent on the Internet is correlated with unobserved factors that additively affect the number of times an individual visits the website. So we treat time as an endogenous regressor. Two instruments are used for this variable. The time spent on the phone (phone) is one instrument. The other instrument is the time spent interacting with friends and family that live out of town (frfam). We model the number of visits the website receives using an exponential conditional mean model with additive errors and use ivpoisson gmm to estimate the parameters of the regression in the output below. To allow for heteroskedasticity of the errors, we use robust standard errors, which is the default; see Obtaining standard errors in [R] gmm for a discussion of why robust standard errors is the default. 896 ivpoisson — Poisson regression with endogenous regressors .use http://www.stata-press.com/data/r13/website (Visits to website) . ivpoisson gmm visits ad female (time = phone frfam) Step 1 Iteration 0: GMM criterion Q(b) = .33829416 Iteration 1: GMM criterion Q(b) = .00362656 Iteration 2: GMM criterion Q(b) = .00131886 Iteration 3: GMM criterion Q(b) = .00131876 Step 2 Iteration 0: GMM criterion Q(b) = .00027102 Iteration 1: GMM criterion Q(b) = .00025811 Iteration 2: GMM criterion Q(b) = .00025811 Exponential mean model with endogenous regressors Number of parameters = 4 Number of moments = 5 Initial weight matrix: Unadjusted GMM weight matrix: Robust visits Coef. time ad female _cons .0589294 .137344 -.0247707 1.041505 Instrumented: Instruments: Robust Std. Err. .0107942 .010157 .0376218 .0385848 Number of obs z 5.46 13.52 -0.66 26.99 = 500 P>|z| [95% Conf. Interval] 0.000 0.000 0.510 0.000 .0377732 .1174366 -.098508 .9658807 .0800857 .1572515 .0489666 1.11713 time ad female phone frfam We find significant coefficients for all regressors but female. At fixed values of the other regressors, increased time spent on the Internet will raise the expected number of website visits. Receiving additional advertisements will also cause an increase in the expected number of website visits. GMM estimator for multiplicative model Under multiplicative errors, the dependent variable yi is determined by exogenous regressors xi , endogenous regressors y2,i , and unit-mean errors i as 0 yi = exp(x0i β1 + y2,i β2 )i This setting yields a different error function than the additive error case. This ratio formulation is 0 u(yi , xi , y2,i , β1 , β2 ) = yi / exp(x0i β1 + y2,i β2 ) − 1 Given the instrumental variables z, the population-moment conditions for GMM estimation are E {e zi u(yi , xi , y2,i , β1 , β2 )} = 0. The vector e zi is partitioned as (x0i , z0i ). As above, the samplemoment conditions are the sample analogs of the population-moment conditions, and the GMM estimator solves a minimization problem to make the sample-moment conditions as close to zero as possible. Details on how estimation is performed are given in Methods and formulas. ivpoisson — Poisson regression with endogenous regressors 897 Example 2: ivpoisson gmm with multiplicative errors In this example, we observe a simulated random sample of 5,000 households. We model the number of trips taken by members of the household in the 24-hour period immediately prior to the interview time by using an exponential conditional mean model with multiplicative errors. Exogenous regressors include the distance to the central business district from the household (cbd), the distance from the household to a public transit node (ptn), whether there is a full-time worker in the household (worker), and whether the examined period is on a weekend (weekend). We suspect that the endogenous regressor, the transportation cost of the household in the prior week (tcost), is correlated with unobserved factors that affect the number of trips taken. This transportation cost includes gasoline and bus, train tickets, etc. The ratio of the cost of a public transit day pass in the sampled area to the national average cost of such a pass (pt) is also observed. This is used as an instrument for transportation cost. In the output below, we estimate the parameters of the regression with ivpoisson gmm. To allow for heteroskedasticity of the errors, we use robust standard errors, which is the default. . use http://www.stata-press.com/data/r13/trip (Household trips) . ivpoisson gmm trips cbd ptn worker weekend (tcost = pt), multiplicative Step 1 Iteration Iteration Iteration Iteration 0: 1: 2: 3: Step 2 Iteration 0: Iteration 1: GMM GMM GMM GMM criterion criterion criterion criterion Q(b) Q(b) Q(b) Q(b) = = = = .04949852 .00011194 1.563e-08 3.685e-16 GMM criterion Q(b) = GMM criterion Q(b) = 2.287e-16 1.413e-31 Exponential mean model with endogenous regressors Number of parameters = 6 Number of moments = 6 Initial weight matrix: Unadjusted GMM weight matrix: Robust trips Coef. tcost cbd ptn worker weekend _cons .0352185 -.008398 -.0113146 .6623018 .3009323 .2654423 Instrumented: Instruments: Robust Std. Err. .0098182 .0020172 .0021819 .0519909 .0362682 .1550127 Number of obs z 3.59 -4.16 -5.19 12.74 8.30 1.71 P>|z| 0.000 0.000 0.000 0.000 0.000 0.087 = 5000 [95% Conf. Interval] .0159752 -.0123517 -.015591 .5604015 .2298479 -.0383769 .0544617 -.0044444 -.0070383 .764202 .3720167 .5692616 tcost cbd ptn worker weekend pt We find that all coefficients are significant. At fixed values of the other regressors, we see that additional mileage from the central business district and public transit nodes reduces the expected number of trips taken. Individuals who live farther away from the central business district may still be out of the house the same amount of time, but they will take fewer trips because the transit time has increased. The situation is similar for those who live farther from public transit. To interpret the other parameters, we will look at the partial effects of their respective independent variables. The partial effects of a change in an independent variable on the modeled conditional expectation function vary over the data because the model is nonlinear. However, under the multiplicative 898 ivpoisson — Poisson regression with endogenous regressors error model, the ratio of the new value to the old value after a discrete change in an independent variable is constant over the data. Let w = (x0 , y20 )0 . If we add 1 to the j th independent variable in w, the functional form of the model implies that E(y|w, wj + 1, ) E(y|w1 , . . . , wj + 1, . . . , wk , ) = = eβj E(y|w, wj , ) E(y|w1 , . . . , wj , . . . , wk , ) When y is a count variable, this normalized effect is called the incidence-rate ratio (IRR) for a one-unit change in wj . More generally, the IRR for a ∆wj change in wj is eβj ∆wj under a multiplicative-error exponential conditional mean model. We can calculate incidence-rate ratios for different changes in the regressors by using lincom; see [R] lincom. Here we replay the ivpoisson results by typing the command name and we specify the irr option to get the incidence-rate ratios. Each significance test for a coefficient equaling zero becomes a test for the incidence-rate ratio equaling one. . ivpoisson, irr Exponential mean model with endogenous regressors Number of parameters = 6 Number of moments = 6 Initial weight matrix: Unadjusted GMM weight matrix: Robust trips IRR tcost cbd ptn worker weekend _cons 1.035846 .9916371 .9887491 1.939251 1.351118 1.304008 Instrumented: Instruments: Robust Std. Err. .0101701 .0020003 .0021573 .1008234 .0490026 .2021377 z 3.59 -4.16 -5.19 12.74 8.30 1.71 Number of obs = 5000 P>|z| [95% Conf. Interval] 0.000 0.000 0.000 0.000 0.000 0.087 1.016103 .9877243 .9845299 1.751376 1.258409 .9623501 1.055972 .9955655 .9929864 2.14728 1.450657 1.766962 tcost cbd ptn worker weekend pt Holding other regressors and the error constant, the expected number of trips made from houses with a full-time worker is nearly twice that of those houses without a full-time worker. Similarly, the expected number of trips made during a weekend day is close to 35% higher than the expected number of trips made on other days. For each additional dollar of weekly transportation cost, the expected number of household trips is increased by approximately 3.6%. CF estimator for multiplicative model Control-function (CF) estimators can be used to account for endogenous regressors. As Wooldridge (2010, sec. 18.5) describes, CF estimators assume a certain structural relationship between the endogenous regressors and the exogenous regressors and use functions of first-stage parameter estimates to control for the endogeneity in the second stage. ivpoisson — Poisson regression with endogenous regressors 899 Wooldridge (2010, sec. 18.5) notes that the VCE of the second-stage estimator must be adjusted to account for estimates from the first stage. ivpoisson cfunction solves this problem by stacking the moment conditions that define each stage and applying a single GMM estimator. See Newey (1984) and Wooldridge (2010, sec. 14.2) for a description of this technique. No adjustment to the VCE is necessary because there is only one stage. The CF estimator augments the original multiplicative model with an estimated term that controls for the endogeneity of y2,i . When y2,i is exogenous, the coefficient on this control term is zero. Let z be instrumental variables, and the vector e zi be (x0i , z0i ). The augmented model is 0 yi = exp(x0i β1 + y2,i β2 + vi0 ρ + ci ) where y2,i = Be z0i + vi The term vi0 ρ controls for the endogeneity of y2,i , and we normalize E{exp(ci )} = 1. The coefficient vector ρ measures the strength of the endogeneity of y2,i ; y2,i is exogenous when ρ = 0. ivpoisson cfunction estimates β1 and β2 and the auxiliary parameters ρ and B by GMM; see Methods and formulas for details. Example 3: Control-function estimator We return to the previous example, where we estimated the parameters of an exponential conditional mean model for the number of trips taken by a household in a 24-hour period. We will estimate the parameters of the regression with the CF estimator method and compare our results with those obtained with the GMM estimator in example 2. In the output below, we estimate the parameters of the regression with the ivpoisson cfunction command. 900 ivpoisson — Poisson regression with endogenous regressors . ivpoisson cfunction trips cbd ptn worker weekend (tcost = pt) Step 1 Iteration 0: GMM criterion Q(b) = .00056156 Iteration 1: GMM criterion Q(b) = 2.366e-07 Iteration 2: GMM criterion Q(b) = 5.552e-14 Iteration 3: GMM criterion Q(b) = 9.772e-27 Exponential mean model with endogenous regressors Number of parameters = 13 Number of moments = 13 Initial weight matrix: Unadjusted GMM weight matrix: Robust Robust Std. Err. z P>|z| Number of obs = 5000 trips Coef. [95% Conf. Interval] cbd ptn worker weekend tcost _cons -.0082567 -.0113719 .6903044 .2978149 .0320718 .2145986 .0020005 .0021625 .0521642 .0356474 .0092738 .1359327 -4.13 -5.26 13.23 8.35 3.46 1.58 0.000 0.000 0.000 0.000 0.001 0.114 -.0121777 -.0156102 .5880645 .2279472 .0138955 -.0518246 -.0043357 -.0071335 .7925444 .3676825 .0502481 .4810218 cbd ptn worker weekend pt _cons .0165466 -.040652 1.550985 .0423009 .7739176 12.13934 .0043693 .0045946 .0996496 .0779101 .0150072 .1123471 3.79 -8.85 15.56 0.54 51.57 108.05 0.000 0.000 0.000 0.587 0.000 0.000 .0079829 -.0496573 1.355675 -.1104002 .7445041 11.91915 .0251102 -.0316467 1.746294 .1950019 .8033312 12.35954 /c_tcost .1599984 .0111752 14.32 0.000 .1380954 .1819014 trips tcost Instrumented: Instruments: tcost cbd ptn worker weekend pt The output table presents results for the estimated coefficients in each of three equations. First, in the trips equation, we see the results for the estimated coefficients in the equation for the dependent variable trips. Second, in the tcost equation, we see the estimated coefficients in the regression of tcost on the instrumental and exogenous variables. Third, the /c tcost ancillary parameter corresponds to the estimate of ρ, the coefficient on the residual variable included to control for the endogeneity of tcost. We find that all coefficients are significant in the exponential conditional mean equation, trips. The coefficient estimates in the trips equation are similar to the estimates obtained by the GMM estimator in example 2. That the estimated coefficient on the tcost control variable is significantly different from zero suggests that tcost is endogenous. ivpoisson — Poisson regression with endogenous regressors Stored results ivpoisson stores the following in e(): Scalars e(N) e(k) e(k eq) e(k aux) e(k dv) e(Q) e(J) e(J df) e(N clust) e(rank) e(ic) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(instd) e(insts) e(wtype) e(wexp) e(title) e(clustvar) e(offset1) e(winit) e(winitname) e(estimator) e(additive) e(multiplicative) e(gmmestimator) e(wmatrix) e(vce) e(vcetype) e(technique) e(properties) e(estat cmd) e(predict) e(footnote) e(marginsok) e(asbalanced) e(asobserved) Matrices e(b) e(V) e(init) e(Wuser) e(W) e(S) e(V modelbased) Functions e(sample) number of observations number of parameters number of equations number of auxiliary parameters number of dependent variables criterion function Hansen J χ2 statistic J statistic degrees of freedom number of clusters rank of e(V) number of iterations used by iterative GMM estimator 1 if converged, 0 otherwise ivpoisson command as typed dependent variable of regression instrumented variable instruments weight type weight expression title in estimation output name of cluster variable offset variable for first equation initial weight matrix used name of user-supplied initial weight matrix gmm or cfunction additive if additive errors specified multiplicative if multiplicative errors specified onestep, twostep, or igmm wmtype specified in wmatrix() vcetype specified in vce() title used to label Std. Err. optimization technique b V program used to implement estat program used to implement predict program used to implement footnote display predictions allowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector variance–covariance matrix initial values of the estimators user-supplied initial weight matrix weight matrix used for final round of estimation moment covariance matrix used in robust VCE computations model-based variance marks estimation sample 901 902 ivpoisson — Poisson regression with endogenous regressors Methods and formulas The estimators in ivpoisson are GMM estimators that can be expressed in terms of error functions β and the instruments that are used to form the moment conditions. When offsets oj are used in the β outcome variable equation, the following formulas apply with x0j β1 changed to x0j β1 + oj . The error functions for the GMM estimators are given in the text. Here we provide some details about the form of the error function used by the CF estimator. Recall that the multiplicative model is 0 yi = exp(x0i β1 + y2,i β2 )i We parameterize the endogenous variables in the form y2,i = Be z0i + vi This allows us to decompose i as i = exp(vi0 ρ + ci ) Given this setup, we obtain the following conditional mean: 0 E(y|xi , zi , y2,i , vi ) = exp(x0i β1 + y2,i β2 + vi0 ρ) We estimate vi as the residuals of the linear regression of y2,i on e zi . The estimates of vi are used as additional regressors in the exponential conditional mean model for y to estimate β1 , β2 , and ρ. In essence, the estimates of vi control for the endogeneity. The error functions for the endogenous regressors are defined as uen,i (y2,i , e zi , B) = y2,i − Be z0i Now we define the error function for the dependent variable as 0 uy (yi , xi , y2,i , uen,i , β1 , β2 , ρ) = yi / exp(x0i β1 + y2,i β2 + u0en,i ρ) − 1 uen,i will be vector valued if we have multiple endogenous regressors y2,i . Call the dimension of y2,i g . uen,i and uy,i define g + 1 separate error functions. We will use variables e zi to instrument each error function in uen,i . So for error function j = 1, . . . , g , we have the error function uen,i,j and the population-moment conditions E (e zi uen,i,j ) = 0. boi previous to estimation as the residuals of the linear regression of y2,i on e We calculate v zi . We boi to instrument the error function uy . This leads to the population-moment use variables xi , y2,i , and v 0 0 boi conditions E (x0i , y2,i ,v )uy,i = 0 Details of GMM estimation can be found in Methods and formulas of [R] gmm. Determination of the weight matrix WN is discussed there. b 1 and β b 2 are the values of β1 and β2 that Under the GMM estimation, the GMM estimators β minimize X 0 X 1 1 e e Q(β1 , β2 ) = zi ui (yi , xi , y2,i , β1 , β2 ) WN zi ui (yi , xi , y2,i , β1 , β2 ) (1) i i N N ivpoisson — Poisson regression with endogenous regressors 903 for q × q weight matrix WN , where q is the dimension of e zi . The error functions ui were defined in the text. In the CF method, we have multiple error functions as defined above. We can stack the moment conditions and write them more compactly as Z0i ui (B, β1 , β2 , ρ), where x0i 0 0 Zi = . .. 0 and 0 0 0 y2,i 0 0 e zi .. .. . . 0 0 ··· ··· ··· .. . 0 0 0 .. . ··· e zi uy (yi , xi , y2,i , uen,i , β1 , β2 , ρ) ui (B, β1 , β2 , ρ) = uen (y2,i , e zi , B) The matrix Zi has g + 1 rows and k + gz columns, where k is the number of regressors for yi and z is the number of exogenous regressors in e zi . b, β b1, β b 2 , and ρ The GMM estimators B b are the values of B, β1 , β2 , and ρ that minimize Q(B, β1 , β2 , ρ) = ( N −1 N X )0 Z0i ui (B, β1 , β2 , ρ) ( WN N −1 i=1 N X ) Z0i ui (B, β1 , β2 , ρ) (2) i=1 for (k + gz) × (k + gz) weight matrix WN . By default, ivpoisson minimizes (1) and (2) using the Gauss–Newton method. See Hayashi (2000, 498) for a derivation. This technique is typically faster than quasi-Newton methods and does not require second-order derivatives. References Cameron, A. C., and P. K. Trivedi. 2013. Regression Analysis of Count Data. 2nd ed. New York: Cambridge University Press. Hall, A. R. 2005. Generalized Method of Moments. Oxford: Oxford University Press. Hayashi, F. 2000. Econometrics. Princeton, NJ: Princeton University Press. Mullahy, J. 1997. Instrumental-variable estimation of count data models: Applications to models of cigarette smoking behavior. Review of Economics and Statistics 79: 586–593. Newey, W. K. 1984. A method of moments interpretation of sequential estimators. Economics Letters 14: 201–206. Windmeijer, F., and J. M. C. Santos Silva. 1997. Endogeneity in count data models: An application to demand for health care. Journal of Applied Econometrics 12: 281–294. Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press. 904 ivpoisson — Poisson regression with endogenous regressors Also see [R] ivpoisson postestimation — Postestimation tools for ivpoisson [R] gmm — Generalized method of moments estimation [R] ivprobit — Probit model with continuous endogenous regressors [R] ivregress — Single-equation instrumental-variables regression [R] ivtobit — Tobit model with continuous endogenous regressors [R] nl — Nonlinear least-squares estimation [R] nlsur — Estimation of nonlinear systems of equations [R] poisson — Poisson regression [R] regress — Linear regression [U] 20 Estimation and postestimation commands Title ivpoisson postestimation — Postestimation tools for ivpoisson Description Options for predict Remarks and examples Reference Syntax for predict Syntax for estat overid Stored results Also see Menu for predict Menu for estat Methods and formulas Description The following postestimation command is of special interest after ivpoisson: Command Description estat overid perform test of overidentifying restrictions The following standard postestimation commands are also available: Command Description contrast estat summarize estat vce estimates lincom contrasts and ANOVA-style joint tests of estimates summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions and probabilities point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses margins marginsplot nlcom predict predictnl pwcompare test testnl Special-interest postestimation command estat overid reports Hansen’s J statistic, which is used to determine the validity of the overidentifying restrictions in a GMM model. ivpoisson gmm uses GMM estimation to obtain parameter estimates. Under additive and multiplicative errors, Hansen’s J statistic can be accurately reported when more instruments than endogenous regressors are specified. It is not appropriate to report the J statistic after ivpoisson cfunction, because a just-identified model is fit. 905 906 ivpoisson postestimation — Postestimation tools for ivpoisson If the model is correctly specified in the sense that E {e zi u(yi , xi , y2,i , β)} = 0, then the sample analog to that condition should hold at the estimated value of β1 and β2 . The e zi variables are the exogenous regressors xi and instrumental variables zi used in ivpoisson gmm. The y2,i are the endogenous regressors. The u function is the error function, which will have a different form for multiplicative and additive errors in the regression. Hansen’s J statistic is valid only if the weight matrix is optimal, meaning that it equals the inverse of the covariance matrix of the moment conditions. Therefore, estat overid only reports Hansen’s J statistic after two-step or iterated estimation or if you specified winitial(matname) when calling ivpoisson gmm. In the latter case, it is your responsibility to determine the validity of the J statistic. Syntax for predict predict type newvar if in , statistic nooffset Description statistic Main n xbtotal xb residuals number of events; the default linear prediction, using residual estimates for ivpoisson cfunction linear prediction residuals These statistics are available both in and out of sample; type predict only for the estimation sample. . . . if e(sample) . . . if wanted Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main n, the default, calculates the predicted number of events via the exponential-form estimate. This is 0 0 exp(x0j β1 + y2,j β2 ) if neither offset() nor exposure() was specified, exp(x0j β1 + y2,j β2 + 0 0 offsetj ) if offset() was specified, or exp(xj β1 + y2,j β2 ) × exposurej if exposure() was specified. After generalized method of moments estimation, the exponential-form estimate is not a consistent estimate of the conditional mean of yj , because it is not corrected for E(j |y2,j ). More details are found in Methods and formulas. After control-function estimation, we correct the exponential-form estimate for E(j |y2,j ) by using the estimated residuals of y2,j and the c * auxiliary parameters. This supplements the direct effect of y2,j and xj through β1 and β2 with the indirect effects of y2,j , xj , and the instruments zj through the endogenous error j . Thus the exponential-form estimate consistently estimates the conditional mean of yj . 0 xbtotal calculates the linear prediction, which is x0j β1 +y2,j β2 if neither offset() nor exposure() 0 0 0 was specified, xj β1 +y2,j β2 +offsetj if offset() was specified, or x0j β1 +y2,j β2 + ln(exposurej ) if exposure() was specified. ivpoisson postestimation — Postestimation tools for ivpoisson 907 After control-function estimation, the estimate of the linear form x0j β1 includes the estimated residuals of the endogenous regressors with coefficients from the c * auxiliary parameters. 0 xb calculates the linear prediction, which is x0j β1 + y2,j β2 if neither offset() nor exposure() was 0 0 0 specified, xj β1 + y2,j β2 + offsetj if offset() was specified, or x0j β1 + y2,j β2 + ln(exposurej ) if exposure() was specified. residuals calculates the residuals. Under additive errors, these are calculated as yj − exp(x0j β1 + 0 0 y2,j β2 ). Under multiplicative errors, they are calculated as yj / exp(x0j β1 + y2,j β2 ) − 1. When offset() or exposure() is specified, x0j β1 is not used directly in the residuals. x0j β1 + offsetj is used if offset() was specified. x0j β1 + ln(exposurej ) is used if exposure() was specified. See nooffset below. After control-function estimation, the estimate of the linear form x0j β1 includes the estimated residuals of the endogenous regressors with coefficients from the c * auxiliary parameters. nooffset is relevant only if you specified offset() or exposure() when you fit the model. It modifies the calculations made by predict so that they ignore the offset or exposure variable. nooffset removes the offset from calculations involving both the treat() equation and the dependent count variable. Syntax for estat overid estat overid Menu for estat Statistics > Postestimation > Reports and statistics Remarks and examples estat overid reports Hansen’s J statistic, which is used to determine the validity of the overidentifying restrictions in a GMM model. It is not appropriate to use it after ivpoisson cfunction, because a just-identified model is fit. Recall that the GMM criterion function is Q(β) = 0 X 1 X 1 e e zi u(yi , xi , y2,i , β1 , β2 ) WN zi u(yi , xi , y2,i , β1 , β2 ) i i N N (A1) Our u function within this formula will change depending on whether we use additive or multiplicative errors. The e z vector contains the exogenous regressors and instrumental variables used. ivpoisson gmm estimates regression coefficients to minimize Q. Let l be the dimension of e z and k the number of regressors. If WN is an optimal weight matrix, under the null hypothesis H0 : E {e zi u(yi , xi , y2,i , β1 , β2 )} = 0, the test statistic J = N × Q ∼ χ2 (l − k). A large test statistic casts doubt on the null hypothesis. Because the weight matrix WN must be optimal, estat overid works only after the two-step and iterated estimation or if you supplied your own initial weight matrix by using the winitial(matname) option of ivpoisson gmm and used the one-step estimator. 908 ivpoisson postestimation — Postestimation tools for ivpoisson Often the overidentifying restrictions test is interpreted as a test of the validity of the instruments z. However, other forms of model misspecification can sometimes lead to a significant test statistic. See Hall (2005, sec. 5.1) for a discussion of the overidentifying restrictions test and its behavior in correctly specified and misspecified models. Note that ivpoisson gmm defaults to the two-step estimator when other options are not specified to override the default. Thus it is appropriate to perform the J test after the regression of example 1 in [R] ivpoisson. Example 1: Specification test Recall example 1 of [R] ivpoisson. We estimated the parameters of an exponential conditional mean model for the number of visits to a website. Additive errors were used. Exogenous regressors included the gender of an individual and the number of ads received from the website. An endogenous regressor, time spent on the Internet, was also included in the model. Two instruments were used. One of the instruments measured the time spent interacting with friends and out-of-town family. The other measured the time spent on the phone. We will reestimate the parameters of the regression here and then test the specification. . use http://www.stata-press.com/data/r13/website (Visits to website) . ivpoisson gmm visits ad female (time = phone frfam) (output omitted ) . estat overid Test of overidentifying restriction: Hansen’s J chi2(1) = .129055 (p = 0.7194) We have two instruments for one endogenous variable, so the J statistic has one degree of freedom. The J statistic is not significant. We fail to reject the null hypothesis that the model is correctly specified. Stored results estat overid stores the following in r(): Scalars r(J) r(J df) r(J p) Hansen’s J statistic J statistic degrees of freedom J statistic p-value Methods and formulas The vector xi contains the exogenous regressors, and zi the instruments. The vector e zi is partitioned as (xi , zi ). The vector y2,i contains the endogenous regressors. Under multiplicative errors, the conditional mean of yi is E(yi |y2,i , e zi ) = E{E(yi |xi , y2,i , i )|y2,i , e zi } 0 0 = E exp xi β1 + y2,i β2 i |y2,i , e zi 0 0 = exp xi β1 + y2,i β2 E(i |y2,i , e zi ) ivpoisson postestimation — Postestimation tools for ivpoisson 909 Under the CF estimator, E(i |y2,i , e zi ) = E {E(i |vi , ci )|y2,i , e zi } = E {exp(vi0 ρ + ci )|y2,i , e zi } = exp {(y2,i − Be z0i )0 ρ} E(ci |y2,i , e zi ) = exp {(y2,i − Be z0i )0 ρ} Thus under the CF estimator, we estimate the conditional mean of yi as 0 E(yi |y2,i , e zi ) = exp x0i β1 + y2,i β2 + (y2,i − Be z0i )0 ρ The CF estimator explicitly models the functional form of the endogeneity of y2,i and i with the instruments and exogenous regressors e zi . This allows it to correct the exponential-form estimator for the E(i |y2,i , e zi ) term. In contrast, the GMM estimator does not model the functional form of the endogeneity of y2,i and i . Therefore, E(i |y2,i , e zi ) is not estimated, and the exponential-form estimator under GMM estimation simply ignores this term. Noting that because e zi and i are independent, E(i |y2,i , e zi ) = E(i |y2,i ), we can obviously see that ignoring the term will lead to inconsistent estimation of the conditional mean of yi . y2,i and i are not independent, so E(i |y2,i ) may vary based on y2,i . In the additive errors setting, a similar derivation will show that the exponential-form estimator obtained from GMM estimation is inconsistent for the conditional mean of yi . Reference Hall, A. R. 2005. Generalized Method of Moments. Oxford: Oxford University Press. Also see [R] ivpoisson — Poisson regression with endogenous regressors [U] 20 Estimation and postestimation commands Title ivprobit — Probit model with continuous endogenous regressors Syntax Options for ML estimator Stored results References Menu Options for two-step estimator Methods and formulas Also see Description Remarks and examples Acknowledgments Syntax Maximum likelihood estimator ivprobit depvar varlist1 (varlist2 = varlistiv ) if in weight , mle options Two-step estimator ivprobit depvar varlist1 (varlist2 = varlistiv ) if in weight , twostep tse options mle options Description Model mle use conditional maximum-likelihood estimator; the default asis retain perfect predictor variables constraints(constraints) apply specified linear constraints SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) first nocnsreport display options set confidence level; default is level(95) report first-stage regression do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process coeflegend display legend instead of statistics 910 ivprobit — Probit model with continuous endogenous regressors 911 Description tse options Model ∗ use Newey’s two-step estimator; the default is mle retain perfect predictor variables twostep asis SE vce(vcetype) vcetype may be twostep, bootstrap, or jackknife Reporting level(#) first display options set confidence level; default is level(95) report first-stage regression control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling coeflegend display legend instead of statistics ∗ twostep is required. varlist1 and varlistiv may contain factor variables; see [U] 11.4.3 Factor variables. depvar, varlist1 , varlist2 , and varlistiv may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. fp is allowed with the maximum likelihood estimator. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce(), first, twostep, and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed with the maximum likelihood estimator. fweights are allowed with Newey’s two-step estimator. See [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Endogenous covariates > Probit model with endogenous covariates Description ivprobit fits probit models where one or more of the regressors are endogenously determined. By default, ivprobit uses maximum likelihood estimation. Alternatively, Newey’s (1987) minimum chi-squared estimator can be invoked with the twostep option. Both estimators assume that the endogenous regressors are continuous and are not appropriate for use with discrete endogenous regressors. See [R] ivtobit for tobit estimation with endogenous regressors and [R] probit for probit estimation when the model contains no endogenous regressors. Options for ML estimator Model mle requests that the conditional maximum-likelihood estimator be used. This is the default. 912 ivprobit — Probit model with continuous endogenous regressors asis requests that all specified variables and observations be retained in the maximization process. This option is typically not used and may introduce numerical instability. Normally, ivprobit drops any endogenous or exogenous variables that perfectly predict success or failure in the dependent variable. The associated observations are also dropped. For more information, see Model identification in [R] probit. constraints(constraints); see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. first requests that the parameters for the reduced-form equations showing the relationships between the endogenous variables and instruments be displayed. For the two-step estimator, first shows the first-stage regressions. For the maximum likelihood estimator, these parameters are estimated jointly with the parameters of the probit equation. The default is not to show these parameter estimates. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. This model’s likelihood function can be difficult to maximize, especially with multiple endogenous variables. The difficult and technique(bfgs) options may be helpful in achieving convergence. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with ivprobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Options for two-step estimator Model twostep is required and requests that Newey’s (1987) efficient two-step estimator be used to obtain the coefficient estimates. asis requests that all specified variables and observations be retained in the maximization process. This option is typically not used and may introduce numerical instability. Normally, ivprobit drops any endogenous or exogenous variables that perfectly predict success or failure in the dependent variable. The associated observations are also dropped. For more information, see Model identification in [R] probit. ivprobit — Probit model with continuous endogenous regressors 913 SE vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (twostep) and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. first requests that the parameters for the reduced-form equations showing the relationships between the endogenous variables and instruments be displayed. For the two-step estimator, first shows the first-stage regressions. For the maximum likelihood estimator, these parameters are estimated jointly with the parameters of the probit equation. The default is not to show these parameter estimates. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. The following option is available with ivprobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Model setup Model identification Model setup ivprobit fits models with dichotomous dependent variables and endogenous regressors. You can use it to fit a probit model when you suspect that one or more of the regressors are correlated with the error term. ivprobit is to probit modeling what ivregress is to linear regression analysis; see [R] ivregress for more information. Formally, the model is ∗ y1i =y2i β + x1i γ + ui y2i =x1i Π1 + x2i Π2 + vi where i = 1, . . . , N , y2i is a 1 ×p vector of endogenous variables, x1i is a 1 ×k1 vector of exogenous variables, x2i is a 1 × k2 vector of additional instruments, and the equation for y2i is written in reduced form. By assumption, (ui , vi ) ∼ N(0, Σ), where σ11 is normalized to one to identify the model. β and γ are vectors of structural parameters, and Π1 and Π2 are matrices of reduced-form ∗ ∗ parameters. This is a recursive model: y2i appears in the equation for y1i , but y1i does not appear ∗ in the equation for y2i . We do not observe y1i ; instead, we observe ∗ 0 y1i <0 y1i = ∗ 1 y1i ≥ 0 The order condition for identification of the structural parameters requires that k2 ≥ p. Presumably, Σ is not block diagonal between ui and vi ; otherwise, y2i would not be endogenous. 914 ivprobit — Probit model with continuous endogenous regressors Technical note This model is derived under the assumption that (ui , vi ) is independent and identically distributed multivariate normal for all i. The vce(cluster clustvar) option can be used to control for a lack of independence. As with most probit models, if ui is heteroskedastic, point estimates will be inconsistent. Example 1 We have hypothetical data on 500 two-parent households, and we wish to model whether the woman is employed. We have a variable, fem work, that is equal to one if she has a job and zero otherwise. Her decision to work is a function of the number of children at home (kids), number of years of schooling completed (fem educ), and other household income measured in thousands of dollars (other inc). We suspect that unobservable shocks affecting the woman’s decision to hold a job also affect the household’s other income. Therefore, we treat other inc as endogenous. As an instrument, we use the number of years of schooling completed by the man (male educ). The syntax for specifying the exogenous, endogenous, and instrumental variables is identical to that used in ivregress; see [R] ivregress for details. . use http://www.stata-press.com/data/r13/laborsup . ivprobit fem_work fem_educ kids (other_inc = male_educ) Fitting exogenous probit model Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -344.63508 -255.36855 -255.31444 -255.31444 likelihood likelihood likelihood likelihood likelihood = = = = = -2371.4753 -2369.3178 -2368.2198 -2368.2062 -2368.2062 Fitting full model Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log Probit model with endogenous regressors Number of obs Wald chi2(3) Prob > chi2 Log likelihood = -2368.2062 Coef. Std. Err. z P>|z| = = = 500 163.88 0.0000 [95% Conf. Interval] other_inc fem_educ kids _cons -.0542756 .211111 -.1820929 .3672083 .0060854 .0268648 .0478267 .4480724 -8.92 7.86 -3.81 0.82 0.000 0.000 0.000 0.412 -.0662027 .1584569 -.2758316 -.5109975 -.0423485 .2637651 -.0883543 1.245414 /athrho /lnsigma .3907858 2.813383 .1509443 .0316228 2.59 88.97 0.010 0.000 .0949403 2.751404 .6866313 2.875363 rho sigma .3720374 16.66621 .1300519 .5270318 .0946561 15.66461 .5958135 17.73186 Instrumented: Instruments: other_inc fem_educ kids male_educ Wald test of exogeneity (/athrho = 0): chi2(1) = 6.70 Prob > chi2 = 0.0096 Because we did not specify mle or twostep, ivprobit used the maximum likelihood estimator by default. At the top of the output, we see the iteration log. ivprobit fits a probit model ignoring ivprobit — Probit model with continuous endogenous regressors 915 endogeneity to obtain starting values for the endogenous model. The header of the output contains the sample size as well as a Wald statistic and p-value for the test of the hypothesis that all the slope coefficients are jointly zero. Below the table of coefficients, Stata reminds us that the endogenous variable is other inc and that fem educ, kids, and male educ were used as instruments. At the bottom of the output is a Wald test of the exogeneity of the instrumented variables. We reject the null hypothesis of no endogeneity. However, if the test statistic is not significant, there is not sufficient information in the sample to reject the null, so a regular probit regression may be appropriate. The point estimates from ivprobit are still consistent, though those from probit (see [R] probit) are likely to have smaller standard errors. Various two-step estimators have also been proposed for the endogenous probit model, and Newey’s (1987) minimum chi-squared estimator is available with the twostep option. Example 2 Refitting our labor-supply model with the two-step estimator yields . ivprobit fem_work fem_educ kids (other_inc = male_educ), twostep Checking reduced-form model... Two-step probit with endogenous regressors Number of obs = Wald chi2(3) = Prob > chi2 = Coef. other_inc fem_educ kids _cons Instrumented: Instruments: -.058473 .227437 -.1961748 .3956061 Std. Err. .0093364 .0281628 .0496323 .4982649 z -6.26 8.08 -3.95 0.79 P>|z| 0.000 0.000 0.000 0.427 500 93.97 0.0000 [95% Conf. Interval] -.0767719 .1722389 -.2934522 -.5809752 -.040174 .282635 -.0988973 1.372187 other_inc fem_educ kids male_educ Wald test of exogeneity: chi2(1) = 6.50 Prob > chi2 = 0.0108 All the coefficients have the same signs as their counterparts in the maximum likelihood model. The Wald test at the bottom of the output confirms our earlier finding of endogeneity. Technical note In a standard probit model, the error term is assumed to have a variance of one. In the probit model with endogenous regressors, we assume that (ui , vi ) is multivariate normal with covariance matrix 1 Σ021 Var(ui , vi ) = Σ = Σ21 Σ22 With the properties of the multivariate normal distribution, Var(ui |vi ) = 1 − Σ021 Σ−1 22 Σ21 . As a result, Newey’s estimator and other two-step probit estimators do not yield estimates of β and γ but rather β/σ and γ/σ , where σ is the square root of Var(ui |vi ). Hence, we cannot directly compare the estimates obtained from Newey’s estimator with those obtained via maximum likelihood or with those obtained from probit. See Wooldridge (2010, 585–594) for a discussion of Rivers and Vuong’s (1988) two-step estimator. The issues raised pertaining to the interpretation of the coefficients of that 916 ivprobit — Probit model with continuous endogenous regressors estimator are identical to those that arise with Newey’s estimator. Wooldridge also discusses ways to obtain marginal effects from two-step estimators. Despite the coefficients not being directly comparable to their maximum likelihood counterparts, the two-step estimator is nevertheless useful. The maximum likelihood estimator may have difficulty converging, especially with multiple endogenous variables. The two-step estimator, consisting of nothing more complicated than a probit regression, will almost certainly converge. Moreover, although the coefficients from the two models are not directly comparable, the two-step estimates can still be used to test for statistically significant relationships. Model identification As in the linear simultaneous-equation model, the order condition for identification requires that the number of excluded exogenous variables (that is, the additional instruments) be at least as great as the number of included endogenous variables. ivprobit checks this for you and issues an error message if the order condition is not met. Like probit, logit, and logistic, ivprobit checks the exogenous and endogenous variables to see if any of them predict the outcome variable perfectly. It will then drop offending variables and observations and fit the model on the remaining data. Instruments that are perfect predictors do not affect estimation, so they are not checked. See Model identification in [R] probit for more information. ivprobit will also occasionally display messages such as Note: 4 failures and 0 successes completely determined. For an explanation of this message, see [R] logit. Stored results ivprobit, mle stores the following in e(): Scalars e(N) e(N cds) e(N cdf) e(k) e(k eq) e(k eq model) e(k aux) e(k dv) e(df m) e(ll) e(N clust) e(endog ct) e(p) e(p exog) e(chi2) e(chi2 exog) e(rank) e(ic) e(rc) e(converged) number of observations number of completely determined successes number of completely determined failures number of parameters number of equations in e(b) number of equations in overall model test number of auxiliary parameters number of dependent variables model degrees of freedom log likelihood number of clusters number of endogenous regressors model Wald p-value exogeneity test Wald p-value model Wald χ2 Wald χ2 test of exogeneity rank of e(V) number of iterations return code 1 if converged, 0 otherwise ivprobit — Probit model with continuous endogenous regressors Macros e(cmd) e(cmdline) e(depvar) e(instd) e(insts) e(wtype) e(wexp) e(title) e(clustvar) e(chi2type) e(vce) e(vcetype) e(asis) e(method) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(estat cmd) e(predict) e(footnote) e(marginsok) e(asbalanced) e(asobserved) ivprobit command as typed name of dependent variable instrumented variables instruments weight type weight expression title in estimation output name of cluster variable Wald; type of model χ2 test vcetype specified in vce() title used to label Std. Err. asis, if specified ml type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement estat program used to implement predict program used to implement the footnote display predictions allowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved Matrices e(b) e(Cns) e(rules) e(ilog) e(gradient) coefficient vector constraints matrix information about perfect predictors iteration log (up to 20 iterations) gradient vector e(Sigma) e(V) e(V modelbased) Functions e(sample) b Σ variance–covariance matrix of the estimators model-based variance marks estimation sample 917 918 ivprobit — Probit model with continuous endogenous regressors ivprobit, twostep stores the following in e(): Scalars e(N) e(N cds) e(N cdf) e(df m) e(df exog) e(p) e(p exog) e(chi2) e(chi2 exog) e(rank) Macros e(cmd) e(cmdline) e(depvar) e(instd) e(insts) e(wtype) e(wexp) e(chi2type) e(vce) e(vcetype) e(asis) e(method) e(properties) e(estat cmd) e(predict) e(footnote) e(marginsok) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(rules) e(V) Functions e(sample) number of observations number of completely determined successes number of completely determined failures model degrees of freedom degrees of freedom for χ2 test of exogeneity model Wald p-value exogeneity test Wald p-value model Wald χ2 Wald χ2 test of exogeneity rank of e(V) ivprobit command as typed name of dependent variable instrumented variables instruments weight type weight expression Wald; type of model χ2 test vcetype specified in vce() title used to label Std. Err. asis, if specified twostep b V program used to implement estat program used to implement predict program used to implement the footnote display predictions allowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix information about perfect predictors variance–covariance matrix of the estimators marks estimation sample Methods and formulas Fitting limited-dependent variable models with endogenous regressors has received considerable attention in the econometrics literature. Building on the results of Amemiya (1978, 1979), Newey (1987) developed an efficient method of estimation that encompasses both Rivers and Vuong’s (1988) simultaneous-equations probit model and Smith and Blundell’s (1986) simultaneous-equations tobit model. With modern computers, maximum likelihood estimation is feasible as well. For compactness, we write the model as ∗ y1i = zi δ + ui y2i = xi Π + vi (1a) (1b) where zi = (y2i , x1i ), xi = (x1i , x2i ), δ = (β0 , γ0 )0 , and Π = (Π01 , Π02 )0 . Deriving the likelihood function is straightforward because we can write the joint density f (y1i , y2i |xi ) as f (y1i |y2i , xi ) f (y2i |xi ). When there is an endogenous regressor, the log likelihood for observation i is ivprobit — Probit model with continuous endogenous regressors 919 y2i − xi Π lnLi = wi y1i lnΦ (mi ) + (1 − y1i ) ln {1 − Φ (mi )} + lnφ − lnσ σ where mi = zi δ + ρ (y2i − xi Π) /σ 1 (1 − ρ2 ) 2 Φ(·) and φ(·) are the standard normal distribution and density functions, respectively; σ is the standard deviation of vi ; ρ is the correlation coefficient between ui and vi ; and wi is the weight for observation i or one if no weights were specified. Instead of estimating σ and ρ, we estimate lnσ and atanh ρ, where 1+ρ 1 atanh ρ = ln 2 1−ρ For multiple endogenous regressors, let 1 Var(ui , vi ) = Σ = Σ21 Σ021 Σ22 As in any probit model, we have imposed the normalization Var(ui ) = 1 to identify the model. The log likelihood for observation i is lnLi = wi y1i lnΦ (mi ) + (1 − y1i ) ln {1 − Φ (mi )} + lnf (y2i |xi ) where 1 1 p 0 lnf (y2i |xi ) = − ln2π − ln |Σ22 | − (y2i − xi Π) Σ−1 22 (y2i − xi Π) 2 2 2 and mi = 1 − Σ021 Σ−1 22 Σ21 − 21 zi δ + (y2i − xi Π) Σ−1 22 Σ21 Instead of maximizing the log-likelihood function with respect to Σ, we maximize with respect to the Cholesky decomposition S of Σ; that is, there exists a lower triangular matrix, S, such that SS0 = Σ. This maximization ensures that Σ is positive definite, as a covariance matrix must be. Let S= 1 s21 s31 .. . 0 s22 s32 .. . 0 0 s33 .. . sp+1,1 sp+1,2 sp+1,3 ... ... ... .. . 0 0 0 .. . . . . sp+1,p+1 920 ivprobit — Probit model with continuous endogenous regressors With maximum likelihood estimation, this command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. The maximum likelihood version of ivprobit also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. The two-step estimates are obtained using Newey’s (1987) minimum chi-squared estimator. The ∗ reduced-form equation for y1i is ∗ y1i = (xi Π + vi )β + x1i γ + ui = xi α + vi β + ui = xi α + ν i where νi = vi β + ui . Because ui and vi are jointly normal, νi is also normal. Note that Π1 I α= β+ γ = D(Π)δ Π2 0 b , x1i ), zbi δ = where D(Π) = (Π, I1 ) and I1 is defined such that xi I1 = x1i . Letting zbi = (xi Π b b b b b xi D(Π)δ, where D(Π) = (Π, I1 ). Thus one estimator of α is D(Π)δ; denote this estimator by Dδ. α could also be estimated directly as the solution to max α,λ N X l(y1i , xi α + vbi λ) (2) i=1 where l(·) is the log likelihood for probit. Denote this estimator by α e . The inclusion of the vbi λ term follows because the multivariate normality of (ui , vi ) implies that, conditional on y2i , the expected value of ui is nonzero. Because vi is unobservable, the least-squares residuals from fitting (1b) are used. Amemiya (1978) shows that the estimator of δ defined by b )0 Ω b max (e α − Dδ δ −1 b ) (e α − Dδ √ b ), is asymptotically efficient b is a consistent estimator of the covariance of N (e α − Dδ where Ω b )δ. Thus an efficient relative to all other estimators that minimize the distance between α e and D(Π estimator of δ is b b 0Ω b −1 D b )−1 D b 0Ω b −1 α δ = (D e (3) and b 0Ω b −1 D b )−1 Var(b δ) = (D (4) b −1 . To implement this estimator, we need Ω Consider the two-step maximum likelihood estimator that results from first fitting (1b) by OLS and b The estimator is then obtained by solving computing the residuals vbi = y2i − xi Π. max δ,λ N X i=1 l(y1i , zi δ + vbi λ) ivprobit — Probit model with continuous endogenous regressors 921 This is the two-step instrumental variables (2SIV) estimator proposed by Rivers and Vuong (1988), and its role will become apparent shortly. √ d b )−→N(0, From Proposition 5 of Newey (1987), N (e α − Dδ Ω), where 0 −1 Ω = J−1 αα + (λ − β) Σ22 (λ − β)Q b is an estimated e , ignoring that Π and Σ22 = E{v0i vi }. J−1 αα is simply the covariance matrix of α parameter matrix. Moreover, Newey shows that the covariance matrix from an OLS regression of b−β b ) on xi is a consistent estimator of the second term. λ b can be obtained from solving (2), y2i (λ b and the 2SIV estimator yields a consistent estimate, β. Mechanically, estimation proceeds in several steps. 1. Each of the endogenous right-hand-side variables is regressed on all the exogenous variables, b = D(Π b ) is assembled from the and the fitted values and residuals are calculated. The matrix D estimated coefficients. b The portion of the covariance matrix corresponding 2. probit is used to solve (2) and obtain α e and λ. to α, J−1 , is also saved. αα b corresponding to y2i are collected. 3. The 2SIV estimator is evaluated, and the parameters β b−β b ) is regressed on xi . The covariance matrix of the parameters from this regression is 4. y2i (λ b added to J−1 αα , yielding Ω. 5. Evaluating (3) and (4) yields the estimates b δ and Var(b δ). 6. A Wald test of the null hypothesis H0 : λ = 0, using the 2SIV estimates, serves as our test of exogeneity. The two-step estimates are not directly comparable to those obtained from the maximum likelihood estimator or from probit. The argument is the same for Newey’s efficient estimator as for Rivers and Vuong’s (1988) 2SIV estimator, so we consider the simpler 2SIV estimator. From the properties of the normal distribution, E(ui |vi ) = vi Σ−1 22 Σ21 and Var(ui |vi ) = 1 − Σ021 Σ−1 22 Σ21 −1 0 2 2 We write ui as ui = vi Σ−1 22 Σ21 + ei = vi λ + ei , where ei ∼ N(0, 1 − ρ ), ρ = Σ21 Σ22 Σ21 , and ei is independent of vi . In the second stage of 2SIV, we use a probit regression to estimate the parameters of y1i = zi δ + vi λ + ei Because vi is unobservable, we use the sample residuals from the first-stage regressions. o n 1 Pr(y1i = 1|zi , vi ) = Pr(zi δ + vi λ + ei > 0|zi , vi ) = Φ (1 − ρ2 )− 2 (zi δ + vi λ) Hence, as mentioned previously, 2SIV and Newey’s estimator do not estimate δ and λ but rather δρ = 1 (1 − 1 ρ2 ) 2 δ and λρ = 1 1 (1 − ρ2 ) 2 λ 922 ivprobit — Probit model with continuous endogenous regressors Acknowledgments The two-step estimator is based on the probitiv command written by Jonah Gelbach of the Department of Economics at Yale University and the ivprob command written by Joe Harkness of the Institute of Policy Studies at Johns Hopkins University. References Amemiya, T. 1978. The estimation of a simultaneous equation generalized probit model. Econometrica 46: 1193–1205. . 1979. The estimation of a simultaneous-equation tobit model. International Economic Review 20: 169–181. Finlay, K., and L. M. Magnusson. 2009. Implementing weak-instrument robust tests for a general class of instrumentalvariables models. Stata Journal 9: 398–421. Miranda, A., and S. Rabe-Hesketh. 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata Journal 6: 285–308. Newey, W. K. 1987. Efficient estimation of limited dependent variable models with endogenous explanatory variables. Journal of Econometrics 36: 231–250. Rivers, D., and Q. H. Vuong. 1988. Limited information estimators and exogeneity tests for simultaneous probit models. Journal of Econometrics 39: 347–366. Smith, R. J., and R. Blundell. 1986. An exogeneity test for the simultaneous equation tobit model with an application to labor supply. Econometrica 54: 679–685. Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press. Also see [R] ivprobit postestimation — Postestimation tools for ivprobit [R] gmm — Generalized method of moments estimation [R] ivregress — Single-equation instrumental-variables regression [R] ivtobit — Tobit model with continuous endogenous regressors [R] probit — Probit regression [SVY] svy estimation — Estimation commands for survey data [XT] xtprobit — Random-effects and population-averaged probit models [U] 20 Estimation and postestimation commands Title ivprobit postestimation — Postestimation tools for ivprobit Description Remarks and examples Syntax for predict Methods and formulas Menu for predict Also see Options for predict Description The following postestimation commands are of special interest after ivprobit: Command Description estat classification report various summary statistics, including the classification table lroc compute area under ROC curve and graph the curve lsens graph sensitivity and specificity versus probability cutoff These commands are not appropriate after the two-step estimator or the svy prefix. The following standard postestimation commands are also available: Command Description contrast estat ic1 estat summarize estat vce estat (svy) estimates forecast2 hausman lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test; not available with two-step estimator marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest3 margins marginsplot nlcom predict predictnl pwcompare suest1 test testnl 1 estat ic and suest are not appropriate after ivprobit, twostep. 2 forecast is not appropriate with svy estimation results or after ivprobit, twostep. lrtest is not appropriate with svy estimation results. 3 923 924 ivprobit postestimation — Postestimation tools for ivprobit Syntax for predict After ML or twostep predict type newvar if in , statistic rules asif After ML predict type stub* | newvarlist if in , scores Description statistic Main linear prediction; the default standard error of the linear prediction probability of a positive outcome; not available with two-step estimator xb stdp pr These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main xb, the default, calculates the linear prediction. stdp calculates the standard error of the linear prediction. pr calculates the probability of a positive outcome. pr is not available with the two-step estimator. rules requests that Stata use any rules that were used to identify the model when making the prediction. By default, Stata calculates missing for excluded observations. rules is not available with the two-step estimator. asif requests that Stata ignore the rules and the exclusion criteria and calculate predictions for all observations possible using the estimated parameters from the model. asif is not available with the two-step estimator. scores, not available with twostep, calculates equation-level score variables. For models with one endogenous regressor, four new variables are created. The first new variable will contain ∂ lnL/∂(zi δ). The second new variable will contain ∂ lnL/∂(xi Π). The third new variable will contain ∂ lnL/∂ atanh ρ. The fourth new variable will contain ∂ lnL/∂ lnσ . For models with p endogenous regressors, p + {(p + 1)(p + 2)}/2 new variables are created. The first new variable will contain ∂ lnL/∂(zi δ). ivprobit postestimation — Postestimation tools for ivprobit 925 The second through (p + 1)th new variables will contain ∂ lnL/∂(xi Πk ), k = 1, . . . , p, where Πk is the k th column of Π. The remaining score variables will contain the partial derivatives of lnL with respect to s21 , s31 , . . . , sp+1,1 , s22 , . . . , sp+1,2 , . . . , sp+1,p+1 , where sm,n denotes the (m, n) element of the Cholesky decomposition of the error covariance matrix. Remarks and examples Remarks are presented under the following headings: Marginal effects Obtaining predicted values Marginal effects Example 1 We can obtain marginal effects by using the margins command after ivprobit. We will calculate average marginal effects by using the labor-supply model of example 1 in [R] ivprobit. . use http://www.stata-press.com/data/r13/laborsup . ivprobit fem_work fem_educ kids (other_inc = male_educ) (output omitted ) . margins, dydx(*) predict(pr) Average marginal effects Model VCE : OIM Number of obs = 500 Expression : Probability of positive outcome, predict(pr) dy/dx w.r.t. : other_inc fem_educ kids male_educ dy/dx other_inc fem_educ kids male_educ -.014015 .0545129 -.0470199 0 Delta-method Std. Err. .0009836 .0066007 .0123397 (omitted) z -14.25 8.26 -3.81 P>|z| 0.000 0.000 0.000 [95% Conf. Interval] -.0159428 .0415758 -.0712052 -.0120872 .06745 -.0228346 Here we see that a $1,000 increase in other inc leads to an average decrease of 0.014 in the probability that the woman has a job. male edu has no effect because it appears only as an instrument. Obtaining predicted values After fitting your model with ivprobit, you can obtain the linear prediction and its standard error for both the estimation sample and other samples by using the predict command; see [U] 20 Estimation and postestimation commands and [R] predict. If you had used the maximum likelihood estimator, you could also obtain the probability of a positive outcome. 926 ivprobit postestimation — Postestimation tools for ivprobit predict’s pr option calculates the probability of a positive outcome, remembering any rules used to identify the model, and calculates missing for excluded observations. predict’s rules option uses the rules in predicting probabilities, whereas predict’s asif option ignores both the rules and the exclusion criteria and calculates probabilities for all possible observations by using the estimated parameters from the model. See Obtaining predicted values in [R] probit postestimation for an example. Methods and formulas The linear prediction is calculated as zib δ, where b δ is the estimated value of δ, and zi and δ are defined in (1a) of [R] ivprobit. The probability of a positive outcome is Φ(zib δ), where Φ(·) is the standard normal distribution function. Also see [R] ivprobit — Probit model with continuous endogenous regressors [R] estat classification — Classification statistics and table [R] lroc — Compute area under ROC curve and graph the curve [R] lsens — Graph sensitivity and specificity versus probability cutoff [U] 20 Estimation and postestimation commands Title ivregress — Single-equation instrumental-variables regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax ivregress estimator depvar , options varlist1 (varlist2 = varlistiv ) if estimator Description 2sls liml gmm two-stage least squares (2SLS) limited-information maximum likelihood (LIML) generalized method of moments (GMM) options Description in weight Model noconstant hascons GMM suppress constant term has user-supplied constant 1 wmatrix(wmtype) center igmm eps(#)2 weps(#)2 optimization options2 wmtype may be robust, cluster clustvar, hac kernel, or unadjusted center moments in weight matrix computation use iterative instead of two-step GMM estimator specify # for parameter convergence criterion; default is eps(1e-6) specify # for weight matrix convergence criterion; default is weps(1e-6) control the optimization process; seldom used SE/Robust vce(vcetype) vcetype may be unadjusted, robust, cluster clustvar, bootstrap, jackknife, or hac kernel Reporting level(#) first small noheader depname(depname) eform(string) display options set confidence level; default is level(95) report first-stage regression make degrees-of-freedom adjustments and report small-sample statistics display only the coefficient table substitute dependent variable name report exponentiated coefficients and use string to label them control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling 927 928 ivregress — Single-equation instrumental-variables regression perfect coeflegend 1 do not check for collinearity between endogenous regressors and excluded instruments display legend instead of statistics These options may be specified only when gmm is specified. 2 These options may be specified only when igmm is specified. varlist1 , varlist2 , and varlistiv may contain factor variables; see [U] 11.4.3 Factor variables. depvar, varlist1 , varlist2 , and varlistiv may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. hascons, vce(), noheader, depname(), and weights are not allowed with the svy prefix; see [SVY] svy. aweights, fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. perfect and coeflegend do not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Endogenous covariates > Single-equation instrumental-variables regression Description ivregress fits a linear regression of depvar on varlist1 and varlist2 , using varlistiv (along with varlist1 ) as instruments for varlist2 . ivregress supports estimation via two-stage least squares (2SLS), limited-information maximum likelihood (LIML), and generalized method of moments (GMM). In the language of instrumental variables, varlist1 and varlistiv are the exogenous variables, and varlist2 are the endogenous variables. Options Model noconstant; see [R] estimation options. hascons indicates that a user-defined constant or its equivalent is specified among the independent variables. GMM wmatrix(wmtype) specifies the type of weighting matrix to be used in conjunction with the GMM estimator. Specifying wmatrix(robust) requests a weighting matrix that is optimal when the error term is heteroskedastic. wmatrix(robust) is the default. Specifying wmatrix(cluster clustvar) requests a weighting matrix that accounts for arbitrary correlation among observations within clusters identified by clustvar. Specifying wmatrix(hac kernel #) requests a heteroskedasticity- and autocorrelation-consistent (HAC) weighting matrix using the specified kernel (see below) with # lags. The bandwidth of a kernel is equal to # + 1. ivregress — Single-equation instrumental-variables regression 929 Specifying wmatrix(hac kernel opt) requests an HAC weighting matrix using the specified kernel, and the lag order is selected using Newey and West’s (1994) optimal lag-selection algorithm. Specifying wmatrix(hac kernel) requests an HAC weighting matrix using the specified kernel and N − 2 lags, where N is the sample size. There are three kernels available for HAC weighting matrices, and you may request each one by using the name used by statisticians or the name perhaps more familiar to economists: bartlett or nwest requests the Bartlett (Newey–West) kernel; parzen or gallant requests the Parzen (Gallant 1987) kernel; and quadraticspectral or andrews requests the quadratic spectral (Andrews 1991) kernel. Specifying wmatrix(unadjusted) requests a weighting matrix that is suitable when the errors are homoskedastic. The GMM estimator with this weighting matrix is equivalent to the 2SLS estimator. center requests that the sample moments be centered (demeaned) when computing GMM weight matrices. By default, centering is not done. igmm requests that the iterative GMM estimator be used instead of the default two-step GMM estimator. Convergence is declared when the relative change in the parameter vector from one iteration to the next is less than eps() or the relative change in the weight matrix is less than weps(). eps(#) specifies the convergence criterion for successive parameter estimates when the iterative GMM estimator is used. The default is eps(1e-6). Convergence is declared when the relative difference between successive parameter estimates is less than eps() and the relative difference between successive estimates of the weighting matrix is less than weps(). weps(#) specifies the convergence criterion for successive estimates of the weighting matrix when the iterative GMM estimator is used. The default is weps(1e-6). Convergence is declared when the relative difference between successive parameter estimates is less than eps() and the relative difference between successive estimates of the weighting matrix is less than weps(). optimization options: iterate(#), no log. iterate() specifies the maximum number of iterations to perform in conjunction with the iterative GMM estimator. The default is 16,000 or the number set using set maxiter (see [R] maximize). log/nolog specifies whether to show the iteration log. These options are seldom used. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(unadjusted), the default for 2sls and liml, specifies that an unadjusted (nonrobust) VCE matrix be used. The default for gmm is based on the wmtype specified in the wmatrix() option; see wmatrix(wmtype) above. If wmatrix() is specified with gmm but vce() is not, then vcetype is set equal to wmtype. To override this behavior and obtain an unadjusted (nonrobust) VCE matrix, specify vce(unadjusted). ivregress also allows the following: vce(hac kernel # | opt ) specifies that an HAC covariance matrix be used. The syntax used with vce(hac kernel . . .) is identical to that used with wmatrix(hac kernel . . .); see wmatrix(wmtype) above. Reporting level(#); see [R] estimation options. 930 ivregress — Single-equation instrumental-variables regression first requests that the first-stage regression results be displayed. small requests that the degrees-of-freedom adjustment N/(N −k) be made to the variance–covariance matrix of parameters and that small-sample F and t statistics be reported, where N is the sample size and k is the number of parameters estimated. By default, no degrees-of-freedom adjustment is made, and Wald and z statistics are reported. Even with this option, no degrees-of-freedom adjustment is made to the weighting matrix when the GMM estimator is used. noheader suppresses the display of the summary statistics at the top of the output, displaying only the coefficient table. depname(depname) is used only in programs and ado-files that use ivregress to fit models other than instrumental-variables regression. depname() may be specified only at estimation time. depname is recorded as the identity of the dependent variable, even though the estimates are calculated using depvar. This method affects the labeling of the output — not the results calculated — but could affect later calculations made by predict, where the residual would be calculated as deviations from depname rather than depvar. depname() is most typically used when depvar is a temporary variable (see [P] macro) used as a proxy for depname. eform(string) is used only in programs and ado-files that use ivregress to fit models other than instrumental-variables regression. eform() specifies that the coefficient table be displayed in “exponentiated form”, as defined in [R] maximize, and that string be used to label the exponentiated coefficients in the table. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. The following options are available with ivregress but are not shown in the dialog box: perfect requests that ivregress not check for collinearity between the endogenous regressors and excluded instruments, allowing one to specify “perfect” instruments. This option cannot be used with the LIML estimator. This option may be required when using ivregress to implement other estimators. coeflegend; see [R] estimation options. Remarks and examples ivregress performs instrumental-variables regression and weighted instrumental-variables regression. For a general discussion of instrumental variables, see Baum (2006), Cameron and Trivedi (2005; 2010, chap. 6) Davidson and MacKinnon (1993, 2004), Greene (2012, chap. 8), and Wooldridge (2010, 2013). See Hall (2005) for a lucid presentation of GMM estimation. Angrist and Pischke (2009, chap. 4) offer a casual yet thorough introduction to instrumental-variables estimators, including their use in estimating treatment effects. Some of the earliest work on simultaneous systems can be found in Cowles Commission monographs — Koopmans and Marschak (1950) and Koopmans and Hood (1953) — with the first developments of 2SLS appearing in Theil (1953) and Basmann (1957). However, Stock and Watson (2011, 422–424) present an example of the method of instrumental variables that was first published in 1928 by Philip Wright. The syntax for ivregress assumes that you want to fit one equation from a system of equations or an equation for which you do not want to specify the functional form for the remaining equations of the system. To fit a full system of equations, using either 2SLS equation-by-equation or three-stage least squares, see [R] reg3. An advantage of ivregress is that you can fit one equation of a multiple-equation system without specifying the functional form of the remaining equations. ivregress — Single-equation instrumental-variables regression 931 Formally, the model fit by ivregress is yi = yi β1 + x1i β2 + ui yi = x1i Π1 + x2i Π2 + vi (1) (2) Here yi is the dependent variable for the ith observation, yi represents the endogenous regressors (varlist2 in the syntax diagram), x1i represents the included exogenous regressors (varlist1 in the syntax diagram), and x2i represents the excluded exogenous regressors (varlistiv in the syntax diagram). x1i and x2i are collectively called the instruments. ui and vi are zero-mean error terms, and the correlations between ui and the elements of vi are presumably nonzero. The rest of the discussion is presented under the following headings: 2SLS and LIML estimators GMM estimator 2SLS and LIML estimators The most common instrumental-variables estimator is 2SLS. Example 1: 2SLS estimator We have state data from the 1980 census on the median dollar value of owner-occupied housing (hsngval) and the median monthly gross rent (rent). We want to model rent as a function of hsngval and the percentage of the population living in urban areas (pcturban): renti = β0 + β1 hsngvali + β2 pcturbani + ui where i indexes states and ui is an error term. Because random shocks that affect rental rates in a state probably also affect housing values, we treat hsngval as endogenous. We believe that the correlation between hsngval and u is not equal to zero. On the other hand, we have no reason to believe that the correlation between pcturban and u is nonzero, so we assume that pcturban is exogenous. Because we are treating hsngval as an endogenous regressor, we must have one or more additional variables available that are correlated with hsngval but uncorrelated with u. Moreover, these excluded exogenous variables must not affect rent directly, because if they do then they should be included in the regression equation we specified above. In our dataset, we have a variable for family income (faminc) and for region of the country (region) that we believe are correlated with hsngval but not the error term. Together, pcturban, faminc, and factor variables 2.region, 3.region, and 4.region constitute our set of instruments. To fit the equation in Stata, we specify the dependent variable and the list of included exogenous variables. In parentheses, we specify the endogenous regressors, an equal sign, and the excluded exogenous variables. Only the additional exogenous variables must be specified to the right of the equal sign; the exogenous variables that appear in the regression equation are automatically included as instruments. 932 ivregress — Single-equation instrumental-variables regression Here we fit our model with the 2SLS estimator: . use http://www.stata-press.com/data/r13/hsng (1980 Census housing data) . ivregress 2sls rent pcturban (hsngval = faminc i.region) Instrumental variables (2SLS) regression Number of obs Wald chi2(2) Prob > chi2 R-squared Root MSE rent Coef. hsngval pcturban _cons .0022398 .081516 120.7065 Instrumented: Instruments: = = = = = 50 90.76 0.0000 0.5989 22.166 Std. Err. z P>|z| [95% Conf. Interval] .0003284 .2987652 15.22839 6.82 0.27 7.93 0.000 0.785 0.000 .0015961 -.504053 90.85942 .0028836 .667085 150.5536 hsngval pcturban faminc 2.region 3.region 4.region As we would expect, states with higher housing values have higher rental rates. The proportion of a state’s population that is urban does not have a significant effect on rents. Technical note In a simultaneous-equations framework, we could write the model we just fit as hsngvali = π0 + π1 faminci + π2 2.regioni + π3 3.regioni + π4 4.regioni + vi renti = β0 + β1 hsngvali + β2 pcturbani + ui which here happens to be recursive (triangular), because hsngval appears in the equation for rent but rent does not appear in the equation for hsngval. In general, however, systems of simultaneous equations are not recursive. Because this system is recursive, we could fit the two equations individually via OLS if we were willing to assume that u and v were independent. For a more detailed discussion of triangular systems, see Kmenta (1997, 719–720). Historically, instrumental-variables estimation and systems of simultaneous equations were taught concurrently, and older textbooks describe instrumental-variables estimation solely in the context of simultaneous equations. However, in recent decades, the treatment of endogeneity and instrumentalvariables estimation has taken on a much broader scope, while interest in the specification of complete systems of simultaneous equations has waned. Most recent textbooks, such as Cameron and Trivedi (2005), Davidson and MacKinnon (1993, 2004), and Wooldridge (2010, 2013), treat instrumental-variables estimation as an integral part of the modern economists’ toolkit and introduce it long before shorter discussions on simultaneous equations. In addition to the 2SLS member of the κ-class estimators, ivregress implements the LIML estimator. Both theoretical and Monte Carlo exercises indicate that the LIML estimator may yield less bias and confidence intervals with better coverage rates than the 2SLS estimator. See Poi (2006) and Stock, Wright, and Yogo (2002) (and the papers cited therein) for Monte Carlo evidence. ivregress — Single-equation instrumental-variables regression 933 Example 2: LIML estimator Here we refit our model with the LIML estimator: . ivregress liml rent pcturban (hsngval = faminc i.region) Instrumental variables (LIML) regression Number of obs Wald chi2(2) Prob > chi2 R-squared Root MSE rent Coef. hsngval pcturban _cons .0026686 -.1827391 117.6087 Instrumented: Instruments: Std. Err. z .0004173 .3571132 17.22625 6.39 -0.51 6.83 P>|z| = = = = = 50 75.71 0.0000 0.4901 24.992 [95% Conf. Interval] 0.000 0.609 0.000 .0018507 -.8826681 83.84587 .0034865 .5171899 151.3715 hsngval pcturban faminc 2.region 3.region 4.region These results are qualitatively similar to the 2SLS results, although the coefficient on hsngval is about 19% higher. GMM estimator Since the celebrated paper of Hansen (1982), the GMM has been a popular method of estimation in economics and finance, and it lends itself well to instrumental-variables estimation. The basic principle is that we have some moment or orthogonality conditions of the form E(zi ui ) = 0 (3) From (1), we have ui = yi − yi β1 − x1i β2 . What are the elements of the instrument vector zi ? By assumption, x1i is uncorrelated with ui , as are the excluded exogenous variables x2i , and so we use zi = [x1i x2i ]. The moment conditions are simply the mathematical representation of the assumption that the instruments are exogenous—that is, the instruments are orthogonal to (uncorrelated with) ui . If the number of elements in zi is just equal to the number of unknown parameters, then we can apply the analogy principle to (3) and solve 1 X 1 X zi ui = zi (yi − yi β1 − x1i β2 ) = 0 N i N i (4) This equation is known as the method of moments estimator. Here where the number of instruments equals the number of parameters, the method of moments estimator coincides with the 2SLS estimator, which also coincides with what has historically been called the indirect least-squares estimator (Judge et al. 1985, 595). The “generalized” in GMM addresses the case in which the number of instruments (columns of zi ) exceeds the number of parameters to be estimated. Here there is no unique solution to the population moment conditions defined in (3), so we cannot use (4). Instead, we define the objective function !0 ! 1 X 1 X Q(β1 , β2 ) = zi ui W zi u i (5) N N i i 934 ivregress — Single-equation instrumental-variables regression where W is a positive-definite matrix with the same number of rows and columns as the number of columns of zi . W is known as the weighting matrix, and we specify its structure with the wmatrix() option. The GMM estimator of (β1 , β2 ) minimizes Q(β1 , β2 ); that is, the GMM estimator chooses β1 and β2 to make the moment conditions as close to zero as possible for a given W. For a more general GMM estimator, see [R] gmm. gmm does not restrict you to fitting a single linear equation, though the syntax is more complex. A well-known result is that if we define the matrix S0 to be the covariance of zi ui and set W = S−1 0 , then we obtain the optimal two-step GMM estimator, where by optimal estimator we mean the one that results in the smallest variance given the moment conditions defined in (3). Suppose that the errors ui are heteroskedastic but independent among observations. Then S0 = E(zi ui ui z0i ) = E(u2i zi z0i ) and the sample analogue is X b= 1 S u b2 zi z0i N i i (6) To implement this estimator, we need estimates of the sample residuals u bi . ivregress gmm obtains b −1 . Equation (6) the residuals by estimating β1 and β2 by 2SLS and then evaluates (6) and sets W = S is the same as the center term of the “sandwich” robust covariance matrix available from most Stata estimation commands through the vce(robust) option. Example 3: GMM estimator Here we refit our model of rents by using the GMM estimator, allowing for heteroskedasticity in ui : . ivregress gmm rent pcturban (hsngval = faminc i.region), wmatrix(robust) Instrumental variables (GMM) regression Number of obs Wald chi2(2) Prob > chi2 R-squared Root MSE GMM weight matrix: Robust rent Coef. hsngval pcturban _cons .0014643 .7615482 112.1227 Instrumented: Instruments: Robust Std. Err. .0004473 .2895105 10.80234 z 3.27 2.63 10.38 = = = = = 50 112.09 0.0000 0.6616 20.358 P>|z| [95% Conf. Interval] 0.001 0.009 0.000 .0005877 .1941181 90.95052 .002341 1.328978 133.2949 hsngval pcturban faminc 2.region 3.region 4.region Because we requested that a heteroskedasticity-consistent weighting matrix be used during estimation but did not specify the vce() option, ivregress reported standard errors that are robust to heteroskedasticity. Had we specified vce(unadjusted), we would have obtained standard errors that would be correct only if the weighting matrix W does in fact converge to S−1 0 . ivregress — Single-equation instrumental-variables regression 935 Technical note Many software packages that implement GMM estimation use the same heteroskedasticity-consistent weighting matrix we used in the previous example to obtain the optimal two-step estimates but do not use a heteroskedasticity-consistent VCE, even though they may label the standard errors as being “robust”. To replicate results obtained from other packages, you may have to use the vce(unadjusted) option. See Methods and formulas below for a discussion of robust covariance matrix estimation in the GMM framework. By changing our definition of S0 , we can obtain GMM estimators suitable for use with other types of data that violate the assumption that the errors are independent and identically distributed. For example, you may have a dataset that consists of multiple observations for each person in a sample. The observations that correspond to the same person are likely to be correlated, and the estimation technique should account for that lack of independence. Say that in your dataset, people are identified by the variable personid and you type . ivregress gmm ..., wmatrix(cluster personid) Here ivregress estimates S0 as X b= 1 qc q0c S N c∈C where C denotes the set of clusters and qc = X u bi zi i∈cj where cj denotes the j th cluster. This weighting matrix accounts for the within-person correlation among observations, so the GMM estimator that uses this version of S0 will be more efficient than the estimator that ignores this correlation. Example 4: GMM estimator with clustering We have data from the National Longitudinal Survey on young women’s wages as reported in a series of interviews from 1968 through 1988, and we want to fit a model of wages as a function of each woman’s age and age squared, job tenure, birth year, and level of education. We believe that random shocks that affect a woman’s wage also affect her job tenure, so we treat tenure as endogenous. As additional instruments, we use her union status, number of weeks worked in the past year, and a dummy indicating whether she lives in a metropolitan area. Because we have several observations for each woman (corresponding to interviews done over several years), we want to control for clustering on each person. 936 ivregress — Single-equation instrumental-variables regression . use http://www.stata-press.com/data/r13/nlswork (National Longitudinal Survey. Young Women 14-26 years of age in 1968) . ivregress gmm ln_wage age c.age#c.age birth_yr grade > (tenure = union wks_work msp), wmatrix(cluster idcode) Instrumental variables (GMM) regression Number of obs = 18625 Wald chi2(5) = 1807.17 Prob > chi2 = 0.0000 R-squared = . GMM weight matrix: Cluster (idcode) Root MSE = .46951 (Std. Err. adjusted for 4110 clusters in idcode) Robust Std. Err. ln_wage Coef. tenure age .099221 .0171146 .0037764 .0066895 c.age#c.age -.0005191 birth_yr grade _cons -.0085994 .071574 .8575071 Instrumented: Instruments: z P>|z| [95% Conf. Interval] 26.27 2.56 0.000 0.011 .0918194 .0040034 .1066227 .0302259 .000111 -4.68 0.000 -.0007366 -.0003016 .0021932 .0029938 .1616274 -3.92 23.91 5.31 0.000 0.000 0.000 -.012898 .0657062 .5407231 -.0043008 .0774417 1.174291 tenure age c.age#c.age birth_yr grade union wks_work msp Both job tenure and years of schooling have significant positive effects on wages. Time-series data are often plagued by serial correlation. In these cases, we can construct a weighting matrix to account for the fact that the error in period t is probably correlated with the errors in periods t − 1, t − 2, etc. An HAC weighting matrix can be used to account for both serial correlation and potential heteroskedasticity. To request an HAC weighting matrix, you specify the wmatrix(hac kernel # | opt ) option. kernel specifies which of three kernels to use: bartlett, parzen, or quadraticspectral. kernel determines the amount of weight given to lagged values when computing the HAC matrix, and # denotes the maximum number of lags to use. Many texts refer to the bandwidth of the kernel instead of the number of lags; the bandwidth is equal to the number of lags plus one. If neither opt nor # is specified, then N − 2 lags are used, where N is the sample size. If you specify wmatrix(hac kernel opt), then ivregress uses Newey and West’s (1994) algorithm for automatically selecting the number of lags to use. Although the authors’ Monte Carlo simulations do show that the procedure may result in size distortions of hypothesis tests, the procedure is still useful when little other information is available to help choose the number of lags. For more on GMM estimation, see Baum (2006); Baum, Schaffer, and Stillman (2003, 2007); Cameron and Trivedi (2005); Davidson and MacKinnon (1993, 2004); Hayashi (2000); or Wooldridge (2010). See Newey and West (1987) and Wang and Wu (2012) for an introduction to HAC covariance matrix estimation. ivregress — Single-equation instrumental-variables regression 937 Stored results ivregress stores the following in e(): Scalars e(N) e(mss) e(df m) e(rss) e(df r) e(r2) e(r2 a) e(F) e(rmse) e(N clust) e(chi2) e(kappa) e(J) e(wlagopt) e(vcelagopt) e(rank) e(iterations) Macros e(cmd) e(cmdline) e(depvar) e(instd) e(insts) e(constant) e(wtype) e(wexp) e(title) e(clustvar) e(hac kernel) e(hac lag) e(vce) e(vcetype) e(estimator) e(exogr) e(wmatrix) e(moments) e(small) e(depname) e(properties) e(estat cmd) e(predict) e(footnote) e(marginsok) e(marginsnotok) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(W) e(S) e(V) e(V modelbased) Functions e(sample) number of observations model sum of squares model degrees of freedom residual sum of squares residual degrees of freedom R2 adjusted R2 F statistic root mean squared error number of clusters χ2 κ used in LIML estimator value of GMM objective function lags used in HAC weight matrix (if Newey–West algorithm used) lags used in HAC VCE matrix (if Newey–West algorithm used) rank of e(V) number of GMM iterations (0 if not applicable) ivregress command as typed name of dependent variable instrumented variable instruments noconstant or hasconstant if specified weight type weight expression title in estimation output name of cluster variable HAC kernel HAC lag vcetype specified in vce() title used to label Std. Err. 2sls, liml, or gmm exogenous regressors wmtype specified in wmatrix() centered if center specified small if small-sample statistics depname if depname(depname) specified; otherwise same as e(depvar) b V program used to implement estat program used to implement predict program used to implement footnote display predictions allowed by margins predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix weight matrix used to compute GMM estimates moment covariance matrix used to compute GMM variance–covariance matrix variance–covariance matrix of the estimators model-based variance marks estimation sample 938 ivregress — Single-equation instrumental-variables regression Methods and formulas Methods and formulas are presented under the following headings: Notation 2SLS and LIML estimators GMM estimator Notation Items printed in lowercase and italicized (for example, x) are scalars. Items printed in lowercase and boldfaced (for example, x) are vectors. Items printed in uppercase and boldfaced (for example, X) are matrices. The model is y = Yβ1 + X1 β2 + u = Xβ + u Y = X1 Π1 + X2 Π2 + v = ZΠ + V where y is an N × 1 vector of the left-hand-side variable; N is the sample size; Y is an N × p matrix of p endogenous regressors; X1 is an N × k1 matrix of k1 included exogenous regressors; X2 is an N × k2 matrix of k2 excluded exogenous variables, X = [Y X1 ], Z = [X1 X2 ]; u is an N × 1 vector of errors; V is an N × p matrix of errors; β = [β1 β2 ] is a k = (p + k1 ) × 1 vector of parameters; and Π is a (k1 + k2 ) × p vector of parameters. If a constant term is included in the model, then one column of X1 contains all ones. Let v be a column vector of weights specified by the user. If no weights are specified, v = 1. Let w be a column vector of normalized weights.If no weights are specified or if the user specified fweights or iweights, w = v; otherwise, w = v/(10 v) (10 1). Let D denote the N × N matrix with w on the main diagonal and zeros elsewhere. If no weights are specified, D is the identity matrix. The weighted number of observations n is defined as 10 w. For iweights, this is truncated to an integer. The sum of the weights is 10 v. Define c = 1 if there is a constant in the regression and zero otherwise. The order condition for identification requires that k2 ≥ p: the number of excluded exogenous variables must be at least as great as the number of endogenous regressors. In the following formulas, if weights are specified, X01 X1 , X0 X, X0 y, y0 y, Z0 Z, Z0 X, and Z0 y are replaced with X01 DX1 , X0 DX, X0 Dy, y0 Dy, Z0 DZ, Z0 DX, and Z0 Dy, respectively. We suppress the D below to simplify the notation. 2SLS and LIML estimators Define the κ-class estimator of β as b = X0 (I − κMZ )X −1 X0 (I − κMZ )y where MZ = I − Z(Z0 Z)−1 Z0 . The 2SLS estimator results from setting κ = 1. The LIML estimator results from selecting κ to be the minimum eigenvalue of (Y0 MZ Y)−1/2 Y0 MX1 Y(Y0 MZ Y)−1/2 , where MX1 = I − X1 (X01 X1 )−1 X01 . The total sum of squares (TSS) equals y0 y if there is no intercept and y0 y − (10 y)2 /n otherwise. The degrees of freedom is n−c. The error sum of squares (ESS) is defined as y0 y− 2bX0 y+b0 X0 Xb. The model sum of squares (MSS) equals TSS − ESS. The degrees of freedom is k − c. ivregress — Single-equation instrumental-variables regression 939 The mean squared error, s2 , is defined as ESS/(n − k) if small is specified and ESS/n otherwise. The root mean squared error is s, its square root. If c = 1 and small is not specified, a Wald statistic, W , of the joint significance of the k − 1 parameters of β except the constant term is calculated; W ∼ χ2 (k − 1). If c = 1 and small is specified, then an F statistic is calculated as F = W/(k − 1); F ∼ F (k − 1, n − k). The R-squared is defined as R2 = 1 − ESS/TSS. The adjusted R-squared is Ra2 = 1 − (1 − R2 )(n − c)/(n − k). −1 If robust is not specified, then Var(b) = s2 X0 (I − κMZ )X . For a discussion of robust variance estimates in regression and regression with instrumental variables, see Methods and formulas in [R] regress. If small is not specified, then k = 0 in the formulas given there. This command also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. GMM estimator We obtain an initial consistent estimate of β by using the 2SLS estimator; see above. Using this estimate of β, we compute the weighting matrix W and calculate the GMM estimator −1 0 bGMM = X0 ZWZ0 X X ZWZ0 y The variance of bGMM is Var(bGMM ) = n X0 ZWZ0 X −1 0 b X0 ZWSWZ X X0 ZWZ0 X −1 Var(bGMM ) is of the sandwich form DMD; see [P] robust. If the user specifies the small option, ivregress implements a small-sample adjustment by multiplying the VCE by N/(N − k). b = W−1 and the VCE reduces to the “optimal” If vce(unadjusted) is specified, then we set S GMM variance estimator −1 Var(βGMM ) = n X0 ZWZ0 X However, if W−1 is not a good estimator of E(zi ui ui z0i ), then the optimal GMM estimator is inefficient, and inference based on the optimal variance estimator could be misleading. W is calculated using the residuals from the initial 2SLS estimates, whereas S is estimated using the residuals based on bGMM . The wmatrix() option affects the form of W, whereas the vce() option affects the form of S. Except for different residuals being used, the formulas for W−1 and S are identical, so we focus on estimating W−1 . If wmatrix(unadjusted) is specified, then W−1 = where s2 = P i s2 X zi z0i n i u2i /n. This weight matrix is appropriate if the errors are homoskedastic. If wmatrix(robust) is specified, then W−1 = 1X 2 0 u zi zi n i i which is appropriate if the errors are heteroskedastic. 940 ivregress — Single-equation instrumental-variables regression If wmatrix(cluster clustvar) is specified, then 1X W−1 = qc q0c n c where c indexes clusters, qc = X u i zi i∈cj and cj denotes the j th cluster. If wmatrix(hac kernel # ) is specified, then W −1 l=n−1 i=n 1X 2 0 1 X X u zi zi + K(l, m)ui ui−l zi z0i−l + zi−l z0i = n i i n l=1 i=l+1 where m = # if # is specified and m = n − 2 otherwise. Define z = l/(m + 1). If kernel is nwest, then n 1−z 0≤z ≤1 K(l, m) = 0 otherwise If kernel is gallant, then ( 1 − 6z 2 + 6z 3 0 ≤ z ≤ 0.5 K(l, m) = 2(1 − z)3 0.5 < z ≤ 1 0 otherwise If kernel is quadraticspectral, then 1 K(l, m) = 3 {sin(θ)/θ − cos(θ)} /θ2 z=0 otherwise where θ = 6πz/5. If wmatrix(hac kernel opt) is specified, then ivregress uses Newey and West’s (1994) automatic lag-selection algorithm, which proceeds as follows. Define h to be a (k1 + k2 ) × 1 vector containing ones in all rows except for the row corresponding to the constant term (if present); that row contains a zero. Define fi = (ui zi )h n 1 X σ bj = fi fi−j n i=j+1 j = 0, . . . , m∗ ∗ sb (q) =2 m X σ bj j q j=1 ∗ sb (0) =σ b0 + 2 m X σ bj j=1 ( γ b = cγ sb (q) sb (0) m=γ bn1/(2q+1) 2 )1/2q+1 ivregress — Single-equation instrumental-variables regression 941 where q , m∗ , and cγ depend on the kernel specified: Kernel m∗ q Bartlett 1 Parzen 2 Quadratic spectral 2 int 20(T /100)2/9 int 20(T /100)4/25 int 20(T /100)2/25 cγ 1.1447 2.6614 1.3221 where int(x) denotes the integer obtained by truncating x toward zero. For the Bartlett and Parzen kernels, the optimal lag is min{int(m), m∗ }. For the quadratic spectral, the optimal lag is min{m, m∗ }. If center is specified, when computing weighting P matrices ivregress replaces the term ui zi in the formulas above with ui zi − uz, where uz = i ui zi /N . References Andrews, D. W. K. 1991. Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59: 817–858. Angrist, J. D., and J.-S. Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton, NJ: Princeton University Press. Basmann, R. L. 1957. A generalized classical method of linear estimation of coefficients in a structural equation. Econometrica 25: 77–83. Bauldry, S. 2014. miivfind: A command for identifying model-implied instrumental variables for structural equation models in Stata. Stata Journal 14: 60–75. Baum, C. F. 2006. An Introduction to Modern Econometrics Using Stata. College Station, TX: Stata Press. Baum, C. F., M. E. Schaffer, and S. Stillman. 2003. Instrumental variables and GMM: Estimation and testing. Stata Journal 3: 1–31. . 2007. Enhanced routines for instrumental variables/generalized method of moments estimation and testing. Stata Journal 7: 465–506. Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press. . 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. . 2004. Econometric Theory and Methods. New York: Oxford University Press. Desbordes, R., and V. Verardi. 2012. A robust instrumental-variables estimator. Stata Journal 12: 169–181. Finlay, K., and L. M. Magnusson. 2009. Implementing weak-instrument robust tests for a general class of instrumentalvariables models. Stata Journal 9: 398–421. Gallant, A. R. 1987. Nonlinear Statistical Models. New York: Wiley. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Hall, A. R. 2005. Generalized Method of Moments. Oxford: Oxford University Press. Hansen, L. P. 1982. Large sample properties of generalized method of moments estimators. Econometrica 50: 1029–1054. Hayashi, F. 2000. Econometrics. Princeton, NJ: Princeton University Press. Judge, G. G., W. E. Griffiths, R. C. Hill, H. Lütkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2nd ed. New York: Wiley. Kmenta, J. 1997. Elements of Econometrics. 2nd ed. Ann Arbor: University of Michigan Press. Koopmans, T. C., and W. C. Hood. 1953. Studies in Econometric Method. New York: Wiley. 942 ivregress — Single-equation instrumental-variables regression Koopmans, T. C., and J. Marschak. 1950. Statistical Inference in Dynamic Economic Models. New York: Wiley. Newey, W. K., and K. D. West. 1987. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55: 703–708. . 1994. Automatic lag selection in covariance matrix estimation. Review of Economic Studies 61: 631–653. Nichols, A. 2007. Causal inference with observational data. Stata Journal 7: 507–541. Palmer, T. M., V. Didelez, R. R. Ramsahai, and N. A. Sheehan. 2011. Nonparametric bounds for the causal effect in a binary instrumental-variable model. Stata Journal 11: 345–367. Poi, B. P. 2006. Jackknife instrumental variables estimation in Stata. Stata Journal 6: 364–376. Stock, J. H., and M. W. Watson. 2011. Introduction to Econometrics. 3rd ed. Boston: Addison–Wesley. Stock, J. H., J. H. Wright, and M. Yogo. 2002. A survey of weak instruments and weak identification in generalized method of moments. Journal of Business and Economic Statistics 20: 518–529. Theil, H. 1953. Repeated Least Squares Applied to Complete Equation Systems. Mimeograph from the Central Planning Bureau, The Hague. Wang, Q., and N. Wu. 2012. Long-run covariance and its applications in cointegration regression. Stata Journal 12: 515–542. Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press. . 2013. Introductory Econometrics: A Modern Approach. 5th ed. Mason, OH: South-Western. Wright, P. G. 1928. The Tariff on Animal and Vegetable Oils. New York: Macmillan. Also see [R] ivregress postestimation — Postestimation tools for ivregress [R] gmm — Generalized method of moments estimation [R] ivprobit — Probit model with continuous endogenous regressors [R] ivtobit — Tobit model with continuous endogenous regressors [R] reg3 — Three-stage estimation for systems of simultaneous equations [R] regress — Linear regression [SEM] intro 5 — Tour of models [SVY] svy estimation — Estimation commands for survey data [TS] forecast — Econometric model forecasting [XT] xtivreg — Instrumental variables and two-stage least squares for panel-data models [U] 20 Estimation and postestimation commands Title ivregress postestimation — Postestimation tools for ivregress Description Syntax for estat Stored results Syntax for predict Menu for estat Methods and formulas Menu for predict Options for estat References Options for predict Remarks and examples Also see Description The following postestimation commands are of special interest after ivregress: Command Description estat endogenous estat firststage estat overid perform tests of endogeneity report “first-stage” regression statistics perform tests of overidentifying restrictions These commands are not appropriate after the svy prefix. The following postestimation commands are also available: Command Description contrast estat summarize estat vce estat (svy) estimates forecast1 hausman lincom contrasts and ANOVA-style joint tests of estimates summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses margins marginsplot nlcom predict predictnl pwcompare test testnl 1 forecast is not appropriate with svy estimation results. 943 944 ivregress postestimation — Postestimation tools for ivregress Special-interest postestimation commands estat endogenous performs tests to determine whether endogenous regressors in the model are in fact exogenous. After GMM estimation, the C (difference-in-Sargan) statistic is reported. After 2SLS estimation with an unadjusted VCE, the Durbin (1954) and Wu–Hausman (Wu 1974; Hausman 1978) statistics are reported. After 2SLS estimation with a robust VCE, Wooldridge’s (1995) robust score test and a robust regression-based test are reported. In all cases, if the test statistic is significant, then the variables being tested must be treated as endogenous. estat endogenous is not available after LIML estimation. estat firststage reports various statistics that measure the relevance of the excluded exogenous variables. By default, whether the equation has one or more than one endogenous regressor determines what statistics are reported. estat overid performs tests of overidentifying restrictions. If the 2SLS estimator was used, Sargan’s (1958) and Basmann’s (1960) χ2 tests are reported, as is Wooldridge’s (1995) robust score test; if the LIML estimator was used, Anderson and Rubin’s (1950) χ2 test and Basmann’s F test are reported; and if the GMM estimator was used, Hansen’s (1982) J statistic χ2 test is reported. A statistically significant test statistic always indicates that the instruments may not be valid. Syntax for predict predict type predict type newvar if in stub* | newvarlist , statistic if in , scores Description statistic Main xb residuals stdp stdf pr(a,b) e(a,b) ystar(a,b) linear prediction; the default residuals standard error of the prediction standard error of the forecast Pr(a < yj < b) E(yj | a < yj < b) E(yj∗ ), yj∗ = max{a, min(yj , b)} These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for stdf is not allowed with svy estimation results. where a and b may be numbers or variables; a missing (a ≥ .) means −∞, and b missing (b ≥ .) means +∞; see [U] 12.2.1 Missing values. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. ivregress postestimation — Postestimation tools for ivregress 945 Options for predict Main xb, the default, calculates the linear prediction. residuals calculates the residuals, that is, yj −xj b. These are based on the estimated equation when the observed values of the endogenous variables are used—not the projections of the instruments onto the endogenous variables. stdp calculates the standard error of the prediction, which can be thought of as the standard error of the predicted expected value or mean for the observation’s covariate pattern. This is also referred to as the standard error of the fitted value. stdf calculates the standard error of the forecast, which is the standard error of the point prediction for 1 observation. It is commonly referred to as the standard error of the future or forecast value. By construction, the standard errors produced by stdf are always larger than those produced by stdp; see Methods and formulas in [R] regress postestimation. pr(a,b) calculates Pr(a < xj b + uj < b), the probability that yj |xj would be observed in the interval (a, b). a and b may be specified as numbers or variable names; lb and ub are variable names; pr(20,30) calculates Pr(20 < xj b + uj < 30); pr(lb,ub) calculates Pr(lb < xj b + uj < ub); and pr(20,ub) calculates Pr(20 < xj b + uj < ub). a missing (a ≥ .) means −∞; pr(.,30) calculates Pr(−∞ < xj b + uj < 30); pr(lb,30) calculates Pr(−∞ < xj b + uj < 30) in observations for which lb ≥ . and calculates Pr(lb < xj b + uj < 30) elsewhere. b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(+∞ > xj b + uj > 20); pr(20,ub) calculates Pr(+∞ > xj b + uj > 20) in observations for which ub ≥ . and calculates Pr(20 < xj b + uj < ub) elsewhere. e(a,b) calculates E(xj b + uj | a < xj b + uj < b), the expected value of yj |xj conditional on yj |xj being in the interval (a, b), meaning that yj |xj is truncated. a and b are specified as they are for pr(). ystar(a,b) calculates E(yj∗ ), where yj∗ = a if xj b + uj ≤ a, yj∗ = b if xj b + uj ≥ b, and yj∗ = xj b + uj otherwise, meaning that yj∗ is censored. a and b are specified as they are for pr(). scores calculates the scores for the model. A new score variable is created for each endogenous regressor, as well as an equation-level score that applies to all exogenous variables and constant term (if present). Syntax for estat Perform tests of endogeneity estat endogenous varlist , lags(#) forceweights forcenonrobust Report “first-stage” regression statistics estat firststage , all forcenonrobust 946 ivregress postestimation — Postestimation tools for ivregress Perform tests of overidentifying restrictions estat overid , lags(#) forceweights forcenonrobust Menu for estat Statistics > Postestimation > Reports and statistics Options for estat Options for estat are presented under the following headings: Options for estat endogenous Options for estat firststage Options for estat overid Options for estat endogenous lags(#) specifies the number of lags to use for prewhitening when computing the heteroskedasticityand autocorrelation-consistent (HAC) version of the score test of endogeneity. Specifying lags(0) requests no prewhitening. This option is valid only when the model was fit via 2SLS and an HAC covariance matrix was requested when the model was fit. The default is lags(1). forceweights requests that the tests of endogeneity be computed even though aweights, pweights, or iweights were used in the previous estimation. By default, these tests are conducted only after unweighted or frequency-weighted estimation. The reported critical values may be inappropriate for weighted data, so the user must determine whether the critical values are appropriate for a given application. forcenonrobust requests that the Durbin and Wu–Hausman tests be performed after 2SLS estimation even though a robust VCE was used at estimation time. This option is available only if the model was fit by 2SLS. Options for estat firststage all requests that all first-stage goodness-of-fit statistics be reported regardless of whether the model contains one or more endogenous regressors. By default, if the model contains one endogenous regressor, then the first-stage R2 , adjusted R2 , partial R2 , and F statistics are reported, whereas if the model contains multiple endogenous regressors, then Shea’s partial R2 and adjusted partial R2 are reported instead. forcenonrobust requests that the minimum eigenvalue statistic and its critical values be reported even though a robust VCE was used at estimation time. The reported critical values assume that the errors are independent and identically distributed (i.i.d.) normal, so the user must determine whether the critical values are appropriate for a given application. ivregress postestimation — Postestimation tools for ivregress 947 Options for estat overid lags(#) specifies the number of lags to use for prewhitening when computing the heteroskedasticityand autocorrelation-consistent (HAC) version of the score test of overidentifying restrictions. Specifying lags(0) requests no prewhitening. This option is valid only when the model was fit via 2SLS and an HAC covariance matrix was requested when the model was fit. The default is lags(1). forceweights requests that the tests of overidentifying restrictions be computed even though aweights, pweights, or iweights were used in the previous estimation. By default, these tests are conducted only after unweighted or frequency-weighted estimation. The reported critical values may be inappropriate for weighted data, so the user must determine whether the critical values are appropriate for a given application. forcenonrobust requests that the Sargan and Basmann tests of overidentifying restrictions be performed after 2SLS or LIML estimation even though a robust VCE was used at estimation time. These tests assume that the errors are i.i.d. normal, so the user must determine whether the critical values are appropriate for a given application. Remarks and examples Remarks are presented under the following headings: estat endogenous estat firststage estat overid estat endogenous A natural question to ask is whether a variable presumed to be endogenous in the previously fit model could instead be treated as exogenous. If the endogenous regressors are in fact exogenous, then the OLS estimator is more efficient; and depending on the strength of the instruments and other factors, the sacrifice in efficiency by using an instrumental-variables estimator can be significant. Thus, unless an instrumental-variables estimator is really needed, OLS should be used instead. estat endogenous provides several tests of endogeneity after 2SLS and GMM estimation. Example 1 In example 1 of [R] ivregress, we fit a model of the average rental rate for housing in a state as a function of the percentage of the population living in urban areas and the average value of houses. We treated hsngval as endogenous because unanticipated shocks that affect rental rates probably affect house prices as well. We used family income and region dummies as additional instruments for hsngval. Here we test whether we could treat hsngval as exogenous. . use http://www.stata-press.com/data/r13/hsng (1980 Census housing data) . ivregress 2sls rent pcturban (hsngval = faminc i.region) (output omitted ) . estat endogenous Tests of endogeneity Ho: variables are exogenous Durbin (score) chi2(1) Wu-Hausman F(1,46) = = 12.8473 15.9067 (p = 0.0003) (p = 0.0002) 948 ivregress postestimation — Postestimation tools for ivregress Because we did not specify any variable names after the estat endogenous command, Stata by default tested all the endogenous regressors (namely, hsngval) in our model. The null hypothesis of the Durbin and Wu–Hausman tests is that the variable under consideration can be treated as exogenous. Here both test statistics are highly significant, so we reject the null of exogeneity; we must continue to treat hsngval as endogenous. The difference between the Durbin and Wu–Hausman tests of endogeneity is that the former uses an estimate of the error term’s variance based on the model assuming the variables being tested are exogenous, while the latter uses an estimate of the error variance based on the model assuming the variables being tested are endogenous. Under the null hypothesis that the variables being tested are exogenous, both estimates of the error variance are consistent. What we label the Wu–Hausman statistic is Wu’s (1974) “T2 ” statistic, which Hausman (1978) showed can be calculated very easily via linear regression. Baum, Schaffer, and Stillman (2003, 2007) provide a lucid discussion of these tests. When you fit a model with multiple endogenous regressors, you can test the exogeneity of a subset of the regressors while continuing to treat the others as endogenous. For example, say you have three endogenous regressors, y1, y2, and y3, and you fit your model by typing . ivregress depvar . . . (y1 y2 y3 = . . .) Suppose you are confident that y1 must be treated as endogenous, but you are undecided about y2 and y3. To test whether y2 and y3 can be treated as exogenous, you would type . estat endogenous y2 y3 The Durbin and Wu–Hausman tests assume that the error term is i.i.d. Therefore, if you requested a robust VCE at estimation time, estat endogenous will instead report Wooldridge’s (1995) score test and a regression-based test of exogeneity. Both these tests can tolerate heteroskedastic and autocorrelated errors, while only the regression-based test is amenable to clustering. Example 2 We refit our housing model, requesting robust standard errors, and then test the exogeneity of hsngval: . use http://www.stata-press.com/data/r13/hsng (1980 Census housing data) . ivregress 2sls rent pcturban (hsngval = faminc i.region), vce(robust) (output omitted ) . estat endogenous Tests of endogeneity Ho: variables are exogenous Robust score chi2(1) = 2.10428 (p = 0.1469) Robust regression F(1,46) = 4.31101 (p = 0.0435) Wooldridge’s score test does not reject the null hypothesis that hsngval is exogenous at conventional significance levels (p = 0.1469). However, the regression-based test does reject the null hypothesis at the 5% significance level (p = 0.0435). Typically, these two tests yield the same conclusion; the fact that our dataset has only 50 observations could be contributing to the discrepancy. Here we would be inclined to continue to treat hsngval as endogenous. Even if hsngval is exogenous, the 2SLS estimates are still consistent. On the other hand, if hsngval is in fact endogenous, the OLS estimates would not be consistent. Moreover, as we will see in our discussion of the estat overid command, our additional instruments may be invalid. To test whether an endogenous variable can be treated as exogenous, we must have a valid set of instruments to use to fit the model in the first place! ivregress postestimation — Postestimation tools for ivregress 949 Unlike the Durbin and Wu–Hausman tests, Wooldridge’s score and the regression-based tests do not allow you to test a subset of the endogenous regressors in the model; you can test only whether all the endogenous regressors are in fact exogenous. After GMM estimation, estat endogenous calculates what Hayashi (2000, 220) calls the C statistic, also known as the difference-in-Sargan statistic. The C statistic can be made robust to heteroskedasticity, autocorrelation, and clustering; and the version reported by estat endogenous is determined by the weight matrix requested via the wmatrix() option used when fitting the model with ivregress. Additionally, the test can be used to determine the exogeneity of a subset of the endogenous regressors, regardless of the type of weight matrix used. If you fit your model using the LIML estimator, you can use the hausman command to carry out a traditional Hausman (1978) test between the OLS and LIML estimates. estat firststage For an excluded exogenous variable to be a valid instrument, it must be sufficiently correlated with the included endogenous regressors but uncorrelated with the error term. In recent decades, researchers have paid considerable attention to the issue of instruments that are only weakly correlated with the endogenous regressors. In such cases, the usual 2SLS, GMM, and LIML estimators are biased toward the OLS estimator, and inference based on the standard errors reported by, for example, ivregress can be severely misleading. For more information on the theory behind instrumental-variables estimation with weak instruments, see Nelson and Startz (1990); Staiger and Stock (1997); Hahn and Hausman (2003); the survey article by Stock, Wright, and Yogo (2002); and Angrist and Pischke (2009, chap. 4). When the instruments are only weakly correlated with the endogenous regressors, some Monte Carlo evidence suggests that the LIML estimator performs better than the 2SLS and GMM estimators; see, for example, Poi (2006) and Stock, Wright, and Yogo (2002) (and the papers cited therein). On the other hand, the LIML estimator often results in confidence intervals that are somewhat larger than those from the 2SLS estimator. Moreover, using more instruments is not a solution, because the biases of instrumental-variables estimators increase with the number of instruments. See Hahn and Hausman (2003). estat firststage produces several statistics for judging the explanatory power of the instruments and is most easily explained with examples. Example 3 Again building on the model fit in example 1 of [R] ivregress, we now explore the degree of correlation between the additional instruments faminc, 2.region, 3.region, and 4.region and the endogenous regressor hsngval: . use http://www.stata-press.com/data/r13/hsng (1980 Census housing data) . ivregress 2sls rent pcturban (hsngval = faminc i.region) (output omitted ) . estat firststage First-stage regression summary statistics Variable R-sq. Adjusted R-sq. Partial R-sq. F(4,44) Prob > F hsngval 0.6908 0.6557 0.5473 13.2978 0.0000 950 ivregress postestimation — Postestimation tools for ivregress Minimum eigenvalue statistic = 13.2978 Critical Values # of endogenous regressors: Ho: Instruments are weak # of excluded instruments: 2SLS relative bias 5% 16.85 10% 10.27 20% 6.71 30% 5.34 2SLS Size of nominal 5% Wald test LIML Size of nominal 5% Wald test 10% 24.58 5.44 15% 13.96 3.87 20% 10.26 3.30 25% 8.31 2.98 1 4 To understand these results, recall that the first-stage regression is hsngvali = π0 + π1 pcturbani + π2 faminc + π3 2.region + π4 3.region + π5 4.region + vi where vi is an error term. The column marked “R-sq.” is the simple R2 from fitting the first-stage regression by OLS, and the column marked “Adjusted R-sq.” is the adjusted R2 from that regression. Higher values purportedly indicate stronger instruments, and instrumental-variables estimators exhibit less bias when the instruments are strongly correlated with the endogenous variable. Looking at just the R2 and adjusted R2 can be misleading, however. If hsngval were strongly correlated with the included exogenous variable pcturban but only weakly correlated with the additional instruments, then these statistics could be large even though a weak-instrument problem is present. The partial R2 statistic measures the correlation between hsngval and the additional instruments after partialling out the effect of pcturban. Unlike the R2 and adjusted R2 statistics, the partial R2 statistic will not be inflated because of strong correlation between hsngval and pcturban. Bound, Jaeger, and Baker (1995) and others have promoted using this statistic. The column marked “F(4, 44)” is an F statistic for the joint significance of π2 , π3 , π4 , and π5 , the coefficients on the additional instruments. Its p-value is listed in the column marked “Prob > F”. If the F statistic is not significant, then the additional instruments have no significant explanatory power for hsngval after controlling for the effect of pcturban. However, Hall, Rudebusch, and Wilcox (1996) used Monte Carlo simulation to show that simply having an F statistic that is significant at the typical 5% or 10% level is not sufficient. Stock, Wright, and Yogo (2002) suggest that the F statistic should exceed 10 for inference based on the 2SLS estimator to be reliable when there is one endogenous regressor. estat firststage also presents the Cragg and Donald (1993) minimum eigenvalue statistic as a further test of weak instruments. Stock and Yogo (2005) discuss two characterizations of weak instruments: first, weak instruments cause instrumental-variables estimators to be biased; second, hypothesis tests of parameters estimated by instrumental-variables estimators may suffer from severe size distortions. The test statistic in our example is 13.30, which is identical to the F statistic just discussed because our model contains one endogenous regressor. The null hypothesis of each of Stock and Yogo’s tests is that the set of instruments is weak. To perform these tests, we must first choose either the largest relative bias of the 2SLS estimator we are willing to tolerate or the largest rejection rate of a nominal 5% Wald test we are willing to tolerate. If the test statistic exceeds the critical value, we can conclude that our instruments are not weak. The row marked “2SLS relative bias” contains critical values for the test that the instruments are weak based on the bias of the 2SLS estimator relative to the bias of the OLS estimator. For example, from past experience we might know that the OLS estimate of a parameter β may be 50% too high. Saying that we are willing to tolerate a 10% relative bias means that we are willing to tolerate a bias of the 2SLS estimator no greater than 5% (that is, 10% of 50%). In our rental rate model, if we ivregress postestimation — Postestimation tools for ivregress 951 are willing to tolerate a 10% relative bias, then we can conclude that our instruments are not weak because the test statistic of 13.30 exceeds the critical value of 10.22. However, if we were willing to tolerate only a relative bias of 5%, we would conclude that our instruments are weak because 13.30 < 16.85. The rows marked “2SLS Size of nominal 5% Wald test” and “LIML Size of nominal 5% Wald test” contain critical values pertaining to Stock and Yogo’s (2005) second characterization of weak instruments. This characterization defines a set of instruments to be weak if a Wald test at the 5% level can have an actual rejection rate of no more than 10%, 15%, 20%, or 25%. Using the current example, suppose that we are willing to accept a rejection rate of at most 10%. Because 13.30 < 24.58, we cannot reject the null hypothesis of weak instruments. On the other hand, if we use the LIML estimator instead, then we can reject the null hypothesis because 13.30 > 5.44. Technical note Stock and Yogo (2005) tabulated critical values for 2SLS relative biases of 5%, 10%, 20%, and 30% for models with 1, 2, or 3 endogenous regressors and between 3 and 30 excluded exogenous variables (instruments). They also provide critical values for worst-case rejection rates of 5%, 10%, 20%, and 25% for nominal 5% Wald tests of the endogenous regressors with 1 or 2 endogenous regressors and between 1 and 30 instruments. If the model previously fit by ivregress has more instruments or endogenous regressors than these limits, the critical values are not shown. Stock and Yogo did not consider GMM estimators. When the model being fit contains more than one endogenous regressor, the R2 and F statistics described above can overstate the relevance of the excluded instruments. Suppose that there are two endogenous regressors, Y1 and Y2 , and that there are two additional instruments, z1 and z2 . Say that z1 is highly correlated with both Y1 and Y2 but z2 is not correlated with either Y1 or Y2 . Then the first-stage regression of Y1 on z1 and z2 (along with the included exogenous variables) will produce large R2 and F statistics, as will the regression of Y2 on z1 , z2 , and the included exogenous variables. Nevertheless, the lack of correlation between z2 and Y1 and Y2 is problematic. Here, although the order condition indicates that the model is just identified (the number of excluded instruments equals the number of endogenous regressors), the irrelevance of z2 implies that the model is in fact not identified. Even if the model is overidentified, including irrelevant instruments can adversely affect the properties of instrumental-variables estimators, because their biases increase as the number of instruments increases. Example 4 estat firststage presents different statistics when the model contains multiple endogenous regressors. For illustration, we refit our model of rental rates, assuming that both hsngval and faminc are endogenously determined. We use i.region along with popden, a measure of population density, as additional instruments. 952 ivregress postestimation — Postestimation tools for ivregress . ivregress 2sls rent pcturban (hsngval faminc = i.region popden) (output omitted ) . estat firststage Shea’s partial R-squared Variable Shea’s Partial R-sq. Shea’s Adj. Partial R-sq. hsngval faminc 0.3477 0.1893 0.2735 0.0972 Minimum eigenvalue statistic = 2.51666 Critical Values Ho: Instruments are weak # of endogenous regressors: # of excluded instruments: 2SLS relative bias 5% 11.04 10% 7.56 20% 5.57 30% 4.73 2SLS Size of nominal 5% Wald test LIML Size of nominal 5% Wald test 10% 16.87 4.72 15% 9.93 3.39 20% 7.54 2.99 25% 6.28 2.79 2 4 Consider the endogenous regressor hsngval. Part of its variation is attributable to its correlation with the other regressors pcturban and faminc. The other component of hsngval’s variation is peculiar to it and orthogonal to the variation in the other regressors. Similarly, we can think of the instruments as predicting the variation in hsngval in two ways, one stemming from the fact that the predicted values of hsngval are correlated with the predicted values of the other regressors and one from the variation in the predicted values of hsngval that is orthogonal to the variation in the predicted values of the other regressors. What really matters for instrumental-variables estimation is whether the component of hsngval that is orthogonal to the other regressors can be explained by the component of the predicted value of hsngval that is orthogonal to the predicted values of the other regressors in the model. Shea’s (1997) partial R2 statistic measures this correlation. Because the bias of instrumental-variables estimators increases as more instruments are used, Shea’s adjusted partial R2 statistic is often used instead, as it makes a degrees-of-freedom adjustment for the number of instruments, analogous to the adjusted R2 measure used in OLS regression. Although what constitutes a “low” value for Shea’s partial R2 depends on the specifics of the model being fit and the data used, these results, taken in isolation, do not strike us as being a particular cause for concern. However, with this specification the minimum eigenvalue statistic is low. We cannot reject the null hypothesis of weak instruments for either of the characterizations we have discussed. By default, estat firststage determines which statistics to present based on the number of endogenous regressors in the model previously fit. However, you can specify the all option to obtain all the statistics. Technical note If the previous estimation was conducted using aweights, pweights, or iweights, then the first-stage regression summary statistics are computed using those weights. However, in these cases the minimum eigenvalue statistic and its critical values are not available. ivregress postestimation — Postestimation tools for ivregress 953 If the previous estimation included a robust VCE, then the first-stage F statistic is based on a robust VCE as well; for example, if you fit your model with an HAC VCE using the Bartlett kernel and four lags, then the F statistic reported is based on regression results using an HAC VCE using the Bartlett kernel and four lags. By default, the minimum eigenvalue statistic and its critical values are not displayed. You can use the forcenonrobust option to obtain them in these cases; the minimum eigenvalue statistic is computed using the weights, though the critical values reported may not be appropriate. estat overid In addition to the requirement that instrumental variables be correlated with the endogenous regressors, the instruments must also be uncorrelated with the structural error term. If the model is overidentified, meaning that the number of additional instruments exceeds the number of endogenous regressors, then we can test whether the instruments are uncorrelated with the error term. If the model is just identified, then we cannot perform a test of overidentifying restrictions. The estimator you used to fit the model determines which tests of overidentifying restrictions estat overid reports. If you used the 2SLS estimator without a robust VCE, estat overid reports Sargan’s (1958) and Basmann’s (1960) χ2 tests. If you used the 2SLS estimator and requested a robust VCE, Wooldridge’s robust score test of overidentifying restrictions is performed instead; without a robust VCE, Wooldridge’s test statistic is identical to Sargan’s test statistic. If you used the LIML estimator, estat overid reports the Anderson–Rubin (1950) likelihood-ratio test and Basmann’s (1960) F test. estat overid reports Hansen’s (1982) J statistic if you used the GMM estimator. Davidson and MacKinnon (1993, 235–236) give a particularly clear explanation of the intuition behind tests of overidentifying restrictions. Also see Judge et al. (1985, 614–616) for a summary of tests of overidentifying restrictions for the 2SLS and LIML estimators. Tests of overidentifying restrictions actually test two different things simultaneously. One, as we have discussed, is whether the instruments are uncorrelated with the error term. The other is that the equation is misspecified and that one or more of the excluded exogenous variables should in fact be included in the structural equation. Thus a significant test statistic could represent either an invalid instrument or an incorrectly specified structural equation. Example 5 Here we refit the model that treated just hsngval as endogenous using 2SLS, and then we perform tests of overidentifying restrictions: . ivregress 2sls rent pcturban (hsngval = faminc i.region) (output omitted ) . estat overid Tests of overidentifying restrictions: Sargan (score) chi2(3) = 11.2877 (p = 0.0103) Basmann chi2(3) = 12.8294 (p = 0.0050) Both test statistics are significant at the 5% test level, which means that either one or more of our instruments are invalid or that our structural model is specified incorrectly. One possibility is that the error term in our structural model is heteroskedastic. Both Sargan’s and Basmann’s tests assume that the errors are i.i.d.; if the errors are not i.i.d., then these tests are not valid. Here we refit the model by requesting heteroskedasticity-robust standard errors, and then we use estat overid to obtain Wooldridge’s score test of overidentifying restrictions, which is robust to heteroskedasticity. 954 ivregress postestimation — Postestimation tools for ivregress . ivregress 2sls rent pcturban (hsngval = faminc i.region), vce(robust) (output omitted ) . estat overid Test of overidentifying restrictions: Score chi2(3) = 6.8364 (p = 0.0773) Here we no longer reject the null hypothesis that our instruments are valid at the 5% significance level, though we do reject the null at the 10% level. You can verify that the robust standard error on the coefficient for hsngval is more than twice as large as its nonrobust counterpart and that the robust standard error for pcturban is nearly 50% larger. Technical note The test statistic for the test of overidentifying restrictions performed after GMM estimation is simply the sample size times the value of the objective function Q(β1 , β2 ) defined in (5) of [R] ivregress, evaluated at the GMM parameter estimates. If the weighting matrix W is optimal, meaning that A W = Var (zi ui ), then Q(β1 , β2 ) ∼χ2 (q), where q is the number of overidentifying restrictions. However, if the estimated W is not optimal, then the test statistic will not have an asymptotic χ2 distribution. Like the Sargan and Basmann tests of overidentifying restrictions for the 2SLS estimator, the Anderson–Rubin and Basmann tests after LIML estimation are predicated on the errors’ being i.i.d. If the previous LIML results were reported with robust standard errors, then estat overid by default issues an error message and refuses to report the Anderson–Rubin and Basmann test statistics. You can use the forcenonrobust option to override this behavior. You can also use forcenonrobust to obtain the Sargan and Basmann test statistics after 2SLS estimation with robust standard errors. By default, estat overid issues an error message if the previous estimation was conducted using aweights, pweights, or iweights. You can use the forceweights option to override this behavior, though the test statistics may no longer have the expected χ2 distributions. Stored results After 2SLS estimation, estat endogenous stores the following in r(): Scalars r(durbin) r(p durbin) r(wu) r(p wu) r(df) r(wudf r) r(r score) r(p r score) r(hac score) r(p hac score) r(lags) r(regF) r(p regF) r(regFdf n) r(regFdf r) Durbin χ2 statistic p-value for Durbin χ2 statistic Wu–Hausman F statistic p-value for Wu–Hausman F statistic degrees of freedom denominator degrees of freedom for Wu–Hausman F robust score statistic p-value for robust score statistic HAC score statistic p-value for HAC score statistic lags used in prewhitening regression-based F statistic p-value for regression-based F statistic regression-based F numerator degrees of freedom regression-based F denominator degrees of freedom ivregress postestimation — Postestimation tools for ivregress After GMM estimation, estat endogenous stores the following in r(): Scalars r(C) r(p C) r(df) C χ2 statistic p-value for C χ2 statistic degrees of freedom estat firststage stores the following in r(): Scalars r(mineig) Matrices r(mineigcv) r(multiresults) r(singleresults) minimum eigenvalue statistic critical values for minimum eigenvalue statistic Shea’s partial R2 statistics first-stage R2 and F statistics After 2SLS estimation, estat overid stores the following in r(): Scalars r(lags) r(df) r(score) r(p score) r(basmann) r(p basmann) r(sargan) r(p sargan) lags used in prewhitening χ2 degrees of freedom score χ2 statistic p-value for score χ2 statistic Basmann χ2 statistic p-value for Basmann χ2 statistic Sargan χ2 statistic p-value for Sargan χ2 statistic After LIML estimation, estat overid stores the following in r(): Scalars r(ar) r(p ar) r(ar df) r(basmann) r(p basmann) r(basmann df n) r(basmann df d) Anderson–Rubin χ2 statistic p-value for Anderson–Rubin χ2 statistic χ2 degrees of freedom Basmann F statistic p-value for Basmann F statistic F numerator degrees of freedom F denominator degrees of freedom After GMM estimation, estat overid stores the following in r(): Scalars r(HansenJ) r(p HansenJ) r(J df) Hansen’s J χ2 statistic p-value for Hansen’s J χ2 statistic χ2 degrees of freedom Methods and formulas Methods and formulas are presented under the following headings: Notation estat endogenous estat firststage estat overid 955 956 ivregress postestimation — Postestimation tools for ivregress Notation Recall from [R] ivregress that the model is y = Yβ1 + X1 β2 + u = Xβ + u Y = X 1 Π 1 + X 2 Π 2 + V = ZΠ + V where y is an N × 1 vector of the left-hand-side variable, N is the sample size, Y is an N × p matrix of p endogenous regressors, X1 is an N × k1 matrix of k1 included exogenous regressors, X2 is an N × k2 matrix of k2 excluded exogenous variables, X = [Y X1 ], Z = [X1 X2 ], u is an N × 1 vector of errors, V is an N × p matrix of errors, β = [β1 β2 ] is a k = (p + k1 ) × 1 vector of parameters, and Π is a (k1 + k2 ) × p vector of parameters. If a constant term is included in the model, then one column of X1 contains all ones. estat endogenous Partition Y as Y = [Y1 Y2 ], where Y1 represents the p1 endogenous regressors whose endogeneity is being tested and Y2 represents the p2 endogenous regressors whose endogeneity is not being tested. If the endogeneity of all endogenous regressors is being tested, Y = Y1 and p2 = 0. After GMM estimation, estat endogenous refits the model treating Y1 as exogenous using the same type of weight matrix as requested at estimation time with the wmatrix() option; denote the Sargan statistic from this model by Je and the estimated weight matrix by We . Let Se = We−1 . estat endogenous removes from Se the rows and columns corresponding to the variables represented by Y1 ; denote the inverse of the resulting matrix by We0 . Next estat endogenous fits the model treating both Y1 and Y2 as endogenous, using the weight matrix We0 ; denote the Sargan statistic from this model by Jc . Then C = (Je − Jc ) ∼ χ2 (p1 ). If one simply used the J statistic from the original model fit by ivregress in place of Jc , then in finite samples Je − J might be negative. The procedure used by estat endogenous is guaranteed to yield C ≥ 0; see Hayashi (2000, 220). b c denote the residuals from the model treating both Y1 and Y2 as endogenous, and let u be Let u denote the residuals from the model treating only Y2 as endogenous. Then Durbin’s (1954) statistic is b 0 PZY1 u be − u b 0c PZ u bc u D= e 0 beu b e /N u where PZ = Z(Z0 Z)−1 Z0 and PZY1 = [Z Y1 ]([Z Y1 ]0 [Z Y1 ])−1 [Z Y1 ]0 D ∼ χ2 (p1 ). The Wu–Hausman (Wu 1974; Hausman 1978) statistic is WH = be − u b 0c PZ u b c )/p1 (b u0e PZY1 u b e − (b be − u b 0c PZ u b c )} /(N − k1 − p − p1 ) {b u0e u u0e PZY1 u W H ∼ F (p1 , N − k1 − p − p1 ). Baum, Schaffer, and Stillman (2003, 2007) discuss these tests in more detail. Next we describe Wooldridge’s (1995) score test. The nonrobust version of Wooldridge’s test is identical to Durbin’s test. Suppose a robust covariance matrix was used at estimation time. Let b e denote the sample residuals obtained by fitting the model via OLS, treating Y as exogenous. We then regress each variable represented in Y on Z; call the residuals for the j th regression b rj , j = 1, . . . , p. Define b kij = ebi rbij , i = 1, . . . , N . We then run the regression b1 + · · · + θp k bp + 1 = θ1 k ivregress postestimation — Postestimation tools for ivregress 957 where 1 is an N × 1 vector of ones and is a regression error term. N − RSS ∼ χ2 (p), where RSS is the residual sum of squares from the regression just described. If instead an HAC VCE was used bj series by using a at estimation time, then before running the final regression we prewhiten the k VAR(q ) model, where q is the number of lags specified with the lags() option. The regression-based test proceeds as follows. Following Hausman (1978, 1259), we regress Y b . Next we fit the augmented regression on Z and obtain the residuals V bγ + y = Yβ1 + X1 β2 + V by OLS regression, where is a regression error term. A test of the exogeneity of Y is equivalent to a test of γ = 0. As Cameron and Trivedi (2005, 276) suggest, this test can be made robust to heteroskedasticity, autocorrelation, or clustering by using the appropriate robust VCE when testing γ = 0. When a nonrobust VCE is used, this test is equivalent to the Wu–Hausman test described earlier. One cannot simply fit this augmented regression via 2SLS to test the endogeneity of a subset of the endogenous regressors; Davidson and MacKinnon (1993, 229–231) discuss a test of γ = 0 for the homoskedastic version of the augmented regression fit by 2SLS, but an appropriate robust test is not apparent. estat firststage When the structural equation includes one endogenous regressor, estat firststage fits the regression Y = X1 π1 + X2 π2 + v via OLS. The R2 and adjusted R2 from that regression are reported in the output, as well as the F statistic from the Wald test of H0 : π2 = 0. To obtain the partial R2 statistic, estat firststage fits the regression MX1 y = MX1 X2 ξ + by OLS, where is a regression error term, ξ is a k2 × 1 parameter vector, and MX1 = I − X1 (X01 X1 )−1 X01 ; that is, the partial R2 is the R2 between y and X2 after eliminating the effects of X1 . If the model contains multiple endogenous regressors and the all option is specified, these statistics are calculated for each endogenous regressor in turn. To calculate Shea’s partial R2 , let y1 denote the endogenous regressor whose statistic is being e1 as the residuals obtained from calculated and Y0 denote the other endogenous regressors. Define y b1 denote the fitted values obtained from regressing y1 on X1 regressing y1 on Y0 and X1 . Let y b1 are the fitted values from the first-stage regression for y1 , and define the and X2 ; that is, y e b 0 analogously. Finally, let y b 0 and X1 . b 1 denote the residuals from regressing y b1 on Y columns of Y 2 2 e e1 on y b 1 ; denote this as RS2 . Shea’s Shea’s partial R is the simple R from the regression of y 2 2 adjusted partial R is equal to 1 − (1 − RS )(N − 1)/(N − kZ + 1) if a constant term is included 2 and 1 − (1 − RS )(N − 1)/(N − kZ ) if there is no constant term included in the model, where kZ = k1 +k2 . For one endogenous regressor, one instrument, no exogenous regressors, and a constant 2 2 term, RS equals the adjusted RS . The Stock and Yogo minimum eigenvalue statistic, first proposed by Cragg and Donald (1993) as a test for underidentification, is the minimum eigenvalue of the matrix G= 1 b −1/2 0 0 b −1/2 Σ Y MX1 X2 (X02 MX1 X2 )−1 X02 MX1 YΣ VV kZ VV 958 ivregress postestimation — Postestimation tools for ivregress where b VV = Σ 1 Y0 MZ Y N − kZ MZ = I − Z(Z0 Z)−1 Z0 , and Z = [X1 X2 ]. Critical values are obtained from the tables in Stock and Yogo (2005). estat overid The Sargan (1958) and Basmann (1960) χ2 statistics are calculated by running the auxiliary regression b = Zδ + e u b are the sample residuals from the model and e is an error term. Then Sargan’s statistic is where u S=N b e0 b e 1− 0 bu b u where b e are the residuals from that auxiliary regression. Basmann’s statistic is calculated as B=S N − kZ N −S Both S and B are distributed χ2 (m), where m, the number of overidentifying restrictions, is equal to kZ − k , where k is the number of endogenous regressors. Wooldridge’s (1995) score test of overidentifying restrictions is identical to Sargan’s (1958) statistic under the assumption of i.i.d. and therefore is not recomputed unless a robust VCE was used at estimation b denote time. If a heteroskedasticity-robust VCE was used, Wooldridge’s test proceeds as follows. Let Y the N × k matrix of fitted values obtained by regressing the endogenous regressors on X1 and X2 . Let Q denote an N × m matrix of excluded exogenous variables; the test statistic to be calculated is invariant to whichever m of the k2 excluded exogenous variables is chosen. Define the ith element bj , i = 1, . . . , N , j = 1, . . . , m, as of k kij = qbij ui b and bj , the fitted values from regressing the j th column of Q on Y where qbij is the ith element of q X1 . Finally, fit the regression b 1 + · · · + θm k bm + 1 = θ1 k where 1 is an N × 1 vector of ones and is a regression error term, and calculate the residual sum of squares, RSS. Then the test statistic is W = N − RSS. W ∼ χ2 (m). If an HAC VCE was used at bj are prewhitened using a VAR(p) model, where p is specified using the lags() estimation, then the k option. The Anderson–Rubin (1950), AR, test of overidentifying restrictions for use after the LIML estimator is calculated as AR = N (κ − 1), where κ is the minimal eigenvalue of a certain matrix defined in Methods and formulas of [R] ivregress. AR ∼ χ2 (m). (Some texts define this statistic as N ln(κ) because ln(x) ≈ (x − 1) for x near one.) Basmann’s F statistic for use after the LIML estimator is calculated as BF = (κ − 1)(N − kZ )/m. BF ∼ F (m, N − kZ ). Hansen’s J statistic is simply the sample size times the value of the GMM objective function defined in (5) of [R] ivregress, evaluated at the estimated parameter values. Under the null hypothesis that the overidentifying restrictions are valid, J ∼ χ2 (m). ivregress postestimation — Postestimation tools for ivregress 959 References Anderson, T. W., and H. Rubin. 1950. The asymptotic properties of estimates of the parameters of a single equation in a complete system of stochastic equations. Annals of Mathematical Statistics 21: 570–582. Angrist, J. D., and J.-S. Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton, NJ: Princeton University Press. Basmann, R. L. 1960. On finite sample distributions of generalized classical linear identifiability test statistics. Journal of the American Statistical Association 55: 650–659. Baum, C. F., M. E. Schaffer, and S. Stillman. 2003. Instrumental variables and GMM: Estimation and testing. Stata Journal 3: 1–31. . 2007. Enhanced routines for instrumental variables/generalized method of moments estimation and testing. Stata Journal 7: 465–506. Bound, J., D. A. Jaeger, and R. M. Baker. 1995. Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American Statistical Association 90: 443–450. Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press. Cragg, J. G., and S. G. Donald. 1993. Testing identifiability and specification in instrumental variable models. Econometric Theory 9: 222–240. Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. Durbin, J. 1954. Errors in variables. Review of the International Statistical Institute 22: 23–32. Hahn, J., and J. A. Hausman. 2003. Weak instruments: Diagnosis and cures in empirical econometrics. American Economic Review Papers and Proceedings 93: 118–125. Hall, A. R., G. D. Rudebusch, and D. W. Wilcox. 1996. Judging instrument relevance in instrumental variables estimation. International Economic Review 37: 283–298. Hansen, L. P. 1982. Large sample properties of generalized method of moments estimators. Econometrica 50: 1029–1054. Hausman, J. A. 1978. Specification tests in econometrics. Econometrica 46: 1251–1271. Hayashi, F. 2000. Econometrics. Princeton, NJ: Princeton University Press. Judge, G. G., W. E. Griffiths, R. C. Hill, H. Lütkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2nd ed. New York: Wiley. Nelson, C. R., and R. Startz. 1990. The distribution of the instrumental variable estimator and its t ratio when the instrument is a poor one. Journal of Business 63: S125–S140. Poi, B. P. 2006. Jackknife instrumental variables estimation in Stata. Stata Journal 6: 364–376. Sargan, J. D. 1958. The estimation of economic relationships using instrumental variables. Econometrica 26: 393–415. Shea, J. S. 1997. Instrument relevance in multivariate linear models: A simple measure. Review of Economics and Statistics 79: 348–352. Staiger, D. O., and J. H. Stock. 1997. Instrumental variables regression with weak instruments. Econometrica 65: 557–586. Stock, J. H., J. H. Wright, and M. Yogo. 2002. A survey of weak instruments and weak identification in generalized method of moments. Journal of Business and Economic Statistics 20: 518–529. Stock, J. H., and M. Yogo. 2005. Testing for weak instruments in linear IV regression. In Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg, ed. D. W. K. Andrews and J. H. Stock, 80–108. New York: Cambridge University Press. Wooldridge, J. M. 1995. Score diagnostics for linear models estimated by two stage least squares. In Advances in Econometrics and Quantitative Economics: Essays in Honor of Professor C. R. Rao, ed. G. S. Maddala, P. C. B. Phillips, and T. N. Srinivasan, 66–87. Oxford: Blackwell. Wu, D.-M. 1974. Alternative tests of independence between stochastic regressors and disturbances: Finite sample results. Econometrica 42: 529–546. 960 ivregress postestimation — Postestimation tools for ivregress Also see [R] ivregress — Single-equation instrumental-variables regression [U] 20 Estimation and postestimation commands Title ivtobit — Tobit model with continuous endogenous regressors Syntax Options for ML estimator Stored results References Menu Options for two-step estimator Methods and formulas Also see Description Remarks and examples Acknowledgments Syntax Maximum likelihood estimator ivtobit depvar varlist1 (varlist2 = varlistiv ) if in weight , ll (#) ul (#) mle options Two-step estimator ivtobit depvar varlist1 (varlist2 = varlistiv ) if in weight , twostep ll (#) ul (#) tse options mle options Model ∗ ll (#) ∗ ul (#) mle constraints(constraints) Description lower limit for left censoring upper limit for right censoring use conditional maximum-likelihood estimator; the default apply specified linear constraints SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) first nocnsreport display options set confidence level; default is level(95) report first-stage regression do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process coeflegend display legend instead of statistics ∗ You must specify at least one of ll (#) and ul (#) . 961 962 ivtobit — Tobit model with continuous endogenous regressors Description tse options Model ∗ use Newey’s two-step estimator; the default is mle lower limit for left censoring upper limit for right censoring twostep ll (#) ∗ ul (#) ∗ SE vce(vcetype) vcetype may be twostep, bootstrap, or jackknife Reporting level(#) first display options set confidence level; default is level(95) report first-stage regression control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling coeflegend display legend instead of statistics ∗ twostep is required. You must specify at least one of ll (#) and ul (#) . varlist1 and varlistiv may contain factor variables; see [U] 11.4.3 Factor variables. depvar, varlist1 , varlist2 , and varlistiv may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. fp is allowed with the maximum likelihood estimator. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce(), first, twostep, and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed with the maximum likelihood estimator. fweights are allowed with Newey’s two-step estimator. See [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Endogenous covariates > Tobit model with endogenous covariates Description ivtobit fits tobit models where one or more of the regressors is endogenously determined. By default, ivtobit uses maximum likelihood estimation. Alternatively, Newey’s (1987) minimum chi-squared estimator can be invoked with the twostep option. Both estimators assume that the endogenous regressors are continuous and so are not appropriate for use with discrete endogenous regressors. See [R] ivprobit for probit estimation with endogenous regressors and [R] tobit for tobit estimation when the model contains no endogenous regressors. ivtobit — Tobit model with continuous endogenous regressors 963 Options for ML estimator Model ll(#) and ul(#) indicate the lower and upper limits for censoring, respectively. You may specify one or both. Observations with depvar ≤ ll() are left-censored; observations with depvar ≥ ul() are right-censored; and remaining observations are not censored. You do not have to specify the censoring values at all. It is enough to type ll, ul, or both. When you do not specify a censoring value, ivtobit assumes that the lower limit is the minimum observed in the data (if ll is specified) and that the upper limit is the maximum (if ul is specified). mle requests that the conditional maximum-likelihood estimator be used. This is the default. constraints(constraints); see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. first requests that the parameters for the reduced-form equations showing the relationships between the endogenous variables and instruments be displayed. For the two-step estimator, first shows the first-stage regressions. For the maximum likelihood estimator, these parameters are estimated jointly with the parameters of the tobit equation. The default is not to show these parameter estimates. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. This model’s likelihood function can be difficult to maximize, especially with multiple endogenous variables. The difficult and technique(bfgs) options may be helpful in achieving convergence. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with ivtobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Options for two-step estimator Model twostep is required and requests that Newey’s (1987) efficient two-step estimator be used to obtain the coefficient estimates. 964 ivtobit — Tobit model with continuous endogenous regressors ll(#) and ul(#) indicate the lower and upper limits for censoring, respectively. You may specify one or both. Observations with depvar ≤ ll() are left-censored; observations with depvar ≥ ul() are right-censored; and remaining observations are not censored. You do not have to specify the censoring values at all. It is enough to type ll, ul, or both. When you do not specify a censoring value, ivtobit assumes that the lower limit is the minimum observed in the data (if ll is specified) and that the upper limit is the maximum (if ul is specified). SE vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (twostep) and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. first requests that the parameters for the reduced-form equations showing the relationships between the endogenous variables and instruments be displayed. For the two-step estimator, first shows the first-stage regressions. For the maximum likelihood estimator, these parameters are estimated jointly with the parameters of the tobit equation. The default is not to show these parameter estimates. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. The following option is available with ivtobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples ivtobit fits models with censored dependent variables and endogenous regressors. You can use it to fit a tobit model when you suspect that one or more of the regressors is correlated with the error term. ivtobit is to tobit what ivregress is to linear regression analysis; see [R] ivregress for more information. Formally, the model is ∗ y1i =y2i β + x1i γ + ui y2i =x1i Π1 + x2i Π2 + vi where i = 1, . . . , N ; y2i is a 1 ×p vector of endogenous variables; x1i is a 1 ×k1 vector of exogenous variables; x2i is a 1 × k2 vector of additional instruments; and the equation for y2i is written in reduced form. By assumption (ui , vi ) ∼ N(0). β and γ are vectors of structural parameters, and Π1 ∗ and Π2 are matrices of reduced-form parameters. We do not observe y1i ; instead, we observe y1i a ∗ = y1i b ∗ y1i b The order condition for identification of the structural parameters is that k2 ≥ p. Presumably, Σ is not block diagonal between ui and vi ; otherwise, y2i would not be endogenous. ivtobit — Tobit model with continuous endogenous regressors 965 Technical note This model is derived under the assumption that (ui , vi ) is independent and identically distributed multivariate normal for all i. The vce(cluster clustvar) option can be used to control for a lack of independence. As with the standard tobit model without endogeneity, if ui is heteroskedastic, point estimates will be inconsistent. Example 1 Using the same dataset as in [R] ivprobit, we now want to estimate women’s incomes. In our hypothetical world, all women who choose not to work receive $10,000 in welfare and child-support payments. Therefore, we never observe incomes under $10,000: a woman offered a job with an annual wage less than that would not accept and instead would collect the welfare payment. We model income as a function of the number of years of schooling completed, the number of children at home, and other household income. We again believe that other inc is endogenous, so we use male educ as an instrument. . use http://www.stata-press.com/data/r13/laborsup . ivtobit fem_inc fem_educ kids (other_inc = male_educ), ll Fitting exogenous tobit model Fitting full model Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = -3228.4224 -3226.2882 -3226.085 -3226.0845 -3226.0845 Tobit model with endogenous regressors Number of obs Wald chi2(3) Prob > chi2 Log likelihood = -3226.0845 Coef. Std. Err. z P>|z| = = = 500 117.42 0.0000 [95% Conf. Interval] other_inc fem_educ kids _cons -.9045399 3.272391 -3.312357 19.24735 .1329762 .3968708 .7218628 7.372391 -6.80 8.25 -4.59 2.61 0.000 0.000 0.000 0.009 -1.165168 2.494538 -4.727182 4.797725 -.6439114 4.050243 -1.897532 33.69697 /alpha /lns /lnv .2907654 2.874031 2.813383 .1379965 .0506672 .0316228 2.11 56.72 88.97 0.035 0.000 0.000 .0202972 2.774725 2.751404 .5612336 2.973337 2.875363 s v 17.70826 16.66621 .897228 .5270318 16.03422 15.66461 19.55707 17.73186 Instrumented: Instruments: other_inc fem_educ kids male_educ Wald test of exogeneity (/alpha = 0): chi2(1) = Obs. summary: 4.44 Prob > chi2 = 0.0351 272 left-censored observations at fem_inc<=10 228 uncensored observations 0 right-censored observations Because we did not specify mle or twostep, ivtobit used the maximum likelihood estimator by default. ivtobit fits a tobit model, ignoring endogeneity, to get starting values for the full model. The header of the output contains the maximized log likelihood, the number of observations, and a 966 ivtobit — Tobit model with continuous endogenous regressors Wald statistic and p-value for the test of the hypothesis that all the slope coefficients are jointly zero. At the end of the output, we see a count of the censored and uncensored observations. Near the bottom of the output is a Wald test of the exogeneity of the instrumented variables. If the test statistic is not significant, there is not sufficient information in the sample to reject the null hypothesis of no endogeneity. Then the point estimates from ivtobit are consistent, although those from tobit are likely to have smaller standard errors. Various two-step estimators have also been proposed for the endogenous tobit model, and Newey’s (1987) minimum chi-squared estimator is available with the twostep option. Example 2 Refitting our labor-supply model with the two-step estimator yields . ivtobit fem_inc fem_educ kids (other_inc = male_educ), ll twostep Two-step tobit with endogenous regressors Number of obs = Wald chi2(3) = Prob > chi2 = Coef. other_inc fem_educ kids _cons Instrumented: Instruments: -.9045397 3.27239 -3.312356 19.24735 Std. Err. .1330015 .3969399 .7220066 7.37392 z -6.80 8.24 -4.59 2.61 P>|z| 0.000 0.000 0.000 0.009 500 117.38 0.0000 [95% Conf. Interval] -1.165218 2.494402 -4.727463 4.794728 -.6438616 4.050378 -1.897249 33.69997 other_inc fem_educ kids male_educ Wald test of exogeneity: chi2(1) = 4.64 Prob > chi2 = 0.0312 Obs. summary: 272 left-censored observations at fem_inc<=10 228 uncensored observations 0 right-censored observations All the coefficients have the same signs as their counterparts in the maximum likelihood model. The Wald test at the bottom of the output confirms our earlier finding of endogeneity. Technical note In the tobit model with endogenous regressors, we assume that (ui , vi ) is multivariate normal with covariance matrix 2 σu Σ021 Var(ui , vi ) = Σ = Σ21 Σ22 2 Using the properties of the multivariate normal distribution, Var(ui |vi ) ≡ σu|v = σu2 − Σ021 Σ−1 22 Σ21 . Calculating the marginal effects on the conditional expected values of the observed and latent dependent variables and on the probability of censoring requires an estimate of σu2 . The two-step 2 estimator identifies only σu|v , not σu2 , so only the linear prediction and its standard error are available after you have used the twostep option. However, unlike the two-step probit estimator described in [R] ivprobit, the two-step tobit estimator does identify β and γ. See Wooldridge (2010, 683–684) for more information. ivtobit — Tobit model with continuous endogenous regressors Stored results ivtobit, mle stores the following in e(): Scalars e(N) e(N unc) e(N lc) e(N rc) e(llopt) e(ulopt) e(k) e(k eq) e(k eq model) e(k aux) e(k dv) e(df m) e(ll) e(N clust) e(endog ct) e(p) e(p exog) e(chi2) e(chi2 exog) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(instd) e(insts) e(wtype) e(wexp) e(title) e(clustvar) e(chi2type) e(vce) e(vcetype) e(method) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(footnote) e(marginsok) e(asbalanced) e(asobserved) number of observations number of uncensored observations number of left-censored observations number of right-censored observations contents of ll() contents of ul() number of parameters number of equations in e(b) number of equations in overall model test number of auxiliary parameters number of dependent variables model degrees of freedom log likelihood number of clusters number of endogenous regressors model Wald p-value exogeneity test Wald p-value model Wald χ2 Wald χ2 test of exogeneity rank of e(V) number of iterations return code 1 if converged, 0 otherwise ivtobit command as typed name of dependent variable instrumented variables instruments weight type weight expression title in estimation output name of cluster variable Wald; type of model χ2 test vcetype specified in vce() title used to label Std. Err. ml type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict program used to implement the footnote display predictions allowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved 967 968 ivtobit — Tobit model with continuous endogenous regressors Matrices e(b) e(Cns) e(ilog) e(gradient) e(Sigma) e(V) e(V modelbased) Functions e(sample) coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector b Σ variance–covariance matrix of the estimators model-based variance marks estimation sample ivtobit, twostep stores the following in e(): Scalars e(N) e(N unc) e(N lc) e(N rc) e(llopt) e(ulopt) e(df m) e(df exog) e(p) e(p exog) e(chi2) e(chi2 exog) e(rank) number of observations number of uncensored observations number of left-censored observations number of right-censored observations contents of ll() contents of ul() model degrees of freedom degrees of freedom for χ2 test of exogeneity model Wald p-value exogeneity test Wald p-value model Wald χ2 Wald χ2 test of exogeneity rank of e(V) Macros e(cmd) e(cmdline) e(depvar) e(instd) e(insts) e(wtype) e(wexp) e(chi2type) e(vce) e(vcetype) e(method) e(properties) e(predict) e(footnote) e(marginsok) e(asbalanced) e(asobserved) ivtobit command as typed name of dependent variable instrumented variables instruments weight type weight expression Wald; type of model χ2 test vcetype specified in vce() title used to label Std. Err. twostep b V program used to implement predict program used to implement the footnote display predictions allowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved Matrices e(b) e(Cns) e(V) e(V modelbased) coefficient vector constraints matrix variance–covariance matrix of the estimators model-based variance Functions e(sample) marks estimation sample Methods and formulas The estimation procedure used by ivtobit is similar to that used by ivprobit. For compactness, we write the model as ivtobit — Tobit model with continuous endogenous regressors ∗ y1i = zi δ + ui y2i = xi Π + vi 969 (1a) (1b) ∗ ; where zi = (y2i , x1i ), xi = (x1i , x2i ), δ = (β0 , γ0 )0 , and Π = (Π01 , Π02 )0 . We do not observe y1i instead, we observe ∗ y1i b (ui , vi ) is distributed multivariate normal with mean zero and covariance matrix 2 σu Σ021 Σ= Σ21 Σ22 Using the properties of the multivariate normal distribution, we can write ui = v0i α + i , where −1 0 2 2 2 α = Σ−1 22 Σ21 ; i ∼ N(0; σu|v ), where σu|v = σu − Σ21 Σ22 Σ21 ; and i is independent of vi , zi , and xi . The likelihood function is straightforward to derive because we can write the joint density f (y1i , y2i |xi ) as f (y1i |y2i , xi ) f (y2i |xi ). With one endogenous regressor, (y2i − xi Π)2 1 lnf (y2i |xi ) = − ln2π + lnσv2 + 2 σv2 and o n mi −a ln 1 − Φ σu|v 2 lnf (y1i |y2i , xi ) = − 12 ln2π + lnσu|v + lnΦ mi −b σu|v y1i = a (y1i −mi )2 2 σu|v a < y1i < b y1i = b where mi = zi δ + α (y2i − xi Π) and Φ(·) is the normal distribution function so that the log likelihood for observation i is lnLi = wi { lnf (y1i |y2i , xi ) + lnf (y2i |xi )} where wi is the weight for observation i or one if no weights were specified. Instead of estimating σu|v and σv directly, we estimate lnσu|v and lnσv . For multiple endogenous regressors, we have lnf (y2i |xi ) = − 1 ln2π + ln |Σ22 | + v0i Σ−1 22 vi 2 and lnf (y1i |y2i , xi ) is the same as before, except that now mi = zi δ + (y2i − xi Π)Σ−1 22 Σ21 Instead of maximizing the log-likelihood function with respect to Σ, we maximize with respect to the Cholesky decomposition S of Σ; that is, there exists a lower triangular matrix S such that SS0 = Σ. This maximization ensures that Σ is positive definite, as a covariance matrix must be. Let 970 ivtobit — Tobit model with continuous endogenous regressors s11 s21 s31 .. . 0 s22 s32 .. . 0 0 s33 .. . sp+1,1 sp+1,2 sp+1,3 S= ... ... ... .. . 0 0 0 .. . . . . sp+1,p+1 With maximum likelihood estimation, this command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. The maximum likelihood version of ivtobit also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. The two-step estimates are obtained using Newey’s (1987) minimum chi-squared estimator. The procedure is identical to the one described in [R] ivprobit, except that tobit is used instead of probit. Acknowledgments The two-step estimator is based on the tobitiv command written by Jonah Gelbach of the Department of Economics at Yale University and the ivtobit command written by Joe Harkness of the Institute of Policy Studies at Johns Hopkins University. References Finlay, K., and L. M. Magnusson. 2009. Implementing weak-instrument robust tests for a general class of instrumentalvariables models. Stata Journal 9: 398–421. Miranda, A., and S. Rabe-Hesketh. 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata Journal 6: 285–308. Newey, W. K. 1987. Efficient estimation of limited dependent variable models with endogenous explanatory variables. Journal of Econometrics 36: 231–250. Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press. Also see [R] ivtobit postestimation — Postestimation tools for ivtobit [R] gmm — Generalized method of moments estimation [R] ivprobit — Probit model with continuous endogenous regressors [R] ivregress — Single-equation instrumental-variables regression [R] regress — Linear regression [R] tobit — Tobit regression [SVY] svy estimation — Estimation commands for survey data [XT] xtintreg — Random-effects interval-data regression models [XT] xttobit — Random-effects tobit models [U] 20 Estimation and postestimation commands Title ivtobit postestimation — Postestimation tools for ivtobit Description Remarks and examples Syntax for predict Methods and formulas Menu for predict Also see Options for predict Description The following postestimation commands are available after ivtobit: Command Description contrast estat ic1 estat summarize estat vce estat (svy) estimates forecast2 hausman lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test; not available with two-step estimator marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest3 margins marginsplot nlcom predict predictnl pwcompare suest1 test testnl 1 estat ic and suest are not appropriate after ivtobit, twostep. forecast is not appropriate with svy estimation results or after ivtobit, twostep. 3 lrtest is not appropriate with svy estimation results. 2 971 972 ivtobit postestimation — Postestimation tools for ivtobit Syntax for predict After ML or twostep predict type newvar if in , statistic After ML predict type stub* | newvarlist if in , scores Description statistic Main xb stdp stdf pr(a,b) e(a,b) ystar(a,b) linear prediction; the default standard error of the linear prediction standard error of the forecast; not available with two-step estimator Pr(a < yj < b); not available with two-step estimator E(yj |a < yj < b); not available with two-step estimator E(yj∗ ), yj = max{a, min(yj , b)}; not available with two-step estimator These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for stdf is not allowed with svy estimation results. where a and b may be numbers or variables; a missing (a ≥ .) means −∞, and b missing (b ≥ .) means +∞; see [U] 12.2.1 Missing values. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main xb, the default, calculates the linear prediction. stdp calculates the standard error of the linear prediction. It can be thought of as the standard error of the predicted expected value or mean for the observation’s covariate pattern. The standard error of the prediction is also referred to as the standard error of the fitted value. stdf calculates the standard error of the forecast, which is the standard error of the point prediction for 1 observation. It is commonly referred to as the standard error of the future or forecast value. By construction, the standard errors produced by stdf are always larger than those produced by stdp; see Methods and formulas in [R] regress postestimation. stdf is not available with the two-step estimator. pr(a,b) calculates Pr(a < xj b + uj < b), the probability that yj |xj would be observed in the interval (a, b). a and b may be specified as numbers or variable names; lb and ub are variable names; pr(20,30) calculates Pr(20 < xj b + uj < 30); pr(lb,ub) calculates Pr(lb < xj b + uj < ub); and pr(20,ub) calculates Pr(20 < xj b + uj < ub). ivtobit postestimation — Postestimation tools for ivtobit 973 a missing (a ≥ .) means −∞; pr(.,30) calculates Pr(−∞ < xj b + uj < 30); pr(lb,30) calculates Pr(−∞ < xj b + uj < 30) in observations for which lb ≥ . and calculates Pr(lb < xj b + uj < 30) elsewhere. b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(+∞ > xj b + uj > 20); pr(20,ub) calculates Pr(+∞ > xj b + uj > 20) in observations for which ub ≥ . and calculates Pr(20 < xj b + uj < ub) elsewhere. pr(a,b) is not available with the two-step estimator. e(a,b) calculates E(xj b + uj a < xj b + uj < b), the expected value of yj |xj conditional on yj |xj being in the interval (a, b), meaning that yj |xj is truncated. a and b are specified as they are for pr(). e(a,b) is not available with the two-step estimator. ystar(a,b) calculates E(yj∗ ), where yj∗ = a if xj b + uj ≤ a, yj∗ = b if xj b + uj ≥ b, and yj∗ = xj b + uj otherwise, meaning that yj∗ is censored. a and b are specified as they are for pr(). ystar(a,b) is not available with the two-step estimator. scores, not available with twostep, calculates equation-level score variables. For models with one endogenous regressor, five new variables are created. The first new variable will contain ∂ lnL/∂(zi δ). The second new variable will contain ∂ lnL/∂(xi Π). The third new variable will contain ∂ lnL/∂α. The fourth new variable will contain ∂ lnL/∂ lnσu|v . The fifth new variable will contain ∂ lnL/∂ lnσv . For models with p endogenous regressors, p + {(p + 1)(p + 2)}/2 + 1 new variables are created. The first new variable will contain ∂ lnL/∂(zi δ). The second through (p + 1)th new score variables will contain ∂ lnL/∂(xi Πk ), k = 1, . . . , p, where Πk is the k th column of Π. The remaining score variables will contain the partial derivatives of lnL with respect to s11 , s21 , . . . , sp+1,1 , s22 , . . . , sp+1,2 , . . . , sp+1,p+1 , where sm,n denotes the (m, n) element of the Cholesky decomposition of the error covariance matrix. Remarks and examples Remarks are presented under the following headings: Marginal effects Obtaining predicted values Marginal effects Example 1 We can obtain average marginal effects by using the margins command after ivtobit. For the labor-supply model of example 1 in [R] ivtobit, suppose that we wanted to know the average marginal effects on the woman’s expected income, conditional on her income being greater than $10,000. 974 ivtobit postestimation — Postestimation tools for ivtobit . use http://www.stata-press.com/data/r13/laborsup . ivtobit fem_inc fem_educ kids (other_inc = male_educ), ll (output omitted ) . margins, dydx(*) predict(e(10, .)) Average marginal effects Number of obs Model VCE : OIM Expression : E(fem_inc|fem_inc>10), predict(e(10, .)) dy/dx w.r.t. : other_inc fem_educ kids male_educ dy/dx other_inc fem_educ kids male_educ -.3420189 1.237336 -1.252447 0 Delta-method Std. Err. .0553591 .1534025 .2725166 (omitted) z -6.18 8.07 -4.60 P>|z| 0.000 0.000 0.000 = 500 [95% Conf. Interval] -.4505208 .9366723 -1.78657 -.233517 1.537999 -.7183246 In our sample, increasing the number of children in the family by one decreases the expected wage by $1,252 on average (wages in our dataset are measured in thousands of dollars). male edu has no effect because it appears only as an instrument. Obtaining predicted values After fitting your model using ivtobit, you can obtain the linear prediction and its standard error for both the estimation sample and other samples using the predict command. If you used the maximum likelihood estimator, you can also obtain conditional expected values of the observed and latent dependent variables, the standard error of the forecast, and the probability of observing the dependent variable in a specified interval. See [U] 20 Estimation and postestimation commands and [R] predict. Methods and formulas The linear prediction is calculated as zib δ, where b δ is the estimated value of δ, and zi and δ are defined in (1a) of [R] ivtobit. Expected values and probabilities are calculated using the same formulas as those used by the standard exogenous tobit model. Also see [R] ivtobit — Tobit model with continuous endogenous regressors [U] 20 Estimation and postestimation commands Title jackknife — Jackknife estimation Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax jackknife exp list , options eform option : command options Description Main eclass rclass n(exp) number of observations used is stored in e(N) number of observations used is stored in r(N) specify exp that evaluates to the number of observations used Options cluster(varlist) idcluster(newvar) saving( filename, . . .) keep mse variables identifying sample clusters create new cluster ID variable save results to filename; save statistics in double precision; save results to filename every # replications keep pseudovalues use MSE formula for variance estimation Reporting level(#) notable noheader nolegend verbose nodots noisily trace title(text) display options eform option set confidence level; default is level(95) suppress table of results suppress table header suppress table legend display the full table legend suppress replication dots display any output from command trace command use text as title for jackknife results control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling display coefficient table in exponentiated form Advanced nodrop reject(exp) do not drop observations identify invalid results coeflegend display legend instead of statistics svy is allowed; see [SVY] svy jackknife. All weight types supported by command are allowed except aweights; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 975 976 jackknife — Jackknife estimation exp list contains elist contains eexp is specname is eqno is (name: elist) elist eexp newvar = (exp) (exp) specname [eqno]specname b b[] se se[] ## name exp is a standard Stata expression; see [U] 13 Functions and expressions. Distinguish between [ ], which are to be typed, and , which indicate optional arguments. Menu Statistics > Resampling > Jackknife estimation Description jackknife performs jackknife estimation. Typing . jackknife exp list: command executes command once for each observation in the dataset, leaving the associated observation out of the calculations that make up exp list. command defines the statistical command to be executed. Most Stata commands and user-written programs can be used with jackknife, as long as they follow standard Stata syntax and allow the if qualifier; see [U] 11 Language syntax. The by prefix may not be part of command. exp list specifies the statistics to be collected from the execution of command. If command changes the contents in e(b), exp list is optional and defaults to b. Many estimation commands allow the vce(jackknife) option. For those commands, we recommend using vce(jackknife) over jackknife because the estimation command already handles clustering and other model-specific details for you. The jackknife prefix command is intended for use with nonestimation commands, such as summarize, user-written commands, or functions of coefficients. jknife is a synonym for jackknife. Options Main eclass, rclass, and n(exp) specify where command stores the number of observations on which it based the calculated results. We strongly advise you to specify one of these options. jackknife — Jackknife estimation 977 eclass specifies that command store the number of observations in e(N). rclass specifies that command store the number of observations in r(N). n(exp) specifies an expression that evaluates to the number of observations used. Specifying n(r(N)) is equivalent to specifying the rclass option. Specifying n(e(N)) is equivalent to specifying the eclass option. If command stores the number of observations in r(N1), specify n(r(N1)). If you specify no options, jackknife will assume eclass or rclass, depending on which of e(N) and r(N) is not missing (in that order). If both e(N) and r(N) are missing, jackknife assumes that all observations in the dataset contribute to the calculated result. If that assumption is incorrect, the reported standard errors will be incorrect. For instance, say that you specify . jackknife coef=_b[x2]: myreg y x1 x2 x3 where myreg uses e(n) instead of e(N) to identify the number of observations used in calculations. Further assume that observation 42 in the dataset has x3 equal to missing. The 42nd observation plays no role in obtaining the estimates, but jackknife has no way of knowing that and will use the wrong N . If, on the other hand, you specify . jackknife coef=_b[x2], n(e(n)): myreg y x1 x2 x3 jackknife will notice that observation 42 plays no role. The n(e(n)) option is specified because myreg is an estimation command but it stores the number of observations used in e(n) (instead of the standard e(N)). When jackknife runs the regression omitting the 42nd observation, jackknife will observe that e(n) has the same value as when jackknife previously ran the regression using all the observations. Thus jackknife will know that myreg did not use the observation. Options cluster(varlist) specifies the variables identifying sample clusters. If cluster() is specified, one cluster is left out of each call to command, instead of 1 observation. idcluster(newvar) creates a new variable containing a unique integer identifier for each resampled cluster, starting at 1 and leading up to the number of clusters. This option may be specified only when the cluster() option is specified. idcluster() helps identify the cluster to which a pseudovalue belongs. saving( filename , suboptions ) creates a Stata data file (.dta file) consisting of (for each statistic in exp list) a variable containing the replicates. double specifies that the results for each replication be saved as doubles, meaning 8-byte reals. By default, they are saved as floats, meaning 4-byte reals. This option may be used without the saving() option to compute the variance estimates by using double precision. every(#) specifies that results be written to disk every #th replication. every() should be specified only in conjunction with saving() when command takes a long time for each replication. This option will allow recovery of partial results should some other software crash your computer. See [P] postfile. replace specifies that filename be overwritten if it exists. This option does not appear in the dialog box. keep specifies that new variables be added to the dataset containing the pseudovalues of the requested statistics. For instance, if you typed . jackknife coef=_b[x2], eclass keep: regress y x1 x2 x3 978 jackknife — Jackknife estimation new variable coef would be added to the dataset containing the pseudovalues for b[x2]. Let b be the value of b[x2] when all observations are used to fit the model, and let b(j) be the value when the j th observation is omitted. The pseudovalues are defined as pseudovaluej = N {b − b(j)} + b(j) where N is the number of observations used to produce b. When the cluster() option is specified, each cluster is given at most one nonmissing pseudovalue. The keep option implies the nodrop option. mse specifies that jackknife compute the variance by using deviations of the replicates from the observed value of the statistics based on the entire dataset. By default, jackknife computes the variance by using deviations of the pseudovalues from their mean. Reporting level(#); see [R] estimation options. notable suppresses the display of the table of results. noheader suppresses the display of the table header. This option implies nolegend. nolegend suppresses the display of the table legend. The table legend identifies the rows of the table with the expressions they represent. verbose specifies that the full table legend be displayed. By default, coefficients and standard errors are not displayed. nodots suppresses display of the replication dots. By default, one dot character is displayed for each successful replication. A red ‘x’ is displayed if command returns an error or if one of the values in exp list is missing. noisily specifies that any output from command be displayed. This option implies the nodots option. trace causes a trace of the execution of command to be displayed. This option implies the noisily option. title(text) specifies a title to be displayed above the table of jackknife results; the default title is Jackknife results or what is produced in e(title) by an estimation command. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. eform option causes the coefficient table to be displayed in exponentiated form; see [R] eform option. command determines which eform option is allowed (eform(string) and eform are always allowed). jackknife — Jackknife estimation 979 command determines which of the following are allowed (eform(string) and eform are always allowed): eform option Description eform(string) eform hr shr irr or rrr use string for the column title exponentiated coefficient, string is exp(b) hazard ratio, string is Haz. Ratio subhazard ratio, string is SHR incidence-rate ratio, string is IRR odds ratio, string is Odds Ratio relative-risk ratio, string is RRR Advanced nodrop prevents observations outside e(sample) and the if and in qualifiers from being dropped before the data are resampled. reject(exp) identifies an expression that indicates when results should be rejected. When exp is true, the resulting values are reset to missing values. The following option is available with jackknife but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Introduction Jackknifed standard deviation Collecting multiple statistics Collecting coefficients Introduction Although the jackknife—developed in the late 1940s and early 1950s—is of largely historical interest today, it is still useful in searching for overly influential observations. This feature is often forgotten. In any case, the jackknife is • an alternative, first-order unbiased estimator for a statistic; • a data-dependent way to calculate the standard error of the statistic and to obtain significance levels and confidence intervals; and • a way of producing measures called pseudovalues for each observation, reflecting the observation’s influence on the overall statistic. The idea behind the simplest form of the jackknife—the one implemented here—is to repeatedly calculate the statistic in question, each time omitting just one of the dataset’s observations. Assume that our statistic of interest is the sample mean. Let yj be the j th observation of our data on some measurement y , where j = 1, . . . , N and N is the sample size. If y is the sample mean of y using the entire dataset and y (j) is the mean when the j th observation is omitted, then y= (N − 1) y (j) + yj N 980 jackknife — Jackknife estimation Solving for yj , we obtain yj = N y − (N − 1) y (j) These are the pseudovalues that jackknife calculates. To move this discussion beyond the sample mean, let θb be the value of our statistic (not necessarily the sample mean) using the entire dataset, and let θb(j) be the computed value of our statistic with the j th observation omitted. The pseudovalue for the j th observation is θbj∗ = N θb − (N − 1) θb(j) The mean of the pseudovalues is the alternative, first-order unbiased estimator mentioned above, and the standard error of the mean of the pseudovalues is an estimator for the standard error of θb (Tukey 1958). When the cluster() option is given, clusters are omitted instead of observations, and N is the number of clusters instead of the sample size. The jackknife estimate of variance has been largely replaced by the bootstrap estimate (see [R] bootstrap), which is widely viewed as more efficient and robust. The use of jackknife pseudovalues to detect outliers is too often forgotten and is something the bootstrap does not provide. See Mosteller and Tukey (1977, 133–163) and Mooney and Duval (1993, 22–27) for more information. Example 1 As our first example, we will show that the jackknife standard error of the sample mean is equivalent to the standard error of the sample mean computed using the classical formula in the ci command. We use the double option to compute the standard errors with the same precision as the ci command. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . jackknife r(mean), double: summarize mpg (running summarize on estimation sample) Jackknife replications (74) 1 2 3 4 5 .................................................. 50 ........................ Jackknife results Number of obs Replications command: summarize mpg _jk_1: r(mean) n(): r(N) Coef. _jk_1 21.2973 Jackknife Std. Err. .6725511 t 31.67 P>|t| 0.000 = = 74 74 [95% Conf. Interval] 19.9569 22.63769 . ci mpg Variable Obs Mean mpg 74 21.2973 Std. Err. .6725511 [95% Conf. Interval] 19.9569 22.63769 jackknife — Jackknife estimation 981 Jackknifed standard deviation Example 2 Mosteller and Tukey (1977, 139–140) request a 95% confidence interval for the standard deviation of the 11 values: 0.1, 0.1, 0.1, 0.4, 0.5, 1.0, 1.1, 1.3, 1.9, 1.9, 4.7 Stata’s summarize command calculates the mean and standard deviation and stores them as r(mean) and r(sd). To obtain the jackknifed standard deviation of the 11 values and save the pseudovalues as a new variable, sd, we would type . clear . input x x 1. 0.1 2. 0.1 3. 0.1 4. 0.4 5. 0.5 6. 1.0 7. 1.1 8. 1.3 9. 1.9 10. 1.9 11. 4.7 12. end . jackknife sd=r(sd), rclass keep: summarize x (running summarize on estimation sample) Jackknife replications (11) 1 2 3 4 ........... Jackknife results command: sd: n(): Number of obs Replications = = 11 11 summarize x r(sd) r(N) Coef. sd 5 1.343469 Jackknife Std. Err. t P>|t| [95% Conf. Interval] .624405 2.15 0.057 -.047792 2.73473 Interpreting the output, the standard deviation reported by summarize mpg is 1.34. The jackknife standard error is 0.62. The 95% confidence interval for the standard deviation is −0.048 to 2.73. 982 jackknife — Jackknife estimation By specifying keep, jackknife creates in our dataset a new variable, sd, for the pseudovalues. . list, sep(4) x sd 1. 2. 3. 4. .1 .1 .1 .4 1.139977 1.139977 1.139977 .8893147 5. 6. 7. 8. .5 1 1.1 1.3 .824267 .632489 .6203189 .6218889 9. 10. 11. 1.9 1.9 4.7 .835419 .835419 7.703949 The jackknife estimate is the average of the sd variable, so sd contains the individual values of our statistic. We can see that the last observation is substantially larger than the others. The last observation is certainly an outlier, but whether that reflects the considerable information it contains or indicates that it should be excluded from analysis depends on the context of the problem. Here Mosteller and Tukey created the dataset by sampling from an exponential distribution, so the observation is informative. Example 3 Let’s repeat the example above using the automobile dataset, obtaining the standard error of the standard deviation of mpg. . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . jackknife sd=r(sd), rclass keep: summarize mpg (running summarize on estimation sample) Jackknife replications (74) 1 2 3 4 5 .................................................. 50 ........................ Jackknife results Number of obs Replications command: summarize mpg sd: r(sd) n(): r(N) Coef. sd 5.785503 = = 74 74 Jackknife Std. Err. t P>|t| [95% Conf. Interval] .6072509 9.53 0.000 4.575254 6.995753 jackknife — Jackknife estimation 983 Let’s look at sd more carefully: . summarize sd, detail pseudovalues: r(sd) 1% 5% 10% 25% 50% Percentiles 2.870471 2.870471 2.906255 3.328489 3.948335 Smallest 2.870471 2.870471 2.870471 2.870471 Largest 75% 6.844418 17.34316 90% 9.597018 19.7617 95% 17.34316 19.7617 99% 38.60905 38.60905 . list make mpg sd if sd > 30 make 71. VW Diesel mpg sd 41 38.60905 Obs Sum of Wgt. Mean Std. Dev. 74 74 5.817374 5.22377 Variance Skewness Kurtosis 27.28777 4.07202 23.37823 Here the VW Diesel is the only diesel car in our dataset. Collecting multiple statistics Example 4 jackknife is not limited to collecting just one statistic. For instance, we can use summarize, detail and then obtain the jackknife estimate of the standard deviation and skewness. summarize, detail stores the standard deviation in r(sd) and the skewness in r(skewness), so we might type . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . jackknife sd=r(sd) skew=r(skewness), rclass: summarize mpg, detail (running summarize on estimation sample) Jackknife replications (74) 1 2 3 4 5 .................................................. 50 ........................ Jackknife results Number of obs = Replications = command: summarize mpg, detail sd: r(sd) skew: r(skewness) n(): r(N) Coef. sd skew 5.785503 .9487176 74 74 Jackknife Std. Err. t P>|t| [95% Conf. Interval] .6072509 .3367242 9.53 2.82 0.000 0.006 4.575254 .2776272 6.995753 1.619808 984 jackknife — Jackknife estimation Collecting coefficients Example 5 jackknife can also collect coefficients from estimation commands. For instance, using auto.dta, we might wish to obtain the jackknife standard errors of the coefficients from a regression in which we model the mileage of a car by its weight and trunk space. To do this, we could refer to the coefficients as b[weight], b[trunk], se[weight], and se[trunk] in the exp list, or we could simply use the extended expressions b. In fact, jackknife assumes b by default when used with estimation commands. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . jackknife: regress mpg weight trunk (running regress on estimation sample) Jackknife replications (74) 1 2 3 4 5 .................................................. 50 ........................ Linear regression Number of obs Replications F( 2, 73) Prob > F R-squared Adj R-squared Root MSE mpg Coef. weight trunk _cons -.0056527 -.096229 39.68913 Jackknife Std. Err. .0010216 .1486236 1.873324 t -5.53 -0.65 21.19 P>|t| 0.000 0.519 0.000 = = = = = = = 74 74 78.10 0.0000 0.6543 0.6446 3.4492 [95% Conf. Interval] -.0076887 -.3924354 35.9556 -.0036167 .1999773 43.42266 If you are going to use jackknife to estimate standard errors of model coefficients, we recommend using the vce(jackknife) option when it is allowed with the estimation command; see [R] vce option. . regress mpg weight trunk, vce(jackknife, nodots) Linear regression Number of obs Replications F( 2, 73) Prob > F R-squared Adj R-squared Root MSE mpg Coef. weight trunk _cons -.0056527 -.096229 39.68913 Jackknife Std. Err. .0010216 .1486236 1.873324 t -5.53 -0.65 21.19 P>|t| 0.000 0.519 0.000 = = = = = = = 74 74 78.10 0.0000 0.6543 0.6446 3.4492 [95% Conf. Interval] -.0076887 -.3924354 35.9556 -.0036167 .1999773 43.42266 jackknife — Jackknife estimation 985 John Wilder Tukey (1915–2000) was born in Massachusetts. He studied chemistry at Brown and mathematics at Princeton and afterward worked at both Princeton and Bell Labs, as well as being involved in a great many government projects, consultancies, and committees. He made outstanding contributions to several areas of statistics, including time series, multiple comparisons, robust statistics, and exploratory data analysis. Tukey was extraordinarily energetic and inventive, not least in his use of terminology: he is credited with inventing the terms bit and software, in addition to ANOVA, boxplot, data analysis, hat matrix, jackknife, stem-and-leaf plot, trimming, and winsorizing, among many others. Tukey’s direct and indirect impacts mark him as one of the greatest statisticians of all time. Stored results jknife stores the following in e(): Scalars e(N) e(N reps) e(N misreps) e(N clust) e(k eq) e(k extra) e(k exp) e(k eexp) e(df r) Macros e(cmdname) e(cmd) e(command) e(cmdline) e(prefix) e(wtype) e(wexp) e(title) e(cluster) e(pseudo) e(nfunction) e(exp#) e(mse) e(vce) e(vcetype) e(properties) Matrices e(b) e(b jk) e(V) e(V modelbased) When exp list is command. sample size number of complete replications number of incomplete replications number of clusters number of equations in e(b) number of extra equations number of expressions number of extended expressions ( b or degrees of freedom se) command name from command same as e(cmdname) or jackknife command command as typed jackknife weight type weight expression title in estimation output cluster variables new variables containing pseudovalues e(N), r(N), n() option, or empty expression for the #th statistic from mse option jackknife title used to label Std. Err. b V observed statistics jackknife estimates jackknife variance–covariance matrix model-based variance b, jackknife will also carry forward most of the results already in e() from 986 jackknife — Jackknife estimation Methods and formulas Let θb be the observed value of the statistic, that is, the value of the statistic calculated using the original dataset. Let θb(j) be the value of the statistic computed by leaving out the jth observation (or cluster); thus j = 1, 2, . . . , N identifies an individual observation (or cluster), and N is the total number of observations (or clusters). The jth pseudovalue is given by θbj∗ = θb(j) + N {θb − θb(j) } When the mse option is specified, the standard error is estimated as se b = 1/2 N N −1 X b b2 (θ(j) − θ) N j=1 and the jackknife estimate is θ̄(.) = N 1 Xb θ(j) N j=1 Otherwise, the standard error is estimated as 1/2 N X 1 ∗ ∗ 2 b se b = (θj − θ̄ ) N (N − 1) θ̄∗ = j=1 N 1 X b∗ θ N j=1 j where θ̄∗ is the jackknife estimate. The variance–covariance matrix is similarly computed. References Brillinger, D. R. 2002. John W. Tukey: His life and professional contributions. Annals of Statistics 30: 1535–1575. Gould, W. W. 1995. sg34: Jackknife estimation. Stata Technical Bulletin 24: 25–29. Reprinted in Stata Technical Bulletin Reprints, vol. 4, pp. 165–170. College Station, TX: Stata Press. Mooney, C. Z., and R. D. Duval. 1993. Bootstrapping: A Nonparametric Approach to Statistical Inference. Newbury Park, CA: Sage. Mosteller, C. F., and J. W. Tukey. 1977. Data Analysis and Regression: A Second Course in Statistics. Reading, MA: Addison–Wesley. Tukey, J. W. 1958. Bias and confidence in not-quite large samples. Abstract in Annals of Mathematical Statistics 29: 614. Also see [R] jackknife postestimation — Postestimation tools for jackknife [R] bootstrap — Bootstrap sampling and estimation [R] permute — Monte Carlo permutation tests [R] simulate — Monte Carlo simulations [SVY] svy jackknife — Jackknife estimation for survey data [U] 13.5 Accessing coefficients and standard errors [U] 13.6 Accessing results from Stata commands [U] 20 Estimation and postestimation commands Title jackknife postestimation — Postestimation tools for jackknife Description Syntax for predict Also see Description The following postestimation commands are available after jackknife: Command Description ∗ contrast estat ic estat summarize estat vce estimates lincom ∗ margins ∗ marginsplot nlcom ∗ predict predictnl contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses ∗ ∗ pwcompare test testnl ∗ This postestimation command is allowed only if it may be used after command. Syntax for predict The syntax of predict (and whether predict is even allowed) following jackknife depends on the command used with jackknife. Also see [R] jackknife — Jackknife estimation [U] 20 Estimation and postestimation commands 987 Title kappa — Interrater agreement Syntax Remarks and examples Menu Stored results Description Methods and formulas Options References Syntax Interrater agreement, two unique raters kap varname1 varname2 if in weight , options Weights for weighting disagreements kapwgt wgtid 1 \ # 1 \ # # 1 . . . Interrater agreement, nonunique raters, variables record ratings for each rater kap varname1 varname2 varname3 . . . if in weight Interrater agreement, nonunique raters, variables record frequency of ratings kappa varlist if in Description options Main tab wgt(wgtid) absolute display table of assessments specify how to weight disagreements; see Options for alternatives treat rating categories as absolute fweights are allowed; see [U] 11.1.6 weight. Menu kap: two unique raters Statistics > Epidemiology and related > Other > Interrater agreement, two unique raters > Epidemiology and related > Other > Define weights for the above (kap) kapwgt Statistics kap: nonunique raters Statistics > Epidemiology and related > Other > Interrater agreement, nonunique raters > Epidemiology and related > Other > Interrater agreement, nonunique raters with frequencies kappa Statistics 988 kappa — Interrater agreement 989 Description kap (first syntax) calculates the kappa-statistic measure of interrater agreement when there are two unique raters and two or more ratings. kapwgt defines weights for use by kap in measuring the importance of disagreements. kap (second syntax) and kappa calculate the kappa-statistic measure when there are two or more (nonunique) raters and two outcomes, more than two outcomes when the number of raters is fixed, and more than two outcomes when the number of raters varies. kap (second syntax) and kappa produce the same results; they merely differ in how they expect the data to be organized. kap assumes that each observation is a subject. varname1 contains the ratings by the first rater, varname2 by the second rater, and so on. kappa also assumes that each observation is a subject. The variables, however, record the frequencies with which ratings were assigned. The first variable records the number of times the first rating was assigned, the second variable records the number of times the second rating was assigned, and so on. Options Main tab displays a tabulation of the assessments by the two raters. wgt(wgtid) specifies that wgtid be used to weight disagreements. You can define your own weights by using kapwgt; wgt() then specifies the name of the user-defined matrix. For instance, you might define . kapwgt mine 1 \ .8 1 \ 0 .8 1 \ 0 0 .8 1 and then . kap rata ratb, wgt(mine) Also, two prerecorded weights are available. wgt(w) specifies weights 1 − |i − j|/(k − 1), where i and j index the rows and columns of the ratings by the two raters and k is the maximum number of possible ratings. wgt(w2) specifies weights 1 − {(i − j)/(k − 1)}2 . absolute is relevant only if wgt() is also specified. The absolute option modifies how i, j , and k are defined and how corresponding entries are found in a user-defined weighting matrix. When absolute is not specified, i and j refer to the row and column index, not to the ratings themselves. Say that the ratings are recorded as {0, 1, 1.5, 2}. There are four ratings; k = 4, and i and j are still 1, 2, 3, and 4 in the formulas above. Index 3, for instance, corresponds to rating = 1.5. This system is convenient but can, with some data, lead to difficulties. When absolute is specified, all ratings must be integers, and they must be coded from the set {1, 2, 3, . . .}. Not all values need be used; integer values that do not occur are simply assumed to be unobserved. 990 kappa — Interrater agreement Remarks and examples Remarks are presented under the following headings: Two raters More than two raters The kappa-statistic measure of agreement is scaled to be 0 when the amount of agreement is what would be expected to be observed by chance and 1 when there is perfect agreement. For intermediate values, Landis and Koch (1977a, 165) suggest the following interpretations: below 0.0 0.00 – 0.20 0.21 – 0.40 0.41 – 0.60 0.61 – 0.80 0.81 – 1.00 Poor Slight Fair Moderate Substantial Almost perfect Two raters Example 1 Consider the classification by two radiologists of 85 xeromammograms as normal, benign disease, suspicion of cancer, or cancer (a subset of the data from Boyd et al. [1982] and discussed in the context of kappa in Altman [1991, 403–405]). . use http://www.stata-press.com/data/r13/rate2 (Altman p. 403) . tabulate rada radb Radiologist A’s Radiologist B’s assessment normal benign suspect assessment cancer Total normal benign suspect cancer 21 4 3 0 12 17 9 0 0 1 15 0 0 0 2 1 33 22 29 1 Total 28 38 16 3 85 Our dataset contains two variables: rada, radiologist A’s assessment, and radb, radiologist B’s assessment. Each observation is a patient. We can obtain the kappa measure of interrater agreement by typing . kap rada radb Expected Agreement Agreement 63.53% 30.82% Kappa Std. Err. 0.4728 0.0694 Z Prob>Z 6.81 0.0000 If each radiologist had made his determination randomly (but with probabilities equal to the overall proportions), we would expect the two radiologists to agree on 30.8% of the patients. In fact, they agreed on 63.5% of the patients, or 47.3% of the way between random agreement and perfect agreement. The amount of agreement indicates that we can reject the hypothesis that they are making their determinations randomly. kappa — Interrater agreement 991 Example 2: Weighted kappa, prerecorded weight w There is a difference between two radiologists disagreeing about whether a xeromammogram indicates cancer or the suspicion of cancer and disagreeing about whether it indicates cancer or is normal. The weighted kappa attempts to deal with this. kap provides two “prerecorded” weights, w and w2: . kap rada radb, wgt(w) Ratings weighted by: 1.0000 0.6667 0.3333 0.6667 1.0000 0.6667 0.3333 0.6667 1.0000 0.0000 0.3333 0.6667 0.0000 0.3333 0.6667 1.0000 Agreement Expected Agreement Kappa Std. Err. 86.67% 69.11% 0.5684 0.0788 Z Prob>Z 7.22 0.0000 The w weights are given by 1 − |i − j|/(k − 1), where i and j index the rows of columns of the ratings by the two raters and k is the maximum number of possible ratings. The weighting matrix is printed above the table. Here the rows and columns of the 4 × 4 matrix correspond to the ratings normal, benign, suspicious, and cancerous. A weight of 1 indicates that an observation should count as perfect agreement. The matrix has 1s down the diagonals — when both radiologists make the same assessment, they are in agreement. A weight of, say, 0.6667 means that they are in two-thirds agreement. In our matrix, they get that score if they are “one apart” — one radiologist assesses cancer and the other is merely suspicious, or one is suspicious and the other says benign, and so on. An entry of 0.3333 means that they are in one-third agreement, or, if you prefer, two-thirds disagreement. That is the score attached when they are “two apart”. Finally, they are in complete disagreement when the weight is zero, which happens only when they are three apart — one says cancer and the other says normal. Example 3: Weighted kappa, prerecorded weight w2 The other prerecorded weight is w2, where the weights are given by 1 − {(i − j)/(k − 1)}2 : . kap rada radb, wgt(w2) Ratings weighted by: 1.0000 0.8889 0.5556 0.0000 0.8889 1.0000 0.8889 0.5556 0.5556 0.8889 1.0000 0.8889 0.0000 0.5556 0.8889 1.0000 Expected Agreement Agreement Kappa Std. Err. 94.77% 84.09% 0.6714 0.1079 Z Prob>Z 6.22 0.0000 The w2 weight makes the categories even more alike and is probably inappropriate here. 992 kappa — Interrater agreement Example 4: Weighted kappa, user-defined weights In addition to using prerecorded weights, we can define our own weights with the kapwgt command. For instance, we might feel that suspicious and cancerous are reasonably similar, that benign and normal are reasonably similar, but that the suspicious/cancerous group is nothing like the benign/normal group: . kapwgt xm 1 \ .8 1 \ 0 0 1 \ 0 0 .8 1 . kapwgt xm 1.0000 0.8000 1.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.8000 1.0000 We name the weights xm, and after the weight name, we enter the lower triangle of the weighting matrix, using \ to separate rows. We have four outcomes, so we continued entering numbers until we had defined the fourth row of the weighting matrix. If we type kapwgt followed by a name and nothing else, it shows us the weights recorded under that name. Satisfied that we have entered them correctly, we now use the weights to recalculate kappa: . kap rada radb, wgt(xm) Ratings weighted by: 1.0000 0.8000 0.0000 0.0000 0.8000 1.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.8000 0.0000 0.0000 0.8000 1.0000 Expected Agreement Agreement Kappa Std. Err. 80.47% 52.67% 0.5874 0.0865 Z Prob>Z 6.79 0.0000 Technical note In addition to using weights for weighting the differences in categories, you can specify Stata’s traditional weights for weighting the data. In the examples above, we have 85 observations in our dataset — one for each patient. If we only knew the table of outcomes — that there were 21 patients rated normal by both radiologists, etc. — it would be easier to enter the table into Stata and work from it. The easiest way to enter the data is with tabi; see [R] tabulate twoway. . tabi 21 12 0 0 \ 4 17 1 0 \ 3 9 15 2 \ 0 0 0 1, replace col row 1 2 3 4 1 2 3 4 21 4 3 0 Total 28 Pearson chi2(9) = 12 17 9 0 38 77.8111 0 1 15 0 16 Pr = 0.000 Total 0 0 2 1 33 22 29 1 3 85 tabi reported the Pearson χ2 for this table, but we do not care about it. The important thing is that, with the replace option, tabi left the table in memory: kappa — Interrater agreement 993 . list in 1/5 row col pop 1 1 1 1 2 1 2 3 4 1 21 12 0 0 4 1. 2. 3. 4. 5. The variable row is radiologist A’s assessment, col is radiologist B’s assessment, and pop is the number so assessed by both. Thus . kap row col [freq=pop] Agreement Expected Agreement Kappa Std. Err. 63.53% 30.82% 0.4728 0.0694 Z Prob>Z 6.81 0.0000 If we are going to keep these data, the names row and col are not indicative of what the data reflect. We could type (see [U] 12.6 Dataset, variable, and value labels) . rename row rada . . . . . . rename col radb label var rada "Radiologist A’s assessment" label var radb "Radiologist B’s assessment" label define assess 1 normal 2 benign 3 suspect 4 cancer label values rada assess label values radb assess . label data "Altman p. 403" kap’s tab option, which can be used with or without weighted data, shows the table of assessments: . kap rada radb [freq=pop], tab Radiologist A’s Radiologist B’s assessment assessment normal benign suspect cancer Total normal benign suspect cancer 21 4 3 0 12 17 9 0 0 1 15 0 0 0 2 1 33 22 29 1 Total 38 16 3 85 Agreement 28 Expected Agreement Kappa Std. Err. 63.53% 30.82% 0.4728 0.0694 Z Prob>Z 6.81 0.0000 994 kappa — Interrater agreement Technical note You have data on individual patients. There are two raters, and the possible ratings are 1, 2, 3, and 4, but neither rater ever used rating 3: . use http://www.stata-press.com/data/r13/rate2no3, clear . tabulate ratera raterb raterb ratera 1 2 4 Total 1 2 4 6 5 1 4 3 1 3 3 26 13 11 28 Total 12 8 32 52 Here kap would determine that the ratings are from the set {1, 2, 4} because those were the only values observed. kap would expect a user-defined weighting matrix to be 3 × 3, and if it were not, kap would issue an error message. In the formula-based weights, the calculation would be based on i, j = 1, 2, 3 corresponding to the three observed ratings {1, 2, 4}. Specifying the absolute option would clarify that the ratings are 1, 2, 3, and 4; it just so happens that rating 3 was never assigned. If a user-defined weighting matrix were also specified, kap would expect it to be 4 × 4 or larger (larger because we can think of the ratings being 1, 2, 3, 4, 5, . . . and it just so happens that ratings 5, 6, . . . were never observed, just as rating 3 was not observed). In the formula-based weights, the calculation would be based on i, j = 1, 2, 4. . kap ratera raterb, wgt(w) Ratings weighted by: 1.0000 0.5000 0.0000 0.5000 1.0000 0.5000 0.0000 0.5000 1.0000 Expected Agreement Agreement Kappa 79.81% 57.17% . kap ratera raterb, Ratings weighted by: 1.0000 0.6667 0.6667 1.0000 0.0000 0.3333 Std. Err. 0.5285 0.1169 wgt(w) absolute Z Prob>Z 4.52 0.0000 Z Prob>Z 4.85 0.0000 0.0000 0.3333 1.0000 Agreement Expected Agreement Kappa Std. Err. 81.41% 55.08% 0.5862 0.1209 If all conceivable ratings are observed in the data, specifying absolute makes no difference. For instance, if rater A assigns ratings {1, 2, 4} and rater B assigns {1, 2, 3, 4}, the complete set of assigned ratings is {1, 2, 3, 4}, the same that absolute would specify. Without absolute, it makes no difference whether the ratings are coded {1, 2, 3, 4}, {0, 1, 2, 3}, {1, 7, 9, 100}, {0, 1, 1.5, 2.0}, or otherwise. kappa — Interrater agreement 995 More than two raters For more than two raters, the mathematics are such that the two raters are not considered unique. For instance, if there are three raters, there is no assumption that the three raters who rate the first subject are the same as the three raters who rate the second. Although we call this the “more than two raters” case, it can be used with two raters when the raters’ identities vary. The nonunique rater case can be usefully broken down into three subcases: 1) there are two possible ratings, which we will call positive and negative; 2) there are more than two possible ratings, but the number of raters per subject is the same for all subjects; and 3) there are more than two possible ratings, and the number of raters per subject varies. kappa handles all these cases. To emphasize that there is no assumption of constant identity of raters across subjects, the variables specified contain counts of the number of raters rating the subject into a particular category. Jacob Cohen (1923–1998) was born in New York City. After studying psychology at City College of New York and New York University, he worked as a medical psychologist until 1959 when he became a full professor in the Department of Psychology at New York University. He made many contributions to research methods, including the kappa measure. He persistently emphasized the value of multiple regression and the importance of power and of measuring effects rather than testing significance. Example 5: Two ratings Fleiss, Levin, and Paik (2003, 612) offers the following hypothetical ratings by different sets of raters on 25 subjects: Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 No. of No. of raters pos. ratings 2 2 2 0 3 2 4 3 3 3 4 1 3 0 5 0 2 0 4 4 5 5 3 3 4 4 Subject 14 15 16 17 18 19 20 21 22 23 24 25 No. of raters 4 2 2 3 2 4 5 3 4 3 3 2 No. of pos. ratings 3 0 2 1 1 1 4 2 0 0 3 2 We have entered these data into Stata, and the variables are called subject, raters, and pos. kappa, however, requires that we specify variables containing the number of positive ratings and negative ratings, that is, pos and raters-pos: . use http://www.stata-press.com/data/r13/p612 . gen neg = raters-pos . kappa pos neg Two-outcomes, multiple raters: Kappa Z Prob>Z 0.5415 5.28 0.0000 996 kappa — Interrater agreement We would have obtained the same results if we had typed kappa neg pos. Example 6: More than two ratings, constant number of raters, kappa Each of 10 subjects is rated into one of three categories by five raters (Fleiss, Levin, and Paik 2003, 615): . use http://www.stata-press.com/data/r13/p615, clear . list subject cat1 cat2 cat3 1. 2. 3. 4. 5. 1 2 3 4 5 1 2 0 4 3 4 0 0 0 0 0 3 5 1 2 6. 7. 8. 9. 10. 6 7 8 9 10 1 5 0 1 3 4 0 4 0 0 0 0 1 4 2 We obtain the kappa statistic: . kappa cat1-cat3 Kappa Z Prob>Z cat1 cat2 cat3 0.2917 0.6711 0.3490 2.92 6.71 3.49 0.0018 0.0000 0.0002 combined 0.4179 5.83 0.0000 Outcome The first part of the output shows the results of calculating kappa for each of the categories separately against an amalgam of the remaining categories. For instance, the cat1 line is the two-rating kappa, where positive is cat1 and negative is cat2 or cat3. The test statistic, however, is calculated differently (see Methods and formulas). The combined kappa is the appropriately weighted average of the individual kappas. There is considerably less agreement about the rating of subjects into the first category than there is for the second. Example 7: More than two ratings, constant number of raters, kap Now suppose that we have the same data as in the previous example but that the data are organized differently: kappa — Interrater agreement 997 . use http://www.stata-press.com/data/r13/p615b . list subject rater1 rater2 rater3 rater4 rater5 1. 2. 3. 4. 5. 1 2 3 4 5 1 1 3 1 1 2 1 3 1 1 2 3 3 1 1 2 3 3 1 3 2 3 3 3 3 6. 7. 8. 9. 10. 6 7 8 9 10 1 1 2 1 1 2 1 2 3 1 2 1 2 3 1 2 1 2 3 3 2 1 3 3 3 Here we would use kap rather than kappa because the variables record ratings for each rater. . kap rater1 rater2 rater3 rater4 rater5 There are 5 raters per subject: Outcome Kappa Z Prob>Z 1 2 3 0.2917 0.6711 0.3490 2.92 6.71 3.49 0.0018 0.0000 0.0002 combined 0.4179 5.83 0.0000 It does not matter which rater is which when there are more than two raters. Example 8: More than two ratings, varying number of raters, kappa In this unfortunate case, kappa can be calculated, but there is no test statistic for testing against κ > 0. We do nothing differently — kappa calculates the total number of raters for each subject, and, if it is not a constant, kappa suppresses the calculation of test statistics. . use http://www.stata-press.com/data/r13/rvary . list subject cat1 cat2 cat3 1. 2. 3. 4. 5. 1 2 3 4 5 1 2 0 4 3 3 0 0 0 0 0 3 5 1 2 6. 7. 8. 9. 10. 6 7 8 9 10 1 5 0 1 3 4 0 4 0 0 0 0 1 2 2 998 kappa — Interrater agreement . kappa cat1-cat3 Outcome cat1 cat2 cat3 Note: Kappa Z 0.2685 0.6457 0.2938 Prob>Z . . . . . . combined 0.3816 . . number of ratings per subject vary; cannot calculate test statistics. Example 9: More than two ratings, varying number of raters, kap This case is similar to the previous example, but the data are organized differently: . use http://www.stata-press.com/data/r13/rvary2 . list subject rater1 rater2 rater3 rater4 rater5 1. 2. 3. 4. 5. 1 2 3 4 5 1 1 3 1 1 2 1 3 1 1 2 3 3 1 1 . 3 3 1 3 2 3 3 3 3 6. 7. 8. 9. 10. 6 7 8 9 10 1 1 2 1 1 2 1 2 3 1 2 1 2 . 1 2 1 2 . 3 2 1 3 3 3 Here we specify kap instead of kappa because the variables record ratings for each rater. . kap rater1-rater5 There are between 3 and 5 (median = 5.00) raters per subject: Outcome 1 2 3 Note: Kappa 0.2685 0.6457 0.2938 Z Prob>Z . . . . . . combined 0.3816 . . number of ratings per subject vary; cannot calculate test statistics. kappa — Interrater agreement 999 Stored results kap and kappa store the following in r(): Scalars r(N) number of subjects (kap only) r(prop o) observed proportion of agreement (kap only) r(prop e) expected proportion of agreement (kap only) r(kappa) r(z) kappa z statistic r(se) standard error for kappa statistic Methods and formulas The kappa statistic was first proposed by Cohen (1960). The generalization for weights reflecting the relative seriousness of each possible disagreement is due to Cohen (1968). The analysis-of-variance approach for k = 2 and m ≥ 2 is due to Landis and Koch (1977b). See Altman (1991, 403–409) or Dunn (2000, chap. 2) for an introductory treatment and Fleiss, Levin, and Paik (2003, chap. 18) for a more detailed treatment. All formulas below are as presented in Fleiss, Levin, and Paik (2003). Let m be the number of raters, and let k be the number of rating outcomes. Methods and formulas are presented under the following headings: kap: m = 2 kappa: m > 2, k kappa: m > 2, k =2 >2 kap: m = 2 Define wij (i = 1, . . . , k and j = 1, . . . , k ) as the weights for agreement and disagreement (wgt()), or, if the data are not weighted, define wii = 1 and wij = 0 for i 6= j . If wgt(w) is 2 specified, wij = 1 − |i − j|/(k − 1). If wgt(w2) is specified, wij = 1 − (i − j)/(k − 1) . The observed proportion of agreement is po = k X k X wij pij i=1 j=1 where pij is the fraction of ratings i by the first rater and j by the second. The expected proportion of agreement is k X k X pe = wij pi· p·j i=1 j=1 where pi· = P j pij and p·j = P i pij . Kappa is given by κ b = (po − pe )/(1 − pe ). The standard error of κ b for testing against 0 is sb0 = hX X 1/2 i 1 √ pi· p·j {wij − (wi· + w·j )}2 − p2e (1 − pe ) n i j P P where n is the number of subjects being rated, wi· = j p·j wij , and w·j = i pi· wij . The test statistic Z = κ b/b s0 is assumed to be distributed N (0, 1). 1000 kappa — Interrater agreement kappa: m > 2, k = 2 Each subject i, i = 1, . . . , n, is found by xi of mi raters to be positive (the choice as to what is labeled positive is arbitrary). P P The overall proportion of positive ratings is p = i xi /(nm), where m = i mi /n. The between-subjects mean square is (approximately) B= 1 X (xi − mi p)2 n i mi and the within-subject mean square is W = X xi (mi − xi ) 1 n(m − 1) i mi Kappa is then defined as κ b= B−W B + (m − 1)W The standard error for testing against 0 (Fleiss and Cuzick 1979) is approximately equal to and is calculated as sb0 = 1/2 1 (m − mH )(1 − 4pq) √ 2(mH − 1) + mpq (m − 1) nmH where mH is the harmonic mean of mi and q = 1 − p. The test statistic Z = κ b/b s0 is assumed to be distributed N (0, 1). kappa: m > 2, k > 2 Let xij be the number of ratings on subject i, i = 1, . . . , n, into category j , j = 1, . . . , k . Define pj as the overall proportion of ratings in category j , q j = 1 − pj , and let κ bj be the kappa statistic given above for k = 2 when category j is compared with the amalgam of all other categories. Kappa is P pj q j κ bj j P κ= pj q j j P (Landis and Koch 1977b). In the case where the number of raters per subject, j xij , is a constant m for all i, Fleiss, Nee, and Landis (1979) derived the following formulas for the approximate standard errors. The standard error for testing κ bj against 0 is sbj = 2 nm(m − 1) 1/2 kappa — Interrater agreement 1001 and the standard error for testing κ is s= P j √ X 1/2 2 X 2 p pj q j − pj q j (q j − pj ) pj q j nm(m − 1) j j References Abramson, J. H., and Z. H. Abramson. 2001. Making Sense of Data: A Self-Instruction Manual on the Interpretation of Epidemiological Data. 3rd ed. New York: Oxford University Press. Altman, D. G. 1991. Practical Statistics for Medical Research. London: Chapman & Hall/CRC. Boyd, N. F., C. Wolfson, M. Moskowitz, T. Carlile, M. Petitclerc, H. A. Ferri, E. Fishell, A. Gregoire, M. Kiernan, J. D. Longley, I. S. Simor, and A. B. Miller. 1982. Observer variation in the interpretation of xeromammograms. Journal of the National Cancer Institute 68: 357–363. Campbell, M. J., D. Machin, and S. J. Walters. 2007. Medical Statistics: A Textbook for the Health Sciences. 4th ed. Chichester, UK: Wiley. Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 37–46. . 1968. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin 70: 213–220. Cox, N. J. 2006. Assessing agreement of measurements and predictions in geomorphology. Geomorphology 76: 332–346. Dunn, G. 2000. Statistics in Psychiatry. London: Arnold. Fleiss, J. L., and J. Cuzick. 1979. The reliability of dichotomous judgments: Unequal numbers of judges per subject. Applied Psychological Measurement 3: 537–542. Fleiss, J. L., B. Levin, and M. C. Paik. 2003. Statistical Methods for Rates and Proportions. 3rd ed. New York: Wiley. Fleiss, J. L., J. C. M. Nee, and J. R. Landis. 1979. Large sample variance of kappa in the case of different sets of raters. Psychological Bulletin 86: 974–977. Gould, W. W. 1997. stata49: Interrater agreement. Stata Technical Bulletin 40: 2–8. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 20–28. College Station, TX: Stata Press. Landis, J. R., and G. G. Koch. 1977a. The measurement of observer agreement for categorical data. Biometrics 33: 159–174. . 1977b. A one-way components of variance model for categorical data. Biometrics 33: 671–679. Reichenheim, M. E. 2000. sxd3: Sample size for the kappa-statistic of interrater agreement. Stata Technical Bulletin 58: 41–45. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 382–387. College Station, TX: Stata Press. . 2004. Confidence intervals for the kappa statistic. Stata Journal 4: 421–428. Shrout, P. E. 2001. Jacob Cohen (1923–1998). American Psychologist 56: 166. Steichen, T. J., and N. J. Cox. 1998a. sg84: Concordance correlation coefficient. Stata Technical Bulletin 43: 35–39. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 137–143. College Station, TX: Stata Press. . 1998b. sg84.1: Concordance correlation coefficient, revisited. Stata Technical Bulletin 45: 21–23. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 143–145. College Station, TX: Stata Press. . 2000a. sg84.3: Concordance correlation coefficient: Minor corrections. Stata Technical Bulletin 58: 9. Reprinted in Stata Technical Bulletin Reprints, vol. 10, p. 137. College Station, TX: Stata Press. . 2000b. sg84.2: Concordance correlation coefficient: Update for Stata 6. Stata Technical Bulletin 54: 25–26. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 169–170. College Station, TX: Stata Press. . 2002. A note on the concordance correlation coefficient. Stata Journal 2: 183–189. Title kdensity — Univariate kernel density estimation Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgments Syntax kdensity varname if options in weight , options Description Main kernel(kernel) bwidth(#) generate(newvarx newvard ) n(#) at(varx ) nograph specify kernel function; default is kernel(epanechnikov) half-width of kernel store the estimation points in newvarx and the density estimate in newvard estimate density using # points; default is min(N , 50) estimate density using the values specified by varx suppress graph Kernel plot cline options affect rendition of the plotted kernel density estimate Density plots normal normopts(cline options) student(#) stopts(cline options) add normal density to the graph affect rendition of normal density add Student’s t density with # degrees of freedom to the graph affect rendition of the Student’s t density Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options kernel Description epanechnikov epan2 biweight cosine gaussian parzen rectangle triangle Epanechnikov kernel function; the default alternative Epanechnikov kernel function biweight kernel function cosine trace kernel function Gaussian kernel function Parzen kernel function rectangle kernel function triangle kernel function fweights, aweights, and iweights are allowed; see [U] 11.1.6 weight. 1002 kdensity — Univariate kernel density estimation 1003 Menu Statistics > Nonparametric analysis > Kernel density estimation Description kdensity produces kernel density estimates and graphs the result. Options Main kernel(kernel) specifies the kernel function for use in calculating the kernel density estimate. The default kernel is the Epanechnikov kernel (epanechnikov). bwidth(#) specifies the half-width of the kernel, the width of the density window around each point. If bwidth() is not specified, the “optimal” width is calculated and used. The optimal width is the width that would minimize the mean integrated squared error if the data were Gaussian and a Gaussian kernel were used, so it is not optimal in any global sense. In fact, for multimodal and highly skewed densities, this width is usually too wide and oversmooths the density (Silverman 1992). generate(newvarx newvard ) stores the results of the estimation. newvarx will contain the points at which the density is estimated. newvard will contain the density estimate. n(#) specifies the number of points at which the density estimate is to be evaluated. The default is min(N, 50), where N is the number of observations in memory. at(varx ) specifies a variable that contains the values at which the density should be estimated. This option allows you to more easily obtain density estimates for different variables or different subsamples of a variable and then overlay the estimated densities for comparison. nograph suppresses the graph. This option is often used with the generate() option. Kernel plot cline options affect the rendition of the plotted kernel density estimate. See [G-3] cline options. Density plots normal requests that a normal density be overlaid on the density estimate for comparison. normopts(cline options) specifies details about the rendition of the normal curve, such as the color and style of line used. See [G-3] cline options. student(#) specifies that a Student’s t density with # degrees of freedom be overlaid on the density estimate for comparison. stopts(cline options) affects the rendition of the Student’s t density. See [G-3] cline options. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option. 1004 kdensity — Univariate kernel density estimation Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples Kernel density estimators approximate the density f (x) from observations on x. Histograms do this, too, and the histogram itself is a kind of kernel density estimate. The data are divided into nonoverlapping intervals, and counts are made of the number of data points within each interval. Histograms are bar graphs that depict these frequency counts — the bar is centered at the midpoint of each interval — and its height reflects the average number of data points in the interval. In more general kernel density estimates, the range is still divided into intervals, and estimates of the density at the center of intervals are produced. One difference is that the intervals are allowed to overlap. We can think of sliding the interval — called a window — along the range of the data and collecting the center-point density estimates. The second difference is that, rather than merely counting the number of observations in a window, a kernel density estimator assigns a weight between 0 and 1 — based on the distance from the center of the window — and sums the weighted values. The function that determines these weights is called the kernel. Kernel density estimates have the advantages of being smooth and of being independent of the choice of origin (corresponding to the location of the bins in a histogram). See Salgado-Ugarte, Shimizu, and Taniuchi (1993) and Fox (1990) for discussions of kernel density estimators that stress their use as exploratory data-analysis tools. Cox (2007) gives a lucid introductory tutorial on kernel density estimation with several Stata produced examples. He provides tips and tricks for working with skewed or bounded distributions and applying the same techniques to estimate the intensity function of a point process. Example 1: Histogram and kernel density estimate Goeden (1978) reports data consisting of 316 length observations of coral trout. We wish to investigate the underlying density of the lengths. To begin on √familiar ground, we might draw a histogram. In [R] histogram, we suggest setting the bins to min( n, 10 · log10 n), which for n = 316 is roughly 18: kdensity — Univariate kernel density estimation 1005 0 .002 Density .004 .006 .008 . use http://www.stata-press.com/data/r13/trocolen . histogram length, bin(18) (bin=18, start=226, width=19.777778) 200 300 400 length 500 600 The kernel density estimate, on the other hand, is smooth. . kdensity length 0 .001 Density .002 .003 .004 .005 Kernel density estimate 200 300 400 length 500 600 kernel = epanechnikov, bandwidth = 20.1510 Kernel density estimators are, however, sensitive to an assumption, just as are histograms. In histograms, we specify a number of bins. For kernel density estimators, we specify a width. In the graph above, we used the default width. kdensity is smarter than twoway histogram in that its default width is not a fixed constant. Even so, the default width is not necessarily best. kdensity stores the width in the returned scalar bwidth, so typing display r(bwidth) reveals it. Doing this, we discover that the width is approximately 20. Widths are similar to the inverse of the number of bins in a histogram in that smaller widths provide more detail. The units of the width are the units of x, the variable being analyzed. The width is specified as a half-width, meaning that the kernel density estimator with half-width 20 corresponds to sliding a window of size 40 across the data. 1006 kdensity — Univariate kernel density estimation We can specify half-widths for ourselves by using the bwidth() option. Smaller widths do not smooth the density as much: . kdensity length, bwidth(10) 0 .002 Density .004 .006 .008 Kernel density estimate 200 300 400 length 500 600 kernel = epanechnikov, bandwidth = 10.0000 . kdensity length, bwidth(15) 0 .002 Density .004 .006 Kernel density estimate 200 300 400 length 500 600 kernel = epanechnikov, bandwidth = 15.0000 Example 2: Different kernels can produce different results When widths are held constant, different kernels can produce surprisingly different results. This is really an attribute of the kernel and width combination; for a given width, some kernels are more sensitive than others at identifying peaks in the density estimate. We can see this when using a dataset with lots of peaks. In the automobile dataset, we characterize the density of weight, the weight of the vehicles. Below we compare the Epanechnikov and Parzen kernels. kdensity — Univariate kernel density estimation 1007 . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . kdensity weight, kernel(epanechnikov) nograph generate(x epan) kdensity weight, kernel(parzen) nograph generate(x2 parzen) label var epan "Epanechnikov density estimate" label var parzen "Parzen density estimate" line epan parzen x, sort ytitle(Density) legend(cols(1)) 0 .0002 Density .0004 .0006 .0008 . . . . 1000 2000 3000 Weight (lbs.) 4000 5000 Epanechnikov density estimate Parzen density estimate We did not specify a width, so we obtained the default width. That width is not a function of the selected kernel, but of the data. See Methods and formulas for the calculation of the optimal width. Example 3: Density with overlaid normal density In examining the density estimates, we may wish to overlay a normal density or a Student’s t density for comparison. Using automobile weights, we can get an idea of the distance from normality by using the normal option. . kdensity weight, kernel(epanechnikov) normal 0 Density .0001 .0002 .0003 .0004 .0005 Kernel density estimate 1000 2000 3000 Weight (lbs.) 4000 Kernel density estimate Normal density kernel = epanechnikov, bandwidth = 295.7504 5000 1008 kdensity — Univariate kernel density estimation Example 4: Compare two densities We also may want to compare two or more densities. In this example, we will compare the density estimates of the weights for the foreign and domestic cars. . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . . . . . kdensity weight, nograph generate(x fx) kdensity weight if foreign==0, nograph generate(fx0) at(x) kdensity weight if foreign==1, nograph generate(fx1) at(x) label var fx0 "Domestic cars" label var fx1 "Foreign cars" 0 .0002 Density .0004 .0006 .0008 .001 . line fx0 fx1 x, sort ytitle(Density) 1000 2000 3000 Weight (lbs.) Domestic cars 4000 5000 Foreign cars Technical note Although all the examples we included had densities of less than 1, the density may exceed 1. The probability density f (x) of a continuous variable, x, has the units and dimensions of the reciprocal of x. If x is measured in meters, f (x) has units 1/meter. Thus the density is not measured on a probability scale, so it is possible for f (x) to exceed 1. To see this, think of a uniform density on the interval 0 to 1. The area under the density curve is 1: this is the product of the density, which is constant at 1, and the range, which is 1. If the variable is then transformed by doubling, the area under the curve remains 1 and is the product of the density, constant at 0.5, and the range, which is 2. Conversely, if the variable is transformed by halving, the area under the curve also remains at 1 and is the product of the density, constant at 2, and the range, which is 0.5. (Strictly, the range is measured in certain units, and the density is measured in the reciprocal of those units, so the units cancel on multiplication.) kdensity — Univariate kernel density estimation 1009 Stored results kdensity stores the following in r(): Scalars r(bwidth) r(n) r(scale) kernel bandwidth number of points at which the estimate was evaluated density bin width Macros r(kernel) name of kernel Methods and formulas A kernel density estimate is formed by summing the weighted values calculated with the kernel function K , as in n x − Xi 1 X b wi K fK = qh h i=1 P where q = i wi if weights are frequency weights (fweight) or analytic weights (aweight), and P q = 1 if weights are importance weights (iweights). Analytic weights are rescaled so that i wi = n (see [U] 11 Language syntax). If weights are not used, then wi = 1, for i = 1, . . . , n. kdensity includes seven different kernel functions. The Epanechnikov is the default function if no other kernel is specified and is the most efficient in minimizing the mean integrated squared error. Kernel Formula 15 Biweight Cosine Epanechnikov Epan2 Gaussian Parzen Rectangular Triangular 16 (1 K[z] = − z 2 )2 0 1 + cos(2πz) 0 √ 3 (1 − 15 z 2 )/ 5 K[z] = 4 0 3 (1 − z 2 ) K[z] = 4 0 K[z] = K[z] = n if |z| < 1 otherwise if |z| < 1/2 otherwise √ if |z| < 5 otherwise if |z| < 1 otherwise 2 √1 e−z /2 2π 4 2 3 3 − 8z + 8|z| 3 K[z] = 8(1 − |z|) /3 0 n K[z] = 1/2 0 n 1 − |z| K[z] = 0 if |z| ≤ 1/2 if 1/2 < |z| ≤ 1 otherwise if |z| < 1 otherwise if |z| < 1 otherwise From the definitions given in the table, we can see that the choice of h will drive how many values are included in estimating the density at each point. This value is called the window width or bandwidth. If the window width is not specified, it is determined as 1010 kdensity — Univariate kernel density estimation m = min h= √ variancex , interquartile rangex 1.349 0.9m n1/5 where x is the variable for which we wish to estimate the kernel and n is the number of observations. Most researchers agree that the choice of kernel is not as important as the choice of bandwidth. There is a great deal of literature on choosing bandwidths under various conditions; see, for example, Parzen (1962) or Tapia and Thompson (1978). Also see Newton (1988) for a comparison with sample spectral density estimation in time-series applications. Acknowledgments We gratefully acknowledge the previous work by Isaı́as H. Salgado-Ugarte of Universidad Nacional Autónoma de México, and Makoto Shimizu and Toru Taniuchi of the University of Tokyo; see Salgado-Ugarte, Shimizu, and Taniuchi (1993). Their article provides a good overview of the subject of univariate kernel density estimation and presents arguments for its use in exploratory data analysis. References Cox, N. J. 2005. Speaking Stata: Density probability plots. Stata Journal 5: 259–273. . 2007. Kernel estimation as a basic tool for geomorphological data analysis. Earth Surface Processes and Landforms 32: 1902–1912. Fiorio, C. V. 2004. Confidence intervals for kernel density estimation. Stata Journal 4: 168–179. Fox, J. 1990. Describing univariate distributions. In Modern Methods of Data Analysis, ed. J. Fox and J. S. Long, 58–125. Newbury Park, CA: Sage. Goeden, G. B. 1978. A monograph of the coral trout, Plectropomus leopardus (Lacépède). Queensland Fisheries Services Research Bulletin 1: 1–42. Kohler, U., and F. Kreuter. 2012. Data Analysis Using Stata. 3rd ed. College Station, TX: Stata Press. Newton, H. J. 1988. TIMESLAB: A Time Series Analysis Laboratory. Belmont, CA: Wadsworth. Parzen, E. 1962. On estimation of a probability density function and mode. Annals of Mathematical Statistics 33: 1065–1076. Royston, P., and N. J. Cox. 2005. A multivariable scatterplot smoother. Stata Journal 5: 405–412. Salgado-Ugarte, I. H., and M. A. Pérez-Hernández. 2003. Exploring the use of variable bandwidth kernel density estimators. Stata Journal 3: 133–147. Salgado-Ugarte, I. H., M. Shimizu, and T. Taniuchi. 1993. snp6: Exploring the shape of univariate data using kernel density estimators. Stata Technical Bulletin 16: 8–19. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 155–173. College Station, TX: Stata Press. . 1995a. snp6.1: ASH, WARPing, and kernel density estimation for univariate data. Stata Technical Bulletin 26: 23–31. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 161–172. College Station, TX: Stata Press. . 1995b. snp6.2: Practical rules for bandwidth selection in univariate density estimation. Stata Technical Bulletin 27: 5–19. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 172–190. College Station, TX: Stata Press. . 1997. snp13: Nonparametric assessment of multimodality for univariate data. Stata Technical Bulletin 38: 27–35. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 232–243. College Station, TX: Stata Press. Scott, D. W. 1992. Multivariate Density Estimation: Theory, Practice, and Visualization. New York: Wiley. Silverman, B. W. 1992. Density Estimation for Statistics and Data Analysis. London: Chapman & Hall. Simonoff, J. S. 1996. Smoothing Methods in Statistics. New York: Springer. kdensity — Univariate kernel density estimation 1011 Steichen, T. J. 1998. gr33: Violin plots. Stata Technical Bulletin 46: 13–18. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 57–65. College Station, TX: Stata Press. Tapia, R. A., and J. R. Thompson. 1978. Nonparametric Probability Density Estimation. Baltimore: Johns Hopkins University Press. Van Kerm, P. 2003. Adaptive kernel density estimation. Stata Journal 3: 148–156. . 2012. Kernel-smoothed cumulative distribution function estimation with akdensity. Stata Journal 12: 543–548. Wand, M. P., and M. C. Jones. 1995. Kernel Smoothing. London: Chapman & Hall. Also see [R] histogram — Histograms for continuous and categorical variables Title ksmirnov — Kolmogorov – Smirnov equality-of-distributions test Syntax Options for two-sample test Methods and formulas Menu Remarks and examples References Description Stored results Also see Syntax One-sample Kolmogorov–Smirnov test ksmirnov varname = exp if in Two-sample Kolmogorov–Smirnov test ksmirnov varname if in , by(groupvar) exact Menu Statistics > Nonparametric analysis > Tests of hypotheses > Kolmogorov-Smirnov test Description ksmirnov performs one- and two-sample Kolmogorov – Smirnov tests of the equality of distributions. In the first syntax, varname is the variable whose distribution is being tested, and exp must evaluate to the corresponding (theoretical) cumulative. In the second syntax, groupvar must take on two distinct values. The distribution of varname for the first value of groupvar is compared with that of the second value. When testing for normality, please see [R] sktest and [R] swilk. Options for two-sample test Main by(groupvar) is required. It specifies a binary variable that identifies the two groups. exact specifies that the exact p-value be computed. This may take a long time if n > 50. Remarks and examples Example 1: Two-sample test Say that we have data on x that resulted from two different experiments, labeled as group==1 and group==2. Our data contain 1012 ksmirnov — Kolmogorov – Smirnov equality-of-distributions test 1013 . use http://www.stata-press.com/data/r13/ksxmpl . list group x 1. 2. 3. 4. 5. 2 1 2 1 1 2 0 3 4 5 6. 7. 2 2 8 10 We wish to use the two-sample Kolmogorov – Smirnov test to determine if there are any differences in the distribution of x for these two groups: . ksmirnov x, by(group) Two-sample Kolmogorov-Smirnov test for equality of distribution functions Smaller group D P-value Corrected 1: 2: Combined K-S: 0.5000 -0.1667 0.5000 0.424 0.909 0.785 0.735 The first line tests the hypothesis that x for group 1 contains smaller values than for group 2. The largest difference between the distribution functions is 0.5. The approximate p-value for this is 0.424, which is not significant. The second line tests the hypothesis that x for group 1 contains larger values than for group 2. The largest difference between the distribution functions in this direction is 0.1667. The approximate p-value for this small difference is 0.909. Finally, the approximate p-value for the combined test is 0.785, corrected to 0.735. The p-values ksmirnov calculates are based on the asymptotic distributions derived by Smirnov (1933). These approximations are not good for small samples (n < 50). They are too conservative — real p-values tend to be substantially smaller. We have also included a less conservative approximation for the nondirectional hypothesis based on an empirical continuity correction—the 0.735 reported in the third column. That number, too, is only an approximation. An exact value can be calculated using the exact option: . ksmirnov x, by(group) exact Two-sample Kolmogorov-Smirnov test for equality of distribution functions Smaller group D P-value Exact 1: 2: Combined K-S: 0.5000 -0.1667 0.5000 0.424 0.909 0.785 0.657 Example 2: One-sample test Let’s now test whether x in the example above is distributed normally. Kolmogorov – Smirnov is not a particularly powerful test in testing for normality, and we do not endorse such use of it; see [R] sktest and [R] swilk for better tests. 1014 ksmirnov — Kolmogorov – Smirnov equality-of-distributions test In any case, we will test against a normal distribution with the same mean and standard deviation: . summarize x Variable Obs Mean Std. Dev. Min Max x 7 4.571429 3.457222 0 10 . ksmirnov x = normal((x-4.571429)/3.457222) One-sample Kolmogorov-Smirnov test against theoretical distribution normal((x-4.571429)/3.457222) Smaller group D P-value Corrected x: Cumulative: Combined K-S: 0.1650 -0.1250 0.1650 0.683 0.803 0.991 0.978 Because Stata has no way of knowing that we based this calculation on the calculated mean and standard deviation of x, the test statistics will be slightly conservative in addition to being approximations. Nevertheless, they clearly indicate that the data cannot be distinguished from normally distributed data. Stored results ksmirnov stores the following in r(): Scalars r(D 1) r(p 1) r(D 2) r(p 2) Macros r(group1) D from line 1 p-value from line 1 D from line 2 p-value from line 2 r(D) r(p) r(p cor) r(p exact) combined D combined p-value corrected combined p-value exact combined p-value name of group from line 1 r(group2) name of group from line 2 Methods and formulas In general, the Kolmogorov – Smirnov test (Kolmogorov 1933; Smirnov 1933; also see Conover [1999], 428 – 465) is not very powerful against differences in the tails of distributions. In return for this, it is fairly powerful for alternative hypotheses that involve lumpiness or clustering in the data. The directional hypotheses are evaluated with the statistics n o D+ = max F (x) − G(x) x n o − D = min F (x) − G(x) x where F (x) and G(x) are the empirical distribution functions for the sample being compared. The combined statistic is D = max |D+ | , |D− | The p-value for this statistic may be obtained by evaluating the asymptotic limiting distribution. Let m be the sample size for the first sample, and let n be the sample size for the second sample. Smirnov (1933) shows that ksmirnov — Kolmogorov – Smirnov equality-of-distributions test lim Pr np m,n→∞ o mn/(m + n)Dm,n ≤ z = 1 − 2 ∞ X 1015 i−1 −1 exp − 2i2 z 2 i=1 The first five terms form the approximation Pa used by Stata. The exact p-value is calculated by a counting algorithm; see Gibbons and Chakraborti (2011, 236–238). A corrected p-value was obtained by modifying the asymptotic p-value by using a numerical approximation technique: p Z = Φ−1 (Pa ) + 1.04/ min(m, n) + 2.09/ max(m, n) − 1.35/ mn/(m + n) p-value = Φ(Z) where Φ(·) is the cumulative normal distribution. Andrei Nikolayevich Kolmogorov (1903–1987), of Russia, was one of the great mathematicians of the twentieth century, making outstanding contributions in many different branches, including set theory, measure theory, probability and statistics, approximation theory, functional analysis, classical dynamics, and theory of turbulence. He was a faculty member at Moscow State University for more than 60 years. Nikolai Vasilyevich Smirnov (1900–1966) was a Russian statistician whose work included contributions in nonparametric statistics, order statistics, and goodness of fit. After army service and the study of philosophy and philology, he turned to mathematics and eventually rose to be head of mathematical statistics at the Steklov Mathematical Institute in Moscow. References Aivazian, S. A. 1997. Smirnov, Nikolai Vasil’yevich. In Leading Personalities in Statistical Sciences: From the Seventeenth Century to the Present, ed. N. L. Johnson and S. Kotz, 208–210. New York: Wiley. Conover, W. J. 1999. Practical Nonparametric Statistics. 3rd ed. New York: Wiley. Gibbons, J. D., and S. Chakraborti. 2011. Nonparametric Statistical Inference. 5th ed. Boca Raton, FL: Chapman & Hall/CRC. Goerg, S. J., and J. Kaiser. 2009. Nonparametric testing of distributions—the Epss–Singleton two-sample test using the empirical characteristic function. Stata Journal 9: 454–465. Jann, B. 2008. Multinomial goodness-of-fit: Large-sample tests with survey design correction and exact tests for small samples. Stata Journal 8: 147–169. Johnson, N. L., and S. Kotz. 1997. Kolmogorov, Andrei Nikolayevich. In Leading Personalities in Statistical Sciences: From the Seventeenth Century to the Present, ed. N. L. Johnson and S. Kotz, 255–256. New York: Wiley. Kolmogorov, A. N. 1933. Sulla determinazione empirica di una legge di distribuzione. Giornale dell’ Istituto Italiano degli Attuari 4: 83–91. Riffenburgh, R. H. 2012. Statistics in Medicine. 3rd ed. San Diego, CA: Academic Press. Smirnov, N. V. 1933. Estimate of deviation between empirical distribution functions in two independent samples. Bulletin Moscow University 2: 3–16. Also see [R] runtest — Test for random order [R] sktest — Skewness and kurtosis test for normality [R] swilk — Shapiro – Wilk and Shapiro – Francia tests for normality Title kwallis — Kruskal – Wallis equality-of-populations rank test Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Option References Syntax kwallis varname if in , by(groupvar) Menu Statistics > Nonparametric analysis > Tests of hypotheses > Kruskal-Wallis rank test Description kwallis tests the hypothesis that several samples are from the same population. In the syntax diagram above, varname refers to the variable recording the outcome, and groupvar refers to the variable denoting the population. by() is required. Option by(groupvar) is required. It specifies a variable that identifies the groups. Remarks and examples Example 1 We have data on the 50 states. The data contain the median age of the population, medage, and the region of the country, region, for each state. We wish to test for the equality of the median age distribution across all four regions simultaneously: . use http://www.stata-press.com/data/r13/census (1980 Census data by state) . kwallis medage, by(region) Kruskal-Wallis equality-of-populations rank test region Obs Rank Sum NE N Cntrl South West 9 12 16 13 376.50 294.00 398.00 206.50 chi-squared probability chi-squared probability = 17.041 with 3 d.f. = 0.0007 with ties = 17.062 with 3 d.f. = 0.0007 1016 kwallis — Kruskal – Wallis equality-of-populations rank test 1017 From the output, we see that we can reject the hypothesis that the populations are the same at any level below 0.07%. Stored results kwallis stores the following in r(): Scalars r(df) degrees of freedom r(chi2) χ2 r(chi2 adj) χ2 adjusted for ties Methods and formulas The Kruskal – Wallis test (Kruskal and Wallis 1952, 1953; also see Altman [1991, 213 – 215]; Conover [1999, 288 – 297]; and Riffenburgh [2012, sec. 11.6]) is a multiple-sample generalization of the two-sample Wilcoxon (also called Mann – Whitney) rank sum test (Wilcoxon 1945; Mann and Whitney 1947). Samples of sizes nj , j = 1, . . . , m, are combined and ranked in ascending order of magnitude. Pnj Tied values are assigned the average ranks. Let n denote the overall sample size, and let Rj = i=1 R(Xji ) denote the sum of the ranks for the j th sample. The Kruskal – Wallis one-way analysis-of-variance test, H , is defined as m 2 2 X R n(n + 1) 1 j − H= 2 S j=1 nj 4 where X 2 n(n + 1) 1 R(Xji )2 − S2 = n−1 4 all ranks If there are no ties, this equation simplifies to m X Rj2 12 H= − 3(n + 1) n(n + 1) j=1 nj The sampling distribution of H is approximately χ2 with m − 1 degrees of freedom. William Henry Kruskal (1919–2005) was born in New York City. He studied mathematics and statistics at Antioch College, Harvard, and Columbia, and joined the University of Chicago in 1951. He made many outstanding contributions to linear models, nonparametric statistics, government statistics, and the history and methodology of statistics. Wilson Allen Wallis (1912–1998) was born in Philadelphia. He studied psychology and economics at the Universities of Minnesota and Chicago and at Columbia. He taught at Yale, Stanford, and Chicago, before moving as president (later chancellor) to the University of Rochester in 1962. He also served in several Republican administrations. Wallis served as editor of the Journal of the American Statistical Association, coauthored a popular introduction to statistics, and contributed to nonparametric statistics. 1018 kwallis — Kruskal – Wallis equality-of-populations rank test References Altman, D. G. 1991. Practical Statistics for Medical Research. London: Chapman & Hall/CRC. Conover, W. J. 1999. Practical Nonparametric Statistics. 3rd ed. New York: Wiley. Fienberg, S. E., S. M. Stigler, and J. M. Tanur. 2007. The William Kruskal Legacy: 1919–2005. Statistical Science 22: 255–261. Kruskal, W. H., and W. A. Wallis. 1952. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47: 583–621. . 1953. Errata: Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 48: 907–911. Mann, H. B., and D. R. Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics 18: 50–60. Newson, R. B. 2006. Confidence intervals for rank statistics: Somers’ D and extensions. Stata Journal 6: 309–334. Olkin, I. 1991. A conversation with W. Allen Wallis. Statistical Science 6: 121–140. Riffenburgh, R. H. 2012. Statistics in Medicine. 3rd ed. San Diego, CA: Academic Press. Wilcoxon, F. 1945. Individual comparisons by ranking methods. Biometrics 1: 80–83. Zabell, S. L. 1994. A conversation with William Kruskal. Statistical Science 9: 285–303. Also see [R] nptrend — Test for trend across ordered groups [R] oneway — One-way analysis of variance [R] sdtest — Variance-comparison tests [R] signrank — Equality tests on matched data Title ladder — Ladder of powers Syntax Options for ladder Remarks and examples Acknowledgment Menu Options for gladder Stored results References Description Options for qladder Methods and formulas Also see Syntax Ladder of powers ladder varname if in , generate(newvar) noadjust Ladder-of-powers histograms gladder varname if in , histogram options combine options Ladder-of-powers quantile–normal plots qladder varname if in , qnorm options combine options by is allowed with ladder; see [D] by. Menu ladder Statistics > Summaries, tables, and tests > Distributional plots and tests > Ladder of powers Summaries, tables, and tests > Distributional plots and tests > Ladder-of-powers histograms Summaries, tables, and tests > Distributional plots and tests gladder Statistics > qladder Statistics > > Ladder-of-powers quantile-normal plots Description ladder searches a subset of the ladder of powers (Tukey 1977) for a transform that converts varname into a normally distributed variable. sktest tests for normality; see [R] sktest. Also see [R] boxcox. gladder displays nine histograms of transforms of varname according to the ladder of powers. gladder is useful pedagogically, but we do not advise looking at histograms for research work; ladder or qnorm (see [R] diagnostic plots) is preferred. qladder displays the quantiles of transforms of varname according to the ladder of powers against the quantiles of a normal distribution. 1019 1020 ladder — Ladder of powers Options for ladder Main generate(newvar) saves the transformed values corresponding to the minimum chi-squared value from the table. We do not recommend using generate() because it is literal in interpreting the minimum, thus ignoring nearly equal but perhaps more interpretable transforms. noadjust is the noadjust option to sktest; see [R] sktest. Options for gladder histogram options affect the rendition of the histograms across all relevant transformations; see [R] histogram. Here the normal option is assumed, so you must supply the nonormal option to suppress the overlaid normal density. Also, gladder does not allow the width(#) option of histogram. combine options are any of the options documented in [G-2] graph combine. These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Options for qladder qnorm options affect the rendition of the quantile–normal plots across all relevant transformations. See [R] diagnostic plots. combine options are any of the options documented in [G-2] graph combine. These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples Example 1: ladder We have data on the mileage rating of 74 automobiles and wish to find a transform that makes the variable normally distributed: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . ladder mpg Transformation formula chi2(2) P(chi2) cubic square identity square root log 1/(square root) inverse 1/square 1/cubic 0.000 0.000 0.004 0.084 0.647 0.905 0.307 0.002 0.000 mpg^3 mpg^2 mpg sqrt(mpg) log(mpg) 1/sqrt(mpg) 1/mpg 1/(mpg^2) 1/(mpg^3) 43.59 27.03 10.95 4.94 0.87 0.20 2.36 11.99 24.30 √ If we had typed ladder mpg, gen(mpgx), the variable mpgx containing 1/ mpg would have been automatically generated for us. This is the perfect example of why you should not, in general, specify the generate() option. We also cannot reject the hypothesis that the inverse of mpg is normally distributed and that 1/mpg — gallons per mile — has a better interpretation. It is a measure of energy consumption. ladder — Ladder of powers 1021 Example 2: gladder gladder explores the same transforms as ladder but presents results graphically: . gladder mpg, fraction square identity 0 20000 40000 60000 80000 .2 .05 0 0 0 .1 .1 .2 .1 .2 .3 .15 .3 .4 .5 .4 .25 cubic 0 500 1000 2000 10 20 30 40 −.2 −.15 −.0002 0 1/sqrt 3 4 5 6 7 .2 0 0 .05 .05 .1 .1 .15 .15 .2 .3 .2 .1 0 Fraction 1500 log .25 sqrt 2.5 3 3.5 4 −.3 −.25 1/cubic .3 .2 .2 .15 −.08 −.06 −.04 −.02 0 .05 0 0 .05 .1 .1 .1 .15 .2 .25 1/square .25 inverse −.008 −.006 −.004 −.002 0 −.0006 −.0004 Mileage (mpg) Histograms by transformation Technical note gladder is useful pedagogically, but be careful when using it for research work, especially with many observations. For instance, consider the following data on the average July temperature in degrees Fahrenheit for 954 U.S. cities: . use http://www.stata-press.com/data/r13/citytemp (City Temperature Data) . ladder tempjuly Transformation formula chi2(2) P(chi2) cubic square identity square root log 1/(square root) inverse 1/square 1/cubic 0.000 0.000 0.147 0.400 0.067 0.001 0.000 0.000 0.000 tempjuly^3 tempjuly^2 tempjuly sqrt(tempjuly) log(tempjuly) 1/sqrt(tempjuly) 1/tempjuly 1/(tempjuly^2) 1/(tempjuly^3) 47.49 19.70 3.83 1.83 5.40 13.72 26.36 64.43 . The period in the last line indicates that the χ2 is very large; see [R] sktest. 1022 ladder — Ladder of powers From the table, we see that there is certainly a difference in normality between the square and square-root transform. If, however, you can see the difference between the transforms in the diagram below, you have better eyes than we do: . gladder tempjuly, l1title("") ylabel(none) xlabel(none) cubic square identity sqrt log 1/sqrt inverse 1/square 1/cubic Average July temperature Histograms by transformation Example 3: qladder A better graph for seeing normality is the quantile–normal graph, which can be produced by qladder. ladder — Ladder of powers 1023 . qladder tempjuly, ylabel(none) xlabel(none) cubic square identity sqrt log 1/sqrt inverse 1/square 1/cubic Average July temperature Quantile−Normal plots by transformation This graph shows that for the square transform, the upper tail—and only the upper tail—diverges from what would be expected. This divergence is detected by sktest (see [R] sktest) as a problem with skewness, as we would learn from using sktest to examine tempjuly squared and square rooted. 1024 ladder — Ladder of powers Stored results ladder stores the following in r(): Scalars r(N) r(invcube) r(P invcube) r(invsq) r(P invsq) r(inv) r(P inv) r(invsqrt) r(P invsqrt) r(log) r(P log) r(sqrt) r(P sqrt) r(ident) r(P ident) r(square) r(P square) r(cube) r(P cube) number of observations χ2 for inverse-cubic transformation significance level for inverse-cubic transformation χ2 for inverse-square transformation significance level for inverse-square transformation χ2 for inverse transformation significance level for inverse transformation χ2 for inverse-root transformation significance level for inverse-root transformation χ2 for log transformation significance level for log transformation χ2 for square-root transformation significance level for square-root transformation χ2 for untransformed data significance level for untransformed data χ2 for square transformation significance level for square transformation χ2 for cubic transformation significance level for cubic transformation Methods and formulas For ladder, results are as reported by sktest; see [R] sktest. If generate() is specified, the transform with the minimum χ2 value is chosen. √ gladder sets the number of bins to min( n, 10 log10 n), rounded to the closest integer, where n is the number of unique values of varname. See [R] histogram for a discussion of the optimal number of bins. Also see Findley (1990) for a ladder-of-powers variable transformation program that produces one-way graphs with overlaid box plots, in addition to histograms with overlaid normals. Buchner and Findley (1990) discuss ladder-of-powers transformations as one aspect of preliminary data analysis. Also see Hamilton (1992, 18–23) and Hamilton (2013, 129–132). Acknowledgment qladder was written by Jeroen Weesie of the Department of Sociology at Utrecht University, The Netherlands. References Buchner, D. M., and T. W. Findley. 1990. Research in physical medicine and rehabilitation: VIII. Preliminary data analysis. American Journal of Physical Medicine and Rehabilitation 69: 154–169. Cox, N. J. 2005. Speaking Stata: Density probability plots. Stata Journal 5: 259–273. Findley, T. W. 1990. sed3: Variable transformation and evaluation. Stata Technical Bulletin 2: 15. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 85–86. College Station, TX: Stata Press. Hamilton, L. C. 1992. Regression with Graphics: A Second Course in Applied Statistics. Belmont, CA: Duxbury. . 2013. Statistics with Stata: Updated for Version 12. 8th ed. Boston: Brooks/Cole. Tukey, J. W. 1977. Exploratory Data Analysis. Reading, MA: Addison–Wesley. ladder — Ladder of powers Also see [R] diagnostic plots — Distributional diagnostic plots [R] lnskew0 — Find zero-skewness log or Box – Cox transform [R] lv — Letter-value displays [R] sktest — Skewness and kurtosis test for normality 1025 Title level — Set default confidence level Syntax Description Option Remarks and examples Also see Syntax set level # , permanently Description set level specifies the default confidence level for confidence intervals for all commands that report confidence intervals. The initial value is 95, meaning 95% confidence intervals. # may be between 10.00 and 99.99, and # can have at most two digits after the decimal point. Option permanently specifies that, in addition to making the change right now, the level setting be remembered and become the default setting when you invoke Stata. Remarks and examples To change the level of confidence intervals reported by a particular command, you need not reset the default confidence level. All commands that report confidence intervals have a level(#) option. When you do not specify the option, the confidence intervals are calculated for the default level set by set level, or for 95% if you have not reset set level. Example 1 We use the ci command to obtain the confidence interval for the mean of mpg: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . ci mpg Obs Mean Std. Err. Variable mpg 74 21.2973 .6725511 [95% Conf. Interval] 19.9569 22.63769 To obtain 90% confidence intervals, we would type . ci mpg, level(90) Variable mpg Obs Mean 74 21.2973 Obs Mean 74 21.2973 Std. Err. [90% Conf. Interval] .6725511 20.17683 Std. Err. [90% Conf. Interval] .6725511 20.17683 22.41776 or . set level 90 . ci mpg Variable mpg 1026 22.41776 level — Set default confidence level 1027 If we opt for the second alternative, the next time that we fit a model (say, with regress), 90% confidence intervals will be reported. If we wanted 95% confidence intervals, we could specify level(95) on the estimation command, or we could reset the default by typing set level 95. The current setting of level() is stored as the c-class value c(level); see [P] creturn. Also see [R] query — Display system parameters [P] creturn — Return c-class values [U] 20 Estimation and postestimation commands [U] 20.7 Specifying the width of confidence intervals Title limits — Quick reference for limits Description Remarks and examples Also see Description This entry provides a quick reference for the size limits in Stata. Note that most of these limits are so high that you will never encounter them. Remarks and examples Remarks are presented under the following headings: Maximum size limits Matrix size Determining which flavor of Stata you are running Maximum size limits Small Stata/IC Stata/MP and Stata/SE 1,200 99 800 2,147,483,647 2,047 24,564 2,147,483,647 32,767 393,192 100 98 800 798 11,000 10,998 # characters in a command # options for a command 13,416 70 165,216 70 1,081,527 70 # of elements in a numlist 2,500 2,500 2,500 8 8 8 8 8 8 100 100 100 8 8 8 66 50 256 2,000,000,000 5 249 800 300 512 2,000,000,000 5 249 800 300 512 2,000,000,000 5 249 13,400 165,200 1,081,511 64 64 64 # of observations (1) # of variables width of a dataset in bytes value of matsize # of RHS variables # of interacted continuous variables # of interacted factor variables # of unique time-series operators in a command # seasonal suboperators per time-series operator # of dyadic operators in an expression # of numeric literals in an expression # of string literals in an expression length of string in string expression # of sum functions in an expression # of pairs of nested parentheses # of characters in a macro (2) # of nested do-files 1028 limits — Quick reference for limits Small Stata/IC Stata/MP and Stata/SE 3,500 135,600 3,500 135,600 3,500 135,600 32 32 32 31 32 32 32 31 32 32 32 31 2,045 2,000,000,000 2,045 2,000,000,000 2,045 2,000,000,000 continued # of lines in a program # of characters in a program length length length length of of of of a variable name ado-command name a global macro name a local macro name length of a str# variable length of a strL variable 1029 anova # of variables in one anova term # of terms in the repeated() option 8 4 8 4 8 4 13,400 67,784 67,784 constraint # of constraints 1,999 1,999 1,999 encode and decode # of unique values 1,000 65,536 65,536 estimates hold # of stored estimation results 300 300 300 estimates store # of stored estimation results 300 300 300 exlogistic and expoisson maximum memory specification in memory(#) 2gb 2gb 2gb grmeanby # of unique values in varlist N/2 N/2 N/2 graph twoway # of variables in a plot # of styles in an option’s stylelist 100 20 100 20 100 20 none none none infile (fixed format) record length with a dictionary 524,275 524,275 524,275 infix (fixed format) record length with a dictionary 524,275 524,275 524,275 char length of one characteristic infile (free format) record length without dictionary 1030 limits — Quick reference for limits continued label length of dataset label length of variable label length of value label string length of name of value label # of codings within one value label Small Stata/IC Stata/MP and Stata/SE 80 80 32,000 32 1,000 80 80 32,000 32 65,536 80 80 32,000 32 65,536 100 100 100 20 20 20 8 8 8 40 × 40 800 × 800 11,000 × 11,000 16,000 16,000 16,000 30 30 30 100 100 100 8 8 8 13,400 9,999 9,999 67,784 9,999 9,999 67,784 9,999 9,999 2,500 2,500 2,500 40 800 11,000 set adosize memory ado-files may consume 1000K 1000K 1000K set scrollbufsize memory for Results window buffer 2000K 2000K 2000k label language # of different languages macro # of nested macros manova # of variables in single manova term matrix (3) dimension of single matrix maximize options iterate() maximum mprobit # of categories in a depvar net # of description lines in .pkg file nlogit and nlogittree # of levels in model notes length of one note # of notes attached to dta # of notes attached to each variable numlist # of elements in the numeric list reg3, sureg, and other system estimators # of equations limits — Quick reference for limits continued slogit # of categories in a depvar Small Stata/MP and Stata/SE 30 30 80 1,000 80 1,000 5 5 5 10 10 10 4 4 4 3,000 3,000 3,000 500 3,000 12,000 160 × 20 300 × 20 1,200 × 80 375 375 375 20 20 20 # of time periods within panel 40 800 11,000 # of integration points accepted by intpoints(#) 195 195 195 snapspan length of label # of saved snapshots stcox # of variables in strata() option stcurve # of curves plotted on the same graph table and tabdisp # of by variables # of margins, i.e., sum of rows, columns, supercolumns, and by groups tabulate oneway # of rows in one-way table tabulate twoway # of rows & cols in two-way table tabulate, summarize() # of cells (rows X cols) teffects # of treatments 30 Stata/IC 1031 80 1,000 xt estimation commands (e.g., xtgee, xtgls, xtpoisson, xtprobit, xtreg with mle option, and xtpcse when neither option hetonly nor option independent is specified) (1) 2,147,483,647 is a theoretical maximum; memory availability will certainly impose a smaller maximum. (2) The maximum length of the contents of a macro are fixed in Stata/IC and settable in Stata/SE and Stata/MP. The currently set maximum length is recorded in c(macrolen); type display c(macrolen). The maximum length can be changed with set maxvar. If you set maxvar to a larger value, the maximum length increases; if you set maxvar to a smaller value, the maximum length decreases. The relationship between them is maximum length = 33 × maxvar + 200. (3) In Mata, matrices are limited only by the amount of memory on your computer. 1032 limits — Quick reference for limits Matrix size See [R] matsize. Determining which flavor of Stata you are running Type . about The response will be Stata/MP, Stata/SE, Stata/IC, or Small Stata. Other information is also shown, including your serial number. See [R] about. Also see [R] about — Display information about your Stata [R] matsize — Set the maximum number of variables in a model [D] compress — Compress data in memory [D] data types — Quick reference for data types [D] import — Overview of importing data into Stata [D] infile (fixed format) — Read text data in fixed format with a dictionary [D] infile (free format) — Read unformatted text data [D] memory — Memory management [D] obs — Increase the number of observations in a dataset Title lincom — Linear combinations of estimators Syntax Remarks and examples Menu Stored results Description References Options Also see Syntax lincom exp , options options Description eform or hr shr irr rrr level(#) display options generic label; exp(b) odds ratio hazard ratio subhazard ratio incidence-rate ratio relative-risk ratio set confidence level; default is level(95) control column formats df(#) use t distribution with # degrees of freedom for computing p-values and confidence intervals exp is any linear combination of coefficients that is a valid syntax for test; see [R] test. exp must not contain an equal sign. df(#) does not appear in the dialog box. Menu Statistics > Postestimation > Linear combinations of estimates Description lincom computes point estimates, standard errors, t or z statistics, p-values, and confidence intervals for linear combinations of coefficients after any estimation command. Results can optionally be displayed as odds ratios, hazard ratios, incidence-rate ratios, or relative-risk ratios. lincom can be used with svy estimation results; see [SVY] svy postestimation. Options b rather than βb. Standard eform, or, hr, shr, irr, and rrr all report coefficient estimates as exp(β) errors and confidence intervals are similarly transformed. or is the default after logistic. The only difference in these options is how the output is labeled. 1033 1034 lincom — Linear combinations of estimators Option Label Explanation Example commands eform or hr shr irr rrr exp(b) Odds Ratio Haz. Ratio SHR IRR RRR Generic label Odds ratio Hazard ratio Subhazard ratio Incidence-rate ratio Relative-risk ratio cloglog logistic, logit stcox, streg stcrreg poisson mlogit exp may not contain any additive constants when you use the eform, or, hr, irr, or rrr option. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. display options: cformat(% fmt), pformat(% fmt), and sformat(% fmt); see [R] estimation options. The following option is available with lincom but is not shown in the dialog box: df(#) specifies that the t distribution with # degrees of freedom be used for computing p-values and confidence intervals. The default is to use e(df r) degrees of freedom or the standard normal distribution if e(df r) is missing. Remarks and examples Remarks are presented under the following headings: Using lincom Odds ratios and incidence-rate ratios Multiple-equation models Using lincom After fitting a model and obtaining estimates for coefficients β1 , β2 , . . . , βk , you may want to view estimates for linear combinations of the βi , such as β1 − β2 . lincom can display estimates for any linear combination of the form c0 + c1 β1 + c2 β2 + · · · + ck βk . lincom works after any estimation command for which test works. Any valid expression for test syntax 1 (see [R] test) is a valid expression for lincom. lincom is useful for viewing odds ratios, hazard ratios, etc., for one group (that is, one set of covariates) relative to another group (that is, another set of covariates). See the examples below. lincom — Linear combinations of estimators 1035 Example 1 We perform a linear regression: . use http://www.stata-press.com/data/r13/regress . regress y x1 x2 x3 Source SS df MS Model Residual 3259.3561 1627.56282 3 144 1086.45203 11.3025196 Total 4886.91892 147 33.2443464 y Coef. x1 x2 x3 _cons 1.457113 2.221682 -.006139 36.10135 Std. Err. 1.07461 .8610358 .0005543 4.382693 t 1.36 2.58 -11.08 8.24 P>|t| 0.177 0.011 0.000 0.000 Number of obs F( 3, 144) Prob > F R-squared Adj R-squared Root MSE = = = = = = 148 96.12 0.0000 0.6670 0.6600 3.3619 [95% Conf. Interval] -.666934 .5197797 -.0072345 27.43863 3.581161 3.923583 -.0050435 44.76407 To see the difference of the coefficients of x2 and x1, we type . lincom x2 - x1 ( 1) - x1 + x2 = 0 y Coef. (1) .7645682 Std. Err. t P>|t| [95% Conf. Interval] .9950282 0.77 0.444 -1.20218 Std. Err. t P>|t| [95% Conf. Interval] 3.396624 0.38 0.702 2.731316 The expression can be any linear combination. . lincom 3*x1 + 500*x3 ( 1) 3*x1 + 500*x3 = 0 y Coef. (1) 1.301825 -5.411858 8.015507 Nonlinear expressions are not allowed. . lincom x2/x1 not possible with test r(131); For information about estimating nonlinear expressions, see [R] nlcom. Technical note lincom uses the same shorthands for coefficients as does test (see [R] test). When you type x1, for instance, lincom knows that you mean the coefficient of x1. The formal syntax for referencing this coefficient is actually b[x1], or alternatively, coef[x1]. So, more formally, in the last example we could have typed . lincom 3*_b[x1] + 500*_b[x3] (output omitted ) 1036 lincom — Linear combinations of estimators Odds ratios and incidence-rate ratios After logistic regression, the or option can be specified with lincom to display odds ratios for any effect. Incidence-rate ratios after commands such as poisson can be similarly obtained by specifying the irr option. Example 2 Consider the low birthweight dataset from Hosmer, Lemeshow, and Sturdivant (2013, 24). We fit a logistic regression model of low birthweight (variable low) on the following variables: Variable Description age race smoke ht ui lwd ptd c.age##lwd age in years race smoking status history of hypertension uterine irritability maternal weight before pregnancy history of premature labor age main effects, lwd main effects, and their interaction smoke main effects, lwd main effects, and their interaction smoke##lwd Coding 1 1 1 1 1 1 if if if if if if white, 2 if black, 3 if other smoker, 0 if nonsmoker yes, 0 if no yes, 0 if no weight < 110 lb., 0 otherwise yes, 0 if no We first fit a model without the interaction terms by using logit. . use http://www.stata-press.com/data/r13/lbw3 (Hosmer & Lemeshow data) . logit low age lwd i.race smoke ptd ht ui Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -99.3982 Iteration 2: log likelihood = -98.780418 Iteration 3: log likelihood = -98.777998 Iteration 4: log likelihood = -98.777998 Logistic regression Log likelihood = -98.777998 Std. Err. z Number of obs LR chi2(8) Prob > chi2 Pseudo R2 P>|z| = = = = 189 37.12 0.0000 0.1582 low Coef. [95% Conf. Interval] age lwd -.0464796 .8420615 .0373888 .4055338 -1.24 2.08 0.214 0.038 -.1197603 .0472299 .0268011 1.636893 race black other 1.073456 .815367 .5150753 .4452979 2.08 1.83 0.037 0.067 .0639273 -.0574008 2.082985 1.688135 smoke ptd ht ui _cons .8071996 1.281678 1.435227 .6576256 -1.216781 .404446 .4621157 .6482699 .4666192 .9556797 2.00 2.77 2.21 1.41 -1.27 0.046 0.006 0.027 0.159 0.203 .0145001 .3759478 .1646414 -.2569313 -3.089878 1.599899 2.187408 2.705813 1.572182 .656317 lincom — Linear combinations of estimators 1037 To get the odds ratio for black smokers relative to white nonsmokers (the reference group), we type . lincom 2.race + smoke, or ( 1) [low]2.race + [low]smoke = 0 low Odds Ratio (1) 6.557805 Std. Err. z P>|z| [95% Conf. Interval] 4.744692 2.60 0.009 1.588176 27.07811 lincom computed exp(β2.race + βsmoke ) = 6.56. To see the odds ratio for white smokers relative to black nonsmokers, we type . lincom smoke - 2.race, or ( 1) - [low]2.race + [low]smoke = 0 low Odds Ratio (1) .7662425 Std. Err. .4430176 z -0.46 P>|z| [95% Conf. Interval] 0.645 .2467334 2.379603 Now let’s add the interaction terms to the model (Hosmer and Lemeshow 1989, table 4.10). This time, we will use logistic rather than logit. By default, logistic displays odds ratios. . logistic low i.race ht ui ptd c.age##lwd smoke##lwd Logistic regression Log likelihood = Number of obs LR chi2(10) Prob > chi2 Pseudo R2 -96.00616 low Odds Ratio race black other = = = = 189 42.66 0.0000 0.1818 Std. Err. z P>|z| [95% Conf. Interval] 2.95383 2.137589 1.532789 .9919138 2.09 1.64 0.037 0.102 1.068277 .8608708 8.167465 5.307752 ht ui ptd age 1.lwd 3.893141 2.071284 3.426633 .9194513 .1772934 2.575201 .9931388 1.615282 .041896 .3312384 2.05 1.52 2.61 -1.84 -0.93 0.040 0.129 0.009 0.065 0.354 1.064768 .8092926 1.360252 .8408967 .0045539 14.2346 5.301192 8.632089 1.005344 6.902367 lwd#c.age 1 1.15883 .09602 1.78 0.075 .9851215 1.36317 smoke smoker 3.168096 1.452378 2.52 0.012 1.289956 7.78076 smoke#lwd smoker 1 .2447849 .2003996 -1.72 0.086 .0491956 1.217988 _cons .599443 .6519163 -0.47 0.638 .0711271 5.051971 Hosmer and Lemeshow (1989, table 4.13) consider the effects of smoking (smoke = 1) and low maternal weight before pregnancy (lwd = 1). The effect of smoking among non–low-weight mothers (lwd = 0) is given by the odds ratio 3.17 for smoke in the logistic output. The effect of smoking among low-weight mothers is given by 1038 lincom — Linear combinations of estimators . lincom 1.smoke + 1.smoke#1.lwd ( 1) [low]1.smoke + [low]1.smoke#1.lwd = 0 low Odds Ratio (1) .7755022 Std. Err. .574951 z -0.34 P>|z| [95% Conf. Interval] 0.732 .1813465 3.316323 We did not have to specify the or option. After logistic, lincom assumes or by default. The effect of low weight (lwd = 1) is more complicated because we fit an age × lwd interaction. We must specify the age of mothers for the effect. The effect among 30-year-old nonsmokers is given by . lincom 1.lwd + 30*1.lwd#c.age ( 1) [low]1.lwd + 30*[low]1.lwd#c.age = 0 low Odds Ratio (1) 14.7669 Std. Err. z P>|z| [95% Conf. Interval] 13.5669 2.93 0.003 2.439264 89.39633 lincom computed exp(βlwd + 30βagelwd ) = 14.8. It may seem odd that we entered it as 1.lwd + 30*1.lwd#c.age, but remember that these terms are just lincom’s (and test’s) shorthands for b[1.lwd] and b[1.lwd#c.age]. We could have typed . lincom _b[1.lwd] + 30*_b[1.lwd#c.age] ( 1) [low]1.lwd + 30*[low]1.lwd#c.age = 0 low Odds Ratio (1) 14.7669 Std. Err. z P>|z| [95% Conf. Interval] 13.5669 2.93 0.003 2.439264 89.39633 Multiple-equation models lincom also works with multiple-equation models. The only difference is how you refer to the coefficients. Recall that for multiple-equation models, coefficients are referenced using the syntax [eqno]varname where eqno is the equation number or equation name and varname is the corresponding variable name for the coefficient; see [U] 13.5 Accessing coefficients and standard errors and [R] test for details. lincom — Linear combinations of estimators 1039 Example 3 Let’s consider example 4 from [R] mlogit (Tarlov et al. 1989; Wells et al. 1989). . use http://www.stata-press.com/data/r13/sysdsn1 (Health insurance data) . mlogit insure age male nonwhite i.site, nolog Multinomial logistic regression Number of obs LR chi2(10) Prob > chi2 Log likelihood = -534.36165 Pseudo R2 insure Indemnity Coef. Std. Err. z P>|z| = = = = 615 42.99 0.0000 0.0387 [95% Conf. Interval] (base outcome) Prepaid age male nonwhite -.011745 .5616934 .9747768 .0061946 .2027465 .2363213 -1.90 2.77 4.12 0.058 0.006 0.000 -.0238862 .1643175 .5115955 .0003962 .9590693 1.437958 site 2 3 .1130359 -.5879879 .2101903 .2279351 0.54 -2.58 0.591 0.010 -.2989296 -1.034733 .5250013 -.1412433 _cons .2697127 .3284422 0.82 0.412 -.3740222 .9134476 age male nonwhite -.0077961 .4518496 .2170589 .0114418 .3674867 .4256361 -0.68 1.23 0.51 0.496 0.219 0.610 -.0302217 -.268411 -.6171725 .0146294 1.17211 1.05129 site 2 3 -1.211563 -.2078123 .4705127 .3662926 -2.57 -0.57 0.010 0.570 -2.133751 -.9257327 -.2893747 .510108 _cons -1.286943 .5923219 -2.17 0.030 -2.447872 -.1260134 Uninsure To see the estimate of the sum of the coefficient of male and the coefficient of nonwhite for the Prepaid outcome, we type 1040 lincom — Linear combinations of estimators . lincom [Prepaid]male + [Prepaid]nonwhite ( 1) [Prepaid]male + [Prepaid]nonwhite = 0 insure Coef. (1) 1.53647 Std. Err. z P>|z| [95% Conf. Interval] .3272489 4.70 0.000 .8950741 2.177866 To view the estimate as a ratio of relative risks (see [R] mlogit for the definition and interpretation), we specify the rrr option. . lincom [Prepaid]male + [Prepaid]nonwhite, rrr ( 1) [Prepaid]male + [Prepaid]nonwhite = 0 insure RRR (1) 4.648154 Std. Err. z P>|z| [95% Conf. Interval] 1.521103 4.70 0.000 2.447517 8.827451 Stored results lincom stores the following in r(): Scalars r(estimate) r(se) r(df) point estimate estimate of standard error degrees of freedom References Hosmer, D. W., Jr., and S. A. Lemeshow. 1989. Applied Logistic Regression. New York: Wiley. Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Tarlov, A. R., J. E. Ware, Jr., S. Greenfield, E. C. Nelson, E. Perrin, and M. Zubkoff. 1989. The medical outcomes study. An application of methods for monitoring the results of medical care. Journal of the American Medical Association 262: 925–930. Wells, K. B., R. D. Hays, M. A. Burnam, W. H. Rogers, S. Greenfield, and J. E. Ware, Jr. 1989. Detection of depressive disorder for patients receiving prepaid or fee-for-service care. Results from the Medical Outcomes Survey. Journal of the American Medical Association 262: 3298–3302. Also see [R] nlcom — Nonlinear combinations of estimators [R] test — Test linear hypotheses after estimation [R] testnl — Test nonlinear hypotheses after estimation [U] 13.5 Accessing coefficients and standard errors [U] 20 Estimation and postestimation commands Title linktest — Specification link test for single-equation models Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Option References Syntax linktest if in , cmd options When if and in are not specified, the link test is performed on the same sample as the previous estimation. Menu Statistics > Postestimation > Tests > Specification link test for single-equation models Description linktest performs a link test for model specification after any single-equation estimation command, such as logistic, regress, stcox, etc. Option Main cmd options must be the same options specified with the underlying estimation command, except the display options may differ. Remarks and examples The form of the link test implemented here is based on an idea of Tukey (1949), which was further described by Pregibon (1980), elaborating on work in his unpublished thesis (Pregibon 1979). See Methods and formulas below for more details. Example 1 We want to explain the mileage ratings of cars in our automobile dataset by using the weight, engine displacement, and whether the car is manufactured outside the United States: 1041 1042 linktest — Specification link test for single-equation models . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight displ foreign SS df MS Source Model Residual 1619.71935 823.740114 3 70 539.906448 11.7677159 Total 2443.45946 73 33.4720474 mpg Coef. weight displacement foreign _cons -.0067745 .0019286 -1.600631 41.84795 Std. Err. .0011665 .0100701 1.113648 2.350704 t -5.81 0.19 -1.44 17.80 Number of obs F( 3, 70) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.849 0.155 0.000 = = = = = = 74 45.88 0.0000 0.6629 0.6484 3.4304 [95% Conf. Interval] -.0091011 -.0181556 -3.821732 37.15962 -.0044479 .0220129 .6204699 46.53628 On the basis of the R2 , we are reasonably pleased with this model. If our model really is specified correctly, then if we were to regress mpg on the prediction and the prediction squared, the prediction squared would have no explanatory power. This is what linktest does: . linktest Source SS df MS Model Residual 1670.71514 772.744316 2 71 835.357572 10.8837228 Total 2443.45946 73 33.4720474 mpg Coef. _hat _hatsq _cons -.4127198 .0338198 14.00705 Std. Err. .6577736 .015624 6.713276 t -0.63 2.16 2.09 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.532 0.034 0.041 = = = = = = 74 76.75 0.0000 0.6837 0.6748 3.299 [95% Conf. Interval] -1.724283 .0026664 .6211539 .8988434 .0649732 27.39294 We find that the prediction squared does have explanatory power, so our specification is not as good as we thought. Although linktest is formally a test of the specification of the dependent variable, it is often interpreted as a test that, conditional on the specification, the independent variables are specified incorrectly. We will follow that interpretation and now include weight squared in our model: linktest — Specification link test for single-equation models . regress mpg weight c.weight#c.weight displ foreign SS df MS Source Model Residual 1699.02634 744.433124 4 69 424.756584 10.7888859 Total 2443.45946 73 33.4720474 t P>|t| = = = = = = 74 39.37 0.0000 0.6953 0.6777 3.2846 mpg Coef. weight -.0173257 .0040488 -4.28 0.000 -.0254028 -.0092486 c.weight# c.weight 1.87e-06 6.89e-07 2.71 0.008 4.93e-07 3.24e-06 -.0101625 -2.560016 58.23575 .0106236 1.123506 6.449882 -0.96 -2.28 9.03 0.342 0.026 0.000 -.031356 -4.801349 45.36859 .011031 -.3186832 71.10291 displacement foreign _cons Std. Err. Number of obs F( 4, 69) Prob > F R-squared Adj R-squared Root MSE 1043 [95% Conf. Interval] Now we perform the link test on our new model: . linktest Source SS df MS Model Residual 1699.39489 744.06457 2 71 849.697445 10.4797827 Total 2443.45946 73 33.4720474 mpg Coef. _hat _hatsq _cons 1.141987 -.0031916 -1.50305 Std. Err. .7612218 .0170194 8.196444 t 1.50 -0.19 -0.18 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.138 0.852 0.855 = = = = = = 74 81.08 0.0000 0.6955 0.6869 3.2372 [95% Conf. Interval] -.3758456 -.0371272 -17.84629 2.659821 .0307441 14.84019 We now pass the link test. Example 2 Above we followed a standard misinterpretation of the link test — when we discovered a problem, we focused on the explanatory variables of our model. We might consider varying exactly what the link test tests. The link test told us that our dependent variable was misspecified. For those with an engineering background, mpg is indeed a strange measure. It would make more sense to model energy consumption — gallons per mile — in terms of weight and displacement: 1044 linktest — Specification link test for single-equation models . gen gpm = 1/mpg . regress gpm weight displ foreign SS Source df MS Model Residual .009157962 .002799666 3 70 .003052654 .000039995 Total .011957628 73 .000163803 gpm Coef. weight displacement foreign _cons .0000144 .0000186 .0066981 .0008917 Number of obs F( 3, 70) Prob > F R-squared Adj R-squared Root MSE Std. Err. t P>|t| 2.15e-06 .0000186 .0020531 .0043337 6.72 1.00 3.26 0.21 0.000 0.319 0.002 0.838 = = = = = = 74 76.33 0.0000 0.7659 0.7558 .00632 [95% Conf. Interval] .0000102 -.0000184 .0026034 -.0077515 .0000187 .0000557 .0107928 .009535 This model looks every bit as reasonable as our original model: . linktest Source SS df MS Model Residual .009175219 .002782409 2 71 .004587609 .000039189 Total .011957628 73 .000163803 gpm Coef. _hat _hatsq _cons .6608413 3.275857 .008365 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE Std. Err. t P>|t| .515275 4.936655 .0130468 1.28 0.66 0.64 0.204 0.509 0.523 = = = = = = 74 117.06 0.0000 0.7673 0.7608 .00626 [95% Conf. Interval] -.3665877 -6.567553 -.0176496 1.68827 13.11927 .0343795 Specifying the model in terms of gallons per mile also solves the specification problem and results in a more parsimonious specification. Example 3 The link test can be used with any single-equation estimation procedure, not solely regression. Let’s turn our problem around and attempt to explain whether a car is manufactured outside the United States by its mileage rating and weight. To save paper, we will specify logit’s nolog option, which suppresses the iteration log: . logit foreign mpg weight, nolog Logistic regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -27.175156 foreign Coef. mpg weight _cons -.1685869 -.0039067 13.70837 Std. Err. .0919175 .0010116 4.518709 z -1.83 -3.86 3.03 P>|z| 0.067 0.000 0.002 = = = = 74 35.72 0.0000 0.3966 [95% Conf. Interval] -.3487418 -.0058894 4.851859 .011568 -.001924 22.56487 linktest — Specification link test for single-equation models 1045 When we run linktest after logit, the result is another logit specification: . linktest, nolog Logistic regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -26.615714 foreign Coef. _hat _hatsq _cons .8438531 -.1559115 .2630557 Std. Err. .2738759 .1568642 .4299598 z 3.08 -0.99 0.61 P>|z| 0.002 0.320 0.541 = = = = 74 36.83 0.0000 0.4090 [95% Conf. Interval] .3070661 -.4633596 -.57965 1.38064 .1515366 1.105761 The link test reveals no problems with our specification. If there had been a problem, we would have been virtually forced to accept the misinterpretation of the link test — we would have reconsidered our specification of the independent variables. When using logit, we have no control over the specification of the dependent variable other than to change likelihood functions. We admit to having seen a dataset once for which the link test rejected the logit specification. We did change the likelihood function, refitting the model using probit, and satisfied the link test. Probit has thinner tails than logit. In general, however, you will not be so lucky. Technical note You should specify the same options with linktest that you do with the estimation command, although you do not have to follow this advice as literally as we did in the preceding example. logit’s nolog option merely suppresses a part of the output, not what is estimated. We specified nolog both times to save space. If you are testing a tobit model, you must specify the censoring points just as you do with the tobit command. If you are not sure which options are important, duplicate exactly what you specified on the estimation command. If you do not specify if exp or in range with linktest, Stata will by default perform the link test on the same sample as the previous estimation. Suppose that you omitted some data when performing your estimation, but want to calculate the link test on all the data, which you might do if you believe the model is appropriate for all the data. You would type linktest if e(sample) < . to do this. Stored results linktest stores the following in r(): Scalars r(t) r(df) t statistic on hatsq degrees of freedom linktest is not an estimation command in the sense that it leaves previous estimation results unchanged. For instance, after running a regression and performing the link test, typing regress without arguments after the link test still replays the original regression. 1046 linktest — Specification link test for single-equation models For integrating an estimation command with linktest, linktest assumes that the name of the estimation command is stored in e(cmd) and that the name of the dependent variable is stored in e(depvar). After estimation, it assumes that the number of degrees of freedom for the t test is given by e(df m) if the macro is defined. If the estimation command reports z statistics instead of t statistics, linktest will also report z statistics. The z statistic, however, is still returned in r(t), and r(df) is set to a missing value. Methods and formulas The link test is based on the idea that if a regression or regression-like equation is properly specified, you should be able to find no additional independent variables that are significant except by chance. One kind of specification error is called a link error. In regression, this means that the dependent variable needs a transformation or “link” function to properly relate to the independent variables. The idea of a link test is to add an independent variable to the equation that is especially likely to be significant if there is a link error. Let y = f (Xβ) b be the parameter estimates. linktest calculates be the model and β b hat = Xβ and hatsq = hat2 The model is then refit with these two variables, and the test is based on the significance of hatsq. This is the form suggested by Pregibon (1979) based on an idea of Tukey (1949). Pregibon (1980) suggests a slightly different method that has come to be known as “Pregibon’s goodness-of-link test”. We prefer the older version because it is universally applicable, straightforward, and a good second-order approximation. It can be applied to any single-equation estimation technique, whereas Pregibon’s more recent tests are estimation-technique specific. References Pregibon, D. 1979. Data analytic methods for generalized linear models. PhD diss., University of Toronto. . 1980. Goodness of link tests for generalized linear models. Applied Statistics 29: 15–24. Tukey, J. W. 1949. One degree of freedom for non-additivity. Biometrics 5: 232–242. Also see [R] regress postestimation — Postestimation tools for regress Title lnskew0 — Find zero-skewness log or Box – Cox transform Syntax Remarks and examples Reference Menu Stored results Also see Description Methods and formulas Options Acknowledgment Syntax Zero-skewness log transform lnskew0 newvar = exp if in , options Zero-skewness Box–Cox transform bcskew0 newvar = exp if in , options Description options Main delta(#) zero(#) level(#) increment for derivative of skewness function; default is delta(0.02) for lnskew0 and delta(0.01) for bcskew0 value for determining convergence; default is zero(0.001) set confidence level; default is level(95) Menu lnskew0 Data > Create or change data > Other variable-creation commands > Zero-skewness log transform > Other variable-creation commands > Box-Cox transform bcskew0 Data > Create or change data Description lnskew0 creates newvar = ln(±exp − k), choosing k and the sign of exp so that the skewness of newvar is zero. bcskew0 creates newvar = (expλ −1)/λ, the Box – Cox power transformation (Box and Cox 1964), choosing λ so that the skewness of newvar is zero. exp must be strictly positive. Also see [R] boxcox for maximum likelihood estimation of λ. Options Main delta(#) specifies the increment used for calculating the derivative of the skewness function with respect to k (lnskew0) or λ (bcskew0). The default values are 0.02 for lnskew0 and 0.01 for bcskew0. 1047 1048 lnskew0 — Find zero-skewness log or Box – Cox transform zero(#) specifies a value for skewness to determine convergence that is small enough to be considered zero and is, by default, 0.001. level(#) specifies the confidence level for the confidence interval for k (lnskew0) or λ (bcskew0). The confidence interval is calculated only if level() is specified. # is specified as an integer; 95 means 95% confidence intervals. The level() option is honored only if the number of observations exceeds 7. Remarks and examples Example 1: lnskew0 Using our automobile dataset (see [U] 1.2.2 Example datasets), we want to generate a new variable equal to ln(mpg − k) to be approximately normally distributed. mpg records the miles per gallon for each of our cars. One feature of the normal distribution is that it has skewness 0. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . lnskew0 lnmpg = mpg Transform k [95% Conf. Interval] ln(mpg-k) 5.383659 (not calculated) Skewness -7.05e-06 This created the new variable lnmpg = ln(mpg − 5.384): . describe lnmpg storage variable name type display format lnmpg %9.0g float value label variable label ln(mpg-5.383659) Because we did not specify the level() option, no confidence interval was calculated. At the outset, we could have typed . use http://www.stata-press.com/data/r13/auto, clear (Automobile Data) . lnskew0 lnmpg = mpg, level(95) Transform k [95% Conf. Interval] ln(mpg-k) 5.383659 -17.12339 9.892416 Skewness -7.05e-06 The confidence interval is calculated under the assumption that ln(mpg − k) really does have a normal distribution. It would be perfectly reasonable to use lnskew0, even if we did not believe that the transformed variable would have a normal distribution — if we literally wanted the zero-skewness transform — although, then the confidence interval would be an approximation of unknown quality to the true confidence interval. If we now wanted to test the believability of the confidence interval, we could also test our new variable lnmpg by using swilk (see [R] swilk) with the lnnormal option. Technical note lnskew0 and bcskew0 report the resulting skewness of the variable merely to reassure you of the accuracy of its results. In our example above, lnskew0 found k such that the resulting skewness was −7 × 10−6 , a number close enough to zero for all practical purposes. If we wanted to make it even smaller, we could specify the zero() option. Typing lnskew0 new=mpg, zero(1e-8) changes the estimated k to 5.383552 from 5.383659 and reduces the calculated skewness to −2 × 10−11 . lnskew0 — Find zero-skewness log or Box – Cox transform 1049 When you request a confidence interval, lnskew0 may report the lower confidence interval as ‘.’, which should be taken as indicating the lower confidence limit kL = −∞. (This cannot happen with bcskew0.) As an example, consider a sample of size n on x and assume that the skewness of x is positive, but not significantly so, at the desired significance level—say, 5%. Then no matter how large and negative you make kL , there is no value extreme enough to make the skewness of ln(x − kL ) equal the corresponding percentile (97.5 for a 95% confidence interval) of the distribution of skewness in a normal distribution of the same sample size. You cannot do this because the distribution of ln(x − kL ) tends to that of x — apart from location and scale shift — as x → ∞. This “problem” never applies to the upper confidence limit, kU , because the skewness of ln(x − kU ) tends to −∞ as k tends upward to the minimum value of x. Example 2: bcskew0 In example 1, using lnskew0 with a variable such as mpg is probably undesirable. mpg has a natural zero, and we are shifting that zero arbitrarily. On the other hand, use of lnskew0 with a variable such as temperature measured in Fahrenheit or Celsius would be more appropriate, as the zero is indeed arbitrary. For a variable like mpg, it makes more sense to use the Box – Cox power transform (Box and Cox 1964): yλ − 1 y (λ) = λ λ is free to take on any value, but y (1) = y − 1, y (0) = ln(y), and y (−1) = 1 − 1/y . bcskew0 works like lnskew0: . bcskew0 bcmpg = mpg, level(95) Transform L [95% Conf. Interval] (mpg^L-1)/L -.3673283 -1.212752 .4339645 Skewness .0001898 The 95% confidence interval includes λ = −1 (λ is labeled L in the output), which has a rather more pleasing interpretation — gallons per mile — than (mpg−0.3673 − 1)/(−0.3673). The confidence interval, however, is calculated assuming that the power transformed variable is normally distributed. It makes perfect sense to use bcskew0, even when you do not believe that the transformed variable will be normally distributed, but then the confidence interval is an approximation of unknown quality. If you believe that the transformed data are normally distributed, you can alternatively use boxcox to estimate λ; see [R] boxcox. Stored results lnskew0 and bcskew0 store the following in r(): Scalars r(gamma) r(lambda) r(lb) r(ub) r(skewness) k (lnskew0) λ (bcskew0) lower bound of confidence interval upper bound of confidence interval resulting skewness of transformed variable 1050 lnskew0 — Find zero-skewness log or Box – Cox transform Methods and formulas Skewness is as calculated by summarize; see [R] summarize. Newton’s method with numeric, uncentered derivatives is used to estimate k (lnskew0) and λ (bcskew0). For lnskew0, the initial value is chosen so that the minimum of x − k is 1, and thus ln(x − k) is 0. bcskew0 starts with λ = 1. Acknowledgment lnskew0 and bcskew0 were written by Patrick Royston of the MRC Clinical Trials Unit, London, and coauthor of the Stata Press book Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. Reference Box, G. E. P., and D. R. Cox. 1964. An analysis of transformations. Journal of the Royal Statistical Society, Series B 26: 211–252. Also see [R] boxcox — Box–Cox regression models [R] ladder — Ladder of powers [R] swilk — Shapiro – Wilk and Shapiro – Francia tests for normality Title log — Echo copy of session to file Syntax Description Options for use with log Remarks and examples Also see Menu Options for use with both log and cmdlog Option for use with set logtype Stored results Syntax Report status of log file log log query logname | all Open log file log using filename , append replace text | smcl name(logname) Close log log close logname | all Temporarily suspend logging or resume logging logname log off | on Report status of command log file cmdlog Open command log file cmdlog using filename , append replace Close command log, temporarily suspend logging, or resume logging cmdlog close | on | off Set default format for logs set logtype text | smcl , permanently Specify screen width set linesize # In addition to using the log command, you may access the capabilities of log by selecting File > Log from the menu and choosing one of the options in the list. 1051 1052 log — Echo copy of session to file Menu File > Log Description log allows you to make a full record of your Stata session. A log is a file containing what you type and Stata’s output. You may start multiple log files at the same time, and you may refer to them with a logname. If you do not specify a logname, Stata will use the name . cmdlog allows you to make a record of what you type during your Stata session. A command log contains only what you type, so it is a subset of a full log. You can make full logs, command logs, or both simultaneously. Neither is produced until you tell Stata to start logging. Command logs are always text files, making them easy to convert into do-files. (In this respect, it would make more sense if the default extension of a command log file was .do because command logs are do-files. The default is .txt, not .do, however, to keep you from accidentally overwriting your important do-files.) Full logs are recorded in one of two formats: Stata Markup and Control Language (SMCL) or plain text. The default is SMCL, but you can use set logtype to change that, or you can specify an option to state the format you wish. We recommend SMCL because it preserves fonts and colors. SMCL logs can be converted to text or to other formats by using the translate command; see [R] translate. You can also use translate to produce printable versions of SMCL logs. SMCL logs can be viewed and printed from the Viewer, as can any text file; see [R] view. When using multiple log files, you may have up to five SMCL logs and five text logs open at the same time. log or cmdlog, typed without arguments, reports the status of logging. log query, when passed an optional logname, reports the status of that log. log using and cmdlog using open a log file. log close and cmdlog close close the file. Between times, log off and cmdlog off, and log on and cmdlog on, can temporarily suspend and resume logging. If filename is specified without an extension, one of the suffixes .smcl, .log, or .txt is added. The extension .smcl or .log is added by log, depending on whether the file format is SMCL or text. The extension .txt is added by cmdlog. If filename contains embedded spaces, remember to enclose it in double quotes. set logtype specifies the default format in which full logs are to be recorded. Initially, full logs are recorded in SMCL format. set linesize specifies the maximum width, in characters, of Stata output. Most commands in Stata do not respect linesize, because it is not important for most commands. Most users never need to set linesize, because it will automatically be reset if you resize your Results window. This is also why there is no permanently option allowed with set linesize. set linesize is for use with commands such as list and display and is typically used by programmers who wish the output of those commands to be wider or narrower than the current width of the Results window. log — Echo copy of session to file 1053 Options for use with both log and cmdlog append specifies that results be appended to an existing file. If the file does not already exist, a new file is created. replace specifies that filename, if it already exists, be overwritten. When you do not specify either replace or append, the file is assumed to be new. If the specified file already exists, an error message is issued and logging is not started. Options for use with log text and smcl specify the format in which the log is to be recorded. The default is complicated to describe but is what you would expect: If you specify the file as filename.smcl, the default is to write the log in SMCL format (regardless of the value of set logtype). If you specify the file as filename.log, the default is to write the log in text format (regardless of the value of set logtype). If you type filename without an extension and specify neither the smcl option nor the text option, the default is to write the file according to the value of set logtype. If you have not set logtype, then the default is SMCL. Also, the filename you specified will be fixed to read filename.smcl if a SMCL log is being created or filename.log if a text log is being created. If you specify either the text or smcl option, then what you specify determines how the log is written. If filename was specified without an extension, the appropriate extension is added for you. If you open multiple log files, you may choose a different format for each file. name(logname) specifies an optional name you may use to refer to the log while it is open. You can start multiple log files, give each a different logname, and then close, temporarily suspend, or resume them each individually. Option for use with set logtype permanently specifies that, in addition to making the change right now, the logtype setting be remembered and become the default setting when you invoke Stata. Remarks and examples For a detailed explanation of logs, see [U] 15 Saving and printing output—log files. When you open a full log, the default is to show the name of the file and a time and date stamp: . log using myfile name: log: log type: opened on: . C:\data\proj1\myfile.smcl smcl 12 Jan 2013, 12:28:23 The above information will appear in the log. If you do not want this information to appear, precede the command by quietly: . quietly log using myfile quietly will not suppress any error messages or anything else you need to know. 1054 log — Echo copy of session to file Similarly, when you close a full log, the default is to show the full information, . log close name: log: log type: closed on: C:\data\proj1\myfile.smcl smcl 12 Jan 2013, 12:32:41 and that information will also appear in the log. If you want to suppress that, type quietly log close. Stored results log and cmdlog store the following in r(): Macros r(name) r(filename) r(status) r(type) logname name of file on or off smcl or text log query all stores the following in r(): Scalars r(numlogs) number of open log files For each open log file, log query all also stores r(name#) r(filename#) r(status#) r(type#) logname name of file on or off smcl or text where # varies between 1 and the value of r(numlogs). Be aware that # will not necessarily represent the order in which the log files were first opened, nor will it necessarily remain constant for a given log file upon multiple calls to log query. Also see [R] query — Display system parameters [R] translate — Print and translate logs [GSM] 16 Saving and printing results by using logs [GSW] 16 Saving and printing results by using logs [GSU] 16 Saving and printing results by using logs [U] 15 Saving and printing output—log files Title logistic — Logistic regression, reporting odds ratios Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax logistic depvar indepvars options if in weight , options Description Model noconstant offset(varname) asis constraints(constraints) collinear suppress constant term include varname in model with coefficient constrained to 1 retain perfect predictor variables apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) coef nocnsreport display options set confidence level; default is level(95) report estimated coefficients do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mfp, mi estimate, nestreg, rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 1055 1056 logistic — Logistic regression, reporting odds ratios Menu Statistics > Binary outcomes > Logistic regression (reporting odds ratios) Description logistic fits a logistic regression model of depvar on indepvars, where depvar is a 0/1 variable (or, more precisely, a 0/non-0 variable). Without arguments, logistic redisplays the last logistic estimates. logistic displays estimates as odds ratios; to view coefficients, type logit after running logistic. To obtain odds ratios for any covariate pattern relative to another, see [R] lincom. Options Model noconstant, offset(varname), constraints(constraints), collinear; see [R] estimation options. asis forces retention of perfect predictor variables and their associated perfectly predicted observations and may produce instabilities in maximization; see [R] probit. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. coef causes logistic to report the estimated coefficients rather than the odds ratios (exponentiated coefficients). coef may be specified when the model is fit or may be used later to redisplay results. coef affects only how results are displayed and not how they are estimated. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. The following option is available with logistic but is not shown in the dialog box: coeflegend; see [R] estimation options. logistic — Logistic regression, reporting odds ratios 1057 Remarks and examples Remarks are presented under the following headings: logistic and logit Robust estimate of variance Video examples logistic and logit logistic provides an alternative and preferred way to fit maximum-likelihood logit models, the other choice being logit ([R] logit). First, let’s dispose of some confusing terminology. We use the words logit and logistic to mean the same thing: maximum likelihood estimation. To some, one or the other of these words connotes transforming the dependent variable and using weighted least squares to fit the model, but that is not how we use either word here. Thus the logit and logistic commands produce the same results. The logistic command is generally preferred to the logit command because logistic presents the estimates in terms of odds ratios rather than coefficients. To some people, this may seem disadvantageous, but you can type logit without arguments after logistic to see the underlying coefficients. You should be cautious when interpreting the odds ratio of the constant term. Usually, this odds ratio represents the baseline odds of the model when all predictor variables are set to zero. However, you must verify that a zero value for all predictor variables in the model actually makes sense before continuing with this interpretation. Nevertheless, [R] logit is still worth reading because logistic shares the same features as logit, including omitting variables due to collinearity or one-way causation. For an introduction to logistic regression, see Lemeshow and Hosmer (2005), Pagano and Gauvreau (2000, 470–487), or Pampel (2000); for a complete but nonmathematical treatment, see Kleinbaum and Klein (2010); and for a thorough discussion, see Hosmer, Lemeshow, and Sturdivant (2013). See Gould (2000) for a discussion of the interpretation of logistic regression. See Dupont (2009) or Hilbe (2009) for a discussion of logistic regression with examples using Stata. For a discussion using Stata with an emphasis on model specification, see Vittinghoff et al. (2012). Stata has a variety of commands for performing estimation when the dependent variable is dichotomous or polytomous. See Long and Freese (2014) for a book devoted to fitting these models with Stata. Here is a list of some estimation commands that may be of interest. See help estimation commands for a complete list of all of Stata’s estimation commands. 1058 logistic — Logistic regression, reporting odds ratios asclogit [R] asclogit Alternative-specific conditional logit (McFadden’s choice) model asmprobit [R] asmprobit Alternative-specific multinomial probit regression asroprobit [R] asroprobit Alternative-specific rank-ordered probit regression binreg [R] binreg Generalized linear models for the binomial family biprobit [R] biprobit Bivariate probit regression blogit [R] glogit Logit regression for grouped data bprobit [R] glogit Probit regression for grouped data clogit [R] clogit Conditional (fixed-effects) logistic regression cloglog [R] cloglog Complementary log-log regression exlogistic [R] exlogistic Exact logistic regression glm [R] glm Generalized linear models glogit [R] glogit Weighted least-squares logistic regression for grouped data gprobit [R] glogit Weighted least-squares probit regression for grouped data heckoprobit [R] heckoprobit Ordered probit model with sample selection heckprobit [R] heckprobit Probit model with sample selection hetprobit [R] hetprobit Heteroskedastic probit model ivprobit [R] ivprobit Probit model with endogenous regressors logit [R] logit Logistic regression, reporting coefficients mecloglog [ME] mecloglog Multilevel mixed-effects complementary log-log regression meglm [ME] meglm Multilevel mixed-effects generalized linear model melogit [ME] melogit Multilevel mixed-effects logistic regression meprobit [ME] meprobit Multilevel mixed-effects probit regression mlogit [R] mlogit Multinomial (polytomous) logistic regression mprobit [R] mprobit Multinomial probit regression nlogit [R] nlogit Nested logit regression (RUM-consistent and nonnormalized) ologit [R] ologit Ordered logistic regression oprobit [R] oprobit Ordered probit regression probit [R] probit Probit regression rologit [R] rologit Rank-ordered logistic regression scobit [R] scobit Skewed logistic regression slogit [R] slogit Stereotype logistic regression svy: cmd [SVY] svy estimation Survey versions of many of these commands are available; see [SVY] svy estimation xtcloglog [XT] xtcloglog Random-effects and population-averaged cloglog models xtgee [XT] xtgee GEE population-averaged generalized linear models xtlogit [XT] xtlogit Fixed-effects, random-effects, and population-averaged logit models xtologit [XT] xtologit Random-effects ordered logistic models xtoprobit [XT] xtoprobit Random-effects ordered probit models xtprobit [XT] xtprobit Random-effects and population-averaged probit models logistic — Logistic regression, reporting odds ratios 1059 Example 1 Consider the following dataset from a study of risk factors associated with low birthweight described in Hosmer, Lemeshow, and Sturdivant (2013, 24). . use http://www.stata-press.com/data/r13/lbw (Hosmer & Lemeshow data) . describe Contains data from http://www.stata-press.com/data/r13/lbw.dta obs: 189 Hosmer & Lemeshow data vars: 11 15 Jan 2013 05:01 size: 2,646 variable name storage type display format id low age lwt race smoke ptl ht ui ftv int byte byte int byte byte byte byte byte byte %8.0g %8.0g %8.0g %8.0g %8.0g %9.0g %8.0g %8.0g %8.0g %8.0g bwt int %8.0g value label variable label identification code birthweight<2500g age of mother weight at last menstrual period race smoked during pregnancy premature labor history (count) has history of hypertension presence, uterine irritability number of visits to physician during 1st trimester birthweight (grams) race smoke Sorted by: We want to investigate the causes of low birthweight. Here race is a categorical variable indicating whether a person is white (race = 1), black (race = 2), or some other race (race = 3). We want indicator (dummy) variables for race included in the regression, so we will use factor variables. . logistic low age lwt i.race smoke ptl ht ui Logistic regression Log likelihood = -100.724 low Odds Ratio Std. Err. age lwt .9732636 .9849634 .0354759 .0068217 race black other 3.534767 2.368079 smoke ptl ht ui _cons 2.517698 1.719161 6.249602 2.1351 1.586014 z Number of obs LR chi2(8) Prob > chi2 Pseudo R2 = = = = 189 33.22 0.0001 0.1416 P>|z| [95% Conf. Interval] -0.74 -2.19 0.457 0.029 .9061578 .9716834 1.045339 .9984249 1.860737 1.039949 2.40 1.96 0.016 0.050 1.259736 1.001356 9.918406 5.600207 1.00916 .5952579 4.322408 .9808153 1.910496 2.30 1.56 2.65 1.65 0.38 0.021 0.118 0.008 0.099 0.702 1.147676 .8721455 1.611152 .8677528 .1496092 5.523162 3.388787 24.24199 5.2534 16.8134 The odds ratios are for a one-unit change in the variable. If we wanted the odds ratio for age to be in terms of 4-year intervals, we would type 1060 logistic — Logistic regression, reporting odds ratios . gen age4 = age/4 . logistic low age4 lwt i.race smoke ptl ht ui (output omitted ) After logistic, we can type logit to see the model in terms of coefficients and standard errors: . logit Logistic regression Log likelihood = Number of obs LR chi2(8) Prob > chi2 Pseudo R2 -100.724 Std. Err. z P>|z| = = = = 189 33.22 0.0001 0.1416 low Coef. [95% Conf. Interval] age4 lwt -.1084012 -.0151508 .1458017 .0069259 -0.74 -2.19 0.457 0.029 -.3941673 -.0287253 .1773649 -.0015763 race black other 1.262647 .8620792 .5264101 .4391532 2.40 1.96 0.016 0.050 .2309024 .0013548 2.294392 1.722804 smoke ptl ht ui _cons .9233448 .5418366 1.832518 .7585135 .4612239 .4008266 .346249 .6916292 .4593768 1.20459 2.30 1.56 2.65 1.65 0.38 0.021 0.118 0.008 0.099 0.702 .137739 -.136799 .4769494 -.1418484 -1.899729 1.708951 1.220472 3.188086 1.658875 2.822176 If we wanted to see the logistic output again, we would type logistic without arguments. Example 2 We can specify the confidence interval for the odds ratios with the level() option, and we can do this either at estimation time or when replaying the model. For instance, to see our first model in example 1 with narrower, 90% confidence intervals, we might type . logistic, level(90) Logistic regression Log likelihood = Number of obs LR chi2(8) Prob > chi2 Pseudo R2 -100.724 low Odds Ratio Std. Err. age4 lwt .8972675 .9849634 .1308231 .0068217 race black other 3.534767 2.368079 smoke ptl ht ui _cons 2.517698 1.719161 6.249602 2.1351 1.586014 z = = = = 189 33.22 0.0001 0.1416 P>|z| [90% Conf. Interval] -0.74 -2.19 0.457 0.029 .7059409 .9738063 1.140448 .9962483 1.860737 1.039949 2.40 1.96 0.016 0.050 1.487028 1.149971 8.402379 4.876471 1.00916 .5952579 4.322408 .9808153 1.910496 2.30 1.56 2.65 1.65 0.38 0.021 0.118 0.008 0.099 0.702 1.302185 .9726876 2.003487 1.00291 .2186791 4.867819 3.038505 19.49478 4.545424 11.50288 logistic — Logistic regression, reporting odds ratios 1061 Robust estimate of variance If you specify vce(robust), Stata reports the robust estimate of variance described in [U] 20.21 Obtaining robust variance estimates. Here is the model previously fit with the robust estimate of variance: . logistic low age lwt i.race smoke ptl ht ui, vce(robust) Logistic regression Number of obs Wald chi2(8) Prob > chi2 Log pseudolikelihood = -100.724 Pseudo R2 Robust Std. Err. low Odds Ratio age lwt .9732636 .9849634 .0329376 .0070209 race black other 3.534767 2.368079 smoke ptl ht ui _cons 2.517698 1.719161 6.249602 2.1351 1.586014 z = = = = 189 29.02 0.0003 0.1416 P>|z| [95% Conf. Interval] -0.80 -2.13 0.423 0.034 .9108015 .9712984 1.040009 .9988206 1.793616 1.026563 2.49 1.99 0.013 0.047 1.307504 1.012512 9.556051 5.538501 .9736417 .7072902 4.102026 1.042775 1.939482 2.39 1.32 2.79 1.55 0.38 0.017 0.188 0.005 0.120 0.706 1.179852 .7675715 1.726445 .8197749 .144345 5.372537 3.850476 22.6231 5.560858 17.42658 Also you can specify vce(cluster clustvar) and then, within cluster, relax the assumption of independence. To illustrate this, we have made some fictional additions to the low-birthweight data. Say that these data are not a random sample of mothers but instead are a random sample of mothers from a random sample of hospitals. In fact, that may be true — we do not know the history of these data. Hospitals specialize, and it would not be too incorrect to say that some hospitals specialize in more difficult cases. We are going to show two extremes. In one, all hospitals are alike, but we are going to estimate under the possibility that they might differ. In the other, hospitals are strikingly different. In both cases, we assume that patients are drawn from 20 hospitals. In both examples, we will fit the same model, and we will type the same command to fit it. Below are the same data we have been using but with a new variable, hospid, that identifies from which of the 20 hospitals each patient was drawn (and which we have made up): 1062 logistic — Logistic regression, reporting odds ratios . use http://www.stata-press.com/data/r13/hospid1, clear . logistic low age lwt i.race smoke ptl ht ui, vce(cluster hospid) Logistic regression Number of obs = 189 Wald chi2(8) = 49.67 Prob > chi2 = 0.0000 Log pseudolikelihood = -100.724 Pseudo R2 = 0.1416 (Std. Err. adjusted for 20 clusters in hospid) Robust Std. Err. low Odds Ratio age lwt .9732636 .9849634 .0397476 .0057101 race black other 3.534767 2.368079 smoke ptl ht ui _cons 2.517698 1.719161 6.249602 2.1351 1.586014 z P>|z| [95% Conf. Interval] -0.66 -2.61 0.507 0.009 .898396 .9738352 1.05437 .9962187 2.013285 .8451325 2.22 2.42 0.027 0.016 1.157563 1.176562 10.79386 4.766257 .8284259 .6676221 4.066275 1.093144 1.661913 2.81 1.40 2.82 1.48 0.44 0.005 0.163 0.005 0.138 0.660 1.321062 .8030814 1.74591 .7827337 .2034094 4.79826 3.680219 22.37086 5.824014 12.36639 The standard errors are similar to the standard errors we have previously obtained, whether we used the robust or conventional estimators. In this example, we invented the hospital IDs randomly. Here are the results of the estimation with the same data but with a different set of hospital IDs: . use http://www.stata-press.com/data/r13/hospid2 . logistic low age lwt i.race smoke ptl ht ui, vce(cluster hospid) Logistic regression Number of obs = 189 Wald chi2(8) = 7.19 Prob > chi2 = 0.5167 Log pseudolikelihood = -100.724 Pseudo R2 = 0.1416 (Std. Err. adjusted for 20 clusters in hospid) Robust Std. Err. low Odds Ratio age lwt .9732636 .9849634 .0293064 .0106123 race black other 3.534767 2.368079 smoke ptl ht ui _cons 2.517698 1.719161 6.249602 2.1351 1.586014 z P>|z| [95% Conf. Interval] -0.90 -1.41 0.368 0.160 .9174862 .9643817 1.032432 1.005984 3.120338 1.297738 1.43 1.57 0.153 0.116 .6265521 .8089594 19.9418 6.932114 1.570287 .6799153 7.165454 1.411977 1.946253 1.48 1.37 1.60 1.15 0.38 0.139 0.171 0.110 0.251 0.707 .7414969 .7919045 .660558 .5841231 .1431423 8.548655 3.732161 59.12808 7.804266 17.573 Note the strikingly larger standard errors. What happened? In these data, women most likely to have low-birthweight babies are sent to certain hospitals, and the decision on likeliness is based not just on age, smoking history, etc., but on other things that doctors can see but that are not recorded in our data. Thus merely because a woman is at one of the centers identifies her to be more likely to have a low-birthweight baby. logistic — Logistic regression, reporting odds ratios 1063 Video examples Logistic regression, part 1: Binary predictors Logistic regression, part 2: Continuous predictors Logistic regression, part 3: Factor variables Stored results logistic stores the following in e(): Scalars e(N) e(N cds) e(N cdf) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(r2 p) e(ll) e(ll 0) e(N clust) e(chi2) e(p) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(estat cmd) e(predict) e(marginsnotok) e(asbalanced) e(asobserved) number of observations number of completely determined successes number of completely determined failures number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom pseudo-R-squared log likelihood log likelihood, constant-only model number of clusters χ2 significance of model test rank of e(V) number of iterations return code 1 if converged, 0 otherwise logistic command as typed name of dependent variable weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement estat program used to implement predict predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved 1064 logistic — Logistic regression, reporting odds ratios Matrices e(b) e(Cns) e(ilog) e(gradient) e(mns) e(rules) e(V) e(V modelbased) coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector vector of means of the independent variables information about perfect predictors variance–covariance matrix of the estimators model-based variance Functions e(sample) marks estimation sample Methods and formulas Define xj as the (row) vector of independent variables, augmented by 1, and b as the corresponding estimated parameter (column) vector. The logistic regression model is fit by logit; see [R] logit for details of estimation. The odds ratio corresponding to the ith coefficient is ψi = exp(bi ). The standard error of the odds ψ ratio is si = ψi si , where si is the standard error of bi estimated by logit. Define Ij = xj b as the predicted index of the j th observation. The predicted probability of a positive outcome is exp(Ij ) pj = 1 + exp(Ij ) This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. logistic also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Archer, K. J., and S. A. Lemeshow. 2006. Goodness-of-fit test for a logistic regression model fitted using survey sample data. Stata Journal 6: 97–105. Brady, A. R. 1998. sbe21: Adjusted population attributable fractions from logistic regression. Stata Technical Bulletin 42: 8–12. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 137–143. College Station, TX: Stata Press. Buis, M. L. 2010a. Direct and indirect effects in a logit model. Stata Journal 10: 11–29. . 2010b. Stata tip 87: Interpretation of interactions in nonlinear models. Stata Journal 10: 305–308. Cleves, M. A., and A. Tosetto. 2000. sg139: Logistic regression when binary outcome is measured with uncertainty. Stata Technical Bulletin 55: 20–23. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 152–156. College Station, TX: Stata Press. Collett, D. 2003. Modelling Survival Data in Medical Research. 2nd ed. London: Chapman & Hall/CRC. de Irala-Estévez, J., and M. A. Martı́nez. 2000. sg125: Automatic estimation of interaction effects and their confidence intervals. Stata Technical Bulletin 53: 29–31. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 270–273. College Station, TX: Stata Press. Dupont, W. D. 2009. Statistical Modeling for Biomedical Researchers: A Simple Introduction to the Analysis of Complex Data. 2nd ed. Cambridge: Cambridge University Press. Freese, J. 2002. Least likely observations in regression models for categorical outcomes. Stata Journal 2: 296–300. logistic — Logistic regression, reporting odds ratios 1065 Garrett, J. M. 1997. sbe14: Odds ratios and confidence intervals for logistic regression models with effect modification. Stata Technical Bulletin 36: 15–22. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 104–114. College Station, TX: Stata Press. Gould, W. W. 2000. sg124: Interpreting logistic regression in all its forms. Stata Technical Bulletin 53: 19–29. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 257–270. College Station, TX: Stata Press. Hilbe, J. M. 1997. sg63: Logistic regression: Standardized coefficients and partial correlations. Stata Technical Bulletin 35: 21–22. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 162–163. College Station, TX: Stata Press. . 2009. Logistic Regression Models. Boca Raton, FL: Chapman & Hill/CRC. Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Kleinbaum, D. G., and M. Klein. 2010. Logistic Regression: A Self-Learning Text. 3rd ed. New York: Springer. Lemeshow, S. A., and J.-R. L. Gall. 1994. Modeling the severity of illness of ICU patients: A systems update. Journal of the American Medical Association 272: 1049–1055. Lemeshow, S. A., and D. W. Hosmer, Jr. 2005. Logistic regression. In Vol. 2 of Encyclopedia of Biostatistics, ed. P. Armitage and T. Colton, 2870–2880. Chichester, UK: Wiley. Long, J. S., and J. Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Miranda, A., and S. Rabe-Hesketh. 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata Journal 6: 285–308. Mitchell, M. N., and X. Chen. 2005. Visualizing main effects and interactions for binary logit models. Stata Journal 5: 64–82. Pagano, M., and K. Gauvreau. 2000. Principles of Biostatistics. 2nd ed. Belmont, CA: Duxbury. Pampel, F. C. 2000. Logistic Regression: A Primer. Thousand Oaks, CA: Sage. Paul, C. 1998. sg92: Logistic regression for data including multiple imputations. Stata Technical Bulletin 45: 28–30. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 180–183. College Station, TX: Stata Press. Pearce, M. S. 2000. sg148: Profile likelihood confidence intervals for explanatory variables in logistic regression. Stata Technical Bulletin 56: 45–47. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 211–214. College Station, TX: Stata Press. Pregibon, D. 1981. Logistic regression diagnostics. Annals of Statistics 9: 705–724. Reilly, M., and A. Salim. 2000. sg156: Mean score method for missing covariate data in logistic regression models. Stata Technical Bulletin 58: 25–27. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 256–258. College Station, TX: Stata Press. Schonlau, M. 2005. Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata Journal 5: 330–354. Vittinghoff, E., D. V. Glidden, S. C. Shiboski, and C. E. McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. 2nd ed. New York: Springer. Xu, J., and J. S. Long. 2005. Confidence intervals for predicted outcomes in regression models for categorical outcomes. Stata Journal 5: 537–559. 1066 logistic — Logistic regression, reporting odds ratios Also see [R] logistic postestimation — Postestimation tools for logistic [R] brier — Brier score decomposition [R] cloglog — Complementary log-log regression [R] exlogistic — Exact logistic regression [R] logit — Logistic regression, reporting coefficients [R] roc — Receiver operating characteristic (ROC) analysis [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [XT] xtlogit — Fixed-effects, random-effects, and population-averaged logit models [U] 20 Estimation and postestimation commands Title logistic postestimation — Postestimation tools for logistic Description Remarks and examples Syntax for predict Methods and formulas Menu for predict References Options for predict Also see Description The following postestimation commands are of special interest after logistic: Command Description estat classification report various summary statistics, including the classification table estat gof Pearson or Hosmer–Lemeshow goodness-of-fit test lroc compute area under ROC curve and graph the curve lsens graph sensitivity and specificity versus probability cutoff These commands are not appropriate after the svy prefix. The following standard postestimation commands are also available: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with mi or svy estimation results. lrtest is not appropriate with svy estimation results. 1067 1068 logistic postestimation — Postestimation tools for logistic Syntax for predict predict type newvar if in , statistic nooffset rules asif Description statistic Main pr xb stdp ∗ dbeta ∗ deviance ∗ dx2 ∗ ddeviance ∗ hat ∗ number ∗ residuals ∗ rstandard score probability of a positive outcome; the default linear prediction standard error of the prediction Pregibon (1981) ∆βb influence statistic deviance residual Hosmer, Lemeshow, and Sturdivant (2013) ∆ χ2 influence statistic Hosmer, Lemeshow, and Sturdivant (2013) ∆ D influence statistic Pregibon (1981) leverage sequential number of the covariate pattern Pearson residuals; adjusted for number sharing covariate pattern standardized Pearson residuals; adjusted for number sharing covariate pattern first derivative of the log likelihood with respect to xj β Unstarred statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample. Starred statistics are calculated only for the estimation sample, even when if e(sample) is not specified. pr, xb, stdp, and score are the only options allowed with svy estimation results. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the probability of a positive outcome. xb calculates the linear prediction. stdp calculates the standard error of the linear prediction. dbeta calculates the Pregibon (1981) ∆βb influence statistic, a standardized measure of the difference in the coefficient vector that is due to deletion of the observation along with all others that share the same covariate pattern. In Hosmer, Lemeshow, and Sturdivant (2013, 154–155) jargon, this statistic is M -asymptotic; that is, it is adjusted for the number of observations that share the same covariate pattern. deviance calculates the deviance residual. dx2 calculates the Hosmer, Lemeshow, and Sturdivant (2013, 191) ∆χ2 influence statistic, reflecting the decrease in the Pearson χ2 that is due to the deletion of the observation and all others that share the same covariate pattern. ddeviance calculates the Hosmer, Lemeshow, and Sturdivant (2013, 191) ∆D influence statistic, which is the change in the deviance residual that is due to deletion of the observation and all others that share the same covariate pattern. logistic postestimation — Postestimation tools for logistic 1069 hat calculates the Pregibon (1981) leverage or the diagonal elements of the hat matrix adjusted for the number of observations that share the same covariate pattern. number numbers the covariate patterns — observations with the same covariate pattern have the same number. Observations not used in estimation have number set to missing. The first covariate pattern is numbered 1, the second 2, and so on. residuals calculates the Pearson residual as given by Hosmer, Lemeshow, and Sturdivant (2013, 155) and adjusted for the number of observations that share the same covariate pattern. rstandard calculates the standardized Pearson residual as given by Hosmer, Lemeshow, and Sturdivant (2013, 191) and adjusted for the number of observations that share the same covariate pattern. score calculates the equation-level score, ∂ ln L/∂(xj β). Options nooffset is relevant only if you specified offset(varname) for logistic. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . rules requests that Stata use any rules that were used to identify the model when making the prediction. By default, Stata calculates missing for excluded observations. See example 1 in [R] logit postestimation. asif requests that Stata ignore the rules and the exclusion criteria and calculate predictions for all observations possible by using the estimated parameter from the model. See example 1 in [R] logit postestimation. Remarks and examples predict is used after logistic to obtain predicted probabilities, residuals, and influence statistics for the estimation sample. The suggested diagnostic graphs below are from Hosmer, Lemeshow, and Sturdivant (2013), where they are more elaborately explained. Also see Collett (2003, 129–168) for a thorough discussion of model checking. Remarks are presented under the following headings: predict predict predict predict predict predict predict predict predict predict without options with the xb and stdp options with the residuals option with the number option with the deviance option with the rstandard option with the hat option with the dx2 option with the ddeviance option with the dbeta option predict without options Typing predict newvar after estimation calculates the predicted probability of a positive outcome. In example 1 of [R] logistic, we ran the model logistic low age lwt i.race smoke ptl ht ui. We obtain the predicted probabilities of a positive outcome by typing . use http://www.stata-press.com/data/r13/lbw 1070 logistic postestimation — Postestimation tools for logistic (Hosmer & Lemeshow data) . logistic low age lwt i.race smoke ptl ht ui (output omitted ) . predict p (option pr assumed; Pr(low)) . summarize p low Obs Mean Std. Dev. Variable p low 189 189 .3121693 .3121693 .1913915 .4646093 Min Max .0272559 0 .8391283 1 predict with the xb and stdp options predict with the xb option calculates the linear combination xj b, where xj are the independent variables in the j th observation and b is the estimated parameter vector. This is sometimes known as the index function because the cumulative distribution function indexed at this value is the probability of a positive outcome. With the stdp option, predict calculates the standard error of the prediction, which is not adjusted for replicated covariate patterns in the data. The influence statistics described below are adjusted for replicated covariate patterns in the data. predict with the residuals option predict can calculate more than predicted probabilities. The Pearson residual is defined as the square root of the contribution of the covariate pattern to the Pearson χ2 goodness-of-fit statistic, signed according to whether the observed number of positive responses within the covariate pattern is less than or greater than expected. For instance, . predict r, residuals . summarize r, detail Pearson residual 1% 5% 10% 25% 50% Percentiles -1.750923 -1.129907 -.9581174 -.6545911 -.3806923 75% 90% 95% 99% .8162894 1.510355 1.747948 3.002206 Smallest -2.283885 -1.750923 -1.636279 -1.636279 Obs Sum of Wgt. Mean Std. Dev. Largest 2.23879 2.317558 3.002206 3.126763 Variance Skewness Kurtosis .9941981 .8618271 3.038448 We notice the prevalence of a few large positive residuals: . sort r . list id r low p age race in -5/l 185. 186. 187. 188. 189. 189 189 -.0242299 .9970949 id r low p age race 33 57 16 77 36 2.224501 2.23879 2.317558 3.002206 3.126763 1 1 1 1 1 .1681123 .166329 .1569594 .0998678 .0927932 19 15 27 26 24 white white other white white logistic postestimation — Postestimation tools for logistic 1071 predict with the number option Covariate patterns play an important role in logistic regression. Two observations are said to share the same covariate pattern if the independent variables for the two observations are identical. Although we might think of having individual observations, the statistical information in the sample can be summarized by the covariate patterns, the number of observations with that covariate pattern, and the number of positive outcomes within the pattern. Depending on the model, the number of covariate patterns can approach or be equal to the number of observations, or it can be considerably less. Stata calculates all the residual and diagnostic statistics in terms of covariate patterns, not observations. That is, all observations with the same covariate pattern are given the same residual and diagnostic statistics. Hosmer, Lemeshow, and Sturdivant (2013, 154–155) argue that such “M asymptotic” statistics are more useful than “N -asymptotic” statistics. To understand the difference, think of an observed positive outcome with predicted probability of 0.8. Taking the observation in isolation, the residual must be positive — we expected 0.8 positive responses and observed 1. This may indeed be the correct residual, but not necessarily. Under the M -asymptotic definition, we ask how many successes we observed across all observations with this covariate pattern. If that number were, say, six, and there were a total of 10 observations with this covariate pattern, then the residual is negative for the covariate pattern — we expected eight positive outcomes but observed six. predict makes this kind of calculation and then attaches the same residual to all observations in the covariate pattern. Occasionally, you might want to find all observations sharing a covariate pattern. number allows you to do this: . predict pattern, number . summarize pattern Variable Obs pattern 189 Mean Std. Dev. 89.2328 53.16573 Min Max 1 182 We previously fit the model logistic low age lwt i.race smoke ptl ht ui over 189 observations. There are 182 covariate patterns in our data. predict with the deviance option The deviance residual is defined as the square root of the contribution to the likelihood-ratio test statistic of a saturated model versus the fitted model. It has slightly different properties from the Pearson residual (see Hosmer, Lemeshow, and Sturdivant [2013, 155–157]): . predict d, deviance . summarize d, detail deviance residual 1% 5% 10% 25% 50% Percentiles -1.843472 -1.33477 -1.148316 -.8445325 -.5202702 75% 90% 95% 99% .9129041 1.541558 1.673338 2.146583 Smallest -1.911621 -1.843472 -1.843472 -1.674869 Largest 1.894089 1.924457 2.146583 2.180542 Obs Sum of Wgt. Mean Std. Dev. Variance Skewness Kurtosis 189 189 -.1228811 1.049237 1.100898 .6598857 2.036938 1072 logistic postestimation — Postestimation tools for logistic predict with the rstandard option Pearson residuals do not have a standard deviation equal to 1. rstandard generates Pearson residuals normalized to have an expected standard deviation equal to 1. . predict rs, rstandard . summarize r rs Variable Obs Mean r rs 189 189 -.0242299 -.0279135 r rs 1.0000 0.9998 1.0000 Std. Dev. .9970949 1.026406 Min Max -2.283885 -2.4478 3.126763 3.149081 . correlate r rs (obs=189) r rs Remember that we previously created r containing the (unstandardized) Pearson residuals. In these data, whether we use standardized or unstandardized residuals does not matter much. predict with the hat option hat calculates the leverage of a covariate pattern — a scaled measure of distance in terms of the independent variables. Large values indicate covariate patterns far from the average covariate pattern that can have a large effect on the fitted model even if the corresponding residual is small. Consider the following graph: 0 .05 leverage .1 .15 .2 . predict h, hat . scatter h r, xline(0) −2 −1 0 1 Pearson residual 2 3 The points to the left of the vertical line are observed negative outcomes; here our data contain almost as many covariate patterns as observations, so most covariate patterns are unique. In such unique patterns, we observe either 0 or 1 success and expect p, thus forcing the sign of the residual. If we had fewer covariate patterns—if we did not have continuous variables in our model—there would be no such interpretation, and we would not have drawn the vertical line at 0. logistic postestimation — Postestimation tools for logistic 1073 Points on the left and right edges of the graph represent large residuals — covariate patterns that are not fit well by our model. Points at the top of our graph represent high leverage patterns. When analyzing the influence of observations on the model, we are most interested in patterns with high leverage and small residuals — patterns that might otherwise escape our attention. predict with the dx2 option There are many ways to measure influence, and hat is one example. dx2 measures the decrease in the Pearson χ2 goodness-of-fit statistic that would be caused by deleting an observation (and all others sharing the covariate pattern): . predict dx2, dx2 0 2 H−L dX^2 4 6 8 10 . scatter dx2 p 0 .2 .4 Pr(low) .6 .8 Paraphrasing Hosmer, Lemeshow, and Sturdivant (2013, 195–197), the points going from the top left to the bottom right correspond to covariate patterns with the number of positive outcomes equal to the number in the group; the points on the other curve correspond to 0 positive outcomes. In our data, most of the covariate patterns are unique, so the points tend to lie along one or the other curves; the points that are off the curves correspond to the few repeated covariate patterns in our data in which all the outcomes are not the same. We examine this graph for large values of dx2 — there are two at the top left. predict with the ddeviance option Another measure of influence is the change in the deviance residuals due to deletion of a covariate pattern: . predict dd, ddeviance As with dx2, we typically graph ddeviance against the probability of a positive outcome. We direct you to Hosmer, Lemeshow, and Sturdivant (2013, 195) for an example and for the interpretation of this graph. 1074 logistic postestimation — Postestimation tools for logistic predict with the dbeta option One of the more direct measures of influence of interest to model fitters is the Pregibon (1981) dbeta measure, a measure of the change in the coefficient vector that would be caused by deleting an observation (and all others sharing the covariate pattern): 0 .2 Pregibon’s dbeta .4 .6 .8 1 . predict db, dbeta . scatter db p 0 .2 .4 Pr(low) .6 .8 One observation has a large effect on the estimated coefficients. We can easily find this point: . sort db . list in l 189. id 188 low 0 p .8391283 age 25 lwt 95 r -2.283885 dx2 5.991726 race white pattern 117 smoke smoker ptl 3 d -1.911621 dd 4.197658 ht 0 ui 1 rs -2.4478 ftv 0 bwt 3637 h .1294439 db .8909163 logistic postestimation — Postestimation tools for logistic 1075 Hosmer, Lemeshow, and Sturdivant (2013, 196) suggest a graph that combines two of the influence measures: . scatter dx2 p [w=db], title("Symbol size proportional to dBeta") mfcolor(none) (analytic weights assumed) (analytic weights assumed) 0 2 H−L dX^2 4 6 8 10 Symbol size proportional to dBeta 0 .2 .4 Pr(low) .6 .8 We can easily spot the most influential points by the dbeta and dx2 measures. Methods and formulas Let j index observations. Define Mj for each observation as the total number of observations sharing j ’s covariate pattern. Define Yj as the total number of positive responses among observations sharing j ’s covariate pattern. The Pearson residual for the j th observation is defined as rj = p Yj − Mj pj Mj pj (1 − pj ) For Mj > 1, the deviance residual dj is defined as " Yj dj = ± 2 Yj ln Mj p j Mj − Yj + (Mj − Yj ) ln Mj (1 − pj ) #!1/2 where the sign is the same as the sign of (Yj − Mj pj ). In the limiting cases, the deviance residual is given by ( p − 2Mj | ln(1 − pj )| if Yj = 0 dj = p 2Mj | lnpj | if Yj = Mj The unadjusted diagonal elements of the hat matrix hUj are given by hUj = (XVX0 )jj , where V is the estimated covariance matrix of parameters. The adjusted diagonal elements hj created by hat are then hj = Mj pj (1 − pj )hUj . 1076 logistic postestimation — Postestimation tools for logistic p The standardized Pearson residual rSj is rj / 1 − hj . The Pregibon (1981) ∆βbj influence statistic is ∆βbj = rj2 hj (1 − hj )2 2 . The corresponding change in the deviance residual The corresponding change in the Pearson χ2 is rSj 2 is ∆Dj = dj /(1 − hj ). References Collett, D. 2003. Modelling Survival Data in Medical Research. 2nd ed. London: Chapman & Hall/CRC. Garrett, J. M. 2000. sg157: Predicted values calculated from linear or logistic regression models. Stata Technical Bulletin 58: 27–30. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 258–261. College Station, TX: Stata Press. Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Mitchell, M. N., and X. Chen. 2005. Visualizing main effects and interactions for binary logit models. Stata Journal 5: 64–82. Newson, R. B. 2013. Attributable and unattributable risks and fractions and other scenario comparisons. Stata Journal 13: 672–698. Powers, D. A., H. Yoshioka, and M.-S. Yun. 2011. mvdcmp: Multivariate decomposition for nonlinear response models. Stata Journal 11: 556–576. Pregibon, D. 1981. Logistic regression diagnostics. Annals of Statistics 9: 705–724. Wang, Z. 2007. Two postestimation commands for assessing confounding effects in epidemiological studies. Stata Journal 7: 183–196. Also see [R] logistic — Logistic regression, reporting odds ratios [R] estat classification — Classification statistics and table [R] estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test [R] lroc — Compute area under ROC curve and graph the curve [R] lsens — Graph sensitivity and specificity versus probability cutoff [U] 20 Estimation and postestimation commands Title logit — Logistic regression, reporting coefficients Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax logit depvar indepvars options if in weight , options Description Model noconstant offset(varname) asis constraints(constraints) collinear suppress constant term include varname in model with coefficient constrained to 1 retain perfect predictor variables apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) or nocnsreport display options set confidence level; default is level(95) report odds ratios do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used nocoef coeflegend do not display coefficient table; seldom used display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mfp, mi estimate, nestreg, rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce(), nocoef, and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. nocoef and coeflegend do not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 1077 1078 logit — Logistic regression, reporting coefficients Menu Statistics > Binary outcomes > Logistic regression Description logit fits a logit model for a binary response by maximum likelihood; it models the probability of a positive outcome given a set of regressors. depvar equal to nonzero and nonmissing (typically depvar equal to one) indicates a positive outcome, whereas depvar equal to zero indicates a negative outcome. Also see [R] logistic; logistic displays estimates as odds ratios. Many users prefer the logistic command to logit. Results are the same regardless of which you use—both are the maximumlikelihood estimator. Several auxiliary commands that can be run after logit, probit, or logistic estimation are described in [R] logistic postestimation. A list of related estimation commands is given in [R] logistic. If estimating on grouped data, see [R] glogit. Options Model noconstant, offset(varname), constraints(constraints), collinear; see [R] estimation options. asis forces retention of perfect predictor variables and their associated perfectly predicted observations and may produce instabilities in maximization; see [R] probit. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. or reports the estimated coefficients transformed to odds ratios, that is, eb rather than b. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated. or may be specified at estimation or when replaying previously estimated results. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. logit — Logistic regression, reporting coefficients 1079 The following options are available with logit but are not shown in the dialog box: nocoef specifies that the coefficient table not be displayed. This option is sometimes used by program writers but is of no use interactively. coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Basic usage Model identification Basic usage logit fits maximum likelihood models with dichotomous dependent (left-hand-side) variables coded as 0/1 (or, more precisely, coded as 0 and not-0). Example 1 We have data on the make, weight, and mileage rating of 22 foreign and 52 domestic automobiles. We wish to fit a logit model explaining whether a car is foreign on the basis of its weight and mileage. Here is an overview of our data: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . keep make mpg weight foreign . describe Contains data from http://www.stata-press.com/data/r13/auto.dta obs: 74 1978 Automobile Data vars: 4 13 Apr 2013 17:45 size: 1,702 (_dta has notes) variable name make mpg weight foreign storage type str18 int int byte display format %-18s %8.0g %8.0gc %8.0g value label variable label origin Make and Model Mileage (mpg) Weight (lbs.) Car type Sorted by: foreign Note: dataset has changed since last saved . inspect foreign foreign: Car type Number of Observations # # # # # # 0 Negative Zero Positive # # Total Missing Total 52 22 74 - Integers 52 22 Nonintegers - 74 1 74 (2 unique values) foreign is labeled and all values are documented in the label. - 1080 logit — Logistic regression, reporting coefficients The variable foreign takes on two unique values, 0 and 1. The value 0 denotes a domestic car, and 1 denotes a foreign car. The model that we wish to fit is Pr(foreign = 1) = F (β0 + β1 weight + β2 mpg) where F (z) = ez /(1 + ez ) is the cumulative logistic distribution. To fit this model, we type . logit foreign weight mpg Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = -45.03321 -29.238536 -27.244139 -27.175277 -27.175156 -27.175156 Logistic regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -27.175156 foreign Coef. weight mpg _cons -.0039067 -.1685869 13.70837 Std. Err. .0010116 .0919175 4.518709 z -3.86 -1.83 3.03 P>|z| 0.000 0.067 0.002 = = = = 74 35.72 0.0000 0.3966 [95% Conf. Interval] -.0058894 -.3487418 4.851859 -.001924 .011568 22.56487 We find that heavier cars are less likely to be foreign and that cars yielding better gas mileage are also less likely to be foreign, at least holding the weight of the car constant. Technical note Stata interprets a value of 0 as a negative outcome (failure) and treats all other values (except missing) as positive outcomes (successes). Thus if your dependent variable takes on the values 0 and 1, then 0 is interpreted as failure and 1 as success. If your dependent variable takes on the values 0, 1, and 2, then 0 is still interpreted as failure, but both 1 and 2 are treated as successes. If you prefer a more formal mathematical statement, when you type logit y x, Stata fits the model exp(xj β) Pr(yj 6= 0 | xj ) = 1 + exp(xj β) Model identification The logit command has one more feature, and it is probably the most useful. logit automatically checks the model for identification and, if it is underidentified, drops whatever variables and observations are necessary for estimation to proceed. (logistic, probit, and ivprobit do this as well.) logit — Logistic regression, reporting coefficients 1081 Example 2 Have you ever fit a logit model where one or more of your independent variables perfectly predicted one or the other outcome? For instance, consider the following data: Outcome y Independent variable x 0 0 0 1 1 1 0 0 Say that we wish to predict the outcome on the basis of the independent variable. The outcome is always zero whenever the independent variable is one. In our data, Pr(y = 0 | x = 1) = 1, which means that the logit coefficient on x must be minus infinity with a corresponding infinite standard error. At this point, you may suspect that we have a problem. Unfortunately, not all such problems are so easily detected, especially if you have a lot of independent variables in your model. If you have ever had such difficulties, you have experienced one of the more unpleasant aspects of computer optimization. The computer has no idea that it is trying to solve for an infinite coefficient as it begins its iterative process. All it knows is that at each step, making the coefficient a little bigger, or a little smaller, works wonders. It continues on its merry way until either 1) the whole thing comes crashing to the ground when a numerical overflow error occurs or 2) it reaches some predetermined cutoff that stops the process. In the meantime, you have been waiting. The estimates that you finally receive, if you receive any at all, may be nothing more than numerical roundoff. Stata watches for these sorts of problems, alerts us, fixes them, and properly fits the model. Let’s return to our automobile data. Among the variables we have in the data is one called repair, which takes on three values. A value of 1 indicates that the car has a poor repair record, 2 indicates an average record, and 3 indicates a better-than-average record. Here is a tabulation of our data: . use http://www.stata-press.com/data/r13/repair, clear (1978 Automobile Data) . tabulate foreign repair Car type 1 repair 2 3 Total Domestic Foreign 10 0 27 3 9 9 46 12 Total 10 30 18 58 All the cars with poor repair records (repair = 1) are domestic. If we were to attempt to predict foreign on the basis of the repair records, the predicted probability for the repair = 1 category would have to be zero. This in turn means that the logit coefficient must be minus infinity, and that would set most computer programs buzzing. 1082 logit — Logistic regression, reporting coefficients Let’s try Stata on this problem. . logit foreign b3.repair note: 1.repair != 0 predicts failure perfectly 1.repair dropped and 10 obs not used Iteration 0: log likelihood = -26.992087 Iteration 1: log likelihood = -22.483187 Iteration 2: log likelihood = -22.230498 Iteration 3: log likelihood = -22.229139 Iteration 4: log likelihood = -22.229138 Logistic regression Log likelihood = -22.229138 Std. Err. z Number of obs LR chi2(1) Prob > chi2 Pseudo R2 P>|z| = = = = 48 9.53 0.0020 0.1765 foreign Coef. [95% Conf. Interval] repair 1 2 0 -2.197225 (empty) .7698003 -2.85 0.004 -3.706005 -.6884436 _cons -1.98e-16 .4714045 -0.00 1.000 -.9239359 .9239359 Remember that all the cars with poor repair records (repair = 1) are domestic, so the model cannot be fit, or at least it cannot be fit if we restrict ourselves to finite coefficients. Stata noted that fact “note: 1.repair !=0 predicts failure perfectly”. This is Stata’s mathematically precise way of saying what we said in English. When repair is 1, the car is domestic. Stata then went on to say “1.repair dropped and 10 obs not used”. This is Stata eliminating the problem. First 1.repair had to be removed from the model because it would have an infinite coefficient. Then the 10 observations that led to the problem had to be eliminated, as well, so as not to bias the remaining coefficients in the model. The 10 observations that are not used are the 10 domestic cars that have poor repair records. Stata then fit what was left of the model, using the remaining observations. Because no observations remained for cars with poor repair records, Stata reports “(empty)” in the row for repair = 1. Technical note Stata is pretty smart about catching problems like this. It will catch “one-way causation by a dummy variable”, as we demonstrated above. Stata also watches for “two-way causation”, that is, a variable that perfectly determines the outcome, both successes and failures. Here Stata says, “so-and-so predicts outcome perfectly” and stops. Statistics dictates that no model can be fit. Stata also checks your data for collinear variables; it will say, “so-and-so omitted because of collinearity”. No observations need to be eliminated in this case, and model fitting will proceed without the offending variable. It will also catch a subtle problem that can arise with continuous data. For instance, if we were estimating the chances of surviving the first year after an operation, and if we included in our model age, and if all the persons over 65 died within the year, Stata would say, “age > 65 predicts failure perfectly”. It would then inform us about the fix-up it takes and fit what can be fit of our model. logit — Logistic regression, reporting coefficients 1083 logit (and logistic, probit, and ivprobit) will also occasionally display messages such as Note: 4 failures and 0 successes completely determined. There are two causes for a message like this. The first—and most unlikely—case occurs when a continuous variable (or a combination of a continuous variable with other continuous or dummy variables) is simply a great predictor of the dependent variable. Consider Stata’s auto.dta dataset with 6 observations removed. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . drop if foreign==0 & gear_ratio > 3.1 (6 observations deleted) . logit foreign mpg weight gear_ratio, nolog Logistic regression Log likelihood = -6.4874814 foreign Coef. mpg weight gear_ratio _cons -.4944907 -.0060919 15.70509 -21.39527 Std. Err. .2655508 .003101 8.166234 25.41486 z -1.86 -1.96 1.92 -0.84 Number of obs LR chi2(3) Prob > chi2 Pseudo R2 P>|z| 0.063 0.049 0.054 0.400 = = = = 68 72.64 0.0000 0.8484 [95% Conf. Interval] -1.014961 -.0121698 -.300436 -71.20747 .0259792 -.000014 31.71061 28.41694 Note: 4 failures and 0 successes completely determined. There are no missing standard errors in the output. If you receive the “completely determined” message and have one or more missing standard errors in your output, see the second case discussed below. Note gear ratio’s large coefficient. logit thought that the 4 observations with the smallest predicted probabilities were essentially predicted perfectly. . predict p (option pr assumed; Pr(foreign)) . sort p . list p in 1/4 p 1. 2. 3. 4. 1.34e-10 6.26e-09 7.84e-09 1.49e-08 If this happens to you, you do not have to do anything. Computationally, the model is sound. The second case discussed below requires careful examination. The second case occurs when the independent terms are all dummy variables or continuous ones with repeated values (for example, age). Here one or more of the estimated coefficients will have missing standard errors. For example, consider this dataset consisting of 5 observations. 1084 logit — Logistic regression, reporting coefficients . use http://www.stata-press.com/data/r13/logitxmpl, clear . list, separator(0) 1. 2. 3. 4. 5. 6. y x1 x2 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 1 . logit y x1 x2 Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Iteration 4: log likelihood = Iteration 5: log likelihood = (output omitted ) Iteration 15996: log likelihood Iteration 15997: log likelihood Iteration 15998: log likelihood Iteration 15999: log likelihood Iteration 16000: log likelihood convergence not achieved Logistic regression -3.819085 -2.9527336 -2.8110282 -2.7811973 -2.7746107 -2.7730128 = = = = = -2.7725887 -2.7725887 -2.7725887 -2.7725887 -2.7725887 Coef. x1 x2 _cons 18.3704 18.3704 -18.3704 Std. Err. 2 . 1.414214 concave) concave) concave) concave) concave) Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -2.7725887 y (not (not (not (not (not z 9.19 . -12.99 P>|z| 0.000 . 0.000 = = = = 6 2.09 0.1480 0.2740 [95% Conf. Interval] 14.45047 . -21.14221 22.29033 . -15.5986 Note: 2 failures and 0 successes completely determined. convergence not achieved r(430); Three things are happening here. First, logit iterates almost forever and then declares nonconvergence. Second, logit can fit the outcome (y = 0) for the covariate pattern x1 = 0 and x2 = 0 (that is, the first two observations) perfectly. This observation is the “2 failures and 0 successes completely determined”. Third, if this observation is dropped, then x1, x2, and the constant are collinear. This is the cause of the nonconvergence, the message “completely determined”, and the missing standard errors. It happens when you have a covariate pattern (or patterns) with only one outcome and there is collinearity when the observations corresponding to this covariate pattern are dropped. If this happens to you, confirm the causes. First, identify the covariate pattern with only one outcome. (For your data, replace x1 and x2 with the independent variables of your model.) logit — Logistic regression, reporting coefficients 1085 . egen pattern = group(x1 x2) . quietly logit y x1 x2, iterate(100) . predict p (option pr assumed; Pr(y)) . summarize p Variable p Obs Mean 6 .3333333 Std. Dev. .2581989 Min Max 1.05e-08 .5 If successes were completely determined, that means that there are predicted probabilities that are almost 1. If failures were completely determined, that means that there are predicted probabilities that are almost 0. The latter is the case here, so we locate the corresponding value of pattern: . tabulate pattern if p < 1e-7 group(x1 x2) Freq. Percent Cum. 1 2 100.00 100.00 Total 2 100.00 Once we omit this covariate pattern from the estimation sample, logit can deal with the collinearity: . logit y x1 x2 if pattern != 1, nolog note: x2 omitted because of collinearity Logistic regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -2.7725887 y Coef. x1 x2 _cons 0 0 0 Std. Err. 2 (omitted) 1.414214 = = = = 4 0.00 1.0000 0.0000 z P>|z| [95% Conf. Interval] 0.00 1.000 -3.919928 3.919928 0.00 1.000 -2.771808 2.771808 We omit the collinear variable. Then we must decide whether to include or omit the observations with pattern = 1. We could include them, . logit y x1, nolog Logistic regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -3.6356349 y Coef. x1 _cons 1.098612 -1.098612 Std. Err. 1.825742 1.154701 z 0.60 -0.95 P>|z| 0.547 0.341 = = = = 6 0.37 0.5447 0.0480 [95% Conf. Interval] -2.479776 -3.361784 4.677001 1.164559 1086 logit — Logistic regression, reporting coefficients or exclude them, . logit y x1 if pattern != 1, nolog Logistic regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -2.7725887 y Coef. x1 _cons 0 0 Std. Err. z P>|z| 2 1.414214 0.00 0.00 1.000 1.000 = = = = 4 0.00 1.0000 0.0000 [95% Conf. Interval] -3.919928 -2.771808 3.919928 2.771808 If the covariate pattern that predicts outcome perfectly is meaningful, you may want to exclude these observations from the model. Here you would report that covariate pattern such and such predicted outcome perfectly and that the best model for the rest of the data is . . . . But, more likely, the perfect prediction was simply the result of having too many predictors in the model. Then you would omit the extraneous variables from further consideration and report the best model for all the data. Stored results logit stores the following in e(): Scalars e(N) e(N cds) e(N cdf) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(r2 p) e(ll) e(ll 0) e(N clust) e(chi2) e(p) e(rank) e(ic) e(rc) e(converged) number of observations number of completely determined successes number of completely determined failures number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom pseudo-R-squared log likelihood log likelihood, constant-only model number of clusters χ2 significance of model test rank of e(V) number of iterations return code 1 if converged, 0 otherwise logit — Logistic regression, reporting coefficients Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(estat cmd) e(predict) e(marginsnotok) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(mns) e(rules) e(V) e(V modelbased) Functions e(sample) 1087 logit command as typed name of dependent variable weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement estat program used to implement predict predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector vector of means of the independent variables information about perfect predictors variance–covariance matrix of the estimators model-based variance marks estimation sample Methods and formulas Cramer (2003, chap. 9) surveys the prehistory and history of the logit model. The word “logit” was coined by Berkson (1944) and is analogous to the word “probit”. For an introduction to probit and logit, see, for example, Aldrich and Nelson (1984), Cameron and Trivedi (2010), Greene (2012), Jones (2007), Long (1997), Long and Freese (2014), Pampel (2000), or Powers and Xie (2008). The likelihood function for logit is lnL = X j∈S wj lnF (xj b) + X j6∈S wj ln 1 − F (xj b) 1088 logit — Logistic regression, reporting coefficients where S is the set of all observations j , such that yj 6= 0, F (z) = ez /(1 + ez ), and wj denotes the optional weights. lnL is maximized as described in [R] maximize. This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. The scores are calculated as uj = {1 − F (xj b)}xj for the positive outcomes and −F (xj b)xj for the negative outcomes. logit also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. Joseph Berkson (1899–1982) was born in New York City and studied at the College of the City of New York, Columbia, and Johns Hopkins, earning both an MD and a doctorate in statistics. He then worked at Johns Hopkins before moving to the Mayo Clinic in 1931 as a biostatistician. Among many other contributions, his most influential one drew upon a long-sustained interest in the logistic function, especially his 1944 paper on bioassay, in which he introduced the term “logit”. Berkson was a frequent participant in controversy—sometimes humorous, sometimes bitter—on subjects such as the evidence for links between smoking and various diseases and the relative merits of probit and logit methods and of different calculation methods. References Aldrich, J. H., and F. D. Nelson. 1984. Linear Probability, Logit, and Probit Models. Newbury Park, CA: Sage. Archer, K. J., and S. A. Lemeshow. 2006. Goodness-of-fit test for a logistic regression model fitted using survey sample data. Stata Journal 6: 97–105. Berkson, J. 1944. Application of the logistic function to bio-assay. Journal of the American Statistical Association 39: 357–365. Buis, M. L. 2010a. Direct and indirect effects in a logit model. Stata Journal 10: 11–29. . 2010b. Stata tip 87: Interpretation of interactions in nonlinear models. Stata Journal 10: 305–308. Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. Cleves, M. A., and A. Tosetto. 2000. sg139: Logistic regression when binary outcome is measured with uncertainty. Stata Technical Bulletin 55: 20–23. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 152–156. College Station, TX: Stata Press. Cramer, J. S. 2003. Logit Models from Economics and Other Fields. Cambridge: Cambridge University Press. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Hilbe, J. M. 2009. Logistic Regression Models. Boca Raton, FL: Chapman & Hill/CRC. Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Jones, A. 2007. Applied Econometrics for Health Economists: A Practical Guide. 2nd ed. Abingdon, UK: Radcliffe. Judge, G. G., W. E. Griffiths, R. C. Hill, H. Lütkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2nd ed. New York: Wiley. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Long, J. S., and J. Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Miranda, A., and S. Rabe-Hesketh. 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata Journal 6: 285–308. Mitchell, M. N., and X. Chen. 2005. Visualizing main effects and interactions for binary logit models. Stata Journal 5: 64–82. logit — Logistic regression, reporting coefficients 1089 O’Fallon, W. M. 1998. Berkson, Joseph. In Vol. 1 of Encyclopedia of Biostatistics, ed. P. Armitage and T. Colton, 290–295. Chichester, UK: Wiley. Orsini, N., R. Bellocco, and P. C. Sjölander. 2013. Doubly robust estimation in generalized linear models. Stata Journal 13: 185–205. Pampel, F. C. 2000. Logistic Regression: A Primer. Thousand Oaks, CA: Sage. Powers, D. A., and Y. Xie. 2008. Statistical Methods for Categorical Data Analysis. 2nd ed. Bingley, UK: Emerald. Pregibon, D. 1981. Logistic regression diagnostics. Annals of Statistics 9: 705–724. Schonlau, M. 2005. Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata Journal 5: 330–354. Xu, J., and J. S. Long. 2005. Confidence intervals for predicted outcomes in regression models for categorical outcomes. Stata Journal 5: 537–559. Also see [R] logit postestimation — Postestimation tools for logit [R] brier — Brier score decomposition [R] cloglog — Complementary log-log regression [R] exlogistic — Exact logistic regression [R] glogit — Logit and probit regression for grouped data [R] logistic — Logistic regression, reporting odds ratios [R] probit — Probit regression [R] roc — Receiver operating characteristic (ROC) analysis [ME] melogit — Multilevel mixed-effects logistic regression [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [XT] xtlogit — Fixed-effects, random-effects, and population-averaged logit models [U] 20 Estimation and postestimation commands Title logit postestimation — Postestimation tools for logit Description Remarks and examples Syntax for predict Methods and formulas Menu for predict References Options for predict Also see Description The following postestimation commands are of special interest after logit: Command Description estat classification report various summary statistics, including the classification table estat gof Pearson or Hosmer–Lemeshow goodness-of-fit test lroc compute area under ROC curve and graph the curve lsens graph sensitivity and specificity versus probability cutoff These commands are not appropriate after the svy prefix. The following standard postestimation commands are also available: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with mi or svy estimation results. lrtest is not appropriate with svy estimation results. 1090 logit postestimation — Postestimation tools for logit 1091 Syntax for predict predict type newvar if in , statistic nooffset rules asif Description statistic Main pr xb stdp ∗ dbeta ∗ deviance ∗ dx2 ∗ ddeviance ∗ hat ∗ number ∗ residuals ∗ rstandard score probability of a positive outcome; the default linear prediction standard error of the prediction Pregibon (1981) ∆βb influence statistic deviance residual Hosmer, Lemeshow, and Sturdivant (2013) ∆ χ2 influence statistic Hosmer, Lemeshow, and Sturdivant (2013) ∆ D influence statistic Pregibon (1981) leverage sequential number of the covariate pattern Pearson residuals; adjusted for number sharing covariate pattern standardized Pearson residuals; adjusted for number sharing covariate pattern first derivative of the log likelihood with respect to xj β Unstarred statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample. Starred statistics are calculated only for the estimation sample, even when if e(sample) is not specified. pr, xb, stdp, and score are the only options allowed with svy estimation results. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the probability of a positive outcome. xb calculates the linear prediction. stdp calculates the standard error of the linear prediction. dbeta calculates the Pregibon (1981) ∆βb influence statistic, a standardized measure of the difference in the coefficient vector that is due to deletion of the observation along with all others that share the same covariate pattern. In Hosmer, Lemeshow, and Sturdivant (2013, 154–155) jargon, this statistic is M -asymptotic; that is, it is adjusted for the number of observations that share the same covariate pattern. deviance calculates the deviance residual. dx2 calculates the Hosmer, Lemeshow, and Sturdivant (2013, 191) ∆χ2 influence statistic, reflecting the decrease in the Pearson χ2 that is due to deletion of the observation and all others that share the same covariate pattern. ddeviance calculates the Hosmer, Lemeshow, and Sturdivant (2013, 191) ∆D influence statistic, which is the change in the deviance residual that is due to deletion of the observation and all others that share the same covariate pattern. 1092 logit postestimation — Postestimation tools for logit hat calculates the Pregibon (1981) leverage or the diagonal elements of the hat matrix adjusted for the number of observations that share the same covariate pattern. number numbers the covariate patterns — observations with the same covariate pattern have the same number. Observations not used in estimation have number set to missing. The first covariate pattern is numbered 1, the second 2, and so on. residuals calculates the Pearson residual as given by Hosmer, Lemeshow, and Sturdivant (2013, 155) and adjusted for the number of observations that share the same covariate pattern. rstandard calculates the standardized Pearson residual as given by Hosmer, Lemeshow, and Sturdivant (2013, 191) and adjusted for the number of observations that share the same covariate pattern. score calculates the equation-level score, ∂ ln L/∂(xj β). Options nooffset is relevant only if you specified offset(varname) for logit. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . rules requests that Stata use any rules that were used to identify the model when making the prediction. By default, Stata calculates missing for excluded observations. asif requests that Stata ignore the rules and exclusion criteria and calculate predictions for all observations possible by using the estimated parameter from the model. Remarks and examples Once you have fit a logit model, you can obtain the predicted probabilities by using the predict command for both the estimation sample and other samples; see [U] 20 Estimation and postestimation commands and [R] predict. Here we will make only a few more comments. predict without arguments calculates the predicted probability of a positive outcome, that is, Pr(yj = 1) = F (xj b). With the xb option, predict calculates the linear combination xj b, where xj are the independent variables in the j th observation and b is the estimated parameter vector. This is sometimes known as the index function because the cumulative distribution function indexed at this value is the probability of a positive outcome. In both cases, Stata remembers any rules used to identify the model and calculates missing for excluded observations, unless rules or asif is specified. For information about the other statistics available after predict, see [R] logistic postestimation. Example 1: Predicted probabilities In example 2 of [R] logit, we fit the logit model logit foreign b3.repair. To obtain predicted probabilities, type . use http://www.stata-press.com/data/r13/repair (1978 Automobile Data) . logit foreign b3.repair note: 1.repair != 0 predicts failure perfectly 1.repair dropped and 10 obs not used (output omitted ) . predict p (option pr assumed; Pr(foreign)) (10 missing values generated) logit postestimation — Postestimation tools for logit . summarize foreign p Obs Variable foreign p 58 48 Mean .2068966 .25 Std. Dev. Min Max .4086186 .1956984 0 .1 1 .5 1093 Stata remembers any rules used to identify the model and sets predictions to missing for any excluded observations. logit dropped the variable 1.repair from our model and excluded 10 observations. Thus when we typed predict p, those same 10 observations were again excluded, and their predictions were set to missing. predict’s rules option uses the rules in the prediction. During estimation, we were told “1.repair != 0 predicts failure perfectly”, so the rule is that when 1.repair is not zero, we should predict 0 probability of success or a positive outcome: . predict p2, rules (option pr assumed; Pr(foreign)) . summarize foreign p p2 Variable Obs foreign p p2 58 48 58 Mean .2068966 .25 .2068966 Std. Dev. Min Max .4086186 .1956984 .2016268 0 .1 0 1 .5 .5 predict’s asif option ignores the rules and exclusion criteria and calculates predictions for all observations possible by using the estimated parameters from the model: . predict p3, asif (option pr assumed; Pr(foreign)) . summarize foreign p p2 p3 Variable Obs Mean foreign p p2 p3 58 48 58 58 .2068966 .25 .2068966 .2931035 Std. Dev. Min Max .4086186 .1956984 .2016268 .2016268 0 .1 0 .1 1 .5 .5 .5 Which is right? What predict does by default is the most conservative approach. If many observations had been excluded because of a simple rule, we could be reasonably certain that the rules prediction is correct. The asif prediction is correct only if the exclusion is a fluke, and we would be willing to exclude the variable from the analysis anyway. Then, however, we would refit the model to include the excluded observations. Example 2: Predictive margins We can use the command margins, contrast after logit to make comparisons on the probability scale. Let’s fit a model predicting low birthweight from characteristics of the mother: . use http://www.stata-press.com/data/r13/lbw, clear (Hosmer & Lemeshow data) 1094 logit postestimation — Postestimation tools for logit . logit low age i.race i.smoke ptl i.ht i.ui Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -103.81846 Iteration 2: log likelihood = -103.40486 Iteration 3: log likelihood = -103.40384 Iteration 4: log likelihood = -103.40384 Logistic regression Number of obs LR chi2(7) Prob > chi2 Pseudo R2 Log likelihood = -103.40384 Std. Err. z P>|z| = = = = 189 27.86 0.0002 0.1187 low Coef. [95% Conf. Interval] age -.0403293 .0357127 -1.13 0.259 -.1103249 .0296663 race black other 1.009436 1.001908 .5025122 .4248342 2.01 2.36 0.045 0.018 .0245302 .1692485 1.994342 1.834568 smoke smoker ptl 1.ht 1.ui _cons .9631876 .6288678 1.358142 .8001832 -1.184127 .3904357 .3399067 .6289555 .4572306 .9187461 2.47 1.85 2.16 1.75 -1.29 0.014 0.064 0.031 0.080 0.197 .1979477 -.0373371 .125412 -.0959724 -2.984837 1.728427 1.295073 2.590872 1.696339 .6165818 The coefficients are log odds-ratios: conditional on the other predictors, smoking during pregnancy is associated with an increase of 0.96 in the log odds-ratios of low birthweight. The model is linear in the log odds-scale, so the estimate of 0.96 has the same interpretation, whatever the values of the other predictors might be. We could convert 0.96 to an odds ratio by replaying the results with logit, or. But what if we want to talk about the probability of low birthweight, and not the odds? Then we will need the command margins, contrast. We will use the r. contrast operator to compare each level of smoke with a reference level. (smoke has only two levels, so there will be only one comparison: a comparison of smokers with nonsmokers.) . margins r.smoke, contrast Contrasts of predictive margins Model VCE : OIM Expression : Pr(low), predict() smoke smoke (smoker vs nonsmoker) df chi2 P>chi2 1 6.32 0.0119 Contrast Delta-method Std. Err. .1832779 .0728814 [95% Conf. Interval] .0404329 .3261229 We see that maternal smoking is associated with an 18.3% increase in the probability of low birthweight. (We received a contrast in the probability scale because predicted probabilities are the default when margins is used after logit.) logit postestimation — Postestimation tools for logit 1095 The contrast of 18.3% is a difference of margins that are computed by averaging over the predictions for observations in the estimation sample. If the values of the other predictors were different, the contrast for smoke would be different, too. Let’s estimate the contrast for 25-year-old mothers: . margins r.smoke, contrast at(age=25) Contrasts of Model VCE Expression at predictive margins : OIM : Pr(low), predict() : age = smoke smoke (smoker vs nonsmoker) 25 df chi2 P>chi2 1 6.19 0.0129 Contrast Delta-method Std. Err. .1808089 .0726777 [95% Conf. Interval] .0383632 .3232547 Specifying a maternal age of 25 changed the contrast to 18.1%. Our contrast of probabilities changed because the logit model is nonlinear in the probability scale. A contrast of log odds-ratios would not have changed. Methods and formulas See Methods and formulas of the individual postestimation commands for details. References Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Newson, R. B. 2013. Attributable and unattributable risks and fractions and other scenario comparisons. Stata Journal 13: 672–698. Powers, D. A., H. Yoshioka, and M.-S. Yun. 2011. mvdcmp: Multivariate decomposition for nonlinear response models. Stata Journal 11: 556–576. Pregibon, D. 1981. Logistic regression diagnostics. Annals of Statistics 9: 705–724. Also see [R] logit — Logistic regression, reporting coefficients [R] estat classification — Classification statistics and table [R] estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test [R] lroc — Compute area under ROC curve and graph the curve [R] lsens — Graph sensitivity and specificity versus probability cutoff [U] 20 Estimation and postestimation commands Title loneway — Large one-way ANOVA, random effects, and reliability Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgment Syntax loneway response var group var if in weight , options Description options Main expected value of F distribution; default is 1 median of F distribution; default is 1 exact confidence intervals (groups must be equal with no weights) set confidence level; default is level(95) mean median exact level(#) by is allowed; see [D] by. aweights are allowed; see [U] 11.1.6 weight. Menu Statistics > Linear models and related > ANOVA/MANOVA > Large one-way ANOVA Description loneway fits one-way analysis-of-variance (ANOVA) models on datasets with many levels of group var and presents different ancillary statistics from oneway (see [R] oneway): Feature Fit one-way model on fewer than 376 levels on more than 376 levels Bartlett’s test for equal variance Multiple-comparison tests Intragroup correlation and SE Intragroup correlation confidence interval Est. reliability of group-averaged score Est. SD of group effect Est. SD within group oneway loneway x x x x x x x x x x x x Options Main mean specifies that the expected value of the Fk−1,N −k distribution be used as the reference point Fm in the estimation of ρ instead of the default value of 1. 1096 loneway — Large one-way ANOVA, random effects, and reliability 1097 median specifies that the median of the Fk−1,N −k distribution be used as the reference point Fm in the estimation of ρ instead of the default value of 1. exact requests that exact confidence intervals be computed, as opposed to the default asymptotic confidence intervals. This option is allowed only if the groups are equal in size and weights are not used. level(#) specifies the confidence level, as a percentage, for confidence intervals of the coefficients. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. Remarks and examples Remarks are presented under the following headings: The one-way ANOVA model R-squared The random-effects ANOVA model Intraclass correlation Estimated reliability of the group-averaged score The one-way ANOVA model Example 1 loneway’s output looks like that of oneway, except that loneway presents more information at the end. Using our automobile dataset, we have created a (numeric) variable called manufacturer grp identifying the manufacturer of each car, and within each manufacturer we have retained a maximum of four models, selecting those with the lowest mpg. We can compute the intraclass correlation of mpg for all manufacturers with at least four models as follows: . use http://www.stata-press.com/data/r13/auto7 (1978 Automobile Data) . loneway mpg manufacturer_grp if nummake == 4 One-way Analysis of Variance for mpg: Mileage (mpg) Number of obs = 36 R-squared = 0.5228 Source SS df MS F Prob > F Between manufactur~p Within manufactur~p 621.88889 567.75 8 27 77.736111 21.027778 Total 1189.6389 Asy. S.E. 35 33.989683 Intraclass correlation 3.70 0.0049 [95% Conf. Interval] 0.40270 0.18770 0.03481 0.77060 Estimated SD of manufactur~p effect 3.765247 Estimated SD within manufactur~p 4.585605 Est. reliability of a manufactur~p mean 0.72950 (evaluated at n=4.00) In addition to the standard one-way ANOVA output, loneway produces the R-squared, the estimated standard deviation of the group effect, the estimated standard deviation within group, the intragroup correlation, the estimated reliability of the group-averaged mean, and, for unweighted data, the asymptotic standard error and confidence interval for the intragroup correlation. 1098 loneway — Large one-way ANOVA, random effects, and reliability R-squared The R-squared is, of course, simply the underlying R2 for a regression of response var on the levels of group var, or mpg on the various manufacturers here. The random-effects ANOVA model loneway assumes that we observe a variable, yij , measured for ni elements within k groups or classes such that yij = µ + αi + ij , i = 1, 2, . . . , k, j = 1, 2, . . . , ni 2 and σ2 , respectively. and αi and ij are independent zero-mean random variables with variance σα This is the random-effects ANOVA model, also known as the components-of-variance model, in which it is typically assumed that the yij are normally distributed. The interpretation with respect to our example is that the observed value of our response variable, mpg, is created in two steps. First, the ith manufacturer is chosen, and a value, αi , is determined — the typical mpg for that manufacturer less the overall mpg µ. Then a deviation, ij , is chosen for the j th model within this manufacturer. This is how much that particular automobile differs from the typical mpg value for models from this manufacturer. For our sample of 36 car models, the estimated standard deviations are σα = 3.8 and σ = 4.6. Thus a little more than half of the variation in mpg between cars is attributable to the car model, with the rest attributable to differences between manufacturers. These standard deviations differ from those that would be produced by a (standard) fixed-effects regression in that the regression would require the sum within each manufacturer of the ij , i. for the ith manufacturer, to be zero, whereas these estimates merely impose the constraint that the sum is expected to be zero. Intraclass correlation There are various estimators of the intraclass correlation, such as the pairwise estimator, which is defined as the Pearson product-moment correlation computed over all possible pairs of observations that can be constructed within groups. For a discussion of various estimators, see Donner (1986). loneway computes what is termed the analysis of variance, or ANOVA, estimator. This intraclass correlation is the theoretical upper bound on the variation in response var that is explainable by group var, of which R-squared is an overestimate because of the serendipity of fitting. This correlation is comparable to an R-squared — you do not have to square it. In our example, the intra-manu correlation, the correlation of mpg within manufacturer, is 0.40. Because aweights were not used and the default correlation was computed (that is, the mean and median options were not specified), loneway also provided the asymptotic confidence interval and standard error of the intraclass correlation estimate. Estimated reliability of the group-averaged score The estimated reliability of the group-averaged score or mean has an interpretation similar to that of the intragroup correlation; it is a comparable number if we average response var by group var, or mpg by manu in our example. It is the theoretical upper bound of a regression of manufactureraveraged mpg on characteristics of manufacturers. Why would we want to collapse our 36-observation dataset into a 9-observation dataset of manufacturer averages? Because the 36 observations might be a mirage. When General Motors builds cars, do they sometimes put a Pontiac label and sometimes a Chevrolet label on them, so that it appears in our data as if we have two cars when we really have loneway — Large one-way ANOVA, random effects, and reliability 1099 only one, replicated? If that were the case, and if it were the case for many other manufacturers, then we would be forced to admit that we do not have data on 36 cars; we instead have data on nine manufacturer-averaged characteristics. Stored results loneway stores the following in r(): Scalars r(N) r(rho) r(lb) r(ub) number of observations intraclass correlation lower bound of 95% CI for rho upper bound of 95% CI for rho r(rho t) r(se) r(sd w) r(sd b) estimated reliability asymp. SE of intraclass correlation estimated SD within group estimated SD of group effect Methods and formulas The mean squares in the loneway’s ANOVA table are computed as X MSα = wi· (y i· − y ·· )2 /(k − 1) i and MS = XX i wij (yij − y i· )2 /(N − k) j in which wi· = X wij w·· = j X wi· y i· = i X wij yij /wi· and y .. = j X wi· y i· /w·· i The corresponding expected values of these mean squares are E(MSα ) = σ2 + gσα2 in which g= and w·· − E(MS ) = σ2 wi·2 /w·· k−1 P i In the unweighted case, we get P N − i n2i /N k−1 As expected, g = m for the case of no weights and equal group sizes in the data, that is, ni = m for all i. Replacing the expected values with the observed values and solving yields the ANOVA estimates 2 of σα and σ2 . Substituting these into the definition of the intraclass correlation g= ρ= σα2 σα2 + σ2 yields the ANOVA estimator of the intraclass correlation: ρA = Fobs − 1 Fobs − 1 + g 1100 loneway — Large one-way ANOVA, random effects, and reliability Fobs is the observed value of the F statistic from the ANOVA table. For no weights and equal ni , ρA = roh, which is the intragroup correlation defined by Kish (1965). Two slightly different estimators are available through the mean and median options (Gleason 1997). If either of these options is specified, the estimate of ρ becomes ρ= Fobs − Fm Fobs + (g − 1)Fm For the mean option, Fm = E(Fk−1,N −K ) = (N − k)/(N − k − 2), that is, the expected value of the ANOVA table’s F statistic. For the median option, Fm is simply the median of the F statistic. Setting Fm to 1 gives ρA , so for large samples, these different point estimators are essentially the same. Also, because the intraclass correlation of the random-effects model is by definition nonnegative, for any of the three possible point estimators, ρ is truncated to zero if Fobs is less than Fm . For no weighting, interval estimators for ρA are computed. If the groups are equal sized (all ni equal) and the exact option is specified, the following exact (assuming that the yij are normally distributed) 100(1 − α)% confidence interval is computed: Fobs − Fm Fu Fobs − Fm Fl , Fobs + (g − 1)Fm Fu Fobs + (g − 1)Fm Fl with Fm = 1, Fl = Fα/2,k−1,N −k , and Fu = F1−α/2,k−1,N −k , F·,k−1,N −k being the cumulative distribution function for the F distribution with k − 1 and N − k degrees of freedom. If mean or median is specified, Fm is defined as above. If the groups are equal sized and exact is not specified, the following asymptotic 100(1 − α)% confidence interval for ρA is computed, p p ρA − zα/2 V (ρA ), ρA + zα/2 V (ρA ) p where zα/2 is the 100(1 − α/2) percentile of the standard normal distribution and V (ρA ) is the asymptotic standard error of ρ defined below. This confidence interval is also available for unequal groups. It is not applicable and, therefore, not computed for the estimates of ρ provided by the mean and median options. Again, because the intraclass coefficient is nonnegative, if the lower bound is negative for either confidence interval, it is truncated to zero. As might be expected, the coverage probability of a truncated interval is higher than its nominal value. The asymptotic standard error of ρA , assuming that the yij are normally distributed, is also computed when appropriate, namely, for unweighted data and when ρA is computed (neither the mean option nor the median option is specified): V (ρA ) = 2(1 − ρ)2 (A + B + C) g2 with A= {1 + ρ(g − 1)}2 N −k (1 − ρ){1 + ρ(2g − 1)} k−1 P 2 P 3 P 2 −1 ρ { ni − 2N ni + N −2 ( n2i )2 } C= (k − 1)2 B= and ρA is substituted for ρ (Donner 1986). loneway — Large one-way ANOVA, random effects, and reliability 1101 The estimated reliability of the group-averaged score, known as the Spearman – Brown prediction formula in the psychometric literature (Winer, Brown, and Michels 1991, 1014), is ρt = tρ 1 + (t − 1)ρ for group size t. loneway computes ρt for t = g . p The estimated standard deviation of the group effect is σα = (MSα − MS )/g . This deviation comes from the assumption that an observation is derived by adding a group effect to a within-group effect. or The estimated standard deviation within group is the square root of the mean square due to error, √ MS . Acknowledgment We thank John Gleason of Syracuse University (retired) for his contributions to improving loneway. References Donner, A. 1986. A review of inference procedures for the intraclass correlation coefficient in the one-way random effects model. International Statistical Review 54: 67–82. Gleason, J. R. 1997. sg65: Computing intraclass correlations and large ANOVAs. Stata Technical Bulletin 35: 25–31. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 167–176. College Station, TX: Stata Press. Kish, L. 1965. Survey Sampling. New York: Wiley. Marchenko, Y. V. 2006. Estimating variance components in Stata. Stata Journal 6: 1–21. Winer, B. J., D. R. Brown, and K. M. Michels. 1991. Statistical Principles in Experimental Design. 3rd ed. New York: McGraw–Hill. Also see [R] anova — Analysis of variance and covariance [R] icc — Intraclass correlation coefficients [R] oneway — One-way analysis of variance Title lowess — Lowess smoothing Syntax Remarks and examples Also see Menu Methods and formulas Description Acknowledgment Options References Syntax lowess yvar xvar if in , options Description options Main mean noweight bwidth(#) logit adjust nograph generate(newvar) running-mean smooth; default is running-line least squares suppress weighted regressions; default is tricube weighting function use # for the bandwidth; default is bwidth(0.8) transform dependent variable to logits adjust smoothed mean to equal mean of dependent variable suppress graph create newvar containing smoothed values of yvar Plot marker options marker label options change look of markers (color, size, etc.) add marker labels; change look or position Smoothed line lineopts(cline options) affect rendition of the smoothed line Add plots addplot(plot) add other plots to generated graph Y axis, X axis, Titles, Legend, Overall, By twoway options any of the options documented in [G-3] twoway options yvar and xvar may contain time-series operators; see [U] 11.4.4 Time-series varlists. Menu Statistics > Nonparametric analysis > Lowess smoothing Description lowess carries out a locally weighted regression of yvar on xvar, displays the graph, and optionally saves the smoothed variable. Warning: lowess is computationally intensive and may therefore take a long time to run on a slow computer. Lowess calculations on 1,000 observations, for instance, require performing 1,000 regressions. 1102 lowess — Lowess smoothing 1103 Options Main mean specifies running-mean smoothing; the default is running-line least-squares smoothing. noweight prevents the use of Cleveland’s (1979) tricube weighting function; the default is to use the weighting function. bwidth(#) specifies the bandwidth. Centered subsets of bwidth() × N observations are used for calculating smoothed values for each point in the data except for the end points, where smaller, uncentered subsets are used. The greater the bwidth(), the greater the smoothing. The default is 0.8. logit transforms the smoothed yvar into logits. Predicted values less than 0.0001 or greater than 0.9999 are set to 1/N and 1 − 1/N , respectively, before taking logits. adjust adjusts the mean of the smoothed yvar to equal the mean of yvar by multiplying by an appropriate factor. This option is useful when smoothing binary (0/1) data. nograph suppresses displaying the graph. generate(newvar) creates newvar containing the smoothed values of yvar. Plot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Smoothed line lineopts(cline options) affects the rendition of the lowess-smoothed line; see [G-3] cline options. Add plots addplot(plot) provides a way to add other plots to the generated graph; see [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall, By twoway options are any of the options documented in [G-3] twoway options. These include options for titling the graph (see [G-3] title options), options for saving the graph to disk (see [G-3] saving option), and the by() option (see [G-3] by option). Remarks and examples By default, lowess provides locally weighted scatterplot smoothing. The basic idea is to create a new variable (newvar) that, for each yvar yi , contains the corresponding smoothed value. The smoothed values are obtained by running a regression of yvar on xvar by using only the data (xi , yi ) and a few of the data near this point. In lowess, the regression is weighted so that the central point (xi , yi ) gets the highest weight and points that are farther away (based on the distance |xj − xi |) receive less weight. The estimated regression line is then used to predict the smoothed value ybi for yi only. The procedure is repeated to obtain the remaining smoothed values, which means that a separate weighted regression is performed for every point in the data. Lowess is a desirable smoother because of its locality — it tends to follow the data. Polynomial smoothing methods, for instance, are global in that what happens on the extreme left of a scatterplot can affect the fitted values on the extreme right. 1104 lowess — Lowess smoothing Example 1 The amount of smoothing is affected by bwidth(#). You are warned to experiment with different values. For instance, . use http://www.stata-press.com/data/r13/lowess1 (example data for lowess) . lowess h1 depth 0 5 Wet hole 1 10 15 Lowess smoother 0 100 200 depth 300 400 300 400 bandwidth = .8 Now compare that with . lowess h1 depth, bwidth(.4) 0 5 Wet hole 1 10 15 Lowess smoother 0 100 200 depth bandwidth = .4 In the first case, the default bandwidth of 0.8 is used, meaning that 80% of the data are used in smoothing each point. In the second case, we explicitly specified a bandwidth of 0.4. Smaller bandwidths follow the original data more closely. lowess — Lowess smoothing 1105 Example 2 Two lowess options are especially useful with binary (0/1) data: adjust and logit. adjust adjusts the resulting curve (by multiplication) so that the mean of the smoothed values is equal to the mean of the unsmoothed values. logit specifies that the smoothed curve be in terms of the log of the odds ratio: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . lowess foreign mpg, ylabel(0 "Domestic" 1 "Foreign") jitter(5) adjust Lowess smoother Domestic Car type Foreign Mean adjusted smooth 10 20 30 40 Mileage (mpg) bandwidth = .8 . lowess foreign mpg, logit yline(0) Lowess smoother −4 −2 Car type 0 2 4 Logit transformed smooth 10 20 30 40 Mileage (mpg) bandwidth = .8 With binary data, if you do not use the logit option, it is a good idea to specify graph’s jitter() option; see [G-2] graph twoway scatter. Because the underlying data (whether the car was manufactured outside the United States here) take on only two values, raw data points are more likely to be on top of each other, thus making it impossible to tell how many points there are. graph’s jitter() option adds some noise to the data to shift the points around. This noise affects only the location of points on the graph, not the lowess curve. 1106 lowess — Lowess smoothing When you specify the logit option, the display of the raw data is suppressed. Technical note lowess can be used for more than just lowess smoothing. Lowess can be usefully thought of as a combination of two smoothing concepts: the use of predicted values from regression (rather than means) for imputing a smoothed value and the use of the tricube weighting function (rather than a constant weighting function). lowess allows you to combine these concepts freely. You can use line smoothing without weighting (specify noweight), mean smoothing with tricube weighting (specify mean), or mean smoothing without weighting (specify mean and noweight). Methods and formulas Let yi and xi be the two variables, and assume that the data are ordered so that xi ≤ xi+1 for i = 1, . . . , N − 1. For each yi , a smoothed value yis is calculated. The subset used in calculating yis is indices i− = max(1, i−k) through i+ = min(i+k, N ), where k = b(N × bwidth − 0.5)/2c. The weights for each of the observations between j = i− , . . . , i+ are either 1 (noweight) or the tricube (default), wj = 1− |xj − xi | ∆ 3 3 where ∆ = 1.0001 max(xi+ − xi , xi − xi− ). The smoothed value yis is then the (weighted) mean or the (weighted) regression prediction at xi . William Swain Cleveland (1943– ) studied mathematics and statistics at Princeton and Yale. He worked for several years at Bell Labs in New Jersey and now teaches statistics and computer science at Purdue. He has made key contributions in many areas of statistics, including graphics and data visualization, time series, environmental applications, and analysis of Internet traffic data. Acknowledgment lowess is a modified version of a command originally written by Patrick Royston of the MRC Clinical Trials Unit, London, and coauthor of the Stata Press book Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. References Chambers, J. M., W. S. Cleveland, B. Kleiner, and P. A. Tukey. 1983. Graphical Methods for Data Analysis. Belmont, CA: Wadsworth. Cleveland, W. S. 1979. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74: 829–836. lowess — Lowess smoothing 1107 . 1993. Visualizing Data. Summit, NJ: Hobart. . 1994. The Elements of Graphing Data. Rev. ed. Summit, NJ: Hobart. Cox, N. J. 2005. Speaking Stata: Smoothing in various directions. Stata Journal 5: 574–593. Goodall, C. 1990. A survey of smoothing techniques. In Modern Methods of Data Analysis, ed. J. Fox and J. S. Long, 126–176. Newbury Park, CA: Sage. Lindsey, C., and S. J. Sheather. 2010. Model fit assessment via marginal model plots. Stata Journal 10: 215–225. Royston, P. 1991. gr6: Lowess smoothing. Stata Technical Bulletin 3: 7–9. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 41–44. College Station, TX: Stata Press. Royston, P., and N. J. Cox. 2005. A multivariable scatterplot smoother. Stata Journal 5: 405–412. Salgado-Ugarte, I. H., and M. Shimizu. 1995. snp8: Robust scatterplot smoothing: Enhancements to Stata’s ksm. Stata Technical Bulletin 25: 23–26. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 190–194. College Station, TX: Stata Press. Sasieni, P. D. 1994. snp7: Natural cubic splines. Stata Technical Bulletin 22: 19–22. Reprinted in Stata Technical Bulletin Reprints, vol. 4, pp. 171–174. College Station, TX: Stata Press. Also see [R] lpoly — Kernel-weighted local polynomial smoothing [R] smooth — Robust nonlinear smoother [D] ipolate — Linearly interpolate (extrapolate) values Title lpoly — Kernel-weighted local polynomial smoothing Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax lpoly yvar xvar if in options weight , options Description Main kernel(kernel) bwidth(# | varname) degree(#) generate( newvarx newvars ) n(#) at(varname) nograph noscatter specify kernel function; default is kernel(epanechnikov) specify kernel bandwidth specify degree of the polynomial smooth; default is degree(0) store smoothing grid in newvarx and smoothed points in newvars obtain the smooth at # points; default is min(N , 50) obtain the smooth at the values specified by varname suppress graph suppress scatterplot only SE/CI ci level(#) se(newvar) pwidth(#) var(# | varname) plot confidence bands set confidence level; default is level(95) store standard errors in newvar specify pilot bandwidth for standard error calculation specify estimates of residual variance Scatterplot marker options marker label options change look of markers (color, size, etc.) add marker labels; change look or position Smoothed line lineopts(cline options) affect rendition of the smoothed line CI plot ciopts(cline options) affect rendition of the confidence bands Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options 1108 lpoly — Kernel-weighted local polynomial smoothing kernel Description epanechnikov epan2 biweight cosine gaussian parzen rectangle triangle Epanechnikov kernel function; the default alternative Epanechnikov kernel function biweight kernel function cosine trace kernel function Gaussian kernel function Parzen kernel function rectangle kernel function triangle kernel function 1109 fweights and aweights are allowed; see [U] 11.1.6 weight. Menu Statistics > Nonparametric analysis > Local polynomial smoothing Description lpoly performs a kernel-weighted local polynomial regression of yvar on xvar and displays a graph of the smoothed values with (optional) confidence bands. Options Main kernel(kernel) specifies the kernel function for use in calculating the weighted local polynomial estimate. The default is kernel(epanechnikov). bwidth(# | varname) specifies the half-width of the kernel—the width of the smoothing window around each point. If bwidth() is not specified, a rule-of-thumb (ROT) bandwidth estimator is calculated and used. A local variable bandwidth may be specified in varname, in conjunction with an explicit smoothing grid using the at() option. degree(#) specifies the degree of the polynomial to be used in the smoothing. The default is degree(0), meaning local-mean smoothing. generate( [ newvarx ] newvars ) stores the smoothing grid in newvarx and the smoothed values in newvars . If at() is not specified, then both newvarx and newvars must be specified. Otherwise, only newvars is to be specified. n(#) specifies the number of points at which the smooth is to be calculated. The default is min(N, 50), where N is the number of observations. at(varname) specifies a variable that contains the values at which the smooth should be calculated. By default, the smoothing is done on an equally spaced grid, but you can use at() to instead perform the smoothing at the observed x’s, for example. This option also allows you to more easily obtain smooths for different variables or different subsamples of a variable and then overlay the estimates for comparison. nograph suppresses drawing the graph of the estimated smooth. This option is often used with the generate() option. 1110 lpoly — Kernel-weighted local polynomial smoothing noscatter suppresses superimposing a scatterplot of the observed data over the smooth. This option is useful when the number of resulting points would be so large as to clutter the graph. SE/CI ci plots confidence bands, using the confidence level specified in level(). level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. se(newvar) stores the estimates of the standard errors in newvar. This option requires specifying generate() or at(). pwidth(#) specifies the pilot bandwidth to be used for standard-error computations. The default is chosen to be 1.5 times the value of the ROT bandwidth selector. If you specify pwidth() without specifying se() or ci, then the ci option is assumed. var(# | varname) specifies an estimate of a constant residual variance or a variable containing estimates of the residual variances at each grid point required for standard-error computation. By default, the residual variance at each smoothing point is estimated by the normalized weighted residual sum of squares obtained from locally fitting a polynomial of order p + 2, where p is the degree specified in degree(). var(varname) is allowed only if at() is specified. If you specify var() without specifying se() or ci, then the ci option is assumed. Scatterplot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Smoothed line lineopts(cline options) affects the rendition of the smoothed line; see [G-3] cline options. CI plot ciopts(cline options) affects the rendition of the confidence bands; see [G-3] cline options. Add plots addplot(plot) provides a way to add other plots to the generated graph; see [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples Remarks are presented under the following headings: Introduction Local polynomial smoothing Choice of a bandwidth Confidence bands lpoly — Kernel-weighted local polynomial smoothing 1111 Introduction The last 25 years or so has seen a significant outgrowth in the literature on scatterplot smoothing, otherwise known as univariate nonparametric regression. Of most appeal is the idea of making no assumptions about the functional form for the expected value of a response given a regressor, but instead allowing the data to “speak for themselves”. Various methods and estimators fall into the category of nonparametric regression, including local mean smoothing as described independently by Nadaraya (1964) and Watson (1964), the Gasser and Müller (1979) estimator, locally weighted scatterplot smoothing (LOWESS) as described by Cleveland (1979), wavelets (for example, Donoho [1995]), and splines (Eubank 1999), to name a few. Much of the vast literature focuses on automating the amount of smoothing to be performed and dealing with the bias/variance tradeoff inherent to this type of estimation. For example, for Nadaraya–Watson the amount of smoothing is controlled by choosing a bandwidth. Smoothing via local polynomials is by no means a new idea but instead one that has been rediscovered in recent years in articles such as Fan (1992). A natural extension of the local mean smoothing of Nadaraya–Watson, local polynomial regression involves fitting the response to a polynomial form of the regressor via locally weighted least squares. Higher-order polynomials have better bias properties than the zero-degree local polynomials of the Nadaraya–Watson estimator; in general, higher-order polynomials do not require bias adjustment at the boundary of the regression space. For a definitive reference on local polynomial smoothing, see Fan and Gijbels (1996). Local polynomial smoothing Consider a set of scatterplot data {(x1 , y1 ), . . . , (xn , yn )} from the model yi = m(xi ) + σ(xi )i (1) for some unknown mean and variance functions m(·) and σ 2 (·), and symmetric errors i with E(i ) = 0 and Var(i ) = 1. The goal is to estimate m(x0 ) = E[Y |X = x0 ], making no assumption about the functional form of m(·). lpoly estimates m(x0 ) as the constant term (intercept) of a regression, weighted by the kernel function specified in kernel(), of yvar on the polynomial terms (xvar −x0 ), (xvar −x0 )2 , . . . , (xvar − x0 )p for each smoothing point x0 . The degree of the polynomial, p, is specified in degree(), the amount of smoothing is controlled by the bandwidth specified in bwidth(), and the chosen kernel function is specified in kernel(). Example 1 Consider the motorcycle data as examined (among other places) in Fan and Gijbels (1996). The data consist of 133 observations and measure the acceleration (accel measured in grams [g]) of a dummy’s head during impact over time (time measured in milliseconds). For these data, we use lpoly to fit a local cubic polynomial with the default bandwidth (obtained using the ROT method) and the default Epanechnikov kernel. 1112 lpoly — Kernel-weighted local polynomial smoothing . use http://www.stata-press.com/data/r13/motorcycle (Motorcycle data from Fan & Gijbels (1996)) . lpoly accel time, degree(3) −150 −100 acceleration (g) −50 0 50 100 Local polynomial smooth 0 20 40 60 time (msec) kernel = epanechnikov, degree = 3, bandwidth = 6.04 The default bandwidth and kernel settings do not provide a satisfactory fit in this example. To improve the fit, we can either supply a different bandwidth by using the bwidth() option or specify a different kernel by using the kernel() option. For example, using the alternative Epanechnikov kernel, kernel(epan2), below provides a better fit for these data. . lpoly accel time, degree(3) kernel(epan2) −150 −100 acceleration (g) −50 0 50 100 Local polynomial smooth 0 20 40 60 time (msec) kernel = epan2, degree = 3, bandwidth = 6.88 Technical note lpoly allows specifying in degree() both odd and even orders of the polynomial to be used for the smoothing. However, the odd-order, 2k + 1, polynomial approximations are preferable. They have lpoly — Kernel-weighted local polynomial smoothing 1113 an extra parameter compared with the even-order, 2k , approximations, which leads to a significant bias reduction and there is no increase of variability associated with adding this extra parameter. Using an odd order when estimating the regression function is therefore usually sufficient. For a more thorough discussion, see Fan and Gijbels (1996). Choice of a bandwidth The choice of a bandwidth is crucial for many smoothing techniques, including local polynomial smoothing. In general, using a large bandwidth gives smooths with a large bias, whereas a small bandwidth may result in highly variable smoothed values. Various techniques exist for optimal bandwidth selection. By default, lpoly uses the ROT method to estimate the bandwidth used for the smoothing; see Methods and formulas for details. Example 2 Using the motorcycle data, we demonstrate how a local linear polynomial fit changes using different bandwidths. . lpoly accel time, degree(1) kernel(epan2) bwidth(1) generate(at smooth1) > nograph . lpoly accel time, degree(1) kernel(epan2) bwidth(7) at(at) generate(smooth2) > nograph . label variable smooth1 "smooth: width = 1" . label variable smooth2 "smooth: width = 7" . lpoly accel time, degree(1) kernel(epan2) at(at) addplot(line smooth* at) > legend(label(2 "smooth: width = 3.42 (ROT)")) note("kernel = epan2, degree = 1") −150 −100 acceleration (g) −50 0 50 100 Local polynomial smooth 0 20 40 60 time (msec) acceleration (g) smooth: width = 1 smooth: width = 3.42 (ROT) smooth: width = 7 kernel = epan2, degree = 1 From this graph, we can see that the local linear polynomial fit with larger bandwidth (width = 7) corresponds to a smoother line but fails to fit the curvature of the scatterplot data. The smooth obtained using the width equal to one seems to fit most data points, but the corresponding line has several spikes indicating larger variability. The smooth obtained using the ROT bandwidth estimator seems to have a good tradeoff between the fit and variability in this example. 1114 lpoly — Kernel-weighted local polynomial smoothing In the above, we also demonstrated how the generate() and addplot() options may be used to produce overlaid plots obtained from lpoly with different options. The nograph option saves time when you need to save only results with generate(). However, to avoid generating variables manually, one can use twoway lpoly instead; see [G-2] graph twoway lpoly for more details. . twoway scatter accel time || > lpoly accel time, degree(1) kernel(epan2) lpattern(solid) || > lpoly accel time, degree(1) kernel(epan2) bwidth(1) || > lpoly accel time, degree(1) kernel(epan2) bwidth(7) || > , legend(label(2 "smooth: width = 3.42 (ROT)") label(3 "smooth: width = 1") > label(4 "smooth: width = 7")) > title("Local polynomial smooth") note("kernel = epan2, degree = 1") > xtitle("time (msec)") ytitle("acceleration (g)") −150 −100 acceleration (g) −50 0 50 100 Local polynomial smooth 0 20 40 60 time (msec) acceleration (g) smooth: width = 1 smooth: width = 3.42 (ROT) smooth: width = 7 kernel = epan2, degree = 1 The ROT estimate is commonly used as an initial guess for the amount of smoothing; this approach may be sufficient when the choice of a bandwidth is less important. In other cases, you can pick your own bandwidth. When the shape of the regression function has a combination of peaked and flat regions, a variable bandwidth may be preferable over the constant bandwidth to allow for different degrees of smoothness in different regions. The bwidth() option allows you to specify the values of the local variable bandwidths as those stored in a variable in your data. Similar issues with bias and variability arise when choosing a pilot bandwidth (the pwidth() option) used to compute standard errors of the local polynomial smoother. The default value is chosen to be 1.5 × ROT. For a review of methods for pilot bandwidth selection, see Fan and Gijbels (1996). Confidence bands The established asymptotic normality of the local polynomial estimators under certain conditions allows the construction of approximate confidence bands. lpoly offers the ci option to plot these bands. lpoly — Kernel-weighted local polynomial smoothing 1115 Example 3 Let us plot the confidence bands for the local polynomial fit from example 1. . lpoly accel time, degree(3) kernel(epan2) ci −150 −100 acceleration (g) −50 0 50 100 Local polynomial smooth 0 20 40 60 time (msec) 95% CI acceleration (g) lpoly smooth kernel = epan2, degree = 3, bandwidth = 6.88, pwidth = 10.33 You can obtain graphs with overlaid confidence bands by using twoway lpolyci; see [G-2] graph twoway lpolyci for examples. Constructing the confidence intervals involves computing standard errors obtained by taking a square root of the estimate of the conditional variance of the local polynomial estimator at each grid point x0 . Estimating the conditional variance requires fitting a polynomial of a higher order locally by using a different bandwidth, the pilot bandwidth. The value of the pilot bandwidth may be supplied by using pwidth(). By default, the value of 1.5 × ROT is used. Also, estimates of the residual variance σ 2 (x0 ) at each grid point, x0 , are required to obtain the estimates of the conditional variances. These estimates may be supplied by using the var() option. By default, they are computed using the normalized weighted residual sum of squares from a local polynomial fit of a higher order. See Methods and formulas for details. The standard errors may be saved by using se(). Stored results lpoly stores the following in r(): Scalars r(degree) r(ngrid) r(N) Macros r(kernel) smoothing polynomial degree number of successful regressions sample size name of kernel r(bwidth) r(pwidth) bandwidth of the smooth pilot bandwidth 1116 lpoly — Kernel-weighted local polynomial smoothing Methods and formulas Consider model (1), written in matrix notation, y = m(x) + where y and x are the n × 1 vectors of scatterplot values, is the n × 1 vector of errors with zero mean and covariance matrix Σ = diag{σ(xi )}In , and m() and σ() are some unknown functions. Define m(x0 ) = E[Y |X = x0 ] and σ 2 (x0 ) = Var[Y |X = x0 ] to be the conditional mean and conditional variance of random variable Y (residual variance), respectively, for some realization x0 of random variable X . The method of local polynomial smoothing is based on the approximation of m(x) locally by a pth order polynomial in (x − x0 ) for some x in the neighborhood of x0 . For the scatterplot data {(x1 , y1 ), . . . , (xn , yn )}, the pth-order local polynomial smooth m(x b 0 ) is equal to βb0 , an estimate of the intercept of the weighted linear regression, βb = (XT WX)−1 XT Wy (2) where βb = (βb0 , βb1 , . . . , βbp )T is the vector of estimated regression coefficients (with {βbj = (j!)−1 m b (j) (x)|x=x0 , j = 0, . . . , p} also representing estimated coefficients from a corresponding n,p Taylor expansion); X = {(xi − x0 )j }i,j=1,0 is a design matrix; and W = diag{Kh (xi − x0 )}n×n is a weighting matrix with weights Kh (·) defined as Kh (x) = h−1 K(x/h), with K(·) being a kernel function and h defining a bandwidth. The kernels are defined in Methods and formulas of [R] kdensity. The default bandwidth is obtained using the ROT method of bandwidth selection. The ROT bandwidth is the plugin estimator of the asymptotically optimal constant bandwidth. This is the bandwidth that minimizes the conditional weighted mean integrated squared error. The ROT plugin bandwidth selector for the smoothing bandwidth h is defined as follows; assuming constant residual variance σ 2 (x0 ) = σ 2 and odd degree p: " σ b2 R w0 (x)dx b h = C0,p (K) R n {m b (p+1) (x)}2 w0 (x)f (x)dx #1/(2p+3) (3) where C0,p (K) is a constant, as defined in Fan and Gijbels (1996), that depends on the kernel function K(·), and the degree of a polynomial p and w0 is chosen to be an indicator function on the interval [minx + 0.05 × rangex , maxx − 0.05 × rangex ] with minx , maxx , and rangex being, respectively, the minimum, maximum, and the range of x. To obtain the estimates of a constant residual variance, σ b2 , (p+1) and (p + 1)th order derivative of m(x), denoted as m b (x), a polynomial in x of order (p + 3) is fit globally to y. σ b2 is estimated as a standardized residual sum of squares from this fit. The expression for the asymptotically optimal constant bandwidth used in constructing the ROT bandwidth estimator is derived for the odd-order polynomial approximations. For even-order polynomial fits the expression would depend not only on m(p+1) (x) but also on m(p+2) (x) and the design density and its derivative, f (x) and f 0 (x). Therefore, the ROT bandwidth selector would require estimation of these additional quantities. Instead, for an even-degree p of the local polynomial, lpoly uses the value of the ROT estimator (3) computed using degree p + 1. As such, for even degrees this is not a plugin estimator of the asymptotically optimal constant bandwidth. The estimates of the conditional variance of local polynomial estimators are obtained using 2 d m(x Var{ b 0 )|X = x0 } = σ bm (x0 ) = (XT WX)−1 (XT W2 X)(XT WX)−1 σ b2 (x0 ) (4) lpoly — Kernel-weighted local polynomial smoothing 1117 where σ b2 (x0 ) is estimated by the normalized weighted residual sum of squares from the (p + 2)th order polynomial fit using pilot bandwidth h? . When the bias is negligible the normal-approximation method yields a (1 − α) × 100% confidence interval for m(x0 ), m(x b 0 ) − z(1−α/2) σ bm (x0 ), m(x b 0 ) + z(1−α/2) σ bm (x0 ) where z(1−α/2) is the (1 − α/2)th quantile of the standard Gaussian distribution, and m(x b 0 ) and σ bm (x0 ) are as defined in (2) and (4), respectively. References Cleveland, W. S. 1979. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74: 829–836. Cox, N. J. 2005. Speaking Stata: Smoothing in various directions. Stata Journal 5: 574–593. Donoho, D. L. 1995. Nonlinear solution of linear inverse problems by wavelet-vaguelette decomposition. Applied and Computational Harmonic Analysis 2: 101–126. Eubank, R. L. 1999. Nonparametric Regression and Spline Smoothing. 2nd ed. New York: Dekker. Fan, J. 1992. Design-adaptive nonparametric regression. Journal of the American Statistical Association 87: 998–1004. Fan, J., and I. Gijbels. 1996. Local Polynomial Modelling and Its Applications. London: Chapman & Hall. Gasser, T., and H.-G. Müller. 1979. Kernel estimation of regression functions. In Smoothing Techniques for Curve Estimation, Lecture Notes in Mathematics, ed. T. Gasser and M. Rosenblatt, 23–68. New York: Springer. Gutierrez, R. G., J. M. Linhart, and J. S. Pitblado. 2003. From the help desk: Local polynomial regression and Stata plugins. Stata Journal 3: 412–419. Nadaraya, E. A. 1964. On estimating regression. Theory of Probability and Its Application 9: 141–142. Sheather, S. J., and M. C. Jones. 1991. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B 53: 683–690. Verardi, V., and N. Debarsy. 2012. Robinson’s square root of Stata. Stata Journal 12: 726–735. N consistent semiparametric regression estimator in Watson, G. S. 1964. Smooth regression analysis. Sankhyā Series A 26: 359–372. Also see [R] kdensity — Univariate kernel density estimation [R] lowess — Lowess smoothing [R] smooth — Robust nonlinear smoother [G-2] graph twoway lpoly — Local polynomial smooth plots [G-2] graph twoway lpolyci — Local polynomial smooth plots with CIs Title lroc — Compute area under ROC curve and graph the curve Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax lroc depvar if in weight , options Description options Main compute area under ROC curve and graph curve for all observations suppress graph all nograph Advanced beta(matname) row vector containing model coefficients Plot cline options marker options marker label options change the look of the line change look of markers (color, size, etc.) add marker labels; change look or position Reference line rlopts(cline options) affect rendition of the reference line Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options fweights are allowed; see [U] 11.1.6 weight. lroc is not appropriate after the svy prefix. Menu Statistics > Binary outcomes > Postestimation > ROC curve after logistic/logit/probit/ivprobit Description lroc graphs the ROC curve and calculates the area under the curve. lroc requires that the current estimation results be from logistic, logit, probit, or ivprobit; see [R] logistic, [R] logit, [R] probit, or [R] ivprobit. 1118 lroc — Compute area under ROC curve and graph the curve 1119 Options Main all requests that the statistic be computed for all observations in the data, ignoring any if or in restrictions specified by the estimation command. nograph suppresses graphical output. Advanced beta(matname) specifies a row vector containing model coefficients. The columns of the row vector must be labeled with the corresponding names of the independent variables in the data. The dependent variable depvar must be specified immediately after the command name. See Models other than the last fitted model later in this entry. Plot cline options, marker options, and marker label options affect the rendition of the ROC curve—the plotted points connected by lines. These options affect the size and color of markers, whether and how the markers are labeled, and whether and how the points are connected; see [G-3] cline options, [G-3] marker options, and [G-3] marker label options. Reference line rlopts(cline options) affects the rendition of the reference line; see [G-3] cline options. Add plots addplot(plot) provides a way to add other plots to the generated graph; see [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples Remarks are presented under the following headings: Introduction Samples other than the estimation sample Models other than the last fitted model Introduction Stata also has a suite of commands for performing both parametric and nonparametric receiver operating characteristic (ROC) analysis. See [R] roc for an overview of these commands. lroc graphs the ROC curve — a graph of sensitivity versus one minus specificity as the cutoff c is varied — and calculates the area under it. Sensitivity is the fraction of observed positive-outcome cases that are correctly classified; specificity is the fraction of observed negative-outcome cases that are correctly classified. When the purpose of the analysis is classification, you must choose a cutoff. 1120 lroc — Compute area under ROC curve and graph the curve The curve starts at (0, 0), corresponding to c = 1, and continues to (1, 1), corresponding to c = 0. A model with no predictive power would be a 45◦ line. The greater the predictive power, the more bowed the curve, and hence the area beneath the curve is often used as a measure of the predictive power. A model with no predictive power has area 0.5; a perfect model has area 1. The ROC curve was first discussed in signal detection theory (Peterson, Birdsall, and Fox 1954) and then was quickly introduced into psychology (Tanner and Swets 1954). It has since been applied in other fields, particularly medicine (for instance, Metz [1978]). For a classic text on ROC techniques, see Green and Swets (1966). lsens also plots sensitivity and specificity; see [R] lsens. Example 1 Hardin and Hilbe (2012) examine data from the National Canadian Registry of Cardiovascular Disease (FASTRAK), sponsored by Hoffman-La Roche Canada. They model death within 48 hours based on whether a patient suffers an anterior infarct (heart attack) rather than an inferior infarct using a logistic regression and evaluate the model using an ROC curve. We replicate their analysis here. Both anterior and inferior refer to sites on the heart where damage occurs. The model is also adjusted for hcabg, whether the subject has had a cardiac bypass surgery (CABG); age, a four-category age-group indicator; and killip, a four-level risk indicator. We load the data and then estimate the parameters of the logistic regression with logistic. Factor-variable notation is used for each predictor, because they are categorical; see [U] 11.4.3 Factor variables. . use http://www.stata-press.com/data/r13/heart (Heart attacks) . logistic death i.site i.hcabg i.killip i.age Logistic regression Log likelihood = -636.62553 death Odds Ratio site Anterior 1.hcabg Number of obs LR chi2(8) Prob > chi2 Pseudo R2 = = = = 4483 211.37 0.0000 0.1424 Std. Err. z P>|z| [95% Conf. Interval] 1.901333 2.105275 .3185757 .7430694 3.83 2.11 0.000 0.035 1.369103 1.054076 2.640464 4.204801 killip 2 3 4 2.251732 2.172105 14.29137 .4064423 .584427 5.087654 4.50 2.88 7.47 0.000 0.004 0.000 1.580786 1.281907 7.112964 3.207453 3.680487 28.71423 age 60-69 70-79 >=80 1.63726 4.532029 8.893222 .5078582 1.206534 2.41752 1.59 5.68 8.04 0.112 0.000 0.000 .8914261 2.689568 5.219991 3.007115 7.636647 15.15125 _cons .0063961 .0016541 -19.54 0.000 .0038529 .010618 The odds ratios for a unit change in each covariate are reported by logistic. At fixed values of the other covariates, patients who enter Canadian hospitals with an anterior infarct have nearly twice the odds of death within 48 hours than those with an inferior infarct. Those who have had a previous CABG have approximately twice the risk of death of those who have not. Those with higher Killip risks and those who are older are also at greater risk of death. lroc — Compute area under ROC curve and graph the curve 1121 We use lroc to draw the ROC curve for the model. The area under the curve of approximately 0.8 indicates acceptable discrimination for the model. . lroc 4483 0.7965 0.00 0.25 Sensitivity 0.50 0.75 1.00 Logistic model for death number of observations = area under ROC curve = 0.00 0.25 0.50 1 − Specificity 0.75 1.00 Area under ROC curve = 0.7965 Samples other than the estimation sample lroc can be used with samples other than the estimation sample. By default, lroc remembers the estimation sample used with the last logistic, logit, probit, or ivprobit command. To override this, simply use an if or in restriction to select another set of observations, or specify the all option to force the command to use all the observations in the dataset. See example 3 in [R] estat gof for an example of using lroc with a sample other than the estimation sample. Models other than the last fitted model By default, lroc uses the last model fit by logistic, logit, probit, or ivprobit. You may also directly specify the model to lroc by inputting a vector of coefficients with the beta() option and passing the name of the dependent variable depvar to lroc. Example 2 Suppose that someone publishes the following logistic model of low birthweight: Pr(low = 1) = F (−0.02 age − 0.01 lwt + 1.3 black + 1.1 smoke + 0.5 ptl + 1.8 ht + 0.8 ui + 0.5) where F is the cumulative logistic distribution. These coefficients are not odds ratios; they are the equivalent of what logit produces. 1122 lroc — Compute area under ROC curve and graph the curve We can see whether this model fits our data. First we enter the coefficients as a row vector and label its columns with the names of the independent variables plus cons for the constant (see [P] matrix define and [P] matrix rownames). . use http://www.stata-press.com/data/r13/lbw3, clear (Hosmer & Lemeshow data) . matrix input b = (-.02, -.01, 1.3, 1.1, .5, 1.8, .8, .5) . matrix colnames b = age lwt black smoke ptl ht ui _cons Here we use lroc to examine the predictive ability of the model: . lroc low, beta(b) nograph Logistic model for low number of observations = 189 area under ROC curve = 0.7275 The area under the curve indicates that this model does have some predictive power. We can obtain a graph of sensitivity and specificity as a function of the cutoff probability by typing 0.00 Sensitivity/Specificity 0.25 0.50 0.75 1.00 . lsens low, beta(b) 0.00 0.25 0.50 Probability cutoff Sensitivity 0.75 1.00 Specificity See [R] lsens. Stored results lroc stores the following in r(): Scalars r(N) r(area) number of observations area under the ROC curve Methods and formulas The ROC curve is a graph of specificity against (1 − sensitivity). This is guaranteed to be a monotone nondecreasing function because the number of correctly predicted successes increases and the number of correctly predicted failures decreases as the classification cutoff c decreases. lroc — Compute area under ROC curve and graph the curve 1123 The area under the ROC curve is the area on the bottom of this graph and is determined by integrating the curve. The vertices of the curve are determined by sorting the data according to the predicted index, and the integral is computed using the trapezoidal rule. References Green, D. M., and J. A. Swets. 1966. Signal Detection Theory and Psychophysics. New York: Wiley. Hardin, J. W., and J. M. Hilbe. 2012. Generalized Linear Models and Extensions. 3rd ed. College Station, TX: Stata Press. Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Metz, C. E. 1978. Basic principles of ROC analysis. Seminars in Nuclear Medicine 8: 283–298. Peterson, W. W., T. G. Birdsall, and W. C. Fox. 1954. The theory of signal detectability. Transactions IRE Professional Group on Information Theory PGIT-4: 171–212. Tanner, W. P., Jr., and J. A. Swets. 1954. A decision-making theory of visual detection. Psychological Review 61: 401–409. Tilford, J. M., P. K. Roberson, and D. H. Fiser. 1995. sbe12: Using lfit and lroc to evaluate mortality prediction models. Stata Technical Bulletin 28: 14–18. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 77–81. College Station, TX: Stata Press. Also see [R] logistic — Logistic regression, reporting odds ratios [R] logit — Logistic regression, reporting coefficients [R] probit — Probit regression [R] ivprobit — Probit model with continuous endogenous regressors [R] lsens — Graph sensitivity and specificity versus probability cutoff [R] estat classification — Classification statistics and table [R] estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test [R] roc — Receiver operating characteristic (ROC) analysis [U] 20 Estimation and postestimation commands Title lrtest — Likelihood-ratio test after estimation Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax lrtest modelspec1 modelspec2 where modelspec is , options name | . | (namelist) where name is the name under which estimation results were stored using estimates store (see [R] estimates store), and “.” refers to the last estimation results, whether or not these were already stored. options Description stats dir df(#) force display statistical information about the two models display descriptive information about the two models override the automatic degrees-of-freedom calculation; seldom used force testing even when apparently invalid Menu Statistics > Postestimation > Tests > Likelihood-ratio test Description lrtest performs a likelihood-ratio test of the null hypothesis that the parameter vector of a statistical model satisfies some smooth constraint. To conduct the test, both the unrestricted and the restricted models must be fit using the maximum likelihood method (or some equivalent method), and the results of at least one must be stored using estimates store; see [R] estimates store. modelspec1 and modelspec2 specify the restricted and unrestricted model in any order. modelspec1 and modelspec2 cannot have names in common; for example, lrtest (A B C) (C D E) is not allowed because both model specifications include C. If modelspec2 is not specified, the last estimation result is used; this is equivalent to specifying modelspec2 as a period (.). lrtest supports composite models specified by a parenthesized list of model names. In a composite model, we assume that the log likelihood and dimension (number of free parameters) of the full model are obtained as the sum of the log-likelihood values and dimensions of the constituting models. lrtest provides an important alternative to test (see [R] test) for models fit via maximum likelihood or equivalent methods. 1124 lrtest — Likelihood-ratio test after estimation 1125 Options stats displays statistical information about the unrestricted and restricted models, including the information indices of Akaike and Schwarz. dir displays descriptive information about the unrestricted and restricted models; see estimates dir in [R] estimates store. df(#) is seldom specified; it overrides the automatic degrees-of-freedom calculation. force forces the likelihood-ratio test calculations to take place in situations where lrtest would normally refuse to do so and issue an error. Such situations arise when one or more assumptions of the test are violated, for example, if the models were fit with vce(robust), vce(cluster clustvar), or pweights; when the dependent variables in the two models differ; when the null log likelihoods differ; when the samples differ; or when the estimation commands differ. If you use the force option, there is no guarantee as to the validity or interpretability of the resulting test. Remarks and examples The standard way to use lrtest is to do the following: 1. Fit either the restricted model or the unrestricted model by using one of Stata’s estimation commands and then store the results using estimates store name. 2. Fit the alternative model (the unrestricted or restricted model) and then type ‘lrtest name .’. lrtest determines for itself which of the two models is the restricted model by comparing the degrees of freedom. Often you may want to store the alternative model with estimates store name2 , for instance, if you plan additional tests against models yet to be fit. The likelihood-ratio test is then obtained as lrtest name name2 . Remarks are presented under the following headings: Nested models Composite models Nested models lrtest may be used with any estimation command that reports a log likelihood, including heckman, logit, poisson, stcox, and streg. You must check that one of the model specifications implies a statistical model that is nested within the model implied by the other specification. Usually, this means that both models are fit with the same estimation command (for example, both are fit by logit, with the same dependent variables) and that the set of covariates of one model is a subset of the covariates of the other model. Second, lrtest is valid only for models that are fit by maximum likelihood or by some equivalent method, so it does not apply to models that were fit with probability weights or clusters. Specifying the vce(robust) option similarly would indicate that you are worried about the valid specification of the model, so you would not use lrtest. Third, lrtest assumes that under the null hypothesis, the test statistic is (approximately) distributed as chi-squared. This assumption is not true for likelihood-ratio tests of “boundary conditions”, such as tests for the presence of overdispersion or random effects (Gutierrez, Carter, and Drukker 2001). Example 1 We have data on infants born with low birthweights along with the characteristics of the mother (Hosmer, Lemeshow, and Sturdivant 2013; see also [R] logistic). We fit the following model: 1126 lrtest — Likelihood-ratio test after estimation . use http://www.stata-press.com/data/r13/lbw (Hosmer & Lemeshow data) . logistic low age lwt i.race smoke ptl ht ui Logistic regression Log likelihood = -100.724 low Odds Ratio Std. Err. age lwt .9732636 .9849634 .0354759 .0068217 race black other 3.534767 2.368079 smoke ptl ht ui _cons 2.517698 1.719161 6.249602 2.1351 1.586014 z Number of obs LR chi2(8) Prob > chi2 Pseudo R2 = = = = 189 33.22 0.0001 0.1416 P>|z| [95% Conf. Interval] -0.74 -2.19 0.457 0.029 .9061578 .9716834 1.045339 .9984249 1.860737 1.039949 2.40 1.96 0.016 0.050 1.259736 1.001356 9.918406 5.600207 1.00916 .5952579 4.322408 .9808153 1.910496 2.30 1.56 2.65 1.65 0.38 0.021 0.118 0.008 0.099 0.702 1.147676 .8721455 1.611152 .8677528 .1496092 5.523162 3.388787 24.24199 5.2534 16.8134 We now wish to test the constraint that the coefficients on age, lwt, ptl, and ht are all zero or, equivalently here, that the odds ratios are all 1. One solution is to type . test ( 1) ( 2) ( 3) ( 4) age lwt ptl ht [low]age = 0 [low]lwt = 0 [low]ptl = 0 [low]ht = 0 chi2( 4) = Prob > chi2 = 12.38 0.0147 This test is based on the inverse of the information matrix and is therefore based on a quadratic approximation to the likelihood function; see [R] test. A more precise test would be to refit the model, applying the proposed constraints, and then calculate the likelihood-ratio test. We first save the current model: . estimates store full We then fit the constrained model, which here is the model omitting age, lwt, ptl, and ht: . logistic low i.race smoke ui Logistic regression Number of obs LR chi2(4) Prob > chi2 Pseudo R2 Log likelihood = -107.93404 low Odds Ratio race black other smoke ui _cons = = = = 189 18.80 0.0009 0.0801 Std. Err. z P>|z| [95% Conf. Interval] 3.052746 2.922593 1.498087 1.189229 2.27 2.64 0.023 0.008 1.166747 1.316457 7.987382 6.488285 2.945742 2.419131 .1402209 1.101838 1.047359 .0512295 2.89 2.04 -5.38 0.004 0.041 0.000 1.415167 1.035459 .0685216 6.131715 5.651788 .2869447 lrtest — Likelihood-ratio test after estimation 1127 That done, lrtest compares this model with the model we previously stored: . lrtest full . Likelihood-ratio test (Assumption: . nested in full) LR chi2(4) = Prob > chi2 = 14.42 0.0061 Let’s compare results. test reported that age, lwt, ptl, and ht were jointly significant at the 1.5% level; lrtest reports that they are significant at the 0.6% level. Given the quadratic approximation made by test, we could argue that lrtest’s results are more accurate. lrtest explicates the assumption that, from a comparison of the degrees of freedom, it has assessed that the last fit model (.) is nested within the model stored as full. In other words, full is the unconstrained model and . is the constrained model. The names in “(Assumption: . nested in full)” are actually links. Click on a name, and the results for that model are replayed. Aside: The nestreg command provides a simple syntax for performing likelihood-ratio tests for nested model specifications; see [R] nestreg. In the previous example, we fit a full logistic model, used estimates store to store the full model, fit a constrained logistic model, and used lrtest to report a likelihood-ratio test between two models. To do this with one call to nestreg, use the lrtable option. Technical note lrtest determines the degrees of freedom of a model as the rank of the (co)variance matrix e(V). There are two issues here. First, the numerical determination of the rank of a matrix is a subtle problem that can, for instance, be affected by the scaling of the variables in the model. The rank of a matrix depends on the number of (independent) linear combinations of coefficients that sum exactly to zero. In the world of numerical mathematics, it is hard to tell whether a very small number is really nonzero or is a real zero that happens to be slightly off because of roundoff error from the finite precision with which computers make floating-point calculations. Whether a small number is being classified as one or the other, typically on the basis of a threshold, affects the determined degrees of freedom. Although Stata generally makes sensible choices, it is bound to make mistakes occasionally. The moral of this story is to make sure that the calculated degrees of freedom is as you expect before interpreting the results. Technical note A second issue involves regress and related commands such as anova. Mainly for historical reasons, regress does not treat the residual variance, σ 2 , the same way that it treats the regression coefficients. Type estat vce after regress, and you will see the regression coefficients, not σ b2 . Most estimation commands for models with ancillary parameters (for example, streg and heckman) treat all parameters as equals. There is nothing technically wrong with regress here; we are usually focused on the regression coefficients, and their estimators are uncorrelated with σ b2 . But, formally, 2 σ adds a degree of freedom to the model, which does not matter if you are comparing two regression models by a likelihood-ratio test. This test depends on the difference in the degrees of freedom, and hence being “off by 1” in each does not matter. But, if you are comparing a regression model with a larger model—for example, a heteroskedastic regression model fit by arch—the automatic determination of the degrees of freedom is incorrect, and you must specify the df(#) option. 1128 lrtest — Likelihood-ratio test after estimation Example 2 Returning to the low-birthweight data in the example 1, we now wish to test that the coefficient on 2.race (black) is equal to that on 3.race (other). The base model is still stored under the name full, so we need only fit the constrained model and perform the test. With z as the index of the logit model, the base model is z = β0 + β1 age + β2 lwt + β3 2.race + β4 3.race + · · · If β3 = β4 , this can be written as z = β0 + β1 age + β2 lwt + β3 (2.race + 3.race) + · · · We can fit the constrained model as follows: . constraint 1 2.race = 3.race . logistic low age lwt i.race smoke ptl ht ui, constraints(1) Logistic regression Number of obs Wald chi2(7) Log likelihood = -100.9997 Prob > chi2 ( 1) [low]2.race - [low]3.race = 0 low Odds Ratio Std. Err. age lwt .9716799 .9864971 .0352638 .0064627 race black other 2.728186 2.728186 smoke ptl ht ui _cons 2.664498 1.709129 6.116391 2.09936 1.309371 z = = = 189 25.17 0.0007 P>|z| [95% Conf. Interval] -0.79 -2.08 0.429 0.038 .9049649 .9739114 1.043313 .9992453 1.080207 1.080207 2.53 2.53 0.011 0.011 1.255586 1.255586 5.927907 5.927907 1.052379 .5924776 4.215585 .9699702 1.527398 2.48 1.55 2.63 1.61 0.23 0.013 0.122 0.009 0.108 0.817 1.228633 .8663666 1.58425 .8487997 .1330839 5.778414 3.371691 23.61385 5.192407 12.8825 Comparing this model with our original model, we obtain . lrtest full . Likelihood-ratio test (Assumption: . nested in full) LR chi2(1) = Prob > chi2 = 0.55 0.4577 By comparison, typing test 2.race=3.race after fitting our base model results in a significance level of 0.4572. Alternatively, we can first store the restricted model, here using the name equal. Next lrtest is invoked specifying the names of the restricted and unrestricted models (we do not care about the order). This time, we also add the option stats requesting a table of model statistics, including the model selection indices AIC and BIC. . estimates store equal . lrtest equal full, stats Likelihood-ratio test (Assumption: equal nested in full) LR chi2(1) = Prob > chi2 = 0.55 0.4577 Model Obs ll(null) ll(model) df AIC BIC equal full 189 189 . -117.336 -100.9997 -100.724 8 9 217.9994 219.448 243.9334 248.6237 Note: N=Obs used in calculating BIC; see [R] BIC note lrtest — Likelihood-ratio test after estimation 1129 Composite models lrtest supports composite models; that is, models that can be fit by fitting a series of simpler models or by fitting models on subsets of the data. Theoretically, a composite model is one in which the likelihood function, L(θ), of the parameter vector, θ, can be written as the product L(θ) = L1 (θ1 ) × L2 (θ2 ) × · · · × Lk (θk ) of likelihood terms with θ = (θ1 , . . . , θk ) a partitioning of the full parameter vector. In such a case, the full-model likelihood L(θ) is maximized by maximizing the likelihood terms Lj (θj ) in b = Pk log Lj (θbj ). The degrees of freedom for the composite model is turn. Obviously, log L(θ) j=1 obtained as the sum of the degrees of freedom of the constituting models. Example 3 As an example of the application of composite models, we consider a test of the hypothesis that the coefficients of a statistical model do not differ between different portions (“regimes”) of the covariate space. Economists call a test for such a hypothesis a Chow test. We continue the analysis of the data on children of low birthweight by using logistic regression modeling and study whether the regression coefficients are the same among the three races: white, black, and other. A likelihood-ratio Chow test can be obtained by fitting the logistic regression model for each of the races and then comparing the combined results with those of the model previously stored as full. Because the full model included dummies for the three races, this version of the Chow test allows the intercept of the logistic regression model to vary between the regimes (races). . logistic low age lwt smoke ptl ht ui if 1.race, nolog Logistic regression Number of obs LR chi2(6) Prob > chi2 Log likelihood = -45.927061 Pseudo R2 low Odds Ratio age lwt smoke ptl ht ui _cons .9869674 .9900874 4.208697 1.592145 2.900166 1.229523 .4891008 . estimates store white Std. Err. .0527757 .0106101 2.680133 .7474264 3.193537 .9474768 .993785 z -0.25 -0.93 2.26 0.99 0.97 0.27 -0.35 = = = = 96 13.86 0.0312 0.1311 P>|z| [95% Conf. Interval] 0.806 0.353 0.024 0.322 0.334 0.789 0.725 .8887649 .9695089 1.20808 .6344379 .3350554 .2715165 .0091175 1.096021 1.011103 14.66222 3.995544 25.1032 5.567715 26.23746 1130 lrtest — Likelihood-ratio test after estimation . logistic low age lwt smoke ptl ht ui if 2.race, nolog Logistic regression Number of obs LR chi2(6) Prob > chi2 Log likelihood = -12.654157 Pseudo R2 low Odds Ratio age lwt smoke ptl ht ui _cons .8735313 .9747736 16.50373 4.866916 85.05605 67.61338 48.7249 Std. Err. .1377846 .016689 24.37044 9.33151 214.6382 133.3313 169.9216 z -0.86 -1.49 1.90 0.83 1.76 2.14 1.11 Odds Ratio age lwt smoke ptl ht ui _cons .9263905 .9724499 .7979034 2.845675 7.767503 2.925006 49.09444 Std. Err. .0665386 .015762 .6340585 1.777944 10.00537 2.046473 113.9165 z -1.06 -1.72 -0.28 1.67 1.59 1.53 1.68 26 10.12 0.1198 0.2856 P>|z| [95% Conf. Interval] 0.391 0.136 0.058 0.409 0.078 0.033 0.265 .6412332 .9426065 .9133647 .1135573 .6049308 1.417399 .0523961 . estimates store black . logistic low age lwt smoke ptl ht ui if 3.race, nolog Logistic regression Number of obs LR chi2(6) Prob > chi2 Log likelihood = -37.228444 Pseudo R2 low = = = = 1.189983 1.008038 298.2083 208.5895 11959.27 3225.322 45310.94 = = = = 67 14.06 0.0289 0.1589 P>|z| [95% Conf. Interval] 0.287 0.085 0.776 0.094 0.112 0.125 0.093 .8047407 .9420424 .1680885 .8363053 .6220764 .7423107 .5199275 1.06643 1.003839 3.787586 9.682908 96.98826 11.52571 4635.769 . estimates store other We are now ready to perform the likelihood-ratio Chow test: . lrtest (full) (white black other), stats Likelihood-ratio test LR chi2(12) = Prob > chi2 = 9.83 0.6310 Assumption: (full) nested in (white, black, other) Model Obs ll(null) ll(model) df AIC BIC full white black other 189 96 26 67 -117.336 -52.85752 -17.71291 -44.26039 -100.724 -45.92706 -12.65416 -37.22844 9 7 7 7 219.448 105.8541 39.30831 88.45689 248.6237 123.8046 48.11499 103.8897 Note: N=Obs used in calculating BIC; see [R] BIC note We cannot reject the hypothesis that the logistic regression model applies to each of the races at any reasonable significance level. By specifying the stats option, we can verify the degrees of freedom of the test: 12 = 7 + 7 + 7 − 9. We can obtain the same test by fitting an expanded model with interactions between all covariates and race. lrtest — Likelihood-ratio test after estimation . logistic low race##c.(age lwt smoke ptl ht ui) Logistic regression Log likelihood = -95.809661 low Odds Ratio race black other Number of obs LR chi2(20) Prob > chi2 Pseudo R2 = = = = 189 43.05 0.0020 0.1835 Std. Err. z P>|z| [95% Conf. Interval] 99.62137 100.3769 402.0829 309.586 1.14 1.49 0.254 0.135 .0365434 .2378638 271578.9 42358.38 age lwt smoke ptl ht ui .9869674 .9900874 4.208697 1.592145 2.900166 1.229523 .0527757 .0106101 2.680133 .7474264 3.193537 .9474768 -0.25 -0.93 2.26 0.99 0.97 0.27 0.806 0.353 0.024 0.322 0.334 0.789 .8887649 .9695089 1.20808 .6344379 .3350554 .2715165 1.096021 1.011103 14.66222 3.995544 25.1032 5.567715 race#c.age black other .885066 .9386232 .1474079 .0840486 -0.73 -0.71 0.464 0.479 .638569 .7875366 1.226714 1.118695 race#c.lwt black other .9845329 .9821859 .0198857 .0190847 -0.77 -0.93 0.440 0.355 .9463191 .9454839 1.02429 1.020313 race#c.smoke black other 3.921338 .1895844 6.305992 .1930601 0.85 -1.63 0.395 0.102 .167725 .025763 91.67917 1.395113 race#c.ptl black other 3.05683 1.787322 6.034089 1.396789 0.57 0.74 0.571 0.457 .0638301 .3863582 146.3918 8.268285 race#c.ht black other 29.328 2.678295 80.7482 4.538712 1.23 0.58 0.220 0.561 .1329492 .0966916 6469.623 74.18702 race#c.ui black other 54.99155 2.378976 116.4274 2.476124 1.89 0.83 0.058 0.405 .8672471 .309335 3486.977 18.29579 _cons .4891008 .993785 -0.35 0.725 .0091175 26.23746 . lrtest full . Likelihood-ratio test (Assumption: full nested in .) LR chi2(12) = Prob > chi2 = 1131 9.83 0.6310 Applying lrtest for the full model against the model with all interactions yields the same test statistic and p-value as for the full model against the composite model for the three regimes. Here the specification of the model with interactions was convenient, and logistic had no problem computing the estimates for the expanded model. In models with more complicated likelihoods, such as Heckman’s selection model (see [R] heckman) or complicated survival-time models (see [ST] streg), fitting the models with all interactions may be numerically demanding and may be much more time consuming than fitting a series of models separately for each regime. Given the model with all interactions, we could also test the hypothesis of no differences among the regions (races) by a Wald version of the Chow test by using the testparm command; see [R] test. 1132 lrtest — Likelihood-ratio test after estimation . testparm race#c.(age lwt smoke ptl ht ui) ( 1) [low]2.race#c.age = 0 ( 2) [low]3.race#c.age = 0 ( 3) [low]2.race#c.lwt = 0 ( 4) [low]3.race#c.lwt = 0 ( 5) [low]2.race#c.smoke = 0 ( 6) [low]3.race#c.smoke = 0 ( 7) [low]2.race#c.ptl = 0 ( 8) [low]3.race#c.ptl = 0 ( 9) [low]2.race#c.ht = 0 (10) [low]3.race#c.ht = 0 (11) [low]2.race#c.ui = 0 (12) [low]3.race#c.ui = 0 chi2( 12) = 8.24 Prob > chi2 = 0.7663 We conclude that, here, the Wald version of the Chow test is similar to the likelihood-ratio version of the Chow test. Stored results lrtest stores the following in r(): Scalars r(p) r(df) r(chi2) level of significance degrees of freedom LR test statistic Programmers wishing their estimation commands to be compatible with lrtest should note that lrtest requires that the following results be returned: e(cmd) e(ll) e(V) e(N) name of estimation command log likelihood variance–covariance matrix of the estimators number of observations lrtest also verifies that e(N), e(ll 0), and e(depvar) are consistent between two noncomposite models. Methods and formulas Let L0 and L1 be the log-likelihood values associated with the full and constrained models, respectively. The test statistic of the likelihood-ratio test is LR = −2(L1 − L0 ). If the constrained model is true, LR is approximately χ2 distributed with d0 − d1 degrees of freedom, where d0 and d1 are the model degrees of freedom associated with the full and constrained models, respectively (Greene 2012, 526 – 527). lrtest determines the degrees of freedom of a model as the rank of e(V), computed as the number of nonzero diagonal elements of invsym(e(V)). lrtest — Likelihood-ratio test after estimation 1133 References Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Gutierrez, R. G., S. L. Carter, and D. M. Drukker. 2001. sg160: On boundary-value likelihood-ratio tests. Stata Technical Bulletin 60: 15–18. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 269–273. College Station, TX: Stata Press. Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Kleinbaum, D. G., and M. Klein. 2010. Logistic Regression: A Self-Learning Text. 3rd ed. New York: Springer. Pérez-Hoyos, S., and A. Tobı́as. 1999. sg111: A modified likelihood-ratio test command. Stata Technical Bulletin 49: 24–25. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 171–173. College Station, TX: Stata Press. Wang, Z. 2000. sg133: Sequential and drop one term likelihood-ratio tests. Stata Technical Bulletin 54: 46–47. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 332–334. College Station, TX: Stata Press. Also see [R] test — Test linear hypotheses after estimation [R] testnl — Test nonlinear hypotheses after estimation [R] nestreg — Nested model statistics Title lsens — Graph sensitivity and specificity versus probability cutoff Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options Reference Syntax lsens depvar if in weight , options Description options Main all genprob(varname) gensens(varname) genspec(varname) replace nograph graph all observations in the data create variable containing probability cutoffs create variable containing sensitivity create variable containing specificity overwrite existing variables suppress the graph Advanced beta(matname) row vector containing model coefficients Plot connect options affect rendition of the plotted points connected by lines Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options fweights are allowed; see [U] 11.1.6 weight. lsens is not appropriate after the svy prefix. Menu Statistics > Binary outcomes > Postestimation > Sensitivity/specificity plot Description lsens graphs sensitivity and specificity versus probability cutoff and optionally creates new variables containing these data. lsens requires that the current estimation results be from logistic, logit, probit, or ivprobit; see [R] logistic, [R] logit, [R] probit, or [R] ivprobit. 1134 lsens — Graph sensitivity and specificity versus probability cutoff 1135 Options Main all requests that the statistic be computed for all observations in the data, ignoring any if or in restrictions specified by the estimation command. genprob(varname), gensens(varname), and genspec(varname) specify the names of new variables created to contain, respectively, the probability cutoffs and the corresponding sensitivity and specificity. replace requests that existing variables specified for genprob(), gensens(), or genspec() be overwritten. nograph suppresses graphical output. Advanced beta(matname) specifies a row vector containing model coefficients. The columns of the row vector must be labeled with the corresponding names of the independent variables in the data. The dependent variable depvar must be specified immediately after the command name. See Models other than the last fitted model later in this entry. Plot connect options affect the rendition of the plotted points connected by lines; see connect options in [G-2] graph twoway scatter. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples Remarks are presented under the following headings: Introduction Models other than the last fitted model Introduction lsens plots sensitivity and specificity; it plots both sensitivity and specificity versus probability cutoff c. The graph is equivalent to what you would get from estat classification (see [R] estat classification) if you varied the cutoff probability c from 0 to 1. 1136 lsens — Graph sensitivity and specificity versus probability cutoff Example 1 We illustrate lsens after logistic; see [R] logistic. 0.00 Sensitivity/Specificity 0.25 0.50 0.75 1.00 . use http://www.stata-press.com/data/r13/lbw (Hosmer & Lemeshow data) . logistic low age i.race smoke ui (output omitted ) . lsens 0.00 0.25 0.50 Probability cutoff Sensitivity 0.75 1.00 Specificity lsens optionally creates new variables containing the probability cutoff, sensitivity, and specificity. . lsens, genprob(p) gensens(sens) genspec(spec) nograph The variables created will have M + 2 distinct nonmissing values: one for each of the M covariate patterns, one for c = 0, and another for c = 1. Values are recorded for p = 0, for each of the observed predicted probabilities, and for p = 1. The total number of observations required to do this can be fewer than N, the same as N, or N + 1, or N + 2. If more observations are added, they are added at the end of the dataset and the values of the original variables are set to missing in the added observations. How the values added align with existing observations is irrelevant. Technical note logistic, logit, probit, or ivprobit and lsens keep track of the estimation sample. If you type, for instance, logistic . . . if x==1, then when you type lsens, the statistics will be calculated on the x==1 subsample of the data automatically. You should specify if or in with lsens only when you wish to produce graphs and calculate statistics for a set of observations other than the estimation sample. If the logistic model was fit with fweights, lsens properly accounts for the weights in its calculations. You do not have to specify the weights when you run lsens. Weights should be specified with lsens only when you wish to use a different set of weights. lsens — Graph sensitivity and specificity versus probability cutoff 1137 Models other than the last fitted model By default, lsens uses the last model fit. You may also directly specify the model to lsens by inputting a vector of coefficients with the beta() option and passing the name of the dependent variable depvar to lsens. Example 2 Suppose that someone publishes the following logistic model of low birthweight: Pr(low = 1) = F (−0.02 age − 0.01 lwt + 1.3 black + 1.1 smoke + 0.5 ptl + 1.8 ht + 0.8 ui + 0.5) where F is the cumulative logistic distribution. These coefficients are not odds ratios; they are the equivalent of what logit produces. We can see whether this model fits our data. First we enter the coefficients as a row vector and label its columns with the names of the independent variables plus cons for the constant (see [P] matrix define and [P] matrix rownames). . use http://www.stata-press.com/data/r13/lbw3, clear (Hosmer & Lemeshow data) . matrix input b = (-0.02, -.01, 1.3, 1.1, .5, 1.8, .8, .5) . matrix colnames b = age lwt black smoke ptl ht ui _cons We can use lroc (see [R] lroc) to examine the predictive ability of the model: . lroc low, beta(b) nograph Logistic model for low number of observations = area under ROC curve = 189 0.7275 The area under the curve indicates that this model does have some predictive power. We can obtain a graph of sensitivity and specificity as a function of the cutoff probability by typing 0.00 Sensitivity/Specificity 0.25 0.50 0.75 1.00 . lsens low, beta(b) 0.00 0.25 0.50 Probability cutoff Sensitivity 0.75 Specificity 1.00 1138 lsens — Graph sensitivity and specificity versus probability cutoff Stored results lsens stores the following in r(): Scalars r(N) number of observations Methods and formulas Let j index observations and c be the cutoff probability. Let pj be the predicted probability of a positive outcome and yj be the actual outcome, which we will treat as 0 or 1, although Stata treats it as 0 and non-0, excluding missing observations. A prediction is classified as positive if pj ≥ c and otherwise is classified as negative. The classification is correct if it is positive and yj = 1 or if it is negative and yj = 0. Sensitivity is the fraction of yj = 1 observations that are correctly classified. Specificity is the percentage of yj = 0 observations that are correctly classified. Reference Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Also see [R] logistic — Logistic regression, reporting odds ratios [R] logit — Logistic regression, reporting coefficients [R] probit — Probit regression [R] ivprobit — Probit model with continuous endogenous regressors [R] lroc — Compute area under ROC curve and graph the curve [R] estat classification — Classification statistics and table [R] estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test [R] roc — Receiver operating characteristic (ROC) analysis [U] 20 Estimation and postestimation commands Title lv — Letter-value displays Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax lv varlist if in , generate tail(#) by is allowed; see [D] by. Menu Statistics > Summaries, tables, and tests > Distributional plots and tests > Letter-value display Description lv shows a letter-value display (Tukey 1977, 44 – 49; Hoaglin 1983) for each variable in varlist. If no variables are specified, letter-value displays are shown for each numeric variable in the data. Options Main generate adds four new variables to the data: mid, containing the midsummaries; spread, containing the spreads; psigma, containing the pseudosigmas; and z2, containing the squared values from a standard normal distribution corresponding to the particular letter value. If the variables mid, spread, psigma, and z2 already exist, their contents are replaced. At most, only the first 11 observations of each variable are used; the remaining observations contain missing. If varlist specifies more than one variable, the newly created variables contain results for the last variable specified. The generate option may not be used with the by prefix. tail(#) indicates the inverse of the tail density through which letter values are to be displayed: 2 corresponds to the median (meaning half in each tail), 4 to the fourths (roughly the 25th and 75th percentiles), 8 to the eighths, and so on. # may be specified as 4, 8, 16, 32, 64, 128, 256, 512, or 1,024 and defaults to a value of # that has corresponding depth just greater than 1. The default is taken as 1,024 if the calculation results in a number larger than 1,024. Given the intelligent default, this option is rarely specified. Remarks and examples Letter-value displays are a collection of observations drawn systematically from the data, focusing especially on the tails rather than the middle of the distribution. The displays are called letter-value displays because letters have been (almost arbitrarily) assigned to tail densities: Letter M F E D C Tail area 1/2 1/4 1/8 1/16 1/32 Letter B A Z Y X 1139 Tail area 1/64 1/128 1/256 1/512 1/1024 1140 lv — Letter-value displays Example 1 We have data on the mileage ratings of 74 automobiles. To obtain a letter-value display, we type . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . lv mpg # 74 Mileage (mpg) M F E D C B A 37.5 19 10 5.5 3 2 1.5 1 18 15 14 14 12 12 12 inner fence outer fence 7.5 -3 20 21.5 21.5 22.25 24.5 23.5 25 26.5 25 28 30.5 35 35 38 41 spread 7 13 16.5 21 23 26 29 35.5 46 # below 0 0 pseudosigma 5.216359 5.771728 5.576303 5.831039 5.732448 6.040635 6.16562 # above 1 0 The decimal points can be made to line up and thus the output made more readable by specifying a display format for the variable; see [U] 12.5 Formats: Controlling how data are displayed. . format mpg %9.2f . lv mpg # 74 M F E D C B A Mileage (mpg) 37.5 19 10 5.5 3 2 1.5 1 18.00 15.00 14.00 14.00 12.00 12.00 12.00 inner fence outer fence 7.50 -3.00 20.00 21.50 21.50 22.25 24.50 23.50 25.00 26.50 25.00 28.00 30.50 35.00 35.00 38.00 41.00 spread 7.00 13.00 16.50 21.00 23.00 26.00 29.00 35.50 46.00 # below 0 0 pseudosigma 5.22 5.77 5.58 5.83 5.73 6.04 6.17 # above 1 0 At the top, the number of observations is indicated as 74. The first line shows the statistics associated with M, the letter value that puts half the density in each tail, or the median. The median has depth 37.5 (that is, in the ordered data, M is 37.5 observations in from the extremes) and has value 20. The next line shows the statistics associated with F or the fourths. The fourths have depth 19 (that is, in the ordered data, the lower fourth is observation 19, and the upper fourth is observation 74 − 19 + 1), and the values of the lower and upper fourths are 18 and 25. The number in the middle is the point halfway between the fourths — called a midsummary. If the distribution were perfectly symmetric, the midsummary would equal the median. The spread is the difference between the lower and upper summaries (25 − 18 = 7). For fourths, half the data lie within a 7-mpg band. The pseudosigma is a calculation of the standard deviation using only the lower and upper summaries and assuming that the variable is normally distributed. If the data really were normally distributed, all the pseudosigmas would be roughly equal. After the letter values, the line labeled with depth 1 reports the minimum and maximum values. Here the halfway point between the extremes is 26.5, which is greater than the median, indicating that 41 is more extreme than 12, at least relative to the median. And with each letter value, the midsummaries are increasing — our data are skewed. The pseudosigmas are also increasing, indicating lv — Letter-value displays 1141 that the data are spreading out relative to a normal distribution, although, given the evident skewness, this elongation may be an artifact of the skewness. At the end is an attempt to identify outliers, although the points so identified are merely outside some predetermined cutoff. Points outside the inner fence are called outside values or mild outliers. Points outside the outer fence are called severe outliers. The inner fence is defined as (3/2)IQR and the outer fence as 3IQR above and below the F summaries, where the interquartile range (IQR) is the spread of the fourths. Technical note The form of the letter-value display has varied slightly with different authors. lv displays appear as described by Hoaglin (1983) but as modified by Emerson and Stoto (1983), where they included the midpoint of each of the spreads. This format was later adopted by Hoaglin (1985). If the distribution is symmetric, the midpoints will all be roughly equal. On the other hand, if the midpoints vary systematically, the distribution is skewed. The pseudosigmas are obtained from the lower and upper summaries for each letter value. For each letter value, they are the standard deviation a normal distribution would have if its spread for the given letter value were to equal the observed spread. If the pseudosigmas are all roughly equal, the data are said to have neutral elongation. If the pseudosigmas increase systematically, the data are said to be more elongated than a normal, that is, have thicker tails. If the pseudosigmas decrease systematically, the data are said to be less elongated than a normal, that is, have thinner tails. Interpretation of the number of mild and severe outliers is more problematic. The following discussion is drawn from Hamilton (1991): Obviously, the presence of any such outliers does not rule out that the data have been drawn from a normal distribution; in large datasets, there will most certainly be observations outside (3/2)IQR and 3IQR. Severe outliers, however, make up about two per million (0.0002%) of a normal population. In samples, they lie far enough out to have substantial effects on means, standard deviations, and other classical statistics. The 0.0002%, however, should be interpreted carefully; outliers appear more often in small samples than one might expect from population proportions because of sampling variation in estimated quartiles. Monte Carlo simulation by Hoaglin, Iglewicz, and Tukey (1986) obtained these results on the percentages and numbers of outliers in random samples from a normal population: n 10 20 50 100 200 300 ∞ percentage any outliers severe 2.83 1.66 1.15 .95 .79 .75 .70 .362 .074 .011 .002 .001 .001 .0002 number any outliers severe .283 .332 .575 .95 1.58 2.25 .0362 .0148 .0055 .002 .002 .003 ∞ ∞ Thus the presence of any severe outliers in samples of less than 300 is sufficient to reject normality. Hoaglin, Iglewicz, and Tukey (1981) suggested the approximation 0.00698 + 0.4/n for the fraction of mild outliers in a sample of size n or, equivalently, 0.00698n + 0.4 for the number of outliers. 1142 lv — Letter-value displays Example 2 The generate option adds the mid, spread, psigma, and z2 variables to our data, making possible many of the diagnostic graphs suggested by Hoaglin (1985). . lv mpg, generate (output omitted ) . list _mid _spread _psigma _z2 in 1/12 _mid _spread _psigma _z2 1. 2. 3. 4. 5. 20 21.5 21.5 22.25 24.5 . 7 13 16.5 21 . 5.216359 5.771728 5.576303 5.831039 . .4501955 1.26828 2.188846 3.24255 6. 7. 8. 9. 10. 23.5 25 . . . 23 26 . . . 5.732448 6.040635 . . . 4.024532 4.631499 . . . 11. 12. 26.5 . 29 . 6.16562 . 5.53073 . Observations 12 through the end are missing for these new variables. The definition of the observations is always the same. The first observation contains the M summary; the second, the F; the third, the E; and so on. Observation 11 always contains the summary for depth 1. Observations 8–10 — corresponding to letter values Z, Y, and X — contain missing because these statistics were not calculated. We have only 74 observations, and their depth would be 1. Hoaglin (1985) suggests graphing the midsummary against z 2 . If the distribution is not skewed, the points in the resulting graph will be along a horizontal line: 20 mpg midsummary 22 24 26 . scatter _mid _z2 0 2 4 6 Z squared The graph clearly indicates the skewness of the distribution. We might also graph psigma against z2 to examine elongation. lv — Letter-value displays 1143 Stored results lv stores the following in r(): Scalars r(N) r(min) r(max) r(median) r(l F) r(u F) r(l E) r(u E) r(l D) r(u D) r(l C) number of observations minimum maximum median lower 4th upper 4th lower 8th upper 8th lower 16th upper 16th lower 32nd r(u r(l r(u r(l r(u r(l r(u r(l r(u r(l r(u C) B) B) A) A) Z) Z) Y) Y) X) X) upper lower upper lower upper lower upper lower upper lower upper 32nd 64th 64th 128th 128th 256th 256th 512th 512th 1024th 1024th The lower/upper 8ths, 16ths, . . . , 1024ths will be defined only if there are sufficient data. Methods and formulas Let N be the number of (nonmissing) observations on x, and let x(i) refer to the ordered data when i is an integer. Define x(i+0.5) = (x(i) + x(i+1) )/2; the median is defined as x{(N +1)/2} . Define x[d] as the pair of numbers x(d) and x(N +1−d) , where d is called the depth. Thus x[1] refers to the minimum and maximum of the data. Define m = (N + 1)/2 as the depth of the median, f = (bmc + 1)/2 as the depth of the fourths, e = (bf c + 1)/2 as the depth of the eighths, and so on. Depths are reported on the far left of the letter-value display. The corresponding fourths of the data are x[f ] , the eighths are x[e] , and so on. These values are reported inside the display. The middle value is defined as the corresponding midpoint of x[·] . The spreads are defined as the difference in x[·] . The corresponding point zi on a standard normal distribution is obtained as (Hoaglin 1985, 456 – 457) ( −1 F (di − 1/3)/(N + 1/3) if di > 1 zi = F −1 0.695/(N + 0.390) otherwise where di is the depth of the letter value. The corresponding pseudosigma is obtained as the ratio of the spread to −2zi (Hoaglin 1985, 431). Define (Fl , Fu ) = x[f ] . The inner fence has cutoffs Fl − 32 (Fu − Fl ) and Fu + 32 (Fu − Fl ). The outer fence has cutoffs Fl − 3(Fu − Fl ) and Fu + 3(Fu − Fl ). The inner-fence values reported by lv are almost equal to those used by graph, box to identify outside points. The only difference is that graph uses a slightly different definition of fourths, namely, the 25th and 75th percentiles as defined by summarize; see [R] summarize. 1144 lv — Letter-value displays References Emerson, J. D., and M. A. Stoto. 1983. Transforming data. In Understanding Robust and Exploratory Data Analysis, ed. D. C. Hoaglin, C. F. Mosteller, and J. W. Tukey, 97–128. New York: Wiley. Fox, J. 1990. Describing univariate distributions. In Modern Methods of Data Analysis, ed. J. Fox and J. S. Long, 58–125. Newbury Park, CA: Sage. Hamilton, L. C. 1991. sed4: Resistant normality check and outlier identification. Stata Technical Bulletin 3: 15–18. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 86–90. College Station, TX: Stata Press. Hoaglin, D. C. 1983. Letter values: A set of selected order statistics. In Understanding Robust and Exploratory Data Analysis, ed. D. C. Hoaglin, C. F. Mosteller, and J. W. Tukey, 33–57. New York: Wiley. . 1985. Using quantiles to study shape. In Exploring Data Tables, Trends, and Shapes, ed. D. C. Hoaglin, C. F. Mosteller, and J. W. Tukey, 417–460. New York: Wiley. Hoaglin, D. C., B. Iglewicz, and J. W. Tukey. 1981. Small-sample performance of a resistant rule for outlier detection. In 1980 Proceedings of the Statistical Computing Section. Washington, DC: American Statistical Association. . 1986. Performance of some resistant rules for outlier labeling. Journal of the American Statistical Association 81: 991–999. Tukey, J. W. 1977. Exploratory Data Analysis. Reading, MA: Addison–Wesley. Also see [R] diagnostic plots — Distributional diagnostic plots [R] stem — Stem-and-leaf displays [R] summarize — Summary statistics Title margins — Marginal means, predictive margins, and marginal effects Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax margins marginlist if in weight , response options options where marginlist is a list of factor variables or interactions that appear in the current estimation results. The variables may be typed with or without the i. prefix, and you may use any factor-variable syntax: . margins i.sex i.group i.sex#i.group . margins sex group sex#i.group . margins sex##group response options Description Main predict(pred opt) expression(pnl exp) dydx(varlist) eyex(varlist) dyex(varlist) eydx(varlist) continuous estimate margins for predict, pred opt estimate margins for pnl exp estimate marginal effect of variables in varlist estimate elasticities of variables in varlist estimate semielasticity— d(y)/d(lnx) estimate semielasticity— d(lny)/d(x) treat factor-level indicators as continuous options Description Main grand add the overall margin; default if no marginlist At at(atspec) atmeans asbalanced estimate margins at specified values of covariates estimate margins at the means of covariates treat all factor variables as balanced if/in/over over(varlist) subpop(subspec) estimate margins at unique values of varlist estimate margins for subpopulation Within within(varlist) estimate margins at unique values of the nesting factors in varlist Contrast contrast options any options documented in [R] margins, contrast Pairwise comparisons pwcompare options any options documented in [R] margins, pwcompare 1145 1146 margins — Marginal means, predictive margins, and marginal effects SE estimate SEs using delta method; the default estimate SEs allowing for sampling of covariates do not estimate SEs vce(delta) vce(unconditional) nose Advanced ignore weights specified in estimation do not restrict margins to the estimation sample treatment of empty cells for balanced factors specify numerical tolerance used to determine estimable functions; default is estimtolerance(1e-5) suppress estimability checks estimate margins despite potential problems use the chain rule when computing derivatives do not use the chain rule noweights noesample emptycells(empspec) estimtolerance(tol) noestimcheck force chainrule nochainrule Reporting level(#) mcompare(method) noatlegend post display options set confidence level; default is level(95) adjust for multiple comparisons; default is mcompare(noadjust) suppress legend of fixed covariate values post margins and their VCE as estimation results control columns and column formats, row spacing, line width and factor-variable labeling df(#) use t distribution with # degrees of freedom for computing p-values and confidence intervals method Description noadjust bonferroni adjustall sidak adjustall scheffe do not adjust for multiple comparisons; the default Bonferroni’s method; adjust across all terms Šidák’s method; adjust across all terms Scheffé’s method Time-series operators are allowed if they were used in the estimation. See at() under Options for a description of atspec. fweights, aweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. df(#) does not appear in the dialog box. Menu Statistics > Postestimation > Marginal means and predictive margins Statistics > Postestimation > Marginal effects Description Margins are statistics calculated from predictions of a previously fit model at fixed values of some covariates and averaging or otherwise integrating over the remaining covariates. margins — Marginal means, predictive margins, and marginal effects 1147 The margins command estimates margins of responses for specified values of covariates and presents the results as a table. Capabilities include estimated marginal means, least-squares means, average and conditional marginal and partial effects (which may be reported as derivatives or as elasticities), average and conditional adjusted predictions, and predictive margins. Options Warning: The option descriptions are brief and use jargon. Skip to Remarks and examples if you are reading about margins for the first time. Main predict(pred opt) and expression(pnl exp) are mutually exclusive; they specify the response. If neither is specified, the response will be the default prediction that would be produced by predict after the underlying estimation command. predict(pred opt) specifies the option(s) to be specified with the predict command to produce the variable that will be used as the response. After estimation by logistic, you could specify predict(xb) to obtain linear predictions rather than the predict command’s default, the probabilities. expression(pnl exp) specifies the response as an expression. See [R] predictnl for a full description of pnl exp. After estimation by logistic, you might specify expression(exp(predict(xb))) to use relative odds rather than probabilities as the response. For examples, see Example 12: Margins of a specified expression. dydx(varlist), eyex(varlist), dyex(varlist), and eydx(varlist) request that margins report derivatives of the response with respect to varlist rather than on the response itself. eyex(), dyex(), and eydx() report derivatives as elasticities; see Expressing derivatives as elasticities. continuous is relevant only when one of dydx() or eydx() is also specified. It specifies that the levels of factor variables be treated as continuous; see Derivatives versus discrete differences. This option is implied if there is a single-level factor variable specified in dydx() or eydx(). grand specifies that the overall margin be reported. grand is assumed when marginlist is empty. At at(atspec) specifies values for covariates to be treated as fixed. at(age=20) fixes covariate age to the value specified. at() may be used to fix continuous or factor covariates. at(age=20 sex=1) simultaneously fixes covariates age and sex at the values specified. at(age=(20 30 40 50)) fixes age first at 20, then at 30, . . . . margins produces separate results for each specified value. at(age=(20(10)50)) does the same as at(age=(20 30 40 50)); that is, you may specify a numlist. at((mean) age (median) distance) fixes the covariates at the summary statistics specified. at((p25) all) fixes all covariates at their 25th percentile values. See Syntax of at() for the full list of summary-statistic modifiers. at((mean) all (median) x x2=1.2 z=(1 2 3)) is read from left to right, with latter specifiers overriding former ones. Thus all covariates are fixed at their means except for x (fixed at its median), x2 (fixed at 1.2), and z (fixed first at 1, then at 2, and finally at 3). 1148 margins — Marginal means, predictive margins, and marginal effects at((means) all (asobserved) x2) is a convenient way to set all covariates except x2 to the mean. Multiple at() options can be specified, and each will produce a different set of margins. See Syntax of at() for more information. atmeans specifies that covariates be fixed at their means and is shorthand for at((mean) all). atmeans differs from at((mean) all) in that atmeans will affect subsequent at() options. For instance, . margins . . . , atmeans at((p25) x) at((p75) x) produces two sets of margins with both sets evaluated at the means of all covariates except x. asbalanced is shorthand for at((asbalanced) factor) and specifies that factor covariates be evaluated as though there were an equal number of observations in each level; see Obtaining margins as though the data were balanced. asbalanced differs from at((asbalanced) factor) in that asbalanced will affect subsequent at() options in the same way as atmeans does. if/in/over over(varlist) specifies that separate sets of margins be estimated for the groups defined by varlist. The variables in varlist must contain nonnegative integer values. The variables need not be covariates in your model. When over() is combined with the vce(unconditional) option, each group is treated as a subpopulation; see [SVY] subpopulation estimation. subpop( varname if ) is intended for use with the vce(unconditional) option. It specifies that margins be estimated for the single subpopulation identified by the indicator variable or by the if expression or by both. Zero indicates that the observation be excluded; nonzero, that it be included; and missing value, that it be treated as outside of the population (and so ignored). See [SVY] subpopulation estimation for why subpop() is preferred to if expressions and in ranges when also using vce(unconditional). If subpop() is used without vce(unconditional), it is treated merely as an additional if qualifier. Within within(varlist) allows for nested designs. varlist contains the nesting variable(s) over which margins are to be estimated. See Obtaining margins with nested designs. As with over(varlist), when within(varlist) is combined with vce(unconditional), each level of the variables in varlist is treated as a subpopulation. Contrast contrast options are any of the options documented in [R] margins, contrast. Pairwise comparisons pwcompare options are any of the options documented in [R] margins, pwcompare. SE vce(delta) and vce(unconditional) specify how the VCE and, correspondingly, standard errors are calculated. vce(delta) is the default. The delta method is applied to the formula for the response and the VCE of the estimation command. This method assumes that values of the covariates used to calculate the response are given or, if all covariates are not fixed using at(), that the data are given. margins — Marginal means, predictive margins, and marginal effects 1149 vce(unconditional) specifies that the covariates that are not fixed be treated in a way that accounts for their having been sampled. The VCE is estimated using the linearization method. This method allows for heteroskedasticity or other violations of distributional assumptions and allows for correlation among the observations in the same manner as vce(robust) and vce(cluster . . . ), which may have been specified with the estimation command. This method also accounts for complex survey designs if the data are svyset. See Obtaining margins with survey data and representative samples. When you use complex survey data, this method requires that the linearized variance estimation method be used for the model. See [SVY] svy postestimation for an example of margins with replication-based methods. nose suppresses calculation of the VCE and standard errors. See Requirements for model specification for an example of the use of this option. Advanced noweights specifies that any weights specified on the previous estimation command be ignored by margins. By default, margins uses the weights specified on the estimator to average responses and to compute summary statistics. If weights are specified on the margins command, they override previously specified weights, making it unnecessary to specify noweights. The noweights option is not allowed after svy: estimation when the vce(unconditional) option is specified. noesample specifies that margins not restrict its computations to the estimation sample used by the previous estimation command. See Example 15: Margins evaluated out of sample. With the default delta-method VCE, noesample margins may be estimated on samples other than the estimation sample; such results are valid under the assumption that the data used are treated as being given. You can specify noesample and vce(unconditional) together, but if you do, you should be sure that the data in memory correspond to the original e(sample). To show that you understand that, you must also specify the force option. Be aware that making the vce(unconditional) calculation on a sample different from the estimation sample would be equivalent to estimating the coefficients on one set of data and computing the scores used by the linearization on another set; see [P] robust. emptycells(strict) and emptycells(reweight) are relevant only when the asbalanced option is also specified. emptycells() specifies how empty cells are handled in interactions involving factor variables that are being treated as balanced; see Obtaining margins as though the data were balanced. emptycells(strict) is the default; it specifies that margins involving empty cells be treated as not estimable. emptycells(reweight) specifies that the effects of the observed cells be increased to accommodate any missing cells. This makes the margin estimable but changes its interpretation. emptycells(reweight) is implied when the within() option is specified. estimtolerance(tol) specifies the numerical tolerance used to determine estimable functions. The default is estimtolerance(1e-5). A linear combination of the model coefficients z is found to be not estimable if mreldif(z, z × H) > tol where H is defined in Methods and formulas. 1150 margins — Marginal means, predictive margins, and marginal effects noestimcheck specifies that margins not check for estimability. By default, the requested margins are checked and those found not estimable are reported as such. Nonestimability is usually caused by empty cells. If noestimcheck is specified, estimates are computed in the usual way and reported even though the resulting estimates are manipulable, which is to say they can differ across equivalent models having different parameterizations. See Estimability of margins. force instructs margins to proceed in some situations where it would otherwise issue an error message because of apparent violations of assumptions. Do not be casual about specifying force. You need to understand and fully evaluate the statistical issues. For an example of the use of force, see Using margins after the estimates use command. chainrule and nochainrule specify whether margins uses the chain rule when numerically computing derivatives. You need not specify these options when using margins after any official Stata estimator; margins will choose the appropriate method automatically. Specify nochainrule after estimation by a user-written command. We recommend using nochainrule, even though chainrule is usually safe and is always faster. nochainrule is safer because it makes no assumptions about how the parameters and covariates join to form the response. nochainrule is implied when the expression() option is specified. Reporting level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. mcompare(method) specifies the method for computing p-values and confidence intervals that account for multiple comparisons within a factor-variable term. Most methods adjust the comparisonwise error rate, αc , to achieve a prespecified experimentwise error rate, αe . mcompare(noadjust) is the default; it specifies no adjustment. αc = αe mcompare(bonferroni) adjusts the comparisonwise error rate based on the upper limit of the Bonferroni inequality αe ≤mαc where m is the number of comparisons within the term. The adjusted comparisonwise error rate is αc = αe /m mcompare(sidak) adjusts the comparisonwise error rate based on the upper limit of the probability inequality αe ≤1 − (1 − αc )m where m is the number of comparisons within the term. The adjusted comparisonwise error rate is αc = 1 − (1 − αe )1/m This adjustment is exact when the m comparisons are independent. mcompare(scheffe) controls the experimentwise error rate using the F or χ2 distribution with degrees of freedom equal to the rank of the term. margins — Marginal means, predictive margins, and marginal effects 1151 mcompare(method adjustall) specifies that the multiple-comparison adjustments count all comparisons across all terms rather than performing multiple comparisons term by term. This leads to more conservative adjustments when multiple variables or terms are specified in marginslist. This option is compatible only with the bonferroni and sidak methods. noatlegend specifies that the legend showing the fixed values of covariates be suppressed. post causes margins to behave like a Stata estimation (e-class) command. margins posts the vector of estimated margins along with the estimated variance–covariance matrix to e(), so you can treat the estimated margins just as you would results from any other estimation command. For example, you could use test to perform simultaneous tests of hypotheses on the margins, or you could use lincom to create linear combinations. See Example 10: Testing margins—contrasts of margins. display options: noci, nopvalues, vsquish, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch. noci suppresses confidence intervals from being reported in the coefficient table. nopvalues suppresses p-values and their test statistics from being reported in the coefficient table. vsquish specifies that the blank space separating factor-variable terms or time-series–operated variables from other variables in the model be suppressed. nofvlabel displays factor-variable level values rather than attached value labels. This option overrides the fvlabel setting; see [R] set showbaselevels. fvwrap(#) allows long value labels to wrap the first # lines in the coefficient table. This option overrides the fvwrap setting; see [R] set showbaselevels. fvwrapon(style) specifies whether value labels that wrap will break at word boundaries or break based on available space. fvwrapon(word), the default, specifies that value labels break at word boundaries. fvwrapon(width) specifies that value labels break based on available space. This option overrides the fvwrapon setting; see [R] set showbaselevels. cformat(% fmt) specifies how to format margins, standard errors, and confidence limits in the table of estimated margins. pformat(% fmt) specifies how to format p-values in the table of estimated margins. sformat(% fmt) specifies how to format test statistics in the table of estimated margins. nolstretch specifies that the width of the table of estimated margins not be automatically widened to accommodate longer variable names. The default, lstretch, is to automatically widen the table of estimated margins up to the width of the Results window. To change the default, use set lstretch off. nolstretch is not shown in the dialog box. The following option is available with margins but is not shown in the dialog box: df(#) specifies that the t distribution with # degrees of freedom be used for computing p-values and confidence intervals. The default typically is to use the standard normal distribution. However, if the estimation command computes the residual degrees of freedom (e(df r)) and predict(xb) is specified with margins, the default is to use the t distribution with e(df r) degrees of freedom. 1152 margins — Marginal means, predictive margins, and marginal effects Remarks and examples Remarks are presented under the following headings: Introduction Obtaining margins of responses Example 1: A simple case after regress Example 2: A simple case after logistic Example 3: Average response versus response at average Example 4: Multiple margins from one command Example 5: Margins with interaction terms Example 6: Margins with continuous variables Example 7: Margins of continuous variables Example 8: Margins of interactions Example 9: Decomposing margins Example 10: Testing margins—contrasts of margins Example 11: Margins of a specified prediction Example 12: Margins of a specified expression Example 13: Margins with multiple outcomes (responses) Example 14: Margins with multiple equations Example 15: Margins evaluated out of sample Obtaining margins of derivatives of responses (a.k.a. marginal effects) Do not specify marginlist when you mean over() Use at() freely, especially with continuous variables Expressing derivatives as elasticities Derivatives versus discrete differences Example 16: Average marginal effect (partial effects) Example 17: Average marginal effect of all covariates Example 18: Evaluating marginal effects over the response surface Obtaining margins with survey data and representative samples Example 19: Inferences for populations, margins of response Example 20: Inferences for populations, marginal effects Example 21: Inferences for populations with svyset data Standardizing margins Obtaining margins as though the data were balanced Balancing using asbalanced Balancing by standardization Balancing nonlinear responses Treating a subset of covariates as balanced Using fvset design Balancing in the presence of empty cells Obtaining margins with nested designs Introduction Margins with nested designs as though the data were balanced Coding of nested designs Special topics Requirements for model specification Estimability of margins Manipulability of tests Using margins after the estimates use command Syntax of at() Estimation commands that may be used with margins Video examples Glossary Introduction margins is a postestimation command, a command for use after you have fit a model using an estimation command such as regress or logistic, or using almost any other estimation command. margins estimates and reports margins of responses and margins of derivatives of responses, also known as marginal effects. A margin is a statistic based on a fitted model in which some of or all the margins — Marginal means, predictive margins, and marginal effects 1153 covariates are fixed. Marginal effects are changes in the response for change in a covariate, which can be reported as a derivative, elasticity, or semielasticity. For a brief overview of margins, see Williams (2012). Obtaining margins of responses What we call margins of responses are also known as predictive margins, adjusted predictions, and recycled predictions. When applied to balanced data, margins of responses are also called estimated marginal means and least-squares means. A margin is a statistic based on a fitted model calculated over a dataset in which some of or all the covariates are fixed at values different from what they really are. For instance, after a linear regression fit on males and females, the marginal mean (margin of mean) for males is the predicted mean of the dependent variable, where every observation is treated as if it represents a male; thus those observations that in fact do represent males are included, as well as those observations that represent females. The marginal mean for female would be similarly obtained by treating all observations as if they represented females. In making the calculation, sex is treated as male or female everywhere it appears in the model. The model might be . regress y age bp i.sex sex#c.age sex#c.bp and then, in making the marginal calculation of the mean for males and females, margins not only accounts for the direct effect of i.sex but also for the indirect effects of sex#c.age and sex#c.bp. The response being margined can be any statistic produced by [R] predict, or any expression of those statistics. Standard errors are obtained by the delta method, at least by default. The delta method assumes that the values at which the covariates are evaluated to obtain the marginal responses are fixed. When your sample represents a population, whether you are using svy or not (see [SVY] svy), you can specify margins’ vce(unconditional) option and margins will produce standard errors that account for the sampling variability of the covariates. Some researchers reserve the term predictive margins to describe this. The best way to understand margins is to see some examples. You can run the following examples yourself if you type . use http://www.stata-press.com/data/r13/margex (Artificial data for margins) 1154 margins — Marginal means, predictive margins, and marginal effects Example 1: A simple case after regress . regress y i.sex i.group (output omitted ) . margins sex Predictive margins Model VCE : OLS Expression : Linear prediction, predict() Margin sex male female 60.56034 78.88236 Delta-method Std. Err. .5781782 .5772578 t 104.74 136.65 Number of obs = 3000 P>|t| [95% Conf. Interval] 0.000 0.000 59.42668 77.7505 61.69401 80.01422 The numbers reported in the “Margin” column are average values of y. Based on a linear regression of y on sex and group, 60.6 would be the average value of y if everyone in the data were treated as if they were male, and 78.9 would be the average value if everyone were treated as if they were female. Example 2: A simple case after logistic margins may be used after almost any estimation command. . logistic outcome i.sex i.group (output omitted ) . margins sex Predictive margins Model VCE : OIM Expression : Pr(outcome), predict() Margin sex male female .1286796 .1905087 Delta-method Std. Err. .0111424 .0089719 Number of obs z 11.55 21.23 = 3000 P>|z| [95% Conf. Interval] 0.000 0.000 .106841 .1729241 .1505182 .2080933 The numbers reported in the “Margin” column are average predicted probabilities. Based on a logistic regression of outcome on sex and group, 0.13 would be the average probability of outcome if everyone in the data were treated as if they were male, and 0.19 would be the average probability if everyone were treated as if they were female. margins reports average values after regress and average probabilities after logistic. By default, margins makes tables of whatever it is that predict (see [R] predict) predicts by default. Alternatively, margins can make tables of anything that predict can produce if you use margins’ predict() option; see Example 11: Margins of a specified prediction. margins — Marginal means, predictive margins, and marginal effects 1155 Example 3: Average response versus response at average In example 2, margins reported average probabilities of outcome for sex = 0 and sex = 1. If we instead wanted the predicted probabilities evaluated at the mean of the covariates, we would specify margins’ atmeans option. We previously typed . logistic outcome i.sex i.group (output omitted ) . margins sex (output omitted ) and now we type . margins sex, atmeans Adjusted predictions Model VCE : OIM Expression : Pr(outcome), predict() at : 0.sex = .4993333 1.sex = .5006667 1.group = .3996667 2.group = .3726667 3.group = .2276667 Margin sex male female .0966105 .1508362 Delta-method Std. Err. .0089561 .0118064 Number of obs = 3000 (mean) (mean) (mean) (mean) (mean) z 10.79 12.78 P>|z| [95% Conf. Interval] 0.000 0.000 .0790569 .127696 .1141641 .1739764 The prediction at the average of the covariates is different from the average of the predictions. The first is the expected probability of a person with average characteristics, a person who, in another problem, might be 3/4 married and have 1.2 children. The second is the average of the probability among actual persons in the data. When you specify atmeans or any other at option, margins reports the values used for the covariates in the legend above the table. margins lists the values for all the covariates, including values it may not use, in the results that follow. In this example, margins reported means for sex even though those means were not used. They were not used because we asked for the margins of sex, so sex was fixed first at 0 and then at 1. If you wish to suppress this legend, specify the nolegend option. Example 4: Multiple margins from one command More than one margin can be reported by just one margins command. You can type . margins sex group and doing that is equivalent in terms of the output to typing . margins sex . margins group When multiple margins are requested on the same command, each is estimated separately. There is, however, a difference when you also specify margins’ post option. Then the variance–covariance matrix for all margins requested is posted, and that is what allows you to test equality of margins, etc. Testing equality of margins is covered in Example 10: Testing margins—contrasts of margins. 1156 margins — Marginal means, predictive margins, and marginal effects In any case, below we request margins for sex and for group. . margins sex group Predictive margins Model VCE : OIM Expression Number of obs = 3000 : Pr(outcome), predict() Margin Delta-method Std. Err. z P>|z| [95% Conf. Interval] sex male female .1286796 .1905087 .0111424 .0089719 11.55 21.23 0.000 0.000 .106841 .1729241 .1505182 .2080933 group 1 2 3 .2826207 .1074814 .0291065 .0146234 .0094901 .0073417 19.33 11.33 3.96 0.000 0.000 0.000 .2539593 .0888812 .0147169 .311282 .1260817 .043496 Example 5: Margins with interaction terms The estimation command on which margins bases its calculations may contain interaction terms, such as an interaction of sex and group: . logistic outcome i.sex i.group sex#group (output omitted ) . margins sex group Predictive margins Model VCE : OIM Expression Number of obs = 3000 : Pr(outcome), predict() Margin Delta-method Std. Err. z P>|z| [95% Conf. Interval] sex male female .1561738 .1983749 .0132774 .0101546 11.76 19.54 0.000 0.000 .1301506 .1784723 .182197 .2182776 group 1 2 3 .3211001 .1152127 .0265018 .0176403 .0099854 .0109802 18.20 11.54 2.41 0.000 0.000 0.016 .2865257 .0956417 .0049811 .3556744 .1347838 .0480226 We fit the model by typing logistic outcome i.sex i.group sex#group, but the meaning would have been the same had we typed logistic outcome sex##group. As mentioned in example 4, the results for sex and the results for group are calculated independently, and we would have obtained the same results had we typed margins sex followed by margins group. The margin for male (sex = 0) is 0.16. The probability 0.16 is the average probability if everyone in the data were treated as if sex = 0, including sex = 0 in the main effect and sex = 0 in the interaction of sex with group. Had we specified margins sex, atmeans, we would have obtained not average probabilities but the probabilities evaluated at the average. Rather than obtaining 0.16, we would have obtained 0.10 margins — Marginal means, predictive margins, and marginal effects 1157 for sex = 0. The 0.10 is calculated by taking the fitted model, plugging in sex = 0 everywhere, and plugging in the average value of the group indicator variables everywhere they are used. That is, rather than treating the group indicators as being (1, 0, 0), (0, 1, 0), or (0, 0, 1) depending on observation, the group indicators are treated as being (0.40, 0.37, 0.23), which are the average values of group = 1, group = 2, and group = 3. Example 6: Margins with continuous variables To the above example, we will add the continuous covariate age to the model and then rerun margins sex group. . logistic outcome i.sex i.group sex#group age (output omitted ) . margins sex group Predictive margins Model VCE : OIM Expression : Pr(outcome), predict() Margin Delta-method Std. Err. z Number of obs = 3000 P>|z| [95% Conf. Interval] sex male female .1600644 .1966902 .0125653 .0100043 12.74 19.66 0.000 0.000 .1354368 .1770821 .184692 .2162983 group 1 2 3 .2251302 .150603 .0736157 .0123233 .0116505 .0337256 18.27 12.93 2.18 0.000 0.000 0.029 .200977 .1277685 .0075147 .2492834 .1734376 .1397167 Compared with the results presented in example 5, results for sex change little, but results for groups 1 and 3 change markedly. The tables differ because now we are adjusting for the continuous covariate age, as well as for sex and group. We will continue examining interactions in example 8. Because we have added a continuous variable, let’s take a detour to explain how to obtain margins for continuous variables and to explain their interpretation. Example 7: Margins of continuous variables Continuing with our example of . logistic outcome i.sex i.group sex#group age let’s examine the continuous covariate age. 1158 margins — Marginal means, predictive margins, and marginal effects You are not allowed to type margins age; doing that will produce an error: . margins age ‘age’ not found in list of covariates r(322); The message “‘age’ not found in list of covariates” is margins’ way of saying, “Yes, age might be in the model, but if it is, it is not included as a factor variable; it is in as a continuous variable.” Sometimes, Stata is overly terse. margins might also say that because age is continuous there are an infinite number of values at which it could evaluate the margins. At what value(s) should age be fixed? margins requires more guidance with continuous covariates. We can provide that guidance by using the at() option and typing . margins, at(age=40) To understand why that yields the desired result, let us tell you that if you were to type . margins margins would report the overall margin—the margin that holds nothing constant. Because our model is logistic, the average value of the predicted probabilities would be reported. The at() option fixes one or more covariates to the value(s) specified and can be used with both factor and continuous variables. Thus, if you typed margins, at(age=40), then margins would average over the data the responses for everybody, setting age=40. Here is what happens when you type that: . margins, at(age=40) Predictive margins Model VCE : OIM Expression at : Pr(outcome), predict() : age = Margin _cons Number of obs .0070731 3000 40 Delta-method Std. Err. .1133603 = z 16.03 P>|z| [95% Conf. Interval] 0.000 .0994972 .1272234 Reported is the margin for age = 40, adjusted for the other covariates in our model. If we wanted to obtain the margins for age 30, 35, 40, 45, and 50, we could type . margins, at(age=(30 35 40 45 50)) or, equivalently, . margins, at(age=(30(5)50)) Example 8: Margins of interactions Our model is . logistic outcome i.sex i.group sex#group age margins — Marginal means, predictive margins, and marginal effects 1159 We can obtain the margins of all possible combinations of the levels of sex and the levels of group by typing . margins sex#group Predictive margins Model VCE : OIM Expression Number of obs = 3000 : Pr(outcome), predict() Margin sex#group male#1 male#2 male#3 female#1 female#2 female#3 .2379605 .0658294 .0538001 .2158632 .2054406 .085448 Delta-method Std. Err. .0237178 .0105278 .0136561 .0112968 .0183486 .0533914 z 10.03 6.25 3.94 19.11 11.20 1.60 P>|z| 0.000 0.000 0.000 0.000 0.000 0.110 [95% Conf. Interval] .1914745 .0451953 .0270347 .1937218 .1694781 -.0191973 .2844465 .0864636 .0805656 .2380045 .2414032 .1900932 The first line in the table reports the marginal probability for sex = 0 (male) and group = 1. That is, it reports the estimated probability if everyone in the data were treated as if they were sex = 0 and group = 1. Also reported are all the other combinations of sex and group. By the way, we could have typed margins sex#group even if our fitted model did not include sex#group. Estimation is one thing, and asking questions about the nature of the estimates is another. margins does, however, require that i.sex and i.group appear somewhere in the model, because fixing a value outside the model would just produce the grand margin, and you can separately ask for that if you want it by typing margins without arguments. Example 9: Decomposing margins We have the model . logistic outcome i.sex i.group sex#group age In example 6, we typed margins sex and obtained 0.160 for males and 0.197 for females. We are going to decompose each of those numbers. Let us explain: 1. The margin for males, 0.160, treats everyone as if they were male, and that amounts to simultaneously 1a. treating males as males and 1b. treating females as males. 2. The margin for females, 0.197, treats everyone as if they were female, and that amounts to simultaneously 2a. treating males as females and 2b. treating females as females. The margins 1a and 1b are the decomposition of 1, and the margins 2a and 2b are the decomposition of 2. We could obtain 1a and 2a by typing . margins if sex==0, at(sex=(0 1)) 1160 margins — Marginal means, predictive margins, and marginal effects because the qualifier if sex==0 would restrict margins to running on only the males. Similarly, we could obtain 1b and 2b by typing . margins if sex==1, at(sex=(0 1)) We run these examples below: . margins if sex==0, at(sex=(0 1)) Predictive margins Model VCE : OIM Expression : Pr(outcome), predict() 1._at : sex = 2._at : sex = Margin _at 1 2 .0794393 .1335584 Delta-method Std. Err. .0062147 .0127351 . margins if sex==1, at(sex=(0 1)) Predictive margins Model VCE : OIM Expression : Pr(outcome), predict() 1._at : sex = 2._at : sex = Margin _at 1 2 .2404749 .2596538 Delta-method Std. Err. .0199709 .0104756 Number of obs = 1498 0 1 z 12.78 10.49 P>|z| [95% Conf. Interval] 0.000 0.000 .0672586 .1085981 Number of obs .0916199 .1585187 = 1502 0 1 z 12.04 24.79 P>|z| [95% Conf. Interval] 0.000 0.000 .2013326 .2391219 .2796171 .2801857 Putting together the results from example 6 and the results above, we have Margin treating everybody as themself 0.170 Margin treating everybody as male Margin treating male as male Margin treating female as male 0.160 0.079 0.240 Margin treating everybody as female Margin treating male as female Margin treating female as female 0.197 0.134 0.260 Example 10: Testing margins—contrasts of margins Continuing with the previous example, it would be interesting to test the equality of 2b and 1b, to test whether the average probability of a positive outcome for females treated as females is equal to that for females treated as males. That test would be different from testing the overall significance of sex in our model. The test performed on our model would be a test of whether the probability of a positive outcome differs between males and females when they have equal values of the other covariates. The test of equality of margins is a test of whether the average probabilities differ given the different pattern of values of the other covariates that the two sexes have in our data. margins — Marginal means, predictive margins, and marginal effects 1161 We can also perform such tests by treating the results from margins as estimation results. There are three steps required to perform tests on margins. First, you must arrange it so that all the margins of interest are reported by just one margins command. Second, you must specify margins’ post option. Third, you perform the test with the test command. Such tests and comparisons can be readily performed by contrasting margins; see [R] margins, contrast. Also see Contrasts of margins—effects (discrete marginal effects) in [R] marginsplot. In the previous example, we used two commands to obtain our results, namely, . margins if sex==0, at(sex=(0 1)) . margins if sex==1, at(sex=(0 1)) We could, however, have obtained the same results by typing just one command: . margins, over(sex) at(sex=(0 1)) Performing margins, over(sex) first restricts the sample to sex==0 and then restricts it to sex==1, and that is equivalent to the two different if conditions that we specified before. To test whether females treated as females is equal to females treated as males, we will need to type . margins, over(sex) at(sex=(0 1)) post . test _b[2._at#1.sex] = _b[1._at#1.sex] We admit that the second command may seem to have come out of nowhere. When we specify post on the margins command, margins behaves as if it were an estimation command, which means that 1) it posts its estimates and full VCE to e(), 2) it gains the ability to replay results just as any estimation command can, and 3) it gains access to the standard postestimation commands. Item 3 explains why we could use test. We learned that we wanted to test b[2. at#1.sex] and b[1. at#1.sex] by replaying the estimation results, but this time with the standard estimation command coeflegend option. So what we typed was . margins, over(sex) at(sex=(0 1)) post . margins, coeflegend . test _b[2._at#1.sex] = _b[1._at#1.sex] 1162 margins — Marginal means, predictive margins, and marginal effects We will let you try margins, coeflegend for yourself. The results of running the other two commands are . margins, over(sex) at(sex=(0 1)) post Predictive margins Model VCE : OIM Expression : Pr(outcome), predict() over : sex 1._at : 0.sex sex = 1.sex sex = 2._at : 0.sex sex = 1.sex sex = Margin _at#sex 1#male 1#female 2#male 2#female .0794393 .2404749 .1335584 .2596538 Number of obs 0 1 1 z 12.78 12.04 10.49 24.79 P>|z| [95% Conf. Interval] 0.000 0.000 0.000 0.000 .0672586 .2013326 .1085981 .2391219 . test _b[2._at#1.sex] = _b[1._at#1.sex] ( 1) - 1bn._at#1.sex + 2._at#1.sex = 0 chi2( 1) = 0.72 Prob > chi2 = 0.3951 We can perform the same test in one command using contrasts of margins: . logistic outcome i.sex i.group sex#group age (output omitted ) . margins, over(sex) at(sex=(0 1)) contrast(atcontrast(r._at) wald) Contrasts of Model VCE Expression over predictive margins : OIM : Pr(outcome), predict() : sex 1._at : 0.sex sex 1.sex sex : 0.sex sex 1.sex sex 2._at _at@sex (2 vs 1) male (2 vs 1) female Joint 3000 0 Delta-method Std. Err. .0062147 .0199709 .0127351 .0104756 = = 0 = 0 = 1 = 1 df chi2 P>chi2 1 1 2 14.59 0.72 16.13 0.0001 0.3951 0.0003 .0916199 .2796171 .1585187 .2801857 margins — Marginal means, predictive margins, and marginal effects _at@sex (2 vs 1) male (2 vs 1) female Contrast Delta-method Std. Err. .0541192 .0191789 .0141706 .0225516 1163 [95% Conf. Interval] .0263453 -.0250215 .081893 .0633793 We refitted our logistic model because its estimation results were replaced when we posted our margins. The syntax to perform the contrast we want is admittedly not obvious. Contrasting (testing) across at() groups is more difficult than contrasting across the margins themselves or across over() groups, because we have no natural place for the contrast operators (r., in our case). We also explicitly requested Wald tests of the contrasts, which are not provided by default. Nevertheless, the chi-squared statistic and its p-value for (2 vs 1) for male matches the results of our test command. We also obtain the test of whether the response of males treated as males is equal to the response of males treated as females. For a gentler introduction to contrasts of margins, see [R] margins, contrast. Example 11: Margins of a specified prediction We will fit the model . use http://www.stata-press.com/data/r13/margex . tobit ycn i.sex i.group sex#group age, ul(90) and we will tell the following story about the variables: We run a peach orchard where we allow people to pick their own peaches. A person receives one empty basket in exchange for $20, along with the right to enter the orchard. There is no official limit on how many peaches a person can pick, but only 90 peaches will fit into a basket. The dependent variable in the above tobit model, ycn, is the number of peaches picked. We use tobit, a special case of censored-normal regression, because ycn is censored at 90. After fitting this model, if we typed . margins sex we would obtain the margins for males and for females of the uncensored number of peaches picked. We would obtain that because predict after tobit produces the uncensored number by default. To obtain the censored prediction, we would have to specify predict’s ystar(.,90) option. If we want the margins based on that response, we type . margins sex, predict(ystar(.,90)) 1164 margins — Marginal means, predictive margins, and marginal effects The results of typing that are . tobit ycn i.sex i.group sex#group age, ul(90) (output omitted ) . margins sex, predict(ystar(.,90)) Predictive margins Model VCE : OIM Expression Number of obs 3000 : E(ycn*|ycn<90), predict(ystar(.,90)) Margin sex male female = 62.21804 78.34272 Delta-method Std. Err. .5996928 .455526 z 103.75 171.98 P>|z| [95% Conf. Interval] 0.000 0.000 61.04266 77.4499 63.39342 79.23553 In our previous examples, sex = 1 has designated females, so evidently the females visiting our orchard are better at filling baskets than the men. Example 12: Margins of a specified expression Continuing with our peach orchard example and the previously fit model . use http://www.stata-press.com/data/r13/margex . tobit ycn i.sex i.group sex#group age, ul(90) let’s examine how well our baskets are working for us. What is the proportion of the number of peaches actually picked to the number that would have been picked were the baskets larger? As mentioned in example 11, predict, ystar(.,90) produces the expected number picked given the limit of basket size. predict, xb would predict the expected number without a limit. We want the ratio of those two predictions. That ratio will measure as a proportion how well the baskets work. Thus we could type . margins sex, expression(predict(ystar(.,90))/predict(xb)) That would give us the proportion for everyone treated as male and everyone treated as female, but what we want to know is how well baskets work for true males and true females, so we will type . margins, over(sex) expression(predict(ystar(.,90))/predict(xb)) . margins, over(sex) expression(predict(ystar(0,90))/predict(xb)) Predictive margins Model VCE : OIM Expression over Number of obs 3000 : predict(ystar(0,90))/predict(xb) : sex Margin sex male female = .9811785 .9419962 Delta-method Std. Err. .0013037 .0026175 z 752.60 359.88 P>|z| [95% Conf. Interval] 0.000 0.000 .9786233 .936866 .9837337 .9471265 By the way, we could count the number of peaches saved by the limited basket size during the period of data collection by typing margins — Marginal means, predictive margins, and marginal effects 1165 . count 3000 . margins, expression(3000*(predict(xb)-predict(ystar(.,90)))) (output omitted ) The number of peaches saved turns outs to be 9,183. Example 13: Margins with multiple outcomes (responses) Estimation commands such as mlogit and mprobit (see [R] mlogit and [R] mprobit) calculate multiple responses, and those multiple responses are reflected in the options available with predict after estimation. Obtaining margins for such estimators is thus the same as obtaining margins of a specified prediction, which was demonstrated in example 11. The solution is to include the predict opt that selects the desired response in margins’ predict(predict opt) option. If we fit the multinomial logistic model . mlogit group i.sex age then to obtain the margins for the probability that group = 1, we would type . margins sex, predict(outcome(1)) and to obtain the margins for the probability that group = 3, we would type . margins sex, predict(outcome(3)) We learned about the outcome(1) and outcome(3) options by looking in [R] mlogit postestimation. For an example using margins with a multiple-outcome estimator, see example 4 in [R] mlogit postestimation. Example 14: Margins with multiple equations Estimation commands such as mvreg, manova, sureg, and reg3 (see [MV] mvreg, [MV] manova, [R] sureg, and [R] reg3) fit multiple equations. Obtaining margins for such estimators is the same as obtaining margins with multiple outcomes (see example 13), which in turn is the same as obtaining margins of a specified prediction (see example 11). You place the relevant option from the estimator’s predict command into margins’ predict(predict opt) option. If we fit the seemingly unrelated regression model . sureg (y = i.sex age) (distance = i.sex i.group) we can obtain the marginal means of y for males and females by typing . margins sex, predict(equation(y)) and we can obtain the marginal means of distance by typing . margins sex, predict(equation(distance)) We could obtain the difference between the margins of y and distance by typing . margins sex, expression(predict(equation(y)) > predict(equation(distance))) More examples can be found in [MV] manova and [MV] manova postestimation. 1166 margins — Marginal means, predictive margins, and marginal effects Example 15: Margins evaluated out of sample You can fit your model on one dataset and use margins on another if you specify margins’ noesample option. Remember that margins reports estimated average responses, and, unless you lock all the covariates at fixed values by using the at() option, the remaining variables are allowed to vary as they are observed to vary in the data. That is indeed the point of using margins. The fitted model provides the basis for adjusting for the remaining variables, and the data provide their values. The predictions produced by margins are of interest assuming the data used by margins are in some sense interesting or representative. In some cases, you might need to fit your model on one set of data and perform margins on another. In example 11, we fit the model . tobit ycn i.sex i.group sex#group age, ul(90) and we told a story about our peach orchard in which we charged people $20 to collect a basket of peaches, where baskets could hold at most 90 peaches. Let us now tell you that we believe the data on which we estimated those margins were unrepresentative, or at least, we have a more representative sample stored in another .dta file. That dataset includes the demographics of our customers but does not include counts of peaches picked. It is a lot of work counting those peaches. Thus we will fit our model just as we did previously using the detailed data, but we will bring the other, more representative dataset into memory before issuing the margins sex, predict(ystar(.,90)) command, and we will add noesample to it. . use http://www.stata-press.com/data/r13/margex (Artificial data for margins) . tobit ycn i.sex i.group sex#group age, ul(90) (output omitted ) . use http://www.stata-press.com/data/r13/peach . margins sex, predict(ystar(.,90)) noesample Predictive margins Model VCE : OIM Expression = 2727 : E(ycn*|ycn<90), predict(ystar(.,90)) Margin sex 0 1 Number of obs 56.79774 75.02146 Delta-method Std. Err. 1.003727 .643742 z 56.59 116.54 P>|z| [95% Conf. Interval] 0.000 0.000 54.83047 73.75975 58.76501 76.28317 In example 12, we produced an estimate of the number of peaches saved by the limited-size baskets. We can update that estimate using the new demographic data by typing . count 2727 . margins, exp(2727*(predict(xb)-predict(ystar(.,90)))) noesample (output omitted ) By running the above, we find that the updated number of peaches saved is 6,408. margins — Marginal means, predictive margins, and marginal effects 1167 Obtaining margins of derivatives of responses (a.k.a. marginal effects) Derivatives of responses are themselves responses, so everything said above in Obtaining margins of responses is equally true of derivatives of responses, and every example above could be repeated here substituting the derivative of the response for the response. Derivatives are of interest because they are an informative way of summarizing fitted results. The change in a response for a change in the covariate is easy to understand and to explain. In simple models, one hardly needs margins to assist in obtaining such margins. Consider the simple linear regression y = β0 + β1 × sex + β2 × age + The derivatives of the responses are dy/d(sex) = β1 dy/d(age) = β2 The derivatives are the fitted coefficients. How does y change between males and females? It changes by β1 . How does y change with age? It changes by β2 per year. If you make the model a little more complicated, however, the need for margins arises. Consider the model y = β0 + β1 × sex + β2 × age + β3 × age2 + Now the derivative with respect to age is dy/d(age) = β2 + 2 × β3 × age The change in y for a change in age itself changes with age, and so to better understand the fitted results, you might want to make a table of the change in y for a change in age for age = 30, age = 40, and age = 50. margins can do that. Consider an even more complicated model, such as y = β0 + β1 × sex + β2 × age + β3 × age2 + β4 × bp + β5 × sex × bp + β6 × tmt + β7 × tmt × age + β8 × tmt × age2 + The derivatives are dy/d(sex) = β1 + β5 × bp dy/d(age) = β2 + 2 × β3 × age + β7 × tmt + 2 × β8 × tmt × age dy/d(bp) = β4 + β5 × sex dy/d(tmt) = β6 + β7 × age + β8 × age2 At this point, margins becomes indispensable. (1) 1168 margins — Marginal means, predictive margins, and marginal effects Do not specify marginlist when you mean over() margins has the same syntax when used with derivatives of responses as when used with responses. To obtain derivatives, one specifies the dydx() option. If we wanted to examine the response variable dy/d(tmt), we would specify margins’ dydx(tmt) option. The rest of the margins command has the same syntax as ordinarily, although one tends to specify different syntactical elements. For instance, one usually does not specify a marginlist. If we typed . margins sex, dydx(tmt) we would obtain dy/d(tmt) calculated first as if everyone were male and then as if everyone were female. At the least, we would probably want to specify . margins sex, dydx(tmt) grand so as also to obtain dy/d(tmt), the overall margin, the margin with everyone having their own value of sex. Usually, however, all we want is the overall margin, and because grand is the default when the marginlist is not specified, we would just type . margins, dydx(tmt) Alternatively, if we were interested in the decomposition by sex, then rather than type margins sex, dydx(tmt), we probably want to type . margins, over(sex) dydx(tmt) This command gives us the average effect of tmt for males and again for females rather than the average effect with everyone treated as male and then again with everyone treated as female. Use at() freely, especially with continuous variables Another option one tends to use more often with derivatives of responses than one does with responses is at(). Such use is often to better understand or to communicate how the response varies, or, in technical jargon, to explore the nature of the response surface. For instance, the effect dy/d(tmt) in (1) is equal to β6 + β7 × age + β8 × age2 , and so simply to understand how treatment varies with age, we may want to fix age at various values. We might type . margins, dydx(tmt) over(sex) at(age=(30 40 50)) Expressing derivatives as elasticities You specify the dydx(varname) option on the margins command to use dy/d(varname) as the response variable. If you want that derivative expressed as an elasticity, you can specify eyex(varname), eydx(varname), or dyex(varname). You substitute e for d where you want an elasticity. The formulas are dydx() = dy/dx eyex() = dy/dx × (x/y) eydx() = dy/dx × (1/y) dyex() = dy/dx × (x) margins — Marginal means, predictive margins, and marginal effects 1169 and the interpretations are dydx(): eyex(): eydx(): dyex(): proportional proportional change change change change in in in in y y y y for for for for a a a a change change change proportional change proportional in in in in x x x x As margins always does with response functions, calculations are made at the observational level and are then averaged. Let’s assume that in observation 5, dy/dx = 0.5, y = 15, and x = 30; then dydx() = 0.5 eyex() = 1.0 eydx() = 0.03 dyex() = 15.0 Many social scientists would informally explain the meaning of eyex() = 1 as “y increases 100% when x increases 100%” or as “y doubles when x doubles”, although neither statement is literally true. eyex(), eydx(), and dyex() are rates evaluated at a point, just as dydx() is a rate, and all such interpretations are valid only for small (infinitesimal) changes in x. It is true that eyex() = 1 means y increases with x at a rate such that, if the rate were constant, y would double if x doubled. This issue of casual interpretation is no different from casually interpreting dydx() as if it represents the response to a unit change. It is not necessarily true that dydx() = 0.5 means that “y increases by 0.5 if x increases by 1”. It is true that “y increases with x at a rate such that, if the rate were constant, y would increase by 0.5 if x increased by 1”. dydx(), eyex(), eydx(), and dyex() may be used with continuous x variables. dydx() and eydx() may also be used with factor variables. Derivatives versus discrete differences In (1), y = β0 + β1 × sex + β2 × age + β3 × age2 + β4 × bp + β5 × sex × bp + β6 × tmt + β7 × tmt × age + β8 × tmt × age2 + Let us call your attention to the derivatives of y with respect to age and sex: dy/d(age) = β2 + 2 × β3 × age + β7 × tmt + 2 × β8 × tmt × age (2) dy/d(sex) = β1 + β5 × bp (3) age is presumably a continuous variable and (2) is precisely how margins calculates its derivatives when you type margins, dydx(age). sex, however, is presumably a factor variable, and margins does not necessarily make the calculation using (3) were you to type margins, dydx(sex). We will explain, but let us first clarify what we mean by a continuous and a factor variable. Say that you fit (1) by typing . regress y i.sex age c.age#c.age i.bp bp#sex > i.tmt tmt#c.age tmt#c.age#c.age 1170 margins — Marginal means, predictive margins, and marginal effects It is important that sex entered the model as a factor variable. It would not do to type regress y sex . . . because then sex would be a continuous variable, or at least it would be a continuous variable from Stata’s point of view. The model estimates would be the same, but margins’ understanding of those estimates would be a little different. With the model estimated using i.sex, margins understands that either sex is 0 or sex is 1. With the model estimated using sex, margins thinks sex is continuous and, for instance, sex = 1.5 is a possibility. margins calculates dydx() differently for continuous and for factor variables. For continuous variables, margins calculates dy/dx. For factor variables, margins calculates the discrete firstdifference from the base category. To obtain that for sex, write down the model and then subtract from it the model evaluated at the base category for sex, which is sex = 0. If you do that, you will get the same formula as we obtained for the derivative, namely, discrete difference{(sex = 1) − (sex = 0)} = β1 + β5 × bp We obtain the same formula because our model is linear regression. Outside of linear regression, and outside of linear response functions generally, the discrete difference is not equal to the derivative. The discrete difference is not equal to the derivative for logistic regression, probit, etc. The discrete difference calculation is generally viewed as better for factor variables than the derivative calculation because the discrete difference is what would actually be observed. If you want the derivative calculation for your factor variables, specify the continuous option on the margins command. Example 16: Average marginal effect (partial effects) Concerning the title of this example, the way we use the term marginal effect, the effects of factor variables are calculated using discrete first-differences. If you wanted the continuous calculation, you would specify margins’ continuous option in what follows. . use http://www.stata-press.com/data/r13/margex (Artificial data for margins) . logistic outcome treatment##group age c.age#c.age treatment#c.age (output omitted ) . margins, dydx(treatment) Average marginal effects Model VCE : OIM Expression : Pr(outcome), predict() dy/dx w.r.t. : 1.treatment dy/dx 1.treatment .0385625 Number of obs = 3000 Delta-method Std. Err. z P>|z| [95% Conf. Interval] .0162848 2.37 0.018 .0066449 .0704801 Note: dy/dx for factor levels is the discrete change from the base level. The average marginal effect of treatment on the probability of a positive outcome is 0.039. margins — Marginal means, predictive margins, and marginal effects 1171 Example 17: Average marginal effect of all covariates We will continue with the model . logistic outcome treatment##group age c.age#c.age treatment#c.age if we wanted the average marginal effects for all covariates, we would type margins, dydx(*) or margins, dydx( all); they mean the same thing. This is probably the most common way margins, dydx() is used. . margins, dydx(*) Average marginal effects Model VCE : OIM Expression : Pr(outcome), predict() dy/dx w.r.t. : 1.treatment 2.group 3.group age dy/dx Number of obs = 3000 Delta-method Std. Err. z P>|z| [95% Conf. Interval] 1.treatment .0385625 .0162848 2.37 0.018 .0066449 .0704801 group 2 3 -.0776906 -.1505652 .0181584 .0400882 -4.28 -3.76 0.000 0.000 -.1132805 -.2291366 -.0421007 -.0719937 age .0095868 .0007796 12.30 0.000 .0080589 .0111148 Note: dy/dx for factor levels is the discrete change from the base level. Example 18: Evaluating marginal effects over the response surface Continuing with the model . logistic outcome treatment##group age c.age#c.age treatment#c.age What follows maps out the entire response surface of our fitted model. We report the marginal effect of treatment evaluated at age = 20, 30, . . . , 60, by each level of group. 1172 margins — Marginal means, predictive margins, and marginal effects . margins group, dydx(treatment) at(age=(20(10)60)) Conditional marginal effects Number of obs Model VCE : OIM Expression : Pr(outcome), predict() dy/dx w.r.t. : 1.treatment 1._at : age = 20 2._at : age = 30 3._at : age = 40 4._at : age = 50 5._at : age dy/dx 1.treatment _at#group 1 1 1 2 1 3 2 1 2 2 2 3 3 1 3 2 3 3 4 1 4 2 4 3 5 1 5 2 5 3 -.0208409 .009324 .0006558 -.0436964 .0382959 .0064564 -.055676 .1152235 .0284808 -.027101 .2447682 .0824401 .0292732 .3757777 .1688268 = 60 Delta-method Std. Err. z .0152862 .0059896 .0048682 .0279271 .0120405 .0166581 .0363191 .0209858 .0471293 .0395501 .0362623 .1025028 .0587751 .0578106 .1642191 -1.36 1.56 0.13 -1.56 3.18 0.39 -1.53 5.49 0.60 -0.69 6.75 0.80 0.50 6.50 1.03 P>|z| 0.173 0.120 0.893 0.118 0.001 0.698 0.125 0.000 0.546 0.493 0.000 0.421 0.618 0.000 0.304 = 3000 [95% Conf. Interval] -.0508013 -.0024155 -.0088856 -.0984325 .014697 -.0261929 -.1268601 .074092 -.0638908 -.1046177 .1736954 -.1184616 -.0859239 .2624709 -.1530368 .0091196 .0210635 .0101972 .0110397 .0618949 .0391057 .015508 .156355 .1208524 .0504158 .315841 .2833418 .1444703 .4890844 .4906904 Note: dy/dx for factor levels is the discrete change from the base level. Obtaining margins with survey data and representative samples The standard errors and confidence intervals produced by margins are based by default on the delta method applied to the VCE of the current estimates. Delta-method standard errors treat the covariates at which the response is evaluated as given or fixed. Such standard errors are appropriate if you specify at() to fix the covariates, and they are appropriate when you are making inferences about groups exactly like your sample whether you specify at() or not. On the other hand, if you have a representative sample of the population or if you have complex survey data and if you want to make inferences about the underlying population, you need to account for the variation in the covariates that would arise in repeated sampling. You do that using vce(unconditional), which invokes a different standard-error calculation based on Korn and Graubard (1999). Syntactically, there are three cases. They all involve specifying the vce(unconditional) option on the margins command: 1. You have a representative random sample, and you have not svyset your data. When you fit the model, you need to specify the vce(robust) or vce(cluster clustvar) option. When you issue the margins command, you need to specify the vce(unconditional) option. margins — Marginal means, predictive margins, and marginal effects 1173 2. You have a weighted sample, and you have not svyset your data. You need to specify [pw=weight] when you fit the model and, of course, specify the vce(unconditional) option on the margins command. You do not need to specify the weights on the margins command because margins will obtain them from the estimation results. 3. You have svyset your data, whether it be a simple random sample or something more complex including weights, strata, sampling units, or poststratification, and you are using the linearized variance estimator. You need to use the svy prefix when you fit the model. You need to specify vce(unconditional) when you issue the margins command. You do not need to respecify the weights. Even though the data are svyset, and even though the estimation was svy estimation, margins does not default to vce(unconditional). It does not default to vce(unconditional) because there are valid reasons to want the data-specific, vce(delta) standard-error estimates. Whether you specify vce(unconditional) or not, margins uses the weights, so you do not need to respecify them even if you are using vce(unconditional). vce(unconditional) is allowed only after estimation with vce(robust), vce(cluster . . .), or the svy prefix with the linearized variance estimator. If the VCE of the current estimates was specified as clustered, so will be the VCE estimates of margins. If the estimates were from a survey estimation, the survey settings in the dataset will be used by margins. When you use vce(unconditional), never specify if exp or in range on the margins command; instead, specify the subpop(if exp) option. You do that for the usual reasons; see [SVY] subpopulation estimation. If you specify over(varlist) to examine subgroups, the subgroups will automatically be treated as subpopulations. If you are using a replication-based variance estimator, you may want to use this method to estimate the variance of your margins; see [SVY] svy postestimation. Example 19: Inferences for populations, margins of response In example 6, we fit the model . logistic outcome i.sex i.group sex#group age and we obtained margins by sex and margins by group, . margins sex group If our data were randomly drawn from the population of interest and we wanted to account for this, we would have typed . logistic outcome i.sex i.group sex#group age, vce(robust) . margins sex group, vce(unconditional) We do that below: 1174 margins — Marginal means, predictive margins, and marginal effects . logistic outcome i.sex i.group sex#group age, vce(robust) (output omitted ) . margins sex group, vce(unconditional) Predictive margins Number of obs Expression : Pr(outcome), predict() Margin Unconditional Std. Err. z = 3000 P>|z| [95% Conf. Interval] sex male female .1600644 .1966902 .0131685 .0104563 12.16 18.81 0.000 0.000 .1342546 .1761963 .1858743 .2171841 group 1 2 3 .2251302 .150603 .0736157 .0127069 .0118399 .0343188 17.72 12.72 2.15 0.000 0.000 0.032 .200225 .1273972 .0063522 .2500354 .1738088 .1408793 The estimated margins are the same as they were in example 6, but the standard errors and confidence intervals differ, although not by much. Given that we have 3,000 observations in our randomly drawn sample, we should expect this. Example 20: Inferences for populations, marginal effects In example 17, we fit a logistic model and then obtained the average marginal effects for all covariates by typing . logistic outcome treatment##group age c.age#c.age treatment#c.age . margins, dydx(*) To repeat that and also obtain standard errors for our population, we would type . logistic outcome treatment##group age c.age#c.age treatment#c.age, > vce(robust) . margins, dydx(*) vce(unconditional) The results are . logistic outcome treatment##group age c.age#c.age treatment#c.age, vce(robust) (output omitted ) . margins, dydx(*) vce(unconditional) Average marginal effects Number of obs = 3000 Expression : Pr(outcome), predict() dy/dx w.r.t. : 1.treatment 2.group 3.group age dy/dx Unconditional Std. Err. z P>|z| [95% Conf. Interval] 1.treatment .0385625 .0163872 2.35 0.019 .0064442 .0706808 group 2 3 -.0776906 -.1505652 .0179573 .0411842 -4.33 -3.66 0.000 0.000 -.1128863 -.2312848 -.0424949 -.0698456 age .0095868 .0007814 12.27 0.000 .0080553 .0111183 Note: dy/dx for factor levels is the discrete change from the base level. margins — Marginal means, predictive margins, and marginal effects 1175 Example 21: Inferences for populations with svyset data See example 3 in [SVY] svy postestimation. Standardizing margins A standardized margin is the margin calculated on data different from the data used to fit the model. Typically, the word standardized is reserved for situations in which the alternate population is a reference population, which may be real or artificial, and which is treated as fixed. Say that you work for a hospital and have fit a model of mortality on the demographic characteristics of the hospital’s patients. At this stage, were you to type . margins you would obtain the mortality rate for your hospital. You have another dataset, hstandard.dta, that contains demographic characteristics of patients across all hospitals along with the population of each hospital recorded in the pop variable. You could obtain the expected mortality rate at your hospital if your patients matched the characteristics of the standard population by typing . use http://www.stata-press.com/data/r13/hstandard, clear . margins [fw=pop], noesample You specified noesample because the margin is being calculated on data other than the data used to estimate the model. You specified [fw=pop] because the reference dataset you are using included population counts, as many reference datasets do. Obtaining margins as though the data were balanced Here we discuss what are commonly called estimated marginal means or least-squares means. These are margins assuming that all levels of factor variables are equally likely or, equivalently, that the design is balanced. The seminal reference on these margins is Searle, Speed, and Milliken (1980). In designed experiments, observations are often allocated in a balanced way so that the variances can be easily compared and decomposed. At the Acme Portable Widget Company, they are experimenting with a new machine. The machine has three temperature settings and two pressure settings; a combination of settings will be optimal on any particular day, determined by the weather. At start-up, one runs a quick test and chooses the optimal setting for the day. Across different days, each setting will be used about equally, says the manufacturer. In experiments with the machine, 10 widgets were collected for stress testing at each of the settings over a six-week period. We wish to know the average stress-test value that can be expected from these machines over a long period. Balancing using asbalanced The data were intended to be balanced, but unfortunately, the stress test sometimes destroys samples before the stress can be measured. Thus even though the experiment was designed to be balanced, the data are not balanced. You specify the asbalanced option to estimate the margins as if the data were balanced. We will type . use http://www.stata-press.com/data/r13/acmemanuf . regress y pressure##temp . margins, asbalanced 1176 margins — Marginal means, predictive margins, and marginal effects So that you can compare the asbalanced results with the observed results, we will also include margins without the asbalanced option in what follows: . use http://www.stata-press.com/data/r13/acmemanuf . regress y pressure##temp (output omitted ) . margins Predictive margins Number of obs Model VCE : OLS Expression : Linear prediction, predict() Margin _cons 109.9214 Delta-method Std. Err. 1.422629 t 77.27 . margins, asbalanced Adjusted predictions Model VCE : OLS Expression : Linear prediction, predict() at : pressure (asbalanced) temp (asbalanced) Margin _cons 115.3758 Delta-method Std. Err. 1.530199 t 75.40 = 49 P>|t| [95% Conf. Interval] 0.000 107.0524 Number of obs 112.7904 = 49 P>|t| [95% Conf. Interval] 0.000 112.2899 118.4618 Technical note Concerning how asbalanced calculations are performed, if a factor variable has l levels, then each level’s coefficient contributes to the response weighted by 1/l. If two factors, a and b, interact, then each coefficient associated with their interaction is weighted by 1/(la × lb ). If a balanced factor interacts with a continuous variable, then each coefficient in the interaction is applied to the value of the continuous variable, and the results are weighted equally. So, if the factor being interacted has la levels, the effect of each coefficient on the value of the continuous covariate is weighted by 1/la . Balancing by standardization To better understand the balanced results, we can perform the balancing ourselves by using the standardizing method shown in Standardizing margins. To do that, we will input a balanced dataset and then type margins, noesample. margins — Marginal means, predictive margins, and marginal effects 1177 . use http://www.stata-press.com/data/r13/acmemanuf . regress y pressure##temp (output omitted ) . drop _all . input pressure temp 1. 2. 3. 4. 5. 6. 7. pressure 1 1 1 2 1 3 2 1 2 2 2 3 end temp . margins, noesample Predictive margins Model VCE : OLS Expression Number of obs 6 : Linear prediction, predict() Margin _cons = 115.3758 Delta-method Std. Err. 1.530199 t 75.40 P>|t| [95% Conf. Interval] 0.000 112.2899 118.4618 We obtain the same results as previously. Balancing nonlinear responses If our testing had produced a binary outcome, say, acceptable/unacceptable, rather than a continuous variable, we would type . use http://www.stata-press.com/data/r13/acmemanuf, clear . logistic acceptable pressure##temp . margins, asbalanced The result of doing that would be 0.680. If we omitted the asbalanced option, the result would have been 0.667. The two results are so similar because acmemanuf.dta is nearly balanced. Even though the asbalanced option can be used on both linear and nonlinear responses, such as probabilities, there is an issue of which you should be aware. The most widely used formulas for balancing responses apply the balancing to the linear prediction, average that as if it were balanced, and then apply the nonlinear transform. That is the calculation that produced 0.680. An alternative would be to apply the standardization method. That amounts to making the linear predictions observation by observation, applying the nonlinear transform to each, and then averaging the nonlinear result as if it were balanced. You could do that by typing . . . . use http://www.stata-press.com/data/r13/acmemanuf, clear logistic acceptable pressure##temp clear input pressure temp (see above for entered data) . margins, noesample 1178 margins — Marginal means, predictive margins, and marginal effects The result from the standardization procedure would be 0.672. These two ways of averaging nonlinear responses are discussed in detail in Lane and Nelder (1982) within the context of general linear models. Concerning the method used by the asbalanced option, if your data start balanced and you have a nonlinear response, you will get different results with and without the asbalanced option! Treating a subset of covariates as balanced So far, we have treated all the covariates as if they were balanced. margins will allow you to treat a subset of the covariates as balanced, too. For instance, you might be performing an experiment in which you are randomly allocating patients to a treatment arm and so want to balance on arm, but you do not want to balance the other characteristics because you want mean effects for the experiment’s population. In this example, we will imagine that the outcome of the experiment is continuous. We type . use http://www.stata-press.com/data/r13/margex, clear . regress y arm##sex sex##agegroup . margins, at((asbalanced) arm) If we wanted results balanced on agegroup as well, we could type . margins, at((asbalanced) arm agegroup) If we wanted results balanced on all three covariates, we could type . margins, at((asbalanced) arm agegroup sex) or we could type . margins, at((asbalanced) _factor) or we could type . margins, asbalanced Using fvset design As a convenience feature, equivalent to . regress y arm##sex sex##agegroup . margins, at((asbalanced) arm sex) is . fvset design asbalanced arm sex . regress y arm##sex sex##agegroup . margins The advantage of the latter is that you have to set the variables as balanced only once. This is useful when balancing is a design characteristic of certain variables and you wish to avoid accidentally treating them as unbalanced. If you save your data after fvsetting, the settings will be remembered in future sessions. If you want to clear the setting(s), type . fvset clear varlist See [R] fvset. margins — Marginal means, predictive margins, and marginal effects 1179 Balancing in the presence of empty cells The issue of empty cells is not exclusively an issue of balancing, but there are special considerations when balancing. Empty cells are discussed generally in Estimability of margins. An empty cell is an interaction of levels of two or more factor variables for which you have no data. Usually, margins involving empty cells cannot be estimated. When balancing, there is an alternate definition of the margin that allows the margin to be estimated. margins makes the alternate calculation when you specify the emptycells(reweight) option. By default, margins uses the emptycells(strict) option. If you have empty cells in your data and you request margins involving the empty cells, those margins will be marked as not estimable even if you specify the asbalanced option. . use http://www.stata-press.com/data/r13/estimability, clear (margins estimability) . regress y sex##group (output omitted ) . margins sex, asbalanced Adjusted predictions Number of obs Model VCE : OLS Expression : Linear prediction, predict() at : sex (asbalanced) group (asbalanced) Margin sex male female 21.91389 . Delta-method Std. Err. t 1.119295 19.58 (not estimable) = 69 P>|t| [95% Conf. Interval] 0.000 19.67572 24.15206 This example is discussed in Estimability of margins, although without the asbalanced option. What is said there is equally relevant to the asbalanced case. For reasons explained there, the margin for sex = 1 (female) cannot be estimated. The margin for sex = 1 can be estimated in the asbalanced case if you are willing to make an assumption. Remember that margins makes the balanced calculation by summing the responses associated with the levels and then dividing by the number of levels. If you specify emptycells(reweight), margins sums what is available and divides by the number available. Thus you are assuming that, whatever the responses in the empty cells, those responses are such that they would not change the overall mean of what is observed. 1180 margins — Marginal means, predictive margins, and marginal effects The results of specifying emptycells(reweight) are . margins sex, asbalanced emptycells(reweight) Adjusted predictions Model VCE : OLS Expression : Linear prediction, predict() Empty cells : reweight at : sex (asbalanced) group (asbalanced) Margin sex male female Delta-method Std. Err. 21.91389 24.85185 t 1.119295 1.232304 19.58 20.17 Number of obs = 69 P>|t| [95% Conf. Interval] 0.000 0.000 19.67572 22.38771 24.15206 27.316 Obtaining margins with nested designs Introduction Factors whose meaning depends on other factors are called nested factors, and the factors on which their meaning depends are called the nesting factors. For instance, assume that we have a sample of patients and each patient is assigned to one doctor. Then patient is nested within doctor. Let the identifiers of the first 5 observations of our data be Doctor 1 1 1 Patient 1 2 3 Name Fred Mary Bob 2 2 1 2 Karen Hank The first patient on one doctor’s list has nothing whatsoever to do with the first patient on another doctor’s list. The meaning of patient = 1 is defined only when the value of doctor is supplied. Nested factors enter into models as interactions of nesting and nested; the nested factor does not appear by itself. We might estimate a model such as . regress y . . . i.doctor doctor#patient . . . You do not include i.patient because the coding for patient has no meaning except within doctor. Patient 1 is Fred for doctor 1 and Karen for doctor 2, etc. margins provides an option to help account for the structure of nested models. The within(varlist) option specifies that margins estimate and report a set of margins for the value combinations of varlist. We might type . margins, within(doctor) Margin calculations are performed first for doctor = 1, then for doctor = 2, and so on. Sometimes you need to specify within(), and other times you do not. Let’s consider the particular model . regress y i.doctor doctor#patient i.sex sex#doctor#patient margins — Marginal means, predictive margins, and marginal effects 1181 The guidelines are the following: 1. You may compute overall margins by typing margins. 2. You may compute overall margins within levels of a nesting factor by typing margins, within(doctor). 3. You may compute margins of a nested factor within levels of its nesting factor by typing margins patient, within(doctor). 4. You may compute margins of factors in your model, as long as the factor does not nest other factors and is not nested within other factors, by typing margins sex. 5. You may not compute margins of a nesting factor, such as margins doctor, because they are not estimable. For examples using within(), see [R] anova. Margins with nested designs as though the data were balanced To obtain margins with nested designs as though the data were balanced, the guidelines are the same as above except that 1) you add the asbalanced option and 2) whenever you do not specify within(), you specify emptycells(reweight). The updated guidelines are 1. You may compute overall margins by typing margins, asbalanced emptycells(reweight). 2. You may compute overall margins within levels of a nesting factor by typing margins, asbalanced within(doctor). 3. You may compute margins of a nested factor within levels of its nesting factor by typing margins patient, asbalanced within(doctor). 4. You may compute margins of factors in your model, as long as the factor does not nest other factors and is not nested within other factors, by typing margins sex, asbalanced emptycells(reweight). 5. You may not compute margins of a nesting factor, such as margins doctor, because they are not estimable. Just as explained in Using fvset design, rather than specifying the asbalanced option, you may set the balancing characteristic on the factor variables once and for all by using the command fvset design asbalanced varlist. Technical note Specifying either emptycells(reweight) or within(varlist) causes margins to rebalance over all empty cells in your model. If you have interactions in your model that are not involved in the nesting, margins will lose its ability to detect estimability. Technical note Careful readers will note that the description of within(varlist) matches closely the description of over(varlist). The concept of nesting is similar to the concept of subpopulations. within() differs from over() in that it gracefully handles the missing cells when margins are computed as balanced. 1182 margins — Marginal means, predictive margins, and marginal effects Coding of nested designs In the Introduction to this section, we showed a coding of the nested variable patient, where the coding started over with each doctor: Doctor 1 1 1 Patient 1 2 3 Name Fred Mary Bob 2 2 1 2 Karen Hank That coding style is not required. The data could just as well have been coded Doctor 1 1 1 Patient 1 2 3 Name Fred Mary Bob 2 2 4 5 Karen Hank or even Doctor 1 1 1 Patient 1037239 2223942 0611393 Name Fred Mary Bob 2 2 4433329 6110271 Karen Hank Actually, either of the above two alternatives are better than the first one because margins will be better able to give you feedback about estimability should you make a mistake following the guidelines. On the other hand, both of these two alternatives require more memory at the estimation step. If you run short of memory, you will need to recode your patient ID to the first coding style, which you could do by typing . sort doctor patient . by doctor: gen newpatient = _n Alternatively, you can set emptycells drop and continue to use your patient ID variable just as it is coded. If you do this, we recommend that you remember to type set emptycells keep when you are finished; margins is better able to determine estimability that way. If you regularly work with large nested models, you can set emptycells keep, permanently so that the setting persists across sessions. See [R] set emptycells. margins — Marginal means, predictive margins, and marginal effects 1183 Special topics Requirements for model specification The results that margins reports are based on the most recently fit model or, in Stata jargon, the most recently issued estimation command. Here we discuss 1) mechanical requirements for how you specify that estimation command, 2) work-arounds to use when those restrictions prove impossible, and 3) requirements for margins’ predict(pred opt) option to work. Concerning 1, when you specify the estimation command, covariates that are logically factor variables must be Stata factor variables, and that includes indicator variables, binary variables, and dummies. It will not do to type . regress y . . . female . . . even if female is a 0/1 variable. You must type . regress y . . . i.female . . . If you violate this rule, you will not get incorrect results, but you will discover that you will be unable to obtain margins on female: . margins female factor female not found in e(b) r(111); It is also important that if the same continuous variable appears in your model more than once, differently transformed, those transforms be performed via Stata’s factor-variable notation. It will not do to type . generate age2 = age^2 . regress y . . . age age2 . . . You must type . regress y . . . age c.age#c.age . . . You must do that because margins needs to know everywhere that variable appears in the model if it is to be able to set covariates to fixed values. Concerning 2, sometimes the transformations you desire may not be achievable using the factorvariable notation; in those situations, there is a work-around. Let’s assume you wish to estimate . generate age1_5 = age^1.5 . regress y . . . age age1_5 . . . There is no factor-variable notation for including age and age1.5 in a model, so obviously you are going to obtain the estimates by typing just what we have shown. In what follows, it would be okay if there are interactions of age and age1 5 with other variables specified by the factor-variable notation, so the model could just as well be . regress y . . . age age1_5 sex#c.age sex#c.age1_5 . . . Let’s assume you have fit one of these two models. On any subsequent margins command where you leave age free to vary, there will be no issue. You can type . margins female 1184 margins — Marginal means, predictive margins, and marginal effects and results will be correct. Issues arise when you attempt to fix age at predetermined values. The following would produce incorrect results: . margins female, at(age=20) The results would be incorrect because they leave age1 5 free to vary, and, logically, fixing age implies that age1 5 should also be fixed. Because we were unable to state the relationship between age and age1 5 using the factor-variable notation, margins does not know to fix age1 5 at 201.5 when it fixes age at 20. To get the correct results, you must fix the value of age1 5 yourself: . margins female, at(age=20 age1_5=89.442719) That command produces correct results. In the command, 89.442719 is 201.5 . In summary, when there is a functional relationship between covariates of your model and that functional relationship is not communicated to margins via the factor-variable notation, then it becomes your responsibility to ensure that all variables that are functionally related are set to the appropriate fixed values when any one of them is set to a fixed value. Concerning 3, we wish to amend our claim that you can calculate margins for anything that predict will produce. We need to add a qualifier. Let us show you an example where the statement is not true. After regress, predict will predict something it calls pr(a,b), which is the probability a ≤ y ≤ b. Yet if we attempted to use pr() with margins after estimation by regress, we would obtain . margins sex, predict(pr(10,20)) prediction is a function of possibly stochastic quantities other than e(b) r(498); What we should have stated was that you can calculate margins for anything that predict will produce for which all the estimated quantities used in its calculation appear in e(V), the estimated VCE. pr() is a function of β, the estimated coefficients, and of s2 , the estimated variance of the residual. regress does not post the variance of the residual variance (sic) in e(V), or even estimate it, and therefore, predict(pr(10,20)) cannot be specified with margins after estimation by regress. It is unlikely that you will ever encounter these kinds of problems because there are so few predictions where the components are not posted to e(V). If you do encounter the problem, the solution may be to specify nose to suppress the standard-error calculation. If the problem is not with computing the margin, but with computing its standard error, margins will report the result: . margins sex, predict(pr(10,20)) nose (output appears with SEs, tests, and CIs left blank) Technical note Programmers: If you run into this after running an estimation command that you have written, be aware that as of Stata 11, you are supposed to set in e(marginsok) the list of options allowed with predict that are okay to use with margins. When that list is not set, margins looks for violations of its assumptions and, if it finds any, refuses to proceed. margins — Marginal means, predictive margins, and marginal effects 1185 Estimability of margins Sometimes margins will report that a margin cannot be estimated: . use http://www.stata-press.com/data/r13/estimability, clear (margins estimability) . regress y sex##group (output omitted ) . margins sex Predictive margins Model VCE : OLS Expression Number of obs = 69 : Linear prediction, predict() Margin sex male female 21 . Delta-method Std. Err. t .8500245 24.71 (not estimable) P>|t| [95% Conf. Interval] 0.000 19.30027 22.69973 In the above output, the margin for sex = 0 (male) is estimated, but the margin for sex = 1 (female) is not estimable. This occurs because of empty cells. An empty cell is an interaction of levels of two or more factor variables for which you have no data. In the example, the lack of estimability arises because we have two empty cells: . table sex group sex 1 2 group 3 male female 2 9 9 9 27 3 4 5 8 2 To calculate the marginal mean response for sex = 1, we have no responses to average over for group = 4 and group = 5. We obviously could calculate that mean for the observations that really are sex = 1, but remember, the marginal calculation for sex = 1 treats everyone as if female, and we will thus have 8 and 2 observations for which we have no basis for estimating the response. There is no solution for this problem unless you are willing to treat the data as if it were balanced and adjust your definition of a margin; see Balancing in the presence of empty cells. Manipulability of tests Manipulability is a problem that arises with some tests, and in particular, arises with Wald tests. Tests of margins are based on Wald tests, hence our interest. This is a generic issue and not specific to the margins command. Let’s understand the problem. Consider performing a test of whether some statistic φ is 0. Whatever the outcome of that test, it would be desirable if the outcome were the same were we to test whether the sqrt(φ) were 0, or whether φ2 were 0, or whether any other monotonic transform of φ were 0 (for φ2 , we were considering only the positive half of the number line). If a test does not have that property, it is manipulable. Wald tests are manipulable, and that means the tests produced by margins are manipulable. You can see this for yourself by typing 1186 margins — Marginal means, predictive margins, and marginal effects . . . . . use http://www.stata-press.com/data/r13/margex, clear replace y = y - 65 regress y sex##group margins, df(.) margins, expression(predict(xb)^2) To compare the results from the two margins commands, we added the df(.) option to the first one, forcing it to report a z statistic even though a t statistic would have been appropriate in this case. We would prefer if the test against zero produced by margins, df(.) was equal to the test produced by margins, expression(predict(xb)^2). But alas, they produce different results. The first produces z = 12.93, and the second produces z = 12.57. The difference is not much in our example, but behind the scenes, we worked to make it small. We subtracted 65 from y so that the experiment would be for a case where it might be reasonable that you would be testing against 0. One does not typically test whether the mean income in the United States is zero or whether the mean blood pressure of live patients is zero. Had we left y as it was originally, we would have obtained z = 190 and z = 96. We did not want to show that comparison to you first because the mean of y is so far from 0 that you probably would never be testing it. The corresponding difference in φ is tiny. Regardless of the example, it is important that you base your tests in the metric where the likelihood surface is most quadratic. For further discussion on manipulability, see Manipulability in [R] predictnl. This manipulability is not limited to Wald tests after estimation; you can also see the manipulability of results produced by linear regression just by applying nonlinear transforms to a covariate (Phillips and Park 1988; Gould 1996). Using margins after the estimates use command Assume you fit and used estimates save (see [R] estimates save) to save the estimation results: . regress y sex##group age c.age*c.age if site==1 . ... . estimates save mymodel (file mymodel.ster saved) Later, perhaps in a different Stata session, you reload the estimation results by typing . estimates use mymodel You plan to use margins with the reloaded results. You must remember that margins bases its results not only on the current estimation results but also on the current data in memory. Before you can use margins, you must reload the dataset on which you fit the model or, if you wish to produce standardized margins, some other dataset. . use mydata, clear (data for fitting models) If the dataset you loaded contained the data for standardization, you can stop reading; you know that to produce standardized margins, you need to specify the noesample option. We reloaded the original data and want to produce margins for the estimation sample. In addition to the data, margins requires that e(sample) be set, as margins will remind us: . margins sex e(sample) does not identify the estimation sample r(322); margins — Marginal means, predictive margins, and marginal effects 1187 The best solution is to use estimates esample to rebuild e(sample): . estimates esample: y sex group age if site==1 If we knew we had no missing values in y and the covariates, we could type . estimates esample: if site==1 Either way, margins would now work: . margins sex (usual output appears) There is an alternative. We do not recommend it, but we admit that we have used it. Rather than rebuilding e(sample), you can use margins’ noesample option to tell margins to skip using e(sample). You could then specify the appropriate if statement (if necessary) to identify the estimation sample: . estimates use mymodel . use mydata, clear (data for fitting models) . margins sex if !missing(y, sex, group age) & site==1, noesample (usual output appears) In the above, we are not really running on a sample different from the estimation sample; we are merely using noesample to fool margins, and then we are specifying on the margins command the conditions equivalent to re-create e(sample). If we wish to obtain vce(unconditional) results, however, noesample will be insufficient. We must also specify the force option, . margins sex if !missing(y, sex, group age) & site==1, > vce(unconditional) noesample force (usual output appears) Regardless of the approach you choose—resetting e(sample) or specifying noesample and possibly force—make sure you are right. In the vce(delta) case, you want to be right to ensure that you obtain the results you want. In the vce(unconditional) case, you need to be right because otherwise results will be statistically invalid. Syntax of at() In at(atspec), atspec may contain one or more of the following specifications: varlist (stat) varlist varname = # varname = (numlist) varname = generate(exp) where 1. varnames must be covariates in the previously fit model (estimation command). 2. Variable names (whether in varname or varlist) may be continuous variables, factor variables, or specific level variables, such as age, group, or 3.group. 1188 margins — Marginal means, predictive margins, and marginal effects 3. varlist may also be one of three standard lists: a. all (all covariates), b. factor (all factor-variable covariates), or c. continuous (all continuous covariates). 4. Specifications are processed from left to right with latter specifications overriding previous ones. 5. stat can be any of the following: stat Description Variables allowed asobserved mean median p1 p2 ... p50 ... p98 p99 min max zero base asbalanced at observed values in the sample (default) means (default for varlist) medians 1st percentile 2nd percentile 3rd–49th percentiles 50th percentile (same as median) 51st–97th percentiles 98th percentile 99th percentile minimums maximums fixed at zero base level all levels equally probable and sum to 1 all all continuous continuous continuous continuous continuous continuous continuous continuous continuous continuous continuous factors factors Any stat except zero, base, and asbalanced may be prefixed with an o to get the overall statistic—the sample over all over() groups. For example, omean, omedian, and op25. Overall statistics differ from their correspondingly named statistics only when the over() or within() option is specified. When no stat is specified, mean is assumed. Estimation commands that may be used with margins margins may be used after most estimation commands. margins cannot be used after estimation commands that do not produce full variance matrices, such as exlogistic and expoisson (see [R] exlogistic and [R] expoisson). margins is all about covariates and cannot be used after estimation commands that do not post the covariates, which eliminates gmm (see [R] gmm). margins cannot be used after estimation commands that have an odd data organization, and that excludes asclogit, asmprobit, asroprobit, and nlogit (see [R] asclogit, [R] asmprobit, [R] asroprobit, and [R] nlogit). margins — Marginal means, predictive margins, and marginal effects 1189 Video examples Introduction to margins, part 1: Categorical variables Introduction to margins, part 2: Continuous variables Introduction to margins, part 3: Interactions Glossary adjusted mean. A margin when the response is the linear predictor from linear regression, ANOVA, etc. For some authors, adjusting also implies adjusting for unbalanced data. See Obtaining margins of responses and see Obtaining margins as though the data were balanced. average marginal effect. See marginal effect and average marginal effect. average partial effect. See partial effect and average partial effect. conditional margin. A margin when the response is evaluated at fixed values of all the covariates. If any covariates are left to vary, the margin is called a predictive margin. effect. The effect of x is the derivative of the response with respect to covariate x, or it is the difference in responses caused by a discrete change in x. Also see marginal effect. The effect of x measures the change in the response for a change in x. Derivatives or differences might be reported as elasticities. If x is continuous, the effect is measured continuously. If x is a factor, the effect is measured with respect to each level of the factor and may be calculated as a discrete difference or as a continuous change, as measured by the derivative. margins calculates the discrete difference by default and calculates the derivative if the continuous option is specified. elasticity and semielasticity. The elasticity of y with respect to x is d(lny)/d(lnx) = (x/y)×(dy/dx), which is approximately equal to the proportional change in y for a proportional change in x. The semielasticity of y with respect to x is either 1) dy/d(lnx) = x × (dy/dx) or 2) d(lny)/dx = (1/y) × (dy/dx), which is approximately 1) the change in y for a proportional change in x or 2) the proportional change in y for a change in x. empty cell. An interaction of levels of two or more factor variables for which you have no data. For instance, you have sex interacted with group in your model, and in your data there are no females in group 1. Empty cells affect which margins can be estimated; see Estimability of margins. estimability. Estimability concerns whether a margin can be uniquely estimated (identified); see Estimability of margins. estimated marginal mean. This is one of the few terms that has the same meaning across authors. An estimated marginal mean is a margin assuming the levels of each factor covariate are equally likely (balanced), including interaction terms. This is obtained using margins’ asbalanced option. In addition, there is an alternate definition of estimated marginal mean in which margins involving empty cells are redefined so that they become estimable. This is invoked by margins’ emptycells(reweight) option. See Balancing in the presence of empty cells. least-squares mean. Synonym for estimated marginal mean. margin. A statistic calculated from predictions or other statistics of a previously fit model at fixed values of some covariates and averaging or otherwise integrating over the remaining covariates. The prediction or other statistic on which the margin is based is called the response. If all the covariates are fixed, then the margin is called a conditional margin. If any covariates are left to vary, the margin is called a predictive margin. 1190 margins — Marginal means, predictive margins, and marginal effects In this documentation, we divide margins on the basis of whether the statistic is a response or a derivative of a response; see Obtaining margins of responses and Obtaining margins of derivatives of responses. marginal effect and average marginal effect. The marginal effect of x is the margin of the effect of x. The term is popular with social scientists, and because of that, you might think the word marginal in marginal effect means derivative because of terms like marginal cost and marginal revenue. Marginal used in that way, however, refers to the derivative of revenue and the derivative of cost; it refers to the numerator, whereas marginal effect refers to the denominator. Moreover, effect is already a derivative or difference. Some researchers interpret marginal in marginal effect to mean instantaneous, and thus a marginal effect is the instantaneous derivative rather than the discrete first-difference, corresponding to margins’ continuous option. Researchers who use marginal in this way refer to the discrete difference calculation of an effect as a partial effect. Other researchers define marginal effect to be the margin when all covariates are held fixed and the average marginal effect when some covariates are not fixed. out-of-sample prediction. Predictions made in one dataset using the results from a model fit on another. Sample here refers to the sample on which the model was fit, and out-of-sample refers to the dataset on which the predictions are made. partial effect and average partial effect. Some authors restrict the term marginal effect to mean derivatives and use the term partial effect to denote discrete differences; see marginal effect and average marginal effect. population marginal mean. The theoretical (true) value that is estimated by estimated marginal mean. We avoid this term because it can be confused with the concept of a population in survey statistics, with which the population marginal mean has no connection. posting results, posting margins. A Stata concept having to do with storing the results from the margins command in e() so that those results can be used as if they were estimation results, thus allowing the subsequent use of postestimation commands, such as test, testnl, lincom, and nlcom (see [R] test, [R] testnl, [R] lincom, and [R] nlcom). This is achieved by specifying margins’ post option. See Example 10: Testing margins—contrasts of margins. predictive margin. A margin in which all the covariates are not fixed. When all covariates are fixed, it is called a conditional margin. recycled prediction. A synonym for predictive margin. response. A prediction or other statistic derived from combining the parameter estimates of a fitted model with data or specified values on covariates. Derivatives of responses are themselves responses. Responses are what we take margins of. standardized margin. The margin calculated on data different from the data used to fit the model. The term standardized is usually reserved for situations in which the alternate population is a reference population, which may be real or artificial, and which is treated as fixed. subpopulation. A subset of your sample that represents a subset of the population, such as the males in a sample of people. In survey contexts when it is desired to account for sampling of the covariates, standard errors for marginal statistics and effects need to account for both the population and the subpopulation. This is accomplished by specifying the vce(unconditional) option and one of the subpop() or over() options. In fact, the above is allowed even when your data are not svyset because vce(unconditional) implies that the sample represents a population. margins — Marginal means, predictive margins, and marginal effects 1191 Stored results margins stores the following in r(): Scalars r(N) r(N sub) r(N clust) r(N psu) r(N strata) r(df r) r(N poststrata) r(k margins) r(k by) r(k at) r(level) Macros r(cmd) r(cmdline) r(est cmd) r(est cmdline) r(title) r(subpop) r(model vce) r(model vcetype) r(vce) r(vcetype) r(clustvar) r(margins) r(predict label) r(expression) r(xvars) r(derivatives) r(over) r(within) r(by) r(by#) r(atstats#) r(emptycells) r(mcmethod) r(mcadjustall) Matrices r(b) r(V) r(Jacobian) r( N) r(at) r(chainrule) r(error) r(table) number of observations subpopulation observations number of clusters number of sampled PSUs, survey data only number of strata, survey data only variance degrees of freedom, survey data only number of post strata, survey data only number of terms in marginlist number of subpopulations number of at() options confidence level of confidence intervals margins command as typed e(cmd) from original estimation results e(cmdline) from original estimation results title in output subspec from subpop() vcetype from estimation command Std. Err. title from estimation command vcetype specified in vce() title used to label Std. Err. name of cluster variable marginlist label from predict() response expression varlist from dydx(), dyex(), eydx(), or eyex() “ ”, “dy/dx”, “dy/ex”, “ey/dx”, “ey/ex” varlist from over() varlist from within() union of r(over) and r(within) lists interaction notation identifying the #th subpopulation the #th at() specification empspec from emptycells() method from mcompare() adjustall or empty estimates variance–covariance matrix of the estimates Jacobian matrix sample size corresponding to each margin estimate matrix of values from the at() options chainrule information from the fitted model margin estimability codes; 0 means estimable, 8 means not estimable matrix containing the margins with their standard errors, test statistics, p-values, and confidence intervals 1192 margins — Marginal means, predictive margins, and marginal effects margins with the post option also stores the following in e(): Scalars e(N) e(N sub) e(N clust) e(N psu) e(N strata) e(df r) e(N poststrata) e(k margins) e(k by) e(k at) Macros e(cmd) e(cmdline) e(est cmd) e(est cmdline) e(title) e(subpop) e(model vce) e(model vcetype) e(vce) e(vcetype) e(clustvar) e(margins) e(predict label) e(expression) e(xvars) e(derivatives) e(over) e(within) e(by) e(by#) e(atstats#) e(emptycells) e(mcmethod) e(mcadjustall) Matrices e(b) e(V) e(Jacobian) e( N) e(at) e(chainrule) Functions e(sample) number of observations subpopulation observations number of clusters number of sampled PSUs, survey data only number of strata, survey data only variance degrees of freedom, survey data only number of post strata, survey data only number of terms in marginlist number of subpopulations number of at() options margins command as typed e(cmd) from original estimation results e(cmdline) from original estimation results title in estimation output subspec from subpop() vcetype from estimation command Std. Err. title from estimation command vcetype specified in vce() title used to label Std. Err. name of cluster variable marginlist label from predict() prediction expression varlist from dydx(), dyex(), eydx(), or eyex() “ ”, “dy/dx”, “dy/ex”, “ey/dx”, “ey/ex” varlist from over() varlist from within() union of r(over) and r(within) lists interaction notation identifying the #th subpopulation the #th at() specification empspec from emptycells() method from mcompare() adjustall or empty estimates variance–covariance matrix of the estimates Jacobian matrix sample size corresponding to each margin estimate matrix of values from the at() options chainrule information from the fitted model marks estimation sample Methods and formulas Margins are statistics calculated from predictions of a previously fit model at fixed values of some covariates and averaging or otherwise integrating over the remaining covariates. There are many names for the different statistics that margins can compute: estimates marginal means (see Searle, Speed, and Milliken [1980]), predictive margins (see Graubard and Korn [2004]), marginal effects (see Greene [2012]), and average marginal/partial effects (see Wooldridge [2010] and Bartus [2005]). margins — Marginal means, predictive margins, and marginal effects 1193 Methods and formulas are presented under the following headings: Notation Marginal effects Fixing covariates and balancing factors Estimable functions Standard errors conditional on the covariates Unconditional standard errors Notation Let θ be the vector of parameters in the current model fit, let z be a vector of covariate values, and let f (z, θ) be a scalar-valued function returning the value of the predictions of interest. The following table illustrates the parameters and default prediction for several of Stata’s estimation commands. Command θ z f (z, θ) regress cloglog logit poisson probit β β β β β x x x x x xβ xβ 1 − e−e 1/(1 + e−xβ ) exβ Φ(xβ) biprobit mlogit nbreg β1 , β2 , ρ β1 , β2 , . . . , βk β, lnα x1 , x2 x x Φ2 (x1 β1P , x2 β2 , ρ) e−xβ1 /( i e−xβi ) exβ Φ() and Φ2 () are cumulative distribution functions: Φ() for the standard normal distribution and Φ2 () for the standard bivariate normal distribution. margins computes estimates of M 1 X p(θ) = δj (Sp )f (zj , θ) MSp j=1 where δj (Sp ) identifies elements within the subpopulation Sp (for the prediction of interest), δj (Sp ) = 1, j ∈ Sp 0, j ∈ 6 Sp MSp is the subpopulation size, MSp = M X δj (Sp ) j=1 and M is the population size. Let b θ be the vector of parameter estimates. Then margins estimates p(θ) via pb = N 1 X δj (Sp )wj f (zj , b θ) w· j=1 1194 margins — Marginal means, predictive margins, and marginal effects where w· = N X δj (Sp )wj j=1 δj (Sp ) indicates whether observation j is in subpopulation Sp , wj is the weight for the j th observation, and N is the sample size. Marginal effects margins also computes marginal/partial effects. For the marginal effect of continuous covariate x, margins computes pb = N 1 X δj (Sp )wj h(zj , b θ) w· j=1 where h(z, θ) = ∂f (z, θ) ∂x The marginal effect for level k of factor variable A is the simple contrast (a.k.a. difference) comparing its margin with the margin at the base level. h(z, θ) = f (z, θ|A = k) − f (z, θ|A = base) Fixing covariates and balancing factors margins controls the values in each z vector through the marginlist, the at() option, the atmeans option, and the asbalanced and emptycells() options. Suppose z is composed of the elements from the equation specification A##B x where A is a factor variable with a levels, B is a factor variable with b levels, and x is a continuous covariate. To simplify the notation for this discussion, assume the levels of A and B start with 1 and are contiguous. Then z = (A1 , . . . , Aa , B1 , . . . , Bb , A1 B1 , A1 B2 , . . . , Aa Bb , x, 1) where Ai , Bj , and Ai Bj represent the indicator values for the factor variables A and B and the interaction A#B. When factor A is in the marginlist, margins replaces A with i and then computes the mean of the subsequent prediction, for i = 1, . . . , a. When the interaction term A#B is in the marginlist, margins replaces A with i and B with j , and then computes the mean of the subsequent prediction, for all combinations of i = 1, . . . , a and j = 1, . . . , b. margins — Marginal means, predictive margins, and marginal effects 1195 The at() option sets model covariates to fixed values. For example, at(x=15) causes margins to temporarily set x to 15 for each observation in the dataset before computing any predictions. Similarly, at((median) x) causes margins to temporarily set x to the median of x using the current dataset. When factor variable A is specified as asbalanced, margins sets each Ai to 1/a. Thus each z vector will look like z = (1/a, . . . , 1/a, B1 , . . . , Bb , B1 /a, B2 /a, . . . , Bb /a, x, 1) If B is also specified as asbalanced, then each Bj is set to 1/b, and each z vector will look like z = (1/a, . . . , 1/a, 1/b, . . . , 1/b, 1/ab, 1/ab, . . . , 1/ab, x, 1) If emptycells(reweight) is also specified, then margins uses a different balancing weight for each element of z, depending on how many empty cells the element is associated with. Let δij indicate that the ij th cell of A#B was observed in the estimation sample. δij = 0, A = i and B = j was an empty cell 1, otherwise For the grand margin, the affected elements of z and their corresponding balancing weights are P j δij Ai = P P k j δkj P i δij Bj = P P i k δik δij Ai B j = P P k l δkl For the j th margin of B, the affected elements of z and their corresponding balancing weights are δij Ai = P k δkj Bl = 1, if l = j and not all δij are zero 0, otherwise δil Ai B l = P Bl k δkl 1196 margins — Marginal means, predictive margins, and marginal effects Estimable functions The fundamental idea behind estimable functions is clearly defined in the statistical literature for linear models; see Searle (1971). Assume that we are working with the following linear model: y = Xb + e where y is an N × 1 vector of responses, X is an N × p matrix of covariate values, b is a p × 1 vector of coefficients, and e is a vector of random errors. Assuming a constant variance for the random b , are errors, the normal equations for the least-squares estimator, b b = X0 y X0 Xb When X is not of full column rank, we will need a generalized inverse (g-inverse) of X0 X to solve b . Let G be a g-inverse of X0 X. for b Searle (1971) defines a linear function of the parameters as estimable if it is identically equal to some linear function of the expected values of the y vector. Let H = GX0 X. Then this definition simplifies to the following rule: zb is estimable if z = zH margins generalizes this to nonlinear functions by assuming the prediction function f (z, θ) is a function of one or more of the linear predictions from the equations in the model that θ represents. f (z, θ) = h(z1 β1 , z2 β2 , . . . , zk βk ) zi βi is considered estimable if zi = zi Hi , where Hi = Gi X0i Xi , Gi is a g-inverse for X0i Xi , and Xi is the matrix of covariates from the ith equation of the fitted model. margins considers p(θ) to be estimable if every zi βi is estimable. Standard errors conditional on the covariates By default, margins uses the delta method to estimate the variance of pb. c (b Var p | z) = v0 Vv where V is a variance estimate for b θ and v= ∂ pb ∂θ θ=b θ This variance estimate is conditional on the z vectors used to compute the marginalized predictions. margins — Marginal means, predictive margins, and marginal effects 1197 Unconditional standard errors margins with the vce(unconditional) option uses linearization to estimate the unconditional variance of b θ. Linearization uses the variance estimator for the total of a score variable for pb as an approximate estimator for Var(b p); see [SVY] variance estimation. margins requires that the model was fit using some form of linearized variance estimator and that predict, scores computes the appropriate score values for the linearized variance estimator. The score for pb from the j th observation is given by sj = N ∂ pb δj (Sp ) δj (Sp ) 1 X ∂f (zi , b θ) δi (Sp )wi =− pb + f (zj , b θ) + ∂wj w· w· w· i=1 ∂wj The remaining partial derivative can be decomposed using the chain rule. ∂f (zi , b θ) = ∂wj ∂f (zi , θ) ∂θ ∂b θ ∂wj θ=b θ !0 This is the inner product of two vectors, the second of which is not a function of the i index. Thus the score is δj (Sp ) δj (Sp ) sj = − pb + f (zj , b θ) + w· w· ∂ pb ∂θ θ=b θ ∂b θ ∂wj !0 If b θ was derived from a system of equations (such as in linear regression or maximum likelihood estimation), then b θ is the solution to G(θ) = N X δj (Sm )wj g(θ, yj , xj ) = 0 j=1 where Sm identifies the subpopulation used to fit the model, g() is the model’s gradient function, and yj and xj are the values of the dependent and independent variables for the j th observation. We can use linearization to derive a first-order approximation for ∂ b θ/∂wj . ∂G(θ) G(b θ) ≈ G(θ0 ) + ∂θ (b θ − θ0 ) θ=θ0 Let H be the Hessian matrix H= ∂G(θ) ∂θ θ=θ0 Then b θ ≈ θ0 + (−H)−1 G(θ0 ) 1198 margins — Marginal means, predictive margins, and marginal effects and ∂b θ ∂G(θ) ≈ (−H)−1 ∂wj ∂wj = (−H)−1 δj (Sm )g(b θ, yj , xj ) θ=b θ The computed value of the score for pb for the j th observation is sj = v0 uj where v= pb − w· 1 w· ∂p b (−H)−1 ∂b θ and δj (Sp ) uj = δj (Sp )f (zj , b θ) b δj (Sm )g(θ, yj , xj ) Thus the variance estimate for pb is b c (b c (U)v Var p) = v0 Var where b = U N X wj uj j=1 margins uses the model-based variance estimates for (−H)−1 and the scores from predict for g(b θ, yj , xj ). References Bartus, T. 2005. Estimation of marginal effects using margeff. Stata Journal 5: 309–329. Baum, C. F. 2010. Stata tip 88: Efficiently evaluating elasticities with the margins command. Stata Journal 10: 309–312. Buis, M. L. 2010. Stata tip 87: Interpretation of interactions in nonlinear models. Stata Journal 10: 305–308. Chang, I. M., R. Gelman, and M. Pagano. 1982. Corrected group prognostic curves and summary statistics. Journal of Chronic Diseases 35: 669–674. Cummings, P. 2011. Estimating adjusted risk ratios for matched and unmatched data: An update. Stata Journal 11: 290–298. Gould, W. W. 1996. crc43: Wald test of nonlinear hypotheses after model estimation. Stata Technical Bulletin 29: 2–4. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 15–18. College Station, TX: Stata Press. margins — Marginal means, predictive margins, and marginal effects 1199 Graubard, B. I., and E. L. Korn. 2004. Predictive margins with survey data. Biometrics 55: 652–659. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Korn, E. L., and B. I. Graubard. 1999. Analysis of Health Surveys. New York: Wiley. Lane, P. W., and J. A. Nelder. 1982. Analysis of covariance and standardization as instances of prediction. Biometrics 38: 613–621. Mitchell, M. N. 2012. Interpreting and Visualizing Regression Models Using Stata. College Station, TX: Stata Press. Newson, R. B. 2013. Attributable and unattributable risks and fractions and other scenario comparisons. Stata Journal 13: 672–698. Phillips, P. C. B., and J. Y. Park. 1988. On the formulation of Wald tests of nonlinear restrictions. Econometrica 56: 1065–1083. Searle, S. R. 1971. Linear Models. New York: Wiley. . 1997. Linear Models for Unbalanced Data. New York: Wiley. Searle, S. R., F. M. Speed, and G. A. Milliken. 1980. Population marginal means in the linear model: An alternative to least squares means. American Statistician 34: 216–221. Williams, R. 2012. Using the margins command to estimate and interpret adjusted predictions and marginal effects. Stata Journal 12: 308–331. Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press. Also see [R] contrast — Contrasts and linear hypothesis tests after estimation [R] margins, contrast — Contrasts of margins [R] margins, pwcompare — Pairwise comparisons of margins [R] margins postestimation — Postestimation tools for margins [R] marginsplot — Graph results from margins (profile plots, etc.) [R] lincom — Linear combinations of estimators [R] nlcom — Nonlinear combinations of estimators [R] predict — Obtain predictions, residuals, etc., after estimation [R] predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation [U] 20 Estimation and postestimation commands Title margins postestimation — Postestimation tools for margins Description Remarks and examples Also see Description The following standard postestimation command is available after margins: Command Description marginsplot graph the results from margins—profile plots, interaction plots, etc. For information on marginsplot, see [R] marginsplot. The following standard postestimation commands are available after margins, post: Command Description contrast estat summarize estimates lincom contrasts and ANOVA-style joint tests of estimates summary statistics for the estimation sample cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses nlcom pwcompare test testnl Remarks and examples Continuing with the example from Example 8: Margins of interactions in [R] margins, we use the dataset and reestimate the logistic model of outcome: . use http://www.stata-press.com/data/r13/margex (Artificial data for margins) . logistic outcome sex##group age (output omitted ) 1200 margins postestimation — Postestimation tools for margins 1201 We then estimate the margins for males and females and post the margins as estimation results with a full VCE. . margins sex, post Predictive margins Model VCE : OIM Expression : Pr(outcome), predict() Margin sex male female .1600644 .1966902 Number of obs Delta-method Std. Err. .0125653 .0100043 z 12.74 19.66 = 3000 P>|z| [95% Conf. Interval] 0.000 0.000 .1354368 .1770821 .184692 .2162983 We can now use nlcom (see [R] nlcom) to estimate a risk ratio of females to males using the average probabilities for females and males posted by margins: . nlcom (risk_ratio: _b[1.sex] / _b[0.sex]) risk_ratio: _b[1.sex] / _b[0.sex] Coef. risk_ratio 1.228819 Std. Err. .1149538 z 10.69 P>|z| [95% Conf. Interval] 0.000 1.003514 1.454124 We could similarly estimate the average risk difference between females and males: . nlcom (risk_diff: _b[1.sex] - _b[0.sex]) risk_diff: _b[1.sex] - _b[0.sex] Coef. risk_diff .0366258 Std. Err. z P>|z| [95% Conf. Interval] .0160632 2.28 0.023 .0051425 Also see [R] margins — Marginal means, predictive margins, and marginal effects [R] marginsplot — Graph results from margins (profile plots, etc.) [U] 20 Estimation and postestimation commands .068109 Title margins, contrast — Contrasts of margins Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Suboptions Reference Syntax margins marginlist if in weight , contrast margins options margins marginlist if in weight , contrast(suboptions) margins options where marginlist is a list of factor variables or interactions that appear in the current estimation results. The variables may be typed with or without contrast operators, and you may use any factor-variable syntax: . margins sex##group, contrast . margins sex##g.group, contrast . margins sex@group, contrast See the operators (op.) table in [R] contrast for the list of contrast operators. Contrast operators may also be specified on the variables in margins’ over() and within() options to perform contrasts across the levels of those variables. See [R] margins for the available margins options. suboptions Description Contrast overall lincom atcontrast(op . at ) atjoint overjoint withinjoint marginswithin add a joint hypothesis test for all specified contrasts treat user-defined contrasts as linear combinations apply the op. contrast operator to the groups defined by at() test jointly across all groups defined by at() test jointly across all levels of the unoperated over() variables test jointly across all levels of the unoperated within() variables perform contrasts within the levels of the unoperated terms in marginlist cieffects pveffects effects nowald noatlevels show effects table with confidence intervals show effects table with p-values show effects table with confidence intervals and p-values suppress table of Wald tests report only the overall Wald test for terms that use the within @ or nested | operator compute unadjusted Wald tests for survey results nosvyadjust fweights, aweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. 1202 margins, contrast — Contrasts of margins 1203 Menu Statistics > Postestimation > Contrasts of margins Description margins with the contrast option or with contrast operators performs contrasts of margins. This extends the capabilities of contrast to any of the nonlinear responses, predictive margins, or other margins that can be estimated by margins. Suboptions Contrast overall specifies that a joint hypothesis test over all terms be performed. lincom specifies that user-defined contrasts be treated as linear combinations. The default is to require that all user-defined contrasts sum to zero. (Summing to zero is part of the definition of a contrast.) atcontrast(op . at ) specifies that the op. contrast operator be applied to the groups defined by the at() option(s). The default behavior, by comparison, is to perform tests and contrasts within the groups defined by the at() option(s). See example 6 in Remarks and examples. atjoint specifies that joint tests be performed across all groups defined by the at() option. The default behavior, by comparison, is to perform contrasts and tests within each group. See example 5 in Remarks and examples. overjoint specifies how unoperated variables in the over() option are treated. Each variable in the over() option may be specified either with or without a contrast operator. For contrast-operated variables, the specified contrast comparisons are always performed. overjoint specifies that joint tests be performed across all levels of the unoperated variables. The default behavior, by comparison, is to perform contrasts and tests within each combination of levels of the unoperated variables. See example 3 in Remarks and examples. withinjoint specifies how unoperated variables in the within() option are treated. Each variable in the within() option may be specified either with or without a contrast operator. For contrast-operated variables, the specified contrast comparisons are always performed. withinjoint specifies that joint tests be performed across all levels of the unoperated variables. The default behavior, by comparison, is to perform contrasts and tests within each combination of levels of the unoperated variables. marginswithin specifies how unoperated variables in marginlist are treated. Each variable in marginlist may be specified either with or without a contrast operator. For contrast-operated variables, the specified contrast comparisons are always performed. marginswithin specifies that contrasts and tests be performed within each combination of levels of the unoperated variables. The default behavior, by comparison, is to perform joint tests across all levels of the unoperated variables. See example 4 in Remarks and examples. 1204 margins, contrast — Contrasts of margins cieffects specifies that a table containing a confidence interval for each individual contrast be reported. pveffects specifies that a table containing a p-value for each individual contrast be reported. effects specifies that a single table containing a confidence interval and p-value for each individual contrast be reported. nowald suppresses the table of Wald tests. noatlevels indicates that only the overall Wald test be reported for each term containing within or nested (@ or |) operators. nosvyadjust is for use with svy estimation commands. It specifies that the Wald test be carried out without the default adjustment for the design degrees of freedom. That is to say the test is carried out as W/k ∼ F (k, d) rather than as (d − k + 1)W/(kd) ∼ F (k, d − k + 1), where k is the dimension of the test and d is the total number of sampled PSUs minus the total number of strata. Remarks and examples Remarks are presented under the following headings: Contrasts of margins Contrasts and the over() option The overjoint suboption The marginswithin suboption Contrasts and the at() option Estimating treatment effects with margins Conclusion Contrasts of margins Example 1 Estimating contrasts of margins is as easy as adding a contrast operator to the variable name. Let’s review Example 2: A simple case after logistic of [R] margins. Variable sex is coded 0 for males and 1 for females. . use http://www.stata-press.com/data/r13/margex . logistic outcome i.sex i.group (output omitted ) . margins sex Predictive margins Model VCE : OIM Expression : Pr(outcome), predict() Margin sex male female .1286796 .1905087 Delta-method Std. Err. .0111424 .0089719 z 11.55 21.23 Number of obs = 3000 P>|z| [95% Conf. Interval] 0.000 0.000 .106841 .1729241 .1505182 .2080933 The first margin, 0.13, is the average probability of a positive outcome, treating everyone as if they were male. The second margin, 0.19, is the average probability of a positive outcome, treating everyone as if they were female. We can compare females with males by rerunning margins and adding a contrast operator: margins, contrast — Contrasts of margins 1205 . margins r.sex Contrasts of predictive margins Model VCE : OIM Expression : Pr(outcome), predict() sex sex (female vs male) df chi2 P>chi2 1 16.61 0.0000 Contrast Delta-method Std. Err. .0618291 .0151719 [95% Conf. Interval] .0320927 .0915656 The r. prefix for sex is the reference-category contrast operator—see [R] contrast. (The default reference category is zero, the lowest value of sex.) Contrast operators in a marginlist work just as they do in the termlist of a contrast command. The contrast estimate of 0.06 says that unconditional on group, females on average are about 6% more likely than males to have a positive outcome. The chi-squared statistic of 16.61 shows that the contrast is significantly different from zero. You may be surprised that we did not need to include the contrast option to estimate our contrast. If we had included the option, our output would not have changed: . margins r.sex, contrast Contrasts of predictive margins Model VCE : OIM Expression : Pr(outcome), predict() sex sex (female vs male) df chi2 P>chi2 1 16.61 0.0000 Contrast Delta-method Std. Err. .0618291 .0151719 [95% Conf. Interval] .0320927 .0915656 The contrast option is useful mostly for its suboptions, which control the output and how contrasts are estimated in more complicated situations. But contrast may be specified on its own (without contrast operators or suboptions) if we do not need estimates or confidence intervals: . margins sex group, contrast Contrasts of predictive margins Model VCE : OIM Expression : Pr(outcome), predict() df chi2 P>chi2 sex 1 16.61 0.0000 group 2 225.76 0.0000 1206 margins, contrast — Contrasts of margins Each chi-squared statistic is a joint test of constituent contrasts. The test for group has two degrees of freedom because group has three levels. Contrasts and the over() option Example 2 It is common to estimate margins at combinations of factor levels, and margins, contrast includes several suboptions for contrasting such margins. Let’s fit a model with two categorical predictors and their interaction: . logistic outcome agegroup##group Logistic regression Number of obs LR chi2(8) Prob > chi2 Pseudo R2 Log likelihood = -1105.7504 outcome Odds Ratio agegroup 30-39 40+ group 2 3 agegroup# group 30-39#2 30-39#3 40+#2 40+#3 _cons = = = = 3000 520.64 0.0000 0.1906 Std. Err. z P>|z| [95% Conf. Interval] 3.54191 16.23351 2.226951 9.61188 2.01 4.71 0.044 0.000 1.032882 5.086452 12.14576 51.80955 .834507 .2146729 .5663738 .1772897 -0.27 -1.86 0.790 0.062 .2206611 .0425407 3.15598 1.083303 .4426927 1.160885 .440672 .4407912 .3358505 1.103527 .3049393 .4034688 -1.07 0.16 -1.18 -0.89 0.283 0.875 0.236 0.371 .1000772 .1801543 .1135259 .0733 1.958257 7.480553 1.71055 2.650709 .0379747 .0223371 -5.56 0.000 .0119897 .1202762 Each of agegroup and group has three levels. To compare each age group with the reference category on the probability scale, we can again use margins with the r. contrast operator. margins, contrast — Contrasts of margins 1207 . margins r.agegroup Contrasts of predictive margins Model VCE : OIM Expression : Pr(outcome), predict() agegroup (30-39 vs 20-29) (40+ vs 20-29) Joint agegroup (30-39 vs 20-29) (40+ vs 20-29) df chi2 P>chi2 1 1 2 10.04 224.44 238.21 0.0015 0.0000 0.0000 Contrast Delta-method Std. Err. .044498 .2059281 .0140448 .0137455 [95% Conf. Interval] .0169706 .1789874 .0720253 .2328688 Our model includes an interaction, though, so it would be nice to estimate the contrasts separately for each value of group. We need the over() option: . margins r.agegroup, over(group) Contrasts of predictive margins Model VCE : OIM Expression : Pr(outcome), predict() over : group agegroup@group (30-39 vs 20-29) 1 (30-39 vs 20-29) 2 (30-39 vs 20-29) 3 (40+ vs 20-29) 1 (40+ vs 20-29) 2 (40+ vs 20-29) 3 Joint agegroup@group (30-39 vs 20-29) 1 (30-39 vs 20-29) 2 (30-39 vs 20-29) 3 (40+ vs 20-29) 1 (40+ vs 20-29) 2 (40+ vs 20-29) 3 df chi2 P>chi2 1 1 1 1 1 1 6 6.94 1.18 3.10 173.42 57.77 5.12 266.84 0.0084 0.2783 0.0783 0.0000 0.0000 0.0236 0.0000 Contrast Delta-method Std. Err. .0819713 .0166206 .0243462 .3447797 .1540882 .0470319 .0311208 .0153309 .0138291 .0261811 .0202722 .0207774 [95% Conf. Interval] .0209757 -.0134275 -.0027583 .2934658 .1143554 .006309 .142967 .0466686 .0514508 .3960937 .193821 .0877548 The effect of agegroup appears to be greatest for the first level of group. Including a variable in the over() option is not equivalent to including the variable in the main marginlist. The variables in the marginlist are manipulated in the analysis, so that we can measure, for example, the effect of being in age group 3 and not age group 1. (The manipulation could be mimicked by running replace and then predict, but the manipulations actually performed by margins do not 1208 margins, contrast — Contrasts of margins change the data in memory.) The variables in the over() option are not so manipulated—the values of the over() variables are left as they were observed, and the marginlist variables are manipulated separately for each observed over() group. For more information, see Do not specify marginlist when you mean over() in [R] margins. The overjoint suboption Example 3 Each variable in an over() option may be specified with or without contrast operators. Our option over(group) did not include a contrast operator, so margins estimated the contrasts separately for each level of group. If we had instead specified over(r.group), we would have received differences of the contrasts: . margins r.agegroup, over(r.group) Contrasts of predictive margins Model VCE : OIM Expression : Pr(outcome), predict() over : group group#agegroup (2 vs 1) (30-39 vs 20-29) (2 vs 1) (40+ vs 20-29) (3 vs 1) (30-39 vs 20-29) (3 vs 1) (40+ vs 20-29) Joint group#agegroup (2 vs 1) (30-39 vs 20-29) (2 vs 1) (40+ vs 20-29) (3 vs 1) (30-39 vs 20-29) (3 vs 1) (40+ vs 20-29) df chi2 P>chi2 1 1 1 1 4 3.55 33.17 2.86 79.36 83.88 0.0596 0.0000 0.0906 0.0000 0.0000 Contrast Delta-method Std. Err. -.0653508 -.1906915 -.0576251 -.2977479 .0346921 .0331121 .0340551 .0334237 [95% Conf. Interval] -.133346 -.25559 -.1243719 -.3632572 .0026445 -.1257931 .0091216 -.2322385 The contrasts are double differences: the estimate of −0.19, for example, says that the difference in the probability of success between age group 3 and age group 1 is smaller in group 2 than in group 1. We can jointly test pairs of the double differences with the overjoint suboption: . margins r.agegroup, over(group) contrast(overjoint) Contrasts of predictive margins Model VCE : OIM Expression : Pr(outcome), predict() over : group group#agegroup (joint) (30-39 vs 20-29) (joint) (40+ vs 20-29) Joint df chi2 P>chi2 2 2 4 3.62 79.45 83.88 0.1641 0.0000 0.0000 margins, contrast — Contrasts of margins 1209 The contrast(overjoint) option overrides the default behavior of over() and requests joint tests over the levels of the unoperated variable group. The chi-squared statistic of 3.62 tests that the first and third contrasts from the previous table are jointly zero. The chi-squared statistic of 79.45 jointly tests the other pair of contrasts. The marginswithin suboption Example 4 Another suboption that may usefully be combined with over() is marginswithin. marginswithin requests that contrasts be performed within the levels of unoperated variables in the main marginlist, instead of performing them jointly across the levels. marginswithin affects only unoperated variables because contrast operators take precedence over suboptions. Let’s first look at the default behavior, which occurs when marginswithin is not specified: . margins agegroup, over(r.group) contrast(effects) Contrasts of predictive margins Model VCE : OIM Expression : Pr(outcome), predict() over : group group#agegroup (2 vs 1) (joint) (3 vs 1) (joint) Joint group# agegroup (2 vs 1) (30-39 vs base) (2 vs 1) (40+ vs base) (3 vs 1) (30-39 vs base) (3 vs 1) (40+ vs base) df chi2 P>chi2 2 2 4 33.94 83.38 83.88 0.0000 0.0000 0.0000 Contrast Delta-method Std. Err. -.0653508 .0346921 -.1906915 z P>|z| [95% Conf. Interval] -1.88 0.060 -.133346 .0026445 .0331121 -5.76 0.000 -.25559 -.1257931 -.0576251 .0340551 -1.69 0.091 -.1243719 .0091216 -.2977479 .0334237 -8.91 0.000 -.3632572 -.2322385 Here agegroup in the main marginlist is an unoperated variable, so margins by default performs joint tests across the levels of agegroup: the chi-squared statistic of 33.94, for example, jointly tests whether the first two contrast estimates in the lower table differ significantly from zero. 1210 margins, contrast — Contrasts of margins When we specify marginswithin, the contrasts will instead be performed within the levels of agegroup: . margins agegroup, over(r.group) contrast(marginswithin effects) Contrasts of predictive margins Model VCE : OIM Expression : Pr(outcome), predict() over : group group@agegroup (2 vs 1) 20-29 (2 vs 1) 30-39 (2 vs 1) 40+ (3 vs 1) 20-29 (3 vs 1) 30-39 (3 vs 1) 40+ Joint group@ agegroup (2 vs 1) 20-29 (2 vs 1) 30-39 (2 vs 1) 40+ (3 vs 1) 20-29 (3 vs 1) 30-39 (3 vs 1) 40+ df chi2 P>chi2 1 1 1 1 1 1 6 0.06 7.55 68.39 1.80 10.47 159.89 186.87 0.7991 0.0060 0.0000 0.1798 0.0012 0.0000 0.0000 Contrast Delta-method Std. Err. -.0058686 .0230533 -0.25 0.799 -.0510523 .039315 -.0712194 .0259246 -2.75 0.006 -.1220308 -.0204081 -.1965602 .0237688 -8.27 0.000 -.2431461 -.1499742 -.0284991 .0212476 -1.34 0.180 -.0701436 .0131453 -.0861243 .0266137 -3.24 0.001 -.1382862 -.0339624 -.326247 .0258009 -12.64 0.000 -.3768159 -.2756781 z P>|z| [95% Conf. Interval] The joint tests in the top table have been replaced by one-degree-of-freedom tests, one for each combination of the two reference comparisons and three levels of agegroup. The reference-category contrasts for group have been performed within levels of agegroup. Contrasts and the at() option Example 5 The at() option of margins is used to set predictors to particular values. When at() is used, contrasts are by default performed within each at() level: margins, contrast — Contrasts of margins 1211 . margins r.agegroup, at(group=(1/3)) Contrasts of adjusted predictions Model VCE : OIM Expression : Pr(outcome), predict() 1._at : group = 1 2._at : group = 2 3._at : group = 3 df chi2 P>chi2 1 1 1 1 1 1 6 6.94 1.18 3.10 173.42 57.77 5.12 266.84 0.0084 0.2783 0.0783 0.0000 0.0000 0.0236 0.0000 (30-39 (30-39 (30-39 (40+ (40+ (40+ (30-39 (30-39 (30-39 (40+ (40+ (40+ agegroup@_at vs 20-29) 1 vs 20-29) 2 vs 20-29) 3 vs 20-29) 1 vs 20-29) 2 vs 20-29) 3 Joint agegroup@_at vs 20-29) 1 vs 20-29) 2 vs 20-29) 3 vs 20-29) 1 vs 20-29) 2 vs 20-29) 3 Contrast Delta-method Std. Err. .0819713 .0166206 .0243462 .3447797 .1540882 .0470319 .0311208 .0153309 .0138291 .0261811 .0202722 .0207774 [95% Conf. Interval] .0209757 -.0134275 -.0027583 .2934658 .1143554 .006309 .142967 .0466686 .0514508 .3960937 .193821 .0877548 Our option at(group=(1/3)) manipulates the values of group and is therefore not equivalent to over(group). We see that the reference-category contrasts for agegroup have been performed within each at() level. For a similar example that uses the . at operator instead of the at() option, see Contrasts of at() groups—discrete effects in [R] marginsplot. The default within behavior of at() may be changed to joint behavior with the atjoint suboption: . margins r.agegroup, at(group=(1/3)) contrast(atjoint) Contrasts of adjusted predictions Model VCE : OIM Expression : Pr(outcome), predict() 1._at : group = 1 2._at : group = 2 3._at : group = 3 _at#agegroup (joint) (30-39 vs 20-29) (joint) (40+ vs 20-29) Joint df chi2 P>chi2 2 2 4 3.62 79.45 83.88 0.1641 0.0000 0.0000 Now the tests are performed jointly over the levels of group, the at() variable. The atjoint suboption is the analogue for at() of the overjoint suboption from example 3. 1212 margins, contrast — Contrasts of margins Example 6 What if we would like to apply a contrast operator, like r., to the at() levels? It is not possible to specify the operator inside the at() option. Instead, we need a new suboption, atcontrast(): . margins r.agegroup, at(group=(1/3)) contrast(atcontrast(r)) Contrasts of adjusted predictions Model VCE : OIM Expression 1._at 2._at 3._at : : : : Pr(outcome), predict() group = group = group = _at#agegroup (2 vs 1) (30-39 vs 20-29) (2 vs 1) (40+ vs 20-29) (3 vs 1) (30-39 vs 20-29) (3 vs 1) (40+ vs 20-29) Joint _at#agegroup (2 vs 1) (30-39 vs 20-29) (2 vs 1) (40+ vs 20-29) (3 vs 1) (30-39 vs 20-29) (3 vs 1) (40+ vs 20-29) 1 2 3 df chi2 P>chi2 1 1 1 1 4 3.55 33.17 2.86 79.36 83.88 0.0596 0.0000 0.0906 0.0000 0.0000 Contrast Delta-method Std. Err. -.0653508 -.1906915 -.0576251 -.2977479 .0346921 .0331121 .0340551 .0334237 [95% Conf. Interval] -.133346 -.25559 -.1243719 -.3632572 .0026445 -.1257931 .0091216 -.2322385 When we specify contrast(atcontrast(r)), margins will apply the r. reference-category operator to the levels of group, the variable specified inside at(). The default reference category is 1, the lowest level of group. Estimating treatment effects with margins margins with the contrast option can also be used to estimate treatment effects in certain cases. A treatment effect represents the change in an outcome variable that is attributable to a particular event, controlling for all other factors that could affect the outcome. For example, we might want to know how a person’s wage changes as a result of being in a union. Here the outcome variable is the person’s wage, and the “event” is membership in a union. The treatment effect measures the difference in a person’s wage as a result of being or not being in a union once we control for the person’s educational background, level of experience, industry, and other factors. In fact, Stata has an entire manual dedicated to estimators designed specifically for estimating treatment effects; see the Stata Treatment-Effects Reference Manual. Here we show how margins can be used to estimate treatment effects using the regression-adjustment estimator when the conditional independence assumption is met; see [TE] teffects intro. Regression adjustment simply means that we are going to use a regression model to predict the outcome variable, controlling for treatment status and other characteristics. The conditional independence assumption implies that we have enough variables in our dataset so that once we control for them in our regression model, the outcomes one would obtain with and without treatment are independent of how treatment status is determined. margins, contrast — Contrasts of margins 1213 Example 7: Regression adjustment with a binary treatment variable nlsw88.dta contains women’s wages (wage) in dollars per hour, a binary variable indicating their union status (union), years of experience (ttl exp), and a variable, grade, indicating the number of years of schooling completed. We want to know how being in a union (the treatment) affects women’s wages. Traditionally, a wage equation of the form ln wagei = β0 + β1 unioni + β2 gradei + β3 ttl exp + β4 ttl exp2 + i would be fit. However, there are two shortcomings that we will improve upon. First, to avoid the problem of predicting the level of a log-transformed dependent variable, we will use poisson with the vce(robust) option to fit an exponential regression model; see Wooldridge (2010, sec. 18.2) for background on this approach. Second, the previous equation implies that factors other than union status have the same impact on wages for both union and nonunion workers. Regression-adjustment estimators allow all the variables to have different impacts depending on the level of the treatment variable, and we can accomplish that here using factor-variable notation. In Stata, we fit our model by typing . use http://www.stata-press.com/data/r13/nlsw88 (NLSW, 1988 extract) . poisson wage i.union##(c.grade c.ttl_exp##c.ttl_exp), vce(robust) note: you are responsible for interpretation of noncount dep. variable Iteration 0: log pseudolikelihood = -4770.7957 Iteration 1: log pseudolikelihood = -4770.7693 Iteration 2: log pseudolikelihood = -4770.7693 Poisson regression Number of obs = 1876 Wald chi2(7) = 1047.11 Prob > chi2 = 0.0000 Log pseudolikelihood = -4770.7693 Pseudo R2 = 0.1195 Robust Std. Err. wage Coef. union union grade ttl_exp .8638376 .0895252 .0805737 .168233 .0056874 .0114534 c.ttl_exp# c.ttl_exp -.0015502 union# c.grade union P>|z| [95% Conf. Interval] 5.13 15.74 7.03 0.000 0.000 0.000 .534107 .0783782 .0581255 1.193568 .1006722 .103022 .0004612 -3.36 0.001 -.0024541 -.0006463 -.0310298 .0088259 -3.52 0.000 -.0483282 -.0137314 union# c.ttl_exp union -.0404226 .0230113 -1.76 0.079 -.085524 .0046788 union# c.ttl_exp# c.ttl_exp union .0011808 .0008428 1.40 0.161 -.0004711 .0028327 .017488 .0893602 0.20 0.845 -.1576547 .1926308 _cons z 1214 margins, contrast — Contrasts of margins To see how union status affects wages, we can use margins: . margins r.union, vce(unconditional) Contrasts of predictive margins Expression : Predicted number of events, predict() union union (union vs nonunion) df chi2 P>chi2 1 26.22 0.0000 Contrast Unconditional Std. Err. 1.004119 .1960944 [95% Conf. Interval] .6197815 1.388457 The estimated contrast 1.004 indicates that on average, belonging to a union causes a woman’s wage to be slightly more than a dollar higher than if she were not in the union. This estimated contrast is called the average treatment effect (ATE). Conceptually, we predicted the wage of each woman in the estimation sample assuming she was in a union and obtained the sample mean. We then predicted each woman’s wage assuming she was not in a union and obtained that sample mean. The difference between these two sample means represents the ATE. We obtain essentially the same results by using teffects ra: . teffects ra (wage c.grade c.ttl_exp##c.ttl_exp, poisson) (union) Iteration 0: EE criterion = 2.611e-13 Iteration 1: EE criterion = 1.098e-26 Treatment-effects estimation Number of obs = Estimator : regression adjustment Outcome model : Poisson Treatment model: none 1876 Robust Std. Err. z P>|z| 1.004119 .1960421 5.12 0.000 .619884 1.388355 7.346493 .1096182 67.02 0.000 7.131645 7.561341 wage Coef. union (union vs nonunion) [95% Conf. Interval] ATE POmean union nonunion The point estimates of the ATE are identical to those we obtained using margins, though the standard errors differ slightly from those reported by margins. The standard errors from the two estimators are, however, asymptotically equivalent, meaning they would coincide with a sufficiently large dataset. The last statistic in this output table indicates the untreated potential-outcome mean (untreated POM), which is the mean predicted wage assuming each woman did not belong to a union. If we specify the pomeans option with teffects ra, we can obtain both the treated and the untreated POMs, which represent the predicted mean wages assuming all women were or were not in the union: margins, contrast — Contrasts of margins 1215 . teffects ra (wage c.grade c.ttl_exp##c.ttl_exp, poisson) (union), pomeans Iteration 0: EE criterion = 2.611e-13 Iteration 1: EE criterion = 1.098e-26 Treatment-effects estimation Number of obs = 1876 Estimator : regression adjustment Outcome model : Poisson Treatment model: none wage Coef. POmeans union nonunion union 7.346493 8.350612 Robust Std. Err. .1096182 .1757346 z P>|z| [95% Conf. Interval] 0.000 0.000 7.131645 8.006179 67.02 47.52 7.561341 8.695046 Notice that the difference between these two POMs equals 1.004119, which is the ATE we obtained earlier. In some applications, the average treatment effect of the treated (ATET) is more germane than the ATE. For example, if the untreated subjects in the sample could not possibly receive treatment (perhaps because a medical condition precludes their taking an experimental drug), then considering the counterfactual outcome had those subjects taken the drug may not be relevant. In these cases, the ATET is a better statistic because it measures the effect of the treatment only for those subjects who actually did receive treatment. Like the ATE, the ATET involves computing predicted outcomes for each treatment level, obtaining the sample means, and computing the difference between those two means. Unlike the ATE, however, we only use observations corresponding to treated subjects. Example 8: Regression adjustment with a binary treatment variable (continued) Here we calculate the ATET of union membership, first using margins. Because teffects ra overwrote our estimation results, we first quietly refit our poisson model. We then call margins to obtain the ATET: . quietly poisson wage i.union##(c.grade c.ttl_exp##c.ttl_exp), vce(robust) . margins r.union, subpop(union) vce(unconditional) Contrasts of predictive margins Expression : Predicted number of events, predict() union union (union vs nonunion) df chi2 P>chi2 1 18.86 0.0000 Contrast Unconditional Std. Err. .901419 .2075863 [95% Conf. Interval] .4945574 1.308281 The key here was specifying the subpop(union) option to restrict margin’s computations to those women who are union members. The results indicate that being in the union causes the union members’ wages to be about 90 cents higher than they would otherwise be. 1216 margins, contrast — Contrasts of margins To replicate these results using teffects ra, we include the atet option to obtain ATETs: . teffects ra (wage c.grade c.ttl_exp##c.ttl_exp, poisson) (union), atet Iteration 0: EE criterion = 2.611e-13 Iteration 1: EE criterion = 9.324e-27 Treatment-effects estimation Number of obs = Estimator : regression adjustment Outcome model : Poisson Treatment model: none 1876 Robust Std. Err. z P>|z| [95% Conf. Interval] .901419 .2075309 4.34 0.000 .4946658 1.308172 7.776417 .162121 47.97 0.000 7.458665 8.094168 wage Coef. union (union vs nonunion) ATET POmean union nonunion We obtain the same point estimate of the effect of union status as with margins. As before, the standard errors differ slightly between the two estimators, but they are asymptotically equivalent. The output also indicates that among the women who are in a union, their average wage would be $7.78 if they were not in a union. Technical note One advantage of the ATET over the ATE is that the ATET can be consistently estimated with slightly weaker assumptions than are required to consistently estimate the ATE. See Comparing the ATE and ATET in Remarks and examples of [TE] teffects intro advanced. Both margins and teffects can estimate treatment effects using regression adjustment, so which should you use? In addition to regression adjustment, the teffects command implements other estimators of treatment effects; some of these estimators possess desirable robustness properties that we cannot replicate using margins. Moreover, all the teffects estimators use a common syntax and automatically present the estimated treatment effects, whereas we must first fit our own regression model and then call margins to obtain the treatment effects. On the other hand, particularly with the at() option, margins gives us more flexibility in specifying our scenarios. The teffects commands allow us to measure the effect of a single binary or multinomial treatment, but we can have margins compute the effects of arbitrary interventions, as we illustrate in the next example. Example 9: Interventions involving multiple variables Suppose we want to see how women’s wages would be affected if we could increase each woman’s education level by one year. That is, we want to measure the treatment effect of an additional year of schooling. We assume that if a woman attains another year of schooling, she cannot simultaneously work. Thus an additional year of education implies her total work experience must decrease by a year. The flexible at() option of margins allows us to manipulate both variables at once: margins, contrast — Contrasts of margins 1217 . quietly poisson wage i.union##(c.grade c.ttl_exp##c.ttl_exp), vce(robust) . margins, at((asobserved) _all) > at(grade=generate(grade+1) ttl_exp=generate(ttl_exp-1)) > contrast(atcontrast(r._at)) Contrasts of predictive margins Model VCE : Robust Expression : Predicted number of events, predict() 1._at 2._at : (asobserved) : grade ttl_exp _at _at (2 vs 1) = grade+1 = ttl_exp-1 df chi2 P>chi2 1 58.53 0.0000 Contrast Delta-method Std. Err. .3390392 .0443161 [95% Conf. Interval] .2521813 .4258971 The first at() option instructs margins to obtain predicted wages for all women in the sample using their existing values for grade and ttl exp and to record the mean of those predictions. The second at() option instructs margins to obtain the mean predicted wage under the counterfactual scenario where each woman’s education level is increased by one year and total work experience is simultaneously decreased by one year. The contrast() option instructs margins to compute the difference between the two means. The output indicates that increasing education by one year, which will necessarily decrease work experience by the same amount, will cause the average wage to increase by about 34 cents per hour, a statistically significant amount. Conclusion margins, contrast is a powerful command, and its abundance of suboptions may seem daunting. The suboptions are in the service of only three goals, however. There are three things that margins, contrast can do with a factor variable or a set of at() definitions: 1. Perform contrasts across the levels of the factor or set (as in example 1). 2. Perform a joint test across the levels of the factor or set (as in example 5). 3. Perform other tests and contrasts within each level of the factor or set (as in example 4). The default behavior for variables specified inside at(), over(), and within() is to perform contrasts within groups; the default behavior for variables in the marginlist is to perform joint tests across groups. 1218 margins, contrast — Contrasts of margins Stored results margins, contrast stores the following additional results in r(): Scalars r(k terms) Macros r(cmd) r(cmd2) r(overall) Matrices r(L) r(chi2) r(p) r(df) number of terms participating in contrasts contrast margins overall or empty matrix vector vector vector of of of of contrasts applied to the margins χ2 statistics p-values corresponding to r(chi2) degrees of freedom corresponding to r(p) margins, contrast with the post option also stores the following additional results in e(): Scalars e(k terms) Macros e(cmd) e(cmd2) e(overall) Matrices e(L) e(chi2) e(p) e(df) number of terms participating in contrasts contrast margins overall or empty matrix vector vector vector of of of of contrasts applied to the margins χ2 statistics p-values corresponding to e(chi2) degrees of freedom corresponding to e(p) Methods and formulas See Methods and formulas in [R] margins and Methods and formulas in [R] contrast. Reference Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press. Also see [R] contrast — Contrasts and linear hypothesis tests after estimation [R] lincom — Linear combinations of estimators [R] margins — Marginal means, predictive margins, and marginal effects [R] margins postestimation — Postestimation tools for margins [R] margins, pwcompare — Pairwise comparisons of margins [R] pwcompare — Pairwise comparisons Title margins, pwcompare — Pairwise comparisons of margins Syntax Remarks and examples Menu Stored results Description Methods and formulas Suboptions Also see Syntax margins marginlist if in weight , pwcompare margins options margins marginlist if in weight , pwcompare(suboptions) margins options where marginlist is a list of factor variables or interactions that appear in the current estimation results. The variables may be typed with or without the i. prefix, and you may use any factor-variable syntax: . margins i.sex i.group i.sex#i.group, pwcompare . margins sex group sex#i.group, pwcompare . margins sex##group, pwcompare See [R] margins for the available margins options. Description suboptions Pairwise comparisons cieffects pveffects effects cimargins groups sort show effects table with confidence intervals; the default show effects table with p-values show effects table with confidence intervals and p-values show table of margins and confidence intervals show table of margins and group codes sort the margins or contrasts in each term fweights, aweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. Menu Statistics > Postestimation > Pairwise comparisons of margins Description margins with the pwcompare option performs pairwise comparisons of margins. margins, pwcompare extends the capabilities of pwcompare to any of the nonlinear responses, predictive margins, or other margins that can be estimated by margins. 1219 1220 margins, pwcompare — Pairwise comparisons of margins Suboptions Pairwise comparisons cieffects specifies that a table of the pairwise comparisons with their standard errors and confidence intervals be reported. This is the default. pveffects specifies that a table of the pairwise comparisons with their standard errors, test statistics, and p-values be reported. effects specifies that a table of the pairwise comparisons with their standard errors, test statistics, p-values, and confidence intervals be reported. cimargins specifies that a table of the margins with their standard errors and confidence intervals be reported. groups specifies that a table of the margins with their standard errors and group codes be reported. Margins with the same letter in the group code are not significantly different at the specified significance level. sort specifies that the reported tables be sorted on the margins or contrasts in each term. Remarks and examples You should be familiar with the concepts and syntax of both margins and pwcompare before using the pwcompare option of margins. These remarks build on those in [R] margins and [R] pwcompare. margins can perform pairwise comparisons of any of the margins that it estimates. We begin by fitting a logistic regression model using the NHANES II dataset, ignoring the complex survey nature of the data. Our dependent variable is highbp, an indicator for whether a person has high blood pressure. We fit an interacted model including two factor variables representing the region of the country as well as the continuous covariate bmi. . use http://www.stata-press.com/data/r13/nhanes2 . logistic highbp region##c.bmi (output omitted ) By default, margins will compute the predictive margins of the probability of a positive outcome for each of the terms in marginlist after logistic regression. We will margin on region so that margins will estimate the average predicted probabilities of having high blood pressure conditional on being in each of the four regions and unconditional on BMI. We can specify the pwcompare option to obtain all possible pairwise comparisons of these predictive margins: margins, pwcompare — Pairwise comparisons of margins 1221 . margins region, pwcompare Pairwise comparisons of predictive margins Model VCE : OIM Expression : Pr(highbp), predict() region MW vs NE S vs NE W vs NE S vs MW W vs MW W vs S Contrast Delta-method Std. Err. -.0377194 -.0156843 -.006873 .0220351 .0308463 .0088112 .0133571 .0133986 .0136595 .0124564 .0127366 .0127801 Unadjusted [95% Conf. Interval] -.0638987 -.041945 -.0336451 -.0023789 .0058831 -.0162373 -.01154 .0105764 .019899 .0464492 .0558096 .0338598 This table gives each of the pairwise differences with confidence intervals. We can see that the confidence interval in the row labeled MW vs NE does not include 0. At the 5% level, the predictive margins for the first and second regions, the Northeast and the Midwest, are significantly different. The same is true of the second and fourth regions, the Midwest and the West. With many pairwise comparisons, output in this format can be difficult to sort through. We can organize it by adding the group suboption: . margins region, pwcompare(group) Pairwise comparisons of predictive margins Model VCE : OIM Expression : Pr(highbp), predict() Margin region NE MW S W .4388358 .4011164 .4231516 .4319628 Delta-method Unadjusted Std. Err. Groups .010069 .0087764 .0088395 .0092301 B A AB B Note: Margins sharing a letter in the group label are not significantly different at the 5% level. The group output includes the predictive margins for each region and letters denoting margins that are not significantly different from one another. In this case, the Northeast (NE), South (S), and West (W) regions have the letter B in the “Unadjusted Groups” column. The letter B indicates that the average predicted probability for the Northeast region is not significantly different from the average predicted probabilities for the South and West regions at the 5% significance level. The Midwest (MW) region does not share a letter with the Northeast region nor the West region, which indicates that the average predicted probability for the Midwest region is significantly different for each of the other two regions at our 5% level. We can also include the mcompare(bonferroni) option to perform tests using Bonferroni’s method to account for making multiple comparisons. 1222 margins, pwcompare — Pairwise comparisons of margins . margins region, pwcompare(group) mcompare(bonferroni) Pairwise comparisons of predictive margins Model VCE : OIM Expression : Pr(highbp), predict() Number of Comparisons region 6 Margin region NE MW S W .4388358 .4011164 .4231516 .4319628 Delta-method Bonferroni Std. Err. Groups .010069 .0087764 .0088395 .0092301 B A AB AB Note: Margins sharing a letter in the group label are not significantly different at the 5% level. We now see the letter A on the row corresponding to the West region. At the 5% level and with Bonferroni’s adjustment, the predictive margins for the probability in the Midwest and West regions are not significantly different. Stored results margins, pwcompare stores the following additional results in r(): Scalars r(k terms) Macros r(cmd) r(cmd2) r(group#) r(mcmethod vs) r(mctitle vs) r(mcadjustall vs) Matrices r(b) r(V) r(b vs) r(V vs) r(error vs) r(table vs) r(L) number of terms participating in pairwise comparisons pwcompare margins group code for the #th margin in r(b) method from mcompare() title for method from mcompare() adjustall or empty margin estimates variance–covariance matrix of the margin estimates margin difference estimates variance–covariance margin difference of the margin estimates margin difference estimability codes; 0 means estimable, 8 means not estimable matrix containing the margin differences with their standard errors, test statistics, p-values, and confidence intervals matrix that produces the margin differences margins, pwcompare — Pairwise comparisons of margins 1223 margins, pwcompare with the post option also stores the following additional results in e(): Scalars e(k terms) Macros e(cmd) e(cmd2) Matrices e(b) e(V) e(b vs) e(V vs) e(error vs) e(L) number of terms participating in pairwise comparisons pwcompare margins margin estimates variance–covariance matrix of the margin estimates margin difference estimates variance–covariance margin difference of the margin estimates margin difference estimability codes; 0 means estimable, 8 means not estimable matrix that produces the margin differences Methods and formulas See Methods and formulas in [R] margins and Methods and formulas in [R] pwcompare. Also see [R] contrast — Contrasts and linear hypothesis tests after estimation [R] margins — Marginal means, predictive margins, and marginal effects [R] margins, contrast — Contrasts of margins [R] margins postestimation — Postestimation tools for margins [R] pwcompare — Pairwise comparisons Title marginsplot — Graph results from margins (profile plots, etc.) Syntax Options Acknowledgments Menu Remarks and examples References Description Addendum: Advanced uses of dimlist Also see Syntax marginsplot , options options Description Main xdimension(dimlist , dimopts ) plotdimension(dimlist , dimopts ) ) bydimension(dimlist , dimopts graphdimension(dimlist , dimopts ) horizontal noci name(name | stub , replace ) use dimlist to define x axis create plots for groups in dimlist create subgraphs for groups in dimlist create graphs for groups in dimlist swap x and y axes do not plot confidence intervals name of graph, or stub if multiple graphs Labels allxlabels nolabels allsimplelabels nosimplelabels separator(string) noseparator place ticks and labels on the x axis for each value label groups with their values, not their labels forgo variable name and equal signs in all labels include variable name and equal signs in all labels separator for labels when multiple variables are specified in a dimension do not use a separator Plot plotopts(plot options) plot#opts(plot options) recast(plottype) affect rendition of all margin plots affect rendition of #th margin plot plot margins using plottype CI plot ciopts(rcap options) ci#opts(rcap options) recastci(plottype) mcompare(method) level(#) affect rendition of all confidence interval plots affect rendition of #th confidence interval plot plot confidence intervals using plottype adjust for multiple comparisons set confidence level Pairwise unique csort plot only unique pairwise comparisons sort comparison categories first Add plots addplot(plot) add other plots to the graph 1224 marginsplot — Graph results from margins (profile plots, etc.) 1225 Y axis, X axis, Titles, Legend, Overall, By any options documented in [G-3] twoway options how subgraphs are combined, labeled, etc. twoway options byopts(byopts) where dimlist may be any of the dimensions across which margins were computed in the immediately preceding margins command; see [R] margins. That is to say, dimlist may be any variable used in the margins command, including variables specified in the at(), over(), and within() options. More advanced specifications of dimlist are covered in Addendum: Advanced uses of dimlist. dimopts Description labels(lablist) elabels(elablist) nolabels allsimplelabels nosimplelabels separator(string) list of quoted strings to label each level of the dimension list of enumerated labels label groups with their values, not their labels forgo variable name and equal signs in all labels include variable name and equal signs in all labels separator for labels when multiple variables are specified in the dimension do not use a separator noseparator where lablist is defined as "label" "label" . . . elablist is defined as # "label" # "label" . . . and the #s are the indices of the levels of the dimension—1 is the first level, 2 is the second level, and so on. plot options Description marker options marker label options cline options change look of markers (color, size, etc.) add marker labels; change look or position change look of the line method Description noadjust bonferroni adjustall sidak adjustall scheffe do not adjust for multiple comparisons Bonferroni’s method; adjust across all terms Šidák’s method; adjust across all terms Scheffé’s method Menu Statistics > Postestimation > Margins plots and profile plots 1226 marginsplot — Graph results from margins (profile plots, etc.) Description marginsplot graphs the results of the immediately preceding margins command; see [R] margins. Common names for some of the graphs that marginsplot can produce are profile plots and interaction plots. Options Main xdimension(), plotdimension(), bydimension(), and graphdimension() specify the variables from the preceding margins command whose group levels will be used for the graph’s x axis, plots, by() subgraphs, and graphs. marginsplot chooses default dimensions based on the margins command. In most cases, the first variable appearing in an at() option and evaluated over more than one value is used for the x axis. If no at() variable meets this condition, the first variable in the marginlist is usually used for the x axis and the remaining variables determine the plotted lines or markers. Pairwise comparisons and graphs of marginal effects (derivatives) have different defaults. In all cases, you may override the defaults and explicitly control which variables are used on each dimension of the graph by using these dimension options. Each of these options supports suboptions that control the labeling of the dimension—axis labels for xdimension(), plot labels for plotdimension(), subgraph titles for bydimension(), and graph titles for graphdimension() titles. For examples using the dimension options, see Controlling the graph’s dimensions. xdimension(dimlist , dimopts ) specifies the variables for the x axis in dimlist and controls the content of those labels with dimopts. plotdimension(dimlist , dimopts ) specifies in dimlist the variables whose group levels determine the plots and optionally specifies in dimopts the content of the plots’ labels. bydimension(dimlist , dimopts ) specifies in dimlist the variables whose group levels determine the by() subgraphs and optionally specifies in dimopts the content of the subgraphs’ titles. For an example using by(), see Three-way interactions. graphdimension(dimlist , dimopts ) specifies in dimlist the variables whose group levels determine the graphs and optionally specifies in dimopts the content of the graphs’ titles. horizontal reverses the default x and y axes. By default, the y axis represents the estimates of the margins and the x axis represents one or more factors or continuous covariates. Specifying horizontal swaps the axes so that the x axis represents the estimates of the margins. This option can be useful if the labels on the factor or continuous covariates are long. The horizontal option is discussed in Horizontal is sometimes better. noci removes plots of the pointwise confidence intervals. The default is to plot the confidence intervals. name(name | stub , replace ) specifies the name of the graph or graphs. If the graphdimension() option is specified, or if the default action is to produce multiple graphs, then the argument of name() is taken to be stub and graphs named stub1, stub2, . . . are created. The replace suboption causes existing graphs with the specified name or names to be replaced. If name() is not specified, default names are used and the graphs may be replaced by subsequent marginsplot or other graphing commands. marginsplot — Graph results from margins (profile plots, etc.) 1227 Labels With the exception of allxlabels, all of these options may be specified either directly as options or as dimopts within options xdimension(), plotdimension(), bydimension(), and graphdimension(). When specified in one of the dimension options, only the labels for that dimension are affected. When specified outside the dimension options, all labels on all dimensions are affected. Specifications within the dimension options take precedence. allxlabels specifies that tick marks and labels be placed on the x axis for each value of the x-dimension variables. By default, if there are more than 25 ticks, default graph axis labeling rules are applied. Labeling may also be specified using the standard graph twoway x-axis label rules and options—xlabel(); see [G-3] axis label options. nolabels specifies that value labels not be used to construct graph labels and titles for the group levels in the dimension. By default, if a variable in a dimension has value labels, those labels are used to construct labels and titles for axis ticks, plots, subgraphs, and graphs. Graphs of contrasts and pairwise comparisons are an exception to this rule and are always labeled with values rather than value labels. allsimplelabels and nosimplelabels control whether graphs’ labels and titles include just the values of the variables or include variable names and equal signs. The default is to use just the value label for variables that have value labels and to use variable names and equal signs for variables that do not have value labels. An example of the former is “Female” and the latter is “country=2”. Sometimes value labels are universally descriptive, and sometimes they have meaning only when considered in relation to their variable. For example, “Male” and “Female” are typically universal, regardless of the variable from which they are taken. “High” and “Low” may not have meaning unless you know they are in relation to a specific measure, say, blood-pressure level. The allsimplelabels and nosimplelabels options let you override the default labeling. allsimplelabels specifies that all titles and labels use just the value or value label of the variable. nosimplelabels specifies that all titles and labels include varname= before the value or value label of the variable. separator(string) and noseparator control the separator between label sections when more than one variable is used to specify a dimension. The default separator is a comma followed by a space, but no separator may be requested with noseparator or the default may be changed to any string with separator(). For example, if plotdimension(a b) is specified, the plot labels in our graph legend might be “a=1, b=1”, “a=1, b=2”, . . . . Specifying separator(:) would create labels “a=1:b=1”, “a=1:b=2”, . . . . Plot plotopts(plot options) affects the rendition of all margin plots. The plot options can affect the size and color of markers, whether and how the markers are labeled, and whether and how the points are connected; see [G-3] marker options, [G-3] marker label options, and [G-3] cline options. These settings may be overridden for specific plots by using the plot#opts() option. plot#opts(plot options) affects the rendition of the #th margin plot. The plot options can affect the size and color of markers, whether and how the markers are labeled, and whether and how the points are connected; see [G-3] marker options, [G-3] marker label options, and [G-3] cline options. 1228 marginsplot — Graph results from margins (profile plots, etc.) recast(plottype) specifies that margins be plotted using plottype. plottype may be scatter, line, connected, bar, area, spike, dropline, or dot; see [G-2] graph twoway. When recast() is specified, the plot-rendition options appropriate to the specified plottype may be used in lieu of plot options. For details on those options, follow the appropriate link from [G-2] graph twoway. For an example using recast(), see Continuous covariates. You may specify recast() within a plotopts() or plot#opts() option. It is better, however, to specify it as documented here, outside those options. When specified outside those options, you have greater access to the plot-specific rendition options of your specified plottype. CI plot ciopts(rcap options) affects the rendition of all confidence interval plots; see [G-3] rcap options. These settings may be overridden for specific confidence interval plots with the ci#opts() option. ci#opts(rcap options) affects the rendition of the #th confidence interval; see [G-3] rcap options. recastci(plottype) specifies that confidence intervals be plotted using plottype. plottype may be rarea, rbar, rspike, rcap, rcapsym, rline, rconnected, or rscatter; see [G-2] graph twoway. When recastci() is specified, the plot-rendition options appropriate to the specified plottype may be used in lieu of rcap options. For details on those options, follow the appropriate link from [G-2] graph twoway. For an example using recastci(), see Continuous covariates. You may specify recastci() within a ciopts() or ci#opts() option. It is better, however, to specify it as documented here, outside those options. When specified outside those options, you have greater access to the plot-specific rendition options of your specified plottype. mcompare(method) specifies the method for confidence intervals that account for multiple comparisons within a factor-variable term. The default is determined by the margins results stored in r(). If marginsplot is working from margins results stored in e(), the default is mcompare(noadjust). level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is determined by the margins results stored in r(). If marginsplot is working from margins results stored in e(), the default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. Pairwise These options have an effect only when the pwcompare option was specified on the preceding margins command. unique specifies that only unique pairwise comparisons be plotted. The default is to plot all pairwise comparisons, including those that are mirror images of each other—“male” versus “female” and “female” versus “male”. margins reports only the unique pairwise comparisons. unique also changes the default xdimension() for graphs of pairwise comparisons from the reference categories ( pw0) to the comparisons of each pairwise category ( pw). Unique comparisons are often preferred with horizontal graphs that put all pairwise comparisons on the x axis, whereas including the full matrix of comparisons is preferred for charts showing the reference groups on an axis and the comparison groups as plots; see Pairwise comparisons and Horizontal is sometimes better. csort specifies that comparison categories are sorted first, and then reference categories are sorted within comparison category. The default is to sort reference categories first, and then sort comparison categories within reference categories. This option has an observable effect only when pw is also marginsplot — Graph results from margins (profile plots, etc.) 1229 specified in one of the dimension options. It then determines the order of the labeling in the dimension where pw is specified. Add plots addplot(plot) provides a way to add other plots to the generated graph; see [G-3] addplot option. For an example using addplot(), see Adding scatterplots of the data. If multiple graphs are drawn by a single marginsplot command or if plot specifies plots with multiple y variables, for example, scatter y1 y2 x, then the graph’s legend will not clearly identify all the plots and will require customization using the legend() option; see [G-3] legend options. Y axis, X axis, Titles, Legend, Overall, By twoway options are any of the options documented in [G-3] twoway options. These include options for titling the graph (see [G-3] title options); for saving the graph to disk (see [G-3] saving option); for controlling the labeling and look of the axes (see [G-3] axis options); for controlling the look, contents, position, and organization of the legend (see [G-3] legend options); for adding lines (see [G-3] added line options) and text (see [G-3] added text options); and for controlling other aspects of the graph’s appearance (see [G-3] twoway options). The label() suboption of the legend() option has no effect on marginsplot. Use the order() suboption instead. byopts(byopts) affects the appearance of the combined graph when bydimension() is specified or when the default graph has subgraphs, including the overall graph title, the position of the legend, and the organization of subgraphs. See [G-3] by option. Remarks and examples Remarks are presented under the following headings: Introduction Dataset Profile plots Interaction plots Contrasts of margins—effects (discrete marginal effects) Three-way interactions Continuous covariates Plots at every value of a continuous covariate Contrasts of at() groups—discrete effects Controlling the graph’s dimensions Pairwise comparisons Horizontal is sometimes better Marginal effects Plotting a subset of the results from margins Advanced usage Plots with multiple terms Plots with multiple at() options Adding scatterplots of the data Video examples 1230 marginsplot — Graph results from margins (profile plots, etc.) Introduction marginsplot is a post-margins command. It graphs the results of the margins command, whether those results are marginal means, predictive margins, marginal effects, contrasts, pairwise comparisons, or other statistics; see [R] margins. By default, the margins are plotted on the y axis, and all continuous and factor covariates specified in the margins command will usually be placed on the x axis or used to identify plots. Exceptions are discussed in the following sections and in Addendum: Advanced uses of dimlist below. marginsplot produces classic plots, such as profile plots and interaction plots. Beyond that, anything that margins can compute, marginsplot can graph. We will be using some relatively complicated margins commands with little explanation of the syntax. We will also avoid lengthy interpretations of the results of margins. See [R] margins for the complete syntax of margins and discussions of its results. All graphs in this entry were drawn using the s2gcolor scheme; see [G-4] scheme s2. Mitchell (2012) shows in many examples how to use marginsplot to understand a fitted model. Dataset For continuity, we will use one dataset for most examples—the Second National Health and Nutrition Examination Survey (NHANES II) (McDowell et al. 1981). NHANES II is part of a study to assess the health and nutritional status of adults and children in the United States. It is designed to be a nationally representative sample of the U.S. population. This particular sample is from 1976 to 1980. The survey nature of the dataset—weights, strata, and sampling units—will be ignored in our analyses. We are discussing graphing, not survey statistics. If you would like to see the results with the appropriate adjustments for the survey design, just add svy: before each estimation command, and if you wish, add vce(unconditional) as an option to each margins command. See [R] margins, particularly the discussion and examples under Obtaining margins with survey data and representative samples, for reasons why you probably would want to add vce(unconditional) when analyzing survey data. For the most part, adjusting for survey design produces moderately larger confidence intervals and relatively small changes in point estimates. Profile plots What does my estimation say about how my response varies as one (or more) of my covariates changes? That is the question that is answered by profile plots. Profile plots are also referred to as plots of estimated (or expected, or least-squares) means, though that is unnecessarily restrictive when considering models of binary, count, and ordered outcomes. In the latter cases, we might prefer to say they plot conditional expectations of responses, where a response might be a probability. What we do with the other covariates depends on the questions we wish to answer. Sometimes we wish to hold other covariates at fixed values, and sometimes we wish to average the response over their values. margins can do either, so you can graph either. We can fit a fully factorial two-way ANOVA of systolic blood pressure on age group and sex using the NHANES II data. marginsplot — Graph results from margins (profile plots, etc.) . use http://www.stata-press.com/data/r13/nhanes2 . anova bpsystol agegrp##sex Number of obs = 10351 Root MSE = 20.2209 R-squared = Adj R-squared = MS F 1231 0.2497 0.2489 Source Partial SS df Prob > F Model 1407229.28 11 127929.935 312.88 0.0000 agegrp sex agegrp#sex 1243037.82 27728.3794 88675.043 5 1 5 248607.565 27728.3794 17735.0086 608.02 67.81 43.37 0.0000 0.0000 0.0000 Residual 4227440.75 10339 408.882943 Total 5634670.03 10350 544.412563 If you are more comfortable with regression than ANOVA, then type . regress bpsystol agegrp##sex The anova and regress commands fit identical models. The output from anova displays all the terms in the model and thus tends to be more conducive to exploration with margins and marginsplot. We estimate the predictive margins of systolic blood pressure for each age group using margins. . margins agegrp Predictive margins Expression : Linear prediction, predict() Margin agegrp 20-29 30-39 40-49 50-59 60-69 70+ 117.2684 120.2383 126.9255 135.682 141.5285 148.1096 Delta-method Std. Err. .419845 .5020813 .56699 .5628593 .3781197 .6445073 t 279.31 239.48 223.86 241.06 374.30 229.80 Number of obs = 10351 P>|t| [95% Conf. Interval] 0.000 0.000 0.000 0.000 0.000 0.000 116.4454 119.2541 125.8141 134.5787 140.7873 146.8463 118.0914 121.2225 128.0369 136.7853 142.2696 149.373 The six predictive margins are just the averages of the predictions over the estimation sample, holding agegrp to each of its six levels. If this were a designed experiment rather than survey data, we might wish to assume the cells are balanced—that they have the same number of observations—and thus estimate what are often called expected means or least-squares means. To do that, we would simply add the asbalanced option to the margins command. The NHANES II data are decidedly unbalanced over sex#agegrp cells. So much so that it is unreasonable to assume the cells are balanced. 1232 marginsplot — Graph results from margins (profile plots, etc.) We graph the results: . marginsplot Variables that uniquely identify margins: agegrp 110 120 Linear Prediction 130 140 150 Predictive Margins of agegrp with 95% CIs 20−29 30−39 40−49 50−59 Age Group 60−69 70+ Profile plots are often drawn without confidence intervals (CIs). The CIs may be removed by adding the noci option. We prefer to see the CIs. Disciplines vary widely in their use of the term profile plot. Some disciplines consider any connected plot of a response over values of other variables to be a profile plot. By that definition, most graphs in this entry are profile plots. Interaction plots Interaction plots are often used to explore the form of an interaction. The interaction term in our ANOVA results is highly significant. Are the interaction effects also large enough to matter? What form do they take? We can answer these questions by fixing agegrp and sex to each possible combination of the two covariates and estimating the margins for those cells. . margins agegrp#sex marginsplot — Graph results from margins (profile plots, etc.) 1233 Then we can graph the results: . marginsplot Variables that uniquely identify margins: agegrp sex 110 120 Linear Prediction 130 140 150 Adjusted Predictions of agegrp#sex with 95% CIs 20−29 30−39 40−49 50−59 Age Group Male 60−69 70+ Female It is clear that the effect of age differs by sex—there is an interaction. If there were no interaction, then the two lines would be parallel. While males start out with higher systolic blood pressure, females catch up to the males as age increases and may even surpass males in the upper age groups. We say “may” because we cannot tell if the differences are statistically significant. The CIs overlap for the top three age groups. It is tempting to conclude from this overlap that the differences are not statistically significant. Do not fall into this trap. Likewise, do not fall into the trap that the first three age groups are different because their CIs do not overlap. The CIs are for the point estimates, not the differences. There is a covariance between the differences that we must consider if we are to make statements about those differences. Contrasts of margins—effects (discrete marginal effects) To assess the differences, all we need do is ask margins to contrast the sets of effects that we just estimated; see [R] margins, contrast. With only two groups in sex, it does not matter much which contrast operator we choose. We will use the reference contrast. It will compare the difference between males and females, with males (the first category) as the reference category. 1234 marginsplot — Graph results from margins (profile plots, etc.) . margins r.sex@agegrp Contrasts of adjusted predictions Expression : Linear prediction, predict() df F P>F sex@agegrp (Female vs Male) 20-29 (Female vs Male) 30-39 (Female vs Male) 40-49 (Female vs Male) 50-59 (Female vs Male) 60-69 (Female vs Male) 70+ Joint 1 1 1 1 1 1 6 224.92 70.82 12.15 0.47 3.88 6.37 53.10 0.0000 0.0000 0.0005 0.4949 0.0488 0.0116 0.0000 Denominator 10339 sex@agegrp (Female vs Male) 20-29 (Female vs Male) 30-39 (Female vs Male) 40-49 (Female vs Male) 50-59 (Female vs Male) 60-69 (Female vs Male) 70+ Contrast Delta-method Std. Err. -12.60132 -8.461161 -3.956451 -.7699782 1.491684 3.264762 .8402299 1.005448 1.134878 1.128119 .756906 1.293325 [95% Conf. Interval] -14.24833 -10.43203 -6.181031 -2.981309 .0080022 .729594 -10.9543 -6.490288 -1.731871 1.441353 2.975367 5.79993 Because we are looking for effects that are different from 0, we will add a reference line at 0 to our graph. . marginsplot, yline(0) Variables that uniquely identify margins: agegrp −15 Contrasts of Linear Prediction −10 −5 0 5 Contrasts of Adjusted Predictions of sex@agegrp with 95% CIs 20−29 30−39 40−49 50−59 Age Group 60−69 70+ We can now say that females’ systolic blood pressure is substantially and significantly lower than males’ in the first three age groups but is significantly higher in the last two age groups. Despite the overlapping CIs for the last two age groups in the interaction graph, the effect of sex is significant in these age groups. marginsplot — Graph results from margins (profile plots, etc.) 1235 The terminology for what we just estimated and graphed varies widely across disciplines. Those versed in design of experiments refer to these values as contrasts or effects. Economists and some other social scientists call them marginal or partial effects. The latter groups might be more comfortable if we avoided the whole concept of contrasts and instead estimated the effects by typing . margins agegrp, dydx(sex) This will produce estimates that are identical to those shown above, and we can graph them by typing marginsplot. The advantage of using the contrast notation and thinking in contrasts is most evident when we take marginal effects with respect to a categorical covariate with more than two levels. Marginal effects for each level of the covariate will be taken with respect to a specified base level. Contrasts are much more flexible. Using the r. operator, we can reproduce the marginal-effects results by taking derivatives with respect to a reference level (as we saw above.) We can also estimate the marginal effect of first moving from level 1 to level 2, then from level 2 to level 3, then from level 3 to level 4, . . . using the ar. or “reverse adjacent” operator. Adjacent effects (marginal effects) can be valuable when evaluating an ordinal covariate, such as agegrp in our current model. For a discussion of contrasts, see [R] contrast and [R] margins, contrast. Three-way interactions marginsplot can handle any number of covariates in your margins command. Consider the three-way ANOVA model that results from adding an indicator for whether an individual has been diagnosed with diabetes. We will fully interact the new covariate with the others in the model. . anova bpsystol agegrp##sex##diabetes Number of obs = 10349 Root MSE = 20.131 Source Partial SS df MS R-squared = 0.2572 Adj R-squared = 0.2556 F Prob > F Model 1448983.17 23 62999.2681 155.45 0.0000 agegrp sex agegrp#sex diabetes agegrp#diabetes sex#diabetes agegrp#sex#diabetes 107963.582 1232.79267 11679.5925 7324.98924 5484.54623 102.988239 4863.14971 5 1 5 1 5 1 5 21592.7164 1232.79267 2335.91849 7324.98924 1096.90925 102.988239 972.629943 53.28 3.04 5.76 18.07 2.71 0.25 2.40 0.0000 0.0812 0.0000 0.0000 0.0189 0.6142 0.0349 Residual 4184296.88 10325 405.258778 Total 5633280.05 10348 544.38346 The three-way interaction is significant, as is the main effect of diabetes and its interaction with agegrp. Again, if you are more comfortable with regression than ANOVA, you may type . regress bpsystol agegrp##sex##diabetes The margins and marginsplot results will be the same. 1236 marginsplot — Graph results from margins (profile plots, etc.) We estimate the expected cell means for each combination of agegrp, sex, and diabetes, and then graph the results by typing . margins agegrp#sex#diabetes (output omitted ) . marginsplot Variables that uniquely identify margins: agegrp sex diabetes 80 100 Linear Prediction 120 140 160 180 Adjusted Predictions of agegrp#sex#diabetes with 95% CIs 20−29 30−39 40−49 50−59 Age Group Male, diabetes=0 Female, diabetes=0 60−69 70+ Male, diabetes=1 Female, diabetes=1 The graph is busy and difficult to interpret. We can make it better by putting those with diabetes on one subgraph and those without on another: . marginsplot, by(diabetes) Variables that uniquely identify margins: agegrp sex diabetes Adjusted Predictions of agegrp#sex#diabetes with 95% CIs diabetes=1 140 120 100 80 Linear Prediction 160 diabetes=0 20−29 30−39 40−49 50−59 60−69 70+ 20−29 30−39 40−49 50−59 60−69 70+ Age Group Male Female We notice much larger CIs for diabetics. That is not surprising because our sample contains only 499 diabetics compared with 9,850 nondiabetics. marginsplot — Graph results from margins (profile plots, etc.) 1237 A more interesting way to arrange the plots is by grouping the subgraphs on sex: . marginsplot, by(sex) Variables that uniquely identify margins: agegrp sex diabetes Adjusted Predictions of agegrp#sex#diabetes with 95% CIs Female 140 120 100 80 Linear Prediction 160 Male 20−29 30−39 40−49 50−59 60−69 70+ 20−29 30−39 40−49 50−59 60−69 70+ Age Group diabetes=0 diabetes=1 Aside from increased systolic blood pressure in the upper-age groups, which we saw earlier, it appears that those with diabetes are at greater risk of higher systolic blood pressure for many upper-age groups. We can check that by having margins estimate the differences between diabetics and nondiabetics, and graphing the results. . margins r.diabetes@agegrp#sex (output omitted ) . marginsplot, by(sex) yline(0) Variables that uniquely identify margins: agegrp sex Contrasts of Adjusted Predictions of diabetes@agegrp#sex with 95% CIs sex: Female 20 0 −20 −40 Contrasts of Linear Prediction 40 sex: Male 20−29 30−39 40−49 50−59 60−69 70+ 20−29 30−39 40−49 50−59 60−69 70+ Age Group With CIs above 0 for six of eight age groups over 40, this graph provides evidence that diabetes is related to higher blood pressure in those over 40. 1238 marginsplot — Graph results from margins (profile plots, etc.) Continuous covariates margins and marginsplot are just as useful with continuous covariates as they are with factor variables. As a variation on our ANOVA/regression models, let’s move to a logistic regression, using as our dependent variable an indicator for whether a person has high blood pressure. We introduce a continuous covariate—body mass index (BMI), a measure of weight relative to height. High BMI is often associated with high blood pressure. We will allow the effect of BMI to vary across sexes, age groups, and sex/age combinations by fully interacting the covariates. . logistic highbp sex##agegrp##c.bmi If we wished, we could perform all the analyses above on this model. Instead of estimating margins, contrasts, and marginal effects on the level of systolic blood pressure, we would be estimating margins, contrasts, and marginal effects on the probability of having high blood pressure. You can see those results by repeating any of the prior commands that involve sex and agegrp. In this section, we will focus on the continuous covariate bmi. With continuous covariates, rather than specify them in the marginlist of margins, we specify the specific values at which we want the covariate evaluated in an at() option. at() options are very flexible, and there are many ways to specify values; see Syntax of at() in [R] margins. BMI in our sample ranges from 12.4 to 61.1. Let’s estimate the predictive margins for males and females at levels of BMI from 10 through 65 at intervals of 5 and graph the results: . margins sex, at(bmi=(10(5)65)) (output omitted ) . marginsplot, xlabel(10(10)60) Variables that uniquely identify margins: bmi sex 0 .2 Pr(Highbp) .4 .6 .8 1 Predictive Margins of sex with 95% CIs 10 20 30 40 Body Mass Index (BMI) Male 50 60 Female We added the xlabel(10(10)60) option to improve the labeling of the x axis. You may add any twoway options (see [G-3] twoway options) to the marginsplot command. For a given BMI, males are generally more susceptible to high blood pressure, though the effect is attenuated by the logistic response when the probabilities approach 0 or 1. Because bmi is continuous, we might prefer to see the response graphed using a line. We might also prefer that the CIs be plotted as areas. We change the plottype of the response by using the recast() option and the plottype of the CI by using the recastci() option: marginsplot — Graph results from margins (profile plots, etc.) 1239 . marginsplot, xlabel(10(10)60) recast(line) recastci(rarea) Variables that uniquely identify margins: bmi sex 0 .2 Pr(Highbp) .4 .6 .8 1 Predictive Margins of sex with 95% CIs 10 20 30 40 Body Mass Index (BMI) Male 50 60 Female The CIs are a little dark for our tastes. You can dim them a bit by reducing the intensity of their color. Adding ciopts(color(*.8)) to our marginsplot command will do that. Any plot option accepted by twoway rarea (see [G-2] graph twoway rarea) may be specified in a ciopts() option. Given their confidence regions, the male and female profiles appear to be statistically different over most of the range of BMI. As with the profiles of categorical covariates, we can check that assertion by contrasting the two profiles on sex and graphing the results. Let’s improve the smoothness of the response by specifying intervals of 1 instead of 5. . margins r.sex, at(bmi=(10(1)65)) (output omitted ) . marginsplot, xlabel(10(10)60) recast(line) recastci(rarea) Variables that uniquely identify margins: bmi −.2 −.15 Contrasts of Pr(Highbp) −.1 −.05 0 .05 Contrasts of Predictive Margins of sex with 95% CIs 10 20 30 40 Body Mass Index (BMI) 50 60 We see that the difference between the sexes is largest at a BMI of about 35 and that the sexes respond more similarly with very high and very low BMI. This shape is largely determined by the 1240 marginsplot — Graph results from margins (profile plots, etc.) response of the logistic function, which is attenuated near probabilities 0 and 1, combined with the fact that the lowest measured BMIs are associated with low probabilities of high blood pressure and the highest measured BMIs are associated with high probabilities of high blood pressure. As when we contrasted profiles of categorical variables, different disciplines will think of this graph differently. Those familiar with designed experiments will be comfortable with the terms used above—this is a contrast of profiles, or a profile of effects, or a profile of a contrast. Many social scientists will prefer to think of this as a graph of marginal or partial effects. For them, this is a plot of the discrete marginal effect of being female for various levels of BMI. They can obtain an identical graph, with labeling more appropriate for the marginal effect’s interpretation, by typing . margins, at(bmi=(10(1)65)) dydx(sex) . marginsplot, xlabel(10(10)60) recast(line) recastci(rarea) We can also plot profiles of the response of BMI by levels of another continuous covariate (rather than by the categorical variable sex). To do so, we will need another continuous variable in our model. We have been using age groups as a covariate to emphasize the treatment of categorical variables and to allow the effect of age to be flexible. Our dataset also has age recorded in integer years. We replace agegrp with continuous age in our logistic regression. . logistic highbp sex##c.age##c.bmi We can now obtain profiles of BMI for different ages by specifying ranges for both bmi and age in a single at() option on the margins command: . margins sex, at(bmi=(10(5)60) age=(20(10)70)) With six ages specified, we have many profiles, so we will dispense with the CIs by adding the noci option and also tidy up the graph by asking for three columns in the legend: . marginsplot, noci by(sex) legend(cols(3)) Variables that uniquely identify margins: bmi age sex Adjusted Predictions of sex Female .5 0 Pr(Highbp) 1 Male 10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60 Body Mass Index (BMI) age=20 age=50 age=30 age=60 age=40 age=70 Our model seems to indicate that males have a sharper reaction to body mass indices than do females. Likewise, younger subjects display a sharper response, while older subjects have a more gradual response with earlier onset. That interpretation might be a result of our parametric treatment of age. As it turns out, the interpretation holds if we allow age to take more flexible forms or return to our use of age groups, which allows each of seven age groups to have unique BMI profiles. Here are the commands to perform that analysis: marginsplot — Graph results from margins (profile plots, etc.) 1241 . logistic highbp sex##agegrp##c.bmi (output omitted ) . margins sex#agegrp, at(bmi=(10(5)60)) (output omitted ) . marginsplot, noci by(sex) legend(cols(3)) Variables that uniquely identify margins: bmi sex agegrp Adjusted Predictions of sex#agegrp Female .5 0 Pr(Highbp) 1 Male 10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60 Body Mass Index (BMI) 20−29 50−59 30−39 60−69 40−49 70+ Plots at every value of a continuous covariate In some cases, the specific values of a continuous covariate are important, and we want to plot the response at those specific values. Return to our logistic example with age treated as a continuous covariate. . logistic highbp sex##c.age##c.bmi We can use a programming trick to extract all the values of age and then supply them in an at() option, just as we would any list of values. . levelsof age . margins sex, at(age=(‘r(levels)’)) See [P] levelsof for a discussion of the levelsof command. levelsof returns in r(levels) the sorted list of unique values of the specified varlist, in our case, age. We can then plot the results using marginsplot. This is not a very interesting trick when using our age variable, which is recorded as integers from 20 to 74, but the approach will work with almost any continuous variable. In our model, bmi might seem more interesting, but there are 9,941 unique values of bmi in our dataset. A graph cannot resolve so many different values. For that reason, we usually recommend against plotting at every value of a covariate. Instead, graph at reasonable values over the range of the covariate by using the at() option, as we did earlier. This trick is best reserved for variables with a few, or at most a few dozen, unique values. 1242 marginsplot — Graph results from margins (profile plots, etc.) Contrasts of at() groups—discrete effects We have previously contrasted across the values of factor variables in our model. Put another way, we have estimated the discrete marginal effects of factor variables. We can do the same for the levels of variables in at() specifications and across separate at() specifications. Returning to one of our logistic models and its margins, we earlier estimated the predictive margins of BMI at 5-unit intervals for both sexes. These are the commands we typed: . logistic highbp sex##agegrp##c.bmi . margins sex, at(bmi=(10(5)65)) . marginsplot, xlabel(10(10)60) We can estimate the discrete effects by sex of bmi moving from 10 to 15, then from 15 to 20, . . . , and then from 60 to 65 by contrasting the levels of the at() groups using the reverse-adjacent contrast operator (ar.). We specify the operator within the atcontrast() suboption of the contrast() option. We need to specify one other option. By default, margins, contrast will apply a contrast to all variables in its marginlist when a contrast has been requested. In this case, we do not want to contrast across sexes but rather to contrast across the levels of BMI within each sex. To prevent margins from contrasting across the sexes, we specify the marginswithin option. Our margins command is . margins sex, at(bmi=(10(5)65)) contrast(atcontrast(ar._at) marginswithin) And we graph the results using marginsplot: . marginsplot Variables that uniquely identify margins: bmi sex 0 Contrasts of Pr(Highbp) .05 .1 .15 .2 Contrasts of Predictive Margins of sex with 95% CIs 10 20 30 40 Body Mass Index (BMI) sex: Male 50 60 sex: Female The graph shows the contrasts (or if you prefer, discrete changes) in the probability of high blood pressure by sex as one increases BMI in 5-unit increments. We can even estimate contrasts (discrete effects) across at() options. To start, let’s compare the age-group profiles of the probability of high blood pressure for those in the 25th and 75th percentile of BMI. marginsplot — Graph results from margins (profile plots, etc.) 1243 . margins agegrp, at((p25) bmi) at((p75) bmi) (output omitted ) . marginsplot Variables that uniquely identify margins: agegrp _atopt Multiple at() options specified: _atoption=1: (p25) bmi _atoption=2: (p75) bmi 0 .2 Pr(Highbp) .4 .6 .8 Predictive Margins of agegrp with 95% CIs 20−29 30−39 40−49 50−59 Age Group (p25) bmi 60−69 70+ (p75) bmi For each age group, people whose BMI is at the 75th percentile have a much higher probability of high blood pressure than those at the 25th percentile. What is that difference in probability and its CI? To contrast across the percentiles of BMI within age groups, we again specify a contrast operator on the at() groups using atcontrast(), and we also tell margins to perform that contrast within the levels of the marginlist by using the marginswithin option. 1244 marginsplot — Graph results from margins (profile plots, etc.) . margins agegrp, at((p25) bmi) at((p75) bmi) > contrast(atcontrast(r._at) marginswithin) (output omitted ) . marginsplot Variables that uniquely identify margins: agegrp _atopt Multiple at() options specified: _atoption=1: (p25) bmi _atoption=2: (p75) bmi .05 Contrasts of Pr(Highbp) .1 .15 .2 .25 Contrasts of Predictive Margins of agegrp with 95% CIs 20−29 30−39 40−49 50−59 Age Group 60−69 70+ The differences in probability between 25th and 75th BMI percentiles are clearly significantly greater than 0. The differences appear to be smallest for those in the youngest and oldest age groups. Controlling the graph’s dimensions Thus far, marginsplot has miraculously done almost exactly what we want in most cases. The things we want on the x axis have been there, the choice of plots has made sense, etc. Some of that luck sprang from the relatively simple analyses we were performing, and some was from careful specification of our margins command. Sometimes, we will not be so lucky. marginsplot — Graph results from margins (profile plots, etc.) 1245 Consider the following regress, margins, and marginsplot commands: . regress bpsystol agegrp##sex##c.bmi (output omitted ) . margins agegrp, over(sex) at(bmi=(10(10)60)) (output omitted ) . marginsplot Variables that uniquely identify margins: agegrp _atopt Multiple at() options specified: _atoption=1: (p25) bmi _atoption=2: (p75) bmi 100 Linear Prediction 150 200 Predictive Margins of agegrp with 95% CIs 10 20 30 40 Body Mass Index (BMI) 20−29, Male 30−39, Male 40−49, Male 50−59, Male 60−69, Male 70+, Male 50 60 20−29, Female 30−39, Female 40−49, Female 50−59, Female 60−69, Female 70+, Female By default, marginsplot places the levels of the first multilevel at() specification on the x axis, and then usually plots the levels of all remaining variables as connected lines. That is what we see in the graph above—bmi, the at() variable, is on the x axis, and each combination of agegrp and sex is plotted as a separate connected line. If there is no multilevel at() specification, then the first variable in marginlist becomes the x axis. There are many more rules, but it is usually best to simply type marginsplot and see what happens. If you do not like marginsplot’s choices, change them. What if we wanted agegrp on the x axis instead of BMI? We tell marginsplot to make that change by specifying agegrp in the xdimension() option: 1246 marginsplot — Graph results from margins (profile plots, etc.) . marginsplot, xdimension(agegrp) Variables that uniquely identify margins: bmi agegrp sex 100 Linear Prediction 150 200 Predictive Margins of agegrp with 95% CIs 20−29 30−39 40−49 50−59 Age Group bmi=10, Male bmi=20, Male bmi=30, Male bmi=40, Male bmi=50, Male bmi=60, Male 60−69 70+ bmi=10, Female bmi=20, Female bmi=30, Female bmi=40, Female bmi=50, Female bmi=60, Female We have been suppressing the Results window output for marginsplot, but that output is helpful if we want to change how things are plotted. You may specify any variable used in your margins command in any of the dimension options—xdimension(), plotdimension(), bydimension(), and graphdimension(). (In fact, there are some pseudovariables that you may also specify in some cases; see Addendum: Advanced uses of dimlist for details.) marginsplot tries to help you narrow your choices by listing a set of variables that uniquely identify all your margins. You are not restricted to this list. We have a different x axis and a different set of plots, but our graph is still busy and difficult to read. We can make it better by creating separate graph panels for each sex. We do that by adding a bydimension() option with sex as the argument. . marginsplot, xdimension(agegrp) bydimension(sex) Variables that uniquely identify margins: bmi agegrp sex Predictive Margins of agegrp with 95% CIs Female 150 100 Linear Prediction 200 Male 20−29 30−39 40−49 50−59 60−69 70+ 20−29 30−39 40−49 50−59 60−69 70+ Age Group bmi=10 bmi=30 bmi=50 bmi=20 bmi=40 bmi=60 The patterns and the differences between males and females are now easier to see. marginsplot — Graph results from margins (profile plots, etc.) 1247 If our interest is in comparing males and females, we might even choose to create a separate panel for each level of BMI: . marginsplot, xdimension(agegrp) bydimension(bmi) xlabel(, angle(45)) Variables that uniquely identify margins: bmi agegrp sex Predictive Margins of agegrp with 95% CIs bmi=20 bmi=30 bmi=40 bmi=50 bmi=60 100 200 150 9 70 + 9 −6 60 9 −5 9 −4 50 40 9 −3 −2 30 20 9 70 + 9 −6 60 9 −5 50 9 −4 40 9 −3 −2 30 9 70 + 20 9 −6 60 9 −5 50 9 −4 −3 40 30 20 −2 9 100 Linear Prediction 150 200 bmi=10 Age Group Male Female The x-axis labels did not fit, so we angled them. We leave you to explore the use of the graphdimension() option. It is much like bydimension() but creates separate graphs rather than separate panels. Operationally, the plotdimension() option is rarely used. All variables not in the x dimension and not specified elsewhere become the plotted connected lines. You will likely use the dimension options frequently. This is one of the rare cases where we recommend using the minimal abbreviations of the options—x() for xdimension(), plot() for plotdimension(), by() for bydimension(), and graph() for graphdimension(). The abbreviations are easy to read and just as meaningful as the full option names. The full names exist to reinforce the relationship between the dimension options. Pairwise comparisons marginsplot can graph the results of margins, pwcompare; see [R] margins, pwcompare. We return to one of our ANOVA examples. Here we request pairwise comparisons with the pwcompare option of margins, and we request Bonferroni-adjusted CIs with the mcompare() option: 1248 marginsplot — Graph results from margins (profile plots, etc.) . anova bpsystol agegrp##sex (output omitted ) . margins agegrp, pwcompare mcompare(bonferroni) (output omitted ) . marginsplot Variables that uniquely identify margins: _pw1 _pw0 _pw enumerates all pairwise comparisons; _pw0 enumerates the reference categories; _pw1 enumerates the comparison categories. Comparisons of Linear Prediction −40 −20 0 20 40 Pairwise Comparisons of Predictive Margins of agegrp with 95% CIs 1 2 3 4 Comparison category 5 6 Reference category 1 3 5 2 4 6 Each connected line plot in the graph represents a reference age-group category for the pairwise comparison. The ticks on the x axis represent comparison age-group categories. So, each plot is a profile for a reference category showing its comparison to each other category. Horizontal is sometimes better Another interesting way to graph pairwise comparisons is to simply plot each comparison and label the two categories being compared. This type of graph works better if it is oriented horizontally rather than vertically. Continuing with the example above, we will switch the graph to horizontal. We will also make several changes to display the graph better. We specify that only unique comparisons be plotted. The graph above plotted both 1 versus 2 and 2 versus 1, which are the same comparison with opposite signs. We add a reference line at 0 because we are interested in comparisons that differ from 0. This graph looks better without the connecting lines, so we add the option recast(scatter). We also reverse the y scale so that the smallest levels of age group appear at the top of the axis. marginsplot — Graph results from margins (profile plots, etc.) 1249 . marginsplot, horizontal unique xline(0) recast(scatter) yscale(reverse) Variables that uniquely identify margins: _pw1 _pw0 _pw enumerates all pairwise comparisons; _pw0 enumerates the reference categories; _pw1 enumerates the comparison categories. Pairwise Comparisons of Predictive Margins of agegrp with 95% CIs 2 vs 1 3 vs 1 4 vs 1 5 vs 1 Comparisons 6 vs 1 3 vs 2 4 vs 2 5 vs 2 6 vs 2 4 vs 3 5 vs 3 6 vs 3 5 vs 4 6 vs 4 6 vs 5 0 10 20 30 Comparisons of Linear Prediction 40 All the comparisons differ from 0, so all our age groups are statistically different from each other. The horizontal option can be useful outside of pairwise comparisons. Profile plots are usually oriented vertically. However, when your covariates have long labels or there are many levels at which the margins are being evaluated, the graph may be easier to read when rendered horizontally. Marginal effects We have seen how to graph discrete effects for factor variables and continuous variables by using contrasts, and optionally by using the dydx() option of margins: Contrasts of margins—effects (discrete marginal effects) and Continuous covariates. Let’s now consider graphing instantaneous marginal effects for continuous covariates. Begin by refitting our logistic model of high blood pressure as a function of sex, age, and BMI: . logistic highbp sex##agegrp##c.bmi 1250 marginsplot — Graph results from margins (profile plots, etc.) We estimate the average marginal effect of BMI on the probability of high blood pressure for each age group and then graph the results by typing . margins agegrp, dydx(bmi) (output omitted ) . marginsplot Variables that uniquely identify margins: agegrp .01 Effects on Pr(Highbp) .02 .03 .04 Average Marginal Effects of bmi with 95% CIs 20−29 30−39 40−49 50−59 Age Group 60−69 70+ These are the conditional expectations of the marginal effects treating everyone in the sample as though they were in each age group. We can estimate fully conditional marginal effects that do not depend on averaging over the sample by also margining on our one remaining covariate—sex. . margins agegrp#sex, dydx(bmi) (output omitted ) . marginsplot Variables that uniquely identify margins: agegrp sex 0 Effects on Pr(Highbp) .01 .02 .03 .04 .05 Average Marginal Effects of bmi with 95% CIs 20−29 30−39 40−49 50−59 Age Group Male 60−69 70+ Female The effect of BMI on the probability of high blood pressure looks to increase with age for females. The marginal effect is higher for males than females in the younger age groups but then decreases with age for males after the 40–49 age group. marginsplot — Graph results from margins (profile plots, etc.) 1251 You may want to test for differences in the marginal effect of BMI for males and females by contrasting across sexes within agegrp: . margins r.sex@agegrp, dydx(bmi) Plotting a subset of the results from margins marginsplot plots all the margins produced by the preceding margins command. If you want a graph that does not include all the margins, then enter a margins command that produces a reduced set of margins. Obvious ways to reduce the number of margins include not specifying some factors or interactions in the marginlist of margins, not specifying some at() or over() options, or reducing the values specified in an at() option. A less obvious technique uses selection lists in factor operators to select specific sets of levels from factor variables specified in the marginlist. Instead of typing . margins agegrp which will give you margins for all six age groups in our sample, type . margins i(2/4).agegrp which will give you only three margins—those for groups 2, 3, and 4. See [U] 11.4.3.4 Selecting levels. Advanced usage margins is incredibly flexible in the statistics it can estimate and in the grouping of those estimates. Many of the estimates that margins can produce do not make convincing graphs. marginsplot plots the results of any margins command, regardless of whether the resulting graph is easily interpreted. Here we demonstrate some options that can make complicated margins into graphs that are somewhat more useful than those produced by marginsplot’s defaults. Others may find truly useful applications for these approaches. Plots with multiple terms Margins plots are rarely interesting when you specify multiple terms on your margins command, for example, margins a b. Such plots often compare things that are not comparable. The defaults for marginsplot rarely produce useful plots with multiple terms. Perhaps the most interesting graph in such cases puts all the levels of all the terms together on the vertical axis and plots their margins on the horizontal axis. We do that by including the marginlist from margins in an xdimension() option on marginsplot. The long labels on such graphs look better with a horizontal orientation, and there is no need to connect the margin estimates, so we specify the recast(scatter) option. 1252 marginsplot — Graph results from margins (profile plots, etc.) Using one of our ANOVA examples from earlier, . anova bpsystol agegrp##sex (output omitted ) . margins agegrp sex (output omitted ) . marginsplot, xdimension(agegrp sex) horizontal recast(scatter) Variables that uniquely identify margins: agegrp sex Predictive Margins with 95% CIs asobserved, Female asobserved, Male agegrp, sex 70+, asobserved 60−69, asobserved 50−59, asobserved 40−49, asobserved 30−39, asobserved 20−29, asobserved 110 120 130 Linear Prediction 140 150 The “asobserved” notations in the y -axis labels are informing us that, for example, when the margin for females is evaluated, the values of age group are taken as they are observed in the dataset. The margin is computed as an average over those values. Plots with multiple at() options Some disciplines like to compute margins at the means of other covariates in their model and others like to compute the response for each observation and then take the means of the response. These correspond to the margins options at((mean) all) and at((asobserved) all). For responses that are linear functions of the coefficients, such as predict after regress, the two computations yield identical results. For responses that are nonlinear functions of the coefficients, the two computations estimate different things. Using one of our logistic models of high blood pressure, . logistic highbp sex##agegrp##c.bmi and computing both sets of margins for each age group, . margins agegrp, at((mean) _all) at((asobserved) _all) marginsplot — Graph results from margins (profile plots, etc.) 1253 we can use marginsplot to compare the approaches: . marginsplot Variables that uniquely identify margins: agegrp _atopt Multiple at() options specified: _atoption=1: (mean)_all _atoption=2: (asobserved) _all 0 .2 Pr(Highbp) .4 .6 .8 Predictive Margins of agegrp with 95% CIs 20−29 30−39 40−49 50−59 Age Group (mean) _all 60−69 70+ (asobserved) _all For the first three age groups, the probabilities of high blood pressure are lower at the means of sex and bpi than are the mean probabilities of high blood pressure averaged over the observed values of sex and bpi. The reverse is true for the last three age groups, although the values are very similar in these older age groups. 1254 marginsplot — Graph results from margins (profile plots, etc.) Such comparisons come up even more frequently when evaluating marginal effects. We can estimate the marginal effects of sex at each age group and graph the results by adding dydx(sex) to our margins command: . margins agegrp, at((mean) _all) at((asobserved) _all) dydx(sex) (output omitted ) . marginsplot Variables that uniquely identify margins: agegrp _atopt Multiple at() options specified: _atoption=1: (mean)_all _atoption=2: (asobserved) _all −.3 Effects on Pr(Highbp) −.2 −.1 0 .1 Average Marginal Effects of 2.sex with 95% CIs 20−29 30−39 40−49 50−59 Age Group (mean) _all 60−69 70+ (asobserved) _all The average marginal effect is smaller for most age groups, but the CIs for both sets of estimates are wide. Can we tell the difference between the estimates? To answer that, we use the now-familiar tactic of taking the contrast of our estimated marginal-effects profiles. That means adding contrast(atjoint marginswithin) to our margins command. We will also add mcompare(bonferroni) to account for the fact that we will be comparing six contrasts. . margins agegrp, at((mean) _all) at((asobserved) _all) dydx(sex) > contrast(atjoint marginswithin) mcompare(bonferroni) marginsplot — Graph results from margins (profile plots, etc.) 1255 We will also add the familiar reference line at 0 to our graph of the contrasts. . marginsplot, yline(0) Variables that uniquely identify margins: agegrp _atopt Multiple at() options specified: _atoption=1: (mean)_all _atoption=2: (asobserved) _all −.04 Contrasts of Pr(Highbp) −.02 0 .02 .04 Contrasts of Average Marginal Effects of 2.sex with 95% CIs 20−29 30−39 40−49 50−59 Age Group 60−69 70+ While the difference in the estimates of marginal effects is not large, we can distinguish the estimates for the 30–39 and 70+ age groups. The at() option of margins provides far more flexibility than demonstrated above. It can be used to evaluate a response or marginal effect at almost any point of interest or combinations of such points. See Syntax of at() in [R] margins. Adding scatterplots of the data We can add scatterplots of the observed data to our plots of the margins. The NHANES II dataset is too large for this to be interesting, so for this example, we will use auto.dta. We fit mileage on whether the care is foreign and on a quadratic in the weight of the car. We convert the weight into tons (U.S. definition) to improve the scaling, and we format the new tons variable to improve its labels on the graph. For our graph, we create separate variables for mileage of domestic and of foreign cars. We fit a fully interacted model so that the effect of weight on mileage can be different for foreign and for domestic cars. . use http://www.stata-press.com/data/r13/auto . generate tons = weight/2000 . format tons %6.2f . separate mpg, by(foreign) . regress mpg foreign##c.tons##c.tons We then estimate the margins over the range of tons, using the option over(foreign) to obtain separate estimates for foreign and domestic cars. . margins, at(tons=(.8(.05)2.4)) over(foreign) 1256 marginsplot — Graph results from margins (profile plots, etc.) Adding scatterplots of mileage for domestic and foreign cars is easy. We insert into an addplot() option of marginsplot the same scatterplot syntax for twoway that we would type to produce a scatterplot of the data: . marginsplot, addplot(scatter mpg0 tons || scatter mpg1 tons) recast(line) noci Variables that uniquely identify margins: tons foreign 0 10 Linear Prediction 20 30 40 Predictive Margins 0.50 1.00 1.50 tons Domestic mpg, foreign == Domestic 2.00 2.50 Foreign mpg, foreign == Foreign Many will be surprised that the mileage profile is higher in 1978 for domestic (U.S. built) cars. Is the difference significant? . margins, at(tons=(.8(.05)2.4)) over(r.for) (output omitted ) . marginsplot, yline(0) Variables that uniquely identify margins: tons −40 Contrasts of Linear Prediction −20 0 20 Contrasts of Predictive Margins with 95% CIs 0.50 1.00 1.50 tons 2.00 2.50 As we did earlier, we contrast the two profiles. We can discern some difference between the two profiles for midweight vehicles, but otherwise there is insufficient information to believe mileage differs across domestic and foreign cars. marginsplot — Graph results from margins (profile plots, etc.) 1257 Video examples Profile plots and interaction plots, part 1: A single categorical variable Profile plots and interaction plots, part 2: A single continuous variable Profile plots and interaction plots, part 3: Interactions between categorical variables Profile plots and interaction plots, part 4: Interactions of continuous and categorical variables Profile plots and interaction plots, part 5: Interactions of two continuous variables Addendum: Advanced uses of dimlist dimlist specifies the dimensions from the immediately preceding margins command that are to be used for the marginsplot’s x axis, plots, subgraphs, and graphs. dimlist may contain: dim Description varname Any variable referenced in the preceding margins command. at(varname) If a variable is specified in both the marginlist or the over() option and in the at() option of margins, then the two uses can be distinguished in marginsplot by typing the at() variables as at(varname) in dimlist. deriv If the preceding margins command included a dydx(), eyex(), dyex(), or eydx() option, dimlist may also contain deriv to specify all the variables over which derivatives were taken. term If the preceding margins command included multiple terms (for example, margins a b), then dimlist may contain term to enumerate those terms. atopt If the preceding margins command included multiple at() options, then dimlist may contain atopt to enumerate those at() options. When the pairwise option is specified on margins, you may specify dimensions that enumerate the pairwise comparisons. pw pw0 pw1 enumerates all the pairwise comparisons enumerates the reference categories of the comparisons enumerates the comparison categories of the comparisons Acknowledgments We thank Philip B. Ender of UCLA Academic Technology Services for his programs that demonstrated what could be done in this area. We also thank Michael N. Mitchell, author of the Stata Press books Data Management Using Stata: A Practical Handbook and A Visual Guide to Stata Graphics, for his generous advice and comprehensive insight into the application of margins and their plots. References McDowell, A., A. Engel, J. T. Massey, and K. Maurer. 1981. Plan and operation of the Second National Health and Nutrition Examination Survey, 1976–1980. Vital and Health Statistics 1(15): 1–144. Mitchell, M. N. 2012. Interpreting and Visualizing Regression Models Using Stata. College Station, TX: Stata Press. Royston, P. 2013. marginscontplot: Plotting the marginal effects of continuous predictors. Stata Journal 13: 510–527. Williams, R. 2012. Using the margins command to estimate and interpret adjusted predictions and marginal effects. Stata Journal 12: 308–331. 1258 marginsplot — Graph results from margins (profile plots, etc.) Also see [R] margins — Marginal means, predictive margins, and marginal effects [R] margins, contrast — Contrasts of margins [R] margins, pwcompare — Pairwise comparisons of margins [R] margins postestimation — Postestimation tools for margins Title matsize — Set the maximum number of variables in a model Syntax Description Option Remarks and examples Also see Syntax set matsize # , permanently where 10 ≤ # ≤ 11000 for Stata/MP and Stata/SE and where 10 ≤ # ≤ 800 for Stata/IC. Description set matsize sets the maximum number of variables that can be included in any of Stata’s estimation commands. For Stata/MP and Stata/SE, the default value is 400, but it may be changed upward or downward. The upper limit is 11,000. For Stata/IC, the initial value is 400, but it may be changed upward or downward. The upper limit is 800. This command may not be used with Small Stata; matsize is permanently frozen at 100. Changing matsize has no effect on Mata. Option permanently specifies that, in addition to making the change right now, the matsize setting be remembered and become the default setting when you invoke Stata. Remarks and examples set matsize controls the internal size of matrices that Stata uses. The default of 400 for Stata/IC, for instance, means that linear regression models are limited to 198 independent variables — 198 because the constant uses one position and the dependent variable another, making a total of 200. You may change matsize with data in memory, but increasing matsize increases the amount of memory consumed by Stata, increasing the probability of page faults and thus of making Stata run more slowly. Example 1 We wish to fit a model of y on the variables x1 through x400. Without thinking, we type . regress y x1-x400 matsize too small You have attempted to create a matrix with more than 400 rows or columns or to fit a model with more than 400 variables plus ancillary parameters. You need to increase matsize by using the set matsize command; see help matsize. r(908); 1259 1260 matsize — Set the maximum number of variables in a model We realize that we need to increase matsize, so we type . set matsize 450 . regress y x1-x400 (output omitted ) Programmers should note that the current setting of matsize is stored as the c-class value c(matsize); see [P] creturn. Also see [R] query — Display system parameters [D] memory — Memory management [U] 6 Managing memory Title maximize — Details of iterative maximization Syntax Option for set maxiter Methods and formulas Description Remarks and examples References Maximization options Stored results Also see Syntax Maximum likelihood optimization mle cmd . . . , options Set default maximum iterations set maxiter # , permanently options Description difficult technique(algorithm spec) iterate(#) no log trace gradient showstep hessian showtolerance use a different stepping algorithm in nonconcave regions maximization technique perform maximum of # iterations; default is iterate(16000) display an iteration log of the log likelihood; typically, the default display current parameter vector in iteration log display current gradient vector in iteration log report steps within an iteration in iteration log display current negative Hessian matrix in iteration log report the calculated result that is compared to the effective convergence criterion tolerance for the coefficient vector; see Options for the defaults tolerance for the log likelihood; see Options for the defaults tolerance for the scaled gradient; see Options for the defaults when specified with algorithms bhhh, dfp, or bfgs, the q − H matrix is used as the final check for convergence rather than nrtolerance() and the H matrix; seldom used ignore the nrtolerance() option initial values for the coefficients tolerance(#) ltolerance(#) nrtolerance(#) qtolerance(#) nonrtolerance from(init specs) where algorithm spec is algorithm # algorithm # ... algorithm is nr | bhhh | dfp | bfgs 1261 1262 maximize — Details of iterative maximization and init specs is one of matname , skip copy eqname: name = # | /eqname = # # # . . . , copy ... Description All Stata commands maximize likelihood functions using moptimize() and optimize(); see Methods and formulas below. Commands use the Newton – Raphson method with step halving and special fixups when they encounter nonconcave regions of the likelihood. For details, see [M-5] moptimize( ) and [M-5] optimize( ). For more information about programming maximum likelihood estimators in ado-files and Mata, see [R] ml and the fourth edition of Maximum Likelihood Estimation with Stata (Gould, Pitblado, and Poi 2010). set maxiter specifies the default maximum number of iterations for estimation commands that iterate. The initial value is 16000, and # can be 0 to 16000. To change the maximum number of iterations performed by a particular estimation command, you need not reset maxiter; you can specify the iterate(#) option. When iterate(#) is not specified, the maxiter value is used. Maximization options difficult specifies that the likelihood function is likely to be difficult to maximize because of nonconcave regions. When the message “not concave” appears repeatedly, ml’s standard stepping algorithm may not be working well. difficult specifies that a different stepping algorithm be used in nonconcave regions. There is no guarantee that difficult will work better than the default; sometimes it is better and sometimes it is worse. You should use the difficult option only when the default stepper declares convergence and the last iteration is “not concave” or when the default stepper is repeatedly issuing “not concave” messages and producing only tiny improvements in the log likelihood. technique(algorithm spec) specifies how the likelihood function is to be maximized. The following algorithms are allowed. For details, see Gould, Pitblado, and Poi (2010). technique(nr) specifies Stata’s modified Newton–Raphson (NR) algorithm. technique(bhhh) specifies the Berndt–Hall–Hall–Hausman (BHHH) algorithm. technique(dfp) specifies the Davidon–Fletcher–Powell (DFP) algorithm. technique(bfgs) specifies the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. The default is technique(nr). You can switch between algorithms by specifying more than one in the technique() option. By default, an algorithm is used for five iterations before switching to the next algorithm. To specify a different number of iterations, include the number after the technique in the option. For example, specifying technique(bhhh 10 nr 1000) requests that ml perform 10 iterations with the BHHH algorithm followed by 1000 iterations with the NR algorithm, and then switch back to BHHH for 10 iterations, and so on. The process continues until convergence or until the maximum number of iterations is reached. maximize — Details of iterative maximization 1263 iterate(#) specifies the maximum number of iterations. When the number of iterations equals iterate(), the optimizer stops and presents the current results. If convergence is declared before this threshold is reached, it will stop when convergence is declared. Specifying iterate(0) is useful for viewing results evaluated at the initial value of the coefficient vector. Specifying iterate(0) and from() together allows you to view results evaluated at a specified coefficient vector; however, not all commands allow the from() option. The default value of iterate(#) for both estimators programmed internally and estimators programmed with ml is the current value of set maxiter, which is iterate(16000) by default. log and nolog specify whether an iteration log showing the progress of the log likelihood is to be displayed. For most commands, the log is displayed by default, and nolog suppresses it. For a few commands (such as the svy maximum likelihood estimators), you must specify log to see the log. trace adds to the iteration log a display of the current parameter vector. gradient adds to the iteration log a display of the current gradient vector. showstep adds to the iteration log a report on the steps within an iteration. This option was added so that developers at StataCorp could view the stepping when they were improving the ml optimizer code. At this point, it mainly provides entertainment. hessian adds to the iteration log a display of the current negative Hessian matrix. showtolerance adds to the iteration log the calculated value that is compared with the effective convergence criterion at the end of each iteration. Until convergence is achieved, the smallest calculated value is reported. shownrtolerance is a synonym of showtolerance. Below we describe the three convergence tolerances. Convergence is declared when the nrtolerance() criterion is met and either the tolerance() or the ltolerance() criterion is also met. tolerance(#) specifies the tolerance for the coefficient vector. When the relative change in the coefficient vector from one iteration to the next is less than or equal to tolerance(), the tolerance() convergence criterion is satisfied. tolerance(1e-4) is the default for estimators programmed with ml. tolerance(1e-6) is the default. ltolerance(#) specifies the tolerance for the log likelihood. When the relative change in the log likelihood from one iteration to the next is less than or equal to ltolerance(), the ltolerance() convergence is satisfied. ltolerance(0) is the default for estimators programmed with ml. ltolerance(1e-7) is the default. nrtolerance(#) specifies the tolerance for the scaled gradient. Convergence is declared when gH−1 g0 < nrtolerance(). The default is nrtolerance(1e-5). qtolerance(#) when specified with algorithms bhhh, dfp, or bfgs uses the q − H matrix as the final check for convergence rather than nrtolerance() and the H matrix. Beginning with Stata 12, by default, Stata now computes the H matrix when the q−H matrix passes the convergence tolerance, and Stata requires that H be concave and pass the nrtolerance() criterion before concluding convergence has occurred. qtolerance() provides a way for the user to obtain Stata’s earlier behavior. 1264 maximize — Details of iterative maximization nonrtolerance specifies that the default nrtolerance() criterion be turned off. from() specifies initial values for the coefficients. Not all estimators in Stata support this option. You can specify the initial values in one of three ways: by specifying the name of a vector containing the initial values (for example, from(b0), where b0 is a properly labeled vector); by specifying coefficient names with the values (for example, from(age=2.1 /sigma=7.4)); or by specifying a list of values (for example, from(2.1 7.4, copy)). from() is intended for use when doing bootstraps (see [R] bootstrap) and in other special situations (for example, with iterate(0)). Even when the values specified in from() are close to the values that maximize the likelihood, only a few iterations may be saved. Poor values in from() may lead to convergence problems. skip specifies that any parameters found in the specified initialization vector that are not also found in the model be ignored. The default action is to issue an error message. copy specifies that the list of values or the initialization vector be copied into the initial-value vector by position rather than by name. Option for set maxiter permanently specifies that, in addition to making the change right now, the maxiter setting be remembered and become the default setting when you invoke Stata. Remarks and examples Only in rare circumstances would you ever need to specify any of these options, except nolog. The nolog option is useful for reducing the amount of output appearing in log files. The following is an example of an iteration log: Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Iteration 5: log likelihood Iteration 6: log likelihood Iteration 7: log likelihood Iteration 8: log likelihood (table of results omitted ) = = = = = = = = = -3791.0251 -3761.738 -3758.0632 -3758.0447 -3757.5861 -3757.474 -3757.4613 -3757.4606 -3757.4606 (not concave) At iteration 8, the model converged. The message “not concave” at the second iteration is notable. This example was produced using the heckman command; its likelihood is not globally concave, so it is not surprising that this message sometimes appears. The other message that is occasionally seen is “backed up”. Neither of these messages should be of any concern unless they appear at the final iteration. If a “not concave” message appears at the last step, there are two possibilities. One is that the result is valid, but there is collinearity in the model that the command did not otherwise catch. Stata checks for obvious collinearity among the independent variables before performing the maximization, but strange collinearities or near collinearities can sometimes arise between coefficients and ancillary parameters. The second, more likely cause for a “not concave” message at the final step is that the optimizer entered a flat region of the likelihood and prematurely declared convergence. If a “backed up” message appears at the last step, there are also two possibilities. One is that Stata found a perfect maximum and could not step to a better point; if this is the case, all is fine, but this is a highly unlikely occurrence. The second is that the optimizer worked itself into a bad concave spot where the computed gradient and Hessian gave a bad direction for stepping. maximize — Details of iterative maximization 1265 If either of these messages appears at the last step, perform the maximization again with the gradient option. If the gradient goes to zero, the optimizer has found a maximum that may not be unique but is a maximum. From the standpoint of maximum likelihood estimation, this is a valid result. If the gradient is not zero, it is not a valid result, and you should try tightening up the convergence criterion, or try ltol(0) tol(1e-7) to see if the optimizer can work its way out of the bad region. If you get repeated “not concave” steps with little progress being made at each step, try specifying the difficult option. Sometimes difficult works wonderfully, reducing the number of iterations and producing convergence at a good (that is, concave) point. Other times, difficult works poorly, taking much longer to converge than the default stepper. Stored results Maximum likelihood estimators store the following in e(): Scalars e(N) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(r2 p) e(ll) e(ll 0) e(N clust) e(chi2) e(p) e(rank) e(rank0) e(ic) e(rc) e(converged) number of observations number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom pseudo-R-squared log likelihood log likelihood, constant-only model number of clusters always always usually usually stored stored stored stored usually stored always stored sometimes stored always stored stored when constant-only model is fit stored when vce(cluster clustvar) is specified; see [U] 20.21 Obtaining robust variance estimates χ2 usually stored significance of model of test usually stored rank of e(V) always stored rank of e(V) for constant-only model stored when constant-only model is fit number of iterations usually stored return code usually stored 1 if converged, 0 otherwise usually stored 1266 maximize — Details of iterative maximization Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) name of command command as typed names of dependent variables weight type weight expression title in estimation output name of cluster variable Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization e(ml method) type of ml method e(user) name of likelihood-evaluator program e(technique) from technique() option e(singularHmethod) m-marquardt or hybrid; method used when Hessian is singular e(crittype) optimization criterion e(properties) estimator properties e(predict) program used to implement predict e(chi2type) e(vce) e(vcetype) e(opt) e(which) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample always stored always stored always stored stored when weights are specified or implied stored when weights are specified or implied usually stored by commands using ml stored when vce(cluster clustvar) is specified; see [U] 20.21 Obtaining robust variance estimates usually stored stored when command allows (vce()) sometimes stored always stored always stored always stored by commands using ml always stored sometimes stored sometimes stored1 always stored1 always stored usually stored always stored sometimes stored usually stored usually stored always stored only stored when e(V) is neither the OIM nor OPG variance always stored 1. Type ereturn list, all to view these results; see [P] return. See Stored results in the manual entry for any maximum likelihood estimator for a list of returned results. Methods and formulas Optimization is currently performed by moptimize() and optimize(), with the former implemented in terms of the latter; see [M-5] moptimize( ) and [M-5] optimize( ). Some estimators use moptimize() and optimize() directly, and others use the ml ado-file interface to moptimize(). Prior to Stata 11, Stata had three separate optimization engines: an internal one used by estimation commands implemented in C code; ml implemented in ado-code separately from moptimize() and used by most estimators; and moptimize() and optimize() used by a few recently written maximize — Details of iterative maximization 1267 estimators. These days, the internal optimizer and the old version of ml are used only under version control. In addition, arch and arima (see [TS] arch and [TS] arima) are currently implemented using the old ml. Let L1 be the log likelihood of the full model (that is, the log-likelihood value shown on the output), and let L0 be the log likelihood of the “constant-only” model. The likelihood-ratio χ2 model test is defined as 2(L1 − L0 ). The pseudo-R2 (McFadden 1974) is defined as 1 − L1 /L0 . This is simply the log likelihood on a scale where 0 corresponds to the “constant-only” model and 1 corresponds to perfect prediction for a discrete model (in which case the overall log likelihood is 0). Some maximum likelihood routines can report coefficients in an exponentiated form, for example, odds ratios in logistic. Let b be the unexponentiated coefficient, s its standard error, and b0 and b1 the reported confidence interval for b. In exponentiated form, the point estimate is eb , the standard error eb s, and the confidence interval eb0 and eb1 . The displayed Z (or t) statistics and p-values are the same as those for the unexponentiated results. This is justified because eb = 1 and b = 0 are equivalent hypotheses, and normality is more likely to hold in the b metric. References Gould, W. W., J. S. Pitblado, and B. P. Poi. 2010. Maximum Likelihood Estimation with Stata. 4th ed. College Station, TX: Stata Press. McFadden, D. L. 1974. Conditional logit analysis of qualitative choice behavior. In Frontiers in Econometrics, ed. P. Zarembka, 105–142. New York: Academic Press. Also see [R] ml — Maximum likelihood estimation [SVY] ml for svy — Maximum pseudolikelihood estimation for survey data [M-5] moptimize( ) — Model optimization [M-5] optimize( ) — Function optimization Title mean — Estimate means Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax mean varlist if in weight , options Description options Model stdize(varname) stdweight(varname) nostdrescale variable identifying strata for standardization weight variable for standardization do not rescale the standard weight variable if/in/over over(varlist , nolabel ) group over subpopulations defined by varlist; optionally, suppress group labels SE/Cluster vce(vcetype) vcetype may be analytic, cluster clustvar, bootstrap, or jackknife Reporting level(#) noheader nolegend display options set confidence level; default is level(95) suppress table header suppress table legend control column formats and line width coeflegend display legend instead of statistics bootstrap, jackknife, mi estimate, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, aweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Summaries, tables, and tests > Summary and descriptive statistics 1268 > Means mean — Estimate means 1269 Description mean produces estimates of means, along with standard errors. Options Model stdize(varname) specifies that the point estimates be adjusted by direct standardization across the strata identified by varname. This option requires the stdweight() option. stdweight(varname) specifies the weight variable associated with the standard strata identified in the stdize() option. The standardization weights must be constant within the standard strata. nostdrescale prevents the standardization weights from being rescaled within the over() groups. This option requires stdize() but is ignored if the over() option is not specified. if/in/over over(varlist , nolabel ) specifies that estimates be computed for multiple subpopulations, which are identified by the different values of the variables in varlist. When this option is supplied with one variable name, such as over(varname), the value labels of varname are used to identify the subpopulations. If varname does not have labeled values (or there are unlabeled values), the values themselves are used, provided that they are nonnegative integers. Noninteger values, negative values, and labels that are not valid Stata names are substituted with a default identifier. When over() is supplied with multiple variable names, each subpopulation is assigned a unique default identifier. nolabel requests that value labels attached to the variables identifying the subpopulations be ignored. SE/Cluster vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (analytic), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(analytic), the default, uses the analytically derived variance estimator associated with the sample mean. Reporting level(#); see [R] estimation options. noheader prevents the table header from being displayed. This option implies nolegend. nolegend prevents the table legend identifying the subpopulations from being displayed. display options: cformat(% fmt) and nolstretch; see [R] estimation options. The following option is available with mean but is not shown in the dialog box: coeflegend; see [R] estimation options. 1270 mean — Estimate means Remarks and examples Example 1 Using the fuel data from example 3 of [R] ttest, we estimate the average mileage of the cars without the fuel treatment (mpg1) and those with the fuel treatment (mpg2). . use http://www.stata-press.com/data/r13/fuel . mean mpg1 mpg2 Mean estimation Number of obs mpg1 mpg2 = 12 Mean Std. Err. [95% Conf. Interval] 21 22.75 .7881701 .9384465 19.26525 20.68449 22.73475 24.81551 Using these results, we can test the equality of the mileage between the two groups of cars. . test mpg1 = mpg2 ( 1) mpg1 - mpg2 = 0 F( 1, 11) = Prob > F = 5.04 0.0463 Example 2 In example 1, the joint observations of mpg1 and mpg2 were used to estimate a covariance between their means. . matrix list e(V) symmetric e(V)[2,2] mpg1 mpg2 mpg1 .62121212 mpg2 .4469697 .88068182 If the data were organized this way out of convenience but the two variables represent independent samples of cars (coincidentally of the same sample size), we should reshape the data and use the over() option to ensure that the covariance between the means is zero. . use http://www.stata-press.com/data/r13/fuel . stack mpg1 mpg2, into(mpg) clear . mean mpg, over(_stack) Mean estimation 1: _stack = 1 2: _stack = 2 Number of obs = 24 Over Mean Std. Err. [95% Conf. Interval] 1 2 21 22.75 .7881701 .9384465 19.36955 20.80868 mpg 22.63045 24.69132 mean — Estimate means 1271 . matrix list e(V) symmetric e(V)[2,2] mpg: mpg: 1 2 mpg:1 .62121212 mpg:2 0 .88068182 Now we can test the equality of the mileage between the two independent groups of cars. . test [mpg]1 = [mpg]2 ( 1) [mpg]1 - [mpg]2 = 0 F( 1, 23) = 2.04 Prob > F = 0.1667 Example 3: standardized means Suppose that we collected the blood pressure data from example 2 of [R] dstdize, and we wish to obtain standardized high blood pressure rates for each city in 1990 and 1992, using, as the standard, the age, sex, and race distribution of the four cities and two years combined. Our rate is really the mean of a variable that indicates whether a sampled individual has high blood pressure. First, we generate the strata and weight variables from our standard distribution, and then use mean to compute the rates. . use http://www.stata-press.com/data/r13/hbp, clear . egen strata = group(age race sex) if inlist(year, 1990, 1992) (675 missing values generated) . by strata, sort: gen stdw = _N . mean hbp, over(city year) stdize(strata) stdweight(stdw) Mean estimation N. of std strata = 24 Number of obs = 455 Over: city year _subpop_1: 1 1990 _subpop_2: 1 1992 _subpop_3: 2 1990 _subpop_4: 2 1992 _subpop_5: 3 1990 _subpop_6: 3 1992 _subpop_7: 5 1990 _subpop_8: 5 1992 Over Mean hbp _subpop_1 _subpop_2 _subpop_3 _subpop_4 _subpop_5 _subpop_6 _subpop_7 _subpop_8 .058642 .0117647 .0488722 .014574 .1011211 .0810577 .0277778 .0548926 Std. Err. .0296273 .0113187 .0238958 .007342 .0268566 .0227021 .0155121 0 [95% Conf. Interval] .0004182 -.0104789 .0019121 .0001455 .0483425 .0364435 -.0027066 . .1168657 .0340083 .0958322 .0290025 .1538998 .1256719 .0582622 . The standard error of the high blood pressure rate estimate is missing for city 5 in 1992 because there was only one individual with high blood pressure; that individual was the only person observed in the stratum of white males 30–35 years old. 1272 mean — Estimate means By default, mean rescales the standard weights within the over() groups. In the following, we use the nostdrescale option to prevent this, thus reproducing the results in [R] dstdize. . mean hbp, over(city year) nolegend stdize(strata) stdweight(stdw) > nostdrescale Mean estimation N. of std strata = 24 Number of obs = 455 Over Mean hbp _subpop_1 _subpop_2 _subpop_3 _subpop_4 _subpop_5 _subpop_6 _subpop_7 _subpop_8 .0073302 .0015432 .0078814 .0025077 .0155271 .0081308 .0039223 .0088735 Std. Err. .0037034 .0014847 .0038536 .0012633 .0041238 .0022772 .0021904 0 [95% Conf. Interval] .0000523 -.0013745 .0003084 .000025 .007423 .0036556 -.0003822 . Video example Descriptive statistics in Stata Stored results mean stores the following in e(): Scalars e(N) e(N over) e(N stdize) e(N clust) e(k eq) e(df r) e(rank) Macros e(cmd) e(cmdline) e(varlist) e(stdize) e(stdweight) e(wtype) e(wexp) e(title) e(cluster) e(over) e(over labels) e(over namelist) e(vce) e(vcetype) e(properties) e(estat cmd) e(marginsnotok) number of observations number of subpopulations number of standard strata number of clusters number of equations in e(b) sample degrees of freedom rank of e(V) mean command as typed varlist varname from stdize() varname from stdweight() weight type weight expression title in estimation output name of cluster variable varlist from over() labels from over() variables names from e(over labels) vcetype specified in vce() title used to label Std. Err. b V program used to implement estat predictions disallowed by margins .0146082 .004461 .0154544 .0049904 .0236312 .012606 .0082268 . mean — Estimate means Matrices e(b) e(V) e( N) e( N stdsum) e( p stdize) e(error) Functions e(sample) 1273 vector of mean estimates (co)variance estimates vector of numbers of nonmissing observations number of nonmissing observations within the standard strata standardizing proportions error code corresponding to e(b) marks estimation sample Methods and formulas Methods and formulas are presented under the following headings: The mean estimator Survey data The survey mean estimator The standardized mean estimator The poststratified mean estimator The standardized poststratified mean estimator Subpopulation estimation The mean estimator Let y be the variable on which we want to calculate the mean and yj an individual observation on y , where j = 1, . . . , n and n is the sample size. Let wj be the weight, and if no weight is specified, define wj = 1 for all j . For aweights, the wj are normalized to sum to n. See The survey mean estimator for pweighted data. Let W be the sum of the weights W = n X wj j=1 The mean is defined as y= n 1 X wj yj W j=1 The default variance estimator for the mean is n Vb (y) = X 1 wj (yj − y)2 W (W − 1) j=1 The standard error of the mean is the square root of the variance. If x, xj , and x are similarly defined for another variable (observed jointly with y ), the covariance estimator between x and y is n d Cov(x, y) = X 1 wj (xj − x)(yj − y) W (W − 1) j=1 1274 mean — Estimate means Survey data See [SVY] variance estimation, [SVY] direct standardization, and [SVY] poststratification for discussions that provide background information for the following formulas. The following formulas are derived from the fact that the mean is a special case of the ratio estimator where the denominator variable is one, xj = 1; see [R] ratio. The survey mean estimator Let Yj be a survey item for the j th individual in the population, where j = 1, . . . , M and M is the size of the population. The associated population mean for the item of interest is Y = Y /M where M X Yj Y = j=1 Let yj be the survey item for the j th sampled individual from the population, where j = 1, . . . , m and m is the number of observations in the sample. c, where The estimator for the mean is y = Yb /M Yb = m X wj yj c= M and m X j=1 wj j=1 and wj is a sampling weight. The score variable for the mean estimator is zj (y) = cyj − Yb yj − y M = c c2 M M The standardized mean estimator Let Dg denote the set of sampled observations that belong to the g th standard stratum and define IDg (j) to indicate if the j th observation is a member of the g th standard stratum; where g = 1, . . . , LD and LD is the number of standard strata. Also, let πg denote the fraction of the population that belongs to the g th standard stratum, thus π1 + · · · + πLD = 1. πg is derived from the stdweight() option. The estimator for the standardized mean is yD = LD X πg g=1 where Ybg = m X IDg (j) wj yj Ybg cg M and j=1 cg = M m X j=1 The score variable for the standardized mean is zj (y D ) = LD X g=1 πg IDg (j) cg yj − Ybg M cg2 M IDg (j) wj mean — Estimate means 1275 The poststratified mean estimator Let Pk denote the set of sampled observations that belong to poststratum k and define IPk (j) to indicate if the j th observation is a member of poststratum k ; where k = 1, . . . , LP and LP is the number of poststrata. Also let Mk denote the population size for poststratum k . Pk and Mk are identified by specifying the poststrata() and postweight() options on svyset; see [SVY] svyset. The estimator for the poststratified mean is yP = where Yb P = LP X Mk ck M k=1 and cP = M Yb P Yb P = cP M M LP m X Mk X Ybk = k=1 LP X Mk k=1 ck M LP X ck = M ck M IPk (j) wj yj j=1 Mk = M k=1 The score variable for the poststratified mean is LP Mk zj (Yb P ) 1 X IPk (j) zj (y ) = = ck M M M P k=1 Ybk yj − ck M ! The standardized poststratified mean estimator The estimator for the standardized poststratified mean is y DP = LD X YbgP cgP M πg g=1 where YbgP = Lp X Mk k=1 and cgP = M ck M Ybg,k = k=1 Lp X Mk k=1 Lp m X Mk X ck M cg,k = M ck M IDg (j)IPk (j) wj yj j=1 Lp m X Mk X k=1 ck M IDg (j)IPk (j) wj j=1 The score variable for the standardized poststratified mean is zj (y DP ) = LD X g=1 where zj (YbgP ) = LP X k=1 and cP ) zj ( M g = πg cgP zj (YbgP ) − YbgP zj (M cgP ) M cP )2 (M Mk IPk (j) ck M LP X k=1 g ( Mk IPk (j) ck M Ybg,k IDg (j)yj − ck M ( cg,k M IDg (j) − ck M ) ) 1276 mean — Estimate means Subpopulation estimation Let S denote the set of sampled observations that belong to the subpopulation of interest, and define IS (j) to indicate if the j th observation falls within the subpopulation. cS , where The estimator for the subpopulation mean is y S = Yb S /M Yb S = m X IS (j) wj yj cS = M and j=1 m X IS (j) wj j=1 Its score variable is zj (y S ) = IS (j) cS yj − Yb S yj − y S M = IS (j) cS )2 cS (M M The estimator for the standardized subpopulation mean is y DS = LD X πg g=1 where YbgS = m X IDg (j)IS (j) wj yj YbgS cgS M cgS = M and j=1 m X IDg (j)IS (j) wj j=1 Its score variable is zj (y DS ) = LD X πg IDg (j)IS (j) g=1 cgS yj − YbgS M cgS )2 (M The estimator for the poststratified subpopulation mean is yP S = where Yb P S = LP X Mk k=1 and cP S = M ck M YbkS = k=1 LP m X Mk X k=1 LP X Mk ck M Yb P S cP S M ckS = M ck M IPk (j)IS (j) wj yj j=1 LP m X Mk X k=1 ck M IPk (j)IS (j) wj j=1 Its score variable is zj (y P S ) = where bPS zj ( Y )= cP S zj (Yb P S ) − Yb P S zj (M cP S ) M cP S )2 (M LP X k=1 Mk IPk (j) ck M ( Yb S IS (j) yj − k ck M ) mean — Estimate means and cP S ) = zj (M LP X k=1 Mk IPk (j) ck M ( cS M IS (j) − k ck M 1277 ) The estimator for the standardized poststratified subpopulation mean is y DP S = LD X πg g=1 where YbgP S = Lp X Mk k=1 and cgP S = M ck M S Ybg,k = k=1 Lp m X Mk X k=1 Lp X Mk ck M YbgP S cgP S M S cg,k M = ck M IDg (j)IPk (j)IS (j) wj yj j=1 Lp m X Mk X k=1 ck M IDg (j)IPk (j)IS (j) wj j=1 Its score variable is zj (y DP S ) = LD X g=1 where zj (YbgP S ) = LP X k=1 and cgP S ) = zj (M πg cP S zj (Yb P S ) − Yb P S zj (M cP S ) M g g g g cgP S )2 (M Mk IPk (j) ck M LP X k=1 ( Mk IPk (j) ck M S Ybg,k IDg (j)IS (j) yj − ck M ( cS M g,k IDg (j)IS (j) − ck M ) ) References Bakker, A. 2003. The early history of average values and implications for education. Journal of Statistics Education 11(1). http://www.amstat.org/publications/jse/v11n1/bakker.html. Cochran, W. G. 1977. Sampling Techniques. 3rd ed. New York: Wiley. Stuart, A., and J. K. Ord. 1994. Kendall’s Advanced Theory of Statistics: Distribution Theory, Vol I. 6th ed. London: Arnold. 1278 mean — Estimate means Also see [R] mean postestimation — Postestimation tools for mean [R] ameans — Arithmetic, geometric, and harmonic means [R] proportion — Estimate proportions [R] ratio — Estimate ratios [R] summarize — Summary statistics [R] total — Estimate totals [MI] estimation — Estimation commands for use with mi estimate [SVY] direct standardization — Direct standardization of means, proportions, and ratios [SVY] poststratification — Poststratification for survey data [SVY] subpopulation estimation — Subpopulation estimation for survey data [SVY] svy estimation — Estimation commands for survey data [SVY] variance estimation — Variance estimation for survey data [U] 20 Estimation and postestimation commands Title mean postestimation — Postestimation tools for mean Description Remarks and examples Also see Description The following postestimation commands are available after mean: Command Description estat vce estat (svy) estimates lincom variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses nlcom test testnl Remarks and examples Example 1 We have a dataset with monthly rates of returns on the Dow and NASDAQ stock indices. We can use mean to compute the average quarterly rates of return for the two indices separately; . use http://www.stata-press.com/data/r13/rates . mean dow nasdaq Mean estimation Number of obs Mean dow nasdaq .2489137 10.78477 Std. Err. 6.524386 4.160821 = 357 [95% Conf. Interval] -12.58227 2.601887 13.0801 18.96765 If you chose just one of the indices for your portfolio, you either did rather well or rather poorly, depending on which one you picked. However, as we now show with the postestimation command lincom, if you diversified your portfolio, you would have earned a respectable 5.5% rate of return without having to guess which index would be the better performer. 1279 1280 mean postestimation — Postestimation tools for mean . lincom .5*dow + .5*nasdaq ( 1) .5*dow + .5*nasdaq = 0 Mean Coef. (1) 5.51684 Std. Err. t P>|t| 4.262673 1.29 0.196 Also see [R] mean — Estimate means [U] 20 Estimation and postestimation commands [95% Conf. Interval] -2.866347 13.90003 Title meta — Meta-analysis Remarks and examples References Remarks and examples Stata does not have a meta-analysis command. Stata users, however, have developed an excellent suite of commands for performing meta-analysis, including commands for performing standard and cumulative meta-analysis, commands for producing forest plots and contour-enhanced funnel plots, and commands for nonparametric analysis of publication bias. Many articles describing these commands have been published in the Stata Technical Bulletin and the Stata Journal. These articles were updated and published in a cohesive collection: Meta-Analysis in Stata: An Updated Collection from the Stata Journal. In this collection, editor Jonathan Sterne discusses how these articles relate to each other and how they fit in the overall literature of meta-analysis. Sterne has organized the collection into four areas: classic meta-analysis; meta-regression; graphical and analytic tools for detecting bias; and recent advances such as meta-analysis for dose–response curves, diagnostic accuracy, multivariate analysis, and studies containing missing values. All meta-analysis commands discussed in this collection may be downloaded by visiting http://www.stata-press.com/books/mais.html. We highly recommend that Stata users interested in meta-analysis read this book. Since the publication of the meta-analysis collection, Kontopantelis and Reeves (2010) published an article in the Stata Journal describing a new command metaan that performs fixed- or random-effects meta-analysis. Please also see the following FAQ on the Stata website: What meta-analysis features are available in Stata? http://www.stata.com/support/faqs/stat/meta.html References Borenstein, M., L. V. Hedges, J. P. T. Higgins, and H. R. Rothstein. 2009. Introduction to Meta-Analysis. Chichester, UK: Wiley. Crowther, M. J., S. R. Hinchliffe, A. Donald, and A. J. Sutton. 2013. Simulation-based sample-size calculation for designing new clinical trials and diagnostic test accuracy studies to update an existing meta-analysis. Stata Journal 13: 451–473. Crowther, M. J., D. Langan, and A. J. Sutton. 2012. Graphical augmentations to the funnel plot to assess the impact of a new study on an existing meta-analysis. Stata Journal 12: 605–622. Egger, M., G. Davey Smith, and D. G. Altman, ed. 2001. Systematic Reviews in Health Care: Meta-analysis in Context. 2nd ed. London: BMJ Books. Kontopantelis, E., and D. Reeves. 2010. metaan: Random-effects meta-analysis. Stata Journal 10: 395–407. . 2013. A short guide and a forest plot command (ipdforest) for one-stage meta-analysis. Stata Journal 13: 574–587. Miladinovic, B., I. Hozo, A. Chaimani, and B. Djulbegovic. 2014. Indirect treatment comparison. Stata Journal 14: 76–86. 1281 1282 meta — Meta-analysis Miladinovic, B., I. Hozo, and B. Djulbegovic. 2013. Trial sequential boundaries for cumulative meta-analyses. Stata Journal 13: 77–91. Ringquist, E. J. 2013. Meta-Analysis for Public Management and Policy. San Francisco: Jossey-Bass. Sterne, J. A. C., ed. 2009. Meta-Analysis in Stata: An Updated Collection from the Stata Journal. College Station, TX: Stata Press. Sutton, A. J., K. R. Abrams, D. R. Jones, T. A. Sheldon, and F. Song. 2000. Methods for Meta-Analysis in Medical Research. New York: Wiley. White, I. R. 2011. Multivariate random-effects meta-regression: Updates to mvmeta. Stata Journal 11: 240–254. Title mfp — Multivariable fractional polynomial models Syntax Remarks and examples Also see Menu Stored results Description Acknowledgments Options References Syntax , options : mfp regression cmd , regression cmd options options Description yvar1 yvar2 xvarlist if in weight Model 2 sequential cycles(#) dfdefault(#) center(cent list) alpha(alpha list) df(df list) powers(numlist) use the Royston and Altman model-selection algorithm; default uses closed-test procedure maximum number of iteration cycles; default is cycles(5) default maximum degrees of freedom; default is dfdefault(4) specification of centering for the independent variables p-values for testing between FP models; default is alpha(0.05) degrees of freedom for each predictor list of FP powers to use; default is powers(-2 -1(.5)1 2 3) Adv. model xorder(+ | - | n) select(select list) xpowers(xp list) zero(varlist) catzero(varlist) all order of entry into model-selection algorithm; default is xorder(+) nominal p-values for selection on each predictor FP powers for each predictor treat nonpositive values of specified predictors as zero when FP transformed add indicator variable for specified predictors include out-of-sample observations in generated variables Reporting level(#) display options set confidence level; default is level(95) control column formats and line width regression cmd options Description Adv. model regression cmd options options appropriate to the regression command in use All weight types supported by regression cmd are allowed; see [U] 11.1.6 weight. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. fp generate may be used to create new variables containing fractional polynomial powers. See [R] fp. 1283 1284 mfp — Multivariable fractional polynomial models where regression cmd may be clogit, glm, intreg, logistic, logit, mlogit, nbreg, ologit, oprobit, poisson, probit, qreg, regress, rreg, stcox, stcrreg, streg, or xtgee. yvar1 is not allowed for streg, stcrreg, and stcox. For these commands, you must first stset your data. yvar1 and yvar2 must both be specified when regression cmd is intreg. xvarlist has elements of type varlist and/or (varlist), for example, x1 x2 (x3 x4 x5) Elements enclosed in parentheses are tested jointly for inclusion in the model and are not eligible for fractional polynomial transformation. Menu Statistics > Linear models and related > Fractional polynomials > Multivariable fractional polynomial models Description mfp selects the multivariable fractional polynomial (MFP) model that best predicts the outcome variable from the right-hand-side variables in xvarlist. For univariate fractional polynomials, fp can be used to fit a wider range of models than mfp. See [R] fp for more details. Options Model 2 sequential chooses the sequential fractional polynomial (FP) selection algorithm (see Methods of FP model selection). cycles(#) sets the maximum number of iteration cycles permitted. cycles(5) is the default. dfdefault(#) determines the default maximum degrees of freedom (df) for a predictor. The default is dfdefault(4) (second-degree FP). center(cent list) defines the centering of the covariates xvar1 , xvar2 , . . . of xvarlist. The default is center(mean), except for binary covariates, where it is center(#), with # being the lower of the two distinct values of the covariate. A typical item in cent list is varlist:{mean | # | no}. Items are separated by commas. The first item is special in that varlist is optional, and if it is omitted, the default is reset to the specified value (mean, #, or no). For example, center(no, age:mean) sets the default to no (that is, no centering) and the centering of age to mean. alpha(alpha list) sets the significance levels for testing between FP models of different degrees. The rules for alpha list are the same as those for df list in the df() option (see below). The default nominal p-value (significance level, selection level) is 0.05 for all variables. Example: alpha(0.01) specifies that all variables have an FP selection level of 1%. Example: alpha(0.05, weight:0.1) specifies that all variables except weight have an FP selection level of 5%; weight has a level of 10%. mfp — Multivariable fractional polynomial models 1285 df(df list) sets the df for each predictor. The df (not counting the regression constant, cons) is twice the degree of the FP, so, for example, an xvar fit as a second-degree FP (FP2) has 4 df. The first item in df list may be either # or varlist:#. Subsequent items must be varlist:#. Items are separated by commas, and varlist is specified in the usual way for variables. With the first type of item, the df for all predictors is taken to be #. With the second type of item, all members of varlist (which must be a subset of xvarlist) have # df. The default number of degrees of freedom for a predictor of type varlist specified in xvarlist but not in df list is assigned according to the number of distinct (unique) values of the predictor, as follows: # of distinct values 1 2–3 4–5 ≥6 Default df (invalid predictor) 1 min(2, dfdefault()) dfdefault() Example: df(4) All variables have 4 df. Example: df(2, weight displ:4) weight and displ have 4 df; all other variables have 2 df. Example: df(weight displ:4, mpg:2) weight and displ have 4 df, mpg has 2 df; all other variables have default df. powers(numlist) is the set of FP powers to be used. The default set is −2, −1, −0.5, 0, 0.5, 1, 2, 3 (0 means log). Adv. model xorder(+ | - | n) determines the order of entry of the covariates into the model-selection algorithm. The default is xorder(+), which enters them in decreasing order of significance in a multiple linear regression (most significant first). xorder(-) places them in reverse significance order, whereas xorder(n) respects the original order in xvarlist. select(select list) sets the nominal p-values (significance levels) for variable selection by backward elimination. A variable is dropped if its removal causes a nonsignificant increase in deviance. The rules for select list are the same as those for df list in the df() option (see above). Using the default selection level of 1 for all variables forces them all into the model. Setting the nominal p-value to be 1 for a given variable forces it into the model, leaving others to be selected or not. The nominal p-value for elements of xvarlist bound by parentheses is specified by including (varlist) in select list. Example: select(0.05) All variables have a nominal p-value of 5%. Example: select(0.05, weight:1) All variables except weight have a nominal p-value of 5%; weight is forced into the model. Example: select(a (b c):0.05) All variables except a, b, and c are forced into the model. b and c are tested jointly with 2 df at the 5% level, and a is tested singly at the 5% level. 1286 mfp — Multivariable fractional polynomial models xpowers(xp list) sets the permitted FP powers for covariates individually. The rules for xp list are the same as for df list in the df() option. The default selection is the same as that for the powers() option. Example: xpowers(-1 0 1) All variables have powers −1, 0, 1. Example: xpowers(x5:-1 0 1) All variables except x5 have default powers; x5 has powers −1, 0, 1. zero(varlist) treats negative and zero values of members of varlist as zero when FP transformations are applied. By default, such variables are subjected to a preliminary linear transformation to avoid negative and zero values, as described in the scale option of [R] fp. varlist must be part of xvarlist. catzero(varlist) is a variation on zero(); see Zeros and zero categories below. varlist must be part of xvarlist. regression cmd options may be any of the options appropriate to regression cmd. all includes out-of-sample observations when generating the FP variables. By default, the generated FP variables contain missing values outside the estimation sample. Reporting level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. display options: cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Iteration report Estimation algorithm Methods of FP model selection Zeros and zero categories For elements in xvarlist not enclosed in parentheses, mfp leaves variables in the data named Ixvar 1, Ixvar 2, . . . , where xvar represents the first four letters of the name of xvar1 , and so on, for xvar2 , xvar3 , etc. The new variables contain the best-fitting FP powers of xvar1 , xvar2 , . . . . Iteration report By default, for each continuous predictor, x, mfp compares null, linear, and FP1 models for x with an FP2 model. The deviance for each of these nested submodels is given in the column labeled “Deviance”. The line labeled “Final” gives the deviance for the selected model and its powers. All the other predictors currently selected are included, with their transformations (if any). For models specified as having 1 df, the only choice is whether the variable enters the model. mfp — Multivariable fractional polynomial models 1287 Estimation algorithm The estimation algorithm in mfp processes the xvars in turn. Initially, mfp silently arranges xvarlist in order of increasing p-value (that is, of decreasing statistical significance) for omitting each predictor from the model comprising xvarlist, with each term linear. The aim is to model relatively important variables before unimportant ones. This approach may help to reduce potential model-fitting difficulties caused by collinearity or, more generally, “concurvity” among the predictors. See the xorder() option above for details on how to change the ordering. At the initial cycle, the best-fitting FP function for xvar1 (the first of xvarlist) is determined, with all the other variables assumed to be linear. Either the default or the alternative procedure is used (see Methods of FP model selection below). The functional form (but not the estimated regression coefficients) for xvar1 is kept, and the process is repeated for xvar2 , xvar3 , etc. The first iteration concludes when all the variables have been processed in this way. The next cycle is similar, except that the functional forms from the initial cycle are retained for all variables except the one currently being processed. A variable whose functional form is prespecified to be linear (that is, to have 1 df) is tested for exclusion within the above procedure when its nominal p-value (selection level) according to select() is less than 1; otherwise, it is included. Updating of FP functions and candidate variables continues until the functions and variables included in the overall model do not change (convergence). Convergence is usually achieved within 1–4 cycles. Methods of FP model selection mfp includes two algorithms for FP model selection, both of which combine backward elimination with the selection of an FP function. For each continuous variable in turn, they start from a mostcomplex permitted FP model and attempt to simplify the model by reducing the degree. The default algorithm resembles a closed-test procedure, a sequence of tests maintaining the overall type I error rate at a prespecified nominal level, such as 5%. All significance tests are approximate; therefore, the algorithm is not precisely a closed-test procedure (Royston and Sauerbrei 2008, chap. 6). The closed-test algorithm for choosing an FP model with maximum permitted degree m = 2 (that is, an FP2 model with 4 df) for one continuous predictor, x, is as follows: 1. Inclusion: Test FP2 against the null model for x on 4 df at the significance level determined by select(). If x is significant, continue; otherwise, drop x from the model. 2. Nonlinearity: Test FP2 against a straight line in x on 3 df at the significance level determined by alpha(). If significant, continue; otherwise, stop, with the chosen model for x being a straight line. 3. Simplification: Test FP2 against FP1 on 2 df at the significance level determined by alpha(). If significant, the final model is FP2; otherwise, it is FP1. The first step is omitted if x is to be retained in the model, that is, if its nominal p-value, according to the select() option, is 1. An alternative algorithm is available with the sequential option, as originally suggested by Royston and Altman (1994): 1. Test FP2 against FP1 on 2 df at the alpha() significance level. If significant, the final model is FP2; otherwise, continue. 2. Test FP1 against a straight line on 1 df at the alpha() level. If significant, the final model is FP1; otherwise, continue. 1288 mfp — Multivariable fractional polynomial models 3. Test a straight line against omitting x on 1 df at the select() level. If significant, the final model is a straight line; otherwise, drop x. The final step is omitted if x is to be retained in the model, that is, if its nominal p-value, according to the select() option, is 1. If x is uninfluential, the overall type I error rate of this procedure is about double that of the closed-test procedure, for which the rate is close to the nominal value. This inflated type I error rate confers increased apparent power to detect nonlinear relationships. Zeros and zero categories The zero() option permits fitting an FP model to the positive values of a covariate, taking nonpositive values as zero. An application is the assessment of the effect of cigarette smoking as a risk factor in an epidemiological study. Nonsmokers may be qualitatively different from smokers, so the effect of smoking (regarded as a continuous variable) may not be continuous between one and zero cigarettes. To allow for this, the risk may be modeled as constant for the nonsmokers and as an FP function of the number of cigarettes for the smokers: . generate byte nonsmokr = cond(n_cigs==0, 1, 0) if n_cigs != . . mfp, zero(n_cigs) df(4, nonsmokr:1): logit case n_cigs nonsmokr age Omission of zero(n cigs) would cause n cigs to be transformed before analysis by the addition of a suitable constant, probably 1. A closely related approach involves the catzero() option. The command . mfp, catzero(n_cigs): logit case n_cigs age would achieve a similar result to the previous command but with important differences. First, mfp would create the equivalent of the binary variable nonsmokr automatically and include it in the model. Second, the two smoking variables would be treated as one predictor in the model. With the select() option active, the two variables would be tested jointly for inclusion in the model. A modified version is described in Royston and Sauerbrei (2008, sec. 4.15). Example 1 We illustrate two of the analyses performed by Sauerbrei and Royston (1999). We use brcancer.dta, which contains prognostic factors data from the German Breast Cancer Study Group of patients with node-positive breast cancer. The response variable is recurrence-free survival time (rectime), and the censoring variable is censrec. There are 686 patients with 299 events. We use Cox regression to predict the log hazard of recurrence from prognostic factors of which five are continuous (x1, x3, x5, x6, x7) and three are binary (x2, x4a, x4b). Hormonal therapy (hormon) is known to reduce recurrence rates and is forced into the model. We use mfp to build a model from the initial set of eight predictors by using the backfitting model-selection algorithm. We set the nominal p-value for variable and FP selection to 0.05 for all variables except hormon, which it is set to 1: . use http://www.stata-press.com/data/r13/brcancer (German breast cancer data) . stset rectime, fail(censrec) (output omitted ) mfp — Multivariable fractional polynomial models . mfp, alpha(.05) select(.05, hormon:1): stcox x1 x2 x3 x4a x4b x5 x6 x7 hormon, > nohr Deviance for model with all terms untransformed = 3471.637, 686 observations Variable Model (vs.) Deviance Dev diff. P Powers (vs.) x5 null lin. FP1 Final FP2 3503.610 3471.637 3449.203 3442.244 61.366 29.393 6.959 0.000* 0.000+ 0.031+ . 1 0 .5 3 .5 3 x6 null lin. FP1 Final FP2 3464.113 3442.244 3435.550 3435.550 29.917 8.048 1.354 0.000* 0.045+ 0.508 . 1 .5 .5 -2 .5 [hormon included with 1 df in model] x4a null Final lin. 3440.749 3435.550 5.199 0.023* . 1 1 x3 null Final FP2 3436.832 3436.832 3.560 0.469 . . -2 3 x2 null Final lin. 3437.589 3437.589 0.756 0.384 . . 1 x4b null Final lin. 3437.848 3437.848 0.259 0.611 . . 1 x1 null lin. FP1 Final FP2 3437.893 3437.848 3433.628 3419.808 18.085 18.040 13.820 0.001* 0.000+ 0.001+ . 1 -2 -2 -.5 -2 -.5 x7 null Final FP2 3420.805 3420.805 3.715 0.446 . . -.5 3 End of Cycle 1: deviance = 3420.805 x5 null lin. FP1 Final FP2 3494.867 3451.795 3428.023 3420.724 74.143 31.071 7.299 0.000* 0.000+ 0.026+ . 1 0 -2 -1 -2 -1 x6 null lin. FP1 Final FP2 3452.093 3427.703 3420.724 3420.724 32.704 8.313 1.334 0.000* 0.040+ 0.513 . 1 .5 .5 0 0 [hormon included with 1 df in model] x4a null Final lin. 3425.310 3420.724 4.586 0.032* . 1 1 x3 null Final FP2 3420.724 3420.724 5.305 0.257 . . -.5 0 x2 null Final lin. 3420.724 3420.724 0.214 0.644 . . 1 x4b null Final lin. 3420.724 3420.724 0.145 0.703 . . 1 x1 null lin. FP1 Final FP2 3440.057 3440.038 3436.949 3420.724 19.333 19.314 16.225 0.001* 0.000+ 0.000+ . 1 -2 -2 -.5 -2 -.5 x7 null Final FP2 3420.724 3420.724 2.152 0.708 . . -1 3 Fractional polynomial fitting algorithm converged after 2 cycles. 1289 1290 mfp — Multivariable fractional polynomial models Transformations of covariates: -> gen double Ix1__1 = X^-2-.0355294635 if e(sample) -> gen double Ix1__2 = X^-.5-.4341573547 if e(sample) (where: X = x1/10) -> gen double Ix5__1 = X^-2-3.983723313 if e(sample) -> gen double Ix5__2 = X^-1-1.99592668 if e(sample) (where: X = x5/10) -> gen double Ix6__1 = X^.5-.3331600619 if e(sample) (where: X = (x6+1)/1000) Final multivariable fractional polynomial model for _t Variable x1 x2 x3 x4a x4b x5 x6 x7 hormon df Initial Select Alpha Status Final df 4 1 4 1 1 4 4 4 1 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 1.0000 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 in out out in out in in out in 4 0 0 1 0 4 2 0 1 Cox regression -- Breslow method for ties Entry time _t0 Coef. Ix1__1 Ix1__2 x4a Ix5__1 Ix5__2 Ix6__1 hormon 44.73377 -17.92302 .5006982 .0387904 -.5490645 -1.806966 -.4024169 Std. Err. 8.256682 3.909611 .2496324 .0076972 .0864255 .3506314 .1280843 -2 -.5 1 -2 -1 .5 1 Number of obs LR chi2(7) Prob > chi2 Pseudo R2 Log likelihood = -1710.3619 _t Powers z 5.42 -4.58 2.01 5.04 -6.35 -5.15 -3.14 P>|z| = = = = 686 155.62 0.0000 0.0435 [95% Conf. Interval] 0.000 0.000 0.045 0.000 0.000 0.000 0.002 28.55097 -25.58571 .0114276 .0237041 -.7184554 -2.494191 -.6534575 60.91657 -10.26032 .9899687 .0538767 -.3796736 -1.119741 -.1513763 Deviance: 3420.724. Some explanation of the output from the model-selection algorithm is desirable. Consider the first few lines of output in the iteration log: 1. Deviance for model with all terms untransformed = Variable 2. x5 3. 4. 5. 3471.637, 686 observations Model (vs.) Deviance Dev diff. P null lin. FP1 Final 3503.610 3471.637 3449.203 3442.244 61.366 29.393 6.959 0.000* 0.000+ 0.031+ FP2 Powers . 1 0 .5 3 (vs.) .5 3 Line 1 gives the deviance (−2 × log partial likelihood) for the Cox model with all terms linear, the place where the algorithm starts. The model is modified variable by variable in subsequent steps. The most significant linear term turns out to be x5, which is therefore processed first. Line 2 compares the best-fitting FP2 for x5 with a model omitting x5. The FP has powers (0.5, 3), and the test for inclusion of x5 is highly significant. The reported deviance of 3,503.610 is of the null model, not for the FP2 model. The deviance for the FP2 model may be calculated by subtracting the deviance mfp — Multivariable fractional polynomial models 1291 difference (Dev diff.) from the reported deviance, giving 3,503.610 − 61.366 = 3,442.244. Line 3 shows that the FP2 model is also a significantly better fit than a straight line (lin.) and line 4 that FP2 is also somewhat better than FP1 (p = 0.031). Thus at this stage in the model-selection procedure, the final model for x5 (line 5) is FP2 with powers (0.5, 3). The overall model with an FP2 for x5 and all other terms linear has a deviance of 3,442.244. After all the variables have been processed (cycle 1) and reprocessed (cycle 2) in this way, convergence is achieved because the functional forms (FP powers and variables included) after cycle 2 are the same as they were after cycle 1. The model finally chosen is Model II as given in tables 3 and 4 of Sauerbrei and Royston (1999). Because of scaling of variables, the regression coefficients reported there are different, but the model and its deviance are identical. The model includes x1 with powers (−2, −0.5), x4a, x5 with powers (−2, −1), and x6 with power 0.5. There is strong evidence of nonlinearity for x1 and for x5, the deviance differences for comparison with a straight-line model (FP2 vs lin.) being, respectively, 19.3 and 31.1 at convergence (cycle 2). Predictors x2, x3, x4b, and x7 are dropped, as may be seen from their status out in the table Final multivariable fractional polynomial model for t (the assumed depvar when using stcox). All predictors except x4a and hormon, which are binary, have been centered on the mean of the original variable. For example, the mean of x1 (age) is 53.05 years. The first FP-transformed variable for x1 is x1^-2 and is created by the expression gen double Ix1 1 = X^-2-.0355 if e(sample). The value 0.0355 is obtained from (53.05/10)−2 . The division by 10 is applied automatically to improve the scaling of the regression coefficient for Ix1 1. According to Sauerbrei and Royston (1999), medical knowledge dictates that the estimated risk function for x5 (number of positive nodes), which was based on the above FP with powers (−2, −1), should be monotonic, but it was not. They improved Model II by estimating a preliminary exponential transformation, x5e = exp(−0.12 · x5), for x5 and fitting a degree 1 FP for x5e, thus obtaining a monotonic risk function. The value of −0.12 was estimated univariately using nonlinear Cox regression with the ado-file boxtid (Royston and Ambler 1999b, 1999d). To ensure a negative exponent, Sauerbrei and Royston (1999) restricted the powers for x5e to be positive. Their Model III may be fit by using the following command: . mfp, alpha(.05) select(.05, hormon:1) df(x5e:2) xpowers(x5e:0.5 1 2 3): > stcox x1 x2 x3 x4a x4b x5e x6 x7 hormon Other than the customization for x5e, the command is the same as it was before. The resulting model is as reported in table 4 of Sauerbrei and Royston (1999): 1292 mfp — Multivariable fractional polynomial models . use http://www.stata-press.com/data/r13/brcancer, clear (German breast cancer data) . stset rectime, fail(censrec) (output omitted ) . mfp, alpha(.05) select(.05, hormon:1) df(x5e:2) xpowers(x5e:0.5 1 2 3): > stcox x1 x2 x3 x4a x4b x5e x6 x7 hormon, nohr (output omitted ) Final multivariable fractional polynomial model for _t Variable x1 x2 x3 x4a x4b x5e x6 x7 hormon df Initial Select Alpha Status Final df 4 1 4 1 1 2 4 4 1 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 1.0000 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 in out out in out in in out in 4 0 0 1 0 1 2 0 1 Cox regression -- Breslow method for ties Entry time _t0 Coef. Ix1__1 Ix1__2 x4a Ix5e__1 Ix6__1 hormon 43.55382 -17.48136 .5174351 -1.981213 -1.84008 -.3944998 Deviance: 3423.237. Std. Err. 8.253433 3.911882 .2493739 .2268903 .3508432 .128097 -2 -.5 1 1 .5 1 Number of obs LR chi2(6) Prob > chi2 Pseudo R2 Log likelihood = -1711.6186 _t Powers z 5.28 -4.47 2.07 -8.73 -5.24 -3.08 P>|z| 0.000 0.000 0.038 0.000 0.000 0.002 = = = = 686 153.11 0.0000 0.0428 [95% Conf. Interval] 27.37738 -25.14851 .0286713 -2.425909 -2.52772 -.6455654 59.73025 -9.814212 1.006199 -1.536516 -1.15244 -.1434342 mfp — Multivariable fractional polynomial models 1293 Stored results In addition to what regression cmd stores, mfp stores the following in e(): Scalars e(fp e(fp e(Fp e(Fp e(Fp e(Fp nx) dev) id#) fd#) al#) se#) Macros e(fp cmd) e(fp cmd2) e(cmdline) e(fracpoly) e(fp fvl) e(fp depv) e(fp opts) e(fp x1) e(fp x2) number of predictors in xvarlist deviance of final model fit initial degrees of freedom for the #th element of xvarlist final degrees of freedom for the #th element of xvarlist FP selection level for the #th element of xvarlist backward elimination selection level for the #th element of xvarlist fracpoly mfp command as typed command used to fit the selected model using fracpoly variables in final model yvar1 (yvar2 ) estimation command options first variable in xvarlist second variable in xvarlist ... e(fp xN) last variable in xvarlist, N =e(fp nx) e(fp k1) e(fp k2) power for first variable in xvarlist (*) power for second variable in xvarlist (*) ... e(fp kN) power for last var. in xvarlist (*), N =e(fp nx) Note: (*) contains ‘.’ if the variable is not selected in the final model. Acknowledgments mfp is an update of mfracpol by Royston and Ambler (1998). References Ambler, G., and P. Royston. 2001. Fractional polynomial model selection procedures: Investigation of Type I error rate. Journal of Statistical Computation and Simulation 69: 89–108. Royston, P., and D. G. Altman. 1994. Regression using fractional polynomials of continuous covariates: Parsimonious parametric modelling. Applied Statistics 43: 429–467. Royston, P., and G. Ambler. 1998. sg81: Multivariable fractional polynomials. Stata Technical Bulletin 43: 24–32. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 123–132. College Station, TX: Stata Press. . 1999a. sg112: Nonlinear regression models involving power or exponential functions of covariates. Stata Technical Bulletin 49: 25–30. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 173–179. College Station, TX: Stata Press. . 1999b. sg81.1: Multivariable fractional polynomials: Update. Stata Technical Bulletin 49: 17–23. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 161–168. College Station, TX: Stata Press. . 1999c. sg112.1: Nonlinear regression models involving power or exponential functions of covariates: Update. Stata Technical Bulletin 50: 26. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 180. College Station, TX: Stata Press. . 1999d. sg81.2: Multivariable fractional polynomials: Update. Stata Technical Bulletin 50: 25. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 168. College Station, TX: Stata Press. Royston, P., and W. Sauerbrei. 2007. Multivariable modeling with cubic regression splines: A principled approach. Stata Journal 7: 45–70. 1294 mfp — Multivariable fractional polynomial models . 2008. Multivariable Model-building: A Pragmatic Approach to Regression Analysis Based on Fractional Polynomials for Modelling Continuous Variables. Chichester, UK: Wiley. . 2009a. Two techniques for investigating interactions between treatment and continuous covariates in clinical trials. Stata Journal 9: 230–251. . 2009b. Bootstrap assessment of the stability of multivariable models. Stata Journal 9: 547–570. Sauerbrei, W., and P. Royston. 1999. Building multivariable prognostic and diagnostic models: Transformation of the predictors by using fractional polynomials. Journal of the Royal Statistical Society, Series A 162: 71–94. . 2002. Corrigendum: Building multivariable prognostic and diagnostic models: Transformation of the predictors by using fractional polynomials. Journal of the Royal Statistical Society, Series A 165: 399–400. Also see [R] mfp postestimation — Postestimation tools for mfp [R] fp — Fractional polynomial regression [U] 20 Estimation and postestimation commands Title mfp postestimation — Postestimation tools for mfp Description Options for fracplot Methods and formulas Syntax for fracplot and fracpred Options for fracpred Also see Menu for fracplot and fracpred Remarks and examples Description The following postestimation commands are of special interest after mfp: Command Description fracplot fracpred plot data and fit from most recently fit fractional polynomial model create variable containing prediction, deviance residuals, or SEs of fitted values The following standard postestimation commands are also available if available after regression cmd: Command Description estat ic estat summarize estat vce estimates lincom Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest nlcom test testnl Special-interest postestimation commands fracplot plots the data and fit, with 95% confidence limits, from the most recently fit fractional polynomial (FP) model. The data and fit are plotted against varname, which may be xvar1 or another of the covariates (xvar2 , . . . , or a variable from xvarlist). If varname is not specified, xvar1 is assumed. fracpred creates newvar containing the fitted index or deviance residuals for the whole model, or the fitted index or its standard error for varname, which may be xvar1 or another covariate. 1295 1296 mfp postestimation — Postestimation tools for mfp Syntax for fracplot and fracpred Plot data and fit from most recently fit fractional polynomial model fracplot varname if in , fracplot options Create variable containing the prediction, deviance residuals, or SEs of fitted values fracpred newvar , fracpred options fracplot options Description Plot marker options marker label options change look of markers (color, size, etc.) add marker labels; change look or position Fitted line lineopts(cline options) affect rendition of the fitted line CI plot ciopts(area options) affect rendition of the confidence bands Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options fracpred options Description for(varname) dresid stdp compute prediction for varname compute deviance residuals compute standard errors of the fitted values varname fracplot is not allowed after mfp with clogit, mlogit, or stcrreg. fracpred, dresid is not allowed after mfp with clogit, mlogit, or stcrreg. mfp postestimation — Postestimation tools for mfp 1297 Menu for fracplot and fracpred fracplot Statistics > Linear models and related > Fractional polynomials > Multivariable fractional polynomial plot Linear models and related > Fractional polynomials > Multivariable fractional polynomial prediction fracpred Statistics > Options for fracplot Plot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Fitted line lineopts(cline options) affect the rendition of the fitted line; see [G-3] cline options. CI plot ciopts(area options) affect the rendition of the confidence bands; see [G-3] area options. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Options for fracpred for(varname) specifies (partial) prediction for variable varname. The fitted values are adjusted to the value specified by the center() option in mfp. dresid specifies that deviance residuals be calculated. stdp specifies calculation of the standard errors of the fitted values varname, adjusted for all the other predictors at the values specified by center(). Remarks and examples fracplot actually produces a component-plus-residual plot. For normal-error models with constant weights and one covariate, this amounts to a plot of the observations with the fitted line inscribed. For other normal-error models, weighted residuals are calculated and added to the fitted values. For models with additional covariates, the line is the partial linear predictor for the variable in question (xvar1 or a covariate) and includes the intercept β0 . 1298 mfp postestimation — Postestimation tools for mfp For generalized linear and Cox models, the fitted values are plotted on the scale of the “index” (linear predictor). Deviance residuals are added to the (partial) linear predictor to give component-plus-residual values. These values are plotted as small circles. Example 1 In example 1 of [R] mfp, we used Cox regression to predict the log hazard of breast cancer recurrence from prognostic factors of which five are continuous (x1, x3, x5, x6, x7) and three are binary (x2, x4a, x4b). We also controlled for hormonal therapy (hormon). We used mfp to build a model from the initial set of eight predictors by using the backfitting model-selection algorithm. The nominal p-value for variable and FP selection was set to 0.05 for all variables except hormon, which is set to 1. . use http://www.stata-press.com/data/r13/brcancer (German breast cancer data) . stset rectime, fail(censrec) (output omitted ) . mfp, alpha(.05) select(.05, hormon:1): stcox x1 x2 x3 x4a x4b x5 x6 x7 hormon, > nohr (output omitted ) We can use fracplot to produce component-plus-residual plots of the continuous variables. We produce the component-plus-residual plot for x1 with fracplot by specifying x1 after the command name. . fracplot x1 −2 Partial predictor+residual of _t 0 2 4 6 Fractional Polynomial (−2 −.5), adjusted for covariates 20 40 60 80 age, years We use fracpred with the stdp option to predict the standard error of the fractional polynomial prediction for x1. The standard error prediction will be stored in variable sepx1. We specify that prediction is made for x1 with the for() option. After prediction, we use summarize to show how the standard error estimate varies over different values of x1. . fracpred sepx1, stdp for(x1) . summarize sepx1 Obs Variable sepx1 686 Mean .0542654 Std. Dev. .0471993 Min Max .0003304 .6862065 mfp postestimation — Postestimation tools for mfp 1299 Methods and formulas The general definition of an FP, accommodating possible repeated powers, may be written for functions H1 (x), . . . , Hm (x) as m X β0 + βj Hj (x) j=1 where H1 (x) = x(p1 ) and for j = 2, . . . , m, Hj (x) = x(pj ) if pj = 6 pj−1 Hj−1 (x) log x if pj = pj−1 For example, an FP of degree 3 with powers (1, 3, 3) has H1 (x) = x, H2 (x) = x3 , and H3 (x) = x3 log x and equals β0 + β1 x + β2 x3 + β3 x3 log x. The component-plus-residual values graphed by fracplot are calculated as follows: Let the data consist of triplets (yi , xi , zi ), i = 1, . . . , n, where zi is the vector of covariates for the ith observation, after applying possible fractional polynomial transformation and adjustment as described 0 0b + zi γ b be the linear predictor from the FP model, as earlier. Let ηbi = βb0 + {H(xi ) − H(x0 )} β given by the fracpred command or, equivalently, by the predict command with the xb option, following mfp. Here H(xi ) = {H1 (xi ), . . . , Hm (xi )}0 is the vector of FP functions described above, H(x0 ) = {H1 (x0 ), . . . , Hm (x0 )}0 is the vector of adjustments to x0 (often, x0 is chosen to be the b is the estimated parameter vector, and γ mean of the xi ), β b is the estimated parameter vector for 0b represent the behavior of the FP model the covariates. The values ηbi∗ = βb0 + {H(xi ) − H(x0 )} β for x at fixed values z = 0 of the (adjusted) covariates. The ith component-plus-residual is defined as ηbi∗√ + di , where di is the deviance residual for the ith observation. For normal-errors models, di = wi (yi − ηbi ), where wi is the case weight (or 1, if weight is not specified). For logistic, Cox, and generalized linear regression models, see [R] logistic, [R] probit, [ST] stcox, and [R] glm for the formula for di . The formula for poisson models is the same as that for glm with family(poisson). For stcox, di is the partial martingale residual (see [ST] stcox postestimation). fracplot plots the values of di and the curve represented by ηbi∗ against xi . The confidence interval for ηbi∗ is obtained from the variance–covariance matrix of the entire model and takes into account the uncertainty in estimating β0 , β, and γ (but not in estimating the FP powers for x). fracpred with the for(varname) option calculates the predicted index at xi = x0 and zi = 0; 0b that is, ηbi = βb0 +{H(xi ) − H(x0 )} β. The standard error is calculated from the variance–covariance b ), again ignoring estimation of the powers. matrix of (βb0 , β Also see [R] mfp — Multivariable fractional polynomial models [U] 20 Estimation and postestimation commands Title misstable — Tabulate missing values Syntax Remarks and examples Menu Stored results Description Also see Options Syntax Report counts of missing values misstable summarize varlist if in , summarize options Report pattern of missing values if in , patterns options misstable patterns varlist Present a tree view of the pattern of missing values misstable tree varlist if in , tree options List the nesting rules that describe the missing-value pattern misstable nested varlist if in , nested options summarize options Description all show all variables showzeros show zeros in table generate(stub , exok ) generate missing-value indicators patterns options Description asis frequency exok replace clear bypatterns use variables in order given report frequencies instead of percentages treat .a, .b, . . . , .z as nonmissing replace data in memory with dataset of patterns okay to replace even if original unsaved list by patterns rather than by frequency tree options Description asis frequency exok use variables in order given report frequencies instead of percentages treat .a, .b, . . . , .z as nonmissing 1300 misstable — Tabulate missing values nested options Description exok treat .a, .b, . . . , .z as nonmissing 1301 In addition, programmer’s option nopreserve is allowed with all syntaxes; see [P] nopreserve option. Menu Statistics > Summaries, tables, and tests > Other tables > Tabulate missing values Description misstable makes tables that help you understand the pattern of missing values in your data. Options Options are presented under the following headings: Options for misstable summarize Options for misstable patterns Options for misstable tree Option for misstable nested Common options Options for misstable summarize all specifies that the table should include all the variables specified or all the variables in the dataset. The default is to include only numeric variables that contain missing values. showzeros specifies that zeros in the table should display as 0 rather than being omitted. generate(stub , exok ) requests that a missing-value indicator newvar, a new binary variable containing 0 for complete observations and 1 for incomplete observations, be generated for every numeric variable in varlist containing missing values. If the all option is specified, missing-value indicators are created for all the numeric variables specified or for all the numeric variables in the dataset. If exok is specified within generate(), the extended missing values .a, .b, . . . , .z are treated as if they do not designate missing. For each variable in varlist, newvar is the corresponding variable name varname prefixed with stub. If the total length of stub and varname exceeds 32 characters, newvar is abbreviated so that its name does not exceed 32 characters. Options for misstable patterns asis, frequency, and exok – see Common options below. replace specifies that the data in memory be replaced with a dataset corresponding to the table just displayed; see misstable patterns under Remarks and examples below. clear is for use with replace; it specifies that it is okay to change the data in memory even if they have not been saved to disk. 1302 misstable — Tabulate missing values bypatterns specifies the table be ordered by pattern rather than by frequency. That is, bypatterns specifies that patterns containing one incomplete variable be listed first, followed by those for two incomplete variables, and so on. The default is to list the most frequent pattern first, followed by the next most frequent pattern, etc. Options for misstable tree asis, frequency, and exok – see Common options below. Option for misstable nested exok – see Common options below. Common options asis specifies that the order of the variables in the table be the same as the order in which they are specified on the misstable command. The default is to order the variables by the number of missing values, and within that, by the amount of overlap of missing values. frequency specifies that the table should report frequencies instead of percentages. exok specifies that the extended missing values .a, .b, . . . , .z should be treated as if they do not designate missing. Some users use extended missing values to designate values that are missing for a known and valid reason. nopreserve is a programmer’s option allowed with all misstable commands; see [P] nopreserve option. Remarks and examples Remarks are presented under the following headings: misstable summarize misstable patterns misstable tree misstable nested Execution time of misstable nested In what follows, we will use data from a 125-observation, fictional, student-satisfaction survey: . use http://www.stata-press.com/data/r13/studentsurvey (Student Survey) . summarize Variable Obs Mean Std. Dev. Min Max m1 m2 age female dept 125 125 122 122 116 2.456 2.472 18.97541 .5245902 2.491379 .8376619 .8089818 .8763477 .5014543 1.226488 1 1 17 0 1 4 4 21 1 4 offcampus comment 125 0 .36 .4819316 0 1 The m1 and m2 variables record the student’s satisfaction with teaching and with academics. comment is a string variable recording any comments the student might have had. misstable — Tabulate missing values 1303 misstable summarize Example 1 misstable summarize reports counts of missing values: . misstable summarize Obs<. Variable Obs=. age female dept 3 3 9 Obs>. Obs<. Unique values Min Max 122 122 116 5 2 4 17 0 1 21 1 4 Stata provides 27 different missing values, namely, ., .a, .b, . . . , .z. The first of those, ., is often called system missing. The remaining missing values are called extended missings. The nonmissing and missing values are ordered nonmissing < . < .a < .b < · · · < .z. Thus reported in the column “Obs=.” are counts of system missing values; in the column “Obs>.”, extended missing values; and in the column “Obs<.”, nonmissing values. The rightmost portion of the table is included to remind you how the variables are encoded. Our data contain seven variables and yet misstable reported only three of them. The omitted variables contain no missing values or are string variables. Even if we specified the varlist explicitly, those variables would not appear in the table unless we specified the all option. We can also create missing-value indicators for each of the variables above using the generate() option: . quietly misstable summarize, generate(miss_) . describe miss_* storage variable name type display format miss_age miss_female miss_dept %8.0g %8.0g %8.0g byte byte byte value label variable label (age>=.) (female>=.) (dept>=.) For each variable containing missing values, the generate() option creates a new binary variable containing 0 for complete observations and 1 for incomplete observations. In our example, three new missing-value indicators are generated, one for each of the incomplete variables age, female, and dept. The naming convention of generate() is to prefix the corresponding variable names with the specified stub, which is miss in this example. Missing-value indicators are useful, for example, for checking whether data are missing completely at random. They are also often used within the multiple-imputation context to identify the observed and imputed data; see [MI] intro substantive for a general introduction to multiple imputation. Within Stata’s multiple-imputation commands, an incomplete value is identified by the system missing value, a dot. By default, misstable summarize, generate() marks the extended missing values as incomplete values, as well. You can use exok within generate() to treat extended missing values as complete when creating missing-value identifiers. 1304 misstable — Tabulate missing values misstable patterns Example 2 misstable patterns reports the pattern of missing values: . misstable patterns Missing-value patterns (1 means complete) Pattern Percent 1 2 3 93% 1 1 1 5 2 1 0 1 0 0 0 100% Variables are (1) age (2) female (3) dept There are three patterns in these data: (1,1,1), (1,1,0), and (0,0,0). By default, the rows of the table are ordered by frequency. In larger tables that have more patterns, it is sometimes useful to order the rows by pattern. We could have obtained that by typing mi misstable patterns, bypatterns. In a pattern, 1 indicates that all values of the variable are nonmissing and 0 indicates that all values are missing. Thus pattern (1,1,1) means no missing values, and 93% of our data have that pattern. There are two patterns in which variables are missing, (1,1,0) and (0,0,0). Pattern (1,1,0) means that age is nonmissing, female is nonmissing, and dept is missing. The order of the variables in the patterns appears in the key at the bottom of the table. Five percent of the observations have pattern (1,1,0). The remaining 2% have pattern (0,0,0), meaning that all three variables contain missing. As with misstable summarize, only numeric variables that contain missing are listed, so had we typed misstable patterns comments age female offcampus dept, we still would have obtained the same table. Variables that are automatically omitted contain no missing values or are string variables. The variables in the table are ordered from lowest to highest frequency of missing values, although you cannot see that from the information presented in the table. The variables are ordered this way even if you explicitly specify the varlist with a different ordering. Typing misstable patterns dept female age would produce the same table as above. Specify the asis option if you want the variables in the order in which you specify them. You can obtain a dataset of the patterns by specifying the replace option: . misstable patterns, replace clear Missing-value patterns (1 means complete) Pattern Percent 1 2 3 93% 1 1 1 5 2 1 0 1 0 0 0 100% Variables are (1) age (2) female (summary data now in memory) (3) dept misstable — Tabulate missing values 1305 . list _freq age female dept 3 6 116 0 1 1 0 1 1 0 0 1 1. 2. 3. The differences between the dataset and the printed table are that 1) the dataset always records frequency and 2) the rows are reversed. misstable tree Example 3 misstable tree presents a tree view of the pattern of missing values: . use http://www.stata-press.com/data/r13/studentsurvey, clear (Student Survey) . misstable tree, frequency Nested pattern of missing values dept age female 9 3 6 116 0 116 3 0 0 6 0 0 0 116 (number missing listed first) In this example, we specified the frequency option to see the table in frequency rather than percentage terms. In the table, each column sums to the total number of observations in the data, 125. Variables are ordered from those with the most missing values to those with the least. Start with the first column. The dept variable is missing in 9 observations and, farther down, the table reports that it is not missing in 116 observations. Go back to the first row and read across, but only to the second column. The dept variable is missing in 9 observations. Within those 9, age is missing in 3 of them and is not missing in the remaining 6. Reading down the second column, within the 116 observations that dept is not missing, age is missing in 0 and not missing in 116. Reading straight across the first row again, dept is missing in 9 observations, and within the 9, age is missing in 3, and within the 3, female is also missing in 3. Skipping down just a little, within the 6 observations for which dept is missing and age is not missing, female is not missing, too. 1306 misstable — Tabulate missing values misstable nested Example 4 misstable nested lists the nesting rules that describe the missing-value pattern, . misstable nested 1. female(3) <-> age(3) -> dept(9) This line says that in observations in which female is missing, so is age missing, and vice versa, and in observations in which age (or female) is missing, so is dept. The numbers in parentheses are counts of the missing values. The female variable happens to be missing in 3 observations, and the same is true for age; the dept variable is missing in 9 observations. Thus dept is missing in the 3 observations for which age and female are missing, and in 6 more observations, too. In these data, it turns out that the missing-value pattern can be summarized in one statement. In a larger dataset, you might see something like this: . misstable nested 1. female(50) <-> age(50) -> dept(120) 2. female(50) -> m1(58) 3. offcampus(11) misstable nested accounts for every missing value. In the above, in addition to female <-> age -> dept, we have that female -> m1, and we have offcampus, the last all by itself. The last line says that the 11 missing values in offcampus are not themselves nested in the missing value of any other variable, nor do they imply the missing values in another variable. In some datasets, all the statements will be of this last form. In our data, however, we have one statement: . misstable nested 1. female(3) <-> age(3) -> dept(9) When the missing-value pattern can be summarized in one misstable nested statement, the pattern of missing values in the data is said to be monotone. Execution time of misstable nested The execution time of misstable nested is affected little by the number of observations but can grow quickly with the number of variables, depending on the fraction of missing values within variable. The execution time of the example above, which has 3 variables containing missing, is instant. In worst-case scenarios, with 500 variables, the time might be 25 seconds; with 1,000 variables, the execution time might be closer to an hour. In situations where misstable nested takes a long time to complete, it will produce thousands of rules that will defy interpretation. A 523-variable dataset we have seen ran in 20 seconds and produced 8,040 rules. Although we spotted a few rules in the output that did not surprise us, such as the year of the date being missing implied that the month and the day were also missing, mostly the output was not helpful. If you have such a dataset, we recommend you run misstable on groups of variables that you have reason to believe the pattern of missing values might be related. misstable — Tabulate missing values 1307 Stored results misstable summarize stores the following values of the last variable summarized in r(): Scalars r(N eq dot) r(N gt dot) r(N lt dot) r(K uniq) r(min) r(max) Macros r(vartype) number of observations containing . number of observations containing .a, .b, . . . , .z number of observations containing nonmissing number of unique, nonmissing values variable’s minimum value variable’s maximum value numeric, string, or none r(K uniq) contains . if the number of unique, nonmissing values is greater than 500. r(vartype) contains none if no variables are summarized, and in that case, the value of the scalars are all set to missing (.). Programmers intending to access results after misstable summarize should specify the all option. misstable patterns stores the following in r(): Scalars r(N complete) r(N incomplete) r(K) Macros r(vars) number of complete observations number of incomplete observations number of patterns variables used in order presented r(N complete) and r(N incomplete) are defined with respect to the variables specified if variables were specified and otherwise, defined with respect to all the numeric variables in the dataset. r(N complete) is the number of observations that contain no missing values. misstable tree stores the following in r(): Macros r(vars) variables used in order presented misstable nested stores the following in r(): Scalars r(K) Macros r(stmt1) r(stmt2) . . r(stmt‘r(K)’) r(stmt1wc) r(vars) number of statements first statement second statement . . last statement r(stmt1) with missing-value counts variables considered A statement is encoded “varname”, “varname op varname”, or “varname op varname op varname”, and so on; op is either “->” or “<->”. Also see [MI] mi misstable — Tabulate pattern of missing values [R] summarize — Summary statistics [R] tabulate oneway — One-way table of frequencies [R] tabulate twoway — Two-way table of frequencies Title mkspline — Linear and restricted cubic spline construction Syntax Remarks and examples Also see Menu Methods and formulas Description Acknowledgment Options References Syntax Linear spline with knots at specified points mkspline newvar1 # 1 [newvar2 # 2 [. . .] ] newvark = oldvar if in , marginal displayknots Linear spline with knots equally spaced or at percentiles of data mkspline stubname # = oldvar if in weight , marginal pctile displayknots Restricted cubic spline mkspline stubname = oldvar displayknots if in weight , cubic nknots(#) knots(numlist) fweights are allowed with the second and third syntax; see [U] 11.1.6 weight. Menu Data > Create or change data > Other variable-creation commands > Linear and cubic spline construction Description mkspline creates variables containing a linear spline or a restricted cubic spline of oldvar. In the first syntax, mkspline creates newvar1 , . . . , newvark containing a linear spline of oldvar with knots at the specified # 1 , . . . , # k−1 . In the second syntax, mkspline creates # variables named stubname1, . . . , stubname# containing a linear spline of oldvar. The knots are equally spaced over the range of oldvar or are placed at the percentiles of oldvar. In the third syntax, mkspline creates variables containing a restricted cubic spline of oldvar. This is also known as a natural spline. The location and spacing of the knots is determined by the specification of the nknots() and knots() options. 1308 mkspline — Linear and restricted cubic spline construction 1309 Options Options marginal is allowed with the first or second syntax. It specifies that the new variables be constructed so that, when used in estimation, the coefficients represent the change in the slope from the preceding interval. The default is to construct the variables so that, when used in estimation, the coefficients measure the slopes for the interval. displayknots displays the values of the knots that were used in creating the linear or restricted cubic spline. pctile is allowed only with the second syntax. It specifies that the knots be placed at percentiles of the data rather than being equally spaced over the range. nknots(#) is allowed only with the third syntax. It specifies the number of knots that are to be used for a restricted cubic spline. This number must be between 3 and 7 unless the knot locations are specified using knots(). The default number of knots is 5. knots(numlist) is allowed only with the third syntax. It specifies the exact location of the knots to be used for a restricted cubic spline. The values of these knots must be given in increasing order. When this option is omitted, the default knot values are based on Harrell’s recommended percentiles with the additional restriction that the smallest knot may not be less than the fifth-smallest value of oldvar and the largest knot may not be greater than the fifth-largest value of oldvar. If both nknots() and knots() are given, they must specify the same number of knots. Remarks and examples Remarks are presented under the following headings: Linear splines Restricted cubic splines Linear splines Linear splines allow estimating the relationship between y and x as a piecewise linear function, which is a function composed of linear segments — straight lines. One linear segment represents the function for values of x below x0 , another linear segment handles values between x0 and x1 , and so on. The linear segments are arranged so that they join at x0 , x1 , . . . , which are called the knots. An example of a piecewise linear function is shown below. z 4 5 A piecewise linear function 3 knot 1 2 knot 2 0 1 2 x 3 1310 mkspline — Linear and restricted cubic spline construction Example 1 We wish to fit a model of log income on education and age by using a piecewise linear function for age: lninc = b0 + b1 educ + f (age) + u The knots are to be placed at 10-year intervals: 20, 30, 40, 50, and 60. . use http://www.stata-press.com/data/r13/mksp1 . mkspline age1 20 age2 30 age3 40 age4 50 age5 60 age6 = age, marginal . regress lninc educ age1-age6 (output omitted ) Because we specified the marginal option, we could test whether the age effect is the same in the 30 – 40 and 40 – 50 intervals by asking whether the age4 coefficient is zero. With the marginal option, coefficients measure the change in slope from the preceding group. Specifying marginal changes only the interpretation of the coefficients; the same model is fit in either case. Without the marginal option, the interpretation of the coefficients would have been a 1 a 2 dy a3 = a4 dage a5 a6 if age < 20 if 20 ≤ age < 30 if 30 ≤ age < 40 if 40 ≤ age < 50 if 50 ≤ age < 60 otherwise With the marginal option, the interpretation is a 1 a + a2 1 dy a1 + a2 + a3 = a1 + a2 + a3 + a4 dage a 1 + a2 + a3 + a4 + a5 a1 + a2 + a3 + a4 + a5 + a6 if age < 20 if 20 ≤ age < 30 if 30 ≤ age < 40 if 40 ≤ age < 50 if 50 ≤ age < 60 otherwise Example 2 Say that we have a binary outcome variable called outcome. We are beginning an analysis and wish to parameterize the effect of dosage on outcome. We wish to divide the data into five equal-width groups of dosage for the piecewise linear function. . use http://www.stata-press.com/data/r13/mksp2, clear . mkspline dose 5 = dosage, displayknots dosage knot1 knot2 knot3 knot4 20 40 60 80 . logistic outcome dose1-dose5 (output omitted ) mkspline — Linear and restricted cubic spline construction 1311 mkspline dose 5 = dosage creates five variables—dose1, dose2, . . . , dose5—equally spacing the knots over the range of dosage. Because dosage varied between 0 and 100, the mkspline command above has the same effect as typing . mkspline dose1 20 dose2 40 dose3 60 dose4 80 dose5 = dosage The pctile option sets the knots to divide the data into five equal sample-size groups rather than five equal-width ranges. Typing . mkspline pctdose 5 = dosage, pctile displayknots knot1 knot2 knot3 dosage 16 36.4 knot4 55.6 82 places the knots at the 20th, 40th, 60th, and 80th percentiles of the data. Restricted cubic splines A linear spline can be used to fit many functions well. However, a restricted cubic spline may be a better choice than a linear spline when working with a very curved function. When using a restricted cubic spline, one obtains a continuous smooth function that is linear before the first knot, a piecewise cubic polynomial between adjacent knots, and linear again after the last knot. Example 3 Returning to the data from example 1, we may feel that a curved function is a better fit. First, we will use the knots() option to specify the five knots that we used previously. . use http://www.stata-press.com/data/r13/mksp1, clear . mkspline agesp = age, cubic knots(20 30 40 50 60) . regress lninc educ agesp* (output omitted ) Harrell (2001, 23) recommends placing knots at equally spaced percentiles of the original variable’s marginal distribution. If we do not specify the knots() option, variables will be created containing a restricted cubic spline with five knots determined by Harrell’s default percentiles. . use http://www.stata-press.com/data/r13/mksp1, clear . mkspline agesp = age, cubic displayknots . regress lninc educ agesp* (output omitted ) Methods and formulas Methods and formulas are presented under the following headings: Linear splines Restricted cubic splines 1312 mkspline — Linear and restricted cubic spline construction Linear splines Let Vi , i = 1, . . . , n, be the variables to be created; ki , i = 1, . . . , n − 1, be the corresponding knots; and V be the original variable (the command is mkspline V1 k1 V2 k2 . . . Vn = V ). Then V1 = min(V, k1 ) n o Vi = max min(V, ki ), ki−1 − ki−1 i = 2, . . . , n − 1 Vn = max(V, kn−1 ) − kn−1 If the marginal option is specified, the definitions are V1 = V Vi = max(0, V − ki−1 ) i = 2, . . . , n In the second syntax, mkspline stubname # = V , so let m and M be the minimum and maximum of V . Without the pctile option, knots are set at m + (M − m)(i/n) for i = 1, . . . , n − 1. If pctile is specified, knots are set at the 100(i/n) percentiles, for i = 1, . . . , n − 1. Percentiles are calculated by centile; see [R] centile. Restricted cubic splines Let ki , i = 1, . . . , n, be the knot values; Vi , i = 1, . . . , n − 1, be the variables to be created; and V be the original variable. Then V1 = V Vi+1 = (V − ki )3+ − (kn − kn−1 )−1 {(V − kn−1 )3+ (kn − ki ) − (V − kn )3+ (kn−1 − ki )} (kn − k1 )2 i = 1, . . . , n − 2 where ( (u)+ = u, if u > 0 0, if u ≤ 0 Without the knots() option, the locations of the knots are determined by the percentiles recommended in Harrell (2001, 23). These percentiles are based on the chosen number of knots as follows: No. of knots 3 4 5 6 7 Percentiles 10 5 5 5 2.5 50 35 27.5 23 18.33 90 65 50 41 34.17 95 72.5 59 50 95 77 65.83 95 81.67 97.5 mkspline — Linear and restricted cubic spline construction 1313 Harrell provides default percentiles when the number of knots is between 3 and 7. When using a number of knots outside this range, the location of the knots must be specified in knots(). Acknowledgment The restricted cubic spline portion of mkspline is based on the rc spline command by William Dupont of the Department of Biostatistics at Vanderbilt University. References Gould, W. W. 1993. sg19: Linear splines and piecewise linear functions. Stata Technical Bulletin 15: 13–17. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 98–104. College Station, TX: Stata Press. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Harrell, F. E., Jr. 2001. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer. Newson, R. B. 2000. sg151: B-splines and splines parameterized by their values at reference points on the x-axis. Stata Technical Bulletin 57: 20–27. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 221–230. College Station, TX: Stata Press. . 2012. Sensible parameters for univariate and multivariate splines. Stata Journal 12: 479–504. Orsini, N., and S. Greenland. 2011. A procedure to tabulate and plot results after flexible modeling of a quantitative covariate. Stata Journal 11: 1–29. Panis, C. 1994. sg24: The piecewise linear spline transformation. Stata Technical Bulletin 18: 27–29. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 146–149. College Station, TX: Stata Press. Also see [R] fp — Fractional polynomial regression Title ml — Maximum likelihood estimation Syntax Stored results Description Methods and formulas Options References Remarks and examples Also see Syntax ml model in interactive mode ml model method progname eq eq . . . if in weight , model options svy diparm options ml model method funcname() eq eq . . . if in weight , model options svy diparm options ml model in noninteractive mode ml model method progname eq eq . . . if in weight , maximize model options svy diparm options noninteractive options ml model method funcname() eq eq . . . if in weight , maximize model options svy diparm options noninteractive options Noninteractive mode is invoked by specifying the maximize option. Use maximize when ml will be used as a subroutine of another ado-file or program and you want to carry forth the problem, from definition to posting of results, in one command. ml clear ml query ml check ml search / eqname : # lb # ub ... , search options ml plot ml init ml init ml init eqname: name # # # , saving(filename , replace ) eqname: name=# | /eqname=# # # . . . , copy matname , copy skip ml report ml trace on | off 1314 ... ml — Maximum likelihood estimation ml count clear | on | off ml maximize , ml maximize options display options eform option ml graph # , saving(filename , replace ) ml display 1315 , display options eform option ml footnote ml score newvar if ml score newvarlist ml score type if stub* where method is one of lf d0 d1 d1debug d2 d2debug in in if , equation(eqname) missing in , missing lf0 lf1 lf1debug lf2 lf2debug , missing gf0 or method can be specified using one of the longer, more descriptive names method Longer name lf d0 d1 d1debug d2 d2debug lf0 lf1 lf1debug lf2 lf2debug gf0 linearform derivative0 derivative1 derivative1debug derivative2 derivative2debug linearform0 linearform1 linearform1debug linearform2 linearform2debug generalform0 eq is the equation to be estimated, enclosed in parentheses, and optionally with a name to be given to the equation, preceded by a colon, ( eqname: varlisty = varlistx , eq options ) or eq is the name of a parameter, such as sigma, with a slash in front /eqname which is equivalent to (eqname:) and diparm options is one or more diparm(diparm args) options where diparm args is either sep or anything accepted by the “undocumented” diparm command; see help diparm. 1316 ml — Maximum likelihood estimation eq options Description noconstant offset(varnameo ) exposure(varnamee ) do not include an intercept in the equation include varnameo in model with coefficient constrained to 1 include ln(varnamee ) in model with coefficient constrained to 1 model options Description group(varname) vce(vcetype) constraints(numlist) constraints(matname) nocnsnotes title(string) nopreserve collinear missing lf0(# k # ll ) obs(#) crittype(string) subpop(varname) nosvyadjust technique(nr) technique(bhhh) technique(dfp) technique(bfgs) use varname to identify groups vcetype may be robust, cluster clustvar, oim, or opg constraints by number to be applied matrix that contains the constraints to be applied do not display notes when constraints are dropped place a title on the estimation output do not preserve the estimation subsample in memory keep collinear variables within equations keep observations containing variables with missing values number of parameters and log-likelihood value of the constant-only model specifies that a model has been fit and sets the initial values b0 for the model to be fit based on those results perform a Wald test; see Options for use with ml model in interactive or noninteractive mode below number of observations describe the criterion optimized by ml compute estimates for the single subpopulation carry out Wald test as W/k ∼ F (k, d) Stata’s modified Newton–Raphson (NR) algorithm Berndt–Hall–Hall–Hausman (BHHH) algorithm Davidon–Fletcher–Powell (DFP) algorithm Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm noninteractive options Description init(ml init args) search(on) search(norescale) search(quietly) search(off) repeat(#) bounds(ml search bounds) nowarning novce negh score(newvars) maximize options set the initial values b0 equivalent to ml search, repeat(0); the default equivalent to ml search, repeat(0) norescale same as search(on), except that output is suppressed prevents calling ml search ml search’s repeat() option; see below specify bounds for ml search suppress “convergence not achieved” message of iterate(0) substitute the zero matrix for the variance matrix indicates that the evaluator returns the negative Hessian matrix new variables containing the contribution to the score control the maximization process; seldom used continue waldtest(#) ml — Maximum likelihood estimation 1317 search options Description repeat(#) restart norescale maximize options number of random attempts to find better initial-value vector; default is repeat(10) in interactive mode and repeat(0) in noninteractive mode use random actions to find starting values; not recommended do not rescale to improve parameter vector; not recommended control the maximization process; seldom used ml maximize options Description nowarning novce negh score(newvars | stub*) nooutput noclear maximize options suppress “convergence not achieved” message of iterate(0) substitute the zero matrix for the variance matrix indicates that the evaluator returns the negative Hessian matrix new variables containing the contribution to the score suppress display of final results do not clear ml problem definition after model has converged control the maximization process; seldom used display options Description noheader nofootnote level(#) first neq(#) showeqns plus nocnsreport noomitted vsquish suppress header display above the coefficient table suppress footnote display below the coefficient table set confidence level; default is level(95) display coefficient table reporting results for first equation only display coefficient table reporting first # equations display equation names in the coefficient table display coefficient table ending in dashes–plus-sign–dashes suppress constraints display above the coefficient table suppress display of omitted variables suppress blank space separating factor-variable terms or time-series–operated variables from other variables suppress empty cells for interactions of factor variables report base levels of factor variables and interactions display all base levels of factor variables and interactions format the coefficients, standard errors, and confidence limits in the coefficient table format the p-values in the coefficient table format the test statistics in the coefficient table do not automatically widen the coefficient table to accommodate longer variable names display legend instead of statistics noemptycells baselevels allbaselevels cformat(% fmt) pformat(% fmt) sformat(% fmt) nolstretch coeflegend 1318 ml — Maximum likelihood estimation eform option Description eform(string) eform hr shr irr or rrr display exponentiated coefficients; column title is “string” display exponentiated coefficients; column title is “exp(b)” report hazard ratios report subhazard ratios report incidence-rate ratios report odds ratios report relative-risk ratios fweights, aweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. With all but method lf, you must write your likelihood-evaluation program carefully if pweights are to be specified, and pweights may not be specified with method d0, d1, d1debug, d2, or d2debug. See Gould, Pitblado, and Poi (2010, chap. 6) for details. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. To redisplay results, type ml display. Syntax of subroutines for use by evaluator programs mleval newvar = vecname mleval scalarname = vecname , scalar mlsum scalarnamelnf = exp mlvecsum mlmatsum mlmatbysum , eq(#) eq(#) , noweight scalarnamelnf rowvecname = exp if , eq(#) scalarnamelnf matrixname = exp if , eq(# ,# ) scalarnamelnf matrixname varnamea varnameb varnamec if , eq(# ,# ) by(varname) if Syntax of user-written evaluator Summary of notation The log-likelihood function is ln L(θ1j , θ2j , . . . , θEj ), where θij = xij bi , j = 1, . . . , N indexes observations, and i = 1, . . . , E indexes the linear equations defined by ml model. If the likelihood PN satisfies the linear-form restrictions, it can be decomposed as ln L = j=1 ln `(θ1j , θ2j , . . . , θEj ). Method-lf evaluators program progname version 13 args lnfj theta1 theta2 ... // if you need to create any intermediate results: tempvar tmp1 tmp2 . . . quietly gen double ‘tmp1 ’ = . . . ... quietly replace ‘lnfj’ = end ... ml — Maximum likelihood estimation where ‘lnfj’ ‘theta1’ ‘theta2’ variable to be filled in with observation-by-observation values of ln`j variable containing evaluation of first equation θ1j =x1j b1 variable containing evaluation of second equation θ2j =x2j b2 ... Method-d0 evaluators program progname version 13 args todo b lnf tempvar theta1 theta2 . . . mleval ‘theta1 ’ = ‘b’, eq(1) mleval ‘theta2 ’ = ‘b’, eq(2) // if there is a θ2 ... // if you need to create any intermediate results: tempvar tmp1 tmp2 . . . gen double ‘tmp1 ’ = . . . ... mlsum ‘lnf’ = ... end where always contains 0 (may be ignored) full parameter row vector b=(b1 ,b2 ,...,bE ) scalar to be filled in with overall lnL ‘todo’ ‘b’ ‘lnf’ Method-d1 evaluators program progname version 13 args todo b lnf g tempvar theta1 theta2 . . . mleval ‘theta1 ’ = ‘b’, eq(1) mleval ‘theta2 ’ = ‘b’, eq(2) // if there is a θ2 ... // if you need to create any intermediate results: tempvar tmp1 tmp2 . . . gen double ‘tmp1 ’ = . . . ... mlsum ‘lnf’ = . . . if (‘todo’==0 | ‘lnf’>=.) exit tempname d1 d2 . . . mlvecsum ‘lnf’ ‘d1’ = formula for ∂ ln`j /∂θ1j , eq(1) mlvecsum ‘lnf’ ‘d2’ = formula for ∂ ln`j /∂θ2j , eq(2) ... matrix ‘g’ = (‘d1’,‘d2’, ... ) end where ‘todo’ ‘b’ ‘lnf’ ‘g’ contains 0 or 1 0⇒‘lnf’to be filled in; 1⇒‘lnf’ and ‘g’ to be filled in full parameter row vector b=(b1 ,b2 ,...,bE ) scalar to be filled in with overall lnL row vector to be filled in with overall g=∂ lnL/∂b 1319 1320 ml — Maximum likelihood estimation Method-d2 evaluators program progname version 13 args todo b lnf g H tempvar theta1 theta2 . . . mleval ‘theta1 ’ = ‘b’, eq(1) mleval ‘theta2 ’ = ‘b’, eq(2) // if there is a θ2 ... // if you need to create any intermediate results: tempvar tmp1 tmp2 . . . gen double ‘tmp1 ’ = . . . ... mlsum ‘lnf’ = . . . if (‘todo’==0 | ‘lnf’>=.) exit tempname d1 d2 . . . mlvecsum ‘lnf’ ‘d1’ = formula for ∂ ln`j /∂θ1j , eq(1) mlvecsum ‘lnf’ ‘d2’ = formula for ∂ ln`j /∂θ2j , eq(2) ... matrix ‘g’ = (‘d1’,‘d2’, . . . ) if (‘todo’==1 | ‘lnf’>=.) exit tempname d11 d12 d22 . . . 2 mlmatsum ‘lnf’ ‘d11’ = formula for ∂ 2 ln`j /∂θ1j , eq(1) mlmatsum ‘lnf’ ‘d12’ = formula for ∂ 2 ln`j /∂θ1j ∂θ2j , eq(1,2) 2 mlmatsum ‘lnf’ ‘d22’ = formula for ∂ 2 ln`j /∂θ2j , eq(2) ... matrix ‘H’ = (‘d11’,‘d12’, . . . \ ‘d12’’,‘d22’, . . . ) end where contains 0, 1, or 2 0⇒‘lnf’ to be filled in; 1⇒‘lnf’ and ‘g’ to be filled in; 2⇒‘lnf’, ‘g’, and ‘H’ to be filled in full parameter row vector b=(b1 ,b2 ,...,bE ) scalar to be filled in with overall lnL row vector to be filled in with overall g=∂ ln L/∂b matrix to be filled in with overall Hessian H=∂ 2 ln L/∂b∂b0 ‘todo’ ‘b’ ‘lnf’ ‘g’ ‘H’ Method-lf0 evaluators program progname version 13 args todo b lnfj tempvar theta1 theta2 . . . mleval ‘theta1 ’ = ‘b’, eq(1) mleval ‘theta2 ’ = ‘b’, eq(2) // if there is a θ2 ... // if you need to create any intermediate results: tempvar tmp1 tmp2 . . . gen double ‘tmp1 ’ = . . . ... quietly replace ‘lnfj’ = ... end where ‘todo’ ‘b’ ‘lnfj’ always contains 0 (may be ignored) full parameter row vector b=(b1 ,b2 ,...,bE ) variable to be filled in with observation-by-observation values of ln`j ml — Maximum likelihood estimation Method-lf1 evaluators program progname version 13 args todo b lnfj g1 g2 ... tempvar theta1 theta2 . . . mleval ‘theta1 ’ = ‘b’, eq(1) mleval ‘theta2 ’ = ‘b’, eq(2) // if there is a θ2 ... // if you need to create any intermediate results: tempvar tmp1 tmp2 . . . gen double ‘tmp1 ’ = . . . ... quietly replace ‘lnfj’ = . . . if (‘todo’==0) exit quietly replace ‘g1’ = formula for ∂ ln`j /∂θ1j quietly replace ‘g2’ = formula for ∂ ln`j /∂θ2j ... end where contains 0 or 1 0⇒‘lnfj’to be filled in; 1⇒‘lnfj’, ‘g1’, ‘g2’, . . ., to be filled in full parameter row vector b=(b1 ,b2 ,...,bE ) variable to be filled in with observation-by-observation values of ln`j variable to be filled in with ∂ ln`j /∂θ1j variable to be filled in with ∂ ln`j /∂θ2j ‘todo’ ‘b’ ‘lnfj’ ‘g1’ ‘g2’ ... Method-lf2 evaluators program progname version 13 args todo b lnfj g1 g2 . . . H tempvar theta1 theta2 . . . mleval ‘theta1 ’ = ‘b’, eq(1) mleval ‘theta2 ’ = ‘b’, eq(2) // if there is a θ2 ... // if you need to create any intermediate results: tempvar tmp1 tmp2 . . . gen double ‘tmp1 ’ = . . . ... quietly replace ‘lnfj’ = . . . if (‘todo’==0) exit quietly replace ‘g1’ = formula for ∂ ln`j /∂θ1j quietly replace ‘g2’ = formula for ∂ ln`j /∂θ2j ... if (‘todo’==1) exit tempname d11 d12 d22 mlmatsum ‘lnf’ ‘d11’ mlmatsum ‘lnf’ ‘d12’ mlmatsum ‘lnf’ ‘d22’ ... lnf . . . 2 = formula for ∂ 2 ln`j /∂θ1j , eq(1) = formula for ∂ 2 ln`j /∂θ1j ∂θ2j , eq(1,2) 2 = formula for ∂ 2 ln`j /∂θ2j , eq(2) matrix ‘H’ = (‘d11’,‘d12’, end . . . \ ‘d12’’,‘d22’, . . . ) 1321 1322 ml — Maximum likelihood estimation where contains 0 or 1 0⇒‘lnfj’to be filled in; 1⇒‘lnfj’, ‘g1’, ‘g2’, . . ., to be filled in 2⇒‘lnfj’, ‘g1’, ‘g2’, . . ., and ‘H’ to be filled in full parameter row vector b=(b1 ,b2 ,...,bE ) scalar to be filled in with observation-by-observation lnL variable to be filled in with ∂ ln`j /∂θ1j variable to be filled in with ∂ ln`j /∂θ2j ‘todo’ ‘b’ ‘lnfj’ ‘g1’ ‘g2’ ... matrix to be filled in with overall Hessian H=∂ 2 ln L/∂b∂b0 ‘H’ Method-gf0 evaluators program progname version 13 args todo b lnfj tempvar theta1 theta2 . . . mleval ‘theta1 ’ = ‘b’, eq(1) mleval ‘theta2 ’ = ‘b’, eq(2) // if there is a θ2 ... // if you need to create any intermediate results: tempvar tmp1 tmp2 . . . gen double ‘tmp1 ’ = . . . ... quietly replace ‘lnfj’ = ... end where ‘todo’ ‘b’ ‘lnfj’ always contains 0 (may be ignored) full parameter row vector b=(b1 ,b2 ,...,bE ) variable to be filled in with the values of the log-likelihood ln`j Global macros for use by all evaluators $ML y1 $ML y2 name of first dependent variable name of second dependent variable, if any ... $ML samp $ML w variable containing 1 if observation to be used; 0 otherwise variable containing weight associated with observation or 1 if no weights specified Method-lf evaluators can ignore $ML samp, but restricting calculations to the $ML samp==1 subsample will speed execution. Method-lf evaluators must ignore $ML w; application of weights is handled by the method itself. Methods d0, d1, d2, lf0, lf1, lf2, and gf0 can ignore $ML samp as long as ml model’s nopreserve option is not specified. These methods will run more quickly if nopreserve is specified. These evaluators can ignore $ML w only if they use mlsum, mlvecsum, mlmatsum, and mlmatbysum to produce all final results. Description ml model defines the current problem. ml clear clears the current problem definition. This command is rarely used because when you type ml model, any previous problem is automatically cleared. ml query displays a description of the current problem. ml check verifies that the log-likelihood evaluator you have written works. We strongly recommend using this command. ml — Maximum likelihood estimation 1323 ml search searches for (better) initial values. We recommend using this command. ml plot provides a graphical way of searching for (better) initial values. ml init provides a way to specify initial values. ml report reports ln L’s values, gradient, and Hessian at the initial values or current parameter estimates, b0 . ml trace traces the execution of the user-defined log-likelihood evaluation program. ml count counts the number of times the user-defined log-likelihood evaluation program is called; this command is seldom used. ml count clear clears the counter. ml count on turns on the counter. ml count without arguments reports the current values of the counter. ml count off stops counting calls. ml maximize maximizes the likelihood function and reports results. Once ml maximize has successfully completed, the previously mentioned ml commands may no longer be used unless noclear is specified. ml graph and ml display may be used whether or not noclear is specified. ml graph graphs the log-likelihood values against the iteration number. ml display redisplays results. ml footnote displays a warning message when the model did not converge within the specified number of iterations. ml score creates new variables containing the equation-level scores. The variables generated by ml score are equivalent to those generated by specifying the score() option of ml maximize (and ml model . . . , . . . maximize). progname is the name of a Stata program you write to evaluate the log-likelihood function. funcname() is the name of a Mata function you write to evaluate the log-likelihood function. In this documentation, progname and funcname() are referred to as the user-written evaluator, the likelihood evaluator, or sometimes simply as the evaluator. The program you write is written in the style required by the method you choose. The methods are lf, d0, d1, d2, lf0, lf1, lf2, and gf0. Thus, if you choose to use method lf, your program is called a method-lf evaluator. Method-lf evaluators are required to evaluate the observation-by-observation log likelihood ln `j , j = 1, . . . , N . Method-d0 evaluators are required to evaluate the overall log likelihood ln L. Method-d1 evaluators are required to evaluate the overall log likelihood and its gradient vector g = ∂ ln L/∂b. Method-d2 evaluators are required to evaluate the overall log likelihood, its gradient, and its Hessian matrix H = ∂ 2 ln L/∂b∂b0 . Method-lf0 evaluators are required to evaluate the observation-by-observation log likelihood ln `j , j = 1, . . . , N . Method-lf1 evaluators are required to evaluate the observation-by-observation log likelihood and its equation-level scores gji = ∂ ln `/∂xji bi . Method-lf2 evaluators are required to evaluate the observation-by-observation log likelihood, its equation-level scores, and its Hessian matrix H = ∂ 2 ln `/∂b∂b0 . Method-gf0 evaluators are required to evaluate the summable pieces of the log likelihood ln `k , k = 1, . . . , K . mleval is a subroutine used by evaluators of methods d0, d1, d2, lf0, lf1, lf2, and gf0 to evaluate the coefficient vector, b, that they are passed. mlsum is a subroutine used by evaluators of methods d0, d1, and d2 to define the value, ln L, that is to be returned. 1324 ml — Maximum likelihood estimation mlvecsum is a subroutine used by evaluators of methods d1 and d2 to define the gradient vector, g, that is to be returned. It is suitable for use only when the likelihood function meets the linear-form restrictions. mlmatsum is a subroutine used by evaluators of methods d2 and lf2 to define the Hessian matrix, H, that is to be returned. It is suitable for use only when the likelihood function meets the linear-form restrictions. mlmatbysum is a subroutine used by evaluator of method d2 to help define the Hessian matrix, H, that is to be returned. It is suitable for use when the likelihood function contains terms made up of grouped sums, such as in panel-data models. For such models, use mlmatsum to compute the observation-level outer products and mlmatbysum to compute the group-level outer products. mlmatbysum requires that the data be sorted by the variable identified in the by() option. Options Options are presented under the following headings: Options for use with ml model in interactive or noninteractive mode Options for use with ml model in noninteractive mode Options for use when specifying equations Options for use with ml search Option for use with ml plot Options for use with ml init Options for use with ml maximize Option for use with ml graph Options for use with ml display Options for use with mleval Option for use with mlsum Option for use with mlvecsum Option for use with mlmatsum Options for use with mlmatbysum Options for use with ml score Options for use with ml model in interactive or noninteractive mode group(varname) specifies the numeric variable that identifies groups. This option is typically used to identify panels for panel-data models. vce(vcetype) specifies the type of standard error reported, which includes types that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that are derived from asymptotic theory (oim, opg); see [R] vce option. vce(robust), vce(cluster clustvar), pweight, and svy will work with evaluators of methods lf, lf0, lf1, lf2, and gf0; all you need do is specify them. These options will not work with evaluators of methods d0, d1, or d2, and specifying these options will produce an error message. constraints(numlist | matname) specifies the linear constraints to be applied during estimation. constraints(numlist) specifies the constraints by number. Constraints are defined by using the constraint command; see [R] constraint. constraint(matname) specifies a matrix that contains the constraints. nocnsnotes prevents notes from being displayed when constraints are dropped. A constraint will be dropped if it is inconsistent, contradicts other constraints, or causes some other error when the constraint matrix is being built. Constraints are checked in the order in which they are specified. ml — Maximum likelihood estimation 1325 title(string) specifies the title for the estimation output when results are complete. nopreserve specifies that ml need not ensure that only the estimation subsample is in memory when the user-written likelihood evaluator is called. nopreserve is irrelevant when you use method lf. For the other methods, if nopreserve is not specified, ml saves the data in a file (preserves the original dataset) and drops the irrelevant observations before calling the user-written evaluator. This way, even if the evaluator does not restrict its attentions to the $ML samp==1 subsample, results will still be correct. Later, ml automatically restores the original dataset. ml need not go through these machinations for method lf because the user-written evaluator calculates observation-by-observation values, and ml itself sums the components. ml goes through these machinations if and only if the estimation sample is a subsample of the data in memory. If the estimation sample includes every observation in memory, ml does not preserve the original dataset. Thus programmers must not alter the original dataset unless they preserve the data themselves. We recommend that interactive users of ml not specify nopreserve; the speed gain is not worth the possibility of getting incorrect results. We recommend that programmers specify nopreserve, but only after verifying that their evaluator really does restrict its attentions solely to the $ML samp==1 subsample. collinear specifies that ml not remove the collinear variables within equations. There is no reason to leave collinear variables in place, but this option is of interest to programmers who, in their code, have already removed collinear variables and do not want ml to waste computer time checking again. missing specifies that observations containing variables with missing values not be eliminated from the estimation sample. There are two reasons you might want to specify missing: Programmers may wish to specify missing because, in other parts of their code, they have already eliminated observations with missing values and do not want ml to waste computer time looking again. You may wish to specify missing if your model explicitly deals with missing values. Stata’s heckman command is a good example of this. In such cases, there will be observations where missing values are allowed and other observations where they are not—where their presence should cause the observation to be eliminated. If you specify missing, it is your responsibility to specify an if exp that eliminates the irrelevant observations. lf0(# k # ll ) is typically used by programmers. It specifies the number of parameters and log-likelihood value of the constant-only model so that ml can report a likelihood-ratio test rather than a Wald test. These values may have been analytically determined, or they may have been determined by a previous fitting of the constant-only model on the estimation sample. Also see the continue option directly below. If you specify lf0(), it must be safe for you to specify the missing option, too, else how did you calculate the log likelihood for the constant-only model on the same sample? You must have identified the estimation sample, and done so correctly, so there is no reason for ml to waste time rechecking your results. All of which is to say, do not specify lf0() unless you are certain your code identifies the estimation sample correctly. lf0(), even if specified, is ignored if vce(robust), vce(cluster clustvar), pweight, or svy is specified because, in that case, a likelihood-ratio test would be inappropriate. 1326 ml — Maximum likelihood estimation continue is typically specified by programmers and does two things: First, it specifies that a model has just been fit by either ml or some other estimation command, such as logit, and that the likelihood value stored in e(ll) and the number of parameters stored in e(b) as of that instant are the relevant values of the constant-only model. The current value of the log likelihood is used to present a likelihood-ratio test unless vce(robust), vce(cluster clustvar), pweight, svy, or constraints() is specified. A likelihood-ratio test is inappropriate when vce(robust), vce(cluster clustvar), pweight, or svy is specified. We suggest using lrtest when constraints() is specified; see [R] lrtest. Second, continue sets the initial values, b0 , for the model about to be fit according to the e(b) currently stored. The comments made about specifying missing with lf0() apply equally well here. waldtest(#) is typically specified by programmers. By default, ml presents a Wald test, but that is overridden if the lf0() or continue option is specified. A Wald test is performed if vce(robust), vce(cluster clustvar), or pweight is specified. waldtest(0) prevents even the Wald test from being reported. waldtest(-1) is the default. It specifies that a Wald test be performed by constraining all coefficients except the intercept to 0 in the first equation. Remaining equations are to be unconstrained. A Wald test is performed if neither lf0() nor continue was specified, and a Wald test is forced if vce(robust), vce(cluster clustvar), or pweight was specified. waldtest(k ) for k ≤ −1 specifies that a Wald test be performed by constraining all coefficients except intercepts to 0 in the first |k| equations; remaining equations are to be unconstrained. A Wald test is performed if neither lf0() nor continue was specified, and a Wald test is forced if vce(robust), vce(cluster clustvar), or pweight was specified. waldtest(k ) for k ≥ 1 works like the options above, except that it forces a Wald test to be reported even if the information to perform the likelihood-ratio test is available and even if none of vce(robust), vce(cluster clustvar), or pweight was specified. waldtest(k ), k ≥ 1, may not be specified with lf0(). obs(#) is used mostly by programmers. It specifies that the number of observations reported and ultimately stored in e(N) be #. Ordinarily, ml works that out for itself. Programmers may want to specify this option when, for the likelihood evaluator to work for N observations, they first had to modify the dataset so that it contained a different number of observations. crittype(string) is used mostly by programmers. It allows programmers to supply a string (up to 32 characters long) that describes the criterion that is being optimized by ml. The default is "log likelihood" for nonrobust and "log pseudolikelihood" for robust estimation. svy indicates that ml is to pick up the svy settings set by svyset and use the robust variance estimator. This option requires the data to be svyset; see [SVY] svyset. svy may not be specified with vce() or weights. subpop(varname) specifies that estimates be computed for the single subpopulation defined by the observations for which varname 6= 0. Typically, varname = 1 defines the subpopulation, and varname = 0 indicates observations not belonging to the subpopulation. For observations whose subpopulation status is uncertain, varname should be set to missing (‘.’). This option requires the svy option. nosvyadjust specifies that the model Wald test be carried out as W/k ∼ F (k, d), where W is the Wald test statistic, k is the number of terms in the model excluding the constant term, d is the total number of sampled PSUs minus the total number of strata, and F (k, d) is an F distribution with k numerator degrees of freedom and d denominator degrees of freedom. By default, an adjusted ml — Maximum likelihood estimation 1327 Wald test is conducted: (d − k + 1)W/(kd) ∼ F (k, d − k + 1). See Korn and Graubard (1990) for a discussion of the Wald test and the adjustments thereof. This option requires the svy option. technique(algorithm spec) specifies how the likelihood function is to be maximized. The following algorithms are currently implemented in ml. For details, see Gould, Pitblado, and Poi (2010). technique(nr) specifies Stata’s modified Newton–Raphson (NR) algorithm. technique(bhhh) specifies the Berndt–Hall–Hall–Hausman (BHHH) algorithm. technique(dfp) specifies the Davidon–Fletcher–Powell (DFP) algorithm. technique(bfgs) specifies the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. The default is technique(nr). You can switch between algorithms by specifying more than one in the technique() option. By default, ml will use an algorithm for five iterations before switching to the next algorithm. To specify a different number of iterations, include the number after the technique in the option. For example, technique(bhhh 10 nr 1000) requests that ml perform 10 iterations using the BHHH algorithm, followed by 1,000 iterations using the NR algorithm, and then switch back to BHHH for 10 iterations, and so on. The process continues until convergence or until reaching the maximum number of iterations. Options for use with ml model in noninteractive mode The following extra options are for use with ml model in noninteractive mode. Noninteractive mode is for programmers who use ml as a subroutine and want to issue one command that will carry forth the estimation from start to finish. maximize is required. It specifies noninteractive mode. init(ml init args) sets the initial values, b0 . ml init args are whatever you would type after the ml init command. search(on | norescale | quietly | off) specifies whether ml search is to be used to improve the initial values. search(on) is the default and is equivalent to separately running ml search, repeat(0). search(norescale) is equivalent to separately running ml search, repeat(0) norescale. search(quietly) is equivalent to search(on), except that it suppresses ml search’s output. search(off) prevents calling ml search. repeat(#) is ml search’s repeat() option. repeat(0) is the default. bounds(ml search bounds) specifies the search bounds. ml search bounds is specified as eqn name lower bound upper bound . . . eqn name lower bound upper bound for instance, bounds(100 100 lnsigma 0 10). The ml model command issues ml search ml search bounds, repeat(#). Specifying search bounds is optional. nowarning, novce, negh, and score() are ml maximize’s equivalent options. maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. 1328 ml — Maximum likelihood estimation Options for use when specifying equations noconstant specifies that the equation not include an intercept. offset(varnameo ) specifies that the equation be xb + varnameo —that it include varnameo with coefficient constrained to be 1. exposure(varnamee ) is an alternative to offset(varnameo ); it specifies that the equation be xb + ln(varnamee ). The equation is to include ln(varnamee ) with coefficient constrained to be 1. Options for use with ml search repeat(#) specifies the number of random attempts that are to be made to find a better initial-value vector. The default is repeat(10). repeat(0) specifies that no random attempts be made. More precisely, repeat(0) specifies that no random attempts be made if the first initial-value vector is a feasible starting point. If it is not, ml search will make random attempts, even if you specify repeat(0), because it has no alternative. The repeat() option refers to the number of random attempts to be made to improve the initial values. When the initial starting value vector is not feasible, ml search will make up to 1,000 random attempts to find starting values. It stops when it finds one set of values that works and then moves into its improve-initial-values logic. repeat(k ), k > 0, specifies the number of random attempts to be made to improve the initial values. restart specifies that random actions be taken to obtain starting values and that the resulting starting values not be a deterministic function of the current values. Generally, you should not specify this option because, with restart, ml search intentionally does not produce as good a set of starting values as it could. restart is included for use by the optimizer when it gets into serious trouble. The random actions ensure that the optimizer and ml search, working together, do not cause an endless loop. restart implies norescale, which is why we recommend that you do not specify restart. In testing, sometimes rescale worked so well that, even after randomization, the rescaler would bring the starting values right back to where they had been the first time and thus defeat the intended randomization. norescale specifies that ml search not engage in its rescaling actions to improve the parameter vector. We do not recommend specifying this option because rescaling tends to work so well. maximize options: no log and trace; see [R] maximize. These options are seldom used. Option for use with ml plot saving( filename[ , replace]) specifies that the graph be saved in filename.gph. See [G-3] saving option. Options for use with ml init copy specifies that the list of numbers or the initialization vector be copied into the initial-value vector by position rather than by name. skip specifies that any parameters found in the specified initialization vector that are not also found in the model be ignored. The default action is to issue an error message. ml — Maximum likelihood estimation 1329 Options for use with ml maximize nowarning is allowed only with iterate(0). nowarning suppresses the “convergence not achieved” message. Programmers might specify iterate(0) nowarning when they have a vector b already containing the final estimates and want ml to calculate the variance matrix and postestimation results. Then specify init(b) search(off) iterate(0) nowarning nolog. novce is allowed only with iterate(0). novce substitutes the zero matrix for the variance matrix, which in effect posts estimation results as fixed constants. negh indicates that the evaluator returns the negative Hessian matrix. By default, ml assumes d2 and lf2 evaluators return the Hessian matrix. score(newvars | stub*) creates new variables containing the contributions to the score for each equation and ancillary parameter in the model; see [U] 20.22 Obtaining scores. If score(newvars) is specified, the newvars must contain k new variables. For evaluators of methods lf, lf0, lf1, and lf2, k is the number of equations. For evaluators of method gf0, k is the number of parameters. If score(stub*) is specified, variables named stub1, stub2, . . . , stubk are created. For evaluators of methods lf, lf0, lf1, and lf2, the first variable contains ∂ ln `j /∂(x1j b1 ), the second variable contains ∂ ln `j /∂(x2j b2 ), and so on. For evaluators of method gf0, the first variable contains ∂ ln `j /∂b1 , the second variable contains ∂ ln `j /∂b2 , and so on. nooutput suppresses display of results. This option is different from prefixing ml maximize with quietly in that the iteration log is still displayed (assuming that nolog is not specified). noclear specifies that the ml problem definition not be cleared after the model has converged. Perhaps you are having convergence problems and intend to run the model to convergence. If so, use ml search to see if those values can be improved, and then restart the estimation. maximize options: difficult, iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance; see [R] maximize. These options are seldom used. display options; see Options for use with ml display below. eform option; see Options for use with ml display below. Option for use with ml graph saving( filename[ , replace]) specifies that the graph be saved in filename.gph. See [G-3] saving option. Options for use with ml display noheader suppresses the header display above the coefficient table that displays the final log-likelihood value, the number of observations, and the model significance test. nofootnote suppresses the footnote display below the coefficient table, which displays a warning if the model fit did not converge within the specified number of iterations. Use ml footnote to display the warning if 1) you add to the coefficient table using the plus option or 2) you have your own footnotes and want the warning to be last. 1330 ml — Maximum likelihood estimation level(#) is the standard confidence-level option. It specifies the confidence level, as a percentage, for confidence intervals of the coefficients. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. first displays a coefficient table reporting results for the first equation only, and the report makes it appear that the first equation is the only equation. This option is used by programmers who estimate ancillary parameters in the second and subsequent equations and who wish to report the values of such parameters themselves. neq(#) is an alternative to first. neq(#) displays a coefficient table reporting results for the first # equations. This option is used by programmers who estimate ancillary parameters in the # + 1 and subsequent equations and who wish to report the values of such parameters themselves. showeqns is a seldom-used option that displays the equation names in the coefficient table. ml display uses the numbers stored in e(k eq) and e(k aux) to determine how to display the coefficient table. e(k eq) identifies the number of equations, and e(k aux) identifies how many of these are for ancillary parameters. The first option is implied when showeqns is not specified and all but the first equation are for ancillary parameters. plus displays the coefficient table, but rather than ending the table in a line of dashes, ends it in dashes–plus-sign–dashes. This is so that programmers can write additional display code to add more results to the table and make it appear as if the combined result is one table. Programmers typically specify plus with the first or neq() options. This option implies nofootnote. nocnsreport suppresses the display of constraints above the coefficient table. This option is ignored if constraints were not used to fit the model. noomitted specifies that variables that were omitted because of collinearity not be displayed. The default is to include in the table any variables omitted because of collinearity and to label them as “(omitted)”. vsquish specifies that the blank space separating factor-variable terms or time-series–operated variables from other variables in the model be suppressed. noemptycells specifies that empty cells for interactions of factor variables not be displayed. The default is to include in the table interaction cells that do not occur in the estimation sample and to label them as “(empty)”. baselevels and allbaselevels control whether the base levels of factor variables and interactions are displayed. The default is to exclude from the table all base categories. baselevels specifies that base levels be reported for factor variables and for interactions whose bases cannot be inferred from their component factor variables. allbaselevels specifies that all base levels of factor variables and interactions be reported. cformat(% fmt) specifies how to format coefficients, standard errors, and confidence limits in the coefficient table. pformat(% fmt) specifies how to format p-values in the coefficient table. sformat(% fmt) specifies how to format test statistics in the coefficient table. nolstretch specifies that the width of the coefficient table not be automatically widened to accommodate longer variable names. The default, lstretch, is to automatically widen the coefficient table up to the width of the Results window. To change the default, use set lstretch off. nolstretch is not shown in the dialog box. coeflegend specifies that the legend of the coefficients and how to specify them in an expression be displayed rather than displaying the statistics for the coefficients. ml — Maximum likelihood estimation 1331 eform option: eform(string), eform, hr, shr, irr, or, and rrr display the coefficient table in exponentiated form: for each coefficient, exp(b) rather than b is displayed, and standard errors and confidence intervals are transformed. string is the table header that will be displayed above the transformed coefficients and must be 11 characters or shorter in length—for example, eform("Odds ratio"). The options eform, hr, shr, irr, or, and rrr provide a default string equivalent to “exp(b)”, “Haz. Ratio”, “SHR”, “IRR”, “Odds Ratio”, and “RRR”, respectively. These options may not be combined. ml display looks at e(k eform) to determine how many equations are affected by an eform option; by default, only the first equation is affected. Type ereturn list, all to view e(k eform); see [P] ereturn. Options for use with mleval eq(#) specifies the equation number, i, for which θij = xij bi is to be evaluated. eq(1) is assumed if eq() is not specified. scalar asserts that the ith equation is known to evaluate to a constant, meaning that the equation was specified as (), (name:), or /name on the ml model statement. If you specify this option, the new variable created is created as a scalar. If the ith equation does not evaluate to a scalar, an error message is issued. Option for use with mlsum noweight specifies that weights ($ML w) be ignored when summing the likelihood function. Option for use with mlvecsum eq(#) specifies the equation for which a gradient vector ∂ ln L/∂bi is to be constructed. The default is eq(1). Option for use with mlmatsum eq(# ,# ) specifies the equations for which the Hessian matrix is to be constructed. The default is eq(1), which is the same as eq(1,1), which means ∂ 2 ln L/∂b1 ∂b01 . Specifying eq(i,j ) results in ∂ 2 ln L/∂bi ∂b0j . Options for use with mlmatbysum by(varname) is required and specifies the group variable. eq(# ,# ) specifies the equations for which the Hessian matrix is to be constructed. The default is eq(1), which is the same as eq(1,1), which means ∂ 2 ln L/∂b1 ∂b01 . Specifying eq(i,j ) results in ∂ 2 ln L/∂bi ∂b0j . Options for use with ml score equation(eqname) identifies from which equation the observation scores are to come. This option may be used only when generating one variable. missing specifies that observations containing variables with missing values not be eliminated from the estimation sample. 1332 ml — Maximum likelihood estimation Remarks and examples For a thorough discussion of ml, see the fourth edition of Maximum Likelihood Estimation with Stata (Gould, Pitblado, and Poi 2010). The book provides a tutorial introduction to ml, notes on advanced programming issues, and a discourse on maximum likelihood estimation from both theoretical and practical standpoints. See Survey options and ml at the end of Remarks and examples for examples of the new svy options. For more information about survey estimation, see [SVY] survey, [SVY] svy estimation, and [SVY] variance estimation. ml requires that you write a program that evaluates the log-likelihood function and, possibly, its first and second derivatives. The style of the program you write depends upon the method you choose. Methods lf, lf0, d0, and gf0 require that your program evaluate the log likelihood only. Methods d1 and lf1 require that your program evaluate the log likelihood and its first derivatives. Methods d2 and lf2 requires that your program evaluate the log likelihood and its first and second derivatives. Methods lf, lf0, d0, and gf0 differ from each other in that, with methods lf and lf0, your program is required P to produce observation-by-observation log-likelihood values ln `j and it is assumed that ln L = j ln `j ; with method d0, your program is required to produce only the overall value ln L; and with method gf0, your program is required to produce the summable pieces of the log likelihood, such as those in panel-data models. Once you have written the program—called an evaluator—you define a model to be fit using ml model and obtain estimates using ml maximize. You might type . ml model . . . . ml maximize but we recommend that you type . . . . ml ml ml ml model . . . check search maximize ml check verifies your evaluator has no obvious errors, and ml search finds better initial values. You fill in the ml model statement with 1) the method you are using, 2) the name of your program, and 3) the “equations”. You write your evaluator in terms of θ1 , θ2 , . . . , each of which has a linear equation associated with it. That linear equation might be as simple as θi = b0 , it might be θi = b1 mpg + b2 weight + b3 , or it might omit the intercept b3 . The equations are specified in parentheses on the ml model line. Suppose that you are using method lf and the name of your evaluator program is myprog. The statement . ml model lf myprog (mpg weight) would specify one equation with θi = b1 mpg + b2 weight + b3 . If you wanted to omit b3 , you would type . ml model lf myprog (mpg weight, nocons) and if all you wanted was θi = b0 , you would type . ml model lf myprog () With multiple equations, you list the equations one after the other; so, if you typed . ml model lf myprog (mpg weight) () ml — Maximum likelihood estimation 1333 you would be specifying θ1 = b1 mpg + b2 weight + b3 and θ2 = b4 . You would write your likelihood in terms of θ1 and θ2 . If the model was linear regression, θ1 might be the xb part and θ2 the variance of the residuals. When you specify the equations, you also specify any dependent variables. If you typed . ml model lf myprog (price = mpg weight) () price would be the one and only dependent variable, and that would be passed to your program in $ML y1. If your model had two dependent variables, you could type . ml model lf myprog (price displ = mpg weight) () Then $ML y1 would be price and $ML y2 would be displ. You can specify however many dependent variables are necessary and specify them on any equation. It does not matter on which equation you specify them; the first one specified is placed in $ML y1, the second in $ML y2, and so on. Example 1: Method lf Using method lf, we want to produce observation-by-observation values of the log likelihood. The probit log-likelihood function is ln `j = ln Φ(θ1j ) ln Φ(−θ1j ) if yj = 1 if yj = 0 θ1j = xj b1 The following is the method-lf evaluator for this likelihood function: program myprobit version 13 args lnf theta1 quietly replace ‘lnf’ = ln(normal(‘theta1’)) if $ML_y1==1 quietly replace ‘lnf’ = ln(normal(-‘theta1’)) if $ML_y1==0 end If we wanted to fit a model of foreign on mpg and weight, we would type the following commands. The ‘foreign =’ part specifies that y is foreign. The ‘mpg weight’ part specifies that θ1j = b1 mpgj + b2 weightj + b3 . 1334 ml — Maximum likelihood estimation . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . ml model lf myprobit (foreign = mpg weight) . ml maximize initial: log likelihood = -51.292891 alternative: log likelihood = -45.055272 rescale: log likelihood = -45.055272 Iteration 0: log likelihood = -45.055272 Iteration 1: log likelihood = -27.905385 Iteration 2: log likelihood = -26.858058 Iteration 3: log likelihood = -26.844198 Iteration 4: log likelihood = -26.844189 Iteration 5: log likelihood = -26.844189 Number of obs Wald chi2(2) Prob > chi2 Log likelihood = -26.844189 foreign Coef. mpg weight _cons -.1039503 -.0023355 8.275464 Std. Err. .0515689 .0005661 2.554142 z -2.02 -4.13 3.24 P>|z| 0.044 0.000 0.001 = = = 74 20.75 0.0000 [95% Conf. Interval] -.2050235 -.003445 3.269438 -.0028772 -.0012261 13.28149 Example 2: Method lf for two-equation, two-dependent-variable model A two-equation, two-dependent-variable model is a little different. Rather than receiving one θ, our program will receive two. Rather than there being one dependent variable in $ML y1, there will be dependent variables in $ML y1 and $ML y2. For instance, the Weibull regression log-likelihood function is ln `j = −(tj e−θ1j )exp(θ2j ) + dj {θ2j − θ1j + (eθ2j − 1)(ln tj − θ1j )} θ1j = xj b1 θ2j = s where tj is the time of failure or censoring and dj = 1 if failure and 0 if censored. We can make the log likelihood a little easier to program by introducing some extra variables: pj = exp(θ2j ) Mj = {tj exp(−θ1j )}pj Rj = ln tj − θ1j ln `j = −Mj + dj {θ2j − θ1j + (pj − 1)Rj } The method-lf evaluator for this is program myweib version 13 args lnf theta1 theta2 tempvar p M R quietly gen double ‘p’ = exp(‘theta2’) quietly gen double ‘M’ = ($ML_y1*exp(-‘theta1’))^‘p’ quietly gen double ‘R’ = ln($ML_y1)-‘theta1’ quietly replace ‘lnf’ = -‘M’ + $ML_y2*(‘theta2’-‘theta1’ + (‘p’-1)*‘R’) end ml — Maximum likelihood estimation 1335 We can fit a model by typing . ml model lf myweib (studytime died = i.drug age) () . ml maximize Note that we specified ‘()’ for the second equation. The second equation corresponds to the Weibull shape parameter s, and the linear combination we want for s contains just an intercept. Alternatively, we could type . ml model lf myweib (studytime died = i.drug age) /s Typing /s means the same thing as typing (s:), and both really mean the same thing as (). The s, either after a slash or in parentheses before a colon, labels the equation. It makes the output look prettier, and that is all: . use http://www.stata-press.com/data/r13/cancer, clear (Patient Survival in Drug Trial) . ml model lf myweib (studytime died = i.drug age) /s . ml maximize initial: alternative: rescale: rescale eq: Iteration 0: Iteration 1: Iteration 2: Iteration 3: Iteration 4: Iteration 5: Iteration 6: log log log log log log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = = = = = = -744 -356.14276 -200.80201 -136.69232 -136.69232 -124.11726 -113.91566 -110.30559 -110.26747 -110.26736 -110.26736 (not concave) Number of obs Wald chi2(3) Prob > chi2 Log likelihood = -110.26736 Coef. = = = 48 35.25 0.0000 Std. Err. z P>|z| [95% Conf. Interval] eq1 drug 2 3 1.012966 1.45917 .2903917 .2821195 3.49 5.17 0.000 0.000 .4438086 .9062261 1.582123 2.012114 age _cons -.0671728 6.060723 .0205688 1.152845 -3.27 5.26 0.001 0.000 -.1074868 3.801188 -.0268587 8.320259 _cons .5573333 .1402154 3.97 0.000 .2825162 .8321504 s Example 3: Method d0 Method-d0 evaluators receive b = (b1 , b2 , . . . , bE ), the coefficient vector, rather than the already evaluated θ1 , θ2 , . . . , θE , and they are required to evaluate the overall log-likelihood ln L rather than ln `j , j = 1, . . . , N . Use mleval to produce the thetas from the coefficient vector. Use mlsum to sum the components that enter into ln L. 1336 ml — Maximum likelihood estimation In the case of Weibull, ln L = P ln `j , and our method-d0 evaluator is program weib0 version 13 args todo b lnf tempvar theta1 theta2 mleval ‘theta1’ = ‘b’, eq(1) mleval ‘theta2’ = ‘b’, eq(2) local t "$ML_y1" local d "$ML_y2" tempvar p M R quietly gen double quietly gen double quietly gen double mlsum ‘lnf’ = -‘M’ // this is just for readability ‘p’ = exp(‘theta2’) ‘M’ = (‘t’*exp(-‘theta1’))^‘p’ ‘R’ = ln(‘t’)-‘theta1’ + ‘d’*(‘theta2’-‘theta1’ + (‘p’-1)*‘R’) end To fit our model using this evaluator, we would type . ml model d0 weib0 (studytime died = i.drug age) /s . ml maximize Technical note P Method d0 does not require ln L = j ln `j , j = 1, . . . , N , as method lf does. Your likelihood function might have independent components only for groups of observations. Panel-data estimators P have a log-likelihood value ln L = i ln Li , where i indexes the P panels, each of which contains multiple observations. ConditionalP logistic regression has ln L = k ln Lk , where k indexes the risk pools. Cox regression has ln L = (t) ln L(t) , where (t) denotes the ordered failure times. To evaluate such likelihood functions, first calculate the within-group log-likelihood contributions. This usually involves generate and replace statements prefixed with by, as in tempvar sumd by group: gen double ‘sumd’ = sum($ML_y1) Structure your code so that the log-likelihood contributions are recorded in the last observation of each group. Say that a variable is named ‘cont’. To sum the contributions, code tempvar last quietly by group: gen byte ‘last’ = (_n==_N) mlsum ‘lnf’ = ‘cont’ if ‘last’ You must inform mlsum which observations contain log-likelihood values to be summed. First, you do not want to include intermediate results in the sum. Second, mlsum does not skip missing values. Rather, if mlsum sees a missing value among the contributions, it sets the overall result, ‘lnf’, to missing. That is how ml maximize is informed that the likelihood function could not be evaluated at the particular value of b. ml maximize will then take action to escape from what it thinks is an infeasible area of the likelihood function. P When the likelihood function violates the linear-form restriction ln L = j ln `j , j = 1, . . . , N , with ln `j being a function solely of values within the j th observation, use method d0. In the following examples, we will demonstrate methods d1 and d2 with likelihood functions that meet this linear-form restriction. The d1 and d2 methods themselves do not require the linear-form restriction, but the utility routines mlvecsum and mlmatsum do. Using method d1 or d2 when the restriction is violated is difficult; however, mlmatbysum may be of some help for method-d2 evaluators. ml — Maximum likelihood estimation 1337 Example 4: Method d1 Method-d1 evaluators are required to produce the gradient vector g = ∂ ln L/∂b, as well as the overall log-likelihood value. Using mlvecsum, we can obtain ∂ ln L/∂b from ∂ lnL/∂θi , i = 1, . . . , E . The derivatives of the Weibull log-likelihood function are ∂ ln `j = pj (Mj − dj ) ∂θ1j ∂ ln `j = dj − Rj pj (Mj − dj ) ∂θ2j The method-d1 evaluator for this is program weib1 version 13 args todo b lnf g // g is new tempvar t1 t2 mleval ‘t1’ = ‘b’, eq(1) mleval ‘t2’ = ‘b’, eq(2) local t "$ML_y1" local d "$ML_y2" tempvar quietly quietly quietly p M gen gen gen R double ‘p’ = exp(‘t2’) double ‘M’ = (‘t’*exp(-‘t1’))^‘p’ double ‘R’ = ln(‘t’)-‘t1’ mlsum ‘lnf’ = -‘M’ + ‘d’*(‘t2’-‘t1’ + (‘p’-1)*‘R’) if (‘todo’==0 | ‘lnf’>=.) exit /* <-- new */ tempname d1 d2 mlvecsum ‘lnf’ ‘d1’ = ‘p’*(‘M’-‘d’), eq(1) mlvecsum ‘lnf’ ‘d2’ = ‘d’ - ‘R’*‘p’*(‘M’-‘d’), eq(2) matrix ‘g’ = (‘d1’,‘d2’) /* /* /* /* <-<-<-<-- new new new new */ */ */ */ end We obtained this code by starting with our method-d0 evaluator and then adding the extra lines that method d1 requires. To fit our model using this evaluator, we could type . ml model d1 weib1 (studytime died = drug2 drug3 age) /s . ml maximize but we recommend substituting method d1debug for method d1 and typing . ml model d1debug weib1 (studytime died = drug2 drug3 age) /s . ml maximize Method d1debug will compare the derivatives we calculate with numerical derivatives and thus verify that our program is correct. Once we are certain the program is correct, then we would switch from method d1debug to method d1. Example 5: Method d2 Method-d2 evaluators are required to produce H = ∂ 2 ln L/∂b∂b0 , the Hessian matrix, as well as the gradient and log-likelihood value. mlmatsum will help calculate ∂ 2 ln L/∂b∂b0 from the second derivatives with respect to θ. For the Weibull model, these second derivatives are 1338 ml — Maximum likelihood estimation ∂ 2 ln `j = −p2j Mj 2 ∂θ1j ∂ 2 ln `j = pj (Mj − dj + Rj pj Mj ) ∂θ1j ∂θ2j ∂ 2 ln `j = −pj Rj (Rj pj Mj + Mj − dj ) 2 ∂θ2j The method-d2 evaluator is program weib2 version 13 args todo b lnf g H // H added tempvar t1 t2 mleval ‘t1’ = ‘b’, eq(1) mleval ‘t2’ = ‘b’, eq(2) local t "$ML_y1" local d "$ML_y2" tempvar p M R quietly gen double ‘p’ = exp(‘t2’) quietly gen double ‘M’ = (‘t’*exp(-‘t1’))^‘p’ quietly gen double ‘R’ = ln(‘t’)-‘t1’ mlsum ‘lnf’ = -‘M’ + ‘d’*(‘t2’-‘t1’ + (‘p’-1)*‘R’) if (‘todo’==0 | ‘lnf’>=.) exit tempname d1 d2 mlvecsum ‘lnf’ ‘d1’ = ‘p’*(‘M’-‘d’), eq(1) mlvecsum ‘lnf’ ‘d2’ = ‘d’ - ‘R’*‘p’*(‘M’-‘d’), eq(2) matrix ‘g’ = (‘d1’,‘d2’) if (‘todo’==1 | ‘lnf’>=.) exit // new from here down tempname d11 d12 d22 mlmatsum ‘lnf’ ‘d11’ = -‘p’^2 * ‘M’, eq(1) mlmatsum ‘lnf’ ‘d12’ = ‘p’*(‘M’-‘d’ + ‘R’*‘p’*‘M’), eq(1,2) mlmatsum ‘lnf’ ‘d22’ = -‘p’*‘R’*(‘R’*‘p’*‘M’ + ‘M’ - ‘d’) , eq(2) matrix ‘H’ = (‘d11’,‘d12’ \ ‘d12’’,‘d22’) end We started with our previous method-d1 evaluator and added the lines that method d2 requires. We could now fit a model by typing . ml model d2 weib2 (studytime died = drug2 drug3 age) /s . ml maximize but we would recommend substituting method d2debug for method d2 and typing . ml model d2debug weib2 (studytime died = drug2 drug3 age) /s . ml maximize Method d2debug will compare the first and second derivatives we calculate with numerical derivatives and thus verify that our program is correct. Once we are certain the program is correct, then we would switch from method d2debug to method d2. As we stated earlier, to produce the robust variance estimator with method lf, there is nothing to do except specify vce(robust), vce(cluster clustvar), or pweight. For methods d0, d1, and d2, these options do not work. If your likelihood function meets the linear-form restrictions, you can use methods lf0, lf1, and lf2, then these options will work. The equation scores are defined as ∂ ln `j ∂ ln `j , , ... ∂θ1j ∂θ2j ml — Maximum likelihood estimation 1339 Your evaluator will be passed variables, one for each equation, which you fill in with the equation scores. For both method lf1 and lf2, these variables are passed in the fourth and subsequent positions of the argument list. That is, you must process the arguments as args todo b lnf g1 g2 ... H Note that for method lf1, the ‘H’ argument is not used and can be ignored. Example 6: Robust variance estimates If you have used mlvecsum in your evaluator of method d1 or d2, it is easy to turn it into evaluator of method lf1 or lf2 that allows the computation of the robust variance estimator. The expression that you specified on the right-hand side of mlvecsum is the equation score. Here we turn the program that we gave earlier in the method-d1 example into a method-lf1 evaluator that allows vce(robust), vce(cluster clustvar), or pweight. program weib1 version 13 args todo b lnfj g1 g2 tempvar t1 t2 mleval ‘t1’ = ‘b’, eq(1) mleval ‘t2’ = ‘b’, eq(2) local t "$ML_y1" local d "$ML_y2" tempvar p M R quietly gen double ‘p’ = quietly gen double ‘M’ = quietly gen double ‘R’ = quietly replace ‘lnfj’ = if (‘todo’==0) exit // g1 and g2 are new exp(‘t2’) (‘t’*exp(-‘t1’))^‘p’ ln(‘t’)-‘t1’ -‘M’ + ‘d’*(‘t2’-‘t1’ + (‘p’-1)*‘R’) quietly replace ‘g1’ = ‘p’*(‘M’-‘d’) quietly replace ‘g2’ = ‘d’ - ‘R’*‘p’*(‘M’-‘d’) /* <-- new /* <-- new */ */ end To fit our model and get the robust variance estimates, we type . ml model lf1 weib1 (studytime died = drug2 drug3 age) /s, vce(robust) . ml maximize Survey options and ml ml can handle stratification, poststratification, multiple stages of clustering, and finite population corrections. Specifying the svy option implies that the data come from a survey design and also implies that the survey linearized variance estimator is to be used; see [SVY] variance estimation. Example 7 Suppose that we are interested in a probit analysis of data from a survey in which q1 is the answer to a yes/no question and x1, x2, x3 are demographic responses. The following is a lf2 evaluator for the probit model that meets the requirements for vce(robust) (linear form and computes the scores). 1340 ml — Maximum likelihood estimation program mylf2probit version 13 args todo b lnfj g1 H tempvar z Fz lnf mleval ‘z’ = ‘b’ quietly gen double ‘Fz’ = normal( ‘z’) if $ML_y1 == quietly replace ‘Fz’ = normal(-‘z’) if $ML_y1 == quietly replace ‘lnfj’ = log(‘Fz’) if (‘todo’==0) exit quietly replace ‘g1’ = normalden(‘z’)/‘Fz’ if $ML_y1 quietly replace ‘g1’ = -normalden(‘z’)/‘Fz’ if $ML_y1 if (‘todo’==1) exit mlmatsum ‘lnf’ ‘H’ = -‘g1’*(‘g1’+‘z’), eq(1,1) end 1 0 == 1 == 0 To fit a model, we svyset the data, then use svy with ml. . svyset psuid [pw=w], strata(strid) . ml model lf2 mylf2probit (q1 = x1 x2 x3), svy . ml maximize We could also use the subpop() option to make inferences about the subpopulation identified by the variable sub: . svyset psuid [pw=w], strata(strid) . ml model lf2 mylf2probit (q1 = x1 x2 x3), svy subpop(sub) . ml maximize Stored results For results stored by ml without the svy option, see [R] maximize. For results stored by ml with the svy option, see [SVY] svy. Methods and formulas ml is implemented using moptimize(); see [M-5] moptimize( ). References Gould, W. W., J. S. Pitblado, and B. P. Poi. 2010. Maximum Likelihood Estimation with Stata. 4th ed. College Station, TX: Stata Press. Korn, E. L., and B. I. Graubard. 1990. Simultaneous testing of regression coefficients with complex survey data: Use of Bonferroni t statistics. American Statistician 44: 270–276. Royston, P. 2007. Profile likelihood for estimation and confidence intervals. Stata Journal 7: 376–387. Also see [R] maximize — Details of iterative maximization [R] mlexp — Maximum likelihood estimation of user-specified expressions [R] nl — Nonlinear least-squares estimation [M-5] moptimize( ) — Model optimization [M-5] optimize( ) — Function optimization Title mlexp — Maximum likelihood estimation of user-specified expressions Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax mlexp ( ) . . . if in weight , options where is a substitutable expression representing the log-likelihood function. options Description Model variables(varlist) from(initial values) specify variables in model specify initial values for parameters Derivatives derivative(/name = ) specify derivative of with respect to parameter name; can be specified more than once SE/Robust vce(vcetype) vcetype may be oim, opg, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) title(string) title2(string) display options set confidence level; default is level(95) display string as title above the table of parameter estimates display string as subtitle control column formats Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics may contain time-series operators; see [U] 13.9 Time-series operators. bootstrap, by, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. aweights, fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. and are extensions of valid Stata expressions that also contain parameters to be estimated. The parameters are enclosed in curly braces and must otherwise satisfy the naming requirements for variables; {beta} is an example of a parameter. Also allowed is a notation of the 1341 1342 mlexp — Maximum likelihood estimation of user-specified expressions form { :varlist} for linear combinations of multiple covariates and their parameters. For example, {xb: mpg price turn} defines a linear combination of the variables mpg, price, and turn. See Substitutable expressions under Remarks and examples below. Menu Statistics > Other > Maximum likelihood estimation of expression Description mlexp performs maximum likelihood estimation of models that satisfy the linear-form restrictions, which is to say models for which you can write down the log likelihood for an individual observation and for which the overall log likelihood is simply the sum of the individual observations’ log likelihoods. You express the observation-level log-likelihood function by using a substitutable expression. Unlike models fit using ml, you do not need to do any programming. However, ml can fit classes of models that cannot be fit by mlexp. Options Model variables(varlist) specifies the variables in the model. mlexp ignores observations for which any of these variables has missing values. If you do not specify variables(), then mlexp assumes all the observations are valid. If the log likelihood cannot be calculated at the initial values for any observation, mlexp will exit with an error message. from(initial values) specifies the initial values to begin the estimation. You can specify a 1 × k matrix, where k is the number of parameters in the model, or you can specify parameter names and values. For example, to initialize alpha to 1.23 and delta to 4.57, you would type mlexp ..., from(alpha=1.23 delta=4.57) ... Initial values declared using this option override any that are declared within substitutable expressions. If you specify a parameter that does not appear in your model, mlexp exits with an error. If you specify a matrix, the values must be in the same order in which the parameters are declared in your model. mlexp ignores the row and column names of the matrix. Derivatives derivative(/name = ) specifies the derivative of the observation-level log-likelihood function with respect to parameter name. uses the same substitutable expression syntax as is used to specify the log-likelihood function. If you declare a linear combination in the log-likelihood function, you provide the derivative for the linear combination; mlexp then applies the chain rule for you. See Specifying derivatives under Remarks and examples below for examples. If you do not specify the derivative() option, mlexp calculates derivatives numerically. You must either specify no derivatives or specify all the derivatives; you cannot specify some analytic derivatives and have mlexp compute the rest numerically. If you are estimating multiple parameters, you supply derivatives using multiple derivative() specifications. mlexp — Maximum likelihood estimation of user-specified expressions 1343 SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. title(string) specifies an optional title that will be displayed just above the table of parameter estimates. title2(string) specifies an optional subtitle that will be displayed between the title specified in title() and the table of parameter estimates. If title2() is specified but title() is not, then title2() has the same effect as title(). display options: cformat(% fmt), pformat(% fmt), and sformat(% fmt); see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), and nonrtolerance; see [R] maximize. These options are seldom used. The following option is available with mlexp but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Introduction Substitutable expressions Parameter constraints Specifying derivatives Introduction mlexp performs maximum likelihood estimation of models that satisfy the linear-form restrictions, which is to say models for which you can write down the log likelihood for a single observation and for which the overall log likelihood is simply the sum of the individual observations’ log likelihoods. Models designed for use with cross-sectional data usually meet the linear-form restrictions, including linear regression, many discrete-choice models, limited-dependent-variable models, and selection models. Examples of models that do not satisfy the linear-form restrictions are random-effects paneldata models (because the likelihood function is defined at the panel level) and Cox proportional hazards models (because the likelihood function is defined for risk sets). Because of its straightforward syntax and accessibility from the menu system, mlexp is particularly suited to users who are new to Stata and to those using Stata for pedagogical purposes. You express the observation-level log-likelihood function by using a substitutable expression, which we explain below. Unlike models fit using ml, you do not need to do any programming. However, ml can fit classes of models that cannot be fit by mlexp, including those that do not meet the linear-form restrictions. 1344 mlexp — Maximum likelihood estimation of user-specified expressions Substitutable expressions You specify the log-likelihood function that mlexp is to maximize by using substitutable expressions that are similar to those used by nl, nlsur, and gmm. You specify substitutable expressions just as you would specify any other mathematical expression involving scalars and variables, such as those expressions you would use with Stata’s generate command, except that the parameters to be estimated are bound in braces. See [U] 13.2 Operators and [U] 13.3 Functions for more information on expressions. Parameter names must follow the same conventions as variable names. See [U] 11.3 Naming conventions. For example, say that you have observations on variable x, and the log likelihood for the ith observation is ln`i = lnλ − λxi where λ is a parameter to be estimated. Then you would type . mlexp (ln({lambda}) - {lambda}*x) Because λ is a parameter, we enclosed it in braces. To specify initial values for a parameter, you can include an equal sign and the initial value after the parameter; for example, . mlexp (ln({lambda = 0.75}) - {lambda}*x) would initialize λ to be 0.75. If you do not initialize a parameter, mlexp initializes it to zero. Frequently, even nonlinear functions contain linear combinations of variables. Continuing the previous example, say that we want to parameterize λ as λi = α1 ui + α2 vi where u and v are variables in the dataset. Instead of typing . mlexp (ln({alpha1}*u + {alpha2}*v) - ({alpha1}*u + {alpha2}*v)*x) you can instead type . mlexp (ln({lambda: u v}) - {lambda:}*x) The notation {lambda: u v} indicates to mlexp that you want a linear combination of the variables u and v. We named the linear combination lambda, so mlexp will name the parameters for the two variables lambda u and lambda v, respectively. Once you have declared a linear combination, you can subsequently refer to the linear combination by specifying its name and a colon inside braces, as we did with this example. You cannot use the same name for both an individual parameter and a linear combination. However, after a linear combination has been declared, you can refer to the parameter of an individual variable within that linear combination by using the notation {lc z }, where lc is the name of the linear combination and z is the variable whose parameter you want to reference. Linear combinations do not include a constant term. There are three rules to follow when defining substitutable expressions: 1. Parameters of the model are bound in braces: {b0}, {param}, etc. 2. Initial values for parameters are given by including an equal sign and the initial value inside the braces: {b0=1}, {param=3.571}, etc. 3. Linear combinations of variables can be included using the notation {eqname:varlist}: {xb: mpg price weight}, {score: w x z}, etc. Parameters of linear combinations are initialized to zero. mlexp — Maximum likelihood estimation of user-specified expressions 1345 If you specify initial values by using the from() option, they override whatever initial values are given within the substitutable expression. Substitutable expressions are so named because once values are assigned to the parameters, the resulting expressions can be handled by generate and replace. Regardless of whether you specify initial values, mlexp performs a search procedure for better starting values before commencing the first iteration of the maximization routine. If you specify initial values, the search procedure tries to improve upon those values. Otherwise, the search procedure begins with all parameters set to zero. Example 1: The gamma density function The two-parameter gamma density function for y ≥ 0 is f (y) = λP exp(−λy)y P −1 Γ(P ) λ > 0, P > 0 so that the log likelihood for the ith observation is ln`i = P lnλ − lnΓ(P ) − λyi + (P − 1) lnyi The dataset greenegamma.dta, based on Greene (2012, 460–461), contains 20 observations drawn randomly from the two-parameter gamma distribution. We want to estimate the parameters of that distribution. We type . use http://www.stata-press.com/data/r13/greenegamma . mlexp ({P}*ln({lambda}) - lngamma({P}) - {lambda}*y + ({P}-1)*ln(y)) initial: log likelihood = - (could not be evaluated) feasible: log likelihood = -363.37264 rescale: log likelihood = -153.09898 rescale eq: log likelihood = -88.863468 Iteration 0: log likelihood = -88.863468 Iteration 1: log likelihood = -85.405011 Iteration 2: log likelihood = -85.375857 Iteration 3: log likelihood = -85.375669 Iteration 4: log likelihood = -85.375669 Maximum likelihood estimation Log likelihood = -85.375669 Coef. /P /lambda 2.410602 .0770702 Number of obs = 20 Std. Err. z P>|z| [95% Conf. Interval] .7158452 .0254361 3.37 3.03 0.001 0.002 1.007571 .0272164 3.813633 .1269241 In our substitutable expression for the log-likelihood function, we enclosed the two parameters of our model, P and λ, in curly braces. We used the lngamma() function to compute lnΓ(P ) because Stata (unlike Mata) does not have a built-in function to compute Γ(P ), and numerical algorithms for computing lnΓ(P ) directly are more accurate than taking the natural logarithm of Γ(P ), anyway. Because we did not specify initial values, mlexp initialized P and λ to be zero. When both parameters are zero, the log-likelihood function cannot be evaluated because ln(0) is undefined. Therefore, in the iteration log above the coefficient table, we see that mlexp reported the initial log likelihood to be - (could not be evaluated). mlexp uses a search routine to find alternative initial values that do allow the log-likelihood function to be calculated. 1346 mlexp — Maximum likelihood estimation of user-specified expressions Example 2: Obtaining alternative VCEs In example 1, by default mlexp reported standard errors based on the observed information matrix of the log-likelihood function. See [R] vce option for an overview or Gould, Pitblado, and Poi (2010) for an in-depth discussion of different ways of obtaining the VCE in maximum likelihood estimation. With mlexp, we can use the vce() option to obtain standard errors based on alternative VCEs. For example, to obtain the outer product of gradients (OPG) standard errors, we type . mlexp ({P}*ln({lambda}) - lngamma({P}) - {lambda}*y + ({P}-1)*ln(y)), vce(opg) initial: log likelihood = - (could not be evaluated) feasible: log likelihood = -363.37264 rescale: log likelihood = -153.09898 rescale eq: log likelihood = -88.863468 Iteration 0: log likelihood = -88.863468 Iteration 1: log likelihood = -85.405011 Iteration 2: log likelihood = -85.375857 Iteration 3: log likelihood = -85.375669 Iteration 4: log likelihood = -85.375669 Maximum likelihood estimation Log likelihood = -85.375669 Coef. /P /lambda 2.410602 .0770702 Number of obs = 20 OPG Std. Err. z P>|z| [95% Conf. Interval] .8768255 .0270771 2.75 2.85 0.006 0.004 .6920557 .0240001 4.129149 .1301404 Parameter constraints In examples 1 and 2, we were lucky. The two-parameter gamma density function is defined only when both λ and P are positive. However, mlexp does not know this; when maximizing the log-likelihood function, it will consider all real values for the parameters. Rather than relying on luck, we should instead reparameterize our model so that we avoid having to directly estimate parameters that are restricted. For example, consider the parameter λ > 0, and suppose we define the new parameter θ = ln(λ) so that λ = exp(θ). With this parameterization, for any real value of θ that mlexp might try to use when evaluating the log-likelihood function, λ is guaranteed to be positive. Example 3: Classical normal regression In the classical normal variant of linear regression (Goldberger 1991, chap. 19), we assume not only that yi = x0i β + i but also that xi is nonstochastic and that i is distributed independently and identically normal with mean zero and variance σ 2 . This is equivalent to assuming that yi |xi ∼ N (x0i β, σ 2 ) Using the properties of the normal distribution, we could write the log likelihood for the ith observation as 1 yi − x0i β ln`i = ln φ σ σ mlexp — Maximum likelihood estimation of user-specified expressions 1347 where φ(·) is the standard normal density function. In Stata, we use the three-argument version of the normalden() function and directly specify the conditional mean (x0i β) and standard deviation (σ) as the additional arguments. The normal density function is defined only for σ > 0, so instead of estimating σ directly, we will instead estimate the unconstrained parameter θ and let σ = exp(θ). For any real value of θ, this transformation ensures that σ > 0. Using auto.dta, say that we want to fit the classical normal regression mpgi = β0 + β1 weighti + i We type . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . mlexp (ln(normalden(mpg, {b0} initial: log likelihood = feasible: log likelihood = rescale: log likelihood = rescale eq: log likelihood = Iteration 0: log likelihood = (output omitted ) Iteration 13: log likelihood = + {b1}*weight, exp({theta})))) - (could not be evaluated) -882.02886 -882.02886 -274.09391 -274.09391 (not concave) -195.38869 Maximum likelihood estimation Log likelihood = -195.38869 Coef. /b0 /b1 /theta 39.44028 -.0060087 1.221449 Number of obs Std. Err. 1.592043 .0005108 .0821995 z 24.77 -11.76 14.86 P>|z| 0.000 0.000 0.000 = 74 [95% Conf. Interval] 36.31993 -.0070099 1.060341 42.56063 -.0050075 1.382557 To recover our estimate of σ , we can use nlcom: . nlcom (sigma: exp(_b[/theta])) sigma: exp(_b[/theta]) Coef. sigma 3.392099 Std. Err. .2788288 z 12.17 P>|z| [95% Conf. Interval] 0.000 2.845605 3.938594 In the previous example, we named the unconstrained parameter θ theta. In actual practice, however, we would generally name that parameter lnsigma to indicate that it is ln(σ). Two other parameter restrictions often appear in maximum likelihood estimation. Consider the correlation coefficient ρ. In general, it must be true that −1 < ρ < 1. Define the parameter η = tanh−1 (ρ), where tanh−1 (·) is the hyperbolic arctangent function. Then ρ = tanh(η), and by the properties of the hyperbolic tangent function, for any real value of η , we will have −1 < ρ < 1. Stata has the built-in function tanh(), so recovering ρ from η is easy. In practice, instead of naming the unconstrained parameter eta in our likelihood expression, we would name it atanhrho to remind us that it is the hyperbolic arctangent of ρ. 1348 mlexp — Maximum likelihood estimation of user-specified expressions Other parameters, such as those that represent probabilities or ratios of variances, are often restricted to be between 0 and 1. Say that we have the restriction 0 < κ < 1. Consider the parameter ψ = ln {κ/(1 − κ)}. Then κ = eψ /(1 + eψ ) and for any real value of ψ , we have 0 < κ < 1. The formula for κ in terms of ψ is known as the inverse logit transformation and is available as invlogit() in Stata. Thus in our likelihood expression, we would code invlogit({logitk}) to map the unconstrained parameter logitk into the (0, 1) interval. Specifying derivatives By default, mlexp calculates derivatives of the log-likelihood function numerically using a sophisticated algorithm that produces accurate results. However, mlexp will fit your model more quickly (and even more accurately) if you specify analytic derivatives. You specify derivatives by using substitutable expressions in much the same way as you specify the log-likelihood function. If you specify a linear combination in your log-likelihood function, then you supply a derivative with respect to that linear combination; mlexp then uses the chain rule to obtain the derivatives with respect to the individual parameters. We will illustrate how to specify derivatives using the probit model for dichotomous outcomes. The log likelihood for the probit model is often written as ln`i = lnΦ(x0i β) yi = 1 lnΦ(−x0i β) yi = 0 using the fact that 1 − Φ(x0i β) = Φ(−x0i β), where Φ(·) is the cumulative standard normal distribution function. If we use the trick suggested by Greene (2012, 691, fn. 7), we can simplify the log-likelihood function, making the derivative calculation easier. Let qi = 2yi − 1. Then we can write the loglikelihood function as ln`i = lnΦ(qi x0i β) (1) and the first derivative as ∂ ln`i qi φ(qi x0i β) = xi ∂β Φ(qi x0i β) (2) Example 4: Probit with one regressor Say that we want to fit a probit model of foreign on mpg and a constant term. We have two parameters, so we will need to specify two derivatives; xi consists of the ith observation on mpg and a 1 as the constant term. Because the term qi x0i β will appear several times in our command, we create a macro to store it. We type mlexp — Maximum likelihood estimation of user-specified expressions 1349 . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . generate q = 2*foreign - 1 . global qxb "q*({b1}*mpg + {b0})" . mlexp (ln(normal($qxb))), derivative(/b1 = q*normalden($qxb)/normal($qxb)*mpg) > deriv(/b0 = q*normalden($qxb)/normal($qxb)) initial: log likelihood = -51.292891 alternative: log likelihood = -2017.3105 rescale: log likelihood = -47.888213 rescale eq: log likelihood = -46.343247 Iteration 0: log likelihood = -46.343247 Iteration 1: log likelihood = -39.268764 Iteration 2: log likelihood = -39.258972 Iteration 3: log likelihood = -39.258972 Maximum likelihood estimation Log likelihood = -39.258972 Number of obs = 74 Coef. /b1 /b0 .0960601 -2.635268 Std. Err. .0301523 .6841462 z 3.19 -3.85 P>|z| [95% Conf. Interval] 0.001 0.000 .0369627 -3.97617 .1551575 -1.294366 When you specify a linear combination of variables, you specify the derivative with respect to the linear combination. That way, if you change the variables that comprise the linear combination, you do not need to change the derivative at all. To see why this is the case, consider the function f (x0i β), where x0i β is a linear combination. Then, using the chain rule, ∂f (x0i β) ∂f (x0i β) ∂x0i β ∂f (x0i β) × × xij = = ∂βj ∂x0i β ∂βj ∂x0i β Once the derivative with respect to the linear combination is known, mlexp can then multiply it by each of the variables in the linear combination to get the full set of derivatives with respect to the parameters needed to maximize the likelihood function. Moreover, the derivative with respect to the linear combination does not depend on the variables within the linear combination, so even if you change the variables in it, you will not need to modify the specification of the corresponding derivative() option. Example 5: Probit with a linear combination Now let’s fit a probit model of foreign on mpg and gear ratio. We could specify the parameters and independent variables individually, but we will use a linear combination instead. First, note that ∂ ln`i qi φ(qi x0i β) = ∂x0i β Φ(qi x0i β) 1350 mlexp — Maximum likelihood estimation of user-specified expressions We type . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . generate q = 2*foreign - 1 . global qxb "q*({xb:} + {b0})" . mlexp (ln(normal(q*({xb:mpg gear_ratio}+{b0})))), > deriv(/xb = q*normalden($qxb)/normal($qxb)) > deriv(/b0 = q*normalden($qxb)/normal($qxb)) initial: log likelihood = -51.292891 alternative: log likelihood = -2556.2172 rescale: log likelihood = -47.865271 rescale eq: log likelihood = -46.658776 Iteration 0: log likelihood = -46.658776 Iteration 1: log likelihood = -22.541058 Iteration 2: log likelihood = -21.467371 Iteration 3: log likelihood = -21.454446 Iteration 4: log likelihood = -21.454436 Iteration 5: log likelihood = -21.454436 Maximum likelihood estimation Log likelihood = -21.454436 Number of obs Coef. /xb_mpg /xb_gear_r~o /b0 -.0282433 3.699635 -11.57588 Std. Err. .0464514 .8368276 2.337239 z -0.61 4.42 -4.95 P>|z| 0.543 0.000 0.000 = 74 [95% Conf. Interval] -.1192864 2.059483 -16.15678 .0627998 5.339787 -6.994972 We first redefined our global macro $qxb to contain the linear combination xb and a constant term b0. More importantly, we did not specify the variables in the linear combination just yet. Instead, we will use $qxb after we explicitly declare the variables in xb when we specify our model. To avoid making mistakes, you can declare the variables in a linear combination only once when you set up your model. If we had declared the variables when we defined $qxb, we would have received an error because, upon substituting for $qxb, we would have declared the variables multiple times in our call to mlexp. mlexp — Maximum likelihood estimation of user-specified expressions Stored results mlexp stores the following in e(): Scalars e(N) e(k) e(k aux) e(k eq) e(k eq model) e(df m) e(ll) e(N clust) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(lexp) e(wtype) e(wexp) e(usrtitle) e(usrtitle2) e(vce) e(vcetype) e(params) e(hasderiv) e(d j ) e(rhs) e(opt) e(ml method) e(technique) e(singularHmethod) e(crittype) e(properties) e(estat cmd) e(predict) e(marginsnotok) e(marginsprop) Matrices e(b) e(ilog) e(init) e(gradient) e(V) e(V modelbased) Functions e(sample) number of observations number of parameters number of ancillary parameters number of equations in e(b) number of equations in overall model test model degrees of freedom log likelihood number of clusters rank of e(V) number of iterations return code 1 if converged, 0 otherwise mlexp command as typed likelihood expression weight type weight expression user-specified title user-specified secondary title vcetype specified in vce() title used to label Std. Err. names of parameters yes, if derivative() is specified derivative expression for parameter j contents of variables() type of optimization type of ml method maximization technique m-marquardt or hybrid; method used when Hessian is singular1 optimization criterion1 b V program used to implement estat program used to implement predict predictions disallowed by margins signals to the margins command coefficient vector iteration log (up to 20 iterations) initial values gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample 1. Type ereturn list, all to view these results; see [P] return. 1351 1352 mlexp — Maximum likelihood estimation of user-specified expressions Methods and formulas Optimization is carried out using moptimize(); see [M-5] moptimize( ). References Goldberger, A. S. 1991. A Course in Econometrics. Cambridge, MA: Harvard University Press. Gould, W. W., J. S. Pitblado, and B. P. Poi. 2010. Maximum Likelihood Estimation with Stata. 4th ed. College Station, TX: Stata Press. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Also see [R] mlexp postestimation — Postestimation tools for mlexp [R] gmm — Generalized method of moments estimation [R] maximize — Details of iterative maximization [R] ml — Maximum likelihood estimation [R] nl — Nonlinear least-squares estimation [R] nlsur — Estimation of nonlinear systems of equations Title mlexp postestimation — Postestimation tools for mlexp Description Syntax for predict Menu for predict Option for predict Also see Description The following postestimation commands are available after mlexp: Command Description estat ic estat summarize estat vce estimates lincom Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients scores point estimates, standard errors, testing, and inference for generalized predictions seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest1 nlcom predict predictnl suest test testnl 1 lrtest is not appropriate with svy estimation results. Syntax for predict predict type stub* | newvar1 . . . newvark if in , scores This statistic is only available for observations within the estimation sample. k represents the number of parameters in the model. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Option for predict scores, the default, calculates the equation-level score variables. The j th new variable will contain the scores for the j th parameter of the model. Linear combinations are expanded prior to computing scores, so each variable’s parameter will have its own score variable. 1353 1354 mlexp postestimation — Postestimation tools for mlexp Also see [R] mlexp — Maximum likelihood estimation of user-specified expressions [U] 20 Estimation and postestimation commands Title mlogit — Multinomial (polytomous) logistic regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax mlogit depvar options indepvars if in weight , options Description Model noconstant baseoutcome(#) constraints(clist) collinear suppress constant term value of depvar that will be the base outcome apply specified linear constraints; clist has the form # -# , # -# . . . keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) rrr nocnsreport display options set confidence level; default is level(95) report relative-risk ratios do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mfp, mi estimate, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 1355 1356 mlogit — Multinomial (polytomous) logistic regression Menu Statistics > Categorical outcomes > Multinomial logistic regression Description mlogit fits maximum-likelihood multinomial logit models, also known as polytomous logistic regression. You can define constraints to perform constrained estimation. Some people refer to conditional logistic regression as multinomial logit. If you are one of them, see [R] clogit. See [R] logistic for a list of related estimation commands. Options Model noconstant; see [R] estimation options. baseoutcome(#) specifies the value of depvar to be treated as the base outcome. The default is to choose the most frequent outcome. constraints(clist), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. If specifying vce(bootstrap) or vce(jackknife), you must also specify baseoutcome(). Reporting level(#); see [R] estimation options. rrr reports the estimated coefficients transformed to relative-risk ratios, that is, eb rather than b; see Description of the model below for an explanation of this concept. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated. rrr may be specified at estimation or when replaying previously estimated results. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. The following option is available with mlogit but is not shown in the dialog box: coeflegend; see [R] estimation options. mlogit — Multinomial (polytomous) logistic regression 1357 Remarks and examples Remarks are presented under the following headings: Description of the model Fitting unconstrained models Fitting constrained models mlogit fits maximum likelihood models with discrete dependent (left-hand-side) variables when the dependent variable takes on more than two outcomes and the outcomes have no natural ordering. If the dependent variable takes on only two outcomes, estimates are identical to those produced by logistic or logit; see [R] logistic or [R] logit. If the outcomes are ordered, see [R] ologit. Description of the model For an introduction to multinomial logit models, see Greene (2012, 763–766), Hosmer, Lemeshow, and Sturdivant (2013, 269–289), Long (1997, chap. 6), Long and Freese (2014, chap. 8), and Treiman (2009, 336–341). For a description emphasizing the difference in assumptions and data requirements for conditional and multinomial logit, see Davidson and MacKinnon (1993). Consider the outcomes 1, 2, 3, . . . , m recorded in y , and the explanatory variables X . Assume that there are m = 3 outcomes: “buy an American car”, “buy a Japanese car”, and “buy a European car”. The values of y are then said to be “unordered”. Even though the outcomes are coded 1, 2, and 3, the numerical values are arbitrary because 1 < 2 < 3 does not imply that outcome 1 (buy American) is less than outcome 2 (buy Japanese) is less than outcome 3 (buy European). This unordered categorical property of y distinguishes the use of mlogit from regress (which is appropriate for a continuous dependent variable), from ologit (which is appropriate for ordered categorical data), and from logit (which is appropriate for two outcomes, which can be thought of as ordered). In the multinomial logit model, you estimate a set of coefficients, β (1) , β (2) , and β (3) , corresponding to each outcome: (1) Pr(y = 1) = eXβ (1) eXβ + eXβ (2) + eXβ (3) eXβ (1) eXβ + eXβ (2) + eXβ (3) eXβ (1) eXβ + eXβ (2) + eXβ (3) (2) Pr(y = 2) = (3) Pr(y = 3) = The model, however, is unidentified in the sense that there is more than one solution to β (1) , β (2) , and β (3) that leads to the same probabilities for y = 1, y = 2, and y = 3. To identify the model, you arbitrarily set one of β (1) , β (2) , or β (3) to 0 — it does not matter which. That is, if you arbitrarily set β (1) = 0, the remaining coefficients β (2) and β (3) will measure the change relative to the y = 1 group. If you instead set β (2) = 0, the remaining coefficients β (1) and β (3) will measure the change relative to the y = 2 group. The coefficients will differ because they have different interpretations, but the predicted probabilities for y = 1, 2, and 3 will still be the same. Thus either parameterization will be a solution to the same underlying model. 1358 mlogit — Multinomial (polytomous) logistic regression Setting β (1) = 0, the equations become Pr(y = 1) = Pr(y = 2) = Pr(y = 3) = 1 1 + eXβ (2) + eXβ (3) eXβ (2) 1 + eXβ (2) + eXβ (3) eXβ (3) 1 + eXβ (2) + eXβ (3) The relative probability of y = 2 to the base outcome is (2) Pr(y = 2) = eXβ Pr(y = 1) (2) Let’s call this ratio the relative risk, and let’s further assume that X and βk (x1 , x2 , . . . , xk ) and change in xi is then (2) (2) (2) (β1 , β2 , . . . , βk )0 , (2) eβ1 e (2) x1 +···+βi are vectors equal to respectively. The ratio of the relative risk for a one-unit (2) (xi +1)+···+βk xk (2) (2) (2) β1 x1 +···+βi xi +···+βk xk (2) = eβi Thus the exponentiated value of a coefficient is the relative-risk ratio for a one-unit change in the corresponding variable (risk is measured as the risk of the outcome relative to the base outcome). Fitting unconstrained models Example 1: A first example We have data on the type of health insurance available to 616 psychologically depressed subjects in the United States (Tarlov et al. 1989; Wells et al. 1989). The insurance is categorized as either an indemnity plan (that is, regular fee-for-service insurance, which may have a deductible or coinsurance rate) or a prepaid plan (a fixed up-front payment allowing subsequent unlimited use as provided, for instance, by an HMO). The third possibility is that the subject has no insurance whatsoever. We wish to explore the demographic factors associated with each subject’s insurance choice. One of the demographic factors in our data is the race of the participant, coded as white or nonwhite: mlogit — Multinomial (polytomous) logistic regression 1359 . use http://www.stata-press.com/data/r13/sysdsn1 (Health insurance data) . tabulate insure nonwhite, chi2 col Key frequency column percentage nonwhite 0 insure 1 Total Indemnity 251 50.71 43 35.54 294 47.73 Prepaid 208 42.02 69 57.02 277 44.97 Uninsure 36 7.27 9 7.44 45 7.31 Total 495 100.00 121 100.00 616 100.00 Pearson chi2(2) = 9.5599 Pr = 0.008 Although insure appears to take on the values Indemnity, Prepaid, and Uninsure, it actually takes on the values 1, 2, and 3. The words appear because we have associated a value label with the numeric variable insure; see [U] 12.6.3 Value labels. When we fit a multinomial logit model, we can tell mlogit which outcome to use as the base outcome, or we can let mlogit choose. To fit a model of insure on nonwhite, letting mlogit choose the base outcome, we type . mlogit insure nonwhite Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -556.59502 -551.78935 -551.78348 -551.78348 Multinomial logistic regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -551.78348 insure Coef. Std. Err. z P>|z| = = = = 616 9.62 0.0081 0.0086 [95% Conf. Interval] Indemnity (base outcome) Prepaid nonwhite _cons .6608212 -.1879149 .2157321 .0937644 3.06 -2.00 0.002 0.045 .2379942 -.3716896 1.083648 -.0041401 Uninsure nonwhite _cons .3779586 -1.941934 .407589 .1782185 0.93 -10.90 0.354 0.000 -.4209011 -2.291236 1.176818 -1.592632 mlogit chose the indemnity outcome as the base outcome and presented coefficients for the outcomes prepaid and uninsured. According to the model, the probability of prepaid for whites (nonwhite = 0) is 1360 mlogit — Multinomial (polytomous) logistic regression Pr(insure = Prepaid) = 1+ e−.188 = 0.420 + e−1.942 e−.188 Similarly, for nonwhites, the probability of prepaid is Pr(insure = Prepaid) = 1+ e−.188+.661 = 0.570 + e−1.942+.378 e−.188+.661 These results agree with the column percentages presented by tabulate because the mlogit model is fully saturated. That is, there are enough terms in the model to fully explain the column percentage in each cell. The model chi-squared and the tabulate chi-squared are in almost perfect agreement; both test that the column percentages of insure are the same for both values of nonwhite. Example 2: Specifying the base outcome By specifying the baseoutcome() option, we can control which outcome of the dependent variable is treated as the base. Left to its own, mlogit chose to make outcome 1, indemnity, the base outcome. To make outcome 2, prepaid, the base, we would type . mlogit insure nonwhite, base(2) Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -556.59502 -551.78935 -551.78348 -551.78348 Multinomial logistic regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -551.78348 insure Coef. Std. Err. Indemnity nonwhite _cons -.6608212 .1879149 Prepaid (base outcome) Uninsure nonwhite _cons -.2828627 -1.754019 .2157321 .0937644 .3977302 .1805145 z P>|z| = = = = 616 9.62 0.0081 0.0086 [95% Conf. Interval] -3.06 2.00 0.002 0.045 -1.083648 .0041401 -.2379942 .3716896 -0.71 -9.72 0.477 0.000 -1.0624 -2.107821 .4966742 -1.400217 The baseoutcome() option requires that we specify the numeric value of the outcome, so we could not type base(Prepaid). Although the coefficients now appear to be different, the summary statistics reported at the top are identical. With this parameterization, the probability of prepaid insurance for whites is Pr(insure = Prepaid) = 1 = 0.420 1 + e.188 + e−1.754 This is the same answer we obtained previously. mlogit — Multinomial (polytomous) logistic regression 1361 Example 3: Displaying relative-risk ratios By specifying rrr, which we can do at estimation time or when we redisplay results, we see the model in terms of relative-risk ratios: . mlogit, rrr Multinomial logistic regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -551.78348 insure RRR Indemnity nonwhite _cons .516427 1.206731 Prepaid Std. Err. .1114099 .1131483 z = = = = 616 9.62 0.0081 0.0086 P>|z| [95% Conf. Interval] -3.06 2.00 0.002 0.045 .3383588 1.004149 .7882073 1.450183 -0.71 -9.72 0.477 0.000 .3456255 .1215024 1.643247 .2465434 (base outcome) Uninsure nonwhite _cons .7536233 .1730769 .2997387 .0312429 Looked at this way, the relative risk of choosing an indemnity over a prepaid plan is 0.516 for nonwhites relative to whites. To illustrate, from the output and discussions of examples 1 and 2 we find that Pr (insure = Indemnity | white) = 1 = 0.507 1 + e−.188 + e−1.942 and thus the relative risk of choosing indemnity over prepaid (for whites) is Pr (insure = Indemnity | white) 0.507 = = 1.207 Pr (insure = Prepaid | white) 0.420 For nonwhites, Pr (insure = Indemnity | not white) = 1 1+ e−.188+.661 + e−1.942+.378 = 0.355 and thus the relative risk of choosing indemnity over prepaid (for nonwhites) is Pr (insure = Indemnity | not white) 0.355 = = 0.623 Pr (insure = Prepaid | not white) 0.570 The ratio of these two relative risks, hence the name “relative-risk ratio”, is 0.623/1.207 = 0.516, as given in the output under the heading “RRR”. 1362 mlogit — Multinomial (polytomous) logistic regression Technical note In models where only two categories are considered, the mlogit model reduces to standard logit. Consequently the exponentiated regression coefficients, labeled as RRR within mlogit, are equal to the odds ratios as given when the or option is specified under logit; see [R] logit. As such, always referring to mlogit’s exponentiated coefficients as odds ratios may be tempting. However, the discussion in example 3 demonstrates that doing so would be incorrect. In general mlogit models, the exponentiated coefficients are ratios of relative risks, not ratios of odds. Example 4: Model with continuous and multiple categorical variables One of the advantages of mlogit over tabulate is that we can include continuous variables and multiple categorical variables in the model. In examining the data on insurance choice, we decide that we want to control for age, gender, and site of study (the study was conducted in three sites): . mlogit insure age male nonwhite i.site Iteration 0: log likelihood = -555.85446 Iteration 1: log likelihood = -534.67443 Iteration 2: log likelihood = -534.36284 Iteration 3: log likelihood = -534.36165 Iteration 4: log likelihood = -534.36165 Multinomial logistic regression Number of obs LR chi2(10) Prob > chi2 Pseudo R2 Log likelihood = -534.36165 insure Indemnity Coef. Std. Err. z P>|z| = = = = 615 42.99 0.0000 0.0387 [95% Conf. Interval] (base outcome) Prepaid age male nonwhite -.011745 .5616934 .9747768 .0061946 .2027465 .2363213 -1.90 2.77 4.12 0.058 0.006 0.000 -.0238862 .1643175 .5115955 .0003962 .9590693 1.437958 site 2 3 .1130359 -.5879879 .2101903 .2279351 0.54 -2.58 0.591 0.010 -.2989296 -1.034733 .5250013 -.1412433 _cons .2697127 .3284422 0.82 0.412 -.3740222 .9134476 age male nonwhite -.0077961 .4518496 .2170589 .0114418 .3674867 .4256361 -0.68 1.23 0.51 0.496 0.219 0.610 -.0302217 -.268411 -.6171725 .0146294 1.17211 1.05129 site 2 3 -1.211563 -.2078123 .4705127 .3662926 -2.57 -0.57 0.010 0.570 -2.133751 -.9257327 -.2893747 .510108 _cons -1.286943 .5923219 -2.17 0.030 -2.447872 -.1260134 Uninsure These results suggest that the inclination of nonwhites to choose prepaid care is even stronger than it was without controlling. We also see that subjects in site 2 are less likely to be uninsured. mlogit — Multinomial (polytomous) logistic regression 1363 Fitting constrained models mlogit can fit models with subsets of coefficients constrained to be zero, with subsets of coefficients constrained to be equal both within and across equations, and with subsets of coefficients arbitrarily constrained to equal linear combinations of other estimated coefficients. Before fitting a constrained model, you define the constraints with the constraint command; see [R] constraint. Once the constraints are defined, you estimate using mlogit, specifying the constraint() option. Typing constraint(4) would use the constraint you previously saved as 4. Typing constraint(1,4,6) would use the previously stored constraints 1, 4, and 6. Typing constraint(1-4,6) would use the previously stored constraints 1, 2, 3, 4, and 6. Sometimes you will not be able to specify the constraints without knowing the omitted outcome. In such cases, assume that the omitted outcome is whatever outcome is convenient for you, and include the baseoutcome() option when you specify the mlogit command. Example 5: Specifying constraints to test hypotheses We can use constraints to test hypotheses, among other things. In our insurance-choice model, let’s test the hypothesis that there is no distinction between having indemnity insurance and being uninsured. Indemnity-style insurance was the omitted outcome, so we type . test ( 1) ( 2) ( 3) ( 4) ( 5) ( 6) [Uninsure] [Uninsure]age = 0 [Uninsure]male = 0 [Uninsure]nonwhite = 0 [Uninsure]1b.site = 0 [Uninsure]2.site = 0 [Uninsure]3.site = 0 Constraint 4 dropped chi2( 5) = Prob > chi2 = 9.31 0.0973 If indemnity had not been the omitted outcome, we would have typed test [Uninsure=Indemnity]. The results produced by test are an approximation based on the estimated covariance matrix of the coefficients. Because the probability of being uninsured is low, the log likelihood may be nonlinear for the uninsured. Conventional statistical wisdom is not to trust the asymptotic answer under these circumstances but to perform a likelihood-ratio test instead. To use Stata’s lrtest (likelihood-ratio test) command, we must fit both the unconstrained and constrained models. The unconstrained model is the one we have previously fit. Following the instruction in [R] lrtest, we first store the unconstrained model results: . estimates store unconstrained To fit the constrained model, we must refit our model with all the coefficients except the constant set to 0 in the Uninsure equation. We define the constraint and then refit: 1364 mlogit — Multinomial (polytomous) logistic regression . constraint 1 [Uninsure] . mlogit insure age male nonwhite i.site, constraints(1) Iteration 0: log likelihood = -555.85446 Iteration 1: log likelihood = -539.80523 Iteration 2: log likelihood = -539.75644 Iteration 3: log likelihood = -539.75643 Multinomial logistic regression Number of obs Wald chi2(5) Log likelihood = -539.75643 Prob > chi2 ( 1) [Uninsure]o.age = 0 ( 2) [Uninsure]o.male = 0 ( 3) [Uninsure]o.nonwhite = 0 ( 4) [Uninsure]2o.site = 0 ( 5) [Uninsure]3o.site = 0 insure Indemnity Coef. Std. Err. z P>|z| = = = 615 29.70 0.0000 [95% Conf. Interval] (base outcome) Prepaid age male nonwhite -.0107025 .4963616 .9421369 .0060039 .1939683 .2252094 -1.78 2.56 4.18 0.075 0.010 0.000 -.0224699 .1161907 .5007346 .0010649 .8765324 1.383539 site 2 3 .2530912 -.5521773 .2029465 .2187237 1.25 -2.52 0.212 0.012 -.1446767 -.9808678 .6508591 -.1234869 _cons .1792752 .3171372 0.57 0.572 -.4423023 .8008527 age male nonwhite 0 0 0 (omitted) (omitted) (omitted) site 2 3 0 0 (omitted) (omitted) _cons -1.87351 .1601099 -11.70 0.000 -2.18732 -1.5597 Uninsure We can now perform the likelihood-ratio test: . lrtest unconstrained . Likelihood-ratio test (Assumption: . nested in unconstrained) LR chi2(5) = Prob > chi2 = 10.79 0.0557 The likelihood-ratio chi-squared is 10.79 with 5 degrees of freedom — just slightly greater than the magic p = 0.05 level—so we should not call this difference significant. Technical note In certain circumstances, you should fit a multinomial logit model with conditional logit; see [R] clogit. With substantial data manipulation, clogit can handle the same class of models with some interesting additions. For example, if we had available the price and deductible of the most competitive insurance plan of each type, mlogit could not use this information, but clogit could. mlogit — Multinomial (polytomous) logistic regression 1365 Stored results mlogit stores the following in e(): Scalars e(N) e(N cd) e(k out) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(r2 p) e(ll) e(ll 0) e(N clust) e(chi2) e(p) e(k eq base) e(baseout) e(ibaseout) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(chi2type) e(vce) e(vcetype) e(eqnames) e(baselab) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(marginsnotok) e(asbalanced) e(asobserved) number of observations number of completely determined observations number of outcomes number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom pseudo-R-squared log likelihood log likelihood, constant-only model number of clusters χ2 significance equation number of the base outcome the value of depvar to be treated as the base outcome index of the base outcome rank of e(V) number of iterations return code 1 if converged, 0 otherwise mlogit command as typed name of dependent variable weight type weight expression title in estimation output name of cluster variable Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. names of equations value label corresponding to base outcome type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved 1366 mlogit — Multinomial (polytomous) logistic regression Matrices e(b) e(out) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) coefficient vector outcome values constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample Methods and formulas The multinomial logit model is described in Greene (2012, 763–766). Suppose that there are k categorical outcomes and—without loss of generality—let the base outcome be 1. The probability that the response for the j th observation is equal to the ith outcome is pij = Pr(yj = i) = 1 , if i = 1 k P 1 + exp(x β ) j m m=2 exp(xj βi ) , if i > 1 k P exp(xj βm ) 1+ m=2 where xj is the row vector of observed values of the independent variables for the j th observation and βm is the coefficient vector for outcome m. The log pseudolikelihood is lnL = X wj j k X Ii (yj ) lnpik i=1 where wj is an optional weight and ( Ii (yj ) = 1, if yj = i 0, otherwise Newton – Raphson maximum likelihood is used; see [R] maximize. For constrained equations, the set of constraints is orthogonalized, and a subset of maximizable parameters is selected. For example, a parameter that is constrained to zero is not a maximizable parameter. If two parameters are constrained to be equal to each other, only one is a maximizable parameter. Let r be the vector of maximizable parameters. r is physically a subset of the solution parameters, b. A matrix, T, and a vector, m, are defined as b = Tr + m mlogit — Multinomial (polytomous) logistic regression so that 1367 ∂f ∂f 0 = T ∂b ∂r ∂2f 0 ∂2f = T T ∂b2 ∂r2 T consists of a block form in which one part is a permutation of the identity matrix and the other part describes how to calculate the constrained parameters from the maximizable parameters. This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. mlogit also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. Freese, J., and J. S. Long. 2000. sg155: Tests for the multinomial logit model. Stata Technical Bulletin 58: 19–25. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 247–255. College Station, TX: Stata Press. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Haan, P., and A. Uhlendorff. 2006. Estimation of multinomial logit models with unobserved heterogeneity using maximum simulated likelihood. Stata Journal 6: 229–245. Hamilton, L. C. 1993. sqv8: Interpreting multinomial logistic regression. Stata Technical Bulletin 13: 24–28. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 176–181. College Station, TX: Stata Press. . 2013. Statistics with Stata: Updated for Version 12. 8th ed. Boston: Brooks/Cole. Hendrickx, J. 2000. sbe37: Special restrictions in multinomial logistic regression. Stata Technical Bulletin 56: 18–26. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 93–103. College Station, TX: Stata Press. Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood. Stata Journal 7: 388–401. Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Kleinbaum, D. G., and M. Klein. 2010. Logistic Regression: A Self-Learning Text. 3rd ed. New York: Springer. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Long, J. S., and J. Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Tarlov, A. R., J. E. Ware, Jr., S. Greenfield, E. C. Nelson, E. Perrin, and M. Zubkoff. 1989. The medical outcomes study. An application of methods for monitoring the results of medical care. Journal of the American Medical Association 262: 925–930. Treiman, D. J. 2009. Quantitative Data Analysis: Doing Social Research to Test Ideas. San Francisco: Jossey-Bass. Wells, K. B., R. D. Hays, M. A. Burnam, W. H. Rogers, S. Greenfield, and J. E. Ware, Jr. 1989. Detection of depressive disorder for patients receiving prepaid or fee-for-service care. Results from the Medical Outcomes Survey. Journal of the American Medical Association 262: 3298–3302. Xu, J., and J. S. Long. 2005. Confidence intervals for predicted outcomes in regression models for categorical outcomes. Stata Journal 5: 537–559. 1368 mlogit — Multinomial (polytomous) logistic regression Also see [R] mlogit postestimation — Postestimation tools for mlogit [R] clogit — Conditional (fixed-effects) logistic regression [R] logistic — Logistic regression, reporting odds ratios [R] logit — Logistic regression, reporting coefficients [R] mprobit — Multinomial probit regression [R] nlogit — Nested logit regression [R] ologit — Ordered logistic regression [R] rologit — Rank-ordered logistic regression [R] slogit — Stereotype logistic regression [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [U] 20 Estimation and postestimation commands Title mlogit postestimation — Postestimation tools for mlogit Description Remarks and examples Syntax for predict Reference Menu for predict Also see Options for predict Description The following postestimation commands are available after mlogit: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 hausman lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with mi or svy estimation results. lrtest is not appropriate with svy estimation results. 1369 1370 mlogit postestimation — Postestimation tools for mlogit Syntax for predict predict type stub* | newvar | newvarlist predict type stub* | newvarlist if if in , statistic outcome(outcome) in , scores Description statistic Main probability of a positive outcome; the default linear prediction standard error of the linear prediction standard error of the difference in two linear predictions pr xb stdp stddp If you do not specify outcome(), pr (with one new variable specified), xb, and stdp assume outcome(#1). You must specify outcome() with the stddp option. You specify one or k new variables with pr, where k is the number of outcomes. You specify one new variable with xb, stdp, and stddp. These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the probability of each of the categories of the dependent variable or the probability of the level specified in outcome(outcome). If you specify the outcome(outcome) option, you need to specify only one new variable; otherwise, you must specify a new variable for each category of the dependent variable. xb calculates the linear prediction. You must also specify the outcome(outcome) option. stdp calculates the standard error of the linear prediction. You must also specify the outcome(outcome) option. stddp calculates the standard error of the difference in two linear predictions. You must specify the outcome(outcome) option, and here you specify the two particular outcomes of interest inside the parentheses, for example, predict sed, stddp outcome(1,3). outcome(outcome) specifies the outcome for which the statistic is to be calculated. equation() is a synonym for outcome(): it does not matter which you use. outcome() or equation() can be specified using #1, #2, . . . , where #1 means the first category of the dependent variable, #2 means the second category, etc.; the values of the dependent variable; or the value labels of the dependent variable if they exist. mlogit postestimation — Postestimation tools for mlogit 1371 scores calculates equation-level score variables. The number of score variables created will be one less than the number of outcomes in the model. If the number of outcomes in the model were k , then the first new variable will contain ∂ ln L/∂(xj β1 ); the second new variable will contain ∂ ln L/∂(xj β2 ); ... the (k − 1)th new variable will contain ∂ ln L/∂(xj βk−1 ). Remarks and examples Remarks are presented under the following headings: Obtaining predicted values Calculating marginal effects Testing hypotheses about coefficients Obtaining predicted values Example 1: Obtaining predicted probabilities After estimation, we can use predict to obtain predicted probabilities, index values, and standard errors of the index, or differences in the index. For instance, in example 4 of [R] mlogit, we fit a model of insurance choice on various characteristics. We can obtain the predicted probabilities for outcome 1 by typing . use http://www.stata-press.com/data/r13/sysdsn1 (Health insurance data) . mlogit insure age i.male i.nonwhite i.site (output omitted ) . predict p1 if e(sample), outcome(1) (option pr assumed; predicted probability) (29 missing values generated) . summarize p1 Variable Obs Mean Std. Dev. p1 615 .4764228 .1032279 Min Max .1698142 .71939 We added the i. prefix to the male, nonwhite, and site variables to explicitly identify them as factor variables. That makes no difference in the estimated results, but we will take advantage of it in later examples. We also included if e(sample) to restrict the calculation to the estimation sample. In example 4 of [R] mlogit, the multinomial logit model was fit on 615 observations, so there must be missing values in our dataset. Although we typed outcome(1), specifying 1 for the indemnity outcome, we could have typed outcome(Indemnity). For instance, to obtain the probabilities for prepaid, we could type . predict p2 if e(sample), outcome(Prepaid) (option pr assumed; predicted probability) (29 missing values generated) . summarize p2 Variable Obs Mean Std. Dev. p2 615 .4504065 .1125962 Min Max .1964103 .7885724 1372 mlogit postestimation — Postestimation tools for mlogit We must specify the label exactly as it appears in the underlying value label (or how it appears in the mlogit output), including capitalization. Here we have used predict to obtain probabilities for the same sample on which we estimated. That is not necessary. We could use another dataset that had the independent variables defined (in our example, age, male, nonwhite, and site) and use predict to obtain predicted probabilities; here, we would not specify if e(sample). Example 2: Obtaining index values predict can also be used to obtain the index values — the . predict idx1, outcome(Indemnity) xb (1 missing value generated) . summarize idx1 Variable Obs Mean idx1 643 0 Std. Dev. 0 P (k) xi βbi — as well as the probabilities: Min Max 0 0 The indemnity outcome was our base outcome — the outcome for which all the coefficients were set to 0 — so the index is always 0. For the prepaid and uninsured outcomes, we type . predict idx2, outcome(Prepaid) xb (1 missing value generated) . predict idx3, outcome(Uninsure) xb (1 missing value generated) . summarize idx2 idx3 Variable Obs Mean idx2 idx3 643 643 -.0566113 -1.980747 Std. Dev. .4962973 .6018139 Min Max -1.298198 -3.112741 1.700719 -.8258458 We can obtain the standard error of the index by specifying the stdp option: . predict se2, outcome(Prepaid) stdp (1 missing value generated) . list p2 idx2 se2 in 1/5 1. 2. 3. 4. 5. p2 idx2 se2 .3709022 .4977667 .4113073 .5424927 . -.4831167 .055111 -.1712106 .3788345 -.0925817 .2437772 .1694686 .1793498 .2513701 .1452616 We obtained the probability, p2, in the previous example. mlogit postestimation — Postestimation tools for mlogit 1373 Finally, predict can calculate the standard error of the difference in the index values between two outcomes with the stddp option: . predict se_2_3, outcome(Prepaid,Uninsure) stddp (1 missing value generated) . list idx2 idx3 se_2_3 in 1/5 1. 2. 3. 4. 5. idx2 idx3 se_2_3 -.4831167 .055111 -.1712106 .3788345 -.0925817 -3.073253 -2.715986 -1.579621 -1.462007 -2.814022 .5469354 .4331918 .3053815 .4492552 .4024784 In the first observation, the difference in the indexes is −0.483 − (−3.073) = 2.59. The standard error of that difference is 0.547. Example 3: Interpreting results using predictive margins It is more difficult to interpret the results from mlogit than those from clogit or logit because there are multiple equations. For example, suppose that one of the independent variables in our model takes on the values 0 and 1, and we are attempting to understand the effect of this variable. Assume that the coefficient on this variable for the second outcome, β (2) , is positive. We might then be tempted to reason that the probability of the second outcome is higher if the variable is 1 rather than 0. Most of the time, that will be true, but occasionally we will be surprised. The probability of some other outcome could increase even more (say, β (3) > β (2) ), and thus the probability of outcome 2 would actually fall relative to that outcome. We can use predict to help interpret such results. Continuing with our previously fit insurance-choice model, we wish to describe the model’s predictions by race. For this purpose, we can use the method of predictive margins (also known as recycled predictions), in which we vary characteristics of interest across the whole dataset and average the predictions. That is, we have data on both whites and nonwhites, and our individuals have other characteristics as well. We will first pretend that all the people in our data are white but hold their other characteristics constant. We then calculate the probabilities of each outcome. Next we will pretend that all the people in our data are nonwhite, still holding their other characteristics constant. Again we calculate the probabilities of each outcome. The difference in those two sets of calculated probabilities, then, is the difference due to race, holding other characteristics constant. . gen byte nonwhold=nonwhite // save real race . replace nonwhite=0 (126 real changes made) // make everyone white . predict wpind, outcome(Indemnity) // predict probabilities (option pr assumed; predicted probability) (1 missing value generated) . predict wpp, outcome(Prepaid) (option pr assumed; predicted probability) (1 missing value generated) . predict wpnoi, outcome(Uninsure) (option pr assumed; predicted probability) (1 missing value generated) . replace nonwhite=1 (644 real changes made) // make everyone nonwhite 1374 mlogit postestimation — Postestimation tools for mlogit . predict nwpind, outcome(Indemnity) (option pr assumed; predicted probability) (1 missing value generated) . predict nwpp, outcome(Prepaid) (option pr assumed; predicted probability) (1 missing value generated) . predict nwpnoi, outcome(Uninsure) (option pr assumed; predicted probability) (1 missing value generated) . replace nonwhite=nonwhold // restore real race (518 real changes made) . summarize wp* nwp*, sep(3) Obs Mean Std. Dev. Min Variable Max wpind wpp wpnoi 643 643 643 .5141673 .4082052 .0776275 .0872679 .0993286 .0360283 .3092903 .1964103 .0273596 .71939 .6502247 .1302816 nwpind nwpp nwpnoi 643 643 643 .3112809 .630078 .0586411 .0817693 .0979976 .0287185 .1511329 .3871782 .0209648 .535021 .8278881 .0933874 In example 1 of [R] mlogit, we presented a cross-tabulation of insurance type and race. Those values were unadjusted. The means reported above are the values adjusted for age, sex, and site. Combining the results gives Indemnity Prepaid Uninsured Unadjusted white nonwhite 0.51 0.36 0.42 0.57 0.07 0.07 Adjusted white nonwhite 0.51 0.31 0.41 0.63 0.08 0.06 We find, for instance, after adjusting for age, sex, and site, that although 57% of nonwhites in our data had prepaid plans, 63% of nonwhites chose prepaid plans. Computing predictive margins by hand was instructive, but we can compute these values more easily using the margins command (see [R] margins). The two margins for the indemnity outcome can be estimated by typing . margins nonwhite, predict(outcome(Indemnity)) noesample Predictive margins Number of obs Model VCE : OIM Expression 643 : Pr(insure==Indemnity), predict(outcome(Indemnity)) Margin nonwhite 0 1 = .5141673 .3112809 Delta-method Std. Err. .0223485 .0418049 z 23.01 7.45 P>|z| [95% Conf. Interval] 0.000 0.000 .470365 .2293448 .5579695 .393217 margins also estimates the standard errors and confidence intervals of the margins. By default, margins uses only the estimation sample. We added the noesample option so that margins would use the entire sample and produce results comparable to our earlier analysis. mlogit postestimation — Postestimation tools for mlogit 1375 We can use marginsplot to graph the results from margins: . marginsplot Variables that uniquely identify margins: nonwhite .2 Pr(Insure==Indemnity) .3 .4 .5 .6 Predictive Margins of nonwhite with 95% CIs 0 1 nonwhite The margins for the other two outcomes can be computed by typing . margins nonwhite, predict(outcome(Prepaid)) noesample (output omitted ) . margins nonwhite, predict(outcome(Uninsure)) noesample (output omitted ) Technical note You can use predict to classify predicted values and compare them with the observed outcomes to interpret a multinomial logit model. This is a variation on the notions of sensitivity and specificity for logistic regression. Here we will classify indemnity and prepaid as definitely predicting indemnity, definitely predicting prepaid, and ambiguous. . predict indem, outcome(Indemnity) index // (1 missing value generated) . predict prepaid, outcome(Prepaid) index (1 missing value generated) . gen diff = prepaid-indem // (1 missing value generated) . predict sediff, outcome(Indemnity,Prepaid) stddp // (1 missing value generated) . gen type = 1 if diff/sediff < -1.96 // (504 missing values generated) . replace type = 3 if diff/sediff > 1.96 // (100 real changes made) . replace type = 2 if type>=. & diff/sediff < . // (404 real changes made) . label def type 1 "Def Ind" 2 "Ambiguous" 3 "Def Prep" . label values type type // obtain indexes obtain difference & its standard error definitely indemnity definitely prepaid ambiguous label results 1376 mlogit postestimation — Postestimation tools for mlogit . tabulate insure type insure Def Ind type Ambiguous Def Prep Total Indemnity Prepaid Uninsure 78 44 12 183 177 28 33 56 5 294 277 45 Total 134 388 94 616 We can see that the predictive power of this model is modest. There are many misclassifications in both directions, though there are more correctly classified observations than misclassified observations. Also the uninsured look overwhelmingly as though they might have come from the indemnity system rather than from the prepaid system. Calculating marginal effects Example 4 We have already noted that the coefficients from multinomial logit can be difficult to interpret because they are relative to the base outcome. Another way to evaluate the effect of covariates is to examine the marginal effect of changing their values on the probability of observing an outcome. The margins command can be used for this too. We can estimate the marginal effect of each covariate on the probability of observing the first outcome—indemnity insurance—by typing . margins, dydx(*) predict(outcome(Indemnity)) Average marginal effects Number of obs = Model VCE : OIM Expression : Pr(insure==Indemnity), predict(outcome(Indemnity)) dy/dx w.r.t. : age 1.male 1.nonwhite 2.site 3.site dy/dx Delta-method Std. Err. z P>|z| 615 [95% Conf. Interval] age 1.male 1.nonwhite .0026655 -.1295734 -.2032404 .001399 .0450945 .0482554 1.91 -2.87 -4.21 0.057 0.004 0.000 -.0000765 -.2179571 -.2978192 .0054074 -.0411898 -.1086616 site 2 3 .0070995 .1216165 .0479993 .0505833 0.15 2.40 0.882 0.016 -.0869775 .022475 .1011765 .220758 Note: dy/dx for factor levels is the discrete change from the base level. By default, margins estimates the average marginal effect over the estimation sample, and that is what we see above. Being male decreases the average probability of having indemnity insurance by 0.130. We also see, from the note at the bottom of the table, that the marginal effect was computed as a discrete change in the probability of being male rather than female. That is why we made male a factor variable when fitting the model. The dydx(*) option requested that margins estimate the marginal effect for each regressor, dydx(age) would have produced estimates only for the effect of age. margins has many options for controlling how the marginal effect is computed, including the ability to average over subgroups or to compute estimates for specified values of the regressors; see [R] margins. mlogit postestimation — Postestimation tools for mlogit 1377 We could evaluate the marginal effects on the other two outcomes by typing . margins, dydx(*) predict(outcome(Prepaid)) (output omitted ) . margins, dydx(*) predict(outcome(Uninsure)) (output omitted ) Testing hypotheses about coefficients Example 5 test tests hypotheses about the coefficients just as after any estimation command; see [R] test. Note, however, test’s syntax for dealing with multiple-equation models. Because test bases its results on the estimated covariance matrix, we might prefer a likelihood-ratio test; see example 5 in [R] mlogit for an example of lrtest. If we simply list variables after the test command, we are testing that the corresponding coefficients are zero across all equations: . test ( 1) ( 2) ( 3) ( 4) ( 5) ( 6) 2.site 3.site [Indemnity]2o.site = 0 [Prepaid]2.site = 0 [Uninsure]2.site = 0 [Indemnity]3o.site = 0 [Prepaid]3.site = 0 [Uninsure]3.site = 0 Constraint 1 dropped Constraint 4 dropped chi2( 4) = 19.74 Prob > chi2 = 0.0006 We can test that all the coefficients (except the constant) in an equation are zero by simply typing the outcome in square brackets: . test ( 1) ( 2) ( 3) ( 4) ( 5) ( 6) ( 7) ( 8) [Uninsure] [Uninsure]age = 0 [Uninsure]0b.male = 0 [Uninsure]1.male = 0 [Uninsure]0b.nonwhite = 0 [Uninsure]1.nonwhite = 0 [Uninsure]1b.site = 0 [Uninsure]2.site = 0 [Uninsure]3.site = 0 Constraint 2 dropped Constraint 4 dropped Constraint 6 dropped chi2( 5) = 9.31 Prob > chi2 = 0.0973 We specify the outcome just as we do with predict; we can specify the label if the outcome variable is labeled, or we can specify the numeric value of the outcome. We would have obtained the same test as above if we had typed test [3] because 3 is the value of insure for the outcome uninsured. We can combine the two syntaxes. To test that the coefficients on the site variables are 0 in the equation corresponding to the outcome prepaid, we can type 1378 mlogit postestimation — Postestimation tools for mlogit . test [Prepaid]: 2.site ( 1) [Prepaid]2.site = ( 2) [Prepaid]3.site = chi2( 2) = Prob > chi2 = 3.site 0 0 10.78 0.0046 We specified the outcome and then followed that with a colon and the variables we wanted to test. We can also test that coefficients are equal across equations. To test that all coefficients except the constant are equal for the prepaid and uninsured outcomes, we can type . test ( 1) ( 2) ( 3) ( 4) ( 5) ( 6) ( 7) ( 8) [Prepaid=Uninsure] [Prepaid]age - [Uninsure]age = 0 [Prepaid]0b.male - [Uninsure]0b.male = 0 [Prepaid]1.male - [Uninsure]1.male = 0 [Prepaid]0b.nonwhite - [Uninsure]0b.nonwhite = 0 [Prepaid]1.nonwhite - [Uninsure]1.nonwhite = 0 [Prepaid]1b.site - [Uninsure]1b.site = 0 [Prepaid]2.site - [Uninsure]2.site = 0 [Prepaid]3.site - [Uninsure]3.site = 0 Constraint 2 dropped Constraint 4 dropped Constraint 6 dropped chi2( 5) = 13.80 Prob > chi2 = 0.0169 To test that only the site variables are equal, we can type . test [Prepaid=Uninsure]: 2.site 3.site ( 1) [Prepaid]2.site - [Uninsure]2.site = 0 ( 2) [Prepaid]3.site - [Uninsure]3.site = 0 chi2( 2) = 12.68 Prob > chi2 = 0.0018 Finally, we can test any arbitrary constraint by simply entering the equation and specifying the coefficients as described in [U] 13.5 Accessing coefficients and standard errors. The following hypothesis is senseless but illustrates the point: . test ([Prepaid]age+[Uninsure]2.site)/2 = 2-[Uninsure]1.nonwhite ( 1) .5*[Prepaid]age + [Uninsure]1.nonwhite + .5*[Uninsure]2.site = 2 chi2( 1) = 22.45 Prob > chi2 = 0.0000 See [R] test for more information about test. The information there about combining hypotheses across test commands (the accumulate option) also applies after mlogit. Reference Fagerland, M. W., and D. W. Hosmer, Jr. 2012. A generalized HosmerLemeshow goodness-of-fit test for multinomial logistic regression models. Stata Journal 12: 447–453. Also see [R] mlogit — Multinomial (polytomous) logistic regression [U] 20 Estimation and postestimation commands Title more — The —more— message Syntax Description Option Remarks and examples Also see Syntax Tell Stata to pause or not pause for —more— messages , permanently set more on | off Set number of lines between —more— messages set pagesize # Description set more on, which is the default, tells Stata to wait until you press a key before continuing when a more message is displayed. set more off tells Stata not to pause or display the set pagesize # sets the number of lines between is not allowed with set pagesize. more more message. messages. The permanently option Option permanently specifies that, in addition to making the change right now, the more setting be remembered and become the default setting when you invoke Stata. Remarks and examples When you see more at the bottom of the screen, Press . . . letter l or Enter letter q Spacebar or any other key and Stata . . . displays the next line acts as if you pressed Break displays the next screen You can also click on the More button or click on more to display the next screen. more is Stata’s way of telling you that it has something more to show you but that showing it to you will cause the information on the screen to scroll off. If you type set more off, at full speed. If you type set more on, more more conditions will never arise, and Stata’s output will scroll by conditions will be restored at the appropriate places. Programmers should see [P] more for information on the more programming command. 1379 1380 more — The —more— message Also see [R] query — Display system parameters [P] creturn — Return c-class values [P] more — Pause until key is pressed [P] sleep — Pause for a specified time [U] 7 –more– conditions Title mprobit — Multinomial probit regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax mprobit depvar indepvars if in weight , options options Description Model noconstant baseoutcome(# | lbl) probitparam constraints(constraints) collinear suppress constant terms outcome used to normalize location use the probit variance parameterization apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) nocnsreport display options set confidence level; default is level(95) do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Integration intpoints(#) number of quadrature points Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. bootstrap, by, fp, jackknife, mi estimate, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 1381 1382 mprobit — Multinomial probit regression Menu Statistics > Categorical outcomes > Independent multinomial probit Description mprobit fits multinomial probit (MNP) models via maximum likelihood. depvar contains the outcome for each observation, and indepvars are the associated covariates. The error terms are assumed to be independent, standard normal, random variables. See [R] asmprobit for the case where the latent-variable errors are correlated or heteroskedastic and you have alternative-specific variables. Options Model noconstant suppresses the J − 1 constant terms. baseoutcome(# | lbl) specifies the outcome used to normalize the location of the latent variable. The base outcome may be specified as a number or a label. The default is to use the most frequent outcome. The coefficients associated with the base outcome are zero. probitparam specifies to use the probit variance parameterization by fixing the variance of the differenced latent errors between the scale and the base alternatives to be one. The default is to make the variance of the base and scale latent errors one, thereby making the variance of the difference to be two. constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. If specifying vce(bootstrap) or vce(jackknife), you must also specify baseoutcome(). Reporting level(#); see [R] estimation options. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Integration intpoints(#) specifies the number of Gaussian quadrature points to use in approximating the likelihood. The default is 15. mprobit — Multinomial probit regression 1383 Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with mprobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples The MNP model is used with discrete dependent variables that take on more than two outcomes that do not have a natural ordering. The stochastic error terms for this implementation of the model are assumed to have independent, standard normal distributions. To use mprobit, you must have one observation for each decision maker in the sample. See [R] asmprobit for another implementation of the MNP model that permits correlated and heteroskedastic errors and is suitable when you have data for each alternative that a decision maker faced. The MNP model is frequently motivated using a latent-variable framework. The latent variable for the j th alternative, j = 1, . . . , J , is ηij = zi αj + ξij where the 1 × q row vector zi contains the observed independent variables for the ith decision maker. Associated with zi are the J vectors of regression coefficients αj . The ξi,1 , . . . , ξi,J are distributed independently and identically standard normal. The decision maker chooses the alternative k such that ηik ≥ ηim for m 6= k . Suppose that case i chooses alternative k , and take the difference between latent variable ηik and the J − 1 others: vijk = ηij − ηik = zi (αj − αk ) + ξij − ξik (1) = zi γj 0 + ij 0 where j 0 = j if j < k and j 0 = j−1 if j > k so that j 0 = 1, . . . , J −1. Var(ij 0 ) = Var(ξij −ξik ) = 2 and Cov(ij 0 , il0 ) = 1 for j 0 6= l0 . The probability that alternative k is chosen is Pr(i chooses k) = Pr(vi1k ≤ 0, . . . , vi,J−1,k ≤ 0) = Pr(i1 ≤ −zi γ1 , . . . , i,J−1 ≤ −zi γJ−1 ) Hence, evaluating the likelihood function involves computing probabilities from the multivariate normal distribution. That all the covariances are equal simplifies the problem somewhat; see Methods and formulas for details. In (1), not all J of the αj are identifiable. To remove the indeterminacy, αl is set to the zero vector, where l is the base outcome as specified in the baseoutcome() option. That fixes the lth latent variable to zero so that the remaining variables measure the attractiveness of the other alternatives relative to the base. 1384 mprobit — Multinomial probit regression Example 1 As discussed in example 1 of [R] mlogit, we have data on the type of health insurance available to 616 psychologically depressed subjects in the United States (Tarlov et al. 1989; Wells et al. 1989). Patients may have either an indemnity (fee-for-service) plan or a prepaid plan such as an HMO, or the patient may be uninsured. Demographic variables include age, gender, race, and site. Indemnity insurance is the most popular alternative, so mprobit will choose it as the base outcome by default. . use http://www.stata-press.com/data/r13/sysdsn1 (Health insurance data) . mprobit insure age male nonwhite i.site Iteration 0: log likelihood = -535.89424 Iteration 1: log likelihood = -534.56173 Iteration 2: log likelihood = -534.52835 Iteration 3: log likelihood = -534.52833 Multinomial probit regression Number of obs Wald chi2(10) Prob > chi2 Log likelihood = -534.52833 insure Indemnity Coef. Std. Err. z P>|z| = = = 615 40.18 0.0000 [95% Conf. Interval] (base outcome) Prepaid age male nonwhite -.0098536 .4774678 .8245003 .0052688 .1718316 .1977582 -1.87 2.78 4.17 0.061 0.005 0.000 -.0201802 .1406841 .4369013 .000473 .8142515 1.212099 site 2 3 .0973956 -.495892 .1794546 .1904984 0.54 -2.60 0.587 0.009 -.2543289 -.869262 .4491201 -.1225221 _cons .22315 .2792424 0.80 0.424 -.324155 .7704549 age male nonwhite -.0050814 .3332637 .2485859 .0075327 .2432986 .2767734 -0.67 1.37 0.90 0.500 0.171 0.369 -.0198452 -.1435929 -.29388 .0096823 .8101203 .7910518 site 2 3 -.6899485 -.1788447 .2804497 .2479898 -2.46 -0.72 0.014 0.471 -1.23962 -.6648957 -.1402771 .3072063 _cons -.9855917 .3891873 -2.53 0.011 -1.748385 -.2227986 Uninsure The likelihood function for mprobit is derived under the assumption that all decision-making units face the same choice set, which is the union of all outcomes observed in the dataset. If that is not true for your model, then an alternative is to use the asmprobit command, which does not require this assumption. To do that, you will need to expand the dataset so that each decision maker has ki observations, where ki is the number of alternatives in the choice set faced by decision maker i. You will also need to create a binary variable to indicate the choice made by each decision maker. Moreover, you will need to use the correlation(independent) and stddev(homoskedastic) options with asmprobit unless you have alternative-specific variables. mprobit — Multinomial probit regression 1385 Stored results mprobit stores the following in e(): Scalars e(N) e(k out) e(k points) e(k) e(k eq) e(k eq model) e(k indvars) e(k dv) e(df m) e(ll) e(N clust) e(chi2) e(p) e(i base) e(const) e(probitparam) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(indvars) e(wtype) e(wexp) e(title) e(clustvar) e(chi2type) e(vce) e(vcetype) e(outeqs) e(out#) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(marginsnotok) e(asbalanced) e(asobserved) number of observations number of outcomes number of quadrature points number of parameters number of equations in e(b) number of equations in overall model test number of independent variables number of dependent variables model degrees of freedom log simulated-likelihood number of clusters χ2 significance base outcome index 0 if noconstant is specified, 1 otherwise 1 if probitparam is specified, 0 otherwise rank of e(V) number of iterations return code 1 if converged, 0 otherwise mprobit command as typed name of dependent variable independent variables weight type weight expression title in estimation output name of cluster variable Wald, type of model χ2 test vcetype specified in vce() title used to label Std. Err. outcome equations outcome labels, # =1,...,e(k out) type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved 1386 mprobit — Multinomial probit regression Matrices e(b) e(outcomes) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) coefficient vector outcome values constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample Methods and formulas See Cameron and Trivedi (2005, chap. 15) for a discussion of multinomial models, including multinomial probit. Long and Freese (2014, chap. 8) discuss the multinomial logistic, multinomial probit, and stereotype logistic regression models, with examples using Stata. As discussed in Remarks and examples, the latent variables for a J -alternative model are ηij = zi αj + ξij , for j = 1, . . . , J , i = 1, . . . , n, and {ξi,1 , . . . , ξi,J } ∼ i.i.d.N (0, 1). The experimenter observes alternative k for the ith observation if ηik > ηil for l 6= k . For j 6= k , let vij 0 = ηij − ηik = zi (αj − αk ) + ξij − ξik = zi γj 0 + ij 0 where j 0 = j if j < k and j 0 = j − 1 if j > k so that j 0 = 1, . . . , J − 1. i = (i1 , . . . , i,J−1 ) ∼ M V N (0, Σ), where 2 1 1 ... 1 2 1 ... 1 1 2 ... Σ= . . . . .. .. .. .. 1 1 1 ... 1 1 1 .. . 2 Denote the deterministic part of the model as λij 0 = zi γj 0 ; the probability that subject i chooses outcome k is Pr(yi = k) = Pr(vi1 ≤ 0, . . . , vi,J−1 ≤ 0) = Pr(i1 ≤ −λi1 , . . . , i,J−1 ≤ −λi,J−1 ) Z −λi1 Z −λi,J−1 1 ··· exp − 12 z0 Σ−1 z dz = (J−1)/2 1/2 −∞ −∞ (2π) |Σ| Because of the exchangeable correlation structure of Σ (ρij = 1/2 for all i 6= j ), we can use Dunnett’s (1989) result to reduce the multidimensional integral to one dimension: Z ∞ J−1 J−1 Y Y √ √ 2 1 Pr(yi = k) = √ Φ −z 2 − λij + Φ z 2 − λij e−z dz π 0 j=1 j=1 mprobit — Multinomial probit regression 1387 Gaussian quadrature is used to approximate this integral, resulting in the K -point quadrature formula K J−1 Y Y J−1 √ √ 1X wk Pr(yi = k) ≈ Φ − 2xk − λij + Φ 2xk − λij 2 j=1 j=1 k=1 where wk and xk are the weights and roots of the Laguerre polynomial of order K . In mprobit, K is specified by the intpoints() option. This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. mprobit also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press. Dunnett, C. W. 1989. Algorithm AS 251: Multivariate normal probability integrals with product correlation structure. Journal of the Royal Statistical Society, Series C 38: 564–579. Haan, P., and A. Uhlendorff. 2006. Estimation of multinomial logit models with unobserved heterogeneity using maximum simulated likelihood. Stata Journal 6: 229–245. Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood. Stata Journal 7: 388–401. Long, J. S., and J. Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Tarlov, A. R., J. E. Ware, Jr., S. Greenfield, E. C. Nelson, E. Perrin, and M. Zubkoff. 1989. The medical outcomes study. An application of methods for monitoring the results of medical care. Journal of the American Medical Association 262: 925–930. Wells, K. B., R. D. Hays, M. A. Burnam, W. H. Rogers, S. Greenfield, and J. E. Ware, Jr. 1989. Detection of depressive disorder for patients receiving prepaid or fee-for-service care. Results from the Medical Outcomes Survey. Journal of the American Medical Association 262: 3298–3302. Also see [R] mprobit postestimation — Postestimation tools for mprobit [R] asmprobit — Alternative-specific multinomial probit regression [R] clogit — Conditional (fixed-effects) logistic regression [R] mlogit — Multinomial (polytomous) logistic regression [R] nlogit — Nested logit regression [R] ologit — Ordered logistic regression [R] oprobit — Ordered probit regression [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [U] 20 Estimation and postestimation commands Title mprobit postestimation — Postestimation tools for mprobit Description Remarks and examples Syntax for predict References Menu for predict Also see Options for predict Description The following postestimation commands are available after mprobit: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 hausman lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predicted probabilities, linear predictions, and standard errors point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with mi or svy estimation results. lrtest is not appropriate with svy estimation results. 1388 mprobit postestimation — Postestimation tools for mprobit 1389 Syntax for predict predict type stub* | newvar | newvarlist predict type stub* | newvarlist if if in , statistic outcome(outcome) in , scores Description statistic Main probability of a positive outcome; the default linear prediction standard error of the linear prediction pr xb stdp If you do not specify outcome(), pr (with one new variable specified), xb, and stdp assume outcome(#1). You specify one or k new variables with pr, where k is the number of outcomes. You specify one new variable with xb and stdp. These statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the probability of each of the categories of the dependent variable or the probability of the level specified in outcome(outcome). If you specify the outcome(outcome) option, you need to specify only one new variable; otherwise, you must specify a new variable for each category of the dependent variable. xb calculates the linear prediction, xi αj , for alternative j and individual i. The index, j , corresponds to the outcome specified in outcome(). stdp calculates the standard error of the linear prediction. outcome(outcome) specifies the outcome for which the statistic is to be calculated. equation() is a synonym for outcome(): it does not matter which you use. outcome() or equation() can be specified using #1, #2, . . . , where #1 means the first category of the dependent variable, #2 means the second category, etc.; the values of the dependent variable; or the value labels of the dependent variable if they exist. scores calculates the equation-level score variables. The j th new variable will contain the scores for the j th fitted equation. 1390 mprobit postestimation — Postestimation tools for mprobit Remarks and examples Once you have fit a multinomial probit model, you can use predict to obtain probabilities that an individual will choose each of the alternatives for the estimation sample, as well as other samples; see [U] 20 Estimation and postestimation commands and [R] predict. Example 1 In example 1 of [R] mprobit, we fit the multinomial probit model to a dataset containing the type of health insurance available to 616 psychologically depressed subjects in the United States (Tarlov et al. 1989; Wells et al. 1989). We can obtain the predicted probabilities by typing . use http://www.stata-press.com/data/r13/sysdsn1 (Health insurance data) . mprobit insure age male nonwhite i.site (output omitted ) . predict p1-p3 (option pr assumed; predicted probabilities) . list p1-p3 insure in 1/10 p1 p2 p3 insure 1. 2. 3. 4. 5. .5961306 .4719296 .4896086 .3730529 .5063069 .3741824 .4972289 .4121961 .5416623 .4629773 .029687 .0308415 .0981953 .0852848 .0307158 Indemnity Prepaid Indemnity Prepaid . 6. 7. 8. 9. 10. .4768125 .5035672 .3326361 .4758165 .5734057 .4923548 .4657016 .5580404 .4384811 .3316601 .0308327 .0307312 .1093235 .0857024 .0949342 Prepaid Prepaid . Uninsure Prepaid insure contains a missing value for observations 5 and 8. Because of that, those two observations were not used in the estimation. However, because none of the independent variables is missing, predict can still calculate the probabilities. Had we typed . predict p1-p3 if e(sample) predict would have filled in missing values for p1, p2, and p3 for those observations because they were not used in the estimation. References Tarlov, A. R., J. E. Ware, Jr., S. Greenfield, E. C. Nelson, E. Perrin, and M. Zubkoff. 1989. The medical outcomes study. An application of methods for monitoring the results of medical care. Journal of the American Medical Association 262: 925–930. Wells, K. B., R. D. Hays, M. A. Burnam, W. H. Rogers, S. Greenfield, and J. E. Ware, Jr. 1989. Detection of depressive disorder for patients receiving prepaid or fee-for-service care. Results from the Medical Outcomes Survey. Journal of the American Medical Association 262: 3298–3302. Also see [R] mprobit — Multinomial probit regression [U] 20 Estimation and postestimation commands Title nbreg — Negative binomial regression Syntax Options for gnbreg References Menu Remarks and examples Also see Description Stored results Options for nbreg Methods and formulas Syntax Negative binomial regression model nbreg depvar indepvars if in weight , nbreg options Generalized negative binomial model gnbreg depvar indepvars if in weight , gnbreg options nbreg options Description Model noconstant dispersion(mean) dispersion(constant) exposure(varnamee ) offset(varnameo ) constraints(constraints) collinear suppress constant term parameterization of dispersion; the default constant dispersion for all observations include ln(varnamee ) in model with coefficient constrained to 1 include varnameo in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) nolrtest irr nocnsreport display options set confidence level; default is level(95) suppress likelihood-ratio test report incidence-rate ratios do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics 1391 1392 nbreg — Negative binomial regression Description gnbreg options Model noconstant lnalpha(varlist) exposure(varnamee ) offset(varnameo ) constraints(constraints) collinear suppress constant term dispersion model variables include ln(varnamee ) in model with coefficient constrained to 1 include varnameo in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables SE/Robust vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife vce(vcetype) Reporting set confidence level; default is level(95) report incidence-rate ratios do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling level(#) irr nocnsreport display options Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars and varlist may contain factor variables; see [U] 11.4.3 Factor variables. depvar, indepvars, varnamee , and varnameo may contain time-series operators (nbreg only); see [U] 11.4.4 Time-series varlists. bootstrap, by (nbreg only), fp (nbreg only), jackknife, mfp (nbreg only), mi estimate, nestreg (nbreg only), rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu nbreg Statistics > Count outcomes > Negative binomial regression > Count outcomes > Generalized negative binomial regression gnbreg Statistics Description nbreg fits a negative binomial regression model of depvar on indepvars, where depvar is a nonnegative count variable. In this model, the count variable is believed to be generated by a Poissonlike process, except that the variation is greater than that of a true Poisson. This extra variation is referred to as overdispersion. See [R] poisson before reading this entry. nbreg — Negative binomial regression 1393 gnbreg fits a generalization of the negative binomial mean-dispersion model; the shape parameter α may also be parameterized. If you have panel data, see [XT] xtnbreg and [ME] menbreg. Options for nbreg Model noconstant; see [R] estimation options. dispersion(mean | constant) specifies the parameterization of the model. dispersion(mean), the default, yields a model with dispersion equal to 1 + α exp(xj β + offsetj ); that is, the dispersion is a function of the expected mean: exp(xj β + offsetj ). dispersion(constant) has dispersion equal to 1 + δ ; that is, it is a constant for all observations. exposure(varnamee ), offset(varnameo ), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. nolrtest suppresses fitting the Poisson model. Without this option, a comparison Poisson model is fit, and the likelihood is used in a likelihood-ratio test of the null hypothesis that the dispersion parameter is zero. irr reports estimated coefficients transformed to incidence-rate ratios, that is, eβi rather than βi . Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated or stored. irr may be specified at estimation or when replaying previously estimated results. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with nbreg but is not shown in the dialog box: coeflegend; see [R] estimation options. 1394 nbreg — Negative binomial regression Options for gnbreg Model noconstant; see [R] estimation options. lnalpha(varlist) allows you to specify a linear equation for lnα. Specifying lnalpha(male old) means that lnα = γ0 + γ1 male + γ2 old, where γ0 , γ1 , and γ2 are parameters to be estimated along with the other model coefficients. If this option is not specified, gnbreg and nbreg will produce the same results because the shape parameter will be parameterized as a constant. exposure(varnamee ), offset(varnameo ), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. irr reports estimated coefficients transformed to incidence-rate ratios, that is, eβi rather than βi . Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated or stored. irr may be specified at estimation or when replaying previously estimated results. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with gnbreg but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Introduction to negative binomial regression nbreg gnbreg nbreg — Negative binomial regression 1395 Introduction to negative binomial regression Negative binomial regression models the number of occurrences (counts) of an event when the event has extra-Poisson variation, that is, when it has overdispersion. The Poisson regression model is yj ∼ Poisson(µj ) where µj = exp(xj β + offsetj ) for observed counts yj with covariates xj for the j th observation. One derivation of the negative binomial mean-dispersion model is that individual units follow a Poisson regression model, but there is an omitted variable νj , such that eνj follows a gamma distribution with mean 1 and variance α: yj ∼ Poisson(µ∗j ) where µ∗j = exp(xj β + offsetj + νj ) and eνj ∼ Gamma(1/α, α) With this parameterization, a Gamma(a, b) distribution will have expectation ab and variance ab2 . We refer to α as the overdispersion parameter. The larger α is, the greater the overdispersion. The Poisson model corresponds to α = 0. nbreg parameterizes α as lnα. gnbreg allows lnα to be modeled as lnαj = zj γ, a linear combination of covariates zj . nbreg will fit two different parameterizations of the negative binomial model. The default, described above and also given by the dispersion(mean) option, has dispersion for the j th observation equal to 1 + α exp(xj β + offsetj ). This is seen by noting that the above implies that µ∗j ∼ Gamma(1/α, αµj ) and thus Var(yj ) = E Var(yj |µ∗j ) + Var E(yj |µ∗j ) = E(µ∗j ) + Var(µ∗j ) = µj (1 + αµj ) The alternative parameterization, given by the dispersion(constant) option, has dispersion equal to 1 + δ ; that is, it is constant for all observations. This is so because the constant-dispersion model assumes instead that µ∗j ∼ Gamma(µj /δ, δ) and thus Var(yj ) = µj (1 + δ). The Poisson model corresponds to δ = 0. For detailed derivations of both models, see Cameron and Trivedi (2013, 80–89). In particular, note that the mean-dispersion model is known as the NB2 model in their terminology, whereas the constant-dispersion model is referred to as the NB1 model. See Long and Freese (2014) and Cameron and Trivedi (2010, chap. 17) for a discussion of the negative binomial regression model with Stata examples and for a discussion of other regression models for count data. Hilbe (2011) provides an extensive review of the negative binomial model and its variations, using Stata examples. 1396 nbreg — Negative binomial regression nbreg It is not uncommon to posit a Poisson regression model and observe a lack of model fit. The following data appeared in Rodrı́guez (1993): . use http://www.stata-press.com/data/r13/rod93 . list, sepby(cohort) cohort age_mos deaths exposure 1. 2. 3. 4. 5. 6. 7. 1941-1949 1941-1949 1941-1949 1941-1949 1941-1949 1941-1949 1941-1949 0.5 2.0 4.5 9.0 18.0 42.0 90.0 168 48 63 89 102 81 40 278.4 538.8 794.4 1,550.8 3,006.0 8,743.5 14,270.0 8. 9. 10. 11. 12. 13. 14. 1960-1967 1960-1967 1960-1967 1960-1967 1960-1967 1960-1967 1960-1967 0.5 2.0 4.5 9.0 18.0 42.0 90.0 197 48 62 81 97 103 39 403.2 786.0 1,165.3 2,294.8 4,500.5 13,201.5 19,525.0 15. 16. 17. 18. 19. 20. 21. 1968-1976 1968-1976 1968-1976 1968-1976 1968-1976 1968-1976 1968-1976 0.5 2.0 4.5 9.0 18.0 42.0 90.0 195 55 58 85 87 70 10 495.3 956.7 1,381.4 2,604.5 4,618.5 9,814.5 5,802.5 . generate logexp = ln(exposure) . poisson deaths i.cohort, offset(logexp) Iteration 0: log likelihood = -2160.0544 Iteration 1: log likelihood = -2159.5162 Iteration 2: log likelihood = -2159.5159 Iteration 3: log likelihood = -2159.5159 Poisson regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -2159.5159 Std. Err. z P>|z| = = = = 21 49.16 0.0000 0.0113 deaths Coef. [95% Conf. Interval] cohort 1960-1967 1968-1976 -.3020405 .0742143 .0573319 .0589726 -5.27 1.26 0.000 0.208 -.4144089 -.0413698 -.1896721 .1897983 _cons logexp -3.899488 1 .0411345 (offset) -94.80 0.000 -3.98011 -3.818866 nbreg — Negative binomial regression . estat gof Deviance goodness-of-fit Prob > chi2(18) Pearson goodness-of-fit Prob > chi2(18) = = = = 1397 4190.689 0.0000 15387.67 0.0000 The extreme significance of the goodness-of-fit χ2 indicates that the Poisson regression model is inappropriate, suggesting to us that we should try a negative binomial model: . nbreg deaths i.cohort, offset(logexp) nolog Negative binomial regression Dispersion = mean Log likelihood = -131.3799 Std. Err. P>|z| = = = = 21 0.40 0.8171 0.0015 deaths Coef. cohort 1960-1967 1968-1976 -.2676187 -.4573957 .7237203 .7236651 -0.37 -0.63 0.712 0.527 -1.686084 -1.875753 1.150847 .9609618 _cons logexp -2.086731 1 .511856 (offset) -4.08 0.000 -3.08995 -1.083511 /lnalpha .5939963 .2583615 .0876171 1.100376 alpha 1.811212 .4679475 1.09157 3.005295 Likelihood-ratio test of alpha=0: z Number of obs LR chi2(2) Prob > chi2 Pseudo R2 [95% Conf. Interval] chibar2(01) = 4056.27 Prob>=chibar2 = 0.000 Our original Poisson model is a special case of the negative binomial — it corresponds to α = 0. nbreg, however, estimates α indirectly, estimating instead lnα. In our model, lnα = 0.594, meaning that α = 1.81 (nbreg undoes the transformation for us at the bottom of the output). To test α = 0 (equivalent to lnα = −∞), nbreg performs a likelihood-ratio test. The staggering χ2 value of 4,056 asserts that the probability that we would observe these data conditional on α = 0 is virtually zero, that is, conditional on the process being Poisson. The data are not Poisson. It is not accidental that this χ2 value is close to the goodness-of-fit statistic from the Poisson regression itself. Technical note The usual Gaussian test of α = 0 is omitted because this test occurs on the boundary, invalidating the usual theory associated with such tests. However, the likelihood-ratio test of α = 0 has been modified to be valid on the boundary. In particular, the null distribution of the likelihood-ratio test statistic is not the usual χ21 , but rather a 50 : 50 mixture of a χ20 (point mass at zero) and a χ21 , denoted as χ201 . See Gutierrez, Carter, and Drukker (2001) for more details. Technical note The negative binomial model deals with cases in which there is more variation than would be expected if the process were Poisson. The negative binomial model is not helpful if there is less than Poisson variation — if the variance of the count variable is less than its mean. However, underdispersion is uncommon. Poisson models arise because of independently generated events. Overdispersion comes about if some of the parameters (causes) of the Poisson processes are unknown. To obtain underdispersion, the sequence of events somehow would have to be regulated; that is, events would not be independent but controlled based on past occurrences. 1398 nbreg — Negative binomial regression gnbreg gnbreg is a generalization of nbreg, dispersion(mean). Whereas in nbreg, one lnα is estimated, gnbreg allows lnα to vary, observation by observation, as a linear combination of another set of covariates: lnαj = zj γ. We will assume that the number of deaths is a function of age, whereas the lnα parameter is a function of cohort. To fit the model, we type . gnbreg deaths age_mos, lnalpha(i.cohort) offset(logexp) Fitting constant-only model: Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Iteration 5: log likelihood Iteration 6: log likelihood Fitting full model: Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Generalized negative binomial = = = = = = = -187.067 -137.4064 -134.07766 -131.60668 -131.57951 -131.57948 -131.57948 (not concave) = -124.34327 = -117.70256 = -117.56373 = -117.56164 = -117.56164 regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -117.56164 Std. Err. z P>|z| = = = = 21 28.04 0.0000 0.1065 deaths Coef. [95% Conf. Interval] deaths age_mos _cons logexp -.0516657 -1.867225 1 .0051747 .2227944 (offset) -9.98 -8.38 0.000 0.000 -.061808 -2.303894 -.0415233 -1.430556 lnalpha cohort 1960-1967 1968-1976 .0939546 .0815279 .7187747 .7365476 0.13 0.11 0.896 0.912 -1.314818 -1.362079 1.502727 1.525135 _cons -.4759581 .5156502 -0.92 0.356 -1.486614 .5346978 We find that age is a significant determinant of the number of deaths. The standard errors for the variables in the lnα equation suggest that the overdispersion parameter does not vary across cohorts. We can test this assertion by typing . test 2.cohort 3.cohort ( 1) ( 2) [lnalpha]2.cohort = 0 [lnalpha]3.cohort = 0 chi2( 2) = 0.02 Prob > chi2 = 0.9904 There is no evidence of variation by cohort in these data. nbreg — Negative binomial regression 1399 Technical note Note the intentional absence of a likelihood-ratio test for α = 0 in gnbreg. The test is affected by the same boundary condition that affects the comparison test in nbreg; however, when α is parameterized by more than a constant term, the null distribution becomes intractable. For this reason, we recommend using nbreg to test for overdispersion and, if you have reason to believe that overdispersion exists, only then modeling the overdispersion using gnbreg. Stored results nbreg and gnbreg store the following in e(): Scalars e(N) e(k) e(k aux) e(k eq) e(k eq model) e(k dv) e(df m) e(r2 p) e(ll) e(ll 0) e(ll c) e(alpha) e(delta) e(N clust) e(chi2) e(chi2 c) e(p) e(rank) e(rank0) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(offset1) e(chi2type) e(chi2 ct) e(dispers) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(asbalanced) e(asobserved) number of observations number of parameters number of auxiliary parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom pseudo-R-squared log likelihood log likelihood, constant-only mode log likelihood, comparison model value of alpha value of delta number of clusters χ2 χ2 for comparison test significance rank of e(V) rank of e(V) for constant-only model number of iterations return code 1 if converged, 0 otherwise nbreg or gnbreg command as typed name of dependent variable weight type weight expression title in estimation output name of cluster variable linear offset variable (nbreg) linear offset variable (gnbreg) Wald or LR; type of model χ2 test Wald or LR; type of model χ2 test corresponding to e(chi2 c) mean or constant vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved 1400 nbreg — Negative binomial regression Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample Methods and formulas See [R] poisson and Johnson, Kemp, and Kotz (2005, chap. 4) for an introduction to the Poisson distribution. Methods and formulas are presented under the following headings: Mean-dispersion model Constant-dispersion model Mean-dispersion model A negative binomial distribution can be regarded as a gamma mixture of Poisson random variables. The number of times something occurs, yj , is distributed as Poisson(νj µj ). That is, its conditional likelihood is (νj µj )yj e−νj µj f (yj | νj ) = Γ(yj + 1) where µj = exp(xj β + offsetj ) and νj is an unobserved parameter with a Gamma(1/α, α) density: g(ν) = ν (1−α)/α e−ν/α α1/α Γ(1/α) This gamma distribution has mean 1 and variance α, where α is our ancillary parameter. The unconditional likelihood for the j th observation is therefore Z ∞ Γ(m + yj ) yj f (yj ) = f (yj | ν)g(ν) dν = pm j (1 − pj ) Γ(y j + 1)Γ(m) 0 where pj = 1/(1 + αµj ) and m = 1/α. Solutions for α are handled by searching for lnα because α must be greater than zero. The log likelihood (with weights wj and offsets) is given by m = 1/α lnL = pj = 1/(1 + αµj ) n X µj = exp(xj β + offsetj ) wj ln{Γ(m + yj )} − ln{Γ(yj + 1)} j=1 − ln{Γ(m)} + m ln(pj ) + yj ln(1 − pj ) For gnbreg, α can vary across the observations according to the parameterization lnαj = zj γ. nbreg — Negative binomial regression 1401 Constant-dispersion model The constant-dispersion model assumes that yj is conditionally distributed as Poisson(µ∗j ), where ∼ Gamma(µj /δ, δ) for some dispersion parameter δ (by contrast, the mean-dispersion model assumes that µ∗j ∼ Gamma(1/α, αµj )). The log likelihood is given by µ∗j mj = µj /δ lnL = n X p = 1/(1 + δ) wj ln{Γ(mj + yj )} − ln{Γ(yj + 1)} j=1 − ln{Γ(mj )} + mj ln(p) + yj ln(1 − p) with everything else defined as before in the calculations for the mean-dispersion model. nbreg and gnbreg support the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. These commands also support estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. . 2013. Regression Analysis of Count Data. 2nd ed. New York: Cambridge University Press. Deb, P., and P. K. Trivedi. 2006. Maximum simulated likelihood estimation of a negative binomial regression model with multinomial endogenous treatment. Stata Journal 6: 246–255. Gutierrez, R. G., S. L. Carter, and D. M. Drukker. 2001. sg160: On boundary-value likelihood-ratio tests. Stata Technical Bulletin 60: 15–18. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 269–273. College Station, TX: Stata Press. Hilbe, J. M. 1998. sg91: Robust variance estimators for MLE Poisson and negative binomial regression. Stata Technical Bulletin 45: 26–28. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 177–180. College Station, TX: Stata Press. . 1999. sg102: Zero-truncated Poisson and negative binomial regression. Stata Technical Bulletin 47: 37–40. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 233–236. College Station, TX: Stata Press. . 2011. Negative Binomial Regression. 2nd ed. Cambridge: Cambridge University Press. Johnson, N. L., A. W. Kemp, and S. Kotz. 2005. Univariate Discrete Distributions. 3rd ed. New York: Wiley. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Long, J. S., and J. Freese. 2001. Predicted probabilities for count models. Stata Journal 1: 51–57. . 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Miranda, A., and S. Rabe-Hesketh. 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata Journal 6: 285–308. Rodrı́guez, G. 1993. sbe10: An improvement to poisson. Stata Technical Bulletin 11: 11–14. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 94–98. College Station, TX: Stata Press. Rogers, W. H. 1991. sbe1: Poisson regression with rates. Stata Technical Bulletin 1: 11–12. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 62–64. College Station, TX: Stata Press. . 1993. sg16.4: Comparison of nbreg and glm for negative binomial. Stata Technical Bulletin 16: 7. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 82–84. College Station, TX: Stata Press. 1402 nbreg — Negative binomial regression Also see [R] nbreg postestimation — Postestimation tools for nbreg and gnbreg [R] glm — Generalized linear models [R] poisson — Poisson regression [R] tnbreg — Truncated negative binomial regression [R] zinb — Zero-inflated negative binomial regression [ME] menbreg — Multilevel mixed-effects negative binomial regression [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [XT] xtnbreg — Fixed-effects, random-effects, & population-averaged negative binomial models [U] 20 Estimation and postestimation commands Title nbreg postestimation — Postestimation tools for nbreg and gnbreg Description Remarks and examples Syntax for predict Methods and formulas Menu for predict Also see Options for predict Description The following postestimation commands are available after nbreg and gnbreg: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with mi or svy estimation results. lrtest is not appropriate with svy estimation results. 1403 1404 nbreg postestimation — Postestimation tools for nbreg and gnbreg Syntax for predict newvar if in , statistic nooffset predict type stub* | newvarreg newvardisp if in , scores predict type Description statistic Main number of events; the default incidence rate (equivalent to predict . . . , n nooffset) probability Pr(yj = n) probability Pr(a ≤ yj ≤ b) linear prediction standard error of the linear prediction n ir pr(n) pr(a,b) xb stdp In addition, relevant only after gnbreg are the following: Description statistic Main predicted values of αj predicted values of lnαj standard error of predicted lnαj alpha lnalpha stdplna These statistics are available both in and out of sample; type predict only for the estimation sample. . . . if e(sample) . . . if wanted Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main n, the default, calculates the predicted number of events, which is exp(xj β) if neither offset(varnameo ) nor exposure(varnamee ) was specified when the model was fit; exp(xj β + offsetj ) if offset() was specified; or exp(xj β) × exposurej if exposure() was specified. ir calculates the incidence rate exp(xj β), which is the predicted number of events when exposure is 1. This is equivalent to specifying both the n and the nooffset options. pr(n) calculates the probability Pr(yj = n), where n is a nonnegative integer that may be specified as a number or a variable. pr(a,b) calculates the probability Pr(a ≤ yj ≤ b), where a and b are nonnegative integers that may be specified as numbers or variables; b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(yj ≥ 20); pr(20,b) calculates Pr(yj ≥ 20) in observations for which b ≥ . and calculates Pr(20 ≤ yj ≤ b) elsewhere. nbreg postestimation — Postestimation tools for nbreg and gnbreg 1405 pr(.,b) produces a syntax error. A missing value in an observation of the variable a causes a missing value in that observation for pr(a,b). xb calculates the linear prediction, which is xj β if neither offset() nor exposure() was specified; xj β + offsetj if offset() was specified; or xj β + ln(exposurej ) if exposure() was specified; see nooffset below. stdp calculates the standard error of the linear prediction. alpha, lnalpha, and stdplna are relevant after gnbreg estimation only; they produce the predicted values of αj , lnαj , and the standard error of the predicted lnαj , respectively. nooffset is relevant only if you specified offset() or exposure() when you fit the model. It modifies the calculations made by predict so that they ignore the offset or exposure variable; the linear prediction is treated as xj β rather than as xj β + offsetj or xj β + ln(exposurej ). Specifying predict . . . , nooffset is equivalent to specifying predict . . . , ir. scores calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xj β). The second new variable will contain ∂ ln L/∂( lnαj ) for dispersion(mean) and gnbreg. The second new variable will contain ∂ ln L/∂( lnδ) for dispersion(constant). Remarks and examples After nbreg and gnbreg, predict returns the expected number of deaths per cohort and the probability of observing the number of deaths recorded or fewer. . use http://www.stata-press.com/data/r13/rod93 . nbreg deaths i.cohort, nolog Negative binomial regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Dispersion = mean Log likelihood = -108.48841 Std. Err. P>|z| 21 0.14 0.9307 0.0007 deaths Coef. cohort 1960-1967 1968-1976 .0591305 -.0538792 .2978419 .2981621 0.20 -0.18 0.843 0.857 -.5246289 -.6382662 .64289 .5305077 _cons 4.435906 .2107213 21.05 0.000 4.0229 4.848912 /lnalpha -1.207379 .3108622 -1.816657 -.5980999 alpha .29898 .0929416 .1625683 .5498555 Likelihood-ratio test of alpha=0: z = = = = chibar2(01) = [95% Conf. Interval] 434.62 Prob>=chibar2 = 0.000 . predict count (option n assumed; predicted number of events) . predict p, pr(0, deaths) . summarize deaths count p Variable Obs Mean deaths count p 21 21 21 84.66667 84.66667 .4991542 Std. Dev. 48.84192 4.00773 .2743702 Min Max 10 80 .0070255 197 89.57143 .9801285 1406 nbreg postestimation — Postestimation tools for nbreg and gnbreg The expected number of deaths ranges from 80 to 90. The probability Pr(yi ≤ deaths) ranges from 0.007 to 0.98. Methods and formulas In the following, we use the same notation as in [R] nbreg. Methods and formulas are presented under the following headings: Mean-dispersion model Constant-dispersion model Mean-dispersion model The equation-level scores are given by score(xβ)j = pj (yj − µj ) αj (µj − yj ) score(τ )j = −m − ln(1 + αj µj ) + ψ(yj + m) − ψ(m) 1 + αj µj where τj = lnαj and ψ(z) is the digamma function. Constant-dispersion model The equation-level scores are given by score(xβ)j = mj {ψ(yj + mj ) − ψ(mj ) + ln(p)} score(τ )j = yj − (yj + mj )(1 − p) − score(xβ)j where τj = lnδj . Also see [R] nbreg — Negative binomial regression [U] 20 Estimation and postestimation commands Title nestreg — Nested model statistics Syntax Remarks and examples Also see Menu Stored results Description Acknowledgment Options Reference Syntax Standard estimation command syntax nestreg , options : command name depvar (varlist) (varlist) . . . if in weight command options Survey estimation command syntax nestreg , options : svy vcetype , svy options : command name depvar (varlist) (varlist) . . . if in , command options Description options Reporting report Wald test results; the default report likelihood-ratio test results suppress any output from command name store nested estimation results in est stub# waldtable lrtable quietly store(stub) by is allowed; see [U] 11.1.10 Prefix commands. Weights are allowed if command name allows them; see [U] 11.1.6 weight. A varlist in parentheses indicates that this list of variables is to be considered as a block. Each variable in a varlist not bound in parentheses will be treated as its own block. All postestimation commands behave as they would after command name without the nestreg prefix; see the postestimation manual entry for command name. Menu Statistics > Other > Nested model statistics Description nestreg fits nested models by sequentially adding blocks of variables and then reports comparison tests between the nested models. 1407 1408 nestreg — Nested model statistics Options Reporting waldtable specifies that the table of Wald test results be reported. waldtable is the default. lrtable specifies that the table of likelihood-ratio tests be reported. This option is not allowed if pweights, the vce(robust) option, or the vce(cluster clustvar) option is specified. lrtable is also not allowed with the svy prefix. quietly suppresses the display of any output from command name. store(stub) specifies that each model fit by nestreg be stored under the name est stub#, where # is the nesting order from first to last. Remarks and examples Remarks are presented under the following headings: Estimation commands Wald tests Likelihood-ratio tests Programming for nestreg Estimation commands nestreg removes collinear predictors and observations with missing values from the estimation sample before calling command name. The following Stata commands are supported by nestreg: clogit cloglog glm intreg logistic logit nbreg ologit oprobit poisson probit qreg regress scobit stcox stcrreg streg tobit You do not supply a depvar for stcox, stcrreg, or streg; otherwise, depvar is required. You must supply two depvars for intreg. Wald tests Use nestreg to test the significance of blocks of predictors, building the regression model one block at a time. Using the data from example 1 of [R] test, we wish to test the significance of the following predictors of birth rate: medage, medagesq, and region (already partitioned into four indicator variables: reg1, reg2, reg3, and reg4). nestreg — Nested model statistics . use http://www.stata-press.com/data/r13/census4 (birth rate, median age) . nestreg: regress brate (medage) (medagesq) (reg2-reg4) Block 1: medage SS Source df MS Model Residual 32675.1044 9521.71561 1 48 32675.1044 198.369075 Total 42196.82 49 861.159592 brate Coef. medage _cons -15.24893 618.3935 Block 2: medagesq Source SS Std. Err. 1.188141 35.15416 df t -12.83 17.59 36755.8524 5440.96755 2 47 18377.9262 115.765267 Total 42196.82 49 861.159592 brate Coef. medage medagesq _cons -109.8925 1.607332 2007.071 3: reg2 reg3 reg4 Source SS P>|t| 0.000 0.000 MS Model Residual Block Number of obs F( 1, 48) Prob > F R-squared Adj R-squared Root MSE Std. Err. 15.96663 .2707228 235.4316 df t -6.88 5.94 8.53 0.000 0.000 0.000 MS Model Residual 38803.419 3393.40095 5 44 7760.68381 77.1227489 Total 42196.82 49 861.159592 brate Coef. medage medagesq reg2 reg3 reg4 _cons -109.0957 1.635208 15.00284 7.366435 21.39679 1947.61 Std. Err. t 13.52452 .2290536 4.252068 3.953336 4.650602 199.8405 -8.07 7.14 3.53 1.86 4.60 9.75 -17.63785 547.7113 0.000 0.000 0.001 0.069 0.000 0.000 -12.86002 689.0756 = = = = = = 50 158.75 0.0000 0.8711 0.8656 10.759 [95% Conf. Interval] -142.0132 1.062708 1533.444 Number of obs F( 5, 44) Prob > F R-squared Adj R-squared Root MSE P>|t| 50 164.72 0.0000 0.7743 0.7696 14.084 [95% Conf. Interval] Number of obs F( 2, 47) Prob > F R-squared Adj R-squared Root MSE P>|t| = = = = = = -77.7718 2.151956 2480.698 = = = = = = 50 100.63 0.0000 0.9196 0.9104 8.782 [95% Conf. Interval] -136.3526 1.173581 6.433365 -.6009898 12.02412 1544.858 Block F Block df Residual df Pr > F R2 Change in R2 1 2 3 164.72 35.25 8.85 1 1 3 48 47 44 0.0000 0.0000 0.0001 0.7743 0.8711 0.9196 0.0967 0.0485 -81.83886 2.096835 23.57233 15.33386 30.76946 2350.362 1409 1410 nestreg — Nested model statistics This single call to nestreg ran regress three times, adding a block of predictors to the model for each run as in . regress brate medage SS Source df MS Model Residual 32675.1044 9521.71561 1 48 32675.1044 198.369075 Total 42196.82 49 861.159592 brate Coef. medage _cons -15.24893 618.3935 Std. Err. 1.188141 35.15416 . regress brate medage medagesq Source SS df t -12.83 17.59 Number of obs F( 1, 48) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 MS Model Residual 36755.8524 5440.96755 2 47 18377.9262 115.765267 Total 42196.82 49 861.159592 brate Coef. medage medagesq _cons -109.8925 1.607332 2007.071 Std. Err. 15.96663 .2707228 235.4316 t -6.88 5.94 8.53 0.000 0.000 0.000 50 164.72 0.0000 0.7743 0.7696 14.084 [95% Conf. Interval] -17.63785 547.7113 Number of obs F( 2, 47) Prob > F R-squared Adj R-squared Root MSE P>|t| = = = = = = -12.86002 689.0756 = = = = = = 50 158.75 0.0000 0.8711 0.8656 10.759 [95% Conf. Interval] -142.0132 1.062708 1533.444 -77.7718 2.151956 2480.698 . regress brate medage medagesq reg2-reg4 Source SS df MS Model Residual 38803.419 3393.40095 5 44 7760.68381 77.1227489 Total 42196.82 49 861.159592 brate Coef. medage medagesq reg2 reg3 reg4 _cons -109.0957 1.635208 15.00284 7.366435 21.39679 1947.61 Std. Err. 13.52452 .2290536 4.252068 3.953336 4.650602 199.8405 t -8.07 7.14 3.53 1.86 4.60 9.75 Number of obs F( 5, 44) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.001 0.069 0.000 0.000 = = = = = = 50 100.63 0.0000 0.9196 0.9104 8.782 [95% Conf. Interval] -136.3526 1.173581 6.433365 -.6009898 12.02412 1544.858 -81.83886 2.096835 23.57233 15.33386 30.76946 2350.362 nestreg collected the F statistic for the corresponding block of predictors and the model R2 statistic from each model fit. The F statistic for the first block, 164.72, is for a test of the joint significance of the first block of variables; it is simply the F statistic from the regression of brate on medage. The F statistic for the second block, 35.25, is for a test of the joint significance of the second block of variables in a regression of both the first and second blocks of variables. In our example, it is an F test of medagesq in the regression of brate on medage and medagesq. Similarly, the third block’s F statistic of 8.85 corresponds to a joint test of reg2, reg3, and reg4 in the final regression. nestreg — Nested model statistics 1411 Likelihood-ratio tests The nestreg command provides a simple syntax for performing likelihood-ratio tests for nested model specifications; also see lrtest. Using the data from example 1 of [R] lrtest, we wish to jointly test the significance of the following predictors of low birthweight: age, lwt, ptl, and ht. . use http://www.stata-press.com/data/r13/lbw (Hosmer & Lemeshow data) . xi: nestreg, lr: logistic low (i.race smoke ui) (age lwt ptl ht) i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) Block 1: _Irace_2 _Irace_3 smoke ui Logistic regression Number of obs = 189 LR chi2(4) = 18.80 Prob > chi2 = 0.0009 Log likelihood = -107.93404 Pseudo R2 = 0.0801 low Odds Ratio Std. Err. _Irace_2 _Irace_3 smoke ui _cons 3.052746 2.922593 2.945742 2.419131 .1402209 1.498087 1.189229 1.101838 1.047359 .0512295 z 2.27 2.64 2.89 2.04 -5.38 Block 2: age lwt ptl ht Logistic regression Log likelihood = Odds Ratio _Irace_2 _Irace_3 smoke ui age lwt ptl ht _cons 3.534767 2.368079 2.517698 2.1351 .9732636 .9849634 1.719161 6.249602 1.586014 [95% Conf. Interval] 0.023 0.008 0.004 0.041 0.000 1.166747 1.316457 1.415167 1.035459 .0685216 Number of obs LR chi2(8) Prob > chi2 Pseudo R2 -100.724 low P>|z| Std. Err. 1.860737 1.039949 1.00916 .9808153 .0354759 .0068217 .5952579 4.322408 1.910496 z 2.40 1.96 2.30 1.65 -0.74 -2.19 1.56 2.65 0.38 = = = = 7.987382 6.488285 6.131715 5.651788 .2869447 189 33.22 0.0001 0.1416 P>|z| [95% Conf. Interval] 0.016 0.050 0.021 0.099 0.457 0.029 0.118 0.008 0.702 1.259736 1.001356 1.147676 .8677528 .9061578 .9716834 .8721455 1.611152 .1496092 Block LL LR df Pr > LR AIC BIC 1 2 -107.934 -100.724 18.80 14.42 4 4 0.0009 0.0061 225.8681 219.448 242.0768 248.6237 9.918406 5.600207 5.523162 5.2534 1.045339 .9984249 3.388787 24.24199 16.8134 The estimation results from the full model are left in e(), so we can later use estat and other postestimation commands. . estat gof Logistic model for low, goodness-of-fit test number of observations number of covariate patterns Pearson chi2(173) Prob > chi2 = = = = 189 182 179.24 0.3567 1412 nestreg — Nested model statistics Programming for nestreg If you want your user-written command (command name) to work with nestreg, it must follow standard Stata syntax and allow the if qualifier. Furthermore, command name must have sw or swml as a program property; see [P] program properties. If command name has swml as a property, command name must store the log-likelihood value in e(ll) and the model degrees of freedom in e(df m). Stored results nestreg stores the following in r(): Matrices r(wald) r(lr) matrix corresponding to the Wald table matrix corresponding to the likelihood-ratio table Acknowledgment We thank Paul H. Bern of Syracuse University for developing the hierarchical regression command that inspired nestreg. Reference Acock, A. C. 2014. A Gentle Introduction to Stata. 4th ed. College Station, TX: Stata Press. Also see [P] program properties — Properties of user-defined programs Title net — Install and manage user-written additions from the Internet Syntax References Description Also see Options Remarks and examples Syntax Set current location for net net from directory or url Change to a different net directory net cd path or url Change to a different net site net link linkname Search for installed packages net search (see [R] net search) Report current net location net Describe a package net describe pkgname , from(directory or url) Set location where packages will be installed net set ado dirname Set location where ancillary files will be installed net set other dirname Report net ‘from’, ‘ado’, and ‘other’ settings net query Install ado-files and help files from a package net install pkgname , all replace force from(directory or url) 1413 1414 net — Install and manage user-written additions from the Internet Install ancillary files from a package net get pkgname , all replace force from(directory or url) Shortcut to access Stata Journal (SJ) net site net sj vol-issue insert Shortcut to access Stata Technical Bulletin (STB) net site net stb issue insert List installed packages ado , find(string) from(dirname) ado dir pkgid , find(string) from(dirname) Describe installed packages ado describe pkgid , find(string) from(dirname) Uninstall an installed package ado uninstall pkgid , from(dirname) where pkgname is pkgid is or dirname is or or or name of a package name of a package a number in square brackets: [#] a directory name PLUS (default) PERSONAL SITE Description net downloads and installs additions to Stata. The additions can be obtained from the Internet or from physical media. The additions can be ado-files (new commands), help files, or even datasets. Collections of files are bound together into packages. For instance, the package named zz49 might add the xyz command to Stata. At a minimum, such a package would contain xyz.ado, the code to implement the new command, and xyz.sthlp, the system help to describe it. That the package contains two files is a detail: you use net to download the package zz49, regardless of the number of files. ado manages the packages you have installed by using net. The ado command lets you list and uninstall previously installed packages. You can also access the net and ado features by selecting Help > SJ and User-written Programs; this is the recommended method to find and install additions to Stata. net — Install and manage user-written additions from the Internet 1415 Options all is used with net install and net get. Typing it with either one makes the command equivalent to typing net install followed by net get. replace is for use with net install and net get. It specifies that the downloaded files replace existing files if any of the files already exists. force specifies that the downloaded files replace existing files if any of the files already exists, even if Stata thinks all the files are the same. force implies replace. find(string) is for use with ado, ado dir, and ado describe. It specifies that the descriptions of the packages installed on your computer be searched, and that the package descriptions containing string be listed. from(dirname), when used with ado, specifies where the packages are installed. The default is from(PLUS). PLUS is a code word that Stata understands to correspond to a particular directory on your computer that was set at installation time. On Windows computers, PLUS probably means the directory c:\ado\plus, but it might mean something else. You can find out what it means by typing sysdir, but doing so is irrelevant if you use the defaults. from(directory or url), when used with net, specifies the directory or URL where installable packages may be found. The directory or URL is the same as the one that would have been specified with net from. Remarks and examples For an introduction to using net and ado, see [U] 28 Using the Internet to keep up to date. The purpose of this documentation is • to briefly, but accurately, describe net and ado and all their features and • to provide documentation to those who wish to set up their own sites to distribute additions to Stata. Remarks are presented under the following headings: Definition of a package The purpose of the net and ado commands Content pages Package-description pages Where packages are installed A summary of the net command A summary of the ado command Relationship of net and ado to the point-and-click interface Creating your own site Format of content and package-description files Example 1 Example 2 Additional package directives SMCL in content and package-description files Error-free file delivery 1416 net — Install and manage user-written additions from the Internet Definition of a package A package is a collection of files—typically, .ado and .sthlp files—that together provide a new feature in Stata. Packages contain additions that you wish had been part of Stata at the outset. We write such additions, and so do other users. One source of these additions is the Stata Journal, a printed and electronic journal with corresponding software. If you want the journal, you must subscribe, but the software is available for free from our website. The purpose of the net and ado commands The net command makes it easy to distribute and install packages. The goal is to get you quickly to a package-description page that summarizes the addition, for example, . net describe rte_stat, from(http://www.wemakeitupaswego.edu/faculty/sgazer/) package rte_stat from http://www.wemakeitupaswego.edu/faculty/sgazer/ TITLE rte_stat. The robust-to-everything statistic; update. DESCRIPTION/AUTHOR(S) S. Gazer, Dept. of Applied Theoretical Mathematics, WMIUAWG Univ. Aleph-0 100% confidence intervals proved too conservative for some applications; Aleph-1 confidence intervals have been substituted. The new robust-to-everything supplants the previous robust-toeverything-conceivable statistic. See "Inference in the absence of data" (forthcoming). After installation, see help rte. INSTALLATION FILES (type net install rte_stat) rte.ado rte.sthlp nullset.ado random.ado If you decide that the addition might prove useful, net makes the installation easy: . net install rte_stat checking rte_stat consistency and verifying not already installed... installing into c:\ado\plus\ ... installation complete. The ado command helps you manage packages installed with net. Perhaps you remember that you installed a package that calculates the robust-to-everything statistic, but you cannot remember the command’s name. You could use ado to search what you have previously installed for the rte command, . ado [1] package sg145 from http://www.stata.com/stb/stb56 STB-56 sg145. Scalar measures of fit for regression models. (output omitted ) [15] package rte_stat from http://www.wemakeitupaswego.edu/faculty/sgazer rte_stat. The robust-to-everything statistic; update. (output omitted ) [21] package st0119 from http://www.stata-journal.com/software/sj7-1 SJ7-1 st0119. Rasch analysis net — Install and manage user-written additions from the Internet 1417 or you might type . ado, find("robust-to-everything") [15] package rte_stat from http://www.wemakeitupaswego.edu/faculty/sgazer rte_stat. The robust-to-everything statistic; update. Perhaps you decide that rte, despite the author’s claims, is not worth the disk space it occupies. You can use ado to erase it: . ado uninstall rte_stat package rte_stat from http://www.wemakeitupaswego.edu/faculty/sgazer rte_stat. The robust-to-everything statistic; update. (package uninstalled) ado uninstall is easier than erasing the files by hand because ado uninstall erases every file associated with the package, and, moreover, ado knows where on your computer rte stat is installed; you would have to hunt for these files. Content pages There are two types of pages displayed by net: content pages and package-description pages. When you type net from, net cd, net link, or net without arguments, Stata goes to the specified place and displays the content page: . net from http://www.stata.com http://www.stata.com/ StataCorp Welcome to StataCorp. Below we provide links to sites providing additions to Stata, including the Stata Journal, STB, and Statalist. These are NOT THE OFFICIAL UPDATES; you fetch and install the official updates by typing -update-. PLACES you could -net link- to: sj The Stata Journal DIRECTORIES you could -net cd- to: stb materials published in the Stata Technical Bulletin users materials written by various people, including StataCorp employees meetings software packages from Stata Users Group meetings links links to other locations providing additions to Stata A content page tells you about other content pages and package-description pages. The example above lists other content pages only. Below we follow one of the links for the Stata Journal: 1418 net — Install and manage user-written additions from the Internet . net link sj http://www.stata-journal.com/ The Stata Journal The Stata Journal is a refereed, quarterly journal containing articles of interest to Stata users. For more details and subscription information, visit the Stata Journal website at http://www.stata-journal.com. PLACES you could -net link- to: stata StataCorp website DIRECTORIES you could -net cd- to: production Files for authors of the Stata Journal software Software associated with Stata Journal articles . net cd software http://www.stata-journal.com/software/ The Stata Journal PLACES you could -net link- to: stata StataCorp website stb Stata Technical Bulletin (STB) software archive DIRECTORIES you could -net cd- to: (output omitted ) sj7-1 volume 7, issue 1 (output omitted ) sj1-1 volume 1, issue 1 . net cd sj7-1 http://www.stata-journal.com/software/sj7-1/ Stata Journal volume 7, issue 1 DIRECTORIES you could -net cd- to: .. Other Stata Journals PACKAGES you could -net describe-: dm0027 File filtering in Stata: handling complex data formats and navigating log files efficiently st0119 Rasch analysis st0120 Multivariable regression spline models st0121 mhbounds - Sensitivity Analysis for Average Treatment Effects dm0027, st0119, . . . , st0121 are links to package-description pages. 1. When you type net from, you follow that with a location to display the location’s content page. a. The location could be a URL, such as http://www.stata.com. The content page at that location would then be listed. b. The location could be e: on a Windows computer or a mounted volume on a Mac computer. The content page on that source would be listed. That would work if you had special media obtained from StataCorp or special media prepared by another user. c. The location could even be a directory on your computer, but that would work only if that directory contained the right kind of files. net — Install and manage user-written additions from the Internet 1419 2. Once you have specified a location, typing net cd will take you into subdirectories of that location, if there are any. Typing . net from http://www.stata-journal.com . net cd software is equivalent to typing . net from http://www.stata-journal.com/software Typing net cd displays the content page from that location. 3. Typing net without arguments redisplays the current content page, which is the content page last displayed. 4. net link is similar to net cd in that the result is to change the location, but rather than changing to subdirectories of the current location, net link jumps to another location: . net from http://www.stata-journal.com http://www.stata-journal.com/ The Stata Journal The Stata Journal is a refereed, quarterly journal containing articles of interest to Stata users. For more details and subscription information, visit the Stata Journal website at http://www.stata-journal.com. PLACES you could -net link- to: stata StataCorp website DIRECTORIES you could -net cd- to: production Files for authors of the Stata Journal software Software associated with Stata Journal articles Typing net link stata would jump to http://www.stata.com: . net link stata http://www.stata.com/ StataCorp Welcome to StataCorp. (output omitted ) Package-description pages Package-description pages describe what could be installed: 1420 net — Install and manage user-written additions from the Internet . net from http://www.stata-journal.com/software/sj7-1 http://www.stata-journal.com/software/sj7-1/ (output omitted ) . net describe st0119 package st0119 from http://www.stata-journal.com/software/sj7-1 TITLE SJ7-1 st0119. Rasch analysis DESCRIPTION/AUTHOR(S) Rasch analysis by Jean-Benoit Hardouin, University of Nantes, France Support: jean-benoit.hardouin@univ-nantes.fr After installation, type help gammasym, gausshermite, geekel2d, raschtest, and raschtestv7 INSTALLATION FILES (type net install st0119) st0119/raschtest.ado st0119/raschtest.hlp st0119/raschtestv7.ado st0119/raschtestv7.hlp st0119/gammasym.ado st0119/gammasym.hlp st0119/gausshermite.ado st0119/gausshermite.hlp st0119/geekel2d.ado st0119/geekel2d.hlp ANCILLARY FILES (type net get st0119) st0119/data.dta st0119/outrasch.do A package-description page describes the package and tells you how to install the component files. Package-description pages potentially describe two types of files: 1. Installation files: files that you type net install to install and that are required to make the addition work. 2. Ancillary files: additional files that you might want to install—you type net get to install them— but that you can ignore. Ancillary files are typically datasets that are useful for demonstration purposes. Ancillary files are not really installed in the sense of being copied to an official place for use by Stata itself. They are merely copied into the current directory so that you may use them if you wish. You install the official files by typing net install followed by the package name. For example, to install st0119, you would type . net install st0119 checking st0119 consistency and verifying not already installed... installing into c:\ado\plus\ ... installation complete. You get the ancillary files—if there are any and if you want them—by typing net get followed by the package name: net — Install and manage user-written additions from the Internet 1421 . net get st0119 checking st0119 consistency and verifying not already installed... copying into current directory... copying data.dta copying outrasch.do ancillary files successfully copied. Most users ignore the ancillary files. Once you have installed a package—by typing net install—use ado to redisplay the packagedescription page whenever you wish: . ado describe st0119 [1] package st0119 from http://www.stata-journal.com/software/sj7-1 TITLE SJ7-1 st0119. Rasch analysis DESCRIPTION/AUTHOR(S) Rasch analysis by Jean-Benoit Hardouin, University of Nantes, France Support: jean-benoit.hardouin@univ-nantes.fr After installation, type help gammasym, gausshermite, geekel2d, raschtest, and raschtestv7 INSTALLATION FILES r/raschtest.ado r/raschtest.hlp r/raschtestv7.ado r/raschtestv7.hlp g/gammasym.ado g/gammasym.hlp g/gausshermite.ado g/gausshermite.hlp g/geekel2d.ado g/geekel2d.hlp INSTALLED ON 24 Apr 2013 The package-description page shown by ado includes the location from which we got the package and when we installed it. It does not mention the ancillary files that were originally part of this package because they are not tracked by ado. Where packages are installed Packages should be installed in PLUS or SITE, which are code words that Stata understands and that correspond to some real directories on your computer. Typing sysdir will tell you where these are, if you care. . sysdir STATA: BASE: SITE: PLUS: PERSONAL: OLDPLACE: C:\Program Files\Stata13\ C:\Program Files\Stata13\ado\base\ C:\Program Files\Stata13\ado\site\ c:\ado\plus\ c:\ado\personal\ c:\ado\ If you type sysdir, you may obtain different results. 1422 net — Install and manage user-written additions from the Internet By default, net installs in the PLUS directory, and ado tells you about what is installed there. If you are on a multiple-user system, you may wish to install some packages in the SITE directory. This way, they will be available to other Stata users. To do that, before using net install, type . net set ado SITE and when reviewing what is installed or removing packages, redirect ado to that directory: . ado . . ., from(SITE) In both cases, you type SITE because Stata will understand that SITE means the site ado-directory as defined by sysdir. To install into SITE, you must have write access to that directory. If you reset where net installs and then, in the same session, wish to install into your private ado-directory, type . net set ado PLUS That is how things were originally. If you are confused as to where you are, type net query. A summary of the net command The net command displays content pages and package-description pages. Such pages are provided over the Internet, and most users get them there. We recommend that you start at http://www.stata.com and work out from there. We also recommend using net search to find packages of interest to you; see [R] net search. net from moves you to a location and displays the content page. net cd and net link change from your current location to other locations. net cd enters subdirectories of the original location. net link jumps from one location to another, depending on the code on the content page. net describe lists a package-description page. Packages are named, and you type net describe pkgname. net install installs a package into your copy of Stata. net get copies any additional files (ancillary files) to your current directory. net sj and net stb simplify loading files from the Stata Journal and its predecessor, the Stata Technical Bulletin. net sj vol-issue is a synonym for typing net from http://www.stata-journal.com/software/sjvol-issue whereas net sj vol-issue insert is a synonym for typing net from http://www.stata-journal.com/software/sjvol-issue net describe insert net set controls where net installs files. By default, net installs in the PLUS directory; see [P] sysdir. net set ado SITE would cause subsequent net commands to install in the SITE directory. net set other sets where ancillary files, such as .dta files, are installed. The default is the current directory. net query displays the current net from, net set ado, and net set other settings. net — Install and manage user-written additions from the Internet 1423 A summary of the ado command The ado command lists the package descriptions of previously installed packages. Typing ado without arguments is the same as typing ado dir. Both list the names and titles of the packages you have installed. ado describe lists full package-description pages. ado uninstall removes packages from your computer. Because you can install packages from a variety of sources, the package names may not always be unique. Thus the packages installed on your computer are numbered sequentially, and you may refer to them by name or by number. For instance, say that you wanted to get rid of the robust-to-everything statistic command you installed. Type . ado, find("robust-to-everything") [15] package rte_stat from http://www.wemakeitupaswego.edu/faculty/sgazer rte_stat. The robust-to-everything statistic; update. You could then type . ado uninstall rte_stat or . ado uninstall [15] Typing ado uninstall rte stat would work only if the name rte stat were unique; otherwise, ado would refuse, and you would have to type the number. The find() option is allowed with ado dir and ado describe. It searches the package description for the word or phrase you specify, ignoring case (alpha matches Alpha). The complete package description is searched, including the author’s name and the name of the files. Thus if rte was the name of a command that you wanted to eliminate, but you could not remember the name of the package, you could type . ado, find(rte) [15] package rte_stat from http://www.wemakeitupaswego.edu/faculty/sgazer rte_stat. The robust-to-everything statistic; update. Relationship of net and ado to the point-and-click interface Users may instead select Help > SJ and User-written Programs. There are advantages and disadvantages: 1. Flipping through content and package-description pages is easier; it is much like a browser. See [GS] 19 Updating and extending Stata—Internet functionality (GSM, GSU, or GSW). 2. When browsing a product-description page, note that the .sthlp files are highlighted. You may click on .sthlp files to review them before installing the package. 3. You may not redirect from where ado searches for files. 1424 net — Install and manage user-written additions from the Internet Creating your own site The rest of this entry concerns how to create your own site to distribute additions to Stata. The idea is that you have written additions for use with Stata—say, xyz.ado and xyz.sthlp—and you wish to put them out so that coworkers or researchers at other institutions can easily install them. Or, perhaps you just have a dataset that you and others want to share. In any case, all you need is a webpage. You place the files that you want to distribute on your webpage (or in a subdirectory), and you add two more files—a content file and a package-description file—and you are done. Format of content and package-description files The content file describes the content page. It must be named stata.toc: begin stata.toc OFF (to make site unavailable temporarily) * lines starting with * are comments; they are ignored * * v * * * d d d d blank lines are ignored, too v indicates version—specify v 3, which is the current version of .toc files 3 d lines display description text the first d line is the title, and the remaining ones are text blank d lines display a blank line title text text ... * l lines display links l word-to-show path-or-url [description] l word-to-show path-or-url [description] ... * t lines display other directories within the site t path [description] t path [description] ... * p lines display packages p pkgname [description] p pkgname [description] ... end stata.toc Package files describe packages and are named pkgname.pkg: net — Install and manage user-written additions from the Internet 1425 begin pkgname.pkg * * * v * * * d d d d d lines starting with * are comments; they are ignored blank lines are ignored, too v indicates version—specify v 3, which is the current version of .toc files 3 d lines display package description text the first d line is the title, and the remaining ones are text blank d lines display a blank line title text Distribution-Date: date text ... * f identifies the component files f [path/]filename [description] f [path/]filename [description] ... * e line is optional; it means stop reading e end pkgname.pkg Note the Distribution-Date description line. This line is optional but recommended. Stata can look for updates to user-written programs with the adoupdate command if the package files from which those programs were installed contain a Distribution-Date description line. Example 1 Say that we want the user to see the following: . net from http://www.university.edu/~me http://www.university.edu/~me Chris Farrar, Uni University PACKAGES you could -net describe-: xyz interval-truncated survival . net describe xyz package xyz from http://www.university.edu/~me TITLE xyz. interval-truncated survival. DESCRIPTION/AUTHOR(S) C. Farrar, Uni University. INSTALLATION FILES xyz.ado xyz.sthlp ANCILLARY FILES (type net install xyz) (type net get xyz) sample.dta The files needed to do this would be begin stata.toc v 3 d Chris Farrar, Uni University p xyz interval-truncated survival end stata.toc 1426 net — Install and manage user-written additions from the Internet begin xyz.pkg v d d f f f 3 xyz. interval-truncated survival. C. Farrar, Uni University. xyz.ado xyz.sthlp sample.dta end xyz.pkg On his homepage, Chris would place the following files: stata.toc xyz.pkg xyz.ado xyz.sthlp sample.dta (shown above) (shown above) file to be delivered (for use by net install) file to be delivered (for use by net install) file to be delivered (for use by net get) Chris does nothing to distinguish ancillary files from installation files. Example 2 S. Gazer wants to create a more complex site: . net from http://www.wemakeitupaswego.edu/faculty/sgazer http://www.wemakeitupaswego.edu/faculty/sgazer Data-free inference materials S. Gazer, Department of Applied Theoretical Mathematics Also see my homepage for the preprint of "Irrefutable inference". PLACES you could -net link- to: stata StataCorp website DIRECTORIES you could -net cd- to: ir irrefutable inference programs (work in progress) PACKAGES you could -net describe-: rtec Robust-to-everything-conceivable statistic rte Robust-to-everything statistic net — Install and manage user-written additions from the Internet . net describe rte package rte from http://www.wemakeitupaswego.edu/faculty/sgazer/ TITLE rte. The robust-to-everything statistic; update. DESCRIPTION/AUTHOR(S) S. Gazer, Dept. of Applied Theoretical Mathematics, WMIUAWG Univ. Aleph-0 100% confidence intervals proved too conservative for some applications; Aleph-1 confidence intervals have been substituted. The new robust-to-everything supplants the previous robust-toeverything-conceivable statistic. See "Inference in the absence of data" (forthcoming). After installation, see help rte. Distribution-Date: 20130420 Support: email sgazer@wemakeitupaswego.edu INSTALLATION FILES (type net install rte_stat) rte.ado rte.sthlp nullset.ado random.ado ANCILLARY FILES (type net get rte_stat) empty.dta The files needed to do this would be begin stata.toc v d d d d l t p p v d d d d d d d d d d d f f f f f 3 Data-free inference materials S. Gazer, Department of Applied Theoretical Mathematics Also see my homepage for the preprint of "Irrefutable inference". stata http://www.stata.com ir irrefutable inference programs (work in progress) rtec Robust-to-everything-conceivable statistic rte Robust-to-everything statistic end stata.toc begin rte.pkg 3 rte. The robust-to-everything statistic; update. {bf:S. Gazer, Dept. of Applied Theoretical Mathematics, WMIUAWG Univ.} Aleph-0 100% confidence intervals proved too conservative for some applications; Aleph-1 confidence intervals have been substituted. The new robust-to-everything supplants the previous robust-toeverything-conceivable statistic. See "Inference in the absence of data" (forthcoming). After installation, see help {bf:rte}. Distribution-Date: 20130420 Support: email sgazer@wemakeitupaswego.edu rte.ado rte.sthlp nullset.ado random.ado empty.dta end rte.pkg 1427 1428 net — Install and manage user-written additions from the Internet On his homepage, Mr. Gazer would place the following files: stata.toc rte.pkg rte.ado rte.sthlp nullset.ado random.ado empty.dta rtec.pkg rtec.ado rtec.sthlp ir/stata.toc ir/. . . ir/. . . (shown above) (shown above) (file to be delivered) (file to be delivered) (file to be delivered) (file to be delivered) (file to be delivered) the other package referred to in stata.toc the corresponding files to be delivered the contents file for when the user types net cd ir whatever other .pkg files are referred to whatever other files are to be delivered If Mr. Gazer later updated the rte package, he could change the Distribution-Date description line in his package. Then, if someone who had previously installed the rte packaged wanted to obtain the latest version, that person could use the adoupdate command; see [R] adoupdate. For complex sites, a different structure may prove more convenient: stata.toc rte.pkg rtec.pkg rte/ rte/rte.ado rte/rte.sthlp rte/nullset.ado rte/random.ado rte/empty.dta rtec/ rtec/. . . ir/stata.toc ir/*.pkg ir/*/. . . (shown above) (shown above) the other package referred to in stata.toc directory containing rte files to be delivered: (file to be delivered) (file to be delivered) (file to be delivered) (file to be delivered) (file to be delivered) directory containing rtec files to be delivered: (files to be delivered) the contents file for when the user types net cd ir whatever other package files are referred to whatever other files are to be delivered If you prefer this structure, it is simply a matter of changing the bottom of the rte.pkg from f f f f f rte.ado rte.sthlp nullset.ado random.ado empty.dta f f f f f rte/rte.ado rte/rte.sthlp rte/nullset.ado rte/random.ado rte/empty.dta to In writing paths and files, the directory separator forward slash (/) is used, regardless of operating system, because this is what the Internet uses. It does not matter whether the files you put out are in Windows, Mac, or Unix format (how lines end is recorded differently). When Stata reads the files over the Internet, it will figure out the file format on its own and will automatically translate the files to what is appropriate for the receiver. net — Install and manage user-written additions from the Internet 1429 Additional package directives F filename is similar to f filename, except that, when the file is installed, it will always be copied to the system directories (and not the current directory). With f filename, the file is installed into a directory according to the file’s suffix. For instance, xyz.ado would be installed in the system directories, whereas xyz.dta would be installed in the current directory. Coding F xyz.ado would have the same result as coding f xyz.ado. Coding F xyz.dta, however, would state that xyz.dta is to be installed in the system directories. g platformname filename is also a variation on f filename. It specifies that the file be installed only if the user’s operating system is of type platformname; otherwise, the file is ignored. The platform names are WIN64A (64-bit x86-64) and WIN (32-bit x86) for Windows; MACINTEL64 (64-bit Intel, GUI), OSX.X8664 (64-bit Intel, console), MACINTEL (32-bit Intel, GUI), and OSX.X86 (32-bit Intel, console) for Mac; and LINUX64 (64-bit x86-64), LINUX (32-bit x86), SOLX8664 (64-bit x86-64), and SOL64 for Unix. G platformname filename is a variation on F filename. The file, if not ignored, is to be installed in the system directories. g platformname filename1 filename2 is a more detailed version of g platformname filename. In this case, filename1 is the name of the file on the server (the file to be copied), and filename2 is to be the name of the file on the user’s system; for example, you might code g WIN mydll.forwin mydll.plugin g LINUX mydll.forlinux mydll.plugin When you specify one filename, the result is the same as specifying two identical filenames. G platformname filename1 filename2 is the install-in-system-directories version of g platformname filename1 filename2. h filename asserts that filename must be loaded, or this package is not to be installed; for example, you might code g WIN mydll.forwin mydll.plugin g LINUX mydll.forlinux mydll.plugin h mydll.plugin if you were offering the plugin mydll.plugin for Windows and Linux only. SMCL in content and package-description files The text listed on the second and subsequent d lines in both stata.toc and pkgname.pkg may contain SMCL as long as you include v 3; see [P] smcl. Thus, in rte.pkg, S. Gazer coded the third line as d {bf:S. Gazer, Dept. of Applied Theoretical Mathematics, WMIUAWG Univ.} 1430 net — Install and manage user-written additions from the Internet Error-free file delivery Most people transport files over the Internet and never worry about the file being corrupted in the process because corruption rarely occurs. If, however, the files must be delivered perfectly or not at all, you can include checksum files in the directory. For instance, say that big.dta is included in your package and that it must be sent perfectly. First, use Stata to make the checksum file for big.dta . checksum big.dta, save That command creates a small file called big.sum; see [D] checksum. Then copy both big.dta and big.sum to your homepage. If set checksum is on (the default is off), whenever Stata reads filename.whatever over the net, it also looks for filename.sum. If it finds such a file, it uses the information recorded in it to verify that what was copied was error free. If you do this, be cautious. If you put big.dta and big.sum on your homepage and then later change big.dta without changing big.sum, people will think that there are transmission errors when they try to download big.dta. References Baum, C. F., and N. J. Cox. 1999. ip29: Metadata for user-written contributions to the Stata programming language. Stata Technical Bulletin 52: 10–12. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 121–124. College Station, TX: Stata Press. Cox, N. J., and C. F. Baum. 2000. ip29.1: Metadata for user-written contributions to the Stata programming language. Stata Technical Bulletin 54: 21–22. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 124–126. College Station, TX: Stata Press. Also see [R] adoupdate — Update user-written ado-files [R] net search — Search the Internet for installable packages [R] netio — Control Internet connections [R] search — Search Stata documentation and other resources [R] sj — Stata Journal and STB installation instructions [R] ssc — Install and uninstall packages from SSC [R] update — Check for official updates [D] checksum — Calculate checksum of file [P] smcl — Stata Markup and Control Language [GSM] 19 Updating and extending Stata—Internet functionality [GSU] 19 Updating and extending Stata—Internet functionality [GSW] 19 Updating and extending Stata—Internet functionality [U] 28 Using the Internet to keep up to date Title net search — Search the Internet for installable packages Syntax References Description Also see Options Remarks and examples Syntax net search word word . . . , options options Description or nosj tocpkg toc pkg everywhere filenames errnone list packages that contain any of the keywords; default is all search non-SJ and non-STB sources search both tables of contents and packages; the default search tables of contents only search packages only search packages for match search filenames associated with package for match make return code 111 instead of 0 when no matches found Description net search searches the Internet for user-written additions to Stata, including, but not limited to, user-written additions published in the Stata Journal (SJ) and in the Stata Technical Bulletin (STB). net search lists the available additions that contain the specified keywords. The user-written materials found are available for immediate download by using the net command or by clicking on the link. In addition to typing net search, you may select Help > Search... and choose Search net resources. This is the recommended way to search for user-written additions to Stata. Options or is relevant only when multiple keywords are specified. By default, net search lists only packages that include all the keywords. or changes the command to list packages that contain any of the keywords. nosj specifies that net search not list matches that were published in the SJ or in the STB. tocpkg, toc, and pkg determine what is searched. tocpkg is the default, meaning that both tables of contents (tocs) and packages (pkgs) are searched. toc restricts the search to tables of contents. pkg restricts the search to packages. everywhere and filenames determine where in packages net search looks for keywords. The default is everywhere. filenames restricts net search to search for matches only in the filenames associated with a package. Specifying everywhere implies pkg. errnone is a programmer’s option that causes the return code to be 111 instead of 0 when no matches are found. 1431 1432 net search — Search the Internet for installable packages Remarks and examples net search searches the Internet for user-written additions to Stata. If you want to search the Stata documentation for a particular topic, command, or author, see [R] search . net search word word . . . (without options) is equivalent to typing search word word . . . , net. Remarks are presented under the following headings: Topic searches Author searches Command searches Where does net search look? How does net search work? Topic searches Example: Find what is available about random effects . net search random effect Comments: • It is best to search using the singular form of a word. net search random effect will find both “random effect” and “random effects”. • net search random effect will also find “random-effect” because net search performs a string search and not a word search. • net search random effect lists all packages containing the words “random” and “effect”, not necessarily used together. • If you wanted all packages containing the word “random” or the word “effect”, you would type net search random effect, or. Author searches Example: Find what is available by author Jeroen Weesie . net search weesie Comments: • You could type net search jeroen weesie, but that might list fewer results because sometimes the last name is used without the first. • You could type net search Weesie, but it would not matter. Capitalization is ignored in the search. Example: Find what is available by Jeroen Weesie, excluding SJ and STB materials . net search weesie, nosj • The SJ and the STB tend to dominate search results because so much has been published in them. If you know that what you are looking for is not in the SJ or in the STB, specifying the nosj option will narrow the search. • net search weesie lists everything that net search weesie, nosj lists, and more. If you just type net search weesie, look down the list. SJ and STB materials are listed first, and non-SJ and non-STB materials are listed last. net search — Search the Internet for installable packages 1433 Command searches Example: Find the user-written command kursus . net search kursus, file • You could just type net search kursus, and that will list everything net search kursus, file lists, and more. Because you know kursus is a command, however, there must be a kursus.ado file associated with the package. Typing net search kursus, file narrows the search. • You could also type net search kursus.ado, file to narrow the search even more. Where does net search look? net search looks everywhere, not just at http://www.stata.com. net search begins by looking at http://www.stata.com, but then follows every link, which takes it to other places, and then follows every link again, which takes it to even more places, and so on. Authors: Please let us know if you have a site that we should include in our search by sending an email to webmaster@stata.com. We will then link to your site from ours to ensure that net search finds your materials. That is not strictly necessary, however, as long as your site is directly or indirectly linked from some site that is linked to ours. How does net search work? www.stata.com crawler Your computer talks to www.stata.com net search database The Internet 1434 net search — Search the Internet for installable packages Our website maintains a database of Stata resources. When you use net search, it contacts http://www.stata.com with your request, http://www.stata.com searches its database, and Stata returns the results to you. Another part of the system is called the crawler, which searches the web for new Stata resources to add to the net search database and verifies that the resources already found are still available. When a new resource becomes available, the crawler takes about 2 days to add it to the database, and, similarly, if a resource disappears, the crawler takes roughly 2 days to remove it from the database. References Baum, C. F., and N. J. Cox. 1999. ip29: Metadata for user-written contributions to the Stata programming language. Stata Technical Bulletin 52: 10–12. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 121–124. College Station, TX: Stata Press. Cox, N. J., and C. F. Baum. 2000. ip29.1: Metadata for user-written contributions to the Stata programming language. Stata Technical Bulletin 54: 21–22. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 124–126. College Station, TX: Stata Press. Gould, W. W., and A. R. Riley. 2000. stata55: Search web for installable packages. Stata Technical Bulletin 54: 4–6. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 10–13. College Station, TX: Stata Press. Also see [R] adoupdate — Update user-written ado-files [R] net — Install and manage user-written additions from the Internet [R] search — Search Stata documentation and other resources [R] sj — Stata Journal and STB installation instructions [R] ssc — Install and uninstall packages from SSC [R] update — Check for official updates Title netio — Control Internet connections Syntax Description Options Remarks and examples Also see Syntax Turn on or off the use of a proxy server set httpproxy on | off , init Set proxy host name set httpproxyhost " name " Set the proxy port number set httpproxyport # Turn on or off proxy authorization set httpproxyauth on | off Set proxy authorization user ID set httpproxyuser " name " Set proxy authorization password set httpproxypw " password " Set time limit for establishing initial connection set timeout1 #seconds , permanently Set time limit for data transfer set timeout2 #seconds , permanently Description Several commands (for example, net, news, and update) are designed specifically for use over the Internet. Many other Stata commands that read a file (for example, copy, type, and use) can also read directly from a URL. All of these commands will usually work without your ever needing to concern yourself with the set commands discussed here. These set commands provide control over network system parameters. If you experience problems when using Stata’s network features, ask your system administrator if your site uses a proxy. A proxy is a server between your computer and the rest of the Internet, and your computer may need to communicate with other computers on the Internet through this proxy. If your site uses a proxy, your system administrator can provide you with its host name and the port your computer can use to communicate with it. If your site’s proxy requires you to log in to it before it will respond, your system administrator will provide you with a user ID and password. 1435 1436 netio — Control Internet connections set httpproxyhost sets the name of the host to be used as a proxy server. set httpproxyport sets the port number. set httpproxy turns on or off the use of a proxy server, leaving the proxy host name and port intact, even when not in use. Under the Mac and Windows operating systems, when you set httpproxy on, Stata will attempt to obtain the values of httpproxyhost and httpproxyport from the operating system if they have not been previously set. set httpproxy on, init attempts to obtain these values from the operating system, even if they have been previously set. If the proxy requires authorization (user ID and password), set authorization on via set httpproxyauth on. The proxy user and proxy password must also be set to the appropriate user ID and password by using set httpproxyuser and set httpproxypw. Stata remembers the various proxy settings between sessions and does not need a permanently option. set timeout1 changes the time limit in seconds that Stata imposes for establishing the initial connection with a remote host. The default value is 30. set timeout2 changes the time limit in seconds that Stata imposes for subsequent data transfer with the host. The default value is 180. If these time limits are exceeded, a “connection timed out” message and error code 2 are produced. You should seldom need to change these settings. Options init specifies that set httpproxy on attempts to initialize httpproxyhost and httpproxyport from the operating system (Mac and Windows only). permanently specifies that, in addition to making the change right now, the timeout1 and timeout2 settings be remembered and become the default setting when you invoke Stata. The various httpproxy settings do not have a permanently option because permanently is implied. Remarks and examples If you receive an error message, see http://www.stata.com/support/faqs/web/ for the latest information. 1. remote connection failed r(677); If you see remote connection failed r(677); then you asked for something to be done over the web, and Stata tried but could not contact the specified host. Stata was able to talk over the network and look up the host but was not able to establish a connection to that host. Perhaps the host is down; try again later. If all your web accesses result in this message, then perhaps your network connection is through a proxy server. If it is, then you must tell Stata. netio — Control Internet connections 1437 Contact your system administrator. Ask for the name and port of the “HTTP proxy server”. Say that you are told HTTP proxy server: jupiter.myuni.edu port number: 8080 In Stata, type . set httpproxyhost jupiter.myuni.edu . set httpproxyport 8080 . set httpproxy on Your web accesses should then work. 2. connection timed out r(2); If you see connection timed out r(2); then an Internet connection has timed out. This can happen when a. the connection between you and the host is slow, or b. the connection between you and the host has disappeared, and so it eventually “timed out”. For (b), wait a while (say, 5 minutes) and try again (sometimes pieces of the Internet can break for up to a day, but that is rare). For (a), you can reset the limits for what constitutes “timed out”. There are two numbers to set. The time to establish the initial connection is timeout1. By default, Stata waits 30 seconds before declaring a timeout. You can change the limit: . set timeout1 #seconds You might try doubling the usual limit and specify 60; #seconds must be between 1 and 32,000. The time to retrieve data from an open connection is timeout2. By default, Stata waits 180 seconds (3 minutes) before declaring a timeout. To change the limit, type . set timeout2 #seconds You might try doubling the usual limit and specify 360; #seconds must be between 1 and 32,000. Also see [R] query — Display system parameters [P] creturn — Return c-class values [U] 28 Using the Internet to keep up to date Title news — Report Stata news Syntax Menu Description Remarks and examples Also see Syntax news Menu Help > News Description news displays a brief listing of recent Stata news and information, which it obtains from Stata’s website. news requires that your computer be connected to the Internet. You may also execute news by selecting Help > News. Remarks and examples news provides an easy way of displaying a brief list of the latest Stata news: . news ___ ____ ____ /__ / ____/ ___/ / /___/ ____ ____ / ____/ / /___/ News The latest from http://www.stata.com 24 June 2013. Stata 13 available Stata 13 is now available. Visit http://www.stata.com/stata13/ for more information. Quick summary: 1) 2-billion character long string variables. 2) Treatment-effects estimators. 3) More multilevel mixed-effects models. 4) Multivariate mixed-effects and generalized linear SEM. 5) Forecasts. 6) Power and sample size. 7) New and extended random-effects panel-data estimators. 8) Effect sizes. 9) Project Manager. 10) Java plugins. 11) ... 12 June 2013. New public training dates announced. Visit http://www.stata.com/public-training/ for course offerings and dates. 20 May 2013. Official update available for download Click here (equivalent to pulling down Help and selecting Check for Updates) or type update from http://www.stata.com. 25 March 2013. NetCourse schedule updated See http://www.stata.com/netcourse/ for more information. (output omitted ) 1438 news — Report Stata news Also see [U] 28 Using the Internet to keep up to date 1439 Title nl — Nonlinear least-squares estimation Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Syntax Interactive version nl (depvar = ) if in weight , options Programmed substitutable expression version if in weight , options nl sexp prog : depvar varlist Function evaluator program version if in weight , nl func prog @ depvar varlist parameters(namelist) | nparameters(#) options where depvar is the dependent variable; is a substitutable expression; sexp prog is a substitutable expression program; and func prog is a function evaluator program. 1440 Options Acknowledgments nl — Nonlinear least-squares estimation 1441 Description options Model variables(varlist) initial(initial values) ∗ parameters(namelist) ∗ nparameters(#) sexp options func options variables in model initial values for parameters parameters in model (function evaluator program version only) number of parameters in model (function evaluator program version only) options for substitutable expression program options for function evaluator program Model 2 lnlsq(#) noconstant hasconstant(name) use log least-squares where ln(depvar − # ) is assumed to be normally distributed the model has no constant term; seldom used use name as constant term; seldom used SE/Robust vce(vcetype) vcetype may be gnr, robust, cluster clustvar, bootstrap, jacknife, hac kernel, hc2, or hc3 Reporting level(#) leave title(string) title2(string) display options set confidence level; default is level(95) create variables containing derivative of E(y) display string as title above the table of parameter estimates display string as subtitle control column formats and line width Optimization optimization options eps(#) delta(#) control the optimization process; seldom used specify # for convergence criterion; default is eps(1e-5) specify # for computing derivatives; default is delta(4e-7) coeflegend display legend instead of statistics ∗ For function evaluator program version, you must specify parameters(namelist) or nparameters(#), or both. bootstrap, by, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. vce(), leave, and weights are not allowed with the svy prefix; see [SVY] svy. aweights, fweights, and iweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Linear models and related > Nonlinear least squares 1442 nl — Nonlinear least-squares estimation Description nl fits an arbitrary nonlinear regression function by least squares. With the interactive version of the command, you enter the function directly on the command line or in the dialog box by using a substitutable expression. If you have a function that you use regularly, you can write a substitutable expression program and use the second syntax to avoid having to reenter the function every time. The function evaluator program version gives you the most flexibility in exchange for increased complexity; with this version, your program is given a vector of parameters and a variable list, and your program computes the regression function. When you write a substitutable expression program or function evaluator program, the first two letters of the name must be nl. sexp prog and func prog refer to the name of the program without the first two letters. For example, if you wrote a function evaluator program named nlregss, you would type nl regss @ . . . to estimate the parameters. Options Model variables(varlist) specifies the variables in the model. nl ignores observations for which any of these variables have missing values. If you do not specify variables(), then nl issues an error message with return code 480 if the estimation sample contains any missing values. initial(initial values) specifies the initial values to begin the estimation. You can specify a 1 × k matrix, where k is the number of parameters in the model, or you can specify a parameter name, its initial value, another parameter name, its initial value, and so on. For example, to initialize alpha to 1.23 and delta to 4.57, you would type nl . . . , initial(alpha 1.23 delta 4.57) . . . Initial values declared using this option override any that are declared within substitutable expressions. If you specify a parameter that does not appear in your model, nl exits with error code 480. If you specify a matrix, the values must be in the same order that the parameters are declared in your model. nl ignores the row and column names of the matrix. parameters(namelist) specifies the names of the parameters in the model. The names of the parameters must adhere to the naming conventions of Stata’s variables; see [U] 11.3 Naming conventions. If you specify both parameters() and nparameters(), the number of names in the former must match the number specified in the latter; if not, nl issues an error message with return code 198. nparameters(#) specifies the number of parameters in the model. If you do not specify names with the parameters() option, nl names them b1, b2, . . . , b#. If you specify both parameters() and nparameters(), the number of names in the former must match the number specified in the latter; if not, nl issues an error message with return code 198. sexp options refer to any options allowed by your sexp prog. func options refer to any options allowed by your func prog. Model 2 lnlsq(#) fits the model by using log least-squares, which we define as least squares with shifted lognormal errors. In other words, ln(depvar − # ) is assumed to be normally distributed. Sums of squares and deviance are adjusted to the same scale as depvar. nl — Nonlinear least-squares estimation 1443 noconstant indicates that the function does not include a constant term. This option is generally not needed, even if there is no constant term in the model, unless the coefficient of variation (over observations) of the partial derivative of the function with respect to a parameter is less than eps() and that parameter is not a constant term. hasconstant(name) indicates that parameter name be treated as the constant term in the model and that nl should not use its default algorithm to find a constant term. As with noconstant, this option is seldom used. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (gnr), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(gnr), the default, uses the conventionally derived variance estimator for nonlinear models fit using Gauss–Newton regression. nl also allows the following: vce(hac kernel # ) specifies that a heteroskedasticity- and autocorrelation-consistent (HAC) variance estimate be used. HAC refers to the general form for combining weighted matrices to form the variance estimate. There are three kernels available for nl: nwest | gallant | anderson # specifies the number of lags. If # is not specified, N − 2 is assumed. vce(hac kernel # ) is not allowed if weights are specified. vce(hc2) and vce(hc3) specify alternative bias corrections for the robust variance calculation. vce(hc2) and vce(hc3) may not be specified with the svy prefix. By default, vce(robust) uses σ bj2 = {n/(n − k)}u2j as an estimate of the variance of the j th observation, where uj is the calculated residual and n/(n − k) is included to improve the overall estimate’s small-sample properties. vce(hc2) instead uses u2j /(1 − hjj ) as the observation’s variance estimate, where hjj is the j th diagonal element of the hat (projection) matrix. This produces an unbiased estimate of the covariance matrix if the model is homoskedastic. vce(hc2) tends to produce slightly more conservative confidence intervals than vce(robust). vce(hc3) uses u2j /(1 − hjj )2 as suggested by Davidson and MacKinnon (1993 and 2004), who report that this often produces better results when the model is heteroskedastic. vce(hc3) produces confidence intervals that tend to be even more conservative. See, in particular, Davidson and MacKinnon (2004, 239), who advocate the use of vce(hc2) or vce(hc3) instead of the plain robust estimator for nonlinear least squares. Reporting level(#); see [R] estimation options. leave leaves behind after estimation a set of new variables with the same names as the estimated parameters containing the derivatives of E(y) with respect to the parameters. If the dataset contains an existing variable with the same name as a parameter, then using leave causes nl to issue an error message with return code 110. leave may not be specified with vce(cluster clustvar) or the svy prefix. 1444 nl — Nonlinear least-squares estimation title(string) specifies an optional title that will be displayed just above the table of parameter estimates. title2(string) specifies an optional subtitle that will be displayed between the title specified in title() and the table of parameter estimates. If title2() is specified but title() is not, title2() has the same effect as title(). display options: cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Optimization optimization options: iterate(#), no log, trace. iterate() specifies the maximum number of iterations, log/nolog specifies whether to show the iteration log, and trace specifies that the iteration log should include the current parameter vector. These options are seldom used. eps(#) specifies the convergence criterion for successive parameter estimates and for the residual sum of squares. The default is eps(1e-5). delta(#) specifies the relative change in a parameter to be used in computing the numeric derivatives. The derivative for parameter βi is computed as {f (X, β1 , β2 , . . . , βi + d, βi+1 , . . .) − f (X, β1 , β2 , . . . , βi , βi+1 , . . .)}/d, where d is δ(βi + δ). The default is delta(4e-7). The following options are available with nl but are not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Substitutable expressions Substitutable expression programs Built-in functions Lognormal errors Other uses Weights Potential errors General comments on fitting nonlinear models Function evaluator programs nl fits an arbitrary nonlinear function by least squares. The interactive version allows you to enter the function directly on the command line or dialog box using substitutable expressions. You can write a substitutable expression program for functions that you fit frequently to save yourself time. Finally, function evaluator programs give you the most flexibility in defining your nonlinear function, though they are more complicated to use. The next section explains the substitutable expressions that are used to define the regression function, and the section thereafter explains how to write substitutable expression program files so that you do not need to type in commonly used functions over and over. Later sections highlight other features of nl. The final section discusses function evaluator programs. If you find substitutable expressions adequate to define your nonlinear function, then you can skip that section entirely. Function evaluator programs are generally needed only for complicated problems, such as multistep estimators. The program receives a vector of parameters at which it is to compute the function and a variable into which the results are to be placed. nl — Nonlinear least-squares estimation 1445 Substitutable expressions You define the nonlinear function to be fit by nl by using a substitutable expression. Substitutable expressions are just like any other mathematical expressions involving scalars and variables, such as those you would use with Stata’s generate command, except that the parameters to be estimated are bound in braces. See [U] 13.2 Operators and [U] 13.3 Functions for more information on expressions. For example, suppose that you wish to fit the function yi = β0 (1 − e−β1 xi ) + i where β0 and β1 are the parameters to be estimated and i is an error term. You would simply type . nl (y = {b0}*(1 - exp(-1*{b1}*x))) You must enclose the entire equation in parentheses. Because b0 and b1 are enclosed in braces, nl knows that they are parameters in the model. nl will initialize b0 and b1 to zero by default. To request that nl initialize b0 to 1 and b1 to 0.25, you would type . nl (y = {b0=1}*(1 - exp(-1*{b1=0.25}*x))) That is, inside the braces denoting a parameter, you put the parameter name followed by an equal sign and the initial value. If a parameter appears in your function multiple times, you need only specify an initial value only once (or never, if you wish to set the initial value to zero). If you do specify more than one initial value for the same parameter, nl will use the last value given. Parameter names must follow the same conventions as variable names. See [U] 11.3 Naming conventions. Frequently, even nonlinear functions contain linear combinations of variables. As an example, suppose that you wish to fit the function n o yi = β0 1 − e−(β1 x1i +β2 x2i +β3 x3i ) + i nl allows you to declare a linear combination of variables by using the shorthand notation . nl (y = {b0=1}*(1 - exp(-1*{xb: x1 x2 x3}))) In the syntax {xb: x1 x2 x3}, you are telling nl that you are declaring a linear combination named xb that is a function of three variables, x1, x2, and x3. nl will create three parameters, named xb x1, xb x2, and xb x3, and initialize them to zero. Instead of typing the previous command, you could have typed . nl (y = {b0=1}*(1 - exp(-1*({xb x1}*x1 + {xb x2}*x2 + {xb x3}*x3)))) and yielded the same result. You can refer to the parameters created by nl in the linear combination later in the function, though you must declare the linear combination first if you intend to do that. When creating linear combinations, nl ensures that the parameter names it chooses are unique and have not yet been used in the function. In general, there are three rules to follow when defining substitutable expressions: 1. 2. 3. Parameters of the model are bound in braces: {b0}, {param}, etc. Initial values for parameters are given by including an equal sign and the initial value inside the braces: {b0=1}, {param=3.571}, etc. Linear combinations of variables can be included using the notation {eqname:varlist}, for example, {xb: mpg price weight}, {score: w x z}, etc. Parameters of linear combinations are initialized to zero. 1446 nl — Nonlinear least-squares estimation If you specify initial values by using the initial() option, they override whatever initial values are given within the substitutable expression. Substitutable expressions are so named because, once values are assigned to the parameters, the resulting expression can be handled by generate and replace. Example 1 We wish to fit the CES production function lnQi = β0 − 1 −ρ −ρ ln δKi + (1 − δ)Li + i ρ (1) where lnQi is the log of output for firm i; Ki and Li are firm i’s capital and labor usage, respectively; and i is a regression error term. Because ρ appears in the denominator of a fraction, zero is not a feasible initial value; for a CES production function, ρ = 1 is a reasonable choice. Setting δ = 0.5 implies that labor and capital have equal impacts on output, which is also a reasonable choice for an initial value. We type . use http://www.stata-press.com/data/r13/production . nl (lnoutput = {b0} - 1/{rho=1}*ln({delta=0.5}*capital^(-1*{rho}) + > (1 - {delta})*labor^(-1*{rho}))) (obs = 100) Iteration 0: residual SS = 29.38631 Iteration 1: residual SS = 29.36637 Iteration 2: residual SS = 29.36583 Iteration 3: residual SS = 29.36581 Iteration 4: residual SS = 29.36581 Iteration 5: residual SS = 29.36581 Iteration 6: residual SS = 29.36581 Iteration 7: residual SS = 29.36581 SS df MS Source Number of obs = 100 Model 91.1449924 2 45.5724962 R-squared = 0.7563 Residual 29.3658055 97 .302740263 Adj R-squared = 0.7513 Root MSE = .5502184 120.510798 99 1.21728079 Res. dev. = 161.2538 Total lnoutput Coef. /b0 /rho /delta 3.792158 1.386993 .4823616 Std. Err. .099682 .472584 .0519791 t 38.04 2.93 9.28 P>|t| [95% Conf. Interval] 0.000 0.004 0.000 3.594316 .4490443 .3791975 3.989999 2.324941 .5855258 Parameter b0 taken as constant term in model & ANOVA table nl will attempt to find a constant term in the model and, if one is found, mention it at the bottom of the output. nl found b0 to be a constant because the partial derivative ∂ lnQi /∂b0 has a coefficient of variation less than eps() in the estimation sample. The elasticity of substitution for the CES production function is σ = 1/(1 + ρ); and, having fit the model, we can use nlcom to estimate it: . nlcom (1/(1 + _b[/rho])) _nl_1: 1/(1 + _b[/rho]) lnoutput Coef. _nl_1 .4189372 Std. Err. z P>|z| .0829424 5.05 0.000 [95% Conf. Interval] .256373 .5815014 nl — Nonlinear least-squares estimation 1447 See [R] nlcom and [U] 13.5 Accessing coefficients and standard errors for more information. nl’s output closely mimics that of regress; see [R] regress for more information. The R2 , sums of squares, and similar statistics are calculated in the same way that regress calculates them. If no “constant” term is specified, the usual caveats apply to the interpretation of the R2 statistic; see the comments and references in Goldstein (1992). Unlike regress, nl does not report a model F statistic, because a test of the joint significance of all the parameters except the constant term may not be relevant in a nonlinear model. Substitutable expression programs If you fit the same model often or if you want to write an estimator that will operate on whatever variables you specify, then you will want to write a substitutable expression program. That program will return a macro containing a substitutable expression that nl can then evaluate, and it may optionally calculate initial values as well. The name of the program must begin with the letters nl. To illustrate, suppose that you use the CES production function often in your work. Instead of typing in the formula each time, you can write a program like this: program nlces, rclass version 13 syntax varlist(min=3 max=3) [if] local logout : word 1 of ‘varlist’ local capital : word 2 of ‘varlist’ local labor : word 3 of ‘varlist’ // Initial value for b0 given delta=0.5 and rho=1 tempvar y generate double ‘y’ = ‘logout’ + ln(0.5*‘capital’^-1 + 0.5*‘labor’^-1) summarize ‘y’ ‘if’, meanonly local b0val = r(mean) // Terms for substitutable expression local capterm "{delta=0.5}*‘capital’^(-1*{rho})" local labterm "(1-{delta})*‘labor’^(-1*{rho})" local term2 "1/{rho=1}*ln(‘capterm’ + ‘labterm’)" // Return substitutable expression and title return local eq "‘logout’ = {b0=‘b0val’} - ‘term2’" return local title "CES ftn., ln Q=‘logout’, K=‘capital’, L=‘labor’" end The program accepts three variables for log output, capital, and labor, and it accepts an if exp qualifier to restrict the estimation sample. All programs that you write to use with nl must accept an if exp qualifier because, when nl calls the program, it passes a binary variable that marks the estimation sample (the variable equals one if the observation is in the sample and zero otherwise). When calculating initial values, you will want to restrict your computations to the estimation sample, and you can do so by using if with any commands that accept if exp qualifiers. Even if your program does not calculate initial values or otherwise use the if qualifier, the syntax statement must still allow it. See [P] syntax for more information on the syntax command and the use of if. As in the previous example, reasonable initial values for δ and ρ are 0.5 and 1, respectively. Conditional on those values, (1) can be rewritten as β0 = lnQi + ln(0.5Ki−1 + 0.5L−1 i ) − i (2) so a good initial value for β0 is the mean of the right-hand side of (2) ignoring i . Lines 7–10 of the function evaluator program calculate that mean and store it in a local macro. Notice the use of if in the summarize statement so that the mean is calculated only for the estimation sample. 1448 nl — Nonlinear least-squares estimation The final part of the program returns two macros. The macro title is optional and defines a short description of the model that will be displayed in the output immediately above the table of parameter estimates. The macro eq is required and defines the substitutable expression that nl will use. If the expression is short, you can define it all at once. However, because the expression used here is somewhat lengthy, defining local macros and then building up the final expression from them is easier. To verify that there are no errors in your program, you can call it directly and then use return list: . use http://www.stata-press.com/data/r13/production . nlces lnoutput capital labor (output omitted ) . return list macros: r(title) : "CES ftn., ln Q=lnoutput, K=capital, L=labor" r(eq) : "lnoutput = {b0=3.711606264663641} - 1/{rho=1}*ln({delt > a=0.5}*capital^(-1*{rho}) + (1-{delta})*labor^(-1*{rho}))" The macro r(eq) contains the same substitutable expression that we specified at the command line in the preceding example, except for the initial value for b0. In short, an nl substitutable expression program should return in r(eq) the same substitutable expression you would type at the command line. The only difference is that when writing a substitutable expression program, you do not bind the entire expression inside parentheses. Having written the program, you can use it by typing . nl ces: lnoutput capital labor (There is a space between nl and ces.) The output is identical to that shown in example 1, save for the title defined in the function evaluator program that appears immediately above the table of parameter estimates. Technical note You will want to store nlces as an ado-file called nlces.ado. The alternative is to type the code into Stata interactively or to place the code in a do-file. While those alternatives are adequate for occasional use, if you save the program as an ado-file, you can use the function anytime you use Stata without having to redefine the program. When nl attempts to execute nlces, if the program is not in Stata’s memory, Stata will search the disk(s) for an ado-file of the same name and, if found, automatically load it. All you have to do is name the file with the .ado suffix and then place it in a directory where Stata will find it. You should put the file in the directory Stata reserves for user-written ado-files, which, depending on your operating system, is c:\ado\personal (Windows), ~ /ado/personal (Unix), or ~:ado:personal (Mac). See [U] 17 Ado-files. Sometimes you may want to pass additional options to the substitutable expression program. You can modify the syntax statement of your program to accept whatever options you wish. Then when you call nl with the syntax . nl func prog: varlist, options any options that are not recognized by nl (see the table of options at the beginning of this entry) are passed on to your function evaluator program. The only other restriction is that your program cannot accept an option named at because nl uses that option with function evaluator programs. nl — Nonlinear least-squares estimation 1449 Built-in functions Some functions are used so often that nl has them built in so that you do not need to write them yourself. nl automatically chooses initial values for the parameters, though you can use the initial(. . .) option to override them. Three alternatives are provided for exponential regression with one asymptote: exp3 yi = β0 + β1 β2xi + i exp2 yi = β1 β2xi + i exp2a yi = β1 1 − β2xi + i For instance, typing nl exp3: ras dvl fits the three-parameter exponential model (parameters β0 , β1 , and β2 ) using yi = ras and xi = dvl. Two alternatives are provided for the logistic function (symmetric sigmoid shape; not to be confused with logistic regression): .h i log4 yi = β0 + β1 1 + exp −β2 (xi − β3 ) + i .h i log3 yi = β1 1 + exp −β2 (xi − β3 ) + i Finally, two alternatives are provided for the Gompertz function (asymmetric sigmoid shape): h i gom4 yi = β0 + β1 exp − exp −β2 (xi − β3 ) + i h i gom3 yi = β1 exp − exp −β2 (xi − β3 ) + i Lognormal errors A nonlinear model with errors that are independent and identically distributed normal may be written as yi = f (xi , β) + ui , ui ∼ N (0, σ 2 ) (3) for i = 1, . . . , n. If the yi are thought to have a k -shifted lognormal instead of a normal distribution— that is, ln(yi − k) ∼ N (ζi , τ 2 ), and the systematic part f (xi , β) of the original model is still thought appropriate for yi —the model becomes ln(yi − k) = ζi + vi = ln f (xi , β) − k + vi , vi ∼ N (0, τ 2 ) (4) This model is fit if lnlsq(k ) is specified. 2 If model (4) is correct, the variance of (yi − k) is proportional to f (xi , β) − k . Probably the most common case is k = 0, sometimes called “proportional errors” because the standard error of yi is proportional to its expectation, f (xi , β). Assuming that the value of k is known, (4) is just another nonlinear model in β, and it may be fit as usual. However, we may wish to compare the fit of (3) with that of (4) using the residual sum of squares (RSS) or the deviance D, D = −2 × log-likelihood, from each model. To do so, we must allow for the change in scale introduced by the log transformation. Assuming,Qthen, the yi to be normally distributed, Atkinson (1985, 85–87, 184), by considering the Jacobian |∂ ln(yi − k)/∂yi |, showed that multiplying both sides of (4) by the geometric mean of yi − k , ẏ , gives residuals on the same scale as those of yi . The geometric mean is given by P −1 ln(yi −k) ẏ = en which is a constant for a given dataset. The residual deviance for (3) and for (4) may be expressed as n o b ) = 1 + ln(2πb D(β σ2 ) n (5) 1450 nl — Nonlinear least-squares estimation b is the maximum likelihood estimate (MLE) of β for each model and nb where β σ 2 is the RSS from 2 (3), or that from (4) multiplied by ẏ . Because (3) and (4) are models with different error structures but the same functional form, the arithmetic difference in their RSS or deviances is not easily tested for statistical significance. However, if the deviance difference is large (>4, say), we would naturally prefer the model with the smaller deviance. Of course, the residuals for each model should be examined for departures from assumptions (nonconstant variance, nonnormality, serial correlations, etc.) in the usual way. Alternatively, consider modeling E(yi ) = 1/(C + AeBxi ) (6) E(1/yi ) = E(yi0 ) = C + AeBxi (7) where C , A, and B are parameters to be estimated. Using the data (y, x) = (0.04, 5), (0.06, 12), (0.08, 25), (0.1, 35), (0.15, 42), (0.2, 48), (0.25, 60), (0.3, 75), and (0.5, 120) (Danuso 1991), fitting the models yields Model (6) (6) with lnlsq(0) (7) (7) with lnlsq(0) C A 1.781 25.74 1.799 25.45 1.781 25.74 1.799 27.45 B RSS Deviance −0.03926 −0.001640 −51.95 −0.04051 −0.001431 −53.18 −0.03926 8.197 24.70 −0.04051 3.651 17.42 There is little to choose between the two versions of the logistic model (6), whereas for the exponential model (7), the fit using lnlsq(0) is much better (a deviance difference of 7.28). The reciprocal transformation has introduced heteroskedasticity into yi0 , which is countered by the proportional errors property of the lognormal distribution implicit in lnlsq(0). The deviances are not comparable between the logistic and exponential models because the change of scale has not been allowed for, although in principle it could be. Other uses Even if you are fitting linear regression models, you may find that nl can save you some typing. Because you specify the parameters of your model explicitly, you can impose constraints on them directly. Example 2 In example 2 of [R] cnsreg, we showed how to fit the model mpg = β0 + β1 price + β2 weight + β3 displ + β4 gear ratio + β5 foreign + β6 length + u subject to the constraints β1 = β2 = β3 = β6 β4 = −β5 = β0 /20 nl — Nonlinear least-squares estimation 1451 An alternative way is to use nl: . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . nl (mpg = {b0} + {b1}*price + {b1}*weight + {b1}*displ + > {b0}/20*gear_ratio - {b0}/20*foreign + {b1}*length) (obs = 74) Iteration 0: residual SS = 1578.522 Iteration 1: residual SS = 1578.522 Source SS df MS Number of obs 34429.4777 2 17214.7389 R-squared Model Residual 1578.52226 72 21.9239203 Adj R-squared Root MSE Total 36008 74 486.594595 Res. dev. mpg Coef. /b0 /b1 26.52229 -.000923 Std. Err. 1.375178 .0001534 t P>|t| 19.29 -6.02 0.000 0.000 = = = = = 74 0.9562 0.9549 4.682299 436.4562 [95% Conf. Interval] 23.78092 -.0012288 29.26365 -.0006172 The point estimates and standard errors for β0 and β1 are identical to those reported in example 2 of [R] cnsreg. To get the estimate for β4 , we can use nlcom: . nlcom _b[/b0]/20 _nl_1: _b[/b0]/20 mpg Coef. _nl_1 1.326114 Std. Err. .0687589 z 19.29 P>|z| [95% Conf. Interval] 0.000 1.191349 1.460879 The advantage to using nl is that we do not need to use the constraint command six times. nl is also a useful tool when doing exploratory data analysis. For example, you may want to run a regression of y on a function of x, though you have not decided whether to use sqrt(x) or ln(x). You can use nl to run both regressions without having first to generate two new variables: . nl (y = {b0} + {b1}*ln(x)) . nl (y = {b0} + {b1}*sqrt(x)) Poi (2008) shows the advantages of using nl when marginal effects of transformed variables are desired as well. Weights Weights are specified in the usual way — analytic and frequency weights as well as iweights are supported; see [U] 20.23 Weighted estimation. Use of analytic weights implies that the yi have different variances. Therefore, model (3) may be rewritten as yi = f (xi , β) + ui , ui ∼ N (0, σ 2 /wi ) (3a) where wi are (positive) weights, assumed to be known and normalized such that their sum equals the number of observations. The residual deviance for (3a) is X b ) = 1 + ln(2πb D(β σ2 ) n − ln(wi ) (5a) 1452 nl — Nonlinear least-squares estimation [compare with (5)], where nb σ 2 = RSS = X b) wi yi − f (xi , β 2 Defining and fitting a model equivalent to (4) when weights have been specified as in (3a) is not straightforward and has not been attempted. Thus deviances using and not using the lnlsq() option may not be strictly comparable when analytic weights (other than 0 and 1) are used. You do not need to modify your substitutable expression in any way to use weights. If, however, you write a substitutable expression program, then you should account for weights when obtaining initial values. When nl calls your program, it passes whatever weight expression (if any) was specified by the user. Here is an outline of a substitutable expression program that accepts weights: program nl name, rclass version 13 syntax varlist [aw fw iw] if ... // Obtain initial values allowing weights // Use the syntax [‘weight’‘exp’]. For example, summarize varname [‘weight’‘exp’] ‘if’ regress depvar varlist [‘weight’‘exp’] ‘if’ ... // Return substitutable expression return local eq "substitutable expression" return local title "description of estimator" end For details on how the syntax command processes weight expressions, see [P] syntax. Potential errors nl is reasonably robust to the inability of your nonlinear function to be evaluated at some parameter values. nl does assume that your function can be evaluated at the initial values of the parameters. If your function cannot be evaluated at the initial values, an error message is issued with return code 480. Recall that if you do not specify an initial value for a parameter, then nl initializes it to zero. Many nonlinear functions cannot be evaluated when some parameters are zero, so in those cases specifying alternative initial values is crucial. Thereafter, as nl changes the parameter values, it monitors your function for unexpected missing values. If these are detected, nl backs up. That is, nl finds a point between the previous, known-tobe-good parameter vector and the new, known-to-be-bad vector at which the function can be evaluated and continues its iterations from that point. nl requires that once a parameter vector is found where the predictions can be calculated, small changes to the parameter vector be made to calculate numeric derivatives. If a boundary is encountered at this point, an error message is issued with return code 481. When specifying lnlsq(), an attempt to take logarithms of yi − k when yi ≤ k results in an error message with return code 482. If iterate() iterations are performed and estimates still have not converged, results are presented with a warning, and the return code is set to 430. If you use the programmed substitutable expression version of nl with a function evaluator program, or vice versa, Stata issues an error message. Verify that you are using the syntax appropriate for the program you have. nl — Nonlinear least-squares estimation 1453 General comments on fitting nonlinear models Achieving convergence is often problematic. For example, a unique minimum of the sum-ofsquares function may not exist. Much literature exists on different algorithms that have been used, on strategies for obtaining good initial parameter values, and on tricks for parameterizing the model to make its behavior as linear-like as possible. Selected references are Kennedy and Gentle (1980, chap. 10) for computational matters and Ross (1990) and Ratkowsky (1983) for all three aspects. Ratkowsky’s book is particularly clear and approachable, with useful discussion on the meaning and practical implications of intrinsic and parameter-effects nonlinearity. An excellent text on nonlinear estimation is Gallant (1987). Also see Davidson and MacKinnon (1993 and 2004). To enhance the success of nl, pay attention to the form of the model fit, along the lines of Ratkowsky and Ross. For example, Ratkowsky (1983, 49–59) analyzes three possible three-parameter yield-density models for plant growth: (α + βxi )−1/θ E(yi ) = (α + βxi + γx2i )−1 (α + βxφi )−1 All three models give similar fits. However, he shows that the second formulation is dramatically more linear-like than the other two and therefore has better convergence properties. In addition, the parameter estimates are virtually unbiased and normally distributed, and the asymptotic approximation to the standard errors, correlations, and confidence intervals is much more accurate than for the other models. Even within a given model, the way the parameters are expressed (for example, φxi or eθxi ) affects the degree of linearity and convergence behavior. Function evaluator programs Occasionally, a nonlinear function may be so complex that writing a substitutable expression for it is impractical. For example, there could be many parameters in the model. Alternatively, if you are implementing a two-step estimator, writing a substitutable expression may be altogether impossible. Function evaluator programs can be used in these situations. nl will pass to your function evaluator program a list of variables, a weight expression, a variable marking the estimation sample, and a vector of parameters. Your program is to replace the dependent variable, which is the first variable in the variables list, with the values of the nonlinear function evaluated at those parameters. As with substitutable expression programs, the first two letters of the name must be nl. To focus on the mechanics of the function evaluator program, again let’s compare the CES production function to the previous examples. The function evaluator program is 1454 nl — Nonlinear least-squares estimation program nlces2 version 13 syntax varlist(min=3 max=3) if, at(name) local logout : word 1 of ‘varlist’ local capital : word 2 of ‘varlist’ local labor : word 3 of ‘varlist’ // Retrieve parameters out of at matrix tempname b0 rho delta scalar ‘b0’ = ‘at’[1, 1] scalar ‘rho’ = ‘at’[1, 2] scalar ‘delta’ = ‘at’[1, 3] tempvar kterm lterm generate double ‘kterm’ = ‘delta’*‘capital’^(-1*‘rho’) ‘if’ generate double ‘lterm’ = (1-‘delta’)*‘labor’^(-1*‘rho’) ‘if’ // Fill in dependent variable replace ‘logout’ = ‘b0’ - 1/‘rho’*ln(‘kterm’ + ‘lterm’) ‘if’ end Unlike the previous nlces program, this one is not declared to be r-class. The syntax statement again accepts three variables: one for log output, one for capital, and one for labor. An if exp is again required because nl will pass a binary variable marking the estimation sample. All function evaluator programs must accept an option named at() that takes a name as an argument—that is how nl passes the parameter vector to your program. The next part of the program retrieves the output, labor, and capital variables from the variables list. It then breaks up the temporary matrix at and retrieves the parameters b0, rho, and delta. Pay careful attention to the order in which the parameters refer to the columns of the at matrix because that will affect the syntax you use with nl. The temporary names you use inside this program are immaterial, however. The rest of the program computes the nonlinear function, using some temporary variables to hold intermediate results. The final line of the program then replaces the dependent variable with the values of the function. Notice the use of ‘if’ to restrict attention to the estimation sample. nl makes a copy of your dependent variable so that when the command is finished your data are left unchanged. To use the program and fit your model, you type . use http://www.stata-press.com/data/r13/production, clear . nl ces2 @ lnoutput capital labor, parameters(b0 rho delta) > initial(b0 0 rho 1 delta 0.5) The output is again identical to that shown in example 1. The order in which the parameters were specified in the parameters() option is the same in which they are retrieved from the at matrix in the program. To initialize them, you simply list the parameter name, a space, the initial value, and so on. If you use the nparameters() option instead of the parameters() option, the parameters are named b1, b2, . . . , bk , where k is the number of parameters. Thus you could have typed . nl ces2 @ lnoutput capital labor, nparameters(3) initial(b1 0 b2 1 b3 0.5) With that syntax, the parameters called b0, rho, and delta in the program will be labeled b1, b2, and b3, respectively. In programming situations or if there are many parameters, instead of listing the parameter names and initial values in the initial() option, you may find it more convenient to pass a column vector. In those cases, you could type . matrix myvals = (0, 1, 0.5) . nl ces2 @ lnoutput capital labor, nparameters(3) initial(myvals) nl — Nonlinear least-squares estimation 1455 In summary, a function evaluator program receives a list of variables, the first of which is the dependent variable that you are to replace with the values of your nonlinear function. Additionally, it must accept an if exp, as well as an option named at that will contain the vector of parameters at which nl wants the function evaluated. You are then free to do whatever is necessary to evaluate your function and replace the dependent variable. If you wish to use weights, your function evaluator program’s syntax statement must accept them. If your program consists only of, for example, generate statements, you need not do anything with the weights passed to your program. However, if in calculating the nonlinear function you use commands such as summarize or regress, then you will want to use the weights with those commands. As with substitutable expression programs, nl will pass to it any options specified that nl does not accept, providing you with a way to pass more information to your function. Technical note Before version 9 of Stata, the nl command used a different syntax, which required you to write an nlfcn program, and it did not have a syntax for interactive use other than the seven functions that were built-in. The old syntax of nl still works, and you can still use those nlfcn programs. If nl does not see a colon, an at sign, or a set of parentheses surrounding the equation in your command, it assumes that the old syntax is being used. The current version of nl uses scalars and matrices to store intermediate calculations instead of local and global macros as the old version did, so the current version produces more accurate results. In practice, however, any discrepancies are likely to be small. 1456 nl — Nonlinear least-squares estimation Stored results nl stores the following in e(): Scalars e(N) e(k) e(k eq model) e(df m) e(df r) e(df t) e(mss) e(rss) e(tss) e(mms) e(msr) e(ll) e(r2) e(r2 a) e(rmse) e(dev) e(N clust) e(lnlsq) e(log t) e(gm 2) e(cj) e(delta) e(rank) e(ic) e(converge) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(title 2) e(clustvar) e(hac kernel) e(hac lag) e(vce) e(vcetype) e(type) e(sexp) e(params) e(funcprog) e(rhs) e(properties) e(predict) e(marginsnotok) Matrices e(b) e(init) e(V) Functions e(sample) number of observations number of parameters number of equations in overall model test; always 0 model degrees of freedom residual degrees of freedom total degrees of freedom model sum of squares residual sum of squares total sum of squares model mean square residual mean square log likelihood assuming i.i.d. normal errors R-squared adjusted R-squared root mean squared error residual deviance number of clusters value of lnlsq if specified 1 if lnlsq specified, 0 otherwise square of geometric mean of (y−k) if lnlsq; 1 otherwise position of constant in e(b) or 0 if no constant relative change used to compute derivatives rank of e(V) number of iterations 1 if converged, 0 otherwise nl command as typed name of dependent variable weight type weight expression title in estimation output secondary title in estimation output name of cluster variable HAC kernel HAC lag vcetype specified in vce() title used to label Std. Err. 1 = interactively entered expression 2 = substitutable expression program 3 = function evaluator program substitutable expression names of parameters function evaluator program contents of variables() b V program used to implement predict predictions disallowed by margins coefficient vector initial values vector variance–covariance matrix of the estimators marks estimation sample nl — Nonlinear least-squares estimation 1457 Methods and formulas The derivation here is based on Davidson and MacKinnon (2004, chap. 6). Let β denote the k × 1 vector of parameters, and write the regression function using matrix notation as y = f (x, β) + u so that the objective function can be written as SSR(β) 0 = {y − f (x, β)} D {y − f (x, β)} The D matrix contains the weights and is defined in [R] regress; if no weights are specified, then D is the N × N identity matrix. Taking a second-order Taylor series expansion centered at β0 yields SSR(β) 1 ≈ SSR(β0 ) + g0 (β0 )(β − β0 ) + (β − β0 )0 H(β0 )(β − β0 ) 2 (8) where g(β0 ) denotes the k × 1 gradient of SSR(β) evaluated at β0 and H(β0 ) denotes the k × k Hessian of SSR(β) evaluated at β0 . Letting X denote the N × k matrix of derivatives of f (x, β) with respect to β, the gradient g(β) is g(β) = −2X0 Du (9) X and u are obviously functions of β, though for notational simplicity that dependence is not shown explicitly. The (m, n) element of the Hessian can be written as Hmn (β) = −2 i=N X i=1 dii ∂ 2 fi ui − Xim Xin ∂βm ∂βn (10) where dii is the ith diagonal element of D. As discussed in Davidson and MacKinnon (2004, chap. 6), the first term inside the brackets of (10) has expectation zero, so the Hessian can be approximated as H(β) = 2X0 DX (11) Differentiating the Taylor series expansion of SSR(β) shown in (8) yields the first-order condition for a minimum g(β0 ) + H(β0 )(β − β0 ) = 0 which suggests the iterative procedure βj+1 = βj − αH−1 (βj )g(βj ) (12) where α is a “step size” parameter chosen at each iteration to improve convergence. Using (9) and (11), we can write (12) as βj+1 = βj + α(X0 DX)−1 X0 Du (13) where X and u are evaluated at βj . Apart from the scalar α, the second term on the right-hand side of (13) can be computed via a (weighted) regression of the columns of X on the errors. nl computes the derivatives numerically and then calls regress. At each iteration, α is set to one, and a candidate value β∗j+1 is computed by (13). If SSR(β∗j+1 ) < SSR(βj ), then βj+1 = β∗j+1 and the iteration is complete. Otherwise, α is halved, a new β∗j+1 is calculated, and the process is repeated. Convergence is declared when α|βj+1,m | ≤ (|βjm | + τ ) for all m = 1, . . . , k . nl uses τ = 10−3 and, by default, = 10−5 , though you can specify an alternative value of with the eps() option. 1458 nl — Nonlinear least-squares estimation As derived, for example, in Davidson and MacKinnon (2004, chap. 6), an expedient way to b and then obtain the covariance matrix is to compute u and the columns of X at the final estimate β regress that u on X. The covariance matrix of the estimated parameters of that regression serves b ). If that regression employs a robust covariance matrix estimator, then the as an estimate of Var(β covariance matrix for the parameters of the nonlinear regression will also be robust. All other statistics are calculated analogously to those in linear regression, except that the nonlinear function f (xi , β) plays the role of the linear function x0i β. See [R] regress. This command supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. Acknowledgments The original version of nl was written by Patrick Royston (1992) of the MRC Clinical Trials Unit, London, and coauthor of the Stata Press book Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. Francesco Danuso’s menu-driven nonlinear regression program (1991) provided the inspiration. References Atkinson, A. C. 1985. Plots, Transformations, and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis. Oxford: Oxford University Press. Canette, I. 2011. A tip to debug your nl/nlsur function evaluator program. The Stata Blog: Not Elsewhere Classified. http://blog.stata.com/2011/12/05/a-tip-to-debug-your-nlnlsur-function-evaluator-program/. Danuso, F. 1991. sg1: Nonlinear regression command. Stata Technical Bulletin 1: 17–19. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 96–98. College Station, TX: Stata Press. Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. . 2004. Econometric Theory and Methods. New York: Oxford University Press. Gallant, A. R. 1987. Nonlinear Statistical Models. New York: Wiley. Goldstein, R. 1992. srd7: Adjusted summary statistics for logarithmic regressions. Stata Technical Bulletin 5: 17–21. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 178–183. College Station, TX: Stata Press. Kennedy, W. J., Jr., and J. E. Gentle. 1980. Statistical Computing. New York: Dekker. Poi, B. P. 2008. Stata tip 58: nl is not just for nonlinear models. Stata Journal 8: 139–141. Ratkowsky, D. A. 1983. Nonlinear Regression Modeling: A Unified Practical Approach. New York: Dekker. Ross, G. J. S. 1987. MLP User Manual, Release 3.08. Oxford: Numerical Algorithms Group. . 1990. Nonlinear Estimation. New York: Springer. Royston, P. 1992. sg7: Centile estimation command. Stata Technical Bulletin 8: 12–15. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 122–125. College Station, TX: Stata Press. . 1993. sg1.4: Standard nonlinear curve fits. Stata Technical Bulletin 11: 17. Reprinted in Stata Technical Bulletin Reprints, vol. 2, p. 121. College Station, TX: Stata Press. nl — Nonlinear least-squares estimation Also see [R] nl postestimation — Postestimation tools for nl [R] gmm — Generalized method of moments estimation [R] ml — Maximum likelihood estimation [R] mlexp — Maximum likelihood estimation of user-specified expressions [R] nlcom — Nonlinear combinations of estimators [R] nlsur — Estimation of nonlinear systems of equations [R] regress — Linear regression [SVY] svy estimation — Estimation commands for survey data [U] 20 Estimation and postestimation commands 1459 Title nl postestimation — Postestimation tools for nl Description Remarks and examples Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after nl: Command Description estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions and residuals point estimates, standard errors, testing, and inference for generalized predictions Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest2 margins3 marginsplot nlcom predict predictnl test testnl 1 forecast is not appropriate with svy estimation results. lrtest is not appropriate with svy estimation results. 3 You must specify the variables() option with nl. 2 1460 nl postestimation — Postestimation tools for nl 1461 Syntax for predict predict type predict type newvar if in , statistic stub* | newvar1 . . . newvark if in , scores where k is the number of parameters in the model. Description statistic Main yhat residuals pr(a,b) fitted values; the default residuals Pr(yj | a < yj < b) e(a,b) E(yj | a < yj < b) ystar(a,b) E(yj∗ ), yj∗ = max{a, min(yj , b)} These statistics are available both in and out of sample; type predict only for the estimation sample. . . . if e(sample) . . . if wanted Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main yhat, the default, calculates the fitted values. residuals calculates the residuals. pr(a,b) calculates Pr(a < xj b + uj < b), the probability that yj |xj would be observed in the interval (a, b). a and b may be specified as numbers or variable names; lb and ub are variable names; pr(20,30) calculates Pr(20 < xj b + uj < 30); pr(lb,ub) calculates Pr(lb < xj b + uj < ub); and pr(20,ub) calculates Pr(20 < xj b + uj < ub). a missing (a ≥ .) means −∞; pr(.,30) calculates Pr(−∞ < xj b + uj < 30); pr(lb,30) calculates Pr(−∞ < xj b + uj < 30) in observations for which lb ≥ . and calculates Pr(lb < xj b + uj < 30) elsewhere. b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(+∞ > xj b + uj > 20); pr(20,ub) calculates Pr(+∞ > xj b + uj > 20) in observations for which ub ≥ . and calculates Pr(20 < xj b + uj < ub) elsewhere. e(a,b) calculates E(xj b + uj | a < xj b + uj < b), the expected value of yj |xj conditional on yj |xj being in the interval (a, b), meaning that yj |xj is truncated. a and b are specified as they are for pr(). ystar(a,b) calculates E(yj∗ ), where yj∗ = a if xj b + uj ≤ a, yj∗ = b if xj b + uj ≥ b, and yj∗ = xj b + uj otherwise, meaning that yj∗ is censored. a and b are specified as they are for pr(). 1462 nl postestimation — Postestimation tools for nl scores calculates the scores. The j th new variable created will contain the score for the j th parameter in e(b). Remarks and examples Example 1 Obtaining predictions after fitting a nonlinear regression model with nl is no more difficult than obtaining predictions after fitting a linear regression model with regress. Here we fit a model of mpg on weight, allowing for a nonlinear relationship: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . nl (mpg = {b0} + {b1}*weight^{gamma=-.5}), variables(weight) nolog (obs = 74) SS Source df MS Model Residual 1646.43761 797.021847 2 71 823.218806 11.2256598 Total 2443.45946 73 33.4720474 mpg Coef. /b0 /b1 /gamma -18.17583 1377.267 -.4460916 Std. Err. 60.61713 5292.443 .6763643 Number of obs R-squared Adj R-squared Root MSE Res. dev. t P>|t| -0.30 0.26 -0.66 0.765 0.795 0.512 = = = = = 74 0.6738 0.6646 3.350472 385.8874 [95% Conf. Interval] -139.0429 -9175.564 -1.794724 102.6913 11930.1 .9025405 Parameter b0 taken as constant term in model & ANOVA table Now we obtain the predicted values of mpg and plot them in a graph along with the observed values: . predict mpghat (option yhat assumed; fitted values) 10 20 30 40 . scatter mpg weight || line mpghat weight, sort 2,000 3,000 Weight (lbs.) Mileage (mpg) 4,000 Fitted values 5,000 nl postestimation — Postestimation tools for nl 1463 Suppose we wanted to know how sensitive mpg is to changes in weight for cars that weigh 3,000 pounds. We can use margins to find out: . margins, eyex(weight) at(weight = 3000) Warning: cannot perform check for estimable functions. Conditional marginal effects Number of obs Model VCE : GNR = 74 Expression : Fitted values, predict() ey/ex w.r.t. : weight at : weight = 3000 ey/ex weight -.8408119 Delta-method Std. Err. .0804379 z -10.45 P>|z| 0.000 [95% Conf. Interval] -.9984673 -.6831565 With the eyex() option, margins reports elasticities. These results show that if we increase weight by 1%, then mpg decreases by about 0.84%. Technical note Observant readers will notice that margins issued a warning message stating that it could not perform its usual check for estimable functions. In the case of nl, as long as you do not specify the predict() option of margins or specify the default predict(yhat), you can safely ignore that message. The predicted values that nl produces are suitable for use with margins. However, if you specify any predict() options other than yhat, then the output from margins after using nl will not be correct. Also see [R] nl — Nonlinear least-squares estimation [U] 20 Estimation and postestimation commands Title nlcom — Nonlinear combinations of estimators Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Nonlinear combination of estimators—one expression nlcom name: exp , options Nonlinear combinations of estimators—more than one expression nlcom ( name: exp) ( name: exp , options options Description level(#) iterate(#) post display options set confidence level; default is level(95) maximum number of iterations post estimation results control column formats and line width noheader df(#) suppress output header use t distribution with # degrees of freedom for computing p-values and confidence intervals noheader and df(#) do not appear in the dialog box. The second syntax means that if more than one expression is specified, each must be surrounded by parentheses. The optional name is any valid Stata name and labels the transformations. exp is a possibly nonlinear expression containing b[coef ] b[eqno:coef ] [eqno]coef [eqno] b[coef ] eqno is ## name coef identifies a coefficient in the model. coef is typically a variable name, a level indicator, an interaction indicator, or an interaction involving continuous variables. Level indicators identify one level of a factor variable and interaction indicators identify one combination of levels of an interaction; see [U] 11.4.3 Factor variables. coef may contain time-series operators; see [U] 11.4.4 Time-series varlists. Distinguish between [ ], which are to be typed, and , which indicate optional arguments. 1464 nlcom — Nonlinear combinations of estimators 1465 Menu Statistics > Postestimation > Nonlinear combinations of estimates Description nlcom computes point estimates, standard errors, test statistics, significance levels, and confidence intervals for (possibly) nonlinear combinations of parameter estimates after any Stata estimation command. Results are displayed in the usual table format used for displaying estimation results. Calculations are based on the “delta method”, an approximation appropriate in large samples. nlcom can be used with svy estimation results; see [SVY] svy postestimation. Options level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. iterate(#) specifies the maximum number of iterations used to find the optimal step size in calculating numerical derivatives of the transformation(s) with respect to the original parameters. By default, the maximum number of iterations is 100, but convergence is usually achieved after only a few iterations. You should rarely have to use this option. post causes nlcom to behave like a Stata estimation (eclass) command. When post is specified, nlcom will post the vector of transformed estimators and its estimated variance–covariance matrix to e(). This option, in essence, makes the transformation permanent. Thus you could, after posting, treat the transformed estimation results in the same way as you would treat results from other Stata estimation commands. For example, after posting, you could redisplay the results by typing nlcom without any arguments, or use test to perform simultaneous tests of hypotheses on linear combinations of the transformed estimators; see [R] test. Specifying post clears out the previous estimation results, which can be recovered only by refitting the original model or by storing the estimation results before running nlcom and then restoring them; see [R] estimates store. display options: cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. The following options are available with nlcom but are not shown in the dialog box: noheader suppresses the output header. df(#) specifies that the t distribution with # degrees of freedom be used for computing p-values and confidence intervals. Remarks and examples Remarks are presented under the following headings: Introduction Basics Using the post option Reparameterizing ML estimators for univariate data nlcom versus eform 1466 nlcom — Nonlinear combinations of estimators Introduction nlcom and predictnl both use the delta method. They take nonlinear transformations of the estimated parameter vector from some fitted model and apply the delta method to calculate the variance, standard error, Wald test statistic, etc., of the transformations. nlcom is designed for functions of the parameters, and predictnl is designed for functions of the parameters and of the data, that is, for predictions. nlcom generalizes lincom (see [R] lincom) in two ways. First, nlcom allows the transformations to be nonlinear. Second, nlcom can be used to simultaneously estimate many transformations (whether linear or nonlinear) and to obtain the estimated variance–covariance matrix of these transformations. Basics In [R] lincom, the following regression was performed: . use http://www.stata-press.com/data/r13/regress . regress y x1 x2 x3 Source SS df MS Model Residual 3259.3561 1627.56282 3 144 1086.45203 11.3025196 Total 4886.91892 147 33.2443464 y Coef. x1 x2 x3 _cons 1.457113 2.221682 -.006139 36.10135 Std. Err. 1.07461 .8610358 .0005543 4.382693 t 1.36 2.58 -11.08 8.24 P>|t| 0.177 0.011 0.000 0.000 Number of obs F( 3, 144) Prob > F R-squared Adj R-squared Root MSE = = = = = = 148 96.12 0.0000 0.6670 0.6600 3.3619 [95% Conf. Interval] -.666934 .5197797 -.0072345 27.43863 3.581161 3.923583 -.0050435 44.76407 Then lincom was used to estimate the difference between the coefficients of x1 and x2: . lincom _b[x2] - _b[x1] ( 1) - x1 + x2 = 0 y Coef. (1) .7645682 Std. Err. t P>|t| [95% Conf. Interval] .9950282 0.77 0.444 -1.20218 2.731316 It was noted, however, that nonlinear expressions are not allowed with lincom: . lincom _b[x2]/_b[x1] not possible with test r(131); Nonlinear transformations are instead estimated using nlcom: . nlcom _b[x2]/_b[x1] _nl_1: _b[x2]/_b[x1] y Coef. _nl_1 1.524714 Std. Err. z P>|z| .9812848 1.55 0.120 [95% Conf. Interval] -.3985686 3.447997 nlcom — Nonlinear combinations of estimators 1467 Technical note The notation b[name] is the standard way in Stata to refer to regression coefficients; see [U] 13.5 Accessing coefficients and standard errors. Some commands, such as lincom and test, allow you to drop the b[] and just refer to the coefficients by name. nlcom, however, requires the full specification b[name]. Returning to our linear regression example, nlcom also allows simultaneous estimation of more than one combination: . nlcom (_b[x2]/_b[x1]) (_b[x3]/_b[x1]) (_b[x3]/_b[x2]) _nl_1: _b[x2]/_b[x1] _nl_2: _b[x3]/_b[x1] _nl_3: _b[x3]/_b[x2] y Coef. _nl_1 _nl_2 _nl_3 1.524714 -.0042131 -.0027632 Std. Err. .9812848 .0033483 .0010695 z 1.55 -1.26 -2.58 P>|z| 0.120 0.208 0.010 [95% Conf. Interval] -.3985686 -.0107756 -.0048594 3.447997 .0023494 -.000667 We can also label the transformations to produce more informative names in the estimation table: . nlcom (ratio21:_b[x2]/_b[x1]) (ratio31:_b[x3]/_b[x1]) (ratio32:_b[x3]/_b[x2]) ratio21: _b[x2]/_b[x1] ratio31: _b[x3]/_b[x1] ratio32: _b[x3]/_b[x2] y Coef. ratio21 ratio31 ratio32 1.524714 -.0042131 -.0027632 Std. Err. .9812848 .0033483 .0010695 z 1.55 -1.26 -2.58 P>|z| 0.120 0.208 0.010 [95% Conf. Interval] -.3985686 -.0107756 -.0048594 3.447997 .0023494 -.000667 nlcom stores the vector of estimated combinations and its estimated variance–covariance matrix in r(). . matrix list r(b) r(b)[1,3] ratio21 ratio31 c1 1.5247143 -.00421315 . matrix list r(V) symmetric r(V)[3,3] ratio21 ratio21 .96291982 ratio31 -.00287781 ratio32 -.00014234 ratio32 -.00276324 ratio31 ratio32 .00001121 2.137e-06 1.144e-06 1468 nlcom — Nonlinear combinations of estimators Using the post option When used with the post option, nlcom stores the estimation vector and variance–covariance matrix in e(), making the transformation permanent: . quietly nlcom (ratio21:_b[x2]/_b[x1]) (ratio31:_b[x3]/_b[x1]) > (ratio32:_b[x3]/_b[x2]), post . matrix list e(b) e(b)[1,3] ratio21 ratio31 y1 1.5247143 -.00421315 . matrix list e(V) symmetric e(V)[3,3] ratio21 ratio21 .96291982 ratio31 -.00287781 ratio32 -.00014234 ratio32 -.00276324 ratio31 ratio32 .00001121 2.137e-06 1.144e-06 After posting, we can proceed as if we had just run a Stata estimation (eclass) command. For instance, we can replay the results, . nlcom y Coef. ratio21 ratio31 ratio32 1.524714 -.0042131 -.0027632 Std. Err. .9812848 .0033483 .0010695 z 1.55 -1.26 -2.58 P>|z| 0.120 0.208 0.010 [95% Conf. Interval] -.3985686 -.0107756 -.0048594 3.447997 .0023494 -.000667 or perform other postestimation tasks in the transformed metric, this time making reference to the new “coefficients”: . display _b[ratio31] -.00421315 . estat vce, correlation Correlation matrix of coefficients of nlcom model e(V) ratio21 ratio21 1.0000 ratio31 -0.8759 ratio32 -0.1356 . test _b[ratio21] = 1 ( 1) ratio21 = 1 chi2( 1) = Prob > chi2 = ratio31 ratio32 1.0000 0.5969 1.0000 0.29 0.5928 We see that testing b[ratio21]=1 in the transformed metric is equivalent to testing using testnl b[x2]/ b[x1]=1 in the original metric: . quietly regress y x1 x2 x3 . testnl _b[x2]/_b[x1] = 1 (1) _b[x2]/_b[x1] = 1 chi2(1) = Prob > chi2 = 0.29 0.5928 We needed to refit the regression model to recover the original parameter estimates. nlcom — Nonlinear combinations of estimators 1469 Technical note In a previous technical note, we mentioned that commands such as lincom and test permit reference to name instead of b[name]. This is not the case when lincom and test are used after nlcom, post. In the above, we used . test _b[ratio21] = 1 rather than . test ratio21 = 1 which would have returned an error. Consider this a limitation of Stata. For the shorthand notation to work, you need a variable named name in the data. In nlcom, however, name is just a coefficient label that does not necessarily correspond to any variable in the data. Reparameterizing ML estimators for univariate data When run using only a response and no covariates, Stata’s maximum likelihood (ML) estimation commands will produce ML estimates of the parameters of some assumed univariate distribution for the response. The parameterization, however, is usually not one we are used to dealing with in a nonregression setting. In such cases, nlcom can be used to transform the estimation results from a regression model to those from a maximum likelihood estimation of the parameters of a univariate probability distribution in a more familiar metric. Example 1 Consider the following univariate data on Y = # of traffic accidents at a certain intersection in a given year: . use http://www.stata-press.com/data/r13/trafint . summarize accidents Variable Obs Mean Std. Dev. accidents 12 13.83333 14.47778 Min Max 0 41 A quick glance of the output from summarize leads us to quickly reject the assumption that Y is distributed as Poisson because the estimated variance of Y is much greater than the estimated mean of Y . Instead, we choose to model the data as univariate negative binomial, of which a common parameterization is Pr(Y = y) = Γ(r + y) pr (1 − p)y Γ(r)Γ(y + 1) with E(Y ) = r(1 − p) p 0 ≤ p ≤ 1, Var(Y ) = r > 0, y = 0, 1, . . . r(1 − p) p2 There exist no closed-form solutions for the maximum likelihood estimates of p and r, yet they may be estimated by the iterative method of Newton–Raphson. One way to get these estimates would be to write our own Newton–Raphson program for the negative binomial. Another way would be to write our own ML evaluator; see [R] ml. 1470 nlcom — Nonlinear combinations of estimators The easiest solution, however, would be to use Stata’s existing negative binomial ML regression command, nbreg. The only problem with this solution is that nbreg estimates a different parameterization of the negative binomial, but we can worry about that later. . nbreg accidents Fitting Poisson model: Iteration 0: Iteration 1: log likelihood = -105.05361 log likelihood = -105.05361 Fitting constant-only model: Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = -43.948619 = -43.891483 = -43.89144 = -43.89144 Fitting full model: Iteration 0: Iteration 1: log likelihood = log likelihood = -43.89144 -43.89144 Negative binomial regression Number of obs LR chi2(0) Prob > chi2 Pseudo R2 Dispersion = mean Log likelihood = -43.89144 accidents Coef. _cons = = = = 12 0.00 . 0.0000 Std. Err. z P>|z| [95% Conf. Interval] 2.627081 .3192233 8.23 0.000 2.001415 3.252747 /lnalpha .1402425 .4187147 -.6804233 .9609083 alpha 1.150553 .4817534 .5064026 2.61407 Likelihood-ratio test of alpha=0: chibar2(01) = 122.32 Prob>=chibar2 = 0.000 . nbreg, coeflegend Negative binomial regression Number of obs LR chi2(0) Prob > chi2 Pseudo R2 Dispersion = mean Log likelihood = -43.89144 accidents Coef. _cons 2.627081 _b[accidents:_cons] /lnalpha .1402425 _b[lnalpha:_cons] alpha 1.150553 = = = = 12 0.00 . 0.0000 Legend Likelihood-ratio test of alpha=0: chibar2(01) = 122.32 Prob>=chibar2 = 0.000 From this output, we see that, when used with univariate data, nbreg estimates a regression intercept, β0 , and the logarithm of some parameter α. This parameterization is useful in regression models: β0 is the intercept meant to be augmented with other terms of the linear predictor, and α is an overdispersion parameter used for comparison with the Poisson regression model. However, we need to transform (β0 , lnα) to (p, r). Examining Methods and formulas of [R] nbreg reveals the transformation as p = {1 + α exp(β0 )}−1 which we apply using nlcom: r = α−1 nlcom — Nonlinear combinations of estimators 1471 . nlcom (p:1/(1 + exp([lnalpha]_b[_cons] + _b[_cons]))) > (r:exp(-[lnalpha]_b[_cons])) p: 1/(1 + exp([lnalpha]_b[_cons] + _b[_cons])) r: exp(-[lnalpha]_b[_cons]) accidents Coef. p r .0591157 .8691474 Std. Err. z P>|z| [95% Conf. Interval] .0292857 .3639248 2.02 2.39 0.044 0.017 .0017168 .1558679 .1165146 1.582427 Given the invariance of maximum likelihood estimators and the properties of the delta method, the above parameter estimates, standard errors, etc., are precisely those we would have obtained had we instead performed the Newton–Raphson optimization in the (p, r) metric. Technical note Note how we referred to the estimate of lnα above as [lnalpha] b[ cons]. This is not entirely evident from the output of nbreg, which is why we redisplayed the results using the coeflegend option so that we would know how to refer to the coefficients; [U] 13.5 Accessing coefficients and standard errors. nlcom versus eform Many Stata estimation commands allow you to display exponentiated regression coefficients, some by default, some optionally. Known as “eform” in Stata terminology, this reparameterization serves many uses: it gives odds ratios for logistic models, hazard ratios in survival models, incidence-rate ratios in Poisson models, and relative-risk ratios in multinomial logit models, to name a few. For example, consider the following estimation taken directly from the technical note in [R] poisson: . use http://www.stata-press.com/data/r13/airline . generate lnN = ln(n) . poisson injuries XYZowned lnN Iteration 0: log likelihood = -22.333875 Iteration 1: log likelihood = -22.332276 Iteration 2: log likelihood = -22.332276 Poisson regression Number of obs LR chi2(2) Prob > chi2 Log likelihood = -22.332276 Pseudo R2 injuries Coef. XYZowned lnN _cons .6840667 1.424169 4.863891 Std. Err. z P>|z| .3895877 .3725155 .7090501 1.76 3.82 6.86 0.079 0.000 0.000 = = = = 9 19.15 0.0001 0.3001 [95% Conf. Interval] -.0795111 .6940517 3.474178 1.447645 2.154285 6.253603 1472 nlcom — Nonlinear combinations of estimators When we replay results and specify the irr (incidence-rate ratios) option, . poisson, irr Poisson regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -22.332276 injuries IRR XYZowned lnN _cons 1.981921 4.154402 129.5272 = = = = 9 19.15 0.0001 0.3001 Std. Err. z P>|z| [95% Conf. Interval] .7721322 1.547579 91.84126 1.76 3.82 6.86 0.079 0.000 0.000 .9235678 2.00181 32.2713 4.253085 8.621728 519.8828 we obtain the exponentiated regression coefficients and their estimated standard errors. Contrast this with what we obtain if we exponentiate the coefficients manually by using nlcom: . nlcom (E_XYZowned:exp(_b[XYZowned])) (E_lnN:exp(_b[lnN])) E_XYZowned: exp(_b[XYZowned]) E_lnN: exp(_b[lnN]) injuries Coef. E_XYZowned E_lnN 1.981921 4.154402 Std. Err. z P>|z| [95% Conf. Interval] .7721322 1.547579 2.57 2.68 0.010 0.007 .4685701 1.121203 3.495273 7.187602 There are three things to note when comparing poisson, irr (and eform in general) with nlcom: 1. The exponentiated coefficients and standard errors are identical. This is certainly good news. 2. The Wald test statistic (z) and level of significance are different. When using poisson, irr and other related eform options, the Wald test does not change from what you would have obtained without the eform option, and you can see this by comparing both versions of the poisson output given previously. When you use eform, Stata knows that what is usually desired is a test of H0 : exp(β) = 1 and not the uninformative-by-comparison H0 : exp(β) = 0 The test of H0 : exp(β) = 1 is asymptotically equivalent to a test of H0 : β = 0, the Wald test in the original metric, but the latter has better small-sample properties. Thus if you specify eform, you get a test of H0 : β = 0. nlcom, however, is general. It does not attempt to infer the test of greatest interest for a given transformation, and so a test of H0 : transformed coefficient = 0 is always given, regardless of the transformation. 3. You may be surprised to see that, even though the coefficients and standard errors are identical, the confidence intervals (both 95%) are different. nlcom — Nonlinear combinations of estimators 1473 eform confidence intervals are standard confidence intervals with the endpoints transformed. For example, the confidence interval for the coefficient on lnN is [0.694, 2.154], whereas the confidence interval for the incidence-rate ratio due to lnN is [exp(0.694), exp(2.154)] = [2.002, 8.619], which, except for some roundoff error, is what we see from the output of poisson, irr. For exponentiated coefficients, confidence intervals based on transform-the-endpoints methodology generally have better small-sample properties than their asymptotically equivalent counterparts. The transform-the-endpoints method, however, gives valid coverage only when the transformation is monotonic. nlcom uses a more general and asymptotically equivalent method for calculating confidence intervals, as described in Methods and formulas. Stored results nlcom stores the following in r(): Scalars r(N) r(df r) number of observations residual degrees of freedom Matrices r(b) r(V) vector of transformed coefficients estimated variance–covariance matrix of the transformed coefficients If post is specified, nlcom also stores the following in e(): Scalars e(N) e(df r) e(N strata) e(N psu) e(rank) number residual number number rank of of observations degrees of freedom of strata L, if used after svy of sampled PSUs n, if used after svy e(V) Macros e(cmd) nlcom e(predict) program used to implement predict e(properties) b V Matrices e(b) e(V) e(V srs) e(V srswr) e(V msp) vector of transformed coefficients estimated variance–covariance matrix of the transformed coefficients bsrswor , if svy simple-random-sampling-without-replacement (co)variance V bsrswr , if svy and fpc() simple-random-sampling-with-replacement (co)variance V bmsp , if svy and available misspecification (co)variance V Functions e(sample) marks estimation sample Methods and formulas Given a 1 ×k vector of parameter estimates, b θ = (θb1 , . . . , θbk ), consider the estimated p-dimensional transformation g(b θ) = [g1 (b θ), g2 (b θ), . . . , gp (b θ)] The estimated variance–covariance of g(b θ) is given by n o c g(b Var θ) = GVG0 1474 nlcom — Nonlinear combinations of estimators where G is the p × k matrix of derivatives for which Gij = ∂gi (θ) ∂θj θ=b θ i = 1, . . . , p j = 1, . . . , k and V is the estimated variance–covariance matrix of b θ. Standard errors are obtained as the square roots of the variances. The Wald test statistic for testing H0 : gi (θ) = 0 versus the two-sided alternative is given by Zi = h gi (b θ) n oi1/2 c ii g(b Var θ) When the variance–covariance matrix of b θ is an asymptotic covariance matrix, Zi is approximately distributed as Gaussian. For linear regression, Zi is taken to be approximately distributed as t1,r where r is the residual degrees of freedom from the original fitted model. A (1 − α) × 100% confidence interval for gi (θ) is given by h n oi1/2 c ii g(b gi (b θ) ± zα/2 Var θ) for those cases where Zi is Gaussian and h n oi1/2 c ii g(b gi (b θ) ± tα/2,r Var θ) for those cases where Zi is t distributed. zp is the 1 − p quantile of the standard normal distribution, and tp,r is the 1 − p quantile of the t distribution with r degrees of freedom. References Feiveson, A. H. 1999. FAQ: What is the delta method and how is it used to estimate the standard error of a transformed parameter? http://www.stata.com/support/faqs/stat/deltam.html. Gould, W. W. 1996. crc43: Wald test of nonlinear hypotheses after model estimation. Stata Technical Bulletin 29: 2–4. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 15–18. College Station, TX: Stata Press. Oehlert, G. W. 1992. A note on the delta method. American Statistician 46: 27–29. Phillips, P. C. B., and J. Y. Park. 1988. On the formulation of Wald tests of nonlinear restrictions. Econometrica 56: 1065–1083. Also see [R] lincom — Linear combinations of estimators [R] predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation [R] test — Test linear hypotheses after estimation [R] testnl — Test nonlinear hypotheses after estimation [U] 20 Estimation and postestimation commands Title nlogit — Nested logit regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Nested logit regression nlogit depvar indepvars if in weight || lev1 equation || altvar: byaltvarlist , case(varname) options || lev2 equation . . . where the syntax of lev# equation is altvar: byaltvarlist , base(# | lbl) estconst Create variable based on specification of branches nlogitgen newaltvar = altvar (branchlist) , nolog where branchlist is branch, branch , branch . . . and branch is label: alternative | alternative | alternative . . . Display tree structure nlogittree altvarlist if in weight , choice(depvar) nolabel nobranches 1475 1476 nlogit — Nested logit regression Description options Model ∗ case(varname) base(# | lbl) noconstant nonnormalized altwise constraints(constraints) collinear use varname to identify cases use the specified level or label of altvar as the base alternative for the bottom level suppress the constant terms for the bottom-level alternatives use the nonnormalized parameterization use alternativewise deletion instead of casewise deletion apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) notree nocnsreport display options set confidence level; default is level(95) suppress display of tree-structure output; see also nolabel and nobranches do not display constraints control column formats and line width Maximization maximize options control the maximization process; seldom used ∗ case(varname) is required. bootstrap, by, fp, jackknife, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. fweights, iweights, and pweights are allowed with nlogit, and fweights are allowed with nlogittree; see [U] 11.1.6 weight. Weights for nlogit must be constant within case. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu nlogit Statistics > Categorical outcomes > Nested logit regression Categorical outcomes > Setup for nested logit regression Categorical outcomes > Display nested logit tree structure nlogitgen Statistics > nlogittree Statistics > Description nlogit performs full information maximum-likelihood estimation for nested logit models. These models relax the assumption of independently distributed errors and the independence of irrelevant alternatives inherent in conditional and multinomial logit models by clustering similar alternatives into nests. nlogit — Nested logit regression 1477 By default, nlogit uses a parameterization that is consistent with random utility maximization (RUM). Before version 10 of Stata, a nonnormalized version of the nested logit model was fit, which you can request by specifying the nonnormalized option. You must use nlogitgen to generate a new categorical variable to specify the branches of the decision tree before calling nlogit. Options Specification and options for lev# equation altvar is a variable identifying alternatives at this level of the hierarchy. byaltvarlist specifies the variables to be used to compute the by-alternative regression coefficients for that level. For each variable specified in the variable list, there will be one regression coefficient for each alternative of that level of the hierarchy. If the variable is constant across each alternative (a case-specific variable), the regression coefficient associated with the base alternative is not identifiable. These regression coefficients are labeled as (base) in the regression table. If the variable varies among the alternatives, a regression coefficient is estimated for each alternative. base(# | lbl) can be specified in each level equation where it identifies the base alternative to be used at that level. The default is the alternative that has the highest frequency. If vce(bootstrap) or vce(jackknife) is specified, you must specify the base alternative for each level that has a byaltvarlist or if the constants will be estimated. Doing so ensures that the same model is fit with each call to nlogit. estconst applies to all the level equations except the bottom-level equation. Specifying estconst requests that constants for each alternative (except the base alternative) be estimated. By default, no constant is estimated at these levels. Constants can be estimated in only one level of the tree hierarchy. If you specify estconst for one of the level equations, you must specify noconstant for the bottom-level equation. Options for nlogit Model case(varname) specifies the variable that identifies each case. case() is required. base(# | lbl) can be specified in each level equation where it identifies the base alternative to be used at that level. The default is the alternative that has the highest frequency. If vce(bootstrap) or vce(jackknife) is specified, you must specify the base alternative for each level that has a byaltvarlist or if the constants will be estimated. Doing so ensures that the same model is fit with each call to nlogit. noconstant applies only to the equation defining the bottom level of the hierarchy. By default, constants are estimated for each alternative of altvar, less the base alternative. To suppress the constant terms for this level, specify noconstant. If you do not specify noconstant, you cannot specify estconst for the higher-level equations. nonnormalized requests a nonnormalized parameterization of the model that does not scale the inclusive values by the degree of dissimilarity of the alternatives within each nest. Use this option to replicate results from older versions of Stata. The default is to use the RUM –consistent parameterization. 1478 nlogit — Nested logit regression altwise specifies that alternativewise deletion be used when marking out observations because of missing values in your variables. The default is to use casewise deletion. This option does not apply to observations that are marked out by the if or in qualifier or the by prefix. constraints(constraints); see [R] estimation options. The inclusive-valued/dissimilarity parameters are parameterized as ml ancillary parameters. They are labeled as alternative tau const, where alternative is one of the alternatives defining a branch in the tree. To constrain the inclusive-valued/dissimilarity parameter for alternative a1 to be, say, equal to alternative a2, you would use the following syntax: . constraint 1 [a1_tau]_cons = [a2_tau]_cons . nlogit ..., constraints(1) collinear prevents collinear variables from being dropped. Use this option when you know that you have collinear variables and you are applying constraints() to handle the rank reduction. See [R] estimation options for details on using collinear with constraints(). nlogit will not allow you to specify an independent variable in more than one level equation. Specifying the collinear option will allow execution to proceed in this case, but it is your responsibility to ensure that the parameters are identified. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. If vce(robust) or vce(cluster clustvar) is specified, the likelihood-ratio test for the independence of irrelevant alternatives (IIA) is not computed. Reporting level(#); see [R] estimation options. notree specifies that the tree structure of the nested logit model not be displayed. See also nolabel and nobranches below for when notree is not specified. nocnsreport; see [R] estimation options. display options: cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. The technique(bhhh) option is not allowed. Specification and options for nlogitgen newaltvar and altvar are variables identifying alternatives at each level of the hierarchy. label defines a label to associate with the branch. If no label is given, a numeric value is used. nlogit — Nested logit regression 1479 alternative specifies an alternative, of altvar specified in the syntax, to be included in the branch. It is either a numeric value or the label associated with that value. An example of nlogitgen is . nlogitgen type = restaurant(fast: 1 | 2, > family: CafeEccell | LosNortenos | WingsNmore, fancy: 6 | 7) nolog suppresses the display of the iteration log. Specification and options for nlogittree Main altvarlist is a list of alternative variables that define the tree hierarchy. The first variable must define bottom-level alternatives, and the order continues to the variable defining the top-level alternatives. choice(depvar) defines the choice indicator variable and forces nlogittree to compute and display choice frequencies for each bottom-level alternative. nolabel forces nlogittree to suppress value labels in tree-structure output. nobranches forces nlogittree to suppress drawing branches in the tree-structure output. Remarks and examples Remarks are presented under the following headings: Introduction Data setup and the tree structure Estimation Testing for the IIA Nonnormalized model Introduction nlogit performs full information maximum-likelihood estimation for nested logit models. These models relax the assumption of independently distributed errors and the IIA inherent in conditional and multinomial logit models by clustering similar alternatives into nests. Because the nested logit model is a direct generalization of the alternative-specific conditional logit model (also known as McFadden’s choice model), you may want to read [R] asclogit before continuing. By default, nlogit uses a parameterization that is consistent with RUM. Before version 10 of Stata, a nonnormalized version of the nested logit model was fit, which you can request by specifying the nonnormalized option. We recommend using the RUM-consistent version of the model for new projects because it is based on a sound model of consumer behavior. McFadden (1977, 1981) showed how this model can be derived from a rational choice framework. Amemiya (1985, chap. 9) contains a nice discussion of how this model can be derived under the assumption of utility maximization. Hensher, Rose, and Greene (2005) provide a lucid introduction to choice models including nested logit. Throughout this entry, we consider a model of restaurant choice. We begin by introducing the data. 1480 nlogit — Nested logit regression Example 1 We have fictional data on 300 families and their choice of seven local restaurants. Freebirds and Mama’s Pizza are fast food restaurants; Café Eccell, Los Norteños, and Wings ’N More are family restaurants; and Christopher’s and Mad Cows are fancy restaurants. We want to model the decision of where to eat as a function of household income (income, in thousands of dollars), the number of children in the household (kids), the rating of the restaurant according to a local restaurant guide (rating, coded 0–5), the average meal cost per person (cost), and the distance between the household and the restaurant (distance, in miles). income and kids are attributes of the family, rating is an attribute of the alternative (the restaurant), and cost and distance are attributes of the alternative as perceived by the families—that is, each family has its own cost and distance for each restaurant. We begin by loading the data and listing some of the variables for the first three families: . use http://www.stata-press.com/data/r13/restaurant . describe Contains data from http://www.stata-press.com/data/r13/restaurant.dta obs: 2,100 vars: 8 10 Mar 2013 01:17 size: 67,200 variable name storage type display format family_id restaurant income cost kids rating distance float float float float float float float %9.0g %12.0g %9.0g %9.0g %9.0g %9.0g %9.0g chosen float %9.0g Sorted by: family_id value label names variable label family ID choices of restaurants household income average meal cost per person number of kids in the household ratings in local restaurant guide distance between home and restaurant 0 no 1 yes nlogit — Nested logit regression 1481 . list family_id restaurant chosen kids rating distance in 1/21, sepby(fam) > abbrev(10) family_id restaurant chosen kids rating distance 1. 2. 3. 4. 5. 6. 7. 1 1 1 1 1 1 1 Freebirds MamasPizza CafeEccell LosNortenos WingsNmore Christophers MadCows 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 2 3 2 4 5 1.245553 2.82493 4.21293 4.167634 6.330531 10.19829 5.601388 8. 9. 10. 11. 12. 13. 14. 2 2 2 2 2 2 2 Freebirds MamasPizza CafeEccell LosNortenos WingsNmore Christophers MadCows 0 0 0 1 0 0 0 3 3 3 3 3 3 3 0 1 2 3 2 4 5 4.162657 2.865081 5.337799 4.282864 8.133914 8.664631 9.119597 15. 16. 17. 18. 19. 20. 21. 3 3 3 3 3 3 3 Freebirds MamasPizza CafeEccell LosNortenos WingsNmore Christophers MadCows 1 0 0 0 0 0 0 3 3 3 3 3 3 3 0 1 2 3 2 4 5 2.112586 2.215329 6.978715 5.117877 5.312941 9.551273 5.539806 Because each family chose among seven restaurants, there are 7 observations in the dataset for each family. The variable chosen is coded 0/1, with 1 indicating the chosen restaurant and 0 otherwise. We could fit a conditional logit model to our data. Because income and kids are constant within each family, we would use the asclogit command instead of clogit. However, the conditional logit may be inappropriate. That model assumes that the random errors are independent, and as a result it forces the odds ratio of any two alternatives to be independent of the other alternatives, a property known as the IIA. We will discuss the IIA assumption in more detail later. Assuming that unobserved shocks influencing a decision maker’s attitude toward one alternative have no effect on his attitudes toward the other alternatives may seem innocuous, but often this assumption is too restrictive. Suppose that when a family was deciding which restaurant to visit, they were pressed for time because of plans to attend a movie later. The unobserved shock (being in a hurry) would raise the likelihood that the family goes to either fast food restaurant (Freebirds or Mama’s Pizza). Similarly, another family might be choosing a restaurant to celebrate a birthday and therefore be inclined to attend a fancy restaurant (Christopher’s or Mad Cows). Nested logit models relax the independence assumption and allow us to group alternatives for which unobserved shocks may have concomitant effects. Here we suspect that restaurants should be grouped by type (fast, family, or fancy). The tree structure of a family’s decision about where to eat might look like this: 1482 nlogit — Nested logit regression Dining Fast food restaurants Mama's Freebirds Pizza Family restaurants Wings 'N More Los Café Norteños Eccell Fancy restaurants Christopher's Mad Cows At the bottom of the tree are the individual restaurants, indicating that there are some random shocks that affect a family’s decision to eat at each restaurant independently. Above the restaurants are the three types of restaurants, indicating that other random shocks affect the type of restaurant chosen. As is customary when drawing decision trees, at the top level is one box, representing the family making the decision. We use the following terms to describe nested logit models. level, or decision level, is the level or stage at which a decision is made. The example above has only two levels. In the first level, a type of restaurant is chosen—fast food, family, or fancy—and in the second level, a specific restaurant is chosen. bottom level is the level where the final decision is made. In our example, this is when we choose a specific restaurant. alternative set is the set of all possible alternatives at any given decision level. bottom alternative set is the set of all possible alternatives at the bottom level. This concept is often referred to as the choice set in the economics-choice literature. In our example, the bottom alternative set is all seven of the specific restaurants. alternative is a specific alternative within an alternative set. In the first level of our example, “fast food” is an alternative. In the second or bottom level, “Mad Cows” is an alternative. Not all alternatives within an alternative set are available to someone making a choice at a specific stage, only those that are nested within all higher-level decisions. chosen alternative is the alternative from an alternative set that we observe someone having chosen. Technical note Although decision trees in nested logit analysis are often interpreted as implying that the highestlevel decisions are made first, followed by decisions at lower levels, and finally the decision among alternatives at the bottom level, no such temporal ordering is implied. See Hensher, Rose, and Greene (2005, chap. 13). In our example, we are not assuming that families first choose whether to attend a fast, family, or fancy restaurant and then choose the particular restaurant; we assume merely that they choose one of the seven restaurants. nlogit — Nested logit regression 1483 Data setup and the tree structure To fit a nested logit model, you must first create a variable that defines the structure of your decision tree. Example 2 To run nlogit, we need to generate a categorical variable that identifies the first-level set of alternatives: fast food, family restaurants, or fancy restaurants. We can do so easily by using nlogitgen. . nlogitgen type = restaurant(fast: Freebirds | MamasPizza, > family: CafeEccell | LosNortenos| WingsNmore, fancy: Christophers | MadCows) new variable type is generated with 3 groups label list lb_type lb_type: 1 fast 2 family 3 fancy . nlogittree restaurant type, choice(chosen) tree structure specified for the nested logit model type N restaurant N k fast 600 family 900 fancy 600 Freebirds MamasPizza CafeEccell LosNortenos WingsNmore Christophers MadCows 300 300 300 300 300 300 300 12 15 78 75 69 27 24 total 2100 300 k = number of times alternative is chosen N = number of observations at each level The new categorical variable is type, which takes on value 1 (fast) if restaurant is Freebirds or Mama’s Pizza; value 2 (family) if restaurant is Café Eccell, Los Norteños, or Wings ’N More; and value 3 (fancy) otherwise. nlogittree displays the tree structure. Technical note We could also use values instead of value labels of restaurant in nlogitgen. Value labels are optional, and the default value labels for type are type1, type2, and type3. The vertical bar is also optional. . use http://www.stata-press.com/data/r13/restaurant, clear . nlogitgen type = restaurant(1 2, 3 4 5, 6 7) new variable type is generated with 3 groups label list lb_type lb_type: 1 type1 2 type2 3 type3 1484 nlogit — Nested logit regression . nlogittree restaurant type tree structure specified for the nested logit model type N restaurant N type1 600 type2 900 type3 600 Freebirds MamasPizza CafeEccell LosNortenos WingsNmore Christophers MadCows 300 300 300 300 300 300 300 total 2100 N = number of observations at each level In our dataset, every family was able to choose among all seven restaurants. However, in other applications some decision makers may not have been able to choose among all possible alternatives. For example, two cases may have choice hierarchies of case 1 type restaurant type fast Freebirds MamasPizza CafeEccell LosNortenos WingsNmore Christophers MadCows fast family fancy case 2 restaurant family fancy Freebirds MamasPizza LosNortenos WingsNmore Christophers where the second case does not have the restaurant alternatives Café Eccell or Mad Cows available to them. The only restriction is that the relationships between higher- and lower-level alternative sets be the same for all decision makers. In this two-level example, Freebirds and Mama’s Pizza are classified as fast food restaurants for both cases; Café Eccell, Los Norteños, and Wings ’N More are family restaurants; and Christopher’s and Mad Cows are fancy restaurants. nlogit requires only that hierarchy be maintained for all cases. Estimation Example 3 With our type variable created that defines the three types of restaurants, we can now examine how the alternative-specific attributes (cost, rating, and distance) apply to the bottom alternative set (the seven restaurants) and how family-specific attributes (income and kid) apply to the alternative set at the first decision level (the three types of restaurants). nlogit — Nested logit regression . use http://www.stata-press.com/data/r13/restaurant, clear . qui nlogitgen type = restaurant(fast: Freebirds | MamasPizza, > family: CafeEccell | LosNortenos| WingsNmore, fancy: Christophers | MadCows) . nlogit chosen cost rating distance || type: income kids, base(family) || > restaurant:, noconstant case(family_id) tree structure specified for the nested logit model type N restaurant N k fast 600 family 900 fancy 600 Freebirds MamasPizza CafeEccell LosNortenos WingsNmore Christophers MadCows 300 300 300 300 300 300 300 12 15 78 75 69 27 24 total 2100 300 k = number of times alternative is chosen N = number of observations at each level Iteration 0: log likelihood = -541.93581 (output omitted ) Iteration 17: log likelihood = -485.47331 RUM-consistent nested logit regression Case variable: family_id Alternative variable: restaurant Number of obs Number of cases Alts per case: min avg max Wald chi2(7) Prob > chi2 Log likelihood = -485.47331 Std. Err. z P>|z| = = = = = = = 2100 300 7 7.0 7 46.71 0.0000 chosen Coef. [95% Conf. Interval] restaurant cost rating distance -.1843847 .463694 -.3797474 .0933975 .3264935 .1003828 -1.97 1.42 -3.78 0.048 0.156 0.000 -.3674404 -.1762215 -.5764941 -.0013289 1.10361 -.1830007 income kids -.0266038 -.0872584 .0117306 .1385026 -2.27 -0.63 0.023 0.529 -.0495952 -.3587184 -.0036123 .1842016 family income kids 0 0 5.08 -3.24 0.000 0.001 .0283595 -.6351267 .0640059 -.1567559 -1.201295 .614463 -1.407896 4.627051 4.395763 9.607583 type equations fast (base) (base) fancy income kids .0461827 -.3959413 .0090936 .1220356 dissimilarity parameters type /fast_tau /family_tau /fancy_tau 1.712878 2.505113 4.099844 LR test for IIA (tau = 1): 1.48685 .9646351 2.810123 chi2(3) = 6.87 Prob > chi2 = 0.0762 1485 1486 nlogit — Nested logit regression First, let’s examine how we called nlogit. The delimiters (||) separate equations. The first equation specifies the dependent variable, chosen, and three alternative-specific variables, cost, rating, and distance. We refer to these variables as alternative-specific because they vary among the bottom-level alternatives, the restaurants. We obtain one parameter estimate for each variable. These estimates are listed in the equation subtable labeled restaurant. For the second equation, we specify the type variable. It identifies the first-level alternatives, the restaurant types. Following the colon after type, we specify two case-specific variables, income and kids. Here we obtain a parameter estimate for each variable for each alternative at this level. That is why we call these variable lists by-alternative variables. Because income and kids do not vary within each case, to identify the model one alternative’s set of parameters must be set to zero. We specified the base(family) option with this equation to restrict the parameters for the family alternative. The variable identifying the bottom-level alternatives, restaurant, is specified after the second equation delimiter. We do not specify any variables after the colon delimiter at this level. Had we specified variables here, we would have obtained an estimate for each variable in each equation. As we will see below, these variables parameterize the constant term in the utility equation for each bottom-level alternative. The noconstant option suppresses bottom-level alternative-specific constant terms. Near the bottom of the output are the dissimilarity parameters, which measure the degree of correlation of random shocks within each of the three types of restaurants. Dissimilarity parameters greater than one imply that the model is inconsistent with RUM; Hensher, Rose, and Greene (2005, sec. 13.6) discuss this in detail. We will ignore the fact that all our dissimilarity parameters exceed one. The conditional logit model is a special case of nested logit in which all the dissimilarity parameters are equal to one. At the bottom of the output, we find a likelihood-ratio test of this hypothesis. Here we have mixed evidence of the null hypothesis that all the parameters are one. Equivalently, the property known as the IIA imposed by the conditional logit model holds if and only if all dissimilarity parameters are equal to one. We discuss the IIA in more detail now. Testing for the IIA The IIA is a property of the multinomial and conditional logit models that forces the odds of choosing one alternative over another to be independent of the other alternatives. For simplicity, suppose that a family was choosing only between Freebirds and Mama’s Pizza, and the family was equally likely to choose either of the restaurants. The probability of going to each restaurant is 50%. Now suppose that Bill’s Burritos opens up next door to Freebirds, which is also a burrito restaurant. If the IIA holds, then the probability of going to each restaurant must now be 33.33% so that the family remains equally likely to go to Mama’s Pizza or Freebirds. The IIA may sometimes be a plausible assumption. However, a more likely scenario would be for the probability of going to Mama’s Pizza to remain at 50% and the probabilities of going to Freebirds and Bill’s Burritos to be 25% each, because the two restaurants are next door to each other and serve the same food. Nested logit analysis would allow us to relax the IIA assumption of conditional logit. We could group Bill’s Burritos and Freebirds into one nest that encompasses all burrito restaurants and create a second nest for pizzerias. The IIA is a consequence of assuming that the errors are independent and identically distributed (i.i.d.). Because the errors are i.i.d., they cannot contain any alternative-specific unobserved information, and therefore adding a new alternative cannot affect the relationship between a pair of existing alternatives. nlogit — Nested logit regression 1487 In the previous example, we saw that a joint test that the dissimilarity parameters were equal to one is one way to test for IIA. However, that test required us to specify a decision tree for the nested logit model, and different specifications could lead to conflicting results of the test. Hausman and McFadden (1984) suggest that if part of the choice set truly is irrelevant with respect to the other alternatives, omitting that subset from the conditional logit model will not lead to inconsistent estimates. Therefore, Hausman’s (1978) specification test can be used to test for IIA, and this test will not be sensitive to the tree structure we specify for a nested logit model. Example 4 We want to test the IIA for the subset of family restaurants against the alternatives of fast food and fancy restaurants. To do so, we need to use Stata’s hausman command; see [R] hausman. We first run the estimation on the full bottom alternative set, store the results by using estimates store, and then run the estimation on the bottom alternative set, excluding the alternatives of family restaurants. We then run the hausman test. . . . . generate generate generate generate incFast = (type == 1) * income incFancy = (type == 3) * income kidFast = (type == 1) * kids kidFancy = (type == 3) * kids . clogit chosen cost rating distance incFast incFancy kidFast kidFancy, > group(family_id) nolog Conditional (fixed-effects) logistic regression Number of obs = 2100 LR chi2(7) = 189.73 Prob > chi2 = 0.0000 Log likelihood = -488.90834 Pseudo R2 = 0.1625 chosen Coef. cost rating distance incFast incFancy kidFast kidFancy -.1367799 .3066622 -.1977505 -.0390183 .0407053 -.2398757 -.3893862 Std. Err. .0358479 .1418291 .0471653 .0094018 .0080405 .1063674 .1143797 z -3.82 2.16 -4.19 -4.15 5.06 -2.26 -3.40 P>|z| 0.000 0.031 0.000 0.000 0.000 0.024 0.001 [95% Conf. Interval] -.2070404 .0286823 -.2901927 -.0574455 .0249462 -.448352 -.6135662 -.0665193 .584642 -.1053082 -.0205911 .0564644 -.0313994 -.1652061 . estimates store fullset . clogit chosen cost rating distance incFast kidFast if type != 2, > group(family_id) nolog note: 222 groups (888 obs) dropped because of all positive or all negative outcomes. Conditional (fixed-effects) logistic regression Number of obs = LR chi2(5) = Prob > chi2 = Log likelihood = -85.955324 Pseudo R2 = chosen Coef. cost rating distance incFast kidFast -.0616621 .1659001 -.244396 -.0737506 .4105386 Std. Err. .067852 .2832041 .0995056 .0177444 .2137051 z -0.91 0.59 -2.46 -4.16 1.92 P>|z| 0.363 0.558 0.014 0.000 0.055 312 44.35 0.0000 0.2051 [95% Conf. Interval] -.1946496 -.3891698 -.4394234 -.108529 -.0083157 .0713254 .72097 -.0493687 -.0389721 .8293928 1488 nlogit — Nested logit regression . hausman . fullset Coefficients (b) (B) . fullset cost rating distance incFast kidFast Test: -.0616621 .1659001 -.244396 -.0737506 .4105386 -.1367799 .3066622 -.1977505 -.0390183 -.2398757 (b-B) Difference sqrt(diag(V_b-V_B)) S.E. .0751178 -.1407621 -.0466456 -.0347323 .6504143 .0576092 .2451308 .0876173 .015049 .1853533 b = consistent under Ho and Ha; obtained from clogit B = inconsistent under Ha, efficient under Ho; obtained from clogit Ho: difference in coefficients not systematic chi2(5) = (b-B)’[(V_b-V_B)^(-1)](b-B) = 10.70 Prob>chi2 = 0.0577 (V_b-V_B is not positive definite) Similar to our findings in example 3, the results of the test of the IIA are mixed. We cannot reject the IIA at the commonly used 5% significance level, but we could at the 10% level. Substantively, a significant test result suggests that the odds of going to one of the fancy restaurants versus going to one of the fast food restaurants changes if we include the family restaurants in the alternative set and that a nested logit specification may be warranted. Nonnormalized model Previous versions of Stata fit a nonnormalized nested logit model that is available via the nonnormalized option. The nonnormalized version is presented in, for example, Greene (2012, 768–770). Here we outline the differences between the RUM-consistent and nonnormalized models. Our discussion follows Heiss (2002) and assumes the decision tree has two levels, with M alternatives at the upper level and a total of J alternatives at the bottom level. In a RUM framework, by consuming alternative j , decision maker i obtains utility Uij = Vij + ij = αj + xij βj + zi γj + ij where Vij is the deterministic part of utility and ij is the random part. xij are alternative-specific variables and zi are case-specific variables. The set of errors i1 , . . . , iJ are assumed to follow the generalized extreme-value (GEV) distribution, which is a generalization of the type 1 extreme-value distribution that allows for alternatives within nests of the tree structure to be √ correlated. Let ρm denote the correlation in nest m, and define the dissimilarity parameter τm = 1 − ρm . τm = 0 implies that the alternatives in nest m are perfectly correlated, whereas τm = 1 implies independence. The inclusive value for the mth nest corresponds to the expected value of the utility that decision maker i obtains by consuming an alternative in nest m. Denote this value by IVm : X IVm = ln exp (Vk /τm ) (1) j∈Bm where Bm denotes the set of alternatives in nest m. Given the inclusive values, we can show that the probability that random-utility–maximizing decision maker i chooses alternative j is Prj = exp {Vj /τ (j)} exp {τ (j)IV(j)} P exp {IV(j)} m exp (τm IVm ) nlogit — Nested logit regression 1489 where τ (j) and IV(j) are the dissimilarity parameter and inclusive value for the nest in which alternative j lies. In contrast, for the nonnormalized model, we have a latent variable e j + zi γ Vei,j = α ej + xi,j β ej and corresponding inclusive values em IV = ln X exp (Vek ) (2) j∈Bm The probability of choosing alternative j is e (j) exp (Vej ) exp τ (j)IV P Prj = e e (j) exp IV m exp (τm IVm ) Equations (1) and (2) represent the key difference between the RUM-consistent and nonnormalized models. By scaling the Vij within each nest, the RUM-consistent model allows utilities to be compared across nests. Without the rescaling, utilities can be compared only for goods within the same nest. Moreover, adding a constant to each Vij for consumer i will not affect the probabilities of the RUMconsistent model, but adding a constant to each Veij will affect the probabilities from the nonnormalized model. Decisions based on utility maximization can depend only on utility differences and not the scale or zero point of the utility function because utility is an ordinal concept, so the nonnormalized model cannot be consistent with utility maximization. Heiss (2002) showed that the nonnormalized model can be RUM consistent in the special case where all the variables are specified in the bottom-level equation. Then multiplying the nonnormalized coefficients by the respective dissimilarity parameters results in the RUM-consistent coefficients. Technical note Degenerate nests occur when there is only one alternative in a branch of the tree hierarchy. The associated dissimilarity parameter of the RUM model is not defined. The inclusive-valued parameter of the nonnormalized model will be identifiable if there are alternative-specific variables specified in (1) of the model specification (the indepvars in the model syntax). Numerically, you can skirt the issue of nonidentifiable/undefined parameters by setting constraints on them. For the RUM model constraint, set the dissimilarity parameter to 1. See the description of constraints() in Options for details on setting constraints on the dissimilarity parameters. 1490 nlogit — Nested logit regression Stored results nlogit stores the following in e(): Scalars e(N) e(N case) e(k eq) e(k eq model) e(k alt) e(k altj) e(k indvars) e(k ind2vars) e(k ind2varsj) e(df m) e(df c) e(ll) e(ll c) e(N clust) e(chi2) e(chi2 c) e(p) e(p c) e(i base) e(i basej) e(levels) e(alt min) e(alt avg) e(alt max) e(const) e(constj) e(rum) e(rank) e(ic) e(rc) e(converged) number of observations number of cases number of equations in e(b) number of equations in overall model test number of alternatives for bottom level number of alternatives for jth level number of independent variables number of by-alternative variables for bottom level number of by-alternative variables for jth level model degrees of freedom clogit model degrees of freedom log likelihood clogit model log likelihood number of clusters χ2 likelihood-ratio test for IIA p-value for model Wald test p-value for IIA test base index for bottom level base index for jth level number of levels minimum number of alternatives average number of alternatives maximum number of alternatives constant indicator for bottom level constant indicator for jth level 1 if RUM model, 0 otherwise rank of e(V) number of iterations return code 1 if converged, 0 otherwise nlogit — Nested logit regression Macros e(cmd) e(cmdline) e(depvar) e(indvars) e(ind2vars) e(ind2varsj) e(case) e(altvar) e(altvarj) e(alteqs) e(alteqsj) e(alti) e(altj i) e(wtype) e(wexp) e(title) e(clustvar) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(datasignature) e(datasignaturevars) e(properties) e(estat cmd) e(predict) e(marginsnotok) Matrices e(b) e(Cns) e(k altern) e(k branchj) e(stats) e(statsj) e(altidxj) e(alt ind2vars) e(alt ind2varsj) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) 1491 nlogit command as typed name of dependent variable name of independent variables by-alternative variables for bottom level by-alternative variables for jth level variable defining cases alternative variable for bottom level alternative variable for jth level equation names for bottom level equation names for jth level ith alternative for bottom level ith alternative for jth level weight type weight expression title in estimation output name of cluster variable Wald, type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique the checksum variables used in calculation of checksum b V program used to implement estat program used to implement predict predictions disallowed by margins coefficient vector constraints matrix number of alternatives at each level number of branches at each alternative of jth level alternative statistics for bottom level alternative statistics for jth level alternative indices for jth level indicators for bottom level estimated by-alternative variable—e(k alt)×e(k ind2vars) indicators for jth level estimated by-alternative variable—e(k altj)×e(k ind2varsj) iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample Methods and formulas Methods and formulas are presented under the following headings: Two-level nested logit model Three-level nested logit model 1492 nlogit — Nested logit regression Two-level nested logit model Consider our two-level nested logit model for restaurant choice. We define T = {1, 2, 3} to be the set of indices denoting the three restaurant types and R1 = {1, 2}, R2 = {3, 4, 5}, and R3 = {6, 7} to be the set of indices representing each restaurant within type t ∈ T . Let C1 and C2 be the random variables that represent the choices made for the first level, restaurant type, and second level, restaurant, of the hierarchy, where we observe the choices C1 = t, t ∈ T , and C2 = j, j ∈ Rt . Let zt and xtj , for t ∈ T and j ∈ Rt , refer to the row vectors of explanatory variables for the first-level alternatives and bottom-level alternatives for one case, respectively. We write the utilities (latent variables) as Utj = zt αt + xtj βj + tj = ηtj + tj , where αt and βj are column vectors and the tj are random disturbances. When the xtj are alternative specific, we can drop the indices from β, where we estimate one coefficient for each alternative in Rt , t ∈ T . These variables are specified in the first equation of the nlogit syntax (see example 3). When the random-utility framework is used to describe the choice behavior, the alternative that is chosen is the alternative that has the highest utility. Assume for our restaurant example that we choose restaurant type t ∈ T . For the RUM parameterization of nlogit, the conditional distribution of tj given choice of restaurant type t is a multivariate version of Gumbel’s extreme-value distribution, " ( )τt # X FR|T ( | t) = exp − exp(tm /τt ) (3) m∈Rt where it has been shown that the tj , j ∈ Rt , are exchangeable with correlation 1 − τt2 , for τt ∈ (0, 1] (Kotz and Nadarajah 2000). For example, the probability of choosing Christopher’s, j = 6 given type t = 3, is Pr(C2 = 6 | C1 = 3) = Pr (U36 − U37 > 0) = Pr (37 ≤ 36 + η36 − η37 ) Z ∞ Z 36 +η36 −η37 = fR|T (36 , 37 ) d37 d36 −∞ −∞ ∂F is the joint density function of given t. U37 is the utility of eating at Mad ∂36 ∂37 Cows, the other fancy (t = 3) restaurant. Amemiya (1985) demonstrates that this integral evaluates to the logistic function where f = exp(η36 /τ3 ) exp(η36 /τ3 ) + exp(η37 /τ3 ) exp(x36 β6 /τ3 ) = exp(x36 β6 /τ3 ) + exp(x37 β7 /τ3 ) Pr(C2 = 6 | C1 = 3) = and in general Pr(C2 = j | C1 = t) = P exp(xtj βj /τt ) m∈Rt exp(xtm βm /τt ) (4) Letting τt = 1 in (3) reduces to the product of independent extreme-value distributions, and (4) reduces to the multinomial logistic function. For the logistic function in (4), we scale the linear predictors by the dissimilarity parameters. Another formulation of the conditional probability of choosing alternative j ∈ Rt given choice t ∈ T is the logistic function without this normalization: nlogit — Nested logit regression 1493 exp(xtj βj ) m∈Rt exp(xtm βm ) Pr(C2 = j | C1 = t) = P and this is what is used in nlogit’s nonnormalized parameterization. Amemiya (1985) defines the general form for the joint distribution of the ’s as ( !τk ) X X FT,R () = exp − θk exp(−km /τk ) m∈Rk k∈T from which the probability of choice t, t ∈ T can be derived as P τt θt m∈Rt exp(ηtm /τt ) Pr(C1 = t) = P P k∈T θk m∈Rk exp(ηkm /τk ) τk (5) nlogit sets θt = 1. Noting that ( )τ t ( )τt X X zt αt + xtm βm exp(ηtm /τt ) = exp τt m∈Rt m∈Rt ( )τt X = exp(zt αt ) exp (xtm βm /τt ) m∈Rt = exp(zt αt + τt It ) we define the inclusive values It as X It = ln exp(xtm βm /τt ) m∈Rt and we can view )τt ( exp(τt It ) = X 1/τt exp(xtm βm ) m∈Rt as a weighted average of the exp(xtm βm ), for m ∈ Rt . For the nlogit RUM parameterization, we can express (5) as exp(zt αt + τt It ) Pr(C1 = t) = P k∈T exp(zk αk + τk Ik ) Next we define inclusive values for the nonnormalized model to be X e It = ln exp(xtm βm ) m∈Rt and we express Pr(C1 = t) as Pr(C1 = t) = P exp(zt αt + τt Iet ) k∈T exp(zk αk + τk Iek ) (6) 1494 nlogit — Nested logit regression Equation (5) is consistent with (6) only when ηij = xij βj , so in general the nlogit nonnormalized model is not consistent with the RUM model. Now assume that we have N cases where we add a third subscript, i, to denote case i, i = 1, . . . , N . Denote yitj to be a binary variable indicating the choice made by case i so that for each i only one yitj is 1 and the rest are 0 for all t ∈ T and j ∈ Rt . The log likelihood for the two-level RUM-consistent model is log ` = N X X X yitj log {Pr(Ci1 = k)Pr(Ci2 = m|Ci1 = k)} i=1 k∈T m∈Rk = N X X X " ( yitj zik αk + τk Iik − log i=1 k∈T m∈Rk ) X exp(zil αl + τl Iil ) + l∈T ( xitj βm /τk − log )# X exp(xikl βl /τk ) l∈Rk The likelihood for the nonnormalized model has a similar form, replacing I with Ie and by not scaling xikj βj by τk . Three-level nested logit model Here we define a three-level nested logit model that can be generalized to the four-level and higher models. As before, let the integer set T be the indices for the first level of choices. Let sets St , t ∈ T , be mutually exclusive sets of integers representing the choices of the second level of the hierarchy. Finally, let Rj , j ∈ St , be the bottom-level choices. Let Utjk = ηtjk + tjk , k ∈ Rj , and the distribution of tjk be Gumbel’s multivariate extreme value of the form τj /υt υj X X X F () = exp − exp(−ηtjk /τj ) t∈T j∈St k∈Rj Let C1 , C2 , and C3 represent the choice random variables for levels 1, 2, and the bottom, respectively. Then the set of conditional probabilities is exp(ηtjk /τj ) l∈Rj exp(ηtjl /τj ) nP oτj /υt k∈Rj exp(ηtjk /τj ) Pr(C2 = j | C1 = t) = P P τl /υt l∈St k∈Rl exp(ηtlk /τl ) nP oτj /υt υt P j∈St k∈Rj exp(ηtjk /τj ) Pr(C1 = t) = nP oτj /υl υl P P exp(η /τ ) ljk j l∈T j∈Sl k∈Rj Pr(C3 = k | C1 = t, C2 = j) = P nlogit — Nested logit regression 1495 Assume that we can decompose the linear predictor as ηtjk = zt αt + utj γj + xtjk βk . Here zt , utj , and xtjk are the row vectors of explanatory variables for the first, second, and bottom levels of the hierarchy, respectively, and αt , γj , and βk are the corresponding column vectors of regression coefficients for t ∈ T , j ∈ St , and k ∈ Rj . We then can define the inclusive values for the first and second levels as X exp(xtjk βk /τj ) Itj = log k∈Rj Jt = log X exp(utj γj /υt + j∈St τj Itj ) υt and rewrite the probabilities exp(xtjk βk /τj ) l∈Rj exp(xtjl βl /τj ) Pr(C3 = k | C1 = t, C2 = j) = P τj υt Itj ) P τl l∈St exp(utl γl /υt + υt Itl ) exp(utj γj /υt + Pr(C2 = j | C1 = t) = exp(zt αt + υt Jt ) l∈T exp(zl αl + υl Jl ) Pr(C1 = t) = P We add a fourth index, i, for case and define the indicator variable yitjk , i = 1, . . . , N , to indicate the choice made by case i, t ∈ T , j ∈ St , and k ∈ Rj . The log likelihood for the nlogit RUM-consistent model is ( ! N X X X X X `= yitjk zit αt + υt Jit − log zim αm + υm Jim + i=1 t∈T j∈St k∈Rj m∈T τj uitj γj /υt + Iitj − log υt xitjk βk /τk − X τm Iitm uitm γm /υt + υt m∈St ) X ! + exp(xitjm βm /τk ) m∈Rt and for the nonnormalized nlogit model the log likelihood is ( ! N X X X X X `= yitjk zit αt + υt Jit − log zim αm + υm Jim + i=1 t∈T j∈St k∈Rj m∈T ! uitj γj + τj Iitj − log X uitm γm + τm Iitm + m∈St ) xitjk βk − X exp(xitjm βm ) m∈Rt Extending the model to more than three levels is straightforward, albeit notationally cumbersome. This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. 1496 nlogit — Nested logit regression References Amemiya, T. 1985. Advanced Econometrics. Cambridge, MA: Harvard University Press. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Hausman, J. A. 1978. Specification tests in econometrics. Econometrica 46: 1251–1271. Hausman, J. A., and D. L. McFadden. 1984. Specification tests for the multinomial logit model. Econometrica 52: 1219–1240. Heiss, F. 2002. Structural choice analysis with nested logit models. Stata Journal 2: 227–252. Hensher, D. A., J. M. Rose, and W. H. Greene. 2005. Applied Choice Analysis: A Primer. New York: Cambridge University Press. Kotz, S., and S. Nadarajah. 2000. Extreme Value Distributions: Theory and Applications. London: Imperial College Press. Maddala, G. S. 1983. Limited-Dependent and Qualitative Variables in Econometrics. Cambridge: Cambridge University Press. McFadden, D. L. 1977. Quantitative methods for analyzing travel behaviour of individuals: Some recent developments. Working paper 474, Cowles Foundation. http://cowles.econ.yale.edu/P/cd/d04b/d0474.pdf. . 1981. Econometric models of probabilistic choice. In Structural Analysis of Discrete Data with Econometric Applications, ed. C. F. Manski and D. McFadden, 198–272. Cambridge, MA: MIT Press. Also see [R] nlogit postestimation — Postestimation tools for nlogit [R] asclogit — Alternative-specific conditional logit (McFadden’s choice) model [R] clogit — Conditional (fixed-effects) logistic regression [R] mlogit — Multinomial (polytomous) logistic regression [R] ologit — Ordered logistic regression [R] rologit — Rank-ordered logistic regression [R] slogit — Stereotype logistic regression [U] 20 Estimation and postestimation commands Title nlogit postestimation — Postestimation tools for nlogit Description Options for predict Remarks and examples Syntax for predict Syntax for estat alternatives Also see Menu for predict Menu for estat Description The following postestimation command is of special interest after nlogit: Command Description estat alternatives alternative summary statistics The following standard postestimation commands are also available: Command Description estat ic estat summarize estat vce estimates hausman lincom Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest nlcom predict predictnl test testnl Special-interest postestimation command estat alternatives displays summary statistics about the alternatives in the estimation sample for each level of the tree structure. 1497 1498 nlogit postestimation — Postestimation tools for nlogit Syntax for predict predict type predict type newvar if in stub* | newvarlist , statistic hlevel(#) altwise if in , scores Description statistic Main predicted probabilities of choosing the alternatives at all levels of the hierarchy or at level #, where # is specified by hlevel(#); the default linear predictors for all levels of the hierarchy or at level #, where # is specified by hlevel(#) predicted conditional probabilities at all levels of the hierarchy or at level #, where # is specified by hlevel(#) inclusive values for levels 2, . . . , e(levels) or for hlevel(#) pr xb condp iv The inclusive value for the first-level alternatives is not used in estimation; therefore, it is not calculated. These statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr calculates the probability of choosing each alternative at each level of the hierarchy. Use the hlevel(#) option to compute the alternative probabilities at level #. When hlevel(#) is not specified, j new variables must be given, where j is the number of levels, or use the stub* option to have predict generate j variables with the prefix stub and numbered from 1 to j . The pr option is the default and if one new variable is given, the probability of the bottom-level alternatives are computed. Otherwise, probabilities for all levels are computed and the stub* option is still valid. xb calculates the linear prediction for each alternative at each level. Use the hlevel(#) option to compute the linear predictor at level #. When hlevel(#) is not specified, j new variables must be given, where j is the number of levels, or use the stub* option to have predict generate j variables with the prefix stub and numbered from 1 to j . condp calculates the conditional probabilities for each alternative at each level. Use the hlevel(#) option to compute the conditional probabilities of the alternatives at level #. When hlevel(#) is not specified, j new variables must be given, where j is the number of levels, or use the stub* option to have predict generate j variables with the prefix stub and numbered from 1 to j . iv calculates the inclusive value for each alternative at each level. Use the hlevel(#) option to compute the inclusive value at level #. There is no inclusive value at level 1. If hlevel(#) is not used, j − 1 new variables are required, where j is the number of levels, or use stub* to have predict generate j − 1 variables with the prefix stub and numbered from 2 to j . See Methods and formulas in [R] nlogit for a definition of the inclusive values. hlevel(#) calculates the prediction only for hierarchy level #. nlogit postestimation — Postestimation tools for nlogit 1499 altwise specifies that alternativewise deletion be used when marking out observations due to missing values in your variables. The default is to use casewise deletion. The xb option always uses alternativewise deletion. scores calculates the scores for each coefficient in e(b). This option requires a new-variable list of length equal to the number of columns in e(b). Otherwise, use the stub* option to have predict generate enumerated variables with prefix stub. Syntax for estat alternatives estat alternatives Menu for estat Statistics > Postestimation > Reports and statistics Remarks and examples predict may be used after nlogit to obtain the predicted values of the probabilities, the conditional probabilities, the linear predictions, and the inclusive values for each level of the nested logit model. Predicted probabilities for nlogit must be interpreted carefully. Probabilities are estimated for each case as a whole and not for individual observations. Example 1 Continuing with our model in example 3 of [R] nlogit, we refit the model and then examine a summary of the alternatives and their frequencies in the estimation sample. . use http://www.stata-press.com/data/r13/restaurant . nlogitgen type = restaurant(fast: Freebirds | MamasPizza, > family: CafeEccell | LosNortenos | WingsNmore, fancy: Christophers | MadCows) (output omitted ) . nlogit chosen cost rating distance || type: income kids, base(family) || > restaurant:, noconst case(family_id) (output omitted ) 1500 nlogit postestimation — Postestimation tools for nlogit . estat alternatives Alternatives summary for type index Alternative value 1 2 3 1 2 3 label Cases present Frequency selected Percent selected fast family fancy 600 900 600 27 222 51 9.00 74.00 17.00 label Cases present Frequency selected Percent selected Freebirds MamasPizza CafeEccell LosNortenos WingsNmore Christophers MadCows 300 300 300 300 300 300 300 12 15 78 75 69 27 24 4.00 5.00 26.00 25.00 23.00 9.00 8.00 Alternatives summary for restaurant index Alternative value 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Next we predict p2 = Pr(restaurant); p1 = Pr(type); condp = Pr(restaurant | type); xb2, the linear prediction for the bottom-level alternatives; xb1, the linear prediction for the first-level alternatives; and iv, the inclusive values for the bottom-level alternatives. . predict p* (option pr assumed) . predict condp, condp hlevel(2) . sort family_id type restaurant . list restaurant type chosen p2 p1 condp in 1/14, sepby(family_id) divider restaurant type chosen p2 p1 condp 1. 2. 3. 4. 5. 6. 7. Freebirds MamasPizza CafeEccell LosNortenos WingsNmore Christophers MadCows fast fast family family family fancy fancy 1 0 0 0 0 0 0 .0642332 .0547278 .284409 .3045242 .1849429 .0429508 .0642122 .1189609 .1189609 .7738761 .7738761 .7738761 .107163 .107163 .5399519 .4600481 .3675124 .3935051 .2389825 .4007991 .5992009 8. 9. 10. 11. 12. 13. 14. Freebirds MamasPizza CafeEccell LosNortenos WingsNmore Christophers MadCows fast fast family family family fancy fancy 0 0 0 1 0 0 0 .0183578 .030537 .2832149 .3038883 .1689618 .1041277 .0909125 .0488948 .0488948 .756065 .756065 .756065 .1950402 .1950402 .3754559 .6245441 .3745907 .4019341 .2234752 .533878 .466122 . predict xb*, xb . predict iv, iv nlogit postestimation — Postestimation tools for nlogit . list restaurant type chosen xb* iv in 1/14, sepby(family_id) divider restaurant type chosen xb1 xb2 iv 1. 2. 3. 4. 5. 6. 7. Freebirds MamasPizza CafeEccell LosNortenos WingsNmore Christophers MadCows fast fast family family family fancy fancy 1 0 0 0 0 0 0 -1.124805 -1.124805 0 0 0 1.405185 1.405185 -1.476914 -1.751229 -2.181112 -2.00992 -3.259229 -6.804211 -5.155514 -.2459659 -.2459659 .1303341 .1303341 .1303341 -.745332 -.745332 8. 9. 10. 11. 12. 13. 14. Freebirds MamasPizza CafeEccell LosNortenos WingsNmore Christophers MadCows fast fast family family family fancy fancy 0 0 0 1 0 0 0 -1.804794 -1.804794 0 0 0 1.490775 1.490775 -2.552233 -1.680583 -2.400434 -2.223939 -3.694409 -5.35932 -5.915751 -.5104123 -.5104123 .0237072 .0237072 .0237072 -.6796131 -.6796131 Also see [R] nlogit — Nested logit regression [U] 20 Estimation and postestimation commands 1501 Title nlsur — Estimation of nonlinear systems of equations Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Interactive version nlsur (depvar 1 = ) (depvar 2 = ) . . . if in weight , options Programmed substitutable expression version if in weight , options nlsur sexp prog : depvar 1 depvar 2 . . . varlist Function evaluator program version if in weight , nlsur func prog @ depvar 1 depvar 2 . . . varlist nequations(#) parameters(namelist) | nparameters(#) options where depvar j is the dependent variable for equation j ; j is the substitutable expression for equation j ; sexp prog is a substitutable expression program; and func prog is a function evaluator program. 1502 nlsur — Estimation of nonlinear systems of equations 1503 Description options Model fgnls ifgnls nls variables(varlist) initial(initial values) nequations(#) ∗ parameters(namelist) ∗ nparameters(#) sexp options func options use two-step FGNLS estimator; the default use iterative FGNLS estimator use NLS estimator variables in model initial values for parameters number of equations in model (function evaluator program version only) parameters in model (function evaluator program version only) number of parameters in model (function evaluator program version only) options for substitutable expression program options for function evaluator program SE/Robust vce(vcetype) vcetype may be gnr, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) title(string) title2(string) display options set confidence level; default is level(95) display string as title above the table of parameter estimates display string as subtitle control column formats and line width Optimization optimization options eps(#) ifgnlsiterate(#) ifgnlseps(#) delta(#) noconstants hasconstants(namelist) control the optimization process; seldom used specify # for convergence criteria; default is eps(1e-5) set maximum number of FGNLS iterations specify # for FGNLS convergence criterion; default is ifgnlseps(1e-10) specify stepsize # for computing derivatives; default is delta(4e-7) no equations have constant terms use namelist as constant terms coeflegend display legend instead of statistics ∗ You must specify parameters(namelist), nparameters(#), or both. bootstrap, by, jackknife, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. aweights, fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Linear models and related > Multiple-equation models > Nonlinear seemingly unrelated regression 1504 nlsur — Estimation of nonlinear systems of equations Description nlsur fits a system of nonlinear equations by feasible generalized nonlinear least squares (FGNLS). With the interactive version of the command, you enter the system of equations on the command line or in the dialog box by using substitutable expressions. If you have a system that you use regularly, you can write a substitutable expression program and use the second syntax to avoid having to reenter the system every time. The function evaluator program version gives you the most flexibility in exchange for increased complexity; with this version, your program is given a vector of parameters and a variable list, and your program computes the system of equations. When you write a substitutable expression program or a function evaluator program, the first five letters of the name must be nlsur. sexp prog and func prog refer to the name of the program without the first five letters. For example, if you wrote a function evaluator program named nlsurregss, you would type nlsur regss @ . . . to estimate the parameters. Options Model fgnls requests the two-step FGNLS estimator; this is the default. ifgnls requests the iterative FGNLS estimator. For the nonlinear systems estimator, this is equivalent to maximum likelihood estimation. nls requests the nonlinear least-squares (NLS) estimator. variables(varlist) specifies the variables in the system. nlsur ignores observations for which any of these variables has missing values. If you do not specify variables(), nlsur issues an error message if the estimation sample contains any missing values. initial(initial values) specifies the initial values to begin the estimation. You can specify a 1 × k matrix, where k is the total number of parameters in the system, or you can specify a parameter name, its initial value, another parameter name, its initial value, and so on. For example, to initialize alpha to 1.23 and delta to 4.57, you would type . nlsur . . . , initial(alpha 1.23 delta 4.57) . . . Initial values declared using this option override any that are declared within substitutable expressions. If you specify a matrix, the values must be in the same order in which the parameters are declared in your model. nlsur ignores the row and column names of the matrix. nequations(#) specifies the number of equations in the system. parameters(namelist) specifies the names of the parameters in the system. The names of the parameters must adhere to the naming conventions of Stata’s variables; see [U] 11.3 Naming conventions. If you specify both parameters() and nparameters(), the number of names in the former must match the number specified in the latter. nparameters(#) specifies the number of parameters in the system. If you do not specify names with the parameters() options, nlsur names them b1, b2, . . . , b#. If you specify both parameters() and nparameters(), the number of names in the former must match the number specified in the latter. sexp options refer to any options allowed by your sexp prog. func options refer to any options allowed by your func prog. nlsur — Estimation of nonlinear systems of equations 1505 SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (gnr), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(gnr), the default, uses the conventionally derived variance estimator for nonlinear models fit using Gauss–Newton regression. Reporting level(#); see [R] estimation options. title(string) specifies an optional title that will be displayed just above the table of parameter estimates. title2(string) specifies an optional subtitle that will be displayed between the title specified in title() and the table of parameter estimates. If title2() is specified but title() is not, title2() has the same effect as title(). display options: cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Optimization optimization options: iterate(#), no log, trace. iterate() specifies the maximum number of iterations to use for NLS at each round of FGNLS estimation. This option is different from ifgnlsiterate(), which controls the maximum rounds of FGNLS estimation to use when the ifgnls option is specified. log/nolog specifies whether to show the iteration log, and trace specifies that the iteration log should include the current parameter vector. eps(#) specifies the convergence criterion for successive parameter estimates and for the residual sum of squares (RSS). The default is eps(1e-5) (0.00001). eps() also specifies the convergence criterion for successive parameter estimates between rounds of iterative FGNLS estimation when ifgnls is specified. ifgnlsiterate(#) specifies the maximum number of FGNLS iterations to perform. The default is the number set using set maxiter (see [R] maximize), which is 16,000 by default. To use this option, you must also specify the ifgnls option. ifgnlseps(#) specifies the convergence criterion for successive estimates of the error covariance matrix during iterative FGNLS estimation. The default is ifgnlseps(1e-10). To use this option, you must also specify the ifgnls option. delta(#) specifies the relative change in a parameter, δ , to be used in computing the numeric derivatives. The derivative for parameter βi is computed as {fi (xi , β1 , β2 , . . . , βi + d, βi+1 , . . .) − fi (xi , β1 , β2 , . . . , βi , βi+1 , . . .)} /d where d = δ(|βi | + δ). The default is delta(4e-7). noconstants indicates that none of the equations in the system includes constant terms. This option is generally not needed, even if there are no constant terms in the system; though in rare cases without this option, nlsur may claim that there is one or more constant terms even if there are none. 1506 nlsur — Estimation of nonlinear systems of equations hasconstants(namelist) indicates the parameters that are to be treated as constant terms in the system of equations. The number of elements of namelist must equal the number of equations in the system. The ith entry of namelist specifies the constant term in the ith equation. If an equation does not include a constant term, specify a period (.) instead of a parameter name. This option is seldom needed with the interactive and programmed substitutable expression versions, because in those cases nlsur can almost always find the constant terms automatically. The following options are available with nlsur but are not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Introduction Substitutable expression programs Function evaluator programs Introduction nlsur fits a system of nonlinear equations by FGNLS. It can be viewed as a nonlinear variant of Zellner’s seemingly unrelated regression model (Zellner 1962; Zellner and Huang 1962; Zellner 1963) and is therefore commonly called nonlinear SUR or nonlinear SURE. The model is also discussed in textbooks such as Davidson and MacKinnon (1993, 2004) and Greene (2012, 305–306). Formally, the model fit by nlsur is yi1 yi2 .. . yiM = f1 (xi , β) + ui1 = f2 (xi , β) + ui2 . = .. = fM (xi , β) + uiM for i = 1, . . . , N observations and m = 1, . . . , M equations. The errors for the ith observation, ui1 , ui2 , . . . , uiM , may be correlated, so fitting the m equations jointly may lead to more efficient estimates. Moreover, fitting the equations jointly allows us to impose cross-equation restrictions on the parameters. Not all elements of the parameter vector β and data vector xi must appear in all the equations, though each element of β must appear in at least one equation for β to be identified. For this model, iterative FGNLS estimation is equivalent to maximum likelihood estimation with multivariate normal disturbances. The syntax you use with nlsur closely mirrors that used with nl. In particular, you use substitutable expressions with the interactive and programmed substitutable expression versions to define the functions in your system. See [R] nl for more information on substitutable expressions. Here we reiterate the three rules that you must follow: 1. Parameters of the model are bound in braces: {b0}, {param}, etc. 2. Initial values for parameters are given by including an equal sign and the initial value inside the braces: {b0=1}, {param=3.571}, etc. If you do not specify an initial value, that parameter is initialized to zero. The initial() option overrides initial values in substitutable expressions. nlsur — Estimation of nonlinear systems of equations 1507 3. Linear combinations of variables can be included using the notation {eqname:varlist}, for example, {xb: mpg price weight}, {score: w x z}, etc. Parameters of linear combinations are initialized to zero. Example 1: Interactive version using two-step FGNLS estimator We have data from an experiment in which two closely related types of bacteria were placed in a Petri dish, and the number of each type of bacteria were recorded every hour. We suspect a two-parameter exponential growth model can be used to model each type of bacteria, but because they shared the same dish, we want to allow for correlation in the error terms. We want to fit the system of equations p1 = β1 β2 t + u1 p 2 = γ1 γ2 t + u2 where p1 and p2 are the two populations and t is time, and we want to allow for nonzero correlation between u1 and u2 . We type . use http://www.stata-press.com/data/r13/petridish . nlsur (p1 = {b1}*{b2}^t) (p2 = {g1}*{g2}^t) (obs = 25) Calculating NLS estimates... Iteration 0: Residual SS = 335.5286 Iteration 1: Residual SS = 333.8583 Iteration 2: Residual SS = 219.9233 Iteration 3: Residual SS = 127.9355 Iteration 4: Residual SS = 14.86765 Iteration 5: Residual SS = 8.628459 Iteration 6: Residual SS = 8.281268 Iteration 7: Residual SS = 8.28098 Iteration 8: Residual SS = 8.280979 Iteration 9: Residual SS = 8.280979 Calculating FGNLS estimates... Iteration 0: Scaled RSS = 49.99892 Iteration 1: Scaled RSS = 49.99892 Iteration 2: Scaled RSS = 49.99892 FGNLS regression Equation Obs Parms RMSE p1 p2 25 25 2 2 .4337019 .3783479 1 2 R-sq 0.9734* 0.9776* Constant (none) (none) * Uncentered R-sq Coef. /b1 /b2 /g1 /g2 .3926631 1.119593 .5090441 1.102315 Std. Err. .064203 .0088999 .0669495 .0072183 z 6.12 125.80 7.60 152.71 P>|z| [95% Conf. Interval] 0.000 0.000 0.000 0.000 .2668275 1.102149 .3778256 1.088167 .5184987 1.137036 .6402626 1.116463 The header of the output contains a summary of each equation, including the number of observations and parameters and the root mean squared error of the residuals. nlsur checks to see whether each equation contains a constant term, and if an equation does contain a constant term, an R2 statistic is 1508 nlsur — Estimation of nonlinear systems of equations presented. If an equation does not have a constant term, an uncentered R2 is instead reported. The R2 statistic for each equation measures the percentage of variance explained by the nonlinear function and may be useful for descriptive purposes, though it does not have the same formal interpretation in the context of FGNLS as it does with NLS estimation. As we would expect, β2 and γ2 are both greater than one, indicating the two bacterial populations increased in size over time. The model we fit in the next three examples is in fact linear in the parameters, so it could be fit using the sureg command. However, we will fit the model using nlsur so that we can focus on the mechanics of using the command. Moreover, using nlsur will obviate the need to generate several variables as well as the need to use the constraint command to impose parameter restrictions. Example 2: Interactive version using iterative FGNLS estimator—the translog production function Greene (1997, sec. 15.6) discusses the transcendental logarithmic (translog) cost function and provides cost and input price data for capital, labor, energy, and materials for the U.S. economy. One way to fit the translog production function to these data is to fit the system of three equations pk pl pe sk = βk + δkk ln + δkl ln + δke ln + u1 pm pm pm pk pl pe sl = βl + δkl ln + δll ln + δle ln + u2 pm pm pm pl pe pk + δle ln + δee ln + u3 se = βe + δke ln pm pm pm where sk is capital’s cost share, sl is labor’s cost share, and se is energy’s cost share; pk , pl , pe , and pm are the prices of capital, labor, energy, and materials, respectively; the u’s are regression error terms; and the β s and δ s are parameters to be estimated. There are three cross-equation restrictions on the parameters: δkl , δke , and δle each appear in two equations. To fit this model by using the iterative FGNLS estimator, we type nlsur — Estimation of nonlinear systems of equations 1509 . use http://www.stata-press.com/data/r13/mfgcost . nlsur (s_k = {bk} + {dkk}*ln(pk/pm) + {dkl}*ln(pl/pm) + {dke}*ln(pe/pm)) > (s_l = {bl} + {dkl}*ln(pk/pm) + {dll}*ln(pl/pm) + {dle}*ln(pe/pm)) > (s_e = {be} + {dke}*ln(pk/pm) + {dle}*ln(pl/pm) + {dee}*ln(pe/pm)), > ifgnls (obs = 25) Calculating NLS estimates... Iteration 0: Residual SS = .0009989 Iteration 1: Residual SS = .0009989 Calculating FGNLS estimates... Iteration 0: Scaled RSS = 65.45197 Iteration 1: Scaled RSS = 65.45197 (output omitted ) FGNLS iteration 10... Iteration 0: Scaled RSS = 75 Iteration 1: Scaled RSS = 75 Parameter change = 4.074e-06 Covariance matrix change = 6.265e-10 FGNLS regression Equation Obs Parms RMSE R-sq Constant s_k s_l s_e 25 25 25 4 4 4 .0031722 .0053963 .00177 0.4776 0.8171 0.6615 bk bl be 1 2 3 Coef. /bk /dkk /dkl /dke /bl /dll /dle /be /dee .0568925 .0294833 -.0000471 -.0106749 .253438 .0754327 -.004756 .0444099 .0183415 Std. Err. .0013454 .0057956 .0038478 .0033882 .0020945 .0067572 .002344 .0008533 .0049858 z 42.29 5.09 -0.01 -3.15 121.00 11.16 -2.03 52.04 3.68 P>|z| 0.000 0.000 0.990 0.002 0.000 0.000 0.042 0.000 0.000 [95% Conf. Interval] .0542556 .0181241 -.0075887 -.0173157 .2493329 .0621889 -.0093501 .0427374 .0085694 .0595294 .0408425 .0074945 -.0040341 .2575432 .0886766 -.0001619 .0460823 .0281135 We draw your attention to the iteration log at the top of the output. When iterative FGNLS estimation is used, the final scaled RSS will equal the product of the number of observations in the estimation sample and the number of equations; see Methods and formulas for details. Because the RSS is scaled by the error covariance matrix during each round of FGNLS estimation, the scaled RSS is not comparable from one FGNLS iteration to the next. Technical note You may have noticed that we mentioned having data for four factors of production, yet we fit only three share equations. Because the four shares sum to one, we must drop one of the equations to avoid having a singular error covariance matrix. The iterative FGNLS estimator is equivalent to maximum likelihood estimation, and thus it is invariant to which one of the four equations we choose to drop. The (linearly restricted) parameters of the fourth equation can be obtained using the lincom command. Nonlinear functions of the parameters, such as the elasticities of substitution, can be computed using nlcom. 1510 nlsur — Estimation of nonlinear systems of equations Substitutable expression programs If you fit the same model repeatedly or you want to share code with colleagues, you can write a substitutable expression program to define your system of equations and avoid having to retype the system every time. The first five letters of the program’s name must be nlsur, and the program must set the r-class macro r(n eq) to the number of equations in your system. The first equation’s substitutable expression must be returned in r(eq 1), the second equation’s in r(eq 2), and so on. You may optionally set r(title) to label your output; that has the same effect as specifying the title() option. Example 3: Programmed substitutable expression version We return to our translog cost function, for which a substitutable expression program is program nlsurtranslog, rclass version 13 syntax varlist(min=7 max=7) [if] tokenize ‘varlist’ args sk sl se pk pl pe pm local pkpm ln(‘pk’/‘pm’) local plpm ln(‘pl’/‘pm’) local pepm ln(‘pe’/‘pm’) return scalar n_eq = 3 return local eq_1 "‘sk’= {bk} + {dkk}*‘pkpm’ + {dkl}*‘plpm’ + {dke}*‘pepm’" return local eq_2 "‘sl’= {bl} + {dkl}*‘pkpm’ + {dll}*‘plpm’ + {dle}*‘pepm’" return local eq_3 "‘se’= {be} + {dke}*‘pkpm’ + {dle}*‘plpm’ + {dee}*‘pepm’" return local title "4-factor translog cost function" end We made our program accept seven variables, for the three dependent variables sk , sl , and se , and the four factor prices pk , pl , pm , and pe . The tokenize command assigns to macros ‘1’, ‘2’, . . . , ‘7’ the seven variables stored in ‘varlist’, and the args command transfers those numbered macros to macros ‘sk’, ‘sl’, . . . , ‘pm’. Because we knew our substitutable expressions were going to be somewhat long, we created local macros to hold the log price ratios. These are simply macros that hold strings such as ln(pk/pm), not variables, and they will save us some repetitious typing when we define our substitutable expressions. Our program returns the number of equations in r(n eq), and we defined our substitutable expressions in eq 1, eq 2, and eq 3. We do not bind the expressions in parentheses as we do with the interactive version of nlsur. Finally, we put a title in r(title) to label our output. Our syntax command also accepts an if clause, and that is how nlsur indicates the estimation sample to our program. In this application, we can safely ignore it, because our program does not compute initial values. However, had we used commands such as summarize or regress to obtain initial values, then we would need to restrict those commands to analyze only the estimation sample. In those cases, typically, you simply need to include ‘if’ with the commands you are using. For example, instead of the command summarize ‘depvar’, meanonly you would use summarize ‘depvar’ ‘if’, meanonly nlsur — Estimation of nonlinear systems of equations 1511 We can check our program by typing . nlsurtranslog s_k s_l s_e pk pl pe pm . return list scalars: r(n_eq) = 3 macros: r(title) r(eq_3) r(eq_2) r(eq_1) : : : : "4-factor translog cost function" "s_e= {be} + {dke}*ln(pk/pm) + {dle}*ln(pl/pm) + {.." "s_l= {bl} + {dkl}*ln(pk/pm) + {dll}*ln(pl/pm) + {.." "s_k= {bk} + {dkk}*ln(pk/pm) + {dkl}*ln(pl/pm) + {.." Now that we know that our program works, we fit our model by typing . nlsur translog: s_k s_l s_e pk pl pe pm, ifgnls (obs = 25) Calculating NLS estimates... Iteration 0: Residual SS = .0009989 Iteration 1: Residual SS = .0009989 Calculating FGNLS estimates... Iteration 0: Scaled RSS = 65.45197 Iteration 1: Scaled RSS = 65.45197 FGNLS iteration 2... Iteration 0: Scaled RSS = 73.28311 Iteration 1: Scaled RSS = 73.28311 Iteration 2: Scaled RSS = 73.28311 Parameter change = 6.537e-03 Covariance matrix change = 1.002e-06 (output omitted ) FGNLS iteration 10... Iteration 0: Scaled RSS = 75 Iteration 1: Scaled RSS = 75 Parameter change = 4.074e-06 Covariance matrix change = 6.265e-10 FGNLS regression Equation Obs Parms RMSE R-sq Constant s_k s_l s_e 25 25 25 4 4 4 .0031722 .0053963 .00177 0.4776 0.8171 0.6615 bk bl be 1 2 3 4-factor translog cost function Coef. /bk /dkk /dkl /dke /bl /dll /dle /be /dee .0568925 .0294833 -.0000471 -.0106749 .253438 .0754327 -.004756 .0444099 .0183415 Std. Err. .0013454 .0057956 .0038478 .0033882 .0020945 .0067572 .002344 .0008533 .0049858 z 42.29 5.09 -0.01 -3.15 121.00 11.16 -2.03 52.04 3.68 P>|z| 0.000 0.000 0.990 0.002 0.000 0.000 0.042 0.000 0.000 [95% Conf. Interval] .0542556 .0181241 -.0075887 -.0173157 .2493329 .0621889 -.0093501 .0427374 .0085694 .0595294 .0408425 .0074945 -.0040341 .2575432 .0886766 -.0001619 .0460823 .0281135 Because we set r(title) in our substitutable expression program, the coefficient table has a title attached to it. The estimates are identical to those we obtained in example 2. 1512 nlsur — Estimation of nonlinear systems of equations Technical note nlsur accepts frequency and analytic weights as well as pweights (sampling weights) and iweights (importance weights). You do not need to modify your substitutable expressions in any way to perform weighted estimation, though you must make two changes to your substitutable expression program. The general outline of a sexp prog program is program nlsur name, rclass version 13 syntax varlist [fw aw pw iw] [if] // Obtain initial values incorporating weights. summarize varname [‘weight’‘exp’] ‘if’ For example, ... // Return n_eqn and substitutable expressions return scalar n_eq = # return local eq_1 = . . . ... end First, we wrote the syntax statement to accept a weight expression. Here we allow all four types of weights, but if you know that your estimator is valid, say, for only frequency weights, then you should modify the syntax line to accept only fweights. Second, if your program computes starting values, then any commands you use must incorporate the weights passed to the program; you do that by including [‘weight’‘exp’] when calling those commands. Function evaluator programs Although substitutable expressions are extremely flexible, there are some problems for which the nonlinear system cannot be defined using them. You can use the function evaluator program version of nlsur in these cases. We present two examples, a simple one to illustrate the mechanics of function evaluator programs and a more complicated one to illustrate the power of nlsur. nlsur — Estimation of nonlinear systems of equations 1513 Example 4: Function evaluator program version Here we write a function evaluator program to fit the translog cost function used in examples 2 and 3. The function evaluator program is program nlsurtranslog2 version 13 syntax varlist(min=7 max=7) [if], at(name) tokenize ‘varlist’ args sk sl se pk pl pe pm tempname bk dkk dkl dke bl dll dle be dee scalar ‘bk’ = ‘at’[1,1] scalar ‘dkk’ = ‘at’[1,2] scalar ‘dkl’ = ‘at’[1,3] scalar ‘dke’ = ‘at’[1,4] scalar ‘bl’ = ‘at’[1,5] scalar ‘dll’ = ‘at’[1,6] scalar ‘dle’ = ‘at’[1,7] scalar ‘be’ = ‘at’[1,8] scalar ‘dee’ = ‘at’[1,9] local pkpm ln(‘pk’/‘pm’) local plpm ln(‘pl’/‘pm’) local pepm ln(‘pe’/‘pm’) quietly { replace ‘sk’ = ‘bk’ + ‘dkk’*‘pkpm’ ‘dke’*‘pepm’ replace ‘sl’ = ‘bl’ + ‘dkl’*‘pkpm’ ‘dle’*‘pepm’ replace ‘se’ = ‘be’ + ‘dke’*‘pkpm’ ‘dee’*‘pepm’ } end + ‘dkl’*‘plpm’ + ‘if’ + ‘dll’*‘plpm’ + ‘if’ + ‘dle’*‘plpm’ + ‘if’ /// /// /// Unlike the substitutable expression program we wrote in example 3, nlsurtranslog2 is not declared as r-class because we will not be returning any stored results. We are again expecting seven variables: three shares and four factor prices, and nlsur will again mark the estimation sample with an if expression. Our function evaluator program also accepts an option named at(), which will receive a parameter vector at which we are to evaluate the system of equations. All function evaluator programs must accept this option. Our model has nine parameters to estimate, and we created nine temporary scalars to hold the elements of the ‘at’ matrix. Because our model has three equations, the first three variables passed to our program are the dependent variables that we are to fill in with the function values. We replaced only the observations in our estimation sample by including the ‘if’ qualifier in the replace statements. Here we could have ignored the ‘if’ qualifier because nlsur will skip over observations not in the estimation sample and we did not perform any computations requiring knowledge of the estimation sample. However, including the ‘if’ is good practice and may result in a slight speed improvement if the functions of your model are complicated and the estimation sample is much smaller than the dataset in memory. We could have avoided creating temporary scalars to hold our individual parameters by writing the replace statements as, for example, replace ‘sk’ = ‘at’[1,1] + ‘at’[1,2]*‘pkpm’ + ‘at’[1,3]*‘plpm’ + ‘at’[1,4]*‘pepm’ ‘if’ You can use whichever method you find more appealing, though giving the parameters descriptive names reduces the chance for mistakes and makes debugging easier. 1514 nlsur — Estimation of nonlinear systems of equations To fit our model by using the function evaluator program version of nlsur, we type . nlsur translog2 @ s_k s_l s_e pk pl pe pm, ifgnls nequations(3) > parameters(bk dkk dkl dke bl dll dle be dee) > hasconstants(bk bl be) (obs = 25) Calculating NLS estimates... Iteration 0: Residual SS = .0009989 Iteration 1: Residual SS = .0009989 Calculating FGNLS estimates... Iteration 0: Scaled RSS = 65.45197 Iteration 1: Scaled RSS = 65.45197 FGNLS iteration 2... Iteration 0: Scaled RSS = 73.28311 Iteration 1: Scaled RSS = 73.28311 Iteration 2: Scaled RSS = 73.28311 Parameter change = 6.537e-03 Covariance matrix change = 1.002e-06 FGNLS iteration 3... Iteration 0: Scaled RSS = 74.7113 Iteration 1: Scaled RSS = 74.7113 Parameter change = 2.577e-03 Covariance matrix change = 3.956e-07 FGNLS iteration 4... Iteration 0: Scaled RSS = 74.95356 Iteration 1: Scaled RSS = 74.95356 Iteration 2: Scaled RSS = 74.95356 Parameter change = 1.023e-03 Covariance matrix change = 1.571e-07 FGNLS iteration 5... Iteration 0: Scaled RSS = 74.99261 Iteration 1: Scaled RSS = 74.99261 Parameter change = 4.067e-04 Covariance matrix change = 6.250e-08 FGNLS iteration 6... Iteration 0: Scaled RSS = 74.99883 Iteration 1: Scaled RSS = 74.99883 Iteration 2: Scaled RSS = 74.99883 Parameter change = 1.619e-04 Covariance matrix change = 2.489e-08 FGNLS iteration 7... Iteration 0: Scaled RSS = 74.99981 Iteration 1: Scaled RSS = 74.99981 Iteration 2: Scaled RSS = 74.99981 Parameter change = 6.449e-05 Covariance matrix change = 9.912e-09 FGNLS iteration 8... Iteration 0: Scaled RSS = 74.99997 Iteration 1: Scaled RSS = 74.99997 Iteration 2: Scaled RSS = 74.99997 Parameter change = 2.569e-05 Covariance matrix change = 3.948e-09 FGNLS iteration 9... Iteration 0: Scaled RSS = 75 Iteration 1: Scaled RSS = 75 Parameter change = 1.023e-05 Covariance matrix change = 1.573e-09 FGNLS iteration 10... Iteration 0: Scaled RSS = 75 Iteration 1: Scaled RSS = 75 Parameter change = 4.074e-06 Covariance matrix change = 6.265e-10 nlsur — Estimation of nonlinear systems of equations 1515 FGNLS regression Equation Obs Parms RMSE R-sq Constant s_k s_l s_e 25 25 25 . . . .0031722 .0053963 .00177 0.4776 0.8171 0.6615 bk bl be 1 2 3 Coef. /bk /dkk /dkl /dke /bl /dll /dle /be /dee .0568925 .0294833 -.0000471 -.0106749 .253438 .0754327 -.004756 .0444099 .0183415 Std. Err. .0013454 .0057956 .0038478 .0033882 .0020945 .0067572 .002344 .0008533 .0049858 z 42.29 5.09 -0.01 -3.15 121.00 11.16 -2.03 52.04 3.68 P>|z| 0.000 0.000 0.990 0.002 0.000 0.000 0.042 0.000 0.000 [95% Conf. Interval] .0542556 .0181241 -.0075887 -.0173157 .2493329 .0621889 -.0093501 .0427374 .0085694 .0595294 .0408425 .0074945 -.0040341 .2575432 .0886766 -.0001619 .0460823 .0281135 When we use the function evaluator program version, nlsur requires us to specify the number of equations in nequations(), and it requires us to either specify names for each of our parameters or the number of parameters in the model. Here we used the parameters() option to name our parameters; the order in which we specified them in this option is the same as the order in which we extracted them from the ‘at’ matrix in our program. Had we instead specified nparameters(9), our parameters would have been labeled /b1, /b2, . . . , /b9 in the output. nlsur has no way of telling how many parameters appear in each equation, so the Parms column in the header contains missing values. Moreover, the function evaluator program version of nlsur does not attempt to identify constant terms, so we used the hasconstant option to tell nlsur which parameter in each equation is a constant term. The estimates are identical to those we obtained in examples 2 and 3. Technical note As with substitutable expression programs, if you intend to do weighted estimation with a function evaluator program, you must modify your func prog program’s syntax statement to accept weights. Moreover, if you use any statistical commands when computing your nonlinear functions, then you must include the weight expression with those commands. Example 5: Fitting the basic AIDS model using nlsur Deaton and Muellbauer (1980) introduce the almost ideal demand system (AIDS), and Poi (2012) presents a set of commands and several extensions for fitting the AIDS automatically. Here we show how to fit the basic AIDS model, which is a common example of a nonlinear system of equations, by manually using nlsur. The dataset food.dta contains household expenditures, expenditure shares, and log prices for four broad food groups. For a four-good demand system, we need to fit the following system of three equations: 1516 nlsur — Estimation of nonlinear systems of equations m + u1 P (p) m + u2 w2 = α2 + γ12 lnp1 + γ22 lnp2 + γ23 lnp3 + β2 ln P (p) m + u3 w3 = α3 + γ13 lnp1 + γ23 lnp2 + γ33 lnp3 + β3 ln P (p) w1 = α1 + γ11 lnp1 + γ12 lnp2 + γ13 lnp3 + β1 ln where wk denotes a household’s fraction of expenditures on good k , lnpk denotes the logarithm of the price paid for good k , m denotes a household’s total expenditure on all four goods, the u’s are regression error terms, and lnP (p) = α0 + 4 X 4 αi lnpi + i=1 4 1 XX γij lnpi lnpj 2 i=1 j=1 The parameters for the fourth good’s share equation can be recovered from the following constraints that are imposed by economic theory: 4 X i=1 αi = 1 4 X i=1 βi = 0 γij = γji and 4 X γij = 0 for all j i=1 Our model has a total of 12 unrestricted parameters. We will not estimate α0 directly. Instead, we will set it equal to 5; see Deaton and Muellbauer (1980) for a discussion of why treating α0 as fixed is acceptable. nlsur — Estimation of nonlinear systems of equations 1517 Our function evaluator program is program nlsuraids version 13 syntax varlist(min=8 max=8) if, at(name) tokenize ‘varlist’ args w1 w2 w3 lnp1 lnp2 lnp3 lnp4 lnm tempname a1 a2 a3 a4 scalar ‘a1’ = ‘at’[1,1] scalar ‘a2’ = ‘at’[1,2] scalar ‘a3’ = ‘at’[1,3] scalar ‘a4’ = 1 - ‘a1’ - ‘a2’ - ‘a3’ tempname b1 b2 b3 scalar ‘b1’ = ‘at’[1,4] scalar ‘b2’ = ‘at’[1,5] scalar ‘b3’ = ‘at’[1,6] tempname g11 g12 g13 g14 tempname g21 g22 g23 g24 tempname g31 g32 g33 g34 tempname g41 g42 g43 g44 scalar ‘g11’ = ‘at’[1,7] scalar ‘g12’ = ‘at’[1,8] scalar ‘g13’ = ‘at’[1,9] scalar ‘g14’ = -‘g11’-‘g12’-‘g13’ scalar ‘g21’ = ‘g12’ scalar ‘g22’ = ‘at’[1,10] scalar ‘g23’ = ‘at’[1,11] scalar ‘g24’ = -‘g21’-‘g22’-‘g23’ scalar ‘g31’ = ‘g13’ scalar ‘g32’ = ‘g23’ scalar ‘g33’ = ‘at’[1,12] scalar ‘g34’ = -‘g31’-‘g32’-‘g33’ scalar ‘g41’ = ‘g14’ scalar ‘g42’ = ‘g24’ scalar ‘g43’ = ‘g34’ scalar ‘g44’ = -‘g41’-‘g42’-‘g43’ quietly { tempvar lnpindex gen double ‘lnpindex’ = 5 + ‘a1’*‘lnp1’ + ‘a2’*‘lnp2’ + /// ‘a3’*‘lnp3’ + ‘a4’*‘lnp4’ forvalues i = 1/4 { forvalues j = 1/4 { replace ‘lnpindex’ = ‘lnpindex’ + /// 0.5*‘g‘i’‘j’’*‘lnp‘i’’*‘lnp‘j’’ } } replace ‘w1’ = ‘a1’ + ‘g11’*‘lnp1’ + ‘g12’*‘lnp2’ + /// ‘g13’*‘lnp3’ + ‘g14’*‘lnp4’ + /// ‘b1’*(‘lnm’ - ‘lnpindex’) replace ‘w2’ = ‘a2’ + ‘g21’*‘lnp1’ + ‘g22’*‘lnp2’ + /// ‘g23’*‘lnp3’ + ‘g24’*‘lnp4’ + /// ‘b2’*(‘lnm’ - ‘lnpindex’) replace ‘w3’ = ‘a3’ + ‘g31’*‘lnp1’ + ‘g32’*‘lnp2’ + /// ‘g33’*‘lnp3’ + ‘g34’*‘lnp4’ + /// ‘b3’*(‘lnm’ - ‘lnpindex’) } end The syntax statement accepts eight variables: three expenditure share variables, all four log-price variables, and a variable for log expenditures ( lnm). Most of the code simply extracts the parameters 1518 nlsur — Estimation of nonlinear systems of equations from the ‘at’ matrix. Although we are estimating only 12 parameters, to calculate the price index term and the expenditure share equations, we need the restricted parameters as well. Notice how we impose the constraints on the parameters. We then created a temporary variable to hold lnP (p), and we filled the three dependent variables with the predicted expenditure shares. To fit our model, we type . use http://www.stata-press.com/data/r13/food . nlsur aids @ w1 w2 w3 lnp1 lnp2 lnp3 lnp4 lnexp, > parameters(a1 a2 a3 b1 b2 b3 > g11 g12 g13 g22 g32 g33) > neq(3) ifgnls (obs = 4048) Calculating NLS estimates... Iteration 0: Residual SS = 126.9713 Iteration 1: Residual SS = 125.669 Iteration 2: Residual SS = 125.669 Iteration 3: Residual SS = 125.669 Iteration 4: Residual SS = 125.669 Calculating FGNLS estimates... Iteration 0: Scaled RSS = 12080.14 Iteration 1: Scaled RSS = 12080.14 Iteration 2: Scaled RSS = 12080.14 Iteration 3: Scaled RSS = 12080.14 FGNLS iteration 2... Iteration 0: Scaled RSS = 12143.99 Iteration 1: Scaled RSS = 12143.99 Iteration 2: Scaled RSS = 12143.99 Parameter change = 1.972e-04 Covariance matrix change = 2.936e-06 FGNLS iteration 3... Iteration 0: Scaled RSS = 12144 Iteration 1: Scaled RSS = 12144 Parameter change = 2.178e-06 Covariance matrix change = 3.469e-08 FGNLS regression Equation Obs Parms RMSE w1 w2 w3 4048 4048 4048 . . . .1333175 .1024166 .053777 1 2 3 R-sq 0.9017* 0.8480* 0.7906* Constant (none) (none) (none) * Uncentered R-sq Coef. /a1 /a2 /a3 /b1 /b2 /b3 /g11 /g12 /g13 /g22 /g32 /g33 .3163959 .2712501 .1039898 .0161044 -.0260771 .0014538 .1215838 -.0522943 -.0351292 .0644298 -.0011786 .0424381 Std. Err. .0073871 .0056938 .0029004 .0034153 .002623 .0013776 .0057186 .0039305 .0021788 .0044587 .0019767 .0017589 z 42.83 47.64 35.85 4.72 -9.94 1.06 21.26 -13.30 -16.12 14.45 -0.60 24.13 P>|z| 0.000 0.000 0.000 0.000 0.000 0.291 0.000 0.000 0.000 0.000 0.551 0.000 [95% Conf. Interval] .3019175 .2600904 .0983051 .0094105 -.0312181 -.0012463 .1103756 -.0599979 -.0393996 .0556909 -.0050528 .0389909 .3308742 .2824097 .1096746 .0227983 -.0209361 .0041539 .1327921 -.0445908 -.0308588 .0731687 .0026957 .0458854 nlsur — Estimation of nonlinear systems of equations 1519 To get the restricted parameters for the fourth share equation, we can use lincom. For example, to obtain α4 , we type . lincom 1 - [a1]_cons - [a2]_cons - [a3]_cons ( 1) - [a1]_cons - [a2]_cons - [a3]_cons = -1 Coef. (1) .3083643 Std. Err. .0052611 z 58.61 P>|z| [95% Conf. Interval] 0.000 .2980528 For more information on lincom, see [R] lincom. Stored results nlsur stores the following in e(): Scalars e(N) e(k) e(k #) e(k eq) e(k eq model) e(n eq) e(mss #) e(rss #) e(rmse #) e(r2 #) e(ll) e(N clust) e(rank) e(converge) Macros e(cmd) e(cmdline) e(method) e(depvar) e(depvar #) e(wtype) e(wexp) e(title) e(title 2) e(clustvar) e(vce) e(vcetype) e(type) e(sexpprog) e(sexp #) e(params) e(params #) e(funcprog) e(rhs) e(constants) e(properties) e(predict) number of observations number of parameters number of parameters for equation # number of equation names in e(b) number of equations in overall model test number of equations model sum of squares for equation # RSS for equation # root mean squared error for equation # R2 for equation # Gaussian log likelihood (iflgs version only) number of clusters rank of e(V) 1 if converged, 0 otherwise nlsur command as typed fgnls, ifgnls, or nls names of dependent variables dependent variable for equation # weight type weight expression title in estimation output secondary title in estimation output name of cluster variable vcetype specified in vce() title used in label Std. Err. 1 = interactively entered expression 2 = substitutable expression program 3 = function evaluator program substitutable expression program substitutable expression for equation # names of all parameters parameters in equation # function evaluator program contents of variables() identifies constant terms b V program used to implement predict .3186758 1520 nlsur — Estimation of nonlinear systems of equations Matrices e(b) e(init) e(Sigma) e(V) Functions e(sample) coefficient vector initial values vector b) error covariance matrix (Σ variance–covariance matrix of the estimators marks estimation sample Methods and formulas Write the system of equations for the ith observation as yi = f (xi , β) + ui (1) where yi and ui are 1 × M vectors, for i = 1, . . . , N ; f is a function that returns a 1 × M vector; xi represents all the exogenous variables in the system; and β is a 1 × k vector of parameters. The generalized nonlinear least-squares system estimator is defined as b ≡ argmin β β N X 0 {yi − f (xi , β)} Σ−1 {yi − f (xi , β)} i=1 where Σ = E(u0i ui ) is an M × M positive-definite weight matrix. Let T be the Cholesky decomposition of Σ−1 ; that is, TT0 = Σ−1 . Postmultiply (1) by T: yi T = f (xi , β)T + ui T (2) Because E(T0 u0i ui T) = I, we can “stack” the columns of (2) and write y1 T1 = f (x1 , β)T1 + u e11 y1 T2 = f (x1 , β)T2 + u e12 .. .. .=. y1 TM = f (x1 , β)TM + u e1M .. .. .=. (3) yN T1 = f (xN , β)T1 + u eN 1 yN T2 = f (xN , β)T2 + u eN 2 .. .. .=. yN TM = f (xN , β)TM + u eN M where Tj denotes the j th column of T. By construction, all u eij are independently distributed with unit variance. As a result, by transforming the model in (1) to that shown in (3), we have reduced the multivariate generalized nonlinear least-squares system estimator to a univariate nonlinear least-squares problem; and the same parameter estimation technique used by nl can be used here. See [R] nl for the details. Moreover, because the u eij all have variance 1, the final scaled RSS reported by nlsur is equal to N M . nlsur — Estimation of nonlinear systems of equations 1521 b of Σ. nlsur first sets Σ b = I. Although To make the estimator feasible, we require an estimate Σ b NLS , is consistent. If the nls option is specified, estimation is not efficient, the resulting estimate, β complete. Otherwise, the residuals b NLS ) b i = yi − f (xi , β u are calculated and used to compute N X b= 1 b 0i u bi u Σ N i=1 b is then obtained. b in hand, a new estimate β With Σ b is used to recompute the residuals and obtain a new If the ifgnls option is specified, the new β b b estimate of Σ, from which β can then be reestimated. Iterations stop when the relative change in b is less than eps(), the relative change in Σ b is less than ifgnlseps(), or if ifgnlsiterate() β iterations have been performed. If the vce(robust) and vce(cluster clustvar) options were not specified, then N X b) = V (β −1 b X0i Σ !−1 Xi i=1 where the M × k matrix Xi has typical element Xist , the derivative of the sth element of f with b As a practical matter, once the model is respect to the tth element of β, evaluated at xi and β. written in the form of (3), the variance–covariance matrix can be calculated via a Gauss–Newton regression; see Davidson and MacKinnon (1993, chap. 6). If robust is specified, then b) = V R (β N X b −1 Xi X0i Σ !−1 i=1 N X b −1 u b −1 Xi b 0i u biΣ X0i Σ i=1 N X b −1 Xi X0i Σ !−1 i=1 The cluster–robust variance matrix is b) = VC (β N X −1 b X0i Σ !−1 Xi NC X wc0 wc c=1 i=1 N X −1 b X0i Σ !−1 Xi i=1 where NC is the number of clusters and wc = X −1 0 bj u b X0j Σ j∈Ck with Ck denoting the set of observations in the k th cluster. In evaluating these formulas, we use the b That is, we do not recalculate Σ b used in calculating the final estimate of β. b after we value of Σ b obtain the final value of β. 1522 nlsur — Estimation of nonlinear systems of equations The RSS for the j th equation, RSSj , is RSSj = N X 2 (b yij − yij ) i=1 where ybij is the predicted value of the ith observation on the j th dependent variable; the total sum of squares (TSS) for the j th equation, TSSj , is TSSj = N X 2 (yij − ȳj ) i=1 if there is a constant term in the j th equation, where ȳj is the sample mean of the j th dependent variable, and N X 2 TSSj = yij i=1 if there is no constant term in the j th equation; and the model sum of squares (MSS) for the j th equation, MSSj , is TSSj − RSSj . The R2 for the j th equation is MSSj /TSSj . If an equation does not have a constant term, then the reported R2 for that equation is “uncentered” and based on the latter definition of TSSj . b ), the log Under the assumption that the ui are independent and identically distributed N (0, Σ likelihood for the model is lnL = − N MN b {1 + ln(2π)} − ln Σ 2 2 The log likelihood is reported only when the ifgnls option is specified. References Canette, I. 2011. A tip to debug your nl/nlsur function evaluator program. The Stata Blog: Not Elsewhere Classified. http://blog.stata.com/2011/12/05/a-tip-to-debug-your-nlnlsur-function-evaluator-program/. Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. . 2004. Econometric Theory and Methods. New York: Oxford University Press. Deaton, A. S., and J. Muellbauer. 1980. An almost ideal demand system. American Economic Review 70: 312–326. Greene, W. H. 1997. Econometric Analysis. 3rd ed. Upper Saddle River, NJ: Prentice Hall. . 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Poi, B. P. 2012. Easy demand-system estimation with quaids. Stata Journal 12: 433–446. Zellner, A. 1962. An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. Journal of the American Statistical Association 57: 348–368. . 1963. Estimators for seemingly unrelated regression equations: Some exact finite sample results. Journal of the American Statistical Association 58: 977–992. Zellner, A., and D. S. Huang. 1962. Further properties of efficient estimators for seemingly unrelated regression equations. International Economic Review 3: 300–313. nlsur — Estimation of nonlinear systems of equations Also see [R] nlsur postestimation — Postestimation tools for nlsur [R] nl — Nonlinear least-squares estimation [R] gmm — Generalized method of moments estimation [R] ml — Maximum likelihood estimation [R] mlexp — Maximum likelihood estimation of user-specified expressions [R] reg3 — Three-stage estimation for systems of simultaneous equations [R] sureg — Zellner’s seemingly unrelated regression [U] 20 Estimation and postestimation commands 1523 Title nlsur postestimation — Postestimation tools for nlsur Description Remarks and examples Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after nlsur: Command Description estat ic estat summarize estat vce estimates forecast lincom Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest margins1 marginsplot nlcom predict predictnl test testnl 1 You must specify the variables() option with nlsur. Syntax for predict predict type newvar if in , equation(#eqno) yhat residuals These statistics are available both in and out of sample; type predict for the estimation sample. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. 1524 . . . if e(sample) . . . if wanted only nlsur postestimation — Postestimation tools for nlsur 1525 Options for predict Main equation(#eqno) specifies to which equation you are referring. equation(#1) would mean that the calculation is to be made for the first equation, equation(#2) would mean the second, and so on. If you do not specify equation(), results are the same as if you had specified equation(#1). yhat, the default, calculates the fitted values for the specified equation. residuals calculates the residuals for the specified equation. Remarks and examples Example 1 In example 2 of [R] nlsur, we fit a four-factor translog cost function to data for the U.S. economy. The own-price elasticity for a factor measures the percentage change in its usage as a result of a 1% increase in the factor’s price, assuming that output is held constant. For the translog production function, the own-price factor elasticities are ηi = δii + si (si − 1) si Here we compute the elasticity for capital at the sample mean of capital’s factor share. First, we use summarize to get the mean of s k and store that value in a scalar: . summarize s_k Variable Obs Mean s_k 25 . scalar kmean = r(mean) .053488 Std. Dev. .0044795 Min Max .04602 .06185 Now we can use nlcom to calculate the elasticity: . nlcom (([dkk]_cons + kmean*(kmean-1)) / kmean) _nl_1: ([dkk]_cons + kmean*(kmean-1)) / kmean Coef. _nl_1 -.3952986 Std. Err. .1083535 z -3.65 P>|z| 0.000 [95% Conf. Interval] -.6076676 -.1829295 If the price of capital increases by 1%, its usage will decrease by about 0.4%. To maintain its current level of output, a firm would increase its usage of other inputs to compensate for the lower capital usage. The standard error reported by nlcom reflects the sampling variance of the estimated parameter δc kk , but nlcom treats the sample mean of s k as a fixed parameter that does not contribute to the sampling variance of the estimated elasticity. 1526 nlsur postestimation — Postestimation tools for nlsur Also see [R] nlsur — Estimation of nonlinear systems of equations [U] 20 Estimation and postestimation commands Title nptrend — Test for trend across ordered groups Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgments Syntax nptrend varname if in , by(groupvar) nodetail score(scorevar) Menu Statistics > Nonparametric analysis > Tests of hypotheses > Trend test across ordered groups Description nptrend performs a nonparametric test for trend across ordered groups. Options Main by(groupvar) is required; it specifies the group on which the data are to be ordered. nodetail suppresses the listing of group rank sums. score(scorevar) defines scores for groups. When it is not specified, the values of groupvar are used for the scores. Remarks and examples nptrend performs the nonparametric test for trend across ordered groups developed by Cuzick (1985), which is an extension of the Wilcoxon rank-sum test (see [R] ranksum). A correction for ties is incorporated into the test. nptrend is a useful adjunct to the Kruskal – Wallis test; see [R] kwallis. If your data are not grouped, you can test for trend with the signtest and spearman commands; see [R] signrank and [R] spearman. With signtest, you can perform the Cox and Stuart test, a sign test applied to differences between equally spaced observations of varname. With spearman, you can perform the Daniels test, a test of zero Spearman correlation between varname and a time index. See Conover (1999, 169–175, 323) for a discussion of these tests and their asymptotic relative efficiency. 1527 1528 nptrend — Test for trend across ordered groups Example 1 The following data (Altman 1991, 217) show ocular exposure to ultraviolet radiation for 32 pairs of sunglasses classified into three groups according to the amount of visible light transmitted. Group Transmission of visible light 1 2 < 25% 25 to 35% 3 > 35% Ocular exposure to ultraviolet radiation 1.4 0.9 2.6 0.8 1.4 1.0 2.8 1.7 1.4 1.1 2.8 1.7 1.6 1.1 3.2 1.7 2.3 1.2 3.5 3.4 2.3 1.2 1.5 1.9 2.2 2.6 2.6 4.3 5.1 7.1 8.9 13.5 Entering these data into Stata, we have . use http://www.stata-press.com/data/r13/sg . list, sep(6) group exposure 1 1 1 1 1 1 1.4 1.4 1.4 1.6 2.3 2.3 1. 2. 3. 4. 5. 6. 7. 2 .9 (output omitted ) 31. 32. 3 3 8.9 13.5 We use nptrend to test for a trend of (increasing) exposure across the three groups by typing . nptrend exposure, by(group) group 1 2 3 score 1 2 3 obs 6 18 8 sum of ranks 76 290 162 z = 1.52 Prob > |z| = 0.129 When the groups are given any equally spaced scores (such as −1, 0, 1), we will obtain the same answer as above. To illustrate the effect of changing scores, an analysis of these data with scores 1, 2, and 5 (admittedly not sensible here) produces . gen mysc = cond(group==3,5,group) . nptrend exposure, by(group) score(mysc) group 1 2 3 score 1 2 5 obs 6 18 8 sum of ranks 76 290 162 z = 1.46 Prob > |z| = 0.143 This example suggests that the analysis is not all that sensitive to the scores chosen. nptrend — Test for trend across ordered groups 1529 Technical note The grouping variable may be either a string variable or a numeric variable. If it is a string variable and no score variable is specified, the natural numbers 1, 2, 3, . . . are assigned to the groups in the sort order of the string variable. This may not always be what you expect. For example, the sort order of the strings “one”, “two”, “three” is “one”, “three”, “two”. Stored results nptrend stores the following in r(): Scalars r(N) r(p) number of observations two-sided p-value r(z) r(T) z statistic test statistic Methods and formulas nptrend is based on a method in Cuzick (1985). The following description of the statistic is from Altman (1991, 215–217). We have k groups of sample sizes ni (i = 1, . . . , k ). The groups are given scores, li , which reflect their ordering, P such as 1, 2, and 3. The scores do not have to be equally spaced, but they usually are. N = ni observations are ranked from 1 to N , and the sums of the ranks in each group, Ri , are obtained. L, the weighted sum of all the group scores, is L= k X li ni i=1 The statistic T is calculated as T = k X li Ri i=1 Under the null hypothesis, the expected value of T is E(T ) = 0.5(N + 1)L, and its standard error is v u k uN + 1 X se(T ) = t N li2 ni − L2 12 i=1 so that the test statistic, z , is given by z = { T − E(T ) }/se(T ), which has an approximately standard normal distribution when the null hypothesis of no trend is true. e be the number of unique values of The correction for ties affects the standard error of T . Let N e the variable being tested (N ≤ N ), and let tj be the number of times the j th unique value of the variable appears in the data. Define PNe a= The corrected standard error of T is se(T e )= 2 j=1 tj (tj − 1) N (N 2 − 1) √ 1 − a se(T ). 1530 nptrend — Test for trend across ordered groups Acknowledgments nptrend was written by K. A. Stepniewska and D. G. Altman (1992) of the Cancer Research UK. References Altman, D. G. 1991. Practical Statistics for Medical Research. London: Chapman & Hall/CRC. Conover, W. J. 1999. Practical Nonparametric Statistics. 3rd ed. New York: Wiley. Cuzick, J. 1985. A Wilcoxon-type test for trend. Statistics in Medicine 4: 87–90. Sasieni, P. D. 1996. snp12: Stratified test for trend across ordered groups. Stata Technical Bulletin 33: 24–27. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 196–200. College Station, TX: Stata Press. Sasieni, P. D., K. A. Stepniewska, and D. G. Altman. 1996. snp11: Test for trend across ordered groups revisited. Stata Technical Bulletin 32: 27–29. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 193–196. College Station, TX: Stata Press. Stepniewska, K. A., and D. G. Altman. 1992. snp4: Non-parametric test for trend across ordered groups. Stata Technical Bulletin 9: 21–22. Reprinted in Stata Technical Bulletin Reprints, vol. 2, p. 169. College Station, TX: Stata Press. Also see [R] kwallis — Kruskal – Wallis equality-of-populations rank test [R] signrank — Equality tests on matched data [R] spearman — Spearman’s and Kendall’s correlations [R] symmetry — Symmetry and marginal homogeneity tests [ST] epitab — Tables for epidemiologists [ST] strate — Tabulate failure rates and rate ratios Title ologit — Ordered logistic regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax ologit depvar indepvars if in weight , options Description options Model offset(varname) include varname in model with coefficient constrained to 1 constraints(constraints) apply specified linear constraints collinear keep collinear variables SE/Robust vcetype may be oim, robust, cluster clustvar, bootstrap, or jackknife vce(vcetype) Reporting set confidence level; default is level(95) report odds ratios do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling level(#) or nocnsreport display options Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mfp, mi estimate, nestreg, rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Ordinal outcomes > Ordered logistic regression 1531 1532 ologit — Ordered logistic regression Description ologit fits ordered logit models of ordinal variable depvar on the independent variables indepvars. The actual values taken on by the dependent variable are irrelevant, except that larger values are assumed to correspond to “higher” outcomes. See [R] logistic for a list of related estimation commands. Options Model offset(varname), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. or reports the estimated coefficients transformed to odds ratios, that is, eb rather than b. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated. or may be specified at estimation or when replaying previously estimated results. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. The following option is available with ologit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Ordered logit models are used to estimate relationships between an ordinal dependent variable and a set of independent variables. An ordinal variable is a variable that is categorical and ordered, for instance, “poor”, “good”, and “excellent”, which might indicate a person’s current health status or the repair record of a car. If there are only two outcomes, see [R] logistic, [R] logit, and [R] probit. This entry is concerned only with more than two outcomes. If the outcomes cannot be ordered (for example, residency in the north, east, south, or west), see [R] mlogit. This entry is concerned only with models in which the outcomes can be ordered. ologit — Ordered logistic regression 1533 In ordered logit, an underlying score is estimated as a linear function of the independent variables and a set of cutpoints. The probability of observing outcome i corresponds to the probability that the estimated linear function, plus random error, is within the range of the cutpoints estimated for the outcome: Pr(outcomej = i) = Pr(κi−1 < β1 x1j + β2 x2j + · · · + βk xkj + uj ≤ κi ) uj is assumed to be logistically distributed in ordered logit. In either case, we estimate the coefficients β1 , β2 , . . . , βk together with the cutpoints κ1 , κ2 , . . . , κk−1 , where k is the number of possible outcomes. κ0 is taken as −∞, and κk is taken as +∞. All of this is a direct generalization of the ordinary two-outcome logit model. Example 1 We wish to analyze the 1977 repair records of 66 foreign and domestic cars. The data are a variation of the automobile dataset described in [U] 1.2.2 Example datasets. The 1977 repair records, like those in 1978, take on values “Poor”, “Fair”, “Average”, “Good”, and “Excellent”. Here is a cross-tabulation of the data: . use http://www.stata-press.com/data/r13/fullauto (Automobile Models) . tabulate rep77 foreign, chi2 Repair Record Foreign 1977 Domestic Foreign Poor Fair Average Good Excellent 2 10 20 13 0 Total 45 Pearson chi2(4) = Total 1 1 7 7 5 3 11 27 20 5 21 13.8619 66 Pr = 0.008 Although it appears that foreign takes on the values Domestic and Foreign, it is actually a numeric variable taking on the values 0 and 1. Similarly, rep77 takes on the values 1, 2, 3, 4, and 5, corresponding to Poor, Fair, and so on. The more meaningful words appear because we have attached value labels to the data; see [U] 12.6.3 Value labels. Because the chi-squared value is significant, we could claim that there is a relationship between foreign and rep77. Literally, however, we can only claim that the distributions are different; the chi-squared test is not directional. One way to model these data is to model the categorization that took place when the data were created. Cars have a true frequency of repair, which we will assume is given by Sj = β foreignj + uj , and a car is categorized as “poor” if Sj ≤ κ0 , as “fair” if κ0 < Sj ≤ κ1 , and so on: 1534 ologit — Ordered logistic regression . ologit rep77 Iteration 0: Iteration 1: Iteration 2: Iteration 3: Iteration 4: foreign log likelihood log likelihood log likelihood log likelihood log likelihood = = = = = -89.895098 -85.951765 -85.908227 -85.908161 -85.908161 Ordered logistic regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -85.908161 rep77 Coef. foreign /cut1 /cut2 /cut3 /cut4 = = = = 66 7.97 0.0047 0.0444 Std. Err. z P>|z| [95% Conf. Interval] 1.455878 .5308951 2.74 0.006 .4153425 2.496413 -2.765562 -.9963603 .9426153 3.123351 .5988208 .3217706 .3136398 .5423257 -3.939229 -1.627019 .3278925 2.060412 -1.591895 -.3657016 1.557338 4.18629 Our model is Sj = 1.46 foreignj + uj ; the expected value for foreign cars is 1.46 and, for domestic cars, 0; foreign cars have better repair records. The estimated cutpoints tell us how to interpret the score. For a foreign car, the probability of a poor record is the probability that 1.46 + uj ≤ −2.77, or equivalently, uj ≤ −4.23. Making this calculation requires familiarity with the logistic distribution: the probability is 1/(1 + e4.23 ) = 0.014. On the other hand, for domestic cars, the probability of a poor record is the probability uj ≤ −2.77, which is 0.059. This, it seems to us, is a far more reasonable prediction than we would have made based on the table alone. The table showed that 2 of 45 domestic cars had poor records, whereas 1 of 21 foreign cars had poor records — corresponding to probabilities 2/45 = 0.044 and 1/21 = 0.048. The predictions from our model imposed a smoothness assumption — foreign cars should not, overall, have better repair records without the difference revealing itself in each category. In our data, the fractions of foreign and domestic cars in the poor category are virtually identical only because of the randomness associated with small samples. Thus if we were asked to predict the true fractions of foreign and domestic cars that would be classified in the various categories, we would choose the numbers implied by the ordered logit model: tabulate Domestic Foreign Poor Fair Average Good Excellent 0.044 0.222 0.444 0.289 0.000 0.048 0.048 0.333 0.333 0.238 Domestic 0.059 0.210 0.450 0.238 0.043 logit Foreign 0.014 0.065 0.295 0.467 0.159 See [R] ologit postestimation for a more complete explanation of how to generate predictions from an ordered logit model. ologit — Ordered logistic regression 1535 Technical note Here ordered logit provides an alternative to ordinary two-outcome logistic models with an arbitrary dichotomization, which might otherwise have been tempting. We could, for instance, have summarized these data by converting the five-outcome rep77 variable to a two-outcome variable, combining cars in the average, fair, and poor categories to make one outcome and combining cars in the good and excellent categories to make the second. Another even less appealing alternative would have been to use ordinary regression, arbitrarily labeling “excellent” as 5, “good” as 4, and so on. The problem is that with different but equally valid labelings (say, 10 for “excellent”), we would obtain different estimates. We would have no way of choosing one metric over another. That assertion is not, however, true of ologit. The actual values used to label the categories make no difference other than through the order they imply. In fact, our labeling was 5 for “excellent”, 4 for “good”, and so on. The words “excellent” and “good” appear in our output because we attached a value label to the variables; see [U] 12.6.3 Value labels. If we were to now go back and type replace rep77=10 if rep77==5, changing all the 5s to 10s, we would still obtain the same results when we refit our model. Example 2 In the example above, we used ordered logit as a way to model a table. We are not, however, limited to including only one explanatory variable or to including only categorical variables. We can explore the relationship of rep77 with any of the variables in our data. We might, for instance, model rep77 not only in terms of the origin of manufacture, but also including length (a proxy for size) and mpg: . ologit rep77 foreign length Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Ordered logistic regression mpg = -89.895098 = -78.775147 = -78.254294 = -78.250719 = -78.250719 Number of obs LR chi2(3) Prob > chi2 Pseudo R2 Log likelihood = -78.250719 rep77 Coef. foreign length mpg /cut1 /cut2 /cut3 /cut4 = = = = 66 23.29 0.0000 0.1295 Std. Err. z P>|z| [95% Conf. Interval] 2.896807 .0828275 .2307677 .7906411 .02272 .0704548 3.66 3.65 3.28 0.000 0.000 0.001 1.347179 .0382972 .0926788 4.446435 .1273579 .3688566 17.92748 19.86506 22.10331 24.69213 5.551191 5.59648 5.708936 5.890754 7.047344 8.896161 10.914 13.14647 28.80761 30.83396 33.29262 36.2378 foreign still plays a role—and an even larger role than previously. We find that larger cars tend to have better repair records, as do cars with better mileage ratings. 1536 ologit — Ordered logistic regression Stored results ologit stores the following in e(): Scalars e(N) e(N cd) e(k cat) e(k) e(k aux) e(k eq) e(k eq model) e(k dv) e(df m) e(r2 p) e(ll) e(ll 0) e(N clust) e(chi2) e(p) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(cat) e(V) e(V modelbased) Functions e(sample) number of observations number of completely determined observations number of categories number of parameters number of auxiliary parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom pseudo-R-squared log likelihood log likelihood, constant-only model number of clusters χ2 significance of model test rank of e(V) number of iterations return code 1 if converged, 0 otherwise ologit command as typed name of dependent variable weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector category values variance–covariance matrix of the estimators model-based variance marks estimation sample ologit — Ordered logistic regression 1537 Methods and formulas See Long and Freese (2014, chap. 7) for a discussion of models for ordinal outcomes and examples that use Stata. Cameron and Trivedi (2005, chap. 15) describe multinomial models, including the model fit by ologit. When you have a qualitative dependent variable, several estimation procedures are available. A popular choice is multinomial logistic regression (see [R] mlogit), but if you use this procedure when the response variable is ordinal, you are discarding information because multinomial logit ignores the ordered aspect of the outcome. Ordered logit and probit models provide a means to exploit the ordering information. There is more than one “ordered logit” model. The model fit by ologit, which we will call the ordered logit model, is also known as the proportional odds model. Another popular choice, not fit by ologit, is known as the stereotype model; see [R] slogit. All ordered logit models have been derived by starting with a binary logit/probit model and generalizing it to allow for more than two outcomes. The proportional-odds ordered logit model is so called because, if we consider the odds odds(k) = P (Y ≤ k)/P (Y > k), then odds(k1 ) and odds(k2 ) have the same ratio for all independent variable combinations. The model is based on the principle that the only effect of combining adjoining categories in ordered categorical regression problems should be a loss of efficiency in estimating the regression parameters (McCullagh 1980). This model was also described by McKelvey and Zavoina (1975) and, previously, by Aitchison and Silvey (1957) in a different algebraic form. Brant (1990) offers a set of diagnostics for the model. Peterson and Harrell (1990) suggest a model that allows nonproportional odds for a subset of the explanatory variables. ologit does not allow this, but a model similar to this was implemented by Fu (1998). The stereotype model rejects the principle on which the ordered logit model is based. Anderson (1984) argues that there are two distinct types of ordered categorical variables: “grouped continuous”, such as income, where the “type a” model applies; and “assessed”, such as extent of pain relief, where the stereotype model applies. Greenland (1985) independently developed the same model. The stereotype model starts with a multinomial logistic regression model and imposes constraints on this model. Goodness of fit for ologit can be evaluated by comparing the likelihood value with that obtained by fitting the model with mlogit. Let lnL1 be the log-likelihood value reported by ologit, and let lnL0 be the log-likelihood value reported by mlogit. If there are p independent variables (excluding the constant) and k categories, mlogit will estimate p(k − 1) additional parameters. We can then perform a “likelihood-ratio test”, that is, calculate −2( lnL1 − lnL0 ), and compare it with χ2 p(k − 2) . This test is suggestive only because the ordered logit model is not nested within the multinomial logit model. A large value of −2( lnL1 − lnL0 ) should, however, be taken as evidence of poorness of fit. Marginally large values, on the other hand, should not be taken too seriously. The coefficients and cutpoints are estimated using maximum likelihood as described in [R] maximize. In our parameterization, no constant appears, because the effect is absorbed into the cutpoints. ologit and oprobit begin by tabulating the dependent variable. Category i = 1 is defined as the minimum value of the variable, i = 2 as the next ordered value, and so on, for the empirically determined k categories. 1538 ologit — Ordered logistic regression The probability of a given observation for ordered logit is pij = Pr(yj = i) = Pr κi−1 < xj β + u ≤ κi = 1 1 − 1 + exp(−κi + xj β) 1 + exp(−κi−1 + xj β) κ0 is defined as −∞ and κk as +∞. For ordered probit, the probability of a given observation is pij = Pr(yj = i) = Pr κi−1 < xj β + u ≤ κi = Φ κi − xj β − Φ κi−1 − xj β where Φ(·) is the standard normal cumulative distribution function. The log likelihood is lnL = N X wj j=1 k X Ii (yj ) lnpij i=1 where wj is an optional weight and ( Ii (yj ) = 1, if yj = i 0, otherwise ologit and oprobit support the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. These commands also support estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Aitchison, J., and S. D. Silvey. 1957. The generalization of probit analysis to the case of multiple responses. Biometrika 44: 131–140. Anderson, J. A. 1984. Regression and ordered categorical variables (with discussion). Journal of the Royal Statistical Society, Series B 46: 1–30. Brant, R. 1990. Assessing proportionality in the proportional odds model for ordinal logistic regression. Biometrics 46: 1171–1178. Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press. Fu, V. K. 1998. sg88: Estimating generalized ordered logit models. Stata Technical Bulletin 44: 27–30. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 160–164. College Station, TX: Stata Press. Goldstein, R. 1997. sg59: Index of ordinal variation and Neyman–Barton GOF. Stata Technical Bulletin 33: 10–12. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 145–147. College Station, TX: Stata Press. Greenland, S. 1985. An application of logistic models to the analysis of ordinal responses. Biometrical Journal 27: 189–197. ologit — Ordered logistic regression 1539 Kleinbaum, D. G., and M. Klein. 2010. Logistic Regression: A Self-Learning Text. 3rd ed. New York: Springer. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Long, J. S., and J. Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Lunt, M. 2001. sg163: Stereotype ordinal regression. Stata Technical Bulletin 61: 12–18. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 298–307. College Station, TX: Stata Press. McCullagh, P. 1977. A logistic model for paired comparisons with ordered categorical data. Biometrika 64: 449–453. . 1980. Regression models for ordinal data (with discussion). Journal of the Royal Statistical Society, Series B 42: 109–142. McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. 2nd ed. London: Chapman & Hall/CRC. McKelvey, R. D., and W. Zavoina. 1975. A statistical model for the analysis of ordinal level dependent variables. Journal of Mathematical Sociology 4: 103–120. Miranda, A., and S. Rabe-Hesketh. 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata Journal 6: 285–308. Peterson, B., and F. E. Harrell, Jr. 1990. Partial proportional odds models for ordinal response variables. Applied Statistics 39: 205–217. Williams, R. 2006. Generalized ordered logit/partial proportional odds models for ordinal dependent variables. Stata Journal 6: 58–82. . 2010. Fitting heterogeneous choice models with oglm. Stata Journal 10: 540–567. Wolfe, R. 1998. sg86: Continuation-ratio models for ordinal response data. Stata Technical Bulletin 44: 18–21. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 149–153. College Station, TX: Stata Press. Wolfe, R., and W. W. Gould. 1998. sg76: An approximate likelihood-ratio test for ordinal response models. Stata Technical Bulletin 42: 24–27. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 199–204. College Station, TX: Stata Press. Xu, J., and J. S. Long. 2005. Confidence intervals for predicted outcomes in regression models for categorical outcomes. Stata Journal 5: 537–559. Also see [R] ologit postestimation — Postestimation tools for ologit [R] clogit — Conditional (fixed-effects) logistic regression [R] logistic — Logistic regression, reporting odds ratios [R] logit — Logistic regression, reporting coefficients [R] mlogit — Multinomial (polytomous) logistic regression [R] oprobit — Ordered probit regression [R] rologit — Rank-ordered logistic regression [R] slogit — Stereotype logistic regression [ME] meologit — Multilevel mixed-effects ordered logistic regression [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [XT] xtologit — Random-effects ordered logistic models [U] 20 Estimation and postestimation commands Title ologit postestimation — Postestimation tools for ologit Description Remarks and examples Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after ologit: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with mi or svy estimation results. lrtest is not appropriate with svy estimation results. 1540 ologit postestimation — Postestimation tools for ologit 1541 Syntax for predict predict type predict type stub* | newvar | newvarlist outcome(outcome) nooffset stub* | newvarlist if if in , statistic in , scores Description statistic Main predicted probabilities; the default linear prediction standard error of the linear prediction pr xb stdp If you do not specify outcome(), pr (with one new variable specified) assumes outcome(#1). You specify one or k new variables with pr, where k is the number of outcomes. You specify one new variable with xb and stdp. These statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the predicted probabilities. If you do not also specify the outcome() option, you specify k new variables, where k is the number of categories of the dependent variable. Say that you fit a model by typing ologit result x1 x2, and result takes on three values. Then you could type predict p1 p2 p3 to obtain all three predicted probabilities. If you specify the outcome() option, you must specify one new variable. Say that result takes on the values 1, 2, and 3. Typing predict p1, outcome(1) would produce the same p1. xb calculates the linear prediction. You specify one new variable, for example, predict linear, xb. The linear prediction is defined, ignoring the contribution of the estimated cutpoints. stdp calculates the standard error of the linear prediction. You specify one new variable, for example, predict se, stdp. outcome(outcome) specifies for which outcome the predicted probabilities are to be calculated. outcome() should contain either one value of the dependent variable or one of #1, #2, . . . , with #1 meaning the first category of the dependent variable, #2 meaning the second category, etc. nooffset is relevant only if you specified offset(varname) for ologit. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . 1542 ologit postestimation — Postestimation tools for ologit scores calculates equation-level score variables. The number of score variables created will equal the number of outcomes in the model. If the number of outcomes in the model was k , then the first new variable will contain ∂ ln L/∂(xj b); the second new variable will contain ∂ ln L/∂κ1 ; the third new variable will contain ∂ ln L/∂κ2 ; ... and the k th new variable will contain ∂ ln L/∂κk−1 , where κi refers to the ith cutpoint. Remarks and examples See [U] 20 Estimation and postestimation commands for instructions on obtaining the variance– covariance matrix of the estimators, predicted values, and hypothesis tests. Also see [R] lrtest for performing likelihood-ratio tests. Example 1 In example 2 of [R] ologit, we fit the model ologit rep77 foreign length mpg. The predict command can be used to obtain the predicted probabilities. We type predict followed by the names of the new variables to hold the predicted probabilities, ordering the names from low to high. In our data, the lowest outcome is “poor”, and the highest is “excellent”. We have five categories, so we must type five names following predict; the choice of names is up to us: . use http://www.stata-press.com/data/r13/fullauto (Automobile Models) . ologit rep77 foreign length mpg (output omitted ) . predict poor fair avg good exc (option pr assumed; predicted probabilities) . list exc good make model rep78 if rep77>=., sep(4) divider exc good make model rep78 3. 10. 32. 44. .0033341 .0098392 .0023406 .015697 .0393056 .1070041 .0279497 .1594413 AMC Buick Ford Merc. Spirit Opel Fiesta Monarch . . Good Average 53. 56. 57. 63. .065272 .005187 .0261461 .0294961 .4165188 .059727 .2371826 .2585825 Peugeot Plym. Plym. Pont. 604 Horizon Sapporo Phoenix . Average . . The eight cars listed were introduced after 1977, so they do not have 1977 repair records in our data. We predicted what their 1977 repair records might have been using the fitted model. We see that, based on its characteristics, the Peugeot 604 had about a 41.65 + 6.53 ≈ 48.2% chance of a good or excellent repair record. The Ford Fiesta, which had only a 3% chance of a good or excellent repair record, in fact, had a good record when it was introduced in the following year. ologit postestimation — Postestimation tools for ologit 1543 Technical note For ordered logit, predict, xb produces Sj = x1j β1 + x2j β2 + · · · + xkj βk . The ordered-logit predictions are then the probability that Sj + uj lies between a pair of cutpoints, κi−1 and κi . Some handy formulas are Pr(Sj + uj < κ) = 1/(1 + eSj −κ ) Pr(Sj + uj > κ) = 1 − 1/(1 + eSj −κ ) Pr(κ1 < Sj + uj < κ2 ) = 1/(1 + eSj −κ2 ) − 1/(1 + eSj −κ1 ) Rather than using predict directly, we could calculate the predicted probabilities by hand. If we wished to obtain the predicted probability that the repair record is excellent and the probability that it is good, we look back at ologit’s output to obtain the cutpoints. We find that “good” corresponds to the interval /cut3 < Sj + u < /cut4 and “excellent” to the interval Sj + u > /cut4: . predict score, xb . generate probgood = 1/(1+exp(score-_b[/cut4])) - 1/(1+exp(score-_b[/cut3])) . generate probexc = 1 - 1/(1+exp(score-_b[/cut4])) The results of our calculation will be the same as those produced in the previous example. We refer to the estimated cutpoints just as we would any coefficient, so b[/cut3] refers to the value of the /cut3 coefficient; see [U] 13.5 Accessing coefficients and standard errors. Also see [R] ologit — Ordered logistic regression [U] 20 Estimation and postestimation commands Title oneway — One-way analysis of variance Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax oneway response var factor var if in weight , options Description options Main bonferroni scheffe sidak tabulate no means no standard no freq no obs noanova nolabel wrap missing Bonferroni multiple-comparison test Scheffé multiple-comparison test Šidák multiple-comparison test produce summary table include or suppress means; default is means include or suppress standard deviations; default is standard include or suppress frequencies; default is freq include or suppress number of obs; default is obs if data are weighted suppress the ANOVA table show numeric codes, not labels do not break wide tables treat missing values as categories by is allowed; see [D] by. aweights and fweights are allowed; see [U] 11.1.6 weight. Menu Statistics > Linear models and related > ANOVA/MANOVA > One-way ANOVA Description The oneway command reports one-way analysis-of-variance (ANOVA) models and performs multiplecomparison tests. If you wish to fit more complicated ANOVA layouts or wish to fit analysis-of-covariance (ANCOVA) models, see [R] anova. See [D] encode for examples of fitting ANOVA models on string variables. See [R] loneway for an alternative oneway command with slightly different features. 1544 oneway — One-way analysis of variance 1545 Options Main bonferroni reports the results of a Bonferroni multiple-comparison test. scheffe reports the results of a Scheffé multiple-comparison test. sidak reports the results of a Šidák multiple-comparison test. tabulate produces a table of summary statistics of the response var by levels of the factor var. The table includes the mean, standard deviation, frequency, and, if the data are weighted, the number of observations. Individual elements of the table may be included or suppressed by using the [no]means, [no]standard, [no]freq, and [no]obs options. For example, typing oneway response factor, tabulate means standard produces a summary table that contains only the means and standard deviations. You could achieve the same result by typing oneway response factor, tabulate nofreq [no]means includes or suppresses only the means from the table produced by the tabulate option. See tabulate above. [no]standard includes or suppresses only the standard deviations from the table produced by the tabulate option. See tabulate above. [no]freq includes or suppresses only the frequencies from the table produced by the tabulate option. See tabulate above. [no]obs includes or suppresses only the reported number of observations from the table produced by the tabulate option. If the data are not weighted, only the frequency is reported. If the data are weighted, the frequency refers to the sum of the weights. See tabulate above. noanova suppresses the display of the ANOVA table. nolabel causes the numeric codes to be displayed rather than the value labels in the ANOVA and multiple-comparison test tables. wrap requests that Stata not break up wide tables to make them more readable. missing requests that missing values of factor var be treated as a category rather than as observations to be omitted from the analysis. Remarks and examples Remarks are presented under the following headings: Introduction Obtaining observed means Multiple-comparison tests Weighted data Video example Introduction The oneway command reports one-way ANOVA models. To perform a one-way layout of a variable called endog on exog, type oneway endog exog. 1546 oneway — One-way analysis of variance Example 1 We run an experiment varying the amount of fertilizer used in growing apple trees. We test four concentrations, using each concentration in three groves of 12 trees each. Later in the year, we measure the average weight of the fruit. If all had gone well, we would have had 3 observations on the average weight for each of the four concentrations. Instead, two of the groves were mistakenly leveled by a confused man on a large bulldozer. We are left with the following dataset: . use http://www.stata-press.com/data/r13/apple (Apple trees) . describe Contains data from http://www.stata-press.com/data/r13/apple.dta obs: 10 Apple trees vars: 2 16 Jan 2013 11:23 size: 100 variable name storage type treatment weight display format int double value label variable label %8.0g %10.0g Fertilizer Average weight in grams Sorted by: . list, abbreviate(10) treatment weight 1. 2. 3. 4. 5. 1 1 1 2 2 117.5 113.8 104.4 48.9 50.4 6. 7. 8. 9. 10. 2 3 3 4 4 58.9 70.4 86.9 87.7 67.3 To obtain the one-way ANOVA results, we type . oneway weight treatment Analysis of Variance Source SS df MS Between groups Within groups Total 5295.54433 493.591667 3 6 5789.136 9 Bartlett’s test for equal variances: F 1765.18144 82.2652778 21.46 Prob > F 0.0013 643.237333 chi2(3) = 1.3900 Prob>chi2 = 0.708 We find significant (at better than the 1% level) differences among the four concentrations. oneway — One-way analysis of variance 1547 Technical note Rather than using the oneway command, we could have performed this analysis by using anova. Example 1 in [R] anova repeats this same analysis. You may wish to compare the output. You will find the oneway command quicker than the anova command, and, as you will learn, oneway allows you to perform multiple-comparison tests. On the other hand, anova will let you generate predictions, examine the covariance matrix of the estimators, and perform more general hypothesis tests. Technical note Although the output is a usual ANOVA table, let’s run through it anyway. The between-group sum of squares for the model is 5295.5 with 3 degrees of freedom, resulting in a mean square of 5295.5/3 ≈ 1765.2. The corresponding F statistic is 21.46 and has a significance level of 0.0013. Thus the model appears to be significant at the 0.13% level. The second line summarizes the within-group (residual) variation. The within-group sum of squares is 493.59 with 6 degrees of freedom, resulting in a mean squared error of 82.27. The between- and residual-group variations sum to the total sum of squares (TSS), which is reported as 5789.1 in the last line of the table. This is the TSS of weight after removal of the mean. Similarly, the between plus residual degrees of freedom sum to the total degrees of freedom, 9. Remember that there are 10 observations. Subtracting 1 for the mean, we are left with 9 total degrees of freedom. At the bottom of the table, Bartlett’s test for equal variances is reported. The value of the statistic is 1.39. The corresponding significance level (χ2 with 3 degrees of freedom) is 0.708, so we cannot reject the assumption that the variances are homogeneous. Obtaining observed means Example 2 We typed oneway weight treatment to obtain an ANOVA table of weight of fruit by fertilizer concentration. Although we obtained the table, we obtained no information on which fertilizer seems to work the best. If we add the tabulate option, we obtain that additional information: . oneway weight treatment, tabulate Summary of Average weight in grams Fertilizer Mean Std. Dev. Freq. 1 2 3 4 Total Source Between groups Within groups 111.9 52.733333 78.65 77.5 6.7535176 5.3928966 11.667262 14.424978 3 3 2 2 80.62 25.362124 Analysis of Variance SS df MS 10 5295.54433 493.591667 3 6 1765.18144 82.2652778 F 21.46 Total 5789.136 9 643.237333 Bartlett’s test for equal variances: chi2(3) = 1.3900 Prob > F 0.0013 Prob>chi2 = 0.708 1548 oneway — One-way analysis of variance We find that the average weight was largest when we used fertilizer concentration 1. Multiple-comparison tests Example 3: Bonferroni multiple-comparison test oneway can also perform multiple-comparison tests using either Bonferroni, Scheffé, or Šidák normalizations. For instance, to obtain the Bonferroni multiple-comparison test, we specify the bonferroni option: . oneway weight treatment, bonferroni Analysis of Variance Source SS df MS Between groups Within groups 5295.54433 493.591667 3 6 1765.18144 82.2652778 F 21.46 Prob > F 0.0013 Total 5789.136 9 643.237333 Bartlett’s test for equal variances: chi2(3) = 1.3900 Prob>chi2 = 0.708 Comparison of Average weight in grams by Fertilizer (Bonferroni) Row MeanCol Mean 1 2 3 2 -59.1667 0.001 3 -33.25 0.042 25.9167 0.122 4 -34.4 0.036 24.7667 0.146 -1.15 1.000 The results of the Bonferroni test are presented as a matrix. The first entry, −59.17, represents the difference between fertilizer concentrations 2 and 1 (labeled “Row Mean - Col Mean” in the upper stub of the table). Remember that in the previous example we requested the tabulate option. Looking back, we find that the means of concentrations 1 and 2 are 111.90 and 52.73, respectively. Thus 52.73 − 111.90 = −59.17. Underneath that number is reported “0.001”. This is the Bonferroni-adjusted significance of the difference. The difference is significant at the 0.1% level. Looking down the column, we see that concentration 3 is also worse than concentration 1 (4.2% level), as is concentration 4 (3.6% level). On the basis of this evidence, we would use concentration 1 if we grew apple trees. Example 4: Scheffé multiple-comparison test We can just as easily obtain the Scheffé-adjusted significance levels. Rather than specifying the bonferroni option, we specify the scheffe option. oneway — One-way analysis of variance 1549 We will also add the noanova option to prevent Stata from redisplaying the ANOVA table: . oneway weight treatment, noanova scheffe Comparison of Average weight in grams by Fertilizer (Scheffe) Row MeanCol Mean 1 2 3 2 -59.1667 0.001 3 -33.25 0.039 25.9167 0.101 4 -34.4 0.034 24.7667 0.118 -1.15 0.999 The differences are the same as those we obtained in the Bonferroni output, but the significance levels are not. According to the Bonferroni-adjusted numbers, the significance of the difference between fertilizer concentrations 1 and 3 is 4.2%. The Scheffé-adjusted significance level is 3.9%. We will leave it to you to decide which results are more accurate. Example 5: Šidák multiple-comparison test Let’s conclude this example by obtaining the Šidák-adjusted multiple-comparison tests. We do this to illustrate Stata’s capabilities to calculate these results, because searching across adjustment methods until you find the results you want is not a valid technique for obtaining significance levels. . oneway weight treatment, noanova sidak Comparison of Average weight in grams by Fertilizer (Sidak) Row MeanCol Mean 1 2 3 2 -59.1667 0.001 3 -33.25 0.041 25.9167 0.116 4 -34.4 0.035 24.7667 0.137 -1.15 1.000 We find results that are similar to the Bonferroni-adjusted numbers. Henry Scheffé (1907–1977) was born in New York. He studied mathematics at the University of Wisconsin, gaining a doctorate with a dissertation on differential equations. He taught mathematics at Wisconsin, Oregon State University, and Reed College, but his interests changed to statistics and he joined Wilks at Princeton. After periods at Syracuse, UCLA, and Columbia, Scheffé settled in Berkeley from 1953. His research increasingly focused on linear models and particularly ANOVA, on which he produced a celebrated monograph. His death was the result of a bicycle accident. 1550 oneway — One-way analysis of variance Weighted data Example 6 oneway can work with both weighted and unweighted data. Let’s assume that we wish to perform a one-way layout of the death rate on the four census regions of the United States using state data. Our data contain three variables, drate (the death rate), region (the region), and pop (the population of the state). To fit the model, we type oneway drate region [weight=pop], although we typically abbreviate weight as w. We will also add the tabulate option to demonstrate how the table of summary statistics differs for weighted data: . use http://www.stata-press.com/data/r13/census8 (1980 Census data by state) . oneway drate region [w=pop], tabulate (analytic weights assumed) Census Summary of Death Rate region Mean Std. Dev. Freq. NE N Cntrl South West Total Source Between groups Within groups 97.15 88.10 87.05 75.65 49135283 58865670 74734029 43172490 9 12 16 13 87.34 10.43 2.259e+08 Analysis of Variance SS df MS 50 2360.92281 2974.09635 5.82 5.58 10.40 8.23 Obs. 3 46 786.974272 64.6542685 F Prob > F 12.17 Total 5335.01916 49 108.877942 Bartlett’s test for equal variances: chi2(3) = 5.4971 0.0000 Prob>chi2 = 0.139 When the data are weighted, the summary table has four columns rather than three. The column labeled “Freq.” reports the sum of the weights. The overall frequency is 2.259 × 108 , meaning that there are approximately 226 million people in the United States. The ANOVA table is appropriately weighted. Also see [U] 11.1.6 weight. Video example One-way ANOVA in Stata oneway — One-way analysis of variance 1551 Stored results oneway stores the following in r(): Scalars r(N) r(F) r(df r) r(mss) number of observations F statistic within-group degrees of freedom between-group sum of squares r(df m) r(rss) r(chi2bart) r(df bart) between-group degrees of freedom within-group sum of squares Bartlett’s χ2 Bartlett’s degrees of freedom Methods and formulas Methods and formulas are presented under the following headings: One-way analysis of variance Bartlett’s test Multiple-comparison tests One-way analysis of variance The model of one-way ANOVA is yij = µ + αi + ij for levels i = 1, . . . , k and observations j = 1, . . . , ni . Define y i as the (weighted) mean of yij over j and y as the overall (weighted) mean of yij . Define wij as the weight associated with yij , which P is 1 if the data are unweighted. wij is normalized to sum to n = n P Pi i if aweights are used and is otherwise not normalized. wi refers to j wij , and w refers to i wi . The between-group sum of squares is then S1 = X wi (y i − y)2 i The TSS is S= XX i wij (yij − y)2 j The within-group sum of squares is given by Se = S − S1 . s2e The between-group mean square is s21 = S1 /(k − 1), and the within-group mean square is = Se /(w − k). The test statistic is F = s21 /s2e . See, for instance, Snedecor and Cochran (1989). Bartlett’s test Bartlett’s test assumes that you have m independent, normal, random samples and tests the 2 hypothesis σ12 = σ22 = · · · = σm . The test statistic, M , is defined as M= P (T − m) lnσ b2 − (Ti − 1) lnσ bi2 nP o 1 1 1 1 + 3(m−1) − Ti −1 T −m 1552 oneway — One-way analysis of variance where there are T overall observations, Ti observations in the ith group, and (Ti − 1)b σi2 Ti X = (yij − y i )2 (T − m)b σ2 = j=1 m X (Ti − 1)b σi2 i=1 An approximate test of the homogeneity of variance is based on the statistic M with critical values obtained from the χ2 distribution of m − 1 degrees of freedom. See Bartlett (1937) or Draper and Smith (1998, 56–57). Multiple-comparison tests Let’s begin by reviewing the logic behind these adjustments. The “standard” t statistic for the comparison of two means is yi − yj t= q s n1i + n1j where s is the overall standard deviation, y i is the measured average of y in group i, and ni is the number of observations in the group. We perform hypothesis tests by calculating this t statistic. We simultaneously choose a critical level, α, and look up the t statistic corresponding to that level in a table. We reject the hypothesis if our calculated t exceeds the value we looked up. Alternatively, because we have a computer at our disposal, we calculate the significance level e corresponding to our calculated t statistic, and if e < α, we reject the hypothesis. This logic works well when we are performing one test. Now consider what happens when we perform several separate tests, say, n of them. Let’s assume, just for discussion, that we set α equal to 0.05 and that we will perform six tests. For each test, we have a 0.05 probability of falsely rejecting the equality-of-means hypothesis. Overall, then, our chances of falsely rejecting at least one of the hypotheses is 1 − (1 − 0.05)6 ≈ 0.26 if the tests are independent. The idea behind multiple-comparison tests is to control for the fact that we will perform multiple tests and to reduce our overall chances of falsely rejecting each hypothesis to α rather than letting our chances increase with each additional test. (See Miller [1981] and Hochberg and Tamhane [1987] for rather advanced texts on multiple-comparison procedures.) The Bonferroni adjustment (see Miller [1981]; also see van Belle et al. [2004, 534 – 537]) does this by (falsely but approximately) asserting that the critical level we should use, a, is the true critical level, α, divided by the number of tests, n; that is, a = α/n. For instance, if we are going to perform six tests, each at the 0.05 significance level, we want to adopt a critical level of 0.05/6 ≈ 0.00833. We can just as easily apply this logic to e, the significance level associated with our t statistic, as to our critical level α. If a comparison has a calculated significance of e, then its “real” significance, adjusted for the fact of n comparisons, is n × e. If a comparison has a significance level of, say, 0.012, and we perform six tests, then its “real” significance is 0.072. If we adopt a critical level of 0.05, we cannot reject the hypothesis. If we adopt a critical level of 0.10, we can reject it. Of course, this calculation can go above 1, but that just means that there is no α < 1 for which we could reject the hypothesis. (This situation arises because of the crude nature of the Bonferroni adjustment.) Stata handles this case by simply calling the significance level 1. Thus the formula for the Bonferroni significance level is eb = min(1, en) where n = k(k − 1)/2 is the number of comparisons. oneway — One-way analysis of variance 1553 The Šidák adjustment (Šidák [1967]; also see Winer, Brown, and Michels [1991, 165 – 166]) is slightly different and provides a tighter bound. It starts with the assertion that a = 1 − (1 − α)1/n Turning this formula around and substituting calculated significance levels, we obtain n o es = min 1, 1 − (1 − e)n For example, if the calculated significance is 0.012 and we perform six tests, the “real” significance is approximately 0.07. The Scheffé test (Scheffé [1953, 1959]; also see Kuehl [2000, 97 – 98]) differs in derivation, but it attacks the same problem. Let there be k means for which we want to make all the pairwise tests. Two means are declared significantly different if t≥ p (k − 1)F (α; k − 1, ν) where F (α; k − 1, ν) is the α-critical value of the F distribution with k − 1 numerator and ν denominator degrees of freedom. Scheffé’s test has the nicety that it never declares a contrast significant if the overall F test is not significant. Turning the test around, Stata calculates a significance level t2 eb = F , k − 1, ν k−1 For instance, you have a calculated t statistic of 4.0 with 50 degrees of freedom. The simple t test says that the significance level is 0.00021. The F test equivalent, 16 with 1 and 50 degrees of freedom, says the same. If you are comparing three means, however, you calculate an F test of 8.0 with 2 and 50 degrees of freedom, which says that the significance level is 0.0010. References Acock, A. C. 2014. A Gentle Introduction to Stata. 4th ed. College Station, TX: Stata Press. Altman, D. G. 1991. Practical Statistics for Medical Research. London: Chapman & Hall/CRC. Bartlett, M. S. 1937. Properties of sufficiency and statistical tests. Proceedings of the Royal Society, Series A 160: 268–282. Daniel, C., and E. L. Lehmann. 1979. Henry Scheffé 1907–1977. Annals of Statistics 7: 1149–1161. Draper, N., and H. Smith. 1998. Applied Regression Analysis. 3rd ed. New York: Wiley. Hochberg, Y., and A. C. Tamhane. 1987. Multiple Comparison Procedures. New York: Wiley. Kuehl, R. O. 2000. Design of Experiments: Statistical Principles of Research Design and Analysis. 2nd ed. Belmont, CA: Duxbury. Marchenko, Y. V. 2006. Estimating variance components in Stata. Stata Journal 6: 1–21. Miller, R. G., Jr. 1981. Simultaneous Statistical Inference. 2nd ed. New York: Springer. Scheffé, H. 1953. A method for judging all contrasts in the analysis of variance. Biometrika 40: 87–104. . 1959. The Analysis of Variance. New York: Wiley. Šidák, Z. 1967. Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association 62: 626–633. 1554 oneway — One-way analysis of variance Snedecor, G. W., and W. G. Cochran. 1989. Statistical Methods. 8th ed. Ames, IA: Iowa State University Press. van Belle, G., L. D. Fisher, P. J. Heagerty, and T. S. Lumley. 2004. Biostatistics: A Methodology for the Health Sciences. 2nd ed. New York: Wiley. Winer, B. J., D. R. Brown, and K. M. Michels. 1991. Statistical Principles in Experimental Design. 3rd ed. New York: McGraw–Hill. Also see [R] anova — Analysis of variance and covariance [R] loneway — Large one-way ANOVA, random effects, and reliability [PSS] power oneway — Power analysis for one-way analysis of variance Title oprobit — Ordered probit regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax oprobit depvar indepvars if in weight , options Description options Model offset(varname) include varname in model with coefficient constrained to 1 constraints(constraints) apply specified linear constraints collinear keep collinear variables SE/Robust vcetype may be oim, robust, cluster clustvar, bootstrap, or vce(vcetype) jackknife Reporting set confidence level; default is level(95) do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling level(#) nocnsreport display options Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mfp, mi estimate, nestreg, rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Ordinal outcomes > Ordered probit regression 1555 1556 oprobit — Ordered probit regression Description oprobit fits ordered probit models of ordinal variable depvar on the independent variables indepvars. The actual values taken on by the dependent variable are irrelevant, except that larger values are assumed to correspond to “higher” outcomes. See [R] logistic for a list of related estimation commands. Options Model offset(varname), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. The following option is available with oprobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples An ordered probit model is used to estimate relationships between an ordinal dependent variable and a set of independent variables. An ordinal variable is a variable that is categorical and ordered, for instance, “poor”, “good”, and “excellent”, which might indicate a person’s current health status or the repair record of a car. If there are only two outcomes, see [R] logistic, [R] logit, and [R] probit. This entry is concerned only with more than two outcomes. If the outcomes cannot be ordered (for example, residency in the north, east, south, or west), see [R] mlogit. This entry is concerned only with models in which the outcomes can be ordered. oprobit — Ordered probit regression 1557 In ordered probit, an underlying score is estimated as a linear function of the independent variables and a set of cutpoints. The probability of observing outcome i corresponds to the probability that the estimated linear function, plus random error, is within the range of the cutpoints estimated for the outcome: Pr(outcomej = i) = Pr(κi−1 < β1 x1j + β2 x2j + · · · + βk xkj + uj ≤ κi ) uj is assumed to be normally distributed. In either case, we estimate the coefficients β1 , β2 , . . . , βk together with the cutpoints κ1 , κ2 , . . . , κI−1 , where I is the number of possible outcomes. κ0 is taken as −∞, and κI is taken as +∞. All of this is a direct generalization of the ordinary two-outcome probit model. Example 1 In example 2 of [R] ologit, we use a variation of the automobile dataset (see [U] 1.2.2 Example datasets) to analyze the 1977 repair records of 66 foreign and domestic cars. We use ordered logit to explore the relationship of rep77 in terms of foreign (origin of manufacture), length (a proxy for size), and mpg. Here we fit the same model using ordered probit rather than ordered logit: . use http://www.stata-press.com/data/r13/fullauto (Automobile Models) . oprobit rep77 foreign length mpg Iteration 0: log likelihood = -89.895098 Iteration 1: log likelihood = -78.106316 Iteration 2: log likelihood = -78.020086 Iteration 3: log likelihood = -78.020025 Iteration 4: log likelihood = -78.020025 Ordered probit regression Number of obs LR chi2(3) Prob > chi2 Log likelihood = -78.020025 Pseudo R2 rep77 Coef. foreign length mpg /cut1 /cut2 /cut3 /cut4 = = = = 66 23.75 0.0000 0.1321 Std. Err. z P>|z| [95% Conf. Interval] 1.704861 .0468675 .1304559 .4246796 .012648 .0378628 4.01 3.71 3.45 0.000 0.000 0.001 .8725037 .022078 .0562463 2.537217 .0716571 .2046656 10.1589 11.21003 12.54561 13.98059 3.076754 3.107527 3.155233 3.218793 4.128577 5.119389 6.361467 7.671874 16.18923 17.30067 18.72975 20.28931 We find that foreign cars have better repair records, as do larger cars and cars with better mileage ratings. 1558 oprobit — Ordered probit regression Stored results oprobit stores the following in e(): Scalars e(N) e(N cd) e(k cat) e(k) e(k aux) e(k eq) e(k eq model) e(k dv) e(df m) e(r2 p) e(ll) e(ll 0) e(N clust) e(chi2) e(p) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(cat) e(V) e(V modelbased) Functions e(sample) number of observations number of completely determined observations number of categories number of parameters number of auxiliary parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom pseudo-R-squared log likelihood log likelihood, constant-only model number of clusters χ2 significance of model test rank of e(V) number of iterations return code 1 if converged, 0 otherwise oprobit command as typed name of dependent variable weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector category values variance–covariance matrix of the estimators model-based variance marks estimation sample oprobit — Ordered probit regression 1559 Methods and formulas See Methods and formulas of [R] ologit. References Aitchison, J., and S. D. Silvey. 1957. The generalization of probit analysis to the case of multiple responses. Biometrika 44: 131–140. Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press. Chiburis, R., and M. Lokshin. 2007. Maximum likelihood and two-step estimation of an ordered-probit selection model. Stata Journal 7: 167–182. De Luca, G., and V. Perotti. 2011. Estimation of ordered response models with sample selection. Stata Journal 11: 213–239. Goldstein, R. 1997. sg59: Index of ordinal variation and Neyman–Barton GOF. Stata Technical Bulletin 33: 10–12. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 145–147. College Station, TX: Stata Press. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Long, J. S., and J. Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Miranda, A., and S. Rabe-Hesketh. 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata Journal 6: 285–308. Stewart, M. B. 2004. Semi-nonparametric estimation of extended ordered probit models. Stata Journal 4: 27–39. Williams, R. 2010. Fitting heterogeneous choice models with oglm. Stata Journal 10: 540–567. Wolfe, R. 1998. sg86: Continuation-ratio models for ordinal response data. Stata Technical Bulletin 44: 18–21. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 149–153. College Station, TX: Stata Press. Wolfe, R., and W. W. Gould. 1998. sg76: An approximate likelihood-ratio test for ordinal response models. Stata Technical Bulletin 42: 24–27. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 199–204. College Station, TX: Stata Press. Xu, J., and J. S. Long. 2005. Confidence intervals for predicted outcomes in regression models for categorical outcomes. Stata Journal 5: 537–559. Also see [R] oprobit postestimation — Postestimation tools for oprobit [R] heckoprobit — Ordered probit model with sample selection [R] logistic — Logistic regression, reporting odds ratios [R] mlogit — Multinomial (polytomous) logistic regression [R] mprobit — Multinomial probit regression [R] ologit — Ordered logistic regression [R] probit — Probit regression [ME] meoprobit — Multilevel mixed-effects ordered probit regression [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [XT] xtoprobit — Random-effects ordered probit models [U] 20 Estimation and postestimation commands Title oprobit postestimation — Postestimation tools for oprobit Description Remarks and examples Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after oprobit: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with mi or svy estimation results. lrtest is not appropriate with svy estimation results. 1560 oprobit postestimation — Postestimation tools for oprobit 1561 Syntax for predict predict type predict type stub* | newvar | newvarlist outcome(outcome) nooffset stub* | newvarlist if if in , statistic in , scores Description statistic Main predicted probabilities; the default linear prediction standard error of the linear prediction pr xb stdp If you do not specify outcome(), pr (with one new variable specified) assumes outcome(#1). You specify one or k new variables with pr, where k is the number of outcomes. You specify one new variable with xb and stdp. These statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the predicted probabilities. If you do not also specify the outcome() option, you specify k new variables, where k is the number of categories of the dependent variable. Say that you fit a model by typing oprobit result x1 x2, and result takes on three values. Then you could type predict p1 p2 p3 to obtain all three predicted probabilities. If you specify the outcome() option, you must specify one new variable. Say that result takes on values 1, 2, and 3. Typing predict p1, outcome(1) would produce the same p1. xb calculates the linear prediction. You specify one new variable, for example, predict linear, xb. The linear prediction is defined, ignoring the contribution of the estimated cutpoints. stdp calculates the standard error of the linear prediction. You specify one new variable, for example, predict se, stdp. outcome(outcome) specifies for which outcome the predicted probabilities are to be calculated. outcome() should contain either one value of the dependent variable or one of #1, #2, . . . , with #1 meaning the first category of the dependent variable, #2 meaning the second category, etc. nooffset is relevant only if you specified offset(varname) for oprobit. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . 1562 oprobit postestimation — Postestimation tools for oprobit scores calculates equation-level score variables. The number of score variables created will equal the number of outcomes in the model. If the number of outcomes in the model was k , then the first new variable will contain ∂ ln L/∂(xj b); the second new variable will contain ∂ ln L/∂κ1 ; the third new variable will contain ∂ ln L/∂κ2 ; ... and the k th new variable will contain ∂ ln L/∂κk−1 , where κi refers to the ith cutpoint. Remarks and examples See [U] 20 Estimation and postestimation commands for instructions on obtaining the variance– covariance matrix of the estimators, predicted values, and hypothesis tests. Also see [R] lrtest for performing likelihood-ratio tests. Example 1 In example 1 of [R] oprobit, we fit the model oprobit rep77 foreign length mpg. The predict command can be used to obtain the predicted probabilities. We type predict followed by the names of the new variables to hold the predicted probabilities, ordering the names from low to high. In our data, the lowest outcome is “poor” and the highest is “excellent”. We have five categories, so we must type five names following predict; the choice of names is up to us: . use http://www.stata-press.com/data/r13/fullauto (Automobile Models) . oprobit rep77 foreign length mpg (output omitted ) . predict poor fair avg good exc (option pr assumed; predicted probabilities) . list make model exc good if rep77>=., sep(4) divider make model exc good 3. 10. 32. 44. AMC Buick Ford Merc. Spirit Opel Fiesta Monarch .0006044 .0043803 .0002927 .0093209 .0351813 .1133763 .0222789 .1700846 53. 56. 57. 63. Peugeot Plym. Plym. Pont. 604 Horizon Sapporo Phoenix .0734199 .001413 .0197543 .0234156 .4202766 .0590294 .2466034 .266771 oprobit postestimation — Postestimation tools for oprobit 1563 Technical note For ordered probit, predict, xb produces Sj = x1j β1 + x2j β2 + · · · + xkj βk . Ordered probit is identical to ordered logit, except that we use different distribution functions for calculating probabilities. The ordered-probit predictions are then the probability that Sj + uj lies between a pair of cutpoints κi−1 and κi . The formulas for ordered probit are Pr(Sj + u < κ) = Φ(κ − Sj ) Pr(Sj + u > κ) = 1 − Φ(κ − Sj ) = Φ(Sj − κ) Pr(κ1 < Sj + u < κ2 ) = Φ(κ2 − Sj ) − Φ(κ1 − Sj ) Rather than using predict directly, we could calculate the predicted probabilities by hand. . predict pscore, xb . generate probexc = normal(pscore-_b[/cut4]) . generate probgood = normal(_b[/cut4]-pscore) - normal(_b[/cut3]-pscore) Also see [R] oprobit — Ordered probit regression [U] 20 Estimation and postestimation commands Title orthog — Orthogonalize variables and compute orthogonal polynomials Syntax Options for orthog Methods and formulas Menu Options for orthpoly References Description Remarks and examples Also see Syntax Orthogonalize variables orthog varlist if in weight , generate(newvarlist) matrix(matname) Compute orthogonal polynomial orthpoly varname if in weight , generate(newvarlist) | poly(matname) degree(#) orthpoly requires that generate(newvarlist) or poly(matname), or both, be specified. varlist may contain time-series operators; see [U] 11.4.4 Time-series varlists. iweights, aweights, fweights, and pweights are allowed, see [U] 11.1.6 weight. Menu orthog Data > Create or change data > Other variable-creation commands > Orthogonalize variables > Other variable-creation commands > Orthogonal polynomials orthpoly Data > Create or change data Description orthog orthogonalizes a set of variables, creating a new set of orthogonal variables (all of type double), using a modified Gram–Schmidt procedure (Golub and Van Loan 1996). The order of the variables determines the orthogonalization; hence, the “most important” variables should be listed first. Execution time is proportional to the square of the number of variables. With many ( >10) variables, orthog will be fairly slow. orthpoly computes orthogonal polynomials for one variable. Options for orthog Main generate(newvarlist) is required. generate() creates new orthogonal variables of type double. For orthog, newvarlist will contain the orthogonalized varlist. If varlist contains d variables, then so will newvarlist. newvarlist can be specified by giving a list of exactly d new variable names, or it can be abbreviated using the styles newvar1-newvard or newvar*. For these two styles of abbreviation, new variables newvar1, newvar2, . . . , newvard are generated. 1564 orthog — Orthogonalize variables and compute orthogonal polynomials 1565 matrix(matname) creates a (d + 1) × (d + 1) matrix containing the matrix R defined by X = QR, where X is the N × (d + 1) matrix representation of varlist plus a column of ones and Q is the N × (d + 1) matrix representation of newvarlist plus a column of ones (d = number of variables in varlist, and N = number of observations). Options for orthpoly Main generate(newvarlist) or poly(), or both, must be specified. generate() creates new orthogonal variables of type double. newvarlist will contain orthogonal polynomials of degree 1, 2, . . . , d evaluated at varname, where d is as specified by degree(d). newvarlist can be specified by giving a list of exactly d new variable names, or it can be abbreviated using the styles newvar1newvard or newvar*. For these two styles of abbreviation, new variables newvar1, newvar2, . . . , newvard are generated. poly(matname) creates a (d + 1) × (d + 1) matrix called matname containing the coefficients of the orthogonal polynomials. The orthogonal polynomial of degree i ≤ d is matname[ i, d + 1 ] + matname[ i, 1 ]*varname + matname[ i, 2 ]*varname2 + · · · + matname[ i, i ]*varnamei The coefficients corresponding to the constant term are placed in the last column of the matrix. The last row of the matrix is all zeros, except for the last column, which corresponds to the constant term. degree(#) specifies the highest-degree polynomial to include. Orthogonal polynomials of degree 1, 2, . . . , d = # are computed. The default is d = 1. Remarks and examples Orthogonal variables are useful for two reasons. The first is numerical accuracy for highly collinear variables. Stata’s regress and other estimation commands can face much collinearity and still produce accurate results. But, at some point, these commands will drop variables because of collinearity. If you know with certainty that the variables are not perfectly collinear, you may want to retain all their effects in the model. If you use orthog or orthpoly to produce a set of orthogonal variables, all variables will be present in the estimation results. Users are more likely to find orthogonal variables useful for the second reason: ease of interpreting results. orthog and orthpoly create a set of variables such that the “effects” of all the preceding variables have been removed from each variable. For example, if we issue the command . orthog x1 x2 x3, generate(q1 q2 q3) the effect of the constant is removed from x1 to produce q1; the constant and x1 are removed from x2 to produce q2; and finally the constant, x1, and x2 are removed from x3 to produce q3. Hence, q1 = r01 + r11 x1 q2 = r02 + r12 x1 + r22 x2 q3 = r03 + r13 x1 + r23 x2 + r33 x3 This effect can be generalized and written in matrix notation as X = QR 1566 orthog — Orthogonalize variables and compute orthogonal polynomials where X is the N × (d + 1) matrix representation of varlist plus a column of ones, and Q is the N × (d + 1) matrix representation of newvarlist plus a column of ones (d = number of variables in varlist and N = number of observations). The (d + 1) × (d + 1) matrix R is a permuted uppertriangular matrix, that is, R would be upper triangular if the constant were first, but the constant is last, so the first row/column has been permuted with the last row/column. Because Stata’s estimation commands list the constant term last, this allows R, obtained via the matrix() option, to be used to transform estimation results. Example 1: orthog Consider Stata’s auto.dta dataset. Suppose that we postulate a model in which price depends on the car’s length, weight, headroom, and trunk size (trunk). These predictors are collinear, but not extremely so—the correlations are not that close to 1: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . correlate length weight headroom trunk (obs=74) length weight headroom length weight headroom trunk 1.0000 0.9460 0.5163 0.7266 1.0000 0.4835 0.6722 trunk 1.0000 0.6620 1.0000 regress certainly has no trouble fitting this model: . regress price length weight headroom trunk SS df MS Source Model Residual 236016580 399048816 4 69 59004145 5783316.17 Total 635065396 73 8699525.97 price Coef. length weight headroom trunk _cons -101.7092 4.753066 -711.5679 114.0859 11488.47 Std. Err. 42.12534 1.120054 445.0204 109.9488 4543.902 t -2.41 4.24 -1.60 1.04 2.53 Number of obs F( 4, 69) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.018 0.000 0.114 0.303 0.014 = = = = = = 74 10.20 0.0000 0.3716 0.3352 2404.9 [95% Conf. Interval] -185.747 2.518619 -1599.359 -105.2559 2423.638 -17.67147 6.987512 176.2236 333.4277 20553.31 However, we may believe a priori that length is the most important predictor, followed by weight, headroom, and trunk. We would like to remove the “effect” of length from all the other predictors, remove weight from headroom and trunk, and remove headroom from trunk. We can do this by running orthog, and then we fit the model again using the orthogonal variables: orthog — Orthogonalize variables and compute orthogonal polynomials . orthog length weight headroom trunk, gen(olength oweight oheadroom > matrix(R) . regress price olength oweight oheadroom otrunk SS df MS Number of obs Source F( 4, 69) Model 236016580 4 59004145 Prob > F Residual 399048816 69 5783316.17 R-squared Adj R-squared Total 635065396 73 8699525.97 Root MSE price Coef. olength oweight oheadroom otrunk _cons 1265.049 1175.765 -349.9916 290.0776 6165.257 Std. Err. 279.5584 279.5584 279.5584 279.5584 279.5584 t 4.53 4.21 -1.25 1.04 22.05 P>|t| 1567 otrunk) = = = = = = 74 10.20 0.0000 0.3716 0.3352 2404.9 [95% Conf. Interval] 0.000 0.000 0.215 0.303 0.000 707.3454 618.0617 -907.6955 -267.6262 5607.553 1822.753 1733.469 207.7122 847.7815 6722.961 Using the matrix R, we can transform the results obtained using the orthogonal predictors back to the metric of original predictors: . matrix b = e(b)*inv(R)’ . matrix list b b[1,5] length y1 -101.70924 weight 4.7530659 headroom -711.56789 trunk 114.08591 _cons 11488.475 Technical note The matrix R obtained using the matrix() option with orthog can also be used to recover X (the original varlist) from Q (the orthogonalized newvarlist), one variable at a time. Continuing with the previous example, we illustrate how to recover the trunk variable: . matrix C = R[1...,"trunk"]’ . matrix score double rtrunk = C . compare rtrunk trunk count minimum difference average maximum rtrunk>trunk 74 1.42e-14 2.27e-14 3.55e-14 jointly defined 74 1.42e-14 2.27e-14 3.55e-14 total 74 Here the recovered variable rtrunk is almost exactly the same as the original trunk variable. When you are orthogonalizing many variables, this procedure can be performed to check the numerical soundness of the orthogonalization. Because of the ordering of the orthogonalization procedure, the last variable and the variables near the end of the varlist are the most important ones to check. The orthpoly command effectively does for polynomial terms what the orthog command does for an arbitrary set of variables. 1568 orthog — Orthogonalize variables and compute orthogonal polynomials Example 2: orthpoly Again consider the auto.dta dataset. Suppose that we wish to fit the model mpg = β0 + β1 weight + β2 weight2 + β3 weight3 + β4 weight4 + We will first compute the regression with natural polynomials: . gen double w1 = . gen double w2 = . gen double w3 = . gen double w4 = . correlate w1-w4 (obs=74) w1 w2 w3 w4 weight w1*w1 w2*w1 w3*w1 w1 w2 w3 w4 1.0000 0.9915 0.9665 0.9279 1.0000 0.9916 0.9679 1.0000 0.9922 1.0000 df MS . regress mpg w1-w4 Source SS Model Residual 1652.73666 790.722803 4 69 413.184164 11.4597508 Total 2443.45946 73 33.4720474 mpg Coef. w1 w2 w3 w4 _cons .0289302 -.0000229 5.74e-09 -4.86e-13 23.94421 Std. Err. .1161939 .0000566 1.19e-08 9.14e-13 86.60667 t 0.25 -0.40 0.48 -0.53 0.28 Number of obs F( 4, 69) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.804 0.687 0.631 0.596 0.783 = = = = = = 74 36.06 0.0000 0.6764 0.6576 3.3852 [95% Conf. Interval] -.2028704 -.0001359 -1.80e-08 -2.31e-12 -148.8314 .2607307 .0000901 2.95e-08 1.34e-12 196.7198 Some of the correlations among the powers of weight are very large, but this does not create any problems for regress. However, we may wish to look at the quadratic trend with the constant removed, the cubic trend with the quadratic and constant removed, etc. orthpoly will generate polynomial terms with this property: . orthpoly weight, generate(pw*) deg(4) poly(P) . regress mpg pw1-pw4 Source SS df MS Model Residual 1652.73666 790.722803 4 69 413.184164 11.4597508 Total 2443.45946 73 33.4720474 mpg Coef. pw1 pw2 pw3 pw4 _cons -4.638252 .8263545 -.3068616 -.209457 21.2973 Std. Err. .3935245 .3935245 .3935245 .3935245 .3935245 t -11.79 2.10 -0.78 -0.53 54.12 Number of obs F( 4, 69) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.039 0.438 0.596 0.000 = = = = = = 74 36.06 0.0000 0.6764 0.6576 3.3852 [95% Conf. Interval] -5.423312 .0412947 -1.091921 -.9945168 20.51224 -3.853192 1.611414 .4781982 .5756028 22.08236 orthog — Orthogonalize variables and compute orthogonal polynomials 1569 Compare the p-values of the terms in the natural polynomial regression with those in the orthogonal polynomial regression. With orthogonal polynomials, it is easy to see that the pure cubic and quartic trends are not significant and that the constant, linear, and quadratic terms each have p < 0.05. The matrix P obtained with the poly() option can be used to transform coefficients for orthogonal polynomials to coefficients for natural polynomials: . orthpoly weight, poly(P) deg(4) . matrix b = e(b)*P . matrix list b b[1,5] deg1 deg2 deg3 y1 .02893016 -.00002291 5.745e-09 deg4 -4.862e-13 _cons 23.944212 Methods and formulas orthog’s orthogonalization can be written in matrix notation as X = QR where X is the N × (d + 1) matrix representation of varlist plus a column of ones and Q is the N × (d + 1) matrix representation of newvarlist plus a column of ones (d = number of variables in varlist, and N = number of observations). The (d + 1) × (d + 1) matrix R is a permuted uppertriangular matrix; that is, R would be upper triangular if the constant were first, but the constant is last, so the first row/column has been permuted with the last row/column. Q and R are obtained using a modified Gram–Schmidt procedure; see Golub and Van Loan (1996, 218–219) for details. The traditional Gram–Schmidt procedure is notoriously unsound, but the modified procedure is good. orthog performs two passes of this procedure. orthpoly uses the Christoffel–Darboux recurrence formula (Abramowitz and Stegun 1972). Both orthog and orthpoly normalize the orthogonal variables such that Q0 W Q = M I where W = diag(w1 , w2 , . . . , wN ) with weights w1 , w2 , . . . , wN (all 1 if weights are not specified), and M is the sum of the weights (the number of observations if weights are not specified). References Abramowitz, M., and I. A. Stegun, ed. 1972. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. 10th ed. Washington, DC: National Bureau of Standards. Golub, G. H., and C. F. Van Loan. 1996. Matrix Computations. 3rd ed. Baltimore: Johns Hopkins University Press. Sribney, W. M. 1995. sg37: Orthogonal polynomials. Stata Technical Bulletin 25: 17–18. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 96–98. College Station, TX: Stata Press. Also see [R] regress — Linear regression Title pcorr — Partial and semipartial correlation coefficients Syntax Stored results Also see Menu Methods and formulas Description Acknowledgment Remarks and examples References Syntax pcorr varname1 varlist if in weight varname1 and varlist may contain time-series operators; see [U] 11.4.4 Time-series varlists. by is allowed; see [D] by. aweights and fweights are allowed; see [U] 11.1.6 weight. Menu Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Partial correlations Description pcorr displays the partial and semipartial correlation coefficients of varname1 with each variable in varlist after removing the effects of all other variables in varlist. The squared correlations and corresponding significance are also reported. Remarks and examples Assume that y is determined by x1 , x2 , . . . , xk . The partial correlation between y and x1 is an attempt to estimate the correlation that would be observed between y and x1 if the other x’s did not vary. The semipartial correlation, also called part correlation, between y and x1 is an attempt to estimate the correlation that would be observed between y and x1 after the effects of all other x’s are removed from x1 but not from y . Both squared correlations estimate the proportion of the variance of y that is explained by each predictor. The squared semipartial correlation between y and x1 represents the proportion of variance in y that is explained by x1 only. This squared correlation can also be interpreted as the decrease in the model’s R2 value that results from removing x1 from the full model. Thus one could use the squared semipartial correlations as criteria for model selection. The squared partial correlation between y and x1 represents the proportion of variance in y not associated with any other x’s that is explained by x1 . Thus the squared partial correlation gives an estimate of how much of the variance of y not explained by the other x’s is explained by x1 . Example 1 Using our automobile dataset (described in [U] 1.2.2 Example datasets), we can obtain the simple correlations between price, mpg, weight, and foreign from correlate (see [R] correlate): 1570 pcorr — Partial and semipartial correlation coefficients 1571 . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . correlate price mpg weight foreign (obs=74) price mpg weight foreign price mpg weight foreign 1.0000 -0.4686 0.5386 0.0487 1.0000 -0.8072 0.3934 1.0000 -0.5928 1.0000 Although correlate gave us the full correlation matrix, our interest is in just the first column. We find, for instance, that the higher the mpg, the lower the price. We obtain the partial and semipartial correlation coefficients by using pcorr: . pcorr price mpg weight foreign (obs=74) Partial and semipartial correlations of price with Partial Semipartial Partial Variable Corr. Corr. Corr.^2 mpg weight foreign 0.0352 0.5488 0.5402 0.0249 0.4644 0.4541 0.0012 0.3012 0.2918 Semipartial Corr.^2 Significance Value 0.0006 0.2157 0.2062 0.7693 0.0000 0.0000 We now find that the partial and semipartial correlations of price with mpg are near 0. In the simple correlations, we found that price and foreign were virtually uncorrelated. In the partial and semipartial correlations, we find that price and foreign are positively correlated. The nonsignificance of mpg tells us that the amount in which R2 decreases by removing mpg from the model is not significant. We find that removing either weight or foreign results in a significant drop in the R2 of the model. Technical note Use caution when interpreting the above results. As we said at the outset, the partial and semipartial correlation coefficients are an attempt to estimate the correlation that would be observed if the effects of all other variables were taken out of both y and x or only x. pcorr makes it too easy to ignore the fact that we are fitting a model. In the example above, the model is price = β0 + β1 mpg + β2 weight + β3 foreign + which is, in all honesty, a rather silly model. Even if we accept the implied economic assumptions of the model — that consumers value mpg, weight, and foreign — do we really believe that consumers place equal value on every extra 1,000 pounds of weight? That is, have we correctly parameterized the model? If we have not, then the estimated partial and semipartial correlation coefficients may not represent what they claim to represent. Partial and semipartial correlation coefficients are a reasonable way to summarize data if we are convinced that the underlying model is reasonable. We should not, however, pretend that there is no underlying model and that these correlation coefficients are unaffected by the assumptions and parameterization. 1572 pcorr — Partial and semipartial correlation coefficients Stored results pcorr stores the following in r(): Scalars r(N) r(df) Matrices r(p corr) r(sp corr) number of observations degrees of freedom partial correlation coefficient vector semipartial correlation coefficient vector Methods and formulas Results are obtained by fitting a linear regression of varname1 on varlist; see [R] regress. The partial correlation coefficient between varname1 and each variable in varlist is then calculated as √ t2 t +n−k (Greene 2012, 37), where t is the t statistic, n is the number of observations, and k is the number of independent variables, including the constant but excluding any dropped variables. as The semipartial correlation coefficient between varname1 and each variable in varlist is calculated r t2 (1 − R2 ) sign(t) n−k (Cohen et al. 2003, 89), where R2 is the model R2 value, and t, n, and k are as described above. The significance is given by 2Pr(tn−k > |t|), where tn−k follows a Student’s t distribution with n − k degrees of freedom. Acknowledgment The addition of semipartial correlation coefficients to pcorr is based on the pcorr2 command by Richard Williams of the Department of Sociology at the University of Notre Dame. References Cohen, J., P. Cohen, S. G. West, and L. S. Aiken. 2003. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. 3rd ed. Hillsdale, NJ: Erlbaum. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Also see [R] correlate — Correlations (covariances) of variables or coefficients [R] spearman — Spearman’s and Kendall’s correlations Title permute — Monte Carlo permutation tests Syntax Remarks and examples Menu Stored results Description References Options Also see Syntax Compute permutation test permute permvar exp list , options : command Report saved results permute varlist using filename , display options options Description Main reps(#) left | right perform # random permutations; default is reps(100) compute one-sided p-values; default is two-sided Options strata(varlist) saving( filename, . . .) permute within strata save results to filename; save statistics in double precision; save results to filename every # replications Reporting level(#) noheader nolegend verbose nodrop nodots noisily trace title(text) set confidence level; default is level(95) suppress table header suppress table legend display full table legend do not drop observations suppress replication dots display any output from command trace command use text as title for permutation results Advanced eps(#) nowarn force reject(exp) seed(#) numerical tolerance; seldom used do not warn when e(sample) is not set do not check for weights or svy commands; seldom used identify invalid results set random-number seed to # weights are not allowed in command. 1573 1574 permute — Monte Carlo permutation tests display options Description left | right level(#) noheader nolegend verbose title(text) eps(#) compute one-sided p-values; default is two-sided set confidence level; default is level(95) suppress table header suppress table legend display full table legend use text as title for results numerical tolerance; seldom used exp list contains elist contains eexp is specname is eqno is (name: elist) elist eexp newvar = (exp) (exp) specname [eqno]specname b b[] se se[] ## name exp is a standard Stata expression; see [U] 13 Functions and expressions. Distinguish between [ ], which are to be typed, and , which indicate optional arguments. Menu Statistics > Resampling > Permutation tests Description permute estimates p-values for permutation tests on the basis of Monte Carlo simulations. Typing . permute permvar exp list, reps(#): command randomly permutes the values in permvar # times, each time executing command and collecting the associated values from the expression in exp list. These p-value estimates can be one-sided: Pr(T ∗ ≤ T ) or Pr(T ∗ ≥ T ). The default is two-sided: Pr(|T ∗ | ≥ |T |). Here T ∗ denotes the value of the statistic from a randomly permuted dataset, and T denotes the statistic as computed on the original data. permvar identifies the variable whose observed values will be randomly permuted. permute — Monte Carlo permutation tests 1575 command defines the statistical command to be executed. Most Stata commands and user-written programs can be used with permute, as long as they follow standard Stata syntax; see [U] 11 Language syntax. The by prefix may not be part of command. exp list specifies the statistics to be collected from the execution of command. permute may be used for replaying results, but this feature is appropriate only when a dataset generated by permute is currently in memory or is identified by the using option. The variables specified in varlist in this context must be present in the respective dataset. Options Main reps(#) specifies the number of random permutations to perform. The default is 100. left or right requests that one-sided p-values be computed. If left is specified, an estimate of Pr(T ∗ ≤ T ) is produced, where T ∗ is the test statistic and T is its observed value. If right is specified, an estimate of Pr(T ∗ ≥ T ) is produced. By default, two-sided p-values are computed; that is, Pr(|T ∗ | ≥ |T |) is estimated. Options strata(varlist) specifies that the permutations be performed within each stratum defined by the values of varlist. saving( filename , suboptions ) creates a Stata data file (.dta file) consisting of (for each statistic in exp list) a variable containing the replicates. double specifies that the results for each replication be saved as doubles, meaning 8-byte reals. By default, they are saved as floats, meaning 4-byte reals. every(#) specifies that results are to be written to disk every #th replication. every() should be specified only in conjunction with saving() when command takes a long time for each replication. This will allow recovery of partial results should some other software crash your computer. See [P] postfile. replace specifies that filename be overwritten if it exists. This option does not appear in the dialog box. Reporting level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [R] level. noheader suppresses display of the table header. This option implies the nolegend option. nolegend suppresses display of the table legend. The table legend identifies the rows of the table with the expressions they represent. verbose requests that the full table legend be displayed. By default, coefficients and standard errors are not displayed. nodrop prevents permute from dropping observations outside the if and in qualifiers. nodrop will also cause permute to ignore the contents of e(sample) if it exists as a result of running command. By default, permute temporarily drops out-of-sample observations. 1576 permute — Monte Carlo permutation tests nodots suppresses display of the replication dots. By default, one dot character is displayed for each successful replication. A red ‘x’ is displayed if command returns an error or if one of the values in exp list is missing. noisily requests that any output from command be displayed. This option implies the nodots option. trace causes a trace of the execution of command to be displayed. This option implies the noisily option. title(text) specifies a title to be displayed above the table of permutation results; the default title is Monte Carlo permutation results. Advanced eps(#) specifies the numerical tolerance for testing |T ∗ | ≥ |T |, T ∗ ≤ T , or T ∗ ≥ T . These are considered true if, respectively, |T ∗ | ≥ |T |−#, T ∗ ≤ T +#, or T ∗ ≥ T −#. The default is 1e-7. You will not have to specify eps() under normal circumstances. nowarn suppresses the printing of a warning message when command does not set e(sample). force suppresses the restriction that command may not specify weights or be a svy command. permute is not suited for weighted estimation, thus permute should not be used with weights or svy. permute reports an error when it encounters weights or svy in command if the force option is not specified. This is a seldom used option, so use it only if you know what you are doing! reject(exp) identifies an expression that indicates when results should be rejected. When exp is true, the resulting values are reset to missing values. seed(#) sets the random-number seed. Specifying this option is equivalent to typing the following command prior to calling permute: . set seed # Remarks and examples Permutation tests determine the significance of the observed value of a test statistic in light of rearranging the order (permuting) of the observed values of a variable. Example 1: A simple two-sample test Suppose that we conducted an experiment to determine the effect of a treatment on the development of cells. Further suppose that we are restricted to six experimental units because of the extreme cost of the experiment. Thus three units are to be given a placebo, and three units are given the treatment. The measurement is the number of newly developed healthy cells. The following listing gives the hypothetical data, along with some summary statistics. . input y treatment y treatment 1. 7 0 2. 9 0 3. 11 0 4. 10 1 5. 12 1 6. 14 1 7. end . sort treatment permute — Monte Carlo permutation tests . summarize y Variable Obs Mean Std. Dev. 10.5 2.428992 Obs Mean Std. Dev. 3 9 Obs Mean 3 12 y 6 . by treatment: summarize y -> treatment = 0 Variable y -> treatment = 1 Variable y Min Max 7 14 Min Max 7 11 Min Max 10 14 2 Std. Dev. 2 1577 Clearly, there are more cells in the treatment group than in the placebo group, but a statistical test is needed to conclude that the treatment does affect the development of cells. If the sum of the treatment measures is our test statistic, we can use permute to determine the probability of observing 36 or more cells, given the observed data and assuming that there is no effect due to the treatment. . set seed 1234 . permute y sum=r(sum), saving(permdish) right nodrop nowarn: sum y if treatment (running summarize on estimation sample) Permutation replications (100) 1 2 3 4 5 .................................................. 50 .................................................. 100 Monte Carlo permutation results Number of obs = 6 command: summarize y if treatment sum: r(sum) permute var: y T sum Note: Note: T(obs) c n p=c/n 36 10 100 0.1000 SE(p) [95% Conf. Interval] 0.0300 .0490047 .1762226 confidence interval is with respect to p=c/n. c = #{T >= T(obs)} We see that 10 of the 100 randomly permuted datasets yielded sums from the treatment group larger than or equal to the observed sum of 36. Thus the evidence is not strong enough, at the 5% level, to reject the null hypothesis that there is no effect of the treatment. Because of the small size of this experiment, we could have calculated the exact permutation p-value from all possible permutations. There are six units, but we want the sum of the treatment units. Thus there are 63 = 20 permutation sums from the possible unique permutations. 1578 permute — Monte Carlo permutation tests 7 + 9 + 10 = 26 7 + 10 + 12 = 29 9 + 10 + 11 = 30 9 + 12 + 14 = 35 7 + 9 + 11 = 27 7 + 10 + 14 = 31 9 + 10 + 12 = 31 10 + 11 + 12 = 33 7 + 9 + 12 = 28 7 + 11 + 12 = 30 9 + 10 + 14 = 33 10 + 11 + 14 = 35 7 + 9 + 14 = 30 7 + 11 + 14 = 32 9 + 11 + 12 = 32 10 + 12 + 14 = 36 7 + 10 + 11 = 28 7 + 12 + 14 = 33 9 + 11 + 14 = 34 11 + 12 + 14 = 37 Two of the 20 permutation sums are greater than or equal to 36. Thus the exact p-value for this permutation test is 0.1. Tied values will decrease the number of unique permutations. When the saving() option is supplied, permute saves the values of the permutation statistic to the indicated file, in our case, permdish.dta. This file can be used to replay the result of permute. The level() option controls the confidence level of the confidence interval for the permutation p-value. This confidence interval is calculated using cii with the reported n (number of nonmissing replications) and c (the counter for events of significance). . permute using permdish, level(80) Monte Carlo permutation results command: sum: permute var: T sum Note: Note: Number of obs = 6 summarize y if treatment r(sum) y T(obs) c n p=c/n 36 10 100 0.1000 SE(p) [80% Conf. Interval] 0.0300 .0631113 .1498826 confidence interval is with respect to p=c/n. c = #{|T| >= |T(obs)|} Example 2: Permutation tests with ANOVA Consider some fictional data from a randomized complete-block design in which we wish to determine the significance of five treatments. . use http://www.stata-press.com/data/r13/permute1, clear . list y treatment in 1/10, abbrev(10) y treatment 1. 2. 3. 4. 5. 4.407557 5.693386 7.099699 3.12132 5.242648 1 1 1 1 1 6. 7. 8. 9. 10. 4.280349 4.508785 4.079967 5.904368 3.010556 2 2 2 2 2 permute — Monte Carlo permutation tests 1579 These data may be analyzed using anova. . anova y treatment subject Source Number of obs = 50 Root MSE = .914159 Partial SS df MS R-squared = 0.3544 Adj R-squared = 0.1213 F Prob > F Model 16.5182188 13 1.27063221 1.52 0.1574 treatment subject 13.0226706 3.49554813 9 4 1.44696341 .873887032 1.73 1.05 0.1174 0.3973 Residual 30.0847503 36 .835687509 Total 46.6029691 49 .951081002 Suppose that we want to compute the significance of the F statistic for treatment by using permute. All we need to do is write a short program that will save the result of this statistic for permute to use. For example, program panova, rclass version 13 args response fac_intrst fac_other anova ‘response’ ‘fac_intrst’ ‘fac_other’ return scalar Fmodel = e(F) test ‘fac_intrst’ return scalar F = r(F) end Now in panova, test saves the F statistic for the factor of interest in r(F). This is different from e(F), which is the overall model F statistic for the model fit by anova that panova saves in r(Fmodel). In the following example, we use the strata() option so that the treatments are randomly rearranged within each subject. It should not be too surprising that the estimated p-values are equal for this example, because the two F statistics are equivalent when controlling for differences between subjects. However, we would not expect to always get the same p-values every time we reran permute. . set seed 1234 . permute treatment treatmentF=r(F) modelF=e(F), reps(1000) strata(subject) > saving(permanova) nodots: panova y treatment subject Monte Carlo permutation results Number of strata = 5 Number of obs = 50 command: panova y treatment subject treatmentF: r(F) modelF: e(F) permute var: treatment T treatmentF modelF Note: Note: T(obs) c n p=c/n 1.731465 1.520463 118 118 1000 1000 0.1180 0.1180 SE(p) [95% Conf. Interval] 0.0102 0.0102 confidence intervals are with respect to p=c/n. c = #{|T| >= |T(obs)|} .0986525 .0986525 .1396277 .1396277 1580 permute — Monte Carlo permutation tests Example 3: Wilcoxon rank-sum test As a final example, let’s consider estimating the p-value of the Z statistic returned by ranksum. Suppose that we collected data from some experiment: y is some measure we took on 17 individuals, and group identifies the group that an individual belongs to. . use http://www.stata-press.com/data/r13/permute2 . list group y 1. 2. 3. 4. 5. 1 1 1 1 1 6 11 20 2 9 6. 7. 8. 9. 10. 1 0 0 0 0 5 2 1 6 0 11. 12. 13. 14. 15. 0 0 0 0 0 2 3 3 12 4 16. 17. 0 0 1 5 Next we analyze the data using ranksum and notice that the observed value of the test statistic (stored as r(z)) is −2.02 with an approximate p-value of 0.0434. . ranksum y, by(group) Two-sample Wilcoxon rank-sum (Mann-Whitney) test obs rank sum expected group 0 1 combined unadjusted variance adjustment for ties 11 6 17 79 74 99 54 153 153 99.00 -0.97 adjusted variance 98.03 Ho: y(group==0) = y(group==1) z = -2.020 Prob > |z| = 0.0434 The observed value of the rank-sum statistic is 79, with an expected value (under the null hypothesis of no group effect) of 99. There are 17 observations, so the permutation distribution contains 17 6 = 12,376 possible values of the rank-sum statistic if we ignore ties. With ties, we have fewer possible values but still too many to want to count them. Thus we use permute with 10,000 replications and see that the Monte Carlo permutation test agrees with the result of the test based on the normal approximation. permute — Monte Carlo permutation tests . set seed 18385766 . permute y z=r(z), reps(10000) nowarn nodots: ranksum y, by(group) Monte Carlo permutation results Number of obs = command: ranksum y, by(group) z: r(z) permute var: y T z Note: Note: T(obs) c n p=c/n -2.020002 468 10000 0.0468 1581 17 SE(p) [95% Conf. Interval] 0.0021 .0427429 .0511236 confidence interval is with respect to p=c/n. c = #{|T| >= |T(obs)|} For an application of a permutation test to a problem in epidemiology, see Hayes and Moulton (2009, 190–193). Technical note permute reports confidence intervals for p to emphasize that it is based on the binomial estimator for proportions. When the variability implied by the confidence interval makes conclusions difficult, you may increase the number of replications to determine more precisely the significance of the test statistic of interest. In other words, the value of p from permute will converge to the true permutation p-value as the number of replications gets arbitrarily large. Stored results permute stores the following in r(): Scalars r(N) sample size r(N reps) number of requested replications r(level) confidence level Macros r(cmd) permute r(command) command following colon r(permvar) permutation variable r(title) title in output r(exp#) #th expression Matrices r(b) observed statistics r(c) count when r(event) is true r(reps) number of nonmissing results r(k exp) r(k eexp) number of standard expressions number of b/ se expressions r(left) r(right) r(seed) r(event) left or empty right or empty initial random-number seed T <= T(obs), T >= T(obs), or |T| <= |T(obs)| r(p) r(se) r(ci) observed proportions standard errors of observed proportions confidence intervals of observed proportions References Ängquist, L. 2010. Stata tip 92: Manual implementation of permutations and bootstraps. Stata Journal 10: 686–688. Good, P. I. 2006. Resampling Methods: A Practical Guide to Data Analysis. 3rd ed. Boston: Birkhäuser. Hayes, R. J., and L. H. Moulton. 2009. Cluster Randomised Trials. Boca Raton, FL: Chapman & Hall/CRC. Kaiser, J. 2007. An exact and a Monte Carlo proposal to the Fisher–Pitman permutation tests for paired replicates and for independent samples. Stata Journal 7: 402–412. 1582 permute — Monte Carlo permutation tests Kaiser, J., and M. G. Lacy. 2009. A general-purpose method for two-group randomization tests. Stata Journal 9: 70–85. Also see [R] bootstrap — Bootstrap sampling and estimation [R] jackknife — Jackknife estimation [R] simulate — Monte Carlo simulations Title pk — Pharmacokinetic (biopharmaceutical) data Description Remarks and examples References Description The term pk refers to pharmacokinetic data and the Stata commands, all of which begin with the letters pk, designed to do some of the analyses commonly performed in the pharmaceutical industry. The system is intended for the analysis of pharmacokinetic data, although some of the commands are for general use. The pk commands are pkexamine pksumm pkshape pkcross pkequiv pkcollapse [R] [R] [R] [R] [R] [R] pkexamine pksumm pkshape pkcross pkequiv pkcollapse Calculate pharmacokinetic measures Summarize pharmacokinetic data Reshape (pharmacokinetic) Latin-square data Analyze crossover experiments Perform bioequivalence tests Generate pharmacokinetic measurement dataset Remarks and examples Several types of clinical trials are commonly performed in the pharmaceutical industry. Examples include combination trials, multicenter trials, equivalence trials, and active control trials. For each type of trial, there is an optimal study design for estimating the effects of interest. Currently, the pk system can be used to analyze equivalence trials, which are usually conducted using a crossover design; however, it is possible to use a parallel design and still draw conclusions about equivalence. Equivalence trials assess bioequivalence between two drugs. Although proving that two drugs behave the same is impossible, the United States Food and Drug Administration believes that if the absorption properties of two drugs are similar, the two drugs will produce similar effects and have similar safety profiles. Generally, the goal of an equivalence trial is to assess the equivalence of a generic drug to an existing drug. This goal is commonly accomplished by comparing a confidence interval about the difference between a pharmacokinetic measurement of two drugs with a confidence limit constructed from U.S. federal regulations. If the confidence interval is entirely within the confidence limit, the drugs are declared bioequivalent. Another approach to assessing bioequivalence is to use the method of interval hypotheses testing. pkequiv is used to conduct these tests of bioequivalence. Several pharmacokinetic measures can be used to ascertain how available a drug is for cellular absorption. The most common measure is the area under the time-versus-concentration curve (AUC). Another common measure of drug availability is the maximum concentration (Cmax ) achieved by the drug during the follow-up period. Stata reports these and other less common measures of drug availability, including the time at which the maximum drug concentration was observed and the duration of the period during which the subject was being measured. Stata also reports the elimination rate, that is, the rate at which the drug is metabolized, and the drug’s half-life, that is, the time it takes for the drug concentration to fall to one-half of its maximum concentration. 1583 1584 pk — Pharmacokinetic (biopharmaceutical) data pkexamine computes and reports all the pharmacokinetic measures that Stata produces, including four calculations of the area under the time-versus-concentration curve. The standard area under the curve from 0 to the maximum observed time (AUC0,tmax ) is computed using cubic splines or the trapezoidal rule. Additionally, pkexamine also computes the area under the curve from 0 to infinity by extending the standard time-versus-concentration curve from the maximum observed time by using three different methods. The first method simply extends the standard curve by using a least-squares linear fit through the last few data points. The second method extends the standard curve by fitting a decreasing exponential curve through the last few data points. Finally, the third method extends the curve by fitting a least-squares linear regression line on the log concentration. The mathematical details of these extensions are described in Methods and formulas of [R] pkexamine. Data from an equivalence trial may also be analyzed using methods appropriate to the particular study design. When you have a crossover design, pkcross can be used to fit an appropriate ANOVA model. As an aside, a crossover design is simply a restricted Latin square; therefore, pkcross can also be used to analyze any Latin-square design. There are some practical concerns when dealing with data from equivalence trials. Primarily, the data must be organized in a manner that Stata can use. The pk commands include pkcollapse and pkshape, which are designed to help transform data from a common format to one that is suitable for analysis with Stata. In the following example, we illustrate several different data formats that are often encountered in pharmaceutical research and describe how these formats can be transformed to formats that can be analyzed with Stata. Example 1 Assume that we have one subject and are interested in determining the drug profile for that subject. A reasonable experiment would be to give the subject the drug and then measure the concentration of the drug in the subject’s blood over a given period. For example, here is a part of a dataset from Chow and Liu (2009, 13): . use http://www.stata-press.com/data/r13/auc . list, abbrev(14) id time concentration 1. 2. 3. 4. 5. 1 1 1 1 1 0 .5 1 1.5 2 0 0 2.8 4.4 4.4 6. 7. 8. 9. 10. 1 1 1 1 1 3 4 6 8 12 4.7 4.1 4 3.6 3 11. 12. 13. 1 1 1 16 24 32 2.5 2 1.6 Examining these data, we notice that the concentration quickly increases, plateaus for a short period, and then slowly decreases over time. pkexamine is used to calculate the pharmacokinetic measures of interest. pkexamine is explained in detail in [R] pkexamine. The output is pk — Pharmacokinetic (biopharmaceutical) data 1585 . pkexamine time conc Maximum concentration Time of maximum concentration Time of last observation (Tmax) Elimination rate Half life = = = = = 4.7 3 32 0.0279 24.8503 Area under the curve AUC [0, Tmax] AUC [0, inf.) Linear of log conc. 85.24 AUC [0, inf.) Linear fit AUC [0, inf.) Exponential fit 107.759 142.603 142.603 Fit based on last 3 points. Clinical trials, however, require that data be collected on more than one subject. There are several ways to enter raw measured data collected on several subjects. It would be reasonable to enter for each subject the drug concentration value at specific points in time. Such data could be id 1 2 3 conc1 0 0 0 conc2 1 2 1 conc3 4 6 2 conc4 7 5 3 conc5 5 4 5 conc6 3 3 4 conc7 1 2 1 where conc1 is the concentration at the first measured time, conc2 is the concentration at the second measured time, etc. This format requires that each drug concentration measurement be made at the same time on each subject. Another more flexible way to enter the data is to have an observation with three variables for each time measurement on a subject. Each observation would have a subject ID, the time at which the measurement was made, and the corresponding drug concentration at that time. The data would be 1586 pk — Pharmacokinetic (biopharmaceutical) data . use http://www.stata-press.com/data/r13/pkdata . list id concA time, sepby(id) id concA time 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 1 1 1 1 1 1 1 1 1 1 1 1 1 0 3.073403 5.188444 5.898577 5.096378 6.094085 5.158772 5.7065 5.272467 4.4576 5.146423 4.947427 1.920421 0 .5 1 1.5 2 3 4 6 8 12 16 24 32 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2.48462 4.883569 7.253442 5.849345 6.761085 4.33839 5.04199 4.25128 6.205004 5.566165 3.689007 3.644063 0 .5 1 1.5 2 3 4 6 8 12 16 24 32 27. 3 207. 208. 0 (output omitted ) 20 4.673281 20 3.487347 0 24 32 Stata expects the data to be organized in the second form. If your data are organized as described in the first dataset, you will need to use reshape to change the data to the second form; see [D] reshape. Because the data in the second (or long) format contain information for one drug on several subjects, pksumm can be used to produce summary statistics of the pharmacokinetic measurements. The output is . pksumm id time concA ................ Summary statistics for the pharmacokinetic measures Number of observations = 16 Measure Mean Median Variance Skewness Kurtosis p-value auc aucline aucexp auclog half ke cmax tomc tmax 151.63 397.09 668.60 665.95 90.68 0.02 7.37 3.38 32.00 152.18 219.83 302.96 298.03 29.12 0.02 7.42 3.00 32.00 127.58 178276.59 720356.98 752573.34 17750.70 0.00 0.40 7.25 0.00 -0.34 2.69 2.67 2.71 2.36 0.88 -0.64 2.27 . 2.07 9.61 9.54 9.70 7.92 3.87 2.75 7.70 . 0.55 0.00 0.00 0.00 0.00 0.08 0.36 0.00 . pk — Pharmacokinetic (biopharmaceutical) data 1587 Until now, we have been concerned with the profile of only one drug. We have characterized the profile of that drug by individual subjects by using pkexamine and by a group of subjects by using pksumm. The goal of an equivalence trial, however, is to compare two drugs, which we will do in the rest of this example. For equivalence trials, the study design most often used is the crossover design. For a complete discussion of crossover designs, see Ratkowsky, Evans, and Alldredge (1993). In brief, crossover designs require that each subject be given both treatments at two different times. The order in which the treatments are applied changes between groups. For example, if we had 20 subjects numbered 1–20, the first 10 would receive treatment A during the first period of the study, and then they would be given treatment B. The second 10 subjects would be given treatment B during the first period of the study, and then they would be given treatment A. Each subject in the study will have four variables that describe the observation: a subject identifier, a sequence identifier that indicates the order of treatment, and two outcome variables, one for each treatment. The outcome variables for each subject are the pharmacokinetic measures. The data must be transformed from a series of measurements on individual subjects to data containing the pharmacokinetic measures for each subject. In Stata parlance, this is referred to as a collapse, which can be done with pkcollapse; see [R] pkcollapse. Here is a part of our data: . list, sepby(id) id seq time concA concB 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 .5 1 1.5 2 3 4 6 8 12 16 24 32 0 3.073403 5.188444 5.898577 5.096378 6.094085 5.158772 5.7065 5.272467 4.4576 5.146423 4.947427 1.920421 0 3.712592 6.230602 7.885944 9.241735 13.10507 .169429 8.759894 7.985409 7.740126 7.607208 7.588428 2.791115 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 0 .5 1 1.5 2 3 4 6 8 12 16 24 32 0 2.48462 4.883569 7.253442 5.849345 6.761085 4.33839 5.04199 4.25128 6.205004 5.566165 3.689007 3.644063 0 .9209593 5.925818 8.710549 10.90552 8.429898 5.573152 6.32341 .5251224 7.415988 6.323938 1.133553 5.759489 27. 3 1 0 207. 208. 20 20 2 2 0 0 (output omitted ) 24 4.673281 32 3.487347 6.059818 5.213639 1588 pk — Pharmacokinetic (biopharmaceutical) data This format is similar to the second format described above, except that now we have measurements for two drugs at each time for each subject. We transform these data with pkcollapse: . pkcollapse time concA concB, id(id) keep(seq) stat(auc) ................................ . list, sep(8) abbrev(10) id seq auc_concA auc_concB 1. 2. 3. 4. 5. 6. 7. 8. 1 2 3 4 5 7 8 9 1 1 1 1 1 1 1 1 150.9643 146.7606 160.6548 157.8622 133.6957 160.639 131.2604 168.5186 218.5551 133.3201 126.0635 96.17461 188.9038 223.6922 104.0139 237.8962 9. 10. 11. 12. 13. 14. 15. 16. 10 12 13 14 15 18 19 20 2 2 2 2 2 2 2 2 137.0627 153.4038 163.4593 146.0462 158.1457 147.1977 164.9988 145.3823 139.7382 202.3942 136.7848 104.5191 165.8654 139.235 166.2391 158.5146 For this example, we chose to use the AUC for two drugs as our pharmacokinetic measure. We could have used any of the measures computed by pkexamine. In addition to the AUCs, the dataset also contains a sequence variable for each subject indicating when each treatment was administered. The data produced by pkcollapse are in what Stata calls wide format; that is, there is one observation per subject containing two or more outcomes. To use pkcross and pkequiv, we need to transform these data to long format. This goal can be accomplished using pkshape; see [R] pkshape. Consider the first subject in the dataset. This subject is in sequence one, which means that treatment A was applied during the first period of the study and treatment B was applied in the second period of the study. We need to split the first observation into two observations so that the outcome measure is only in one variable. Also we need two new variables, one indicating the treatment the subject received and another recording the period of the study when the subject received that treatment. We might expect the expansion of the first subject to be id 1 1 sequence 1 1 auc 150.9643 218.5551 treat A B period 1 2 We see that subject number 1 was in sequence 1, had an AUC of 150.9643 when treatment A was applied in the first period of the study, and had an AUC of 218.5551 when treatment B was applied. Similarly, the expansion of subject 10 (the first subject in sequence 2) would be id 10 10 sequence 2 2 auc 137.0627 139.7382 treat B A period 1 2 Here treatment B was applied to the subject during the first period of the study, and treatment A was applied to the subject during the second period of the study. An additional complication is common in crossover study designs. The treatment applied in the first period of the study might still have some effect on the outcome in the second period. In this example, pk — Pharmacokinetic (biopharmaceutical) data 1589 each subject was given one treatment followed by another treatment. To get accurate estimates of treatment effects, it is necessary to account for the effect that the first treatment has in the second period of the study. This is called the carryover effect. We must, therefore, have a variable that indicates which treatment was applied in the first treatment period. pkshape creates a variable that indicates the carryover effect. For treatments applied during the first treatment period, there will never be a carryover effect. Thus the expanded data created by pkshape for subject 1 will be id 1 1 sequence 1 1 outcome 150.9643 218.5551 treat A B period 1 2 carry 0 A treat B A period 1 2 carry 0 B and the data for subject 10 will be id 10 10 sequence 2 2 outcome 137.0627 139.7382 We pkshape the data: . pkshape id seq auc*, order(ab ba) . sort id sequence period . list, sep(16) id sequence outcome treat carry period 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 1 1 2 2 3 3 4 4 5 5 7 7 8 8 9 9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 150.9643 218.5551 146.7606 133.3201 160.6548 126.0635 157.8622 96.17461 133.6957 188.9038 160.639 223.6922 131.2604 104.0139 168.5186 237.8962 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 10 10 12 12 13 13 14 14 15 15 18 18 19 19 20 20 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 137.0627 139.7382 153.4038 202.3942 163.4593 136.7848 146.0462 104.5191 158.1457 165.8654 147.1977 139.235 164.9988 166.2391 145.3823 158.5146 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 As an aside, crossover designs do not require that each subject receive each treatment, but if they do, the crossover design is referred to as a complete crossover design. 1590 pk — Pharmacokinetic (biopharmaceutical) data The last dataset is organized in a manner that can be analyzed with Stata. To fit an ANOVA model to these data, we can use anova or pkcross. To conduct equivalence tests, we can use pkequiv. This example is further analyzed in [R] pkcross and [R] pkequiv. References Chow, S.-C., and J.-P. Liu. 2009. Design and Analysis of Bioavailability and Bioequivalence Studies. 3rd ed. Boca Raton, FL: Chapman & Hall/CRC. Ratkowsky, D. A., M. A. Evans, and J. R. Alldredge. 1993. Cross-over Experiments: Design, Analysis, and Application. New York: Dekker. Title pkcollapse — Generate pharmacokinetic measurement dataset Syntax Remarks and examples Menu Methods and formulas Description Also see Options Syntax pkcollapse time concentration if , id(id var) options Description options Main ∗ id(id var) stat(measures) trapezoid fit(#) keep(varlist) force nodots ∗ subject ID variable create specified measures; default is all use trapezoidal rule; default is cubic splines use # points to estimate AUC0,∞ ; default is fit(3) keep variables in varlist force collapse suppress dots during calculation id(id var) is required. measures Description auc aucline aucexp area under the concentration-time curve (AUC0,∞ ) area under the concentration-time curve from 0 to ∞ using a linear extension area under the concentration-time curve from 0 to ∞ using an exponential extension area under the log-concentration-time curve extended with a linear fit half-life of the drug elimination rate maximum concentration time at last concentration time of maximum concentration auclog half ke cmax tmax tomc Menu Statistics > Epidemiology and related > Other > Generate pharmacokinetic measurement dataset Description pkcollapse generates new variables with the pharmacokinetic summary measures of interest. pkcollapse is one of the pk commands. Please read [R] pk before reading this entry. 1591 1592 pkcollapse — Generate pharmacokinetic measurement dataset Options Main id(id var) is required and specifies the variable that contains the subject ID over which pkcollapse is to operate. stat(measures) specifies the measures to be generated. The default is to generate all the measures. trapezoid tells Stata to use the trapezoidal rule when calculating the AUC. The default is to use cubic splines, which give better results for most functions. When the curve is irregular, trapezoid may give better results. fit(#) specifies the number of points to use in estimating the AUC0,∞ . The default is fit(3), the last three points. This number should be viewed as a minimum; the appropriate number of points will depend on your data. keep(varlist) specifies the variables to be kept during the collapse. Variables not specified with the keep() option will be dropped. When keep() is specified, the keep variables are checked to ensure that all values of the variables are the same within id var. force forces the collapse, even when the values of the keep() variables are different within the id var. nodots suppresses the display of dots during calculation. Remarks and examples pkcollapse generates all the summary pharmacokinetic measures. Example 1 We demonstrate the use of pkcollapse with the data described in [R] pk. We have drug concentration data on 15 subjects. Each subject is measured at 13 time points over a 32-hour period. Some of the records are . use http://www.stata-press.com/data/r13/pkdata . list, sep(0) id seq 1. 2. 3. 4. 5. 6. 1 1 1 1 1 1 1 1 1 1 1 1 14. 15. 16. 17. 18. 19. 2 2 2 2 2 2 1 1 1 1 1 1 207. 208. 20 20 2 2 time concA concB 0 0 .5 3.073403 1 5.188444 1.5 5.898577 2 5.096378 3 6.094085 (output omitted ) 0 0 .5 2.48462 1 4.883569 1.5 7.253442 2 5.849345 3 6.761085 (output omitted ) 24 4.673281 32 3.487347 0 3.712592 6.230602 7.885944 9.241735 13.10507 0 .9209593 5.925818 8.710549 10.90552 8.429898 6.059818 5.213639 pkcollapse — Generate pharmacokinetic measurement dataset 1593 Although pksumm allows us to view all the pharmacokinetic measures, we can create a dataset with the measures by using pkcollapse. . pkcollapse time concA concB, id(id) stat(auc) keep(seq) ................................ . list, sep(8) abbrev(10) id seq auc_concA auc_concB 1. 2. 3. 4. 5. 6. 7. 8. 1 2 3 4 5 7 8 9 1 1 1 1 1 1 1 1 150.9643 146.7606 160.6548 157.8622 133.6957 160.639 131.2604 168.5186 218.5551 133.3201 126.0635 96.17461 188.9038 223.6922 104.0139 237.8962 9. 10. 11. 12. 13. 14. 15. 16. 10 12 13 14 15 18 19 20 2 2 2 2 2 2 2 2 137.0627 153.4038 163.4593 146.0462 158.1457 147.1977 164.9988 145.3823 139.7382 202.3942 136.7848 104.5191 165.8654 139.235 166.2391 158.5146 The resulting dataset, which we will call pkdata2, contains 1 observation per subject. This dataset is in wide format. If we want to use pkcross or pkequiv, we must transform these data to long format, which we do in the last example of [R] pkshape. Methods and formulas The statistics generated by pkcollapse are described in [R] pkexamine. Also see [R] pk — Pharmacokinetic (biopharmaceutical) data Title pkcross — Analyze crossover experiments Syntax Remarks and examples Menu Methods and formulas Description References Options Also see Syntax pkcross outcome if in , options Description options Model sequence(varname) treatment(varname) period(varname) id(varname) carryover(varname) carryover(none) model(string) sequential sequence variable; default is sequence(sequence) treatment variable; default is treatment(treat) period variable; default is period(period) ID variable name of carryover variable; default is carryover(carry) omit carryover effects from model; default is carryover(carry) specify the model to fit estimate sequential instead of partial sums of squares Parameterization estimate mean and the period, treatment, and sequence effects; assume no carryover effects exist; the default estimate mean and the period, treatment, and carryover effects; assume no sequence effects exist estimate mean, period and treatment effects, and period-by-treatment interaction; assume no sequence or carryover effects exist estimate mean, period and treatment effects, and period-by-treatment interaction; assume no period or crossover effects exist param(3) param(1) param(2) param(4) Menu Statistics > Epidemiology and related > Other > Analyze crossover experiments Description pkcross analyzes data from a crossover design experiment. When analyzing pharmaceutical trial data, if the treatment, carryover, and sequence variables are known, the omnibus test for separability of the treatment and carryover effects is calculated. pkcross is one of the pk commands. Please read [R] pk before reading this entry. 1594 pkcross — Analyze crossover experiments 1595 Options Model sequence(varname) specifies the variable that contains the sequence in which the treatment was administered. If this option is not specified, sequence(sequence) is assumed. treatment(varname) specifies the variable that contains the treatment information. If this option is not specified, treatment(treat) is assumed. period(varname) specifies the variable that contains the period information. If this option is not specified, period(period) is assumed. id(varname) specifies the variable that contains the subject identifiers. If this option is not specified, id(id) is assumed. carryover(varname | none) specifies the variable that contains the carryover information. If carry(none) is specified, the carryover effects are omitted from the model. If this option is not specified, carryover(carry) is assumed. model(string) specifies the model to be fit. For higher-order crossover designs, this option can be useful if you want to fit a model other than the default. However, anova (see [R] anova) can also be used to fit a crossover model. The default model for higher-order crossover designs is outcome predicted by sequence, period, treatment, and carryover effects. By default, the model statement is model(sequence period treat carry). sequential specifies that sequential sums of squares be estimated. Parameterization param(#) specifies which of the four parameterizations to use for the analysis of a 2 × 2 crossover experiment. This option is ignored with higher-order crossover designs. The default is param(3). See the technical note for 2 × 2 crossover designs for more details. param(3) estimates the overall mean, the period effects, the treatment effects, and the sequence effects, assuming that no carryover effects exist. This is the default parameterization. param(1) estimates the overall mean, the period effects, the treatment effects, and the carryover effects, assuming that no sequence effects exist. param(2) estimates the overall mean, the period effects, the treatment effects, and the period-bytreatment interaction, assuming that no sequence or carryover effects exist. param(4) estimates the overall mean, the sequence effects, the treatment effects, and the sequenceby-treatment interaction, assuming that no period or crossover effects exist. When the sequence by treatment is equivalent to the period effect, this reduces to the third parameterization. Remarks and examples pkcross is designed to analyze crossover experiments. Use pkshape first to reshape your data; see [R] pkshape. pkcross assumes that the data were reshaped by pkshape or are organized in the same manner as produced with pkshape. Washout periods are indicated by the number 0. See the technical note in this entry for more information on analyzing 2 × 2 crossover experiments. 1596 pkcross — Analyze crossover experiments Technical note The 2 × 2 crossover design cannot be used to estimate more than four parameters because there are only four pieces of information (the four cell means) collected. pkcross uses ANOVA models to analyze the data, so one of the four parameters must be the overall mean of the model, leaving just 3 degrees of freedom to estimate the remaining effects (period, sequence, treatment, and carryover). Thus the model is overparameterized. Estimation of treatment and carryover effects requires the assumption of either no period effects or no sequence effects. Some researchers maintain that it estimating carryover effects at the expense of other effects is a bad idea. This is a limitation of this design. pkcross implements four parameterizations for this model. They are numbered sequentially from one to four and are described in Options. Example 1 Consider the example data published in Chow and Liu (2009, 71) and described in [R] pkshape. We have entered and reshaped the data with pkshape and have variables that identify the subjects, periods, treatments, sequence, and carryover treatment. To compute the ANOVA table, use pkcross: . use http://www.stata-press.com/data/r13/chowliu . pkshape id seq period1 period2, order(ab ba) . pkcross outcome sequence variable = sequence period variable = period treatment variable = treat carryover variable = carry id variable = id Analysis of variance (ANOVA) for a 2x2 crossover study Source of Variation Partial SS df MS F Prob > F Intersubjects Sequence effect Residuals 276.00 16211.49 1 22 276.00 736.89 0.37 4.41 0.5468 0.0005 Intrasubjects Treatment effect Period effect Residuals 62.79 35.97 3679.43 1 1 22 62.79 35.97 167.25 0.38 0.22 0.5463 0.6474 Total 20265.68 47 Omnibus measure of separability of treatment and carryover = 29.2893% There is evidence of intersubject variability, but there are no other significant effects. The omnibus test for separability is a measure reflecting the degree to which the study design allows the treatment effects to be estimated independently of the carryover effects. The measure of separability of the treatment and carryover effects indicates approximately 29% separability, which can be interpreted as the degree to which the treatment and carryover effects are orthogonal. This is a characteristic of the design of the study. For a complete discussion, see Ratkowsky, Evans, and Alldredge (1993). Compared to the output in Chow and Liu (2009), the sequence effect is mislabeled as a carryover effect. See Ratkowsky, Evans, and Alldredge (1993, sec. 3.2) for a complete discussion of the mislabeling. pkcross — Analyze crossover experiments 1597 By specifying param(1), we obtain parameterization 1 for this model. . pkcross outcome, param(1) sequence variable = sequence period variable = period treatment variable = treat carryover variable = carry id variable = id Analysis of variance (ANOVA) for a 2x2 crossover study Partial SS df MS F Prob > F Source of Variation Treatment effect Period effect Carryover effect Residuals 301.04 255.62 276.00 19890.92 1 1 1 44 301.04 255.62 276.00 452.07 0.67 0.57 0.61 Total 20265.68 47 Omnibus measure of separability of treatment and carryover = 0.4189 0.4561 0.4388 29.2893% Example 2 Consider the case of a two-treatment, four-sequence, two-period crossover design. This design is commonly referred to as Balaam’s design (Balaam 1968). Ratkowsky, Evans, and Alldredge (1993, 140) published the following data from an amantadine trial, originally published by Taka and Armitage (1983): . use http://www.stata-press.com/data/r13/balaam, clear . list, sep(0) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. id seq period1 period2 period3 1 2 3 4 1 2 3 4 5 1 2 3 4 1 2 3 4 -ab -ab -ab -ab -ba -ba -ba -ba -ba -aa -aa -aa -aa -bb -bb -bb -bb 9 12 17 21 23 15 13 24 18 14 27 19 30 21 11 20 25 8.75 10.5 15 21 22 15 14 22.75 17.75 12.5 24.25 17.25 28.25 20 10.5 19.5 22.5 8.75 9.75 18.5 21.5 18 13 13.75 21.5 16.75 14 22.5 16.25 29.75 19.51 10 20.75 23.5 The sequence identifier must be a string with zeros to indicate washout or baseline periods, or a number. If the sequence identifier is numeric, the order option must be specified with pkshape. If the sequence identifier is a string, pkshape will create sequence, period, and treatment identifiers without the order option. In this example, the dash is used to indicate a baseline period, which is an invalid code for this purpose. As a result, the data must be encoded; see [D] encode. 1598 pkcross — Analyze crossover experiments . encode seq, gen(num_seq) . pkshape id num_seq period1 period2 period3, order(0aa 0ab 0ba 0bb) . pkcross outcome, se sequence variable = period variable = treatment variable = carryover variable = id variable = Analysis of variance (ANOVA) for a crossover study SS df MS F Source of Variation sequence period treat carry id Prob > F Intersubjects Sequence effect Residuals 285.82 1221.49 3 13 95.27 93.96 1.01 59.96 0.4180 0.0000 Intrasubjects Period effect Treatment effect Carryover effect Residuals 15.13 8.48 0.11 29.56 2 1 1 30 7.56 8.48 0.11 0.99 6.34 8.86 0.12 0.0048 0.0056 0.7366 Total 1560.59 50 Omnibus measure of separability of treatment and carryover = 64.6447% In this example, the sequence specifier used dashes instead of zeros to indicate a baseline period during which no treatment was given. For pkcross to work, we need to encode the string sequence variable and then use the order option with pkshape. A word of caution: encode does not necessarily choose the first sequence to be sequence 1, as in this example. Always double-check the sequence numbering when using encode. pkcross — Analyze crossover experiments Example 3 Continuing with the example from [R] pkshape, we fit an ANOVA model. . use http://www.stata-press.com/data/r13/pkdata3, clear . list, sep(8) id sequence outcome treat carry period 1. 2. 3. 4. 5. 6. 7. 8. 1 2 3 4 5 7 8 9 1 1 1 1 1 1 1 1 150.9643 146.7606 160.6548 157.8622 133.6957 160.639 131.2604 168.5186 A A A A A A A A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 9. 10. 11. 12. 13. 14. 15. 16. 10 12 13 14 15 18 19 20 2 2 2 2 2 2 2 2 137.0627 153.4038 163.4593 146.0462 158.1457 147.1977 164.9988 145.3823 B B B B B B B B 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 17. 18. 19. 20. 21. 22. 23. 24. 1 2 3 4 5 7 8 9 1 1 1 1 1 1 1 1 218.5551 133.3201 126.0635 96.17461 188.9038 223.6922 104.0139 237.8962 B B B B B B B B A A A A A A A A 2 2 2 2 2 2 2 2 25. 26. 27. 28. 29. 30. 31. 32. 10 12 13 14 15 18 19 20 2 2 2 2 2 2 2 2 139.7382 202.3942 136.7848 104.5191 165.8654 139.235 166.2391 158.5146 A A A A A A A A B B B B B B B B 2 2 2 2 2 2 2 2 1599 1600 pkcross — Analyze crossover experiments The ANOVA model is fit using pkcross: . pkcross outcome sequence variable = sequence period variable = period treatment variable = treat carryover variable = carry id variable = id Analysis of variance (ANOVA) for a 2x2 crossover study Partial SS df MS F Prob > F Source of Variation Intersubjects Sequence effect Residuals 378.04 17991.26 1 14 378.04 1285.09 0.29 1.40 0.5961 0.2691 Intrasubjects Treatment effect Period effect Residuals 455.04 419.47 12860.78 1 1 14 455.04 419.47 918.63 0.50 0.46 0.4931 0.5102 Total 32104.59 31 Omnibus measure of separability of treatment and carryover = 29.2893% Example 4 Consider the case of a six-treatment crossover trial in which the squares are not variance balanced. The following dataset is from a partially balanced crossover trial published by Patterson and Lucas (1962) and reproduced in Ratkowsky, Evans, and Alldredge (1993, 231): . use http://www.stata-press.com/data/r13/nobalance . list, sep(4) cow seq period1 period2 period3 period4 block 1. 2. 3. 4. 1 2 3 4 adbe baed ebda deab 38.7 48.9 34.6 35.2 37.4 46.9 32.3 33.5 34.3 42 28.5 28.4 31.3 39.6 27.1 25.1 1 1 1 1 5. 6. 7. 8. 1 2 3 4 dafc fdca cfad acdf 32.9 30.4 30.8 25.7 33.1 29.5 29.3 26.1 27.5 26.7 26.4 23.4 25.1 23.1 23.2 18.7 2 2 2 2 9. 10. 11. 12. 1 2 3 4 efbc becf fceb cbfe 25.4 21.8 21.4 22.8 26 23.9 22 21 23.9 21.7 19.4 18.6 19.9 17.6 16.6 16.1 3 3 3 3 pkcross — Analyze crossover experiments 1601 When there is no variance balance in the design, a square or blocking variable is needed to indicate in which treatment cell a sequence was observed, but the mechanical steps are the same. . pkshape cow seq period1 period2 period3 period4 . pkcross outcome, model(block cow|block period|block treat carry) se Number of obs = 48 R-squared = 0.9965 Root MSE = .740408 Adj R-squared = 0.9903 Seq. SS df MS F Prob > F Source Model 2650.1331 30 88.3377701 161.14 0.0000 block cow|block period|block treat carry 1607.01128 628.706274 408.031253 2.50000057 3.88428906 2 9 9 5 5 803.505642 69.8562527 45.3368059 .500000114 .776857812 1465.71 127.43 82.70 0.91 1.42 0.0000 0.0000 0.0000 0.4964 0.2680 Residual 9.31945887 17 .548203463 Total 2659.45256 47 56.584097 When the model statement is used and the omnibus measure of separability is desired, specify the variables in the treatment(), carryover(), and sequence() options to pkcross. Methods and formulas pkcross uses ANOVA to fit models for crossover experiments; see [R] anova. The omnibus measure of separability is S = 100(1 − V )% where V is Cramér’s V and is defined as χ2 N ( V = ) 21 min(r − 1, c − 1) The χ2 is calculated as ( 2 χ = XX i j 2 (Oij − Eij ) Eij ) where O and E are the observed and expected counts in a table of the number of times each treatment is followed by the other treatments. References Balaam, L. N. 1968. A two-period design with t2 experimental units. Biometrics 24: 61–73. Chow, S.-C., and J.-P. Liu. 2009. Design and Analysis of Bioavailability and Bioequivalence Studies. 3rd ed. Boca Raton, FL: Chapman & Hall/CRC. Kutner, M. H., C. J. Nachtsheim, J. Neter, and W. Li. 2005. Applied Linear Statistical Models. 5th ed. New York: McGraw–Hill/Irwin. 1602 pkcross — Analyze crossover experiments Patterson, H. D., and H. L. Lucas. 1962. Change-over designs. Technical Bulletin 147, North Carolina Agricultural Experiment Station and the USDA. Ratkowsky, D. A., M. A. Evans, and J. R. Alldredge. 1993. Cross-over Experiments: Design, Analysis, and Application. New York: Dekker. Taka, M. T., and P. Armitage. 1983. Autoregressive models in clinical trials. Communications in Statistics—Theory and Methods 12: 865–876. Also see [R] pk — Pharmacokinetic (biopharmaceutical) data Title pkequiv — Perform bioequivalence tests Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax pkequiv outcome treatment period sequence id if in , options Description options Options compare(string) limit(#) level(#) fieller symmetric anderson tost noboot compare the two specified values of the treatment variable equivalence limit (between 0.10 and 0.99); default is 0.2 set confidence level; default is level(90) calculate confidence interval by Fieller’s theorem calculate symmetric equivalence interval Anderson and Hauck hypothesis test for bioequivalence two one-sided hypothesis tests for bioequivalence do not estimate probability that CI lies within confidence limits Menu Statistics > Epidemiology and related > Other > Bioequivalence tests Description pkequiv performs bioequivalence testing for two treatments. By default, pkequiv calculates a standard confidence interval symmetric about the difference between the two treatment means. pkequiv also calculates confidence intervals symmetric about zero and intervals based on Fieller’s theorem. Also, pkequiv can perform interval hypothesis tests for bioequivalence. pkequiv is one of the pk commands. Please read [R] pk before reading this entry. Options Options compare(string) specifies the two treatments to be tested for equivalence. Sometimes there may be more than two treatments, but the equivalence can be determined only between any two treatments. limit(#) specifies the equivalence limit. The default is 0.2. The equivalence limit can be changed only symmetrically; that is, it is not possible to have a 0.15 lower limit and a 0.2 upper limit in the same test. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(90). This setting is not controlled by the set level command. 1603 1604 pkequiv — Perform bioequivalence tests fieller specifies that an equivalence interval based on Fieller’s theorem be calculated. symmetric specifies that a symmetric equivalence interval be calculated. anderson specifies that the Anderson and Hauck (1983) hypothesis test for bioequivalence be computed. This option is ignored when calculating equivalence intervals based on Fieller’s theorem or when calculating a confidence interval that is symmetric about zero. tost specifies that the two one-sided hypothesis tests for bioequivalence be computed. This option is ignored when calculating equivalence intervals based on Fieller’s theorem or when calculating a confidence interval that is symmetric about zero. noboot prevents the estimation of the probability that the confidence interval lies within the confidence limits. If this option is not specified, this probability is estimated by resampling the data. Remarks and examples pkequiv is designed to conduct tests for bioequivalence based on data from a crossover experiment. pkequiv requires that the user specify the outcome, treatment, period, sequence, and id variables. The data must be in the same format as that produced by pkshape; see [R] pkshape. pkequiv — Perform bioequivalence tests 1605 Example 1 We have the following data on which we want to conduct a bioequivalence test between treat = A and treat = B . . use http://www.stata-press.com/data/r13/pkdata3 . list, sep(4) id sequence outcome treat carry period 1. 2. 3. 4. 1 2 3 4 1 1 1 1 150.9643 146.7606 160.6548 157.8622 A A A A 0 0 0 0 1 1 1 1 5. 6. 7. 8. 5 7 8 9 1 1 1 1 133.6957 160.639 131.2604 168.5186 A A A A 0 0 0 0 1 1 1 1 9. 10. 11. 12. 10 12 13 14 2 2 2 2 137.0627 153.4038 163.4593 146.0462 B B B B 0 0 0 0 1 1 1 1 13. 14. 15. 16. 15 18 19 20 2 2 2 2 158.1457 147.1977 164.9988 145.3823 B B B B 0 0 0 0 1 1 1 1 17. 18. 19. 20. 1 2 3 4 1 1 1 1 218.5551 133.3201 126.0635 96.17461 B B B B A A A A 2 2 2 2 21. 22. 23. 24. 5 7 8 9 1 1 1 1 188.9038 223.6922 104.0139 237.8962 B B B B A A A A 2 2 2 2 25. 26. 27. 28. 10 12 13 14 2 2 2 2 139.7382 202.3942 136.7848 104.5191 A A A A B B B B 2 2 2 2 29. 30. 31. 32. 15 18 19 20 2 2 2 2 165.8654 139.235 166.2391 158.5146 A A A A B B B B 2 2 2 2 . set seed 1 . pkequiv outcome treat period seq id Classic confidence interval for bioequivalence difference: ratio: [equivalence limits] [ -30.296 80% -11.332 92.519% 30.296 120% test limits probability test limits are within equivalence limits = note: reference treatment = 1 ] 26.416 117.439% 0.6410 1606 pkequiv — Perform bioequivalence tests The default output for pkequiv shows a confidence interval for the difference of the means (test limits), the ratio of the means, and the federal equivalence limits. The classic confidence interval can be constructed around the difference between the average measure of effect for the two drugs or around the ratio of the average measure of effect for the two drugs. pkequiv reports both the difference measure and the ratio measure. For these data, U.S. federal government regulations state that the confidence interval for the difference must be entirely contained within the range [ −30.296, 30.296 ] and between 80% and 120% for the ratio. Here the test limits are within the equivalence limits. Although the test limits are inside the equivalence limits, there is only a 64% assurance that the observed confidence interval will be within the equivalence limits in the long run. This is an interesting case because, although this sample shows bioequivalence, the evaluation of the long-run performance indicates possible problems. These fictitious data were generated with high intersubject variability, which causes poor long-run performance. If we conduct a bioequivalence test with the data published in Chow and Liu (2009, 71), which we introduced in [R] pk and fully described in [R] pkshape, we observe that the probability that the test limits are within the equivalence limits is high. . use http://www.stata-press.com/data/r13/chowliu2 . set seed 1 . pkequiv outcome treat period seq id Classic confidence interval for bioequivalence [equivalence limits] difference: ratio: -16.512 80% [ 16.512 120% test limits -8.698 89.464% probability test limits are within equivalence limits = note: reference treatment = 1 ] 4.123 104.994% 0.9980 For these data, the test limits are well within the equivalence limits, and the probability that the test limits are within the equivalence limits is 99.8%. Example 2 We compute a confidence interval that is symmetric about zero: . pkequiv outcome treat period seq id, symmetric Westlake’s symmetric confidence interval for bioequivalence [Equivalence limits] Test formulation: 75.145 89.974 [ Test mean ] 80.272 note: reference treatment = 1 The reported equivalence limit is constructed symmetrically about the reference mean, which is equivalent to constructing a confidence interval symmetric about zero for the difference in the two drugs. In the output above, we see that the test formulation mean of 80.272 is within the equivalence limits, indicating that the test drug is bioequivalent to the reference drug. pkequiv — Perform bioequivalence tests 1607 pkequiv displays interval hypothesis tests of bioequivalence if you specify the tost or the anderson option, or both. For example, . pkequiv outcome treat period seq id, tost anderson Classic confidence interval for bioequivalence [equivalence limits] difference: ratio: -16.512 80% [ 16.512 120% test limits -8.698 89.464% probability test limits are within equivalence limits = ] 4.123 104.994% 0.9990 Schuirmann’s two one-sided tests upper test statistic = -5.036 lower test statistic = 3.810 Anderson and Hauck’s test noncentrality parameter = 4.423 test statistic = -0.613 note: reference treatment = 1 p-value = p-value = 0.000 0.001 empirical p-value = 0.0005 Both of Schuirmann’s one-sided tests are highly significant, suggesting that the two drugs are bioequivalent. A similar conclusion is drawn from the Anderson and Hauck test of bioequivalence. Stored results pkequiv stores the following in r(): Scalars r(stddev) r(uci) r(lci) r(delta) r(u3) r(l3) pooled-sample standard deviation of period differences from both sequences upper confidence interval for a classic interval lower confidence interval for a classic interval delta value used in calculating a symmetric confidence interval upper confidence interval for Fieller’s confidence interval lower confidence interval for Fieller’s confidence interval Methods and formulas The lower confidence interval for the difference in the two treatments for the classic shortest confidence interval is r 1 1 L1 = Y T − Y R − t(α,n1 +n2 −2) σ bd + n1 n2 The upper limit is r bd U1 = Y T − Y R + t(α,n1 +n2 −2) σ 1 1 + n1 n2 1608 pkequiv — Perform bioequivalence tests The limits for the ratio measure are L1 + 1 100% YR U1 + 1 100% YR L2 = and U2 = where Y T is the mean of the test formulation of the drug, Y R is the mean of the reference formulation of the drug, and t(α,n1 +n2 −2) is the t distribution with n1 + n2 − 2 degrees of freedom. σ bd is the pooled sample variance of the period differences from both sequences, defined as 2 σ bd = n k XX 2 1 dik − d.k n1 + n2 − 2 i=1 k=1 The upper and lower limits for the symmetric confidence interval are Y R + ∆ and Y R − ∆, where r 1 1 + − YT −YR n1 n2 r 1 1 + +2 YT −YR n1 n2 ∆ = k1 σ bd and (simultaneously) ∆ = −k2 σ bd and k1 and k2 are computed iteratively to satisfy the above equalities and the condition Z k2 f (t)dt = 1 − 2α k1 where f (t) is the probability density function of the t distribution with n1 + n2 − 2 degrees of freedom. See Chow and Liu (2009, 88–92) for details about calculating the confidence interval based on Fieller’s theorem. The two test statistics for the two one-sided tests of equivalence are Y T − Y R − θL q TL = σ bd n11 + n12 and Y T − Y R − θU q TU = σ bd n11 + n12 where −θL = θU and are the regulated confidence limits. pkequiv — Perform bioequivalence tests 1609 The logic of the Anderson and Hauck test is tricky; see Chow and Liu (2009) for a complete explanation. However, the test statistic is TAH YT −YR − q = σ bd n11 + θL +θU 2 1 n2 and the noncentrality parameter is estimated by δb = θU − θL q 2b σd n11 + 1 n2 The empirical p-value is calculated as p = Ft |TAH | − δb − Ft − |TAH | − δb where Ft is the cumulative distribution function of the t distribution with n1 + n2 − 2 degrees of freedom. References Anderson, S., and W. W. Hauck. 1983. A new procedure for testing equivalence in comparative bioavailability and other clinical trials. Communications in Statistics—Theory and Methods 12: 2663–2692. Chow, S.-C., and J.-P. Liu. 2009. Design and Analysis of Bioavailability and Bioequivalence Studies. 3rd ed. Boca Raton, FL: Chapman & Hall/CRC. Fieller, E. C. 1954. Some problems in interval estimation. Journal of the Royal Statistical Society, Series B 16: 175–185. Kutner, M. H., C. J. Nachtsheim, J. Neter, and W. Li. 2005. Applied Linear Statistical Models. 5th ed. New York: McGraw–Hill/Irwin. Locke, C. S. 1984. An exact confidence interval from untransformed data for the ratio of two formulation means. Journal of Pharmacokinetics and Biopharmaceutics 12: 649–655. Schuirmann, D. J. 1989. Confidence intervals for the ratio of two means from a cross-over study. In Proceedings of the Biopharmaceutical Section, 121–126. Washington, DC: American Statistical Association. Westlake, W. J. 1976. Symmetrical confidence intervals for bioequivalence trials. Biometrics 32: 741–744. Also see [R] pk — Pharmacokinetic (biopharmaceutical) data Title pkexamine — Calculate pharmacokinetic measures Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options Reference Syntax pkexamine time concentration if in , options Description options Main fit(#) trapezoid graph line log exp(#) use # points to estimate AUC0,∞ ; default is fit(3) use trapezoidal rule; default is cubic splines graph the AUC graph the linear extension graph the log extension plot the exponential fit for the AUC0,∞ AUC plot cline options marker options marker label options affect rendition of plotted points connected by lines change look of markers (color, size, etc.) add marker labels; change look or position Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options by is allowed; see [D] by. Menu Statistics > Epidemiology and related > Other > Pharmacokinetic measures Description pkexamine calculates pharmacokinetic measures from time-and-concentration subject-level data. pkexamine computes and displays the maximum measured concentration, the time at the maximum measured concentration, the time of the last measurement, the elimination time, the half-life, and the area under the concentration-time curve (AUC). Three estimates of the area under the concentration-time curve from 0 to infinity (AUC0,∞ ) are also calculated. pkexamine is one of the pk commands. Please read [R] pk before reading this entry. 1610 pkexamine — Calculate pharmacokinetic measures 1611 Options Main fit(#) specifies the number of points, counting back from the last measurement, to use in fitting the extension to estimate the AUC0,∞ . The default is fit(3), or the last three points. This value should be viewed as a minimum; the appropriate number of points will depend on your data. trapezoid specifies that the trapezoidal rule be used to calculate the AUC. The default is cubic splines, which give better results for most functions. When the curve is irregular, trapezoid may give better results. graph tells pkexamine to graph the concentration-time curve. line and log specify the estimates of the AUC0,∞ to display when graphing the AUC0,∞ . These options are ignored, unless they are specified with the graph option. exp(#) specifies that the exponential fit for the AUC0,∞ be plotted. You must specify the maximum time value to which you want to plot the curve, and this time value must be greater than the maximum time measurement in the data. If you specify 0, the curve will be plotted to the point at which the linear extension would cross the x axis. This option is not valid with the line or log option and is ignored, unless the graph option is also specified. AUC plot cline options affect the rendition of the plotted points connected by lines; see [G-3] cline options. marker options specify the look of markers. This look includes the marker symbol, the marker size, and its color and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples pkexamine computes summary statistics for a given patient in a pharmacokinetic trial. If by idvar: is specified, statistics will be displayed for each subject in the data. Example 1 Chow and Liu (2009, 13) present data on a study examining primidone concentrations versus time for a subject over a 32-hour period after dosing. 1612 pkexamine — Calculate pharmacokinetic measures . use http://www.stata-press.com/data/r13/auc . list, abbrev(14) id time concentration 1. 2. 3. 4. 5. 1 1 1 1 1 0 .5 1 1.5 2 0 0 2.8 4.4 4.4 6. 7. 8. 9. 10. 1 1 1 1 1 3 4 6 8 12 4.7 4.1 4 3.6 3 11. 12. 13. 1 1 1 16 24 32 2.5 2 1.6 We use pkexamine to produce the summary statistics: . pkexamine time conc, graph Maximum concentration Time of maximum concentration Time of last observation (Tmax) Elimination rate Half life = = = = = 4.7 3 32 0.0279 24.8503 Area under the curve AUC [0, inf.) Linear of log conc. AUC [0, Tmax] 85.24 142.603 AUC [0, inf.) Linear fit AUC [0, inf.) Exponential fit 107.759 142.603 0 1 Concentration 2 3 4 5 Fit based on last 3 points. 0 10 20 Analysis Time 30 pkexamine — Calculate pharmacokinetic measures 1613 The maximum concentration of 4.7 occurs at time 3, and the time of the last observation (Tmax) is 32. In addition to the AUC, which is calculated from 0 to the maximum value of time, pkexamine also reports the area under the curve, computed by extending the curve with each of three methods: a linear fit to the log of the concentration, a linear regression line, and a decreasing exponential regression line. See Methods and formulas for details on these three methods. By default, all extensions to the AUC are based on the last three points. Looking at the graph for these data, it seems more appropriate to use the last seven points to estimate the AUC0,∞ : . pkexamine time conc, fit(7) Maximum concentration Time of maximum concentration Time of last observation (Tmax) Elimination rate Half life = = = = = 4.7 3 32 0.0349 19.8354 Area under the curve AUC [0, Tmax] AUC [0, inf.) Linear of log conc. 85.24 131.027 AUC [0, inf.) Linear fit AUC [0, inf.) Exponential fit 96.805 129.181 Fit based on last 7 points. This approach decreased the estimate of the AUC0,∞ for all extensions. To see a graph of the AUC0,∞ using a linear extension, specify the graph and line options. . pkexamine time conc, fit(7) graph line Maximum concentration Time of maximum concentration Time of last observation (Tmax) Elimination rate Half life = = = = = 4.7 3 32 0.0349 19.8354 Area under the curve AUC [0, Tmax] 85.24 AUC [0, inf.) Linear of log conc. 131.027 Fit based on last 7 points. AUC [0, inf.) Linear fit AUC [0, inf.) Exponential fit 96.805 129.181 pkexamine — Calculate pharmacokinetic measures 0 1 Concentration 2 3 4 5 1614 0 10 20 30 Analysis Time 40 50 Stored results pkexamine stores the following in r(): Scalars r(auc) r(half) r(ke) r(tmax) r(cmax) r(tomc) r(auc line) r(auc exp) r(auc ln) area under the concentration curve half-life of the drug elimination rate time at last concentration measurement maximum concentration time of maximum concentration AUC0,∞ estimated with a linear fit AUC0,∞ estimated with an exponential fit AUC0,∞ estimated with a linear fit of the natural log Methods and formulas Let i index the observations sorted by time, let k be the number of observations, and let f be the number of points specified in the fit(#) option. The AUC0,tmax is defined as tmax Z AUC0,tmax = Ct dt 0 where Ct is the concentration at time t. By default, the integral is calculated numerically using cubic splines. However, if the trapezoidal rule is used, the AUC0,tmax is given as AUC0,tmax = k X Ci−1 + Ci i=2 2 (ti − ti−1 ) pkexamine — Calculate pharmacokinetic measures 1615 The AUC0,∞ is the AUC0,tmax + AUCtmax ,∞ , or Z AUC0,∞ = tmax Z ∞ Ct dt + 0 Ct dt tmax When using the linear extension to the AUC0,tmax , the integration is cut off when the line crosses the x axis. The log extension is a linear extension on the log concentration scale. The area for the exponential extension is Z AUC0,∞ ∞ e−(β0 +tβ1 ) dt = − = tmax e−(β0 +tmax β1 ) β1 The elimination rate Keq is the negative of the slope from a linear regression of log concentration on time fit to the number of points specified in the fit(#) option: Pk Keq = − i=k−f +1 ti − t Pk i=k−f +1 lnCi − lnC 2 ti − t The half-life is thalf = ln2 Keq Reference Chow, S.-C., and J.-P. Liu. 2009. Design and Analysis of Bioavailability and Bioequivalence Studies. 3rd ed. Boca Raton, FL: Chapman & Hall/CRC. Also see [R] pk — Pharmacokinetic (biopharmaceutical) data Title pkshape — Reshape (pharmacokinetic) Latin-square data Syntax Remarks and examples Menu References Description Also see Options Syntax pkshape id sequence period1 period2 options Description order(string) outcome(newvar) treatment(newvar) carryover(newvar) sequence(newvar) period(newvar) apply name name name name name period list , options treatments in specified order for outcome variable; default is outcome(outcome) for treatment variable; default is treatment(treat) for carryover variable; default is carryover(carry) for sequence variable; default is sequence(sequence) for period variable; default is period(period) Menu Statistics > Epidemiology and related > Other > Reshape pharmacokinetic latin-square data Description pkshape reshapes the data for use with anova, pkcross, and pkequiv; see [R] anova, [R] pkcross, and [R] pkequiv. Latin-square and crossover data are often organized in a manner that cannot be analyzed easily with Stata. pkshape reorganizes the data in memory for use in Stata. pkshape is one of the pk commands. Please read [R] pk before reading this entry. Options order(string) specifies the order in which treatments were applied. If the sequence() specifier is a string variable that specifies the order, this option is not necessary. Otherwise, order() specifies how to generate the treatment and carryover variables. Any string variable can be used to specify the order. For crossover designs, any washout periods can be indicated with the number 0. outcome(newvar) specifies the name for the outcome variable in the reorganized data. By default, outcome(outcome) is used. treatment(newvar) specifies the name for the treatment variable in the reorganized data. By default, treatment(treat) is used. carryover(newvar) specifies the name for the carryover variable in the reorganized data. By default, carryover(carry) is used. sequence(newvar) specifies the name for the sequence variable in the reorganized data. By default, sequence(sequence) is used. period(newvar) specifies the name for the period variable in the reorganized data. By default, period(period) is used. 1616 pkshape — Reshape (pharmacokinetic) Latin-square data 1617 Remarks and examples Often data from a Latin-square experiment are naturally organized in a manner that Stata cannot manage easily. pkshape reorganizes Latin-square data so that they can be used with anova (see [R] anova) or any pk command. This includes the classic 2 × 2 crossover design commonly used in pharmaceutical research, as well as many other Latin-square designs. Example 1 Consider the example data published in Chow and Liu (2009, 71). There are 24 patients, 12 in each sequence. Sequence 1 consists of the reference formulation followed by the test formulation; sequence 2 is the test formulation followed by the reference formulation. The measurements reported are the AUC0−tmax for each patient and for each period. . use http://www.stata-press.com/data/r13/chowliu . list, sep(4) id seq period1 period2 1. 2. 3. 4. 1 4 5 6 1 1 1 1 74.675 96.4 101.95 79.05 73.675 93.25 102.125 69.45 5. 6. 7. 8. 11 12 15 16 1 1 1 1 79.05 85.95 69.725 86.275 69.025 68.7 59.425 76.125 9. 10. 11. 12. 19 20 23 24 1 1 1 1 112.675 99.525 89.425 55.175 114.875 116.25 64.175 74.575 13. 14. 15. 16. 2 3 7 8 2 2 2 2 74.825 86.875 81.675 92.7 37.35 51.925 72.175 77.5 17. 18. 19. 20. 9 10 13 14 2 2 2 2 50.45 66.125 122.45 99.075 71.875 94.025 124.975 85.225 21. 22. 23. 24. 17 18 21 22 2 2 2 2 86.35 49.925 42.7 91.725 95.925 67.1 59.425 114.05 Because the outcome for one person is in two different variables, the treatment that was applied to an individual is a function of the period and the sequence. To analyze this treatment using anova, all the outcomes must be in one variable, and each covariate must be in its own variable. To reorganize these data, use pkshape: . pkshape id seq period1 period2, order(ab ba) . sort seq id treat 1618 pkshape — Reshape (pharmacokinetic) Latin-square data . list, sep(8) id sequence outcome treat carry period 1. 2. 3. 4. 5. 6. 7. 8. 1 1 4 4 5 5 6 6 1 1 1 1 1 1 1 1 74.675 73.675 96.4 93.25 101.95 102.125 79.05 69.45 1 2 1 2 1 2 1 2 0 1 0 1 0 1 0 1 1 2 1 2 1 2 1 2 9. 10. 11. 12. 13. 14. 15. 16. 11 11 12 12 15 15 16 16 1 1 1 1 1 1 1 1 79.05 69.025 85.95 68.7 69.725 59.425 86.275 76.125 1 2 1 2 1 2 1 2 0 1 0 1 0 1 0 1 1 2 1 2 1 2 1 2 17. 18. 19. 20. 21. 22. 23. 24. 19 19 20 20 23 23 24 24 1 1 1 1 1 1 1 1 112.675 114.875 99.525 116.25 89.425 64.175 55.175 74.575 1 2 1 2 1 2 1 2 0 1 0 1 0 1 0 1 1 2 1 2 1 2 1 2 25. 26. 27. 28. 29. 30. 31. 32. 2 2 3 3 7 7 8 8 2 2 2 2 2 2 2 2 37.35 74.825 51.925 86.875 72.175 81.675 77.5 92.7 1 2 1 2 1 2 1 2 2 0 2 0 2 0 2 0 2 1 2 1 2 1 2 1 33. 34. 35. 36. 37. 38. 39. 40. 9 9 10 10 13 13 14 14 2 2 2 2 2 2 2 2 71.875 50.45 94.025 66.125 124.975 122.45 85.225 99.075 1 2 1 2 1 2 1 2 2 0 2 0 2 0 2 0 2 1 2 1 2 1 2 1 41. 42. 43. 44. 45. 46. 47. 48. 17 17 18 18 21 21 22 22 2 2 2 2 2 2 2 2 95.925 86.35 67.1 49.925 59.425 42.7 114.05 91.725 1 2 1 2 1 2 1 2 2 0 2 0 2 0 2 0 2 1 2 1 2 1 2 1 Now the data are organized into separate variables that indicate each factor level for each of the covariates, so the data may be used with anova or pkcross; see [R] anova and [R] pkcross. pkshape — Reshape (pharmacokinetic) Latin-square data 1619 Example 2 Consider the study of background music on bank teller productivity published in Kutner et al. (2005). The data are Week 1 2 3 4 5 Monday Tuesday Wednesday Thursday Friday 18(D) 13(C) 7(A) 17(E) 21(B) 17(C) 34(B) 29(D) 13(A) 26(E) 14(A) 21(E) 32(B) 24(C) 26(D) 21(B) 16(A) 27(E) 31(D) 31(C) 17(E) 15(D) 13(C) 25(B) 7(A) The numbers are the productivity scores, and the letters represent the treatment. We entered the data into Stata: . use http://www.stata-press.com/data/r13/music, clear . list 1. 2. 3. 4. 5. id seq day1 day2 day3 day4 day5 1 2 3 4 5 dcabe cbead adbec eacdb bedca 18 13 7 17 21 17 34 29 13 26 14 21 32 24 26 21 16 27 31 31 17 15 13 25 7 We reshape these data with pkshape: . pkshape id seq day1 day2 day3 day4 day5 . list, sep(0) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. id sequence outcome treat carry period 3 5 2 1 4 3 5 2 1 4 3 5 2 1 4 3 5 2 1 4 3 5 2 1 4 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 7 21 13 18 17 29 26 34 17 13 32 26 21 14 24 27 31 16 21 31 13 7 15 17 25 1 3 5 2 4 2 4 3 5 1 3 2 4 1 5 4 5 1 3 2 5 1 2 4 3 0 0 0 0 0 1 3 5 2 4 2 4 3 5 1 3 2 4 1 5 4 5 1 3 2 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 1620 pkshape — Reshape (pharmacokinetic) Latin-square data Here the sequence variable is a string variable that specifies how the treatments were applied, so the order option is not used. When the sequence variable is a string and the order is specified, the arguments from the order option are used. We could now produce an ANOVA table: . anova outcome seq period treat Source Number of obs = 25 Root MSE = 3.96232 Partial SS df MS R-squared = 0.8666 Adj R-squared = 0.7331 F Prob > F Model 1223.6 12 101.966667 6.49 0.0014 sequence period treat 82 477.2 664.4 4 4 4 20.5 119.3 166.1 1.31 7.60 10.58 0.3226 0.0027 0.0007 Residual 188.4 12 15.7 Total 1412 24 58.8333333 Example 3 Consider the Latin-square crossover example published in Kutner et al. (2005). The example is about apple sales given different methods for displaying apples. Pattern 1 Store 1 2 1 2 1 2 2 3 Week 1 9(B) 4(B) 12(A) 13(A) 7(C) 5(C) Week 2 12(C) 12(C) 14(B) 14(B) 18(A) 20(A) Week 3 15(A) 9(A) 3(C) 3(C) 6(B) 4(B) We entered the data into Stata: . use http://www.stata-press.com/data/r13/applesales, clear . list, sep(2) id seq p1 p2 p3 square 1. 2. 1 2 1 1 9 4 12 12 15 9 1 2 3. 4. 3 4 2 2 12 13 14 14 3 3 1 2 5. 6. 5 6 3 3 7 5 18 20 6 4 1 2 pkshape — Reshape (pharmacokinetic) Latin-square data Now the data can be reorganized using descriptive names for the outcome variables. . pkshape id seq p1 p2 p3, order(bca abc cab) seq(pattern) period(order) > treat(displays) . anova outcome pattern order display id|pattern Number of obs = 18 R-squared = 0.9562 Root MSE = 1.59426 Adj R-squared = 0.9069 Partial SS df MS F Prob > F Source Model 443.666667 9 49.2962963 19.40 0.0002 pattern order displays id|pattern .333333333 233.333333 189 21 2 2 2 3 .166666667 116.666667 94.5 7 0.07 45.90 37.18 2.75 0.9370 0.0000 0.0001 0.1120 Residual 20.3333333 8 2.54166667 Total 464 17 27.2941176 These are the same results reported by Kutner et al. (2005). Example 4 We continue with example 1 from [R] pkcollapse; the data are . use http://www.stata-press.com/data/r13/pkdata2, clear . list, sep(4) abbrev(10) id seq auc_concA auc_concB 1. 2. 3. 4. 1 2 3 4 1 1 1 1 150.9643 146.7606 160.6548 157.8622 218.5551 133.3201 126.0635 96.17461 5. 6. 7. 8. 5 7 8 9 1 1 1 1 133.6957 160.639 131.2604 168.5186 188.9038 223.6922 104.0139 237.8962 9. 10. 11. 12. 10 12 13 14 2 2 2 2 137.0627 153.4038 163.4593 146.0462 139.7382 202.3942 136.7848 104.5191 13. 14. 15. 16. 15 18 19 20 2 2 2 2 158.1457 147.1977 164.9988 145.3823 165.8654 139.235 166.2391 158.5146 . pkshape id seq auc_concA auc_concB, order(ab ba) . sort period id 1621 1622 pkshape — Reshape (pharmacokinetic) Latin-square data . list, sep(4) id sequence outcome treat carry period 1. 2. 3. 4. 1 2 3 4 1 1 1 1 150.9643 146.7606 160.6548 157.8622 1 1 1 1 0 0 0 0 1 1 1 1 5. 6. 7. 8. 5 7 8 9 1 1 1 1 133.6957 160.639 131.2604 168.5186 1 1 1 1 0 0 0 0 1 1 1 1 9. 10. 11. 12. 10 12 13 14 2 2 2 2 137.0627 153.4038 163.4593 146.0462 2 2 2 2 0 0 0 0 1 1 1 1 13. 14. 15. 16. 15 18 19 20 2 2 2 2 158.1457 147.1977 164.9988 145.3823 2 2 2 2 0 0 0 0 1 1 1 1 17. 18. 19. 20. 1 2 3 4 1 1 1 1 218.5551 133.3201 126.0635 96.17461 2 2 2 2 1 1 1 1 2 2 2 2 21. 22. 23. 24. 5 7 8 9 1 1 1 1 188.9038 223.6922 104.0139 237.8962 2 2 2 2 1 1 1 1 2 2 2 2 25. 26. 27. 28. 10 12 13 14 2 2 2 2 139.7382 202.3942 136.7848 104.5191 1 1 1 1 2 2 2 2 2 2 2 2 29. 30. 31. 32. 15 18 19 20 2 2 2 2 165.8654 139.235 166.2391 158.5146 1 1 1 1 2 2 2 2 2 2 2 2 We call the resulting dataset pkdata3. We conduct equivalence testing on the data in [R] pkequiv, and we fit an ANOVA model to these data in the third example of [R] pkcross. References Chow, S.-C., and J.-P. Liu. 2009. Design and Analysis of Bioavailability and Bioequivalence Studies. 3rd ed. Boca Raton, FL: Chapman & Hall/CRC. Kutner, M. H., C. J. Nachtsheim, J. Neter, and W. Li. 2005. Applied Linear Statistical Models. 5th ed. New York: McGraw–Hill/Irwin. pkshape — Reshape (pharmacokinetic) Latin-square data Also see [R] pk — Pharmacokinetic (biopharmaceutical) data 1623 Title pksumm — Summarize pharmacokinetic data Syntax Remarks and examples Menu Methods and formulas Description Also see Options Syntax pksumm id time concentration if in , options Description options Main trapezoid fit(#) notimechk nodots graph stat(statistic) use trapezoidal rule to calculate AUC; default is cubic splines use # points to estimate AUC; default is fit(3) do not check whether follow-up time for all subjects is the same suppress the dots during calculation graph the distribution of statistic graph the specified statistic; default is stat(auc) Histogram, Density plots, Y axis, X axis, Titles, Legend, Overall histogram options any option other than by() documented in [R] histogram statistic Description auc aucline aucexp auclog half ke cmax tmax tomc area under the concentration-time curve (AUC0,∞ ); the default area under the concentration-time curve from 0 to ∞ using a linear extension area under the concentration-time curve from 0 to ∞ using an exponential extension area under the log-concentration-time curve extended with a linear fit half-life of the drug elimination rate maximum concentration time at last concentration time of maximum concentration Menu Statistics > Epidemiology and related > Other > Summarize pharmacokinetic data Description pksumm obtains summary measures based on the first four moments from the empirical distribution of each pharmacokinetic measurement and tests the null hypothesis that the distribution of that measurement is normally distributed. pksumm is one of the pk commands. Please read [R] pk before reading this entry. 1624 pksumm — Summarize pharmacokinetic data 1625 Options Main trapezoid specifies that the trapezoidal rule be used to calculate the AUC. The default is cubic splines, which give better results for most situations. When the curve is irregular, the trapezoidal rule may give better results. fit(#) specifies the number of points, counting back from the last time measurement, to use in fitting the extension to estimate the AUC0,∞ . The default is fit(3), the last three points. This default should be viewed as a minimum; the appropriate number of points will depend on the data. notimechk suppresses the check that the follow-up time for all subjects is the same. By default, pksumm expects the maximum follow-up time to be equal for all subjects. nodots suppresses the progress dots during calculation. By default, a period is displayed for every call to calculate the pharmacokinetic measures. graph requests a graph of the distribution of the statistic specified with stat(). stat(statistic) specifies the statistic that pksumm should graph. The default is stat(auc). If the graph option is not specified, this option is ignored. Histogram, Density plots, Y axis, X axis, Titles, Legend, Overall histogram options are any of the options documented in [R] histogram, excluding by(). For pksumm, fraction is the default, not density. Remarks and examples pksumm produces summary statistics for the distribution of nine common pharmacokinetic measurements. If there are more than eight subjects, pksumm also computes a test for normality on each measurement. The nine measurements summarized by pksumm are listed above and are described in Methods and formulas of [R] pkexamine. Example 1 We demonstrate the use of pksumm on a variation of the data described in [R] pk. We have drug concentration data on 15 subjects, each measured at 13 time points over a 32-hour period. A few of the records are 1626 pksumm — Summarize pharmacokinetic data . use http://www.stata-press.com/data/r13/pksumm . list, sep(0) id 1. 2. 3. 4. 5. 6. 1 1 1 1 1 1 183. 184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195. 15 15 15 15 15 15 15 15 15 15 15 15 15 time conc 0 0 .5 3.073403 1 5.188444 1.5 5.898577 2 5.096378 3 6.094085 (output omitted ) 0 0 .5 3.86493 1 6.432444 1.5 6.969195 2 6.307024 3 6.509584 4 6.555091 6 7.318319 8 5.329813 12 5.411624 16 3.891397 24 5.167516 32 2.649686 We can use pksumm to view the summary statistics for all the pharmacokinetic parameters. . pksumm id time conc ............... Summary statistics for the pharmacokinetic measures Measure Mean Median Variance auc aucline aucexp auclog half ke cmax tomc tmax 150.74 408.30 691.68 688.98 94.84 0.02 7.36 3.47 32.00 150.96 214.17 297.08 297.67 29.39 0.02 7.42 3.00 32.00 123.07 188856.87 762679.94 797237.24 18722.13 0.00 0.42 7.62 0.00 Number of observations = 15 Skewness Kurtosis p-value -0.26 2.57 2.56 2.59 2.26 0.89 -0.60 2.17 . 2.10 8.93 8.87 9.02 7.37 3.70 2.56 7.18 . 0.69 0.00 0.00 0.00 0.00 0.09 0.44 0.00 . For the 15 subjects, the mean AUC0,tmax is 150.74, and σ 2 = 123.07. The skewness of −0.26 indicates that the distribution is slightly skewed left. The p-value of 0.69 for the χ2 test of normality indicates that we cannot reject the null hypothesis that the distribution is normal. If we were to consider any of the three variants of the AUC0,∞ , we would see that there is huge variability and that the distribution is heavily skewed. A skewness different from 0 and a kurtosis different from 3 are expected because the distribution of the AUC0,∞ is not normal. We now graph the distribution of AUC0,tmax by specifying the graph option. pksumm — Summarize pharmacokinetic data . pksumm id time conc, graph bin(20) ............... Summary statistics for the pharmacokinetic measures Number of observations = 1627 15 Mean Median Variance Skewness Kurtosis p-value auc aucline aucexp auclog half ke cmax tomc tmax 150.74 408.30 691.68 688.98 94.84 0.02 7.36 3.47 32.00 150.96 214.17 297.08 297.67 29.39 0.02 7.42 3.00 32.00 123.07 188856.87 762679.94 797237.24 18722.13 0.00 0.42 7.62 0.00 -0.26 2.57 2.56 2.59 2.26 0.89 -0.60 2.17 . 2.10 8.93 8.87 9.02 7.37 3.70 2.56 7.18 . 0.69 0.00 0.00 0.00 0.00 0.09 0.44 0.00 . 0 .05 Fraction .1 .15 Measure 130 140 150 Area Under Curve (AUC) 160 170 graph, by default, plots AUC0,tmax . To plot a graph of one of the other pharmacokinetic measurements, we need to specify the stat() option. For example, we can ask Stata to produce a plot of the AUC0,∞ using the log extension: . pksumm id time conc, stat(auclog) graph bin(20) ............... Summary statistics for the pharmacokinetic measures Number of observations = 15 Mean Median Variance Skewness Kurtosis p-value Measure auc aucline aucexp auclog half ke cmax tomc tmax 150.74 408.30 691.68 688.98 94.84 0.02 7.36 3.47 32.00 150.96 214.17 297.08 297.67 29.39 0.02 7.42 3.00 32.00 123.07 188856.87 762679.94 797237.24 18722.13 0.00 0.42 7.62 0.00 -0.26 2.57 2.56 2.59 2.26 0.89 -0.60 2.17 . 2.10 8.93 8.87 9.02 7.37 3.70 2.56 7.18 . 0.69 0.00 0.00 0.00 0.00 0.09 0.44 0.00 . pksumm — Summarize pharmacokinetic data 0 .2 Fraction .4 .6 .8 1628 0 1000 2000 3000 Linear fit to log concentration AUC for AUC 0−inf. 4000 Methods and formulas The χ2 test for normality is conducted with sktest; see [R] sktest for more information on the test of normality. The statistics reported by pksumm are identical to those reported by summarize and sktest; see [R] summarize and [R] sktest. Also see [R] pk — Pharmacokinetic (biopharmaceutical) data Title poisson — Poisson regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax poisson depvar indepvars options if in weight , options Description Model noconstant exposure(varnamee ) offset(varnameo ) constraints(constraints) collinear suppress constant term include ln(varnamee ) in model with coefficient constrained to 1 include varnameo in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) irr nocnsreport display options set confidence level; default is level(95) report incidence-rate ratios do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar, indepvars, varnamee , and varnameo may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mfp, mi estimate, nestreg, rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 1629 1630 poisson — Poisson regression Menu Statistics > Count outcomes > Poisson regression Description poisson fits a Poisson regression of depvar on indepvars, where depvar is a nonnegative count variable. If you have panel data, see [XT] xtpoisson. Options Model noconstant, exposure(varnamee ), offset(varnameo ), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. irr reports estimated coefficients transformed to incidence-rate ratios, that is, eβi rather than βi . Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated or stored. irr may be specified at estimation or when replaying previously estimated results. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with poisson but is not shown in the dialog box: coeflegend; see [R] estimation options. poisson — Poisson regression 1631 Remarks and examples The basic idea of Poisson regression was outlined by Coleman (1964, 378–379). See Cameron and Trivedi (2013; 2010, chap. 17) and Johnson, Kemp, and Kotz (2005, chap. 4) for information about the Poisson distribution. See Cameron and Trivedi (2013), Long (1997, chap. 8), Long and Freese (2014, chap. 9), McNeil (1996, chap. 6), and Selvin (2011, chap. 6) for an introduction to Poisson regression. Also see Selvin (2004, chap. 5) for a discussion of the analysis of spatial distributions, which includes a discussion of the Poisson distribution. An early example of Poisson regression was Cochran (1940). Poisson regression fits models of the number of occurrences (counts) of an event. The Poisson distribution has been applied to diverse events, such as the number of soldiers kicked to death by horses in the Prussian army (von Bortkiewicz 1898); the pattern of hits by buzz bombs launched against London during World War II (Clarke 1946); telephone connections to a wrong number (Thorndike 1926); and disease incidence, typically with respect to time, but occasionally with respect to space. The basic assumptions are as follows: 1. There is a quantity called the incidence rate that is the rate at which events occur. Examples are 5 per second, 20 per 1,000 person-years, 17 per square meter, and 38 per cubic centimeter. 2. The incidence rate can be multiplied by exposure to obtain the expected number of observed events. For example, a rate of 5 per second multiplied by 30 seconds means that 150 events are expected; a rate of 20 per 1,000 person-years multiplied by 2,000 person-years means that 40 events are expected; and so on. 3. Over very small exposures , the probability of finding more than one event is small compared with . 4. Nonoverlapping exposures are mutually independent. With these assumptions, to find the probability of k events in an exposure of size E , you divide E into n subintervals E1 , E2 , . . . , En , and approximate the answer as the binomial probability of observing k successes in n trials. If you let n → ∞, you obtain the Poisson distribution. In the Poisson regression model, the incidence rate for the j th observation is assumed to be given by rj = eβ0 +β1 x1,j +···+βk xk,j If Ej is the exposure, the expected number of events, Cj , will be Cj = Ej eβ0 +β1 x1,j +···+βk xk,j = e ln(Ej )+β0 +β1 x1,j +···+βk xk,j This model is fit by poisson. Without the exposure() or offset() options, Ej is assumed to be 1 (equivalent to assuming that exposure is unknown), and controlling for exposure, if necessary, is your responsibility. Comparing rates is most easily done by calculating incidence-rate ratios (IRRs). For instance, what is the relative incidence rate of chromosome interchanges in cells as the intensity of radiation increases; the relative incidence rate of telephone connections to a wrong number as load increases; or the relative incidence rate of deaths due to cancer for females relative to males? That is, you want to hold all the x’s in the model constant except one, say, the ith. The IRR for a one-unit change in xi is e ln(E)+β1 x1 +···+βi (xi +1)+···+βk xk = eβi e ln(E)+β1 x1 +···+βi xi +···+βk xk 1632 poisson — Poisson regression More generally, the IRR for a ∆xi change in xi is eβi ∆xi . The lincom command can be used after poisson to display incidence-rate ratios for any group relative to another; see [R] lincom. Example 1 Chatterjee and Hadi (2012, 174) give the number of injury incidents and the proportion of flights for each airline out of the total number of flights from New York for nine major U.S. airlines in one year: . use http://www.stata-press.com/data/r13/airline . list airline injuries n XYZowned 1. 2. 3. 4. 5. 1 2 3 4 5 11 7 7 19 9 0.0950 0.1920 0.0750 0.2078 0.1382 1 0 0 0 0 6. 7. 8. 9. 6 7 8 9 4 3 1 3 0.0540 0.1292 0.0503 0.0629 1 0 0 1 To their data, we have added a fictional variable, XYZowned. We will imagine that an accusation is made that the airlines owned by XYZ Company have a higher injury rate. . poisson injuries Iteration 0: log Iteration 1: log Iteration 2: log Poisson regression XYZowned, exposure(n) irr likelihood = -23.027197 likelihood = -23.027177 likelihood = -23.027177 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -23.027177 injuries IRR Std. Err. XYZowned _cons ln(n) 1.463467 58.04416 1 .406872 8.558145 (exposure) z 1.37 27.54 = = = = 9 1.77 0.1836 0.0370 P>|z| [95% Conf. Interval] 0.171 0.000 .8486578 43.47662 2.523675 77.49281 We specified irr to see the IRRs rather than the underlying coefficients. We estimate that XYZ Airlines’ injury rate is 1.46 times larger than that for other airlines, but the 95% confidence interval is 0.85 to 2.52; we cannot even reject the hypothesis that XYZ Airlines has a lower injury rate. Technical note In example 1, we assumed that each airline’s exposure was proportional to its fraction of flights out of New York. What if “large” airlines, however, also used larger planes, and so had even more passengers than would be expected, given this measure of exposure? A better measure would be each airline’s fraction of passengers on flights out of New York, a number that we do not have. Even so, we suppose that n represents this number to some extent, so a better estimate of the effect might be poisson — Poisson regression . gen lnN=ln(n) . poisson injuries Iteration 0: log Iteration 1: log Iteration 2: log Poisson regression 1633 XYZowned lnN likelihood = -22.333875 likelihood = -22.332276 likelihood = -22.332276 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -22.332276 injuries Coef. XYZowned lnN _cons .6840667 1.424169 4.863891 Std. Err. z P>|z| .3895877 .3725155 .7090501 1.76 3.82 6.86 0.079 0.000 0.000 = = = = 9 19.15 0.0001 0.3001 [95% Conf. Interval] -.0795111 .6940517 3.474178 1.447645 2.154285 6.253603 Here rather than specifying the exposure() option, we explicitly included the variable that would normalize for exposure in the model. We did not specify the irr option, so we see coefficients rather than IRRs. We started with the model rate = eβ0 +β1 XYZowned The observed counts are therefore count = neβ0 +β1 XYZowned = e ln(n)+β0 +β1 XYZowned which amounts to constraining the coefficient on ln(n) to 1. This is what was estimated when we specified the exposure(n) option. In the above model, we included the normalizing exposure ourselves and, rather than constraining the coefficient to be 1, estimated the coefficient. The estimated coefficient is 1.42, a respectable distance away from 1, and is consistent with our speculation that larger airlines also use larger airplanes. With this small amount of data, however, we also have a wide confidence interval that includes 1. Our estimated coefficient on XYZowned is now 0.684, and the implied IRR is e0.684 ≈ 1.98 (which we could also see by typing poisson, irr). The 95% confidence interval for the coefficient still includes 0 (the interval for the IRR includes 1), so although the point estimate is now larger, we still cannot be certain of our results. Our expert opinion would be that, although there is not enough evidence to support the charge, there is enough evidence to justify collecting more data. Example 2 In a famous age-specific study of coronary disease deaths among male British doctors, Doll and Hill (1966) reported the following data (reprinted in Rothman, Greenland, and Lash [2008, 264]): Age 35 – 44 45 – 54 55 – 64 65 – 74 75 – 84 Smokers Deaths Person-years 32 104 206 186 102 52,407 43,248 28,612 12,663 5,317 Nonsmokers Deaths Person-years 2 12 28 28 31 18,790 10,673 5,710 2,585 1,462 1634 poisson — Poisson regression The first step is to enter these data into Stata, which we have done: . use http://www.stata-press.com/data/r13/dollhill3, clear . list agecat smokes deaths pyears 1. 2. 3. 4. 5. 35-44 45-54 55-64 65-74 75-84 1 1 1 1 1 32 104 206 186 102 52,407 43,248 28,612 12,663 5,317 6. 7. 8. 9. 10. 35-44 45-54 55-64 65-74 75-84 0 0 0 0 0 2 12 28 28 31 18,790 10,673 5,710 2,585 1,462 The most “natural” analysis of these data would begin by introducing indicator variables for each age category and one indicator for smoking: . poisson deaths smokes i.agecat, exposure(pyears) irr Iteration 0: log likelihood = -33.823284 Iteration 1: log likelihood = -33.600471 Iteration 2: log likelihood = -33.600153 Iteration 3: log likelihood = -33.600153 Poisson regression Number of obs LR chi2(5) Prob > chi2 Log likelihood = -33.600153 Pseudo R2 deaths IRR smokes = = = = 10 922.93 0.0000 0.9321 Std. Err. z P>|z| [95% Conf. Interval] 1.425519 .1530638 3.30 0.001 1.154984 1.759421 agecat 45-54 55-64 65-74 75-84 4.410584 13.8392 28.51678 40.45121 .8605197 2.542638 5.269878 7.775511 7.61 14.30 18.13 19.25 0.000 0.000 0.000 0.000 3.009011 9.654328 19.85177 27.75326 6.464997 19.83809 40.96395 58.95885 _cons ln(pyears) .0003636 1 .0000697 (exposure) -41.30 0.000 .0002497 .0005296 In the above, we specified irr to obtain IRRs. We estimate that smokers have 1.43 times the mortality rate of nonsmokers. See, however, example 1 in [R] poisson postestimation. poisson — Poisson regression 1635 Stored results poisson stores the following in e(): Scalars e(N) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(r2 p) e(ll) e(ll 0) e(N clust) e(chi2) e(p) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(estat cmd) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) number of observations number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom pseudo-R-squared log likelihood log likelihood, constant-only model number of clusters χ2 significance rank of e(V) number of iterations return code 1 if converged, 0 otherwise poisson command as typed name of dependent variable weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement estat program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample 1636 poisson — Poisson regression Methods and formulas The log likelihood (with weights wj and offsets) is given by Pr(Y = y) = e−λ λy y! ξj = xj β + offsetj e− exp(ξj ) eξj yj yj ! n X lnL = wj −eξj + ξj yj − ln(yj !) f (yj ) = j=1 This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. poisson also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. Siméon-Denis Poisson (1781–1840) was a French mathematician and physicist who contributed to several fields: his name is perpetuated in Poisson brackets, Poisson’s constant, Poisson’s differential equation, Poisson’s integral, and Poisson’s ratio. Among many other results, he produced a version of the law of large numbers. His rather misleadingly titled Recherches sur la probabilité des jugements embraces a complete treatise on probability, as the subtitle indicates, including what is now known as the Poisson distribution. That, however, was discovered earlier by the Huguenot–British mathematician Abraham de Moivre (1667–1754). References Bru, B. 2001. Siméon-Denis Poisson. In Statisticians of the Centuries, ed. C. C. Heyde and E. Seneta, 123–126. New York: Springer. Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. . 2013. Regression Analysis of Count Data. 2nd ed. New York: Cambridge University Press. Chatterjee, S., and A. S. Hadi. 2012. Regression Analysis by Example. 5th ed. New York: Hoboken, NJ. Clarke, R. D. 1946. An application of the Poisson distribution. Journal of the Institute of Actuaries 72: 481. Cochran, W. G. 1940. The analysis of variance when experimental errors follow the Poisson or binomial laws. Annals of Mathematical Statistics 11: 335–347. . 1982. Contributions to Statistics. New York: Wiley. Coleman, J. S. 1964. Introduction to Mathematical Sociology. New York: Free Press. Doll, R., and A. B. Hill. 1966. Mortality of British doctors in relation to smoking: Observations on coronary thrombosis. Journal of the National Cancer Institute, Monographs 19: 205–268. Gould, W. W. 2011. Use poisson rather than regress; tell a friend. The Stata Blog: Not Elsewhere Classified. http://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/. Harris, T., Z. Yang, and J. W. Hardin. 2012. Modeling underdispersed count data with generalized Poisson regression. Stata Journal 12: 736–747. poisson — Poisson regression 1637 Hilbe, J. M. 1998. sg91: Robust variance estimators for MLE Poisson and negative binomial regression. Stata Technical Bulletin 45: 26–28. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 177–180. College Station, TX: Stata Press. . 1999. sg102: Zero-truncated Poisson and negative binomial regression. Stata Technical Bulletin 47: 37–40. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 233–236. College Station, TX: Stata Press. Hilbe, J. M., and D. H. Judson. 1998. sg94: Right, left, and uncensored Poisson regression. Stata Technical Bulletin 46: 18–20. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 186–189. College Station, TX: Stata Press. Johnson, N. L., A. W. Kemp, and S. Kotz. 2005. Univariate Discrete Distributions. 3rd ed. New York: Wiley. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Long, J. S., and J. Freese. 2001. Predicted probabilities for count models. Stata Journal 1: 51–57. . 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. McNeil, D. 1996. Epidemiological Research Methods. Chichester, UK: Wiley. Miranda, A., and S. Rabe-Hesketh. 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata Journal 6: 285–308. Newman, S. C. 2001. Biostatistical Methods in Epidemiology. New York: Wiley. Poisson, S. D. 1837. Recherches sur la probabilité des jugements en matière criminelle et en matière civile: précédées des règles générales du calcul des probabilités. Paris: Bachelier. Raciborski, R. 2011. Right-censored Poisson regression model. Stata Journal 11: 95–105. Rodrı́guez, G. 1993. sbe10: An improvement to poisson. Stata Technical Bulletin 11: 11–14. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 94–98. College Station, TX: Stata Press. Rogers, W. H. 1991. sbe1: Poisson regression with rates. Stata Technical Bulletin 1: 11–12. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 62–64. College Station, TX: Stata Press. Rothman, K. J., S. Greenland, and T. L. Lash. 2008. Modern Epidemiology. 3rd ed. Philadelphia: Lippincott Williams & Wilkins. Rutherford, E., J. Chadwick, and C. D. Ellis. 1930. Radiations from Radioactive Substances. Cambridge: Cambridge University Press. Rutherford, M. J., P. C. Lambert, and J. R. Thompson. 2010. Age–period–cohort modeling. Stata Journal 10: 606–627. Sasieni, P. D. 2012. Age–period–cohort models in Stata. Stata Journal 12: 45–60. Schonlau, M. 2005. Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata Journal 5: 330–354. Selvin, S. 2004. Statistical Analysis of Epidemiologic Data. 3rd ed. New York: Oxford University Press. . 2011. Statistical Tools for Epidemiologic Research. New York: Oxford University Press. Thorndike, F. 1926. Applications of Poisson’s probability summation. Bell System Technical Journal 5: 604–624. Tobı́as, A., and M. J. Campbell. 1998. sg90: Akaike’s information criterion and Schwarz’s criterion. Stata Technical Bulletin 45: 23–25. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 174–177. College Station, TX: Stata Press. von Bortkiewicz, L. 1898. Das Gesetz der Kleinen Zahlen. Leipzig: Teubner. 1638 poisson — Poisson regression Also see [R] poisson postestimation — Postestimation tools for poisson [R] glm — Generalized linear models [R] nbreg — Negative binomial regression [R] tpoisson — Truncated Poisson regression [R] zip — Zero-inflated Poisson regression [ME] mepoisson — Multilevel mixed-effects Poisson regression [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [XT] xtpoisson — Fixed-effects, random-effects, and population-averaged Poisson models [U] 20 Estimation and postestimation commands Title poisson postestimation — Postestimation tools for poisson Description Options for predict Remarks and examples Syntax for predict Syntax for estat gof Methods and formulas Menu for predict Menu for estat Also see Description The following postestimation command is of special interest after poisson: Command Description estat gof goodness-of-fit test estat gof is not appropriate after the svy prefix. The following standard postestimation commands are also available: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with mi or svy estimation results. lrtest is not appropriate with svy estimation results. 1639 1640 poisson postestimation — Postestimation tools for poisson Special-interest postestimation command estat gof performs a goodness-of-fit test of the model. Both the deviance statistic and the Pearson statistic are reported. If the tests are significant, the Poisson regression model is inappropriate. Then you could try a negative binomial model; see [R] nbreg. Syntax for predict predict type newvar if in , statistic nooffset Description statistic Main n ir pr(n) pr(a,b) xb stdp score number of events; the default incidence rate probability Pr(yj = n) probability Pr(a ≤ yj ≤ b) linear prediction standard error of the linear prediction first derivative of the log likelihood with respect to xj β These statistics are available both in and out of sample; type predict only for the estimation sample. . . . if e(sample) . . . if wanted Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main n, the default, calculates the predicted number of events, which is exp(xj β) if neither offset() nor exposure() was specified when the model was fit; exp(xj β + offsetj ) if offset() was specified; or exp(xj β) × exposurej if exposure() was specified. ir calculates the incidence rate exp(xj β), which is the predicted number of events when exposure is 1. Specifying ir is equivalent to specifying n when neither offset() nor exposure() was specified when the model was fit. pr(n) calculates the probability Pr(yj = n), where n is a nonnegative integer that may be specified as a number or a variable. pr(a,b) calculates the probability Pr(a ≤ yj ≤ b), where a and b are nonnegative integers that may be specified as numbers or variables; b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(yj ≥ 20); pr(20,b) calculates Pr(yj ≥ 20) in observations for which b ≥ . and calculates Pr(20 ≤ yj ≤ b) elsewhere. pr(.,b) produces a syntax error. A missing value in an observation of the variable a causes a missing value in that observation for pr(a,b). poisson postestimation — Postestimation tools for poisson 1641 xb calculates the linear prediction, which is xj β if neither offset() nor exposure() was specified; xj β + offsetj if offset() was specified; or xj β + ln(exposurej ) if exposure() was specified; see nooffset below. stdp calculates the standard error of the linear prediction. score calculates the equation-level score, ∂ ln L/∂(xj β). nooffset is relevant only if you specified offset() or exposure() when you fit the model. It modifies the calculations made by predict so that they ignore the offset or exposure variable; the linear prediction is treated as xj β rather than as xj β + offsetj or xj β + ln(exposurej ). Specifying predict . . . , nooffset is equivalent to specifying predict . . . , ir. Syntax for estat gof estat gof Menu for estat Statistics > Postestimation > Reports and statistics Remarks and examples Example 1 Continuing with example 2 of [R] poisson, we use estat gof to determine whether the model fits the data well. . use http://www.stata-press.com/data/r13/dollhill3 . poisson deaths smokes i.agecat, exp(pyears) irr (output omitted ) . estat gof Deviance goodness-of-fit = 12.13244 Prob > chi2(4) = 0.0164 Pearson goodness-of-fit = 11.15533 Prob > chi2(4) = 0.0249 The deviance goodness-of-fit test tells us that, given the model, we can reject the hypothesis that these data are Poisson distributed at the 1.64% significance level. The Pearson goodness-of-fit test tells us that we can reject the hypothesis at the 2.49% significance level. So let us now back up and be more careful. We can most easily obtain the incidence-rate ratios within age categories by using ir; see [ST] epitab: . ir deaths smokes pyears, by(agecat) nohet age category IRR [95% Conf. Interval] 35-44 45-54 55-64 65-74 75-84 5.736638 2.138812 1.46824 1.35606 .9047304 1.463557 1.173714 .9863624 .9081925 .6000757 49.40468 4.272545 2.264107 2.096412 1.399687 Crude M-H combined 1.719823 1.424682 1.391992 1.154703 2.14353 1.757784 M-H Weight 1.472169 9.624747 23.34176 23.25315 24.31435 (exact) (exact) (exact) (exact) (exact) (exact) 1642 poisson postestimation — Postestimation tools for poisson We find that the mortality incidence ratios are greatly different within age category, being highest for the youngest categories and actually dropping below 1 for the oldest. (In the last case, we might argue that those who smoke and who have not died by age 75 are self-selected to be particularly robust.) Seeing this, we will now parameterize the smoking effects separately for each category, although we will begin by constraining the smoking effects on third and fourth age categories to be equivalent: . constraint 1 smokes#3.agecat = smokes#4.agecat . poisson deaths c.smokes#agecat i.agecat, exposure(pyears) irr constraints(1) Iteration 0: log likelihood = -31.95424 Iteration 1: log likelihood = -27.796801 Iteration 2: log likelihood = -27.574177 Iteration 3: log likelihood = -27.572645 Iteration 4: log likelihood = -27.572645 Poisson regression Number of obs = 10 Wald chi2(8) = 632.14 Log likelihood = -27.572645 Prob > chi2 = 0.0000 ( 1) [deaths]3.agecat#c.smokes - [deaths]4.agecat#c.smokes = 0 deaths IRR Std. Err. z P>|z| [95% Conf. Interval] agecat# c.smokes 35-44 45-54 55-64 65-74 75-84 5.736637 2.138812 1.412229 1.412229 .9047304 4.181256 .6520701 .2017485 .2017485 .1855513 2.40 2.49 2.42 2.42 -0.49 0.017 0.013 0.016 0.016 0.625 1.374811 1.176691 1.067343 1.067343 .6052658 23.93711 3.887609 1.868557 1.868557 1.35236 agecat 45-54 55-64 65-74 75-84 10.5631 47.671 98.22765 199.2099 8.067701 34.37409 70.85012 145.3356 3.09 5.36 6.36 7.26 0.002 0.000 0.000 0.000 2.364153 11.60056 23.89324 47.67693 47.19623 195.8978 403.8244 832.3648 _cons ln(pyears) .0001064 1 .0000753 (exposure) -12.94 0.000 .0000266 .0004256 . estat gof Deviance goodness-of-fit Prob > chi2(1) Pearson goodness-of-fit Prob > chi2(1) = = = = .0774185 0.7808 .0773882 0.7809 The goodness-of-fit is now small; we are no longer running roughshod over the data. Let us now consider simplifying the model. The point estimate of the incidence-rate ratio for smoking in age category 1 is much larger than that for smoking in age category 2, but the confidence interval for smokes#1.agecat is similarly wide. Is the difference real? . test smokes#1.agecat = smokes#2.agecat ( 1) [deaths]1b.agecat#c.smokes - [deaths]2.agecat#c.smokes = 0 chi2( 1) = 1.56 Prob > chi2 = 0.2117 The point estimates of the incidence-rate ratio for smoking in the 35–44 age category is much larger than that for smoking in the 45–54 age category, but there is insufficient data, and we may be observing random differences. With that success, might we also combine the smokers in the third and fourth categories with those in the first and second categories? poisson postestimation — Postestimation tools for poisson 1643 . test smokes#2.agecat = smokes#3.agecat, accum ( 1) [deaths]1b.agecat#c.smokes - [deaths]2.agecat#c.smokes = 0 ( 2) [deaths]2.agecat#c.smokes - [deaths]3.agecat#c.smokes = 0 chi2( 2) = 4.73 Prob > chi2 = 0.0938 Combining the first four categories may be overdoing it — the 9.38% significance level is enough to stop us, although others may disagree. Thus we now fit our final model: . constraint 2 smokes#1.agecat = smokes#2.agecat . poisson deaths c.smokes#agecat i.agecat, exposure(pyears) irr constraints(1/2) Iteration 0: log Iteration 1: log Iteration 2: log Iteration 3: log Poisson regression likelihood likelihood likelihood likelihood deaths IRR = = = = -31.550722 -28.525057 -28.514535 -28.514535 Number of obs = Wald chi2(7) = Log likelihood = -28.514535 Prob > chi2 = ( 1) [deaths]3.agecat#c.smokes - [deaths]4.agecat#c.smokes = 0 ( 2) [deaths]1b.agecat#c.smokes - [deaths]2.agecat#c.smokes = 0 Std. Err. z 10 642.25 0.0000 P>|z| [95% Conf. Interval] agecat# c.smokes 35-44 45-54 55-64 65-74 75-84 2.636259 2.636259 1.412229 1.412229 .9047304 .7408403 .7408403 .2017485 .2017485 .1855513 3.45 3.45 2.42 2.42 -0.49 0.001 0.001 0.016 0.016 0.625 1.519791 1.519791 1.067343 1.067343 .6052658 4.572907 4.572907 1.868557 1.868557 1.35236 agecat 45-54 55-64 65-74 75-84 4.294559 23.42263 48.26309 97.87965 .8385329 7.787716 16.06939 34.30881 7.46 9.49 11.64 13.08 0.000 0.000 0.000 0.000 2.928987 12.20738 25.13068 49.24123 6.296797 44.94164 92.68856 194.561 _cons ln(pyears) .0002166 1 .0000652 (exposure) -28.03 0.000 .0001201 .0003908 1644 poisson postestimation — Postestimation tools for poisson The above strikes us as a fair representation of the data. The probabilities of observing the deaths seen in these data are estimated using the following predict command: . predict p, pr(0, deaths) . list deaths p deaths p 1. 2. 3. 4. 5. 32 104 206 186 102 .6891766 .4456625 .5455328 .4910622 .5263011 6. 7. 8. 9. 10. 2 12 28 28 31 .227953 .7981917 .4772961 .6227565 .5475718 The probability Pr(y ≤ deaths) ranges from 0.23 to 0.80. Methods and formulas In the following, we use the same notation as in [R] poisson. The equation-level scores are given by score(xβ)j = yj − eξj The deviance (D) and Pearson (P) goodness-of-fit statistics are given by lnLmax = n X wj [−yj { ln(yj ) − 1} − ln(yj !)] j=1 χ2D = −2{ lnL − lnLmax } n X wj (yj − eξj )2 χ2P = e ξj j=1 Also see [R] poisson — Poisson regression [U] 20 Estimation and postestimation commands Title predict — Obtain predictions, residuals, etc., after estimation Syntax Remarks and examples Menu for predict Methods and formulas Description Also see Options Syntax After single-equation (SE) models predict type newvar if in , single options After multiple-equation (ME) models predict type newvar if in , multiple options predict type stub* | newvar1 . . . newvarq if in , scores Description single options Main calculate linear prediction calculate standard error of the prediction calculate first derivative of the log likelihood with respect to xj β xb stdp score Options nooffset other options ignore any offset() or exposure() variable command-specific options multiple options Description Main equation(eqno , eqno ) xb stdp stddp specify equations calculate linear prediction calculate standard error of the prediction calculate the difference in linear predictions Options ignore any offset() or exposure() variable command-specific options nooffset other options Menu for predict Statistics > Postestimation > Predictions, residuals, etc. 1645 1646 predict — Obtain predictions, residuals, etc., after estimation Description predict calculates predictions, residuals, influence statistics, and the like after estimation. Exactly what predict can do is determined by the previous estimation command; command-specific options are documented with each estimation command. Regardless of command-specific options, the actions of predict share certain similarities across estimation commands: 1. predict newvar creates newvar containing “predicted values” — numbers related to the E(yj |xj ). For instance, after linear regression, predict newvar creates xj b and, after probit, creates the probability Φ(xj b). 2. predict newvar, xb creates newvar containing xj b. This may be the same result as option 1 (for example, linear regression) or different (for example, probit), but regardless, option xb is allowed. 3. predict newvar, stdp creates newvar containing the standard error of the linear prediction xj b. 4. predict newvar, other options may create newvar containing other useful quantities; see help or the reference manual entry for the particular estimation command to find out about other available options. 5. nooffset added to any of the above commands requests that the calculation ignore any offset or exposure variable specified by including the offset(varnameo ) or exposure(varnamee ) option when you fit the model. predict can be used to make in-sample or out-of-sample predictions: 6. predict calculates the requested statistic for all possible observations, whether they were used in fitting the model or not. predict does this for standard options 1–3 and generally does this for estimator-specific options 4. 7. predict newvar if e(sample), . . . restricts the prediction to the estimation subsample. 8. Some statistics make sense only with respect to the estimation subsample. In such cases, the calculation is automatically restricted to the estimation subsample, and the documentation for the specific option states this. Even so, you can still specify if e(sample) if you are uncertain. 9. predict can make out-of-sample predictions even using other datasets. In particular, you can . . . . use ds1 (fit a model) use two predict yhat, ... /* another dataset */ /* fill in the predictions */ Options Main xb calculates the linear prediction from the fitted model. That is, all models can be thought of as estimating a set of parameters b1 , b2 , . . . , bk , and the linear prediction is ybj = b1 x1j + b2 x2j + bj = xj b. For linear regression, the values ybj · · · + bk xkj , often written in matrix notation as y are called the predicted values or, for out-of-sample predictions, the forecast. For logit and probit, for example, ybj is called the logit or probit index. x1j , x2j , . . . , xkj are obtained from the data currently in memory and do not necessarily correspond to the data on the independent variables used to fit the model (obtaining b1 , b2 , . . . , bk ). predict — Obtain predictions, residuals, etc., after estimation 1647 stdp calculates the standard error of the linear prediction. Here the prediction means the same thing as the “index”, namely, xj b. The statistic produced by stdp can be thought of as the standard error of the predicted expected value, or mean index, for the observation’s covariate pattern. The standard error of the prediction is also commonly referred to as the standard error of the fitted value. The calculation can be made in or out of sample. stddp is allowed only after you have previously fit a multiple-equation model. The standard error of the difference in linear predictions (x1j b − x2j b) between equations 1 and 2 is calculated. This option requires that equation(eqno1 ,eqno2 ) be specified. score calculates the equation-level score, ∂ ln L/∂(xj β). Here lnL refers to the log-likelihood function. scores is the ME model equivalent of the score option, resulting in multiple equation-level score variables. An equation-level score variable is created for each equation in the model; ancillary parameters — such as lnσ and atanhρ — make up separate equations. equation(eqno ,eqno ) — synonym outcome() — is relevant only when you have previously fit a multiple-equation model. It specifies the equation to which you are referring. equation() is typically filled in with one eqno — it would be filled in that way with options xb and stdp, for instance. equation(#1) would mean the calculation is to be made for the first equation, equation(#2) would mean the second, and so on. You could also refer to the equations by their names. equation(income) would refer to the equation named income and equation(hours) to the equation named hours. If you do not specify equation(), results are the same as if you specified equation(#1). Other statistics, such as stddp, refer to between-equation concepts. In those cases, you might specify equation(#1,#2) or equation(income,hours). When two equations must be specified, equation() is required. Options nooffset may be combined with most statistics and specifies that the calculation should be made, ignoring any offset or exposure variable specified when the model was fit. This option is available, even if it is not documented for predict after a specific command. If neither the offset(varnameo ) option nor the exposure(varnamee ) option was specified when the model was fit, specifying nooffset does nothing. other options refers to command-specific options that are documented with each command. Remarks and examples Remarks are presented under the following headings: Estimation-sample predictions Out-of-sample predictions Residuals Single-equation (SE) models SE model scores Multiple-equation (ME) models ME model scores Most of the examples are presented using linear regression, but the general syntax is applicable to all estimators. 1648 predict — Obtain predictions, residuals, etc., after estimation You can think of any estimation command as estimating a set of coefficients b1 , b2 , . . . , bk corresponding to the variables x1 , x2 , . . . , xk , along with a (possibly empty) set of ancillary statistics γ1 , γ2 , . . . , γm . All estimation commands store the bi s and γi s. predict accesses that stored information and combines it with the data currently in memory to make various calculations. For instance, predict can calculate the linear prediction, ybj = b1 x1j + b2 x2j + · · · + bk xkj . The data on which predict makes the calculation can be the same data used to fit the model or a different dataset — it does not matter. predict uses the stored parameter estimates from the model, obtains the corresponding values of x for each observation in the data, and then combines them to produce the desired result. Estimation-sample predictions Example 1 We have a 74-observation dataset on automobiles, including the mileage rating (mpg), the car’s weight (weight), and whether the car is foreign (foreign). We fit the model . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight if foreign SS df MS Source Model Residual 427.990298 489.873338 1 20 427.990298 24.4936669 Total 917.863636 21 43.7077922 mpg Coef. weight _cons -.010426 48.9183 Std. Err. .0024942 5.871851 t -4.18 8.33 Number of obs F( 1, 20) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 = = = = = = 22 17.47 0.0005 0.4663 0.4396 4.9491 [95% Conf. Interval] -.0156287 36.66983 -.0052232 61.16676 If we were to type predict pmpg now, we would obtain the linear predictions for all 74 observations. To obtain the predictions just for the sample on which we fit the model, we could type . predict pmpg if e(sample) (option xb assumed; fitted values) (52 missing values generated) Here e(sample) is true only for foreign cars because we typed if foreign when we fit the model and because there are no missing values among the relevant variables. If there had been missing values, e(sample) would also account for those. By the way, the if e(sample) restriction can be used with any Stata command, so we could obtain summary statistics on the estimation sample by typing . summarize if e(sample) (output omitted ) predict — Obtain predictions, residuals, etc., after estimation 1649 Out-of-sample predictions By out-of-sample predictions, we mean predictions extending beyond the estimation sample. In the example above, typing predict pmpg would generate linear predictions using all 74 observations. predict will work on other datasets, too. You can use a new dataset and type predict to obtain results for that sample. Example 2 Using the same auto dataset, assume that we wish to fit the model mpg = β1 weight + β2 ln(weight) + β3 foreign + β4 We first create the ln(weight) variable, and then type the regress command: . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . generate lnweight = ln(weight) . regress mpg weight lnweight foreign Source SS df MS Model Residual 1690.27997 753.179489 3 70 563.426657 10.759707 Total 2443.45946 73 33.4720474 mpg Coef. weight lnweight foreign _cons .003304 -29.59133 -2.125299 248.0548 Std. Err. .0038995 11.52018 1.052324 80.37079 t 0.85 -2.57 -2.02 3.09 Number of obs F( 3, 70) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.400 0.012 0.047 0.003 = = = = = = 74 52.36 0.0000 0.6918 0.6785 3.2802 [95% Conf. Interval] -.0044734 -52.5676 -4.224093 87.76035 .0110813 -6.615061 -.0265044 408.3493 If we typed predict pmpg now, we would obtain predictions for all 74 cars in the current data. Instead, we are going to use a new dataset. The dataset newautos.dta contains the make, weight, and place of manufacture of two cars, the Pontiac Sunbird and the Volvo 260. Let’s use the dataset and create the predictions: . use http://www.stata-press.com/data/r13/newautos, clear (New Automobile Models) . list 1. 2. make weight foreign Pont. Sunbird Volvo 260 2690 3170 Domestic Foreign . predict mpg (option xb assumed; fitted values) variable lnweight not found r(111); Things did not work. We typed predict mpg, and Stata responded with the message “variable lnweight not found”. predict can calculate predicted values on a different dataset only if that dataset contains the variables that went into the model. Here our dataset does not contain a variable called lnweight. lnweight is just the log of weight, so we can create it and try again: 1650 predict — Obtain predictions, residuals, etc., after estimation . generate lnweight = ln(weight) . predict mpg (option xb assumed; fitted values) . list 1. 2. make weight foreign lnweight mpg Pont. Sunbird Volvo 260 2690 3170 Domestic Foreign 7.897296 8.061487 23.25097 17.85295 We obtained our predicted values. The Pontiac Sunbird has a predicted mileage rating of 23.3 mpg, whereas the Volvo 260 has a predicted rating of 17.9 mpg. Residuals Example 3 With many estimators, predict can calculate more than predicted values. With most regressiontype estimators, we can, for instance, obtain residuals. Using our regression example, we return to our original data and obtain residuals by typing . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . generate lnweight = ln(weight) . regress mpg weight lnweight foreign (output omitted ) . predict double resid, residuals . summarize resid Variable resid Obs Mean 74 -1.51e-15 Std. Dev. 3.212091 Min Max -5.453078 13.83719 We could do this without refitting the model. Stata always remembers the last set of estimates, even as we use new datasets. It was not necessary to type the double in predict double resid, residuals, but we wanted to remind you that you can specify the type of a variable in front of the variable’s name; see [U] 11.4.2 Lists of new variables. We made the new variable resid a double rather than the default float. If you want your residuals to have a mean as close to zero as possible, remember to request the extra precision of double. If we had not specified double, the mean of resid would have been roughly 10−9 rather than 10−14 . Although 10−14 sounds more precise than 10−9 , the difference really does not matter. For linear regression, predict can also calculate standardized residuals and Studentized residuals with the options rstandard and rstudent; for examples, see [R] regress postestimation. predict — Obtain predictions, residuals, etc., after estimation 1651 Single-equation (SE) models If you have not read the discussion above on using predict after linear regression, please do so. And predict’s default calculation almost always produces a statistic in the same metric as the dependent variable of the fitted model — for example, predicted counts for Poisson regression. In any case, xb can always be specified to obtain the linear prediction. predict can calculate the standard error of the prediction, which is obtained by using the covariance matrix of the estimators. Example 4 After most binary outcome models (for example, logistic, logit, probit, cloglog, scobit), predict calculates the probability of a positive outcome if we do not tell it otherwise. We can specify the xb option if we want the linear prediction (also known as the logit or probit index). The odd abbreviation xb is meant to suggest xβ. In logit and probit models, for example, the predicted probability is p = F (xβ), where F () is the logistic or normal cumulative distribution function, respectively. . logistic foreign mpg weight (output omitted ) . predict phat (option pr assumed; Pr(foreign)) . predict idxhat, xb . summarize foreign phat idxhat Variable Obs Mean foreign phat idxhat 74 74 74 .2972973 .2972973 -1.678202 Std. Dev. .4601885 .3052979 2.321509 Min Max 0 .000729 -7.223107 1 .8980594 2.175845 Because this is a logit model, we could obtain the predicted probabilities ourselves from the predicted index . generate phat2 = exp(idxhat)/(1+exp(idxhat)) but using predict without options is easier. Example 5 For all models, predict attempts to produce a predicted value in the same metric as the dependent variable of the model. We have seen that for dichotomous outcome models, the default statistic produced by predict is the probability of a success. Similarly, for Poisson regression, the default statistic produced by predict is the predicted count for the dependent variable. You can always specify the xb option to obtain the linear combination of the coefficients with an observation’s x values (the inner product of the coefficients and x values). For poisson (without an explicit exposure), this is the natural log of the count. . use http://www.stata-press.com/data/r13/airline, clear . poisson injuries XYZowned (output omitted ) 1652 predict — Obtain predictions, residuals, etc., after estimation . predict injhat (option n assumed; predicted number of events) . predict idx, xb . generate exp_idx = exp(idx) . summarize injuries injhat exp_idx idx Obs Mean Std. Dev. Variable injuries injhat exp_idx idx 9 9 9 9 7.111111 7.111111 7.111111 1.955174 5.487359 .8333333 .8333333 .1225612 Min Max 1 6 6 1.791759 19 7.666667 7.666667 2.036882 We note that our “hand-computed” prediction of the count (exp idx) matches what was produced by the default operation of predict. If our model has an exposure-time variable, we can use predict to obtain the linear prediction with or without the exposure. Let’s verify what we are getting by obtaining the linear prediction with and without exposure, transforming these predictions to count predictions and comparing them with the default count prediction from predict. We must remember to multiply by the exposure time when using predict . . . , nooffset. . use http://www.stata-press.com/data/r13/airline, clear . poisson injuries XYZowned, exposure(n) (output omitted ) . predict double injhat (option n assumed; predicted number of events) . predict double idx, xb . gen double exp_idx = exp(idx) . predict double idxn, xb nooffset . gen double exp_idxn = exp(idxn)*n . summarize injuries injhat exp_idx exp_idxn idx idxn Variable Obs Mean Std. Dev. Min Max injuries injhat exp_idx exp_idxn idx 9 9 9 9 9 7.111111 7.111111 7.111111 7.111111 1.869722 5.487359 3.10936 3.10936 3.10936 .4671044 1 2.919621 2.919621 2.919621 1.071454 19 12.06158 12.06158 12.06158 2.490025 idxn 9 4.18814 .1904042 4.061204 4.442013 Looking at the identical means and standard deviations for injhat, exp idx, and exp idxn, we see that we can reproduce the default computations of predict for poisson estimations. We have also demonstrated the relationship between the count predictions and the linear predictions with and without exposure. SE model scores Example 6 With most maximum likelihood estimators, predict can calculate equation-level scores. The first derivative of the log likelihood with respect to xj β is the equation-level score. predict — Obtain predictions, residuals, etc., after estimation . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . logistic foreign mpg weight (output omitted ) . predict double sc, score . summarize sc Variable Obs Mean Std. Dev. sc 74 -1.37e-12 .3533133 Min Max -.8760856 .8821309 1653 See [P] robust and [SVY] variance estimation for details regarding the role equation-level scores play in linearization-based variance estimators. Technical note predict after some estimation commands, such as regress and cnsreg, allows the score option as a synonym for the residuals option. Multiple-equation (ME) models If you have not read the above discussion on using predict after SE models, please do so. With the exception of the ability to select specific equations to predict from, the use of predict after ME models follows almost the same form that it does for SE models. Example 7 The details of prediction statistics that are specific to particular ME models are documented with the estimation command. If you are using ME commands that do not have separate discussions on obtaining predictions, read Obtaining predicted values in [R] mlogit postestimation, even if your interest is not in multinomial logistic regression. As a general introduction to the ME models, we will demonstrate predict after sureg: . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . sureg (price foreign displ) (weight foreign length) Seemingly unrelated regression Equation price weight Obs Parms RMSE "R-sq" chi2 P 74 74 2 2 2202.447 245.5238 0.4348 0.8988 45.21 658.85 0.0000 0.0000 Coef. Std. Err. z P>|z| [95% Conf. Interval] price foreign displacement _cons 3137.894 23.06938 680.8438 697.3805 3.443212 859.8142 4.50 6.70 0.79 0.000 0.000 0.428 1771.054 16.32081 -1004.361 4504.735 29.81795 2366.049 weight foreign length _cons -154.883 30.67594 -2699.498 75.3204 1.531981 302.3912 -2.06 20.02 -8.93 0.040 0.000 0.000 -302.5082 27.67331 -3292.173 -7.257674 33.67856 -2106.822 1654 predict — Obtain predictions, residuals, etc., after estimation sureg estimated two equations, one called price and the other weight; see [R] sureg. . predict pred_p, equation(price) (option xb assumed; fitted values) . predict pred_w, equation(weight) (option xb assumed; fitted values) . summarize price pred_p weight pred_w Obs Mean Variable price pred_p weight pred_w 74 74 74 74 6165.257 6165.257 3019.459 3019.459 Std. Dev. 2949.496 1678.805 777.1936 726.0468 Min Max 3291 2664.81 1760 1501.602 15906 10485.33 4840 4447.996 You may specify the equation by name, as we did above, or by number: equation(#1) means the same thing as equation(price) in this case. ME model scores Example 8 For ME models, predict allows you to specify a stub when generating equation-level score variables. predict generates new variables using this stub by appending an equation index. Depending upon the command, the index will start with 0 or 1. Here is an example where predict starts indexing the score variables with 0. . ologit rep78 mpg weight (output omitted ) . predict double sc*, scores . summarize sc* Variable sc0 sc1 sc2 sc3 sc4 Obs Mean 69 69 69 69 69 -1.33e-11 -7.69e-13 -2.87e-11 -1.04e-10 1.47e-10 Std. Dev. .5337363 .186919 .4061637 .5315368 .360525 Min Max -.9854088 -.2738537 -.5188487 -1.067351 -.921433 .921433 .9854088 1.130178 .8194842 .6140182 Although it involves much more typing, we could also specify the new variable names individually. . predict double (sc_xb sc_1 sc_2 sc_3 sc_4), scores . summarize sc_* Variable Obs Mean Std. Dev. sc_xb sc_1 sc_2 sc_3 sc_4 69 69 69 69 69 -1.33e-11 -7.69e-13 -2.87e-11 -1.04e-10 1.47e-10 .5337363 .186919 .4061637 .5315368 .360525 Min Max -.9854088 -.2738537 -.5188487 -1.067351 -.921433 .921433 .9854088 1.130178 .8194842 .6140182 predict — Obtain predictions, residuals, etc., after estimation 1655 Methods and formulas Denote the previously estimated coefficient vector as b and its estimated variance matrix as V. predict works by recalling various aspects of the model, such as b, and combining that information with the data currently in memory. Let’s write xj for the j th observation currently in memory. The predicted value (xb option) is defined as ybj = xj b + offsetj The standard error of the prediction (the stdp option) is defined as spj = q xj Vx0j The standard error of the difference in linear predictions between equations 1 and 2 is defined as 1 sdpj = {(x1j , −x2j , 0, . . . , 0) V (x1j , −x2j , 0, . . . , 0)0 } 2 See the individual estimation commands for information about calculating command-specific predict statistics. Also see [R] predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation [P] predict — Obtain predictions, residuals, etc., after estimation programming command [U] 20 Estimation and postestimation commands Title predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation Syntax Remarks and examples Menu Methods and formulas Description References Options Also see Syntax in , options predictnl type newvar = pnl exp if Description options Main create newvar containing standard errors create newvar containing variances create newvar containing the Wald test statistic create newvar containing the significance level (p-value) of the Wald test create newvars containing lower and upper confidence intervals set confidence level; default is level(95) create stub1, stub2, . . . , stubk variables containing observationspecific derivatives se(newvar) variance(newvar) wald(newvar) p(newvar) ci(newvars) level(#) g(stub) Advanced iterate(#) force maximum iterations for finding optimal step size; default is 100 calculate standard errors, etc., even when possibly inappropriate df(#) use F distribution with # denominator degrees of freedom for the reference distribution of the test statistic df(#) does not appear in the dialog box. Menu Statistics > Postestimation > Nonlinear predictions Description predictnl calculates (possibly) nonlinear predictions after any Stata estimation command and optionally calculates the variances, standard errors, Wald test statistics, significance levels, and confidence limits for these predictions. Unlike its companion nonlinear postestimation commands testnl and nlcom, predictnl generates functions of the data (that is, predictions), not scalars. The quantities generated by predictnl are thus vectorized over the observations in the data. Consider some general prediction, g(θ, xi ), for i = 1, . . . , n, where θ are the model parameters and xi are some data for the ith observation; xi is assumed fixed. Typically, g(θ, xi ) is estimated by g(b θ, xi ), where b θ are the estimated model parameters, which are stored in e(b) following any Stata estimation command. 1656 predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation 1657 In its most common use, predictnl generates two variables: one containing the estimated prediction, g(b θ, xi ), the other containing the estimated standard error of g(b θ, xi ). The calculation of standard errors (and other obtainable quantities that are based on the standard errors, such as test statistics) is based on the delta method, an approximation appropriate in large samples; see Methods and formulas. predictnl can be used with svy estimation results (assuming that predict is also allowed), see [SVY] svy postestimation. The specification of g(b θ, xi ) is handled by specifying pnl exp, and the values of g(b θ, xi ) are stored in the new variable newvar of storage type type. pnl exp is any valid Stata expression and may also contain calls to two special functions unique to predictnl: 1. predict([predict options]): When you are evaluating pnl exp, predict() is a convenience function that replicates the calculation performed by the command predict . . ., predict options As such, the predict() function may be used either as a shorthand for the formula used to make this prediction or when the formula is not readily available. When used without arguments, predict() replicates the default prediction for that particular estimation command. 2. xb([eqno]): The xb() function replicates the calculation of the linear predictor xi b for equation eqno. If xb() is specified without eqno, the linear predictor for the first equation (or the only equation in single-equation estimation) is obtained. For example, xb(#1) (or equivalently, xb() with no arguments) translates to the linear predictor for the first equation, xb(#2) for the second, and so on. You could also refer to the equations by their names, such as xb(income). When specifying pnl exp, both of these functions may be used repeatedly, in combination, and in combination with other Stata functions and expressions. See Remarks and examples for examples that use both of these functions. Options Main se(newvar) adds newvar of storage type type, where for each i in the prediction sample, newvar[i] contains the estimated standard error of g(b θ, xi ). variance(newvar) adds newvar of storage type type, where for each i in the prediction sample, newvar[i] contains the estimated variance of g(b θ, xi ). wald(newvar) adds newvar of storage type type, where for each i in the prediction sample, newvar[i] contains the Wald test statistic for the test of the hypothesis H0 : g(θ, xi ) = 0. p(newvar) adds newvar of storage type type, where newvar[i] contains the significance level (p-value) of the Wald test of H0 : g(θ, xi ) = 0 versus the two-sided alternative. ci(newvars) requires the specification of two newvars, such that the ith observation of each will contain the left and right endpoints (respectively) of a confidence interval for g(θ, xi ). The level of the confidence intervals is determined by level(#). level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. 1658 predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation g(stub) specifies that new variables, stub1, stub2, . . . , stubk be created, where k is the dimension of θ. stub1 will contain the observation-specific derivatives of g(θ, xi ) with respect to the first element, θ1 , of θ; stub2 will contain the derivatives of g(θ, xi ) with respect to θ2 , etc.; If the derivative of g(θ, xi ) with respect to a particular coefficient in θ equals zero for all observations in the prediction sample, the stub variable for that coefficient is not created. The ordering of the parameters in θ is precisely that of the stored vector of parameter estimates e(b). Advanced iterate(#) specifies the maximum number of iterations used to find the optimal step size in the calculation of numerical derivatives of g(θ, xi ) with respect to θ. By default, the maximum number of iterations is 100, but convergence is usually achieved after only a few iterations. You should rarely have to use this option. force forces the calculation of standard errors and other inference-related quantities in situations where predictnl would otherwise refuse to do so. The calculation of standard errors takes place by evaluating (at b θ) the numerical derivative of g(θ, xi ) with respect to θ. If predictnl detects that g() is possibly a function of random quantities other than b θ, it will refuse to calculate standard errors or any other quantity derived from them. The force option forces the calculation to take place anyway. If you use the force option, there is no guarantee that any inference quantities (for example, standard errors) will be correct or that the values obtained can be interpreted. The following option is available with predictnl but is not shown in the dialog box: df(#) specifies that the F distribution with # denominator degrees of freedom be used for the reference distribution of the test statistic. Remarks and examples Remarks are presented under the following headings: Introduction Nonlinear transformations and standard errors Using xb() and predict() Multiple-equation (ME) estimators Test statistics and significance levels Manipulability Confidence intervals Introduction predictnl and nlcom both use the delta method. They take a nonlinear transformation of the estimated parameter vector from some fitted model and apply the delta method to calculate the variance, standard error, Wald test statistic, etc., of this transformation. nlcom is designed for scalar functions of the parameters, and predictnl is designed for functions of the parameters and of the data, that is, for predictions. Nonlinear transformations and standard errors We begin by fitting a probit model to the low-birthweight data of Hosmer, Lemeshow, and Sturdivant (2013, 24). The data are described in detail in example 1 of [R] logistic. . use http://www.stata-press.com/data/r13/lbw predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation (Hosmer & Lemeshow data) . probit low lwt smoke ptl ht Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Probit regression = -117.336 = -106.75886 = -106.67852 = -106.67851 Number of obs LR chi2(4) Prob > chi2 Pseudo R2 Log likelihood = -106.67851 low Coef. lwt smoke ptl ht _cons -.0095164 .3487004 .365667 1.082355 .4238985 1659 Std. Err. .0036875 .2041772 .1921201 .410673 .4823224 z P>|z| -2.58 1.71 1.90 2.64 0.88 0.010 0.088 0.057 0.008 0.379 = = = = 189 21.31 0.0003 0.0908 [95% Conf. Interval] -.0167438 -.0514794 -.0108815 .2774503 -.5214361 -.0022891 .7488803 .7422154 1.887259 1.369233 After we fit such a model, we first would want to generate the predicted probabilities of a low birthweight, given the covariate values in the estimation sample. This is easily done using predict after probit, but it doesn’t answer the question, “What are the standard errors of those predictions?” For the time being, we will consider ourselves ignorant of any automated way to obtain the predicted probabilities after probit. The formula for the prediction is Pr(y 6= 0|xi ) = Φ(xi β) where Φ is the standard cumulative normal. Thus for this example, g(θ, xi ) = Φ(xi β). Armed with the formula, we can use predictnl to generate the predictions and their standard errors: . predictnl phat = normal(_b[_cons] + _b[ht]*ht + _b[ptl]*ptl + > _b[smoke]*smoke + _b[lwt]*lwt), se(phat_se) . list phat phat_se lwt smoke ptl ht in -10/l phat phat_se lwt smoke ptl ht 180. 181. 182. 183. 184. .2363556 .6577712 .2793261 .1502118 .5702871 .042707 .1580714 .0519958 .0676339 .0819911 120 154 106 190 101 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 185. 186. 187. 188. 189. .4477045 .2988379 .4514706 .5615571 .7316517 .079889 .0576306 .080815 .1551051 .1361469 95 100 94 142 130 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 Thus subject 180 in our data has an estimated probability of low birthweight of 23.6% with standard error 4.3%. Used without options, predictnl is not much different from generate. By specifying the se(phat se) option, we were able to obtain a variable containing the standard errors of the predictions; therein lies the utility of predictnl. 1660 predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation Using xb() and predict() As was the case above, a prediction is often not a function of a few isolated parameters and their corresponding variables but instead is some (possibly elaborate) function of the entire linear predictor. For models with many predictors, the brute-force expression for the linear predictor can be cumbersome to type. An alternative is to use the inline function xb(). xb() is a shortcut for having to type b[ cons] + b[ht]*ht + b[ptl]*ptl + . . . , . drop phat phat_se . predictnl phat = norm(xb()), se(phat_se) . list phat phat_se lwt smoke ptl ht in -10/l phat phat_se lwt smoke ptl ht 180. 181. 182. 183. 184. .2363556 .6577712 .2793261 .1502118 .5702871 .042707 .1580714 .0519958 .0676339 .0819911 120 154 106 190 101 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 185. 186. 187. 188. 189. .4477045 .2988379 .4514706 .5615571 .7316517 .079889 .0576306 .080815 .1551051 .1361469 95 100 94 142 130 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 which yields the same results. This approach is easier, produces more readable code, and is less prone to error, such as forgetting to include a term in the sum. Here we used xb() without arguments because we have only one equation in our model. In multiple-equation (ME) settings, xb() (or equivalently xb(#1)) yields the linear predictor from the first equation, xb(#2) from the second, etc. You can also refer to equations by their names, for example, xb(income). Technical note Most estimation commands in Stata allow the postestimation calculation of linear predictors and their standard errors via predict. For example, to obtain these for the first (or only) equation in the model, you could type predict xbvar, xb predict stdpvar, stdp Equivalently, you could type predictnl xbvar = xb(), se(stdpvar) but we recommend the first method, as it is faster. As we demonstrated above, however, predictnl is more general. Returning to our probit example, we can further simplify the calculation by using the inline function predict(). predict(pred options) works by substituting, within our predictnl expression, the calculation performed by predict . . ., pred options predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation 1661 In our example, we are interested in the predicted probabilities after a probit regression, normally obtained via predict . . ., p We can obtain these predictions (and standard errors) by using . drop phat phat_se . predictnl phat = predict(p), se(phat_se) . list phat phat_se lwt smoke ptl ht in -10/l phat phat_se lwt smoke ptl ht 180. 181. 182. 183. 184. .2363556 .6577712 .2793261 .1502118 .5702871 .042707 .1580714 .0519958 .0676339 .0819911 120 154 106 190 101 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 185. 186. 187. 188. 189. .4477045 .2988379 .4514706 .5615571 .7316517 .079889 .0576306 .080815 .1551051 .1361469 95 100 94 142 130 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 which again replicates what we have already done by other means. However, this version did not require knowledge of the formula for the predicted probabilities after a probit regression—predict(p) took care of that for us. Because the predicted probability is the default prediction after probit, we could have just used predict() without arguments, namely, . predictnl phat = predict(), se(phat_se) Also, the expression pnl exp can be inordinately complicated, with multiple calls to predict() and xb(). For example, . predictnl phat = normal(invnormal(predict()) + predict(xb)/xb() - 1), > se(phat_se) is perfectly valid and will give the same result as before, albeit a bit inefficiently. Technical note When using predict() and xb(), the formula for the calculation is substituted within pnl exp, not the values that result from the application of that formula. To see this, note the subtle difference between . predict xbeta, xb . predictnl phat = normal(xbeta), se(phat_se) and . predictnl phat = normal(xb()), se(phat_se) 1662 predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation Both sequences will yield the same phat, yet for the first sequence, phat se will equal zero for all observations. The reason is that, once evaluated, xbeta will contain the values of the linear predictor, yet these values are treated as fixed and nonstochastic as far as predictnl is concerned. By contrast, because xb() is shorthand for the formula used to calculate the linear predictor, it contains not values, but references to the estimated regression coefficients and corresponding variables. Thus the second method produces the desired result. Multiple-equation (ME) estimators In [R] mlogit, data on insurance choice (Tarlov et al. 1989; Wells et al. 1989) were examined, and a multinomial logit was used to assess the effects of age, gender, race, and site of study (one of three sites) on the type of insurance: . use http://www.stata-press.com/data/r13/sysdsn1, clear (Health insurance data) . mlogit insure age male nonwhite i.site, nolog Multinomial logistic regression Number of obs LR chi2(10) Prob > chi2 Log likelihood = -534.36165 Pseudo R2 insure Indemnity Coef. Std. Err. z P>|z| = = = = 615 42.99 0.0000 0.0387 [95% Conf. Interval] (base outcome) Prepaid age male nonwhite -.011745 .5616934 .9747768 .0061946 .2027465 .2363213 -1.90 2.77 4.12 0.058 0.006 0.000 -.0238862 .1643175 .5115955 .0003962 .9590693 1.437958 site 2 3 .1130359 -.5879879 .2101903 .2279351 0.54 -2.58 0.591 0.010 -.2989296 -1.034733 .5250013 -.1412433 _cons .2697127 .3284422 0.82 0.412 -.3740222 .9134476 age male nonwhite -.0077961 .4518496 .2170589 .0114418 .3674867 .4256361 -0.68 1.23 0.51 0.496 0.219 0.610 -.0302217 -.268411 -.6171725 .0146294 1.17211 1.05129 site 2 3 -1.211563 -.2078123 .4705127 .3662926 -2.57 -0.57 0.010 0.570 -2.133751 -.9257327 -.2893747 .510108 _cons -1.286943 .5923219 -2.17 0.030 -2.447872 -.1260134 Uninsure Of particular interest is the estimation of the relative risk, which, for a given selection, is the ratio of the probability of making that selection to the probability of selecting the base category (Indemnity here), given a set of covariate values. In a multinomial logit model, the relative risk (when comparing to the base category) simplifies to the exponentiated linear predictor for that selection. Using this example, we can estimate the observation-specific relative risks of selecting a prepaid plan over the base category (with standard errors) by either referring to the Prepaid equation by name or number, . predictnl RRppaid = exp(xb(Prepaid)), se(SERRppaid) predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation 1663 or . predictnl RRppaid = exp(xb(#1)), se(SERRppaid) because Prepaid is the first equation in the model. Those of us for whom the simplified formula for the relative risk does not immediately come to mind may prefer to calculate the relative risk directly from its definition, that is, as a ratio of two predicted probabilities. After mlogit, the predicted probability for a category may be obtained using predict, but we must specify the category as the outcome: . predictnl RRppaid = predict(outcome(Prepaid))/predict(outcome(Indemnity)), > se(SERRppaid) (1 missing value generated) . list RRppaid SERRppaid age male nonwhite site in 1/10 RRppaid SERRpp~d age male nonwhite site 1. 2. 3. 4. 5. .6168578 1.056658 .8426442 1.460581 .9115747 .1503759 .1790703 .1511281 .3671465 .1324168 73.722107 27.89595 37.541397 23.641327 40.470901 0 0 0 0 0 0 0 0 1 0 2 2 1 3 2 6. 7. 8. 9. 10. 1.034701 .9223664 1.678312 .9188519 .5766296 .1696923 .1344981 .4216626 .2256017 .1334877 29.683777 39.468857 26.702255 63.101974 69.839828 0 0 1 0 0 0 0 0 1 0 2 2 1 3 1 The “(1 missing value generated)” message is not an error; further examination of the data would reveal that age is missing in one observation and that the offending observation (among others) is not in the estimation sample. Just as with predict, predictnl can generate predictions in or out of the estimation sample. Thus we estimate (among other things) that a white, female, 73-year-old from site 2 is less likely to choose a prepaid plan over an indemnity plan—her relative risk is about 62% with standard error 15%. Test statistics and significance levels Often a standard error calculation is just a means to an end, and what is really desired is a test of the hypothesis, H0 : g(θ, xi ) = 0 versus the two-sided alternative. We can use predictnl to obtain the Wald test statistics or significance levels (or both) for the above tests, whether or not we want standard errors. To obtain the Wald test statistics, we use the wald() option; for significance levels, we use p(). Returning to our mlogit example, suppose that we wanted for each observation a test of whether the relative risk of choosing a prepaid plan over an indemnity plan is different from one. One way to do this would be to define g() to be the relative risk minus one and then test whether g() is different from zero. . predictnl RRm1 = exp(xb(Prepaid)) - 1, wald(W_RRm1) p(sig_RRm1) (1 missing value generated) 1664 predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation note: significance levels are with respect to the chi-squared(1) distribution . list RRm1 W_RRm1 sig_RRm1 age male nonwhite in 1/10 RRm1 W_RRm1 sig_RRm1 age male nonwhite 1. 2. 3. 4. 5. -.3831422 .0566578 -.1573559 .4605812 -.0884253 6.491778 .100109 1.084116 1.573743 .4459299 .0108375 .7516989 .2977787 .2096643 .5042742 73.722107 27.89595 37.541397 23.641327 40.470901 0 0 0 0 0 0 0 0 1 0 6. 7. 8. 9. 10. .0347015 -.0776336 .6783119 -.0811482 -.4233705 .0418188 .3331707 2.587788 .1293816 10.05909 .8379655 .563798 .1076906 .719074 .001516 29.683777 39.468857 26.702255 63.101974 69.839828 0 0 1 0 0 0 0 0 1 0 The newly created variable W RRm1 contains the Wald test statistic for each observation, and sig RRm1 contains the level of significance. Thus our 73-year-old white female represented by the first observation would have a relative risk of choosing prepaid over indemnity that is significantly different from 1, at least at the 5% level. For this test, it was not necessary to generate a variable containing the standard error of the relative risk minus 1, but we could have done so had we wanted. We could have also omitted specifying wald(W RRm1) if all we cared about were, say, the significance levels of the tests. In this regard, predictnl acts as an observation-specific version of testnl, with the test results vectorized over the observations in the data. The significance levels are pointwise—they are not adjusted to reflect any simultaneous testing over the observations in the data. Manipulability There are many ways to specify g(θ, xi ) to yield tests such that, for multiple specifications of g(), the theoretical conditions for which H0 : g(θ, xi ) = 0 is true will be equivalent. However, this does not mean that the tests themselves will be equivalent. This is known as the manipulability of the Wald test for nonlinear hypotheses; also see [R] boxcox. As an example, consider the previous section where we defined g() to be the relative risk between choosing a prepaid plan over an indemnity plan, minus 1. We could also have defined g() to be the risk difference—the probability of choosing a prepaid plan minus the probability of choosing an indemnity plan. Either specification of g() yields a mathematically equivalent specification of H0 : g() = 0; that is, the risk difference will equal zero when the relative risk equals one. However, the tests themselves do not give the same results: predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation 1665 . predictnl RD = predict(outcome(Prepaid)) - predict(outcome(Indemnity)), > wald(W_RD) p(sig_RD) (1 missing value generated) note: significance levels are with respect to the chi-squared(1) distribution . list RD W_RD sig_RD RRm1 W_RRm1 sig_RRm1 in 1/10 RD W_RD sig_RD RRm1 W_RRm1 sig_RRm1 1. 2. 3. 4. 5. -.2303744 .0266902 -.0768078 .1710702 -.0448509 4.230243 .1058542 .9187646 2.366535 .4072922 .0397097 .7449144 .3377995 .1239619 .5233471 -.3831422 .0566578 -.1573559 .4605812 -.0884253 6.491778 .100109 1.084116 1.573743 .4459299 .0108375 .7516989 .2977787 .2096643 .5042742 6. 7. 8. 9. 10. .0165251 -.0391535 .22382 -.0388409 -.2437626 .0432816 .3077611 4.539085 .1190183 6.151558 .835196 .5790573 .0331293 .7301016 .0131296 .0347015 -.0776336 .6783119 -.0811482 -.4233705 .0418188 .3331707 2.587788 .1293816 10.05909 .8379655 .563798 .1076906 .719074 .001516 In certain cases (such as subject 8), the difference can be severe enough to potentially change the conclusion. The reason for this inconsistency is that the nonlinear Wald test is actually a standard Wald test of a first-order Taylor approximation of g(), and this approximation can differ according to how g() is specified. As such, keep in mind the manipulability of nonlinear Wald tests when drawing scientific conclusions. Confidence intervals We can also use predictnl to obtain confidence intervals for the observation-specific g(θ, xi ) by using the ci() option to specify two new variables to contain the left and right endpoints of the confidence interval, respectively. For example, we could generate confidence intervals for the risk differences calculated previously: . drop RD . predictnl RD = predict(outcome(Prepaid)) - predict(outcome(Indemnity)), > ci(RD_lcl RD_rcl) (1 missing value generated) note: Confidence intervals calculated using Z critical values . list RD RD_lcl RD_rcl age male nonwhite in 1/10 RD RD_lcl RD_rcl age male nonwhite 1. 2. 3. 4. 5. -.2303744 .0266902 -.0768078 .1710702 -.0448509 -.4499073 -.1340948 -.2338625 -.0468844 -.1825929 -.0108415 .1874752 .080247 .3890248 .092891 73.722107 27.89595 37.541397 23.641327 40.470901 0 0 0 0 0 0 0 0 1 0 6. 7. 8. 9. 10. .0165251 -.0391535 .22382 -.0388409 -.2437626 -.1391577 -.177482 .0179169 -.2595044 -.4363919 .1722078 .099175 .4297231 .1818226 -.0511332 29.683777 39.468857 26.702255 63.101974 69.839828 0 0 1 0 0 0 0 0 1 0 The confidence level, here, 95%, is either set using the level() option or obtained from the current default level, c(level); see [U] 20.7 Specifying the width of confidence intervals. 1666 predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation From the above output, we can see that, for subjects 1, 8, and 10, a 95% confidence interval for the risk difference does not contain zero, meaning that, for these subjects, there is some evidence of a significant difference in risks. The confidence intervals calculated by predictnl are pointwise; there is no adjustment (such as a Bonferroni correction) made so that these confidence intervals may be considered jointly at the specified level. Methods and formulas For the ith observation, consider the transformation g(θ, xi ), estimated by g(b θ, xi ), for the 1 × k parameter vector θ and data xi (xi is assumed fixed). The variance of g(b θ, xi ) is estimated by n o c g(b Var θ, xi ) = GVG0 where G is the vector of derivatives G= ∂g(θ, xi ) ∂θ θ=b θ (1×k) b {g(b and V is the estimated variance–covariance matrix of b θ. Standard errors, se θ, xi )}, are obtained as the square roots of the variances. The Wald test statistic for testing H0 : g(θ, xi ) = 0 versus the two-sided alternative is given by n o2 g(b θ, xi ) n o Wi = c g(b Var θ, xi ) When the variance–covariance matrix of b θ is an asymptotic covariance matrix, Wi is approximately distributed as χ2 with 1 degree of freedom. For linear regression, Wi is taken to be approximately distributed as F1,r , where r is the residual degrees of freedom from the original model fit. The levels of significance of the observation-by-observation tests of H0 versus the two-sided alternative are given by pi = Pr(T > Wi ) where T is either a χ2 - or F -distributed random variable, as described above. A (1 − α) × 100% confidence interval for g(θ, xi ) is given by h n oi b g(b g(b θ, xi ) ± zα/2 se θ, xi ) when Wi is χ2 -distributed, and h n oi b g(b g(b θ, xi ) ± tα/2,r se θ, xi ) predictnl — Obtain nonlinear predictions, standard errors, etc., after estimation 1667 when Wi is F -distributed. zp is the 1 − p quantile of the standard normal distribution, and tp,r is the 1 − p quantile of the t distribution with r degrees of freedom. References Gould, W. W. 1996. crc43: Wald test of nonlinear hypotheses after model estimation. Stata Technical Bulletin 29: 2–4. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 15–18. College Station, TX: Stata Press. Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Phillips, P. C. B., and J. Y. Park. 1988. On the formulation of Wald tests of nonlinear restrictions. Econometrica 56: 1065–1083. Tarlov, A. R., J. E. Ware, Jr., S. Greenfield, E. C. Nelson, E. Perrin, and M. Zubkoff. 1989. The medical outcomes study. An application of methods for monitoring the results of medical care. Journal of the American Medical Association 262: 925–930. Wells, K. B., R. D. Hays, M. A. Burnam, W. H. Rogers, S. Greenfield, and J. E. Ware, Jr. 1989. Detection of depressive disorder for patients receiving prepaid or fee-for-service care. Results from the Medical Outcomes Survey. Journal of the American Medical Association 262: 3298–3302. Also see [R] lincom — Linear combinations of estimators [R] nlcom — Nonlinear combinations of estimators [R] predict — Obtain predictions, residuals, etc., after estimation [R] test — Test linear hypotheses after estimation [R] testnl — Test nonlinear hypotheses after estimation [U] 20 Estimation and postestimation commands Title probit — Probit regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax probit depvar indepvars options if in weight , options Description Model noconstant offset(varname) asis constraints(constraints) collinear suppress constant term include varname in model with coefficient constrained to 1 retain perfect predictor variables apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) nocnsreport display options set confidence level; default is level(95) do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used nocoef coeflegend do not display the coefficient table; seldom used display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mfp, mi estimate, nestreg, rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce(), nocoef, and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. nocoef and coeflegend do not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 1668 probit — Probit regression 1669 Menu Statistics > Binary outcomes > Probit regression Description probit fits a maximum-likelihood probit model. If estimating on grouped data, see the bprobit command described in [R] glogit. Several auxiliary commands may be run after probit, logit, or logistic; see [R] logistic postestimation for a description of these commands. See [R] logistic for a list of related estimation commands. Options Model noconstant, offset(varname), constraints(constraints), collinear; see [R] estimation options. asis specifies that all specified variables and observations be retained in the maximization process. This option is typically not specified and may introduce numerical instability. Normally probit drops variables that perfectly predict success or failure in the dependent variable along with their associated observations. In those cases, the effective coefficient on the dropped variables is infinity (negative infinity) for variables that completely determine a success (failure). Dropping the variable and perfectly predicted observations has no effect on the likelihood or estimates of the remaining coefficients and increases the numerical stability of the optimization process. Specifying this option forces retention of perfect predictor variables and their associated observations. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. 1670 probit — Probit regression The following options are available with probit but are not shown in the dialog box: nocoef specifies that the coefficient table not be displayed. This option is sometimes used by programmers but is of no use interactively. coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Robust standard errors Model identification probit fits maximum likelihood models with dichotomous dependent (left-hand-side) variables coded as 0/1 (more precisely, coded as 0 and not 0). Example 1 We have data on the make, weight, and mileage rating of 22 foreign and 52 domestic automobiles. We wish to fit a probit model explaining whether a car is foreign based on its weight and mileage. Here is an overview of our data: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . keep make mpg weight foreign . describe Contains data from http://www.stata-press.com/data/r13/auto.dta obs: 74 1978 Automobile Data vars: 4 13 Apr 2013 17:45 size: 1,702 (_dta has notes) variable name make mpg weight foreign storage type display format str18 int int byte Sorted by: Note: %-18s %8.0g %8.0gc %8.0g value label variable label origin Make and Model Mileage (mpg) Weight (lbs.) Car type foreign dataset has changed since last saved . inspect foreign foreign: # # # # # # Car type Number of Observations Negative Zero Positive # # 0 Total Missing 1 Total 52 22 74 - Integers 52 22 Nonintegers - 74 74 (2 unique values) foreign is labeled and all values are documented in the label. - probit — Probit regression 1671 The foreign variable takes on two unique values, 0 and 1. The value 0 denotes a domestic car, and 1 denotes a foreign car. The model that we wish to fit is Pr(foreign = 1) = Φ(β0 + β1 weight + β2 mpg) where Φ is the cumulative normal distribution. To fit this model, we type . probit foreign weight mpg Iteration 0: log likelihood = -45.03321 Iteration 1: log likelihood = -27.914626 (output omitted ) Iteration 5: log likelihood = -26.844189 Probit regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -26.844189 foreign Coef. weight mpg _cons -.0023355 -.1039503 8.275464 Std. Err. .0005661 .0515689 2.554142 z -4.13 -2.02 3.24 P>|z| 0.000 0.044 0.001 = = = = 74 36.38 0.0000 0.4039 [95% Conf. Interval] -.003445 -.2050235 3.269437 -.0012261 -.0028772 13.28149 We find that heavier cars are less likely to be foreign and that cars yielding better gas mileage are also less likely to be foreign, at least holding the weight of the car constant. See [R] maximize for an explanation of the output. Technical note Stata interprets a value of 0 as a negative outcome (failure) and treats all other values (except missing) as positive outcomes (successes). Thus if your dependent variable takes on the values 0 and 1, then 0 is interpreted as failure and 1 as success. If your dependent variable takes on the values 0, 1, and 2, then 0 is still interpreted as failure, but both 1 and 2 are treated as successes. If you prefer a more formal mathematical statement, when you type probit y x, Stata fits the model Pr(yj 6= 0 | xj ) = Φ(xj β) where Φ is the standard cumulative normal. Robust standard errors If you specify the vce(robust) option, probit reports robust standard errors; see [U] 20.21 Obtaining robust variance estimates. Example 2 For the model from example 1, the robust calculation increases the standard error of the coefficient on mpg by almost 15%: 1672 probit — Probit regression . probit foreign weight mpg, vce(robust) nolog Probit regression Log pseudolikelihood = -26.844189 foreign Coef. weight mpg _cons -.0023355 -.1039503 8.275464 Robust Std. Err. .0004934 .0593548 2.539177 z -4.73 -1.75 3.26 Number of obs Wald chi2(2) Prob > chi2 Pseudo R2 P>|z| 0.000 0.080 0.001 = = = = 74 30.26 0.0000 0.4039 [95% Conf. Interval] -.0033025 -.2202836 3.298769 -.0013686 .0123829 13.25216 Without vce(robust), the standard error for the coefficient on mpg was reported to be 0.052 with a resulting confidence interval of [ −0.21, −0.00 ]. Example 3 The vce(cluster clustvar) option can relax the independence assumption required by the probit estimator to independence between clusters. To demonstrate, we will switch to a different dataset. We are studying unionization of women in the United States and have a dataset with 26,200 observations on 4,434 women between 1970 and 1988. We will use the variables age (the women were 14 – 26 in 1968, and our data span the age range of 16 – 46), grade (years of schooling completed, ranging from 0 to 18), not smsa (28% of the person-time was spent living outside an SMSA —standard metropolitan statistical area), south (41% of the person-time was in the South), and year. Each of these variables is included in the regression as a covariate along with the interaction between south and year. This interaction, along with the south and year variables, is specified in the probit command using factor-variables notation, south##c.year. We also have variable union, indicating union membership. Overall, 22% of the person-time is marked as time under union membership, and 44% of these women have belonged to a union. probit — Probit regression 1673 We fit the following model, ignoring that the women are observed an average of 5.9 times each in these data: . use http://www.stata-press.com/data/r13/union, clear (NLS Women 14-24 in 1968) . probit union age grade not_smsa south##c.year Iteration 0: log likelihood = -13864.23 Iteration 1: log likelihood = -13545.541 Iteration 2: log likelihood = -13544.385 Iteration 3: log likelihood = -13544.385 Probit regression Number of obs LR chi2(6) Prob > chi2 Pseudo R2 Log likelihood = -13544.385 Std. Err. z P>|z| = = = = 26200 639.69 0.0000 0.0231 union Coef. [95% Conf. Interval] age grade not_smsa 1.south year .0118481 .0267365 -.1293525 -.8281077 -.0080931 .0029072 .0036689 .0202595 .2472219 .0033469 4.08 7.29 -6.38 -3.35 -2.42 0.000 0.000 0.000 0.001 0.016 .0061502 .0195457 -.1690604 -1.312654 -.0146529 .017546 .0339273 -.0896445 -.3435618 -.0015333 south#c.year 1 .0057369 .0030917 1.86 0.064 -.0003226 .0117965 _cons -.6542487 .2007777 -3.26 0.001 -1.047766 -.2607316 The reported standard errors in this model are probably meaningless. Women are observed repeatedly, and so the observations are not independent. Looking at the coefficients, we find a large southern effect against unionization and a time trend for the south that is almost significantly different from the overall downward trend. The vce(cluster clustvar) option provides a way to fit this model and obtains correct standard errors: 1674 probit — Probit regression . probit union age Iteration 0: log Iteration 1: log Iteration 2: log Iteration 3: log Probit regression grade not_smsa south##c.year, vce(cluster id) pseudolikelihood = -13864.23 pseudolikelihood = -13545.541 pseudolikelihood = -13544.385 pseudolikelihood = -13544.385 Number of obs = 26200 Wald chi2(6) = 166.53 Prob > chi2 = 0.0000 Log pseudolikelihood = -13544.385 Pseudo R2 = 0.0231 (Std. Err. adjusted for 4434 clusters in idcode) Robust Std. Err. union Coef. z P>|z| [95% Conf. Interval] age grade not_smsa 1.south year .0118481 .0267365 -.1293525 -.8281077 -.0080931 .0056625 .0078124 .0403885 .3201584 .0060829 2.09 3.42 -3.20 -2.59 -1.33 0.036 0.001 0.001 0.010 0.183 .0007499 .0114244 -.2085125 -1.455607 -.0200153 .0229463 .0420486 -.0501925 -.2006089 .0038292 south#c.year 1 .0057369 .0040133 1.43 0.153 -.002129 .0136029 _cons -.6542487 .3485976 -1.88 0.061 -1.337487 .02899 These standard errors are larger than those reported by the inappropriate conventional calculation. By comparison, another model we could fit is an equal-correlation population-averaged probit model: . xtprobit union age grade not_smsa south##c.year, pa Iteration 1: tolerance = .12544249 Iteration 2: tolerance = .0034686 Iteration 3: tolerance = .00017448 Iteration 4: tolerance = 8.382e-06 Iteration 5: tolerance = 3.997e-07 GEE population-averaged model Number of obs Group variable: idcode Number of groups Link: probit Obs per group: min Family: binomial avg Correlation: exchangeable max Wald chi2(6) Scale parameter: 1 Prob > chi2 Std. Err. z P>|z| = = = = = = = 26200 4434 1 5.9 12 242.57 0.0000 union Coef. [95% Conf. Interval] age grade not_smsa 1.south year .0089699 .0333174 -.0715717 -1.017368 -.0062708 .0053208 .0062352 .027543 .207931 .0055314 1.69 5.34 -2.60 -4.89 -1.13 0.092 0.000 0.009 0.000 0.257 -.0014586 .0210966 -.1255551 -1.424905 -.0171122 .0193985 .0455382 -.0175884 -.6098308 .0045706 south#c.year 1 .0086294 .00258 3.34 0.001 .0035727 .013686 _cons -.8670997 .294771 -2.94 0.003 -1.44484 -.2893592 probit — Probit regression 1675 The coefficient estimates are similar, but these standard errors are smaller than those produced by probit, vce(cluster clustvar), as we would expect. If the equal-correlation assumption is valid, the population-averaged probit estimator above should be more efficient. Is the assumption valid? That is a difficult question to answer. The default population-averaged estimates correspond to an assumption of exchangeable correlation within person. It would not be unreasonable to assume an AR(1) correlation within person or to assume that the observations are correlated but that we do not wish to impose any structure. See [XT] xtprobit and [XT] xtgee for full details. probit, vce(cluster clustvar) is robust to assumptions about within-cluster correlation. That is, it inefficiently sums within cluster for the standard error calculation rather than attempting to exploit what might be assumed about the within-cluster correlation. Model identification The probit command has one more feature that is probably the most useful. It will automatically check the model for identification and, if the model is underidentified, drop whatever variables and observations are necessary for estimation to proceed. Example 4 Have you ever fit a probit model where one or more of your independent variables perfectly predicted one or the other outcome? For instance, consider the following data: Outcome y Independent variable x 0 0 0 1 1 1 0 0 Say that we wish to predict the outcome on the basis of the independent variable. The outcome is always zero when the independent variable is one. In our data, Pr(y = 0 | x = 1) = 1, which means that the probit coefficient on x must be minus infinity with a corresponding infinite standard error. At this point, you may suspect that we have a problem. Unfortunately, not all such problems are so easily detected, especially if you have many independent variables in your model. If you have ever had such difficulties, then you have experienced one of the more unpleasant aspects of computer optimization. The computer has no idea that it is trying to solve for an infinite coefficient as it begins its iterative process. All it knows is that, at each step, making the coefficient a little bigger, or a little smaller, works wonders. It continues on its merry way until either 1) the whole thing comes crashing to the ground when a numerical overflow error occurs or 2) it reaches some predetermined cutoff that stops the process. Meanwhile, you have been waiting. And the estimates that you finally receive, if any, may be nothing more than numerical roundoff. Stata watches for these sorts of problems, alerts you, fixes them, and then properly fits the model. Let’s return to our automobile data. Among the variables we have in the data is one called repair that takes on three values. A value of 1 indicates that the car has a poor repair record, 2 indicates an average record, and 3 indicates a better-than-average record. Here is a tabulation of our data: 1676 probit — Probit regression . use http://www.stata-press.com/data/r13/repair (1978 Automobile Data) . tabulate foreign repair repair 1 2 3 Car type Total Domestic Foreign 10 0 27 3 9 9 46 12 Total 10 30 18 58 All the cars with poor repair records (repair = 1) are domestic. If we were to attempt to predict foreign on the basis of the repair records, the predicted probability for the repair = 1 category would have to be zero. This in turn means that the probit coefficient must be minus infinity, and that would set most computer programs buzzing. Let’s try using Stata on this problem. . probit foreign b3.repair note: 1.repair != 0 predicts failure perfectly 1.repair dropped and 10 obs not used Iteration 0: log likelihood = -26.992087 Iteration 1: log likelihood = -22.276479 Iteration 2: log likelihood = -22.229184 Iteration 3: log likelihood = -22.229138 Iteration 4: log likelihood = -22.229138 Probit regression Log likelihood = -22.229138 Std. Err. z Number of obs LR chi2(1) Prob > chi2 Pseudo R2 P>|z| = = = = 48 9.53 0.0020 0.1765 foreign Coef. [95% Conf. Interval] repair 1 2 0 -1.281552 (empty) .4297326 -2.98 0.003 -2.123812 -.4392911 _cons 9.89e-17 .295409 0.00 1.000 -.578991 .578991 Remember that all the cars with poor repair records (repair = 1) are domestic, so the model cannot be fit, or at least it cannot be fit if we restrict ourselves to finite coefficients. Stata noted that fact “note: 1.repair != 0 predicts failure perfectly”. This is Stata’s mathematically precise way of saying what we said in English. When repair is 1, the car is domestic. Stata then went on to say, “1.repair dropped and 10 obs not used”. This is Stata eliminating the problem. First, 1.repair had to be removed from the model because it would have an infinite coefficient. Then the 10 observations that led to the problem had to be eliminated, as well, so as not to bias the remaining coefficients in the model. The 10 observations that are not used are the 10 domestic cars that have poor repair records. Stata then fit what was left of the model, using the remaining observations. Because no observations remained for cars with poor repair records, Stata reports “(empty)” in the row for repair = 1. probit — Probit regression 1677 Technical note Stata is pretty smart about catching these problems. It will catch “one-way causation by a dummy variable”, as we demonstrated above. Stata also watches for “two-way causation”, that is, a variable that perfectly determines the outcome, both successes and failures. Here Stata says that the variable “predicts outcome perfectly” and stops. Statistics dictate that no model can be fit. Stata also checks your data for collinear variables; it will say “so-and-so omitted because of collinearity”. No observations need to be eliminated here and model fitting will proceed without the offending variable. It will also catch a subtle problem that can arise with continuous data. For instance, if we were estimating the chances of surviving the first year after an operation, and if we included in our model age, and if all the persons over 65 died within the year, Stata will say, “age > 65 predicts failure perfectly”. It will then inform us about how it resolves the issue and fit what can be fit of our model. probit (and logit, logistic, and ivprobit) will also occasionally fail to converge and then display messages such as Note: 4 failures and 0 successes completely determined. The cause of this message and what to do if you see it are described in [R] logit. Stored results probit stores the following in e(): Scalars e(N) e(N cds) e(N cdf) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(r2 p) e(ll) e(ll 0) e(N clust) e(chi2) e(p) e(rank) e(ic) e(rc) e(converged) number of observations number of completely determined successes number of completely determined failures number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom pseudo-R-squared log likelihood log likelihood, constant-only model number of clusters χ2 significance of model test rank of e(V) number of iterations return code 1 if converged, 0 otherwise 1678 probit — Probit regression Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(estat cmd) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(mns) e(rules) e(V) e(V modelbased) Functions e(sample) probit command as typed name of dependent variable weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement estat program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector vector of means of the independent variables information about perfect predictors variance–covariance matrix of the estimators model-based variance marks estimation sample Methods and formulas Probit analysis originated in connection with bioassay, and the word probit, a contraction of “probability unit”, was suggested by Bliss (1934a, 1934b). For an introduction to probit and logit, see, for example, Aldrich and Nelson (1984), Cameron and Trivedi (2010), Greene (2012), Long (1997), Pampel (2000), or Powers and Xie (2008). Long and Freese (2014, chap. 5 and 6) and Jones (2007, chap. 3) provide introductions to probit and logit, along with Stata examples. The log-likelihood function for probit is n o X X lnL = wj lnΦ(xj β) + wj ln 1 − Φ(xj β) j∈S j6∈S where Φ is the cumulative normal and wj denotes the optional weights. lnL is maximized, as described in [R] maximize. This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. The scores are calculated as uj = {φ(xj b)/Φ(xj b)}xj for the positive outcomes and −[φ(xj b)/{1 − Φ(xj b)}]xj for the negative outcomes, where φ is the normal density. probit also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. probit — Probit regression 1679 Chester Ittner Bliss (1899–1979) was born in Ohio. He was educated as an entomologist, earning degrees from Ohio State and Columbia, and was employed by the United States Department of Agriculture until 1933. When he lost his job because of the Depression, Bliss then worked with R. A. Fisher in London and at the Institute of Plant Protection in Leningrad before returning to a post at the Connecticut Agricultural Experiment Station in 1938. He was also a lecturer at Yale for 25 years. Among many contributions to biostatistics, his development and application of probit methods to biological problems are outstanding. References Aldrich, J. H., and F. D. Nelson. 1984. Linear Probability, Logit, and Probit Models. Newbury Park, CA: Sage. Berkson, J. 1944. Application of the logistic function to bio-assay. Journal of the American Statistical Association 39: 357–365. Bliss, C. I. 1934a. The method of probits. Science 79: 38–39. . 1934b. The method of probits—a correction. Science 79: 409–410. Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. Cochran, W. G., and D. J. Finney. 1979. Chester Ittner Bliss 1899–1979. Biometrics 35: 715–717. De Luca, G. 2008. SNP and SML estimation of univariate and bivariate binary-choice models. Stata Journal 8: 190–220. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Hilbe, J. M. 1996. sg54: Extended probit regression. Stata Technical Bulletin 32: 20–21. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 131–132. College Station, TX: Stata Press. Jones, A. 2007. Applied Econometrics for Health Economists: A Practical Guide. 2nd ed. Abingdon, UK: Radcliffe. Judge, G. G., W. E. Griffiths, R. C. Hill, H. Lütkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2nd ed. New York: Wiley. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Long, J. S., and J. Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Miranda, A., and S. Rabe-Hesketh. 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata Journal 6: 285–308. Pampel, F. C. 2000. Logistic Regression: A Primer. Thousand Oaks, CA: Sage. Powers, D. A., and Y. Xie. 2008. Statistical Methods for Categorical Data Analysis. 2nd ed. Bingley, UK: Emerald. Xu, J., and J. S. Long. 2005. Confidence intervals for predicted outcomes in regression models for categorical outcomes. Stata Journal 5: 537–559. 1680 probit — Probit regression Also see [R] probit postestimation — Postestimation tools for probit [R] asmprobit — Alternative-specific multinomial probit regression [R] biprobit — Bivariate probit regression [R] brier — Brier score decomposition [R] glm — Generalized linear models [R] heckoprobit — Ordered probit model with sample selection [R] hetprobit — Heteroskedastic probit model [R] ivprobit — Probit model with continuous endogenous regressors [R] logistic — Logistic regression, reporting odds ratios [R] logit — Logistic regression, reporting coefficients [R] mprobit — Multinomial probit regression [R] roc — Receiver operating characteristic (ROC) analysis [R] scobit — Skewed logistic regression [ME] meprobit — Multilevel mixed-effects probit regression [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [XT] xtprobit — Random-effects and population-averaged probit models [U] 20 Estimation and postestimation commands Title probit postestimation — Postestimation tools for probit Description Remarks and examples Syntax for predict Methods and formulas Menu for predict Also see Options for predict Description The following postestimation commands are of special interest after probit: Command Description estat classification report various summary statistics, including the classification table estat gof Pearson or Hosmer–Lemeshow goodness-of-fit test lroc compute area under ROC curve and graph the curve lsens graph sensitivity and specificity versus probability cutoff These commands are not appropriate after the svy prefix. The following standard postestimation commands are also available: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 hausman lincom linktest lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses forecast is not appropriate with mi or svy estimation results. lrtest is not appropriate with svy estimation results. 1681 1682 probit postestimation — Postestimation tools for probit Syntax for predict predict type newvar if in , statistic nooffset rules asif Description statistic Main pr xb stdp ∗ deviance score probability of a positive outcome; the default linear prediction standard error of the linear prediction deviance residual first derivative of the log likelihood with respect to xj β Unstarred statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample. Starred statistics are calculated only for the estimation sample, even when if e(sample) is not specified. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the probability of a positive outcome. xb calculates the linear prediction. stdp calculates the standard error of the linear prediction. deviance calculates the deviance residual. score calculates the equation-level score, ∂ ln L/∂(xj β). nooffset is relevant only if you specified offset(varname) for probit. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . rules requests that Stata use any rules that were used to identify the model when making the prediction. By default, Stata calculates missing for excluded observations. asif requests that Stata ignore the rules and exclusion criteria and calculate predictions for all observations possible using the estimated parameter from the model. Remarks and examples Remarks are presented under the following headings: Obtaining predicted values Performing hypothesis tests probit postestimation — Postestimation tools for probit 1683 Obtaining predicted values Once you have fit a probit model, you can obtain the predicted probabilities by using the predict command for both the estimation sample and other samples; see [U] 20 Estimation and postestimation commands and [R] predict. Here we will make only a few additional comments. predict without arguments calculates the predicted probability of a positive outcome. With the xb option, predict calculates the linear combination xj b, where xj are the independent variables in the j th observation and b is the estimated parameter vector. This is known as the index function because the cumulative density indexed at this value is the probability of a positive outcome. In both cases, Stata remembers any rules used to identify the model and calculates missing for excluded observations unless rules or asif is specified. This is covered in the following example. With the stdp option, predict calculates the standard error of the prediction, which is not adjusted for replicated covariate patterns in the data. You can calculate the unadjusted-for-replicated-covariate-patterns diagonal elements of the hat matrix, or leverage, by typing . predict pred . predict stdp, stdp . generate hat = stdp^2*pred*(1-pred) Example 1 In example 4 of [R] probit, we fit the probit model probit foreign b3.repair. To obtain predicted probabilities, we type . predict p (option pr assumed; Pr(foreign)) (10 missing values generated) . summarize foreign p Variable Obs Mean foreign p 58 48 .2068966 .25 Std. Dev. Min Max .4086186 .1956984 0 .1 1 .5 Stata remembers any rules used to identify the model and sets predictions to missing for any excluded observations. In example 4 of [R] probit, probit dropped the variable 1.repair from our model and excluded 10 observations. When we typed predict p, those same 10 observations were again excluded and their predictions set to missing. predict’s rules option uses the rules in the prediction. During estimation, we were told, “1.repair != 0 predicts failure perfectly”, so the rule is that when 1.repair is not zero, we should predict 0 probability of success or a positive outcome: . predict p2, rules (option pr assumed; Pr(foreign)) . summarize foreign p p2 Variable Obs Mean foreign p p2 58 48 58 .2068966 .25 .2068966 Std. Dev. Min Max .4086186 .1956984 .2016268 0 .1 0 1 .5 .5 predict’s asif option ignores the rules and the exclusion criteria and calculates predictions for all observations possible using the estimated parameters from the model: 1684 probit postestimation — Postestimation tools for probit . predict p3, asif (option pr assumed; Pr(foreign)) . summarize for p p2 p3 Variable Obs Mean foreign p p2 p3 58 48 58 58 .2068966 .25 .2068966 .2931034 Std. Dev. Min Max .4086186 .1956984 .2016268 .2016268 0 .1 0 .1 1 .5 .5 .5 Which is right? By default, predict uses the most conservative approach. If many observations had been excluded due to a simple rule, we could be reasonably certain that the rules prediction is correct. The asif prediction is correct only if the exclusion is a fluke and we would be willing to exclude the variable from the analysis, anyway. Then, however, we should refit the model to include the excluded observations. Performing hypothesis tests After estimation with probit, you can perform hypothesis tests by using the test or testnl command; see [U] 20 Estimation and postestimation commands. Methods and formulas Let index j be used to index observations. Define Mj for each observation as the total number of observations sharing j ’s covariate pattern. Define Yj as the total number of positive responses among observations sharing j ’s covariate pattern. Define pj as the predicted probability of a positive outcome for observation j . For Mj > 1, the deviance residual dj is defined as " Yj dj = ± 2 Yj ln Mj p j Mj − Yj + (Mj − Yj ) ln Mj (1 − pj ) #!1/2 where the sign is the same as the sign of (Yj − Mj pj ). In the limiting cases, the deviance residual is given by ( p − 2Mj | ln(1 − pj )| if Yj = 0 dj = p 2Mj | lnpj | if Yj = Mj Also see [R] probit — Probit regression [R] estat classification — Classification statistics and table [R] estat gof — Pearson or Hosmer–Lemeshow goodness-of-fit test [R] lroc — Compute area under ROC curve and graph the curve [R] lsens — Graph sensitivity and specificity versus probability cutoff [U] 20 Estimation and postestimation commands Title proportion — Estimate proportions Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax proportion varlist if in weight , options options Description Model stdize(varname) stdweight(varname) nostdrescale nolabel missing variable identifying strata for standardization weight variable for standardization do not rescale the standard weight variable suppress value labels from varlist treat missing values like other values if/in/over over(varlist , nolabel ) group over subpopulations defined by varlist; optionally, suppress group labels SE/Cluster vce(vcetype) vcetype may be analytic, cluster clustvar, bootstrap, or jackknife Reporting noheader nolegend display options set confidence level; default is level(95) method to compute limits of confidence intervals; default is citype(logit) suppress table header suppress table legend control column formats and line width coeflegend display legend instead of statistics level(#) citype(logit | normal) bootstrap, jackknife, mi estimate, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 1685 1686 proportion — Estimate proportions Menu Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Proportions Description proportion produces estimates of proportions, along with standard errors, for the categories identified by the values in each variable of varlist. Options Model stdize(varname) specifies that the point estimates be adjusted by direct standardization across the strata identified by varname. This option requires the stdweight() option. stdweight(varname) specifies the weight variable associated with the standard strata identified in the stdize() option. The standardization weights must be constant within the standard strata. nostdrescale prevents the standardization weights from being rescaled within the over() groups. This option requires stdize() but is ignored if the over() option is not specified. nolabel specifies that value labels attached to the variables in varlist be ignored. missing specifies that missing values in varlist be treated as valid categories, rather than omitted from the analysis (the default). if/in/over over(varlist , nolabel ) specifies that estimates be computed for multiple subpopulations, which are identified by the different values of the variables in varlist. When this option is supplied with one variable name, such as over(varname), the value labels of varname are used to identify the subpopulations. If varname does not have labeled values (or there are unlabeled values), the values themselves are used, provided that they are nonnegative integers. Noninteger values, negative values, and labels that are not valid Stata names are substituted with a default identifier. When over() is supplied with multiple variable names, each subpopulation is assigned a unique default identifier. nolabel requests that value labels attached to the variables identifying the subpopulations be ignored. SE/Cluster vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (analytic), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(analytic), the default, uses the analytically derived variance estimator associated with the sample proportion. Reporting level(#); see [R] estimation options. proportion — Estimate proportions 1687 citype(logit | normal) specifies how to compute the limits of confidence intervals. citype(logit), the default, uses the logit transformation to compute the limits of confidence intervals. citype(normal) uses the normal approximation to compute the limits of confidence intervals. noheader prevents the table header from being displayed. This option implies nolegend. nolegend prevents the table legend identifying the subpopulations from being displayed. display options: cformat(% fmt) and nolstretch; see [R] estimation options. The following option is available with proportion but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Example 1 We can estimate the proportion of each repair rating in auto2.dta: . use http://www.stata-press.com/data/r13/auto2 (1978 Automobile Data) . proportion rep78 Proportion estimation Number of obs Proportion = 69 Std. Err. [95% Conf. Interval] .0203446 .0388245 .0601159 .0532498 .0443922 .0070061 .0580159 .3207109 .1690271 .089188 rep78 Poor Fair Average Good Excellent .0289855 .115942 .4347826 .2608696 .1594203 .1121326 .2183014 .556206 .3798066 .2686455 Here we use the missing option to include missing values as a category of rep78: . proportion rep78, missing Proportion estimation _prop_6: rep78 = . Proportion Number of obs = 74 Std. Err. [95% Conf. Interval] .0189796 .0363433 .0574637 .0502154 .0416364 .0293776 .0065484 .054094 .2977369 .1572724 .0831005 .0278144 rep78 Poor Fair Average Good Excellent _prop_6 .027027 .1081081 .4054054 .2432432 .1486486 .0675676 .1047932 .204402 .523012 .3563376 .2517065 .1550743 1688 proportion — Estimate proportions Example 2 We can also estimate proportions over groups: . proportion rep78, over(foreign) Proportion estimation Poor: rep78 = Poor Fair: rep78 = Fair Average: rep78 = Average Good: rep78 = Good Excellent: rep78 = Excellent Domestic: foreign = Domestic Foreign: foreign = Foreign Number of obs Std. Err. = 69 Over Proportion [95% Conf. Interval] Poor Domestic Foreign .0416667 . .0291477 .0100299 (no observations) .1572433 Fair Domestic Foreign .1666667 . .0543607 .0839032 (no observations) .3039797 Average Domestic Foreign .5625 .1428571 .0723605 .0782461 .4169211 .0444941 .6980553 .3736393 Good Domestic Foreign .1875 .4285714 .0569329 .1106567 .0986718 .2333786 .3272601 .6488451 Excellent Domestic Foreign .0416667 .4285714 .0291477 .1106567 .0100299 .2333786 .1572433 .6488451 proportion — Estimate proportions 1689 Stored results proportion stores the following in e(): Scalars e(N) e(N over) e(N stdize) e(N clust) e(k eq) e(df r) e(rank) Macros e(cmd) e(cmdline) e(varlist) e(stdize) e(stdweight) e(wtype) e(wexp) e(title) e(cluster) e(over) e(over labels) e(over namelist) e(namelist) e(label#) e(vce) e(vcetype) e(properties) e(estat cmd) e(marginsnotok) Matrices e(b) e(V) e( N) e( N stdsum) e( p stdize) e(error) Functions e(sample) number of observations number of subpopulations number of standard strata number of clusters number of equations in e(b) sample degrees of freedom rank of e(V) proportion command as typed varlist varname from stdize() varname from stdweight() weight type weight expression title in estimation output name of cluster variable varlist from over() labels from over() variables names from e(over labels) proportion identifiers labels from #th variable in varlist vcetype specified in vce() title used to label Std. Err. b V program used to implement estat predictions disallowed by margins vector of proportion estimates (co)variance estimates vector of numbers of nonmissing observations number of nonmissing observations within the standard strata standardizing proportions error code corresponding to e(b) marks estimation sample Methods and formulas Proportions are means of indicator variables; see [R] mean. Confidence intervals Confidence intervals for proportions are calculated using a logit transform so that the endpoints lie between 0 and 1. Let pb be an estimated proportion and sb be an estimate of its standard error. Let pb f (b p) = ln 1 − pb be the logit transform of the proportion. In this metric, an estimate of the standard error is c {f (b SE p)} = f 0 (b p)b s = sb pb(1 − pb) 1690 proportion — Estimate proportions Thus a 100(1 − α)% confidence interval in this metric is ln t1−α/2,ν sb pb ± 1 − pb pb(1 − pb) where t1−α/2,ν is the (1 − α/2)th quantile of Student’s t distribution with ν degrees of freedom. The endpoints of this confidence interval are transformed back to the proportion metric by using the inverse of the logit transform ey f −1 (y) = 1 + ey Hence, the displayed confidence intervals for proportions are f −1 ln t1−α/2,ν sb pb ± 1 − pb pb(1 − pb) References Cochran, W. G. 1977. Sampling Techniques. 3rd ed. New York: Wiley. Stuart, A., and J. K. Ord. 1994. Kendall’s Advanced Theory of Statistics: Distribution Theory, Vol I. 6th ed. London: Arnold. Also see [R] proportion postestimation — Postestimation tools for proportion [R] mean — Estimate means [R] ratio — Estimate ratios [R] total — Estimate totals [MI] estimation — Estimation commands for use with mi estimate [SVY] direct standardization — Direct standardization of means, proportions, and ratios [SVY] poststratification — Poststratification for survey data [SVY] subpopulation estimation — Subpopulation estimation for survey data [SVY] svy estimation — Estimation commands for survey data [SVY] variance estimation — Variance estimation for survey data [U] 20 Estimation and postestimation commands Title proportion postestimation — Postestimation tools for proportion Description Remarks and examples Also see Description The following postestimation commands are available after proportion: Command Description estat vce estat (svy) estimates lincom variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses nlcom test testnl Remarks and examples Example 1 In example 2 of [R] proportion, we computed the proportions of cars with different repair records for each group, foreign or domestic. We use test to test whether the proportion of cars with repair record equal to 4 is the same for domestic and foreign cars. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . proportion rep78, over(foreign) (output omitted ) . test [_prop_4]:Domestic=[_prop_4]:Foreign ( 1) [_prop_4]Domestic - [_prop_4]Foreign = 0 F( 1, 68) = Prob > F = 3.75 0.0569 There is not a significant difference between those proportions at the 5% level. Example 2 Continuing with auto.dta from example 1, we generate a new variable, highprice, that indicates if the price is larger than $5,000 and then use proportion to see the proportion of cars with high price among domestic and foreign cars separately. 1691 1692 proportion postestimation — Postestimation tools for proportion . generate highprice = price>5000 . proportion highprice, over(foreign) Proportion estimation Number of obs _prop_1: highprice = 0 _prop_2: highprice = 1 Domestic: foreign = Domestic Foreign: foreign = Foreign Over Proportion _prop_1 Domestic Foreign _prop_2 Domestic Foreign = 74 Std. Err. [95% Conf. Interval] .5576923 .3636364 .0695464 .1049728 .4182157 .1879015 .6886264 .5852765 .4423077 .6363636 .0695464 .1049728 .3113736 .4147235 .5817843 .8120985 We will compute the odds ratio of having a high price in group Foreign to having a high price in group Domestic. Usually, odds ratios are computed by using the logistic command, but here we will perform the computation by using nlcom after proportion. . nlcom OR: ([_prop_2]_b[Foreign]/[_prop_1]_b[Foreign])/([_prop_2]_b[Domestic]/ > [_prop_1]_b[Domestic]) OR: ([_prop_2]_b[Foreign]/[_prop_1]_b[Foreign])/([_prop_2]_b[Domesti > c]/[_prop_1]_b[Domestic]) Proportion Coef. OR 2.206522 Std. Err. z P>|z| 1.178522 1.87 0.061 [95% Conf. Interval] -.1033393 4.516383 This is the same odds ratio that we would obtain from . logistic highprice foreign The odds ratio is slightly larger than 2, which means that the odds of having a high price among foreign cars are more than twice that of having a high price among domestic cars. Also see [R] proportion — Estimate proportions [SVY] svy postestimation — Postestimation tools for svy [U] 20 Estimation and postestimation commands Title prtest — Tests of proportions Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax One-sample test of proportion prtest varname == #p if in , level(#) Two-sample test of proportions using groups prtest varname if in , by(groupvar) level(#) Two-sample test of proportions using variables prtest varname1 == varname2 if in , level(#) Immediate form of one-sample test of proportion prtesti # obs1 # p1 # p2 , level(#) count Immediate form of two-sample test of proportions prtesti # obs1 # p1 # obs2 # p2 , level(#) count by is allowed with prtest; see [D] by. Menu prtest Statistics > Summaries, tables, and tests > Classical tests of hypotheses > Proportion test > Summaries, tables, and tests > Classical tests of hypotheses > Proportion test calculator prtesti Statistics Description prtest performs tests on the equality of proportions using large-sample statistics. In the first form, prtest tests that varname has a proportion of #p . In the second form, prtest tests that varname has the same proportion within the two groups defined by groupvar. In the third form, prtest tests that varname1 and varname2 have the same proportion. prtesti is the immediate form of prtest; see [U] 19 Immediate commands. 1693 1694 prtest — Tests of proportions The bitest command is a better version of the first form of prtest in that it gives exact p-values. Researchers should use bitest when possible, especially for small samples; see [R] bitest. Options Main by(groupvar) specifies a numeric variable that contains the group information for a given observation. This variable must have only two values. Do not confuse the by() option with the by prefix; both may be specified. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. count specifies that integer counts instead of proportions be used in the immediate forms of prtest. In the first syntax, prtesti expects that #obs1 and #p1 are counts—#p1 ≤ #obs1 —and #p2 is a proportion. In the second syntax, prtesti expects that all four numbers are integer counts, that #obs1 ≥ #p1 , and that #obs2 ≥ #p2 . Remarks and examples The prtest output follows the output of ttest in providing a lot of information. Each proportion is presented along with a confidence interval. The appropriate one- or two-sample test is performed, and the two-sided and both one-sided results are included at the bottom of the output. For a two-sample test, the calculated difference is also presented with its confidence interval. This command may be used for both large-sample testing and large-sample interval estimation. Example 1: One-sample test of proportion In the first form, prtest tests whether the mean of the sample is equal to a known constant. Assume that we have a sample of 74 automobiles. We wish to test whether the proportion of automobiles that are foreign is different from 40%. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . prtest foreign == .4 One-sample test of proportion foreign: Number of obs = Variable Mean foreign .2972973 74 Std. Err. [95% Conf. Interval] .0531331 .1931583 p = proportion(foreign) Ho: p = 0.4 Ha: p < 0.4 Ha: p != 0.4 Pr(Z < z) = 0.0357 Pr(|Z| > |z|) = 0.0713 .4014363 z = -1.8034 Ha: p > 0.4 Pr(Z > z) = 0.9643 The test indicates that we cannot reject the hypothesis that the proportion of foreign automobiles is 0.40 at the 5% significance level. prtest — Tests of proportions 1695 Example 2: Two-sample test of proportions We have two headache remedies that we give to patients. Each remedy’s effect is recorded as 0 for failing to relieve the headache and 1 for relieving the headache. We wish to test the equality of the proportion of people relieved by the two treatments. . use http://www.stata-press.com/data/r13/cure . prtest cure1 == cure2 Two-sample test of proportions cure1: Number of obs = cure2: Number of obs = Variable Mean Std. Err. cure1 cure2 .52 .7118644 .0706541 .0589618 diff -.1918644 under Ho: .0920245 .0931155 z -2.06 P>|z| [95% Conf. Interval] .3815205 .5963013 .6584795 .8274275 -.372229 -.0114998 0.039 diff = prop(cure1) - prop(cure2) Ho: diff = 0 Ha: diff < 0 Pr(Z < z) = 0.0197 50 59 z = Ha: diff != 0 Pr(|Z| < |z|) = 0.0394 -2.0605 Ha: diff > 0 Pr(Z > z) = 0.9803 We find that the proportions are statistically different from each other at any level greater than 3.9%. Example 3: Immediate form of one-sample test of proportion prtesti is like prtest, except that you specify summary statistics rather than variables as arguments. For instance, we are reading an article that reports the proportion of registered voters among 50 randomly selected eligible voters as 0.52. We wish to test whether the proportion is 0.7: . prtesti 50 .52 .70 One-sample test of proportion x: Number of obs = 50 Variable Mean Std. Err. [95% Conf. Interval] x .52 .0706541 .3815205 p = proportion(x) Ho: p = 0.7 Ha: p < 0.7 Pr(Z < z) = 0.0027 z = Ha: p != 0.7 Pr(|Z| > |z|) = 0.0055 .6584795 -2.7775 Ha: p > 0.7 Pr(Z > z) = 0.9973 Example 4: Immediate form of two-sample test of proportions To judge teacher effectiveness, we wish to test whether the same proportion of people from two classes will answer an advanced question correctly. In the first classroom of 30 students, 40% answered the question correctly, whereas in the second classroom of 45 students, 67% answered the question correctly. 1696 prtest — Tests of proportions . prtesti 30 .4 45 .67 Two-sample test of proportions x: Number of obs = y: Number of obs = Variable Mean Std. Err. x y .4 .67 .0894427 .0700952 diff -.27 under Ho: .1136368 .1169416 z P>|z| -2.31 [95% Conf. Interval] .2246955 .532616 .5753045 .807384 -.4927241 -.0472759 0.021 diff = prop(x) - prop(y) Ho: diff = 0 Ha: diff < 0 Pr(Z < z) = 0.0105 30 45 z = Ha: diff != 0 Pr(|Z| < |z|) = 0.0210 -2.3088 Ha: diff > 0 Pr(Z > z) = 0.9895 Stored results prtest and prtesti store the following in r(): Scalars r(z) z statistic r(P #) proportion for variable # r(N #) number of observations for variable # Methods and formulas See Acock (2014, 155–161) for additional examples of tests of proportions using Stata. A large-sample 100(1 − α)% confidence interval for a proportion p is r pb ± z1−α/2 pb qb n and a 100(1 − α)% confidence interval for the difference of two proportions is given by s (b p1 − pb2 ) ± z1−α/2 pb1 qb1 pb2 qb2 + n1 n2 where qb = 1 − pb and z is calculated from the inverse cumulative standard normal distribution. The one-tailed and two-tailed tests of a population proportion use a normally distributed test statistic calculated as pb − p0 z=p p0 q0 /n where p0 is the hypothesized proportion. A test of the difference of two proportions also uses a normally distributed test statistic calculated as prtest — Tests of proportions z=p 1697 pb1 − pb2 pbp qbp (1/n1 + 1/n2 ) where pbp = x1 + x2 n1 + n2 and x1 and x2 are the total number of successes in the two populations. References Acock, A. C. 2014. A Gentle Introduction to Stata. 4th ed. College Station, TX: Stata Press. Wang, D. 2000. sg154: Confidence intervals for the ratio of two binomial proportions by Koopman’s method. Stata Technical Bulletin 58: 16–19. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 244–247. College Station, TX: Stata Press. Also see [R] bitest — Binomial probability test [R] proportion — Estimate proportions [R] ttest — t tests (mean-comparison tests) [MV] hotelling — Hotelling’s T-squared generalized means test Title pwcompare — Pairwise comparisons Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax pwcompare marginlist , options where marginlist is a list of factor variables or interactions that appear in the current estimation results or eqns to reference equations. The variables may be typed with or without the i. prefix, and you may use any factor-variable syntax: . pwcompare i.sex i.group i.sex#i.group . pwcompare sex group sex#group . pwcompare sex##group options Description Main mcompare(method) asobserved adjust for multiple comparisons; default is mcompare(noadjust) treat all factor variables as observed Equations equation(eqspec) atequations perform comparisons within equation eqspec perform comparisons within each equation Advanced emptycells(empspec) treatment of empty cells for balanced factors noestimcheck suppress estimability checks Reporting level(#) cieffects pveffects effects cimargins groups sort post display options eform option confidence level; default is level(95) show effects table with confidence intervals; the default show effects table with p-values show effects table with confidence intervals and p-values show table of margins and confidence intervals show table of margins and group codes sort the margins or contrasts within each term post margins and their VCEs as estimation results control column formats, row spacing, line width, and factor-variable labeling report exponentiated contrasts df(#) use t distribution with # degrees of freedom for computing p-values and confidence intervals df(#) does not appear in the dialog box. 1698 pwcompare — Pairwise comparisons Description method noadjust bonferroni adjustall sidak adjustall scheffe ∗ tukey ∗ snk ∗ duncan ∗ dunnett ∗ 1699 do not adjust for multiple comparisons; the default Bonferroni’s method; adjust across all terms Šidák’s method; adjust across all terms Scheffé’s method Tukey’s method Student–Newman–Keuls’ method Duncan’s method Dunnett’s method tukey, snk, duncan, and dunnett are only allowed with results from anova, manova, regress, and mvreg. tukey, snk, duncan, and dunnett are not allowed with results from svy. Time-series operators are allowed if they were used in the estimation. Menu Statistics > Postestimation > Pairwise comparisons Description pwcompare performs pairwise comparisons across the levels of factor variables from the most recently fit model. pwcompare can compare estimated cell means, marginal means, intercepts, marginal intercepts, slopes, or marginal slopes—collectively called margins. pwcompare reports the comparisons as contrasts (differences) of margins along with significance tests or confidence intervals for the contrasts. The tests and confidence intervals can be adjusted for multiple comparisons. pwcompare can be used with svy estimation results; see [SVY] svy postestimation. See [R] margins, pwcompare for performing pairwise comparisons of margins of linear and nonlinear predictions. Options Main mcompare(method) specifies the method for computing p-values and confidence intervals that account for multiple comparisons within a factor-variable term. Most methods adjust the comparisonwise error rate, αc , to achieve a prespecified experimentwise error rate, αe . mcompare(noadjust) is the default; it specifies no adjustment. αc = αe mcompare(bonferroni) adjusts the comparisonwise error rate based on the upper limit of the Bonferroni inequality: αe ≤mαc where m is the number of comparisons within the term. The adjusted comparisonwise error rate is αc = αe /m 1700 pwcompare — Pairwise comparisons mcompare(sidak) adjusts the comparisonwise error rate based on the upper limit of the probability inequality αe ≤1 − (1 − αc )m where m is the number of comparisons within the term. The adjusted comparisonwise error rate is αc = 1 − (1 − αe )1/m This adjustment is exact when the m comparisons are independent. mcompare(scheffe) controls the experimentwise error rate using the F (or χ2 ) distribution with degrees of freedom equal to the rank of the term. For results from anova, regress, manova, and mvreg (see [R] anova, [R] regress, [MV] manova, and [MV] mvreg), pwcompare allows the following additional methods. These methods are not allowed with results that used vce(robust) or vce(cluster clustvar). mcompare(tukey) uses what is commonly referred to as Tukey’s honestly significant difference. This method uses the Studentized range distribution instead of the t distribution. mcompare(snk) is a variation on mcompare(tukey) that counts only the number of margins in the range for a given comparison instead of the full number of margins. mcompare(duncan) is a variation on mcompare(snk) with additional adjustment to the significance probabilities. mcompare(dunnett) uses Dunnett’s method for making comparisons with a reference category. mcompare(method adjustall) specifies that the multiple-comparison adjustments count all comparisons across all terms rather than performing multiple comparisons term by term. This leads to more conservative adjustments when multiple variables or terms are specified in marginlist. This option is compatible only with the bonferroni and sidak methods. asobserved specifies that factor covariates be evaluated using the cell frequencies observed when the model was fit. The default is to treat all factor covariates as though there were an equal number of observations at each level. Equations equation(eqspec) specifies the equation from which margins are to be computed. The default is to compute margins from the first equation. atequations specifies that the margins be computed within each equation. Advanced emptycells(empspec) specifies how empty cells are handled in interactions involving factor variables that are being treated as balanced. emptycells(strict) is the default; it specifies that margins involving empty cells be treated as not estimable. emptycells(reweight) specifies that the effects of the observed cells be increased to accommodate any missing cells. This makes the margins estimable but changes their interpretation. noestimcheck specifies that pwcompare not check for estimability. By default, the requested margins are checked and those found not estimable are reported as such. Nonestimability is usually caused by empty cells. If noestimcheck is specified, estimates are computed in the usual way and reported even though the resulting estimates are manipulable, which is to say they can differ across equivalent models having different parameterizations. pwcompare — Pairwise comparisons 1701 Reporting level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. The significance level used by the groups option is 100 − #, expressed as a percentage. cieffects specifies that a table of the pairwise comparisons with their standard errors and confidence intervals be reported. This is the default. pveffects specifies that a table of the pairwise comparisons with their standard errors, test statistics, and p-values be reported. effects specifies that a table of the pairwise comparisons with their standard errors, test statistics, p-values, and confidence intervals be reported. cimargins specifies that a table of the margins with their standard errors and confidence intervals be reported. groups specifies that a table of the margins with their standard errors and group codes be reported. Margins with the same letter in the group code are not significantly different at the specified significance level. sort specifies that the reported tables be sorted on the margins or differences in each term. post causes pwcompare to behave like a Stata estimation (e-class) command. pwcompare posts the vector of estimated margins along with the estimated variance–covariance matrix to e(), so you can treat the estimated margins just as you would results from any other estimation command. For example, you could use test to perform simultaneous tests of hypotheses on the margins, or you could use lincom to create linear combinations. display options: vsquish, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch. vsquish specifies that the blank space separating factor-variable terms or time-series–operated variables from other variables in the model be suppressed. nofvlabel displays factor-variable level values rather than attached value labels. This option overrides the fvlabel setting; see [R] set showbaselevels. fvwrap(#) specifies how many lines to allow when long value labels must be wrapped. Labels requiring more than # lines are truncated. This option overrides the fvwrap setting; see [R] set showbaselevels. fvwrapon(style) specifies whether value labels that wrap will break at word boundaries or break based on available space. fvwrapon(word), the default, specifies that value labels break at word boundaries. fvwrapon(width) specifies that value labels break based on available space. This option overrides the fvwrapon setting; see [R] set showbaselevels. cformat(% fmt) specifies how to format contrasts or margins, standard errors, and confidence limits in the table of pairwise comparisons. pformat(% fmt) specifies how to format p-values in the table of pairwise comparisons. sformat(% fmt) specifies how to format test statistics in the table of pairwise comparisons. nolstretch specifies that the width of the table of pairwise comparisons not be automatically widened to accommodate longer variable names. The default, lstretch, is to automatically widen the table of pairwise comparisons up to the width of the Results window. To change the default, use set lstretch off. nolstretch is not shown in the dialog box. 1702 pwcompare — Pairwise comparisons eform option specifies that the contrasts table be displayed in exponentiated form. econtrast is displayed rather than contrast. Standard errors and confidence intervals are also transformed. See [R] eform option for the list of available options. The following option is available with pwcompare but is not shown in the dialog box: df(#) specifies that the t distribution with # degrees of freedom be used for computing p-values and confidence intervals. The default is to use e(df r) degrees of freedom or the standard normal distribution if e(df r) is missing. Remarks and examples pwcompare performs pairwise comparisons of margins across the levels of factor variables from the most recently fit model. The margins can be estimated cell means, marginal means, intercepts, marginal intercepts, slopes, or marginal slopes. With the exception of slopes, we can also consider these margins to be marginal linear predictions. The margins are calculated as linear combinations of the coefficients. Let k be the number of levels for a factor term in our model; then there are k margins for that term, and m= k k(k − 1) = 2 2 unique pairwise comparisons of those margins. The confidence intervals and p-values for these pairwise comparisons can be adjusted to account for multiple comparisons. Bonferroni’s, Šidák’s, and Scheffé’s adjustments can be made for multiple comparisons after fitting any type of model. In addition, Tukey’s, Student–Newman–Keuls’, Duncan’s, and Dunnett’s adjustments are available when fitting ANOVA, linear regression, MANOVA, or multivariate regression models. Remarks are presented under the following headings: Pairwise comparisons of means Marginal means All pairwise comparisons Overview of multiple-comparison methods Fisher’s protected least-significant difference (LSD) Bonferroni’s adjustment Šidák’s adjustment Scheffé’s adjustment Tukey’s HSD adjustment Student–Newman–Keuls’ adjustment Duncan’s adjustment Dunnett’s adjustment Example adjustments using one-way models Fisher’s protected LSD Tukey’s HSD Dunnett’s method for comparisons to a control Two-way models Pairwise comparisons of slopes Nonlinear models Multiple-equation models Unbalanced data Empty cells pwcompare — Pairwise comparisons 1703 Pairwise comparisons of means Suppose we are interested in the effects of five different fertilizers on wheat yield. We could estimate the following linear regression model to determine the effect of each type of fertilizer on the yield. . use http://www.stata-press.com/data/r13/yield (Artificial wheat yield dataset) . regress yield i.fertilizer Source SS df MS Model Residual 1078.84207 9859.55334 4 195 269.710517 50.561812 Total 10938.3954 199 54.9668111 Std. Err. t Number of obs F( 4, 195) Prob > F R-squared Adj R-squared Root MSE P>|t| = = = = = = 200 5.33 0.0004 0.0986 0.0801 7.1107 yield Coef. [95% Conf. Interval] fertilizer 10-08-22 16-04-08 18-24-06 29-03-04 3.62272 .4906299 4.922803 -1.238328 1.589997 1.589997 1.589997 1.589997 2.28 0.31 3.10 -0.78 0.024 0.758 0.002 0.437 .4869212 -2.645169 1.787005 -4.374127 6.758518 3.626428 8.058602 1.89747 _cons 41.36243 1.124298 36.79 0.000 39.14509 43.57977 In this simple case, the coefficients for fertilizers 10-08-22, 16-04-08, 18-24-06, and 29-03-04 indicate the difference in the mean yield for that fertilizer versus the mean yield for fertilizer 10-10-10. That the standard errors of all four coefficients are identical results from having perfectly balanced data. Marginal means We can use pwcompare with the cimargins option to compute the mean yield for each of the fertilizers. . pwcompare fertilizer, cimargins Pairwise comparisons of marginal linear predictions Margins : asbalanced Margin fertilizer 10-10-10 10-08-22 16-04-08 18-24-06 29-03-04 41.36243 44.98515 41.85306 46.28523 40.1241 Std. Err. Unadjusted [95% Conf. Interval] 1.124298 1.124298 1.124298 1.124298 1.124298 39.14509 42.7678 39.63571 44.06789 37.90676 43.57977 47.20249 44.0704 48.50258 42.34145 Looking at the confidence intervals for fertilizers 10-10-10 and 10-08-22 in the table above, we might be tempted to conclude that these means are not significantly different because the intervals overlap. However, as discussed in Interaction plots of [R] marginsplot, we cannot draw conclusions about the differences in means by looking at confidence intervals for the means themselves. Instead, we would need to look at confidence intervals for the difference in means. 1704 pwcompare — Pairwise comparisons All pairwise comparisons By default, pwcompare calculates all pairwise differences of the margins, in this case pairwise differences of the mean yields. . pwcompare fertilizer Pairwise comparisons of marginal linear predictions Margins 10-08-22 16-04-08 18-24-06 29-03-04 16-04-08 18-24-06 29-03-04 18-24-06 29-03-04 29-03-04 : asbalanced fertilizer vs 10-10-10 vs 10-10-10 vs 10-10-10 vs 10-10-10 vs 10-08-22 vs 10-08-22 vs 10-08-22 vs 16-04-08 vs 16-04-08 vs 18-24-06 Contrast Std. Err. 3.62272 .4906299 4.922803 -1.238328 -3.13209 1.300083 -4.861048 4.432173 -1.728958 -6.161132 1.589997 1.589997 1.589997 1.589997 1.589997 1.589997 1.589997 1.589997 1.589997 1.589997 Unadjusted [95% Conf. Interval] .4869212 -2.645169 1.787005 -4.374127 -6.267889 -1.835715 -7.996847 1.296375 -4.864757 -9.29693 6.758518 3.626428 8.058602 1.89747 .0037086 4.435882 -1.725249 7.567972 1.406841 -3.025333 If a confidence interval does not include zero, the means for the compared fertilizers are significantly different. Therefore, at the 5% significance level, we would reject the hypothesis that the means for fertilizers 10-10-10 and 10-08-22 are equivalent—as we would do for 18-24-06 vs 10-10-10, 29-03-04 vs 10-08-22, 18-24-06 vs 16-04-08, and 29-03-04 vs 18-24-06. We may prefer to see the p-values instead of looking at confidence intervals to determine whether the pairwise differences are significantly different from zero. We could use the pveffects option to see the differences with standard errors and p-values, or we could use the effects option to see both p-values and confidence intervals in the same table. Here we specify effects as well as the sort option so that the differences are sorted from smallest to largest. pwcompare — Pairwise comparisons 1705 . pwcompare fertilizer, effects sort Pairwise comparisons of marginal linear predictions Margins : asbalanced fertilizer 29-03-04 vs 18-24-06 29-03-04 vs 10-08-22 16-04-08 vs 10-08-22 29-03-04 vs 16-04-08 29-03-04 vs 10-10-10 16-04-08 vs 10-10-10 18-24-06 vs 10-08-22 10-08-22 vs 10-10-10 18-24-06 vs 16-04-08 18-24-06 vs 10-10-10 Unadjusted t P>|t| Unadjusted [95% Conf. Interval] Contrast Std. Err. -6.161132 1.589997 -3.87 0.000 -9.29693 -3.025333 -4.861048 1.589997 -3.06 0.003 -7.996847 -1.725249 -3.13209 1.589997 -1.97 0.050 -6.267889 .0037086 -1.728958 1.589997 -1.09 0.278 -4.864757 1.406841 -1.238328 1.589997 -0.78 0.437 -4.374127 1.89747 .4906299 1.589997 0.31 0.758 -2.645169 3.626428 1.300083 1.589997 0.82 0.415 -1.835715 4.435882 3.62272 1.589997 2.28 0.024 .4869212 6.758518 4.432173 1.589997 2.79 0.006 1.296375 7.567972 4.922803 1.589997 3.10 0.002 1.787005 8.058602 We find that 5 of the 10 pairs of means are significantly different at the 5% significance level. We can use the groups option to obtain a table that identifies groups whose means are not significantly different by assigning them the same letter. 1706 pwcompare — Pairwise comparisons . pwcompare fertilizer, groups sort Pairwise comparisons of marginal linear predictions Margins : asbalanced Margin fertilizer 29-03-04 10-10-10 16-04-08 10-08-22 18-24-06 40.1241 41.36243 41.85306 44.98515 46.28523 Std. Err. Unadjusted Groups 1.124298 1.124298 1.124298 1.124298 1.124298 A A AB BC C Note: Margins sharing a letter in the group label are not significantly different at the 5% level. The letter A that is assigned to fertilizers 29-03-04, 10-10-10, and 16-04-08 designates that the mean yields for these fertilizers are not different at the 5% level. Overview of multiple-comparison methods For a single test, if we choose a 5% significance level, we would have a 5% chance of concluding that two margins are different when the population values are actually equal. This is known as making a type I error. When we perform m = k(k − 1)/2 pairwise comparisons of the k margins, we have m opportunities to make a type I error. pwcompare with the mcompare() option allows us to adjust the confidence intervals and p-values for each comparison to account for the increased probability of making a type I error when making multiple comparisons. Bonferroni’s adjustment, Šidák’s adjustment, and Scheffé’s adjustment can be used when making pairwise comparisons of the margins after any estimation command. Tukey’s honestly significant difference, Student–Newman–Keuls’ method, Duncan’s method, and Dunnett’s method are only available when fitting linear models after anova, manova, regress, or mvreg. Fisher’s protected least-significant difference (LSD) pwcompare does not offer an mcompare() option specifically for Fisher’s protected least-significant difference (LSD). In this methodology, no adjustment is made to the confidence intervals or p-values. However, it is protected in the sense that no pairwise comparisons are tested unless the joint test for the corresponding term in the model is significant. Therefore, the default mcompare(noadjust) corresponds to Fisher’s protected LSD assuming that the corresponding joint test was performed before using pwcompare. Milliken and Johnson (2009) recommend using this methodology for planned comparisons, assuming the corresponding joint test is significant. Bonferroni’s adjustment mcompare(bonferroni) adjusts significance levels based on the Bonferroni inequality, which, in the case of multiple testing, tells us that the maximum error rate for all comparisons is the sum of the error rates for the individual comparisons. Assuming that we are using the same significance level for all tests, the experimentwise error rate is the error rate for a single test multiplied by the pwcompare — Pairwise comparisons 1707 number of comparisons. Therefore, a p-value for each comparison can be computed by multiplying the unadjusted p-value by the total number of comparisons. If the adjusted p-value is greater than 1, then pwcompare will report a p-value of 1. Bonferroni’s adjustment is popular because it is easy to compute manually and because it can be applied to any set of tests, not only the pairwise comparisons available in pwcompare. In addition, this method does not require equal sample sizes. Because Bonferroni’s adjustment is so general, it is more conservative than many of the other adjustments. It is especially conservative when a large number of tests is being performed. Šidák’s adjustment mcompare(sidak) performs an adjustment using Šidák’s method. This adjustment, like Bonferroni’s adjustment, is derived from an inequality. However, in this case, the inequality is based on the probability of not making a type I error. For a single test, the probability that we do not make a type I error is 1 − α. For two independent tests, both using α as a significance level, the probability is (1 − α)(1 − α). Likewise, for m independent tests, the probability of not making a type I error is (1 − α)m . Therefore, the probability of making one or more type I errors is 1 − (1 − α)m . When tests are not independent, the probability of making at least one error is less than 1 − (1 − α)m . Therefore, we can compute an adjusted p-value as 1 − (1 −u p)m , where u p is the unadjusted p-value for a single comparison. Šidák’s method is also conservative although slightly less so than Bonferroni’s method. Like Bonferroni’s method, this method does not require equal sample sizes. Scheffé’s adjustment Scheffé’s adjustment is used when mcompare(scheffe) is specified. This adjustment is derived from the joint F test and its correspondence to the maximum normalized comparison. To adjust for multiple comparisons, thep absolute value of the t statistic for a particular comparison can be compared with a critical value of (k − 1)Fk−1,ν , where ν is the residual degrees of freedom. Fk−1,ν is the distribution of the joint F test for the corresponding term in a one-way ANOVA model. Winer, Brown, and Michels (1991, 191–195) discuss this in detail. For estimation commands that report z statistics instead of t statistics for the tests on coefficients, a χ2 distribution is used instead of an F distribution. Scheffé’s method allows for making all possible comparisons of the k margins, not just the pairwise comparisons. Unlike the methods described above, it does not take into account the number of comparisons that are currently being made. Therefore, this method is even more conservative than the others. Because this method adjusts for all possible comparisons of the levels of the term, Milliken and Johnson (2009) recommend using this procedure when making unplanned contrasts that are suggested by the data. As Winer, Brown, and Michels (1991, 191) put it, this method is often used to adjust for “unfettered data snooping”. When using this adjustment, a contrast will never be significant if the joint F or χ2 test for the term is not also significant. This is another method that does not require equal sample sizes. 1708 pwcompare — Pairwise comparisons Tukey’s HSD adjustment Tukey’s adjustment is also referred to as Tukey’s honestly significant difference (HSD) and is used when mcompare(tukey) is specified. It is often applied to all pairwise comparisons of means. Tukey’s HSD is commonly used as a post hoc test although this is not a requirement. To adjust for multiple comparisons, Tukey’s method compares the absolute value of the t statistic from the individual comparison with a critical value based on a Studentized range distribution with parameter equal to the number of levels in the term. When applied to pairwise comparisons of means, q= meanmax − meanmin s follows a Studentized range distribution with parameter k and ν degrees of freedom. Here meanmax and meanmin are the largest and smallest marginal means, and s is an estimate of the standard error of the means. Now for the comparison of the smallest and largest means, we can say that the probability of not making a type I error is Pr meanmax − meanmin ≤ qk,ν s =1−α Then the following inquality holds for all pairs of means simultaneously: Pr |meani − meanj | ≤ qk,ν s ≥1−α Based on this procedure, Tukey’s HSD computes the p-value for each of the individual comparisons using the Studentized range distribution. However, because the equality holds only for the difference in the largest and smallest means, this procedure produces conservative tests for the remaining comparisons. Winer, Brown, and Michels (1991, 172–182) discuss this in further detail. With unequal sample sizes, mcompare(tukey) produces the Tukey–Kramer adjustment (Tukey 1953; Kramer 1956). Student–Newman–Keuls’ adjustment The Student–Newman–Keuls (SNK) method is used when mcompare(snk) is specified. It is a modification to Tukey’s method and is less conservative. In this procedure, we first order the means. We then test the difference in the smallest and largest means using a critical value from the Studentized range distribution with parameter k , where k is the number of levels in the term. This step uses the same methodology as in Tukey’s procedure. However, in the next step, we will then test for differences in the two sets of means that are the endpoints of the two ranges including k − 1 means. Specifically, we test the difference in the smallest mean and the second-largest mean using a critical value from the Studentized range distribution with parameter k − 1. We would also test the difference in the second-smallest mean and the largest mean using this critical value. Likewise, the means that are the endpoints of ranges including k − 2 means when ordered are tested using the Studentized range distribution with parameter k − 2, and so on. Equal sample sizes are required for this method. pwcompare — Pairwise comparisons 1709 Duncan’s adjustment When mcompare(duncan) is specified, tests are adjusted for multiple comparisons using Duncan’s method, which is sometimes referred to as Duncan’s new multiple range method. This adjustment produces tests that are less conservative than both Tukey’s HSD and SNK. This procedure is performed in the same manner as SNK except that the p-values for the individual comparisons are adjusted as 1 − (1 − snk pi )1/(r+1) , where snk p is the p-value computed using the SNK method and r represents the number of means that, when ordered, fall between the two that are being compared. Again equal sample sizes are required for this adjustment. Dunnett’s adjustment Dunnett’s adjustment is obtained by specifying mcompare(dunnett). It is used when one of the levels of a factor can be considered a control or reference level with which each of the other levels is being compared. When Dunnett’s adjustment is requested, k − 1 instead of k(k − 1)/2 pairwise comparisons are made. Dunnett (1955, 1964) developed tables of critical values for what Miller (1981, 76) refers to as the “many-one t statistic”. The t statistics for individual comparisons are compared with these critical values when making many comparisons to a single reference level. This method also requires equal sample sizes. Example adjustments using one-way models Fisher’s protected LSD Fisher’s protected LSD requires that we first verify that the joint test for a term in our model is significant before proceeding with pairwise comparisons. Using our previous example, we could have first used the contrast command to obtain a joint test for the effects of fertilizer. . contrast fertilizer Contrasts of marginal linear predictions Margins : asbalanced df F P>F fertilizer 4 5.33 0.0004 Denominator 195 This test for the effects of fertilizer is highly significant. Now we can say we are using Fisher’s protected LSD when looking at the unadjusted p-values that were obtained from our previous command, . pwcompare fertilizer, effects sort 1710 pwcompare — Pairwise comparisons Tukey’s HSD Because we fit a linear regression model and are interested in all pairwise comparisons of the marginal means, we may instead choose to use Tukey’s HSD. . pwcompare fertilizer, effects sort mcompare(tukey) Pairwise comparisons of marginal linear predictions Margins : asbalanced Number of Comparisons fertilizer fertilizer 29-03-04 vs 18-24-06 29-03-04 vs 10-08-22 16-04-08 vs 10-08-22 29-03-04 vs 16-04-08 29-03-04 vs 10-10-10 16-04-08 vs 10-10-10 18-24-06 vs 10-08-22 10-08-22 vs 10-10-10 18-24-06 vs 16-04-08 18-24-06 vs 10-10-10 10 Tukey t P>|t| Tukey [95% Conf. Interval] Contrast Std. Err. -6.161132 1.589997 -3.87 0.001 -10.53914 -1.78312 -4.861048 1.589997 -3.06 0.021 -9.239059 -.4830368 -3.13209 1.589997 -1.97 0.285 -7.510101 1.245921 -1.728958 1.589997 -1.09 0.813 -6.106969 2.649053 -1.238328 1.589997 -0.78 0.936 -5.616339 3.139683 .4906299 1.589997 0.31 0.998 -3.887381 4.868641 1.300083 1.589997 0.82 0.925 -3.077928 5.678095 3.62272 1.589997 2.28 0.156 -.7552913 8.000731 4.432173 1.589997 2.79 0.046 .0541623 8.810185 4.922803 1.589997 3.10 0.019 .5447922 9.300815 This time, our p-values have been modified, and we find that only four of the pairwise differences are considered significantly different from zero at the 5% level. If we only are interested in performing pairwise comparisons of a subset of our means, we can use factor-variable operators to select the levels of the factor that we want to compare. Here we exclude all comparisons involving fertilizer 10-10-10. pwcompare — Pairwise comparisons 1711 . pwcompare i(2/5).fertilizer, effects sort mcompare(tukey) Pairwise comparisons of marginal linear predictions Margins : asbalanced Number of Comparisons fertilizer fertilizer 29-03-04 vs 18-24-06 29-03-04 vs 10-08-22 16-04-08 vs 10-08-22 29-03-04 vs 16-04-08 18-24-06 vs 10-08-22 18-24-06 vs 16-04-08 6 Tukey t P>|t| Tukey [95% Conf. Interval] Contrast Std. Err. -6.161132 1.589997 -3.87 0.001 -10.28133 -2.040937 -4.861048 1.589997 -3.06 0.013 -8.981242 -.7408538 -3.13209 1.589997 -1.97 0.203 -7.252284 .9881042 -1.728958 1.589997 -1.09 0.698 -5.849152 2.391236 1.300083 1.589997 0.82 0.846 -2.820111 5.420278 4.432173 1.589997 2.79 0.030 .3119792 8.552368 The adjusted p-values and confidence intervals differ from those in the previous output because Tukey’s adjustment takes into account the total number of comparisons being made when determining the appropriate degrees of freedom to use for the Studentized range distribution. Dunnett’s method for comparisons to a control If one of our five fertilizer groups represents fields where no fertilizer was applied, we may want to use Dunnett’s method to compare each of the four fertilizers with the control group. In this case, we make only k − 1 comparisons for k groups. 1712 pwcompare — Pairwise comparisons . pwcompare fertilizer, effects mcompare(dunnett) Pairwise comparisons of marginal linear predictions Margins : asbalanced Number of Comparisons fertilizer fertilizer 10-08-22 vs 10-10-10 16-04-08 vs 10-10-10 18-24-06 vs 10-10-10 29-03-04 vs 10-10-10 4 Dunnett t P>|t| Dunnett [95% Conf. Interval] Contrast Std. Err. 3.62272 1.589997 2.28 0.079 -.2918331 7.537273 .4906299 1.589997 0.31 0.994 -3.423923 4.405183 4.922803 1.589997 3.10 0.008 1.00825 8.837356 -1.238328 1.589997 -0.78 0.852 -5.152881 2.676225 In our previous regress command, fertilizer 10-10-10 was treated as the base. Therefore, by default, it was treated as the control when using Dunnett’s adjustment, and the pairwise comparisons are equivalent to the coefficients reported by regress. Based on our regress output, we would conclude that fertilizers 10-08-22 and 18-24-06 are different from fertilizer 10-10-10 at the 5% level. However, using Dunnett’s adjustment, we find only fertilizer 18-24-06 to be different from fertilizer 10-10-10 at this same significance level. If the model is fit without a base level for a factor variable, then pwcompare will choose the first level as the reference level. If we want to make comparisons with a different level than the one mcompare(dunnett) chooses by default, we can use the b. operator to override the default. Here we use fertilizer 5 (29-03-04) as the reference level. pwcompare — Pairwise comparisons 1713 . pwcompare b5.fertilizer, effects sort mcompare(dunnett) Pairwise comparisons of marginal linear predictions Margins : asbalanced Number of Comparisons fertilizer fertilizer 10-10-10 vs 29-03-04 16-04-08 vs 29-03-04 10-08-22 vs 29-03-04 18-24-06 vs 29-03-04 4 Dunnett t P>|t| Dunnett [95% Conf. Interval] Contrast Std. Err. 1.238328 1.589997 0.78 0.852 -2.676225 5.152881 1.728958 1.589997 1.09 0.649 -2.185595 5.643511 4.861048 1.589997 3.06 0.009 .9464951 8.775601 6.161132 1.589997 3.87 0.001 2.246579 10.07568 Two-way models In the previous examples, we have performed pairwise comparisons after fitting a model with a single factor. Now we include two factors and their interaction in our model. . regress yield fertilizer##irrigation Source SS df MS Model Residual 6200.81605 4737.57936 9 190 688.979561 24.9346282 Total 10938.3954 199 54.9668111 P>|t| 200 27.63 0.0000 0.5669 0.5464 4.9935 Coef. fertilizer 10-08-22 16-04-08 18-24-06 29-03-04 1.882256 -.5687418 4.904999 -1.217496 1.57907 1.57907 1.57907 1.57907 1.19 -0.36 3.11 -0.77 0.235 0.719 0.002 0.442 -1.232505 -3.683502 1.790239 -4.332257 4.997016 2.546019 8.01976 1.897264 1.irrigation 8.899721 1.57907 5.64 0.000 5.784961 12.01448 3.480928 2.118743 .0356082 -.0416636 2.233143 2.233143 2.233143 2.233143 1.56 0.95 0.02 -0.02 0.121 0.344 0.987 0.985 -.9240084 -2.286193 -4.369328 -4.4466 7.885865 6.52368 4.440545 4.363273 36.91257 1.116571 33.06 0.000 34.7101 39.11504 _cons t = = = = = = yield fertilizer# irrigation 10-08-22#1 16-04-08#1 18-24-06#1 29-03-04#1 Std. Err. Number of obs F( 9, 190) Prob > F R-squared Adj R-squared Root MSE [95% Conf. Interval] 1714 pwcompare — Pairwise comparisons We can perform pairwise comparisons of the cell means defined by the fertilizer and irrigation interaction. . pwcompare fertilizer#irrigation, sort groups mcompare(tukey) Pairwise comparisons of marginal linear predictions Margins : asbalanced Number of Comparisons fertilizer#irrigation 45 Margin fertilizer#irrigation 29-03-04#0 16-04-08#0 10-10-10#0 10-08-22#0 18-24-06#0 29-03-04#1 10-10-10#1 16-04-08#1 18-24-06#1 10-08-22#1 35.69507 36.34383 36.91257 38.79482 41.81757 44.55313 45.81229 47.36229 50.7529 51.17547 Std. Err. Tukey Groups 1.116571 1.116571 1.116571 1.116571 1.116571 1.116571 1.116571 1.116571 1.116571 1.116571 A A AB AB BC CD CDE DEF EF F Note: Margins sharing a letter in the group label are not significantly different at the 5% level. Based on Tukey’s HSD and a 5% significance level, we would conclude that the mean yield for fertilizer 29-03-04 without irrigation is not significantly different from the mean yields for fertilizers 10-10-10, 10-08-22, and 16-04-08 when used without irrigation but is significantly different from the remaining means. Up to this point, most of the pairwise comparisons that we have performed could have also been obtained with pwmean (see [R] pwmean) if we had not been interested in examining the results from the estimation command before making pairwise comparisons of the means. For instance, we could reproduce the results from the above pwcompare command by typing . pwmean yield, over(fertilizer irrigation) sort group mcompare(tukey) However, pwcompare extends the capabilities of pwmean in many ways. For instance, pwmean only allows for pairwise comparisons of the cell means determined by the highest level interaction of the variables specified in the over() option. However, pwcompare allows us to fit a single model, such as the two-way model that we fit above, . regress yield fertilizer##irrigation pwcompare — Pairwise comparisons 1715 and compute pairwise comparisons of the marginal means for only one of the variables in the model: . pwcompare fertilizer, sort effects mcompare(tukey) Pairwise comparisons of marginal linear predictions Margins : asbalanced Number of Comparisons fertilizer fertilizer 29-03-04 vs 18-24-06 29-03-04 vs 10-08-22 16-04-08 vs 10-08-22 29-03-04 vs 16-04-08 29-03-04 vs 10-10-10 16-04-08 vs 10-10-10 18-24-06 vs 10-08-22 10-08-22 vs 10-10-10 18-24-06 vs 16-04-08 18-24-06 vs 10-10-10 10 Tukey t P>|t| Tukey [95% Conf. Interval] Contrast Std. Err. -6.161132 1.116571 -5.52 0.000 -9.236338 -3.085925 -4.861048 1.116571 -4.35 0.000 -7.936255 -1.785841 -3.13209 1.116571 -2.81 0.044 -6.207297 -.0568832 -1.728958 1.116571 -1.55 0.532 -4.804165 1.346249 -1.238328 1.116571 -1.11 0.802 -4.313535 1.836879 .4906299 1.116571 0.44 0.992 -2.584577 3.565837 1.300083 1.116571 1.16 0.772 -1.775123 4.37529 3.62272 1.116571 3.24 0.012 .5475131 6.697927 4.432173 1.116571 3.97 0.001 1.356967 7.50738 4.922803 1.116571 4.41 0.000 1.847597 7.99801 Here the standard errors for the differences in marginal means and the residual degrees of freedom are based on the full model. Therefore, the results will differ from those obtained from pwcompare after fitting the one-way model with only fertilizer (or equivalently using pwmean). Pairwise comparisons of slopes If we fit a model with a factor variable that is interacted with a continuous variable, pwcompare will even allow us to make pairwise comparisons of the slopes of the continuous variable for the levels of the factor variable. In this case, we have a continuous variable, N03 N, indicating the amount of nitrate nitrogen already existing in the soil, based on a sample taken from each field. 1716 pwcompare — Pairwise comparisons . regress yield fertilizer##c.N03_N SS df Source MS Model Residual 7005.69932 3932.69609 9 190 778.411035 20.6984005 Total 10938.3954 199 54.9668111 P>|t| 200 37.61 0.0000 0.6405 0.6234 4.5495 Coef. fertilizer 10-08-22 16-04-08 18-24-06 29-03-04 18.65019 -13.34076 24.35061 17.58529 8.452061 10.07595 9.911463 8.446736 2.21 -1.32 2.46 2.08 0.029 0.187 0.015 0.039 1.97826 -33.21585 4.799973 .9238646 35.32212 6.534327 43.90125 34.24671 N03_N 4.915653 .7983509 6.16 0.000 3.340884 6.490423 -1.282039 -1.00571 -2.97627 -3.275947 .8953419 .9025862 .9136338 .8247385 -1.43 -1.11 -3.26 -3.97 0.154 0.267 0.001 0.000 -3.048126 -2.786087 -4.778438 -4.902767 .4840487 .7746662 -1.174102 -1.649127 -5.459168 7.638241 -0.71 0.476 -20.52581 9.607477 _cons t = = = = = = yield fertilizer# c.N03_N 10-08-22 16-04-08 18-24-06 29-03-04 Std. Err. Number of obs F( 9, 190) Prob > F R-squared Adj R-squared Root MSE [95% Conf. Interval] These are the pairwise differences of the slopes of NO3 N for each pair of fertilizers: . pwcompare fertilizer#c.N03_N, pveffects sort mcompare(scheffe) Pairwise comparisons of marginal linear predictions Margins : asbalanced Number of Comparisons fertilizer#c.N03_N fertilizer#c.N03_N 29-03-04 vs 10-10-10 18-24-06 vs 10-10-10 29-03-04 vs 16-04-08 29-03-04 vs 10-08-22 18-24-06 vs 16-04-08 18-24-06 vs 10-08-22 10-08-22 vs 10-10-10 16-04-08 vs 10-10-10 29-03-04 vs 18-24-06 16-04-08 vs 10-08-22 10 Contrast Std. Err. -3.275947 -2.97627 -2.270237 -1.993909 -1.97056 -1.694232 -1.282039 -1.00571 -.2996772 .276328 .8247385 .9136338 .4691771 .4550851 .612095 .6013615 .8953419 .9025862 .4900939 .5844405 Scheffe t P>|t| -3.97 -3.26 -4.84 -4.38 -3.22 -2.82 -1.43 -1.11 -0.61 0.47 0.004 0.034 0.000 0.001 0.038 0.099 0.727 0.871 0.984 0.994 Using Scheffé’s adjustment, we find that five of the pairs have significantly different slopes at the 5% level. pwcompare — Pairwise comparisons 1717 Nonlinear models pwcompare can also perform pairwise comparisons of the marginal linear predictions after fitting a nonlinear model. For instance, we can use the dataset from Beyond linear models in [R] contrast and fit the following logistic regression model of patient satisfaction on hospital: . use http://www.stata-press.com/data/r13/hospital (Artificial hospital satisfaction data) . logit satisfied i.hospital Iteration 0: log likelihood = -393.72216 Iteration 1: log likelihood = -387.55736 Iteration 2: log likelihood = -387.4768 Iteration 3: log likelihood = -387.47679 Logistic regression Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -387.47679 satisfied Coef. hospital 2 3 _cons = = = = 802 12.49 0.0019 0.0159 Std. Err. z P>|z| [95% Conf. Interval] .5348129 .7354519 .2136021 .2221929 2.50 3.31 0.012 0.001 .1161604 .2999618 .9534654 1.170942 1.034708 .1391469 7.44 0.000 .7619855 1.307431 For this model, the marginal linear predictions are the predicted log odds for each hospital and can be obtained with the cimargins option: . pwcompare hospital, cimargins Pairwise comparisons of marginal linear predictions Margins : asbalanced Margin hospital 1 2 3 1.034708 1.569521 1.77016 Std. Err. Unadjusted [95% Conf. Interval] .1391469 .1620618 .1732277 .7619855 1.251886 1.43064 1.307431 1.887157 2.10968 The pairwise comparisons are, therefore, differences in the log odds. We can specify mcompare(bonferroni) and effects to request Bonferroni-adjusted p-values and confidence intervals. . pwcompare hospital, effects mcompare(bonferroni) Pairwise comparisons of marginal linear predictions Margins : asbalanced Number of Comparisons satisfied hospital 3 1718 pwcompare — Pairwise comparisons satisfied hospital 2 vs 1 3 vs 1 3 vs 2 Contrast Std. Err. .5348129 .7354519 .200639 .2136021 .2221929 .2372169 Bonferroni z P>|z| 2.50 3.31 0.85 0.037 0.003 1.000 Bonferroni [95% Conf. Interval] .0234537 .2035265 -.3672535 1.046172 1.267377 .7685314 For nonlinear models, only Bonferroni’s adjustment, Šidák’s adjustment, and Scheffé’s adjustment are available. If we want pairwise comparisons reported as odds ratios, we can specify the or option. . pwcompare hospital, effects mcompare(bonferroni) or Pairwise comparisons of marginal linear predictions Margins : asbalanced Number of Comparisons satisfied hospital 3 Odds Ratio satisfied hospital 2 vs 1 3 vs 1 3 vs 2 1.707129 2.086425 1.222183 Std. Err. .3646464 .4635888 .2899226 Bonferroni z P>|z| 2.50 3.31 0.85 0.037 0.003 1.000 Bonferroni [95% Conf. Interval] 1.023731 1.225718 .6926341 2.846733 3.551525 2.156597 Notice that these tests are still performed on the marginal linear predictions. The odds ratios reported here are the exponentiated versions of the pairwise differences of log odds in the previous output. For further discussion, see [R] contrast. Multiple-equation models pwcompare works with models containing multiple equations. Commands such as intreg and gnbreg allow their ancillary parameters to be modeled as a function of independent variables, and pwcompare can compare the margins within these equations. The equation() option can be used to specify the equation for which pairwise comparisons of the margins should be made. The atequations option specifies that pairwise comparisons be computed for each equation. In addition, pwcompare allows a special pseudofactor for equation—called eqns—when working with results from manova, mvreg, mlogit, and mprobit. Here we use the jaw fracture dataset described in example 4 of [MV] manova. We fit a multivariate regression model including one independent factor variable, fracture. pwcompare — Pairwise comparisons . use http://www.stata-press.com/data/r13/jaw (Table 4.6 Two-Way Unbalanced Data for Fractures of the Jaw -- Rencher (1998)) . mvreg y1 y2 y3 = i.fracture Equation Obs Parms RMSE "R-sq" F P y1 y2 y3 27 27 27 3 3 3 Coef. 10.42366 6.325398 5.976973 Std. Err. 0.2966 0.1341 0.1024 t 5.060804 1.858342 1.368879 P>|t| 0.0147 0.1777 0.2735 [95% Conf. Interval] y1 fracture two compo.. one simpl.. -8.833333 6 4.957441 5.394759 -1.78 1.11 0.087 0.277 -19.06499 -5.134235 1.398322 17.13423 _cons 37 3.939775 9.39 0.000 28.8687 45.1313 fracture two compo.. one simpl.. -5.761905 -3.053571 3.008327 3.273705 -1.92 -0.93 0.067 0.360 -11.97079 -9.810166 .446977 3.703023 _cons 38.42857 2.390776 16.07 0.000 33.49425 43.36289 fracture two compo.. one simpl.. 4.261905 .9285714 2.842618 3.093377 1.50 0.30 0.147 0.767 -1.60497 -5.455846 10.12878 7.312989 _cons 58.57143 2.259083 25.93 0.000 53.90891 63.23395 y2 y3 1719 1720 pwcompare — Pairwise comparisons pwcompare performs pairwise comparisons of the margins using the coefficients from the first equation by default: . pwcompare fracture, mcompare(bonferroni) Pairwise comparisons of marginal linear predictions Margins : asbalanced Number of Comparisons y1 fracture 3 Bonferroni [95% Conf. Interval] Contrast Std. Err. -8.833333 4.957441 -21.59201 3.925341 6 5.394759 -7.884173 19.88417 14.83333 4.75773 2.588644 27.07802 y1 fracture two compound fractures vs one compound fracture one simple fracture vs one compound fracture one simple fracture vs two compound fractures We can use the equation() option to get pwcompare to perform comparisons in the y2 equation: . pwcompare fracture, equation(y2) mcompare(bonferroni) Pairwise comparisons of marginal linear predictions Margins : asbalanced Number of Comparisons y2 fracture 3 Bonferroni [95% Conf. Interval] Contrast Std. Err. -5.761905 3.008327 -13.50426 1.980449 -3.053571 3.273705 -11.47891 5.371769 2.708333 2.887136 -4.722119 10.13879 y2 fracture two compound fractures vs one compound fracture one simple fracture vs one compound fracture one simple fracture vs two compound fractures pwcompare — Pairwise comparisons 1721 Because we are working with mvreg results, we can use the eqns pseudofactor to compare the margins between the three dependent variables. The levels of eqns index the equations: 1 for the first equation, 2 for the second, and 3 for the third. . pwcompare _eqns, mcompare(bonferroni) Pairwise comparisons of marginal linear predictions Margins : asbalanced Number of Comparisons _eqns _eqns 2 vs 1 3 vs 1 3 vs 2 3 Contrast Std. Err. -.5654762 24.24603 24.81151 2.545923 2.320677 2.368188 Bonferroni [95% Conf. Interval] -7.117768 18.27344 18.71664 5.986815 30.21862 30.90637 For the previous command, the only methods available are mcompare(bonferroni), mcompare(sidak), or mcompare(scheffe). Methods that use the Studentized range are not appropriate for making comparisons across equations. Unbalanced data pwcompare treats all factors as balanced when it computes the marginal means. By “balanced”, we mean that the number of observations in each combination of factor levels (in each cell mean) is equal. We can alternatively specify the asobserved option when we have unbalanced data to obtain marginal means that are based on the observed cell frequencies from the model fit. For more details on the difference in these two types of marginal means and a discussion of when each may be appropriate, see [R] margins and [R] contrast. In addition, when our data are not balanced, some of the multiple-comparison adjustments are no longer appropriate. Student–Newman–Keuls’ method, Duncan’s method, and Dunnett’s method assume equal numbers of observations per group. Here we use an unbalanced dataset and fit a two-way ANOVA model for cholesterol levels on race and age group. Then we perform pairwise comparisons of the mean cholesterol levels for each race, requesting Šidák’s adjustment as well as marginal means that are computed using the observed cell frequencies. 1722 pwcompare — Pairwise comparisons . use http://www.stata-press.com/data/r13/cholesterol3 (Artificial cholesterol data, unbalanced) . anova chol race##agegrp Number of obs = 67 R-squared = 0.8179 Root MSE = 8.37496 Adj R-squared = 0.7689 Partial SS df MS F Prob > F Source Model 16379.9926 14 1169.99947 16.68 0.0000 race agegrp race#agegrp 230.754396 13857.9877 857.815209 2 4 8 115.377198 3464.49693 107.226901 1.64 49.39 1.53 0.2029 0.0000 0.1701 Residual 3647.2774 52 70.13995 Total 20027.27 66 303.443485 . pwcompare race, asobserved mcompare(sidak) Pairwise comparisons of marginal linear predictions Margins : asobserved Number of Comparisons race race white vs black other vs black other vs white 3 Contrast Std. Err. -7.232433 -5.231198 2.001235 2.686089 2.651203 2.414964 Sidak [95% Conf. Interval] -13.85924 -11.77194 -3.956682 -.6056277 1.309541 7.959152 Empty cells An empty cell is a combination of the levels of factor variables that is not observed in the estimation sample. When we have empty cells in our data, the marginal means involving those empty cells are not estimable as described in [R] margins. In addition, all pairwise comparisons involving a marginal mean that is not estimable are themselves not estimable. Here we use a dataset where we do not have any observations for white individuals in the 20–29 age group. We can use the emptycells(reweight) option to reweight the nonempty cells so that we can estimate the marginal mean for whites and compute pairwise comparisons involving that marginal mean. pwcompare — Pairwise comparisons . use http://www.stata-press.com/data/r13/cholesterol2 (Artificial cholesterol data, empty cells) . tabulate race agegrp agegrp 10-19 20-29 30-39 40-59 race 60-79 Total black white other 5 5 5 5 0 5 5 5 5 5 5 5 5 5 5 25 20 25 Total 15 10 15 15 15 70 1723 . anova chol race##agegrp Source Number of obs = 70 Root MSE = 9.47055 Partial SS df MS R-squared = 0.7582 Adj R-squared = 0.7021 F Prob > F Model 15751.6113 13 1211.66241 13.51 0.0000 race agegrp race#agegrp 305.49046 14387.8559 795.807574 2 4 7 152.74523 3596.96397 113.686796 1.70 40.10 1.27 0.1914 0.0000 0.2831 Residual 5022.71559 56 89.6913498 Total 20774.3269 69 301.077201 . pwcompare race, emptycells(reweight) Pairwise comparisons of marginal linear predictions Margins : asbalanced Empty cells : reweight race white vs black other vs black other vs white Contrast Std. Err. 2.922769 -4.12621 -7.048979 2.841166 2.678677 2.841166 Unadjusted [95% Conf. Interval] -2.768769 -9.492244 -12.74052 8.614308 1.239824 -1.35744 For further details on the emptycells(reweight) option, see [R] margins and [R] contrast. 1724 pwcompare — Pairwise comparisons Stored results pwcompare stores the following in r(): Scalars r(df r) r(k terms) r(level) r(balanced) Macros r(cmd) r(cmdline) r(est cmd) r(est cmdline) r(title) r(emptycells) r(groups#) r(mcmethod vs) r(mctitle vs) r(mcadjustall vs) r(margin method) r(vce) Matrices r(b) r(V) r(error) r(table) r(M) r(b vs) r(V vs) r(error vs) r(table vs) r(L) r(k groups) variance degrees of freedom number of terms in marginlist confidence level of confidence intervals 1 if fully balanced data; 0 otherwise pwcompare command as typed e(cmd) from original estimation results e(cmdline) from original estimation results title in output empspec from emptycells() group codes for the #th margin in r(b) method from mcompare() title for method from mcompare() adjustall or empty asbalanced or asobserved vcetype specified in vce() in original estimation command margin estimates variance–covariance matrix of the margin estimates margin estimability codes; 0 means estimable, 8 means not estimable matrix containing the margins with their standard errors, test statistics, p-values, and confidence intervals matrix that produces the margins from the model coefficients margin difference estimates variance–covariance matrix of the margin difference estimates margin difference estimability codes; 0 means estimable, 8 means not estimable matrix containing the margin differences with their standard errors, test statistics, p-values, and confidence intervals matrix that produces the margin differences from the model coefficients number of significance groups for each term pwcompare — Pairwise comparisons pwcompare with the post option also stores the following in e(): Scalars e(df r) e(k terms) e(balanced) Macros e(cmd) e(cmdline) e(est cmd) e(est cmdline) e(title) e(emptycells) e(margin method) e(vce) e(properties) Matrices e(b) e(V) e(error) e(M) e(b vs) e(V vs) e(error vs) e(L) e(k groups) variance degrees of freedom number of terms in marginlist 1 if fully balanced data; 0 otherwise pwcompare command as typed e(cmd) from original estimation results e(cmdline) from original estimation results title in output empspec from emptycells() asbalanced or asobserved vcetype specified in vce() in original estimation command b V margin estimates variance–covariance matrix of the margin estimates margin estimability codes; 0 means estimable, 8 means not estimable matrix that produces the margins from the model coefficients margin difference estimates variance–covariance matrix of the margin difference estimates margin difference estimability codes; 0 means estimable, 8 means not estimable matrix that produces the margin differences from the model coefficients number of significance groups for each term Methods and formulas Methods and formulas are presented under the following headings: Notation Unadjusted comparisons Bonferroni’s method Šidák’s method Scheffé’s method Tukey’s method Student–Newman–Keuls’ method Duncan’s method Dunnett’s method Notation pwcompare performs comparisons of margins; see Methods and formulas in [R] contrast. If there are k margins for a given factor term, then there are m= k k(k − 1) = 2 2 unique pairwise comparisons. Let the ith pairwise comparison be denoted by δbi = li0 b 1725 1726 pwcompare — Pairwise comparisons where b is a column vector of coefficients from the fitted model and li is a column vector that forms b denotes the estimated variance matrix for b, then the the corresponding linear combination. If V standard error for δbi is given by q se( b δbi ) = b i li0 Vl The corresponding test statistic is then ti = δbi se( b δbi ) and the limits for a 100(1 − α)% confidence interval for the expected value of δbi are δbi ± ci (α) se( b δbi ) where ci (α) is the critical value corresponding to the chosen multiple-comparison method. Unadjusted comparisons pwcompare computes unadjusted p-values and confidence intervals by default. pwcompare uses the t distribution with ν = e(df r) degrees of freedom when e(df r) is posted by the estimation command. The unadjusted two-sided p-value is u pi = 2 Pr(tν > |ti |) and the unadjusted critical value u ci (α) satisfies the following probability statement: α = 2 Pr {tν > u ci (α)} pwcompare uses the standard normal distribution when e(df r) is not posted. Bonferroni’s method For mcompare(bonferroni), the adjusted p-value is b pi = min(1, m u pi ) and the adjusted critical value is b ci (α) = u ci (α/m) pwcompare — Pairwise comparisons 1727 Šidák’s method For mcompare(sidak), the adjusted p-value is si pi = 1 − (1 − u pi )m and the adjusted critical value is si ci (α) n o = u ci 1 − (1 − α)1/m Scheffé’s method For mcompare(scheffe), the adjusted p-value is sc pi = Pr Fd,ν > t2i /d where Fd,ν is distributed as an F with d numerator and ν denominator degrees of freedom and d is the rank of the VCE for the term. The adjusted critical value satisfies the following probability statement: α = Pr Fd,ν > {sc ci (α)}2 /d pwcompare uses the χ2 distribution when e(df r) is not posted. Tukey’s method For mcompare(tukey), the adjusted p-value is t pi √ = Pr qk,ν > |ti | 2 where qk,ν is distributed as the Studentized range statistic for k means and ν residual degrees of freedom (Miller 1981). The adjusted critical value satisfies the following probability statement: n √ o α = Pr qk,ν > t ci (α) 2 Student–Newman–Keuls’ method For mcompare(snk), suppose ti is comparing two margins that have r other margins between them. Then the adjusted p-value is √ p = Pr q > |t | 2 snk i r+2,ν i where r ranges from 0 to k − 2. The adjusted critical value snk ci (α) satisfies the following probability statement: n √ o α = Pr qr+2,ν > snk ci (α) 2 1728 pwcompare — Pairwise comparisons Duncan’s method For mcompare(duncan), the adjusted p-value is dunc pi = 1 − (1 − snk pi )1/(r+1) and the adjusted critical value is dunc ci (α) = snk ci 1 − (1 − α)r+1 Dunnett’s method For mcompare(dunnett), the margins are compared with a reference category, resulting in only k − 1 pairwise comparisons. The adjusted p-value is dunn pi = Pr(dk−1,ν > |ti |) where dk−1,ν is distributed as the many-one t statistic (Miller 1981, 76). The adjusted critical value dunn ci (α) satisfies the following probability statement: α = Pr {dk−1,ν > dunn ci (α)} The multiple-comparison methods for mcompare(tukey), mcompare(snk), mcompare(duncan), and mcompare(dunnett) assume the normal distribution with equal variance; thus these methods are allowed only with results from anova, regress, manova, and mvreg. mcompare(snk), mcompare(duncan), and mcompare(dunnett) assume equal sample size for each marginal mean. These options will cause pwcompare to report a footnote if unbalanced factors are detected. References Dunnett, C. W. 1955. A multiple comparison for comparing several treatments with a control. Journal of the American Statistical Association 50: 1096–1121. . 1964. New tables for multiple comparisons with a control. Biometrics 20: 482–491. Kramer, C. Y. 1956. Extension of multiple range tests to group means with unequal numbers of replications. Biometrics 12: 307–310. Miller, R. G., Jr. 1981. Simultaneous Statistical Inference. 2nd ed. New York: Springer. Milliken, G. A., and D. E. Johnson. 2009. Analysis of Messy Data, Volume 1: Designed Experiments. 2nd ed. Boca Raton, FL: CRC Press. Mitchell, M. N. 2012. Interpreting and Visualizing Regression Models Using Stata. College Station, TX: Stata Press. Searle, S. R. 1997. Linear Models for Unbalanced Data. New York: Wiley. Tukey, J. W. 1953. The problem of multiple comparisons. Unpublished manuscript, Princeton University. Winer, B. J., D. R. Brown, and K. M. Michels. 1991. Statistical Principles in Experimental Design. 3rd ed. New York: McGraw–Hill. pwcompare — Pairwise comparisons Also see [R] pwcompare postestimation — Postestimation tools for pwcompare [R] contrast — Contrasts and linear hypothesis tests after estimation [R] lincom — Linear combinations of estimators [R] margins — Marginal means, predictive margins, and marginal effects [R] margins, pwcompare — Pairwise comparisons of margins [R] pwmean — Pairwise comparisons of means [R] test — Test linear hypotheses after estimation [U] 20 Estimation and postestimation commands 1729 Title pwcompare postestimation — Postestimation tools for pwcompare Description Remarks and examples Also see Description The following postestimation commands are available after pwcompare, post: Command Description estat vce estat (svy) estimates lincom variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses nlcom test testnl Remarks and examples When we use the post option with pwcompare, the marginal linear predictions are posted as estimation results, and we can use postestimation commands to perform further analysis on them. In Pairwise comparisons of means of [R] pwcompare, we fit a regression of wheat yield on types of fertilizers. . use http://www.stata-press.com/data/r13/yield (Artificial wheat yield dataset) . regress yield i.fertilizer (output omitted ) We also used pwcompare with the cimargins option to obtain the marginal mean yield for each fertilizer. We can add the post option to this command to post these marginal means and their VCEs as estimation results. 1730 pwcompare postestimation — Postestimation tools for pwcompare 1731 . pwcompare fertilizer, cimargins post Pairwise comparisons of marginal linear predictions Margins : asbalanced Margin fertilizer 10-10-10 10-08-22 16-04-08 18-24-06 29-03-04 41.36243 44.98515 41.85306 46.28523 40.1241 Std. Err. Unadjusted [95% Conf. Interval] 1.124298 1.124298 1.124298 1.124298 1.124298 39.14509 42.7678 39.63571 44.06789 37.90676 43.57977 47.20249 44.0704 48.50258 42.34145 Now we can use nlcom to compute a percentage improvement in the mean yield for fertilizer 2 when compared with fertilizer 1. . nlcom (pct_chg: 100*(_b[2.fertilizer] - _b[1.fertilizer])/_b[1.fertilizer]) pct_chg: 100*(_b[2.fertilizer] - _b[1.fertilizer])/_b[1.fertilizer] Coef. pct_chg 8.758479 Std. Err. z P>|z| [95% Conf. Interval] 4.015932 2.18 0.029 .8873982 16.62956 The mean yield for fertilizer 2 is about 9% higher than that of fertilizer 1, with a standard error of 4%. Also see [R] pwcompare — Pairwise comparisons [U] 20 Estimation and postestimation commands Title pwmean — Pairwise comparisons of means Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options Reference Syntax pwmean varname, over(varlist) options Description options Main ∗ over(varlist) mcompare(method) compare means across each combination of the levels in varlist adjust for multiple comparisons; default is mcompare(noadjust) Reporting level(#) cieffects pveffects effects cimeans groups sort display options ∗ confidence level; default is level(95) display a table of mean differences and confidence intervals; the default display a table of mean differences and p-values display a table of mean differences with p-values and confidence intervals display a table of means and confidence intervals display a table of means with codes that group them with other means that are not significantly different sort results tables by displayed mean or difference control column formats, line width, and factor-variable labeling over(varlist) is required. method Description noadjust bonferroni sidak scheffe tukey snk duncan dunnett do not adjust for multiple comparisons; the default Bonferroni’s method Šidák’s method Scheffé’s method Tukey’s method Student–Newman–Keuls’ method Duncan’s method Dunnett’s method Menu Statistics > Summaries, tables, and tests > Summary and descriptive statistics 1732 > Pairwise comparisons of means pwmean — Pairwise comparisons of means 1733 Description pwmean performs pairwise comparisons of means. It computes all pairwise differences of the means of varname over the combination of the levels of the variables in varlist. The tests and confidence intervals for the pairwise comparisons assume equal variances across groups. pwmean also allows for adjusting the confidence intervals and p-values to account for multiple comparisons using Bonferroni’s method, Scheffé’s method, Tukey’s method, Dunnett’s method, and others. See [R] pwcompare for performing pairwise comparisons of means, estimated marginal means, and other types of marginal linear predictions after anova, regress, and most other estimation commands. See [R] margins, pwcompare for performing pairwise comparisons of marginal probabilities and other linear and nonlinear predictions after estimation commands. Options Main over(varlist) is required and specifies that means are computed for each combination of the levels of the variables in varlist. mcompare(method) specifies the method for computing p-values and confidence intervals that account for multiple comparisons. Most methods adjust the comparisonwise error rate, αc , to achieve a prespecified experimentwise error rate, αe . mcompare(noadjust) is the default; it specifies no adjustment. αc = αe mcompare(bonferroni) adjusts the comparisonwise error rate based on the upper limit of the Bonferroni inequality: αe ≤mαc where m is the number of comparisons within the term. The adjusted comparisonwise error rate is αc = αe /m mcompare(sidak) adjusts the comparisonwise error rate based on the upper limit of the probability inequality αe ≤1 − (1 − αc )m where m is the number of comparisons within the term. The adjusted comparisonwise error rate is αc = 1 − (1 − αe )1/m This adjustment is exact when the m comparisons are independent. mcompare(scheffe) controls the experimentwise error rate using the F (or χ2 ) distribution with degrees of freedom equal to k − 1 where k is the number of means being compared. mcompare(tukey) uses what is commonly referred to as Tukey’s honestly significant difference. This method uses the Studentized range distribution instead of the t distribution. 1734 pwmean — Pairwise comparisons of means mcompare(snk) is a variation on mcompare(tukey) that counts only the number of means participating in the range for a given comparison instead of the full number of means. mcompare(duncan) is a variation on mcompare(snk) with additional adjustment to the significance probabilities. mcompare(dunnett) uses Dunnett’s method for making comparisons with a reference category. Reporting level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. The significance level used by the groups option is 100 − #, expressed as a percentage. cieffects specifies that a table of the pairwise comparisons of means with their standard errors and confidence intervals be reported. This is the default. pveffects specifies that a table of the pairwise comparisons of means with their standard errors, test statistics, and p-values be reported. effects specifies that a table of the pairwise comparisons of means with their standard errors, test statistics, p-values, and confidence intervals be reported. cimeans specifies that a table of the means with their standard errors and confidence intervals be reported. groups specifies that a table of the means with their standard errors and group codes be reported. Means with the same letter in the group code are not significantly different at the specified significance level. sort specifies that the reported tables be sorted by the mean or difference that is displayed in the table. display options: nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch. nofvlabel displays factor-variable level values rather than attached value labels. This option overrides the fvlabel setting; see [R] set showbaselevels. fvwrap(#) specifies how many lines to allow when long value labels must be wrapped. Labels requiring more than # lines are truncated. This option overrides the fvwrap setting; see [R] set showbaselevels. fvwrapon(style) specifies whether value labels that wrap will break at word boundaries or break based on available space. fvwrapon(word), the default, specifies that value labels break at word boundaries. fvwrapon(width) specifies that value labels break based on available space. This option overrides the fvwrapon setting; see [R] set showbaselevels. cformat(% fmt) specifies how to format means, standard errors, and confidence limits in the table of pairwise comparison of means. pformat(% fmt) specifies how to format p-values in the table of pairwise comparison of means. sformat(% fmt) specifies how to format test statistics in the table of pairwise comparison of means. nolstretch specifies that the width of the table of pairwise comparisons not be automatically widened to accommodate longer variable names. The default, lstretch, is to automatically pwmean — Pairwise comparisons of means 1735 widen the table of pairwise comparisons up to the width of the Results window. To change the default, use set lstretch off. nolstretch is not shown in the dialog box. Remarks and examples pwmean performs pairwise comparisons (differences) of means, assuming a common variance among groups. It can easily adjust the p-values and confidence intervals for the differences to account for the elevated type I error rate due to multiple comparisons. Adjustments for multiple comparisons can be made using Bonferroni’s method, Scheffé’s method, Tukey’s method, Dunnett’s method, and others. Remarks are presented under the following headings: Group means Pairwise differences of means Group output Adjusting for multiple comparisons Tukey’s method Dunnett’s method Multiple over() variables Equal variance assumption Group means Suppose we have data on the wheat yield of fields that were each randomly assigned an application of one of five types of fertilizers. Let’s first look at the mean yield for each type of fertilizer. . use http://www.stata-press.com/data/r13/yield (Artificial wheat yield dataset) . pwmean yield, over(fertilizer) cimeans Pairwise comparisons of means with equal variances over : fertilizer yield Mean fertilizer 10-10-10 10-08-22 16-04-08 18-24-06 29-03-04 41.36243 44.98515 41.85306 46.28523 40.1241 Std. Err. Unadjusted [95% Conf. Interval] 1.124298 1.124298 1.124298 1.124298 1.124298 39.14509 42.7678 39.63571 44.06789 37.90676 43.57977 47.20249 44.0704 48.50258 42.34145 1736 pwmean — Pairwise comparisons of means Pairwise differences of means We can compute all pairwise differences in mean wheat yields for the types of fertilizers. . pwmean yield, over(fertilizer) effects Pairwise comparisons of means with equal variances over : fertilizer yield fertilizer 10-08-22 vs 10-10-10 16-04-08 vs 10-10-10 18-24-06 vs 10-10-10 29-03-04 vs 10-10-10 16-04-08 vs 10-08-22 18-24-06 vs 10-08-22 29-03-04 vs 10-08-22 18-24-06 vs 16-04-08 29-03-04 vs 16-04-08 29-03-04 vs 18-24-06 Unadjusted t P>|t| Unadjusted [95% Conf. Interval] Contrast Std. Err. 3.62272 1.589997 2.28 0.024 .4869212 6.758518 .4906299 1.589997 0.31 0.758 -2.645169 3.626428 4.922803 1.589997 3.10 0.002 1.787005 8.058602 -1.238328 1.589997 -0.78 0.437 -4.374127 1.89747 -3.13209 1.589997 -1.97 0.050 -6.267889 .0037086 1.300083 1.589997 0.82 0.415 -1.835715 4.435882 -4.861048 1.589997 -3.06 0.003 -7.996847 -1.725249 4.432173 1.589997 2.79 0.006 1.296375 7.567972 -1.728958 1.589997 -1.09 0.278 -4.864757 1.406841 -6.161132 1.589997 -3.87 0.000 -9.29693 -3.025333 The contrast in the row labeled (10-08-22 vs 10-10-10) is the difference in the mean wheat yield for fertilizer 10-08-22 and fertilizer 10-10-10. At a 5% significance level, we conclude that there is a difference in the means for these two fertilizers. Likewise, the rows labeled (18-24-06 vs 10-10-10), (29-03-04 vs 10-08-22), (18-24-06 vs 16-04-08) and (29-03-04 vs 18-24-06) show differences in these pairs of means. In all, we find that 5 of the 10 mean differences are significantly different from zero at a 5% significance level. pwmean — Pairwise comparisons of means 1737 We can specify the sort option to order the differences from smallest to largest in the table. . pwmean yield, over(fertilizer) effects sort Pairwise comparisons of means with equal variances over : fertilizer yield fertilizer 29-03-04 vs 18-24-06 29-03-04 vs 10-08-22 16-04-08 vs 10-08-22 29-03-04 vs 16-04-08 29-03-04 vs 10-10-10 16-04-08 vs 10-10-10 18-24-06 vs 10-08-22 10-08-22 vs 10-10-10 18-24-06 vs 16-04-08 18-24-06 vs 10-10-10 Unadjusted t P>|t| Unadjusted [95% Conf. Interval] Contrast Std. Err. -6.161132 1.589997 -3.87 0.000 -9.29693 -3.025333 -4.861048 1.589997 -3.06 0.003 -7.996847 -1.725249 -3.13209 1.589997 -1.97 0.050 -6.267889 .0037086 -1.728958 1.589997 -1.09 0.278 -4.864757 1.406841 -1.238328 1.589997 -0.78 0.437 -4.374127 1.89747 .4906299 1.589997 0.31 0.758 -2.645169 3.626428 1.300083 1.589997 0.82 0.415 -1.835715 4.435882 3.62272 1.589997 2.28 0.024 .4869212 6.758518 4.432173 1.589997 2.79 0.006 1.296375 7.567972 4.922803 1.589997 3.10 0.002 1.787005 8.058602 Ordering the pairwise differences is particularly convenient when we are comparing means for a large number of groups. Group output We can use the group option to see the mean of each group and a visual representation of the tests for differences. 1738 pwmean — Pairwise comparisons of means . pwmean yield, over(fertilizer) group sort Pairwise comparisons of means with equal variances over : fertilizer yield Mean fertilizer 29-03-04 10-10-10 16-04-08 10-08-22 18-24-06 40.1241 41.36243 41.85306 44.98515 46.28523 Std. Err. 1.124298 1.124298 1.124298 1.124298 1.124298 Unadjusted Groups A A AB BC C Note: Means sharing a letter in the group label are not significantly different at the 5% level. Fertilizers 29-03-04, 10-10-10, and 16-04-08 are all in group A. This means that at our 5% level of significance, we have insufficient information to distinguish their means. Likewise, fertilizers 16-0408 and 10-08-22 are in group B and cannot be distinguished at the 5% level. The same is true for fertilizers 10-08-22 and 18-24-06 in group C. Fertilizer 29-03-04 and fertilizer 10-08-22 have no letters in common, indicating that the mean yields of these two groups are significantly different at the 5% level. We can conclude that any other fertilizers without a letter in common have significantly different means as well. Adjusting for multiple comparisons The statistics in the examples above take no account that we are performing 10 comparisons. With our 5% significance level and assuming the comparisons are independent, we expect 1 in 20 tests of comparisons to be significant, even if all the population means are truly the same. If we are performing many comparisons, then we should account for the fact that some tests will be found significant by chance alone. More formally, the test for each pairwise comparison is made without adjusting for the elevated type I experimentwise error rate that is introduced when performing multiple tests. We can use the mcompare() option to adjust the confidence intervals and p-values for multiple comparisons. Tukey’s method Of the available adjustments for multiple comparisons, Tukey’s honestly significant difference, Student–Newman–Keuls’ method, and Duncan’s method are most often used when performing all pairwise comparisons of means. Of these, Tukey’s method is the most conservative and Duncan’s method is the least conservative. For further discussion of each of the multiple-comparison adjustments, see [R] pwcompare. Here we use Tukey’s adjustment to compute p-values and confidence intervals for the pairwise differences. pwmean — Pairwise comparisons of means 1739 . pwmean yield, over(fertilizer) effects sort mcompare(tukey) Pairwise comparisons of means with equal variances over : fertilizer Number of Comparisons fertilizer yield fertilizer 29-03-04 vs 18-24-06 29-03-04 vs 10-08-22 16-04-08 vs 10-08-22 29-03-04 vs 16-04-08 29-03-04 vs 10-10-10 16-04-08 vs 10-10-10 18-24-06 vs 10-08-22 10-08-22 vs 10-10-10 18-24-06 vs 16-04-08 18-24-06 vs 10-10-10 10 Tukey t P>|t| Tukey [95% Conf. Interval] Contrast Std. Err. -6.161132 1.589997 -3.87 0.001 -10.53914 -1.78312 -4.861048 1.589997 -3.06 0.021 -9.239059 -.4830368 -3.13209 1.589997 -1.97 0.285 -7.510101 1.245921 -1.728958 1.589997 -1.09 0.813 -6.106969 2.649053 -1.238328 1.589997 -0.78 0.936 -5.616339 3.139683 .4906299 1.589997 0.31 0.998 -3.887381 4.868641 1.300083 1.589997 0.82 0.925 -3.077928 5.678095 3.62272 1.589997 2.28 0.156 -.7552913 8.000731 4.432173 1.589997 2.79 0.046 .0541623 8.810185 4.922803 1.589997 3.10 0.019 .5447922 9.300815 When using a 5% significance level, Tukey’s adjustment indicates that four pairs of means are different. With the adjustment, we no longer conclude that the difference in the mean yields for fertilizers 10-08-22 and 10-10-10 is significantly different from zero. 1740 pwmean — Pairwise comparisons of means Dunnett’s method Now let’s suppose that fertilizer 10-10-10 actually represents fields on which no fertilizer was applied. In this case, we can use Dunnett’s method for comparing each of the fertilizers to the control. . pwmean yield, over(fertilizer) effects mcompare(dunnett) Pairwise comparisons of means with equal variances over : fertilizer Number of Comparisons fertilizer yield fertilizer 10-08-22 vs 10-10-10 16-04-08 vs 10-10-10 18-24-06 vs 10-10-10 29-03-04 vs 10-10-10 4 Dunnett t P>|t| Dunnett [95% Conf. Interval] Contrast Std. Err. 3.62272 1.589997 2.28 0.079 -.2918331 7.537273 .4906299 1.589997 0.31 0.994 -3.423923 4.405183 4.922803 1.589997 3.10 0.008 1.00825 8.837356 -1.238328 1.589997 -0.78 0.852 -5.152881 2.676225 Using Dunnett’s adjustment, we conclude that only fertilizer 4 (18-24-06) produces a mean yield that is significantly different from the mean yield of the field with no fertilizer applied. By default, pwmean treats the lowest level of the group variable as the control. If, for instance, fertilizer 3 (16-04-08) was our control group, we could type . pwmean yield, over(b3.fertilizer) effects mcompare(dunnett) using the b3. factor-variable operator to specify this level as the reference level. pwmean — Pairwise comparisons of means 1741 Multiple over() variables When we specify more than one variable in the over() option, pairwise comparisons are performed for the means defined by each combination of levels of these variables. . pwmean yield, over(fertilizer irrigation) group Pairwise comparisons of means with equal variances over : fertilizer irrigation yield Mean fertilizer#irrigation 10-10-10#0 10-10-10#1 10-08-22#0 10-08-22#1 16-04-08#0 16-04-08#1 18-24-06#0 18-24-06#1 29-03-04#0 29-03-04#1 36.91257 45.81229 38.79482 51.17547 36.34383 47.36229 41.81757 50.7529 35.69507 44.55313 Std. Err. 1.116571 1.116571 1.116571 1.116571 1.116571 1.116571 1.116571 1.116571 1.116571 1.116571 Unadjusted Groups A B A C E A B CD E A B D Note: Means sharing a letter in the group label are not significantly different at the 5% level. Here the row labeled 10-10-10#0 is the mean for the fields treated with fertilizer 10-10-10 and without irrigation. This mean is significantly different from the mean of all fertilizer/irrigation pairings that do not have an A in the “Unadjusted Groups” column. These include all pairings where the fields were irrigated as well as the fields treated with fertilizer 18-24-06 but without irrigation. Equal variance assumption pwmean performs multiple comparisons assuming that there is a common variance for all groups. In the case of two groups, this is equivalent to performing the familiar two-sample t test when equal variances are assumed. 1742 pwmean — Pairwise comparisons of means . ttest yield, by(irrigation) Two-sample t test with equal variances Group Obs Mean 0 1 100 100 combined 200 diff Std. Err. Std. Dev. [95% Conf. Interval] 37.91277 47.93122 .5300607 .5630353 5.300607 5.630353 36.86102 46.81403 38.96453 49.0484 42.92199 .5242462 7.413961 41.8882 43.95579 -10.01844 .7732872 -11.54338 -8.493509 diff = mean(0) - mean(1) Ho: diff = 0 t = -12.9557 degrees of freedom = 198 Ha: diff < 0 Ha: diff != 0 Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 . pwmean yield, over(irrigation) effects Pairwise comparisons of means with equal variances over : irrigation yield Contrast Std. Err. irrigation 1 vs 0 10.01844 .7732872 Unadjusted t P>|t| 12.96 0.000 Ha: diff > 0 Pr(T > t) = 1.0000 Unadjusted [95% Conf. Interval] 8.493509 11.54338 The signs for the difference, the test statistic, and the confidence intervals are reversed because the difference is taken in the opposite direction. The p-value from pwmean is equivalent to the one for the two-sided test in the ttest output. pwmean extends the capabilities of ttest to allow for simultaneously comparing all pairs of means and to allow for using one common variance estimate for all the tests instead of computing a separate pooled variance for each pair of means when using multiple ttest commands. In addition, pwmean allows adjustments for multiple comparisons, many of which rely on an assumption of equal variances among groups. pwmean — Pairwise comparisons of means 1743 Stored results pwmean stores the following in e(): Scalars e(df r) e(balanced) Macros e(cmd) e(cmdline) e(title) e(depvar) e(over) e(properties) Matrices e(b) e(V) e(error) e(b vs) e(V vs) e(error vs) e(k groups) variance degrees of freedom 1 if fully balanced data; 0 otherwise pwmean command as typed title in output name of variable from which the means are computed varlist from over() b V mean estimates variance–covariance matrix of the mean estimates mean estimability codes; 0 means estimable, 8 means not estimable mean difference estimates variance–covariance matrix of the mean difference estimates mean difference estimability codes; 0 means estimable, 8 means not estimable number of significance groups for each term Methods and formulas pwmean is a convenience command that uses pwcompare after fitting a fully factorial linear model. See Methods and formulas described in [R] pwcompare. Reference Searle, S. R. 1997. Linear Models for Unbalanced Data. New York: Wiley. Also see [R] pwmean postestimation — Postestimation tools for pwmean [R] contrast — Contrasts and linear hypothesis tests after estimation [R] margins — Marginal means, predictive margins, and marginal effects [R] margins, pwcompare — Pairwise comparisons of margins [R] pwcompare — Pairwise comparisons [R] ttest — t tests (mean-comparison tests) [U] 20 Estimation and postestimation commands Title pwmean postestimation — Postestimation tools for pwmean Description Remarks and examples Also see Description The following postestimation commands are available after pwmean: Command Description estat vce estimates lincom variance–covariance matrix of the estimators (VCE) cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses nlcom test testnl Remarks and examples In Pairwise differences of means of [R] pwmean, we computed all pairwise differences in mean wheat yields for five fertilizers. . use http://www.stata-press.com/data/r13/yield (Artificial wheat yield dataset) . pwmean yield, over(fertilizer) Pairwise comparisons of means with equal variances over 10-08-22 16-04-08 18-24-06 29-03-04 16-04-08 18-24-06 29-03-04 18-24-06 29-03-04 29-03-04 : fertilizer yield Contrast Std. Err. fertilizer vs 10-10-10 vs 10-10-10 vs 10-10-10 vs 10-10-10 vs 10-08-22 vs 10-08-22 vs 10-08-22 vs 16-04-08 vs 16-04-08 vs 18-24-06 3.62272 .4906299 4.922803 -1.238328 -3.13209 1.300083 -4.861048 4.432173 -1.728958 -6.161132 1.589997 1.589997 1.589997 1.589997 1.589997 1.589997 1.589997 1.589997 1.589997 1.589997 Unadjusted [95% Conf. Interval] .4869212 -2.645169 1.787005 -4.374127 -6.267889 -1.835715 -7.996847 1.296375 -4.864757 -9.29693 6.758518 3.626428 8.058602 1.89747 .0037086 4.435882 -1.725249 7.567972 1.406841 -3.025333 After pwmean, we can use testnl to test whether the improvement in mean wheat yield when using fertilizer 18-24-06 instead of fertilizer 29-03-04 is significantly different from 10%. 1744 pwmean postestimation — Postestimation tools for pwmean . testnl (_b[4.fertilizer] - _b[5.fertilizer])/_b[5.fertilizer] = 0.1 (1) (_b[4.fertilizer] - _b[5.fertilizer])/_b[5.fertilizer] = 0.1 chi2(1) = 1.57 Prob > chi2 = 0.2106 The improvement is not significantly different from 10%. Also see [R] pwmean — Pairwise comparisons of means [U] 20 Estimation and postestimation commands 1745 Title qc — Quality control charts Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Draw a c chart cchart defect var unit var , cchart options Draw a p (fraction-defective) chart pchart reject var unit var ssize var , pchart options Draw an R (range or dispersion) chart rchart varlist if in , rchart options Draw an X (control line) chart xchart varlist if in , xchart options Draw vertically aligned X and R charts shewhart varlist if in , shewhart options cchart options Description Main nograph suppress graph Plot connect options marker options marker label options affect rendition of the plotted points change look of markers (color, size, etc.) add marker labels; change look or position Control limits clopts(cline options) affect rendition of the control limits Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options 1746 qc — Quality control charts pchart options Description Main stabilized stabilize the p chart when sample sizes are unequal suppress graph nograph generate(newvarf newvarlcl newvarucl ) store the fractions of defective elements and the lower and upper control limits Plot connect options marker options marker label options affect rendition of the plotted points change look of markers (color, size, etc.) add marker labels; change look or position Control limits clopts(cline options) affect rendition of the control limits Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options rchart options Description Main std(#) nograph user-specified standard deviation suppress graph Plot connect options marker options marker label options affect rendition of the plotted points change look of markers (color, size, etc.) add marker labels; change look or position Control limits clopts(cline options) affect rendition of the control limits Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options 1747 1748 qc — Quality control charts Description xchart options Main user-specified standard deviation user-specified mean lower and upper limits of the X-bar limits suppress graph std(#) mean(#) lower(#) upper(#) nograph Plot affect rendition of the plotted points change look of markers (color, size, etc.) add marker labels; change look or position connect options marker options marker label options Control limits affect rendition of the control limits clopts(cline options) Add plots add other plots to the generated graph addplot(plot) Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options shewhart options Description Main user-specified standard deviation user-specified mean suppress graph std(#) mean(#) nograph Plot affect rendition of the plotted points change look of markers (color, size, etc.) add marker labels; change look or position connect options marker options marker label options Control limits affect rendition of the control limits clopts(cline options) Y axis, X axis, Titles, Legend, Overall any options documented in [G-2] graph combine combine options Menu cchart Statistics > Other > Quality control > C chart > Other > Quality control > P chart > Other > Quality control > R chart pchart Statistics rchart Statistics qc — Quality control charts 1749 xchart Statistics > Other > Quality control > X-bar chart Other > Quality control > Vertically aligned X-bar and R chart shewhart Statistics > Description These commands provide standard quality-control charts. cchart draws a c chart; pchart, a p (fraction-defective) chart; rchart, an R (range or dispersion) chart; xchart, an X (control line) chart; and shewhart, vertically aligned X and R charts. Options Main stabilized stabilizes the p chart when sample sizes are unequal. std(#) specifies the standard deviation of the process. The R chart is calculated (based on the range) if this option is not specified. mean(#) specifies the grand mean, which is calculated if not specified. lower(#) and upper(#) must be specified together or not at all. They specify the lower and upper limits of the X chart. Calculations based on the mean and standard deviation (whether specified by option or calculated) are used otherwise. nograph suppresses the graph. generate(newvarf newvarlcl newvarucl ) stores the plotted values in the p chart. newvarf will contain the fractions of defective elements; newvarlcl and newvarucl will contain the lower and upper control limits, respectively. Plot connect options affect whether lines connect the plotted points and the rendition of those lines; see [G-3] connect options. marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Control limits clopts(cline options) affects the rendition of the control limits; see [G-3] cline options. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). 1750 qc — Quality control charts combine options (shewhart only) are any of the options documented in [G-2] graph combine. These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples Control charts may be used to define the goal of a repetitive process, to control that process, and to determine if the goal has been achieved. Walter A. Shewhart of Bell Telephone Laboratories devised the first control chart in 1924. In 1931, Shewhart published Economic Control of Quality of Manufactured Product. According to Burr, “Few fields of knowledge have ever been so completely explored and charted in the first exposition” (1976, 29). Shewhart states that “a phenomenon will be said to be controlled when, through the use of past experience, we can predict, at least within limits, how the phenomenon may be expected to vary in the future. Here it is understood that prediction within limits means that we can state, at least approximately, the probability that the observed phenomenon will fall within given limits” (1931, 6). For more information on quality-control charts, see Burr (1976), Duncan (1986), Harris (1999), or Ryan (2000). Example 1: cchart cchart graphs a c chart showing the number of nonconformities in a unit, where defect var records the number of defects in each inspection unit and unit var records the unit number. The unit numbers need not be in order. For instance, consider the following example dataset from Ryan (2000, 156): . use http://www.stata-press.com/data/r13/ncu . describe Contains data from http://www.stata-press.com/data/r13/ncu.dta obs: 30 vars: 2 31 Mar 2013 03:56 size: 240 variable name day defects storage type float float Sorted by: . list in 1/5 1. 2. 3. 4. 5. day defects 1 2 3 4 5 7 5 11 13 9 display format %9.0g %9.0g value label variable label Days in April Numbers of Nonconforming Units qc — Quality control charts 1751 . cchart defects day, title(c Chart for Nonconforming Transistors) 0 10 20 30 .8327076 0 10.6 Numbers of Nonconforming Units 5 10 15 20 20.36729 c Chart for Nonconforming Transistors Days in April 0 units are out of control The expected number of defects is 10.6, with lower and upper control limits of 0.8327 and 20.37, respectively. No units are out of control. Example 2: pchart pchart graphs a p chart, which shows the fraction of nonconforming items in a subgroup, where reject var records the number rejected in each inspection unit, unit var records the inspection unit number, and ssize var records the number inspected in each unit. Consider the example dataset from Ryan (2000, 156) of the number of nonconforming transistors out of 1,000 inspected each day during the month of April: . use http://www.stata-press.com/data/r13/ncu2 . describe Contains data from http://www.stata-press.com/data/r13/ncu2.dta obs: 30 vars: 3 31 Mar 2013 14:13 size: 360 variable name day rejects ssize Sorted by: storage type float float float display format %9.0g %9.0g %9.0g value label variable label Days in April Numbers of Nonconforming Units Sample size 1752 qc — Quality control charts . list in 1/5 1. 2. 3. 4. 5. day rejects ssize 1 2 3 4 5 7 5 11 13 9 1000 1000 1000 1000 1000 0 .0008846 .005 .0106 Fraction defective .01 .015 .02 .0203154 . pchart rejects day ssize 0 10 20 30 Days in April 0 units are out of control All the points are within the control limits, which are 0.0009 for the lower limit and 0.0203 for the upper limit. Here the sample sizes are fixed at 1,000, so the ssize variable contains 1,000 for each observation. Sample sizes need not be fixed, however. Say that our data were slightly different: . use http://www.stata-press.com/data/r13/ncu3 . list in 1/5 1. 2. 3. 4. 5. day rejects ssize 1 2 3 4 5 7 5 11 13 9 920 920 920 950 950 qc — Quality control charts 1753 0 .005 .0119445 Fraction defective .01 .015 .02 .025 . pchart rejects day ssize 0 10 20 30 Days in April 0 units are out of control Here the control limits are, like the sample size, no longer constant. The stabilize option will stabilize the control chart: −4 −3 0 Fraction defective (Standard Deviation units) −2 0 2 3 4 . pchart rejects day ssize, stabilize 0 10 20 Days in April 0 units are out of control Stabilized p Chart, average number of defects = .0119 30 1754 qc — Quality control charts Example 3: rchart rchart displays an R chart showing the range for repeated measurements at various times. Variables within observations record measurements. Observations represent different samples. For instance, say that we take five samples of 5 observations each. In our first sample, our measurements are 10, 11, 10, 11, and 12. The data are . list 1. 2. 3. 4. 5. m1 m2 m3 m4 m5 10 12 10 9 12 11 10 11 9 12 10 9 10 9 12 11 10 12 10 12 12 9 10 11 13 0 0 1 2 Range 2 3 4 4.23 . rchart m1-m5, connect(l) 1 2 3 Sample 4 5 0 units are out of control The expected range in each sample is 2 with lower and upper control limits of 0 and 4.23, respectively. If we know that the process standard deviation is 0.3, we could specify qc — Quality control charts 1755 1 2 3 Sample 4 5 −2.121525 −2 −1 0 Range .6978 1 2 3 2.721525 . rchart m1-m5, connect(l) std(.3) 1 unit is out of control Example 4: xchart xchart graphs an X chart for repeated measurements at various times. Variables within observations record measurements, and observations represent different samples. Using the same data as in the previous example, we type 1 2 3 Sample 4 5 9.486 9.5 10 10.64 Average 10.5 11 11.5 11.794 12 . xchart m1-m5, connect(l) 1 unit is out of control The average measurement in the sample is 10.64, and the lower and upper control limits are 9.486 and 11.794, respectively. Suppose that we knew from prior information that the mean of the process is 11. Then we would type 1756 qc — Quality control charts 9.5 9.846 10 11 Average 10.5 11 11.5 12 12.154 . xchart m1-m5, connect(l) mean(11) 1 2 3 Sample 4 5 2 units are out of control If we also know that the standard deviation of the process is 0.3, we could type 9.5 10 Average 10.5 11 11.5 10.59751 11 11.40249 12 . xchart m1-m5, connect(l) mean(11) std(.3) 1 2 3 Sample 3 units are out of control Finally, xchart allows us to specify our own control limits: 4 5 qc — Quality control charts 1757 12 10 9.5 10 11 Average 10.5 11 11.5 12 . xchart m1-m5, connect(l) mean(11) lower(10) upper(12) 1 2 3 Sample 4 5 2 units are out of control Walter Andrew Shewhart (1891–1967) was born in Illinois and educated as a physicist, with degrees from the Universities of Illinois and California. After a brief period teaching physics, he worked for the Western Electric Company and (from 1925) the Bell Telephone Laboratories. His name is most associated with control charts used in quality controls, but his many other interests ranged generally from quality assurance to the philosophy of science. Example 5: shewhart shewhart displays a vertically aligned X and R chart in the same image. To produce the bestlooking combined image possible, you will want to use the xchart and rchart commands separately and then combine the graphs. shewhart, however, is more convenient. Using the same data as previously, but realizing that the standard deviation should have been 0.4, we type 1758 qc — Quality control charts 10.463341111.53666 Average 9.5 10 10.5 11 11.5 12 . shewhart m1-m5, connect(l) mean(11) std(.4) 1 2 3 Sample 4 5 2 3 Sample 4 5 −4 −2 −2.8287 Range 0 2 4 .9304 3.6287 3 units are out of control 1 0 units are out of control Stored results cchart stores the following in r(): Scalars r(cbar) r(lcl c) r(ucl c) r(N) r(out c) r(below c) r(above c) expected number of nonconformities lower control limit upper control limit number of observations number of units out of control number of units below the lower limit number of units above the upper limit pchart stores the following in r(): Scalars r(pbar) r(lcl p) r(ucl p) r(N) r(out p) r(below p) r(above p) average fraction of nonconformities lower control limit upper control limit number of observations number of units out of control number of units below the lower limit number of units above the upper limit rchart stores the following in r(): Scalars r(central line) r(lcl r) r(ucl r) r(N) r(out r) r(below r) r(above r) ordinate of the central line lower control limit upper control limit number of observations number of units out of control number of units below the lower limit number of units above the upper limit qc — Quality control charts 1759 xchart stores the following in r(): Scalars r(xbar) r(lcl x) r(ucl x) r(N) r(out x) r(below x) r(above x) grand mean lower control limit upper control limit number of observations number of units out of control number of units below the lower limit number of units above the upper limit shewhart stores in r() the combination of stored results from xchart and rchart. Methods and formulas For the c chart, the number of defects per unit, C , is taken to be a value of a random variable having a Poisson distribution. If k is the number of units available for estimating λ, the parameter of theP Poisson distribution, and if Ci is the number of defects in the ith unit, then λ is estimated by C = i Ci /k . Then central line = C p UCL = C + 3 C p LCL = C − 3 C Control limits for the p chart are based on the sampling theory for proportions, using the Pnormal approximation to the binomial. If k samples are taken, the estimator of p is given by p = i pbi /k , where pbi = xi /ni , and xi is the number of defects in the ith sample of size ni . The central line and the control limits are given by central line = p p p(1 − p)/ni p LCL = p − 3 p(1 − p)/ni UCL = p + 3 Control limits for the R chart are based on the distribution of the range of samples of size n from a normal population. If the standard deviation of the process, σ , is known, central line = d2 σ UCL = D2 σ LCL = D1 σ where d2 , D1 , and D2 are functions of the number of observations in the sample and are obtained from the table published in Beyer (1976). When σ is unknown, central line = R UCL = (D2 /d2 )R LCL = (D1 /d2 )R where R = P i Ri /k is the range of the k sample ranges Ri . 1760 qc — Quality control charts Control limits for the X chart are given by central line = x √ UCL = x + (3/ n)σ √ LCL = x − (3/ n)σ if σ is known. If σ is unknown, central line = x UCL = x + A2 R LCL = x − A2 R where R is the average range as defined above and A2 is a function (op. cit.) of the number of observations in the sample. References Bayart, D. 2001. Walter Andrew Shewhart. In Statisticians of the Centuries, ed. C. C. Heyde and E. Seneta, 398–401. New York: Springer. Beyer, W. H. 1976. Factors for computing control limits. In Vol. 2 of Handbook of Tables for Probability and Statistics, ed. W. H. Beyer, 451–465. Cleveland, OH: The Chemical Rubber Company. Burr, I. W. 1976. Statistical Quality Control Methods. New York: Dekker. Caulcutt, R. 2004. Control charts in practice. Significance 1: 81–84. Duncan, A. J. 1986. Quality Control and Industrial Statistics. 5th ed. Homewood, IL: Irwin. Harris, R. L. 1999. Information Graphics: A Comprehensive Illustrated Reference. New York: Oxford University Press. Ryan, T. P. 2000. Statistical Methods for Quality Improvement. 2nd ed. New York: Wiley. Saw, S. L. C., and T. W. Soon. 1994. sqc1: Estimating process capability indices with Stata. Stata Technical Bulletin 17: 18–19. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 174–175. College Station, TX: Stata Press. Shewhart, W. A. 1931. Economic Control of Quality of Manufactured Product. New York: Van Nostrand. Also see [R] serrbar — Graph standard error bar chart Title qreg — Quantile regression Syntax Options for iqreg Stored results Menu Options for sqreg Methods and formulas Description Options for bsqreg References Options for qreg Remarks and examples Also see Syntax Quantile regression qreg depvar indepvars if in weight , qreg options Interquantile range regression iqreg depvar indepvars if in , iqreg options Simultaneous-quantile regression sqreg depvar indepvars if in , sqreg options Bootstrapped quantile regression bsqreg depvar indepvars if in , bsqreg options qreg options Description Model quantile(#) estimate # quantile; default is quantile(.5) SE/Robust vce( vcetype , vceopts ) technique used to estimate standard errors Reporting level(#) display options set confidence level; default is level(95) control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Optimization optimization options wlsiter(#) control the optimization process; seldom used attempt # weighted least-squares iterations before doing linear programming iterations vcetype Description iid robust compute the VCE assuming the residuals are i.i.d. compute the robust VCE 1761 1762 qreg — Quantile regression vceopts Description denmethod bwidth nonparametric density estimation technique bandwidth method used by the density estimator denmethod Description fitted residual kernel (kernel) use the empirical quantile function using fitted values; the default use the empirical residual quantile function use a nonparametric kernel density estimator; default is epanechnikov bwidth Description hsheather bofinger chamberlain Hall–Sheather’s bandwidth; the default Bofinger’s bandwidth Chamberlain’s bandwidth kernel Description epanechnikov epan2 biweight cosine gaussian parzen rectangle triangle Epanechnikov kernel function; the default alternative Epanechnikov kernel function biweight kernel function cosine trace kernel function Gaussian kernel function Parzen kernel function rectangle kernel function triangle kernel function iqreg options Description Model quantiles(# #) reps(#) interquantile range; default is quantiles(.25 .75) perform # bootstrap replications; default is reps(20) Reporting level(#) nodots display options set confidence level; default is level(95) suppress display of the replication dots control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling qreg — Quantile regression sqreg options 1763 Description Model quantiles(# # # . . . ) estimate # quantiles; default is quantiles(.5) perform # bootstrap replications; default is reps(20) reps(#) Reporting level(#) nodots display options set confidence level; default is level(95) suppress display of the replication dots control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling bsqreg options Description Model quantile(#) reps(#) estimate # quantile; default is quantile(.5) perform # bootstrap replications; default is reps(20) Reporting level(#) display options set confidence level; default is level(95) control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling indepvars may contain factor variables; see [U] 11.4.3 Factor variables. by, mi estimate, rolling, and statsby, are allowed by qreg, iqreg, sqreg, and bsqreg; mfp, nestreg, and stepwise are allowed only with qreg; see [U] 11.1.10 Prefix commands. qreg allows fweights, iweights, and pweights; see [U] 11.1.6 weight. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu qreg Statistics > Nonparametric analysis > Quantile regression > Nonparametric analysis > Interquantile regression > Nonparametric analysis > Simultaneous-quantile regression > Nonparametric analysis > Bootstrapped quantile regression iqreg Statistics sqreg Statistics bsqreg Statistics Description qreg fits quantile (including median) regression models, also known as least–absolute-value models (LAV or MAD) and minimum L1-norm models. The quantile regression models fit by qreg express the quantiles of the conditional distribution as linear functions of the independent variables. 1764 qreg — Quantile regression iqreg estimates interquantile range regressions, regressions of the difference in quantiles. The estimated variance–covariance matrix of the estimators (VCE) is obtained via bootstrapping. sqreg estimates simultaneous-quantile regression. It produces the same coefficients as qreg for each quantile. Reported standard errors will be similar, but sqreg obtains an estimate of the VCE via bootstrapping, and the VCE includes between-quantile blocks. Thus you can test and construct confidence intervals comparing coefficients describing different quantiles. bsqreg is equivalent to sqreg with one quantile. Options for qreg Model quantile(#) specifies the quantile to be estimated and should be a number between 0 and 1, exclusive. Numbers larger than 1 are interpreted as percentages. The default value of 0.5 corresponds to the median. SE/Robust vce( vcetype , vceopts ) specifies the type of VCE to compute and the density estimation method to use in computing the VCE. vcetype specifies the type of VCE to compute. Available types are iid and robust. vce(iid), the default, computes the VCE under the assumption that the residuals are independent and identically distributed (i.i.d.). vce(robust) computes the robust VCE under the assumption that the residual density is continuous and bounded away from 0 and infinity at the specified quantile(); see Koenker (2005, sec. 4.2). vceopts consists of available denmethod and bwidth options. denmethod specifies the method to use for the density estimator. Available nonparametric methods are fitted, residual, or kernel (kernel) , where the optional kernel must be one of the kernel choices listed below. fitted and residual specify that the nonparametric density estimator use some of the structure imposed by quantile regression. The default fitted uses a function of the fitted values and residual uses a function of the residuals. vce(robust, residual) is not allowed. kernel() specifies that the nonparametric density estimator use a kernel method. The available kernel functions are epanechnikov, epan2, biweight, cosine, gaussian, parzen, rectangle, and triangle. The default is epanechnikov. See [R] kdensity for the kernel function forms. bwidth specifies the bandwidth method to use by the nonparametric density estimator. Available methods are hsheather for the Hall–Sheather bandwidth, bofinger for the Bofinger bandwidth, and chamberlain for the Chamberlain bandwidth. See Koenker (2005, sec. 3.4 and 4.10) for a description of the sparsity estimation techniques and the Hall–Sheather and Bofinger bandwidth formulas. See Chamberlain (1994, eq. 2.2) for the Chamberlain bandwidth. Reporting level(#); see [R] estimation options. qreg — Quantile regression 1765 display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Optimization optimization options: iterate(#), no log, trace. iterate() specifies the maximum number of iterations; log/nolog specifies whether to show the iteration log; and trace specifies that the iteration log should include the current parameter vector. These options are seldom used. wlsiter(#) specifies the number of weighted least-squares iterations that will be attempted before the linear programming iterations are started. The default value is 1. If there are convergence problems, increasing this number should help. Options for iqreg Model quantiles(# #) specifies the quantiles to be compared. The first number must be less than the second, and both should be between 0 and 1, exclusive. Numbers larger than 1 are interpreted as percentages. Not specifying this option is equivalent to specifying quantiles(.25 .75), meaning the interquantile range. reps(#) specifies the number of bootstrap replications to be used to obtain an estimate of the variance–covariance matrix of the estimators (standard errors). reps(20) is the default and is arguably too small. reps(100) would perform 100 bootstrap replications. reps(1000) would perform 1,000 replications. Reporting level(#); see [R] estimation options. nodots suppresses display of the replication dots. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Options for sqreg Model quantiles(# # # . . . ) specifies the quantiles to be estimated and should contain numbers between 0 and 1, exclusive. Numbers larger than 1 are interpreted as percentages. The default value of 0.5 corresponds to the median. reps(#) specifies the number of bootstrap replications to be used to obtain an estimate of the variance–covariance matrix of the estimators (standard errors). reps(20) is the default and is arguably too small. reps(100) would perform 100 bootstrap replications. reps(1000) would perform 1,000 replications. Reporting level(#); see [R] estimation options. nodots suppresses display of the replication dots. 1766 qreg — Quantile regression display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Options for bsqreg Model quantile(#) specifies the quantile to be estimated and should be a number between 0 and 1, exclusive. Numbers larger than 1 are interpreted as percentages. The default value of 0.5 corresponds to the median. reps(#) specifies the number of bootstrap replications to be used to obtain an estimate of the variance–covariance matrix of the estimators (standard errors). reps(20) is the default and is arguably too small. reps(100) would perform 100 bootstrap replications. reps(1000) would perform 1,000 replications. Reporting level(#); see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Median regression Quantile regression Estimated standard errors Interquantile and simultaneous-quantile regression What are the parameters? Median regression qreg fits quantile regression models. The default form is median regression, where the objective is to estimate the median of the dependent variable, conditional on the values of the independent variables. This method is similar to ordinary regression, where the objective is to estimate the conditional mean of the dependent variable. Simply put, median regression finds a line through the data that minimizes the sum of the absolute residuals rather than the sum of the squares of the residuals, as in ordinary regression. Equivalently, median regression expresses the median of the conditional distribution of the dependent variable as a linear function of the conditioning (independent) variables. Cameron and Trivedi (2010, chap. 7) provide a nice introduction to quantile regression using Stata. qreg — Quantile regression 1767 Example 1: Estimating the conditional median Consider a two-group experimental design with 5 observations per group: . use http://www.stata-press.com/data/r13/twogrp . list x y 1. 2. 3. 4. 5. 0 0 0 0 0 0 1 3 4 95 6. 7. 8. 9. 10. 1 1 1 1 1 14 19 20 22 23 . qreg y x Iteration 1: WLS sum of weighted deviations = Iteration 1: sum of abs. weighted deviations = Iteration 2: sum of abs. weighted deviations = Median regression Raw sum of deviations Min sum of deviations 60.941342 55.5 55 Number of obs = 78.5 (about 14) 55 y Coef. x _cons 17 3 Pseudo R2 Std. Err. t P>|t| 18.23213 12.89207 0.93 0.23 0.378 0.822 = 10 0.2994 [95% Conf. Interval] -25.04338 -26.72916 59.04338 32.72916 We have estimated the equation ymedian = 3 + 17 x We look back at our data. x takes on the values 0 and 1, so the median for the x = 0 group is 3, whereas for x = 1 it is 3 + 17 = 20. The output reports that the raw sum of absolute deviations about 14 is 78.5; that is, the sum of |y − 14| is 78.5. Fourteen is the unconditional median of y, although in these data, any value between 14 and 19 could also be considered an unconditional median (we have an even number of observations, so the median is bracketed by those two values). In any case, the raw sum of deviations of y about the median would be the same no matter what number we choose between 14 and 19. (With a “median” of 14, the raw sum of deviations is 78.5. Now think of choosing a slightly larger number for the median and recalculating the sum. Half the observations will have larger negative residuals, but the other half will have smaller positive residuals, resulting in no net change.) We turn now to the actual estimated equation. The sum of the absolute deviations about the solution ymedian = 3 + 17x is 55. The pseudo-R2 is calculated as 1 − 55/78.5 ≈ 0.2994. This result is based on the idea that the median regression is the maximum likelihood estimate for the double-exponential distribution. 1768 qreg — Quantile regression Technical note qreg is an alternative to regular regression or robust regression — see [R] regress and [R] rreg. Let’s compare the results: . regress y x Source SS df MS Model Residual 2.5 6978.4 1 8 2.5 872.3 Total 6980.9 9 775.655556 y Coef. Std. Err. x _cons -1 20.6 18.6794 13.20833 Number of obs F( 1, 8) Prob > F R-squared Adj R-squared Root MSE t P>|t| -0.05 1.56 0.959 0.157 = 10 = 0.00 = 0.9586 = 0.0004 = -0.1246 = 29.535 [95% Conf. Interval] -44.07477 -9.858465 42.07477 51.05847 Unlike qreg, regress fits ordinary linear regression and is concerned with predicting the mean rather than the median, so both results are, in a technical sense, correct. Putting aside those technicalities, however, we tend to use either regression to describe the central tendency of the data, of which the mean is one measure and the median another. Thus we can ask, “which method better describes the central tendency of these data?” Means—and therefore ordinary linear regression—are sensitive to outliers, and our data were purposely designed to contain two such outliers: 95 for x = 0 and 14 for x = 1. These two outliers dominated the ordinary regression and produced results that do not reflect the central tendency well — you are invited to enter the data and graph y against x. Robust regression attempts to correct the outlier-sensitivity deficiency in ordinary regression: . rreg y x, genwt(wt) Huber iteration 1: Huber iteration 2: Huber iteration 3: Biweight iteration 4: Biweight iteration 5: Biweight iteration 6: Biweight iteration 7: Biweight iteration 8: Robust regression maximum maximum maximum maximum maximum maximum maximum maximum y Coef. x _cons 18.16597 2.000003 difference difference difference difference difference difference difference difference in in in in in in in in weights weights weights weights weights weights weights weights = = = = = = = = Std. Err. t P>|t| 2.023114 1.430558 8.98 1.40 0.000 0.200 .7311828 .17695779 .03149585 .1979335 .23332905 .09960067 .02691458 .0009113 Number of obs = F( 1, 8) = Prob > F = 10 80.63 0.0000 [95% Conf. Interval] 13.50066 -1.298869 22.83128 5.298875 Here rreg discarded the first outlier completely. (We know this because we included the genwt() option on rreg and, after fitting the robust regression, examined the weights.) For the other “outlier”, rreg produced a weight of 0.47. In any case, the answers produced by qreg and rreg to describe the central tendency are similar, but the standard errors are different. In general, robust regression will have smaller standard errors because it is not as sensitive to the exact placement of observations near the median. You are welcome to try removing the first outlier in the qreg estimation to observe an improvement in the standard errors by typing qreg — Quantile regression 1769 . qreg y x if _n!=5 Also, some authors (Rousseeuw and Leroy 1987, 11) have noted that quantile regression, unlike the unconditional median, may be sensitive to even one outlier if its leverage is high enough. Rousseeuw and Leroy (1987) discuss estimators that are more robust to perturbations to the data than either mean regression or quantile regression. In the end, quantile regression may be more useful for the interpretation of the parameters that it estimates than for its robustness to perturbations to the data. Example 2: Median regression Let’s now consider a less artificial example using the automobile data described in [U] 1.2.2 Example datasets. Using median regression, we will regress each car’s price on its weight and length and whether it is of foreign manufacture: . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . qreg price weight length foreign Iteration 1: WLS sum of weighted deviations = 56397.829 Iteration 1: sum of abs. weighted deviations = 55950.5 Iteration 2: sum of abs. weighted deviations = 55264.718 Iteration 3: sum of abs. weighted deviations = 54762.283 Iteration 4: sum of abs. weighted deviations = 54734.152 Iteration 5: sum of abs. weighted deviations = 54552.638 note: alternate solutions exist Iteration 6: sum of abs. weighted deviations = 54465.511 Iteration 7: sum of abs. weighted deviations = 54443.699 Iteration 8: sum of abs. weighted deviations = 54411.294 Median regression Raw sum of deviations 71102.5 (about 4934) Min sum of deviations 54411.29 price Coef. weight length foreign _cons 3.933588 -41.25191 3377.771 344.6489 Std. Err. 1.328718 45.46469 885.4198 5182.394 t 2.96 -0.91 3.81 0.07 Number of obs = Pseudo R2 P>|t| 0.004 0.367 0.000 0.947 = 74 0.2347 [95% Conf. Interval] 1.283543 -131.9284 1611.857 -9991.31 6.583632 49.42456 5143.685 10680.61 The estimated equation is pricemedian = 3.93 weight − 41.25 length + 3377.8 foreign + 344.65 The output may be interpreted in the same way as linear regression output; see [R] regress. The variables weight and foreign are significant, but length is not significant. The median price of the cars in these data is $4,934. This value is a median (one of the two center observations), not the median, which would typically be defined as the midpoint of the two center observations. 1770 qreg — Quantile regression Quantile regression Quantile regression is similar to median regression in that it estimates an equation expressing a quantile of the conditional distribution, albeit one that generally differs from the 0.5 quantile that is the median. For example, specifying quantile(.25) estimates the parameters that describe the 25th percentile (first quartile) of the conditional distribution. Quantile regression allows for effects of the independent variables to differ over the quantiles. For example, Chamberlain (1994) finds that union membership has a larger effect on the lower quantiles than on the higher quantiles of the conditional distribution of U.S. wages. That the effects of the independent variables may vary over quantiles of the conditional distribution is an important advantage of quantile regression over mean regression. Example 3: Estimating quantiles other than the median Returning to real data, the equation for the 25th percentile of price conditional on weight, length, and foreign in our automobile data is . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . qreg price weight length foreign, quantile(.25) Iteration 1: WLS sum of weighted deviations = 49469.235 Iteration 1: sum of abs. weighted deviations = 49728.883 Iteration 2: sum of abs. weighted deviations = 45669.89 Iteration 3: sum of abs. weighted deviations = 43416.646 Iteration 4: sum of abs. weighted deviations = 41947.221 Iteration 5: sum of abs. weighted deviations = 41093.025 Iteration 6: sum of abs. weighted deviations = 37623.424 Iteration 7: sum of abs. weighted deviations = 35721.453 Iteration 8: sum of abs. weighted deviations = 35226.308 Iteration 9: sum of abs. weighted deviations = 34823.319 Iteration 10: sum of abs. weighted deviations = 34801.777 .25 Quantile regression Raw sum of deviations 41912.75 (about 4187) Min sum of deviations 34801.78 price Coef. weight length foreign _cons 1.831789 2.84556 2209.925 -1879.775 Std. Err. .6328903 21.65558 421.7401 2468.46 t 2.89 0.13 5.24 -0.76 Number of obs = Pseudo R2 P>|t| 0.005 0.896 0.000 0.449 = 74 0.1697 [95% Conf. Interval] .5695289 -40.34514 1368.791 -6802.963 3.094049 46.03626 3051.059 3043.413 Compared with our previous median regression, the coefficient on length now has a positive sign, and the coefficients on foreign and weight are reduced. The actual lower quantile is $4,187, substantially less than the median $4,934. qreg — Quantile regression 1771 We can also estimate the upper quartile as a function of the same three variables: . qreg price weight length foreign, quantile(.75) Iteration 1: WLS sum of weighted deviations = 55465.741 Iteration 1: sum of abs. weighted deviations = 55652.957 Iteration 2: sum of abs. weighted deviations = 52994.785 Iteration 3: sum of abs. weighted deviations = 50189.446 Iteration 4: sum of abs. weighted deviations = 49898.245 Iteration 5: sum of abs. weighted deviations = 49398.106 Iteration 6: sum of abs. weighted deviations = 49241.835 Iteration 7: sum of abs. weighted deviations = 49197.967 .75 Quantile regression Raw sum of deviations 79860.75 (about 6342) Min sum of deviations 49197.97 price Coef. weight length foreign _cons 9.22291 -220.7833 3595.133 20242.9 Std. Err. 1.785767 61.10352 1189.984 6965.02 t 5.16 -3.61 3.02 2.91 Number of obs = Pseudo R2 P>|t| 0.000 0.001 0.004 0.005 = 74 0.3840 [95% Conf. Interval] 5.66131 -342.6504 1221.785 6351.61 12.78451 -98.91616 5968.482 34134.2 This result tells a different story: weight is much more important, and length is now significant — with a negative coefficient! The prices of high-priced cars seem to be determined by factors different from those affecting the prices of low-priced cars. Technical note One explanation for having substantially different regression functions for different quantiles is that the data are heteroskedastic, as we will demonstrate below. The following statements create a sharply heteroskedastic set of data: . drop _all . set obs 10000 obs was 0, now 10000 . set seed 50550 . gen x = .1 + .9 * runiform() . gen y = x * runiform()^2 1772 qreg — Quantile regression Let’s now fit the regressions for the 5th and 95th quantiles: . qreg y x, quantile(.05) Iteration 1: WLS sum of weighted deviations = 540.36365 Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration 539.15959 141.36772 91.234609 91.127281 91.126351 91.126236 91.126229 91.126224 91.126221 1: 2: 3: 4: 5: 6: 7: 8: 9: sum sum sum sum sum sum sum sum sum of of of of of of of of of abs. abs. abs. abs. abs. abs. abs. abs. abs. weighted weighted weighted weighted weighted weighted weighted weighted weighted deviations deviations deviations deviations deviations deviations deviations deviations deviations = = = = = = = = = .05 Quantile regression Raw sum of deviations 91.17849 (about .0009234) Min sum of deviations 91.12622 y Coef. x _cons .002601 -.0001393 Std. Err. .0004576 .0002782 t Number of obs = Pseudo R2 P>|t| 5.68 -0.50 0.000 0.617 .001704 -.0006846 618.77845 Iteration Iteration Iteration Iteration Iteration 619.00068 228.32522 169.22749 169.21949 169.21945 sum sum sum sum sum of of of of of abs. abs. abs. abs. abs. weighted weighted weighted weighted weighted deviations deviations deviations deviations deviations = = = = = .95 Quantile regression Raw sum of deviations 277.3444 (about .61326343) Min sum of deviations 169.2194 y Coef. x _cons .8898259 .0021514 Std. Err. .0090984 .0055307 t 97.80 0.39 0.0006 [95% Conf. Interval] . qreg y x, quantile(.95) Iteration 1: WLS sum of weighted deviations = 1: 2: 3: 4: 5: = 10000 .003498 .000406 Number of obs = Pseudo R2 = 10000 0.3899 P>|t| [95% Conf. Interval] 0.000 0.697 .8719912 -.00869 .9076605 .0129927 The coefficient on x, in particular, differs markedly between the two estimates. For the mathematically inclined, it is not too difficult to show that the theoretical lines are y = 0.0025 x for the 5th percentile and y = 0.9025 x for the 95th, numbers in close agreement with our numerical results. The estimator for the standard errors computed by qreg assumes that the sample is independent and identically distributed (i.i.d.); see Estimated standard errors and Methods and formulas for details. Because the data are conditionally heteroskedastic, we should have used bsqreg to consistently estimate the standard errors using a bootstrap method. Estimated standard errors The variance–covariance matrix of the estimator (VCE) depends on the reciprocal of the density of the dependent variable evaluated at the quantile of interest. This function, known as the “sparsity function”, is hard to estimate. qreg — Quantile regression 1773 The default method, which uses the fitted values for the predicted quantiles, generally performs well, but other methods may be preferred in larger samples. The vce() suboptions denmethod and bwidth provide other estimators of the sparsity function, the details of which are described in Methods and formulas. For models with heteroskedastic errors, option vce(robust) computes a Huber (1967) form of sandwich estimate (Koenker 2005). Alternatively, Gould (1992, 1997b) introduced generalized versions of qreg that obtain estimates of the standard errors by using bootstrap resampling (see Efron and Tibshirani [1993] or Wu [1986] for an introduction to bootstrap standard errors). The iqreg, sqreg, and bsqreg commands provide a bootstrapped estimate of the entire variance–covariance matrix of the estimators. Example 4: Obtaining robust standard errors Example 2 of qreg on real data above was a median regression of price on weight, length, and foreign using auto.dta. Suppose, after investigation, we are convinced that car price observations are not independent. We decide that standard errors robust to non-i.i.d. errors would be appropriate and use the option vce(robust). . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . qreg price weight length foreign, vce(robust) Iteration 1: WLS sum of weighted deviations = 56397.829 Iteration 1: sum of abs. weighted deviations = 55950.5 Iteration 2: sum of abs. weighted deviations = 55264.718 Iteration 3: sum of abs. weighted deviations = 54762.283 Iteration 4: sum of abs. weighted deviations = 54734.152 Iteration 5: sum of abs. weighted deviations = 54552.638 note: alternate solutions exist Iteration 6: sum of abs. weighted deviations = 54465.511 Iteration 7: sum of abs. weighted deviations = 54443.699 Iteration 8: sum of abs. weighted deviations = 54411.294 Median regression Number of obs = Raw sum of deviations 71102.5 (about 4934) Min sum of deviations 54411.29 Pseudo R2 = price Coef. weight length foreign _cons 3.933588 -41.25191 3377.771 344.6489 Robust Std. Err. 1.694477 51.73571 728.5115 5096.528 t 2.32 -0.80 4.64 0.07 P>|t| 0.023 0.428 0.000 0.946 74 0.2347 [95% Conf. Interval] .55406 -144.4355 1924.801 -9820.055 7.313116 61.93171 4830.741 10509.35 We see that the robust standard error for weight increases making it less significant in modifying the median automobile price. The standard error for length also increases, but the standard error for the foreign indicator decreases. 1774 qreg — Quantile regression For comparison, we repeat the estimation using bootstrap standard errors: . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . set seed 1001 . bsqreg price weight length foreign (fitting base model) Bootstrap replications (20) 1 2 3 4 5 .................... Median regression, bootstrap(20) SEs Number of obs = Raw sum of deviations 71102.5 (about 4934) Min sum of deviations 54411.29 Pseudo R2 = price Coef. weight length foreign _cons 3.933588 -41.25191 3377.771 344.6489 Std. Err. 3.12446 83.71267 1057.281 7053.301 t 1.26 -0.49 3.19 0.05 P>|t| 0.212 0.624 0.002 0.961 74 0.2347 [95% Conf. Interval] -2.297951 -208.2116 1269.09 -13722.72 10.16513 125.7077 5486.452 14412.01 The coefficient estimates are the same — indeed, they are obtained using the same technique. Only the standard errors differ. Therefore, the t statistics, significance levels, and confidence intervals also differ. Because bsqreg (as well as sqreg and iqreg) obtains standard errors by randomly resampling the data, the standard errors it produces will not be the same from run to run unless we first set the random-number seed to the same number; see [R] set seed. qreg — Quantile regression 1775 By default, bsqreg, sqreg, and iqreg use 20 replications. We can control the number of replications by specifying the reps() option: . bsqreg price weight length i.foreign, reps(1000) (fitting base model) Bootstrap replications (1000) 1 2 3 4 5 .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. Median regression, bootstrap(1000) SEs Raw sum of deviations 71102.5 (about 4934) Min sum of deviations 54411.29 Std. Err. t 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 Number of obs = Pseudo R2 P>|t| = 74 0.2347 price Coef. [95% Conf. Interval] weight length 3.933588 -41.25191 2.659381 69.29771 1.48 -0.60 0.144 0.554 -1.370379 -179.4618 9.237555 96.95802 foreign Foreign _cons 3377.771 344.6489 1094.264 5916.906 3.09 0.06 0.003 0.954 1195.331 -11456.25 5560.211 12145.55 A comparison of the standard errors is informative. Variable weight length 1.foreign cons qreg qreg vce(robust) bsqreg reps(20) bsqreg reps(1000) 1.329 45.46 885.4 5182. 1.694 51.74 728.5 5096. 3.124 83.71 1057. 7053. 2.660 69.30 1094. 5917. The results shown above are typical for models with heteroskedastic errors. (Our dependent variable is price; if our model had been in terms of ln(price), the standard errors estimated by qreg and bsqreg would have been nearly identical.) Also, even for heteroskedastic errors, 20 replications is generally sufficient for hypothesis tests against 0. 1776 qreg — Quantile regression Interquantile and simultaneous-quantile regression Consider a quantile regression model where the q th quantile is given by Qq (y) = aq + bq,1 x1 + bq,2 x2 For instance, the 75th and 25th quantiles are given by Q0.75 (y) = a0.75 + b0.75,1 x1 + b0.75,2 x2 Q0.25 (y) = a0.25 + b0.25,1 x1 + b0.25,2 x2 The difference in the quantiles is then Q0.75 (y) − Q0.25 (y) = (a0.75 − a0.25 ) + (b0.75,1 − b0.25,1 )x1 + (b0.75,2 − b0.25,2 )x2 qreg fits models such as Q0.75 (y) and Q0.25 (y). iqreg fits interquantile models, such as Q0.75 (y) − Q0.25 (y). The relationships of the coefficients estimated by qreg and iqreg are exactly as shown: iqreg reports coefficients that are the difference in coefficients of two qreg models, and, of course, iqreg reports the appropriate standard errors, which it obtains by bootstrapping. sqreg is like qreg in that it estimates the equations for the quantiles Q0.75 (y) = a0.75 + b0.75,1 x1 + b0.75,2 x2 Q0.25 (y) = a0.25 + b0.25,1 x1 + b0.25,2 x2 The coefficients it obtains are the same that would be obtained by estimating each equation separately using qreg. sqreg differs from qreg in that it estimates the equations simultaneously and obtains an estimate of the entire variance–covariance matrix of the estimators by bootstrapping. Thus you can perform hypothesis tests concerning coefficients both within and across equations. For example, to fit the above model, you could type . qreg y x1 x2, quantile(.25) . qreg y x1 x2, quantile(.75) By doing this, you would obtain estimates of the parameters, but you could not test whether b0.25,1 = b0.75,1 or, equivalently, b0.75,1 − b0.25,1 = 0. If your interest really is in the difference of coefficients, you could type . iqreg y x1 x2, quantiles(.25 .75) The “coefficients” reported would be the difference in quantile coefficients. You could also estimate both quantiles simultaneously and then test the equality of the coefficients: . sqreg y x1 x2, quantiles(.25 .75) . test [q25]x1 = [q75]x1 Whether you use iqreg or sqreg makes no difference for this test. sqreg, however, because it estimates the quantiles simultaneously, allows you to test other hypotheses. iqreg, by focusing on quantile differences, presents results in a way that is easier to read. Finally, sqreg can estimate quantiles singly, . sqreg y x1 x2, quantiles(.5) and can thereby be used as a substitute for the slower bsqreg. (Gould [1997b] presents timings demonstrating that sqreg is faster than bsqreg.) sqreg can also estimate more than two quantiles simultaneously: . sqreg y x1 x2, quantiles(.25 .5 .75) qreg — Quantile regression 1777 Example 5: Simultaneous quantile estimation In demonstrating qreg, we performed quantile regressions using auto.dta. We discovered that the regression of price on weight, length, and foreign produced vastly different coefficients for the 0.25, 0.5, and 0.75 quantile regressions. Here are the coefficients that we obtained: Variable weight length foreign cons 25th percentile 1.83 2.85 2209.9 −1879.8 50th percentile 3.93 75th percentile 9.22 −41.25 −220.8 3377.8 344.6 3595.1 20242.9 All we can say, having estimated these equations separately, is that price seems to depend differently on the weight, length, and foreign variables depending on the portion of the price distribution we examine. We cannot be more precise because the estimates have been made separately. With sqreg, however, we can estimate all the effects simultaneously: . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . set seed 1001 . sqreg price weight length foreign, q(.25 .5 .75) reps(100) (fitting base model) Bootstrap replications (100) 1 2 3 4 5 .................................................. 50 .................................................. 100 Simultaneous quantile regression Number of obs bootstrap(100) SEs .25 Pseudo R2 .50 Pseudo R2 .75 Pseudo R2 Bootstrap Std. Err. t P>|t| = = = = 74 0.1697 0.2347 0.3840 price Coef. [95% Conf. Interval] weight length foreign _cons 1.831789 2.84556 2209.925 -1879.775 1.574777 38.63523 1008.521 3665.184 1.16 0.07 2.19 -0.51 0.249 0.941 0.032 0.610 -1.309005 -74.20998 198.494 -9189.753 4.972583 79.9011 4221.357 5430.204 weight length foreign _cons 3.933588 -41.25191 3377.771 344.6489 2.529541 68.62258 1025.882 6199.257 1.56 -0.60 3.29 0.06 0.124 0.550 0.002 0.956 -1.111423 -178.1153 1331.715 -12019.38 8.978599 95.61151 5423.827 12708.68 weight length foreign _cons 9.22291 -220.7833 3595.133 20242.9 2.483676 86.17422 1145.124 9414.242 3.71 -2.56 3.14 2.15 0.000 0.013 0.002 0.035 4.269374 -392.6524 1311.255 1466.79 14.17645 -48.91421 5879.011 39019.02 q25 q50 q75 The coefficient estimates above are the same as those previously estimated, although the standard error estimates are a little different. sqreg obtains estimates of variance by bootstrapping. The important thing here, however, is that the full covariance matrix of the estimators has been estimated and stored, and thus it is now possible to perform hypothesis tests. Are the effects of weight the same at the 25th and 75th percentiles? 1778 qreg — Quantile regression . test [q25]weight = ( 1) [q25]weight F( 1, 70) Prob > F [q75]weight [q75]weight = 0 = 8.97 = 0.0038 It appears that they are not. We can obtain a confidence interval for the difference by using lincom: . lincom [q75]weight-[q25]weight ( 1) - [q25]weight + [q75]weight = 0 price Coef. (1) 7.391121 Std. Err. t P>|t| [95% Conf. Interval] 2.467548 3.00 0.004 2.469752 12.31249 Indeed, we could test whether the weight and length sets of coefficients are equal at the three quantiles estimated: . quietly test [q25]weight = [q50]weight . quietly test [q25]weight = [q75]weight, accumulate . quietly test [q25]length = [q50]length, accumulate . test [q25]length = [q75]length, accumulate ( 1) [q25]weight - [q50]weight = 0 ( 2) [q25]weight - [q75]weight = 0 ( 3) [q25]length - [q50]length = 0 ( 4) [q25]length - [q75]length = 0 F( 4, 70) = 2.43 Prob > F = 0.0553 iqreg focuses on one quantile comparison but presents results that are more easily interpreted: . set seed 1001 . iqreg price weight length foreign, q(.25 .75) reps(100) nolog .75-.25 Interquantile regression Number of obs = bootstrap(100) SEs .75 Pseudo R2 = .25 Pseudo R2 = price Coef. weight length foreign _cons 7.391121 -223.6288 1385.208 22122.68 Bootstrap Std. Err. 2.467548 83.09868 1191.018 9009.159 t 3.00 -2.69 1.16 2.46 P>|t| 0.004 0.009 0.249 0.017 74 0.3840 0.1697 [95% Conf. Interval] 2.469752 -389.3639 -990.2036 4154.478 12.31249 -57.89376 3760.619 40090.88 Looking only at the 0.25 and 0.75 quantiles (the interquartile range), the iqreg command output is easily interpreted. Increases in weight correspond significantly to increases in price dispersion. Increases in length correspond to decreases in price dispersion. The foreign variable does not significantly change price dispersion. Do not make too much of these results; the purpose of this example is simply to illustrate the sqreg and iqreg commands and to do so in a context that suggests why analyzing dispersion might be of interest. qreg — Quantile regression 1779 lincom after sqreg produced the same t statistic for the interquartile range of weight, as did the iqreg command above. In general, they will not agree exactly because of the randomness of bootstrapping, unless the random-number seed is set to the same value before estimation (as was done here). Gould (1997a) presents simulation results showing that the coverage — the actual percentage of confidence intervals containing the true value — for iqreg is appropriate. What are the parameters? In this section, we use a specific data-generating process (DGP) to illustrate the interpretation of the parameters estimated by qreg. If simulation experiments are not intuitive to you, skip this section. In general, quantile regression parameterizes the quantiles of the distribution of y conditional on the independent variables x as xβ, where β is a vector of estimated parameters. In our example, we include a constant term and a single independent variable, and we express quantiles of the distribution of y conditional on x as β0 + β1 x. We use simulated data to illustrate what we mean by a conditional distribution and how to interpret the parameters β estimated by qreg. We also note how we could change our example to illustrate a DGP for which the estimator in qreg would be misspecified. We suppose that the distribution of y conditional on x has a Weibull form. If y has a Weibull distribution, the distribution function is F (y) = 1 − exp{−(y/λ)k }, where the scale parameter λ > 0 and the shape parameter k > 0. We can make y have a Weibull distribution function conditional on x by making the scale parameter or the shape parameter functions of x.√In our example, we specify a particular DGP by supposing that λ = (1 + αx), α = 1.5, x = 1 + ν , and that ν has a χ2 (1) distribution. For the moment, we leave the parameter k as is so that we can discuss how this decision relates to model specification. Plugging in for λ yields the functional form for the distribution of y conditional on x, which is known as the conditional distribution function and is denoted F (y|x). F (y|x) is the distribution for y for each given value of x. Some algebra yields that F (y|x) = 1 − exp[−{y/(1 + αx)}k ]. Letting τ = F (y|x) implies that 0 ≤ τ ≤ 1, because probabilities must be between 0 and 1. To obtain the τ quantile of the distribution of y conditional on x, we solve τ = 1 − exp[−{y/(1 + αx)}k ] for y as a function of τ , x, α, and k . The solution is y = (1 + αx){− ln(1 − τ )}(1/k) (1) For any value of τ ∈ (0, 1), expression (1) gives the τ quantile of the distribution of y conditional on x. To use qreg, we must rewrite (1) as a function of x, β0 , and β1 . Some algebra yields that (1) can be rewritten as y = β0 + β1 ∗ x where β0 = {− ln(1 − τ )}(1/k) and β1 = α{− ln(1 − τ )}(1/k) . We can express the conditional quantiles as linear combinations of x, which is a property of the estimator implemented in qreg. 1780 qreg — Quantile regression If we parameterize k as a nontrivial function of x, the conditional quantiles will not be linear in x. If the conditional quantiles cannot be represented as linear functions of x, we cannot estimate the true parameters of the DGP. This restriction illustrates the limits of the estimator implemented in qreg. We set k = 2 for our example. Conditional quantile regression allows the coefficients to change with the specified quantile. For our DGP increase as τ gets larger. Substituting in for α and k yields that p, the coefficients β0 and β1 p β0 = − ln(1 − τ ) and β1 = 1.5 − ln(1 − τ ). Table 1 presents the true values for β0 and β1 implied by our DGP when τ ∈ {0.25, 0.5, 0.8}. Table 1: True values for β0 and β1 τ 0.25 0.5 0.8 β0 0.53636 0.8325546 1.268636 β1 0.80454 1.248832 1.902954 We can also use (1) to generate data from the specified distribution of y conditional on x by plugging in random uniform numbers for τ . Each random uniform number substituted in for τ in (1) yields a draw from the conditional distribution of y given x. Example 6 In this example, we generate 100,000 observations from our specified DGP by substituting random √ uniform numbers for τ in (1), with α = 1.5, k = 2, x = 1 + ν , and ν coming from a χ2 (1) distribution. We begin by executing the code that implements this method; below we discuss each line of the output produced. . clear // drop existing variables . set seed 1234571 // set random-number seed . set obs 100000 obs was 0, now 100000 // set number of observations . generate double tau = runiform() // generate uniform variate . generate double x = 1 + sqrt(rchi2(1)) // generate values for x . generate double lambda = 1 + 1.5*x // lambda is 1 + alpha*x . generate double k = 2 // fix value of k . . . generate double y // generate random values for y // given x = lambda*((-ln(1-tau))^(1/k)) Although the comments at the end of each line briefly describe what each line is doing, we provide a more careful description. The first line drops any variables in memory. The second sets the seed of the random-number generator so that we will always get the same sequence of random uniform numbers. The third line sets the sample size to 100,000 observations, and the fourth line reports the change in sample size. The fifth line substitutes random uniform numbers for τ . This line is the key to the algorithm. This standard method, known as inverse-probability transforms, for computing random numbers is discussed by Cameron and Trivedi (2010, 126–127), among others. qreg — Quantile regression 1781 Lines 6–8 generate x, λ, and k per our specified DGP. Lines 9–11 implement (1) using the previously generated λ, x, and k . At the end, we have 100,000 observations on y and x, with y coming from the conditional distribution that we specified above. Example 7 In the example below, we use qreg to estimate β1 and β0 , the parameters from the conditional quantile function, for the 0.5 quantile from our simulated data. . qreg y x, quantile(.5) Iteration 1: WLS sum of weighted deviations = 68975.517 Iteration 1: sum of abs. weighted deviations = 68975.325 Iteration 2: sum of abs. weighted deviations = 68843.958 Iteration 3: sum of abs. weighted deviations = 68629.64 Iteration 4: sum of abs. weighted deviations = 68626.382 Iteration 5: sum of abs. weighted deviations = 68625.659 Iteration 6: sum of abs. weighted deviations = 68625.657 Iteration 7: sum of abs. weighted deviations = 68625.657 Median regression Number of obs = Raw sum of deviations 73840.51 (about 2.944248) Min sum of deviations 68625.66 Pseudo R2 = y Coef. x _cons 1.228536 .8693355 Std. Err. .0118791 .0225288 t 103.42 38.59 100000 0.0706 P>|t| [95% Conf. Interval] 0.000 0.000 1.205253 .8251793 1.251819 .9134917 In the qreg output, the results for x correspond to the estimate of β1 , and the results for cons correspond to the estimate of β0 . The reported estimates are close to their true values of 1.248832 and 0.8325546, which are given in table 1. The intuition in this example comes from the ability of qreg to recover the true parameters of our specified DGP. As we increase the number of observations in our sample size, the qreg estimates will get closer to the true values. 1782 qreg — Quantile regression Example 8 In the example below, we estimate the parameters of the conditional quantile function for the 0.25 quantile and compare them with the true values. . qreg y x, quantile(.25) Iteration 1: WLS sum of weighted deviations = 65497.284 Iteration 1: sum of abs. weighted deviations = 65492.359 Iteration 2: sum of abs. weighted deviations = 60139.477 Iteration 3: sum of abs. weighted deviations = 49999.793 Iteration 4: sum of abs. weighted deviations = 49999.479 Iteration 5: sum of abs. weighted deviations = 49999.465 Iteration 6: sum of abs. weighted deviations = 49999.465 .25 Quantile regression Number of obs = Raw sum of deviations 52014.79 (about 1.857329) Min sum of deviations 49999.47 Pseudo R2 = y Coef. x _cons .7844305 .5633285 Std. Err. .0107092 .0203102 t 73.25 27.74 100000 0.0387 P>|t| [95% Conf. Interval] 0.000 0.000 .7634405 .5235209 .8054204 .6031362 As above, qreg reports the estimates of β1 and β0 in the output table for x and cons, respectively. The reported estimates are close to their true values of 0.80454 and 0.53636, which are given in table 1. As expected, the estimates are close to their true values. Also as expected, the estimates for the 0.25 quantile are smaller than the estimates for the 0.5 quantile. qreg — Quantile regression 1783 Example 9 We finish this section by estimating the parameters of the conditional quantile function for the 0.8 quantile and comparing them with the true values. . qreg y x, quantile(.8) Iteration 1: WLS sum of Iteration 1: sum of abs. Iteration 2: sum of abs. Iteration 3: sum of abs. Iteration 4: sum of abs. Iteration 5: sum of abs. Iteration 6: sum of abs. Iteration 7: sum of abs. Iteration 8: sum of abs. Iteration 9: sum of abs. Iteration 10: sum of abs. Iteration 11: sum of abs. Iteration 12: sum of abs. Iteration 13: sum of abs. Iteration 14: sum of abs. Iteration 15: sum of abs. weighted weighted weighted weighted weighted weighted weighted weighted weighted weighted weighted weighted weighted weighted weighted weighted deviations deviations deviations deviations deviations deviations deviations deviations deviations deviations deviations deviations deviations deviations deviations deviations = = = = = = = = = = = = = = = = 66332.299 66332.194 60076.645 52589.193 52340.961 52262.505 52249.305 52245.124 52245.103 52245.081 52245.075 52245.074 52245.073 52245.073 52245.073 52245.073 .8 Quantile regression Raw sum of deviations 60093.34 (about 4.7121822) Min sum of deviations 52245.07 y Coef. x _cons 1.889702 1.293773 Std. Err. .0146895 .0278587 t 128.64 46.44 Number of obs = 100000 Pseudo R2 0.1306 = P>|t| [95% Conf. Interval] 0.000 0.000 1.860911 1.23917 1.918493 1.348375 As above, qreg reports the estimates of β1 and β0 in the output table for x and cons, respectively. The reported estimates are close to their true values of 1.902954 and 1.268636, which are given in table 1. As expected, the estimates are close to their true values. Also as expected, the estimates for the 0.8 quantile are larger than the estimates for the 0.5 quantile. 1784 qreg — Quantile regression Stored results qreg stores the following in e(): Scalars e(N) e(df m) e(df r) e(q) e(q v) e(sum adev) e(sum rdev) e(sum w) e(f r) e(sparsity) e(bwidth) e(kbwidth) e(rank) e(convcode) Macros e(cmd) e(cmdline) e(depvar) e(bwmethod) e(denmethod) e(kernel) e(wtype) e(wexp) e(vce) e(vcetype) e(properties) e(predict) e(marginsnotok) Matrices e(b) e(V) Functions e(sample) number of observations model degrees of freedom residual degrees of freedom quantile requested value of the quantile sum of absolute deviations sum of raw deviations sum of weights density estimate sparsity estimate bandwidth kernel bandwidth rank of e(V) 0 if converged; otherwise, return code for why nonconvergence qreg command as typed name of dependent variable bandwidth method; hsheather, bofinger, or chamberlain density estimation method; fitted, residual, or kernel kernel function weight type weight expression vcetype specified in vce() title used to label Std. Err. b V program used to implement predict predictions disallowed by margins coefficient vector variance–covariance matrix of the estimators marks estimation sample qreg — Quantile regression iqreg stores the following in e(): Scalars e(N) e(df r) e(q0) e(q1) e(reps) e(sumrdev0) e(sumrdev1) e(sumadev0) e(sumadev1) e(rank) e(convcode) Macros e(cmd) e(cmdline) e(depvar) e(vcetype) e(properties) e(predict) e(marginsnotok) Matrices e(b) e(V) Functions e(sample) number of observations residual degrees of freedom lower quantile requested upper quantile requested number of replications lower quantile sum of raw deviations upper quantile sum of raw deviations lower quantile sum of absolute deviations upper quantile sum of absolute deviations rank of e(V) 0 if converged; otherwise, return code for why nonconvergence iqreg command as typed name of dependent variable title used to label Std. Err. b V program used to implement predict predictions disallowed by margins coefficient vector variance–covariance matrix of the estimators marks estimation sample sqreg stores the following in e(): Scalars e(N) e(df r) e(n q) e(q#) e(reps) e(sumrdv#) e(sumadv#) e(rank) e(convcode) Macros e(cmd) e(cmdline) e(depvar) e(eqnames) e(vcetype) e(properties) e(predict) e(marginsnotok) number of observations residual degrees of freedom number of quantiles requested the quantiles requested number of replications sum of raw deviations for q# sum of absolute deviations for q# rank of e(V) 0 if converged; otherwise, return code for why nonconvergence sqreg command as typed name of dependent variable names of equations title used to label Std. Err. b V program used to implement predict predictions disallowed by margins Matrices e(b) e(V) coefficient vector variance–covariance matrix of the estimators Functions e(sample) marks estimation sample 1785 1786 qreg — Quantile regression bsqreg stores the following in e(): Scalars e(N) e(df r) e(q) e(q v) e(reps) e(sum adev) e(sum rdev) e(rank) e(convcode) number of observations residual degrees of freedom quantile requested value of the quantile number of replications sum of absolute deviations sum of raw deviations rank of e(V) 0 if converged; otherwise, return code for why nonconvergence Macros e(cmd) e(cmdline) e(depvar) e(properties) e(predict) e(marginsnotok) bsqreg command as typed name of dependent variable b V program used to implement predict predictions disallowed by margins Matrices e(b) e(V) coefficient vector variance–covariance matrix of the estimators Functions e(sample) marks estimation sample Methods and formulas Methods and formulas are presented under the following headings: Introduction Linear programming formulation of quantile regression Standard errors when residuals are i.i.d. Pseudo-R2 Introduction According to Stuart and Ord (1991, 1084), the method of minimum absolute deviations was first proposed by Boscovich in 1757 and was later developed by Laplace; Stigler (1986, 39–55) and Hald (1998, 97–103, 112–116) provide historical details. According to Bloomfield and Steiger (1980), Harris (1950) later observed that the problem of minimum absolute deviations could be turned into the linear programming problem that was first implemented by Wagner (1959). Interest has grown in this method because robust methods and extreme value modeling have become more popular. Statistical and computational properties of minimum absolute deviation estimators are surveyed by Narula and Wellington (1982). Cameron and Trivedi (2005), Hao and Naiman (2007), and Wooldridge (2010) provide excellent introductions to quantile regression methods, while Koenker (2005) gives an in-depth review of the topic. Linear programming formulation of quantile regression Define τ as the quantile to be estimated; the median is τ = 0.5. For each observation i, let εi be the residual bτ εi = yi − x0i β qreg — Quantile regression 1787 The objective function to be minimized is cτ (εi ) = (τ 1 {εi ≥ 0} + (1 − τ )1 {εi < 0}) |εi | = (τ 1 {εi ≥ 0} − (1 − τ )1 {εi < 0}) εi (2) = (τ − 1 {εi < 0}) εi where 1{·} is the indicator function. This function is sometimes referred to as the check function because it resembles a check mark (Wooldridge 2010, 450); the slope of cτ (εi ) is τ when εi > 0 b τ that minimize cτ (εi ) is and is τ − 1 when εi < 0, but is undefined for εi = 0. Choosing the β b τ that make xβ b τ best fit the quantiles of the distribution of y conditional equivalent to finding the β on x. This minimization problem is set up as a linear programming problem and is solved with linear programming techniques, as suggested by Armstrong, Frome, and Kung (1979) and described in detail by Koenker (2005). Here 2n slack variables, un×1 and vn×1 , are introduced, where ui ≥ 0, vi ≥ 0, and ui × vi = 0, reformulating the problem as min {τ 10n u + (1 − τ )10n v | y − Xβτ = u − v} βτ ,u,v where 1n is a vector of 1s. This is a linear objective function on a polyhedral constraint set with nk vertices, and our goal is to find the vertex that minimizes (2). Each step in the search is described by a set of k observations through which the regression plane passes, called the basis. A step is taken by replacing a point in the basis if the linear objective function can be improved. If this occurs, a line is printed in the iteration log. The definition of convergence is exact in the sense that no amount of added iterations could improve the objective function. A series of weighted least-squares (WLS) regressions is used to identify a set of observations as a starting basis. The WLS algorithm for τ = 0.5 is taken from Schlossmacher (1973) with a generalization for 0 < τ < 1 implied from Hunter and Lange (2000). Standard errors when residuals are i.i.d. The estimator for the VCE implemented in qreg assumes that the errors of the model are independent and identically distributed (i.i.d.). When the errors are i.i.d., the large-sample VCE is cov(βτ ) = τ (1 − τ ) −1 {E(xi x0i )} fY2 (ξτ ) (3) where ξτ = FY−1 (τ ) and FY (y) is the distribution function of Y with density fY (y). See Koenker (2005, 73) for this result. From (3), we see that the regression precision depends on the inverse of the density function, termed the sparsity function, sτ = 1/fY (ξτ ). Pn While 1/n i=1 xi x0i estimates E(xi x0i ), estimating the sparsity function is more difficult. qreg provides several methods to estimate the sparsity function. The different estimators are specified through the suboptions of vce(iid, denmethod bwidth). The suboption denmethod specifies the functional form for the sparsity estimator. The default is fitted. Here we outline the logic underlying the fitted estimator. Because FY (y) is the distribution function for Y , we have fY (y) = {dFy (y)}/dy , τ = FY (ξτ ), and ξτ = FY−1 (τ ). When differentiating the identity FY {FY−1 (τ )} = τ , the sparsity function can be written as sτ = {FY−1 (τ )}/dt. Numerically, we can approximate the derivative using the centered difference, 1788 qreg — Quantile regression FY−1 (τ ) F −1 (τ + h) − FY−1 (τ − h) ξτ +h − ξτ −h ≈ Y = = sbτ dt 2h 2h where h is the bandwidth. (4) The empirical quantile function is computed by first estimating βτ +h and βτ −h , and then computing −1 b b τ +h and Fb−1 (τ − h) = x0 β b τ −h , where x is the sample mean of the independent FY (τ + h) = x0 β Y variables x. These quantities are then substituted into (4). Alternatively, as the option suggests, vce(iid, residual) specifies that qreg use the empirical quantile function of the residuals to estimate the sparsity. Here we substitute F , the distribution of the residuals, for FY , which only differ by their first moments. The k residuals associated with the linear programming basis will be zero, where k is the number of regression coefficients. These zero residuals are removed before computing the τ + h and τ − h quantiles, ε(τ +h) = Fb−1 (τ + h) and ε(τ −h) = Fb−1 (τ − h). The Fb−1 estimates are then substituted for FY−1 in (4). Each of the estimators for the sparsity function depends on a bandwidth. The vce() suboption bwidth specifies the bandwidth method to use. The three bandwidth options and their citations are hsheather (Hall and Sheather 1988), bofinger (Bofinger 1975), and chamberlain (Chamberlain 1994). Their formulas are 1/3 α 2/3 3 φ{Φ−1 (τ )}4 hs = n Φ 1− × 2 2 2Φ−1 (τ )2 + 1 9 1/5 φ{2Φ−1 (τ )}4 hb = n−1/5 2 −1 2 {2Φ (τ ) + 1}2 r α τ (1 − τ ) −1 hc = Φ 1− 2 n −1/3 −1 where hs is the Hall–Sheather bandwidth, hb is the Bofinger bandwidth, hc is the Chamberlain bandwidth, Φ() and φ() are the standard normal distribution and density functions, n is the sample size, and 100(1 − α) is the confidence level set by the level() option. Koenker (2005) discusses the derivation of the Hall–Sheather and the Bofinger bandwidth formulas. You should avoid modifying the confidence level when replaying estimates that use the Hall–Sheather or Chamberlain bandwidths because these methods use the confidence level to estimate the coefficient standard errors. Finally, the vce() suboption kernel(kernel) specifies that qreg use one of several kernel-density estimators to estimate the sparsity function. kernel allows you to choose which kernel function to use, where the default is the Epanechnikov kernel. See [R] kdensity for the functional form of the eight kernels. The kernel bandwidth is computed using an adaptive estimate of scale rq −1 × Φ (τ + h) − Φ−1 (τ − h) hk = min σ b, 1.34 where h is one of hs , hb , or hc ; rq is the interquartile range; and σ b is the standard deviation of y; see Silverman (1992, 47) and Koenker (2005, 81) for discussions. Let fb (εi ) be the kernel density estimate for the ith residual, and then the kernel estimator for the sparsity function is sbτ = Pn nhk fb (εi ) i=1 qreg — Quantile regression 1789 Finally, substituting your choice of sparsity estimate into (3) results in the i.i.d. variance–covariance matrix !−1 n X 2 0 Vn = sbτ τ (1 − τ ) xi xi i=1 Pseudo-R2 The pseudo-R2 is calculated as 1− sum of weighted deviations about estimated quantile sum of weighted deviations about raw quantile This is based on the likelihood for a double-exponential distribution evi |εi | , where vi are multipliers vi = 2τ if εi > 0 2(1 − τ ) otherwise P Minimizing the objective function (2) with respect to βτ also minimizes i |εi |vi , the sum of weighted least absolute deviations. For example, for the 50th percentile vi = 1, for all i, and we have median regression. If we want to estimate the 75th percentile, we weight the negative residuals by 0.50 and the positive residuals by 1.50. It can be shown that the criterion is minimized when 75% of the residuals are negative. References Angrist, J. D., and J.-S. Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton, NJ: Princeton University Press. Armstrong, R. D., E. L. Frome, and D. S. Kung. 1979. Algorithm 79-01: A revised simplex algorithm for the absolute deviation curve fitting problem. Communications in Statistics—Simulation and Computation 8: 175–190. Bloomfield, P., and W. Steiger. 1980. Least absolute deviations curve-fitting. SIAM Journal on Scientific Computing 1: 290–301. Bofinger, E. 1975. Estimation of a density function using order statistics. Australian Journal of Statistics 17: 1–17. Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press. . 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. Chamberlain, G. 1994. Quantile regression, censoring, and the structure of wages. In Advances in Econometrics, Vol. 1: Sixth World Congress, ed. C. A. Sims, 171–209. Cambridge: Cambridge University Press. Efron, B., and R. J. Tibshirani. 1993. An Introduction to the Bootstrap. New York: Chapman & Hall/CRC. Frölich, M., and B. Melly. 2010. Estimation of quantile treatment effects with Stata. Stata Journal 10: 423–457. Gould, W. W. 1992. sg11.1: Quantile regression with bootstrapped standard errors. Stata Technical Bulletin 9: 19–21. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 137–139. College Station, TX: Stata Press. . 1997a. crc46: Better numerical derivatives and integrals. Stata Technical Bulletin 35: 3–5. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 8–12. College Station, TX: Stata Press. . 1997b. sg70: Interquantile and simultaneous-quantile regression. Stata Technical Bulletin 38: 14–22. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 167–176. College Station, TX: Stata Press. Gould, W. W., and W. H. Rogers. 1994. Quantile regression as an alternative to robust regression. In 1994 Proceedings of the Statistical Computing Section. Alexandria, VA: American Statistical Association. 1790 qreg — Quantile regression Hald, A. 1998. A History of Mathematical Statistics from 1750 to 1930. New York: Wiley. Hall, P., and S. J. Sheather. 1988. On the distribution of a Studentized quantile. Journal of the Royal Statistical Society, Series B 50: 381–391. Hao, L., and D. Q. Naiman. 2007. Quantile Regression. Thousand Oaks, CA: Sage. Harris, T. 1950. Regression using minimum absolute deviations. American Statistician 4: 14–15. Huber, P. J. 1967. The behavior of maximum likelihood estimates under nonstandard conditions. In Vol. 1 of Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 221–233. Berkeley: University of California Press. . 1981. Robust Statistics. New York: Wiley. Hunter, D. R., and K. Lange. 2000. Quantile regression via an MM algorithm. Journal of Computational and Graphical Statistics 9: 60–77. Jolliffe, D., B. Krushelnytskyy, and A. Semykina. 2000. sg153: Censored least absolute deviations estimator: CLAD. Stata Technical Bulletin 58: 13–16. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 240–244. College Station, TX: Stata Press. Koenker, R. 2005. Quantile Regression. New York: Cambridge University Press. Koenker, R., and K. Hallock. 2001. Quantile regression. Journal of Economic Perspectives 15: 143–156. Narula, S. C., and J. F. Wellington. 1982. The minimum sum of absolute errors regression: A state of the art survey. International Statistical Review 50: 317–326. Orsini, N., and M. Bottai. 2011. Logistic quantile regression in Stata. Stata Journal 11: 327–344. Rousseeuw, P. J., and A. M. Leroy. 1987. Robust Regression and Outlier Detection. New York: Wiley. Schlossmacher, E. J. 1973. An iterative technique for absolute deviations curve fitting. Journal of the American Statistical Association 68: 857–859. Silverman, B. W. 1992. Density Estimation for Statistics and Data Analysis. London: Chapman & Hall. Stigler, S. M. 1986. The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge, MA: Belknap Press. Stuart, A., and J. K. Ord. 1991. Kendall’s Advanced Theory of Statistics: Distribution Theory, Vol I. 5th ed. New York: Oxford University Press. Wagner, H. M. 1959. Linear programming techniques for regression analysis. Journal of the American Statistical Association 54: 206–212. Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press. Wu, C. F. J. 1986. Jackknife, bootstrap and other resampling methods in regression analysis. Annals of Statistics 14: 1261–1350 (including discussions and rejoinder). Also see [R] qreg postestimation — Postestimation tools for qreg, iqreg, sqreg, and bsqreg [R] bootstrap — Bootstrap sampling and estimation [R] regress — Linear regression [R] rreg — Robust regression [MI] estimation — Estimation commands for use with mi estimate [U] 20 Estimation and postestimation commands Title qreg postestimation — Postestimation tools for qreg, iqreg, sqreg, and bsqreg Description Remarks and examples Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after qreg, iqreg, bsqreg, and sqreg: Command Description estat summarize estat vce estimates forecast1 lincom summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest margins marginsplot nlcom predict predictnl test testnl 1 forecast is not appropriate with mi estimation results. 1791 1792 qreg postestimation — Postestimation tools for qreg, iqreg, sqreg, and bsqreg Syntax for predict For qreg, iqreg, and bsqreg predict type newvar if in , xb | stdp | residuals For sqreg predict type newvar if in , equation(eqno ,eqno ) statistic Description statistic Main xb stdp stddp residuals linear prediction; the default standard error of the linear prediction standard error of the difference in linear predictions residuals These statistics are available both in and out of sample; type predict only for the estimation sample. . . . if e(sample) . . . if wanted Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main xb, the default, calculates the linear prediction. stdp calculates the standard error of the linear prediction. stddp is allowed only after you have fit a model using sqreg. The standard error of the difference in linear predictions (x1j b − x2j b) between equations 1 and 2 is calculated. residuals calculates the residuals, that is, yj − xj b. equation(eqno ,eqno ) specifies the equation to which you are making the calculation. equation() is filled in with one eqno for the xb, stdp, and residuals options. equation(#1) would mean that the calculation is to be made for the first equation, equation(#2) would mean the second, and so on. You could also refer to the equations by their names. equation(income) would refer to the equation named income and equation(hours) to the equation named hours. If you do not specify equation(), results are the same as if you had specified equation(#1). To use stddp, you must specify two equations. You might specify equation(#1, #2) or equation(q80, q20) to indicate the 80th and 20th quantiles. qreg postestimation — Postestimation tools for qreg, iqreg, sqreg, and bsqreg 1793 Remarks and examples Example 1 In example 4 of [R] qreg, we fit regressions for the lower and the upper quartile of the price variable. The predict command can be used to obtain the linear prediction after each regression. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . qreg price weight length foreign, quantile(.25) (output omitted ) . predict q25 (option xb assumed; fitted values) . qreg price weight length foreign, quantile(.75) (output omitted ) . predict q75 (option xb assumed; fitted values) We can use the variables generated by predict to compute the predicted interquartile range, that is, . generate iqr1 = q75 - q25 If we directly perform the interquartile range regression with the iqreg command, we can predict the interquartile range and also the standard error for the prediction. . iqreg price weight length foreign, quantile(.25 .75) (output omitted ) . predict iqr2 (option xb assumed; fitted values) . predict stdp, stdp We now plot the predicted interquartile range versus variable length: 0 Linear prediction 2000 4000 6000 8000 . scatter iqr2 length 140 160 180 200 Length (in.) 220 240 As stated in example 5 of [R] qreg, the negative coefficient for the length variable means that increases in length imply decreases in the interquartile range and therefore in price dispersion. Consequently, we could have expected a downward trend in the plot, but there is not. This is because 1794 qreg postestimation — Postestimation tools for qreg, iqreg, sqreg, and bsqreg the regression output indicates that when we hold the rest of the variables constant, an increase in length leads to a decrease in iqr2. However, there is a high correlation between weight and length, which could be masking the effect of length on iqr2. We can achieve a better visualization by using a contour plot. 5,000 . twoway contour iqr2 weight length, level(10) 7,332.5 5,576.8 4,699 3,821.1 2,943.3 Linear prediction Weight (lbs.) 3,000 4,000 6,454.6 2,065.4 1,187.6 2,000 309.766 140 160 180 200 Length (in.) 220 240 We can see the effect by setting a fixed value of length on the vertical axis, say, 3,000 lbs. When we move from left to right on the horizontal axis, we see that for small values of length, iqr2 values are shown in red, meaning high values, and when we move toward the right, the graph indicates transition into increasingly smaller values. Also see [R] qreg — Quantile regression [U] 20 Estimation and postestimation commands Title query — Display system parameters Syntax Description Remarks and examples Also see Syntax memory | output | interface | graphics | efficiency | network | update | trace | mata | other query Description query displays the settings of various Stata parameters. Remarks and examples query provides more system information than you will ever want to know. You do not need to understand every line of output that query produces if all you need is one piece of information. Here is what happens when you type query: . query Memory settings set maxvar set matsize set niceness set min_memory set max_memory set segmentsize 5000 400 5 0 . 32m 2048-32767; max. vars allowed 10-11000; max. # vars in models 0-10 0-1600g 32m-1600g or . 1m-32g Output settings set more set rmsg set dp set linesize set pagesize on off period 80 28 may be period or comma characters lines 95 percent confidence intervals set level set set set set set set showbaselevels showemptycells showomitted fvlabel fvwrap fvwrapon may be empty, off, on, or all may be empty, off, or on may be empty, off, or on on 1 word may be word or width set lstretch may be empty, off, or on set cformat set pformat set sformat may be empty or a numerical format may be empty or a numerical format may be empty or a numerical format 1795 1796 query — Display system parameters set coeftabresults on set logtype smcl Interface settings set dockable set dockingguides set floatwindows set locksplitters set pinnable set doublebuffer set set set set set linegap scrollbufsize fastscroll varlabelpos reventries may be smcl or text on on off off on on 1 204800 on pixels characters 5000 (not relevant) lines 50 dialog boxes Graphics settings set graphics set autotabgraphs set scheme set printcolor set copycolor on off s2color automatic automatic may be automatic, asis, gs1, gs2, gs3 may be automatic, asis, gs1, gs2, gs3 Efficiency settings set adosize 1000 kilobytes Network set set set off 30 180 seconds seconds set maxdb settings checksum timeout1 timeout2 set httpproxy set httpproxyhost set httpproxyport set httpproxyauth set httpproxyuser set httpproxypw off 80 off Update settings set update_query on set update_interval 7 set update_prompt on Trace (programming debugging) settings set trace off set tracedepth 32000 set traceexpand on set tracesep on set traceindent on set tracenumber off set tracehilite query — Display system parameters 1797 Mata settings set matastrict off set matalnum off set mataoptimize on set matafavor space may be space or speed set matacache 400 kilobytes set matalibs lmatabase;lmataado;lmatafc;lmatagsem;lmataopt; > lmatapath;lmatapss;lmatasem set matamofirst off Other settings set type set maxiter set searchdefault set seed set varabbrev set emptycells set processors set haverdir float may be float or double 16000 max iterations for estimation commands all may be local, net, or all X075bcd151f123bb5159a55e50022865700043e55 on keep may be keep or drop 2 1-2 The output is broken into several divisions: memory, output, interface, graphics, efficiency, network, update, trace, mata, and other settings. We will discuss each one in turn. We generated the output above using Stata/MP for Windows. Here is what happens when we type query and we are running Stata/SE for Mac: . query Memory settings set maxvar set matsize set niceness set min_memory set max_memory set segmentsize 5000 400 5 0 . 32m 2048-32767; max. vars allowed 10-11000; max. # vars in models 0-10 0-1600g 32m-1600g or . 1m-32g Output settings set more set rmsg set dp set linesize set pagesize off off period 80 23 may be period or comma characters lines 95 percent confidence intervals set level set set set set set set showbaselevels showemptycells showomitted fvlabel fvwrap fvwrapon may be empty, off, on, or all may be empty, off, or on may be empty, off, or on on 1 word may be word or width set lstretch may be empty, off, or on set cformat set pformat set sformat may be empty or a numerical format may be empty or a numerical format may be empty or a numerical format set coeftabresults on set logtype smcl may be smcl or text 1798 query — Display system parameters set set set set set charset eolchar notifyuser playsnd include_bitmap Interface settings set revkeyboard set varkeyboard set smoothfonts set set set set linegap scrollbufsize varlabelpos reventries mac unix on off on may be mac or latin1 may be mac or unix on on on 5000 pixels characters (not relevant) lines 50 dialog boxes Graphics settings set graphics set scheme set printcolor set copycolor on s2color automatic automatic may be automatic, asis, gs1, gs2, gs3 may be automatic, asis, gs1, gs2, gs3 Efficiency settings set adosize 1000 kilobytes Network set set set off 30 180 seconds seconds set maxdb settings checksum timeout1 timeout2 set httpproxy set httpproxyhost set httpproxyport set httpproxyauth set httpproxyuser set httpproxypw 1 204800 off 80 off Update settings set update_query on set update_interval 7 set update_prompt on Trace (programming debugging) settings set trace off set tracedepth 32000 set traceexpand on set tracesep on set traceindent on set tracenumber off set tracehilite query — Display system parameters 1799 Mata settings set matastrict off set matalnum off set mataoptimize on set matafavor space may be space or speed set matacache 400 kilobytes set matalibs lmatabase;lmataado;lmatafc;lmatagsem;lmataopt; > lmatapath;lmatapss;lmatasem set matamofirst off Other settings set type set maxiter set searchdefault set seed set varabbrev set emptycells set processors float may be float or double 16000 max iterations for estimation commands local may be local, net, or all X075bcd151f123bb5159a55e50022865700043e55 on keep may be keep or drop 1 Memory settings Memory settings indicate how memory is allocated, the maximum number of variables, and the maximum size of a matrix. For more information, see maxvar matsize niceness min memory max memory segmentsize [D] [R] [D] [D] [D] [D] memory matsize memory memory memory memory Output settings Output settings show how Stata displays output on the screen and in log files. For more information, see more rmsg dp linesize pagesize level showbaselevels showemptycells showomitted fvlabel fvwrap fvwrapon cformat pformat sformat [R] more [P] rmsg [D] format [R] log [R] more [R] level [R] set showbaselevels [R] set showbaselevels [R] set showbaselevels [R] set showbaselevels [R] set showbaselevels [R] set showbaselevels [R] set cformat [R] set cformat [R] set cformat 1800 query — Display system parameters [R] [R] [R] [R] [R] [R] [R] [R] coeftabresults lstretch logtype charset eolchar notifyuser playsnd include bitmap set set log set set set set set Interface settings Interface settings control how Stata’s interface works. For more information, see dockable dockingguides floatwindows locksplitters pinnable doublebuffer revkeyboard varkeyboard smoothfonts linegap scrollbufsize fastscroll reventries maxdb [R] [R] [R] [R] [R] [R] [R] [R] [R] [R] [R] [R] [R] [R] set set set set set set set set set set set set set db Graphics settings Graphics settings indicate how Stata’s graphics are displayed. For more information, see graphics autotabgraphs scheme printcolor copycolor [G-2] set [R] set [G-2] set [G-2] set [G-2] set graphics scheme printcolor printcolor Efficiency settings The efficiency settings set the maximum amount of memory allocated to automatically loaded do-files, the maximum number of remembered-contents dialog boxes, and the use of virtual memory. For more information, see adosize [P] sysdir query — Display system parameters Network settings Network settings determine how Stata interacts with the Internet. For more information, see [R] netio. Update settings Update settings determine how Stata performs updates. For more information, see [R] update. Trace settings Trace settings adjust Stata’s behavior and are particularly useful in debugging code. For more information, see [P] trace. Mata settings Mata settings affect Mata’s system parameters. For more information, see [M-3] mata set. Other settings The other settings are a miscellaneous collection. For more information, see type maxiter searchdefault seed varabbrev emptycells processors odbcmgr haverdir [D] [R] [R] [R] [R] [R] [R] [D] [D] generate maximize search set seed set set set odbc import haver In general, the parameters displayed by query can be changed by set; see [R] set. Also see [R] set — Overview of system parameters [P] creturn — Return c-class values [M-3] mata set — Set and display Mata system parameters 1801 Title ranksum — Equality tests on unmatched data Syntax Options for median References Menu Remarks and examples Also see Description Stored results Options for ranksum Methods and formulas Syntax Wilcoxon rank-sum test ranksum varname if in , by(groupvar) porder Nonparametric equality-of-medians test median varname if in weight , by(groupvar) median options ranksum options Main ∗ Description by(groupvar) porder grouping variable probability that variable for first group is larger than variable for second group median options Description Main ∗ by(groupvar) exact medianties(below) medianties(above) medianties(drop) medianties(split) grouping variable perform Fisher’s exact test assign values equal to the median to below group assign values equal to the median to above group drop values equal to the median from the analysis split values equal to the median equally between the two groups ∗ by(groupvar) is required. by is allowed with ranksum and median; see [D] by. fweights are allowed with median; see [U] 11.1.6 weight. Menu ranksum Statistics > Nonparametric analysis > Tests of hypotheses > Wilcoxon rank-sum test Nonparametric analysis > Tests of hypotheses > K-sample equality-of-medians test median Statistics > 1802 ranksum — Equality tests on unmatched data 1803 Description ranksum tests the hypothesis that two independent samples (that is, unmatched data) are from populations with the same distribution by using the Wilcoxon rank-sum test, which is also known as the Mann – Whitney two-sample statistic (Wilcoxon 1945; Mann and Whitney 1947). median performs a nonparametric k-sample test on the equality of medians. It tests the null hypothesis that the k samples were drawn from populations with the same median. For two samples, the chi-squared test statistic is computed both with and without a continuity correction. ranksum and median are for use with unmatched data. For equality tests on matched data, see [R] signrank. Options for ranksum Main by(groupvar) is required. It specifies the name of the grouping variable. porder displays an estimate of the probability that a random draw from the first population is larger than a random draw from the second population. Options for median Main by(groupvar) is required. It specifies the name of the grouping variable. exact displays the significance calculated by Fisher’s exact test. For two samples, both one- and two-sided probabilities are displayed. medianties(below | above | drop | split) specifies how values equal to the overall median are to be handled. The median test computes the median for varname by using all observations and then divides the observations into those falling above the median and those falling below the median. When values for an observation are equal to the sample median, they can be dropped from the analysis by specifying medianties(drop); added to the group above or below the median by specifying medianties(above) or medianties(below), respectively; or if there is more than 1 observation with values equal to the median, they can be equally divided into the two groups by specifying medianties(split). If this option is not specified, medianties(below) is assumed. Remarks and examples Example 1 We are testing the effectiveness of a new fuel additive. We run an experiment with 24 cars: 12 cars with the fuel treatment and 12 cars without. We input these data by creating a dataset with 24 observations. mpg records the mileage rating, and treat records 0 if the mileage corresponds to untreated fuel and 1 if it corresponds to treated fuel. 1804 ranksum — Equality tests on unmatched data . use http://www.stata-press.com/data/r13/fuel2 . ranksum mpg, by(treat) Two-sample Wilcoxon rank-sum (Mann-Whitney) test obs rank sum expected treat untreated treated 12 12 combined unadjusted variance adjustment for ties 24 128 172 150 150 300 300 300.00 -4.04 adjusted variance 295.96 Ho: mpg(treat==untreated) = mpg(treat==treated) z = -1.279 Prob > |z| = 0.2010 These results indicate that the medians are not statistically different at any level smaller than 20.1%. Similarly, the median test, . median mpg, by(treat) exact Median test whether car received Greater than the fuel additive untreated treated median no yes 7 5 Total 12 Pearson chi2(1) Fisher’s exact 1-sided Fisher’s exact Continuity corrected: Pearson chi2(1) = = = = Total 5 7 12 12 12 0.6667 24 Pr = 0.414 0.684 0.342 0.1667 Pr = 0.683 fails to reject the null hypothesis that there is no difference between the fuel with the additive and the fuel without the additive. Compare these results from these two tests with those obtained from the signrank and signtest where we found significant differences; see [R] signrank. An experiment run on 24 different cars is not as powerful as a before-and-after comparison using the same 12 cars. Stored results ranksum stores the following in r(): Scalars r(N 1) r(N 2) r(z) r(Var a) r(group1) r(sum obs) r(sum exp) r(porder) sample size n1 sample size n2 z statistic adjusted variance value of variable for first group actual sum of ranks for first group expected sum of ranks for first group probability that draw from first population is larger than draw from second population ranksum — Equality tests on unmatched data 1805 median stores the following in r(): Scalars r(N) r(chi2) r(p) r(p exact) r(groups) r(chi2 cc) r(p cc) r(p1 exact) sample size Pearson’s χ2 significance of Pearson’s χ2 Fisher’s exact p number of groups compared continuity-corrected Pearson’s χ2 continuity-corrected significance one-sided Fisher’s exact p Methods and formulas For a practical introduction to these techniques with an emphasis on examples rather than theory, see Acock (2014), Bland (2000), or Sprent and Smeeton (2007). For a summary of these tests, see Snedecor and Cochran (1989). Methods and formulas are presented under the following headings: ranksum median ranksum For the Wilcoxon rank-sum test, there are two independent random variables, X1 and X2 , and we test the null hypothesis that X1 ∼ X2 . We have a sample of size n1 from X1 and another of size n2 from X2 . The data are then ranked without regard to the sample to which they belong. If the data are tied, averaged ranks are used. Wilcoxon’s test statistic (1945) is the sum of the ranks for the observations in the first sample: n1 X R1i T = i=1 Mann and Whitney’s U statistic (1947) is the number of pairs (X1i , X2j ) such that X1i > X2j . These statistics differ only by a constant: U =T − n1 (n1 + 1) 2 Again Fisher’s principle of randomization provides a method for calculating the distribution of the test statistic, ties or not. The randomization distribution consists of the nn1 ways to choose n1 ranks from the set of all n = n1 + n2 ranks and assign them to the first sample. It is a straightforward exercise to verify that E(T ) = n1 (n + 1) 2 and Var(T ) = n1 n2 s2 n where s is the standard deviation of the combined ranks, ri , for both groups: n s2 = 1 X (ri − r)2 n − 1 i=1 1806 ranksum — Equality tests on unmatched data This formula for the variance is exact and holds both when there are no ties and when there are ties and we use averaged ranks. (Indeed, the variance formula holds for the randomization distribution of choosing n1 numbers from any set of n numbers.) Using a normal approximation, we calculate T − E(T ) z= p Var(T ) When the porder option is specified, the probability p= U n1 n2 is computed. Technical note We follow the great majority of the literature in naming these tests for Wilcoxon, Mann, and Whitney. However, they were independently developed by several other researchers in the late 1940s and early 1950s. In addition to Wilcoxon, Mann, and Whitney, credit is due to Festinger (1946), Whitfield (1947), Haldane and Smith (1947), and Van der Reyden (1952). Leon Festinger (1919–1989), John Burdon Sanderson Haldane (1892–1964), and Cedric Austen Bardell Smith (1917–2002) are well known for other work, but little seems to be known about Whitfield or van der Reyden. For a detailed study, including information on these researchers, see Berry, Mielke, and Johnston (2012). median The median test examines whether it is likely that two or more samples came from populations with the same median. The null hypothesis is that the samples were drawn from populations with the same median. The alternative hypothesis is that at least one sample was drawn from a population with a different median. The test should be used only with ordinal or interval data. Assume that there are score values for k independent samples to be compared. The median test is performed by first computing the median score for all observations combined, regardless of the sample group. Each score is compared with this computed grand median and is classified as being above the grand median, below the grand median, or equal to the grand median. Observations with scores equal to the grand median can be dropped, added to the “above” group, added to the “below” group, or split between the two groups. Once all observations are classified, the data are cast into a 2 × k contingency table, and a Pearson’s chi-squared test or Fisher’s exact test is performed. ranksum — Equality tests on unmatched data 1807 Henry Berthold Mann (1905–2000) was born in Vienna, Austria, where he completed a doctorate in algebraic number theory. He moved to the United States in 1938 and for several years made his livelihood by tutoring in New York. During this time, he proved a celebrated conjecture in number theory and studied statistics at Columbia with Abraham Wald, with whom he wrote three papers. After the war, he taught at Ohio State and the Universities of Wisconsin and Arizona. In addition to his work in number theory and statistics, he made major contributions to algebra and combinatorics. Donald Ransom Whitney (1915–2007) studied at Oberlin, Princeton, and Ohio State Universities and worked at the latter throughout his career. His PhD thesis under Henry Mann was on nonparametric statistics. It was this work that produced the test that bears their names. References Acock, A. C. 2014. A Gentle Introduction to Stata. 4th ed. College Station, TX: Stata Press. Berry, K. J., P. W. Mielke, Jr., and J. E. Johnston. 2012. The two-sample rank-sum test: Early development. Electronic Journal for History of Probability and Statistics 8: 1–26. http://www.jehps.net/decembre2012/BerryMielkeJohnston.pdf. Bland, M. 2000. An Introduction to Medical Statistics. 3rd ed. Oxford: Oxford University Press. Conroy, R. M. 2012. What hypotheses do nonparametric two-group tests actually test? Stata Journal 12: 182–190. Feiveson, A. H. 2002. Power by simulation. Stata Journal 2: 107–124. Festinger, L. 1946. The significance of difference between means without reference to the frequency distribution function. Psychometrika 11: 97–105. Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver & Boyd. Goldstein, R. 1997. sg69: Immediate Mann–Whitney and binomial effect-size display. Stata Technical Bulletin 36: 29–31. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 187–189. College Station, TX: Stata Press. Haldane, J. B. S., and C. A. B. Smith. 1947. A simple exact test for birth-order effect. Annals of Human Genetics 14: 117–124. Harris, T., and J. W. Hardin. 2013. Exact Wilcoxon signed-rank and Wilcoxon Mann–Whitney ranksum tests. Stata Journal 13: 337–343. Kruskal, W. H. 1957. Historical notes on the Wilcoxon unpaired two-sample test. Journal of the American Statistical Association 52: 356–360. Mann, H. B., and D. R. Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics 18: 50–60. Newson, R. B. 2000a. snp15: somersd—Confidence intervals for nonparametric statistics and their differences. Stata Technical Bulletin 55: 47–55. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 312–322. College Station, TX: Stata Press. . 2000b. snp15.1: Update to somersd. Stata Technical Bulletin 57: 35. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 322–323. College Station, TX: Stata Press. . 2000c. snp15.2: Update to somersd. Stata Technical Bulletin 58: 30. Reprinted in Stata Technical Bulletin Reprints, vol. 10, p. 323. College Station, TX: Stata Press. . 2001. snp15.3: Update to somersd. Stata Technical Bulletin 61: 22. Reprinted in Stata Technical Bulletin Reprints, vol. 10, p. 324. College Station, TX: Stata Press. . 2003. snp15 4: Software update for somersd. Stata Journal 3: 325. . 2005. snp15 5: Software update for somersd. Stata Journal 5: 470. Perkins, A. M. 1998. snp14: A two-sample multivariate nonparametric test. Stata Technical Bulletin 42: 47–49. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 243–245. College Station, TX: Stata Press. Snedecor, G. W., and W. G. Cochran. 1989. Statistical Methods. 8th ed. Ames, IA: Iowa State University Press. 1808 ranksum — Equality tests on unmatched data Sprent, P., and N. C. Smeeton. 2007. Applied Nonparametric Statistical Methods. 4th ed. Boca Raton, FL: Chapman & Hall/CRC. Sribney, W. M. 1995. crc40: Correcting for ties and zeros in sign and rank tests. Stata Technical Bulletin 26: 2–4. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 5–8. College Station, TX: Stata Press. Van der Reyden, D. 1952. A simple statistical significance test. Rhodesia Agricultural Journal 49: 96–104. Whitfield, J. W. 1947. Rank correlation between two variables, one of which is ranked, the other dichotomous. Biometrika 34: 292–296. Wilcoxon, F. 1945. Individual comparisons by ranking methods. Biometrics 1: 80–83. Also see [R] signrank — Equality tests on matched data [R] ttest — t tests (mean-comparison tests) Title ratio — Estimate ratios Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Basic syntax ratio name: varname / varname Full syntax ratio ( name: varname / varname) ( name: varname / varname) . . . if in weight , options options Description Model stdize(varname) stdweight(varname) nostdrescale variable identifying strata for standardization weight variable for standardization do not rescale the standard weight variable if/in/over over(varlist , nolabel ) group over subpopulations defined by varlist; optionally, suppress group labels SE/Cluster vce(vcetype) vcetype may be linearized, cluster clustvar, bootstrap, or jackknife Reporting level(#) noheader nolegend display options set confidence level; default is level(95) suppress table header suppress table legend control column formats and line width coeflegend display legend instead of statistics bootstrap, jackknife, mi estimate, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 1809 1810 ratio — Estimate ratios Menu Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Ratios Description ratio produces estimates of ratios, along with standard errors. Options Model stdize(varname) specifies that the point estimates be adjusted by direct standardization across the strata identified by varname. This option requires the stdweight() option. stdweight(varname) specifies the weight variable associated with the standard strata identified in the stdize() option. The standardization weights must be constant within the standard strata. nostdrescale prevents the standardization weights from being rescaled within the over() groups. This option requires stdize() but is ignored if the over() option is not specified. if/in/over over(varlist , nolabel ) specifies that estimates be computed for multiple subpopulations, which are identified by the different values of the variables in varlist. When this option is supplied with one variable name, such as over(varname), the value labels of varname are used to identify the subpopulations. If varname does not have labeled values (or there are unlabeled values), the values themselves are used, provided that they are nonnegative integers. Noninteger values, negative values, and labels that are not valid Stata names are substituted with a default identifier. When over() is supplied with multiple variable names, each subpopulation is assigned a unique default identifier. nolabel requests that value labels attached to the variables identifying the subpopulations be ignored. SE/Cluster vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (linearized), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(linearized), the default, uses the linearized or sandwich estimator of variance. Reporting level(#); see [R] estimation options. noheader prevents the table header from being displayed. This option implies nolegend. nolegend prevents the table legend identifying the subpopulations from being displayed. display options: cformat(% fmt) and nolstretch; see [R] estimation options. The following option is available with ratio but is not shown in the dialog box: coeflegend; see [R] estimation options. ratio — Estimate ratios 1811 Remarks and examples Example 1 Using the fuel data from example 3 of [R] ttest, we estimate the ratio of mileage for the cars without the fuel treatment (mpg1) to those with the fuel treatment (mpg2). . use http://www.stata-press.com/data/r13/fuel . ratio myratio: mpg1/mpg2 Ratio estimation myratio: mpg1/mpg2 Ratio myratio .9230769 Number of obs Linearized Std. Err. .032493 = 12 [95% Conf. Interval] .8515603 .9945936 Using these results, we can test to see if this ratio is significantly different from one. . test _b[myratio] = 1 ( 1) myratio = 1 F( 1, 11) = Prob > F = 5.60 0.0373 We find that the ratio is different from one at the 5% significance level but not at the 1% significance level. Example 2 Using state-level census data, we want to test whether the marriage rate is equal to the death rate. . use http://www.stata-press.com/data/r13/census2 (1980 Census data by state) . ratio (deathrate: death/pop) (marrate: marriage/pop) Ratio estimation Number of obs = deathrate: death/pop marrate: marriage/pop Ratio deathrate marrate .0087368 .0105577 Linearized Std. Err. .0002052 .0006184 . test _b[deathrate] = _b[marrate] ( 1) deathrate - marrate = 0 F( 1, 49) = 6.93 Prob > F = 0.0113 50 [95% Conf. Interval] .0083244 .009315 .0091492 .0118005 1812 ratio — Estimate ratios Stored results ratio stores the following in e(): Scalars e(N) e(N over) e(N stdize) e(N clust) e(k eq) e(df r) e(rank) Macros e(cmd) e(cmdline) e(varlist) e(stdize) e(stdweight) e(wtype) e(wexp) e(title) e(cluster) e(over) e(over labels) e(over namelist) e(namelist) e(vce) e(vcetype) e(properties) e(estat cmd) e(marginsnotok) Matrices e(b) e(V) e( N) e( N stdsum) e( p stdize) e(error) Functions e(sample) number of observations number of subpopulations number of standard strata number of clusters number of equations in e(b) sample degrees of freedom rank of e(V) ratio command as typed varlist varname from stdize() varname from stdweight() weight type weight expression title in estimation output name of cluster variable varlist from over() labels from over() variables names from e(over labels) ratio identifiers vcetype specified in vce() title used to label Std. Err. b V program used to implement estat predictions disallowed by margins vector of ratio estimates (co)variance estimates vector of numbers of nonmissing observations number of nonmissing observations within the standard strata standardizing proportions error code corresponding to e(b) marks estimation sample Methods and formulas Methods and formulas are presented under the following headings: The ratio estimator Survey data The survey ratio estimator The standardized ratio estimator The poststratified ratio estimator The standardized poststratified ratio estimator Subpopulation estimation ratio — Estimate ratios 1813 The ratio estimator Let R = Y /X be the ratio to be estimated, where Y and X are totals; see [R] total. The estimate b = Yb /X b (the ratio of the sample totals). From the delta method (that is, a first-order for R is R b is Taylor expansion), the approximate variance of the sampling distribution of the linearized R b ≈ V (R) o 1 n b b + R2 V (X) b V (Y ) − 2RCov(Yb , X) 2 X b, R b, and the estimated variances and covariance of X b and Yb leads to the Direct substitution of X following variance estimator: n o bCov( d Yb , X) b +R b2 Vb (X) b b = 1 Vb (Yb ) − 2R Vb (R) b2 X (1) Survey data See [SVY] variance estimation, [SVY] direct standardization, and [SVY] poststratification for discussions that provide background information for the following formulas. The survey ratio estimator Let Yj and Xj be survey items for the j th individual in the population, where j = 1, . . . , M and M is the size of the population. The associated population ratio for the items of interest is R = Y /X where M M X X Y = Yj and X= Xj j=1 j=1 Let yj and xj be the corresponding survey items for the j th sampled individual from the population, where j = 1, . . . , m and m is the number of observations in the sample. b for the population ratio R is R b = Yb /X b , where The estimator R Yb = m X j=1 wj yj and b= X m X w j xj j=1 and wj is a sampling weight. The score variable for the ratio estimator is b b b b = yj − Rxj = Xyj − Y xj zj (R) b b2 X X 1814 ratio — Estimate ratios The standardized ratio estimator Let Dg denote the set of sampled observations that belong to the g th standard stratum and define IDg (j) to indicate if the j th observation is a member of the g th standard stratum; where g = 1, . . . , LD and LD is the number of standard strata. Also, let πg denote the fraction of the population that belongs to the g th standard stratum, thus π1 + · · · + πLD = 1. Note that πg is derived from the stdweight() option. The estimator for the standardized ratio is LD X bD = R πg g=1 where Ybg = m X Ybg bg X IDg (j) wj yj j=1 bg is similarly defined. The score variable for the standardized ratio is and X bD ) = zj (R LD X πg IDg (j) g=1 bg yj − Ybg xj X bg2 X The poststratified ratio estimator Let Pk denote the set of sampled observations that belong to poststratum k , and define IPk (j) to indicate if the j th observation is a member of poststratum k , where k = 1, . . . , LP and LP is the number of poststrata. Also, let Mk denote the population size for poststratum k . Pk and Mk are identified by specifying the poststrata() and postweight() options on svyset; see [SVY] svyset. The estimator for the poststratified ratio is bP bP = Y R bP X where Yb P = LP X Mk k=1 ck M Ybk = LP m X Mk X k=1 ck M IPk (j) wj yj j=1 b P is similarly defined. The score variable for the poststratified ratio is and X bP ) = zj (R bP zj (X bP ) b P zj (Yb P ) − Yb P zj (X bP ) zj (Yb P ) − R X = b P )2 bP (X X where bP zj (Y ) = LP X k=1 b P ) is similarly defined. and zj (X Mk IPk (j) ck M Ybk yj − ck M ! ratio — Estimate ratios 1815 The standardized poststratified ratio estimator The estimator for the standardized poststratified ratio is bDP = R LD X πg g g=1 where YbgP = Lp X Mk k=1 ck M Ybg,k = YbgP bP X Lp m X Mk X k=1 ck M IDg (j)IPk (j) wj yj j=1 bgP is similarly defined. The score variable for the standardized poststratified ratio is and X bDP ) = zj (R LD X πg g=1 where zj (YbgP ) = LP X k=1 and bP ) zj (X g bgP ) bgP zj (YbgP ) − YbgP zj (X X bgP )2 (X Mk IPk (j) ck M ( Ybg,k IDg (j)yj − ck M ) is similarly defined. Subpopulation estimation Let S denote the set of sampled observations that belong to the subpopulation of interest, and define IS (j) to indicate if the j th observation falls within the subpopulation. bS = Yb S /X b S , where The estimator for the subpopulation ratio is R Yb S = m X IS (j) wj yj bS = X and j=1 m X IS (j) wj xj j=1 Its score variable is bS bS bS bS ) = IS (j) yj − R xj = IS (j) X yj − Y xj zj (R bS b S )2 X (X The estimator for the standardized subpopulation ratio is bDS = R LD X g=1 where YbgS = m X πg YbgS bS X g IDg (j)IS (j) wj yj j=1 b S is similarly defined. Its score variable is and X g bDS ) = zj (R LD X g=1 πg IDg (j)IS (j) bgS yj − YbgS xj X bgS )2 (X 1816 ratio — Estimate ratios The estimator for the poststratified subpopulation ratio is bPS bP S = Y R bPS X where Yb P S = LP X Mk k=1 ck M YbkS = LP m X Mk X ck M k=1 IPk (j)IS (j) wj yj j=1 b P S is similarly defined. Its score variable is and X bPS b PS bPS bPS bP S ) = X zj (Y ) − Y zj (X ) zj (R b P S )2 (X where bPS zj ( Y )= LP X k=1 Mk IPk (j) ck M ( Yb S IS (j) yj − k ck M ) b P S ) is similarly defined. and zj (X The estimator for the standardized poststratified subpopulation ratio is bDP S = R LD X πg g=1 where YbgP S = Lp X Mk k=1 ck M S Ybg,k = Lp m X Mk X k=1 ck M YbgP S bgP S X IDg (j)IPk (j)IS (j) wj yj j=1 bgP S is similarly defined. Its score variable is and X bDP S ) = zj ( R LD X g=1 where zj (YbgP S ) = LP X k=1 and bgP S ) zj (X πg bgP S zj (YbgP S ) − YbgP S zj (X bgP S ) X bgP S )2 (X Mk IPk (j) ck M ( S Ybg,k IDg (j)IS (j) yj − ck M ) is similarly defined. References Cochran, W. G. 1977. Sampling Techniques. 3rd ed. New York: Wiley. Stuart, A., and J. K. Ord. 1994. Kendall’s Advanced Theory of Statistics: Distribution Theory, Vol I. 6th ed. London: Arnold. ratio — Estimate ratios Also see [R] ratio postestimation — Postestimation tools for ratio [R] mean — Estimate means [R] proportion — Estimate proportions [R] total — Estimate totals [MI] estimation — Estimation commands for use with mi estimate [SVY] direct standardization — Direct standardization of means, proportions, and ratios [SVY] poststratification — Poststratification for survey data [SVY] subpopulation estimation — Subpopulation estimation for survey data [SVY] svy estimation — Estimation commands for survey data [SVY] variance estimation — Variance estimation for survey data [U] 20 Estimation and postestimation commands 1817 Title ratio postestimation — Postestimation tools for ratio Description Remarks and examples Also see Description The following postestimation commands are available after ratio: Command Description estat vce estat (svy) estimates lincom variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses nlcom test testnl Remarks and examples For examples of the use of test after ratio, see [R] ratio. Also see [R] ratio — Estimate ratios [U] 20 Estimation and postestimation commands 1818 Title reg3 — Three-stage estimation for systems of simultaneous equations Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Basic syntax reg3 (depvar1 varlist1 ) (depvar2 varlist2 ) . . . (depvarN varlistN ) if in weight Full syntax reg3 ( eqname1 : depvar1a depvar1b . . . = varlist1 , noconstant ) ( eqname2 : depvar2a depvar2b . . . = varlist2 , noconstant ) ... ( eqnameN : depvarN a depvarN b . . . = varlistN , noconstant ) if in weight , options options Description Model ireg3 constraints(constraints) iterate until estimates converge apply specified linear constraints Model 2 exog(varlist) endog(varlist) inst(varlist) allexog noconstant exogenous variables not specified in system equations additional right-hand-side endogenous variables full list of exogenous variables all right-hand-side variables are exogenous suppress constant from instrument list Est. method 3sls 2sls ols sure mvreg corr(correlation) three-stage least squares; the default two-stage least squares ordinary least squares (OLS) seemingly unrelated regression estimation (SURE) sure with OLS degrees-of-freedom adjustment unstructured or independent correlation structure; default is unstructured df adj. small dfk dfk2 report small-sample statistics use small-sample adjustment use alternate adjustment 1819 1820 reg3 — Three-stage estimation for systems of simultaneous equations Reporting level(#) first nocnsreport display options set confidence level; default is level(95) report first-stage regression do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Optimization optimization options control the optimization process; seldom used noheader notable nofooter coeflegend suppress display of header suppress display of coefficient table suppress display of footer display legend instead of statistics varlist1 , . . . , varlistN and the exog() and the inst() varlist may contain factor variables; see [U] 11.4.3 Factor variables. You must have the same levels of factor variables in all equations that have factor variables. depvar and varlist may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. aweights and fweights are allowed; see [U] 11.1.6 weight. noheader, notable, nofooter, and coeflegend do not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Explicit equation naming (eqname:) cannot be combined with multiple dependent variables in an equation specification. Menu Statistics > Endogenous covariates > Three-stage least squares Description reg3 estimates a system of structural equations, where some equations contain endogenous variables among the explanatory variables. Estimation is via three-stage least squares (3SLS); see Zellner and Theil (1962). Typically, the endogenous explanatory variables are dependent variables from other equations in the system. reg3 supports iterated GLS estimation and linear constraints. reg3 can also estimate systems of equations by seemingly unrelated regression estimation (SURE), multivariate regression (MVREG), and equation-by-equation ordinary least squares (OLS) or two-stage least squares (2SLS). Nomenclature Under 3SLS or 2SLS estimation, a structural equation is defined as one of the equations specified in the system. A dependent variable will have its usual interpretation as the left-hand-side variable in an equation with an associated disturbance term. All dependent variables are explicitly taken to be endogenous to the system and are treated as correlated with the disturbances in the system’s equations. Unless specified in an endog() option, all other variables in the system are treated as exogenous to the system and uncorrelated with the disturbances. The exogenous variables are taken to be instruments for the endogenous variables. reg3 — Three-stage estimation for systems of simultaneous equations 1821 Options Model ireg3 causes reg3 to iterate over the estimated disturbance covariance matrix and parameter estimates until the parameter estimates converge. Although the iteration is usually successful, there is no guarantee that it will converge to a stable point. Under SURE, this iteration converges to the maximum likelihood estimates. constraints(constraints); see [R] estimation options. Model 2 exog(varlist) specifies additional exogenous variables that are included in none of the system equations. This can occur when the system contains identities that are not estimated. If implicitly exogenous variables from the equations are listed here, reg3 will just ignore the additional information. Specified variables will be added to the exogenous variables in the system and used in the first stage as instruments for the endogenous variables. By specifying dependent variables from the structural equations, you can use exog() to override their endogeneity. endog(varlist) identifies variables in the system that are not dependent variables but are endogenous to the system. These variables must appear in the variable list of at least one equation in the system. Again the need for this identification often occurs when the system contains identities. For example, a variable that is the sum of an exogenous variable and a dependent variable may appear as an explanatory variable in some equations. inst(varlist) specifies a full list of all exogenous variables and may not be used with the endog() or exog() options. It must contain a full list of variables to be used as instruments for the endogenous regressors. Like exog(), the list may contain variables not specified in the system of equations. This option can be used to achieve the same results as the endog() and exog() options, and the choice is a matter of convenience. Any variable not specified in the varlist of the inst() option is assumed to be endogenous to the system. As with exog(), including the dependent variables from the structural equations will override their endogeneity. allexog indicates that all right-hand-side variables are to be treated as exogenous—even if they appear as the dependent variable of another equation in the system. This option can be used to enforce a SURE or MVREG estimation even when some dependent variables appear as regressors. noconstant; see [R] estimation options. Est. method 3sls specifies the full 3SLS estimation of the system and is the default for reg3. 2sls causes reg3 to perform equation-by-equation 2SLS on the full system of equations. This option implies dfk, small, and corr(independent). Cross-equation testing should not be performed after estimation with this option. With 2sls, no covariance is estimated between the parameters of the equations. For cross-equation testing, use 3sls. ols causes reg3 to perform equation-by-equation OLS on the system—even if dependent variables appear as regressors or the regressors differ for each equation; see [MV] mvreg. ols implies allexog, dfk, small, and corr(independent); nodfk and nosmall may be specified to override dfk and small. The covariance of the coefficients between equations is not estimated under this option, and cross-equation tests should not be performed after estimation with ols. For cross-equation testing, use sure or 3sls (the default). 1822 reg3 — Three-stage estimation for systems of simultaneous equations sure causes reg3 to perform a SURE of the system—even if dependent variables from some equations appear as regressors in other equations; see [R] sureg. sure is a synonym for allexog. mvreg is identical to sure, except that the disturbance covariance matrix is estimated with an OLS degrees-of-freedom adjustment—the dfk option. If the regressors are identical for all equations, the parameter point estimates will be the standard MVREG results. If any of the regressors differ, the point estimates are those for SURE with an OLS degrees-of-freedom adjustment in computing the covariance matrix. nodfk and nosmall may be specified to override dfk and small. corr(correlation) specifies the assumed form of the correlation structure of the equation disturbances and is rarely requested explicitly. For the family of models fit by reg3, the only two allowable correlation structures are unstructured and independent. The default is unstructured. This option is used almost exclusively to estimate a system of equations by 2SLS or to perform OLS regression with reg3 on multiple equations. In these cases, the correlation is set to independent, forcing reg3 to treat the covariance matrix of equation disturbances as diagonal in estimating model parameters. Thus a set of two-stage coefficient estimates can be obtained if the system contains endogenous right-hand-side variables, or OLS regression can be imposed, even if the regressors differ across equations. Without imposing independent disturbances, reg3 would estimate the former by 3SLS and the latter by SURE. Any tests performed after estimation with the independent option will treat coefficients in different equations as having no covariance; cross-equation tests should not be used after specifying corr(independent). df adj. small specifies that small-sample statistics be computed. It shifts the test statistics from χ2 and z statistics to F statistics and t statistics. This option is intended primarily to support MVREG. Although the standard errors from each equation are computed using the degrees of freedom for the equation, the degrees of freedom for the t statistics are all taken to be those for the first equation. This approach poses no problem under MVREG because the regressors are the same across equations. dfk specifies the use of an alternative divisor in computing the covariance matrix for the equation residuals. As an asymptotically justified estimator, reg3 by default uses the number of sample observations n as a divisor. p When the dfk option is set, a small-sample adjustment is made, and the divisor is taken to be (n − ki )(n − kj ), where ki and kj are the numbers of parameters in equations i and j , respectively. dfk2 specifies the use of an alternative divisor in computing the covariance matrix for the equation errors. When the dfk2 option is set, the divisor is taken to be the mean of the residual degrees of freedom from the individual equations. Reporting level(#); see [R] estimation options. first requests that the first-stage regression results be displayed during estimation. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. reg3 — Three-stage estimation for systems of simultaneous equations 1823 Optimization optimization options control the iterative process that minimizes the sum of squared errors when ireg3 is specified. These options are seldom used. iterate(#) specifies the maximum number of iterations. When the number of iterations equals #, the optimizer stops and presents the current results, even if the convergence tolerance has not been reached. The default value of iterate() is the current value of set maxiter (see [R] maximize), which is iterate(16000) if maxiter has not been changed. trace adds to the iteration log a display of the current parameter vector. nolog suppresses the display of the iteration log. tolerance(#) specifies the tolerance for the coefficient vector. When the relative change in the coefficient vector from one iteration to the next is less than or equal to #, the optimization process is stopped. tolerance(1e-6) is the default. The following options are available with reg3 but are not shown in the dialog box: noheader suppresses display of the header reporting the estimation method and the table of equation summary statistics. notable suppresses display of the coefficient table. nofooter suppresses display of the footer reporting the list of endogenous and exogenous variables in the model. coeflegend; see [R] estimation options. Remarks and examples reg3 estimates systems of structural equations where some equations contain endogenous variables among the explanatory variables. Generally, these endogenous variables are the dependent variables of other equations in the system, though not always. The disturbance is correlated with the endogenous variables—violating the assumptions of OLS. Further, because some of the explanatory variables are the dependent variables of other equations in the system, the error terms among the equations are expected to be correlated. reg3 uses an instrumental-variables approach to produce consistent estimates and generalized least squares (GLS) to account for the correlation structure in the disturbances across the equations. Good general references on three-stage estimation include Davidson and MacKinnon (1993, 651–661) and Greene (2012, 331–334). Three-stage least squares can be thought of as producing estimates from a three-step process. Step 1. Develop instrumented values for all endogenous variables. These instrumented values can simply be considered as the predicted values resulting from a regression of each endogenous variable on all exogenous variables in the system. This stage is identical to the first step in 2SLS and is critical for the consistency of the parameter estimates. Step 2. Obtain a consistent estimate for the covariance matrix of the equation disturbances. These estimates are based on the residuals from a 2SLS estimation of each structural equation. Step 3. Perform a GLS-type estimation using the covariance matrix estimated in the second stage and with the instrumented values in place of the right-hand-side endogenous variables. 1824 reg3 — Three-stage estimation for systems of simultaneous equations Technical note The estimation and use of the covariance matrix of disturbances in three-stage estimation is almost identical to the SURE method—sureg. As with SURE, using this covariance matrix improves the efficiency of the three-stage estimator. Even without the covariance matrix, the estimates would be consistent. (They would be 2SLS estimates.) This improvement in efficiency comes with a caveat. All the parameter estimates now depend on the consistency of the covariance matrix estimates. If one equation in the system is misspecified, the disturbance covariance estimates will be inconsistent, and the resulting coefficients will be biased and inconsistent. Alternatively, if each equation is estimated separately by 2SLS ([R] regress), only the coefficients in the misspecified equation are affected. Technical note If an equation is just identified, the 3SLS point estimates for that equation are identical to the 2SLS estimates. However, as with sureg, even if all equations are just identified, fitting the model via reg3 has at least one advantage over fitting each equation separately via ivregress; by using reg3, tests involving coefficients in different equations can be performed easily using test or testnl. Example 1 A simple macroeconomic model relates consumption (consump) to private and government wages paid (wagepriv and wagegovt). Simultaneously, private wages depend on consumption, total government expenditures (govt), and the lagged stock of capital in the economy (capital1). Although this is not a plausible model, it does meet the criterion of being simple. This model could be written as consump = β0 + β1 wagepriv + β2 wagegovt + 1 wagepriv = β3 + β4 consump + β5 govt + β6 capital1 + 2 If we assume that this is the full system, consump and wagepriv will be endogenous variables, with wagegovt, govt, and capital1 exogenous. Data for the U.S. economy on these variables are taken from Klein (1950). This model can be fit with reg3 by typing reg3 — Three-stage estimation for systems of simultaneous equations 1825 . use http://www.stata-press.com/data/r13/klein . reg3 (consump wagepriv wagegovt) (wagepriv consump govt capital1) Three-stage least-squares regression Equation Obs Parms RMSE "R-sq" chi2 P consump wagepriv 22 22 2 3 1.776297 2.372443 0.9388 0.8542 208.02 80.04 0.0000 0.0000 Coef. Std. Err. z P>|z| [95% Conf. Interval] consump wagepriv wagegovt _cons .8012754 1.029531 19.3559 .1279329 .3048424 3.583772 6.26 3.38 5.40 0.000 0.001 0.000 .5505314 .432051 12.33184 1.052019 1.627011 26.37996 wagepriv consump govt capital1 _cons .4026076 1.177792 -.0281145 14.63026 .2567312 .5421253 .0572111 10.26693 1.57 2.17 -0.49 1.42 0.117 0.030 0.623 0.154 -.1005764 .1152461 -.1402462 -5.492552 .9057916 2.240338 .0840173 34.75306 Endogenous variables: Exogenous variables: consump wagepriv wagegovt govt capital1 Without showing the 2SLS results, we note that the consumption function in this system falls under the conditions noted earlier. That is, the 2SLS and 3SLS coefficients for the equation are identical. Example 2 Some of the most common simultaneous systems encountered are supply-and-demand models. A simple system could be specified as qDemand = β0 + β1 price + β2 pcompete + β3 income + 1 qSupply = β4 + β5 price + β6 praw + 2 Equilibrium condition: quantity = qDemand = qSupply where quantity is the quantity of a product produced and sold, price is the price of the product, pcompete is the price of a competing product, income is the average income level of consumers, and praw is the price of raw materials used to produce the product. In this system, price is assumed to be determined simultaneously with demand. The important statistical implications are that price is not a predetermined variable and that it is correlated with the disturbances of both equations. The system is somewhat unusual: quantity is associated with two disturbances. This fact really poses no problem because the disturbances are specified on the behavioral demand and supply equations—two separate entities. Often one of the two equations is rewritten to place price on the left-hand side, making this endogeneity explicit in the specification. 1826 reg3 — Three-stage estimation for systems of simultaneous equations To provide a concrete illustration of the effects of simultaneous equations, we can simulate data for the above system by using known coefficients and disturbance properties. Specifically, we will simulate the data as qDemand = 40 − 1.0 price + 0.25 pcompete + 0.5 income + 1 qSupply = 0.5 price − 0.75 praw + 2 where 1 ∼ N (0, 3.8) 2 ∼ N (0, 2.4) For comparison, we can estimate the supply and demand equations separately by OLS. The estimates for the demand equation are . use http://www.stata-press.com/data/r13/supDem . regress quantity price pcompete income SS df MS Source Model Residual 23.1579302 346.459313 3 45 7.71931008 7.69909584 Total 369.617243 48 7.70035923 quantity Coef. price pcompete income _cons .1186265 .0946416 .0785339 7.563261 Std. Err. t P>|t| .1716014 .1200815 .1159867 5.019479 0.69 0.79 0.68 1.51 0.493 0.435 0.502 0.139 Number of obs F( 3, 45) Prob > F R-squared Adj R-squared Root MSE = = = = = = 49 1.00 0.4004 0.0627 0.0002 2.7747 [95% Conf. Interval] -.2269965 -.1472149 -.1550754 -2.54649 .4642496 .3364981 .3121432 17.67301 The OLS estimates for the supply equation are . regress quantity price praw Source SS df MS Model Residual 224.819549 144.797694 2 46 112.409774 3.14777596 Total 369.617243 48 7.70035923 quantity Coef. price praw _cons .724675 -.8674796 -6.97291 Std. Err. .1095657 .1066114 3.323105 t 6.61 -8.14 -2.10 Number of obs F( 2, 46) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.041 = = = = = = 49 35.71 0.0000 0.6082 0.5912 1.7742 [95% Conf. Interval] .5041307 -1.082077 -13.66197 .9452192 -.652882 -.283847 Examining the coefficients from these regressions, we note that they are not close to the known parameters used to generate the simulated data. In particular, the positive coefficient on price in the demand equation stands out. We constructed our simulated data to be consistent with economic theory—people demand less of a product if its price rises and more if their personal income rises. Although the price coefficient is statistically insignificant, the positive value contrasts starkly with what is predicted from economic price theory and the −1.0 value that we used in the simulation. Likewise, we are disappointed with the insignificance and level of the coefficient on average income. The supply equation has correct signs on the two main parameters, but their levels are different from the known values. In fact, the coefficient on price (0.724675) is different from the simulated parameter (0.5) at the 5% level of significance. reg3 — Three-stage estimation for systems of simultaneous equations 1827 All of these problems are to be expected. We explicitly constructed a simultaneous system of equations that violated one of the assumptions of least squares. Specifically, the disturbances were correlated with one of the regressors—price. Two-stage least squares can be used to address the correlation between regressors and disturbances. Using instruments for the endogenous variable, price, 2SLS will produce consistent estimates of the parameters in the system. Let’s use ivregress (see [R] ivregress) to see how our simulated system behaves when fit using 2SLS. . ivregress 2sls quantity (price = praw) pcompete income Instrumental variables (2SLS) regression Number of obs Wald chi2(3) Prob > chi2 R-squared Root MSE quantity Coef. price pcompete income _cons -1.015817 .3319504 .5090607 39.89988 Std. Err. .374209 .172912 .1919482 10.77378 z -2.71 1.92 2.65 3.70 P>|z| 0.007 0.055 0.008 0.000 Coef. price praw _cons .5773133 -.7835496 -2.550694 Instrumented: Instruments: Std. Err. .1749974 .1312414 5.273067 z 3.30 -5.97 -0.48 P>|z| 0.001 0.000 0.629 49 8.77 0.0326 . 3.7333 [95% Conf. Interval] -1.749253 -.0069508 .1328491 18.78366 Instrumented: price Instruments: pcompete income praw . ivregress 2sls quantity (price = pcompete income) praw Instrumental variables (2SLS) regression Number of obs Wald chi2(2) Prob > chi2 R-squared Root MSE quantity = = = = = -.282381 .6708517 .8852723 61.01611 = = = = = 49 39.25 0.0000 0.5928 1.7525 [95% Conf. Interval] .2343247 -1.040778 -12.88571 .9203019 -.5263213 7.784327 price praw pcompete income We are now much happier with the estimation results. All the coefficients from both equations are close to the true parameter values for the system. In particular, the coefficients are all well within 95% confidence intervals for the parameters. The missing R-squared in the demand equation seems unusual; we will discuss that more later. Finally, this system could be estimated using 3SLS. To demonstrate how large systems might be handled and to avoid multiline commands, we will use global macros (see [P] macro) to hold the specifications for our equations. . global demand "(qDemand: quantity price pcompete income)" . global supply "(qSupply: quantity price praw)" . reg3 $demand $supply, endog(price) We must specify price as endogenous because it does not appear as a dependent variable in either equation. Without this option, reg3 would assume that there are no endogenous variables in the system and produce seemingly unrelated regression (sureg) estimates. The reg3 output from our series of commands is 1828 reg3 — Three-stage estimation for systems of simultaneous equations Three-stage least-squares regression Equation Obs Parms RMSE "R-sq" chi2 P qDemand qSupply 49 49 3 2 3.739686 1.752501 -0.8540 0.5928 8.68 39.25 0.0338 0.0000 Coef. Std. Err. z P>|z| [95% Conf. Interval] qDemand price pcompete income _cons -1.014345 .2647206 .5299146 40.08749 .3742036 .1464194 .1898161 10.77072 -2.71 1.81 2.79 3.72 0.007 0.071 0.005 0.000 -1.74777 -.0222561 .1578819 18.97726 -.2809194 .5516973 .9019472 61.19772 qSupply price praw _cons .5773133 -.7835496 -2.550694 .1749974 .1312414 5.273067 3.30 -5.97 -0.48 0.001 0.000 0.629 .2343247 -1.040778 -12.88571 .9203019 -.5263213 7.784327 Endogenous variables: Exogenous variables: quantity price pcompete income praw The use of 3SLS over 2SLS is essentially an efficiency issue. The coefficients of the demand equation from 3SLS are close to the coefficients from two-stage least squares, and those of the supply equation are identical. The latter case was mentioned earlier for systems with some exactly identified equations. However, even for the demand equation, we do not expect the coefficients to change systematically. What we do expect from three-stage least squares are more precise estimates of the parameters given the validity of our specification and reg3’s use of the covariances among the disturbances. Let’s summarize the results. With OLS, we got obviously biased estimates of the parameters. No amount of data would have improved the OLS estimates—they are inconsistent in the face of the violated OLS assumptions. With 2SLS, we obtained consistent estimates of the parameters, and these would have improved with more data. With 3SLS, we obtained consistent estimates of the parameters that are more efficient than those obtained by 2SLS. Technical note We noted earlier that the R-squared was missing from the two-stage estimates of the demand equation. Now we see that the R-squared is negative for the three-stage estimates of the same equation. How can we have a negative R-squared? In most estimators, other than least squares, the R-squared is no more than a summary measure of the overall in-sample predictive power of the estimator. The computational formula for R-squared is R-squared = 1 − RSS/TSS, where RSS is the residual sum of squares (sum of squared residuals) and TSS is the total sum of squared deviations about the mean of the dependent variable. In a standard linear model with a constant, the model from which the TSS is computed is nested within the full model from which RSS is computed—they both have a constant term based on the same data. Thus it must be that TSS ≥ RSS and R-squared is constrained between 0 and 1. For 2SLS and 3SLS, some of the regressors enter the model as instruments when the parameters are estimated. However, because our goal is to fit the structural model, the actual values, not the instruments for the endogenous right-hand-side variables, are used to determine R-squared. The model residuals are computed over a different set of regressors from those used to fit the model. The two- reg3 — Three-stage estimation for systems of simultaneous equations 1829 or three-stage estimates are no longer nested within a constant-only model of the dependent variable, and the residual sum of squares is no longer constrained to be smaller than the total sum of squares. A negative R-squared in 3SLS should be taken for exactly what it is—an indication that the structural model predicts the dependent variable worse than a constant-only model. Is this a problem? It depends on the application. Three-stage least squares applied to our contrived supply-and-demand example produced good estimates of the known true parameters. Still, the demand equation produced an R-squared of −0.854. How do we feel about our parameter estimates? This should be determined by the estimates themselves, their associated standard errors, and the overall model significance. On this basis, negative R-squared and all, we feel pretty good about all the parameter estimates for both the supply and demand equations. Would we want to make predictions about equilibrium quantity by using the demand equation alone? Probably not. Would we want to make these quantity predictions by using the supply equation? Possibly, because based on in-sample predictions, they seem better than those from the demand equations. However, both the supply and demand estimates are based on limited information. If we are interested in predicting quantity, a reduced-form equation containing all our independent variables would usually be preferred. Technical note As a matter of syntax, we could have specified the supply-and-demand model on one line without using global macros. . reg3 (quantity price pcompete income) (quantity price praw), endog(price) Three-stage least-squares regression Equation Obs Parms RMSE "R-sq" chi2 P quantity 2quantity 49 49 3 2 3.739686 1.752501 -0.8540 0.5928 8.68 39.25 0.0338 0.0000 Coef. Std. Err. z P>|z| [95% Conf. Interval] quantity price pcompete income _cons -1.014345 .2647206 .5299146 40.08749 .3742036 .1464194 .1898161 10.77072 -2.71 1.81 2.79 3.72 0.007 0.071 0.005 0.000 -1.74777 -.0222561 .1578819 18.97726 -.2809194 .5516973 .9019472 61.19772 2quantity price praw _cons .5773133 -.7835496 -2.550694 .1749974 .1312414 5.273067 3.30 -5.97 -0.48 0.001 0.000 0.629 .2343247 -1.040778 -12.88571 .9203019 -.5263213 7.784327 Endogenous variables: Exogenous variables: quantity price pcompete income praw However, here reg3 has been forced to create a unique equation name for the supply equation— 2quantity. Both the supply and demand equations could not be designated as quantity, so a number was prefixed to the name for the supply equation. 1830 reg3 — Three-stage estimation for systems of simultaneous equations We could have specified . reg3 (qDemand: quantity price pcompete income) (qSupply: quantity price praw), > endog(price) and obtained the same results and equation labeling as when we used global macros to hold the equation specifications. Without explicit equation names, reg3 always assumes that the dependent variable should be used to name equations. When each equation has a different dependent variable, this rule causes no problems and produces easily interpreted result tables. If the same dependent variable appears in more than one equation, however, reg3 will create a unique equation name based on the dependent variable name. Because equation names must be used for cross-equation tests, you have more control in this situation if explicit names are placed on the equations. Example 3: Using the full syntax of reg3 Klein’s (1950) model of the U.S. economy is often used to demonstrate system estimators. It contains several common features that will serve to demonstrate the full syntax of reg3. The Klein model is defined by the following seven relationships: c = β0 + β1 p + β2 L.p + β3 w + 1 i = β4 + β5 p + β6 L.p + β7 L.k + 2 wp = β8 + β9 y + β10 L.y + β11 yr + 3 y=c+i+g p = y − t − wp k = L.k + i w = wg + wp (1) (2) (3) (4) (5) (6) (7) Here we have used Stata’s lag operator L. to represent variables that appear with a one-period lag in our model; see [U] 13.9 Time-series operators. The variables in the model are listed below. Two sets of variable names are shown. The concise first name uses traditional economics mnemonics, whereas the second name provides more guidance for everyone else. The concise names serve to keep the specification of the model small (and quite understandable to economists). Short name Long name Variable definition Type c p wp wg w i k y g t yr consump profits wagepriv wagegovt wagetot invest capital totinc govt taxnetx year Consumption Private industry profits Private wage bill Government wage bill Total wage bill Investment Capital stock Total income/demand Government spending Indirect bus. taxes + net exports Year—1931 endogenous endogenous endogenous exogenous endogenous endogenous endogenous endogenous exogenous exogenous exogenous reg3 — Three-stage estimation for systems of simultaneous equations 1831 Equations (1)–(3) are behavioral and contain explicit disturbances (1 , 2 , and 3 ). The remaining equations are identities that specify additional variables in the system and their accounting relationships with the variables in the behavioral equations. Some variables are explicitly endogenous by appearing as dependent variables in (1)–(3). Others are implicitly endogenous as linear combinations that contain other endogenous variables (for example, w and p). Still other variables are implicitly exogenous by appearing in the identities but not in the behavioral equations (for example, wg and g). Using the concise names, we can fit Klein’s model with the following command: . use http://www.stata-press.com/data/r13/klein2 . reg3 (c p L.p w) (i p L.p L.k) (wp y L.y yr), endog(w p y) exog(t wg g) Three-stage least-squares regression Equation c i wp Obs Parms RMSE "R-sq" chi2 P 21 21 21 3 3 3 .9443305 1.446736 .7211282 0.9801 0.8258 0.9863 864.59 162.98 1594.75 0.0000 0.0000 0.0000 Coef. Std. Err. z P>|z| [95% Conf. Interval] c p --. L1. .1248904 .1631439 .1081291 .1004382 1.16 1.62 0.248 0.104 -.0870387 -.0337113 .3368194 .3599992 w _cons .790081 16.44079 .0379379 1.304549 20.83 12.60 0.000 0.000 .715724 13.88392 .8644379 18.99766 p --. L1. -.0130791 .7557238 .1618962 .1529331 -0.08 4.94 0.936 0.000 -.3303898 .4559805 .3042316 1.055467 k L1. -.1948482 .0325307 -5.99 0.000 -.2586072 -.1310893 _cons 28.17785 6.793768 4.15 0.000 14.86231 41.49339 y --. L1. .4004919 .181291 .0318134 .0341588 12.59 5.31 0.000 0.000 .3381388 .1143411 .462845 .2482409 yr _cons .149674 1.797216 .0279352 1.115854 5.36 1.61 0.000 0.107 .094922 -.3898181 .2044261 3.984251 i wp Endogenous variables: Exogenous variables: c i wp w p y L.p L.k L.y yr t wg g We used the exog() option to identify t, wg, and g as exogenous variables in the system. These variables must be identified because they are part of the system but appear directly in none of the behavioral equations. Without this option, reg3 would not know they were part of the system and could be used as instrumental variables. The endog() option specifying w, p, and y is also required. Without this information, reg3 would be unaware that these variables are linear combinations that include endogenous variables. We did not include k in the endog() option because only its lagged value appears in the behavioral equations. 1832 reg3 — Three-stage estimation for systems of simultaneous equations Technical note Rather than listing additional endogenous and exogenous variables, we could specify the full list of exogenous variables in an inst() option, . reg3 (c p L.p w) (i p L.p L.k) (wp y L.y yr), inst(g t wg yr L.p L.k L.y) or equivalently, . . . . . global conseqn "(c p L.p w)" global inveqn "(i p L.p L.k)" global wageqn "(wp y L.y yr)" global inlist "g t wg yr L.p L.k L.y" reg3 $conseqn $inveqn $wageqn, inst($inlist) Macros and explicit equations can also be mixed in the specification . reg3 $conseqn (i p L.p L.k) $wageqn, endog(w p y) exog(t wg g) or . reg3 (c p L.p w) $inveqn (wp y L.y yr), endog(w p y) exog(t wg g) Placing the equation-binding parentheses in the global macros was also arbitrary. We could have used . . . . global consump global invest global wagepriv reg3 ($consump) "c p L.p w" "i p L.p L.k" "wp y L.y yr" ($invest) ($wagepriv), endog(w p y) exog(t wg g) reg3 is tolerant of all combinations, and these commands will produce identical output. Switching to the full variable names, we can fit Klein’s model with the commands below. We will use global macros to store the lists of endogenous and exogenous variables. Again this is not necessary: these lists could have been typed directly on the command line. However, assigning the lists to local macros makes additional processing easier if alternative models are to be fit. We will also use the ireg3 option to produce the iterated estimates. . use http://www.stata-press.com/data/r13/kleinfull . global conseqn "(consump profits L.profits wagetot)" . global inveqn "(invest profits L.profits L.capital)" . global wageqn "(wagepriv totinc L.totinc year)" . global enlist "wagetot profits totinc" . global exlist "taxnetx wagegovt govt" . reg3 $conseqn $inveqn $wageqn, endog($enlist) exog($exlist) ireg3 Iteration 1: tolerance = .3712549 Iteration 2: tolerance = .1894712 Iteration 3: tolerance = .1076401 (output omitted ) Iteration 24: tolerance = 7.049e-07 Three-stage least-squares regression, iterated Equation Obs Parms RMSE "R-sq" chi2 P consump invest wagepriv 21 21 21 3 3 3 .9565088 2.134327 .7782334 0.9796 0.6209 0.9840 970.31 56.78 1312.19 0.0000 0.0000 0.0000 reg3 — Three-stage estimation for systems of simultaneous equations Coef. Std. Err. z P>|z| [95% Conf. Interval] consump profits --. L1. .1645096 .1765639 .0961979 .0901001 1.71 1.96 0.087 0.050 -.0240348 -.0000291 .3530539 .3531569 wagetot _cons .7658011 16.55899 .0347599 1.224401 22.03 13.52 0.000 0.000 .6976729 14.15921 .8339294 18.95877 invest profits --. L1. -.3565316 1.011299 .2601568 .2487745 -1.37 4.07 0.171 0.000 -.8664296 .5237098 .1533664 1.498888 capital L1. -.2602 .0508694 -5.12 0.000 -.3599022 -.1604978 _cons 42.89629 10.59386 4.05 0.000 22.13271 63.65987 wagepriv totinc --. L1. .3747792 .1936506 .0311027 .0324018 12.05 5.98 0.000 0.000 .3138191 .1301443 .4357394 .257157 year _cons .1679262 2.624766 .0289291 1.195559 5.80 2.20 0.000 0.028 .1112263 .2815124 .2246261 4.968019 Endogenous variables: Exogenous variables: govt 1833 consump invest wagepriv wagetot profits totinc L.profits L.capital L.totinc year taxnetx wagegovt Example 4: Constraints with reg3 As a simple example of constraints, (1) above may be rewritten with both wages explicitly appearing (rather than as a variable containing the sum). Using the longer variable names, we have consump = β0 + β1 profits + β2 L.profits + β3 wagepriv + β12 wagegovt + 1 To retain the effect of the identity in (7), we need β3 = β12 as a constraint on the system. We obtain this result by defining the constraint in the usual way and then specifying its use in reg3. Because reg3 is a system estimator, we will need to use the full equation syntax of constraint. The assumption that the following commands are entered after the model above has been estimated. We are simply changing the definition of the consumption equation (consump) and adding a constraint on two of its parameters. The rest of the model definition is carried forward. . global conseqn "(consump profits L.profits wagepriv wagegovt)" . constraint define 1 [consump]wagepriv = [consump]wagegovt . reg3 $conseqn $inveqn $wageqn, endog($enlist) exog($exlist) constr(1) ireg3 note: additional endogenous variables not in the system have no effect and are ignored: wagetot Iteration 1: tolerance = .3712547 Iteration 2: tolerance = .189471 Iteration 3: tolerance = .10764 (output omitted ) Iteration 24: tolerance = 7.049e-07 1834 reg3 — Three-stage estimation for systems of simultaneous equations Three-stage least-squares regression, iterated Equation Obs Parms RMSE "R-sq" chi2 P consump invest wagepriv 21 21 21 3 3 3 .9565086 2.134326 .7782334 0.9796 0.6209 0.9840 970.31 56.78 1312.19 0.0000 0.0000 0.0000 ( 1) [consump]wagepriv - [consump]wagegovt = 0 Coef. Std. Err. z P>|z| [95% Conf. Interval] consump profits --. L1. .1645097 .1765639 .0961978 .0901001 1.71 1.96 0.087 0.050 -.0240346 -.0000291 .353054 .3531568 wagepriv wagegovt _cons .7658012 .7658012 16.55899 .0347599 .0347599 1.224401 22.03 22.03 13.52 0.000 0.000 0.000 .6976729 .6976729 14.1592 .8339294 .8339294 18.95877 invest profits --. L1. -.3565311 1.011298 .2601567 .2487744 -1.37 4.07 0.171 0.000 -.8664288 .5237096 .1533666 1.498887 capital L1. -.2601999 .0508694 -5.12 0.000 -.359902 -.1604977 _cons 42.89626 10.59386 4.05 0.000 22.13269 63.65984 wagepriv totinc --. L1. .3747792 .1936506 .0311027 .0324018 12.05 5.98 0.000 0.000 .313819 .1301443 .4357394 .257157 year _cons .1679262 2.624766 .0289291 1.195559 5.80 2.20 0.000 0.028 .1112263 .281512 .2246261 4.968019 Endogenous variables: Exogenous variables: govt consump invest wagepriv wagetot profits totinc L.profits wagegovt L.capital L.totinc year taxnetx As expected, none of the parameter or standard error estimates has changed from the previous estimates (before the seventh significant digit). We have simply decomposed the total wage variable into its two parts and constrained the coefficients on these parts. The warning about additional endogenous variables was just reg3’s way of letting us know that we had specified some information that was irrelevant to the estimation of the system. We had left the wagetot variable in our endog macro. It does not mean anything to the system to specify wagetot as endogenous because it is no longer in the system. That’s fine with reg3 and fine for our current purposes. We can also impose constraints across the equations. For example, the admittedly meaningless constraint of requiring profits to have the same effect in both the consumption and investment equations could be imposed. Retaining the constraint on the wage coefficients, we would estimate this constrained system. reg3 — Three-stage estimation for systems of simultaneous equations . constraint define 2 [consump]profits = [invest]profits . reg3 $conseqn $inveqn $wageqn, endog($enlist) exog($exlist) constr(1 2) ireg3 note: additional endogenous variables not in the system have no effect and are ignored: wagetot Iteration 1: tolerance = .1427927 Iteration 2: tolerance = .032539 Iteration 3: tolerance = .00307811 Iteration 4: tolerance = .00016903 Iteration 5: tolerance = .00003409 Iteration 6: tolerance = 7.763e-06 Iteration 7: tolerance = 9.240e-07 Three-stage least-squares regression, iterated Equation Obs Parms RMSE "R-sq" chi2 P consump invest wagepriv 21 21 21 3 3 3 .9504669 1.247066 .7225276 0.9798 0.8706 0.9862 1019.54 144.57 1537.45 0.0000 0.0000 0.0000 ( 1) ( 2) [consump]wagepriv - [consump]wagegovt = 0 [consump]profits - [invest]profits = 0 Coef. Std. Err. z P>|z| [95% Conf. Interval] consump profits --. L1. .1075413 .1712756 .0957767 .0912613 1.12 1.88 0.262 0.061 -.0801777 -.0075932 .2952602 .3501444 wagepriv wagegovt _cons .798484 .798484 16.2521 .0340876 .0340876 1.212157 23.42 23.42 13.41 0.000 0.000 0.000 .7316734 .7316734 13.87631 .8652946 .8652946 18.62788 invest profits --. L1. .1075413 .6443378 .0957767 .1058682 1.12 6.09 0.262 0.000 -.0801777 .43684 .2952602 .8518356 capital L1. -.1766669 .0261889 -6.75 0.000 -.2279962 -.1253375 _cons 24.31931 5.284325 4.60 0.000 13.96222 34.6764 wagepriv totinc --. L1. .4014106 .1775359 .0300552 .0321583 13.36 5.52 0.000 0.000 .3425035 .1145068 .4603177 .240565 year _cons .1549211 1.959788 .0282291 1.14467 5.49 1.71 0.000 0.087 .099593 -.2837242 .2102492 4.203299 Endogenous variables: Exogenous variables: govt consump invest wagepriv wagetot profits totinc L.profits wagegovt L.capital L.totinc year taxnetx 1835 1836 reg3 — Three-stage estimation for systems of simultaneous equations Technical note Identification in a system of simultaneous equations involves the notion that there is enough information to estimate the parameters of the model given the specified functional form. Underidentification usually manifests itself as one matrix in the 3SLS computations. The most commonly violated order condition for 2SLS or 3SLS involves the number of endogenous and exogenous variables. There must be at least as many noncollinear exogenous variables in the remaining system as there are endogenous right-hand-side variables in an equation. This condition must hold for each structural equation in the system. Put as a set of rules the following: 1. Count the number of right-hand-side endogenous variables in an equation and call this mi . 2. Count the number of exogenous variables in the same equation and call this ki . 3. Count the total number of exogenous variables in all the structural equations plus any additional variables specified in an exog() or inst() option and call this K . 4. If mi > (K − ki ) for any structural equation (i), then the system is underidentified and cannot be estimated by 3SLS. We are also possibly in trouble if any of the exogenous variables are linearly dependent. We must have mi linearly independent variables among the exogenous variables represented by (K − ki ). The complete conditions for identification involve rank-order conditions on several matrices. For a full treatment, see Theil (1971) or Greene (2012, 331–334). Henri Theil (1924–2000) was born in Amsterdam and awarded a PhD in 1951 by the University of Amsterdam. He researched and taught econometric theory, statistics, microeconomics, macroeconomic modeling, and economic forecasting, and policy at (now) Erasmus University Rotterdam, the University of Chicago, and the University of Florida. Theil’s many specific contributions include work on 2SLS and 3SLS, inequality and concentration, and consumer demand. Stored results reg3 stores the following in e(): Scalars e(N) e(k) e(k eq) e(mss #) e(df m#) e(rss #) e(df r) e(r2 #) e(F #) e(rmse #) e(dfk2 adj) e(ll) e(chi2 #) e(p #) e(cons #) e(rank) e(ic) number of observations number of parameters number of equations in e(b) model sum of squares for equation # model degrees of freedom for equation # residual sum of squares for equation # residual degrees of freedom (small) R-squared for equation # F statistic for equation # (small) root mean squared error for equation # divisor used with VCE when dfk2 specified log likelihood χ2 for equation # significance for equation # 1 when equation # has a constant, 0 otherwise rank of e(V) number of iterations reg3 — Three-stage estimation for systems of simultaneous equations Macros e(cmd) e(cmdline) e(depvar) e(exog) e(endog) e(eqnames) e(corr) e(wtype) e(wexp) e(method) e(small) e(dfk) e(properties) e(predict) e(marginsok) e(marginsnotok) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(Sigma) e(V) Functions e(sample) 1837 reg3 command as typed names of dependent variables names of exogenous variables names of endogenous variables names of equations correlation structure weight type weight expression 3sls, 2sls, ols, sure, or mvreg small dfk, if specified b V program used to implement predict predictions allowed by margins predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix b matrix Σ variance–covariance matrix of the estimators marks estimation sample Methods and formulas The most concise way to represent a system of equations for 3SLS requires thinking of the individual equations and their associated data as being stacked. reg3 does not expect the data in this format, but it is a convenient shorthand. The system could then be formulated as y1 Z1 y2 0 . = . . .. . 0 Z2 .. . 0 0 yM β 0 1 1 0 β2 2 .. .. + ... . . . . . ZM M βM ... ... .. . In full matrix notation, this is just y = ZB + The Z elements in these matrices represent both the endogenous and the exogenous right-hand-side variables in the equations. Also assume that there will be correlation between the disturbances of the equations so that E(0 ) = Σ where the disturbances are further assumed to have an expected value of 0; E() = 0. The first stage of 3SLS regression requires developing instrumented values for the endogenous variables in the system. These values can be derived as the predictions from a linear regression of each endogenous regressor on all exogenous variables in the system or, more succinctly, as the projection of each regressor through the projection matrix of all exogenous variables onto the regressors. Designating the set of all exogenous variables as X results in 1838 reg3 — Three-stage estimation for systems of simultaneous equations b zi = X(X0 X) −1 X 0 zi for each i b contain the instrumented values for all the regressors. They take on Taken collectively, these Z the actual values for the exogenous variables and first-stage predictions for the endogenous variables. Given these instrumented variables, a generalized least squares (GLS) or Aitken (1935) estimator can be formed for the parameters of the system o−1 n b b = Zb0 (Σ−1 ⊗ I)Z Zb0 (Σ−1 ⊗ I)y B All that remains is to obtain a consistent estimator for Σ. This estimate can be formed from the residuals of 2SLS estimates of each equation in the system. Alternately, and identically, the residuals can be computed from the estimates formed by taking Σ to be an identity matrix. This maintains the full system of coefficients and allows constraints to be applied when the residuals are computed. If we take E to be the matrix of residuals from these estimates, a consistent estimate of Σ is 0 b= EE Σ n where n is the number of observations in the sample. An alternative divisor for this estimate can be obtained with the dfk option as outlined under options. b placed into the GLS estimating equation, With the estimate of Σ o−1 n b 0 (Σ b b 0 (Σ b = Z b −1 ⊗ I)Z b −1 ⊗ I)y Z B is the 3SLS estimates of the system parameters. The asymptotic variance–covariance matrix of the estimator is just the standard formulation for a GLS estimator n o−1 b 0 (Σ b b −1 ⊗ I)Z VB = Z b Iterated 3SLS estimates can be obtained by computing the residuals from the three-stage parameter b and recomputing the parameter estimates. This process estimates, using these to formulate a new Σ, b converge—if they converge. Convergence is not guaranteed. When is repeated until the estimates B estimating a system by SURE, these iterated estimates will be the maximum likelihood estimates for the system. The iterated solution can also be used to produce estimates that are invariant to choice of system and restriction parameterization for many linear systems under full 3SLS. The exposition above follows the parallel developments in Greene (2012) and Davidson and MacKinnon (1993). References Aitken, A. C. 1935. On least squares and linear combination of observations. Proceedings of the Royal Society of Edinburgh 55: 42–48. Bewley, R. 2000. Mr. Henri Theil: An interview with the International Journal of Forecasting. International Journal of Forecasting 16: 1–16. reg3 — Three-stage estimation for systems of simultaneous equations 1839 Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Klein, L. R. 1950. Economic Fluctuations in the United States 1921–1941. New York: Wiley. Nichols, A. 2007. Causal inference with observational data. Stata Journal 7: 507–541. Poi, B. P. 2006. Jackknife instrumental variables estimation in Stata. Stata Journal 6: 364–376. Theil, H. 1971. Principles of Econometrics. New York: Wiley. Weesie, J. 1999. sg121: Seemingly unrelated estimation and the cluster-adjusted sandwich estimator. Stata Technical Bulletin 52: 34–47. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 231–248. College Station, TX: Stata Press. Zellner, A., and H. Theil. 1962. Three stage least squares: Simultaneous estimate of simultaneous equations. Econometrica 29: 54–78. Also see [R] reg3 postestimation — Postestimation tools for reg3 [R] ivregress — Single-equation instrumental-variables regression [R] nlsur — Estimation of nonlinear systems of equations [R] regress — Linear regression [R] sureg — Zellner’s seemingly unrelated regression [MV] mvreg — Multivariate regression [SEM] example 7 — Nonrecursive structural model [SEM] intro 5 — Tour of models [TS] forecast — Econometric model forecasting [U] 20 Estimation and postestimation commands Title reg3 postestimation — Postestimation tools for reg3 Description Remarks and examples Syntax for predict Methods and formulas Menu for predict Reference Options for predict Also see Description The following postestimation commands are available after reg3: ∗ Command Description contrast estat ic estat summarize estat vce estimates forecast hausman lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results dynamic forecasts and simulations Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses margins marginsplot nlcom predict predictnl pwcompare test testnl ∗ estat ic is not appropriate after reg3, 2sls. Syntax for predict predict statistic type newvar if in , equation(eqno ,eqno ) statistic Description Main xb stdp residuals difference stddp linear prediction; the default standard error of the linear prediction residuals difference between the linear predictions of two equations standard error of the difference in linear predictions These statistics are available both in and out of sample; type predict only for the estimation sample. 1840 . . . if e(sample) . . . if wanted reg3 postestimation — Postestimation tools for reg3 1841 Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main equation(eqno ,eqno ) specifies to which equation you are referring. equation() is filled in with one eqno for the xb, stdp, and residuals options. equation(#1) would mean the calculation is to be made for the first equation, equation(#2) would mean the second, and so on. You could also refer to the equations by their names. equation(income) would refer to the equation named income and equation(hours) to the equation named hours. If you do not specify equation(), results are the same as if you specified equation(#1). difference and stddp refer to between-equation concepts. To use these options, you must specify two equations, for example, equation(#1,#2) or equation(income,hours). When two equations must be specified, equation() is required. xb, the default, calculates the linear prediction (fitted values) — the prediction of xj b for the specified equation. stdp calculates the standard error of the prediction for the specified equation. It can be thought of as the standard error of the predicted expected value or mean for the observation’s covariate pattern. The standard error of the prediction is also referred to as the standard error of the fitted value. residuals calculates the residuals. difference calculates the difference between the linear predictions of two equations in the system. With equation(#1,#2), difference computes the prediction of equation(#1) minus the prediction of equation(#2). stddp is allowed only after you have previously fit a multiple-equation model. The standard error of the difference in linear predictions (x1j b − x2j b) between equations 1 and 2 is calculated. For more information on using predict after multiple-equation estimation commands, see [R] predict. Remarks and examples Example 1: Using predict In example 2 of [R] reg3, we fit a simple supply-and-demand model. Here we obtain the fitted supply and demand curves assuming that the exogenous regressors equal their sample means. We first replace each of the three exogenous regressors with their sample means, then we call predict to obtain the predictions. . . . . use http://www.stata-press.com/data/r13/supDem global demand "(qDemand: quantity price pcompete income)" global supply "(qSupply: quantity price praw)" reg3 $demand $supply, endog(price) (output omitted ) . summarize pcompete, meanonly . replace pcompete = r(mean) (49 real changes made) 1842 reg3 postestimation — Postestimation tools for reg3 5 10 15 20 . summarize income, meanonly . replace income = r(mean) (49 real changes made) . summarize praw, meanonly . replace praw = r(mean) (49 real changes made) . predict demand, equation(qDemand) (option xb assumed; fitted values) . predict supply, equation(qSupply) (option xb assumed; fitted values) . graph twoway line demand price, sort || line supply price, ytitle(" ") > legend(label(1 "Fitted values: qDemand") label(2 "Fitted values: qSupply")) 25 30 35 40 price Fitted values: qDemand Fitted values: qSupply As we would expect based on economic theory, the demand curve slopes downward while the supply curve slopes upward. With the exogenous variables at their mean levels, the equilibrium price and quantity are slightly less than 33 and 13, respectively. Example 2: Obtaining forecasts In example 3 of [R] reg3, we fit Klein’s (1950) model of the U.S. economy. That model includes three stochastic equations we fit using reg3 as well as four identities. Here we briefly illustrate how the forecast command can be used to obtain forecasts for all the endogenous variables in the model. For a more detailed discussion of how to forecast with this model, see [TS] forecast. In Stata, we type . use http://www.stata-press.com/data/r13/klein2, clear . reg3 (c p L.p w) (i p L.p L.k) (wp y L.y yr), endog(w p y) exog(t wg g) (output omitted ) . estimates store kleineqs . forecast create kleinmodel Forecast model kleinmodel started. . forecast estimates kleineqs Added estimation results from reg3. Forecast model kleinmodel now contains 3 endogenous variables. reg3 postestimation — Postestimation tools for reg3 . forecast identity y = c + i + Forecast model kleinmodel now . forecast identity p = y - t Forecast model kleinmodel now . forecast identity k = L.k + i Forecast model kleinmodel now . forecast identity w = wg + wp Forecast model kleinmodel now . forecast solve, begin(1937) Computing dynamic forecasts for 1843 g contains 4 endogenous variables. wp contains 5 endogenous variables. contains 6 endogenous variables. contains 7 endogenous variables. model kleinmodel. Starting period: 1937 Ending period: 1941 Forecast prefix: f_ 1937: ........................................... 1938: ............................................ 1939: ........................................... 1940: ......................................... 1941: ............................................. Forecast 7 variables spanning 5 periods. Here we have obtained dynamic forecasts for our 7 endogenous variables beginning in 1937. By default, the variables containing the forecasts begin with the prefix f . Next we plot the forecast and actual values of consumption: 40 50 60 70 . tsline c f_c 1920 1925 consumption 1930 year 1935 1940 consumption (kleinmodel f_) For more information about producing forecasts, see [TS] forecast. Methods and formulas The computational formulas for the statistics produced by predict can be found in [R] predict and [R] regress postestimation. 1844 reg3 postestimation — Postestimation tools for reg3 Reference Klein, L. R. 1950. Economic Fluctuations in the United States 1921–1941. New York: Wiley. Also see [R] reg3 — Three-stage estimation for systems of simultaneous equations [U] 20 Estimation and postestimation commands Title regress — Linear regression Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgments Syntax regress depvar options indepvars if in weight , options Description Model noconstant hascons tsscons suppress constant term has user-supplied constant compute total sum of squares with constant; seldom used SE/Robust vce(vcetype) vcetype may be ols, robust, cluster clustvar, bootstrap, jackknife, hc2, or hc3 Reporting level(#) beta eform(string) depname(varname) display options set confidence level; default is level(95) report standardized beta coefficients report exponentiated coefficients and label as string substitute dependent variable name; programmer’s option control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling noheader notable plus mse1 coeflegend suppress output header suppress coefficient table make table extendable force mean squared error to 1 display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mfp, mi estimate, nestreg, rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. hascons, tsscons, vce(), beta, noheader, notable, plus, depname(), mse1, and weights are not allowed with the svy prefix; see [SVY] svy. aweights, fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. noheader, notable, plus, mse1, and coeflegend do not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 1845 1846 regress — Linear regression Menu Statistics > Linear models and related > Linear regression Description regress fits a model of depvar on indepvars using linear regression. Here is a short list of other regression commands that may be of interest. See help estimation commands for a complete list. Command Entry Description areg arch arima boxcox cnsreg eivreg etregress frontier gmm heckman intreg ivregress ivtobit newey nl nlsur qreg reg3 rreg gsem sem sureg tobit truncreg xtabond xtdpd xtfrontier xtgls xthtaylor xtintreg xtivreg xtpcse xtreg xtregar xttobit [R] areg [TS] arch [TS] arima [R] boxcox [R] cnsreg [R] eivreg [TE] etregress [R] frontier [R] gmm [R] heckman [R] intreg [R] ivregress [R] ivtobit [TS] newey [R] nl [R] nlsur [R] qreg [R] reg3 [R] rreg [SEM] intro 5 [SEM] intro 5 [R] sureg [R] tobit [R] truncreg [XT] xtabond [XT] xtdpd [XT] xtfrontier [XT] xtgls [XT] xthtaylor [XT] xtintreg [XT] xtivreg [XT] xtpcse [XT] xtreg [XT] xtregar [XT] xttobit an easier way to fit regressions with many dummy variables regression models with ARCH errors ARIMA models Box–Cox regression models constrained linear regression errors-in-variables regression Linear regression with endogenous treatment effects stochastic frontier models generalized method of moments estimation Heckman selection model interval regression single-equation instrumental-variables regression tobit regression with endogenous variables regression with Newey – West standard errors nonlinear least-squares estimation estimation of nonlinear systems of equations quantile (including median) regression three-stage least-squares (3SLS) regression a type of robust regression generalized structural equation models linear structural equation models seemingly unrelated regression tobit regression truncated regression Arellano–Bond linear dynamic panel-data estimation linear dynamic panel-data estimation panel-data stochastic frontier models panel-data GLS models Hausman–Taylor estimator for error-components models panel-data interval regression models panel-data instrumental-variables (2SLS) regression linear regression with panel-corrected standard errors fixed- and random-effects linear models fixed- and random-effects linear models with an AR(1) disturbance panel-data tobit models regress — Linear regression 1847 Options Model noconstant; see [R] estimation options. hascons indicates that a user-defined constant or its equivalent is specified among the independent variables in indepvars. Some caution is recommended when specifying this option, as resulting estimates may not be as accurate as they otherwise would be. Use of this option requires “sweeping” the constant last, so the moment matrix must be accumulated in absolute rather than deviation form. This option may be safely specified when the means of the dependent and independent variables are all reasonable and there is not much collinearity between the independent variables. The best procedure is to view hascons as a reporting option — estimate with and without hascons and verify that the coefficients and standard errors of the variables not affected by the identity of the constant are unchanged. tsscons forces the total sum of squares to be computed as though the model has a constant, that is, as deviations from the mean of the dependent variable. This is a rarely used option that has an effect only when specified with noconstant. It affects the total sum of squares and all results derived from the total sum of squares. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (ols), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(ols), the default, uses the standard variance estimator for ordinary least-squares regression. regress also allows the following: vce(hc2) and vce(hc3) specify an alternative bias correction for the robust variance calculation. vce(hc2) and vce(hc3) may not be specified with svy prefix. In the unclustered case, vce(robust) uses σ bj2 = {n/(n − k)}u2j as an estimate of the variance of the j th observation, where uj is the calculated residual and n/(n − k) is included to improve the overall estimate’s small-sample properties. vce(hc2) instead uses u2j /(1 − hjj ) as the observation’s variance estimate, where hjj is the diagonal element of the hat (projection) matrix. This estimate is unbiased if the model really is homoskedastic. vce(hc2) tends to produce slightly more conservative confidence intervals. vce(hc3) uses u2j /(1 − hjj )2 as suggested by Davidson and MacKinnon (1993), who report that this method tends to produce better results when the model really is heteroskedastic. vce(hc3) produces confidence intervals that tend to be even more conservative. See Davidson and MacKinnon (1993, 554–556) and Angrist and Pischke (2009, 294–308) for more discussion on these two bias corrections. Reporting level(#); see [R] estimation options. beta asks that standardized beta coefficients be reported instead of confidence intervals. The beta coefficients are the regression coefficients obtained by first standardizing all variables to have a mean of 0 and a standard deviation of 1. beta may not be specified with vce(cluster clustvar) or the svy prefix. 1848 regress — Linear regression eform(string) is used only in programs and ado-files that use regress to fit models other than linear regression. eform() specifies that the coefficient table be displayed in exponentiated form as defined in [R] maximize and that string be used to label the exponentiated coefficients in the table. depname(varname) is used only in programs and ado-files that use regress to fit models other than linear regression. depname() may be specified only at estimation time. varname is recorded as the identity of the dependent variable, even though the estimates are calculated using depvar. This method affects the labeling of the output — not the results calculated — but could affect subsequent calculations made by predict, where the residual would be calculated as deviations from varname rather than depvar. depname() is most typically used when depvar is a temporary variable (see [P] macro) used as a proxy for varname. depname() is not allowed with the svy prefix. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. The following options are available with regress but are not shown in the dialog box: noheader suppresses the display of the ANOVA table and summary statistics at the top of the output; only the coefficient table is displayed. This option is often used in programs and ado-files. notable suppresses display of the coefficient table. plus specifies that the output table be made extendable. This option is often used in programs and ado-files. mse1 is used only in programs and ado-files that use regress to fit models other than linear regression and is not allowed with the svy prefix. mse1 sets the mean squared error to 1, forcing the variance–covariance matrix of the estimators to be (X0 DX)−1 (see Methods and formulas below) and affecting calculated standard errors. Degrees of freedom for t statistics is calculated as n rather than n − k . coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Ordinary least squares Treatment of the constant Robust standard errors Weighted regression Instrumental variables and two-stage least-squares regression Video example regress performs linear regression, including ordinary least squares and weighted least squares. For a general discussion of linear regression, see Draper and Smith (1998), Greene (2012), or Kmenta (1997). See Wooldridge (2013) for an excellent treatment of estimation, inference, interpretation, and specification testing in linear regression models. This presentation stands out for its clarification of the statistical issues, as opposed to the algebraic issues. See Wooldridge (2010, chap. 4) for a more advanced discussion along the same lines. regress — Linear regression 1849 See Hamilton (2013, chap. 7) and Cameron and Trivedi (2010, chap. 3) for an introduction to linear regression using Stata. Dohoo, Martin, and Stryhn (2012, 2010) discuss linear regression using examples from epidemiology, and Stata datasets and do-files used in the text are available. Cameron and Trivedi (2010) discuss linear regression using econometric examples with Stata. Mitchell (2012) shows how to use graphics and postestimation commands to understand a fitted regression model. Chatterjee and Hadi (2012) explain regression analysis by using examples containing typical problems that you might encounter when performing exploratory data analysis. We also recommend Weisberg (2005), who emphasizes the importance of the assumptions of linear regression and problems resulting from these assumptions. Becketti (2013) discusses regression analysis with an emphasis on time-series data. Angrist and Pischke (2009) approach regression as a tool for exploring relationships, estimating treatment effects, and providing answers to public policy questions. For a discussion of model-selection techniques and exploratory data analysis, see Mosteller and Tukey (1977). For a mathematically rigorous treatment, see Peracchi (2001, chap. 6). Finally, see Plackett (1972) if you are interested in the history of regression. Least squares, which dates back to the 1790s, was discovered independently by Legendre and Gauss. Ordinary least squares Example 1: Basic linear regression Suppose that we have data on the mileage rating and weight of 74 automobiles. The variables in our data are mpg, weight, and foreign. The last variable assumes the value 1 for foreign and 0 for domestic automobiles. We wish to fit the model mpg = β0 + β1 weight + β2 foreign + This model can be fit with regress by typing . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight foreign SS df MS Source Model Residual 1619.2877 824.171761 2 71 809.643849 11.608053 Total 2443.45946 73 33.4720474 mpg Coef. weight foreign _cons -.0065879 -1.650029 41.6797 Std. Err. .0006371 1.075994 2.165547 t -10.34 -1.53 19.25 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.130 0.000 = = = = = = 74 69.75 0.0000 0.6627 0.6532 3.4071 [95% Conf. Interval] -.0078583 -3.7955 37.36172 -.0053175 .4954422 45.99768 regress produces a variety of summary statistics along with the table of regression coefficients. At the upper left, regress reports an analysis-of-variance (ANOVA) table. The column headings SS, df, and MS stand for “sum of squares”, “degrees of freedom”, and “mean square”, respectively. In this example, the total sum of squares is 2,443.5: 1,619.3 accounted for by the model and 824.2 left unexplained. Because the regression included a constant, the total sum reflects the sum after removal of means, as does the sum of squares due to the model. The table also reveals that there are 73 total degrees of freedom (counted as 74 observations less 1 for the mean removal), of which 2 are consumed by the model, leaving 71 for the residual. 1850 regress — Linear regression To the right of the ANOVA table are presented other summary statistics. The F statistic associated with the ANOVA table is 69.75. The statistic has 2 numerator and 71 denominator degrees of freedom. The F statistic tests the hypothesis that all coefficients excluding the constant are zero. The chance of observing an F statistic that large or larger is reported as 0.0000, which is Stata’s way of indicating a number smaller than 0.00005. The R-squared (R2 ) for the regression is 0.6627, and the R-squared adjusted for degrees of freedom (Ra2 ) is 0.6532. The root mean squared error, labeled Root MSE, is 3.4071. It is the square root of the mean squared error reported for the residual in the ANOVA table. Finally, Stata produces a table of the estimated coefficients. The first line of the table indicates that the left-hand-side variable is mpg. Thereafter follow the estimated coefficients. Our fitted model is mpg hat = 41.68 − 0.0066 weight − 1.65 foreign Reported to the right of the coefficients in the output are the standard errors. For instance, the standard error for the coefficient on weight is 0.0006371. The corresponding t statistic is −10.34, which has a two-sided significance level of 0.000. This number indicates that the significance is less than 0.0005. The 95% confidence interval for the coefficient is [ −0.0079, −0.0053 ]. Example 2: Transforming the dependent variable If we had a graph comparing mpg with weight, we would notice that the relationship is distinctly nonlinear. This is to be expected because energy usage per distance should increase linearly with weight, but mpg is measuring distance per energy used. We could obtain a better model by generating a new variable measuring the number of gallons used per 100 miles (gp100m) and then using this new variable in our model: gp100m = β0 + β1 weight + β2 foreign + We can now fit this model: . generate gp100m = 100/mpg . regress gp100m weight foreign Source SS df MS Model Residual 91.1761694 28.4000913 2 71 45.5880847 .400001287 Total 119.576261 73 1.63803097 gp100m Coef. weight foreign _cons .0016254 .6220535 -.0734839 Std. Err. .0001183 .1997381 .4019932 t 13.74 3.11 -0.18 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.003 0.855 = = = = = = 74 113.97 0.0000 0.7625 0.7558 .63246 [95% Conf. Interval] .0013896 .2237871 -.8750354 Fitting the physically reasonable model increases our R-squared to 0.7625. .0018612 1.02032 .7280677 regress — Linear regression 1851 Example 3: Obtaining beta coefficients regress shares the features of all estimation commands. Among other things, this means that after running a regression, we can use test to test hypotheses about the coefficients, estat vce to examine the covariance matrix of the estimators, and predict to obtain predicted values, residuals, and influence statistics. See [U] 20 Estimation and postestimation commands. Options that affect how estimates are displayed, such as beta or level(), can be used when replaying results. Suppose that we meant to specify the beta option to obtain beta coefficients (regression coefficients normalized by the ratio of the standard deviation of the regressor to the standard deviation of the dependent variable). Even though we forgot, we can specify the option now: . regress, beta Source SS df MS Model Residual 91.1761694 28.4000913 2 71 45.5880847 .400001287 Total 119.576261 73 1.63803097 gp100m Coef. weight foreign _cons .0016254 .6220535 -.0734839 Std. Err. .0001183 .1997381 .4019932 t 13.74 3.11 -0.18 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE = = = = = = 74 113.97 0.0000 0.7625 0.7558 .63246 P>|t| Beta 0.000 0.003 0.855 .9870255 .2236673 . Treatment of the constant By default, regress includes an intercept (constant) term in the model. The noconstant option suppresses it, and the hascons option tells regress that the model already has one. Example 4: Suppressing the constant term We wish to fit a regression of the weight of an automobile against its length, and we wish to impose the constraint that the weight is zero when the length is zero. If we simply type regress weight length, we are fitting the model weight = β0 + β1 length + Here a length of zero corresponds to a weight of β0 . We want to force β0 to be zero or, equivalently, estimate an equation that does not include an intercept: weight = β1 length + 1852 regress — Linear regression We do this by specifying the noconstant option: . regress weight length, noconstant SS df Source MS Model Residual 703869302 14892897.8 1 73 703869302 204012.299 Total 718762200 74 9713002.7 weight Coef. length 16.29829 Std. Err. .2774752 t 58.74 Number of obs F( 1, 73) Prob > F R-squared Adj R-squared Root MSE = 74 = 3450.13 = 0.0000 = 0.9793 = 0.9790 = 451.68 P>|t| [95% Conf. Interval] 0.000 15.74528 16.8513 In our data, length is measured in inches and weight in pounds. We discover that each inch of length adds 16 pounds to the weight. Sometimes there is no need for Stata to include a constant term in the model. Most commonly, this occurs when the model contains a set of mutually exclusive indicator variables. hascons is a variation of the noconstant option — it tells Stata not to add a constant to the regression because the regression specification already has one, either directly or indirectly. For instance, we now refit our model of weight as a function of length and include separate constants for foreign and domestic cars by specifying bn.foreign. bn.foreign is factor-variable notation for “no base for foreign” or “include all levels of variable foreign in the model”; see [U] 11.4.3 Factor variables. . regress weight length bn.foreign, hascons Source SS df MS Model Residual 39647744.7 4446433.7 2 71 19823872.3 62625.8268 Total 44094178.4 73 604029.841 weight Coef. Std. Err. length 31.44455 1.601234 foreign Domestic Foreign -2850.25 -2983.927 315.9691 275.1041 t Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE = = = = = = 74 316.54 0.0000 0.8992 0.8963 250.25 P>|t| [95% Conf. Interval] 19.64 0.000 28.25178 34.63732 -9.02 -10.85 0.000 0.000 -3480.274 -3532.469 -2220.225 -2435.385 regress — Linear regression 1853 Technical note There is a subtle distinction between the hascons and noconstant options. We can most easily reveal it by refitting the last regression, specifying noconstant rather than hascons: . regress weight length bn.foreign, noconstant Source SS df MS Model Residual 714315766 4446433.7 3 71 238105255 62625.8268 Total 718762200 74 9713002.7 weight Coef. Std. Err. length 31.44455 1.601234 foreign Domestic Foreign -2850.25 -2983.927 315.9691 275.1041 t Number of obs F( 3, 71) Prob > F R-squared Adj R-squared Root MSE = 74 = 3802.03 = 0.0000 = 0.9938 = 0.9936 = 250.25 P>|t| [95% Conf. Interval] 19.64 0.000 28.25178 34.63732 -9.02 -10.85 0.000 0.000 -3480.274 -3532.469 -2220.225 -2435.385 Comparing this output with that produced by the previous regress command, we see that they are almost, but not quite, identical. The parameter estimates and their associated statistics — the second half of the output — are identical. The overall summary statistics and the ANOVA table — the first half of the output — are different, however. In the first case, the R2 is shown as 0.8992; here it is shown as 0.9938. In the first case, the F statistic is 316.54; now it is 3,802.03. The numerator degrees of freedom is different as well. In the first case, the numerator degrees of freedom is 2; now the degrees of freedom is 3. Which is correct? Both are. Specifying the hascons option causes regress to adjust the ANOVA table and its associated statistics for the explanatory power of the constant. The regression in effect has a constant; it is just written in such a way that a separate constant is unnecessary. No such adjustment is made with the noconstant option. Technical note When the hascons option is specified, regress checks to make sure that the model does in fact have a constant term. If regress cannot find a constant term, it automatically adds one. Fitting a model of weight on length and specifying the hascons option, we obtain . regress weight length, hascons (note: hascons false) SS df Source MS Model Residual 39461306.8 4632871.55 1 72 39461306.8 64345.4382 Total 44094178.4 73 604029.841 weight Coef. length _cons 33.01988 -3186.047 Std. Err. 1.333364 252.3113 t 24.76 -12.63 Number of obs F( 1, 72) Prob > F R-squared Adj R-squared Root MSE = = = = = = 74 613.27 0.0000 0.8949 0.8935 253.66 P>|t| [95% Conf. Interval] 0.000 0.000 30.36187 -3689.02 35.67789 -2683.073 1854 regress — Linear regression Even though we specified hascons, regress included a constant, anyway. It also added a note to our output: “note: hascons false”. Technical note Even if the model specification effectively includes a constant term, we need not specify the hascons option. regress is always on the lookout for collinear variables and omits them from the model. For instance, . regress weight length bn.foreign note: 1.foreign omitted because of collinearity Source SS df MS Model Residual 39647744.7 4446433.7 2 71 19823872.3 62625.8268 Total 44094178.4 73 604029.841 weight Coef. Std. Err. length 31.44455 1.601234 foreign Domestic Foreign 133.6775 0 _cons -2983.927 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE t = = = = = = 74 316.54 0.0000 0.8992 0.8963 250.25 P>|t| [95% Conf. Interval] 19.64 0.000 28.25178 34.63732 77.47615 (omitted) 1.73 0.089 -20.80555 288.1605 275.1041 -10.85 0.000 -3532.469 -2435.385 Robust standard errors regress with the vce(robust) option substitutes a robust variance matrix calculation for the conventional calculation, or if vce(cluster clustvar) is specified, allows relaxing the assumption of independence within groups. How this method works is explained in [U] 20.21 Obtaining robust variance estimates. Below we show how well this approach works. Example 5: Heteroskedasticity and robust standard errors Specifying the vce(robust) option is equivalent to requesting White-corrected standard errors in the presence of heteroskedasticity. We use the automobile data and, in the process of looking at the energy efficiency of cars, analyze a variable with considerable heteroskedasticity. We will examine the amount of energy — measured in gallons of gasoline — that the cars in the data need to move 1,000 pounds of their weight 100 miles. We are going to examine the relative efficiency of foreign and domestic cars. . gen gpmw = ((1/mpg)/weight)*100*1000 . summarize gpmw Variable Obs Mean gpmw 74 1.682184 Std. Dev. .2426311 Min Max 1.09553 2.30521 In these data, the engines consume between 1.10 and 2.31 gallons of gas to move 1,000 pounds of the car’s weight 100 miles. If we ran a regression with conventional standard errors of gpmw on foreign, we would obtain regress — Linear regression . regress gpmw foreign Source SS df MS Model Residual .936705572 3.36079459 1 72 .936705572 .046677703 Total 4.29750017 73 .058869865 gpmw Coef. foreign _cons .2461526 1.609004 Std. Err. .0549487 .0299608 t 4.48 53.70 Number of obs F( 1, 72) Prob > F R-squared Adj R-squared Root MSE = = = = = = 1855 74 20.07 0.0000 0.2180 0.2071 .21605 P>|t| [95% Conf. Interval] 0.000 0.000 .1366143 1.549278 .3556909 1.66873 regress with the vce(robust) option, on the other hand, reports . regress gpmw foreign, vce(robust) Linear regression gpmw Coef. foreign _cons .2461526 1.609004 Robust Std. Err. .0679238 .0234535 Number of obs = F( 1, 72) = Prob > F = R-squared = Root MSE = t 3.62 68.60 74 13.13 0.0005 0.2180 .21605 P>|t| [95% Conf. Interval] 0.001 0.000 .1107489 1.56225 .3815563 1.655758 The point estimates are the same (foreign cars need one-quarter gallon more gas), but the standard errors differ by roughly 20%. Conventional regression reports the 95% confidence interval as [ 0.14, 0.36 ], whereas the robust standard errors make the interval [ 0.11, 0.38 ]. Which is right? Notice that gpmw is a variable with considerable heteroskedasticity: . tabulate foreign, summarize(gpmw) Summary of gpmw Mean Std. Dev. Car type Freq. Domestic Foreign 1.6090039 1.8551565 .16845182 .30186861 52 22 Total 1.6821844 .24263113 74 Thus here we favor the robust standard errors. In [U] 20.21 Obtaining robust variance estimates, we show another example using linear regression where it makes little difference whether we specify vce(robust). The linear-regression assumptions were true, and we obtained nearly linear-regression results. The advantage of the robust estimate is that in neither case did we have to check assumptions. Technical note regress purposefully suppresses displaying the ANOVA table when vce(robust) is specified, as it is no longer appropriate in a statistical sense, even though, mechanically, the numbers would be unchanged. That is, sums of squares remain unchanged, but the meaning of those sums is no longer relevant. The F statistic, for instance, is no longer based on sums of squares; it becomes a Wald test based on the robustly estimated variance matrix. Nevertheless, regress continues to report the R2 1856 regress — Linear regression and the root MSE even though both numbers are based on sums of squares and are, strictly speaking, irrelevant. In this, the root MSE is more in violation of the spirit of the robust estimator than is R2 . As a goodness-of-fit statistic, R2 is still fine; just do not use it in formulas to obtain F statistics because those formulas no longer apply. The root MSE is valid in a literal sense — it is the square root of the mean squared error, but it is no longer an estimate of σ because there is no single σ ; the variance of the residual varies observation by observation. Example 6: Alternative robust standard errors The vce(hc2) and vce(hc3) options modify the robust variance calculation. In the context of linear regression without clustering, the idea behind the robust calculation is somehow to measure σj2 , the variance of the residual associated with the j th observation, and then to use that estimate b Because residuals have (theoretically and practically) mean to improve the estimated variance of β. 0, one estimate of σj2 is the observation’s squared residual itself — u2j . A finite-sample correction could improve that by multiplying u2j by n/(n − k), and, as a matter of fact, vce(robust) uses {n/(n − k)}u2j as its estimate of the residual’s variance. vce(hc2) and vce(hc3) use alternative estimators of the observation-specific variances. For instance, if the residuals are homoskedastic, we can show that the expected value of u2j is σ 2 (1 − hjj ), where hjj is the j th diagonal element of the projection (hat) matrix. hjj has average value k/n, so 1 −hjj has average value 1 −k/n = (n−k)/n. Thus the default robust estimator σ bj = {n/(n−k)}u2j 2 amounts to dividing uj by the average of the expectation. vce(hc2) divides u2j by 1 − hjj itself, so it should yield better estimates if the residuals really are homoskedastic. vce(hc3) divides u2j by (1 − hjj )2 and has no such clean interpretation. Davidson and MacKinnon (1993) show that u2j /(1 − hjj )2 approximates a more complicated estimator that they obtain by jackknifing (MacKinnon and White 1985). Angrist and Pischke (2009) also illustrate the relative merits of these adjustments. Here are the results of refitting our efficiency model using vce(hc2) and vce(hc3): . regress gpmw foreign, vce(hc2) Linear regression gpmw Coef. foreign _cons .2461526 1.609004 Robust HC2 Std. Err. .0684669 .0233601 Number of obs = F( 1, 72) = Prob > F = R-squared = Root MSE = t 3.60 68.88 74 12.93 0.0006 0.2180 .21605 P>|t| [95% Conf. Interval] 0.001 0.000 .1096662 1.562437 .3826389 1.655571 regress — Linear regression . regress gpmw foreign, vce(hc3) Linear regression gpmw Coef. foreign _cons .2461526 1.609004 Robust HC3 Std. Err. .069969 .023588 Number of obs = F( 1, 72) = Prob > F = R-squared = Root MSE = t 3.52 68.21 1857 74 12.38 0.0008 0.2180 .21605 P>|t| [95% Conf. Interval] 0.001 0.000 .1066719 1.561982 .3856332 1.656026 Example 7: Standard errors for clustered data The vce(cluster clustvar) option relaxes the assumption of independence. Below we have 28,534 observations on 4,711 women aged 14–46 years. Data were collected on these women between 1968 and 1988. We are going to fit a classic earnings model, and we begin by ignoring that each woman appears an average of 6.057 times in the data. . use http://www.stata-press.com/data/r13/regsmpl, clear (NLS Women 14-26 in 1968) . regress ln_wage age c.age#c.age tenure SS df MS Number of obs Source F( 3, 28097) Model 1054.52501 3 351.508335 Prob > F Residual 5360.43962 28097 .190783344 R-squared Adj R-squared 6414.96462 28100 .228290556 Root MSE Total ln_wage Coef. Std. Err. age .0752172 .0034736 c.age#c.age -.0010851 tenure _cons .0390877 .3339821 t = 28101 = 1842.45 = 0.0000 = 0.1644 = 0.1643 = .43679 P>|t| [95% Conf. Interval] 21.65 0.000 .0684088 .0820257 .0000575 -18.86 0.000 -.0011979 -.0009724 .0007743 .0504413 50.48 6.62 0.000 0.000 .0375699 .2351148 .0406054 .4328495 The number of observations in our model is 28,101 because Stata drops observations that have a missing value for one or more of the variables in the model. We can be reasonably certain that the standard errors reported above are meaningless. Without a doubt, a woman with higher-than-average wages in one year typically has higher-than-average wages in other years, and so the residuals are not independent. One way to deal with this would be to fit a random-effects model — and we are going to do that — but first we fit the model using regress specifying vce(cluster id), which treats only observations with different person ids as truly independent: 1858 regress — Linear regression . regress ln_wage age c.age#c.age tenure, vce(cluster id) Linear regression Number of obs F( 3, 4698) Prob > F R-squared Root MSE = = = = = 28101 748.82 0.0000 0.1644 .43679 (Std. Err. adjusted for 4699 clusters in idcode) Robust Std. Err. ln_wage Coef. age .0752172 .0045711 c.age#c.age -.0010851 tenure _cons .0390877 .3339821 t P>|t| [95% Conf. Interval] 16.45 0.000 .0662557 .0841788 .0000778 -13.94 0.000 -.0012377 -.0009325 .0014425 .0641918 27.10 5.20 0.000 0.000 .0362596 .208136 .0419157 .4598282 For comparison, we focus on the tenure coefficient, which in economics jargon can be interpreted as the rate of return for keeping your job. The 95% confidence interval we previously estimated — an interval we do not believe — is [ 0.038, 0.041 ]. The robust interval is twice as wide, being [ 0.036, 0.042 ]. As we said, one correct way to fit this model is by random-effects regression. Here is the random-effects result: . xtreg ln_wage age c.age#c.age tenure, re Random-effects GLS regression Group variable: idcode R-sq: Number of obs Number of groups within = 0.1370 between = 0.2154 overall = 0.1608 corr(u_i, X) = = 28101 4699 Obs per group: min = avg = max = 1 6.0 15 Wald chi2(3) Prob > chi2 = 0 (assumed) ln_wage Coef. Std. Err. age .0568296 .0026958 c.age#c.age -.0007566 tenure _cons .0260135 .6136792 sigma_u sigma_e rho .33542449 .29674679 .56095413 z = = 4717.05 0.0000 P>|z| [95% Conf. Interval] 21.08 0.000 .0515459 .0621132 .0000447 -16.93 0.000 -.0008441 -.000669 .0007477 .0394611 34.79 15.55 0.000 0.000 .0245481 .5363368 .0274789 .6910216 (fraction of variance due to u_i) Robust regression estimated the 95% interval [ 0.036, 0.042 ], and xtreg (see [XT] xtreg) estimates [ 0.025, 0.027 ]. Which is better? The random-effects regression estimator assumes a lot. We can check some of these assumptions by performing a Hausman test. Using estimates (see [R] estimates store), we store the random-effects estimation results, and then we run the required fixed-effects regression to perform the test. regress — Linear regression . estimates store random . xtreg ln_wage age c.age#c.age tenure, fe Fixed-effects (within) regression Group variable: idcode R-sq: Number of obs Number of groups within = 0.1375 between = 0.2066 overall = 0.1568 corr(u_i, Xb) ln_wage Coef. age .0522751 .002783 c.age#c.age -.0006717 tenure _cons .021738 .687178 sigma_u sigma_e rho .38743138 .29674679 .6302569 Std. Err. t Test: .0522751 -.0006717 .021738 Obs per group: min = avg = max = 1 6.0 15 = = 1243.00 0.0000 P>|t| [95% Conf. Interval] 18.78 0.000 .0468202 .05773 .0000461 -14.56 0.000 -.0007621 -.0005813 .000799 .0405944 27.21 16.93 0.000 0.000 .020172 .6076103 .023304 .7667456 (fraction of variance due to u_i) F(4698, 23399) = Coefficients (b) (B) . random age c.age#c.age tenure 28101 4699 F(3,23399) Prob > F = 0.1380 F test that all u_i=0: . hausman . random = = 1859 .0568296 -.0007566 .0260135 7.98 Prob > F = 0.0000 (b-B) Difference sqrt(diag(V_b-V_B)) S.E. -.0045545 .0000849 -.0042756 .0006913 .0000115 .0002816 b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Ho: difference in coefficients not systematic chi2(3) = (b-B)’[(V_b-V_B)^(-1)](b-B) = 336.62 Prob>chi2 = 0.0000 The Hausman test casts grave suspicions on the random-effects model we just fit, so we should be careful in interpreting those results. Meanwhile, our robust regression results still stand, as long as we are careful about the interpretation. The correct interpretation is that, if the data collection were repeated (on women sampled the same way as in the original sample), and if we were to refit the model, 95% of the time we would expect the estimated coefficient on tenure to be in the range [ 0.036, 0.042 ]. Even with robust regression, we must be careful about going beyond that statement. Here the Hausman test is probably picking up something that differs within and between person, which would cast doubt on our robust regression model in terms of interpreting [ 0.036, 0.042 ] to contain the rate of return for keeping a job, economywide, for all women, without exception. 1860 regress — Linear regression Weighted regression regress can perform weighted and unweighted regression. We indicate the weight by specifying the [weight] qualifier. By default, regress assumes analytic weights; see the technical note below. Example 8: Using means as regression variables We have census data recording the death rate (drate) and median age (medage) for each state. The data also record the region of the country in which each state is located and the overall population of the state: . use http://www.stata-press.com/data/r13/census9 (1980 Census data by state) . describe Contains data from http://www.stata-press.com/data/r13/census9.dta obs: 50 1980 Census data by state vars: 6 6 Apr 2013 15:43 size: 1,450 variable name state state2 drate pop medage region storage type display format str14 str2 float long float byte %-14s %-2s %9.0g %12.0gc %9.2f %-8.0g value label variable label cenreg State Two-letter state abbreviation Death Rate Population Median age Census region Sorted by: We can use factor variables to include dummy variables for region. Because the variables in the regression reflect means rather than individual observations, the appropriate method of estimation is analytically weighted least squares (Davidson and MacKinnon 2004, 261–262), where the weight is total population: . regress drate medage i.region [w=pop] (analytic weights assumed) (sum of wgt is 2.2591e+08) Source SS df MS Model Residual 4096.6093 1238.40987 4 45 1024.15232 27.5202192 Total 5335.01916 49 108.877942 drate Coef. medage Number of obs F( 4, 45) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 37.21 0.0000 0.7679 0.7472 5.246 Std. Err. t P>|t| [95% Conf. Interval] 4.283183 .5393329 7.94 0.000 3.196911 5.369455 region N Cntrl South West .3138738 -1.438452 -10.90629 2.456431 2.320244 2.681349 0.13 -0.62 -4.07 0.899 0.538 0.000 -4.633632 -6.111663 -16.30681 5.26138 3.234758 -5.505777 _cons -39.14727 17.23613 -2.27 0.028 -73.86262 -4.431915 To weight the regression by population, we added the qualifier [w=pop] to the end of the regress command. Our qualifier was vague (we did not say [aweight=pop]), but unless told otherwise, Stata regress — Linear regression 1861 assumes analytic weights for regress. Stata informed us that the sum of the weight is 2.2591 × 108 ; there were approximately 226 million people residing in the United States according to our 1980 data. Technical note Once we fit a weighted regression, we can obtain the appropriately weighted variance–covariance matrix of the estimators using estat vce and perform appropriately weighted hypothesis tests using test. In the weighted regression in example 8, we see that 4.region is statistically significant but that 2.region and 3.region are not. We use test to test the joint significance of the region variables: . test ( 1) ( 2) ( 3) 2.region 3.region 4.region 2.region = 0 3.region = 0 4.region = 0 F( 3, 45) = 9.84 Prob > F = 0.0000 The results indicate that the region variables are jointly significant. regress also accepts frequency weights (fweights). Frequency weights are appropriate when the data do not reflect cell means, but instead represent replicated observations. Specifying aweights or fweights will not change the parameter estimates, but it will change the corresponding significance levels. For instance, if we specified [fweight=pop] in the weighted regression example above — which would be statistically incorrect — Stata would treat the data as if the data represented 226 million independent observations on death rates and median age. The data most certainly do not represent that — they represent 50 observations on state averages. With aweights, Stata treats the number of observations on the process as the number of observations in the data. When we specify fweights, Stata treats the number of observations as if it were equal to the sum of the weights; see Methods and formulas below. Technical note A popular request on the help line is to describe the effect of specifying [aweight=exp] with regress in terms of transformation of the dependent and independent variables. The mechanical answer is that typing . regress y x1 x2 [aweight=n] is equivalent to fitting the model √ √ √ √ √ yj nj = β0 nj + β1 x1j nj + β2 x2j nj + uj nj This regression will reproduce the coefficients and covariance matrix produced by the aweighted regression. The mean squared errors (estimates of the variance of the residuals) will, however, √ be different. The transformed regression reports s2t , an estimate of Var(uj nj ). The aweighted p P √ regression reports s2a , an estimate of Var(uj nj N/ k nk ), where N is the number of observations. Thus N s2 s2a = P s2t = t (1) n k nk 1862 regress — Linear regression The logic for this adjustment is as follows: Consider the model y = β0 + β1 x1 + β2 x2 + u Assume that, were this model fit on individuals, Var(u) = σu2 , a constant. Assume that individual data are not available; what is available are averages (y j , x1j , x2j ) for j = 1, . . . , N , and each average is calculated over nj observations. Then it is still true that y j = β0 + β1 x1j + β2 x2j + uj where uj is the average of nj mean 0, variance σu2 deviates and has variance σu2 = σu2 /nj . Thus √ multiplying through by nj produces √ √ √ √ √ y j nj = β0 nj + β1 x1j nj + β2 x2j nj + uj nj √ and Var(uj nj ) = σu2 . The mean squared error, s2t , reported by fitting this transformed regression is an estimate of σu2 . The coefficients and covariance matrix could also be obtained by aweighted regress. The only difference would be in the reported mean squared error, which from P (1) is σu2 /n. On average, each observation in the data reflects the averages calculated over n = k nk /N individuals, and thus this reported mean squared error is the average variance of an observation in the dataset. We can retrieve the estimate of σu2 by multiplying the reported mean squared error by n. More generally, aweights are used to solve general heteroskedasticity problems. In these cases, we have the model yj = β0 + β1 x1j + β2 x2j + uj and the variance of uj is thought to be proportional to aj . If the variance is proportional to aj , it is also proportional to αaj , where αPis any positive constant. Not quite arbitrarily, but with no loss of generality, we could choose α = k (1/ak )/N , the average value of the inverse of aj . We can then write Var(uj ) = kαaj σ 2 , where k is the constant of proportionality that is no longer a function of the scale of the weights. √ Dividing this regression through by the aj , √ √ √ √ √ yj / aj = β0 / aj + β1 x1j / aj + β2 x2j / aj + uj / aj √ produces a model with Var(uj / aj ) = kασ 2 , which is the constant part of Var(uj ). This variance is a function of α, the average of the reciprocal weights; if the weights are scaled arbitrarily, then so is this variance. We can also fit this model by typing . regress y x1 x2 [aweight=1/a] This input will produce the same estimates of the coefficients and covariance matrix; the reported P mean squared error is, from (1), N/ k (1/ak ) kασ 2 = kσ 2 . This variance is independent of the scale of aj . regress — Linear regression 1863 Instrumental variables and two-stage least-squares regression An alternate syntax for regress can be used to produce instrumental-variables (two-stage least squares) estimates. regress depvar varlist1 (varlist2 ) if in weight , regress options ] This syntax is used mainly by programmers developing estimators using the instrumental-variables estimates as intermediate results. ivregress is normally used to directly fit these models; see [R] ivregress. With this syntax, regress fits a structural equation of depvar on varlist1 using instrumental variables regression; (varlist2 ) indicates the list of instrumental variables. With the exception of vce(hc2) and vce(hc3), all standard regress options are allowed. Video example Simple linear regression in Stata Stored results regress stores the following in e(): Scalars e(N) e(mss) e(df m) e(rss) e(df r) e(r2) e(r2 a) e(F) e(rmse) e(ll) e(ll 0) e(N clust) e(rank) Macros e(cmd) e(cmdline) e(depvar) e(model) e(wtype) e(wexp) e(title) e(clustvar) e(vce) e(vcetype) e(properties) e(estat cmd) e(predict) e(marginsok) e(asbalanced) e(asobserved) Matrices e(b) e(V) e(V modelbased) Functions e(sample) number of observations model sum of squares model degrees of freedom residual sum of squares residual degrees of freedom R-squared adjusted R-squared F statistic root mean squared error log likelihood under additional assumption of i.i.d. normal errors log likelihood, constant-only model number of clusters rank of e(V) regress command as typed name of dependent variable ols or iv weight type weight expression title in estimation output when vce() is not ols name of cluster variable vcetype specified in vce() title used to label Std. Err. b V program used to implement estat program used to implement predict predictions allowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector variance–covariance matrix of the estimators model-based variance marks estimation sample 1864 regress — Linear regression Methods and formulas Methods and formulas are presented under the following headings: Coefficient estimation and ANOVA table A general notation for the robust variance calculation Robust calculation for regress Coefficient estimation and ANOVA table Variables printed in lowercase and not boldfaced (for example, x) are scalars. Variables printed in lowercase and boldfaced (for example, x) are column vectors. Variables printed in uppercase and boldfaced (for example, X) are matrices. Let v be a column vector of weights specified by the user. If no weights are specified, v = 1. Let w be a column vector of normalized weights. If no weights are specified or if the user specified fweights or iweights, w = v. Otherwise, w = v/(10 v) (10 1). The number of observations, n, is defined as 10 w. For iweights, this is truncated to an integer. The sum of the weights is 10 v. Define c = 1 if there is a constant in the regression and zero otherwise. Define k as the number of right-hand-side variables (including the constant). Let X denote the matrix of observations on the right-hand-side variables, y the vector of observations on the left-hand-side variable, and Z the matrix of observations on the instruments. If the user specifies no instruments, then Z = X. In the following formulas, if the user specifies weights, then X0 X, X0 y, y0 y, Z0 Z, Z0 X, and Z0 y are replaced by X0 DX, X0 Dy, y0 Dy, Z0 DZ, Z0 DX, and Z0 Dy, respectively, where D is a diagonal matrix whose diagonal elements are the elements of w. We suppress the D below to simplify the notation. If no instruments are specified, define A as X0 X and a as X0 y. Otherwise, define A as X Z(Z0 Z)−1 (X0 Z)0 and a as X0 Z(Z0 Z)−1 Z0 y. 0 The coefficient vector b is defined as A−1 a. Although not shown in the notation, unless hascons is specified, A and a are accumulated in deviation form and the constant is calculated separately. This comment applies to all statistics listed below. The total sum of squares, TSS, equals y0 y if there is no intercept and y0 y − (10 y)2 /n otherwise. The degrees of freedom is n − c. The error sum of squares, ESS, is defined as y0 y − 2bX0 y + b0 X0 Xb if there are instruments and as y0 y − b0 X0 y otherwise. The degrees of freedom is n − k . The model sum of squares, MSS, equals TSS − ESS. The degrees of freedom is k − c. The mean squared error, s2 , is defined as ESS/(n − k). The root mean squared error is s, its square root. The F statistic with k − c and n − k degrees of freedom is defined as F = MSS (k − c)s2 if no instruments are specified. If instruments are specified and c = 1, then F is defined as F = (b − c)0 A(b − c) (k − 1)s2 where c is a vector of k − 1 zeros and k th element 10 y/n. Otherwise, F is defined as “missing”. (Here you may use the test command to construct any F test that you wish.) regress — Linear regression 1865 The R-squared, R2 , is defined as R2 = 1 − ESS/TSS. The adjusted R-squared, Ra2 , is 1 − (1 − R2 )(n − c)/(n − k). If vce(robust) is not specified, the conventional estimate of variance is s2 A−1 . The handling of vce(robust) is described below. A general notation for the robust variance calculation Put aside all context of linear regression and the notation that goes with it — we will return to it. First, we are going to establish a notation for describing robust variance calculations. The calculation formula for the robust variance calculation is b b = qc V V M X (G)0 (G) uk uk b V k=1 where (G) uk = X wj uj j∈Gk G1 , G2 , . . . , GM are the clusters specified by vce(cluster clustvar), and wj are the user-specified weights, normalized if aweights or pweights are specified and equal to 1 if no weights are specified. For fweights without clusters, the variance formula is b b = qc V V N X b wj u0j uj V j=1 which is the same as expanding the dataset and making the calculation on the unweighted data. If vce(cluster clustvar) is not specified, M = N , and each cluster contains 1 observation. The inputs into this calculation are b , which is typically a conventionally calculated variance matrix; • V • uj , j = 1, . . . , N , a row vector of scores; and • qc , a constant finite-sample adjustment. b , uj , Thus we can now describe how estimators apply the robust calculation formula by defining V and qc . Two definitions are popular enough for qc to deserve a name. The regression-like formula for qc (Fuller et al. 1986) is N −1 M qc = N −k M −1 where M is the number of clusters and N is the number of observations. For weights, N refers to the sum of the weights if weights are frequency weights and the number of observations in the dataset (ignoring weights) in all other cases. Also note that, weighted or not, M = N when vce(cluster clustvar) is not specified, and then qc = N/(N − k). The asymptotic-like formula for qc is qc = M M −1 where M = N if vce(cluster clustvar) is not specified. 1866 regress — Linear regression See [U] 20.21 Obtaining robust variance estimates and [P] robust for a discussion of the robust variance estimator and a development of these formulas. Robust calculation for regress b = A−1 . The other terms are For regress, V No instruments, vce(robust), but not vce(hc2) or vce(hc3), uj = (yj − xj b)xj and qc is given by its regression-like definition. No instruments, vce(hc2), 1 uj = p (yj − xj b)xj 1 − hjj where qc = 1 and hjj = xj (X0 X)−1 xj 0 . No instruments, vce(hc3), uj = 1 (yj − xj b)xj 1 − hjj where qc = 1 and hjj = xj (X0 X)−1 xj 0 . Instrumental variables, uj = (yj − xj b)b xj where qc is given by its regression-like definition, and b0j = Pzj 0 x where P = (X0 Z)(Z0 Z)−1 . Acknowledgments The robust estimate of variance was first implemented in Stata by Mead Over of the Center for Global Development, Dean Jolliffe of the World Bank, and Andrew Foster of the Department of Economics at Brown University (Over, Jolliffe, and Foster 1996). regress — Linear regression 1867 The history of regression is long and complicated: the books by Stigler (1986) and Hald (1998) are devoted largely to the story. Legendre published first on least squares in 1805. Gauss published later in 1809, but he had the idea earlier. Gauss, and especially Laplace, tied least squares to a normal errors assumption. The idea of the normal distribution can itself be traced back to De Moivre in 1733. Laplace discussed a variety of other estimation methods and error assumptions over his long career, while linear models long predate either innovation. Most of this work was linked to problems in astronomy and geodesy. A second wave of ideas started when Galton used graphical and descriptive methods on data bearing on heredity to develop what he called regression. His term reflects the common phenomenon that characteristics of offspring are positively correlated with those of parents but with regression slope such that offspring “regress toward the mean”. Galton’s work was rather intuitive: contributions from Pearson, Edgeworth, Yule, and others introduced more formal machinery, developed related ideas on correlation, and extended application into the biological and social sciences. So most of the elements of regression as we know it were in place by 1900. Pierre-Simon Laplace (1749–1827) was born in Normandy and was early recognized as a remarkable mathematician. He weathered a changing political climate well enough to rise to Minister of the Interior under Napoleon in 1799 (although only for 6 weeks) and to be made a Marquis by Louis XVIII in 1817. He made many contributions to mathematics and physics, his two main interests being theoretical astronomy and probability theory (including statistics). Laplace transforms are named for him. Adrien-Marie Legendre (1752–1833) was born in Paris (or possibly in Toulouse) and educated in mathematics and physics. He worked in number theory, geometry, differential equations, calculus, function theory, applied mathematics, and geodesy. The Legendre polynomials are named for him. His main contribution to statistics is as one of the discoverers of least squares. He died in poverty, having refused to bow to political pressures. Johann Carl Friedrich Gauss (1777–1855) was born in Braunschweig (Brunswick), now in Germany. He studied there and at Göttingen. His doctoral dissertation at the University of Helmstedt was a discussion of the fundamental theorem of algebra. He made many fundamental contributions to geometry, number theory, algebra, real analysis, differential equations, numerical analysis, statistics, astronomy, optics, geodesy, mechanics, and magnetism. An outstanding genius, Gauss worked mostly in isolation in Göttingen. Francis Galton (1822–1911) was born in Birmingham, England, into a well-to-do family with many connections: he and Charles Darwin were first cousins. After an unsuccessful foray into medicine, he became independently wealthy at the death of his father. Galton traveled widely in Europe, the Middle East, and Africa, and became celebrated as an explorer and geographer. His pioneering work on weather maps helped in the identification of anticyclones, which he named. From about 1865, most of his work was centered on quantitative problems in biology, anthropology, and psychology. In a sense, Galton (re)invented regression, and he certainly named it. Galton also promoted the normal distribution, correlation approaches, and the use of median and selected quantiles as descriptive statistics. He was knighted in 1909. References Adkins, L. C., and R. C. Hill. 2011. Using Stata for Principles of Econometrics. 4th ed. Hoboken, NJ: Wiley. Alexandersson, A. 1998. gr32: Confidence ellipses. Stata Technical Bulletin 46: 10–13. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 54–57. College Station, TX: Stata Press. 1868 regress — Linear regression Angrist, J. D., and J.-S. Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton, NJ: Princeton University Press. Becketti, S. 2013. Introduction to Time Series Using Stata. College Station, TX: Stata Press. Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. Chatterjee, S., and A. S. Hadi. 2012. Regression Analysis by Example. 5th ed. New York: Hoboken, NJ. Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. . 2004. Econometric Theory and Methods. New York: Oxford University Press. Dohoo, I., W. Martin, and H. Stryhn. 2010. Veterinary Epidemiologic Research. 2nd ed. Charlottetown, Prince Edward Island: VER Inc. . 2012. Methods in Epidemiologic Research. Charlottetown, Prince Edward Island: VER Inc. Draper, N., and H. Smith. 1998. Applied Regression Analysis. 3rd ed. New York: Wiley. Dunnington, G. W. 1955. Gauss: Titan of Science. New York: Hafner Publishing. Duren, P. 2009. Changing faces: The mistaken portrait of Legendre. Notices of the American Mathematical Society 56: 1440–1443. Filoso, V. 2013. Regression anatomy, revealed. Stata Journal 13: 92–106. Fuller, W. A., W. J. Kennedy, Jr., D. Schnell, G. Sullivan, and H. J. Park. 1986. PC CARP. Software package. Ames, IA: Statistical Laboratory, Iowa State University. Gillham, N. W. 2001. A Life of Sir Francis Galton: From African Exploration to the Birth of Eugenics. New York: Oxford University Press. Gillispie, C. C. 1997. Pierre-Simon Laplace, 1749–1827: A Life in Exact Science. Princeton: Princeton University Press. Gould, W. W. 2011a. Understanding matrices intuitively, part 1. The Stata Blog: Not Elsewhere Classified. http://blog.stata.com/2011/03/03/understanding-matrices-intuitively-part-1/. . 2011b. Use poisson rather than regress; tell a friend. The Stata Blog: Not Elsewhere Classified. http://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Hald, A. 1998. A History of Mathematical Statistics from 1750 to 1930. New York: Wiley. Hamilton, L. C. 2013. Statistics with Stata: Updated for Version 12. 8th ed. Boston: Brooks/Cole. Hill, R. C., W. E. Griffiths, and G. C. Lim. 2011. Principles of Econometrics. 4th ed. Hoboken, NJ: Wiley. Kmenta, J. 1997. Elements of Econometrics. 2nd ed. Ann Arbor: University of Michigan Press. Kohler, U., and F. Kreuter. 2012. Data Analysis Using Stata. 3rd ed. College Station, TX: Stata Press. Long, J. S., and J. Freese. 2000. sg152: Listing and interpreting transformed coefficients from certain regression models. Stata Technical Bulletin 57: 27–34. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 231–240. College Station, TX: Stata Press. MacKinnon, J. G., and H. L. White, Jr. 1985. Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Journal of Econometrics 29: 305–325. Mitchell, M. N. 2012. Interpreting and Visualizing Regression Models Using Stata. College Station, TX: Stata Press. Mosteller, C. F., and J. W. Tukey. 1977. Data Analysis and Regression: A Second Course in Statistics. Reading, MA: Addison–Wesley. Over, M., D. Jolliffe, and A. Foster. 1996. sg46: Huber correction for two-stage least squares estimates. Stata Technical Bulletin 29: 24–25. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 140–142. College Station, TX: Stata Press. Peracchi, F. 2001. Econometrics. Chichester, UK: Wiley. Plackett, R. L. 1972. Studies in the history of probability and statistics: XXIX. The discovery of the method of least squares. Biometrika 59: 239–251. Rogers, W. H. 1991. smv2: Analyzing repeated measurements—some practical alternatives. Stata Technical Bulletin 4: 10–16. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 123–131. College Station, TX: Stata Press. regress — Linear regression 1869 Royston, P., and G. Ambler. 1998. sg79: Generalized additive models. Stata Technical Bulletin 42: 38–43. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 217–224. College Station, TX: Stata Press. Schonlau, M. 2005. Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata Journal 5: 330–354. Stigler, S. M. 1986. The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge, MA: Belknap Press. Tyler, J. H. 1997. sg73: Table making programs. Stata Technical Bulletin 40: 18–23. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 186–192. College Station, TX: Stata Press. Weesie, J. 1998. sg77: Regression analysis with multiplicative heteroscedasticity. Stata Technical Bulletin 42: 28–32. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 204–210. College Station, TX: Stata Press. Weisberg, S. 2005. Applied Linear Regression. 3rd ed. New York: Wiley. Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press. . 2013. Introductory Econometrics: A Modern Approach. 5th ed. Mason, OH: South-Western. Zimmerman, F. 1998. sg93: Switching regressions. Stata Technical Bulletin 45: 30–33. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 183–186. College Station, TX: Stata Press. Also see [R] regress postestimation — Postestimation tools for regress [R] regress postestimation diagnostic plots — Postestimation plots for regress [R] regress postestimation time series — Postestimation tools for regress with time series [R] anova — Analysis of variance and covariance [R] contrast — Contrasts and linear hypothesis tests after estimation [MI] estimation — Estimation commands for use with mi estimate [SEM] example 6 — Linear regression [SEM] intro 5 — Tour of models [SVY] svy estimation — Estimation commands for survey data [TS] forecast — Econometric model forecasting [U] 20 Estimation and postestimation commands Title regress postestimation — Postestimation tools for regress Description Tests for violation of assumptions Methods and formulas Also see Predictions Variance inflation factors Acknowledgments DFBETA influence statistics Measures of effect size References Description The following postestimation commands are of special interest after regress: Command Description dfbeta estat hettest estat imtest estat ovtest estat szroeter estat vif estat esize DFBETA influence statistics tests for heteroskedasticity information matrix test Ramsey regression specification-error test for omitted variables Szroeter’s rank test for heteroskedasticity variance inflation factors for the independent variables η 2 and ω 2 effect sizes These commands are not appropriate after the svy prefix. 1870 regress postestimation — Postestimation tools for regress 1871 The following standard postestimation commands are also available: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 hausman lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 forecast is not appropriate with mi or svy estimation results. 2 lrtest is not appropriate with svy estimation results. 1872 regress postestimation — Postestimation tools for regress Predictions Syntax for predict predict type newvar if in , statistic Description statistic Main xb residuals score rstandard rstudent cooksd leverage | hat pr(a,b) linear prediction; the default residuals score; equivalent to residuals standardized residuals Studentized (jackknifed) residuals Cook’s distance leverage (diagonal elements of hat matrix) Pr(yj | a < yj < b) e(a,b) E(yj | a < yj < b) ystar(a,b) E(yj∗ ), yj∗ = max{a, min(yj , b)} ∗ DFBETA for varname standard error of the linear prediction standard error of the forecast standard error of the residual dfbeta(varname) stdp stdf stdr ∗ covratio ∗ dfits ∗ welsch COVRATIO DFITS Welsch distance Unstarred statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample. Starred statistics are calculated only for the estimation sample, even when if e(sample) is not specified. rstandard, rstudent, cooksd, leverage, dfbeta(), stdf, stdr, covratio, dfits, and welsch are not available if any vce() other than vce(ols) was specified with regress. xb, residuals, score, and stdp are the only options allowed with svy estimation results. where a and b may be numbers or variables; a missing (a [U] 12.2.1 Missing values. ≥ .) means −∞, and b missing (b ≥ .) means +∞; see Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main xb, the default, calculates the linear prediction. residuals calculates the residuals. score is equivalent to residuals in linear regression. regress postestimation — Postestimation tools for regress 1873 rstandard calculates the standardized residuals. rstudent calculates the Studentized (jackknifed) residuals. cooksd calculates the Cook’s D influence statistic (Cook 1977). leverage or hat calculates the diagonal elements of the projection (“hat”) matrix. pr(a,b) calculates Pr(a < xj b + uj < b), the probability that yj |xj would be observed in the interval (a, b). a and b may be specified as numbers or variable names; lb and ub are variable names; pr(20,30) calculates Pr(20 < xj b + uj < 30); pr(lb,ub) calculates Pr(lb < xj b + uj < ub); and pr(20,ub) calculates Pr(20 < xj b + uj < ub). a missing (a ≥ .) means −∞; pr(.,30) calculates Pr(−∞ < xj b + uj < 30); pr(lb,30) calculates Pr(−∞ < xj b + uj < 30) in observations for which lb ≥ . and calculates Pr(lb < xj b + uj < 30) elsewhere. b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(+∞ > xj b + uj > 20); pr(20,ub) calculates Pr(+∞ > xj b + uj > 20) in observations for which ub ≥ . and calculates Pr(20 < xj b + uj < ub) elsewhere. e(a,b) calculates E(xj b + uj | a < xj b + uj < b), the expected value of yj |xj conditional on yj |xj being in the interval (a, b), meaning that yj |xj is truncated. a and b are specified as they are for pr(). ystar(a,b) calculates E(yj∗ ), where yj∗ = a if xj b + uj ≤ a, yj∗ = b if xj b + uj ≥ b, and yj∗ = xj b + uj otherwise, meaning that yj∗ is censored. a and b are specified as they are for pr(). dfbeta(varname) calculates the DFBETA for varname, the difference between the regression coefficient when the j th observation is included and excluded, said difference being scaled by the estimated standard error of the coefficient. varname must have been included among the regressors in the previously fitted model. The calculation is automatically restricted to the estimation subsample. stdp calculates the standard error of the prediction, which can be thought of as the standard error of the predicted expected value or mean for the observation’s covariate pattern. The standard error of the prediction is also referred to as the standard error of the fitted value. stdf calculates the standard error of the forecast, which is the standard error of the point prediction for 1 observation. It is commonly referred to as the standard error of the future or forecast value. By construction, the standard errors produced by stdf are always larger than those produced by stdp; see Methods and formulas. stdr calculates the standard error of the residuals. covratio calculates COVRATIO (Belsley, Kuh, and Welsch 1980), a measure of the influence of the j th observation based on considering the effect on the variance–covariance matrix of the estimates. The calculation is automatically restricted to the estimation subsample. dfits calculates DFITS (Welsch and Kuh 1977) and attempts to summarize the information in the leverage versus residual-squared plot into one statistic. The calculation is automatically restricted to the estimation subsample. welsch calculates Welsch distance (Welsch 1982) and is a variation on dfits. The calculation is automatically restricted to the estimation subsample. 1874 regress postestimation — Postestimation tools for regress Remarks and examples for predict Remarks are presented under the following headings: Terminology Fitted values and residuals Prediction standard errors Prediction with weighted data Leverage statistics Standardized and Studentized residuals DFITS, Cook’s Distance, and Welsch Distance COVRATIO Terminology Many of these commands concern identifying influential data in linear regression. This is, unfortunately, a field that is dominated by jargon, codified and partially begun by Belsley, Kuh, and Welsch (1980). In the words of Chatterjee and Hadi (1986, 416), “Belsley, Kuh, and Welsch’s book, Regression Diagnostics, was a very valuable contribution to the statistical literature, but it unleashed on an unsuspecting statistical community a computer speak (à la Orwell), the likes of which we have never seen.” Things have only gotten worse since then. Chatterjee and Hadi’s (1986, 1988) own attempts to clean up the jargon did not improve matters (see Hoaglin and Kempthorne [1986], Velleman [1986], and Welsch [1986]). We apologize for the jargon, and for our contribution to the jargon in the form of inelegant command names, we apologize most of all. Model sensitivity refers to how estimates are affected by subsets of our data. Imagine data on y and x, and assume that the data are to be fit by the regression yi = α + βxi + i . The regression estimates of α and β are a and b, respectively. Now imagine that the estimated a and b would be different if a small portion of the dataset, perhaps even one observation, were deleted. As a data analyst, you would like to think that you are summarizing tendencies that apply to all the data, but you have just been told that the model you fit is unduly influenced by one point or just a few points and that, as a matter of fact, there is another model that applies to the rest of the data — a model that you have ignored. The search for subsets of the data that, if deleted, would change the results markedly is a predominant theme of this entry. There are three key issues in identifying model sensitivity to individual observations, which go by the names residuals, leverage, and influence. In our yi = a + bxi + ei regression, the residuals are, of course, ei — they reveal how much our fitted value ybi = a + bxi differs from the observed yi . A point (xi , yi ) with a corresponding large residual is called an outlier. Say that you are interested in outliers because you somehow think that such points will exert undue influence on your estimates. Your feelings are generally right, but there are exceptions. A point might have a huge residual and yet not affect the estimated b at all. Nevertheless, studying observations with large residuals almost always pays off. (xi , yi ) can be an outlier in another way — just as yi can be far from ybi , xi can be far from the center of mass of the other x’s. Such an “outlier” should interest you just as much as the more traditional outliers. Picture a scatterplot of y against x with thousands of points in some sort of mass at the lower left of the graph and one point at the upper right of the graph. Now run a regression line through the points — the regression line will come close to the point at the upper right of the graph and may in fact, go through it. That is, this isolated point will not appear as an outlier as measured by residuals because its residual will be small. Yet this point might have a dramatic effect on our resulting estimates in the sense that, were you to delete the point, the estimates would change markedly. Such a point is said to have high leverage. Just as with traditional outliers, a high leverage point does not necessarily have an undue effect on regression estimates, but if it does not, it is more the exception than the rule. regress postestimation — Postestimation tools for regress 1875 Now all this is a most unsatisfactory state of affairs. Points with large residuals may, but need not, have a large effect on our results, and points with small residuals may still have a large effect. Points with high leverage may, but need not, have a large effect on our results, and points with low leverage may still have a large effect. Can you not identify the influential points and simply have the computer list them for you? You can, but you will have to define what you mean by “influential”. “Influential” is defined with respect to some statistic. For instance, you might ask which points in your data have a large effect on your estimated a, which points have a large effect on your estimated b, which points have a large effect on your estimated standard error of b, and so on, but do not be surprised when the answers to these questions are different. In any case, obtaining such measures is not difficult — all you have to do is fit the regression excluding each observation one at a time and record the statistic of interest which, in the day of the modern computer, is not too onerous. Moreover, you can save considerable computer time by doing algebra ahead of time and working out formulas that will calculate the same answers as if you ran each of the regressions. (Ignore the question of pairs of observations that, together, exert undue influence, and triples, and so on, which remains largely unsolved and for which the brute force fit-every-possible-regression procedure is not a viable alternative.) Fitted values and residuals Typing predict newvar with no options creates newvar containing the fitted values. Typing predict newvar, resid creates newvar containing the residuals. Example 1 Continuing with example 1 from [R] regress, we wish to fit the following model: mpg = β0 + β1 weight + β2 foreign + . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight foreign Source SS df MS Model Residual 1619.2877 824.171761 2 71 809.643849 11.608053 Total 2443.45946 73 33.4720474 mpg Coef. weight foreign _cons -.0065879 -1.650029 41.6797 Std. Err. .0006371 1.075994 2.165547 t -10.34 -1.53 19.25 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.130 0.000 = = = = = = 74 69.75 0.0000 0.6627 0.6532 3.4071 [95% Conf. Interval] -.0078583 -3.7955 37.36172 -.0053175 .4954422 45.99768 That done, we can now obtain the predicted values from the regression. We will store them in a new variable called pmpg by typing predict pmpg. Because predict produces no output, we will follow that by summarizing our predicted and observed values. 1876 regress postestimation — Postestimation tools for regress . predict pmpg (option xb assumed; fitted values) . summarize pmpg mpg Obs Mean Variable pmpg mpg 74 74 21.2973 21.2973 Std. Dev. 4.709779 5.785503 Min Max 9.794333 12 29.82151 41 Example 2: Out-of-sample predictions We can just as easily obtain predicted values from the model by using a wholly different dataset from the one on which the model was fit. The only requirement is that the data have the necessary variables, which here are weight and foreign. Using the data on two new cars (the Pontiac Sunbird and the Volvo 260) from the newautos.dta dataset, we can obtain out-of-sample predictions (or forecasts) by typing . use http://www.stata-press.com/data/r13/newautos, clear (New Automobile Models) . predict pmpg (option xb assumed; fitted values) . list, divider 1. 2. make weight foreign pmpg Pont. Sunbird Volvo 260 2690 3170 Domestic Foreign 23.95829 19.14607 The Pontiac Sunbird has a predicted mileage rating of 23.96 mpg, whereas the Volvo 260 has a predicted rating of 19.15 mpg. In comparison, the actual mileage ratings are 24 for the Pontiac and 17 for the Volvo. Prediction standard errors predict can calculate the standard error of the forecast (stdf option), the standard error of the prediction (stdp option), and the standard error of the residual (stdr option). It is easy to confuse stdf and stdp because both are often called the prediction error. Consider the prediction ybj = xj b, where b is the estimated coefficient (column) vector and xj is a (row) vector of independent variables for which you want the prediction. First, ybj has a variance due to the variance of the estimated coefficient vector b, Var(b yj ) = Var(xj b) = s2 hj where hj = xj (X0 X)−1 x0j and s2 is the mean squared error of the regression. Do not panic over the algebra — just remember that Var(b yj ) = s2 hj , whatever s2 and hj are. stdp calculates this quantity. This is the error in the prediction due to the uncertainty about b. If you are about to hand this number out as your forecast, however, there is another error. According to your model, the true value of yj is given by yj = xj b + j = ybj + j regress postestimation — Postestimation tools for regress 1877 and thus the Var(yj ) = Var(b yj ) + Var(j ) = s2 hj + s2 , which is the square of stdf. stdf, then, is the sum of the error in the prediction plus the residual error. stdr has to do with an analysis-of-variance decomposition of s2 , the estimated variance of y . The standard error of the prediction is s2 hj , and therefore s2 hj + s2 (1 − hj ) = s2 decomposes s2 into the prediction and residual variances. Example 3: standard error of the forecast Returning to our model of mpg on weight and foreign, we previously predicted the mileage rating for the Pontiac Sunbird and Volvo 260 as 23.96 and 19.15 mpg, respectively. We now want to put a standard error around our forecast. Remember, the data for these two cars were in newautos.dta: . use http://www.stata-press.com/data/r13/newautos, clear (New Automobile Models) . predict pmpg (option xb assumed; fitted values) . predict se_pmpg, stdf . list, divider 1. 2. make weight foreign pmpg se_pmpg Pont. Sunbird Volvo 260 2690 3170 Domestic Foreign 23.95829 19.14607 3.462791 3.525875 Thus an approximate 95% confidence interval for the mileage rating of the Volvo 260 is 19.15±2·3.53 = [ 12.09, 26.21 ]. Prediction with weighted data predict can be used after frequency-weighted (fweight) estimation, just as it is used after unweighted estimation. The technical note below concerns the use of predict after analytically weighted (aweight) estimation. Technical note After analytically weighted estimation, predict is willing to calculate only the prediction (no options), residual (residual option), standard error of the prediction (stdp option), and diagonal elements of the projection matrix (hat option). Moreover, the results produced by hat need to be adjusted, as will be described. For analytically weighted estimation, the standard error of the forecast and residuals, the standardized and Studentized residuals, and Cook’s D are not statistically well-defined concepts. 1878 regress postestimation — Postestimation tools for regress Leverage statistics In addition to providing fitted values and the associated standard errors, the predict command can also be used to generate various statistics used to detect the influence of individual observations. This section provides a brief introduction to leverage (hat) statistics, and some of the following subsections discuss other influence statistics produced by predict. Example 4: diagonal elements of projection matrix The diagonal elements of the projection matrix, obtained by the hat option, are a measure of distance in explanatory variable space. leverage is a synonym for hat. . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . regress mpg weight foreign (output omitted ) . predict xdist, hat . summarize xdist, detail Leverage 1% 5% 10% 25% 50% Percentiles .0192325 .0192686 .0193448 .0220291 .0383797 75% 90% 95% 99% .0494002 .0693432 .0880814 .1003283 Smallest .0192325 .0192366 .019241 .0192686 Largest .0880814 .099715 .099715 .1003283 Obs Sum of Wgt. Mean Std. Dev. 74 74 .0405405 .0207624 Variance Skewness Kurtosis .0004311 1.159745 4.083313 Some 5% of our sample has an xdist measure in excess of 0.08. Let’s force them to reveal their identities: . list foreign make mpg if xdist>.08, divider foreign 24. 26. 27. 43. 64. Domestic Domestic Domestic Domestic Foreign make Ford Fiesta Linc. Continental Linc. Mark V Plym. Champ Peugeot 604 mpg 28 12 12 34 14 To understand why these cars are on this list, we must remember that the explanatory variables in our model are weight and foreign and that xdist measures distance in this metric. The Ford Fiesta and the Plymouth Champ are the two lightest domestic cars in our data. The Lincolns are the two heaviest domestic cars, and the Peugeot is the heaviest foreign car. See lvr2plot in [R] regress postestimation diagnostic plots for information on a leverage-versussquared-residual plot. regress postestimation — Postestimation tools for regress 1879 Standardized and Studentized residuals The terms standardized and Studentized residuals have meant different √ things to different authors. In Stata, predict defines the standardized residual as e b = e /(s 1 − hi ) and the Studentized i i √ residual as ri = ei /(s(i) 1 − hi ), where s(i) is the root mean squared error of a regression with the ith observation removed. Stata’s definition of the Studentized residual is the same as the one given in Bollen and Jackman (1990, 264) and is what Chatterjee and Hadi (1988, 74) call the “externally Studentized” residual. Stata’s “standardized” residual is the same as what Chatterjee and Hadi (1988, 74) call the “internally Studentized” residual. Standardized and Studentized residuals are attempts to adjust residuals for their standard errors. Although the i theoretical residuals are homoskedastic by assumption (that is, they all have the same variance), the calculated ei are not. In fact, Var(ei ) = σ 2 (1 − hi ) where hi are the leverage measures obtained from the diagonal elements of hat matrix. Thus observations with the greatest leverage have corresponding residuals with the smallest variance. Standardized residuals use the root mean squared error of the regression for σ . Studentized residuals use the root mean squared error of a regression omitting the observation in question for σ . In general, Studentized residuals are preferable to standardized residuals for purposes of outlier identification. Studentized residuals can be interpreted as the t statistic for testing the significance of a dummy variable equal to 1 in the observation in question and 0 elsewhere (Belsley, Kuh, and Welsch 1980). Such a dummy variable would effectively absorb the observation and so remove its influence in determining the other coefficients in the model. Caution must be exercised here, however, because of the simultaneous testing problem. You cannot simply list the residuals that would be individually significant at the 5% level — their joint significance would be far less (their joint significance level would be far greater). Example 5: standardized and Studentized residuals In the Terminology section of Remarks and examples for predict, we distinguished residuals from leverage and speculated on the impact of an observation with a small residual but large leverage. If we adjust the residuals for their standard errors, however, the adjusted residual would be (relatively) larger and perhaps large enough so that we could simply examine the adjusted residuals. Taking our price on weight and foreign##c.mpg model from example 1 of [R] regress postestimation diagnostic plots, we can obtain the in-sample standardized and Studentized residuals by typing . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . regress price weight foreign##c.mpg (output omitted ) . predict esta if e(sample), rstandard . predict estu if e(sample), rstudent 1880 regress postestimation — Postestimation tools for regress In the lvr2plot section of [R] regress postestimation diagnostic plots, we discovered that the VW Diesel has the highest leverage in our data, but a corresponding small residual. The standardized and Studentized residuals for the VW Diesel are . list make price esta estu if make=="VW Diesel" 71. make price esta estu VW Diesel 5,397 .6142691 .6114758 The Studentized residual of 0.611 can be interpreted as the t statistic for including a dummy variable for VW Diesel in our regression. Such a variable would not be significant. DFITS, Cook’s Distance, and Welsch Distance DFITS (Welsch and Kuh 1977), Cook’s Distance (Cook 1977), and Welsch Distance (Welsch 1982) are three attempts to summarize the information in the leverage versus residual-squared plot into one statistic. That is, the goal is to create an index that is affected by the size of the residuals — outliers — and the size of hi — leverage. Viewed mechanically, one way to write DFITS (Bollen and Jackman 1990, 265) is r hi DFITSi = ri 1 − hi where ri are the Studentized residuals. Thus large residuals increase the value of DFITS, as do large values of hi . Viewed more traditionally, DFITS is a scaled difference between predicted values for the ith case when the regression is fit with and without the ith observation, hence the name. The mechanical relationship between DFITS and Cook’s Distance, Di (Bollen and Jackman 1990, 266), is 2 1 s(i) 2 DFITSi Di = k s2 where k is the number of variables (including the constant) in the regression, s is the root mean squared error of the regression, and s(i) is the root mean squared error when the ith observation is omitted. Viewed more traditionally, Di is a scaled measure of the distance between the coefficient vectors when the ith observation is omitted. The mechanical relationship between DFITS and Welsch’s Distance, Wi (Chatterjee and Hadi 1988, 123), is r n−1 Wi = DFITSi 1 − hi The interpretation of Wi is more difficult, as it is based on the empirical influence curve. Although DFITS and Cook’s distance are similar, the Welsch distance measure includes another normalization by leverage. p Belsley, Kuh, and Welsch (1980, 28) suggest that DFITS values greater than 2 k/n deserve more investigation, and so values of Cook’s distance greater than 4/n should also be examined (Bollen and √ Jackman 1990, 265 – 266). Through similar logic, the cutoff for Welsch distance is approximately 3 k (Chatterjee and Hadi 1988, 124). regress postestimation — Postestimation tools for regress 1881 Example 6: DFITS influence measure Continuing with our model of price on weight and foreign##c.mpg, we can obtain the DFITS influence measure: . predict e if e(sample), resid . predict dfits, dfits We did not specify if e(sample) in computing the DFITS statistic. DFITS is available only over the estimation sample, so specifying if e(sample) would have been redundant. It would have done no harm, but it would not have changed the results. Our model has p k = 5 independent variables (k includes the constant) and n = 74 observations; following the 2 k/n cutoff advice, we type . list make price e dfits if abs(dfits) > 2*sqrt(5/74), divider make 12. 13. 24. 27. 28. Cad. Eldorado Cad. Seville Ford Fiesta Linc. Mark V Linc. Versailles 42. Plym. Arrow price e dfits 14,500 15,906 4,389 13,594 13,466 7271.96 5036.348 3164.872 3109.193 6560.912 .9564455 1.356619 .5724172 .5200413 .8760136 4,647 -3312.968 -.9384231 We calculate Cook’s distance and list the observations greater than the suggested 4/n cutoff: . predict cooksd if e(sample), cooksd . list make price e cooksd if cooksd > 4/74, divider make 12. 13. 24. 28. 42. Cad. Eldorado Cad. Seville Ford Fiesta Linc. Versailles Plym. Arrow price e cooksd 14,500 15,906 4,389 13,466 4,647 7271.96 5036.348 3164.872 6560.912 -3312.968 .1492676 .3328515 .0638815 .1308004 .1700736 Here we used if e(sample) because Cook’s distance is not restricted to the estimation sample by default. It is worth comparing this list with the preceding one. √ Finally, we use Welsch distance and the suggested 3 k cutoff: . predict wd, welsch . list make price e wd if abs(wd) > 3*sqrt(5), divider make 12. 13. 28. 42. Cad. Eldorado Cad. Seville Linc. Versailles Plym. Arrow price e wd 14,500 15,906 13,466 4,647 7271.96 5036.348 6560.912 -3312.968 8.394372 12.81125 7.703005 -8.981481 Here we did not need to specify if e(sample) because welsch automatically restricts the prediction to the estimation sample. 1882 regress postestimation — Postestimation tools for regress COVRATIO COVRATIO (Belsley, Kuh, and Welsch 1980) measures the influence of the ith observation by considering the effect on the variance–covariance matrix of the estimates. The measure is the ratio of the determinants of the covariances matrix, with and without the ith observation. The resulting formula is k 1 n − k − eb2i COVRATIOi = 1 − hi n − k − 1 where ebi is the standardized residual. For noninfluential observations, the value of COVRATIO is approximately 1. Large values of the residuals or large values of leverage will cause deviations from 1, although if both are large, COVRATIO may tend back toward 1 and therefore not identify such observations (Chatterjee and Hadi 1988, 139). Belsley, Kuh, and Welsch (1980) suggest that observations for which |COVRATIOi − 1| ≥ 3k n are worthy of further examination. Example 7: COVRATIO influence measure Using our model of price on weight and foreign##c.mpg, we can obtain the COVRATIO measure and list the observations outside the suggested cutoff by typing . predict covr, covratio . list make price e covr if abs(covr-1) >= 3*5/74, divider make price e covr 12. 13. 28. 43. 53. Cad. Eldorado Cad. Seville Linc. Versailles Plym. Champ Audi 5000 14,500 15,906 13,466 4,425 9,690 7271.96 5036.348 6560.912 1621.747 591.2883 .3814242 .7386969 .4761695 1.27782 1.206842 57. 64. 66. 71. 74. Datsun 210 Peugeot 604 Subaru VW Diesel Volvo 260 4,589 12,990 3,798 5,397 11,995 19.81829 1037.184 -909.5894 999.7209 1327.668 1.284801 1.348219 1.264677 1.630653 1.211888 The covratio option automatically restricts the prediction to the estimation sample. regress postestimation — Postestimation tools for regress 1883 DFBETA influence statistics Syntax for dfbeta dfbeta indepvar Menu for dfbeta Statistics > Linear indepvar . . . , stub(name) models and related > Regression diagnostics > DFBETAs Description for dfbeta dfbeta will calculate one, more than one, or all the DFBETAs after regress. Although predict will also calculate DFBETAs, predict can do this for only one variable at a time. dfbeta is a convenience tool for those who want to calculate DFBETAs for multiple variables. The names for the new variables created are chosen automatically and begin with the letters dfbeta . Option for dfbeta stub(name) specifies the leading characters dfbeta uses to name the new variables to be generated. The default is stub( dfbeta ). Remarks and examples for dfbeta DFBETAs are perhaps the most direct influence measure of interest to model builders. DFBETAs focus on one coefficient and measure the difference between the regression coefficient when the ith observation is included and excluded, the difference being scaled by the estimated standard error√of the coefficient. Belsley, Kuh, and Welsch (1980, 28) suggest observations with |DFBETAi | > 2/ n as deserving special attention, but it is also common practice to use 1 (Bollen and Jackman 1990, 267), meaning that the observation shifted the estimate at least one standard error. Example 8: DFBETAs influence measure; the dfbeta() option Using our model of price on weight and foreign##c.mpg, let’s first ask which observations have the greatest √ impact on the determination of the coefficient on 1.foreign. We will use the suggested 2/ n cutoff: . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . regress price weight foreign##c.mpg (output omitted ) 1884 regress postestimation — Postestimation tools for regress . sort foreign make . predict dfor, dfbeta(1.foreign) . list make price foreign dfor if abs(dfor) > 2/sqrt(74), divider make price foreign dfor 12. 13. 28. 42. 43. Cad. Eldorado Cad. Seville Linc. Versailles Plym. Arrow Plym. Champ 14,500 15,906 13,466 4,647 4,425 Domestic Domestic Domestic Domestic Domestic -.5290519 .8243419 -.5283729 -.6622424 .2371104 64. 69. Peugeot 604 Toyota Corona 12,990 5,719 Foreign Foreign .2552032 -.256431 The Cadillac Seville shifted the coefficient on 1.foreign 0.82 standard deviations! Now let us ask which observations have the greatest effect on the mpg coefficient: . predict dmpg, dfbeta(mpg) . list make price mpg dmpg if abs(dmpg) > 2/sqrt(74), divider make 12. 13. 28. 42. 43. Cad. Eldorado Cad. Seville Linc. Versailles Plym. Arrow Plym. Champ price mpg dmpg 14,500 15,906 13,466 4,647 4,425 14 21 14 28 34 -.5970351 1.134269 -.6069287 -.8925859 .3186909 Once again we see the Cadillac Seville heading the list, indicating that our regression results may be dominated by this one car. Example 9: DFBETAs influence measure; the dfbeta command We can use predict, dfbeta() or the dfbeta command to generate the DFBETAs. dfbeta makes up names for the new variables automatically and, without arguments, generates the DFBETAs for all the variables in the regression: . dfbeta _dfbeta_1: _dfbeta_2: _dfbeta_3: _dfbeta_4: dfbeta(weight) dfbeta(1.foreign) dfbeta(mpg) dfbeta(1.foreign#c.mpg) dfbeta created four new variables in our dataset: dfbeta 1, containing the DFBETAs for weight; dfbeta 2, containing the DFBETAs for mpg; and so on. Had we wanted only the DFBETAs for mpg and weight, we might have typed . dfbeta mpg weight _dfbeta_5: dfbeta(weight) _dfbeta_6: dfbeta(mpg) In the example above, we typed dfbeta mpg weight instead of dfbeta; if we had typed dfbeta followed by dfbeta mpg weight, here is what would have happened: regress postestimation — Postestimation tools for regress 1885 . dfbeta _dfbeta_7: _dfbeta_8: _dfbeta_9: _dfbeta_10: dfbeta(weight) dfbeta(1.foreign) dfbeta(mpg) dfbeta(1.foreign#c.mpg) . dfbeta mpg weight _dfbeta_11: dfbeta(weight) _dfbeta_12: dfbeta(mpg) dfbeta would have made up different names for the new variables. dfbeta never replaces existing variables — it instead makes up a different name, so we need to pay attention to dfbeta’s output. Tests for violation of assumptions Syntax for estat hettest estat hettest varlist Menu for estat Statistics > Postestimation > , rhs normal | iid | fstat mtest (spec) Reports and statistics Description for estat hettest estat hettest performs three versions of the Breusch–Pagan (1979) and Cook–Weisberg (1983) test for heteroskedasticity. All three versions of this test present evidence against the null hypothesis that t = 0 in Var(e) = σ 2 exp(zt). In the normal version, performed by default, the null hypothesis also includes the assumption that the regression disturbances are independent-normal draws with variance σ 2 . The normality assumption is dropped from the null hypothesis in the iid and fstat versions, which respectively produce the score and F tests discussed in Methods and formulas. If varlist is not specified, the fitted values are used for z. If varlist or the rhs option is specified, the variables specified are used for z. Options for estat hettest rhs specifies that tests for heteroskedasticity be performed for the right-hand-side (explanatory) variables of the fitted regression model. The rhs option may be combined with a varlist. normal, the default, causes estat hettest to compute the original Breusch–Pagan/Cook–Weisberg test, which assumes that the regression disturbances are normally distributed. iid causes estat hettest to compute the N ∗ R2 version of the score test that drops the normality assumption. fstat causes estat hettest to compute the F -statistic version that drops the normality assumption. 1886 regress postestimation — Postestimation tools for regress mtest (spec) specifies that multiple testing be performed. The argument specifies how p-values are adjusted. The following specifications, spec, are supported: bonferroni holm sidak noadjust Bonferroni’s multiple testing adjustment Holm’s multiple testing adjustment Šidák’s multiple testing adjustment no adjustment is made for multiple testing mtest may be specified without an argument. This is equivalent to specifying mtest(noadjust); that is, tests for the individual variables should be performed with unadjusted p-values. By default, estat hettest does not perform multiple testing. mtest may not be specified with iid or fstat. Syntax for estat imtest estat imtest , preserve white Menu for estat Statistics > Postestimation > Reports and statistics Description for estat imtest estat imtest performs an information matrix test for the regression model and an orthogonal decomposition into tests for heteroskedasticity, skewness, and kurtosis due to Cameron and Trivedi (1990); White’s test for homoskedasticity against unrestricted forms of heteroskedasticity (1980) is available as an option. White’s test is usually similar to the first term of the Cameron–Trivedi decomposition. Options for estat imtest preserve specifies that the data in memory be preserved, all variables and cases that are not needed in the calculations be dropped, and at the conclusion the original data be restored. This option is costly for large datasets. However, because estat imtest has to perform an auxiliary regression on k(k + 1)/2 temporary variables, where k is the number of regressors, it may not be able to perform the test otherwise. white specifies that White’s original heteroskedasticity test also be performed. Syntax for estat ovtest estat ovtest , rhs Menu for estat Statistics > Postestimation > Reports and statistics regress postestimation — Postestimation tools for regress 1887 Description for estat ovtest estat ovtest performs two versions of the Ramsey (1969) regression specification-error test (RESET) for omitted variables. This test amounts to fitting y = xb + zt + u and then testing t = 0. If the rhs option is not specified, powers of the fitted values are used for z. If rhs is specified, powers of the individual elements of x are used. Option for estat ovtest rhs specifies that powers of the right-hand-side (explanatory) variables be used in the test rather than powers of the fitted values. Syntax for estat szroeter estat szroeter varlist , rhs mtest(spec) Either varlist or rhs must be specified. Menu for estat Statistics > Postestimation > Reports and statistics Description for estat szroeter estat szroeter performs Szroeter’s rank test for heteroskedasticity for each of the variables in varlist or for the explanatory variables of the regression if rhs is specified. Options for estat szroeter rhs specifies that tests for heteroskedasticity be performed for the right-hand-side (explanatory) variables of the fitted regression model. Option rhs may be combined with a varlist. mtest(spec) specifies that multiple testing be performed. The argument specifies how p-values are adjusted. The following specifications, spec, are supported: bonferroni holm sidak noadjust Bonferroni’s multiple testing adjustment Holm’s multiple testing adjustment Šidák’s multiple testing adjustment no adjustment is made for multiple testing estat szroeter always performs multiple testing. By default, it does not adjust the p-values. 1888 regress postestimation — Postestimation tools for regress Remarks and examples for estat hettest, estat imtest, estat ovtest, and estat szroeter We introduce some regression diagnostic commands that are designed to test for certain violations that rvfplot (see [R] regress postestimation diagnostic plots) less formally attempts to detect. estat ovtest provides Ramsey’s test for omitted variables — a pattern in the residuals. estat hettest provides a test for heteroskedasticity — the increasing or decreasing variation in the residuals with fitted values, with respect to the explanatory variables, or with respect to yet other variables. The score test implemented in estat hettest (Breusch and Pagan 1979; Cook and Weisberg 1983) performs a score test of the null hypothesis that b = 0 against the alternative hypothesis of multiplicative heteroskedasticity. estat szroeter provides a rank test for heteroskedasticity, which is an alternative to the score test computed by estat hettest. Finally, estat imtest computes an information matrix test, including an orthogonal decomposition into tests for heteroskedasticity, skewness, and kurtosis (Cameron and Trivedi 1990). The heteroskedasticity test computed by estat imtest is similar to the general test for heteroskedasticity that was proposed by White (1980). Cameron and Trivedi (2010, chap. 3) discuss most of these tests and provides more examples. Example 10: estat ovtest, estat hettest, estat szroeter, and estat imtest We use our model of price on weight and foreign##c.mpg. . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . regress price weight foreign##c.mpg (output omitted ) . estat ovtest Ramsey RESET test using powers of the fitted values of price Ho: model has no omitted variables F(3, 66) = 7.77 Prob > F = 0.0002 . estat hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of price chi2(1) Prob > chi2 = = 6.50 0.0108 Testing for heteroskedasticity in the right-hand-side variables is requested by specifying the rhs option. By specifying the mtest(bonferroni) option, we request that tests be conducted for each of the variables, with a Bonferroni adjustment for the p-values to accommodate our testing multiple hypotheses. regress postestimation — Postestimation tools for regress 1889 . estat hettest, rhs mtest(bonf) Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variable chi2 df p weight foreign Foreign mpg foreign# c.mpg Foreign 15.24 1 0.0004 # 6.15 9.04 1 1 0.0525 # 0.0106 # 6.02 1 0.0566 # 15.60 4 0.0036 simultaneous # Bonferroni-adjusted p-values . estat szroeter, rhs mtest(holm) Szroeter’s test for homoskedasticity Ho: variance constant Ha: variance monotonic in variable Variable chi2 df p weight foreign Foreign mpg foreign# c.mpg Foreign 17.07 1 0.0001 # 6.15 11.45 1 1 0.0131 # 0.0021 # 6.17 1 0.0260 # # Holm-adjusted p-values Finally, we request the information matrix test, which is a conditional moments test with second-, third-, and fourth-order moment conditions. . estat imtest Cameron & Trivedi’s decomposition of IM-test Source chi2 df p Heteroskedasticity Skewness Kurtosis 18.86 11.69 2.33 10 4 1 0.0420 0.0198 0.1273 Total 32.87 15 0.0049 We find evidence for omitted variables, heteroskedasticity, and nonnormal skewness. So, why bother with the various graphical commands when the tests seem so much easier to interpret? In part, it is a matter of taste: both are designed to uncover the same problem, and both are, in fact, going about it in similar ways. One is based on a formal calculation, whereas the other is based on personal judgment in evaluating a graph. On the other hand, the tests are seeking evidence of specific problems, whereas judgment is more general. The careful analyst will use both. We performed the omitted-variable test first. Omitted variables are a more serious problem than heteroskedasticity or the violations of higher moment conditions tested by estat imtest. If this 1890 regress postestimation — Postestimation tools for regress were not a manual, having found evidence of omitted variables, we would never have run the estat hettest, estat szroeter, and estat imtest commands, at least not until we solved the omitted-variable problem. Technical note estat ovtest and estat hettest both perform two flavors of their respective tests. By default, estat ovtest looks for evidence of omitted variables by fitting the original model augmented by yb2 , yb3 , and yb4 , which are the fitted values from the original model. Under the assumption of no misspecification, the coefficients on the powers of the fitted values will be zero. With the rhs option, estat ovtest instead augments the original model with powers (second through fourth) of the explanatory variables (except for dummy variables). estat hettest, by default, looks for heteroskedasticity by modeling the variance as a function of the fitted values. If, however, we specify a variable or variables, the variance will be modeled as a function of the specified variables. In our example, if we had, a priori, some reason to suspect heteroskedasticity and that the heteroskedasticity is a function of a car’s weight, then using a test that focuses on weight would be more powerful than the more general tests such as White’s test or the first term in the Cameron–Trivedi decomposition test. estat hettest, by default, computes the original Breusch–Pagan/Cook–Weisberg test, which includes the assumption of normally distributed errors. Koenker (1981) derived an N ∗ R2 version of this test that drops the normality assumption. Wooldridge (2013) gives an F -statistic version that does not require the normality assumption. Stored results for estat hettest, estat imtest, and estat ovtest estat hettest stores the following results for the (multivariate) score test in r(): Scalars r(chi2) r(df) r(p) χ2 test statistic #df for the asymptotic χ2 distribution under H0 p-value estat hettest, fstat stores results for the (multivariate) score test in r(): Scalars r(F) r(df m) r(df r) r(p) test statistic #df of the test for the F distribution under H0 #df of the residuals for the F distribution under H0 p-value estat hettest (if mtest is specified) and estat szroeter store the following in r(): Matrices r(mtest) Macros r(mtmethod) a matrix of test results, with rows corresponding to the univariate tests mtest[.,1] χ2 test statistic mtest[.,2] #df mtest[.,3] unadjusted p-value mtest[.,4] adjusted p-value (if an mtest() adjustment method is specified) adjustment method for p-values regress postestimation — Postestimation tools for regress 1891 estat imtest stores the following in r(): Scalars r(chi2 t) r(df t) r(chi2 h) r(df h) r(chi2 s) r(df s) r(chi2 k) r(df k) r(chi2 w) r(df w) IM-test statistic (= r(chi2 h) + r(chi2 s) + r(chi2 k)) df for limiting χ2 distribution under H0 (= r(df h) + r(df s) + r(df k)) heteroskedasticity test statistic df for limiting χ2 distribution under H0 skewness test statistic df for limiting χ2 distribution under H0 kurtosis test statistic df for limiting χ2 distribution under H0 White’s heteroskedasticity test (if white specified) df for limiting χ2 distribution under H0 estat ovtest stores the following in r(): Scalars r(p) r(F) r(df) r(df r) two-sided p-value F statistic degrees of freedom residual degrees of freedom Variance inflation factors Syntax for estat vif estat vif , uncentered Menu for estat Statistics > Postestimation > Reports and statistics Description for estat vif estat vif calculates the centered or uncentered variance inflation factors (VIFs) for the independent variables specified in a linear regression model. Option for estat vif uncentered requests that the computation of the uncentered variance inflation factors. This option is often used to detect the collinearity of the regressors with the constant. estat vif, uncentered may be used after regression models fit without the constant term. Remarks and examples for estat vif Problems arise in regression when the predictors are highly correlated. In this situation, there may be a significant change in the regression coefficients if you add or delete an independent variable. The estimated standard errors of the fitted coefficients are inflated, or the estimated coefficients may not be statistically significant even though a statistical relation exists between the dependent and independent variables. 1892 regress postestimation — Postestimation tools for regress Data analysts rely on these facts to check informally for the presence of multicollinearity. estat vif, another command for use after regress, calculates the variance inflation factors and tolerances for each of the independent variables. The output shows the variance inflation factors together with their reciprocals. Some analysts compare the reciprocals with a predetermined tolerance. In the comparison, if the reciprocal of the VIF is smaller than the tolerance, the associated predictor variable is removed from the regression model. However, most analysts rely on informal rules of thumb applied to the VIF; see Chatterjee and Hadi (2012). According to these rules, there is evidence of multicollinearity if 1. The largest VIF is greater than 10 (some choose a more conservative threshold value of 30). 2. The mean of all the VIFs is considerably larger than 1. Example 11: estat vif We examine a regression model fit using the ubiquitous automobile dataset: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress price mpg rep78 trunk headroom length turn displ gear_ratio Source SS df Model Residual 264102049 312694909 8 60 33012756.2 5211581.82 Total 576796959 68 8482308.22 price Coef. mpg rep78 trunk headroom length turn displacement gear_ratio _cons -144.84 727.5783 44.02061 -807.0996 -8.688914 -177.9064 30.73146 1500.119 6691.976 MS Std. Err. 82.12751 337.6107 108.141 435.5802 34.89848 137.3455 7.576952 1110.959 7457.906 t -1.76 2.16 0.41 -1.85 -0.25 -1.30 4.06 1.35 0.90 Number of obs F( 8, 60) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.083 0.035 0.685 0.069 0.804 0.200 0.000 0.182 0.373 = = = = = = 69 6.33 0.0000 0.4579 0.3856 2282.9 [95% Conf. Interval] -309.1195 52.25638 -172.2935 -1678.39 -78.49626 -452.6383 15.5753 -722.1303 -8226.058 19.43948 1402.9 260.3347 64.19062 61.11843 96.82551 45.88762 3722.368 21610.01 . estat vif Variable VIF 1/VIF length displacement turn gear_ratio mpg trunk headroom rep78 8.22 6.50 4.85 3.45 3.03 2.88 1.80 1.46 0.121614 0.153860 0.205997 0.290068 0.330171 0.347444 0.554917 0.686147 Mean VIF 4.02 The results are mixed. Although we have no VIFs greater than 10, the mean VIF is greater than 1, though not considerably so. We could continue the investigation of collinearity, but given that other authors advise that collinearity is a problem only when VIFs exist that are greater than 30 (contradicting our rule above), we will not do so here. regress postestimation — Postestimation tools for regress 1893 Example 12: estat vif, with strong evidence of multicollinearity This example comes from a dataset described in Kutner, Nachtsheim, and Neter (2004, 257) that examines body fat as modeled by caliper measurements on the triceps, midarm, and thigh. . use http://www.stata-press.com/data/r13/bodyfat (Body Fat) . regress bodyfat tricep thigh midarm SS df MS Source Model Residual 396.984607 98.4049068 3 16 132.328202 6.15030667 Total 495.389513 19 26.0731323 bodyfat Coef. triceps thigh midarm _cons 4.334085 -2.856842 -2.186056 117.0844 Std. Err. 3.015511 2.582015 1.595499 99.78238 . estat vif Variable VIF 1/VIF triceps thigh midarm 708.84 564.34 104.61 0.001411 0.001772 0.009560 Mean VIF 459.26 t 1.44 -1.11 -1.37 1.17 P>|t| 0.170 0.285 0.190 0.258 Number of obs F( 3, 16) Prob > F R-squared Adj R-squared Root MSE = = = = = = 20 21.52 0.0000 0.8014 0.7641 2.48 [95% Conf. Interval] -2.058512 -8.330468 -5.568362 -94.44474 10.72668 2.616785 1.19625 328.6136 Here we see strong evidence of multicollinearity in our model. More investigation reveals that the measurements on the thigh and the triceps are highly correlated: . correlate triceps thigh midarm (obs=20) triceps thigh triceps thigh midarm 1.0000 0.9238 0.4578 1.0000 0.0847 midarm 1.0000 If we remove the predictor tricep from the model (because it had the highest VIF), we get . regress bodyfat thigh midarm Source SS df MS Model Residual 384.279748 111.109765 2 17 192.139874 6.53586854 Total 495.389513 19 26.0731323 bodyfat Coef. thigh midarm _cons .8508818 .0960295 -25.99696 Std. Err. .1124482 .1613927 6.99732 t 7.57 0.60 -3.72 Number of obs F( 2, 17) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.560 0.002 = = = = = = 20 29.40 0.0000 0.7757 0.7493 2.5565 [95% Conf. Interval] .6136367 -.2444792 -40.76001 1.088127 .4365383 -11.2339 1894 regress postestimation — Postestimation tools for regress . estat vif Variable VIF 1/VIF midarm thigh 1.01 1.01 0.992831 0.992831 Mean VIF 1.01 Note how the coefficients change and how the estimated standard errors for each of the regression coefficients become much smaller. The calculated value of R2 for the overall regression for the subset model does not appreciably decline when we remove the correlated predictor. Removing an independent variable from the model is one way to deal with multicollinearity. Other methods include ridge regression, weighted least squares, and restricting the use of the fitted model to data that follow the same pattern of multicollinearity. In economic studies, it is sometimes possible to estimate the regression coefficients from different subsets of the data by using cross-section and time series. All examples above demonstrated the use of centered VIFs. As pointed out by Belsley (1991), the centered VIFs may fail to discover collinearity involving the constant term. One solution is to use the uncentered VIFs instead. According to the definition of the uncentered VIFs, the constant is viewed as a legitimate explanatory variable in a regression model, which allows one to obtain the VIF value for the constant term. Example 13: estat vif, with strong evidence of collinearity with the constant term Consider the extreme example in which one of the regressors is highly correlated with the constant. We simulate the data and examine both centered and uncentered VIF diagnostics after fitted regression model as follows. . use http://www.stata-press.com/data/r13/extreme_collin . regress y one x z Source SS df MS Number of obs F( 3, 96) Model 223801.985 3 74600.6617 Prob > F Residual 2642.42124 96 27.5252213 R-squared Adj R-squared Total 226444.406 99 2287.31723 Root MSE y Coef. one x z _cons -3.278582 2.038696 4.863137 9.760075 Std. Err. 10.5621 .0242673 .2681036 10.50935 . estat vif Variable VIF 1/VIF z x one 1.03 1.03 1.00 0.968488 0.971307 0.995425 Mean VIF 1.02 t -0.31 84.01 18.14 0.93 P>|t| 0.757 0.000 0.000 0.355 = 100 = 2710.27 = 0.0000 = 0.9883 = 0.9880 = 5.2464 [95% Conf. Interval] -24.24419 1.990526 4.330956 -11.10082 17.68702 2.086866 5.395319 30.62097 regress postestimation — Postestimation tools for regress . estat vif, uncentered VIF Variable one intercept z x 402.94 401.26 2.93 1.13 Mean VIF 202.06 1895 1/VIF 0.002482 0.002492 0.341609 0.888705 According to the values of the centered VIFs (1.03, 1.03, 1.00), no harmful collinearity is detected in the model. However, by the construction of these simulated data, we know that one is highly collinear with the constant term. As such, the large values of uncentered VIFs for one (402.94) and intercept (401.26) reveal high collinearity of the variable one with the constant term. Measures of effect size Syntax for estat esize estat esize , omega level(#) Menu for estat Statistics > Postestimation > Reports and statistics Description for estat esize estat esize calculates effect sizes for linear models after regress or anova. By default, estat esize reports η 2 (eta-squared) estimates (Kerlinger 1964), which are equivalent to R2 estimates. If the option omega is specified, estat esize reports ω 2 estimates (Hays 1963), which are equivalent to adjusted R2 estimates. Confidence intervals for η 2 and ω 2 estimates are estimated by using the noncentral F distribution (Smithson 2001). See Kline (2013) or Thompson (2006) for further information. Options for estat esize omega specifies that the ω 2 estimates of effect size be reported. The default is η 2 estimates. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. 1896 regress postestimation — Postestimation tools for regress Remarks and examples for estat esize Whereas p-values are used to assess the statistical significance of a result, measures of effect size are used to assess the practical significance of a result. Effect sizes can be broadly categorized as “measures of group differences” (the d family) and “measures of association” (the r family); see Ellis (2010, table 1.1). The d family includes estimators such as Cohen’s D, Hedges’s G, and Glass’s ∆ (also see [R] esize). The r family includes estimators such as the point-biserial correlation coefficient, ω 2 , and η 2 . For an introduction to the concepts and calculation of effect sizes, see Kline (2013) or Thompson (2006). For a more detailed discussion, see Kirk (1996), Ellis (2010), Cumming (2012), Grissom and Kim (2012), and Kelley and Preacher (2012). Example 14: Calculating effect sizes for a linear regression model Suppose we fit a linear regression model for low-birthweight infants. . use http://www.stata-press.com/data/r13/lbw (Hosmer & Lemeshow data) . regress bwt smoke i.race Source SS df MS Model Residual 12346897.6 87568400.9 3 185 4115632.54 473342.708 Total 99915298.6 188 531464.354 Std. Err. t Number of obs F( 3, 185) Prob > F R-squared Adj R-squared Root MSE P>|t| = = = = = = 189 8.69 0.0000 0.1236 0.1094 688 bwt Coef. [95% Conf. Interval] smoke -428.0254 109.0033 -3.93 0.000 -643.0746 -212.9761 race black other -450.54 -454.1813 153.066 116.436 -2.94 -3.90 0.004 0.000 -752.5194 -683.8944 -148.5607 -224.4683 _cons 3334.858 91.74301 36.35 0.000 3153.86 3515.855 We can use the estat esize command to calculate η 2 for the entire model and a partial η 2 for each term in the model. regress postestimation — Postestimation tools for regress . estat esize Effect sizes for linear models Source Eta-Squared df [95% Conf. Interval] Model .1235736 3 .0399862 .2041365 smoke race .0769345 .0908394 1 2 .0193577 .0233037 .1579213 .1700334 The omega option causes estat esize to report ω 2 and partial ω 2 . . estat esize, omega Effect sizes for linear models Source Omega-Squared df [95% Conf. Interval] Model .1093613 3 .0244184 .1912306 smoke race .0719449 .0810106 1 2 .0140569 .0127448 .1533695 .1610608 Example 15: Calculating effect size for an ANOVA model We can use estat esize after ANOVA models as well. . anova bwt smoke race Source Number of obs = 189 Root MSE = 687.999 Partial SS df MS R-squared = 0.1236 Adj R-squared = 0.1094 F Prob > F Model 12346897.6 3 4115632.54 8.69 0.0000 smoke race 7298536.57 8749453.3 1 2 7298536.57 4374726.65 15.42 9.24 0.0001 0.0001 Residual 87568400.9 185 473342.708 Total 99915298.6 . estat esize Effect sizes for linear models 188 531464.354 Source Eta-Squared df [95% Conf. Interval] Model .1235736 3 .0399862 .2041365 smoke race .0769345 .0908394 1 2 .0193577 .0233037 .1579213 .1700334 1897 1898 regress postestimation — Postestimation tools for regress Technical note η 2 and ω 2 were developed in the context of analysis of variance. Thus, the published research on the calculation of their confidence intervals focuses on cases where the numerator degrees of freedom are relatively small (for example, df < 20). Some combinations of the F statistic, numerator degrees of freedom, and denominator degrees of freedom yield confidence limits that do not contain the corresponding estimated value for an η 2 or ω 2 . This problem is most commonly observed for larger numerator degrees of freedom. Nothing in the literature suggests alternative methods for constructing confidence intervals in such cases; therefore, we recommend cautious interpretation of confidence intervals for η 2 and ω 2 when the numerator degrees of freedom are greater than 20. Stored results for estat esize estat esize stores the following results in r(): Scalars r(level) confidence level Matrices r(esize) a matrix of effect sizes, confidence intervals, degrees of freedom, and F statistics with rows corresponding to each term in the model esize[.,1] η 2 esize[.,2] lower confidence bound for η 2 esize[.,3] upper confidence bound for η 2 esize[.,4] ω 2 esize[.,5] lower confidence bound for ω 2 esize[.,6] upper confidence bound for ω 2 esize[.,7] numerator degrees of freedom esize[.,8] denominator degrees of freedom esize[.,9] F statistic Methods and formulas See Hamilton (2013, chap. 7), Kohler and Kreuter (2012, sec. 9.3), or Baum (2006, chap. 5) for an overview of using Stata to perform regression diagnostics. See Peracchi (2001, chap. 8) for a mathematically rigorous discussion of diagnostics. Methods and formulas are presented under the following headings: predict Special-interest postestimation commands predict Assume that you have already fit the regression model y = Xb + e where X is n × k . regress postestimation — Postestimation tools for regress 1899 Denote the previously estimated coefficient vector by b and its estimated variance matrix by V. predict works by recalling various aspects of the model, such as b, and combining that information with the data currently in memory. Let xj be the j th observation currently in memory, and let s2 be the mean squared error of the regression. If the user specified weights in regress, then X0 X in the following formulas is replaced by X DX, where D is defined in Coefficient estimation and ANOVA table under Methods and formulas in [R] regress. 0 Let V = s2 (X0 X)−1 . Let k be the number of independent variables including the intercept, if any, and let yj be the observed value of the dependent variable. The predicted value (xb option) is defined as ybj = xj b. Let `j represent a lower bound for an observation j and uj represent an upper bound. The probability that yj |xj would be observed in the interval (`j , uj ) — the pr(`, u) option — is P (`j , uj ) = Pr(`j < xj b + ej < uj ) = Φ uj − ybj s −Φ `j − ybj s where for the pr(`, u), e(`, u), and ystar(`, u) options, `j and uj can be anywhere in the range (−∞, +∞). The option e(`, u) computes the expected value of yj |xj conditional on yj |xj being in the interval (`j , uj ), that is, when yj |xj is truncated. It can be expressed as E(`j , uj ) = E(xj b + ej | `j < xj b + ej < uj ) = ybj − s φ uj −b yj s −φ Φ uj −b yj s −Φ `j −b yj s `j −b yj s where φ is the normal density and Φ is the cumulative normal. You can also compute ystar(`, u) — the expected value of yj |xj , where yj is assumed censored at `j and uj : if xj b + ej ≤ `j `j yj∗ = xj b + u if `j < xj b + ej < uj uj if xj b + ej ≥ uj This computation can be expressed in several ways, but the most intuitive formulation involves a combination of the two statistics just defined: yj∗ = P (−∞, `j )`j + P (`j , uj )E(`j , uj ) + P (uj , +∞)uj A diagonal element of the projection matrix (hat) or (leverage) is given by hj = xj (X0 X)−1 x0j The standard error of the prediction (the stdp option) is defined as spj = p and can also be written as spj = s hj . p The standard error of the forecast (stdf) is defined as sfj = s 1 + hj . q xj Vx0j 1900 regress postestimation — Postestimation tools for regress p The standard error of the residual (stdr) is defined as srj = s 1 − hj . The residuals (residuals) are defined as ebj = yj − ybj . The standardized residuals (rstandard) are defined as ebsj = ebj /srj . The Studentized residuals (rstudent) are defined as rj = eb pj s(j) 1 − hj where s(j) represents the root mean squared error with the j th observation removed, which is given by ebj2 s2 (T − k) − s2(j) = T − k − 1 (T − k − 1)(1 − hj ) Cook’s D (cooksd) is given by Dj = ebs2j (spj /srj )2 k = hj ebj2 ks2 (1 − hj )2 DFITS (dfits) is given by s DFITSj = rj hj 1 − hj Welsch distance (welsch) is given by Wj = rj p hj (n − 1) 1 − hj COVRATIO (covratio) is given by 1 COVRATIOj = 1 − hj n − k − eb2j n−k−1 k The DFBETAs (dfbeta) for a particular regressor xi are given by DFBETAj =p rj uj U 2 (1 − hj ) where uj are the residuals obtained from a regression of xi on the remaining x’s and U 2 = P j u2j . regress postestimation — Postestimation tools for regress 1901 Special-interest postestimation commands The omitted-variable test (Ramsey 1969) reported by estat ovtest fits the regression yi = xi b + zi t + ui and then performs a standard F test of t = 0. The default test uses zi = (b yi2 , ybi3 , ybi4 ). 2 3 4 2 4 If rhs is specified, zi = (x1i , x1i , x1i , x2i , . . . , xmi ). In either case, the variables are normalized to have minimum 0 and maximum 1 before powers are calculated. The test for heteroskedasticity (Breusch and Pagan 1979; Cook and Weisberg 1983) models Var(ei ) = σ 2 exp(zt), where z is a variable list specified by the user, the list of right-hand-side b The test is of t = 0. Mechanically, estat hettest fits the variables, or the fitted values xβ. augmented regression eb2i /b σ 2 = a + zi t + vi . The original Breusch–Pagan/Cook–Weisberg version of the test assumes that the ei are normally distributed under the null hypothesis which implies that the score test statistic S is equal to the model sum of squares from the augmented regression divided by 2. Under the null hypothesis, S has the χ2 distribution with m degrees of freedom, where m is the number of columns of z. Koenker (1981) derived a score test of the null hypothesis that t = 0 under the assumption that the ei are independent and identically distributed (i.i.d.). Koenker showed that S = N ∗ R2 has a large-sample χ2 distribution with m degrees of freedom, where N is the number of observations and R2 is the R-squared in the augmented regression and m is the number of columns of z. estat hettest, iid produces this version of the test. Wooldridge (2013) showed that an F test of t = 0 in the augmented regression can also be used under the assumption that the ei are i.i.d. estat hettest, fstat produces this version of the test. Szroeter’s class of tests for homoskedasticity against the alternative that the residual variance increases in some variable x is defined in terms of Pn h(x )e2 Pn i2 i H = i=1 i=1 ei where h(x) is some weight function that increases in x (Szroeter 1978). H is a weighted average of the h(x), with the squared residuals serving as weights. Under homoskedasticity, H should be approximately equal to the unweighted average of h(x). Large values of H suggest that e2i tends to be large where h(x) is large; that is, the variance indeed increases in x, whereas small values of H suggest that the variance actually decreases in x. estat szroeter uses h(xi ) = rank(xi in x1 . . . xn ); see Judge et al. [1985, 452] for details. estat szroeter displays a normalized version of H , r Q= 6n H n2 − 1 which is approximately N (0, 1) distributed under the null (homoskedasticity). estat hettest and estat szroeter provide adjustments of p-values for multiple testing. The supported methods are described in [R] test. estat imtest performs the information matrix test for the regression model, as well as an orthogonal decomposition into tests for heteroskedasticity δ1 , nonnormal skewness δ2 , and nonnormal kurtosis δ3 (Cameron and Trivedi 1990; Long and Trivedi 1993). The decomposition is obtained via three auxiliary regressions. Let e be the regression residuals, σ b2 be the maximum likelihood estimate 2 of σ in the regression, n be the number of observations, X be the set of k variables specified with 2 2 estat imtest, and Run be the uncentered R2 from a regression. δ1 is obtained as nRun from a 2 2 2 regression of e − σ b on the cross products of the variables in X . δ2 is computed as nRun from a 2 regression of e3 − 3b σ 2 e on X . Finally, δ3 is obtained as nRun from a regression of e4 − 6b σ 2 e2 − 3b σ4 1902 regress postestimation — Postestimation tools for regress on X . δ1 , δ2 , and δ3 are asymptotically χ2 distributed with 1/2k(k +1), K , and 1 degree of freedom. The information test statistic δ = δ1 + δ2 + δ3 is asymptotically χ2 distributed with 1/2k(k + 3) degrees of freedom. White’s test for heteroskedasticity is computed as nR2 from a regression of u b2 on X and the cross products of the variables in X . This test statistic is usually close to δ1 . estat vif calculates the centered variance inflation factor (VIFc ) (Chatterjee and Hadi 2012, 248 – 251) for xj , given by 1 VIFc (xj ) = b2 1−R j b2 is the square of the centered multiple correlation coefficient that results when xj is regressed where R j with intercept against all the other explanatory variables. The uncentered variance inflation factor (VIFuc ) (Belsley 1991, 28 – 29) for xj is given by VIFuc (xj ) = 1 e2 1−R j e2 is the square of the uncentered multiple correlation coefficient that results when xj is where R j regressed without intercept against all the other explanatory variables including the constant term. The methods and formulas for estat esize are described in Methods and formulas of [R] esize. Acknowledgments estat ovtest and estat hettest are based on programs originally written by Richard Goldstein (1991, 1992). estat imtest, estat szroeter, and the current version of estat hettest were written by Jeroen Weesie of the Department of Sociology at Utrecht University, The Netherlands. estat imtest is based in part on code written by J. Scott Long of the Department of Sociology at Indiana University, coauthor of the Stata Press book Regression Models for Categorical and Limited Dependent Variables, and author of the Stata Press book The Workflow of Data Analysis Using Stata. References Adkins, L. C., and R. C. Hill. 2011. Using Stata for Principles of Econometrics. 4th ed. Hoboken, NJ: Wiley. Baum, C. F. 2006. An Introduction to Modern Econometrics Using Stata. College Station, TX: Stata Press. Baum, C. F., N. J. Cox, and V. L. Wiggins. 2000. sg137: Tests for heteroskedasticity in regression error distribution. Stata Technical Bulletin 55: 15–17. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 147–149. College Station, TX: Stata Press. Baum, C. F., and V. L. Wiggins. 2000a. sg135: Test for autoregressive conditional heteroskedasticity in regression error distribution. Stata Technical Bulletin 55: 13–14. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 143–144. College Station, TX: Stata Press. . 2000b. sg136: Tests for serial correlation in regression error distribution. Stata Technical Bulletin 55: 14–15. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 145–147. College Station, TX: Stata Press. Belsley, D. A. 1991. Conditional Diagnostics: Collinearity and Weak Data in Regression. New York: Wiley. Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley. Bollen, K. A., and R. W. Jackman. 1990. Regression diagnostics: An expository treatment of outliers and influential cases. In Modern Methods of Data Analysis, ed. J. Fox and J. S. Long, 257–291. Newbury Park, CA: Sage. Breusch, T. S., and A. R. Pagan. 1979. A simple test for heteroscedasticity and random coefficient variation. Econometrica 47: 1287–1294. regress postestimation — Postestimation tools for regress 1903 Cameron, A. C., and P. K. Trivedi. 1990. The information matrix test and its applied alternative hypotheses. Working paper 372, University of California–Davis, Institute of Governmental Affairs. . 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. Chatterjee, S., and A. S. Hadi. 1986. Influential observations, high leverage points, and outliers in linear regression. Statistical Science 1: 379–393. . 1988. Sensitivity Analysis in Linear Regression. New York: Wiley. . 2012. Regression Analysis by Example. 5th ed. New York: Hoboken, NJ. Cook, R. D. 1977. Detection of influential observation in linear regression. Technometrics 19: 15–18. Cook, R. D., and S. Weisberg. 1982. Residuals and Influence in Regression. New York: Chapman & Hall/CRC. . 1983. Diagnostics for heteroscedasticity in regression. Biometrika 70: 1–10. Cox, N. J. 2004. Speaking Stata: Graphing model diagnostics. Stata Journal 4: 449–475. Cumming, G. 2012. Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Taylor & Francis. DeMaris, A. 2004. Regression with Social Data: Modeling Continuous and Limited Response Variables. Hoboken, NJ: Wiley. Ellis, P. D. 2010. The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Cambridge: Cambridge University Press. Garrett, J. M. 2000. sg157: Predicted values calculated from linear or logistic regression models. Stata Technical Bulletin 58: 27–30. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 258–261. College Station, TX: Stata Press. Goldstein, R. 1991. srd5: Ramsey test for heteroscedasticity and omitted variables. Stata Technical Bulletin 2: 27. Reprinted in Stata Technical Bulletin Reprints, vol. 1, p. 177. College Station, TX: Stata Press. . 1992. srd14: Cook–Weisberg test of heteroscedasticity. Stata Technical Bulletin 10: 27–28. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 183–184. College Station, TX: Stata Press. Grissom, R. J., and J. J. Kim. 2012. Effect Sizes for Research: Univariate and Multivariate Applications. 2nd ed. New York: Taylor & Francis. Hamilton, L. C. 1992. Regression with Graphics: A Second Course in Applied Statistics. Belmont, CA: Duxbury. . 2013. Statistics with Stata: Updated for Version 12. 8th ed. Boston: Brooks/Cole. Hardin, J. W. 1995. sg32: Variance inflation factors and variance-decomposition proportions. Stata Technical Bulletin 24: 17–22. Reprinted in Stata Technical Bulletin Reprints, vol. 4, pp. 154–160. College Station, TX: Stata Press. Hays, W. L. 1963. Statistics for Psychologists. New York: Holt, Rinehart & Winston. Hill, R. C., W. E. Griffiths, and G. C. Lim. 2011. Principles of Econometrics. 4th ed. Hoboken, NJ: Wiley. Hoaglin, D. C., and P. J. Kempthorne. 1986. Comment [on Chatterjee and Hadi 1986]. Statistical Science 1: 408–412. Hoaglin, D. C., and R. E. Welsch. 1978. The hat matrix in regression and ANOVA. American Statistician 32: 17–22. Huber, C. 2013. Measures of effect size in Stata 13. The Stata Blog: Not Elsewhere Classified. http://blog.stata.com/2013/09/05/measures-of-effect-size-in-stata-13/. Judge, G. G., W. E. Griffiths, R. C. Hill, H. Lütkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2nd ed. New York: Wiley. Kelley, K., and K. J. Preacher. 2012. On effect size. Psychological Methods 17: 137–152. Kerlinger, F. N. 1964. Foundations of Behavioral Research. New York: Holt, Rinehart & Winston. Kirk, R. E. 1996. Practical significance: A concept whose time has come. Educational and Psychological Measurement 56: 746–759. Kline, R. B. 2013. Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. 2nd ed. Washington, DC: American Psychological Association. Koenker, R. 1981. A note on studentizing a test for heteroskedasticity. Journal of Econometrics 17: 107–112. Kohler, U., and F. Kreuter. 2012. Data Analysis Using Stata. 3rd ed. College Station, TX: Stata Press. 1904 regress postestimation — Postestimation tools for regress Kutner, M. H., C. J. Nachtsheim, and J. Neter. 2004. Applied Linear Regression Models. 4th ed. New York: McGraw–Hill/Irwin. Lindsey, C., and S. J. Sheather. 2010a. Optimal power transformation via inverse response plots. Stata Journal 10: 200–214. . 2010b. Model fit assessment via marginal model plots. Stata Journal 10: 215–225. Long, J. S., and J. Freese. 2000. sg145: Scalar measures of fit for regression models. Stata Technical Bulletin 56: 34–40. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 197–205. College Station, TX: Stata Press. Long, J. S., and P. K. Trivedi. 1993. Some specification tests for the linear regression model. Sociological Methods and Research 21: 161–204. Reprinted in Testing Structural Equation Models, ed. K. A. Bollen and J. S. Long, pp. 66–110. Newbury Park, CA: Sage. Peracchi, F. 2001. Econometrics. Chichester, UK: Wiley. Ramsey, J. B. 1969. Tests for specification errors in classical linear least-squares regression analysis. Journal of the Royal Statistical Society, Series B 31: 350–371. Ramsey, J. B., and P. Schmidt. 1976. Some further results on the use of OLS and BLUS residuals in specification error tests. Journal of the American Statistical Association 71: 389–390. Rousseeuw, P. J., and A. M. Leroy. 1987. Robust Regression and Outlier Detection. New York: Wiley. Smithson, M. 2001. Correct confidence intervals for various regression effect sizes and parameters: The importance of noncentral distributions in computing intervals. Educational and Psychological Measurement 61: 605–632. Szroeter, J. 1978. A class of parametric tests for heteroscedasticity in linear econometric models. Econometrica 46: 1311–1327. Thompson, B. 2006. Foundations of Behavioral Statistics: An Insight-Based Approach. New York: Guilford Press. Velleman, P. F. 1986. Comment [on Chatterjee and Hadi 1986]. Statistical Science 1: 412–413. Velleman, P. F., and R. E. Welsch. 1981. Efficient computing of regression diagnostics. American Statistician 35: 234–242. Weesie, J. 2001. sg161: Analysis of the turning point of a quadratic specification. Stata Technical Bulletin 60: 18–20. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 273–277. College Station, TX: Stata Press. Weisberg, S. 2005. Applied Linear Regression. 3rd ed. New York: Wiley. Welsch, R. E. 1982. Influence functions and regression diagnostics. In Modern Data Analysis, ed. R. L. Launer and A. F. Siegel, 149–169. New York: Academic Press. . 1986. Comment [on Chatterjee and Hadi 1986]. Statistical Science 1: 403–405. Welsch, R. E., and E. Kuh. 1977. Linear Regression Diagnostics. Technical Report 923-77, Massachusetts Institute of Technology, Cambridge, MA. White, H. L., Jr. 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48: 817–838. Wooldridge, J. M. 2013. Introductory Econometrics: A Modern Approach. 5th ed. Mason, OH: South-Western. Also see [R] regress — Linear regression [R] regress postestimation diagnostic plots — Postestimation plots for regress [R] regress postestimation time series — Postestimation tools for regress with time series [U] 20 Estimation and postestimation commands Title regress postestimation diagnostic plots — Postestimation plots for regress Description rvpplot rvfplot lvr2plot avplot Methods and formulas avplots References cprplot Also see acprplot Description The following postestimation commands are of special interest after regress: Command Description rvfplot avplot avplots cprplot acprplot rvpplot lvr2plot residual-versus-fitted plot added-variable plot all added-variables plots in one image component-plus-residual plot augmented component-plus-residual plot residual-versus-predictor plot leverage-versus-squared-residual plot These commands are not appropriate after the svy prefix. For a discussion of the terminology used in this entry, see the Terminology section of Remarks and examples for predict in [R] regress postestimation. rvfplot Syntax for rvfplot rvfplot , rvfplot options rvfplot options Description Plot marker options marker label options change look of markers (color, size, etc.) add marker labels; change look or position Add plots addplot(plot) add plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options 1905 1906 regress postestimation diagnostic plots — Postestimation plots for regress Menu for rvfplot Statistics > Linear models and related > Regression diagnostics > Residual-versus-fitted plot Description for rvfplot rvfplot graphs a residual-versus-fitted plot, a graph of the residuals against the fitted values. Options for rvfplot Plot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Add plots addplot(plot) provides a way to add plots to the generated graph. See [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples for rvfplot rvfplot graphs the residuals against the fitted values. Example 1 Using auto.dta described in [U] 1.2.2 Example datasets, we will use regress to fit a model of price on weight, mpg, foreign, and the interaction of foreign with mpg. We specify foreign##c.mpg to obtain the interaction of foreign with mpg; see [U] 11.4.3 Factor variables. regress postestimation diagnostic plots — Postestimation plots for regress 1907 . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress price weight foreign##c.mpg Source SS df Model Residual 350319665 284745731 4 69 87579916.3 4126749.72 Total 635065396 73 8699525.97 price Coef. weight MS Number of obs F( 4, 69) Prob > F R-squared Adj R-squared Root MSE = = = = = = 74 21.22 0.0000 0.5516 0.5256 2031.4 Std. Err. t P>|t| [95% Conf. Interval] 4.613589 .7254961 6.36 0.000 3.166263 6.060914 foreign Foreign mpg 11240.33 263.1875 2751.681 110.7961 4.08 2.38 0.000 0.020 5750.878 42.15527 16729.78 484.2197 foreign#c.mpg Foreign -307.2166 108.5307 -2.83 0.006 -523.7294 -90.70368 _cons -14449.58 4425.72 -3.26 0.002 -23278.65 -5620.51 Once we have fit a model, we may use any of the regression diagnostics commands. rvfplot (read residual-versus-fitted plot) graphs the residuals against the fitted values: −5000 0 Residuals 5000 10000 . rvfplot, yline(0) 2000 4000 6000 8000 Fitted values 10000 12000 All the diagnostic plot commands allow the graph twoway and graph twoway scatter options; we specified a yline(0) to draw a line across the graph at y = 0; see [G-2] graph twoway scatter. In a well-fitted model, there should be no pattern to the residuals plotted against the fitted values — something not true of our model. Ignoring the two outliers at the top center of the graph, we see curvature in the pattern of the residuals, suggesting a violation of the assumption that price is linear in our independent variables. We might also have seen increasing or decreasing variation in the residuals — heteroskedasticity. Any pattern whatsoever indicates a violation of the least-squares assumptions. 1908 regress postestimation diagnostic plots — Postestimation plots for regress avplot Syntax for avplot avplot indepvar , avplot options avplot options Description Plot marker options marker label options change look of markers (color, size, etc.) add marker labels; change look or position Reference line rlopts(cline options) affect rendition of the reference line Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options Menu for avplot Statistics > Linear models and related > Regression diagnostics > Added-variable plot Description for avplot avplot graphs an added-variable plot (a.k.a. partial-regression leverage plot, partial regression plot, or adjusted partial residual plot) after regress. indepvar may be an independent variable (a.k.a. predictor, carrier, or covariate) that is currently in the model or not. Options for avplot Plot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Reference line rlopts(cline options) affects the rendition of the reference line. See [G-3] cline options. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). regress postestimation diagnostic plots — Postestimation plots for regress 1909 Remarks and examples for avplot avplot graphs an added-variable plot, also known as the partial-regression leverage plot. One of the wonderful features of one-regressor regressions (regressions of y on one x) is that we can graph the data and the regression line. There is no easier way to understand the regression than to examine such a graph. Unfortunately, we cannot do this when we have more than one regressor. With two regressors, it is still theoretically possible — the graph must be drawn in three dimensions, but with three or more regressors no graph is possible. The added-variable plot is an attempt to project multidimensional data back to the two-dimensional world for each of the original regressors. This is, of course, impossible without making some concessions. Call the coordinates on an added-variable plot y and x. The added-variable plot has the following properties: • There is a one-to-one correspondence between (xi , yi ) and the ith observation used in the original regression. • A regression of y on x has the same coefficient and standard error (up to a degree-of-freedom adjustment) as the estimated coefficient and standard error for the regressor in the original regression. • The “outlierness” of each observation in determining the slope is in some sense preserved. It is equally important to note the properties that are not listed. The y and x coordinates of the added-variable plot cannot be used to identify functional form, or, at least, not well (see Mallows [1986]). In the construction of the added-variable plot, the relationship between y and x is forced to be linear. Example 2 Let’s use the same model as we used in example 1. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress price weight foreign##c.mpg (output omitted ) We can now examine the added-variable plot for mpg. −4000 −2000 e( price | X ) 0 2000 4000 6000 . avplot mpg −5 0 5 e( mpg | X ) coef = 263.18749, se = 110.79612, t = 2.38 10 1910 regress postestimation diagnostic plots — Postestimation plots for regress This graph suggests a problem in determining the coefficient on mpg. Were this a one-regressor regression, the two points at the top-left corner and the one at the top right would cause us concern, and so it does in our more complicated multiple-regressor case. To identify the problem points, we retyped our command, modifying it to read avplot mpg, mlabel(make), and discovered that the two cars at the top left are the Cadillac Eldorado and the Lincoln Versailles; the point at the top right is the Cadillac Seville. These three cars account for 100% of the luxury cars in our data, suggesting that our model is misspecified. By the way, the point at the lower right of the graph, also cause for concern, is the Plymouth Arrow, our data entry error. Technical note Stata’s avplot command can be used with regressors already in the model, as we just did, or with potential regressors not yet in the model. In either case, avplot will produce the correct graph. The name “added-variable plot” is unfortunate in the case when the variable is already among the list of regressors but is, we think, still preferable to the name “partial-regression leverage plot” assigned by Belsley, Kuh, and Welsch (1980, 30) and more in the spirit of the original use of such plots by Mosteller and Tukey (1977, 271–279). Welsch (1986, 403), however, disagrees: “I am sorry to see that Chatterjee and Hadi [1986] endorse the term ‘added-variable plot’ when Xj is part of the original model” and goes on to suggest the name “adjusted partial residual plot”. avplots Syntax for avplots avplots , avplots options avplots options Description Plot marker options marker label options combine options change look of markers (color, size, etc.) add marker labels; change look or position any of the options documented in [G-2] graph combine Reference line rlopts(cline options) affect rendition of the reference line Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options Menu for avplots Statistics > Linear models and related > Regression diagnostics > Added-variable plot regress postestimation diagnostic plots — Postestimation plots for regress 1911 Description for avplots avplots graphs all the added-variable plots in one image. Options for avplots Plot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. combine options are any of the options documented in [G-2] graph combine. These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Reference line rlopts(cline options) affects the rendition of the reference line. See [G-3] cline options. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples for avplots Example 3 In example 2, we used avplot to examine the added-variable plot for mpg in our regression of price on weight and foreign##c.mpg. Now let’s use avplots to graph an added-variable plot for every regressor in the data. 1912 regress postestimation diagnostic plots — Postestimation plots for regress −500 0 500 e( weight | X ) 1000 −5000 −5000 e( price | X ) 0 5000 e( price | X ) 0 5000 10000 10000 . avplots −5 0 5 e( mpg | X ) coef = 263.18749, se = 110.79612, t = 2.38 0 .1 .2 e( 1.foreign | X ) .3 coef = 11240.331, se = 2751.6808, t = 4.08 10 e( price | X ) −4000 −2000 0 200040006000 e( price | X ) −4000 −2000 0 200040006000 coef = 4.6135886, se = .7254961, t = 6.36 −.1 −6 −4 −2 0 2 e( 1.foreign#c.mpg | X ) coef = −307.21656, se = 108.53072, t = −2.83 4 regress postestimation diagnostic plots — Postestimation plots for regress 1913 cprplot Syntax for cprplot cprplot indepvar , cprplot options cprplot options Description Plot marker options marker label options change look of markers (color, size, etc.) add marker labels; change look or position Reference line rlopts(cline options) affect rendition of the reference line Options lowess lsopts(lowess options) mspline msopts(mspline options) add a lowess smooth of the plotted points affect rendition of the lowess smooth add median spline of the plotted points affect rendition of the spline Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options Menu for cprplot Statistics > Linear models and related > Regression diagnostics > Component-plus-residual plot Description for cprplot cprplot graphs a component-plus-residual plot (a.k.a. partial residual plot) after regress. indepvar must be an independent variable that is currently in the model. Options for cprplot Plot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Reference line rlopts(cline options) affects the rendition of the reference line. See [G-3] cline options. 1914 regress postestimation diagnostic plots — Postestimation plots for regress Options lowess adds a lowess smooth of the plotted points to assist in detecting nonlinearities. lsopts(lowess options) affects the rendition of the lowess smooth. For an explanation of these options, especially the bwidth() option, see [R] lowess. Specifying lsopts() implies the lowess option. mspline adds a median spline of the plotted points to assist in detecting nonlinearities. msopts(mspline options) affects the rendition of the spline. For an explanation of these options, especially the bands() option, see [G-2] graph twoway mspline. Specifying msopts() implies the mspline option. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples for cprplot Added-variable plots are successful at identifying outliers, but they cannot be used to identify functional form. The component-plus-residual plot (Ezekiel 1924; Larsen and McCleary 1972) is another attempt at projecting multidimensional data into a two-dimensional form, but with different properties. Although the added-variable plot can identify outliers, the component-plus-residual plot cannot. It can, however, be used to examine the functional form assumptions of the model. Both plots have the property that a regression line through the coordinates has a slope equal to the estimated coefficient in the regression model. Example 4 We illustrate component-plus-residual plots using a variation of auto.dta. . use http://www.stata-press.com/data/r13/auto1 (Automobile Models) . regress price mpg weight Source SS df Model Residual 187716578 447348818 2 71 93858289 6300687.58 Total 635065396 73 8699525.97 price Coef. mpg weight _cons -55.9393 1.710992 2197.9 MS Std. Err. 75.24136 .5861682 3190.768 t -0.74 2.92 0.69 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.460 0.005 0.493 = = = = = = 74 14.90 0.0000 0.2956 0.2757 2510.1 [95% Conf. Interval] -205.9663 .5422063 -4164.311 94.08771 2.879779 8560.11 regress postestimation diagnostic plots — Postestimation plots for regress 1915 In fact, we know that the effects of mpg in this model are nonlinear — if we added mpg squared to the model, its coefficient would have a t statistic of 2.38, the t statistic on mpg would become −2.48, and weight’s effect would become about one-third of its current value and become statistically insignificant. Pretend that we do not know this. The component-plus-residual plot for mpg is −4000 Component plus residual −2000 0 2000 4000 6000 . cprplot mpg, mspline msopts(bands(13)) 10 20 30 40 Mileage (mpg) We are supposed to examine the above graph for nonlinearities or, equivalently, ask if the regression line, which has slope equal to the estimated effect of mpg in the original model, fits the data adequately. To assist our eyes, we added a median spline. Perhaps some people may detect nonlinearity from this graph, but we assert that if we had not previously revealed the nonlinearity of mpg and if we had not added the median spline, the graph would not overly bother us. 1916 regress postestimation diagnostic plots — Postestimation plots for regress acprplot Syntax for acprplot acprplot indepvar , acprplot options acprplot options Description Plot marker options marker label options change look of markers (color, size, etc.) add marker labels; change look or position Reference line rlopts(cline options) affect rendition of the reference line Options lowess lsopts(lowess options) mspline msopts(mspline options) add a lowess smooth of the plotted points affect rendition of the lowess smooth add median spline of the plotted points affect rendition of the spline Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options Menu for acprplot Statistics > Linear models and related > Regression diagnostics > Augmented component-plus-residual plot Description for acprplot acprplot graphs an augmented component-plus-residual plot (a.k.a. augmented partial residual plot) as described by Mallows (1986). This seems to work better than the component-plus-residual plot for identifying nonlinearities in the data. Options for acprplot Plot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Reference line rlopts(cline options) affects the rendition of the reference line. See [G-3] cline options. regress postestimation diagnostic plots — Postestimation plots for regress 1917 Options lowess adds a lowess smooth of the plotted points to assist in detecting nonlinearities. lsopts(lowess options) affects the rendition of the lowess smooth. For an explanation of these options, especially the bwidth() option, see [R] lowess. Specifying lsopts() implies the lowess option. mspline adds a median spline of the plotted points to assist in detecting nonlinearities. msopts(mspline options) affects the rendition of the spline. For an explanation of these options, especially the bands() option, see [G-2] graph twoway mspline. Specifying msopts() implies the mspline option. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples for acprplot In the cprplot section above, we discussed the component-plus-residual plot. Mallows (1986) proposed an augmented component-plus-residual plot that is often more sensitive to detecting nonlinearity. Example 5 Let’s compare the augmented component-plus-residual plot with the component-plus-residual plot of example 4. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress price weight foreign##c.mpg (output omitted ) 1918 regress postestimation diagnostic plots — Postestimation plots for regress −15000 Augmented component plus residual −10000 −5000 . acprplot mpg, mspline msopts(bands(13)) 10 20 30 40 Mileage (mpg) It does do somewhat better. rvpplot Syntax for rvpplot rvpplot indepvar , rvpplot options rvpplot options Description Plot marker options marker label options change look of markers (color, size, etc.) add marker labels; change look or position Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options Menu for rvpplot Statistics > Linear models and related > Regression diagnostics > Residual-versus-predictor plot regress postestimation diagnostic plots — Postestimation plots for regress 1919 Description for rvpplot rvpplot graphs a residual-versus-predictor plot (a.k.a. independent variable plot or carrier plot), a graph of the residuals against the specified predictor. Options for rvpplot Plot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Add plots addplot(plot) provides a way to add other plots to the generated graph; see [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples for rvpplot The residual-versus-predictor plot is a simple way to look for violations of the regression assumptions. If the assumptions are correct, there should be no pattern on the graph. Example 6 Let’s use our model of price on mpg and weight. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) −5000 0 Residuals 5000 10000 . regress price weight foreign##c.mpg (output omitted ) . rvpplot mpg, yline(0) 10 20 30 Mileage (mpg) 40 1920 regress postestimation diagnostic plots — Postestimation plots for regress Remember, any pattern counts as a problem, and in this graph, we see that the variation in the residuals decreases as mpg increases. lvr2plot Syntax for lvr2plot lvr2plot , lvr2plot options lvr2plot options Description Plot marker options marker label options change look of markers (color, size, etc.) add marker labels; change look or position Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options Menu for lvr2plot Statistics > Linear models and related > Regression diagnostics > Leverage-versus-squared-residual plot Description for lvr2plot lvr2plot graphs a leverage-versus-squared-residual plot (a.k.a. L-R plot). Options for lvr2plot Plot marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. marker label options specify if and how the markers are to be labeled; see [G-3] marker label options. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). regress postestimation diagnostic plots — Postestimation plots for regress 1921 Remarks and examples for lvr2plot One of the most useful diagnostic graphs is provided by lvr2plot (leverage-versus-residual-squared plot), a graph of leverage against the (normalized) residuals squared. Example 7 We illustrate lvr2plot using our model in example 1. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) 0 .1 Leverage .2 .3 .4 . regress price weight foreign##c.mpg (output omitted ) . lvr2plot 0 .05 .1 .15 Normalized residual squared .2 The lines on the chart show the average values of leverage and the (normalized) residuals squared. Points above the horizontal line have higher-than-average leverage; points to the right of the vertical line have larger-than-average residuals. One point immediately catches our eye, and four more make us pause. The point at the top of the graph has high leverage and a smaller-than-average residual. The other points that bother us all have higher-than-average leverage, two with smaller-than-average residuals and two with larger-than-average residuals. A less pretty but more useful version of the above graph specifies that make be used as the symbol (see [G-3] marker label options): 1922 regress postestimation diagnostic plots — Postestimation plots for regress .4 . lvr2plot, mlabel(make) mlabp(0) m(none) mlabsize(small) Leverage .2 .3 VW Diesel Peugeot 604 Plym. Champ Plym. Arrow Cad. Seville Subaru Datsun 210 Volvo 260 .1 Olds 98 Audi 5000 Dodge Colt Ford Fiesta Linc. Continental Toyota Celica Toyota Corona Linc. Mark V Chev.Toyota Chevette Corolla Honda Civic Fiat Strada Mazda GLC Renault Le CarHorizon Datsun 810 Plym. VW Rabbit Buick Opel Audi Fox BMW 320i VW Scirocco Cad. Deville VW Dasher Ford Mustang Plym. Sapporo Merc. Bobcat Merc. XR−7 Datsun 200 Merc. Cougar Merc. Zephyr Datsun 510 Honda Accord Merc. Marquis AMC Spirit Buick Electra AMC Pacer Olds Toronado Dodge Magnum Pont. Sunbird Olds Starfire Chev. Monza Chev. Impala Buick Riviera Plym. Volare Chev. Monte Carlo Dodge St. Regis Merc. Monarch Pont. Le Mans Chev. Malibu Pont. Grand Prix Pont. Catalina AMC Concord Olds Delta 88 Buick LeSabre Pont. Firebird Dodge Diplomat Olds Cutlass Olds Cutl Supr Olds Omega Buick Skylark Chev. Nova Pont. Phoenix Buick Century Buick Regal 0 Linc. Versailles Cad. Eldorado 0 .05 .1 .15 Normalized residual squared .2 The VW Diesel, Plymouth Champ, Plymouth Arrow, and Peugeot 604 are the points that cause us the most concern. When we further examine our data, we discover that the VW Diesel is the only diesel in our data and that the data for the Plymouth Arrow were entered incorrectly into the computer. No such simple explanations were found for the Plymouth Champ and Peugeot 604. Methods and formulas See Hamilton (2013, 209–214) and Kohler and Kreuter (2012, sec. 9.3) for a discussion of these diagnostic graphs. The lvr2plot command plots leverage against the squares of the normalized residuals. The P normalized residuals are defined as ebnj = ebj /( i ebi2 )1/2 . References Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley. Chatterjee, S., and A. S. Hadi. 1986. Influential observations, high leverage points, and outliers in linear regression. Statistical Science 1: 379–393. Cox, N. J. 2004. Speaking Stata: Graphing model diagnostics. Stata Journal 4: 449–475. Ezekiel, M. 1924. A method of handling curvilinear correlation for any number of variables. Journal of the American Statistical Association 19: 431–453. Hamilton, L. C. 1992. Regression with Graphics: A Second Course in Applied Statistics. Belmont, CA: Duxbury. . 2013. Statistics with Stata: Updated for Version 12. 8th ed. Boston: Brooks/Cole. Hoaglin, D. C., and R. E. Welsch. 1978. The hat matrix in regression and ANOVA. American Statistician 32: 17–22. Kohler, U., and F. Kreuter. 2012. Data Analysis Using Stata. 3rd ed. College Station, TX: Stata Press. Larsen, W. A., and S. J. McCleary. 1972. The use of partial residual plots in regression analysis. Technometrics 14: 781–790. Lindsey, C., and S. J. Sheather. 2010a. Optimal power transformation via inverse response plots. Stata Journal 10: 200–214. regress postestimation diagnostic plots — Postestimation plots for regress 1923 . 2010b. Model fit assessment via marginal model plots. Stata Journal 10: 215–225. Mallows, C. L. 1986. Augmented partial residuals. Technometrics 28: 313–319. Mosteller, C. F., and J. W. Tukey. 1977. Data Analysis and Regression: A Second Course in Statistics. Reading, MA: Addison–Wesley. Welsch, R. E. 1986. Comment [on Chatterjee and Hadi 1986]. Statistical Science 1: 403–405. Also see [R] regress — Linear regression [R] regress postestimation — Postestimation tools for regress [R] regress postestimation time series — Postestimation tools for regress with time series [U] 20 Estimation and postestimation commands Title regress postestimation time series — Postestimation tools for regress with time series Description Syntax for estat bgodfrey Options for estat durbinalt Remarks and examples Acknowledgment Syntax for estat archlm Options for estat bgodfrey Syntax for estat dwatson Stored results References Options for estat archlm Syntax for estat durbinalt Menu for estat Methods and formulas Also see Description The following postestimation commands for time series are available for regress: Command Description estat estat estat estat test for ARCH effects in the residuals Breusch–Godfrey test for higher-order serial correlation Durbin’s alternative test for serial correlation Durbin–Watson d statistic to test for first-order serial correlation archlm bgodfrey durbinalt dwatson These commands provide regression diagnostic tools specific to time series. You must tsset your data before using these commands; see [TS] tsset. estat archlm tests for time-dependent volatility. estat bgodfrey, estat durbinalt, and estat dwatson test for serial correlation in the residuals of a linear regression. For non-time-series regression diagnostic tools, see [R] regress postestimation. estat archlm performs Engle’s Lagrange multiplier (LM) test for the presence of autoregressive conditional heteroskedasticity. estat bgodfrey performs the Breusch–Godfrey test for higher-order serial correlation in the disturbance. This test does not require that all the regressors be strictly exogenous. estat durbinalt performs Durbin’s alternative test for serial correlation in the disturbance. This test does not require that all the regressors be strictly exogenous. estat dwatson computes the Durbin–Watson d statistic (Durbin and Watson 1950) to test for first-order serial correlation in the disturbance when all the regressors are strictly exogenous. Syntax for estat archlm estat archlm , archlm options archlm options Description lags(numlist) force test numlist lag orders allow test after regress, vce(robust) 1924 regress postestimation time series — Postestimation tools for regress with time series 1925 Options for estat archlm lags(numlist) specifies a list of numbers, indicating the lag orders to be tested. The test will be performed separately for each order. The default is order one. force allows the test to be run after regress, vce(robust). The command will not work if the vce(cluster clustvar) option is specified with regress; see [R] regress. Syntax for estat bgodfrey estat bgodfrey , bgodfrey options bgodfrey options Description lags(numlist) nomiss0 small test numlist lag orders do not use Davidson and MacKinnon’s approach obtain p-values using the F or t distribution Options for estat bgodfrey lags(numlist) specifies a list of numbers, indicating the lag orders to be tested. The test will be performed separately for each order. The default is order one. nomiss0 specifies that Davidson and MacKinnon’s approach (1993, 358), which replaces the missing values in the initial observations on the lagged residuals in the auxiliary regression with zeros, not be used. small specifies that the p-values of the test statistics be obtained using the F or t distribution instead of the default chi-squared or normal distribution. Syntax for estat durbinalt estat durbinalt , durbinalt options durbinalt options Description lags(numlist) nomiss0 robust small force test numlist lag orders do not use Davidson and MacKinnon’s approach compute standard errors using the robust/sandwich estimator obtain p-values using the F or t distribution allow test after regress, vce(robust) or after newey Options for estat durbinalt lags(numlist) specifies a list of numbers, indicating the lag orders to be tested. The test will be performed separately for each order. The default is order one. 1926 regress postestimation time series — Postestimation tools for regress with time series nomiss0 specifies that Davidson and MacKinnon’s approach (1993, 358), which replaces the missing values in the initial observations on the lagged residuals in the auxiliary regression with zeros, not be used. robust specifies that the Huber/White/sandwich robust estimator of the variance–covariance matrix be used in Durbin’s alternative test. small specifies that the p-values of the test statistics be obtained using the F or t distribution instead of the default chi-squared or normal distribution. This option may not be specified with robust, which always uses an F or t distribution. force allows the test to be run after regress, vce(robust) and after newey (see [R] regress and [TS] newey). The command will not work if the vce(cluster clustvar) option is specified with regress. Syntax for estat dwatson estat dwatson Menu for estat Statistics > Postestimation > Reports and statistics Remarks and examples The Durbin–Watson test is used to determine whether the error term in a linear regression model follows an AR(1) process. For the linear model yt = xt β + ut the AR(1) process can be written as ut = ρut−1 + t In general, an AR(1) process requires only that t be independent and identically distributed (i.i.d.). The Durbin–Watson test, however, requires t to be distributed N (0, σ 2 ) for the statistic to have an exact distribution. Also, the Durbin–Watson test can be applied only when the regressors are strictly exogenous. A regressor x is strictly exogenous if Corr(xs , ut ) = 0 for all s and t, which precludes the use of the Durbin–Watson statistic with models where lagged values of the dependent variable are included as regressors. The null hypothesis of the test is that there is no first-order autocorrelation. The Durbin–Watson d statistic can take on values between 0 and 4 and under the null d is equal to 2. Values of d less than 2 suggest positive autocorrelation (ρ > 0), whereas values of d greater than 2 suggest negative autocorrelation (ρ < 0). Calculating the exact distribution of the d statistic is difficult, but empirical upper and lower bounds have been established based on the sample size and the number of regressors. Extended tables for the d statistic have been published by Savin and White (1977). For example, suppose you have a model with 30 observations and three regressors (including the constant term). For a test of the null hypothesis of no autocorrelation versus the alternative of positive autocorrelation, the lower bound of the d statistic is 1.284, and the upper bound is 1.567 at the 5% significance level. You would reject the null if d < 1.284, and you would fail to reject if d > 1.567. A value falling within the range (1.284, 1.567) leads to no conclusion about whether or not to reject the null hypothesis. regress postestimation time series — Postestimation tools for regress with time series 1927 When lagged dependent variables are included among the regressors, the past values of the error term are correlated with those lagged variables at time t, implying that they are not strictly exogenous regressors. The inclusion of covariates that are not strictly exogenous causes the d statistic to be biased toward the acceptance of the null hypothesis. Durbin (1970) suggested an alternative test for models with lagged dependent variables and extended that test to the more general AR(p) serial correlation process ut = ρ1 ut−1 + · · · + ρp ut−p + t where t is i.i.d. with variance σ 2 but is not assumed or required to be normal for the test. The null hypothesis of Durbin’s alternative test is H0 : ρ1 = 0, . . . , ρp = 0 and the alternative is that at least one of the ρ’s is nonzero. Although the null hypothesis was originally derived for an AR(p) process, this test turns out to have power against MA(p) processes as well. Hence, the actual null of this test is that there is no serial correlation up to order p because the MA(p) and the AR(p) models are locally equivalent alternatives under the null. See Godfrey (1988, 113–115) for a discussion of this result. Durbin’s alternative test is in fact a LM test, but it is most easily computed with a Wald test on the coefficients of the lagged residuals in an auxiliary OLS regression of the residuals on their lags and all the covariates in the original regression. Consider the linear regression model yt = β1 x1t + · · · + βk xkt + ut (1) in which the covariates x1 through xk are not assumed to be strictly exogenous and ut is assumed to be i.i.d. and to have finite variance. The process is also assumed to be stationary. (See Wooldridge [2013] for a discussion of stationarity.) Estimating the parameters in (1) by OLS obtains the residuals u bt . Next another OLS regression is performed of u bt on u bt−1 , . . . , u bt−p and the other regressors, u bt = γ1 u bt−1 + · · · + γp u bt−p + β1 x1t + · · · + βk xkt + t (2) where t stands for the random-error term in this auxiliary OLS regression. Durbin’s alternative test is then obtained by performing a Wald test that γ1 , . . . , γp are jointly zero. The test can be made robust to an unknown form of heteroskedasticity by using a robust VCE estimator when estimating the regression in (2). When there are only strictly exogenous regressors and p = 1, this test is asymptotically equivalent to the Durbin–Watson test. The Breusch–Godfrey test is also an LM test of the null hypothesis of no autocorrelation versus the alternative that ut follows an AR(p) or MA(p) process. Like Durbin’s alternative test, it is based on the auxiliary regression (2), and it is computed as N R2 , where N is the number of observations and R2 is the simple R2 from the regression. This test and Durbin’s alternative test are asymptotically equivalent. The test statistic N R2 has an asymptotic χ2 distribution with p degrees of freedom. It is valid with or without the strict exogeneity assumption but is not robust to conditional heteroskedasticity, even if a robust VCE is used when fitting (2). In fitting (2), the values of the lagged residuals will be missing in the initial periods. As noted by Davidson and MacKinnon (1993), the residuals will not be orthogonal to the other covariates in the model in this restricted sample, which implies that the R2 from the auxiliary regression will not be zero when the lagged residuals are left out. Hence, Breusch and Godfrey’s N R2 version of the test may overreject in small samples. To correct this problem, Davidson and MacKinnon (1993) recommend setting the missing values of the lagged residuals to zero and running the auxiliary regression in (2) over the full sample used in (1). This small-sample correction has become conventional for both the Breusch–Godfrey and Durbin’s alternative test, and it is the default for both commands. Specifying the nomiss0 option overrides this default behavior and treats the initial missing values generated by regressing on the lagged residuals as missing. Hence, nomiss0 causes these initial observations to be dropped from the sample of the auxiliary regression. 1928 regress postestimation time series — Postestimation tools for regress with time series Durbin’s alternative test and the Breusch–Godfrey test were originally derived for the case covered by regress without the vce(robust) option. However, after regress, vce(robust) and newey, Durbin’s alternative test is still valid and can be invoked if the robust and force options are specified. Example 1: tests for serial correlation Using data from Klein (1950), we first fit an OLS regression of consumption on the government wage bill: . use http://www.stata-press.com/data/r13/klein . tsset yr time variable: yr, 1920 to 1941 delta: 1 unit . regress consump wagegovt Source SS df MS Model Residual 532.567711 601.207167 1 20 532.567711 30.0603584 Total 1133.77488 21 53.9892799 consump Coef. wagegovt _cons 2.50744 40.84699 Std. Err. t .5957173 3.192183 4.21 12.80 Number of obs F( 1, 20) Prob > F R-squared Adj R-squared Root MSE = = = = = = 22 17.72 0.0004 0.4697 0.4432 5.4827 P>|t| [95% Conf. Interval] 0.000 0.000 1.264796 34.18821 3.750085 47.50577 If we assume that wagegov is a strictly exogenous variable, we can use the Durbin–Watson test to check for first-order serial correlation in the errors. . estat dwatson Durbin-Watson d-statistic( 2, 22) = .3217998 The Durbin–Watson d statistic, 0.32, is far from the center of its distribution (d = 2.0). Given 22 observations and two regressors (including the constant term) in the model, the lower 5% bound is about 0.997, much greater than the computed d statistic. Assuming that wagegov is strictly exogenous, we can reject the null of no first-order serial correlation. Rejecting the null hypothesis does not necessarily mean an AR process; other forms of misspecification may also lead to a significant test statistic. If we are willing to assume that the errors follow an AR(1) process and that wagegov is strictly exogenous, we could refit the model using arima or prais and model the error process explicitly; see [TS] arima and [TS] prais. If we are not willing to assume that wagegov is strictly exogenous, we could instead use Durbin’s alternative test or the Breusch–Godfrey to test for first-order serial correlation. Because we have only 22 observations, we will use the small option. . estat durbinalt, small Durbin’s alternative test for autocorrelation lags(p) F 1 35.035 df ( 1, Prob > F 19 ) H0: no serial correlation 0.0000 regress postestimation time series — Postestimation tools for regress with time series 1929 . estat bgodfrey, small Breusch-Godfrey LM test for autocorrelation lags(p) F 1 14.264 df ( 1, Prob > F 19 ) 0.0013 H0: no serial correlation Both tests strongly reject the null of no first-order serial correlation, so we decide to refit the model with two lags of consump included as regressors and then rerun estat durbinalt and estat bgodfrey. Because the revised model includes lagged values of the dependent variable, the Durbin–Watson test is not applicable. . regress consump wagegovt L.consump L2.consump Source SS df MS Model Residual 702.660311 85.1596011 3 16 234.220104 5.32247507 Total 787.819912 19 41.4642059 consump Coef. wagegovt Number of obs F( 3, 16) Prob > F R-squared Adj R-squared Root MSE = = = = = = 20 44.01 0.0000 0.8919 0.8716 2.307 Std. Err. t P>|t| [95% Conf. Interval] .6904282 .3295485 2.10 0.052 -.0081835 1.38904 consump L1. L2. 1.420536 -.650888 .197024 .1933351 7.21 -3.37 0.000 0.004 1.002864 -1.06074 1.838208 -.241036 _cons 9.209073 5.006701 1.84 0.084 -1.404659 19.82281 . estat durbinalt, small lags(1/2) Durbin’s alternative test for autocorrelation lags(p) F 1 2 0.080 0.260 df ( ( 1, 2, Prob > F 15 ) 14 ) 0.7805 0.7750 H0: no serial correlation . estat bgodfrey, small lags(1/2) Breusch-Godfrey LM test for autocorrelation lags(p) F 1 2 0.107 0.358 df ( ( 1, 2, Prob > F 15 ) 14 ) 0.7484 0.7056 H0: no serial correlation Although wagegov and the constant term are no longer statistically different from zero at the 5% level, the output from estat durbinalt and estat bgodfrey indicates that including the two lags of consump has removed any serial correlation from the errors. Engle (1982) suggests an LM test for checking for autoregressive conditional heteroskedasticity (ARCH) in the errors. The pth-order ARCH model can be written as 1930 regress postestimation time series — Postestimation tools for regress with time series σt2 = E(u2t |ut−1 , . . . , ut−p ) = γ0 + γ1 u2t−1 + · · · + γp u2t−p To test the null hypothesis of no autoregressive conditional heteroskedasticity (that is, γ1 = · · · = γp = 0), we first fit the OLS model (1), obtain the residuals u bt , and run another OLS regression on the lagged residuals: u b2t = γ0 + γ1 u b2t−1 + · · · + γp u b2t−p + (3) The test statistic is N R2 , where N is the number of observations in the sample and R2 is the R2 from the regression in (3). Under the null hypothesis, the test statistic follows a χ2p distribution. Example 2: estat archlm We refit the original model that does not include the two lags of consump and then use estat archlm to see if there is any evidence that the errors are autoregressive conditional heteroskedastic. . regress consump wagegovt Source SS df MS Model Residual 532.567711 601.207167 1 20 532.567711 30.0603584 Total 1133.77488 21 53.9892799 consump Coef. wagegovt _cons 2.50744 40.84699 Std. Err. .5957173 3.192183 t 4.21 12.80 Number of obs F( 1, 20) Prob > F R-squared Adj R-squared Root MSE = = = = = = 22 17.72 0.0004 0.4697 0.4432 5.4827 P>|t| [95% Conf. Interval] 0.000 0.000 1.264796 34.18821 3.750085 47.50577 . estat archlm, lags(1 2 3) LM test for autoregressive conditional heteroskedasticity (ARCH) lags(p) chi2 df Prob > chi2 1 2 3 5.543 9.431 9.039 1 2 3 0.0186 0.0090 0.0288 H0: no ARCH effects vs. H1: ARCH(p) disturbance estat archlm shows the results for tests of ARCH(1), ARCH(2), and ARCH(3) effects, respectively. At the 5% significance level, all three tests reject the null hypothesis that the errors are not autoregressive conditional heteroskedastic. See [TS] arch for information on fitting ARCH models. Stored results estat archlm stores the following in r(): Scalars r(N) r(k) Macros r(lags) Matrices r(arch) r(df) number of observations number of regressors r(N gaps) number of gaps r(p) two-sided p-values lag order test statistic for each lag order degrees of freedom regress postestimation time series — Postestimation tools for regress with time series 1931 estat bgodfrey stores the following in r(): Scalars r(N) r(k) Macros r(lags) Matrices r(chi2) r(F) r(df r) number of observations number of regressors r(N gaps) number of gaps r(p) r(df) two-sided p-values degrees of freedom r(N gaps) number of gaps r(p) r(df) two-sided p-values degrees of freedom r(N gaps) r(dw) number of gaps Durbin–Watson statistic lag order χ2 statistic for each lag order F statistic for each lag order (small only) residual degrees of freedom (small only) estat durbinalt stores the following in r(): Scalars r(N) r(k) Macros r(lags) Matrices r(chi2) r(F) r(df r) number of observations number of regressors lag order χ2 statistic for each lag order F statistic for each lag order (small only) residual degrees of freedom (small only) estat dwatson stores the following in r(): Scalars r(N) r(k) number of observations number of regressors Methods and formulas Consider the regression yt = β1 x1t + · · · + βk xkt + ut (4) in which some of the covariates are not strictly exogenous. In particular, some of the xit may be lags of the dependent variable. We are interested in whether the ut are serially correlated. The Durbin – Watson d statistic reported by estat dwatson is n−1 P d= t=1 (b ut+1 − u bt )2 n P t=1 u b2t where u bt represents the residual of the tth observation. To compute Durbin’s alternative test and the Breusch–Godfrey test against the null hypothesis that there is no pth order serial correlation, we fit the regression in (4), compute the residuals, and then fit the following auxiliary regression of the residuals u bt on p lags of u bt and on all the covariates in the original regression in (4): u bt = γ1 u bt−1 + · · · + γp u bt−p + β1 x1t + · · · + βk xkt + (5) 1932 regress postestimation time series — Postestimation tools for regress with time series Durbin’s alternative test is computed by performing a Wald test to determine whether the coefficients of u bt−1 , . . . , u bt−p are jointly different from zero. By default, the statistic is assumed to be distributed χ2 (p). When small is specified, the statistic is assumed to follow an F (p, N − p − k ) distribution. The reported p-value is a two-sided p-value. When robust is specified, the Wald test is performed using the Huber/White/sandwich estimator of the variance–covariance matrix, and the test is robust to an unspecified form of heteroskedasticity. The Breusch–Godfrey test is computed as N R2 , where N is the number of observations in the auxiliary regression (5) and R2 is the R2 from the same regression (5). Like Durbin’s alternative test, the Breusch–Godfrey test is asymptotically distributed χ2 (p), but specifying small causes the p-value to be computed using an F (p, N − p − k). By default, the initial missing values of the lagged residuals are replaced with zeros, and the auxiliary regression is run over the full sample used in the original regression of (4). Specifying the nomiss0 option causes these missing values to be treated as missing values, and the observations are dropped from the sample. b2t−1 , . . . , u b2t−p : Engle’s LM test for ARCH(p) effects fits an OLS regression of u b2t on u u b2t = γ0 + γ1 u b2t−1 + · · · + γp u b2t−p + The test statistic is nR2 and is asymptotically distributed χ2 (p). Acknowledgment The original versions of estat archlm, estat bgodfrey, and estat durbinalt were written by Christopher F. Baum of the Department of Economics at Boston College and author of the Stata Press books An Introduction to Modern Econometrics Using Stata and An Introduction to Stata Programming. References Baum, C. F. 2006. An Introduction to Modern Econometrics Using Stata. College Station, TX: Stata Press. Baum, C. F., and V. L. Wiggins. 2000a. sg135: Test for autoregressive conditional heteroskedasticity in regression error distribution. Stata Technical Bulletin 55: 13–14. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 143–144. College Station, TX: Stata Press. . 2000b. sg136: Tests for serial correlation in regression error distribution. Stata Technical Bulletin 55: 14–15. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 145–147. College Station, TX: Stata Press. Beran, R. J., and N. I. Fisher. 1998. A conversation with Geoff Watson. Statistical Science 13: 75–93. Breusch, T. S. 1978. Testing for autocorrelation in dynamic linear models. Australian Economic Papers 17: 334–355. Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. Durbin, J. 1970. Testing for serial correlation in least-squares regressions when some of the regressors are lagged dependent variables. Econometrica 38: 410–421. Durbin, J., and S. J. Koopman. 2012. Time Series Analysis by State Space Methods. 2nd ed. Oxford: Oxford University Press. Durbin, J., and G. S. Watson. 1950. Testing for serial correlation in least squares regression. I. Biometrika 37: 409–428. . 1951. Testing for serial correlation in least squares regression. II. Biometrika 38: 159–177. Engle, R. F. 1982. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica 50: 987–1007. regress postestimation time series — Postestimation tools for regress with time series 1933 Fisher, N. I., and P. Hall. 1998. Geoffrey Stuart Watson: Tributes and obituary (3 December 1921–3 January 1998). Australian and New Zealand Journal of Statistics 40: 257–267. Godfrey, L. G. 1978. Testing against general autoregressive and moving average error models when the regressors include lagged dependent variables. Econometrics 46: 1293–1301. . 1988. Misspecification Tests in Econometrics: The Lagrange Multiplier Principle and Other Approaches. Econometric Society Monographs, No. 16. Cambridge: Cambridge University Press. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Klein, L. R. 1950. Economic Fluctuations in the United States 1921–1941. New York: Wiley. Koopman, S. J. 2012. James Durbin, FBA, 1923–2012. Journal of the Royal Statistical Society, Series A 175: 1060–1064. Phillips, P. C. B. 1988. The ET Interview: Professor James Durbin. Econometric Theory 4: 125–157. Savin, N. E., and K. J. White. 1977. The Durbin–Watson test for serial correlation with extreme sample sizes or many regressors. Econometrica 45: 1989–1996. Wooldridge, J. M. 2013. Introductory Econometrics: A Modern Approach. 5th ed. Mason, OH: South-Western. James Durbin (1923–2012) was a British statistician who was born in Wigan, near Manchester. He studied mathematics at Cambridge and after military service and various research posts joined the London School of Economics in 1950. Later in life, he was also affiliated with University College London. His many contributions to statistics centered on serial correlation, time series (including major contributions to structural or unobserved components models), sample survey methodology, goodness-of-fit tests, and sample distribution functions, with emphasis on applications in the social sciences. He served terms as president of the Royal Statistical Society and the International Statistical Institute. Geoffrey Stuart Watson (1921–1998) was born in Victoria, Australia, and earned degrees at Melbourne University and North Carolina State University. After a visit to the University of Cambridge, he returned to Australia, working at Melbourne and then the Australian National University. Following periods at Toronto and Johns Hopkins, he settled at Princeton. Throughout his wide-ranging career, he made many notable accomplishments and important contributions, including the Durbin–Watson test for serial correlation, the Nadaraya–Watson estimator in nonparametric regression, and methods for analyzing directional data. Leslie G. Godfrey (1946– ) was born in London and earned degrees at the Universities of Exeter and London. He is now a professor of econometrics at the University of York. His interests center on implementation and interpretation of tests of econometric models, including nonnested models. Trevor Stanley Breusch (1949– ) was born in Queensland and earned degrees at the University of Queensland and Australian National University (ANU). After a post at the University of Southampton, he returned to work at ANU. His background is in econometric methods and his recent interests include political values and social attitudes, earnings and income, and measurement of underground economic activity. Also see [R] regress — Linear regression [TS] tsset — Declare data to be time-series data [R] regress postestimation — Postestimation tools for regress [R] regress postestimation diagnostic plots — Postestimation plots for regress Title #review — Review previous commands Syntax Description Remarks and examples Syntax #review #1 #2 Description The #review command displays the last few lines typed at the terminal. Remarks and examples #review (pronounced pound-review) is a Stata preprocessor command. #commands do not generate a return code or generate ordinary Stata errors. The only error message associated with #commands is “unrecognized #command”. The #review command displays the last few lines typed at the terminal. If no arguments follow #review, the last five lines typed at the terminal are displayed. The first argument specifies the number of lines to be reviewed, so #review 10 displays the last 10 lines typed. The second argument specifies the number of lines to be displayed, so #review 10 5 displays five lines, starting at the 10th previous line. Stata reserves a buffer for #review lines and stores as many previous lines in the buffer as will fit, rolling out the oldest line to make room for the newest. Requests to #review lines no longer stored will be ignored. Only lines typed at the terminal are placed in the #review buffer. See [U] 10.5 Editing previous lines in Stata. Example 1 Typing #review by itself will show the last five lines you typed at the terminal: . 5 4 3 2 1 . #review use mydata * comments go into the #review buffer, too describe tabulate marriage educ [freq=number] tabulate marriage educ [freq=number], chi2 Typing #review 15 2 shows the 15th and 14th previous lines: . #review 15 2 15 replace x=. if x<200 14 summarize x . 1934 Title roc — Receiver operating characteristic (ROC) analysis Description Reference Description ROC analysis quantifies the accuracy of diagnostic tests or other evaluation modalities used to discriminate between two states or conditions, which are here referred to as normal and abnormal or control and case. The discriminatory accuracy of a diagnostic test is measured by its ability to correctly classify known normal and abnormal subjects. For this reason, we often refer to the diagnostic test as a classifier. The analysis uses the ROC curve, a graph of the sensitivity versus 1 − specificity of the diagnostic test. The sensitivity is the fraction of positive cases that are correctly classified by the diagnostic test, whereas the specificity is the fraction of negative cases that are correctly classified. Thus the sensitivity is the true-positive rate, and the specificity is the true-negative rate. There are six ROC commands: Command Entry roccomp rocgold rocfit rocreg rocregplot roctab [R] [R] [R] [R] [R] [R] roccomp roccomp rocfit rocreg rocregplot roctab Description Tests of equality of ROC areas Tests of equality of ROC areas against a standard ROC curve Parametric ROC models Nonparametric and parametric ROC regression models Plot marginal and covariate-specific ROC curves Nonparametric ROC analysis Postestimation commands are available after rocfit and rocreg; see [R] rocfit postestimation and [R] rocreg postestimation. Both nonparametric and parametric (semiparametric) methods have been suggested for generating the ROC curve. The roctab command performs nonparametric ROC analysis for a single classifier. roccomp extends the nonparametric ROC analysis function of roctab to situations where we have multiple diagnostic tests of interest to be compared and tested. The rocgold command also provides ROC analysis for multiple classifiers. rocgold compares each classifier’s ROC curve to a “gold standard” ROC curve and makes adjustments for multiple comparisons in the analysis. Both rocgold and roccomp also allow parametric estimation of the ROC curve through a binormal fit. In a binormal fit, both the control and the case populations are normal. The rocfit command also estimates the ROC curve of a classifier through a binormal fit. Unlike roctab, roccomp, and rocgold, rocfit is an estimation command. In postestimation, graphs of the ROC curve and confidence bands can be produced. Additional tests on the parameters can also be conducted. ROC analysis can be interpreted as a two-stage process. First, the control distribution of the classifier is estimated, assuming a normal model or using a distribution-free estimation technique. The classifier is standardized using the control distribution to 1 − percentile value, the false-positive rate. Second, the ROC curve is estimated as the case distribution of the standardized classifier values. Covariates may affect both stages of ROC analysis. The first stage may be affected, yielding a covariate-adjusted ROC curve. The second stage may also be affected, producing multiple covariatespecific ROC curves. 1935 1936 roc — Receiver operating characteristic (ROC) analysis The rocreg command performs ROC analysis under both types of covariate effects. Both parametric (semiparametric) and nonparametric methods may be used by rocreg. Like rocfit, rocreg is an estimation command and provides many postestimation capabilities. The global performance of a diagnostic test is commonly summarized by the area under the ROC curve (AUC). This area can be interpreted as the probability that the result of a diagnostic test of a randomly selected abnormal subject will be greater than the result of the same diagnostic test from a randomly selected normal subject. The greater the AUC, the better the global performance of the diagnostic test. Each of the ROC commands provides computation of the AUC. Citing a lack of clinical relevance for the AUC, other ROC summary measures have been suggested. These include the partial area under the ROC curve for a given false-positive rate t [pAUC(t)]. This is the area under the ROC curve from the false-positive rate of 0 to t. The ROC value at a particular false-positive rate and the false-positive rate for a particular ROC value are also useful summary measures for the ROC curve. These three measures are directly estimated by rocreg during the model fit or postestimation stages. Point estimates of ROC value are computed by the other ROC commands, but no standard errors are reported. See Pepe (2003) for a discussion of ROC analysis. Pepe has posted Stata datasets and programs used to reproduce results presented in the book (http://www.stata.com/bookstore/pepe.html). Reference Pepe, M. S. 2003. The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press. Title roccomp — Tests of equality of ROC areas Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Test equality of ROC areas roccomp refvar classvar classvars if in weight , roccomp options Test equality of ROC area against a standard ROC curve rocgold refvar goldvar classvar classvars if in weight , rocgold options roccomp options Description Main by(varname) test(matname) graph norefline separate summary binormal line#opts(cline options) level(#) split into groups by variable use contrast matrix for comparing ROC areas graph the ROC curve suppress plotting the 45-degree reference line place each ROC curve on its own graph report the area under the ROC curve estimate areas by using binormal distribution assumption affect rendition of the #th binormal fit line set confidence level; default is level(95) Plot plot#opts(plot options) affect rendition of the #th ROC curve Reference line rlopts(cline options) affect rendition of the reference line Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options 1937 1938 roccomp — Tests of equality of ROC areas rocgold options Description Main sidak test(matname) graph norefline separate summary binormal line#opts(cline options) level(#) adjust the significance probability by using Šidák’s method use contrast matrix for comparing ROC areas graph the ROC curve suppress plotting the 45-degree reference line place each ROC curve on its own graph report the area under the ROC curve estimate areas by using binormal distribution assumption affect rendition of the #th binormal fit line set confidence level; default is level(95) Plot plot#opts(plot options) affect rendition of the #th ROC curve; plot 1 is the “gold standard” Reference line rlopts(cline options) affect rendition of the reference line Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options plot options Description marker options marker label options cline options change look of markers (color, size, etc.) add marker labels; change look or position change the look of the line fweights are allowed; see [U] 11.1.6 weight. Menu roccomp Statistics > Epidemiology and related > ROC analysis > Test equality of two or more ROC areas Epidemiology and related > ROC analysis > Test equality of ROC area against gold standard rocgold Statistics > Description The above commands are used to perform receiver operating characteristic (ROC) analyses with rating and discrete classification data. The two variables refvar and classvar must be numeric. The reference variable indicates the true state of the observation, such as diseased and nondiseased or normal and abnormal, and must be coded as 0 and 1. The rating or outcome of the diagnostic test or test modality is recorded in classvar, which must be at least ordinal, with higher values indicating higher risk. roccomp — Tests of equality of ROC areas 1939 roccomp tests the equality of two or more ROC areas obtained from applying two or more test modalities to the same sample or to independent samples. roccomp expects the data to be in wide form when comparing areas estimated from the same sample and in long form for areas estimated from independent samples. rocgold independently tests the equality of the ROC area of each of several test modalities, specified by classvar, against a “gold standard” ROC curve, goldvar. For each comparison, rocgold reports the raw and the Bonferroni-adjusted significance probability. Optionally, Šidák’s adjustment for multiple comparisons can be obtained. See [R] rocfit and [R] rocreg for commands that fit maximum-likelihood ROC models. Options Main by(varname) (roccomp only) is required when comparing independent ROC areas. The by() variable identifies the groups to be compared. sidak (rocgold only) requests that the significance probability be adjusted for the effect of multiple comparisons by using Šidák’s method. Bonferroni’s adjustment is reported by default. test(matname) specifies the contrast matrix to be used when comparing ROC areas. By default, the null hypothesis that all areas are equal is tested. graph produces graphical output of the ROC curve. norefline suppresses plotting the 45-degree reference line from the graphical output of the ROC curve. separate is meaningful only with roccomp and specifies that each ROC curve be placed on its own graph rather than one curve on top of the other. summary reports the area under the ROC curve, its standard error, and its confidence interval. This option is needed only when also specifying graph. binormal specifies that the areas under the ROC curves to be compared should be estimated using the binormal distribution assumption. By default, areas to be compared are computed using the trapezoidal rule. line#opts(cline options) affects the rendition of the line representing the #th ROC curve drawn using the binormal distribution assumption; see [G-3] cline options. These lines are drawn only if the binormal option is specified. level(#) specifies the confidence level, as a percentage, for the confidence intervals. The default is level(95) or as set by set level; see [R] level. Plot plot#opts(plot options) affects the rendition of the #th ROC curve—the curve’s plotted points connected by lines. The plot options can affect the size and color of markers, whether and how the markers are labeled, and whether and how the points are connected; see [G-3] marker options, [G-3] marker label options, and [G-3] cline options. For rocgold, plot1opts() are applied to the ROC for the gold standard. Reference line rlopts(cline options) affects the rendition of the reference line; see [G-3] cline options. 1940 roccomp — Tests of equality of ROC areas Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options. These include options for titling the graph (see [G-3] title options), options for saving the graph to disk (see [G-3] saving option), and the by() option (see [G-3] by option). Remarks and examples Remarks are presented under the following headings: Introduction Comparing areas under the ROC curve Correlated data Independent data Comparing areas with a gold standard Introduction roccomp provides comparison of the ROC curves of multiple classifiers. rocgold compares the ROC curves of multiple classifiers with a single “gold standard” classifier. Adjustment of inference for multiple comparisons is also provided by rocgold. See Pepe (2003) for a discussion of ROC analysis. Pepe has posted Stata datasets and programs used to reproduce results presented in the book (http://www.stata.com/bookstore/pepe.html). Comparing areas under the ROC curve The area under multiple ROC curves can be compared by using roccomp. The command syntax is slightly different if the ROC curves are correlated (that is, different diagnostic tests are applied to the same sample) or independent (that is, diagnostic tests are applied to different samples). Correlated data Example 1 Hanley and McNeil (1983) presented data from an evaluation of two computer algorithms designed to reconstruct CT images from phantoms. We will call these two algorithms’ modalities 1 and 2. A sample of 112 phantoms was selected; 58 phantoms were considered normal, and the remaining 54 were abnormal. Each of the two modalities was applied to each phantom, and the resulting images were rated by a reviewer using a six-point scale: 1 = definitely normal, 2 = probably normal, 3 = possibly normal, 4 = possibly abnormal, 5 = probably abnormal, and 6 = definitely abnormal. Because each modality was applied to the same sample of phantoms, the two sets of outcomes are correlated. roccomp — Tests of equality of ROC areas 1941 We list the first 7 observations: . use http://www.stata-press.com/data/r13/ct . list in 1/7, sep(0) 1. 2. 3. 4. 5. 6. 7. mod1 mod2 status 2 5 2 2 5 2 3 1 5 1 3 6 2 2 0 1 0 0 1 0 0 The data are in wide form, which is required when dealing with correlated data. Each observation corresponds to one phantom. The variable mod1 identifies the rating assigned for the first modality, and mod2 identifies the rating assigned for the second modality. The true status of the phantoms is given by status=0 if they are normal and status=1 if they are abnormal. The observations with at least one missing rating were dropped from the analysis. We plot the two ROC curves and compare their areas. . roccomp status mod1 mod2, graph summary mod1 mod2 Obs ROC Area 112 112 0.8828 0.9302 0.0317 0.0256 Prob>chi2 = Asymptotic Normal [95% Conf. Interval] 0.82067 0.88005 0.94498 0.98042 0.1282 0.00 0.25 Sensitivity 0.50 0.75 1.00 Ho: area(mod1) = area(mod2) chi2(1) = 2.31 Std. Err. 0.00 0.25 0.50 1−Specificity mod1 ROC area: 0.8828 Reference 0.75 1.00 mod2 ROC area: 0.9302 By default, roccomp, with the graph option specified, plots the ROC curves on the same graph. Optionally the curves can be plotted side by side, each on its own graph, by also specifying separate. For each curve, roccomp reports summary statistics and provides a test for the equality of the area under the curves, using an algorithm suggested by DeLong, DeLong, and Clarke-Pearson (1988). 1942 roccomp — Tests of equality of ROC areas Although the area under the ROC curve for modality 2 is larger than that of modality 1, the chi-squared test yielded a significance probability of 0.1282, suggesting that there is no significant difference between these two areas. The roccomp command can also be used to compare more than two ROC areas. To illustrate this, we modified the previous dataset by including a fictitious third modality. . use http://www.stata-press.com/data/r13/ct2 . roccomp status mod1 mod2 mod3, graph summary ROC Obs Area Std. Err. mod1 mod2 mod3 112 112 112 0.8828 0.9302 0.9240 0.0317 0.0256 0.0241 0.82067 0.88005 0.87670 0.94498 0.98042 0.97132 0.0381 0.00 0.25 Sensitivity 0.50 0.75 1.00 Ho: area(mod1) = area(mod2) = area(mod3) chi2(2) = 6.54 Prob>chi2 = Asymptotic Normal [95% Conf. Interval] 0.00 0.25 0.50 1−Specificity mod1 ROC area: 0.8828 mod3 ROC area: 0.924 0.75 1.00 mod2 ROC area: 0.9302 Reference By default, roccomp tests whether the areas under the ROC curves are all equal. Other comparisons can be tested by creating a contrast matrix and specifying test(matname), where matname is the name of the contrast matrix. For example, assume that we are interested in testing whether the area under the ROC for mod1 is equal to that of mod3. To do this, we can first create an appropriate contrast matrix and then specify its name with the test() option. Of course, this is a trivial example because we could have just specified . roccomp status mod1 mod3 without including mod2 to obtain the same test results. However, for illustration, we will continue with this example. The contrast matrix must have its number of columns equal to the number of classvars (that is, the total number of ROC curves) and a number of rows less than or equal to the number of classvars, and the elements of each row must add to zero. roccomp — Tests of equality of ROC areas . matrix C=(1,0,-1) . roccomp status mod1 mod2 mod3, test(C) ROC Obs Area Std. Err. mod1 mod2 mod3 112 112 112 0.8828 0.9302 0.9240 0.0317 0.0256 0.0241 1943 Asymptotic Normal [95% Conf. Interval] 0.82067 0.88005 0.87670 0.94498 0.98042 0.97132 Ho: Comparison as defined by contrast matrix: C chi2(1) = 5.25 Prob>chi2 = 0.0220 Although all three areas are reported, the comparison is made using the specified contrast matrix. Perhaps more interesting would be a comparison of the area from mod1 and the average area of mod2 and mod3. . matrix C=(1,-.5,-.5) . roccomp status mod1 mod2 mod3, test(C) ROC Obs Area Std. Err. mod1 mod2 mod3 112 112 112 0.8828 0.9302 0.9240 0.0317 0.0256 0.0241 Asymptotic Normal [95% Conf. Interval] 0.82067 0.88005 0.87670 0.94498 0.98042 0.97132 Ho: Comparison as defined by contrast matrix: C chi2(1) = 3.43 Prob>chi2 = 0.0642 Other contrasts could be made. For example, we could test if mod3 is different from at least one of the other two by first creating the following contrast matrix: . matrix C=(-1,0,1 \ 0,-1,1) . mat list C C[2,3] c1 c2 c3 r1 -1 0 1 r2 0 -1 1 Independent data Example 2 In example 1, we noted that because each test modality was applied to the same sample of phantoms, the classification outcomes were correlated. Now assume that we have collected the same data presented by Hanley and McNeil (1983), except that we applied the first test modality to one sample of phantoms and the second test modality to a different sample of phantoms. The resulting measurements are now considered independent. 1944 roccomp — Tests of equality of ROC areas Here are a few of the observations. . use http://www.stata-press.com/data/r13/ct3 . list in 1/7, sep(0) pop status rating mod 12 31 1 3 28 19 3 0 0 1 1 0 0 1 1 1 1 1 2 2 2 1 2 1 2 1 2 1 1. 2. 3. 4. 5. 6. 7. The data are in long form, which is required when dealing with independent data. The data consist of 24 observations: 6 observations corresponding to abnormal phantoms and 6 to normal phantoms evaluated using the first modality, and similarly 6 observations corresponding to abnormal phantoms and 6 to normal phantoms evaluated using the second modality. The number of phantoms corresponding to each observation is given by the pop variable. Once again we have frequency-weighted data. The variable mod identifies the modality, and rating is the assigned classification. We can better view our data by using the table command. . table status rating [fw=pop], by(mod) row col mod and status 1 2 3 rating 4 5 6 Total 0 1 12 1 28 3 8 6 6 13 4 22 9 58 54 Total 13 31 14 19 26 9 112 0 1 31 3 19 2 5 5 3 19 15 10 58 54 Total 34 21 10 22 15 10 112 1 2 The status variable indicates the true status of the phantoms: status = 0 if they are normal and status = 1 if they are abnormal. We now compare the areas under the two ROC curves. roccomp — Tests of equality of ROC areas 1945 . roccomp status rating [fw=pop], by(mod) graph summary Asymptotic Normal ROC mod Obs Area Std. Err. [95% Conf. Interval] 1 2 112 112 0.0317 0.0256 Prob>chi2 = 0.82067 0.88005 0.94498 0.98042 0.2447 0.00 0.25 Sensitivity 0.50 0.75 1.00 Ho: area(1) = area(2) chi2(1) = 1.35 0.8828 0.9302 0.00 0.25 0.50 1−Specificity 1 ROC area: 0.8828 Reference 0.75 1.00 2 ROC area: 0.9302 Comparing areas with a gold standard The area under multiple ROC curves can be compared with a gold standard using rocgold. The command syntax is similar to that of roccomp. The tests are corrected for the effect of multiple comparisons. Example 3 We will use the same data (presented by Hanley and McNeil [1983]) as in the roccomp examples. Let’s assume that the first modality is considered to be the standard against which both the second and third modalities are compared. We want to plot and compare both the areas of the ROC curves of mod2 and mod3 with mod1. Because we consider mod1 to be the gold standard, it is listed first after the reference variable in the rocgold command line. 1946 roccomp — Tests of equality of ROC areas . use http://www.stata-press.com/data/r13/ct2 . rocgold status mod1 mod2 mod3, graph summary ROC Area df Pr>chi2 Bonferroni Pr>chi2 0.8828 0.9302 0.9240 0.0317 0.0256 0.0241 2.3146 5.2480 1 1 0.1282 0.0220 0.2563 0.0439 0.00 0.25 Sensitivity 0.50 0.75 1.00 mod1 (standard) mod2 mod3 chi2 Std. Err. 0.00 0.25 0.50 1−Specificity mod1 ROC area: 0.8828 mod3 ROC area: 0.924 0.75 1.00 mod2 ROC area: 0.9302 Reference Equivalently, we could have done this in two steps by using the roccomp command. . roccomp status mod1 mod2, graph summary . roccomp status mod1 mod3, graph summary Stored results roccomp stores the following in r(): Scalars r(N g) number of groups r(p) significance probability Matrices r(V) variance–covariance matrix r(df) r(chi2) χ2 degrees of freedom χ2 r(p) r(p adj) significance-probability vector adjusted significance-probability vector rocgold stores the following in r(): Scalars r(N g) Matrices r(V) r(chi2) r(df) number of groups variance–covariance matrix χ2 vector χ2 degrees-of-freedom vector roccomp — Tests of equality of ROC areas 1947 Methods and formulas Assume that we applied a diagnostic test to each of Nn normal and Na abnormal subjects. Further assume that the higher the outcome value of the diagnostic test, the higher the risk of the subject being abnormal. Let θb be the estimated area under the curve, and let Xi , i = 1, 2, . . . , Na and Yj , j = 1, 2, . . . , Nn be the values of the diagnostic test for the abnormal and normal subjects, respectively. Areas under ROC curves are compared using an algorithm suggested by DeLong, DeLong, and Clarke-Pearson (1988). Let b θ = (θb1 , θb2 , . . . , θbk ) be a vector representing the areas under k ROC curves. See Methods and formulas in [R] roctab for the definition of these area estimates. For the rth area, define r V10 (Xi ) = Nn 1 X ψ(Xir , Yjr ) Nn j=1 and for each normal subject, j , define r V01 (Yj ) = Na 1 X ψ(Xir , Yjr ) Na i=1 where ( r r ψ(X , Y ) = 1 1 2 0 Y r < Xr Y r = Xr Y r > Xr Define the k × k matrix S10 such that the (r, s)th element is N r,s S10 = a X 1 s {V r (Xi ) − θbr }{V10 (Xi ) − θbs } Na − 1 i=1 10 and S01 such that the (r, s)th element is N r,s S01 = n X 1 s {V r (Yi ) − θbr }{V01 (Yi ) − θbs } Nn − 1 j=1 01 Then the covariance matrix is S= 1 1 S10 + S01 Na Nn Let L be a contrast matrix defining the comparison, so that (θb − θ)0 L0 LSL0 −1 L(θb − θ) has a chi-squared distribution with degrees of freedom equal to the rank of LSL0 . 1948 roccomp — Tests of equality of ROC areas References Cleves, M. A. 1999. sg120: Receiver operating characteristic (ROC) analysis. Stata Technical Bulletin 52: 19–33. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 212–229. College Station, TX: Stata Press. . 2000. sg120.2: Correction to roccomp command. Stata Technical Bulletin 54: 26. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 231. College Station, TX: Stata Press. . 2002a. Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve. Stata Journal 2: 280–289. . 2002b. From the help desk: Comparing areas under receiver operating characteristic curves from two or more probit or logit models. Stata Journal 2: 301–313. DeLong, E. R., D. M. DeLong, and D. L. Clarke-Pearson. 1988. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 44: 837–845. Erdreich, L. S., and E. T. Lee. 1981. Use of relative operating characteristic analysis in epidemiology: A method for dealing with subjective judgment. American Journal of Epidemiology 114: 649–662. Hanley, J. A., and B. J. McNeil. 1983. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 148: 839–843. Harbord, R. M., and P. Whiting. 2009. metandi: Meta-analysis of diagnostic accuracy using hierarchical logistic regression. Stata Journal 9: 211–229. Juul, S., and M. Frydenberg. 2014. An Introduction to Stata for Health Researchers. 4th ed. College Station, TX: Stata Press. Ma, G., and W. J. Hall. 1993. Confidence bands for the receiver operating characteristic curves. Medical Decision Making 13: 191–197. Pepe, M. S. 2003. The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press. Reichenheim, M. E., and A. Ponce de Leon. 2002. Estimation of sensitivity and specificity arising from validity studies with incomplete design. Stata Journal 2: 267–279. Seed, P. T., and A. Tobı́as. 2001. sbe36.1: Summary statistics for diagnostic tests. Stata Technical Bulletin 59: 25–27. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 90–93. College Station, TX: Stata Press. Tobı́as, A. 2000. sbe36: Summary statistics report for diagnostic tests. Stata Technical Bulletin 56: 16–18. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 87–90. College Station, TX: Stata Press. Working, H., and H. Hotelling. 1929. Application of the theory of error to the interpretation of trends. Journal of the American Statistical Association 24 (Suppl.): 73–85. Also see [R] logistic postestimation — Postestimation tools for logistic [R] roc — Receiver operating characteristic (ROC) analysis [R] rocfit — Parametric ROC models [R] rocreg — Receiver operating characteristic (ROC) regression [R] roctab — Nonparametric ROC analysis Title rocfit — Parametric ROC models Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax rocfit refvar classvar rocfit options if in weight , rocfit options Description Model continuous(#) generate(newvar) divide classvar into # groups of approximately equal length create newvar containing classification groups SE vce(vcetype) vcetype may be oim or opg Reporting set confidence level; default is level(95) level(#) Maximization maximize options control the maximization process; seldom used fp, is allowed; see [U] 11.1.10 Prefix commands. fweights are allowed; see [U] 11.1.6 weight. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Epidemiology and related > ROC analysis > Parametric ROC analysis without covariates Description rocfit fits maximum-likelihood ROC models assuming a binormal distribution of the latent variable. The two variables refvar and classvar must be numeric. The reference variable indicates the true state of the observation, such as diseased and nondiseased or normal and abnormal, and must be coded as 0 and 1. The rating or outcome of the diagnostic test or test modality is recorded in classvar, which must be at least ordinal, with higher values indicating higher risk. See [R] roc for other commands designed to perform receiver operating characteristic (ROC) analyses with rating and discrete classification data. 1949 1950 rocfit — Parametric ROC models Options Model continuous(#) specifies that the continuous classvar be divided into # groups of approximately equal length. This option is required when classvar takes on more than 20 distinct values. continuous(.) may be specified to indicate that classvar be used as it is, even though it could have more than 20 distinct values. generate(newvar) specifies the new variable that is to contain the values indicating the groups produced by continuous(#). generate() may be specified only with continuous(). SE vce(vcetype) specifies the type of standard error reported. vcetype may be either oim or opg; see [R] vce option. Reporting level(#); see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). Remarks and examples Dorfman and Alf (1969) developed a generalized approach for obtaining maximum likelihood estimates of the parameters for a smooth fitting ROC curve. The most commonly used method for ordinal data, and the one implemented here, is based upon the binormal model; see Pepe (2003), Pepe, Longton, and Janes (2009), and Janes, Longton, and Pepe (2009) for methods of ROC analysis for continuous data, including methods for adjusting for covariates. The model assumes the existence of an unobserved, continuous, latent variable that is normally distributed (perhaps after a monotonic transformation) in both the normal and abnormal populations with means µn and µa and variances σn2 and σa2 , respectively. The model further assumes that the K categories of the rating variable result from partitioning the unobserved latent variable by K − 1 fixed boundaries. The method fits a straight line to the empirical ROC points plotted using normal probability scales on both axes. Maximum likelihood estimates of the line’s slope and intercept and the K − 1 boundaries are obtained simultaneously. See Methods and formulas for details. The intercept from the fitted line is a measurement of (µa − µn )/σa , and the slope measures σn /σa . Thus the intercept is the standardized difference between the two latent population means, and the slope is the ratio of the two standard deviations. The null hypothesis that there is no difference between the two population means is evaluated by testing that the intercept = 0, and the null hypothesis that the variances in the two populations are equal is evaluated by testing that the slope = 1. rocfit — Parametric ROC models 1951 Example 1 We use Hanley and McNeil’s (1982) dataset, described in example 1 of [R] roctab, to fit a smooth ROC curve assuming a binormal model. . use http://www.stata-press.com/data/r13/hanley . rocfit disease rating Fitting binormal model: Iteration 0: log likelihood = -123.68069 Iteration 1: log likelihood = -123.64867 Iteration 2: log likelihood = -123.64855 Iteration 3: log likelihood = -123.64855 Binormal model of disease on rating Goodness-of-fit chi2(2) = 0.21 Prob > chi2 = 0.9006 Log likelihood = -123.64855 Coef. Std. Err. Number of obs z = 109 P>|z| [95% Conf. Interval] intercept slope (*) 1.656782 0.713002 0.310456 0.215882 5.34 -1.33 0.000 0.184 1.048300 0.289881 2.265265 1.136123 /cut1 /cut2 /cut3 /cut4 0.169768 0.463215 0.766860 1.797938 0.165307 0.167235 0.174808 0.299581 1.03 2.77 4.39 6.00 0.304 0.006 0.000 0.000 -0.154227 0.135441 0.424243 1.210770 0.493764 0.790990 1.109477 2.385106 Index Estimate Indices from binormal fit Std. Err. [95% Conf. Interval] ROC area delta(m) d(e) d(a) 0.911331 2.323671 1.934361 1.907771 0.029506 0.502370 0.257187 0.259822 0.853501 1.339044 1.430284 1.398530 0.969161 3.308298 2.438438 2.417012 (*) z test for slope==1 rocfit outputs the MLE for the intercept and slope of the fitted regression line along with, here, four boundaries (because there are five ratings) labeled /cut1 through /cut4. Also rocfit computes and reports four indices based on the fitted ROC curve: the area under the curve (labeled ROC area), δ(m) (labeled delta(m)), de (labeled d(e)), and da (labeled d(a)). More information about these indices can be found in Methods and formulas and in Erdreich and Lee (1981). 1952 rocfit — Parametric ROC models Stored results rocfit stores the following in e(): Scalars e(N) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(ll) e(chi2 gf) e(df gf) e(p gf) e(area) e(se area) e(deltam) e(se delm) e(de) e(se de) e(da) e(se da) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) Matrices e(b) e(ilog) e(gradient) e(V) Functions e(sample) number of observations number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom log likelihood goodness-of-fit χ2 goodness-of-fit degrees of freedom χ2 goodness-of-fit significance probability area under the ROC curve standard error for the area under the ROC curve delta(m) standard area for delta(m) d(e) index standard error for d(e) index d(a) index standard error for d(a) index rank of e(V) number of iterations return code 1 if converged, 0 otherwise rocfit command as typed refvar and classvar weight type weight expression title in estimation output GOF; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V coefficient vector iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators marks estimation sample rocfit — Parametric ROC models 1953 Methods and formulas Dorfman and Alf (1969) developed a general procedure for obtaining maximum likelihood estimates of the parameters of a smooth-fitting ROC curve. The most common method, and the one implemented in Stata, is based upon the binormal model. The model assumes that there is an unobserved continuous latent variable that is normally distributed in both the normal and abnormal populations. The idea is better explained with the following illustration: Normal Abnormal Z 1 Z 1 2 2 3 Z Z4 3 4 5 The latent variable is assumed to be normally distributed for both the normal and abnormal subjects, perhaps after a monotonic transformation, with means µn and µa and variances σn2 and σa2 , respectively. This latent variable is assumed to be partitioned into the k categories of the rating variable by k − 1 fixed boundaries. In the above figure, the k = 5 categories of the rating variable identified on the bottom result from the partition of the four boundaries Z1 through Z4 . Let Rj for j = 1, 2, . . . , k indicate the categories of the rating variable, let i = 1 if the subject belongs to the normal group, and let i = 2 if the subject belongs to the abnormal group. Then p(Rj |i = 1) = F (Zj ) − F (Zj−1 ) where Zk = (xk − µn )/σn , F is the cumulative normal distribution, F (Z0 ) = 0, and F (Zk ) = 1. Also, p(Rj |i = 2) = F (bZj − a) − F (bZj−1 − a) where b = σn /σa and a = (µa − µn )/σa . The parameters a, b and the k − 1 fixed boundaries Zj are simultaneously estimated by maximizing the log-likelihood function 2 X k X logL = rij log p(Rj |i) i=1 j=1 where rij is the number of Rj s in group i. 1954 rocfit — Parametric ROC models The area under the fitted ROC curve is computed as a Φ √ 1 + b2 where Φ is the standard normal cumulative distribution function. Point estimates for the ROC curve indices are as follows: a δ(m) = b 2a de = b+1 √ a 2 da = √ 1 + b2 Variances for these indices are computed using the delta method. (µa − µn )/σn , de estimates 2(µa − µn )/(σa − σn ), and da estimates √ The δ(m) estimates 2(µa − µn )/(σa2 − σn2 )2 . Simultaneous confidence bands for the entire curve are obtained, as suggested by Ma and Hall (1993), by first obtaining Working–Hotelling (1929) confidence bands for the fitted straight line in normal probability coordinates and then transforming them back to ROC coordinates. References Bamber, D. 1975. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology 12: 387–415. Choi, B. C. K. 1998. Slopes of a receiver operating characteristic curve and likelihood ratios for a diagnostic test. American Journal of Epidemiology 148: 1127–1132. Cleves, M. A. 1999. sg120: Receiver operating characteristic (ROC) analysis. Stata Technical Bulletin 52: 19–33. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 212–229. College Station, TX: Stata Press. . 2000. sg120.1: Two new options added to rocfit command. Stata Technical Bulletin 53: 18–19. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 230–231. College Station, TX: Stata Press. Dorfman, D. D., and E. Alf, Jr. 1969. Maximum-likelihood estimation of parameters of signal-detection theory and determination of confidence intervals–rating-method data. Journal of Mathematical Psychology 6: 487–496. Erdreich, L. S., and E. T. Lee. 1981. Use of relative operating characteristic analysis in epidemiology: A method for dealing with subjective judgment. American Journal of Epidemiology 114: 649–662. Hanley, J. A., and B. J. McNeil. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143: 29–36. Janes, H., G. M. Longton, and M. S. Pepe. 2009. Accommodating covariates in receiver operating characteristic analysis. Stata Journal 9: 17–39. Ma, G., and W. J. Hall. 1993. Confidence bands for the receiver operating characteristic curves. Medical Decision Making 13: 191–197. Pepe, M. S. 2003. The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press. Pepe, M. S., G. M. Longton, and H. Janes. 2009. Estimation and comparison of receiver operating characteristic curves. Stata Journal 9: 1–16. Working, H., and H. Hotelling. 1929. Application of the theory of error to the interpretation of trends. Journal of the American Statistical Association 24 (Suppl.): 73–85. rocfit — Parametric ROC models Also see [R] rocfit postestimation — Postestimation tools for rocfit [R] roc — Receiver operating characteristic (ROC) analysis [R] rocreg — Receiver operating characteristic (ROC) regression [U] 20 Estimation and postestimation commands 1955 Title rocfit postestimation — Postestimation tools for rocfit Description Remarks and examples Syntax for rocplot Also see Menu Options for rocplot Description The following command is of special interest after rocfit: Command Description rocplot plot the fitted ROC curve and simultaneous confidence bands The following standard postestimation commands are also available: Command estat ic estat summarize estat vce estimates ∗ lincom ∗ test ∗ Description Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients Wald tests of simple and composite linear hypotheses See Using lincom and test below. Special-interest postestimation command rocplot plots the fitted ROC curve and simultaneous confidence bands. 1956 rocfit postestimation — Postestimation tools for rocfit 1957 Syntax for rocplot rocplot , rocplot options rocplot options Description Main confband norefline level(#) display confidence bands suppress plotting the reference line set confidence level; default is level(95) Plot plotopts(plot options) affect rendition of the ROC points Fit line lineopts(cline options) affect rendition of the fitted ROC line CI plot ciopts(area options) affect rendition of the confidence bands Reference line rlopts(cline options) affect rendition of the reference line Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options plot options Description marker options marker label options cline options change look of markers (color, size, etc.) add marker labels; change look or position change the look of the line Menu Statistics > Epidemiology and related > ROC analysis > ROC curves after rocfit Options for rocplot Main confband specifies that simultaneous confidence bands be plotted around the ROC curve. norefline suppresses plotting the 45-degree reference line from the graphical output of the ROC curve. level(#) specifies the confidence level, as a percentage, for the confidence bands. The default is level(95) or as set by set level; see [R] level. 1958 rocfit postestimation — Postestimation tools for rocfit Plot plotopts(plot options) affects the rendition of the plotted ROC points, including the size and color of markers, whether and how the markers are labeled, and whether and how the points are connected. For the full list of available plot options, see [G-3] marker options, [G-3] marker label options, and [G-3] cline options. Fit line lineopts(cline options) affects the rendition of the fitted ROC line; see [G-3] cline options. CI plot ciopts(area options) affects the rendition of the confidence bands; see [G-3] area options. Reference line rlopts(cline options) affects the rendition of the reference line; see [G-3] cline options. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples Remarks are presented under the following headings: Using lincom and test Using rocplot Using lincom and test intercept, slope, and cut#, shown in example 1 of [R] rocfit, are equation names and not variable names, so they need to be referenced as described in Special syntaxes after multiple-equation estimation of [R] test. For example, instead of typing . test intercept intercept not found r(111); you should type . test [intercept]_cons ( 1) [intercept]_cons = 0 chi2( 1) = 28.48 Prob > chi2 = 0.0000 rocfit postestimation — Postestimation tools for rocfit 1959 Using rocplot Example 1 In example 1 of [R] rocfit, we fit a ROC curve by typing rocfit disease rating. In the output table for our model, we are testing whether the variances of the two latent populations are equal by testing that the slope = 1. We plot the fitted ROC curve. 0 .25 Sensitivity .5 .75 1 . rocplot, confband 0 .25 .5 1 − Specificity Area under curve = 0.9113 se(area) = 0.0295 Also see [R] rocfit — Parametric ROC models [U] 20 Estimation and postestimation commands .75 1 Title rocreg — Receiver operating characteristic (ROC) regression Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgments Syntax Perform nonparametric analysis of ROC curve under covariates, using bootstrap rocreg refvar classvar classvars if in , np options Perform parametric analysis of ROC curve under covariates, using bootstrap rocreg refvar classvar classvars if in , probit probit options Perform parametric analysis of ROC curve under covariates, using maximum likelihood rocreg refvar classvar classvars if in weight , probit ml probit ml options np options Description Model estimate total area under the ROC curve; the default estimate ROC for given false-positive rates estimate false-positive rates for given ROC values estimate partial area under the ROC curve (pAUC) up to each false-positive rate cluster(varname) variable identifying resampling clusters ctrlcov(varlist) adjust control distribution for covariates in varlist ctrlmodel(strata | linear) stratify or regress on covariates; default is ctrlmodel(strata) pvc(empirical | normal) use empirical or normal distribution percentile value estimates; default is pvc(empirical) tiecorrected adjust for tied observations; not allowed with pvc(normal) auc roc(numlist) invroc(numlist) pauc(numlist) Bootstrap nobootstrap bseed(#) breps(#) bootcc nobstrata nodots do not perform bootstrap, just output point estimates random-number seed for bootstrap number of bootstrap replications; default is breps(1000) perform case–control (stratified on refvar) sampling rather than cohort sampling in bootstrap ignore covariate stratification in bootstrap sampling suppress bootstrap replication dots Reporting level(#) set confidence level; default is level(95) 1960 rocreg — Receiver operating characteristic (ROC) regression probit options 1961 Description Model ∗ fit the probit model covariates affecting ROC curve number of false-positive rate points to use in fitting ROC curve; default is fprpts(10) ctrlfprall fit ROC curve at each false-positive rate in control population variable identifying resampling clusters cluster(varname) ctrlcov(varlist) adjust control distribution for covariates in varlist ctrlmodel(strata | linear) stratify or regress on covariates; default is ctrlmodel(strata) pvc(empirical | normal) use empirical or normal distribution percentile value estimates; default is pvc(empirical) adjust for tied observations; not allowed with pvc(normal) tiecorrected probit roccov(varlist) fprpts(#) Bootstrap nobootstrap bseed(#) breps(#) bootcc nobstrata nodots bsave(filename, . . . ) bfile(filename) do not perform bootstrap, just output point estimates random-number seed for bootstrap number of bootstrap replications; default is breps(1000) perform case–control (stratified on refvar) sampling rather than cohort sampling in bootstrap ignore covariate stratification in bootstrap sampling suppress bootstrap replication dots save bootstrap replicates from parametric estimation use bootstrap replicates dataset for estimation replay Reporting level(#) ∗ probit is required. set confidence level; default is level(95) 1962 rocreg — Receiver operating characteristic (ROC) regression probit ml options Model ∗ ∗ probit ml roccov(varlist) cluster(varname) ctrlcov(varlist) Description fit the probit model fit the probit model by maximum likelihood estimation covariates affecting ROC curve variable identifying clusters adjust control distribution for covariates in varlist Reporting set confidence level; default is level(95) control column formats, line width, and display of omitted variables level(#) display options Maximization maximize options control the maximization process; seldom used ∗ probit and ml are required. fweights, iweights, and pweights are allowed with maximum likelihood estimation; see [U] 11.1.6 weight. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Epidemiology and related > ROC analysis > ROC regression models Description The rocreg command is used to perform receiver operating characteristic (ROC) analyses with rating and discrete classification data under the presence of covariates. The two variables refvar and classvar must be numeric. The reference variable indicates the true state of the observation—such as diseased and nondiseased or normal and abnormal—and must be coded as 0 and 1. The refvar coded as 0 can also be called the control population, while the refvar coded as 1 comprises the case population. The rating or outcome of the diagnostic test or test modality is recorded in classvar, which must be ordinal, with higher values indicating higher risk. rocreg can fit three models: a nonparametric model, a parametric probit model that uses the bootstrap for inference, and a parametric probit model fit using maximum likelihood. Options Options are presented under the following headings: Options for nonparametric ROC estimation, using bootstrap Options for parametric ROC estimation, using bootstrap Options for parametric ROC estimation, using maximum likelihood rocreg — Receiver operating characteristic (ROC) regression 1963 Options for nonparametric ROC estimation, using bootstrap Model auc estimates the total area under the ROC curve. This is the default summary statistic. roc(numlist) estimates the ROC corresponding to each of the false-positive rates in numlist. The values of numlist must be in the range (0,1). invroc(numlist) estimates the false-positive rates corresponding to each of the ROC values in numlist. The values of numlist must be in the range (0,1). pauc(numlist) estimates the partial area under the ROC curve up to each false-positive rate in numlist. The values of numlist must in the range (0,1]. cluster(varname) specifies the variable identifying resampling clusters. ctrlcov(varlist) specifies the covariates to be used to adjust the control population. ctrlmodel(strata | linear) specifies how to model the control population of classifiers on ctrlcov(). When ctrlmodel(linear) is specified, linear regression is used. The default is ctrlmodel(strata); that is, the control population of classifiers is stratified on the control variables. pvc(empirical | normal) determines how the percentile values of the control population will be calculated. When pvc(normal) is specified, the standard normal cumulative distribution function (CDF) is used for calculation. Specifying pvc(empirical) will use the empirical CDFs of the control population classifiers for calculation. The default is pvc(empirical). tiecorrected adjusts the percentile values for ties. For each value of the classifier, one half the probability that the classifier equals that value under the control population is added to the percentile value. tiecorrected is not allowed with pvc(normal). Bootstrap nobootstrap specifies that bootstrap standard errors not be calculated. bseed(#) specifies the random-number seed to be used in the bootstrap. breps(#) sets the number of bootstrap replications. The default is breps(1000). bootcc performs case–control (stratified on refvar) sampling rather than cohort bootstrap sampling. nobstrata ignores covariate stratification in bootstrap sampling. nodots suppresses bootstrap replicate dots. Reporting level(#); see [R] estimation options. Options for parametric ROC estimation, using bootstrap Model probit fits the probit model. This option is required and implies parametric estimation. roccov(varlist) specifies the covariates that will affect the ROC curve. fprpts(#) sets the number of false-positive rate points to use in modeling the ROC curve. These points form an equispaced grid on (0,1). The default is fprpts(10). 1964 rocreg — Receiver operating characteristic (ROC) regression ctrlfprall models the ROC curve at each false-positive rate in the control population. cluster(varname) specifies the variable identifying resampling clusters. ctrlcov(varlist) specifies the covariates to be used to adjust the control population. ctrlmodel(strata | linear) specifies how to model the control population of classifiers on ctrlcov(). When ctrlmodel(linear) is specified, linear regression is used. The default is ctrlmodel(strata); that is, the control population of classifiers is stratified on the control variables. pvc(empirical | normal) determines how the percentile values of the control population will be calculated. When pvc(normal) is specified, the standard normal CDF is used for calculation. Specifying pvc(empirical) will use the empirical CDFs of the control population classifiers for calculation. The default is pvc(empirical). tiecorrected adjusts the percentile values for ties. For each value of the classifier, one half the probability that the classifier equals that value under the control population is added to the percentile value. tiecorrected is not allowed with pvc(normal). Bootstrap nobootstrap specifies that bootstrap standard errors not be calculated. bseed(#) specifies the random-number seed to be used in the bootstrap. breps(#) sets the number of bootstrap replications. The default is breps(1000). bootcc performs case–control (stratified on refvar) sampling rather than cohort bootstrap sampling. nobstrata ignores covariate stratification in bootstrap sampling. nodots suppresses bootstrap replicate dots. bsave(filename, . . . ) saves bootstrap replicates from parametric estimation in the given filename with specified options (that is, replace). bsave() is only allowed with parametric analysis using bootstrap. bfile(filename) specifies to use the bootstrap replicates dataset for estimation replay. bfile() is only allowed with parametric analysis using bootstrap. Reporting level(#); see [R] estimation options. Options for parametric ROC estimation, using maximum likelihood Model probit fits the probit model. This option is required and implies parametric estimation. ml fits the probit model by maximum likelihood estimation. This option is required and must be specified with probit. roccov(varlist) specifies the covariates that will affect the ROC curve. cluster(varname) specifies the variable used for clustering. ctrlcov(varlist) specifies the covariates to be used to adjust the control population. rocreg — Receiver operating characteristic (ROC) regression 1965 Reporting level(#); see [R] estimation options. display options: noomitted, cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. The technique(bhhh) option is not allowed. Remarks and examples Remarks are presented under the following headings: Introduction ROC statistics Covariate-adjusted ROC curves Parametric ROC curves: Estimating equations Parametric ROC curves: Maximum likelihood Introduction Receiver operating characteristic (ROC) analysis provides a quantitative measure of the accuracy of diagnostic tests to discriminate between two states or conditions. These conditions may be referred to as normal and abnormal, nondiseased and diseased, or control and case. We will use these terms interchangeably. The discriminatory accuracy of a diagnostic test is measured by its ability to correctly classify known control and case subjects. The analysis uses the ROC curve, a graph of the sensitivity versus 1 − specificity of the diagnostic test. The sensitivity is the fraction of positive cases that are correctly classified by the diagnostic test, whereas the specificity is the fraction of negative cases that are correctly classified. Thus the sensitivity is the true-positive rate, and the specificity is the true-negative rate. We also call 1 − specificity the false-positive rate. These rates are functions of the possible outcomes of the diagnostic test. At each outcome, a decision will be made by the user of the diagnostic test to classify the tested subject as either normal or abnormal. The true-positive and false-positive rates measure the probability of correct classification or incorrect classification of the subject as abnormal. Given the classification role of the diagnostic test, we will refer to it as the classifier. Using this basic definition of the ROC curve, Pepe (2000) and Pepe (2003) describe how ROC analysis can be performed as a two-stage process. In the first stage, the control distribution of the classifier is estimated. The specificity is then determined as the percentiles of the classifier values calculated based on the control population. The false-positive rates are calculated as 1 − specificity. In the second stage, the ROC curve is estimated as the cumulative distribution of the case population’s “false-positive” rates, also known as the survival function under the case population of the previously calculated percentiles. We use the terms ROC value and true-positive value interchangeably. This formulation of ROC curve analysis provides simple, nonparametric estimates of several ROC curve summary parameters: area under the ROC curve, partial area under the ROC curve, ROC value for a given false-positive rate, and false-positive rate (also known as invROC) for a given ROC value. In the next section, we will show how to use rocreg to compute these estimates with bootstrap inference. There we will also show how rocreg complements the other nonparametric Stata ROC commands roctab and roccomp. 1966 rocreg — Receiver operating characteristic (ROC) regression Other factors beyond condition status and the diagnostic test may affect both stages of ROC analysis. For example, a test center may affect the control distribution of the diagnostic test. Disease severity may affect the distribution of the standardized diagnostic test under the case population. Our analysis of the ROC curve in these situations will be more accurate if we take these covariates into account. In a nonparametric ROC analysis, covariates may only affect the first stage of estimation; that is, they may be used to adjust the control distribution of the classifier. In a parametric ROC analysis, it is assumed that ROC follows a normal distribution, and thus covariates may enter the model at both stages; they may be used to adjust the control distribution and to model ROC as a function of these covariates and the false-positive rate. In parametric models, both sets of covariates need not be distinct but, in fact, they are often the same. To model covariate effects on the first stage of ROC analysis, Janes and Pepe (2009) propose a covariate-adjusted ROC curve. We will demonstrate the covariate adjustment capabilities of rocreg in Covariate-adjusted ROC curves. To account for covariate effects at the second stage, we assume a parametric model. Particularly, the ROC curve is a generalized linear model of the covariates. We will thus have a separate ROC curve for each combination of the relevant covariates. In Parametric ROC curves: Estimating equations, we show how to fit the model with estimating equations and bootstrap inference using rocreg. This method, documented as the “pdf” approach in Alonzo and Pepe (2002), works well with weak assumptions about the control distribution. Also in Parametric ROC curves: Estimating equations, we show how to fit a constant-only parametric model (involving no covariates) of the ROC curve with weak assumptions about the control distribution. The constant-only model capabilities of rocreg in this context will be compared with those of rocfit. roccomp has the binormal option, which will allow it to compute area under the ROC curve according to a normal ROC curve, equivalent to that obtained by rocfit. We will compare this functionality with that of rocreg. In Parametric ROC curves: Maximum likelihood, we demonstrate maximum likelihood estimation of the ROC curve model with rocreg. There we assume a normal linear model for the classifier on the covariates and case–control status. This method is documented in Pepe (2003). We will also demonstrate how to use this method with no covariates, and we will compare rocreg under the constant-only model with rocfit and roccomp. The rocregplot command is used repeatedly in this entry. This command provides graphical output for rocreg and is documented in [R] rocregplot. ROC statistics roctab computes the ROC curve by calculating the false-positive rate and true-positive rate empirically at every value of the input classifier. It makes no distributional assumptions about the case or control distributions. We can get identical behavior from rocreg by using the default option settings. Example 1: Nonparametric ROC, AUC Hanley and McNeil (1982) presented data from a study in which a reviewer was asked to classify, using a five-point scale, a random sample of 109 tomographic images from patients with neurological problems. The rating scale was as follows: 1 is definitely normal, 2 is probably normal, 3 is questionable, 4 is probably abnormal, and 5 is definitely abnormal. The true disease status was normal for 58 of the patients and abnormal for the remaining 51 patients. rocreg — Receiver operating characteristic (ROC) regression 1967 Here we list 9 of the 109 observations: . use http://www.stata-press.com/data/r13/hanley . list disease rating in 1/9 disease rating 1. 2. 3. 4. 5. 1 0 1 0 0 5 1 5 4 1 6. 7. 8. 9. 0 1 0 0 3 5 5 1 For each observation, disease identifies the true disease status of the subject (0 is normal, 1 is abnormal), and rating contains the classification value assigned by the reviewer. We run roctab on these data, specifying the graph option so that the ROC curve is rendered. We then calculate the false-positive and true-positive rates of the ROC curve by using rocreg. We graph the rates with rocregplot. Because we focus on rocreg output later, for now we use the quietly prefix to omit the output of rocreg. Both graphs are combined using graph combine (see [G-2] graph combine) for comparison. To ease the comparison, we specify the aspectratio(1) option in roctab; this is the default aspect ratio in rocregplot. roctab disease rating, graph aspectratio(1) name(a) nodraw title("roctab") quietly rocreg disease rating rocregplot, name(b) nodraw legend(off) title("rocreg") graph combine a b roctab rocreg 0.00 0.25 0.50 1 − Specificity Area under ROC curve = 0.8932 0.75 1.00 0 0.00 .25 0.25 Sensitivity 0.50 True−positive rate (ROC) .5 .75 0.75 1 1.00 . . . . 0 .25 .5 False−positive rate .75 1 Both roctab and rocreg compute the same false-positive rate and ROC values. The stairstep line connection style of the graph on the right emphasizes the empirical nature of its estimates. The control distribution of the classifier is estimated using the empirical CDF estimate. Similarly, the ROC curve, the distribution of the resulting case observation false-positive rate values, is estimated using the empirical CDF. Note the footnote in the roctab plot. By default, roctab will estimate the area 1968 rocreg — Receiver operating characteristic (ROC) regression under the ROC curve (AUC) using a trapezoidal approximation to the estimated false-positive rate and true-positive rate points. The AUC can be interpreted as the probability that a randomly selected member of the case population will have a larger classifier value than a randomly selected member of the control population. It can also be viewed as the average ROC value, averaged uniformly over the (0,1) false-positive rate domain (Pepe 2003). The nonparametric estimator of the AUC (DeLong, DeLong, and Clarke-Pearson 1988; Hanley and Hajian-Tilaki 1997) used by rocreg is equivalent to the sample mean of the percentile values of the case observations. Thus to calculate the nonparametric AUC estimate, we only need to calculate the percentile values of the case observations with respect to the control distribution. This estimate can differ from the trapezoidal approximation estimate. Under discrete classification data, like we have here, there may be ties between classifier values from case to control. The trapezoidal approximation uses linear interpolation between the classifier values to correct for ties. Correcting the nonparametric estimator involves adding a correction term to each observation’s percentile value, which measures the probability that the classifier is equal to (instead of less than) the observation’s classifier value. The tie-corrected nonparametric estimate (trapezoidal approximation) is used when we think the true ROC curve is smooth. This means that the classifier we measure is a discretized approximation of a true latent and a continuous classifier. We now recompute the ROC curve of rating for classifying disease and calculate the AUC. Specifying the tiecorrected option allows tie correction to be used in the rocreg calculation. Under nonparametric estimation, rocreg bootstraps to obtain standard errors and confidence intervals for requested statistics. We use the default 1,000 bootstrap replications to obtain confidence intervals for our parameters. This is a reasonable lower bound to the number of replications (Mooney and Duval 1993) required for estimating percentile confidence intervals. By specifying the summary option in roctab, we will obtain output showing the trapezoidal approximation of the AUC estimate, along with standard error and confidence interval estimates for the trapezoidal approximation suggested by DeLong, DeLong, and Clarke-Pearson (1988). rocreg — Receiver operating characteristic (ROC) regression . roctab disease rating, summary ROC Obs Area Std. Err. 1969 Asymptotic Normal [95% Conf. Interval] 109 0.8932 0.0307 0.83295 0.95339 . rocreg disease rating, tiecorrected bseed(29092) (running rocregstat on estimation sample) Bootstrap replications (1000) 1 2 3 4 5 .................................................. 50 .................................................. 100 (output omitted ) .................................................. 950 .................................................. 1000 Bootstrap results Number of obs Replications = = 109 1000 Nonparametric ROC estimation Control standardization: empirical, corrected for ties ROC method : empirical Area under the ROC curve Status : disease Classifier: rating AUC Observed Coef. Bias .8931711 .000108 Bootstrap Std. Err. .0292028 [95% Conf. Interval] .8359347 .8290958 .8280714 .9504075 (N) .9457951 (P) .9450642 (BC) The estimates of AUC match well. The standard error from roctab is close to the bootstrap standard error calculated by rocreg. The bootstrap standard error generalizes to the more complex models that we consider later, whereas the roctab standard-error calculation does not. The AUC can be used to compare different classifiers. It is the most popular summary statistic for comparisons (Pepe, Longton, and Janes 2009). roccomp will compute the trapezoidal approximation of the AUC and graph the ROC curves of multiple classifiers. Using the DeLong, DeLong, and ClarkePearson (1988) covariance estimates for the AUC estimate, roccomp performs a Wald test of the null hypothesis that all classifier AUC values are equal. rocreg has similar capabilities. Example 2: Nonparametric ROC, AUC, multiple classifiers Hanley and McNeil (1983) presented data from an evaluation of two computer algorithms designed to reconstruct CT images from phantoms. We will call these two algorithms modalities 1 and 2. A sample of 112 phantoms was selected; 58 phantoms were considered normal, and the remaining 54 were abnormal. Each of the two modalities was applied to each phantom, and the resulting images were rated by a reviewer using a six-point scale: 1 is definitely normal, 2 is probably normal, 3 is possibly normal, 4 is possibly abnormal, 5 is probably abnormal, and 6 is definitely abnormal. Because each modality was applied to the same sample of phantoms, the two sets of outcomes are correlated. 1970 rocreg — Receiver operating characteristic (ROC) regression We list the first seven observations: . use http://www.stata-press.com/data/r13/ct, clear . list in 1/7, sep(0) 1. 2. 3. 4. 5. 6. 7. mod1 mod2 status 2 5 2 2 5 2 3 1 5 1 3 6 2 2 0 1 0 0 1 0 0 Each observation corresponds to one phantom. The mod1 variable identifies the rating assigned for the first modality, and the mod2 variable identifies the rating assigned for the second modality. The true status of the phantoms is given by status==0 if they are normal and status==1 if they are abnormal. The observations with at least one missing rating were dropped from the analysis. A fictitious dataset was created from this true dataset, adding a third test modality. We will use roccomp to compute the AUC statistic for each modality in these data and compare the AUC of the three modalities. We obtain the same behavior from rocreg. As before, the tiecorrected option is specified so that the AUC is calculated with the trapezoidal approximation. . use http://www.stata-press.com/data/r13/ct2 . roccomp status mod1 mod2 mod3, summary ROC Obs Area Std. Err. mod1 mod2 mod3 112 112 112 0.8828 0.9302 0.9240 0.0317 0.0256 0.0241 Ho: area(mod1) = area(mod2) = area(mod3) chi2(2) = 6.54 Prob>chi2 = 0.0381 Asymptotic Normal [95% Conf. Interval] 0.82067 0.88005 0.87670 0.94498 0.98042 0.97132 rocreg — Receiver operating characteristic (ROC) regression 1971 . rocreg status mod1 mod2 mod3, tiecorrected bseed(38038) nodots Bootstrap results Number of obs Replications = = 112 1000 Nonparametric ROC estimation Control standardization: empirical, corrected for ties ROC method : empirical Area under the ROC curve Status : status Classifier: mod1 AUC Observed Coef. Bias .8828225 -.0006367 Bootstrap Std. Err. .0322291 [95% Conf. Interval] .8196546 .8147518 .8124397 .9459903 (N) .9421572 (P) .9394085 (BC) Status : status Classifier: mod2 AUC Observed Coef. Bias .9302363 -.0015402 Bootstrap Std. Err. .0259593 [95% Conf. Interval] .8793569 .8737522 .8739467 .9811156 (N) .9737432 (P) .9737768 (BC) Status : status Classifier: mod3 AUC Observed Coef. Bias .9240102 -.0003528 Bootstrap Std. Err. .0247037 [95% Conf. Interval] .8755919 .8720036 .8693548 .9724286 (N) .9674485 (P) .965 (BC) Ho: All classifiers have equal AUC values. Ha: At least one classifier has a different AUC value. P-value: .0389797 Test based on bootstrap (N) assumptions. We see that the AUC estimates are equivalent, and the standard errors are quite close as well. The p-value for the tests of equal AUC under rocreg leads to similar inference as the p-value from roccomp. The Wald test performed by rocreg uses the joint bootstrap estimate variance matrix of the three AUC estimators rather than the DeLong, DeLong, and Clarke-Pearson (1988) variance estimate used by roccomp. roccomp is used here on potentially correlated classifiers that are recorded in wide-format data. It can also be used on long-format data to compare independent classifiers. Further details can be found in [R] roccomp. Citing the AUC’s lack of clinical relevance, there is argument against using it as a key summary statistic of the ROC curve (Pepe 2003; Cook 2007). Pepe, Longton, and Janes (2009) suggest using the estimate of the ROC curve itself at a particular point, or the estimate of the false-positive rate at a given ROC value, also known as invROC. 1972 rocreg — Receiver operating characteristic (ROC) regression Recall from example 1 how nonparametric rocreg graphs look, with the stairstep pattern in the ROC curve. In an ideal world, the graph would be a smooth one-to-one function, and it would be trivial to map a false-positive rate to its corresponding true-positive rate and vice versa. However, smooth ROC curves can only be obtained by assuming a parametric model that uses linear interpolation between observed false-positive rates and between observed true-positive rates, and rocreg is certainly capable of that; see example 1 of [R] rocregplot. However, under nonparametric estimation, the mapping between false-positive rates and true-positive rates is not one to one, and estimates tend to be less reliable the further you are from an observed data point. This is somewhat mitigated by using tie-corrected rates (the tiecorrected option). When we examine continuous data, the difference between the tie-corrected estimates and the standard estimates becomes negligible, and the empirical estimate of the ROC curve becomes close to the smooth ROC curve obtained by linear interpolation. So the nonparametric ROC and invROC estimates work well. Fixing one rate value of interest can be difficult and subjective (Pepe 2003). A compromise measure is the partial area under the ROC curve (pAUC) (McClish 1989; Thompson and Zucchini 1989). This is the integral of the ROC curve from 0 and above to a given false-positive rate (perhaps the largest clinically acceptable value). Like the AUC estimate, the nonparametric estimate of the pAUC can be written as a sample average of the case observation percentiles, but with an adjustment based on the prescribed maximum false-positive rate (Dodd and Pepe 2003). A tie correction may also be applied so that it reflects the trapezoidal approximation. We cannot compare rocreg with roctab or roccomp on the estimation of pAUC, because pAUC is not computed by the latter two. Example 3: Nonparametric ROC, other statistics To see how rocreg estimates ROC, invROC, and pAUC, we will examine a new study. Wieand et al. (1989) examined a pancreatic cancer study with two continuous classifiers, here called y1 (CA 19-9) and y2 (CA 125). This study was also examined in Pepe, Longton, and Janes (2009). The indicator of cancer in a subject is recorded as d. The study was a case–control study, stratifying participants on disease status. We list the first five observations: . use http://labs.fhcrc.org/pepe/book/data/wiedat2b, clear (S. Wieand - Pancreatic cancer diagnostic marker data) . list in 1/5 1. 2. 3. 4. 5. y1 y2 d 28 15.5 8.2 3.4 17.3 13.3 11.1 16.7 12.6 7.4 no no no no no We will estimate the ROC curves at a large value (0.7) and a small value (0.2) of the false-positive rate. These values are specified in roc(). The false-positive rate for ROC or sensitivity value of 0.6 will also be estimated by specifying invroc(). Percentile confidence intervals for these parameters are displayed in the graph obtained by rocregplot after rocreg. The pAUC statistic will be calculated for the false-positive rate of 0.5, which is specified as an argument to the pauc() option. Following Pepe, Longton, and Janes (2009), we use a stratified bootstrap, sampling separately from the case rocreg — Receiver operating characteristic (ROC) regression 1973 and control populations by specifying the bootcc option. This reflects the case–control nature of the study. All four statistics can be estimated simultaneously by rocreg. For clarity, however, we will estimate each statistic with a separate call to rocreg. rocregplot is used after estimation to graph the ROC and false-positive rate estimates. The display of the individual, observation-specific false-positive rate and ROC values will be omitted in the plot. This is accomplished by specifying msymbol(i) in our plot1opts() and plot2opts() options to rocregplot. . rocreg d y1 y2, roc(.7) bseed(8378923) bootcc nodots Bootstrap results Number of strata = 2 Number of obs Replications Nonparametric ROC estimation = = 141 1000 Control standardization: empirical ROC method : empirical ROC curve Status : d Classifier: y1 ROC Observed Coef. Bias .7 .9222222 -.0021889 ROC Observed Coef. Bias .7 .8888889 -.0035556 Bootstrap Std. Err. .0323879 [95% Conf. Interval] .8587432 .8444445 .8555555 .9857013 (N) .9777778 (P) .9777778 (BC) Status : d Classifier: y2 Bootstrap Std. Err. .0414215 Ho: All classifiers have equal ROC values. Ha: At least one classifier has a different ROC value. Test based on bootstrap (N) assumptions. ROC P-value .7 .5423044 [95% Conf. Interval] .8077043 .8 .7888889 .9700735 (N) .9611111 (P) .9555556 (BC) 1974 rocreg — Receiver operating characteristic (ROC) regression 0 True−positive rate (ROC) .25 .5 .75 1 . rocregplot, plot1opts(msymbol(i)) plot2opts(msymbol(i)) 0 .25 .5 .75 False−positive rate 1 CA 19−9 CA 125 In this study, we see that classifier y1 (CA 19-9) is a uniformly better test than is classifier y2 (CA 125) until high levels of false-positive rate and sensitivity or ROC value are reached. At the high level of false-positive rate, 0.7, the ROC value does not significantly differ between the two classifiers. This can be seen in the plot by the overlapping confidence intervals. rocreg — Receiver operating characteristic (ROC) regression . rocreg d y1 y2, roc(.2) bseed(8378923) bootcc nodots Bootstrap results Number of strata = 2 Number of obs Replications = = 141 1000 Nonparametric ROC estimation Control standardization: empirical ROC method : empirical ROC curve Status : d Classifier: y1 ROC Observed Coef. Bias .2 .7777778 .0011778 ROC Observed Coef. Bias .2 .4888889 -.0091667 Bootstrap Std. Err. .0483655 [95% Conf. Interval] .6829831 .6888889 .6777778 .8725725 (N) .8777778 (P) .8666667 (BC) Status : d Classifier: y2 Bootstrap Std. Err. .1339863 [95% Conf. Interval] .2262806 .2222222 .2111111 Ho: All classifiers have equal ROC values. Ha: At least one classifier has a different ROC value. Test based on bootstrap (N) assumptions. ROC P-value 0 True−positive rate (ROC) .25 .5 .75 1 .2 .043234 . rocregplot, plot1opts(msymbol(i)) plot2opts(msymbol(i)) 0 .25 .5 .75 False−positive rate CA 19−9 CA 125 1 .7514971 (N) .7 (P) .7 (BC) 1975 1976 rocreg — Receiver operating characteristic (ROC) regression The sensitivity for the false-positive rate of 0.2 is found to be higher under y1 than under y2, and this difference is significant at the 0.05 level. In the plot, this is shown by the vertical confidence intervals. . rocreg d y1 y2, invroc(.6) bseed(8378923) bootcc nodots Bootstrap results Number of strata = 2 Number of obs Replications Nonparametric ROC estimation Control standardization: empirical ROC method : empirical False-positive rate Status : d Classifier: y1 invROC Observed Coef. Bias .6 0 .0158039 invROC Observed Coef. Bias .6 .254902 .0101961 Bootstrap Std. Err. .0267288 = = 141 1000 [95% Conf. Interval] -.0523874 0 0 .0523874 (N) .0784314 (P) .1372549 (BC) Status : d Classifier: y2 Bootstrap Std. Err. .0757902 [95% Conf. Interval] .1063559 .1372549 .1176471 Ho: All classifiers have equal invROC values. Ha: At least one classifier has a different invROC value. Test based on bootstrap (N) assumptions. invROC P-value 0 True−positive rate (ROC) .25 .5 .75 1 .6 .0016562 . rocregplot, plot1opts(msymbol(i)) plot2opts(msymbol(i)) 0 .25 .5 .75 False−positive rate CA 19−9 CA 125 1 .403448 (N) .4313726 (P) .3921569 (BC) rocreg — Receiver operating characteristic (ROC) regression 1977 We find significant evidence that false-positive rates corresponding to a sensitivity of 0.6 are different from y1 to y2. This is visually indicated by the horizontal confidence intervals, which are separated from each other. . rocreg d y1 y2, pauc(.5) bseed(8378923) bootcc nodots Bootstrap results Number of strata = 2 Number of obs Replications Nonparametric ROC estimation Control standardization: empirical ROC method : empirical Partial area under the ROC curve Status : d Classifier: y1 pAUC Observed Coef. Bias .5 .3932462 -.0000769 pAUC Observed Coef. Bias .5 .2496732 .0019168 Bootstrap Std. Err. .021332 = = 141 1000 [95% Conf. Interval] .3514362 .3492375 .3492375 .4350562 (N) .435512 (P) .435403 (BC) Status : d Classifier: y2 Bootstrap Std. Err. .0374973 [95% Conf. Interval] .1761798 .177451 .1738562 .3231666 (N) .3253268 (P) .3233115 (BC) Ho: All classifiers have equal pAUC values. Ha: At least one classifier has a different pAUC value. Test based on bootstrap (N) assumptions. pAUC P-value .5 .0011201 We also find significant evidence supporting the hypothesis that the pAUC for y1 up to a false-positive rate of 0.5 differs from the area of the same region under the ROC curve of y2. Covariate-adjusted ROC curves When covariates affect the control distribution of the diagnostic test, thresholds for the test being classified as abnormal may be chosen that vary with the covariate values. These conditional thresholds will be more accurate than the marginal thresholds that would normally be used, because they take into account the specific distribution of the diagnostic test under the given covariate values as opposed to the marginal distribution over all covariate values. By using these covariate-specific thresholds, we are essentially creating new classifiers for each covariate-value combination, and thus we are creating multiple ROC curves. As explained in Pepe (2003), when the case and control distributions of the covariates are the same, the marginal ROC curve will always be bound above by these covariate-specific ROC curves. So using conditional thresholds will never provide a less powerful test diagnostic in this case. 1978 rocreg — Receiver operating characteristic (ROC) regression In the marginal ROC curve calculation, the classifiers are standardized to percentiles according to the control distribution, marginalized over the covariates. Thus the ROC curve is the CDF of the standardized case observations. The covariate-adjusted ROC curve is the CDF of one minus the conditional control percentiles for the case observations, and the marginal ROC curve is the CDF of one minus the marginal control percentiles for the case observations (Pepe and Cai 2004). Thus the standardization of classifier to false-positive rate value is conditioned on the specific covariate values under the covariate-adjusted ROC curve. The covariate-adjusted ROC curve (Janes and Pepe 2009) at a given false-positive rate t is equivalent to the expected value of the covariate-specific ROC at t over all covariate combinations. When the covariates in question do not affect the case distribution of the classifier, the covariate-specific ROC will have the same value at each covariate combination. So here the covariate-adjusted ROC is equivalent to the covariate-specific ROC, regardless of covariate values. When covariates do affect the case distribution of the classifier, users of the diagnostic test would likely want to model the covariate-specific ROC curves separately. Tools to do this can be found in the parametric modeling discussion in the following two sections. Regardless, the covariate-adjusted ROC curve can serve as a meaningful summary of covariate-adjusted accuracy. Also note that the ROC summary statistics defined in the previous section have covariate-adjusted analogs. These analogs are estimated in a similar manner as under the marginal ROC curve (Janes, Longton, and Pepe 2009). The options for their calculation in rocreg are identical to those given in the previous section. Further details can be found in Methods and formulas. Example 4: Nonparametric ROC, linear covariate adjustment Norton et al. (2000) studied data from a neonatal audiology study on three tests to identify hearing impairment in newborns. These data were also studied in Janes, Longton, and Pepe (2009). Here we list 5 of the 5,058 observations. . use http://www.stata-press.com/data/r13/nnhs, clear (Norton - neonatal audiology data) . list in 1/5 1. 2. 3. 4. 5. id ear male currage d y1 y2 y3 B0157 B0157 B0158 B0161 B0167 R L R L R M M M F F 42.42 42.42 40.14 38.14 37 0 0 1 0 0 -3.1 -4.5 -3.2 -22.1 -10.9 -9 -8.7 -13.2 -7.8 -6.6 -1.5 -2.71 -2.64 -2.59 -1.42 The classifiers y1 (DPOAE 65 at 2 kHz), y2 (TEOAE 80 at 2 kHz), and y3 (ABR) and the hearing impairment indicator d are recorded along with some relevant covariates. The infant’s age is recorded in months as currage, and the infant’s gender is indicated by male. Over 90% of the newborns were tested in each ear (ear), so we will cluster on infant ID (id). Following the strategy of Janes, Longton, and Pepe (2009), we will first perform ROC analysis for the classifiers while adjusting for the covariate effects of the infant’s gender and age. This is done by specifying these variables in the ctrlcov() option. We adjust using a linear regression rule, by specifying ctrlmodel(linear). This means that when a user of the diagnostic test chooses a threshold conditional on the age and gender covariates, they assume that the diagnostic test classifier has some linear dependence on age and gender and equal variance as their levels vary. Our cluster adjustment is made by specifying the cluster() option. rocreg — Receiver operating characteristic (ROC) regression 1979 We will focus on the first classifier. The percentile, or specificity, values are calculated empirically by default, and thus so are the false-positive rates, (1 − specificity). Also by default, the ROC curve values are empirically defined by the false-positive rates. To draw the ROC curve, we again use rocregplot. The AUC is calculated by default. For brevity, we specify the nobootstrap option so that bootstrap sampling is not performed. The AUC point estimate will be sufficient for our purposes. . rocreg d y1, ctrlcov(male currage) ctrlmodel(linear) cluster(id) nobootstrap Nonparametric ROC estimation Covariate control : linear regression Control variables : male currage Control standardization: empirical ROC method : empirical Status : d Classifier: y1 Covariate control adjustment model: Linear regression Number of obs = 4907 F( 2, 2685) = 13.80 Prob > F = 0.0000 R-squared = 0.0081 Root MSE = 7.7515 (Std. Err. adjusted for 2686 clusters in id) y1 Coef. male currage _cons .2471744 -.2032456 -1.239484 Robust Std. Err. .2603598 .0389032 1.487855 t 0.95 -5.22 -0.83 P>|t| 0.343 0.000 0.405 [95% Conf. Interval] -.2633516 -.2795288 -4.156942 .7577005 -.1269624 1.677973 Area under the ROC curve Status : d Classifier: y1 AUC Observed Coef. Bias .6293994 . Bootstrap Std. Err. . [95% Conf. Interval] . . . . (N) . (P) . (BC) 1980 rocreg — Receiver operating characteristic (ROC) regression 0 True−positive rate (ROC) .25 .5 .75 1 . rocregplot 0 .25 .5 .75 False−positive rate 1 DPOAE 65 at 2kHz Our covariate control adjustment model shows that currage has a negative effect on y1 (DPOAE 65 at 2 kHz) under the control population. At the 0.001 significance level, we reject that its contribution to y1 is zero, and the point estimate has a negative sign. This result does not directly tell us about the effect of currage on the ROC curve of y1 as a classifier of d. None of the case observations are used in the linear regression, so information on currage for abnormal cases is not used in the model. This result does show us how to calculate false-positive rates for tests that use thresholds conditional on a child’s sex and current age. We will see how currage affects the ROC curve when y1 is used as a classifier and conditional thresholds are used based on male and currage in the following section, Parametric ROC curves: Estimating equations. Technical note Under this nonparametric estimation, rocreg saved the false-positive rate for each observation’s y1 values in the utility variable fpr y1. The true-positive rates are stored in the utility variable roc y1. For other models, say with classifier yname, these variables would be named fpr yname and roc yname. They will also be overwritten with each call of rocreg. The variables roc * and fpr * are usually for internal rocreg use only and are overwritten with each call of rocreg. They are only created for nonparametric models or parametric models that do not involve ROC covariates. In these models, covariates may only affect the first stage of estimation, the control distribution, and not the ROC curve itself. In parametric models that allow ROC covariates, different covariate values would lead to different ROC curves. To see how the covariate-adjusted ROC curve estimate differs from the standard marginal estimate, we will reestimate the ROC curve for classifier y1 without covariate adjustment. We rename these variables before the new estimation and then draw an overlaid twoway line (see [G-2] graph twoway line) plot to compare the two. rocreg — Receiver operating characteristic (ROC) regression 1981 . rename _fpr_y1 o_fpr_y1 . rename _roc_y1 o_roc_y1 . label variable o_roc_y1 "covariate_adjusted" . rocreg d y1, cluster(id) nobootstrap Nonparametric ROC estimation Control standardization: empirical ROC method : empirical Area under the ROC curve Status : d Classifier: y1 AUC Observed Coef. Bias .6279645 . Bootstrap Std. Err. [95% Conf. Interval] . . . . . (N) . (P) . (BC) 0 .2 .4 .6 .8 1 . label variable _roc_y1 "marginal" . twoway line _roc_y1 _fpr_y1, sort(_fpr_y1 _roc_y1) connect(J) || > line o_roc_y1 o_fpr_y1, sort(o_fpr_y1 o_roc_y1) > connect(J) lpattern(dash) aspectratio(1) legend(cols(1)) 0 .2 .4 .6 .8 false−positive rate for y1 1 marginal covariate_adjusted Though they are close, particularly in AUC, there are clearly some points of difference between the estimates. So the covariate-adjusted ROC curve may be useful here. In our examples thus far, we have used the empirical CDF estimator to estimate the control distribution. rocreg allows some flexibility here. The pvc(normal) option may be specified to calculate the percentile values according to a Gaussian distribution of the control. Covariate adjustment in rocreg may also be performed with stratification instead of linear regression. Under the stratification method, the unique values of the stratified covariates each define separate parameters for the control distribution of the classifier. A user of the diagnostic test chooses a threshold based on the control distribution conditioned on the unique covariate value parameters. We will demonstrate the use of normal percentile values and covariate stratification in our next example. 1982 rocreg — Receiver operating characteristic (ROC) regression Example 5: Nonparametric ROC, covariate stratification The hearing test study of Stover et al. (1996) examined the effectiveness of negative signal-to-noise ratio, nsnr, as a classifier of hearing loss. The test was administered under nine different settings, corresponding to different frequency, xf, and intensity, xl, combinations. Here we list 10 of the 1,848 observations. . use http://www.stata-press.com/data/r13/dp, clear (Stover - DPOAE test data) . list in 1/10 id d nsnr xf xl xd 1. 2. 3. 4. 5. 101 101 101 101 101 1 1 1 1 1 18 19 7.6 15 16 10.01 20.02 10.01 20.02 10.01 5.5 5.5 6 6 6.5 3.5 3 3.5 3 3.5 6. 7. 8. 9. 10. 101 102 102 102 102 1 0 0 1 0 5.8 -2.6 -3 10 -5.8 20.02 10.01 14.16 20.02 10.01 6.5 5.5 5.5 5.5 6 3 . . 1 . Hearing loss is represented by d. The covariate xd is a measure of the degree of hearing loss. We will use this covariate in later analysis, because it only affects the case distribution of the classifier. Multiple measurements are taken for each individual, id, so we will cluster by individual. We evaluate the effectiveness of nsnr using xf and xl as stratification covariates with rocreg; the default method of covariate adjustment. As mentioned before, the default false-positive rate calculation method in rocreg estimates the conditional control distribution of the classifiers empirically. For comparison, we will also estimate a separate ROC curve using false-positive rates assuming the conditional control distribution is normal. This behavior is requested by specifying the pvc(normal) option. Using the rocregplot option name() to store the ROC plots and using the graph combine command, we are able to compare the Gaussian and empirical ROC curves side by side. As before, for brevity we specify the nobootstrap option to suppress bootstrap sampling. . rocreg d nsnr, ctrlcov(xf xl) cluster(id) nobootstrap Nonparametric ROC estimation Covariate control : stratification Control variables : xf xl Control standardization: empirical ROC method : empirical Area under the ROC curve Status : d Classifier: nsnr AUC Observed Coef. Bias .9264192 . Bootstrap Std. Err. . . rocregplot, title(Empirical FPR) name(a) nodraw [95% Conf. Interval] . . . . (N) . (P) . (BC) rocreg — Receiver operating characteristic (ROC) regression 1983 . rocreg d nsnr, pvc(normal) ctrlcov(xf xl) cluster(id) nobootstrap Nonparametric ROC estimation Covariate control : stratification Control variables : xf xl Control standardization: normal ROC method : empirical Area under the ROC curve Status : d Classifier: nsnr AUC Observed Coef. Bias .9309901 . Bootstrap Std. Err. [95% Conf. Interval] . . . . . (N) . (P) . (BC) . rocregplot, title(Normal FPR) name(b) nodraw . graph combine a b, xsize(5) True−positive rate (ROC) .5 .75 .25 0 0 .25 True−positive rate (ROC) .5 .75 1 Normal FPR 1 Empirical FPR 0 .25 .5 False−positive rate .75 1 −SNR 0 .25 .5 False−positive rate .75 1 −SNR On cursory visual inspection, we see little difference between the two curves. The AUC values are close as well. So it is sensible to assume that we have Gaussian percentile values for control standardization. Parametric ROC curves: Estimating equations We now assume a parametric model for covariate effects on the second stage of ROC analysis. Particularly, the ROC curve is a probit model of the covariates. We will thus have a separate ROC curve for each combination of the relevant covariates. Under weak assumptions about the control distribution of the classifier, we can fit this model by using estimating equations as described in Alonzo and Pepe (2002). This method can be also be used without covariate effects in the second stage, assuming a parametric model for the single (constant only) ROC curve. Covariates may still affect the first stage of estimation, so we parametrically model the single covariate-adjusted ROC curve (from the previous section). The marginal ROC curve, involving no covariates in either stage of estimation, can be fit parametrically as well. 1984 rocreg — Receiver operating characteristic (ROC) regression In addition to the Alonzo and Pepe (2002) explanation, further details are given in Pepe, Longton, and Janes (2009); Janes, Longton, and Pepe (2009); Pepe (2003); and Janes and Pepe (2009). The parametric models that we consider assume that the ROC curve is a cumulative distribution function g invoked with input of a linear polynomial in the corresponding quantile function invoked on the false-positive rate u. In this context, we assume that g corresponds to a standard normal cumulative distribution function, Φ. So the corresponding quantile function is Φ−1 . The constant intercept of the polynomial may depend on covariates, but the slope term α (the quantile coefficient) may not. 0 −1 ROC (u) = g{x β + αg (u)} The first step of the algorithm involves the choice of false-positive rates to use in the parametric fit. These are typically a set of equispaced points spanning the interval (0,1). Alonzo and Pepe (2002) examined the effect of fitting large and small sets of points, finding that relatively small sets could be used with little loss of efficiency. Alternatively, the set can be formed by using the observed false-positive rates in the data (Pepe 2003). Further details on the algorithm are provided in Methods and formulas. Under parametric estimation, all the summary measures we defined earlier, except the AUC, are not calculated until postestimation. In models with covariates, each covariate combination would yield a different ROC curve and thus different summary parameters, so no summary parameters are initially estimated. In marginal parametric models (where there are no ROC covariates, but there are potentially control covariates), we will calculate the AUC and leave the other measures for postestimation; see [R] rocreg postestimation. As with the other parameters, we bootstrap for standard errors and inference. We will now demonstrate how rocreg performs the Alonzo and Pepe (2002) algorithm using the previous section’s examples and others. Example 6: Parametric ROC, linear covariate adjustment We return to the neonatal audiology study with gender and age covariates (Norton et al. 2000), which we discussed in example 4. Janes, Longton, and Pepe (2009) suspected the current age of the infant would play a role in the case distribution of the classifier y1 (DPOAE 65 at 2 kHz). They postulated a probit link between the ROC curve and the covariate-adjusted false-positive rates. We follow their investigation and reach similar results. In example 4, we saw the results of adjusting for the currage and male variables in the control population for classifier y1. Now we see how currage affects the ROC curve when y1 is used with thresholds conditioned on male and currage. We specify the covariates that should affect the ROC curve in the roccov() option. By default, rocreg will choose 10 equally spaced false-positive rates in the (0,1) interval as fitting points. The fprpts() option allows the user to specify more or fewer points. We specify the bsave() option with the nnhs2y1 dataset so that we can use the bootstrap resamples in postestimation. rocreg — Receiver operating characteristic (ROC) regression 1985 . use http://www.stata-press.com/data/r13/nnhs, clear (Norton - neonatal audiology data) . rocreg d y1, probit ctrlcov(currage male) ctrlmodel(linear) roccov(currage) > cluster(id) bseed(56930) bsave(nnhs2y1) nodots Bootstrap results Number of obs = 5056 Replications = 1000 Parametric ROC estimation Covariate control : linear regression Control variables : currage male Control standardization: empirical ROC method : parametric Link: probit Status : d Classifier: y1 Covariate control adjustment model: Linear regression Number of obs = 4907 F( 2, 2685) = 13.80 Prob > F = 0.0000 R-squared = 0.0081 Root MSE = 7.7515 (Std. Err. adjusted for 2686 clusters in id) y1 Coef. currage male _cons -.2032456 .2471744 -1.239484 Robust Std. Err. .0389032 .2603598 1.487855 t -5.22 0.95 -0.83 P>|t| 0.000 0.343 0.405 [95% Conf. Interval] -.2795288 -.2633516 -4.156942 -.1269624 .7577005 1.677973 Status : d Classifier: y1 ROC Model : (Replications based on 2741 clusters in id) y1 Observed Coef. Bias Bootstrap Std. Err. _cons -1.272505 -.0566737 1.076706 currage .0448228 .0015878 .0280384 .9372393 .0128376 .0747228 [95% Conf. Interval] -3.38281 -3.509356 -3.487457 -.0101316 -.007932 -.0102905 .8377993 (N) .7178385 (P) .7813575 (BC) .0997771 (N) .1033131 (P) .101021 (BC) .7907853 .8079087 .7928988 1.083693 (N) 1.101941 (P) 1.083399 (BC) probit _cons Note how the number of clusters—here infants—changes from the covariate control adjustment model fit to the ROC model. The control fit is limited to control cases and thus fewer infants. The ROC is fit on all the data, so the variance is adjusted for all clustering on all infants. With a 0.05 level of statistical significance, we cannot reject the null hypothesis that currage has no effect on the ROC curve at a given false-positive rate. This is because each of our 95% bootstrap confidence intervals contains 0. This corresponds with the finding in Janes, Longton, and Pepe (2009) where the reported 95% intervals each contained 0. We cannot reject that the intercept parameter β0 , reported as cons in the main table, is 0 at the 0.05 level either. The slope parameter α, reported 1986 rocreg — Receiver operating characteristic (ROC) regression as cons in the probit table, is close to 1 and cannot be rejected as being 1 at the 0.05 level. Under the assumption that the ROC coefficients except α are 0 and that α = 1, the ROC curve at false-positive rate u is equal to u. In other words, we cannot reject that the false-positive rate is equal to the true-positive rate, and so the test is noninformative. Further investigation of the results requires postestimation; see [R] rocreg postestimation. The fitting point set can be formed by using the observed false-positive rates (Pepe 2003). Our next example will illustrate this. Example 7: Parametric ROC, covariate stratification We return to the hearing test study of Stover et al. (1996), which we discussed in example 5. Pepe (2003) suspected that intensity, xd, would play a role in the case distribution of the negative signal-to-noise ratio (nsnr) classifier. A ROC regression was fit with covariate adjustment for xf and xl with stratification, and for ROC covariates xf, xl, and xd. There is no prohibition against the same covariate being used in the first and second stages of ROC calculation. The false-positive rate fitting point set was composed of all observed false-positive rates in the control data. We fit the model with rocreg here. Using observed false-positive rates as the fitting point set can make the dataset very large, so fitting the model is computationally intensive. We demonstrate the fitting algorithm without precise confidence intervals, focusing instead on the coefficient estimates and standard errors. We will thus perform only 50 bootstrap replications, a reasonable number to obtain accurate standard error estimates (Mooney and Duval 1993). The number of replications is specified in the breps() option. The ROC covariates are specified in roccov(). We specify that all observed false-positive rates in the control observations be used as fitting points with the ctrlfprall option. The nobstrata option specifies that the bootstrap is not stratified. The covariate stratification in the first stage of estimation does not affect the resampling. We will return to this example in postestimation, so we save the bootstrap results in the nsnrf dataset with the bsave() option. rocreg — Receiver operating characteristic (ROC) regression 1987 . use http://www.stata-press.com/data/r13/dp (Stover - DPOAE test data) . rocreg d nsnr, probit ctrlcov(xf xl) roccov(xf xl xd) ctrlfprall cluster(id) > nobstrata bseed(156385) breps(50) bsave(nsnrf) (running rocregstat on estimation sample) Bootstrap replications (50) 1 2 3 4 5 .................................................. Bootstrap results 50 Number of obs Replications = = 1848 50 Parametric ROC estimation Covariate control : Control variables : Control standardization: ROC method : stratification xf xl empirical parametric Link: probit Status : d Classifier: nsnr ROC Model : (Replications based on 208 clusters in id) nsnr Observed Coef. Bias Bootstrap Std. Err. _cons 3.247872 -.0846178 .8490006 xf .0502557 .014478 .0329044 xl -.4327223 -.0194846 .1116309 xd .4431764 .0086147 .0936319 1.032657 -.0188887 .1224993 [95% Conf. Interval] 1.583862 1.598022 1.346904 -.0142357 -.0031814 -.0053095 -.6515149 -.6570321 -.6570321 .2596612 .330258 .3487118 4.911883 4.690076 4.690076 .1147471 .1186107 .1132185 -.2139298 -.2499706 -.231854 .6266916 .6672749 .7674865 (N) (P) (BC) (N) (P) (BC) (N) (P) (BC) (N) (P) (BC) probit _cons .7925628 .7815666 .7815666 1.272751 (N) 1.236179 (P) 1.237131 (BC) We obtain results similar to those reported in Pepe (2003, 159). We find that the coefficients for xl and xd differ from 0 at the 0.05 level of significance. So over certain covariate combinations, we can have a variety of informative tests using nsnr as a classifier. As mentioned before, when there are no covariates, rocreg can still fit a parametric model for the ROC curve of a classifier by using the Alonzo and Pepe (2002) method. roccomp and rocfit can fit marginal probit models as well. We will compare the behavior of rocreg with that of roccomp and rocfit for probit models without covariates. When the binormal option is specified, roccomp calculates the AUC for input classifiers according to the maximum likelihood algorithm of rocfit. The rocfit algorithm expects discrete classifiers but can slice continuous classifiers into discrete partitions. Further, the case and control distributions are both assumed normal. Actually, the observed classification values are taken as discrete indicators 1988 rocreg — Receiver operating characteristic (ROC) regression of the latent normally distributed classification values. This method is documented in Dorfman and Alf (1969). Alonzo and Pepe (2002) compared their estimating equations probability density function method (with empirical estimation of the false-positive rates) to the maximum likelihood approach of Dorfman and Alf (1969) and found that they had similar efficiency and mean squared error. So we should expect rocfit and rocreg to give similar results when fitting a simple probit model. Example 8: Parametric ROC, marginal model We return to the Hanley and McNeil (1982) data. We will fit a probit model to the ROC curve, assuming that the rating variable is a discrete indicator of an underlying latent normal random variable in both the case and control populations of disease. We invoke rocfit with the default options. rocreg is invoked with the probit option. The percentile values are calculated empirically. Because there are fewer categories than 10, there will be fewer than 10 false-positive rates that trigger a different true-positive rate value. So for efficiency, we invoke rocreg with the ctrlfprall option. . use http://www.stata-press.com/data/r13/hanley . rocfit disease rating, nolog Binormal model of disease on rating Goodness-of-fit chi2(2) = 0.21 Prob > chi2 = 0.9006 Log likelihood = -123.64855 Coef. Std. Err. Number of obs z = 109 P>|z| [95% Conf. Interval] intercept slope (*) 1.656782 0.713002 0.310456 0.215882 5.34 -1.33 0.000 0.184 1.048300 0.289881 2.265265 1.136123 /cut1 /cut2 /cut3 /cut4 0.169768 0.463215 0.766860 1.797938 0.165307 0.167235 0.174808 0.299581 1.03 2.77 4.39 6.00 0.304 0.006 0.000 0.000 -0.154227 0.135441 0.424243 1.210770 0.493764 0.790990 1.109477 2.385106 Index Estimate Indices from binormal fit Std. Err. [95% Conf. Interval] ROC area delta(m) d(e) d(a) 0.911331 2.323671 1.934361 1.907771 0.029506 0.502370 0.257187 0.259822 (*) z test for slope==1 0.853501 1.339044 1.430284 1.398530 0.969161 3.308298 2.438438 2.417012 rocreg — Receiver operating characteristic (ROC) regression . rocreg disease rating, probit ctrlfprall bseed(8574309) nodots Bootstrap results Number of obs Replications Parametric ROC estimation Control standardization: empirical ROC method : parametric Status : disease Classifier: rating ROC Model : = = 1989 109 1000 Link: probit rating Observed Coef. Bias Bootstrap Std. Err. _cons 1.635041 .0588548 .3609651 .9275621 1.162363 1.164204 2.342519 (N) 2.556508 (P) 2.566174 (BC) _cons .6951252 .0572146 .3241451 .0598125 .3500569 .3372983 1.330438 (N) 1.430441 (P) 1.411953 (BC) AUC Observed Coef. Bias .9102903 -.0051749 [95% Conf. Interval] probit Bootstrap Std. Err. .0314546 [95% Conf. Interval] .8486405 .837113 .8468336 .9719402 (N) .9605498 (P) .9630486 (BC) We see that the intercept and slope parameter estimates are close. The intercept ( cons in the main table) is clearly nonzero. Under rocreg, the slope ( cons in the probit table) and its percentile and bias-corrected confidence intervals are close to those of rocfit. The area under the ROC curve for each of the rocreg and rocfit estimators also matches closely. Now we will compare the parametric fit of rocreg under the constant probit model with roccomp. Example 9: Parametric ROC, marginal model, multiple classifiers We now use the fictitious dataset generated from Hanley and McNeil (1983). To fit a probit model using roccomp, we specify the binormal option. Our specification of rocreg remains the same as before. rocregplot is used to render the model produced by rocreg. We specify several graph options to both roccomp and rocregplot to ease comparison. When the binormal option is specified along with graph, roccomp will draw the binormal fitted lines in addition to connected line plots of the empirical false-positive and true-positive rates. In this plot, we overlay scatterplots of the empirical false-positive rates (because percentile value calculation defaulted to pvc(empirical)) and the parametric true-positive rates. 1990 rocreg — Receiver operating characteristic (ROC) regression . use http://www.stata-press.com/data/r13/ct2, clear . roccomp status mod1 mod2 mod3, summary binormal graph aspectratio(1) > plot1opts(connect(i) msymbol(o)) > plot2opts(connect(i) msymbol(s)) > plot3opts(connect(i) msymbol(t)) > legend(label(1 "mod1") label(3 "mod2") label(5 "mod3") > label(2 "mod1 fit") label(4 "mod2 fit") > label(6 "mod3 fit") order(1 3 5 2 4 6) cols(1)) > title(roccomp) name(a) nodraw Fitting binormal model for: mod1 Fitting binormal model for: mod2 Fitting binormal model for: mod3 ROC Obs Area Std. Err. [95% Conf. Interval] mod1 mod2 mod3 112 112 112 0.8945 0.9382 0.9376 0.0305 0.0264 0.0223 0.83482 0.88647 0.89382 0.95422 0.99001 0.98139 Ho: area(mod1) = area(mod2) = area(mod3) chi2(2) = 8.27 Prob>chi2 = 0.0160 . rocreg status mod1 mod2 mod3, probit ctrlfprall bseed(867340912) nodots Bootstrap results Number of obs = 112 Replications = 1000 Parametric ROC estimation Control standardization: empirical ROC method : parametric Link: probit Status : status Classifier: mod1 ROC Model : mod1 Observed Coef. Bias Bootstrap Std. Err. _cons 1.726034 .1363112 .5636358 .6213277 1.162477 1.152112 2.83074 (N) 3.277376 (P) 3.187595 (BC) _cons .9666323 .0872018 .4469166 .0906919 .518082 .5568404 1.842573 (N) 2.219548 (P) 2.394036 (BC) AUC Observed Coef. Bias .8927007 -.0011794 [95% Conf. Interval] probit Bootstrap Std. Err. .0313951 [95% Conf. Interval] .8311675 .8245637 .8210562 .954234 (N) .9466904 (P) .9432855 (BC) rocreg — Receiver operating characteristic (ROC) regression Status : status Classifier: mod2 ROC Model : mod2 Observed Coef. Bias Bootstrap Std. Err. _cons 1.696811 .0918364 .5133386 .6906858 1.21812 1.22064 2.702936 (N) 2.973929 (P) 3.068454 (BC) _cons .4553828 .047228 .3345303 -.2002845 .1054933 .1267796 1.11105 (N) 1.18013 (P) 1.272523 (BC) AUC Observed Coef. Bias .938734 -.0037989 [95% Conf. Interval] probit Bootstrap Std. Err. .0261066 [95% Conf. Interval] .8875659 .8777664 .8823555 .9899021 (N) .9778214 (P) .9792451 (BC) Status : status Classifier: mod3 ROC Model : mod3 Observed Coef. Bias Bootstrap Std. Err. _cons 2.281359 .1062846 .6615031 .9848363 1.637764 1.666076 3.577881 (N) 4.157873 (P) 4.474779 (BC) _cons 1.107736 .0514693 .4554427 .2150843 .58586 .6385949 2.000387 (N) 2.28547 (P) 2.671192 (BC) AUC Observed Coef. Bias .9368321 -.0023853 [95% Conf. Interval] probit Bootstrap Std. Err. .0231363 [95% Conf. Interval] .8914859 .8844096 .8836259 .9821784 (N) .9722485 (P) .9718463 (BC) Ho: All classifiers have equal AUC values. Ha: At least one classifier has a different AUC value. P-value: .0778556 Test based on bootstrap (N) assumptions. . rocregplot, title(rocreg) nodraw name(b) > plot1opts(msymbol(o)) plot2opts(msymbol(s)) plot3opts(msymbol(t)) 1991 1992 rocreg — Receiver operating characteristic (ROC) regression . graph combine a b, xsize(5) rocreg 0 0.00 0.25 Sensitivity 0.50 True−positive rate (ROC) .25 .5 .75 0.75 1 1.00 roccomp 0.00 0.25 0.50 0.75 1−Specificity 1.00 0 .25 .5 .75 False−positive rate mod1 mod1 mod2 mod2 mod3 mod3 mod1 fit mod1 Fit mod2 fit mod2 Fit mod3 fit mod3 Fit 1 We see differing true-positive rate values in the scattered points, which is expected because roccomp gives the empirical estimate and rocreg gives the parametric estimate. However, the estimated curves and areas under the ROC curve look similar. Using the Wald test based on the bootstrap covariance, rocreg rejects the null hypothesis that each test has the same AUC at the 0.1 significance level. roccomp formulates the asymptotic covariance using the rocfit estimates of AUC. Examination of its output leads to rejection of the null hypothesis that the AUCs are equal across each test at the 0.05 significance level. Parametric ROC curves: Maximum likelihood The Alonzo and Pepe (2002) method of fitting a parametric model to the ROC curve is powerful because it can be generally applied, but that can be a limitation as well. Whenever we invoke the method and want anything other than point estimates of the parameters, we must perform bootstrap resampling. An alternative is to use maximum likelihood inference to fit the ROC curve. This method can save computational time by avoiding the bootstrap. rocreg implements maximum likelihood estimation for ROC curve analysis when both the case and control populations are normal. Particularly, the classifier is a normal linear model on certain covariates, and the covariate effect and variance of the classifier may change between the case and control populations. This model is defined in Pepe (2003, 145). y = z0 β0 + Dx0 β1 + σ (D) Our error term, , is a standard normal random variable. The variable D is our true status variable, being 1 for the case population observations and 0 for the control population observations. The variance function σ is defined as σ (D) = σ0 (D = 0) + σ1 (D = 1) This provides two variance parameters in the model and does not depend on covariate values. rocreg — Receiver operating characteristic (ROC) regression 1993 Suppose a covariate xi is present in z and x. The coefficient β1i represents the interaction effect of the xi and D. It is the extra effect that xi has on classifier y under the case population, D = 1, beyond the main effect β0i . These β1 coefficients are directly related to the ROC curve of y . Under this model, the ROC curve is derived to be 1 0 ROC (u) = Φ {x β1 + σ0 Φ−1 (u)} σ1 For convenience, we reparameterize the model at this point, creating the parameters βi = σ1−1 β1i and α = σ1−1 σ0 . We refer to β0 as the constant intercept, i cons. The parameter α is referred to as the constant slope, s cons. ROC (u) = Φ{x0 β + αΦ−1 (u)} We may interpret the final coefficients as the standardized linear effect of the ROC covariate on the classifier under the case population. The marginal effect of the covariate on the classifier in the control population is removed, and it is rescaled by the case population standard deviation of the classifier when all ROC covariate effects are removed. An appreciable effect on the classifier by a ROC covariate in this measure leads to an appreciable effect on the classifier’s ROC curve by the ROC covariate. The advantage of estimating the control coefficients β0 is similar to the gains of estimating the covariate control models in the estimating equations ROC method and nonparametric ROC estimation. This model would similarly apply when evaluating a test that is conditioned on control covariates. Again we note that under parametric estimation, all the summary measures we defined earlier except the AUC are not calculated until postestimation. In models with covariates, each covariate combination would yield a different ROC curve and thus different summary parameters, so no summary parameters are estimated initially. In marginal parametric models, we will calculate the AUC and leave the other measures for postestimation. There is a simple closed-form formula for the AUC under the probit model. Using this formula, the delta method can be invoked for inference on the AUC. Details on AUC estimation for probit marginal models are found in Methods and formulas. We will demonstrate the maximum likelihood method of rocreg by revisiting the models of the previous section. Example 10: Maximum likelihood ROC, single classifier Returning to the hearing test study of Stover et al. (1996), we use a similar covariate grouping as before. The frequency xf and intensity xl are control covariates (z), while all three covariates xf, xl, and hearing loss degree xd are case covariates (x). In example 7, we fit this model using the Alonzo and Pepe (2002) method. Earlier we stratified on the control covariates and estimated the conditioned control distribution of nsnr empirically. Now we assume a normal linear model for nsnr on xf and xl under the control population. We fit the model by specifying the control covariates in the ctrlcov() option and the case covariates in the roccov() option. The ml option tells rocreg to perform maximum likelihood estimation. 1994 rocreg — Receiver operating characteristic (ROC) regression . use http://www.stata-press.com/data/r13/dp, clear (Stover - DPOAE test data) . rocreg d nsnr, ctrlcov(xf xl) roccov(xf xl xd) probit ml cluster(id) nolog Parametric ROC estimation Covariate control : linear regression Control variables : xf xl Control standardization: normal ROC method : parametric Link: probit Status : d Classifiers: nsnr Classifier : nsnr Covariate control adjustment model: (Std. Err. adjusted for 208 clusters in id) Coef. Robust Std. Err. z P>|z| [95% Conf. Interval] casecov xf xl xd _cons .4690907 -3.187785 3.042998 23.48064 .1408683 .8976521 .3569756 5.692069 3.33 -3.55 8.52 4.13 0.001 0.000 0.000 0.000 .192994 -4.947151 2.343339 12.32439 .7451874 -1.42842 3.742657 34.63689 _cons 7.979708 .354936 22.48 0.000 7.284047 8.67537 xf xl _cons -.1447499 -.8631348 1.109477 .0615286 .2871976 1.964004 -2.35 -3.01 0.56 0.019 0.003 0.572 -.2653438 -1.426032 -2.7399 -.0241561 -.3002378 4.958854 _cons 7.731203 .3406654 22.69 0.000 7.063511 8.398894 casesd ctrlcov ctrlsd Status : d ROC Model : (Std. Err. adjusted for 208 clusters in id) Coef. Robust Std. Err. z P>|z| [95% Conf. Interval] nsnr i_cons xf xl xd s_cons 2.942543 .0587854 -.3994865 .381342 .9688578 .7569821 .0175654 .1171914 .0449319 .0623476 3.89 3.35 -3.41 8.49 15.54 0.000 0.001 0.001 0.000 0.000 1.458885 .024358 -.6291775 .2932771 .8466587 4.426201 .0932129 -.1697955 .4694068 1.091057 We find the results are similar to those of example 7. Frequency (xf) and intensity (xl) have a negative effect on the classifier nsnr in the control population. The negative control effect is mitigated for xf in the case population, but the effect for xl is even more negative there. Hearing loss severity, xd, has a positive effect on nsnr in the case population, and it is undefined in the control population. The ROC coefficients are shown in the ROC Model table. Each are different from 0 at the 0.05 level. At this level, we also cannot conclude that the variances differ from case to control populations, because 1 is in the 95% confidence interval for s cons, the ratio of the case to control standard deviation parameters. rocreg — Receiver operating characteristic (ROC) regression 1995 Both frequency (xf) and hearing loss severity (xd) make a positive contribution to the ROC curve and thus make the test more powerful. Intensity (xl) has a negative effect on the ROC curve and weakens the test. We previously saw in example 5 that the control distribution appears to be normal, so using maximum likelihood to fit this model is a reasonable approach. This model was also fit in Pepe (2003, 147). Pepe used separate least-squares estimates for the case and control samples. We obtain similar results for the coefficients, but the maximum likelihood fitting yields slightly different standard deviations by considering both case and control observations concurrently. In addition, a misprint in Pepe (2003, 147) reports a coefficient of −4.91 for xl in the case population instead of −3.19 as reported by Stata. Inference on multiple classifiers using the Alonzo and Pepe (2002) estimating equation method is performed by fitting each model separately and bootstrapping to determine the dependence of the estimates. Using the maximum likelihood method, we also fit each model separately. We use suest (see [R] suest) to estimate the joint variance–covariance of our parameter estimates. For our models, we can view the score equation for each model as an estimating equation. The estimate that solves the estimating equation (that makes the score 0) is asymptotically normal with a variance matrix that can be estimated using the inverse of the squared scores. By stacking the score equations of the separate models, we can estimate the variance matrix for all the parameter estimates by using this rule. This is an informal explanation; further details can be found in [R] suest and in the references Rogers (1993); White (1982 and 1996). Now we will examine a case with multiple classification variables. Example 11: Maximum likelihood ROC, multiple classifiers We return to the neonatal audiology study with gender and age covariates (Norton et al. 2000). In example 6, we fit a model with male and currage as control covariates, and currage as a ROC covariate for the classifier y1 (DPOAE 65 at 2 kHz). We will refit this model, extending it to include the classifier y2 (TEOAE 80 at 2 kHz). 1996 rocreg — Receiver operating characteristic (ROC) regression . use http://www.stata-press.com/data/r13/nnhs (Norton - neonatal audiology data) . rocreg d y1 y2, probit ml ctrlcov(currage male) roccov(currage) cluster(id) > nolog Parametric ROC estimation Covariate control : linear regression Control variables : currage male Control standardization: normal ROC method : parametric Link: probit Status : d Classifiers: y1 y2 Classifier : y1 Covariate control adjustment model: Coef. casecov currage _cons Std. Err. z P>|z| [95% Conf. Interval] .494211 -15.00403 .2126672 8.238094 2.32 -1.82 0.020 0.069 .077391 -31.1504 .9110311 1.142338 _cons 8.49794 .4922792 17.26 0.000 7.533091 9.46279 ctrlcov currage male _cons -.2032048 .2369359 -1.23534 .0323803 .2201391 1.252775 -6.28 1.08 -0.99 0.000 0.282 0.324 -.266669 -.1945288 -3.690734 -.1397406 .6684006 1.220055 7.749156 .0782225 99.07 0.000 7.595843 7.902469 P>|z| [95% Conf. Interval] casesd ctrlsd _cons Classifier : y2 Covariate control adjustment model: Coef. casecov currage _cons Std. Err. z .5729861 -18.2597 .2422662 9.384968 2.37 -1.95 0.018 0.052 .0981532 -36.6539 1.047819 .1344949 _cons 9.723858 .5632985 17.26 0.000 8.619813 10.8279 ctrlcov currage male _cons -.1694575 .7122587 -5.651728 .0291922 .1993805 1.129452 -5.80 3.57 -5.00 0.000 0.000 0.000 -.2266732 .3214802 -7.865415 -.1122419 1.103037 -3.438042 6.986167 .0705206 99.07 0.000 6.84795 7.124385 casesd ctrlsd _cons rocreg — Receiver operating characteristic (ROC) regression 1997 Status : d ROC Model : (Std. Err. adjusted for 2741 clusters in id) Coef. Robust Std. Err. z P>|z| [95% Conf. Interval] y1 i_cons currage s_cons -1.765608 .0581566 .9118864 1.105393 .0290177 .0586884 -1.60 2.00 15.54 0.110 0.045 0.000 -3.932138 .0012828 .7968593 .4009225 .1150303 1.026913 i_cons currage s_cons -1.877825 .0589258 .7184563 .905174 .0235849 .0565517 -2.07 2.50 12.70 0.038 0.012 0.000 -3.651933 .0127002 .607617 -.1037167 .1051514 .8292957 y2 Both classifiers have similar results. The results for y1 show the same direction as the estimating equation results in example 6. However, we can now reject the null hypothesis that the ROC currage coefficient is 0 at the 0.05 level. In example 6, we could not reject that the slope parameter s cons was 1 and that the constant intercept or ROC coefficient for current age was 0. The resulting ROC curve implied a noninformative test using y1 as a classifier. This is not the case with our current results. As currage increases, we expect a steeper ROC curve and thus a more powerful test, for both classifiers y1 (DPOAE 65 at 2 kHz) and y2 (TEOAE 80 at 2 kHz). In example 10, the clustering of observations within infant id was adjusted in the individual fit of nsnr. In our current example, the adjustment for the clustering of observations within id is performed during concurrent estimation, as opposed to during the individual classifier fits (as in example 10). This adjustment, performed by suest, is still accurate. Now we will fit constant probit models and compare rocreg with rocfit and roccomp with the binormal option. Our first applications of rocfit and roccomp are taken directly from examples 8 and 9. The Dorfman and Alf (1969) algorithm that rocfit works with uses discrete classifiers or uses slicing to make a classifier discrete. So we are applying the maximum likelihood method of rocreg on discrete classification data here, where it expects continuous data. We expect to see some discrepancies, but we do not find great divergence in the estimates. After revisiting examples 8 and 9, we will fit a probit model with a continuous classifier and no covariates using rocreg, and we will compare the results with those from rocfit. Example 12: Maximum likelihood ROC, marginal model Using the Hanley and McNeil (1982) data, discussed in example 1 and in example 8, we fit a constant probit model of the classifier rating with true status disease. rocreg is invoked with the ml option and compared with rocfit. 1998 rocreg — Receiver operating characteristic (ROC) regression . use http://www.stata-press.com/data/r13/hanley, clear . rocfit disease rating, nolog Binormal model of disease on rating Number of obs Goodness-of-fit chi2(2) = 0.21 Prob > chi2 = 0.9006 Log likelihood = -123.64855 Coef. Std. Err. z = 109 P>|z| [95% Conf. Interval] intercept slope (*) 1.656782 0.713002 0.310456 0.215882 5.34 -1.33 0.000 0.184 1.048300 0.289881 2.265265 1.136123 /cut1 /cut2 /cut3 /cut4 0.169768 0.463215 0.766860 1.797938 0.165307 0.167235 0.174808 0.299581 1.03 2.77 4.39 6.00 0.304 0.006 0.000 0.000 -0.154227 0.135441 0.424243 1.210770 0.493764 0.790990 1.109477 2.385106 Index Estimate Indices from binormal fit Std. Err. [95% Conf. Interval] ROC area delta(m) d(e) d(a) 0.911331 2.323671 1.934361 1.907771 0.029506 0.502370 0.257187 0.259822 0.853501 1.339044 1.430284 1.398530 (*) z test for slope==1 . rocreg disease rating, probit ml nolog Parametric ROC estimation Control standardization: normal ROC method : parametric Status : disease Classifiers: rating Classifier : rating Covariate control adjustment model: Coef. Std. Err. 2.3357 .2334285 _cons 1.117131 ctrlcov _cons casecov _cons 0.969161 3.308298 2.438438 2.417012 Link: probit z P>|z| [95% Conf. Interval] 10.01 0.000 1.878188 2.793211 .1106124 10.10 0.000 .9003344 1.333927 2.017241 .1732589 11.64 0.000 1.67766 2.356823 1.319501 .1225125 10.77 0.000 1.07938 1.559621 casesd ctrlsd _cons Status : disease ROC Model : Coef. rating i_cons s_cons auc 2.090802 1.181151 .9116494 Std. Err. .2941411 .1603263 .0261658 z 7.11 7.37 34.84 P>|z| [95% Conf. Interval] 0.000 0.000 0.000 1.514297 .8669177 .8603654 2.667308 1.495385 .9629333 rocreg — Receiver operating characteristic (ROC) regression 1999 We compare the estimates for these models: slope SE of slope intercept SE of intercept AUC SE of AUC rocfit 0.7130 0.2159 1.6568 0.3105 0.9113 0.0295 rocreg, ml 1.1812 0.1603 2.0908 0.2941 0.9116 0.0262 We find that both the intercept and the slope are estimated as higher with the maximum likelihood method under rocreg than with rocfit. The AUC (ROC area in rocfit) is close for both commands. We find that the standard errors of each of these estimates is slightly lower under rocreg than rocfit as well. Both rocfit and rocreg suggest that the slope parameter of the ROC curve (slope in rocfit and s cons in rocreg) is not significantly different from 1. Thus we cannot reject that the classifier has the same variance in both case and control populations. There is, however, significant evidence that the intercepts (i cons in rocreg and intercept in rocfit) differ from 0. Because of the positive direction of the intercept estimates, the ROC curve for rating as a classifier of disease suggests that rating provides an informative test. This is also suggested by the high AUC, which is significantly different from 0.5, that is, a flip of a coin. Example 13: Maximum likelihood ROC, marginal model, multiple classifiers We use the fictitious dataset generated from Hanley and McNeil (1983), which we previously used in example 2 and in example 9. To fit a probit model using roccomp, we specify the binormal option. We perform parametric, maximum likelihood ROC analysis using rocreg. We use rocregplot to plot the ROC curves created by rocreg. . use http://www.stata-press.com/data/r13/ct2, clear . roccomp status mod1 mod2 mod3, summary binormal graph aspectratio(1) > plot1opts(connect(i) msymbol(o)) > plot2opts(connect(i) msymbol(s)) > plot3opts(connect(i) msymbol(t)) > legend(label(1 "mod1") label(3 "mod2") label(5 "mod3") > label(2 "mod1 fit") label(4 "mod2 fit") label(6 "mod3 fit") > order(1 3 5 2 4 6) cols(1)) title(roccomp) name(a) nodraw Fitting binormal model for: mod1 Fitting binormal model for: mod2 Fitting binormal model for: mod3 ROC Obs Area Std. Err. [95% Conf. Interval] mod1 mod2 mod3 112 112 112 0.8945 0.9382 0.9376 0.0305 0.0264 0.0223 Ho: area(mod1) = area(mod2) = area(mod3) chi2(2) = 8.27 Prob>chi2 = 0.0160 0.83482 0.88647 0.89382 0.95422 0.99001 0.98139 2000 rocreg — Receiver operating characteristic (ROC) regression . rocreg status mod1 mod2 mod3, probit ml nolog Parametric ROC estimation Control standardization: normal ROC method : parametric Link: probit Status : status Classifiers: mod1 mod2 mod3 Classifier : mod1 Covariate control adjustment model: Coef. Std. Err. z P>|z| [95% Conf. Interval] 2.118135 .2165905 9.78 0.000 1.693626 2.542645 _cons 1.166078 .1122059 10.39 0.000 .9461589 1.385998 ctrlcov _cons 2.344828 .1474147 15.91 0.000 2.0559 2.633755 1.122677 .1042379 10.77 0.000 .9183746 1.32698 P>|z| [95% Conf. Interval] casecov _cons casesd ctrlsd _cons Classifier : mod2 Covariate control adjustment model: Coef. casecov _cons Std. Err. z 2.659642 .2072731 12.83 0.000 2.253395 3.06589 _cons 1.288468 .1239829 10.39 0.000 1.045466 1.53147 ctrlcov _cons 1.655172 .1105379 14.97 0.000 1.438522 1.871823 .8418313 .0781621 10.77 0.000 .6886365 .9950262 P>|z| [95% Conf. Interval] casesd ctrlsd _cons Classifier : mod3 Covariate control adjustment model: Coef. casecov _cons Std. Err. z 2.353768 .1973549 11.93 0.000 1.966959 2.740576 _cons 1.143359 .1100198 10.39 0.000 .9277243 1.358994 ctrlcov _cons 2.275862 .1214094 18.75 0.000 2.037904 2.51382 .9246267 .0858494 10.77 0.000 .7563649 1.092888 casesd ctrlsd _cons rocreg — Receiver operating characteristic (ROC) regression 2001 Status : status ROC Model : Coef. Robust Std. Err. z P>|z| [95% Conf. Interval] mod1 i_cons s_cons auc 1.81646 .9627801 .904657 .3144804 .1364084 .0343518 5.78 7.06 26.34 0.000 0.000 0.000 1.20009 .6954245 .8373287 2.432831 1.230136 .9719853 i_cons s_cons auc 2.064189 .6533582 .9580104 .3267274 .1015043 .0219713 6.32 6.44 43.60 0.000 0.000 0.000 1.423815 .4544135 .9149473 2.704563 .8523029 1.001073 i_cons s_cons auc 2.058643 .8086932 .9452805 .2890211 .1163628 .0236266 7.12 6.95 40.01 0.000 0.000 0.000 1.492172 .5806262 .8989732 2.625113 1.03676 .9915877 mod2 mod3 Ho: All classifiers have equal AUC values. Ha: At least one classifier has a different AUC value. P-value: .0808808 . rocregplot, title(rocreg) nodraw name(b) > plot1opts(msymbol(o)) plot2opts(msymbol(s)) plot3opts(msymbol(t)) . graph combine a b, xsize(5) rocreg 0 0.00 0.25 Sensitivity 0.50 True−positive rate (ROC) .25 .5 .75 0.75 1 1.00 roccomp 0.00 0.25 0.50 0.75 1−Specificity 1.00 0 .25 .5 .75 False−positive rate mod1 mod1 mod2 mod2 mod3 mod3 mod1 fit mod1 Fit mod2 fit mod2 Fit mod3 fit mod3 Fit We compare the AUC estimates for these models: roccomp mod1 0.8945 mod2 0.9382 mod3 0.9376 1 rocreg, ml 0.9047 0.9580 0.9453 Each classifier has a higher estimated AUC under rocreg than roccomp. Each curve appears to be raised and smoothed in the rocreg fit as compared with roccomp. They are different, but not drastically different. The inference on whether the curve areas are the same is similar to example 9. We reject equality at the 0.10 level under rocreg and at the 0.05 level under roccomp. 2002 rocreg — Receiver operating characteristic (ROC) regression Each intercept is significantly different from 0 at the 0.05 level and is estimated in a positive direction. Though all but classifier mod2 has 1 in their slope confidence intervals, the high intercepts suggest steep ROC curves and powerful tests. Also note that the false-positive and true-positive rate points are calculated empirically in the roccomp graph and parametrically in rocreg. In example 9, the false-positive rates calculated by rocreg were calculated empirically, similar to roccomp. But in this example, the rates are calculated based on normal percentiles. Now we will generate an example to compare rocfit and rocreg under maximum likelihood estimation of a continuous classifier. Example 14: Maximum likelihood ROC, graphical comparison with rocfit We generate 500 realizations of a population under threat of disease. One quarter of the population has the disease. A classifier x is measured, which has a control distribution of N (1, 3) and a case distribution of N (1 + 5, 2). We will invoke rocreg with the ml option on this generated data. We specify the continuous() option for rocfit and invoke it on the data as well. The continuous() option tells rocfit how many discrete slices to partition the data into before fitting. For comparison of the two curves, we will use the rocfit postestimation command, rocplot; see [R] rocfit postestimation. This command graphs the empirical false-positive and true-positive rates with an overlaid fit of the binormal curve estimated by rocfit. rocplot also supports an addplot() option. We use the saved variables from rocreg in this option to overlay a line plot of the rocreg fit. . clear . set seed 8675309 . set obs 500 obs was 0, now 500 . generate d = runiform() < .25 . quietly generate double epsilon = 3*invnormal(runiform()) if d == 0 . quietly replace epsilon = 2*invnormal(runiform()) if d == 1 . quietly generate double x = 1 + d*5 + epsilon rocreg — Receiver operating characteristic (ROC) regression . rocreg d x, probit ml nolog Parametric ROC estimation Control standardization: normal ROC method : parametric Link: probit Status : d Classifiers: x Classifier : x Covariate control adjustment model: Coef. casecov _cons Std. Err. z P>|z| [95% Conf. Interval] 4.905612 .2411624 20.34 0.000 4.432943 5.378282 _cons 2.038278 .1299559 15.68 0.000 1.783569 2.292987 ctrlcov _cons 1.010382 .1561482 6.47 0.000 .7043377 1.316427 3.031849 .1104134 27.46 0.000 2.815443 3.248255 P>|z| [95% Conf. Interval] 0.000 0.000 0.000 2.026969 1.273394 .8853318 casesd ctrlsd _cons Status : d ROC Model : Coef. Std. Err. z x i_cons s_cons auc 2.406743 1.487456 .9103292 .193766 .1092172 .012754 12.42 13.62 71.38 . rocfit d x, continuous(10) nolog Binormal model of d on x Goodness-of-fit chi2(7) = 1.69 Prob > chi2 = 0.9751 Log likelihood = -911.91338 Coef. Number of obs = 2.786518 1.701518 .9353266 500 Std. Err. z P>|z| [95% Conf. Interval] intercept slope (*) 2.207250 1.281443 0.232983 0.158767 9.47 1.77 0.000 0.076 1.750611 0.970265 2.663888 1.592620 /cut1 /cut2 /cut3 /cut4 /cut5 /cut6 /cut7 /cut8 /cut9 -1.895707 -1.326900 -0.723677 -0.116960 0.442769 1.065183 1.689570 2.495841 3.417994 0.130255 0.089856 0.070929 0.064666 0.066505 0.075744 0.102495 0.185197 0.348485 -14.55 -14.77 -10.20 -1.81 6.66 14.06 16.48 13.48 9.81 0.000 0.000 0.000 0.070 0.000 0.000 0.000 0.000 0.000 -2.151001 -1.503015 -0.862695 -0.243702 0.312422 0.916728 1.488683 2.132861 2.734976 -1.640412 -1.150784 -0.584660 0.009782 0.573116 1.213637 1.890457 2.858821 4.101012 2003 2004 rocreg — Receiver operating characteristic (ROC) regression Index Estimate Indices from binormal fit Std. Err. [95% Conf. Interval] ROC area delta(m) d(e) d(a) 0.912757 1.722473 1.934960 1.920402 0.013666 0.127716 0.125285 0.121804 0.885972 1.472153 1.689405 1.681670 0.939542 1.972792 2.180515 2.159135 0 .25 Sensitivity .5 .75 1 (*) z test for slope==1 . rocplot, plotopts(msymbol(i)) lineopts(lpattern(dash)) > norefline addplot(line _roc_x _fpr_x, sort(_fpr_x _roc_x) > lpattern(solid)) aspectratio(1) legend(off) 0 .25 .5 1 − Specificity .75 1 Area under curve = 0.9128 se(area) = 0.0137 We find that the curves are close. As before, the rocfit estimates are lower for the slope and intercept than under rocreg. The AUC estimates are close. Though the slope confidence interval contains 1, a high ROC intercept suggests a steep ROC curve and thus a powerful test. rocreg — Receiver operating characteristic (ROC) regression Stored results Nonparametric rocreg stores the following in e(): Scalars e(N) e(N strata) e(N clust) e(rank) Macros e(cmd) e(cmdline) e(classvars) e(refvar) e(ctrlmodel) e(ctrlcov) e(pvc) e(title) e(tiecorrected) e(nobootstrap) e(bseed) e(breps) e(cc) e(nobstrata) e(clustvar) e(roc) e(invroc) e(pauc) e(auc) e(vce) e(properties) Matrices e(b) e(V) e(b bs) e(bias) e(se) e(z0) e(ci normal) e(ci percentile) e(ci bc) coefficient vector variance–covariance matrix of the estimators bootstrap estimates estimated biases estimated standard errors median biases normal-approximation confidence intervals percentile confidence intervals bias-corrected confidence intervals Functions e(sample) marks estimation sample number number number rank of of observations of covariate strata of clusters e(V) rocreg command as typed classification variable list status variable, reference variable covariate-adjustment specification covariate-adjustment variables percentile value calculation method title in estimation output indicates whether tie correction was used indicates that bootstrap was performed seed used in bootstrap, if bootstrap performed number of bootstrap resamples, if bootstrap performed indicates whether case–control groups were used as resampling strata indicates whether resampling should stratify based on control covariates name of cluster variable false-positive rates where ROC was estimated ROC values where false-positive rates were estimated false-positive rates where pAUC was estimated indicates that AUC was calculated bootstrap b V (or b if bootstrap not performed) 2005 2006 rocreg — Receiver operating characteristic (ROC) regression Parametric, bootstrap rocreg stores the following in e(): Scalars e(N) e(N strata) e(N clust) e(rank) Macros e(cmd) e(cmdline) e(title) e(classvars) e(refvar) e(ctrlmodel) e(ctrlcov) e(pvc) e(title) e(tiecorrected) e(probit) e(roccov) e(fprpts) e(ctrlfprall) e(nobootstrap) e(bseed) e(breps) e(cc) e(nobstrata) e(clustvar) e(vce) e(properties) e(predict) Matrices e(b) e(V) e(b bs) e(reps) e(bias) e(se) e(z0) e(ci normal) e(ci percentile) e(ci bc) Functions e(sample) number number number rank of of observations of covariate strata of clusters e(V) rocreg command as typed title in estimation output classification variable list status variable, reference variable covariate-adjustment specification covariate-adjustment variables percentile value calculation method title in estimation output indicates whether tie correction was used probit ROC covariates number of points used as false-positive rate indicates whether all observed false-positive indicates that bootstrap was performed seed used in bootstrap number of bootstrap resamples indicates whether case–control groups were indicates whether resampling should stratify name of cluster variable bootstrap b V (or b if nobootstrap is specified) program used to implement predict coefficient vector variance–covariance matrix of the estimators bootstrap estimates number of nonmissing results estimated biases estimated standard errors median biases normal-approximation confidence intervals percentile confidence intervals bias-corrected confidence intervals marks estimation sample fit points rates were used as fit points used as resampling strata based on control covariates rocreg — Receiver operating characteristic (ROC) regression 2007 Parametric, maximum likelihood rocreg stores the following in e(): Scalars e(N) e(N clust) e(rank) Macros e(cmd) e(cmdline) e(classvars) e(refvar) e(ctrlmodel) e(ctrlcov) e(roccov) e(probit) e(pvc) e(wtype) e(wexp) e(title) e(clustvar) e(vce) e(vcetype) e(ml) e(predict) Matrices e(b) e(V) Functions e(sample) number of observations number of clusters rank of e(V) rocreg command as typed classification variable list status variable linear control population covariates ROC covariates probit normal weight type weight expression title in estimation output name of cluster variable cluster if clustering used robust if multiple classifiers or clustering used indicates that maximum likelihood estimation was used program used to implement predict coefficient vector variance–covariance matrix of the estimators marks estimation sample Methods and formulas Assume that we applied a diagnostic test to each of N0 control and N1 case subjects. Further assume that the higher the outcome value of the diagnostic test, the higher the risk of the subject being abnormal. Let y1i , i = 1, 2, . . . , N1 , and y0j , j = 1, 2, . . . , N0 , be the values of the diagnostic test for the case and control subjects, respectively. The true status variable D identifies an observation as case D = 1 or control D = 0. The CDF of the classifier Y is F . Conditional on D, we write the CDF as FD . Methods and formulas are presented under the following headings: ROC statistics Covariate-adjusted ROC curves Parametric ROC curves: Estimating equations Parametric ROC curves: Maximum likelihood 2008 rocreg — Receiver operating characteristic (ROC) regression ROC statistics We obtain these definitions and their estimates from Pepe (2003) and Pepe, Longton, and Janes (2009). The false-positive and true-positive rates at cutoff c are defined as FPR (y) = P Y ≥ y D = 0 TPR (y) = P Y ≥ y D = 1 The true-positive rate, or ROC value at false-positive rate u, is given by ROC (u) = P 1 − F0 (Y ) ≤ u D = 1 When Y is continuous, the false-positive rate can be written as FPR (y) = 1 − F0 (y) The empirical CDF for the sample z1 , . . . , zn is given by Fb(z) = n X I (z < zi ) n i=1 c and Rd The empirical estimates FPR OC both use this empirical CDF estimator. The area under the ROC curve is defined as Z AUC = 1 ROC (u) du 0 The partial area under the ROC curve for false-positive rate a is defined as Z a pAUC (a) = ROC (u) du 0 The nonparametric estimate for the AUC is given by d AUC = N1 X c (y1i ) 1 − FPR i=1 N1 The nonparametric estimate of pAUC is given by N1 X c (y1i ) − (1 − a), 0 max 1 − FPR d (a) = pAUC N1 i=1 For discrete classifiers, a correction term is subtracted from the false-positive rate estimate so that d and pAUC d estimates correspond with a trapezoidal approximation to the area of the ROC the AUC curve. N0 X I (y = y0j ) c b0 (y) − 1 FPR (y) = 1 − F 2 N0 j=1 In the nonparametric estimation of the ROC curve, all inference is performed using the bootstrap command (see [R] bootstrap). rocreg also allows users to calculate the ROC curve and related statistics by assuming a normal control distribution. So these formulas are updated by replacing F0 by Φ (with adjustment of the marginal mean and variance of the control distribution). rocreg — Receiver operating characteristic (ROC) regression 2009 Covariate-adjusted ROC curves Suppose we observe covariate vector Z in addition to the classifier Y . Let Z1i , i = 1, 2, . . . , N1 , and Z0j , j = 1, 2, . . . , N0 , be the values of the covariates for the case and control subjects, respectively. The covariate-adjusted ROC curve is defined by Janes and Pepe (2009) as AROC (t) = E ROC t Z0 It is calculated by replacing the marginal control CDF estimate, Fb0 , with the conditional control CDF estimate, Fb0Z . If we used a normal control CDF, then we would replace the marginal control mean and variance with the conditional control mean and variance. The formulas of the previous section can be updated for covariate-adjustment by making this substitution of the conditional CDF for the marginal CDF in the false-positive rate calculation. Because the calculation of the ROC value is now performed based on the conditionally calculated false-positive rate, no further conditioning is made in its calculation under nonparametric estimation. rocreg supports covariate adjustment with stratification and linear regression. Under stratification, separate parameters are estimated for the control distribution at each level of the covariates. Under linear regression, the classifier is regressed on the covariates over the control distribution, and the resulting coefficients serve as parameters for Fb0Z . Parametric ROC curves: Estimating equations Under nonparametric estimation of the ROC curve with covariate adjustment, no further conditioning occurs in the ROC curve calculation beyond the use of covariate-adjusted false-positive rates as inputs. Under parametric estimation of the ROC curve, we can relax this restriction. We model the ROC curve as a cumulative distribution function g (standard normal Φ) invoked with input of a linear polynomial in the corresponding quantile function (here Φ−1 ) invoked on the false-positive rate u. The constant intercept of the polynomial may depend on covariates; the slope term α (quantile coefficient) may not. 0 −1 ROC (u) = g{x β + αg (u)} Pepe (2003) notes that having a binormal ROC (g = Φ) is equivalent to specifying that some monotone transformation of the data exists to make the case and control classifiers normally distributed. This specification applies to the marginal case and control. Under weak assumptions about the control distribution of the classifier, we can fit this model by using estimating equations (Alonzo and Pepe 2002). The method can be used without covariate effects in the second stage, assuming a parametric model for the single ROC curve. Using the Alonzo and Pepe (2002) method, the covariate-adjusted ROC curve may be fit parametrically. The marginal ROC curve, involving no covariates in either stage of estimation, can be fit parametrically as well. In addition to the Alonzo and Pepe (2002) explanation, further details are given in Pepe, Longton, and Janes (2009); Janes, Longton, and Pepe (2009); Pepe (2003); and Janes and Pepe (2009). The algorithm can be described as follows: 1. Estimate the false-positive rates of the classifier fpr. These may be computed in any fashion outlined so far: covariate-adjusted, empirically, etc. 2. Determine a set of np false-positive rates to use as fitting points f1 , . . . , fnp . These may be an equispaced grid on (0, 1) or the set of observed false-positive rates from part 1. 2010 rocreg — Receiver operating characteristic (ROC) regression 3. Expand the case observation portion of the data to include a subobservation for each fitting point. So there are now N1 (np − 1) additional observations in the data. 4. Generate a new dummy variable u. For subobservation j , u = I (fpr ≤ fj ). 5. Generate a new variable quant containing the quantiles of the false-positive rate fitting points. For subobservation j , quant = g −1 (fj ). 6. Perform a binary regression (probit, g = Φ) of fpr on the covariates x and quantile variable quant. The coefficients of part 6 are the coefficients of the ROC model. The coefficients of the covariates coincide naturally with estimates of β, and the α parameter is estimated by the coefficient on quant. Because the method is so general and makes few distributional assumptions, bootstrapping must be performed for inference. If multiple classifiers are to be fit, the algorithm is performed separately for each in each bootstrap, and the bootstrap is used to estimate covariances. We mentioned earlier that in parametric estimation, the AUC was the only summary parameter that could be estimated initially. This is true when we fit the marginal probit model because there are no covariates in part 6 of the algorithm. To calculate the AUC statistic under a marginal probit model, we use the formula β0 AUC = Φ √ 1 + α2 Alternatively, the AUC for the probit model can be calculated as pAUC(1) in postestimation. Under both models, bootstrapping is performed for inference on the AUC. Parametric ROC curves: Maximum likelihood rocreg supports another form of parametric ROC estimation: maximum likelihood with a normally distributed classifier. This method assumes that the classifier is a normal linear model on certain covariates, and the covariate effect and variance of the classifier may change between the case and control populations. The model is defined in Pepe (2003, 145). y = z0 β0 + Dx0 β1 + σ (D) Our error term, , is a standard normal random variable. The variable D is our true status variable, being 1 for the case population observations and 0 for the control population observations. The variance function σ is defined as σ (D) = σ0 (D = 0) + σ1 (D = 1) This provides two variance parameters in the model and does not depend on covariate values. Under this model, the ROC curve is easily derived to be 1 0 −1 ROC (u) = Φ x β1 + σ0 Φ (u) σ1 We reparameterize the model, creating the parameters βi = σ1−1 β1i and α = σ1−1 σ0 . We refer to β0 as the constant intercept, i cons. The parameter α is referred to as the constant slope, s cons. ROC (u) = Φ{x0 β + αΦ−1 (u)} rocreg — Receiver operating characteristic (ROC) regression 2011 The original model defining the classifier y leads to the following single observation likelihoods for D = 0 and D = 1: L(β0 , β1 , σ1 , σ0 , D = 0, y, z, x) = √ L(β0 , β1 , σ1 , σ0 , D = 1, y, z, x) = √ 1 −(y − z0 β0 )2 exp 2σ02 2πσ0 1 −(y − z0 β0 − x0 β1 )2 exp 2σ12 2πσ1 These can be combined to yield the observation-level log likelihood: lnL(β0 , β1 , σ1 , σ0 , D, y, z, x) = − ln2π 2 − I (D = 0) (y − z0 β0 )2 lnσ0 + 2σ02 − I (D = 1) lnσ1 + (y − z0 β0 − x0 β1 )2 2σ12 When there are multiple classifiers, each classifier is fit separately with maximum likelihood. Then the results are combined by stacking the scores and using the sandwich variance estimator. For more information, see [R] suest and the references White (1982); Rogers (1993); and White (1996). Acknowledgments We thank Margaret S. Pepe, Holly Janes, and Gary Longton of the Fred Hutchinson Cancer Research Center for providing the inspiration for the rocreg command and for illuminating many useful datasets for its documentation. References Alonzo, T. A., and M. S. Pepe. 2002. Distribution-free ROC analysis using binary regression techniques. Biostatistics 3: 421–432. Cleves, M. A. 1999. sg120: Receiver operating characteristic (ROC) analysis. Stata Technical Bulletin 52: 19–33. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 212–229. College Station, TX: Stata Press. . 2000. sg120.2: Correction to roccomp command. Stata Technical Bulletin 54: 26. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 231. College Station, TX: Stata Press. . 2002a. Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve. Stata Journal 2: 280–289. . 2002b. From the help desk: Comparing areas under receiver operating characteristic curves from two or more probit or logit models. Stata Journal 2: 301–313. Cook, N. R. 2007. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation 115: 928–935. DeLong, E. R., D. M. DeLong, and D. L. Clarke-Pearson. 1988. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 44: 837–845. Dodd, L. E., and M. S. Pepe. 2003. Partial AUC estimation and regression. Biometrics 59: 614–623. Dorfman, D. D., and E. Alf, Jr. 1969. Maximum-likelihood estimation of parameters of signal-detection theory and determination of confidence intervals–rating-method data. Journal of Mathematical Psychology 6: 487–496. 2012 rocreg — Receiver operating characteristic (ROC) regression Hanley, J. A., and K. O. Hajian-Tilaki. 1997. Sampling variability of nonparametric estimates of the areas under receiver operating characteristic curves: An update. Academic Radiology 4: 49–58. Hanley, J. A., and B. J. McNeil. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143: 29–36. . 1983. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 148: 839–843. Janes, H., G. M. Longton, and M. S. Pepe. 2009. Accommodating covariates in receiver operating characteristic analysis. Stata Journal 9: 17–39. Janes, H., and M. S. Pepe. 2009. Adjusting for covariate effects on classification accuracy using the covariate-adjusted receiver operating characteristic curve. Biometrika 96: 371–382. McClish, D. K. 1989. Analyzing a portion of the ROC curve. Medical Decision Making 9: 190–195. Mooney, C. Z., and R. D. Duval. 1993. Bootstrapping: A Nonparametric Approach to Statistical Inference. Newbury Park, CA: Sage. Norton, S. J., M. P. Gorga, J. E. Widen, R. C. Folsom, Y. Sininger, B. Cone-Wesson, B. R. Vohr, K. Mascher, and K. Fletcher. 2000. Identification of neonatal hearing impairment: Evaluation of transient evoked otoacoustic emission, distortion product otoacoustic emission, and auditory brain stem response test performance. Ear and Hearing 21: 508–528. Pepe, M. S. 1998. Three approaches to regression analysis of receiver operating characteristic curves for continuous test results. Biometrics 54: 124–135. . 2000. Receiver operating characteristic methodology. Journal of the American Statistical Association 95: 308–311. . 2003. The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press. Pepe, M. S., and T. Cai. 2004. The analysis of placement values for evaluating discriminatory measures. Biometrics 60: 528–535. Pepe, M. S., G. M. Longton, and H. Janes. 2009. Estimation and comparison of receiver operating characteristic curves. Stata Journal 9: 1–16. Rogers, W. H. 1993. sg16.4: Comparison of nbreg and glm for negative binomial. Stata Technical Bulletin 16: 7. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 82–84. College Station, TX: Stata Press. Stover, L., M. P. Gorga, S. T. Neely, and D. Montoya. 1996. Toward optimizing the clinical utility of distortion product otoacoustic emission measurements. Journal of the Acoustical Society of America 100: 956–967. Thompson, M. L., and W. Zucchini. 1989. On the statistical analysis of ROC curves. Statistics in Medicine 8: 1277–1290. White, H. L., Jr. 1982. Maximum likelihood estimation of misspecified models. Econometrica 50: 1–25. . 1996. Estimation, Inference and Specification Analysis. Cambridge: Cambridge University Press. Wieand, S., M. H. Gail, B. R. James, and K. L. James. 1989. A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika 76: 585–592. Also see [R] rocreg postestimation — Postestimation tools for rocreg [R] rocregplot — Plot marginal and covariate-specific ROC curves after rocreg [R] rocfit — Parametric ROC models [R] roc — Receiver operating characteristic (ROC) analysis Title rocreg postestimation — Postestimation tools for rocreg Description Options for predict Options for estat nproc Methods and formulas Syntax for predict Syntax for estat nproc Remarks and examples References Menu for predict Menu for estat Stored results Also see Description The following commands are of special interest after rocreg: Command Description estat nproc rocregplot nonparametric ROC curve estimation, keeping fit information from rocreg plot marginal and covariate-specific ROC curves The following standard postestimation commands are also available: Command Description estimates lincom nlcom cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions for parametric ROC curve estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses predict test testnl Special-interest postestimation command The estat nproc command allows calculation of all the ROC curve summary statistics for covariate-specific ROC curves, as well as for a nonparametric ROC estimation. Under nonparametric estimation, a single ROC curve is estimated by rocreg. Covariates can affect this estimation, but there are no separate covariate-specific ROC curves. Thus the input arguments for estat nproc are taken in the command line rather than from the data as variable values. 2013 2014 rocreg postestimation — Postestimation tools for rocreg Syntax for predict predict type newvar if in , statistic options Description statistic Main classvar(varname) input variable for statistic total area under the ROC curve; the default ROC values for given false-positive rates in at() false-positive rate for given ROC values in at() partial area under the ROC curve up to each false-positive rate in at() statistic for given classifier options Description at(varname) auc roc invroc pauc Options points in numeric integration of pAUC calculation predict standard errors produce confidence intervals, stored as variables with prefix stubname and suffixes l and u set confidence level; default is level(95) load dataset containing bootstrap replicates from rocreg produce normal-based (n), percentile (p), or bias-corrected (bc) confidence intervals; default is btype(n) intpts(#) se(newvar) ci(stubname) level(#) bfile(filename, . . . ) ∗ btype(n | p | bc) ∗ ∗ bfile() and btype() are only allowed with parametric analysis using bootstrap inference. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main at(varname) records the variable to be used as input for the above predictions. auc predicts the total area under the ROC curve defined by the covariate values in the data. This is the default statistic. roc predicts the ROC values for false-positive rates stored in varname specified in at(). invroc predicts the false-positive rates for given ROC values stored in varname specified in at(). pauc predicts the partial area under the ROC curve up to each false-positive rate stored in varname specified in at(). classvar(varname) performs the prediction for the specified classifier. rocreg postestimation — Postestimation tools for rocreg 2015 Options intpts(#) specifies that # points be used in the pAUC calculation. se(newvar) specifies that standard errors be produced and stored in newvar. ci(stubname) requests that confidence intervals be produced and the lower and upper bounds be stored in stubname l and stubname u, respectively. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. bfile(filename, . . . ) uses bootstrap replicates of parameters from rocreg stored in filename to estimate standard errors and confidence intervals of predictions. btype(n | p | bc) specifies whether to produce normal-based (n), percentile (p), or bias-corrected (bc) confidence intervals. The default is btype(n). Syntax for estat nproc estat nproc , estat nproc options Description estat nproc options Main estimate estimate estimate estimate auc roc(numlist) invroc(numlist) pauc(numlist) total area under the ROC curve ROC values for given false-positive rates false-positive rate for given ROC values partial area under the ROC curve up to each false-positive rate At least one option must be specified. Menu for estat Statistics > Postestimation > Reports and statistics Options for estat nproc Main auc estimates the total area under the ROC curve. roc(numlist) estimates the ROC for each of the false-positive rates in numlist. The values in numlist must be in the range (0,1). invroc(numlist) estimates the false-positive rate for each of the ROC values in numlist. The values in numlist must be in the range (0,1). pauc(numlist) estimates the partial area under the ROC curve up to each false-positive rate in numlist. The values in numlist must be in the range (0,1]. 2016 rocreg postestimation — Postestimation tools for rocreg Remarks and examples Remarks are presented under the following headings: Using predict after rocreg Using estat nproc Using predict after rocreg predict, after parametric rocreg, predicts the AUC, the ROC value, the false-positive rate (invROC), or the pAUC value. The default is auc. We begin by estimating the area under the ROC curve for each of the three age-specific ROC curves in example 1 of [R] rocregplot: 30, 40, and 50 months. Example 1: Parametric ROC, AUC In example 6 of [R] rocreg, a probit ROC model was fit to audiology test data from Norton et al. (2000). The estimating equations method of Alonzo and Pepe (2002) was used to fit the model. Gender and age were covariates that affected the control distribution of the classifier y1 (DPOAE 65 at 2 kHz). Age was a ROC covariate for the model, so we fit separate ROC curves at each age. Following Janes, Longton, and Pepe (2009), we drew the ROC curves for ages 30, 40, and 50 months in example 1 of [R] rocregplot. Now we use predict to estimate the AUC for the ROC curve at each of those ages. The bootstrap dataset saved by rocreg in example 6 of [R] rocreg, nnhs2y1.dta, is used in the bfile() option. We will store the AUC prediction in the new variable predAUC. We specify the se() option with the new variable name seAUC to produce an estimate of the prediction’s standard error. By specifying the stubname cin in ci(), we tell predict to create normal-based confidence intervals (the default) as new variables cin l and cin u. . use http://www.stata-press.com/data/r13/nnhs (Norton - neonatal audiology data) . rocreg d y1, probit ctrlcov(currage male) ctrlmodel(linear) roccov(currage) > cluster(id) bseed(56930) bsave(nnhs2y1) (output omitted ) . set obs 5061 obs was 5058, now 5061 . quietly replace currage = 30 in 5059 . quietly replace currage = 40 in 5060 . quietly replace currage = 50 in 5061 . predict predAUC in 5059/5061, auc se(seAUC) ci(cin) bfile(nnhs2y1) . list currage predAUC seAUC cin* in 5059/5061 5059. 5060. 5061. currage predAUC seAUC cin_l cin_u 30 40 50 .5209999 .6479176 .7601378 .0712928 .0286078 .0746157 .3812686 .5918474 .6138937 .6607312 .7039879 .9063819 rocreg postestimation — Postestimation tools for rocreg 2017 As expected, we find the AUC to increase with age. Essentially, we have a stored bootstrap sample of ROC covariate coefficient estimates in nnhs2y1.dta. We calculate the AUC using each set of coefficient estimates, resulting in a sample of AUC estimates. Then the bootstrap standard error and confidence intervals are calculated based on this AUC sample. Further details of the computation of the standard error and percentile confidence intervals can be found in Methods and formulas and in [R] bootstrap. We can also produce percentile or bias-corrected confidence intervals by specifying btype(p) or btype(bc), which we now demonstrate. . drop *AUC* . predict predAUC in 5059/5061, auc se(seAUC) ci(cip) bfile(nnhs2y1) btype(p) . list currage predAUC cip* in 5059/5061 5059. 5060. 5061. currage predAUC cip_l cip_u 30 40 50 .5209999 .6479176 .7601378 .3760555 .5893397 .5881404 .6513149 .7032645 .8836223 . drop *AUC* . predict predAUC in 5059/5061, auc se(seAUC) ci(cibc) bfile(nnhs2y1) btype(bc) . list currage predAUC cibc* in 5059/5061 5059. 5060. 5061. currage predAUC cibc_l cibc_u 30 40 50 .5209999 .6479176 .7601378 .3736968 .588947 .5812373 .6500064 .7010052 .8807758 predict can also estimate the ROC value and the false-positive rate (invROC). Example 2: Parametric ROC, invROC, and ROC value In example 7 of [R] rocreg, we fit the ROC curve for status variable hearing loss (d) and classifier negative signal-to-noise ratio nsnr with ROC covariates frequency (xf), intensity (xl), and hearing loss severity (xd). The data were obtained from Stover et al. (1996). The model fit was probit with bootstrap resampling. We saved 50 bootstrap replications in the dataset nsnrf.dta. The covariate value combinations xf = 10.01, xl = 5.5, and xd = .5, and xf = 10.01, xl = 6.5, and xd = 4 are of interest. In example 3 of [R] rocregplot, we estimated the ROC values for false-positive rates 0.2 and 0.7 and the false-positive rate for a ROC value of 0.5 by using rocregplot. We will use predict to replicate the estimation. We begin by appending observations with our desired covariate combinations to the data. We also create two new variables: rocinp, which contains the ROC values for which we wish to predict the corresponding invROC values, and invrocinp, which contains the invROC values corresponding to the ROC values we wish to predict. 2018 rocreg postestimation — Postestimation tools for rocreg . clear . input xf xl xd rocinp invrocinp xf xl xd 1. 10.01 5.5 .5 .2 . 2. 10.01 6.5 4 .2 . 3. 10.01 5.5 .5 .7 .5 4. 10.01 6.5 4 .7 .5 5. end . save newdata file newdata.dta saved rocinp invrocinp . use http://www.stata-press.com/data/r13/dp (Stover - DPOAE test data) . quietly rocreg d nsnr, ctrlcov(xf xl) roccov(xf xl xd) probit cluster(id) > nobstrata ctrlfprall bseed(156385) breps(50) ctrlmodel(strata) bsave(nsnrf) . append using newdata . list xf xl xd invrocinp rocinp in 1849/1852 1849. 1850. 1851. 1852. xf xl xd invroc~p rocinp 10.01 10.01 10.01 10.01 5.5 6.5 5.5 6.5 .5 4 .5 4 . . .5 .5 .2 .2 .7 .7 Now we will use predict to estimate the ROC value for the false-positive rates stored in rocinp. We specify the roc option, and we specify rocinp in the at() option. The other options, se() and ci(), are used to obtain standard errors and confidence intervals, respectively. The dataset of bootstrap samples, nsnrf.dta, is specified in bfile(). After prediction, we list the point estimates and standard errors. . predict rocit in 1849/1852, roc at(rocinp) se(seroc) ci(cin) bfile(nsnrf) . list xf xl xd rocinp rocit seroc if !missing(rocit) 1849. 1850. 1851. 1852. xf xl xd rocinp rocit seroc 10.01 10.01 10.01 10.01 5.5 6.5 5.5 6.5 .5 4 .5 4 .2 .2 .7 .7 .7652956 .9672505 .9835816 .999428 .0735506 .0227977 .0204353 .0011309 These results match example 3 of [R] rocregplot. We list the confidence intervals next. These also conform to the rocregplot results from example 3 in [R] rocregplot. We begin with the confidence intervals for ROC under the covariate values xf=10.01, xl=5.5, and xd=.5. rocreg postestimation — Postestimation tools for rocreg 2019 . list xf xl xd rocinp rocit cin* if inlist(_n, 1849, 1851) 1849. 1851. xf xl xd rocinp rocit cin_l cin_u 10.01 10.01 5.5 5.5 .5 .5 .2 .7 .7652956 .9835816 .6211391 .9435292 .9094521 1.023634 Now we list the ROC confidence intervals under the covariate values xf=10.01, xl=6.5, and xd=4. . list xf xl xd rocinp rocit cin* if inlist(_n, 1850, 1852) 1850. 1852. xf xl xd rocinp rocit cin_l cin_u 10.01 10.01 6.5 6.5 4 4 .2 .7 .9672505 .999428 .9225678 .9972115 1.011933 1.001644 Now we will predict the false-positive rate for a ROC value by specifying the invroc option. We pass the invrocinp variable as an argument to the at() option. Again we list the point estimates and standard errors first. . . > . drop ci* predict invrocit in 1849/1852, invroc at(invrocinp) se(serocinv) ci(cin) bfile(nsnrf) list xf xl xd invrocinp invrocit serocinv if !missing(invrocit) 1851. 1852. xf xl xd invroc~p invrocit serocinv 10.01 10.01 5.5 6.5 .5 4 .5 .5 .0615144 .0043298 .0254042 .0045938 These also match those of example 3 of [R] rocregplot. Listing the confidence intervals shows identical results as well. First we list the confidence intervals under the covariate values xf=10.01, xl=5.5, and xd=.5. . list xf xl xd invrocinp invrocit cin* in 1851 1851. xf xl xd invroc~p invrocit cin_l cin_u 10.01 5.5 .5 .5 .0615144 .0117231 .1113057 Now we list the confidence intervals for false-positive rate under the covariate values xf=10.01, xl=6.5, and xd=4. . list xf xl xd invrocinp invrocit cin* in 1852 1852. xf xl xd invroc~p invrocit cin_l cin_u 10.01 6.5 4 .5 .0043298 -.004674 .0133335 The predict command can also be used after a maximum-likelihood ROC model is fit. 2020 rocreg postestimation — Postestimation tools for rocreg Example 3: Maximum likelihood ROC, invROC, and ROC value In the previous example, we revisited the estimating equations fit of a probit model with ROC covariates frequency (xf), intensity (xl), and hearing loss severity (xd) to the Stover et al. (1996) audiology study data. A maximum likelihood fit of the same model was performed in example 10 of [R] rocreg. In example 2 of [R] rocregplot, we used rocregplot to estimate ROC values and false-positive rates for this model under two covariate configurations. We will use predict to obtain the same estimates. We will also estimate the partial area under the ROC curve. We append the data as in the previous example. This leads to the following four final observations in the data. . use http://www.stata-press.com/data/r13/dp, clear (Stover - DPOAE test data) . rocreg d nsnr, probit ctrlcov(xf xl) roccov(xf xl xd) ml cluster(id) (output omitted ) . append using newdata . list xf xl xd invrocinp rocinp in 1849/1852 1849. 1850. 1851. 1852. xf xl xd invroc~p rocinp 10.01 10.01 10.01 10.01 5.5 6.5 5.5 6.5 .5 4 .5 4 . . .5 .5 .2 .2 .7 .7 Now we predict the ROC value for false-positive rates of 0.2 and 0.7. Under maximum likelihood prediction, only Wald-type confidence intervals are produced. We specify a new variable name for the standard error in the se() option and a stubname for the confidence interval variables in the ci() option. . predict rocit in 1849/1852, roc at(rocinp) se(seroc) ci(ci) . list xf xl xd rocinp rocit seroc ci_l ci_u if !missing(rocit), noobs xf xl xd rocinp rocit seroc ci_l ci_u 10.01 10.01 10.01 10.01 5.5 6.5 5.5 6.5 .5 4 .5 4 .2 .2 .7 .7 .7608593 .9499408 .978951 .9985001 .0510501 .0179824 .0097382 .0009657 .660803 .914696 .9598644 .9966073 .8609157 .9851856 .9980376 1.000393 These results match our estimates in example 2 of [R] rocregplot. We also match example 2 of [R] rocregplot when we estimate the false-positive rate for a ROC value of 0.5. . drop ci* . predict invrocit in 1851/1852, invroc at(invrocinp) se(serocinv) ci(ci) . list xf xl xd invrocinp invrocit serocinv ci_l ci_u if !missing(invrocit), > noobs xf xl xd invroc~p invrocit serocinv ci_l ci_u 10.01 10.01 5.5 6.5 .5 4 .5 .5 .0578036 .0055624 .0198626 .0032645 .0188736 -.0008359 .0967336 .0119607 rocreg postestimation — Postestimation tools for rocreg 2021 Example 4: Maximum likelihood ROC, pAUC, and ROC value In example 13 of [R] rocreg, we fit a maximum-likelihood marginal probit model to each classifier of the fictitious dataset generated from Hanley and McNeil (1983). In example 5 of [R] rocregplot, rocregplot was used to draw the ROC for the mod1 and mod3 classifiers. Estimates of the ROC value and false-positive rate were also obtained with Wald-type confidence intervals. We return to this example, this time using predict to estimate the ROC value and false-positive rate. We will also estimate the pAUC for the false-positive rates of 0.3 and 0.8. First, we add the input variables to the data. The variable paucinp will hold the 0.3 and 0.8 false-positive rates that we will input to pAUC. The variable invrocinp holds the ROC value of 0.8 for which we will estimate the false-positive rate. Finally, the variable rocinp holds the false-positive rates of 0.15 and 0.75 for which we will estimate the ROC value. . use http://www.stata-press.com/data/r13/ct2, clear . rocreg status mod1 mod2 mod3, probit ml (output omitted ) . . . . . quietly quietly quietly quietly quietly generate paucinp = .3 in 111 replace paucinp = .8 in 112 generate invrocinp = .8 in 112 generate rocinp = .15 in 111 replace rocinp = .75 in 112 Then, we estimate the ROC value for false-positive rates 0.15 and 0.75 under classifier mod1. The point estimate is stored in roc1. Wald confidence intervals and standard errors are also estimated. We find that these results match those of example 5 of [R] rocregplot. . predict roc1 in 111/112, classvar(mod1) roc at(rocinp) se(sr1) ci(cir1) . list rocinp roc1 sr1 cir1* in 111/112 111. 112. rocinp roc1 sr1 cir1_l cir1_u .15 .75 .7934935 .9931655 .0801363 .0069689 .6364293 .9795067 .9505578 1.006824 Now we perform the same estimation under the classifier mod3. . predict roc3 in 111/112, classvar(mod3) roc at(roci) se(sr3) ci(cir3) . list rocinp roc3 sr3 cir3* in 111/112 111. 112. rocinp roc3 sr3 cir3_l cir3_u .15 .75 .8888596 .9953942 .0520118 .0043435 .7869184 .9868811 .9908009 1.003907 Next we estimate the false-positive rate for the ROC value of 0.8. These results also match example 5 of [R] rocregplot. . predict invroc1 in 112, classvar(mod1) invroc at(invrocinp) se(sir1) ci(ciir1) . list invrocinp invroc1 sir1 ciir1* in 112 112. invroc~p invroc1 sir1 ciir1_l ciir1_u .8 .1556435 .069699 .0190361 .292251 2022 rocreg postestimation — Postestimation tools for rocreg . predict invroc3 in 112, classvar(mod3) invroc at(invrocinp) se(sir3) ci(ciir3) . list invrocinp invroc3 sir3 ciir3* in 112 invroc~p invroc3 sir3 ciir3_l ciir3_u .8 .0661719 .045316 -.0226458 .1549896 112. Finally, we estimate the pAUC for false-positive rates of 0.3 and 0.8. The point estimate is calculated by numeric integration. Wald confidence intervals are obtained with the delta method. Further details are presented in Methods and formulas. . predict pauc1 in 111/112, classvar(mod1) pauc at(paucinp) se(sp1) ci(cip1) . list paucinp pauc1 sp1 cip1* in 111/112 paucinp pauc1 sp1 cip1_l cip1_u .3 .8 .221409 .7033338 .0240351 .0334766 .174301 .6377209 .268517 .7689466 111. 112. . predict pauc3 in 111/112, classvar(mod3) pauc at(paucinp) se(sp3) ci(cip3) . list paucinp pauc3 sp3 cip3* in 111/112 paucinp pauc3 sp3 cip3_l cip3_u .3 .8 .2540215 .7420408 .0173474 .0225192 .2200213 .6979041 .2880217 .7861776 111. 112. Using estat nproc When you initially use rocreg to fit a nonparametric ROC curve, you can obtain bootstrap estimates of a ROC value, false-positive rate, area under the ROC curve, and partial area under the ROC curve. The estat nproc command allows the user to estimate these parameters after rocreg has originally been used. The seed and resampling settings used by rocreg are used by estat nproc. So the results for these new statistics are identical to what they would be if they had been initially estimated in the rocreg command. These new statistics, together with those previously estimated in rocreg, are returned in r(). We demonstrate with an example. Example 5: Nonparametric ROC, invROC, and pAUC In example 3 of [R] rocreg, we examined data from a pancreatic cancer study (Wieand et al. 1989). Two continuous classifiers, y1 (CA 19-9) and y2 (CA 125), were used for the true status variable d. In that example, we estimated various quantities including the false-positive rate for a ROC value of 0.6 and the pAUC for a false-positive rate of 0.5. Here we replicate that estimation with a call to rocreg to estimate the former and follow that with a call to estat nproc to estimate the latter. For simplicity, we restrict estimation to classifier y1 (CA 19-9). We start by executing rocreg, estimating the false-positive rate for a ROC value of 0.6. This value is specified in invroc(). Case–control resampling is used by specifying the bootcc option. rocreg postestimation — Postestimation tools for rocreg . use http://labs.fhcrc.org/pepe/book/data/wiedat2b, clear (S. Wieand - Pancreatic cancer diagnostic marker data) . rocreg d y1, invroc(.6) bseed(8378923) bootcc nodots Bootstrap results Number of strata = 2 Number of obs Replications Nonparametric ROC estimation Control standardization: empirical ROC method : empirical False-positive rate Status : d Classifier: y1 invROC Observed Coef. Bias .6 0 .0158039 Bootstrap Std. Err. .0267288 = = 2023 141 1000 [95% Conf. Interval] -.0523874 0 0 .0523874 (N) .0784314 (P) .1372549 (BC) Now we will estimate the pAUC for the false-positive rate of 0.5 using estat nproc and the pauc() option. . matrix list e(b) symmetric e(b)[1,1] y1: invroc_1 y1 0 . estat nproc, pauc(.5) Bootstrap results Number of strata = 2 Number of obs Replications = = 141 1000 Nonparametric ROC estimation Control standardization: empirical ROC method : empirical False-positive rate Status : d Classifier: y1 invROC Observed Coef. Bias .6 0 .0158039 Bootstrap Std. Err. .0267288 [95% Conf. Interval] -.0523874 0 0 .0523874 (N) .0784314 (P) .1372549 (BC) Partial area under the ROC curve Status : d Classifier: y1 pAUC Observed Coef. Bias .5 .3932462 -.0000769 Bootstrap Std. Err. .021332 [95% Conf. Interval] .3514362 .3492375 .3492375 .4350562 (N) .435512 (P) .435403 (BC) 2024 rocreg postestimation — Postestimation tools for rocreg . matrix list r(b) r(b)[1,2] y1: y1: invroc_1 pauc_1 y1 0 .39324619 . matrix list e(b) symmetric e(b)[1,1] y1: invroc_1 y1 0 . matrix list r(V) symmetric r(V)[2,2] y1: y1: invroc_1 pauc_1 y1:invroc_1 .00071443 y1:pauc_1 -.000326 .00045506 . matrix list e(V) symmetric e(V)[1,1] y1:invroc_1 y1: invroc_1 .00071443 The advantages of using estat nproc are twofold. First, you can estimate additional parameters of interest without having to respecify the bootstrap settings you did with rocreg; instead estat nproc uses the bootstrap settings that were stored by rocreg. Second, parameters estimated with estat nproc are added to those parameters estimated by rocreg and returned in the matrices r(b) (parameter estimates) and r(V) (variance–covariance matrix). Thus you can also obtain correlations between any quantities you wish to estimate. Stored results estat nproc stores the following in r(): r(b) r(V) r(ci normal) r(ci percentile) r(ci bc) coefficient vector variance–covariance matrix of the estimators normal-approximation confidence intervals percentile confidence intervals bias-corrected confidence intervals Methods and formulas Details on computation of the nonparametric ROC curve and the estimation of the parametric ROC curve model coefficients can be found in [R] rocreg. Here we describe how to estimate the ROC curve summary statistics for a parametric model. The cumulative distribution function, g , can be the standard normal cumulative distribution function, Φ. Methods and formulas are presented under the following headings: Parametric model: Summary parameter definition Maximum likelihood estimation Estimating equations estimation rocreg postestimation — Postestimation tools for rocreg 2025 Parametric model: Summary parameter definition Conditioning on covariates x, we have the following ROC curve model: ROC (u) = g{x0 β + αg −1 (u)} x can be constant, and β = β0 , the constant intercept. We can solve this equation to obtain the false-positive rate value u for a ROC value of r: u = g {g −1 (r) − x0 β}α−1 The partial area under the ROC curve for the false-positive rate u is defined by Z u pAUC (u) = g{x0 β + αg −1 (t)}dt o The area under the ROC curve is defined by Z 1 AUC = g{x0 β + αg −1 (t)}dt o When g is the standard normal cumulative distribution function Φ, we can express the AUC as x0 β AUC = Φ √ 1 + α2 Maximum likelihood estimation We allow maximum likelihood estimation under probit parametric models, so g = Φ. The ROC value, false-positive rate, and AUC parameters all have closed-form expressions in terms of the covariate values x, coefficient vector β, and slope parameter α. So to estimate these three types of summary parameters, we use the delta method (Oehlert 1992; Phillips and Park 1988). Particularly, we use the nlcom command (see [R] nlcom) to implement the delta method. To estimate the partial area under the ROC curve for false-positive rate u, we use numeric integration. A trapezoidal approximation is used in calculating the integrals. A numeric integral of the ROC(t) b and slope parameter function conditioned on the covariate values x, coefficient vector estimate β, estimate α b is computed over the range t = [0, u]. This gives us the point estimate of pAUC(u). To calculate the standard error and confidence intervals for the point estimate of pAUC(u), we again use the delta method. Details on the delta method algorithm can be found in Methods and formulas of [R] nlcom and the earlier mentioned references. b and slope estimate α Under maximum likelihood estimation, the coefficient estimates β b are asymptotically normal with variance matrix V. For convenience, we rename the parameter vector [β0 , α] to the k -parameter vector θ = [θ1 , . . . , θk ]. We will also explicitly refer to the conditioning of the ROC curve by θ in its mention as ROC(t, θ). Under the delta method, the continuous scalar function of the estimate b θ, f (b θ) has asymptotic mean f (θ) and asymptotic covariance n o c f (b Var θ) = fVf 0 2026 rocreg postestimation — Postestimation tools for rocreg where f is the 1 × k matrix of derivatives for which f1j = ∂f (θ) ∂θj j = 1, . . . , k The asymptotic covariance of f (b θ) is estimated and then used in conjunction with f (b θ) for further inference, including Wald confidence intervals, standard errors, and hypothesis testing. In the case of pAUC(u) estimation, our f (b θ) is the aforementioned numeric integral of the ROC curve. It estimates f (θ), the true integral of the ROC curve on the [0, u] range. The V variance matrix is estimated using the likelihood information that rocreg calculated, and the estimation is performed by rocreg itself. The partial derivatives of f (θ) can be determined by using Leibnitz’s rule (Weisstein 2011): f1j ∂ = ∂θj u Z u Z ROC(t, θ)dt = 0 0 ∂ ROC(t, θ)dt ∂θj j = 1, . . . , k When θj corresponds with the slope parameter α, we obtain the following partial derivative: ∂ pAUC(u) = ∂α Z u φ{x0 β + αΦ−1 (t)}Φ−1 (t) dt 0 The partial derivative of f (θ) [pAUC(u)] for β0 is the following: ∂ pAUC(u) = ∂β0 Z u φ{x0 β + αΦ−1 (t)}dt 0 For a nonintercept coefficient, we obtain the following: ∂ pAUC(u) = ∂βi Z u xi φ{x0 β + αΦ−1 (t)}dt 0 b and α We can estimate each of these integrals by numeric integration, plugging in the estimates β b b for the parameters. This, together with the previously calculated estimate V, provides an estimate of d (u), which allows us to perform further statistical inference the asymptotic covariance of f (b θ) = pAUC on pAUC(u). Estimating equations estimation When we fit a model using the Alonzo and Pepe (2002) estimating equations method, we use the bootstrap to perform inference on the ROC curve summary parameters. Each bootstrap sample provides a sample of the coefficient estimates β and the slope estimates α. Using the formulas in Parametric model: Summary parameter definition under Methods and formulas, we can obtain an estimate of the ROC, false-positive rate, or AUC for each resample. Using numeric integration (with the trapezoidal approximation), we can also estimate the pAUC of the resample. rocreg postestimation — Postestimation tools for rocreg 2027 By making these calculations, we obtain a bootstrap sample of our summary parameter estimate. We then obtain bootstrap standard errors, normal approximation confidence intervals, percentile confidence intervals, and bias-corrected confidence intervals using this bootstrap sample. Further details can be found in [R] bootstrap. References Alonzo, T. A., and M. S. Pepe. 2002. Distribution-free ROC analysis using binary regression techniques. Biostatistics 3: 421–432. Choi, B. C. K. 1998. Slopes of a receiver operating characteristic curve and likelihood ratios for a diagnostic test. American Journal of Epidemiology 148: 1127–1132. Cleves, M. A. 1999. sg120: Receiver operating characteristic (ROC) analysis. Stata Technical Bulletin 52: 19–33. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 212–229. College Station, TX: Stata Press. . 2000. sg120.1: Two new options added to rocfit command. Stata Technical Bulletin 53: 18–19. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 230–231. College Station, TX: Stata Press. Hanley, J. A., and B. J. McNeil. 1983. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 148: 839–843. Janes, H., G. M. Longton, and M. S. Pepe. 2009. Accommodating covariates in receiver operating characteristic analysis. Stata Journal 9: 17–39. Norton, S. J., M. P. Gorga, J. E. Widen, R. C. Folsom, Y. Sininger, B. Cone-Wesson, B. R. Vohr, K. Mascher, and K. Fletcher. 2000. Identification of neonatal hearing impairment: Evaluation of transient evoked otoacoustic emission, distortion product otoacoustic emission, and auditory brain stem response test performance. Ear and Hearing 21: 508–528. Oehlert, G. W. 1992. A note on the delta method. American Statistician 46: 27–29. Phillips, P. C. B., and J. Y. Park. 1988. On the formulation of Wald tests of nonlinear restrictions. Econometrica 56: 1065–1083. Stover, L., M. P. Gorga, S. T. Neely, and D. Montoya. 1996. Toward optimizing the clinical utility of distortion product otoacoustic emission measurements. Journal of the Acoustical Society of America 100: 956–967. Weisstein, E. W. 2011. Leibniz integral rule. From Mathworld—A Wolfram Web Resource. http://mathworld.wolfram.com/LeibnizIntegralRule.html. Wieand, S., M. H. Gail, B. R. James, and K. L. James. 1989. A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika 76: 585–592. Also see [R] rocreg — Receiver operating characteristic (ROC) regression [R] rocregplot — Plot marginal and covariate-specific ROC curves after rocreg [U] 20 Estimation and postestimation commands Title rocregplot — Plot marginal and covariate-specific ROC curves after rocreg Syntax common options References Menu boot options Also see Description Remarks and examples probit options Methods and formulas Syntax Plot ROC curve after nonparametric analysis rocregplot , common options boot options Plot ROC curve after parametric analysis using bootstrap rocregplot , probit options common options boot options Plot ROC curve after parametric analysis using maximum likelihood rocregplot , probit options common options probit options Description Main ∗ ∗ at(varname=# varname=# . . . ) at1(varname=# varname=# . . . ) at2(varname=# varname=# . . . ) ... value of specified covariates/mean of unspecified covariates roc(numlist) invroc(numlist) level(#) show estimated ROC values for given false-positive rates show estimated false-positive rates for given ROC values set confidence level; default is level(95) Curve line#opts(cline options) ∗ affect rendition of ROC curve # Only one of roc() or invroc() may be specified. 2028 rocregplot — Plot marginal and covariate-specific ROC curves after rocreg 2029 Description common options Main restrict plotting of ROC curves to specified classifiers suppress plotting the reference line classvars(varlist) norefline Scatter affect rendition of classifier #s false-positive rate and ROC scatter points; not allowed with at() plot#opts(scatter options) Reference line affect rendition of the reference line rlopts(cline options) Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options boot options Description Bootstrap † bfile(filename) load dataset containing bootstrap replicates from rocreg plot normal-based (n), percentile (p), or bias-corrected (bc) confidence intervals; default is btype(n) btype(n | p | bc) † bfile() is only allowed with parametric analysis using bootstrap inference; in which case this option is required with roc() or invroc(). Menu Statistics > Epidemiology and related > ROC analysis > ROC curves after rocreg Description Under parametric estimation, rocregplot plots the fitted ROC curves for specified covariate values and classifiers. If rocreg, probit or rocreg, probit ml were previously used, the false-positive rates (for specified ROC values) and ROC values (for specified false-positive rates) for each curve may also be plotted, along with confidence intervals. Under nonparametric estimation, rocregplot will plot the fitted ROC curves using the fpr * and roc * variables produced by rocreg. Point estimates and confidence intervals for false-positive rates and ROC values that were computed in rocreg may be plotted as well. probit options Main at(varname=# . . . ) requests that the covariates specified by varname be set to #. By default, rocreg evaluates the function by setting each covariate to its mean value. This option causes the ROC curve to be evaluated at the value of the covariates listed in at() and at the mean of all unlisted covariates. 2030 rocregplot — Plot marginal and covariate-specific ROC curves after rocreg at1(varname=# . . . ), at2(varname=# . . . ), . . . , at10(varname=# . . . ) specify that ROC curves (up to 10) be plotted on the same graph. at1(), at2(), . . . , at10() work like the at() option. They request that the function be evaluated at the value of the covariates specified and at the mean of all unlisted covariates. at1() specifies the values of the covariates for the first curve, at2() specifies the values of the covariates for the second curve, and so on. roc(numlist) specifies that estimated ROC values for given false-positive rates be graphed. invroc(numlist) specifies that estimated false-positive rates for given ROC values be graphed. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. level() may be specified with either roc() or invroc(). Curve line#opts(cline options) affects the rendition of ROC curve #. See [G-3] cline options. common options Main classvars(varlist) restricts plotting ROC curves to specified classification variables. norefline suppresses plotting the reference line. Scatter plot#opts(scatter options) affects the rendition of classifier #’s false-positive rate and ROC scatter points. This option applies only to non-ROC covariate estimation graphing. See [G-2] graph twoway scatter. Reference line rlopts(cline options) affects rendition of the reference line. See [G-3] cline options. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and options for saving the graph to disk (see [G-3] saving option). boot options Bootstrap bfile(filename) uses bootstrap replicates of parameters from rocreg stored in filename to estimate standard errors and confidence intervals of predictions. bfile() must be specified with either roc() or invroc() if parametric estimation with bootstrapping was used. btype(n | p | bc) indicates the desired type of confidence interval rendering. n draws normal-based, p draws percentile, and bc draws bias-corrected confidence intervals for specified false-positive rates and ROC values in roc() and invroc(). The default is btype(n). rocregplot — Plot marginal and covariate-specific ROC curves after rocreg 2031 Remarks and examples Remarks are presented under the following headings: Plotting covariate-specific ROC curves Plotting marginal ROC curves Plotting covariate-specific ROC curves The rocregplot command is also demonstrated in [R] rocreg. We will further demonstrate its use with several examples. Particularly, we will show how rocregplot can draw the ROC curves of covariate models that have been fit using rocreg. Example 1: Parametric ROC In example 6 of [R] rocreg, we fit a probit ROC model to audiology test data from Norton et al. (2000). The estimating equation method of Alonzo and Pepe (2002) was used to the fit the model. Gender and age were covariates that affected the control distribution of the classifier y1 (DPOAE 65 at 2 kHz). Age was a ROC covariate for the model, so we fit separate ROC curves at each age. Following Janes, Longton, and Pepe (2009), we draw the ROC curves for ages 30, 40, and 50 months. The at1(), at2(), and at3() options are used to specify the age covariates. 0 True−positive rate (ROC) .25 .5 .75 1 . use http://www.stata-press.com/data/r13/nnhs (Norton - neonatal audiology data) . rocreg d y1, probit ctrlcov(currage male) ctrlmodel(linear) roccov(currage) > cluster(id) bseed(56930) bsave(nnhs2y1, replace) (output omitted ) . rocregplot, at1(currage=30) at2(currage=40) at3(currage=50) 0 .25 .5 .75 False−positive rate DPOAE 65 at 2kHz 1 At 1 At 2 At 3 Here we use the default entries of the legend, which indicate the “at #” within the specified at* options and the classifier to which the curve corresponds. ROC curve one corresponds with currage=30, two with currage=40, and three with currage=50. The positive effect of age on the ROC curve is evident. At an age of 30 months (currage=30), the ROC curve of y1 (DPOAE 65 at 2 kHz) is nearly equivalent to that of a noninformative test that gives equal probability to hearing loss. At age 50 months (currage=50), corresponding to some of the oldest children in the study, the ROC curve shows that test y1 (DPOAE 65 at 2 kHz) is considerably more powerful than the noninformative test. 2032 rocregplot — Plot marginal and covariate-specific ROC curves after rocreg You may create your own legend by specifying the legend() option. The default legend is designed for the possibility of multiple covariates. Here we could change the legend entries to currage values and gain some extra clarity. However, this may not be feasible when there are many covariates present. We can also use rocregplot after maximum likelihood estimation. Example 2: Maximum likelihood ROC We return to the audiology study with frequency (xf), intensity (xl), and hearing loss severity (xd) covariates from Stover et al. (1996) that we examined in example 10 of [R] rocreg. Negative signal-to-noise ratio is again used as a classifier. Using maximum likelihood, we fit a probit model to these data with the indicated ROC covariates. After fitting the model, we wish to compare the ROC curves of two covariate combinations. The first has an intensity value of 5.5 (the lowest intensity, corresponding to 55 decibels) and a frequency of 10.01 (the lowest frequency, corresponding to 1001 hertz). We give the first combination a hearing loss severity value of 0.5 (the lowest). The second covariate combination has the same frequency, but the highest intensity value of 6.5 (65 decibels). We give this second covariate set a higher severity value of 4. We will visually compare the two ROC curves resulting from these two covariate value combinations. We specify false-positive rates of 0.7 first followed by 0.2 in the roc() option to visually compare the size of the ROC curve at large and small false-positive rates. Because maximum likelihood estimation was used to fit the model, a Wald confidence interval is produced for the estimated ROC value and false-positive rate parameters. Further details are found in Methods and formulas. . use http://www.stata-press.com/data/r13/dp (Stover - DPOAE test data) . rocreg d nsnr, probit ctrlcov(xf xl) roccov(xf xl xd) ml cluster(id) (output omitted ) rocregplot — Plot marginal and covariate-specific ROC curves after rocreg 2033 . rocregplot, at1(xf=10.01, xl=5.5, xd=.5) at2(xf=10.01, xl=6.5, xd=4) roc(.7) ROC curve Status : d Classifier: nsnr Under covariates: at1 xf xl xd 10.01 5.5 .5 ROC Coef. .7 .978951 Std. Err. [95% Conf. Interval] .0097382 .9598645 Std. Err. [95% Conf. Interval] .0009657 .9966073 .9980376 Under covariates: at2 xf xl xd 10.01 6.5 4 Coef. .7 .9985001 1.000393 0 True−positive rate (ROC) .25 .5 .75 1 ROC 0 .25 .5 .75 False−positive rate −SNR 1 At 1 At 2 At the higher false-positive rate value of 0.7, we see little difference in the ROC values and note that the confidence intervals nearly overlap. Now we view the same curves with the lower false-positive rate compared. 2034 rocregplot — Plot marginal and covariate-specific ROC curves after rocreg . rocregplot, at1(xf=10.01, xl=5.5, xd=.5) at2(xf=10.01, xl=6.5, xd=4) roc(.2) ROC curve Status : d Classifier: nsnr Under covariates: at1 xf xl xd 10.01 5.5 .5 ROC Coef. .2 .7608593 Std. Err. .0510501 [95% Conf. Interval] .660803 .8609157 Under covariates: at2 xf xl xd 10.01 6.5 4 Coef. .2 .9499408 Std. Err. .0179824 [95% Conf. Interval] .914696 .9851856 0 True−positive rate (ROC) .25 .5 .75 1 ROC 0 .25 .5 .75 False−positive rate −SNR 1 At 1 At 2 The lower false-positive rate of 0.2 shows clearly distinguishable ROC values. Now we specify option invroc(.5) to view how the false-positive rates vary at a ROC value of 0.5. rocregplot — Plot marginal and covariate-specific ROC curves after rocreg 2035 . rocregplot, at1(xf=10.01, xl=5.5, xd=.5) at2(xf=10.01, xl=6.5, xd=4) invroc(.5) False-positive rate Status : d Classifier: nsnr Under covariates: at1 xf xl xd 10.01 5.5 .5 invROC Coef. .5 .0578036 Std. Err. [95% Conf. Interval] .0198626 .0188736 Std. Err. [95% Conf. Interval] .0967336 Under covariates: at2 xf xl xd 10.01 6.5 4 Coef. .5 .0055624 .0032645 -.0008359 .0119607 0 True−positive rate (ROC) .25 .5 .75 1 invROC 0 .25 .5 .75 False−positive rate −SNR 1 At 1 At 2 At a ROC value of 0.5, the false-positive rates for both curves are small and close to one another. Technical note We can use the testnl command to support our visual observations with statistical inference. We use it to perform a Wald test of the null hypothesis that the two ROC curves just rendered are equal at a false-positive rate of 0.7. 2036 rocregplot — Plot marginal and covariate-specific ROC curves after rocreg . testnl normal(_b[i_cons]+10.01*_b[xf]+5.5*_b[xl] > + .5*_b[xd]+_b[s_cons]*invnormal(.7)) = > normal(_b[i_cons]+10.01*_b[xf]+6.5*_b[xl] > + 4*_b[xd]+_b[s_cons]*invnormal(.7)) (1) normal(_b[i_cons]+10.01*_b[xf]+5.5*_b[xl] +.5*_b[xd]+_b[s_cons]*invnormal(.7))= normal(_b[i_cons]+10.01*_b[xf]+6.5*_b[xl] + 4*_b[xd]+_b[s_cons]*invnormal(.7)) chi2(1) = Prob > chi2 = 4.53 0.0332 The test is significant at the 0.05 level, and thus we find that the two curves are significantly different. Now we will use testnl again to test equality of the false-positive rates for each curve with a ROC value of 0.5. The inverse ROC formula used is derived in Methods and formulas. . testnl normal((invnormal(.5)-(_b[i_cons]+10.01*_b[xf]+5.5*_b[xl]+.5*_b[xd])) > /_b[s_cons]) = > normal((invnormal(.5)-(_b[i_cons]+10.01*_b[xf]+6.5*_b[xl]+4*_b[xd])) > /_b[s_cons]) (1) normal((invnormal(.5)-(_b[i_cons]+10.01*_b[xf]+5.5*_b[xl]+.5*_b[xd])) /_b[s_cons]) = normal((invnormal(.5)-(_b[i_cons]+10.01*_b[xf]+6.5*_b[xl]+4*_b[xd])) /_b[s_cons]) chi2(1) = 8.01 Prob > chi2 = 0.0046 We again reject the null hypothesis that the two curves are equal at the 0.05 level. The model of our last example was also fit using the estimating equations method in example 7 of [R] rocreg. We will demonstrate rocregplot after that model fit as well. Example 3: Parametric ROC, invROC, and ROC value In example 2, we used rocregplot after a maximum likelihood model fit of the ROC curve for classifier nsnr and covariates frequency (xf), intensity (xl), and hearing loss severity (xd). The data were obtained from the audiology study described in Stover et al. (1996). In example 7 of [R] rocreg, we fit the model using the estimating equations method of Alonzo and Pepe (2002). Under this method, bootstrap resampling is used to make inferences. We saved 50 bootstrap replications in nsnrf.dta, which we re-create below. We use rocregplot to draw the ROC curves for nsnr under the covariate values xf = 10.01, xl = 5.5, and xd = .5, and xf = 10.01, xl = 6.5, and xd = 4. The at#() options are used to specify the covariate values. The previous bootstrap results are made available to rocregplot with the bfile() option. As before, we will specify 0.2 and 0.7 as false-positive rates in the roc() option and 0.5 as a ROC value in the invroc() option. We do not specify btype() and thus our graph will contain normal-based bootstrap confidence bands, the default. rocregplot — Plot marginal and covariate-specific ROC curves after rocreg . use http://www.stata-press.com/data/r13/dp (Stover - DPOAE test data) . rocreg d nsnr, probit ctrlcov(xf xl) roccov(xf xl xd) cluster(id) > nobstrata ctrlfprall bseed(156385) breps(50) bsave(nsnrf, replace) (output omitted ) . rocregplot, at1(xf=10.01, xl=5.5, xd=.5) at2(xf=10.01, xl=6.5, xd=4) > roc(.7) bfile(nsnrf) ROC curve Status : d Classifier: nsnr Under covariates: at1 xf xl xd 10.01 5.5 .5 (Replications based on 208 clusters in id) ROC Observed Coef. Bias .7 .9835816 .0087339 Bootstrap Std. Err. .0204353 [95% Conf. Interval] .9435292 .9155462 .9392258 1.023634 (N) .9974037 (P) .9976629 (BC) Under covariates: at2 xf xl xd 10.01 6.5 4 (Replications based on 208 clusters in id) ROC Observed Coef. Bias .7 .999428 .0006059 Bootstrap Std. Err. .0011309 [95% Conf. Interval] .9972115 .9958003 .9968304 1.001644 (N) .9999675 (P) .9999901 (BC) 2037 rocregplot — Plot marginal and covariate-specific ROC curves after rocreg 0 True−positive rate (ROC) .25 .5 .75 1 2038 0 .25 .5 .75 False−positive rate −SNR 1 At 1 At 2 As shown in the graph, we find that the ROC values at a false-positive rate of 0.7 are close together, as they were in the maximum likelihood estimation in example 2. We now repeat this process for the lower false-positive rate of 0.2 by using the roc(.2) option. . rocregplot, at1(xf=10.01, xl=5.5, xd=.5) at2(xf=10.01, xl=6.5, xd=4) > roc(.2) bfile(nsnrf) ROC curve Status : d Classifier: nsnr Under covariates: at1 xf xl xd 10.01 5.5 .5 (Replications based on 208 clusters in id) ROC Observed Coef. Bias .2 .7652956 .0145111 Bootstrap Std. Err. .0735506 [95% Conf. Interval] .6211391 .6054495 .6394838 .9094522 (N) .878052 (P) .9033081 (BC) Under covariates: at2 xf xl xd 10.01 6.5 4 (Replications based on 208 clusters in id) ROC Observed Coef. Bias .2 .9672505 .0072429 Bootstrap Std. Err. .0227977 [95% Conf. Interval] .9225679 .9025254 .9235289 1.011933 (N) .9931714 (P) .9979637 (BC) 2039 0 True−positive rate (ROC) .25 .5 .75 1 rocregplot — Plot marginal and covariate-specific ROC curves after rocreg 0 .25 .5 .75 False−positive rate −SNR 1 At 1 At 2 The ROC values are slightly higher at the false-positive rate of 0.2 than they were in the maximum likelihood estimation in example 2. To see if the false-positive rates differ at a ROC value of 0.5, we specify the invroc(.5) option. . rocregplot, at1(xf=10.01, xl=5.5, xd=.5) at2(xf=10.01, xl=6.5, xd=4) > invroc(.5) bfile(nsnrf) False-positive rate Status : d Classifier: nsnr Under covariates: at1 xf xl xd 10.01 5.5 .5 (Replications based on 208 clusters in id) invROC Observed Coef. Bias .5 .0615144 -.0063531 Bootstrap Std. Err. .0254042 [95% Conf. Interval] .0117231 .0225159 .0224352 .1113057 (N) .1265046 (P) .1265046 (BC) Under covariates: at2 xf xl xd 10.01 6.5 4 (Replications based on 208 clusters in id) invROC Observed Coef. Bias .5 .0043298 -.0012579 Bootstrap Std. Err. .0045938 [95% Conf. Interval] -.004674 .0002773 .0001292 .0133335 (N) .0189199 (P) .0134801 (BC) rocregplot — Plot marginal and covariate-specific ROC curves after rocreg 0 True−positive rate (ROC) .25 .5 .75 1 2040 0 .25 .5 .75 False−positive rate −SNR 1 At 1 At 2 The point estimates of the ROC value and false-positive rate are both computed directly using the point estimates of the ROC coefficients. Calculation of the standard errors and confidence intervals is slightly more complicated. Essentially, we have stored a sample of our ROC covariate coefficient estimates in nsnrf.dta. We then calculate the ROC value or false-positive rate estimates using each set of coefficient estimates, resulting in a sample of point estimates. Then the bootstrap standard error and confidence intervals are calculated based on these bootstrap samples. Details of the computation of the standard error and percentile confidence intervals can be found in Methods and formulas and in [R] bootstrap. As mentioned in [R] rocreg, 50 resamples is a reasonable lower bound for obtaining bootstrap standard errors (Mooney and Duval 1993). However, it may be too low for obtaining percentile and bias-corrected confidence intervals. Normal-based confidence intervals are valid when the bootstrap distribution exhibits normality. See [R] bootstrap postestimation for more details. We can assess the normality of the bootstrap distribution by using a normal probability plot. Stata provides this in the pnorm command (see [R] diagnostic plots). We will use nsnrf.dta to draw a normal probability plot for the ROC estimate corresponding to a false-positive rate of 0.2. We use the covariate values xf = 10.01, xl = 6.5, and xd = 4. . use nsnrf (bootstrap: rocregstat) . generate double rocp2 = nsnr_b_i_cons + 10.01*nsnr_b_xf + 6.5*nsnr_b_xl + > 4*nsnr_b_xd+nsnr_b_s_cons*invnormal(.2) . replace rocp2 = normal(rocp2) (50 real changes made) rocregplot — Plot marginal and covariate-specific ROC curves after rocreg 2041 0.00 Normal F[(rocp2−m)/s] 0.25 0.50 0.75 1.00 . pnorm rocp2 0.00 0.25 0.50 Empirical P[i] = i/(N+1) 0.75 1.00 The closeness of the points to the horizontal line on the normal probability plot shows us that the bootstrap distribution is approximately normal. So it is reasonable to use the normal-based confidence intervals for ROC at a false-positive rate of 0.2 under covariate values xf = 10.01, xl = 6.5, and xd = 4. Plotting marginal ROC curves The rocregplot command can also be used after fitting models with no covariates. We will demonstrate this with an empirical ROC model fit in [R] rocreg. Example 4: Nonparametric ROC We run rocregplot after fitting the single-classifier, empirical ROC model shown in example 1 of [R] rocreg. There we empirically predicted the ROC curve of the classifier rating for the true status variable disease from the Hanley and McNeil (1982) data. The rocreg command saves variables roc rating and fpr rating, which give the ROC values and false-positive rates, respectively, for every value of rating. These variables are used by rocregplot to render the ROC curve. . use http://www.stata-press.com/data/r13/hanley, clear . rocreg disease rating, noboot Nonparametric ROC estimation Control standardization: empirical ROC method : empirical Area under the ROC curve Status : disease Classifier: rating AUC Observed Coef. Bias .8407708 . Bootstrap Std. Err. . [95% Conf. Interval] . . . . (N) . (P) . (BC) 2042 rocregplot — Plot marginal and covariate-specific ROC curves after rocreg 0 True−positive rate (ROC) .25 .5 .75 1 . rocregplot 0 .25 .5 .75 False−positive rate 1 rating We end our discussion of rocregplot by showing its use after a marginal probit model. Example 5: Maximum likelihood ROC, invROC, and ROC value In example 13 of [R] rocreg, we fit a maximum-likelihood probit model to each classifier of the fictitious dataset generated from Hanley and McNeil (1983). We use rocregplot after the original rocreg command to draw the ROC curves for classifiers mod1 and mod3. This is accomplished by specifying the two variables in the classvars() option. We will use the roc() option to obtain confidence intervals for ROC values at false-positive rates of 0.15 and 0.75. We will specify the invroc() option to obtain false-positive rate confidence intervals for a ROC value of 0.8. As mentioned previously, these are Wald confidence intervals. First, we will view results for a false-positive rate of 0.75. . use http://www.stata-press.com/data/r13/ct2, clear . rocreg status mod1 mod2 mod3, probit ml (output omitted ) . rocregplot, classvars(mod1 mod3) roc(.75) ROC curve Status : status Classifier: mod1 ROC Coef. .75 .9931655 Std. Err. [95% Conf. Interval] .0069689 .9795067 Std. Err. [95% Conf. Interval] .0043435 .9868811 1.006824 Status : status Classifier: mod3 ROC Coef. .75 .9953942 1.003907 2043 0 True−positive rate (ROC) .25 .5 .75 1 rocregplot — Plot marginal and covariate-specific ROC curves after rocreg 0 .25 .5 .75 False−positive rate 1 mod1 mod3 mod1 Fit mod3 Fit We see that the estimates for each of the two ROC curves are close. Because this is a marginal model, the actual false-positive rate and the true-positive rate for each observation are plotted in the graph. The added point estimates of the ROC value at false-positive rate 0.75 are shown as diamond (mod3) and circle (mod1) symbols in the upper-right-hand corner of the graph at FPR = 0.75. Confidence bands are also plotted at FPR = 0.75 but are so narrow that they are barely noticeable. Under both classifiers, the ROC value at 0.75 is very high. Now we will compare these results to those with a lower false-positive rate of 0.15. . rocregplot, classvars(mod1 mod3) roc(.15) ROC curve Status : status Classifier: mod1 ROC Coef. .15 .7934935 Std. Err. [95% Conf. Interval] .0801363 .6364292 Std. Err. [95% Conf. Interval] .0520118 .7869184 .9505578 Status : status Classifier: mod3 ROC Coef. .15 .8888596 .9908008 rocregplot — Plot marginal and covariate-specific ROC curves after rocreg 0 True−positive rate (ROC) .25 .5 .75 1 2044 0 .25 .5 .75 False−positive rate 1 mod1 mod3 mod1 Fit mod3 Fit The ROC value for the false-positive rate of 0.15 is more separated in the two classifiers. Here we see that mod3 has a larger ROC value than mod1 for this false-positive rate, but the confidence intervals of the estimates overlap. By specifying invroc(.8), we obtain invROC confidence intervals corresponding to a ROC value of 0.8. . rocregplot, classvars(mod1 mod3) invroc(.8) False-positive rate Status : status Classifier: mod1 invROC Coef. .8 .1556435 Std. Err. .069699 [95% Conf. Interval] .019036 .2922509 Status : status Classifier: mod3 invROC Coef. .8 .0661719 Std. Err. .045316 [95% Conf. Interval] -.0226458 .1549896 2045 0 True−positive rate (ROC) .25 .5 .75 1 rocregplot — Plot marginal and covariate-specific ROC curves after rocreg 0 .25 .5 .75 False−positive rate 1 mod1 mod3 mod1 Fit mod3 Fit For estimation of the false-positive rate at a ROC value of 0.8, the confidence intervals overlap. Both classifiers require only a small false-positive rate to achieve a ROC value of 0.8. Methods and formulas Details on computation of the nonparametric ROC curve and the estimation of the parametric ROC curve model coefficients can be found in [R] rocreg. Here we describe how to estimate the ROC values and false-positive rates of a parametric model. The cumulative distribution function g can be the standard normal cumulative distribution function. Methods and formulas are presented under the following headings: Parametric model: Summary parameter definition Maximum likelihood estimation Estimating equations estimation Parametric model: Summary parameter definition Conditioning on covariates x, we have the following ROC curve model: ROC (u) = g{x0 β + αg −1 (u)} x can be constant, and β = β0 , the constant intercept. With simple algebra, we can solve this equation to obtain the false-positive rate value u for a ROC value of r: u = g {g −1 (r) − x0 β}α−1 2046 rocregplot — Plot marginal and covariate-specific ROC curves after rocreg Maximum likelihood estimation We allow maximum likelihood estimation under probit parametric models, so g = Φ. The ROC value and false-positive rate parameters all have closed-form expressions in terms of the covariate values x, coefficient vector β, and slope parameter α. Thus to estimate these two types of summary parameters, we use the delta method (Oehlert 1992; Phillips and Park 1988). Particularly, we use the nlcom command (see [R] nlcom) to implement the delta method. b and slope estimate α Under maximum likelihood estimation, the coefficient estimates β b are asymptotically normal with variance matrix V. For convenience, we rename the parameter vector [β0 , α] to the k -parameter vector θ = [θ1 , . . . , θk ]. We will also explicitly refer to the conditioning of the ROC curve by θ in its mention as ROC(t, θ). Under the delta method, the continuous scalar function of the estimate b θ, f (b θ) has asymptotic mean f (θ) and asymptotic covariance n o c f (b Var θ) = fVf 0 where f is the 1 × k matrix of derivatives for which f1j = ∂f (θ) ∂θj j = 1, . . . , k The asymptotic covariance of f (b θ) is estimated and then used in conjunction with f (b θ) for further inference, including Wald confidence intervals, standard errors, and hypothesis testing. Estimating equations estimation When we fit a model using the Alonzo and Pepe (2002) estimating equations method, we use the bootstrap to perform inference on the ROC curve summary parameters. Each bootstrap sample provides a sample of the coefficient estimates β and the slope estimates α. Using the formulas above, we can obtain an estimate of the ROC value or false-positive rate for each resample. By making these calculations, we obtain a bootstrap sample of our summary parameter estimate. We then obtain bootstrap standard errors, normal approximation confidence intervals, percentile confidence intervals, and bias-corrected confidence intervals using this bootstrap sample. Further details can be found in [R] bootstrap. References Alonzo, T. A., and M. S. Pepe. 2002. Distribution-free ROC analysis using binary regression techniques. Biostatistics 3: 421–432. Bamber, D. 1975. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology 12: 387–415. Choi, B. C. K. 1998. Slopes of a receiver operating characteristic curve and likelihood ratios for a diagnostic test. American Journal of Epidemiology 148: 1127–1132. Cleves, M. A. 1999. sg120: Receiver operating characteristic (ROC) analysis. Stata Technical Bulletin 52: 19–33. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 212–229. College Station, TX: Stata Press. . 2000. sg120.1: Two new options added to rocfit command. Stata Technical Bulletin 53: 18–19. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 230–231. College Station, TX: Stata Press. rocregplot — Plot marginal and covariate-specific ROC curves after rocreg 2047 Hanley, J. A., and B. J. McNeil. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143: 29–36. . 1983. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 148: 839–843. Janes, H., G. M. Longton, and M. S. Pepe. 2009. Accommodating covariates in receiver operating characteristic analysis. Stata Journal 9: 17–39. Mooney, C. Z., and R. D. Duval. 1993. Bootstrapping: A Nonparametric Approach to Statistical Inference. Newbury Park, CA: Sage. Norton, S. J., M. P. Gorga, J. E. Widen, R. C. Folsom, Y. Sininger, B. Cone-Wesson, B. R. Vohr, K. Mascher, and K. Fletcher. 2000. Identification of neonatal hearing impairment: Evaluation of transient evoked otoacoustic emission, distortion product otoacoustic emission, and auditory brain stem response test performance. Ear and Hearing 21: 508–528. Oehlert, G. W. 1992. A note on the delta method. American Statistician 46: 27–29. Phillips, P. C. B., and J. Y. Park. 1988. On the formulation of Wald tests of nonlinear restrictions. Econometrica 56: 1065–1083. Stover, L., M. P. Gorga, S. T. Neely, and D. Montoya. 1996. Toward optimizing the clinical utility of distortion product otoacoustic emission measurements. Journal of the Acoustical Society of America 100: 956–967. Also see [R] rocreg — Receiver operating characteristic (ROC) regression [R] rocreg postestimation — Postestimation tools for rocreg [U] 20 Estimation and postestimation commands Title roctab — Nonparametric ROC analysis Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax roctab refvar classvar if roctab options in weight , options Description Main lorenz binomial nolabel detail table bamber hanley graph norefline summary specificity level(#) report Gini and Pietra indices calculate exact binomial confidence intervals display numeric codes rather than value labels show details on sensitivity/specificity for each cutpoint display the raw data in a 2 × k contingency table calculate standard errors by using the Bamber method calculate standard errors by using the Hanley method graph the ROC curve suppress plotting the 45-degree reference line report the area under the ROC curve graph sensitivity versus specificity set confidence level; default is level(95) Plot plotopts(plot options) affect rendition of the ROC curve Reference line rlopts(cline options) affect rendition of the reference line Add plots addplot(plot) add other plots to the generated graph Y axis, X axis, Titles, Legend, Overall twoway options any options other than by() documented in [G-3] twoway options fweights are allowed; see [U] 11.1.6 weight. plot options Description marker options marker label options cline options change look of markers (color, size, etc.) add marker labels; change look or position change the look of the line 2048 roctab — Nonparametric ROC analysis 2049 Menu Statistics > Epidemiology and related > ROC analysis > Nonparametric ROC analysis without covariates Description The above command is used to perform receiver operating characteristic (ROC) analyses with rating and discrete classification data. The two variables refvar and classvar must be numeric. The reference variable indicates the true state of the observation, such as diseased and nondiseased or normal and abnormal, and must be coded as 0 and 1. The rating or outcome of the diagnostic test or test modality is recorded in classvar, which must be at least ordinal, with higher values indicating higher risk. roctab performs nonparametric ROC analyses. By default, roctab calculates the area under the ROC curve. Optionally, roctab can plot the ROC curve, display the data in tabular form, and produce Lorenz-like plots. See [R] rocfit for a command that fits maximum-likelihood ROC models. Options Main lorenz specifies that Gini and Pietra indices be reported. Optionally, graph will plot the Lorenz-like curve. binomial specifies that exact binomial confidence intervals be calculated. nolabel specifies that numeric codes be displayed rather than value labels. detail outputs a table displaying the sensitivity, specificity, the percentage of subjects correctly classified, and two likelihood ratios for each possible cutpoint of classvar. table outputs a 2 × k contingency table displaying the raw data. bamber specifies that the standard error for the area under the ROC curve be calculated using the method suggested by Bamber (1975). Otherwise, standard errors are obtained as suggested by DeLong, DeLong, and Clarke-Pearson (1988). hanley specifies that the standard error for the area under the ROC curve be calculated using the method suggested by Hanley and McNeil (1982). Otherwise, standard errors are obtained as suggested by DeLong, DeLong, and Clarke-Pearson (1988). graph produces graphical output of the ROC curve. If lorenz is specified, graphical output of a Lorenz-like curve will be produced. norefline suppresses plotting the 45-degree reference line from the graphical output of the ROC curve. summary reports the area under the ROC curve, its standard error, and its confidence interval. If lorenz is specified, Lorenz indices are reported. This option is needed only when also specifying graph. specificity produces a graph of sensitivity versus specificity instead of sensitivity versus (1 − specificity). specificity implies graph. level(#) specifies the confidence level, as a percentage, for the confidence intervals. The default is level(95) or as set by set level; see [R] level. 2050 roctab — Nonparametric ROC analysis Plot plotopts(plot options) affects the rendition of the plotted ROC curve—the curve’s plotted points connected by lines. The plot options can affect the size and color of markers, whether and how the markers are labeled, and whether and how the points are connected; see [G-3] marker options, [G-3] marker label options, and [G-3] cline options. Reference line rlopts(cline options) affects the rendition of the reference line; see [G-3] cline options. Add plots addplot(plot) provides a way to add other plots to the generated graph; see [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples Remarks are presented under the following headings: Introduction Nonparametric ROC curves Lorenz-like curves Introduction The roctab command provides nonparametric estimation of the ROC for a given classifier and true-status reference variable. The Lorenz curve functionality of roctab, which provides an alternative to standard ROC analysis, is discussed in Lorenz-like curves. See Pepe (2003) for a discussion of ROC analysis. Pepe has posted Stata datasets and programs used to reproduce results presented in the book (http://www.stata.com/bookstore/pepe.html). Nonparametric ROC curves The points on the nonparametric ROC curve are generated using each possible outcome of the diagnostic test as a classification cutpoint and computing the corresponding sensitivity and 1−specificity. These points are then connected by straight lines, and the area under the resulting ROC curve is computed using the trapezoidal rule. Example 1 Hanley and McNeil (1982) presented data from a study in which a reviewer was asked to classify, using a five-point scale, a random sample of 109 tomographic images from patients with neurological problems. The rating scale was as follows: 1 = definitely normal, 2 = probably normal, 3 = questionable, 4 = probably abnormal, and 5 = definitely abnormal. The true disease status was normal for 58 of the patients and abnormal for the remaining 51 patients. roctab — Nonparametric ROC analysis 2051 Here we list 9 of the 109 observations: . use http://www.stata-press.com/data/r13/hanley . list disease rating in 1/9 disease rating 1. 2. 3. 4. 5. 1 0 1 0 0 5 1 5 4 1 6. 7. 8. 9. 0 1 0 0 3 5 5 1 For each observation, disease identifies the true disease status of the subject (0 = normal, 1 = abnormal), and rating contains the classification value assigned by the reviewer. We can use roctab to calculate and plot the nonparametric ROC curve by specifying both the summary and graph options. By also specifying the table option, we obtain a contingency table summarizing our dataset. . roctab disease rating, table graph summary rating 1 2 3 disease 4 5 Total 0 1 33 3 6 2 6 2 11 11 2 33 58 51 Total 36 8 8 22 35 109 ROC Area 109 0.8932 Std. Err. 0.0307 Asymptotic Normal [95% Conf. Interval] 0.83295 0.95339 0.00 0.25 Sensitivity 0.50 0.75 1.00 Obs 0.00 0.25 0.50 1 − Specificity 0.75 1.00 Area under ROC curve = 0.8932 By default, roctab reports the area under the curve, its standard error, and its confidence interval. The graph option can be used to plot the ROC curve. 2052 roctab — Nonparametric ROC analysis The ROC curve is plotted by computing the sensitivity and specificity using each value of the rating variable as a possible cutpoint. A point is plotted on the graph for each of the cutpoints. These plotted points are joined by straight lines to form the ROC curve, and the area under the ROC curve is computed using the trapezoidal rule. We can tabulate the computed sensitivities and specificities for each of the possible cutpoints by specifying detail. . roctab disease rating, detail Detailed report of sensitivity and specificity Cutpoint ( ( ( ( ( ( >= >= >= >= >= > 1 2 3 4 5 5 Sensitivity Specificity Correctly Classified 100.00% 94.12% 90.20% 86.27% 64.71% 0.00% 0.00% 56.90% 67.24% 77.59% 96.55% 100.00% 46.79% 74.31% 77.98% 81.65% 81.65% 53.21% Obs ROC Area Std. Err. 109 0.8932 ) ) ) ) ) ) 0.0307 LR+ 1.0000 2.1835 2.7534 3.8492 18.7647 LR- 0.1034 0.1458 0.1769 0.3655 1.0000 Asymptotic Normal [95% Conf. Interval] 0.83295 0.95339 Each cutpoint in the table indicates the ratings used to classify tomographs as being from an abnormal subject. For example, the first cutpoint (>= 1) indicates that all tomographs rated as 1 or greater are classified as coming from abnormal subjects. Because all tomographs have a rating of 1 or greater, all are considered abnormal. Consequently, all abnormal cases are correctly classified (sensitivity = 100%), but none of the normal patients is classified correctly (specificity = 0%). For the second cutpoint (>=2), tomographs with ratings of 1 are classified as normal, and those with ratings of 2 or greater are classified as abnormal. The resulting sensitivity and specificity are 94.12% and 56.90%, respectively. Using this cutpoint, we correctly classified 74.31% of the 109 tomographs. Similar interpretations can be used on the remaining cutpoints. As mentioned, each cutpoint corresponds to a point on the nonparametric ROC curve. The first cutpoint (>=1) corresponds to the point at (1,1), and the last cutpoint (> 5) corresponds to the point at (0,0). detail also reports two likelihood ratios suggested by Choi (1998): the likelihood ratio for a positive test result (LR+) and the likelihood ratio for a negative test result (LR–). The LR+ is the ratio of the probability of a positive test among the truly positive subjects to the probability of a positive test among the truly negative subjects. The LR– is the ratio of the probability of a negative test among the truly positive subjects to the probability of a negative test among the truly negative subjects. Choi points out that LR+ corresponds to the slope of the line from the origin to the point on the ROC curve determined by the cutpoint. Similarly, LR– corresponds to the slope from the point (1,1) to the point on the ROC curve determined by the cutpoint. By default, roctab calculates the standard error for the area under the curve by using an algorithm suggested by DeLong, DeLong, and Clarke-Pearson (1988) and asymptotic normal confidence intervals. Optionally, standard errors based on methods suggested by Bamber (1975) or Hanley and McNeil (1982) can be computed by specifying bamber or hanley, respectively, and an exact binomial confidence interval can be obtained by specifying binomial. roctab — Nonparametric ROC analysis . roctab disease rating, bamber ROC Bamber Obs Area Std. Err. 109 0.8932 0.0306 . roctab disease rating, hanley binomial ROC Hanley Obs Area Std. Err. 109 0.8932 0.0320 2053 Asymptotic Normal [95% Conf. Interval] 0.83317 0.95317 Binomial Exact [95% Conf. Interval] 0.81559 0.94180 Lorenz-like curves For applications where it is known that the risk status increases or decreases monotonically with increasing values of the diagnostic test, the ROC curve and associated indices are useful in assessing the overall performance of a diagnostic test. When the risk status does not vary monotonically with increasing values of the diagnostic test, however, the resulting ROC curve can be nonconvex and its indices can be unreliable. For these situations, Lee (1999) proposed an alternative to the ROC analysis based on Lorenz-like curves and the associated Pietra and Gini indices. Lee (1999) mentions at least three specific situations where results from Lorenz curves are superior to those obtained from ROC curves: 1) a diagnostic test with similar means but very different standard deviations in the abnormal and normal populations, 2) a diagnostic test with bimodal distributions in either the normal or abnormal population, and 3) a diagnostic test distributed symmetrically in the normal population and skewed in the abnormal. When the risk status increases or decreases monotonically with increasing values of the diagnostic test, the ROC and Lorenz curves yield interchangeable results. Example 2 To illustrate the use of the lorenz option, we constructed a fictitious dataset that yields results similar to those presented in Table III of Lee (1999). The data assume that a 12-point rating scale was used to classify 442 diseased and 442 healthy subjects. We list a few of the observations. . use http://www.stata-press.com/data/r13/lorenz, clear . list in 1/7, noobs sep(0) disease class pop 0 1 0 0 0 0 1 5 11 6 3 10 2 4 66 17 85 19 19 7 16 The data consist of 24 observations: 12 observations from diseased individuals and 12 from nondiseased individuals. Each observation corresponds to one of the 12 classification values of the rating-scale variable, class. The number of subjects represented by each observation is given by the pop variable, making this a frequency-weighted dataset. The data were generated assuming a binormal distribution of the latent variable with similar means for the normal and abnormal populations but with the standard deviation for the abnormal population five times greater than that of the normal population. 2054 roctab — Nonparametric ROC analysis . roctab disease class [fweight=pop], graph summary ROC Area 884 0.5774 Std. Err. 0.0215 Asymptotic Normal [95% Conf. Interval] 0.53517 0.61959 0.00 0.25 Sensitivity 0.50 0.75 1.00 Obs 0.00 0.25 0.50 1 − Specificity 0.75 1.00 Area under ROC curve = 0.5774 The resulting ROC curve is nonconvex or, as termed by Lee, “wiggly”. Lee argues that for this and similar situations, the Lorenz curve and indices are preferred. . roctab disease class [fweight=pop], lorenz summary graph Lorenz curve Pietra index = Gini index = 0.6493 0.7441 0 .1 cumulative % of disease=1 .2 .3 .4 .5 .6 .7 .8 .9 1 Lorenz curve 0 .1 .2 .3 .4 .5 .6 .7 cumulative % of disease=0 .8 .9 1 Like ROC curves, a more bowed Lorenz curve suggests a better diagnostic test. This bowedness is quantified by the Pietra index, which is geometrically equivalent to twice the largest triangle that can be inscribed in the area between the curve and the diagonal line, and the Gini index, which is roctab — Nonparametric ROC analysis 2055 equivalent to twice the area between the Lorenz curve and the diagonal. Lee (1999) provides several additional interpretations for the Pietra and Gini indices. Stored results roctab stores the following in r(): Scalars r(N) r(se) r(lb) r(ub) r(area) r(pietra) r(gini) number of observations standard error for the area under the ROC curve lower bound of CI for the area under the ROC curve upper bound of CI for the area under the ROC curve area under the ROC curve Pietra index Gini index Methods and formulas Assume that we applied a diagnostic test to each of Nn normal and Na abnormal subjects. Further assume that the higher the outcome value of the diagnostic test, the higher the risk of the subject being abnormal. Let θb be the estimated area under the curve, and let Xi , i = 1, 2, . . . , Na and Yj , j = 1, 2, . . . , Nn be the values of the diagnostic test for the abnormal and normal subjects, respectively. The points on the nonparametric ROC curve are generated using each possible outcome of the diagnostic test as a classification cutpoint and computing the corresponding sensitivity and 1−specificity. These points are then connected by straight lines, and the area under the resulting ROC curve is computed using the trapezoidal rule. The default standard error for the area under the ROC curve is computed using the algorithm described by DeLong, DeLong, and Clarke-Pearson (1988). For each abnormal subject, i, define V10 (Xi ) = Nn 1 X ψ(Xi , Yj ) Nn j=1 and for each normal subject, j , define V01 (Yj ) = Na 1 X ψ(Xi , Yj ) Na i=1 where ( ψ(X, Y ) = 1 1 2 0 Y X 2056 roctab — Nonparametric ROC analysis Define N S10 = a X 1 b2 {V10 (Xi ) − θ} Na − 1 i=1 S01 = n X 1 b2 {V01 (Yj ) − θ} Nn − 1 j=1 and N The variance of the estimated area under the ROC curve is given by b = var(θ) 1 1 S10 + S01 Na Nn The hanley standard error for the area under the ROC curve is computed using the algorithm described by Hanley and McNeil (1982). It requires the calculation of two quantities: Q1 is Pr(two randomly selected abnormal subjects will both have a higher score than a randomly selected normal subject), and Q2 is Pr(one randomly selected abnormal subject will have a higher score than any two randomly selected normal subjects). The Hanley and McNeil variance of the estimated area under the ROC curve is b = var(θ) b − θ) b + (Na − 1)(Q1 − θb2 ) + (Nn − 1)(Q2 − θb2 ) θ(1 Na Nn The bamber standard error for the area under the ROC curve is computed using the algorithm described by Bamber (1975). For any two Y values, Yj and Yk , and any Xi value, define byyx = p(Yj , Yk < Xi ) + p(Xi < Yj , Yk ) − 2p(Yj < Xi < Yk ) and similarly, for any two X values, Xi and Xl , and any Yj value, define bxxy = p(Xi , Xl < Yj ) + p(Yj < Xi , Xl ) − 2p(Xi < Yj < Xl ) Bamber’s unbiased estimate of the variance for the area under the ROC curve is b = var(θ) 1 2 b (Na −1)(Nn −1){p(X 6= Y )+(Na −1)bxxy +(Nn −1)byyx −4(Na +Nn −1)(θ−0.5) } 4 Asymptotic confidence intervals are constructed and reported by default, assuming a normal distribution for the area under the ROC curve. Exact binomial confidence intervals are calculated as described in [R] ci, with p equal to the area under the ROC curve. References Bamber, D. 1975. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology 12: 387–415. Choi, B. C. K. 1998. Slopes of a receiver operating characteristic curve and likelihood ratios for a diagnostic test. American Journal of Epidemiology 148: 1127–1132. roctab — Nonparametric ROC analysis 2057 Cleves, M. A. 1999. sg120: Receiver operating characteristic (ROC) analysis. Stata Technical Bulletin 52: 19–33. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 212–229. College Station, TX: Stata Press. . 2000. sg120.2: Correction to roccomp command. Stata Technical Bulletin 54: 26. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 231. College Station, TX: Stata Press. . 2002a. Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve. Stata Journal 2: 280–289. . 2002b. From the help desk: Comparing areas under receiver operating characteristic curves from two or more probit or logit models. Stata Journal 2: 301–313. DeLong, E. R., D. M. DeLong, and D. L. Clarke-Pearson. 1988. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 44: 837–845. Erdreich, L. S., and E. T. Lee. 1981. Use of relative operating characteristic analysis in epidemiology: A method for dealing with subjective judgment. American Journal of Epidemiology 114: 649–662. Hanley, J. A., and B. J. McNeil. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143: 29–36. Harbord, R. M., and P. Whiting. 2009. metandi: Meta-analysis of diagnostic accuracy using hierarchical logistic regression. Stata Journal 9: 211–229. Juul, S., and M. Frydenberg. 2014. An Introduction to Stata for Health Researchers. 4th ed. College Station, TX: Stata Press. Lee, W. C. 1999. Probabilistic analysis of global performances of diagnostic tests: Interpreting the Lorenz curve-based summary measures. Statistics in Medicine 18: 455–471. Ma, G., and W. J. Hall. 1993. Confidence bands for the receiver operating characteristic curves. Medical Decision Making 13: 191–197. Pepe, M. S. 2003. The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press. Reichenheim, M. E., and A. Ponce de Leon. 2002. Estimation of sensitivity and specificity arising from validity studies with incomplete design. Stata Journal 2: 267–279. Seed, P. T., and A. Tobı́as. 2001. sbe36.1: Summary statistics for diagnostic tests. Stata Technical Bulletin 59: 25–27. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 90–93. College Station, TX: Stata Press. Tobı́as, A. 2000. sbe36: Summary statistics report for diagnostic tests. Stata Technical Bulletin 56: 16–18. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 87–90. College Station, TX: Stata Press. Working, H., and H. Hotelling. 1929. Application of the theory of error to the interpretation of trends. Journal of the American Statistical Association 24 (Suppl.): 73–85. Also see [R] logistic postestimation — Postestimation tools for logistic [R] roc — Receiver operating characteristic (ROC) analysis [R] roccomp — Tests of equality of ROC areas [R] rocfit — Parametric ROC models [R] rocreg — Receiver operating characteristic (ROC) regression Title rologit — Rank-ordered logistic regression Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgment Syntax rologit depvar indepvars if in weight , group(varname) options Description options Model ∗ group(varname) offset(varname) incomplete(#) reverse notestrhs ties(spec) identifier variable that links the alternatives include varname in model with coefficient constrained to 1 use # to code unranked alternatives; default is incomplete(0) reverse the preference order keep right-hand-side variables that do not vary within group method to handle ties: exactm, breslow, efron, or none SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) display options set confidence level; default is level(95) control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics ∗ group(varname) is required. indepvars may contain factor variables; see [U] 11.4.3 Factor variables. bootstrap, by, fp, jackknife, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. fweights, iweights, and pweights are allowed, except with ties(efron); see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Ordinal outcomes > Rank-ordered logistic regression 2058 rologit — Rank-ordered logistic regression 2059 Description rologit fits the rank-ordered logistic regression model by maximum likelihood (Beggs, Cardell, and Hausman 1981). This model is also known as the Plackett–Luce model (Marden 1995), as the exploded logit model (Punj and Staelin 1978), and as the choice-based method of conjoint analysis (Hair et al. 2010). rologit expects the data to be in long form, similar to clogit (see [R] clogit), in which each of the ranked alternatives forms an observation; all observations related to an individual are linked together by the variable that you specify in the group() option. The distinction from clogit is that depvar in rologit records the rankings of the alternatives, whereas for clogit, depvar marks only the best alternative by a value not equal to zero. rologit interprets equal scores of depvar as ties. The ranking information may be incomplete “at the bottom” (least preferred alternatives). That is, unranked alternatives may be coded as 0 or as a common value that may be specified with the incomplete() option. If your data record only the unique best alternative, rologit fits the same model as clogit. Options Model group(varname) is required, and it specifies the identifier variable (numeric or string) that links the alternatives for an individual, which have been compared and rank ordered with respect to one another. offset(varname); see [R] estimation options. incomplete(#) specifies the numeric value used to code alternatives that are not ranked. It is assumed that unranked alternatives are less preferred than the ranked alternatives (that is, the data record the ranking of the most preferred alternatives). It is not assumed that subjects are indifferent between the unranked alternatives. # defaults to 0. reverse specifies that in the preference order, a higher number means a less attractive alternative. The default is that higher values indicate more attractive alternatives. The rank-ordered logit model is not symmetric in the sense that reversing the ordering simply leads to a change in the signs of the coefficients. notestrhs suppresses the test that the independent variables vary within (at least some of) the groups. Effects of variables that are always constant are not identified. For instance, a rater’s gender cannot directly affect his or her rankings; it could affect the rankings only via an interaction with a variable that does vary over alternatives. ties(spec) specifies the method for handling ties (indifference between alternatives) (see [ST] stcox for details): exact marginal likelihood (default) exactm breslow Breslow’s method (default if pweights specified) efron Efron’s method (default if robust VCE) none no ties allowed SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. If ties(exactm) is specified, vcetype may be only oim, bootstrap, or jackknife. 2060 rologit — Rank-ordered logistic regression Reporting level(#); see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: iterate(#), trace, no log, tolerance(#), ltolerance(#), nrtolerance(#), and nonrtolerance; see [R] maximize. These options are seldom used. The following option is available with rologit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples The rank-ordered logit model can be applied to analyze how decision makers combine attributes of alternatives into overall evaluations of the attractiveness of these alternatives. The model generalizes a version of McFadden’s choice model without alternative-specific covariates, as fit by the clogit command. It uses richer information about the comparison of alternatives, namely, how decision-makers rank the alternatives rather than just specifying the alternative that they like best. Remarks are presented under the following headings: Examples Comparing respondents Incomplete rankings and ties Clustered choice data Comparison of rologit and clogit On reversals of rankings Examples A popular way to study employer preferences for characteristics of employees is the quasiexperimental “vignette method”. As an example, we consider the research by de Wolf on the labor market position of social science graduates (de Wolf 2000). This study addresses how the educational portfolio (for example, general skills versus specific knowledge) affects short-term and long-term labor-market opportunities. De Wolf asked 22 human resource managers (the respondents) to rank order the six most suitable candidates of 20 fictitious applicants and to rank order these six candidates for three jobs, namely, 1) researcher, 2) management trainee, and 3) policy adviser. Applicants were described by 10 attributes, including their age, gender, details of their portfolio, and work experience. In this example, we analyze a subset of the data. Also, to simplify the output, we drop, at random, 10 nonselected applicants per case. The resulting dataset includes 29 cases, consisting of 10 applicants each. The data are in long form: observations correspond to alternatives (the applications), and alternatives that figured in one decision task are identified by the variable caseid. We list the observations for caseid==7, in which the respondent considered applicants for a social-science research position. rologit — Rank-ordered logistic regression 2061 . use http://www.stata-press.com/data/r13/evignet (Vignet study employer prefs (Inge de Wolf 2000)) . list pref female age grades edufit workexp boardexp if caseid==7, noobs pref female age grades edufit workexp boardexp 0 0 0 0 1 yes no no yes no 28 25 25 25 25 A/B C/D C/D C/D C/D no yes yes no yes none one year none internship one year no no yes yes yes 2 3 4 5 6 no yes yes no yes 25 25 25 25 28 A/B A/B A/B A/B A/B yes yes yes yes yes none one year none internship one year no no yes no yes Here six applicants were selected. The rankings are stored in the variable pref, where a value of 6 corresponds to “best among the candidates”, a value of 5 corresponds to “second-best among the candidates”, etc. The applicants with a ranking of 0 were not among the best six candidates for the job. The respondent was not asked to express his preferences among these four applicants, but by the elicitation procedure, it is known that he ranks these four applicants below the six selected applicants. The best candidate was a female, 28 years old, with education fitting the job, with good grades (A/B), with 1 year of work experience, and with experience being a board member of a fraternity, a sports club, etc. The profiles of the other candidates read similarly. Here the respondent completed the task; that is, he selected and rank ordered the six most suitable applicants. Sometimes the respondent performed only part of the task. . list pref female age grades edufit workexp boardexp if caseid==18, noobs pref female age grades edufit workexp boardexp 0 0 0 0 2 no no no yes yes 25 25 28 25 25 C/D C/D C/D A/B A/B yes no no no no none internship internship one year none yes yes yes no yes 2 2 5 5 5 no no no no yes 25 25 25 25 25 A/B A/B A/B A/B A/B no no no no no none one year none none none yes yes yes yes no The respondent selected the six best candidates and segmented these six candidates into two groups: one group with the three best candidates, and a second group of three candidates that were “still acceptable”. The numbers 2 and 5, indicating these two groups, are arbitrary apart from the implied ranking of the groups. The ties between the candidates in a group indicate that the respondent was not able to rank the candidates within the group. The purpose of the vignette experiment was to explore and test hypotheses about which of the employees’ attributes are valued by employers, how these attributes are weighted depending on the type of job (described by variable job in these data), etc. In the psychometric tradition of Thurstone (1927), value is assumed to be linear in the attributes, with the coefficients expressing the direction and weight of the attributes. In addition, it is assumed that valuation is to some extent a random procedure, 2062 rologit — Rank-ordered logistic regression captured by an additive random term. For instance, if value depends only on an applicant’s age and gender, we would have value(femalei , agei ) = β1 femalei + β2 agei + i where the random residual, i , captures all omitted attributes. Thus β1 > 0 means that the employer assigns higher value to a woman than to a man. Given this conceptualization of value, it is straightforward to model the decision (selection) among alternatives or the ranking of alternatives: the alternative with the highest value is selected (chosen), or the alternatives are ranked according to their value. To complete the specification of a model of choice and of ranking, we assume that the random residual i follows an “extreme value distribution of type I”, introduced in this context by Luce (1959). This specific assumption is made mostly for computational convenience. This model is known by many names. Among others, it is known as the rank-ordered logit model in economics (Beggs, Cardell, and Hausman 1981), as the exploded logit model in marketing research (Punj and Staelin 1978), as the choice-based conjoint analysis model (Hair et al. 2010), and as the Plackett–Luce model (Marden 1995). The model coefficients are estimated using the method of maximum likelihood. The implementation in rologit uses an analogy between the rank-ordered logit model and the Cox regression model observed by Allison and Christakis (1994); see Methods and formulas. The rologit command implements this method for rankings, whereas clogit deals with the variant of choices, that is, only the most highly valued alternative is recorded. In the latter case, the model is also known as the Luce–McFadden choice model. In fact, when the data record the most preferred (unique) alternative and no additional ranking information about preferences is available, rologit and clogit return the same information, though formatted somewhat differently. . rologit pref female age grades edufit workexp boardexp if job==1, group(caseid) Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Refining estimates: Iteration 0: log likelihood = -95.41087 = -71.180903 = -68.47734 = -68.345918 = -68.345389 = -68.345389 Rank-ordered logistic regression Group variable: caseid Number of obs Number of groups = = 80 8 No ties in data Obs per group: min = avg = max = 10 10.00 10 LR chi2(6) Prob > chi2 Log likelihood = -68.34539 pref Coef. female age grades edufit workexp boardexp -.4487287 -.0984926 3.064534 .7658064 1.386427 .6944377 Std. Err. .3671307 .0820473 .6148245 .3602366 .292553 .3762596 z -1.22 -1.20 4.98 2.13 4.74 1.85 P>|z| 0.222 0.230 0.000 0.034 0.000 0.065 = = 54.13 0.0000 [95% Conf. Interval] -1.168292 -.2593024 1.8595 .0597556 .8130341 -.0430176 .2708343 .0623172 4.269568 1.471857 1.959821 1.431893 Focusing only on the variables whose coefficients are significant at the 10% level (we are analyzing 8 respondents only!), the estimated value of an applicant for a job of type 1 (research positions) can be written as value = 3.06*grades + 0.77*edufit + 1.39*workexp + 0.69*boardexp rologit — Rank-ordered logistic regression 2063 Thus employers prefer applicants for a research position (job==1) whose educational portfolio fits the job, who have better grades, who have more relevant work experience, and who have (extracurricular) board experience. They do not seem to care much about the sex and age of applicants, which is comforting. Given these estimates of the valuation by employers, we consider the probabilities that each of the applications is ranked first. Under the assumption that the i are independent and follow an extreme value type I distribution, Luce (1959) showed that the probability, πi , that alternative i is valued higher than alternatives 2, . . . , k can be written in the multinomial logit form exp(valuei ) πi = Pr {value1 > max(value2 , . . . , valuem )} = Pk j=1 exp(valuei ) The probability of observing a specific ranking can be written as the product of such terms, representing a sequential decision interpretation in which the rater first chooses the most preferred alternative, and then the most preferred alternative among the rest, etc. The probabilities for alternatives to be ranked first are conveniently computed by predict. . predict p if e(sample) (option pr assumed; conditional probability that alternative is ranked first) (210 missing values generated) . sort caseid pref p . list pref p grades edufit workexp boardexp if caseid==7, noobs pref p grades edufit workexp boardexp 0 0 0 0 1 .0027178 .0032275 .0064231 .0217202 .0434964 C/D C/D A/B C/D C/D yes no no yes yes none internship none one year one year yes yes no no yes 2 3 4 5 6 .0290762 .2970933 .0371747 .1163203 .4427504 A/B A/B A/B A/B A/B yes yes yes yes yes none one year none internship one year no no yes no yes There clearly is a positive relation between the stated ranking and the predicted probabilities for alternatives to be ranked first, but the association is not perfect. In fact, we would not have expected a perfect association, as the model specifies a (nondegenerate) probability distribution over the possible rankings of the alternatives. These predictions for sets of 10 candidates can also be used to make predictions for subsets of the alternatives. For instance, suppose that only the last three candidates listed in this table would be available. According to parameter estimates of the rank-ordered logit model, the probability that the last of these candidates is selected equals 0.443/(0.037 + 0.116 + 0.443) = 0.743. Comparing respondents The rologit model assumes that all respondents, HR managers in large public-sector organizations in The Netherlands, use the same valuation function; that is, they apply the same decision weights. This is the substantive interpretation of the assumption that the β ’s are constant between the respondents. To probe this assumption, we could test whether the coefficients vary between different groups of respondents. For a metric characteristic of the HR manager, such as firmsize, we can consider a trend-model in the valuation weights, 2064 rologit — Rank-ordered logistic regression βij = αi0 + αi1 firmsizej and we can test that the slopes αi1 of firmsize are zero. . generate firmsize = employer . rologit pref edufit grades workexp c.firmsize#c.(edufit grades workexp boardexp) > if job==1, group(caseid) nolog Rank-ordered logistic regression Number of obs = 80 Group variable: caseid Number of groups = 8 No ties in data Obs per group: min = 10 avg = 10.00 max = 10 LR chi2(7) = 57.17 Log likelihood = -66.82346 Prob > chi2 = 0.0000 pref Coef. Std. Err. z P>|z| edufit grades workexp [95% Conf. Interval] 1.29122 6.439776 1.23342 1.13764 2.288056 .8065067 1.13 2.81 1.53 0.256 0.005 0.126 -.9385127 1.955267 -.347304 3.520953 10.92428 2.814144 c.firmsize# c.edufit -.0173333 .0711942 -0.24 0.808 -.1568714 .1222048 c.firmsize# c.grades -.2099279 .1218251 -1.72 0.085 -.4487008 .028845 c.firmsize# c.workexp .0097508 .0525081 0.19 0.853 -.0931632 .1126649 c.firmsize# c.boardexp .0382304 .0227545 1.68 0.093 -.0063676 .0828284 . testparm c.firmsize#c.(edufit grades workexp boardexp) ( ( ( ( 1) 2) 3) 4) c.firmsize#c.edufit = 0 c.firmsize#c.grades = 0 c.firmsize#c.workexp = 0 c.firmsize#c.boardexp = 0 chi2( 4) = 7.14 Prob > chi2 = 0.1288 The Wald test that the slopes of the interacted firmsize variables are jointly zero provides no evidence upon which we would reject the null hypothesis; that is, we do not find evidence against the assumption of constant valuation weights of the attributes by firms of different size. We did not enter firmsize as a predictor variable. Characteristics of the decision-making agent do not vary between alternatives. Thus an additive effect of these characteristics on the valuation of alternatives does not affect the agent’s ranking of alternatives and his choice. Consequently the coefficient of firmsize is not identified. rologit would in fact have diagnosed the problem and dropped firmsize from the analysis. Diagnosing this problem can slow the estimation considerably; the test may be suppressed by specifying the notestrhs option. rologit — Rank-ordered logistic regression 2065 Incomplete rankings and ties rologit allows incomplete rankings and ties in the rankings as proposed by Allison and Christakis (1994). rologit permits rankings to be incomplete only “at the bottom”; namely, that the ranking of the least attractive alternatives for subjects may not be known—do not confuse this with the situation that a subject is indifferent between these alternatives. This form of incompleteness occurred in the example discussed here, because the respondents were instructed to select and rank only the top six alternatives. It may also be that respondents refused to rank the alternatives that are very unattractive. rologit does not allow other forms of incompleteness, for instance, data in which respondents indicate which of four cars they like best, and which one they like least, but not how they rank the two intermediate cars. Another example of incompleteness that cannot be analyzed with rologit is data in which respondents select the three alternatives they like best but are not requested to express their preferences among the three selected alternatives. rologit also permits ties in rankings. rologit assumes that if a subject expresses a tie between two or more alternatives, he or she actually holds one particular strict preference ordering, but with all possibilities of a strict ordering consistent with the expressed weak ordering being equally probable. For instance, suppose that a respondent ranks alternative 1 highest. He prefers alternatives 2 and 3 over alternative 4, and he is indifferent between alternatives 2 and 3. We assume that this respondent either has the strict preference ordering 1 > 2 > 3 > 4 or 1 > 3 > 2 > 4, with both possibilities being equally likely. From a psychometric perspective, it may actually be more appropriate to also assume that the alternatives 2 and 3 are close; for instance, the difference between the associated valuations (utilities) is less than some threshold or minimally discernible difference. Computationally, however, this is a more demanding model. Clustered choice data We have seen that applicants with work experience are in a relatively favorable position. To test whether the effects of work experience vary between the jobs, we can include interactions between the type of job and the attributes of applicants. Such interactions can be obtained using factor variables. Because some HR managers contributed data for more than one job, we cannot assume that their selection decisions for different jobs are independent. We can account for this by specifying the vce(cluster clustvar) option. By treating choice data as incomplete ranking data with only the most preferred alternative marked, rologit may be used to estimate the model parameters for clustered choice data. 2066 rologit — Rank-ordered logistic regression . rologit pref job##c.(female grades edufit workexp), group(caseid) > vce(cluster employer) nolog 2.job 3.job omitted because of no within-caseid variance Rank-ordered logistic regression Group variable: caseid Ties handled via the Efron method Number of obs Number of groups Obs per group: min avg max Wald chi2(12) Prob > chi2 Log pseudolikelihood = -296.3855 = = = = = = = 290 29 10 10.00 10 79.57 0.0000 (Std. Err. adjusted for 22 clusters in employer) Robust Std. Err. pref Coef. z P>|z| [95% Conf. Interval] job managemen.. policy ad.. 0 0 (omitted) (omitted) female grades edufit workexp -.2286609 2.812555 .7027757 1.224453 .2519883 .8517878 .2398396 .3396773 -0.91 3.30 2.93 3.60 0.364 0.001 0.003 0.000 -.7225489 1.143081 .2326987 .5586978 .2652272 4.482028 1.172853 1.890208 job#c.female managemen.. policy ad.. .0293815 .1195538 .4829166 .3688844 0.06 0.32 0.951 0.746 -.9171177 -.6034463 .9758808 .8425538 job#c.grades managemen.. policy ad.. -2.364247 -1.88232 1.005963 .8995277 -2.35 -2.09 0.019 0.036 -4.335898 -3.645362 -.3925961 -.1192782 job#c.edufit managemen.. policy ad.. -.267475 -.3182995 .4244964 .3689972 -0.63 -0.86 0.529 0.388 -1.099473 -1.041521 .5645226 .4049217 job# c.workexp managemen.. policy ad.. -.6870077 -.4656993 .3692946 .4515712 -1.86 -1.03 0.063 0.302 -1.410812 -1.350763 .0367964 .4193639 The parameter estimates for the first job type are very similar to those that would have been obtained from an analysis isolated to these data. Differences are due only to an implied change in the method of handling ties. With clustered observations, rologit uses Efron’s method. If we had specified the ties(efron) option with the separate analyses, then the parameter estimates would have been identical to the simultaneous results. Another difference is that rologit now reports robust standard errors, adjusted for clustering within respondents. These could have been obtained for the separate analyses, as well by specifying the vce(robust) option. In fact, this option would also have forced rologit to switch to Efron’s method as well. Given the combined results for the three types of jobs, we can test easily whether the weights for the attributes of applicants vary between the jobs, in other words, whether employers are looking for different qualifications in applicants for different jobs. A Wald test for the equality hypothesis of no difference can be obtained with the testparm command: rologit — Rank-ordered logistic regression 2067 . testparm job#c.(female grades edufit workexp) ( 1) 2.job#c.female = 0 ( 2) 3.job#c.female = 0 ( 3) 2.job#c.grades = 0 ( 4) 3.job#c.grades = 0 ( 5) 2.job#c.edufit = 0 ( 6) 3.job#c.edufit = 0 ( 7) 2.job#c.workexp = 0 ( 8) 3.job#c.workexp = 0 chi2( 8) = Prob > chi2 = 14.96 0.0599 We find only mild evidence that employers look for different qualities in candidates according to the job for which they are being considered. Technical note Allison (1999) stressed that the comparison between groups of the coefficients of logistic regression is problematic, especially in its latent-variable interpretation. In many common latent-variable models, only the regression coefficients divided by the scale of the latent variable are identified. Thus a comparison of logit regression coefficients between, say, men and women is meaningful only if one is willing to argue that the standard deviation of the latent residual does not differ between the sexes. The rank-ordered logit model is also affected by this problem. While we formulated the model with a scale-free residual, we can actually think of the model for the value of an alternative as being scaled by the standard deviation of the random term, representing other relevant attributes of alternatives. Again comparing attribute weights between jobs is meaningful to the extent that we are willing to defend the proposition that “all omitted attributes” are equally important for different kinds of jobs. Comparison of rologit and clogit The rank-ordered logit model also has a sequential interpretation. A subject first chooses the best among the alternatives. Next he or she selects the best alternative among the remaining alternatives, etc. The decisions at each of the subsequent stages are described by a conditional logit model, and a subject is assumed to apply the same decision weights at each stage. Some authors have expressed concern that later choices may well be made more randomly than the first few decisions. A formalization of this idea is a heteroskedastic version of the rank-ordered logit model in which the scale of the random term increases with the number of decisions made (for example, Hausman and Ruud [1987]). This extended model is currently not supported by rologit. However, the hypothesis that the same decision weights are applied at the first stage and at later stages can be tested by applying a Hausman test. 2068 rologit — Rank-ordered logistic regression First, we fit the rank-ordered logit model on the full ranking data for the first type of job, . rologit pref age female edufit grades workexp boardexp if job==1, > group(caseid) nolog Rank-ordered logistic regression Number of obs = Group variable: caseid Number of groups = No ties in data Obs per group: min = avg = max = LR chi2(6) Prob > chi2 Log likelihood = -68.34539 pref Coef. age female edufit grades workexp boardexp -.0984926 -.4487287 .7658064 3.064534 1.386427 .6944377 Std. Err. .0820473 .3671307 .3602366 .6148245 .292553 .3762596 z -1.20 -1.22 2.13 4.98 4.74 1.85 P>|z| 0.230 0.222 0.034 0.000 0.000 0.065 = = 80 8 10 10.00 10 54.13 0.0000 [95% Conf. Interval] -.2593024 -1.168292 .0597556 1.8595 .8130341 -.0430176 .0623172 .2708343 1.471857 4.269568 1.959821 1.431893 and we save the estimates for later use with the estimates command. . estimates store Ranking To estimate the decision weights on the basis of the most preferred alternatives only, we create a variable, best, that is 1 for the best alternatives, and 0 otherwise. The by prefix is useful here. . by caseid (pref), sort: gen best = pref == pref[_N] if job==1 (210 missing values generated) By specifying (pref) with by caseid, we ensured that the data were sorted in increasing order on pref within caseid. Hence, the most preferred alternatives are last in the sort order. The expression pref == pref[ N] is true (1) for the most preferred alternatives, even if the alternative is not unique, and false (0) otherwise. If the most preferred alternatives were sometimes tied, we could still fit the model for the based-alternatives-only data via rologit, but clogit would yield different results because it deals with ties in a less appropriate way for continuous valuations. To ascertain whether there are ties in the selected data regarding applicants for research positions, we can combine by with assert: . by caseid (pref), sort: assert pref[_N-1] != pref[_N] if job==1 There are no ties. We can now fit the model on the choice data by using either clogit or rologit. rologit — Rank-ordered logistic regression 2069 . rologit best age edufit grades workexp boardexp if job==1, group(caseid) nolog Rank-ordered logistic regression Number of obs = 80 Group variable: caseid Number of groups = 8 No ties in data Obs per group: min = 10 avg = 10.00 max = 10 LR chi2(5) = 17.27 Log likelihood = -9.783205 Prob > chi2 = 0.0040 best Coef. age edufit grades workexp boardexp -.1048959 .4558387 3.443851 2.545648 1.765176 Std. Err. .2017068 .9336775 1.969002 1.099513 1.112763 z -0.52 0.49 1.75 2.32 1.59 P>|z| 0.603 0.625 0.080 0.021 0.113 [95% Conf. Interval] -.5002339 -1.374136 -.4153223 .3906422 -.4157988 .2904421 2.285813 7.303025 4.700655 3.946152 . estimates store Choice The same results, though with a slightly different formatted header, would have been obtained by using clogit on these data. . clogit best age edufit grades workexp boardexp if job==1, group(caseid) nolog Conditional (fixed-effects) logistic regression Number of obs = 80 LR chi2(5) = 17.27 Prob > chi2 = 0.0040 Log likelihood = -9.7832046 Pseudo R2 = 0.4689 best Coef. age edufit grades workexp boardexp -.1048959 .4558387 3.443851 2.545648 1.765176 Std. Err. .2017068 .9336775 1.969002 1.099513 1.112763 z -0.52 0.49 1.75 2.32 1.59 P>|z| 0.603 0.625 0.080 0.021 0.113 [95% Conf. Interval] -.5002339 -1.374136 -.4153223 .3906422 -.4157988 .2904421 2.285813 7.303025 4.700655 3.946152 The parameters of the ranking and choice models look different, but the standard errors based on the choice data are much larger. Are we estimating parameters with the ranking data that are different from those with the choice data? A Hausman test compares two estimators of a parameter. One of the estimators should be efficient under the null hypothesis, namely, that choosing the second-best alternative is determined with the same decision weights as the best, etc. In our case, the efficient estimator of the decision weights uses the ranking information. The other estimator should be consistent, even if the null hypothesis is false. In our application, this is the estimator that uses the first-choice data only. 2070 rologit — Rank-ordered logistic regression . hausman Choice Ranking Coefficients (b) (B) Choice Ranking age edufit grades workexp boardexp -.1048959 .4558387 3.443851 2.545648 1.765176 -.0984926 .7658064 3.064534 1.386427 .6944377 (b-B) Difference sqrt(diag(V_b-V_B)) S.E. -.0064033 -.3099676 .3793169 1.159221 1.070739 .1842657 .8613846 1.870551 1.059878 1.04722 b = consistent under Ho and Ha; obtained from rologit B = inconsistent under Ha, efficient under Ho; obtained from rologit Test: Ho: difference in coefficients not systematic chi2(5) = (b-B)’[(V_b-V_B)^(-1)](b-B) = 3.05 Prob>chi2 = 0.6918 We do not find evidence for misspecification. We have to be cautious, though, because Hausmantype tests are often not powerful, and the number of observations in our example is very small, which makes the quality of the method of the null distribution by a chi-squared test rather uncertain. On reversals of rankings The rank-ordered logit model has a property that you may find unexpected and even unfortunate. Compare two analyses with the rank-ordered logit model, one in which alternatives are ranked from “most attractive” to “least attractive”, the other a reversed analysis in which these alternatives are ranked from “most unattractive” to “least unattractive”. By unattractiveness, you probably mean just the opposite of attractiveness, and you expect that the weights of the attributes in predicting “attractiveness” to be minus the weights in predicting “unattractiveness”. This is, however, not true for the rank-ordered logit model. The assumed distribution of the random residual takes the form F () = 1 − exp{exp(−)}. This distribution is right-skewed. Therefore, slightly different models result from adding and subtracting the random residual, corresponding with high-to-low and lowto-high rankings. Thus the estimated coefficients will differ between the two specifications, though usually not in an important way. You may observe the difference by specifying the reverse option of rologit. Reversing the rank order makes rankings that are incomplete at the bottom become incomplete at the top. Only the first kind of incompleteness is supported by rologit. Thus, for this comparison, we exclude the alternatives that are not ranked, omitting the information that ranked alternatives are preferred over excluded ones. . rologit pref grades edufit workexp boardexp if job==1 & pref!=0, group(caseid) (output omitted ) . estimates store Original . rologit pref grades edufit workexp boardexp if job==1 & pref!=0, group(caseid) > reverse (output omitted ) . estimates store Reversed rologit — Rank-ordered logistic regression 2071 . estimates table Original Reversed, stats(aic bic) Variable Original Reversed grades edufit workexp boardexp 2.0032332 -.13111006 1.2805373 .46213212 -1.0955335 -.05710681 -1.2096383 -.27200317 aic bic 96.750452 104.23526 99.665642 107.15045 Thus, although the weights of the attributes for reversed rankings are indeed mostly of opposite signs, the magnitudes of the weights and their standard errors differ. Which one is more appropriate? We have no advice to offer here. The specific science of the problem will determine what is appropriate, though we would be surprised indeed if this helps here. Formal testing does not help much either, as the models for the original and reversed rankings are not nested. The model-selection indices, such as the AIC and BIC, however, suggest that you stick to the rank-ordered logit model applied to the original ranking rather than to the reversed ranking. 2072 rologit — Rank-ordered logistic regression Stored results rologit stores the following in e(): Scalars e(N) e(ll 0) e(ll) e(df m) e(chi2) e(p) e(r2 p) e(N g) e(g min) e(g avg) e(g max) e(code inc) e(N clust) e(rank) Macros e(cmd) e(cmdline) e(depvar) e(group) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(reverse) e(ties) e(vce) e(vcetype) e(properties) e(predict) e(marginsok) e(marginsnotok) e(asbalanced) e(asobserved) Matrices e(b) e(V) e(V modelbased) Functions e(sample) number of observations log likelihood of the null model (“all rankings are equiprobable”) log likelihood model degrees of freedom χ2 significance pseudo-R2 number of groups minimum group size average group size maximum group size value for incomplete preferences number of clusters rank of e(V) rologit command as typed name of dependent variable name of group() variable weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test reverse, if specified breslow, efron, exactm vcetype specified in vce() title used to label Std. Err. b V program used to implement predict predictions allowed by margins predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector variance–covariance matrix of the estimators model-based variance marks estimation sample Methods and formulas Allison and Christakis (1994) demonstrate that maximum likelihood estimates for the rank-ordered logit model can be obtained as the maximum partial-likelihood estimates of an appropriately specified Cox regression model for waiting time ([ST] stcox). In this analogy, a higher value for an alternative is formally equivalent to a higher hazard rate of failure. rologit uses stcox to fit the rank-ordered logit model based on such a specification of the data in Cox terms. A higher stated preference is represented by a shorter waiting time until failure. Incomplete rankings are dealt with via censoring. Moreover, decision situations (subjects) are to be treated as strata. Finally, as proposed by Allison and Christakis, ties in rankings are handled by the marginal-likelihood method, specifying that all strict preference orderings consistent with the stated weak preference ordering are equally likely. rologit — Rank-ordered logistic regression 2073 The marginal-likelihood estimator is available in stcox via the exactm option. The methods of the marginal likelihood due to Breslow and Efron are also appropriate for the analysis of rank-ordered logit models. Because in most applications the number of ranked alternatives by one subject will be fairly small (at most, say, 20), the number of ties is small as well, and so you rarely will need to turn to methods to restrict computer time. Because the marginal-likelihood estimator in stcox does not support the cluster adjustment or pweights, you should use the Efron method in such cases. This command supports the clustered version of the Huber/White/sandwich estimator of the variance using vce(robust) and vce(cluster clustvar). See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. Specifying vce(robust) is equivalent to specifying vce(cluster groupvar), where groupvar is the identifier variable that links the alternatives. Acknowledgment The rologit command was written by Jeroen Weesie of the Department of Sociology at Utrecht University, The Netherlands. References Allison, P. D. 1999. Comparing logit and probit coefficients across groups. Sociological Methods and Research 28: 186–208. Allison, P. D., and N. Christakis. 1994. Logit models for sets of ranked items. In Vol. 24 of Sociological Methodology, ed. P. V. Marsden, 123–126. Oxford: Blackwell. Beggs, S., S. Cardell, and J. A. Hausman. 1981. Assessing the potential demand for electric cars. Journal of Econometrics 17: 1–19. de Wolf, I. 2000. Opleidingsspecialisatie en arbeidsmarktsucces van sociale wetenschappers. Amsterdam: ThelaThesis. Hair, J. F., Jr., W. C. Black, B. J. Babin, and R. E. Anderson. 2010. Multivariate Data Analysis. 7th ed. Upper Saddle River, NJ: Pearson. Hausman, J. A., and P. A. Ruud. 1987. Specifying and testing econometric models for rank-ordered data. Journal of Econometrics 34: 83–104. Luce, R. D. 1959. Individual Choice Behavior: A Theoretical Analysis. New York: Dover. Marden, J. I. 1995. Analyzing and Modeling Rank Data. London: Chapman & Hall. McCullagh, P. 1993. Permutations and regression models. In Probability Models and Statistical Analysis for Ranking Data, ed. M. A. Fligner and J. S. Verducci, 196–215. New York: Springer. Plackett, R. L. 1975. The analysis of permutations. Applied Statistics 24: 193–202. Punj, G. N., and R. Staelin. 1978. The choice process for graduate business schools. Journal of Marketing Research 15: 588–598. Thurstone, L. L. 1927. A law of comparative judgment. Psychological Reviews 34: 273–286. Yellott, J. I., Jr. 1977. The relationship between Luce’s choice axiom, Thurstone’s theory of comparative judgment, and the double exponential distribution. Journal of Mathematical Psychology 15: 109–144. 2074 rologit — Rank-ordered logistic regression Also see [R] rologit postestimation — Postestimation tools for rologit [R] clogit — Conditional (fixed-effects) logistic regression [R] logistic — Logistic regression, reporting odds ratios [R] mlogit — Multinomial (polytomous) logistic regression [R] nlogit — Nested logit regression [R] slogit — Stereotype logistic regression [U] 20 Estimation and postestimation commands Title rologit postestimation — Postestimation tools for rologit Description Remarks and examples Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after rologit: Command Description contrast estat ic estat summarize estat vce estimates hausman lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest margins1 marginsplot nlcom predict predictnl pwcompare test testnl 1 The default prediction statistic pr cannot be correctly handled by margins; however, margins can be used after rologit with the predict(xb) option. Syntax for predict predict statistic type newvar if in , statistic nooffset Description Main pr xb stdp probability that alternatives are ranked first; the default linear prediction standard error of the linear prediction These statistics are available both in and out of sample; type predict only for the estimation sample. 2075 . . . if esample() . . . if wanted 2076 rologit postestimation — Postestimation tools for rologit Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the probability that alternatives are ranked first. xb calculates the linear prediction. stdp calculates the standard error of the linear prediction. nooffset is relevant only if you specified offset(varname) for rologit. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . Remarks and examples See Comparing respondents and Clustered choice data in [R] rologit for examples of the use of testparm, an alternative to the test command. See Comparison of rologit and clogit and On reversals of rankings in [R] rologit for examples of the use of estimates. See Comparison of rologit and clogit in [R] rologit for an example of the use of hausman. Also see [R] rologit — Rank-ordered logistic regression [U] 20 Estimation and postestimation commands Title rreg — Robust regression Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgment Syntax rreg depvar indepvars if in , options Description options Model use # as the biweight tuning constant; default is tune(7) tune(#) Reporting level(#) genwt(newvar) display options set confidence level; default is level(95) create newvar containing the weights assigned to each observation control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Optimization optimization options graph control the optimization process; seldom used graph weights during convergence coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. by, mfp, mi estimate, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Linear models and related > Other > Robust regression Description rreg performs one version of robust regression of depvar on indepvars. Also see Robust standard errors in [R] regress for standard regression with robust variance estimates and [R] qreg for quantile (including median or least-absolute-residual) regression. 2077 2078 rreg — Robust regression Options Model tune(#) is the biweight tuning constant. The default is 7, meaning seven times the median absolute deviation (MAD) from the median residual; see Methods and formulas. Lower tuning constants downweight outliers rapidly but may lead to unstable estimates (less than 6 is not recommended). Higher tuning constants produce milder downweighting. Reporting level(#); see [R] estimation options. genwt(newvar) creates the new variable newvar containing the weights assigned to each observation. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Optimization optimization options: iterate(#), tolerance(#), no log. iterate() specifies the maximum number of iterations; iterations stop when the maximum change in weights drops below tolerance(); and log/nolog specifies whether to show the iteration log. These options are seldom used. graph allows you to graphically watch the convergence of the iterative technique. The weights obtained from the most recent round of estimation are graphed against the weights obtained from the previous round. The following option is available with rreg but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples rreg first performs an initial screening based on Cook’s distance > 1 to eliminate gross outliers before calculating starting values and then performs Huber iterations followed by biweight iterations, as suggested by Li (1985). Example 1 We wish to examine the relationship between mileage rating, weight, and location of manufacture for the 74 cars in our automobile data. As a point of comparison, we begin by fitting an ordinary regression: rreg — Robust regression . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight foreign SS df MS Source Model Residual 1619.2877 824.171761 2 71 809.643849 11.608053 Total 2443.45946 73 33.4720474 mpg Coef. weight foreign _cons -.0065879 -1.650029 41.6797 Std. Err. .0006371 1.075994 2.165547 t Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| -10.34 -1.53 19.25 0.000 0.130 0.000 = = = = = = 2079 74 69.75 0.0000 0.6627 0.6532 3.4071 [95% Conf. Interval] -.0078583 -3.7955 37.36172 -.0053175 .4954422 45.99768 We now compare this with the results from rreg: . rreg mpg weight foreign Huber iteration 1: maximum Huber iteration 2: maximum Huber iteration 3: maximum Huber iteration 4: maximum Biweight iteration 5: maximum Biweight iteration 6: maximum Biweight iteration 7: maximum Biweight iteration 8: maximum Robust regression mpg Coef. weight foreign _cons -.0063976 -3.182639 40.64022 difference difference difference difference difference difference difference difference Std. Err. .0003718 .627964 1.263841 in in in in in in in in weights weights weights weights weights weights weights weights t -17.21 -5.07 32.16 = = = = = = = = P>|t| 0.000 0.000 0.000 .80280176 .2915438 .08911171 .02697328 .29186818 .11988101 .03315872 .00721325 Number of obs = F( 2, 71) = Prob > F = 74 168.32 0.0000 [95% Conf. Interval] -.007139 -4.434763 38.1202 -.0056562 -1.930514 43.16025 Note the large change in the foreign coefficient. Technical note It would have been better if we had fit the previous robust regression by typing rreg mpg weight foreign, genwt(w). The new variable, w, would then contain the estimated weights. Let’s pretend that we did this: 2080 rreg — Robust regression . rreg mpg weight foreign, genwt(w) (output omitted ) . summarize w, detail Robust Regression Weight 1% 5% 10% 25% Percentiles 0 .0442957 .4674935 .8894815 50% .9690193 75% 90% 95% 99% Smallest 0 0 0 .0442957 Largest .9996715 .9996953 .9997343 .9998585 .9949395 .9989245 .9996715 .9998585 Obs Sum of Wgt. 74 74 Mean Std. Dev. .8509966 .2746451 Variance Skewness Kurtosis .0754299 -2.287952 6.874605 We discover that 3 observations in our data were dropped altogether (they have weight 0). We could further explore our data: . sort w . list make mpg weight w if w <.467, sep(0) make 1. 2. 3. 4. 5. 6. 7. VW Diesel Subaru Datsun 210 Plym. Arrow Cad. Seville Toyota Corolla Olds 98 mpg weight w 41 35 35 28 21 31 21 2,040 2,050 2,020 3,260 4,290 2,200 4,060 0 0 0 .04429567 .08241943 .10443129 .28141296 Being familiar with the automobile data, we immediately spotted two things: the VW is the only diesel car in our data, and the weight recorded for the Plymouth Arrow is incorrect. Example 2 If we specify no explanatory variables, rreg produces a robust estimate of the mean: . rreg mpg Huber Huber Huber Biweight Biweight iteration iteration iteration iteration iteration 1: 2: 3: 4: 5: maximum maximum maximum maximum maximum difference difference difference difference difference in in in in in weights weights weights weights weights = = = = = Robust regression .64471879 .05098336 .0099887 .25197391 .00358606 Number of obs = F( 0, 73) = Prob > F = mpg Coef. _cons 20.68825 Std. Err. .641813 t 32.23 74 0.00 . P>|t| [95% Conf. Interval] 0.000 19.40912 21.96738 rreg — Robust regression 2081 The estimate is given by the coefficient on cons. The mean is 20.69 with an estimated standard error of 0.6418. The 95% confidence interval is [ 19.4, 22.0 ]. By comparison, ci (see [R] ci) gives us the standard calculation: . ci mpg Variable Obs Mean mpg 74 21.2973 Std. Err. .6725511 [95% Conf. Interval] 19.9569 22.63769 Stored results rreg stores the following in e(): Scalars e(N) e(mss) e(df m) e(rss) e(df r) e(r2) e(r2 a) e(F) e(rmse) e(rank) Macros e(cmd) e(cmdline) e(depvar) e(genwt) e(title) e(model) e(vce) e(properties) e(predict) e(marginsok) e(asbalanced) e(asobserved) Matrices e(b) e(V) Functions e(sample) number of observations model sum of squares model degrees of freedom residual sum of squares residual degrees of freedom R-squared adjusted R-squared F statistic root mean squared error rank of e(V) rreg command as typed name of dependent variable variable containing the weights title in estimation output ols ols b V program used to implement predict predictions allowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector variance–covariance matrix of the estimators marks estimation sample Methods and formulas See Berk (1990), Goodall (1983), and Rousseeuw and Leroy (1987) for a general description of the issues and methods. Hamilton (1991a, 1992) provides a more detailed description of rreg and some Monte Carlo evaluations. rreg begins by fitting the regression (see [R] regress), calculating Cook’s D (see [R] predict and [R] regress postestimation), and excluding any observation for which D > 1. Thereafter rreg works iteratively: it performs a regression, calculates case weights from absolute residuals, and regresses again using those weights. Iterations stop when the maximum change in weights drops below tolerance(). Weights derive from one of two weight functions, Huber weights 2082 rreg — Robust regression and biweights. Huber weights (Huber 1964) are used until convergence, and then, from that result, biweights are used until convergence. The biweight was proposed by Beaton and Tukey (1974, 151– 152) after the Princeton robustness study (Andrews et al. 1972) had compared various estimators. Both weighting functions are used because Huber weights have problems dealing with severe outliers, whereas biweights sometimes fail to converge or have multiple solutions. The initial Huber weighting should improve the behavior of the biweight estimator. In Huber weighting, cases with small residuals receive weights of 1; cases with larger residuals receive gradually smaller weights. Let ei = yi − Xi b represent the ith-case residual. The ith scaled residual ui = ei /s is calculated, where s = M/0.6745 is the residual scale estimate and M = med(|ei − med(ei )|) is the median absolute deviation from the median residual. Huber estimation obtains case weights: 1 if |ui | ≤ ch wi = ch /|ui | otherwise rreg defines ch = 1.345, so downweighting begins with cases whose absolute residual exceeds (1.345/0.6745)M ≈ 2M . With biweights, all cases with nonzero residuals receive some downweighting, according to the smoothly decreasing biweight function wi = {1 − (ui /cb )2 }2 0 if |ui | ≤ cb otherwise where cb = 4.685 × tune()/7. Thus when tune() = 7, cases with absolute residuals of (4.685/0.6745)M ≈ 7M or more are assigned 0 weight and thus are effectively dropped. Goodall (1983, 377) suggests using a value between 6 and 9, inclusive, for tune() in the biweight case and states that performance is good between 6 and 12, inclusive. The tuning constants ch = 1.345 and cb = 4.685 (assuming tune() is set at the default 7) give rreg about 95% of the efficiency of OLS when applied to data with normally distributed errors (Hamilton 1991b). Lower tuning constants downweight outliers more drastically (but give up Gaussian efficiency); higher tuning constants make the estimator more like OLS. Standard errors are calculated using the pseudovalues approach described in Street, Carroll, and Ruppert (1988). Acknowledgment The current version of rreg is due to the work of Lawrence Hamilton of the Department of Sociology at the University of New Hampshire. References Andrews, D. F., P. J. Bickel, F. R. Hampel, P. J. Huber, W. H. Rogers, and J. W. Tukey. 1972. Robust Estimates of Location: Survey and Advances. Princeton: Princeton University Press. Beaton, A. E., and J. W. Tukey. 1974. The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics 16: 147–185. Berk, R. A. 1990. A primer on robust regression. In Modern Methods of Data Analysis, ed. J. Fox and J. S. Long, 292–324. Newbury Park, CA: Sage. Goodall, C. 1983. M-estimators of location: An outline of the theory. In Understanding Robust and Exploratory Data Analysis, ed. D. C. Hoaglin, C. F. Mosteller, and J. W. Tukey, 339–431. New York: Wiley. rreg — Robust regression 2083 Gould, W. W., and W. H. Rogers. 1994. Quantile regression as an alternative to robust regression. In 1994 Proceedings of the Statistical Computing Section. Alexandria, VA: American Statistical Association. Hamilton, L. C. 1991a. srd1: How robust is robust regression? Stata Technical Bulletin 2: 21–26. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 169–175. College Station, TX: Stata Press. . 1991b. ssi2: Bootstrap programming. Stata Technical Bulletin 4: 18–27. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 208–220. College Station, TX: Stata Press. . 1992. Regression with Graphics: A Second Course in Applied Statistics. Belmont, CA: Duxbury. . 2013. Statistics with Stata: Updated for Version 12. 8th ed. Boston: Brooks/Cole. Huber, P. J. 1964. Robust estimation of a location parameter. Annals of Mathematical Statistics 35: 73–101. Li, G. 1985. Robust regression. In Exploring Data Tables, Trends, and Shapes, ed. D. C. Hoaglin, C. F. Mosteller, and J. W. Tukey, 281–340. New York: Wiley. Mosteller, C. F., and J. W. Tukey. 1977. Data Analysis and Regression: A Second Course in Statistics. Reading, MA: Addison–Wesley. Relles, D. A., and W. H. Rogers. 1977. Statisticians are fairly robust estimators of location. Journal of the American Statistical Association 72: 107–111. Rousseeuw, P. J., and A. M. Leroy. 1987. Robust Regression and Outlier Detection. New York: Wiley. Street, J. O., R. J. Carroll, and D. Ruppert. 1988. A note on computing robust regression estimates via iteratively reweighted least squares. American Statistician 42: 152–154. Verardi, V., and C. Croux. 2009. Robust regression in Stata. Stata Journal 9: 439–453. Also see [R] rreg postestimation — Postestimation tools for rreg [R] qreg — Quantile regression [R] regress — Linear regression [MI] estimation — Estimation commands for use with mi estimate [U] 20 Estimation and postestimation commands Title rreg postestimation — Postestimation tools for rreg Description Syntax for predict Menu for predict Options for predict Also see Description The following postestimation commands are available after rreg: Command Description contrast estat summarize estat vce estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses margins marginsplot nlcom predict predictnl pwcompare test testnl 1 forecast is not appropriate with mi estimation results. Syntax for predict predict statistic type newvar if in , statistic Description Main xb stdp residuals hat linear prediction; the default standard error of the linear prediction residuals diagonal elements of the hat matrix These statistics are available both in and out of sample; type predict only for the estimation sample. 2084 . . . if e(sample) . . . if wanted rreg postestimation — Postestimation tools for rreg 2085 Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main xb, the default, calculates the linear prediction. stdp calculates the standard error of the linear prediction. residuals calculates the residuals. hat calculates the diagonal elements of the hat matrix. You must have run the rreg command with the genwt() option. Also see [R] rreg — Robust regression [U] 20 Estimation and postestimation commands Title runtest — Test for random order Syntax Remarks and examples References Menu Stored results Description Methods and formulas Options Acknowledgment Syntax runtest varname in , options options Description continuity drop split continuity correction ignore values equal to the threshold randomly split values equal to the threshold as above or below the threshold; default is to count as below use mean as threshold; default is median assign arbitrary threshold; default is median mean threshold(#) Menu Statistics > Nonparametric analysis > Tests of hypotheses > Test for random order Description runtest tests whether the observations of varname are serially independent — that is, whether they occur in a random order — by counting how many runs there are above and below a threshold. By default, the median is used as the threshold. A small number of runs indicates positive serial correlation; a large number indicates negative serial correlation. Options continuity specifies a continuity correction that may be helpful in small samples. If there are fewer than 10 observations either above or below the threshold, however, the tables in Swed and Eisenhart (1943) provide more reliable critical values. By default, no continuity correction is used. drop directs runtest to ignore any values of varname that are equal to the threshold value when counting runs and tabulating observations. By default, runtest counts a value as being above the threshold when it is strictly above the threshold and as being below the threshold when it is less than or equal to the threshold. split directs runtest to randomly split values of varname that are equal to the threshold. In other words, when varname is equal to threshold, a “coin” is flipped. If it comes up heads, the value is counted as above the threshold. If it comes up tails, the value is counted as below the threshold. mean directs runtest to tabulate runs above and below the mean rather than the median. threshold(#) specifies an arbitrary threshold to use in counting runs. For example, if varname has already been coded as a 0/1 variable, the median generally will not be a meaningful separating value. 2086 runtest — Test for random order 2087 Remarks and examples runtest performs a nonparametric test of the hypothesis that the observations of varname occur in a random order by counting how many runs there are above and below a threshold. If varname is positively serially correlated, it will tend to remain above or below its median for several observations in a row; that is, there will be relatively few runs. If, on the other hand, varname is negatively serially correlated, observations above the median will tend to be followed by observations below the median and vice versa; that is, there will be relatively many runs. By default, runtest uses the median for the threshold, and this is not necessarily the best choice. If mean is specified, the mean is used instead of the median. If threshold(#) is specified, # is used. Because runtest divides the data into two states — above and below the threshold — it is appropriate for data that are already binary; for example, win or lose, live or die, rich or poor, etc. Such variables are often coded as 0 for one state and 1 for the other. Here you should specify threshold(0) because, by default, runtest separates the observations into those that are greater than the threshold and those that are less than or equal to the threshold. As with most nonparametric procedures, the treatment of ties complicates the test. Observations equal to the threshold value are ties and can be treated in one of three ways. By default, they are treated as if they were below the threshold. If drop is specified, they are omitted from the calculation and the total number of observations is adjusted. If split is specified, each is randomly assigned to the above- and below-threshold groups. The random assignment is different each time the procedure is run unless you specify the random-number seed; see [R] set seed. Example 1 We can use runtest to check regression residuals for serial correlation. . use http://www.stata-press.com/data/r13/run1 . scatter resid year, connect(l) yline(0) title(Regression residuals) −4 −2 Residual 0 2 4 Regression residuals 1975 1980 1985 1990 Year The graph gives the impression that these residuals are positively correlated. Excursions above or below zero — the natural threshold for regression residuals — tend to last for several observations. runtest can evaluate the statistical significance of this impression. 2088 runtest — Test for random order . runtest resid, N(resid <= 0) = N(resid > 0) = obs = N(runs) = z = Prob>|z| = thresh(0) 8 8 16 5 -2.07 .04 There are five runs in these 16 observations. Using the normal approximation to the true distribution of the number of runs, the five runs in this series are fewer than would be expected if the residuals were serially independent. The p-value is 0.04, indicating a two-sided significant result at the 5% level. If the alternative hypothesis is positive serial correlation, rather than any deviation from randomness, then the one-sided p-value is 0.04/2 = 0.015. With so few observations, however, the normal approximation may be inaccurate. (Tables compiled by Swed and Eisenhart list five runs as the 5% critical value for a one-sided test.) runtest is a nonparametric test. It ignores the magnitudes of the observations and notes only whether the values are above or below the threshold. We can demonstrate this feature by reducing the information about the regression residuals in this example to a 0/1 variable that indicates only whether a residual is positive or negative. . generate byte . runtest sign, N(sign <= 0) = N(sign > 0) = obs = N(runs) = z = Prob>|z| = sign = resid>0 thresh(0) 8 8 16 5 -2.07 .04 As expected, runtest produces the same answer as before. Technical note The run test can also be used to test the null hypothesis that two samples are drawn from the same underlying distribution. The run test is sensitive to differences in the shapes, as well as the locations, of the empirical distributions. Suppose, for example, that two different additives are added to the oil in 10 different cars during an oil change. The cars are run until a viscosity test determines that another oil change is needed, and the number of miles traveled between oil changes is recorded. The data are runtest — Test for random order 2089 . use http://www.stata-press.com/data/r13/additive, clear . list additive miles 1. 2. 3. 4. 5. 1 1 1 1 1 4024 4756 7993 5025 4188 6. 7. 8. 9. 10. 2 2 2 2 2 3007 1988 1051 4478 4232 To test whether the additives generate different distributions of miles between oil changes, we sort the data by miles and then use runtest to see whether the marker for each additive occurs in random order: . sort miles . runtest additive, N(additive <= 1) = N(additive > 1) = obs = N(runs) = z = Prob>|z| = thresh(1) 5 5 10 4 -1.34 .18 Here the additives do not produce statistically different results. Technical note A test that is related to the run test is the runs up-and-down test. In the latter test, the data are classified not by whether they lie above or below a threshold but by whether they are steadily increasing or decreasing. Thus an unbroken string of increases in the variable of interest is counted as one run, as is an unbroken string of decreases. According to Madansky (1988), the run test is superior to the runs up-and-down test for detecting trends in the data, but the runs up-and-down test is superior for detecting autocorrelation. runtest can be used to perform a runs up-and-down test. Using the regression residuals from the example above, we can perform a runtest on their first differences: . use http://www.stata-press.com/data/r13/run1 . generate resid_D = resid - resid[_n-1] (1 missing value generated) . runtest resid_D, N(resid_D <= 0) = N(resid_D > 0) = obs = N(runs) = z = Prob>|z| = thresh(0) 7 8 15 6 -1.33 .18 2090 runtest — Test for random order Edgington (1961) has compiled a table of the small sample distribution of the runs up-and-down statistic, and this table is reprinted in Madansky (1988). For large samples, the z statistic reported by runtest is incorrect for the runs up-and-down test. Let N be the number of observations (15 here), and let r be the number of runs (6). The expected number of runs in the runs up-and-down test is µr = the variance is σr2 = 2N − 1 3 16N − 29 90 and the correct z statistic is zb = r − µr σr Technical note runtest will tolerate missing values at the beginning or end of a series, as occurred in the technical note above (generating first differences resulted in a missing value for the first observation). runtest, however, will issue an error message if there are any missing observations in the interior of the series (in the portion covered by the in range modifier). To perform the test anyway, simply drop the missing observations before using runtest. Stored results runtest stores the following in r(): Scalars r(N) r(N below) r(N above) r(mean) number of observations number below the threshold number above the threshold expected number of runs r(p) r(z) r(n runs) r(Var) p-value of z z statistic number of runs variance of the number of runs Methods and formulas runtest begins by calculating the number of observations below the threshold, n0 ; the number of observations above the threshold, n1 ; the total number of observations, N = n0 + n1 ; and the number of runs, r. These statistics are always reported, so the exact tables of critical values in Swed and Eisenhart (1943) may be consulted if necessary. The expected number of runs under the null is µr = 2n0 n1 +1 N runtest — Test for random order the variance is σr2 = 2091 2n0 n1 (2n0 n1 − N ) N 2 (N − 1) and the normal approximation test statistic is zb = r − µr σr Acknowledgment runtest was written by Sean Becketti, a past editor of the Stata Technical Bulletin and author of the Stata Press book Introduction to Time Series Using Stata. References Edgington, E. S. 1961. Probability table for number of runs of signs of first differences in ordered series. Journal of the American Statistical Association 56: 156–159. Madansky, A. 1988. Prescriptions for Working Statisticians. New York: Springer. Swed, F. S., and C. Eisenhart. 1943. Tables for testing randomness of grouping in a sequence of alternatives. Annals of Mathematical Statistics 14: 66–87. Title scobit — Skewed logistic regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax scobit depvar indepvars if in weight , options Description options Model noconstant offset(varname) asis constraints(constraints) collinear suppress constant term include varname in model with coefficient constrained to 1 retain perfect predictor variables apply specified linear constraints keep collinear variables SE/Robust vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife vce(vcetype) Reporting set confidence level; default is level(95) report odds ratios do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling level(#) or nocnsreport display options Maximization maximize options control the maximization process coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. bootstrap, by, fp, jackknife, nestreg, rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Binary outcomes > Skewed logit regression 2092 scobit — Skewed logistic regression 2093 Description scobit fits a maximum-likelihood skewed logit model. See [R] logistic for a list of related estimation commands. Options Model noconstant, offset(varname), constraints(constraints), collinear; see [R] estimation options. asis forces retention of perfect predictor variables and their associated perfectly predicted observations and may produce instabilities in maximization; see [R] probit. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. or reports the estimated coefficients transformed to odds ratios, that is, eb rather than b. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated. or may be specified at estimation or when replaying previously estimated results. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with scobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Skewed logistic model Robust standard errors 2094 scobit — Skewed logistic regression Skewed logistic model scobit fits maximum likelihood models with dichotomous dependent variables coded as 0/1 (or, more precisely, coded as 0 and not 0). Example 1 We have data on the make, weight, and mileage rating of 22 foreign and 52 domestic automobiles. We wish to fit a model explaining whether a car is foreign based on its mileage. Here is an overview of our data: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . keep make mpg weight foreign . describe Contains data from http://www.stata-press.com/data/r13/auto.dta obs: 74 1978 Automobile Data vars: 4 13 Apr 2013 17:45 size: 1,702 (_dta has notes) variable name make mpg weight foreign storage type display format str18 int int byte %-18s %8.0g %8.0gc %8.0g value label variable label origin Make and Model Mileage (mpg) Weight (lbs.) Car type Sorted by: foreign Note: dataset has changed since last saved . inspect foreign foreign: Car type Number of Observations # # # # # # Negative Zero Positive # # 0 Total Missing 1 Total 52 22 74 - Integers 52 22 Nonintegers - 74 - 74 (2 unique values) foreign is labeled and all values are documented in the label. The variable foreign takes on two unique values, 0 and 1. The value 0 denotes a domestic car, and 1 denotes a foreign car. scobit — Skewed logistic regression 2095 The model that we wish to fit is Pr(foreign = 1) = F (β0 + β1 mpg) where F (z) = 1 − 1/{1 + exp(z)}α . To fit this model, we type . scobit foreign mpg Fitting logistic model: Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Fitting full model: Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Iteration 5: log likelihood Skewed logistic regression = -45.03321 = -39.380959 = -39.288802 = -39.28864 = -39.28864 = = = = = = -39.28864 -39.286393 -39.284415 -39.284234 -39.284197 -39.284196 Number of obs Zero outcomes Nonzero outcomes = = = 74 52 22 Log likelihood = -39.2842 foreign Coef. mpg _cons .1813879 -4.274883 .2407362 1.399305 0.75 -3.06 0.451 0.002 -.2904463 -7.017471 .6532222 -1.532295 /lnalpha -.4450405 3.879885 -0.11 0.909 -8.049476 7.159395 alpha .6407983 2.486224 .0003193 1286.133 Std. Err. z P>|z| [95% Conf. Interval] Likelihood-ratio test of alpha=1: chi2(1) = 0.01 Prob > chi2 = 0.9249 Note: likelihood-ratio tests are recommended for inference with scobit models. We find that cars yielding better gas mileage are less likely to be foreign. The likelihood-ratio test at the bottom of the output indicates that the model is not significantly different from a logit model. Therefore, we should use the more parsimonious model. Technical note Stata interprets a value of 0 as a negative outcome (failure) and treats all other values (except missing) as positive outcomes (successes). Thus if the dependent variable takes on the values 0 and 1, then 0 is interpreted as failure and 1 as success. If the dependent variable takes on the values 0, 1, and 2, then 0 is still interpreted as failure, but both 1 and 2 are treated as successes. Formally, when we type scobit y x, Stata fits the model .n oα Pr(yj 6= 0 | xj ) = 1 − 1 1 + exp(xj β) 2096 scobit — Skewed logistic regression Robust standard errors If you specify the vce(robust) option, scobit reports robust standard errors as described in [U] 20.21 Obtaining robust variance estimates. For the model of foreign on mpg, the robust calculation increases the standard error of the coefficient on mpg by around 25%: . scobit foreign mpg, vce(robust) nolog Skewed logistic regression Log pseudolikelihood = Number of obs Zero outcomes Nonzero outcomes -39.2842 Robust Std. Err. z P>|z| = = = 74 52 22 foreign Coef. [95% Conf. Interval] mpg _cons .1813879 -4.274883 .3028487 1.335521 0.60 -3.20 0.549 0.001 -.4121847 -6.892455 .7749606 -1.657311 /lnalpha -.4450405 4.71561 -0.09 0.925 -9.687466 8.797385 alpha .6407983 3.021755 .0000621 6616.919 Without vce(robust), the standard error for the coefficient on mpg was reported to be 0.241, with a resulting confidence interval of [ −0.29, 0.65 ]. Specifying the vce(cluster clustvar) option relaxes the independence assumption required by the skewed logit estimator to being just independence between clusters. To demonstrate this, we will switch to a different dataset. Example 2 We are studying the unionization of women in the United States and have a dataset with 26,200 observations on 4,434 women between 1970 and 1988. For our purposes, we will use the variables age (the women were 14 – 26 in 1968 and the data thus span the age range of 16 – 46), grade (years of schooling completed, ranging from 0 to 18), not smsa (28% of the person-time was spent living outside an SMSA —standard metropolitan statistical area), south (41% of the person-time was in the South), and year. Each of these variables is included in the regression as a covariate along with the interaction between south and year. This interaction, along with the south and year variables, is specified in the scobit command using factor-variables notation, south##c.year. We also have variable union. Overall, 22% of the person-time is marked as time under union membership and 44% of these women have belonged to a union. We fit the following model, ignoring that women are observed an average of 5.9 times each in these data: scobit — Skewed logistic regression 2097 . use http://www.stata-press.com/data/r13/union, clear (NLS Women 14-24 in 1968) . scobit union age grade not_smsa south##c.year, nrtol(1e-3) (output omitted ) Skewed logistic regression Number of obs Zero outcomes Nonzero outcomes Log likelihood = -13540.61 Std. Err. z P>|z| = = = 26200 20389 5811 union Coef. [95% Conf. Interval] age grade not_smsa 1.south year .0185365 .0452803 -.1886849 -1.422381 -.0133017 .0043615 .0057124 .0317802 .3949298 .0049575 4.25 7.93 -5.94 -3.60 -2.68 0.000 0.000 0.000 0.000 0.007 .0099881 .0340842 -.250973 -2.196429 -.0230182 .0270849 .0564764 -.1263968 -.6483327 -.0035853 south#c.year 1 .0105663 .0049233 2.15 0.032 .0009168 .0202158 _cons -10.19247 63.69015 -0.16 0.873 -135.0229 114.6379 /lnalpha 8.972796 63.68825 0.14 0.888 -115.8539 133.7995 alpha 7885.616 502221.1 4.85e-51 1.28e+58 Likelihood-ratio test of alpha=1: chi2(1) = 3.76 Prob > chi2 = 0.0524 Note: likelihood-ratio tests are recommended for inference with scobit models. The reported standard errors in this model are probably meaningless. Women are observed repeatedly, so the observations are not independent. Looking at the coefficients, we find a large southern effect against unionization and a different time trend for the south. The vce(cluster clustvar) option provides a way to fit this model and obtains correct standard errors: . scobit union age grade not_smsa south##c.year, vce(cluster id) nrtol(1e-3) (output omitted ) Skewed logistic regression Number of obs = 26200 Zero outcomes = 20389 Log pseudolikelihood = -13540.61 Nonzero outcomes = 5811 (Std. Err. adjusted for 4434 clusters in idcode) Robust Std. Err. union Coef. z P>|z| [95% Conf. Interval] age grade not_smsa 1.south year .0185365 .0452803 -.1886849 -1.422381 -.0133017 .0084867 .0125764 .0642035 .5064916 .0090621 2.18 3.60 -2.94 -2.81 -1.47 0.029 0.000 0.003 0.005 0.142 .0019029 .0206311 -.3145214 -2.415086 -.0310632 .0351701 .0699296 -.0628484 -.4296756 .0044597 south#c.year 1 .0105663 .0063172 1.67 0.094 -.0018152 .0229478 _cons -10.19247 .945772 -10.78 0.000 -12.04615 -8.33879 /lnalpha 8.972796 .7482517 11.99 0.000 7.506249 10.43934 alpha 7885.616 5900.426 1819.377 34178.16 2098 scobit — Skewed logistic regression scobit, vce(cluster clustvar) is robust to assumptions about within-cluster correlation. That is, it inefficiently sums within cluster for the standard error calculation rather than attempting to exploit what might be assumed about the within-cluster correlation (as do the xtgee population-averaged models; see [XT] xtgee). Technical note The scobit model can be difficult to fit because of the functional form. Often it requires many iterations, or the optimizer prints out warning and informative messages during the optimization. For example, without the nrtol(1e-3) option, the model using the union dataset will not converge. See [R] maximize for details about the optimizer. Technical note The main reason for using scobit rather that logit is that the effects of the regressors on the probability of success are not constrained to be the largest when the probability is 0.5. Rather, the independent variables might show their largest impact when the probability of success is 0.3 or 0.6. This added flexibility results because the scobit function, unlike the logit function, can be skewed and is not constrained to be mirror symmetric about the 0.5 probability of success. As Nagler (1994) pointed out, the point of maximum impact is constrained under the scobit model to fall within the interval (0, 1 − e(−1) ) or approximately (0, 0.63). Achen (2002) notes that if we believe the maximum impact to be outside that range, we can instead estimate the “power logit” model by simply reversing the 0s and 1s of our outcome variable and estimating a scobit model on failure, rather than success. We would need to reverse the signs of the coefficients if we wanted to interpret them in terms of impact on success, or we could leave them as they are and interpret them in terms of impact on failure. The important thing to remember is that the scobit model, unlike the logit model, is not invariant to the choice of which result is assigned to success. scobit — Skewed logistic regression 2099 Stored results scobit stores the following in e(): Scalars e(N) e(k) e(k eq) e(k aux) e(k dv) e(ll) e(ll c) e(N f) e(N s) e(alpha) e(N clust) e(chi2) e(chi2 c) e(p) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(chi2 ct) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(footnote) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) number of observations number of parameters number of equations in e(b) number of auxiliary parameters number of dependent variables log likelihood log likelihood, comparison model number of failures (zero outcomes) number of successes (nonzero outcomes) alpha number of clusters χ2 χ2 for comparison test significance rank of e(V) number of iterations return code 1 if converged, 0 otherwise scobit command as typed name of dependent variable weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test Wald or LR; type of model χ2 test corresponding to e(chi2 c) vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict program used to implement the footnote display factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample 2100 scobit — Skewed logistic regression Methods and formulas Skewed logit analysis is an alternative to logit that relaxes the assumption that individuals with initial probability of 0.5 are most sensitive to changes in independent variables. The log-likelihood function for skewed logit is lnL = X wj lnF (xj b) + j∈S X wj ln 1 − F (xj b) j6∈S where S is the set of all observations j such that yj 6= 0, F (z) = 1 − 1/ 1 + exp(z) denotes the optional weights. lnL is maximized as described in [R] maximize. α , and wj This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. scobit also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Achen, C. H. 2002. Toward a new political methodology: Microfoundations and ART. Annual Review of Political Science 5: 423–450. Nagler, J. 1994. Scobit: An alternative estimator to logit and probit. American Journal of Political Science 38: 230–255. Also see [R] scobit postestimation — Postestimation tools for scobit [R] cloglog — Complementary log-log regression [R] glm — Generalized linear models [R] logistic — Logistic regression, reporting odds ratios [SVY] svy estimation — Estimation commands for survey data [U] 20 Estimation and postestimation commands Title scobit postestimation — Postestimation tools for scobit Description Remarks and examples Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after scobit: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with svy estimation results. lrtest is not appropriate with svy estimation results. 2101 2102 scobit postestimation — Postestimation tools for scobit Syntax for predict predict type predict type newvar if in , statistic nooffset stub* | newvarreg newvarlnalpha if in , scores Description statistic Main probability of a positive outcome; the default xj b, linear prediction standard error of the linear prediction pr xb stdp These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the probability of a positive outcome. xb calculates the linear prediction. stdp calculates the standard error of the linear prediction. nooffset is relevant only if you specified offset(varname) for scobit. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . scores calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xj β). The second new variable will contain ∂ ln L/∂ lnα. Remarks and examples Once you have fit a model, you can obtain the predicted probabilities by using the predict command for both the estimation sample and other samples; see [U] 20 Estimation and postestimation commands and [R] predict. Here we will make only a few additional comments. predict without arguments calculates the predicted probability of a positive outcome. With the xb option, it calculates the linear combination xj b, where xj are the independent variables in the j th observation and b is the estimated parameter vector. With the stdp option, predict calculates the standard error of the prediction, which is not adjusted for replicated covariate patterns in the data. scobit postestimation — Postestimation tools for scobit 2103 Example 1 In example 1 of [R] scobit, we fit the model scobit foreign mpg. To obtain predicted probabilities, we type . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . keep make mpg weight foreign . scobit foreign mpg (output omitted ) . predict p (option pr assumed; Pr(foreign)) . summarize foreign p Variable Obs Mean Std. Dev. foreign p 74 74 .2972973 .2974049 .4601885 .182352 Also see [R] scobit — Skewed logistic regression [U] 20 Estimation and postestimation commands Min Max 0 .0714664 1 .871624 Title sdtest — Variance-comparison tests Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax One-sample variance-comparison test sdtest varname == # if in , level(#) Two-sample variance-comparison test using groups sdtest varname if in , by(groupvar) level(#) Two-sample variance-comparison test using variables sdtest varname1 == varname2 if in , level(#) Immediate form of one-sample variance-comparison test sdtesti # obs # mean | . # sd # val , level(#) Immediate form of two-sample variance-comparison test sdtesti # obs,1 # mean,1 | . # sd,1 # obs,2 # mean,2 | . # sd,2 , level(#) Robust tests for equality of variances robvar varname if in , by(groupvar) by is allowed with sdtest and robvar; see [D] by. Menu sdtest Statistics > Summaries, tables, and tests > Classical tests of hypotheses > Variance-comparison test > Summaries, tables, and tests > Classical tests of hypotheses > Variance-comparison test calculator > Summaries, tables, and tests > Classical tests of hypotheses > Robust equal-variance test sdtesti Statistics robvar Statistics 2104 sdtest — Variance-comparison tests 2105 Description sdtest performs tests on the equality of standard deviations (variances). In the first form, sdtest tests that the standard deviation of varname is #. In the second form, sdtest performs the same test, using the standard deviations of the two groups defined by groupvar. In the third form, sdtest tests that varname1 and varname2 have the same standard deviation. sdtesti is the immediate form of sdtest; see [U] 19 Immediate commands. Both the traditional F test for the homogeneity of variances and Bartlett’s generalization of this test to K samples are sensitive to the assumption that the data are drawn from an underlying Gaussian distribution. See, for example, the cautionary results discussed by Markowski and Markowski (1990). Levene (1960) proposed a test statistic for equality of variance that was found to be robust under nonnormality. Then Brown and Forsythe (1974) proposed alternative formulations of Levene’s test statistic that use more robust estimators of central tendency in place of the mean. These reformulations were demonstrated to be more robust than Levene’s test when dealing with skewed populations. robvar reports Levene’s robust test statistic (W0 ) for the equality of variances between the groups defined by groupvar and the two statistics proposed by Brown and Forsythe that replace the mean in Levene’s formula with alternative location estimators. The first alternative (W50 ) replaces the mean with the median. The second alternative replaces the mean with the 10% trimmed mean (W10 ). Options level(#) specifies the confidence level, as a percentage, for confidence intervals of the means. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. by(groupvar) specifies the groupvar that defines the groups to be compared. For sdtest, there should be two groups, but for robvar there may be more than two groups. Do not confuse the by() option with the by prefix; both may be specified. Remarks and examples Remarks are presented under the following headings: Basic form Immediate form Robust test Basic form sdtest performs two different statistical tests: one testing equality of variances and the other testing that the standard deviation is equal to a known constant. Which test it performs is determined by whether you type a variable name or a number to the right of the equal sign. Example 1: One-sample test of variance We have a sample of 74 automobiles. For each automobile, we know the mileage rating. We wish to test whether the overall standard deviation is 5 mpg: 2106 sdtest — Variance-comparison tests . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . sdtest mpg == 5 One-sample test of variance Variable Obs Mean mpg 74 21.2973 Std. Err. Std. Dev. .6725511 5.785503 sd = sd(mpg) Ho: sd = 5 [95% Conf. Interval] 19.9569 c = chi2 = degrees of freedom = Ha: sd < 5 Pr(C < c) = 0.9717 Ha: sd != 5 2*Pr(C > c) = 0.0565 22.63769 97.7384 73 Ha: sd > 5 Pr(C > c) = 0.0283 Example 2: Variance ratio test We are testing the effectiveness of a new fuel additive. We run an experiment on 12 cars, running each without and with the additive. The data can be found in [R] ttest. The results for each car are stored in the variables mpg1 and mpg2: . use http://www.stata-press.com/data/r13/fuel . sdtest mpg1==mpg2 Variance ratio test Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] mpg1 mpg2 12 12 21 22.75 .7881701 .9384465 2.730301 3.250874 19.26525 20.68449 22.73475 24.81551 combined 24 21.875 .6264476 3.068954 20.57909 23.17091 ratio = sd(mpg1) / sd(mpg2) Ho: ratio = 1 Ha: ratio < 1 Pr(F < f) = 0.2862 Ha: ratio != 1 2*Pr(F < f) = 0.5725 f = degrees of freedom = 0.7054 11, 11 Ha: ratio > 1 Pr(F > f) = 0.7138 We cannot reject the hypothesis that the standard deviations are the same. In [R] ttest, we draw an important distinction between paired and unpaired data, which, in this example, means whether there are 12 cars in a before-and-after experiment or 24 different cars. For sdtest, on the other hand, there is no distinction. If the data had been unpaired and stored as described in [R] ttest, we could have typed sdtest mpg, by(treated), and the results would have been the same. Immediate form Example 3: sdtesti Immediate commands are used not with data, but with reported summary statistics. For instance, to test whether a variable on which we have 75 observations and a reported standard deviation of 6.5 comes from a population with underlying standard deviation 6, we would type sdtest — Variance-comparison tests 2107 . sdtesti 75 . 6.5 6 One-sample test of variance x Obs Mean 75 . Std. Err. Std. Dev. .7505553 6.5 sd = sd(x) Ho: sd = 6 [95% Conf. Interval] . c = chi2 = degrees of freedom = Ha: sd < 6 Pr(C < c) = 0.8542 Ha: sd != 6 2*Pr(C > c) = 0.2916 . 86.8472 74 Ha: sd > 6 Pr(C > c) = 0.1458 The mean plays no role in the calculation, so it may be omitted. To test whether the variable comes from a population with the same standard deviation as another for which we have a calculated standard deviation of 7.5 over 65 observations, we would type . sdtesti 75 . 6.5 65 . 7.5 Variance ratio test Obs Mean Std. Err. Std. Dev. x y 75 65 . . .7505553 .9302605 6.5 7.5 . . . . combined 140 . . . . . ratio = sd(x) / sd(y) Ho: ratio = 1 [95% Conf. Interval] f = degrees of freedom = Ha: ratio < 1 Pr(F < f) = 0.1172 Ha: ratio != 1 2*Pr(F < f) = 0.2344 0.7511 74, 64 Ha: ratio > 1 Pr(F > f) = 0.8828 Robust test Example 4: robvar We wish to test whether the standard deviation of the length of stay for patients hospitalized for a given medical procedure differs by gender. Our data consist of observations on the length of hospital stay for 1778 patients: 884 males and 894 females. Length of stay, lengthstay, is highly skewed (skewness coefficient = 4.912591) and thus violates Bartlett’s normality assumption. Therefore, we use robvar to compare the variances. . use http://www.stata-press.com/data/r13/stay . robvar lengthstay, by(sex) sex W0 Summary of Length of stay in days Mean Std. Dev. Freq. male female 9.0874434 8.800671 9.7884747 9.1081478 884 894 Total 8.9432508 9.4509466 1778 = 0.55505315 df(1, 1776) Pr > F = 0.45635888 W50 = 0.42714734 df(1, 1776) Pr > F = 0.51347664 W10 = 0.44577674 df(1, 1776) Pr > F = 0.50443411 2108 sdtest — Variance-comparison tests For these data, we cannot reject the null hypothesis that the variances are equal. However, Bartlett’s test yields a significance probability of 0.0319 because of the pronounced skewness of the data. Technical note robvar implements both the conventional Levene’s test centered at the mean and a median-centered test. In a simulation study, Conover, Johnson, and Johnson (1981) compare the properties of the two tests and recommend using the median test for asymmetric data, although for small sample sizes the test is somewhat conservative. See Carroll and Schneider (1985) for an explanation of why both mean- and median-centered tests have approximately the same level for symmetric distributions, but for asymmetric distributions the median test is closer to the correct level. Stored results sdtest and sdtesti store the following in r(): Scalars r(N) r(p l) r(p u) r(p) r(F) r(sd) r(sd 1) r(sd 2) r(df) r(df 1) r(df 2) r(chi2) number of observations lower one-sided p-value upper one-sided p-value two-sided p-value F statistic standard deviation standard deviation for first variable standard deviation for second variable degrees of freedom numerator degrees of freedom denominator degrees of freedom χ2 robvar stores the following in r(): Scalars r(N) r(w50) r(p w50) r(w0) r(p w0) r(w10) r(p w10) r(df 1) r(df 2) number of observations Brown and Forsythe’s F statistic (median) Brown and Forsythe’s p-value Levene’s F statistic Levene’s p-value Brown and Forsythe’s F statistic (trimmed mean) Brown and Forsythe’s p-value (trimmed mean) numerator degrees of freedom denominator degrees of freedom Methods and formulas See Armitage et al. (2002, 149 – 153) or Bland (2000, 171–172) for an introduction and explanation of the calculation of these tests. The test for σ = σ0 is given by χ2 = (n − 1)s2 σ02 which is distributed as χ2 with n − 1 degrees of freedom. sdtest — Variance-comparison tests 2109 The test for σx2 = σy2 is given by F = s2x s2y which is distributed as F with nx − 1 and ny − 1 degrees of freedom. Let Xij be the j th observation of X for the ith group. Let Zij = |Xij − X i |, where X i is the mean of X in the ith group. Levene’s test statistic is P ni (Z i − Z)2 /(g − 1) W0 = P P i P 2 i j (Zij − Z i ) / i (ni − 1) where ni is the number of observations in group i and g is the number of groups. W50 is obtained by replacing X i with the ith group median of Xij , whereas W10 is obtained by replacing X i with the 10% trimmed mean for group i. References Armitage, P., G. Berry, and J. N. S. Matthews. 2002. Statistical Methods in Medical Research. 4th ed. Oxford: Blackwell. Bland, M. 2000. An Introduction to Medical Statistics. 3rd ed. Oxford: Oxford University Press. Brown, M. B., and A. B. Forsythe. 1974. Robust tests for the equality of variances. Journal of the American Statistical Association 69: 364–367. Carroll, R. J., and H. Schneider. 1985. A note on Levene’s tests for equality of variances. Statistics and Probability Letters 3: 191–194. Cleves, M. A. 1995. sg35: Robust tests for the equality of variances. Stata Technical Bulletin 25: 13–15. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 91–93. College Station, TX: Stata Press. . 2000. sg35.2: Robust tests for the equality of variances update to Stata 6. Stata Technical Bulletin 53: 17–18. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 158–159. College Station, TX: Stata Press. Conover, W. J., M. E. Johnson, and M. M. Johnson. 1981. A comparative study of tests for homogeneity of variances, with applications to the outer continental shelf bidding data. Technometrics 23: 351–361. Gastwirth, J. L., Y. R. Gel, and W. Miao. 2009. The impact of Levene’s test of equality of variances on statistical theory and practice. Statistical Science 24: 343–360. Levene, H. 1960. Robust tests for equality of variances. In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, ed. I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow, and H. B. Mann, 278–292. Menlo Park, CA: Stanford University Press. Markowski, C. A., and E. P. Markowski. 1990. Conditions for the effectiveness of a preliminary test of variance. American Statistician 44: 322–326. Seed, P. T. 2000. sbe33: Comparing several methods of measuring the same quantity. Stata Technical Bulletin 55: 2–9. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 73–82. College Station, TX: Stata Press. Tobı́as, A. 1998. gr28: A graphical procedure to test equality of variances. Stata Technical Bulletin 42: 4–6. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 68–70. College Station, TX: Stata Press. Also see [R] ttest — t tests (mean-comparison tests) Title search — Search Stata documentation and other resources Syntax Options for search Acknowledgment Menu Option for set searchdefault Also see Description Remarks and examples Syntax search word word . . . , search options set searchdefault all | local | net , permanently search options Description all local net search across both the local keyword database and the net material; the default search using Stata’s keyword database search across materials available via Stata’s net command author entry exact search by author’s name search by entry ID search across both the local keyword database and the net materials; prevents matching on abbreviations search the FAQs posted to the Stata website search entries that are of historical interest only list an entry if any of the words typed after search are associated with the entry search the entries in the Stata Documentation search the entries in the Stata Journal and the STB faq historical or manual sj Menu Help > Search... Description search searches a keyword database and the Internet for Stata materials related to your query. Capitalization of the words following search is irrelevant, as is the inclusion or exclusion of special characters such as commas and hyphens. set searchdefault affects the default behavior of the search command. all is the default. search, all is the best way to search for information on a topic across all sources, including the system help, the FAQs at the Stata website, the Stata Journal, and all Stata-related Internet sources including user-written additions. From the results, you can click to go to a source or to install additions. 2110 search — Search Stata documentation and other resources 2111 Options for search all, the default (unless changed by set searchdefault), specifies that the search be performed across both the local keyword database and the net materials. The results of a search performed with all and no other options will be displayed in the Viewer window. local specifies that the search be performed using only Stata’s keyword database. The results of a search performed with local and no other options will be displayed in the Viewer window. net specifies that the search across the materials available via Stata’s net command. be performed Using search word word . . . , net is equivalent to typing net search word word . . . (without options); see [R] net search. The results of a search performed with net and no other options will be displayed in the Viewer window. author specifies that the search be performed on the basis of author’s name rather than keywords. A search with the author option is performed on the local keyword database only, and the results are displayed in the Results window. entry specifies that the search be performed on the basis of entry IDs rather than keywords. A search with the entry option is performed on the local keyword database only, and the results are displayed in the Results window. exact prevents matching on abbreviations. A search with the exact option is performed across both the local keyword database and the net materials, and the results are displayed in the Results window. faq limits the search to the FAQs posted on the Stata website: http://www.stata.com. A search with the faq option is performed on the local keyword database only, and the results are displayed in the Results window. historical adds to the search entries that are of historical interest only. By default, such entries are not listed. Past entries are classified as historical if they discuss a feature that later became an official part of Stata. Updates to historical entries will always be found, even if historical is not specified. A search with the historical option is performed on the local keyword database only, and the results are displayed in the Results window. or specifies that an entry be listed if any of the words typed after search are associated with the entry. The default is to list the entry only if all the words specified are associated with the entry. A search with the or option is performed on the local keyword database only, and the results are displayed in the Results window. manual limits the search to entries in the Stata Documentation; that is, the search is limited to the User’s Guide and all the reference manuals. A search with the manual option is performed on the local keyword database only, and the results are displayed in the Results window. sj limits the search to entries in the Stata Journal and its predecessor, the Stata Technical Bulletin; see [R] sj. A search with the sj option is performed on the local keyword database only, and the results are displayed in the Results window. Option for set searchdefault permanently specifies that, in addition to making the change right now, the searchdefault setting be remembered and become the default setting when you invoke Stata. 2112 search — Search Stata documentation and other resources Remarks and examples Remarks are presented under the following headings: Introduction Internet searches Author searches Entry ID searches Return codes Introduction See [U] 4 Stata’s help and search facilities for a tutorial introduction to search. search is one of Stata’s most useful commands. To understand the advanced features of search, you need to know how it works. search has a database — files — containing the titles, etc., of every entry in the User’s Guide, the Base Reference Manual, the Data Management Reference Manual, the Graphics Reference Manual, the Longitudinal-Data/Panel-Data Reference Manual, the Multilevel Mixed-Effects Reference Manual, the Multiple-Imputation Reference Manual, the Multivariate Statistics Reference Manual, the Power and Sample-Size Reference Manual, the Programming Reference Manual, the Structural Equation Modeling Reference Manual, the Survey Data Reference Manual, the Survival Analysis and Epidemiological Tables Reference Manual, the Treatment-Effects Reference Manual, the Time-Series Reference Manual, the Mata Reference Manual, undocumented help files, NetCourses, Stata Press books, FAQs posted on the Stata website, videos posted on the Stata YouTube channel, selected articles on StataCorp’s official blog, selected user-written FAQs and examples, and the articles in the Stata Journal and the Stata Technical Bulletin. In these files is a list of words, called keywords, associated with each entry. When you type search xyz, search reads the database and compares the list of keywords with xyz. If it finds xyz in the list or a keyword that allows an abbreviation of xyz, it displays the entry. When you type search xyz abc, search does the same thing but displays an entry only if it contains both keywords. The order does not matter, so you can search linear regression or search regression linear. Obviously, how many entries search finds depends on how the search database was constructed. We have included a plethora of keywords under the theory that, for a given request, it is better to list too much rather than risk listing nothing at all. Still, you are in the position of guessing the keywords. Do you look up normality test, normality tests, or tests of normality? Well, normality test would be best, but all would work. In general, use the singular, and strike the unnecessary words. For guidelines for specifying keywords, see [U] 4.6 More on search. set searchdefault allows you to specify where search searches. set searchdefault all, the default, indicates that both the keyword database and the Internet are to be searched. set searchdefault local restricts search to using only Stata’s keyword database. set searchdefault net restricts search to searching only the Internet. Internet searches search with the net option searches the Internet for user-written additions to Stata, including, but not limited to, user-written additions published in the Stata Journal (SJ) and the Stata Technical Bulletin (STB). search keywords, net performs the same search as the command net search (with no options); see [R] net search. search — Search Stata documentation and other resources 2113 . search random effect, net Web resources from Stata and other users (contacting http://www.stata.com) 70 packages found (Stata Journal and STB listed first) -----------------------------------------------------st0156_1 from http://www.stata-journal.com/software/sj11-2 SJ11-2 st0156_1. Update: Multivariate random-effects... / Update: Multivariate random-effects meta-regression / by Ian White / Support: ian.white@mrc-bsu.cam.ac.uk / After installation, type help mvmeta and mvmeta_make st0201 from http://www.stata-journal.com/software/sj10-3 SJ10-3 st0201. metaan: Random-effects meta-analysis / metaan: Random-effects meta-analysis / by Evangelos Kontopantelis, / National Primary Care Research and Development Centre (NPCRDC), / University of Manchester, Manchester, UK / David Reeves, / National Primary Care sbe24_3 from http://www.stata-journal.com/software/sj9-2 SJ9-2 sbe24_3. Update: metan: fixed- and random-effects... / Update: metan: fixed- and random-effects meta-analysis / by Ross J. Harris, Roger M. Harbord, and Jonathan A. C. Sterne, / Department of Social Medicine, University of Bristol / Jonathan J. Deeks, Department of Primary Care (output omitted ) (end of search) Author searches search ordinarily compares the words following search with the keywords for the entry. If you specify the author option, however, it compares the words with the author’s name. In the search database, we have filled in author names for all SJ and STB inserts. For instance, in [R] kdensity in this manual you will discover that Isaı́as H. Salgado-Ugarte wrote the first version of Stata’s kdensity command and published it in the STB. Assume that you read his original insert and found the discussion useful. You might now wonder what else he has written in the SJ or STB. To find out, you type . search Salgado-Ugarte, author (output omitted ) Names like Salgado-Ugarte are confusing to many people. search does not require you to specify the entire name; what you type is compared with each “word” of the name and, if any part matches, the entry is listed. The hyphen is a special character, and you can omit it. Thus you can obtain the same list by looking up Salgado, Ugarte, or Salgado Ugarte without the hyphen. Actually, to find all entries written by Salgado-Ugarte, you need to type . search Salgado-Ugarte, author historical (output omitted ) Prior inserts in the SJ or STB that provide a feature that later was superseded by a built-in feature of Stata are marked as historical in the search database and, by default, are not listed. The historical option ensures that all entries are listed. 2114 search — Search Stata documentation and other resources Entry ID searches If you specify the entry option, search compares what you have typed with the entry ID. The entry ID is not the title — it is the reference listed to the left of the title that tells you where to look. For instance, in [R] regress . . . . . . . . . . . . . . . . . . . . . . Linear regression (help regress) [R] regress is the entry ID. This is a reference, of course, to this manual. In FAQ . . . . . . . . . . . Analysis of multiple failure-time survival data . . . . . . . . . . . . . . . . . . . . . . . M. Cleves and I. Canette 07/09 How do I analyze multiple failure-time data using Stata? http://www.stata.com/support/faqs/statistics/multiple-failuretype-data/ “FAQ” is the entry ID. In SJ-7-1 st0118 . . A survey on . . . . . . . . . . . . Q1/07 SJ7(1):1--21 discusses survey issues data and describes some such analyses survey stat.: What is and can be done in Stata . . . . . . . . . . F. Kreuter and R. Valliant (no commands) in analyzing complex survey of Stata’s capabilities for “SJ-7-1” is the entry ID. search with the entry option searches these entry IDs. Thus you could generate a table of contents for the User’s Guide by typing . search [U], entry (output omitted ) You could generate a table of contents for Stata Journal, Volume 1, Issue 1, by typing . search sj-1-1, entry (output omitted ) To generate a table of contents for the 26th issue of the STB, you would type . search STB-26, entry historical (output omitted ) The historical option here is possibly important. STB-26 was published in July 1995, and perhaps some of its inserts have already been marked historical. You could obtain a list of all inserts associated with sg53 by typing . search sg53, entry historical (output omitted ) Again we include the historical option in case any of the relevant inserts have been marked historical. Return codes In addition to indexing the entries in the User’s Guide and all the Reference manuals, search also can be used to search return codes. search — Search Stata documentation and other resources 2115 To see information on return code 131, type . search rc 131 [P] error . . . . . . . . . . . . . . . . . . . . . . . . Return code 131 not possible with test; You requested a test of a hypothesis that is nonlinear in the variables. test tests only linear hypotheses. Use testnl. If you want a list of all Stata return codes, type . search error, entry (output omitted ) Acknowledgment We thank Nicholas J. Cox of the Department of Geography at Durham University, UK, and coeditor of the Stata Journal for his contributions to the search command. Also see [R] help — Display help in Stata [R] net search — Search the Internet for installable packages [U] 4 Stata’s help and search facilities Title serrbar — Graph standard error bar chart Syntax Remarks and examples Menu Acknowledgment Description Also see Options Syntax serrbar mvar svar xvar if in , options Description options Main scale length of graph bars; default is scale(1) scale(#) Error bars affect rendition of capped spikes rcap options Plotted points mvopts(scatter options) affect rendition of plotted points Add plots add other plots to generated graph addplot(plot) Y axis, X axis, Titles, Legend, Overall any options other than by() documented in [G-3] twoway options twoway options Menu Statistics > Other > Quality control > Standard error bar chart Description serrbar graphs mvar ± scale() × svar against xvar. Usually, but not necessarily, mvar and svar will contain means and standard errors or standard deviations of some variable so that a standard error bar chart is produced. Options Main scale(#) controls the length of the bars. The upper and lower limits of the bars will be mvar + scale() × svar and mvar − scale() × svar. The default is scale(1). Error bars rcap options affect the rendition of the plotted error bars (the capped spikes). See [G-2] graph twoway rcap. 2116 serrbar — Graph standard error bar chart 2117 Plotted points mvopts(scatter options) affects the rendition of the plotted points (mvar versus xvar). See [G-2] graph twoway scatter. Add plots addplot(plot) provides a way to add other plots to the generated graph; see [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples Example 1 In quality-control applications, the three most commonly used variables with this command are the process mean, process standard deviation, and time. For instance, we have data on the average weights and standard deviations from an assembly line in San Francisco for the period January 8 to January 16. Our data are . use http://www.stata-press.com/data/r13/assembly . list, sep(0) divider 1. 2. 3. 4. 5. 6. 7. date mean std 108 109 110 113 114 115 116 192.22 192.64 192.37 194.76 192.69 195.02 193.40 3.94 2.83 4.58 3.25 2.89 1.73 2.62 We type serrbar mean std date, scale(2) but, after seeing the result, decide to make it fancier: 2118 serrbar — Graph standard error bar chart . serrbar mean std date, scale(2) title("Observed Weight Variation") > sub("San Francisco plant, 1/8 to 1/16") yline(195) yaxis(1 2) > ylab(195, axis(2)) ytitle("", axis(2)) Observed Weight Variation 185 195 Package weight in lbs. 190 195 200 205 San Francisco plant, 1/8 to 1/16 108 110 112 date 114 116 Acknowledgment serrbar was written by Nicholas J. Cox of the Department of Geography at Durham University, UK, and coeditor of the Stata Journal. Also see [R] qc — Quality control charts Title set — Overview of system parameters Syntax Description Remarks and examples Also see Syntax set setcommand . . . set typed without arguments is equivalent to query typed without arguments. Description This entry provides a reference to Stata’s set commands. For many entries, more thorough information is provided elsewhere; see the Reference field in each entry below for the location of this information. To reset system parameters to factory defaults, see [R] set defaults. Remarks and examples set adosize Syntax: set adosize # , permanently Default: 1,000 Description: sets the maximum amount of memory that automatically loaded do-files may consume. 10 ≤ # ≤ 10000. Reference: [P] sysdir set autotabgraphs (Windows only) Syntax: set autotabgraphs on | off , permanently Default: off Description: determines whether graphs are created as tabs within one window or as separate windows. set cformat Syntax: set cformat fmt , permanently Description: specifies the output format of coefficients, standard errors, and confidence limits in coefficient tables. fmt is a numerical format; see [D] format. Reference: [R] set cformat set charset (Mac only) Syntax: set charset mac | latin1 , permanently Default: mac Description: sets the character set used by Stata for Mac for rendering of ASCII text. 2119 2120 set — Overview of system parameters set checksum Syntax: set checksum on | off , permanently Default: off Description: determines whether files should be prevented from being downloaded from the Internet if checksums do not match. Reference: [D] checksum set coeftabresults Syntax: set coeftabresults on | off Default: on Description: determines whether coefficient table results are stored in r(). There is no permanently option because permanently is implied. set conren (Unix console only) Syntax 1: set conren Syntax 2: set conren clear Syntax 3: set conren sf | bf | it result | txt | text | input | error | link | hilite char char. . . Syntax 4: set conren {ulon | uloff} char char . . . Syntax 5: set conren reset char char . . . Description: can possibly make the output on your screen appear prettier. set conren displays a list of the currently defined display codes. set conren clear clears all codes. set conren followed by a font type (bf, sf, or it) and display context (result, error, link, or hilite) and then followed by a series of space-separated characters sets the code for the specified font type and display context. If the font type is omitted, the code is set to the same specified code for all three font types. set conren ulon and set conren uloff set the codes for turning on and off underlining. set conren reset sets the code that will turn off all display and underlining codes. Reference: [GSU] conren set copycolor (Mac and Windows only) , permanently Syntax: set copycolor automatic | asis | gs1 | gs2 | gs3 Default: automatic Description: determines how colors are handled when graphs are copied to the Clipboard. Reference: [G-2] set printcolor set dockable Syntax: Default: Description: (Windows only) set dockable on | off , permanently on determines whether to enable the use of dockable window characteristics, including the ability to dock or tab a window into another window. set — Overview of system parameters 2121 set dockingguides (Windows only) , permanently Syntax: set dockingguides on | off Default: on Description: determines whether to enable the use of dockable guides when repositioning a dockable window. set doublebuffer (Windows only) Syntax: set doublebuffer on | off , permanently Default: on Description: enables or disables double buffering of the Results, Viewer, and Data Editor windows. Double buffering prevents the windows from flickering when redrawn or resized. Users who encounter performance problems such as the Results window outputting very slowly should disable double buffering. set dp Syntax: Default: Description: Reference: , permanently set dp comma | period period determines whether a period or a comma is to be used as the decimal point. [D] format set emptycells Syntax: set emptycells keep | drop , permanently Default: keep Description: sets what to do with empty cells in interactions. Reference: [R] set emptycells set eolchar (Mac only) , permanently Syntax: set eolchar mac | unix Default: unix Description: sets the default end-of-line delimiter for text files created in Stata. set fastscroll (Unix and Windows only) Syntax: set fastscroll on | off , permanently Default: on Description: sets the scrolling method for new output in the Results window. Setting fastscroll to on is faster but can be jumpy. Setting fastscroll to off is slower but smoother. set floatwindows (Windows only) Syntax: set floatwindows on | off Default: off Description: determines whether to enable floating window behavior for dialog boxes and dockable window. The term “float” in this context means that a window will always float over the main Stata window; these windows cannot be placed behind the main Stata window. There is no permanently option because permanently is implied. 2122 set — Overview of system parameters set fvlabel Syntax: set fvlabel { on | off } , permanently Description: specifies whether to display factor-variable value labels in coefficient tables. Reference: [R] set showbaselevels set fvwrap Syntax: set fvwrap # , permanently Description: specifies that long value labels wrap # lines in coefficient tables. Reference: [R] set showbaselevels set fvwrapon Syntax: set fvwrapon { word | width } , permanently Description: specifies whether value labels that wrap will break at word bounderies or break based on available space. Reference: [R] set showbaselevels set graphics Syntax: Default: Description: Reference: set graphics on | off on; default is off for console Stata determines whether graphs are displayed on your monitor. [G-2] set graphics set haverdir Syntax: set haverdir "path" , permanently Description: specifies the directory where the Haver databases are stored. Reference: [D] import haver set httpproxy Syntax: set httpproxy on | off , init Default: off Description: turns on/off the use of a proxy server. There is no permanently option because permanently is implied. Reference: [R] netio set httpproxyauth Syntax: set httpproxyauth on | off Default: off Description: determines whether authorization is required for the proxy server. There is no permanently option because permanently is implied. Reference: [R] netio set httpproxyhost Syntax: set httpproxyhost " name " Description: sets the name of a host to be used as a proxy server. There is no permanently option because permanently is implied. Reference: [R] netio set — Overview of system parameters set httpproxyport Syntax: set httpproxyport # Default: 8080 if Stata cannot autodetect the proper setting for your computer. Description: sets the port number for a proxy server. There is no permanently option because permanently is implied. Reference: [R] netio set httpproxypw Syntax: set httpproxypw " password " Description: sets the appropriate password. There is no permanently option because permanently is implied. Reference: [R] netio set httpproxyuser Syntax: set httpproxyuser " name " Description: sets the appropriate user ID. There is no permanently option because permanently is implied. Reference: [R] netio set include bitmap (Mac only) Syntax: set include bitmap on | off , permanently Default: on Description: sets the output behavior when copying an image to the Clipboard. set level Syntax: set level # , permanently Default: 95 Description: sets the default confidence level for confidence intervals for all commands that report confidence intervals. 10.00 ≤ # ≤ 99.99, and # can have at most two digits after the decimal point. Reference: [R] level set linegap Syntax: set linegap # Default: 1 Description: sets the space between lines, in pixels, in the Results window. There is no permanently option because permanently is implied. set linesize Syntax: Default: Description: Reference: set linesize # 1 less than the full width of the screen sets the line width, in characters, for both the screen and the log file. [R] log 2123 2124 set — Overview of system parameters set locksplitters (Windows only) , permanently Syntax: set locksplitters on | off Default: off Description: determines whether splitters should be locked so that docked windows cannot be resized. set logtype Syntax: Default: Description: Reference: set logtype text | smcl , permanently smcl sets the default log filetype. [R] log set lstretch Syntax: set lstretch on | off , permanently Description: specifies whether to automatically widen the coefficient table up to the width of the Results window to accommodate longer variable names. set matacache, set matafavor, set matalibs, set matalnum, set matamofirst, set mataoptimize, and set matastrict; see [M-3] mata set. set matsize Syntax: set matsize # , permanently Default: 400 for Stata/MP, Stata/SE, and Stata/IC; 40 for Small Stata Description: sets the maximum number of variables that can be included in any estimation command. This setting cannot be changed in Small Stata. 10 ≤ # ≤ 11000 for Stata/MP and Stata/SE; 10 ≤ # ≤ 800 for Stata/IC. Reference: [R] matsize set max memory Syntax: set max memory # b | k | m | g , permanently Default: . (all the memory the operating system will supply) Description: specifies the maximum amount of memory Stata can use to store your data. 2 × segmentsize ≤ # ≤ . Reference: [D] memory set maxdb Syntax: set maxdb # , permanently Default: 50 Description: sets the maximum number of dialog boxes whose contents are remembered from one invocation to the next during a session. 5 ≤ # ≤ 1000 Reference: [R] db set maxiter Syntax: set maxiter # , permanently Default: 16000 Description: sets the default maximum number of iterations for estimation commands. 0 ≤ # ≤ 16000 Reference: [R] maximize set — Overview of system parameters 2125 set maxvar Syntax: set maxvar # , permanently Default: 5000 for Stata/MP and Stata/SE, 2048 for Stata/IC, and 99 for Small Stata Description: sets the maximum number of variables. This can be changed only in Stata/MP and Stata/SE. 2048 ≤ # ≤ 32767 Reference: [D] memory set min memory Syntax: set min memory # b | k | m | g , permanently Default: 0 Description: specifies an amount of memory Stata will not fall below. This setting affects efficiency, not the size of datasets you can analyze. 0 ≤ # ≤ max memory Reference: [D] memory set more Syntax: Default: Description: Reference: set more on | off , permanently on pauses when more is displayed, continuing only when the user presses a key. [R] more set niceness Syntax: set niceness # , permanently Default: 5 Description: affects how soon Stata gives back unused segments to the operating system. 0 ≤ # ≤ 10 Reference: [D] memory set notifyuser (Mac only) Syntax: set notifyuser on | off , permanently Default: on Description: sets the default Notification Manager behavior in Stata. set obs Syntax: set obs # Default: current number of observations Description: changes the number of observations in the current dataset. # must be at least as large as the current number of observations. If there are variables in memory, the values of all new observations are set to missing. Reference: [D] obs set odbcmgr (Unix only) Syntax: set odbcmgr iodbc | unixodbc , permanently Default: iodbc Description: determines whether iODBC or unixODBC is your ODBC driver manager. Reference: [D] odbc 2126 set — Overview of system parameters set output Syntax: set output proc | inform | error Default: proc Description: specifies the output to be displayed. proc means display all output; inform suppresses procedure output but displays informative messages and error messages; error suppresses all output except error messages. set output is seldom used. Reference: [P] quietly set pagesize Syntax: Default: Description: Reference: set pagesize # 2 less than the physical number of lines on the screen sets the number of lines between more messages. [R] more set pformat Syntax: set pformat fmt , permanently Description: specifies the output format of p-values in coefficient tables. fmt is a numerical format; see [D] format. Reference: [R] set cformat set pinnable Syntax: Default: Description: (Windows only) set pinnable on | off , permanently on determines whether to enable the use of pinnable window characteristics for certain windows in Stata. set playsnd (Mac only) Syntax: set playsnd on | off , permanently Default: on Description: sets the sound behavior for the Notification Manager behavior in Stata. set printcolor , permanently Syntax: set printcolor automatic | asis | gs1 | gs2 | gs3 Default: automatic Description: determines how colors are handled when graphs are printed. Reference: [G-2] set printcolor set processors Syntax: set processors # Description: sets the number of processors or cores that Stata/MP will use. The default is the number of processors available on the computer, or the number of processors allowed by Stata/MP’s license, whichever is less. set reventries Syntax: set reventries # , permanently Default: 5000 Description: sets the number of scrollback lines available in the Review window. 5 ≤ # ≤ 32000. set — Overview of system parameters 2127 set revkeyboard (Mac only) Syntax: set revkeyboard on | off , permanently Default: on Description: sets the keyboard navigation behavior for the Review window. on indicates that you can use the keyboard to navigate and enter items from the Review window into the Command window. off indicates that all keyboard input be directed at the Command window; items can be entered from the Review window only by using the mouse. set rmsg Syntax: set rmsg on | off , permanently Default: off Description: indicates whether a return message telling the execution time is to be displayed at the completion of each command. Reference: [P] rmsg set scheme Syntax: Default: Description: Reference: set scheme schemename , permanently s2color determines the overall look for graphs. [G-2] set scheme set scrollbufsize Syntax: set scrollbufsize # Default: 200000 Description: sets the scrollback buffer size, in bytes, for the Results window; may be set between 10,000 and 2,000,000. set searchdefault , permanently Syntax: set searchdefault local | net | all Default: local Description: sets the default behavior of the search command. set searchdefault local restricts search to use only Stata’s keyword database. set searchdefault net restricts search to searching only the Internet. set searchdefault all indicates that both the keyword database and the Internet are to be searched. Reference: [R] search set seed Syntax: Default: Description: Reference: set seed # | code 123456789 specifies initial value of the random-number seed used by the runiform() function. [R] set seed set segmentsize Syntax: set segmentsize # b | k | m | g , permanently Default: 32m for 64-bit machines; 16m for 32-bit machines Description: Stata allocates memory for data in units of segmentsize. This setting changes the amount of memory in a single segment. 1m ≤ # ≤ 32g for 64-bit machines; 1m ≤ # ≤ 1g for 32-bit machines Reference: [D] memory 2128 set — Overview of system parameters set sformat Syntax: set sformat fmt , permanently Description: specifies the output format of test statistics in coefficient tables. fmt is a numerical format; see [D] format. Reference: [R] set cformat set showbaselevels Syntax: set showbaselevels { on | off | all } , permanently Description: specifies whether to display base levels of factor variables and their interactions in coefficient tables. Reference: [R] set showbaselevels set showemptycells Syntax: set showemptycells { on | off } , permanently Description: specifies whether to display empty cells in coefficient tables. Reference: [R] set showbaselevels set showomitted Syntax: set showomitted { on | off } , permanently Description: specifies whether to display omitted coefficients in coefficient tables. Reference: [R] set showbaselevels set smoothfonts (Mac only) Syntax: set smoothfonts on | off Default: on Description: determines whether to use font smoothing (antialiased text) in the Results, Viewer, and Data Editor windows. set timeout1 Syntax: set timeout1 #seconds , permanently Default: 30 Description: sets the number of seconds Stata will wait for a remote host to respond to an initial contact before giving up. In general, users should not modify this value unless instructed to do so by Stata Technical Services. Reference: [R] netio set timeout2 Syntax: set timeout2 #seconds , permanently Default: 180 Description: sets the number of seconds Stata will keep trying to get information from a remote host after initial contact before giving up. In general, users should not modify this value unless instructed to do so by Stata Technical Services. Reference: [R] netio set trace Syntax: Default: Description: Reference: set trace on | off off determines whether to trace the execution of programs for debugging. [P] trace set — Overview of system parameters 2129 set tracedepth Syntax: set tracedepth # Default: 32000 (equivalent to ∞) Description: if trace is set on, traces execution of programs and nested programs up to tracedepth. For example, if tracedepth is 2, the current program and any subroutine called would be traced, but subroutines of subroutines would not be traced. Reference: [P] trace set traceexpand Syntax: set traceexpand on | off , permanently Default: on Description: if trace is set on, shows lines both before and after macro expansion. If traceexpand is set off, only the line before macro expansion is shown. Reference: [P] trace set tracehilite Syntax: set tracehilite "pattern" , word Default: "" Description: highlights pattern in the trace output. Reference: [P] trace set traceindent Syntax: set traceindent on | off , permanently Default: on Description: if trace is set on, indents displayed lines according to their nesting level. The lines of the main program are not indented. Two spaces of indentation are used for each level of nested subroutine. Reference: [P] trace set tracenumber , permanently Syntax: set tracenumber on | off Default: off Description: if trace is set on, shows the nesting level numerically in front of the line. Lines of the main program are preceded by 01, lines of subroutines called by the main program are preceded by 02, etc. Reference: [P] trace set tracesep Syntax: set tracesep on | off , permanently Default: on Description: if trace is set on, displays a horizontal separator line that displays the name of the subroutine whenever a subroutine is called or exits. Reference: [P] trace 2130 set — Overview of system parameters set type Syntax: Default: Description: Reference: , permanently set type float | double float specifies the default storage type assigned to new variables. [D] generate set update interval (Mac and Windows only) Syntax: set update interval # Default: 7 Description: sets the number of days to elapse before performing the next automatic update query. Reference: [R] update set update prompt (Mac and Windows only) Syntax: set update prompt on | off Default: on Description: determines wheter a dialog is to be displayed before performing an automatic update query. There is no permanently option because permanently is implied. Reference: [R] update set update query (Mac and Windows only) Syntax: set update query on | off Default: on Description: determines whether update query is to be automatically performed when Stata is launched. There is no permanently option because permanently is implied. Reference: [R] update set varabbrev Syntax: set varabbrev on | off , permanently Default: on Description: indicates whether Stata should allow variable abbreviations. Reference: [P] varabbrev set varkeyboard (Mac only) Syntax: set varkeyboard on | off , permanently Default: on Description: sets the keyboard navigation behavior for the Variables window. on indicates that you can use the keyboard to navigate and enter items from the Variables window into the Command window. off indicates that all keyboard input be directed at the Command window; items can be entered from the Variables window only by using the mouse. Also see [R] query — Display system parameters [R] set defaults — Reset system parameters to original Stata defaults [P] creturn — Return c-class values [M-3] mata set — Set and display Mata system parameters Title set cformat — Format settings for coefficient tables Syntax Description Option Remarks and examples Also see Syntax set cformat fmt , permanently set pformat fmt , permanently set sformat fmt , permanently where fmt is a numerical format. Description set cformat specifies the output format of coefficients, standard errors, and confidence limits in coefficient tables. set pformat specifies the output format of p-values in coefficient tables. set sformat specifies the output format of test statistics in coefficient tables. Option permanently specifies that, in addition to making the change right now, the setting be remembered and become the default setting when you invoke Stata. Remarks and examples The formatting of the numbers in the coefficient table can be controlled by using the set cformat, set pformat, and set sformat commands or by using the cformat(% fmt), pformat(% fmt), and sformat(% fmt) options at the time of estimation or on replay of the estimation command. See [R] estimation options. The maximum format widths for set cformat, set pformat, and set sformat in coefficient tables are 9, 5, and 8, respectively. 2131 2132 set cformat — Format settings for coefficient tables Example 1 We use auto.dta to illustrate. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg weight displacement Source SS df MS Model Residual 1595.40969 848.049768 2 71 797.704846 11.9443629 Total 2443.45946 73 33.4720474 mpg Coef. weight displacement _cons -.0065671 .0052808 40.08452 Std. Err. .0011662 .0098696 2.02011 t -5.63 0.54 19.84 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.594 0.000 = = = = = = 74 66.79 0.0000 0.6529 0.6432 3.4561 [95% Conf. Interval] -.0088925 -.0143986 36.05654 -.0042417 .0249602 44.11251 . set cformat %9.2f . regress mpg weight displacement Source SS df MS Model Residual 1595.40969 848.049768 2 71 797.704846 11.9443629 Total 2443.45946 73 33.4720474 mpg Coef. weight displacement _cons -0.01 0.01 40.08 Std. Err. 0.00 0.01 2.02 t -5.63 0.54 19.84 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.594 0.000 . regress mpg weight displacement, cformat(%9.3f) Source SS df MS Model Residual 1595.40969 848.049768 2 71 797.704846 11.9443629 Total 2443.45946 73 33.4720474 mpg Coef. weight displacement _cons -0.007 0.005 40.085 Std. Err. 0.001 0.010 2.020 t -5.63 0.54 19.84 P>|t| 0.000 0.594 0.000 = = = = = = 74 66.79 0.0000 0.6529 0.6432 3.4561 [95% Conf. Interval] -0.01 -0.01 36.06 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE -0.00 0.02 44.11 = = = = = = 74 66.79 0.0000 0.6529 0.6432 3.4561 [95% Conf. Interval] -0.009 -0.014 36.057 -0.004 0.025 44.113 set cformat — Format settings for coefficient tables To reset the cformat setting to its command-specific default, type . set cformat . regress mpg weight displacement SS df Source MS Model Residual 1595.40969 848.049768 2 71 797.704846 11.9443629 Total 2443.45946 73 33.4720474 mpg Coef. weight displacement _cons -.0065671 .0052808 40.08452 Std. Err. .0011662 .0098696 2.02011 Also see [R] estimation options — Estimation options [R] query — Display system parameters [R] set — Overview of system parameters [U] 20.8 Formatting the coefficient table t -5.63 0.54 19.84 Number of obs F( 2, 71) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.594 0.000 = = = = = = 74 66.79 0.0000 0.6529 0.6432 3.4561 [95% Conf. Interval] -.0088925 -.0143986 36.05654 -.0042417 .0249602 44.11251 2133 Title set defaults — Reset system parameters to original Stata defaults Syntax Description Option Remarks and examples Also see Syntax set defaults category | all , permanently where category is one of memory | output | interface | graphics | efficiency | network | update | trace | mata | other Description set defaults resets settings made by set to the original default settings that were shipped with Stata. set defaults all resets all the categories, whereas set defaults category resets only the settings for the specified category. Option permanently specifies that, in addition to making the change right now, the settings be remembered and become the default settings when you invoke Stata. Remarks and examples Example 1 To assist us in debugging a new command, we modified some of the trace settings. To return them to their original values, we type . set_defaults trace -> set trace off -> set tracedepth 32000 -> set traceexpand on -> set tracesep on -> set traceindent on -> set tracenumber off -> set tracehilite "" (preferences reset) 2134 set defaults — Reset system parameters to original Stata defaults Also see [R] query — Display system parameters [R] set — Overview of system parameters [M-3] mata set — Set and display Mata system parameters 2135 Title set emptycells — Set what to do with empty cells in interactions Syntax Description Option Remarks and examples Also see Syntax set emptycells { keep | drop } , permanently Description set emptycells allows you to control how Stata handles interaction terms with empty cells. Stata can keep empty cells or drop them. The default is to keep empty cells. Option permanently specifies that, in addition to making the change right now, the setting be remembered and become the default setting when you invoke Stata. Remarks and examples By default, Stata keeps empty cells so they can be reported in the coefficient table. For example, type . use http://www.stata-press.com/data/r13/auto . regress mpg rep78#foreign, baselevels and you will see a regression of mpg on 10 indicator variables because rep78 takes on 5 values and foreign takes on 2 values in the auto dataset. Two of those cells will be reported as empty because the data contain no observations of foreign cars with a rep78 value of 1 or 2. Many real datasets contain a large number of empty cells, and this could cause the “matsize too small” error, r(908). In that case, type . set emptycells drop to get Stata to drop empty cells from the list of coefficients. If you commonly fit models with empty cells, you can permanently set Stata to drop empty cells by typing the following: . set emptycells drop, permanently Also see [R] set — Overview of system parameters 2136 Title set seed — Specify initial value of random-number seed Syntax Description Remarks and examples Also see Syntax set seed # set seed statecode where # is any number between 0 and 231 − 1 (2,147,483,647), and statecode is a random-number state previously obtained from creturn value c(seed). Description set seed # specifies the initial value of the random-number seed used by the random-number functions, such as runiform() and rnormal(). set seed statecode resets the state of the random-number functions to the value specified, which is a state previously obtained from creturn value c(seed). Remarks and examples Remarks are presented under the following headings: Examples Setting the seed How to choose a seed Do not set the seed too often Preserving and restoring the random-number generator state Examples 1. Specify initial value of random-number seed . set seed 339487731 2. Create variable u containing uniformly distributed pseudorandom numbers on the interval [0, 1) . generate u = runiform() 3. Create variable z containing normally distributed random numbers with mean 0 and standard deviation 1 . generate z = rnormal() 4. Obtain state of pseudorandom-number generator and store it in a local macro named state . local state = c(seed) 5. Restore pseudorandom-number generator state to that previously stored in local macro named state . set seed ‘state’ 2137 2138 set seed — Specify initial value of random-number seed Setting the seed Stata’s random-number generation functions, such as runiform() and rnormal(), do not really produce random numbers. These functions are deterministic algorithms that produce numbers that can pass for random. runiform() produces numbers that can pass for independent draws from a rectangular distribution over [0, 1); rnormal() produces numbers that can pass for independent draws from N(0, 1). Stata’s random-number functions are formally called pseudorandom-number functions. The sequences these functions produce are determined by the seed, which is just a number and which is set to 123456789 every time Stata is launched. This means that runiform() produces the same sequence each time you start Stata. The first time you use runiform() after Stata is launched, runiform() returns 0.136984078446403146. The second time you use it, runiform() returns 0.643220667960122228. The third time you use it, . . . . To obtain different sequences, you must specify different seeds using the set seed command. You might specify the seed 472195: . set seed 472195 If you were now to use runiform(), the first call would return 0.247166610788553953, the second call would return 0.593119932804256678, and so on. Whenever you set seed 472195, runiform() will return those numbers the first two times you use it. Thus you set the seed to obtain different pseudorandom sequences from the pseudorandom-number functions. If you record the seed you set, pseudorandom results such as results from a simulation or imputed values from mi impute can be reproduced later. Whatever you do after setting the seed, if you set the seed to the same value and repeat what you did, you will obtain the same results. How to choose a seed Your best choice for the seed is an element chosen randomly from the set {0, 1, . . . , 2,147,483,647}. We recommend that, but that is difficult to achieve because finding easy-to-access, truly random sources is difficult. One person we know uses digits from the serial numbers from dollar bills he finds in his wallet. Of course, the numbers he obtains are not really random, but they are good enough, and they are probably a good deal more random than the seeds most people choose. Some people use dates and times, although we recommend against that because, over the day, it just gets later and later, and that is a pattern. Others try to make up a random number, figuring if they include enough digits, the result just has to be random. This is a variation on the five-second rule for dropped food, and we admit to using both of these rules. It does not really matter how you set the seed, as long as there is no obvious pattern in the seeds that you set and as long as you do not set the seed too often during a session. Nonetheless, here are two methods that we have seen used but you should not use: 1. The first time you set the seed, you set the number 1. The next time, you set 2, and then 3, and so on. Variations on this included setting 1001, 1002, 1003, . . . , or setting 1001, 2001, 3001, and so on. Do not follow any of these procedures. The seeds you set must not exhibit a pattern. set seed — Specify initial value of random-number seed 2139 2. To set the seed, you obtain a pseudorandom number from runiform() and then use the digits from that to form the seed. This is a bad idea because the pseudorandom-number generator can converge to a cycle. If you obtained the pseudorandom-number generator unrelated to those in Stata, this would work well, but then you would have to find a rule to set the first generator’s seed. In any case, the pseudorandom-number generators in Stata are all closely related, and so you must not follow this procedure. Choosing seeds that do not exhibit a pattern is of great importance. That the seeds satisfy the other properties of randomness is minor by comparison. Do not set the seed too often We cannot emphasize this enough: Do not set the seed too often. To see why this is such a bad idea, consider the limiting case: You set the seed, draw pseudorandom number, reset the seed, draw again, and so continue. The pseudorandom numbers obtain will be nothing more than the seeds you run through a mathematical function. The results obtain will not pass for random unless the seeds you choose pass for random. If you already such numbers, why are you even bothering to use the pseudorandom-number generator? one you you had The definition of too often is more than once per problem. If you are running a simulation of 10,000 replications, set the seed at the start of the simulation and do not reset it until the 10,000th replication is finished. The pseudorandom-number generators provided by Stata have long periods. The longer you go between setting the seed, the more random-like are the numbers produced. It is sometimes useful later to be able to reproduce in isolation any one of the replications, and so you might be tempted to set the seed to a known value for each of the replications. We negatively mentioned setting the seed to 1, 2, . . . , and it is in exactly such situations that we have seen this done. The advantage, however, is that you could reproduce the fifth replication merely by setting the seed to 5 and then repeating whatever it is that is to be replicated. If this is your goal, you do not need to reset the seed. You can record the state of the random-number generator, save the state with your replication results, and then use the recorded states later to reproduce whichever of the replications that you wish. This will be discussed in Preserving and restoring the random-number generator state. There is another reason you might be tempted to set the seed more than once per problem. It sometimes happens that you run a simulation, let’s say for 5,000 replications, and then you decide you should have run it for 10,000 replications. Instead of running all 10,000 replications afresh, you decide to save time by running another 5,000 replications and then combining those results with your previous 5,000 results. That is okay. We at StataCorp do this kind of thing. If you do this, it is important that you set the seed especially well, particularly if you repeat this process to add yet another 5,000 replications. It is also important that in each run there be a large enough number of replications, which is say thousands of them. Even so, do not do this: You want 500,000 replications. To obtain them, you run in batches of 1,000, setting the seed 500 times. Unless you have a truly random source for the seeds, it is unlikely you can produce a patternless sequence of 500 seeds. The fact that you ran 1,000 replications in between choosing the seeds does not mitigate the requirement that there be no pattern to the seeds you set. In all cases, the best solution is to set the seed only once and then use the method we suggest in the next section. 2140 set seed — Specify initial value of random-number seed Preserving and restoring the random-number generator state In the previous section, we discussed the case in which you might be tempted to set the seed more frequently than otherwise necessary, either to save time or to be able to rerun any one of the replications. In such cases, there is an alternative to setting a new seed: recording the state of the pseudorandom-number generator and then restoring the state later should the need arise. The state of the random-number generator is a string that looks like this: Xb5804563c43f462544a474abacbdd93d00021fb3 You can obtain the state from c(seed): . display c(seed) Xb5804563c43f462544a474abacbdd93d00021fb3 The name c(seed) is unfortunate because it suggests that Xb5804563c43f462544a474abacbdd93d00021fb3 is nothing more than a seed such as 1073741823 in a different guise. It is not. A better name for c(seed) would have been c(rng state). The state string specifies an entry point into the sequence produced by the pseudorandom-number generator. Let us explain. The best way to use a pseudorandom-number generator would be to choose a seed once, draw random numbers until you use up the generator, and then get a new generator and choose a new key. Pseudorandom-number generators have a period, after which they repeat the original sequence. That is what we mean by using up a generator. The period of the pseudorandom-number generator that Stata is currently using is over 2123 . Stata uses the KISS generator. It is difficult to imagine that you could ever use up KISS. The string reported by c(seed) reports an encoded form of the information necessary for Stata to reestablish exactly where it is located in the pseudorandom-number generator’s sequence. We are not seriously suggesting you choose only one seed over your entire lifetime, but let’s look at how you might do that. Sometime after birth, when you needed your first random number, you would set your seed, . set seed 1073741823 On that day, you would draw, say, 10,000 pseudorandom numbers, perhaps to impute some missing values. Being done for the day, you type . display c(seed) X15b512f3b2143ab434f1c92f4e7058e400023bc3 The next day, after launching Stata, you type . set seed X15b512f3b2143ab434f1c92f4e7058e400023bc3 When you type set seed followed by a state string rather than a number, instead of setting the seed, Stata reestablishes the previous state. Thus the next time you draw a pseudorandom number, Stata will produce the 10,001st result after setting seed 1073741823. Let’s assume that you draw 100,000 numbers this day. Done for the day, you display c(seed). . display c(seed) X5d13d693a72ad0602b093cc4f61e07a500020381 On the third day, after setting the seed to the string above, you will be in a position to draw the 110,001st pseudorandom number. In this way, you would eat your way though the 2123 random numbers, but you would be unlikely ever to make it to the end. Assuming you did this every day for 100 years, to arrive at the end of the sequence you would need to consume 2.9e+32 pseudorandom numbers per day. set seed — Specify initial value of random-number seed 2141 We do not expect you to set the seed just once in your life, but using the state string makes it easy to set the seed just once for a problem. When we do simulations at StataCorp, we record c(seed) for each replication. Just like everybody else, we record results from replications as observations in datasets; we just happen to have an extra variable in the dataset, namely, a string variable named state. That string is filled in observation by observation from the then-current values of c(seed), which is a function and so can be used in any context that a function can be used in Stata. Anytime we want to reproduce a particular replication, we thus have the information we need to reset the pseudorandom-number generator, and having it in the dataset is convenient because we had to go there anyway to determine which replication we wanted to reproduce. In addition to recording each of the state strings for each replication, we record the closing value of c(seed) as a note, which is easy enough to do: . note: closing state ‘c(seed)’ If we want to add more replications later, we have a state string that we can use to continue from where we left off. Also see [R] set — Overview of system parameters [D] functions — Functions Title set showbaselevels — Display settings for coefficient tables Syntax Description Option Remarks and examples Also see Syntax set showbaselevels { on | off | all } set showemptycells { on | off } set showomitted { on | off } set fvlabel { on | off } set fvwrap # , permanently , permanently , permanently , permanently , permanently set fvwrapon { word | width } , permanently Description set showbaselevels specifies whether to display base levels of factor variables and their interactions in coefficient tables. set showbaselevels on specifies that base levels be reported for factor variables and for interactions whose bases cannot be inferred from their component factor variables. set showbaselevels all specifies that all base levels of factor variables and interactions be reported. set showemptycells specifies whether to display empty cells in coefficient tables. set showomitted specifies whether to display omitted coefficients in coefficient tables. set fvlabel specifies whether to display factor-variable value labels in coefficient tables. set fvlabel on, the default, specifies that the labels be displayed. set fvlabel off specifies that the levels of factor variables rather than the labels be displayed. set fvwrap # specifies that long value labels wrap # lines in the coefficient table. The default is set fvwrap 1, which means that long value labels will be abbreviated to fit on one line. set fvwrapon specifies whether value labels that wrap will break at word boundaries or break based on available space. set fvwrapon word, the default, specifies that value labels break at word boundaries. set fvwrapon width specifies that value labels break based on available space. Option permanently specifies that, in addition to making the change right now, the setting be remembered and become the default setting when you invoke Stata. 2142 set showbaselevels — Display settings for coefficient tables Remarks and examples Example 1 We illustrate the first three set commands using cholesterol2.dta. . use http://www.stata-press.com/data/r13/cholesterol2 (Artificial cholesterol data, empty cells) . generate x = race . regress chol race##agegrp x note: 2.race#2.agegrp identifies no observations in the sample note: x omitted because of collinearity Source SS df MS Number of obs F( 13, 56) Model 15751.6113 13 1211.66241 Prob > F 5022.71559 56 89.6913498 R-squared Residual Adj R-squared 20774.3269 69 301.077201 Root MSE Total Coef. race white other 12.84185 -.167627 5.989703 5.989703 2.14 -0.03 0.036 0.978 .8430383 -12.16644 24.84067 11.83119 agegrp 20-29 30-39 40-59 60-79 17.24681 31.43847 34.86613 44.43374 5.989703 5.989703 5.989703 5.989703 2.88 5.25 5.82 7.42 0.006 0.000 0.000 0.000 5.247991 19.43966 22.86732 32.43492 29.24562 43.43729 46.86495 56.43256 race#agegrp white 20-29 white 30-39 white 40-59 white 60-79 other 20-29 other 30-39 other 40-59 other 60-79 0 -22.83983 -14.67558 -10.51115 -6.054425 -11.48083 -.6796112 -1.578052 (empty) 8.470719 8.470719 8.470719 8.470719 8.470719 8.470719 8.470719 -2.70 -1.73 -1.24 -0.71 -1.36 -0.08 -0.19 0.009 0.089 0.220 0.478 0.181 0.936 0.853 -39.80872 -31.64447 -27.48004 -23.02331 -28.44971 -17.6485 -18.54694 -5.870939 2.293306 6.457735 10.91446 5.488063 16.28928 15.39084 x _cons 0 175.2309 (omitted) 4.235359 41.37 0.000 166.7464 183.7153 . set showomitted off . set showbaselevels all t P>|t| 70 13.51 0.0000 0.7582 0.7021 9.4706 chol . set showemptycells off Std. Err. = = = = = = [95% Conf. Interval] 2143 2144 set showbaselevels — Display settings for coefficient tables . regress chol race##agegrp x note: 2.race#2.agegrp identifies no observations in the sample note: x omitted because of collinearity Source SS df MS Model Residual 15751.6113 5022.71559 13 56 1211.66241 89.6913498 Total 20774.3269 69 301.077201 Std. Err. t Number of obs F( 13, 56) Prob > F R-squared Adj R-squared Root MSE P>|t| = = = = = = 70 13.51 0.0000 0.7582 0.7021 9.4706 chol Coef. [95% Conf. Interval] race black white other 0 12.84185 -.167627 (base) 5.989703 5.989703 2.14 -0.03 0.036 0.978 .8430383 -12.16644 24.84067 11.83119 agegrp 10-19 20-29 30-39 40-59 60-79 0 17.24681 31.43847 34.86613 44.43374 (base) 5.989703 5.989703 5.989703 5.989703 2.88 5.25 5.82 7.42 0.006 0.000 0.000 0.000 5.247991 19.43966 22.86732 32.43492 29.24562 43.43729 46.86495 56.43256 race#agegrp black 10-19 black 20-29 black 30-39 black 40-59 black 60-79 white 10-19 white 30-39 white 40-59 white 60-79 other 10-19 other 20-29 other 30-39 other 40-59 other 60-79 0 0 0 0 0 0 -22.83983 -14.67558 -10.51115 0 -6.054425 -11.48083 -.6796112 -1.578052 (base) (base) (base) (base) (base) (base) 8.470719 8.470719 8.470719 (base) 8.470719 8.470719 8.470719 8.470719 -2.70 -1.73 -1.24 0.009 0.089 0.220 -39.80872 -31.64447 -27.48004 -5.870939 2.293306 6.457735 -0.71 -1.36 -0.08 -0.19 0.478 0.181 0.936 0.853 -23.02331 -28.44971 -17.6485 -18.54694 10.91446 5.488063 16.28928 15.39084 _cons 175.2309 4.235359 41.37 0.000 166.7464 183.7153 set showbaselevels — Display settings for coefficient tables 2145 To restore the display of empty cells, omitted predictors, and baselevels to their command-specific default behavior, type . set showemptycells . set showomitted . set showbaselevels . regress chol race##agegrp x note: 2.race#2.agegrp identifies no observations in the sample note: x omitted because of collinearity SS df MS Number of obs Source F( 13, 56) Model 15751.6113 13 1211.66241 Prob > F Residual 5022.71559 56 89.6913498 R-squared Adj R-squared 20774.3269 69 301.077201 Root MSE Total Std. Err. t P>|t| = = = = = = 70 13.51 0.0000 0.7582 0.7021 9.4706 chol Coef. [95% Conf. Interval] race white other 12.84185 -.167627 5.989703 5.989703 2.14 -0.03 0.036 0.978 .8430383 -12.16644 24.84067 11.83119 agegrp 20-29 30-39 40-59 60-79 17.24681 31.43847 34.86613 44.43374 5.989703 5.989703 5.989703 5.989703 2.88 5.25 5.82 7.42 0.006 0.000 0.000 0.000 5.247991 19.43966 22.86732 32.43492 29.24562 43.43729 46.86495 56.43256 race#agegrp white 20-29 white 30-39 white 40-59 white 60-79 other 20-29 other 30-39 other 40-59 other 60-79 0 -22.83983 -14.67558 -10.51115 -6.054425 -11.48083 -.6796112 -1.578052 (empty) 8.470719 8.470719 8.470719 8.470719 8.470719 8.470719 8.470719 -2.70 -1.73 -1.24 -0.71 -1.36 -0.08 -0.19 0.009 0.089 0.220 0.478 0.181 0.936 0.853 -39.80872 -31.64447 -27.48004 -23.02331 -28.44971 -17.6485 -18.54694 -5.870939 2.293306 6.457735 10.91446 5.488063 16.28928 15.39084 x _cons 0 175.2309 (omitted) 4.235359 41.37 0.000 166.7464 183.7153 2146 set showbaselevels — Display settings for coefficient tables Example 2 We illustrate the last three set commands using jaw.dta. . use http://www.stata-press.com/data/r13/jaw, clear (Table 4.6 Two-Way Unbalanced Data for Fractures of the Jaw -- Rencher (1998)) . mvreg y1 y2 y3 = i.fracture Equation y1 y2 y3 Obs Parms RMSE "R-sq" F P 27 27 27 3 3 3 10.42366 6.325398 5.976973 0.2966 0.1341 0.1024 5.060804 1.858342 1.368879 0.0147 0.1777 0.2735 Coef. Std. Err. t P>|t| [95% Conf. Interval] y1 fracture two compou.. one simple.. -8.833333 6 4.957441 5.394759 -1.78 1.11 0.087 0.277 -19.06499 -5.134235 1.398322 17.13423 _cons 37 3.939775 9.39 0.000 28.8687 45.1313 fracture two compou.. one simple.. -5.761905 -3.053571 3.008327 3.273705 -1.92 -0.93 0.067 0.360 -11.97079 -9.810166 .446977 3.703023 _cons 38.42857 2.390776 16.07 0.000 33.49425 43.36289 fracture two compou.. one simple.. 4.261905 .9285714 2.842618 3.093377 1.50 0.30 0.147 0.767 -1.60497 -5.455846 10.12878 7.312989 _cons 58.57143 2.259083 25.93 0.000 53.90891 63.23395 y2 y3 set showbaselevels — Display settings for coefficient tables . set fvwrap 2 . mvreg y1 y2 y3 = i.fracture Equation Obs Parms y1 y2 y3 27 27 27 3 3 3 Coef. RMSE "R-sq" F P 10.42366 6.325398 5.976973 0.2966 0.1341 0.1024 5.060804 1.858342 1.368879 0.0147 0.1777 0.2735 Std. Err. t P>|t| [95% Conf. Interval] y1 fracture two compound fractures one simple fracture -8.833333 4.957441 -1.78 0.087 -19.06499 1.398322 6 5.394759 1.11 0.277 -5.134235 17.13423 _cons 37 3.939775 9.39 0.000 28.8687 45.1313 fracture two compound fractures one simple fracture -5.761905 3.008327 -1.92 0.067 -11.97079 .446977 -3.053571 3.273705 -0.93 0.360 -9.810166 3.703023 _cons 38.42857 2.390776 16.07 0.000 33.49425 43.36289 fracture two compound fractures one simple fracture 4.261905 2.842618 1.50 0.147 -1.60497 10.12878 .9285714 3.093377 0.30 0.767 -5.455846 7.312989 _cons 58.57143 2.259083 25.93 0.000 53.90891 63.23395 y2 y3 2147 2148 set showbaselevels — Display settings for coefficient tables . set fvwrapon width . mvreg y1 y2 y3 = i.fracture Equation Obs Parms y1 y2 y3 27 27 27 3 3 3 Coef. RMSE "R-sq" F P 10.42366 6.325398 5.976973 0.2966 0.1341 0.1024 5.060804 1.858342 1.368879 0.0147 0.1777 0.2735 Std. Err. t P>|t| [95% Conf. Interval] y1 fracture two compound fractures one simple f racture -8.833333 4.957441 -1.78 0.087 -19.06499 1.398322 6 5.394759 1.11 0.277 -5.134235 17.13423 _cons 37 3.939775 9.39 0.000 28.8687 45.1313 fracture two compound fractures one simple f racture -5.761905 3.008327 -1.92 0.067 -11.97079 .446977 -3.053571 3.273705 -0.93 0.360 -9.810166 3.703023 _cons 38.42857 2.390776 16.07 0.000 33.49425 43.36289 fracture two compound fractures one simple f racture 4.261905 2.842618 1.50 0.147 -1.60497 10.12878 .9285714 3.093377 0.30 0.767 -5.455846 7.312989 _cons 58.57143 2.259083 25.93 0.000 53.90891 63.23395 y2 y3 set showbaselevels — Display settings for coefficient tables . set showfvlabel off . mvreg y1 y2 y3 = i.fracture Equation Obs Parms y1 y2 y3 27 27 27 3 3 3 Coef. RMSE "R-sq" F P 10.42366 6.325398 5.976973 0.2966 0.1341 0.1024 5.060804 1.858342 1.368879 0.0147 0.1777 0.2735 Std. Err. t P>|t| [95% Conf. Interval] y1 fracture 2 3 -8.833333 6 4.957441 5.394759 -1.78 1.11 0.087 0.277 -19.06499 -5.134235 1.398322 17.13423 _cons 37 3.939775 9.39 0.000 28.8687 45.1313 fracture 2 3 -5.761905 -3.053571 3.008327 3.273705 -1.92 -0.93 0.067 0.360 -11.97079 -9.810166 .446977 3.703023 _cons 38.42857 2.390776 16.07 0.000 33.49425 43.36289 fracture 2 3 4.261905 .9285714 2.842618 3.093377 1.50 0.30 0.147 0.767 -1.60497 -5.455846 10.12878 7.312989 _cons 58.57143 2.259083 25.93 0.000 53.90891 63.23395 y2 y3 2149 2150 set showbaselevels — Display settings for coefficient tables To restore these last three set commands to their defaults, type . set showfvlabel on . set fvwrap 1 . set fvwrapon word . mvreg y1 y2 y3 = i.fracture Equation Obs Parms y1 y2 y3 27 27 27 3 3 3 Coef. RMSE "R-sq" F P 10.42366 6.325398 5.976973 0.2966 0.1341 0.1024 5.060804 1.858342 1.368879 0.0147 0.1777 0.2735 Std. Err. t P>|t| [95% Conf. Interval] y1 fracture two compou.. one simple.. -8.833333 6 4.957441 5.394759 -1.78 1.11 0.087 0.277 -19.06499 -5.134235 1.398322 17.13423 _cons 37 3.939775 9.39 0.000 28.8687 45.1313 fracture two compou.. one simple.. -5.761905 -3.053571 3.008327 3.273705 -1.92 -0.93 0.067 0.360 -11.97079 -9.810166 .446977 3.703023 _cons 38.42857 2.390776 16.07 0.000 33.49425 43.36289 fracture two compou.. one simple.. 4.261905 .9285714 2.842618 3.093377 1.50 0.30 0.147 0.767 -1.60497 -5.455846 10.12878 7.312989 _cons 58.57143 2.259083 25.93 0.000 53.90891 63.23395 y2 y3 Also see [R] set — Overview of system parameters [R] query — Display system parameters Title signrank — Equality tests on matched data Syntax Stored results Menu Methods and formulas Description References Remarks and examples Also see Syntax Wilcoxon matched-pairs signed-ranks test signrank varname = exp if in Sign test of matched pairs signtest varname = exp if in by is allowed with signrank and signtest; see [D] by. Menu signrank Statistics > Nonparametric analysis > Tests of hypotheses > Wilcoxon matched-pairs signed-rank test Nonparametric analysis > Tests of hypotheses > Test equality of matched pairs signtest Statistics > Description signrank tests the equality of matched pairs of observations by using the Wilcoxon matched-pairs signed-ranks test (Wilcoxon 1945). The null hypothesis is that both distributions are the same. signtest also tests the equality of matched pairs of observations (Arbuthnott [1710], but better explained by Snedecor and Cochran [1989]) by calculating the differences between varname and the expression. The null hypothesis is that the median of the differences is zero; no further assumptions are made about the distributions. This, in turn, is equivalent to the hypothesis that the true proportion of positive (negative) signs is one-half. For equality tests on unmatched data, see [R] ranksum. Remarks and examples Example 1: signrank We are testing the effectiveness of a new fuel additive. We run an experiment with 12 cars. We first run each car without the fuel treatment and measure the mileage. We then add the fuel treatment and repeat the experiment. The results of the experiment are 2151 2152 signrank — Equality tests on matched data Without treatment 20 23 21 25 18 17 With treatment 24 25 21 22 23 18 Without treatment 18 24 20 24 23 19 With treatment 17 28 24 27 21 23 We create two variables called mpg1 and mpg2, representing mileage without and with the treatment, respectively. We can test the null hypothesis that the treatment had no effect by typing . use http://www.stata-press.com/data/r13/fuel . signrank mpg1=mpg2 Wilcoxon signed-rank test sign obs sum ranks expected positive negative zero 3 8 1 13.5 63.5 1 38.5 38.5 1 all 12 78 78 unadjusted variance adjustment for ties adjustment for zeros 162.50 -1.62 -0.25 adjusted variance 160.62 Ho: mpg1 = mpg2 z = Prob > |z| = -1.973 0.0485 The output indicates that we can reject the null hypothesis at any level above 4.85%. Example 2: signtest signtest tests that the median of the differences is zero, making no further assumptions, whereas signrank assumed that the distributions are equal as well. Using the data above, we type . signtest mpg1=mpg2 Sign test sign observed expected positive negative zero 3 8 1 5.5 5.5 1 all 12 12 One-sided tests: Ho: median of mpg1 - mpg2 = 0 vs. Ha: median of mpg1 - mpg2 > 0 Pr(#positive >= 3) = Binomial(n = 11, x >= 3, p = 0.5) = 0.9673 Ho: median of mpg1 - mpg2 = 0 vs. Ha: median of mpg1 - mpg2 < 0 Pr(#negative >= 8) = Binomial(n = 11, x >= 8, p = 0.5) = 0.1133 signrank — Equality tests on matched data Two-sided test: Ho: median of mpg1 - mpg2 = 0 vs. Ha: median of mpg1 - mpg2 != 0 Pr(#positive >= 8 or #negative >= 8) = min(1, 2*Binomial(n = 11, x >= 8, p = 0.5)) = 2153 0.2266 The summary table indicates that there were three comparisons for which mpg1 exceeded mpg2, eight comparisons for which mpg2 exceeded mpg1, and one comparison for which they were the same. The output below the summary table is based on the binomial distribution. The significance of the one-sided test, where the alternative hypothesis is that the median of mpg2 − mpg1 is greater than zero, is 0.1133. The significance of the two-sided test, where the alternative hypothesis is simply that the median of the differences is different from zero, is 0.2266 = 2 × 0.1133. Stored results signrank stores the following in r(): Scalars r(N neg) r(N pos) r(N tie) r(sum pos) number number number sum of of negative comparisons of positive comparisons of tied comparisons the positive ranks r(sum neg) r(z) r(Var a) sum of the negative ranks z statistic adjusted variance number of negative comparisons number of positive comparisons r(p 2) r(p neg) number of tied comparisons r(p pos) two-sided probability one-sided probability of negative comparison one-sided probability of positive comparison signtest stores the following in r(): Scalars r(N neg) r(N pos) r(N tie) Methods and formulas For a practical introduction to these techniques with an emphasis on examples rather than theory, see Bland (2000) or Sprent and Smeeton (2007). For a summary of these tests, see Snedecor and Cochran (1989). Methods and formulas are presented under the following headings: signrank signtest signrank Both the sign test and Wilcoxon signed-rank tests test the null hypothesis that the distribution of a random variable D = varname − exp has median zero. The sign test makes no additional assumptions, but the Wilcoxon signed-rank test makes the additional assumption that the distribution of D is symmetric. If D = X1 − X2 , where X1 and X2 have the same distribution, then it follows that the distribution of D is symmetric about zero. Thus the Wilcoxon signed-rank test is often described as a test of the hypothesis that two distributions are the same, that is, X1 ∼ X2 . 2154 signrank — Equality tests on matched data Let dj denote the difference for any matched pair of observations, dj = x1j − x2j = varname − exp for j = 1, 2, . . . , n. Rank the absolute values of the differences, |dj |, and assign any tied values the average rank. Consider the signs of dj , and let rj = sign(dj ) rank(|dj |) be the signed ranks. The test statistic is Tobs = n X rj = (sum of ranks for + signs) − (sum of ranks for − signs) j=1 The null hypothesis is that the distribution of dj is symmetric about 0. Hence the likelihood is unchanged if we flip signs on the dj , and thus the randomization datasets are the 2n possible sign changes for the dj . Thus the randomization distribution of our test statistic T can be computed by considering all the 2n possible values of T = n X Sj rj j=1 where the rj are the observed signed ranks (considered fixed) and Sj is either +1 or −1. With this distribution, the mean and variance of T are given by E(T ) = 0 and Varadj (T ) = n X rj2 j=1 The test statistic for the Wilcoxon signed-rank test is often expressed (equivalently) as the sum of the positive signed-ranks, T+ , where E(T+ ) = n(n + 1) 4 n and Varadj (T+ ) = 1X 2 r 4 j=1 j Zeros and ties do not affect the theory above, and the exact variance is still given by the above formula for Varadj (T+ ). When dj = 0 is observed, dj will always be zero in each of the randomization datasets (using sign(0) = 0). When there are ties, you can assign averaged ranks for each group of ties and then treat them the same as the other ranks. The “unadjusted variance” reported by signrank is the variance that the randomization distribution would have had if there had been no ties or zeros: n Varunadj (T+ ) = 1X 2 n(n + 1)(2n + 1) j = 4 j=1 24 The adjustment for zeros is the change in the variance when the ranks for the zeros are signed to make rj = 0, n0 1X n0 (n0 + 1)(2n0 + 1) ∆Varzero adj (T+ ) = − j2 = − 4 24 j=1 signrank — Equality tests on matched data 2155 where n0 is the number of zeros. The adjustment for ties is the change in the variance when the ranks (for nonzero observations) are replaced by averaged ranks: ∆Varties adj (T+ ) = Varadj (T+ ) − Varunadj (T+ ) − ∆Varzero adj (T+ ) A normal approximation is used to calculate T+ − E(T+ ) z=p Varadj (T+ ) signtest The test statistic for the sign test is the number n+ of differences dj = x1j − x2j = varname − exp greater than zero. Assuming that the probability of a difference being equal to zero is exactly zero, then, under the null hypothesis, n+ ∼ binomial(n, p = 1/2), where n is the total number of observations. But what if some differences are zero? This question has a ready answer if you view the test from the perspective of Fisher’s Principle of Randomization (Fisher 1935). Fisher’s idea (stated in a modern way) was to look at a family of transformations of the observed data such that the a priori likelihood (under the null hypothesis) of the transformed data is the same as the likelihood of the observed data. The distribution of the test statistic is then produced by calculating its value for each of the transformed “randomization” datasets, assuming that each dataset is equally likely. For the sign test, the “data” are simply the set of signs of the differences. Under the null hypothesis of the sign test, the probability that dj is less than zero is equal to the probability that dj is greater than zero. Thus you can transform the observed signs by flipping any number of them, and the set of signs will have the same likelihood. The 2n possible sign changes form the family of randomization datasets. If you have no zeros, this procedure again leads to n+ ∼ binomial(n, p = 1/2). If you do have zeros, changing their signs leaves them as zeros. So, if you observe n0 zeros, each of the 2n sign-change datasets will also have n0 zeros. Hence, the values of n+ calculated over the sign-change datasets range from 0 to n − n0 , and the “randomization” distribution of n+ is binomial(n − n0 , p = 1/2). The work of Arbuthnott (1710) and later eighteenth-century contributions is discussed by Hald (2003, chap. 17). Frank Wilcoxon (1892–1965) was born in Ireland to American parents. After working in various occupations (including merchant seaman, oil-well pump attendant, and tree surgeon), he settled in chemistry, gaining degrees from Rutgers and Cornell and employment from various companies. Working mainly on the development of fungicides and insecticides, Wilcoxon became interested in statistics in 1925 and made several key contributions to nonparametric methods. After retiring from industry, he taught statistics at Florida State until his death. 2156 signrank — Equality tests on matched data References Arbuthnott, J. 1710. An argument for divine providence, taken from the constant regularity observed in the births of both sexes. Philosophical Transaction of the Royal Society of London 27: 186–190. Bland, M. 2000. An Introduction to Medical Statistics. 3rd ed. Oxford: Oxford University Press. Bradley, R. A. 2001. Frank Wilcoxon. In Statisticians of the Centuries, ed. C. C. Heyde and E. Seneta, 420–424. New York: Springer. Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver & Boyd. Hald, A. 2003. A History of Probability and Statistics and Their Applications before 1750. New York: Wiley. Harris, T., and J. W. Hardin. 2013. Exact Wilcoxon signed-rank and Wilcoxon Mann–Whitney ranksum tests. Stata Journal 13: 337–343. Kaiser, J. 2007. An exact and a Monte Carlo proposal to the Fisher–Pitman permutation tests for paired replicates and for independent samples. Stata Journal 7: 402–412. Newson, R. B. 2006. Confidence intervals for rank statistics: Somers’ D and extensions. Stata Journal 6: 309–334. Snedecor, G. W., and W. G. Cochran. 1989. Statistical Methods. 8th ed. Ames, IA: Iowa State University Press. Sprent, P., and N. C. Smeeton. 2007. Applied Nonparametric Statistical Methods. 4th ed. Boca Raton, FL: Chapman & Hall/CRC. Sribney, W. M. 1995. crc40: Correcting for ties and zeros in sign and rank tests. Stata Technical Bulletin 26: 2–4. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 5–8. College Station, TX: Stata Press. Wilcoxon, F. 1945. Individual comparisons by ranking methods. Biometrics 1: 80–83. Also see [R] ranksum — Equality tests on unmatched data [R] ttest — t tests (mean-comparison tests) Title simulate — Monte Carlo simulations Syntax References Description Also see Options Remarks and examples Syntax simulate exp list , reps(#) options : command options Description nodots noisily trace saving( filename, . . .) nolegend verbose seed(#) suppress replication dots display any output from command trace command save results to filename suppress table legend display the full table legend set random-number seed to # All weight types supported by command are allowed; see [U] 11.1.6 weight. exp list contains elist contains eexp is specname is eqno is (name: elist) elist eexp newvar = (exp) (exp) specname [eqno]specname b b[] se se[] ## name exp is a standard Stata expression; see [U] 13 Functions and expressions. Distinguish between [ ], which are to be typed, and , which indicate optional arguments. Description simulate eases the programming task of performing Monte Carlo–type simulations. Typing . simulate exp list, reps(#): command runs command for # replications and collects the results in exp list. 2157 2158 simulate — Monte Carlo simulations command defines the command that performs one simulation. Most Stata commands and userwritten programs can be used with simulate, as long as they follow standard Stata syntax; see [U] 11 Language syntax. The by prefix may not be part of command. exp list specifies the expression to be calculated from the execution of command. If no expressions are given, exp list assumes a default, depending upon whether command changes results in e() or r(). If command changes results in e(), the default is b. If command changes results in r() (but not e()), the default is all the scalars posted to r(). It is an error not to specify an expression in exp list otherwise. Options reps(#) is required—it specifies the number of replications to be performed. nodots suppresses display of the replication dots. By default, one dot character is displayed for each successful replication. A red ‘x’ is displayed if command returns an error or if one of the values in exp list is missing. noisily requests that any output from command be displayed. This option implies the nodots option. trace causes a trace of the execution of command to be displayed. This option implies the noisily option. saving( filename , suboptions ) creates a Stata data file (.dta file) consisting of (for each statistic in exp list) a variable containing the simulated values. double specifies that the results for each replication be saved as doubles, meaning 8-byte reals. By default, they are saved as floats, meaning 4-byte reals. every(#) specifies that results be written to disk every #th replication. every() should be specified only in conjunction with saving() when command takes a long time for each replication. This will allow recovery of partial results should some other software crash your computer. See [P] postfile. replace specifies that filename be overwritten if it exists. nolegend suppresses display of the table legend. The table legend identifies the rows of the table with the expressions they represent. verbose requests that the full table legend be displayed. By default, coefficients and standard errors are not displayed. seed(#) sets the random-number seed. Specifying this option is equivalent to typing the following command before calling simulate: . set seed # Remarks and examples For an introduction to Monte Carlo methods, see Cameron and Trivedi (2010, chap. 4). White (2010) provides a command for analyzing results of simulation studies. simulate — Monte Carlo simulations 2159 Example 1: Simulating basic summary statistics We have a dataset containing means and variances of 100-observation samples from a lognormal distribution (as a first step in evaluating, say, the coverage of a 95%, t-based confidence interval). Then we perform the experiment 1,000 times. The following command definition will generate 100 independent observations from a lognormal distribution and compute the summary statistics for this sample. program lnsim, rclass version 13 drop _all set obs 100 gen z = exp(rnormal()) summarize z return scalar mean = r(mean) return scalar Var = r(Var) end We can save 1,000 simulated means and variances from lnsim by typing . set seed 1234 . simulate mean=r(mean) var=r(Var), reps(1000) nodots: lnsim command: lnsim mean: r(mean) var: r(Var) . describe * variable name mean var . summarize Variable storage type float float mean var display format value label variable label %9.0g %9.0g r(mean) r(Var) Obs Mean 1000 1000 1.638466 4.63856 Std. Dev. .214371 6.428406 Min Max 1.095099 .8626 2.887392 175.3746 Technical note Before executing our lnsim simulator, we can verify that it works by executing it interactively. . set seed 1234 . lnsim obs was 0, now 100 Variable z . return list scalars: Obs Mean 100 1.597757 r(Var) = r(mean) = Std. Dev. 1.734328 3.007893773683719 1.59775722913444 Min Max .0625807 12.71548 2160 simulate — Monte Carlo simulations Example 2: Simulating a regression model Consider a more complicated problem. Let’s experiment with fitting yj = a + bxj + uj when the true model has a = 1, b = 2, uj = zj + cxj , and when zj is N (0, 1). We will save the parameter estimates and standard errors and experiment with varying c. xj will be fixed across experiments but will originally be generated as N (0, 1). We begin by interactively making the true data: . drop _all . set obs 100 obs was 0, now 100 . set seed 54321 . gen x = rnormal() . gen true_y = 1+2*x . save truth file truth.dta saved Our program is program hetero1 version 13 args c use truth, clear gen y = true_y + (rnormal() + ‘c’*x) regress y x end Note the use of ‘c’ in our statement for generating y. c is a local macro generated from args c and thus refers to the first argument supplied to hetero1. If we want c = 3 for our experiment, we type . simulate _b _se, reps(10000): hetero1 3 (output omitted ) Our program hetero1 could, however, be more efficient because it rereads the file truth once every replication. It would be better if we could read the data just once. In fact, if we read in the data right before running simulate, we really should not have to reread for each subsequent replication. A faster version reads program hetero2 version args c capture gen y = regress end 13 drop y true_y + (rnormal() + ‘c’*x) y x Requiring that the current dataset has the variables true y and x may become inconvenient. Another improvement would be to require that the user supply variable names, such as in program hetero3 version 13 args truey x c capture drop y gen y = ‘truey’ + (rnormal() + ‘c’*‘x’) regress y x end Thus we can type . simulate _b _se, reps(10000): hetero3 true_y x 3 (output omitted ) simulate — Monte Carlo simulations 2161 Example 3: Simulating a ratio of statistics Now let’s consider the problem of simulating the ratio of two medians. Suppose that each sample of size ni comes from a normal population with a mean µi and standard deviation σi , where i = 1, 2. We write the program below and save it as a text file called myratio.ado (see [U] 17 Ado-files). Our program is an rclass command that requires six arguments as input, identified by the local macros n1, mu1, sigma1, n2, mu2, and sigma2, which correspond to n1 , µ1 , σ1 , n2 , µ2 , and σ2 , respectively. With these arguments, myratio will generate the data for the two samples, use summarize to compute the two medians and store the ratio of the medians in r(ratio). program myratio, rclass version 13 args n1 mu1 sigma1 n2 mu2 sigma2 // generate the data drop _all local N = ‘n1’+‘n2’ set obs ‘N’ tempvar y generate ‘y’ = rnormal() replace ‘y’ = cond(_n<=‘n1’,‘mu1’+‘y’*‘sigma1’,‘mu2’+‘y’*‘sigma2’) // calculate the medians tempname m1 summarize ‘y’ if _n<=‘n1’, detail scalar ‘m1’ = r(p50) summarize ‘y’ if _n>‘n1’, detail // store the results return scalar ratio = ‘m1’ / r(p50) end The result of running our simulation is . set seed 19192 . simulate ratio=r(ratio), reps(1000) nodots: myratio 5 3 1 10 3 2 command: myratio 5 3 1 10 3 2 ratio: r(ratio) . summarize Variable Obs Mean ratio 1000 1.08571 Std. Dev. .4427828 Min Max .3834799 6.742217 Technical note Stata lets us do simulations of simulations and simulations of bootstraps. Stata’s bootstrap command (see [R] bootstrap) works much like simulate, except that it feeds the user-written program a bootstrap sample. Say that we want to evaluate the bootstrap estimator of the standard error of the median when applied to lognormally distributed data. We want to perform a simulation, resulting in a dataset of medians and bootstrap estimated standard errors. As background, summarize (see [R] summarize) calculates summary statistics, leaving the mean in r(mean) and the standard deviation in r(sd). summarize with the detail option also calculates summary statistics, but more of them, and leaves the median in r(p50). Thus our plan is to perform simulations by randomly drawing a dataset: we calculate the median of our random sample, we use bootstrap to obtain a dataset of medians calculated from bootstrap samples of our random sample, the standard deviation of those medians is our estimate of the standard error, and the summary statistics are stored in the results of summarize. 2162 simulate — Monte Carlo simulations Our simulator is program define bsse, rclass version 13 drop _all set obs 100 gen x = rnormal() tempfile bsfile bootstrap midp=r(p50), rep(100) saving(‘bsfile’): summarize x, detail use ‘bsfile’, clear summarize midp return scalar mean = r(mean) return scalar sd = r(sd) end We can obtain final results, running our simulation 1,000 times, by typing . set seed 48901 . simulate med=r(mean) bs_se=r(sd), reps(1000): bsse command: bsse med: r(mean) bs_se: r(sd) Simulations (1000) 1 2 3 4 5 .................................................. 50 .................................................. 100 .................................................. 150 .................................................. 200 .................................................. 250 .................................................. 300 .................................................. 350 .................................................. 400 .................................................. 450 .................................................. 500 .................................................. 550 .................................................. 600 .................................................. 650 .................................................. 700 .................................................. 750 .................................................. 800 .................................................. 850 .................................................. 900 .................................................. 950 .................................................. 1000 . summarize Variable Obs Mean Std. Dev. Min med bs_se 1000 1000 -.0008696 .126236 .1210451 .029646 -.3132536 .0326791 Max .4058724 .2596813 This is a case where the simulation dots (drawn by default, unless the nodots option is specified) will give us an idea of how long this simulation will take to finish as it runs. References Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. Gould, W. W. 1994. ssi6.1: Simplified Monte Carlo simulations. Stata Technical Bulletin 20: 22–24. Reprinted in Stata Technical Bulletin Reprints, vol. 4, pp. 207–210. College Station, TX: Stata Press. simulate — Monte Carlo simulations 2163 Hamilton, L. C. 2013. Statistics with Stata: Updated for Version 12. 8th ed. Boston: Brooks/Cole. Hilbe, J. M. 2010. Creating synthetic discrete-response regression models. Stata Journal 10: 104–124. Weesie, J. 1998. ip25: Parameterized Monte Carlo simulations: Enhancement to the simulation command. Stata Technical Bulletin 43: 13–15. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 75–77. College Station, TX: Stata Press. White, I. R. 2010. simsum: Analyses of simulation studies including Monte Carlo error. Stata Journal 10: 369–385. Also see [R] bootstrap — Bootstrap sampling and estimation [R] jackknife — Jackknife estimation [R] permute — Monte Carlo permutation tests Title sj — Stata Journal and STB installation instructions Description Remarks and examples Also see Description The Stata Journal (SJ) is a quarterly journal containing articles about statistics, data analysis, teaching methods, and effective use of Stata’s language. The SJ publishes reviewed papers together with shorter notes and comments, regular columns, tips, book reviews, and other material of interest to researchers applying statistics in a variety of disciplines. You can read all about the Stata Journal at http://www.stata-journal.com. The Stata Journal is a printed and electronic journal with corresponding software. If you want the journal, you must subscribe, but the software is available for no charge from our website at http://www.stata-journal.com. PDF copies of SJ articles that are older than three years are available for download for no charge at http://www.stata-journal.com/archives.html. More recent articles may be individually purchased. The predecessor to the Stata Journal was the Stata Technical Bulletin (STB). The STB was also a printed and electronic journal with corresponding software. PDF copies of all STB journals are available for download for no charge at http://www.stata-press.com/journals/stbj.html. The STB software is available for no charge from our website at http://www.stata.com. Below are instructions for installing the Stata Journal and the Stata Technical Bulletin software from our website. Remarks and examples Remarks are presented under the following headings: Installing the Stata Journal software Obtaining from the Internet by pointing and clicking Obtaining from the Internet via command mode Installing the STB software Obtaining from the Internet by pointing and clicking Obtaining from the Internet via command mode Installing the Stata Journal software Each issue of the Stata Journal is labeled Volume #, Number #. Volume 1 refers to the first year of publication, Volume 2 to the second, and so on. Issues are numbered 1, 2, 3, and 4 within each year. The first issue of the Journal was published in the fourth quarter of 2001, and that issue is numbered Volume 1, Number 1. For installation purposes, we refer to this issue as sj1-1. The articles, columns, notes, and comments that make up the Stata Journal are assigned a letterand-number code, called an insert tag, such as st0001, an0034, or ds0011. The letters represent a category: st is the statistics category, an is the announcements category, etc. The numbers are assigned sequentially, so st0001 is the first article in the statistics category. Sometimes inserts are subsequently updated, either to fix bugs or to add new features. A number such as st0001 1 indicates that this article, column, note, or comment is an update to the original st0001 article. Updates are complete; that is, installing st0001 1 provides all the features of the original article and more. 2164 sj — Stata Journal and STB installation instructions 2165 The Stata Journal software may be obtained by pointing and clicking or by using command mode. The sections below detail how to install an insert. In all cases, pretend that you wish to install insert st0274 from sj12-4. Obtaining from the Internet by pointing and clicking 1. Select Help > SJ and User-written Programs. 2. Click on Stata Journal. 3. Click on sj12-4. 4. Click on st0274. 5. Click on (click here to install). Obtaining from the Internet via command mode Type the following: . . . . net net net net from http://www.stata-journal.com/software cd sj12-4 describe st0274 install st0274 The above could be shortened to . net from http://www.stata-journal.com/software/sj12-4 . net describe st0274 . net install st0274 Alternatively, you could type . net sj 12-4 . net describe st0274 . net install st0274 but going about it the long way is more entertaining, at least the first time. Installing the STB software Each issue of the STB is numbered. STB-1 refers to the first issue (published May 1991), STB-2 refers to the second (published July 1991), and so on. An issue of the STB consists of inserts—articles—and these are assigned letter-and-number combinations, such as sg84, dm80, sbe26.1, etc. The letters represent a category; for example, sg is the general statistics category and dm the data management category. The numbers are assigned sequentially, so sbe39 is the 39th insert in the biostatistics and epidemiology series. Insert sbe39, it turns out, provides a method of accounting for publication bias in meta-analysis; it adds a new command called metatrim to Stata. If you installed sbe39, you would have that command and its help file. Insert sbe39 was published in STB-57 (September 2000). Obtaining metatrim simply requires going to STB-57 and getting sbe39. Sometimes inserts were subsequently updated, either to fix bugs or to add new features. sbe39 was updated: the first update is sbe39.1 and the second is sbe39.2. You could install insert sbe39.2, and it would not matter whether you had previously installed sbe39.1. Updates are complete: installing sbe39.2 provides all the features of the original insert and more. 2166 sj — Stata Journal and STB installation instructions For computer naming purposes, insert sbe39.2 is referred to as sbe39 2. When referred to in normal text, however, the insert is still called sbe39.2 because that looks nicer. Inserts are easily available from the Internet. Inserts may be obtained by pointing and clicking or by using command mode. The sections below detail how to install an insert. In all cases, pretend that you wish to install insert sbe39.2 from STB-61. Obtaining from the Internet by pointing and clicking 1. Select Help > SJ and User-written Programs. 2. Click on STB. 3. Click on stb61. 4. Click on sbe39 2. 5. Click on (click here to install). Obtaining from the Internet via command mode Type the following: . . . . net net net net from http://www.stata.com cd stb cd stb61 describe sbe39_2 . net install sbe39_2 The above could be shortened to . net from http://www.stata.com/stb/stb61 . net describe sbe39_2 . net install sbe39_2 but going about it the long way is more entertaining, at least the first time. Also see [R] search — Search Stata documentation and other resources [R] net — Install and manage user-written additions from the Internet [R] net search — Search the Internet for installable packages [R] update — Check for official updates [U] 3.5 The Stata Journal [U] 28 Using the Internet to keep up to date [GSM] 19 Updating and extending Stata—Internet functionality [GSU] 19 Updating and extending Stata—Internet functionality [GSW] 19 Updating and extending Stata—Internet functionality Title sktest — Skewness and kurtosis test for normality Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Option Acknowledgments Syntax sktest varlist if in weight , noadjust aweights and fweights are allowed; see [U] 11.1.6 weight. Menu Statistics > Summaries, tables, and tests > Distributional plots and tests > Skewness and kurtosis normality test Description For each variable in varlist, sktest presents a test for normality based on skewness and another based on kurtosis and then combines the two tests into an overall test statistic. sktest requires a minimum of 8 observations to make its calculations. See [MV] mvtest normality for multivariate tests of normality. Option Main noadjust suppresses the empirical adjustment made by Royston (1991c) to the overall χ2 and its significance level and presents the unaltered test as described by D’Agostino, Belanger, and D’Agostino (1990). Remarks and examples Also see [R] swilk for the Shapiro – Wilk and Shapiro – Francia tests for normality. Those tests are, in general, preferred for nonaggregated data (Gould and Rogers 1991; Gould 1992; Royston 1991c). Moreover, a normal quantile plot should be used with any test for normality; see [R] diagnostic plots for more information. 2167 2168 sktest — Skewness and kurtosis test for normality Example 1 Using our automobile dataset, we will test whether the variables mpg and trunk are normally distributed: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . sktest mpg trunk Skewness/Kurtosis tests for Normality Variable Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) mpg trunk 74 74 0.0015 0.9115 0.0804 0.0445 10.95 4.19 joint Prob>chi2 0.0042 0.1228 We can reject the hypothesis that mpg is normally distributed, but we cannot reject the hypothesis that trunk is normally distributed, at least at the 12% level. The kurtosis for trunk is 2.19, as can be verified by issuing the command . summarize trunk, detail (output omitted ) and the p-value of 0.0445 shown in the table above indicates that it is significantly different from the kurtosis of a normal distribution at the 5% significance level. However, on the basis of skewness alone, we cannot reject the hypothesis that trunk is normally distributed. Technical note sktest implements the test as described by D’Agostino, Belanger, and D’Agostino (1990) but with the adjustment made by Royston (1991c). In the above example, if we had specified the noadjust option, the χ2 values would have been 13.13 for mpg and 4.05 for trunk. With the adjustment, the χ2 value might show as ‘.’. This result should be interpreted as an absurdly large number; the data are most certainly not normal. Stored results sktest stores the following in r(): Scalars r(chi2) r(P skew) r(P kurt) r(P chi2) Matrices r(N) r(Utest) χ2 Pr(skewness) Pr(kurtosis) Prob > chi2 matrix of observations matrix of test results, one row per variable sktest — Skewness and kurtosis test for normality 2169 Methods and formulas sktest implements the test described by D’Agostino, Belanger, and D’Agostino (1990) with the empirical correction developed by Royston (1991c). Let g1 denote the coefficient of skewness and b2 denote the coefficient of kurtosis as calculated by summarize, and let n denote the sample size. If weights are specified, then g1 , b2 , and n denote the weighted coefficients of skewness and kurtosis and weighted sample size, respectively. See [R] summarize for the formulas for skewness and kurtosis. To perform the test of skewness, we compute 1/2 (n + 1)(n + 3) 6(n − 2) 2 3(n + 27n − 70)(n + 1)(n + 3) β2 (g1 ) = (n − 2)(n + 5)(n + 7)(n + 9) Y = g1 and W 2 = −1 + [2 {β2 (g1 ) − 1}] 1/2 α = 2/(W 2 − 1) 1/2 Then the distribution of the test statistic Z1 = √ h 1 ln Y /α + (Y /α)2 + 1 lnW 1/2 i is approximately standard normal under the null hypothesis that the data are distributed normally. To perform the test of kurtosis, we compute 3(n − 1) n+1 24n(n − 2)(n − 3) var(b2 ) = (n + 1)2 (n + 3)(n + 5) p X = {b2 − E(b2 )} / var(b2 ) 1/2 p 6(n2 − 5n + 2) 6(n + 3)(n + 5) β1 (b2 ) = (n + 7)(n + 9) n(n − 2)(n − 3) " 1/2 # 8 2 4 p A=6+ p + 1+ β1 (b2 ) β1 (b2 ) β1 (b2 ) E(b2 ) = and Then the distribution of the test statistic )1/3 ( 1 2 1 − 2/A 1− p − Z2 = p 9A 2/(9A) 1 + X 2/(A − 4) is approximately standard normal under the null hypothesis that the data are distributed normally. D’Agostino, Balanger, and D’Agostino Jr.’s omnibus test of normality uses the statistic K 2 = Z12 + Z22 which has approximately a χ2 distribution with 2 degrees of freedom under the null of normality. 2170 sktest — Skewness and kurtosis test for normality Royston (1991c) proposed the following adjustment to the test of normality, which sktest uses by default. Let Φ(x) denote the cumulative standard normal distribution function for x, and let Φ−1 (p) denote the inverse cumulative standard normal function [that is, x = Φ−1 {Φ(x)}]. Define the following terms: Zc = −Φ and Zt a1 b1 a2 b2 −1 1 exp − K 2 2 = 0.55n0.2 − 0.21 = (−5 + 3.46 lnn) exp(−1.37 lnn) = 1 + (0.854 − 0.148 lnn) exp(−0.55 lnn) = a1 − {2.13/(1 − 2.37 lnn)} Zt = 2.13/(1 − 2.37 lnn) + b1 If Zc < −1 set Z = Zc ; else if Zc < Zt set Z = a1 + b1 Zc ; else set Z = a2 + b2 Zc . Define P = 1 − Φ(Z). Then K 2 = −2 lnP is approximately distributed χ2 with 2 degrees of freedom. The relative merits of the skewness and kurtosis test versus the Shapiro – Wilk and Shapiro – Francia tests have been a subject of debate. The interested reader is directed to the articles in the Stata Technical Bulletin. Our recommendation is to use the Shapiro – Francia test whenever possible, that is, whenever dealing with nonaggregated or ungrouped data (Gould and Rogers 1991; Gould 1992); see [R] swilk. If normality is rejected, use sktest to determine the source of the problem. As both D’Agostino, Belanger, and D’Agostino (1990) and Royston (1991d) mention, researchers should also examine the normal quantile plot to determine normality rather than blindly relying on a few test statistics. See the qnorm command documented in [R] diagnostic plots for more information on normal quantile plots. sktest is similar in spirit to the Jarque–Bera (1987) test of normality. The Jarque–Bera test statistic is also calculated from the sample skewness and kurtosis, though it is based on asymptotic standard errors with no corrections for sample size. In effect, sktest offers two adjustments for sample size, that of Royston (1991c) and that of D’Agostino, Belanger, and D’Agostino (1990). Acknowledgments sktest has benefited greatly by the comments and work of Patrick Royston of the MRC Clinical Trials Unit, London, and coauthor of the Stata Press book Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. At this point, the program should be viewed as due as much to Royston as to us, except, of course, for any errors. We are also indebted to Nicholas J. Cox of the Department of Geography at Durham University, UK, and coeditor of the Stata Journal for his helpful comments. References D’Agostino, R. B., A. J. Belanger, and R. B. D’Agostino, Jr. 1990. A suggestion for using powerful and informative tests of normality. American Statistician 44: 316–321. . 1991. sg3.3: Comment on tests of normality. Stata Technical Bulletin 3: 20. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 105–106. College Station, TX: Stata Press. Gould, W. W. 1991. sg3: Skewness and kurtosis tests of normality. Stata Technical Bulletin 1: 20–21. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 99–101. College Station, TX: Stata Press. . 1992. sg11.1: Quantile regression with bootstrapped standard errors. Stata Technical Bulletin 9: 19–21. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 137–139. College Station, TX: Stata Press. sktest — Skewness and kurtosis test for normality 2171 Gould, W. W., and W. H. Rogers. 1991. sg3.4: Summary of tests of normality. Stata Technical Bulletin 3: 20–23. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 106–110. College Station, TX: Stata Press. Jarque, C. M., and A. K. Bera. 1987. A test for normality of observations and regression residuals. International Statistical Review 2: 163–172. Marchenko, Y. V., and M. G. Genton. 2010. A suite of commands for fitting the skew-normal and skew-t models. Stata Journal 10: 507–539. Royston, P. 1991a. sg3.1: Tests for departure from normality. Stata Technical Bulletin 2: 16–17. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 101–104. College Station, TX: Stata Press. . 1991b. sg3.2: Shapiro–Wilk and Shapiro–Francia tests. Stata Technical Bulletin 3: 19. Reprinted in Stata Technical Bulletin Reprints, vol. 1, p. 105. College Station, TX: Stata Press. . 1991c. sg3.5: Comment on sg3.4 and an improved D’Agostino test. Stata Technical Bulletin 3: 23–24. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 110–112. College Station, TX: Stata Press. . 1991d. sg3.6: A response to sg3.3: Comment on tests of normality. Stata Technical Bulletin 4: 8–9. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 112–114. College Station, TX: Stata Press. Also see [R] diagnostic plots — Distributional diagnostic plots [R] ladder — Ladder of powers [R] lv — Letter-value displays [R] swilk — Shapiro – Wilk and Shapiro – Francia tests for normality [MV] mvtest normality — Multivariate normality tests Title slogit — Stereotype logistic regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax slogit depvar indepvars options if in weight , options Description Model dimension(#) baseoutcome(# | lbl) constraints(numlist) collinear nocorner dimension of the model; default is dimension(1) set the base outcome to # or lbl; default is the last outcome apply specified linear constraints keep collinear variables do not generate the corner constraints SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) nocnsreport display options set confidence level; default is level(95) do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization nonormalize control the maximization process; seldom used method of initializing scale parameters; initype can be constant, random, or svd; see Options for details do not normalize the numeric variables coeflegend display legend instead of statistics maximize options initialize(initype) indepvars may contain factor variables; see [U] 11.4.3 Factor variables. bootstrap, by, fp, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 2172 slogit — Stereotype logistic regression 2173 Menu Statistics > Categorical outcomes > Stereotype logistic regression Description slogit fits maximum-likelihood stereotype logistic regression models as developed by Anderson (1984). Like multinomial logistic and ordered logistic models, stereotype logistic models are for use with categorical dependent variables. In a multinomial logistic model, the categories cannot be ranked, whereas in an ordered logistic model the categories follow a natural ranking scheme. You can view stereotype logistic models as a compromise between those two models. You can use them when you are unsure of the relevance of the ordering, as is often the case when subjects are asked to assess or judge something. You can also use them in place of multinomial logistic models when you suspect that some of the alternatives are similar. Unlike ordered logistic models, stereotype logistic models do not impose the proportional-odds assumption. Options Model dimension(#) specifies the dimension of the model, which is the number of equations required to describe the relationship between the dependent variable and the independent variables. The maximum dimension is min(m − 1, p), where m is the number of categories of the dependent variable and p is the number of independent variables in the model. The stereotype model with maximum dimension is a reparameterization of the multinomial logistic model. baseoutcome(# | lbl) specifies the outcome level whose scale parameters and intercept are constrained to be zero. The base outcome may be specified as a number of a label. By default, slogit assumes that the outcome levels are ordered and uses the largest level of the dependent variable as the base outcome. constraints(numlist), collinear; see [R] estimation options. By default, the linear equality constraints suggested by Anderson (1984), termed the corner constraints, are generated for you. You can add constraints to these as needed, or you can turn off the corner constraints by specifying nocorner. These constraints are in addition to the constraints placed on the φ parameters corresponding to baseoutcome(#). nocorner specifies that slogit not generate the corner constraints. If you specify nocorner, you must specify at least dimension() × dimension() constraints for the model to be identified. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. If specifying vce(bootstrap) or vce(jackknife), you must also specify baseoutcome(). Reporting level(#); see [R] estimation options. nocnsreport; see [R] estimation options. 2174 slogit — Stereotype logistic regression display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). initialize(constant | random | svd) specifies how initial estimates are computed. The default, initialize(constant), is to set the scale parameters to the constant min(1/2, 1/d), where d is the dimension specified in dimension(). initialize(random) requests that uniformly distributed random numbers between 0 and 1 be used as initial values for the scale parameters. If you specify this option, you should also use set seed to ensure that you can replicate your results; see [R] set seed. initialize(svd) requests that a singular value decomposition (SVD) be performed on the matrix of regression estimates from mlogit to reduce its rank to the dimension specified in dimension(). slogit uses the reduced-rank components of the SVD as initial estimates for the scale and regression coefficients. For details, see Methods and formulas. nonormalize specifies that the numeric variables not be normalized. Normalization of the numeric variables improves numerical stability but consumes more memory in generating temporary doubleprecision variables. Variables that are of type byte are not normalized, and if initial estimates are specified using the from() option, normalization of variables is not performed. See Methods and formulas for more information. The following option is available with slogit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Introduction One-dimensional model Higher-dimension models Introduction Stereotype logistic models are often used when subjects are requested to assess or judge something. For example, consider a survey in which consumers may be asked to rate the quality of a product on a scale from 1 to 5, with 1 indicating poor quality and 5 indicating excellent quality. If the categories are monotonically related to an underlying latent variable, the ordered logistic model is appropriate. However, suppose that consumers assess quality not just along one dimension, but rather weigh two or three latent factors. Stereotype logistic regression allows you to specify multiple equations to capture the effects of those latent variables, which you then parameterize in terms of observable characteristics. Unlike with multinomial logit, the number of equations you specify could be less than m − 1, where m is the number of categories of the dependent variable. slogit — Stereotype logistic regression 2175 Stereotype logistic models are also used when categories may be indistinguishable. Suppose that a consumer must choose among A, B, C, or D. Multinomial logistic modeling assumes that the four choices are distinct in the sense that a consumer choosing one of the goods can distinguish its characteristics from the others. If goods A and B are in fact similar, consumers may be randomly picking between the two. One alternative is to combine the two categories and fit a three-category multinomial logistic model. A more flexible alternative is to use a stereotype logistic model. e k , k = 1, . . . , m − 1, In the multinomial logistic model, you estimate m − 1 parameter vectors β where m is the number of categories of the dependent variable. The stereotype logistic model is a restriction on the multinomial model in the sense that there are d parameter vectors, where d is between one and min(m − 1, p), and p is the number of regressors. The relationship between the stereotype model’s coefficients βj , j = 1, . . . , d, and the multinomial model’s coefficients is P e k = − d φjk βj . The φs are scale parameters to be estimated along with the βj s. β j=1 Given a row vector of covariates x, let ηk = θk − outcome k is Pd j=1 φjk xβj . The probability of observing exp (ηk ) 1 + Pm−1 exp (η ) l l=1 Pr(Yi = k) = 1 Pm−1 1 + l=1 exp (ηl ) k chi2 Log likelihood = -159.25691 ( 1) = = = 135 9.33 0.0535 [phi1_1]_cons = 1 repair Coef. foreign mpg price gratio 5.947382 .1911968 -.0000576 -4.307571 /phi1_1 /phi1_2 /phi1_3 /phi1_4 /phi1_5 /theta1 /theta2 /theta3 /theta4 /theta5 Std. Err. 2.094126 .08554 .0001357 1.884713 z P>|z| [95% Conf. Interval] 2.84 2.24 -0.42 -2.29 0.005 0.025 0.671 0.022 1.84297 .0235414 -.0003236 -8.00154 10.05179 .3588521 .0002083 -.6136017 1 1.262268 1.17593 .8657195 0 (constrained) .3530565 3.58 .3169397 3.71 .2411228 3.59 (base outcome) 0.000 0.000 0.000 .5702904 .5547394 .3931275 1.954247 1.79712 1.338311 -6.864749 -7.613977 -5.80655 -3.85724 0 4.21252 -1.63 4.861803 -1.57 4.987508 -1.16 3.824132 -1.01 (base outcome) 0.103 0.117 0.244 0.313 -15.12114 -17.14294 -15.58189 -11.3524 1.391639 1.914981 3.968786 3.637922 (repair=Excellent is the base outcome) The coefficient associated with the first scale parameter, φ11 , is 1, and its standard error and other statistics are missing. This is the corner constraint applied to the one-dimensional model; in the header, this constraint is listed as [phi1 1] cons = 1. Also, the φ and θ parameters that are associated with the base outcome are identified. Keep in mind, though, that there are no coefficient estimates for [phi1 5] cons or [theta5] cons in the ereturn matrix e(b). The Wald statistic is for a test of the joint significance of the regression coefficients on foreign, mpg, price, and gratio. The one-dimensional stereotype model restricts the multinomial logistic regression coefficients e k , k = 1, . . . , m − 1 to be parallel; that is, β e k = −φk β. As Lunt (2001) discusses, in the β one-dimensional stereotype model, one linear combination xi β best discriminates the outcomes of the dependent variable, and the scale parameters φk measure the distance between the outcome levels and the linear predictor. If φ1 ≥ φ2 ≥ · · · φm−1 ≥ φm ≡ 0, the model suggests that the subjective assessment of the dependent variable is indeed ordered. Here the maximum likelihood estimates of the φs are not monotonic, as would be assumed in an ordered logit model. slogit — Stereotype logistic regression 2177 We test that φ1 = φ2 by typing . test [phi1_2]_cons = [phi1_1]_cons ( 1) - [phi1_1]_cons + [phi1_2]_cons = 0 chi2( 1) = 0.55 Prob > chi2 = 0.4576 Because the two parameters are not statistically different, we decide to add a constraint to force φ1 = φ2 : . constraint define 1 [phi1_2]_cons = [phi1_1]_cons . slogit repair foreign mpg price gratio, constraint(1) nolog Stereotype logistic regression Number of obs Wald chi2(4) Log likelihood = -159.65769 Prob > chi2 ( 1) ( 2) = = = 135 21.28 0.0003 [phi1_1]_cons = 1 - [phi1_1]_cons + [phi1_2]_cons = 0 repair Coef. Std. Err. foreign mpg price gratio 7.166515 .2340043 -.000041 -5.218107 /phi1_1 /phi1_2 /phi1_3 /phi1_4 /phi1_5 1 1 .9751096 .7209343 0 (constrained) (constrained) .1286563 7.58 .1220353 5.91 (base outcome) /theta1 /theta2 /theta3 /theta4 /theta5 -8.293452 -6.958451 -5.620232 -3.745624 0 4.645182 -1.79 4.629292 -1.50 4.953981 -1.13 3.809189 -0.98 (base outcome) 1.690177 .0807042 .0001618 1.798717 z 4.24 2.90 -0.25 -2.90 P>|z| [95% Conf. Interval] 0.000 0.004 0.800 0.004 3.853829 .0758271 -.0003581 -8.743528 10.4792 .3921816 .000276 -1.692686 0.000 0.000 .7229478 .4817494 1.227271 .9601191 0.074 0.133 0.257 0.325 -17.39784 -16.0317 -15.32986 -11.2115 .8109368 2.114795 4.089392 3.720249 (repair=Excellent is the base outcome) The φ estimates are now monotonically decreasing and the standard errors of the φs are small relative to the size of the estimates, so we conclude that, with the exception of outcomes Poor and Fair, the groups are distinguishable for the one-dimensional model and that the quality assessment can be ordered. Higher-dimension models The stereotype logistic model is not limited to ordered categorical dependent variables; you can use it on nominal data to reduce the dimension of the regressions. Recall that a multinomial model fit to a categorical dependent variable with m levels will have m − 1 sets of regression coefficients. However, a model with fewer dimensions may fit the data equally well, suggesting that some of the categories are indistinguishable. 2178 slogit — Stereotype logistic regression Example 2 As discussed in [R] mlogit, we have data on the type of health insurance available to 616 psychologically depressed subjects in the United States (Tarlov et al. 1989; Wells et al. 1989). Patients may have either an indemnity (fee-for-service) plan or a prepaid plan, such as an HMO, or may be uninsured. Demographic variables include age, gender, race, and site. First, we fit the saturated, two-dimensional model that is equivalent to a multinomial logistic model. We choose the base outcome to be 1 (indemnity insurance) because that is the default for mlogit. . use http://www.stata-press.com/data/r13/sysdsn1 (Health insurance data) . slogit insure age male nonwhite i.site, dim(2) base(1) Iteration 0: log likelihood = -534.36165 Iteration 1: log likelihood = -534.36165 Stereotype logistic regression Number of obs Wald chi2(10) Log likelihood = -534.36165 Prob > chi2 ( 1) [phi1_2]_cons = 1 ( 2) [phi1_3]_cons = 0 ( 3) [phi2_2]_cons = 0 ( 4) [phi2_3]_cons = 1 Std. Err. z P>|z| = = = 615 38.17 0.0000 insure Coef. [95% Conf. Interval] age male nonwhite .011745 -.5616934 -.9747768 .0061946 .2027465 .2363213 1.90 -2.77 -4.12 0.058 0.006 0.000 -.0003962 -.9590693 -1.437958 .0238862 -.1643175 -.5115955 site 2 3 -.1130359 .5879879 .2101903 .2279351 -0.54 2.58 0.591 0.010 -.5250013 .1412433 .2989296 1.034733 age male nonwhite .0077961 -.4518496 -.2170589 .0114418 .3674867 .4256361 0.68 -1.23 -0.51 0.496 0.219 0.610 -.0146294 -1.17211 -1.05129 .0302217 .268411 .6171725 site 2 3 1.211563 .2078123 .4705127 .3662926 2.57 0.57 0.010 0.570 .2893747 -.510108 2.133751 .9257327 /phi1_1 /phi1_2 /phi1_3 0 1 0 (base outcome) (constrained) (omitted) /phi2_1 /phi2_2 /phi2_3 0 0 1 (base outcome) (omitted) (constrained) /theta1 /theta2 /theta3 0 .2697127 -1.286943 0.412 0.030 -.3740222 -2.447872 .9134476 -.1260134 dim1 dim2 (base outcome) .3284422 0.82 .5923219 -2.17 (insure=Indemnity is the base outcome) slogit — Stereotype logistic regression 2179 For comparison, we also fit the model by using mlogit: . mlogit insure age male nonwhite i.site, nolog Multinomial logistic regression Log likelihood = -534.36165 insure Indemnity Coef. Std. Err. z Number of obs LR chi2(10) Prob > chi2 Pseudo R2 P>|z| = = = = 615 42.99 0.0000 0.0387 [95% Conf. Interval] (base outcome) Prepaid age male nonwhite -.011745 .5616934 .9747768 .0061946 .2027465 .2363213 -1.90 2.77 4.12 0.058 0.006 0.000 -.0238862 .1643175 .5115955 .0003962 .9590693 1.437958 site 2 3 .1130359 -.5879879 .2101903 .2279351 0.54 -2.58 0.591 0.010 -.2989296 -1.034733 .5250013 -.1412433 _cons .2697127 .3284422 0.82 0.412 -.3740222 .9134476 age male nonwhite -.0077961 .4518496 .2170589 .0114418 .3674867 .4256361 -0.68 1.23 0.51 0.496 0.219 0.610 -.0302217 -.268411 -.6171725 .0146294 1.17211 1.05129 site 2 3 -1.211563 -.2078123 .4705127 .3662926 -2.57 -0.57 0.010 0.570 -2.133751 -.9257327 -.2893747 .510108 _cons -1.286943 .5923219 -2.17 0.030 -2.447872 -.1260134 Uninsure Apart from having opposite signs, the coefficients from the stereotype logistic model are identical to those from the multinomial logit model. Recall the definition of ηk given in the Remarks and examples, particularly the minus sign in front of the summation. One other difference in the output is that the constant estimates labeled /theta in the slogit output are the constants labeled cons in the mlogit output. Next we examine the one-dimensional model. 2180 slogit — Stereotype logistic regression . slogit insure age male nonwhite i.site, dim(1) base(1) nolog Stereotype logistic regression Number of obs Wald chi2(5) Log likelihood = -539.75205 Prob > chi2 ( 1) [phi1_2]_cons = 1 Std. Err. z P>|z| = = = 615 28.20 0.0000 insure Coef. [95% Conf. Interval] age male nonwhite .0108366 -.5032537 -.9480351 .0061918 .2078171 .2340604 1.75 -2.42 -4.05 0.080 0.015 0.000 -.0012992 -.9105678 -1.406785 .0229723 -.0959396 -.489285 site 2 3 -.2444316 .556665 .2246366 .2243799 -1.09 2.48 0.277 0.013 -.6847113 .1168886 .1958481 .9964415 /phi1_1 /phi1_2 /phi1_3 0 1 .0383539 (base outcome) (constrained) .4079705 0.09 0.925 -.7612535 .8379613 /theta1 /theta2 /theta3 0 .187542 -1.860134 (base outcome) .3303847 0.57 .2158898 -8.62 0.570 0.000 -.4600001 -2.28327 .835084 -1.436997 (insure=Indemnity is the base outcome) We have reduced a two-dimensional multinomial model to one dimension, reducing the number of estimated parameters by four and decreasing the model likelihood by ≈ 5.4. slogit does not report a model likelihood-ratio test. The test of d = 1 (a one-dimensional model) versus d = 0 (the null model) does not have an asymptotic χ2 distribution because the unconstrained φ parameters (/phi1 3 in this example) cannot be identified if β = 0. More generally, this problem precludes testing any hierarchical model of dimension d versus d − 1. Of course, the likelihood-ratio test of a full-dimension model versus d = 0 is valid because the full model is just multinomial logistic, and all the φ parameters are fixed at 0 or 1. Technical note The stereotype model is a special case of the reduced-rank vector generalized linear model discussed Pd by Yee and Hastie (2003). If we define ηik = θk − j=1 φjk xi βj , for k = 1, . . . , m − 1, we can write the expression in matrix notation as 0 ηi = θ + Φ (xi B) where Φ is a (m − 1) × d matrix containing the φjk parameters and B is a p × d matrix with columns containing the βj parameters, j = 1, . . . , d. The factorization ΦB0 is not unique because ΦB0 = ΦMM−1 B0 for any nonsingular d × d matrix M. To avoid this identifiability problem, we choose M = Φ−1 1 , where Φ1 Φ= Φ2 and Φ1 is d × d of rank d so that ΦM = Id Φ2 Φ−1 1 and Id is a d × d identity matrix. Thus the corner constraints used by slogit are φjj ≡ 1 and φjk ≡ 0 for j = 6 k and k, j ≤ d. slogit — Stereotype logistic regression 2181 Stored results slogit stores the following in e(): Scalars e(N) e(k) e(k indvars) e(k out) e(k eq) e(k eq model) e(df m) e(df 0) e(k dim) e(i base) e(ll) e(ll 0) e(N clust) e(chi2) e(p) e(ic) e(rank) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(indvars) e(wtype) e(wexp) e(title) e(clustvar) e(out#) e(chi2type) e(labels) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(marginsnotok) e(footnote) e(asbalanced) e(asobserved) Matrices e(b) e(outcomes) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) number of observations number of parameters number of independent variables number of outcomes number of equations in e(b) number of equations in overall model test Wald test degrees of freedom null model degrees of freedom model dimension base outcome index log likelihood null model log likelihood number of clusters χ2 significance number of iterations rank of e(V) return code 1 if converged, 0 otherwise slogit command as typed name of dependent variable independent variables weight type weight expression title in estimation output name of cluster variable outcome labels, # = 1,..., e(k out) Wald; type of model χ2 test outcome labels or numeric levels vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict predictions disallowed by margins program used to implement the footnote display factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector outcome values constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample 2182 slogit — Stereotype logistic regression Methods and formulas slogit obtains the maximum likelihood estimates for the stereotype logistic model by using ml; see [R] ml. Each set of regression estimates, one set of βj s for each dimension, constitutes one ml model equation. The d × (m − 1) φs and the (m − 1) θs are ml ancillary parameters. Without loss of generality, let the base outcome level be the mth level of the dependent variable. Define the row vector φk = (φ1k , . . . , φdk ) for k = 1, . . . , m − 1, and define the p × d matrix B = (β1 , . . . , βd ). For observation i, the log odds of outcome level k relative to level m, k = 1, . . . , m − 1 is the index ln Pr(Yi = k) Pr(Yi = m) = ηik = θk − φk (xi B) 0 = θk − φk ν0i The row vector νi can be interpreted as a latent variable reducing the p-dimensional vector of covariates to a more interpretable d < p dimension. The probability of the ith observation having outcome level k is then Pr(Yi = k) = pik = eηik , if k < m P 1 + m−1 eηij j=1 1+ 1 Pm−1 j=1 eηij , if k = m from which the log-likelihood function is computed as L= n X wi i=1 m X Ik (yi ) ln(pik ) (1) k=1 Here wi is the weight for observation i and ( Ik (yi ) = 1, if observation yi has outcome k 0, otherwise Numeric variables are normalized for numerical stability during optimization where a new doubleprecision variable x ej is created from variable xj , j = 1, . . . , p, such that x ej = (xj − x̄j )/sj . This feature is turned off if you specify nonormalize, or if you use the from() option for initial estimates. Normalization is not performed on byte variables, including the indicator variables generated by [R] xi. The linear equality constraints for regression parameters, if specified, must be scaled also. Assume that a constraint is applied to the regression parameter associated with variable j and dimension i, βji , and the corresponding element of the constraint matrix (see [P] makecns) is divided by sj . After convergence, the parameter estimates for variable j and dimension i — βeji , say—are transformed back to their original scale, βji = βeji /sj . For the intercepts, you compute θk = θek + d X i=1 φik p X βeji x̄j j=1 sj slogit — Stereotype logistic regression 2183 Initial values are computed using estimates obtained using mlogit to fit a multinomial logistic model. e contain the multinomial logistic regression parameters less the m − 1 Let the p × (m − 1) matrix B intercepts. Each φ is initialized with constant values min (1/2, 1/d), the initialize(constant) option (the default), or, with uniform random numbers, the initialize(random) option. Constraints are then applied to the starting values so that the structure of the (m − 1) × d matrix Φ is φ1 φ2 I = ed Φ= .. Φ . φm−1 where Id is a d × d identity matrix. Assume that only the corner constraints are used, but any constraints you place on the scale parameters are also applied to the initial scale estimates, so the structure of Φ will change accordingly. The φ parameters are invariant to the scale of the covariates, so initial estimates in [ 0, 1 ] are reasonable. The constraints guarantee that the rank of Φ is at least d, e Φ(Φ0 Φ)−1 . so the initial estimates for the stereotype regression parameters are obtained from B = B One other approach for initial estimates is provided: initialize(svd). It starts with the mlogit e 0 = UDV0 , where Um−1×p and Vp×p are orthonormal matrices and estimates and computes B e . The estimates for Φ and B are the Dp×p is a diagonal matrix containing the singular values of B first d columns of U and VD, respectively (Yee and Hastie 2003). The score for regression coefficients is ∂Lik ui (βj ) = = xi ∂ βj m−1 X ! φjl pil − φjk l=1 the score for the scale parameters is ∂Lik = ui (φjl ) = ∂φjl ( xi βj (pik − 1), if l = k xi βj pil , if l 6= k for l = 1, . . . , m − 1; and the score for the intercepts is ∂Lik ui (θl ) = = ∂θl ( 1 − pik , if l = k − pil , if l 6= k This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. slogit also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Anderson, J. A. 1984. Regression and ordered categorical variables (with discussion). Journal of the Royal Statistical Society, Series B 46: 1–30. 2184 slogit — Stereotype logistic regression Lunt, M. 2001. sg163: Stereotype ordinal regression. Stata Technical Bulletin 61: 12–18. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 298–307. College Station, TX: Stata Press. . 2005. Prediction of ordinal outcomes when the association between predictors and outcome differs between outcome levels. Statistics in Medicine 24: 1357–1369. Tarlov, A. R., J. E. Ware, Jr., S. Greenfield, E. C. Nelson, E. Perrin, and M. Zubkoff. 1989. The medical outcomes study. An application of methods for monitoring the results of medical care. Journal of the American Medical Association 262: 925–930. Wells, K. B., R. D. Hays, M. A. Burnam, W. H. Rogers, S. Greenfield, and J. E. Ware, Jr. 1989. Detection of depressive disorder for patients receiving prepaid or fee-for-service care. Results from the Medical Outcomes Survey. Journal of the American Medical Association 262: 3298–3302. Yee, T. W., and T. J. Hastie. 2003. Reduced-rank vector generalized linear models. Statistical Modelling 3: 15–41. Also see [R] slogit postestimation — Postestimation tools for slogit [R] logistic — Logistic regression, reporting odds ratios [R] mlogit — Multinomial (polytomous) logistic regression [R] ologit — Ordered logistic regression [R] oprobit — Ordered probit regression [R] roc — Receiver operating characteristic (ROC) analysis [SVY] svy estimation — Estimation commands for survey data [U] 20 Estimation and postestimation commands Title slogit postestimation — Postestimation tools for slogit Description Remarks and examples Syntax for predict Methods and formulas Menu for predict Also see Options for predict Description The following postestimation commands are available after slogit: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predicted probabilities, estimated index and its approximate standard error point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest1 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 lrtest is not appropriate with svy estimation results. 2185 2186 slogit postestimation — Postestimation tools for slogit Syntax for predict predict type stub* | newvar | newvarlist predict type stub* | newvarlist if if in , statistic outcome(outcome) in , scores Description statistic Main probability of one or all of the dependent variable outcomes; the default index for the k th outcome standard error of the index for the k th outcome pr xb stdp If you do not specify outcome(), pr (with one new variable specified), xb, and stdp assume outcome(#1). You specify one or k new variables with pr, where k is the number of outcomes. You specify one new variable with xb and stdp. These statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for the estimation sample. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main pr, the default, calculates the probability of each of the categories of the dependent variable or the probability of the level specified in outcome(outcome). If you specify the outcome(outcome) option, you need to specify only one new variable; otherwise, you must specify a new variable for each category of the dependent variable. Pd xb calculates the index, θk − j=1 φjk xi βj , for outcome level k 6= e(i base) and dimension d = e(k dim). It returns a vector of zeros if k = e(i base). A synonym for xb is index. If outcome() is not specified, outcome(#1) is assumed. stdp calculates the standard error of the index. A synonym for stdp is seindex. If outcome() is not specified, outcome(#1) is assumed. outcome(outcome) specifies the outcome for which the statistic is to be calculated. equation() is a synonym for outcome(): it does not matter which you use. outcome() or equation() can be specified using #1, #2, . . . , where #1 means the first category of the dependent variable, #2 means the second category, etc.; the values of the dependent variable; or the value labels of the dependent variable if they exist. scores calculates the equation-level score variables. For models with d dimensions and m levels, d + (d + 1)(m − 1) new variables are created. Assume j = 1, . . . , d and k = 1, . . . , m in the following. The first d new variables will contain ∂ lnL/∂(xβj ). slogit postestimation — Postestimation tools for slogit 2187 The next d(m − 1) new variables will contain ∂ lnL/∂φjk . The last m − 1 new variables will contain ∂ lnL/∂θk . Remarks and examples Once you have fit a stereotype logistic model, you can obtain the predicted probabilities by using the predict command for both the estimation sample and other samples; see [U] 20 Estimation and postestimation commands and [R] predict. predict without arguments (or with the pr option) calculates the predicted probability of each outcome of the dependent variable. You must therefore give a new variable name for each of the outcomes. To compute the estimated probability of one outcome, you use the outcome(outcome) option where outcome is the level encoding the outcome. If the dependent variable’s levels are labeled, the outcomes can also be identified by the label values (see [D] label). The xb option in conjunction with outcome(outcome) specifies that the index be computed for the outcome encoded by level outcome. Its approximate standard error is computed if the stdp option is specified. Only one of the pr, xb, or stdp options can be specified with a call to predict. Example 1 In example 2 of [R] slogit, we fit the one-dimensional stereotype model, where the depvar is insure with levels k = 1 for outcome Indemnity, k = 2 for Prepaid, and k = 3 for Uninsure. The base outcome for the model is Indemnity, so for k 6= 1 the vector of indices for the k th level is ηk = θk − φk (β1 age + β2 male + β3 nonwhite + β4 2.site + β5 3.site) We estimate the group probabilities by calling predict after slogit. . use http://www.stata-press.com/data/r13/sysdsn1 (Health insurance data) . slogit insure age male nonwhite i.site, dim(1) base(1) nolog (output omitted ) . predict pIndemnity pPrepaid pUninsure, p . list pIndemnity pPrepaid pUninsure insure in 1/10 pIndem~y pPrepaid pUnins~e insure 1. 2. 3. 4. 5. .5419344 .4359638 .5111583 .3941132 .4655651 .3754875 .496328 .4105107 .5442234 .4625064 .0825782 .0677081 .0783309 .0616633 .0719285 Indemnity Prepaid Indemnity Prepaid . 6. 7. 8. 9. 10. .4401779 .4632122 .3772302 .4867758 .5823668 .4915102 .4651931 .5635696 .4383018 .3295802 .0683118 .0715948 .0592002 .0749225 .0880531 Prepaid Prepaid . Uninsure Prepaid Observations 5 and 8 are not used to fit the model because insure is missing at these points, but predict estimates the probabilities for these observations since none of the independent variables is missing. You can use if e(sample) in the call to predict to use only those observations that are used to fit the model. 2188 slogit postestimation — Postestimation tools for slogit Methods and formulas predict Let level b be the base outcome that is used to fit the stereotype logistic regression model of Pd dimension d. The index for observation i and level k 6= b is ηik = θk − j=1 φjk xi βj . This is the log odds of outcome encoded as level k relative to that of b so thatPwe define ηib ≡ 0. m The outcome probabilities for this model are defined as Pr(Yi = k) = eηik / j=1 eηij . Unlike in mlogit, ologit, and oprobit, the index is no longer a linear function of the parameters. The standard error of index ηik is thus computed using the delta method (see also [R] predictnl). The equation-level score for regression coefficients is m−1 X ∂ lnLik = ∂xi βj ! φjl pil − φjk l=1 the equation-level score for the scale parameters is ∂ lnLik = ∂φjl ( xi βj (pik − 1), if l = k xi βj pil , if l 6= k for l = 1, . . . , m − 1; and the equation-level score for the intercepts is ∂ lnLik = ∂θl ( 1 − pik , if l = k − pil , if l 6= k Also see [R] slogit — Stereotype logistic regression [U] 20 Estimation and postestimation commands Title smooth — Robust nonlinear smoother Syntax Remarks and examples Also see Menu Methods and formulas Description Acknowledgments Option References Syntax smooth smoother , twice varname if in , generate(newvar) where smoother is specified as Sm Sm . . . and Sm is one of 1|2|3|4|5|6|7|8|9 R 3 R S S|R S|R ... E H Letters may be specified in lowercase if preferred. Examples of smoother ,twice include 3RSSH 3rssh 3RSSH,twice 3rssh,twice 4253H 4253h 4253H,twice 4253h,twice 43RSR2H,twice 43rsr2h,twice Menu Statistics > Nonparametric analysis > Robust nonlinear smoother Description smooth applies the specified resistant, nonlinear smoother to varname and stores the smoothed series in newvar. Option generate(newvar) is required; it specifies the name of the new variable that will contain the smoothed values. Remarks and examples Smoothing is an exploratory data-analysis technique for making the general shape of a series apparent. In this approach (Tukey 1977), the observed data series is assumed to be the sum of an underlying process that evolves smoothly (the smooth) and of an unsystematic noise component (the rough); that is, data = smooth + rough 2189 2190 smooth — Robust nonlinear smoother Smoothed values zt are obtained by taking medians (or some other location estimate) of each point in the original data yt and a few of the points around it. The number of points used is called the span of the smoother. Thus a span-3 smoother produces zt by taking the median of yt−1 , yt , and yt+1 . smooth provides running median smoothers of spans 1 to 9 — indicated by the digit that specifies their span. Median smoothers are resistant to isolated outliers, so they provide robustness to spikes in the data. Because the median is also a nonlinear operator, such smoothers are known as robust (or resistant) nonlinear smoothers. smooth also provides the Hanning linear, nonrobust smoother, indicated by the letter H. Hanning is a span-3 smoother with binomial weights. Repeated applications of H—HH, HHH, etc.— provide binomial smoothers of span 5, 7, etc. See Cox (1997, 2004) for a graphical application of this fact. Because one smoother usually cannot adequately separate the smooth from the rough, compound smoothers — multiple smoothers applied in sequence — are used. The smoother 35H, for instance, then smooths the data with a span-3 median smoother, smooths the result with a span-5 median smoother, and finally smooths that result with the Hanning smoother. smooth allows you to specify any number of smoothers in any sequence. Three refinements can be combined with the running median and Hanning smoothers. First, the endpoints of a smooth can be given special treatment. This is specified by the E operator. Second, smoothing by 3, the span-3 running median, tends to produce flat-topped hills and valleys. The splitting operator, S, “splits” these repeated values, applies the endpoint operator to them, and then “rejoins” the series. Finally, it is sometimes useful to repeat an odd-span median smoother or the splitting operator until the smooth no longer changes. Following a digit or an S with an R specifies this type of repetition. Even the best smoother may fail to separate the smooth from the rough adequately. To guard against losing any systematic components of the data series, after smoothing, the smoother can be reapplied to the resulting rough, and any recovered signal can be added back to the original smooth. The twice operator specifies this procedure. More generally, an arbitrary smoother can be applied to the rough (using a second smooth command), and the recovered signal can be added back to the smooth. This more general procedure is called reroughing (Tukey 1977). The details of each of the smoothers and operators are explained in Methods and formulas below. Example 1 smooth is designed to recover the general features of a series that has been contaminated with noise. To demonstrate this, we construct a series, add noise to it, and then smooth the noisy version to recover an estimate of the original data. First, we construct and display the data: . . . . drop _all set obs 10 set seed 123456789 generate time = _n . . . . . label variable time "Time" generate x = _n^3 - 10*_n^2 + 5*_n label variable x "Signal" generate z = x + 50*rnormal() label variable z "Observed series" smooth — Robust nonlinear smoother 2191 −200 −100 0 100 . scatter x z time, c(l .) m(i o) ytitle("") 0 2 4 6 8 10 Time Signal Observed series Now we smooth the noisy series, z, assumed to be the only data we would observe: smooth 4253eh,twice z, gen(sz) label variable sz "Smoothed series" scatter x z sz time, c(l . l) m(i o i) ytitle("") || scatter sz time, c(l . l) m(i o i) ytitle("") clpattern(dash_dot) −200 −100 0 100 . . . > 0 2 4 6 8 10 Time Signal Smoothed series Observed series Example 2 Salgado-Ugarte and Curts-Garcı́a (1993) provide data on the frequencies of observed fish lengths. In this example, the series to be smoothed — the frequencies — is ordered by fish length rather than by time. . use http://www.stata-press.com/data/r13/fishdata, clear . smooth 4253eh,twice freq, gen(sfreq) . label var sfreq "4253EH,twice of frequencies" 2192 smooth — Robust nonlinear smoother . scatter sfreq freq length, c(l .) m(i o) > title("Smoothed frequencies of fish lengths") ytitle("") xlabel(#4) 0 5 10 15 Smoothed frequencies of fish lengths 40 50 Standard body length 4253EH,twice of frequencies 60 70 Frequency of indiv counts Technical note smooth allows missing values at the beginning and end of the series, but missing values in the middle are not allowed. Leading and trailing missing values are ignored. If you wish to ignore missing values in the middle of the series, you must drop the missing observations before using smooth. Doing so, of course, would violate smooth’s assumption that observations are equally spaced—each observation represents a year, a quarter, or a month (or a 1-year birth-rate category). In practice, smooth produces good results as long as the spaces between adjacent observations do not vary too much. Smoothing is usually applied to time series, but any variable with a natural order can be smoothed. For example, a smoother might be applied to the birth rate recorded by the age of the mothers (birth rate for 17-year-olds, birth rate for 18-year-olds, and so on). Methods and formulas Methods and formulas are presented under the following headings: Running median smoothers of odd span Running median smoothers of even span Repeat operator Endpoint rule Splitting operator Hanning smoother Twicing smooth — Robust nonlinear smoother 2193 Running median smoothers of odd span The smoother 3 defines zt = median(yt−1 , yt , yt+1 ) The smoother 5 defines zt = median(yt−2 , yt−1 , yt , yt+1 , yt+2 ) and so on. The smoother 1 defines zt = median(yt ), so it does nothing. Endpoints are handled by using smoothers of shorter, odd span. Thus for 3, z1 = y1 z2 = median(y1 , y2 , y3 ) .. . zN −1 = median(yN −2 , yN −1 , yN ) Z N = yN For 5, z1 = y1 z2 = median(y1 , y2 , y3 ) z3 = median(y1 , y2 , y3 , y4 , y5 ) z4 = median(y2 , y3 , y4 , y5 , y6 ) .. . zN −2 = median(yN −4 , yN −3 , yN −2 , yN −1 , yN ) zN −1 = median(yN −2 , yN −1 , yN ) ZN = yN and so on. Running median smoothers of even span Define the median() function as returning the linearly interpolated value when given an even number of arguments. Thus the smoother 2 defines zt+0.5 = (yt + yt+1 )/2 The smoother 4 defines zt+0.5 as the linearly interpolated median of (yt−1 , yt , yt+1 , yt+2 ), and so on. Endpoints are always handled using smoothers of shorter, even span. Thus for 4, 2194 smooth — Robust nonlinear smoother z0.5 = y1 z1.5 = median(y1 , y2 ) = (y1 + y2 )/2 z2.5 = median(y1 , y2 , y3 , y4 ) .. . zN −2.5 = median(yN −4 , yN −3 , yN −2 , yN ) zN −1.5 = median(yN −2 , yN −1 ) zN −0.5 = median(yN −1 , yN ) zN +0.5 = yN As defined above, an even-span smoother increases the length of the series by 1 observation. However, the series can be recentered on the original observation numbers, and the “extra” observation can be eliminated by smoothing the series again with another even-span smoother. For instance, the smooth of 4 illustrated above could be followed by a smooth of 2 to obtain z1∗ = (z0.5 + z1.5 )/2 z2∗ = (z1.5 + z2.5 )/2 z3∗ = (z2.5 + z3.5 )/2 .. . ∗ zN −2 = (zN −2.5 + zN −1.5 )/2 ∗ zN −1 = (zN −1.5 + zN −0.5 )/2 ∗ zN = (zN −0.5 + zN +0.5 )/2 smooth keeps track of the number of even smoothers applied to the data and expands and shrinks the length of the series accordingly. To ensure that the final smooth has the same number of observations as varname, smooth requires you to specify an even number of even-span smoothers. However, the pairs of even-span smoothers need not be contiguous; for instance, 4253 and 4523 are both allowed. Repeat operator R indicates that a smoother is to be repeated until convergence, that is, until repeated applications of the smoother produce the same series. Thus 3 applies the smoother of running medians of span 3. 33 applies the smoother twice. 3R produces the result of repeating 3 an infinite number of times. R should be used only with odd-span smoothers because even-span smoothers are not guaranteed to converge. The smoother 453R2 applies a span-4 smoother, followed by a span-5 smoother, followed by repeated applications of a span-3 smoother, followed by a span-2 smoother. smooth — Robust nonlinear smoother 2195 Endpoint rule The endpoint rule E modifies the values z1 and zN according to the following formulas: z1 = median(3z2 − 2z3 , z1 , z2 ) zN = median(3zN −2 − 2zN −1 , zN , zN −1 ) When the endpoint rule is not applied, endpoints are typically “copied in”; that is, z1 = y1 and zN = yN . Splitting operator The smoothers 3 and 3R can produce flat-topped hills and valleys. The split operator attempts to eliminate such hills and valleys by splitting the sequence, applying the endpoint rule E, rejoining the series, and then resmoothing by 3R. The S operator may be applied only after 3, 3R, or S. We recommend that the S operator be repeated once (SS) or until no further changes take place (SR). Hanning smoother H is the Hanning linear smoother: zt = (yt−1 + 2yt + yt+1 )/4 Endpoints are copied in: z1 = y1 and zN = yN . H should be applied only after all nonlinear smoothers. Twicing A smoother divides the data into a smooth and a rough: data = smooth + rough If the smoothing is successful, the rough should exhibit no pattern. Twicing refers to applying the smoother to the observed, calculating the rough, and then applying the smoother to the rough. The resulting “smoothed rough” is then added back to the smooth from the first step. Acknowledgments smooth was originally written by William Gould (1992) — at which time it was named nlsm — and was inspired by Salgado-Ugarte and Curts-Garcı́a (1992). Salgado-Ugarte and Curts-Garcı́a (1993) subsequently reported anomalies in nlsm’s treatment of even-span median smoothers. smooth corrects these problems and incorporates other improvements but otherwise is essentially the same as originally published. 2196 smooth — Robust nonlinear smoother References Cox, N. J. 1997. gr22: Binomial smoothing plot. Stata Technical Bulletin 35: 7–9. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 36–38. College Station, TX: Stata Press. . 2004. gr22 1: Software update: Binomial smoothing plot. Stata Journal 4: 490. . 2005. Speaking Stata: Smoothing in various directions. Stata Journal 5: 574–593. Gould, W. W. 1992. sg11.1: Quantile regression with bootstrapped standard errors. Stata Technical Bulletin 9: 19–21. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 137–139. College Station, TX: Stata Press. Royston, P., and N. J. Cox. 2005. A multivariable scatterplot smoother. Stata Journal 5: 405–412. Salgado-Ugarte, I. H., and J. Curts-Garcı́a. 1992. sed7: Resistant smoothing using Stata. Stata Technical Bulletin 7: 8–11. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 99–103. College Station, TX: Stata Press. . 1993. sed7.2: Twice reroughing procedure for resistant nonlinear smoothing. Stata Technical Bulletin 11: 14–16. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 108–111. College Station, TX: Stata Press. Sasieni, P. D. 1998. gr27: An adaptive variable span running line smoother. Stata Technical Bulletin 41: 4–7. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 63–68. College Station, TX: Stata Press. Tukey, J. W. 1977. Exploratory Data Analysis. Reading, MA: Addison–Wesley. Velleman, P. F. 1977. Robust nonlinear data smoothers: Definitions and recommendations. Proceedings of the National Academy of Sciences 74: 434–436. . 1980. Definition and comparison of robust nonlinear data smoothing algorithms. Journal of the American Statistical Association 75: 609–615. Velleman, P. F., and D. C. Hoaglin. 1981. Applications, Basics, and Computing of Exploratory Data Analysis. Boston: Duxbury. Also see [R] lowess — Lowess smoothing [R] lpoly — Kernel-weighted local polynomial smoothing [TS] tssmooth — Smooth and forecast univariate time-series data Title spearman — Spearman’s and Kendall’s correlations Syntax Options for ktau Acknowledgment Menu Remarks and examples References Description Stored results Also see Options for spearman Methods and formulas Syntax Spearman’s rank correlation coefficients spearman varlist if in , spearman options Kendall’s rank correlation coefficients ktau varlist if in , ktau options spearman options Description Main stats(spearman list) print(#) star(#) bonferroni sidak pw matrix list of statistics; select up to three statistics; default is stats(rho) significance level for displaying coefficients significance level for displaying with a star use Bonferroni-adjusted significance level use Šidák-adjusted significance level calculate all pairwise correlation coefficients by using all available data display output in matrix form ktau options Description Main stats(ktau list) print(#) star(#) bonferroni sidak pw matrix list of statistics; select up to six statistics; default is stats(taua) significance level for displaying coefficients significance level for displaying with a star use Bonferroni-adjusted significance level use Šidák-adjusted significance level calculate all pairwise correlation coefficients by using all available data display output in matrix form by is allowed with spearman and ktau; see [D] by. where the elements of spearman list may be rho obs p correlation coefficient number of observations significance level 2197 2198 spearman — Spearman’s and Kendall’s correlations and the elements of ktau list may be taua taub score se obs p correlation coefficient τa correlation coefficient τb score standard error of score number of observations significance level Menu spearman Statistics > Nonparametric analysis > Tests of hypotheses > Spearman’s rank correlation > Nonparametric analysis > Tests of hypotheses > Kendall’s rank correlation ktau Statistics Description spearman displays Spearman’s rank correlation coefficients for all pairs of variables in varlist or, if varlist is not specified, for all the variables in the dataset. ktau displays Kendall’s rank correlation coefficients between the variables in varlist or, if varlist is not specified, for all the variables in the dataset. ktau is intended for use on small- and moderate-sized datasets; it requires considerable computation time for larger datasets. Options for spearman Main stats(spearman list) specifies the statistics to be displayed in the matrix of output. stats(rho) is the default. Up to three statistics may be specified; stats(rho obs p) would display the correlation coefficient, number of observations, and significance level. If varlist contains only two variables, all statistics are shown in tabular form, and stats(), print(), and star() have no effect unless the matrix option is specified. print(#) specifies the significance level of correlation coefficients to be printed. Correlation coefficients with larger significance levels are left blank in the matrix. Typing spearman, print(.10) would list only those correlation coefficients that are significant at the 10% level or lower. star(#) specifies the significance level of correlation coefficients to be marked with a star. Typing spearman, star(.05) would “star” all correlation coefficients significant at the 5% level or lower. bonferroni makes the Bonferroni adjustment to calculated significance levels. This adjustment affects printed significance levels and the print() and star() options. Thus spearman, print(.05) bonferroni prints coefficients with Bonferroni-adjusted significance levels of 0.05 or less. sidak makes the Šidák adjustment to calculated significance levels. This adjustment affects printed significance levels and the print() and star() options. Thus spearman, print(.05) sidak prints coefficients with Šidák-adjusted significance levels of 0.05 or less. pw specifies that correlations be calculated using pairwise deletion of observations with missing values. By default, spearman uses casewise deletion, where observations are ignored if any of the variables in varlist are missing. spearman — Spearman’s and Kendall’s correlations 2199 matrix forces spearman to display the statistics as a matrix, even if varlist contains only two variables. matrix is implied if more than two variables are specified. Options for ktau Main stats(ktau list) specifies the statistics to be displayed in the matrix of output. stats(taua) is the default. Up to six statistics may be specified; stats(taua taub score se obs p) would display the correlation coefficients τa , τb , score, standard error of score, number of observations, and significance level. If varlist contains only two variables, all statistics are shown in tabular form and stats(), print(), and star() have no effect unless the matrix option is specified. print(#) specifies the significance level of correlation coefficients to be printed. Correlation coefficients with larger significance levels are left blank in the matrix. Typing ktau, print(.10) would list only those correlation coefficients that are significant at the 10% level or lower. star(#) specifies the significance level of correlation coefficients to be marked with a star. Typing ktau, star(.05) would “star” all correlation coefficients significant at the 5% level or lower. bonferroni makes the Bonferroni adjustment to calculated significance levels. This adjustment affects printed significance levels and the print() and star() options. Thus ktau, print(.05) bonferroni prints coefficients with Bonferroni-adjusted significance levels of 0.05 or less. sidak makes the Šidák adjustment to calculated significance levels. This adjustment affects printed significance levels and the print() and star() options. Thus ktau, print(.05) sidak prints coefficients with Šidák-adjusted significance levels of 0.05 or less. pw specifies that correlations be calculated using pairwise deletion of observations with missing values. By default, ktau uses casewise deletion, where observations are ignored if any of the variables in varlist are missing. matrix forces ktau to display the statistics as a matrix, even if varlist contains only two variables. matrix is implied if more than two variables are specified. Remarks and examples Example 1 We wish to calculate the correlation coefficients among marriage rate (mrgrate), divorce rate (divorce rate), and median age (medage) in state data. We can calculate the standard Pearson correlation coefficients and significance by typing .use http://www.stata-press.com/data/r13/states2 (State data) . pwcorr mrgrate divorce_rate medage, sig mrgrate divorc~e medage mrgrate 1.0000 divorce_rate 0.7895 0.0000 1.0000 medage 0.0011 0.9941 -0.1526 0.2900 1.0000 2200 spearman — Spearman’s and Kendall’s correlations We can calculate Spearman’s rank correlation coefficients by typing . spearman mrgrate divorce_rate medage, stats(rho p) (obs=50) Key rho Sig. level mrgrate divorc~e mrgrate 1.0000 divorce_rate 0.6933 0.0000 1.0000 medage -0.4869 0.0003 -0.2455 0.0857 medage 1.0000 The large difference in the results is caused by one observation. Nevada’s marriage rate is almost 10 times higher than the state with the next-highest marriage rate. An important feature of the Spearman rank correlation coefficient is its reduced sensitivity to extreme values compared with the Pearson correlation coefficient. We can calculate Kendall’s rank correlations by typing . ktau mrgrate divorce_rate medage, stats(taua taub p) (obs=50) Key tau_a tau_b Sig. level mrgrate divorc~e mrgrate 0.9829 1.0000 divorce_rate 0.5110 0.5206 0.0000 0.9804 1.0000 medage -0.3486 -0.3544 0.0004 -0.1698 -0.1728 0.0828 medage 0.9845 1.0000 There are tied values for variables mrgrate, divorce rate, and medage, so tied ranks are used. As a result, τa < 1 on the diagonal (see Methods and formulas for the definition of τa ). spearman — Spearman’s and Kendall’s correlations 2201 Technical note According to Conover (1999, 323), “Spearman’s ρ tends to be larger than Kendall’s τ in absolute value. However, as a test of significance, there is no strong reason to prefer one over the other because both will produce nearly identical results in most cases.” Example 2 We illustrate spearman and ktau with the auto data, which contains some missing values. .use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . spearman mpg rep78 Number of obs = 69 Spearman’s rho = 0.3098 Test of Ho: mpg and rep78 are independent Prob > |t| = 0.0096 Because we specified two variables, spearman displayed the sample size, correlation, and p-value in tabular form. To obtain just the correlation coefficient displayed in matrix form, we type . spearman mpg rep78, stats(rho) matrix (obs=69) mpg rep78 mpg rep78 1.0000 0.3098 1.0000 The pw option instructs spearman and ktau to use all nonmissing observations between a pair of variables when calculating their correlation coefficient. In the output below, some correlations are based on 74 observations, whereas others are based on 69 because 5 observations contain a missing value for rep78. . spearman mpg price rep78, pw stats(rho obs p) star(0.01) Key rho Number of obs Sig. level mpg mpg price rep78 1.0000 74 price -0.5419* 74 0.0000 1.0000 74 rep78 0.3098* 69 0.0096 0.1028 69 0.4008 1.0000 69 2202 spearman — Spearman’s and Kendall’s correlations Finally, the bonferroni and sidak options provide adjusted significance levels: . ktau mpg price rep78, stats(taua taub score se p) bonferroni (obs=69) Key tau_a tau_b score se of score Sig. level mpg price mpg 0.9471 1.0000 2222.0000 191.8600 price -0.3973 -0.4082 -932.0000 192.4561 0.0000 1.0000 1.0000 2346.0000 193.0682 rep78 0.2076 0.2525 487.0000 181.7024 0.0224 0.0648 0.0767 152.0000 182.2233 1.0000 rep78 0.7136 1.0000 1674.0000 172.2161 Charles Edward Spearman (1863–1945) was a British psychologist who made contributions to correlation, factor analysis, test reliability, and psychometrics. After several years’ military service, he obtained a PhD in experimental psychology at Leipzig and became a professor at University College London, where he sustained a long program of work on the interpretation of intelligence tests. Ironically, the rank correlation version bearing his name is not the formula he advocated. Maurice George Kendall (1907–1983) was a British statistician who contributed to rank correlation, time series, multivariate analysis, among other topics, and wrote many statistical texts. Most notably, perhaps, his advanced survey of the theory of statistics went through several editions, later ones with Alan Stuart; the baton has since passed to others. Kendall was employed in turn as a government and business statistician, as a professor at the London School of Economics, as a consultant, and as director of the World Fertility Survey. He was knighted in 1974. spearman — Spearman’s and Kendall’s correlations 2203 Stored results spearman stores the following in r(): Scalars r(N) r(rho) r(p) Matrices r(Nobs) r(Rho) r(P) number of observations (last variable pair) ρ (last variable pair) two-sided p-value (last variable pair) number of observations ρ two-sided p-value ktau stores the following in r(): Scalars r(N) r(tau a) r(tau b) r(score) r(se score) r(p) Matrices r(Nobs) r(Tau a) r(Tau b) r(Score) r(Se Score) r(P) number of observations (last variable pair) τa (last variable pair) τb (last variable pair) Kendall’s score (last variable pair) se of score (last variable pair) two-sided p-value (last variable pair) number of observations τa τb Kendall’s score standard error of score two-sided p-value Methods and formulas Spearman’s (1904) rank correlation is calculated as Pearson’s correlation computed on the ranks and average ranks (Conover 1999, 314 – 315). Ranks are as calculated by egen; see [D] egen. The significance is calculated using the approximation p √ p = 2 × ttail(n − 2, |b ρ| n − 2 / 1 − ρb2 ) For any two pairs of ranks (xi , yi ) and (xj , yj ) of one variable pair (varname1 , varname2 ), 1 ≤ i, j ≤ n, where n is the number of observations, define them as concordant if (xi − xj )(yi − yj ) > 0 and discordant if this product is less than zero. Kendall’s (1938; also see Kendall and Gibbons [1990] or Bland [2000], 222–225) score S is defined as C − D, where C (D) is the number of concordant (discordant) pairs. Let N = n(n − 1)/2 be the total number of pairs, so τa is given by τa = S/N and τb is given by τb = √ S √ N −U N −V 2204 spearman — Spearman’s and Kendall’s correlations where U= N1 X ui (ui − 1)/2 i=1 V = N2 X vj (vj − 1)/2 j=1 and where N1 is the number of sets of tied x values, ui is the number of tied x values in the ith set, N2 is the number of sets of tied y values, and vj is the number of tied y values in the j th set. Under the null hypothesis of independence between varname1 and varname2 , the variance of S is exactly (Kendall and Gibbons 1990, 66) N2 N1 X X 1 vj (vj − 1)(2vj + 5) ui (ui − 1)(2ui + 5) − n(n − 1)(2n + 5) − Var(S) = 18 j=1 i=1 + X X N1 N2 1 ui (ui − 1)(ui − 2) vj (vj − 1)(vj − 2) 9n(n − 1)(n − 2) i=1 j=1 X X N2 N1 1 vj (vj − 1) ui (ui − 1) + 2n(n − 1) i=1 j=1 Using a normal approximation with a continuity correction, |S| − 1 z=p Var(S) For the hypothesis of independence, the statistics S , τa , and τb produce equivalent tests and give the same significance. For Kendall’s τ , the normal approximation is surprisingly accurate for sample sizes as small as 8, at least for calculating p-values under the null hypothesis for continuous variables. (See Kendall and Gibbons [1990, chap. 4], who also present some tables for calculating exact p-values for n < 10.) For Spearman’s ρ, the normal approximation requires larger samples to be valid. Let v be the number of variables specified so that k = v(v − 1)/2 correlation coefficients are 0 to be estimated. If bonferroni is specified, the adjusted significance level is p = min(1, kp). If sidak is specified, p0 = min 1, 1 − (1 − p)n . See Methods and formulas in [R] oneway for a more complete description of the logic behind these adjustments. Early work on rank correlation is surveyed by Kruskal (1958). Acknowledgment The original version of ktau was written by Sean Becketti, a past editor of the Stata Technical Bulletin and author of the Stata Press book Introduction to Time Series Using Stata. spearman — Spearman’s and Kendall’s correlations 2205 References Barnard, G. A. 1997. Kendall, Maurice George. In Leading Personalities in Statistical Sciences: From the Seventeenth Century to the Present, ed. N. L. Johnson and S. Kotz, 130–132. New York: Wiley. Bland, M. 2000. An Introduction to Medical Statistics. 3rd ed. Oxford: Oxford University Press. Conover, W. J. 1999. Practical Nonparametric Statistics. 3rd ed. New York: Wiley. David, H. A., and W. A. Fuller. 2007. Sir Maurice Kendall (1907–1983): A centenary appreciation. American Statistician 61: 41–46. Jeffreys, H. 1961. Theory of Probability. 3rd ed. Oxford: Oxford University Press. Kendall, M. G. 1938. A new measure of rank correlation. Biometrika 30: 81–93. Kendall, M. G., and J. D. Gibbons. 1990. Rank Correlation Methods. 5th ed. New York: Oxford University Press. Kruskal, W. H. 1958. Ordinal measures of association. Journal of the American Statistical Association 53: 814–861. Lovie, P., and A. D. Lovie. 1996. Charles Edward Spearman, F.R.S. (1863–1945). Notes and Records of the Royal Society of London 50: 75–88. Newson, R. B. 2000a. snp15: somersd—Confidence intervals for nonparametric statistics and their differences. Stata Technical Bulletin 55: 47–55. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 312–322. College Station, TX: Stata Press. . 2000b. snp15.1: Update to somersd. Stata Technical Bulletin 57: 35. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 322–323. College Station, TX: Stata Press. . 2000c. snp15.2: Update to somersd. Stata Technical Bulletin 58: 30. Reprinted in Stata Technical Bulletin Reprints, vol. 10, p. 323. College Station, TX: Stata Press. . 2001. snp15.3: Update to somersd. Stata Technical Bulletin 61: 22. Reprinted in Stata Technical Bulletin Reprints, vol. 10, p. 324. College Station, TX: Stata Press. . 2003. snp15 4: Software update for somersd. Stata Journal 3: 325. . 2005. snp15 5: Software update for somersd. Stata Journal 5: 470. . 2006. Confidence intervals for rank statistics: Percentile slopes, differences, and ratios. Stata Journal 6: 497–520. Seed, P. T. 2001. sg159: Confidence intervals for correlations. Stata Technical Bulletin 59: 27–28. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 267–269. College Station, TX: Stata Press. Spearman, C. E. 1904. The proof and measurement of association between two things. American Journal of Psychology 15: 72–101. Wolfe, F. 1997. sg64: pwcorrs: Enhanced correlation display. Stata Technical Bulletin 35: 22–25. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 163–167. College Station, TX: Stata Press. . 1999. sg64.1: Update to pwcorrs. Stata Technical Bulletin 49: 17. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 159. College Station, TX: Stata Press. Also see [R] correlate — Correlations (covariances) of variables or coefficients [R] nptrend — Test for trend across ordered groups Title spikeplot — Spike plots and rootograms Syntax Remarks and examples Menu Acknowledgments Description References Options Also see Syntax spikeplot varname if in weight , options Description options Main round varname to nearest multiple of # (bin width) make vertical scale the proportion of total values; default is frequencies make vertical scale show square roots of frequencies round(#) fraction root Plot spike options affect rendition of plotted spikes Add plots addplot(plot) add other plots to generated graph Y axis, X axis, Titles, Legend, Overall, By twoway options any options documented in [G-3] twoway options fweights, aweights, and iweights are allowed; see [U] 11.1.6 weight. Menu Graphics > Distributional graphs > Spike plot and rootogram Description spikeplot produces a frequency plot for a variable in which the frequencies are depicted as vertical lines from zero. The frequency may be a count, a fraction, or the square root of the count (Tukey’s rootogram, circa 1965). The vertical lines may also originate from a baseline other than zero at the user’s option. Options Main round(#) rounds the values of varname to the nearest multiple of #. This action effectively specifies the bin width. fraction specifies that the vertical scale be the proportion of total values (percentage) rather than the count. root specifies that the vertical scale show square roots. This option may not be specified if fraction is specified. 2206 spikeplot — Spike plots and rootograms 2207 Plot spike options affect the rendition of the plotted spikes; see [G-2] graph twoway spike. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall, By twoway options are any of the options documented in [G-3] twoway options. These include options for titling the graph (see [G-3] title options), options for saving the graph to disk (see [G-3] saving option), and the by() option (see [G-3] by option). Remarks and examples Example 1 Cox and Brady (1997a) present an illustrative example using the age structure of the population of Ghana from the 1960 census (rounded to the nearest 1,000). The dataset has ages from 0 (less than 1 year) to 90. To view the distribution of ages, we would like to use each integer from 0 to 90 as the bins for the dataset. . use http://www.stata-press.com/data/r13/ghanaage 0 Population in 1000s 100 200 300 . spikeplot age [fw=pop], ytitle("Population in 1000s") xlab(0(10)90) > xmtick(5(10)85) 0 10 20 30 40 50 Age in years 60 70 80 90 The resulting graph shows a “heaping” of ages at the multiples of 5. Also, ages ending in even numbers are more frequent than ages ending in odd numbers (except for 5). This preference for reporting ages is well known in demography and other social sciences. Note also that we used the ytitle() option to override the default title of “Frequency” and that we used the xlab() and xmtick() options with numlists to further customize the resulting graph. See [U] 11.1.8 numlist for details on specifying numlists. 2208 spikeplot — Spike plots and rootograms Example 2 The rootogram is a plot of the square-root transformation of the frequency counts. The square root of a normal distribution is a multiple of another normal distribution. 0 50 Frequency 100 150 200 250 . clear . set seed 1234567 . set obs 5000 obs was 0, now 5000 . generate normal = rnormal() . label variable normal "Gaussian(0,1) random numbers" . spikeplot normal, round(.10) xlab(-4(1)4) −4 −3 −2 −1 0 1 2 Gaussian(0,1) random numbers 3 4 3 4 0 Root of frequency 5 10 15 . spikeplot normal, round(.10) xlab(-4(1)4) root −4 −3 −2 −1 0 1 2 Gaussian(0,1) random numbers Interpreting a histogram in terms of normality is thus similar to interpreting the rootogram for normality. This example also shows how the round() option is used to bin the values for a spike plot of a continuous variable. spikeplot — Spike plots and rootograms 2209 Example 3 spikeplot can also be used to produce time-series plots. varname should be the time variable, and weights should be specified as the values for those times. To get a plot of daily rainfalls, we type . spikeplot day [w=rain] if rain, ytitle("Daily rainfall in mm") The base() option of graph twoway spike may be used to set a different baseline, such as when we want to show variations relative to an average or to some other measure of level. Acknowledgments The original version of spikeplot was written by Nicholas J. Cox of the Department of Geography at Durham University, UK, and coeditor of the Stata Journal and Anthony R. Brady of the Imperial College School of Medicine (1997a, 1997b). References Cox, N. J., and A. R. Brady. 1997a. gr25: Spike plots for histograms, rootograms, and time-series plots. Stata Technical Bulletin 36: 8–11. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 50–54. College Station, TX: Stata Press. . 1997b. gr25.1: Spike plots for histograms, rootograms, and time-series plots: Update. Stata Technical Bulletin 40: 12. Reprinted in Stata Technical Bulletin Reprints, vol. 7, p. 58. College Station, TX: Stata Press. Tukey, J. W. 1965. The future of processes of data analysis. In The Collected Works of John W. Tukey, Volume IV: Philosophy and Principles of Data Analysis: 1965–1986, ed. L. V. Jones, 123–126. Monterey, CA: Wadsworth & Brooks/Cole. Also see [R] histogram — Histograms for continuous and categorical variables Title ssc — Install and uninstall packages from SSC Syntax Acknowledgments Description References Options Also see Remarks and examples Syntax Summary of packages most recently added or updated at SSC ssc new , saving(filename , replace ) type Summary of most popular packages at SSC ssc hot , n(#) author(name) Describe a specified package at SSC ssc describe pkgname | letter , saving(filename , replace ) Install a specified package from SSC ssc install pkgname , all replace Uninstall from your computer a previously installed package from SSC ssc uninstall pkgname Type a specific file stored at SSC ssc type filename , asis Copy a specific file from SSC to your computer ssc copy filename , plus personal replace public binary where letter in ssc describe is a–z or . Description ssc works with packages (and files) from the Statistical Software Components (SSC) archive, which is often called the Boston College Archive and is provided by http://repec.org. The SSC has become the premier Stata download site for user-written software on the web. ssc provides a convenient interface to the resources available there. For example, on Statalist (see http://www.statalist.org/), users will often write The program can be found by typing ssc install newprogramname. 2210 ssc — Install and uninstall packages from SSC 2211 Typing that would load everything associated with newprogramname, including the help files. If you are searching for what is available, type ssc new and ssc hot, and see [R] search. search searches the SSC and other places, too. search provides a GUI interface from which programs can be installed, including the programs at the SSC archive. You can uninstall particular packages by using ssc uninstall. For the packages that you keep, see [R] adoupdate for an automated way of keeping those packages up to date. Command overview ssc new summarizes the packages made available or updated recently. Output is presented in the Stata Viewer, and from there you may click to find out more about individual packages or to install them. ssc hot lists the most popular packages—popular based on a moving average of the number of downloads in the past three months. By default, 10 packages are listed. ssc describe pkgname describes, but does not install, the specified package. Use search to find packages; see [R] search. If you know the package name but do not know the exact spelling, type ssc describe followed by one letter, a–z or (underscore), to list all the packages starting with that letter. ssc install pkgname installs the specified package. You do not have to describe a package before installing it. (You may also install a package by using net install; see [R] net.) ssc uninstall pkgname removes the previously installed package from your computer. It does not matter how the package was installed. (ssc uninstall is a synonym for ado uninstall, so either may be used to installed any package.) ssc type filename types a specific file stored at SSC. ssc cat is a synonym for ssc type, which may appeal to those familiar with Unix. ssc copy filename copies a specific file stored at SSC to your computer. By default, the file is copied to the current directory, but you can use options to change this. ssc copy is a rarely used alternative to ssc install . . . , all. ssc cp is a synonym for ssc copy. Options Options are presented under the following headings: Options for use with ssc new Options for use with ssc hot Option for use with ssc describe Options for use with ssc install Option for use with ssc type Options for use with ssc copy Options for use with ssc new saving(filename , replace ) specifies that the “what’s new” summary be saved in filename. If filename is specified without a suffix, filename.smcl is assumed. If saving() is not specified, saving(ssc results.smcl) is assumed. type specifies that the “what’s new” results be displayed in the Results window rather than in the Viewer. 2212 ssc — Install and uninstall packages from SSC Options for use with ssc hot n(#) specifies the number of packages to list; n(10) is the default. Specify n(.) to list all packages in order of popularity. author(name) lists the 10 most popular packages by the specified author. If n(#) is also specified, the top # packages are listed. Option for use with ssc describe saving(filename , replace ) specifies that, in addition to the descriptions being displayed on your screen, it be saved in the specified file. If filename is specified without an extension, .smcl will be assumed, and the file will be saved as a SMCL file. If filename is specified with an extension, no default extension is added. If the extension is .log, the file will be stored as a text file. If replace is specified, filename is replaced if it already exists. Options for use with ssc install all specifies that any ancillary files associated with the package be downloaded to your current directory, in addition to the program and help files being installed. Ancillary files are files that do not end in .ado or .sthlp and typically contain datasets or examples of the use of the new command. You can find out which files are associated with the package by typing ssc describe pkgname before or after installing. If you install without using the all option and then want the ancillary files, you can ssc install again. replace specifies that any files being downloaded that already exist on your computer be replaced by the downloaded files. If replace is not specified and any files already exist, none of the files from the package is downloaded or installed. It is better not to specify the replace option and wait to see if there is a problem. If there is a problem, it is usually better to uninstall the old package by using ssc uninstall or ado uninstall (which are, in fact, the same command). Option for use with ssc type asis affects how files with the suffixes .smcl and .sthlp are displayed. The default is to interpret SMCL directives the file might contain. asis specifies that the file be displayed in raw, uninterpreted form. Options for use with ssc copy plus specifies that the file be copied to the PLUS directory, the directory where user-written additions are installed. Typing sysdir will display the identity of the PLUS directory on your computer; see [P] sysdir. personal specifies that the file be copied to your PERSONAL directory as reported by sysdir; see [P] sysdir. If neither plus nor personal is specified, the default is to copy the file to the current directory. ssc — Install and uninstall packages from SSC 2213 replace specifies that, if the file already exists on your computer, the new file replace it. public specifies that the new file be made readable by everyone; otherwise, the file will be created according to the default permission you have set with your operating system. binary specifies that the file being copied is a binary file and that it is to be copied as is. The default is to assume that the file is a text file and change the end-of-line characters to those appropriate for your computer/operating system. Remarks and examples Users can add new features to Stata, and some users choose to make new features that they have written available to others via the web. The files that comprise a new feature are called a package, and a package usually consists of one or more ado-files and help files. The net command (see [R] net) makes it reasonably easy to install and uninstall packages regardless of where they are on the web. One site, the SSC, has become particularly popular as a repository for additions to Stata. Command ssc is an easier to use version of net designed especially for the SSC. Many packages are available at the SSC. Packages have names, such as oaxaca, estout, or egenmore. At SSC, capitalization is not significant, so Oaxaca, ESTOUT, and EGENmore are ways of writing the same package names. When you type . ssc install oaxaca the files associated with the package are downloaded and installed on your computer. Package names usually correspond to the names of the command being added to Stata, so one would expect that installing the package oaxaca will add command oaxaca to Stata on your computer, and expect that typing help oaxaca will provide the documentation. That is the situation here, but that is not always so. Before or after installing a package, type ssc describe pkgname to obtain the details. Example 1 ssc new summarizes the packages most recently made available or updated. Output is presented in the Viewer, from which you may click on a package name to find out more or install it. For example, . ssc new (contacting http://repec.org) (output omitted ) GEOCODE3 module to retrieve coordinates or addresses from Google Geocoding API Version3 Authors: Stefan Bernhard Req: Stata version 12, insheetjson and libjson > from SSC (q.v.) Created: 2013-05-19 GGTAX module to identify the most suitable GG family model Authors: Andres L Gonzalez Rangel Req: Stata version 11 Created: 2013-05-19 ASL_NORM module computing bootstrap Gaussianity tests Authors: Maarten L. Buis Req: Stata version 11 Created: 2013-05-16 (output omitted ) End of recent additions and updates 2214 ssc — Install and uninstall packages from SSC ssc hot provides a list of the most popular packages at SSC. . ssc hot Top 10 packages at SSC Apr 2013 # hits Package Author(s) 1 2 3 4 12621.4 12606.8 8508.8 8061.9 estout outreg2 ranktest ivreg2 5 6 7 8 9 10 3595.9 2862.6 2358.2 2300.3 1743.0 1482.2 psmatch2 tabout outreg winsor xtabond2 xtivreg2 Ben Jann Roy Wada Mark E Schaffer, Frank Kleibergen Mark E Schaffer, Steven Stillman, Christopher F Baum Edwin Leuven, Barbara Sianesi Ian Watson John Luke Gallup Nicholas J. Cox David Roodman Mark E Schaffer Rank (Click on package name for description) Use the n(#) option to change the number of packages listed: . ssc hot, n(20) Top 20 packages at SSC Apr 2013 # hits Package Author(s) 1 2 3 4 12621.4 12606.8 8508.8 8061.9 estout outreg2 ranktest ivreg2 5 6 7 8 9 10 11 12 3595.9 2862.6 2358.2 2300.3 1743.0 1482.2 1481.7 1361.7 psmatch2 tabout outreg winsor xtabond2 xtivreg2 fre cem 13 14 15 16 17 18 19 20 1279.3 1205.7 1188.7 1154.4 1113.3 1111.1 950.6 948.3 xttest3 mdesc bcuse usespss distinct egenmore gllamm hprescott Ben Jann Roy Wada Mark E Schaffer, Frank Kleibergen Mark E Schaffer, Steven Stillman, Christopher F Baum Edwin Leuven, Barbara Sianesi Ian Watson John Luke Gallup Nicholas J. Cox David Roodman Mark E Schaffer Ben Jann Stefano Iacus, Gary King, Matthew Blackwell, Giuseppe Porro Christopher F Baum Rose Anne Medeiros, Dan Blanchette Christopher F Baum Sergiy Radyakin Gary Longton, Nicholas J. Cox Nicholas J. Cox Sophia Rabe-Hesketh Christopher F Baum Rank (Click on package name for description) ssc — Install and uninstall packages from SSC 2215 The author(name) option allows you to list the most popular packages by a specific person: . ssc hot, author(baum) Top 10 packages at SSC by author Baum Apr 2013 # hits Package Author(s) 4 8061.9 ivreg2 13 15 20 27 31 34 42 1279.3 1188.7 948.3 806.8 742.7 696.3 574.0 xttest3 bcuse hprescott tscollap whitetst xttest2 overid 46 58 495.3 437.0 Mark E Schaffer, Steven Stillman, Christopher F Baum Christopher F Baum Christopher F Baum Christopher F Baum Christopher F Baum Christopher F Baum, Nicholas J. Cox Christopher F Baum Christopher F Baum, Mark E Schaffer, Vince Wiggins, Steven Stillman Christopher F Baum Christopher F Baum, Mark E Schaffer, Steven Stillman Rank kpss ivendog (Click on package name for description) ssc describe pkgname describes, but does not install, the specified package. You must already know the name of the package. See [R] search for assistance in searching for packages. Sometimes you know the package name, but you do not know the exact spelling. Then you can type ssc describe followed by one letter, a–z or , to list all the packages starting with that letter; even so, using search is better. . ssc describe bidensity package bidensity from http://fmwww.bc.edu/repec/bocode/b TITLE ’BIDENSITY’: module to produce and graph bivariate density estimates DESCRIPTION/AUTHOR(S) bidensity produces bivariate kernel density estimates and graphs the result using a twoway contourline plot, optionally overlaying a scatterplot. The default kernel is Epanechnikov; all of the kernels provided by -kdensity- are also available. Compared to Baum’s -kdens2- (SSC), which was recently enhanced to produce contourline plots, -bidensity- computes the bivariate kernel densities much more efficiently through use of Mata, and provides a choice of kernel estimators. The estimated densities can be saved in a Stata dataset or accessed as Mata matrices. KW: density estimation KW: bivariate density KW: contourline plots Requires: Stata version 12.1 and moremata from SSC (q.v.) Distribution-Date: 20130119 Author: John Luke Gallup, Portland State University Support: email jlgallup@pdx.edu Author: Christopher F Baum, Boston College Support: email baum@bc.edu INSTALLATION FILES bidensity.ado bidensity.sthlp (type -ssc install bidensity- to install) (type net install bidensity) 2216 ssc — Install and uninstall packages from SSC The default setting for the saving() option is for the output to be saved with the .smcl extension. You could also save the file with a log extension, and in this case, the file would be stored as a text file. . ssc describe b, saving(b.index) (output omitted ) . ssc describe bidensity, saving(bidensity.log) (output omitted ) ssc install pkgname installs the specified package. You do not have to describe a package before installing it. There are ways of installing packages other than ssc install, such as net; see [R] net. It does not matter how a package is installed. For instance, a package can be installed using net and still be uninstalled using ssc. . ssc install bidensity checking bidensity consistency and verifying not already installed... installing into C:\ado\plus\... installation complete. ssc uninstall pkgname removes the specified, previously installed package from your computer. You can uninstall immediately after installation or at any time in the future. (Technical note: ssc uninstall is a synonym for ado uninstall, so it can uninstall any installed package, not just packages obtained from the SSC.) . ssc uninstall bidensity package bidensity from http://fmwww.bc.edu/repec/bocode/b 'BIDENSITY': module to produce and graph bivariate density estimates (package uninstalled) ssc type filename types a specific file stored at the SSC. Although not shown in the syntax diagram, ssc cat is a synonym for ssc type, which may appeal to those familiar with Unix. To view only the bidensity help file for the bidensity package, you would type . ssc type bidensity.sthlp help for bidensity Bivariate kernel density estimation bidensity varnameY varnameX [if exp] [in range] [, n(#) kernel(kernelname) xwidth(#) ywidth(#) saving( name) replace nograph scatter[(scatter_options)] contourline_options mname(name) (output omitted ) ssc copy filename copies a specific file stored at the SSC to your computer. By default, the file is copied to the current directory, but you can use options to change this. ssc copy is a rarely used alternative to ssc install . . . , all. ssc cp is a synonym for ssc copy. . ssc copy bidensity.ado (file bidensity.ado copied to current directory) For more details on the SSC archive and for information on how to submit your own programs to the SSC, see http://repec.org/bocode/s/sscsubmit.html. ssc — Install and uninstall packages from SSC 2217 Acknowledgments ssc is based on archutil by Nicholas J. Cox of the Department of Geography at Durham University, UK, and coeditor of the Stata Journal and Christopher F. Baum of the Department of Economics at Boston College and author of the Stata Press books An Introduction to Modern Econometrics Using Stata and An Introduction to Stata Programming. The reworking of the original was done with their blessing and their participation. Christopher Baum maintains the Stata-related files stored at the SSC archive. We thank him for this contribution to the Stata community. References Baum, C. F., and N. J. Cox. 1999. ip29: Metadata for user-written contributions to the Stata programming language. Stata Technical Bulletin 52: 10–12. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 121–124. College Station, TX: Stata Press. Cox, N. J., and C. F. Baum. 2000. ip29.1: Metadata for user-written contributions to the Stata programming language. Stata Technical Bulletin 54: 21–22. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 124–126. College Station, TX: Stata Press. Also see [R] adoupdate — Update user-written ado-files [R] net — Install and manage user-written additions from the Internet [R] search — Search Stata documentation and other resources [R] sj — Stata Journal and STB installation instructions [P] sysdir — Query and set system directories Title stem — Stem-and-leaf displays Syntax Remarks and examples Menu Stored results Description References Options Also see Syntax stem varname if in , options Description options Main prune round(#) truncate(#) digits(#) lines(#) width(#) do not print stems that have no leaves round data to this value; default is round(1) truncate data to this value digits per leaf; default is digits(1) number of stems per interval of 10digits stem width; equal to 10digits /width by is allowed; see [D] by. Menu Statistics > Summaries, tables, and tests > Distributional plots and tests > Stem-and-leaf display Description stem displays stem-and-leaf plots. Options Main prune prevents printing any stems that have no leaves. round(#) rounds the data to this value and displays the plot in these units. If round() is not specified, noninteger data will be rounded automatically. truncate(#) truncates the data to this value and displays the plot in these units. digits(#) sets the number of digits per leaf. The default is 1. lines(#) sets the number of stems per every data interval of 10digits . The value of lines() must divide 10digits ; that is, if digits(1) is specified, then lines() must divide 10. If digits(2) is specified, then lines() must divide 100, etc. Only one of lines() or width() may be specified. If neither is specified, an appropriate value will be set automatically. 2218 stem — Stem-and-leaf displays 2219 width(#) sets the width of a stem. lines() is equal to 10digits /width, and this option is merely an alternative way of setting lines(). The value of width() must divide 10digits . Only one of width() or lines() may be specified. If neither is specified, an appropriate value will be set automatically. Note: If lines() or width() is not specified, digits() may be decreased in some circumstances to make a better-looking plot. If lines() or width() is set, the user-specified value of digits() will not be altered. Remarks and examples Example 1 Stem-and-leaf displays are a compact way to present considerable information about a batch of data. For instance, using our automobile data (described in [U] 1.2.2 Example datasets): . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . stem mpg Stem-and-leaf plot for mpg (Mileage (mpg)) 1t 22 44444455 1f 1s 66667777 1. 88888888899999999 00011111 2* 2t 22222333 444455555 2f 2s 666 2. 8889 3* 001 3t 3f 455 3s 3. 4* 1 The stem-and-leaf display provides a way to list our data. The expression to the left of the vertical bar is called the stem; the digits to the right are called the leaves. All the stems that begin with the same digit and the corresponding leaves, written beside each other, reconstruct an observation of the data. Thus, if we look at the four stems that begin with the digit 1 and their corresponding leaves, we see that we have two cars rated at 12 mpg, 6 cars at 14, 2 at 15, and so on. The car with the highest mileage rating in our data is rated at 41 mpg. The above plot is a five-line plot with lines() equal to 5 (five lines per interval of 10) and width() equal to 2 (two leaves per stem). Instead, we could specify lines(2): . stem mpg, lines(2) Stem-and-leaf plot for mpg (Mileage (mpg)) 1* 22444444 1. 556666777788888888899999999 2* 00011111222223334444 2. 555556668889 3* 0014 3. 55 4* 1 stem mpg, width(5) would produce the same plot as above. 2220 stem — Stem-and-leaf displays The stem-and-leaf display provides a crude histogram of our data, one not so pretty as that produced by histogram (see [R] histogram), but one that is nonetheless informative. Example 2 The miles per gallon rating fits easily into a stem-and-leaf display because, in our data, it has two digits. However, stem does not require two digits. . stem price, lines(1) digits(3) Stem-and-leaf plot for price (Price) 3*** 291,299,667,748,798,799,829,895,955,984,995 010,060,082,099,172,181,187,195,296,389,424,425,453,482,499, ... (26) 4*** 5*** 079,104,172,189,222,379,397,705,719,788,798,799,886,899 6*** 165,229,295,303,342,486,850 140,827 7*** 8*** 129,814 9*** 690,735 10*** 371,372 11*** 385,497,995 12*** 990 13*** 466,594 500 14*** 15*** 906 The (26) at the right of the second stem shows that there were 26 leaves on this stem — too many to display on one line. We can make a more compact stem-and-leaf plot by rounding. To display stem in units of 100, we could type . stem price, round(100) Stem-and-leaf plot for price (Price) price rounded to nearest multiple of 100 plot in units of 100 33778889 3* 00001112222344455555667777899 4* 5* 11222447788899 6* 2233359 7* 18 18 8* 9* 77 10* 44 11* 45 0 12* 13* 056 5 14* 9 15* price, in our data, has four or five digits. stem presented the display in terms of units of 100, so a car that cost $3,291 was treated for display purposes as $3,300. stem — Stem-and-leaf displays 2221 Technical note Stem-and-leaf diagrams have been used in Japanese railway timetables, as shown in Tufte (1990, 46–47). Stored results stem stores the following in r(): Scalars r(width) r(digits) Macros r(round) r(truncate) width of a stem number of digits per leaf; default is 1 number specified in round() number specified in truncate() References Cox, N. J. 2007. Speaking Stata: Turning over a new leaf. Stata Journal 7: 413–433. Emerson, J. D., and D. C. Hoaglin. 1983. Stem-and-leaf displays. In Understanding Robust and Exploratory Data Analysis, ed. D. C. Hoaglin, C. F. Mosteller, and J. W. Tukey, 7–32. New York: Wiley. Tufte, E. R. 1990. Envisioning Information. Cheshire, CT: Graphics Press. Tukey, J. W. 1972. Some graphic and semigraphic displays. In Statistical Papers in Honor of George W. Snedecor, ed. T. A. Bancroft and S. A. Brown, 293–316. Ames, IA: Iowa State University Press. . 1977. Exploratory Data Analysis. Reading, MA: Addison–Wesley. Also see [R] histogram — Histograms for continuous and categorical variables [R] lv — Letter-value displays Title stepwise — Stepwise estimation Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax stepwise , options : command Description options Model ∗ ∗ significance level for removal from the model significance level for addition to the model pr(#) pe(#) Model2 perform forward-stepwise selection perform hierarchical selection keep the first term perform likelihood-ratio test instead of Wald test forward hierarchical lockterm1 lr Reporting control column formats and line width display options ∗ At least one of pr(#) or pe(#) must be specified. by and xi are allowed; see [U] 11.1.10 Prefix commands. Weights are allowed if command allows them; see [U] 11.1.6 weight. All postestimation commands behave as they would after command without the stepwise prefix; see the postestimation manual entry for command. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Other > Stepwise estimation Description stepwise performs stepwise estimation. Typing . stepwise, pr(#): command performs backward-selection estimation for command. The stepwise selection method is determined by the following option combinations: options Description pr(#) pr(#) pr(#) pe(#) pe(#) pr(#) backward selection backward hierarchical selection backward stepwise forward selection forward hierarchical selection forward stepwise hierarchical pe(#) hierarchical pe(#) forward 2222 stepwise — Stepwise estimation 2223 command defines the estimation command to be executed. The following Stata commands are supported by stepwise: clogit cloglog glm intreg logistic logit nbreg ologit oprobit poisson probit qreg regress scobit stcox stcrreg streg tobit stepwise expects command to have the following form: command name depvar term term . . . if in weight , command options where term is either varname or (varlist) (a varlist in parentheses indicates that this group of variables is to be included or excluded together). depvar is not present when command name is stcox, stcrreg, or streg; otherwise, depvar is assumed to be present. For intreg, depvar is actually two dependent variable names (depvar1 and depvar2 ). sw is a synonym for stepwise. Options Model pr(#) specifies the significance level for removal from the model; terms with p ≥ pr() are eligible for removal. pe(#) specifies the significance level for addition to the model; terms with p < pe() are eligible for addition. Model 2 forward specifies the forward-stepwise method and may be specified only when both pr() and pe() are also specified. Specifying both pr() and pe() without forward results in backward-stepwise selection. Specifying only pr() results in backward selection, and specifying only pe() results in forward selection. hierarchical specifies hierarchical selection. lockterm1 specifies that the first term be included in the model and not be subjected to the selection criteria. lr specifies that the test of term significance be the likelihood-ratio test. The default is the less computationally expensive Wald test; that is, the test is based on the estimated variance–covariance matrix of the estimators. Reporting display options: cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. 2224 stepwise — Stepwise estimation Remarks and examples Remarks are presented under the following headings: Introduction Search logic for a step Full search logic Examples Estimation sample considerations Messages Programming for stepwise Introduction Typing . stepwise, pr(.10): regress y1 x1 x2 d1 d2 d3 x4 x5 performs a backward-selection search for the regression model y1 on x1, x2, d1, d2, d3, x4, and x5. In this search, each explanatory variable is said to be a term. Typing . stepwise, pr(.10): regress y1 x1 x2 (d1 d2 d3) (x4 x5) performs a similar backward-selection search, but the variables d1, d2, and d3 are treated as one term, as are x4 and x5. That is, d1, d2, and d3 may or may not appear in the final model, but they appear or do not appear together. Example 1 Using the automobile dataset, we fit a backward-selection model of mpg: . . . > use http://www.stata-press.com/data/r13/auto generate weight2 = weight*weight stepwise, pr(.2): regress mpg weight weight2 displ gear turn headroom price begin with full model p = 0.7116 >= 0.2000 removing headroom p = 0.6138 >= 0.2000 removing displacement p = 0.3278 >= 0.2000 removing price SS df MS Number of obs = Source F( 5, 68) = Model 1736.31455 5 347.262911 Prob > F = Residual 707.144906 68 10.3991898 R-squared = Adj R-squared = Total 2443.45946 73 33.4720474 Root MSE = mpg Coef. weight weight2 foreign gear_ratio turn _cons -.0158002 1.77e-06 -3.615107 2.011674 -.3087038 59.02133 Std. Err. .0039169 6.20e-07 1.260844 1.468831 .1763099 9.3903 t -4.03 2.86 -2.87 1.37 -1.75 6.29 P>|t| 0.000 0.006 0.006 0.175 0.084 0.000 foreign 74 33.39 0.0000 0.7106 0.6893 3.2248 [95% Conf. Interval] -.0236162 5.37e-07 -6.131082 -.9193321 -.6605248 40.28327 -.0079842 3.01e-06 -1.099131 4.94268 .0431172 77.75938 This estimation treated each variable as its own term and thus considered each one separately. The engine displacement and gear ratio should really be considered together: stepwise — Stepwise estimation 2225 . stepwise, pr(.2): regress mpg weight weight2 (displ gear) turn headroom > foreign price begin with full model p = 0.7116 >= 0.2000 removing headroom p = 0.3944 >= 0.2000 removing displacement gear_ratio p = 0.2798 >= 0.2000 removing price SS df MS Number of obs = 74 Source F( 4, 69) = 40.76 Model 1716.80842 4 429.202105 Prob > F = 0.0000 726.651041 69 10.5311745 R-squared = 0.7026 Residual Adj R-squared = 0.6854 Total 2443.45946 73 33.4720474 Root MSE = 3.2452 mpg Coef. weight weight2 foreign turn _cons -.0160341 1.70e-06 -2.758668 -.2862724 65.39216 Std. Err. .0039379 6.21e-07 1.101772 .176658 8.208778 t -4.07 2.73 -2.50 -1.62 7.97 P>|t| 0.000 0.008 0.015 0.110 0.000 [95% Conf. Interval] -.0238901 4.58e-07 -4.956643 -.6386955 49.0161 -.0081782 2.94e-06 -.5606925 .0661508 81.76823 Search logic for a step Before discussing the complete search logic, consider the logic for a step — the first step — in detail. The other steps follow the same logic. If you type . stepwise, pr(.20): regress y1 x1 x2 (d1 d2 d3) (x4 x5) the logic is 1. 2. 3. 4. 5. 6. Fit the model y on x1 x2 d1 d2 d3 x4 x5. Consider dropping x1. Consider dropping x2. Consider dropping d1 d2 d3. Consider dropping x4 x5. Find the term above that is least significant. If its significance level is ≥ 0.20, remove that term. If you type . stepwise, pr(.20) hierarchical: regress y1 x1 x2 (d1 d2 d3) (x4 x5) the logic would be different because the hierarchical option states that the terms are ordered. The initial logic would become 1. Fit the model y on x1 x2 d1 d2 d3 x4 x5. 2. Consider dropping x4 x5 — the last term. 3. If the significance of this last term is ≥ 0.20, remove the term. The process would then stop or continue. It would stop if x4 x5 were not dropped, and otherwise, stepwise would continue to consider the significance of the next-to-last term, d1 d2 d3. Specifying pe() rather than pr() switches to forward estimation. If you type . stepwise, pe(.20): regress y1 x1 x2 (d1 d2 d3) (x4 x5) 2226 stepwise — Stepwise estimation stepwise performs forward-selection search. The logic for the first step is 1. 2. 3. 4. 5. 6. Fit a model of y on nothing (meaning a constant). Consider adding x1. Consider adding x2. Consider adding d1 d2 d3. Consider adding x4 x5. Find the term above that is most significant. If its significance level is < 0.20, add that term. As with backward estimation, if you specify hierarchical, . stepwise, pe(.20) hierarchical: regress y1 x1 x2 (d1 d2 d3) (x4 x5) the search for the most significant term is restricted to the next term: 1. Fit a model of y on nothing (meaning a constant). 2. Consider adding x1 — the first term. 3. If the significance is < 0.20, add the term. If x1 were added, stepwise would next consider x2; otherwise, the search process would stop. stepwise can also use a stepwise selection logic that alternates between adding and removing terms. The full logic for all the possibilities is given below. stepwise — Stepwise estimation 2227 Full search logic Option Logic pr() (backward selection) Fit the full model on all explanatory variables. While the least-significant term is “insignificant”, remove it and reestimate. pr() hierarchical (backward hierarchical selection) Fit full model on all explanatory variables. While the last term is “insignificant”, remove it and reestimate. pr() pe() (backward stepwise) Fit full model on all explanatory variables. If the least-significant term is “insignificant”, remove it and reestimate; otherwise, stop. Do that again: if the least-significant term is “insignificant”, remove it and reestimate; otherwise, stop. Repeatedly, if the most-significant excluded term is “significant”, add it and reestimate; if the least-significant included term is “insignificant”, remove it and reestimate; until neither is possible. pe() (forward selection) Fit “empty” model. While the most-significant excluded term is “significant”, add it and reestimate. pe() hierarchical (forward hierarchical selection) Fit “empty” model. While the next term is “significant”, add it and reestimate. pr() pe() forward (forward stepwise) Fit “empty” model. If the most-significant excluded term is “significant”, add it and reestimate; otherwise, stop. Do that again: if the most-significant excluded term is “significant”, add it and reestimate; otherwise, stop. Repeatedly, if the least-significant included term is “insignificant”, remove it and reestimate; if the most-significant excluded term is “significant”, add it and reestimate; until neither is possible. 2228 stepwise — Stepwise estimation Examples The following two statements are equivalent; both include solely single-variable terms: . stepwise, pr(.2): regress price mpg weight displ . stepwise, pr(.2): regress price (mpg) (weight) (displ) The following two statements are equivalent; the last term in each is r1, . . . , r4: . stepwise, pr(.2) hierarchical: regress price mpg weight displ (r1-r4) . stepwise, pr(.2) hierarchical: regress price (mpg) (weight) (displ) (r1-r4) To group variables weight and displ into one term, type . stepwise, pr(.2) hierarchical: regress price mpg (weight displ) (r1-r4) stepwise can be used with commands other than regress; for instance, . stepwise, pr(.2): logit outcome (sex weight) treated1 treated2 . stepwise, pr(.2): logistic outcome (sex weight) treated1 treated2 Either statement would fit the same model because logistic and logit both perform logistic regression; they differ only in how they report results; see [R] logit and [R] logistic. We use the lockterm1 option to force the first term to be included in the model. To keep treated1 and treated2 in the model no matter what, we type . stepwise, pr(.2) lockterm1: logistic outcome (treated1 treated2) ... After stepwise estimation, we can type stepwise without arguments to redisplay results, . stepwise (output from logistic appears ) or type the underlying estimation command: . logistic (output from logistic appears ) At estimation time, we can specify options unique to the command being stepped: . stepwise, pr(.2): logit outcome (sex weight) treated1 treated2, or or is logit’s option to report odds ratios rather than coefficients; see [R] logit. Estimation sample considerations Whether you use backward or forward estimation, stepwise forms an estimation sample by taking observations with nonmissing values of all the variables specified (except for depvar1 and depvar2 for intreg). The estimation sample is held constant throughout the stepping. Thus if you type . stepwise, pr(.2) hierarchical: regress amount sk edul sval and variable sval is missing in half the data, that half of the data will not be used in the reported model, even if sval is not included in the final model. The function e(sample) identifies the sample that was used. e(sample) contains 1 for observations used and 0 otherwise. For instance, if you type . stepwise, pr(.2) pe(.10): logistic outcome x1 x2 (x3 x4) (x5 x6 x7) stepwise — Stepwise estimation 2229 and the final model is outcome on x1, x5, x6, and x7, you could re-create the final regression by typing . logistic outcome x1 x5 x6 x7 if e(sample) You could obtain summary statistics within the estimation sample of the independent variables by typing . summarize x1 x5 x6 x7 if e(sample) If you fit another model, e(sample) will automatically be redefined. Typing . stepwise, lock pr(.2): logistic outcome (x1 x2) (x3 x4) (x5 x6 x7) would automatically drop e(sample) and re-create it. Messages note: dropped because of collinearity Each term is checked for collinearity, and variables within the term are dropped if collinearity is found. For instance, say that you type . stepwise, pr(.2): regress y x1 x2 (r1-r4) (x3 x4) and assume that variables r1 through r4 are mutually exclusive and exhaustive dummy variables — perhaps r1, . . . , r4 indicate in which of four regions the subject resides. One of the r1, . . . , r4 variables will be automatically dropped to identify the model. This message should cause you no concern. Error message: between-term collinearity, variable After removing any within-term collinearity, if stepwise still finds collinearity between terms, it refuses to continue. For instance, assume that you type . stepwise, pr(.2): regress y1 x1 x2 (d1-d8) (r1-r4) Assume that r1, . . . , r4 identify in which of four regions the subject resides, and that d1, . . . , d8 identify the same sort of information, but more finely. r1, say, amounts to d1 and d2; r2 to d3, d4, and d5; r3 to d6 and d7; and r4 to d8. You can estimate the d* variables or the r* variables, but not both. It is your responsibility to specify noncollinear terms. note: note: dropped because of estimability obs. dropped because of estimability You probably received this message in fitting a logistic or probit model. Regardless of estimation strategy, stepwise checks that the full model can be fit. The indicated variable had a 0 or infinite standard error. For logistic, logit, and probit, this message is typically caused by one-way causation. Assume that you type . stepwise, pr(.2): logistic outcome (x1 x2 x3) d1 2230 stepwise — Stepwise estimation and assume that variable d1 is an indicator (dummy) variable. Further assume that whenever d1 = 1, outcome = 1 in the data. Then the coefficient on d1 is infinite. One (conservative) solution to this problem is to drop the d1 variable and the d1==1 observations. The underlying estimation commands probit, logit, and logistic report the details of the difficulty and solution; stepwise simply accumulates such problems and reports the above summary messages. Thus if you see this message, you could type . logistic outcome x1 x2 x3 d1 to see the details. Although you should think carefully about such situations, Stata’s solution of dropping the offending variables and observations is, in general, appropriate. Programming for stepwise stepwise requires that command name follow standard Stata syntax and allow the if qualifier; see [U] 11 Language syntax. Furthermore, command name must have sw or swml as a program property; see [P] program properties. If command name has swml as a property, command name must store the log-likelihood value in e(ll) and model degrees of freedom in e(df m). Stored results stepwise stores whatever is stored by the underlying estimation command. Also, stepwise stores stepwise in e(stepwise). Methods and formulas Some statisticians do not recommend stepwise procedures; see Sribney (1998) for a summary. References Afifi, A. A., S. May, and V. A. Clark. 2012. Practical Multivariate Analysis. 5th ed. Boca Raton, FL: CRC Press. Beale, E. M. L. 1970. Note on procedures for variable selection in multiple regression. Technometrics 12: 909–914. Bendel, R. B., and A. A. Afifi. 1977. Comparison of stopping rules in forward “stepwise” regression. Journal of the American Statistical Association 72: 46–53. Berk, K. N. 1978. Comparing subset regression procedures. Technometrics 20: 1–6. Draper, N., and H. Smith. 1998. Applied Regression Analysis. 3rd ed. New York: Wiley. Efroymson, M. A. 1960. Multiple regression analysis. In Mathematical Methods for Digital Computers, ed. A. Ralston and H. S. Wilf, 191–203. New York: Wiley. Gorman, J. W., and R. J. Toman. 1966. Selection of variables for fitting equations to data. Technometrics 8: 27–51. Hocking, R. R. 1976. The analysis and selection of variables in linear regression. Biometrics 32: 1–49. Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: Wiley. Kennedy, W. J., Jr., and T. A. Bancroft. 1971. Model-building for prediction in regression based on repeated significance tests. Annals of Mathematical Statistics 42: 1273–1284. Lindsey, C., and S. J. Sheather. 2010. Variable selection in linear regression. Stata Journal 10: 650–669. Mantel, N. 1970. Why stepdown procedures in variable selection. Technometrics 12: 621–625. . 1971. More on variable selection and an alternative approach (letter to the editor). Technometrics 13: 455–457. stepwise — Stepwise estimation 2231 Sribney, W. M. 1998. FAQ: What are some of the problems with stepwise regression? http://www.stata.com/support/faqs/stat/stepwise.html. Wang, Z. 2000. sg134: Model selection using the Akaike information criterion. Stata Technical Bulletin 54: 47–49. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 335–337. College Station, TX: Stata Press. Williams, R. 2007. Stata tip 46: Step we gaily, on we go. Stata Journal 7: 272–274. Also see [R] nestreg — Nested model statistics Title stored results — Stored results Syntax Description Option Remarks and examples References Also see Syntax List results from general commands, stored in r() return list , all List results from estimation commands, stored in e() ereturn list , all List results from parsing commands, stored in s() sreturn list Description Results of calculations are stored by many Stata commands so that they can be easily accessed and substituted into later commands. return list lists results stored in r(). ereturn list lists results stored in e(). sreturn list lists results stored in s(). This entry discusses using stored results. Programmers wishing to store results should see [P] return and [P] ereturn. Option all is for use with return list and ereturn list. all specifies that hidden and historical stored results be listed along with the usual stored results. This option is seldom used. See Using hidden and historical stored results and Programming hidden and historical stored results under Remarks and examples of [P] return for more information. These sections are written in terms of return list, but everything said there applies equally to ereturn list. all is not allowed with sreturn list because s() does not allow hidden or historical results. Remarks and examples Stata commands are classified as being r-class e-class s-class n-class general commands that store results in r() estimation commands that store results in e() parsing commands that store results in s() commands that do not store in r(), e(), or s() 2232 stored results — Stored results 2233 There is also a c-class, c(), containing the values of system parameters and settings, along with certain constants, such as the value of pi; see [P] creturn. A program, however, cannot be c-class. You can look at the Stored results section of the manual entry of a command to determine whether it is r-, e-, s-, or n-class, but it is easy enough to guess. Commands producing statistical results are either r-class or e-class. They are e-class if they present estimation results and r-class otherwise. s-class is a class used by programmers and is primarily used in subprograms performing parsing. n-class commands explicitly state where the result is to go. For instance, generate and replace are n-class because their syntax is generate varname = . . . and replace varname = . . . . After executing a command, you can type return list, ereturn list, or sreturn list to see what has been stored. Example 1 . use http://www.stata-press.com/data/r13/auto4 (1978 Automobile Data) . describe Contains data from http://www.stata-press.com/data/r13/auto4.dta obs: 74 1978 Automobile Data vars: 6 6 Apr 2013 00:20 size: 2,072 variable name price weight mpg make length rep78 storage type int int int str18 int int Sorted by: . return list scalars: r(changed) r(width) r(k) r(N) display format value label %8.0gc %8.0gc %8.0g %-18s %8.0g %8.0g = = = = variable label Price Weight (lbs.) Mileage (mpg) Make and Model Length (in.) Repair Record 1978 0 28 6 74 To view all stored results, including those that are historical or hidden, specify the all option. . return list, all scalars: r(changed) r(width) r(k) r(N) = = = = 0 28 6 74 Historical; used before Stata 12, may exist only under version control scalars: r(widthmax) = r(k_max) = r(N_max) = 1048576 2048 2147483646 r(widthmax), r(k max), and r(N max) are historical stored results. They are no longer relevant because Stata dynamically adjusts memory beginning with Stata 12. 2234 stored results — Stored results Technical note In the above example, we stated that r(widthmax) and r(N max) are no longer relevant. In fact, they are not useful. Stata no longer has a fixed memory size, so the methods used to calculate r(widthmax) and r(N max) are no longer appropriate. Example 2 You can use stored results in expressions. . summarize mpg Variable mpg . return list scalars: Obs Mean 74 21.2973 Std. Dev. Min Max 5.785503 12 41 Min Max -1.606999 3.40553 r(N) = 74 r(sum_w) = 74 r(mean) = 21.2972972972973 r(Var) = 33.47204738985561 r(sd) = 5.785503209735141 r(min) = 12 r(max) = 41 r(sum) = 1576 . generate double mpgstd = (mpg-r(mean))/r(sd) . summarize mpgstd Variable Obs Mean Std. Dev. mpgstd 74 -1.64e-16 1 Be careful to use results stored in r() soon because they will be replaced the next time you execute another r-class command. For instance, although r(mean) was 21.3 (approximately) after summarize mpg, it is −1.64e–16 now because you just ran summarize with mpgstd. Example 3 e-class is really no different from r-class, except for where results are stored and that, when an estimation command stores results, it tends to store a lot of them: . regress mpg weight length (output omitted ) . ereturn list scalars: e(N) = 74 e(df_m) = 2 e(df_r) = 71 e(F) = 69.34050004300228 e(r2) = .6613903979336324 e(rmse) = 3.413681741382589 e(mss) = 1616.08062422659 e(rss) = 827.3788352328694 e(r2_a) = .6518520992838756 e(ll) = -194.3267619410807 e(ll_0) = -234.3943376482347 e(rank) = 3 stored results — Stored results 2235 macros: e(cmdline) e(title) e(marginsok) e(vce) e(depvar) e(cmd) e(properties) e(predict) e(model) e(estat_cmd) matrices: e(b) e(V) functions: e(sample) : : : : : : : : : : : : "regress mpg weight length" "Linear regression" "XB default" "ols" "mpg" "regress" "b V" "regres_p" "ols" "regress_estat" 1 x 3 3 x 3 These e-class results will stick around until you run another estimation command. Typing return list and ereturn list is the easy way to find out what a command stores. Both r- and e-class results come in four flavors: scalars, macros, matrices, and functions. (s-class results come in only one flavor—macros—and as earlier noted, s-class is used solely by programmers, so ignore it.) Scalars are just that—numbers by any other name. You can subsequently refer to r(mean) or e(rmse) in numeric expressions and obtain the result to full precision. Macros are strings. For instance, e(depvar) contains “mpg”. You can refer to it, too, in subsequent expressions, but really that would be of most use to programmers, who will refer to it using constructs like "‘e(depvar)’". In any case, macros are macros, and you obtain their contents just as you would a local macro, by enclosing their name in single quotes. The name here is the full name, so ‘e(depvar)’ is mpg. Matrices are matrices, and all estimation commands store e(b) and e(V) containing the coefficient vector and variance–covariance matrix of the estimates (VCE). Functions are stored by e-class commands only, and the only function existing is e(sample). e(sample) evaluates to 1 (meaning true) if the observation was used in the previous estimation and to 0 (meaning false) otherwise. Technical note Say that some command set r(scalar) and r(macro), the first being stored as a scalar and the second as a macro. In theory, in subsequent use you are supposed to refer to r(scalar) and ‘r(macro)’. In fact, however, you can refer to either one with or without quotes, so you could refer to ‘r(scalar)’ and r(macro). Programmers sometimes do this. When you refer to r(scalar), you are referring to the full double-precision stored result. Think of r(scalar) without quotes as a function returning the value of the stored result scalar. When you refer to r(scalar) in quotes, Stata understands ‘r(scalar)’ to mean “substitute the printed result of evaluating r(scalar)”. Pretend that r(scalar) equals the number 23. Then ‘r(scalar)’ is 23, the character 2 followed by 3. Referring to r(scalar) in quotes is sometimes useful. Say that you want to use the immediate command ci with r(scalar). The immediate command ci requires its arguments to be numbers— numeric literals in programmer’s jargon—and it will not take an expression. Thus you could not type 2236 stored results — Stored results ‘ci r(scalar) . . .’. You could, however, type ‘ci ‘r(scalar)’ . . .’ because ‘r(scalar)’ is just a numeric literal. For r(macro), you are supposed to refer to it in quotes: ‘r(macro)’. If, however, you omit the quotes in an expression context, Stata evaluates the macro and then pretends that it is the result of function-returning-string. There are side effects of this, the most important being that the result is trimmed to 80 characters. Referring to r(macro) without quotes is never a good idea; the feature was included merely for completeness. You can even refer to r(matrix) in quotes (assume that r(matrix) is a matrix). ‘r(matrix)’ does not result in the matrix being substituted; it returns the word matrix. Programmers sometimes find that useful. References Jann, B. 2005. Making regression tables from stored estimates. Stata Journal 5: 288–308. . 2007. Making regression tables simplified. Stata Journal 7: 227–244. Also see [P] ereturn — Post the estimation results [P] return — Return stored results [U] 18.8 Accessing results calculated by other programs [U] 18.9 Accessing results calculated by estimation commands Title suest — Seemingly unrelated estimation Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgment Syntax suest namelist , options where namelist is a list of one or more names under which estimation results were stored via estimates store; see [R] estimates store. Wildcards may be used. * and all refer to all stored results. A period (.) may be used to refer to the last estimation results, even if they have not (yet) been stored. Description options SE/Robust survey data estimation vcetype may be robust or cluster clustvar svy vce(vcetype) Reporting level(#) dir eform(string) display options set confidence level; default is level(95) display a table describing the models report exponentiated coefficients and label as string control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling coeflegend display legend instead of statistics coeflegend does not appear in the dialog box. Menu Statistics > Postestimation > Tests > Seemingly unrelated estimation Description suest is a postestimation command; see [U] 20 Estimation and postestimation commands. suest combines the estimation results—parameter estimates and associated (co)variance matrices— stored under namelist into one parameter vector and simultaneous (co)variance matrix of the sandwich/robust type. This (co)variance matrix is appropriate even if the estimates were obtained on the same or on overlapping data. Typical applications of suest are tests for intramodel and cross-model hypotheses using test or testnl, for example, a generalized Hausman specification test. lincom and nlcom may be used after suest to estimate linear combinations and nonlinear functions of coefficients. suest may also be used to adjust a standard VCE for clustering or survey design effects. 2237 2238 suest — Seemingly unrelated estimation Different estimators are allowed, for example, a regress model and a probit model; the only requirement is that predict produce equation-level scores with the score option after an estimation command. The models may be estimated on different samples, due either to explicit if or in selection or to missing values. If weights are applied, the same weights (type and values) should be applied to all models in namelist. The estimators should be estimated without vce(robust) or vce(cluster clustvar) options. suest returns the robust VCE, allows the vce(cluster clustvar) option, and automatically works with results from the svy prefix command (only for vce(linearized)). See example 7 in [SVY] svy postestimation for an example using suest with svy: ologit. Because suest posts its results like a proper estimation command, its results can be stored via estimates store. Moreover, like other estimation commands, suest typed without arguments replays the results. Options SE/Robust svy specifies that estimation results should be modified to reflect the survey design effects according to the svyset specifications, see [SVY] svyset. The svy option is implied when suest encounters survey estimation results from the svy prefix; see [SVY] svy. Poststratification is allowed only with survey estimation results from the svy prefix. vce(vcetype) specifies the type of standard error reported, which includes types that are robust to some kinds of misspecification (robust) and that allow for intragroup correlation (cluster clustvar; see [R] vce option. The vce() option may not be combined with the svy option or estimation results from the svy prefix. Reporting level(#) specifies the confidence level, as a percentage, for confidence intervals of the coefficients; see [R] level. dir displays a table describing the models in namelist just like estimates dir namelist. eform(string) displays the coefficient table in exponentiated form: for each coefficient, exp(b) rather than b is displayed, and standard errors and confidence intervals are transformed. string is the table header that will be displayed above the transformed coefficients and must be 11 characters or fewer, for example, eform("Odds ratio"). display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. The following option is available with suest but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Remarks are presented under the following headings: Using suest Remarks on regress Testing the assumption of the independence of irrelevant alternatives Testing proportionality Testing cross-model hypotheses suest — Seemingly unrelated estimation 2239 Using suest If you plan to use suest, you must take precautions when fitting the original models. These restrictions are relaxed when using svy commands; see [SVY] svy postestimation. 1. suest works with estimation commands that allow predict to generate equation-level score variables when supplied with the score (or scores) option. For example, equation-level score variables are generated after running mlogit by typing . predict sc*, scores 2. Estimation should take place without the vce(robust) or vce(cluster clustvar) option. suest always computes the robust estimator of the (co)variance, and suest has a vce(cluster clustvar) option. The within-model covariance matrices computed by suest are identical to those obtained by specifying a vce(robust) or vce(cluster clustvar) option during estimation. suest, however, also estimates the between-model covariances of parameter estimates. 3. Finally, the estimation results to be combined should be stored by estimates store; see [R] estimates store. After estimating and storing a series of estimation results, you are ready to combine the estimation results with suest, . suest name1 name2 . . . , vce(cluster clustvar) and you can subsequently use postestimation commands, such as test, to test hypotheses. Here an important issue is how suest assigns names to the equations. If you specify one model name, the original equation names are left unchanged; otherwise, suest constructs new equation names. The coefficients of a single-equation model (such as logit and poisson) that was estimate stored under name X are collected under equation X. With a multiequation model stored under name X, suest prefixes X to an original equation name eq, forming equation name, X eq. Technical note Earlier we said that standard errors from suest are identical to those obtained by specifying the vce(robust) option with each command individually. Thus if you fit a logistic model using logit with the vce(robust) option, you will get the same standard errors when you type . suest . directly after logit using the same data without the vce(robust) option. This is not true for multiple estimation results when the estimation samples are not all the same. The standard errors from suest will be slightly smaller than those from individual model fits using the vce(robust) option because suest uses a larger number of observations to estimate the simultaneous (co)variance matrix. Technical note In rare circumstances, suest may have to truncate equation names to 32 characters. When equation names are not unique because of truncation, suest numbers the equations within models, using equations named X #. 2240 suest — Seemingly unrelated estimation Remarks on regress regress (see [R] regress) does not include its ancillary parameter, the residual variance, in its coefficient vector and (co)variance matrix. Moreover, while the score option is allowed with predict after regress, a score variable is generated for the mean but not for the variance parameter. suest contains special code that assigns the equation name mean to the coefficients for the mean, adds the equation lnvar for the log variance, and computes the appropriate two score variables itself. Testing the assumption of the independence of irrelevant alternatives The multinomial logit model and the closely related conditional logit model satisfy a probabilistic version of the assumption of the independence of irrelevant alternatives (IIA), implying that the ratio of the probabilities for two alternatives does not depend on what other alternatives are available. Hausman and McFadden (1984) proposed a test for this assumption that is implemented in the hausman command. The standard Hausman test has several limitations. First, the test statistic may be undefined because the estimated VCE does not satisfy the required asymptotic properties of the test. Second, the classic Hausman test applies only to the test of the equality of two estimators. Third, the test requires access to a fully efficient estimator; such an estimator may not be available, for example, if you are analyzing complex survey data. Using suest can overcome these three limitations. Example 1 In our first example, we follow the analysis of the type of health insurance reported in [R] mlogit and demonstrate the hausman command with the suest/test combination. We fit the full multinomial logit model for all three alternatives and two restricted multinomial models in which one alternative is excluded. After fitting each of these models, we store the results by using the store subcommand of estimates. title() simply documents the models. . use http://www.stata-press.com/data/r13/sysdsn4 (Health insurance data) . mlogit insure age male Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -555.85446 -551.32973 -551.32802 -551.32802 Multinomial logistic regression Number of obs LR chi2(4) Prob > chi2 Pseudo R2 Log likelihood = -551.32802 insure Indemnity Coef. Std. Err. z P>|z| = = = = 615 9.05 0.0598 0.0081 [95% Conf. Interval] (base outcome) Prepaid age male _cons -.0100251 .5095747 .2633838 .0060181 .1977893 .2787575 -1.67 2.58 0.94 0.096 0.010 0.345 -.0218204 .1219147 -.2829708 .0017702 .8972346 .8097383 -.0051925 .4748547 -1.756843 .0113821 .3618462 .5309602 -0.46 1.31 -3.31 0.648 0.189 0.001 -.0275011 -.2343508 -2.797506 .0171161 1.18406 -.7161803 Uninsure age male _cons . estimates store m1, title(all three insurance forms) suest — Seemingly unrelated estimation . . . . 2241 quietly mlogit insure age male if insure != "Uninsure":insure estimates store m2, title(insure != "Uninsure":insure) quietly mlogit insure age male if insure != "Prepaid":insure estimates store m3, title(insure != "Prepaid":insure) Having performed the three estimations, we inspect the results. estimates dir provides short descriptions of the models that were stored using estimates store. Typing estimates table lists the coefficients, displaying blanks for a coefficient not contained in a model. . estimates dir name m1 m2 m3 command depvar npar mlogit mlogit mlogit insure insure insure 9 6 6 title all three insurance forms insure != Uninsure :insure insure != Prepaid :insure . estimates table m1 m2 m3, star stats(N ll) keep(Prepaid: Uninsure:) Variable m1 m2 -.01002511 .50957468** .26338378 -.01015205 .51440033** .26780432 m3 Prepaid age male _cons Uninsure age male _cons -.00519249 .47485472 -1.7568431*** -.00410547 .45910738 -1.8017743*** Statistics N ll 615 -551.32802 570 -390.48643 338 -131.76807 legend: * p<0.05; ** p<0.01; *** p<0.001 Comparing the coefficients between models does not suggest substantial differences. We can formally test that coefficients are the same for the full model m1 and the restricted models m2 and m3 by using the hausman command. hausman expects the models to be specified in the order “always consistent” first and “efficient under H0 ” second. . hausman m2 m1, alleqs constant Coefficients (b) (B) m2 m1 age male _cons -.0101521 .5144003 .2678043 -.0100251 .5095747 .2633838 (b-B) Difference sqrt(diag(V_b-V_B)) S.E. -.0001269 .0048256 .0044205 . .0123338 . b = consistent under Ho and Ha; obtained from mlogit B = inconsistent under Ha, efficient under Ho; obtained from mlogit Test: Ho: difference in coefficients not systematic chi2(3) = (b-B)’[(V_b-V_B)^(-1)](b-B) = 0.08 Prob>chi2 = 0.9944 (V_b-V_B is not positive definite) 2242 suest — Seemingly unrelated estimation . hausman m3 m1, alleqs constant Coefficients (b) (B) m3 m1 age male _cons Test: -.0041055 .4591074 -1.801774 (b-B) Difference sqrt(diag(V_b-V_B)) S.E. .001087 -.0157473 -.0449311 .0021355 . .1333421 -.0051925 .4748547 -1.756843 b = consistent under Ho and Ha; obtained from mlogit B = inconsistent under Ha, efficient under Ho; obtained from mlogit Ho: difference in coefficients not systematic chi2(3) = (b-B)’[(V_b-V_B)^(-1)](b-B) = -0.18 chi2<0 ==> model fitted on these data fails to meet the asymptotic assumptions of the Hausman test; see suest for a generalized test According to the test of m1 against m2, we cannot reject the hypothesis that the coefficients of m1 and m2 are the same. The second Hausman test is not well defined—something that happens fairly often. The problem is due to the estimator of the variance V(b-B) as V(b)-V(B), which is a feasible estimator only asymptotically. Here it simply is not a proper variance matrix, and the Hausman test becomes undefined. suest m1 m2 estimates the simultaneous (co)variance of the coefficients of models m1 and m2. Although suest is technically a postestimation command, it acts like an estimation command in that it stores the simultaneous coefficients in e(b) and the full (co)variance matrix in e(V). We could have used the estat vce command to display the full (co)variance matrix to show that the cross-model covariances were indeed estimated. Typically, we would not have a direct interest in e(V). . suest m1 m2, noomitted Simultaneous results for m1, m2 Number of obs Coef. Robust Std. Err. z P>|z| = 615 [95% Conf. Interval] m1_Indemnity m1_Prepaid age male _cons -.0100251 .5095747 .2633838 .0059403 .1988159 .277307 -1.69 2.56 0.95 0.091 0.010 0.342 -.0216679 .1199027 -.280128 .0016176 .8992467 .8068956 m1_Uninsure age male _cons -.0051925 .4748547 -1.756843 .0109005 .3677326 .4971383 -0.48 1.29 -3.53 0.634 0.197 0.000 -.0265571 -.2458879 -2.731216 .0161721 1.195597 -.78247 -.0101521 .5144003 .2678043 .0058988 .1996133 .2744019 -1.72 2.58 0.98 0.085 0.010 0.329 -.0217135 .1231654 -.2700134 .0014094 .9056352 .8056221 m2_Indemnity m2_Prepaid age male _cons suest — Seemingly unrelated estimation 2243 suest created equation names by combining the name under which we stored the results using estimates store with the original equation names. Thus, in the simultaneous estimation result, equation Prepaid originating in model m1 is named m1 Prepaid. According to the McFadden– Hausman specification of a test for IIA, the coefficients of the equations m1 PrePaid and m2 PrePaid should be equal. This equality can be tested easily with the test command. The cons option specifies that the intercept cons be included in the test. . test ( 1) ( 2) ( 3) [m1_Prepaid = m2_Prepaid], cons [m1_Prepaid]age - [m2_Prepaid]age = 0 [m1_Prepaid]male - [m2_Prepaid]male = 0 [m1_Prepaid]_cons - [m2_Prepaid]_cons = 0 chi2( 3) = 0.89 Prob > chi2 = 0.8266 The Hausman test via suest is comparable to that computed by hausman, but they use different estimators of the variance of the difference of the estimates. The hausman command estimates V (b−B) by V (b) − V (B), whereas suest estimates V (b − B) by V (b) − cov(b, B) − cov(B, b) + V (B). One advantage of the second estimator is that it is always admissible, so the resulting test is always well defined. This quality is illustrated in the Hausman-type test of IIA comparing models m1 and m3. . suest m1 m3, noomitted Simultaneous results for m1, m3 Number of obs Coef. Robust Std. Err. z P>|z| = 615 [95% Conf. Interval] m1_Indemnity m1_Prepaid age male _cons -.0100251 .5095747 .2633838 .0059403 .1988159 .277307 -1.69 2.56 0.95 0.091 0.010 0.342 -.0216679 .1199027 -.280128 .0016176 .8992467 .8068956 m1_Uninsure age male _cons -.0051925 .4748547 -1.756843 .0109005 .3677326 .4971383 -0.48 1.29 -3.53 0.634 0.197 0.000 -.0265571 -.2458879 -2.731216 .0161721 1.195597 -.78247 -.0041055 .4591074 -1.801774 .0111185 .3601307 .5226351 -0.37 1.27 -3.45 0.712 0.202 0.001 -.0258974 -.2467357 -2.82612 .0176865 1.164951 -.7774283 m3_Indemnity m3_Uninsure age male _cons . test ( 1) ( 2) ( 3) [m1_Uninsure = m3_Uninsure], cons [m1_Uninsure]age - [m3_Uninsure]age = 0 [m1_Uninsure]male - [m3_Uninsure]male = 0 [m1_Uninsure]_cons - [m3_Uninsure]_cons = 0 chi2( 3) = 1.49 Prob > chi2 = 0.6845 Although the classic Hausman test computed by hausman is not defined here, the suest-based test is just fine. We cannot reject the equality of the common coefficients across m1 and m3. A second advantage of the suest approach is that we can estimate the (co)variance matrix of the multivariate normal distribution of the estimators of the three models m1, m2, and m3 and test that the common coefficients are equal. 2244 suest — Seemingly unrelated estimation . suest m*, noomitted Simultaneous results for m1, m2, m3 Number of obs Coef. Robust Std. Err. z P>|z| = 615 [95% Conf. Interval] m1_Indemnity m1_Prepaid age male _cons -.0100251 .5095747 .2633838 .0059403 .1988159 .277307 -1.69 2.56 0.95 0.091 0.010 0.342 -.0216679 .1199027 -.280128 .0016176 .8992467 .8068956 m1_Uninsure age male _cons -.0051925 .4748547 -1.756843 .0109005 .3677326 .4971383 -0.48 1.29 -3.53 0.634 0.197 0.000 -.0265571 -.2458879 -2.731216 .0161721 1.195597 -.78247 -.0101521 .5144003 .2678043 .0058988 .1996133 .2744019 -1.72 2.58 0.98 0.085 0.010 0.329 -.0217135 .1231654 -.2700134 .0014094 .9056352 .8056221 -.0041055 .4591074 -1.801774 .0111185 .3601307 .5226351 -0.37 1.27 -3.45 0.712 0.202 0.001 -.0258974 -.2467357 -2.82612 .0176865 1.164951 -.7774283 m2_Indemnity m2_Prepaid age male _cons m3_Indemnity m3_Uninsure age male _cons . test [m1_Prepaid = m2_Prepaid] , cons notest ( 1) ( 2) ( 3) [m1_Prepaid]age - [m2_Prepaid]age = 0 [m1_Prepaid]male - [m2_Prepaid]male = 0 [m1_Prepaid]_cons - [m2_Prepaid]_cons = 0 . test [m1_Uninsure = m3_Uninsure], cons acc ( ( ( ( ( ( 1) 2) 3) 4) 5) 6) [m1_Prepaid]age - [m2_Prepaid]age = 0 [m1_Prepaid]male - [m2_Prepaid]male = 0 [m1_Prepaid]_cons - [m2_Prepaid]_cons = 0 [m1_Uninsure]age - [m3_Uninsure]age = 0 [m1_Uninsure]male - [m3_Uninsure]male = 0 [m1_Uninsure]_cons - [m3_Uninsure]_cons = 0 chi2( 6) = Prob > chi2 = 1.95 0.9240 Again we do not find evidence against the correct specification of the multinomial logit for type of insurance. The classic Hausman test assumes that one of the estimators (named B in hausman) is efficient, that is, it has minimal (asymptotic) variance. This assumption ensures that V (b) − V (B) is an admissible, viable estimator for V (b − B). The assumption that we have an efficient estimator is a restrictive one. It is violated, for instance, if our data are clustered. We want to adjust for clustering via a vce(cluster clustvar) option by requesting the cluster-adjusted sandwich estimator of variance. Consequently, in such a case, hausman cannot be used. This problem does not exist with the suest version of the Hausman test. To illustrate this feature, we suppose that the data are clustered by city—we constructed an imaginary variable cityid for this illustration. If we plan to apply suest, we would not specify the vce(cluster clustvar) option at the time of estimation. suest — Seemingly unrelated estimation 2245 suest has a vce(cluster clustvar) option. Thus we do not need to refit the models; we can call suest and test right away. . suest m1 m2, vce(cluster cityid) noomitted Simultaneous results for m1, m2 Number of obs = 615 (Std. Err. adjusted for 260 clusters in cityid) Coef. Robust Std. Err. z P>|z| [95% Conf. Interval] m1_Indemnity m1_Prepaid age male _cons -.0100251 .5095747 .2633838 .005729 .1910496 .2698797 -1.75 2.67 0.98 0.080 0.008 0.329 -.0212538 .1351244 -.2655708 .0012035 .884025 .7923384 m1_Uninsure age male _cons -.0051925 .4748547 -1.756843 .0104374 .3774021 .4916613 -0.50 1.26 -3.57 0.619 0.208 0.000 -.0256495 -.2648399 -2.720481 .0152645 1.214549 -.7932048 -.0101521 .5144003 .2678043 .0057164 .1921385 .2682193 -1.78 2.68 1.00 0.076 0.007 0.318 -.0213559 .1378158 -.2578959 .0010518 .8909848 .7935045 m2_Indemnity m2_Prepaid age male _cons . test ( 1) ( 2) ( 3) [m1_Prepaid = m2_Prepaid], cons [m1_Prepaid]age - [m2_Prepaid]age = 0 [m1_Prepaid]male - [m2_Prepaid]male = 0 [m1_Prepaid]_cons - [m2_Prepaid]_cons = 0 chi2( 3) = 0.79 Prob > chi2 = 0.8529 suest provides some descriptive information about the clustering on cityid. Like any other estimation command, suest informs us that the standard errors are adjusted for clustering. The Hausman-type test obtained from the test command uses a simultaneous (co)variance of m1 and m2 appropriately adjusted for clustering. In this example, we still do not have reason to conclude that the multinomial logit model in this application is misspecified, that is, that IIA is violated. The multinomial logistic regression model is a special case of the conditional logistic regression model; see [R] clogit. Like the multinomial logistic regression model, the conditional logistic regression model also makes the IIA assumption. Consider an example, introduced in [R] asclogit, in which the demand for American, Japanese, and European cars is modeled in terms of the number of local dealers of the respective brands and of some individual attributes incorporated in interaction with the nationality of cars. We want to perform a Hausman-type test for IIA comparing the decision between all nationalities with the decision between non-American cars. The following code fragment demonstrates how to conduct a Hausman test for IIA via suest in this case. . . . . clogit choice japan europe maleJap maleEur incJap incEur dealer, group(id) estimates store allcars clogit choice japan maleJap incJap dealer if car!=1 , group(id) estimates store foreign 2246 suest — Seemingly unrelated estimation . suest allcars foreign . test [allcars_choice=foreign_choice], common Testing proportionality The applications of suest that we have discussed so far concern Hausman-type tests for misspecification. To test such a hypothesis, we compared two estimators that have the same probability limit if the hypothesis holds true, but otherwise have different limits. We may also want to compare the coefficients of models (estimators) for other substantive reasons. Although we most often want to test whether coefficients differ between models or estimators, we may occasionally want to test other constraints (see Hausman and Ruud [1987]). Example 2 In this example, using simulated labor market data for siblings, we consider two dependent variables, income (inc) and whether a person was promoted in the last year (promo). We apply familiar economic arguments regarding human capital, according to which employees have a higher income and a higher probability of being promoted, by having more human capital. Human capital is acquired through formal education (edu) and on-the-job training experience (exp). We study whether income and promotion are “two sides of the same coin”, that is, whether they reflect a common latent variable, “human capital”. Accordingly, we want to compare the effects of different aspects of human capital on different outcome variables. We estimate fairly simple labor market equations. The income model is estimated with regress, and the estimation results are stored under the name Inc. . use http://www.stata-press.com/data/r13/income . regress inc edu exp male SS df MS Source Model Residual 2058.44672 4424.05183 3 273 686.148908 16.2053181 Total 6482.49855 276 23.4873136 inc Coef. edu exp male _cons 2.213707 1.47293 .5381153 1.255497 Number of obs F( 3, 273) Prob > F R-squared Adj R-squared Root MSE = = = = = = 277 42.34 0.0000 0.3175 0.3100 4.0256 Std. Err. t P>|t| [95% Conf. Interval] .243247 .231044 .4949466 .3115808 9.10 6.38 1.09 4.03 0.000 0.000 0.278 0.000 1.734828 1.018076 -.436282 .642091 2.692585 1.927785 1.512513 1.868904 . est store Inc Being sibling data, the observations are clustered on family of origin, famid. In the estimation of the regression parameters, we did not specify a vce(cluster famid) option to adjust standard errors for clustering on family (famid). Thus the standard errors reported by regress are potentially flawed. This problem will, however, be corrected by specifying a vce(cluster clustvar) option with suest. Next we estimate the promotion equation with probit and again store the results under an appropriate name. suest — Seemingly unrelated estimation . probit promo edu exp male, nolog Probit regression Number of obs LR chi2(3) Prob > chi2 Pseudo R2 Log likelihood = -158.43888 promo Coef. edu exp male _cons .4593002 .3593023 .2079983 -.464622 Std. Err. .0898537 .0805774 .1656413 .1088166 z 5.11 4.46 1.26 -4.27 P>|z| 0.000 0.000 0.209 0.000 = = = = 2247 277 49.76 0.0000 0.1357 [95% Conf. Interval] .2831901 .2013735 -.1166527 -.6778985 .6354102 .5172312 .5326494 -.2513454 . est store Promo The coefficients in the income and promotion equations definitely seem to be different. However, because the scales of the two variables are different, we would not expect the coefficients to be equal. The correct hypothesis here is that the proportionality of the coefficients of the two models, apart from the constant, are equal. This formulation would still reflect that the relative effects of the different aspects of human capital do not differ between the dependent variables. We can obtain a nonlinear Wald test for the hypothesis of proportionality by using the testnl command on the combined estimation results of the two estimators. Thus we first have to form the combined estimation results. At this point, we specify the vce(cluster famid) option to adjust for the clustering of observations on famid. . suest Inc Promo, vce(cluster famid) Simultaneous results for Inc, Promo Number of obs = 277 (Std. Err. adjusted for 135 clusters in famid) Coef. Robust Std. Err. z P>|z| [95% Conf. Interval] Inc_mean edu exp male _cons 2.213707 1.47293 .5381153 1.255497 .2483907 .1890583 .4979227 .3374977 8.91 7.79 1.08 3.72 0.000 0.000 0.280 0.000 1.72687 1.102383 -.4377952 .594014 2.700543 1.843478 1.514026 1.916981 Inc_lnvar _cons 2.785339 .079597 34.99 0.000 2.629332 2.941347 Promo_promo edu exp male _cons .4593002 .3593023 .2079983 -.464622 .0886982 .079772 .1691053 .1042169 5.18 4.50 1.23 -4.46 0.000 0.000 0.219 0.000 .2854549 .2029522 -.1234419 -.6688833 .6331454 .5156525 .5394386 -.2603607 2248 suest — Seemingly unrelated estimation The standard errors reported by suest are identical to those reported by the respective estimation commands when invoked with the vce(cluster famid) option. We are now ready to test for proportionality: Income Income Income βexp βmale βedu H0 : Promotion = Promotion = Promotion βexp βedu βmale It is straightforward to translate this into syntax suitable for testnl, recalling that the coefficient of variable v in equation eq is denoted by [eq]v. . testnl [Inc_mean]edu/[Promo_promo]edu = > [Inc_mean]exp/[Promo_promo]exp = > [Inc_mean]male/[Promo_promo]male (1) (2) [Inc_mean]edu/[Promo_promo]edu = [Inc_mean]exp/[Promo_promo]exp [Inc_mean]edu/[Promo_promo]edu = [Inc_mean]male/[Promo_promo]male chi2(2) = Prob > chi2 = 0.61 0.7385 From the evidence, we fail to reject the hypotheses that the coefficients of the income and promotion equations are proportional. Thus it is not unreasonable to assume that income and promotion can be explained by the same latent variable, “labor market success”. A disadvantage of the nonlinear Wald test is that it is not invariant with respect to representation: a Wald test for a mathematically equivalent formulation of the nonlinear constraint usually leads to a different test result. An equivalent formulation of the proportionality hypothesis is Promotion Income Income Promotion βexp = βedu βexp H0 : βedu Income Promotion βedu βmale = and Promotion Income βedu βmale This formulation is “more linear” in the coefficients. The asymptotic χ2 distribution of the nonlinear Wald statistic can be expected to be more accurate for this representation. . testnl ([Inc_mean]edu*[Promo_promo]exp = [Inc_mean]exp*[Promo_promo]edu) > ([Inc_mean]edu*[Promo_promo]male = [Inc_mean]male*[Promo_promo]edu) (1) (2) [Inc_mean]edu*[Promo_promo]exp = [Inc_mean]exp*[Promo_promo]edu [Inc_mean]edu*[Promo_promo]male = [Inc_mean]male*[Promo_promo]edu chi2(2) = Prob > chi2 = 0.46 0.7936 Here the two representations lead to similar test statistics and p-values. As before, we fail to reject the hypothesis of proportionality of the coefficients of the two models. Testing cross-model hypotheses Example 3 In this example, we demonstrate how some cross-model hypotheses can be tested using the facilities already available in most estimation commands. This demonstration will explain the intricate relationship between the cluster adjustment of the robust estimator of variance and the suest command. It will also be made clear that a new facility is required to perform more general cross-model testing. suest — Seemingly unrelated estimation 2249 We want to test whether the effect of x1 on the binary variable y1 is the same as the effect of x2 on the binary y2 ; see Clogg, Petkova, and Haritou (1995). In this setting, x1 may equal x2 , and y1 may equal y2 . We assume that logistic regression models can be used to model the responses, and for simplicity, we ignore further predictor variables in these models. If the two logit models are fit on independent samples so that the estimators are (stochastically) independent, a Wald test for b[x1] = b[x2] rejects the null hypothesis if bb(x1 ) − bb(x2 ) n o n o1/2 σ b2 bb(x1 ) + σ b2 bb(x2 ) is larger than the appropriate χ21 threshold. If the models are fit on the same sample (or on dependent samples), so that the estimators are stochastically dependent, the above test that ignores the covariance between the estimators is not appropriate. It is instructive to see how this problem can be tackled by “stacking” data. In the stacked format, we doubled the number of observations. The dependent variable is y1 in the first half of the data and is y2 in the second half of the data. The predictor variable z1 is set to x1 in the first half of the expanded data and to 0 in the rest. Similarly, z2 is 0 in the first half and x2 in the second half. The following diagram illustrates the transformation, in the terminology of the reshape command, from wide to long format. id y1 y2 x1 x2 1 2 3 y11 y12 y13 y21 y22 y23 x11 x12 x13 x21 x22 x23 =⇒ id y z1 z2 model 1 2 3 1 2 3 y11 y12 y13 y21 y22 y23 x11 x12 x13 0 0 0 0 0 0 x21 x22 x23 1 1 1 2 2 2 The observations in the long format data organization are clustered on the original subjects and are identified with the identifier id. The clustering on id has to be accounted for when fitting a simultaneous model. The simplest way to deal with clustering is to use the cluster adjustment of the robust or sandwich estimator; see [P] robust. The data manipulation can be accomplished easily with the stack command; see [D] stack. Subsequently, we fit a simultaneous logit model and perform a Wald test for the hypothesis that the coefficients of z1 and z2 are the same. A full setup to obtain the cross-model Wald test could then be as follows: . . . . . . . generate zero = 0 // a variable generate one = 1 // a variable generate two = 2 // a variable stack id y1 x1 zero one id y2 zero x2 generate model2 = (model==2) logit y model2 z1 z2, vce(cluster id) test _b[z1] = _b[z2] that that that two, is always is always is always into(id y 0 1 2 z1 z2 model) The coefficient of z1 represents the effect of x1 on y1, and similarly, z2 for the effect of x2 on y2. The variable model2 is a dummy for the “second model”, which is included to allow the intercept in the second model to differ from that in the first model. The estimates of the coefficient of z1 and its standard error in the combined model are the same as the estimates of the coefficient of z1 and its standard error if we fit the model on the unstacked data. . logit y1 x1, vce(robust) 2250 suest — Seemingly unrelated estimation The vce(cluster clustvar) option specified with the logit command for the stacked data ensures that the covariances of b[z1] and b[z2] are indeed estimated. This estimation ensures that the Wald test for the equality of the coefficients is correct. If we had not specified the vce(cluster clustvar) option, the (co)variance matrix of the coefficients would have been block-diagonal; that is, the covariances of b[z1] and b[z2] would have been 0. Then test would have effectively used the invalid formula for the Wald test for two independent samples. In this example, the two logit models were fit on the same data. The same setup would apply, without modification, when the two logit models were fit on overlapping data that resulted, for instance, if the y or x variables were missing in some observations. The suest command allows us to obtain the above Wald test more efficiently by avoiding the data manipulation, obviating the need to fit a model with twice the number of coefficients. The test statistic produced by the above code fragment is identical to that obtained via suest on the original (unstacked) data: . logit y1 x1 . estimates store M1 . . . . logit y2 x2 estimates store M2 suest M1 M2 test [M1]x1=[M2]x2 The stacking method can be applied not only to the testing of cross-model hypotheses for logit models but also to any estimation command that supports the vce(cluster clustvar) option. The stacking approach clearly generalizes to stacking more than two logit or other models, testing more general linear hypotheses, and testing nonlinear cross-model hypotheses (see [R] testnl). In all of these cases, suest would yield identical statistical results but at smaller costs in terms of data management, computer storage, and computer time. Is suest nothing but a convenience command? No, there are two disadvantages to the stacking method, both of which are resolved via suest. First, if the models include ancillary parameters (in a regression model, the residual variance; in an ordinal response model, the cutpoints; and in lognormal survival-time regression, the time scale parameter), these parameters are constrained to be equal between the stacked models. In suest, this constraint is relaxed. Second, the stacking method does not generalize to compare different statistical models, such as a probit model and a regression model. As demonstrated in the previous section, suest can deal with this situation. suest — Seemingly unrelated estimation 2251 Stored results suest stores the following in e(): Scalars e(N) e(N clust) e(rank) Macros e(cmd) e(eqnames#) e(names) e(wtype) e(wexp) e(clustvar) e(vce) e(vcetype) e(properties) Matrices e(b) e(V) Functions e(sample) number of observations number of clusters rank of e(V) suest original names of equations of model # list of model names weight type weight expression name of cluster variable vcetype specified in vce() title used to label Std. Err. b V stacked coefficient vector of the models variance–covariance matrix of the estimators marks estimation sample Methods and formulas The estimation of the simultaneous (co)variance of a series of k estimators is a nonstandard application of the sandwich estimator, as implemented by the command [P] robust. You may want to read this manual entry before reading further. The starting point is that we have fit k different models on the same data—partially overlapping or nonoverlapping data are special cases. We want to derive the simultaneous distribution of these k estimators, for instance, to test a cross-estimator hypothesis H0 . As in the framework of Hausman testing, H0 will often be of the form that different estimators have the same probability limit under some hypothesis, while the estimators have different limits if the hypothesis is violated. b i to be defined as “the” solution of the estimation equations Gi , We consider (vector) estimators β Gi (bi ) = X wij uij (bi ) = 0, i = 1, . . . , k j We refer to the uij as the “scores”. Specifying some weights wij = 0 trivially accommodates for partially overlapping or even disjointed data. Under “suitable regularity conditions” (see White b i are asymptotically normally distributed, with the variance estimated [1982; 1996] for details), the β consistently by the sandwich estimator b i ) = D−1 Vi = Var(β i X j wij uij u0ij D−1 i 2252 suest — Seemingly unrelated estimation b i . In the context of maximum likelihood estimation, where Di is the Jacobian of Gi evaluated at β Di can be estimated consistently by (minus) the Hessian of the log likelihood or by the Fisher P information matrix. If the model is also well specified, the sandwiched term ( j wij uij u0ij ) converges in probability to Di , so Vi may be consistently estimated by D−1 i . To derive the simultaneous distribution of the estimators, we consider the “stacked” estimation equation, 0 0 0 0 b b b b =0 G(β) = G1 (β1 ) G1 (β2 ) . . . Gk (βk ) b is asymptotically jointly Under “suitable regularity conditions” (see White [1996] for details), β normally distributed. The Jacobian and scores of the simultaneous equation are easily expressed in the Jacobian and scores of the separate equations. The Jacobian of G, b) = D(β dG(β) dβ β=β b b ) is again block diagonal, with the is block diagonal with blocks D1 , . . . , Dk . The inverse of D(β inverses of Di on the diagonal. The scores u of G are simply obtained as the concatenated scores of the separate equations: uj = (u01j u02j . . . u0kj )0 Out-of-sample (that is, where wij = 0) values of the score variables are defined as 0 (thus we drop the i subscript from the common weight variable). The sandwich estimator for the asymptotic variance b reads of β X b ) = D(β b )−1 b )−1 V = Var(β wj uj u0j D(β j b i ) is estimated by Taking a “partitioned” look at this expression, we see that V (β X wj uij u0ij D−1 D−1 i i j b i based on the separate estimation equation which is, yet again, the familiar sandwich estimator for β Gi . Thus considering several estimators simultaneously in this way does not affect the estimators of the asymptotic variances of these estimators. However, as a bonus of stacking, we obtained a b i and β bh, sandwich-type estimate of the covariance Vih of estimators β X bi, β b h ) = D−1 Vih = Cov(β wj uij u0ih D−1 i h j which is also obtained by White (1982). This estimator for the covariance of estimators is an application of the cluster modification of the sandwich estimator proposed by Rogers (1993). Consider the stacked data format as discussed in the logit example, and assume that Stata would be able to estimate a “stacked model” in which different models apply to different observations, for example, a probit model for the first half, a regression model for the second half, and a one-to-one cluster relation between the first and second half. If there are no common parameters to both models, the score statistics of parameters for the stacked models are zero in the half of the data in which they do not occur. In Rogers’ method, we have to sum the score statistics over the observations within a cluster. This step boils down to concatenating the score statistics at the level of the cluster. suest — Seemingly unrelated estimation 2253 We compare the sandwich estimator of the (co)variance V12 of two estimators with the estimator b 1 is consistent of variance Ve12 applied in the classic Hausman test. Hausman (1978) showed that if β b under H0 and β2 is efficient under H0 , then asymptotically b1, β b 2 ) = Var(β b2) Cov(β b1 − β b 2 ) is consistently estimated by V1 − V2 . and so var(β Acknowledgment suest was written by Jeroen Weesie of the Department of Sociology at Utrecht University, The Netherlands. This research is supported by grant PGS 50-370 by The Netherlands Organization for Scientific Research. An earlier version of suest was published in the Stata Technical Bulletin (1999). The current version of suest is not backward compatible with the STB version because of the introduction of new ways to manage estimation results via the estimates command. References Arminger, G. 1990. Testing against misspecification in parametric rate models. In Event History Analysis in Life Course Research, ed. K. U. Mayer and N. B. Tuma, 146–158. Madison: University of Wisconsin Press. Clogg, C. C., E. Petkova, and A. Haritou. 1995. Statistical methods for comparing regression coefficients between models. American Journal of Sociology 100: 1261–1312. (With comments by P. D. Allison and a reply by C. C. Clogg, E. Petkova, and T. Cheng). Gourieroux, C. S., and A. Monfort. 1997. Time Series and Dynamic Models. Trans. ed. G. M. Gallo. Cambridge: Cambridge University Press. Hausman, J. A. 1978. Specification tests in econometrics. Econometrica 46: 1251–1271. Hausman, J. A., and D. L. McFadden. 1984. Specification tests for the multinomial logit model. Econometrica 52: 1219–1240. Hausman, J. A., and P. A. Ruud. 1987. Specifying and testing econometric models for rank-ordered data. Journal of Econometrics 34: 83–104. Huber, P. J. 1967. The behavior of maximum likelihood estimates under nonstandard conditions. In Vol. 1 of Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 221–233. Berkeley: University of California Press. Rogers, W. H. 1993. sg16.4: Comparison of nbreg and glm for negative binomial. Stata Technical Bulletin 16: 7. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 82–84. College Station, TX: Stata Press. Weesie, J. 1999. sg121: Seemingly unrelated estimation and the cluster-adjusted sandwich estimator. Stata Technical Bulletin 52: 34–47. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 231–248. College Station, TX: Stata Press. White, H. L., Jr. 1982. Maximum likelihood estimation of misspecified models. Econometrica 50: 1–25. . 1996. Estimation, Inference and Specification Analysis. Cambridge: Cambridge University Press. 2254 suest — Seemingly unrelated estimation Also see [R] estimates — Save and manipulate estimation results [R] hausman — Hausman specification test [R] lincom — Linear combinations of estimators [R] nlcom — Nonlinear combinations of estimators [R] test — Test linear hypotheses after estimation [R] testnl — Test nonlinear hypotheses after estimation [P] robust — Robust variance estimates Title summarize — Summary statistics Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax summarize varlist if in weight , options Description options Main detail meanonly format separator(#) display options display additional statistics suppress the display; calculate only the mean; programmer’s option use variable’s display format draw separator line after every # variables; default is separator(5) control spacing, line width, and base and empty cells varlist may contain factor variables; see [U] 11.4.3 Factor variables. varlist may contain time-series operators; see [U] 11.4.4 Time-series varlists. by, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. aweights, fweights, and iweights are allowed. However, iweights may not be used with the detail option; see [U] 11.1.6 weight. Menu Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics Description summarize calculates and displays a variety of univariate summary statistics. If no varlist is specified, summary statistics are calculated for all the variables in the dataset. Also see [R] ci for calculating the standard error and confidence intervals of the mean. Options Main detail produces additional statistics, including skewness, kurtosis, the four smallest and largest values, and various percentiles. meanonly, which is allowed only when detail is not specified, suppresses the display of results and calculation of the variance. Ado-file writers will find this useful for fast calls. 2255 2256 summarize — Summary statistics format requests that the summary statistics be displayed using the display formats associated with the variables rather than the default g display format; see [U] 12.5 Formats: Controlling how data are displayed. separator(#) specifies how often to insert separation lines into the output. The default is separator(5), meaning that a line is drawn after every five variables. separator(10) would draw a line after every 10 variables. separator(0) suppresses the separation line. display options: vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), and fvwrapon(style); see [R] estimation options. Remarks and examples summarize can produce two different sets of summary statistics. Without the detail option, the number of nonmissing observations, the mean and standard deviation, and the minimum and maximum values are presented. With detail, the same information is presented along with the variance, skewness, and kurtosis; the four smallest and four largest values; and the 1st, 5th, 10th, 25th, 50th (median), 75th, 90th, 95th, and 99th percentiles. Example 1: summarize with the separator() option We have data containing information on various automobiles, among which is the variable mpg, the mileage rating. We can obtain a quick summary of the mpg variable by typing . use http://www.stata-press.com/data/r13/auto2 (1978 Automobile Data) . summarize mpg Variable Obs Mean mpg 74 21.2973 Std. Dev. Min Max 5.785503 12 41 We see that we have 74 observations. The mean of mpg is 21.3 miles per gallon, and the standard deviation is 5.79. The minimum is 12, and the maximum is 41. If we had not specified the variable (or variables) we wanted to summarize, we would have obtained summary statistics on all the variables in the dataset: . summarize, separator(4) Variable Obs Mean Std. Dev. Min Max make price mpg rep78 0 74 74 69 6165.257 21.2973 3.405797 2949.496 5.785503 .9899323 3291 12 1 15906 41 5 headroom trunk weight length 74 74 74 74 2.993243 13.75676 3019.459 187.9324 .8459948 4.277404 777.1936 22.26634 1.5 5 1760 142 5 23 4840 233 turn displacement gear_ratio foreign 74 74 74 74 39.64865 197.2973 3.014865 .2972973 4.399354 91.83722 .4562871 .4601885 31 79 2.19 0 51 425 3.89 1 There are only 69 observations on rep78, so some of the observations are missing. There are no observations on make because it is a string variable. summarize — Summary statistics 2257 The idea of the mean is quite old (Plackett 1958), but its extension to a scheme of moment-based measures was not done until the end of the 19th century. Between 1893 and 1905, Pearson discussed and named the standard deviation, skewness, and kurtosis, but he was not the first to use any of these. Thiele (1889), in contrast, had earlier firmly grasped the notion that the mr provide a systematic basis for discussing distributions. However, even earlier anticipations can also be found. For example, Euler in 1778 used m2 and m3 in passing in a treatment of estimation (Hald 1998, 87), but seemingly did not build on that. Similarly, the idea of the median is quite old. The history of the interquartile range is tangled up with that of the probable error, a long-popular measure. Extending this in various ways to a more general approach based on quantiles (to use a later term) occurred to several people in the nineteenth century. Galton (1875) is a nice example, particularly because he seems so close to the key idea of the quantiles as a function, which took another century to reemerge strongly. Thorvald Nicolai Thiele (1838–1910) was a Danish scientist who worked in astronomy, mathematics, actuarial science, and statistics. He made many pioneering contributions to statistics, several of which were overlooked until recently. Thiele advocated graphical analysis of residuals checking for trends, symmetry of distributions, and changes of sign, and he even warned against overinterpreting such graphs. Example 2: summarize with the detail option The detail option provides all the information of a normal summarize and more. The format of the output also differs, as shown here: . summarize mpg, detail Mileage (mpg) 1% 5% 10% 25% 50% 75% 90% 95% 99% Percentiles 12 14 14 18 20 25 29 34 41 Smallest 12 12 14 14 Largest 34 35 35 41 Obs Sum of Wgt. Mean Std. Dev. 74 74 21.2973 5.785503 Variance Skewness Kurtosis 33.47205 .9487176 3.975005 As in the previous example, we see that the mean of mpg is 21.3 miles per gallon and that the standard deviation is 5.79. We also see the various percentiles. The median of mpg (the 50th percentile) is 20 miles per gallon. The 25th percentile is 18, and the 75th percentile is 25. When we performed summarize, we learned that the minimum and maximum were 12 and 41, respectively. We now see that the four smallest values in our dataset are 12, 12, 14, and 14. The four largest values are 34, 35, 35, and 41. The skewness of the distribution is 0.95, and the kurtosis is 3.98. (A normal distribution would have a skewness of 0 and a kurtosis of 3.) Skewness is a measure of the lack of symmetry of a distribution. If the distribution is symmetric, the coefficient of skewness is 0. If the coefficient is negative, the median is usually greater than the mean and the distribution is said to be skewed left. If the coefficient is positive, the median is usually less than the mean and the distribution is said to be skewed right. Kurtosis (from the Greek kyrtosis, meaning curvature) is a measure of peakedness of a distribution. The smaller the coefficient of kurtosis, the flatter the distribution. The normal distribution has a coefficient of kurtosis of 3 and provides a convenient benchmark. 2258 summarize — Summary statistics Technical note The convention of calculating the median of an even number of values by averaging the central two order statistics is of long standing. (That is, given 8 values, average the 4th and 5th smallest values, or given 42, average the 21st and 22nd smallest.) Stigler (1977) filled a much-needed gap in the literature by naming such paired central order statistics as “comedians”, although it remains unclear how far he was joking. Example 3: summarize with the by prefix summarize can usefully be combined with the by varlist: prefix. In our dataset, we have a variable, foreign, that distinguishes foreign and domestic cars. We can obtain summaries of mpg and weight within each subgroup by typing . by foreign: summarize mpg weight -> foreign = Domestic Variable Obs mpg weight Mean 52 52 19.82692 3317.115 -> foreign = Foreign Variable Obs Mean mpg weight 22 22 24.77273 2315.909 Std. Dev. Min Max 4.743297 695.3637 12 1800 34 4840 Std. Dev. Min Max 6.611187 433.0035 14 1760 41 3420 Domestic cars in our dataset average 19.8 miles per gallon, whereas foreign cars average 24.8. Because by varlist: can be combined with summarize, it can also be combined with summarize, detail: summarize — Summary statistics 2259 . by foreign: summarize mpg, detail -> foreign = Domestic Mileage (mpg) 1% 5% 10% 25% Percentiles 12 14 14 16.5 50% Smallest 12 12 14 14 Obs Sum of Wgt. 19 75% 90% 95% 99% 22 26 29 34 Largest 28 29 30 34 52 52 Mean Std. Dev. 19.82692 4.743297 Variance Skewness Kurtosis 22.49887 .7712432 3.441459 -> foreign = Foreign Mileage (mpg) 1% 5% 10% 25% Percentiles 14 17 17 21 50% Smallest 14 17 17 18 Obs Sum of Wgt. 24.5 75% 90% 95% 99% 28 35 35 41 Largest 31 35 35 41 22 22 Mean Std. Dev. 24.77273 6.611187 Variance Skewness Kurtosis 43.70779 .657329 3.10734 Technical note summarize respects display formats if we specify the format option. When we type summarize price weight, we obtain . summarize price weight Variable Obs Mean price weight 74 74 6165.257 3019.459 Std. Dev. Min Max 2949.496 777.1936 3291 1760 15906 4840 The display is accurate but is not as aesthetically pleasing as we may wish, particularly if we plan to use the output directly in published work. By placing formats on the variables, we can control how the table appears: . format price weight %9.2fc . summarize price weight, format Variable Obs Mean price weight 74 74 6,165.26 3,019.46 Std. Dev. 2,949.50 777.19 Min Max 3,291.00 1,760.00 15,906.00 4,840.00 2260 summarize — Summary statistics If you specify a weight (see [U] 11.1.6 weight), each observation is multiplied by the value of the weighting expression before the summary statistics are calculated so that the weighting expression is interpreted as the discrete density of each observation. Example 4: summarize with factor variables You can also use summarize to obtain summary statistics for factor variables. For example, if you type . summarize i.rep78 Variable rep78 Fair Average Good Excellent Obs Mean 69 69 69 69 .115942 .4347826 .2608696 .1594203 Std. Dev. .3225009 .4993602 .4423259 .3687494 Min Max 0 0 0 0 1 1 1 1 you obtain the sample proportions for four of the five levels of the rep78 variable. For example, 11.6% of the 69 cars with nonmissing values of rep78 have a fair repair record. When you use factor-variable notation, the base category is suppressed by default. If you type . summarize bn.rep78 Variable Obs rep78 Poor Fair Average Good Excellent 69 69 69 69 69 Mean .0289855 .115942 .4347826 .2608696 .1594203 Std. Dev. .1689948 .3225009 .4993602 .4423259 .3687494 Min Max 0 0 0 0 0 1 1 1 1 1 the notation bn.rep78 indicates that Stata should not suppress the base category so that we see the proportions for all five levels. We could have used tabulate oneway rep78 to obtain the sample proportions along with the cumulative proportions. Alternatively, we could have used proportions rep78 to obtain the sample proportions along with the standard errors of the proportions instead of the standard deviations of the proportions. Example 5: summarize with weights We have 1980 census data on each of the 50 states. Included in our variables is medage, the median age of the population of each state. If we type summarize medage, we obtain unweighted statistics: . use http://www.stata-press.com/data/r13/census (1980 Census data by state) . summarize medage Variable Obs Mean Std. Dev. Min Max medage 50 29.54 1.693445 24.2 34.7 Also among our variables is pop, the population in each state. Typing summarize medage [w=pop] produces population-weighted statistics: summarize — Summary statistics . summarize medage [w=pop] (analytic weights assumed) Obs Variable medage 50 Weight Mean 225907472 30.11047 Std. Dev. Min Max 1.66933 24.2 34.7 2261 The number listed under Weight is the sum of the weighting variable, pop, indicating that there are roughly 226 million people in the United States. The pop-weighted mean of medage is 30.11 (compared with 29.54 for the unweighted statistic), and the weighted standard deviation is 1.67 (compared with 1.69). Example 6: summarize with weights and the detail option We can obtain detailed summaries of weighted data as well. When we do this, all the statistics are weighted, including the percentiles. . summarize medage [w=pop], detail (analytic weights assumed) Median age 1% 5% 10% 25% 50% Percentiles 27.1 27.7 28.2 29.2 29.9 75% 90% 95% 99% 30.9 32.1 32.2 34.7 Smallest 24.2 26.1 27.1 27.4 Largest 32 32.1 32.2 34.7 Obs Sum of Wgt. Mean Std. Dev. Variance Skewness Kurtosis 50 225907472 30.11047 1.66933 2.786661 .5281972 4.494223 Technical note If you are writing a program and need to access the mean of a variable, the meanonly option provides for fast calls. For example, suppose that your program reads as follows: program mean summarize ‘1’, meanonly display " mean = " r(mean) end The result of executing this is . use http://www.stata-press.com/data/r13/auto2 (1978 Automobile Data) . mean price mean = 6165.2568 2262 summarize — Summary statistics Video example Descriptive statistics in Stata Stored results summarize stores the following in r(): Scalars r(N) r(mean) r(skewness) r(min) r(max) r(sum w) r(p1) r(p5) r(p10) r(p25) number of observations mean skewness (detail only) minimum maximum sum of the weights 1st percentile (detail only) 5th percentile (detail only) 10th percentile (detail only) 25th percentile (detail only) r(p50) r(p75) r(p90) r(p95) r(p99) r(Var) r(kurtosis) r(sum) r(sd) 50th percentile (detail 75th percentile (detail 90th percentile (detail 95th percentile (detail 99th percentile (detail variance kurtosis (detail only) sum of variable standard deviation only) only) only) only) only) Methods and formulas Let x denote the variable on which we want to calculate summary statistics, and let xi , i = 1, . . . , n, denote an individual observation on x. Let vi be the weight, and if no weight is specified, define vi = 1 for all i. Define V as the sum of the weight: V = n X vi i=1 Define wi to be vi normalized to sum to n, wi = vi (n/V ). The mean, x, is defined as n x= 1X wi xi n i=1 The variance, s2 , is defined as n 1 X s = wi (xi − x)2 n − 1 i=1 √ The standard deviation, s, is defined as s2 . 2 Define mr as the rth moment about the mean x: n mr = 1X wi (xi − x)r n i=1 −3/2 The coefficient of skewness is then defined as m3 m2 m4 m−2 2 . . The coefficient of kurtosis is defined as Let x(i) refer to the x in ascending order, and let w(i) refer to the corresponding weights of x(i) . The four smallest values are x(1) , x(2) , x(3) , and x(4) . The four largest values are x(n) , x(n−1) , x(n−2) , and x(n−3) . summarize — Summary statistics 2263 To obtain the pth percentile, which we will denote as x[p] , let P = np/100. Let W(i) = i X w(j) j=1 Find the first index i such that W(i) > P . The pth percentile is then x[p] x(i−1) + x(i) = 2 x (i) if W(i−1) = P otherwise References Cox, N. J. 2010. Speaking Stata: The limits of sample skewness and kurtosis. Stata Journal 10: 482–495. David, H. A. 2001. First (?) occurrence of common terms in statistics and probability. In Annotated Readings in the History of Statistics, ed. H. A. David and A. W. F. Edwards, 209–246. New York: Springer. Galton, F. 1875. Statistics by intercomparison, with remarks on the law of frequency of error. Philosophical Magazine 49: 33–46. Gleason, J. R. 1997. sg67: Univariate summaries with boxplots. Stata Technical Bulletin 36: 23–25. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 179–183. College Station, TX: Stata Press. . 1999. sg67.1: Update to univar. Stata Technical Bulletin 51: 27–28. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 159–161. College Station, TX: Stata Press. Hald, A. 1998. A History of Mathematical Statistics from 1750 to 1930. New York: Wiley. Hamilton, L. C. 1996. Data Analysis for Social Scientists. Belmont, CA: Duxbury. . 2013. Statistics with Stata: Updated for Version 12. 8th ed. Boston: Brooks/Cole. Kirkwood, B. R., and J. A. C. Sterne. 2003. Essential Medical Statistics. 2nd ed. Malden, MA: Blackwell. Lauritzen, S. L. 2002. Thiele: Pioneer in Statistics. Oxford: Oxford University Press. Plackett, R. L. 1958. Studies in the history of probability and statistics: VII. The principle of the arithmetic mean. Biometrika 45: 130–135. Stigler, S. M. 1977. Fractional order statistics, with applications. Journal of the American Statistical Association 72: 544–550. Stuart, A., and J. K. Ord. 1994. Kendall’s Advanced Theory of Statistics: Distribution Theory, Vol I. 6th ed. London: Arnold. Thiele, T. N. 1889. Forelæsringer over Almindelig Iagttagelseslære: Sandsynlighedsregning og mindste Kvadraters Methode. Kjøbenhavn: C.A. Reitzel. (English translation included in Lauritzen 2002). Weisberg, H. F. 1992. Central Tendency and Variability. Newbury Park, CA: Sage. 2264 summarize — Summary statistics Also see [R] ameans — Arithmetic, geometric, and harmonic means [R] centile — Report centile and confidence interval [R] mean — Estimate means [R] proportion — Estimate proportions [R] ratio — Estimate ratios [R] table — Flexible table of summary statistics [R] tabstat — Compact table of summary statistics [R] tabulate, summarize() — One- and two-way tables of summary statistics [R] total — Estimate totals [D] codebook — Describe data contents [D] describe — Describe data in memory or in file [D] inspect — Display simple summary of data’s attributes [ST] stsum — Summarize survival-time data [SVY] svy estimation — Estimation commands for survey data [XT] xtsum — Summarize xt data Title sunflower — Density-distribution sunflower plots Syntax Remarks and examples Menu Acknowledgments Description References Options Syntax sunflower yvar xvar options if in weight , options Description Main nograph notable marker options do not show graph do not show summary table; implied when by() is specified affect rendition of markers drawn at the plotted points Bins/Petals binwidth(#) binar(#) bin options light(#) dark(#) xcenter(#) ycenter(#) petalweight(#) petallength(#) petal options flowersonly nosinglepetal width of the hexagonal bins aspect ratio of the hexagonal bins affect rendition of hexagonal bins minimum observations for a light sunflower; default is light(3) minimum observations for a dark sunflower; default is dark(13) x-coordinate of the reference bin y -coordinate of the reference bin observations in a dark sunflower petal length of sunflower petal as a percentage affect rendition of sunflower petals show petals only; do not render bins suppress single petals Add plots addplot(plot) add other plots to generated graph Y axis, X axis, Titles, Legend, Overall, By twoway options any options documented in [G-3] twoway options 2265 2266 sunflower — Density-distribution sunflower plots bin options l|d bstyle(areastyle) l|d bcolor(colorstyle) l|d bfcolor(colorstyle) l|d blstyle(linestyle) l|d blcolor(colorstyle) l|d blwidth(linewidthstyle) petal options l|d flstyle(linestyle) l|d flcolor(colorstyle) l|d flwidth(linewidthstyle) Description overall look of hexagonal bins outline and fill color fill color overall look of outline outline color thickness of outline Description overall style of sunflower petals color of sunflower petals thickness of sunflower petals All options are rightmost; see [G-4] concept: repeated options. fweights are allowed; see [U] 11.1.6 weight. Menu Graphics > Smoothing and densities > Density-distribution sunflower plot Description sunflower draws density-distribution sunflower plots (Plummer and Dupont 2003). These plots are useful for displaying bivariate data whose density is too great for conventional scatterplots to be effective. A sunflower is several line segments of equal length, called petals, that radiate from a central point. There are two varieties of sunflowers: light and dark. Each petal of a light sunflower represents 1 observation. Each petal of a dark sunflower represents several observations. Dark and light sunflowers represent high- and medium-density regions of the data, and marker symbols represent individual observations in low-density regions. The plane defined by the variables yvar and xvar is divided into contiguous hexagonal bins. The number of observations contained within a bin determines how the bin will be represented. • When there are fewer than light(#) observations in a bin, each point is plotted using the usual marker symbols in a scatterplot. • Bins with at least light(#) but fewer than dark(#) observations are represented by a light sunflower. • Bins with at least dark(#) observations are represented by a dark sunflower. Options Main nograph prevents the graph from being generated. sunflower — Density-distribution sunflower plots 2267 notable prevents the summary table from being displayed. This option is implied when the by() option is specified. marker options affect the rendition of markers drawn at the plotted points, including their shape, size, color, and outline; see [G-3] marker options. Bins/Petals binwidth(#) specifies the horizontal width of the hexagonal bins in the same units as xvar. By default, binwidth = max(rbw, nbw) where rbw = range of xvar/40 nbw = range of xvar/max(1,nb) and nb = int(min(sqrt(n),10 * log10(n))) where n = the number of observations in the dataset binar(#) specifies the aspect ratio for the hexagonal bins. The height of the bins is given by √ binheight = binwidth × # × 2/ 3 where binheight and binwidth are specified in the units of yvar and xvar, respectively. The default is binar(r), where r results in the rendering of regular hexagons. bin options affect how the hexagonal bins are rendered. lbstyle(areastyle) and dbstyle(areastyle) specify the look of the light and dark hexagonal bins, respectively. The options listed below allow you to change each attribute, but lbstyle() and dbstyle() provide the starting points. See [G-4] areastyle for a list of available area styles. lbcolor(colorstyle) and dbcolor(colorstyle) specify one color to be used both to outline the shape and to fill the interior of the light and dark hexagonal bins, respectively. See [G-4] colorstyle for a list of color choices. lbfcolor(colorstyle) and dbfcolor(colorstyle) specify the color to be used to fill the interior of the light and dark hexagonal bins, respectively. See [G-4] colorstyle for a list of color choices. lblstyle(linestyle) and dblstyle(linestyle) specify the overall style of the line used to outline the area, which includes its pattern (solid, dashed, etc.), thickness, and color. The other options listed below allow you to change the line’s attributes, but lblstyle() and dblstyle() are the starting points. See [G-4] linestyle for a list of choices. lblcolor(colorstyle) and dblcolor(colorstyle) specify the color to be used to outline the light and dark hexagonal bins, respectively. See [G-4] colorstyle for a list of color choices. lblwidth(linewidthstyle) and dblwidth(linewidthstyle) specify the thickness of the line to be used to outline the light and dark hexagonal bins, respectively. See [G-4] linewidthstyle for a list of choices. 2268 sunflower — Density-distribution sunflower plots light(#) specifies the minimum number of observations needed for a bin to be represented by a light sunflower. The default is light(3). dark(#) specifies the minimum number of observations needed for a bin to be represented by a dark sunflower. The default is dark(13). xcenter(#) and ycenter(#) specify the center of the reference bin. The default values are the median values of xvar and yvar, respectively. The centers of the other bins are implicitly defined by the location of the reference bin together with the common bin width and height. petalweight(#) specifies the number of observations represented by each petal of a dark sunflower. The default value is chosen so that the maximum number of petals on a dark sunflower is 14. petallength(#) specifies the length of petals in the sunflowers. The value specified is interpreted as a percentage of half the bin width. The default is 100%. petal options affect how the sunflower petals are rendered. lflstyle(linestyle) and dflstyle(linestyle) specify the overall style of the light and dark sunflower petals, respectively. lflcolor(colorstyle) and dflcolor(colorstyle) specify the color of the light and dark sunflower petals, respectively. lflwidth(linewidthstyle) and dflwidth(linewidthstyle) specify the width of the light and dark sunflower petals, respectively. flowersonly suppresses rendering of the bins. This option is equivalent to specifying lbcolor(none) and dbcolor(none). nosinglepetal suppresses flowers from being drawn in light bins that contain only 1 observation and dark bins that contain as many observations as the petal weight (see the petalweight() option). Add plots addplot(plot) provides a way to add other plots to the generated graph; see [G-3] addplot option. Y axis, X axis, Titles, Legend, Overall, By twoway options are any of the options documented in [G-3] twoway options. These include options for titling the graph (see [G-3] title options), options for saving the graph to disk (see [G-3] saving option), and the by() option (see [G-3] by option). Remarks and examples See Dupont (2009, 87–92) for a discussion of sunflower plots and how to create them using Stata. Example 1 Using the auto dataset, we want to examine the relationship between weight and mpg. To do that, we type sunflower — Density-distribution sunflower plots 2269 . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . sunflower mpg weight, binwid(500) petalw(2) dark(8) scheme(s2color) Bin width = 500 Bin height = 8.38703 Bin aspect ratio = .0145268 Max obs in a bin = 15 Light = 3 Dark = 8 X-center = 3190 Y-center = 20 Petal weight = 2 petal weight No. of petals No. of flowers estimated obs. actual obs. none light light light dark dark dark 1 1 1 2 2 2 3 4 7 4 5 8 1 2 3 1 1 1 10 3 8 21 8 10 16 10 3 8 21 8 9 15 76 74 10 Mileage (mpg) 20 30 40 flower type 1,000 2,000 3,000 Weight (lbs.) Mileage (mpg) 1 petal = 2 obs. 4,000 5,000 1 petal = 1 obs. The three darkly shaded sunflowers immediately catch our eyes, indicating a group of eight cars that are heavy (nearly 4,000 pounds) and fuel inefficient and two groups of cars that get about 20 miles per gallon and weight in the neighborhood of 3,000 pounds, one with 10 cars and one with 8 cars. The lighter sunflowers with seven petals each indicate groups of seven cars that share similar weight and fuel economy characteristics. To obtain the number of cars in each group, we counted the number of petals in each flower and consulted the graph legend to see how many observations each petal represents. 2270 sunflower — Density-distribution sunflower plots Acknowledgments We thank William D. Dupont and W. Dale Plummer Jr., both of the Department of Biostatistics at Vanderbilt University, who are the authors of the original sunflower command, for their assistance in producing this version. References Cleveland, W. S., and R. McGill. 1984. The many faces of a scatterplot. Journal of the American Statistical Association 79: 807–822. Dupont, W. D. 2009. Statistical Modeling for Biomedical Researchers: A Simple Introduction to the Analysis of Complex Data. 2nd ed. Cambridge: Cambridge University Press. Dupont, W. D., and W. D. Plummer, Jr. 2005. Using density-distribution sunflower plots to explore bivariate relationships in dense data. Stata Journal 5: 371–384. Huang, C., J. A. McDonald, and W. Stuetzle. 1997. Variable resolution bivariate plots. Journal of Computational and Graphical Statistics 6: 383–396. Levy, D. E. 1999. 50 Years of Discovery: Medical Milestones from the National Heart, Lung, and Blood Institute’s Framingham Heart Study. Hoboken, NJ: Center for Bio-Medical Communication. Plummer, W. D., Jr., and W. D. Dupont. 2003. Density distribution sunflower plots. Journal of Statistical Software 8: 1–11. Steichen, T. J., and N. J. Cox. 1999. flower: Stata module to draw sunflower plots. Boston College Department of Economics, Statistical Software Components S393001. http://ideas.repec.org/c/boc/bocode/s393001.html. Title sureg — Zellner’s seemingly unrelated regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Basic syntax sureg (depvar1 varlist1 ) (depvar2 varlist2 ) . . . (depvarN varlistN ) if in weight Full syntax sureg ( eqname1 : depvar1a depvar1b . . . = varlist1 , noconstant ) ( eqname2 : depvar2a depvar2b . . . = varlist2 , noconstant ) ... ( eqnameN : depvarN a depvarN b . . . = varlistN , noconstant ) if in weight , options Explicit equation naming (eqname:) cannot be combined with multiple dependent variables in an equation specification. 2271 2272 sureg — Zellner’s seemingly unrelated regression Description options Model isure constraints(constraints) iterate until estimates converge apply specified linear constraints df adj. report small-sample statistics use small-sample adjustment use alternate adjustment small dfk dfk2 Reporting level(#) corr nocnsreport display options set confidence level; default is level(95) perform Breusch–Pagan test do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Optimization optimization options control the optimization process; seldom used noheader notable coeflegend suppress header table from above coefficient table suppress coefficient table display legend instead of statistics varlist1 , . . . , varlistN may contain factor variables; see [U] 11.4.3 Factor variables. You must have the same levels of factor variables in all equations that have factor variables. depvars and the varlists may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. aweights and fweights are allowed; see [U] 11.1.6 weight. noheader, notable, and coeflegend do not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Linear models and related > Multiple-equation models > Seemingly unrelated regression Description sureg fits seemingly unrelated regression models (Zellner 1962; Zellner and Huang 1962; Zellner 1963). The acronyms SURE and SUR are often used for the estimator. sureg — Zellner’s seemingly unrelated regression 2273 Options Model isure specifies that sureg iterate over the estimated disturbance covariance matrix and parameter estimates until the parameter estimates converge. Under seemingly unrelated regression, this iteration converges to the maximum likelihood results. If this option is not specified, sureg produces two-step estimates. constraints(constraints); see [R] estimation options. df adj. small specifies that small-sample statistics be computed. It shifts the test statistics from chi-squared and z statistics to F statistics and t statistics. Although the standard errors from each equation are computed using the degrees of freedom for the equation, the degrees of freedom for the t statistics are all taken to be those for the first equation. dfk specifies the use of an alternate divisor in computing the covariance matrix for the equation residuals. As an asymptotically justified estimator, sureg by default uses the number of sample observations (n) as a divisor. p When the dfk option is set, a small-sample adjustment is made and the divisor is taken to be (n − ki )(n − kj ), where ki and kj are the numbers of parameters in equations i and j , respectively. dfk2 specifies the use of an alternate divisor in computing the covariance matrix for the equation residuals. When the dfk2 option is set, the divisor is taken to be the mean of the residual degrees of freedom from the individual equations. Reporting level(#); see [R] estimation options. corr displays the correlation matrix of the residuals between equations and performs a Breusch–Pagan test for independent equations; that is, the disturbance covariance matrix is diagonal. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Optimization optimization options control the iterative process that minimizes the sum of squared errors when isure is specified. These options are seldom used. iterate(#) specifies the maximum number of iterations. When the number of iterations equals #, the optimizer stops and presents the current results, even if the convergence tolerance has not been reached. The default value of iterate() is the current value of set maxiter (see [R] maximize), which is iterate(16000) if maxiter has not been changed. trace adds to the iteration log a display of the current parameter vector nolog suppresses the display of the iteration log. tolerance(#) specifies the tolerance for the coefficient vector. When the relative change in the coefficient vector from one iteration to the next is less than or equal to #, the optimization process is stopped. tolerance(1e-6) is the default. 2274 sureg — Zellner’s seemingly unrelated regression The following options are available with sureg but are not shown in the dialog box: noheader suppresses display of the table reporting F statistics, R-squared, and root mean squared error above the coefficient table. notable suppresses display of the coefficient table. coeflegend; see [R] estimation options. Remarks and examples Seemingly unrelated regression models are so called because they appear to be joint estimates from several regression models, each with its own error term. The regressions are related because the (contemporaneous) errors associated with the dependent variables may be correlated. Chapter 5 of Cameron and Trivedi (2010) contains a discussion of the seemingly unrelated regression model and the feasible generalized least-squares estimator underlying it. Example 1 When we fit models with the same set of right-hand-side variables, the seemingly unrelated regression results (in terms of coefficients and standard errors) are the same as fitting the models separately (using, say, regress). The same is true when the models are nested. Even in such cases, sureg is useful when we want to perform joint tests. For instance, let us assume that we think price = β0 + β1 foreign + β2 length + u1 weight = γ0 + γ1 foreign + γ2 length + u2 Because the models have the same set of explanatory variables, we could estimate the two equations separately. Yet, we might still choose to estimate them with sureg because we want to perform the joint test β1 = γ1 = 0. We use the small and dfk options to obtain small-sample statistics comparable with regress or mvreg. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . sureg (price foreign length) (weight foreign length), small dfk Seemingly unrelated regression Equation price weight Obs Parms RMSE "R-sq" F-Stat P 74 74 2 2 2474.593 250.2515 0.3154 0.8992 16.35 316.54 0.0000 0.0000 Coef. Std. Err. t P>|t| [95% Conf. Interval] price foreign length _cons 2801.143 90.21239 -11621.35 766.117 15.83368 3124.436 3.66 5.70 -3.72 0.000 0.000 0.000 1286.674 58.91219 -17797.77 4315.611 121.5126 -5444.93 weight foreign length _cons -133.6775 31.44455 -2850.25 77.47615 1.601234 315.9691 -1.73 19.64 -9.02 0.087 0.000 0.000 -286.8332 28.27921 -3474.861 19.4782 34.60989 -2225.639 sureg — Zellner’s seemingly unrelated regression 2275 These two equations have a common set of regressors, and we could have used a shorthand syntax to specify the equations: . sureg (price weight = foreign length), small dfk Here the results presented by sureg are the same as if we had estimated the equations separately: . regress price foreign length (output omitted ) . regress weight foreign length (output omitted ) There is, however, a difference. We have allowed u1 and u2 to be correlated and have estimated the full variance–covariance matrix of the coefficients. sureg has estimated the correlations, but it does not report them unless we specify the corr option. We did not remember to specify corr when we fit the model, but we can redisplay the results: . sureg, notable noheader corr Correlation matrix of residuals: price weight price 1.0000 0.5840 weight 1.0000 Breusch-Pagan test of independence: chi2(1) = 25.237, Pr = 0.0000 The notable and noheader options prevented sureg from redisplaying the header and coefficient tables. We find that, for the same cars, the correlation of the residuals in the price and weight equations is 0.5840 and that we can reject the hypothesis that this correlation is zero. We can test that the coefficients on foreign are jointly zero in both equations — as we set out to do — by typing test foreign; see [R] test. When we type a variable without specifying the equation, that variable is tested for zero in all equations in which it appears: . test foreign ( 1) [price]foreign = 0 ( 2) [weight]foreign = 0 F( 2, 142) = 17.99 Prob > F = 0.0000 Example 2 When the models do not have the same set of explanatory variables and are not nested, sureg may lead to more efficient estimates than running the models separately as well as allowing joint tests. This time, let us assume that we believe price = β0 + β1 foreign + β2 mpg + β3 displ + u1 weight = γ0 + γ1 foreign + γ2 length + u2 2276 sureg — Zellner’s seemingly unrelated regression To fit this model, we type . sureg (price foreign mpg displ) (weight foreign length), corr Seemingly unrelated regression Equation price weight Obs Parms RMSE "R-sq" chi2 P 74 74 3 2 2165.321 245.2916 0.4537 0.8990 49.64 661.84 0.0000 0.0000 Coef. Std. Err. z P>|z| [95% Conf. Interval] price foreign mpg displacement _cons 3058.25 -104.9591 18.18098 3904.336 685.7357 58.47209 4.286372 1966.521 4.46 -1.80 4.24 1.99 0.000 0.073 0.000 0.047 1714.233 -219.5623 9.779842 50.0263 4402.267 9.644042 26.58211 7758.645 weight foreign length _cons -147.3481 30.94905 -2753.064 75.44314 1.539895 303.9336 -1.95 20.10 -9.06 0.051 0.000 0.000 -295.2139 27.93091 -3348.763 .517755 33.96718 -2157.365 Correlation matrix of residuals: price weight price 1.0000 weight 0.3285 1.0000 Breusch-Pagan test of independence: chi2(1) = 7.984, Pr = 0.0047 In comparison, if we had fit the price model separately, . regress price foreign mpg displ SS df Source MS Model Residual 294104790 340960606 3 70 98034929.9 4870865.81 Total 635065396 73 8699525.97 price Coef. foreign mpg displacement _cons 3545.484 -98.88559 22.40416 2796.91 Std. Err. 712.7763 63.17063 4.634239 2137.873 t 4.97 -1.57 4.83 1.31 Number of obs F( 3, 70) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.122 0.000 0.195 = = = = = = 74 20.13 0.0000 0.4631 0.4401 2207 [95% Conf. Interval] 2123.897 -224.8754 13.16146 -1466.943 4967.072 27.10426 31.64686 7060.763 The coefficients are slightly different, but the standard errors are uniformly larger. This would still be true if we specified the dfk option to make a small-sample adjustment to the estimated covariance of the disturbances. Technical note Constraints can be applied to SURE models using Stata’s standard syntax for constraints. For a general discussion of constraints, see [R] constraint; for examples similar to seemingly unrelated regression models, see [R] reg3. sureg — Zellner’s seemingly unrelated regression 2277 Stored results sureg stores the following in e(): Scalars e(N) e(k) e(k eq) e(mss #) e(df m#) e(rss #) e(df r) e(r2 #) e(F #) e(rmse #) e(dfk2 adj) e(ll) e(chi2 #) e(p #) e(cons #) e(chi2 bp) e(df bp) e(cons #) e(rank) e(ic) number of observations number of parameters number of equations in e(b) model sum of squares for equation # model degrees of freedom for equation # residual sum of squares for equation # residual degrees of freedom R-squared for equation # F statistic for equation # (small only) root mean squared error for equation # divisor used with VCE when dfk2 specified log likelihood χ2 for equation # significance for equation # 1 if equation # has a constant, 0 otherwise Breusch–Pagan χ2 degrees of freedom for Breusch–Pagan χ2 test 1 when equation # has a constant; 0, otherwise rank of e(V) number of iterations Macros e(cmd) e(cmdline) e(method) e(depvar) e(exog) e(eqnames) e(wtype) e(wexp) e(corr) e(small) e(dfk) e(properties) e(predict) e(marginsok) e(marginsnotok) e(asbalanced) e(asobserved) sureg command as typed sure or isure names of dependent variables names of exogenous variables names of equations weight type weight expression correlation structure small alternate divisor (dfk or dfk2 only) b V program used to implement predict predictions allowed by margins predictions disallowed by margins factor variables fvset as asbalanced factor variables fvset as asobserved Matrices e(b) e(Cns) e(Sigma) e(V) coefficient vector constraints matrix b matrix Σ variance–covariance matrix of the estimators Functions e(sample) marks estimation sample Methods and formulas sureg uses the asymptotically efficient, feasible, generalized least-squares algorithm described in Greene (2012, 292–304). The computing formulas are given on page 293–294. The R-squared reported is the percent of variance explained by the predictors. It may be used for descriptive purposes, but R-squared is not a well-defined concept when GLS is used. 2278 sureg — Zellner’s seemingly unrelated regression sureg will refuse to compute the estimators if the same equation is named more than once or the covariance matrix of the residuals is singular. The Breusch and Pagan (1980) χ2 statistic — a Lagrange multiplier statistic — is given by λ=T M m−1 X X 2 rmn m=1 n=1 where rmn is the estimated correlation between the residuals of the M equations and T is the number of observations. It is distributed as χ2 with M (M − 1)/2 degrees of freedom. Arnold Zellner (1927–2010) was born in New York. He studied physics at Harvard and economics at Berkeley, and then he taught economics at the Universities of Washington and Wisconsin before settling in Chicago in 1966. Among his many major contributions to econometrics and statistics are his work on seemingly unrelated regression, three-stage least squares, and Bayesian econometrics. References Breusch, T. S., and A. R. Pagan. 1980. The Lagrange multiplier test and its applications to model specification in econometrics. Review of Economic Studies 47: 239–253. Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. McDowell, A. W. 2004. From the help desk: Seemingly unrelated regression with unbalanced equations. Stata Journal 4: 442–448. Rossi, P. E. 1989. The ET interview: Professor Arnold Zellner. Econometric Theory 5: 287–317. Weesie, J. 1999. sg121: Seemingly unrelated estimation and the cluster-adjusted sandwich estimator. Stata Technical Bulletin 52: 34–47. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 231–248. College Station, TX: Stata Press. Zellner, A. 1962. An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. Journal of the American Statistical Association 57: 348–368. . 1963. Estimators for seemingly unrelated regression equations: Some exact finite sample results. Journal of the American Statistical Association 58: 977–992. Zellner, A., and D. S. Huang. 1962. Further properties of efficient estimators for seemingly unrelated regression equations. International Economic Review 3: 300–313. Also see [R] sureg postestimation — Postestimation tools for sureg [R] nlsur — Estimation of nonlinear systems of equations [R] reg3 — Three-stage estimation for systems of simultaneous equations [R] regress — Linear regression [MV] mvreg — Multivariate regression [SEM] example 12 — Seemingly unrelated regression [SEM] intro 5 — Tour of models [TS] dfactor — Dynamic-factor models [U] 20 Estimation and postestimation commands Title sureg postestimation — Postestimation tools for sureg Description Remarks and examples Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after sureg: Command Description contrast estat ic estat summarize estat vce estimates forecast lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses margins marginsplot nlcom predict predictnl pwcompare test testnl Syntax for predict predict statistic type newvar if in , equation(eqno ,eqno ) statistic Description Main xb stdp residuals difference stddp linear prediction; the default standard error of the linear prediction residuals difference between the linear predictions of two equations standard error of the difference in linear predictions These statistics are available both in and out of sample; type predict only for the estimation sample. 2279 . . . if e(sample) . . . if wanted 2280 sureg postestimation — Postestimation tools for sureg Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main equation(eqno ,eqno ) specifies to which equation(s) you are referring. equation() is filled in with one eqno for the xb, stdp, and residuals options. equation(#1) would mean that the calculation is to be made for the first equation, equation(#2) would mean the second, and so on. You could also refer to the equations by their names. equation(income) would refer to the equation named income and equation(hours) to the equation named hours. If you do not specify equation(), the results are the same as if you specified equation(#1). difference and stddp refer to between-equation concepts. To use these options, you must specify two equations, for example, equation(#1,#2) or equation(income,hours). When two equations must be specified, equation() is required. xb, the default, calculates the linear prediction (fitted values) — the prediction of xj b for the specified equation. stdp calculates the standard error of the prediction for the specified equation. It can be thought of as the standard error of the predicted expected value or mean for the observation’s covariate pattern. The standard error of the prediction is also referred to as the standard error of the fitted value. residuals calculates the residuals. difference calculates the difference between the linear predictions of two equations in the system. With equation(#1,#2), difference computes the prediction of equation(#1) minus the prediction of equation(#2). stddp is allowed only after you have previously fit a multiple-equation model. The standard error of the difference in linear predictions (x1j b − x2j b) between equations 1 and 2 is calculated. For more information on using predict after multiple-equation estimation commands, see [R] predict. Remarks and examples For an example of cross-equation testing of parameters using the test command, see example 1 in [R] sureg. Example 1 In example 1 of [R] sureg, we fit a seemingly unrelated regressions model of price and weight. Here we obtain the fitted values. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . sureg (price foreign length) (weight foreign length), small dfk (output omitted ) . predict phat, equation(price) (option xb assumed; fitted values) . predict what, equation(weight) (option xb assumed; fitted values) sureg postestimation — Postestimation tools for sureg . summarize price phat weight what Obs Mean Variable price phat weight what 74 74 74 74 6165.257 6165.257 3019.459 3019.459 Std. Dev. 2949.496 1656.407 777.1936 736.9666 Min Max 3291 1639.872 1760 1481.199 15906 9398.138 4840 4476.331 2281 Just as in single-equation OLS regression, in a SURE model the sample mean of the fitted values for an equation equals the sample mean of the dependent variable. Example 2 Suppose that for whatever reason we were interested in the difference between the predicted values of price and weight. predict has an option to compute this difference in one step: . predict diff, equation(price, weight) difference diff is the same as phat - what: . generate mydiff = phat - what . summarize diff mydiff Obs Mean Variable diff mydiff 74 74 3145.797 3145.797 Std. Dev. 1233.26 1233.26 Also see [R] sureg — Zellner’s seemingly unrelated regression [U] 20 Estimation and postestimation commands Min Max -132.2275 -132.2275 5505.914 5505.914 Title swilk — Shapiro – Wilk and Shapiro – Francia tests for normality Syntax Options for sfrancia Acknowledgment Menu Remarks and examples References Description Stored results Also see Options for swilk Methods and formulas Syntax Shapiro–Wilk normality test swilk varlist if in , swilk options Shapiro–Francia normality test sfrancia varlist if in , sfrancia options swilk options Description Main generate(newvar) lnnormal noties create newvar containing W test coefficients test for three-parameter lognormality do not use average ranks for tied values sfrancia options Description Main use the Box–Cox transformation for W 0 ; the default is to use the log transformation do not use average ranks for tied values boxcox noties by is allowed with swilk and sfrancia; see [D] by. Menu swilk Statistics > Summaries, tables, and tests > Distributional plots and tests > Shapiro-Wilk normality test Summaries, tables, and tests > Distributional plots and tests > Shapiro-Francia normality test sfrancia Statistics > Description swilk performs the Shapiro – Wilk W test for normality, and sfrancia performs the Shapiro – Francia W 0 test for normality. swilk can be used with 4 ≤ n ≤ 2000 observations, and sfrancia can be used with 5 ≤ n ≤ 5000 observations; see [R] sktest for a test allowing more observations. See [MV] mvtest normality for multivariate tests of normality. 2282 swilk — Shapiro – Wilk and Shapiro – Francia tests for normality 2283 Options for swilk Main generate(newvar) creates new variable newvar containing the W test coefficients. lnnormal specifies that the test be for three-parameter lognormality, meaning that ln(X − k) is tested for normality, where k is calculated from the data as the value that makes the skewness coefficient zero. When simply testing ln(X) for normality, do not specify this option. See [R] lnskew0 for estimation of k . noties suppresses use of averaged ranks for tied values when calculating the W test coefficients. Options for sfrancia Main boxcox specifies that the Box – Cox transformation of Royston (1983) for calculating W 0 test coefficients be used instead of the default log transformation (Royston 1993a). Under the Box – Cox transformation, the normal approximation to the sampling distribution of W 0 , used by sfrancia, is valid for 5 ≤ n ≤ 1000. Under the log transformation, it is valid for 10 ≤ n ≤ 5000. noties suppresses use of averaged ranks for tied values when calculating the W 0 test coefficients. Remarks and examples Example 1 Using our automobile dataset, we will test whether the variables mpg and trunk are normally distributed: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . swilk mpg trunk Shapiro-Wilk W test for normal data Variable Obs W V z Prob>z mpg 74 0.94821 3.335 2.627 0.00430 74 0.97921 1.339 0.637 0.26215 trunk . sfrancia mpg trunk Shapiro-Francia W’ test for normal data Variable Obs W’ V’ z Prob>z mpg trunk 74 74 0.94872 0.98446 3.650 1.106 2.510 0.195 0.00604 0.42271 We can reject the hypothesis that mpg is normally distributed, but we cannot reject that trunk is normally distributed. The values reported under W and W 0 are the Shapiro – Wilk and Shapiro – Francia test statistics. The tests also report V and V 0 , which are more appealing indexes for departure from normality. The median values of V and V 0 are 1 for samples from normal populations. Large values indicate nonnormality. The 95% critical values of V (V 0 ), which depend on the sample size, are between 1.2 and 2.4 (2.0 and 2.8); see Royston (1991b). There is no more information in V (V 0 ) than in W (W 0 ) — one is just the transform of the other. 2284 swilk — Shapiro – Wilk and Shapiro – Francia tests for normality Example 2 We have data on a variable called studytime, which we suspect is distributed lognormally: . use http://www.stata-press.com/data/r13/cancer (Patient Survival in Drug Trial) . generate lnstudytime = ln(studytime) . swilk lnstudytime Shapiro-Wilk W test for normal data Variable Obs W V z Prob>z lnstudytime 48 0.92731 3.311 2.547 0.00543 We can reject the lognormal assumption. We do not specify the lnnormal option when testing for lognormality. The lnnormal option is for three-parameter lognormality. Example 3 Having discovered that ln(studytime) is not distributed normally, we now test that ln(studytime − k) is normally distributed, where k is chosen so that the resulting skewness is zero. We obtain the estimate for k from lnskew0; see [R] lnskew0: . lnskew0 lnstudytimek = studytime, level(95) Transform ln(studytim-k) k [95% Conf. Interval] -11.01181 -infinity -.9477328 Skewness -.0000173 . swilk lnstudytimek, lnnormal Shapiro-Wilk W test for 3-parameter lognormal data Variable Obs W V z Prob>z lnstudytimek 48 0.97064 1.337 1.261 0.10363 We cannot reject the hypothesis that ln(studytime + 11.01181) is distributed normally. We do specify the lnnormal option when using an estimated value of k . Stored results swilk and sfrancia store the following in r(): Scalars r(N) r(p) r(z) number of observations significance z statistic r(W) r(V) W or W 0 V or V 0 Methods and formulas The Shapiro – Wilk test is based on Shapiro and Wilk (1965) with a new approximation accurate for 4 ≤ n ≤ 2000 (Royston 1992). The calculations made by swilk are based on Royston (1982, 1992, 1993b). swilk — Shapiro – Wilk and Shapiro – Francia tests for normality 2285 The Shapiro – Francia test (Shapiro and Francia 1972; Royston 1983; Royston 1993a) is an approximate test that is similar to the Shapiro – Wilk test for very large samples. Samuel Sanford Shapiro (1930– ) earned degrees in statistics and engineering from City College of New York, Columbia, and Rutgers. After employment in the U.S. Army and industry, he joined the faculty at Florida International University in 1972. Shapiro has coauthored various texts in statistics and published several papers on distributional testing and other statistical topics. Acknowledgment swilk and sfrancia were written by Patrick Royston of the MRC Clinical Trials Unit, London and coauthor of the Stata Press book Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. References Brzezinski, M. 2012. The Chen–Shapiro test for normality. Stata Journal 12: 368–374. Genest, C., and G. J. Brackstone. 2010. A conversation with Martin Bradbury Wilk. Statistical Science 25: 258–273. Gould, W. W. 1992. sg3.7: Final summary of tests of normality. Stata Technical Bulletin 5: 10–11. Reprinted in Stata Technical Bulletin Reprints, vol. 1, pp. 114–115. College Station, TX: Stata Press. Royston, P. 1982. An extension of Shapiro and Wilks’s W test for normality to large samples. Applied Statistics 31: 115–124. . 1983. A simple method for evaluating the Shapiro–Francia W’ test of non-normality. Statistician 32: 297–300. . 1991a. sg3.2: Shapiro–Wilk and Shapiro–Francia tests. Stata Technical Bulletin 3: 19. Reprinted in Stata Technical Bulletin Reprints, vol. 1, p. 105. College Station, TX: Stata Press. . 1991b. Estimating departure from normality. Statistics in Medicine 10: 1283–1293. . 1992. Approximating the Shapiro–Wilk W-test for non-normality. Statistics and Computing 2: 117–119. . 1993a. A pocket-calculator algorithm for the Shapiro–Francia test for non-normality: An application to medicine. Statistics in Medicine 12: 181–184. . 1993b. A toolkit for testing for non-normality in complete and censored samples. Statistician 42: 37–43. Shapiro, S. S., and R. S. Francia. 1972. An approximate analysis of variance test for normality. Journal of the American Statistical Association 67: 215–216. Shapiro, S. S., and M. B. Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika 52: 591–611. Also see [R] lnskew0 — Find zero-skewness log or Box – Cox transform [R] lv — Letter-value displays [R] sktest — Skewness and kurtosis test for normality [MV] mvtest normality — Multivariate normality tests Title symmetry — Symmetry and marginal homogeneity tests Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Symmetry and marginal homogeneity tests symmetry casevar controlvar if in weight , options Immediate form of symmetry and marginal homogeneity tests symmi # 11 # 12 [...] \ # 21 # 22 [...] [\...] if in , options Description options Main suppress output of contingency table report contribution of each off-diagonal cell pair perform exact test of table symmetry perform two marginal homogeneity tests perform a test for linear trend in the (log) relative risk (RR) use continuity correction when calculating test for linear trend notable contrib exact mh trend cc fweights are allowed; see [U] 11.1.6 weight. Menu symmetry Statistics > Epidemiology and related > Other > Symmetry and marginal homogeneity test > Epidemiology and related > Other > Symmetry and marginal homogeneity test calculator symmi Statistics Description symmetry performs asymptotic symmetry and marginal homogeneity tests, as well as an exact symmetry test on K × K tables where there is a 1-to-1 matching of cases and controls (nonindependence). This testing is used to analyze matched-pair case–control data with multiple discrete levels of the exposure (outcome) variable. In genetics, the test is known as the transmission/disequilibrium test (TDT) and is used to test the association between transmitted and nontransmitted parental marker alleles to an affected child (Spieldman, McGinnis, and Ewens 1993). For 2 × 2 tables, the asymptotic test statistics reduce to the McNemar test statistic, and the exact symmetry test produces an exact McNemar test; see [ST] epitab. For many exposure variables, symmetry can optionally perform a test for linear trend in the log relative risk. 2286 symmetry — Symmetry and marginal homogeneity tests 2287 symmetry expects the data to be in the wide format; that is, each observation contains the matched case and control values in variables casevar and controlvar. Variables can be numeric or string. symmi is the immediate form of symmetry. The symmi command uses the values specified on the command line; rows are separated by ‘\’, and options are the same as for symmetry. See [U] 19 Immediate commands for a general introduction to immediate commands. Options Main notable suppresses the output of the contingency table. By default, symmetry displays the n × n contingency table at the top of the output. contrib reports the contribution of each off-diagonal cell pair to the overall symmetry χ2 . exact performs an exact test of table symmetry. This option is recommended for sparse tables. CAUTION: The exact test requires substantial amounts of time and memory for large tables. mh performs two marginal homogeneity tests that do not require the inversion of the variance–covariance matrix. By default, symmetry produces the Stuart–Maxwell test statistic, which requires the inversion of the nondiagonal variance–covariance matrix, V. When the table is sparse, the matrix may not be of full rank, and then the command substitutes a generalized inverse V∗ for V−1 . mh calculates optional marginal homogeneity statistics that do not require the inversion of the variance–covariance matrix. These tests may be preferred in certain situations. See Methods and formulas and Bickeböller and Clerget-Darpoux (1995) for details on these test statistics. trend performs a test for linear trend in the (log) relative risk (RR). This option is allowed only for numeric exposure (outcome) variables, and its use should be restricted to measurements on the ordinal or the interval scales. cc specifies that the continuity correction be used when calculating the test for linear trend. This correction should be specified only when the levels of the exposure variable are equally spaced. Remarks and examples symmetry and symmi may be used to analyze 1-to-1 matched case–control data with multiple discrete levels of the exposure (outcome) variable. Example 1 Consider a survey of 344 individuals (BMDP 1990, 267–270) who were asked in October 1986 whether they agreed with President Reagan’s handling of foreign affairs. In January 1987, after the Iran-Contra affair became public, these same individuals were surveyed again and asked the same question. We would like to know if public opinion changed over this period. 2288 symmetry — Symmetry and marginal homogeneity tests We first describe the dataset and list a few observations. . use http://www.stata-press.com/data/r13/iran . describe Contains data from http://www.stata-press.com/data/r13/iran.dta obs: 344 vars: 2 29 Jan 2013 02:37 size: 688 variable name storage type before after byte byte display format value label variable label %8.0g %8.0g vlab vlab Public Opinion before IC Public Opinion after IC Sorted by: . list in 1/5 1. 2. 3. 4. 5. before after agree agree agree disagree disagree agree disagree unsure agree disagree Each observation corresponds to one of the 344 individuals. The data are in wide form so that each observation has a before and an after measurement. We now perform the test without options. . symmetry before after Public Opinion before IC Public Opinion after IC agree disagree unsure Total agree disagree unsure 47 28 26 56 61 47 38 31 10 141 120 83 Total 101 164 79 344 Symmetry (asymptotic) Marginal homogeneity (Stuart-Maxwell) chi2 df Prob>chi2 14.87 14.78 3 2 0.0019 0.0006 The test first tabulates the data in a K × K table and then performs Bowker’s (1948) test for table symmetry and the Stuart–Maxwell (Stuart 1955; Maxwell 1970) test for marginal homogeneity. Both the symmetry test and the marginal homogeneity test are highly significant, thus indicating a shift in public opinion. An exact test of symmetry is provided for use on sparse tables. This test is computationally intensive, so it should not be used on large tables. Because we are working on a fast computer, we will run the symmetry test again and this time include the exact option. We will suppress the output of the contingency table by specifying notable and include the contrib option so that we may further examine the cells responsible for the significant result. symmetry — Symmetry and marginal homogeneity tests 2289 . symmetry before after, contrib exact mh notable Contribution to symmetry Cells chi-squared n1_2 & n2_1 n1_3 & n3_1 n2_3 & n3_2 Symmetry Marginal Marginal Marginal 9.3333 2.2500 3.2821 (asymptotic) homogeneity (Stuart-Maxwell) homogeneity (Bickenboller) homogeneity (no diagonals) chi2 df Prob>chi2 14.87 14.78 13.53 15.25 3 2 2 2 0.0019 0.0006 0.0012 0.0005 Symmetry (exact significance probability) 0.0018 The largest contribution to the symmetry χ2 is due to cells n12 and n21 . These correspond to changes between the agree and disagree categories. Of the 344 individuals, 56 (16.3%) changed from the agree to the disagree response, whereas only 28 (8.1%) changed in the opposite direction. For these data, the results from the exact test are similar to those from the asymptotic test. Example 2 Breslow and Day (1980, 163) reprinted data from Mack et al. (1976) from a case–control study of the effect of exogenous estrogen on the risk of endometrial cancer. The data consist of 59 elderly women diagnosed with endometrial cancer and 59 disease-free control subjects living in the same community as the cases. Cases and controls were matched on age, marital status, and time living in the community. The data collected included information on the daily dose of conjugated estrogen therapy. Breslow and Day analyzed these data by creating four levels of the dose variable. Here are the data as entered into a Stata dataset: . use http://www.stata-press.com/data/r13/bd163 . list, noobs divider case control count 0 0 0 0 0.1-0.299 0 0.1-0.299 0.3-0.625 0.626+ 0 6 2 3 1 9 0.1-0.299 0.1-0.299 0.1-0.299 0.3-0.625 0.3-0.625 0.1-0.299 0.3-0.625 0.626+ 0 0.1-0.299 4 2 1 9 2 0.3-0.625 0.3-0.625 0.626+ 0.626+ 0.626+ 0.3-0.625 0.626+ 0 0.1-0.299 0.3-0.625 3 1 12 1 2 0.626+ 0.626+ 1 2290 symmetry — Symmetry and marginal homogeneity tests This dataset is in a different format from that of the previous example. Instead of each observation representing one matched pair, each observation represents possibly multiple pairs indicated by the count variable. For instance, the first observation corresponds to six matched pairs where neither the case nor the control was on estrogen, the second observation corresponds to two matched pairs where the case was not on estrogen and the control was on 0.1 to 0.299 mg/day, etc. To use symmetry to analyze this dataset, we must specify fweight to indicate that in our data there are observations corresponding to more than one matched pair. . symmetry case control [fweight=count] case 0 0.1-0.299 control 0.3-0.625 0.626+ Total 0 0.1-0.299 0.3-0.625 0.626+ 6 9 9 12 2 4 2 1 3 2 3 2 1 1 1 1 12 16 15 16 Total 36 9 10 4 59 Symmetry (asymptotic) Marginal homogeneity (Stuart-Maxwell) chi2 df Prob>chi2 17.10 16.96 6 3 0.0089 0.0007 Both the test of symmetry and the test of marginal homogeneity are highly significant, thus leading us to reject the null hypothesis that there is no effect of exposure to estrogen on the risk of endometrial cancer. Breslow and Day perform a test for trend assuming that the estrogen exposure levels were equally spaced by recoding the exposure levels as 1, 2, 3, and 4. We can easily reproduce their results by recoding our data in this way and by specifying the trend option. Two new numeric variables were created, ca and co, corresponding to the variables case and control, respectively. Below we list some of the data and our results from symmetry: . . . . . encode case, gen(ca) encode control, gen(co) label values ca label values co list in 1/4 1. 2. 3. 4. case control count ca co 0 0 0 0 0 0.1-0.299 0.3-0.625 0.626+ 6 2 3 1 1 1 1 1 1 2 3 4 . symmetry ca co [fw=count], notable trend cc chi2 df Prob>chi2 Symmetry (asymptotic) Marginal homogeneity (Stuart-Maxwell) 17.10 16.96 6 3 0.0089 0.0007 Linear trend in the (log) RR 14.43 1 0.0001 symmetry — Symmetry and marginal homogeneity tests 2291 We requested the continuity correction by specifying cc. Doing so is appropriate because our coded exposure levels are equally spaced. The test for trend was highly significant, indicating an increased risk of endometrial cancer with increased dosage of conjugated estrogen. You must be cautious: the way in which you code the exposure variable affects the linear trend statistic. If instead of coding the levels as 1, 2, 3, and 4, we had instead used 0, 0.2, 0.46, and 0.7 (roughly the midpoint in the range of each level), we would have obtained a χ2 statistic of 11.19 for these data. Stored results symmetry stores the following in r(): Scalars r(N pair) r(chi2) r(df) r(p) r(chi2 sm) r(df sm) r(p sm) r(chi2 b) r(df b) r(p b) r(chi2 nd) r(df nd) r(p nd) r(chi2 t) r(p trend) r(p exact) number of matched pairs asymptotic symmetry χ2 asymptotic symmetry degrees of freedom asymptotic symmetry p-value MH (Stuart–Maxwell) χ2 MH (Stuart–Maxwell) degrees of freedom MH (Stuart–Maxwell) p-value MH (Bickenböller) χ2 MH (Bickenböller) degrees of freedom MH (Bickenböller) p-value MH (no diagonals) χ2 MH (no diagonals) degrees of freedom MH (no diagonals) p-value χ2 for linear trend p-value for linear trend exact symmetry p-value Methods and formulas Methods and formulas are presented under the following headings: Asymptotic tests Exact symmetry test Asymptotic tests Consider a square table with K exposure categories, that is, K rows and K columns. Let nij be the count corresponding to row i and column j of the table, Nij = nij + nji , for i, j = 1, 2, . . . , K , and ni. , and let n.j be the marginal totals for row i and column j , respectively. Asymptotic tests for symmetry and marginal homogeneity for this K × K table are calculated as follows: The null hypothesis of complete symmetry pij = pji , i 6= j , is tested by calculating the test statistic (Bowker 1948) Tcs = X (nij − nji )2 nij + nji i Summaries, tables, and tests > Other tables > Flexible table of summary statistics Description table calculates and displays tables of statistics. Options Main contents(clist) specifies the contents of the table’s cells; if not specified, contents(freq) is used by default. contents(freq) produces a table of frequencies. contents(mean mpg) produces a table of the means of variable mpg. contents(freq mean mpg sd mpg) produces a table of frequencies together with the mean and standard deviation of variable mpg. Up to five statistics may be specified. by(superrowvarlist) specifies that numeric or string variables be treated as superrows. Up to four variables may be specified in superrowvarlist. The by() option may be specified with the by prefix. Options cellwidth(#) specifies the width of the cell in units of digit widths; 10 means the space occupied by 10 digits, which is 0123456789. The default cellwidth() is not a fixed number, but a number chosen by table to spread the table out while presenting a reasonable number of columns across the page. 2296 table — Flexible table of summary statistics csepwidth(#) specifies the separation between columns in units of digit widths. The default is not a fixed number, but a number chosen by table according to what it thinks looks best. stubwidth(#) specifies the width, in units of digit widths, to be allocated to the left stub of the table. The default is not a fixed number, but a number chosen by table according to what it thinks looks best. scsepwidth(#) specifies the separation between supercolumns in units of digit widths. The default is not a fixed number, but a number chosen by table to present the results best. center specifies that results be centered in the table’s cells. The default is to right-align results. For centering to work well, you typically need to specify a display format as well. center format(%9.2f) is popular. left specifies that column labels be left-aligned. The default is to right-align column labels to distinguish them from supercolumn labels, which are left-aligned. cw specifies casewise deletion. If cw is not specified, all observations possible are used to calculate each of the specified statistics. cw is relevant only when you request a table containing statistics on multiple variables. For instance, contents(mean mpg mean weight) would produce a table reporting the means of variables mpg and weight. Consider an observation in which mpg is known but weight is missing. By default, that observation will be used in the calculation of the mean of mpg. If you specify cw, the observation will be excluded in the calculation of the means of both mpg and weight. row specifies that a row be added to the table reflecting the total across the rows. column specifies that a column be added to the table reflecting the total across columns. scolumn specifies that a supercolumn be added to the table reflecting the total across supercolumns. concise specifies that rows with all missing entries not be displayed. missing specifies that missing statistics be shown in the table as periods (Stata’s missing-value indicator). The default is that missing entries be left blank. replace specifies that the data in memory be replaced with data containing 1 observation per cell (row, column, supercolumn, and superrow) and with variables containing the statistics designated in contents(). This option is rarely specified. If you do not specify this option, the data in memory remain unchanged. If you do specify this option, the first statistic will be named table1, the second table2, and so on. For instance, if contents(mean mpg sd mpg) was specified, the means of mpg would be in variable table1 and the standard deviations in table2. name(string) is relevant only if you specify replace. name() allows changing the default stub name that replace uses to name the new variables associated with the statistics. If you specify name(stat), the first statistic will be placed in variable stat1, the second in stat2, and so on. format(% fmt) specifies the display format for presenting numbers in the table’s cells. format(%9.0g) is the default; format(%9.2f) and format(%9.2fc) are popular alternatives. The width of the format you specify does not matter, except that % fmt must be valid. The width of the cells is chosen by table to present the results best. The cellwidth() option allows you to override table’s choice. table — Flexible table of summary statistics 2297 Limits Up to four variables may be specified in the by(), so with the three row, column, and supercolumn variables, seven-way tables may be displayed. Up to five statistics may be displayed in each cell of the table. The sum of the number of rows, columns, supercolumns, and superrows is called the number of margins. A table may contain up to 3,000 margins. Thus a one-way table may contain 3,000 rows. A two-way table could contain 2,998 rows and two columns, 2,997 rows and three columns, . . ., 1,500 rows and 1,500 columns, . . ., two rows and 2,998 columns. A three-way table is similarly limited by the sum of the number of rows, columns, and supercolumns. A r × c × d table is feasible if r + c + d ≤ 3,000. The limit is set in terms of the sum of the rows, columns, supercolumns, and superrows, and not, as you might expect, in terms of their product. Remarks and examples Remarks are presented under the following headings: One-way tables Two-way tables Three-way tables Four-way and higher-dimensional tables Video example One-way tables Example 1 From the automobile dataset, here is a simple one-way table: . use http://www.stata-press.com/data/r13/auto2 (1978 Automobile Data) . table rep78, contents(mean mpg) Repair Record 1978 Poor Fair Average Good Excellent mean(mpg) 21 19.125 19.4333 21.6667 27.3636 We are not limited to including only one statistic: . table rep78, c(n mpg Repair Record 1978 Poor Fair Average Good Excellent mean mpg sd mpg median mpg) N(mpg) mean(mpg) sd(mpg) med(mpg) 2 8 30 18 11 21 19.125 19.4333 21.6667 27.3636 4.24264 3.758324 4.141325 4.93487 8.732385 21 18 19 22.5 30 2298 table — Flexible table of summary statistics We abbreviated contents() as c(). The format() option will allow us to better format the numbers in the table: . table rep78, c(n mpg Repair Record 1978 Poor Fair Average Good Excellent mean mpg sd mpg median mpg) format(%9.2f) N(mpg) mean(mpg) sd(mpg) med(mpg) 2 8 30 18 11 21.00 19.12 19.43 21.67 27.36 4.24 3.76 4.14 4.93 8.73 21.00 18.00 19.00 22.50 30.00 The center option will center the results under the headings: . table rep78, c(n mpg Repair Record 1978 Poor Fair Average Good Excellent N(mpg) 2 8 30 18 11 mean mpg sd mpg median mpg) format(%9.2f) center mean(mpg) sd(mpg) med(mpg) 21.00 19.12 19.43 21.67 27.36 4.24 3.76 4.14 4.93 8.73 21.00 18.00 19.00 22.50 30.00 Two-way tables Example 2 In example 1, when we typed ‘table rep78, . . .’, we obtained a one-way table. If we were to type ‘table rep78 foreign, . . .’, we would obtain a two-way table: . table rep78 foreign, c(mean mpg) Repair Record 1978 Poor Fair Average Good Excellent Car type Domestic Foreign 21 19.125 19 18.4444 32 23.3333 24.8889 26.3333 Note the missing cells. Certain combinations of repair record and car type do not exist in our dataset. As with one-way tables, we can specify a display format for the cells and center the numbers within the cells if we wish. table — Flexible table of summary statistics 2299 . table rep78 foreign, c(mean mpg) format(%9.2f) center Repair Record 1978 Poor Fair Average Good Excellent Car type Domestic Foreign 21.00 19.12 19.00 18.44 32.00 23.33 24.89 26.33 We can obtain row totals by specifying the row option and obtain column totals by specifying the col option. We specify both below: . table rep78 foreign, c(mean mpg) format(%9.2f) center row col Repair Record 1978 Domestic Car type Foreign Total Poor Fair Average Good Excellent 21.00 19.12 19.00 18.44 32.00 23.33 24.89 26.33 21.00 19.12 19.43 21.67 27.36 Total 19.54 25.29 21.29 table can display multiple statistics within cells, but once we move beyond one-way tables, the table becomes busy: . table foreign rep78, c(mean mpg Car type Poor Domestic 21.00 2 Foreign n mpg) format(%9.2f) center Repair Record 1978 Fair Average Good 19.12 8 Excellent 19.00 27 18.44 9 32.00 2 23.33 3 24.89 9 26.33 9 This two-way table with two statistics per cell works well here. That was, in part, helped along by our interchanging the rows and columns. We turned the table around by typing table foreign rep78 rather than table rep78 foreign. 2300 table — Flexible table of summary statistics Another way to display two-way tables is to specify a row and superrow rather than a row and column. We do that below and display three statistics per cell: . table foreign, by(rep78) c(mean mpg Repair Record 1978 and Car type sd mpg mean(mpg) sd(mpg) Poor Domestic Foreign 21.00 4.24 2 Fair Domestic Foreign 19.12 3.76 8 Average Domestic Foreign 19.00 23.33 4.09 2.52 27 3 Good Domestic Foreign 18.44 24.89 4.59 2.71 9 9 Excellent Domestic Foreign 32.00 26.33 2.83 9.37 2 9 n mpg) format(%9.2f) center N(mpg) Three-way tables Example 3 We have data on the prevalence of byssinosis, a form of pneumoconiosis to which workers exposed to cotton dust are susceptible. The dataset is on 5,419 workers in a large cotton mill. We know whether each worker smokes, his or her race, and the dustiness of the work area. The categorical variables are smokes Smoker or nonsmoker in the last five years. race White or other. workplace 1 (most dusty), 2 (less dusty), 3 (least dusty). Moreover, this dataset includes a frequency-weight variable pop. Here is a three-way table showing the fraction of workers with byssinosis: . use http://www.stata-press.com/data/r13/byssin (Byssinosis incidence) . table workplace smokes race [fw=pop], c(mean prob) Dustiness of workplace least less most Race and Smokes other white no yes no .0107527 .02 .0820896 .0101523 .0081633 .1679105 .0081549 .0136612 .0833333 yes .0162774 .0143149 .2295082 table — Flexible table of summary statistics 2301 This table would look better if we showed the fraction to four digits: . table workplace smokes race [fw=pop], c(mean prob) format(%9.4f) Dustiness of workplace least less most Race and Smokes other white no yes no yes 0.0108 0.0200 0.0821 0.0102 0.0082 0.1679 0.0082 0.0137 0.0833 0.0163 0.0143 0.2295 In this table, the rows are the dustiness of the workplace, the columns are whether the worker smokes, and the supercolumns are the worker’s race. Now we request that the table include the supercolumn totals by specifying the sctotal option, which we can abbreviate as sc: . table workplace smokes race [fw=pop], c(mean prob) format(%9.4f) sc Dustiness of workplace least less most Race and Smokes white no yes other no yes 0.0108 0.0200 0.0821 0.0102 0.0082 0.1679 0.0082 0.0137 0.0833 Total no yes 0.0163 0.0143 0.2295 0.0090 0.0159 0.0826 0.0145 0.0123 0.1929 The supercolumn total is the total over race and is divided into its columns based on smokes. Here is the table with the column rather than the supercolumn totals: . table workplace smokes race [fw=pop], c(mean prob) format(%9.4f) col Dustiness of workplace least less most Race and Smokes no other yes Total no white yes Total 0.0108 0.0200 0.0821 0.0102 0.0082 0.1679 0.0104 0.0135 0.1393 0.0082 0.0137 0.0833 0.0163 0.0143 0.2295 0.0129 0.0140 0.1835 Here is the table with both column and supercolumn totals: . table workplace smokes race [fw=pop], c(mean prob) format(%9.4f) sc col Dustin ess of workpl ace least less most no other yes Total 0.0108 0.0200 0.0821 0.0102 0.0082 0.1679 0.0104 0.0135 0.1393 Race and Smokes white no yes Total 0.0082 0.0137 0.0833 0.0163 0.0143 0.2295 0.0129 0.0140 0.1835 no Total yes Total 0.0090 0.0159 0.0826 0.0145 0.0123 0.1929 0.0122 0.0138 0.1570 table is struggling to keep this table from becoming too wide — notice how it divided the words in the title in the top-left stub. Here, if the table had more columns, or, if we demanded more digits, table would be forced to segment the table and present it in pieces, which it would do: 2302 table — Flexible table of summary statistics . table workplace smokes race [fw=pop], c(mean prob) format(%9.6f) sc col Dustiness of workplace least less most Race and Smokes no other yes Total no white yes Total 0.010753 0.020000 0.082090 0.010152 0.008163 0.167910 0.010417 0.013483 0.139303 0.008155 0.013661 0.083333 0.016277 0.014315 0.229508 0.012949 0.014035 0.183521 Dustiness of workplace least less most Race and Smokes Total no yes 0.008990 0.015901 0.082569 0.014471 0.012262 0.192905 Total 0.012174 0.013846 0.156951 Here three digits is probably enough, so here is the table including all the row, column, and supercolumn totals: . table workplace smokes race [fw=pop], c(mean prob) format(%9.3f) sc col row Dustiness of workplace Race and Smokes white no yes Total no other yes Total least less most 0.011 0.020 0.082 0.010 0.008 0.168 0.010 0.013 0.139 0.008 0.014 0.083 0.016 0.014 0.230 Total 0.025 0.048 0.038 0.014 0.035 no Total yes Total 0.013 0.014 0.184 0.009 0.016 0.083 0.014 0.012 0.193 0.012 0.014 0.157 0.026 0.018 0.039 0.030 We can show multiple statistics: . table workplace smokes race [fw=pop], c(mean prob > col row Dustiness of workplace n prob) format(%9.3f) sc Race and Smokes white no yes Total no other yes Total no Total yes Total least 0.011 465 0.010 591 0.010 1,056 0.008 981 0.016 1,413 0.013 2,394 0.009 1,446 0.014 2,004 0.012 3,450 less 0.020 200 0.008 245 0.013 445 0.014 366 0.014 489 0.014 855 0.016 566 0.012 734 0.014 1,300 most 0.082 134 0.168 268 0.139 402 0.083 84 0.230 183 0.184 267 0.083 218 0.193 451 0.157 669 Total 0.025 799 0.048 1,104 0.038 1,903 0.014 1,431 0.035 2,085 0.026 3,516 0.018 2,230 0.039 3,189 0.030 5,419 table — Flexible table of summary statistics 2303 Four-way and higher-dimensional tables Example 4 Let’s pretend that our byssinosis dataset also recorded each worker’s sex (it does not, and we have made up this extra information). We obtain a four-way table just as we would a three-way table, but we specify the fourth variable as a superrow by including it in the by() option: . use http://www.stata-press.com/data/r13/byssin1 (Byssinosis incidence) . table workplace smokes race [fw=pop], by(sex) c(mean prob) > col row Sex and Dustiness of workplace Race and Smokes white no yes Total no other yes Total Female least less most 0.006 0.020 0.057 0.009 0.008 0.154 0.008 0.010 0.141 0.009 0.015 0.021 0.015 Total 0.017 0.051 0.043 0.011 Male least less most 0.013 0.020 0.091 0.011 0.000 0.244 0.012 0.019 0.136 Total 0.029 0.041 0.033 format(%9.3f) sc no Total yes Total 0.016 0.015 0.009 0.016 0.057 0.018 0.012 0.154 0.014 0.014 0.141 0.020 0.016 0.012 0.032 0.024 0.006 0.000 0.083 0.007 0.013 0.230 0.006 0.011 0.184 0.009 0.016 0.087 0.008 0.013 0.232 0.009 0.014 0.167 0.020 0.056 0.043 0.025 0.052 0.039 If our dataset also included work group and we wanted a five-way table, we could include both the sex and work-group variables in the by() option. You may include up to four variables in by(), and so produce up to 7-way tables. Video example Combining cross-tabulations and descriptives in Stata Methods and formulas The contents of cells are calculated by collapse and are displayed by tabdisp; see [D] collapse and [P] tabdisp. 2304 table — Flexible table of summary statistics Also see [R] summarize — Summary statistics [R] tabstat — Compact table of summary statistics [R] tabulate oneway — One-way table of frequencies [R] tabulate twoway — Two-way table of frequencies [D] collapse — Make dataset of summary statistics [P] tabdisp — Display tables Title tabstat — Compact table of summary statistics Syntax Remarks and examples Menu Acknowledgments Description Also see Options Syntax tabstat varlist if in weight , options Description options Main by(varname) statistics(statname . . . ) group statistics by variable report specified statistics Options labelwidth(#) varwidth(#) columns(variables) columns(statistics) format (% fmt) casewise nototal missing noseparator longstub save width for by() variable labels; default is labelwidth(16) variable width; default is varwidth(12) display variables in table columns; the default display statistics in table columns display format for statistics; default format is %9.0g perform casewise deletion of observations do not report overall statistics; use with by() report statistics for missing values of by() variable do not use separator line between by() categories make left table stub wider store summary statistics in r() by is allowed; see [D] by. aweights and fweights are allowed; see [U] 11.1.6 weight. Menu Statistics > Summaries, tables, and tests > Other tables > Compact table of summary statistics Description tabstat displays summary statistics for a series of numeric variables in one table, possibly broken down on (conditioned by) another variable. Without the by() option, tabstat is a useful alternative to summarize (see [R] summarize) because it allows you to specify the list of statistics to be displayed. With the by() option, tabstat resembles tabulate used with its summarize() option in that both report statistics of varlist for the different values of varname. tabstat allows more flexibility in terms of the statistics presented and the format of the table. tabstat is sensitive to the linesize (see set linesize in [R] log); it widens the table if possible and wraps if necessary. 2305 2306 tabstat — Compact table of summary statistics Options Main by(varname) specifies that the statistics be displayed separately for each unique value of varname; varname may be numeric or string. For instance, tabstat height would present the overall mean of height. tabstat height, by(sex) would present the mean height of males, and of females, and the overall mean height. Do not confuse the by() option with the by prefix (see [D] by); both may be specified. statistics(statname . . . ) specifies the statistics to be displayed; the default is equivalent to specifying statistics(mean). (stats() is a synonym for statistics().) Multiple statistics may be specified and are separated by white space, such as statistics(mean sd). Available statistics are statname mean count n sum max min range sd variance cv semean skewness kurtosis Definition mean count of nonmissing observations same as count sum maximum minimum range = max − min standard deviation variance coefficient of variation (sd/mean) √ standard error of mean (sd/ n) skewness kurtosis statname p1 p5 p10 p25 median p50 p75 p90 p95 p99 iqr q Definition 1st percentile 5th percentile 10th percentile 25th percentile median (same as p50) 50th percentile (same as median) 75th percentile 90th percentile 95th percentile 99th percentile interquartile range = p75 − p25 equivalent to specifying p25 p50 p75 Options labelwidth(#) specifies the maximum width to be used within the stub to display the labels of the by() variable. The default is labelwidth(16). 8 ≤ # ≤ 32. varwidth(#) specifies the maximum width to be used within the stub to display the names of the variables. The default is varwidth(12). varwidth() is effective only with columns(statistics). Setting varwidth() implies longstub. 8 ≤ # ≤ 16. columns(variables | statistics) specifies whether to display variables or statistics in the columns of the table. columns(variables) is the default when more than one variable is specified. format and format(% fmt) specify how the statistics are to be formatted. The default is to use a %9.0g format. format specifies that each variable’s statistics be formatted with the variable’s display format; see [D] format. format(% fmt) specifies the format to be used for all statistics. The maximum width of the specified format should not exceed nine characters. casewise specifies casewise deletion of observations. Statistics are to be computed for the sample that is not missing for any of the variables in varlist. The default is to use all the nonmissing values for each variable. nototal is for use with by(); it specifies that the overall statistics not be reported. tabstat — Compact table of summary statistics 2307 missing specifies that missing values of the by() variable be treated just like any other value and that statistics should be displayed for them. The default is not to report the statistics for the by()== missing group. If the by() variable is a string variable, by()=="" is considered to mean missing. noseparator specifies that a separator line between the by() categories not be displayed. longstub specifies that the left stub of the table be made wider so that it can include names of the statistics or variables in addition to the categories of by(varname). The default is to describe the statistics or variables in a header. longstub is ignored if by(varname) is not specified. save specifies that the summary statistics be returned in r(). The overall (unconditional) statistics are returned in matrix r(StatTotal) (rows are statistics, columns are variables). The conditional statistics are returned in the matrices r(Stat1), r(Stat2), . . . , and the names of the corresponding variables are returned in the macros r(name1), r(name2), . . . . Remarks and examples This command is probably most easily understood by going through a series of examples. Example 1 We have data on the price, weight, mileage rating, and repair record of 22 foreign and 52 domestic 1978 automobiles. We want to summarize these variables for the different origins of the automobiles. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . tabstat price weight mpg rep78, by(foreign) Summary statistics: mean by categories of: foreign (Car type) foreign price weight mpg rep78 Domestic Foreign 6072.423 6384.682 3317.115 2315.909 19.82692 24.77273 3.020833 4.285714 Total 6165.257 3019.459 21.2973 3.405797 More summary statistics can be requested via the statistics() option. The group totals can be suppressed with the nototal option. . tabstat price weight mpg rep78, by(foreign) stat(mean sd min max) nototal Summary statistics: mean, sd, min, max by categories of: foreign (Car type) foreign price weight mpg rep78 Domestic 6072.423 3097.104 3291 15906 3317.115 695.3637 1800 4840 19.82692 4.743297 12 34 3.020833 .837666 1 5 Foreign 6384.682 2621.915 3748 12990 2315.909 433.0035 1760 3420 24.77273 6.611187 14 41 4.285714 .7171372 3 5 Although the header of the table describes the statistics running vertically in the “cells”, the table may become hard to read, especially with many variables or statistics. The longstub option specifies that a column be added describing the contents of the cells. The format option can be issued to 2308 tabstat — Compact table of summary statistics specify that tabstat display the statistics by using the display format of the variables rather than the overall default %9.0g. . tabstat price weight mpg rep78, by(foreign) stat(mean sd min max) long format price weight mpg rep78 foreign stats Domestic mean sd min max 6,072.4 3,097.1 3,291 15,906 3,317.1 695.364 1,800 4,840 19.8269 4.7433 12 34 3.02083 .837666 1 5 Foreign mean sd min max 6,384.7 2,621.9 3,748 12,990 2,315.9 433.003 1,760 3,420 24.7727 6.61119 14 41 4.28571 .717137 3 5 Total mean sd min max 6,165.3 2,949.5 3,291 15,906 3,019.5 777.194 1,760 4,840 21.2973 5.7855 12 41 3.4058 .989932 1 5 We can specify a layout of the table in which the statistics run horizontally and the variables run vertically by specifying the col(statistics) option. . tabstat price weight mpg rep78, by(foreign) stat(min mean max) col(stat) long min mean max foreign variable Domestic price weight mpg rep78 3291 1800 12 1 6072.423 3317.115 19.82692 3.020833 15906 4840 34 5 Foreign price weight mpg rep78 3748 1760 14 3 6384.682 2315.909 24.77273 4.285714 12990 3420 41 5 Total price weight mpg rep78 3291 1760 12 1 6165.257 3019.459 21.2973 3.405797 15906 4840 41 5 Finally, tabstat can also be used to enhance summarize so we can specify the statistics to be displayed. For instance, we can display the number of observations, the mean, the coefficient of variation, and the 25%, 50%, and 75% quantiles for a list of variables. . tabstat price weight mpg rep78, stat(n mean cv q) col(stat) variable N mean cv p25 p50 price weight mpg rep78 74 74 74 69 6165.257 3019.459 21.2973 3.405797 .478406 .2573949 .2716543 .290661 4195 2240 18 3 5006.5 3190 20 3 p75 6342 3600 25 4 Because we did not specify the by() option, these statistics were not displayed for the subgroups of the data formed by the categories of the by() variable. tabstat — Compact table of summary statistics 2309 Video example Descriptive statistics in Stata Acknowledgments The tabstat command was written by Jeroen Weesie and Vincent Buskens both of the Department of Sociology at Utrecht University, The Netherlands. Also see [R] summarize — Summary statistics [R] table — Flexible table of summary statistics [R] tabulate, summarize() — One- and two-way tables of summary statistics [D] collapse — Make dataset of summary statistics Title tabulate oneway — One-way table of frequencies Syntax Remarks and examples Menu Stored results Description References Options Also see Syntax One-way table tabulate varname if in weight , tabulate1 options One-way table for each variable—a convenience tool tab1 varlist if in weight , tab1 options tabulate1 options Description Main subpop(varname) missing nofreq nolabel plot sort exclude observations for which varname = 0 treat missing values like other values do not display frequencies display numeric codes rather than value labels produce a bar chart of the relative frequencies display the table in descending order of frequency Advanced generate(stubname) matcell(matname) matrow(matname) create indicator variables for stubname save frequencies in matname; programmer’s option save unique values of varname in matname; programmer’s option tab1 options Description Main subpop(varname) missing nofreq nolabel plot sort exclude observations for which varname = 0 treat missing values like other values do not display frequencies display numeric codes rather than value labels produce a bar chart of the relative frequencies display the table in descending order of frequency by is allowed with tabulate and tab1; see [D] by. fweights, aweights, and iweights are allowed by tabulate. fweights are allowed by tab1. See [U] 11.1.6 weight. 2310 tabulate oneway — One-way table of frequencies 2311 Menu tabulate oneway Statistics > Summaries, tables, and tests > Frequency tables > One-way table tabulate ..., generate() Data > Create or change data > Other variable-creation commands > Create indicator variables tab1 Statistics > Summaries, tables, and tests > Frequency tables > Multiple one-way tables Description tabulate produces a one-way table of frequency counts. For information about a two-way table of frequency counts along with various measures of association, including the common Pearson χ2 , the likelihood-ratio χ2 , Cramér’s V , Fisher’s exact test, Goodman and Kruskal’s gamma, and Kendall’s τb , see [R] tabulate twoway. tab1 produces a one-way tabulation for each variable specified in varlist. Also see [R] table and [R] tabstat if you want one-, two-, or n-way table of frequencies and a wide variety of summary statistics. See [R] tabulate, summarize() for a description of tabulate with the summarize() option; it produces a table (breakdowns) of means and standard deviations. table is better than tabulate, summarize(), but tabulate, summarize() is faster. See [ST] epitab for a 2 × 2 table with statistics of interest to epidemiologists. Options Main subpop(varname) excludes observations for which varname = 0 in tabulating frequencies. The mathematical results of tabulate . . ., subpop(myvar) are the same as tabulate . . . if myvar !=0, but the table may be presented differently. The identities of the rows and columns will be determined from all the data, including the myvar = 0 group, so there may be entries in the table with frequency 0. Consider tabulating answer, a variable that takes on values 1, 2, and 3, but consider tabulating it just for the male==1 subpopulation. Assume that answer is never 2 in this group. tabulate answer if male==1 produces a table with two rows: one for answer 1 and one for answer 3. There will be no row for answer 2 because answer 2 was never observed. tabulate answer, subpop(male) produces a table with three rows. The row for answer 2 will be shown as having 0 frequency. missing requests that missing values be treated like other values in calculations of counts, percentages, and other statistics. nofreq suppresses the printing of the frequencies. nolabel causes the numeric codes to be displayed rather than the value labels. plot produces a bar chart of the relative frequencies in a one-way table. (Also see [R] histogram.) sort puts the table in descending order of frequency (and ascending order of the variable within equal values of frequency). 2312 tabulate oneway — One-way table of frequencies Advanced generate(stubname) creates a set of indicator variables (stubname1, stubname2, . . . ) reflecting the observed values of the tabulated variable. The generate() option may not be used with the by prefix. matcell(matname) saves the reported frequencies in matname. This option is for use by programmers. matrow(matname) saves the numeric values of the r × 1 row stub in matname. This option is for use by programmers. matrow() may not be specified if the row variable is a string. Limits A one-way table may have a maximum of 12,000 rows (Stata/MP and Stata/SE), 3,000 rows (Stata/IC), or 500 rows (Small Stata). Remarks and examples Remarks are presented under the following headings: tabulate tab1 Video example For each value of a specified variable, tabulate reports the number of observations with that value. The number of times a value occurs is called its frequency. tabulate Example 1 We have data summarizing the speed limit and the accident rate per million vehicle miles along various Minnesota highways in 1973. The variable containing the speed limit is called spdlimit. If we summarize the variable, we obtain its mean and standard deviation: . use http://www.stata-press.com/data/r13/hiway (Minnesota Highway Data, 1973) . summarize spdlimit Variable Obs Mean spdlimit 39 55 Std. Dev. Min Max 5.848977 40 70 The average speed limit is 55 miles per hour. We can learn more about this variable by tabulating it: . tabulate spdlimit Speed Limit Freq. Percent Cum. 40 45 50 55 60 65 70 1 3 7 15 11 1 1 2.56 7.69 17.95 38.46 28.21 2.56 2.56 2.56 10.26 28.21 66.67 94.87 97.44 100.00 Total 39 100.00 tabulate oneway — One-way table of frequencies 2313 We see that one highway has a speed limit of 40 miles per hour, three have speed limits of 45, 7 of 50, and so on. The column labeled Percent shows the percentage of highways in the dataset that have the indicated speed limit. For instance, 38.46% of highways in our dataset have a speed limit of 55 miles per hour. The final column shows the cumulative percentage. We see that 66.67% of highways in our dataset have a speed limit of 55 miles per hour or less. Example 2 The plot option places a sideways histogram alongside the table: . tabulate spdlimit, plot Speed Limit Freq. 40 45 50 55 60 65 70 1 3 7 15 11 1 1 Total 39 * *** ******* *************** *********** * * Of course, graph can produce better-looking histograms; see [R] histogram. Example 3 tabulate labels tables using variable and value labels if they exist. To demonstrate how this works, let’s add a new variable to our dataset that categorizes spdlimit into three categories. We will call this new variable spdcat: . generate spdcat=recode(spdlimit,50,60,70) The recode() function divides spdlimit into 50 miles per hour or below, 51 – 60, and above 60; see [D] functions. We specified the breakpoints in the arguments (spdlimit,50,60,70). The first argument is the variable to be recoded. The second argument is the first breakpoint, the third argument is the second breakpoint, and so on. We can specify as many breakpoints as we wish. recode() used our arguments not only as the breakpoints but also to label the results. If spdlimit is less than or equal to 50, spdcat is set to 50; if spdlimit is between 51 and 60, spdcat is 60; otherwise, spdcat is arbitrarily set to 70. (See [U] 25 Working with categorical data and factor variables.) Because we just created the variable spdcat, it is not yet labeled. When we make a table using this variable, tabulate uses the variable’s name to label it: . tabulate spdcat spdcat Freq. Percent Cum. 50 60 70 11 26 2 28.21 66.67 5.13 28.21 94.87 100.00 Total 39 100.00 2314 tabulate oneway — One-way table of frequencies Even through the table is not well labeled, recode()’s coding scheme provides us with clues as to the table’s meaning. The first line of the table corresponds to 50 miles per hour and below, the next to 51 through 60 miles per hour, and the last to above 60 miles per hour. We can improve this table by labeling the values and variables: . label define scat 50 "40 to 50" 60 "55 to 60" 70 "Above 60" . label values spdcat scat . label variable spdcat "Speed Limit Category" We define a value label called scat that attaches labels to the numbers 50, 60, and 70 using the label define command; see [U] 12.6.3 Value labels. We label the value 50 as ‘40 to 50’, because we looked back at our original tabulation in the first example and saw that the speed limit was never less than 40. Similarly, we could have labeled the last category ‘65 to 70’ because the speed limit is never greater than 70 miles per hour. Next we requested that Stata label the values of the new variable spdcat using the value label scat. Finally, we labeled our variable Speed Limit Category. We are now ready to tabulate the result: . tabulate spdcat Speed Limit Category Freq. Percent Cum. 40 to 50 55 to 60 Above 60 11 26 2 28.21 66.67 5.13 28.21 94.87 100.00 Total 39 100.00 Example 4 If we have missing values in our dataset, tabulate ignores them unless we explicitly indicate otherwise. We have no missing data in our example, so let’s add some: . replace spdcat=. in 39 (1 real change made, 1 to missing) We changed the first observation on spdcat to missing. Let’s now tabulate the result: . tabulate spdcat Speed Limit Category Freq. Percent Cum. 40 to 50 55 to 60 Above 60 11 26 1 28.95 68.42 2.63 28.95 97.37 100.00 Total 38 100.00 Comparing this output with that in the previous example, we see that the total frequency count is now one less than it was — 38 rather than 39. Also, the ‘Above 60’ category now has only one observation where it used to have two, so we evidently changed a road with a high speed limit. We want tabulate to treat missing values just as it treats numbers, so we specify the missing option: tabulate oneway — One-way table of frequencies . tabulate spdcat, missing Speed Limit Category Freq. Percent Cum. 28.21 94.87 97.44 100.00 40 to 50 55 to 60 Above 60 . 11 26 1 1 28.21 66.67 2.56 2.56 Total 39 100.00 2315 We now see our missing value — the last category, labeled ‘.’, shows a frequency count of 1. The table sum is once again 39. Let’s put our dataset back as it was originally: . replace spdcat=70 in 39 (1 real change made) Technical note tabulate also can automatically create indicator variables from categorical variables. We will briefly review that capability here, but see [U] 25 Working with categorical data and factor variables for a complete description. Let’s begin by describing our highway dataset: . describe Contains data from http://www.stata-press.com/data/r13/hiway.dta obs: 39 Minnesota Highway Data, 1973 vars: 3 16 Nov 2012 12:39 size: 351 variable name storage type display format value label spdlimit rate byte float %8.0g %9.0g rcat spdcat float %9.0g scat Sorted by: Note: variable label Speed Limit Accident rate per million vehicle miles Speed Limit Category dataset has changed since last saved Our dataset contains three variables. We will type tabulate spdcat, generate(spd), describe our data, and then explain what happened. . tabulate spdcat, generate(spd) Speed Limit Category Freq. Percent 40 to 50 55 to 60 Above 60 11 26 2 28.21 66.67 5.13 Total 39 100.00 Cum. 28.21 94.87 100.00 2316 tabulate oneway — One-way table of frequencies . describe Contains data from http://www.stata-press.com/data/r13/hiway.dta obs: 39 Minnesota Highway Data, 1973 vars: 6 16 Nov 2012 12:39 size: 468 variable name storage type display format value label rcat spdlimit rate byte float %8.0g %9.0g spdcat spd1 spd2 spd3 float byte byte byte %9.0g %8.0g %8.0g %8.0g Sorted by: Note: scat variable label Speed Limit Accident rate per million vehicle miles Speed Limit Category spdcat==40 to 50 spdcat==55 to 60 spdcat==Above 60 dataset has changed since last saved When we typed tabulate with the generate() option, Stata responded by producing a one-way frequency table, so it appeared that the option did nothing. Yet when we describe our dataset, we find that we now have six variables instead of the original three. The new variables are named spd1, spd2, and spd3. When we specify the generate() option, we are telling Stata to not only produce the table but also create a set of indicator variables that correspond to that table. Stata adds a numeric suffix to the name we specify in the parentheses. spd1 refers to the first line of the table, spd2 to the second line, and so on. Also Stata labels the variables so that we know what they mean. spd1 is an indicator variable that is true (takes on the value 1) when spdcat is between 40 and 50; otherwise, it is zero. (There is an exception: if spdcat is missing, so are the spd1, spd2, and spd3 variables. This did not happen in our dataset.) We want to prove our claim. Because we have not yet introduced two-way tabulations, we will use the summarize statement: . summarize spdlimit if spd1==1 Obs Mean Variable spdlimit 11 47.72727 . summarize spdlimit if spd2==1 Obs Mean Variable spdlimit 26 57.11538 . summarize spdlimit if spd3==1 Variable Obs Mean spdlimit 2 67.5 Std. Dev. Min Max 3.437758 40 50 Std. Dev. Min Max 2.519157 55 60 Std. Dev. Min Max 3.535534 65 70 Notice the indicated minimum and maximum in each of the tables above. When we restrict the sample to spd1, spdlimit is between 40 and 50; when we restrict the sample to spd2, spdlimit is between 55 and 60; when we restrict the sample to spd3, spdlimit is between 65 and 70. Thus tabulate provides an easy way to create indicator (sometimes called dummy) variables. For an overview of indicator and categorical variables, see [U] 25 Working with categorical data and factor variables. tabulate oneway — One-way table of frequencies 2317 tab1 tab1 is a convenience tool. Typing . tab1 myvar thisvar thatvar, plot is equivalent to typing . tabulate myvar, plot . tabulate thisvar, plot . tabulate thatvar, plot Video example Tables and cross-tabulations in Stata Stored results tabulate and tab1 store the following in r(): Scalars r(N) number of observations r(r) number of rows References Cox, N. J. 2009. Speaking Stata: I. J. Good and quasi-Bayes smoothing of categorical frequencies. Stata Journal 9: 306–314. Harrison, D. A. 2006. Stata tip 34: Tabulation by listing. Stata Journal 6: 425–427. Also see [R] table — Flexible table of summary statistics [R] tabstat — Compact table of summary statistics [R] tabulate twoway — Two-way table of frequencies [R] tabulate, summarize() — One- and two-way tables of summary statistics [D] collapse — Make dataset of summary statistics [ST] epitab — Tables for epidemiologists [SVY] svy: tabulate oneway — One-way tables for survey data [SVY] svy: tabulate twoway — Two-way tables for survey data [XT] xttab — Tabulate xt data [U] 12.6.3 Value labels [U] 25 Working with categorical data and factor variables Title tabulate twoway — Two-way table of frequencies Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax Two-way table tabulate varname1 varname2 if in weight , options Two-way table for all possible combinations—a convenience tool tab2 varlist if in weight , options Immediate form of two-way tabulations tabi # 11 # 12 . . . \ # 21 # 22 . . . \ ... , options options Description Main chi2 exact (#) gamma lrchi2 taub V cchi2 column row clrchi2 cell expected nofreq missing wrap no key nolabel nolog ∗ firstonly report Pearson’s χ2 report Fisher’s exact test report Goodman and Kruskal’s gamma report likelihood-ratio χ2 report Kendall’s τb report Cramér’s V report Pearson’s χ2 in each cell report relative frequency within its column of each cell report relative frequency within its row of each cell report likelihood-ratio χ2 in each cell report the relative frequency of each cell report expected frequency in each cell do not display frequencies treat missing values like other values do not wrap wide tables report/suppress cell contents key display numeric codes rather than value labels do not display enumeration log for Fisher’s exact test show only tables that include the first variable in varlist 2318 tabulate twoway — Two-way table of frequencies 2319 Advanced matcell(matname) matrow(matname) matcol(matname) ‡ replace equivalent to specifying chi2 lrchi2 V gamma taub all ∗ save frequencies in matname; programmer’s option save unique values of varname1 in matname; programmer’s option save unique values of varname2 in matname; programmer’s option replace current data with given cell frequencies firstonly is available only for tab2. ‡ replace is available only for tabi. by is allowed with tabulate and tab2; see [D] by. fweights, aweights, and iweights are allowed by tabulate. fweights are allowed by tab2. See [U] 11.1.6 weight. all does not appear in the dialog box. Menu tabulate Statistics > Summaries, tables, and tests > Frequency tables > Two-way table with measures of association > Summaries, tables, and tests > Frequency tables > All possible two-way tables > Summaries, tables, and tests > Frequency tables > Table calculator tab2 Statistics tabi Statistics Description tabulate produces a two-way table of frequency counts, along with various measures of association, including the common Pearson’s χ2 , the likelihood-ratio χ2 , Cramér’s V , Fisher’s exact test, Goodman and Kruskal’s gamma, and Kendall’s τb . Line size is respected. That is, if you resize the Results window before running tabulate, the resulting two-way tabulation will take advantage of the available horizontal space. Stata for Unix(console) users can instead use the set linesize command to take advantage of this feature. tab2 produces all possible two-way tabulations of the variables specified in varlist. tabi displays the r × c table, using the values specified; rows are separated by ‘\’. If no options are specified, it is as if exact were specified for a 2 × 2 table and chi2 were specified otherwise. See [U] 19 Immediate commands for a general description of immediate commands. See Tables with immediate data below for examples using tabi. See [R] tabulate oneway if you want a one-way table of frequencies. See [R] table and [R] tabstat if you want one-, two-, or n-way table of frequencies and a wide variety of summary statistics. See [R] tabulate, summarize() for a description of tabulate with the summarize() option; it produces a table (breakdowns) of means and standard deviations. table is better than tabulate, summarize(), but tabulate, summarize() is faster. See [ST] epitab for a 2 × 2 table with statistics of interest to epidemiologists. 2320 tabulate twoway — Two-way table of frequencies Options Main chi2 calculates and displays Pearson’s χ2 for the hypothesis that the rows and columns in a two-way table are independent. chi2 may not be specified if aweights or iweights are specified. exact (#) displays the significance calculated by Fisher’s exact test and may be applied to r × c as well as to 2 × 2 tables. For 2 × 2 tables, both one- and two-sided probabilities are displayed. For r × c tables, one-sided probabilities are displayed. The optional positive integer # is a multiplier on the amount of memory that the command is permitted to consume. The default is 1. This option should not be necessary for reasonable r × c tables. If the command terminates with error 910, try exact(2). The maximum row or column dimension allowed when computing Fisher’s exact test is the maximum row or column dimension for tabulate (see [R] limits). gamma displays Goodman and Kruskal’s gamma along with its asymptotic standard error. gamma is appropriate only when both variables are ordinal. gamma may not be specified if aweights or iweights are specified. lrchi2 displays the likelihood-ratio χ2 statistic. lrchi2 may not be specified if aweights or iweights are specified. taub displays Kendall’s τb along with its asymptotic standard error. taub is appropriate only when both variables are ordinal. taub may not be specified if aweights or iweights are specified. V (note capitalization) displays Cramér’s V . V may not be specified if aweights or iweights are specified. cchi2 displays each cell’s contribution to Pearson’s chi-squared in a two-way table. column displays the relative frequency of each cell within its column in a two-way table. row displays the relative frequency of each cell within its row in a two-way table. clrchi2 displays each cell’s contribution to the likelihood-ratio chi-squared in a two-way table. cell displays the relative frequency of each cell in a two-way table. expected displays the expected frequency of each cell in a two-way table. nofreq suppresses the printing of the frequencies. missing requests that missing values be treated like other values in calculations of counts, percentages, and other statistics. wrap requests that Stata take no action on wide, two-way tables to make them readable. Unless wrap is specified, wide tables are broken into pieces to enhance readability. no key suppresses or forces the display of a key above two-way tables. The default is to display the key if more than one cell statistic is requested, and otherwise to omit it. key forces the display of the key. nokey suppresses its display. nolabel causes the numeric codes to be displayed rather than the value labels. nolog suppresses the display of the log for Fisher’s exact test. Using Fisher’s exact test requires counting all tables that have a probability exceeding that of the observed table given the observed row and column totals. The log counts down each stage of the network computations, starting from the number of columns and counting down to 1, displaying the number of nodes in the network at each stage. A log is not displayed for 2 × 2 tables. firstonly, available only with tab2, restricts the output to only those tables that include the first variable in varlist. Use this option to interact one variable with a set of others. tabulate twoway — Two-way table of frequencies 2321 Advanced matcell(matname) saves the reported frequencies in matname. This option is for use by programmers. matrow(matname) saves the numeric values of the r × 1 row stub in matname. This option is for use by programmers. matrow() may not be specified if the row variable is a string. matcol(matname) saves the numeric values of the 1 × c column stub in matname. This option is for use by programmers. matcol() may not be specified if the column variable is a string. replace indicates that the immediate data specified as arguments to the tabi command be left as the current data in place of whatever data were there. The following option is available with tabulate but is not shown in the dialog box: all is equivalent to specifying chi2 lrchi2 V gamma taub. Note the omission of exact. When all is specified, no may be placed in front of the other options. all noV requests all association measures except Cramér’s V (and Fisher’s exact). all exact requests all association measures, including Fisher’s exact test. all may not be specified if aweights or iweights are specified. Limits Two-way tables may have a maximum of 1,200 rows and 80 columns (Stata/MP and Stata/SE), 300 rows and 20 columns (Stata/IC), or 160 rows and 20 columns (Small Stata). If larger tables are needed, see [R] table. Remarks and examples Remarks are presented under the following headings: tabulate Measures of association N-way tables Weighted data Tables with immediate data tab2 Video examples For each value of a specified variable (or a set of values for a pair of variables), tabulate reports the number of observations with that value. The number of times a value occurs is called its frequency. tabulate Example 1 tabulate will make two-way tables if we specify two variables following the word tabulate. In our highway dataset, we have a variable called rate that divides the accident rate into three categories: below 4, 4 – 7, and above 7 per million vehicle miles. Let’s make a table of the speed limit category and the accident-rate category: 2322 tabulate twoway — Two-way table of frequencies . use http://www.stata-press.com/data/r13/hiway2 (Minnesota Highway Data, 1973) . tabulate spdcat rate Accident rate per million Speed vehicle miles Limit Category Below 4 4-7 Above 7 Total 40 to 50 55 to 50 Above 60 3 19 2 5 6 0 3 1 0 11 26 2 Total 24 11 4 39 The table indicates that three stretches of highway have an accident rate below 4 and a speed limit of 40 to 50 miles per hour. The table also shows the row and column sums (called the marginals). The number of highways with a speed limit of 40 to 50 miles per hour is 11, which is the same result we obtained in our previous one-way tabulations. Stata can present this basic table in several ways — 16, to be precise — and we will show just a few below. It might be easier to read the table if we included the row percentages. For instance, of 11 highways in the lowest speed limit category, three are also in the lowest accident-rate category. Three-elevenths amounts to some 27.3%. We can ask Stata to fill in this information for us by using the row option: . tabulate spdcat rate, row Key frequency row percentage Speed Limit Category Accident rate per million vehicle miles Below 4 4-7 Above 7 Total 40 to 50 3 27.27 5 45.45 3 27.27 11 100.00 55 to 50 19 73.08 6 23.08 1 3.85 26 100.00 Above 60 2 100.00 0 0.00 0 0.00 2 100.00 Total 24 61.54 11 28.21 4 10.26 39 100.00 The number listed below each frequency is the percentage of cases that each cell represents out of its row. That is easy to remember because we see 100% listed in the “Total” column. The bottom row is also informative. We see that 61.54% of all the highways in our dataset fall into the lowest accident-rate category, that 28.21% are in the middle category, and that 10.26% are in the highest. tabulate can calculate column percentages and cell percentages, as well. It does so when we specify the column or cell options, respectively. We can even specify them together. Below is a table that includes everything: tabulate twoway — Two-way table of frequencies 2323 . tabulate spdcat rate, row column cell Key frequency row percentage column percentage cell percentage Speed Limit Category Accident rate per million vehicle miles Below 4 4-7 Above 7 Total 40 to 50 3 27.27 12.50 7.69 5 45.45 45.45 12.82 3 27.27 75.00 7.69 11 100.00 28.21 28.21 55 to 50 19 73.08 79.17 48.72 6 23.08 54.55 15.38 1 3.85 25.00 2.56 26 100.00 66.67 66.67 Above 60 2 100.00 8.33 5.13 0 0.00 0.00 0.00 0 0.00 0.00 0.00 2 100.00 5.13 5.13 Total 24 61.54 100.00 61.54 11 28.21 100.00 28.21 4 10.26 100.00 10.26 39 100.00 100.00 100.00 The number at the top of each cell is the frequency count. The second number is the row percentage — they sum to 100% going across the table. The third number is the column percentage — they sum to 100% going down the table. The bottom number is the cell percentage — they sum to 100% going down all the columns and across all the rows. For instance, highways with a speed limit above 60 miles per hour and in the lowest accident rate category account for 100% of highways with a speed limit above 60 miles per hour; 8.33% of highways in the lowest accident-rate category; and 5.13% of all our data. A fourth option, nofreq, tells Stata not to print the frequency counts. To construct a table consisting of only row percentages, we type . tabulate spdcat rate, row nofreq Speed Accident rate per million vehicle miles Limit Category Below 4 4-7 Above 7 Total 40 to 50 55 to 50 Above 60 27.27 73.08 100.00 45.45 23.08 0.00 27.27 3.85 0.00 100.00 100.00 100.00 Total 61.54 28.21 10.26 100.00 2324 tabulate twoway — Two-way table of frequencies Measures of association Example 2 tabulate will calculate the Pearson χ2 test for the independence of the rows and columns if we specify the chi2 option. Suppose that we have 1980 census data on 956 cities in the United States and wish to compare the age distribution across regions of the country. Assume that agecat is the median age in each city and that region denotes the region of the country in which the city is located. . use http://www.stata-press.com/data/r13/citytemp2 (City Temperature Data) . tabulate region agecat, chi2 Census agecat 19-29 30-34 35+ Total Region NE N Cntrl South West 46 162 139 160 Total 507 Pearson chi2(6) = 83 92 68 73 316 61.2877 37 30 43 23 166 284 250 256 133 956 Pr = 0.000 We obtain the standard two-way table and, at the bottom, a summary of the χ2 test. Stata informs us that the χ2 associated with this table has 6 degrees of freedom and is 61.29. The observed differences are significant. The table is, perhaps, easier to understand if we suppress the frequencies and print just the row percentages: . tabulate region agecat, row nofreq chi2 Census agecat 19-29 30-34 35+ Region NE N Cntrl South West 27.71 57.04 55.60 62.50 50.00 32.39 27.20 28.52 Total 53.03 33.05 Pearson chi2(6) = 61.2877 22.29 10.56 17.20 8.98 Total 100.00 100.00 100.00 100.00 13.91 100.00 Pr = 0.000 Example 3 We have data on dose level and outcome for a set of patients and wish to evaluate the association between the two variables. We can obtain all the association measures by specifying the all and exact options: tabulate twoway — Two-way table of frequencies . use http://www.stata-press.com/data/r13/dose . tabulate dose function, all exact Enumerating sample-space combinations: stage 3: enumerations = 1 stage 2: enumerations = 9 stage 1: enumerations = 0 Function Dosage < 1 hr 1 to 4 4+ 1/day 2/day 3/day 20 16 10 Total 46 Pearson chi2(4) = likelihood-ratio chi2(4) = Cramr’s V = gamma = Kendall’s tau-b = Fisher’s exact = 2325 Total 10 12 16 2 4 6 32 32 32 38 6.7780 6.9844 0.1879 0.3689 0.2378 12 Pr = 0.148 Pr = 0.137 96 ASE = 0.129 ASE = 0.086 0.145 We find evidence of association but not enough to be truly convincing. If we had not also specified the exact option, we would not have obtained Fisher’s exact test. Stata can calculate this statistic both for 2 × 2 tables and for r × c. For 2 × 2 tables, the calculation is almost instant. On more general tables, however, the calculation can take longer. We carefully constructed our example so that all would be meaningful. Kendall’s τb and Goodman and Kruskal’s gamma are relevant only when both dimensions of the table can be ordered, say, from low to high or from worst to best. The other statistics, however, are always applicable. Technical note Be careful when attempting to compute the p-value for Fisher’s exact test because the number of tables that contribute to the p-value can be extremely large and a solution may not be feasible. The errors that are indicative of this situation are errors 910, exceeded memory limitations, and 1401, integer overflow due to large row-margin frequencies. If execution terminates because of memory limitations, use exact(2) to permit the algorithm to consume twice the memory, exact(3) for three times the memory, etc. The default memory usage should be sufficient for reasonable tables. N-way tables If you need more than two-way tables, your best alternative to is use table, not tabulate; see [R] table. The technical note below shows you how to use tabulate to create a sequence of two-way tables that together form, in effect, a three-way table, but using table is easy and produces prettier results: 2326 tabulate twoway — Two-way table of frequencies . use http://www.stata-press.com/data/r13/birthcat (City data) . table birthcat region agecat, c(freq) birthcat NE 29-136 137-195 196-529 birthcat 29-136 137-195 196-529 agecat and Census Region 19-29 30-34 N Cntrl South West NE N Cntrl South 11 31 4 23 97 38 11 65 59 11 46 91 34 48 1 27 58 3 10 45 12 West 8 42 21 agecat and Census Region 35+ NE N Cntrl South West 34 3 26 4 27 7 4 18 4 Technical note We can make n-way tables by combining the by varlist: prefix with tabulate. Continuing with the dataset of 956 cities, say that we want to make a table of age category by birth-rate category by region of the country. The birth-rate category variable is named birthcat in our dataset. To make separate tables for each age category, we would type . by agecat, sort: tabulate birthcat region -> agecat = 19-29 Census Region N Cntrl South birthcat NE West Total 29-136 137-195 196-529 11 31 4 23 97 38 11 65 59 11 46 91 56 239 192 Total 46 158 135 148 487 Census Region N Cntrl South West Total -> agecat = 30-34 birthcat NE 29-136 137-195 196-529 34 48 1 27 58 3 10 45 12 8 42 21 79 193 37 Total 83 88 67 71 309 tabulate twoway — Two-way table of frequencies 2327 -> agecat = 35+ Census Region N Cntrl South birthcat NE 29-136 137-195 196-529 34 3 0 26 4 0 Total 37 30 West Total 27 7 4 18 4 0 105 18 4 38 22 127 Weighted data Example 4 tabulate can process weighted as well as unweighted data. As with all Stata commands, we indicate the weight by specifying the [weight] modifier; see [U] 11.1.6 weight. Continuing with our dataset of 956 cities, we also have a variable called pop, the population of each city. We can make a table of region by age category, weighted by population, by typing . tabulate region agecat [freq=pop] Census agecat Region 19-29 30-34 NE N Cntrl South West Total 35+ Total 5,323,610 4,015,593 4,141,863 2,375,118 20,466,384 29,881,899 25,722,648 26,828,649 52,215,467 34,827,929 15,856,184 102899580 4,721,387 10,421,387 16,901,550 8,964,756 13,894,254 7,686,531 16,698,276 7,755,255 If we specify the cell, column, or row options, they will also be appropriately weighted. Below we repeat the table, suppressing the counts and substituting row percentages: . tabulate region agecat [freq=pop], nofreq row Census agecat Region 19-29 30-34 35+ Total NE N Cntrl South West 23.07 56.56 54.02 62.24 50.92 30.00 29.88 28.91 26.01 13.44 16.10 8.85 100.00 100.00 100.00 100.00 Total 50.74 33.85 15.41 100.00 2328 tabulate twoway — Two-way table of frequencies Tables with immediate data Example 5 tabi ignores the dataset in memory and uses as the table the values that we specify on the command line: . tabi 30 18 \ 38 14 col 1 row 1 2 30 38 Total 68 Fisher’s exact = 1-sided Fisher’s exact = 2 Total 18 14 48 52 32 100 0.289 0.179 We may specify any of the options of tabulate and are not limited to 2 × 2 tables: . tabi 30 18 38 \ 13 7 22, chi2 exact Enumerating sample-space combinations: stage 3: enumerations = 1 stage 2: enumerations = 3 stage 1: enumerations = 0 col row 1 2 3 Total 1 2 30 13 18 7 38 22 86 42 Total 43 25 60 128 Pearson chi2(2) = 0.7967 Pr = 0.671 Fisher’s exact = 0.707 . tabi 30 13 \ 18 7 \ 38 22, all exact col Key frequency column percentage Enumerating sample-space stage 3: enumerations = stage 2: enumerations = stage 1: enumerations = row combinations: 1 3 0 col 1 2 Total 1 30 34.88 13 30.95 43 33.59 2 18 20.93 7 16.67 25 19.53 3 38 44.19 22 52.38 60 46.88 Total 86 100.00 42 100.00 128 100.00 tabulate twoway — Two-way table of frequencies Pearson chi2(2) = likelihood-ratio chi2(2) = Cramr’s V = gamma = Kendall’s tau-b = Fisher’s exact = 0.7967 0.7985 0.0789 0.1204 0.0630 2329 Pr = 0.671 Pr = 0.671 ASE = 0.160 ASE = 0.084 0.707 For 2 × 2 tables, both one- and two-sided Fisher’s exact probabilities are displayed; this is true of both tabulate and tabi. See Cumulative incidence data and Case–control data in [ST] epitab for more discussion on the relationship between one- and two-sided probabilities. Technical note tabi, as with all immediate commands, leaves any data in memory undisturbed. With the replace option, however, the data in memory are replaced by the data from the table: . tabi 30 18 \ 38 14, replace col row 1 2 Total 1 2 30 38 18 14 48 52 Total 68 32 100 Fisher’s exact = 1-sided Fisher’s exact = . list 1. 2. 3. 4. row col pop 1 1 2 2 1 2 1 2 30 18 38 14 0.289 0.179 With this dataset, you could re-create the above table by typing . tabulate row col [freq=pop], exact col row 1 2 1 2 Total 30 38 68 Fisher’s exact = 1-sided Fisher’s exact = 18 14 32 Total 48 52 100 0.289 0.179 2330 tabulate twoway — Two-way table of frequencies tab2 tab2 is a convenience tool. Typing . tab2 myvar thisvar thatvar, chi2 is equivalent to typing . tabulate myvar thisvar, chi2 . tabulate myvar thatvar, chi2 . tabulate thisvar thatvar, chi2 Video examples Pearson’s chi2 and Fisher’s exact test in Stata Tables and cross-tabulations in Stata Immediate commands in Stata: Cross-tabulations and chi-squared tests from summary data Stored results tabulate, tab2, and tabi store the following in r(): Scalars r(N) r(r) r(c) r(chi2) r(p) r(gamma) r(p1 exact) number of observations number of rows number of columns Pearson’s χ2 significance of Pearson’s χ2 gamma one-sided Fisher’s exact p Fisher’s exact p likelihood-ratio χ2 significance of likelihood-ratio χ2 Cramér’s V ASE of gamma ASE of τb r(p exact) r(chi2 lr) r(p lr) r(CramersV) r(ase gam) r(ase taub) r(taub) τb r(p1 exact) is defined only for 2×2 tables. Also, the matrow(), matcol(), and matcell() options allow you to obtain the row values, column values, and frequencies, respectively. Methods and formulas Let nij , i = 1, . . . , I and j = 1, . . . , J , be the number of observations in the ith row and j th column. If the data are not weighted, nij is just a count. If the data are weighted, nij is the sum of the weights of all data corresponding to the (i, j) cell. Define the row and column marginals as ni· = J X nij n·j = j=1 and let n = P P i Aij = j I X nij i=1 nij be the overall sum. Also, define the concordance and discordance as XX k>i l>j nkl + XX nkl Dij = ki l j nij Aij and twice the number of discordances tabulate twoway — Two-way table of frequencies 2331 The Pearson χ2 statistic with (I − 1)(J − 1) degrees of freedom (so called because it is based on Pearson (1900); see Conover [1999, 240] and Fienberg [1980, 9]) is defined as X2 = X X (nij − mij )2 mij i j where mij = ni· n·j /n. The likelihood-ratio χ2 statistic with (I − 1)(J − 1) degrees of freedom (Fienberg 1980, 40) is defined as XX G2 = 2 nij ln(nij /mij ) i j Cramér’s V (Cramér 1946) is a measure of association designed so that the attainable upper bound is 1. For 2 × 2 tables, −1 ≤ V ≤ 1, and otherwise, 0 ≤ V ≤ 1. ( V = (n11 n22 − n12 n21 )/(n1· n2· n·1 n·2 )1/2 2 1/2 (X /n)/min(I − 1, J − 1) for 2 × 2 otherwise Gamma (Goodman and Kruskal 1954, 1959, 1963, 1972; also see Agresti [2010,186 – 188]) ignores tied pairs and is based only on the number of concordant and discordant pairs of observations, −1 ≤ γ ≤ 1, γ = (P − Q)/(P + Q) with asymptotic variance 16 XX i nij (QAij − P Dij )2 /(P + Q)4 j Kendall’s τb (Kendall 1945; also see Agresti 2010, 188 – 189), −1 ≤ τb ≤ 1, is similar to gamma, except that it uses a correction for ties, τb = (P − Q)/(wr wc )1/2 with asymptotic variance P P i j nij (2wr wc dij + τb vij )2 − n3 τb2 (wr + wc )2 (wr wc )4 2332 tabulate twoway — Two-way table of frequencies where wr =n2 − X n2i· i 2 wc =n − X n2·j j dij =Aij − Dij vij =ni· wc + n·j wr Fisher’s exact test (Fisher 1935; Finney 1948; see Zelterman and Louis [1992, 293 – 301] for the 2 × 2 case) yields the probability of observing a table that gives at least as much evidence of association as the one actually observed under the assumption of no association. Holding row and column marginals fixed, the hypergeometric probability P of every possible table A is computed, and the X P = Pr(T ) T ∈A where A is the set of all tables with the same marginals as the observed table, T ∗ , such that Pr(T ) ≤ Pr(T ∗ ). For 2 × 2 tables, the one-sided probability is calculated by further restricting A to tables in the same tail as T ∗ . The first algorithm extending this calculation to r × c tables was Pagano and Halvorsen (1981); the one implemented here is the FEXACT algorithm by Mehta and Patel (1986). This is a search-tree clipping method originally published by Mehta and Patel (1983) with further refinements by Joe (1988) and Clarkson, Fan, and Joe (1993). Fisher’s exact test is a permutation test. For more information on permutation tests, see Good (2005 and 2006) and Pesarin (2001). References Agresti, A. 2010. Analysis of Ordinal Categorical Data. 2nd ed. Hoboken, NJ: Wiley. Campbell, M. J., D. Machin, and S. J. Walters. 2007. Medical Statistics: A Textbook for the Health Sciences. 4th ed. Chichester, UK: Wiley. Clarkson, D. B., Y.-A. Fan, and H. Joe. 1993. A remark on Algorithm 643: FEXACT: An algorithm for performing Fisher’s exact test in r×c contingency tables. ACM Transactions on Mathematical Software 19: 484–488. Conover, W. J. 1999. Practical Nonparametric Statistics. 3rd ed. New York: Wiley. Cox, N. J. 1996. sg57: An immediate command for two-way tables. Stata Technical Bulletin 33: 7–9. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 140–143. College Station, TX: Stata Press. . 1999. sg113: Tabulation of modes. Stata Technical Bulletin 50: 26–27. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 180–181. College Station, TX: Stata Press. . 2003. sg113 1: Software update: Tabulation of modes. Stata Journal 3: 211. . 2009. Speaking Stata: I. J. Good and quasi-Bayes smoothing of categorical frequencies. Stata Journal 9: 306–314. Cramér, H. 1946. Mathematical Methods of Statistics. Princeton: Princeton University Press. Fienberg, S. E. 1980. The Analysis of Cross-Classified Categorical Data. 2nd ed. Cambridge, MA: MIT Press. Finney, D. J. 1948. The Fisher–Yates test of significance in 2 × 2 contingency tables. Biometrika 35: 145–156. Fisher, R. A. 1935. The logic of inductive inference. Journal of the Royal Statistical Society 98: 39–82. Good, P. I. 2005. Permutation, Parametric, and Bootstrap Tests of Hypotheses: A Practical Guide to Resampling Methods for Testing Hypotheses. 3rd ed. New York: Springer. . 2006. Resampling Methods: A Practical Guide to Data Analysis. 3rd ed. Boston: Birkhäuser. Goodman, L. A., and W. H. Kruskal. 1954. Measures of association for cross classifications. Journal of the American Statistical Association 49: 732–764. tabulate twoway — Two-way table of frequencies 2333 . 1959. Measures of association for cross classifications II: Further discussion and references. Journal of the American Statistical Association 54: 123–163. . 1963. Measures of association for cross classifications III: Approximate sampling theory. Journal of the American Statistical Association 58: 310–364. . 1972. Measures of association for cross classifications IV: Simplification of asymptotic variances. Journal of the American Statistical Association 67: 415–421. Harrison, D. A. 2006. Stata tip 34: Tabulation by listing. Stata Journal 6: 425–427. Jann, B. 2008. Multinomial goodness-of-fit: Large-sample tests with survey design correction and exact tests for small samples. Stata Journal 8: 147–169. Joe, H. 1988. Extreme probabilities for contingency tables under row and column independence with application to Fisher’s exact test. Communications in Statistics—Theory and Methods 17: 3677–3685. Judson, D. H. 1992. sg12: Extended tabulate utilities. Stata Technical Bulletin 10: 22–23. Reprinted in Stata Technical Bulletin Reprints, vol. 2, pp. 140–141. College Station, TX: Stata Press. Kendall, M. G. 1945. The treatment of ties in rank problems. Biometrika 33: 239–251. Longest, K. C. 2012. Using Stata for Quantitative Analysis. Thousand Oaks, CA: Sage. Mehta, C. R., and N. R. Patel. 1983. A network algorithm for performing Fisher’s exact test in r×c contingency tables. Journal of the American Statistical Association 78: 427–434. . 1986. Algorithm 643 FEXACT: A FORTRAN subroutine for Fisher’s exact test on unordered r×c contingency tables. ACM Transactions on Mathematical Software 12: 154–161. Newson, R. B. 2002. Parameters behind “nonparametric” statistics: Kendall’s tau, Somers’ D and median differences. Stata Journal 2: 45–64. Pagano, M., and K. T. Halvorsen. 1981. An algorithm for finding the exact significance levels of r×c contingency tables. Journal of the American Statistical Association 76: 931–934. Pearson, K. 1900. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, Series 5 50: 157–175. Pesarin, F. 2001. Multivariate Permutation Tests: With Applications in Biostatistics. Chichester, UK: Wiley. Weesie, J. 2001. dm91: Patterns of missing values. Stata Technical Bulletin 61: 5–7. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 49–51. College Station, TX: Stata Press. Wolfe, R. 1999. sg118: Partitions of Pearson’s χ2 for analyzing two-way tables that have ordered columns. Stata Technical Bulletin 51: 37–40. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 203–207. College Station, TX: Stata Press. Zelterman, D., and T. A. Louis. 1992. Contingency tables in medical studies. In Medical Uses of Statistics, 2nd ed, ed. J. C. Bailar III and C. F. Mosteller, 293–310. Boston: Dekker. 2334 tabulate twoway — Two-way table of frequencies Also see [R] table — Flexible table of summary statistics [R] tabstat — Compact table of summary statistics [R] tabulate oneway — One-way table of frequencies [R] tabulate, summarize() — One- and two-way tables of summary statistics [D] collapse — Make dataset of summary statistics [ST] epitab — Tables for epidemiologists [SVY] svy: tabulate oneway — One-way tables for survey data [SVY] svy: tabulate twoway — Two-way tables for survey data [XT] xttab — Tabulate xt data [U] 12.6.3 Value labels [U] 25 Working with categorical data and factor variables Title tabulate, summarize() — One- and two-way tables of summary statistics Syntax Remarks and examples Menu Also see Description Options Syntax tabulate varname1 varname2 if in weight , options Description options Main summarize(varname3 ) no means no standard no freq no obs nolabel wrap missing report summary statistics for varname3 include or suppress means include or suppress standard deviations include or suppress frequencies include or suppress number of observations show numeric codes, not labels do not break wide tables treat missing values of varname1 and varname2 as categories by is allowed; see [D] by. aweights and fweights are allowed; see [U] 11.1.6 weight. Menu Statistics > Summaries, tables, and tests > Other tables > Table of means, std. dev., and frequencies Description tabulate, summarize() produces one- and two-way tables (breakdowns) of means and standard deviations. See [R] tabulate oneway and [R] tabulate twoway for one- and two-way frequency tables. See [R] table for a more flexible command that produces one-, two-, and n-way tables of frequencies and a wide variety of summary statistics. table is better, but tabulate, summarize() is faster. Also see [R] tabstat for yet another alternative. Options Main summarize(varname3 ) identifies the name of the variable for which summary statistics are to be reported. If you do not specify this option, a table of frequencies is produced; see [R] tabulate oneway and [R] tabulate twoway. The description here concerns tabulate when this option is specified. 2335 2336 tabulate, summarize() — One- and two-way tables of summary statistics [no]means includes or suppresses only the means from the table. The summarize() table normally includes the mean, standard deviation, frequency, and, if the data are weighted, number of observations. Individual elements of the table may be included or suppressed by the [no]means, [no]standard, [no]freq, and [no]obs options. For example, typing . tabulate category, summarize(myvar) means standard produces a summary table by category containing only the means and standard deviations of myvar. You could also achieve the same result by typing . tabulate category, summarize(myvar) nofreq [no]standard includes or suppresses only the standard deviations from the table; see [no]means option above. [no]freq includes or suppresses only the frequencies from the table; see [no]means option above. [no]obs includes or suppresses only the reported number of observations from the table. If the data are not weighted, the number of observations is identical to the frequency, and by default only the frequency is reported. If the data are weighted, the frequency refers to the sum of the weights. See [no]means option above. nolabel causes the numeric codes to be displayed rather than the label values. wrap requests that no action be taken on wide tables to make them readable. Unless wrap is specified, wide tables are broken into pieces to enhance readability. missing requests that missing values of varname1 and varname2 be treated as categories rather than as observations to be omitted from the analysis. Remarks and examples tabulate with the summarize() option produces one- and two-way tables of summary statistics. When combined with the by prefix, it can produce n-way tables as well. Remarks are presented under the following headings: One-way tables Two-way tables One-way tables Example 1 We have data on 74 automobiles. Included in our dataset are the variables foreign, which marks domestic and foreign cars, and mpg, the car’s mileage rating. Typing tabulate foreign displays a breakdown of the number of observations we have by the values of the foreign variable. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . tabulate foreign Car type Freq. Percent Cum. Domestic Foreign 52 22 70.27 29.73 Total 74 100.00 70.27 100.00 tabulate, summarize() — One- and two-way tables of summary statistics 2337 We discover that we have 52 domestic cars and 22 foreign cars in our dataset. If we add the summarize(varname) option, however, tabulate produces a table of summary statistics for varname: . tabulate foreign, summarize(mpg) Summary of Mileage (mpg) Car type Mean Std. Dev. Freq. Domestic Foreign 19.826923 24.772727 4.7432972 6.6111869 52 22 Total 21.297297 5.7855032 74 We also discover that the average gas mileage for domestic cars is about 20 mpg and the average foreign is almost 25 mpg. Overall, the average is 21 mpg in our dataset. Technical note We might now wonder if the difference in gas mileage between foreign and domestic cars is statistically significant. We can use the oneway command to find out; see [R] oneway. To obtain an analysis-of-variance table of mpg on foreign, we type . oneway mpg foreign Source Between groups Within groups Analysis of Variance SS df MS 378.153515 2065.30594 1 72 378.153515 28.6848048 F 13.18 Total 2443.45946 73 33.4720474 Bartlett’s test for equal variances: chi2(1) = 3.4818 Prob > F 0.0005 Prob>chi2 = 0.062 The F statistic is 13.18, and the difference between foreign and domestic cars’ mileage ratings is significant at the 0.05% level. There are several ways that we could have statistically compared mileage ratings — see, for instance, [R] anova, [R] oneway, [R] regress, and [R] ttest — but oneway seemed the most convenient. Two-way tables Example 2 tabulate, summarize can be used to obtain two-way as well as one-way breakdowns. For instance, we obtained summary statistics on mpg decomposed by foreign by typing tabulate foreign, summarize(mpg). We can specify up to two variables before the comma: 2338 tabulate, summarize() — One- and two-way tables of summary statistics . generate wgtcat = autocode(weight,4,1760,4840) . tabulate wgtcat foreign, summarize(mpg) Means, Standard Deviations and Frequencies of Mileage (mpg) Car type wgtcat Domestic Foreign Total 2530 28.285714 3.0937725 7 27.0625 5.9829619 16 27.434783 5.2295149 23 3300 21.75 2.4083189 16 19.6 3.4351128 5 21.238095 2.7550819 21 4070 17.26087 1.8639497 23 14 0 1 17.125 1.9406969 24 4840 14.666667 3.32666 6 . . 0 14.666667 3.32666 6 Total 19.826923 4.7432972 52 24.772727 6.6111869 22 21.297297 5.7855032 74 In addition to the means, standard deviations, and frequencies for each weight–mileage cell, also reported are the summary statistics by weight, by mileage, and overall. For instance, the last row of the table reveals that the average mileage of domestic cars is 19.83 and that of foreign cars is 24.77 — domestic cars yield poorer mileage than foreign cars. But we now see that domestic cars yield better gas mileage within weight class — the reason domestic cars yield poorer gas mileage is because they are, on average, heavier. Example 3 If we do not specify the statistics to be included in a table, tabulate reports the mean, standard deviation, and frequency. We can specify the statistics that we want to see using the means, standard, and freq options: . tabulate wgtcat foreign, summarize(mpg) means Means of Mileage (mpg) Car type Domestic Foreign Total wgtcat 2530 3300 4070 4840 28.285714 21.75 17.26087 14.666667 27.0625 19.6 14 . 27.434783 21.238095 17.125 14.666667 Total 19.826923 24.772727 21.297297 When we specify one or more of the means, standard, and freq options, only those statistics are displayed. Thus we could obtain a table containing just the means and standard deviations by typing means standard after the summarize(mpg) option. We can also suppress selected statistics by placing no in front of the option name. Another way of obtaining only the means and standard deviations is to add the nofreq option: tabulate, summarize() — One- and two-way tables of summary statistics . tabulate wgtcat foreign, summarize(mpg) nofreq Means and Standard Deviations of Mileage (mpg) Car type wgtcat Domestic Foreign Total 2530 28.285714 3.0937725 27.0625 5.9829619 27.434783 5.2295149 3300 21.75 2.4083189 19.6 3.4351128 21.238095 2.7550819 4070 17.26087 1.8639497 14 0 17.125 1.9406969 4840 14.666667 3.32666 . . 14.666667 3.32666 Total 19.826923 4.7432972 24.772727 6.6111869 21.297297 5.7855032 Also see [R] table — Flexible table of summary statistics [R] tabstat — Compact table of summary statistics [R] tabulate oneway — One-way table of frequencies [R] tabulate twoway — Two-way table of frequencies [D] collapse — Make dataset of summary statistics [SVY] svy: tabulate oneway — One-way tables for survey data [SVY] svy: tabulate twoway — Two-way tables for survey data [U] 12.6 Dataset, variable, and value labels [U] 25 Working with categorical data and factor variables 2339 Title test — Test linear hypotheses after estimation Syntax Options for test Acknowledgment Menu Remarks and examples References Description Stored results Also see Options for testparm Methods and formulas Syntax Basic syntax test coeflist (Syntax 1 ) test exp=exp =. . . test [eqno] : coeflist test [eqno=eqno =. . . ] : coeflist testparm varlist , equal equation(eqno) (Syntax 2 ) (Syntax 3 ) (Syntax 4 ) Full syntax , test options test (spec) (spec) . . . test options Description Options mtest (opt) coef accumulate notest common constant nosvyadjust minimum test each condition separately report estimated constrained coefficients test hypothesis jointly with previously tested hypotheses suppress the output test only variables common to all the equations include the constant in coefficients to be tested compute unadjusted Wald tests for survey results perform test with the constant, drop terms until the test becomes nonsingular, and test without the constant on the remaining terms; highly technical matvlc(matname) save the variance–covariance matrix; programmer’s option coeflist and varlist may contain factor variables and time-series operators; see [U] 11.4.3 Factor variables and [U] 11.4.4 Time-series varlists. matvlc(matname) does not appear in the dialog box. Syntax 1 tests that coefficients are 0. Syntax 2 tests that linear expressions are equal. Syntax 3 tests that coefficients in eqno are 0. Syntax 4 tests equality of coefficients between equations. 2340 test — Test linear hypotheses after estimation 2341 spec is one of coeflist exp=exp =exp [eqno] : coeflist [eqno1 =eqno2 =. . . ] : coeflist coeflist is coef coef . . . [eqno]coef [eqno]coef . . . [eqno] b[coef ] [eqno] b[coef ]. . . exp is a linear expression containing coef b[coef ] b[eqno:coef ] [eqno]coef [eqno] b[coef ] eqno is ## name coef identifies a coefficient in the model. coef is typically a variable name, a level indicator, an interaction indicator, or an interaction involving continuous variables. Level indicators identify one level of a factor variable and interaction indicators identify one combination of levels of an interaction; see [U] 11.4.3 Factor variables. coef may contain time-series operators; see [U] 11.4.4 Time-series varlists. Distinguish between [ ], which are to be typed, and , which indicate optional arguments. Although not shown in the syntax diagram, parentheses around spec are required only with multiple specifications. Also, the diagram does not show that test may be called without arguments to redisplay the results from the last test. anova and manova (see [R] anova and [MV] manova) allow the test syntax above plus more (see [R] anova postestimation for test after anova; see [MV] manova postestimation for test after manova). Menu test Statistics > Postestimation > Tests > Test linear hypotheses Postestimation > Tests > Test parameters testparm Statistics > 2342 test — Test linear hypotheses after estimation Description test performs Wald tests of simple and composite linear hypotheses about the parameters of the most recently fit model. test supports svy estimators (see [SVY] svy estimation), carrying out an adjusted Wald test by default in such cases. test can be used with svy estimation results, see [SVY] svy postestimation. testparm provides a useful alternative to test that permits varlist rather than a list of coefficients (which is often nothing more than a list of variables), allowing the use of standard Stata notation, including ‘-’ and ‘*’, which are given the expression interpretation by test. test and testparm perform Wald tests. For likelihood-ratio tests, see [R] lrtest. For Wald-type tests of nonlinear hypotheses, see [R] testnl. To display estimates for one-dimensional linear or nonlinear expressions of coefficients, see [R] lincom and [R] nlcom. See [R] anova postestimation for additional test syntax allowed after anova. See [MV] manova postestimation for additional test syntax allowed after manova. Options for testparm equal tests that the variables appearing in varlist, which also appear in the previously fit model, are equal to each other rather than jointly equal to zero. equation(eqno) is relevant only for multiple-equation models, such as mvreg, mlogit, and heckman. It specifies the equation for which the all-zero or all-equal hypothesis is tested. equation(#1) specifies that the test be conducted regarding the first equation #1. equation(price) specifies that the test concern the equation named price. Options for test Options mtest (opt) specifies that tests be performed for each condition separately. opt specifies the method for adjusting p-values for multiple testing. Valid values for opt are bonferroni holm sidak noadjust Bonferroni’s method Holm’s method Šidák’s method no adjustment is to be made Specifying mtest without an argument is equivalent to mtest(noadjust). coef specifies that the constrained coefficients be displayed. accumulate allows a hypothesis to be tested jointly with the previously tested hypotheses. notest suppresses the output. This option is useful when you are interested only in the joint test of several hypotheses, specified in a subsequent call of test, accumulate. common specifies that when you use the [eqno1 =eqno2 =. . . ] form of spec, the variables common to the equations eqno1 , eqno2 , etc., be tested. The default action is to complain if the equations have variables not in common. constant specifies that cons be included in the list of coefficients to be tested when using the [eqno1 =eqno2 =. . . ] or [eqno] forms of spec. The default is not to include cons. test — Test linear hypotheses after estimation 2343 nosvyadjust is for use with svy estimation commands; see [SVY] svy estimation. It specifies that the Wald test be carried out without the default adjustment for the design degrees of freedom. That is, the test is carried out as W/k ∼ F (k, d) rather than as (d − k + 1)W/(kd) ∼ F (k, d − k + 1), where k = the dimension of the test and d = the total number of sampled PSUs minus the total number of strata. minimum is a highly technical option. It first performs the test with the constant added. If this test is singular, coefficients are dropped until the test becomes nonsingular. Then the test without the constant is performed with the remaining terms. The following option is available with test but is not shown in the dialog box: matvlc(matname), a programmer’s option, saves the variance–covariance matrix of the linear combinations involved in the suite of tests. For the test of the linear constraints Lb = c, matname contains LVL0 , where V is the estimated variance–covariance matrix of b. Remarks and examples Remarks are presented under the following headings: Introductory examples Special syntaxes after multiple-equation estimation Constrained coefficients Multiple testing Introductory examples test performs F or χ2 tests of linear restrictions applied to the most recently fit model (for example, regress or svy: regress in the linear regression case; logit, stcox, svy: logit, . . . in the single-equation maximum-likelihood case; and mlogit, mvreg, streg, . . . in the multipleequation maximum-likelihood case). test may be used after any estimation command, although for maximum likelihood techniques, test produces a Wald test that depends only on the estimate of the covariance matrix — you may prefer to use the more computationally expensive likelihood-ratio test; see [U] 20 Estimation and postestimation commands and [R] lrtest. There are several variations on the syntax for test. The second syntax, test exp=exp =. . . is allowed after any form of estimation. After fitting a model of depvar on x1, x2, and x3, typing test x1+x2=x3 tests the restriction that the coefficients on x1 and x2 sum to the coefficient on x3. The expressions can be arbitrarily complicated; for instance, typing test x1+2*(x2+x3)=x2+3*x3 is the same as typing test x1+x2=x3. As a convenient shorthand, test also allows you to specify equality for multiple expressions; for example, test x1+x2 = x3+x4 = x5+x6 tests that the three specified pairwise sums of coefficients are equal. test understands that when you type x1, you are referring to the coefficient on x1. You could also more explicitly type test b[x1]+ b[x2]= b[x3]; or you could test coef[x1]+ coef[x2]= coef[x3], or test [#1]x1+[#1]x2=[#1]x3, or many other things because there is more than one way to refer to an estimated coefficient; see [U] 13.5 Accessing coefficients and standard errors. The shorthand involves less typing. On the other hand, you must be more explicit 2344 test — Test linear hypotheses after estimation after estimation of multiple-equation models because there may be more than one coefficient associated with an independent variable. You might type, for instance, test [#2]x1+[#2]x2=[#2]x3 to test the constraint in equation 2 or, more readably, test [ford]x1+[ford]x2=[ford]x3, meaning that Stata will test the constraint on the equation corresponding to ford, which might be equation 2. ford would be an equation name after, say, sureg, or, after mlogit, ford would be one of the outcomes. For mlogit, you could also type test [2]x1+[2]x2=[2]x3 — note the lack of the # — meaning not equation 2, but the equation corresponding to the numeric outcome 2. You can even test constraints across equations: test [ford]x1+[ford]x2=[buick]x3. The syntax test coeflist is available after all estimation commands and is a convenient way to test that multiple coefficients are zero following estimation. A coeflist can simply be a list of variable names, test varname varname . . . and it is most often specified that way. After you have fit a model of depvar on x1, x2, and x3, typing test x1 x3 tests that the coefficients on x1 and x3 are jointly zero. After multiple-equation estimation, this would test that the coefficients on x1 and x3 are zero in all equations that contain them. You can also be more explicit and type, for instance, test [ford]x1 [ford]x3 to test that the coefficients on x1 and x3 are zero in the equation for ford. In the multiple-equation case, there are more alternatives. You could also test that the coefficients on x1 and x3 are zero in the equation for ford by typing test [ford]: x1 x3. You could test that all coefficients except the coefficient on the constant are zero in the equation for ford by typing test [ford]. You could test that the coefficients on x1 and x3 in the equation for ford are equal to the corresponding coefficients in the equation corresponding to buick by typing test[ford=buick]: x1 x3. You could test that all the corresponding coefficients except the constant in three equations are equal by typing test [ford=buick=volvo]. testparm is much like the first syntax of test. Its usefulness will be demonstrated below. The examples below use regress, but what is said applies equally after any single-equation estimation command (such as logistic). It also applies after multiple-equation estimation commands as long as references to coefficients are qualified with an equation name or number in square brackets placed before them. The convenient syntaxes for dealing with tests of many coefficients in multipleequation models are demonstrated in Special syntaxes after multiple-equation estimation below. Example 1: Testing for a single coefficient against zero We have 1980 census data on the 50 states recording the birth rate in each state (brate), the median age (medage), and the region of the country in which each state is located. The region variable is 1 if the state is in the Northeast, 2 if the state is in the North Central, 3 if the state is in the South, and 4 if the state is in the West. We estimate the following regression: test — Test linear hypotheses after estimation . use http://www.stata-press.com/data/r13/census3 (1980 Census data by state) . regress brate medage c.medage#c.medage i.region SS df MS Source Number of obs F( 5, 44) Prob > F R-squared Adj R-squared Root MSE = = = = = = 50 100.63 0.0000 0.9196 0.9104 8.782 Model Residual 38803.4208 3393.39921 5 44 7760.68416 77.1227094 Total 42196.82 49 861.159592 brate Coef. medage -109.0958 13.52452 -8.07 0.000 -136.3527 -81.83892 c.medage# c.medage 1.635209 .2290536 7.14 0.000 1.173582 2.096836 region N Cntrl South West 15.00283 7.366445 21.39679 4.252067 3.953335 4.650601 3.53 1.86 4.60 0.001 0.069 0.000 6.433353 -.6009775 12.02412 23.57231 15.33387 30.76946 _cons 1947.611 199.8405 9.75 0.000 1544.859 2350.363 Std. Err. t P>|t| 2345 [95% Conf. Interval] test can now be used to perform a variety of statistical tests. Specify the coeflegend option with your estimation command to see a legend of the coefficients and how to specify them; see [R] estimation options. We can test the hypothesis that the coefficient on 3.region is zero by typing . test 3.region=0 ( 1) 3.region = 0 F( 1, 44) = Prob > F = 3.47 0.0691 The F statistic with 1 numerator and 44 denominator degrees of freedom is 3.47. The significance level of the test is 6.91% — we can reject the hypothesis at the 10% level but not at the 5% level. This result from test is identical to one presented in the output from regress, which indicates that the t statistic on the 3.region coefficient is 1.863 and that its significance level is 0.069. The t statistic presented in the output can be used to test the hypothesis that the corresponding coefficient is zero, although it states the test in slightly different terms. The F distribution with 1 numerator degree of freedom is, however, identical to the t2 distribution. We note that 1.8632 ≈ 3.47 and that the significance levels in each test agree, although one extra digit is presented by the test command. Technical note After all estimation commands, including those that use the maximum likelihood method, the test that one variable is zero is identical to that reported by the command’s output. The tests are performed in the same way—using the estimated covariance matrix—and are known as Wald tests. If the estimation command reports significance levels and confidence intervals using z rather than t statistics, test reports results using the χ2 rather than the F statistic. 2346 test — Test linear hypotheses after estimation Example 2: Testing the value of a single coefficient If that were all test could do, it would be useless. We can use test, however, to perform other tests. For instance, we can test the hypothesis that the coefficient on 2.region is 21 by typing . test 2.region=21 ( 1) 2.region = 21 F( 1, 44) = Prob > F = 1.99 0.1654 We find that we cannot reject that hypothesis, or at least we cannot reject it at any significance level below 16.5%. Example 3: Testing the equality of two coefficients The previous test is useful, but we could almost as easily perform it by hand using the results presented in the regression output if we were well read on our statistics. We could type . display Ftail(1,44,((_coef[2.region]-21)/4.252068)^2) .16544873 So, now let’s test something a bit more difficult: whether the coefficient on 2.region is the same as the coefficient on 4.region: . test 2.region=4.region ( 1) 2.region - 4.region = 0 F( 1, 44) = 2.84 Prob > F = 0.0989 We find that we cannot reject the equality hypothesis at the 5% level, but we can at the 10% level. Example 4 When we tested the equality of the 2.region and 4.region coefficients, Stata rearranged our algebra. When Stata displayed its interpretation of the specified test, it indicated that we were testing whether 2.region minus 4.region is zero. The rearrangement is innocuous and, in fact, allows Stata to perform much more complicated algebra, for instance, . test 2*(2.region-3*(3.region-4.region))=3.region+2.region+6*(4.region-3.region) ( 1) 2.region - 3.region = 0 F( 1, 44) = 5.06 Prob > F = 0.0295 Although we requested what appeared to be a lengthy hypothesis, once Stata simplified the algebra, it realized that all we wanted to do was test whether the coefficient on 2.region is the same as the coefficient on 3.region. Technical note Stata’s ability to simplify and test complex hypotheses is limited to linear hypotheses. If you attempt to test a nonlinear hypothesis, you will be told that it is not possible: . test 2.region/3.region=2.region+3.region not possible with test r(131); To test a nonlinear hypothesis, see [R] testnl. test — Test linear hypotheses after estimation 2347 Example 5: Testing joint hypotheses The real power of test is demonstrated when we test joint hypotheses. Perhaps we wish to test whether the region variables, taken as a whole, are significant by testing whether the coefficients on 2.region, 3.region, and 4.region are simultaneously zero. test allows us to specify multiple conditions to be tested, each embedded within parentheses. . test ( 1) ( 2) ( 3) (2.region=0) (3.region=0) (4.region=0) 2.region = 0 3.region = 0 4.region = 0 F( 3, 44) = 8.85 Prob > F = 0.0001 test displays the set of conditions and reports an F statistic of 8.85. test also reports the degrees of freedom of the test to be 3, the “dimension” of the hypothesis, and the residual degrees of freedom, 44. The significance level of the test is close to 0, so we can strongly reject the hypothesis of no difference between the regions. An alternative method to specify simultaneous hypotheses uses the convenient shorthand of conditions with multiple equality operators. . test ( 1) ( 2) ( 3) 2.region=3.region=4.region=0 2.region - 3.region = 0 2.region - 4.region = 0 2.region = 0 F( 3, 44) = 8.85 Prob > F = 0.0001 Technical note Another method to test simultaneous hypotheses is to specify a test for each constraint and accumulate it with the previous constraints: . test 2.region=0 ( 1) 2.region = 0 F( 1, 44) = 12.45 Prob > F = 0.0010 . test 3.region=0, accumulate ( 1) 2.region = 0 ( 2) 3.region = 0 F( 2, 44) = 6.42 Prob > F = 0.0036 . test 4.region=0, accumulate ( 1) 2.region = 0 ( 2) 3.region = 0 ( 3) 4.region = 0 F( 3, 44) = 8.85 Prob > F = 0.0001 We tested the hypothesis that the coefficient on 2.region was zero by typing test 2.region=0. We then tested whether the coefficient on 3.region was also zero by typing test 3.region=0, accumulate. The accumulate option told Stata that this was not the start of a new test but a continuation of a previous one. Stata responded by showing us the two equations and reporting an F statistic of 6.42. The significance level associated with those two coefficients being zero is 0.36%. 2348 test — Test linear hypotheses after estimation When we added the last constraint test 4.region=0, accumulate, we discovered that the three region variables are significant. If all we wanted was the overall significance and we did not want to bother seeing the interim results, we could have used the notest option: . test ( 1) . test ( 1) ( 2) . test ( 1) ( 2) ( 3) 2.region=0, notest 2.region = 0 3.region=0, accumulate notest 2.region = 0 3.region = 0 4.region=0, accumulate 2.region = 0 3.region = 0 4.region = 0 F( 3, 44) = 8.85 Prob > F = 0.0001 Example 6: Quickly testing coefficients against zero Because tests that coefficients are zero are so common in applied statistics, the test command has a more convenient syntax to accommodate this case: . test ( 1) ( 2) ( 3) 2.region 3.region 4.region 2.region = 0 3.region = 0 4.region = 0 F( 3, 44) = 8.85 Prob > F = 0.0001 Example 7: Specifying varlists We will now show how to use testparm. In its first syntax, test accepts a list of variable names but not a varlist. . test i(2/4).region i not found r(111); In the varlist, i(2/4).region means all the level variables from 2.region through 4.region, yet we received an error. test does not actually understand varlists, but testparm does. In fact, it understands only varlists. . testparm i(2/4).region ( 1) 2.region = 0 ( 2) 3.region = 0 ( 3) 4.region = 0 F( 3, 44) = Prob > F = 8.85 0.0001 Another way to test all the region variables is to type testparm i.region. That testparm accepts varlists has other advantages that do not involve factor variables. Suppose that we have a dataset that has dummy variables reg2, reg3, and reg4, rather than the categorical variable region. test — Test linear hypotheses after estimation 2349 . use http://www.stata-press.com/data/r13/census4 (birth rate, median age) . regress brate medage c.medage#c.medage reg2 reg3 reg4 (output omitted ) . test reg2-reg4 - not found r(111); In a varlist, reg2-reg4 means variables reg2 and reg4 and all the variables between, yet we received an error. test is confused because the - has two meanings: it means subtraction in an expression and “through” in a varlist. Similarly, ‘*’ means “any set of characters” in a varlist and multiplication in an expression. testparm avoids this confusion — it allows only a varlist. . testparm reg2-reg4 ( 1) ( 2) ( 3) reg2 = 0 reg3 = 0 reg4 = 0 F( 3, 44) = Prob > F = 8.85 0.0001 testparm has another advantage. We have five variables in our dataset that start with the characters reg: region, reg1, reg2, reg3, and reg4. reg* thus means those five variables: . describe reg* variable name storage type region reg1 reg2 reg3 reg4 int byte byte byte byte display format value label %8.0g %9.0g %9.0g %9.0g %9.0g region variable label Census Region region==NE region==N Cntrl region==South region==West We cannot type test reg* because, in an expression, ‘*’ means multiplication, but here is what would happen if we attempted to test all the variables that begin with reg: . test region reg1 reg2 reg3 reg4 region not found r(111); The variable region was not included in our model, so it was not found. However, with testparm, . testparm reg* ( 1) ( 2) ( 3) reg2 = 0 reg3 = 0 reg4 = 0 F( 3, 44) = Prob > F = 8.85 0.0001 That is, testparm took reg* to mean all variables that start with reg that were in our model. Technical note Actually, reg* means what it always does — all variables in our dataset that begin with reg — in this case, region reg1 reg2 reg3 reg4. testparm just ignores any variables you specify that are not in the model. 2350 test — Test linear hypotheses after estimation Example 8: Replaying the previous test We just used test (testparm, actually, but it does not matter) to test the hypothesis that reg2, reg3, and reg4 are jointly zero. We can review the results of our last test by typing test without arguments: . test ( 1) reg2 = 0 ( 2) reg3 = 0 ( 3) reg4 = 0 F( 3, 44) = Prob > F = 8.85 0.0001 Technical note test does not care how we build joint hypotheses; we may freely mix different forms of syntax. (We can even start with testparm, but we cannot use it thereafter because it does not have an accumulate option.) Say that we type test reg2 reg3 reg4 to test that the coefficients on our region dummies are jointly zero. We could then add a fourth constraint, say, that medage = 100, by typing test medage=100, accumulate. Or, if we had introduced the medage constraint first (our first test command had been test medage=100), we could then add the region dummy test by typing test reg2 reg3 reg4, accumulate or test (reg2=0) (reg3=0) (reg4=0), accumulate. Remember that all previous tests are cleared when we do not specify the accumulate option. No matter what tests we performed in the past, if we type test medage c.medage#c.medage, omitting the accumulate option, we would test that medage and c.medage#c.medage are jointly zero. Example 9: Testing the equality of multiple coefficients Let’s return to our census3.dta dataset and test the hypothesis that all the included regions have the same coefficient — that the Northeast is significantly different from the rest of the nation: . use http://www.stata-press.com/data/r13/census3 (1980 Census data by state) . regress brate medage c.medage#c.medage i.region (output omitted ) . test 2.region=3.region=4.region ( 1) 2.region - 3.region = 0 ( 2) 2.region - 4.region = 0 F( 2, 44) = 8.23 Prob > F = 0.0009 We find that they are not all the same. The syntax 2.region=3.region=4.region with multiple = operators is just a convenient shorthand for typing that the first expression equals the second expression and that the first expression equals the third expression, . test (2.region=3.region) (2.region=4.region) We performed the test for equality of the three regions by imposing two constraints: region 2 has the same coefficient as region 3, and region 2 has the same coefficient as region 4. Alternatively, we could have tested that the coefficients on regions 2 and 3 are the same and that the coefficients on regions 3 and 4 are the same. We would obtain the same results in either case. To test for equality of the three regions, we might, likely by mistake, type equality constraints for all pairs of regions: test — Test linear hypotheses after estimation . test ( 1) ( 2) ( 3) 2351 (2.region=3.region) (2.region=4.region) (3.region=4.region) 2.region - 3.region = 0 2.region - 4.region = 0 3.region - 4.region = 0 Constraint 3 dropped F( 2, 44) = 8.23 Prob > F = 0.0009 Equality of regions 2 and 3 and of regions 2 and 4, however, implies equality of regions 3 and 4. test recognized that the last constraint is implied by the other constraints and hence dropped it. Technical note Generally, Stata uses = for assignment, as in gen newvar = exp, and == as the operator for testing equality in expressions. For your convenience, test allows both = and == to be used. Example 10 The test for the equality of the regions is also possible with the testparm command. When we include the equal option, testparm tests that the coefficients of all the variables specified are equal: . testparm i(2/4).region, equal ( 1) - 2.region + 3.region = 0 ( 2) - 2.region + 4.region = 0 F( 2, 44) = 8.23 Prob > F = 0.0009 We can also obtain the equality test by accumulating single equality tests. . test ( 1) . test ( 1) ( 2) 2.region=3.region, notest 2.region - 3.region = 0 2.region=4.region, accum 2.region - 3.region = 0 2.region - 4.region = 0 F( 2, 44) = 8.23 Prob > F = 0.0009 Technical note If we specify a set of inconsistent constraints, test will tell us by dropping the constraint or constraints that led to the inconsistency. For instance, let’s test that the coefficients on region 2 and region 4 are the same, add the test that the coefficient on region 2 is 20, and finally add the test that the coefficient on region 4 is 21: . test ( 1) ( 2) ( 3) (2.region=4.region) (2.region=20) (4.region=21) 2.region - 4.region = 0 2.region = 20 4.region = 21 Constraint 1 dropped F( 2, 44) = 1.29 Prob > F = 0.2868 test informed us that it was dropping constraint 1. All three equations cannot be simultaneously true, so test drops whatever it takes to get back to something that makes sense. 2352 test — Test linear hypotheses after estimation Special syntaxes after multiple-equation estimation Everything said above about tests after single-equation estimation applies to tests after multipleequation estimation, as long as you remember to specify the equation name. To demonstrate, let’s estimate a seemingly unrelated regression by using sureg; see [R] sureg. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . sureg (price foreign mpg displ) (weight foreign length) Seemingly unrelated regression Equation price weight Obs Parms RMSE "R-sq" chi2 P 74 74 3 2 2165.321 245.2916 0.4537 0.8990 49.64 661.84 0.0000 0.0000 Coef. Std. Err. z P>|z| [95% Conf. Interval] price foreign mpg displacement _cons 3058.25 -104.9591 18.18098 3904.336 685.7357 58.47209 4.286372 1966.521 4.46 -1.80 4.24 1.99 0.000 0.073 0.000 0.047 1714.233 -219.5623 9.779842 50.0263 4402.267 9.644042 26.58211 7758.645 weight foreign length _cons -147.3481 30.94905 -2753.064 75.44314 1.539895 303.9336 -1.95 20.10 -9.06 0.051 0.000 0.000 -295.2139 27.93091 -3348.763 .517755 33.96718 -2157.365 To test the significance of foreign in the price equation, we could type . test [price]foreign ( 1) [price]foreign = 0 chi2( 1) = 19.89 Prob > chi2 = 0.0000 which is the same result reported by sureg: 4.4602 ≈ 19.89. To test foreign in both equations, we could type . test [price]foreign [weight]foreign ( 1) [price]foreign = 0 ( 2) [weight]foreign = 0 chi2( 2) = 31.61 Prob > chi2 = 0.0000 or . test foreign ( 1) [price]foreign = 0 ( 2) [weight]foreign = 0 chi2( 2) = 31.61 Prob > chi2 = 0.0000 This last syntax — typing the variable name by itself — tests the coefficients in all equations in which they appear. The variable length appears in only the weight equation, so typing test — Test linear hypotheses after estimation 2353 . test length ( 1) [weight]length = 0 chi2( 1) = 403.94 Prob > chi2 = 0.0000 yields the same result as typing test [weight]length. We may also specify a linear expression rather than a list of coefficients: . test mpg=displ ( 1) [price]mpg - [price]displacement = 0 chi2( 1) = 4.85 Prob > chi2 = 0.0277 or . test [price]mpg = [price]displ ( 1) [price]mpg - [price]displacement = 0 chi2( 1) = Prob > chi2 = 4.85 0.0277 A variation on this syntax can be used to test cross-equation constraints: . test [price]foreign = [weight]foreign ( 1) [price]foreign - [weight]foreign = 0 chi2( 1) = 23.07 Prob > chi2 = 0.0000 Typing an equation name in square brackets by itself tests all the coefficients except the intercept in that equation: . test ( 1) ( 2) ( 3) [price] [price]foreign = 0 [price]mpg = 0 [price]displacement = 0 chi2( 3) = 49.64 Prob > chi2 = 0.0000 Typing an equation name in square brackets, a colon, and a list of variable names tests those variables in the specified equation: . test [price]: foreign displ ( 1) [price]foreign = 0 ( 2) [price]displacement = 0 chi2( 2) = 25.19 Prob > chi2 = 0.0000 test [eqname1 =eqname2 ] tests that all the coefficients in the two equations are equal. We cannot use that syntax here because there are different variables in the model: . test [price=weight] variables differ between equations (to test equality of coefficients in common, specify option common) r(111); The common option specifies a test of the equality coefficients common to the equations price and weight, 2354 test — Test linear hypotheses after estimation . test [price=weight], common ( 1) [price]foreign - [weight]foreign = 0 chi2( 1) = 23.07 Prob > chi2 = 0.0000 By default, test does not include the constant, the coefficient of the constant variable cons, in the test. The cons option specifies that the constant be included. . test [price=weight], common cons ( 1) [price]foreign - [weight]foreign = 0 ( 2) [price]_cons - [weight]_cons = 0 chi2( 2) = 51.23 Prob > chi2 = 0.0000 We can also use a modification of this syntax with the model if we also type a colon and the names of the variables we want to test: . test [price=weight]: foreign ( 1) [price]foreign - [weight]foreign = 0 chi2( 1) = 23.07 Prob > chi2 = 0.0000 We have only one variable in common between the two equations, but if there had been more, we could have listed them. Finally, a simultaneous test of multiple constraints may be specified just as after single-equation estimation. . test ([price]: foreign) ([weight]: foreign) ( 1) [price]foreign = 0 ( 2) [weight]foreign = 0 chi2( 2) = 31.61 Prob > chi2 = 0.0000 test can also test for equality of coefficients across more than two equations. For instance, test [eq1=eq2=eq3] specifies a test that the coefficients in the three equations eq1, eq2, and eq3 are equal. This requires that the same variables be included in the three equations. If some variables are entered only in some of the equations, you can type test [eq1=eq2=eq3], common to test that the coefficients of the variables common to all three equations are equal. Alternatively, you can explicitly list the variables for which equality of coefficients across the equations is to be tested. For instance, test [eq1=eq2=eq3]: time money tests that the coefficients of the variables time and money do not differ between the equations. Technical note test [eq1=eq2=eq3], common tests the equality of the coefficients common to all equations, but it does not test the equality of all common coefficients. Consider the case where eq1 eq2 eq3 contains the variables var1 var2 var3 contains the variables var1 var2 var4 contains the variables var1 var3 var4 Obviously, only var1 is common to all three equations. Thus test [eq1=eq2=eq3], common tests that the coefficients of var1 do not vary across the equations, so it is equivalent to test [eq1=eq2=eq3]: var1. To perform a test of the coefficients of variables common to two equations, you could explicitly list the constraints to be tested, . test ([eq1=eq2=eq3]:var1) ([eq1=eq2]:var2) ([eq1=eq3]:var3) ([eq2=eq3]:var4) test — Test linear hypotheses after estimation 2355 or use test with the accumulate option, and maybe also with the notest option, to form the appropriate joint hypothesis: . test [eq1=eq2], common notest . test [eq1=eq3], common accumulate notest . test [eq2=eq3], common accumulate Constrained coefficients If the test indicates that the data do not allow you to conclude that the constraints are not satisfied, you may want to inspect the constrained coefficients. The coef option specified that the constrained results, estimated by GLS, are shown. . test [price=weight], common coef ( 1) [price]foreign - [weight]foreign = 0 chi2( 1) = 23.07 Prob > chi2 = 0.0000 Constrained coefficients Coef. Std. Err. z P>|z| [95% Conf. Interval] price foreign mpg displacement _cons -216.4015 -121.5717 7.632566 7312.856 74.06083 58.36972 3.681114 1834.034 -2.92 -2.08 2.07 3.99 0.003 0.037 0.038 0.000 -361.558 -235.9742 .4177148 3718.215 -71.2449 -7.169116 14.84742 10907.5 weight foreign length _cons -216.4015 30.34875 -2619.719 74.06083 1.534815 302.6632 -2.92 19.77 -8.66 0.003 0.000 0.000 -361.558 27.34057 -3212.928 -71.2449 33.35693 -2026.51 The constrained coefficient of foreign is −216.40 with standard error 74.06 in equations price and weight. The other coefficients and their standard errors are affected by imposing the equality constraint of the two coefficients of foreign because the unconstrained estimates of these two coefficients were correlated with the estimates of the other coefficients. Technical note The two-step constrained coefficients bc displayed by test, coef are asymptotically equivalent to the one-stage constrained estimates that are computed by specifying the constraints during estimation using the constraint() option of estimation commands (Gourieroux and Monfort 1995, chap. 10). Generally, one-step constrained estimates have better small-sample properties. For inspection and interpretation, however, two-step constrained estimates are a convenient alternative. Moreover, some estimation commands (for example, stcox, many xt estimators) do not have a constraint() option. 2356 test — Test linear hypotheses after estimation Multiple testing When performing the test of a joint hypothesis, you might want to inspect the underlying 1-degreeof-freedom hypotheses. Which constraint “is to blame”? test displays the univariate as well as the simultaneous test if the mtest option is specified. For example, . test [price=weight], common cons mtest ( 1) [price]foreign - [weight]foreign = 0 ( 2) [price]_cons - [weight]_cons = 0 chi2 df p (1) (2) 23.07 11.17 1 1 0.0000 # 0.0008 # all 51.23 2 0.0000 # unadjusted p-values Both coefficients seem to contribute to the highly significant result. The 1-degree-of-freedom test shown here is identical to those if test had been invoked to test just this simple hypotheses. There is, of course, a real risk in inspecting these simple hypotheses. Especially in high-dimensional hypotheses, you may easily find one hypothesis that happens to be significant. Multiple testing procedures are designed to provide some safeguard against this risk. p-values of the univariate hypotheses are modified so that the probability of falsely rejecting one of the null hypotheses is bounded. test provides the methods based on Bonferroni, Šidák, and Holm. . test [price=weight], common cons mtest(b) ( 1) [price]foreign - [weight]foreign = 0 ( 2) [price]_cons - [weight]_cons = 0 chi2 df p (1) (2) 23.07 11.17 1 1 0.0000 # 0.0017 # all 51.23 2 0.0000 # Bonferroni-adjusted p-values Stored results test and testparm store the following in r(): Scalars r(p) r(F) r(df) r(df r) r(dropped i) Macros r(mtmethod) Matrices r(mtest) two-sided p-value F statistic test constraints degrees of freedom residual degrees of freedom index of ith constraint dropped r(chi2) r(ss) r(rss) r(drop) χ2 sum of squares (test) residual sum of squares 1 if constraints were dropped, 0 otherwise method of adjustment for multiple testing multiple test results r(ss) and r(rss) are defined only when test is used for testing effects after anova. test — Test linear hypotheses after estimation 2357 Methods and formulas test and testparm perform Wald tests. Let the estimated coefficient vector be b and the estimated variance – covariance matrix be V. Let Rb = r denote the set of q linear hypotheses to be tested jointly. The Wald test statistic is (Judge et al. 1985, 20 – 28) W = (Rb − r)0 (RVR0 )−1 (Rb − r) If the estimation command reports its significance levels using Z statistics, a chi-squared distribution with q degrees of freedom, W ∼ χ2q is used for computation of the significance level of the hypothesis test. If the estimation command reports its significance levels using t statistics with d degrees of freedom, an F statistic, 1 F = W q is computed, and an F distribution with q numerator degrees of freedom and d denominator degrees of freedom computes the significance level of the hypothesis test. The two-step constrained estimates bc displayed by test with the coef option are the GLS estimates of the unconstrained estimates b subject to the specified constraints Rb = c (Gourieroux and Monfort 1995, chap. 10), bc = b − R0 (RVR0 )−1 R(Rb − r) with variance – covariance matrix Vc = V − VR0 (RVR0 )−1 RV If test displays a Wald test for joint (simultaneous) hypotheses, it can also display all 1-degree-offreedom tests, with p-values adjusted for multiple testing. Let p1 , p2 , . . . , pk be the unadjusted p-values of these 1-degree-of-freedom tests. The Bonferroni-adjusted p-values are defined as pbi = min(1, kpi ). The Šidák-adjusted p-values are psi = 1 − (1 − pi )k . Holm’s method for adjusting p-values is defined as phi = min(1, ki pi ), where ki is the number of p-values at least as large as pi . Note that phi < pbi , reflecting that Holm’s method is strictly less conservative than the widely used Bonferroni method. If test is used after a svy command, it carries out an adjusted Wald test—this adjustment should not be confused with the adjustment for multiple testing. Both adjustments may actually be combined. Specifically, the survey adjustment uses an approximate F statistic (d−k + 1)W/(kd), where W is the Wald test statistic, k is the dimension of the hypothesis test, and d = the total number of sampled PSUs minus the total number of strata. Under the null hypothesis, (d−k+1)F/(kd) ∼ F (k, d−k+1), where F (k, d − k + 1) is an F distribution with k numerator degrees of freedom and d − k + 1 denominator degrees of freedom. If nosvyadjust is specified, the p-value is computed using W/k ∼ F (k, d). See Korn and Graubard (1990) for a detailed description of the Bonferroni adjustment technique and for a discussion of the relative merits of it and of the adjusted and unadjusted Wald tests. 2358 test — Test linear hypotheses after estimation Acknowledgment The svy adjustment code was adopted from another command developed in collaboration with John L. Eltinge of the Bureau of Labor Statistics. References Beale, E. M. L. 1960. Confidence regions in non-linear estimation. Journal of the Royal Statistical Society, Series B 22: 41–88. Eltinge, J. L., and W. M. Sribney. 1996. svy5: Estimates of linear combinations and hypothesis tests for survey data. Stata Technical Bulletin 31: 31–42. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 246–259. College Station, TX: Stata Press. Gourieroux, C. S., and A. Monfort. 1995. Statistics and Econometric Models, Vol 1: General Concepts, Estimation, Prediction, and Algorithms. Trans. Q. Vuong. Cambridge: Cambridge University Press. Holm, S. 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6: 65–70. Judge, G. G., W. E. Griffiths, R. C. Hill, H. Lütkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2nd ed. New York: Wiley. Korn, E. L., and B. I. Graubard. 1990. Simultaneous testing of regression coefficients with complex survey data: Use of Bonferroni t statistics. American Statistician 44: 270–276. Weesie, J. 1999. sg100: Two-stage linear constrained estimation. Stata Technical Bulletin 47: 24–30. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 217–225. College Station, TX: Stata Press. Also see [R] anova — Analysis of variance and covariance [R] anova postestimation — Postestimation tools for anova [R] contrast — Contrasts and linear hypothesis tests after estimation [R] lincom — Linear combinations of estimators [R] lrtest — Likelihood-ratio test after estimation [R] nestreg — Nested model statistics [R] nlcom — Nonlinear combinations of estimators [R] testnl — Test nonlinear hypotheses after estimation [U] 13.5 Accessing coefficients and standard errors [U] 20 Estimation and postestimation commands Title testnl — Test nonlinear hypotheses after estimation Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax testnl exp = exp testnl (exp = exp = exp . . . , options = exp . . . ) (exp = exp = exp . . . ) . . . , options options mtest (opt) iterate(#) Description df(#) use F distribution with # denominator degrees of freedom for the reference distribution of the test statistic carry out the Wald test as W/k ∼ F (k, d); for use with svy estimation commands when the df() option is also specified test each condition separately use maximum # of iterations to find the optimal step size nosvyadjust df(#) and nosvyadjust do not appear in the dialog box. The second syntax means that if more than one expression is specified, each must be surrounded by parentheses. exp is a possibly nonlinear expression containing b[coef ] b[eqno:coef ] [eqno]coef [eqno] b[coef ] eqno is ## name coef identifies a coefficient in the model. coef is typically a variable name, a level indicator, an interaction indicator, or an interaction involving continuous variables. Level indicators identify one level of a factor variable and interaction indicators identify one combination of levels of an interaction; see [U] 11.4.3 Factor variables. coef may contain time-series operators; see [U] 11.4.4 Time-series varlists. Distinguish between [ ], which are to be typed, and , which indicate optional arguments. Menu Statistics > Postestimation > Tests > Test nonlinear hypotheses 2359 2360 testnl — Test nonlinear hypotheses after estimation Description testnl tests (linear or nonlinear) hypotheses about the estimated parameters from the most recently fit model. testnl produces Wald-type tests of smooth nonlinear (or linear) hypotheses about the estimated parameters from the most recently fit model. The p-values are based on the delta method, an approximation appropriate in large samples. testnl can be used with svy estimation results; see [SVY] svy postestimation. The format (exp1 =exp2 =exp3 . . . ) for a simultaneous-equality hypothesis is just a convenient shorthand for a list (exp1 =exp2 ) (exp1 =exp3 ), etc. testnl may also be used to test linear hypotheses. test is faster if you want to test only linear hypotheses; see [R] test. testnl is the only option for testing linear and nonlinear hypotheses simultaneously. Options mtest (opt) specifies that tests be performed for each condition separately. opt specifies the method for adjusting p-values for multiple testing. Valid values for opt are bonferroni holm sidak noadjust Bonferroni’s method Holm’s method Šidák’s method no adjustment is to be made Specifying mtest without an argument is equivalent to specifying mtest(noadjust). iterate(#) specifies the maximum number of iterations used to find the optimal step size in the calculation of numerical derivatives of the test expressions. By default, the maximum number of iterations is 100, but convergence is usually achieved after only a few iterations. You should rarely have to use this option. The following options are available with testnl but are not shown in the dialog box: df(#) specifies that the F distribution with # denominator degrees of freedom be used for the reference distribution of the test statistic. With survey data, # is the design degrees unless nosvyadjust is specified. nosvyadjust is for use with svy estimation commands when the df() option is also specified; see [SVY] svy estimation. It specifies that the Wald test be carried out without the default adjustment for the design degrees of freedom. That is, the test is carried out as W/k ∼ F (k, d) rather than as (d − k + 1)W/(kd) ∼ F (k, d − k + 1), where k = the dimension of the test and d = the design degrees of freedom specified in the df() option. Remarks and examples Remarks are presented under the following headings: Introduction Using testnl to perform linear tests Specifying constraints Dropped constraints Multiple constraints Manipulability testnl — Test nonlinear hypotheses after estimation 2361 Introduction Example 1 We have just estimated the parameters of an earnings model on cross-sectional time-series data using one of Stata’s more sophisticated estimators: . use http://www.stata-press.com/data/r13/earnings (NLS Women 14-24 in 1968) . xtgee ln_w grade age c.age#c.age, corr(exchangeable) nolog GEE population-averaged model Number of obs Group variable: idcode Number of groups Link: identity Obs per group: min Family: Gaussian avg Correlation: exchangeable max Wald chi2(3) Scale parameter: .0976738 Prob > chi2 ln_wage Coef. Std. Err. grade age .0749686 .1080806 .0066111 .0235861 c.age#c.age -.0016253 _cons -.8788933 z = = = = = = = 1326 269 1 4.9 9 327.33 0.0000 P>|z| [95% Conf. Interval] 11.34 4.58 0.000 0.000 .062011 .0618526 .0879261 .1543086 .0004739 -3.43 0.001 -.0025541 -.0006966 .2830899 -3.10 0.002 -1.433739 -.3240473 An implication of this model is that peak earnings occur at age - b[age]/(2* b[c.age#c.age]), which here is equal to 33.2. Say that we have a theory that peak earnings should occur at age 16 + 1/ b[grade]. . testnl -_b[age]/(2*_b[c.age#c.age]) = 16 + 1/_b[grade] (1) -_b[age]/(2*_b[c.age#c.age]) = 16 + 1/_b[grade] chi2(1) = 1.71 Prob > chi2 = 0.1914 These data do not reject our theory. Using testnl to perform linear tests testnl may be used to test linear constraints, but test is faster; see [R] test. You could type . testnl _b[x4] = _b[x1] but it would take less computer time if you typed . test _b[x4] = _b[x1] 2362 testnl — Test nonlinear hypotheses after estimation Specifying constraints The constraints to be tested can be formulated in many different ways. You could type . testnl _b[mpg]*_b[weight] = 1 or . testnl _b[mpg] = 1/_b[weight] or you could express the constraint any other way you wished. (To say that testnl allows constraints to be specified in different ways does not mean that the test itself does not depend on the formulation. This point is briefly discussed later.) In formulating the constraints, you must, however, exercise one caution: users of test often refer to the coefficient on a variable by specifying the variable name. For example, . test mpg = 0 More formally, they should type . test _b[mpg] = 0 but test allows the b[] surrounding the variable name to be omitted. testnl does not allow this shorthand. Typing . testnl mpg=0 specifies the constraint that the value of variable mpg in the first observation is zero. If you make this mistake, sometimes testnl will catch it: . testnl mpg=0 equation (1) contains reference to X rather than _b[X] r(198); In other cases, testnl may not catch the mistake; then the constraint will be dropped because it does not make sense: . testnl mpg=0 Constraint (1) dropped (There are reasons other than this for constraints being dropped.) The worst case, however, is . testnl _b[weight]*mpg = 1 when what you mean is not that b[weight] equals the reciprocal of the value of mpg in the first observation, but rather that . testnl _b[weight]*_b[mpg] = 1 Sometimes this mistake will be caught by the “contains reference to X rather than b[X]” error, and sometimes it will not. Be careful. testnl, like test, can be used after any Stata estimation command, including the survey estimators. When you use it after a multiple-equation command, such as mlogit or heckman, you refer to coefficients by using Stata’s standard syntax: [eqname] b[varname]. testnl — Test nonlinear hypotheses after estimation 2363 Stata’s single-equation estimation output looks like this: weight mpg Coef ... 12.27 3.21 ... ... <- coefficient is _b[weight] Stata’s multiple-equation output looks like this: Coef ... weight mpg 12.27 3.21 ... ... ... <- coefficient is [cat1]_b[weight] weight mpg 5.83 7.43 ... ... ... <- coefficient is [8]_b[weight] cat1 8 Dropped constraints testnl automatically drops constraints when • They are nonbinding, for example, b[mpg]= b[mpg]. More subtle cases include _b[mpg]*_b[weight] = 4 _b[weight] = 2 _b[mpg] = 2 In this example, the third constraint is nonbinding because it is implied by the first two. • They are contradictory, for example, b[mpg]=2 and b[mpg]=3. More subtle cases include _b[mpg]*_b[weight] = 4 _b[weight] = 2 _b[mpg] = 3 The third constraint contradicts the first two. Multiple constraints Example 2 We illustrate the simultaneous test of a series of constraints using simulated data on labor-market promotion in a given year. We fit a probit model with separate effects for education, experience, and experience-squared for men and women. 2364 testnl — Test nonlinear hypotheses after estimation . use http://www.stata-press.com/data/r13/promotion . probit promo male male#c.(yedu yexp yexp2), nolog Probit regression Number of obs LR chi2(7) Prob > chi2 Log likelihood = -245.42768 Pseudo R2 promo Coef. male = = = = 775 424.42 0.0000 0.4637 Std. Err. z P>|z| [95% Conf. Interval] .6489974 .203739 3.19 0.001 .2496763 1.048318 male#c.yedu 0 1 .9730237 1.390517 .1056136 .1527288 9.21 9.10 0.000 0.000 .7660248 1.091174 1.180023 1.68986 male#c.yexp 0 1 .4559544 1.422539 .0901169 .1544255 5.06 9.21 0.000 0.000 .2793285 1.11987 .6325803 1.725207 male#c.yexp2 0 1 -.1027149 -.3749457 .0573059 .1160113 -1.79 -3.23 0.073 0.001 -.2150325 -.6023236 .0096026 -.1475677 _cons .9872018 .1148215 8.60 0.000 .7621559 1.212248 Note: 1 failure and 2 successes completely determined. The effects of human capital seem to differ between men and women. A formal test confirms this. . test (yedu#0.male = yedu#1.male) (yexp#0.male = yexp#1.male) > (yexp2#0.male = yexp2#1.male) ( 1) [promo]0b.male#c.yedu - [promo]1.male#c.yedu = 0 ( 2) [promo]0b.male#c.yexp - [promo]1.male#c.yexp = 0 ( 3) [promo]0b.male#c.yexp2 - [promo]1.male#c.yexp2 = 0 chi2( 3) = 35.43 Prob > chi2 = 0.0000 How do we interpret this gender difference? It has repeatedly been stressed (see, for example, Long [1997, 47–50]; Allison [1999]) that comparison of groups in binary response models, and similarly in other latent-variable models, is hampered by an identification problem: with β the regression coefficients for the latent variable and σ the standard deviation of the latent residual, only the β/σ are identified. In fact, in terms of the latent regression, the probit coefficients should be interpreted as β/σ , not as the β . If we cannot claim convincingly that the residual standard deviation σ does not vary between the sexes, equality of the regression coefficients β implies that the coefficients of the probit model for men and women are proportional but not necessarily equal. This is a nonlinear hypothesis in terms of the probit coefficients, not a linear one. . testnl _b[yedu#1.male]/_b[yedu#0.male] = _b[yexp#1.male]/_b[yexp#0.male] > = _b[yexp2#1.male]/_b[yexp2#0.male] (1) _b[yedu#1.male]/_b[yedu#0.male] = _b[yexp#1.male]/_b[yexp#0.male] (2) _b[yedu#1.male]/_b[yedu#0.male] = _b[yexp2#1.male]/_b[yexp2#0.male] chi2(2) = 9.21 Prob > chi2 = 0.0100 We conclude that we find fairly strong evidence against the proportionality of the coefficients, and hence we have to conclude that success in the labor market is produced in different ways by men and women. (But remember, these were simulated data.) testnl — Test nonlinear hypotheses after estimation 2365 Example 3 The syntax for specifying the equality of multiple expressions is just a convenient shorthand for specifying a series of constraints, namely, that the first expression equals the second expression, the first expression also equals the third expression, etc. The Wald test performed and the output of testnl are the same whether we use the shorthand or we specify the series of constraints. The lengthy specification as a series of constraints can be simplified using the continuation symbols ///. . testnl (_b[yedu#1.male]/_b[yedu#0.male] = /// > _b[yexp#1.male]/_b[yexp#0.male]) /// > (_b[yedu#1.male]/_b[yedu#0.male] = /// > _b[yexp2#1.male]/_b[yexp2#0.male]) (1) _b[yedu#1.male]/_b[yedu#0.male] = _b[yexp#1.male]/_b[yexp#0.male] (2) _b[yedu#1.male]/_b[yedu#0.male] = _b[yexp2#1.male]/_b[yexp2#0.male] chi2(2) = 9.21 Prob > chi2 = 0.0100 Having established differences between men and women, we would like to do multiple testing between the ratios. Because we did not specify hypotheses in advance, we prefer to adjust the p-values of tests using, here, Bonferroni’s method. . testnl _b[yedu#1.male]/_b[yedu#0.male] = /// > _b[yexp#1.male]/_b[yexp#0.male] = /// > _b[yexp2#1.male]/_b[yexp2#0.male], mtest(b) (1) _b[yedu#1.male]/_b[yedu#0.male] = _b[yexp#1.male]/_b[yexp#0.male] (2) _b[yedu#1.male]/_b[yedu#0.male] = _b[yexp2#1.male]/_b[yexp2#0.male] chi2 df p (1) (2) 6.89 0.93 1 1 0.0173 # 0.6713 # all 9.21 2 0.0100 # Bonferroni-adjusted p-values Manipulability Although testnl allows you to specify constraints in different ways that are mathematically equivalent, as noted above, this does not mean that the tests are the same. This difference is known as the manipulability of the Wald test for nonlinear hypotheses; also see [R] boxcox. The test might even be significant for one formulation but not significant for another formulation that is mathematically equivalent. Trying out different specifications to find a formulation with the desired p-value is totally inappropriate, though it may actually be fun to try. There is no variance under representation because the nonlinear Wald test is actually a standard Wald test for a linearization of the constraint, which depends on the particular specification. We note that the likelihood-ratio test is not manipulable in this sense. From a statistical point of view, it is best to choose a specification of the constraints that is as linear is possible. Doing so usually improves the accuracy of the approximation of the null-distribution of the test by a χ2 or an F distribution. The example above used the nonlinear Wald test to test whether the coefficients of human capital variables for men were proportional to those of women. A specification of proportionality of coefficients in terms of ratios of coefficients is fairly nonlinear if the coefficients in the denominator are close to 0. A more linear version of the test results from a bilinear formulation. Thus instead of 2366 testnl — Test nonlinear hypotheses after estimation . testnl _b[yedu#1.male]/_b[yedu#0.male] = _b[yexp#1.male]/_b[yexp#0.male] (1) _b[yedu#1.male]/_b[yedu#0.male] = _b[yexp#1.male]/_b[yexp#0.male] chi2(1) = 6.89 Prob > chi2 = 0.0087 perhaps . testnl _b[yedu#1.male]*_b[yexp#0.male] = _b[yedu#0.male]*_b[yexp#1.male] (1) _b[yedu#1.male]*_b[yexp#0.male] = _b[yedu#0.male]*_b[yexp#1.male] chi2(1) = 13.95 Prob > chi2 = 0.0002 is better, and in fact it has been suggested that the latter version of the test is more reliable. This assertion is confirmed by performing simulations and is in line with theoretical results of Phillips and Park (1988). There is strong evidence against the proportionality of human capital effects between men and women, implying for this example that differences in the residual variances between the sexes can be ruled out as the explanation of the sex differences in the analysis of labor market participation. Stored results testnl stores the following in r(): Scalars r(df) r(df r) r(chi2) r(p) r(F) Macros r(mtmethod) Matrices r(G) r(R) r(mtest) degrees of freedom residual degrees of freedom χ2 significance F statistic method specified in mtest() derivatives of R(b) with respect to b; see Methods and formulas below R(b)−q; see Methods and formulas below multiple test results Methods and formulas After fitting a model, define b as the resulting 1 × k parameter vector and V as the k × k covariance matrix. The (linear or nonlinear) hypothesis is given by R(b) = q, where R is a function returning a j × 1 vector. The Wald test formula is (Greene 2012, 528) n o0 −1 n o W = R(b) − q GVG0 R(b) − q where G is the derivative matrix of R(b) with respect to b. W is distributed as χ2 if V is an asymptotic covariance matrix. F = W/j is distributed as F for linear regression. The adjustment methods for multiple testing are described in [R] test. The adjustment for survey design effects is described in [SVY] svy postestimation. testnl — Test nonlinear hypotheses after estimation 2367 References Allison, P. D. 1999. Comparing logit and probit coefficients across groups. Sociological Methods and Research 28: 186–208. Gould, W. W. 1996. crc43: Wald test of nonlinear hypotheses after model estimation. Stata Technical Bulletin 29: 2–4. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 15–18. College Station, TX: Stata Press. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Phillips, P. C. B., and J. Y. Park. 1988. On the formulation of Wald tests of nonlinear restrictions. Econometrica 56: 1065–1083. Also see [R] contrast — Contrasts and linear hypothesis tests after estimation [R] lincom — Linear combinations of estimators [R] lrtest — Likelihood-ratio test after estimation [R] nlcom — Nonlinear combinations of estimators [R] test — Test linear hypotheses after estimation [U] 13.5 Accessing coefficients and standard errors [U] 20 Estimation and postestimation commands Title tetrachoric — Tetrachoric correlations for binary variables Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax tetrachoric varlist if in weight , options Description options Main zeroadjust matrix notable posdef list of statistics; select up to 4 statistics; default is stats(rho) use the noniterative Edwards and Edwards estimator; default is the maximum likelihood estimator significance level for displaying coefficients significance level for displaying with a star use Bonferroni-adjusted significance level use Šidák-adjusted significance level calculate all the pairwise correlation coefficients by using all available data (pairwise deletion) adjust frequencies when one cell has a zero count display output in matrix form suppress display of correlations modify correlation matrix to be positive semidefinite statlist Description rho se obs p tetrachoric correlation coefficient standard error of rho number of observations exact two-sided significance level stats(statlist) edwards print(#) star(#) bonferroni sidak pw by is allowed; see [D] by. fweights are allowed; see [U] 11.1.6 weight. Menu Statistics > Summaries, tables, and tests > Summary and descriptive statistics 2368 > Tetrachoric correlations tetrachoric — Tetrachoric correlations for binary variables 2369 Description tetrachoric computes estimates of the tetrachoric correlation coefficients of the binary variables in varlist. All of these variables should be 0, 1, or missing values. Tetrachoric correlations assume a latent bivariate normal distribution (X1 ,X2 ) for each pair of variables (v1 ,v2 ), with a threshold model for the manifest variables, vi = 1 if and only if Xi > 0. The means and variances of the latent variables are not identified, but the correlation, r, of X1 and X2 can be estimated from the joint distribution of v1 and v2 and is called the tetrachoric correlation coefficient. tetrachoric computes pairwise estimates of the tetrachoric correlations by the (iterative) maximum likelihood estimator obtained from bivariate probit without explanatory variables (see [R] biprobit) by using the Edwards and Edwards (1984) noniterative estimator as the initial value. The pairwise correlation matrix is returned as r(Rho) and can be used to perform a factor analysis or a principal component analysis of binary variables by using the factormat or pcamat commands; see [MV] factor and [MV] pca. Options Main stats(statlist) specifies the statistics to be displayed in the matrix of output. stats(rho) is the default. Up to four statistics may be specified. stats(rho se p obs) would display the tetrachoric correlation coefficient, its standard error, the significance level, and the number of observations. If varlist contains only two variables, all statistics are shown in tabular form. stats(), print(), and star() have no effect unless the matrix option is also specified. edwards specifies that the noniterative Edwards and Edwards estimator be used. The default is the maximum likelihood estimator. If you analyze many binary variables, you may want to use the fast noniterative estimator proposed by Edwards and Edwards (1984). However, if you have skewed variables, the approximation does not perform well. print(#) specifies the maximum significance level of correlation coefficients to be printed. Correlation coefficients with larger significance levels are left blank in the matrix. Typing tetrachoric . . . , print(.10) would list only those correlation coefficients that are significant at the 10% level or lower. star(#) specifies the maximum significance level of correlation coefficients to be marked with a star. Typing tetrachoric . . . , star(.05) would “star” all correlation coefficients significant at the 5% level or lower. bonferroni makes the Bonferroni adjustment to calculated significance levels. This option affects printed significance levels and the print() and star() options. Thus tetrachoric . . . , print(.05) bonferroni prints coefficients with Bonferroni-adjusted significance levels of 0.05 or less. sidak makes the Šidák adjustment to calculated significance levels. This option affects printed significance levels and the print() and star() options. Thus tetrachoric . . . , print(.05) sidak prints coefficients with Šidák-adjusted significance levels of 0.05 or less. pw specifies that the tetrachoric correlation be calculated by using all available data. By default, tetrachoric uses casewise deletion, where observations are ignored if any of the specified variables in varlist are missing. zeroadjust specifies that when one of the cells has a zero count, a frequency adjustment be applied in such a way as to increase the zero to one-half and maintain row and column totals. 2370 tetrachoric — Tetrachoric correlations for binary variables matrix forces tetrachoric to display the statistics as a matrix, even if varlist contains only two variables. matrix is implied if more than two variables are specified. notable suppresses the output. posdef modifies the correlation matrix so that it is positive semidefinite, that is, a proper correlation matrix. The modified result is the correlation matrix associated with the least-squares approximation of the tetrachoric correlation matrix by a positive-semidefinite matrix. If the correlation matrix is modified, the standard errors and significance levels are not displayed and are returned in r(). Remarks and examples Remarks are presented under the following headings: Association in 2-by-2 tables Factor analysis of dichotomous variables Tetrachoric correlations with simulated data Association in 2-by-2 tables Although a wide variety of measures of association in cross tabulations have been proposed, such measures are essentially equivalent (monotonically related) in the special case of 2 × 2 tables—there is only 1 degree of freedom for nonindependence. Still, some measures have more desirable properties than others. Here we compare two measures: the standard Pearson correlation coefficient and the tetrachoric correlation coefficient. Given asymmetric row or column margins, Pearson correlations are limited to a range smaller than −1 to 1, although tetrachoric correlations can still span the range from −1 to 1. To illustrate, consider the following set of tables for two binary variables, X and Z: X=0 X=1 Z=0 Z=1 20 − a 10 + a a 10 − a 20 20 30 10 40 For a equal to 0, 1, 2, 5, 8, 9, and 10, the Pearson and tetrachoric correlations for the above table are a 0 1 2 5 Pearson 0.577 0.462 0.346 0 Tetrachoric 1.000 0.792 0.607 0 8 9 10 −0.346 −0.462 −0.577 −0.607 −0.792 −1.000 The restricted range for the Pearson correlation is especially unfortunate when you try to analyze the association between binary variables by using models developed for continuous data, such as factor analysis and principal component analysis. The tetrachoric correlation of two variables (Y1 , Y2 ) can be thought of as the Pearson correlation of two latent bivariate normal distributed variables (Y1∗ , Y2∗ ) with threshold measurement models Yi = (Yi∗ > ci ) for unknown cutpoints ci . Or equivalently, Yi = (Yi∗∗ > 0) where the latent bivariate normal (Y1∗∗ , Y2∗∗ ) are shifted versions of (Y1∗ , Y2∗ ) so that the cutpoints are zero. Obviously, you must judge whether assuming underlying latent variables is meaningful for the data. If this assumption is justified, tetrachoric correlations have two advantages. First, you have an intuitive understanding of tetrachoric — Tetrachoric correlations for binary variables 2371 the size of correlations that are substantively interesting in your field of research, and this intuition is based on correlations that range from −1 to 1. Second, because the tetrachoric correlation for binary variables estimates the Pearson correlation of the latent continuous variables (assumed multivariate normal distributed), you can use the tetrachoric correlations to analyze multivariate relationships between the dichotomous variables. When doing so, remember that you must interpret the model in terms of the underlying continuous variables. Example 1 To illustrate tetrachoric correlations, we examine three binary variables from the familyvalues dataset (described in example 2). . use http://www.stata-press.com/data/r13/familyvalues (Attitudes on gender, relationships and family) . tabulate RS075 RS076 fam att: fam att: trad division of labor women in charge bad 0 1 Total 0 1 1,564 119 979 632 2,543 751 Total 1,683 1,611 3,294 . correlate (obs=3291) RS074 RS075 RS076 RS074 RS075 RS076 1.0000 0.0396 0.1595 1.0000 0.3830 1.0000 . tetrachoric RS074 RS075 RS076 (obs=3291) RS074 RS075 RS076 RS074 RS075 RS076 RS074 RS075 RS076 1.0000 0.0689 0.2480 1.0000 0.6427 1.0000 As usual, the tetrachoric correlation coefficients are larger (in absolute value) and more dispersed than the Pearson correlations. Factor analysis of dichotomous variables Example 2 Factor analysis is a popular model for measuring latent continuous traits. The standard estimators are appropriate only for continuous unimodal data. Because of the skewness implied by Bernoullidistributed variables (especially when the probability is distributed unevenly), a factor analysis of a Pearson correlation matrix can be rather misleading when used in this context. A factor analysis of a matrix of tetrachoric correlations is more appropriate under these conditions (Uebersax 2000). We illustrate this with data on gender, relationship, and family attitudes of spouses using the Households in The Netherlands survey 1995 (Weesie et al. 1995). For attitude variables, it seems reasonable to assume that agreement or disagreement is just a coarse measurement of more nuanced underlying attitudes. 2372 tetrachoric — Tetrachoric correlations for binary variables To demonstrate, we examine a few of the variables from the familyvalues dataset. . use http://www.stata-press.com/data/r13/familyvalues (Attitudes on gender, relationships and family) . describe RS056-RS063 storage display value variable name type format label variable label RS056 byte %9.0g RS057 byte %9.0g RS058 byte %9.0g RS059 byte %9.0g RS060 byte %9.0g RS061 byte %9.0g RS062 byte %9.0g RS063 byte %9.0g . summarize RS056-RS063 Variable Obs RS056 RS057 RS058 RS059 RS060 fam fam fam fam fam fam fam fam Mean att: att: att: att: att: att: att: att: Std. Dev. should be together should fight for relat should avoid conflict woman better nurturer both spouses money goo woman techn school goo man natural breadwinne common leisure good Min Max 3298 3296 3283 3308 3302 .5630685 .5400485 .6387451 .654474 .3906723 .4960816 .4984692 .4804374 .4756114 .487975 0 0 0 0 0 1 1 1 1 1 RS061 3293 RS062 3307 RS063 3298 . correlate RS056-RS063 (obs=3221) RS056 .7102946 .5857272 .5379018 .4536945 .4926705 .498637 0 0 0 1 1 1 RS056 RS057 RS058 RS059 RS060 RS061 RS062 RS063 1.0000 0.1350 0.2377 0.1816 -0.1020 -0.1137 0.2014 0.2057 RS063 RS063 1.0000 RS057 RS058 RS059 RS060 RS061 RS062 1.0000 0.0258 0.0097 -0.0538 0.0610 0.0285 0.1460 1.0000 0.2550 -0.0424 -0.1375 0.2273 0.1049 1.0000 0.0126 -0.2076 0.4098 0.0911 1.0000 0.0706 -0.0793 0.0179 1.0000 -0.2873 -0.0233 1.0000 0.0975 tetrachoric — Tetrachoric correlations for binary variables 2373 Skewness in these data is relatively modest. For comparison, here are the tetrachoric correlations: . tetrachoric RS056-RS063 (obs=3221) RS056 RS056 RS057 RS058 RS059 RS060 RS061 RS062 RS063 1.0000 0.2114 0.3716 0.2887 -0.1620 -0.1905 0.3135 0.3187 RS063 RS063 1.0000 RS057 RS058 RS059 RS060 RS061 RS062 1.0000 0.0416 0.0158 -0.0856 0.1011 0.0452 0.2278 1.0000 0.4007 -0.0688 -0.2382 0.3563 0.1677 1.0000 0.0208 -0.3664 0.6109 0.1467 1.0000 0.1200 -0.1267 0.0286 1.0000 -0.4845 -0.0388 1.0000 0.1538 Again we see that the tetrachoric correlations are generally larger in absolute value than the Pearson correlations. The bivariate probit and Edwards and Edwards estimators (the edwards option) implemented in tetrachoric may return a correlation matrix that is not positive semidefinite—a mathematical property of any real correlation matrix. Positive definiteness is required by commands for analyses of correlation matrices, such as factormat and pcamat; see [MV] factor and [MV] pca. The posdef option of tetrachoric tests for positive definiteness and projects the estimated correlation matrix to a positive-semidefinite matrix if needed. . tetrachoric RS056-RS063, notable posdef . matrix C = r(corr) This time, we suppressed the display of the correlations with the notable option and requested that the correlation matrix be positive semidefinite with the posdef option. Had the correlation matrix not been positive definite, tetrachoric would have displayed a warning message and then adjusted the matrix to be positive semidefinite. We placed the resulting tetrachoric correlation matrix into a matrix, C, so that we can perform a factor analysis upon it. tetrachoric with the posdef option asserted that C was positive definite because no warning message was displayed. We can verify this by using a familiar characterization of symmetric positivedefinite matrices: all eigenvalues are real and positive. . matrix symeigen eigenvectors eigenvalues = C . matrix list eigenvalues eigenvalues[1,8] e1 e2 e3 e4 r1 2.5974789 1.3544664 1.0532476 .77980391 e8 r1 .35299228 e5 .73462018 e6 .57984565 e7 .54754512 We can proceed with a factor analysis on the matrix C. We use factormat and select iterated principal factors as the estimation method; see [MV] factor. 2374 tetrachoric — Tetrachoric correlations for binary variables . factormat C, n(3221) ipf factor(2) (obs=3221) Factor analysis/correlation Method: iterated principal factors Rotation: (unrotated) Number of obs = Retained factors = Number of params = 3221 2 15 Factor Eigenvalue Difference Proportion Cumulative Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8 2.06855 0.66677 0.19497 0.13065 0.02098 -0.07987 -0.09024 -0.17650 1.40178 0.47180 0.06432 0.10967 0.10085 0.01037 0.08626 . 0.7562 0.2438 0.0713 0.0478 0.0077 -0.0292 -0.0330 -0.0645 0.7562 1.0000 1.0713 1.1191 1.1267 1.0975 1.0645 1.0000 LR test: independent vs. saturated: chi2(28) = 4620.01 Prob>chi2 = 0.0000 Factor loadings (pattern matrix) and unique variances Variable Factor1 Factor2 RS056 RS057 RS058 RS059 RS060 RS061 RS062 RS063 0.5528 0.1124 0.5333 0.6961 -0.1339 -0.5126 0.7855 0.2895 0.4120 0.4214 0.0718 -0.1704 -0.0596 0.2851 -0.2165 0.3919 Uniqueness 0.5247 0.8098 0.7105 0.4865 0.9785 0.6560 0.3361 0.7626 Example 3 We noted in example 2 that the matrix of estimates of the tetrachoric correlation coefficients need not be positive definite. Here is an example: . use http://www.stata-press.com/data/r13/familyvalues (Attitudes on gender, relationships and family) . tetrachoric RS056-RS063 in 1/20, posdef (obs=18) matrix with tetrachoric correlations is not positive semidefinite; it has 2 negative eigenvalues maxdiff(corr,adj-corr) = 0.2346 (adj-corr: tetrachoric correlations adjusted to be positive semidefinite) RS056 RS057 RS058 RS059 RS060 RS061 RS062 adj-corr RS056 RS057 RS058 RS059 RS060 RS061 RS062 RS063 adj-corr 1.0000 0.5284 0.3012 0.3251 -0.5197 0.3448 0.1066 0.3830 RS063 RS063 1.0000 1.0000 0.2548 0.2791 -0.4222 0.4815 -0.0375 0.4939 1.0000 0.0550 -0.7163 -0.0958 0.0072 0.4336 1.0000 0.0552 -0.1857 0.3909 0.0075 1.0000 -0.0980 -0.2333 -0.8937 1.0000 -0.7654 -0.0337 1.0000 0.4934 tetrachoric — Tetrachoric correlations for binary variables . mata: 2375 mata (type end to exit) : C2 = st_matrix("r(corr)") : : : : eigenvecs = . eigenvals = . symeigensystem(C2, eigenvecs, eigenvals) eigenvals 1 2 1 1 3 4 3.156592567 2.065279398 1.324911199 .7554904485 5 6 7 8 .4845368741 .2131895139 -8.80914e-19 -1.90196e-16 : end The estimated tetrachoric correlation matrix is rank-2 deficient. With this C2 matrix, we can only use models of correlation that allow for singular cases. Tetrachoric correlations with simulated data Example 4 We use drawnorm (see [D] drawnorm) to generate a sample of 1,000 observations from a bivariate normal distribution with means −1 and 1, unit variances, and correlation 0.4. . clear . set seed . matrix m . matrix V . drawnorm (obs 1000) 11000 = (1, -1) = (1, 0.4 \ 0.4, 1) c1 c2, n(1000) means(m) cov(V) Now consider the measurement model assumed by the tetrachoric correlations. We observe only whether c1 and c2 are greater than zero, . generate d1 = (c1 > 0) . generate d2 = (c2 > 0) . tabulate d1 d2 d2 d1 0 1 Total 0 1 176 656 6 162 182 818 Total 832 168 1,000 We want to estimate the correlation of c1 and c2 from the binary variables d1 and d2. Pearson’s correlation of the binary variables d1 and d2 is 0.170—a seriously biased estimate of the underlying correlation ρ = 0.4. 2376 tetrachoric — Tetrachoric correlations for binary variables . correlate d1 d2 (obs=1000) d1 d2 d1 d2 1.0000 0.1704 1.0000 The tetrachoric correlation coefficient of d1 and d2 estimates the Pearson correlation of the latent continuous variables, c1 and c2. . tetrachoric d1 d2 Number of obs = 1000 Tetrachoric rho = 0.4790 Std error = 0.0700 Test of Ho: d1 and d2 are independent 2-sided exact P = 0.0000 The estimate of the tetrachoric correlation of d1 and d2, 0.4790, is much closer to the underlying correlation, 0.4, between c1 and c2. Stored results tetrachoric stores the following in r(): Scalars r(rho) r(N) r(nneg) r(se rho) r(p) Macros r(method) Matrices r(Rho) r(Se Rho) r(Nobs) r(P) tetrachoric correlation coefficient between variables 1 and 2 number of observations number of negative eigenvalues (posdef only) standard error of r(rho) exact two-sided significance level estimator used tetrachoric correlation matrix standard errors of r(Rho) number of observations used in computing correlation exact two-sided significance level matrix Methods and formulas tetrachoric provides two estimators for the tetrachoric correlation ρ of two binary variables with the frequencies nij , i, j = 0, 1. tetrachoric defaults to the slower (iterative) maximum likelihood estimator obtained from bivariate probit without explanatory variables (see [R] biprobit) by using the Edwards and Edwards noniterative estimator as the initial value. A fast (noniterative) estimator is also available by specifying the edwards option (Edwards and Edwards 1984; Digby 1983) ρb = α−1 α+1 tetrachoric — Tetrachoric correlations for binary variables 2377 where α= n00 n11 n01 n10 π/4 (π = 3.14 . . .) if all nij > 0. If n00 = 0 or n11 = 0, ρb = −1; if n01 = 0 or n10 = 0, ρb = 1. The asymptotic variance of the Edwards and Edwards estimator of the tetrachoric correlation is easily obtained by the delta method, avar(b ρ) = πα 2(1 + α)2 2 1 1 1 1 + + + n00 n01 n10 n11 provided all nij > 0, otherwise it is left undefined (missing). The Edwards and Edwards estimator is fast, but may be inaccurate if the margins are very skewed. tetrachoric reports exact p-values for statistical independence, computed by the exact option of [R] tabulate twoway. References Brown, M. B. 1977. Algorithm AS 116: The tetrachoric correlation and its asymptotic standard error. Applied Statistics 26: 343–351. Brown, M. B., and J. K. Benedetti. 1977. On the mean and variance of the tetrachoric correlation coefficient. Psychometrika 42: 347–355. Digby, P. G. N. 1983. Approximating the tetrachoric correlation coefficient. Biometrics 39: 753–757. Edwards, J. H., and A. W. F. Edwards. 1984. Approximating the tetrachoric correlation coefficient. Biometrics 40: 563. Golub, G. H., and C. F. Van Loan. 1996. Matrix Computations. 3rd ed. Baltimore: Johns Hopkins University Press. Uebersax, J. S. 2000. Estimating a latent trait model by factor analysis of tetrachoric correlations. http://ourworld.compuserve.com/homepages/jsuebersax/irt.htm. Weesie, J., M. Kalmijn, W. Bernasco, and D. Giesen. 1995. Households in The Netherlands 1995. Utrecht, Netherlands: Datafile, ISCORE, University of Utrecht. Also see [R] biprobit — Bivariate probit regression [R] correlate — Correlations (covariances) of variables or coefficients [R] spearman — Spearman’s and Kendall’s correlations [R] tabulate twoway — Two-way table of frequencies [MV] factor — Factor analysis [MV] pca — Principal component analysis Title tnbreg — Truncated negative binomial regression Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgment Syntax tnbreg depvar indepvars if in weight , options options Description Model noconstant ll(# | varname) dispersion(mean) dispersion(constant) exposure(varnamee ) offset(varnameo ) constraints(constraints) collinear suppress constant term truncation point; default value is ll(0), zero truncation parameterization of dispersion; the default constant dispersion for all observations include ln(varnamee ) in model with coefficient constrained to 1 include varnameo in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) nolrtest irr nocnsreport display options set confidence level; default is level(95) suppress likelihood-ratio test report incidence-rate ratios do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 2378 tnbreg — Truncated negative binomial regression 2379 Menu Statistics > Count outcomes > Truncated negative binomial regression Description tnbreg estimates the parameters of a truncated negative binomial model by maximum likelihood. The dependent variable depvar is regressed on indepvars, where depvar is a positive count variable whose values are all above the truncation point. Options Model noconstant; see [R] estimation options. ll(# | varname) specifies the truncation point, which is a nonnegative integer. The default is zero truncation, ll(0). dispersion(mean | constant) specifies the parameterization of the model. dispersion(mean), the default, yields a model with dispersion equal to 1 + α exp(xj β + offsetj ); that is, the dispersion is a function of the expected mean: exp(xj β + offsetj ). dispersion(constant) has dispersion equal to 1 + δ ; that is, it is a constant for all observations. exposure(varnamee ), offset(varnameo ), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. nolrtest suppresses fitting the Poisson model. Without this option, a comparison Poisson model is fit, and the likelihood is used in a likelihood-ratio test of the null hypothesis that the dispersion parameter is zero. irr reports estimated coefficients transformed to incidence-rate ratios, that is, eβi rather than βi . Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated or stored. irr may be specified at estimation or when replaying previously estimated results. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. 2380 tnbreg — Truncated negative binomial regression Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with tnbreg but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Grogger and Carson (1991) showed that overdispersion causes inconsistent estimation of the mean in the truncated Poisson model. To solve this problem, they proposed using the truncated negative binomial model as an alternative. If data are truncated but do not exhibit overdispersion, the truncated Poisson model is more appropriate; see [R] tpoisson. For an introduction to negative binomial regression, see Cameron and Trivedi (2005, 2010) and Long and Freese (2014). For an introduction to truncated negative binomial models, see Cameron and Trivedi (2013) and Long (1997, chap. 8). tnbreg fits the mean-dispersion and the constant-dispersion parameterizations of truncated negative binomial models. These parameterizations extend those implemented in nbreg; see [R] nbreg. Example 1 We illustrate the truncated negative binomial model using the 1997 MedPar dataset (Hilbe 1999). The data are from 1,495 patients in Arizona who were assigned to a diagnostic-related group (DRG) of patients having a ventilator. Length of stay (los), the dependent variable, is a positive integer; it cannot have zero values. The data are truncated because there are no observations on individuals who stayed for zero days. The objective of this example is to determine whether the length of stay was related to the binary variables: died, hmo, type1, type2, and type3. The died variable was recorded as a 0 unless the patient died, in which case, it was recorded as a 1. The other variables also adopted this encoding. The hmo variable was set to 1 if the patient belonged to a health maintenance organization (HMO). The type1–type3 variables indicated the type of admission used for the patient. The type1 variable indicated an emergency admit. The type2 variable indicated an urgent admit—that is, the first available bed. The type3 variable indicated an elective admission. Because type1–type3 were mutually exclusive, only two of the three could be used in the truncated negative binomial regression shown below. tnbreg — Truncated negative binomial regression 2381 . use http://www.stata-press.com/data/r13/medpar . tnbreg los died hmo type2-type3, vce(cluster provnum) nolog Truncated negative binomial regression Truncation point: 0 Number of obs = 1495 Dispersion = mean Wald chi2(4) = 36.01 Log likelihood = -4737.535 Prob > chi2 = 0.0000 (Std. Err. adjusted for 54 clusters in provnum) Robust Std. Err. los Coef. died hmo type2 type3 _cons -.2521884 -.0754173 .2685095 .7668101 2.224028 .061533 .0533132 .0666474 .2183505 .034727 /lnalpha -.630108 alpha .5325343 z -4.10 -1.41 4.03 3.51 64.04 P>|z| 0.000 0.157 0.000 0.000 0.000 [95% Conf. Interval] -.3727908 -.1799091 .137883 .338851 2.155964 -.1315859 .0290746 .3991359 1.194769 2.292091 .0764019 -.779853 -.480363 .0406866 .4584734 .6185588 Because observations within the same hospital (provnum) are likely to be correlated, we specified the vce(cluster provnum) option. The results show that whether the patient died in the hospital and the type of admission have significant effects on the patient’s length of stay. Example 2 To illustrate truncated negative binomial regression with more complex data than the previous example, similar data were created from 100 hospitals. Each hospital had its own way of tracking patient data. In particular, hospitals only recorded data from patients with a minimum length of stay, denoted by the variable minstay. Definitions for minimum length of stay varied among hospitals, typically, from 5 to 18 days. The objective of this example is the same as before: to determine whether the length of stay, recorded in los, was related to the binary variables: died, hmo, type1, type2, and type3. The binary variables encode the same information as in example 1 above. The minstay variable was used to allow for varying truncation points. 2382 tnbreg — Truncated negative binomial regression . use http://www.stata-press.com/data/r13/medproviders . tnbreg los died hmo type2-type3, ll(minstay) vce(cluster hospital) nolog Truncated negative binomial regression Truncation points: minstay Number of obs = 2144 Dispersion = mean Wald chi2(4) = 15.22 Log likelihood = -7864.0928 Prob > chi2 = 0.0043 (Std. Err. adjusted for 100 clusters in hospital) Robust Std. Err. los Coef. died hmo type2 type3 _cons .0781044 -.0731128 .0294136 .0626352 3.014964 .0303596 .0368897 .0390167 .054012 .0290886 /lnalpha -.9965131 alpha .3691645 z 2.57 -1.98 0.75 1.16 103.65 P>|z| 0.010 0.047 0.451 0.246 0.000 [95% Conf. Interval] .0186006 -.1454152 -.0470578 -.0432265 2.957951 .1376081 -.0008104 .1058849 .1684969 3.071977 .082867 -1.158929 -.8340967 .0305916 .313822 .4342666 In this analysis, two variables have a statistically significant relationship with length of stay. On average, patients who died in the hospital had longer lengths of stay (p = 0.01). Because the coefficient for HMO is negative, that is, bHMO = −0.073, on average, patients who were insured by an HMO had shorter lengths of stay (p = 0.047). The type of admission was not statistically significant (p > 0.05). tnbreg — Truncated negative binomial regression 2383 Stored results tnbreg stores the following in e(): Scalars e(N) e(k) e(k aux) e(k eq) e(k eq model) e(k dv) e(df m) e(r2 p) e(ll) e(ll 0) e(ll c) e(alpha) e(N clust) e(chi2) e(chi2 c) e(p) e(rank) e(rank0) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(llopt) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(chi2 ct) e(dispers) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) number of observations number of parameters number of auxiliary parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom pseudo-R-squared log likelihood log likelihood, constant-only model log likelihood, comparison model value of alpha number of clusters χ2 χ2 for comparison test significance rank of e(V) rank of e(V) for constant-only model number of iterations return code 1 if converged, 0 otherwise tnbreg command as typed name of dependent variable contents of ll(), or 0 if not specified weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test Wald or LR; type of model χ2 test corresponding to e(chi2 c) mean or constant vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample 2384 tnbreg — Truncated negative binomial regression Methods and formulas Methods and formulas are presented under the following headings: Mean-dispersion model Constant-dispersion model Mean-dispersion model A negative binomial distribution can be regarded as a gamma mixture of Poisson random variables. The number of times an event occurs, yj , is distributed as Poisson(νj µj ). That is, its conditional likelihood is (νj µj )yj e−νj µj f (yj | νj ) = Γ(yj + 1) where µj = exp(xj β + offsetj ) and νj is an unobserved parameter with a Gamma(1/α, α) density: g(ν) = ν (1−α)/α e−ν/α α1/α Γ(1/α) This gamma distribution has a mean of 1 and a variance of α, where α is our ancillary parameter. The unconditional likelihood for the j th observation is therefore Z ∞ f (yj | ν)g(ν) dν = f (yj ) = 0 Γ(m + yj ) pm (1 − pj )yj Γ(yj + 1)Γ(m) j where pj = 1/(1 + αµj ) and m = 1/α. Solutions for α are handled by searching for lnα because α must be greater than zero. The conditional probability of observing yj events given that yj is greater than the truncation point τj is Pr(Y = yj | yj > τj , xj ) = f (yj ) Pr(Y > τj | xj ) The log likelihood (with weights wj and offsets) is given by m = 1/α lnL = n X pj = 1/(1 + αµj ) µj = exp(xj β + offsetj ) wj ln{Γ(m + yj )} − ln{Γ(yj + 1)} j=1 − ln{Γ(m)} + m ln(pj ) + yj ln(1 − pj ) − ln{Pr(Y > τj | pj , m)} tnbreg — Truncated negative binomial regression 2385 Constant-dispersion model The constant-dispersion model assumes that yj is conditionally distributed as Poisson(µ∗j ), where ∼ Gamma(µj /δ, δ) for some dispersion parameter δ [by contrast, the mean-dispersion model assumes that µ∗j ∼ Gamma(1/α, αµj )]. The log likelihood is given by µ∗j mj = µj /δ lnL = n X p = 1/(1 + δ) wj ln{Γ(mj + yj )} − ln{Γ(yj + 1)} j=1 − ln{Γ(mj )} + mj ln(p) + yj ln(1 − p) − ln{Pr(Y > τj | p, mj )} with everything else defined as shown above in the calculations for the mean-dispersion model. This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. tnbreg also supports estimation with survey data. For details on variance–covariance estimates with survey data, see [SVY] variance estimation. Acknowledgment We gratefully acknowledge the previous work by Joseph Hilbe (1999) at Arizona State University and past editor of the Stata Technical Bulletin and coauthor of the Stata Press book Generalized Linear Models and Extensions. References Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press. . 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. . 2013. Regression Analysis of Count Data. 2nd ed. New York: Cambridge University Press. Grogger, J. T., and R. T. Carson. 1991. Models for truncated counts. Journal of Applied Econometrics 6: 225–238. Hilbe, J. M. 1998. sg91: Robust variance estimators for MLE Poisson and negative binomial regression. Stata Technical Bulletin 45: 26–28. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 177–180. College Station, TX: Stata Press. . 1999. sg102: Zero-truncated Poisson and negative binomial regression. Stata Technical Bulletin 47: 37–40. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 233–236. College Station, TX: Stata Press. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Long, J. S., and J. Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Simonoff, J. S. 2003. Analyzing Categorical Data. New York: Springer. 2386 tnbreg — Truncated negative binomial regression Also see [R] tnbreg postestimation — Postestimation tools for tnbreg [R] nbreg — Negative binomial regression [R] poisson — Poisson regression [R] tpoisson — Truncated Poisson regression [R] zinb — Zero-inflated negative binomial regression [R] zip — Zero-inflated Poisson regression [SVY] svy estimation — Estimation commands for survey data [XT] xtnbreg — Fixed-effects, random-effects, & population-averaged negative binomial models [U] 20 Estimation and postestimation commands Title tnbreg postestimation — Postestimation tools for tnbreg Description Methods and formulas Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after tnbreg: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 forecast is not appropriate with svy estimation results. 2 lrtest is not appropriate with svy estimation results. 2387 2388 tnbreg postestimation — Postestimation tools for tnbreg Syntax for predict predict type predict type newvar if in , statistic nooffset stub* | newvarreg newvardisp if in , scores Description statistic Main number of events; the default incidence rate conditional mean, E(yj | yj > τj ) probability Pr(yj = n) probability Pr(a ≤ yj ≤ b) conditional probability Pr(yj = n | yj > τj ) conditional probability Pr(a ≤ yj ≤ b | yj > τj ) linear prediction standard error of the linear prediction n ir cm pr(n) pr(a,b) cpr(n) cpr(a,b) xb stdp These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main n, the default, calculates the predicted number of events, which is exp(xj β) if neither offset() nor exposure() was specified when the model was fit; exp(xj β + offsetj ) if offset() was specified; or exp(xj β) × exposurej if exposure() was specified. ir calculates the incidence rate exp(xj β), which is the predicted number of events when exposure is 1. This is equivalent to specifying both the n and the nooffset options. cm calculates the conditional mean, E(yj | yj > τj ) = E(yj ) Pr(yj > τj ) where τj is the truncation point found in e(llopt). pr(n) calculates the probability Pr(yj = n), where n is a nonnegative integer that may be specified as a number or a variable. pr(a,b) calculates the probability Pr(a ≤ yj ≤ b), where a and b are nonnegative integers that may be specified as numbers or variables; b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(yj ≥ 20); pr(20,b) calculates Pr(yj ≥ 20) in observations for which b ≥ . and calculates Pr(20 ≤ yj ≤ b) elsewhere. tnbreg postestimation — Postestimation tools for tnbreg 2389 pr(.,b) produces a syntax error. A missing value in an observation of the variable a causes a missing value in that observation for pr(a,b). cpr(n) calculates the conditional probability Pr(yj = n | yj > τj ), where τj is the truncation point found in e(llopt). n is an integer greater than the truncation point that may be specified as a number or a variable. cpr(a,b) calculates the conditional probability Pr(a ≤ yj ≤ b | yj > τj ), where τj is the truncation point found in e(llopt). The syntax for this option is analogous to that used for pr(a,b) except that a must be greater than the truncation point. xb calculates the linear prediction, which is xj β if neither offset() nor exposure() was specified when the model was fit; xj β + offsetj if offset() was specified; or xj β + ln(exposurej ) if exposure() was specified; see nooffset below. stdp calculates the standard error of the linear prediction. nooffset is relevant only if you specified offset() or exposure() when you fit the model. It modifies the calculations made by predict so that they ignore the offset or exposure variable; the linear prediction is treated as xj β rather than as xj β + offsetj or xj β + ln(exposurej ). Specifying predict . . . , nooffset is equivalent to specifying predict . . . , ir. scores calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xj β). The second new variable will contain ∂ ln L/∂( lnα) for dispersion(mean). The second new variable will contain ∂ ln L/∂( lnδ) for dispersion(constant). Methods and formulas In the following formulas, we use the same notation as in [R] tnbreg. Methods and formulas are presented under the following headings: Mean-dispersion model Constant-dispersion model Mean-dispersion model The equation-level scores are given by (m+1) pj µj score(xβ)j = pj (yj − µj ) − Pr(Y > τj | pj , m) α(µj − yj ) score(ω)j = − m − ln(1 + αµj ) + ψ(yj + m) − ψ(m) 1 + αµj pm j − {m ln(pj ) + µj pj } Pr(Y > τj | pj , m) where ωj = lnαj , ψ(z) is the digamma function, and τj is the truncation point found in e(llopt). 2390 tnbreg postestimation — Postestimation tools for tnbreg Constant-dispersion model The equation-level scores are given by pmj ln(p) score(xβ)j = mj ψ(yj + mj ) − ψ(mj ) + ln(p) + Pr(Y > τj | p, mj ) µj p score(ω)j = yj − (yj + mj )(1 − p) − score(xβ)j − Pr(Y > τj | p, mj ) where ωj = lnδj and τj is the truncation point found in e(llopt). Also see [R] tnbreg — Truncated negative binomial regression [U] 20 Estimation and postestimation commands Title tobit — Tobit regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax tobit depvar indepvars if in weight , ll (#) ul (#) options Description options Model noconstant ll (#) ∗ ul (#) offset(varname) ∗ suppress constant term left-censoring limit right-censoring limit include varname in model with coefficient constrained to 1 SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, bootstrap, or jackknife Reporting level(#) display options set confidence level; default is level(95) control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics ∗ You must specify at least one of ll (#) or ul (#) . indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, nestreg, rolling, statsby, stepwise, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. vce() and weights are not allowed with the svy prefix; see [SVY] svy. aweights, fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Linear models and related > Censored regression 2391 > Tobit regression 2392 tobit — Tobit regression Description tobit fits a model of depvar on indepvars where the censoring values are fixed. Options Model noconstant; see [R] estimation options. ll (#) and ul (#) indicate the lower and upper limits for censoring, respectively. You may specify one or both. Observations with depvar ≤ ll() are left-censored; observations with depvar ≥ ul() are right-censored; and remaining observations are not censored. You do not have to specify the censoring values at all. It is enough to type ll, ul, or both. When you do not specify a censoring value, tobit assumes that the lower limit is the minimum observed in the data (if ll is specified) and the upper limit is the maximum (if ul is specified). offset(varname); see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: iterate(#), no log, trace, tolerance(#), ltolerance(#), nrtolerance(#), and nonrtolerance; see [R] maximize. These options are seldom used. Unlike most maximum likelihood commands, tobit defaults to nolog — it suppresses the iteration log. The following option is available with tobit but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Tobit estimation was originally developed by Tobin (1958). A consumer durable was purchased if a consumer’s desire was high enough, where desire was measured by the dollar amount spent by the purchaser. If no purchase was made, the measure of desire was censored at zero. tobit — Tobit regression 2393 Example 1: Censored from below We will demonstrate tobit with an artificial example, which in the process will allow us to emphasize the assumptions underlying the estimation. We have a dataset containing the mileage ratings and weights of 74 cars. There are no censored variables in this dataset, but we are going to create one. Before that, however, the relationship between mileage and weight in our complete data is . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . generate wgt = weight/1000 . regress mpg wgt Source SS df MS Model Residual 1591.99024 851.469221 1 72 1591.99024 11.8259614 Total 2443.45946 73 33.4720474 mpg Coef. wgt _cons -6.008687 39.44028 Std. Err. .5178782 1.614003 t -11.60 24.44 Number of obs F( 1, 72) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 = = = = = = 74 134.62 0.0000 0.6515 0.6467 3.4389 [95% Conf. Interval] -7.041058 36.22283 -4.976316 42.65774 (We divided weight by 1,000 simply to make discussing the resulting coefficients easier. We find that each additional 1,000 pounds of weight reduces mileage by 6 mpg.) mpg in our data ranges from 12 to 41. Let us now pretend that our data were censored in the sense that we could not observe a mileage rating below 17 mpg. If the true mpg is 17 or less, all we know is that the mpg is less than or equal to 17: . replace mpg=17 if mpg<=17 (14 real changes made) . tobit mpg wgt, ll Tobit regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -164.25438 mpg Coef. wgt _cons -6.87305 41.49856 .7002559 2.05838 /sigma 3.845701 .3663309 Obs. summary: Std. Err. t -9.82 20.16 P>|t| 0.000 0.000 = = = = 74 72.85 0.0000 0.1815 [95% Conf. Interval] -8.268658 37.39621 -5.477442 45.6009 3.115605 4.575797 18 left-censored observations at mpg<=17 56 uncensored observations 0 right-censored observations The replace before estimation was not really necessary — we remapped all the mileage ratings below 17 to 17 merely to reassure you that tobit was not somehow using uncensored data. We typed ll after tobit to inform tobit that the data were left-censored. tobit found the minimum of mpg in our data and assumed that was the censoring point. We could also have dispensed with replace and typed ll(17), informing tobit that all values of the dependent variable 17 and below are really censored at 17. In either case, at the bottom of the table, we are informed that there are, as a result, 18 left-censored observations. 2394 tobit — Tobit regression On these data, our estimate is now a reduction of 6.9 mpg per 1,000 extra pounds of weight as opposed to 6.0. The parameter reported as /sigma is the estimated standard error of the regression; the resulting 3.8 is comparable with the estimated root mean squared error reported by regress of 3.4. Technical note You would never want to throw away information by purposefully censoring variables. The regress estimates are in every way preferable to those of tobit. Our example is designed solely to illustrate the relationship between tobit and regress. If you have uncensored data, use regress. If your data are censored, you have no choice but to use tobit. Example 2: Censored from above tobit can also fit models that are censored from above. This time, let’s assume that we do not observe the actual mileage rating of cars yielding 24 mpg or better — we know only that it is at least 24. (Also assume that we have undone the change to mpg we made in the previous example.) . use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data) . generate wgt = weight/1000 . regress mpg wgt (output omitted ) . tobit mpg wgt, ul(24) Tobit regression Number of obs LR chi2(1) Prob > chi2 Log likelihood = -129.8279 Pseudo R2 mpg Coef. wgt _cons -5.080645 36.08037 .43493 1.432056 /sigma 2.385357 .2444604 Obs. summary: Std. Err. t -11.68 25.19 P>|t| 0.000 0.000 = = = = 74 90.72 0.0000 0.2589 [95% Conf. Interval] -5.947459 33.22628 -4.213831 38.93445 1.898148 2.872566 0 left-censored observations 51 uncensored observations 23 right-censored observations at mpg>=24 tobit — Tobit regression Example 3: Two-limit tobit model tobit can also fit models that are censored from both sides (the so-called two-limit tobit): . tobit mpg wgt, ll(17) ul(24) Tobit regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -104.25976 mpg Coef. wgt _cons -5.764448 38.07469 .7245417 2.255917 /sigma 2.886337 .3952143 Obs. summary: Std. Err. t -7.96 16.88 P>|t| 0.000 0.000 tobit stores the following in e(): Scalars e(N) e(N unc) e(N lc) e(N rc) e(llopt) e(ulopt) e(k aux) e(df m) e(df r) e(r2 p) e(chi2) e(ll) e(ll 0) e(N clust) e(F) e(p) e(rank) e(converged) number of observations number of uncensored observations number of left-censored observations number of right-censored observations contents of ll(), if specified contents of ul(), if specified number of auxiliary parameters model degrees of freedom residual degrees of freedom pseudo-R-squared χ2 log likelihood log likelihood, constant-only model number of clusters F statistic significance rank of e(V) 1 if converged, 0 otherwise 74 77.60 0.0000 0.2712 [95% Conf. Interval] -7.208457 33.57865 -4.320438 42.57072 2.098676 3.673998 18 left-censored observations at mpg<=17 33 uncensored observations 23 right-censored observations at mpg>=24 Stored results = = = = 2395 2396 tobit — Tobit regression Macros e(cmd) e(cmdline) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(vce) e(vcetype) e(properties) e(predict) e(footnote) e(asbalanced) e(asobserved) Matrices e(b) e(V) e(V modelbased) Functions e(sample) tobit command as typed name of dependent variable weight type weight expression title in estimation output name of cluster variable linear offset variable LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. b V program used to implement predict program and arguments to display footnote factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector variance–covariance matrix of the estimators model-based variance marks estimation sample James Tobin (1918–2002) was an American economist who after education and research at Harvard moved to Yale, where he was on the faculty from 1950 to 1988. He made many outstanding contributions to economics and was awarded the Nobel Prize in 1981 “for his analysis of financial markets and their relations to expenditure decisions, employment, production and prices”. He trained in the U.S. Navy with the writer in Herman Wouk, who later fashioned a character after Tobin in the novel The Caine Mutiny (1951): “A mandarin-like midshipman named Tobit, with a domed forehead, measured quiet speech, and a mind like a sponge, was ahead of the field by a spacious percentage.” Methods and formulas See Methods and formulas in [R] intreg. See Tobin (1958) for the original derivation of the tobit model. An introductory description of the tobit model can be found in, for instance, Wooldridge (2013, sec. 17.2), Davidson and MacKinnon (2004, 484–486), Long (1997, 196–210), and Maddala and Lahiri (2006, 333–336). Cameron and Trivedi (2010, chap. 16) discuss the tobit model using Stata examples. This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. tobit also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. tobit — Tobit regression 2397 References Amemiya, T. 1973. Regression analysis when the dependent variable is truncated normal. Econometrica 41: 997–1016. . 1984. Tobit models: A survey. Journal of Econometrics 24: 3–61. Burke, W. J. 2009. Fitting and interpreting Cragg’s tobit alternative using Stata. Stata Journal 9: 584–592. Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. Cong, R. 2000. sg144: Marginal effects of the tobit model. Stata Technical Bulletin 56: 27–34. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 189–197. College Station, TX: Stata Press. Davidson, R., and J. G. MacKinnon. 2004. Econometric Theory and Methods. New York: Oxford University Press. Drukker, D. M. 2002. Bootstrapping a conditional moments test for normality after tobit estimation. Stata Journal 2: 125–139. Goldberger, A. S. 1983. Abnormal selection bias. In Studies in Econometrics, Time Series, and Multivariate Statistics, ed. S. Karlin, T. Amemiya, and L. A. Goodman, 67–84. New York: Academic Press. Hurd, M. 1979. Estimation in truncated samples when there is heteroscedasticity. Journal of Econometrics 11: 247–258. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Maddala, G. S., and K. Lahiri. 2006. Introduction to Econometrics. 4th ed. New York: Wiley. McDonald, J. F., and R. A. Moffitt. 1980. The use of tobit analysis. Review of Economics and Statistics 62: 318–321. Shiller, R. J. 1999. The ET interview: Professor James Tobin. Econometric Theory 15: 867–900. Stewart, M. B. 1983. On least squares estimation when the dependent variable is grouped. Review of Economic Studies 50: 737–753. Tobin, J. 1958. Estimation of relationships for limited dependent variables. Econometrica 26: 24–36. Wooldridge, J. M. 2013. Introductory Econometrics: A Modern Approach. 5th ed. Mason, OH: South-Western. Also see [R] tobit postestimation — Postestimation tools for tobit [R] heckman — Heckman selection model [R] intreg — Interval regression [R] ivtobit — Tobit model with continuous endogenous regressors [R] regress — Linear regression [R] truncreg — Truncated regression [SVY] svy estimation — Estimation commands for survey data [XT] xtintreg — Random-effects interval-data regression models [XT] xttobit — Random-effects tobit models [U] 20 Estimation and postestimation commands Title tobit postestimation — Postestimation tools for tobit Description Remarks and examples Syntax for predict References Menu for predict Also see Options for predict Description The following postestimation commands are available after tobit: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 hausman lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations Hausman’s specification test point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with svy estimation results. lrtest is not appropriate with svy estimation results. 2398 tobit postestimation — Postestimation tools for tobit 2399 Syntax for predict newvar if in , statistic nooffset predict type stub* | newvarreg newvarsigma if in , scores predict type Description statistic Main xb stdp stdf pr(a,b) e(a,b) ystar(a,b) linear prediction; the default standard error of the linear prediction standard error of the forecast Pr(a < yj < b) E(yj | a < yj < b) E(yj∗ ), yj∗ = max{a, min(yj , b)} These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for stdf is not allowed with svy estimation results. where a and b may be numbers or variables; a missing (a ≥ .) means −∞, and b missing (b ≥ .) means +∞; see [U] 12.2.1 Missing values. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main xb, the default, calculates the linear prediction. stdp calculates the standard error of the prediction, which can be thought of as the standard error of the predicted expected value or mean for the observation’s covariate pattern. The standard error of the prediction is also referred to as the standard error of the fitted value. stdf calculates the standard error of the forecast, which is the standard error of the point prediction for 1 observation. It is commonly referred to as the standard error of the future or forecast value. By construction, the standard errors produced by stdf are always larger than those produced by stdp; see Methods and formulas in [R] regress postestimation. pr(a,b) calculates Pr(a < xj b + uj < b), the probability that yj |xj would be observed in the interval (a, b). a and b may be specified as numbers or variable names; lb and ub are variable names; pr(20,30) calculates Pr(20 < xj b + uj < 30); pr(lb,ub) calculates Pr(lb < xj b + uj < ub); and pr(20,ub) calculates Pr(20 < xj b + uj < ub). a missing (a ≥ .) means −∞; pr(.,30) calculates Pr(−∞ < xj b + uj < 30); pr(lb,30) calculates Pr(−∞ < xj b + uj < 30) in observations for which lb ≥ . and calculates Pr(lb < xj b + uj < 30) elsewhere. 2400 tobit postestimation — Postestimation tools for tobit b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(+∞ > xj b + uj > 20); pr(20,ub) calculates Pr(+∞ > xj b + uj > 20) in observations for which ub ≥ . and calculates Pr(20 < xj b + uj < ub) elsewhere. e(a,b) calculates E(xj b + uj | a < xj b + uj < b), the expected value of yj |xj conditional on yj |xj being in the interval (a, b), meaning that yj |xj is truncated. a and b are specified as they are for pr(). ystar(a,b) calculates E(yj∗ ), where yj∗ = a if xj b + uj ≤ a, yj∗ = b if xj b + uj ≥ b, and yj∗ = xj b + uj otherwise, meaning that yj∗ is censored. a and b are specified as they are for pr(). nooffset is relevant only if you specified offset(varname). It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . scores calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xj β). The second new variable will contain ∂ ln L/∂σ . Remarks and examples Following Cong (2000), write the tobit model as ( yi∗ = yi , if a < yi < b a, if yi ≤ a b, if yi ≥ b yi is a latent variable; instead, we observe yi∗ , which is bounded between a and b if yi is outside those bounds. There are four types of marginal effects that may be of interest in the tobit model, depending on the application: 1. The β coefficients themselves measure how the unobserved variable yi changes with respect to changes in the regressors. 2. The marginal effects of the truncated expected value E(yi∗ |a < yi∗ < b) measure the changes in yi with respect to changes in the regressors among the subpopulation for which yi is not at a boundary. 3. The marginal effects of the censored expected value E(yi∗ ) describe how the observed variable yi∗ changes with respect to the regressors. 4. The marginal effects of Pr(a < yi∗ < b) describe how the probability of being uncensored changes with respect to the regressors. In the next example, we show how to obtain each of these. tobit postestimation — Postestimation tools for tobit 2401 Example 1 In example 3 of [R] tobit, we fit a two-limit tobit model of mpg on wgt. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . generate wgt = weight/1000 . tobit mpg wgt, ll(17) ul(24) Tobit regression Log likelihood = -104.25976 mpg Coef. wgt _cons -5.764448 38.07469 .7245417 2.255917 /sigma 2.886337 .3952143 Obs. summary: Std. Err. t -7.96 16.88 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 P>|t| 0.000 0.000 = = = = 74 77.60 0.0000 0.2712 [95% Conf. Interval] -7.208457 33.57865 -4.320438 42.57072 2.098676 3.673998 18 left-censored observations at mpg<=17 33 uncensored observations 23 right-censored observations at mpg>=24 tobit reports the β coefficients for the latent regression model. The marginal effect of xk on y is simply the corresponding βk , because E(y|x) is linear in x. Thus a 1,000-pound increase in a car’s weight (which is a 1-unit increase in wgt) would lower fuel economy by 5.8 mpg. To estimate the means of the marginal effects on the expected value of the censored outcome, conditional on weight being each of three values (2,000; 3,000; and 4,000 pounds), we type . margins, dydx(wgt) predict(ystar(17,24)) at(wgt=(2 3 4)) Conditional marginal effects Number of obs Model VCE : OIM Expression : E(mpg*|17 |z| = 74 [95% Conf. Interval] wgt _at 1 2 3 -1.0861 -4.45315 -1.412822 .311273 .4772541 .3289702 -3.49 -9.33 -4.29 0.000 0.000 0.000 -1.696184 -5.388551 -2.057591 -.4760162 -3.51775 -.768052 The E(y ∗ |x) is nonlinear in x, so the marginal effect for a continuous covariate is not the same as the change in y ∗ induced by a one-unit change in x. Recall that the marginal effect at a point is the slope of the tangent line at that point. In our example, we estimate the mean of the marginal effects for different values of wgt. The estimated mean of the marginal effects is −1.1 mpg for a 2,000 pound car; −4.5 mpg for a 3,000 pound car; and −1.4 mpg for a 4,000 pound car. 2402 tobit postestimation — Postestimation tools for tobit To estimate the means of the marginal effects on the expected value of the truncated outcome at the same levels of wgt, we type . margins, dydx(wgt) predict(e(17,24)) at(wgt=(2 3 4)) Conditional marginal effects Number of obs Model VCE : OIM Expression : E(mpg|17 |z| = 74 [95% Conf. Interval] wgt _at 1 2 3 -1.166572 -2.308842 -1.288896 .0827549 .4273727 .0889259 -14.10 -5.40 -14.49 0.000 0.000 0.000 -1.328768 -3.146477 -1.463188 -1.004375 -1.471207 -1.114604 The mean of the marginal effects of a change in wgt on yi (which is bounded between 17 and 24) is about −1.2 mpg for a 2,000 pound car; −2.3 mpg for a 3,000 pound car; and −1.3 for a 4,000 pound car. References Cong, R. 2000. sg144: Marginal effects of the tobit model. Stata Technical Bulletin 56: 27–34. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 189–197. College Station, TX: Stata Press. McDonald, J. F., and R. A. Moffitt. 1980. The use of tobit analysis. Review of Economics and Statistics 62: 318–321. Also see [R] tobit — Tobit regression [U] 20 Estimation and postestimation commands Title total — Estimate totals Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax total varlist if in weight , options Description options if/in/over over(varlist , nolabel ) group over subpopulations defined by varlist; optionally, suppress group labels SE/Cluster vce(vcetype) vcetype may be analytic, cluster clustvar, bootstrap, or jackknife Reporting level(#) noheader nolegend display options set confidence level; default is level(95) suppress table header suppress table legend control column formats and line width coeflegend display legend instead of statistics bootstrap, jackknife, mi estimate, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Summaries, tables, and tests > Summary and descriptive statistics Description total produces estimates of totals, along with standard errors. 2403 > Totals 2404 total — Estimate totals Options if/in/over over(varlist , nolabel ) specifies that estimates be computed for multiple subpopulations, which are identified by the different values of the variables in varlist. When this option is supplied with one variable name, such as over(varname), the value labels of varname are used to identify the subpopulations. If varname does not have labeled values (or there are unlabeled values), the values themselves are used, provided that they are nonnegative integers. Noninteger values, negative values, and labels that are not valid Stata names are substituted with a default identifier. When over() is supplied with multiple variable names, each subpopulation is assigned a unique default identifier. nolabel specifies that value labels attached to the variables identifying the subpopulations be ignored. SE/Cluster vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (analytic), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. vce(analytic), the default, uses the analytically derived variance estimator associated with the sample total. Reporting level(#); see [R] estimation options. noheader prevents the table header from being displayed. This option implies nolegend. nolegend prevents the table legend identifying the subpopulations from being displayed. display options: cformat(% fmt) and nolstretch; see [R] estimation options. The following option is available with total but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Example 1 Suppose that we collected data on incidence of heart attacks. The variable heartatk indicates whether a person ever had a heart attack (1 means yes; 0 means no). We can then estimate the total number of persons who have had heart attacks for each sex in the population represented by the data we collected. total — Estimate totals . use http://www.stata-press.com/data/r13/total . total heartatk [pw=swgt], over(sex) Total estimation Number of obs Male: sex = Male Female: sex = Female = 4946 Over Total Std. Err. [95% Conf. Interval] heartatk Male Female 944559 581590 104372.3 82855.59 739943 419156.3 1149175 744023.7 Stored results total stores the following in e(): Scalars e(N) e(N over) e(N clust) e(k eq) e(df r) e(rank) Macros e(cmd) e(cmdline) e(varlist) e(wtype) e(wexp) e(title) e(cluster) e(over) e(over labels) e(over namelist) e(vce) e(vcetype) e(properties) e(estat cmd) e(marginsnotok) Matrices e(b) e(V) e( N) e(error) Functions e(sample) number of observations number of subpopulations number of clusters number of equations in e(b) sample degrees of freedom rank of e(V) total command as typed varlist weight type weight expression title in estimation output name of cluster variable varlist from over() labels from over() variables names from e(over labels) vcetype specified in vce() title used to label Std. Err. b V program used to implement estat predictions disallowed by margins vector of total estimates (co)variance estimates vector of numbers of nonmissing observations error code corresponding to e(b) marks estimation sample 2405 2406 total — Estimate totals Methods and formulas Methods and formulas are presented under the following headings: The total estimator Survey data The survey total estimator The poststratified total estimator Subpopulation estimation The total estimator Let y denote the variable on which to calculate the total and yj , j = 1, . . . , n, denote an individual observation on y . Let wj be the frequency weight (or iweight), and if no weight is specified, define wj = 1 for all j . See the next section for pweighted data. The sum of the weights is an estimate of the population size: n X b= wj N j=1 If the population values of y are denoted by Yj , j = 1, . . . , N , the associated population total is Y = N X Yj = N y j=1 where y is the population mean. The total is estimated as by Yb = N The variance estimator for the total is b 2 Vb (y) Vb (Yb ) = N where Vb (y) is the variance estimator for the mean; see [R] mean. The standard error of the total is the square root of the variance. b are similarly defined for another variable (observed jointly with y ), the If x, xj , x, and X b and Yb is covariance estimator between X d X, b Yb ) = N b 2 Cov(x, d Cov( y) d where Cov(x, y) is the covariance estimator between two means; see [R] mean. Survey data See [SVY] variance estimation and [SVY] poststratification for discussions that provide background information for the following formulas. total — Estimate totals 2407 The survey total estimator Let Yj be a survey item for the j th individual in the population, where j = 1, . . . , M and M is the size of the population. The associated population total for the item of interest is Y = M X Yj j=1 Let yj be the survey item for the j th sampled individual from the population, where j = 1, . . . , m and m is the number of observations in the sample. The estimator Yb for the population total Y is Yb = m X w j yj j=1 where wj is a sampling weight. The estimator for the number of individuals in the population is c= M m X wj j=1 The score variable for the total estimator is the variable itself, zj (Yb ) = yj The poststratified total estimator Let Pk denote the set of sampled observations that belong to poststratum k , and define IPk (j) to indicate if the j th observation is a member of poststratum k , where k = 1, . . . , LP and LP is the number of poststrata. Also, let Mk denote the population size for poststratum k . Pk and Mk are identified by specifying the poststrata() and postweight() options on svyset; see [SVY] svyset. The estimator for the poststratified total is Yb P = LP X Mk k=1 ck M Ybk = LP m X Mk X k=1 where ck = M m X ck M IPk (j) wj yj j=1 IPk (j)wj j=1 The score variable for the poststratified total is bP zj (Y ) = LP X k=1 Mk IPk (j) ck M Ybk yj − ck M ! 2408 total — Estimate totals Subpopulation estimation Let S denote the set of sampled observations that belong to the subpopulation of interest, and define IS (j) to indicate if the j th observation falls within the subpopulation. The estimator for the subpopulation total is Yb S = m X IS (j) wj yj j=1 and its score variable is zj (Yb S ) = IS (j) yj The estimator for the poststratified subpopulation total is Yb P S = LP X Mk k=1 ck M YbkS = LP m X Mk X k=1 ck M IPk (j)IS (j) wj yj j=1 and its score variable is bPS zj ( Y )= LP X k=1 Mk IPk (j) ck M ( Yb S IS (j) yj − k ck M ) References Cochran, W. G. 1977. Sampling Techniques. 3rd ed. New York: Wiley. Stuart, A., and J. K. Ord. 1994. Kendall’s Advanced Theory of Statistics: Distribution Theory, Vol I. 6th ed. London: Arnold. Also see [R] total postestimation — Postestimation tools for total [R] mean — Estimate means [R] proportion — Estimate proportions [R] ratio — Estimate ratios [MI] estimation — Estimation commands for use with mi estimate [SVY] direct standardization — Direct standardization of means, proportions, and ratios [SVY] poststratification — Poststratification for survey data [SVY] subpopulation estimation — Subpopulation estimation for survey data [SVY] svy estimation — Estimation commands for survey data [SVY] variance estimation — Variance estimation for survey data [U] 20 Estimation and postestimation commands Title total postestimation — Postestimation tools for total Description Remarks and examples Also see Description The following postestimation commands are available after total: Command Description estat vce estat (svy) estimates lincom variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results point estimates, standard errors, testing, and inference for linear combinations of coefficients point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses nlcom test testnl Remarks and examples Example 1 Continuing with our data on incidence of heart attacks from example 1 in [R] total, we want to test whether there are twice as many heart attacks among men than women in the population. . use http://www.stata-press.com/data/r13/total . total heartatk [pw=swgt], over(sex) (output omitted ) . test _b[Male] = 2*_b[Female] ( 1) [heartatk]Male - 2*[heartatk]Female = 0 F( 1, 4945) = 1.25 Prob > F = 0.2643 Thus we do not reject our hypothesis that the total number of heart attacks for men is twice that for women in the population. Also see [R] total — Estimate totals [U] 20 Estimation and postestimation commands 2409 Title tpoisson — Truncated Poisson regression Syntax Remarks and examples References Menu Stored results Also see Description Methods and formulas Options Acknowledgment Syntax tpoisson depvar indepvars if in weight , options Description options Model noconstant ll(# | varname) exposure(varnamee ) offset(varnameo ) constraints(constraints) collinear suppress constant term truncation point; default value is ll(0), zero truncation include ln(varnamee ) in model with coefficient constrained to 1 include varnameo in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables SE/Robust vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife vce(vcetype) Reporting set confidence level; default is level(95) report incidence-rate ratios do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling level(#) irr nocnsreport display options Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce() and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Count outcomes > Truncated Poisson regression 2410 tpoisson — Truncated Poisson regression 2411 Description tpoisson estimates the parameters of a truncated Poisson model by maximum likelihood. The dependent variable depvar is regressed on indepvars, where depvar is a positive count variable whose values are all above the truncation point. Options Model noconstant; see [R] estimation options. ll(# | varname) specifies the truncation point, which is a nonnegative integer. The default is zero truncation, ll(0). exposure(varnamee ), offset(varnameo ), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. irr reports estimated coefficients transformed to incidence-rate ratios, that is, eβi rather than βi . Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated. irr may be specified at estimation or when replaying previously estimated results. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with tpoisson but is not shown in the dialog box: coeflegend; see [R] estimation options. 2412 tpoisson — Truncated Poisson regression Remarks and examples Truncated Poisson regression is used to model the number of occurrences of an event when that number is restricted to be above the truncation point. If the dependent variable is not truncated, standard Poisson regression may be more appropriate; see [R] poisson. Truncated Poisson regression was first proposed by Grogger and Carson (1991). For an introduction to Poisson regression, see Cameron and Trivedi (2005, 2010) and Long and Freese (2014). For an introduction to truncated Poisson models, see Cameron and Trivedi (2013) and Long (1997, chap. 8). Suppose that the patients admitted to a hospital for a given condition form a random sample from a population of interest and that each admitted patient stays at least one day. You are interested in modeling the length of stay of patients in days. The sample is truncated at zero because you only have data on individuals who stayed at least one day. tpoisson accounts for the truncated sample, whereas poisson does not. Truncation is not the same as censoring. Right-censored Poisson regression was implemented in Stata by Raciborski (2011). Example 1 Consider the Simonoff (2003) dataset of running shoes for a sample of runners who registered an online running log. A running-shoe marketing executive is interested in knowing how the number of running shoes purchased relates to other factors such as gender, marital status, age, education, income, typical number of runs per week, average miles run per week, and the preferred type of running. These data are naturally truncated at zero. A truncated Poisson model is fit to the number of shoes owned on runs per week, miles run per week, gender, age, and marital status. No options are needed because zero truncation is the default for tpoisson. . use http://www.stata-press.com/data/r13/runshoes . tpoisson shoes rpweek mpweek male age married Iteration 0: log likelihood = -88.328151 Iteration 1: log likelihood = -86.272639 Iteration 2: log likelihood = -86.257999 Iteration 3: log likelihood = -86.257994 Truncated Poisson regression Truncation point: 0 Log likelihood = -86.257994 shoes Coef. rpweek mpweek male age married _cons .1575811 .0210673 .0446134 .0185565 -.1283912 -1.205844 Std. Err. .1097893 .0091113 .2444626 .0137786 .2785044 .6619774 z 1.44 2.31 0.18 1.35 -0.46 -1.82 Number of obs LR chi2(5) Prob > chi2 Pseudo R2 P>|z| 0.151 0.021 0.855 0.178 0.645 0.069 = = = = 60 22.75 0.0004 0.1165 [95% Conf. Interval] -.057602 .0032094 -.4345246 -.008449 -.6742498 -2.503296 .3727641 .0389252 .5237513 .045562 .4174674 .0916078 Using the zero-truncated Poisson regression with these data, only the coefficient on average miles per week is statistically significant at the 5% level. tpoisson — Truncated Poisson regression 2413 Example 2 Semiconductor manufacturing requires that silicon wafers be coated with a layer of metal oxide. The depth of this layer is strictly controlled. In this example, a critical oxide layer is designed for 300 ± 20 angstroms (Å). After the oxide layer is coated onto a wafer, the wafer enters a photolithography step in which the lines representing the electrical connections are printed on the oxide and later etched and filled with metal. The widths of these lines are measured. In this example, they are controlled to 90±5 micrometers (µm). After these and other steps, each wafer is electrically tested at probe. If too many failures are discovered, the wafer is rejected and sent for engineering analysis. In this example, the maximum number of probe failures tolerated for this product is 10. A major failure at probe has been encountered—88 wafers had more than 10 failures each. The 88 wafers that failed were tested using 4 probe machines. The engineer suspects that the failures were a result of faulty probe machines, poor depth control, or poor line widths. The line widths and depths in these data are the actual measurement minus its specification target, 300 Å for the oxide depths and 90 µm for the line widths. The following table tabulates the average failure rate for each probe using Stata’s mean command; see [R] mean. . use http://www.stata-press.com/data/r13/probe . mean failures, over(probe) nolegend Mean estimation Number of obs Over Mean = 88 Std. Err. [95% Conf. Interval] 1.186293 .5912379 .9279866 .9451117 13.51711 13.78318 14.62611 21.21826 failures 1 2 3 4 15.875 14.95833 16.47059 23.09677 18.23289 16.13348 18.31506 24.97529 The 95% confidence intervals in this table suggest that there are about 5–11 additional failures per wafer on probe 4. These are unadjusted for varying line widths and oxide depths. Possibly, probe 4 received the wafers with larger line widths or extreme oxide depths. Truncated Poisson regression more clearly identifies the root causes for the increased failures by estimating the differences between probes adjusted for the line widths and oxide depths. It also allows us to determine whether the deviations from specifications in line widths or oxide depths might be contributing to the problem. 2414 tpoisson — Truncated Poisson regression . tpoisson failures i.probe depth width, ll(10) nolog Truncated Poisson regression Number of obs Truncation point: 10 LR chi2(5) Prob > chi2 Log likelihood = -239.35746 Pseudo R2 Std. Err. z P>|z| = = = = 88 73.70 0.0000 0.1334 failures Coef. [95% Conf. Interval] probe 2 3 4 -.1113037 .0114339 .4254115 .1019786 .1036032 .0841277 -1.09 0.11 5.06 0.275 0.912 0.000 -.3111781 -.1916245 .2605242 .0885707 .2144924 .5902989 depth width _cons -.0005034 .0330225 2.714025 .0033375 .015573 .0752617 -0.15 2.12 36.06 0.880 0.034 0.000 -.0070447 .0025001 2.566515 .006038 .063545 2.861536 The coefficients listed for the probes are testing the null hypothesis: H0 : probei = probe1 , where i equals 2, 3, and 4. Because the only coefficient that is statistically significant is the one for testing for H0 : probe4 = probe1 , p < 0.001, and because the p-values for the other probes are not statistically significant, that is, p ≥ 0.275, the implication is that there is a difference between probe 4 and the other machines. Because the coefficient for this test is positive, 0.425, the conclusion is that the average failure rate for probe 4, after adjusting for line widths and oxide depths, is higher than the other probes. Possibly, probe 4 needs calibration or the head used with this machine is defective. Line-width control is statistically significant, p = 0.034, but variation in oxide depths is not causing the increased failure rate. The engineer concluded that the sudden increase in failures is the result of two problems. First, probe 4 is malfunctioning, and second, there is a possible lithography or etching problem. tpoisson — Truncated Poisson regression 2415 Stored results tpoisson stores the following in e(): Scalars e(N) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(r2 p) e(ll) e(ll 0) e(N clust) e(chi2) e(p) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(llopt) e(wtype) e(wexp) e(title) e(clustvar) e(offset) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) number of observations number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom pseudo-R-squared log likelihood log likelihood, constant-only model number of clusters χ2 significance rank of e(V) number of iterations return code 1 if converged, 0 otherwise tpoisson command as typed name of dependent variable contents of ll(), or 0 if not specified weight type weight expression title in estimation output name of cluster variable linear offset variable Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample 2416 tpoisson — Truncated Poisson regression Methods and formulas The conditional probability of observing yj events given that yj > τj , where τj is the truncation point, is given by exp(−λ)λyj Pr(Y = yj | yj > τj , xj ) = yj !Pr(Y > τj | xj ) The log likelihood (with weights wj and offsets) is given by ξj = xj β + offsetj exp{− exp(ξj )} exp(ξj yj ) yj !Pr(Y > τj | ξj ) n X lnL = wj [− exp(ξj ) + ξj yj − ln(yj !) − ln {Pr(Y > τj | ξj )}] f (yj ) = j=1 This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. tpoisson also supports estimation with survey data. For details on variance–covariance estimates with survey data, see [SVY] variance estimation. Acknowledgment We gratefully acknowledge the previous work by Joseph Hilbe (1999) at Arizona State University and past editor of the Stata Technical Bulletin and coauthor of the Stata Press book Generalized Linear Models and Extensions. References Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press. . 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. . 2013. Regression Analysis of Count Data. 2nd ed. New York: Cambridge University Press. Farbmacher, H. 2011. Estimation of hurdle models for overdispersed count data. Stata Journal 11: 82–94. Grogger, J. T., and R. T. Carson. 1991. Models for truncated counts. Journal of Applied Econometrics 6: 225–238. Hilbe, J. M. 1998. sg91: Robust variance estimators for MLE Poisson and negative binomial regression. Stata Technical Bulletin 45: 26–28. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 177–180. College Station, TX: Stata Press. . 1999. sg102: Zero-truncated Poisson and negative binomial regression. Stata Technical Bulletin 47: 37–40. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 233–236. College Station, TX: Stata Press. Hilbe, J. M., and D. H. Judson. 1998. sg94: Right, left, and uncensored Poisson regression. Stata Technical Bulletin 46: 18–20. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 186–189. College Station, TX: Stata Press. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Long, J. S., and J. Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Raciborski, R. 2011. Right-censored Poisson regression model. Stata Journal 11: 95–105. Simonoff, J. S. 2003. Analyzing Categorical Data. New York: Springer. tpoisson — Truncated Poisson regression Also see [R] tpoisson postestimation — Postestimation tools for tpoisson [R] poisson — Poisson regression [R] nbreg — Negative binomial regression [R] tnbreg — Truncated negative binomial regression [R] zinb — Zero-inflated negative binomial regression [R] zip — Zero-inflated Poisson regression [SVY] svy estimation — Estimation commands for survey data [XT] xtpoisson — Fixed-effects, random-effects, and population-averaged Poisson models [U] 20 Estimation and postestimation commands 2417 Title tpoisson postestimation — Postestimation tools for tpoisson Description Methods and formulas Syntax for predict Also see Menu for predict Options for predict Description The following postestimation commands are available after tpoisson: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 forecast is not appropriate with svy estimation results. 2 lrtest is not appropriate with svy estimation results. 2418 tpoisson postestimation — Postestimation tools for tpoisson 2419 Syntax for predict predict type newvar if in , statistic nooffset Description statistic Main number of events; the default incidence rate conditional mean, E(yj | yj > τj ) probability Pr(yj = n) probability Pr(a ≤ yj ≤ b) conditional probability Pr(yj = n | yj > τj ) conditional probability Pr(a ≤ yj ≤ b | yj > τj ) linear prediction standard error of the linear prediction first derivative of the log likelihood with respect to xj β n ir cm pr(n) pr(a,b) cpr(n) cpr(a,b) xb stdp score These statistics are available both in and out of sample; type predict only for the estimation sample. . . . if e(sample) . . . if wanted Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main n, the default, calculates the predicted number of events, which is exp(xj β) if neither offset() nor exposure() was specified when the model was fit; exp(xj β + offsetj ) if offset() was specified; or exp(xj β) × exposurej if exposure() was specified. ir calculates the incidence rate exp(xj β), which is the predicted number of events when exposure is 1. This is equivalent to specifying both the n and the nooffset options. cm calculates the conditional mean, E(yj | yj > τj ) = E(yj ) Pr(yj > τj ) where τj is the truncation point found in e(llopt). pr(n) calculates the probability Pr(yj = n), where n is a nonnegative integer that may be specified as a number or a variable. pr(a,b) calculates the probability Pr(a ≤ yj ≤ b), where a and b are nonnegative integers that may be specified as numbers or variables; b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(yj ≥ 20); pr(20,b) calculates Pr(yj ≥ 20) in observations for which b ≥ . and calculates Pr(20 ≤ yj ≤ b) elsewhere. 2420 tpoisson postestimation — Postestimation tools for tpoisson pr(.,b) produces a syntax error. A missing value in an observation of the variable a causes a missing value in that observation for pr(a,b). cpr(n) calculates the conditional probability Pr(yj = n | yj > τj ), where τj is the truncation point found in e(llopt). n is an integer greater than the truncation point that may be specified as a number or a variable. cpr(a,b) calculates the conditional probability Pr(a ≤ yj ≤ b | yj > τj ), where τj is the truncation point found in e(llopt). The syntax for this option is analogous to that used for pr(a,b) except that a must be greater than the truncation point. xb calculates the linear prediction, which is xj β if neither offset() nor exposure() was specified when the model was fit; xj β + offsetj if offset() was specified; or xj β + ln(exposurej ) if exposure() was specified; see nooffset below. stdp calculates the standard error of the linear prediction. score calculates the equation-level score, ∂ ln L/∂(xj β). nooffset is relevant only if you specified offset() or exposure() when you fit the model. It modifies the calculations made by predict so that they ignore the offset or exposure variable; the linear prediction is treated as xj β rather than as xj β + offsetj or xj β + ln(exposurej ). Specifying predict . . . , nooffset is equivalent to specifying predict . . . , ir. Methods and formulas In the following formula, we use the same notation as in [R] tpoisson. The equation-level scores are given by ξj score(xβ)j = yj − eξj − where τj is the truncation point found in e(llopt). Also see [R] tpoisson — Truncated Poisson regression [U] 20 Estimation and postestimation commands e−e eξj Pr(Y > τj | ξj ) Title translate — Print and translate logs Syntax Remarks and examples Description Stored results Options for print Also see Options for translate Syntax Print log and SMCL files print filename , like(ext) name(windowname) override options Translate log files to SMCL files and vice versa translate filenamein filenameout , translator(tname) name(windowname) override options replace View translator parameter settings translator query tname Change translator parameter settings translator set tname setopt setval Return translator parameter settings to default values translator reset tname List current mappings from one extension to another transmap query .ext Specify that files with one extension be treated the same as files with another extension transmap define .extnew .extold filename in print, in addition to being a filename to be printed, may be specified as @Results to mean the Results window and @Viewer to mean the Viewer window. filenamein in translate may be specified just as filename in print. tname in translator specifies the name of a translator; see the translator() option under Options for translate. Description print prints log, SMCL, and text files. Although there is considerable flexibility in how print (and translate, which print uses) can be set to work, they have already been set up and should just work: . print mylog.smcl . print mylog.log 2421 2422 translate — Print and translate logs Unix users may discover that they need to do a bit of setup before print works; see Printing files, Unix below. International Unix users may also wish to modify the default paper size. All users can tailor print and translate to their needs. print may also be used to print the current contents of the Results window or the Viewer. For instance, the current contents of the Results window could be printed by typing . print @Results translate translates log and SMCL files from one format to another, the other typically being suitable for printing. translate can also translate SMCL logs (logs created by typing, say, log using mylog) to plain text: . translate mylog.smcl mylog.log You can use translate to recover a log when you have forgotten to start one. You may type . translate @Results mylog.txt to capture as plain text what is currently shown in the Results window. This entry provides a general overview of print and translate and covers in detail the printing and translation of text (nongraphic) files. translator query, translator set, and translator reset show, change, and restore the default values of the settings for each translator. transmap define and transmap query create and show mappings from one file extension to another for use with print and translate. For example, print myfile.txt knows to use a translator appropriate for printing text files because of the .txt extension. However, it does not know what to do with .xyz files. If you have .xyz files and always wish to treat them as .txt files, you can type transmap define .xyz .txt. Options for print like(ext) specifies how the file should be translated to a form suitable for printing. The default is to determine the translation method from the extension of filename. Thus mylog.smcl is translated according to the rule for translating smcl files, myfile.txt is translated according to the rule for translating txt files, and so on. (These rules are, in fact, translate’s smcl2prn and txt2prn translators, but put that aside for the moment.) Rules for the following extensions are predefined: .txt .log .smcl assume input file contains plain text assume input file contains Stata log text assume input file contains SMCL To print a file that has an extension different from those listed above, you can define a new extension, but you do not have to do that. Assume that you wish to print the file read.me, which you know to contain plain text. If you were just to type print read.me, you would be told that Stata cannot translate .me files. (You would actually be told that the translator for me2prn was not found.) You could type print read.me, like(txt) to tell print to print read.me like a .txt file. On the other hand, you could type . transmap define .me .txt translate — Print and translate logs 2423 to tell Stata that .me files are always to be treated like .txt files. If you did that, Stata would remember the new rule, even in future sessions. When you specify the like() option, you override the recorded rules. So, if you were to type print mylog.smcl, like(txt), the file would be printed as plain text (meaning that all the SMCL commands would show). name(windowname) specifies which window to print when printing a Viewer. The default is for Stata to print the topmost Viewer [Unix(GUI) users: See the second technical note in Printing files, Unix]. The name() option is ignored when printing the Results window. The window name is located inside parentheses in the window title. For example, if the title for a Viewer window is Viewer (#1) [help print], the name for the window is #1. override options refers to translate’s options for overriding default values. print uses translate to translate the file into a format suitable for sending to the printer, and thus translate’s override options may also be used with print. The settings available vary between each translator (for example, smcl2ps will have different settings than smcl2txt) and may also differ across operating systems (for example, Windows may have different printing options than Mac OS X). To find out what you can override when printing .smcl files, type . translator query smcl2prn (output omitted ) In the omitted output, you might learn that there is an rmargin # tunable value, which specifies the right margin in inches. You could specify the override option rmargin(#) to temporarily override the default value, or you could type translator set smcl2prn rmargin # beforehand to permanently reset the value. Alternatively, on some computers with some translators, you might discover that nothing can be set. Options for translate translator(tname) specifies the name of the translator to be used to translate the file. The available translators are tname Input Output smcl2ps log2ps txt2ps Viewer2ps Results2ps smcl2prn log2prn txt2prn Results2prn Viewer2prn smcl2txt smcl2log Results2txt Viewer2txt smcl2pdf log2pdf txt2pdf Results2pdf Viewer2pdf SMCL Stata text log generic text file Viewer window Results window SMCL Stata text log generic text log Results window Viewer window SMCL SMCL Results window Viewer window SMCL Stata text log generic text log Results window Viewer window PostScript PostScript PostScript PostScript PostScript default printer format default printer format default printer format default printer format default printer format generic text file Stata text log generic text file generic text file PDF PDF PDF PDF PDF 2424 translate — Print and translate logs If translator() is not specified, translate determines which translator to use from extensions of the filenames specified. Typing translate myfile.smcl myfile.ps would use the smcl2ps translator. Typing translate myfile.smcl myfile.ps, translate(smcl2prn) would override the default and use the smcl2prn translator. Actually, when you type translate a.b c.d, translate looks up .b in the transmap extensionsynonym table. If .b is not found, the translator b2d is used. If .b is found in the table, the mapped extension is used (call it b0 ), and then the translator b0 2d is used. For example, Command . translate myfile.smcl myfile.ps Translator used smcl2ps . translate myfile.odd myfile.ps . transmap define .odd .txt . translate myfile.odd myfile.ps odd2ps, which does not exist, so error txt2ps You can list the mappings that translate uses by typing transmap query. name(windowname) specifies which window to translate when translating a Viewer. The default is for Stata to translate the topmost Viewer. The name() option is ignored when translating the Results window. The window name is located inside parentheses in the window title. For example, if the title for a Viewer window is Viewer (#1) [help print], the name for the window is #1. override options override any of the default options of the specified or implied translator. To find out what you can override for, say, log2ps, type . translator query log2ps (output omitted ) In the omitted output, you might learn that there is an rmargin # tunable value, which, for log2ps, specifies the right margin in inches. You could specify the override option rmargin(#) to temporarily override the default value or type translator set log2ps rmargin # beforehand to permanently reset the value. replace specifies that filenameout be replaced if it already exists. Remarks and examples Remarks are presented under the following headings: Printing files Printing files, Mac and Windows Printing files, Unix Translating files from one format to another Printing files Printing should be easy; just type . print mylog.smcl . print mylog.log You can use print to print SMCL files, plain text files, and even the contents of the Results and Viewer windows: translate — Print and translate logs 2425 . print @Results . print @Viewer . print @Viewer, name(#2) For information about printing and translating graph files, see [G-2] graph print and see [G-2] graph export. Printing files, Mac and Windows When you type print, you are using the same facility that you would be using if you had selected Print from the File menu. If you try to print a file that Stata does not know about, Stata will complain: . print read.me translator me2prn not found (perhaps you need to specify the like() option) r(111); Then you could type . print read.me, like(txt) to indicate that you wanted read.me sent to the printer in the same fashion as if the file were named readme.txt, or you could type . transmap define .me .txt . print read.me Here you are telling Stata once and for all that you want files ending in .me to be treated in the same way as files ending in .txt. Stata will remember this mapping, even across sessions. To clear the .me mapping, type . transmap define .me To see all the mappings, type . transmap query To print to a file, use the translate command, not print: . translate mylog.smcl mylog.prn translate prints to a file by using the Windows print driver when the new filename ends in .prn. Under Mac, the prn translators are the same as the pdf translators. We suggest that you simply use the .pdf file extension when printing to a file. Printing files, Unix Stata assumes that you have a PostScript printer attached to your Unix computer and that the Unix command lpr(1) can be used to send PostScript files to it, but you can change this. On your Unix system, typing mycomputer$ lpr < filename may not be sufficient to print PostScript files. For instance, perhaps on your system you would need to type mycomputer$ lpr -Plexmark < filename 2426 translate — Print and translate logs or mycomputer$ lpr -Plexmark filename or something else. To set the print command to be lpr -Plexmark filename and to state that the printer expects to receive PostScript files, type . printer define prn ps "lpr -Plexmark @" To set the print command to lpr -Plexmark < filename and to state that the printer expects to receive plain text files, type . printer define prn txt "lpr -Plexmark < @" That is, just type the command necessary to send files to your printer and include an @ sign where the filename should be substituted. Two file formats are available: ps and txt. The default setting, as shipped from the factory, is . printer define prn ps "lpr < @" We will return to the printer command in the technical note that follows because it has some other capabilities you should know about. In any case, after you redefine the default printer, the following should just work: . print mylog.smcl . print mylog.log If you try to print a file that Stata does not know about, it will complain: . print read.me translator me2prn not found r(111); Here you could type . print read.me, like(txt) to indicate that you wanted read.me sent to the printer in the same fashion as if the file were named readme.txt, or you could type . transmap define .me .txt . print read.me Here you are telling Stata once and for all that you want files ending in .me to be treated in the same way as files ending in .txt. Stata will remember this setting for .me, even across sessions. If you want to clear the .me setting, type . transmap define .me If you want to see all your settings, type . transmap query Technical note The syntax of the printer command is printer define printername ps | txt printer query printername "Unix command with @" translate — Print and translate logs 2427 You may define multiple printers. By default, print uses the printer named prn, but print has the syntax print filename , like(ext) printer(printername) override options so, if you define multiple printers, you may route your output to them. For instance, if you have a second printer on your system, you might type . printer define lexmark ps "lpr -Plexmark < @" After doing that, you could type . print myfile.smcl, printer(lexmark) Any printers that you set will be remembered even across sessions. You can delete printers: . printer define lexmark You can list all the defined printers by typing printer query, and you can list the definition of a particular printer, say, prn, by typing printer query prn. The default printer prn we have predefined for you is . printer define prn ps "lpr < @" meaning that we assume that it is a PostScript printer and that the Unix command lpr(1), without options, is sufficient to cause files to print. Feel free to change the default definition. If you change it, the change will be remembered across sessions. Technical note Unix(GUI) users should note that X-Windows does not have the concept of a window z-order, which prevents Stata from determining which window is the topmost window. Instead, Stata determines which window is topmost based on which window has the focus. However, some window managers will set the focus to a window without bringing the window to the top. What Stata considers the topmost window may not appear topmost visually. For this reason, you should always use the name() option to ensure that the correct window is printed. Technical note When you select the Results window to print from the Print menu or toolbar button, the result is the same as if you were to issue the print command. When you select a Viewer window to print from the Print menu or toolbar button, the result is the same as if you were to issue the print command with a name() option. The translation to PostScript format is done by translate and, in particular, is performed by the translators smcl2ps, log2ps, and txt2ps. There are many tunable parameters in each of these translators. You can display the current values of these tunable parameters for, say, smcl2ps by typing . translator query smcl2ps (output omitted ) 2428 translate — Print and translate logs and you can set any of the tunable parameters (for instance, setting smcl2ps’s rmargin value to 1) by typing . translator set smcl2ps rmargin 1 (output omitted ) Any settings you make will be remembered across sessions. You can reset smcl2ps to be as it was when Stata was shipped by typing . translator reset smcl2ps Translating files from one format to another If you have a SMCL log, which you might have created by previously typing log using mylog, you can translate it to an text log by typing . translate myfile.smcl myfile.log and you can translate it to a PostScript file by typing . translate myfile.smcl myfile.ps translate translates files from one format to another, and, in fact, print uses translate to produce a file suitable for sending to the printer. When you type . translate a.b c.d translate looks for the predefined translator b2d and uses that to perform the translation. If there is a transmap synonym for b, however, the mapped value b0 is used: b0 2d. Only certain translators exist, and they are listed under the description of the translate() option in Options for translate above, or you can type . translator query for a complete (and perhaps more up-to-date) list. Anyway, translate forms the name b2d or b0 2d, and if the translator does not exist, translate issues an error message. With the translator() option, you can specify exactly which translator to use, and then it does not matter how your files are named. The only other thing to know is that some translators have tunable parameters that affect how they perform their translation. You can type . translator query translator_name to find out what those parameters are. Some translators have no tunable parameters, and some have many: translate — Print and translate logs 2429 . translator query smcl2ps header headertext logo user projecttext cmdnumber fontsize pagesize pagewidth pageheight scheme cust1_result_color cust1_standard_color cust1_error_color cust1_input_color cust1_link_color cust1_hilite_color cust1_result_bold cust1_standard_bold cust1_error_bold cust1_input_bold cust1_link_bold cust1_hilite_bold cust1_link_underline cust1_hilite_underline on on on 9 letter 8.50 11.00 lmargin rmargin tmargin bmargin 1.00 1.00 1.00 1.00 monochrome 0 0 0 0 0 0 on off on off off on on off 0 0 0 0 0 0 0 0 0 0 0 0 cust2_result_color cust2_standard_color cust2_error_color cust2_input_color cust2_link_color cust2_hilite_color cust2_result_bold cust2_standard_bold cust2_error_bold cust2_input_bold cust2_link_bold cust2_hilite_bold cust2_link_underline cust2_hilite_underline 0 0 255 0 0 0 on off on off off on on off 0 0 0 0 0 0 0 0 0 255 0 0 You can temporarily override any setting by specifying the setopt(setval) option on the translate (or print) command. For instance, you can type . translate . . . , . . . cmdnumber(off) or you can reset the value permanently by typing . translator set smcl2ps setopt setval For instance, . translator set smcl2ps cmdnumber off If you reset a value, Stata will remember the change, even in future sessions. Mac and Windows users: The smcl2ps (and the other *2ps translators) are not used by print, even when you have a PostScript printer attached to your computer. Instead, the Mac or Windows print driver is used. Resetting smcl2ps values will not affect printing; instead, you change the defaults in the Printers Control Panel in Windows and by selecting Page Setup... from the File menu in Mac. You can, however, translate files yourself using the smcl2ps translator and the other *2ps translators. Stored results transmap query .ext stores in macro r(suffix) the mapped extension (without the leading period) or stores ext if the ext is not mapped. translator query translatorname stores setval in macro r(setopt) for every setopt, setval pair. 2430 translate — Print and translate logs printer query printername (Unix only) stores in macro r(suffix) the “filetype” of the input that the printer expects (currently “ps” or “txt”) and, in macro r(command), the command to send output to the printer. Also see [R] log — Echo copy of session to file [G-2] graph export — Export current graph [G-2] graph print — Print a graph [G-2] graph set — Set graphics options [P] smcl — Stata Markup and Control Language [U] 15 Saving and printing output—log files Title truncreg — Truncated regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax truncreg depvar indepvars options if in weight , options Description Model noconstant ll(varname | #) ul(varname | #) offset(varname) constraints(constraints) collinear suppress constant term lower limit for left-truncation upper limit for right-truncation include varname in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables SE/Robust vce(vcetype) vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife Reporting level(#) noskip nocnsreport display options set confidence level; default is level(95) perform likelihood-ratio test do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists. bootstrap, by, fp, jackknife, mi estimate, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix; see [MI] mi estimate. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. aweights are not allowed with the jackknife prefix; see [R] jackknife. vce(), noskip, and weights are not allowed with the svy prefix; see [SVY] svy. aweights, fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 2431 2432 truncreg — Truncated regression Menu Statistics > Linear models and related > Truncated regression Description truncreg fits a regression model of depvar on indepvars from a sample drawn from a restricted part of the population. Under the normality assumption for the whole population, the error terms in the truncated regression model have a truncated normal distribution, which is a normal distribution that has been scaled upward so that the distribution integrates to one over the restricted range. Options Model noconstant; see [R] estimation options. ll(varname | #) and ul(varname | #) indicate the lower and upper limits for truncation, respectively. You may specify one or both. Observations with depvar ≤ ll() are left-truncated, observations with depvar ≥ ul() are right-truncated, and the remaining observations are not truncated. See [R] tobit for a more detailed description. offset(varname), constraints(constraints), collinear; see [R] estimation options. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. noskip specifies that a full maximum-likelihood model with only a constant for the regression equation be fit. This model is not displayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asymptotically equivalent Wald test of all the parameters in the regression equation being zero (except the constant). For many models, this option can substantially increase estimation time. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used, but you may use the ltol(#) option to relax the convergence criterion; the default is 1e-6 during specification searches. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). truncreg — Truncated regression 2433 The following option is available with truncreg but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples Truncated regression fits a model of a dependent variable on independent variables from a restricted part of a population. Truncation is essentially a characteristic of the distribution from which the sample data are drawn. If x has a normal distribution with mean µ and standard deviation σ , the density of the truncated normal distribution is f (x | a < x < b) = Φ b−µ σ f (x) −Φ a−µ σ 1 φ x−µ σ σ Φ b−µ − Φ a−µ σ σ = where φ and Φ are the density and distribution functions of the standard normal distribution. Compared with the mean of the untruncated variable, the mean of the truncated variable is greater if the truncation is from below, and the mean of the truncated variable is smaller if the truncation is from above. Moreover, truncation reduces the variance compared with the variance in the untruncated distribution. Example 1 We will demonstrate truncreg with part of the Mroz dataset distributed with Berndt (1996). This dataset contains 753 observations on women’s labor supply. Our subsample is of 250 observations, with 150 market laborers and 100 nonmarket laborers. . use http://www.stata-press.com/data/r13/laborsub . describe Contains data from http://www.stata-press.com/data/r13/laborsub.dta obs: 250 vars: 6 25 Sep 2012 18:36 size: 1,750 variable name lfp whrs kl6 k618 wa we Sorted by: storage type byte int byte byte byte byte display format %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g value label variable label 1 if woman worked in 1975 Wife’s hours of work # of children younger than 6 # of children between 6 and 18 Wife’s age Wife’s educational attainment 2434 truncreg — Truncated regression . summarize, sep(0) Variable Obs Mean lfp whrs kl6 k618 wa we 250 250 250 250 250 250 .6 799.84 .236 1.364 42.92 12.352 Std. Dev. Min Max .4908807 915.6035 .5112234 1.370774 8.426483 2.164912 0 0 0 0 30 5 1 4950 3 8 60 17 We first perform ordinary least-squares estimation on the market laborers. . regress whrs kl6 k618 wa we if whrs > 0 Source SS df MS Model Residual 7326995.15 94793104.2 4 145 1831748.79 653745.546 Total 102120099 149 685369.794 whrs Coef. kl6 k618 wa we _cons -421.4822 -104.4571 -4.784917 9.353195 1629.817 Std. Err. 167.9734 54.18616 9.690502 31.23793 615.1301 t -2.51 -1.93 -0.49 0.30 2.65 Number of obs F( 4, 145) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.013 0.056 0.622 0.765 0.009 = = = = = = 150 2.80 0.0281 0.0717 0.0461 808.55 [95% Conf. Interval] -753.4748 -211.5538 -23.9378 -52.38731 414.0371 -89.48953 2.639668 14.36797 71.0937 2845.597 Now we use truncreg to perform truncated regression with truncation from below zero. . truncreg whrs kl6 k618 wa we, ll(0) (note: 100 obs. truncated) Fitting full model: Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Truncated regression Limit: lower = 0 upper = +inf Log likelihood = -1200.9157 = = = = = -1205.6992 -1200.9873 -1200.9159 -1200.9157 -1200.9157 Std. Err. Number of obs = 150 Wald chi2(4) = 10.05 Prob > chi2 = 0.0395 whrs Coef. z P>|z| [95% Conf. Interval] kl6 k618 wa we _cons -803.0042 -172.875 -8.821123 16.52873 1586.26 321.3614 88.72898 14.36848 46.50375 912.355 -2.50 -1.95 -0.61 0.36 1.74 0.012 0.051 0.539 0.722 0.082 -1432.861 -346.7806 -36.98283 -74.61695 -201.9233 -173.1474 1.030579 19.34059 107.6744 3374.442 /sigma 983.7262 94.44303 10.42 0.000 798.6213 1168.831 truncreg — Truncated regression 2435 If we assume that our data were censored, the tobit model is . tobit whrs kl6 k618 wa we, ll(0) Tobit regression Number of obs LR chi2(4) Prob > chi2 Pseudo R2 Log likelihood = -1367.0903 whrs Coef. kl6 k618 wa we _cons -827.7657 -140.0192 -24.97919 103.6896 589.0001 214.7407 74.22303 13.25639 41.82393 841.5467 /sigma 1309.909 82.73335 Obs. summary: Std. Err. t -3.85 -1.89 -1.88 2.48 0.70 P>|t| 0.000 0.060 0.061 0.014 0.485 = = = = 250 23.03 0.0001 0.0084 [95% Conf. Interval] -1250.731 -286.2129 -51.08969 21.31093 -1068.556 -404.8008 6.174547 1.131317 186.0683 2246.556 1146.953 1472.865 100 left-censored observations at whrs<=0 150 uncensored observations 0 right-censored observations Technical note Whether truncated regression is more appropriate than the ordinary least-squares estimation depends on the purpose of that estimation. If we are interested in the mean of wife’s working hours conditional on the subsample of market laborers, least-squares estimation is appropriate. However if we are interested in the mean of wife’s working hours regardless of market or nonmarket labor status, least-squares estimates could be seriously misleading. Truncation and censoring are different concepts. A sample has been censored if no observations have been systematically excluded but some of the information contained in them has been suppressed. In a truncated distribution, only the part of the distribution above (or below, or between) the truncation points is relevant to our computations. We need to scale it up by the probability that an observation falls in the range that interests us to make the distribution integrate to one. The censored distribution used by tobit, however, is a mixture of discrete and continuous distributions. Instead of rescaling over the observable range, we simply assign the full probability from the censored regions to the censoring points. The truncated regression model is sometimes less well behaved than the tobit model. Davidson and MacKinnon (1993) provide an example where truncation results in more inconsistency than censoring. 2436 truncreg — Truncated regression Stored results truncreg stores the following in e(): Scalars e(N) e(N bf) e(chi2) e(k eq) e(k eq model) e(k aux) e(df m) e(ll) e(ll 0) e(N clust) e(sigma) e(p) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(llopt) e(ulopt) e(depvar) e(wtype) e(wexp) e(title) e(clustvar) e(offset1) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) e(means) e(dummy) Functions e(sample) number of observations number of obs. before truncation model χ2 number of equations in e(b) number of equations in overall model test number of auxiliary parameters model degrees of freedom log likelihood log likelihood, constant-only model number of clusters estimate of sigma significance rank of e(V) number of iterations return code 1 if converged, 0 otherwise truncreg command as typed contents of ll(), if specified contents of ul(), if specified name of dependent variable weight type weight expression title in estimation output name of cluster variable offset Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance means of independent variables indicator for dummy variables marks estimation sample Methods and formulas Greene (2012, 833–839) and Davidson and MacKinnon (1993, 534–537) provide introductions to the truncated regression model. truncreg — Truncated regression 2437 Let y = Xβ + be the model. y represents continuous outcomes either observed or not observed. Our model assumes that ∼ N (0, σ 2 I). Let a be the lower limit and b be the upper limit. The log likelihood is n n X n a − xj β 1 X b − xj β 2 2 lnL = − log(2πσ ) − (yj − xj β) − −Φ log Φ 2 2σ 2 j=1 σ σ j=1 This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. truncreg also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Berndt, E. R. 1996. The Practice of Econometrics: Classic and Contemporary. New York: Addison–Wesley. Cong, R. 1999. sg122: Truncated regression. Stata Technical Bulletin 52: 47–52. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 248–255. College Station, TX: Stata Press. Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University Press. Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Also see [R] truncreg postestimation — Postestimation tools for truncreg [R] regress — Linear regression [R] tobit — Tobit regression [MI] estimation — Estimation commands for use with mi estimate [SVY] svy estimation — Estimation commands for survey data [U] 20 Estimation and postestimation commands Title truncreg postestimation — Postestimation tools for truncreg Description Syntax for predict Menu for predict Options for predict Also see Description The following postestimation commands are available after truncreg: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with mi or svy estimation results. lrtest is not appropriate with svy estimation results. 2438 truncreg postestimation — Postestimation tools for truncreg 2439 Syntax for predict predict type predict type newvar if in , statistic nooffset stub* | newvarreg newvarlnsigma if in , scores Description statistic Main xb stdp stdf pr(a,b) e(a,b) ystar(a,b) linear prediction; the default standard error of the prediction standard error of the forecast Pr(a < yj < b) E(yj | a < yj < b) E(yj∗ ), yj∗ = max{a, min(yj , b)} These statistics are available both in and out of sample; type predict the estimation sample. stdf is not allowed with svy estimation results. . . . if e(sample) . . . if wanted only for where a and b may be numbers or variables; a missing (a ≥ .) means −∞, and b missing (b ≥ .) means +∞; see [U] 12.2.1 Missing values. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main xb, the default, calculates the linear prediction. stdp calculates the standard error of the prediction, which can be thought of as the standard error of the predicted expected value or mean for the observation’s covariate pattern. The standard error of the prediction is also referred to as the standard error of the fitted value. stdf calculates the standard error of the forecast, which is the standard error of the point prediction for 1 observation. It is commonly referred to as the standard error of the future or forecast value. By construction, the standard errors produced by stdf are always larger than those produced by stdp; see Methods and formulas in [R] regress postestimation. pr(a,b) calculates Pr(a < xj b + uj < b), the probability that yj |xj would be observed in the interval (a, b). a and b may be specified as numbers or variable names; lb and ub are variable names; pr(20,30) calculates Pr(20 < xj b + uj < 30); pr(lb,ub) calculates Pr(lb < xj b + uj < ub); and pr(20,ub) calculates Pr(20 < xj b + uj < ub). a missing (a ≥ .) means −∞; pr(.,30) calculates Pr(−∞ < xj b + uj < 30); pr(lb,30) calculates Pr(−∞ < xj b + uj < 30) in observations for which lb ≥ . and calculates Pr(lb < xj b + uj < 30) elsewhere. 2440 truncreg postestimation — Postestimation tools for truncreg b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(+∞ > xj b + uj > 20); pr(20,ub) calculates Pr(+∞ > xj b + uj > 20) in observations for which ub ≥ . and calculates Pr(20 < xj b + uj < ub) elsewhere. e(a,b) calculates E(xj b + uj | a < xj b + uj < b), the expected value of yj |xj conditional on yj |xj being in the interval (a, b), meaning that yj |xj is truncated. a and b are specified as they are for pr(). ystar(a,b) calculates E(yj∗ ), where yj∗ = a if xj b + uj ≤ a, yj∗ = b if xj b + uj ≥ b, and yj∗ = xj b + uj otherwise, meaning that yj∗ is censored. a and b are specified as they are for pr(). nooffset is relevant only if you specified offset(varname). It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as xj b rather than as xj b + offsetj . scores calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xj β). The second new variable will contain ∂ ln L/∂σ . Also see [R] truncreg — Truncated regression [U] 20 Estimation and postestimation commands Title ttest — t tests (mean-comparison tests) Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax One-sample t test ttest varname == # if in , level(#) Two-sample t test using groups ttest varname if in , by(groupvar) options1 Two-sample t test using variables ttest varname1 == varname2 if in , unpaired unequal welch level(#) if in Paired t test ttest varname1 == varname2 , level(#) Immediate form of one-sample t test ttesti # obs # mean # sd # val , level(#) Immediate form of two-sample t test ttesti # obs1 # mean1 # sd1 # obs2 # mean2 # sd2 options1 Main ∗ by(groupvar) unequal welch level(#) ∗ , options2 Description variable defining the groups unpaired data have unequal variances use Welch’s approximation set confidence level; default is level(95) by(groupvar) is required. options2 Description Main unequal welch level(#) unpaired data have unequal variances use Welch’s approximation set confidence level; default is level(95) by is allowed with ttest; see [D] by. 2441 2442 ttest — t tests (mean-comparison tests) Menu ttest Statistics > Summaries, tables, and tests > Classical tests of hypotheses > t test (mean-comparison test) > Summaries, tables, and tests > Classical tests of hypotheses > t test calculator ttesti Statistics Description ttest performs t tests on the equality of means. In the first form, ttest tests that varname has a mean of #. In the second form, ttest tests that varname has the same mean within the two groups defined by groupvar. In the third form, ttest tests that varname1 and varname2 have the same mean, assuming unpaired data. In the fourth form, ttest tests that varname1 and varname2 have the same mean, assuming paired data. ttesti is the immediate form of ttest; see [U] 19 Immediate commands. For the equivalent of a two-sample t test with sampling weights (pweights), use the svy: mean command with the over() option, and then use lincom; see [R] mean and [SVY] svy postestimation. Options Main by(groupvar) specifies the groupvar that defines the two groups that ttest will use to test the hypothesis that their means are equal. Specifying by(groupvar) implies an unpaired (two sample) t test. Do not confuse the by() option with the by prefix; you can specify both. unpaired specifies that the data be treated as unpaired. The unpaired option is used when the two sets of values to be compared are in different variables. unequal specifies that the unpaired data not be assumed to have equal variances. welch specifies that the approximate degrees of freedom for the test be obtained from Welch’s formula (1947) rather than from Satterthwaite’s approximation formula (1946), which is the default when unequal is specified. Specifying welch implies unequal. level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals. Remarks and examples Remarks are presented under the following headings: One-sample t test Two-sample t test Paired t test Two-sample t test compared with one-way ANOVA Immediate form Video examples ttest — t tests (mean-comparison tests) 2443 One-sample t test Example 1 In the first form, ttest tests whether the mean of the sample is equal to a known constant under the assumption of unknown variance. Assume that we have a sample of 74 automobiles. We know each automobile’s average mileage rating and wish to test whether the overall average for the sample is 20 miles per gallon. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . ttest mpg==20 One-sample t test Variable Obs Mean mpg 74 21.2973 mean = mean(mpg) Ho: mean = 20 Ha: mean < 20 Pr(T < t) = 0.9712 Std. Err. Std. Dev. .6725511 5.785503 [95% Conf. Interval] 19.9569 22.63769 t = 1.9289 degrees of freedom = 73 Ha: mean != 20 Ha: mean > 20 Pr(|T| > |t|) = 0.0576 Pr(T > t) = 0.0288 The test indicates that the underlying mean is not 20 with a significance level of 5.8%. Two-sample t test Example 2: Two-sample t test using groups We are testing the effectiveness of a new fuel additive. We run an experiment in which 12 cars are given the fuel treatment and 12 cars are not. The results of the experiment are as follows: treated 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 mpg 20 23 21 25 18 17 18 24 20 24 23 19 24 25 21 22 23 18 17 28 24 27 21 23 2444 ttest — t tests (mean-comparison tests) The treated variable is coded as 1 if the car received the fuel treatment and 0 otherwise. We can test the equality of means of the treated and untreated group by typing . use http://www.stata-press.com/data/r13/fuel3 . ttest mpg, by(treated) Two-sample t test with equal variances Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] 0 1 12 12 21 22.75 .7881701 .9384465 2.730301 3.250874 19.26525 20.68449 22.73475 24.81551 combined 24 21.875 .6264476 3.068954 20.57909 23.17091 -1.75 1.225518 -4.291568 .7915684 diff diff = mean(0) - mean(1) t = -1.4280 Ho: diff = 0 degrees of freedom = 22 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0837 Pr(|T| > |t|) = 0.1673 Pr(T > t) = 0.9163 We do not find a statistically significant difference in the means. If we were not willing to assume that the variances were equal and wanted to use Welch’s formula, we could type . ttest mpg, by(treated) welch Two-sample t test with unequal variances Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] 0 1 12 12 21 22.75 .7881701 .9384465 2.730301 3.250874 19.26525 20.68449 22.73475 24.81551 combined 24 21.875 .6264476 3.068954 20.57909 23.17091 -1.75 1.225518 -4.28369 .7836902 diff diff = mean(0) - mean(1) Ho: diff = 0 Ha: diff < 0 Pr(T < t) = 0.0833 t = Welch’s degrees of freedom = Ha: diff != 0 Pr(|T| > |t|) = 0.1666 -1.4280 23.2465 Ha: diff > 0 Pr(T > t) = 0.9167 Technical note In two-sample using groups randomized designs, subjects will sometimes refuse the assigned treatment but still be measured for an outcome. In this case, take care to specify the group properly. You might be tempted to let varname contain missing where the subject refused and thus let ttest drop such observations from the analysis. Zelen (1979) argues that it would be better to specify that the subject belongs to the group in which he or she was randomized, even though such inclusion will dilute the measured effect. ttest — t tests (mean-comparison tests) 2445 Example 3: Two-sample t test using variables There is a second, inferior way to organize the data in the preceding example. We ran a test on 24 cars, 12 without the additive and 12 with. We now create two new variables, mpg1 and mpg2. mpg1 20 23 21 25 18 17 18 24 20 24 23 19 mpg2 24 25 21 22 23 18 17 28 24 27 21 23 This method is inferior because it suggests a connection that is not there. There is no link between the car with 20 mpg and the car with 24 mpg in the first row of the data. Each column of data could be arranged in any order. Nevertheless, if our data are organized like this, ttest can accommodate us. . use http://www.stata-press.com/data/r13/fuel . ttest mpg1==mpg2, unpaired Two-sample t test with equal variances Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] mpg1 mpg2 12 12 21 22.75 .7881701 .9384465 2.730301 3.250874 19.26525 20.68449 22.73475 24.81551 combined 24 21.875 .6264476 3.068954 20.57909 23.17091 -1.75 1.225518 -4.291568 .7915684 diff diff = mean(mpg1) - mean(mpg2) t = -1.4280 Ho: diff = 0 degrees of freedom = 22 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0837 Pr(|T| > |t|) = 0.1673 Pr(T > t) = 0.9163 Paired t test Example 4 Suppose that the preceding data were actually collected by running a test on 12 cars. Each car was run once with the fuel additive and once without. Our data are stored in the same manner as in example 3, but this time, there is most certainly a connection between the mpg values that appear in the same row. These come from the same car. The variables mpg1 and mpg2 represent mileage without and with the treatment, respectively. 2446 ttest — t tests (mean-comparison tests) . use http://www.stata-press.com/data/r13/fuel . ttest mpg1==mpg2 Paired t test Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] mpg1 mpg2 12 12 21 22.75 .7881701 .9384465 2.730301 3.250874 19.26525 20.68449 22.73475 24.81551 diff 12 -1.75 .7797144 2.70101 -3.46614 -.0338602 mean(diff) = mean(mpg1 - mpg2) t = -2.2444 Ho: mean(diff) = 0 degrees of freedom = 11 Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0 Pr(T < t) = 0.0232 Pr(|T| > |t|) = 0.0463 Pr(T > t) = 0.9768 We find that the means are statistically different from each other at any level greater than 4.6%. Two-sample t test compared with one-way ANOVA Example 5 In example 2, we saw that ttest can be used to test the equality of a pair of means; see [R] oneway for an extension that allows testing the equality of more than two means. Suppose that we have data on the 50 states. The dataset contains the median age of the population (medage) and the region of the country (region) for each state. Region 1 refers to the Northeast, region 2 to the North Central, region 3 to the South, and region 4 to the West. Using oneway, we can test the equality of all four means. . use http://www.stata-press.com/data/r13/census (1980 Census data by state) . oneway medage region Analysis of Variance Source SS df MS Between groups Within groups 46.3961903 94.1237947 3 46 F 15.4653968 2.04616945 Prob > F 7.56 Total 140.519985 49 2.8677548 Bartlett’s test for equal variances: chi2(3) = 10.5757 0.0003 Prob>chi2 = 0.014 We find that the means are different, but we are interested only in testing whether the means for the Northeast (region==1) and West (region==4) are different. We could use oneway: . oneway medage region if region==1 | region==4 Analysis of Variance Source SS df MS Between groups Within groups 46.241247 46.1969169 1 20 46.241247 2.30984584 F 20.02 Total 92.4381638 21 4.40181733 Bartlett’s test for equal variances: chi2(1) = 2.4679 Prob > F 0.0002 Prob>chi2 = 0.116 ttest — t tests (mean-comparison tests) 2447 We could also use ttest: . ttest medage if region==1 | region==4, by(region) Two-sample t test with equal variances Group Obs Mean NE West 9 13 combined 22 diff Std. Err. Std. Dev. [95% Conf. Interval] 31.23333 28.28462 .3411581 .4923577 1.023474 1.775221 30.44662 27.21186 32.02005 29.35737 29.49091 .4473059 2.098051 28.56069 30.42113 2.948718 .6590372 1.57399 4.323445 diff = mean(NE) - mean(West) Ho: diff = 0 Ha: diff < 0 Pr(T < t) = 0.9999 t = degrees of freedom = Ha: diff != 0 Pr(|T| > |t|) = 0.0002 4.4743 20 Ha: diff > 0 Pr(T > t) = 0.0001 The significance levels of both tests are the same. Immediate form Example 6 ttesti is like ttest, except that we specify summary statistics rather than variables as arguments. For instance, we are reading an article that reports the mean number of sunspots per month as 62.6 with a standard deviation of 15.8. There are 24 months of data. We wish to test whether the mean is 75: . ttesti 24 62.6 15.8 75 One-sample t test x Obs Mean Std. Err. 24 62.6 3.225161 mean = mean(x) Ho: mean = 75 Ha: mean < 75 Pr(T < t) = 0.0004 Std. Dev. 15.8 [95% Conf. Interval] 55.92825 t = degrees of freedom = Ha: mean != 75 Pr(|T| > |t|) = 0.0008 69.27175 -3.8448 23 Ha: mean > 75 Pr(T > t) = 0.9996 Example 7 There is no immediate form of ttest with paired data because the test is also a function of the covariance, a number unlikely to be reported in any published source. For nonpaired data, however, we might type 2448 ttest — t tests (mean-comparison tests) . ttesti 20 20 5 32 15 4 Two-sample t test with equal variances Obs Mean x y 20 32 20 15 1.118034 .7071068 5 4 17.65993 13.55785 22.34007 16.44215 combined 52 16.92308 .6943785 5.007235 15.52905 18.3171 5 1.256135 2.476979 7.523021 diff Std. Err. Std. Dev. [95% Conf. Interval] diff = mean(x) - mean(y) t = 3.9805 Ho: diff = 0 degrees of freedom = 50 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.9999 Pr(|T| > |t|) = 0.0002 Pr(T > t) = 0.0001 If we had typed ttesti 20 20 5 32 15 4, unequal, the test would have assumed unequal variances. Video examples One-sample t test in Stata t test for two independent samples in Stata t test for two paired samples in Stata Immediate commands in Stata: One-sample t test from summary data Immediate commands in Stata: Two-sample t test from summary data Stored results ttest and ttesti store the following in r(): Scalars r(N 1) r(N 2) r(p l) r(p u) r(p) r(se) r(t) sample size n1 sample size n2 lower one-sided p-value upper one-sided p-value two-sided p-value estimate of standard error t statistic r(sd 1) r(sd 2) r(sd) r(mu 1) r(mu 2) r(df t) r(level) standard deviation for first variable standard deviation for second variable combined standard deviation x̄1 mean for population 1 x̄2 mean for population 2 degrees of freedom confidence level Methods and formulas See, for instance, Hoel (1984, 140–161) or Dixon and Massey (1983, 121–130) for an introduction and explanation of the calculation of these tests. Acock (2014, 162–173) and Hamilton (2013, 145–150) describe t tests using applications in Stata. The test for µ = µ0 for unknown σ is given by t= √ (x − µ0 ) n s The statistic is distributed as Student’s t with n− 1 degrees of freedom (Gosset [Student, pseud.] 1908). ttest — t tests (mean-comparison tests) 2449 The test for µx = µy when σx and σy are unknown but σx = σy is given by t= x−y 1/2 (nx −1)s2x +(ny −1)s2y nx +ny −2 1 nx + 1 ny 1/2 The result is distributed as Student’s t with nx + ny − 2 degrees of freedom. You could perform ttest (without the unequal option) in a regression setting given that regression assumes a homoskedastic error model. To compare with the ttest command, denote the underlying observations on x and y by xj , j = 1, . . . , nx , and yj , j = 1, . . . , ny . In a regression framework, typing ttest without the unequal option is equivalent to 1. creating a new variable zj that represents the stacked observations on x and y (so that zj = xj for j = 1, . . . , nx and znx +j = yj for j = 1, . . . , ny ) 2. and then estimating the equation zj = β0 + β1 dj + j , where dj = 0 for j = 1, . . . , nx and dj = 1 for j = nx + 1, . . . , nx + ny (that is, dj = 0 when the z observations represent x, and dj = 1 when the z observations represent y ). The estimated value of β1 , b1 , will equal y − x, and the reported t statistic will be the same t statistic as given by the formula above. The test for µx = µy when σx and σy are unknown and σx 6= σy is given by t= x−y s2x /nx + s2y /ny 1/2 The result is distributed as Student’s t with ν degrees of freedom, where ν is given by (with Satterthwaite’s [1946] formula) 2 s2x /nx + s2y /ny 2 2 s2x /nx nx −1 + s2y /ny ny −1 With Welch’s formula (1947), the number of degrees of freedom is given by 2 s2x /nx + s2y /ny −2 + 2 2 s2x /nx nx +1 + s2y /ny ny +1 The test for µx = µy for matched observations (also known as paired observations, correlated pairs, or permanent components) is given by t= √ d n sd where d represents the mean of xi − yi and sd represents the standard deviation. The test statistic t is distributed as Student’s t with n − 1 degrees of freedom. 2450 ttest — t tests (mean-comparison tests) You can also use ttest without the unpaired option in a regression setting because a paired comparison includes the assumption of constant variance. The ttest with an unequal variance assumption does not lend itself to an easy representation in regression settings and is not discussed here. (xj − yj ) = β0 + j . William Sealy Gosset (1876–1937) was born in Canterbury, England. He studied chemistry and mathematics at Oxford and worked as a chemist with the brewers Guinness in Dublin. Gosset became interested in statistical problems, which he discussed with Karl Pearson and later with Fisher and Neyman. He published several important papers under the pseudonym “Student”, and he lent that name to the t test he invented. References Acock, A. C. 2014. A Gentle Introduction to Stata. 4th ed. College Station, TX: Stata Press. Boland, P. J. 2000. William Sealy Gosset—alias ‘Student’ 1876–1937. In Creators of Mathematics: The Irish Connection, ed. K. Houston, 105–112. Dublin: University College Dublin Press. Dixon, W. J., and F. J. Massey, Jr. 1983. Introduction to Statistical Analysis. 4th ed. New York: McGraw–Hill. Gleason, J. R. 1999. sg101: Pairwise comparisons of means, including the Tukey wsd method. Stata Technical Bulletin 47: 31–37. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 225–233. College Station, TX: Stata Press. Gosset, W. S. 1943. “Student’s” Collected Papers. London: Biometrika Office, University College. Gosset [Student, pseud.], W. S. 1908. The probable error of a mean. Biometrika 6: 1–25. Hamilton, L. C. 2013. Statistics with Stata: Updated for Version 12. 8th ed. Boston: Brooks/Cole. Hoel, P. G. 1984. Introduction to Mathematical Statistics. 5th ed. New York: Wiley. Pearson, E. S., R. L. Plackett, and G. A. Barnard. 1990. ‘Student’: A Statistical Biography of William Sealy Gosset. Oxford: Oxford University Press. Preece, D. A. 1982. t is for trouble (and textbooks): A critique of some examples of the paired-samples t-test. Statistician 31: 169–195. Satterthwaite, F. E. 1946. An approximate distribution of estimates of variance components. Biometrics Bulletin 2: 110–114. Senn, S. J., and W. Richardson. 1994. The first t-test. Statistics in Medicine 13: 785–803. Welch, B. L. 1947. The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika 34: 28–35. Zelen, M. 1979. A new design for randomized clinical trials. New England Journal of Medicine 300: 1242–1245. Also see [R] bitest — Binomial probability test [R] ci — Confidence intervals for means, proportions, and counts [R] esize — Effect size based on mean comparison [R] mean — Estimate means [R] oneway — One-way analysis of variance [R] prtest — Tests of proportions [R] sdtest — Variance-comparison tests [MV] hotelling — Hotelling’s T-squared generalized means test Title update — Check for official updates Syntax Remarks and examples Menu Stored results Description Also see Options Syntax Report on update level of currently installed Stata update Set update source update from location Compare update level of currently installed Stata with that of source update query , from(location) Perform update if necessary update all , from(location) detail force exit Set automatic updates (Mac and Windows only) set update query on | off set update interval # set update prompt on | off Menu Help > Check for Updates Description The update command reports on the current update level and installs official updates to Stata. Official updates are updates to Stata as it was originally shipped from StataCorp, not the additions to Stata published in, for instance, the Stata Journal (SJ). Those additions are installed using the net command and updated using the adoupdate command; see [R] net and [R] adoupdate. update without arguments reports on the update level of the currently installed Stata. update from sets an update source, where location is a directory name or URL. If you are on the Internet, type update from http://www.stata.com. update query compares the update level of the currently installed Stata with that available from the update source and displays a report. 2451 2452 update — Check for official updates update all updates all necessary files. This is what you should type to check for and install updates. set update query determines if update query is to be automatically performed when Stata is launched. Only Mac and Windows platforms can be set for automatic updating. set update interval # sets the number of days to elapse before performing the next automatic update query. The default # is 7. The interval starts from the last time an update query was performed (automatically or manually). Only Mac and Windows platforms can be set for automatic updating. set update prompt determines whether a dialog is to be displayed before performing an automatic update query. The dialog allows you to perform an update query now, perform one the next time Stata is launched, perform one after the next interval has passed, or disable automatic update query. Only Mac and Windows platforms can be set for automatic updating. Options from(location) specifies the location of the update source. You can specify the from() option on the individual update commands or use the update from command. Which you do makes no difference. You typically do not need to use this option. detail specifies to display verbose output during the update process. force specifies to force downloading of all files even if, based on the date comparison, Stata does not think it is necessary. There is seldom a reason to specify this option. exit instructs Stata to exit when the update has successfully completed. There is seldom a reason to specify this option. Remarks and examples update updates the official components of Stata from the official source: http://www.stata.com. If you are connected to the Internet, the easy thing to do is to type . update all and follow the instructions. If Stata is up to date, update all will do nothing. Otherwise, it will download whatever is necessary and install the files. If you just want to know what updates are available, type . update query update query will check if any updates are available and report that information. If updates are available, it will recommend that you type update all. If you want to report the current update level, type . update update will report the update level of the Stata installation. update will also show you the date that updates were last checked and if any updates were available at that time. update — Check for official updates 2453 Stored results update without a subcommand, update from, and update query store the following in r(): Scalars r(inst exe) r(avbl exe) r(inst ado) r(avbl ado) r(inst utilities) r(avbl utilities) r(inst docs) r(avbl docs) Macros r(name exe) r(dir exe) r(dir ado) r(dir utilities) r(dir docs) date date date date date date date date of of of of of of of of executable installed (*) executable available over web (*) (**) ado-files installed (*) ado-files available over web (*) (**) utilities installed (*) utilities available over web (*) (**) documentation installed (*) documentation available over web (*) (**) name of the Stata executable directory in which executable is stored directory in which ado-files are stored directory in which utilities are stored directory in which PDF documentation is stored Notes: * Dates are stored as integers counting the number of days since January 1, 1960; see [D] datetime. ** These dates are not stored by update without a subcommand because update by itself reports information solely about the local computer and does not check what is available on the web. Also see [R] adoupdate — Update user-written ado-files [R] net — Install and manage user-written additions from the Internet [R] ssc — Install and uninstall packages from SSC [P] sysdir — Query and set system directories [U] 28 Using the Internet to keep up to date [GSM] 19 Updating and extending Stata—Internet functionality [GSU] 19 Updating and extending Stata—Internet functionality [GSW] 19 Updating and extending Stata—Internet functionality Title vce option — Variance estimators Syntax Methods and formulas Description Also see Options Remarks and examples Syntax estimation cmd . . . , vce(vcetype) . . . vcetype Description Likelihood based observed information matrix (OIM) outer product of the gradient (OPG) vectors oim opg Sandwich estimators robust cluster clustvar Huber/White/sandwich estimator clustered sandwich estimator Replication based bootstrap , bootstrap options jackknife , jackknife options bootstrap estimation jackknife estimation Description This entry describes the vce() option, which is common to most estimation commands. vce() specifies how to estimate the variance–covariance matrix (VCE) corresponding to the parameter estimates. The standard errors reported in the table of parameter estimates are the square root of the variances (diagonal elements) of the VCE. Options SE/Robust vce(oim) is usually the default for models fit using maximum likelihood. vce(oim) uses the observed information matrix (OIM); see [R] ml. vce(opg) uses the sum of the outer product of the gradient (OPG) vectors; see [R] ml. This is the default VCE when the technique(bhhh) option is specified; see [R] maximize. vce(robust) uses the robust or sandwich estimator of variance. This estimator is robust to some types of misspecification so long as the observations are independent; see [U] 20.21 Obtaining robust variance estimates. 2454 vce option — Variance estimators 2455 If the command allows pweights and you specify them, vce(robust) is implied; see [U] 20.23.3 Sampling weights. vce(cluster clustvar) specifies that the standard errors allow for intragroup correlation, relaxing the usual requirement that the observations be independent. That is, the observations are independent across groups (clusters) but not necessarily within groups. clustvar specifies to which group each observation belongs, for example, vce(cluster personid) in data with repeated observations on individuals. vce(cluster clustvar) affects the standard errors and variance–covariance matrix of the estimators but not the estimated coefficients; see [U] 20.21 Obtaining robust variance estimates. vce(bootstrap , bootstrap options ) uses a bootstrap; see [R] bootstrap. After estimation with vce(bootstrap), see [R] bootstrap postestimation to obtain percentile-based or bias-corrected confidence intervals. vce(jackknife , jackknife options ) uses the delete-one jackknife; see [R] jackknife. Remarks and examples Remarks are presented under the following headings: Prefix commands Passing options in vce() Prefix commands Specifying vce(bootstrap) or vce(jackknife) is often equivalent to using the corresponding prefix command. Here is an example using jackknife with regress. . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg turn trunk, vce(jackknife) (running regress on estimation sample) Jackknife replications (74) 1 2 3 4 5 .................................................. ........................ Linear regression 50 Number of obs Replications F( 2, 73) Prob > F R-squared Adj R-squared Root MSE mpg Coef. turn trunk _cons -.7610113 -.3161825 55.82001 Jackknife Std. Err. .150726 .1282326 5.031107 t -5.05 -2.47 11.09 P>|t| 0.000 0.016 0.000 = = = = = = = 74 74 66.26 0.0000 0.5521 0.5395 3.9260 [95% Conf. Interval] -1.061408 -.5717498 45.79303 -.4606147 -.0606152 65.84699 2456 vce option — Variance estimators . jackknife: regress mpg turn trunk (running regress on estimation sample) Jackknife replications (74) 1 2 3 4 5 .................................................. ........................ Linear regression 50 Number of obs Replications F( 2, 73) Prob > F R-squared Adj R-squared Root MSE mpg Coef. turn trunk _cons -.7610113 -.3161825 55.82001 Jackknife Std. Err. .150726 .1282326 5.031107 t -5.05 -2.47 11.09 P>|t| 0.000 0.016 0.000 = = = = = = = 74 74 66.26 0.0000 0.5521 0.5395 3.9260 [95% Conf. Interval] -1.061408 -.5717498 45.79303 -.4606147 -.0606152 65.84699 Here it does not matter whether we specify the vce(jackknife) option or instead use the jackknife prefix. However, vce(jackknife) should be used in place of the jackknife prefix whenever available because they are not always equivalent. For example, to use the jackknife prefix with clogit properly, you must tell jackknife to omit whole groups rather than individual observations. Specifying vce(jackknife) does this automatically. . use http://www.stata-press.com/data/r13/clogitid . jackknife, cluster(id): clogit y x1 x2, group(id) (output omitted ) This extra information is automatically communicated to jackknife by clogit when the vce() option is specified. . clogit y x1 x2, group(id) vce(jackknife) (running clogit on estimation sample) Jackknife replications (66) 1 2 3 4 5 .................................................. 50 ................ Conditional (fixed-effects) logistic regression Number of obs = 369 Replications = 66 F( 2, 65) = 4.58 Prob > F = 0.0137 Log likelihood = -123.41386 Pseudo R2 = 0.0355 (Replications based on 66 clusters in id) y Coef. x1 x2 .653363 .0659169 Jackknife Std. Err. t P>|t| .3010608 .0487858 2.17 1.35 0.034 0.181 [95% Conf. Interval] .052103 -.0315151 1.254623 .1633489 vce option — Variance estimators 2457 Passing options in vce() If you wish to specify more options to the bootstrap or jackknife estimation, you can include them within the vce() option. Below we request 300 bootstrap replications and save the replications in bsreg.dta: . use http://www.stata-press.com/data/r13/auto (1978 Automobile Data) . regress mpg turn trunk, vce(bootstrap, nodots seed(123) rep(300) saving(bsreg)) Linear regression Number of obs Replications Wald chi2(2) Prob > chi2 R-squared Adj R-squared Root MSE mpg Observed Coef. Bootstrap Std. Err. turn trunk _cons -.7610113 -.3161825 55.82001 .1361786 .1145728 4.69971 z -5.59 -2.76 11.88 . bstat using bsreg Bootstrap results command: turn trunk _cons P>|z| 0.000 0.006 0.000 = = = = = = = 74 300 127.28 0.0000 0.5521 0.5395 3.9260 Normal-based [95% Conf. Interval] -1.027916 -.540741 46.60875 Number of obs Replications -.4941062 -.0916239 65.03127 = = 74 300 regress mpg turn trunk Observed Coef. Bootstrap Std. Err. -.7610113 -.3161825 55.82001 .1361786 .1145728 4.69971 z -5.59 -2.76 11.88 P>|z| 0.000 0.006 0.000 Normal-based [95% Conf. Interval] -1.027916 -.540741 46.60875 -.4941062 -.0916239 65.03127 Methods and formulas By default, Stata’s maximum likelihood estimators display standard errors based on variance estimates given by the inverse of the negative Hessian (second derivative) matrix. If vce(robust), vce(cluster clustvar), or pweights is specified, standard errors are based on the robust variance estimator (see [U] 20.21 Obtaining robust variance estimates); likelihood-ratio tests are not appropriate here (see [SVY] survey), and the model χ2 is from a Wald test. If vce(opg) is specified, the standard errors are based on the outer product of the gradients; this option has no effect on likelihood-ratio tests, though it does affect Wald tests. If vce(bootstrap) or vce(jackknife) is specified, the standard errors are based on the chosen replication method; here the model χ2 or F statistic is from a Wald test using the respective replicationbased covariance matrix. The t distribution is used in the coefficient table when the vce(jackknife) option is specified. vce(bootstrap) and vce(jackknife) are also available with some commands that are not maximum likelihood estimators. 2458 vce option — Variance estimators Also see [R] bootstrap — Bootstrap sampling and estimation [R] jackknife — Jackknife estimation [XT] vce options — Variance estimators [U] 20 Estimation and postestimation commands Title view — View files and logs Syntax Remarks and examples Menu Also see Description Syntax Display file in Viewer view file " filename " , asis adopath Bring up browser pointed to specified URL view browse " url " Display help results in Viewer view help topic or command name Display search results in Viewer view search keywords Display news results in Viewer view news Display net results in Viewer view net netcmd Display ado-results in Viewer view ado adocmd Display update results in Viewer view update updatecmd Menu File > View... 2459 Options 2460 view — View files and logs Description view displays file contents in the Viewer. view file displays the specified file. file is optional, so if you had a SMCL session log created by typing log using mylog, you could view it by typing view mylog.smcl. view file can properly display .smcl files (logs and the like), .sthlp files, and text files. view file’s asis option specifies that the file be displayed as plain text, regardless of the filename’s extension. view browse opens your browser pointed to url. Typing view browse http://www.stata.com would bring up your browser pointed to the http://www.stata.com website. view help displays the specified topic in the Viewer. For example, to review the help for Stata’s print command, you could type help print. See [R] help for more details. view search displays the results of the search command in the Viewer. For instance, to search the system help for information on robust regression, you could type search robust regression. See [R] search for more details. view news does the same as the news command—see [R] news —but displays the results in the Viewer. (news displays the latest news from http://www.stata.com.) view net does the same as the net command—see [R] net —but displays the result in the Viewer. For instance, typing view net search hausman test would search the Internet for additions to Stata related to the Hausman test. Typing view net from http://www.stata.com would go to the Stata additions download site at http://www.stata.com. view ado does the same as the ado command—see [R] net —but displays the result in the Viewer. For instance, typing view ado dir would show a list of files you have installed. view update does the same as the update command—see [R] update —but displays the result in the Viewer. Typing view update would show the dates of what you have installed, and from there you could click to compare those dates with the latest updates available. Typing view update query would skip the first step and show the comparison. Options asis, allowed with view file, specifies that the file be displayed as text, regardless of the filename’s extension. view file’s default action is to display files ending in .smcl and .sthlp as SMCL; see [P] smcl. adopath, allowed with view file, specifies that Stata search the S ADO path for filename and display it, if found. Remarks and examples Most users access the Viewer by selecting File > View... and proceeding from there. Some commands allow you to skip that step. Some common interactive uses of commands that display their results in the Viewer are the following: view — View files and logs . . . . . . . . . . view mysession.smcl view mysession.log help print help regress view news view browse http://www.stata.com search hausman test view net view ado view update query Also see [R] help — Display help in Stata [R] net — Install and manage user-written additions from the Internet [R] news — Report Stata news [R] search — Search Stata documentation and other resources [R] update — Check for official updates [D] type — Display contents of a file [GSM] 3 Using the Viewer [GSU] 3 Using the Viewer [GSW] 3 Using the Viewer 2461 Title vwls — Variance-weighted least squares Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax vwls depvar indepvars if in weight , options Description options Model noconstant sd(varname) suppress constant term variable containing estimate of conditional standard deviation Reporting level(#) display options set confidence level; default is level(95) control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling coeflegend display legend instead of statistics indepvars may contain factor variables; see [U] 11.4.3 Factor variables. bootstrap, by, jackknife, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. fweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. Menu Statistics > Linear models and related > Other > Variance-weighted least squares Description vwls estimates a linear regression using variance-weighted least squares. It differs from ordinary least-squares (OLS) regression in that it does not assume homogeneity of variance, but requires that the conditional variance of depvar be estimated prior to the regression. The estimated variance need not be constant across observations. vwls treats the estimated variance as if it were the true variance when it computes the standard errors of the coefficients. You must supply an estimate of the conditional standard deviation of depvar to vwls by using the sd(varname) option, or you must have grouped data with the groups defined by the indepvars variables. In the latter case, vwls treats all indepvars as categorical variables, computes the mean and standard deviation of depvar separately for each subgroup, and computes the regression of the subgroup means on indepvars. 2462 vwls — Variance-weighted least squares 2463 regress with analytic weights can be used to produce another kind of “variance-weighted least squares”; see Remarks and examples for an explanation of the difference. Options Model noconstant; see [R] estimation options. sd(varname) is an estimate of the conditional standard deviation of depvar (that is, it can vary observation by observation). All values of varname must be > 0. If you specify sd(), you cannot use fweights. If sd() is not given, the data will be grouped by indepvars. Here indepvars are treated as categorical variables, and the means and standard deviations of depvar for each subgroup are calculated and used for the regression. Any subgroup for which the standard deviation is zero is dropped. Reporting level(#); see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. The following option is available with vwls but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples The vwls command is intended for use with two special — and different — types of data. The first contains data that consist of measurements from physical science experiments in which all error is due solely to measurement errors and the sizes of the measurement errors are known. You can also use variance-weighted least-squares linear regression for certain problems in categorical data analysis, such as when all the independent variables are categorical and the outcome variable is either continuous or a quantity that can sensibly be averaged. If each of the subgroups defined by the categorical variables contains a reasonable number of subjects, then the variance of the outcome variable can be estimated independently within each subgroup. For the purposes of estimation, vwls treats each subgroup as one observation, with the dependent variable being the subgroup mean of the outcome variable. The vwls command fits the model yi = xi β + εi where the errors εi are independent normal random variables with the distribution εi ∼ N (0, νi ). The independent variables xi are assumed to be known without error. As described above, vwls assumes that you already have estimates s2i for the variances νi . The error variance is not estimated in the regression. The estimates s2i are used to compute the standard errors of the coefficients; see Methods and formulas below. In contrast, weighted OLS regression assumes that the errors have the distribution εi ∼ N (0, σ 2 /wi ), where the wi are known weights and σ 2 is an unknown parameter that is estimated in the regression. This is the difference from variance-weighted least squares: in weighted OLS, the magnitude of the error variance is estimated in the regression using all the data. 2464 vwls — Variance-weighted least squares Example 1 An artificial, but informative, example illustrates the difference between variance-weighted least squares and weighted OLS. We measure the quantities xi and yi and estimate that the standard deviation of yi is si . We enter the data into Stata: . use http://www.stata-press.com/data/r13/vwlsxmpl . list x y s 1. 2. 3. 4. 5. 1 2 3 4 5 1.2 1.9 3.2 4.3 4.9 .5 .5 1 1 1 6. 7. 8. 6 7 8 6.0 7.2 7.9 2 2 2 Because we want observations with smaller variance to carry larger weight in the regression, we compute an OLS regression with analytic weights proportional to the inverse of the squared standard deviations: . regress y x [aweight=s^(-2)] (sum of wgt is 1.1750e+01) SS df Source MS Model Residual 22.6310183 .193355117 1 6 22.6310183 .032225853 Total 22.8243734 7 3.26062477 y Coef. x _cons .9824683 .1138554 Std. Err. .0370739 .1120078 t 26.50 1.02 Number of obs F( 1, 6) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.349 = = = = = = 8 702.26 0.0000 0.9915 0.9901 .17952 [95% Conf. Interval] .8917517 -.1602179 1.073185 .3879288 If we compute a variance-weighted least-squares regression by using vwls, we get the same results for the coefficient estimates but very different standard errors: . vwls y x, sd(s) Variance-weighted least-squares regression Goodness-of-fit chi2(6) = 0.28 Prob > chi2 = 0.9996 y Coef. x _cons .9824683 .1138554 Number of obs Model chi2(1) Prob > chi2 Std. Err. z P>|z| .170409 .51484 5.77 0.22 0.000 0.825 = = = 8 33.24 0.0000 [95% Conf. Interval] .6484728 -.8952124 1.316464 1.122923 vwls — Variance-weighted least squares 2465 Although the values of yi were nicely linear with xi , the vwls regression used the large estimates for the standard deviations to compute large standard errors for the coefficients. For weighted OLS regression, however, the scale of the analytic weights has no effect on the standard errors of the coefficients—only the relative proportions of the analytic weights affect the regression. If we are sure of the sizes of our error estimates for yi , using vwls is valid. However, if we can estimate only the relative proportions of error among the yi , then vwls is not appropriate. Example 2 Let’s now consider an example of the use of vwls with categorical data. Suppose that we have blood pressure data for n = 400 subjects, categorized by gender and race (black or white). Here is a description of the data: . use http://www.stata-press.com/data/r13/bp . table gender race, c(mean bp sd bp freq) row col format(%8.1f) Gender White Race Black Total Female 117.1 10.3 100.0 118.5 11.6 100.0 117.8 10.9 200.0 Male 122.1 10.6 100.0 125.8 15.5 100.0 124.0 13.3 200.0 Total 119.6 10.7 200.0 122.2 14.1 200.0 120.9 12.6 400.0 Performing a variance-weighted regression using vwls gives . vwls bp gender race Variance-weighted least-squares regression Goodness-of-fit chi2(1) = 0.88 Prob > chi2 = 0.3486 bp Coef. gender race _cons 5.876522 2.372818 116.6486 Std. Err. 1.170241 1.191683 .9296297 Number of obs Model chi2(2) Prob > chi2 z 5.02 1.99 125.48 = = = 400 27.11 0.0000 P>|z| [95% Conf. Interval] 0.000 0.046 0.000 3.582892 .0371631 114.8266 8.170151 4.708473 118.4707 2466 vwls — Variance-weighted least squares By comparison, an OLS regression gives the following result: . regress bp gender race Source SS df MS Model Residual 4485.66639 58442.7305 2 397 2242.83319 147.210908 Total 62928.3969 399 157.71528 bp Coef. gender race _cons 6.1775 2.5875 116.4862 Std. Err. 1.213305 1.213305 1.050753 t 5.09 2.13 110.86 Number of obs F( 2, 397) Prob > F R-squared Adj R-squared Root MSE = = = = = = 400 15.24 0.0000 0.0713 0.0666 12.133 P>|t| [95% Conf. Interval] 0.000 0.034 0.000 3.792194 .2021938 114.4205 8.562806 4.972806 118.552 Note the larger value for the race coefficient (and smaller p-value) in the OLS regression. The assumption of homogeneity of variance in OLS means that the mean for black men pulls the regression line higher than in the vwls regression, which takes into account the larger variance for black men and reduces its effect on the regression. Stored results vwls stores the following in e(): Scalars e(N) e(df m) e(chi2) e(df gf) e(chi2 gf) e(rank) number of observations model degrees of freedom model χ2 goodness-of-fit degrees of freedom goodness-of-fit χ2 rank of e(V) Macros e(cmd) e(cmdline) e(depvar) e(properties) e(predict) e(asbalanced) e(asobserved) vwls command as typed name of dependent variable b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved Matrices e(b) e(V) coefficient vector variance–covariance matrix of the estimators Functions e(sample) marks estimation sample Methods and formulas Let y = (y1 , y2 , . . . , yn )0 be the vector of observations of the dependent variable, where n is the number of observations. When sd() is specified, let s1 , s2 , . . . , sn be the standard deviations supplied by sd(). For categorical data, when sd() is not given, the means and standard deviations of y for each subgroup are computed, and n becomes the number of subgroups, y is the vector of subgroup means, and si are the standard deviations for the subgroups. vwls — Variance-weighted least squares 2467 Let V = diag(s21 , s22 , . . . , s2n ) denote the estimate of the variance of y. Then the estimated regression coefficients are b = (X0 V−1 X)−1 X0 V−1 y and their estimated covariance matrix is d Cov(b) = (X0 V−1 X)−1 A statistic for the goodness of fit of the model is Q = (y − Xb)0 V−1 (y − Xb) where Q has a χ2 distribution with n − k degrees of freedom (k is the number of independent variables plus the constant, if any). References Gini, R., and J. Pasquini. 2006. Automatic generation of documents. Stata Journal 6: 22–39. Grizzle, J. E., C. F. Starmer, and G. G. Koch. 1969. Analysis of categorical data by linear models. Biometrics 25: 489–504. Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 2007. Numerical Recipes: The Art of Scientific Computing. 3rd ed. New York: Cambridge University Press. Also see [R] vwls postestimation — Postestimation tools for vwls [R] regress — Linear regression [U] 11.1.6 weight [U] 20 Estimation and postestimation commands Title vwls postestimation — Postestimation tools for vwls Description Syntax for predict Menu for predict Options for predict Also see Description The following postestimation commands are available after vwls: Command Description contrast estat summarize estat vce estimates forecast lincom contrasts and ANOVA-style joint tests of estimates summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients link test for model specification marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses linktest margins marginsplot nlcom predict predictnl pwcompare test testnl Syntax for predict predict type newvar if in , xb stdp These statistics are available both in and out of sample; type predict only for the estimation sample. Menu for predict Statistics > Postestimation > Predictions, residuals, etc. 2468 . . . if e(sample) . . . if wanted vwls postestimation — Postestimation tools for vwls Options for predict Main xb, the default, calculates the linear prediction. stdp calculates the standard error of the linear prediction. Also see [R] vwls — Variance-weighted least squares [U] 20 Estimation and postestimation commands 2469 Title which — Display location and version for an ado-file Syntax Description Option Remarks and examples Also see Syntax which fname .ftype , all Description which looks for fname.ftype along the S ADO path. If Stata finds the file, which displays the full path and filename, along with, if the file is text, all lines in the file that begin with “*!” in the first column. If Stata cannot find the file, which issues the message “file not found along ado-path” and sets the return code to 111. ftype must be a file type for which Stata usually looks along the ado-path to find. Allowable ftypes are .ado, .class, .dlg, .idlg, .sthlp, .ihlp, .hlp, .key, .maint, .mata, .mlib, .mo, .mnu, .plugin, .scheme, .stbcal, and .style If ftype is omitted, which assumes .ado. When searching for .ado files, if Stata cannot find the file, Stata then checks to see if fname is a built-in Stata command, allowing for valid abbreviations. If it is, the message “built-in command” is displayed; if not, the message “command not found as either built-in or ado-file” is displayed and the return code is set to 111. For information about internal version control, see [P] version. Option all forces which to report the location of all files matching the fname.ftype found along the search path. The default is to report just the first one found. Remarks and examples If you write programs, you know that you make changes to the programs over time. If you are like us, you also end up with multiple versions of the program stored on your disk, perhaps in different directories. You may even have given copies of your programs to other Stata users, and you may not remember which version of a program you or your friends are using. The which command helps you solve this problem. Example 1 The which command displays the path for filename.ado and any lines in the code that begin with “*!”. For example, we might want information about the test command, described in [R] test, which is an ado-file written by StataCorp. Here is what happens when we type which test: . which test C:\Program Files\Stata13\ado\base\t\test.ado *! version 2.2.2 07feb2012 2470 which — Display location and version for an ado-file 2471 which displays the path for the test.ado file and also a line beginning with “*!” that indicates the version of the file. This is how we, at StataCorp, do version control — see [U] 18.11.1 Version for an explanation of our version control numbers. We do not need to be so formal. which will display anything typed after lines that begin with ‘*!’. For instance, we might write myprog.ado: . which myprog .\myprog.ado *! first written 1/03/2013 *! bug fix on 1/05/2013 (no variance case) *! updated 1/24/2013 to include noconstant option *! still suspicious if variable takes on only two values It does not matter where in the program the lines beginning with *! are — which will list them (in particular, our “still suspicious” comment was buried about 50 lines down in the code). All that is important is that the *! marker appear in the first two columns of a line. Example 2 If we type which command, where command is a built-in command rather than an ado-file, Stata responds with . which summarize built-in command: summarize If command was neither a built-in command nor an ado-file, Stata would respond with . which junk command junk not found as either built-in or ado-file r(111); Also see [P] findfile — Find file in path [P] version — Version control [U] 17 Ado-files [U] 18.11.1 Version Title xi — Interaction expansion Syntax Remarks and examples Menu Stored results Description References Options Also see Syntax xi xi , prefix(string) noomit term(s) , prefix(string) noomit : any stata command varlist with terms . . . where a term has the form i.varname i.varname1 *i.varname2 i.varname1 *varname3 i.varname1 |varname3 or I.varname I.varname1 *I.varname2 I.varname1 *varname3 I.varname1 |varname3 varname, varname1 , and varname2 denote numeric or string categorical variables. varname3 denotes a continuous, numeric variable. Menu Data > Create or change data > Other variable-creation commands > Interaction expansion Most commands in Stata now allow factor variables; see [U] 11.4.3 Factor variables. To determine if a command allows factor variables, see the information printed below the options table for the command. If the command allows factor variables, it will say something like “indepvars may contain factor variables”. We recommend that you use factor variables instead of xi if a command allows factor variables. We include [R] xi in our documentation so that readers can consult it when using a Stata command that does not allow factor variables. Description xi expands terms containing categorical variables into indicator (also called dummy) variable sets by creating new variables and, in the second syntax (xi: any stata command ), executes the specified command with the expanded terms. The dummy variables created are i.varname i.varname1 *i.varname2 i.varname1 *varname3 i.varname1 |varname3 creates dummies for categorical variable varname creates dummies for categorical variables varname1 and varname2 : all interactions and main effects creates dummies for categorical variable varname1 and continuous variable varname3 : all interactions and main effects creates dummies for categorical variable varname1 and continuous variable varname3 : all interactions and main effect of varname3 , but no main effect of varname1 2472 xi — Interaction expansion 2473 Options prefix(string) allows you to choose a prefix other than I for the newly created interaction variables. The prefix cannot be longer than four characters. By default, xi will create interaction variables starting with I. When you use xi, it drops all previously created interaction variables starting with the prefix specified in the prefix(string) option or with I by default. Therefore, if you want to keep the variables with a certain prefix, specify a different prefix in the prefix(string) option. noomit prevents xi from omitting groups. This option provides a way to generate an indicator variable for every category having one or more variables, which is useful when combined with the noconstant option of an estimation command. Remarks and examples Remarks are presented under the following headings: Background Indicator variables for simple effects Controlling the omitted dummy Categorical variable interactions Interactions with continuous variables Using xi: Interpreting output How xi names variables xi as a command rather than a command prefix Warnings xi provides a convenient way to include dummy or indicator variables when fitting a model (say, with regress or logistic). For instance, assume that the categorical variable agegrp contains 1 for ages 20 – 24, 2 for ages 25 – 39, 3 for ages 40 – 44, etc. Typing . xi: logistic outcome weight i.agegrp bp estimates a logistic regression of outcome on weight, dummies for each agegrp category, and bp. That is, xi searches out and expands terms starting with “i.” or “I.” but ignores the other variables. xi will expand both numeric and string categorical variables, so if you had a string variable race containing “white”, “black”, and “other”, typing . xi: logistic outcome weight bp i.agegrp i.race would include indicator variables for the race group as well. The i. indicator variables xi expands may appear anywhere in the varlist, so . xi: logistic outcome i.agegrp weight i.race bp would fit the same model. You can also create interactions of categorical variables; typing xi: logistic outcome weight bp i.agegrp*i.race fits a model with indicator variables for all agegrp and race combinations, including the agegrp and race main-effect terms (that is, the terms that are created when you just type i.agegrp i.race). You can interact dummy variables with continuous variables; typing xi: logistic outcome bp i.agegrp*weight i.race 2474 xi — Interaction expansion fits a model with indicator variables for all agegrp categories interacted with weight, plus the main-effect terms weight and i.agegrp. You can get the interaction terms without the agegrp main effect (but with the weight main effect) by typing xi: logistic outcome bp i.agegrp|weight i.race You can also include multiple interactions: xi: logistic outcome bp i.agegrp*weight i.agegrp*i.race We will now back up and describe the construction of dummy variables in more detail. Background The terms continuous, categorical, and indicator or dummy variables are used below. Continuous variables measure something — such as height or weight — and at least conceptually can take on any real number over some range. Categorical variables, on the other hand, take on a finite number of values, each denoting membership in a subclass—for example, excellent, good, and poor, which might be coded 0, 1, 2, or 1, 2, 3, or even “Excellent”, “Good”, and “Poor”. An indicator or dummy variable — the terms are used interchangeably — is a special type of two-valued categorical variable that contains values 0, denoting false, and 1, denoting true. The information contained in any k -valued categorical variable can be equally well represented by k indicator variables. Instead of one variable recording values representing excellent, good, and poor, you can have three indicator variables, indicating the truth or falseness of “result is excellent”, “result is good”, and “result is poor”. xi provides a convenient way to convert categorical variables to dummy or indicator variables when you fit a model (say, with regress or logistic). Example 1 For instance, assume that the categorical variable agegrp contains 1 for ages 20 – 24, 2 for ages 25 – 39, and 3 for ages 40 – 44. (There is no one over 44 in our data.) As it stands, agegrp would be a poor candidate for inclusion in a model even if we thought age affected the outcome. The reason is that the coding would restrict the effect of being in the second age group to be twice the effect of being in the first, and, similarly, the effect of being in the third to be three times the first. That is, if we fit the model, y = β0 + β1 agegrp + Xβ2 the effect of being in the first age group is β1 , the second 2β1 , and the third 3β1 . If the coding 1, 2, and 3 is arbitrary, we could just as well have coded the age groups 1, 4, and 9, making the effects β1 , 4β1 , and 9β1 . The solution is to convert the categorical variable agegrp to a set of indicator variables, a1 , a2 , and a3 , where ai is 1 if the individual is a member of the ith age group and 0 otherwise. We can then fit the model y = β0 + β11 a1 + β12 a2 + β13 a3 + Xβ2 The effect of being in age group 1 is now β11 ; 2, β12 ; and 3, β13 ; and these results are independent of our (arbitrary) coding. The only difficulty at this point is that the model is unidentified in the sense that there are an infinite number of (β0 , β11 , β12 , β13 ) that fit the data equally well. xi — Interaction expansion To see this, pretend that (β0 , β11 , β12 , β13 ) = (1, 1, 3, 4). age groups are ( 1 + 1 + Xβ2 = 2 + Xβ2 y = 1 + 3 + Xβ2 = 4 + Xβ2 1 + 4 + Xβ2 = 5 + Xβ2 2475 The predicted values of y for the various (age group 1) (age group 2) (age group 3) Now pretend that (β0 , β11 , β12 , β13 ) = (2, 0, 2, 3). The predicted values of y are ( y= 2 + 0 + Xβ2 = 2 + Xβ2 2 + 2 + Xβ2 = 4 + Xβ2 2 + 3 + Xβ2 = 5 + Xβ2 (age group 1) (age group 2) (age group 3) These two sets of predictions are indistinguishable: for age group 1, y = 2 + Xβ2 regardless of the coefficient vector used, and similarly for age groups 2 and 3. This arises because we have three equations and four unknowns. Any solution is as good as any other, and, for our purposes, we merely need to choose one of them. The popular selection method is to set the coefficient on the first indicator variable to 0 (as we have done in our second coefficient vector). This is equivalent to fitting the model y = β0 + β12 a2 + β13 a3 + Xβ2 How we select a particular coefficient vector (identifies the model) does not matter. It does, however, affect the interpretation of the coefficients. For instance, we could just as well choose to omit the second group. In our artificial example, this would yield (β0 , β11 , β12 , β13 ) = (4, −2, 0, 1) instead of (2, 0, 2, 3). These coefficient vectors are the same in the sense that ( 2 + 0 + Xβ2 = 2 + Xβ2 = 4 − 2 + Xβ2 (age group 1) y = 2 + 2 + Xβ2 = 4 + Xβ2 = 4 + 0 + Xβ2 (age group 2) 2 + 3 + Xβ2 = 5 + Xβ2 = 4 + 1 + Xβ2 (age group 3) But what does it mean that β13 can just as well be 3 or 1? We obtain β13 = 3 when we set β11 = 0, so β13 = β13 − β11 and β13 measures the difference between age groups 3 and 1. In the second case, we obtain β13 = 1 when we set β12 = 0, so β13 − β12 = 1 and β13 measures the difference between age groups 3 and 2. There is no inconsistency. According to our β12 = 0 model, the difference between age groups 3 and 1 is β13 − β11 = 1 − (−2) = 3, the same result we got in the β11 = 0 model. Example 2 The issue of interpretation is important because it can affect the way we discuss results. Imagine that we are studying recovery after a coronary bypass operation. Assume that the age groups are children under 13 (we have two of them), young adults under 25 (we have a handful of them), adults under 46 (of which we have even more), mature adults under 56, older adults under 65, and elderly adults. We follow the prescription of omitting the first group, so all our results are reported relative to children under 13. While there is nothing statistically wrong with this, readers will be suspicious when we make statements like “compared with young children, older and elder adults . . . ”. Moreover, we will probably have to end each statement with “although results are not statistically significant” because we have only two children in our comparison group. Of course, even with results reported in this way, we can do reasonable comparisons (say, with mature adults), but we will have to do extra work to perform the appropriate linear hypothesis test using Stata’s test command. 2476 xi — Interaction expansion Here it would be better to force the omitted group to be more reasonable, such as mature adults. There is, however, a generic rule for automatic comparison group selection that, although less popular, tends to work better than the omit-the-first-group rule. That rule is to omit the most prevalent group. The most prevalent is usually a reasonable baseline. In any case, the prescription for categorical variables is 1. Convert each k -valued categorical variable to k indicator variables. 2. Drop one of the k indicator variables; any one will do, but dropping the first is popular, dropping the most prevalent is probably better in terms of having the computer guess at a reasonable interpretation, and dropping a specified one often eases interpretation the most. 3. Fit the model on the remaining k − 1 indicator variables. xi automates this procedure. We will now consider each of xi’s features in detail. Indicator variables for simple effects When you type i.varname, xi internally tabulates varname (which may be a string or a numeric variable) and creates indicator (dummy) variables for each observed value, omitting the indicator for the smallest value. For instance, say that agegrp takes on the values 1, 2, 3, and 4. Typing xi: logistic outcome i.agegrp creates indicator variables named Iagegrp 2, Iagegrp 3, and Iagegrp 4. (xi chooses the names and tries to make them readable; xi guarantees that the names are unique.) The expanded logistic model is . logistic outcome _Iagegrp_2 _Iagegrp_3 _Iagegrp_4 Afterward, you can drop the new variables xi leaves behind by typing ‘drop capitalization). I*’ (note the xi provides the following features when you type i.varname: • varname may be string or numeric. • Dummy variables are created automatically. • By default, the dummy-variable set is identified by dropping the dummy corresponding to the smallest value of the variable (how to specify otherwise is discussed below). • The new dummy variables are left in your dataset. By default, the names of the new dummy variables start with I; therefore, you can drop them by typing ‘drop I*’. You do not have to do this; each time you use xi, any automatically generated dummies with the same prefix as the one specified in the prefix(string) option, or I by default, are dropped and new ones are created. • The new dummy variables have variable labels so that you can determine what they correspond to by typing ‘describe’. • xi may be used with any Stata command (not just logistic). xi — Interaction expansion 2477 Controlling the omitted dummy By default, i.varname omits the dummy corresponding to the smallest value of varname; for a string variable, this is interpreted as dropping the first in an alphabetical, case-sensitive sort. xi provides two alternatives to dropping the first: xi will drop the dummy corresponding to the most prevalent value of varname, or xi will let you choose the particular dummy to be dropped. To change xi’s behavior to dropping the most prevalent dummy, type . char _dta[omit] prevalent although whether you type “prevalent” or “yes” or anything else does not matter. Setting this characteristic affects the expansion of all categorical variables in the dataset. If you resave your dataset, the prevalent preference will be remembered. If you want to change the behavior back to the default drop-the-first rule, type . char _dta[omit] to clear the characteristic. Once you set dta[omit], i.varname omits the dummy corresponding to the most prevalent value of varname. Thus the coefficients on the dummies have the interpretation of change from the most prevalent group. For example, . char _dta[omit] prevalent . xi: regress y i.agegrp might create Iagegrp 1 through Iagegrp 4, resulting in Iagegrp 2 being omitted if agegrp = 2 is most common (as opposed to the default dropping of Iagegrp 1). The model is then y = b0 + b1 Iagegrp 1 + b3 Iagegrp 3 + b4 Iagegrp 4 + u Then Predicted y for agegrp 1 = b0 + b1 Predicted y for agegrp 2 = b0 Predicted y for agegrp 3 = b0 + b3 Predicted y for agegrp 4 = b0 + b4 Thus the model’s reported t or Z statistics are for a test of whether each group is different from the most prevalent group. Perhaps you wish to omit the dummy for agegrp 3 instead. You do this by setting the variable’s omit characteristic: . char agegrp[omit] 3 This overrides dta[omit] if you have set it. Now when you type . xi: regress y i.agegrp Iagegrp 3 will be omitted, and you will fit the model y = b00 + b01 Iagegrp 1 + b02 Iagegrp 2 + b04 Iagegrp 4 + u Later if you want to return to the default omission, type . char agegrp[omit] to clear the characteristic. In summary, i.varname omits the first group by default, but if you define . char _dta[omit] prevalent 2478 xi — Interaction expansion the default behavior changes to dropping the most prevalent group. Either way, if you define a characteristic of the form . char varname[omit] # or, if varname is a string, . char varname[omit] string-literal the specified value will be omitted. Examples: . char agegrp[omit] 1 . char race[omit] White . char agegrp[omit] (for race, a string variable) (to restore default for agegrp) Categorical variable interactions i.varname1 *i.varname2 creates the dummy variables associated with the interaction of the categorical variables varname1 and varname2 . The identification rules — which categories are omitted — are the same as those for i.varname. For instance, assume that agegrp takes on four values and race takes on three values. Typing . xi: regress y i.agegrp*i.race results in model : y = a +b2 +c2 +d22 +d32 +d42 +u Iagegrp 2 + b3 Iagegrp 3 + b4 Irace 2 + c3 Irace 3 IageXrac 2 2 + d23 IageXrac IageXrac 3 2 + d33 IageXrac IageXrac 4 2 + d43 IageXrac dummies for: Iagegrp 4 (agegrp) (race) 2 3 3 3 (agegrp*race) 4 3 That is, typing . xi: regress y i.agegrp*i.race is the same as typing . xi: regress y i.agegrp i.race i.agegrp*i.race Although there are many other ways the interaction could have been parameterized, this method has the advantage that you can test the joint significance of the interactions by typing . testparm _IageXrac* When you perform the estimation step, whether you specify i.agegrp*i.race or i.race*i.agegrp makes no difference (other than in the names given to the interaction terms; in the first case, the names will begin with IageXrac; in the second, IracXage). Thus . xi: regress y i.race*i.agegrp fits the same model. You may also include multiple interactions simultaneously: . xi: regress y i.agegrp*i.race i.agegrp*i.sex xi — Interaction expansion 2479 The model fit is model : y = a +b2 +c2 +d22 +d32 +d42 +e2 +f22 +u Iagegrp 2 + b3 Iagegrp 3 + b4 Irace 2 + c3 Irace 3 IageXrac 2 2 + d23 IageXrac IageXrac 3 2 + d33 IageXrac IageXrac 4 2 + d43 IageXrac Isex 2 IageXsex 2 2 + f23 IageXsex Iagegrp 4 dummies for: (agegrp) (race) 2 3 3 3 4 3 (agegrp*race) 2 3 + f24 (sex) IageXsex 2 4 (agegrp*sex) The agegrp dummies are (correctly) included only once. Interactions with continuous variables i.varname1 *varname2 (as distinguished from i.varname1 *i.varname2 —note the second i.) specifies an interaction of a categorical variable with a continuous variable. For instance, . xi: regress y i.agegr*wgt results in the model y = a +b2 Iagegrp 2 + b3 Iagegrp 3 + b4 Iagegrp 4 (agegrp dummies) +c wgt (continuous wgt effect) +d2 IageXwgt 2 + d3 IageXwgt 3 + d4 IageXwgt 4 (agegrp*wgt interactions) +u A variation on this notation, using | rather than *, omits the agegrp dummies. Typing . xi: regress y i.agegrp|wgt fits the model y = a0 +c0 wgt (continuous wgt effect) +d02 IageXwgt 2 + d03 IageXwgt 3 + d04 IageXwgt 4 (agegrp*wgt interactions) +u0 The predicted values of y are agegrp*wgt model agegrp|wgt model y = a + c wgt a + c wgt + b2 + d2 wgt a + c wgt + b3 + d3 wgt a + c wgt + b4 + d4 wgt a0 + c0 wgt a0 + c0 wgt + d02 wgt a0 + c0 wgt + d03 wgt a0 + c0 wgt + d04 wgt That is, typing . xi: regress y i.agegrp*wgt if if if if agegrp = 1 agegrp = 2 agegrp = 3 agegrp = 4 2480 xi — Interaction expansion is equivalent to typing . xi: regress y i.agegrp i.agegrp|wgt In either case, you do not need to specify separately the continuous variable wgt; it is included automatically. Using xi: Interpreting output . xi: regress mpg i.rep78 i.rep78 _Irep78_1-5 (output from regress appears ) (naturally coded; _Irep78_1 omitted) Interpretation: i.rep78 expanded to the dummies Irep78 1, Irep78 2, . . . , Irep78 5. The numbers on the end are “natural” in the sense that Irep78 1 corresponds to rep78 = 1, Irep78 2 to rep78 = 2, and so on. Finally, the dummy for rep78 = 1 was omitted. . xi: regress mpg i.make i.make _Imake_1-74 (output from regress appears ) (_Imake_1 for make==AMC Concord omitted) Interpretation: i.make expanded to Imake 1, Imake 2, . . . , Imake 74. The coding is not natural because make is a string variable. Imake 1 corresponds to one make, Imake 2 to another, and so on. You can find out the coding by typing describe. Imake 1 for the AMC Concord was omitted. How xi names variables By default, xi assigns to the dummy variables it creates names having the form Istub groupid You may subsequently refer to the entire set of variables by typing ‘Istub*’. For example, name = Iagegrp 1 Iagegrp 2 IageXwgt 1 IageXrac 1 2 IageXrac 2 1 I + stub + I agegrp I agegrp I ageXwgt I ageXrac I ageXrac + groupid 1 2 1 1 2 2 1 If you specify a prefix in the prefix(string) option, say, starting with the prefix Sstub groupid Entire set Iagegrp* Iagegrp* IageXwgt* IageXrac* IageXrac* S, then xi will name the variables xi — Interaction expansion 2481 xi as a command rather than a command prefix xi can be used as a command prefix or as a command by itself. In the latter form, xi merely creates the indicator and interaction variables. Typing . xi: regress y i.agegrp*wgt i.agegrp _Iagegrp_1-4 i.agegrp*wgt _IageXwgt_1-4 (output from regress appears ) (naturally coded; _Iagegrp_1 omitted) (coded as above) is equivalent to typing . xi i.agegrp*wgt i.agegrp _Iagegrp_1-4 i.agegrp*wgt _IageXwgt_1-4 . regress y _Iagegrp* _IageXwgt* (output from regress appears ) (naturally coded; _Iagegrp_1 omitted) (coded as above) Warnings 1. xi creates new variables in your dataset; most are bytes, but interactions with continuous variables will have the storage type of the underlying continuous variable. You may get the message “insufficient memory”. If so, you will need to increase the amount of memory allocated to Stata’s data areas; see [U] 6 Managing memory. 2. When using xi with an estimation command, you may get the message “matsize too small”. If so, see [R] matsize. Stored results xi stores the following characteristics: dta[ dta[ xi xi Vars Vars Prefix ] To Drop ] prefix names variables created References Hendrickx, J. 1999. dm73: Using categorical variables in Stata. Stata Technical Bulletin 52: 2–8. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 51–59. College Station, TX: Stata Press. . 2000. dm73.1: Contrasts for categorical variables: Update. Stata Technical Bulletin 54: 7. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 60–61. College Station, TX: Stata Press. . 2001a. dm73.2: Contrasts for categorical variables: Update. Stata Technical Bulletin 59: 2–5. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 9–14. College Station, TX: Stata Press. . 2001b. dm73.3: Contrasts for categorical variables: Update. Stata Technical Bulletin 61: 5. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 14–15. College Station, TX: Stata Press. Also see [U] 11.1.10 Prefix commands [U] 20 Estimation and postestimation commands Title zinb — Zero-inflated negative binomial regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax zinb depvar indepvars if in weight , inflate(varlist , offset(varname) | cons) options Description options Model ∗ inflate( ) noconstant exposure(varnamee ) offset(varnameo ) constraints(constraints) collinear probit equation that determines whether the count is zero suppress constant term include ln(varnamee ) in model with coefficient constrained to 1 include varnameo in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables use probit model to characterize excess zeros; default is logit SE/Robust vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife vce(vcetype) Reporting level(#) irr vuong zip nocnsreport display options set confidence level; default is level(95) report incidence-rate ratios perform Vuong test perform ZIP likelihood-ratio test do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used display legend instead of statistics ∗ inflate(varlist , offset(varname) | cons) is required. coeflegend indepvars and varlist may contain factor variables; see [U] 11.4.3 Factor variables. bootstrap, by, fp, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce(), vuong, zip, and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 2482 zinb — Zero-inflated negative binomial regression 2483 Menu Statistics > Count outcomes > Zero-inflated negative binomial regression Description zinb estimates a zero-inflated negative binomial (ZINB) regression of depvar on indepvars, where depvar is a nonnegative count variable. Options Model inflate(varlist , offset(varname) | cons) specifies the equation that determines whether the observed count is zero. Conceptually, omitting inflate() would be equivalent to fitting the model with nbreg. inflate(varlist , offset(varname) ) specifies the variables in the equation. You may optionally include an offset for this varlist. inflate( cons) specifies that the equation determining whether the count is zero contains only an intercept. To run a zero-inflated model of depvar with only an intercept in both equations, type zinb depvar, inflate( cons). noconstant, exposure(varnamee ), offset(varnameo ), constraints(constraints), collinear; see [R] estimation options. probit requests that a probit, instead of logit, model be used to characterize the excess zeros in the data. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. irr reports estimated coefficients transformed to incidence-rate ratios, that is, eβi rather than βi . Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated or stored. irr may be specified at estimation or when replaying previously estimated results. vuong specifies that the Vuong (1989) test of ZINB versus negative binomial be reported. This test statistic has a standard normal distribution with large positive values favoring the ZINB model and large negative values favoring the negative binomial model. zip requests that a likelihood-ratio test comparing the ZINB model with the zero-inflated Poisson model be included in the output. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. 2484 zinb — Zero-inflated negative binomial regression Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with zinb but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples See Long (1997, 242–247) and Greene (2012, 821–826) for a discussion of zero-modified count models. For information about the test developed by Vuong (1989), see Greene (2012, 823–824) and Long (1997). Greene (1994) applied the test to zero-inflated Poisson and negative binomial models, and there is a description of that work in Greene (2012). Negative binomial regression fits models of the number of occurrences (counts) of an event. You could use nbreg for this (see [R] nbreg), but in some count-data models, you might want to account for the prevalence of zero counts in the data. For instance, you could count how many fish each visitor to a park catches. Many visitors may catch zero, because they do not fish (as opposed to being unsuccessful). You may be able to model whether a person fishes depending on several covariates related to fishing activity and model how many fish a person catches depending on several covariates having to do with the success of catching fish (type of lure/bait, time of day, temperature, season, etc.). This is the type of data for which the zinb command is useful. The zero-inflated (or zero-altered) negative binomial model allows overdispersion through the splitting process that models the outcomes as zero or nonzero. Example 1 We have data on the number of fish caught by visitors to a national park. Some of the visitors do not fish, but we do not have the data on whether a person fished; we have data merely on how many fish were caught, together with several covariates. Because our data have a preponderance of zeros (142 of 250), we use the zinb command to model the outcome. zinb — Zero-inflated negative binomial regression . use http://www.stata-press.com/data/r13/fish . zinb count persons livebait, inf(child camper) vuong Fitting constant-only model: Iteration 0: log likelihood = -519.33992 (output omitted ) Iteration 8: log likelihood = -442.66299 Fitting full model: Iteration 0: log likelihood = -442.66299 (not concave) (output omitted ) Iteration 8: log likelihood = -401.54776 Zero-inflated negative binomial regression Number of obs Nonzero obs Zero obs Inflation model = logit LR chi2(2) Log likelihood = -401.5478 Prob > chi2 Std. Err. z P>|z| = = = = = 250 108 142 82.23 0.0000 count Coef. count persons livebait _cons .9742984 1.557523 -2.730064 .1034938 .4124424 .476953 9.41 3.78 -5.72 0.000 0.000 0.000 .7714543 .7491503 -3.664874 1.177142 2.365895 -1.795253 inflate child camper _cons 3.185999 -2.020951 -2.695385 .7468551 .872054 .8929071 4.27 -2.32 -3.02 0.000 0.020 0.003 1.72219 -3.730146 -4.44545 4.649808 -.3117567 -.9453189 /lnalpha .5110429 .1816816 2.81 0.005 .1549535 .8671323 alpha 1.667029 .3028685 1.167604 2.380076 Vuong test of zinb vs. standard negative binomial: z = 2485 [95% Conf. Interval] 5.59 Pr>z = 0.0000 In general, Vuong test statistics that are significantly positive favor the zero-inflated models, whereas those that are significantly negative favor the non–zero-inflated models. Thus, in the above model, the zero inflation is significant. 2486 zinb — Zero-inflated negative binomial regression Stored results zinb stores the following in e(): Scalars e(N) e(N zero) e(k) e(k eq) e(k eq model) e(k aux) e(k dv) e(df m) e(ll) e(ll 0) e(df c) e(N clust) e(chi2) e(p) e(chi2 cp) e(vuong) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(inflate) e(wtype) e(wexp) e(title) e(clustvar) e(offset1) e(offset2) e(chi2type) e(chi2 cpt) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) number of observations number of zero observations number of parameters number of equations in e(b) number of equations in overall model test number of auxiliary parameters number of dependent variables model degrees of freedom log likelihood log likelihood, constant-only model degrees of freedom for comparison test number of clusters χ2 significance of model test χ2 for test of α = 0 Vuong test statistic rank of e(V) number of iterations return code 1 if converged, 0 otherwise zinb command as typed name of dependent variable logit or probit weight type weight expression title in estimation output name of cluster variable offset offset for inflate() Wald or LR; type of model χ2 test Wald or LR; type of model χ2 test corresponding to e(chi2 cp) vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample zinb — Zero-inflated negative binomial regression 2487 Methods and formulas Several models in the literature are (correctly) described as zero inflated. The zinb command maximizes the log likelihood lnL, defined by m = 1/α pj = 1/(1 + αµj ) ξjβ = xj β + offsetβj ξjγ = zj γ + offsetγj µj = exp(ξjβ ) X lnL = wj ln F (ξjγ ) + 1 − F (ξjγ ) pm j j∈S + X h wj ln 1 − F (ξjγ ) + lnΓ(m + yj ) − lnΓ(yj + 1) j6∈S − lnΓ(m) + m lnpj + yj ln(1 − pj ) i where wj are the weights, F is the inverse of the logit link (or the inverse of the probit link if probit was specified), and S is the set of observations for which the outcome yj = 0. This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. zinb also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Desmarais, B. A., and J. J. Harden. 2013. Testing for zero inflation in count models: Bias correction for the Vuong test. Stata Journal 13: 810–835. Greene, W. H. 1994. Accounting for excess zeros and sample selection in Poisson and negative binomial regression models. Working paper EC-94-10, Department of Economics, Stern School of Business, New York University. http://ideas.repec.org/p/ste/nystbu/94-10.html. . 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Long, J. S., and J. Freese. 2001. Predicted probabilities for count models. Stata Journal 1: 51–57. . 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Mullahy, J. 1986. Specification and testing of some modified count data models. Journal of Econometrics 33: 341–365. Vuong, Q. H. 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 57: 307–333. 2488 zinb — Zero-inflated negative binomial regression Also see [R] zinb postestimation — Postestimation tools for zinb [R] zip — Zero-inflated Poisson regression [R] nbreg — Negative binomial regression [R] poisson — Poisson regression [R] tnbreg — Truncated negative binomial regression [R] tpoisson — Truncated Poisson regression [SVY] svy estimation — Estimation commands for survey data [XT] xtnbreg — Fixed-effects, random-effects, & population-averaged negative binomial models [U] 20 Estimation and postestimation commands Title zinb postestimation — Postestimation tools for zinb Description Methods and formulas Syntax for predict Reference Menu for predict Also see Options for predict Description The following postestimation commands are available after zinb: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with svy estimation results. lrtest is not appropriate with svy estimation results. 2489 2490 zinb postestimation — Postestimation tools for zinb Syntax for predict predict type predict type newvar if in , statistic nooffset stub* | newvarreg newvarinflate newvarlnalpha if in , scores Description statistic Main number of events; the default incidence rate probability of a degenerate zero probability Pr(yj = n) probability Pr(a ≤ yj ≤ b) linear prediction standard error of the linear prediction n ir pr pr(n) pr(a,b) xb stdp These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main n, the default, calculates the predicted number of events, which is (1 − pj ) exp(xj β) if neither offset() nor exposure() was specified when the model was fit, where pj is the predicted probability of a zero outcome; (1 − pj ) exp{(xj β) + offsetj } if offset() was specified; or (1 − pj ){exp(xj β) × exposurej } if exposure() was specified. ir calculates the incidence rate exp(xj β), which is the predicted number of events when exposure is 1. This is equivalent to specifying both the n and the nooffset options. pr calculates the probability Pr(yj = 0), where this zero was obtained from the degenerate distribution F (zj γ). If offset() was specified within the inflate() option, then F (zj γ + offsetγj ) is calculated. pr(n) calculates the probability Pr(yj = n), where n is a nonnegative integer that may be specified as a number or a variable. Note that pr is not equivalent to pr(0). pr(a,b) calculates the probability Pr(a ≤ yj ≤ b), where a and b are nonnegative integers that may be specified as numbers or variables; b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(yj ≥ 20); pr(20,b) calculates Pr(yj ≥ 20) in observations for which b ≥ . and calculates Pr(20 ≤ yj ≤ b) elsewhere. pr(.,b) produces a syntax error. A missing value in an observation of the variable a causes a missing value in that observation for pr(a,b). zinb postestimation — Postestimation tools for zinb 2491 xb calculates the linear prediction, which is xj β if neither offset() nor exposure() was specified; xj β + offsetj if offset() was specified; or xj β + ln(exposurej ) if exposure() was specified; see nooffset below. stdp calculates the standard error of the linear prediction. nooffset is relevant only if you specified offset() or exposure() when you fit the model. It modifies the calculations made by predict so that they ignore the offset or exposure variable; the linear prediction is treated as xj β rather than as xj β + offsetj or xj β + ln(exposurej ). Specifying predict . . . , nooffset is equivalent to specifying predict . . . , ir. scores calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xj β). The second new variable will contain ∂ ln L/∂(zj γ). The third new variable will contain ∂ ln L/∂ lnα. Methods and formulas The probabilities calculated using the pr(n) option are the probability Pr(yi = n). These are calculated using Pr(0|xi ) = ωi + (1 − ωi ) p2 (0|xi ) Pr(n|xi ) = (1 − ωi ) p2 (n|xi ) for n = 1, 2, . . . where ωi is the probability of obtaining an observation from the degenerate distribution whose mass is concentrated at zero, and p2 (n|xi ) is the probability of yi = n from the nondegenerate, negative binomial distribution. ωi can be obtained from the pr option. See Cameron and Trivedi (2013, sec. 4.6) for further details. Reference Cameron, A. C., and P. K. Trivedi. 2013. Regression Analysis of Count Data. 2nd ed. New York: Cambridge University Press. Also see [R] zinb — Zero-inflated negative binomial regression [U] 20 Estimation and postestimation commands Title zip — Zero-inflated Poisson regression Syntax Remarks and examples Also see Menu Stored results Description Methods and formulas Options References Syntax zip depvar indepvars if in weight , inflate(varlist , offset(varname) | cons) options Description options Model ∗ inflate( ) noconstant exposure(varnamee ) offset(varnameo ) constraints(constraints) collinear probit equation that determines whether the count is zero suppress constant term include ln(varnamee ) in model with coefficient constrained to 1 include varnameo in model with coefficient constrained to 1 apply specified linear constraints keep collinear variables use probit model to characterize excess zeros; default is logit SE/Robust vcetype may be oim, robust, cluster clustvar, opg, bootstrap, or jackknife vce(vcetype) Reporting level(#) irr vuong nocnsreport display options set confidence level; default is level(95) report incidence-rate ratios perform Vuong test do not display constraints control column formats, row spacing, line width, display of omitted variables and base and empty cells, and factor-variable labeling Maximization maximize options control the maximization process; seldom used display legend instead of statistics ∗ inflate(varlist , offset(varname) | cons) is required. coeflegend indepvars and varlist may contain factor variables; see [U] 11.4.3 Factor variables. bootstrap, by, fp, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands. Weights are not allowed with the bootstrap prefix; see [R] bootstrap. vce(), vuong, and weights are not allowed with the svy prefix; see [SVY] svy. fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight. coeflegend does not appear in the dialog box. See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 2492 zip — Zero-inflated Poisson regression 2493 Menu Statistics > Count outcomes > Zero-inflated Poisson regression Description zip estimates a zero-inflated Poisson (ZIP) regression of depvar on indepvars, where depvar is a nonnegative count variable. Options Model inflate(varlist , offset(varname) | cons) specifies the equation that determines whether the observed count is zero. Conceptually, omitting inflate() would be equivalent to fitting the model with poisson; see [R] poisson. inflate(varlist , offset(varname) ) specifies the variables in the equation. You may optionally include an offset for this varlist. inflate( cons) specifies that the equation determining whether the count is zero contains only an intercept. To run a zero-inflated model of depvar with only an intercept in both equations, type zip depvar, inflate( cons). noconstant, exposure(varnamee ), offset(varnameo ), constraints(constraints), collinear; see [R] estimation options. probit requests that a probit, instead of logit, model be used to characterize the excess zeros in the data. SE/Robust vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory (oim, opg), that are robust to some kinds of misspecification (robust), that allow for intragroup correlation (cluster clustvar), and that use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option. Reporting level(#); see [R] estimation options. irr reports estimated coefficients transformed to incidence-rate ratios, that is, eb rather than b. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated or stored. irr may be specified at estimation or when replaying previously estimated results. vuong specifies that the Vuong (1989) test of ZIP versus Poisson be reported. This test statistic has a standard normal distribution with large positive values favoring the ZIP model and large negative values favoring the Poisson model. nocnsreport; see [R] estimation options. display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch; see [R] estimation options. 2494 zip — Zero-inflated Poisson regression Maximization maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and from(init specs); see [R] maximize. These options are seldom used. Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg). The following option is available with zip but is not shown in the dialog box: coeflegend; see [R] estimation options. Remarks and examples See Long (1997, 242–247) and Greene (2012, 821–826) for a discussion of zero-modified count models. For information about the test developed by Vuong (1989), see Greene (2012, 823–824) and Long (1997). Greene (1994) applied the test to ZIP and ZINB models, as described in Greene (2012, 824). Poisson regression fits models of the number of occurrences (counts) of an event. You could use poisson for this (see [R] poisson), but in some count-data models, you might want to account for the prevalence of zero counts in the data. For instance, you might count how many fish each visitor to a park catches. Many visitors may catch zero, because they do not fish (as opposed to being unsuccessful). You may be able to model whether a person fishes depending on several covariates related to fishing activity and model how many fish a person catches depending on several covariates having to do with the success of catching fish (type of lure/bait, time of day, temperature, season, etc.). This is the type of data for which the zip command is useful. The zero-inflated (or zero-altered) Poisson model allows overdispersion through the splitting process that models the outcomes as zero or nonzero. Example 1 We have data on the number of fish caught by visitors to a national park. Some of the visitors do not fish, but we do not have the data on whether a person fished; we merely have data on how many fish were caught together with several covariates. Because our data have a preponderance of zeros (142 of 250), we use the zip command to model the outcome. zip — Zero-inflated Poisson regression . use http://www.stata-press.com/data/r13/fish . zip count persons livebait, inf(child camper) vuong Fitting constant-only model: Iteration 0: log likelihood = -1347.807 (output omitted ) Iteration 4: log likelihood = -1103.9425 Fitting full model: Iteration 0: log likelihood = -1103.9425 (output omitted ) Iteration 5: log likelihood = -850.70142 Zero-inflated Poisson regression Number of obs Nonzero obs Zero obs Inflation model = logit LR chi2(2) Log likelihood = -850.7014 Prob > chi2 Std. Err. P>|z| 250 108 142 506.48 0.0000 count Coef. count persons livebait _cons .8068853 1.757289 -2.178472 .0453288 .2446082 .2860289 17.80 7.18 -7.62 0.000 0.000 0.000 .7180424 1.277866 -2.739078 .8957281 2.236713 -1.617865 inflate child camper _cons 1.602571 -1.015698 -.4922872 .2797719 .365259 .3114562 5.73 -2.78 -1.58 0.000 0.005 0.114 1.054228 -1.731593 -1.10273 2.150913 -.2998038 .1181558 Vuong test of zip vs. standard Poisson: z = = = = = z = 2495 [95% Conf. Interval] 3.95 Pr>z = 0.0000 In general, Vuong test statistics that are significantly positive favor the zero-inflated models, while those that are significantly negative favor the non–zero-inflated models. Thus, in the above model, the zero inflation is significant. 2496 zip — Zero-inflated Poisson regression Stored results zip stores the following in e(): Scalars e(N) e(N zero) e(k) e(k eq) e(k eq model) e(k dv) e(df m) e(ll) e(ll 0) e(ll c) e(df c) e(N clust) e(chi2) e(p) e(vuong) e(rank) e(ic) e(rc) e(converged) Macros e(cmd) e(cmdline) e(depvar) e(inflate) e(wtype) e(wexp) e(title) e(clustvar) e(offset1) e(offset2) e(chi2type) e(vce) e(vcetype) e(opt) e(which) e(ml method) e(user) e(technique) e(properties) e(predict) e(asbalanced) e(asobserved) Matrices e(b) e(Cns) e(ilog) e(gradient) e(V) e(V modelbased) Functions e(sample) number of observations number of zero observations number of parameters number of equations in e(b) number of equations in overall model test number of dependent variables model degrees of freedom log likelihood log likelihood, constant-only model log likelihood, comparison model degrees of freedom for comparison test number of clusters χ2 significance of model test Vuong test statistic rank of e(V) number of iterations return code 1 if converged, 0 otherwise zip command as typed name of dependent variable logit or probit weight type weight expression title in estimation output name of cluster variable offset offset for inflate() Wald or LR; type of model χ2 test vcetype specified in vce() title used to label Std. Err. type of optimization max or min; whether optimizer is to perform maximization or minimization type of ml method name of likelihood-evaluator program maximization technique b V program used to implement predict factor variables fvset as asbalanced factor variables fvset as asobserved coefficient vector constraints matrix iteration log (up to 20 iterations) gradient vector variance–covariance matrix of the estimators model-based variance marks estimation sample zip — Zero-inflated Poisson regression 2497 Methods and formulas Several models in the literature are (correctly) described as zero inflated. The zip command maximizes the log-likelihood lnL, defined by ξjβ = xj β + offsetβj ξjγ = zj γ + offsetγj X lnL = wj ln F (ξjγ ) + 1 − F (ξjγ ) exp(−λj ) + j∈S X h i wj ln 1 − F (ξjγ ) − λj + ξjβ yj − ln(yj !) j6∈S where wj are the weights, F is the inverse of the logit link (or the inverse of the probit link if probit was specified), and S is the set of observations for which the outcome yj = 0. This command supports the Huber/White/sandwich estimator of the variance and its clustered version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, particularly Maximum likelihood estimators and Methods and formulas. zip also supports estimation with survey data. For details on VCEs with survey data, see [SVY] variance estimation. References Desmarais, B. A., and J. J. Harden. 2013. Testing for zero inflation in count models: Bias correction for the Vuong test. Stata Journal 13: 810–835. Greene, W. H. 1994. Accounting for excess zeros and sample selection in Poisson and negative binomial regression models. Working paper EC-94-10, Department of Economics, Stern School of Business, New York University. http://ideas.repec.org/p/ste/nystbu/94-10.html. . 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall. Lambert, D. 1992. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34: 1–14. Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Long, J. S., and J. Freese. 2001. Predicted probabilities for count models. Stata Journal 1: 51–57. . 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata Press. Mullahy, J. 1986. Specification and testing of some modified count data models. Journal of Econometrics 33: 341–365. Vuong, Q. H. 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 57: 307–333. 2498 zip — Zero-inflated Poisson regression Also see [R] zip postestimation — Postestimation tools for zip [R] zinb — Zero-inflated negative binomial regression [R] nbreg — Negative binomial regression [R] poisson — Poisson regression [R] tnbreg — Truncated negative binomial regression [R] tpoisson — Truncated Poisson regression [SVY] svy estimation — Estimation commands for survey data [XT] xtpoisson — Fixed-effects, random-effects, and population-averaged Poisson models [U] 20 Estimation and postestimation commands Title zip postestimation — Postestimation tools for zip Description Remarks and examples Syntax for predict Methods and formulas Menu for predict Reference Options for predict Also see Description The following postestimation commands are available after zip: Command Description contrast estat ic estat summarize estat vce estat (svy) estimates forecast1 lincom contrasts and ANOVA-style joint tests of estimates Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC) summary statistics for the estimation sample variance–covariance matrix of the estimators (VCE) postestimation statistics for survey data cataloging estimation results dynamic forecasts and simulations point estimates, standard errors, testing, and inference for linear combinations of coefficients likelihood-ratio test marginal means, predictive margins, marginal effects, and average marginal effects graph the results from margins (profile plots, interaction plots, etc.) point estimates, standard errors, testing, and inference for nonlinear combinations of coefficients predictions, residuals, influence statistics, and other diagnostic measures point estimates, standard errors, testing, and inference for generalized predictions pairwise comparisons of estimates seemingly unrelated estimation Wald tests of simple and composite linear hypotheses Wald tests of nonlinear hypotheses lrtest2 margins marginsplot nlcom predict predictnl pwcompare suest test testnl 1 2 forecast is not appropriate with svy estimation results. lrtest is not appropriate with svy estimation results. 2499 2500 zip postestimation — Postestimation tools for zip Syntax for predict predict type predict type newvar if in , statistic nooffset stub* | newvarreg newvarinflate if in , scores Description statistic Main number of events; the default incidence rate probability of a degenerate zero probability Pr(yj = n) probability Pr(a ≤ yj ≤ b) linear prediction standard error of the linear prediction n ir pr pr(n) pr(a,b) xb stdp These statistics are available both in and out of sample; type predict the estimation sample. . . . if e(sample) . . . if wanted only for Menu for predict Statistics > Postestimation > Predictions, residuals, etc. Options for predict Main n, the default, calculates the predicted number of events, which is (1 − pj ) exp(xj β) if neither offset() nor exposure() was specified when the model was fit, where pj is the predicted probability of a zero outcome; (1 − pj ) exp{(xj β) + offsetj } if offset() was specified; or (1 − pj ){exp(xj β) × exposurej } if exposure() was specified. ir calculates the incidence rate exp(xj β), which is the predicted number of events when exposure is 1. This is equivalent to specifying both the n and the nooffset options. pr calculates the probability Pr(yj = 0), where this zero was obtained from the degenerate distribution F (zj γ). If offset() was specified within the inflate() option, then F (zj γ + offsetγj ) is calculated. pr(n) calculates the probability Pr(yj = n), where n is a nonnegative integer that may be specified as a number or a variable. Note that pr is not equivalent to pr(0). pr(a,b) calculates the probability Pr(a ≤ yj ≤ b), where a and b are nonnegative integers that may be specified as numbers or variables; b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(yj ≥ 20); pr(20,b) calculates Pr(yj ≥ 20) in observations for which b ≥ . and calculates Pr(20 ≤ yj ≤ b) elsewhere. pr(.,b) produces a syntax error. A missing value in an observation of the variable a causes a missing value in that observation for pr(a,b). zip postestimation — Postestimation tools for zip 2501 xb calculates the linear prediction, which is xj β if neither offset() nor exposure() was specified; xj β + offsetj if offset() was specified; or xj β + ln(exposurej ) if exposure() was specified; see nooffset below. stdp calculates the standard error of the linear prediction. nooffset is relevant only if you specified offset() or exposure() when you fit the model. It modifies the calculations made by predict so that they ignore the offset or exposure variable; the linear prediction is treated as xj β rather than as xj β + offsetj or xj β + ln(exposurej ). Specifying predict . . . , nooffset is equivalent to specifying predict . . . , ir. scores calculates equation-level score variables. The first new variable will contain ∂ ln L/∂(xj β). The second new variable will contain ∂ ln L/∂(zj γ). Remarks and examples Example 1 Continuing with example 1 from [R] zip, we will use predict to compute the predicted number of fish captured by each individual. . use http://www.stata-press.com/data/r13/fish . zip count persons livebait, inf(child camper) vuong (output omitted ) . predict numfished (option n assumed; predicted number of events) predict with the pr option computes the probability that an individual does not fish. . predict pr, pr On the other hand, predict with the pr(n) option computes the probability of catching n fish; particularly, the probability of catching zero fish will be . predict pr0, pr(0) . list pr pr0 in 1 1. pr pr0 .3793549 .8609267 Notice that pr0 is always equal to or greater than pr. For example, for the first individual, the probability of not fishing is 0.38; on the other hand, the probability of catching zero fish (0.86) is equal to the sum of the probability of not fishing and the probability of fishing but not catching any fish. pr0 can be also computed as one minus the probability of catching at least one fish, that is: . predict pr_catch, pr(1,.) . gen pr0b = 1-pr_catch 2502 zip postestimation — Postestimation tools for zip Methods and formulas The probabilities calculated using the pr(n) option are the probability Pr(yi = n). These are calculated using Pr(0|xi ) = ωi + (1 − ωi ) exp(−λi ) λni exp(−λi ) for n = 1, 2, . . . n! where ωi is the probability of obtaining an observation from the degenerate distribution whose mass is concentrated at zero. ωi can be obtained from the pr option. Pr(n|xi ) = (1 − ωi ) See Cameron and Trivedi (2013, sec. 4.6) for further details. Reference Cameron, A. C., and P. K. Trivedi. 2013. Regression Analysis of Count Data. 2nd ed. New York: Cambridge University Press. Also see [R] zip — Zero-inflated Poisson regression [U] 20 Estimation and postestimation commands Author index This is the author index for the Stata Base Reference Manual. A Abramowitz, M., [R] contrast, [R] orthog Abrams, K. R., [R] meta Abramson, J. H., [R] kappa Abramson, Z. H., [R] kappa Abrevaya, J., [R] boxcox postestimation Achen, C. H., [R] scobit Acock, A. C., [R] anova, [R] correlate, [R] nestreg, [R] oneway, [R] prtest, [R] ranksum, [R] ttest Adkins, L. C., [R] heckman, [R] regress, [R] regress postestimation Afifi, A. A., [R] anova, [R] stepwise Agresti, A., [R] ci, [R] expoisson, [R] tabulate twoway Aigner, D. J., [R] frontier Aiken, L. S., [R] pcorr Aitchison, J., [R] ologit, [R] oprobit Aitken, A. C., [R] reg3 Aivazian, S. A., [R] ksmirnov Akaike, H., [R] BIC note, [R] estat ic, [R] glm Aldrich, J. H., [R] logit, [R] probit Alexandersson, A., [R] regress Alf, E., Jr., [R] rocfit, [R] rocreg Algina, J., [R] esize Alldredge, J. R., [R] pk, [R] pkcross Allison, P. D., [R] rologit, [R] testnl Almås, I., [R] inequality Alonzo, T. A., [R] rocreg, [R] rocreg postestimation, [R] rocregplot Altman, D. G., [R] anova, [R] fp, [R] kappa, [R] kwallis, [R] meta, [R] mfp, [R] nptrend, [R] oneway Ambler, G., [R] fp, [R] fp postestimation, [R] mfp, [R] regress Amemiya, T., [R] glogit, [R] intreg, [R] ivprobit, [R] nlogit, [R] tobit Andersen, E. B., [R] clogit Andersen, P. K., [R] glm Anderson, J. A., [R] ologit, [R] slogit Anderson, R. E., [R] rologit Anderson, R. L., [R] anova Anderson, S., [R] pkequiv Anderson, T. W., [R] ivregress postestimation Andrews, D. F., [R] rreg Andrews, D. W. K., [R] ivregress Ängquist, L., [R] bootstrap, [R] permute Angrist, J. D., [R] ivregress, [R] ivregress postestimation, [R] qreg, [R] regress Anscombe, F. J., [R] binreg postestimation, [R] glm, [R] glm postestimation Arbuthnott, J., [R] signrank 2503 Archer, K. J., [R] estat gof, [R] logistic, [R] logit Arellano, M., [R] areg postestimation, [R] gmm Arminger, G., [R] suest Armitage, P., [R] ameans, [R] expoisson, [R] pkcross, [R] sdtest Armstrong, R. D., [R] qreg Arthur, M., [R] symmetry Atella, V., [R] frontier Atkinson, A. C., [R] boxcox, [R] nl Azen, S. P., [R] anova B Babin, B. J., [R] rologit Baker, R. J., [R] glm Baker, R. M., [R] ivregress postestimation Bakker, A., [R] mean Balaam, L. N., [R] pkcross Baltagi, B. H., [R] hausman Bamber, D., [R] rocfit, [R] rocregplot, [R] roctab Bancroft, T. A., [R] stepwise Barnard, G. A., [R] spearman, [R] ttest Barnett, A. G., [R] glm Barrison, I. G., [R] binreg Bartlett, M. S., [R] oneway Bartus, T., [R] margins Basmann, R. L., [R] ivregress, [R] ivregress postestimation Basu, A., [R] glm Bauldry, S., [R] ivregress Baum, C. F., [R] gmm, [R] heckman, [R] heckoprobit, [R] heckprobit, [R] ivregress, [R] ivregress postestimation, [R] margins, [R] net, [R] net search, [R] regress postestimation, [R] regress postestimation time series, [R] ssc Bayart, D., [R] qc Beale, E. M. L., [R] stepwise, [R] test Beaton, A. E., [R] rreg Becketti, S., [R] fp, [R] fp postestimation, [R] regress, [R] runtest, [R] spearman Beggs, S., [R] rologit Belanger, A. J., [R] sktest Bellocco, R., [R] glm, [R] logit Belotti, F., [R] frontier Belsley, D. A., [R] regress postestimation, [R] regress postestimation diagnostic plots Bendel, R. B., [R] stepwise Benedetti, J. K., [R] tetrachoric Beniger, J. R., [R] cumul Bera, A. K., [R] sktest Beran, R. J., [R] regress postestimation time series Berk, K. N., [R] stepwise Berk, R. A., [R] rreg Berkson, J., [R] logit, [R] probit Bern, P. H., [R] nestreg Bernasco, W., [R] tetrachoric Berndt, E. K., [R] glm Berndt, E. R., [R] truncreg 2504 Author index Berry, G., [R] ameans, [R] expoisson, [R] sdtest Berry, K. J., [R] ranksum Bewley, R., [R] reg3 Beyer, W. H., [R] qc Bickeböller, H., [R] symmetry Bickel, P. J., [R] rreg Birdsall, T. G., [R] lroc Black, W. C., [R] rologit Blackwell, J. L., III, [R] areg Bland, M., [R] ranksum, [R] sdtest, [R] signrank, [R] spearman Blevins, J. R., [R] hetprobit Bliese, P. D., [R] icc Bliss, C. I., [R] probit Bloch, D. A., [R] brier Bloomfield, P., [R] qreg Blundell, R., [R] gmm, [R] ivprobit BMDP, [R] symmetry Bofinger, E., [R] qreg Boice, J. D., Jr., [R] bitest Boland, P. J., [R] ttest Bolduc, D., [R] asmprobit Bollen, K. A., [R] regress postestimation Bond, S., [R] gmm Bonferroni, C. E., [R] correlate Borenstein, M., [R] meta Bottai, M., [R] qreg Bound, J., [R] ivregress postestimation Bowker, A. H., [R] symmetry Box, G. E. P., [R] anova, [R] boxcox, [R] lnskew0 Box, J. F., [R] anova Boyd, N. F., [R] kappa Brackstone, G. J., [R] diagnostic plots, [R] swilk Bradley, R. A., [R] signrank Brady, A. R., [R] logistic, [R] spikeplot Brant, R., [R] ologit Breslow, N. E., [R] clogit, [R] dstdize, [R] symmetry Breusch, T. S., [R] regress postestimation, [R] regress postestimation time series, [R] sureg Brier, G. W., [R] brier Brillinger, D. R., [R] jackknife Brook, R. H., [R] brier Brown, D. R., [R] anova, [R] contrast, [R] loneway, [R] oneway, [R] pwcompare Brown, L. D., [R] ci Brown, M. B., [R] sdtest, [R] tetrachoric Brown, S. E., [R] symmetry Brown, W., [R] icc Bru, B., [R] poisson Brzezinski, M., [R] swilk Buchner, D. M., [R] ladder Buis, M. L., [R] constraint, [R] eform option, [R] logistic, [R] logit, [R] margins Bunch, D. S., [R] asmprobit Burke, W. J., [R] tobit Burnam, M. A., [R] lincom, [R] mlogit, [R] mprobit, [R] mprobit postestimation, [R] predictnl, [R] slogit Burr, I. W., [R] qc Buskens, V., [R] tabstat C Cai, T., [R] rocreg Cai, T. T., [R] ci Cameron, A. C., [R] asclogit, [R] asmprobit, [R] bootstrap, [R] gmm, [R] heckman, [R] heckoprobit, [R] intreg, [R] ivpoisson, [R] ivregress, [R] ivregress postestimation, [R] logit, [R] mprobit, [R] nbreg, [R] ologit, [R] oprobit, [R] poisson, [R] probit, [R] qreg, [R] regress, [R] regress postestimation, [R] simulate, [R] sureg, [R] tnbreg, [R] tobit, [R] tpoisson, [R] zinb postestimation, [R] zip postestimation Campbell, M. J., [R] ci, [R] kappa, [R] poisson, [R] tabulate twoway Canette, I., [R] nl, [R] nlsur Cappellari, L., [R] asmprobit Cardell, S., [R] rologit Carlile, T., [R] kappa Carlin, J. B., [R] ameans Carpenter, J. R., [R] bootstrap, [R] bstat Carroll, R. J., [R] boxcox, [R] rreg, [R] sdtest Carson, R. T., [R] tnbreg, [R] tpoisson Carter, S. L., [R] frontier, [R] lrtest, [R] nbreg Caudill, S. B., [R] frontier Caulcutt, R., [R] qc Chadwick, J., [R] poisson Chaimani, A., [R] meta Chakraborti, S., [R] ksmirnov Chamberlain, G., [R] clogit, [R] gmm, [R] qreg Chambers, J. M., [R] diagnostic plots, [R] grmeanby, [R] lowess Chang, I. M., [R] margins Charlett, A., [R] fp Chatfield, M., [R] anova Chatterjee, S., [R] poisson, [R] regress, [R] regress postestimation, [R] regress postestimation diagnostic plots Chen, X., [R] logistic, [R] logistic postestimation, [R] logit Chiburis, R., [R] heckman, [R] heckoprobit, [R] heckprobit, [R] oprobit Choi, B. C. K., [R] rocfit, [R] rocreg postestimation, [R] rocregplot, [R] roctab Chow, G. C., [R] contrast Chow, S.-C., [R] pk, [R] pkcross, [R] pkequiv, [R] pkexamine, [R] pkshape Christakis, N., [R] rologit Clark, V. A., [R] stepwise Clarke, R. D., [R] poisson Clarke-Pearson, D. L., [R] roccomp, [R] rocreg, [R] roctab Author index 2505 Clarkson, D. B., [R] tabulate twoway Clayton, D. G., [R] cloglog, [R] cumul Clerget-Darpoux, F., [R] symmetry Cleveland, W. S., [R] diagnostic plots, [R] lowess, [R] lpoly, [R] sunflower Cleves, M. A., [R] binreg, [R] dstdize, [R] logistic, [R] logit, [R] roccomp, [R] rocfit, [R] rocreg, [R] rocreg postestimation, [R] rocregplot, [R] roctab, [R] sdtest, [R] symmetry Clogg, C. C., [R] suest Clopper, C. J., [R] ci Cobb, G. W., [R] anova Cochran, W. G., [R] ameans, [R] anova, [R] correlate, [R] dstdize, [R] mean, [R] oneway, [R] poisson, [R] probit, [R] proportion, [R] ranksum, [R] ratio, [R] signrank, [R] total Coelli, T. J., [R] frontier Cohen, J., [R] esize, [R] kappa, [R] pcorr Cohen, P., [R] pcorr Coleman, J. S., [R] poisson Collett, D., [R] clogit, [R] logistic, [R] logistic postestimation Cone-Wesson, B., [R] rocreg, [R] rocreg postestimation, [R] rocregplot Cong, R., [R] tobit, [R] tobit postestimation, [R] truncreg Conover, W. J., [R] centile, [R] ksmirnov, [R] kwallis, [R] nptrend, [R] sdtest, [R] spearman, [R] tabulate twoway Conroy, R. M., [R] intreg, [R] ranksum Consonni, D., [R] dstdize Cook, A., [R] ci Cook, N. R., [R] rocreg Cook, R. D., [R] boxcox, [R] regress postestimation Coster, D., [R] contrast Coull, B. A., [R] ci Cox, D. R., [R] boxcox, [R] exlogistic, [R] expoisson, [R] lnskew0 Cox, G. M., [R] anova Cox, N. J., [R] ci, [R] cumul, [R] diagnostic plots, [R] histogram, [R] inequality, [R] kappa, [R] kdensity, [R] ladder, [R] lowess, [R] lpoly, [R] net, [R] net search, [R] regress postestimation, [R] regress postestimation diagnostic plots, [R] search, [R] serrbar, [R] sktest, [R] smooth, [R] spikeplot, [R] ssc, [R] stem, [R] summarize, [R] sunflower, [R] tabulate oneway, [R] tabulate twoway Cragg, J. G., [R] ivregress postestimation Cramér, H., [R] tabulate twoway Cramer, J. S., [R] logit Cronbach, L. J., [R] icc Croux, C., [R] rreg Crowther, M. J., [R] meta Cui, J., [R] symmetry Cumming, G., [R] esize, [R] regress postestimation Cummings, P., [R] binreg, [R] glm, [R] margins Curts-Garcı́a, J., [R] smooth Cuzick, J., [R] kappa, [R] nptrend D D’Agostino, R. B., [R] sktest D’Agostino, R. B., Jr., [R] sktest Daidone, S., [R] frontier Daniel, C., [R] diagnostic plots, [R] oneway Danuso, F., [R] nl DasGupta, A., [R] ci Davey Smith, G., [R] meta David, F. N., [R] correlate David, H. A., [R] spearman, [R] summarize Davidson, R., [R] boxcox, [R] cnsreg, [R] gmm, [R] intreg, [R] ivregress, [R] ivregress postestimation, [R] mlogit, [R] nl, [R] nlsur, [R] reg3, [R] regress, [R] regress postestimation time series, [R] tobit, [R] truncreg Davison, A. C., [R] bootstrap Day, N. E., [R] clogit, [R] dstdize, [R] symmetry de Irala-Estévez, J., [R] logistic De Luca, G., [R] biprobit, [R] heckoprobit, [R] heckprobit, [R] oprobit, [R] probit de Wolf, I., [R] rologit Deaton, A. S., [R] nlsur Deb, P., [R] nbreg Debarsy, N., [R] lpoly Dehon, C., [R] correlate DeLong, D. M., [R] roccomp, [R] rocreg, [R] roctab DeLong, E. R., [R] roccomp, [R] rocreg, [R] roctab DeMaris, A., [R] regress postestimation Desbordes, R., [R] ivregress Desmarais, B. A., [R] zinb, [R] zip Dewey, M. E., [R] correlate Didelez, V., [R] ivregress Digby, P. G. N., [R] tetrachoric Dixon, W. J., [R] ttest Djulbegovic, B., [R] meta Dobson, A. J., [R] glm Dodd, L. E., [R] rocreg Dohoo, I., [R] regress Doll, R., [R] poisson Donald, A., [R] meta Donald, S. G., [R] ivregress postestimation Donner, A., [R] loneway Donoho, D. L., [R] lpoly Dore, C. J., [R] fp Dorfman, D. D., [R] rocfit, [R] rocreg Doris, A., [R] gmm, [R] inequality Draper, N., [R] eivreg, [R] oneway, [R] regress, [R] stepwise Drukker, D. M., [R] asmprobit, [R] boxcox, [R] frontier, [R] lrtest, [R] nbreg, [R] tobit Duan, N., [R] boxcox postestimation, [R] heckman Duncan, A. J., [R] qc Dunn, G., [R] kappa Dunnett, C. W., [R] mprobit, [R] pwcompare Dunnington, G. W., [R] regress Dupont, W. D., [R] logistic, [R] mkspline, [R] sunflower 2506 Author index Durbin, J., [R] ivregress postestimation, [R] regress postestimation time series Duren, P., [R] regress Duval, R. D., [R] bootstrap, [R] jackknife, [R] rocreg, [R] rocregplot Edgington, E. S., [R] runtest Edwards, A. L., [R] anova Edwards, A. W. F., [R] tetrachoric Edwards, J. H., [R] tetrachoric Efron, B., [R] bootstrap, [R] qreg Efroymson, M. A., [R] stepwise Egger, M., [R] meta Eisenhart, C., [R] correlate, [R] runtest Ellis, C. D., [R] poisson Ellis, P. D., [R] esize, [R] regress postestimation Eltinge, J. L., [R] test Emerson, J. D., [R] lv, [R] stem Ender, P. B., [R] marginsplot Engel, A., [R] boxcox, [R] marginsplot Engle, R. F., [R] regress postestimation time series Erdreich, L. S., [R] roccomp, [R] rocfit, [R] roctab Eubank, R. L., [R] lpoly Evans, M. A., [R] pk, [R] pkcross Everitt, B. S., [R] gllamm, [R] glm Ewens, W. J., [R] symmetry Ezekiel, M., [R] regress postestimation diagnostic plots Fisher, N. I., [R] regress postestimation time series Fisher, R. A., [R] anova, [R] anova, [R] esize, [R] ranksum, [R] signrank, [R] tabulate twoway Flannery, B. P., [R] dydx, [R] vwls Fleiss, J. L., [R] dstdize, [R] icc, [R] kappa Fletcher, K., [R] rocreg, [R] rocreg postestimation, [R] rocregplot Flynn, Z. L., [R] gmm Folsom, R. C., [R] rocreg, [R] rocreg postestimation, [R] rocregplot Ford, J. M., [R] frontier Forsythe, A. B., [R] sdtest Forthofer, R. N., [R] dstdize Foster, A., [R] regress Fouladi, R. T., [R] esize Fourier, J. B. J., [R] cumul Fox, J., [R] kdensity, [R] lv Fox, W. C., [R] lroc Francia, R. S., [R] swilk Freese, J., [R] asroprobit, [R] clogit, [R] cloglog, [R] logistic, [R] logit, [R] mlogit, [R] mprobit, [R] nbreg, [R] ologit, [R] oprobit, [R] poisson, [R] probit, [R] regress, [R] regress postestimation, [R] tnbreg, [R] tpoisson, [R] zinb, [R] zip Frölich, M., [R] qreg Frome, E. L., [R] qreg Frydenberg, M., [R] dstdize, [R] roccomp, [R] roctab Fu, V. K., [R] ologit Fuller, W. A., [R] regress, [R] spearman F G Fagerland, M. W., [R] estat gof, [R] mlogit postestimation Fan, J., [R] lpoly Fan, Y.-A., [R] tabulate twoway Fang, K.-T., [R] asmprobit Farbmacher, H., [R] tpoisson Feiveson, A. H., [R] nlcom, [R] ranksum Feldt, L. S., [R] anova Ferri, H. A., [R] kappa Festinger, L., [R] ranksum Field, C. A., [R] bootstrap Fieller, E. C., [R] pkequiv Fienberg, S. E., [R] kwallis, [R] tabulate twoway Filon, L. N. G., [R] correlate Filoso, V., [R] regress Finch, S., [R] esize Findley, D. F., [R] estat ic Findley, T. W., [R] ladder Finlay, K., [R] ivprobit, [R] ivregress, [R] ivtobit Finney, D. J., [R] probit, [R] tabulate twoway Fiorio, C. V., [R] kdensity Fiser, D. H., [R] estat gof, [R] lroc Fishell, E., [R] kappa Fisher, L. D., [R] anova, [R] dstdize, [R] oneway Gail, M. H., [R] rocreg, [R] rocreg postestimation Gall, J.-R. L., [R] estat gof, [R] logistic Gallant, A. R., [R] ivregress, [R] nl Gallup, J. L., [R] estimates table Galton, F., [R] correlate, [R] cumul, [R] regress, [R] summarize Gan, F. F., [R] diagnostic plots Garrett, J. M., [R] logistic, [R] logistic postestimation, [R] regress postestimation Garsd, A., [R] exlogistic Gasser, T., [R] lpoly Gastwirth, J. L., [R] sdtest Gates, R., [R] asmprobit Gauss, J. C. F., [R] regress Gauvreau, K., [R] dstdize, [R] logistic Geisser, S., [R] anova Gel, Y. R., [R] sdtest Gelbach, J., [R] ivprobit, [R] ivtobit Gelman, R., [R] margins Genest, C., [R] diagnostic plots, [R] swilk Gentle, J. E., [R] anova, [R] nl Genton, M. G., [R] sktest Genz, A., [R] asmprobit Gerkins, V. R., [R] symmetry E Author index 2507 Geweke, J., [R] asmprobit Gibbons, J. D., [R] ksmirnov, [R] spearman Giesen, D., [R] tetrachoric Gijbels, I., [R] lpoly Gillham, N. W., [R] regress Gillispie, C. C., [R] regress Gini, R., [R] vwls Glass, G. V., [R] esize Gleason, J. R., [R] anova, [R] bootstrap, [R] ci, [R] correlate, [R] loneway, [R] summarize, [R] ttest Glidden, D. V., [R] logistic Gnanadesikan, R., [R] cumul, [R] diagnostic plots Godfrey, L. G., [R] regress postestimation time series Goeden, G. B., [R] kdensity Goerg, S. J., [R] ksmirnov Goldberger, A. S., [R] intreg, [R] mlexp, [R] tobit Goldstein, R., [R] brier, [R] correlate, [R] inequality, [R] nl, [R] ologit, [R] oprobit, [R] ranksum, [R] regress postestimation Golub, G. H., [R] orthog, [R] tetrachoric Good, P. I., [R] permute, [R] symmetry, [R] tabulate twoway Goodall, C., [R] lowess, [R] rreg Goodman, L. A., [R] tabulate twoway Gordon, M. G., [R] binreg Gorga, M. P., [R] rocreg, [R] rocreg postestimation, [R] rocregplot Gorman, J. W., [R] stepwise Gosset [Student, pseud.], W. S., [R] ttest Gosset, W. S., [R] ttest Gould, W. W., [R] bootstrap, [R] bsample, [R] dydx, [R] frontier, [R] gmm, [R] grmeanby, [R] jackknife, [R] kappa, [R] logistic, [R] margins, [R] maximize, [R] mkspline, [R] ml, [R] mlexp, [R] net search, [R] nlcom, [R] ologit, [R] oprobit, [R] poisson, [R] predictnl, [R] qreg, [R] regress, [R] rreg, [R] simulate, [R] sktest, [R] smooth, [R] swilk, [R] testnl Gourieroux, C. S., [R] hausman, [R] suest, [R] test Graubard, B. I., [R] margins, [R] ml, [R] test Graybill, F. A., [R] centile Green, D. M., [R] lroc Greene, W. H., [R] asclogit, [R] asmprobit, [R] biprobit, [R] clogit, [R] cnsreg, [R] frontier, [R] gmm, [R] heckman, [R] heckoprobit, [R] heckprobit, [R] hetprobit, [R] ivregress, [R] logit, [R] lrtest, [R] margins, [R] mkspline, [R] mlexp, [R] mlogit, [R] nlogit, [R] nlsur, [R] pcorr, [R] probit, [R] reg3, [R] regress, [R] regress postestimation time series, [R] sureg, [R] testnl, [R] truncreg, [R] zinb, [R] zip Greenfield, S., [R] lincom, [R] mlogit, [R] mprobit, [R] mprobit postestimation, [R] predictnl, [R] slogit Greenhouse, S. W., [R] anova Greenland, S., [R] ci, [R] glogit, [R] mkspline, [R] ologit, [R] poisson Gregoire, A., [R] kappa Grieve, R., [R] bootstrap, [R] bstat Griffith, J. L., [R] brier Griffith, R., [R] gmm Griffiths, W. E., [R] cnsreg, [R] estat ic, [R] glogit, [R] ivregress, [R] ivregress postestimation, [R] logit, [R] probit, [R] regress, [R] regress postestimation, [R] test Grissom, R. J., [R] esize, [R] regress postestimation Grizzle, J. E., [R] vwls Grogger, J. T., [R] tnbreg, [R] tpoisson Gronau, R., [R] heckman Gropper, D. M., [R] frontier Guan, W., [R] bootstrap Gutierrez, R. G., [R] frontier, [R] lpoly, [R] lrtest, [R] nbreg H Haan, P., [R] asmprobit, [R] mlogit, [R] mprobit Hadi, A. S., [R] poisson, [R] regress, [R] regress postestimation, [R] regress postestimation diagnostic plots Hadorn, D. C., [R] brier Hahn, J., [R] ivregress postestimation Hair, J. F., Jr., [R] rologit Hajian-Tilaki, K. O., [R] rocreg Hajivassiliou, V. A., [R] asmprobit Hald, A., [R] qreg, [R] regress, [R] signrank, [R] summarize Haldane, J. B. S., [R] ranksum Hall, A. D., [R] frontier Hall, A. R., [R] gmm, [R] gmm postestimation, [R] ivpoisson, [R] ivpoisson postestimation, [R] ivregress, [R] ivregress postestimation Hall, B. H., [R] glm Hall, N. S., [R] anova Hall, P., [R] bootstrap, [R] qreg, [R] regress postestimation time series Hall, R. E., [R] glm Hall, W. J., [R] roccomp, [R] rocfit, [R] roctab Hallock, K., [R] qreg Halvorsen, K. T., [R] tabulate twoway Hamerle, A., [R] clogit Hamilton, J. D., [R] gmm Hamilton, L. C., [R] bootstrap, [R] diagnostic plots, [R] estat vce, [R] ladder, [R] lv, [R] mlogit, [R] regress, [R] regress postestimation, [R] regress postestimation diagnostic plots, [R] rreg, [R] simulate, [R] summarize, [R] ttest Hampel, F. R., [R] rreg Hanley, J. A., [R] roccomp, [R] rocfit, [R] rocreg, [R] rocreg postestimation, [R] rocregplot, [R] roctab Hansen, L. P., [R] gmm, [R] ivregress, [R] ivregress postestimation Hao, L., [R] qreg 2508 Author index Harbord, R. M., [R] roccomp, [R] roctab Harden, J. J., [R] zinb, [R] zip Hardin, J. W., [R] binreg, [R] biprobit, [R] estat ic, [R] glm, [R] glm postestimation, [R] lroc, [R] poisson, [R] ranksum, [R] regress postestimation, [R] signrank Haritou, A., [R] suest Harkness, J., [R] ivprobit, [R] ivtobit Harrell, F. E., Jr., [R] mkspline, [R] ologit Harris, R. L., [R] qc Harris, T., [R] poisson, [R] qreg, [R] ranksum, [R] signrank Harrison, D. A., [R] histogram, [R] tabulate oneway, [R] tabulate twoway Harrison, J. A., [R] dstdize Hartmann, D. P., [R] icc Harvey, A. C., [R] hetprobit Hastie, T. J., [R] grmeanby, [R] slogit Hauck, W. W., [R] pkequiv Haughton, J. H., [R] inequality Hausman, J. A., [R] glm, [R] hausman, [R] ivregress postestimation, [R] nlogit, [R] rologit, [R] suest Havnes, T., [R] inequality Hayashi, F., [R] gmm, [R] ivpoisson, [R] ivregress, [R] ivregress postestimation Hayes, R. J., [R] permute Hays, R. D., [R] lincom, [R] mlogit, [R] mprobit, [R] mprobit postestimation, [R] predictnl, [R] slogit Hays, W. L., [R] esize, [R] regress postestimation Heagerty, P. J., [R] anova, [R] dstdize, [R] oneway Heckman, J., [R] biprobit, [R] heckman, [R] heckman postestimation, [R] heckoprobit, [R] heckprobit Hedges, L. V., [R] esize, [R] meta Heiss, F., [R] nlogit Henderson, B. E., [R] symmetry Hendrickx, J., [R] mlogit, [R] xi Hensher, D. A., [R] nlogit Hickam, D. H., [R] brier Higgins, J. E., [R] anova Higgins, J. P. T., [R] meta Hilbe, J. M., [R] cloglog, [R] estat ic, [R] glm, [R] glm postestimation, [R] logistic, [R] logit, [R] lroc, [R] nbreg, [R] poisson, [R] probit, [R] simulate, [R] tnbreg, [R] tpoisson Hill, A. B., [R] poisson Hill, R. C., [R] cnsreg, [R] estat ic, [R] glogit, [R] heckman, [R] ivregress, [R] ivregress postestimation, [R] logit, [R] probit, [R] regress, [R] regress postestimation, [R] test Hills, M., [R] cloglog, [R] cumul Hinchliffe, S. R., [R] meta Hinkley, D. V., [R] bootstrap Hirji, K. F., [R] exlogistic, [R] expoisson Hoaglin, D. C., [R] diagnostic plots, [R] lv, [R] regress postestimation, [R] regress postestimation diagnostic plots, [R] smooth, [R] stem Hochberg, Y., [R] oneway Hocking, R. R., [R] stepwise Hoel, P. G., [R] bitest, [R] ttest Hoffmann, J. P., [R] glm Hole, A. R., [R] asmprobit, [R] clogit, [R] mlogit, [R] mprobit Holloway, L., [R] brier Holm, S., [R] test Holmes, S., [R] bootstrap Hood, W. C., [R] ivregress Hosmer, D. W., Jr., [R] clogit, [R] clogit postestimation, [R] estat classification, [R] estat gof, [R] glm, [R] glogit, [R] lincom, [R] logistic, [R] logistic postestimation, [R] logit, [R] logit postestimation, [R] lroc, [R] lrtest, [R] lsens, [R] mlogit, [R] mlogit postestimation, [R] predictnl, [R] stepwise Hotelling, H., [R] roccomp, [R] rocfit, [R] roctab Hozo, I., [R] meta Huang, C., [R] sunflower Huang, D. S., [R] nlsur, [R] sureg Huber, C., [R] esize, [R] regress postestimation Huber, P. J., [R] qreg, [R] rreg, [R] suest Hunter, D. R., [R] qreg Hurd, M., [R] intreg, [R] tobit Hutto, C., [R] exlogistic Huynh, H., [R] anova I Iglewicz, B., [R] lv Ilardi, G., [R] frontier Isaacs, D., [R] fp Ishiguro, M., [R] BIC note J Jackman, R. W., [R] regress postestimation Jacobs, K. B., [R] symmetry Jaeger, D. A., [R] ivregress postestimation James, B. R., [R] rocreg, [R] rocreg postestimation James, K. L., [R] rocreg, [R] rocreg postestimation Janes, H., [R] rocfit, [R] rocreg, [R] rocreg postestimation, [R] rocregplot Jann, B., [R] estimates store, [R] ksmirnov, [R] stored results, [R] tabulate twoway Jarque, C. M., [R] sktest Jeffreys, H., [R] ci, [R] spearman Jenkins, S. P., [R] asmprobit, [R] do, [R] inequality Joe, H., [R] tabulate twoway Johnson, D. E., [R] anova, [R] contrast, [R] pwcompare Johnson, M. E., [R] sdtest Johnson, M. M., [R] sdtest Johnson, N. L., [R] ksmirnov, [R] nbreg, [R] poisson Johnston, J. E., [R] ranksum Jolliffe, D., [R] inequality, [R] qreg, [R] regress Jolliffe, I. T., [R] brier Jones, A., [R] heckman, [R] logit, [R] probit Author index 2509 Jones, D. R., [R] meta Jones, M. C., [R] kdensity, [R] lpoly Judge, G. G., [R] estat ic, [R] glogit, [R] ivregress, [R] ivregress postestimation, [R] logit, [R] probit, [R] regress postestimation, [R] test Judson, D. H., [R] poisson, [R] tabulate twoway, [R] tpoisson Juul, S., [R] dstdize, [R] roccomp, [R] roctab K Kahn, H. A., [R] dstdize Kaiser, J., [R] ksmirnov, [R] permute, [R] signrank Kalmijn, M., [R] tetrachoric Keane, M. P., [R] asmprobit Keeler, E. B., [R] brier Kelley, K., [R] esize, [R] regress postestimation Kemp, A. W., [R] nbreg, [R] poisson Kempthorne, P. J., [R] regress postestimation Kendall, M. G., [R] centile, [R] spearman, [R] tabulate twoway Kennedy, W. J., Jr., [R] anova, [R] nl, [R] regress, [R] stepwise Kerlinger, F. N., [R] esize, [R] regress postestimation Keselman, H. J., [R] esize Kettenring, J. R., [R] diagnostic plots Keynes, J. M., [R] ameans Khan, S., [R] hetprobit Khandker, S. R., [R] inequality Kiernan, M., [R] kappa Kim, J. J., [R] esize, [R] regress postestimation Kirk, R. E., [R] esize, [R] regress postestimation Kirkwood, B. R., [R] dstdize, [R] summarize Kish, L., [R] loneway Kitagawa, G., [R] BIC note Klar, J., [R] estat gof Kleiber, C., [R] inequality Klein, L. R., [R] reg3, [R] reg3 postestimation, [R] regress postestimation time series Klein, M., [R] binreg, [R] clogit, [R] logistic, [R] lrtest, [R] mlogit, [R] ologit Kleinbaum, D. G., [R] binreg, [R] clogit, [R] logistic, [R] lrtest, [R] mlogit, [R] ologit Kleiner, B., [R] diagnostic plots, [R] lowess Kline, R. B., [R] esize, [R] regress postestimation Kmenta, J., [R] eivreg, [R] ivregress, [R] regress Koch, G. G., [R] anova, [R] kappa, [R] vwls Koehler, K. J., [R] diagnostic plots Koenker, R., [R] qreg, [R] regress postestimation Kohler, U., [R] estat classification, [R] kdensity, [R] regress, [R] regress postestimation, [R] regress postestimation diagnostic plots Kolmogorov, A. N., [R] ksmirnov Kontopantelis, E., [R] meta Koopman, S. J., [R] regress postestimation time series Koopmans, T. C., [R] ivregress Korn, E. L., [R] margins, [R] ml, [R] test Kotz, S., [R] inequality, [R] ksmirnov, [R] nbreg, [R] nlogit, [R] poisson Kreuter, F., [R] estat classification, [R] kdensity, [R] regress, [R] regress postestimation, [R] regress postestimation diagnostic plots Krushelnytskyy, B., [R] inequality, [R] qreg Kruskal, W. H., [R] kwallis, [R] ranksum, [R] spearman, [R] tabulate twoway Kuehl, R. O., [R] anova, [R] contrast, [R] icc, [R] oneway Kuh, E., [R] regress postestimation, [R] regress postestimation diagnostic plots Kumbhakar, S. C., [R] frontier, [R] frontier postestimation Kung, D. S., [R] qreg Kutner, M. H., [R] pkcross, [R] pkequiv, [R] pkshape, [R] regress postestimation L Lachenbruch, P. A., [R] diagnostic plots Lacy, M. G., [R] permute Lafontaine, F., [R] boxcox Lahiri, K., [R] tobit Lai, S., [R] exlogistic Laird, N. M., [R] expoisson Lambert, D., [R] zip Lambert, P. C., [R] poisson Landis, J. R., [R] kappa Lane, P. W., [R] margins Langan, D., [R] meta Lange, K., [R] qreg Laplace, P.-S., [R] regress Larsen, W. A., [R] regress postestimation diagnostic plots Lash, T. L., [R] ci, [R] glogit, [R] poisson Lauritzen, S. L., [R] summarize Lee, E. S., [R] dstdize Lee, E. T., [R] roccomp, [R] rocfit, [R] roctab Lee, T.-C., [R] estat ic, [R] glogit, [R] ivregress, [R] ivregress postestimation, [R] logit, [R] probit, [R] regress postestimation, [R] test Lee, W. C., [R] roctab Legendre, A.-M., [R] regress Lehmann, E. L., [R] oneway Lemeshow, S. A., [R] clogit, [R] clogit postestimation, [R] estat classification, [R] estat gof, [R] glm, [R] glogit, [R] lincom, [R] logistic, [R] logistic postestimation, [R] logit, [R] logit postestimation, [R] lroc, [R] lrtest, [R] lsens, [R] mlogit, [R] predictnl, [R] stepwise Leroy, A. M., [R] qreg, [R] regress postestimation, [R] rreg Levene, H., [R] sdtest Levin, B., [R] dstdize, [R] kappa Levinsohn, J. A., [R] frontier Levy, D. E., [R] sunflower Lewis, H. G., [R] heckman Lewis, I. G., [R] binreg 2510 Author index Lewis, J. D., [R] fp Li, G., [R] rreg Li, W., [R] pkcross, [R] pkequiv, [R] pkshape Libois, F., [R] fp Lim, G. C., [R] cnsreg, [R] regress, [R] regress postestimation Lindley, D. V., [R] ci Lindsey, C., [R] boxcox, [R] lowess, [R] regress postestimation, [R] regress postestimation diagnostic plots, [R] stepwise Linhart, J. M., [R] lpoly Lipset, S. M., [R] histogram Liu, J.-P., [R] pk, [R] pkcross, [R] pkequiv, [R] pkexamine, [R] pkshape Locke, C. S., [R] pkequiv Lockwood, J. R., [R] areg Lokshin, M., [R] biprobit, [R] heckman, [R] heckoprobit, [R] heckprobit, [R] oprobit Long, J. S., [R] asroprobit, [R] clogit, [R] cloglog, [R] intreg, [R] logistic, [R] logit, [R] mlogit, [R] mprobit, [R] nbreg, [R] ologit, [R] oprobit, [R] poisson, [R] probit, [R] regress, [R] regress postestimation, [R] testnl, [R] tnbreg, [R] tobit, [R] tpoisson, [R] zinb, [R] zip Longest, K. C., [R] tabulate twoway Longley, J. D., [R] kappa Longton, G. M., [R] rocfit, [R] rocreg, [R] rocreg postestimation, [R] rocregplot López-Feldman, A., [R] inequality Lorenz, M. O., [R] inequality Louis, T. A., [R] tabulate twoway Lovell, C. A. K., [R] frontier, [R] frontier postestimation Lovie, A. D., [R] spearman Lovie, P., [R] spearman Lucas, H. L., [R] pkcross Luce, R. D., [R] rologit Lumley, T. S., [R] anova, [R] dstdize, [R] oneway Lunt, M., [R] ologit, [R] slogit Lütkepohl, H., [R] estat ic, [R] glogit, [R] ivregress, [R] ivregress postestimation, [R] logit, [R] probit, [R] regress postestimation, [R] test M Ma, G., [R] roccomp, [R] rocfit, [R] roctab Machin, D., [R] ci, [R] kappa, [R] tabulate twoway Mack, T. M., [R] symmetry MacKinnon, J. G., [R] boxcox, [R] cnsreg, [R] gmm, [R] intreg, [R] ivregress, [R] ivregress postestimation, [R] mlogit, [R] nl, [R] nlsur, [R] reg3, [R] regress, [R] regress postestimation time series, [R] tobit, [R] truncreg MacRae, K. D., [R] binreg Madansky, A., [R] runtest Maddala, G. S., [R] nlogit, [R] tobit Magnusson, L. M., [R] gmm, [R] ivprobit, [R] ivregress, [R] ivtobit Mallows, C. L., [R] regress postestimation diagnostic plots Mander, A. P., [R] anova, [R] symmetry Mann, H. B., [R] kwallis, [R] ranksum Manning, W. G., [R] heckman Manski, C. F., [R] gmm Mantel, N., [R] stepwise Marchenko, Y. V., [R] anova, [R] loneway, [R] oneway, [R] sktest Marden, J. I., [R] rologit Markowski, C. A., [R] sdtest Markowski, E. P., [R] sdtest Marschak, J., [R] ivregress Martin, W., [R] regress Martı́nez, M. A., [R] logistic Mascher, K., [R] rocreg, [R] rocreg postestimation, [R] rocregplot Massey, F. J., Jr., [R] ttest Massey, J. T., [R] boxcox, [R] marginsplot Master, I. M., [R] exlogistic Mastrucci, M. T., [R] exlogistic Matthews, J. N. S., [R] ameans, [R] expoisson, [R] sdtest Mátyás, L., [R] gmm Maurer, K., [R] boxcox, [R] marginsplot Maxwell, A. E., [R] symmetry May, S., [R] stepwise McCaffrey, D. F., [R] areg McCleary, S. J., [R] regress postestimation diagnostic plots McClish, D. K., [R] rocreg McCullagh, P., [R] binreg, [R] binreg postestimation, [R] glm, [R] glm postestimation, [R] ologit, [R] rologit McCulloch, C. E., [R] logistic McDonald, J. A., [R] sunflower McDonald, J. F., [R] tobit, [R] tobit postestimation McDowell, A., [R] boxcox, [R] marginsplot McDowell, A. W., [R] sureg McFadden, D. L., [R] asclogit, [R] asmprobit, [R] clogit, [R] hausman, [R] maximize, [R] nlogit, [R] suest McGill, R., [R] sunflower McGinnis, R. E., [R] symmetry McGraw, K. O., [R] icc McGuire, T. J., [R] dstdize McKelvey, R. D., [R] ologit McNeil, B. J., [R] roccomp, [R] rocfit, [R] rocreg, [R] rocreg postestimation, [R] rocregplot, [R] roctab McNeil, D., [R] poisson Meeusen, W., [R] frontier Mehta, C. R., [R] exlogistic, [R] exlogistic postestimation, [R] expoisson, [R] tabulate twoway Melly, B., [R] qreg Mensing, R. W., [R] anova postestimation Metz, C. E., [R] lroc Author index 2511 Miao, W., [R] sdtest Michels, K. M., [R] anova, [R] contrast, [R] loneway, [R] oneway, [R] pwcompare Mielke, P. W., Jr., [R] brier, [R] ranksum Mihaly, K., [R] areg Miladinovic, B., [R] meta Miller, A. B., [R] kappa Miller, R. G., Jr., [R] diagnostic plots, [R] oneway, [R] pwcompare Milliken, G. A., [R] anova, [R] contrast, [R] margins, [R] pwcompare Miranda, A., [R] gllamm, [R] heckoprobit, [R] heckprobit, [R] ivprobit, [R] ivtobit, [R] logistic, [R] logit, [R] nbreg, [R] ologit, [R] oprobit, [R] poisson, [R] probit Mitchell, C., [R] exlogistic Mitchell, M. N., [R] anova, [R] anova postestimation, [R] contrast, [R] logistic, [R] logistic postestimation, [R] logit, [R] margins, [R] marginsplot, [R] pwcompare, [R] regress Moffitt, R. A., [R] tobit, [R] tobit postestimation Mogstad, M., [R] inequality Monfort, A., [R] hausman, [R] suest, [R] test Monson, R. R., [R] bitest Montoya, D., [R] rocreg, [R] rocreg postestimation, [R] rocregplot Mood, A. M., [R] centile Mooney, C. Z., [R] bootstrap, [R] jackknife, [R] rocreg, [R] rocregplot Moran, J. L., [R] dstdize Morris, C., [R] bootstrap Morris, N. F., [R] binreg Moskowitz, M., [R] kappa Mosteller, C. F., [R] jackknife, [R] regress, [R] regress postestimation diagnostic plots, [R] rreg Moulton, L. H., [R] permute Muellbauer, J., [R] nlsur Mullahy, J., [R] gmm, [R] ivpoisson, [R] zinb, [R] zip Müller, H.-G., [R] lpoly Muro, J., [R] heckoprobit, [R] heckprobit Murphy, A. H., [R] brier Murray-Lyon, I. M., [R] binreg Muñoz, J., [R] exlogistic N Nachtsheim, C. J., [R] pkcross, [R] pkequiv, [R] pkshape, [R] regress postestimation Nadarajah, S., [R] nlogit Nadaraya, E. A., [R] lpoly Nagler, J., [R] scobit Naiman, D. Q., [R] qreg Narula, S. C., [R] qreg Nee, J. C. M., [R] kappa Neely, S. T., [R] rocreg, [R] rocreg postestimation, [R] rocregplot Nelder, J. A., [R] binreg, [R] binreg postestimation, [R] glm, [R] glm postestimation, [R] margins, [R] ologit Nelson, C. R., [R] ivregress postestimation Nelson, E. C., [R] lincom, [R] mlogit, [R] mprobit, [R] mprobit postestimation, [R] predictnl, [R] slogit Nelson, F. D., [R] logit, [R] probit Neter, J., [R] pkcross, [R] pkequiv, [R] pkshape, [R] regress postestimation Newey, W. K., [R] glm, [R] gmm, [R] ivpoisson, [R] ivprobit, [R] ivregress, [R] ivtobit Newman, S. C., [R] poisson Newson, R. B., [R] centile, [R] glm, [R] glm postestimation, [R] inequality, [R] kwallis, [R] logistic postestimation, [R] logit postestimation, [R] margins, [R] mkspline, [R] ranksum, [R] signrank, [R] spearman, [R] tabulate twoway Newton, H. J., [R] kdensity Neyman, J., [R] ci Ng, E. S.-W., [R] bootstrap, [R] bstat Nicewander, W. A., [R] correlate Nichols, A., [R] ivregress, [R] reg3 Nickell, S. J., [R] gmm Nolan, D., [R] diagnostic plots Norton, S. J., [R] rocreg, [R] rocreg postestimation, [R] rocregplot O O’Fallon, W. M., [R] logit O’Neill, D., [R] gmm, [R] inequality Oehlert, G. W., [R] nlcom, [R] rocreg postestimation, [R] rocregplot Olivier, D., [R] expoisson Olkin, I., [R] kwallis Olson, J. M., [R] symmetry Ord, J. K., [R] centile, [R] mean, [R] proportion, [R] qreg, [R] ratio, [R] summarize, [R] total Orsini, N., [R] glm, [R] logit, [R] mkspline, [R] qreg Ostle, B., [R] anova postestimation Over, M., [R] regress P Pacheco, J. M., [R] dstdize Pagan, A. R., [R] frontier, [R] regress postestimation, [R] sureg Pagano, M., [R] dstdize, [R] logistic, [R] margins, [R] tabulate twoway Paik, M. C., [R] dstdize, [R] kappa Palmer, T. M., [R] ivregress Pampel, F. C., [R] logistic, [R] logit, [R] probit Panis, C., [R] mkspline Park, H. J., [R] regress Park, J. Y., [R] boxcox, [R] margins, [R] nlcom, [R] predictnl, [R] rocreg postestimation, [R] rocregplot, [R] testnl Parks, W. P., [R] exlogistic Parner, E. T., [R] glm 2512 Author index Parzen, E., [R] estat ic, [R] kdensity Pasquini, J., [R] vwls Patel, N. R., [R] exlogistic, [R] exlogistic postestimation, [R] expoisson, [R] tabulate twoway Patterson, H. D., [R] pkcross Paul, C., [R] logistic Pearce, M. S., [R] logistic Pearson, E. S., [R] ci, [R] ttest Pearson, K., [R] correlate, [R] correlate, [R] esize, [R] tabulate twoway Penfield, R. D., [R] esize Pepe, M. S., [R] roc, [R] roccomp, [R] rocfit, [R] rocreg, [R] rocreg postestimation, [R] rocregplot, [R] roctab Peracchi, F., [R] regress, [R] regress postestimation Pérez-Hernández, M. A., [R] kdensity Pérez-Hoyos, S., [R] lrtest Perkins, A. M., [R] ranksum Perotti, V., [R] heckoprobit, [R] heckprobit, [R] oprobit Perrin, E., [R] lincom, [R] mlogit, [R] mprobit, [R] mprobit postestimation, [R] predictnl, [R] slogit Pesarin, F., [R] tabulate twoway Peterson, B., [R] ologit Peterson, W. W., [R] lroc Petitclerc, M., [R] kappa Petkova, E., [R] suest Petrin, A. K., [R] frontier Pfeffer, R. I., [R] symmetry Phillips, P. C. B., [R] boxcox, [R] margins, [R] nlcom, [R] predictnl, [R] regress postestimation time series, [R] rocreg postestimation, [R] rocregplot, [R] testnl Pickles, A., [R] gllamm, [R] glm Pike, M. C., [R] symmetry Pindyck, R. S., [R] biprobit, [R] heckprobit Pischke, J.-S., [R] ivregress, [R] ivregress postestimation, [R] qreg, [R] regress Pitblado, J. S., [R] frontier, [R] gmm, [R] lpoly, [R] maximize, [R] ml, [R] mlexp Plackett, R. L., [R] ameans, [R] regress, [R] rologit, [R] summarize, [R] ttest Plummer, W. D., Jr., [R] sunflower Poi, B. P., [R] bootstrap, [R] bstat, [R] frontier, [R] gmm, [R] ivregress, [R] ivregress postestimation, [R] maximize, [R] ml, [R] mlexp, [R] nl, [R] nlsur, [R] reg3 Poirier, D. J., [R] biprobit Poisson, S. D., [R] poisson Pollock, P. H., III, [R] histogram Ponce de Leon, A., [R] roccomp, [R] roctab Porter, T. M., [R] correlate Powers, D. A., [R] logistic postestimation, [R] logit, [R] logit postestimation, [R] probit Preacher, K. J., [R] esize, [R] regress postestimation Preece, D. A., [R] ttest Pregibon, D., [R] glm, [R] linktest, [R] logistic, [R] logistic postestimation, [R] logit, [R] logit postestimation Press, W. H., [R] dydx, [R] vwls Punj, G. N., [R] rologit R Rabe-Hesketh, S., [R] gllamm, [R] glm, [R] heckoprobit, [R] heckprobit, [R] ivprobit, [R] ivtobit, [R] logistic, [R] logit, [R] nbreg, [R] ologit, [R] oprobit, [R] poisson, [R] probit Raciborski, R., [R] poisson, [R] tpoisson Raftery, A. E., [R] BIC note, [R] estat ic, [R] glm Ramalheira, C., [R] ameans Ramsahai, R. R., [R] ivregress Ramsey, J. B., [R] regress postestimation Ratkowsky, D. A., [R] nl, [R] pk, [R] pkcross Redelmeier, D. A., [R] brier Reeves, D., [R] meta Reichenheim, M. E., [R] kappa, [R] roccomp, [R] roctab Reid, C., [R] ci Reilly, M., [R] logistic Relles, D. A., [R] rreg Rencher, A. C., [R] anova postestimation Revankar, N. S., [R] frontier Richardson, W., [R] ttest Riffenburgh, R. H., [R] ksmirnov, [R] kwallis Riley, A. R., [R] net search Ringquist, E. J., [R] meta Rivers, D., [R] ivprobit Roberson, P. K., [R] estat gof, [R] lroc Robyn, D. L., [R] cumul Rodgers, J. L., [R] correlate Rodrı́guez, G., [R] nbreg, [R] poisson Rogers, W. H., [R] brier, [R] glm, [R] heckman, [R] lincom, [R] mlogit, [R] mprobit, [R] mprobit postestimation, [R] nbreg, [R] poisson, [R] predictnl, [R] qreg, [R] regress, [R] rocreg, [R] rreg, [R] sktest, [R] slogit, [R] suest Ronning, G., [R] clogit Rose, J. M., [R] nlogit Rosenthal, R., [R] contrast Rosnow, R. L., [R] contrast Ross, G. J. S., [R] nl Rossi, P. E., [R] sureg Rothman, K. J., [R] ci, [R] dstdize, [R] glogit, [R] poisson Rothstein, H. R., [R] meta Rousseeuw, P. J., [R] qreg, [R] regress postestimation, [R] rreg Rovine, M. J., [R] correlate Author index 2513 Royston, P., [R] bootstrap, [R] centile, [R] cusum, [R] diagnostic plots, [R] dotplot, [R] dydx, [R] estat ic, [R] fp, [R] fp postestimation, [R] glm, [R] kdensity, [R] lnskew0, [R] lowess, [R] marginsplot, [R] mfp, [R] ml, [R] nl, [R] regress, [R] sktest, [R] smooth, [R] swilk Rubin, D. B., [R] contrast Rubin, H., [R] ivregress postestimation Rubinfeld, D. L., [R] biprobit, [R] heckprobit Rudebusch, G. D., [R] ivregress postestimation Ruppert, D., [R] boxcox, [R] rreg Rutherford, E., [R] poisson Rutherford, M. J., [R] poisson Ruud, P. A., [R] gmm, [R] rologit, [R] suest Ryan, T. P., [R] qc S Sajaia, Z., [R] biprobit, [R] heckprobit Sakamoto, Y., [R] BIC note Salgado-Ugarte, I. H., [R] kdensity, [R] lowess, [R] smooth Salim, A., [R] logistic Sanders, F., [R] brier Santos Silva, J. M. C., [R] gmm, [R] ivpoisson Sargan, J. D., [R] ivregress postestimation Sasieni, P. D., [R] dotplot, [R] glm, [R] lowess, [R] nptrend, [R] poisson, [R] smooth Sass, T. R., [R] areg Satterthwaite, F. E., [R] esize, [R] ttest Sauerbrei, W., [R] bootstrap, [R] estat ic, [R] fp, [R] mfp Savin, N. E., [R] regress postestimation time series Saw, S. L. C., [R] qc Sawa, T., [R] estat ic Saxl, I., [R] correlate Schaalje, G. B., [R] anova postestimation Schaffer, M. E., [R] ivregress, [R] ivregress postestimation Scheffé, H., [R] anova, [R] oneway Schlesselman, J. J., [R] boxcox Schlossmacher, E. J., [R] qreg Schmidt, C. H., [R] brier Schmidt, P., [R] frontier, [R] regress postestimation Schneider, H., [R] sdtest Schnell, D., [R] regress Schonlau, M., [R] glm, [R] logistic, [R] logit, [R] poisson, [R] regress Schuirmann, D. J., [R] pkequiv Schwarz, G., [R] BIC note, [R] estat ic Scott, D. W., [R] kdensity Scott, E. L., [R] intro Scott, G. B., [R] exlogistic Scotto, M. G., [R] diagnostic plots Searle, S. R., [R] contrast, [R] margins, [R] pwcompare, [R] pwmean Seed, P. T., [R] ci, [R] correlate, [R] roccomp, [R] roctab, [R] sdtest, [R] spearman Seidler, J., [R] correlate Selvin, S., [R] poisson Sempos, C. T., [R] dstdize Semykina, A., [R] inequality, [R] qreg Seneta, E., [R] correlate Senn, S. J., [R] glm, [R] ttest Shapiro, S. S., [R] swilk Shea, J. S., [R] ivregress postestimation Sheather, S. J., [R] boxcox, [R] lowess, [R] lpoly, [R] qreg, [R] regress postestimation, [R] regress postestimation diagnostic plots, [R] stepwise Sheehan, N. A., [R] ivregress Sheldon, T. A., [R] meta Shewhart, W. A., [R] qc Shiboski, S. C., [R] logistic Shiller, R. J., [R] tobit Shimizu, M., [R] kdensity, [R] lowess Shrout, P. E., [R] icc, [R] kappa Šidák, Z., [R] correlate, [R] oneway Silverman, B. W., [R] kdensity, [R] qreg Silvey, S. D., [R] ologit, [R] oprobit Simonoff, J. S., [R] kdensity, [R] tnbreg, [R] tpoisson Simor, I. S., [R] kappa Singleton, K. J., [R] gmm Sininger, Y., [R] rocreg, [R] rocreg postestimation, [R] rocregplot Sitgreaves, R., [R] icc Sjölander, P. C., [R] glm, [R] logit Skrondal, A., [R] gllamm, [R] glm Smeeton, N. C., [R] ranksum, [R] signrank Smirnov, N. V., [R] ksmirnov Smith, C. A. B., [R] ranksum Smith, H., [R] eivreg, [R] oneway, [R] regress, [R] stepwise Smith, J. M., [R] fp Smith, M. L., [R] esize Smith, R. J., [R] ivprobit Smithson, M., [R] esize, [R] regress postestimation Snedecor, G. W., [R] ameans, [R] anova, [R] correlate, [R] oneway, [R] ranksum, [R] signrank Snell, E. J., [R] exlogistic, [R] expoisson Song, F., [R] meta Soon, T. W., [R] qc Spearman, C. E., [R] icc, [R] spearman Speed, F. M., [R] margins Speed, T., [R] diagnostic plots Spiegelhalter, D. J., [R] brier Spieldman, R. S., [R] symmetry Spitzer, J. J., [R] boxcox Sprent, P., [R] ranksum, [R] signrank Sribney, W. M., [R] orthog, [R] ranksum, [R] signrank, [R] stepwise, [R] test Staelin, R., [R] rologit Staiger, D. O., [R] ivregress postestimation Starmer, C. F., [R] vwls Startz, R., [R] ivregress postestimation Stegun, I. A., [R] contrast, [R] orthog 2514 Author index Steichen, T. J., [R] kappa, [R] kdensity, [R] sunflower Steiger, J. H., [R] esize Steiger, W., [R] qreg Stein, C., [R] bootstrap Stephenson, D. B., [R] brier Stepniewska, K. A., [R] nptrend Sterne, J. A. C., [R] dstdize, [R] meta, [R] summarize Stevenson, R. E., [R] frontier Stewart, M. B., [R] intreg, [R] oprobit, [R] tobit Stigler, S. M., [R] ameans, [R] ci, [R] correlate, [R] kwallis, [R] qreg, [R] regress, [R] summarize Stillman, S., [R] ivregress, [R] ivregress postestimation Stine, R., [R] bootstrap Stock, J. H., [R] areg postestimation, [R] ivregress, [R] ivregress postestimation Stoto, M. A., [R] lv Stover, L., [R] rocreg, [R] rocreg postestimation, [R] rocregplot Street, J. O., [R] rreg Stryhn, H., [R] regress Stuart, A., [R] centile, [R] mean, [R] proportion, [R] qreg, [R] ratio, [R] summarize, [R] symmetry, [R] total Student, see Gosset, W. S. Stuetzle, W., [R] sunflower Sturdivant, R. X., [R] clogit, [R] clogit postestimation, [R] estat classification, [R] estat gof, [R] glm, [R] glogit, [R] lincom, [R] logistic, [R] logistic postestimation, [R] logit, [R] logit postestimation, [R] lroc, [R] lrtest, [R] lsens, [R] mlogit, [R] predictnl, [R] stepwise Suárez, C., [R] heckoprobit, [R] heckprobit Suen, H. K., [R] icc Sullivan, G., [R] regress Sutton, A. J., [R] meta Swed, F. S., [R] runtest Sweetman, O., [R] gmm, [R] inequality Swets, J. A., [R] lroc Szroeter, J., [R] regress postestimation T Taka, M. T., [R] pkcross Tamhane, A. C., [R] oneway Taniuchi, T., [R] kdensity Tanner, W. P., Jr., [R] lroc Tanur, J. M., [R] kwallis Tapia, R. A., [R] kdensity Tarlov, A. R., [R] lincom, [R] mlogit, [R] mprobit, [R] mprobit postestimation, [R] predictnl, [R] slogit Tauchmann, H., [R] frontier Taylor, C., [R] gllamm, [R] glm Teukolsky, S. A., [R] dydx, [R] vwls Theil, H., [R] ivregress, [R] reg3 Thiele, T. N., [R] summarize Thompson, B., [R] esize, [R] regress postestimation Thompson, J. C., [R] diagnostic plots Thompson, J. R., [R] kdensity, [R] poisson Thompson, M. L., [R] rocreg Thorndike, F., [R] poisson Thurstone, L. L., [R] rologit Tibshirani, R. J., [R] bootstrap, [R] qreg Tidmarsh, C. E., [R] fp Tilford, J. M., [R] estat gof, [R] lroc Tobı́as, A., [R] lrtest, [R] poisson, [R] roccomp, [R] roctab, [R] sdtest Tobin, J., [R] tobit Toman, R. J., [R] stepwise Tong, H., [R] estat ic Toplis, P. J., [R] binreg Tosetto, A., [R] logistic, [R] logit Train, K. E., [R] asmprobit Trapido, E., [R] exlogistic Treiman, D. J., [R] eivreg, [R] mlogit Trivedi, P. K., [R] asclogit, [R] asmprobit, [R] bootstrap, [R] gmm, [R] heckman, [R] heckoprobit, [R] intreg, [R] ivpoisson, [R] ivregress, [R] ivregress postestimation, [R] logit, [R] mprobit, [R] nbreg, [R] ologit, [R] oprobit, [R] poisson, [R] probit, [R] qreg, [R] regress, [R] regress postestimation, [R] simulate, [R] sureg, [R] tnbreg, [R] tobit, [R] tpoisson, [R] zinb postestimation, [R] zip postestimation Tsiatis, A. A., [R] exlogistic Tufte, E. R., [R] stem Tukey, J. W., [R] jackknife, [R] ladder, [R] linktest, [R] lv, [R] regress, [R] regress postestimation diagnostic plots, [R] rreg, [R] smooth, [R] spikeplot, [R] stem Tukey, P. A., [R] diagnostic plots, [R] lowess Tyler, J. H., [R] regress U Uebersax, J. S., [R] tetrachoric Uhlendorff, A., [R] asmprobit, [R] mlogit, [R] mprobit University Group Diabetes Program, [R] glogit Utts, J. M., [R] ci V Valman, H. B., [R] fp van Belle, G., [R] anova, [R] dstdize, [R] oneway Van de Ven, W. P. M. M., [R] biprobit, [R] heckoprobit, [R] heckprobit van den Broeck, J., [R] frontier Van der Reyden, D., [R] ranksum Van Kerm, P., [R] inequality, [R] kdensity Van Loan, C. F., [R] orthog, [R] tetrachoric Van Pragg, B. M. S., [R] biprobit, [R] heckoprobit, [R] heckprobit Velleman, P. F., [R] regress postestimation, [R] smooth Venables, W., [R] esize Author index 2515 Verardi, V., [R] correlate, [R] fp, [R] ivregress, [R] lpoly, [R] rreg Vetterling, W. T., [R] dydx, [R] vwls Vidmar, S., [R] ameans Vittinghoff, E., [R] logistic Vohr, B. R., [R] rocreg, [R] rocreg postestimation, [R] rocregplot von Bortkiewicz, L., [R] poisson von Eye, A., [R] correlate Von Storch, H., [R] brier Vondráček, J., [R] correlate Vuong, Q. H., [R] ivprobit, [R] zinb, [R] zip W Wacholder, S., [R] binreg Wagner, H. M., [R] qreg Wallis, W. A., [R] kwallis Walters, S. J., [R] ci, [R] kappa, [R] tabulate twoway Wand, M. P., [R] kdensity Wang, D., [R] ci, [R] dstdize, [R] prtest Wang, Q., [R] ivregress Wang, Y., [R] asmprobit Wang, Z., [R] logistic postestimation, [R] lrtest, [R] stepwise Ware, J. E., Jr., [R] lincom, [R] mlogit, [R] mprobit, [R] mprobit postestimation, [R] predictnl, [R] slogit Waterson, E. J., [R] binreg Watson, G. S., [R] lpoly, [R] regress postestimation time series Watson, M. W., [R] areg postestimation, [R] ivregress Weber, S., [R] correlate Webster, A. D., [R] fp Wedderburn, R. W. M., [R] glm Weesie, J., [R] constraint, [R] hausman, [R] ladder, [R] reg3, [R] regress, [R] regress postestimation, [R] rologit, [R] simulate, [R] suest, [R] sureg, [R] tabstat, [R] tabulate twoway, [R] test, [R] tetrachoric Weisberg, H. F., [R] summarize Weisberg, S., [R] boxcox, [R] regress, [R] regress postestimation Weiss, M., [R] estimates table Weisstein, E. W., [R] rocreg postestimation Welch, B. L., [R] esize, [R] ttest Wellington, J. F., [R] qreg Wells, K. B., [R] lincom, [R] mlogit, [R] mprobit, [R] mprobit postestimation, [R] predictnl, [R] slogit Welsch, R. E., [R] regress postestimation, [R] regress postestimation diagnostic plots Welsh, A. H., [R] bootstrap West, K. D., [R] glm, [R] gmm, [R] ivregress West, S. G., [R] pcorr Westlake, W. J., [R] pkequiv White, H. L., Jr., [R] regress, [R] regress postestimation, [R] rocreg, [R] suest White, I. R., [R] meta, [R] simulate White, K. J., [R] boxcox, [R] regress postestimation time series Whitehouse, E., [R] inequality Whitfield, J. W., [R] ranksum Whiting, P., [R] roccomp, [R] roctab Whitney, D. R., [R] kwallis, [R] ranksum Widen, J. E., [R] rocreg, [R] rocreg postestimation, [R] rocregplot Wieand, S., [R] rocreg, [R] rocreg postestimation Wiggins, V. L., [R] regress postestimation, [R] regress postestimation time series Wilcox, D. W., [R] ivregress postestimation Wilcoxon, F., [R] kwallis, [R] ranksum, [R] signrank Wilde, J., [R] gmm Wilk, M. B., [R] cumul, [R] diagnostic plots, [R] swilk Wilks, D. S., [R] brier Williams, R., [R] glm, [R] margins, [R] marginsplot, [R] ologit, [R] oprobit, [R] pcorr, [R] stepwise Wilson, E. B., [R] ci Wilson, S. R., [R] bootstrap Windmeijer, F., [R] gmm, [R] ivpoisson Winer, B. J., [R] anova, [R] contrast, [R] loneway, [R] oneway, [R] pwcompare Wolfe, F., [R] correlate, [R] spearman Wolfe, R., [R] ologit, [R] oprobit, [R] tabulate twoway Wolfson, C., [R] kappa Wolpin, K. I., [R] asmprobit Wong, S. P., [R] icc Wood, F. S., [R] diagnostic plots Woodard, D. E., [R] contrast Wooldridge, J. M., [R] areg postestimation, [R] gmm, [R] heckoprobit, [R] intreg, [R] ivpoisson, [R] ivprobit, [R] ivregress, [R] ivregress postestimation, [R] ivtobit, [R] margins, [R] margins, contrast, [R] qreg, [R] regress, [R] regress postestimation, [R] regress postestimation time series, [R] tobit Working, H., [R] roccomp, [R] rocfit, [R] roctab Wright, J. H., [R] ivregress, [R] ivregress postestimation Wright, J. T., [R] binreg Wright, P. G., [R] ivregress Wu, C. F. J., [R] qreg Wu, D.-M., [R] ivregress postestimation Wu, N., [R] ivregress X Xie, Y., [R] logit, [R] probit Xu, J., [R] cloglog, [R] logistic, [R] logit, [R] mlogit, [R] ologit, [R] oprobit, [R] probit Y Yang, Z., [R] poisson Yates, J. F., [R] brier Yee, T. W., [R] slogit 2516 Author index Yellott, J. I., Jr., [R] rologit Yogo, M., [R] ivregress, [R] ivregress postestimation Yoshioka, H., [R] logistic postestimation, [R] logit postestimation Yun, M.-S., [R] logistic postestimation, [R] logit postestimation Z Zabell, S. L., [R] kwallis Zamora, M., [R] heckoprobit, [R] heckprobit Zavoina, W., [R] ologit Zelen, M., [R] ttest Zellner, A., [R] frontier, [R] nlsur, [R] reg3, [R] sureg Zelterman, D., [R] tabulate twoway Zheng, X., [R] gllamm Zimmerman, F., [R] regress Zubkoff, M., [R] lincom, [R] mlogit, [R] mprobit, [R] mprobit postestimation, [R] predictnl, [R] slogit Zucchini, W., [R] rocreg Zwiers, F. W., [R] brier Subject index This is the subject index for the Base Reference Manual. Readers may also want to consult the combined subject index (and the combined author index) in the Glossary and Index. A about command, [R] about absorption in regression, [R] areg acprplot command, [R] regress postestimation diagnostic plots added-variable plots, [R] regress postestimation diagnostic plots adjusted margins, [R] margins, [R] marginsplot means, [R] contrast, [R] margins, [R] marginsplot partial residual plot, [R] regress postestimation diagnostic plots ado command, [R] net ado describe command, [R] net ado dir command, [R] net ado uninstall command, [R] net ado, view subcommand, [R] view ado d, view subcommand, [R] view ado-files, editing, [R] doedit installing, [R] net, [R] sj, [R] ssc location of, [R] which official, [R] update searching for, [R] search, [R] ssc updating user-written, [R] adoupdate adosize, set subcommand, [R] set adoupdate command, [R] adoupdate agreement, interrater, [R] kappa AIC, see Akaike information criterion Akaike information criterion, [R] BIC note, [R] estat, [R] estat ic, [R] estimates stats, [R] glm, [R] lrtest all, update subcommand, [R] update alternative-specific conditional logit (McFadden’s choice) model, [R] asclogit multinomial probit regression, [R] asmprobit rank-ordered probit regression, [R] asroprobit alternatives, estat subcommand, [R] asclogit postestimation, [R] asmprobit postestimation, [R] asroprobit postestimation, [R] nlogit postestimation ameans command, [R] ameans analysis of covariance, [R] anova analysis of variance, [R] anova, [R] contrast, [R] icc, [R] loneway, [R] oneway Kruskal–Wallis, [R] kwallis plots, [R] marginsplot repeated measures, [R] anova analysis-of-variance test of normality, [R] swilk 2519 ANCOVA, see analysis of covariance ANOVA, see analysis of variance anova command, [R] anova, [R] anova postestimation ARCH effects, testing for, [R] regress postestimation time series archlm, estat subcommand, [R] regress postestimation time series area under the curve, [R] lroc, also see pharmacokinetic data, also see receiver operating characteristic analysis areg command, [R] areg, [R] areg postestimation asclogit command, [R] asclogit, [R] asclogit postestimation asmprobit command, [R] asmprobit, [R] asmprobit postestimation asroprobit command, [R] asroprobit, [R] asroprobit postestimation association test, [R] correlate, [R] spearman, [R] tabulate twoway, [R] tetrachoric association, measures of, [R] tabulate twoway asymmetry, see skewness AUC, also see area under the curve augmented component-plus-residual plot, [R] regress postestimation diagnostic plots partial residual plot, [R] regress postestimation diagnostic plots autocorrelation, [R] regress postestimation time series, also see HAC variance estimate autoregressive conditional heteroskedasticity test, [R] regress postestimation time series autotabgraphs, set subcommand, [R] set average marginal effects, [R] margins, [R] marginsplot partial effects (APEs), [R] margins, [R] marginsplot predictions, [R] margins, [R] marginsplot averages, see means avplot and avplots commands, [R] regress postestimation diagnostic plots B backed up message, [R] maximize Bartlett’s test for equal variances, [R] oneway base, fvset subcommand, [R] fvset Bayesian information criterion, [R] BIC note, [R] estat, [R] estat ic, [R] estimates stats, [R] glm, [R] lrtest bcskew0 command, [R] lnskew0 Berndt–Hall–Hall–Hausman algorithm, [R] ml beta coefficients, [R] regress BFGS algorithm, see Broyden–Fletcher–Goldfarb– Shanno algorithm bgodfrey, estat subcommand, [R] regress postestimation time series BHHH algorithm, see Berndt–Hall–Hall–Hausman algorithm bias corrected and accelerated, [R] bootstrap postestimation, [R] bstat 2520 Subject index BIC, see Bayesian information criterion Bickenböller test statistic, [R] symmetry binary outcome model, see outcomes, binary binomial distribution, confidence intervals, [R] ci family regression, [R] binreg probability test, [R] bitest binreg command, [R] binreg, [R] binreg postestimation bioequivalence test, [R] pk, [R] pkequiv biopharmaceutical data, see pharmacokinetic data biprobit command, [R] biprobit, [R] biprobit postestimation bitest and bitesti commands, [R] bitest bivariate probit regression, [R] biprobit biweight kernel function, [R] kdensity, [R] lpoly, [R] qreg biweight regression estimates, [R] rreg blogit command, [R] glogit, [R] glogit postestimation Bonferroni’s multiple-comparison adjustment, see multiple comparisons, Bonferroni’s method bootstrap sampling and estimation, [R] bootstrap, [R] bsample, [R] bstat, [R] qreg, [R] rocreg, [R] simulate standard errors, [R] vce option bootstrap prefix command, [R] bootstrap, [R] bootstrap postestimation bootstrap, estat subcommand, [R] bootstrap postestimation Boston College archive, see Statistical Software Components archive Box–Cox power transformations, [R] lnskew0 regression, [R] boxcox boxcox command, [R] boxcox, [R] boxcox postestimation Box’s conservative epsilon, [R] anova bprobit command, [R] glogit, [R] glogit postestimation Breusch–Godfrey test, [R] regress postestimation time series Breusch–Pagan test, [R] sureg Breusch–Pagan/Cook–Weisberg test for heteroskedasticity, [R] regress postestimation brier command, [R] brier Brier score decomposition, [R] brier browse, view subcommand, [R] view Broyden–Fletcher–Goldfarb–Shanno algorithm, [R] ml bsample command, [R] bsample bsqreg command, [R] qreg, [R] qreg postestimation bstat command, [R] bstat C c(cformat) c-class value, [R] set cformat c(pformat) c-class value, [R] set cformat c(seed) c-class value, [R] set emptycells, [R] set seed c(sformat) c-class value, [R] set cformat c(showbaselevels) c-class value, [R] set showbaselevels c(showemptycells) c-class value, [R] set showbaselevels c(showomitted) c-class value, [R] set showbaselevels calculator, [R] display carryover effects, [R] pk, [R] pkcross, [R] pkshape case–control data, [R] clogit, [R] logistic, [R] rocreg, [R] symmetry categorical, also see factor variables contrasts after anova, [R] contrast covariates, [R] anova data, agreement, measures for, [R] kappa graphs, [R] grmeanby, [R] spikeplot outcomes, see outcomes, categorical, also see outcomes, binary, also see outcomes, ordinal regression, also see outcomes subentry absorbing one categorical variable, [R] areg tabulations, [R] table, [R] tabstat, [R] tabulate oneway, [R] tabulate twoway, [R] tabulate, summarize() variable creation, [R] tabulate oneway, [R] xi cchart command, [R] qc cd, net subcommand, [R] net censored observations, [R] heckman, [R] heckoprobit, [R] heckprobit, [R] intreg, [R] ivtobit, [R] tobit, also see truncated observations censored-normal regression, see interval regression centile command, [R] centile centiles, see percentiles, displaying central tendency, measures of, see means, see medians cformat, set subcommand, [R] set, [R] set cformat charset, set subcommand, [R] set check, ml subcommand, [R] ml checksum, set subcommand, [R] set chi-squared hypothesis test, [R] hausman, [R] lrtest, [R] sdtest, [R] tabulate twoway, [R] test, [R] testnl probability plot, [R] diagnostic plots quantile plot, [R] diagnostic plots test for marginal homogeneity, [R] symmetry test of independence, [R] tabulate twoway choice models, [R] asclogit, [R] asmprobit, [R] asroprobit, [R] clogit, [R] cloglog, [R] exlogistic, [R] glm, [R] glogit, [R] heckoprobit, [R] heckprobit, [R] hetprobit, [R] ivprobit, [R] logistic, [R] logit, [R] mlogit, [R] mprobit, [R] nlogit, [R] ologit, [R] oprobit, [R] probit, [R] rologit, [R] scobit, [R] slogit, [R] suest Chow test, [R] anova, [R] contrast, [R] lrtest ci and cii commands, [R] ci classification data, see receiver operating characteristic analysis interrater agreement, [R] kappa table, [R] estat classification classification, estat subcommand, [R] estat classification Subject index 2521 clear, estimates subcommand, [R] estimates store fvset subcommand, [R] fvset ml subcommand, [R] ml clearing estimation results, [R] estimates store clogit command, [R] bootstrap, [R] clogit, [R] clogit postestimation, [R] exlogistic, [R] rologit cloglog command, [R] cloglog, [R] cloglog postestimation close, cmdlog subcommand, [R] log log subcommand, [R] log cls command, [R] cls cluster estimator of variance, [R] vce option alternative-specific conditional logit model, [R] asclogit multinomial probit regression, [R] asmprobit rank-ordered probit regression, [R] asroprobit complementary log-log regression, [R] cloglog generalized linear models, [R] glm for binomial family, [R] binreg generalized method of moments, [R] gmm, [R] ivpoisson heckman selection model, [R] heckman instrumental-variables regression, [R] ivregress interval regression, [R] intreg linear regression, [R] regress constrained, [R] cnsreg truncated, [R] truncreg with dummy-variable set, [R] areg logistic regression, [R] logistic, [R] logit, also see logit regression subentry conditional, [R] clogit multinomial, [R] mlogit ordered, [R] ologit rank-ordered, [R] rologit skewed, [R] scobit stereotype, [R] slogit logit regression, [R] logit, also see logistic regression subentry for grouped data, [R] glogit nested, [R] nlogit maximum likelihood estimation, [R] ml, [R] mlexp multinomial logistic regression, [R] mlogit probit regression, [R] mprobit negative binomial regression truncated, [R] nbreg zero-inflated, [R] zinb nonlinear least-squares estimation, [R] nl systems of equations, [R] nlsur Poisson regression, [R] poisson truncated, [R] tpoisson with endogenous regressors, [R] ivpoisson zero-inflated, [R] zip cluster estimator of variance, continued probit regression, [R] probit bivariate, [R] biprobit for grouped data, [R] glogit heteroskedastic, [R] hetprobit multinomial, [R] mprobit ordered, [R] oprobit ordered heckman selection model, [R] heckoprobit with endogenous regressors, [R] ivprobit with sample selection, [R] heckprobit summary statistics, mean, [R] mean proportion, [R] proportion ratio, [R] ratio total, [R] total tobit model, [R] tobit with endogenous regressors, [R] ivtobit truncated negative binomial regression, [R] tnbreg Poisson regression, [R] tpoisson regression, [R] truncreg with endogenous regressors, instrumental-variables regression, [R] ivregress Poisson regression, [R] ivpoisson probit model, [R] ivprobit tobit model, [R] ivtobit zero-inflated negative binomial regression, [R] zinb Poisson regression, [R] zip cluster sampling, [R] bootstrap, [R] bsample, [R] jackknife cmdlog close command, [R] log command, [R] log off command, [R] log on command, [R] log using command, [R] log cnsreg command, [R] cnsreg, [R] cnsreg postestimation coefficient of variation, [R] tabstat coefficients (from estimation), cataloging, [R] estimates linear combinations of, see linear combinations of estimators nonlinear combinations of, see nonlinear combinations of estimators testing equality of, [R] test, [R] testnl coeftabresults, set subcommand, [R] set collinearity, display of omitted variables, [R] set showbaselevels handling by regress, [R] regress retaining collinear variables, [R] estimation options, [R] orthog variance inflation factors, [R] regress postestimation command line, launching dialog box from, [R] db commands, reviewing, [R] #review 2522 Subject index comparative scatterplot, [R] dotplot comparison test between nested models, [R] nestreg complementary log-log regression, [R] cloglog, [R] glm completely determined outcomes, [R] logit component-plus-residual plot, [R] regress postestimation diagnostic plots conditional logistic regression, [R] asclogit, [R] clogit, [R] rologit, [R] slogit marginal effects, [R] margins, [R] marginsplot margins, [R] margins, [R] marginsplot confidence interval for bioequivalence, [R] pkequiv for bootstrap statistics, [R] bootstrap postestimation, [R] rocreg, [R] rocreg postestimation for combinations of coefficients, linear, [R] lincom nonlinear, [R] nlcom for contrasts, [R] contrast for counts, [R] ci for false-positive rates, [R] rocregplot for incidence-rate ratios, [R] expoisson, [R] glm, [R] nbreg, [R] poisson, [R] tnbreg, [R] tpoisson, [R] zinb, [R] zip for intragroup correlations, [R] loneway for margins, [R] margins for means, [R] ci, [R] ameans, [R] esize, [R] mean, [R] ttest for medians and percentiles, [R] centile for odds ratios, [R] exlogistic, [R] glm, [R] glogit, [R] logistic, [R] logit, [R] ologit, [R] scobit for proportions, [R] ci, [R] proportion for ratios, [R] ratio for relative-risk ratios, [R] mlogit for ROC area, [R] roccomp, [R] rocfit, [R] rocreg, [R] roctab for ROC values, [R] rocregplot for standardized mortality ratios, [R] dstdize for totals, [R] total confidence interval, set default, [R] level confidence levels, [R] level conjoint analysis, [R] rologit conren, set subcommand, [R] set console, controlling scrolling of output, [R] more constrained estimation, [R] constraint, [R] estimation options alternative-specific conditional logistic model, [R] asclogit multinomial probit regression, [R] asmprobit rank-ordered probit regression, [R] asroprobit complementary log-log regression, [R] cloglog generalized linear models, [R] glm for binomial family, [R] binreg generalized negative binomial regression, [R] nbreg heckman selection model, [R] heckman, [R] heckoprobit interval regression, [R] intreg constrained estimation, continued linear regression, [R] cnsreg seemingly unrelated, [R] sureg stochastic frontier, [R] frontier three-stage least squares, [R] reg3 truncated, [R] truncreg logistic regression, [R] logistic, [R] logit, also see logit regression subentry conditional, [R] clogit multinomial, [R] mlogit ordered, [R] ologit skewed, [R] scobit stereotype, [R] slogit logit regression, [R] logit, also see logistic regression subentry for grouped data, [R] glogit nested, [R] nlogit maximum likelihood estimation, [R] ml multinomial logistic regression, [R] mlogit probit regression, [R] mprobit negative binomial regression, [R] nbreg truncated, [R] tnbreg zero-inflated, [R] zinb Poisson regression, [R] poisson truncated, [R] tpoisson zero-inflated, [R] zip probit regression, [R] probit bivariate, [R] biprobit for grouped data, [R] glogit heteroskedastic, [R] hetprobit multinomial, [R] mprobit ordered, [R] oprobit with endogenous regressors, [R] ivprobit with sample selection, [R] heckprobit tobit model with endogenous regressors, [R] ivtobit truncated negative binomial regression, [R] tnbreg Poisson regression, [R] tpoisson regression, [R] truncreg with endogenous regressors probit regression, [R] ivprobit tobit model, [R] ivtobit zero-inflated negative binomial regression, [R] zinb Poisson regression, [R] zip constraint command, [R] constraint define command, [R] constraint dir command, [R] constraint drop command, [R] constraint free command, [R] constraint get command, [R] constraint list command, [R] constraint contingency tables, [R] roctab, [R] symmetry, [R] table, [R] tabulate twoway Subject index 2523 contrast command, [R] anova postestimation, [R] contrast, [R] contrast postestimation, [R] margins, contrast contrasts, [R] contrast, [R] margins, contrast, [R] marginsplot control charts, [R] qc convergence criteria, [R] maximize Cook–Weisberg test for heteroskedasticity, [R] regress postestimation Cook’s D, [R] glm postestimation, [R] regress postestimation copy, ssc subcommand, [R] ssc copycolor, set subcommand, [R] set copyright Apache, [R] copyright apache boost, [R] copyright boost freetype, [R] copyright freetype icu, [R] copyright icu JagPDF, [R] copyright jagpdf lapack, [R] copyright lapack libpng, [R] copyright libpng MiG Layout, [R] copyright miglayout scintilla, [R] copyright scintilla ttf2pt1, [R] copyright ttf2pt1 zlib, [R] copyright zlib copyright command, [R] copyright correlate command, [R] correlate correlated errors, see robust, Huber/White/sandwich estimator of variance, also see autocorrelation correlation, [R] correlate binary variables, [R] tetrachoric continuous variables, [R] correlate intraclass, [R] icc intracluster, [R] loneway Kendall’s rank, [R] spearman matrices, [R] correlate, [R] estat, [R] estat vce pairwise, [R] correlate partial and semipartial, [R] pcorr serial, [R] runtest Spearman’s rank, [R] spearman structure, [R] asmprobit, [R] asroprobit, [R] reg3 tetrachoric, [R] tetrachoric correlation, estat subcommand, [R] asmprobit postestimation, [R] asroprobit postestimation cosine kernel function, [R] kdensity, [R] lpoly, [R] qreg cost frontier model, [R] frontier count data, confidence intervals for counts, [R] ci estimation, [R] expoisson, [R] glm, [R] gmm, [R] ivpoisson, [R] nbreg, [R] poisson, [R] tnbreg, [R] tpoisson, [R] zinb, [R] zip graphs, [R] histogram, [R] kdensity, [R] spikeplot interrater agreement, [R] kappa summary statistics of, [R] table, [R] tabstat, [R] tabulate oneway, [R] tabulate twoway, [R] tabulate, summarize() count data, continued symmetry and marginal homogeneity tests, [R] symmetry count, ml subcommand, [R] ml covariance matrix of estimators, [R] estat, [R] estat vce, [R] estimates store of variables or coefficients, [R] correlate covariance, analysis of, [R] anova covariance, estat subcommand, [R] asmprobit postestimation, [R] asroprobit postestimation covariate patterns, [R] logistic postestimation, [R] logit postestimation, [R] probit postestimation COVRATIO, [R] regress postestimation cprplot command, [R] regress postestimation diagnostic plots Cramér’s V , [R] tabulate twoway crossover designs, [R] pk, [R] pkcross, [R] pkshape cross-tabulations, see tables cumul command, [R] cumul cumulative distribution, empirical, [R] cumul cumulative incidence data, [R] poisson cusum command, [R] cusum cusum test, [R] cusum D data, autocorrelated, see autocorrelation case–control, see case–control data categorical, see categorical data, agreement, measures for experimental, see experimental data matched case–control, see matched case–control data observational, see observational data range of, see range of data ranking, see ranking data sampling, see sampling summarizing, see summarizing data survival-time, see survival analysis time-series, see time-series analysis data manipulation, [R] fvrevar, [R] fvset Davidon–Fletcher–Powell algorithm, [R] ml db command, [R] db default settings of system parameters, [R] query, [R] set defaults define, constraint subcommand, [R] constraint transmap subcommand, [R] translate delta beta influence statistic, [R] clogit postestimation, [R] logistic postestimation, [R] logit postestimation delta chi-squared influence statistic, [R] clogit postestimation, [R] logistic postestimation, [R] logit postestimation delta deviance influence statistic, [R] clogit postestimation, [R] logistic postestimation, [R] logit postestimation 2524 Subject index delta method, [R] margins, [R] nlcom, [R] predictnl, [R] testnl density estimation, kernel, [R] kdensity density-distribution sunflower plot, [R] sunflower derivatives, numeric, [R] dydx, [R] testnl describe, ado subcommand, [R] net estimates subcommand, [R] estimates describe net subcommand, [R] net ssc subcommand, [R] ssc descriptive statistics, CIs for means, proportions, and counts, [R] ci correlations, [R] correlate, [R] pcorr, [R] tetrachoric displaying, [R] grmeanby, [R] lv, [R] summarize estimation, [R] mean, [R] proportion, [R] ratio, [R] total means, [R] ameans, [R] summarize percentiles, [R] centile pharmacokinetic data, make dataset of, [R] pkcollapse summarize, [R] pksumm tables, [R] table, [R] tabstat, [R] tabulate oneway, [R] tabulate twoway, [R] tabulate, summarize() design, fvset subcommand, [R] fvset design effects, [R] loneway deviance residual, [R] binreg postestimation, [R] fp postestimation, [R] glm postestimation, [R] logistic postestimation, [R] logit postestimation, [R] probit postestimation DFBETA, [R] regress postestimation dfbeta command, [R] regress postestimation DFITS, [R] regress postestimation DFP algorithm, [R] ml diagnostic plots, [R] diagnostic plots, [R] logistic postestimation, [R] regress postestimation diagnostic plots diagnostics, regression, see regression diagnostics dialog box, [R] db dichotomous outcome model, see outcomes, binary difference of estimated coefficients, see linear combinations of estimators difficult option, [R] maximize dir, ado subcommand, [R] net constraint subcommand, [R] constraint estimates subcommand, [R] estimates store direct standardization, [R] dstdize, [R] mean, [R] proportion, [R] ratio dispersion, measures of, see percentiles, displaying, see range of data, see standard deviations, displaying, see variance, displaying display settings, [R] set showbaselevels width and length, [R] log display command, as a calculator, [R] display display, ml subcommand, [R] ml displaying, also see printing, logs (output) previously typed lines, [R] #review stored results, [R] stored results distributions, examining, [R] ameans, [R] centile, [R] kdensity, [R] mean, [R] pksumm, [R] summarize, [R] total income, [R] inequality plots, [R] cumul, [R] cusum, [R] diagnostic plots, [R] dotplot, [R] histogram, [R] kdensity, [R] ladder, [R] lv, [R] spikeplot, [R] stem standard population, [R] dstdize testing equality of, [R] ksmirnov, [R] kwallis, [R] ranksum, [R] signrank testing for normality, [R] sktest, [R] swilk transformations to achieve normality, [R] boxcox, [R] ladder to achieve zero skewness, [R] lnskew0 do command, [R] do dockable, set subcommand, [R] set dockingguides, set subcommand, [R] set documentation, keyword search on, [R] search doedit command, [R] doedit do-files, [R] do editing, [R] doedit dose–response models, [R] binreg, [R] glm, [R] logistic dotplot command, [R] dotplot doublebuffer, set subcommand, [R] set dp, set subcommand, [R] set drop, constraint subcommand, [R] constraint estimates subcommand, [R] estimates store dstdize command, [R] dstdize dummy variables, see indicator variables Duncan’s multiple-comparison adjustment, see multiple comparisons, Duncan’s method Dunnett’s multiple comparison adjustment, see multiple comparisons, Dunnett’s method duration analysis, see survival analysis Durbin–Watson statistic, [R] regress postestimation time series durbinalt, estat subcommand, [R] regress postestimation time series Durbin’s alternative test, [R] regress postestimation time series dwatson, estat subcommand, [R] regress postestimation time series dydx command, [R] dydx E e() stored results, [R] stored results e(sample), resetting, [R] estimates save e-class command, [R] stored results editing ado-files and do-files, [R] doedit files while in Stata, [R] doedit efficiency, query subcommand, [R] query Subject index 2525 eform option, [R] eform option eivreg command, [R] eivreg, [R] eivreg postestimation empirical cumulative distribution function, [R] cumul emptycells, set subcommand, [R] set, [R] set emptycells ending a Stata session, [R] exit endless loop, see loop, endless endogeneity test, [R] ivregress postestimation endogenous covariates, [R] gmm, [R] ivpoisson, [R] ivprobit, [R] ivregress, [R] ivtobit, [R] reg3 endogenous, estat subcommand, [R] ivregress postestimation Engle’s LM test, [R] regress postestimation time series eolchar, set subcommand, [R] set Epanechnikov kernel function, [R] kdensity, [R] lpoly, [R] qreg epidemiology and related, Brier score decomposition, [R] brier interrater agreement, [R] kappa meta-analysis, [R] meta pharmacokinetic data, see pharmacokinetic data ROC analysis, see receiver operating characteristic analysis standardization, [R] dstdize symmetry and marginal homogeneity tests, [R] symmetry tables, [R] tabulate twoway equality test of binomial proportions, [R] bitest coefficients, [R] pwcompare, [R] sureg, [R] test, [R] testnl distributions, [R] ksmirnov, [R] kwallis, [R] ranksum, [R] signrank margins, [R] margins, [R] pwcompare means, [R] contrast, [R] esize, [R] pwmean, [R] ttest medians, [R] ranksum proportions, [R] bitest, [R] prtest ROC areas, [R] roccomp, [R] rocreg variances, [R] sdtest equivalence test, [R] pk, [R] pkequiv ereturn list command, [R] stored results error messages and return codes, [R] error messages searching, [R] search error-bar charts, [R] serrbar errors-in-variables regression, [R] eivreg esample, estimates subcommand, [R] estimates save esize and esizei commands, [R] esize esize, estat subcommand, [R] regress postestimation estat alternatives command, [R] asclogit postestimation, [R] asmprobit postestimation, [R] asroprobit postestimation, [R] nlogit postestimation archlm command, [R] regress postestimation time series estat, continued bgodfrey command, [R] regress postestimation time series bootstrap command, [R] bootstrap postestimation classification command, [R] estat classification correlation command, [R] asmprobit postestimation, [R] asroprobit postestimation covariance command, [R] asmprobit postestimation, [R] asroprobit postestimation durbinalt command, [R] regress postestimation time series dwatson command, [R] regress postestimation time series endogenous command, [R] ivregress postestimation esize command, [R] regress postestimation facweights command, [R] asmprobit postestimation, [R] asroprobit postestimation firststage command, [R] ivregress postestimation gof command, [R] estat gof, [R] poisson postestimation hettest command, [R] regress postestimation ic command, [R] estat, [R] estat ic imtest command, [R] regress postestimation mfx command, [R] asclogit postestimation, [R] asmprobit postestimation, [R] asroprobit postestimation nproc command, [R] rocreg postestimation overid command, [R] gmm postestimation, [R] ivpoisson postestimation, [R] ivregress postestimation ovtest command, [R] regress postestimation predict command, [R] exlogistic postestimation se command, [R] exlogistic postestimation, [R] expoisson postestimation summarize command, [R] estat, [R] estat summarize szroeter command, [R] regress postestimation vce command, [R] estat, [R] estat vce vif command, [R] regress postestimation estimates clear command, [R] estimates store command, [R] suest introduction, [R] estimates describe command, [R] estimates describe dir command, [R] estimates store drop command, [R] estimates store esample command, [R] estimates save for command, [R] estimates for notes command, [R] estimates notes query command, [R] estimates store replay command, [R] estimates replay restore command, [R] estimates store save command, [R] estimates save stats command, [R] estimates stats store command, [R] estimates store table command, [R] estimates table 2526 Subject index estimates, continued title command, [R] estimates title use command, [R] estimates save estimation options, [R] estimation options results, clearing, [R] estimates store storing and restoring, [R] estimates store tables of, [R] estimates table sample, summarizing, [R] estat, [R] estat summarize estimators, covariance matrix of, [R] correlate, [R] estat, [R] estat vce linear combinations of, [R] lincom nonlinear combinations of, [R] nlcom event history analysis, see survival analysis exact statistics, binary confidence intervals, [R] ci, [R] exlogistic, [R] roctab centiles, [R] centile indirect standardization, [R] dstdize one-way anova, [R] loneway regression, [R] exlogistic, [R] expoisson test, binomial probability, [R] bitest equality of distributions, [R] ksmirnov equality of medians, [R] ranksum Fisher’s, [R] tabulate twoway symmetry and marginal homogeneity, [R] symmetry tetrachoric correlations, [R] tetrachoric exit command, [R] exit exiting Stata, see exit command exlogistic command, [R] exlogistic, [R] exlogistic postestimation exogeneity test, see endogeneity test experimental data, [R] anova, [R] contrast, [R] correlate, [R] kwallis, [R] logit, [R] mean, [R] regress, [R] summarize, [R] tabulate oneway, [R] tabulate twoway, [R] ttest exploded logit model, [R] rologit expoisson command, [R] expoisson, [R] expoisson postestimation exponentiated coefficients, [R] eform option F factor variables, [R] fvrevar, [R] fvset factorial design, [R] anova factor-variable settings, [R] fvset facweights, estat subcommand, [R] asmprobit postestimation, [R] asroprobit postestimation failure-time model, see survival analysis false-positive rate, [R] estat classification, [R] roc, [R] rocreg, [R] rocreg postestimation, [R] rocregplot FAQs, search, [R] search fastscroll, set subcommand, [R] set feasible generalized least squares, [R] reg3, [R] sureg feasible generalized nonlinear least squares, [R] nlsur fences, [R] lv FGLS, see feasible generalized least squares FGNLS, see feasible generalized nonlinear least squares files, downloading, [R] adoupdate, [R] net, [R] sj, [R] ssc, [R] update firststage, estat subcommand, [R] ivregress postestimation Fisher’s exact test, [R] tabulate twoway fixed-effects model, [R] anova, [R] areg, [R] asclogit, [R] clogit flexible functional form, [R] boxcox, [R] fp, [R] mfp floatwindows, set subcommand, [R] set footnote, ml subcommand, [R] ml for, estimates subcommand, [R] estimates for forecast, standard error of, [R] regress postestimation format settings, [R] set cformat fp generate command, [R] fp fp plot command, [R] fp postestimation fp predict command, [R] fp postestimation fp prefix command, [R] fp, [R] fp postestimation fraction defective, [R] qc fractional polynomial regression, [R] fp multivariable, [R] mfp free, constraint subcommand, [R] constraint frequencies, graphical representation, [R] histogram, [R] kdensity table of, [R] table, [R] tabstat, [R] tabulate oneway, [R] tabulate twoway, [R] tabulate, summarize() from, net subcommand, [R] net update subcommand, [R] update from() option, [R] maximize frontier command, [R] frontier, [R] frontier postestimation frontier model, see stochastic frontier model functions, combinations of estimators, [R] lincom, [R] nlcom cumulative distribution, [R] cumul derivatives and integrals of, [R] dydx estimable, [R] margins evaluator program, [R] gmm, [R] nl, [R] nlsur fractional polynomial, [R] fp, [R] mfp index, [R] logistic postestimation, [R] logit postestimation, [R] probit postestimation kernel, [R] kdensity, [R] lpoly link, [R] glm maximizing likelihood, [R] maximize, [R] ml obtaining help for, [R] help orthogonalization, [R] orthog parameters, [R] nlcom piecewise cubic and piecewise linear, [R] mkspline prediction, [R] predict, [R] predictnl Subject index 2527 functions, continued production and cost, [R] frontier variance, [R] glm fvlabel, set subcommand, [R] set, [R] set showbaselevels fvrevar command, [R] fvrevar fvset base command, [R] fvset clear command, [R] fvset design command, [R] fvset report command, [R] fvset fvwrap, set subcommand, [R] set, [R] set showbaselevels fvwrapon, set subcommand, [R] set, [R] set showbaselevels G Gaussian kernel function, [R] kdensity, [R] lpoly, [R] qreg generalized least squares, feasible, see feasible generalized least squares linear latent and mixed models, [R] gllamm linear models, [R] binreg, [R] glm method of moments, see gmm command negative binomial regression, [R] nbreg get, constraint subcommand, [R] constraint net subcommand, [R] net gladder command, [R] ladder GLLAMM, see generalized linear latent and mixed models gllamm command, [R] gllamm GLM, see generalized linear models glm command, [R] glm, [R] glm postestimation glogit command, [R] glogit, [R] glogit postestimation gmm command, [R] gmm, [R] gmm postestimation gnbreg command, [R] nbreg, [R] nbreg postestimation gof, estat subcommand, [R] estat gof, [R] poisson postestimation Goodman and Kruskal’s gamma, [R] tabulate twoway goodness of fit, [R] brier, [R] diagnostic plots, [R] estat gof, [R] ksmirnov, [R] linktest, [R] logistic postestimation, [R] lrtest, [R] poisson postestimation, [R] regress postestimation, also see deviance residual, also see normal distribution and normality, test for gprobit command, [R] glogit, [R] glogit postestimation gradient option, [R] maximize graph, ml subcommand, [R] ml graphics, query subcommand, [R] query set subcommand, [R] set graphs, added-variable plot, [R] regress postestimation diagnostic plots graphs, continued adjusted partial residual plot, [R] regress postestimation diagnostic plots augmented component-plus-residual plot, [R] regress postestimation diagnostic plots augmented partial residual plot, [R] regress postestimation diagnostic plots binary variable cumulative sum, [R] cusum component-plus-residual, [R] regress postestimation diagnostic plots cumulative distribution, [R] cumul density, [R] kdensity density-distribution sunflower, [R] sunflower derivatives, [R] dydx, [R] testnl diagnostic, [R] diagnostic plots dotplot, [R] dotplot error-bar charts, [R] serrbar fractional polynomial, [R] fp postestimation histograms, [R] histogram, [R] kdensity integrals, [R] dydx interaction plots, [R] marginsplot ladder-of-power histograms, [R] ladder letter-value display, [R] lv leverage-versus-(squared)-residual, [R] regress postestimation diagnostic plots logistic diagnostic, [R] logistic postestimation, [R] lsens lowess smoothing, [R] lowess margins plots, [R] marginsplot means and medians, [R] grmeanby normal probability, [R] diagnostic plots partial residual, [R] regress postestimation diagnostic plots partial-regression leverage, [R] regress postestimation diagnostic plots profile plots, [R] marginsplot quality control, [R] qc quantile, [R] diagnostic plots quantile–normal, [R] diagnostic plots quantile–quantile, [R] diagnostic plots regression diagnostic, [R] regress postestimation diagnostic plots residual versus fitted, [R] regress postestimation diagnostic plots residual versus predictor, [R] regress postestimation diagnostic plots ROC curve, [R] lroc, [R] roccomp, [R] rocfit postestimation, [R] rocregplot, [R] roctab rootograms, [R] spikeplot smoothing, [R] kdensity, [R] lowess, [R] lpoly spike plot, [R] spikeplot stem-and-leaf, [R] stem sunflower, [R] sunflower symmetry, [R] diagnostic plots time-versus-concentration curve, [R] pk, [R] pkexamine Greenhouse–Geisser epsilon, [R] anova 2528 Subject index grmeanby command, [R] grmeanby group-data regression, [R] glogit, [R] intreg H HAC variance estimate, [R] binreg, [R] glm, [R] gmm, [R] ivregress, [R] nl Hansen’s J statistic, [R] gmm, [R] gmm postestimation, [R] ivpoisson, [R] ivpoisson postestimation, [R] ivregress harmonic mean, [R] ameans hat matrix, see projection matrix, diagonal elements of hausman command, [R] hausman Hausman specification test, [R] hausman haverdir, set subcommand, [R] set hazard ratio, [R] eform option, [R] lincom health ratio, [R] binreg heckman command, [R] heckman, [R] heckman postestimation Heckman selection model, [R] heckman, [R] heckoprobit, [R] heckprobit heckoprobit command, [R] heckoprobit, [R] heckoprobit postestimation heckprobit command, [R] heckprobit, [R] heckprobit postestimation Helmert contrasts, [R] contrast help command, [R] help help, view subcommand, [R] view help d, view subcommand, [R] view hessian option, [R] maximize heteroskedastic probit regression, [R] hetprobit heteroskedasticity, also see HAC variance estimate conditional, [R] regress postestimation time series robust variances, see robust, Huber/White/sandwich estimator of variance test, [R] hetprobit, [R] regress postestimation, [R] regress postestimation time series heteroskedasticity test, [R] sdtest hetprobit command, [R] hetprobit, [R] hetprobit postestimation hettest, estat subcommand, [R] regress postestimation hierarchical regression, [R] nestreg, [R] stepwise samples, [R] anova, [R] gllamm, [R] loneway, [R] areg histogram command, [R] histogram histograms, [R] histogram dotplots, [R] dotplot kernel density estimator, [R] kdensity ladder-of-powers, [R] ladder of categorical variables, [R] histogram rootograms, [R] spikeplot stem-and-leaf, [R] stem Holm’s multiple-comparison adjustment, see multiple comparisons, Holm’s method homogeneity of variances, [R] oneway, [R] sdtest homoskedasticity tests, [R] regress postestimation Hosmer–Lemeshow delta chi-squared influence statistic, see delta chisquared influence statistic delta deviance influence statistic, see delta deviance influence statistic goodness-of-fit test, [R] estat gof hot, ssc subcommand, [R] ssc httpproxy, set subcommand, [R] netio, [R] set httpproxyauth, set subcommand, [R] netio, [R] set httpproxyhost, set subcommand, [R] netio, [R] set httpproxyport, set subcommand, [R] netio, [R] set httpproxypw, set subcommand, [R] netio, [R] set httpproxyuser, set subcommand, [R] netio, [R] set Huber weighting, [R] rreg Huber/White/sandwich estimator of variance, see robust, Huber/White/sandwich estimator of variance Huynh–Feldt epsilon, [R] anova hypertext help, [R] help I ic, estat subcommand, [R] estat, [R] estat ic icc command, [R] icc IIA, see independence of irrelevant alternatives immediate commands, [R] bitest, [R] ci, [R] esize, [R] prtest, [R] sdtest, [R] symmetry, [R] tabulate twoway, [R] ttest imtest, estat subcommand, [R] regress postestimation incidence rate, negative binomial regression, [R] nbreg postestimation, [R] tnbreg postestimation, [R] zinb postestimation Poisson regression, [R] poisson postestimation, [R] tpoisson postestimation, [R] zip postestimation incidence-rate ratio, [R] eform option estimation, negative binomial regression, [R] nbreg, [R] tnbreg, [R] zinb Poisson regression, [R] expoisson, [R] ivpoisson, [R] poisson, [R] tpoisson, [R] zip postestimation, [R] contrast, [R] expoisson postestimation, [R] lincom include bitmap, set subcommand, [R] set income distributions, [R] inequality independence of irrelevant alternatives, assumption, [R] clogit, [R] mlogit relaxing assumption, [R] asclogit, [R] asmprobit, [R] asroprobit, [R] nlogit test for, [R] hausman, [R] nlogit, [R] suest independence test, [R] correlate, [R] spearman, [R] tabulate twoway index of probit and logit, [R] logit postestimation, [R] predict, [R] probit postestimation index search, [R] search indicator variables, [R] tabulate oneway, [R] xi, also see factor variables indirect standardization, [R] dstdize Subject index 2529 inequality measures, [R] inequality influence statistics, see delta beta influence statistic, see delta chi-squared influence statistic, see delta deviance influence statistic, see DFBETA information criteria, see Akaike information criterion, see Bayesian information criterion matrix, [R] correlate, [R] maximize matrix test, [R] regress postestimation init, ml subcommand, [R] ml inner fence, [R] lv install, net subcommand, [R] net ssc subcommand, [R] ssc installation of official updates, [R] update of SJ and STB, [R] net, [R] sj of user-written commands (updating), [R] adoupdate instrumental-variables regression, [R] gmm, [R] ivpoisson, [R] ivprobit, [R] ivregress, [R] ivtobit, [R] reg3 integ command, [R] dydx integrals, numeric, [R] dydx interaction, [R] anova, [R] contrast, [R] fvrevar, [R] margins, [R] margins, contrast, [R] margins, pwcompare, [R] marginsplot, [R] pwcompare, [R] set emptycells, [R] xi interaction expansion, [R] xi interaction plots, [R] marginsplot interface, query subcommand, [R] query Internet, commands to control connections to, [R] netio installation of updates from, [R] adoupdate, [R] net, [R] sj, [R] update search, [R] net search interquantile range, [R] qreg interquartile range, [R] lv, [R] table, [R] tabstat interrater agreement, [R] kappa interval regression, [R] intreg intraclass correlation, see correlation, intraclass intracluster correlation, see correlation, intracluster intreg command, [R] intreg, [R] intreg postestimation IQR, see interquartile range iqreg command, [R] qreg, [R] qreg postestimation IRLS, see iterated, reweighted least squares istdize command, [R] dstdize iterate() option, [R] maximize iterated, reweighted least squares, [R] binreg, [R] glm, [R] reg3, [R] sureg iterations, controlling the maximum number, [R] maximize ivpoisson command, [R] ivpoisson, [R] ivpoisson postestimation ivprobit command, [R] ivprobit, [R] ivprobit postestimation ivregress command, [R] ivregress, [R] ivregress postestimation ivtobit command, [R] ivtobit, [R] ivtobit postestimation J jackknife estimation, [R] jackknife standard errors, [R] vce option jackknife prefix command, [R] jackknife, [R] jackknife postestimation jackknifed residuals, [R] regress postestimation K kap command, [R] kappa kappa command, [R] kappa kapwgt command, [R] kappa kdensity command, [R] kdensity Kendall’s tau, [R] spearman, [R] tabulate twoway kernel density estimator, [R] kdensity kernel-weighted local polynomial estimator, [R] lpoly Kish design effects, [R] loneway Kolmogorov–Smirnov test, [R] ksmirnov Kruskal–Wallis test, [R] kwallis ksmirnov command, [R] ksmirnov ktau command, [R] spearman kurtosis, [R] lv, [R] pksumm, [R] regress postestimation, [R] sktest, [R] summarize, [R] tabstat kwallis command, [R] kwallis L L1-norm models, [R] qreg LAD regression, [R] qreg ladder command, [R] ladder ladder of powers, [R] ladder Lagrange multiplier test, [R] regress postestimation time series Latin-square designs, [R] anova, [R] pkshape LAV regression, [R] qreg least absolute deviations, [R] qreg residuals, [R] qreg value regression, [R] qreg least squared deviations, see linear regression least squares, see linear regression generalized, see feasible generalized least squares least-squares means, [R] margins, [R] marginsplot letter values, [R] lv level, set subcommand, [R] level, [R] set Levene’s robust test statistic, [R] sdtest leverage, [R] logistic postestimation, [R] regress postestimation diagnostic plots leverage-versus-(squared)-residual plot, [R] regress postestimation diagnostic plots license, [R] about likelihood, see maximum likelihood estimation 2530 Subject index likelihood-ratio chi-squared of association, [R] tabulate twoway test, [R] lrtest limited dependent variables, [R] asclogit, [R] asmprobit, [R] asroprobit, [R] binreg, [R] biprobit, [R] brier, [R] clogit, [R] cloglog, [R] cusum, [R] exlogistic, [R] expoisson, [R] glm, [R] glogit, [R] heckoprobit, [R] heckprobit, [R] hetprobit, [R] ivpoisson, [R] ivprobit, [R] logistic, [R] logit, [R] mlogit, [R] mprobit, [R] nbreg, [R] nlogit, [R] ologit, [R] oprobit, [R] poisson, [R] probit, [R] rocfit, [R] rocreg, [R] rologit, [R] scobit, [R] slogit, [R] tnbreg, [R] tpoisson, [R] zinb, [R] zip limits, [R] limits, [R] matsize lincom command, [R] lincom linear combinations of estimators, [R] lincom hypothesis test after estimation, [R] contrast, [R] lrtest, [R] margins, [R] margins, contrast, [R] margins, pwcompare, [R] pwcompare, [R] test regression, [R] anova, [R] areg, [R] binreg, [R] cnsreg, [R] eivreg, [R] frontier, [R] glm, [R] gmm, [R] heckman, [R] intreg, [R] ivregress, [R] ivtobit, [R] qreg, [R] reg3, [R] regress, [R] rreg, [R] sureg, [R] tobit, [R] vwls splines, [R] mkspline linegap, set subcommand, [R] set linesize, set subcommand, [R] log, [R] set link function, [R] glm link, net subcommand, [R] net linktest command, [R] linktest list, constraint subcommand, [R] constraint ereturn subcommand, [R] stored results return subcommand, [R] stored results sreturn subcommand, [R] stored results lnskew0 command, [R] lnskew0 local linear, [R] lpoly local polynomial, [R] lpoly locally weighted smoothing, [R] lowess location, measures of, [R] lv, [R] summarize, [R] table locksplitters, set subcommand, [R] set log close command, [R] log command, [R] log, [R] view off command, [R] log on command, [R] log query command, [R] log using command, [R] log log files, printing, [R] translate, also see log command log or nolog option, [R] maximize log transformations, [R] boxcox, [R] lnskew0 logistic and logit regression, [R] logistic, [R] logit complementary log-log, [R] cloglog conditional, [R] asclogit, [R] clogit, [R] rologit exact, [R] exlogistic fixed-effects, [R] asclogit, [R] clogit fractional polynomial, [R] fp generalized linear model, [R] glm multinomial, [R] asclogit, [R] clogit, [R] mlogit nested, [R] nlogit ordered, [R] ologit polytomous, [R] mlogit rank-ordered, [R] rologit skewed, [R] scobit stereotype, [R] slogit with grouped data, [R] glogit logistic command, [R] logistic, [R] logistic postestimation logit command, [R] logit, [R] logit postestimation logit regression, see logistic and logit regression log-linear model, [R] expoisson, [R] glm, [R] ivpoisson, [R] poisson, [R] tpoisson, [R] zip logtype, set subcommand, [R] log, [R] set loneway command, [R] loneway loop, endless, see endless loop Lorenz curve, [R] inequality lowess, see locally weighted smoothing lowess command, [R] lowess lpoly command, [R] lpoly L-R plots, [R] regress postestimation diagnostic plots lroc command, [R] lroc lrtest command, [R] lrtest lsens command, [R] lsens lstat command, see estat classification command lstretch, set subcommand, [R] set ltolerance() option, [R] maximize lv command, [R] lv lvr2plot command, [R] regress postestimation diagnostic plots M MAD regression, [R] qreg main effects, [R] anova man command, [R] help Mann–Whitney two-sample statistics, [R] ranksum marginal effects, [R] margins, [R] marginsplot homogeneity, test of, [R] symmetry means, [R] contrast, [R] margins, [R] margins, contrast, [R] margins, pwcompare, [R] marginsplot, [R] pwcompare margins command, [R] margins, [R] margins postestimation, [R] margins, contrast, [R] margins, pwcompare, [R] marginsplot margins test, [R] margins, [R] pwcompare marginsplot command, [R] marginsplot Subject index 2531 mata query command, [R] set set matacache command, [R] set set matafavor command, [R] set set matalibs command, [R] set set matalnum command, [R] set set matamofirst command, [R] set set mataoptimize command, [R] set set matastrict command, [R] set mata, query subcommand, [R] query matched case–control data, [R] asclogit, [R] clogit, [R] symmetry matched-pairs tests, [R] signrank, [R] ttest matsize, set subcommand, [R] matsize, [R] set max memory, set subcommand, [R] set maxdb, set subcommand, [R] db, [R] set maximization technique explained, [R] maximize, [R] ml maximize, ml subcommand, [R] ml maximum likelihood estimation, [R] maximize, [R] ml, [R] mlexp limits, [R] limits number of variables in a model, [R] matsize maximums and minimums, reporting, [R] lv, [R] summarize, [R] table maxiter, set subcommand, [R] maximize, [R] set maxvar, set subcommand, [R] set McFadden’s choice model, [R] asclogit McNemar’s chi-squared test, [R] clogit mean command, [R] mean, [R] mean postestimation means, arithmetic, geometric, and harmonic, [R] ameans confidence interval and standard error, [R] ci displaying, [R] ameans, [R] summarize, [R] table, [R] tabstat, [R] tabulate, summarize() estimating, [R] mean graphing, [R] grmeanby marginal, [R] margins pairwise comparisons of, [R] pwmean pharmacokinetic data, [R] pksumm robust, [R] rreg testing equality of, see equality test of means measurement error, [R] vwls measures of association, [R] tabulate twoway central tendency, see means, see medians dispersion, see percentiles, displaying, see range of data, see standard deviations, displaying, see variance, displaying inequality, [R] inequality location, [R] lv, [R] summarize median command, [R] ranksum median regression, [R] qreg median test, [R] ranksum medians, displaying, [R] centile, [R] lv, [R] summarize, [R] table, [R] tabstat graphing, [R] grmeanby testing equality of, see equality test of medians memory, matsize, see matsize, set subcommand memory, query subcommand, [R] query messages and return codes, see error messages and return codes meta-analysis, [R] meta mfp prefix command, [R] mfp, [R] mfp postestimation mfx, estat subcommand, [R] asclogit postestimation, [R] asmprobit postestimation, [R] asroprobit postestimation midsummaries, [R] lv mild outliers, [R] lv Mills’ ratio, [R] heckman, [R] heckman postestimation min memory, set subcommand, [R] set minimum absolute deviations, [R] qreg squared deviations, [R] areg, [R] cnsreg, [R] nl, [R] regress, [R] regress postestimation minimums and maximums, see maximums and minimums, reporting missing values, [R] misstable misstable nested command, [R] misstable patterns command, [R] misstable summarize command, [R] misstable tree command, [R] misstable mixed designs, [R] anova mkspline command, [R] mkspline ml check command, [R] ml clear command, [R] ml count command, [R] ml display command, [R] ml footnote command, [R] ml graph command, [R] ml init command, [R] ml maximize command, [R] ml model command, [R] ml plot command, [R] ml query command, [R] ml report command, [R] ml score command, [R] ml search command, [R] ml trace command, [R] ml mleval command, [R] ml mlexp command, [R] mlexp, [R] mlexp postestimation mlmatbysum command, [R] ml mlmatsum command, [R] ml mlogit command, [R] mlogit, [R] mlogit postestimation mlsum command, [R] ml mlvecsum command, [R] ml 2532 Subject index MNP, see outcomes, multinomial model coefficients test, [R] lrtest, [R] test, [R] testnl model specification test, see specification test model, ml subcommand, [R] ml models, maximum number of variables in, [R] matsize modulus transformations, [R] boxcox monotone-missing pattern, [R] misstable Monte Carlo simulations, [R] permute, [R] simulate more command and parameter, [R] more more, set subcommand, [R] more, [R] set mprobit command, [R] mprobit, [R] mprobit postestimation multilevel model, [R] gllamm multinomial outcome model, see outcomes, multinomial multiple comparisons, [R] contrast, [R] margins, [R] pwcompare, [R] pwmean, [R] anova postestimation, [R] correlate, [R] oneway, [R] regress postestimation, [R] roccomp, [R] spearman, [R] test, [R] testnl, [R] tetrachoric Bonferroni’s method, [R] contrast, [R] margins, [R] pwcompare, [R] pwmean, [R] anova postestimation, [R] correlate, [R] oneway, [R] regress postestimation, [R] roccomp, [R] spearman, [R] test, [R] testnl, [R] tetrachoric Duncan’s method, [R] pwcompare, [R] pwmean Dunnett’s method, [R] pwcompare, [R] pwmean Holm’s method, [R] anova postestimation, [R] regress postestimation, [R] test, [R] testnl multiple-range method, see Dunnett’s method subentry Scheffé’s method, [R] contrast, [R] margins, [R] pwcompare, [R] pwmean, [R] oneway Šidák’s method, [R] contrast, [R] margins, [R] pwcompare, [R] pwmean, [R] anova postestimation, [R] correlate, [R] oneway, [R] regress postestimation, [R] roccomp, [R] spearman, [R] test, [R] testnl, [R] tetrachoric Studentized-range method, see Tukey’s method subentry Student–Newman–Keuls’ method, [R] pwcompare, [R] pwmean Tukey’s method, [R] pwcompare, [R] pwmean multiple regression, see linear regression multiple-range multiple-comparison adjustment, see multiple comparisons, Dunnett’s method multivariable fractional polynomial regression, [R] mfp multivariate analysis, bivariate probit, [R] biprobit three-stage least squares, [R] reg3 Zellner’s seemingly unrelated, [R] nlsur, [R] sureg N natural splines, [R] mkspline nbreg command, [R] nbreg, [R] nbreg postestimation needle plot, [R] spikeplot negative binomial regression, [R] nbreg generalized linear models, [R] glm truncated, [R] tnbreg zero-inflated, [R] zinb nested designs, [R] anova effects, [R] anova logit, [R] nlogit model statistics, [R] nestreg regression, [R] nestreg nested, misstable subcommand, [R] misstable nestreg prefix command, [R] nestreg net cd command, [R] net describe command, [R] net from command, [R] net get command, [R] net install command, [R] net link command, [R] net query command, [R] net search command, [R] net search set ado command, [R] net set other command, [R] net sj command, [R] net stb command, [R] net net, view subcommand, [R] view net d, view subcommand, [R] view network, query subcommand, [R] query new, ssc subcommand, [R] ssc Newey–West standard errors, [R] glm news command, [R] news news, view subcommand, [R] view Newton–Raphson algorithm, [R] ml niceness, set subcommand, [R] set nl command, [R] nl, [R] nl postestimation nlcom command, [R] nlcom nlogit command, [R] nlogit, [R] nlogit postestimation nlogitgen command, [R] nlogit nlogittree command, [R] nlogit nlsur command, [R] nlsur, [R] nlsur postestimation nolog or log option, [R] maximize nonconformities, quality control, [R] qc nonconstant variance, see robust, Huber/White/sandwich estimator of variance nonlinear combinations of estimators, [R] nlcom hypothesis test after estimation, [R] lrtest, [R] margins, [R] margins, contrast, [R] margins, pwcompare, [R] nlcom, [R] predictnl, [R] testnl least squares, [R] nl regression, [R] boxcox, [R] nl, [R] nlsur nonparametric analysis, hypothesis tests, agreement, [R] kappa association, [R] spearman, [R] tabulate twoway Subject index 2533 nonparametric analysis, hypothesis tests, continued cusum, [R] cusum equality of distributions, [R] ksmirnov, [R] kwallis, [R] ranksum, [R] signrank medians, [R] ranksum proportions, [R] bitest, [R] prtest random order, [R] runtest trend, [R] nptrend percentiles, [R] centile quantile regression, [R] qreg ROC analysis, [R] roc estimation, [R] rocreg graphs, [R] rocregplot test equality of areas, [R] roccomp without covariates, [R] roctab smoothing, [R] kdensity, [R] lowess, [R] lpoly, [R] smooth nonrtolerance option, [R] maximize nonselection hazard, [R] heckman, [R] heckman postestimation normal distribution and normality, examining distributions for, [R] diagnostic plots, [R] lv probability and quantile plots, [R] diagnostic plots test for, [R] sktest, [R] swilk transformations to achieve, [R] boxcox, [R] ladder, [R] lnskew0 not concave message, [R] maximize notes on estimation results, [R] estimates notes notes, estimates subcommand, [R] estimates notes notifyuser, set subcommand, [R] set nproc, estat subcommand, [R] rocreg postestimation nptrend command, [R] nptrend NR algorithm, [R] ml nrtolerance() option, [R] maximize N-way analysis of variance, [R] anova O obs, set subcommand, [R] set observational data, [R] correlate, [R] heckman, [R] ivregress, [R] logit, [R] mean, [R] regress, [R] summarize, [R] tabulate oneway, [R] tabulate twoway, [R] ttest observed information matrix, [R] ml, [R] vce option odbcmgr, set subcommand, [R] set odds ratio, [R] eform option estimation, [R] asclogit, [R] binreg, [R] clogit, [R] cloglog, [R] exlogistic, [R] glm, [R] glogit, [R] logistic, [R] logit, [R] mlogit, [R] scobit postestimation, [R] contrast, [R] exlogistic postestimation, [R] lincom off, cmdlog subcommand, [R] log log subcommand, [R] log OIM, see observed information matrix ologit command, [R] ologit, [R] ologit postestimation OLS regression, see linear regression omitted variables test, [R] regress postestimation, also see specification test on, cmdlog subcommand, [R] log log subcommand, [R] log one-way analysis of variance, [R] kwallis, [R] loneway, [R] oneway oneway command, [R] oneway OPG, see outer product of the gradient oprobit command, [R] oprobit, [R] oprobit postestimation order statistics, [R] lv ordered logit, [R] ologit probit, [R] heckoprobit, [R] oprobit ordinal outcome model, see outcomes, ordinal ordinary least squares, see linear regression orthog command, [R] orthog orthogonal polynomial, [R] contrast, [R] margins, contrast, [R] orthog orthpoly command, [R] orthog other, query subcommand, [R] query outcomes, binary, complementary log-log, [R] cloglog glm for binomial family, [R] binreg, [R] glm grouped data, [R] glogit logistic, [R] exlogistic, [R] logistic, [R] logit, [R] scobit probit, [R] biprobit, [R] heckprobit, [R] hetprobit, [R] ivprobit, [R] probit ROC analysis, [R] rocfit, [R] rocreg categorical, logistic, [R] asclogit, [R] clogit, [R] mlogit, [R] nlogit, [R] slogit probit, [R] asmprobit, [R] mprobit count, negative binomial, [R] nbreg, [R] tnbreg, [R] zinb Poisson, [R] expoisson, [R] ivpoisson, [R] poisson, [R] tpoisson, [R] zip multinomial, see categorical subentry, see ordinal subentry, see rank subentry ordinal, logistic, [R] ologit, [R] slogit probit, [R] heckoprobit, [R] oprobit polytomous, see categorical subentry, see ordinal subentry, see rank subentry rank, logistic, [R] rologit probit, [R] asroprobit outer fence, [R] lv outer product of the gradient, [R] ml, [R] vce option outliers, [R] lv, [R] qreg, [R] regress postestimation, [R] rreg out-of-sample predictions, [R] predict, [R] predictnl 2534 Subject index output, query subcommand, [R] query set subcommand, [R] set output, coefficient table, automatically widen, [R] set display settings, [R] set showbaselevels format settings, [R] set cformat controlling the scrolling of, [R] more printing, [R] translate recording, [R] log outside values, [R] lv overid, estat subcommand, [R] gmm postestimation, [R] ivpoisson postestimation, [R] ivregress postestimation overidentifying restrictions test, [R] gmm postestimation, [R] ivpoisson postestimation, [R] ivregress postestimation ovtest, estat subcommand, [R] regress postestimation P P–P plot, [R] diagnostic plots pagesize, set subcommand, [R] more, [R] set paging of screen output, controlling, [R] more pairwise comparisons, [R] margins, pwcompare, [R] marginsplot, [R] pwcompare, [R] pwmean pairwise correlation, [R] correlate parameters, system, see system parameters partial correlation, [R] pcorr effects, [R] margins, [R] marginsplot regression leverage plot, [R] regress postestimation diagnostic plots regression plot, [R] regress postestimation diagnostic plots residual plot, [R] regress postestimation diagnostic plots Parzen kernel function, [R] kdensity, [R] lpoly, [R] qreg pattern of missing values, [R] misstable patterns, misstable subcommand, [R] misstable pausing until key is pressed, [R] more pchart command, [R] qc pchi command, [R] diagnostic plots pcorr command, [R] pcorr PDF, [R] translate Pearson goodness-of-fit test, [R] estat gof, [R] logistic postestimation, [R] poisson postestimation Pearson product-moment correlation coefficient, [R] correlate Pearson residual, [R] binreg postestimation, [R] estat gof, [R] glm postestimation, [R] logistic postestimation, [R] logit postestimation percentiles, displaying, [R] centile, [R] lv, [R] summarize, [R] table, [R] tabstat permutation test, [R] permute permute prefix command, [R] permute pformat, set subcommand, [R] set, [R] set cformat pharmaceutical statistics, [R] pk, [R] pksumm pharmacokinetic data, [R] pk, [R] pkcollapse, [R] pkcross, [R] pkequiv, [R] pkexamine, [R] pkshape, [R] pksumm piecewise cubic functions, [R] mkspline linear functions, [R] mkspline pinnable, set subcommand, [R] set pk, see pharmacokinetic data pkcollapse command, [R] pkcollapse pkcross command, [R] pkcross pkequiv command, [R] pkequiv pkexamine command, [R] pkexamine .pkg filename suffix, [R] net pkshape command, [R] pkshape pksumm command, [R] pksumm Plackett–Luce model, [R] rologit playsnd, set subcommand, [R] set plot, ml subcommand, [R] ml pnorm command, [R] diagnostic plots poisson command, [R] nbreg, [R] poisson, [R] poisson postestimation Poisson distribution, confidence intervals, [R] ci regression, see Poisson regression Poisson regression, [R] nbreg, [R] poisson generalized linear model, [R] glm truncated, [R] tpoisson zero-inflated, [R] zip polynomials, fractional, [R] fp, [R] mfp orthogonal, [R] orthog smoothing, see local polynomial polytomous outcome model, see outcomes, polytomous populations, diagnostic plots, [R] diagnostic plots examining, [R] histogram, [R] lv, [R] spikeplot, [R] stem, [R] summarize, [R] table standard, [R] dstdize testing equality of, see distributions, testing equality of testing for normality, [R] sktest, [R] swilk postestimation command, [R] contrast, [R] estat, [R] estat ic, [R] estat summarize, [R] estat vce, [R] estimates, [R] hausman, [R] lincom, [R] linktest, [R] lrtest, [R] margins, [R] margins, contrast, [R] margins, pwcompare, [R] marginsplot, [R] nlcom, [R] predict, [R] predictnl, [R] pwcompare, [R] suest, [R] test, [R] testnl poverty indices, [R] inequality power transformations, [R] boxcox, [R] lnskew0 predict command, [R] predict, [R] regress postestimation predict, estat subcommand, [R] exlogistic postestimation Subject index 2535 predictions, [R] predict, [R] predictnl predictions, standard error of, [R] glm, [R] predict, [R] regress postestimation predictnl command, [R] predictnl prefix command, [R] bootstrap, [R] fp, [R] jackknife, [R] mfp, [R] nestreg, [R] permute, [R] simulate, [R] stepwise, [R] xi Pregibon delta beta influence statistic, see delta beta influence statistic preprocessor commands, [R] #review prevalence studies, see case–control data print command, [R] translate printcolor, set subcommand, [R] set printing, logs (output), [R] translate probit command, [R] probit, [R] probit postestimation probit regression, [R] probit alternative-specific multinomial probit, [R] asmprobit alternative-specific rank-ordered, [R] asroprobit bivariate, [R] biprobit generalized linear model, [R] glm heteroskedastic, [R] hetprobit multinomial, [R] mprobit ordered, [R] heckoprobit, [R] oprobit two-equation, [R] biprobit with endogenous regressors, [R] ivprobit with grouped data, [R] glogit with sample selection, [R] heckprobit processors, set subcommand, [R] set production frontier model, [R] frontier product-moment correlation, [R] correlate between ranks, [R] spearman profile plots, [R] marginsplot programming, limits, [R] limits programs, user-written, see ado-files projection matrix, diagonal elements of, [R] binreg postestimation, [R] clogit postestimation, [R] glm postestimation, [R] logistic postestimation, [R] logit postestimation, [R] regress postestimation, [R] rreg postestimation proportion command, [R] proportion, [R] proportion postestimation proportional hazards models, see survival analysis odds assumption, [R] ologit relaxed, [R] slogit odds model, [R] ologit sampling, [R] bootstrap proportions, confidence intervals for, [R] ci estimating, [R] proportion testing equality of, [R] bitest, [R] prtest prtest command, [R] prtest prtesti command, [R] prtest pseudo R-squared, [R] maximize pseudosigmas, [R] lv pwcompare command, [R] pwcompare, [R] pwcompare postestimation pwcorr command, [R] correlate pwmean command, [R] pwmean, [R] pwmean postestimation Q Q–Q plot, [R] diagnostic plots qc charts, see quality control charts qchi command, [R] diagnostic plots qladder command, [R] ladder qnorm command, [R] diagnostic plots qqplot command, [R] diagnostic plots qreg command, [R] qreg, [R] qreg postestimation qtolerance() option, [R] maximize qualitative dependent variables, [R] asclogit, [R] asmprobit, [R] asroprobit, [R] binreg, [R] biprobit, [R] brier, [R] clogit, [R] cloglog, [R] cusum, [R] exlogistic, [R] glm, [R] glogit, [R] heckoprobit, [R] heckprobit, [R] hetprobit, [R] ivprobit, [R] logistic, [R] logit, [R] mlogit, [R] mprobit, [R] nlogit, [R] ologit, [R] oprobit, [R] probit, [R] rocfit, [R] rocreg, [R] rologit, [R] scobit, [R] slogit quality control charts, [R] qc, [R] serrbar quantile command, [R] diagnostic plots quantile–normal plots, [R] diagnostic plots quantile plots, [R] diagnostic plots quantile–quantile plots, [R] diagnostic plots quantile regression, [R] qreg quantiles, see percentiles, displaying query command, [R] query efficiency command, [R] query graphics command, [R] query interface command, [R] query mata command, [R] query memory command, [R] query network command, [R] query other command, [R] query output command, [R] query trace command, [R] query update command, [R] query query, estimates subcommand, [R] estimates store log subcommand, [R] log ml subcommand, [R] ml net subcommand, [R] net translator subcommand, [R] translate transmap subcommand, [R] translate update subcommand, [R] update quitting Stata, see exit command R r() stored results, [R] stored results Ramsey test, [R] regress postestimation 2536 Subject index random order, test for, [R] runtest sample, [R] bootstrap random-effects model, [R] anova, [R] loneway random-order test, [R] runtest range chart, [R] qc range of data, [R] lv, [R] stem, [R] summarize, [R] table, [R] tabstat rank correlation, [R] spearman ranking data, [R] rologit rank-order statistics, [R] signrank, [R] spearman rank-ordered logistic regression, see outcomes, rank ranksum command, [R] ranksum rate ratio, see incidence-rate ratio ratio command, [R] ratio, [R] ratio postestimation ratios, estimating, [R] ratio rc (return codes), see error messages and return codes rchart command, [R] qc receiver operating characteristic analysis, [R] roc area under ROC curve, [R] lroc nonparametric analysis without covariates, [R] roctab parametric analysis without covariates, [R] rocfit regression models, [R] rocreg ROC curves after rocfit, [R] rocfit postestimation ROC curves after rocreg, [R] rocregplot test equality of ROC areas, see equality test of ROC areas rectangle kernel function, [R] kdensity, [R] lpoly, [R] qreg reexpression, [R] boxcox, [R] ladder, [R] lnskew0 reg3 command, [R] reg3, [R] reg3 postestimation regress command, [R] regress, [R] regress postestimation, [R] regress postestimation diagnostic plots, [R] regress postestimation time series regression diagnostics, [R] estat classification, [R] estat gof, [R] logistic postestimation, [R] lroc, [R] lsens, [R] poisson postestimation, [R] predict, [R] predictnl, [R] regress postestimation diagnostic plots, [R] regress postestimation time series function, estimating, [R] lpoly regression, constrained, [R] cnsreg creating orthogonal polynomials for, [R] orthog dummy variables, with, [R] anova, [R] areg, [R] xi fixed-effects, [R] areg fractional polynomial, [R] fp, [R] mfp graphing, [R] logistic, [R] regress postestimation diagnostic plots grouped data, [R] intreg increasing number of variables allowed, [R] matsize instrumental variables, [R] gmm, [R] ivpoisson, [R] ivprobit, [R] ivregress, [R] ivtobit linear, see linear regression regression, continued system, [R] gmm, [R] ivpoisson, [R] ivregress, [R] nlsur, [R] reg3, [R] sureg truncated, [R] truncreg relative-risk ratio, [R] eform option, [R] lincom, [R] mlogit reliability, [R] brier, [R] eivreg, [R] icc, [R] intreg, [R] loneway, [R] poisson reliability theory, see survival analysis repeated-measures ANOVA, [R] anova repeating and editing commands, [R] #review replay, estimates subcommand, [R] estimates replay report, fvset subcommand, [R] fvset report, ml subcommand, [R] ml RESET test, [R] regress postestimation reset, translator subcommand, [R] translate residuals, [R] logistic, [R] predict, [R] regress postestimation diagnostic plots, [R] rreg postestimation residual-versus-fitted plot, [R] regress postestimation diagnostic plots residual-versus-predictor plot, [R] regress postestimation diagnostic plots resistant smoothers, [R] smooth restore, estimates subcommand, [R] estimates store restricted cubic splines, [R] mkspline Results window, clearing, [R] cls results, saving, [R] estimates save stored, [R] stored results return codes, see error messages and return codes return list command, [R] stored results reventries, set subcommand, [R] set #review command, [R] #review revkeyboard, set subcommand, [R] set risk ratio, [R] binreg rmsg, set subcommand, [R] set robust regression, [R] regress, [R] rreg, also see robust, Huber/White/sandwich estimator of variance robust test for equality of variance, [R] sdtest robust, Huber/White/sandwich estimator of variance, [R] vce option alternative-specific conditional logit model, [R] asclogit multinomial probit regression, [R] asmprobit rank-ordered probit regression, [R] asroprobit complementary log-log regression, [R] cloglog generalized linear models, [R] glm for binomial family, [R] binreg generalized method of moments, [R] gmm, [R] ivpoisson heckman selection model, [R] heckman instrumental-variables regression, [R] ivregress interval regression, [R] intreg Subject index 2537 robust, Huber/White/sandwich estimator of variance, continued linear regression, [R] regress constrained, [R] cnsreg truncated, [R] truncreg with dummy-variable set, [R] areg logistic regression, [R] logistic, [R] logit, also see logit regression subentry conditional, [R] clogit multinomial, [R] mlogit ordered, [R] ologit rank-ordered, [R] rologit skewed, [R] scobit stereotype, [R] slogit logit regression, [R] logistic, [R] logit, also see logistic regression subentry for grouped data, [R] glogit nested, [R] nlogit maximum likelihood estimation, [R] ml, [R] mlexp multinomial logistic regression, [R] mlogit probit regression, [R] mprobit negative binomial regression, [R] nbreg truncated, [R] tnbreg zero-inflated, [R] zinb nonlinear least-squares estimation, [R] nl systems of equations, [R] nlsur Poisson regression, [R] poisson truncated, [R] tpoisson with endogenous regressors, [R] ivpoisson zero-inflated, [R] zip probit regression, [R] probit bivariate, [R] biprobit for grouped data, [R] glogit heteroskedastic, [R] hetprobit multinomial, [R] mprobit ordered, [R] heckoprobit, [R] oprobit with endogenous regressors, [R] ivprobit with sample selection, [R] heckprobit quantile regression, [R] qreg summary statistics, mean, [R] mean proportion, [R] proportion ratio, [R] ratio total, [R] total tobit model, [R] tobit with endogenous regressors, [R] ivtobit truncated negative binomial regression, [R] tnbreg Poisson regression, [R] tpoisson regression, [R] truncreg with endogenous regressors, instrumental-variables regression, [R] ivregress Poisson regression, [R] ivpoisson probit regression, [R] ivprobit tobit regression, [R] ivtobit robust, Huber/White/sandwich estimator of variance, continued zero-inflated negative binomial regression, [R] zinb Poisson regression, [R] zip robust, other methods of, [R] rreg, [R] smooth robvar command, [R] sdtest ROC, see receiver operating characteristic analysis roccomp command, [R] roc, [R] roccomp rocfit command, [R] rocfit, [R] rocfit postestimation rocgold command, [R] roc, [R] roccomp rocplot command, [R] rocfit postestimation rocreg command, [R] rocreg, [R] rocreg postestimation, [R] rocregplot rocregplot command, [R] rocregplot roctab command, [R] roc, [R] roctab roh, [R] loneway rologit command, [R] rologit, [R] rologit postestimation rootograms, [R] spikeplot rreg command, [R] rreg, [R] rreg postestimation run command, [R] do runiform() function, [R] set seed runtest command, [R] runtest rvfplot command, [R] regress postestimation diagnostic plots rvpplot command, [R] regress postestimation diagnostic plots S s() stored results, [R] stored results sample, random, see random sample sampling, [R] bootstrap, [R] bsample, also see cluster sampling sandwich/Huber/White estimator of variance, see robust, Huber/White/sandwich estimator of variance save, estimates subcommand, [R] estimates save saved results, see stored results saving results, [R] estimates save Scheffé’s multiple-comparison adjustment, see multiple comparisons, Scheffé’s method scheme, set subcommand, [R] set Schwarz information criterion, see Bayesian information criterion s-class command, [R] stored results scobit command, [R] scobit, [R] scobit postestimation score, ml subcommand, [R] ml scores, [R] predict scrollbufsize, set subcommand, [R] set scrolling of output, controlling, [R] more sdtest command, [R] sdtest sdtesti command, [R] sdtest se, estat subcommand, [R] exlogistic postestimation, [R] expoisson postestimation 2538 Subject index search, ml subcommand, [R] ml net subcommand, [R] net view subcommand, [R] view search command, [R] search search Internet, [R] net search search d, view subcommand, [R] view searchdefault, set subcommand, [R] search, [R] set seed, set subcommand, [R] set, [R] set seed seemingly unrelated estimation, [R] suest regression, [R] nlsur, [R] reg3, [R] sureg segmentsize, set subcommand, [R] set selection models, [R] heckman, [R] heckoprobit, [R] heckprobit sensitivity, [R] estat classification, [R] lroc, [R] lsens, also see receiver operating characteristic analysis model, [R] regress postestimation, [R] rreg serial correlation, see autocorrelation serial independence test, [R] runtest serrbar command, [R] serrbar session, recording, [R] log set adosize command, [R] set autotabgraphs command, [R] set cformat command, [R] set, [R] set cformat charset command, [R] set checksum command, [R] set coeftabresults command, [R] set command, [R] query, [R] set conren command, [R] set copycolor command, [R] set dockable command, [R] set dockingguides command, [R] set doublebuffer command, [R] set dp command, [R] set emptycells command, [R] set, [R] set emptycells eolchar command, [R] set fastscroll command, [R] set floatwindows command, [R] set fvlabel command, [R] set, [R] set showbaselevels fvwrap command, [R] set, [R] set showbaselevels fvwrapon command, [R] set, [R] set showbaselevels graphics command, [R] set haverdir command, [R] set httpproxy command, [R] netio, [R] set httpproxyauth command, [R] netio, [R] set httpproxyhost command, [R] netio, [R] set httpproxyport command, [R] netio, [R] set httpproxypw command, [R] netio, [R] set httpproxyuser command, [R] netio, [R] set include bitmap command, [R] set level command, [R] level, [R] set linegap command, [R] set linesize command, [R] log, [R] set locksplitters command, [R] set set, continued logtype command, [R] log, [R] set lstretch command, [R] set matsize command, [R] matsize, [R] set maxdb command, [R] db, [R] set maxiter command, [R] maximize, [R] set max memory command, [R] set maxvar command, [R] set min memory command, [R] set more command, [R] more, [R] set niceness command, [R] set notifyuser command, [R] set obs command, [R] set odbcmgr command, [R] set output command, [R] set pagesize command, [R] more, [R] set pformat command, [R] set, [R] set cformat pinnable command, [R] set playsnd command, [R] set printcolor command, [R] set processors command, [R] set reventries command, [R] set revkeyboard command, [R] set rmsg command, [R] set scheme command, [R] set scrollbufsize command, [R] set searchdefault command, [R] search, [R] set seed command, [R] set, [R] set seed segmentsize command, [R] set sformat command, [R] set, [R] set cformat showbaselevels command, [R] set, [R] set showbaselevels showemptycells command, [R] set, [R] set showbaselevels showomitted command, [R] set, [R] set showbaselevels smoothfonts command, [R] set timeout1 command, [R] netio, [R] set timeout2 command, [R] netio, [R] set trace command, [R] set tracedepth command, [R] set traceexpand command, [R] set tracehilite command, [R] set traceindent command, [R] set tracenumber command, [R] set tracesep command, [R] set type command, [R] set update interval command, [R] set, [R] update update prompt command, [R] set, [R] update update query command, [R] set, [R] update varabbrev command, [R] set varkeyboard command, [R] set set ado, net subcommand, [R] net set matacache, mata subcommand, [R] set set matafavor, mata subcommand, [R] set set matalibs, mata subcommand, [R] set set matalnum, mata subcommand, [R] set Subject index 2539 set matamofirst, mata subcommand, [R] set set mataoptimize, mata subcommand, [R] set set matastrict, mata subcommand, [R] set set other, net subcommand, [R] net set, translator subcommand, [R] translate set defaults command, [R] set defaults settings, display, [R] set showbaselevels format, [R] set cformat sformat, set subcommand, [R] set, [R] set cformat sfrancia command, [R] swilk Shapiro–Francia test for normality, [R] swilk Shapiro–Wilk test for normality, [R] swilk shewhart command, [R] qc showbaselevels, set subcommand, [R] set, [R] set showbaselevels showemptycells, set subcommand, [R] set, [R] set showbaselevels shownrtolerance option, [R] maximize showomitted, set subcommand, [R] set, [R] set showbaselevels showstep option, [R] maximize showtolerance option, [R] maximize Šidák’s multiple-comparison adjustment, see multiple comparisons, Šidák’s method signrank command, [R] signrank signtest command, [R] signrank simulate prefix command, [R] simulate simulation, Monte Carlo, [R] permute, [R] simulate simultaneous quantile regression, [R] qreg systems, [R] reg3 SIR, see standardized incidence ratio SJ, see Stata Journal and Stata Technical Bulletin sj, net subcommand, [R] net skewed logistic regression, [R] scobit skewness, [R] ladder, [R] regress postestimation, [R] summarize, [R] lnskew0, [R] lv, [R] pksumm, [R] sktest, [R] tabstat sktest command, [R] sktest slogit command, [R] slogit, [R] slogit postestimation Small Stata, [R] limits smooth command, [R] smooth smoothfonts, set subcommand, [R] set smoothing, [R] lpoly, [R] smooth graphs, [R] kdensity, [R] lowess SMR, see standardized mortality ratio spearman command, [R] spearman Spearman’s rho, [R] spearman specification test, [R] gmm postestimation, [R] hausman, [R] ivpoisson postestimation, [R] ivregress postestimation, [R] linktest, [R] lnskew0, [R] regress postestimation, [R] suest specificity, [R] estat classification, [R] lroc, [R] lsens, also see receiver operating characteristic analysis Spiegelhalter’s Z statistic, [R] brier spike plot, [R] spikeplot spikeplot command, [R] spikeplot splines linear, [R] mkspline restricted cubic, [R] mkspline split-plot designs, [R] anova spread, [R] lv sqreg command, [R] qreg, [R] qreg postestimation sreturn list command, [R] stored results ssc copy command, [R] ssc describe command, [R] ssc hot command, [R] ssc install command, [R] ssc new command, [R] ssc type command, [R] ssc uninstall command, [R] ssc SSC archive, see Statistical Software Components archive standard deviations, displaying, [R] lv, [R] summarize, [R] table, [R] tabstat, [R] tabulate, summarize() testing equality of, [R] sdtest standard errors, for general predictions, [R] predictnl forecast, [R] predict, [R] regress postestimation mean, [R] ci, [R] mean prediction, [R] glm, [R] predict, [R] regress postestimation residuals, [R] predict, [R] regress postestimation robust, see robust, Huber/White/sandwich estimator of variance standardized incidence ratio, [R] dstdize margins, [R] margins means, [R] mean mortality ratio, [R] dstdize proportions, [R] proportion rates, [R] dstdize ratios, [R] ratio residuals, [R] binreg postestimation, [R] glm postestimation, [R] logistic postestimation, [R] logit postestimation, [R] predict, [R] regress postestimation Stata Journal and Stata Technical Bulletin installation of, [R] net, [R] sj keyword search of, [R] search Stata limits, [R] limits Stata/IC, [R] limits Stata/MP, [R] limits Stata/SE, [R] limits stata.key file, [R] search Statistical Software Components archive, [R] ssc stats, estimates subcommand, [R] estimates stats STB, see Stata Journal and Stata Technical Bulletin stb, net subcommand, [R] net stcox, fractional polynomials, [R] fp, [R] mfp stem command, [R] stem 2540 Subject index stem-and-leaf displays, [R] stem stepwise estimation, [R] stepwise stepwise prefix command, [R] stepwise stereotype logistic regression, [R] slogit stochastic frontier model, [R] frontier store, estimates subcommand, [R] estimates store stored results, [R] stored results storing and restoring estimation results, [R] estimates store stratified graphs, [R] dotplot models, [R] asclogit, [R] asmprobit, [R] asroprobit, [R] clogit, [R] exlogistic, [R] expoisson, [R] rocreg, [R] rologit resampling, [R] bootstrap, [R] bsample, [R] bstat, [R] permute standardization, [R] dstdize summary statistics, [R] mean, [R] proportion, [R] ratio, [R] total structural vector autoregressive postestimation, [R] regress postestimation time series Stuart–Maxwell test statistic, [R] symmetry Studentized residuals, [R] predict, [R] regress postestimation Studentized-range multiple-comparison adjustment, see multiple comparisons, Tukey’s method Student–Newman–Keuls’ multiple-comparison adjustment, see multiple comparisons, Student– Newman–Keuls’ method Student’s t distribution, see t distribution subhazard ratio, [R] eform option, [R] lincom suest command, [R] suest summarize, estat subcommand, [R] estat, [R] estat summarize misstable subcommand, [R] misstable summarize command, [R] summarize, [R] tabulate, summarize() summarizing data, [R] summarize, [R] tabstat, [R] lv, [R] table, [R] tabulate oneway, [R] tabulate twoway, [R] tabulate, summarize() summary statistics, see descriptive statistics sums, over observations, [R] summarize sunflower command, [R] sunflower sunflower plots, [R] sunflower sureg command, [R] sureg, [R] sureg postestimation survey sampling, see cluster sampling survival analysis, [R] intreg, [R] logistic, [R] poisson survival-time data, see survival analysis swilk command, [R] swilk symbolic forms, [R] anova symmetry command, [R] symmetry symmetry plots, [R] diagnostic plots symmetry test, [R] symmetry symmi command, [R] symmetry symplot command, [R] diagnostic plots syntax diagrams explained, [R] intro system estimators, [R] gmm, [R] ivpoisson, [R] ivregress, [R] nlsur, [R] reg3, [R] sureg parameters, [R] query, [R] set, [R] set defaults szroeter, estat subcommand, [R] regress postestimation Szroeter’s test for heteroskedasticity, [R] regress postestimation T t distribution confidence interval for mean, [R] ci, [R] mean testing equality of means, [R] esize, [R] ttest tab1 command, [R] tabulate oneway tab2 command, [R] tabulate twoway tabi command, [R] tabulate twoway table command, [R] table table, estimates subcommand, [R] estimates table tables, coefficient, display in exponentiated form, [R] eform option display settings, [R] estimation options, [R] set showbaselevels format settings, [R] set cformat maximum likelihood display options, [R] ml system parameter settings, [R] set contingency, [R] table, [R] tabulate twoway estimation results, [R] estimates table frequency, [R] tabulate oneway, [R] tabulate twoway, [R] table, [R] tabstat, [R] tabulate, summarize() missing values, [R] misstable summary statistics, [R] table, [R] tabstat, [R] tabulate, summarize() tabstat command, [R] tabstat tabulate command, [R] tabulate oneway, [R] tabulate twoway summarize(), [R] tabulate, summarize() tau, [R] spearman TDT test, see transmission-disequilibrium test technique() option, [R] maximize test, ARCH, see autoregressive conditional heteroskedasticity test association, see association test autoregressive conditional heteroskedasticity, see autoregressive conditional heteroskedasticity test binomial probability, see binomial probability test bioequivalence, see bioequivalence test Breusch–Godfrey, see Breusch–Godfrey test Breusch–Pagan, see Breusch–Pagan test chi-squared hypothesis, see chi-squared hypothesis test Chow, see Chow test comparison (between nested models), see comparison test between nested models cusum, see cusum test Subject index 2541 test, continued Durbin’s alternative, see Durbin’s alternative test endogeneity, see endogeneity test Engle’s LM, see Engle’s LM test equality of binomial proportions, see equality test of binomial proportions coefficients, see equality test of coefficients distributions, see distributions, testing equality of margins, see equality test of margins means, see equality test of means medians, see equality test of medians proportions, see equality test of proportions ROC areas, see equality test of ROC areas variances, see equality test of variances equivalence, see equivalence test exogeneity, see endogeneity test Fisher’s exact, see Fisher’s exact test goodness-of-fit, see goodness of fit Hausman specification, see Hausman specification test heteroskedasticity, see heteroskedasticity test independence, see independence test, also see Breusch–Pagan test independence of irrelevant alternatives, see independence of irrelevant alternatives information matrix, see information matrix test interrater agreement, see interrater agreement Kolmogorov–Smirnov, see Kolmogorov–Smirnov test Kruskal–Wallis, see Kruskal–Wallis test kurtosis, see kurtosis likelihood-ratio, see likelihood-ratio test linear hypotheses after estimation, see linear hypothesis test after estimation marginal homogeneity, see marginal homogeneity, test of margins, see margins test model coefficients, see model coefficients test model specification, see specification test nonlinear hypotheses after estimation, see nonlinear hypothesis test after estimation normality, see normal distribution and normality omitted variables, see omitted variables test overidentifying restrictions, see overidentifying restrictions test permutation, see permutation test Ramsey, see Ramsey test random-order, see random-order test RESET, see RESET test serial correlation, see autocorrelation serial independence, see serial independence test Shapiro–Francia, see Shapiro–Francia test for normality Shapiro–Wilk, see Shapiro–Wilk test for normality skewness, see skewness symmetry, see symmetry test Szroeter’s, see Szroeter’s test for heteroskedasticity TDT, see transmission-disequilibrium test test, continued transmission-disequilibrium test, see transmissiondisequilibrium test trend, see trend, test for variance-comparison, see variance-comparison test Vuong, see Vuong test weak instrument, see weak instrument test test command, [R] anova postestimation, [R] test testnl command, [R] testnl testparm command, [R] test tetrachoric command, [R] tetrachoric three-stage least squares, [R] reg3 timeout1, set subcommand, [R] netio, [R] set timeout2, set subcommand, [R] netio, [R] set time-series analysis, [R] regress postestimation time series time-versus-concentration curve, [R] pk title, estimates subcommand, [R] estimates title tnbreg command, [R] tnbreg, [R] tnbreg postestimation tobit command, [R] tobit, [R] tobit postestimation tobit regression, [R] ivtobit, [R] tobit, also see intreg command, also see truncreg command .toc filename suffix, [R] net tolerance() option, [R] maximize total command, [R] total, [R] total postestimation totals, estimation, [R] total tpoisson command, [R] tpoisson, [R] tpoisson postestimation trace, ml subcommand, [R] ml query subcommand, [R] query set subcommand, [R] set trace option, [R] maximize tracedepth, set subcommand, [R] set traceexpand, set subcommand, [R] set tracehilite, set subcommand, [R] set traceindent, set subcommand, [R] set tracenumber, set subcommand, [R] set tracesep, set subcommand, [R] set tracing iterative maximization process, [R] maximize transformations log, [R] lnskew0 modulus, [R] boxcox power, [R] boxcox, [R] lnskew0 to achieve normality, [R] boxcox, [R] ladder to achieve zero skewness, [R] lnskew0 translate command, [R] translate translate logs, [R] translate translator query command, [R] translate reset command, [R] translate set command, [R] translate transmap define command, [R] translate query command, [R] translate transmission-disequilibrium test, [R] symmetry 2542 Subject index tree, misstable subcommand, [R] misstable trend, test for, [R] nptrend, [R] symmetry triangle kernel function, [R] kdensity, [R] lpoly, [R] qreg truncated negative binomial regression, [R] tnbreg observations, [R] truncreg, also see censored observations Poisson regression, [R] tpoisson regression, [R] truncreg truncreg command, [R] truncreg, [R] truncreg postestimation ttest and ttesti commands, [R] ttest Tukey’s multiple-comparison adjustment, see multiple comparisons, Tukey’s method tuning constant, [R] rreg two-stage least squares, [R] ivregress two-way analysis of variance, [R] anova scatterplots, [R] lowess type, set subcommand, [R] set ssc subcommand, [R] ssc U U statistic, [R] ranksum uniformly distributed random-number function, [R] set seed uninstall, net subcommand, [R] net ssc subcommand, [R] ssc unique values, counting, [R] table, [R] tabulate oneway univariate distributions, displaying, [R] cumul, [R] diagnostic plots, [R] histogram, [R] ladder, [R] lv, [R] stem kernel density estimation, [R] kdensity update all command, [R] update command, [R] update from command, [R] update query command, [R] update update, query subcommand, [R] query view subcommand, [R] view update d, view subcommand, [R] view update interval, set subcommand, [R] set, [R] update update prompt, set subcommand, [R] set, [R] update update query, set subcommand, [R] set, [R] update updates to Stata, [R] adoupdate, [R] net, [R] sj, [R] update use, estimates subcommand, [R] estimates save user-written additions, installing, [R] net, [R] ssc searching for, [R] net search, [R] ssc using, cmdlog subcommand, [R] log log subcommand, [R] log V varabbrev, set subcommand, [R] set variables, categorical, see categorical data, agreement, measures for dummy, see indicator variables factor, see factor variables in model, maximum number, [R] matsize orthogonalize, [R] orthog variance, analysis of, [R] anova, [R] loneway, [R] oneway displaying, [R] summarize, [R] tabstat estimators, [R] vce option Huber/White/sandwich estimator, see robust, Huber/White/sandwich estimator of variance inflation factors, [R] regress postestimation nonconstant, see robust, Huber/White/sandwich estimator of variance stabilizing transformations, [R] boxcox testing equality of, [R] sdtest variance–covariance matrix of estimators, [R] correlate, [R] estat, [R] estat vce variance-comparison test, [R] sdtest variance-weighted least squares, [R] vwls varkeyboard, set subcommand, [R] set vce, estat subcommand, [R] estat, [R] estat vce vce() option, [R] vce option version of ado-file, [R] which version of Stata, [R] about view ado command, [R] view ado d command, [R] view browse command, [R] view command, [R] view help command, [R] view help d command, [R] view net command, [R] view net d command, [R] view news command, [R] view search command, [R] view search d command, [R] view update command, [R] view update d command, [R] view view d command, [R] view view d, view subcommand, [R] view viewing previously typed lines, [R] #review vif, estat subcommand, [R] regress postestimation Vuong test, [R] zinb, [R] zip vwls command, [R] vwls, [R] vwls postestimation Subject index 2543 W Wald test, [R] contrast, [R] predictnl, [R] test, [R] testnl weak instrument test, [R] ivregress postestimation weighted least squares, [R] regress for grouped data, [R] glogit generalized linear models, [R] glm generalized method of moments estimation, [R] gmm, [R] ivpoisson instrumental-variables regression, [R] gmm, [R] ivregress nonlinear least-squares estimation, [R] nl nonlinear systems of equations, [R] nlsur variance, [R] vwls Welsch distance, [R] regress postestimation which command, [R] which White/Huber/sandwich estimator of variance, see robust, Huber/White/sandwich estimator of variance White’s test for heteroskedasticity, [R] regress postestimation Wilcoxon rank-sum test, [R] ranksum signed-ranks test, [R] signrank X xchart command, [R] qc xi prefix command, [R] xi Z Zellner’s seemingly unrelated regression, [R] sureg, [R] reg3, [R] suest zero-altered, see zero-inflated zero-inflated negative binomial regression, [R] zinb Poisson regression, [R] zip zero-skewness transform, [R] lnskew0 zinb command, [R] zinb, [R] zinb postestimation zip command, [R] zip, [R] zip postestimation
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 2556 Page Mode : UseOutlines Producer : pdfTeX-1.40.14 Title : [R] Base Reference Author : StataCorp LP Subject : Keywords : Revision, 10 Creator : TeX Create Date : 2015:06:03 12:20:04-05:00 Modify Date : 2015:06:03 12:20:04-05:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013) kpathsea version 6.1.1EXIF Metadata provided by EXIF.tools