A Beginner's Guide To Structural Equation Ing Beginners 3rd Ed

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 510 [warning: Documents this large are best viewed by clicking the View PDF Link!]

A Beginner’s Guide to
Structural
Equation
Randall E. Schumacker
The University of Alabama
Richard G. Lomax
The Ohio State University
Modeling
Third Edition
Y102005.indb 3 4/3/10 4:25:16 PM
Routledge
Taylor & Francis Group
711 Third Avenue
New York, NY 10017
Routledge
Taylor & Francis Group
27 Church Road
Hove, East Sussex BN3 2FA
© 2010 by Taylor and Francis Group, LLC
Routledge is an imprint of Taylor & Francis Group, an Informa business
International Standard Book Number: 978-1-84169-890-8 (Hardback) 978-1-84169-891-5 (Paperback)
For permission to photocopy or use material electronically from this work, please access www.
copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organiza-
tion that provides licenses and registration for a variety of users. For organizations that have been
granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Schumacker, Randall E.
A beginners guide to structural equation modeling / authors, Randall E.
Schumacker, Richard G. Lomax.-- 3rd ed.
p. cm.
Includes bibliographical references and index.
ISBN 978-1-84169-890-8 (hardcover : alk. paper) -- ISBN 978-1-84169-891-5
(pbk. : alk. paper)
1. Structural equation modeling. 2. Social sciences--Statistical methods. I.
Lomax, Richard G. II. Title.
QA278.S36 2010
519.53--dc22 2010009456
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the Psychology Press Web site at
http://www.psypress.com
Y102005.indb 4 4/3/10 4:25:16 PM
vii
Contents
About the Authors ...........................................................................................xv
Preface ............................................................................................................. xvii
1 Introduction ................................................................................................1
1.1 What Is Structural Equation Modeling? .......................................2
1.2 History of Structural Equation Modeling .................................... 4
1.3 Why Conduct Structural Equation Modeling? ............................ 6
1.4 Structural Equation Modeling Software Programs ....................8
1.5 Summary ......................................................................................... 10
References .................................................................................................. 11
2 Data Entry and Data Editing Issues ..................................................... 13
2.1 Data Entry ....................................................................................... 14
2.2 Data Editing Issues ........................................................................ 18
2.2.1 Measurement Scale ........................................................... 18
2.2.2 Restriction of Range ......................................................... 19
2.2.3 Missing Data ...................................................................... 20
2.2.4 LISREL–PRELIS Missing Data Example........................ 21
2.2.5 Outliers ............................................................................... 27
2.2.6 Linearity ............................................................................. 27
2.2.7 Nonnormality .................................................................... 28
2.3 Summary ......................................................................................... 29
References .................................................................................................. 31
3 Correlation ................................................................................................33
3.1 Types of Correlation Coefcients .................................................33
3.2 Factors Affecting Correlation Coefcients ................................. 35
3.2.1 Level of Measurement and Range of Values ................. 35
3.2.2 Nonlinearity ...................................................................... 36
3.2.3 Missing Data ......................................................................38
3.2.4 Outliers ............................................................................... 39
3.2.5 Correction for Attenuation .............................................. 39
3.2.6 Nonpositive Denite Matrices ........................................ 40
3.2.7 Sample Size ........................................................................ 41
3.3 Bivariate, Part, and Partial Correlations .....................................42
3.4 Correlation versus Covariance .....................................................46
3.5 Variable Metrics (Standardized versus Unstandardized) ........ 47
3.6 Causation Assumptions and Limitations ...................................48
3.7 Summary ......................................................................................... 49
References .................................................................................................. 51
Y102005.indb 7 4/3/10 4:25:17 PM
viii Contents
4 SEM Basics ................................................................................................ 55
4.1 Model Specication ........................................................................55
4.2 Model Identication ....................................................................... 56
4.3 Model Estimation ........................................................................... 59
4.4 Model Testing ................................................................................. 63
4.5 Model Modication ....................................................................... 64
4.6 Summary ......................................................................................... 67
References .................................................................................................. 69
5 Model Fit .................................................................................................... 73
5.1 Types of Model-Fit Criteria ........................................................... 74
5.1.1 LISREL–SIMPLIS Example ..............................................77
5.1.1.1 Data .....................................................................77
5.1.1.2 Program ..............................................................80
5.1.1.3 Output ................................................................. 81
5.2 Model Fit ..........................................................................................85
5.2.1 Chi-Square (χ2) .................................................................. 85
5.2.2 Goodness-of-Fit Index (GFI) and Adjusted
Goodness-of-Fit Index (AGFI) .........................................86
5.2.3 Root-Mean-Square Residual Index (RMR) .................... 87
5.3 Model Comparison ........................................................................ 88
5.3.1 Tucker–Lewis Index (TLI) ................................................ 88
5.3.2 Normed Fit Index (NFI) and Comparative Fit
Index (CFI) .........................................................................88
5.4 Model Parsimony ........................................................................... 89
5.4.1 Parsimony Normed Fit Index (PNFI) ............................. 90
5.4.2 Akaike Information Criterion (AIC) .............................. 90
5.4.3 Summary ............................................................................ 91
5.5 Parameter Fit ................................................................................... 92
5.6 Power and Sample Size ................................................................. 93
5.6.1 Model Fit ............................................................................ 94
5.6.1.1 Power ................................................................... 94
5.6.1.2 Sample Size ........................................................99
5.6.2 Model Comparison ......................................................... 108
5.6.3 Parameter Signicance ....................................................111
5.6.4 Summary ...........................................................................113
5.7 Two-Step Versus Four-Step Approach to Modeling ................114
5.8 Summary ........................................................................................116
Chapter Footnote .....................................................................................118
Standard Errors ........................................................................................118
Chi-Squares ...............................................................................................118
References ................................................................................................ 120
Y102005.indb 8 4/3/10 4:25:17 PM
Contents ix
6 Regression Models ................................................................................ 125
6.1 Overview ....................................................................................... 126
6.2 An Example ................................................................................... 130
6.3 Model Specication ...................................................................... 130
6.4 Model Identication ..................................................................... 131
6.5 Model Estimation ......................................................................... 131
6.6 Model Testing ............................................................................... 133
6.7 Model Modication ..................................................................... 134
6.8 Summary ....................................................................................... 135
6.8.1 Measurement Error ......................................................... 136
6.8.2 Additive Equation ........................................................... 137
Chapter Footnote .................................................................................... 138
Regression Model with Intercept Term ..................................... 138
LISREL–SIMPLIS Program (Intercept Term) ...................................... 138
References ................................................................................................ 139
7 Path Models ............................................................................................ 143
7.1 An Example ................................................................................... 144
7.2 Model Specication ...................................................................... 147
7.3 Model Identication ..................................................................... 150
7.4 Model Estimation ......................................................................... 151
7.5 Model Testing ............................................................................... 154
7.6 Model Modication ..................................................................... 155
7.7 Summary ....................................................................................... 156
Appendix: LISREL–SIMPLIS Path Model Program ........................... 156
Chapter Footnote .................................................................................... 158
Another Traditional Non-SEM Path Model-Fit Index ............ 158
LISREL–SIMPLIS program ......................................................... 158
References .................................................................................................161
8 Conrmatory Factor Models ............................................................... 163
8.1 An Example ................................................................................... 164
8.2 Model Specication ...................................................................... 166
8.3 Model Identication ......................................................................167
8.4 Model Estimation ......................................................................... 169
8.5 Model Testing ............................................................................... 170
8.6 Model Modication ..................................................................... 173
8.7 Summary ........................................................................................174
Appendix: LISREL–SIMPLIS Conrmatory Factor Model Program ....174
References ................................................................................................ 177
9 Developing Structural Equation Models: Part I.............................. 179
9.1 Observed Variables and Latent Variables ................................. 180
9.2 Measurement Model .................................................................... 184
Y102005.indb 9 4/3/10 4:25:17 PM
x Contents
9.3 Structural Model .......................................................................... 186
9.4 Variances and Covariance Terms .............................................. 189
9.5 Two-Step/Four-Step Approach .................................................. 191
9.6 Summary ....................................................................................... 192
References ................................................................................................ 193
10 Developing Structural Equation Models: Part II ............................ 195
10.1 An Example ................................................................................... 195
10.2 Model Specication ...................................................................... 197
10.3 Model Identication ..................................................................... 200
10.4 Model Estimation ......................................................................... 202
10.5 Model Testing ............................................................................... 203
10.6 Model Modication ..................................................................... 205
10.7 Summary ....................................................................................... 207
Appendix: LISREL–SIMPLIS Structural Equation Model Program .....207
References ................................................................................................ 208
11 Reporting SEM Research: Guidelines and Recommendations ... 209
11.1 Data Preparation .......................................................................... 212
11.2 Model Specication ...................................................................... 213
11.3 Model Identication ..................................................................... 215
11.4 Model Estimation ..........................................................................216
11.5 Model Testing ............................................................................... 217
11.6 Model Modication ..................................................................... 218
11.7 Summary ....................................................................................... 219
References ................................................................................................ 220
12 Model Validation ................................................................................... 223
Key Concepts ........................................................................................... 223
12.1 Multiple Samples .......................................................................... 223
12.1.1 Model A Computer Output ...........................................226
12.1.2 Model B Computer Output ............................................ 227
12.1.3 Model C Computer Output ........................................... 228
12.1.4 Model D Computer Output ...........................................229
12.1.5 Summary .......................................................................... 229
12.2 Cross Validation ........................................................................... 229
12.2.1 ECVI .................................................................................. 230
12.2.2 CVI .................................................................................... 231
12.3 Bootstrap .......................................................................................234
12.3.1 PRELIS Graphical User Interface .................................. 234
12.3.2 LISREL and PRELIS Program Syntax .......................... 237
12.4 Summary ....................................................................................... 241
References ................................................................................................ 243
Y102005.indb 10 4/3/10 4:25:17 PM
Contents xi
13 Multiple Sample, Multiple Group, and Structured
Means Models ........................................................................................ 245
13.1 Multiple Sample Models ............................................................. 245
Sample 1 ........................................................................................ 247
Sample 2 ........................................................................................ 247
13.2 Multiple Group Models ...............................................................250
13.2.1 Separate Group Models .................................................. 251
13.2.2 Similar Group Model .....................................................255
13.2.3 Chi-Square Difference Test ............................................ 258
13.3 Structured Means Models .......................................................... 259
13.3.1 Model Specication and Identication ........................ 259
13.3.2 Model Fit .......................................................................... 261
13.3.3 Model Estimation and Testing ...................................... 261
13.4 Summary ....................................................................................... 263
Suggested Readings ................................................................................ 267
Multiple Samples ......................................................................... 267
Multiple Group Models .............................................................. 267
Structured Means Models ........................................................... 267
Chapter Footnote .................................................................................... 268
SPSS ................................................................................................ 268
References ................................................................................................ 269
14 Second-Order, Dynamic, and Multitrait Multimethod Models .....271
14.1 Second-Order Factor Model ....................................................... 271
14.1.1 Model Specication and Identication ........................ 271
14.1.2 Model Estimation and Testing ...................................... 272
14.2 Dynamic Factor Model .................................................................274
14.3 Multitrait Multimethod Model (MTMM) ................................. 277
14.3.1 Model Specication and Identication ........................ 279
14.3.2 Model Estimation and Testing ...................................... 280
14.3.3 Correlated Uniqueness Model ...................................... 281
14.4 Summary ....................................................................................... 286
Suggested Readings ................................................................................ 290
Second-Order Factor Models ...................................................... 290
Dynamic Factor Models .............................................................. 290
Multitrait Multimethod Models ................................................. 290
Correlated Uniqueness Model ................................................... 291
References ................................................................................................ 291
15 Multiple IndicatorMultiple Indicator Cause, Mixture,
and Multilevel Models ......................................................................... 293
15.1 Multiple Indicator–Multiple Cause (MIMIC) Models ............. 293
15.1.1 Model Specication and Identication ........................ 294
15.1.2 Model Estimation and Model Testing .......................... 294
Y102005.indb 11 4/3/10 4:25:17 PM
xii Contents
15.1.3 Model Modication ........................................................ 297
Goodness-of-Fit Statistics .............................................. 297
Measurement Equations ................................................ 297
Structural Equations ....................................................... 298
15.2 Mixture Models ............................................................................ 298
15.2.1 Model Specication and Identication ........................ 299
15.2.2 Model Estimation and Testing ...................................... 301
15.2.3 Model Modication ........................................................ 302
15.2.4 Robust Statistic ................................................................305
15.3 Multilevel Models ........................................................................ 307
15.3.1 Constant Effects .............................................................. 313
15.3.2 Time Effects ..................................................................... 313
15.3.3 Gender Effects ................................................................. 315
15.3.4 Multilevel Model Interpretation ....................................318
15.3.5 Intraclass Correlation ..................................................... 319
15.3.6 Deviance Statistic ............................................................ 320
15.4 Summary ....................................................................................... 320
Suggested Readings ................................................................................ 324
Multiple Indicator–Multiple Cause Models ............................. 324
Mixture Models ............................................................................ 325
Multilevel Models ........................................................................ 325
References ................................................................................................ 325
16 Interaction, Latent Growth, and Monte Carlo Methods ................ 327
16.1 Interaction Models ....................................................................... 327
16.1.1 Categorical Variable Approach ..................................... 328
16.1.2 Latent Variable Interaction Model ................................ 331
16.1.2.1 Computing Latent Variable Scores ............... 331
16.1.2.2 Computing Latent Interaction Variable ....... 333
16.1.2.3 Interaction Model Output ..............................335
16.1.2.4 Model Modication ......................................... 336
16.1.2.5 Structural Equations—No Latent
Interaction Variable ......................................... 336
16.1.3 Two-Stage Least Squares (TSLS) Approach ................ 337
16.2 Latent Growth Curve Models..................................................... 341
16.2.1 Latent Growth Curve Program ..................................... 343
16.2.2 Model Modication ........................................................344
16.3 Monte Carlo Methods ..................................................................345
16.3.1 PRELIS Simulation of Population Data........................ 346
16.3.2 Population Data from Specied
Covariance Matrix .......................................................... 352
16.3.2.1 SPSS Approach ................................................ 352
16.3.2.2 SAS Approach ..................................................354
16.3.2.3 LISREL Approach ............................................ 355
Y102005.indb 12 4/3/10 4:25:18 PM
Contents xiii
16.3.3 Covariance Matrix from Specied Model ................... 359
16.4 Summary ....................................................................................... 365
Suggested Readings ................................................................................ 368
Interaction Models ....................................................................... 368
Latent Growth-Curve Models .................................................... 368
Monte Carlo Methods .................................................................. 368
References ................................................................................................ 369
17 Matrix Approach to Structural Equation Modeling ....................... 373
17.1 General Overview of Matrix Notation ...................................... 373
17.2 Free, Fixed, and Constrained Parameters ................................. 379
17.3 LISREL Model Example in Matrix Notation ............................ 382
LISREL8 Matrix Program Output (Edited and Condensed)..385
17.4 Other Models in Matrix Notation ..............................................400
17.4.1 Path Model .......................................................................400
17.4.2 Multiple-Sample Model ................................................. 404
17.4.3 Structured Means Model ............................................... 405
17.4.4 Interaction Models .......................................................... 410
PRELIS Computer Output .......................................................... 412
LISREL Interaction Computer Output .......................................416
17.5 Summary ....................................................................................... 421
References ................................................................................................ 423
Appendix A: Introduction to Matrix Operations ...................................425
Appendix B: Statistical Tables ...................................................................439
Answers to Selected Exercises ................................................................... 449
Author Index .................................................................................................. 489
Subject Index ................................................................................................. 495
Y102005.indb 13 4/3/10 4:25:18 PM
xv
About the Authors
RANDALL E. SCHUMACKER received his Ph.D. in educational psychol-
ogy from Southern Illinois University. He is currently professor of educa-
tional research at the University of Alabama, where he teaches courses
in structural equation modeling, multivariate statistics, multiple regres-
sion, and program evaluation. His research interests are varied, including
modeling interaction in SEM, robust statistics (normal scores, centering,
and variance ination factor issues), and SEM specication search issues
as well as measurement model issues related to estimation, mixed-item
formats, and reliability.
He has published in several journals including Academic Medicine,
Educational and Psychological Measurement, Journal of Applied Measurement,
Journal of Educational and Behavioral Statistics, Journal of Research Methodology,
Multiple Linear Regression Viewpoints, and Structural Equation Modeling.
He has served on the editorial boards of numerous journals and is a
member of the American Educational Research Association, American
Psychological Association—Division 5, as well as past-president of the
Southwest Educational Research Association, and emeritus editor of
Structural Equation Modeling journal. He can be contacted at the University
of Alabama College of Education.
RICHARD G. LOMAX received his Ph.D. in educational research meth-
odology from the University of Pittsburgh. He is currently a professor in
the School of Educational Policy and Leadership, Ohio State University,
where he teaches courses in structural equation modeling, statistics, and
quantitative research methodology.
His research primarily focuses on models of literacy acquisition, multi-
variate statistics, and assessment. He has published in such diverse jour-
nals as Parenting: Science and Practice, Understanding Statistics: Statistical
Issues in Psychology, Education, and the Social Sciences, Violence Against
Women, Journal of Early Adolescence, and Journal of Negro Education. He has
served on the editorial boards of numerous journals, and is a member of
the American Educational Research Association, the American Statistical
Association, and the National Reading Conference. He can be contacted at
Ohio State University College of Education and Human Ecology.
Y102005.indb 15 4/3/10 4:25:18 PM
xvii
Preface
Approach
This book presents a basic introduction to structural equation modeling
(SEM). Readers will nd that we have kept to our tradition of keeping
examples rudimentary and easy to follow. The reader is provided with
a review of correlation and covariance, followed by multiple regression,
path, and factor analyses in order to better understand the building blocks
of SEM. The book describes a basic structural equation model followed by
the presentation of several different types of structural equation models.
Our approach in the text is both conceptual and application oriented.
Each chapter covers basic concepts, principles, and practice and then
utilizes SEM software to provide meaningful examples. Each chapter also
features an outline, key concepts, a summary, numerous examples from
a variety of disciplines, tables, and gures, including path diagrams, to
assist with conceptual understanding. Chapters with examples follow the
conceptual sequence of SEM steps known as model specication, identi-
cation, estimation, testing, and modication.
The book now uses LISREL 8.8 student version to make the software and
examples readily available to readers. Please be aware that the student
version, although free, does not contain all of the functional features as a
full licensed version. Given the advances in SEM software over the past
decade, you should expect updates and patches of this software package
and therefore become familiar with any new features as well as explore the
excellent library of examples and help materials. The LISREL 8.8 student
version is an easy-to-use Windows PC based program with pull-down
menus, dialog boxes, and drawing tools. To access the program, and/or
if you’re a Mac user and are interested in learning about Mac availability,
please check with Scientic Software (http://www.ssicentral.com). There
is also a hotlink to the Scientic Software site from the book page for A
Beginners Guide to Structural Equation Modeling, 3rd edition on the Textbook
Resources tab at www.psypress.com.
The SEM model examples in the book do not require complicated pro-
gramming skills nor does the reader need an advanced understanding of
statistics and matrix algebra to understand the model applications. We have
provided a chapter on the matrix approach to SEM as well as an appendix
on matrix operations for the interested reader. We encourage the under-
standing of the matrices used in SEM models, especially for some of the
more advanced SEM models you will encounter in the research literature.
Y102005.indb 17 4/3/10 4:25:18 PM
xviii Preface
Goals and Content Coverage
Our main goal in this third edition is for students and researchers to be
able to conduct their own SEM model analyses, as well as be able to under-
stand and critique published SEM research. These goals are supported by
the conceptual and applied examples contained in the book and several
journal article references for each advanced SEM model type. We have
also included a SEM checklist to guide your model analysis according to
the basic steps a researcher takes.
As for content coverage, the book begins with an introduction to SEM
(what it is, some history, why conduct it, and what software is available),
followed by chapters on data entry and editing issues, and correlation.
These early chapters are critical to understanding how missing data, non-
normality, scale of measurement, non-linearity, outliers, and restriction of
range in scores affects SEM analysis. Chapter 4 lays out the basic steps of
model specication, identication, estimation, testing, and modication,
followed by Chapter 5, which covers issues related to model t indices,
power and sample size. Chapters 6 through 10 follow the basic SEM steps
of modeling, with actual examples from different disciplines, using regres-
sion, path, conrmatory factor and structural equation models. Logically
the next chapter presents information about reporting SEM research and
includes a SEM checklist to guide decision-making. Chapter 12 presents
different approaches to model validation, an important nal step after
obtaining an acceptable theoretical model. Chapters 13 through 16 provide
SEM examples that introduce many of the different types of SEM model
applications. The nal chapter describes the matrix approach to structural
equation modeling by using examples from the previous chapters.
Theoretical models are present in every discipline, and therefore can be
formulated and tested. This third edition expands SEM models and appli-
cations to provide the students and researchers in medicine, political sci-
ence, sociology, education, psychology, business, and the biological sciences
the basic concepts, principles, and practice necessary to test their theoreti-
cal models. We hope you become more familiar with structural equation
modeling after reading the book, and use SEM in your own research.
New to the Third Edition
The rst edition of this book was one of the rst books published on SEM,
while the second edition greatly expanded knowledge of advanced SEM
models. Since that time, we have had considerable experience utilizing the
Y102005.indb 18 4/3/10 4:25:18 PM
Preface xix
book in class with our students. As a result of those experiences, the third
edition represents a more useable book for teaching SEM. As such it is an
ideal text for introductory graduate level courses in structural equation
modeling or factor analysis taught in departments of psychology, educa-
tion, business, and other social and healthcare sciences. An understand-
ing of correlation is assumed.
The third edition offers several new surprises, namely:
1. Our instruction and examples are now based on freely available
software: LISREL 8.8 student version.
2. More examples presented from more disciplines, including input,
output, and screenshots.
3. Every chapter has been updated and enhanced with additional
material.
4. A website with raw data sets for the book’s examples and exer-
cises so they can be used with any SEM program, all of the book’s
exercises, hotlinks to related websites, and answers to all of the
exercises for instructors only. To access the website visit the book
page or the Textbook Resource page at www.psypress.com.
5. Expanded coverage of advanced models with more on multiple-
group, multi-level, and mixture modeling (Chs. 13 and 15), second-
order and dynamic factor models (Ch. 14), and Monte Carlo
methods (Ch. 16).
6. Increased coverage of sample size and power (Ch. 5), including
software programs, and reporting research (Ch. 11).
7. New journal article references help readers better understand
published research (Chs. 1317).
8. Troubleshooting tips on how to address the most frequently
encountered problems are found in Chapters 3 and 11.
9. Chapters 13 to 16 now include additional SEM model examples.
10. 25% new exercises with answers to half in the back of the book
for student review (and answers to all for instructors only on the
book and/or Textbook Resource page at www.psypress.com).
11. Added Matrix examples for several models in Chapter 17.
12. Updated references in all chapters on all key topics.
Overall, we believe this third edition is a more complete book that can
be used to teach a full course in SEM. The past several years have seen an
explosion in SEM coursework, books, websites, and training courses. We
are proud to have been considered a starting point for many beginner’s
to SEM. We hope you nd that this third edition expands on many of the
programming tools, trends and topics in SEM today.
Y102005.indb 19 4/3/10 4:25:18 PM
xx Preface
Acknowledgments
The third edition of this book represents more than thirty years of inter-
acting with our colleagues and students who use structural equation
modeling. As before, we are most grateful to the pioneers in the eld of
structural equation modeling, particularly to Karl Jöreskog, Dag Sörbom,
Peter Bentler, James Arbuckle, and Linda and Bengt Muthèn. These indi-
viduals have developed and shaped the new advances in the SEM eld as
well as the content of this book, plus provided SEM researchers with soft-
ware programs. We are also grateful to Gerhard Mels who answered our
questions and inquiries about SEM programming problems in the chap-
ters. We also wish to thank the reviewers: James Leeper, The University
of Alabama, Philip Smith, Augusta State University, Phil Wood, the
University of Missouri–Columbia, and Ke-Haie Yuan, the University of
Notre Dame.
This book was made possible through the encouragement of Debra
Riegert at Routledge/Taylor & Francis who insisted it was time for a third
edition. We wish to thank her and her editorial assistant, Erin M. Flaherty,
for coordinating all of the activity required to get a book into print. We
also want to thank Suzanne Lassandro at Taylor & Francis Group, LLC
for helping us through the difcult process of revisions, galleys, and nal
book copy.
Randall E. Schumacker
The University of Alabama
Richard G. Lomax
The Ohio State University
Y102005.indb 20 4/3/10 4:25:18 PM
1
1
Introduction
Key Concepts
Latent and observed variables
Independent and dependent variables
Types of models
Regression
Path
Conrmatory factor
Structural equation
History of structural equation modeling
Structural equation modeling software programs
Structural equation modeling can be easily understood if the researcher
has a grounding in basic statistics, correlation, and regression analysis.
The rst three chapters provide a brief introduction to structural equation
modeling (SEM), basic data entry, and editing issues in statistics, and con-
cepts related to the use of correlation coefcients in structural equation
modeling. Chapter 4 covers the essential concepts of SEM: model speci-
cation, identication, estimation, testing, and modication. This basic
understanding provides the framework for understanding the material
presented in chapters 5 through 8 on model-t indices, regression analy-
sis, path analysis, and conrmatory factor analysis models (measurement
models), which form the basis for understanding the structural equation
models (latent variable models) presented in chapters 9 and 10. Chapter 11
provides guidance on reporting structural equation modeling research.
Chapter 12 addresses techniques used to establish model validity and
generalization of ndings. Chapters 13 to 16 present many of the advanced
SEM models currently appearing in journal articles: multiple group, mul-
tiple indicators–multiple causes, mixture, multilevel, structured means,
multitrait–multimethod, second-order factor, dynamic factor, interaction
Y102005.indb 1 3/22/10 3:24:44 PM
2 A Beginners Guide to Structural Equation Modeling
models, latent growth curve models, and Monte Carlo studies. Chapter 17
presents matrix notation for one of our SEM applications, covers the differ-
ent matrices used in structural equation modeling, and presents multiple
regression and path analysis solutions using matrix algebra. We include
an introduction to matrix operations in the Appendix for readers who
want a more mathematical understanding of matrix operations. To start
our journey of understanding, we rst ask, What is structural equation
modeling? Then, we give a brief history of SEM, discuss the importance of
SEM, and note the availability of SEM software programs.
1.1 What Is Structural Equation Modeling?
Structural equation modeling (SEM) uses various types of models to
depict relationships among observed variables, with the same basic goal
of providing a quantitative test of a theoretical model hypothesized by
the researcher. More specically, various theoretical models can be tested
in SEM that hypothesize how sets of variables dene constructs and
how these constructs are related to each other. For example, an educa-
tional researcher might hypothesize that a students home environment
inuences her later achievement in school. A marketing researcher may
hypothesize that consumer trust in a corporation leads to increased prod-
uct sales for that corporation. A health care professional might believe
that a good diet and regular exercise reduce the risk of a heart attack.
In each example, the researcher believes, based on theory and empirical
research, sets of variables dene the constructs that are hypothesized to be
related in a certain way. The goal of SEM analysis is to determine the extent to
which the theoretical model is supported by sample data. If the sample data
support the theoretical model, then more complex theoretical models can be
hypothesized. If the sample data do not support the theoretical model, then
either the original model can be modied and tested, or other theoretical
models need to be developed and tested. Consequently, SEM tests theoreti-
cal models using the scientic method of hypothesis testing to advance our
understanding of the complex relationships among constructs.
SEM can test various types of theoretical models. Basic models include
regression (chapter 6), path (chapter 7), and conrmatory factor (chap-
ter 8) models. Our reason for covering these basic models is that they
provide a basis for understanding structural equation models (chapters
9 and 10). To better understand these basic models, we need to dene a
few terms. First, there are two major types of variables: latent variables
and observed variables. Latent variables (constructs or factors) are vari-
ables that are not directly observable or measured. Latent variables are
Y102005.indb 2 3/22/10 3:24:44 PM
Introduction 3
indirectly observed or measured, and hence are inferred from a set of
observed variables that we actually measure using tests, surveys, and
so on. For example, intelligence is a latent variable that represents a psy-
chological construct. The condence of consumers in American business
is another latent variable, one representing an economic construct. The
physical condition of adults is a third latent variable, one representing a
health-related construct.
The observed, measured, or indicator variables are a set of variables that
we use to dene or infer the latent variable or construct. For example, the
Wechsler Intelligence Scale for Children—Revised (WISC-R) is an instru-
ment that produces a measured variable (scores), which one uses to infer
the construct of a childs intelligence. Additional indicator variables, that
is, intelligence tests, could be used to indicate or dene the construct of
intelligence (latent variable). The Dow-Jones index is a standard measure
of the American corporate economy construct. Other measured variables
might include gross national product, retail sales, or export sales. Blood
pressure is one of many health-related variables that could indicate a
latent variable dened as tness.Each of these observed or indicator
variables represent one denition of the latent variable. Researchers use
sets of indicator variables to dene a latent variable; thus, other measure-
ment instruments are used to obtain indicator variables, for example, the
Stanford–Binet Intelligence Scale, the NASDAQ index, and an individual’s
cholesterol level, respectively.
Variables, whether they are observed or latent, can also be dened
as either independent variables or dependent variables. An independent
variable is a variable that is not inuenced by any other variable in
the model. A dependent variable is a variable that is inuenced by
another variable in the model. Let us return to the previous examples
and specify the independent and dependent variables. The educational
researcher hypothesizes that a students home environment (indepen-
dent latent variable) inuences school achievement (dependent latent
variable). The marketing researcher believes that consumer trust in a
corporation (independent latent variable) leads to increased product
sales (dependent latent variable). The health care professional wants to
determine whether a good diet and regular exercise (two independent
latent variables) inuence the frequency of heart attacks (dependent
latent variable).
The basic SEM models in chapters 6 through 8 illustrate the use of
observed variables and latent variables when dened as independent
or dependent. A regression model consists solely of observed variables
where a single dependent observed variable is predicted or explained by
one or more independent observed variables; for example, a parent’s edu-
cation level (independent observed variable) is used to predict his or her
childs achievement score (dependent observed variable). A path model is
Y102005.indb 3 3/22/10 3:24:44 PM
4 A Beginners Guide to Structural Equation Modeling
also specied entirely with observed variables, but the exibility allows
for multiple independent observed variables and multiple dependent
observed variables—for example, export sales, gross national product,
and NASDAQ index inuence consumer trust and consumer spending
(dependent observed variables). Path models, therefore, test more com-
plex models than regression models. Conrmatory factor models con-
sist of observed variables that are hypothesized to measure one or more
latent variables (independent or dependent); for example, diet, exercise,
and physiology are observed measures of the independent latent variable
tness. An understanding of these basic models will help in under-
standing structural equation modeling, which combines path and factor
analytic models. Structural equation models consist of observed variables
and latent variables, whether independent or dependent; for example, an
independent latent variable (home environment) inuences a dependent
latent variable (achievement), where both types of latent variables are
measured, dened, or inferred by multiple observed or measured indica-
tor variables.
1.2 History of Structural Equation Modeling
To discuss the history of structural equation modeling, we explain the fol-
lowing four types of related models and their chronological order of devel-
opment: regression, path, conrmatory factor, and structural equation
models.
The rst model involves linear regression models that use a correlation
coefcient and the least squares criterion to compute regression weights.
Regression models were made possible because Karl Pearson created a
formula for the correlation coefcient in 1896 that provides an index for
the relationship between two variables (Pearson, 1938). The regression
model permits the prediction of dependent observed variable scores
(Y scores), given a linear weighting of a set of independent observed
scores (X scores) that minimizes the sum of squared residual error val-
ues. The mathematical basis for the linear regression model is found in
basic algebra. Regression analysis provides a test of a theoretical model
that may be useful for prediction (e.g., admission to graduate school or
budget projections). In an example study, regression analysis was used
to predict student exam scores in statistics (dependent variable) from a
series of collaborative learning group assignments (independent vari-
ables; Delucchi, 2006). The results provided some support for collabora-
tive learning groups improving statistics exam performance, although
not for all tasks.
Y102005.indb 4 3/22/10 3:24:44 PM
Introduction 5
Some years later, Charles Spearman (1904, 1927) used the correlation
coefcient to determine which items correlated or went together to create
the factor model. His basic idea was that if a set of items correlated or
went together, individual responses to the set of items could be summed
to yield a score that would measure, dene, or infer a construct. Spearman
was the rst to use the term factor analysis in dening a two-factor con-
struct for a theory of intelligence. D.N. Lawley and L.L. Thurstone in 1940
further developed applications of factor models, and proposed instru-
ments (sets of items) that yielded observed scores from which constructs
could be inferred. Most of the aptitude, achievement, and diagnostic
tests, surveys, and inventories in use today were created using factor ana-
lytic techniques. The term conrmatory factor analysis (CFA) is used today
based in part on earlier work by Howe (1955), Anderson and Rubin (1956),
and Lawley (1958). The CFA method was more fully developed by Karl
reskog in the 1960s to test whether a set of items dened a construct.
reskog completed his dissertation in 1963, published the rst article on
CFA in 1969, and subsequently helped develop the rst CFA software pro-
gram. Factor analysis has been used for over 100 years to create measure-
ment instruments in many academic disciplines, while today CFA is used
to test the existence of these theoretical constructs. In an example study,
CFA was used to conrm the “Big Five” model of personality by Goldberg
(1990). The ve-factor model of extraversion, agreeableness, conscientious-
ness, neuroticism, and intellect was conrmed through the use of multiple
indicator variables for each of the ve hypothesized factors.
Sewell Wright (1918, 1921, 1934), a biologist, developed the third type of
model, a path model. Path models use correlation coefcients and regres-
sion analysis to model more complex relationships among observed
variables. The rst applications of path analysis dealt with models of
animal behavior. Unfortunately, path analysis was largely overlooked
until econometricians reconsidered it in the 1950s as a form of simultane-
ous equation modeling (e.g., H. Wold) and sociologists rediscovered it in
the 1960s (e.g., O. D. Duncan and H. M. Blalock). In many respects, path
analysis involves solving a set of simultaneous regression equations that
theoretically establish the relationship among the observed variables in
the path model. In an example path analysis study, Walberg’s theoretical
model of educational productivity was tested for fth- through eighth-
grade students (Parkerson et al., 1984). The relations among the follow-
ing variables were analyzed in a single model: home environment, peer
group, media, ability, social environment, time on task, motivation, and
instructional strategies. All of the hypothesized paths among those vari-
ables were shown to be statistically signicant, providing support for the
educational productivity model.
The nal model type is structural equation modeling (SEM). SEM mod-
els essentially combine path models and conrmatory factor models;
Y102005.indb 5 3/22/10 3:24:44 PM
6 A Beginners Guide to Structural Equation Modeling
that is, SEM models incorporate both latent and observed variables. The
early development of SEM models was due to Karl Jöreskog (1969, 1973),
Ward Keesling (1972), and David Wiley (1973); this approach was initially
known as the JKW model, but became known as the linear structural rela-
tions model (LISREL) with the development of the rst software program,
LISREL, in 1973. Since then, many SEM articles have been published; for
example, Shumow and Lomax (2002) tested a theoretical model of paren-
tal efcacy for adolescent students. For the overall sample, neighborhood
quality predicted parental efcacy, which predicted parental involvement
and monitoring, both of which predicted academic and social-emotional
adjustment.
reskog and van Thillo originally developed the LISREL software pro-
gram at the Educational Testing Service (ETS) using a matrix command
language (i.e., involving Greek and matrix notation), which is described
in chapter 17. The rst publicly available version, LISREL III, was released
in 1976. Later in 1993, LISREL8 was released; it introduced the SIMPLIS
(SIMPle LISrel) command language in which equations are written
using variable names. In 1999, the rst interactive version of LISREL was
released. LISREL8 introduced the dialog box interface using pull-down
menus and point-and-click features to develop models, and the path dia-
gram mode, a drawing program to develop models. Karl Jöreskog was rec-
ognized by Cudeck, DuToit, and Sörbom (2001) who edited a Festschrift
in honor of his contributions to the eld of structural equation modeling.
Their volume contains chapters by scholars who address the many top-
ics, concerns, and applications in the eld of structural equation model-
ing today, including milestones in factor analysis; measurement models;
robustness, reliability, and t assessment; repeated measurement designs;
ordinal data; and interaction models. We cover many of these topics in
this book, although not in as great a depth. The eld of structural equa-
tion modeling across all disciplines has expanded since 1994. Hershberger
(2003) found that between 1994 and 2001 the number of journal articles
concerned with SEM increased, the number of journals publishing SEM
research increased, SEM became a popular choice amongst multivariate
methods, and the journal Structural Equation Modeling became the primary
source for technical developments in structural equation modeling.
1.3 Why Conduct Structural Equation Modeling?
Why is structural equation modeling popular? There are at least four
major reasons for the popularity of SEM. The rst reason suggests that
researchers are becoming more aware of the need to use multiple observed
Y102005.indb 6 3/22/10 3:24:45 PM
Introduction 7
variables to better understand their area of scientic inquiry. Basic statis-
tical methods only utilize a limited number of variables, which are not
capable of dealing with the sophisticated theories being developed. The
use of a small number of variables to understand complex phenomena is
limiting. For instance, the use of simple bivariate correlations is not suf-
cient for examining a sophisticated theoretical model. In contrast, struc-
tural equation modeling permits complex phenomena to be statistically
modeled and tested. SEM techniques are therefore becoming the preferred
method for conrming (or disconrming) theoretical models in a quanti-
tative fashion.
A second reason involves the greater recognition given to the valid-
ity and reliability of observed scores from measurement instruments.
Specically, measurement error has become a major issue in many dis-
ciplines, but measurement error and statistical analysis of data have
been treated separately. Structural equation modeling techniques explic-
itly take measurement error into account when statistically analyzing
data. As noted in subsequent chapters, SEM analysis includes latent and
observed variables as well as measurement error terms in certain SEM
models.
A third reason pertains to how structural equation modeling has matured
over the past 30 years, especially the ability to analyze more advanced the-
oretical SEM models. For example, group differences in theoretical models
can be assessed through multiple-group SEM models. In addition, analyz-
ing educational data collected at more than one level—for example, school
districts, schools, and teachers with student data—is now possible using
multilevel SEM modeling. As a nal example, interaction terms can now
be included in an SEM model so that main effects and interaction effects
can be tested. These advanced SEM models and techniques have provided
many researchers with an increased capability to analyze sophisticated
theoretical models of complex phenomena, thus requiring less reliance on
basic statistical methods.
Finally, SEM software programs have become increasingly user-
friendly. For example, until 1993 LISREL users had to input the pro-
gram syntax for their models using Greek and matrix notation. At
that time, many researchers sought help because of the complex pro-
gramming requirement and knowledge of the SEM syntax that was
required. Today, most SEM software programs are Windows-based
and use pull-down menus or drawing programs to generate the pro-
gram syntax internally. Therefore, the SEM software programs are now
easier to use and contain features similar to other Windows-based
software packages. However, such ease of use necessitates statisti-
cal training in SEM modeling and software via courses, workshops,
or textbooks to avoid mistakes and errors in analyzing sophisticated
theoretical models.
Y102005.indb 7 3/22/10 3:24:45 PM
8 A Beginners Guide to Structural Equation Modeling
1.4 Structural Equation Modeling Software Programs
Although the LISREL program was the rst SEM software program,
other software programs have subsequently been developed since the
mid-1980s. Some of the other programs include AMOS, EQS, Mx, Mplus,
Ramona, and Sepath, to name a few. These software programs are each
unique in their own way, with some offering specialized features for
conducting different SEM applications. Many of these SEM software
programs provide statistical analysis of raw data (e.g., means, correla-
tions, missing data conventions), provide routines for handling missing
data and detecting outliers, generate the programs syntax, diagram the
model, and provide for import and export of data and gures of a theo-
retical model. Also, many of the programs come with sets of data and
program examples that are clearly explained in their user guides. Many
of these software programs have been reviewed in the journal Structural
Equation Modeling.
The pricing information for SEM software varies depending on indi-
vidual, group, or site license arrangements; corporate versus educa-
tional settings; and even whether one is a student or faculty member.
Furthermore, newer versions and updates necessitate changes in pric-
ing. Most programs will run in the Windows environment; some run
on MacIntosh personal computers. We are often asked to recommend
a software package to a beginning SEM researcher; however, given the
different individual needs of researchers and the multitude of different
features available in these programs, we are not able to make such a rec-
ommendation. Ultimately the decision depends upon the researchers
needs and preferences. Consequently, with so many software packages,
we felt it important to narrow our examples in the book to LISREL–
SIMPLIS programs.
We will therefore be using the LISREL 8.8 student version in the book
to demonstrate the many different SEM applications, including regres-
sion models, path models, conrmatory factor models, and the various
SEM models in chapters 13 through 16. The free student version of the
LISREL software program (Windows, Mac, and Linux editions) can be
downloaded from the website: http://www.ssicentral.com/lisrel/student.
html. (Note: The LISREL 8.8 Student Examples folder is placed in the main
directory C:/ of your computer, not the LISREL folder under C:/Program
Files when installing the software.)
Y102005.indb 8 3/22/10 3:24:45 PM
Introduction 9
Once the LISREL software is downloaded, place an icon on your desk-
top by creating a shortcut to the LISREL icon. The LISREL icon should
look something like this:
LISREL 8.80 Student.lnk
When you click on the icon, an empty dialog box will appear that should
look like this:
NOTE: Nothing appears until you open a program le or data set using
the File or open folder icon; more about this in the next chapter.
We do want to mention the very useful HELP menu. Click on the ques-
tion mark (?), a HELP menu will appear, then enter Output Questions in
the search window to nd answers to key questions you may have when
going over examples in the Third Edition.
Y102005.indb 9 3/22/10 3:25:10 PM
10 A Beginners Guide to Structural Equation Modeling
1 . 5 S u m m a r y
In this chapter we introduced structural equation modeling by describ-
ing basic types of variables—that is, latent, observed, independent, and
dependent—and basic types of SEM models—that is, regression, path,
conrmatory factor, and structural equation models. In addition, a brief
history of structural equation modeling was provided, followed by a dis-
cussion of the importance of SEM. This chapter concluded with a brief
listing of the different structural equation modeling software programs
and where to obtain the LISREL 8.8 student version for use with examples
Y102005.indb 10 3/22/10 3:25:11 PM
Introduction 11
in the book, including what the dialog box will rst appear like and a very
useful HELP menu.
In chapter 2 we consider the importance of examining data for issues
related to measurement level (nominal, ordinal, interval, or ratio), restric-
tion of range (fewer than 15 categories), missing data, outliers (extreme
values), linearity or nonlinearity, and normality or nonnormality, all of
which can affect statistical methods, and especially SEM applications.
Exercises
1. Dene the following terms:
a. Latent variable
b. Observed variable
c. Dependent variable
d. Independent variable
2. Explain the difference between a dependent latent variable and
a dependent observed variable.
3. Explain the difference between an independent latent variable
and an independent observed variable.
4. List the reasons why a researcher would conduct structural
equation modeling.
5. Download and activate the student version of LISREL: http://
www.ssicentral.com
6. Open and import SPSS or data le.
References
Anderson, T. W., & Rubin, H. (1956). Statistical inference in factor analysis. In
J. Neyman (Ed.), Proceedings of the third Berkeley symposium on mathemati-
cal statistics and probability, Vol. V (pp. 111–150). Berkeley: University of
California Press.
Cudeck, R., Du Toit, S., & Sörbom, D. (2001) (Eds). Structural equation modeling:
Present and future. A Festschrift in honor of Karl Jöreskog. Lincolnwood, IL:
Scientic Software International.
Delucchi, M. (2006). The efcacy of collaborative learning groups in an under-
graduate statistics course. College Teaching, 54, 244–248.
Goldberg, L. (1990). An alternative “description of personality”: Big Five factor
structure. Journal of Personality and Social Psychology, 59, 1216–1229.
Hershberger, S. L. (2003). The growth of structural equation modeling: 1994–2001.
Structural Equation Modeling, 10(1), 35–46.
Howe, W. G. (1955). Some contributions to factor analysis (Report No. ORNL-1919).
Oak Ridge National Laboratory, Oak Ridge, Tennessee.
Jöreskog, K. G. (1963). Statistical estimation in factor analysis: A new technique and its
foundation. Stockholm: Almqvist & Wiksell.
Y102005.indb 11 3/22/10 3:25:11 PM
12 A Beginners Guide to Structural Equation Modeling
Jöreskog, K. G. (1969). A general approach to conrmatory maximum likelihood
factor analysis. Psychometrika, 34, 183–202.
Jöreskog, K. G. (1973). A general method for estimating a linear structural equation
system. In A. S. Goldberger & O. D. Duncan (Eds.), Structural equation models
in the social sciences (pp. 85–112). New York: Seminar.
Keesling, J. W. (1972). Maximum likelihood approaches to causal ow analysis.
Unpublished doctoral dissertation. Chicago: University of Chicago.
Lawley, D. N. (1958). Estimation in factor analysis under various initial assump-
tions. British Journal of Statistical Psychology, 11, 1–12.
Parkerson, J. A., Lomax, R. G., Schiller, D. P., & Walberg, H. J. (1984). Exploring
causal models of educational achievement. Journal of Educational Psychology,
76, 638–646.
Pearson, E. S. (1938). Karl Pearson. An appreciation of some aspects of his life and work.
Cambridge: Cambridge University Press.
Shumow, L., & Lomax, R. G. (2002). Parental efcacy: Predictor of parenting behav-
ior and adolescent outcomes. Parenting: Science and Practice, 2, 127–150.
Spearman, C. (1904). The proof and measurement of association between two
things. American Journal of Psychology, 15, 72–101.
Spearman, C. (1927). The abilities of man. New York: Macmillan.
Wiley, D. E. (1973). The identication problem for structural equation models with
unmeasured variables. In A. S. Goldberger & O. D. Duncan (Eds.), Structural
equation models in the social sciences (pp. 69–83). New York: Seminar.
Wright, S. (1918). On the nature of size factors. Genetics, 3, 367–374.
Wright, S. (1921). Correlation and causation. Journal of Agricultural Research, 20,
557–585.
Wright, S. (1934). The method of path coefcients. Annals of Mathematical Statistics,
5, 161–215.
Y102005.indb 12 3/22/10 3:25:11 PM
13
2
Data Entry and Data Editing Issues
Key Concepts
Importing data le
System le
Measurement scale
Restriction of range
Missing data
Outliers
Linearity
Nonnormality
An important rst step in using LISREL is to be able to enter raw data and/
or import data, such as les from other programs (SPSS, SAS, EXCEL, etc.).
Other important steps involve being able to use LISREL–PRELIS to save
a system le, as well as output and save les that contain the variance
covariance matrix, the correlation matrix, means, and standard deviations
of variables so they can be input into command syntax programs. The
LISREL–PRELIS program will be briey explained in this chapter to dem-
onstrate how it handles raw data entry, importing of data, and the output
of saved les.
There are several key issues in the eld of statistics that impact our anal-
yses once data have been imported into a software program. These data
issues are commonly referred to as the measurement scale of variables,
restriction in the range of data, missing data values, outliers, linearity, and
nonnormality. Each of these data issues will be discussed because they
not only affect traditional statistics, but present additional problems and
concerns in structural equation modeling.
We use LISREL software throughout the book, so you will need to use
that software and become familiar with their Web site. You should have
downloaded by now the free student version of the LISREL software.
Y102005.indb 13 3/22/10 3:25:11 PM
14 A Beginners Guide to Structural Equation Modeling
We use some of the data and model examples available in the free stu-
dent version to illustrate SEM applications. (Note: The LISREL 8.8 Student
Examples folder is placed in the main directory C:/ of your computer.)
The free student version of the software has a user guide, help functions,
and tutorials. The Web site also contains important research, documenta-
tion, and information about structural equation modeling. However, be
aware that the free student version of the software does not contain the
full capabilities available in their full licensed version (e.g., restricted to
15 observed variables in SEM analyses). These limitations are spelled out
on their Web site.
2.1 Data Entry
The LISREL software program interfaces with PRELIS, a preprocessor of
data prior to running LISREL (matrix command language) or SIMPLIS
(easier-to-use variable name syntax) programs. The newer Interactive
LISREL uses a spreadsheet format for data with pull-down menu options.
LISREL offers several different options for inputting data and importing
les from numerous other programs. The New, Open, and Import Data
functions provide maximum exibility for inputting data.
The New option permits the creation of a command syntax language
program (PRELIS, LISREL, or SIMPLIS) to read in a PRELIS data le, or
Y102005.indb 14 3/22/10 3:25:12 PM
Data Entry and Data Editing Issues 15
to open SIMPLIS and LISREL saved projects as well as a previously saved
Path Diagram.
The Open option permits you to browse and locate previously saved
PRELIS (.pr2), LISREL (.ls8), or SIMPLIS (.spl) programs; each with their
unique le extension. The student version has distinct folders containing
several program examples, for example LISREL (LS8EX folder), PRELIS
(PR2EX folder), and SIMPLIS (SPLEX folder).
The Import Data option permits inputting raw data les or SPSS
saved les. The raw data le, lsat6.dat, is in the PRELIS folder (PR2EX).
When selecting this le, you will need to know the number of variables
in the le.
Y102005.indb 15 3/22/10 3:25:12 PM
16 A Beginner’s Guide to Structural Equation Modeling
An SPSS saved le, data100.sav, is in the SPSS folder (SPSSEX). Once you
open this le, a PRELIS system le is created.
Y102005.indb 16 3/22/10 3:25:13 PM
Data Entry and Data Editing Issues 17
Once the PRELIS systemle becomes active, then it needs to be saved for
future use. (Note: # symbol may appear if columns are to narrow; simply use
your mouse to expand the columns so that the missing values—999999.00
will appear. Also, if you right-mouse click on the variable names, a menu
appears to dene missing values, etc.). The PRELIS system le (.psf) acti-
vates a pull-down menu that permits data editing features, data transfor-
mations, statistical analysis of data, graphical display of data, multilevel
modeling, and many other related features.
Y102005.indb 17 3/22/10 3:25:16 PM
18 A Beginners Guide to Structural Equation Modeling
The statistical analysis of data includes factor analysis, probit regres-
sion, least squares regression, and two-stage least squares methods.
Other important data editing features include imputing missing values,
a homogeneity test, creation of normal scores, bootstrapping, and data
output options. The data output options permit saving different types of
variancecovariance matrices and descriptive statistics in les for use in
LISREL and SIMPLIS command syntax programs. This capability is very
important, especially when advanced SEM models are analyzed in chap-
ters 13 to 16. We will demonstrate the use of this Output Options dialog
box in this chapter and in some of our other chapter examples.
2.2 Data Editing Issues
2.2.1 Measurement Scale
How variables are measured or scaled inuences the type of statistical
analyses we perform (Anderson, 1961; Stevens, 1946). Properties of scale
also guide our understanding of permissible mathematical operations.
Y102005.indb 18 3/22/10 3:25:17 PM
Data Entry and Data Editing Issues 19
For example, a nominal variable implies mutually exclusive groups; a
biological gender has two mutually exclusive groups, male and female.
An individual can only be in one of the groups that dene the levels
of the variable. In addition, it would not be meaningful to calculate a
mean and a standard deviation on the variable gender. Consequently,
the number or percentage of individuals at each level of the gender
variable is the only mathematical property of scale that makes sense.
An ordinal variable, for example, attitude toward school, that is scaled
strongly agree, agree, neutral, disagree, and strongly disagree, implies mutu-
ally exclusive categories that are ordered or ranked. When levels of a
variable have properties of scale that involve mutually exclusive groups
that are ordered, only certain mathematical operations are meaning-
ful, for example, a comparison of ranks between groups. SEM nal
exam scores, an example of an interval variable, possesses the property
of scale, implying equal intervals between the data points, but no true
zero point. This property of scale permits the mathematical operation
of computing a mean and a standard deviation. Similarly, a ratio vari-
able, for example, weight, has the property of scale that implies equal
intervals and a true zero point (weightlessness). Therefore, ratio vari-
ables also permit mathematical operations of computing a mean and
a standard deviation. Our use of different variables requires us to be
aware of their properties of scale and what mathematical operations
are possible and meaningful, especially in SEM, where variance
covariance (correlation) matrices are used with means and standard
deviations of variables. Different correlations among variables are
therefore possible depending upon the level of measurement, but they
create unique problems in SEM (see chapter 3). PRELIS designates con-
tinuous variables (CO), ordinal variables (OR), and categorical vari-
ables (CL) to make these distinctions.
2.2.2 Restriction of Range
Data values at the interval or ratio level of measurement can be further
dened as being discrete or continuous. For example, SEM nal exam
scores could be reported in whole numbers (discrete). Similarly, the num-
ber of children in a family would be considered a discrete level of mea-
surement—or example, 5 children. In contrast, a continuous variable is
reported using decimal values; for example, a student’s grade point aver-
age would be reported as 3.75 on a 5-point scale.
Karl Jöreskog (1996) provided a criterion in the PRELIS program based
on his research that denes whether a variable is ordinal or interval,
based on the presence of 15 distinct scale points. If a variable has fewer
than 15 categories or scale points, it is referenced in PRELIS as ordi-
nal (OR), whereas a variable with 15 or more categories is referenced as
Y102005.indb 19 3/22/10 3:25:17 PM
20 A Beginners Guide to Structural Equation Modeling
continuous (CO). This 15-point criterion allows Pearson correlation coef-
cient values to vary between +/1.0. Variables with fewer distinct scale
points restrict the value of the Pearson correlation coefcient such that it
may only vary between +/0.5. Other factors that affect the Pearson cor-
relation coefcient are presented in this chapter and discussed further
in chapter 3.
2.2.3 Missing Data
The statistical analysis of data is affected by missing data values in vari-
ables. That is, not every subject has an actual value for every variable in
the dataset, as some values are missing. It is common practice in statis-
tical packages to have default values for handling missing values. The
researcher has the options of deleting subjects who have missing values,
replacing the missing data values, or using robust statistical procedures
that accommodate for the presence of missing data.
The various SEM software handle missing data differently and have
different options for replacing missing data values. Table 2.1 lists many
of the various options for dealing with missing data. These options can
dramatically affect the number of subjects available for analysis, the
magnitude and direction of the correlation coefcient, or create problems
if means, standard deviations, and correlations are computed based on
different sample sizes. The Listwise deletion of cases and Pairwise dele-
tion of cases are not always recommended options due to the possibil-
ity of losing a large number of subjects, thus dramatically reducing the
sample size. Mean substitution works best when only a small number
of missing values is present in the data, whereas regression imputation
provides a useful approach with a moderate amount of missing data.
In LISREL–PRELIS the expectation maximization (EM), Monte Carlo
Markov Chain (MCMC), and matching response pattern approaches
are recommended when larger amounts of data are missing at random.
TABLE 2.1
Options for Dealing with Missing Data
Listwise Delete subjects with missing data on any variable
Pairwise Delete subjects with missing data on each pair of variables used
Mean substitution Substitute the mean for missing values of a variable
Regression imputation Substitute a predicted value for the missing value of a variable
Expectation
maximization (EM)
Find expected value based on expectation maximization
algorithm
Matching response
pattern
Match cases with incomplete data to cases with complete data
to determine a missing value
Y102005.indb 20 3/22/10 3:25:17 PM
Data Entry and Data Editing Issues 21
More information about missing data is available in resources such as
Enders (2006), McKnight, McKnight, Sidani and Aurelio (2007), and
Peng, Harwell, Liou, and Ehman (2007). Davey and Savla (2010) have
more recently published an excellent book with SAS, SPSS, STATA, and
Mplus source programs to handle missing data in SEM in the context of
power analysis.
2.2.4 LISRELPRELIS Missing Data Example
Imputation of missing values is possible for a single variable (Impute
Missing Values) or several variables simultaneously (Multiple Imputation)
by selecting Statistics from the tool bar menu. The Impute Missing Values
option uses the matching response pattern approach. The value to be sub-
stituted for the missing value of a single case is obtained from another
case (or cases) having a similar response pattern over a set of matching
variables. In data sets where missing values occur on more than one vari-
able, you can use multiple imputation of missing values with mean sub-
stitution, delete cases, or leave the variables with dened missing values
as options in the dialog box. In addition, the Multiple Imputation option
uses either the expectation maximization algorithm (EM) or Monte Carlo
Markov Chain (MCMC, generating random draws from probability dis-
tributions via Markov chains) approaches to replacing missing values
across multiple variables.
We present an example from LISREL–PRELIS involving the choles-
terol levels for 28 patients treated for heart attacks. We assume the data
to be missing at random (MAR) with an underlying multivariate normal
distribution. Cholesterol levels were measured after 2 days (VAR1), after
4 days (VAR2), and after 14 days (VAR3), but were only complete for 19
of the 28 patients. The data are shown from the PRELIS System File,
chollev.psf. The PRELIS system le was created by selecting File, Import
Data, and selecting the raw data le chollev.raw located in the Tutorial
folder [C:\LISREL 8.8 Student Examples\Tutorial]. We must know the num-
ber of variables in the raw data le. We must also select Data, then Dene
Variables, and then select 9.00 as the missing value for the VAR 3 vari-
able [Optionally, right mouse click on VAR1 in the PRELIS chollev le].
Y102005.indb 21 3/22/10 3:25:18 PM
22 A Beginners Guide to Structural Equation Modeling
Y102005.indb 22 3/22/10 3:25:18 PM
Data Entry and Data Editing Issues 23
We now click on Statistics on the tool bar menu and select Impute
Missing Values from the pull-down menu.
We next select Output Options and save the transformed data in a new
PRELIS system le cholnew.psf, and output the new correlation matrix,
mean, and standard deviation les.
Y102005.indb 23 3/22/10 3:25:19 PM
24 A Beginners Guide to Structural Equation Modeling
We should examine our data both before (Table 2.2) and after (Table 2.3)
imputation of missing values. Here, we used the matching response pat-
tern method. This comparison provides us with valuable information
about the nature of the missing data.
We can also view our new transformed PRELIS System File, cholnew.psf,
to verify that the missing values were in fact replaced; for example, VAR3
has values replaced for Case 2 = 204, Case 4 = 142, Case 5 = 182, Case 10 =
280, and so on.
Y102005.indb 24 3/22/10 3:25:19 PM
Data Entry and Data Editing Issues 25
TABLE 2.2
Data Before Imputation of Missing Values
Number of Missing Values per Variable
VAR1
-------------- VAR2
------------- VAR3
------------
0 0 9
Distribution of Missing Values
Total Sample Size = 28
Number of Missing Values 0 1
Number of Cases 19 9
Effective Sample Sizes
Univariate (in Diagonal) and Pairwise Bivariate (off
Diagonal) VAR1
-------------- VAR2
------------- VAR3
------------
VAR1 28
VAR2 28 28
VAR3 19 19 19
Percentage of Missing Values
Univariate (in Diagonal) and Pairwise Bivariate (off
Diagonal) VAR1
-------------- VAR2
------------- VAR3
------------
VAR1 0.00
VAR2 0.00 0.00
VAR3 32.14 32.14 32.14
Correlation MatrixVAR1
-------------- VAR2
------------- VAR3
------------
VAR1 1.000
VAR2 0.673 1.000
VAR3 0.395 0.665 1.000
Means VAR1
-------------- VAR2
------------- VAR3
------------
253.929 230.643 221.474
Standard Deviations
VAR1
-------------- VAR2
------------- VAR3
------------
47.710 46.967 43.184
Y102005.indb 25 3/22/10 3:25:20 PM
26 A Beginners Guide to Structural Equation Modeling
We have noticed that selecting matching variables with a higher cor-
relation to the variable with missing values provides better imputed
values for the missing data. We highly recommend comparing any anal-
yses before and after the replacement of missing data values to fully
understand the impact missing data values have on the parameter esti-
mates and standard errors. LISREL–PRELIS also permits replacement
TABLE 2.3
Data After Imputation of Missing Values
Number of Missing Values per Variable
VAR1
------------------ VAR2
---------------- VAR3
----------------
0 0 9
Imputations for VAR3
Case 2 imputed with value 204 (Variance Ratio = 0.000), NM= 1
Case 4 imputed with value 142 (Variance Ratio = 0.000), NM= 1
Case 5 imputed with value 182 (Variance Ratio = 0.000), NM= 1
Case 10 imputed with value 280 (Variance Ratio = 0.000), NM= 1
Case 13 imputed with value 248 (Variance Ratio = 0.000), NM= 1
Case 16 imputed with value 256 (Variance Ratio = 0.000), NM= 1
Case 18 imputed with value 216 (Variance Ratio = 0.000), NM= 1
Case 23 imputed with value 188 (Variance Ratio = 0.000), NM= 1
Case 25 imputed with value 256 (Variance Ratio = 0.000), NM= 1
Number of Missing Values per Variable After Imputation
VAR1
------------------ VAR2
---------------- VAR3
----------------
0 0 0
Total Sample Size = 28
Correlation Matrix VAR1
------------------ VAR2
---------------- VAR3
----------------
VAR1 1.000
VAR2 0.673 1.000
VAR3 0.404 0.787 1.000
Means VAR1
------------------ VAR2
---------------- VAR3
----------------
253.929 230.643 220.714
Standard Deviations
VAR1
------------------ VAR2
---------------- VAR3
----------------
47.710 46.967 42.771
Y102005.indb 26 3/22/10 3:25:20 PM
Data Entry and Data Editing Issues 27
of missing values using the EM and MCMC approaches, which may be
practical when matching sets of variables are not possible. A comparison
of EM and MCMC is also warranted in multiple imputations to deter-
mine the effect of using a different algorithm on the replacement of miss-
ing values.
2.2.5 Outliers
Outliers or inuential data points can be dened as data values that are
extreme or atypical on either the independent (X variables) or dependent
(Y variables) variables or both. Outliers can occur as a result of observa-
tion errors, data entry errors, instrument errors based on layout or instruc-
tions, or actual extreme values from self-report data. Because outliers
affect the mean, the standard deviation, and correlation coefcient values,
they must be explained, deleted, or accommodated by using robust sta-
tistics. Sometimes, additional data will need to be collected to ll in the
gap along either the Y or X axes. LISREL–PRELIS has outlier detection
methods available that include the following: box plot display, scatterplot,
histogram, and frequency distributions.
2.2.6 Linearity
Some statistical techniques, such as SEM, assume that the variables are lin-
early related to one another. Thus, a standard practice is to visualize the
coordinate pairs of data points of two continuous variables by plotting the
data in a scatterplot. These bivariate plots depict whether the data are lin-
early increasing or decreasing. The presence of curvilinear data reduces the
magnitude of the Pearson correlation coefcient, even resulting in the pres-
ence of a zero correlation. Recall that the Pearson correlation value indicates
the magnitude and direction of the linear relationships between two vari-
ables. Figure 2.1 shows the importance of visually displaying the bivariate
data scatterplot.
FIGURE 2.1
Left: correlation is linear. Right: correlation is nonlinear.
Y102005.indb 27 3/22/10 3:25:20 PM
28 A Beginners Guide to Structural Equation Modeling
2.2.7 Nonnormality
In basic statistics, several transformations are given to handle issues with
nonnormal data. Some of these common transformations are in Table 2.4.
Inferential statistics often rely on the assumption that the data are nor-
mally distributed. Data that are skewed (lack of symmetry) or more fre-
quently occurring along one part of the measurement scale will affect the
variancecovariance among variables. In addition, kurtosis (peakedness)
in data will impact statistics. Leptokurtic data values are more peaked than
the normal distribution, whereas platykurtic data values are atter and
more dispersed along the X axis, but have a consistent low frequency on
the Y axis—that is, the frequency distribution of the data appears more
rectangular in shape.
Nonnormal data can occur because of the scaling of variables (e.g.,
ordinal rather than interval) or the limited sampling of subjects. Possible
solutions for skewness are to resample more participants or to perform a
linear transformation as outlined above. Our experience is that a probit
data transformation works best in correcting skewness. Kurtosis in data
is more difcult to resolve; some possible solutions in LISRELPRELIS
include additional sampling of subjects, or the use of bootstrap meth-
ods, normalizing scores, or alternative methods of estimation (e.g., WLS
or ADF).
The presence of skewness and kurtosis can be detected in LISREL–
PRELIS using univariate tests, multivariate tests, and measures of skew-
ness and kurtosis that are available in the pull-down menus or output.
One recommended method of handling nonnormal data is to use an
asymptotic covariance matrix as input along with the sample covariance
matrix in the LISREL–PRELIS program, as follows:
TABLE 2.4
Data Transformation Types
y = ln(x) or y = log10(x) or
y = ln(x+0.5)
Useful with clustered data or cases where the standard
deviation increases with the mean
y = sqrt(x) Useful with Poisson counts
y = arcsin((x + 0.375)/(n + 0.75)) Useful with binomial proportions [0.2 < p = x/n < 0.8]
y = 1/x Useful with gamma-distributed x variable
y = logit(x) = ln(x/(1 – x)) Useful with binomial proportions x = p
y = normit(x) Quantile of normal distribution for standardized x
y = probit(x) = 5 + normit(x) Most useful to resolve nonnormality of data
Note: probit(x) is same as normit(x) plus 5 to avoid negative values.
Y102005.indb 28 3/22/10 3:25:21 PM
Data Entry and Data Editing Issues 29
LISREL
CM = boy.cov
AC = boy.acm
SIMPLIS
Covariance matrix from file boy.cov
Asymptotic covariance matrix from file boy.acm
We can use the asymptotic covariance matrix in two different ways: (a) as a
weight matrix when specifying the method of estimation as weighted least
squares (WLS), and (b) as a weight matrix that adjusts the normal-theory
weight matrix to correct for bias in standard errors and t statistics. The
appropriate moment matrix in PRELIS, using OUTPUT OPTIONS, must
be selected before requesting the calculation of the asymptotic covariance
matrix.
PRELIS recognizes data as being continuous (CO), ordinal (OR), or
classes (CL), that is gender (boy, girl). Different correlations are possible
depending upon the level of measurement. A variance–covariance matrix
with continuous variables would use Pearson correlations, while ordinal
variables would use Tetrachoric correlations. If skewed nonnormal data
is present, then consider a linear transformation using Probit. In SEM,
researchers typically output and use an asymptotic variance–covariance
matrix. When using a PRELIS data set, consider the normal score option
in the menu to correct for nonnormal variables.
2 . 3 S u m m a r y
Structural equation modeling is a correlation research method; therefore,
the measurement scale, restriction of range in the data values, missing
data, outliers, nonlinearity, and nonnormality of data affect the variance
covariance among variables and thus can impact the SEM analysis.
Researchers should use the built-in menu options to examine, graph, and
test for any of these problems in the data prior to conducting any SEM
model analysis. Basically, researchers should know their data character-
istics. Data screening is a very important rst step in structural equation
modeling. The next chapter illustrates in more detail issues related to the
use of correlation and variancecovariance in SEM models. There, we
provide specic examples to illustrate the importance of topics covered
in this chapter. A troubleshooting box summarizing these issues is pro-
vided in Box 2.1.
Y102005.indb 29 3/22/10 3:25:21 PM
30 A Beginners Guide to Structural Equation Modeling
BOX 2.1 TROUBLESHOOTING TIPS
Issue Suggestions
Measurement
scale
Need to take the measurement scale of the variables into account
when computing statistics such as means, standard deviations, and
correlations.
Restriction of
range
Need to consider range of values obtained for variables, as
restricted range of one or more variables can reduce the
magnitude of correlations.
Missing data Need to consider missing data on one or more subjects for one or
more variables as this can affect SEM results. Cases are lost with
listwise deletion, pairwise deletion is often problematic (e.g.,
different sample sizes), and thus modern imputation methods are
recommended.
Outliers Need to consider outliers as they can affect statistics such as
means, standard deviations, and correlations. They can either be
explained, deleted, or accommodated (using either robust
statistics or obtaining additional data to ll-in). Can be detected
by methods such as box plots, scatterplots, histograms or
frequency distributions.
Linearity Need to consider whether variables are linearly related, as
nonlinearity can reduce the magnitude of correlations. Can be
detected by scatterplots. Can be dealt with by transformations or
deleting outliers.
Nonnormality Need to consider whether the variables are normally distributed,
as nonnormality can affect resulting SEM statistics. Can be
detected by univariate tests, multivariate tests, and skewness and
kurtosis statistics. Can be dealt with by transformations,
additional sampling, bootstrapping, normalizing scores, or
alternative methods of estimation.
Exercises
1. LISREL uses which command to import data sets?
a. File, then Export Data
b. File, then Open
c. File, then Import Data
d. File, then New
2. Dene the following levels of measurement.
a. Nominal
b. Ordinal
c. Interval
d. Ratio
3. Mark each of the following statements true (T) or false (F).
a. LISREL can deal with missing data.
b. PRELIS can deal with missing data.
Y102005.indb 30 3/22/10 3:25:21 PM
Data Entry and Data Editing Issues 31
c. LISREL can compute descriptive statistics.
d. PRELIS can compute descriptive statistics.
4. Explain how each of the following affects statistics:
a. Restriction of range
b. Missing data
c. Outliers
d. Nonlinearity
e. Nonnormality
References
Anderson, N. H. (1961). Scales and statistics: Parametric and non-parametric.
Psychological Bulletin, 58, 305–316.
Davey, A., & Savla, J. (2009). Statistical power analysis with missing data: A structural
equation modeling approach. Routledge, Taylor & Francis Group: New York.
Enders, C. K. (2006). Analyzing structural equation models with missing data. In
G.R. Hancock & R.O. Mueller (Eds.), Structural equation modeling: A second
course (pp. 313–342). Greenwich, CT: Information Age.
Jöreskog, K. G., & Sörbom, D. (1996). PRELIS2: Users reference guide. Lincolnwood,
IL: Scientic Software International.
McKnight, P. E., McKnight, K. M., Sidani, S., & Aurelio, J. F. (2007). Missing data: A
gentle introduction. New York: Guilford.
Peng, C.-Y. J., Harwell, M., Liou, S.-M., & Ehman, L. H. (2007). Advances in missing
data methods and implications for educational research. In S.S. Sawilowsky
(Ed.), Real data analysis. Charlotte: Information Age.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103,
677–680.
Y102005.indb 31 3/22/10 3:25:21 PM
33
3
Correlation
Key Concepts
Types of correlation coefcients
Factors affecting correlation
Correction for attenuation
Nonpositive denite matrices
Bivariate, part, and partial correlation
Suppressor variable
Covariance and causation
In chapter 2 we considered a number of data preparation issues in struc-
tural equation modeling. In this chapter, we move beyond data prepara-
tion in describing the important role that correlation (covariance) plays
in SEM. We also include a discussion of a number of factors that affect
correlation coefcients as well as the assumptions and limitations of cor-
relation methods in structural equation modeling.
3.1 Types of Correlation Coefficients
Sir Francis Galton conceptualized the correlation and regression proce-
dure for examining covariance in two or more traits, and Karl Pearson
(1896) developed the statistical formula for the correlation coefcient and
regression based on his suggestion (Crocker & Algina, 1986; Ferguson &
Takane, 1989; Tankard, 1984). Shortly thereafter, Charles Spearman (1904)
used the correlation procedure to develop a factor analysis technique.
The correlation, regression, and factor analysis techniques have for many
decades formed the basis for generating tests and dening constructs.
Today, researchers are expanding their understanding of the roles that
correlation, regression, and factor analysis play in theory and construct
Y102005.indb 33 3/22/10 3:25:21 PM
34 A Beginners Guide to Structural Equation Modeling
denition to include latent variable, covariance structure, and conrma-
tory factor measurement models.
The relationships and contributions of Galton, Pearson, and Spearman
to the eld of statistics, especially correlation, regression, and factor anal-
ysis, are quite interesting (Tankard, 1984). In fact, the basis of association
between two variables—that is, correlation or covariance—has played a
major role in statistics. The Pearson correlation coefcient provides the
basis for point estimation (test of signicance), explanation (variance
accounted for in a dependent variable by an independent variable), predic-
tion (of a dependent variable from an independent variable through lin-
ear regression), reliability estimates (test–retest, equivalence), and validity
(factorial, predictive, concurrent).
The Pearson correlation coefcient also provides the basis for estab-
lishing and testing models among measured and/or latent variables. The
partial and part correlations further permit the identication of specic
bivariate relationships between variables that allow for the specication
of unique variance shared between two variables while controlling for the
inuence of other variables. Partial and part correlations can be tested for
signicance, similar to the Pearson correlation coefcient, by simply using
the degrees of freedom, n 2, in the standard correlation table of signi-
cance values (Table A.3) or an F test in multiple regression which tests the
difference in R2 values between full and restricted models (Table A.5).
Although the Pearson correlation coefcient has had a major impact in
the eld of statistics, other correlation coefcients have emerged depend-
ing upon the level of variable measurement. Stevens (1968) provided the
properties of scales of measurement that have become known as nominal,
ordinal, interval, and ratio. The types of correlation coefcients developed
for these various levels of measurement are categorized in Table 3.1.
TABLE 3.1
Types of Correlation Coefcients
Correlation Coefcient Level of Measurement
Pearson product-moment Both variables interval
Spearman rank, Kendall’s tau Both variables ordinal
Phi, contingency Both variables nominal
Point biserial One variable interval, one variable dichotomous
Gamma, rank biserial One variable ordinal, one variable nominal
Biserial One variable interval, one variable articiala
Polyserial One variable interval, one variable ordinal with
underlying continuity
Tetrachoric Both variables dichotomous (nominal articiala)
Polychoric Both variables ordinal with underlying continuities
a Articial refers to recoding variable values into a dichotomy.
Y102005.indb 34 3/22/10 3:25:21 PM
Correlation 35
Many popular computer programs, for example, SAS and SPSS, typi-
cally do not compute all of these correlation types. Therefore, you may
need to check a popular statistics book or look around for a computer pro-
gram that will compute the type of correlation coefcient you need—for
example, the phi and point-biserial coefcient are not readily available. In
SEM analyses, the Pearson coefcient, tetrachoric or polychoric (for several
ordinal variable pairs) coefcient, and biserial or polyserial (for several
continuous and ordinal variable pairs) coefcient are typically used (see
PRELIS for the use of Kendall’s tau-c or tau-b, and canonical correlation).
LISREL permits mixture models, which use variables with both ordinal and
interval-ratio levels of measurement (chapter 15). Although SEM software
programs are now demonstrating how mixture models can be analyzed,
the use of variables with different levels of measurement has traditionally
been a problem in the eld of statistics—for example, multiple regression
and multivariate statistics.
3.2 Factors Affecting Correlation Coefficients
Given the important role that correlation plays in structural equation
modeling, we need to understand the factors that affect establishing rela-
tionships among multivariable data points. The key factors are the level
of measurement, restriction of range in data values (variability, skewness,
kurtosis), missing data, nonlinearity, outliers, correction for attenuation,
and issues related to sampling variation, condence intervals, effect size,
signicance, sample size, and power.
3.2.1 Level of Measurement and Range of Values
Four types or levels of measurement typically dene whether the charac-
teristic or scale interpretation of a variable is nominal, ordinal, interval, or
ratio (Stevens, 1968). In structural equation modeling, each of these types
of scaled variables can be used. However, it is not recommended that they
be included together or mixed in a correlation (covariance) matrix. Instead,
the PRELIS data output option should be used to save an asymptotic cova-
riance matrix for input along with the sample variance-covariance matrix
into a LISREL or SIMPLIS program.
Initially, SEM required variables measured at the interval or ratio level
of measurement, so the Pearson product-moment correlation coefcient
was used in regression, path, factor, and structural equation modeling.
The interval or ratio scaled variable values should also have a sufcient
range of score values to introduce variance (15 or more scale points). If the
Y102005.indb 35 3/22/10 3:25:22 PM
36 A Beginners Guide to Structural Equation Modeling
range of scores is restricted, the magnitude of the correlation value is
decreased. Basically, as a group of subjects become more homogeneous,
score variance decreases, reducing the correlation value between the vari-
ables. So, there must be enough variation in scores to allow a correlation
relationship to manifest itself between variables. Variables with fewer
than 15 categories are treated as ordinal variables in LISREL–PRELIS, so
if you are assuming continuous interval-level data, you will need to check
whether the variables meet this condition. Also, the use of the same scale
values for variables can help in the interpretation of results and/or rela-
tive comparison among variables. The meaningfulness of a correlation
relationship will depend on the variables employed; hence, your theoreti-
cal perspective is very important. You may recall from your basic statistics
course that a spurious correlation is possible when two sets of scores cor-
relate signicantly, but their relationship is not meaningful or substantive
in nature.
If the distributions of variables are widely divergent, correlation can
also be affected, and so several data transformations are suggested by
Ferguson and Takane (1989) to provide a closer approximation to a nor-
mal, homogeneous variance for skewed or kurtotic data. Some possible
transformations are the square root transformation (sqrt X), the logarith-
mic transformation (log X), the reciprocal transformation (1/X), and the
arcsine transformation (arcsin X). The probit transformation appears to be
most effective in handling univariate skewed data.
Consequently, the type of scale used and the range of values for the
measured variables can have profound effects on your statistical analysis
(in particular, on the mean, variance, and correlation). The scale and range
of a variables numerical values affects statistical methods, and this is no
different in structural equation modeling. The PRELIS program is avail-
able to provide tests of normality, skewness, and kurtosis on variables
and to compute an asymptotic covariance matrix for input into LISREL if
required. The use of normal scores is also an option in PRELIS.
3.2.2 Nonlinearity
The Pearson correlation coefcient indicates the degree of linear relation-
ship between two variables. It is possible that two variables can indicate no
correlation if they have a curvilinear relationship. Thus, the extent to which
the variables deviate from the assumption of a linear relationship will affect
the size of the correlation coefcient. It is therefore important to check for
linearity of the scores; the common method is to graph the coordinate data
points in a scatterplot. The linearity assumption should not be confused
with recent advances in testing interaction in structural equation models
discussed in chapter 16. You should also be familiar with the eta coefcient
as an index of nonlinear relationship between two variables and with the
Y102005.indb 36 3/22/10 3:25:22 PM
Correlation 37
testing of linear, quadratic, or cubic effects. Consult an intermediate statis-
tics text, for example, Lomax (2007) to review these basic concepts.
The heuristic data sets in Table 3.2 will demonstrate the dramatic effect
a lack of linearity has on the Pearson correlation coefcient value. In the
rst data set, the Y values increase from 1 to 10, and the X values increase
from 1 to 5, then decrease from 5 to 1 (nonlinear). The result is a Pearson
correlation coefcient of r = 0; although a nonlinear relationship does exist
in the data, it is not indicated by the Pearson correlation coefcient. The
restriction of range in values can be demonstrated using the fourth heu-
ristic data set in Table 3.2. The Y values only range between 3 and 7, and
the X values only range from 1 to 4. The Pearson correlation coefcient is
also r = 0 for these data. The fth data set indicates how limited sampling
can affect the Pearson coefcient. In these sample data, only three pairs
of scores are sampled, and the Pearson correlation is r = 1.0, or perfectly
negatively correlated.
TABLE 3.2
Heuristic Data Sets
Nonlinear Data Complete Data Missing Data
Y X Y X Y X
1.00 1.00 8.00 6.00 8.00
2.00 2.00 7.00 5.00 7.00 5.00
3.00 3.00 8.00 4.00 8.00
4.00 4.00 5.00 2.00 5.00 2.00
5.00 5.00 4.00 3.00 4.00 3.00
6.00 5.00 5.00 2.00 5.00 2.00
7.00 4.00 3.00 3.00 3.00 3.00
8.00 3.00 5.00 4.00 5.00
9.00 2.00 3.00 1.00 3.00 1.00
10.00 1.00 2.00 2.00 2.00 2.00
Range of Data Sampling Effect
Y X Y X
3.00 1.00 8.00 3.00
3.00 2.00 9.00 2.00
4.00 3.00 10.00 1.00
4.00 4.00
5.00 1.00
5.00 2.00
6.00 3.00
6.00 4.00
7.00 1.00
7.00 2.00
Y102005.indb 37 3/22/10 3:25:22 PM
38 A Beginners Guide to Structural Equation Modeling
3.2.3 Missing Data
A complete data set is also given in Table 3.2 where the Pearson correla-
tion coefcient is r = .782, p = .007, for n = 10 pairs of scores. If missing
data were present, the Pearson correlation coefcient would drop to r =
.659, p = .108, for n = 7 pairs of scores. The Pearson correlation coefcient
changes from statistically signicant to not statistically signicant. More
importantly, in a correlation matrix with several variables, the various
correlation coefcients could be computed on different sample sizes. If
we used listwise deletion of cases, then any variable in the data set with
a missing value would cause a subject to be deleted, possibly causing a
substantial reduction in our sample size, whereas pairwise deletion of cases
would result in different sample sizes for our correlation coefcients in
the correlation matrix.
Researchers have examined various aspects of how to handle or treat
missing data beyond our introductory example using a small heuristic
data set. One basic approach is to eliminate any observations where some
of the data are missing, listwise deletion. Listwise deletion is not recom-
mended because of the loss of information on other variables, and statisti-
cal estimates are based on reduced sample size. Pairwise deletion excludes
data only when they are missing on the pairs of variables selected for
analysis. However, this could lead to different sample sizes for the differ-
ent correlations and related statistical estimates. A third approach, data
imputation, replaces missing values with an estimate, for example, the
mean value on a variable for all subjects who did not report any data for
that variable (Beale & Little, 1975; also see chapter 2).
Missing data can arise in different ways (Little & Rubin, 1987, 1990).
Missing completely at random (MCAR) implies that data on variable X are
missing unrelated statistically to the values that have been observed
for other variables as well as X. Missing at random (MAR) implies that
data values on variable X are missing conditional on other variables,
but are unrelated to the values of X. A third situation, nonignorable data,
implies probabilistic information about the values that would have been
observed. For MCAR data, mean substitution yields biased variance and
covariance estimates, whereas listwise and pairwise deletion methods
yield consistent solutions. For MAR data, mean substitution, listwise,
and pairwise deletion methods produce biased results. When missing
data are nonignorable, all approaches yield biased results. It would be
prudent for the researcher to investigate how parameter estimates are
affected by the use or nonuse of a data imputation method. A few ref-
erences are provided to give a more detailed understanding of miss-
ing data (Arbuckle, 1996; Enders, 2006; McKnight, McKnight, Sidani &
Aurelio, 2007; Peng, Harwell, Liou & Ehman, 2007; Wothke, 2000; Davey
& Savla, 2009).
Y102005.indb 38 3/22/10 3:25:22 PM
Correlation 39
3.2.4 Outliers
The Pearson correlation coefcient can be drastically affected by a sin-
gle outlier on X or Y. For example, the two data sets in Table 3.3 indicate
a Y = 27 value (Set A) versus a Y = 2 value (Set B) for the last subject. In
the rst set of data, r = .524, p = .37, whereas in the second set of data,
r = –.994, p = .001. Is the Y = 27 data value an outlier based on limited
sampling or is it a data entry error? A large body of research has been
undertaken to examine how different outliers on X, Y, or both X, and
Y affect correlation relationships, and how to better analyze the data
using robust statistics (Anderson & Schumacker, 2003; Ho & Naugher,
2000; Huber, 1981; Rousseeuw & Leroy, 1987; Staudte & Sheather, 1990).
TABLE 3.3
Outlier Data Sets
Set A Set B
X Y X Y
1 9 1 9
2 7 2 7
3 5 3 5
4 3 4 3
5 27 5 2
3.2.5 Correction for Attenuation
A basic assumption in psychometric theory is that observed data contain mea-
surement error. A test score (observed data) is a function of a true score and
measurement error. A Pearson correlation coefcient will have different val-
ues, depending on whether it was computed with observed scores or the true
scores where measurement error has been removed. The Pearson correlation
coefcient can be corrected for attenuation or unreliable measurement error in
scores, thus yielding a true score correlation; however, the corrected correla-
tion coefcient can become greater than 1.0! Low reliability in the indepen-
dent and/or dependent variables, coupled with a high correlation between
the independent and dependent variable, can result in correlations greater
than 1.0. For example, given a correlation of r = .90 between the observed
scores on X and Y, the Cronbach alpha reliability coefcient of .60 for X scores,
and the Cronbach alpha reliability coefcient of .70 for Y scores, the Pearson
correlation coefcient, corrected for attenuation (r*) , is greater than 1.0:
rr
rr
xy
xy
xx yy
*.
.(.)
.
..== ==
90
60 70
90
648 1 389
Y102005.indb 39 3/22/10 3:25:23 PM
40 A Beginners Guide to Structural Equation Modeling
When this happens, a nonpositive denite error message occurs stopping
the SEM program.
3.2.6 Nonpositive Definite Matrices
Correlation coefcients greater than 1.0 in a correlation matrix cause the
correlation matrix to be nonpositive denite. In other words, the solution is
not admissible, indicating that parameter estimates cannot be computed.
Correction for attenuation is not the only situation that causes nonposi-
tive matrices to occur (Wothke, 1993). Sometimes the ratio of covariance
to the product of variable variances yields correlations greater than 1.0.
The following variance–covariance matrix is nonpositive denite because
it contains a correlation coefcient greater than 1.0 between the Relations
and Attribute latent variables (denoted by an asterisk):
VarianceCovariance Matrix
Task 1.043
Relations .994 1.079
Management .892 .905 .924
Attribute 1.065 1.111 .969 1.12
Correlation Matrix
Task 1.000
Relations .937 1.000
Management .908 .906 1.000
Attribute .985 1.010* .951 1.000
Nonpositive denite covariance matrices occur when the determinant of
the matrix is zero or the inverse of the matrix is not possible. This can
be caused by correlations greater than 1.0, linear dependency among the
observed variables, multicollinearity among the observed variables, a
variable that is a linear combination of other variables, a sample size less
than the number of variables, the presence of a negative or zero variance
(Heywood Case), variance–covariance (correlation) values outside the
permissible range, for example, correlation beyond +/1.0, and bad start
values in the user-specied model. A Heywood case also occurs when the
communality estimate is greater than 1.0. Possible solutions to resolve
this error are to reduce communality or x communality to less than 1.0,
extract a different number of factors (possibly by dropping paths), rescale
observed variables to create a more linear relationship, or eliminate a bad
observed variable that indicates linear dependency or multicollinearity.
Regression, path, factor, and structural equation models mathematically
solve a set of simultaneous equations typically using ordinary least squares
Y102005.indb 40 3/22/10 3:25:23 PM
Correlation 41
(OLS) estimates as initial estimates of coefcients in the model. However,
these initial estimates or coefcients are sometimes distorted or too differ-
ent from the nal admissible solution. When this happens, more reason-
able start values need to be chosen. It is easy to see from the basic regression
coefcient formula that the correlation coefcient value and the standard
deviation values of the two variables affect the initial OLS estimates:
br s
s
xy
y
x
=
.
3.2.7 Sample Size
A common formula used to determine sample size when estimating means
of variables was given by McCall (1982): n = (Z s/e)2, where n is the sample
size needed for the desired level of precision, e is the effect size, Z is the
condence level, and s is the population standard deviation of scores
(s can be estimated from prior research studies, test norms, or the range of
scores divided by 6). For example, given a random sample of ACT scores
from a dened population with a standard deviation of 100, a desired con-
dence level of 1.96 (which corresponds to a .05 level of signicance), and
an effect size of 20 (difference between sampled ACT mean and popula-
tion ACT mean), the sample size needed would be [1.96 (100)/20)]2 = 96.
In structural equation modeling, however, the researcher often requires
a much larger sample size to maintain power and obtain stable parameter
estimates and standard errors. The need for larger sample sizes is also
due in part to the program requirements and the multiple observed vari-
ables used to dene latent variables. Hoelter (1983) proposed the critical
N statistic, which indicates the sample size needed to obtain a chi-square
value that would reject the null hypothesis in a structural equation model.
The required sample size and power estimates that provide a reasonable
indication of whether a researcher’s data ts their theoretical model or to
estimate parameters is discussed in more detail in chapter 5.
SEM software programs estimate coefcients based on the user-specied
theoretical model, or implied model, but also must work with the satu-
rated and independence models. A saturated model is the model with all
parameters indicated, while the independence model is the null model or
model with no parameters estimated. A saturated model with p observed
variables has p (p + 3)/2 free parameters [Note: Number of independent
elements in the symmetric covariance matrix = p(p + 1)/2. Number of
means = p, so total number of independent elements = p (p + 1)/2 + p = p
(p + 3)/2]. For example, with 10 observed variables, 10(10 + 3)/2 = 65 free
parameters. If the sample size is small, then there is not enough informa-
tion to estimate parameters in the saturated model for a large number of
variables. Consequently, the chi-square t statistic and derived statistics
Y102005.indb 41 3/22/10 3:25:23 PM
42 A Beginners Guide to Structural Equation Modeling
such as Akaike’s Information Criterion (AIC) and the root-mean-square
error of approximation (RMSEA) cannot be computed. In addition, the t
of the independence model is required to calculate other t indices such
as the Comparative Fit Index (CFI) and the Normed Fit Index (NFI).
Ding, Velicer, and Harlow (1995) located numerous studies (e.g.,
Anderson & Gerbing, 1988) that were in agreement that 100 to 150 subjects
is the minimum satisfactory sample size when conducting structural equa-
tion models. Boomsma (1982, 1983) recommended 400, and Hu, Bentler,
and Kano (1992) indicated that in some cases 5,000 is insufcient! Many
of us may recall rules of thumb in our statistics texts, for example, 10 sub-
jects per variable or 20 subjects per variable. Costello and Osborne (2005)
demonstrated in their Monte Carlo study that 20 subjects per variable is
recommended for best practices in factor analysis. In our examination of
published SEM research, we have found that many articles used from 250
to 500 subjects, although the greater the sample size, the more likely it
is one can validate the model using cross-validation (see chapter 12). For
example, Bentler and Chou (1987) suggested that a ratio as low as ve sub-
jects per variable would be sufcient for normal and elliptical distributions
when the latent variables have multiple indicators and that a ratio of at
least 10 subjects per variable would be sufcient for other distributions.
Determination of sample size is now better understood in SEM model-
ing and further discussed in chapter 5.
3.3 Bivariate, Part, and Partial Correlations
The types of correlations indicated in Table 3.1 are considered bivariate cor-
relations, or associations between two variables. Cohen & Cohen (1983), in
describing correlation research, further presented the correlation between
two variables controlling for the inuence of a third variable. These correla-
tions are referred to as part and partial correlations, depending upon how
variables are controlled or partialled out. Some of the various ways in which
three variables can be depicted are illustrated in Figure 3.1. The diagrams
illustrate different situations among variables where (a) all the variables are
uncorrelated (Case 1), (b) only one pair of variables is correlated (Cases 2
and 3), (c) two pairs of variables are correlated (Cases 4 and 5), and (d) all of
the variables are correlated (Case 6). It is obvious that with more than three
variables the possibilities become overwhelming. It is therefore important to
have a theoretical perspective to suggest why certain variables are correlated
and/or controlled in a study. A theoretical perspective is essential in specify-
ing a model and forms the basis for testing a structural equation model.
The partial correlation coefcient measures the association between two
variables while controlling for a third variable, for example, the association
Y102005.indb 42 3/22/10 3:25:24 PM
Correlation 43
between age and reading comprehension, controlling for reading level.
Controlling for reading level in the correlation between age and compre-
hension partials out the correlation of reading level with age and the cor-
relation of reading level with comprehension. Part correlation, in contrast,
is the correlation between age and comprehension with reading level con-
trolled for, where only the correlation between comprehension and read-
ing level is removed before age is correlated with comprehension.
Whether a part or partial correlation is used depends on the specic
model or research questio